mirror of
https://github.com/remsky/Kokoro-FastAPI.git
synced 2025-04-13 09:39:17 +00:00
Bump version to v0.2.0-pre, enhance Docker configurations for GPU support, and refine text processing settings
This commit is contained in:
parent
3ee43cea23
commit
d452a6e114
11 changed files with 136 additions and 98 deletions
25
CHANGELOG.md
25
CHANGELOG.md
|
@ -2,6 +2,31 @@
|
|||
|
||||
Notable changes to this project will be documented in this file.
|
||||
|
||||
## [v0.2.0-pre] - 2025-02-06
|
||||
### Added
|
||||
- Complete Model Overhaul:
|
||||
- Upgraded to Kokoro v1.0 model architecture
|
||||
- Pre-installed multi-language support from Misaki:
|
||||
- English (en), Japanese (ja), Korean (ko),Chinese (zh), Vietnamese (vi)
|
||||
- All voice packs included for supported languages, along with the original versions.
|
||||
- Enhanced Audio Generation Features:
|
||||
- Per-word timestamped caption generation
|
||||
- Phoneme-based audio generation capabilities
|
||||
- Detailed phoneme generation
|
||||
- Web UI Improvements:
|
||||
- Improved voice mixing with weighted combinations
|
||||
- Text file upload support
|
||||
- Enhanced formatting and user interface
|
||||
- Cleaner UI (in progress)
|
||||
- Integration with https://github.com/hexgrad/kokoro and https://github.com/hexgrad/misaki packages
|
||||
|
||||
### Removed
|
||||
- Deprecated support for Kokoro v0.19 model
|
||||
|
||||
### Changes
|
||||
- Combine Voices endpoint now returns a .pt file, with generation combinations generated on the fly otherwise
|
||||
|
||||
|
||||
## [v0.1.4] - 2025-01-30
|
||||
### Added
|
||||
- Smart Chunking System:
|
||||
|
|
166
README.md
166
README.md
|
@ -5,19 +5,21 @@
|
|||
# <sub><sub>_`FastKoko`_ </sub></sub>
|
||||
[]()
|
||||
[]()
|
||||
[](https://huggingface.co/hexgrad/Kokoro-82M/tree/c3b0d86e2a980e027ef71c28819ea02e351c2667) [](https://huggingface.co/spaces/Remsky/Kokoro-TTS-Zero)
|
||||
[](https://huggingface.co/spaces/Remsky/Kokoro-TTS-Zero)
|
||||
|
||||
> Support for Kokoro-82M v1.0 coming very soon! Dev build on the `v0.1.5-integration` branch
|
||||
[](https://huggingface.co/hexgrad/Kokoro-82M/commit/9901c2b79161b6e898b7ea857ae5298f47b8b0d6)
|
||||
[]()
|
||||
[]()
|
||||
|
||||
Dockerized FastAPI wrapper for [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) text-to-speech model
|
||||
- OpenAI-compatible Speech endpoint, with inline voice combination, and mapped naming/models for strict systems
|
||||
- NVIDIA GPU accelerated or CPU inference (ONNX or Pytorch for either)
|
||||
- very fast generation time
|
||||
- ~35x-100x+ real time speed via 4060Ti+
|
||||
- ~5x+ real time speed via M3 Pro CPU
|
||||
- streaming support & tempfile generation, phoneme based dev endpoints
|
||||
- (new) Integrated web UI on localhost:8880/web
|
||||
- (new) Debug endpoints for monitoring threads, storage, and session pools
|
||||
- Multi-language support (English, Japanese, Korean, Chinese, Vietnamese)
|
||||
- OpenAI-compatible Speech endpoint, NVIDIA GPU accelerated or CPU inference with PyTorch
|
||||
- ONNX support coming soon, see v0.1.5 and earlier for legacy ONNX support in the interim
|
||||
- Debug endpoints for monitoring threads, storage, and session pools
|
||||
- Integrated web UI on localhost:8880/web
|
||||
- Phoneme-based audio generation, phoneme generation
|
||||
- (new) Per-word timestamped caption generation
|
||||
- (new) Voice mixing with weighted combinations
|
||||
|
||||
|
||||
## Get Started
|
||||
|
@ -49,18 +51,16 @@ docker run --gpus all -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-gpu:v0.1.4 #NVI
|
|||
git clone https://github.com/remsky/Kokoro-FastAPI.git
|
||||
cd Kokoro-FastAPI
|
||||
|
||||
cd docker/gpu # OR
|
||||
# cd docker/cpu # Run this or the above
|
||||
cd docker/gpu # For GPU support
|
||||
# or cd docker/cpu # For CPU support
|
||||
docker compose up --build
|
||||
# if you are missing any models, run:
|
||||
# python ../scripts/download_model.py --type pth # for GPU
|
||||
# python ../scripts/download_model.py --type onnx # for CPU
|
||||
```
|
||||
|
||||
```bash
|
||||
Or directly via UV
|
||||
./start-cpu.sh
|
||||
./start-gpu.sh
|
||||
# Models will auto-download, but if needed you can manually download:
|
||||
python docker/scripts/download_model.py --output api/src/models/v1_0
|
||||
|
||||
# Or run directly via UV:
|
||||
./start-gpu.sh # For GPU support
|
||||
./start-cpu.sh # For CPU support
|
||||
```
|
||||
</details>
|
||||
<details>
|
||||
|
@ -111,7 +111,6 @@ with client.audio.speech.with_streaming_response.create(
|
|||
- API Documentation: http://localhost:8880/docs
|
||||
|
||||
- Web Interface: http://localhost:8880/web
|
||||
- Gradio UI (deprecating) can be accessed at http://localhost:7860 if enabled in docker compose file (it is a separate image!)
|
||||
|
||||
<div align="center" style="display: flex; justify-content: center; gap: 10px;">
|
||||
<img src="assets/docs-screenshot.png" width="40%" alt="API Documentation" style="border: 2px solid #333; padding: 10px;">
|
||||
|
@ -172,9 +171,10 @@ python examples/assorted_checks/test_voices/test_all_voices.py # Test all availa
|
|||
<details>
|
||||
<summary>Voice Combination</summary>
|
||||
|
||||
- Averages model weights of any existing voicepacks
|
||||
- Weighted voice combinations using ratios (e.g., "af_bella(2)+af_heart(1)" for 67%/33% mix)
|
||||
- Ratios are automatically normalized to sum to 100%
|
||||
- Available through any endpoint by adding weights in parentheses
|
||||
- Saves generated voicepacks for future use
|
||||
- (new) Available through any endpoint, simply concatenate desired packs with "+"
|
||||
|
||||
Combine voices and generate audio:
|
||||
```python
|
||||
|
@ -182,22 +182,46 @@ import requests
|
|||
response = requests.get("http://localhost:8880/v1/audio/voices")
|
||||
voices = response.json()["voices"]
|
||||
|
||||
# Create combined voice (saves locally on server)
|
||||
response = requests.post(
|
||||
"http://localhost:8880/v1/audio/voices/combine",
|
||||
json=[voices[0], voices[1]]
|
||||
)
|
||||
combined_voice = response.json()["voice"]
|
||||
|
||||
# Generate audio with combined voice (or, simply pass multiple directly with `+` )
|
||||
# Example 1: Simple voice combination (50%/50% mix)
|
||||
response = requests.post(
|
||||
"http://localhost:8880/v1/audio/speech",
|
||||
json={
|
||||
"input": "Hello world!",
|
||||
"voice": combined_voice, # or skip the above step with f"{voices[0]}+{voices[1]}"
|
||||
"voice": "af_bella+af_sky", # Equal weights
|
||||
"response_format": "mp3"
|
||||
}
|
||||
)
|
||||
|
||||
# Example 2: Weighted voice combination (67%/33% mix)
|
||||
response = requests.post(
|
||||
"http://localhost:8880/v1/audio/speech",
|
||||
json={
|
||||
"input": "Hello world!",
|
||||
"voice": "af_bella(2)+af_sky(1)", # 2:1 ratio = 67%/33%
|
||||
"response_format": "mp3"
|
||||
}
|
||||
)
|
||||
|
||||
# Example 3: Download combined voice as .pt file
|
||||
response = requests.post(
|
||||
"http://localhost:8880/v1/audio/voices/combine",
|
||||
json="af_bella(2)+af_sky(1)" # 2:1 ratio = 67%/33%
|
||||
)
|
||||
|
||||
# Save the .pt file
|
||||
with open("combined_voice.pt", "wb") as f:
|
||||
f.write(response.content)
|
||||
|
||||
# Use the downloaded voice file
|
||||
response = requests.post(
|
||||
"http://localhost:8880/v1/audio/speech",
|
||||
json={
|
||||
"input": "Hello world!",
|
||||
"voice": "combined_voice", # Use the saved voice file
|
||||
"response_format": "mp3"
|
||||
}
|
||||
)
|
||||
|
||||
```
|
||||
<p align="center">
|
||||
<img src="assets/voice_analysis.png" width="80%" alt="Voice Analysis Comparison" style="border: 2px solid #333; padding: 10px;">
|
||||
|
@ -220,46 +244,6 @@ response = requests.post(
|
|||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>Gradio Web Utility</summary>
|
||||
|
||||
Access the interactive web UI at http://localhost:7860 after starting the service. Features include:
|
||||
- Voice/format/speed selection
|
||||
- Audio playback and download
|
||||
- Text file or direct input
|
||||
|
||||
If you only want the API, just comment out everything in the docker-compose.yml under and including `gradio-ui`
|
||||
|
||||
Currently, voices created via the API are accessible here, but voice combination/creation has not yet been added
|
||||
|
||||
Running the UI Docker Service [deprecating]
|
||||
- If you only want to run the Gradio web interface separately and connect it to an existing API service:
|
||||
```bash
|
||||
docker run -p 7860:7860 \
|
||||
-e API_HOST=<api-hostname-or-ip> \
|
||||
-e API_PORT=8880 \
|
||||
```
|
||||
|
||||
- Replace `<api-hostname-or-ip>` with:
|
||||
- `kokoro-tts` if the UI container is running in the same Docker Compose setup.
|
||||
- `localhost` if the API is running on your local machine.
|
||||
|
||||
### Disabling Local Saving
|
||||
|
||||
You can disable local saving of audio files and hide the file view in the UI by setting the `DISABLE_LOCAL_SAVING` environment variable to `true`. This is useful when running the service on a server where you don't want to store generated audio files locally.
|
||||
|
||||
When using Docker Compose:
|
||||
```yaml
|
||||
environment:
|
||||
- DISABLE_LOCAL_SAVING=true
|
||||
```
|
||||
|
||||
When running the Docker image directly:
|
||||
```bash
|
||||
docker run -p 7860:7860 -e DISABLE_LOCAL_SAVING=true ghcr.io/remsky/kokoro-fastapi-ui:v0.1.4
|
||||
```
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>Streaming Support</summary>
|
||||
|
||||
|
@ -357,10 +341,13 @@ Key Performance Metrics:
|
|||
|
||||
```bash
|
||||
# GPU: Requires NVIDIA GPU with CUDA 12.1 support (~35x-100x realtime speed)
|
||||
cd docker/gpu
|
||||
docker compose up --build
|
||||
|
||||
# CPU: PyTorch CPU inference
|
||||
cd docker/cpu
|
||||
docker compose up --build
|
||||
|
||||
# CPU: ONNX optimized inference (~5x+ realtime speed on M3 Pro)
|
||||
docker compose -f docker-compose.cpu.yml up --build
|
||||
```
|
||||
*Note: Overall speed may have reduced somewhat with the structural changes to accomodate streaming. Looking into it*
|
||||
</details>
|
||||
|
@ -372,6 +359,37 @@ docker compose -f docker-compose.cpu.yml up --build
|
|||
- Helps to reduce artifacts and allow long form processing as the base model is only currently configured for approximately 30s output
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>Timestamped Captions & Phonemes</summary>
|
||||
|
||||
Generate audio with word-level timestamps:
|
||||
```python
|
||||
import requests
|
||||
import json
|
||||
|
||||
response = requests.post(
|
||||
"http://localhost:8880/dev/captioned_speech",
|
||||
json={
|
||||
"model": "kokoro",
|
||||
"input": "Hello world!",
|
||||
"voice": "af_bella",
|
||||
"speed": 1.0,
|
||||
"response_format": "wav"
|
||||
}
|
||||
)
|
||||
|
||||
# Get timestamps from header
|
||||
timestamps = json.loads(response.headers['X-Word-Timestamps'])
|
||||
print("Word-level timestamps:")
|
||||
for ts in timestamps:
|
||||
print(f"{ts['word']}: {ts['start_time']:.3f}s - {ts['end_time']:.3f}s")
|
||||
|
||||
# Save audio
|
||||
with open("output.wav", "wb") as f:
|
||||
f.write(response.content)
|
||||
```
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>Phoneme & Token Routes</summary>
|
||||
|
||||
|
|
2
VERSION
2
VERSION
|
@ -1 +1 @@
|
|||
v0.1.5-pre
|
||||
v0.2.0-pre
|
||||
|
|
|
@ -22,7 +22,11 @@ class Settings(BaseSettings):
|
|||
|
||||
# Audio Settings
|
||||
sample_rate: int = 24000
|
||||
max_chunk_size: int = 400 # Maximum size of text chunks for processing
|
||||
# Text Processing Settings
|
||||
target_min_tokens: int = 175 # Target minimum tokens per chunk
|
||||
target_max_tokens: int = 250 # Target maximum tokens per chunk
|
||||
absolute_max_tokens: int = 450 # Absolute maximum tokens per chunk
|
||||
|
||||
gap_trim_ms: int = 250 # Amount to trim from streaming chunk ends in milliseconds
|
||||
|
||||
# Web Player Settings
|
||||
|
|
|
@ -7,11 +7,7 @@ from loguru import logger
|
|||
from .phonemizer import phonemize
|
||||
from .normalizer import normalize_text
|
||||
from .vocabulary import tokenize
|
||||
|
||||
# Target token ranges
|
||||
TARGET_MIN = 175
|
||||
TARGET_MAX = 250
|
||||
ABSOLUTE_MAX = 450
|
||||
from ...core.config import settings
|
||||
|
||||
def process_text_chunk(text: str, language: str = "a", skip_phonemize: bool = False) -> List[int]:
|
||||
"""Process a chunk of text through normalization, phonemization, and tokenization.
|
||||
|
@ -94,7 +90,7 @@ def get_sentence_info(text: str) -> List[Tuple[str, List[int], int]]:
|
|||
|
||||
return results
|
||||
|
||||
async def smart_split(text: str, max_tokens: int = ABSOLUTE_MAX) -> AsyncGenerator[Tuple[str, List[int]], None]:
|
||||
async def smart_split(text: str, max_tokens: int = settings.absolute_max_tokens) -> AsyncGenerator[Tuple[str, List[int]], None]:
|
||||
"""Build optimal chunks targeting 300-400 tokens, never exceeding max_tokens."""
|
||||
start_time = time.time()
|
||||
chunk_count = 0
|
||||
|
@ -138,7 +134,7 @@ async def smart_split(text: str, max_tokens: int = ABSOLUTE_MAX) -> AsyncGenerat
|
|||
count = len(tokens)
|
||||
|
||||
# If adding clause keeps us under max and not optimal yet
|
||||
if clause_count + count <= max_tokens and clause_count + count <= TARGET_MAX:
|
||||
if clause_count + count <= max_tokens and clause_count + count <= settings.target_max_tokens:
|
||||
clause_chunk.append(full_clause)
|
||||
clause_tokens.extend(tokens)
|
||||
clause_count += count
|
||||
|
@ -161,7 +157,7 @@ async def smart_split(text: str, max_tokens: int = ABSOLUTE_MAX) -> AsyncGenerat
|
|||
yield chunk_text, clause_tokens
|
||||
|
||||
# Regular sentence handling
|
||||
elif current_count >= TARGET_MIN and current_count + count > TARGET_MAX:
|
||||
elif current_count >= settings.target_min_tokens and current_count + count > settings.target_max_tokens:
|
||||
# If we have a good sized chunk and adding next sentence exceeds target,
|
||||
# yield current chunk and start new one
|
||||
chunk_text = " ".join(current_chunk)
|
||||
|
@ -171,12 +167,12 @@ async def smart_split(text: str, max_tokens: int = ABSOLUTE_MAX) -> AsyncGenerat
|
|||
current_chunk = [sentence]
|
||||
current_tokens = tokens
|
||||
current_count = count
|
||||
elif current_count + count <= TARGET_MAX:
|
||||
elif current_count + count <= settings.target_max_tokens:
|
||||
# Keep building chunk while under target max
|
||||
current_chunk.append(sentence)
|
||||
current_tokens.extend(tokens)
|
||||
current_count += count
|
||||
elif current_count + count <= max_tokens and current_count < TARGET_MIN:
|
||||
elif current_count + count <= max_tokens and current_count < settings.target_min_tokens:
|
||||
# Only exceed target max if we haven't reached minimum size yet
|
||||
current_chunk.append(sentence)
|
||||
current_tokens.extend(tokens)
|
||||
|
|
Binary file not shown.
Before Width: | Height: | Size: 385 KiB |
Binary file not shown.
Before Width: | Height: | Size: 283 KiB After Width: | Height: | Size: 420 KiB |
|
@ -46,14 +46,11 @@ RUN --mount=type=cache,target=/root/.cache/uv \
|
|||
ENV PYTHONUNBUFFERED=1 \
|
||||
PYTHONPATH=/app:/app/api \
|
||||
PATH="/app/.venv/bin:$PATH" \
|
||||
UV_LINK_MODE=copy
|
||||
UV_LINK_MODE=copy \
|
||||
USE_GPU=false
|
||||
|
||||
# Core settings that differ from config.py defaults
|
||||
ENV USE_GPU=false
|
||||
|
||||
# Model download flags (container-specific)
|
||||
ENV DOWNLOAD_MODEL=false
|
||||
|
||||
ENV DOWNLOAD_MODEL=true
|
||||
# Download model if enabled
|
||||
RUN if [ "$DOWNLOAD_MODEL" = "true" ]; then \
|
||||
python download_model.py --output api/src/models/v1_0; \
|
||||
|
|
|
@ -44,10 +44,9 @@ ENV PYTHONUNBUFFERED=1 \
|
|||
PYTHONPATH=/app:/app/api \
|
||||
PATH="/app/.venv/bin:$PATH" \
|
||||
UV_LINK_MODE=copy \
|
||||
USE_GPU=true \
|
||||
USE_ONNX=false \
|
||||
DOWNLOAD_MODEL=true
|
||||
|
||||
USE_GPU=true
|
||||
|
||||
ENV DOWNLOAD_MODEL=true
|
||||
# Download model if enabled
|
||||
RUN if [ "$DOWNLOAD_MODEL" = "true" ]; then \
|
||||
python download_model.py --output api/src/models/v1_0; \
|
||||
|
|
|
@ -12,7 +12,6 @@ services:
|
|||
environment:
|
||||
- PYTHONPATH=/app:/app/api
|
||||
- USE_GPU=true
|
||||
- USE_ONNX=false
|
||||
- PYTHONUNBUFFERED=1
|
||||
deploy:
|
||||
resources:
|
||||
|
|
|
@ -186,7 +186,7 @@
|
|||
transform: translateY(-1px);
|
||||
box-shadow: 0 4px 12px rgba(99, 102, 241, 0.2);
|
||||
}
|
||||
|
||||
|
||||
/* Cancel Button Styles */
|
||||
.player-btn.cancel {
|
||||
background: #976161;
|
||||
|
|
Loading…
Add table
Reference in a new issue