Bump version to v0.2.0-pre, enhance Docker configurations for GPU support, and refine text processing settings

2025-04-13 09:39:17 +00:00 · 2025-02-06 01:22:21 -07:00 · 2025-02-06 01:22:21 -07:00 · d452a6e114
commit d452a6e114
parent 3ee43cea23
11 changed files with 136 additions and 98 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -2,6 +2,31 @@

 Notable changes to this project will be documented in this file.

+## [v0.2.0-pre] - 2025-02-06
+### Added
+- Complete Model Overhaul:
+  - Upgraded to Kokoro v1.0 model architecture
+  - Pre-installed multi-language support from Misaki:
+    - English (en), Japanese (ja), Korean (ko),Chinese (zh), Vietnamese (vi)
+  - All voice packs included for supported languages, along with the original versions.
+- Enhanced Audio Generation Features:
+  - Per-word timestamped caption generation
+  - Phoneme-based audio generation capabilities
+  - Detailed phoneme generation
+- Web UI Improvements:
+  - Improved voice mixing with weighted combinations
+  - Text file upload support
+  - Enhanced formatting and user interface
+  - Cleaner UI (in progress)
+  - Integration with https://github.com/hexgrad/kokoro and https://github.com/hexgrad/misaki packages
+
+### Removed
+- Deprecated support for Kokoro v0.19 model
+
+### Changes
+- Combine Voices endpoint now returns a .pt file, with generation combinations generated on the fly otherwise 
+
+
 ## [v0.1.4] - 2025-01-30
 ### Added
 - Smart Chunking System:
--- a/README.md
+++ b/README.md
@ -5,19 +5,21 @@
 # <sub><sub>_`FastKoko`_ </sub></sub>
 [![Tests](https://img.shields.io/badge/tests-100%20passed-darkgreen)]()
 [![Coverage](https://img.shields.io/badge/coverage-49%25-grey)]()
-[![Tested at Model Commit](https://img.shields.io/badge/last--tested--model--commit-a67f113-blue)](https://huggingface.co/hexgrad/Kokoro-82M/tree/c3b0d86e2a980e027ef71c28819ea02e351c2667) [![Try on Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Try%20on-Spaces-blue)](https://huggingface.co/spaces/Remsky/Kokoro-TTS-Zero)
+[![Try on Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Try%20on-Spaces-blue)](https://huggingface.co/spaces/Remsky/Kokoro-TTS-Zero)

-> Support for Kokoro-82M v1.0 coming very soon! Dev build on the `v0.1.5-integration` branch
+[![Tested at Model Commit](https://img.shields.io/badge/last--tested--model--commit-9901c2b-blue)](https://huggingface.co/hexgrad/Kokoro-82M/commit/9901c2b79161b6e898b7ea857ae5298f47b8b0d6)
+[![Kokoro](https://img.shields.io/badge/kokoro-v0.7.6-BB5420)]()
+[![Misaki](https://img.shields.io/badge/misaki-v0.7.6-B8860B)]()

 Dockerized FastAPI wrapper for [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) text-to-speech model
- OpenAI-compatible Speech endpoint, with inline voice combination, and mapped naming/models for strict systems
- NVIDIA GPU accelerated or CPU inference (ONNX or Pytorch for either)
- very fast generation time
-  - ~35x-100x+ real time speed via 4060Ti+
-  - ~5x+ real time speed via M3 Pro CPU
- streaming support & tempfile generation, phoneme based dev endpoints
- (new) Integrated web UI on localhost:8880/web
- (new) Debug endpoints for monitoring threads, storage, and session pools
+- Multi-language support (English, Japanese, Korean, Chinese, Vietnamese)
+- OpenAI-compatible Speech endpoint, NVIDIA GPU accelerated or CPU inference with PyTorch 
+- ONNX support coming soon, see v0.1.5 and earlier for legacy ONNX support in the interim
+- Debug endpoints for monitoring threads, storage, and session pools
+- Integrated web UI on localhost:8880/web
+- Phoneme-based audio generation, phoneme generation
+- (new) Per-word timestamped caption generation
+- (new) Voice mixing with weighted combinations


 ## Get Started
@ -49,18 +51,16 @@ docker run --gpus all -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-gpu:v0.1.4 #NVI
        git clone https://github.com/remsky/Kokoro-FastAPI.git
        cd Kokoro-FastAPI

-        cd docker/gpu # OR
-        # cd docker/cpu # Run this or the above
+        cd docker/gpu  # For GPU support
+        # or cd docker/cpu  # For CPU support
        docker compose up --build
-        # if you are missing any models, run:
-        # python ../scripts/download_model.py --type pth  # for GPU
-        # python ../scripts/download_model.py --type onnx # for CPU
-        ```

-        ```bash
-        Or directly via UV
-        ./start-cpu.sh
-        ./start-gpu.sh 
+        # Models will auto-download, but if needed you can manually download:
+        python docker/scripts/download_model.py --output api/src/models/v1_0
+
+        # Or run directly via UV:
+        ./start-gpu.sh  # For GPU support
+        ./start-cpu.sh  # For CPU support
        ```
 </details>
 <details>
@ -111,7 +111,6 @@ with client.audio.speech.with_streaming_response.create(
 - API Documentation: http://localhost:8880/docs

 - Web Interface: http://localhost:8880/web
- Gradio UI (deprecating) can be accessed at http://localhost:7860 if enabled in docker compose file (it is a separate image!)

 <div align="center" style="display: flex; justify-content: center; gap: 10px;">
  <img src="assets/docs-screenshot.png" width="40%" alt="API Documentation" style="border: 2px solid #333; padding: 10px;">
@ -172,9 +171,10 @@ python examples/assorted_checks/test_voices/test_all_voices.py # Test all availa
 <details>
 <summary>Voice Combination</summary>

- Averages model weights of any existing voicepacks
+- Weighted voice combinations using ratios (e.g., "af_bella(2)+af_heart(1)" for 67%/33% mix)
+- Ratios are automatically normalized to sum to 100%
+- Available through any endpoint by adding weights in parentheses
 - Saves generated voicepacks for future use
- (new) Available through any endpoint, simply concatenate desired packs with "+"

 Combine voices and generate audio:
 ```python
@ -182,22 +182,46 @@ import requests
 response = requests.get("http://localhost:8880/v1/audio/voices")
 voices = response.json()["voices"]

-# Create combined voice (saves locally on server)
-response = requests.post(
-    "http://localhost:8880/v1/audio/voices/combine",
-    json=[voices[0], voices[1]]
-)
-combined_voice = response.json()["voice"]
-
-# Generate audio with combined voice (or, simply pass multiple directly with `+` )
+# Example 1: Simple voice combination (50%/50% mix)
 response = requests.post(
    "http://localhost:8880/v1/audio/speech",
    json={
        "input": "Hello world!",
-        "voice": combined_voice, # or skip the above step with f"{voices[0]}+{voices[1]}"
+        "voice": "af_bella+af_sky",  # Equal weights
        "response_format": "mp3"
    }
 )
+
+# Example 2: Weighted voice combination (67%/33% mix)
+response = requests.post(
+    "http://localhost:8880/v1/audio/speech",
+    json={
+        "input": "Hello world!",
+        "voice": "af_bella(2)+af_sky(1)",  # 2:1 ratio = 67%/33%
+        "response_format": "mp3"
+    }
+)
+
+# Example 3: Download combined voice as .pt file
+response = requests.post(
+    "http://localhost:8880/v1/audio/voices/combine",
+    json="af_bella(2)+af_sky(1)"  # 2:1 ratio = 67%/33%
+)
+
+# Save the .pt file
+with open("combined_voice.pt", "wb") as f:
+    f.write(response.content)
+
+# Use the downloaded voice file
+response = requests.post(
+    "http://localhost:8880/v1/audio/speech",
+    json={
+        "input": "Hello world!",
+        "voice": "combined_voice",  # Use the saved voice file
+        "response_format": "mp3"
+    }
+)
+
 ```
 <p align="center">
  <img src="assets/voice_analysis.png" width="80%" alt="Voice Analysis Comparison" style="border: 2px solid #333; padding: 10px;">
@ -220,46 +244,6 @@ response = requests.post(

 </details>

-<details>
-<summary>Gradio Web Utility</summary>
-
-Access the interactive web UI at http://localhost:7860 after starting the service. Features include:
- Voice/format/speed selection
- Audio playback and download
- Text file or direct input
-
-If you only want the API, just comment out everything in the docker-compose.yml under and including `gradio-ui`
-
-Currently, voices created via the API are accessible here, but voice combination/creation has not yet been added
-
-Running the UI Docker Service [deprecating]
-   - If you only want to run the Gradio web interface separately and connect it to an existing API service:
-      ```bash
-      docker run -p 7860:7860 \
-        -e API_HOST=<api-hostname-or-ip> \
-        -e API_PORT=8880 \
-      ```
-
-     - Replace `<api-hostname-or-ip>` with:
-       - `kokoro-tts` if the UI container is running in the same Docker Compose setup.
-       - `localhost` if the API is running on your local machine.
-  
-### Disabling Local Saving
-
-You can disable local saving of audio files and hide the file view in the UI by setting the `DISABLE_LOCAL_SAVING` environment variable to `true`. This is useful when running the service on a server where you don't want to store generated audio files locally.
-
-When using Docker Compose:
-```yaml
-environment:
-  - DISABLE_LOCAL_SAVING=true
-```
-
-When running the Docker image directly:
-```bash
-docker run -p 7860:7860 -e DISABLE_LOCAL_SAVING=true ghcr.io/remsky/kokoro-fastapi-ui:v0.1.4
-```
-</details>
-
 <details>
 <summary>Streaming Support</summary>

@ -357,10 +341,13 @@ Key Performance Metrics:

 ```bash
 # GPU: Requires NVIDIA GPU with CUDA 12.1 support (~35x-100x realtime speed)
+cd docker/gpu
+docker compose up --build
+
+# CPU: PyTorch CPU inference
+cd docker/cpu
 docker compose up --build

-# CPU: ONNX optimized inference (~5x+ realtime speed on M3 Pro)
-docker compose -f docker-compose.cpu.yml up --build
 ```
 *Note: Overall speed may have reduced somewhat with the structural changes to accomodate streaming. Looking into it* 
 </details>
@ -372,6 +359,37 @@ docker compose -f docker-compose.cpu.yml up --build
 - Helps to reduce artifacts and allow long form processing as the base model is only currently configured for approximately 30s output 
 </details>

+<details>
+<summary>Timestamped Captions & Phonemes</summary>
+
+Generate audio with word-level timestamps:
+```python
+import requests
+import json
+
+response = requests.post(
+    "http://localhost:8880/dev/captioned_speech",
+    json={
+        "model": "kokoro",
+        "input": "Hello world!",
+        "voice": "af_bella",
+        "speed": 1.0,
+        "response_format": "wav"
+    }
+)
+
+# Get timestamps from header
+timestamps = json.loads(response.headers['X-Word-Timestamps'])
+print("Word-level timestamps:")
+for ts in timestamps:
+    print(f"{ts['word']}: {ts['start_time']:.3f}s - {ts['end_time']:.3f}s")
+
+# Save audio
+with open("output.wav", "wb") as f:
+    f.write(response.content)
+```
+</details>
+
 <details>
 <summary>Phoneme & Token Routes</summary>

--- a/2
+++ b/2
@ -1 +1 @@
-v0.1.5-pre
+v0.2.0-pre
--- a/api/src/core/config.py
+++ b/api/src/core/config.py
@ -22,7 +22,11 @@ class Settings(BaseSettings):
    
    # Audio Settings
    sample_rate: int = 24000
-    max_chunk_size: int = 400  # Maximum size of text chunks for processing
+    # Text Processing Settings
+    target_min_tokens: int = 175  # Target minimum tokens per chunk
+    target_max_tokens: int = 250  # Target maximum tokens per chunk
+    absolute_max_tokens: int = 450  # Absolute maximum tokens per chunk
+    
    gap_trim_ms: int = 250  # Amount to trim from streaming chunk ends in milliseconds

    # Web Player Settings
--- a/api/src/services/text_processing/text_processor.py
+++ b/api/src/services/text_processing/text_processor.py
@ -7,11 +7,7 @@ from loguru import logger
 from .phonemizer import phonemize
 from .normalizer import normalize_text
 from .vocabulary import tokenize
-
-# Target token ranges
-TARGET_MIN = 175
-TARGET_MAX = 250
-ABSOLUTE_MAX = 450
+from ...core.config import settings

 def process_text_chunk(text: str, language: str = "a", skip_phonemize: bool = False) -> List[int]:
    """Process a chunk of text through normalization, phonemization, and tokenization.
@ -94,7 +90,7 @@ def get_sentence_info(text: str) -> List[Tuple[str, List[int], int]]:
        
    return results

-async def smart_split(text: str, max_tokens: int = ABSOLUTE_MAX) -> AsyncGenerator[Tuple[str, List[int]], None]:
+async def smart_split(text: str, max_tokens: int = settings.absolute_max_tokens) -> AsyncGenerator[Tuple[str, List[int]], None]:
    """Build optimal chunks targeting 300-400 tokens, never exceeding max_tokens."""
    start_time = time.time()
    chunk_count = 0
@ -138,7 +134,7 @@ async def smart_split(text: str, max_tokens: int = ABSOLUTE_MAX) -> AsyncGenerat
                count = len(tokens)
                
                # If adding clause keeps us under max and not optimal yet
-                if clause_count + count <= max_tokens and clause_count + count <= TARGET_MAX:
+                if clause_count + count <= max_tokens and clause_count + count <= settings.target_max_tokens:
                    clause_chunk.append(full_clause)
                    clause_tokens.extend(tokens)
                    clause_count += count
@ -161,7 +157,7 @@ async def smart_split(text: str, max_tokens: int = ABSOLUTE_MAX) -> AsyncGenerat
                yield chunk_text, clause_tokens
                
        # Regular sentence handling
-        elif current_count >= TARGET_MIN and current_count + count > TARGET_MAX:
+        elif current_count >= settings.target_min_tokens and current_count + count > settings.target_max_tokens:
            # If we have a good sized chunk and adding next sentence exceeds target,
            # yield current chunk and start new one
            chunk_text = " ".join(current_chunk)
@ -171,12 +167,12 @@ async def smart_split(text: str, max_tokens: int = ABSOLUTE_MAX) -> AsyncGenerat
            current_chunk = [sentence]
            current_tokens = tokens
            current_count = count
-        elif current_count + count <= TARGET_MAX:
+        elif current_count + count <= settings.target_max_tokens:
            # Keep building chunk while under target max
            current_chunk.append(sentence)
            current_tokens.extend(tokens)
            current_count += count
-        elif current_count + count <= max_tokens and current_count < TARGET_MIN:
+        elif current_count + count <= max_tokens and current_count < settings.target_min_tokens:
            # Only exceed target max if we haven't reached minimum size yet
            current_chunk.append(sentence)
            current_tokens.extend(tokens)
--- a/assets/beta_web_ui.png
+++ b/assets/beta_web_ui.png
--- a/assets/webui-screenshot.png
+++ b/assets/webui-screenshot.png
--- a/docker/cpu/Dockerfile
+++ b/docker/cpu/Dockerfile
@ -46,14 +46,11 @@ RUN --mount=type=cache,target=/root/.cache/uv \
 ENV PYTHONUNBUFFERED=1 \
    PYTHONPATH=/app:/app/api \
    PATH="/app/.venv/bin:$PATH" \
-    UV_LINK_MODE=copy
+    UV_LINK_MODE=copy \
+    USE_GPU=false

 # Core settings that differ from config.py defaults
-ENV USE_GPU=false
-
-# Model download flags (container-specific)
-ENV DOWNLOAD_MODEL=false
-
+ENV DOWNLOAD_MODEL=true
 # Download model if enabled
 RUN if [ "$DOWNLOAD_MODEL" = "true" ]; then \
    python download_model.py --output api/src/models/v1_0; \
--- a/docker/gpu/Dockerfile
+++ b/docker/gpu/Dockerfile
@ -44,10 +44,9 @@ ENV PYTHONUNBUFFERED=1 \
    PYTHONPATH=/app:/app/api \
    PATH="/app/.venv/bin:$PATH" \
    UV_LINK_MODE=copy \
-    USE_GPU=true \
-    USE_ONNX=false \
-    DOWNLOAD_MODEL=true
-
+    USE_GPU=true
+    
+ENV DOWNLOAD_MODEL=true
 # Download model if enabled
 RUN if [ "$DOWNLOAD_MODEL" = "true" ]; then \
    python download_model.py --output api/src/models/v1_0; \
--- a/docker/gpu/docker-compose.yml
+++ b/docker/gpu/docker-compose.yml
@ -12,7 +12,6 @@ services:
    environment:
      - PYTHONPATH=/app:/app/api
      - USE_GPU=true
-      - USE_ONNX=false
      - PYTHONUNBUFFERED=1
    deploy:
      resources:
--- a/web/styles/player.css
+++ b/web/styles/player.css
@ -186,7 +186,7 @@
    transform: translateY(-1px);
    box-shadow: 0 4px 12px rgba(99, 102, 241, 0.2);
 }
-
+ 
 /* Cancel Button Styles */
 .player-btn.cancel {
    background: #976161;