Enhance web player information, adjust text chunk size, update audio wave settings, and implement OpenAI model mappings

This commit is contained in:
remsky 2025-01-23 04:11:31 -07:00
parent a8e6a3d2d9
commit ba577d348e
13 changed files with 333 additions and 71 deletions

View file

@ -2,6 +2,22 @@
Notable changes to this project will be documented in this file.
## [v0.1.2] - 2025-01-23
### Structural Improvements
- Models can be manually download and placed in api/src/models, or use included script
- TTSGPU/TPSCPU/STTSService classes replaced with a ModelManager service
- CPU/GPU of each of ONNX/PyTorch (Note: Only Pytorch GPU, and ONNX CPU/GPU have been tested)
- Should be able to improve new models as they become available, or new architectures, in a more modular way
- Converted a number of internal processes to async handling to improve concurrency
- Improving separation of concerns towards plug-in and modular structure, making PR's and new features easier
### Web UI (test release)
- An integrated simple web UI has been added on the FastAPI server directly
- This can be disabled via core/config.py or ENV variables if desired.
- Simplifies deployments, utility testing, aesthetics, etc
- Looking to deprecate/collaborate/hand off the Gradio UI
## [v0.1.0] - 2025-01-13
### Changed
- Major Docker improvements:

View file

@ -3,55 +3,54 @@
</p>
# <sub><sub>_`FastKoko`_ </sub></sub>
[![Tests](https://img.shields.io/badge/tests-117%20passed-darkgreen)]()
[![Coverage](https://img.shields.io/badge/coverage-60%25-grey)]()
[![Tests](https://img.shields.io/badge/tests-104%20passed-darkgreen)]()
[![Coverage](https://img.shields.io/badge/coverage-49%25-grey)]()
[![Tested at Model Commit](https://img.shields.io/badge/last--tested--model--commit-a67f113-blue)](https://huggingface.co/hexgrad/Kokoro-82M/tree/c3b0d86e2a980e027ef71c28819ea02e351c2667) [![Try on Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Try%20on-Spaces-blue)](https://huggingface.co/spaces/Remsky/Kokoro-TTS-Zero)
> [!INFO]
> Pre-release. Not fully tested
Dockerized FastAPI wrapper for [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) text-to-speech model
- OpenAI-compatible Speech endpoint, with inline voice combination functionality
- NVIDIA GPU accelerated or CPU Onnx inference
- OpenAI-compatible Speech endpoint, with inline voice combination, and mapped naming/models for strict systems
- NVIDIA GPU accelerated or CPU inference (ONNX, Pytorch) (~80-300mb modelfile)
- very fast generation time
- 35x-100x+ real time speed via 4060Ti+
- 5x+ real time speed via M3 Pro CPU
- streaming support w/ variable chunking to control latency & artifacts
- phoneme, simple audio generation web ui utility
- Runs on an 80mb-300mb model (CUDA container + 5gb on disk due to drivers)
- streaming support w/ variable chunking to control latency, (new) improved concurrency
- phoneme based dev endpoints
- (new) Integrated web UI on localhost:8880/web
## Quick Start
The service can be accessed through either the API endpoints or the Gradio web interface.
1. Install prerequisites, and start the service using Docker Compose (Full setup including UI):
- Install [Docker Desktop](https://www.docker.com/products/docker-desktop/)
- Install [Docker](https://www.docker.com/products/docker-desktop/)
- Clone the repository:
```bash
git clone https://github.com/remsky/Kokoro-FastAPI.git
cd Kokoro-FastAPI
# * Switch to stable branch if any issues *
git checkout v0.0.5post1-stable
cd docker/gpu # OR
# cd docker/cpu # Run this or the above
docker compose up --build
# if you are missing any models, run the .py or .sh scrips in the respective folders
```
Once started:
- The API will be available at http://localhost:8880
- The UI can be accessed at http://localhost:7860
- The *Web UI* can be tested at http://localhost:8880/web
- The Gradio UI (deprecating) can be accessed at http://localhost:7860
__Or__ running the API alone using Docker (model + voice packs baked in) (Most Recent):
__Or__ running the API alone using Docker (model + voice packs baked in) (Most Recent):
```bash
docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:v0.1.0post1 # CPU
docker run --gpus all -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-gpu:v0.1.0post1 # Nvidia GPU
```
```bash
docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:v0.1.0post1 # CPU
docker run --gpus all -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-gpu:v0.1.0post1 # Nvidia GPU
```
4. Run locally as an OpenAI-Compatible Speech Endpoint
2. Run locally as an OpenAI-Compatible Speech Endpoint
```python
from openai import OpenAI
client = OpenAI(
@ -69,10 +68,12 @@ The service can be accessed through either the API endpoints or the Gradio web i
```
or visit http://localhost:7860
<p align="center">
<img src="ui\GradioScreenShot.png" width="80%" alt="Voice Analysis Comparison" style="border: 2px solid #333; padding: 10px;">
</p>
<div align="center">
<div style="display: flex; justify-content: center; gap: 20px;">
<img src="assets/beta_web_ui.png" width="45%" alt="Beta Web UI" style="border: 2px solid #333; padding: 10px;">
<img src="ui/GradioScreenShot.png" width="45%" alt="Voice Analysis Comparison" style="border: 2px solid #333; padding: 10px;">
</div>
</div>
## Features
<details>
@ -83,8 +84,8 @@ The service can be accessed through either the API endpoints or the Gradio web i
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8880/v1", api_key="not-needed")
response = client.audio.speech.create(
model="kokoro", # Not used but required for compatibility, also accepts library defaults
voice="af_bella+af_sky",
model="kokoro",
voice="af_bella+af_sky", # see /api/src/core/openai_mappings.json to customize
input="Hello world!",
response_format="mp3"
)
@ -103,7 +104,7 @@ voices = response.json()["voices"]
response = requests.post(
"http://localhost:8880/v1/audio/speech",
json={
"model": "kokoro", # Not used but required for compatibility
"model": "kokoro",
"input": "Hello world!",
"voice": "af_bella",
"response_format": "mp3", # Supported: mp3, wav, opus, flac

View file

@ -23,7 +23,7 @@ class Settings(BaseSettings):
# Audio Settings
sample_rate: int = 24000
max_chunk_size: int = 300 # Maximum size of text chunks for processing
max_chunk_size: int = 400 # Maximum size of text chunks for processing
gap_trim_ms: int = 250 # Amount to trim from streaming chunk ends in milliseconds
# Web Player Settings

View file

@ -0,0 +1,18 @@
{
"models": {
"tts-1": "kokoro-v0_19",
"tts-1-hd": "kokoro-v0_19",
"kokoro": "kokoro-v0_19"
},
"voices": {
"alloy": "am_adam",
"ash": "af_nicole",
"coral": "bf_emma",
"echo": "af_bella",
"fable": "af_sarah",
"onyx": "bm_george",
"nova": "bf_isabella",
"sage": "am_michael",
"shimmer": "af_sky"
}
}

View file

@ -79,7 +79,7 @@ async def lifespan(app: FastAPI):
# Add web player info if enabled
if settings.enable_web_player:
startup_msg += f"\n\nWeb Player: http://{settings.host}:{settings.port}/web/"
startup_msg += f"\n\nBeta Web Player: http://{settings.host}:{settings.port}/web/"
else:
startup_msg += "\n\nWeb Player: disabled"

View file

@ -1,4 +1,6 @@
from typing import AsyncGenerator, List, Union
import json
import os
from typing import AsyncGenerator, Dict, List, Union
from fastapi import APIRouter, Depends, Header, HTTPException, Request, Response
from fastapi.responses import StreamingResponse
@ -7,6 +9,22 @@ from loguru import logger
from ..services.audio import AudioService
from ..services.tts_service import TTSService
from ..structures.schemas import OpenAISpeechRequest
from ..core.config import settings
# Load OpenAI mappings
def load_openai_mappings() -> Dict:
"""Load OpenAI voice and model mappings from JSON"""
api_dir = os.path.dirname(os.path.dirname(__file__))
mapping_path = os.path.join(api_dir, "core", "openai_mappings.json")
try:
with open(mapping_path, 'r') as f:
return json.load(f)
except Exception as e:
logger.error(f"Failed to load OpenAI mappings: {e}")
return {"models": {}, "voices": {}}
# Global mappings
_openai_mappings = load_openai_mappings()
router = APIRouter(
@ -39,15 +57,30 @@ async def get_tts_service() -> TTSService:
return _tts_service
def get_model_name(model: str) -> str:
"""Get internal model name from OpenAI model name"""
base_name = _openai_mappings["models"].get(model)
if not base_name:
raise ValueError(f"Unsupported model: {model}")
# Add extension based on runtime config
extension = ".onnx" if settings.use_onnx else ".pth"
return base_name + extension
async def process_voices(
voice_input: Union[str, List[str]], tts_service: TTSService
) -> str:
"""Process voice input into a combined voice, handling both string and list formats"""
# Convert input to list of voices
if isinstance(voice_input, str):
# Check if it's an OpenAI voice name
mapped_voice = _openai_mappings["voices"].get(voice_input)
if mapped_voice:
voice_input = mapped_voice
voices = [v.strip() for v in voice_input.split("+") if v.strip()]
else:
voices = voice_input
# For list input, map each voice if it's an OpenAI voice name
voices = [_openai_mappings["voices"].get(v, v) for v in voice_input]
voices = [v.strip() for v in voices if v.strip()]
if not voices:
raise ValueError("No voices provided")
@ -89,7 +122,10 @@ async def stream_audio_chunks(
output_format=request.response_format,
):
# Check if client is still connected
if await client_request.is_disconnected():
is_disconnected = client_request.is_disconnected
if callable(is_disconnected):
is_disconnected = await is_disconnected()
if is_disconnected:
logger.info("Client disconnected, stopping audio generation")
break
yield chunk
@ -106,7 +142,20 @@ async def create_speech(
x_raw_response: str = Header(None, alias="x-raw-response"),
):
"""OpenAI-compatible endpoint for text-to-speech"""
# Validate model before processing request
if request.model not in _openai_mappings["models"]:
raise HTTPException(
status_code=400,
detail={
"error": "invalid_model",
"message": f"Unsupported model: {request.model}",
"type": "invalid_request_error"
}
)
try:
model_name = get_model_name(request.model)
# Get global service instance
tts_service = await get_tts_service()
@ -200,7 +249,7 @@ async def create_speech(
status_code=500,
detail={
"error": "processing_error",
"message": "Failed to process audio generation request",
"message": str(e),
"type": "server_error"
}
)
@ -210,8 +259,8 @@ async def create_speech(
raise HTTPException(
status_code=500,
detail={
"error": "server_error",
"message": "An unexpected error occurred",
"error": "processing_error",
"message": str(e),
"type": "server_error"
}
)

View file

@ -23,7 +23,10 @@ class TTSStatus(str, Enum):
# OpenAI-compatible schemas
class OpenAISpeechRequest(BaseModel):
model: Literal["tts-1", "tts-1-hd", "kokoro"] = "kokoro"
model: str = Field(
default="kokoro",
description="The model to use for generation. Supported models: tts-1, tts-1-hd, kokoro"
)
input: str = Field(..., description="The text to generate audio for")
voice: str = Field(
default="af",

BIN
api/src/voices/am_gurney.pt Normal file

Binary file not shown.

View file

@ -4,12 +4,171 @@ from fastapi.testclient import TestClient
import numpy as np
import asyncio
from typing import AsyncGenerator
import os
import json
from api.src.main import app
from api.src.services.tts_service import TTSService
from api.src.core.config import settings
from api.src.routers.openai_compatible import (
load_openai_mappings,
get_tts_service,
stream_audio_chunks
)
from api.src.structures.schemas import OpenAISpeechRequest
client = TestClient(app)
@pytest.fixture
def test_voice():
"""Fixture providing a test voice name."""
return "test_voice"
@pytest.fixture
def mock_openai_mappings():
"""Mock OpenAI mappings for testing."""
with patch("api.src.routers.openai_compatible._openai_mappings", {
"models": {
"tts-1": "kokoro-v0_19",
"tts-1-hd": "kokoro-v0_19"
},
"voices": {
"alloy": "am_adam",
"nova": "bf_isabella"
}
}):
yield
@pytest.fixture
def mock_json_file(tmp_path):
"""Create a temporary mock JSON file."""
content = {
"models": {"test-model": "test-kokoro"},
"voices": {"test-voice": "test-internal"}
}
json_file = tmp_path / "test_mappings.json"
json_file.write_text(json.dumps(content))
return json_file
def test_load_openai_mappings(mock_json_file):
"""Test loading OpenAI mappings from JSON file"""
with patch("os.path.join", return_value=str(mock_json_file)):
mappings = load_openai_mappings()
assert "models" in mappings
assert "voices" in mappings
assert mappings["models"]["test-model"] == "test-kokoro"
assert mappings["voices"]["test-voice"] == "test-internal"
def test_load_openai_mappings_file_not_found():
"""Test handling of missing mappings file"""
with patch("os.path.join", return_value="/nonexistent/path"):
mappings = load_openai_mappings()
assert mappings == {"models": {}, "voices": {}}
@pytest.mark.asyncio
async def test_get_tts_service_initialization():
"""Test TTSService initialization"""
with patch("api.src.routers.openai_compatible._tts_service", None):
with patch("api.src.routers.openai_compatible._init_lock", None):
with patch("api.src.services.tts_service.TTSService.create") as mock_create:
mock_service = AsyncMock()
mock_create.return_value = mock_service
# Test concurrent access
async def get_service():
return await get_tts_service()
# Create multiple concurrent requests
tasks = [get_service() for _ in range(5)]
results = await asyncio.gather(*tasks)
# Verify service was created only once
mock_create.assert_called_once()
assert all(r == mock_service for r in results)
@pytest.mark.asyncio
async def test_stream_audio_chunks_client_disconnect():
"""Test handling of client disconnect during streaming"""
mock_request = MagicMock()
mock_request.is_disconnected = AsyncMock(return_value=True)
mock_service = AsyncMock()
async def mock_stream(*args, **kwargs):
for i in range(5):
yield b"chunk"
mock_service.generate_audio_stream = mock_stream
mock_service.list_voices.return_value = ["test_voice"]
request = OpenAISpeechRequest(
model="kokoro",
input="Test text",
voice="test_voice",
response_format="mp3",
stream=True,
speed=1.0
)
chunks = []
async for chunk in stream_audio_chunks(mock_service, request, mock_request):
chunks.append(chunk)
assert len(chunks) == 0 # Should stop immediately due to disconnect
def test_openai_voice_mapping(mock_tts_service, mock_openai_mappings):
"""Test OpenAI voice name mapping"""
mock_tts_service.list_voices.return_value = ["am_adam", "bf_isabella"]
response = client.post(
"/v1/audio/speech",
json={
"model": "tts-1",
"input": "Hello world",
"voice": "alloy", # OpenAI voice name
"response_format": "mp3",
"stream": False
}
)
assert response.status_code == 200
mock_tts_service.generate_audio.assert_called_once()
assert mock_tts_service.generate_audio.call_args[1]["voice"] == "am_adam"
def test_openai_voice_mapping_streaming(mock_tts_service, mock_openai_mappings, mock_audio_bytes):
"""Test OpenAI voice mapping in streaming mode"""
mock_tts_service.list_voices.return_value = ["am_adam", "bf_isabella"]
response = client.post(
"/v1/audio/speech",
json={
"model": "tts-1-hd",
"input": "Hello world",
"voice": "nova", # OpenAI voice name
"response_format": "mp3",
"stream": True
}
)
assert response.status_code == 200
content = b""
for chunk in response.iter_bytes():
content += chunk
assert content == mock_audio_bytes
def test_invalid_openai_model(mock_tts_service, mock_openai_mappings):
"""Test error handling for invalid OpenAI model"""
response = client.post(
"/v1/audio/speech",
json={
"model": "invalid-model",
"input": "Hello world",
"voice": "alloy",
"response_format": "mp3",
"stream": False
}
)
assert response.status_code == 400
error_response = response.json()
assert error_response["detail"]["error"] == "invalid_model"
assert "Unsupported model" in error_response["detail"]["message"]
@pytest.fixture
def mock_audio_bytes():
"""Mock audio bytes for testing."""
@ -22,15 +181,13 @@ def mock_tts_service(mock_audio_bytes):
service = AsyncMock(spec=TTSService)
service.generate_audio.return_value = (np.zeros(1000), 0.1)
# Create a proper async generator for streaming
async def mock_stream(*args, **kwargs) -> AsyncGenerator[bytes, None]:
yield mock_audio_bytes
service.generate_audio_stream = mock_stream
service.list_voices.return_value = ["voice1", "voice2"]
service.list_voices.return_value = ["test_voice", "voice1", "voice2"]
service.combine_voices.return_value = "voice1_voice2"
# Return the same instance for all calls
mock_get.return_value = service
mock_get.side_effect = None
yield service
@ -68,7 +225,6 @@ def test_openai_speech_streaming(mock_tts_service, test_voice, mock_audio_bytes)
assert "Transfer-Encoding" in response.headers
assert response.headers["Transfer-Encoding"] == "chunked"
# For streaming responses, we need to read the content in chunks
content = b""
for chunk in response.iter_bytes():
content += chunk
@ -89,7 +245,6 @@ def test_openai_speech_pcm_streaming(mock_tts_service, test_voice, mock_audio_by
assert response.status_code == 200
assert response.headers["content-type"] == "audio/pcm"
# For streaming responses, we need to read the content in chunks
content = b""
for chunk in response.iter_bytes():
content += chunk
@ -117,7 +272,11 @@ def test_openai_speech_invalid_voice(mock_tts_service):
def test_openai_speech_empty_text(mock_tts_service, test_voice):
"""Test error handling for empty text"""
mock_tts_service.generate_audio.side_effect = ValueError("Text is empty after preprocessing")
async def mock_error_stream(*args, **kwargs):
raise ValueError("Text is empty after preprocessing")
mock_tts_service.generate_audio = mock_error_stream
mock_tts_service.list_voices.return_value = ["test_voice"]
response = client.post(
"/v1/audio/speech",
@ -151,6 +310,9 @@ def test_openai_speech_invalid_format(mock_tts_service, test_voice):
def test_list_voices(mock_tts_service):
"""Test listing available voices"""
# Override the mock for this specific test
mock_tts_service.list_voices.return_value = ["voice1", "voice2"]
response = client.get("/v1/audio/voices")
assert response.status_code == 200
data = response.json()
@ -172,7 +334,11 @@ def test_combine_voices(mock_tts_service):
def test_server_error(mock_tts_service, test_voice):
"""Test handling of server errors"""
mock_tts_service.generate_audio.side_effect = RuntimeError("Internal server error")
async def mock_error_stream(*args, **kwargs):
raise RuntimeError("Internal server error")
mock_tts_service.generate_audio = mock_error_stream
mock_tts_service.list_voices.return_value = ["test_voice"]
response = client.post(
"/v1/audio/speech",
@ -191,7 +357,6 @@ def test_server_error(mock_tts_service, test_voice):
def test_streaming_error(mock_tts_service, test_voice):
"""Test handling streaming errors"""
# Create a proper async generator that raises an error
async def mock_error_stream(*args, **kwargs) -> AsyncGenerator[bytes, None]:
if False: # This makes it a proper generator
yield b""
@ -212,4 +377,29 @@ def test_streaming_error(mock_tts_service, test_voice):
assert response.status_code == 500
error_response = response.json()
assert error_response["detail"]["error"] == "processing_error"
assert error_response["detail"]["type"] == "server_error"
assert error_response["detail"]["type"] == "server_error"
@pytest.mark.asyncio
async def test_streaming_initialization_error():
"""Test handling of streaming initialization errors"""
mock_service = AsyncMock()
async def mock_error_stream(*args, **kwargs):
if False: # This makes it a proper generator
yield b""
raise RuntimeError("Failed to initialize stream")
mock_service.generate_audio_stream = mock_error_stream
mock_service.list_voices.return_value = ["test_voice"]
request = OpenAISpeechRequest(
model="kokoro",
input="Test text",
voice="test_voice",
response_format="mp3",
stream=True,
speed=1.0
)
with pytest.raises(RuntimeError) as exc:
async for _ in stream_audio_chunks(mock_service, request, MagicMock()):
pass
assert "Failed to initialize stream" in str(exc.value)

View file

@ -35,28 +35,11 @@ async def test_load_voice_not_found(voice_manager):
await voice_manager.load_voice("invalid_voice", "cpu")
@pytest.mark.skip(reason="Local saving is optional and not critical to functionality")
@pytest.mark.asyncio
async def test_combine_voices_with_saving(voice_manager, mock_voice_tensor):
"""Test combining voices with local saving enabled"""
with patch("api.src.core.paths.load_voice_tensor", new_callable=AsyncMock) as mock_load, \
patch("torch.save") as mock_save, \
patch("os.makedirs"), \
patch("os.path.exists", return_value=True):
# Setup mocks
mock_load.return_value = mock_voice_tensor
# Mock settings
with patch("api.src.core.config.settings") as mock_settings:
mock_settings.allow_local_voice_saving = True
mock_settings.voices_dir = "/mock/voices"
# Combine voices
combined = await voice_manager.combine_voices(["af_bella", "af_sarah"], "cpu")
assert combined == "af_bella+af_sarah" # Note: using + separator
# Verify voice was saved
mock_save.assert_called_once()
pass
@pytest.mark.asyncio
@ -112,18 +95,20 @@ async def test_load_combined_voice(voice_manager, mock_voice_tensor):
assert torch.equal(voice, mock_voice_tensor)
def test_cache_management(voice_manager, mock_voice_tensor):
def test_cache_management(mock_voice_tensor):
"""Test voice cache management"""
# Set small cache size
voice_manager._config.cache_size = 2
# Create voice manager with small cache size
config = VoiceConfig(cache_size=2)
voice_manager = VoiceManager(config)
# Add items to cache
voice_manager._voice_cache = {
"voice1_cpu": torch.randn(5, 5),
"voice2_cpu": torch.randn(5, 5),
"voice3_cpu": torch.randn(5, 5), # Add one more than cache size
}
# Try adding another item
# Try managing cache
voice_manager._manage_cache()
# Check cache size maintained

BIN
assets/beta_web_ui.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 385 KiB

View file

@ -38,9 +38,9 @@ class KokoroPlayer {
this.wave = new SiriWave({
container: this.elements.waveContainer,
width: this.elements.waveContainer.clientWidth,
height: 50,
style: 'ios',
color: '#6366f1',
height: 80,
style: '"ios9"',
// color: '#6366f1',
speed: 0.02,
amplitude: 0.7,
frequency: 4

View file

@ -26,7 +26,7 @@
<div class="overlay"></div>
<div class="badges-container">
<a href="https://huggingface.co/hexgrad/Kokoro-82M" target="_blank" class="badge">
<img src="https://img.shields.io/badge/HexGrad%2FKokoro--82M-grey?logo=huggingface&logoColor=white&labelColor=grey&style=for-the-badge" alt="HexGrad/Kokoro-82M on Hugging Face">
<img src="https://img.shields.io/badge/HexGrad%2FKokoro--82M-black?logo=huggingface&logoColor=white&labelColor=black&style=for-the-badge" alt="HexGrad/Kokoro-82M on Hugging Face">
</a>
<div class="badge">
<a class="github-button" href="https://github.com/remsky/Kokoro-FastAPI" data-color-scheme="dark" data-size="large" data-show-count="true" aria-label="Star remsky/Kokoro-FastAPI on GitHub">Kokoro-FastAPI</a>