mirror of https://github.com/remsky/Kokoro-FastAPI.git synced 2025-04-13 09:39:17 +00:00

Dockerized FastAPI wrapper for Kokoro-82M text-to-speech model w/CPU ONNX and NVIDIA GPU PyTorch support, handling, and auto-stitching

fastapi huggingface-spaces kokoro kokoro-tts onnx onnxruntime openai-compatible-api openwebui pytorch sillytavern tts tts-api uv

Find a file

remsky f800c4ecf9 Added mp3 samples		2024-12-31 03:48:26 -07:00
api	Refactor Docker setup to use a dedicated model-fetcher service and update schemas for additional voice support	2024-12-31 03:41:45 -07:00
examples	Added mp3 samples	2024-12-31 03:48:26 -07:00
.coverage	Refactor TTS API and enhance testing setup with coverage and logging improvements	2024-12-31 02:55:51 -07:00
.coveragerc	Refactor TTS API and enhance testing setup with coverage and logging improvements	2024-12-31 02:55:51 -07:00
.gitignore	Add initial implementation of Kokoro TTS API with Docker GPU support	2024-12-30 04:17:50 -07:00
.ruff.toml	Refactor TTS API and enhance testing setup with coverage and logging improvements	2024-12-31 02:55:51 -07:00
docker-compose.yml	Refactor Docker setup to use a dedicated model-fetcher service and update schemas for additional voice support	2024-12-31 03:41:45 -07:00
Dockerfile	Refactor Docker setup to use a dedicated model-fetcher service and update schemas for additional voice support	2024-12-31 03:41:45 -07:00
githubbanner.png	Update README with performance benchmarks and usage examples; add benchmark plotting script	2024-12-30 04:53:29 -07:00
pytest.ini	Refactor TTS API and enhance testing setup with coverage and logging improvements	2024-12-31 02:55:51 -07:00
README.md	Update README and tests to clarify audio format support and enhance documentation	2024-12-31 03:46:31 -07:00
requirements-test.txt	Added basic pytest on the fastapi side	2024-12-30 13:25:30 -07:00
requirements.txt	Enhance TTS API with logging, voice pack loading, and schema updates	2024-12-31 01:57:00 -07:00

README.md

Kokoro TTS Banner

Kokoro TTS API

FastAPI wrapper for Kokoro-82M text-to-speech model, providing an OpenAI-compatible endpoint with:

NVIDIA GPU acceleration enabled
automatic chunking/stitching for long texts
very fast generation time (~35-49x RTF)

Quick Start

Install prerequisites:
- Install Docker Desktop
- Install Git (or download and extract zip)
Clone and run:

# Clone repository
git clone https://github.com/remsky/Kokoro-FastAPI.git
cd Kokoro-FastAPI

# Start the API (will automatically clone source HF repo via git-lfs)
docker compose up --build

Test all voices:

python examples/test_all_voices.py

Test OpenAI compatibility:

python examples/test_openai_tts.py

OpenAI-Compatible API

List available voices:

import requests

response = requests.get("http://localhost:8000/audio/voices")
voices = response.json()["voices"]

Generate speech:

import requests

response = requests.post(
    "http://localhost:8000/audio/speech",
    json={
        "model": "kokoro",  # Not used but required for compatibility
        "input": "Hello world!",
        "voice": "af_bella",
        "response_format": "mp3",  # Supported: mp3, wav, opus, flac
        "speed": 1.0
    }
)

# Save audio
with open("output.mp3", "wb") as f:
    f.write(response.content)

Using OpenAI's Python library:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000", api_key="not-needed")

response = client.audio.speech.create(
    model="kokoro",  # Not used but required for compatibility, also accepts library defaults
    voice="af_bella",
    input="Hello world!",
    response_format="mp3"
)

response.stream_to_file("output.mp3")

Performance Benchmarks

Benchmarking was performed on generation via the local API using text lengths up to feature-length books (~1.5 hours output), measuring processing time and realtime factor. Tests were run on:

Windows 11 Home w/ WSL2
NVIDIA 4060Ti 16gb GPU @ CUDA 12.1
11th Gen i7-11700 @ 2.5GHz
64gb RAM
WAV native output
H.G. Wells - The Time Machine (full text)

Processing Time Realtime Factor

Key Performance Metrics:

Realtime Factor: Ranges between 35-49x (generation time to output audio length)
Average Processing Rate: 137.67 tokens/second
Efficient Scaling: Maintains performance with long texts through automatic chunking
Natural Boundary Detection: Automatically splits and stitches at sentence boundaries to prevent artifacts

Features

OpenAI-compatible API endpoints
Multiple audio formats: mp3, wav, opus, flac, (aac & pcm not implemented)
Automatic text chunking and audio stitching
GPU-accelerated inference

Model

This API uses the Kokoro-82M model from HuggingFace.

Visit the model page for more details about training, architecture, and capabilities. I have no affiliation with any of their work, and produced this wrapper for ease of use and personal projects.

License

This project is licensed under the Apache License 2.0 - see below for details:

The Kokoro model weights are licensed under Apache 2.0 (see model page)
The FastAPI wrapper code in this repository is licensed under Apache 2.0 to match
The inference code adapted from StyleTTS2 is MIT licensed

The full Apache 2.0 license text can be found at: https://www.apache.org/licenses/LICENSE-2.0