mirror of https://github.com/remsky/Kokoro-FastAPI.git synced 2025-08-05 16:48:53 +00:00

Dockerized FastAPI wrapper for Kokoro-82M text-to-speech model w/CPU ONNX and NVIDIA GPU PyTorch support, handling, and auto-stitching

fastapi huggingface-spaces kokoro kokoro-tts onnx onnxruntime openai-compatible-api openwebui pytorch sillytavern tts tts-api uv

Find a file

remsky 8ce8334345 - Complete TTS endpoint replacement with OpenAI compatible -Removed output directory, and update configuration settings - Added benchmarking for entire novel		2024-12-31 01:52:16 -07:00
api	- Complete TTS endpoint replacement with OpenAI compatible	2024-12-31 01:52:16 -07:00
examples	- Complete TTS endpoint replacement with OpenAI compatible	2024-12-31 01:52:16 -07:00
.gitignore	Add initial implementation of Kokoro TTS API with Docker GPU support	2024-12-30 04:17:50 -07:00
docker-compose.yml	- Complete TTS endpoint replacement with OpenAI compatible	2024-12-31 01:52:16 -07:00
Dockerfile	- Complete TTS endpoint replacement with OpenAI compatible	2024-12-31 01:52:16 -07:00
githubbanner.png	Update README with performance benchmarks and usage examples; add benchmark plotting script	2024-12-30 04:53:29 -07:00
pytest.ini	Added basic pytest on the fastapi side	2024-12-30 13:25:30 -07:00
README.md	Update README.md	2024-12-30 06:48:27 -07:00
requirements-test.txt	Added basic pytest on the fastapi side	2024-12-30 13:25:30 -07:00
requirements.txt	- Complete TTS endpoint replacement with OpenAI compatible	2024-12-31 01:52:16 -07:00

README.md

Kokoro TTS Banner

Kokoro TTS API

FastAPI wrapper for Kokoro-82M text-to-speech model.

Dockerized with NVIDIA GPU support, simple queue handling via sqllite, and automatic chunking/stitching on lengthy input/outputs

Quick Start

# Start the API (will automatically clone source HF repo via git-lfs)
docker compose up --build

Test it out:

# From host terminal
python examples/test_tts.py "Hello world" --voice af_bella

Performance Benchmarks

Benchmarking was performed solely on generation via the local API (ignoring file transfers) using various text lengths up to ~10 minutes output, measuring processing time, token count, and output audio length. Tests were run on:

Windows 11 Home w/ WSL2
NVIDIA 4060Ti 16gb GPU @ CUDA 12.1
11th Gen i7-11700 @ 2.5GHz
64gb RAM
Randomized chunks from H.G. Wells - The Time Machine

Processing Time vs Output Length Processing Time vs Token Count

Average processing speed: ~3.4 seconds per minute of audio output
Efficient token processing: ~0.01 seconds per token
Scales well with longer texts, maintains consistent performance

API Endpoints

GET /tts/voices           # List available voices
POST /tts                 # Generate speech
GET /tts/{request_id}     # Check generation status
GET /tts/file/{request_id} # Download audio file

Example Usage

List available voices:

python examples/test_tts.py

Generate speech:

# Default voice
python examples/test_tts.py "Your text here"

# Specific voice
python examples/test_tts.py --voice af_bella "Your text here"

# Get file path without downloading
python examples/test_tts.py --no-download "Your text here"

Generated files are saved in:

With download: examples/output/
Without download: src/output/ (in API container)

Requirements

Docker
NVIDIA GPU + CUDA
nvidia-container-toolkit installed on host

Model

This API uses the Kokoro-82M model from HuggingFace.

Visit the model page for more details about training, architecture, and capabilities. I have no affiliation with any of their work, and produced this wrapper for ease of use and personal projects.

License

This project is licensed under the Apache License 2.0 - see below for details:

The Kokoro model weights are licensed under Apache 2.0 (see model page)
The FastAPI wrapper code in this repository is licensed under Apache 2.0 to match
The inference code adapted from StyleTTS2 is MIT licensed

The full Apache 2.0 license text can be found at: https://www.apache.org/licenses/LICENSE-2.0