mirror of https://github.com/remsky/Kokoro-FastAPI.git synced 2025-04-13 09:39:17 +00:00

Dockerized FastAPI wrapper for Kokoro-82M text-to-speech model w/CPU ONNX and NVIDIA GPU PyTorch support, handling, and auto-stitching

fastapi huggingface-spaces kokoro kokoro-tts onnx onnxruntime openai-compatible-api openwebui pytorch sillytavern tts tts-api uv

Find a file

remsky 0fb36bb1b2 fix: update benchmark results for processing time and output length		2024-12-30 06:16:55 -07:00
api/src	feat: enabled support for stitching long outputs in TTS requests	2024-12-30 06:16:18 -07:00
examples	fix: update benchmark results for processing time and output length	2024-12-30 06:16:55 -07:00
.gitignore	Add initial implementation of Kokoro TTS API with Docker GPU support	2024-12-30 04:17:50 -07:00
docker-compose.yml	Update Dockerfile and docker-compose.yml to add versions, specify Kokoro commit	2024-12-30 05:29:35 -07:00
Dockerfile	fix: finalize pytorch version lock, git lfs	2024-12-30 06:09:17 -07:00
githubbanner.png	Update README with performance benchmarks and usage examples; add benchmark plotting script	2024-12-30 04:53:29 -07:00
README.md	Merge branch 'master' of https://github.com/remsky/Kokoro-FastAPI	2024-12-30 05:29:45 -07:00
requirements.txt	Update Dockerfile and docker-compose.yml to add versions, specify Kokoro commit	2024-12-30 05:29:35 -07:00

README.md

Kokoro TTS Banner

Kokoro TTS API

FastAPI wrapper for Kokoro-82M text-to-speech model.

Dockerized with NVIDIA GPU support, simple queue handling via sqllite, and automatic chunking/stitching on lengthy input/outputs

Quick Start

# Start the API (will automatically clone source HF repo via git-lfs)
docker compose up --build

Test it out:

# From host terminal
python examples/test_tts.py "Hello world" --voice af_bella

Performance Benchmarks

Benchmarking was performed soley on generation via the API (no download) using various text lengths from 100 to ~10,000 characters, measuring processing time, token count, and output audio length. Tests were run on:

NVIDIA 4060Ti 16gb GPU @ CUDA 12.1
11th Gen i7-11700 @ 2.5GHz
64gb RAM
Randomized chunks from H.G. Wells - The Time Machine

Processing Time vs Output Length Processing Time vs Token Count

Average processing speed: ~3.4 seconds per minute of audio output
Efficient token processing: ~0.01 seconds per token
Scales well with longer texts, maintains consistent performance

API Endpoints

GET /tts/voices           # List available voices
POST /tts                 # Generate speech
GET /tts/{request_id}     # Check generation status
GET /tts/file/{request_id} # Download audio file

Example Usage

List available voices:

python examples/test_tts.py

Generate speech:

# Default voice
python examples/test_tts.py "Your text here"

# Specific voice
python examples/test_tts.py --voice af_bella "Your text here"

# Get file path without downloading
python examples/test_tts.py --no-download "Your text here"

Generated files are saved in:

With download: examples/output/
Without download: src/output/ (in API container)

Requirements

Docker
NVIDIA GPU + CUDA
nvidia-container-toolkit installed on host

Model

This API uses the Kokoro-82M model from HuggingFace.

Visit the model page for more details about training, architecture, and capabilities. I have no affiliation with any of their work, and produced this wrapper for ease of use and personal projects.

License

This project is licensed under the Apache License 2.0 - see below for details:

The Kokoro model weights are licensed under Apache 2.0 (see model page)
The FastAPI wrapper code in this repository is licensed under Apache 2.0 to match
The inference code adapted from StyleTTS2 is MIT licensed

The full Apache 2.0 license text can be found at: https://www.apache.org/licenses/LICENSE-2.0