mirror of https://github.com/remsky/Kokoro-FastAPI.git synced 2025-04-13 09:39:17 +00:00

Dockerized FastAPI wrapper for Kokoro-82M text-to-speech model w/CPU ONNX and NVIDIA GPU PyTorch support, handling, and auto-stitching

fastapi huggingface-spaces kokoro kokoro-tts onnx onnxruntime openai-compatible-api openwebui pytorch sillytavern tts tts-api uv

Find a file

remsky 0e9f77fc79 WIP: open ai compatible streaming		2025-01-04 17:55:36 -07:00
.github/workflows	ci: update docker workflow to only build on releases	2025-01-04 02:50:45 -07:00
api	WIP: open ai compatible streaming	2025-01-04 17:55:36 -07:00
examples	WIP: open ai compatible streaming	2025-01-04 17:55:36 -07:00
ui	Ruff Check + Format	2025-01-01 21:50:41 -07:00
.coverage	WIP: open ai compatible streaming	2025-01-04 17:55:36 -07:00
.coveragerc	Allow ONNX support optimizations for CPU inference and update benchmarking scripts; modify README for clarity on performance metrics	2025-01-04 02:46:27 -07:00
.dockerignore	-Removed commit lock on HF repo	2025-01-01 17:38:22 -07:00
.gitignore	First streaming attempt	2025-01-04 17:54:54 -07:00
.ruff.toml	Refactor TTS API and enhance testing setup with coverage and logging improvements	2024-12-31 02:55:51 -07:00
CHANGELOG.md	Allow ONNX support optimizations for CPU inference and update benchmarking scripts; modify README for clarity on performance metrics	2025-01-04 02:46:27 -07:00
docker-compose.cpu.yml	Allow ONNX support optimizations for CPU inference and update benchmarking scripts; modify README for clarity on performance metrics	2025-01-04 02:46:27 -07:00
docker-compose.yml	WIP: open ai compatible streaming	2025-01-04 17:55:36 -07:00
Dockerfile	-Removed commit lock on HF repo	2025-01-01 17:38:22 -07:00
Dockerfile.cpu	WIP, Functional for CPU: Updated for ONNX runtime support, Dockerfile and TTS Service	2025-01-03 00:53:41 -07:00
githubbanner.png	Update README with performance benchmarks and usage examples; add benchmark plotting script	2024-12-30 04:53:29 -07:00
pytest.ini	Add Gradio web interface + tests	2025-01-01 21:50:00 -07:00
README.md	Allow ONNX support optimizations for CPU inference and update benchmarking scripts; modify README for clarity on performance metrics	2025-01-04 02:46:27 -07:00
requirements-test.txt	Ruff Check + Format	2025-01-01 21:50:41 -07:00
requirements.txt	Enhance TTS API with logging, voice pack loading, and schema updates	2024-12-31 01:57:00 -07:00

README.md

Kokoro TTS Banner

Kokoro TTS API

Dockerized FastAPI wrapper for Kokoro-82M text-to-speech model

OpenAI-compatible Speech endpoint, with voice combination functionality
NVIDIA GPU accelerated inference (or CPU) option
very fast generation time (~35x real time factor via 4060Ti)
automatic chunking/stitching for long texts
simple audio generation web ui utility

Quick Start

The service can be accessed through either the API endpoints or the Gradio web interface.

Install prerequisites:

Install Docker Desktop + Git

Clone and start the service:

git clone https://github.com/remsky/Kokoro-FastAPI.git
cd Kokoro-FastAPI
docker compose up --build

Run locally as an OpenAI-Compatible Speech Endpoint

from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8880",
    api_key="not-needed"
    )

response = client.audio.speech.create(
    model="kokoro", 
    voice="af_bella",
    input="Hello world!",
    response_format="mp3"
)
response.stream_to_file("output.mp3")

or visit http://localhost:7860

Voice Analysis Comparison

Features

OpenAI-Compatible Speech Endpoint

# Using OpenAI's Python library
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8880", api_key="not-needed")
response = client.audio.speech.create(
    model="kokoro",  # Not used but required for compatibility, also accepts library defaults
    voice="af_bella",
    input="Hello world!",
    response_format="mp3"
)

response.stream_to_file("output.mp3")

Or Via Requests:

import requests


response = requests.get("http://localhost:8880/v1/audio/voices")
voices = response.json()["voices"]

# Generate audio
response = requests.post(
    "http://localhost:8880/v1/audio/speech",
    json={
        "model": "kokoro",  # Not used but required for compatibility
        "input": "Hello world!",
        "voice": "af_bella",
        "response_format": "mp3",  # Supported: mp3, wav, opus, flac
        "speed": 1.0
    }
)

# Save audio
with open("output.mp3", "wb") as f:
    f.write(response.content)

Quick tests (run from another terminal):

python examples/test_openai_tts.py # Test OpenAI Compatibility
python examples/test_all_voices.py # Test all available voices

Voice Combination

Averages model weights of any existing voicepacks
Saves generated voicepacks for future use

Combine voices and generate audio:

import requests
response = requests.get("http://localhost:8880/v1/audio/voices")
voices = response.json()["voices"]

# Create combined voice (saves locally on server)
response = requests.post(
    "http://localhost:8880/v1/audio/voices/combine",
    json=[voices[0], voices[1]]
)
combined_voice = response.json()["voice"]

# Generate audio with combined voice
response = requests.post(
    "http://localhost:8880/v1/audio/speech",
    json={
        "input": "Hello world!",
        "voice": combined_voice,
        "response_format": "mp3"
    }
)

Voice Analysis Comparison

Multiple Output Audio Formats

mp3
wav
opus
flac
aac
pcm

Audio Format Comparison

Gradio Web Utility

Access the interactive web UI at http://localhost:7860 after starting the service. Features include:

Voice/format/speed selection
Audio playback and download
Text file or direct input

If you only want the API, just comment out everything in the docker-compose.yml under and including gradio-ui

Currently, voices created via the API are accessible here, but voice combination/creation has not yet been added

Processing Details

Performance Benchmarks

Benchmarking was performed on generation via the local API using text lengths up to feature-length books (~1.5 hours output), measuring processing time and realtime factor. Tests were run on:

Windows 11 Home w/ WSL2
NVIDIA 4060Ti 16gb GPU @ CUDA 12.1
11th Gen i7-11700 @ 2.5GHz
64gb RAM
WAV native output
H.G. Wells - The Time Machine (full text)

Processing Time Realtime Factor

Key Performance Metrics:

Realtime Factor: Ranges between 35-49x (generation time to output audio length)
Average Processing Rate: 137.67 tokens/second (cl100k_base)

GPU Vs. CPU

# GPU: Requires NVIDIA GPU with CUDA 12.1 support (~35x realtime speed)
docker compose up --build

# CPU: ONNX optimized inference (~2.4x realtime speed)
docker compose -f docker-compose.cpu.yml up --build

Natural Boundary Detection

Automatically splits and stitches at sentence boundaries
Helps to reduce artifacts and allow long form processing as the base model is only currently configured for approximately 30s output

Model and License

Model

This API uses the Kokoro-82M model from HuggingFace.

Visit the model page for more details about training, architecture, and capabilities. I have no affiliation with any of their work, and produced this wrapper for ease of use and personal projects.

License

This project is licensed under the Apache License 2.0 - see below for details:

The Kokoro model weights are licensed under Apache 2.0 (see model page)
The FastAPI wrapper code in this repository is licensed under Apache 2.0 to match
The inference code adapted from StyleTTS2 is MIT licensed

The full Apache 2.0 license text can be found at: https://www.apache.org/licenses/LICENSE-2.0