mirror of
https://github.com/remsky/Kokoro-FastAPI.git
synced 2025-04-13 09:39:17 +00:00
Merge branch 'master' of https://github.com/remsky/Kokoro-FastAPI
This commit is contained in:
commit
6134802d2c
1 changed files with 10 additions and 8 deletions
16
README.md
16
README.md
|
@ -7,16 +7,15 @@
|
|||
[]()
|
||||
[](https://huggingface.co/spaces/Remsky/Kokoro-TTS-Zero)
|
||||
|
||||
[](https://huggingface.co/hexgrad/Kokoro-82M/commit/9901c2b79161b6e898b7ea857ae5298f47b8b0d6)
|
||||
[]()
|
||||
[]()
|
||||
[](https://huggingface.co/hexgrad/Kokoro-82M/commit/9901c2b79161b6e898b7ea857ae5298f47b8b0d6)
|
||||
[]()
|
||||
[]()
|
||||
|
||||
Dockerized FastAPI wrapper for [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) text-to-speech model
|
||||
- Multi-language support (English, Japanese, Korean, Chinese, Vietnamese)
|
||||
- OpenAI-compatible Speech endpoint, NVIDIA GPU accelerated or CPU inference with PyTorch
|
||||
- ONNX support coming soon, see v0.1.5 and earlier for legacy ONNX support in the interim
|
||||
- Debug endpoints for monitoring threads, storage, and session pools
|
||||
- Integrated web UI on localhost:8880/web
|
||||
- Debug endpoints for monitoring system stats, integrated web UI on localhost:8880/web
|
||||
- Phoneme-based audio generation, phoneme generation
|
||||
- (new) Per-word timestamped caption generation
|
||||
- (new) Voice mixing with weighted combinations
|
||||
|
@ -113,8 +112,8 @@ with client.audio.speech.with_streaming_response.create(
|
|||
- Web Interface: http://localhost:8880/web
|
||||
|
||||
<div align="center" style="display: flex; justify-content: center; gap: 10px;">
|
||||
<img src="assets/docs-screenshot.png" width="40%" alt="API Documentation" style="border: 2px solid #333; padding: 10px;">
|
||||
<img src="assets/webui-screenshot.png" width="49%" alt="Web UI Screenshot" style="border: 2px solid #333; padding: 10px;">
|
||||
<img src="assets/docs-screenshot.png" width="42%" alt="API Documentation" style="border: 2px solid #333; padding: 10px;">
|
||||
<img src="assets/webui-screenshot.png" width="42%" alt="Web UI Screenshot" style="border: 2px solid #333; padding: 10px;">
|
||||
</div>
|
||||
|
||||
</details>
|
||||
|
@ -357,6 +356,9 @@ docker compose up --build
|
|||
|
||||
- Automatically splits and stitches at sentence boundaries
|
||||
- Helps to reduce artifacts and allow long form processing as the base model is only currently configured for approximately 30s output
|
||||
|
||||
The model is capable of processing up to a 510 phonemized token chunk at a time, however, this can often lead to 'rushed' speech or other artifacts. An additional layer of chunking is applied in the server, that creates flexible chunks with a `TARGET_MIN_TOKENS` , `TARGET_MAX_TOKENS`, and `ABSOLUTE_MAX_TOKENS` which are configurable via environment variables, and set to 175, 250, 450 by default
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
|
|
Loading…
Add table
Reference in a new issue