Kokoro-FastAPI/CHANGELOG.md

# Changelog

Notable changes to this project will be documented in this file.

## [v0.2.0post1] - 2025-02-07
- Fix: Building Kokoro from source with adjustments, to avoid CUDA lock 
- Fixed ARM64 compatibility on Spacy dep to avoid emulation slowdown
- Added g++ for Japanese language support
- Temporarily disabled Vietnamese language support due to ARM64 compatibility issues

## [v0.2.0-pre] - 2025-02-06
### Added
- Complete Model Overhaul:
  - Upgraded to Kokoro v1.0 model architecture
  - Pre-installed multi-language support from Misaki:
    - English (en), Japanese (ja), Korean (ko),Chinese (zh), Vietnamese (vi)
  - All voice packs included for supported languages, along with the original versions.
- Enhanced Audio Generation Features:
  - Per-word timestamped caption generation
  - Phoneme-based audio generation capabilities
  - Detailed phoneme generation
- Web UI Improvements:
  - Improved voice mixing with weighted combinations
  - Text file upload support
  - Enhanced formatting and user interface
  - Cleaner UI (in progress)
  - Integration with https://github.com/hexgrad/kokoro and https://github.com/hexgrad/misaki packages

### Removed
- Deprecated support for Kokoro v0.19 model

### Changes
- Combine Voices endpoint now returns a .pt file, with generation combinations generated on the fly otherwise 


## [v0.1.4] - 2025-01-30
### Added
- Smart Chunking System:
  - New text_processor with smart_split for improved sentence boundary detection
  - Dynamically adjusts chunk sizes based on sentence structure, using phoneme/token information in an intial pass
  - Should avoid ever going over the 510 limit per chunk, while preserving natural cadence
- Web UI Added (To Be Replacing Gradio):
  - Integrated streaming with tempfile generation
  - Download links available in X-Download-Path header
  - Configurable cleanup triggers for temp files
- Debug Endpoints:
  - /debug/threads for thread information and stack traces
  - /debug/storage for temp file and output directory monitoring
  - /debug/system for system resource information
  - /debug/session_pools for ONNX/CUDA session status
- Automated Model Management:
  - Auto-download from releases page
  - Included download scripts for manual installation
  - Pre-packaged voice models in repository

### Changed
- Significant architectural improvements:
  - Multi-model architecture support
  - Enhanced concurrency handling
  - Improved streaming header management
  - Better resource/session pool management


## [v0.1.2] - 2025-01-23
### Structural Improvements
- Models can be manually download and placed in api/src/models, or use included script
- TTSGPU/TPSCPU/STTSService classes replaced with a ModelManager service
  - CPU/GPU of each of ONNX/PyTorch (Note: Only Pytorch GPU, and ONNX CPU/GPU have been tested)
  - Should be able to improve new models as they become available, or new architectures, in a more modular way
- Converted a number of internal processes to async handling to improve concurrency
- Improving separation of concerns towards plug-in and modular structure, making PR's and new features easier

### Web UI (test release)
- An integrated simple web UI has been added on the FastAPI server directly
  - This can be disabled via core/config.py or ENV variables if desired. 
  - Simplifies deployments, utility testing, aesthetics, etc 
  - Looking to deprecate/collaborate/hand off the Gradio UI


## [v0.1.0] - 2025-01-13
### Changed
- Major Docker improvements:
  - Baked model directly into Dockerfile for improved deployment reliability
  - Switched to uv for dependency management
  - Streamlined container builds and reduced image sizes
- Dependency Management:
  - Migrated from pip/poetry to uv for faster, more reliable package management
  - Added uv.lock for deterministic builds
  - Updated dependency resolution strategy

## [v0.0.5post1] - 2025-01-11
### Fixed
- Docker image tagging and versioning improvements (-gpu, -cpu, -ui)
- Minor vram management improvements
- Gradio bugfix causing crashes and errant warnings
- Updated GPU and UI container configurations

## [v0.0.5] - 2025-01-10
### Fixed
- Stabilized issues with images tagging and structures from v0.0.4
- Added automatic master to develop branch synchronization
- Improved release tagging and structures
- Initial CI/CD setup

## 2025-01-04
### Added
- ONNX Support:
  - Added single batch ONNX support for CPU inference
  - Roughly 0.4 RTF (2.4x real-time speed)

### Modified
- Code Refactoring:
  - Work on modularizing phonemizer and tokenizer into separate services
  - Incorporated these services into a dev endpoint
- Testing and Benchmarking:
  - Cleaned up benchmarking scripts
  - Cleaned up test scripts
  - Added auto-WAV validation scripts

## 2025-01-02
- Audio Format Support:
  - Added comprehensive audio format conversion support (mp3, wav, opus, flac)

## 2025-01-01
### Added
- Gradio Web Interface:
  - Added simple web UI utility for audio generation from input or txt file

### Modified
#### Configuration Changes
- Updated Docker configurations:
  - Changes to `Dockerfile`:
    - Improved layer caching by separating dependency and code layers
  - Updates to `docker-compose.yml` and `docker-compose.cpu.yml`:
    - Removed commit lock from model fetching to allow automatic model updates from HF
    - Added git index lock cleanup

#### API Changes
- Modified `api/src/main.py`
- Updated TTS service implementation in `api/src/services/tts.py`:
  - Added device management for better resource control:
    - Voices are now copied from model repository to api/src/voices directory for persistence
  - Refactored voice pack handling:
    - Removed static voice pack dictionary
    - On-demand voice loading from disk
  - Added model warm-up functionality:
    - Model now initializes with a dummy text generation
    - Uses default voice (af.pt) for warm-up
    - Model is ready for inference on first request
-Removed commit lock on HF repo -Warm start added to model initialization -Layer caching tweaks to dockerfile 2025-01-01 17:38:22 -07:00			`# Changelog`

			`Notable changes to this project will be documented in this file.`

ARM64 Compatibility, dependencies fix 2025-02-07 17:08:10 -07:00			`## [v0.2.0post1] - 2025-02-07`
			`- Fix: Building Kokoro from source with adjustments, to avoid CUDA lock`
			`- Fixed ARM64 compatibility on Spacy dep to avoid emulation slowdown`
			`- Added g++ for Japanese language support`
			`- Temporarily disabled Vietnamese language support due to ARM64 compatibility issues`

Bump version to v0.2.0-pre, enhance Docker configurations for GPU support, and refine text processing settings 2025-02-06 01:22:21 -07:00			`## [v0.2.0-pre] - 2025-02-06`
			`### Added`
			`- Complete Model Overhaul:`
			`- Upgraded to Kokoro v1.0 model architecture`
			`- Pre-installed multi-language support from Misaki:`
			`- English (en), Japanese (ja), Korean (ko),Chinese (zh), Vietnamese (vi)`
			`- All voice packs included for supported languages, along with the original versions.`
			`- Enhanced Audio Generation Features:`
			`- Per-word timestamped caption generation`
			`- Phoneme-based audio generation capabilities`
			`- Detailed phoneme generation`
			`- Web UI Improvements:`
			`- Improved voice mixing with weighted combinations`
			`- Text file upload support`
			`- Enhanced formatting and user interface`
			`- Cleaner UI (in progress)`
			`- Integration with https://github.com/hexgrad/kokoro and https://github.com/hexgrad/misaki packages`

			`### Removed`
			`- Deprecated support for Kokoro v0.19 model`

			`### Changes`
			`- Combine Voices endpoint now returns a .pt file, with generation combinations generated on the fly otherwise`


Update .gitignore and benchmark scripts for GPU support; enhance TTS service handling and session management 2025-01-30 05:47:28 -07:00			`## [v0.1.4] - 2025-01-30`
			`### Added`
			`- Smart Chunking System:`
			`- New text_processor with smart_split for improved sentence boundary detection`
			`- Dynamically adjusts chunk sizes based on sentence structure, using phoneme/token information in an intial pass`
			`- Should avoid ever going over the 510 limit per chunk, while preserving natural cadence`
			`- Web UI Added (To Be Replacing Gradio):`
			`- Integrated streaming with tempfile generation`
			`- Download links available in X-Download-Path header`
			`- Configurable cleanup triggers for temp files`
			`- Debug Endpoints:`
			`- /debug/threads for thread information and stack traces`
			`- /debug/storage for temp file and output directory monitoring`
			`- /debug/system for system resource information`
			`- /debug/session_pools for ONNX/CUDA session status`
			`- Automated Model Management:`
			`- Auto-download from releases page`
			`- Included download scripts for manual installation`
			`- Pre-packaged voice models in repository`

			`### Changed`
			`- Significant architectural improvements:`
			`- Multi-model architecture support`
			`- Enhanced concurrency handling`
			`- Improved streaming header management`
			`- Better resource/session pool management`


Enhance web player information, adjust text chunk size, update audio wave settings, and implement OpenAI model mappings 2025-01-23 04:11:31 -07:00			`## [v0.1.2] - 2025-01-23`
			`### Structural Improvements`
			`- Models can be manually download and placed in api/src/models, or use included script`
			`- TTSGPU/TPSCPU/STTSService classes replaced with a ModelManager service`
			`- CPU/GPU of each of ONNX/PyTorch (Note: Only Pytorch GPU, and ONNX CPU/GPU have been tested)`
			`- Should be able to improve new models as they become available, or new architectures, in a more modular way`
			`- Converted a number of internal processes to async handling to improve concurrency`
			`- Improving separation of concerns towards plug-in and modular structure, making PR's and new features easier`

			`### Web UI (test release)`
			`- An integrated simple web UI has been added on the FastAPI server directly`
			`- This can be disabled via core/config.py or ENV variables if desired.`
			`- Simplifies deployments, utility testing, aesthetics, etc`
			`- Looking to deprecate/collaborate/hand off the Gradio UI`


feat: merge master into core/uv-management for v0.1.0 Major changes: - Baked model directly into Dockerfile for improved deployment - Switched to uv for dependency management - Restructured Docker files into docker/cpu and docker/gpu directories - Updated configuration for better ONNX performance 2025-01-13 19:31:44 -07:00			`## [v0.1.0] - 2025-01-13`
			`### Changed`
			`- Major Docker improvements:`
			`- Baked model directly into Dockerfile for improved deployment reliability`
			`- Switched to uv for dependency management`
			`- Streamlined container builds and reduced image sizes`
			`- Dependency Management:`
			`- Migrated from pip/poetry to uv for faster, more reliable package management`
			`- Added uv.lock for deterministic builds`
			`- Updated dependency resolution strategy`

docs: update changelog for v0.0.5post1 2025-01-12 23:33:57 -07:00			`## [v0.0.5post1] - 2025-01-11`
Initial swap to UV dependency management 2025-01-11 20:00:34 -07:00			`### Fixed`
docs: update changelog for v0.0.5post1 2025-01-12 23:33:57 -07:00			`- Docker image tagging and versioning improvements (-gpu, -cpu, -ui)`
			`- Minor vram management improvements`
			`- Gradio bugfix causing crashes and errant warnings`
			`- Updated GPU and UI container configurations`
Initial swap to UV dependency management 2025-01-11 20:00:34 -07:00
Add CI workflows for testing and syncing branches, and update Docker image tagging 2025-01-10 22:03:59 -07:00			`## [v0.0.5] - 2025-01-10`
			`### Fixed`
			`- Stabilized issues with images tagging and structures from v0.0.4`
			`- Added automatic master to develop branch synchronization`
			`- Improved release tagging and structures`
			`- Initial CI/CD setup`

Allow ONNX support optimizations for CPU inference and update benchmarking scripts; modify README for clarity on performance metrics 2025-01-04 02:46:27 -07:00			`## 2025-01-04`
			`### Added`
			`- ONNX Support:`
			`- Added single batch ONNX support for CPU inference`
			`- Roughly 0.4 RTF (2.4x real-time speed)`

			`### Modified`
			`- Code Refactoring:`
			`- Work on modularizing phonemizer and tokenizer into separate services`
			`- Incorporated these services into a dev endpoint`
			`- Testing and Benchmarking:`
			`- Cleaned up benchmarking scripts`
			`- Cleaned up test scripts`
			`- Added auto-WAV validation scripts`
-Removed commit lock on HF repo -Warm start added to model initialization -Layer caching tweaks to dockerfile 2025-01-01 17:38:22 -07:00
added output audio tests, validation 2025-01-02 15:36:53 -07:00			`## 2025-01-02`
			`- Audio Format Support:`
			`- Added comprehensive audio format conversion support (mp3, wav, opus, flac)`

			`## 2025-01-01`
Add Gradio web interface + tests 2025-01-01 21:50:00 -07:00			`### Added`
			`- Gradio Web Interface:`
			`- Added simple web UI utility for audio generation from input or txt file`

-Removed commit lock on HF repo -Warm start added to model initialization -Layer caching tweaks to dockerfile 2025-01-01 17:38:22 -07:00			`### Modified`
			`#### Configuration Changes`
			`- Updated Docker configurations:`
			- Changes to `Dockerfile`:
			`- Improved layer caching by separating dependency and code layers`
			- Updates to `docker-compose.yml` and `docker-compose.cpu.yml`:
			`- Removed commit lock from model fetching to allow automatic model updates from HF`
			`- Added git index lock cleanup`

			`#### API Changes`
			- Modified `api/src/main.py`
			- Updated TTS service implementation in `api/src/services/tts.py`:
			`- Added device management for better resource control:`
			`- Voices are now copied from model repository to api/src/voices directory for persistence`
			`- Refactored voice pack handling:`
			`- Removed static voice pack dictionary`
			`- On-demand voice loading from disk`
			`- Added model warm-up functionality:`
			`- Model now initializes with a dummy text generation`
			`- Uses default voice (af.pt) for warm-up`
			`- Model is ready for inference on first request`