mirror of
https://github.com/remsky/Kokoro-FastAPI.git
synced 2025-08-05 16:48:53 +00:00

- Introduced v1.0 model build system integration. - Updated imports to reflect new directory structure for versioned models. - Modified environment variables - Added version selection in the frontend for voice management. - Enhanced Docker build scripts for multi-platform support. - Updated configuration settings for default voice and model paths.
2.4 KiB
2.4 KiB
Kokoro v1.0 Technical Integration Notes
Core Components
- KModel Class
- Main model class with unified interface
- Handles both weights and inference
- Language-blind design (phoneme focused)
- No external language processing
- Key Architecture Changes
- Uses CustomAlbert instead of PLBert
- New ProsodyPredictor implementation
- Different phoneme handling approach
- Built-in vocab management
Integration Points
- Model Loading
# v1.0 approach
model = KModel(config_path, model_path)
# vs our current
model = await build_model(path, device)
- Forward Pass Differences
# v1.0
audio = model(phonemes, ref_s, speed=1.0)
# vs our current
audio = model.decoder(asr, F0_pred, N_pred, ref_s)
- Key Dependencies
- transformers (for AlbertConfig)
- torch
- No external phoneme processing
Configuration Changes
- v1.0 Config Structure
{
"vocab": {...}, # Built-in phoneme mapping
"n_token": X,
"plbert": {...}, # Albert config
"hidden_dim": X,
"style_dim": X,
"istftnet": {...}
}
- Voice Management
- No HF downloads
- Local voice file management
- Simpler voice structure
Implementation Strategy
- Core Changes
- Keep our streaming infrastructure
- Adapt to new model interface
- Maintain our voice management
- Key Adaptations Needed
- Wrap KModel in our build system
- Handle phoneme mapping internally
- Adapt to new prosody prediction
- Compatibility Layer
class V1ModelWrapper:
def __init__(self, kmodel):
self.model = kmodel
async def forward(self, phonemes, ref_s):
# Adapt v1.0 interface to our system
return self.model(phonemes, ref_s)
Technical Considerations
- Memory Usage
- Models ~few hundred MB
- Voices ~few hundred KB
- No need for complex memory management
- Performance
- Similar inference speed expected
- No major architectural bottlenecks
- Keep existing streaming optimizations
- Integration Points
- Model loading/initialization
- Voice file management
- Inference pipeline
- Streaming output
Migration Notes
- Key Files to Port
- model.py -> v1_0/models.py
- istftnet.py -> v1_0/istftnet.py
- Add albert.py for CustomAlbert
- Config Updates
- Add version selection
- Keep config structure similar
- Add v1.0 specific params
- Testing Focus
- Basic inference
- Voice compatibility
- Streaming performance
- Version switching