mirror of
https://github.com/remsky/Kokoro-FastAPI.git
synced 2025-08-05 16:48:53 +00:00
116 lines
2.4 KiB
Markdown
116 lines
2.4 KiB
Markdown
![]() |
# Kokoro v1.0 Technical Integration Notes
|
||
|
|
||
|
## Core Components
|
||
|
|
||
|
1. KModel Class
|
||
|
- Main model class with unified interface
|
||
|
- Handles both weights and inference
|
||
|
- Language-blind design (phoneme focused)
|
||
|
- No external language processing
|
||
|
|
||
|
2. Key Architecture Changes
|
||
|
- Uses CustomAlbert instead of PLBert
|
||
|
- New ProsodyPredictor implementation
|
||
|
- Different phoneme handling approach
|
||
|
- Built-in vocab management
|
||
|
|
||
|
## Integration Points
|
||
|
|
||
|
1. Model Loading
|
||
|
```python
|
||
|
# v1.0 approach
|
||
|
model = KModel(config_path, model_path)
|
||
|
# vs our current
|
||
|
model = await build_model(path, device)
|
||
|
```
|
||
|
|
||
|
2. Forward Pass Differences
|
||
|
```python
|
||
|
# v1.0
|
||
|
audio = model(phonemes, ref_s, speed=1.0)
|
||
|
# vs our current
|
||
|
audio = model.decoder(asr, F0_pred, N_pred, ref_s)
|
||
|
```
|
||
|
|
||
|
3. Key Dependencies
|
||
|
- transformers (for AlbertConfig)
|
||
|
- torch
|
||
|
- No external phoneme processing
|
||
|
|
||
|
## Configuration Changes
|
||
|
|
||
|
1. v1.0 Config Structure
|
||
|
```json
|
||
|
{
|
||
|
"vocab": {...}, # Built-in phoneme mapping
|
||
|
"n_token": X,
|
||
|
"plbert": {...}, # Albert config
|
||
|
"hidden_dim": X,
|
||
|
"style_dim": X,
|
||
|
"istftnet": {...}
|
||
|
}
|
||
|
```
|
||
|
|
||
|
2. Voice Management
|
||
|
- No HF downloads
|
||
|
- Local voice file management
|
||
|
- Simpler voice structure
|
||
|
|
||
|
## Implementation Strategy
|
||
|
|
||
|
1. Core Changes
|
||
|
- Keep our streaming infrastructure
|
||
|
- Adapt to new model interface
|
||
|
- Maintain our voice management
|
||
|
|
||
|
2. Key Adaptations Needed
|
||
|
- Wrap KModel in our build system
|
||
|
- Handle phoneme mapping internally
|
||
|
- Adapt to new prosody prediction
|
||
|
|
||
|
3. Compatibility Layer
|
||
|
```python
|
||
|
class V1ModelWrapper:
|
||
|
def __init__(self, kmodel):
|
||
|
self.model = kmodel
|
||
|
|
||
|
async def forward(self, phonemes, ref_s):
|
||
|
# Adapt v1.0 interface to our system
|
||
|
return self.model(phonemes, ref_s)
|
||
|
```
|
||
|
|
||
|
## Technical Considerations
|
||
|
|
||
|
1. Memory Usage
|
||
|
- Models ~few hundred MB
|
||
|
- Voices ~few hundred KB
|
||
|
- No need for complex memory management
|
||
|
|
||
|
2. Performance
|
||
|
- Similar inference speed expected
|
||
|
- No major architectural bottlenecks
|
||
|
- Keep existing streaming optimizations
|
||
|
|
||
|
3. Integration Points
|
||
|
- Model loading/initialization
|
||
|
- Voice file management
|
||
|
- Inference pipeline
|
||
|
- Streaming output
|
||
|
|
||
|
## Migration Notes
|
||
|
|
||
|
1. Key Files to Port
|
||
|
- model.py -> v1_0/models.py
|
||
|
- istftnet.py -> v1_0/istftnet.py
|
||
|
- Add albert.py for CustomAlbert
|
||
|
|
||
|
2. Config Updates
|
||
|
- Add version selection
|
||
|
- Keep config structure similar
|
||
|
- Add v1.0 specific params
|
||
|
|
||
|
3. Testing Focus
|
||
|
- Basic inference
|
||
|
- Voice compatibility
|
||
|
- Streaming performance
|
||
|
- Version switching
|