Kokoro-FastAPI/docs/architecture/kokoro_v1_technical_notes.md

116 lines
2.4 KiB
Markdown
Raw Permalink Normal View History

# Kokoro v1.0 Technical Integration Notes
## Core Components
1. KModel Class
- Main model class with unified interface
- Handles both weights and inference
- Language-blind design (phoneme focused)
- No external language processing
2. Key Architecture Changes
- Uses CustomAlbert instead of PLBert
- New ProsodyPredictor implementation
- Different phoneme handling approach
- Built-in vocab management
## Integration Points
1. Model Loading
```python
# v1.0 approach
model = KModel(config_path, model_path)
# vs our current
model = await build_model(path, device)
```
2. Forward Pass Differences
```python
# v1.0
audio = model(phonemes, ref_s, speed=1.0)
# vs our current
audio = model.decoder(asr, F0_pred, N_pred, ref_s)
```
3. Key Dependencies
- transformers (for AlbertConfig)
- torch
- No external phoneme processing
## Configuration Changes
1. v1.0 Config Structure
```json
{
"vocab": {...}, # Built-in phoneme mapping
"n_token": X,
"plbert": {...}, # Albert config
"hidden_dim": X,
"style_dim": X,
"istftnet": {...}
}
```
2. Voice Management
- No HF downloads
- Local voice file management
- Simpler voice structure
## Implementation Strategy
1. Core Changes
- Keep our streaming infrastructure
- Adapt to new model interface
- Maintain our voice management
2. Key Adaptations Needed
- Wrap KModel in our build system
- Handle phoneme mapping internally
- Adapt to new prosody prediction
3. Compatibility Layer
```python
class V1ModelWrapper:
def __init__(self, kmodel):
self.model = kmodel
async def forward(self, phonemes, ref_s):
# Adapt v1.0 interface to our system
return self.model(phonemes, ref_s)
```
## Technical Considerations
1. Memory Usage
- Models ~few hundred MB
- Voices ~few hundred KB
- No need for complex memory management
2. Performance
- Similar inference speed expected
- No major architectural bottlenecks
- Keep existing streaming optimizations
3. Integration Points
- Model loading/initialization
- Voice file management
- Inference pipeline
- Streaming output
## Migration Notes
1. Key Files to Port
- model.py -> v1_0/models.py
- istftnet.py -> v1_0/istftnet.py
- Add albert.py for CustomAlbert
2. Config Updates
- Add version selection
- Keep config structure similar
- Add v1.0 specific params
3. Testing Focus
- Basic inference
- Voice compatibility
- Streaming performance
- Version switching