Kokoro-FastAPI/docs/architecture/kokoro_v1_technical_notes.md

# Kokoro v1.0 Technical Integration Notes

## Core Components

1. KModel Class
- Main model class with unified interface
- Handles both weights and inference
- Language-blind design (phoneme focused)
- No external language processing

2. Key Architecture Changes
- Uses CustomAlbert instead of PLBert
- New ProsodyPredictor implementation
- Different phoneme handling approach
- Built-in vocab management

## Integration Points

1. Model Loading
```python
# v1.0 approach
model = KModel(config_path, model_path)
# vs our current
model = await build_model(path, device)
```

2. Forward Pass Differences
```python
# v1.0
audio = model(phonemes, ref_s, speed=1.0)
# vs our current
audio = model.decoder(asr, F0_pred, N_pred, ref_s)
```

3. Key Dependencies
- transformers (for AlbertConfig)
- torch
- No external phoneme processing

## Configuration Changes

1. v1.0 Config Structure
```json
{
  "vocab": {...},  # Built-in phoneme mapping
  "n_token": X,
  "plbert": {...},  # Albert config
  "hidden_dim": X,
  "style_dim": X,
  "istftnet": {...}
}
```

2. Voice Management
- No HF downloads
- Local voice file management
- Simpler voice structure

## Implementation Strategy

1. Core Changes
- Keep our streaming infrastructure
- Adapt to new model interface
- Maintain our voice management

2. Key Adaptations Needed
- Wrap KModel in our build system
- Handle phoneme mapping internally
- Adapt to new prosody prediction

3. Compatibility Layer
```python
class V1ModelWrapper:
    def __init__(self, kmodel):
        self.model = kmodel
        
    async def forward(self, phonemes, ref_s):
        # Adapt v1.0 interface to our system
        return self.model(phonemes, ref_s)
```

## Technical Considerations

1. Memory Usage
- Models ~few hundred MB
- Voices ~few hundred KB
- No need for complex memory management

2. Performance
- Similar inference speed expected
- No major architectural bottlenecks
- Keep existing streaming optimizations

3. Integration Points
- Model loading/initialization
- Voice file management
- Inference pipeline
- Streaming output

## Migration Notes

1. Key Files to Port
- model.py -> v1_0/models.py
- istftnet.py -> v1_0/istftnet.py
- Add albert.py for CustomAlbert

2. Config Updates
- Add version selection
- Keep config structure similar
- Add v1.0 specific params

3. Testing Focus
- Basic inference
- Voice compatibility
- Streaming performance
- Version switching
WIP: 1.0 integration - Introduced v1.0 model build system integration. - Updated imports to reflect new directory structure for versioned models. - Modified environment variables - Added version selection in the frontend for voice management. - Enhanced Docker build scripts for multi-platform support. - Updated configuration settings for default voice and model paths. 2025-01-31 05:55:57 -07:00			`# Kokoro v1.0 Technical Integration Notes`

			`## Core Components`

			`1. KModel Class`
			`- Main model class with unified interface`
			`- Handles both weights and inference`
			`- Language-blind design (phoneme focused)`
			`- No external language processing`

			`2. Key Architecture Changes`
			`- Uses CustomAlbert instead of PLBert`
			`- New ProsodyPredictor implementation`
			`- Different phoneme handling approach`
			`- Built-in vocab management`

			`## Integration Points`

			`1. Model Loading`
			```python
			`# v1.0 approach`
			`model = KModel(config_path, model_path)`
			`# vs our current`
			`model = await build_model(path, device)`
			```

			`2. Forward Pass Differences`
			```python
			`# v1.0`
			`audio = model(phonemes, ref_s, speed=1.0)`
			`# vs our current`
			`audio = model.decoder(asr, F0_pred, N_pred, ref_s)`
			```

			`3. Key Dependencies`
			`- transformers (for AlbertConfig)`
			`- torch`
			`- No external phoneme processing`

			`## Configuration Changes`

			`1. v1.0 Config Structure`
			```json
			`{`
			`"vocab": {...}, # Built-in phoneme mapping`
			`"n_token": X,`
			`"plbert": {...}, # Albert config`
			`"hidden_dim": X,`
			`"style_dim": X,`
			`"istftnet": {...}`
			`}`
			```

			`2. Voice Management`
			`- No HF downloads`
			`- Local voice file management`
			`- Simpler voice structure`

			`## Implementation Strategy`

			`1. Core Changes`
			`- Keep our streaming infrastructure`
			`- Adapt to new model interface`
			`- Maintain our voice management`

			`2. Key Adaptations Needed`
			`- Wrap KModel in our build system`
			`- Handle phoneme mapping internally`
			`- Adapt to new prosody prediction`

			`3. Compatibility Layer`
			```python
			`class V1ModelWrapper:`
			`def __init__(self, kmodel):`
			`self.model = kmodel`

			`async def forward(self, phonemes, ref_s):`
			`# Adapt v1.0 interface to our system`
			`return self.model(phonemes, ref_s)`
			```

			`## Technical Considerations`

			`1. Memory Usage`
			`- Models ~few hundred MB`
			`- Voices ~few hundred KB`
			`- No need for complex memory management`

			`2. Performance`
			`- Similar inference speed expected`
			`- No major architectural bottlenecks`
			`- Keep existing streaming optimizations`

			`3. Integration Points`
			`- Model loading/initialization`
			`- Voice file management`
			`- Inference pipeline`
			`- Streaming output`

			`## Migration Notes`

			`1. Key Files to Port`
			`- model.py -> v1_0/models.py`
			`- istftnet.py -> v1_0/istftnet.py`
			`- Add albert.py for CustomAlbert`

			`2. Config Updates`
			`- Add version selection`
			`- Keep config structure similar`
			`- Add v1.0 specific params`

			`3. Testing Focus`
			`- Basic inference`
			`- Voice compatibility`
			`- Streaming performance`
			`- Version switching`