Kokoro-FastAPI/docs/architecture/kokoro_v1_technical_notes.md
remsky 9a588a3483 WIP: 1.0 integration
- Introduced v1.0 model build system integration.
- Updated imports to reflect new directory structure for versioned models.
- Modified environment variables
- Added version selection in the frontend for voice management.
- Enhanced Docker build scripts for multi-platform support.
- Updated configuration settings for default voice and model paths.
2025-01-31 05:55:57 -07:00

2.4 KiB

Kokoro v1.0 Technical Integration Notes

Core Components

  1. KModel Class
  • Main model class with unified interface
  • Handles both weights and inference
  • Language-blind design (phoneme focused)
  • No external language processing
  1. Key Architecture Changes
  • Uses CustomAlbert instead of PLBert
  • New ProsodyPredictor implementation
  • Different phoneme handling approach
  • Built-in vocab management

Integration Points

  1. Model Loading
# v1.0 approach
model = KModel(config_path, model_path)
# vs our current
model = await build_model(path, device)
  1. Forward Pass Differences
# v1.0
audio = model(phonemes, ref_s, speed=1.0)
# vs our current
audio = model.decoder(asr, F0_pred, N_pred, ref_s)
  1. Key Dependencies
  • transformers (for AlbertConfig)
  • torch
  • No external phoneme processing

Configuration Changes

  1. v1.0 Config Structure
{
  "vocab": {...},  # Built-in phoneme mapping
  "n_token": X,
  "plbert": {...},  # Albert config
  "hidden_dim": X,
  "style_dim": X,
  "istftnet": {...}
}
  1. Voice Management
  • No HF downloads
  • Local voice file management
  • Simpler voice structure

Implementation Strategy

  1. Core Changes
  • Keep our streaming infrastructure
  • Adapt to new model interface
  • Maintain our voice management
  1. Key Adaptations Needed
  • Wrap KModel in our build system
  • Handle phoneme mapping internally
  • Adapt to new prosody prediction
  1. Compatibility Layer
class V1ModelWrapper:
    def __init__(self, kmodel):
        self.model = kmodel
        
    async def forward(self, phonemes, ref_s):
        # Adapt v1.0 interface to our system
        return self.model(phonemes, ref_s)

Technical Considerations

  1. Memory Usage
  • Models ~few hundred MB
  • Voices ~few hundred KB
  • No need for complex memory management
  1. Performance
  • Similar inference speed expected
  • No major architectural bottlenecks
  • Keep existing streaming optimizations
  1. Integration Points
  • Model loading/initialization
  • Voice file management
  • Inference pipeline
  • Streaming output

Migration Notes

  1. Key Files to Port
  • model.py -> v1_0/models.py
  • istftnet.py -> v1_0/istftnet.py
  • Add albert.py for CustomAlbert
  1. Config Updates
  • Add version selection
  • Keep config structure similar
  • Add v1.0 specific params
  1. Testing Focus
  • Basic inference
  • Voice compatibility
  • Streaming performance
  • Version switching