mirror of https://github.com/remsky/Kokoro-FastAPI.git synced 2025-08-05 16:48:53 +00:00

remsky 9a588a3483 WIP: 1.0 integration

- Introduced v1.0 model build system integration.
- Updated imports to reflect new directory structure for versioned models.
- Modified environment variables
- Added version selection in the frontend for voice management.
- Enhanced Docker build scripts for multi-platform support.
- Updated configuration settings for default voice and model paths.

2025-01-31 05:55:57 -07:00

2.4 KiB

Raw Permalink Blame History

Kokoro v1.0 Technical Integration Notes

Core Components

KModel Class

Main model class with unified interface
Handles both weights and inference
Language-blind design (phoneme focused)
No external language processing

Key Architecture Changes

Uses CustomAlbert instead of PLBert
New ProsodyPredictor implementation
Different phoneme handling approach
Built-in vocab management

Integration Points

Model Loading

# v1.0 approach
model = KModel(config_path, model_path)
# vs our current
model = await build_model(path, device)

Forward Pass Differences

# v1.0
audio = model(phonemes, ref_s, speed=1.0)
# vs our current
audio = model.decoder(asr, F0_pred, N_pred, ref_s)

Key Dependencies

transformers (for AlbertConfig)
torch
No external phoneme processing

Configuration Changes

v1.0 Config Structure

{
  "vocab": {...},  # Built-in phoneme mapping
  "n_token": X,
  "plbert": {...},  # Albert config
  "hidden_dim": X,
  "style_dim": X,
  "istftnet": {...}
}

Voice Management

No HF downloads
Local voice file management
Simpler voice structure

Implementation Strategy

Core Changes

Keep our streaming infrastructure
Adapt to new model interface
Maintain our voice management

Key Adaptations Needed

Wrap KModel in our build system
Handle phoneme mapping internally
Adapt to new prosody prediction

Compatibility Layer

class V1ModelWrapper:
    def __init__(self, kmodel):
        self.model = kmodel
        
    async def forward(self, phonemes, ref_s):
        # Adapt v1.0 interface to our system
        return self.model(phonemes, ref_s)

Technical Considerations

Memory Usage

Models ~few hundred MB
Voices ~few hundred KB
No need for complex memory management

Performance

Similar inference speed expected
No major architectural bottlenecks
Keep existing streaming optimizations

Integration Points

Model loading/initialization
Voice file management
Inference pipeline
Streaming output

Migration Notes

Key Files to Port

model.py -> v1_0/models.py
istftnet.py -> v1_0/istftnet.py
Add albert.py for CustomAlbert

Config Updates

Add version selection
Keep config structure similar
Add v1.0 specific params

Testing Focus

Basic inference
Voice compatibility
Streaming performance
Version switching

2.4 KiB Raw Permalink Blame History

Kokoro v1.0 Technical Integration Notes

Core Components

Integration Points

Configuration Changes

Implementation Strategy

Technical Considerations

Migration Notes

2.4 KiB

Raw Permalink Blame History