Kokoro-FastAPI/docs/architecture/v1_wrapper_spec.md

# Kokoro v1.0 Wrapper Technical Specification

## Overview

This document details the technical implementation of the KokoroV1Wrapper class that integrates the Kokoro v1.0 KModel/KPipeline architecture with our existing system.

## Class Implementation

```python
from pathlib import Path
from kokoro import KModel, KPipeline

class KokoroV1Wrapper:
    """Wrapper for Kokoro v1.0 KModel/KPipeline integration.
    
    This wrapper manages:
    1. Model initialization and weight loading
    2. Pipeline creation and caching per language
    3. Streaming audio generation
    """
    
    def __init__(self, config_path: str, model_path: str):
        """Initialize KModel with config and weights.
        
        Args:
            config_path: Path to config.json in builds/v1_0/
            model_path: Path to model weights in models/v1_0/
        """
        self.model = KModel()  # Will load config and weights
        self.pipelines = {}    # lang_code -> KPipeline cache
        
    def get_pipeline(self, lang_code: str) -> KPipeline:
        """Get or create a KPipeline for the given language code.
        
        Args:
            lang_code: Language code for phoneme processing
            
        Returns:
            KPipeline instance for the language
        """
        if lang_code not in self.pipelines:
            self.pipelines[lang_code] = KPipeline(
                lang_code=lang_code,
                model=self.model
            )
        return self.pipelines[lang_code]
        
    async def forward(self, text: str, voice: str, lang_code: str):
        """Generate audio using the appropriate pipeline.
        
        Args:
            text: Input text to synthesize
            voice: Voice ID to use
            lang_code: Language code for phoneme processing
            
        Yields:
            Audio chunks as torch.FloatTensor
        """
        pipeline = self.get_pipeline(lang_code)
        generator = pipeline(text, voice=voice)
        for gs, ps, audio in generator:
            yield audio

class ModelManager:
    """Manages multiple model versions and their initialization."""
    
    def __init__(self):
        self.models = {}
        
    async def get_model(self, version: str):
        """Get or initialize a model for the specified version.
        
        Args:
            version: Model version ("v0.19" or "v1.0")
            
        Returns:
            Initialized model instance
        """
        if version not in self.models:
            if version == "v0.19":
                from ..builds.v0_19.models import build_model
                self.models[version] = await build_model()
            elif version == "v1.0":
                from ..builds.v1_0.wrapper import KokoroV1Wrapper
                
                # Config in builds directory
                config_path = Path(__file__).parent / "builds/v1_0/config.json"
                
                # Model weights in models directory
                model_path = Path(__file__).parent / "models/v1_0/kokoro-v1_0.pth"
                
                self.models[version] = KokoroV1Wrapper(
                    config_path=str(config_path),
                    model_path=str(model_path)
                )
        return self.models[version]
```

## Key Design Points

1. Model Management
   - KModel handles weights and inference
   - Config and weights loaded from separate directories
   - Language-blind design (phoneme focused)

2. Pipeline Caching
   - One KPipeline per language code
   - Pipelines created on demand and cached
   - Reuses single KModel instance

3. Streaming Integration
   - Maintains compatibility with existing streaming system
   - Yields audio chunks progressively
   - Handles both quiet and loud pipeline modes

4. Version Control
   - Clear separation between v0.19 and v1.0
   - Version-specific model initialization
   - Shared model manager interface

## Usage Example

```python
# Initialize model manager
manager = ModelManager()

# Get v1.0 model
model = await manager.get_model("v1.0")

# Generate audio
async for audio in model.forward(
    text="Hello world",
    voice="af_bella",
    lang_code="en"
):
    # Process audio chunk
    process_audio(audio)
```

## Error Handling

1. File Access
   - Verify config.json exists in builds/v1_0/
   - Verify model weights exist in models/v1_0/
   - Handle missing or corrupt files

2. Pipeline Creation
   - Validate language codes
   - Handle initialization failures
   - Clean up failed pipeline instances

3. Voice Loading
   - Verify voice file existence
   - Handle voice format compatibility
   - Manage voice loading failures

## Testing Strategy

1. Unit Tests
   - Model initialization
   - Pipeline creation and caching
   - Audio generation
   - Error handling

2. Integration Tests
   - End-to-end audio generation
   - Streaming performance
   - Memory usage
   - Multi-language support

3. Performance Tests
   - Pipeline creation overhead
   - Memory usage patterns
   - Streaming latency
   - Voice loading speed
WIP: 1.0 integration - Introduced v1.0 model build system integration. - Updated imports to reflect new directory structure for versioned models. - Modified environment variables - Added version selection in the frontend for voice management. - Enhanced Docker build scripts for multi-platform support. - Updated configuration settings for default voice and model paths. 2025-01-31 05:55:57 -07:00			`# Kokoro v1.0 Wrapper Technical Specification`

			`## Overview`

			`This document details the technical implementation of the KokoroV1Wrapper class that integrates the Kokoro v1.0 KModel/KPipeline architecture with our existing system.`

			`## Class Implementation`

			```python
			`from pathlib import Path`
			`from kokoro import KModel, KPipeline`

			`class KokoroV1Wrapper:`
			`"""Wrapper for Kokoro v1.0 KModel/KPipeline integration.`

			`This wrapper manages:`
			`1. Model initialization and weight loading`
			`2. Pipeline creation and caching per language`
			`3. Streaming audio generation`
			`"""`

			`def __init__(self, config_path: str, model_path: str):`
			`"""Initialize KModel with config and weights.`

			`Args:`
			`config_path: Path to config.json in builds/v1_0/`
			`model_path: Path to model weights in models/v1_0/`
			`"""`
			`self.model = KModel() # Will load config and weights`
			`self.pipelines = {} # lang_code -> KPipeline cache`

			`def get_pipeline(self, lang_code: str) -> KPipeline:`
			`"""Get or create a KPipeline for the given language code.`

			`Args:`
			`lang_code: Language code for phoneme processing`

			`Returns:`
			`KPipeline instance for the language`
			`"""`
			`if lang_code not in self.pipelines:`
			`self.pipelines[lang_code] = KPipeline(`
			`lang_code=lang_code,`
			`model=self.model`
			`)`
			`return self.pipelines[lang_code]`

			`async def forward(self, text: str, voice: str, lang_code: str):`
			`"""Generate audio using the appropriate pipeline.`

			`Args:`
			`text: Input text to synthesize`
			`voice: Voice ID to use`
			`lang_code: Language code for phoneme processing`

			`Yields:`
			`Audio chunks as torch.FloatTensor`
			`"""`
			`pipeline = self.get_pipeline(lang_code)`
			`generator = pipeline(text, voice=voice)`
			`for gs, ps, audio in generator:`
			`yield audio`

			`class ModelManager:`
			`"""Manages multiple model versions and their initialization."""`

			`def __init__(self):`
			`self.models = {}`

			`async def get_model(self, version: str):`
			`"""Get or initialize a model for the specified version.`

			`Args:`
			`version: Model version ("v0.19" or "v1.0")`

			`Returns:`
			`Initialized model instance`
			`"""`
			`if version not in self.models:`
			`if version == "v0.19":`
			`from ..builds.v0_19.models import build_model`
			`self.models[version] = await build_model()`
			`elif version == "v1.0":`
			`from ..builds.v1_0.wrapper import KokoroV1Wrapper`

			`# Config in builds directory`
			`config_path = Path(__file__).parent / "builds/v1_0/config.json"`

			`# Model weights in models directory`
			`model_path = Path(__file__).parent / "models/v1_0/kokoro-v1_0.pth"`

			`self.models[version] = KokoroV1Wrapper(`
			`config_path=str(config_path),`
			`model_path=str(model_path)`
			`)`
			`return self.models[version]`
			```

			`## Key Design Points`

			`1. Model Management`
			`- KModel handles weights and inference`
			`- Config and weights loaded from separate directories`
			`- Language-blind design (phoneme focused)`

			`2. Pipeline Caching`
			`- One KPipeline per language code`
			`- Pipelines created on demand and cached`
			`- Reuses single KModel instance`

			`3. Streaming Integration`
			`- Maintains compatibility with existing streaming system`
			`- Yields audio chunks progressively`
			`- Handles both quiet and loud pipeline modes`

			`4. Version Control`
			`- Clear separation between v0.19 and v1.0`
			`- Version-specific model initialization`
			`- Shared model manager interface`

			`## Usage Example`

			```python
			`# Initialize model manager`
			`manager = ModelManager()`

			`# Get v1.0 model`
			`model = await manager.get_model("v1.0")`

			`# Generate audio`
			`async for audio in model.forward(`
			`text="Hello world",`
			`voice="af_bella",`
			`lang_code="en"`
			`):`
			`# Process audio chunk`
			`process_audio(audio)`
			```

			`## Error Handling`

			`1. File Access`
			`- Verify config.json exists in builds/v1_0/`
			`- Verify model weights exist in models/v1_0/`
			`- Handle missing or corrupt files`

			`2. Pipeline Creation`
			`- Validate language codes`
			`- Handle initialization failures`
			`- Clean up failed pipeline instances`

			`3. Voice Loading`
			`- Verify voice file existence`
			`- Handle voice format compatibility`
			`- Manage voice loading failures`

			`## Testing Strategy`

			`1. Unit Tests`
			`- Model initialization`
			`- Pipeline creation and caching`
			`- Audio generation`
			`- Error handling`

			`2. Integration Tests`
			`- End-to-end audio generation`
			`- Streaming performance`
			`- Memory usage`
			`- Multi-language support`

			`3. Performance Tests`
			`- Pipeline creation overhead`
			`- Memory usage patterns`
			`- Streaming latency`
			`- Voice loading speed`