mirror of https://github.com/remsky/Kokoro-FastAPI.git synced 2025-04-13 09:39:17 +00:00

remsky 903bf91c81 v1_0 multilanguage initial support

-note: all languages currently installed, selected by prefix of first chosen voice in call

2025-02-03 03:33:12 -07:00

3.7 KiB

Raw Blame History

Kokoro V1 Integration Architecture

Overview

This document outlines the architectural approach for integrating the new Kokoro V1 library into our existing inference system. The goal is to bypass most of the legacy model machinery while maintaining compatibility with our existing interfaces, particularly the OpenAI-compatible streaming endpoint.

Current System

The current system uses a ModelBackend interface with multiple implementations (ONNX CPU/GPU, PyTorch CPU/GPU). This interface requires:

Async model loading
Audio generation from tokens and voice tensors
Resource cleanup
Device management

Integration Approach

1. KokoroV1 Backend Implementation

We'll create a KokoroV1 class implementing the ModelBackend interface that wraps the new Kokoro library:

class KokoroV1(BaseModelBackend):
    def __init__(self):
        super().__init__()
        self._model = None
        self._pipeline = None
        self._device = "cuda" if settings.use_gpu and torch.cuda.is_available() else "cpu"

2. Model Loading

The load_model method will initialize both KModel and KPipeline:

async def load_model(self, path: str) -> None:
    model_path = await paths.get_model_path(path)
    self._model = KModel(model_path).to(self._device).eval()
    self._pipeline = KPipeline(model=self._model, device=self._device)

3. Audio Generation

The generate method will adapt our token/voice tensor format to work with KPipeline:

def generate(self, tokens: list[int], voice: torch.Tensor, speed: float = 1.0) -> np.ndarray:
    # Convert tokens to text using pipeline's tokenizer
    # Use voice tensor as voice embedding
    # Return generated audio

4. Streaming Support

The Kokoro V1 backend must maintain compatibility with our OpenAI-compatible streaming endpoint. Key requirements:

Chunked Generation: The pipeline's output should be compatible with our streaming infrastructure:

async def generate_stream(self, text: str, voice_path: str) -> AsyncGenerator[bytes, None]:
    results = self._pipeline(text, voice=voice_path)
    for result in results:
        yield result.audio.numpy()

Format Conversion: Support for various output formats:
- MP3
- Opus
- AAC
- FLAC
- WAV
- PCM
Voice Management:
- Support for voice combination (mean of multiple voice embeddings)
- Dynamic voice loading and caching
- Voice listing and validation
Error Handling:
- Proper error propagation for client disconnects
- Format conversion errors
- Resource cleanup on failures

5. Configuration Integration

We'll use the existing configuration system:

config = model_config.pytorch_kokoro_v1_file  # Model file path

Benefits

Simplified Pipeline: Direct use of Kokoro library's built-in pipeline
Better Language Support: Access to Kokoro's wider language capabilities
Automatic Chunking: Built-in text chunking and processing
Phoneme Generation: Access to phoneme output for better analysis
Streaming Compatibility: Maintains existing streaming functionality

Migration Strategy

Implement KokoroV1 backend with streaming support
Add to model manager's available backends
Make it the default for new requests
Keep legacy backends available for backward compatibility
Update voice management to handle both legacy and new voice formats

Next Steps

Switch to Code mode to implement the KokoroV1 backend
Ensure streaming compatibility with OpenAI endpoint
Add tests to verify both streaming and non-streaming functionality
Update documentation for new capabilities
Add monitoring for streaming performance

3.7 KiB Raw Blame History