
-note: all languages currently installed, selected by prefix of first chosen voice in call
3.7 KiB
Kokoro V1 Integration Architecture
Overview
This document outlines the architectural approach for integrating the new Kokoro V1 library into our existing inference system. The goal is to bypass most of the legacy model machinery while maintaining compatibility with our existing interfaces, particularly the OpenAI-compatible streaming endpoint.
Current System
The current system uses a ModelBackend
interface with multiple implementations (ONNX CPU/GPU, PyTorch CPU/GPU). This interface requires:
- Async model loading
- Audio generation from tokens and voice tensors
- Resource cleanup
- Device management
Integration Approach
1. KokoroV1 Backend Implementation
We'll create a KokoroV1
class implementing the ModelBackend
interface that wraps the new Kokoro library:
class KokoroV1(BaseModelBackend):
def __init__(self):
super().__init__()
self._model = None
self._pipeline = None
self._device = "cuda" if settings.use_gpu and torch.cuda.is_available() else "cpu"
2. Model Loading
The load_model method will initialize both KModel and KPipeline:
async def load_model(self, path: str) -> None:
model_path = await paths.get_model_path(path)
self._model = KModel(model_path).to(self._device).eval()
self._pipeline = KPipeline(model=self._model, device=self._device)
3. Audio Generation
The generate method will adapt our token/voice tensor format to work with KPipeline:
def generate(self, tokens: list[int], voice: torch.Tensor, speed: float = 1.0) -> np.ndarray:
# Convert tokens to text using pipeline's tokenizer
# Use voice tensor as voice embedding
# Return generated audio
4. Streaming Support
The Kokoro V1 backend must maintain compatibility with our OpenAI-compatible streaming endpoint. Key requirements:
-
Chunked Generation: The pipeline's output should be compatible with our streaming infrastructure:
async def generate_stream(self, text: str, voice_path: str) -> AsyncGenerator[bytes, None]: results = self._pipeline(text, voice=voice_path) for result in results: yield result.audio.numpy()
-
Format Conversion: Support for various output formats:
- MP3
- Opus
- AAC
- FLAC
- WAV
- PCM
-
Voice Management:
- Support for voice combination (mean of multiple voice embeddings)
- Dynamic voice loading and caching
- Voice listing and validation
-
Error Handling:
- Proper error propagation for client disconnects
- Format conversion errors
- Resource cleanup on failures
5. Configuration Integration
We'll use the existing configuration system:
config = model_config.pytorch_kokoro_v1_file # Model file path
Benefits
- Simplified Pipeline: Direct use of Kokoro library's built-in pipeline
- Better Language Support: Access to Kokoro's wider language capabilities
- Automatic Chunking: Built-in text chunking and processing
- Phoneme Generation: Access to phoneme output for better analysis
- Streaming Compatibility: Maintains existing streaming functionality
Migration Strategy
- Implement KokoroV1 backend with streaming support
- Add to model manager's available backends
- Make it the default for new requests
- Keep legacy backends available for backward compatibility
- Update voice management to handle both legacy and new voice formats
Next Steps
- Switch to Code mode to implement the KokoroV1 backend
- Ensure streaming compatibility with OpenAI endpoint
- Add tests to verify both streaming and non-streaming functionality
- Update documentation for new capabilities
- Add monitoring for streaming performance