Kokoro-FastAPI/docs/architecture/kokoro_v1_integration.md

113 lines
3.7 KiB
Markdown
Raw Normal View History

# Kokoro V1 Integration Architecture
## Overview
This document outlines the architectural approach for integrating the new Kokoro V1 library into our existing inference system. The goal is to bypass most of the legacy model machinery while maintaining compatibility with our existing interfaces, particularly the OpenAI-compatible streaming endpoint.
## Current System
The current system uses a `ModelBackend` interface with multiple implementations (ONNX CPU/GPU, PyTorch CPU/GPU). This interface requires:
- Async model loading
- Audio generation from tokens and voice tensors
- Resource cleanup
- Device management
## Integration Approach
### 1. KokoroV1 Backend Implementation
We'll create a `KokoroV1` class implementing the `ModelBackend` interface that wraps the new Kokoro library:
```python
class KokoroV1(BaseModelBackend):
def __init__(self):
super().__init__()
self._model = None
self._pipeline = None
self._device = "cuda" if settings.use_gpu and torch.cuda.is_available() else "cpu"
```
### 2. Model Loading
The load_model method will initialize both KModel and KPipeline:
```python
async def load_model(self, path: str) -> None:
model_path = await paths.get_model_path(path)
self._model = KModel(model_path).to(self._device).eval()
self._pipeline = KPipeline(model=self._model, device=self._device)
```
### 3. Audio Generation
The generate method will adapt our token/voice tensor format to work with KPipeline:
```python
def generate(self, tokens: list[int], voice: torch.Tensor, speed: float = 1.0) -> np.ndarray:
# Convert tokens to text using pipeline's tokenizer
# Use voice tensor as voice embedding
# Return generated audio
```
### 4. Streaming Support
The Kokoro V1 backend must maintain compatibility with our OpenAI-compatible streaming endpoint. Key requirements:
1. **Chunked Generation**: The pipeline's output should be compatible with our streaming infrastructure:
```python
async def generate_stream(self, text: str, voice_path: str) -> AsyncGenerator[bytes, None]:
results = self._pipeline(text, voice=voice_path)
for result in results:
yield result.audio.numpy()
```
2. **Format Conversion**: Support for various output formats:
- MP3
- Opus
- AAC
- FLAC
- WAV
- PCM
3. **Voice Management**:
- Support for voice combination (mean of multiple voice embeddings)
- Dynamic voice loading and caching
- Voice listing and validation
4. **Error Handling**:
- Proper error propagation for client disconnects
- Format conversion errors
- Resource cleanup on failures
### 5. Configuration Integration
We'll use the existing configuration system:
```python
config = model_config.pytorch_kokoro_v1_file # Model file path
```
## Benefits
1. **Simplified Pipeline**: Direct use of Kokoro library's built-in pipeline
2. **Better Language Support**: Access to Kokoro's wider language capabilities
3. **Automatic Chunking**: Built-in text chunking and processing
4. **Phoneme Generation**: Access to phoneme output for better analysis
5. **Streaming Compatibility**: Maintains existing streaming functionality
## Migration Strategy
1. Implement KokoroV1 backend with streaming support
2. Add to model manager's available backends
3. Make it the default for new requests
4. Keep legacy backends available for backward compatibility
5. Update voice management to handle both legacy and new voice formats
## Next Steps
1. Switch to Code mode to implement the KokoroV1 backend
2. Ensure streaming compatibility with OpenAI endpoint
3. Add tests to verify both streaming and non-streaming functionality
4. Update documentation for new capabilities
5. Add monitoring for streaming performance