Kyutai's Pocket TTS offers a lightweight, efficient solution for text-to-speech, eliminating the need for GPUs and web APIs. It's designed for ease of use and speed on CPUs.
Kyutai Labs introduces Pocket TTS, a groundbreaking text-to-speech (TTS) application engineered for efficient CPU utilization, making it accessible without requiring GPUs or cloud-based APIs. This project simplifies TTS implementation, offering a Python-based solution that can be installed with pip and used with a function call. Pocket TTS prioritizes low latency audio streaming (~200ms for the first chunk) and faster-than-real-time performance (~6x real-time on a MacBook Air M4 CPU) while using only 2 CPU cores. Key features include a small model size (100M parameters), a Python API and CLI, voice cloning capabilities (English only), and the ability to handle infinitely long text inputs.
The core functionality of Pocket TTS can be accessed via the command line interface (CLI) or a local web server. The generate command facilitates quick audio generation from text, supporting voice modification and voice cloning using custom WAV files. Alternatively, the serve command launches a local web server, offering a faster, interactive experience where the model remains in memory. The Python API allows developers to integrate Pocket TTS into their applications directly. Example code is provided in the content.
Key Takeaways:
While Pocket TTS presents a compelling alternative to GPU-dependent TTS solutions, it's important to note its limitations. Currently, it only supports English, lacks browser integration (WebAssembly), and does not have native support for controlling pauses or silences within the generated speech. The developers also note that they tried running the TTS model on the GPU, but they did not observe a speedup compared to CPU execution. However, the developers welcome contributions to address these limitations. The project also includes a section on prohibited use, emphasizing compliance with laws and regulations. Voice impersonation or cloning without explicit consent and the generation of misinformation or harmful content are strictly prohibited. This commitment to ethical use is a vital consideration for prospective users.