Chatterbox

Premium

Zero-shot voice cloning with expressive speech in 23 languages

Fast Speed
Very Good Quality
Yes Cloning
23 Languages

About Chatterbox

Chatterbox is a powerful voice cloning TTS model from Resemble AI. It performs zero-shot voice cloning from just a few seconds of reference audio, supporting 23 languages with natural expression. Chatterbox includes paralinguistic tags for adding natural sounds like laughter and coughs to generated speech.

Key Features

Zero-Shot Voice Cloning

Clone any voice from a few seconds of audio - no training required.

23 Languages

From Arabic to Chinese, covering most major world languages.

Expressive Tags

Add [laugh], [cough], [chuckle] for natural paralinguistic sounds.

Fast Inference

Sub-200ms latency with the Turbo variant for real-time applications.

Use Cases

Voice cloning for content creation Multilingual voice applications Character voice design for games Personalized voice assistants

How to Use Chatterbox

  1. 1

    Sign up or open the demo

    Create a free TextToSpeechAI account to claim 200 starter credits, or use the on-page demo to try Chatterbox without signing in.

  2. 2

    Select Chatterbox and add a reference clip

    Choose the Chatterbox engine, then upload a short (a few seconds) audio clip of the voice you want to clone. Chatterbox zero-shot clones it instantly - no training required.

  3. 3

    Enter your text with optional tags

    Type or paste the text to speak in any of the 23 supported languages, and drop in [laugh], [cough], or [chuckle] tags wherever you want natural paralinguistic sounds.

  4. 4

    Generate the speech

    Click generate and TextToSpeechAI renders your text in the cloned Chatterbox voice on hosted GPU infrastructure, spending 25 credits per 1,000 characters.

  5. 5

    Download or use the API

    Download the finished audio file, or automate generation through the TextToSpeechAI REST API at api.texttospeechai.com using your account token.

Chatterbox API

Generate speech programmatically using the TextToSpeechAI REST API.

curl -X POST "https://api.texttospeechai.com/v1/generate/" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Chatterbox can clone your voice from just a few seconds of audio and speak in 23 languages.",
    "voice": "en_US-lessac-medium"
  }'

Frequently Asked Questions

Chatterbox is a zero-shot voice cloning text-to-speech model from Resemble AI. It can replicate any voice from just a few seconds of reference audio and generate natural, expressive speech in 23 languages, all without any per-voice training.

Yes, Chatterbox is fully MIT licensed - both the code and the model weights - so you can use it freely in commercial products. Generated audio includes an optional neural watermark that can be disabled, and there are no usage royalties.

You provide a short reference clip of any voice (a few seconds is enough) and Chatterbox extracts that voice's timbre and style into a speaker embedding. It then generates brand-new speech in that voice with no fine-tuning or training step, which is what "zero-shot" means.

Chatterbox reads special inline tags in your text to add natural non-verbal sounds: [laugh] inserts laughter, [cough] inserts a cough, and [chuckle] inserts a soft chuckle. Just place a tag where you want the sound, for example "That is hilarious [laugh] but seriously...".

Type the tag directly inside your input text at the spot where the sound should occur, surrounded by the rest of your sentence. Chatterbox renders the paralinguistic sound in the cloned voice, blending it into the surrounding speech so it sounds spontaneous rather than spliced in.

Chatterbox supports 23 languages, including Arabic, Danish, German, Greek, English, Spanish, Finnish, French, Hebrew, Hindi, Italian, Japanese, Korean, Malay, Dutch, Norwegian, Polish, Portuguese, Russian, Swedish, Swahili, Turkish, and Chinese. A single cloned voice can speak across these languages.

Chatterbox generates speech quickly on a GPU, and the Turbo variant reaches sub-200ms latency for real-time conversational use. Quality is very good, with natural prosody and faithful voice reproduction from even short reference clips.

Chatterbox needs roughly 4-8GB of VRAM depending on the variant, with the Turbo model running comfortably in about 4GB. On TextToSpeechAI you do not need any local GPU - generation runs on our hosted infrastructure.

Chatterbox is a premium-tier engine that costs 25 credits per 1,000 characters. New accounts get 200 free credits to try voice cloning, and you only spend credits on the text you actually generate.

Both support zero-shot voice cloning, but Chatterbox covers far more languages (23 vs 2) and adds expressive paralinguistic tags. F5-TTS can edge out slightly more natural English prosody, so pick Chatterbox for multilingual cloning and expressive sounds, and F5-TTS for English-only fidelity.

Both offer high-quality voice cloning. Chatterbox supports 23 languages and inline expressive tags, while OpenVoice adds tone-style controls (friendly, sad, angry, and more) that Chatterbox lacks. Choose Chatterbox for broad language coverage and OpenVoice when you need explicit emotional tone styling.

Yes. Sign up for a free TextToSpeechAI account to receive 200 starter credits, or use the on-page demo to hear Chatterbox without signing in. Upload a short reference clip, type your text, and generate a cloned voice in seconds.

Technical Specs

  • Generation Speed Fast
  • Output Quality Very Good
  • Voice Cloning Supported
  • Languages 23
  • GPU VRAM 4-8GB
  • Credits/1000 chars 25

Try Chatterbox Now

Generate your first audio free. No credit card required.

Start Free