Chatterbox

Premium

Zero-shot voice cloning with expressive speech in 23 languages

Fast Speed

Very Good Quality

Yes Cloning

23 Languages

About Chatterbox

Chatterbox is a powerful voice cloning TTS model from Resemble AI. It performs zero-shot voice cloning from just a few seconds of reference audio, supporting 23 languages with natural expression. Chatterbox includes paralinguistic tags for adding natural sounds like laughter and coughs to generated speech.

Key Features

Zero-Shot Voice Cloning

Clone any voice from a few seconds of audio - no training required.

23 Languages

From Arabic to Chinese, covering most major world languages.

Expressive Tags

Add [laugh], [cough], [chuckle] for natural paralinguistic sounds.

Fast Inference

Sub-200ms latency with the Turbo variant for real-time applications.

Use Cases

Voice cloning for content creation Multilingual voice applications Character voice design for games Personalized voice assistants

How to Use Chatterbox

1

Sign up or open the demo

Create a free TextToSpeechAI account to claim 200 starter credits, or use the on-page demo to try Chatterbox without signing in.
2

Select Chatterbox and add a reference clip

Choose the Chatterbox engine, then upload a short (a few seconds) audio clip of the voice you want to clone. Chatterbox zero-shot clones it instantly - no training required.
3

Enter your text with optional tags

Type or paste the text to speak in any of the 23 supported languages, and drop in [laugh], [cough], or [chuckle] tags wherever you want natural paralinguistic sounds.
4

Generate the speech

Click generate and TextToSpeechAI renders your text in the cloned Chatterbox voice on hosted GPU infrastructure, spending 25 credits per 1,000 characters.
5

Download or use the API

Download the finished audio file, or automate generation through the TextToSpeechAI REST API at api.texttospeechai.com using your account token.

Chatterbox API

Generate speech programmatically using the TextToSpeechAI REST API.

curl -X POST "https://api.texttospeechai.com/v1/generate/" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Chatterbox can clone your voice from just a few seconds of audio and speak in 23 languages.",
    "voice": "en_US-lessac-medium"
  }'

Read API Docs Get Your API Key

Frequently Asked Questions

Chatterbox is a zero-shot voice cloning text-to-speech model from Resemble AI. It can replicate any voice from just a few seconds of reference audio and generate natural, expressive speech in 23 languages, all without any per-voice training.

Yes, Chatterbox is fully MIT licensed - both the code and the model weights - so you can use it freely in commercial products. Generated audio includes an optional neural watermark that can be disabled, and there are no usage royalties.

You provide a short reference clip of any voice (a few seconds is enough) and Chatterbox extracts that voice's timbre and style into a speaker embedding. It then generates brand-new speech in that voice with no fine-tuning or training step, which is what "zero-shot" means.

Chatterbox reads special inline tags in your text to add natural non-verbal sounds: [laugh] inserts laughter, [cough] inserts a cough, and [chuckle] inserts a soft chuckle. Just place a tag where you want the sound, for example "That is hilarious [laugh] but seriously...".

Type the tag directly inside your input text at the spot where the sound should occur, surrounded by the rest of your sentence. Chatterbox renders the paralinguistic sound in the cloned voice, blending it into the surrounding speech so it sounds spontaneous rather than spliced in.

Chatterbox supports 23 languages, including Arabic, Danish, German, Greek, English, Spanish, Finnish, French, Hebrew, Hindi, Italian, Japanese, Korean, Malay, Dutch, Norwegian, Polish, Portuguese, Russian, Swedish, Swahili, Turkish, and Chinese. A single cloned voice can speak across these languages.

Chatterbox generates speech quickly on a GPU, and the Turbo variant reaches sub-200ms latency for real-time conversational use. Quality is very good, with natural prosody and faithful voice reproduction from even short reference clips.

Chatterbox needs roughly 4-8GB of VRAM depending on the variant, with the Turbo model running comfortably in about 4GB. On TextToSpeechAI you do not need any local GPU - generation runs on our hosted infrastructure.

Chatterbox is a premium-tier engine that costs 25 credits per 1,000 characters. New accounts get 200 free credits to try voice cloning, and you only spend credits on the text you actually generate.

Both support zero-shot voice cloning, but Chatterbox covers far more languages (23 vs 2) and adds expressive paralinguistic tags. F5-TTS can edge out slightly more natural English prosody, so pick Chatterbox for multilingual cloning and expressive sounds, and F5-TTS for English-only fidelity.

Both offer high-quality voice cloning. Chatterbox supports 23 languages and inline expressive tags, while OpenVoice adds tone-style controls (friendly, sad, angry, and more) that Chatterbox lacks. Choose Chatterbox for broad language coverage and OpenVoice when you need explicit emotional tone styling.

Yes. Sign up for a free TextToSpeechAI account to receive 200 starter credits, or use the on-page demo to hear Chatterbox without signing in. Upload a short reference clip, type your text, and generate a cloned voice in seconds.

Technical Specs

Generation Speed Fast
Output Quality Very Good
Voice Cloning Supported
Languages 23
GPU VRAM 4-8GB
Credits/1000 chars 25

Try Chatterbox Now

Generate your first audio free. No credit card required.

Start Free

Other TTS Engines

Chatterbox

About Chatterbox

Key Features

Zero-Shot Voice Cloning

23 Languages

Expressive Tags

Fast Inference

Use Cases

How to Use Chatterbox

Sign up or open the demo

Select Chatterbox and add a reference clip

Enter your text with optional tags

Generate the speech

Download or use the API

Chatterbox API

Frequently Asked Questions

What is Chatterbox TTS?

Is Chatterbox free to use commercially?

How does Chatterbox zero-shot cloning work?

What are Chatterbox paralinguistic tags?

How do I use the [laugh], [cough], and [chuckle] tags?

What languages does Chatterbox support?

How fast is Chatterbox and how good is the quality?

How much GPU memory does Chatterbox need?

How many credits does Chatterbox cost on TextToSpeechAI?

Chatterbox vs F5-TTS: which should I choose?

How does Chatterbox compare to OpenVoice?

Can I try Chatterbox for free on TextToSpeechAI?

Technical Specs

Try Chatterbox Now

Other TTS Engines

Bark

CosyVoice2

Dia