CosyVoice2

Premium

Zero-shot multilingual voice cloning with streaming support

Fast Speed
Very Good Quality
Yes Cloning
5 Languages

About CosyVoice2

CosyVoice2 is a next-generation speech synthesis model from FunAudioLLM (Alibaba). It delivers natural-sounding zero-shot voice cloning across multiple languages with streaming capability for low-latency applications. Built on a finite scalar quantization approach, it achieves excellent voice similarity with just a few seconds of reference audio.

Key Features

Zero-Shot Voice Cloning

Clone any voice from 3-10 seconds of reference audio with high fidelity.

Multilingual

Supports Chinese, English, Japanese, Korean, and Cantonese with cross-lingual synthesis.

Streaming Support

Low-latency streaming mode for real-time applications and interactive systems.

Natural Prosody

Advanced prosody modeling produces natural-sounding speech with appropriate intonation.

Use Cases

Multilingual content creation Real-time voice assistants Cross-lingual dubbing Personalized voice applications

How to Use CosyVoice2

  1. 1

    Sign up and claim free credits

    Create a free TextToSpeechAI account to claim your starter credits, or try the demo first. No GPU or local CosyVoice2 install is needed - everything runs on our infrastructure.

  2. 2

    Select CosyVoice2 and add a reference clip

    Choose CosyVoice2 as your engine, then upload a clean 3-10 second reference recording of the voice you want to clone. CosyVoice2 will extract the speaker characteristics for zero-shot multilingual cloning.

  3. 3

    Enter your text in any supported language

    Type or paste your script in Chinese, English, Japanese, Korean, or Cantonese. CosyVoice2 supports cross-lingual synthesis, so the cloned voice can speak a language different from the reference clip.

  4. 4

    Generate the speech

    Click generate and CosyVoice2 synthesizes natural, multilingual speech in the cloned voice, usually within seconds for short text. Premium-tier usage costs 25 credits per 1,000 characters.

  5. 5

    Download or use the API

    Download the finished audio as MP3 or WAV from your history, or automate CosyVoice2 voice cloning at scale through the TextToSpeechAI REST API.

CosyVoice2 API

Generate speech programmatically using the TextToSpeechAI REST API.

curl -X POST "https://api.texttospeechai.com/v1/generate/" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "CosyVoice2 delivers natural multilingual speech with zero\u002Dshot voice cloning capability.",
    "voice": "en_US-lessac-medium"
  }'

Frequently Asked Questions

CosyVoice2 is a next-generation text-to-speech and voice cloning model from FunAudioLLM (Alibaba). It supports zero-shot voice cloning from just a few seconds of reference audio and can synthesize natural speech in Chinese, English, Japanese, Korean, and Cantonese. On TextToSpeechAI you can run CosyVoice2 in the browser without any local setup.

Yes, CosyVoice2 is fully Apache 2.0 licensed - both the code and the model weights. This makes it safe to use in commercial products, paid content, and client work without licensing fees or non-commercial restrictions.

CosyVoice2 supports five languages: Chinese (Mandarin), English, Japanese, Korean, and Cantonese. It also handles cross-lingual synthesis, so you can clone a voice from a recording in one language and generate speech in another.

Provide 3-10 seconds of clean reference audio of the target speaker. CosyVoice2 extracts the speaker characteristics using a finite scalar quantization approach, then generates new speech in that cloned voice across any of its supported languages. No model training or fine-tuning is required.

CosyVoice2 is one of the stronger multilingual cloning models, preserving speaker identity even when generating speech in a language different from the reference clip. It produces natural prosody and intonation, which makes it well suited for cross-lingual dubbing and localized content.

Yes. CosyVoice2 is a fast model and includes a streaming mode that produces audio with low latency, making it suitable for voice assistants and interactive applications. On TextToSpeechAI generations typically complete in seconds for short text.

CosyVoice2 requires about 4-6GB of VRAM for the 0.5B parameter model, so a GPU with 6GB or more is recommended when self-hosting. On TextToSpeechAI the model runs on our GPU infrastructure, so you do not need any hardware of your own.

CosyVoice2 is a premium-tier model and costs 25 credits per 1,000 characters of text. Every new account gets free starter credits, so you can try CosyVoice2 voice cloning before deciding on a paid plan.

Both are premium voice cloning engines. GPT-SoVITS often reaches the highest raw similarity for a single target voice, while CosyVoice2 is stronger for multilingual and cross-lingual cloning and adds a low-latency streaming mode. Choose CosyVoice2 when you need one cloned voice to speak several languages.

Both offer high-quality zero-shot voice cloning. CosyVoice2 supports more languages (5 versus 2) and adds streaming for real-time use, while F5-TTS can be slightly faster for English-only workloads. For multilingual projects CosyVoice2 is usually the better fit.

TextToSpeechAI lets you export CosyVoice2 generations in common formats such as MP3 and WAV. You can download the file directly from your history page or retrieve it programmatically through the TextToSpeechAI API.

Yes. You can test CosyVoice2 with the free demo and your free starter credits on TextToSpeechAI without installing anything. Just sign up, upload a short reference clip, type your text in any supported language, and generate.

Technical Specs

  • Generation Speed Fast
  • Output Quality Very Good
  • Voice Cloning Supported
  • Languages 5
  • GPU VRAM 4-6GB
  • Credits/1000 chars 25

Try CosyVoice2 Now

Generate your first audio free. No credit card required.

Start Free