OpenVoice

Ultra

Instant Voice Cloning with Granular Tone Control

Moderate Speed
Very Good Quality
Yes Cloning
10 Languages

About OpenVoice

OpenVoice is a versatile instant voice cloning model that allows fine-grained control over speaking style. Unlike other cloning models, OpenVoice separates voice identity from speaking style, allowing you to take a cloned voice and apply different tones - cheerful, sad, angry, excited, or whispering - without new reference audio.

Key Features

Instant Cloning

Clone any voice from just a few seconds of audio.

Tone Control

Apply cheerful, sad, angry, excited, or whisper tones.

Style Transfer

Separate voice identity from speaking style for flexibility.

Cross-Lingual

Use cloned voices across different languages.

Fast Processing

Efficient inference for quick voice generation.

Open Source

MIT licensed for commercial applications.

Use Cases

Emotional Content Character Animation Interactive Games Audiobook Narration Marketing Videos Virtual Assistants

How to Use OpenVoice

  1. 1

    Sign up free or try the demo

    Create a free TextToSpeechAI account to get starter credits, or use the on-page demo to hear OpenVoice before committing. No local GPU or install is needed - everything runs on our servers.

  2. 2

    Choose OpenVoice and upload a reference clip

    Select the OpenVoice engine, then upload a few seconds of clean reference audio to instantly clone the target voice. OpenVoice captures the speaker identity so you can reuse it across any text and tone.

  3. 3

    Enter your text

    Type or paste the script you want spoken in the cloned voice. OpenVoice supports around 10 languages and cross-lingual delivery, so you can write in a different language than the reference clip.

  4. 4

    Pick a tone style and generate

    Choose one of the nine OpenVoice tone styles - default, friendly, cheerful, excited, sad, angry, terrified, shouting, or whispering - then generate. The same cloned voice will speak with that emotional delivery.

  5. 5

    Download or use the API

    Download your audio as MP3, WAV, or OGG, or automate generation through the TextToSpeechAI API by passing your cloned voice and tone style in each request.

OpenVoice API

Generate speech programmatically using the TextToSpeechAI REST API.

curl -X POST "https://api.texttospeechai.com/v1/generate/" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "OpenVoice can speak in any tone \u002D cheerful, sad, or even whispering.",
    "voice": "en_US-lessac-medium"
  }'

Frequently Asked Questions

OpenVoice is an advanced text-to-speech and voice cloning model that uniquely separates voice identity from speaking style. This lets you clone a voice and then apply different emotional tones without needing new reference audio for each emotion. It is built for expressive, controllable speech generation.

Yes, OpenVoice performs instant voice cloning from just a few seconds of reference audio - no training run required. Once a voice is captured, OpenVoice can reuse that identity across any text and any tone style you select.

OpenVoice uses a two-stage architecture that splits base speech synthesis from tone conversion. After cloning a voice, you can apply any of 9 tone styles - default, friendly, cheerful, excited, sad, angry, terrified, shouting, or whispering - and the same cloned voice speaks differently based on your chosen tone without re-recording.

OpenVoice supports nine speaking styles: default, friendly, cheerful, excited, sad, angry, terrified, shouting, and whispering. Each style reshapes the emotional delivery while preserving the cloned speaker identity, giving you fine-grained control over how a line is read.

OpenVoice is open-source under the permissive MIT license, so it is free for commercial use. As with any cloning model, make sure you have proper rights to any voice you clone for commercial projects.

OpenVoice supports around 10 languages including English, Chinese, Japanese, Korean, and several European languages. It also offers cross-lingual cloning, so you can clone a voice in one language and have it speak naturally in another.

OpenVoice has moderate generation speed, typically rendering a sentence in 2-4 seconds on a GPU. Output quality is very good, with clear voice reproduction and tone transfer that keeps the speaker identity intact while convincingly changing emotional delivery.

OpenVoice typically requires 6-8GB of VRAM depending on batch size and tone conversion load. It runs comfortably on mid-range to upper mid-range GPUs, and on TextToSpeechAI all of this is handled on our servers so you do not need any local hardware.

OpenVoice is an Ultra-tier engine, priced at 50 credits per 1000 characters. The Ultra tier reflects its advanced tone control and the extra compute needed for the cloning plus style-conversion pipeline.

OpenVoice is unique for its tone and style control: you can take one cloned voice and re-deliver it as cheerful, sad, angry, or whispering. F5-TTS is faster and is our default cloning engine for natural, neutral speech. Choose OpenVoice when you need emotional style control, and F5-TTS when you want the quickest natural clone.

Create a cloned voice by uploading reference audio, then specify a tone style in your API request. The API applies your chosen emotional tone to the cloned voice automatically and returns the audio in MP3, WAV, or OGG format.

Yes. Sign up for a free TextToSpeechAI account to receive starter credits and try OpenVoice cloning and tone control, or use the on-page demo first. There is no local setup - upload a reference clip, pick a tone, and generate in the browser.

Technical Specs

  • Generation Speed Moderate
  • Output Quality Very Good
  • Voice Cloning Supported
  • Languages 10
  • GPU VRAM 3-6GB
  • Credits/1000 chars 50

Try OpenVoice Now

Generate your first audio free. No credit card required.

Start Free