VITS

Standard

Fast End-to-End TTS with Natural Speech

Very Fast Speed

Good Quality

No Cloning

10 Languages

About VITS

VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) is a fast, end-to-end neural TTS model that generates natural-sounding speech. It combines variational autoencoders with adversarial training for efficient synthesis. VITS is excellent for batch processing and applications requiring both quality and speed.

Key Features

Fast Synthesis

End-to-end architecture for rapid speech generation.

Batch Processing

Efficiently process multiple texts simultaneously.

Natural Speech

VAE+GAN training produces natural prosody and rhythm.

Multi-Speaker

Single model supports multiple speaker voices.

Efficient

Low memory footprint with good performance.

Open Source

MIT licensed for any use case.

Use Cases

Batch Audio Generation E-Learning Platforms News Readers Automated Announcements IVR Systems High-Volume Content

VITS Voices

View All 109

LJSpeech (English Female)

VCTK Speaker 225 (English Female)

VCTK Speaker 226 (English Male)

VCTK Speaker 227 (English Male)

VCTK Speaker 228 (English Female)

VCTK Speaker 229

VCTK Speaker 230

VCTK Speaker 231

VCTK Speaker 232

VCTK Speaker 233

VCTK Speaker 234

VCTK Speaker 236

How to Use VITS

1

Sign up free or try the demo

Create a free TextToSpeechAI account to get starter credits, or use the on-page demo to hear VITS before signing up.
2

Pick a VITS voice or speaker

Browse the voice library and choose a voice marked with the VITS badge. The multi-speaker VITS library, including the VCTK speaker set, lets you select from many distinct voices.
3

Enter your text

Type or paste the text you want spoken into the editor. VITS handles long passages well and is ideal for batch and high-volume content.
4

Generate the audio

Click generate to synthesize speech with VITS. Because VITS is very fast and Standard-tier (10 credits per 1000 characters), results return quickly at low cost.
5

Download or use the API

Download the finished audio as MP3, WAV, or OGG, or call the same VITS voice through the TextToSpeechAI REST API to automate generation in your own application.

VITS API

Generate speech programmatically using the TextToSpeechAI REST API.

curl -X POST "https://api.texttospeechai.com/v1/generate/" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "VITS delivers fast, natural speech for high\u002Dvolume applications.",
    "voice": "vits-ljspeech"
  }'

Read API Docs Get Your API Key

Frequently Asked Questions

VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) is an end-to-end neural TTS model that combines a variational autoencoder with adversarial GAN training. It generates natural-sounding speech in a single pass, which makes it fast and efficient. You can try VITS free on TextToSpeechAI.

Yes, VITS is open-source under the MIT license, so it supports full commercial use without restrictions. It is widely used in commercial products and services. On TextToSpeechAI, VITS costs 10 credits per 1000 characters on the Standard tier.

TextToSpeechAI offers a large multi-speaker VITS library, including the VCTK voice set with dozens of distinct English speakers. A single VITS model can host many speakers, so you can choose from many different voices without switching engines.

VITS support depends on the trained model. Common VITS models cover English, Chinese, Japanese, Korean, German, French, and other major languages, with multi-speaker English coverage from the VCTK dataset.

VITS is very fast, generating speech in real time or faster on a GPU. Its end-to-end architecture avoids the multiple processing stages of other models, which is why VITS is well suited to batch and high-volume synthesis.

No, VITS does not support voice cloning. It uses pre-trained multi-speaker models rather than copying a target voice from a sample. For voice cloning on TextToSpeechAI, use F5-TTS or GPT-SoVITS instead.

VITS produces good quality audio with natural prosody and rhythm. While it is not at the level of StyleTTS 2 or Tortoise, it offers excellent quality for its speed, especially for batch processing.

VITS is memory-efficient, typically needing only a few GB of VRAM (around 4GB). It runs comfortably on consumer GPUs, and on TextToSpeechAI all rendering happens on our servers so you do not need any hardware of your own.

VITS and Piper are both fast, MIT-licensed Standard-tier engines on TextToSpeechAI. Piper is the lightest and fastest option, while VITS offers a large multi-speaker library (including VCTK) with slightly more natural prosody. Neither supports voice cloning.

VITS is a Standard-tier engine, costing 10 credits per 1000 characters. This is our lowest pricing tier thanks to the efficient, fast nature of the VITS model.

VITS generates audio at 22050Hz natively. Through TextToSpeechAI you can request MP3, WAV, or OGG formats, with automatic conversion handled for you.

Sign up on TextToSpeechAI to receive free starter credits, then pick a VITS voice, enter your text, and generate audio. You can also use the demo to hear VITS before creating an account, and access VITS through our REST API once you sign up.

Technical Specs

Generation Speed Very Fast
Output Quality Good
Voice Cloning Not Supported
Languages 10
GPU VRAM 1-2GB
Credits/1000 chars 10

Try VITS Now

Generate your first audio free. No credit card required.

Start Free

Other TTS Engines

VITS

About VITS

Key Features

Fast Synthesis

Batch Processing

Natural Speech

Multi-Speaker

Efficient

Open Source

Use Cases

VITS Voices

LJSpeech (English Female)

VCTK Speaker 225 (English Female)

VCTK Speaker 226 (English Male)

VCTK Speaker 227 (English Male)

VCTK Speaker 228 (English Female)

VCTK Speaker 229

VCTK Speaker 230

VCTK Speaker 231

VCTK Speaker 232

VCTK Speaker 233

VCTK Speaker 234

VCTK Speaker 236

How to Use VITS

Sign up free or try the demo

Pick a VITS voice or speaker

Enter your text

Generate the audio

Download or use the API

VITS API

Frequently Asked Questions

What is VITS TTS?

Is VITS free for commercial use?

How many VITS voices are there?

What languages does VITS support?

How fast is VITS?

Does VITS support voice cloning?

What is the audio quality of VITS?

How much GPU memory does VITS need?

VITS vs Piper: which should I use?

How many credits does VITS cost on TextToSpeechAI?

What audio formats does VITS output?

How do I try VITS for free?

Technical Specs

Try VITS Now

Other TTS Engines

Bark

Chatterbox

CosyVoice2