VITS

Standard

Fast End-to-End TTS with Natural Speech

Very Fast Speed
Good Quality
No Cloning
10 Languages

About VITS

VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) is a fast, end-to-end neural TTS model that generates natural-sounding speech. It combines variational autoencoders with adversarial training for efficient synthesis. VITS is excellent for batch processing and applications requiring both quality and speed.

Key Features

Fast Synthesis

End-to-end architecture for rapid speech generation.

Batch Processing

Efficiently process multiple texts simultaneously.

Natural Speech

VAE+GAN training produces natural prosody and rhythm.

Multi-Speaker

Single model supports multiple speaker voices.

Efficient

Low memory footprint with good performance.

Open Source

MIT licensed for any use case.

Use Cases

Batch Audio Generation E-Learning Platforms News Readers Automated Announcements IVR Systems High-Volume Content

VITS Voices

View All 109
LJSpeech (English Female)
EN
VCTK Speaker 225 (English Female)
EN
VCTK Speaker 226 (English Male)
EN
VCTK Speaker 227 (English Male)
EN
VCTK Speaker 228 (English Female)
EN
VCTK Speaker 229
EN
VCTK Speaker 230
EN
VCTK Speaker 231
EN
VCTK Speaker 232
EN
VCTK Speaker 233
EN
VCTK Speaker 234
EN
VCTK Speaker 236
EN

How to Use VITS

  1. 1

    Sign up free or try the demo

    Create a free TextToSpeechAI account to get starter credits, or use the on-page demo to hear VITS before signing up.

  2. 2

    Pick a VITS voice or speaker

    Browse the voice library and choose a voice marked with the VITS badge. The multi-speaker VITS library, including the VCTK speaker set, lets you select from many distinct voices.

  3. 3

    Enter your text

    Type or paste the text you want spoken into the editor. VITS handles long passages well and is ideal for batch and high-volume content.

  4. 4

    Generate the audio

    Click generate to synthesize speech with VITS. Because VITS is very fast and Standard-tier (10 credits per 1000 characters), results return quickly at low cost.

  5. 5

    Download or use the API

    Download the finished audio as MP3, WAV, or OGG, or call the same VITS voice through the TextToSpeechAI REST API to automate generation in your own application.

VITS API

Generate speech programmatically using the TextToSpeechAI REST API.

curl -X POST "https://api.texttospeechai.com/v1/generate/" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "VITS delivers fast, natural speech for high\u002Dvolume applications.",
    "voice": "vits-ljspeech"
  }'

Frequently Asked Questions

VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) is an end-to-end neural TTS model that combines a variational autoencoder with adversarial GAN training. It generates natural-sounding speech in a single pass, which makes it fast and efficient. You can try VITS free on TextToSpeechAI.

Yes, VITS is open-source under the MIT license, so it supports full commercial use without restrictions. It is widely used in commercial products and services. On TextToSpeechAI, VITS costs 10 credits per 1000 characters on the Standard tier.

TextToSpeechAI offers a large multi-speaker VITS library, including the VCTK voice set with dozens of distinct English speakers. A single VITS model can host many speakers, so you can choose from many different voices without switching engines.

VITS support depends on the trained model. Common VITS models cover English, Chinese, Japanese, Korean, German, French, and other major languages, with multi-speaker English coverage from the VCTK dataset.

VITS is very fast, generating speech in real time or faster on a GPU. Its end-to-end architecture avoids the multiple processing stages of other models, which is why VITS is well suited to batch and high-volume synthesis.

No, VITS does not support voice cloning. It uses pre-trained multi-speaker models rather than copying a target voice from a sample. For voice cloning on TextToSpeechAI, use F5-TTS or GPT-SoVITS instead.

VITS produces good quality audio with natural prosody and rhythm. While it is not at the level of StyleTTS 2 or Tortoise, it offers excellent quality for its speed, especially for batch processing.

VITS is memory-efficient, typically needing only a few GB of VRAM (around 4GB). It runs comfortably on consumer GPUs, and on TextToSpeechAI all rendering happens on our servers so you do not need any hardware of your own.

VITS and Piper are both fast, MIT-licensed Standard-tier engines on TextToSpeechAI. Piper is the lightest and fastest option, while VITS offers a large multi-speaker library (including VCTK) with slightly more natural prosody. Neither supports voice cloning.

VITS is a Standard-tier engine, costing 10 credits per 1000 characters. This is our lowest pricing tier thanks to the efficient, fast nature of the VITS model.

VITS generates audio at 22050Hz natively. Through TextToSpeechAI you can request MP3, WAV, or OGG formats, with automatic conversion handled for you.

Sign up on TextToSpeechAI to receive free starter credits, then pick a VITS voice, enter your text, and generate audio. You can also use the demo to hear VITS before creating an account, and access VITS through our REST API once you sign up.

Technical Specs

  • Generation Speed Very Fast
  • Output Quality Good
  • Voice Cloning Not Supported
  • Languages 10
  • GPU VRAM 1-2GB
  • Credits/1000 chars 10

Try VITS Now

Generate your first audio free. No credit card required.

Start Free