Open-source voice cloning had a quiet revolution between mid-2025 and now. XTTS v2 was the only serious option for years; in the last six months four or five new models have shown up that genuinely move the state of the art forward. I spent a weekend installing all of them on the same Windows machine and putting them through the same prompts so you don't have to.
The hardware: Ryzen 7 5800X, 32 GB RAM, RTX 3060 12 GB. Pretty average gaming PC, three years old. If something runs here it'll run on most modern setups.
The test set: a 60-second voice sample of myself, three prompts in English (a short ad read, a long narrative paragraph, and a question with rising intonation), one prompt in Spanish, one in Mandarin. Same five inputs across all six models.
Quick verdict before the detail
- Best overall English quality: F5-TTS
- Best Asian-language quality: IndexTTS-2 (Mandarin/Japanese), CosyVoice 2 close behind
- Easiest to install on Windows for a non-developer: XTTS v2 (via RBS Voice Cloner V2 — disclosure, mine)
- Lowest VRAM requirement: XTTS v2 (4 GB), Chatterbox (4 GB)
- Best emotional delivery: CosyVoice 2
- Best documentation / friendliest dev experience: Chatterbox
F5-TTS
The newest of the bunch and the most impressive on raw quality. F5-TTS is a flow-matching model from a Hong Kong research group, released late 2025 and refined through Q1 2026. The architectural change versus XTTS v2 is significant — flow matching produces noticeably more natural prosody on long inputs, and the silences between sentences sound right.
Quality on the English narrative prompt: best in the test. The model picks up on subtle phrasing in the source sample (a slight pause before emphasising a word) and reproduces it. On the ad-read prompt, the energy and pacing felt almost professional.
Setup pain: medium-high. There's no installer; you clone the GitHub repo, set up a Python environment, install the right PyTorch for your CUDA version, download the model weights from Hugging Face. About 90 minutes if you've never done this before, 15 minutes if you have.
Hardware reality: I had to use the smaller checkpoint to fit in 12 GB VRAM. The full model wants 16 GB. Generation took about 4 seconds per sentence on my card.
XTTS v2 (via RBS Voice Cloner V2)
Disclosure first: this is what I build. The model itself is Coqui's XTTS v2, which has been around since 2023 and is still the workhorse of the open-source TTS world. The packaging is mine — I bundle PyTorch and CUDA 12.8 inside the installer so you skip the Python setup step entirely.
Quality on the English narrative prompt: very good. Not quite at F5-TTS level on the long paragraph, but close enough that most listeners wouldn't reliably tell. On the short prompts the gap effectively disappears.
Setup pain: zero. Run installer, click through, wait for the model to download on first launch (~2 GB, one time), start cloning. That's the whole story.
Hardware reality: runs on 4 GB VRAM, falls back to CPU if no GPU is present (~10x slower but still usable). Generation took about 2 seconds per sentence on the RTX 3060.
Where it loses to F5-TTS: long-form prosody and the very subtle emotional cues. Where it wins: install pain, hardware floor, and language coverage (17 languages with a built-in translator).
IndexTTS-2
From a Chinese research lab and very obviously trained primarily on Chinese-language data. The Mandarin output is the best of any model in the test — tones are accurate, the prosody sounds genuinely native. Japanese was a close second.
English output is competent but a bit monotone compared to F5-TTS or XTTS v2. If your audience is bilingual or Asia-focused this is the model to consider.
Setup pain: similar to F5-TTS. Hardware: 8 GB VRAM minimum. Generation: ~3 seconds per sentence on the RTX 3060.
Chatterbox
Chatterbox surprised me. It's been topping the Hugging Face Spaces trending list for a few months and I assumed it was hype, but the developer experience is genuinely the best of the open-source options. Clean web UI, good documentation, easy to demo to a non-technical friend.
Quality is a step below F5-TTS and XTTS v2 on long-form, similar on short clips. Where Chatterbox wins is approachability — if you're trying to get a marketing colleague or a podcaster to try open-source voice cloning, point them here. They'll have something working in 10 minutes.
Hardware: 4 GB VRAM, runs on CPU if needed. Generation ~3 seconds per sentence on RTX 3060.
Fish Speech V1.5
Fish Speech has been iterating quickly. V1.5 (released February 2026) added multi-speaker dialogue support and improved Chinese-language quality. Quality on English is roughly tied with XTTS v2 — not as polished as F5-TTS, not as approachable as Chatterbox.
Where Fish Speech is interesting is the multi-speaker support. If you're generating dialogue between two characters and don't want to stitch outputs from separate single-voice generations, this is one of the few open-source options that handles it natively.
Setup pain: medium. Hardware: 6 GB VRAM. Generation ~3-4 seconds per sentence.
CosyVoice 2
Alibaba's open-source TTS, 0.5B parameter version. The thing that surprised me here was emotional control. CosyVoice 2 has explicit "happy", "sad", "angry", "surprised" tags that you can include inline with the text, and it actually changes the delivery in a way that feels natural. ElevenLabs has had this for a while; CosyVoice 2 is the first open-source model where it works.
Quality on the neutral English prompts is decent — not best in the test, but solid. On Mandarin it's strong, second to IndexTTS-2.
Setup pain: medium-high, with thinner English-language documentation than the alternatives. Hardware: 8 GB VRAM. Generation ~4 seconds per sentence.
Side-by-side
| Model | Setup | Min VRAM | Speed (RTX 3060) | Best at |
|---|---|---|---|---|
| F5-TTS | Hard | 12 GB | ~4s / sentence | Peak English quality |
| XTTS v2 (RBS) | Easy (installer) | 4 GB | ~2s / sentence | Approachability + coverage |
| IndexTTS-2 | Medium | 8 GB | ~3s / sentence | Asian languages |
| Chatterbox | Easy | 4 GB | ~3s / sentence | Dev experience |
| Fish Speech V1.5 | Medium | 6 GB | ~3-4s / sentence | Multi-speaker dialogue |
| CosyVoice 2 | Medium-Hard | 8 GB | ~4s / sentence | Emotional control |
Which one should you actually pick?
If you're a creator who just wants something that works on Windows without spending an afternoon on installation, the answer is XTTS v2 via RBS Voice Cloner V2 or directly. The quality gap to F5-TTS is real but small, the install gap is enormous.
If you're a developer building voice into your own app and absolute peak quality matters more than 90 minutes of setup, F5-TTS is the answer.
If you're working primarily in Mandarin, Japanese or Korean, IndexTTS-2.
If you're producing dialogue between multiple characters, Fish Speech V1.5.
If you need explicit emotional delivery, CosyVoice 2.
If you want the smoothest demo experience for showing AI voice cloning to a non-technical person, Chatterbox.
A note on benchmarks
Most published TTS benchmarks (MOS scores, character error rate, speaker similarity) are research metrics that don't always correlate with how the audio actually sounds to a listener. I weighted my comparison heavily toward the listening experience because that's what your audience will care about. If you want the rigorous metrics, the Hugging Face TTS Arena leaderboard is the place to look — and the rankings there match my subjective rankings reasonably well as of April 2026.