There are two assumptions baked into almost every voice-cloning tutorial: that you need a cloud service with an account and a credit card on file, and that running it yourself needs a gaming GPU. I build a voice cloner for a living, so let me say it plainly — both are wrong. A five-year-old office laptop can clone a voice entirely offline. It just does it slower than an RTX card, and slower is fine for most of what people actually use this for.
This post is the honest version of "how": why local matters more than it did a year ago, what hardware you really need, and the actual steps.
Why offline matters more in 2026
Your voice is biometric data. Unlike a password, you can't rotate it after a leak — there is exactly one of it, and it now unlocks things: banks use voice verification, families trust it implicitly on the phone.
That trust is precisely what's being attacked. The FTC has been warning about harmful voice cloning since 2024, and reporting this year — CNN covered it in May — keeps landing on the same uncomfortable number: a few seconds of audio is enough to build a convincing clone of someone. I've written before about how to tell a cloned voice from a real one, and the short answer is that your ears alone can't.
Here's the part that connects to today's topic: every time you upload a voice recording to a cloud cloning service, you're trusting that company's storage, retention policy, and breach record with a biometric you can never change. Some services are careful. Some train on what you upload. Most sit somewhere undisclosed in between. When the processing happens on your own PC, that entire question evaporates — the sample never leaves the machine.
The hardware truth: no GPU required
Modern local voice models (the XTTS family and its descendants) run on a plain CPU. What a GPU buys you is speed — with a CUDA card the same generation runs roughly 5–10× faster. So the honest framing is:
- Any 64-bit Windows 10/11 PC with 8 GB+ RAM — works. Short clips take a coffee-length wait on CPU rather than seconds on GPU. For narrating a paragraph or making a voicemail greeting, completely usable.
- An NVIDIA GPU (RTX 3060 or better) — comfortable. Generation feels close to real-time, and long scripts stop being a patience exercise.
- Disk space — the real requirement nobody mentions. Model files are gigabytes; budget ~5 GB free.
That's it. No account, no API key, no monthly quota. The "you need a gaming rig" idea comes from model training, which genuinely does need serious hardware — but cloning a voice with a pre-trained model is inference, and inference is cheap.
Your three realistic options
1. RBS Voice Cloner V2 — the free Windows app I build. Full disclosure of bias, obviously, but it exists precisely for this use case: ~2 GB download that bundles the entire AI runtime (PyTorch + CUDA 12.8, with automatic CPU fallback if you have no NVIDIA card), 16 built-in voices, custom clones from a ~30-second sample, 17 languages with auto-translate, and a built-in audio editor. After a one-time model download on first launch, it runs fully offline. Details and download here — no signup, no watermark.
2. RBS Voice Cloner V1 (legacy) — the older, smaller build (~248 MB) that runs CPU-only. If you're on a weak laptop or slow internet and just want text-to-speech in 28+ languages with basic cloning, V1 still does the job — it's simply less polished than V2.
3. The DIY open-source route — if you're comfortable with Python environments, the open-source ecosystem gives you maximum control. I compared the serious contenders in my open-source voice cloning roundup. Fair warning: the time cost is real, and half the GitHub issues on these projects are Windows dependency problems — which is the exact pain V2's bundled runtime exists to remove.
Step by step: fully offline cloning
- Download and install. Grab Voice Cloner V2 (free). It's a big download because everything is bundled — the trade for never touching a Python environment.
- Let it fetch models once. First launch downloads the voice models. This is the only moment internet is required; afterwards the app works with Wi-Fi off. (Press Ctrl+D inside the app for the Diagnose page — it shows whether you're running in GPU or CPU mode.)
- Feed it a clean 30-second sample. One speaker, quiet room, no music. Sample quality matters more than sample length — 30 clean seconds beat 5 noisy minutes.
- Type, pick a voice, generate. On CPU, be patient with your first generation; it's the slowest one. Short sentences render much faster than essays.
- Edit and export. Trim silences, fade the ends, normalise the volume in the built-in editor, and export. The file — like everything else in this workflow — never leaves your PC.
What it sounds like — honest limits
A good local clone from a clean sample is convincing for narration, presentations, voiceovers, and personal projects. It is not a perfect replica: emotional range is narrower than a real performance, very long passages can drift in tone, and strong accents clone less faithfully than neutral speech. Cloud services with billion-parameter models still edge out local tools on raw naturalness — that's the genuine trade-off you're making for privacy and zero cost. For most practical uses, the local result is more than good enough; for a film performance, it isn't. Anyone who tells you otherwise is selling something.
Keep it legal — clone your own voice, or get consent
Offline doesn't mean lawless. Cloning your own voice is fine everywhere. Cloning someone else's without consent ranges from rude to criminal depending on where you live and what you do with it — impersonation, fraud, and deepfake laws all apply regardless of which tool you used. I wrote a full plain-English breakdown in the voice cloning legality guide; the one-line version is: consent, disclosure, and don't be a scammer. The offline part protects your privacy — it doesn't change your obligations to everyone else's.
Bottom line
Voice cloning without the cloud is not a compromise setup anymore — it's the sensible default for anyone who thinks about where their biometric data goes. A normal Windows PC handles it, a GPU merely makes it faster, and the whole thing costs nothing. Keep your voice on your own disk.
Everything I build runs offline for the same reason this post exists — RBS Voice Cloner V2 for voice work, RBS PDF Editor for documents, RBS PC Cleaner for upkeep. Free, no accounts, no telemetry. Made by Rai, solo dev, Singapore.