A 100% local, zero-latency, and professional AI dubbing extension for YouTube.
LocalDub is a powerful Chrome Extension and Python backend combo that translates and dubs YouTube videos into your native language (English, Turkish, German, Spanish, French) in real-time. You do not pay for any cloud APIs. All audio extraction, AI speech-to-text, LLM translation, and AI text-to-speech happen entirely on your local machine.
Important
LocalDub features a revolutionary Zero-Drift HTML5 Time-Stretching Algorithm. When the AI generates a translation that is longer or shorter than the original video scene, the browser dynamically stretches the audio while perfectly preserving the human pitch. This guarantees that the dubbed audio perfectly matches the lip movements and video frames with absolutely zero delay accumulation.
The system operates using a seamless pipeline between the browser and your local GPU/CPU:
sequenceDiagram
participant YT as YouTube (Chrome Extension)
participant FF as FFmpeg (Downloader)
participant ASR as Faster-Whisper (Speech-to-Text)
participant LLM as Ollama Gemma2 (Translator)
participant TTS as Edge-TTS (Voice Synthesizer)
YT->>FF: "Video is at 0:03. Give me the audio."
FF->>ASR: Extracts 3-second raw audio chunk
ASR->>LLM: Transcribes: "Hello guys"
LLM->>TTS: Translates: "Merhaba arkadaşlar"
TTS->>YT: Sends perfectly timed MP3 via WebSockets
Note over YT,TTS: The extension receives the audio, dynamically calculates playbackRate, and plays it in flawless lip-sync!
- Faster-Whisper: The fastest and most accurate offline Speech-to-Text (ASR) AI. Includes an advanced Voice Activity Detection (VAD) filter to strip silent frames.
- Ollama (Gemma2 / Llama3): A strictly lobotomized local LLM pipeline. It utilizes Few-Shot completion prompting, Temperature 0.0, and strict stop tokens (
\n) to act as a pure machine translator. It never hallucinates conversational text and preserves technical terms (e.g., Firewall, React, Cheatsheet). - Edge-TTS: Microsoft's highly natural, breathing neural voice synthesis engine.
- HTML5 Time-Stretching: The frontend mathematical engine that scales audio lengths to perfectly fit the visual scenes without sounding distorted (preserves pitch).
Follow these instructions carefully to set up your local AI dubbing studio.
- Download and install Python 3.12 from Python.org.
- !Critical: During installation, make sure to check the box that says "Add Python to PATH".*
- Download FFmpeg. Extract the folder and add the
bindirectory to your Windows Environment Variables (PATH). Open your Command Prompt (CMD) and typeffmpegto verify it is installed correctly.
- Download and install Ollama from Ollama.com.
- Open your terminal (CMD or PowerShell) and pull the translation model by running:
(Note: This is a 5GB+ download. Wait for it to finish and then close the terminal).
ollama run gemma2:9b
- Clone or download this repository to your computer.
- Open a terminal inside the downloaded folder and navigate to the
backenddirectory. - Install the required Python libraries:
cd backend pip install -r requirements.txt
- Open Google Chrome and go to
chrome://extensions/. - Enable Developer mode using the toggle in the top right corner.
- Click the Load unpacked button in the top left.
- Select the
extensionfolder located inside the LocalDub project directory. - The LocalDub logo will now appear in your browser's extension bar!
-
Start the AI Server: Open a terminal in the
backendfolder and run:python main.py
(Wait until you see "WebSocket connected" and "Uvicorn running" in the logs).
-
Start Dubbing:
- Open any foreign language video on YouTube.
- Click the LocalDub extension icon in your browser.
- Select your desired target language from the dropdown menu.
- Click Enable Dubbing.
The video will pause for 1-2 seconds to buffer the initial AI generation, and then play continuously with perfectly synced, natural-sounding AI dubbing!
We are actively working on upgrading LocalDub to compete with multi-million dollar AI dubbing startups:
- Zero-Shot Voice Cloning: Instead of using Microsoft's default voices, the backend will dynamically clone the original YouTuber's exact voice print using
XTTSv2orCosyVoice. - Audio Separation (Background Music Preservation): Integrating Facebook's
DemucsAI to strip ONLY the vocals from the video. Background music, explosions, and sound effects will be preserved and mixed beneath the translated AI dub. - Speaker Diarization: Integrating
Pyannote.audioto detect multiple speakers in interviews or podcasts (Speaker A vs. Speaker B) and assigning them distinct AI voices automatically. - Semantic Chunking: Advanced VAD logic that chunks audio based on breathing and sentence completion rather than strict 3-second blocks, ensuring 100% grammatical perfection.
This project is licensed under the MIT License. Feel free to use, modify, and distribute it.
