Real-time, Fully Local Speech-to-Text with Speaker Diarization
This project is based on Whisper Streaming and SimulStreaming, allowing you to transcribe audio directly from your browser. WhisperLiveKit provides a complete backend solution for real-time speech transcription with a functional, simple and customizable frontend. Everything runs locally on your machine โจ
WhisperLiveKit consists of three main components:
- Frontend: A basic html + JS interface that captures microphone audio and streams it to the backend via WebSockets. You can use and adapt the provided template at whisperlivekit/web/live_transcription.html.
- Backend (Web Server): A FastAPI-based WebSocket server that receives streamed audio data, processes it in real time, and returns transcriptions to the frontend. This is where the WebSocket logic and routing live.
- Core Backend (Library Logic): A server-agnostic core that handles audio processing, ASR, and diarization. It exposes reusable components that take in audio bytes and return transcriptions.
- ๐๏ธ Real-time Transcription - Locally (or on-prem) convert speech to text instantly as you speak
- ๐ฅ Speaker Diarization - Identify different speakers in real-time using Diart
- ๐ Multi-User Support - Handle multiple users simultaneously with a single backend/server
- ๐ Automatic Silence Chunking โ Automatically chunks when no audio is detected to limit buffer size
- โ Confidence Validation โ Immediately validate high-confidence tokens for faster inference (WhisperStreaming only)
- ๐๏ธ Buffering Preview โ Displays unvalidated transcription segments (not compatible with SimulStreaming yet)
- โ๏ธ Punctuation-Based Speaker Splitting [BETA] - Align speaker changes with natural sentence boundaries for more readable transcripts
- โก SimulStreaming Backend - Ultra-low latency transcription using state-of-the-art AlignAtt policy. To use, please copy simul_whisper content into
whisperlivekit/simul_whisper
. You must comply with the Polyform license !!
# Install the package
pip install whisperlivekit
# Start the transcription server
whisperlivekit-server --model tiny.en
# Open your browser at http://localhost:8000 to see the interface.
# Use -ssl-certfile public.crt --ssl-keyfile private.key parameters to use SSL
That's it! Start speaking and watch your words appear on screen.
pip install whisperlivekit
git clone https://github.com/QuentinFuxa/WhisperLiveKit
cd WhisperLiveKit
pip install -e .
FFmpeg is required:
# Ubuntu/Debian
sudo apt install ffmpeg
# macOS
brew install ffmpeg
# Windows
# Download from https://ffmpeg.org/download.html and add to PATH
# Voice Activity Controller (prevents hallucinations)
pip install torch
# Sentence-based buffer trimming
pip install mosestokenizer wtpsplit
pip install tokenize_uk # If you work with Ukrainian text
# Speaker diarization
pip install diart
# Alternative Whisper backends (default is faster-whisper)
pip install whisperlivekit[whisper] # Original Whisper
pip install whisperlivekit[whisper-timestamped] # Improved timestamps
pip install whisperlivekit[mlx-whisper] # Apple Silicon optimization
pip install whisperlivekit[openai] # OpenAI API
pip install whisperlivekit[simulstreaming]
For diarization, you need access to pyannote.audio models:
- Accept user conditions for the
pyannote/segmentation
model - Accept user conditions for the
pyannote/segmentation-3.0
model - Accept user conditions for the
pyannote/embedding
model - Login with HuggingFace:
pip install huggingface_hub
huggingface-cli login
Start the transcription server with various options:
# Basic server with English model
whisperlivekit-server --model tiny.en
# Advanced configuration with diarization
whisperlivekit-server --host 0.0.0.0 --port 8000 --model medium --diarization --language auto
# SimulStreaming backend for ultra-low latency
whisperlivekit-server --backend simulstreaming-whisper --model large-v3 --frame-threshold 20
Check basic_server.py for a complete example.
from whisperlivekit import TranscriptionEngine, AudioProcessor, get_web_interface_html, parse_args
from fastapi import FastAPI, WebSocket, WebSocketDisconnect
from fastapi.responses import HTMLResponse
from contextlib import asynccontextmanager
import asyncio
# Global variable for the transcription engine
transcription_engine = None
@asynccontextmanager
async def lifespan(app: FastAPI):
global transcription_engine
# Example: Initialize with specific parameters directly
# You can also load from command-line arguments using parse_args()
# args = parse_args()
# transcription_engine = TranscriptionEngine(**vars(args))
transcription_engine = TranscriptionEngine(model="medium", diarization=True, lan="en")
yield
app = FastAPI(lifespan=lifespan)
# Serve the web interface
@app.get("/")
async def get():
return HTMLResponse(get_web_interface_html())
# Process WebSocket connections
async def handle_websocket_results(websocket: WebSocket, results_generator):
try:
async for response in results_generator:
await websocket.send_json(response)
await websocket.send_json({"type": "ready_to_stop"})
except WebSocketDisconnect:
print("WebSocket disconnected during results handling.")
@app.websocket("/asr")
async def websocket_endpoint(websocket: WebSocket):
global transcription_engine
# Create a new AudioProcessor for each connection, passing the shared engine
audio_processor = AudioProcessor(transcription_engine=transcription_engine)
results_generator = await audio_processor.create_tasks()
send_results_to_client = handle_websocket_results(websocket, results_generator)
results_task = asyncio.create_task(send_results_to_client)
await websocket.accept()
try:
while True:
message = await websocket.receive_bytes()
await audio_processor.process_audio(message)
except WebSocketDisconnect:
print(f"Client disconnected: {websocket.client}")
except Exception as e:
await websocket.close(code=1011, reason=f"Server error: {e}")
finally:
results_task.cancel()
try:
await results_task
except asyncio.CancelledError:
logger.info("Results task successfully cancelled.")
The package includes a simple HTML/JavaScript implementation that you can adapt for your project. You can find it in whisperlivekit/web/live_transcription.html
, or load its content using the get_web_interface_html()
function from whisperlivekit
:
from whisperlivekit import get_web_interface_html
# ... later in your code where you need the HTML string ...
html_content = get_web_interface_html()
WhisperLiveKit offers extensive configuration options:
Parameter | Description | Default |
---|---|---|
--host |
Server host address | localhost |
--port |
Server port | 8000 |
--model |
Whisper model size. Caution : '.en' models do not work with Simulstreaming | tiny |
--language |
Source language code or auto |
en |
--task |
transcribe or translate |
transcribe |
--backend |
Processing backend | faster-whisper |
--diarization |
Enable speaker identification | False |
--punctuation-split |
Use punctuation to improve speaker boundaries | True |
--confidence-validation |
Use confidence scores for faster validation | False |
--min-chunk-size |
Minimum audio chunk size (seconds) | 1.0 |
--vac |
Use Voice Activity Controller | False |
--no-vad |
Disable Voice Activity Detection | False |
--buffer_trimming |
Buffer trimming strategy (sentence or segment ) |
segment |
--warmup-file |
Audio file path for model warmup | jfk.wav |
--ssl-certfile |
Path to the SSL certificate file (for HTTPS support) | None |
--ssl-keyfile |
Path to the SSL private key file (for HTTPS support) | None |
--segmentation-model |
Hugging Face model ID for pyannote.audio segmentation model. Available models | pyannote/segmentation-3.0 |
--embedding-model |
Hugging Face model ID for pyannote.audio embedding model. Available models | speechbrain/spkrec-ecapa-voxceleb |
SimulStreaming-specific Options:
Parameter | Description | Default |
---|---|---|
--frame-threshold |
AlignAtt frame threshold (lower = faster, higher = more accurate) | 25 |
--beams |
Number of beams for beam search (1 = greedy decoding) | 1 |
--decoder |
Force decoder type (beam or greedy ) |
auto |
--audio-max-len |
Maximum audio buffer length (seconds) | 30.0 |
--audio-min-len |
Minimum audio length to process (seconds) | 0.0 |
--cif-ckpt-path |
Path to CIF model for word boundary detection | None |
--never-fire |
Never truncate incomplete words | False |
--init-prompt |
Initial prompt for the model | None |
--static-init-prompt |
Static prompt that doesn't scroll | None |
--max-context-tokens |
Maximum context tokens | None |
--model-path |
Direct path to .pt model file. Download it if not found | ./base.pt |
- Audio Capture: Browser's MediaRecorder API captures audio in webm/opus format
- Streaming: Audio chunks are sent to the server via WebSocket
- Processing: Server decodes audio with FFmpeg and streams into Whisper for transcription
- Real-time Output:
- Partial transcriptions appear immediately in light gray (the 'aperรงu')
- Finalized text appears in normal color
- (When enabled) Different speakers are identified and highlighted
To deploy WhisperLiveKit in production:
-
Server Setup (Backend):
# Install production ASGI server pip install uvicorn gunicorn # Launch with multiple workers gunicorn -k uvicorn.workers.UvicornWorker -w 4 your_app:app
-
Frontend Integration:
- Host your customized version of the example HTML/JS in your web application
- Ensure WebSocket connection points to your server's address
-
Nginx Configuration (recommended for production):
server { listen 80; server_name your-domain.com; location / { proxy_pass http://localhost:8000; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "upgrade"; proxy_set_header Host $host; }
}
4. **HTTPS Support**: For secure deployments, use "wss://" instead of "ws://" in WebSocket URL
### ๐ Docker
A basic Dockerfile is provided which allows re-use of Python package installation options. See below usage examples:
**NOTE:** For **larger** models, ensure that your **docker runtime** has enough **memory** available.
#### All defaults
- Create a reusable image with only the basics and then run as a named container:
```bash
docker build -t whisperlivekit-defaults .
docker create --gpus all --name whisperlivekit -p 8000:8000 whisperlivekit-defaults
docker start -i whisperlivekit
Note: If you're running on a system without NVIDIA GPU support (such as Mac with Apple Silicon or any system without CUDA capabilities), you need to remove the
--gpus all
flag from thedocker create
command. Without GPU acceleration, transcription will use CPU only, which may be significantly slower. Consider using small models for better performance on CPU-only systems.
- Customize the container options:
docker build -t whisperlivekit-defaults .
docker create --gpus all --name whisperlivekit-base -p 8000:8000 whisperlivekit-defaults --model base
docker start -i whisperlivekit-base
--build-arg
Options:EXTRAS="whisper-timestamped"
- Add extras to the image's installation (no spaces). Remember to set necessary container options!HF_PRECACHE_DIR="./.cache/"
- Pre-load a model cache for faster first-time startHF_TOKEN="./token"
- Add your Hugging Face Hub access token to download gated models
- Meeting Transcription: Capture discussions in real-time
- Accessibility Tools: Help hearing-impaired users follow conversations
- Content Creation: Transcribe podcasts or videos automatically
- Customer Service: Transcribe support calls with speaker identification
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Here's how to get started:
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature
- Commit your changes:
git commit -m 'Add amazing feature'
- Push to your branch:
git push origin feature/amazing-feature
- Open a Pull Request
This project builds upon the foundational work of:
- Whisper Streaming
- SimulStreaming (BETA backend)
- Diart
- OpenAI Whisper
We extend our gratitude to the original authors for their contributions.