I built RealtimeVoiceChat because I was frustrated with the latency in most voice AI interactions. This is an open-source (MIT license) system designed for real-time, local voice conversations with LLMs.
Quick Demo Video (50s): https://www.youtube.com/watch?v=HM_IQuuuPX8
The goal is to get closer to natural conversation speed. It uses audio chunk streaming over WebSockets, RealtimeSTT (based on Whisper), and RealtimeTTS (supporting engines like Coqui XTTSv2/Kokoro) to achieve around 500ms response latency, even when running larger local models like a 24B Mistral fine-tune via Ollama.
Key aspects: Designed for local LLMs (Ollama primarily, OpenAI connector included). Interruptible conversation. Smart turn detection to avoid cutting the user off mid-thought. Dockerized setup available for easier dependency management.
It requires a decent CUDA-enabled GPU for good performance due to the STT/TTS models.
Would love to hear your feedback on the approach, performance, potential optimizations, or any features you think are essential for a good local voice AI experience.
The code is here: https://github.com/KoljaB/RealtimeVoiceChat
Neat! I'm already using openwebui/ollama with a 7900 xtx but the STT and TTS parts don't seem to work with it yet:
2025-05-05 20:53:15,808] [WARNING] [real_accelerator.py:194:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
Error loading model for checkpoint ./models/Lasinya: This op had not been implemented on CPU backend.
I've given up trying to locally use LLMs on AMD
Basically anything llama.cpp (Vulkan backend) should work out of the box w/o much fuss (LM Studio, Ollama, etc).
The HIP backend can have a big prefill speed boost on some architectures (high-end RDNA3 for example). For everything else, I keep notes here: https://llm-tracker.info/howto/AMD-GPUs
Can you explain more about the "Coqui XTTS Lasinya" models that the code is using? What are these, and how were they trained/finetuned? I'm assuming you're the one who uploaded them to huggingface, but there's no model card or README https://huggingface.co/KoljaB/XTTS_Models
In case it's not clear, I'm talking about the models referenced here. https://github.com/KoljaB/RealtimeVoiceChat/blob/main/code/a...
Yeah I really dislike the whisperiness of this voice "Lasinya". It sounds too much like an erotic phone service. I wonder if there's any alternative voice? I don't see Lasinya even mentioned in the public coqui models: https://github.com/coqui-ai/STT-models/releases . But I don't see a list of other model names I could use either.
I tried to select kokoro in the python module but it says in the logs that only coqui is available. I do have to say the coqui models sound really good, it's just the type of voice that puts me off.
The default prompt is also way too "girlfriendy" but that was easily fixed. But for the voice, I simply don't know what the other options are for this engine.
PS: Forgive my criticism of the default voice but I'm really impressed with the responsiveness of this. It really responds so fast. Thanks for making this!
Yeah I know the voice polarizes, I trained it for myself, so it's not an official release. You can change the voice here:
https://github.com/KoljaB/RealtimeVoiceChat/blob/main/code/a...
Create a subfolder in the app container: ./models/some_folder_name Copy the files from your desired voice into that folder: config.json, model.pth, vocab.json and speakers_xtts.pth (you can copy the speakers_xtts.pth from Lasinya, it's the same for every voice)
Then change the specific_model="Lasinya" line in audio_module.py into specific_model="some_folder_name".
If you change TTS_START_ENGINE to "kokoro" in server.py it's supposed to work, what does happen then? Can you post the log message?
Thank you!
I didn't realise that you custom-made that voice. Would you have some links to other out-of-the-box voices for coqui? I'm having some trouble finding them. I think from seeing the demo page that the idea is that you clone someone else's voice or something with that engine. Because I don't see any voices listed. I've never seen it before.
And yes I switched to Kokoro now, I thought it was the default already but then I saw there were 3 lines configuring the same thing. So that's working. Kokoro isn't quite as good though as coqui, that's why I'm wondering about that. I also used kokoro on openwebui and I wasn't very happy with it there either. It's fast, but some pronounciation is weird. Also, it would be amazing to have bilingual TTS (English/Spanish in my case). And it looks like Coqui might be able to do that.
Didn't find many coqui finetunes too so far. I have David Attenborough and Snoop Dogg finetunes on my huggingface, quality is medium.
Coqui can to 17 languages. The problem with RealtimeVoiceChat repo is turn detection, the model I use to determine if a partial sentence indicates turn change is trained on english corpus only.
https://huggingface.co/coqui/XTTS-v2
Seems like they are out of business. Their homepage mentions "Coqui is schutting down"* That is probably the reason you can't find that much.
*https://coqui.ai/
Lasinya voice is a XTTS 2.0.2 finetune I made with a self-created, synthesized dataset. I used https://github.com/daswer123/xtts-finetune-webui for training.
Have you looked at pipecat, seems to be similar trying to do standardized backend/webrtc turn detection pipelines.
Did not look into that one. Looks quite good, I will try that soon.
Would you say you are using the best-in-class speech to text libs at the moment? I feel like this space is moving fast because the last time I was headed down this track, I was sure whisper-cpp was the best.
I'm not sure tbh. Whisper was king for so long time now, especially with the ctranslate2 implementation from faster_whisper. Now nvidia open sourced Parakeet TDT today and it instantly went no 1 on open asr leaderboard. Will have to evaluate these latest models, they look strong.
https://yummy-fir-7a4.notion.site/dia is the new hotness.
Tried that one. Quality is great but sometimes generations fail and it's rather slow. Also needs ~13 GB of VRAM, it's not my first choice for voice agents tbh.
alright, dumb question.
(1) I assume these things can do multiple languages
(2) Given (1), can you strip all the languages you aren't using and speed things up?
Actually good question.
I'd say probably not. You can't easily "unlearn" things from the model weights (and even if this alone doesn't help). You could retrain/finetune the model heavily on a single language but again that alone does not speed up inference.
To gain speed you'd have to bring the parameter count down and train the model from scratch with a single language only. That might work but it's also quite probable that it introduces other issues in the synthesis. In a perfect world the model would only use all that "free parameters" not used now for other languages for a better synthesis of that single trained language. Might be true to a certain degree, but it's not exactly how ai parameter scaling works.
I don't know what I'm talking about, but could you use distillation techniques?
Maybe possible, I did not look into that much for Coqui XTTS. What i know is that the quantized versions for Orpheus sound noticably worse. I feel audio models are quite sensitive to quantization.
Paraket is english only. Stick with Whisper.
The core innovation is happening in TTS at the moment.
Yeah, I figured you would know. Thanks for that, bookmarking that asr leaderboard.
Very cool, thanks for sharing.
A couple questions: - any thought about wake word engines, to have something that listen without consuming all the time? The landscape for open solutions doesn't seem good - any plan to allow using external services for stt/tts for the people who don't have a 4090 ready (at the cost of privacy and sass providers)?
FWIW, wake words are a stopgap; if we want to have a Star Trek level voice interfaces, where the computer responds only when you actually meant to call it, as opposed to using the wake word as a normal word in the conversation, the computer needs to be constantly listening.
A good analogy here is to think of the computer (assistant) as another person in the room, busy with their own stuff but paying attention to the conversations happening around them, in case someone suddenly requests their assistance.
This, of course, could be handled by a more lightweight LLM running locally and listening for explicit mentions/addressing the computer/assistant, as opposed to some context-free wake words.
Home Assistant is much nearer to this than other solutions.
You have a wake word, but it can also speak to you based on automations. You come home and it could tell you that the milk is empty, but with a holiday coming up you probably should go shopping.
I want that for privacy reasons and for resource reasons.
And having this as a small hardware device should not add relevant latency to it.
Privacy isn't a concern when everything is local
Yes it is.
Malware, bugs etc can happen.
And I also might not want to disable it for every guest either.
If the AI is local, it doesn't need to be on an internet connected device. At that point, malware and bugs in that stack don't add extra privacy risks* — but malware and bugs in all your other devices with microphones etc. remain a risk, even if the LLM is absolutely perfect by whatever standard that means for you.
* unless you put the AI on a robot body, but that's then your own new and exciting problem.
There is no privacy difference between a local LLM listening versus a local wake word model listening.
That would be quite easy to integrate. RealtimeSTT already has wakeword support for both pvporcupine and openwakewords.
Modify it with an ultra light LLM agent that always listens that uses a wake word to agentically call the paid API?
You could use open wake word. Which Home Assistant developed for its own Voice Assistant
It was developed by David Scripka: https://github.com/dscripka/openWakeWord
Neat!
I build something almost identical last week (closed source, not my IP) and I recommend: NeMo Parakeet (even faster than insanely_fast_whisper), F5-TTS (fast + very good quality voice cloning), Qwen3-4B for LLM (amazing quality).
This looks great will definitely have a look. I'm just wondering if you tested fastRTC from hugging face? I haven't done that curious about speed between this vs fastrtc vs pipecat.
Yes, I tested it. I'm not that sure what they created there. It adds some noticable latency compared towards using raw websockets. Imho it's not supposed to, but it did it nevertheless in my tests.
Do you have any information on how long each step take? Like how many ms on each step of the pipeline?
I'm curious how fast it will run if we can get this running on a Mac. Any ballpark guess?
LLM and TTS latency get's determined and logged at the start. It's around 220ms for the LLM returning the first synthesizable sentence fragment (depending on the length of the fragment, which is usually something between 3 and 10 words). Then around 80ms of TTS until the first audio chunk is delivered. STT with base.en you can neglect, it's under 5 ms, VAD same. Turn detection model also adds around 20 ms. I have zero clue if and how fast this runs on a Mac.
What is the min VRAM needed on the GPU to run this? I did not see that on the github
With the current 24b LLM model it's 24 GB. I have no clue how far down you can go with the GPU is using smaller models, you can set the model in server.py. Quite sure 16 GB will work but at some point it will probably fail.
This looks great. What hardware do you use, or have you tested it on?
I only tested it on my 4090 so far
Are you using all local models, or does it also use cloud inference? Proprietary models?
Which models are running in which places?
Cool utility!
All local models: - VAD: Webrtcvad (first fast check) followed by SileroVAD (high compute verification) - Transcription: base.en whisper (CTranslate2) - Turn Detection: KoljaB/SentenceFinishedClassification (selftrained BERT-model) - LLM: hf.co/bartowski/huihui-ai_Mistral-Small-24B-Instruct-2501-abliterated-GGUF:Q4_K_M (easily switchable) - TTS: Coqui XTTSv2, switchable to Kokoro or Orpheus (this one is slower)
That's excellent. Really amazing bringing all of these together like this.
Hopefully we get an open weights version of Sesame [1] soon. Keep watching for it, because that'd make a killer addition to your app.
[1] https://www.sesame.com/
That would be absolutely awesome. But I doubt it, since they released a shitty version of that amazing thing they put online. I feel they aren't planning to give us their top model soon.