I have my own fork here: https://github.com/HorizonXP/voxtral.c where I’m working on a CUDA implementation, plus some other niceties. It’s working quite well so far, but I haven’t got it to match Mistral AI’s API endpoint speed just yet.
There is also another Mistral implementation: https://github.com/EricLBuehler/mistral.rs Not sure what the difference is, but it seems to be just be overall better received.
mistral.rs is more like llama.cpp, it's a full inference library written in rust that supports a ton of models and many hardware architectures, not just mistral models.
how does someone get started with doing things like these (writing inference code/ cuda etc..). any guidance is appreciated. i understand one doesn't just directly write these things and this would require some kind of reading. would be great to receive some pointers.
You know, I love this comment because you are where I was 15 years ago when I naively decided that I wanted to do my master's in medical biophysics and try to use NVIDIA CUDA to help accelerate some of the work that we were doing. So I have a very... storied history with NVIDIA CUDA, but frankly, it's been years since I've actually written C code at all, let alone CUDA.
I have to admit that I wrote none of the code in this repo. I asked Codex to go and do it for me. I did a lot of prompting and guidance through some of the benchmarking and tools that I expected it to use to get the result that I was looking for.
Most of the plans that it generated were outside of my wheelhouse and not something I'm particularly familiar with, but I know it well enough to understand that its plan roughly made sense to me and I just let it go. So the fact that this worked at all is a miracle, but I cannot take credit for it other than telling the AI: what I wanted, how to do it, in loose terms, and helping it when it got stuck.
BTW, everything above was dictated with the code we generated, except for this sentence. And I added breaklines for paragraphs. That's it.
It's not fast enough to be realtime, though you could do a more advanced UI and a ring buffer and have it as you describe. (ex. I do this with Whisper in Flutter, and also inference GGUFs in llama.cpp via Dart)
This isn't even close to realtime on M4 Max. Whisper's ~realtime on any device post-2022 with an ONNX implementation. The extra inference cost isn't worth the WER decrease on consumer hardware, or at least, wouldn't be worth the time implementing.
Hello everyone, thanks for the interest. I merged a number of significant performance improvements that increase speed and accuracy across CUDA, Metal, and WASM as well as improve stability.
Here are the latest benchmarks running on DGX Spark:
Just tried out Handy. This is much better and lightweight UI than the previous solutions I've tried out! I know it wasn't you intention, but thank you for the recommendation!
That said, I now agree with your original statement and really want Voxtral support...
Handy is awesome! and easy to fork. I highly recommend building it from source and submitting PRs if there are any features you want. The author is highly responsive and open to vibe-coded PRs as long as you do a good job. (Obviously you should read the code and stand by it before you submit a PR, but I just mean he doesn't flatly reject all AI code like some other projects do.) I submitted a PR recently to add an onboarding flow to Macs that just got merged, so now I'm hooked
I don't know anything about these models, but I've been trying Nvidia's Parakeet and it works great. For a model like this that's 9GB for the full model, do you have to keep it loaded into GPU memory at all times for it to really work realtime? Or what's the delay like to load all the weights each time you want to use it?
Same here. I haven’t found an ASR/STT/transcription setup that beats Parakeet V3 on the speed/accuracy tradeoff spectrum: transcription is extremely fast (near instant for a couple sentences, 1-3 seconds for long ramblings), and the slight accuracy drop relative to heavier/slower models is immaterial for the use case of talking to AIs that can “read between the lines” (terminal coding agents etc).
I use Parakeet V3 in the excellent Handy [1] open source app. I tried incorporating the C-language implementation mentioned by others, into Handy, but it was significantly slower. Speed is absolutely critical for good UX in STT.
Personally I run an ollama server. Models load pretty quickly.
There's a distinction between tokens per second and time to first token.
Delays come for me when I have to load a new model, or if I'm swapping in a particularly large context.
Most of the time, since the model is already loaded, and I'm starting with a small context that builds over time, tokens per second is the biggest impactor.
It's worth noting I don't do much fancy stuff, a tiny bit of agent stuff, I mainly use qwen-coder 30a3b or qwen2.5 code instruct/base 7b.
I'm finding more complex agent stuff where multiple agents are used can really slow things down if they're swapping large contexts. ik_llama has prompt caching which help speed this up when swapping between agent contexts up until a point.
tldr: loading weights each time isn't much of a problem, unless you're having to switch between models and contexts a lot, which modern agent stuff is starting to.
hm, seems broken on my machine (Firefox, Asahi Linux, M1 Pro). I said hello into the mic, and it churned for a minute or so before giving me:
panorama panorama panorama panorama panorama panorama panorama panorama� molest rist moundothe exh� Invothe molest Yan artist��������� Yan Yan Yan Yan Yanothe Yan Yan Yan Yan Yan Yan Yan
Look I think its great that it runs in the browser and all, but I don't want to live in a world where its normal for a website to download 2.5Gb in the background to run something
I recently dug into this as I was trying to benchmark the possibility of using Gemini Nano (Chrome's built in AI model) vs a server side solution for a sideproject.
Nano's stored in localstorage with shared access across sites (because Google), so users only need to download it once. Which I don't think is the case with Mistral, etc.
There's some other production stats around adoption, availability and performance that were interesting as well:
You have already gotten used to loading multiple megabytes of bytes just to display a static landing page. You'll get used to this as well... just give it some time :-D
It's obviously not something you'd want to happen _passively_ when visiting a web page, but if the alternative is installing an executable / using a package manager / etc., why not? At least the browser is a more secure sandboxed environment for running untrusted code than most peoples' native OS.
It's cool but do I really want a single browser tab downloading 2.5 GB of data and then just leaving it to be ephemerally deleted? I know the internet is fast now and disk space is cheap but I have trouble bringing myself around to this way of doing things. It feels so inefficient. I do like the idea of client-side compute, but I feel like a model (or anything) this big belongs on the server.
I don't think local as it stands with browsers will take off simply from the lead time (of downloading the model), but a new web API for LLMs could change that. Some standard API to communicate with the user's preferred model, abstracting over local inference (like what Chrome does with Gemini Nano (?)) and remote inference (LM Studio or calling out to a provider). This way, every site that wants a language model just has to ask the browser for it, and they'd share weights on-disk across sites.
There will always be someone unhappy for literally any aspect of something new. Finding 2.5gb for a local LLM problematic in 2026, I really cannot think what is safe anymore.
We went from impossible to centralised to local in a couple of years and the "cost" is 2.5gb of hard drive.
Neat, and neat to see the burn framework getting used. I tried this on latest Chromium, but my system froze until my OS killed Chromium. My VPN connection died right after downloading the model too. (it doesn't have a bandwidth cap either, so I'm not sure what's happening)
This should be fixed now. There were a number of bugs that kept the model from working correctly in different environments. Please let me know if you test again. :)
I'm interested in your cubecl-wgpu patches. I've been struggling to get lower than FP32 safetensor models working on burn, did you write the patches to cubecl-wgpu to get around this restriction, to add support for GGUF files, or both?
Naive, semi-related question: what is the state of stuff like Mistral when compared to OpenAI, Anthropic, etc?
Could I reasonably use this to get LLM-capability privately on a machine (and get decent output), or is it still in the "yeah it does the thing, but not as well as the commercial stuff" category?
This stuff is cool. So is whisper. But I keep hoping for something that can run close to real time on a Raspberry Pi Zero 2 with a reasonable English vocabulary.
Right now everything is either archaic or requires too much RAM. CPU isn't as big of an issue as you'd think because the pi zero 2 is comparable to a pi 3.
For those exploring browser STT, this sits in an interesting space between Whisper.wasm and the Deepgram KC client. The 2.5GB quantized footprint is notably smaller than most Whisper variants — any thoughts on accuracy tradeoffs compared to Whisper base/small?
> Streaming speech recognition running natively and in the browser. A pure Rust implementation of Mistral's Voxtral Mini 4B Realtime model using the Burn ML framework.
> The Q4 GGUF quantized path (2.5 GB) runs entirely client-side in a browser tab via WASM + WebGPU. Try it live.
Excluding names (Mistral's Voxtral Mini 4B Realtime), you have 1 pretty normal sentence introducing what this is (Streaming speech recognition running natively and in the browser) and the rest is technical details.
It's like complaining that a car description Would contain engine size and output in the third sentence.
After some performance improvements, it is realtime on my DGX Spark with an RTF of .416 -- now getting ~19.5 tokens per second. Check it out, see if it's better for you.
I just tried it, I said "what's up buddy, hey hey stop" and it transcribed this for me: " وطبعا هاي هاي هاي ستوب" No, I'm not in any arabic or middle eastern country. The second test was better, it detected english.
I read that the vectors for the same phrase in multiple languages are very similar, to the point where if a russian speaker writes english, the model might think it's russian
If folks are interested, @antirez has opened a C implementation of Voxtral Mini 4B here: https://github.com/antirez/voxtral.c
I have my own fork here: https://github.com/HorizonXP/voxtral.c where I’m working on a CUDA implementation, plus some other niceties. It’s working quite well so far, but I haven’t got it to match Mistral AI’s API endpoint speed just yet.
There is also another Mistral implementation: https://github.com/EricLBuehler/mistral.rs Not sure what the difference is, but it seems to be just be overall better received.
mistral.rs is more like llama.cpp, it's a full inference library written in rust that supports a ton of models and many hardware architectures, not just mistral models.
hey,
how does someone get started with doing things like these (writing inference code/ cuda etc..). any guidance is appreciated. i understand one doesn't just directly write these things and this would require some kind of reading. would be great to receive some pointers.
You know, I love this comment because you are where I was 15 years ago when I naively decided that I wanted to do my master's in medical biophysics and try to use NVIDIA CUDA to help accelerate some of the work that we were doing. So I have a very... storied history with NVIDIA CUDA, but frankly, it's been years since I've actually written C code at all, let alone CUDA.
I have to admit that I wrote none of the code in this repo. I asked Codex to go and do it for me. I did a lot of prompting and guidance through some of the benchmarking and tools that I expected it to use to get the result that I was looking for.
Most of the plans that it generated were outside of my wheelhouse and not something I'm particularly familiar with, but I know it well enough to understand that its plan roughly made sense to me and I just let it go. So the fact that this worked at all is a miracle, but I cannot take credit for it other than telling the AI: what I wanted, how to do it, in loose terms, and helping it when it got stuck.
BTW, everything above was dictated with the code we generated, except for this sentence. And I added breaklines for paragraphs. That's it.
[dead]
These are good lectures and there is also a discord. https://github.com/gpu-mode/lectures
Same! Would love any resources. I'm interested more in making models run vs making the models themselves :)
I tried the demo and it looks like you have to click Mic, then record your audio, then click "Stop and transcribe" in order to see the result.
Is it possible to rig this up so it really is realtime, displaying the transcription within a second or two of the user saying something out loud?
The Hugging Face server-side demo at https://huggingface.co/spaces/mistralai/Voxtral-Mini-Realtim... manages that, but it's using a much larger (~8.5GB) server-side model running on GPUs.
It's not fast enough to be realtime, though you could do a more advanced UI and a ring buffer and have it as you describe. (ex. I do this with Whisper in Flutter, and also inference GGUFs in llama.cpp via Dart)
This isn't even close to realtime on M4 Max. Whisper's ~realtime on any device post-2022 with an ONNX implementation. The extra inference cost isn't worth the WER decrease on consumer hardware, or at least, wouldn't be worth the time implementing.
Hello, I pushed up and merged a PR that greatly improves performance on CUDA, Metal, and in WASM.
Depending on your hardware, the model is definitely real time (able to transcribe audio faster than the length of the audio).
Kudos, this is were it's add: open-models running on-premise. Preferred by users and businesses. Glad Mistral's got that figured out.
Mistral can really end up having its RedHat moment. I think open models will only get more interesting from here.
Hello everyone, thanks for the interest. I merged a number of significant performance improvements that increase speed and accuracy across CUDA, Metal, and WASM as well as improve stability.
Here are the latest benchmarks running on DGX Spark:
https://github.com/TrevorS/voxtral-mini-realtime-rs#benchmar...
Awesome work, Would be good to have it work with handy.computer. Also are there plans to support streaming ?
Just tried out Handy. This is much better and lightweight UI than the previous solutions I've tried out! I know it wasn't you intention, but thank you for the recommendation!
That said, I now agree with your original statement and really want Voxtral support...
Handy is awesome! and easy to fork. I highly recommend building it from source and submitting PRs if there are any features you want. The author is highly responsive and open to vibe-coded PRs as long as you do a good job. (Obviously you should read the code and stand by it before you submit a PR, but I just mean he doesn't flatly reject all AI code like some other projects do.) I submitted a PR recently to add an onboarding flow to Macs that just got merged, so now I'm hooked
thanks for your contribution :)
I'm looking into porting this into transcribe-rs so handy can use it.
The first cut will probably not be a streaming implementation
okay... so I cannot get this to run on my mac. maybe something with the burn kernels for quantized?
will report a GitHub issue
this should be fixed
I don't know anything about these models, but I've been trying Nvidia's Parakeet and it works great. For a model like this that's 9GB for the full model, do you have to keep it loaded into GPU memory at all times for it to really work realtime? Or what's the delay like to load all the weights each time you want to use it?
Same here. I haven’t found an ASR/STT/transcription setup that beats Parakeet V3 on the speed/accuracy tradeoff spectrum: transcription is extremely fast (near instant for a couple sentences, 1-3 seconds for long ramblings), and the slight accuracy drop relative to heavier/slower models is immaterial for the use case of talking to AIs that can “read between the lines” (terminal coding agents etc).
I use Parakeet V3 in the excellent Handy [1] open source app. I tried incorporating the C-language implementation mentioned by others, into Handy, but it was significantly slower. Speed is absolutely critical for good UX in STT.
[1] https://github.com/cjpais/Handy
Can you use handy exclusive via the cli if you have a file to feed it?
Not sure about that
Not currently
Personally I run an ollama server. Models load pretty quickly.
There's a distinction between tokens per second and time to first token.
Delays come for me when I have to load a new model, or if I'm swapping in a particularly large context.
Most of the time, since the model is already loaded, and I'm starting with a small context that builds over time, tokens per second is the biggest impactor.
It's worth noting I don't do much fancy stuff, a tiny bit of agent stuff, I mainly use qwen-coder 30a3b or qwen2.5 code instruct/base 7b.
I'm finding more complex agent stuff where multiple agents are used can really slow things down if they're swapping large contexts. ik_llama has prompt caching which help speed this up when swapping between agent contexts up until a point.
tldr: loading weights each time isn't much of a problem, unless you're having to switch between models and contexts a lot, which modern agent stuff is starting to.
hm, seems broken on my machine (Firefox, Asahi Linux, M1 Pro). I said hello into the mic, and it churned for a minute or so before giving me:
panorama panorama panorama panorama panorama panorama panorama panorama� molest rist moundothe exh� Invothe molest Yan artist��������� Yan Yan Yan Yan Yanothe Yan Yan Yan Yan Yan Yan Yan
Please try again. The model weights are unchanged, but the inference code is improved.
[flagged]
Look I think its great that it runs in the browser and all, but I don't want to live in a world where its normal for a website to download 2.5Gb in the background to run something
I recently dug into this as I was trying to benchmark the possibility of using Gemini Nano (Chrome's built in AI model) vs a server side solution for a sideproject.
Nano's stored in localstorage with shared access across sites (because Google), so users only need to download it once. Which I don't think is the case with Mistral, etc.
There's some other production stats around adoption, availability and performance that were interesting as well:
https://sendcheckit.com/blog/ai-powered-subject-line-alterna...
You have already gotten used to loading multiple megabytes of bytes just to display a static landing page. You'll get used to this as well... just give it some time :-D
It's obviously not something you'd want to happen _passively_ when visiting a web page, but if the alternative is installing an executable / using a package manager / etc., why not? At least the browser is a more secure sandboxed environment for running untrusted code than most peoples' native OS.
It's cool but do I really want a single browser tab downloading 2.5 GB of data and then just leaving it to be ephemerally deleted? I know the internet is fast now and disk space is cheap but I have trouble bringing myself around to this way of doing things. It feels so inefficient. I do like the idea of client-side compute, but I feel like a model (or anything) this big belongs on the server.
I don't think local as it stands with browsers will take off simply from the lead time (of downloading the model), but a new web API for LLMs could change that. Some standard API to communicate with the user's preferred model, abstracting over local inference (like what Chrome does with Gemini Nano (?)) and remote inference (LM Studio or calling out to a provider). This way, every site that wants a language model just has to ask the browser for it, and they'd share weights on-disk across sites.
There will always be someone unhappy for literally any aspect of something new. Finding 2.5gb for a local LLM problematic in 2026, I really cannot think what is safe anymore.
We went from impossible to centralised to local in a couple of years and the "cost" is 2.5gb of hard drive.
Neat, and neat to see the burn framework getting used. I tried this on latest Chromium, but my system froze until my OS killed Chromium. My VPN connection died right after downloading the model too. (it doesn't have a bandwidth cap either, so I'm not sure what's happening)
This should be fixed now. There were a number of bugs that kept the model from working correctly in different environments. Please let me know if you test again. :)
Nice!
I'm interested in your cubecl-wgpu patches. I've been struggling to get lower than FP32 safetensor models working on burn, did you write the patches to cubecl-wgpu to get around this restriction, to add support for GGUF files, or both?
I've been working on something similar, but for whisper and as a library for other projects: https://github.com/Scronkfinkle/quiet-crab
The cubecl-wgpu were only needed to reduce the number of kernel workgroups, otherwise I was getting errors in WASM.
Naive, semi-related question: what is the state of stuff like Mistral when compared to OpenAI, Anthropic, etc?
Could I reasonably use this to get LLM-capability privately on a machine (and get decent output), or is it still in the "yeah it does the thing, but not as well as the commercial stuff" category?
This stuff is cool. So is whisper. But I keep hoping for something that can run close to real time on a Raspberry Pi Zero 2 with a reasonable English vocabulary.
Right now everything is either archaic or requires too much RAM. CPU isn't as big of an issue as you'd think because the pi zero 2 is comparable to a pi 3.
For those exploring browser STT, this sits in an interesting space between Whisper.wasm and the Deepgram KC client. The 2.5GB quantized footprint is notably smaller than most Whisper variants — any thoughts on accuracy tradeoffs compared to Whisper base/small?
I wonder if there's a metric or measure of how much jargon goes into a README or other document.
Reading the first three sentences of this README. 43 words, I would consider 15 terms to be jargon incomprehensible to the layman.
> Streaming speech recognition running natively and in the browser. A pure Rust implementation of Mistral's Voxtral Mini 4B Realtime model using the Burn ML framework.
> The Q4 GGUF quantized path (2.5 GB) runs entirely client-side in a browser tab via WASM + WebGPU. Try it live.
Excluding names (Mistral's Voxtral Mini 4B Realtime), you have 1 pretty normal sentence introducing what this is (Streaming speech recognition running natively and in the browser) and the rest is technical details.
It's like complaining that a car description Would contain engine size and output in the third sentence.
Just curious, is there any smaller version of this model capable of running on edge devices? Even my Mac M1 with 8gb ram couldn't run the C version.
I'm curious to see if you are able to run the model now from the CLI?
This semi-quantized version targets the Jetson Orin Nano, but only comes with a simple inference engine.
https://huggingface.co/Teaspoon-AI/Voxtral-Mini-4B-INT4-Jets...
https://kyutai.org/stt has an implementation for MLX and mentions iPhones, so it should work on edge devices, Macs and iPhones.
Man, I'd love to fine-tune this, but alas the huggingface implementation isn't out as far as I can tell.
Uggh. I had just started working on this. Congratulations to the author !
>init failed: Worker error: Uncaught RuntimeError: unreachable
Anything I can do to fix/try it on Brave?
Would check memory, ensure you have free ram. Tested here https://imgur.com/a/3vLJ6no Not perfect dictation, but close enough
I have 92GB of RAM
I had to enable the following chrome flag for this to load.
chrome://flags/#enable-unsafe-webgpu
Does disabling shields help?
No, that's usually what I try but it still didn't work.
(no speech detected)
or... not talking anything generate random German sentences.
Notable this isn't even close to realtime. M4 Max.
True :)
After some performance improvements, it is realtime on my DGX Spark with an RTF of .416 -- now getting ~19.5 tokens per second. Check it out, see if it's better for you.
I just tried it, I said "what's up buddy, hey hey stop" and it transcribed this for me: " وطبعا هاي هاي هاي ستوب" No, I'm not in any arabic or middle eastern country. The second test was better, it detected english.
fwiw, that is the right-ish transliteration into arabic. It just picked the wrong language to transcribe to lol
I read that the vectors for the same phrase in multiple languages are very similar, to the point where if a russian speaker writes english, the model might think it's russian
Yep it's called the "universal geometry of embeddings".
Impressive, but to state the obvious, this is not yet practical for browser use due to it's (at least) 2.5GB memory footprint