Co-founder of Krisp here. 1.5B non-native English speakers in the workforce, 4x native — yet all comms infra is optimized for native accents. We spent 3 years building listener-side, on-device accent understanding. The hard parts: no parallel training data exists, the accent space is infinite, accent is entangled with voice identity, and it runs on CPU under 250ms latency. Built in Yerevan, Armenia. Beta is live and free. Happy to go deep on the ML side.
The real achievement here isn't just quality, it's doing it streaming with tight latency on CPU while preserving speaker identity. Most VC-style work looks great offline, then falls apart once you go real-time. Nice work getting this to hold up in streaming.
Kinda wild to think accent friction is basically a tech problem. Doing this in real time on CPU sounds tough. Curious how well it holds up in messy, real calls.
This is a game-changer! I remember each and every call I had with an investor and feeling shy asking "can you repeat?"... thanks krisp, you changed my life!!!
I would like to use such model but only if it really preserves my voice, otherwise people would understand its not me or I have to use it all the time.
Co-founder of Krisp here. 1.5B non-native English speakers in the workforce, 4x native — yet all comms infra is optimized for native accents. We spent 3 years building listener-side, on-device accent understanding. The hard parts: no parallel training data exists, the accent space is infinite, accent is entangled with voice identity, and it runs on CPU under 250ms latency. Built in Yerevan, Armenia. Beta is live and free. Happy to go deep on the ML side.
What do you think about the misuse potential (by scammers for example)?
Aside from that, I like that this exists now.
This is for listener-side, not speaker-side. So no misuse case here.
The real achievement here isn't just quality, it's doing it streaming with tight latency on CPU while preserving speaker identity. Most VC-style work looks great offline, then falls apart once you go real-time. Nice work getting this to hold up in streaming.
On-device CPU inference is the real flex here. Optimization probably mattered as much as modeling.
This feels adjacent to voice conversion research, but with stricter latency constraints.
Yeh, this would be helpful for the Singlish friends of mine out there!
Finally Krisp built it! I will understand my users from interviews better, with no cognitive load and "could you please repeat that" phrasing.
Kinda wild to think accent friction is basically a tech problem. Doing this in real time on CPU sounds tough. Curious how well it holds up in messy, real calls.
The parallel data is a problem here — you can’t crowdsource ground truth because no one can record themselves with a different accent.
Latency can destroy conversational rhythm. What’s your p95 inference time? also are there any benchmarks we can see?
Really cool to see accent adaptation in real time — curious about benchmarks and how well this handles messy, real Zoom calls
This feels adjacent to voice conversion research, but with stricter latency constraints.
Curious whether wav2vec-style embeddings played a role in your representation learning.
Local CPU inference stands out. Careful optimization likely rivaled the modeling effort.
Nice to finally see this direction of accent conversion (that is on incoming calls) in the Krisp app. This is a very meaningful feature.
Curious whether wav2vec-style embeddings played a role in your representation learning.
This is a game-changer! I remember each and every call I had with an investor and feeling shy asking "can you repeat?"... thanks krisp, you changed my life!!!
Great work. Natural + clear is the combo that matters.
I would like to use such model but only if it really preserves my voice, otherwise people would understand its not me or I have to use it all the time.
This is built for international, privacy-first teams!
How did you estimate the number of IQ points?
Streaming constraint under 200ms changes everything. Causal modeling in speech is brutal to get right.
Accent space is effectively infinite. Generalization must rely on invariants rather than enumeration.
Streaming constraint under 200ms changes everything. Causal modeling in speech is brutal to get right.
On-device CPU inference is the real flex here! Optimization probably mattered as much as modeling.
Accent space is effectively infinite. Generalization must rely on invariants rather than enumeration.
will it help the barista in Starbucks get my name right finally?
This is a huge game changer !
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[flagged]
[flagged]