The state of modern AI text to speech systems for screen reader users

(stuff.interfree.ca)

101 points | by tuukkao 2 days ago ago

49 comments

  • superkuh 2 days ago ago

    What use is human sounding TTS when your desktop cannot read the contents of windows?

    As someone with progressive retinal tearing who's used the linux desktop for 20 years I'm terrified. The forcing of the various incompatible waylands by the big linux corps has meant the end of support for screen readers. The only wayland compositor that supports screen readers in linux is GNOME's mutter and they literally only added that support last year (after 15 years of waylands) and instead of supporting standard at-spi and existing protocols that Orca and the like use GNOME decided to come up with two new in-house GNOME proprietary protocols (which themselves don't send the full window tree or anything on request but instead push only info about single windows, etc, etc) for doing it. No other wayland compositor supports screen readers. And without any standardization no developers will ever support screenreaders on waylands. Basically only GNOME's userspace will sort of support it. There's no hope for non-X11 based screen readers and all the megacorps are say they're dropping X11 support.

    The only options I have are to use and maintain old X11 linux distros myself. But eventually things like CA TLS and browsers just won't be feasible for me to backport and compile myself. Eventually I'm going to have to switch to using Windows. It's a sad, sad state of things.

    And regarding AI based text to speech: almost all of it kind of sucks for screen readers. Particularly the random garbled ai-noises that happen between and at the end of utterances, inaccurate readings, etc in many models. Not to mention requiring the use of a GPU and lots of system resources. The old Festival 1.96 Nitech HTS voices on (core2duo) CPU from the early 2000 are incomparibly faster, more accurate, and sound decent enough to understand.

    • noosphr 2 days ago ago

      >The only options I have are to use and maintain old X11 linux distros myself. But eventually things like CA TLS and browsers just won't be feasible for me to backport and compile myself. Eventually I'm going to have to switch to using Windows. It's a sad, sad state of things.

      Gentoo, duvian and all the bsds will keep x11 around until the heat death of the universe. Anyone who doesn't force systemd on their users also doesn't force Wayland. You have plenty of options before windows.

    • lukastyrychtr 2 days ago ago

      What? This description makes no sense. Nothing changed with at-spi2, that is X.org/Wayland independent. The only think which got added (and is already suppored by Kde) is a protocol to inform the screen reader about keyboard events, as it previously used the "anyone in my session can read my keyboard" capability of X.org.

  • cachius 2 days ago ago

    Glooming bottom line:

    So what's the way forward for blind screen reader users? Sadly, I don't know.

    Modern text to speech research has little overlap with our requirements. Using Eloquence [32-bit voice last compiled in 2003], the system that many blind people find best, is becoming increasingly untenable. ESpeak uses an odd architecture originally designed for computers in 1995, and has few maintainers. Blastbay Studios [...] is a closed-source product with a single maintainer, that also suffers from a lack of pronunciation accuracy.

    In an ideal world, someone would re-implement Eloquence as a set of open source libraries. However, doing so would require expertise in linguistics, digital signal processing, and audiology, as well as excellent programming abilities. My suspicion is that modernizing the text to speech stack that is preferred by blind power-users is an effort that would require several million dollars of funding at minimum.

    Instead, we'll probably wind up having to settle for text to speech voices that are "good enough", while being nowhere near as fast and efficient [800 to 900 words per minute] as what we have currently.

    • SequoiaHope 2 days ago ago

      My big takeaway was that a great way AI could help would be to aide in decompiling Eloquence, though I don’t know if there are gotchas there.

      I found some sample audio from Eloquence. I like this type of voice!

      https://youtu.be/bBp8NP3JTpI

  • Jeff_Brown 2 days ago ago

    This surprises me: "These modern systems are developed to sound human, natural, and conversational. Unfortunately this seems to come at the expense of accuracy. In my testing, both models had a tendency to skip words, read numbers incorrectly, chop off short utterances, and ignore prosody hints from text punctuation. "

    • ethin 2 days ago ago

      They also have built-in abbreviation dictionaries. For example, Acapela likes to expand AST to Atlantic Standard Time, even when the context is so obviously (not) talking about time zones.

    • layer8 2 days ago ago

      Why does it surprise you?

  • nuc1e0n 2 days ago ago

    Has anyone considered decompiling eloquence? With something like ghidra or ida pro? Mario 64 was turned back into high level language source code this way.

    • miki123211 2 days ago ago

      This wouldn't be easy due to Eloquence's internal architecture. eci.[dll|so|dylib] only contains the low-level platform abstraction layer, things like threads, queues, mutexes etc, as well as utility classes for .ini file handling and such. It then loads a language module (from a path specified in eci.ini). The actual speech stack is statically linked separately into each language module (possibly with modifications, not sure about that); in theory, if you reverse-engineered the API between the main and language libraries, you could write an Eloquence wrapper for any arbitrary speech synthesizer. This means you'd have to reverse-engineer this separately for each language.

      From what we know, Eloquence was compiled in two stages, stage1 compiled a proprietary language called Delta (for text-to-phoneme rules) to C++, which was then compiled to machine code. A lot of the existing code is likely autogenerated from a much more compact representation, probably via finite state transducers or some such.

      • TheAceOfHearts 2 days ago ago

        I'm bullish on LLMs being able to help with this kind of reverse engineering effort, if not current models then in a few more years. I've had conversations with people where they managed to get Claude to help reverse engineer old weird binaries with very little input. I wouldn't hype it up as being a magical tool that'll definitely work, but it can't hurt to try.

      • nuc1e0n 2 days ago ago

        I gather decompiling mario 64 wasn't easy either. Just having C++ that can be recompiled to other architectures would seem to be useful. The original Eliza chatbot was converted to modern C++ in a similar way recently, and that used a compact representation for its logic as well.

  • nowittyusername 2 days ago ago

    I have been working on playing around with over 10 stt systems in last 25 days and its really weird to read this article as my experience is the opposite. Stt models are amazing today. They are stupid fast, sound great and very simple to implement as huggingface spaces code is readily available for any model. Whats funny is that the model he was talking about "supertonic" was exactly the model I would have recommended if people wanted to see how amazing the tech has become. The model is tiny, runs 55x real time on any potato and sounds amazing. Also I think he is implementing his models wrong. As he mentions that some models don't have streaming and you have to wait for the whole chunk to be processed. But that's not a limit in any meaningful way as you can define the chunk. You can simply make the first n characters within the first sentence be the chunk and process that first and play that immediately while the rest of the text is being processed. ttfs and ttfa on all modern day models is well below 0.5 and for supertonic it was 0.05 with my tests.....

    • jdp23 2 days ago ago

      What screenreaders are you using to test the models with?

    • cachius 2 days ago ago

      What's your experience at high speeds, with garbled speech artifacts and pronouncation accuracy?

      • nowittyusername 2 days ago ago

        With supertonic , or overall? If overall most do pretty well though some are funky, like suprano was so bad no matter what I did, so i had to rule that out from my top contenders on anything. supertonic was close to my number one choice for my agentic pipeline as it was soo insanely fast and quality was great, but it didnt have the other bells and whistles like some other models so i held that off for cpu only projects in the future. If you are gonna use it on a GPU I would suggest chatterbox or pocket tts. Chatterbox is my top contender as of now because it sounds amazing, has cloning and i got it down to 0.26 ttfa/ttsa once i quantized it and implemented pipecat in to it. pocket tts is probably my second choice for similar reasons.

    • pixl97 2 days ago ago

      >Also I think he is implementing his models wrong.

      This is something I've noticed around a lot of AI related stuff. You really can't take any one article on it as definitive. This, and anything that doesn't publish how they fully implemented it is suspect. That's both for the affirmative and negative findings.

      It reminds me a bit of the earlier days of the internet were there was a lot of exploration of ideas occurring, but quite often the implementation and testing of those ideas left much to be desired.

    • swores a day ago ago

      Minor nitpick, but you mean "tts" not "stt" both times.

      Is supertonic the best sounding model, or is there a different one you'd recommend that doesn't perform as well but sounds even better?

      • nowittyusername 18 hours ago ago

        yes sorry i mixed these up. supertonic is not the best sounding in my tests. it was by far the fastest, but its audio quality for something so fast was decent. if you wanted something that sounds better AND is also extremely fast pocket tts is the choice. amazing quality and also crazy fast on both gpu and cpu. if you care mainly about quality, chatterbox in my tests was best fit, but its slower then the others. qwen 3 tts was also great but its unisable as any real time agentic voice as its too slow. they havent relesed the code for streaming yet, once they release that this will be my top contender.

        • swores 18 hours ago ago

          Thanks!

    • 8bitsrule 2 days ago ago

      Just found this video ... it looks to sound and work -very- well. (RasPI & Onyx)

      https://www.youtube.com/watch?v=bZ3I76-oJsc

    • noosphr 2 days ago ago

      Are you using them at 1000 wpm?

      • nowittyusername 2 days ago ago

        Supertonic is probably way faster then that, I wouldn't be surprised if measured it would be something like 14k wpm. On my 4090 I was getting about 175x real time while on cpu only it was 55x realtime. I stopped optimizing it but im sure it could be pushed further. Anyways you should check out their repo to test it yourself its crazy what that team accomplished!

        • gia_ferrari 2 days ago ago

          Audio synthesis speed is one thing, but is the output _intelligible to a human_ at 1,000wpm? That's the sort of thing Eloquence is being used for, according to the article.

        • mrbukkake 16 hours ago ago

          Did you even read the article bud

  • dqv 2 days ago ago

    Does having it sound "natural" even matter for high-speed reading? I assumed it would be a hindrance at higher speeds because natural variation and randomness in a voice makes it harder to scan the voice (similar to how reading something handwritten tends to be harder than something that has been typeset). At least that's how I always feel whenever I listen to audiobooks that use "natural" voices - I always switch to the more robotic sounding ones because, in my experience, it's easier to scan once at 2x and beyond.

    My takeaway from the article is that accuracy of pronunciation, tweakability, and "time to first utterance" are what matter most.

    • ClawsOnPaws 2 days ago ago

      You are correct. At least in my case, more synthetic voices like Eloquence are easier to understand at high speeds especially because of their 'formulaic' nature. You don't listen to each individual phoneme or letter, you listen more for groups of syllables, tone, etc. The more unpredictable the text to speech, the harder this is. Also, performance is another big point. If you have large bits of silence at the beginning of the audio, or slow attacks, then the responsiveness will suffer, whether that's because of the actual audio itself, or the generation time.

      Some of this is surely ssubjective, but I'm pretty sure I'm not the only screen reader user with these opinions.

  • rhdunn 2 days ago ago

    It's not just screen reader users. I use TTS to listen to text content and the AI TTS voices I've tried have the issues with skipping words or generating garbled output in sections.

    I don't know if this is a data/transcription issue, an issue with noisy audio, or what.

  • ctoth 2 days ago ago

    Funny I've actually been digging into this problem recently. I have a webaudio reimplementation of Klatt 1980 driven by cmudict. It still sounds pretty ass, but it's very early days. This weekend I intend to go deep dive on the Delta rule system that powers Eloquence. There're so many interesting papers from the late 90s early 2000s I bet we could get something pretty remarkable that sounds even better than Eloquence and is incredibly fast and runs anywhere.

  • WarmWash 2 days ago ago

    This almost perfectly encapsulates the problems that create friction for new technology. People want/expect the new technology to be an upgraded version of the old technology.

    "AI is going to make screen readers amazing!"

    No, that is not what AI is going to do. That is the exact kind of missing the forest for the trees that comes with new tech.

    AI will be used to act as a sighted person sitting next to the blind person, who the blind person is conversing with (at whatever speed they wish) to interpret and do stuff on the screen. It's a total misapplication of AI to think the goal is to leverage it to make screen readers better.

    They can have sighted servant who is gleefully collaborating with them to use their computer. You don't need 900 words per minute read to you so you can build a full mental model of every webpage. You can just say "Lets go on amazon and look for paper towels", "Lets check the top stories on HN"

    • tuukkao 2 days ago ago

      Can you elaborate how an user interface based on conversation is even remotely as efficient as a keyboard-operated screen reader? With a screen reader I can get information out of a web page much quicker than the time it takes me to think how to ”ask” for it. The only advantage with this approach I could see (assuming there would be no hallucinating etc.) is that AI can extract things out of an inaccessible / unfamiliar interface. However, in all other respects this approach would effectively lock blind people to using only the capabilities the AI is able to do. As a blind software developer this idea of a supposedly viable user interface sounds patronising more than anything.

      • ClawsOnPaws 2 days ago ago

        Not to mention that this seems to completely ignore all the things that we might use computers for. Browsing websites is only one of the things I do. Many of the things I do I think would be extraordinarily clunky through natural language. Also I just do not feel comfortable talking to my computer out loud, especially when I'm anywhere with other people around. Or I don't know... playing games with friends on voice chat. It seems to be common for people to assume that a fix is very easy and simple. LLM's, OCR for screen readers, etc. If it really was as simple as just slapping OCR on everything, it would already have happened. Also I definitely like some privacy and would prefer my computing not to happen entirely through OpenAI, Anthropic or Google, and whether someone can use computers well or not, we shouldn't force them to do that exact thing. At least in my opinion. And that doesn't even go into the costs associated with all of that LLM usage.

      • WarmWash 2 days ago ago

        Then the problem was solved 30 years ago, and you can continue to use it indefinitely.

        No one will force a blind person to use a computer that converses in natural english. But even sighted people are likely to move away from dense visually heavy UIs towards natural conversational interface with digital systems. I suspect that given that comes to fruition (unlike us nerds, regular folks hate visual info dense clutter), young blind people won't even perceive much impediment in that area of life.

        This isn't far off from CLI vs GUI debate, where CLIs are way faster and more efficient, but regular people overwhelmingly despise them and use GUIs. Ease over efficiency is the goal for them.

      • ALittleLight 2 days ago ago

        I agree with you that someone who is good with a screen reader can efficiently move through web interfaces. A good screen reader user is faster than the typical user.

        However, not all blind people are good with screen readers. For them, an AI assistant would be useful. Even for good screen reader users an AI could be useful.

        An example: Yesterday, I needed to buy new valve caps for my car's tires. The screen reader path would be something like walmart -> jump to search field, type "valve cap car tire" and submit -> jump to results section -> iterate through a few results to make sure I'm getting the right thing at a good price -> go to the result I want -> checkout flow. Alternatively, the AI flow would be telling my AI assistant that I need new car tire valve caps. The assistant could then simultaneously search many provider options, select one based on criteria it inferred, and order it by itself.

        The AI path, in other words, gets a better result (looking through more providers means it's likelier to find a better path, faster delivery, whatever) and also, much easier and faster. Of course, not only for screen reader users, but also just everyone.

    • vunderba 2 days ago ago

      Sure but that's only half the equation. Screen readers with realistic high-speed AI voices are still VERY much necessary since users are not always going to be in an environment where they can talk out loud.

    • rhdunn a day ago ago

      AI in this sense means using Machine Learning (ML)/Neural Networks (NN) to convert the text (or phonemes) to audio.

      There are effectively two approaches to voice synthesis: time-domain and pitch-domain.

      In time-domain synthesis you care concatenating short waveforms together. These are variations of Overlap and Add: OLA [1], PSOLA [2], MBROLA [3], etc.

      In pitch-domain synthesis, the analysis and synthesis happens in the pitch domain through the Fast Fourier Transform (visualized as a spectrogram [4]), often adjusted to the Mel scale [5] to better highlight the pitches and overtones. The TTS synthesizer is then generating these pitches and converting them back to the time domain.

      The basic idea is to extract the formants (pitch bands for the fundamental frequency and overtones) and have models for these. Some techniques include:

      1. Klatt formant synthesis [6]

      2. Linear Predictive Coding (LPC) [7]

      3. Hidden Markov Model (HMM) [8]

      4. WaveGrad NN/ML [9]

      [1] https://en.wikipedia.org/wiki/Overlap%E2%80%93add_method

      [2] https://en.wikipedia.org/wiki/PSOLA -- Pitch-synchronous Overlap and Add

      [3] https://en.wikipedia.org/wiki/MBROLA -- Multi-Band Resynthesis Overlap and Add

      [4] https://en.wikipedia.org/wiki/Spectrogram

      [5] https://en.wikipedia.org/wiki/Mel_scale

      [6] https://en.wikipedia.org/wiki/Dennis_H._Klatt

      [7] https://en.wikipedia.org/wiki/Linear_predictive_coding

      [8] https://www.cs.cmu.edu/~awb/papers/ssw6/ssw6_294.pdf

      [9] https://arxiv.org/abs/2009.00713 -- WaveGrad: Estimating Gradients for Waveform Generation

  • aaronbrethorst 2 days ago ago

    Who owns Eloquence and why hasn’t a new version been released since 2003?

    I feel like there’s a lot of backstory I’m missing.

    • 46493168 2 days ago ago

      Microsoft. A new version hasn’t been released because Microsoft, like most companies, don’t take accessibility seriously.

      The original Eloquence TTS was developed as ETI-Eloquence. ScanSoft acquired speech recognition company SpeechWorks in 2003, and in October 2005, ScanSoft merged with Nuance Communications, with the combined company adopting the Nuance name. Currently, Code Factory distributes ETI Eloquence for Windows as a SAPI 5 TTS synthesizer, though I can’t figure out exact licensing relationship between Code Factory and Nuance, which was acquired by Microsoft in like 2022

      • miki123211 2 days ago ago

        This is missing large parts of the story.

        Microsoft only bought the speech recognition / med tech parts of nuance, everything else, notably the Vocalizer speech stack (and likely also Eloquence) was spun off as Cerence. We know that somebody still has source code for Eloquence somewhere, as Apple licenses it and compiles it natively for aarch64 (yes I've looked at those dylibs, no there's no emulation). Not sure why nobody is recompiling the Windows versions, either there's just no need to do so, or some Windows specific part of the code was lost in all the mergers and would need to be rewritten.

        A lot of Eloquence IP was also licensed by IBM, and the text-to-phoneme processing stuff is still in use for IBM Watson to some extend (it's vulnerable to the same crash strings and has similar pronunciation quirks).

        With that said, I'm not sure if Eloquence system integrators are getting the Delta code and the tools to compile it to C++, or just the pre-generated cpp. Either would be consistent with the fact that Apple compiles it for their own platforms but doesn't introduce any changes to the pronunciation rules. It is entirely within the realms of possibility that this part of the stack has been lost, at least to Cerrence, though there's nothing that specifically indicates that such is the case.

        • layer8 2 days ago ago

          > We know that somebody still has source code for Eloquence somewhere, as Apple licenses it and compiles it natively for aarch64 (yes I've looked at those dylibs, no there's no emulation).

          It’s not impossible that Apple might have transpiled the x86 machine code.

        • 46493168 2 days ago ago

          Good catch, you're right. I found this open letter that mentions that Cerence owns Eloquence [0]. That also seems to be confirmed by the update to the letter.

          [0]https://openletter.earth/to-cerence-inc-hims-inc-hims-intern...

  • dfajgljsldkjag 2 days ago ago

    Natural-sounding AI is like fancy cursive font for writing code, it slows things down. The right tool fits the job, and the job here is information retrieval.

  • visarga 2 days ago ago

    I've been using a screen reader Chrome extension for 15 years using the Alex voice on MacOS. Some people find it robotic but I could not replace it yet. I speed it up to 1.4x. When I tried Eloquence voice now it sounded even more robotic, but I can relate to that.

  • noosphr 2 days ago ago

    Ive been using espeak for 20 years.

    There doesn't need to be a way forward when the software 'just works' on every platform, I'm happily using it from my phone now.

  • NedF a day ago ago

    [dead]

  • blabla_bla 2 days ago ago

    [flagged]