328 comments

  • qbow883 3 days ago ago

    Setting aside the various formatting problems and the LLM writing style, this just seems all kinds of wrong throughout.

    > “Just lower the bitrate,” you say. Great idea. Now it’s 10Mbps of blocky garbage that’s still 30 seconds behind.

    10Mbps should be way more than enough for a mostly static image with some scrolling text. (And 40Mbps are ridiculous.) This is very likely to be caused by bad encoding settings and/or a bad encoder.

    > “What if we only send keyframes?” The post goes on to explain how this does not work because some other component needs to see P-frames. If that is the case, just configure your encoder to have very short keyframe intervals.

    > And the size! A 70% quality JPEG of a 1080p desktop is like 100-150KB. A single H.264 keyframe is 200-500KB.

    A single H.264 keyframe can be whatever size you want, *depending on how you configure your encoder*, which was apparently never seriously attempted. Why are we badly reinventing MJPEG instead of configuring the tools we already have? Lower the bitrate and keyint, use a better encoder for higher quality, lower the frame rate if you need to. (If 10 fps JPEGs are acceptable, surely you should try 10 fps H.264 too?)

    But all in all the main problem seems to be squeezing an entire video stream through a single TCP connection. There are plenty of existing solutions for this. For example, this article never mentions DASH, which is made for these exact purposes.

    • Sesse__ 3 days ago ago

      > Why are we badly reinventing MJPEG instead of configuring the tools we already have?

      Is it much of a stretch to assume that in the AI gold rush, there will be products made by people who are not very experienced engineers, but just push forward and assume the LLM will fix all their problems? :-)

      • qingcharles 18 hours ago ago

        I built a little tool using AI recently and it worked great but it was brittle as hell and I was constantly waiting for it to fail. A few days later I realized there was a much better way of writing it. I'd boxed the LLM in by proposing the way to code it.

        I've changed my AGENTS.md now so it basically says "Assume user is ignorant to other better solutions to the problem they are asking. Don't assume their given solution to the problem is the best one, look at the problem itself and propose other ways to solve it."

    • ozim 3 days ago ago

      *Why are we badly reinventing MJPEG instead of configuring the tools we already have?*

      Getting to know and understand existing tools costs time/money. If it less expensive or more expensive than reinventing something badly is very complicated to judge and depends on loads of factors.

      Might be that reinventing something badly - but good enough for the case is best use of resources.

      • antisol 3 days ago ago

        From TFA:

            Implementation complexity: 
             h264 Stream: 3 months of rust
             JPEG Spam: fetch() in a loop
        
        I don't see how it could have taken 3 months to read up on existing technologies. And that "3 month" number is before we start factoring in time spent on:

        * Writing code for JPEG Spam / "fetch() in a loop" method

        * Mechanisms to switch between h264 / jpeg modes

        * Debugging implementation of 2 modes

        * Debugging switching back and forth between the 2 modes

        * Maintenance of 2 modes into the future

    • bugufu8f83 3 days ago ago

      >Setting aside...the LLM writing style

      I don't want to set that aside either. Why is AI generated slop getting voted to the top of HN? If you can't be bothered to spend the time writing a blog post, why should I be bothered spending my time reading it? It's frankly a little bit insulting.

      • piskov 3 days ago ago

        Don’t assume something you cannot prove. It was great writing

        • npunt 3 days ago ago

          Normally the 1 sentence per para LinkedIn post for dummies writing style bugs me to no end, but for a technical article that's continually hopping between questions, results, code, and explanations, it fits really well and was a very easy article to skim and understand.

          • netsharc 3 days ago ago

            It's action thriller writing for something that's in reality is super dull (my question is loaded with outdated cliches, but would you be telling a girl you're trying to impress at a party about this problem you faced of trying to push some data over the network?). I had to skim over it, like watching a YouTube video at 2x so I don't start evaluating how obnoxious the narrator is.

        • bugufu8f83 3 days ago ago

          >Don’t assume something you cannot prove.

          Well it's an inherently unprovable accusation, so assumption will have to do. It reeks of LLM-ese in certain word choices, phrases, and structure, though. I thought it was quite clear.

          >It was great writing

          Err... no accounting for taste, I suppose.

          • afiori 3 days ago ago

            Just saying but LLM-ese as the common dominator of how people wrote, it is likely the writing style of a lot of people

        • tylervigen 2 days ago ago

          The author replied below that they used Opus to write the blog post.

        • rasz 3 days ago ago

          You mean other than this being AI slop company, usage is monitoring AI slop output and author confirming blog is AI slop? https://news.ycombinator.com/item?id=46372060

        • PunchyHamster 3 days ago ago

          Looked like typical medium.com slop but with a bit more technical detail. Not sure where you see greatness

    • 3 days ago ago
      [deleted]
    • mschuster91 3 days ago ago

      > For example, this article never mentions DASH, which is made for these exact purposes.

      DASH isn't supported on Apple AFAIK. HLS would be an idea, yes...

      But in either case: you need ffmpeg somewhere in your pipeline for that experience to be even remotely enjoyable. No ffmpeg? No luck, good luck implementing all of that shit yourself.

      • rezonant 3 days ago ago

        Or Gstreamer, which the article says they were using.

        > DASH isn't supported on Apple AFAIK. HLS would be an idea, yes...

        They said they implemented a WebCodecs websocket custom implementation, surely they can use Dash.js here. Or rather, their LLM can since it's doubtful they are writing any actual code.

        They would need to use LL-DASH or HLS low latency but it's quite achievable.

    • rdsubhas 3 days ago ago

      Huh? This is the least LLM writing style I've encountered. Extraordinary claims require extraordinary proof.

      • nathan82 3 days ago ago

        It's not an extraordinary claim, it's a mundane and plausible one. This is exactly what you get when you ask an LLM to write in a "engaging conversational" style, and skip any editing after the fact. You could never prove it but there are a LOT of tells.

        "The key insight" - llms love key insights! "self-contained corruption-free" - they also love over-hypenating, as much as they love em-dashing. Both abundant here. "X like it's 2005" and also "Y like it's 2009" - what a cool casual turn of phrase, so natural! The architecture diagram is definitely unedited AI, Claude always messes up the border alignment on ascii boxes

        I wouldn't mind except the end result is imprecise and sloppy, as pointed out by the GP comment. And the tone is so predictable/boring at this point, I'd MUCH rather read poorly written human output with some actual personality.

        • 3 days ago ago
          [deleted]
      • Tenobrus 3 days ago ago

        ai detectors are never totally accurate but this one is quite good and it suggests something like 80% of this article is llm generated. honestly idk how you didn't get that just by reading it tho, maybe you haven't been exposed to much modern llm-generated content?

        https://www.pangram.com/history/5cec2f02-6fd6-4c97-8e71-d509...

  • mikepavone 3 days ago ago

    > When the network is bad, you get... fewer JPEGs. That’s it. The ones that arrive are perfect.

    This would make sense... if they were using UDP, but they are using TCP. All the JPEGs they send will get there eventually (unless the connection drops). JPEG does not fix your buffering and congestion control problems. What presumably happened here is the way they implemented their JPEG screenshots, they have some mechanism that minimizes the number of frames that are in-flight. This is not some inherent property of JPEG though.

    > And the size! A 70% quality JPEG of a 1080p desktop is like 100-150KB. A single H.264 keyframe is 200-500KB. We’re sending LESS data per frame AND getting better reliability.

    h.264 has better coding efficiency than JPEG. For a given target size, you should be able to get better quality from an h.264 IDR frame than a JPEG. There is no fixed size to an IDR frame.

    Ultimately, the problem here is a lack of bandwidth estimation (apart from the sort of binary "good network"/"cafe mode" thing they ultimately implemented). To be fair, this is difficult to do and being stuck with TCP makes it a bit more difficult. Still, you can do an initial bandwidth probe and then look for increasing transmission latency as a sign that the network is congested. Back off your bitrate (and if needed reduce frame rate to maintain sufficient quality) until transmission latency starts to decrease again.

    WebRTC will do this for you if you can use it, which actually suggests a different solution to this problem: use websockets for dumb corporate network firewall rules and just use WebRTC everything else

    • auxiliarymoose 3 days ago ago

      They shared the polling code in the article. It doesn't request another jpeg until the previous one finishes downloading. UDP is not necessary to write a loop.

      • mikepavone 3 days ago ago

        > They shared the polling code in the article. It doesn't request another jpeg until the previous one finishes downloading.

        You're right, I don't know how I managed to skip over that.

        > UDP is not necessary to write a loop.

        True, but this doesn't really have anything to do with using JPEG either. They basically implemented a primitive form of rate control by only allowing a single frame to be in flight at once. It was easier for them to do that using JPEG because they (to their own admission) seem to have limited control over their encode pipeline.

        • londons_explore 3 days ago ago

          > have limited control over their encode pipeline.

          Frustratingly this seems common in many video encoding technologies. The code is opaque, often has special kernel, GPU and hardware interfaces which are often closed source, and by the time you get to the user API (native or browser) it seems all knobs have been abstracted away and simple things like choosing which frame to use as a keyframe are impossible to do.

          I had what I thought was a simple usecase for a video codec - I needed to encode two 30 frame videos as small as possible, and I knew the first 15 frames were common between the videos so I wouldn't need to encode that twice.

          I couldn't find a single video codec which could do that without extensive internal surgery to save all internal state after the 15th frame.

          • orisho 3 days ago ago

            A 15 frame min anf max GOP size would do the trick, then you'd get two 15 frame GOPs. Each GOP can be concatenated with another GOP with the same properties (resolution, format, etc) as if they were independent streams. So there is actually a way to do this. This is how video splitting and joining without re encoding works, at GOP boundary.

            • londons_explore 3 days ago ago

              In my case, bandwidth really mattered, so I wanted all one GOP.

              Ended up making a bunch of patches o libx264 to do it, but the compute cost of all the encoding on CPU is crazy high. On the decode side (which runs on consumer devices), we just make the user decode the prefix many times.

          • Sesse__ 3 days ago ago

            > I couldn't find a single video codec which could do that without extensive internal surgery to save all internal state after the 15th frame.

            fork()? :-)

            But most software, video codec or not, simply isn't written to serialize its state at arbitrary points. Why would it?

            • londons_explore 3 days ago ago

              A word processor can save it's state at an arbitrary point... That's what the save button is for, and it's functional at any point in the document writing process!

              In fact, nearly everything in computing is serializable - or if it isn't, there is some other project with a similar purpose which is.

              However this is not the case with video codecs - but this is just one of many examples of where the video codec landscape is limiting.

              Another for example is that on the internet lots of videos have a 'poster frame' - often the first frame of the video. That frame for nearly all usecases ends up downloaded twice - once as a jpeg, and again inside the video content. There is no reasonable way to avoid that - but doing so would reduce the latency to play videos by quite a lot!

              • tsimionescu 3 days ago ago

                > A word processor can save it's state at an arbitrary point... That's what the save button is for, and it's functional at any point in the document writing process!

                No, they generally can't save their whole internal state to be resumed later, and definitely not in the document you were editing. For example, when you save a document in vim it doesn't store the mode you were in, or the keyboard macro step that was executing, or the search buffer, or anything like that.

                > In fact, nearly everything in computing is serializable - or if it isn't, there is some other project with a similar purpose which is.

                Serializable in principle, maybe. Actually serializable in the sense that the code contains a way to dump to a file and back, absolutely not. It's extremely rare for programs to expose a way to save and restore from a mid-state in the algorithm they're implementing.

                > Another for example is that on the internet lots of videos have a 'poster frame' - often the first frame of the video. That frame for nearly all usecases ends up downloaded twice - once as a jpeg, and again inside the video content.

                Actually, it's extremely common for a video thumbnail to contain extra edits such as overlayed text and other graphics that don't end up in the video itself. It's also very common for the thumbnail to not be the first frame in the video.

                • Sesse__ 3 days ago ago

                  > Serializable in principle, maybe. Actually serializable in the sense that the code contains a way to dump to a file and back, absolutely not. It's extremely rare for programs to expose a way to save and restore from a mid-state in the algorithm they're implementing.

                  If you should ever look for an actual example; Cubemap, my video reflector (https://manpages.debian.org/testing/cubemap/cubemap.1.en.htm...), works like that. It supports both config change and binary upgrade by serializing its entire state down to a file and then re-execing itself.

                  It's very satisfying; you can have long-running HTTP connections and upgrade everything mid-flight without a hitch (serialization, exec and deserialization typically takes 20–30 ms or so). But it means that I can hardly use any libraries at all; I have to use a library for TLS setup (the actual bytes are sent through kTLS, but someone needs to do the asymmetric crypto and I'm not stupid enough to do that myself), but it was a pain to find one that could serialize its state. TLSe, which I use, does, but not if you're at certain points in the middle of the key exchange.

                  So yes, it's extremely rare.

                  • vlovich123 3 days ago ago

                    Why not hand off the fd to the new process spawned as a child? That’s how a lot of professional 0 downtime upgrades work: spawn a process, hand off fd & state, exit.

                    • Sesse__ 2 days ago ago

                      That's exactly what it's doing. The tricky part is the “hand off state” part.

                • lelanthran 3 days ago ago

                  > No, they generally can't save their whole internal state to be resumed later, and definitely not in the document you were editing.

                  I broadly agree, but I feel you chose a poor example - Vim.

                  > For example, when you save a document in vim it doesn't store the mode you were in,

                  Without user-mods, it does in fact start up in the mode that you were in when you saved, because you can only save in command/normal mode.

                  > or the keyboard macro step that was executing,

                  Without user-mods, you aren't able to interrupt a macro that is executing anyway, so if you cannot save mid-macro, why would you load mid-macro?

                  > or the search buffer,

                  Vim, by default, "remembers" all my previous searches, all the macros, and all my undos, even across sessions. The undo history is remembered per file.

              • PunchyHamster 3 days ago ago

                > A word processor can save it's state at an arbitrary point...

                As ENTIRE STATE. Video codecs operate on essentially full frame + stream of differences. You might say it's similar to git and you'd be incorrect again, because while with git you can take current state and "go back" using diffs, that is not the case for video, it alwasy go forward from the keyframe and resets on next frame.

                It's fundamentally order of magnitude more complex problem to handle

          • andrewf 2 days ago ago

            I'm on a media engineering team and agree that applying the tech to a new use case often involves people with deep expertise spending a lot of time in the code.

            I'd guess there are fewer media/codec engineers around today than there were web developers in 2006. In 2006, Gmail existed, but today's client- and server-side frameworks did not. It was a major bespoke lift to do many things which are "hello world" demos with a modern framework in 2025.

            It'd be nice to have more flexible, orthogonal and adaptable interfaces to a lot of this tech, but I don't think the demand for it reaches critical mass.

            • xp84 2 days ago ago

              > It was a major bespoke lift to do many things which are "hello world" demos with a modern framework in 2025.

              This brings back a lot of memories -- I remember teaching myself how to use plain XMLHTTPRequest and PHP/MySQL to implement "AJAX" chat. Boy was that ugly JavaScript code. But on the other hand, it was so fast and cool and I could hardly believe that I had written that.

            • dpe82 2 days ago ago

              I started doing media/codec work around 2007 and finding experienced media engineers at the time was difficult and had been for quite some time. It's always been hard - super specialized knowledge that you can only really pick up working at a company that does it often enough to invest in folks learning it. In my case we were at a company that did desktop video editing software so it made sense, but that's obviously uncommon.

          • 6r17 3 days ago ago

            I wonder if we could scan / test / dig these hidden features somehow ; like in a scrapping / fuzz fashion

      • cma 3 days ago ago

        So US->Australia/Asia wouldn't that limit you to 6fps or so due half-rtt? Each time a frame finishes arriving you have 150ms or so for your new request to reach.

        • littlestymaar 3 days ago ago

          That sounds find for most screen sharing use-case.

    • eichin 3 days ago ago

      Probably either (1) they don't request another jpeg until they have the previous one on-screen (so everything is completely serialized and there are no frames "in-flight" ever) (2) they're doing a fresh GET for each and getting a new connection anyway (unless that kind of thing is pipelined these days? in which case it still falls back to (1) above.)

      • 01HNNWZ0MV43FF 3 days ago ago

        You can still get this backpressure properly even if you're doing it push-style. The TCP socket will eventually fill up its buffer and start blocking your writes. When that happens, you stop encoding new frames until the socket is able to send again.

        The trick is to not buffer frames on the sender.

        • mikepavone 3 days ago ago

          You probably won't get acceptable latency this way since you have no control over buffer sizes on all the boxes between you and the receiver. Buffer bloat is a real problem. That said, yeah if you're getting 30-45 seconds behind at 40 Mbps you've probably got a fair bit of sender-side buffering happening.

          • Sesse__ 2 days ago ago

            > you have no control over buffer sizes on all the boxes between you and the receiver

            You certainly do; the amount of data buffered can never be larger than the actual number of bytes you've sent out. Bufferbloat happens when you send too much stuff at once and nothing (typically the candidate to do so would be either the congestion window or some intermediate buffer) stops it from piling up in an intermediate buffer. If you just send less from userspace in the first place (which isn't a good thing to do for e.g. a typical web server, but _can_ be for this kind of video conference-like application), it can't pile up anywhere.

            (You could argue that strictly speaking, you have no control over the _buffer_ sizes, but that doesn't matter in practice if you're bounding the _buffered data_ sizes.)

    • chrisweekly 3 days ago ago

      Related tangent: it's remarkable to me how a given jpeg can be literally visually indistinguishable from another (by a human on a decent monitor) yet consist of 10-15% as many bytes. I got pretty deep into web performance and image optimization in the late 2000s and it was gratifying to have so much low-hanging fruit.

    • lelanthran 3 days ago ago

      > Still, you can do an initial bandwidth probe and then look for increasing transmission latency as a sign that the network is congested. Back off your bitrate (and if needed reduce frame rate to maintain sufficient quality) until transmission latency starts to decrease again.

      They said playing around with bitrate didn't reduce the latency; all that happened was they got blocky videos with the latency remaining the same.

    • afiori 3 days ago ago

      I am almost sure that the most perfect solution would involve using a video codec protocol but the issue is implementation complexity and having to implement a production encoder yourself if your usecase is unusual.

      This is exactly the point of the article they tried keyframes only but their library had a bug that broke it

    • nazgul17 3 days ago ago

      Regarding the encoding efficiency, I imagine the problem is that the compromise in quality shows in the space dimension (aka fewer or blurry pixels) rather than in time. Users need to read text clearly, so the compromise in the time dimension (fewer frames) sounds just fine.

      • mikepavone 3 days ago ago

        Nothing stopping you from encoding h264 at a low frame rate like 5 or 10 fps. In webRTC, you can actually specify how you want to handle low bitrate situations with degredationPreference. If set to maintain-resolution, it will prefer sacrificing frame rate.

  • adamjs 3 days ago ago

    They might want to check out what VNC has been doing since 1998– keep the client-pull model, break the framebuffer up into tiles and, when client requests an update, perform a diff against last frame sent, composite the updated tiles client-side. (This is what VNC falls back to when it doesn’t have damage-tracking from the OS compositor)

    This would really cut down on the bandwidth of static coding terminals where 90% of screen is just cursor flashing or small bits of text moving.

    If they really wanted to be ambitious they could also detect scrolling and do an optimization client-side where it translates some of the existing areas (look up CopyRect command in VNC).

    • ryukoposting 3 days ago ago

      Of all the suggestions in the comments here, this seems like the best one to start with.

      Also... I get that the dumb solution to "ugly text at low bitrates" is "make the bitrate higher." But still, nobody looked at a 40M minimum and wondered if they might be looking at this problem from the wrong angle entirely?

      • martinald 3 days ago ago

        In fairness VNC-style approaches are bloody awful even over my 2.5gbit/sec lan on very fast hardware. It just cannot do 4K well (not sure if they need 4k or not).

        I spent some time compiling the "new" xrdp with x264 and it is incredibly good, basically cannot really tell that I'm remote desktoping.

        The bandwidth was extremely low as well. You are correct on that part, 40mbit/sec is nuts for high quality. I suspect if they are using moonlight it's optimized for extremely low latency at the expense of bandwidth?

        • Scaevolus 3 days ago ago

          Moonlight is mostly designed to stream your gaming desktop to a portable device or your TV at minimal latency and maximum quality within a LAN. For that, 40Mbps is quite reasonable. It's obviously absurd for mundane VNC/productivity workloads.

        • adastra22 2 days ago ago

          They are streaming AI coding agents. They are not streaming 4K video.

    • djmips 3 days ago ago

      The blog post did smell of inexperience. Glad to hear there is other approaches - is something like that open source?

      • cogman10 3 days ago ago

        Yup. Go look into tigervnc if you want to see the source. But also you can just search for "tigervnc h.264" and you'll see extensive discussions between the devs on h.264 and integrating it into tiger. This is something that people spent a LOT of brainpower on.

      • tombert 3 days ago ago

        I'm not sure; sometimes being an experienced dev gravitates you towards the lazy solutions that are "good enough". Senior engineers are often expected to work at a rate that precludes solving interesting problems, and so the dumber solution will often win; at least that's been my experience, and what I tell myself to go to sleep at night when I get told for the millionth time that the company can't justify formal verification.

        • djmips 3 days ago ago

          I understand what you're saying and certainly I've come up against that myself. I didn't intend my comment to be super pejorative.

    • Sean-Der 3 days ago ago

      https://github.com/m1k1o/neko before VNC check neko out.

      I worked on a project that started with VNC and had lots of problems. Slow connect times and backpressure/latency. Switching to neko was quick/easy win.

      • majorchord 16 hours ago ago

        if you want something more lightweight... rustdesk has been great for me, it supports multiple adaptable video codecs and can optimize for latency vs image quality.

    • any1 3 days ago ago

      Yes, in fact, the protocol states that the client can queue up multiple requests. The purpose of this is to fill up the gap created by the RTT. It is actually quite elegant in its simplicity.

      An extension was introduced for continuous updates that allows the server to push frames without receiving requests, so this isn't universally true for all RFB (VNC) software. This is implemented in TigerVNC and noVNC to name a few.

      Of course, continuous updates have the buffer-bloat problem that we're all discussing, so they also implemented fairly complex congestion control on top of the whole thing.

      Effectively, they just moved the role of congestion control over to the server from the client while making things slightly more complicated.

    • krater23 2 days ago ago

      Maybe it would be easier to just USE VNC instead. But the mentioned they have written their software in Rust. Looks like nothig is good enough for Rust coders, they need to fail by reprogramming their things in Rust before they accept that there are still tools for exactly that.

    • klipklop 3 days ago ago

      Copying how VNC does it is exactly how my first attempt would go. Seems odd to try something like Moonlight which is designed for low latency remote gameplay.

  • Dylan16807 3 days ago ago

    > When the network is bad, you get... fewer JPEGs. That’s it. The ones that arrive are perfect.

    You can have still have weird broken stallouts though.

    I dunno, this article has some good problem solving but the biggest and mostly untouched issue is that they set the minimum h.264 bandwidth too high. H.264 can do a lot better than JPEG with a lot less bandwidth. But if you lock it at 40Mbps of course it's flaky. Try 1Mbps and iterate from there.

    And going keyframe-only is the opposite of how you optimize video bandwidth.

    • HelloUsername 3 days ago ago

      > Try 1Mbps and iterate from there.

      From the article:

      “Just lower the bitrate,” you say. Great idea. Now it’s 10Mbps of blocky garbage that’s still 30 seconds behind.

      • martinald 3 days ago ago

        The problem is I think that they are using moonlight which is "designed" to stream games at very low latency. I very much doubt that people need <30ms response times watching an agent terminal or whatever they are showing!

        When you try and use h264 et al at low latency you have to get rid of a lot of optimisations to encode it as quickly as possible. I also highly suspect the vaapi encoder is not very good esp at low bitrates.

        I _think_ moonlight also forces CBR instead of VBR, which is pretty awful for this use case - imagine you have 9 seconds of 'nothing changing' and then the window moves for 0.25 seconds. If you had VBR the encoder could basically send ~0kbit/sec apart from control metadata, and then spike the bitrate up when the window moved (for brevity I'm simplifying here, it's more complicated than this but hopefully you get the idea).

        Basically they've used the wrong software entirely. They should try and look at xrdp with x264 as a start.

        • phire 3 days ago ago

          Yeah, i think the author has been caught out by the fact that there simply isn’t a canonical way to encode h264.

          JPEG is nice and simple, most encoders will produce (more or less) the same result for any given quality settings. The standard tells you exactly how to compress the image. Some encoders (like mozjpeg) use a few non-standard tricks to produce 5-20% better compression, but it’s essentially just a clever lossy preprocessing pass.

          With h264, the standard essentially just says how decompressors should work, and it’s up to the individual encoders to work out to make best use of the available functionality for their intended use case. I’m not sure any encoder uses the full functionality (x264 refuses to use arbitrary frame order without b-frames, and I haven’t found an encoder that takes advantage of that). Which means the output of different encoders has wildly different results.

          I’m guessing moonlight makes the assumption that most of its compression will come from motion prediction, and then takes massive shortcuts when encoding iframes.

      • Dylan16807 3 days ago ago

        Rejecting it out of hand isn't actually trying it.

        10Mbps is still way too high of a minimum. It's more than YouTube uses for full motion 4k.

        And it would not be blocky garbage, it would still look a lot better than JPEG.

        • vscode-rest 3 days ago ago

          1Mbps for video is rule of thumb I use. Of course that will depend on customer expectations. 500K can work, but it won’t be pretty.

          • Dylan16807 3 days ago ago

            For normal video I think that's a good rule of thumb.

            For mostly-static content at 4fps you can cut a bunch more bitrate corners before it looks bad. (And 2-3 JPEGs per second won't even look good at 1Mbps.)

            • qilo 2 days ago ago

              For mostly static content like screencasts by dropping duplicate frames and producing variable framerate h.264 yuv444 videos with lossless encoding I was getting <100 kbps files for 1024x768 resolution more than a decade ago.

            • 3 days ago ago
              [deleted]
            • jcalvinowens 3 days ago ago

              >> 10Mbps is still way too high of a minimum. It's more than YouTube uses for full motion 4k.

              > And 2-3 JPEGs per second won't even look good at 1Mbps.

              Unqualified claims like these are utterly meaningless. It depends too much on exactly what you're doing, some sorts of images will compress much better than others.

          • hn_acker 2 days ago ago

            I can confirm that 500Kbps is not pretty. But when I'm sending screen recordings where text doesn't have to be readable (or isn't present), I try to approach 500K from above.

        • TiredOfLife 3 days ago ago

          Youtube 4k uses VP9 and AV1 codecs that are multiple generations ahead of H.264

          • antonkochubey 2 days ago ago

            VP9 is inferior to H.264

            • Dylan16807 2 days ago ago

              VP8 sucked, VP9 is somewhat better than H.264, and AV1 is a lot better.

      • brigade 3 days ago ago

        Proper rate control for such realtime streaming would also lower framerate and/or resolution to maintain the best quality and latency they can over dynamic network conditions and however little bandwidth they have. The fundamental issue is that they don't have this control loop at all, and are badly simulating it by polling JPEGs.

      • cyberrock 3 days ago ago

        10Mbits is more than the maximum ingest bitrate allowed on Twitch. Granted, anyone who watches a recent game or an IRL stream there might tell you that it should go up to 12 or 15, but I don't think an LLM interface should have trouble. This feels like someone on a 4K monitor defeating themselves through their hedonic treadmill.

    • j45 3 days ago ago

      It might be possible to buffer and queue jpegs for playback as well to help with weird broken stall outs.

      Video players used to call it buffering, and resolving it was called buffering issues.

      Players today can keep an eye on network quality while playing too, which is neat.

  • kccqzy 3 days ago ago

    There are so many things that I would have done differently.

    > We added a keyframes_only flag. We modified the video decoder to check FrameType::Idr. We set GOP to 60 (one keyframe per second at 60fps). We tested.

    Why muck around with P-frames and keyframes? Just make your video 1fps.

    > Now it’s 10Mbps of blocky garbage that’s still 30 seconds behind.

    10 Mbps is way too much. I occasionally watch YouTube videos where someone writes code. I set my quality to 1080p to be comparable with the article and YouTube serves me the video at way less than 1Mbps. I did a quick napkin math for a random coding video and it was 0.6Mbps. It’s not blocky garbage at all.

    • kalleboo 3 days ago ago

      > I occasionally watch YouTube videos

      My experience is that at the same bitrate, real-time hardware encoding is way worse quality than offline CPU encoding (what YouTube does when you upload a video) so you can't compare them directly.

      10 Mbps is still crazy high, and the target should still be around 1 Mbps.

    • taberiand 3 days ago ago

      This blog post smells of LLM, both in the language style and the muddled explanations / bad technical justifications. I wouldn't be surprised if their code is also vibe coded slop.

      • nwallin 3 days ago ago

        > I wouldn't be surprised if their code is also vibe coded slop.

        That's my takeaway from this too. I think they tried the first thing the LLM suggested, it didn't work, they asked the LLM to fix it, and ended up with this crap. They never tried to really understand the problems they were facing.

        Video is really fiddly. You have all sorts of parameters to fiddle with. If you don't dig into that and figure out what tradeoffs you need to make, you'll easily end up in the position where checks notes you think you need 40Mbps for 1080p video and 10Mbps is just too shitty.

        There's various points in the article where they talk about having 30 seconds of latency. Whatever's causing this, this is a solved problem. We all have experience dealing with video teleconferencing, this isn't anything new, it's nothing special, they're just doing it wrong. They say it doesn't work because of corporate network policy, but we all use Teams or Slack.

        I think you're right. They just did a bunch of LLM slop and decided to just send it. At no point did they understand any of their problems any deeper than the LLM tried to understand the problem.

        • mrguyorama 2 days ago ago

          >Video is really fiddly.

          But it's really not! Not for "Tweak a few of the default knobs for your use case".

          It takes five minutes to play around with whatever FFMPEG gui front end (like even OBS) to get some intuition about those knobs.

          Like, people stream coding all the time with OBS itself.

          Every twitch streamer and Youtube creator figured out video encoding options, why couldn't they?

          They are using a copy of a game streaming code base for this, which is entirely the opposite set of optimizations they should have sought out.

          Like, this is rank incompetence. Your average influencer knows more about video encoding than these people. So much for LLMs helping people learn!

    • mdavid626 3 days ago ago

      Setting to 1 FPS might not be enough. GOP or P frame setting needs to be adjusted to make every frame keyframe.

      • Dylan16807 3 days ago ago

        Why would you do that?

        Nearly-static content is where you want even fewer keyframes than usual. In a situation like this you need them when the connection is interrupted and you reset things, and not much of anywhere else.

        • mdavid626 2 days ago ago

          1 FPS with GOP 60 might just simply not play in some players.

          • kccqzy 2 days ago ago

            You wouldn’t use 1fps in conjunction with GOP 60. The original article wanted exactly one key frame every 60 frames and the server drops all frames other than keyframes. I was pointing out that this is a roundabout way of achieving 1 fps.

          • Dylan16807 2 days ago ago

            Is this based on experience? I can't think of a reason for a decoder to care.

      • 3 days ago ago
        [deleted]
    • jcelerier 3 days ago ago

      One man's not-blocky-garbage is another's insufferable hell. Even at 4k I find YouTube quality to be just awful with artefacts everywhere.

  • andai 3 days ago ago

    Many moons ago I was using this software which would screenshot every five seconds and give you a little time lapse and the end of the day. So you could see how you were spending your computer time.

    My hard disk ended up filling up with tens of gigabytes of screenshots.

    I lowered the quality. I lowered the resolution, but this only delayed the inevitable.

    One day I was looking through the folder and I noticed well almost all the image data on almost all of these screenshots is identical.

    What if I created some sort of algorithm which would allow me to preserve only the changes?

    I spent embarrassingly long thinking about this before realizing that I had begun to reinvent video compression!

    So I just wrote a ffmpeg one-liner and got like 98% disk usage reduction :)

    • kccqzy 2 days ago ago

      I did something similar about fifteen years ago. A command line utility to take screenshots. And QuickTime to convert the image sequence to a video. This was before I had heard of ffmpeg so the last bit was done with the QuickTime UI. I remember also adding the screenshot timestamps as a subtitle track.

    • afiori 3 days ago ago

      You could do better than video compression because you could use older keyframes or a combination of keyframes.

      I am pretty sure it might be np compete to find the best combination

  • nemothekid 3 days ago ago

    I'm very familiar with the stack and the pain of trying to livestream video to a browser. If JPEG screenshots work for your clients, then I would just stick with that.

    The problem with wolf, gstreamer, moonlight, $third party, is you need to be familiar with how the underlying stack handles backpressure and error propagation, or else things will just "not work" and you will have no idea why. I've worked on 3 projects in the last 3 years where I started with gstreamer, got up and running - and while things worked in the happy path, the unhappy path was incredibly brittle and painful to debug. All 3 times I opted to just use the lower level libraries myself.

    Given all of OPs requirements, I think something like NVIDIA Video Codec SDK to a websocket to MediaSource Extensions.

    However, given that even this post seems to be LLM generated, I don't think the author would care to learn about the actual internals. I don't think this is a solution that could be vibe coded.

    • colechristensen 3 days ago ago

      This is where LLMs shine, where you need to dip your toes into really complex systems but basically just to do one thing with pretty straightforward requirements.

      • stefan_ 3 days ago ago

        The peak of irony, because you know how these people arrived at their 40 Mbit bitrate H264 and their ineffective tinkering with the same in the first place is guaranteed to be some LLMs expert suggestions. As is often the case, because they had no understanding of the really complex system subject matter whatsoever, they were unable to guide the LLM and ended up with .. slop. Which then turned into a slop blog post.

        God knows what process led them to do video streaming for showing their AI agent work in the first place. Some fool must have put "I want to see video of the agent working" in.. and well, the LLM obliged!

        • antisol 3 days ago ago

            > God knows what process led them to do video streaming for showing their AI agent work in the first place.
          
          
          This was my first thought, too.
          • oskarw85 2 days ago ago

            How else are they going to sell it to all those micromanagers who micromanage things?

        • mrguyorama 2 days ago ago

          >As is often the case, because they had no understanding of the really complex system subject matter whatsoever

          Something I want to harp on because people keep saying this:

          Video streaming is not complicated. Every youtuber and twitch streamer and influencer can manage it. By this I mean the actual act of tweaking your encoding settings to get good quality for low bitrate.

          In 3 months with an LLM, they learned less about video streaming than you can learn from a 12 year old's 10 minute youtube video about how to set up Hypercam2

          Millions and millions of literal children figured this out.

          Keep this in mind next time anyone says LLMs are good for learning new things!

          • colechristensen 2 days ago ago

            ... have you ever tried to do anything with ffmpeg? tried to backup a DVD to a compressed file?

            Video codecs are some of the most complex software I've ever encountered with the most number and the most opaque options.

            It's easy for streamers because they don't have options, twitch et al give you about three total choices, there's nothing to figure out.

            • nemothekid 2 days ago ago

              Video Streaming has surprising little overlap with Video Codecs. Once you choose input/output options, then there's little to change about the codec. The vast majority of options available to ffmpeg aren't supported in the browser. Streamers don't have options for precisely the same reason OP doesn't have options - you are limited entirely into what the browser supports.

              I've built the exact pipeline OP has done - Video, over TCP, over Websockets, precisely because I had to deliver video to through a corporate firewall. Wolf, Moonlight and maybe even gstreamer just shows they didn't even try to understand what they were doing, and just threw every buzzword into an LLM.

              To give you some perspective 40Mbps is an incredible amount of bandwidth. Blu ray is 40mbps. This video, in 8K on Youtube is 20Mbps: https://www.youtube.com/watch?v=1La4QzGeaaQ

              There's really no explanation for this.

            • mrguyorama 2 hours ago ago

              I have done a bit with ffmpeg and video encoding. I've been encoding videos using ffmpeg (from a GUI) since I was a child. I hate ffmpeg though, the UX is just insane, so I tend more towards tools that produce the arcane command structures for me.

              I had a situation where I wanted to chop one encoded video into multiple parts without re-encoding (I had a deadline) and the difficulty getting ffmpeg to do sensible things in that context was insane. One way of splitting the video without re-encoding just left the first GOP without a I frame, so the first seconds of video were broken. Then another attempt left me with video that just got re-timed, and the audio was desynced entirely. I know encoding some frames will be necessary to fix where cuts would break P and B frames, but why is it so hard to get it to "smartly" encode only those broken GOPs when trying to splice and cut video? Clearly I was missing some other parameters or knowledge or incantation that would have done exactly that.

              The few knobs that actual video encoder users need to tweak are clearly exposed and usable in every application I have ever used.

              >twitch et al give you about three total choices

              You don't configure your video encoding through twitch, you do it in OBS. OBS has a lot of configuration available. Also, those three options (bitrate type, bitrate value, profile, "how much encoding time to take" and """quality""" magic number) are the exact knobs they should have been tweaking to come up with an intuition about what was happening.

              Regardless, my entire point is that they were screwing around with video encoding pipelines despite having absolutely no intuition at all about video encoding.

              They weren't even using FFMPEG. They were using an open source implementation of a video game streaming encoder. Again, they demonstrably have no freaking clue even the basics of the space. Even that encoder should be capable of better than what they ended up with.

              We've been doing this exact thing for decades. None of this is new. None of this is novel. There's immense literature and expertise and tons of entry level content to build up intuition and experience with what you should expect encoded video to take bandwidth wise. Worse, Microsoft RDP and old fashioned X apps were doing this over shitty dial up connections decades ago, mostly by avoiding video encoding entirely. Like, we made video with readable text work off CDs in a 2x drive!

              Again, Twitch has a max bandwidth much lower than 40mb/s and people stream coding on it all the time with no issue. That they never noticed how obscenely off the mark they are is sad.

              It would be like if a car company wrote a blog post about how "We replaced tires on our car with legs and it works so much better" and they mention all the trouble they had with their glass tires in the blog.

              They are charging people money for this, and don't seem to have any desire to fix massive gaps in their knowledge, or even wonder if someone else has done this before. It's lame. At any point, did they even say "Okay, we did some research and in the market we are targeting we should expect a bandwidth budget of X mb/s"?

              "AI" people often say they are super helpful for research, and then stuff like this shows up.

      • PunchyHamster 3 days ago ago

        ...and apparently waste 3 months doing it wrong thanks to it without doing anything as basic as "maybe fix your bitrate, it's far higher than any gameplay streaming site and that's for video game, stuff with more movement"

        40Mbit is 1080p bluray bitrate level

  • somehnguy 3 days ago ago

    40mbps for video of an LLM typing text didn't immediately fire off alarm bells in anyone's head that their approach was horribly wrong? That's an insane amount of bandwidth for what they're trying to do.

    • giantrobot 3 days ago ago

      And they for some reason need a 60fps stream to...watch a computer type. No one stopped for a second and asked "maybe we don't know anything about the problem domain". They seem to have given a vague description to an LLM and assumed it knew what it was talking about.

    • throw-12-16 2 days ago ago

      If all you know is vibe coding and the llm didn’t tell you 40mbs is too much how would you know?

      • somehnguy 2 days ago ago

        I'm afraid to see just how poorly we can utilize computing resources in the future due to cluelessness.

    • lomase 3 days ago ago

      That is where LLM shine. It lets you know who is a fraud.

      40mbps to stream a terminal? Are you kidding me?

  • Tarean 3 days ago ago

    Having pair programmed over some truly awful and locked down connections before, dropped frames are infinitely better than blurred frames which make text unreadable whenever the mouse is moved. But 40mbps seems an awful lot for 1080p 60fps.

    Temporal SVC (reduce framerate if bandwidth constrained) is pretty widely supported by now, right? Though maybe not for H.264, so it probably would have scaled nicely but only on Webrtc?

  • dotancohen 3 days ago ago

    They're just streaming a video feed of an LLC running in a terminal? Why not stream the actual text? Or fetch it piecemeal over AJAX requests? They complain that corporate networks support only HTTPS and nothing else? Do they not understand what the first T stands for?

    • eterm 3 days ago ago

      Indeed, live text streaming is well over 100 years old:

      https://en.wikipedia.org/wiki/Teleprinter

    • TZubiri 3 days ago ago

      Suppose an LLM opens a browser, or opens a corporate .exe and GUI and starts typing in there and clicking buttons.

      • worksonmine 3 days ago ago

        You don't give it a browser or buttons to click.

        • j-me 3 days ago ago

          I think we've passed the Rubicon when it comes to that

  • keerthiko 3 days ago ago

    > The fix was embarrassingly simple: once you fall back to screenshots, stay there until the user explicitly clicks to retry.

    There is another recovery option:

    - increase the JPEG framerate every couple seconds until the bandwidth consumption approaches the H264 stream bandwidth estimate

    - keep track latency changes. If the client reports a stable latency range, and it is acceptable (<1s latency, <200ms variance?) and bandwidth use has reached 95% of H264 estimate, re-activate the stream

    Given that text/code is what is being viewed, lower res and adaptive streaming (HLS) are not really viable solutions since they become unreadable at lower res.

    If remote screen sharing is a core feature of the service, I think this is a reasonable next step for the product.

    That said, IMO at a higher level if you know what you're streaming is human-readable text, it's better to send application data pipes to the stream rather than encoding screenspace videos. That does however require building bespoke decoders and client viewing if real time collaboration network clients don't already exist for the tools (but SSH and RTC code editors exist)

  • lewq 3 days ago ago

    Hi, author of the post here. Just fixed up some formatting issues from when we copied it into substack, sorry about that. Yeah, I used Opus 4.5 to help me write it (and it actually made me laugh!). But the struggle was real. Something I didn't make clear enough in the post is that jpeg works because each screenshot is taken exactly when it's requested. Whereas streaming video is pushing a certain frame rate. The client driving the frame rate is exactly what makes it not queue frames. Yes, I wish we could UDP in enterprise networks too, but we can't. The problem actually isn't opening the UDP port, it's hosting UDP on their Kubernetes cluster. "You want to what?? We have ingress. For HTTPS"

    Join our discord for private beta in January! https://discord.gg/VJftd844GE

    (This post written by human)

    • kixelated 3 days ago ago

      Hey lewq, 40Mbps is an absolutely ridiculous bitrate. For context, Twitch maxes out around 8.5Mb/s for 1440p60. Your encoder was poorly configured, that's it. Also, it sounds like your mostly static content would greatly benefit from VBR; you could get the bitrate down to 1Mb/s or something for screen sharing.

      And yeah, the usual approach is to adapt your bitrate to network conditions, but it's also common to modify the frame rate. There's actually no requirement for a fixed frame rate with video codecs. It also you could do the same "encode on demand" approach with a codec like H.264, provided you're okay with it being low FPS on high RTT connections (poor Australians).

      Overall, using keyframes only is a very bad idea. It's how the low quality animated GIFs used to work before they were secretly replaced with video files. Video codecs are extremely efficient because of delta encoding.

      But I totally agree with ditching WebRTC. WebSockets + WebCodecs is fine provided you have a plan for bufferbloat (ex. adaptive bitrate, ABR, GoP skipping).

    • Dylan16807 3 days ago ago

      > Something I didn't make clear enough in the post is that jpeg works because each screenshot is taken exactly when it's requested. Whereas streaming video is pushing a certain frame rate. The client driving the frame rate is exactly what makes it not queue frames.

      I understand that logic but I don't really agree with it. Very aggressive bitrate controls can do a lot to keep that buffer tiny while still looking better than JPEG, and if it bloats beyond 1-2 seconds you can reset. A reset like that wouldn't look notably worse than JPEG mode always looks.

      If you use a video encoder that gives you good insight into what it's doing you could guarantee that the buffer never gets bigger than 1-2 JPEGs by dynamically deciding when to add frames. That would give you the huge benefits of P-frames with no downside.

    • SunlitCat 2 days ago ago

      Hi lewq, commentator of your post here.

      Yeah, I used ChatGPT to help me write this answer ;) (Unlike JPEGs, it works at the right abstraction level for text.)

      I think the core issue isn’t push vs pull or frame scheduling, but why you’re sending frames at all. Your use case reads much more like replicating textual/stateful UI than streaming video.

      The fact that JPEG “works” because the client pulls frames on demand is kind of the tell — you’ve built a demand-driven protocol, then used it to fetch pixels. That avoids queuing, sure, but it’s also sidestepping video semantics you don’t actually need.

      Most of what users care about here is text, cursor position, scroll state, and low interaction latency. JPEG succeeds not because it’s old and robust, but because it accidentally approximates an event-driven model.

      Totally fair points about UDP + Kubernetes + enterprise ingress. But those same constraints apply just as well to structured state updates or terminal-style protocols over HTTPS — without dragging a framebuffer along.

      Pragmatic solution, real struggle — but it feels like a text/state problem being forced through a video abstraction, and JPEG is just the least bad escape hatch.

      — a human (mostly)

  • toledocavani 3 days ago ago

    This thread is great, truly the only way to get great answers on the HN is to post a wrong blog. But stupid wrong blogs are unlikely to get into HN front page, kudos for the writer for striking the right balance between easy to understand, working, interesting but faulty solution.

  • laurencerowe 3 days ago ago

    If you are ok with a second or so of latency then MPEG-DASH (standardized version of HTTP Live Streaming) is likely the best bet. You simply serve the video chunks over HTTP so it should be just as compatible as the JPEG solution used here but provide 60fps video rather than crappy jpegs.

    The standard supports adaptive bit rate playback so you can provide both low quality and high quality videos and players can switch depending on bandwidth available.

  • robrain 3 days ago ago

    "Think “screen share, but the thing being shared is a robot writing code.”"

    Thinks: why not send text instead of graphics, then? I'm sure it's more complicated than that...

    • jodrellblank 3 days ago ago

      Thinks: this video[1] is the processed feed from the Huygens space probe landing on Saturn's moon Titan circa 2005. Relayed through the Cassini probe orbiting Saturn, 880 million miles from the Sun. At a total mission cost of 3.25 billion dollars. This is the sensor data, altitude, speed, spin, ultra violet, and hundreds of photos. (Read the description for what the audio is encoding, it's neat!)

      Look at the end of the video, the photometry data count stops at "7996 kbytes received"(!)

      > "Turns out, 40Mbps video streams don’t appreciate 200ms+ network latency. Who knew. “Just lower the bitrate,” you say. Great idea. Now it’s 10Mbps of blocky garbage"

      Who could do anything useful with 10Mbps. :/

      [1] https://en.wikipedia.org/wiki/File:Huygens_descent.ogv

      • stefan_ 3 days ago ago

        This is a great new "you can land on the moon with 10 MHz".

    • bambax 3 days ago ago

      Yeah, I'm thinking the same thing. Capture the text somehow and send that, and reconstruct it on the other end; and the best part is you only need to send each new character, not the whole screen, so it should be very small and lightning fast?

      • Snild 3 days ago ago

        Sounds kind of like https://asciinema.org/ (which I've never used, but it seems cool).

        • ku1ik 3 days ago ago

          Which features terminal live streaming since recently released 3.0 :)

  • rekshaw 3 days ago ago

    I remember 12 years ago, while the Flash vs Html war was still raging on (pre-html5), I created a framework to create web video playback using CSS and JPEGs. It would expect a set of big JPEGs, each containing the frames of the video in a grid (a "reel"), and play it by changing the css background position (and swap out the background with the next jpeg once a "reel" was complete).

    It worked really well, and I also cloned the (at the time) Youtube player UI. Seeking, keyframes, flexible framerate, etc were all supported out of the box thanks to the simple underlying architecture.

    https://github.com/VAS/animite

    • kalleboo 2 days ago ago

      I built a similar solution around the same time for semi-transparent PNG sprite animations. I remember the biggest issue was working around the GPU texture limitations on the early iPads which were exposed by mobile Safari.

  • karhuton 3 days ago ago

    I made this because I got tired of screensharing issues in corporate environments: https://bluescreen.live (code via github).

    Screenshot once per second. Works everywhere.

    I’m still waiting for mobile screenshare api support, so I could quickly use it to show stuff from my phone to other phones with the QR link.

  • materialpoint 2 days ago ago

    The fact that they considered transmitting only keyframes speaks volumes about how inept they are. It can be a cool baseline test, but celebrating trendy choices, like Rust, and not understanding that keyframes and efficient differentials are key to achieving high video compression makes me go completely numb.

  • MBCook 3 days ago ago

    So it’s video of an AI typing text?

    Why not just send text? Why do you need video at all?

    • bogwog 3 days ago ago

      Why send anything at all if the AI isn't even good enough to solve their own problems?

      (Although the fact they decided to use Moonlight in an enterprise product makes me wonder if their product actually was vibe coded)

    • TacticalCoder 3 days ago ago

      You apparently need video for the 45 seconds window you then get before preventing catastrophic things to happen. From TFA:

      > You’re watching the AI type code from 45 seconds ago > > By the time you see a bug, the AI has already committed it to main > > Everything is terrible forever

      Is this satire? I mean: if the solution for things to not be terrible forever consists in catching what an AI is doing in 45 seconds (!) before the AI commits to trunk, I'm sorry but you should seriously re-evaluate your life plans.

      • kimixa 3 days ago ago

        If you can realistically notice and reason out a bug within ~45 seconds of seeing the diff, then they are really shallow "dumb" bugs. The sort that even a junior would be expected to avoid.

        And I wonder how many other massive issues are being committed to main, but would take longer to reason out, but you're already looking at the next 45-second shallow bug.

        This has to be a joke, right?

  • plqbfbv 3 days ago ago

    I dabbled a bit with recoding/encoding videos in the past: 40mbps is basically blu-ray quality (1080p/4k depending on content), and it's being used to stream a mostly-static background with some text scrolling in front of it.

    A 3-minute chat with Claude suggests 30FPS should be plenty (perhaps minor cursor lag can be noticed if it's drawn), with a GOP of 2s (60 frames) for fast recovery, VBR 1mbps average with a max bitrate at 1.2mbps for crappy connections, and bframes to minimize bandwidth usage (because we have hw encoding).

    The crappiest of internet cafes should still be able to guarantee 1.2mbps (150kb/s). If they can do 5-10FPS with 150kb frames, they have 6-12mbps available. Worst case GOP can be reduced to 15 frames, so that there's 2x I-frames every second, and the latency is 500ms tops.

  • jayd16 3 days ago ago

    So they replaced a TCP connection with no congestion control with a sycnronous poll of an endpoint which is inherently congestion controlled.

    I wonder if they just tried restarting the stream at a lower bitrate once it got too delayed.

    The talk about how the images looks more crisp at a lower FPS is just tuning that I guess they didn't bother with.

  • Jakob 3 days ago ago

    Yes, this is unfortunately still the way and was very common back when iOS Safari did not allow embedded video.

    For a fast start of the video, reverse the implementation: instead of downgrading from Websockets to polling when connection fails, you should upgrade from polling to Websockets when the network allows.

    Socket.io was one of the first libraries that did that switching and had it wrong first, too. Learned the enterprise network behaviour and they switched the implementation.

  • zipy124 3 days ago ago

    This is just poor engineering. H.264 streaming is obviously superior to JPEG streaming, else MJPEG (motion jpeg) would be standard for screen sharing. In addition if all you're sharing is a picture of text, and you have access to the text, you can just send the damn text instead and render it locally.

  • andai 3 days ago ago

    I recognize this voice :) This is Claude.

  • rcarmo 3 days ago ago

    This was the most entertaining thing I read all day. Kudos.

    I've had similar experiences in the past when trying to do remote desktop streaming for digital signage (which is not particularly demanding in bandwidth terms). Multicast streaming video was the most efficent, but annoying to decode when you dropped data. I now wonder how far I could have gone with JPEGs...

    • j45 3 days ago ago

      If playing with Chromecast types multicast or streaming one frame at a time manually worked pretty good.

      • rcarmo 2 days ago ago

        I have already started hacking at a proof of concept… let’s see how fun it turns out to be.

  • tcherasaro 3 days ago ago

    Reminds me when I was working on the video system for a mast on a sub-marine 20 years ago.

    Customer had impossible set of latency, resolution, processing and storage requirements for their video. They also insisted we use this new H.264 standard that just came out though not a requirement.

    We quickly found MJPEG was superior for meeting their requirements in every way. It took a lot of convincing though. H.264 was and would still be a complete non-starter for them.

  • any1 3 days ago ago

    I have some experience with pushing video frames over TCP.

    It appears that the writer has jumped to conclusions at every turn and it's usually the wrong one.

    The reason that the simple "poll for jpeg" method works is that polling is actually a very crude congestion control mechanism. The sender only sends the next frame when the receiver has received the last frame and asks for more. The downside of this is that network latency affects the frame rate.

    The frame rate issue with the polling method can be solved by sending multiple frame requests at a time, but only as many as will fit within one RTT, so the client needs to know the minimum RTT and the sender's maximum frame rate.

    The RFB (VNC) protocol does this, by the way. Well, the thing about rtt_min and frame rate isn't in the spec though.

    Now, I will not go though every wrong assumption, but as for this nonsense about P-frames and I-frames: With TCP, you only need one I-frame. The rest can be all P-frames. I don't understand how they came to the conclusion that sending only I-frames over TCP might help with their latency problem. Just turn off B-frames and you should be OK.

    The actual problem with the latency was that they had frames piling up in buffers between the sender and the receiver. If you're pushing video frames over TCP, you need feedback. The server needs to know how fast it can send. Otherwise, you get pile-up and a bunch of latency. That's all there is to it.

    The simplest, absolutely foolproof way to do this is to use TCP's own congestion control. Spin up a thread that does two things: encodes video frames and sends them out on the socket using a blocking send/write call. Set SO_SNDBUF on that socket to a value that's proportional to your maximum latency tolerance and the rough size of your video frames.

    One final bit of advice: use ffmpeg (libavcodec, libavformat, etc). It's much simpler to actually understand what you're doing with that than some convoluted gstreamer pipeline.

  • egorfine 3 days ago ago

    > The constraint that ruined everything: It has to work on enterprise networks.

    > You know what enterprise networks love? HTTP. HTTPS. Port 443. That’s it. That’s the list.

    That's not enough.

    Corporate networks also love to MITM their own workstations and reinterpret http traffic. So, no WebSockets and no Server-Side Events either, because their corporate firewall is a piece of software no one in the world wants and everyone in the world hates, including its own developers. Thus it only supports a subset of HTTP/1.1 and sometimes it likes to change the content while keeping Content-Length intact.

    And you have to work around that, because IT dept of the corporation will never lift restrictions.

    I wish I was kidding.

    • streptomycin 3 days ago ago

      Back when I had a job at a big old corporation, a significant part of my value to the company was that I knew how to bypass their shitty MITM thing that broke tons of stuff, including our own software that we wrote. So I could solve a lot of problems people had that otherwise seemed intractable because IT was not allowed to disable it, and they didn't even understand the myriad ways it was breaking things.

    • Aurornis 3 days ago ago

      > So, no WebSockets

      The corporate firewall debate came up when we considered websockets at a previous company. Everyone has parroted the same information for so long that it was just assumed that websockets and corporate firewalls were going to cause us huge problems.

      We went with websockets anyway and it was fine. Almost no traffic to the no-websockets fallback path, and the traffic that did arrive appeared to be from users with intermittent internet connections (cellular providers, foreign countries with poor internet).

      I'm 100% sure there are still corporate firewalls out there blocking or breaking websocket connections, but it's not nearly the same problem in 2025 as it was in 2015.

      If your product absolute must, no exceptions, work perfectly in every possible corporate environment then a fallback is necessary if you use websockets. I don't think it's a hard rule that websockets must be avoided due to corporate firewalls any more, though.

      • kevlened 3 days ago ago

        I've had to switch from SSE to WebSockets to navigate a corporate network (the entire SSE would have to close before the user received any of the response).

        Then we ran into a network where WebSockets were blocked, so we switched to streaming http.

        No trouble with streaming http using a standard content-type yet.

    • michaelt 3 days ago ago

      > And you have to work around that, because IT dept of the corporation will never lift restrictions.

      Unless the corporation is 100% in-office, I’d wager they do in fact make exceptions - otherwise they wouldn’t have a working videoconferencing system.

      The challenge is getting corporate insiders to like your product enough to get it through the exception process (a total hassle) when the firewall’s restrictions mean you can’t deliver a decent demo.

      • darthwalsh 3 days ago ago

        I think our corporate VPN doesn't send zoom video traffic through the VPN. As you enabled the VPN, you didn't see any dropped frames.

        Split tunnelling means the UDP packets just go through the normal internet.

    • isoprophlex 3 days ago ago

      Request URL has a query parameter with more than 64 characters? Fuck you.

      Request lives for longer than 15 sec? Fuck you.

      Request POSTs some JSON? Maybe fuck you just a little bit, when we find certain strings in the payload. We won't tell you which though.

    • rcarmo 3 days ago ago

      They even break server-sent events (which is still my default for most interactive apps)

      • j45 3 days ago ago

        There are other ways to make server-sent events work.

        I try to remember many environments once likely supported Flash.

      • 3 days ago ago
        [deleted]
    • thescriptkiddie 3 days ago ago

      > it likes to change the content while keeping Content-Length intact

      thanks, i had repressed that memory

    • ris 3 days ago ago

      Corporate IT needs to die.

      • j45 3 days ago ago

        It's not corporate IT's fault, it's usually corporate leaderships fault who often cosplay leading technology and not understanding it.

        Wherever Tech is a first class citizen and seat at the corporate table, it can be different.

        • pmontra 3 days ago ago

          Sometimes they have checkboxes to tick in some compliance document and they must run the software that let them tick those checkboxes, no exceptions, because those compliances allow the company to be on the market. Regulatory captures, etc.

        • michaelt 3 days ago ago

          Believe me, the average Fortune 500 CEO does not know or care what “SSL MITM” is, or whether passwords should contain symbols and be changed monthly, or what the difference is between ‘VPN’ and ‘Zero Trust’.

          They delegate that stuff. To the corporate IT department.

          • esseph 3 days ago ago

            But they also say "Here, this is Sarah your auditor. Answer these questions and resolve the findings." - every year

            It's all CyberSecurity insurance compliance that in many cases deviates from security best practices.

            • cogman10 3 days ago ago

              This is where the problems come from. Auditors are definitely what ultimately causes IT departments to make dumb decisions.

              For example, we got dinged on an audit because instead of using RSA4096, we used ed25519. I kid you not, their main complaint was there wasn't enough bits which meant it wasn't secure.

              Auditors are snake oil salesman.

            • RankingMember 3 days ago ago

              This is 100% it- the auditor is confirming the system is configured to a set of requirements, and those requirements are rarely in lockstep with actual best practices.

      • convolvatron 3 days ago ago

        where else are you going to find customers that are so sticky it will take years for them to select another solution regardless of how crappy you are. that will staff teams to work around your failures. who, when faced with obvious evidence of the dysfunction of your product, will roundly blame themselves for not holding it properly. gaslight their own users. pay obscene amounts for support when all you provide is a voice mailbox that never gets emptied. will happily accept your estimate about the number of seats they need. when holding a retro about your failure will happily proclaim that there wasn't anything _they_ could have done, so case closed.

        • egorfine 3 days ago ago

          Oh yes you can absolutely profit off that but you have to be dead inside a little bit.

          And produce a piece of software no one in the world wants and everyone in the world hates. Yourself included.

      • embedding-shape 3 days ago ago

        I think the general idea/flow of things is "numbers go up, until $bubble explodes, and we built up smaller things from the ground up, making numbers go up, bloating go up, until $bubble explodes..." and then repeat that forever. Seems to be the end result of capitalism.

        If you wanna kill corporate IT, you have to kill capitalism first.

        • mananaysiempre 3 days ago ago

          I’d say there’s nothing inherently capitalist about large and stupid bureaucracies (but I repeat myself) spending money in stupid ways. Military bureaucracies in capitalist countries do it. Military bureaucracies in socialist countries did it. Everything else in end-stage socialist countries did it too. I’m sorry, it’s not the capitalism—things’d be much easier if it were.

          • queenkjuul 2 days ago ago

            Maybe military people are just uniquely stupid

            • mananaysiempre 2 days ago ago

              Not at all, no. I gave that example because, first, even in a profoundly capitalist country (whatever that means) the military itself is not particularly motivated by profit; and second, because it’s one of the few bureaucratic organizations that will not (be allowed to) collapse under the weight of its own inefficiencies and so easily grows much larger than is othetwise typical.

        • gspr 3 days ago ago

          I don't believe that. I don't necessarily love capitalism (though I can't say I see very many realistic better alternatives either), but if HN is full of people who could do corporate IT better (read: sanely), then the conclusion is just that corporate IT is run by morons. Maybe that's because the corporate owners like morons, but nothing about capitalism inherently makes it so.

          • dylan604 3 days ago ago

            > corporate IT is run by morons

            playing devil's advocate for a second, but corpIT is also working with morons as employees. most draconian rules used by corpIT have a basis in at least one real world example. whether that example happened directly by one of the morons they manage or passed along from corpIT lore, people have done some dumb ass things on corp networks.

            • mananaysiempre 3 days ago ago

              Yes, and the problem in that picture is the belief (whichever level of the management hierarchy it comes from) that you can introduce technical impediments against every instance of stupidity one by one until morons are no longer able to stupid. Morons will always find a way to stupid, and most organizations push the impediments well past the point of diminishing returns.

              • KPGv2 3 days ago ago

                > the problem in that picture is the belief (whichever level of the management hierarchy it comes from) that you can introduce technical impediments against every instance of stupidity one by one until morons are no longer able to stupid

                I would say the problem in the picture is your belief that corporate IT is introducing technical impediments against every instance of stupidity. I bet there's loads of stupidity they don't introduce technical impediments against. It would just not meet the cost-benefit analysis to spend thousands of tech man-hours introducing a new impediment that didn't cost the company much if any money.

          • KPGv2 3 days ago ago

            It's because corporate IT has to service non-tech people, and non-tech people get pwned by tech savvy nogoodniks. So the only sane behavior of corporate IT is to lock everything down and then whitelist things rarely.

          • layer8 3 days ago ago

            Apparently capitalism doesn’t pay enough for corporate IT admin jobs.

          • 3 days ago ago
            [deleted]
    • j45 3 days ago ago

      At the same time, enterprise is where the revenue is.

      • isoprophlex 3 days ago ago

        Against all odds, you're right, that's where somehow revenue is being generated. IT idiocy notwithstanding.

        • j45 3 days ago ago

          Often, enterprises create moats and then profit from them.

          It's not usually IT idiocy, that usually comes from higher up cosplaying their inner tech visionaries.

    • gruez 3 days ago ago

      >And you have to work around that, because IT dept of the corporation will never lift restrictions.

      Because otherwise people do dumb stuff like pasting proprietary designs or PII into deepseek

      • kbelder 3 days ago ago

        Oh, they'll do that anyway, once they find the workaround (Oh... you can paste a credit card if you put periods instead of dashes! Oh... I have to save the file and do it from my phone! Oh... I'll upload it as a .txt file and change the extension on the server!)

        It's purely illusory security, that doesn't protect anything but does levy a constant performance tax on nearly every task.

        • gruez 3 days ago ago

          >Oh, they'll do that anyway, once they find the workaround ...

          This is assuming the DLP service blocks the request, rather than doing something like logging it and reported to your manager and/or CIO.

          >It's purely illusory security, that doesn't protect anything but does levy a constant performance tax on nearly every task.

          Because you can't ask deepseek to extract some unstructured data for you? I'm not sure what the alternative is, just let everyone paste info into deepseek? If you found out that your data got leaked because some employee pasted some data into some random third party service, and that the company didn't have any policies/technological measures against it, would your response still be "yeah it's fine, it's purely illusory security"?

        • unethical_ban 3 days ago ago

          What's the term for the ideology that "laws are silly because people sometimes break them"?

          • jeltz 3 days ago ago

            Posting stuff into Deepseek is banned. The corporate firewall is like putting a camera in your home because you may break the law. But, yeah, arguing against cameras in homes because people find dead angles where they can hide may not be the strongest argument.

            • unethical_ban 3 days ago ago

              Disclaimer: I work in corporate cybersecurity.

              I know that some guardrails and restrictions in a corporate setting can backfire. I know that onerous processes to get approval for needed software access can drive people to break the rules or engage in shadow IT. As a member of a firewall team, I did it myself! We couldn't get access to Python packages or PHP for a local webserver we had available to us from a grandfather clause. My team hated our "approved" Sharepoint service request system. So a few of us built a small web app with Bottle (single file web server microframework, no dependencies) and Bootstrap CSS and SQLite backend. Everyone who interacted with our team loved it. Had we more support from corporate it might have been a lot easier.

              Good cybersecurity needs to work with IT to facilitate peoples' legitimate use cases, not stand in the way all the time just because it's easier that way.

              But saying "corporate IT controls are all useless" is just as foolish to me. It is reasonable and moral for a business to put controls and visibility on what data is moving between endpoints, and to block unsanctioned behavior.

              • unethical_ban 2 days ago ago

                Gotta wonder who objects to this and why, and if they have any experience managing IT or business.

          • collingreen 3 days ago ago

            I don't think that's a good read if the post you're implying this at. I think a more charitable read would be something like "people break rules for convenience so if your security relies on nobody breaking rules then you don't have thorough security".

            You and op can be right at the same time. You imply the rules probably help a lot even while imperfect. They imply that pretending rules alone are enough to be perfect is incomplete.

          • pigeonhole123 3 days ago ago

            It's called black and white thinking

  • benterix 3 days ago ago

    > And the size! A 70% quality JPEG of a 1080p desktop is like 100-150KB. A single H.264 keyframe is 200-500KB.

    I believe the latter can be adjusted in codec settings.

    • mdavid626 3 days ago ago

      Of course. But same quality h264 keyframe will not be much smaller than JPEG.

  • imiric 3 days ago ago

    > By the time you see a bug, the AI has already committed it to main

    If you have given your "AI" full control over your repo so that it can commit unreviewed code to the main branch, you have far greater problems than a 45 second video stream delay. Besides, you'd need superhuman abilities to spot a bug in hundreds of lines of generated code in under 45 seconds.

    I know this example is rhetorical and likely produced by an LLM, but this entire project seems misguided. They're streaming video of a graphical text editor to a web browser client, instead of streaming text itself, or using a web-based editor. These are solved problems. This shouldn't be so complicated...

  • refulgentis 3 days ago ago

    The LinkedIn slop tone, random bolding, miscopied Markdown tables makes me invoke: "please read the copy you worked on with AI"

    smaller thing: many, many, moons ago, I did a lot of work with H.264. "A single H.264 keyframe is 200-500KB." is fantastical.

    Can't prove it wrong because it will be correct given arbitrary dimensions and encoding settings, but, it's pretty hard to end up with.

    Just pulled a couple 1080p's off YouTube, biggest I-frame is 150KB, median is 58KB (`ffprobe $FILE -show_frames -of compact -show_entries frame=pict_type,pkt_size | grep -i "|pict_type=I"`)

    • jamiesonbecker 3 days ago ago

      at least it had a minimum of Clause. Clause. Punchline.

  • algesten 3 days ago ago

    WebSockets over TCP is probably always going to cause problems for streaming media.

    WebRTC over UDP is one choice for lossy situations. Media over Quic might be another (is the future here?), and it might be more enterprise firewall friendly since HTTP3 is over Quic.

  • petcat 3 days ago ago

    so did they reinvent mjpeg

  • Terretta 3 days ago ago

    Helix is a commercial multi-protocol streaming server:

    https://en.wikipedia.org/wiki/Helix_Universal_Server

    HTTP Live Streaming is already a thing:

    https://en.wikipedia.org/wiki/HTTP_Live_Streaming

    See also DASH, M-JPEG, progressive download, etc.

    > "Who knew?"

    Everyone in the streaming industry, and not so long ago that it's been forgotten.

  • wewewedxfgdf 3 days ago ago

    webp is smaller than jpeg

    https://developers.google.com/speed/webp/docs/webp_study

    ALSO - the blog author could simplify - you don't need any code at all at the web browser.

    The <img> tag automatically does motion jpeg streaming.

    • 3 days ago ago
      [deleted]
    • F3nd0 3 days ago ago

      … and JPEG XL is smaller than WebP.

      • wewewedxfgdf 3 days ago ago

        JPEG XL looks to have pretty poor support.

        https://caniuse.com/jpegxl

        • F3nd0 3 days ago ago

          Yes, though hopefully not for long; unfortunately not all codecs are given equal treatment...

          If having native support in a web browser is important, though, then yes, WebP is a better choice (as is JPEG).

  • bArray 2 days ago ago

    I've literally been here (many) years ago whilst trying to stream video from a potato Linux SBC via WiFi. As you walked further away, the H264 stream would just die and hang, no matter what you did. Stream JPEGs? Worked excellently and adjusted the number of JPEGs per second depending on connection (only requested the next frame after the current one arrived or a timeout occurred).

    This got me thinking about video calls, which have be notoriously bad on bad connections. Half the time I am just streaming a screen with static information on it, we're not watching videos together. And yet the streaming pipeline is optimised as this article suggests for the higher bandwidth modes - when we're never really using it at all.

    The most important part about a video call is rarely the video, is usually the audio. It's counter-intuitive but you are better off having your call without video than you are without sound, and yet when the video falls over it takes the audio with it. Insanity!

  • didibus 3 days ago ago

    What I'm wondering is, why couldn't the AI generate this solution? And implement it all?

    Why did they need to spend human time and effort to experiment, arrive at this solution and implement it?

    I'm asking genuinely. I use GenAI a lot, every day, multiple times a day. It helps me write emails, documents, produce code, make configuration changes, create diagrams, research topics, etc.

    Still, it's all assisted, I never use its output as is, the asks from me to the AI are small, so small, I wouldn't ever assign someone else a task this small. We're not talking 1 story point, we're talking 0.1 story point. And even with those, I have to review, re-prompt, dissect, and often manually fix up or complete the work.

    Are there use-cases where this isn't true that I'm simply not tackling? Are there context engineering techniques that I simply fail to grasp? Are there agentic workflows that I don't have the patience to try?

    How then, do models score so high on some of those tests, are the prompts to each question they solve hand crafted, rewritten multiple times until they find a prompt that one-shot the problem? Do they not consider all that human babysitting work as the model not truly solving the problem? Do they run the models with a GPU budget 100x that they sell us?

    • akersten 3 days ago ago

      > What I'm wondering is, why couldn't the AI generate this solution? And implement it all?

      My read of the blog post is that is exactly what happened, and the human time was mostly spent being confused why 40MB/s streams don't work well at a coffee shop.

  • rezonant 3 days ago ago

    I guess their LLM doesn't have much training data on how to do video engineering. The result? A "video" stack that looks like a junior engineer wrote it.

  • liampulles 2 days ago ago

    I appreciate the honesty in this article, hacking a solution together that works is ultimately what counts. Having said that, why H264?

    If I understand correctly, the clients of the video stream are web browsers and perhaps mobile devices, and the servers are Helix's. Would SVT-AV1 with low-latency mode not be an option?

  • epx 3 days ago ago

    Would HLS be an option? I publish my home security cameras via WebRTC, but I keep HLS as a escape for hotel/cafe WiFi situations (MediaMTX makes it easy to offer both).

    • originalvichy 3 days ago ago

      Thought of the same. I have not set it up outside of hobby projects, but it should work over HTTP as it says on the box, even inside a strict network?

      • epx 3 days ago ago

        Yes, it is strictly HTTP, not even persistent connections required.

  • binocarlos 3 days ago ago

    > I mashed F5 like a degenerate.

    I love the style of this blog-post, you can really tell that Luke has been deep down in the rabbit hole, encountered the Balrog and lived to tell the tale.

    • jamiesonbecker 3 days ago ago

      I like it too, even though it has that distinctive odor of being totally written by chatgpt though. (a bit distracting tbh)

    • KptMarchewa 2 days ago ago

      That's amazing! You're going to see a lot more of those AI generated blogs in the coming century!

    • 3 days ago ago
      [deleted]
  • vincepaulushook 3 days ago ago

    Hi, I would concur to some of the comments. A key frame in H264 is already encoded in a similar way as JPEG. Major differences are the "defaults": the flexibility of JPEG in terms of colors depth, color map, but that can be also addressed with a video codec, too. Then when using a video codec like H264, it will also contain differential frames which will only send differences. It depends on the content but these frames can be significantly smaller than a key frame, like 10x.

    So the math is that H264 can nearly only be better than JPEG, assuming proper parameters for the type of content, the targeted transmission challenges, the transmission type.

    Using JPEG is close to using only key frames from a compression stand point (not to say, it is exactly like that), which is close to older protocols like MPEG-1 (DVD), or close to intra-frames only codec (like used as intermediate formats, for editing or preservation). And the difference in size is a no-brainer, eventually this is the amount of data that needs to be sent to every user.

    In my opinion, the first consequence of using JPEG only is the cost per device, the number of concurrent streams from a server and what not.

    If the perception of quality is low with H264 compared to JPEG, some parameters need to be adjusted. And ultimately, H264 is already an old codec anyway, not the one I would recommend, newer ones can address visual perception and bandwidth in a much better way. the VP-8/9/AV1 family will reduce the "macro block" effect of the H.26x codecs. Using HDR will dramatically improve the quality and will crush any benefit from JPEG, benefits related to the number of bits per pixels and the poor 8bits color maps, with a much higher efficiency.

    Should the volume of users and the cost per user be of any consideration, a lossy video codec will prevail.

    Video projects are challenging in the details: wish you the best.

  • throwaway173738 3 days ago ago

    This article reminds me so much of so many hardware providers I deal with at work who want to put equipment on-site and then spend the next year not understanding that our customers manage their own firewall. No, you can’t just add a new protocol or completely change where your stuff is deployed because then our support team has to contact hundreds of customers about thousands of sites.

  • Eduard 3 days ago ago

    > A JPEG screenshot is self-contained. It either arrives complete, or it doesn’t. There’s no “partial decode.”

    What about Progressive JPEG?

  • avsn 3 days ago ago

    We did something similar in one of the places I've worked at. We sent xy coordinates and pointer events from our frontend app to our backend/3d renderer and received JPEG frames back. All of that wrapped in protobuf messages and sent via WS connection. Surpassingly it kinda worked, not "60fps worked" though obviously.

  • socketcluster 3 days ago ago

    Next phase would be to do diffs between the JPEGs and if the diff is smaller than the next JPEG, only send the (gzipped) diff and reconstruct the next JPEG on the client side.

    TBH, the obsession with standards is kind of nutty. It's not that hard to implement custom solutions that are better adapted to specific problems. Standards make sense when you want maximum interoperability but not everything requires this degree of interoperability these days. It's not such hassle to just provide a lightweight client in those cases.

    For example, it's not ideal to use HTTP2 server push for realtime chat use cases. It was primarily intended for file push to avoid round-trip latency but HTTP is such a powerful and widespread protocol that people feel the need to use it for everything.

  • nico 3 days ago ago

    Super interesting. Some time ago I wrote some code that breaks down a jpeg image into smaller frames of itself, then creates an h.264 video with the frames, outputting a smaller file than the original image

    You can then extract the frames from the video and reconstruct the original jpeg

    Additionally, instead of converting to video, you can use the smaller images of the original, to progressively load the bigger image, ie. when you get the first frame, you have a lower quality version of the whole image, then as you get more frames, the code progressively adds detail with the extra pixels contained in each frame

    It was a fun project, but the extra compression doesn’t work for all images, and I also discovered how amazing jpeg is - you can get amazing compression just by changing the quality/size ratio parameter when creating a file

  • wood_spirit 3 days ago ago

    A long time ago I was trying to get video multiplexing to work over mobile over 3G. We struggled with H264 which had broad enough hardware support but almost no tooling and software support on the phones we were targeting. Even with engineers from the phone manufacturer as liaison we struggled to get access to any kind or SDK etc. We ended up doing JPEG streaming instead, much like the article said. And it worked great but we discovered we were getting a fraction of the framerate reported in Flash players - the call to refresh the screen was async and the act of receiving and deciding the next frame staved the redraw so the phone spent more time receiving lots of frames but not showing them. Super annoying and I don’t think the project survived long enough for us to find a fix.

  • dimatura 3 days ago ago

    About eight years ago I was trying to stream several videos of a drone over the internet for remote product demos. Since we were talking to customers while the demo happened, the latency needed to be less than a few seconds. I couldn't get that latency with the more standard streaming video options I tried, and at the time setting up something based on WebRTC seemed pretty daunting. I ended up doing something pretty much like JPEGs as well, via the jsmpeg library [1]. Worked great.

    [1] https://jsmpeg.com/ (tagline: "decode like it's 1999")

  • josephernest 3 days ago ago

    Related: for some hardware project, I have a backend server (either C++ or python) receiving frames from an industrial camera, uncompressed.

    And I need these frames displayed in a web browser client but on the same computer (instead of network trip like in this article).

    How would you do this ?

    I eventually did more or less like OP with uncompressed frames.

    My goal is to minimize CPU usage on the computer. Would h264 compression be a good thing here given source and destination are the same machine?

    Other ideas?

    NB: this camera cannot be directly accessed by the browser.

    • antisol 3 days ago ago

      > How would you do this ?

      It depends. I have many questions.

      > My goal is to minimize CPU usage on the computer. Would h264 compression be a good thing here given source and destination are the same machine?

      No.

      > Other ideas?

      1. Why does it need to be displayed in a web browser (as opposed to more appropriate / better performing software specifically built for video)?

      2. via what interface/library is the camera connected to the machine? What format/codec is the uncompressed stream you're getting from the camera?

      3. I am available at very reasonable consulting rates

      • josephernest 3 days ago ago

        Thanks.

        1. It is part of a bigger web-browser dashboard/control interface and this camera display is just one component among many others.

        2. Some of the (USB) cameras can have proprietary interfaces such as https://www.ximea.com/support/wiki/apis/python

        How would you do in this situation, to have the video stream in the browser, with as low CPU usage as possible?

        3. Not for this project but for a future project, feel free to put a link to your portfolio or contact page (even if you remove the comment later)

        • antisol 3 days ago ago

          1. fair enough

          2. "How would you do in this situation, to have the video stream in the browser, with as low CPU usage as possible?"

          Since it's being consumed on (only) the local machine you've got an excellent situation where you can use any obscure codec you like, as long as the browser you're using supports it. Also you don't need to care at all about network bandwidth. If minimising CPU usage is the #1 priority then something fairly lightweight like mjpeg might do the trick. Alternatively you might get away with not compressing the video at all (but this might cause issues due to dealing with huge amounts of data). If I wanted to minimise CPU usage, I wouldn't be doing it in python.

          3. You can find me if you look.

  • praveen9920 3 days ago ago

    This reminds me of the time we built a big angular3 codebase for a content platform. When we had to launch, the search engines were expecting content to be part of page html while we are calling APIs to fetch the content ( angular3 didn’t have server side rendering at that point)

    So only plausible thing to do was pre-build html pages for content pages and let load angular’s JS take its time to load ( for ux functionality). It looked like page flickered when JS loads for the first time but we solved the search engine problem.

  • dehrmann 3 days ago ago

    > What if we only send keyframes?

    I think the author reached this conclusion, but individual jpegs is essentially only keyframes.

    > We don’t spam HTTP requests for individual frames like it’s 2009.

    Uncompressed frames are huge, somewhere between 5 MB and 50 MB. The overhead of a request is negligible. It's also different when you're optimizing for latency and reliability where dropped frames is OK. Really, the lesson is they should have tried the easy thing first to see how good it was.

  • saagarjha 3 days ago ago

    I’m currently doing this in one of my side projects: https://github.com/saagarjha/Ensemble. It works, kinda; it’s good enough for demos at least and I haven’t had much time to improve it. At some point you would really want to use an actual video encoder though because JPEGs are not cheap to encode and send even with hardware acceleration.

  • bob1029 3 days ago ago

    > Why JPEGs Actually Slap

    JPEG is extremely efficient to [de/en]code on modern CPUs. You can get close to 1080p60 per core if you use a library that leverages SIMD.

    I sometimes struggle with the pursuit of perfect codec efficiency when our networks have become this fast. You can employ half-assed compression and still not max out a 1gbps pipe. From Netflix & Google's perspective it totally makes sense, but unless you are building a streaming video platform with billions of customers I don't see the point.

  • gametheory87 3 days ago ago

    It’s always TCP_NODELAY seems relevant here: https://news.ycombinator.com/item?id=40310896

  • dehrmann 3 days ago ago

    > We’re building Helix, an AI platform where autonomous coding agents work in cloud sandboxes. Users need to watch their AI assistants work. Think “screen share, but the thing being shared is a robot writing code.”

    This feels like a fast dead end. Agents will get much faster pretty quickly, so synchronous human supervision isn't going to scale. I'd focus on systems that make high-signal asks of humans asynchronously.

  • xnx 2 days ago ago

    You see a company that is bad at video streaming. I see a smart application of Cunningham's Law https://meta.wikimedia.org/wiki/Cunningham%27s_Law

  • kiririn7 2 days ago ago

    "By the time you see a bug, the AI has already committed it to main" does anybody actually actively watch the code their agent is writing? i am watching movie recaps on my 2nd monitor. this seems like a problem that they assume exists because they dont actually use their product

  • cwt137 2 days ago ago

    Everyone talks about Websockets for pushing real time data to the browser. This article highlights some of its drawbacks. I use Server Sent Events (SSE) instead. A lot of the problems the author of the article faced are solved with SSE. Also, SSE scales way better than polling all the time.

  • lostmsu 3 days ago ago

    If you have latency detection already why not pause H.264 frames, then when ack comes just force a key frame and resume (perhaps with adjusted target bitrate)?

    • brigade 3 days ago ago

      That would require that they understand the protocol stack they're using to send H.264 frames

    • bobmcnamara 3 days ago ago

      Yeah, monitor the send queue length and reduce bit rate accordingly.

  • STELLANOVA 3 days ago ago

    We did something similar +12 years ago with `streaming` AWS running app inside the browser. Basically you can run 3d studio max on chromebook. App is actually running on AWS instance and it just sending jpegs to browser to `stream` it. We did a lot of QoS logic and other stuff but it was actually working pretty nice. Adobe used it for some time to allow user to run Photoshop in the browser. Good old days..

    • ronyfadel 3 days ago ago

      Who’s “we” in this case? Amazon (AWS)?

  • breve 3 days ago ago

    WebP is well supported in browsers these days. Use WebP for the screenshots instead of JPEG and it will reduce the file size:

    https://developers.google.com/speed/webp/gallery1

    https://caniuse.com/webp

  • K0nserv 3 days ago ago

    You can do TURN using TLS/TCP over port 443. This can fool some firewalls, but will still fail for instances when an intercepting HTTP proxy is used.

    The neat thing about ICE is that you get automatic fallbacks and best path selection. So best case IPv6 UDP, worst case TCP/TLS

    One of the nice things about HTTP3 and QUIC will be that UDP port 443 will be more likely to be open in the future.

    • HackerThemAll 2 days ago ago

      It's going to take those conservative netadmins another 10 to 20 years to learn that HTTP/3 or QUIC works over UDP and that it needs to be enabled. So... happy buffering and watching spinners until then.

  • 3 days ago ago
    [deleted]
  • Sean-Der 3 days ago ago

    Doesn’t matter now, but what led you to TURN?

    You can run all WebRTC traffic over a single port. It’s a shame you spent so much time/were frustrated by ICE errors

    That’s great you got something better and with less complexity! I do think people push ‘you need UDP and BWE’ a little too zealously. If you have a homogeneous set of clients stuff like RTMP/Websockets seems to serve people well

  • mschuster91 3 days ago ago

    > We are professionals. We implement proper video codecs. We don’t spam HTTP requests for individual frames like it’s 2009.

    I distinctly 'member doing CGI stuff with HTTP multipart responses... although I bet that with the exception of Apache, server (and especially: reverse proxy) side support for that has gone down the drain.

  • sevensor 3 days ago ago

    No mention of PNGs? I don’t usually go to jpegs first for screenshots of text. Did png have worse compression? Burn more cpu? I’m sure there are good reasons, but it seems like they’ve glossed over the obvious choice here.

    edit: Thanks for the answers! The consensus is that PNG en/de -coding is too expensive compared to jpeg.

    • dimatura 3 days ago ago

      PNGs of screenshots would probably compress well, and the quality to size ratio would definitely be better than JPG, but the size would likely still be larger than a heavily compressed JPG. And PNG encoding/decoding is relatively slow compared to JPG.

    • wewewedxfgdf 3 days ago ago

      PNG is VERY slow compared to other formats. Not suitable for this sort of thing.

    • StilesCrisis 3 days ago ago

      PNGs are lossless so you can’t really dial up the compression. You can save space by reducing to 8-bit color (or grayscale!) but it’s basically the equivalent of raw pixels plus zlib.

      • vikingerik 3 days ago ago

        PNG can be lossy. It can be done by first discarding some image detail, to make adjacent almost-matching pixel values actually match, to be more amenable to PNG's compression method. pngquant.org has a tool that does it.

        There are usage cases where you might want lossy PNG over other formats; one is for still captures of 2d animated cartoon content, where H.264 tended to blur the sharp edges and flat color areas and this approach can compensate for that.

    • j45 3 days ago ago

      PNGs likely perform great, existing enterprise network filters, browser controls, etc, might not, even with how old PNGs are now.

  • tracker1 2 days ago ago

    My only real curiosity is if .png or .webp are supported and how much slower and/or faster they are in practice over jpeg given the quality level needed to not artifact.

  • monus 3 days ago ago

    Well, we are serving latency sensitive remote control to <one of the biggest banks in US> via WebRTC which uses TLS over TURN so you get 443 HTTPS for the whole traffic.

    No NAT, no UDP, just pure TURN traffic over Cloudflare TURN with TLS.

  • colechristensen 3 days ago ago

    H.264 can be used to encode a single frame as an effective image with better compression than JPEG.

  • poly2it 3 days ago ago

    Why is video streaming so difficult? We've been doing it for decades, why is there seemingly no FOSS library which let's me encode an arbitrary dynamic frame rate image stream in Rust and get HD data with delta encoding in a browser receiver? This is insanity.

  • andrewstuart 3 days ago ago

    I wrote a motion jpeg server for precisely this use case.

    https://github.com/crowdwave/maryjane

    The secret to great user experience is you return the current video frame at time of request.

  • moralestapia 3 days ago ago

    >A single H.264 keyframe is 200-500KB.

    Hmm they must be doing something wrong, they're not usually that heavy.

  • julik 3 days ago ago

    Having built an image sequence player using JPEGs back in the day - I can attest that it slappps.

  • Animats 3 days ago ago

    This is a screen-sharing system, correct? Sharing screens with text? JPEG compression of text is bad. JPEG is terrible at hard edges. PNG is fine with them, and good at uniform areas of color, like text.

  • yanngagnon 2 days ago ago

    If you don't need sound, or don't need to present the sound to the user in sync with the content, the still image solution is obvious.

  • 3 days ago ago
    [deleted]
  • nrhrjrjrjtntbt 3 days ago ago

    Thats fun. I take it JPEG (what settings lolz!) is compressing harder than a keyframe.

    But you are waching code. Why not send the code? Plus any css/html used to render it pretty. Or in other words why not a vscode tunnel?

  • abujazar 3 days ago ago
  • elzbardico 3 days ago ago

    There's no real reason other than bad configuration/coding for a H.264 1080p 30fps screen share stream to sustainably use 40mbps. You can watch an action move at the same frame rate but with 4k resolution while using less than half this bandwidth.

    The real solution is using WebRTC, like every single other fucking company that have to stream video is doing. Yes, enterprise consumers require additional configuration. Yes, sometimes you need to provide a "network requirements" sheet to your customer so they can open a ticket with their IT to configure an exception.

    Second problem, usually enterprise networks are not as bad as internet cafe networks, but then, internet café networks usually are not locked down, so, you should always try first the best case scenario with webrtc and turn servers on 3478. That will also be the best option for really bad networks, but usually those networks are not enterprise networks.

    Please configure your encoder, 40mbps bit rate for what you're doing is way way too much.

    Test if TURN is not acessible. try it first with UDP (the best option and will also work with internet cafe), if not try over tcp on port 443, not working? try over tls on port 443.

  • visiondude 3 days ago ago

    I very confused, couldn’t they have achieved much better outcome with existing hls tech with adaptive bitrate playlists? Seems they both created the problem and found a suboptimal solution.

    • mschuster91 2 days ago ago

      They don't have ffmpeg in the video pipeline, making anything a true PITA. ffmpeg has long solved the task of outputting HLS, DASH or whatever else standard and it Just Works No Matter Where. But if you're not using ffmpeg... you're bound to learn all the mistakes the ffmpeg people had to learn, and there are so many hidden traps.

    • dicroce 3 days ago ago

      Exactly my thoughts.

  • escapecharacter 3 days ago ago

    I guess this is great as long as you don't worry about audio sync?

    • htrp 3 days ago ago

      at least the ai agents aren't talking back to us

      • lostmsu 3 days ago ago

        You're behind by 1.5 years on that thought. They certainly can.

  • keepamovin 3 days ago ago

    This is similar to what BrowserBox does for the same reasons outlined. Glad to see the control afforded by "ye olde ways" is recognized and widely appreciated.

  • ddtaylor 3 days ago ago

    A very stupid hack that can work to "fix" this could be to buffer the h264 stream at the data center using a proxy before sending it to the real client, etc.

    • jamiesonbecker 3 days ago ago

      One of the big issues was latency.

      • ddtaylor 3 days ago ago

        Yes, but the real issue (IMO) is that something is causing an avalanche of some kind. You would much rather have a consistent 100ms increased latency for this application if it works much better for users with high loss, etc. Also, to be clear, this is basically just a memory cache. I doubt it would add any "real" latency like that.

        The idea is that if the fancy system works well on connection A and works poorly on connection B, what are the differences and how can we modify the system so that A and B are the same from it's perspective.

  • CrossVR 3 days ago ago

    This isn't a hack though, MJPEG (Motion JPEG) is an actual video format and has long been used for security camera footage.

  • willseth 3 days ago ago

    “We didn’t have the expertise to build the thing we were building, got in way over our heads, and built a basic POC using legacy technology, which is fine.”

  • ErroneousBosh 3 days ago ago

    So, they've invented MJPEG?

    Or is it intra-only H.264?

    I mean, none of this is especially new. It's an interesting trick though!

  • Dwedit 3 days ago ago

    "Helix" also happens to be the name of an open-source project created by RealPlayer.

  • kuon 2 days ago ago

    Wait what? 40Mbps for a remote desktop? Event 10Mbps is insane. I remember deploying sunrays over dialup and the image wasn't that bad, yes it was low resolution and I think it was UDP, but the desktop was usable with a surprisingly low latency.

    To monitor an IA you can lower the bit depth considerably and not lose that much details on what is happening. If you control the web rendered, disable text anti aliasing, and there might be other optimization that can help. Tile & diff the image... But video encoders already does that so it might just work out of the box.

    Also if your single h264 image is larger that jpeg then you are doing something wrong, jpeg is a very poor encoding compared to what we have today.

    Look at how other remote desktop protocol does it, VNC, RDP...

    Managing streams over corporate network is well documented, many web frameworks will include a "longpoll" fallback (or SSE) for streaming to play nice even without web sockets. "Discovering" you cannot deploy whatever you want to an enterprise network is quite alarming.

    I really don't want to be the graybeard guy saying "young engineers are bad", as I am more on the side of believing on the new generations, but please, don't act like computers spawned into existence in 2020 and that nothing has been done before.

    • throw-12-16 2 days ago ago

      I’ll say it.

      Young engineers are bad.

  • tverbeure 3 days ago ago

    I’m surprised that H264 I-frame only compresses less than JPG.

    Maybe because the basic frequency transform is 4x4 vs 8x8 for JPG?

    • plorkyeran 3 days ago ago

      Their h264 iframes were bigger than the jpegs because they told the h264 encoder to produce bigger images. If they had set it to produce images the same size as the jpegs it most likely would have resulted in higher quality.

  • dicroce 3 days ago ago

    They should have used HLS. Its still pulling, and the client controls the downshifts if required...

  • notpushkin 3 days ago ago

    Considering you already have a WebSocket open, why not just send JPEGs over it?

  • inDigiNeous 3 days ago ago

    That was a fun read. Kudos to the writer. This is software development life.

  • mring33621 3 days ago ago

    This is such a great post. I love the "The Oscillation Problem"!

  • ddtaylor 3 days ago ago

    > I mashed F5 like a degenerate

    Bargaining.

  • boggyb 3 days ago ago

    You can make webrtc work on enterprise networks by tunneling turn tcp traffic over websocket. The flow looks like this.

    client's webrtc app using turn (pointing to the same machine IP) <-> tcp server/ websocket client (runs on client machine) <-> websocket server (relays turn packets) <-> real turn server <-> host's webrtc app

    https://github.com/amitv87/turn_ws_proxy

    I implemented a similar technique for Browserstack more than a decade ago to bypass enterprise firewalls by tunneling turn packets over (websockets/sse/socket.io etc.) The `tcp server/ websocket/sse/scoket.io client` was hosted as part of a packaged chrome app / firefox extension. WebSocket and TURN servers were hosted on same machine to minimize the latency (could have been embedded in same process to reduce latency further).

  • gethly 3 days ago ago

    oof. i knew instantly what the problem was and realised these people have no clue about how video even works. yet another vibe coded AI startup.

  • ErroneousBosh 2 days ago ago

    You know what else I don't quite get? Why isn't "Your network is broken. Fix your network. Blocking UDP is idiotic. Get someone to set it up who has at least stood within hailing distance of a clue" an acceptable thing to say here?

  • 3 days ago ago
    [deleted]
  • mgaunard 3 days ago ago

    RTP is specifically designed for real-time video.

  • worksonmine 3 days ago ago

    I'm confused, do people actually watch their agents code like it was a screen share? Why does the AI even mess with that, just send a diff over text? Is it getting a keyboard next?

    This is the definition of over-engineering. I don't usually criticize ideas but this is so stupid my head hurts.

    • krater23 2 days ago ago

      And working in this setup at all sounds horrible. Who wants that?

  • bandamo 3 days ago ago

    would like to see what alternatives were looked at. RDP with a html client (guacamole) seams like a good match

  • the8472 3 days ago ago

    > looks at TCP congestion control literature

    > closes tab

    Eh, there are a few easy things one can try. Make sure to use a non-ancient kernel on the sender side (to get the necessary features), then enable BBR and NOTSENT_LOWAT (https://blog.cloudflare.com/http-2-prioritization-with-nginx...) to avoid buffering more than what's in-flight and then start dropping websocket frames when the socket says it's full.

    Also, with tighter integration with the h264 encoder loop one could tell it which frames weren't sent and account for that in pframe generation. But I guess that wasn't available with that stack.

  • bayindirh 3 days ago ago

    I love this:

    - It's 2025! We don't need to think like the savages of the yore. Use video at 60FPS. Computing is cheap, network is reliable. Why do we need to remember old ways like savages?

    it turns out that network is not reliable...

    - We will do as our ancestors did, and will send JPEGs, and that works?! Whoa, who guessed it!

    Come on. Everything is new but nothing has changed. Sometimes the older tech is vastly better, and saves our butts or lives or both. We shouldn't be ashamed of using things proven to work.

  • tylertyler 3 days ago ago

    I've found that WebM works much better because of the structure of the data in the container. I've also gone down similar routes using outdated tech and even inventing my own encoder and decoders trying to smooth things out but what I've currently found is the best current approach is using WebM because it is easier to lean on hardware encoder and decoders including across browsers with the new WebCodecs APIs. What I've been working on is a little different than what is in this post but but I'm pretty sure this logic still stands.

  • krater23 2 days ago ago

    "By the time you see a bug, the AI has already committed it to main"

    Beside of that the Author has no plan at all about encoding, mjpeg, vnc,....

    Really, THIS is the product that they sell?! This sounds like a horrible work. Observing a coding agent that does my job, but faster and crappier than me and stopping it when it does totally bullshit to prevent it from commiting to main?

    • rasz 2 days ago ago

      I can totally see some middle managers swallowing "You get used to it, I don't even see the code, All I see is blond, brunette, redhead" sales pitch.

  • j45 3 days ago ago

    One thing this article does point to indirectly is sometimes, simple scales, and complex fails.

  • mannyv 3 days ago ago

    Awesome!

    Good engineering: when you're not too proud to do the obvious, but sort of cheesy-sounding solution.

  • almog 3 days ago ago

    Posts like this on the front page make me miss N-Gate so bad...

  • nicman23 3 days ago ago

    close enough, welcome back mpeg

  • JohnCClarke 2 days ago ago

    +1 - I made the same technology choice back in 2014. Seems like nothing has changed.

    TL;DR: You can't keep things too simple.

  • mxkyb 2 days ago ago

    why not media over quic

  • piyushpr134 3 days ago ago

    how about using mjpeg ?

  • develatio 3 days ago ago

    I cried. Then I laughed. Then I cried again. I can feel all the pain of the entire thing (don't ask me why). Amazing. Bravo!!

  • HocusLocus 3 days ago ago

    This is a beautiful cope. Every time technology rolls out something that works great 90% of the time for 90% of the people, those 10%s pile up big time in support and lost productivity. You need functional systems that fall back gracefully to 1994 if necessary.

    I started the first ISP in my area. We had two T1s to Miami. When HD audio and the rudiments of video started to increase in popularity, I'd always tell our modem customers, "A few minutes of video is a lifetime of email. Remember how exciting email was?"

  • dengolius 3 days ago ago

    what about av1?

    • krater23 2 days ago ago

      What's about configuring the codec correctly? Using mjpeg? Using VNC? Throwing all this shit into the trashcan and get just diff's in utf-8 from the coding agent?

      Or maybe self-write the code to not create this hell of bullshit code that lead to the issues the article writes about?

      Which you a Merry Christmas! :D

  • hmontazeri 3 days ago ago

    Another case of we’re going backwards. The boring stuff is what works every time…

  • dinobones 3 days ago ago

    You spent 3 months on this hacked together garbage when you probably could’ve just configured a pre-existing solution off the shelf with like 10 minutes of reading and understanding documentation.

    This blog post reeks of “you can just do things” type of engineering. This is the quality of engineering I would expect from “TPOT” (that part of Twitter) where people talk about working 12 hour days. It’s cause they’re working 12 hours on bullshit like this.

    Building some sweet custom codec or binary transportation algorithm was barely cute in like 1989. It definitely ain’t cute now.

    How many of these AI and “agentic” companies are just misled engineers thinking they are cracked and writing needlessly complex solutions to problems that dont even exist?

    Just burn it all down. Let it pop already.

    • krater23 2 days ago ago

      Thanks! Exactly what I think about their work and their idea to people watching AI agents to code.