You Don’t Know Jack about Bandwidth

(cacm.acm.org)

196 points | by sohkamyung 15 hours ago ago

87 comments

  • thepuppet33r 13 hours ago ago

    I have spent hours arguing with someone at my work that the issue we are experiencing at our remote locations is not due to bandwidth, but latency. These graphics are exactly what I've been looking for to help get my point across.

    People do a speedtest and see low (sub-100) numbers and think that's why their video call is failing. Never mind the fact that Zoom only needs 3 Mbps for 1080p video.

    • Roark66 2 hours ago ago

      That is why I've been successfully working from home for almost a decade starting on an LTE connection that was 5Mb up and 10Mb down(notice this is small b as in bits). No problem at all... Why because most of the time latency was good.

      I'm still on the same LTE connection, but everyone kept telling me how my speeds are crap and how I should update to a new LTE cat 21 router. So I got one of more popular models ZTE MF289F. And the speed increased to 50Mb up 75Mb down on a speed test. But all my calls suddenly felt very choppy and the perceived Web browsing was unbearably slow... What happened? Well, the router would just decide every day or so to up it's latency to Google.com from 15ms to 150ms until it was restarted. But that is not all. Even when the ping latency was fine it still felt slower than my ancient tplink lte router... So the zte went into a drawer waiting for the times I'll have time to put Linux on it. And the tplink went back on top of my antenna mast.

    • EvanAnderson 11 hours ago ago

      Latency is a cruel mistress. Had a Customer who was using an old Win32 app that did a ton of individual SELECT queries against the database server to render the UI. They tried to put it on the end of a VPN connection and it was excruciating. The old vendor kept trying to convince them to add bandwidth. Fortunately the Customer asked the question "Why does the app work fine across the 1.544Mbps T1 to our other office?" (The T1 had sub-5ms latency.)

      • chrismorgan 10 hours ago ago

        I was involved in some physical network inventory software a dozen years ago. One team produced some new web-based software for it, for use by field agents and such. The first version we got to review was rather bad in various important areas; my favourite was that search would take over thirty seconds in common real-world deployment environment: it was implemented in .NET stuff that makes server calls easy to do by accident and in unnecessarily blocking fashion, and searched by each of the 34 entity types individually, in sequence; and some .NET session thing meant that if the connection had been idle for ten minutes or something, it would even then need to retry every request that got queued while the session was “stale”, which was all of them. So you ended up with 68 sequential requests, on up to half a second’s latency (high-latency 3G or 2G or geostationary satellite)… so yeah, 30 seconds.

        They’d only developed it with sub-millisecond latency to the server, so they never noticed this.

        I don’t think it was a coincidence that the team was US-based: in Australia, we’re used to internet stuff having hundreds of milliseconds of latency, since so much of the internet is US-hosted, so I think Australians would be more likely to notice such issues early on. All those studies about people abandoning pages if they take more than two seconds to load… at those times, it was a rare page that has even started rendering that soon, because of request waterfalls and high latency. (These days, it’s somewhat more common to have CDNs fix the worst of the problem.)

        • dietr1ch 39 minutes ago ago

          I think that latency people grow up with has a huge impact on how careful they are when using the internet.

          Having gotten my hands on an experimental 128kbps link early on, but later and moving to the countryside with a 56kb-1Mbps really spotty connection made me really appreciate local state as every time things blocked on the internet made it pretty notorious.

          I'm glad there's a push for synchronized, local-first state now, as roaming around on mobile or with a laptop hopping on wifi can only perform nicely with local state.

      • akira2501 9 hours ago ago

        I test all my web dev on the other side of a 4G modem connected to an MVNO. It forces you to be considerate of both bandwidth and latency as it's about 5-20Mbps in the city with 120ms average latency.

        It's not at all impossible to design fast and responsive sites and single page applications under these constraints, you just have to be aware of them, and actively target it during the full course of development.

        • martyvis 3 hours ago ago

          Many years ago I was called into troubleshoot the rollout of a new web based customer management application that was replacing a terminal green screen one. I was flummoxed finding all the developers had only ever tested their app running from workstations on 100Mbps switches where the target offices for this application were connected by 128kbps ISDN lines. I was able demonstrate how each 12kB of their application was going to take 1 second. (It was amazing to see their HTML still full of comments, long variable names, etc). I don't even think they had discovered gzip compression. This was after many millions of dollars had already been spent on the development project.

        • slt2021 9 hours ago ago

          good for you, you are doing your job very well.

          part of the reason why modern software is so crappy, is because developers often have thee most powerful machines (MacBook pro like) and don't even realize how resource hungry and crrappy their software at lower end devices

          • Escapado 8 hours ago ago

            Another part of the reason is that in every company I have ever worked for some SEO dude insists we need to add literally 10 different tracking scripts and insists they need to load first and consume megabytes of data. At my last gig the landing page was 190kb of HTML, CSS and JS from us, 800kb of images and literally 8mb in vendor scripts we were mandated to load as early as possible. Of course we fought it, of course nobody cared.

            • justmarc 8 hours ago ago

              Try an e-banking website which literally loads over 25MB of crap just to show the login page.

              This is by far the worst offender I've seen.

              Madness.

          • huijzer 8 hours ago ago

            Yes MacBook pro with a glass fibre connection with a ping below 10 ms to the server.

            I’m usually on an old copper line (16 ms ping to Amsterdam) in the Netherlands (130 ms to San Francisco).

            Some sites are just consistently slow. Especially GitHub or cloud dashboards. My theory is that the round trip to the database slows things down.

            • FridgeSeal 6 hours ago ago

              GitHub and everything Atlassian deserve to be thrown in the pit-of-shame and laughed at for perpetuity.

              Jira is so agonisingly slow it’s a wonder anyone actually pays for it. Are the devs who work on it held against their will or something? It’s ludicrous.

              GitHub gets worse every day, with the worst sin being their cutesy little homegrown loading bar, which is literally never faster than simple reloading the page.

            • oefrha 5 hours ago ago

              These days people on HN love to advocate sockety frontend solutions like Phoenix LiveView, where every stateful interaction causes a roundtrip, including ones that don’t require any new data (or all required data can be sent in a batch at the beginning / milestones). It’s like they forgot the network exists and is uneven and think everything’s happening in-process.

              To ward off potential criticism: I know you can mix client-side updates with server-side updates in LiveView and co. I’ve tried. Maintaining client-side changes on a managed DOM, sort of like maintaining a long-lived topic branch that diverges from master, sucks.

        • eru 7 hours ago ago

          In the Google office we had (perhaps they still have) a deliberately crappy wifi you can connect to with your device, to experience extra latency and latency spikes and random low bandwidth. All in the same of this kind of testing.

      • chickenbig 5 hours ago ago

        > Latency is a cruel mistress.

        Yes, Bloomberg had fun with latency because of their datacenter locations (about a decade ago they still only had two and a half close to New York). Pages that would paint acceptably in London would be unacceptable in Tokyo as when poorly designed they would require several round trips to render. Once the page rendered there was still the matter of updating the prices, which was handled by separately streaming data from servers close to the markets to the terminals. A very different architecture but rather difficult to test because of the significant terminal-side functionality.

      • Hikikomori 5 hours ago ago

        Had some devs in another country complaining that their database query was taking hours to complete but doing it from a server in the same datacenter took a few minutes. Took some weeks of emails and a meeting or two until they understood that we couldn't do anything, I had to actually say that we couldn't do anything about latency unless they physically move their country closer to us.

        • jstanley 5 hours ago ago

          Could you replicate the data to their country and let them run queries locally?

          Could they run their client from your country and operate the UI remotely?

          There are more options than moving the country!

          • Hikikomori 3 hours ago ago

            I was the network engineer. It was their server and database, I couldn't solve the latency problem for them.

        • rodrigodlu 4 hours ago ago

          It's easier to have an replica. Doesn't matter if it's "realtime" sync or using CDC, from backups, etc.

          You can even ask one of these guys to do the setup for you. They'll do in a pinch with a happy face.

          I know because I did.

          • Hikikomori 3 hours ago ago

            It is, but not my problem as a network engineer. We did suggest that though but they refused to believe that we couldn't solve the latency "problem".

            • martyvis 3 hours ago ago

              I have been there many times. While the network guy can solve some of the latency in things like TCP handshakes and use compression and caching with magic black boxes you can't fix the actual application query and acknowledgement requirements that might be there.

      • mschuster91 4 hours ago ago

        > Fortunately the Customer asked the question "Why does the app work fine across the 1.544Mbps T1 to our other office?" (The T1 had sub-5ms latency.)

        That reminds me on the atrocious performance of Apple's TimeMachine with small files. Running backups on SSDs is fast, but cable ethernet is noticeably worse, and even WiFi 6 is utterly disgraceful.

        To my knowledge you can't even go and say "do not include any folder named vendor (PHP) or node_modules (JS)", because (at least on my machine) these stacks tend to be the worst offenders in creating hundreds of thousands of small files.

    • dtaht 12 hours ago ago

      speedtest.net added support for tracking latency under load a few years ago. they show ping during up/dl now. That's the number to show your colleague.

      However they tend to use something like the 75th percentile and throw out real data. The waveform bufferbloat test does 95% and supplies whisker charts. cloudflare also.

      No web test tests up + down at the same time, which is the worst scenario. crusader and flent.org's rrul test do.

      Rathan than argue with your colleague, why not just slap an OpenWrt box as a transparent bridge inline and configure CAKE SQM?

      • imp0cat 9 hours ago ago

        Next time you need to assess your connection's capabilities, try https://speed.cloudflare.com/ instead of speedtest.net. Much more informative.

      • thomasjudge 10 hours ago ago

        Would you put the "OpenWrt box as a transparent bridge inline" between your home router and the cable modem, or on the house side of the home router?

        • dtaht 10 hours ago ago

          I would replace the home router with an OpenWrt router.

          • matheusmoreira 10 hours ago ago

            One of the best things I've ever done. OpenWrt is so good. SQM helps a lot with latency.

    • danpalmer 10 hours ago ago

      Funnily enough I have found since moving from London to Sydney that people here are far more understanding of the difference between latency and throughput. Being 200ms from anyone else on the internet will do that to you!

    • izacus 4 hours ago ago

      Not just latency, but also jitter. Jitter was the biggest issue we had when broadcasting and streaming video. You don't need a lot of bandwidth, latency can be surivived... but jitter will ruin your experience like nothing else.

    • guappa an hour ago ago

      "Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway."

      -- Andrew S. Tanenbaum

    • dtaht 11 hours ago ago

      BITAG published this a while back.

      https://www.bitag.org/latency-explained.php

      It's worth a read.

    • buginprod 5 hours ago ago

      I dont see how people miss latency. It is the only other number shown on the speed check screen! No curiousity as to why that is there?

      I mean I bet they do care about litres/100km for their car AND 0-100km accelerarion (and many other stats)

  • iscoelho 5 hours ago ago

    This article appears to be written from a solid Linux networking background, but not from an ISP networking background.

    ISPs at scale do not use software routers. They use ASIC routers (Juniper/Arista/Cisco/etc.), for many reasons 1) features 2) capacity 3) reliability.

    ASIC routers are capable of handling 100-1000x the throughput of the most over-provisioned Linux server (and that may even be an understatement). ASIC routers can also route packets with latency between 750us (0.75ms) and 10us (0.01ms!), complemented by multi-second (>GB) packet buffers.

    QoS is rarely used at scale, if anything only on the access layer, because transit has become so cheap that ISPs have more bandwidth than they know what to do with. These days, if a link is congested, it's not cost saving, but instead poor network planning. QoS also has very limited benefits at >100G scale.

    With that said, I feel that this article is definitely missing the full picture.

  • cycomanic 12 hours ago ago

    I know it's common to say bandwidth casually, but I really wish a Blog trying to explain the difference between data rate and latency would not conflate bandwidth and data rate (one could also say throughput or capacity although the latter is also technically incorrect). The term bandwidth really denotes the spectral width occupied by a signal, and while it is related to the actual data rate, it is much less so nowadays where we use advanced modulation compared to back when everything was just OOK.

    Coincidentally, the difference between latency and data rate is also much clearer using these two terms.

  • ajb 9 hours ago ago

    This article is confused, or at best unclear, about how AQM works:

    "CAKE then added Active Queue Management (AQM), which performs the same kind of bandwidth probing that vanilla TCP does but adds in-band congestion signaling to detect congestion as soon as possible. The transmission rate is slowly raised until a congestion signal is received,[...]"

    This appears to suggest that Cake (an in-network AQM process) takes over some of the functionality of TCP (implemented in the endpoints). What's actually happening is that the AQM provides a better signal to allow TCP to do a better job.

    The rest of the article is more it less accurate, albeit that it's marketing for one particular tool rather than giving you the level of understanding needed to choose one.

    The dig at PIE (another AQM) is also a bit misleading, in that their main complaint is not PIE itself but the lack of all these other features they think necessary. If Cake used PIE instead of CODEL I don't think it would be noticeably different.

  • jrs235 2 hours ago ago

    Bandwidth is the numbers of traffic lanes. Latency (speed) is the determined by the material and it's quality of the lanes (gravel road vs asphalt etc.). Throughput is determined by the number of lanes and the speed that vehicles can travel.

  • PeterStuer 6 hours ago ago

    ISP's don't mind customers hating them as long as they don't leave, and in many places they can't because theirs is the only game in town, or there is one other player that screws over their clients in exactly the same way.

    They have used deep packet inspection and traffic shaping for ages to screw over Over The Top competition to their own services or tier their offerings into higher priced slightly less artificially sabotaged package deals.

    I realy like what the libreqos people are aiming for, but lets not pretend ISP's are trying to be great and just technically hampered (and yes, I'm sure there are exceptions to this rule).

  • declan_roberts 11 hours ago ago

    Working from home has really put a spotlight on the terrible asymmetric upload speeds of most cable internet.

    I can get 1 gb down but only 50 mb upload. Certain tasks (like uploading a docker image) I cant do at all from my personal computer.

    The layman has no idea the difference, and even most legislators don't understand the issue ("isn't 1 gb fast enough?")

    • packetlost 10 hours ago ago

      This. I've been fighting AT&T for awhile because they told the FCC (via their broadband maps [0]) that they supply fiber to my condo, so I bought it expecting to get fiber. Well when I finally go to set up internet service, they only offer 50/5 DSL service. Fortunately I can get cable that has usable down speeds but the up is substantially less than 50 with garbage routing.

      I'm not very happy.

      [0]: https://broadbandmap.fcc.gov/

      • dsissitka 9 hours ago ago

        I'm not sure if it's true but I've heard they take that very seriously.

        Have you filed a complaint with the FCC? Both times I had to do it things got sorted very quickly.

        https://consumercomplaints.fcc.gov/hc/en-us/articles/1150022...

      • silisili 5 hours ago ago

        Good luck.

        At one time I was experiencing high ping times and near non existent speed from ATT Fiber to Online.fr's network. I did 80% of the diagnostics for them and provided the details and of course a nudge as to what I felt the issue could be.

        It's extremely frustrating to be a networking person having to deal with home internet CS.

        To my surprise, it actually did get to their networking team who replied saying the peer was fine and try again. The problem with that was that it came 8 months later, long after I'd left the area and didn't even have service with them anymore.

    • __MatrixMan__ 11 hours ago ago

      I got lucky and fiber became available in my neighborhood around the same time I noticed how painful pushing images over cable was. Hopefully you'll get that option soon too.

      For the unlucky, maybe we can take advantage of the fact that most image pushes have a predecessor which they are 99% similar to. With some care about the image contents (nar instead of tar, gzip --rsyncable, etc) we ought to be able to save a lot of bandwidth by using rsync on top of the previous version instead of transferring each image independently.

    • sneak 9 hours ago ago

      Why are you building images at home for upload?

      In my experience, it’s much easier to upload code or commits and build/push artifacts in/from the datacenter, whether manually or via CI.

      It can be as simple as exporting DOCKER_HOST=“ssh://root@host”. Docker handles uploading the relevant parts of your cwd to the server.

      I have a wickedly fast workstation but spot instances that are way way faster (and on 10gbps symmetric) are pennies. Added bonus: I can use them from a slow computer with no degradation.

    • imp0cat 9 hours ago ago

      YMMV, but this can be usually mitigated by connecting to a machine in a datacentre at work and doing all the stuff there (like building and uploading docker images).

    • LoganDark 11 hours ago ago

      > I can get 1 gb down but only 50 mb upload. Certain tasks (like uploading a docker image) I cant do at all from my personal computer.

      As someone who used to work with LLMs, I feel this pain. It would take days for me to upload models. Other community members rent GPU servers to do the training on just so that their data will already be in the cloud, but that's not really a sustainable solution for me since I like tinkering at home.

      I have around the same speeds, btw. 1Gb down and barely 40Mb up. Factor of 25!

      • latency-guy2 10 hours ago ago

        I feel your pain, I haven't been in ML world directly for a few years now but I've done the same exercise multiple times.

        The worst part is that block compression actually does not help if it doesn't do a significantly good job of compression AND decompression. My use case had to immediately deploy the models across a few nodes in a live environment at customer sites. Cloud wasn't an option for us and fiber was also unavailable many times.

        The fastest transport protocol was someone's car and a workday of wages.

        • LoganDark 8 hours ago ago

          > The fastest transport protocol was someone's car and a workday of wages.

          This is actually the entire premise of AWS Snowball: send someone a bunch of storage space, have them copy their data to that storage, then just ship the storage back with the data on it. It can be several orders of magnitude faster and easier than an internet transfer.

          Sneakernet really works. https://en.wikipedia.org/wiki/Sneakernet

          • fragmede 5 hours ago ago

            it would be totally cyberpunk to have a data cafe where you bring your hard drive to upload to the cloud and you'd pay by the terabyte/s. have all day? cheap. need to do it in 30 mins? pay up.

  • kortilla 12 hours ago ago

    > Now a company with bad performance can ask its ISP to fix it and point at the software and people who have already used it. If the ISP already knows it has a performance complaint, it can get ahead of the problem by proactively implementing LibreQoS.

    The post was a pretty good explanation about a new distro ISPs can use to help with fair queuing, but this statement is laughably naive.

    A distro existing is only a baby first step to an ISP adopting this. They need to train on how to monitor these, scale them, take them out for maintenance, and operate them in a highly available fashion.

    It's a huge opex barrier and capex is not why ISPs didn’t bother to solve it in the first place.

    • dtaht 11 hours ago ago

      We have seen small ISPs get LibreQos running in under an hour, which includes installing ubuntu. Configuring it right and getting it fully integrated with the customer management system takes longer.

      We're pretty sure most of those ISPs see reduced opex from support calls.

      Capex until the appearance of fq_codel (Preseem, Bequant) or cake (LibreQos, Paraqum) middleboxes was essentially infinite. Now it's pennies per subscriber and many just a get a suitable box off of ebay.

      I agree btw, that how to monitor and scale is a learned thing. For example many naive operators look at "drops" as reported by CAKE as a bad thing, when it is actually needed for good congestion control.

      • kortilla 8 hours ago ago

        > We have seen small ISPs get LibreQos running in under an hour, which includes installing ubuntu.

        Slapped together as a PoC is different than something production ready. Unless those ISPs are so small they don’t care about uptime, a single Ubuntu box in the only hot path of the network is no bueno.

        > We're pretty sure most of those ISPs see reduced opex from support calls.

        I highly doubt this. As someone who worked in an ISP, the things that people call their ISP for are really unrelated to the ISP (poor WiFi placement, computer loaded with malware, can’t find their WiFi password, can’t get into their gmail/bank/whatever). When Zoom sucks they don’t even think to blame their ISP, they just think zoom sucks.

        There is a tiny fraction of power users who might suspect congestion, but they aren’t the type to go into ISP support for help.

        > Capex until the appearance of fq_codel (Preseem, Bequant) or cake (LibreQos, Paraqum) middleboxes was essentially infinite. Now it's pennies per subscriber and many just a get a suitable box off of ebay.

        These tools have been around for a while now. My point is that the ISPs that haven’t done something about this yet aren’t holding out for a cheaper capex option. They are in the mode of not wanting to change anything at all.

        So this attitude that you only need to tell them “there is an open source thing you can run on an old server that will help with something that isn’t costing you money anyway” is out of touch with how most ISPs are run.

        The ones that care don’t need their customers to tell them. The ones that don’t care aren’t going to do anything that requires change.

  • codesections 12 hours ago ago

    How does OpenWRT fair on these metrics? Does it count as a "debloted router" is the sense used in TFA? Or is additional software above and beyond the core OpenWRT system needed to handle congestion properly?

    • wmf 12 hours ago ago

      OpenWRT has SQM but you have to enable it. https://openwrt.org/docs/guide-user/network/traffic-shaping/...

    • dtaht 11 hours ago ago

      OpenWrt depreciated pfifo_fast in favor of fq_codel in 2012, and have not looked back. It (and BQL) is ever present on all their Ethernet hardware and most of their wifi, no configuration required. It's just there.

      That said many OpenWrt chips have offloads that bypass that, and while speedier and low power, tend to be over buffered.

  • NoPicklez 10 hours ago ago

    In my experience higher latency due to bufferbloat occurs when my internet connection is saturated, like the example in the article of downloading a game.

    However, people can still have latency issues from their ISP even if their connection isn't fully saturated at home. Bufferbloat is just one situation in which higher latency is created.

    Yes, my Zoom call was terrible BECAUSE I was also downloading Diablo saturating my connection. But my Zoom call could also be terrible without anything else being downloaded if my ISP is bad or any number of other things.

    As someone who worked in a large ISP, if a customer says their bandwidth is terrible but they are getting their line saturated most ISPs will test for latency issues.

    Bufferbloat is one of many many reasons why someone's network might be causing them high latency.

    • ynik 4 hours ago ago

      The really horrible bufferfloat usually happens when the upload bandwidth is saturated -- upload bandwidth tends to be lower so it'll cause more latency for the same buffer size. I used to have issues with my cable modem, where occasionally the upload bandwidth would drop to ~100kbit/s (from normally 5Mbit/s), and if this tiny upload bandwidth was fully used, latency would jump from the normal 20ms to 5500ms. My ISP's customer support (Vodafone Germany) refused to understand the issue and only wanted to upsell me on a plan with more bandwidth. In the end I relented and accepted their upgrade offer because it also came with a new cable modem, which fixed the issue. (back then ISPs didn't allow users to bring their own cable modem -- nowdays German law requires them to allow this)

    • jonathanlydall 8 hours ago ago

      It is true that there is an interplay between bandwidth utilization and latency.

      However (assuming no prioritisation), if your bandwidth is at least double your video conference bandwidth requirements then a download shouldn’t significantly affect the video conference since TCP tends to be fair between streams.

      Even when I was on a 10Mb/s line I found gaming and voice was generally fine even with a download.

      However, if you’re using peer to peer (like BitTorrent), then that is utilizing dozens or hundreds of individual TCP streams and then your video conference bandwidth getting equal amount per all other streams is too slow.

      Bufferbloat exacerbates high utilisation symptoms because it confounds the TCP algorithms which struggle to find the correct equalibrium due to “erratic” feedback on if you’re transmitting too much.

      It’s like queuing in person at a government office and not being able to see past a door or corner how bad the queue really is, if you could see it’s bad you might come back later, but because you can’t you stand a while on the queue only to realize quite a bit later you’ll have to wait much longer than you initially expected, but if you’d known upfront it would be bad you might have opted to come back later when it’s more quiet. Most people feel that since they’ve sunk the time already they may as well wait as long as it takes, further keeping the queue long.

      Higher throughput would help, but just knowing ahead that now’s a bad time would help a lot too.

      I do wish most consumer ISPs supported deprioritising packets of my choice, which would allow you to download things heavily at low priority and your video call would be fine.

  • jiggawatts 13 hours ago ago

    Even IT professionals can't tell the difference between latency and bandwidth, or capacity and speed.

    A simple rule of thumb is: If a single user experiences poor performance with your otherwise idle cluster of servers, then adding more servers will not help.

    You can't imagine how often I have to have this conversation with developers, devops people, architects, business owners, etc...

    "Let's just double the cores and see what happens."

    "Let's not, it'll just double your costs and do nothing to improve things."

    Also another recent conversation:

    "Your network security middlebox doesn't use a proper network stack and is adding 5 milliseconds of latency to all datacentre communications."

    "We can scale out by adding more instances if capacity is a concern!"

    "That's... not what I said."

    • floating-io 9 hours ago ago

      That is the perfect high-level overview of my previous job.

      I'm semi-retired now...

      (edit: I forgot to note that the "let's not" part was always overridden by "You're wrong, this will fix it. Do it!" by management. Then we would eventually find and fix the actual problem (because it didn't go away), but the cluster size -- and the cost -- would remain because "No, it was too slow with so few replicas".)

    • dtaht 11 hours ago ago

      I share your pain. I really really really share your pain.

    • JohnMakin 12 hours ago ago

      It’s astounding how many people that work in infrastructure should understand things like this but don’t, particularly network bottlenecks or bottlenecks in general. I’ve seen situations where someone wants to increase the number of replicas for some service because the frontend is 504’ing, but the real reason is because the database has become saturated by the calls from the service. It is possible (a little unlikely, but possible, and the rule with infra at scale is “unlikely” always becomes “certain”) to actually make the problem worse by scaling up there. The number of blank stares I get when explaining things like this is demoralizing sometimes, especially in consulting situations where you have some pig headed engineer manager that thinks he knows everything about everything.

    • hinkley 7 hours ago ago

      “Just try it anyway.”

    • ipython 12 hours ago ago

      As they say, if you’re getting impatient for your baby to arrive, just get more pregnant ladies together! The cluster of pregnant women make the process move along quicker!

      /s

      • codesections 12 hours ago ago

        And the "they" in question is Warren Buffet, https://nymag.com/intelligencer/2009/06/you_cant_make_a_baby...

        • vitus 11 hours ago ago

          I thought this was first attributed to Fred Brooks in the 70s.

          > Brooks points out this limited divisibility with another example: while it takes one woman nine months to make one baby, "nine women can't make a baby in one month".

          https://en.wikipedia.org/wiki/Brooks%27s_law

          • fragmede 4 hours ago ago

            You can't take a random 9 women and have a baby in one month, but that makes me wonder, statistically, how many women from the whole population would you need to get to 1 month? 9 women can't make a baby in one month, but given 9 million women, the chances are, one of them are giving birth right now. if I needed a baby tomorrow, what do the statistics say on how many women it would take to have a baby, tomorrow?

            taking the population of the earth and the birth rate and doing some math, you get around to needing 12,000 women of reproductive age for you to have a baby tomorrow.

            12,000 is a lot of women! it's well above Dunbar's number. think about that, next time the 9 women one month baby topic comes up.

  • voidwtf 10 hours ago ago

    These type of solutions don’t scale to large ISPs, and gets costly to deploy at the edge. It’s also not just about throughput in Gbps, but Mpps.

    Also, this doesn’t take into account that the congestion/queueing issue might be at an upstream. I could have 100g from the customers local co to my core routers, but if the route is going over a 20g link to a local IX that’s saturated it probably won’t help to have fq/codel at the edge to the customer.

  • panosv 12 hours ago ago

    MacOS has now a built in dedicated tool called networkQuality that tries to capture these variables https://netbeez.net/blog/measure-network-quality-on-macos/

    Also take a look at Measurement Swiss Army-Knife (MSAK) https://netbeez.net/blog/msak/

  • globalnode 6 hours ago ago

    could i put the appropriate algorithm onto a raspberry pi and put it inline with my cheap router to fix the issue? in theory?

  • sandworm101 3 hours ago ago

    >> For example, I once measured the time to send a “ping” to downtown Toronto from my home office in the suburbs. It took 0.13s to get downtown and back. That is the normal ping time for Istanbul, Turkey, roughly 8,000 km away.

    Ya. Canada is like that. Lack of choice in ISPs, high costs and horrible uptime performance.

  • moffkalast 3 hours ago ago

    > If you are an ISP and your customers hate you

    So... every ISP that exists then? Networking is one of those fields where the results are just varying shades of terrible no matter how hard you try.

  • buginprod 5 hours ago ago

    130ms latency within a city? Are you using sound or something?

    Yeah yeah I know thats only 40m ish for sound.

  • imp0cat 5 hours ago ago

    The Waveform Bufferbloat test page (https://www.waveform.com/tools/bufferbloat) has some recommended routers for mitigating bufferbloat, but the first recommended one (Eero) has apparently already been revised and the new version's capabilities are not as good as they once were. The second one (Netgear Nighthawk) seems to have terrible software and support.

    So I'm looking for some opinions, what's your experience? Casual googling seems to suggest that the best solution to implement traffic management would either be a dedicated machine with something like OpenWRT or an all in one solution (ie. a Firewalla gold + some AP to provide wifi).

    • archi42 3 hours ago ago

      A router you can flash with a modern OpenWRT is likely a good option. Check the project website and/or forums and/or reddit for recent recommendations. That's what I did in the past.

      Personally I've moved to OpnSense: Some run it natively on a refurbished low-power SFF hardware (6000 or 7000 series Intel should be fine, or some Ryzen), so even in countries with high electricity costs that's feasible these days.

      More specifically, I run OpnSense in a qemu/libvirt VM (2C of a E5-2690v4) and do WiFi with a popular prosumer APs. Mind that VMs are likely to introduce latency, so if you try this route, make sure to PCIe-passthrough your network devices to the VM - I was prepared to ditch the VM for a dedicated SFF.

  • msla 6 hours ago ago

    Previously:

    It's The Latency, Stupid: http://www.stuartcheshire.org/rants/latency.html

  • tonymet 13 hours ago ago

    There are three parameters of concern with your ISP. Bandwidth, Latency (and Jitter), and Data Caps.

    Bandwidth is less of a concern for most people as data rates are over 500mb+ . That's enough to comfortably stream 5 concurrent 4k streams (at 20mbps).

    Latency and jitter will have a bigger impact on real time applications, particularly video conferencing , VOIP, gaming , and to a lesser extent video streaming when you are scrubbing the feed. You can test yours at https://speed.cloudflare.com/ . If your video is jittery or laggy, and you are having trouble with natural conversation, latency / jitter are likely the issue.

    Data Caps are a real concern for most people. At 1gbps, most people are hitting their 1-1.5tb data cap within an hour or so.

    Assuming you are around 500mbps or more, latency & data caps are a bigger concern.

    • lxgr 13 hours ago ago

      > At 1gbps, most people are hitting their 1-1.5tb data cap within an hour or so.

      Assuming you're talking about consumers: How? All that data needs to go somewhere!

      Even multiple 4K streams only take a fraction of one gigabit/s, and while downloads can often saturate a connection, the total transmitted amount of data is capped by storage capacities.

      That's not to say that data caps are a good thing, but conversely it also doesn't mean that gigabit connections with terabit-sized data caps are useless.

    • daemonologist 8 hours ago ago

      Another problem around data caps is that even if you have/pay extra for "unlimited" data, there's still a point where your ISP will fire you as a customer (or threaten to do so) for using too much data - I've heard of it around 8-10 TB on Comcast for example. Unlike with mobile plans there's no soft cap in the contract, they just decide when you've breached the ToS/AUP and can cut you off at their sole discretion.

    • Izkata 12 hours ago ago

      > gaming

      Gamers tend to have an intuitive understanding of latency, they just use the words "lag" and "ping" instead.

  • readingnews 13 hours ago ago

    ACM, come on, stop spreading disinformation. You know well and good nothing travels at the speed of light down the wire or fiber. We have converters on the end, and in fact in glass it is the speed of light divided by the refractive index of the glass. Even in the best of times, not c. I just hate that, when a customer is yelling at me telling me that the latency should be absolute 0, they start pointing at articles like this, "see, even the mighty ACM says it should be c".

    Ugh.

    • anotherhue 13 hours ago ago

      And that's before you consider the actual cable length vs the straight line distance.

    • lxgr 13 hours ago ago

      It's a reasonable approximation for most calculations. It seems unfair to call that "disinformation".

      Serialization delay, queuing delay etc. often dominate, but these have little to do with the actual propagation delay, which also can't be neglected.

      > when a customer is yelling at me telling me that the latency should be absolute 0

      The speed of light isn't infinity, is it?

    • thowawatp302 12 hours ago ago

      I don’t think you’re going to have an issue with Cherenkov radiation in the fiber and that fiber is not going to be a straight line over a non trivial distance so the approximation is close enough.