Mount Mayhem at Netflix: Scaling Containers on Modern CPUs

(netflixtechblog.com)

66 points | by vquemener 4 days ago ago

31 comments

  • yjftsjthsd-h 21 hours ago ago

    Okay, I'll ask the dumb question: Couldn't you also reduce the number of layers per container? Sure, if you can reuse layers you should, but unless you've done something very clever like 1 package per layer I struggle to think that 50 is really useful?

    • ActorNightly 20 hours ago ago

      Its not a dumb question. It seems like when it comes to these supposed high tech enterprise solutions, they spend so much churn in doing something that is very complex and impressive like investigating architecture performance when it comes to kernel level operations and figuring out the kernel specifics that are causing slowdowns. Instead they can put that talent into just writing software without containers that can just max out any EC2 instance in terms of delivering streamed content, and then you don't worry about why your containers are taking so long to load.

      • hvb2 20 hours ago ago

        I have seen these comments quite a bit but they gloss over a major feature of a large company.

        In a large company you can have thousands of developers just coding away at their features without worrying about how any of it runs. You can dislike that, but that's how that goes.

        From a company perspective this is preferable as those developers are supposedly focussed on building the things that make the company money. It also allows you to hire people that might be good at that but have no idea how the deployment actually works or how to optimize that. Meanwhile with all code running sort of the same way, that makes the operations side easier.

        When the company grows and you're dealing with thousands of people contributing code. These optimizations might save a lot of money/time. But those savings might be peanuts compared with every 10 devs coming up with their own deployment and the ops overhead of that.

      • Hikikomori 18 hours ago ago

        Content is not streamed from these containers.

        • ActorNightly 2 hours ago ago

          Then there is even less reason to spin up new containers at the rate they are doing it.

    • gucci-on-fleek 21 hours ago ago

      > unless you've done something very clever like 1 package per layer I struggle to think that 50 is really useful?

      1 package per layer can actually be quite nice, since it means that any package updates will only affect that layer, meaning that downloading container updates will use much less network bandwidth. This is nice for things like bootc [0] that are deployed on the "edge", but less useful for things deployed in a well-connected server farm.

      [0]: https://bootc-dev.github.io/bootc/

      • seabrookmx 20 hours ago ago

        It doesn't work this way really?

        It's called a layer because each layer on top depends on the layers below.

        If you change the package defined in the bottom most layer, all 49 above it are invalid and need re-pulled or re-built.

        • minitech 19 hours ago ago

          That’s mostly a Dockerism (and even Docker has `COPY --link` these days). The underlying tech supports independent layers.

        • gucci-on-fleek 20 hours ago ago

          > If you change the package defined in the bottom most layer, all 49 above it are invalid and need re-pulled or re-built.

          I also initially thought that that was the case, but some tools are able to work around that [0] [1] [2]. I have no idea how it works, but it works pretty well in my experience.

          [0]: https://github.com/hhd-dev/rechunk/

          [1]: https://coreos.github.io/rpm-ostree/container/#creating-chun...

          [2]: https://coreos.github.io/rpm-ostree/build-chunked-oci/

        • SR2Z 18 hours ago ago

          Layering in the container spec is achieved by overlaying each layer's filesystem (a tarball, I think) over each layer below it. If file "a" is modified in layers 3 and 5, the resulting container will have data for both versions but reading "a" in the container will return version 5.

          Docker exploits this to figure out when it can cache a layer, but building a container is different than running one because changing the underlying file system can change what a command outputs. If you're running a container, changing one deeply buried layer doesn't change the layers above it because they're already saved.

      • yjftsjthsd-h 19 hours ago ago

        Yes, my intended meaning was that if you're doing that or something similar then I totally get having lots of layers because it's useful. Mostly I've only seen it come up with nix, but I can see how bootc would have a similar deal. That said, most container images I've ever seen aren't doing anything that clever and probably should be like... 2-3 layers? (One base layer, one with all your dependencies shoved in, and maybe one on top for the actual application.)

    • solatic 19 hours ago ago

      This is Netflix, they have thousands of engineers. So you have two approaches to solve the problem: either write enforced policy-as-code to prevent people from deploying images with too-high layer count (and pray they never need to rollback to an image from before the policy was written), thus incurring political alignment costs around the new policy and forcing non-compliant teams to adapt (which is time not spent on features); or, solve the problem entirely at the infrastructure level.

      It's hardly surprising that companies consider infrastructure-level solutions to be better.

    • redanddead 19 hours ago ago

      Here’s an even dumber question: why didn’t they make a documentary instead of an article?

  • rixed 21 hours ago ago

    I am not familiar with the nitty gritty of container instance building process, so maybe I'm just not the intended audience, but this is particularly unclear to me:

      > To avoid the costly process of untarring and shifting UIDs for every container, the new runtime uses the kernel’s idmap feature. This allows efficient UID mapping per container without copying or changing file ownership, which is why containerd performs many mounts
    
    Why does using idmap require to perform more mount?
    • nineteen999 19 hours ago ago

      The costly process probably explains why they just started injecting ads in my plan where there previously weren't any.

      And also explains why rather than be leveraged into a more expensive plan to help them pay for their containers, I cancelled my subscription. Not like there's more than 1% content there worth paying for these days anyway.

    • martijnvds 20 hours ago ago

      This kind of id mapping works as a mount option (it can also be used on bind mounts). You give it a mapping of "id in filesystem on disk" to "id to return to filesystem APIs" and it's all translated on the fly.

      • rixed 14 hours ago ago

        Thank you! Going to ask an LLM to lecture me on this when I have some time; good to see that humans are still the best at giving just the right amount of explanation :)

  • ViktorRay a day ago ago

    Articles like this are pretty cool. It’s so interesting to see the behind the scenes that happens whenever we watch a Netflix movie.

  • haneul a day ago ago

    Interesting, another case of removing HT improving performance. Reminds me of doing that on Intel CPUs of a few gens ago.

    • ahoka 18 hours ago ago

      It's quite logical that by saturating your CPU it can only decrease performance.

      • spockz 17 hours ago ago

        In this case the CPU wasn’t really saturated with work but with contention on global locks. The contention is lessened by removing the amount of concurrent mounts that are being done.

        I wonder if simply setting a maximum number of concurrent mounts in the code or by letting containerd think there are only half the amount of cores, would have solved the contention to the same amount.

  • spockz 18 hours ago ago

    Interestingly. This enhancement has been proposed in July 2025, accepted and merged in August 2025, and released in November 2025. The blog post is also from November. And now it shows here.

  • s_ting765 17 hours ago ago

    Interesting blog post. For what it's worth, I count 7 em-dashes used.

  • parliament32 a day ago ago

    It took them this long to move from docker to containerd?

  • DeathArrow 20 hours ago ago

    So using the "old" container architecture could have been better than wasting time implementing the new architecture, dealing with the performance issues and wasting more time fixing the issues?

    • seabrookmx 11 hours ago ago

      This completely ignores all their reasons to move to the new architecture in the first place.

      My understanding is that they had a mostly in-house architecture (that predated Kubernetes' rise) and by moving to this new platform, they are now much more closely aligned with standard Kubernetes. They can now utilize EKS for their control plane, and leverage the many community provided features previously unavailable to them.

  • vivzkestrel a day ago ago

    - can someone kindly explain why there are 2 websites that all claim to be netflix tech blog?

    - website 1 https://netflixtechblog.medium.com/

    - website 2 https://netflixtechblog.com/

    • hhh 19 hours ago ago

      The second one is a hosted custom domain for the medium blog iirc

    • geodel a day ago ago

      I mean Netflix is dealing with big, important things like container scaling, creating a million micro services talking to each other and so on. Having multiple tech blogging platform on Medium is not something they have a spare moment to think about.

  • owenthejumper a day ago ago

    Why is this so badly AI written? Netflix can surely pay for writers.

    At this point I refuse to read any content in the AI format of: - The problem - The solution - Why it matters

    • 18 hours ago ago
      [deleted]