Netflix Simplified Batch Compute with Kueue

(netflixtechblog.com)

53 points | by dalvrosa 4 days ago ago

11 comments

  • lukax a day ago ago

    It's refreshing to see a tech article that isn't about AI. It feels like 5 years ago.

  • __turbobrew__ a day ago ago

    Anyone know if Netflix does anything for the k8s storage layer? I imagine they are at the scale where etcd starts to go kaboom? Or maybe they have enough cells where that isn’t a problem?

    Given Amazon and Google have their own secret sauce for replacing etcd, I am wondering if Netflix does anything special?

    • scripni a day ago ago

      This runs on AWS managed EKS these days, this talk goes into more detail about Netflix's special sauce around the k8s control plane: https://www.youtube.com/watch?v=vaTOiXR2KSM

      Netflix actually has much fewer cells than you'd expect btw, their special sauce IMO is federation and using a small subset of k8s APIs.

      • __turbobrew__ 15 hours ago ago

        I am surprised a company at that scale is running on managed EKS, maybe I underestimate how large the clusters are.

        • zbentley 3 hours ago ago

          EKS can get pretty damn big, well into the thousands of nodes without much special tuning, and beyond that with some care and control plane monitoring. Expensive, though.

    • stackskipton a day ago ago

      It's possible they are using kine: https://github.com/k3s-io/kine

  • whinvik a day ago ago

    I see Netflix pumping out tech articles but can't help but notice how much worse the UI experience is getting. Video erroring out, general slowness etc.

    Did they just give up?

  • jamesblonde 20 hours ago ago

    It certainly feels like Netflix is now a k8s shop. And it probably only a matter of time until they start repatriating workloads to optimize for costs. Then the world will sit up and notice.

    • beng-nl 18 hours ago ago

      I don’t get what you’re implying. What is repatriating; You think they will move their workloads to on-prem?

      Is there something different about the world that changed the trade-off calculus for cloud vs on-prem from how it was in the last 15 years compared to now?

      (I’m as anti-cloud-overspend as the next guy on hn btw. Just trying to make sense of your comment’s worldview.)

      • jamesblonde 7 hours ago ago

        Yes, coding agents have reduced the skills/knowledge required to operate workloads on virtualized hardware. K8S and its ecosystem has changed so that it now provides 90% of what you need from the public cloud providers. Big changes that make 8-15X savings by running your own workloads. I think it will be the big players who move first, as they have most to save and have the resources to make it happen.

  • scripni a day ago ago

    Congrats, this is awesome!