Show HN: K8s Cleaner – Roomba for Kubernetes

(sveltos.projectsveltos.io)

68 points | by pescerosso 7 days ago ago

61 comments

  • zzyzxd 7 days ago ago

    For resources that are supposed to be cleaned up automatically, fixing your operator/finalizer is a better approach. Using this tool is just kicking the can down the road, which may cause even bigger problem.

    If you have resources that need to be regularly created and deleted, I feel a cronjob running `kubectl delete -l <your-label-selector>` should be more than enough, and less risker than installing a 3rd party software with cluster wide list/delete permission.

    • pcl 7 days ago ago

      How should I discover the things that need deletion?

      Presumably running some sort of analysis tool in a dry-run mode would help, no?

    • __turbobrew__ 6 days ago ago

      I think it can be useful as a discovery tool. How do you know that your operator is leaking resources in the first place? What if one of your cluster operators manually modified or created resources?

    • devops76 6 days ago ago

      Most of the times you don't own the operator/finalizer

      And most of the times, when it is a shared cluster, you don't even know what else is being deployed.

  • devops99 7 days ago ago

    If you find yourself using something like this, you seriously fucked up as DevOps / cloud admin / whatever.

    • pescerosso 7 days ago ago

      I understand where you’re coming from, and ideally, we strive for well-managed Kubernetes environments. However, as DevOps practitioners, we often face complexities that lead to stale or orphaned resources due to fast deployment cycles, changing application needs or teams. Even the public clouds make lots of money from services that are left running and not used for which some companies make a living helping clean things up.

      K8s-cleaner serves as a helpful safety net to automate the identification and cleanup of these resources, reducing manual overhead and minimizing human error. It allows teams to focus on strategic tasks instead of day-to-day resource management.

      • devops99 7 days ago ago

        > However, as DevOps practitioners, we often face complexities that lead to stale or orphaned resources due to fast deployment cycles

        So, as a DevOps practitioner myself, I had enough say within the organizations I worked at, who are now clients, and also my other clients, that anything not in a dev environment goes through our GitOps pipeline. Other than the GitOps pipeline, there is zero write access to anything not dev.

        If we stop using a resource, we remove a line or two (usually just one) in a manifest file, the GitOps pipeline takes care of the rest.

        Not a single thing is unaccounted for, even if indirectly.

        That said, the DevOps-in-name-only clowns far outnumber actual DevOps people, and there is no doubt a large market for your product.

        edited: added clarity

        • photonthug 7 days ago ago

          > I had enough say within the organizations I worked at, who are now clients

          This sounds like experience that’s mainly at small/medium sized orgs. At large orgs the devops/cloud people are constantly under pressure to install random stuff from random vendors. That pressure comes from every direction because every department head (infosec/engineering/data science) is trying to spend huge budgets to justify their own salary/headcount and maintain job security, because it’s harder to fire someone if you’re in the middle of a migrate-to-vendor process they championed, and you’re locked into the vendor contract, etc etc. People also will seek to undermine every reasonable standard about isolation and break down the walls you design between environments so that even QA or QC type vendors want their claws in prod. Best practice or not, You can’t really say no to all of it all the time or it’s perceived as obstructionist.

          Thus there’s constant churn of junk you don’t want and don’t need that’s “supposed to be available” everywhere and the list is always changing. Of course in the limit there is crusty unused junk and we barely know what’s running anywhere in clouds or clusters. Regardless of the state of the art with Devops, most orgs are going to have clutter because those orgs are operating in a changing world and without a decisive or even consistent vision of what they want/need.

          • devops99 7 days ago ago

            > At large orgs the devops/cloud people are constantly

            Two of our clients are large (15,000+ employees, and 22,000+ employees) orgs. Their tech execs are happy with our work, specifically our software delivery pipeline with guard rails and where we emphasize a "Heroku-like experience".

            One of their projects needed HiTRUST, and we made it happen for them in under four weeks (no we're not geniuses, we stole the smarts of the AWS DoD-compliant architecture & deployment patterns) and the tone of the execs seemed to change pretty significantly after that.

            One of these clients laid off more than half their IT staff suddenly this year.

            When I was in individual contributor role in a mid-size (just under 3,000 employees), I wrote my thoughts, "internal whitepaper" or whatever being fully candid about the absurd struggles we were having (why does instantiating a VM take over three weeks?), and sent it to the CTO (and also the CEO, but the CTO didn't know about that) and some things changed pretty quickly.

            But yeah, things suck in large orgs, that's why large orgs are outsourcing which is in the most-downstream customers' (the American peoples') best interests too -- a win-win-win all around.

            • 7 days ago ago
              [deleted]
        • eigenvalue 7 days ago ago

          A tool that would be useful for organizations that don't have superstar, ultra-competent devops people on the full-time payroll sounds pretty useful in general. There are a lot of companies that just aren't at the scale to justify hiring someone for that kind of role, and even then, it's hard to find really good people who know what they are doing.

          • devops99 7 days ago ago

            > for organizations that don't have superstar, ultra-competent

            Just outsource.

            Outsource to those who do have DevOps people who know what they're doing -- most companies do this already in one form or another.

          • 7 days ago ago
            [deleted]
        • nine_k 7 days ago ago

          How do you realize that you have stopped using a resource? Can there be cases when you're hesitant to remove a resource just yet, because you want a possible rollback to be fast, and then it lingers, forgotten?

          • devops99 7 days ago ago

            With our GitOps patterns, anything you could call an "environment" has a Git branch that reflects "intended state".

            So a K8s namespace named test02, another named test-featurebranch, and another named staging-featurebranch, all have their own Git branch with whatever Helm, OpenTOFU, etc.

            With this pattern, and other patterns, we have a principle "if it's in the Git branch, we intend for it to there, if it's not in the Git branch it can't be there".

            We use Helm to apply and remove things -- and we loved version 3 when it came out -- so there's not really any way for anything to linger.

    • benreesman 7 days ago ago

      No one likes leaking resources, but it happens to the very best teams.

      It seems like a tool that can propose a good guess about where you landed is strictly useful and good?

      • devops99 7 days ago ago

        > but it happens to the very best teams.

        I am very convinced it does not. I think where the apex between our viewpoints lies is what we recognize (or not) as "best" teams.

        • __turbobrew__ 6 days ago ago

          Sometimes you show up to your job and find that you have 10000 square pegs that need to go through a circular hole. You can over time fix this problem but sometimes you need a stopgap. You may or may not ascribe to the google SRE creed but the goal is to get stuff working today with what you have however ugly it may be. Some hacky tool to restart programs to fix memory a memory leak is sometimes necessary due to time constraints or being a stopgap. Migrations at large companies can take multiple years whereas I can install this tool with helm in approximately half a day.

          • devops99 6 days ago ago

            > Some hacky tool to restart programs to fix memory a memory leak is sometimes necessary due to time constraints or being a stopgap.

            This makes sense for K8s resources that ARE still serving production traffic. But this overall thread is about a tool to remove applications ARE NOT serving production traffic.

            > Migrations at large companies can take multiple years

            Depends who is in charge and who management considers worth listening to (some of us don't struggle so hard in this area).

            > I can install this tool with helm in approximately half a day.

            A script I wrote to find unused resources took less than 10 minutes to write.

        • benreesman 7 days ago ago

          If you know how to do serious software without spilling any memory then you should write it up and collect the most significant Turing Award ever.

          Hackers would be immediately bifurcated into those who followed that practice and those who are helpless.

          • devops99 6 days ago ago

            > If you know how to do serious software without spilling any memory

            This thread is about K8s Pods (and other K8s resources) that have been sitting idle, not memory leaks in software.

            As far as "spilling" memory, the problem has already been solved by Rust which does not do garbage collection because it has static memory mapping. Does this mean egregious amounts of memory won't be used by some Rust programs? No. But unlike languages with garbage collection, where Rust is using that memory it is actually doing something with that memory.

            • benreesman 4 days ago ago

              I can assure you that memory leaks specifically and resource leaks generally are possible in any language more expressive than a stack machine with an arena.

              Rust makes an interesting and demonstrably pragmatic set of tradeoffs here as opposed to e.g. C/C++: treating std::move as a default (aka linear typing) prevents a lot of leaks. But it still has pointers (Arc, etc.) and it still has tables: it’s still easy to leak in a long-lived process doing interesting things. For a lot of use cases it’s the better default and it’s popular as a result.

              But neither Rust nor k8s have solved computer science.

    • thephyber 7 days ago ago

      Or the person working that role before you did and you are trying to manage the situation as best as possible.

      • devops99 7 days ago ago

        As a cloud/DevOps consultant, I don't believe in letting management drag on the life support of failed deployments.

        We (the team I built out since a couple years ago) carve out a new AWS subaccount / GCP Project or bare-metal K8s or whatever the environment is, instantiate our GitOps pipeline, and services get cut-over in order to get supported.

        When I was working in individual contributor roles, I managed to "manage upward" enough that I could establish boundaries on what could be supported or not. Yes this did involve a lot of "soft skills" (something I'm capable of doing even though my posts on this board are rather curt).

        "Every DevOps job is a political job" rings true.

        Do not support incumbent K8s clusters, as a famous -- more like infamous -- radio host used to say on 97.1 FM in the 90s: dump that bitch.

        • remram 7 days ago ago

          It's really hard to parse what you are trying to say in the middle of this colorful imagery and quotations, but if you are spinning up a new Kubernetes cluster (and AWS/GCP account) per deployment, it's obvious you don't need this tool.

          • devops99 7 days ago ago

            Specifically, leaving the corporate political dynamics aside, we move the workloads to the deployments that are up to standard and then archive+drop the old ones. Very simple.

            • remram 6 days ago ago

              A more cynical man than I would say that if you need to recreate all your workloads on a new cluster to bring them up to standard, "you seriously fucked up as DevOps / cloud admin / whatever"

              • devops99 5 days ago ago

                Heh, they aren't ""our"" workloads, before we get there, and not until they get deployed correctly for the first time ever at which point then and only then are they ""our"" workloads.

    • remram 7 days ago ago

      I don't agree, you can see it as just another linting tool. Just like your IDE will warn of unused variables and functions, and offer a quick-fix to remove them. You wouldn't call someone a fucked up programmer for using an IDE.

      • devops99 7 days ago ago

        [flagged]

        • remram 7 days ago ago

          I am assuming that the tool provides a list and asks for confirmation before deleting? Does it just go and automatically delete stuff in the background at any time?

          edit: yes I see a "dry run" option, which is how I would use it. I also see a "scheduled" option which is probably what you're criticizing. Hard to tell, you're quicker with the insulting than the arguing.

    • jrsdav 7 days ago ago

      Not all third party Kubernetes controllers are created equal, unfortunately.

      And it is also the reality that not every infra team gets final say on what operators we have to deploy to our clusters (yet we're stuck cleaning up the messes they leave behind).

    • llama052 7 days ago ago

      Agreed, this feels like the kubernetes descheduler as a service or something. Wild.

    • paolop 7 days ago ago

      Possible, but if someone needs it and it works well, why not?

    • cameronh90 7 days ago ago

      Or you're budget constrained, and haven't yet been able to allocate resources to things like GitOps.

    • devops99 7 days ago ago

      [flagged]

      • 7 days ago ago
        [deleted]
  • darkwater 7 days ago ago

    How does it work in an IaC/CD scenario, with things like Terraform or ArgoCD creating and syncing resources lifecycle inside the cluster? A stale resource, as identified and cleaned by K8s Cleaner, would be recreated in the next sync cycle, right?

    • mgianluc 7 days ago ago

      In that case I would use it in DryRun mode and have it generate a report. Then look at the report and if it makes sense, fix the terraform or ArgoCD configuration.

      More on report: https://gianlucam76.github.io/k8s-cleaner/reports/k8s-cleane...

      • darkwater 6 days ago ago

        Nice, it's something already. But probably as someone else was saying, it would be cool to iterate and improve this.

        Because, if you want to have a product to sell out of this (and I guess you do?), your n.1 client will be (IMO) big enterprises, which usually have a lot of cruft in their clusters and go after cost optimizations. But they also usually don't shove manifests in directly, they have at least one solution on top of it that takes care of resources lifecycle. So, IMO, if you want your product to be useful for those kind of customers, this is something that need to be improved.

      • imcritic 7 days ago ago

        Improve it! Teach it to figure out if the resource is managed by ArgoCD or FluxCD and then suspend reconciliation.

        • devops76 6 days ago ago

          It is open source. The least we could do if we find something else useful would be to file an enhancement request. Be nice.

  • paolop 7 days ago ago

    I've been using it for a bit now and very happy with it. The stale-persistent-volume-claim detection has been almost a 100% hit in my case; it's a real game-changer for cleaning up disk space.

    Kubernetes clutter can quickly become a headache, and having a tool like this to identify and remove unused resources has made my workflow so much smoother.

  • S0y 7 days ago ago

    When I saw the headline I was pretty excited, but looking at your examples, I'm really curious about why you decided to make everything work via CRDs? Also having to write code inside those CRD for the cleanup logic seems like a pretty steep learning curve and honestly I'd be pretty scared to end up writing something that would delete my entire cluster.

    Any reason why you chose this approach over something like a CLI tool you can run on your cluster?

    • mgianluc 7 days ago ago

      It has a DryRun that generates a report on which resources would be affected. So you can see what resources identifies as unused before asking it to remove those. I would be scared as well otherwise. Agree.

    • resouer 7 days ago ago

      A CRD controller is expected, especially it will allow your clean up logic long running in cluster or periodically. But writing code inside YAML is very weird, at least there are CEL as option here.

    • mgianluc 7 days ago ago

      one more thing. It comes with a huge library of use cases already covered. So you don't necessarily need to write any new instance.

      You would need to do that only if you have your own proprietary CRDs or some use case that is not already covered.

  • siva7 7 days ago ago

    Don’t let these naysayers here discourage you. I’ve used CCleaner on Windows 20 years ago, so why not finally now on my kube cluster.

  • empath75 7 days ago ago

    why isn't this just a cli tool? I don't see any reason it needs to be installed on a cluster. There should at least be an option to scan a cluster from the cli.

    • sethhochberg 7 days ago ago

      Ironically, I'd bet environments most desperately in need of tools like this are some of the ones where there has been lots of "just run this Helm chart/manifest!" admining over the years...

    • mgianluc 7 days ago ago

      Agree ability to run also as a binary would be helpful.

  • caust1c 7 days ago ago

    So is this the first instance of a Cloud C-Cleaner then? You could call it CCCleaner!

  • fyodor0 6 days ago ago

    Looking at other comments and drawing Windows parallel, I propose kkCleaner

    Useful project nevertheless!

  • mia_villarreal 6 days ago ago

    Seems like a simple and effective tool!

  • kiney 7 days ago ago

    I've been saying for a while that most of the time we didn't replace pets with cattle bit pet servers with pet clusters. The need for a tool like this proves my point

  • brianecox 7 days ago ago

    Sounds useful.

  • cjk 7 days ago ago

    I feel like the fact that this even needs to exist is a damning indictment of k8s.

    • mdaniel 7 days ago ago

      Oh, good to know that your environment never loses track of any cloud resource. Maybe apply to Netflix, since it seems they're still trying to solve that problem https://www.theregister.com/2024/12/18/netflix_aws_managemen... <https://news.ycombinator.com/item?id=42448541>

      • cjk 7 days ago ago

        Sure, I’ll grant you that at huge companies it’s probably easy to lose track of who is responsible for what and when resources ought to be cleaned up. But for small/medium-sized companies, using Terraform is sufficient. And while you can also use Terraform to manage resources on k8s, there’s a lot more friction to doing so than using YAML + `kubectl apply`. It’s far too easy to `kubectl apply` and create a bunch of garbage that you forget to clean up.