How Cloudflare responded to the “Copy Fail” Linux vulnerability

(blog.cloudflare.com)

100 points | by mobeigi 2 days ago ago

85 comments

  • electra2012 2 days ago ago

    > Despite our practice of deploying Linux patch updates every two weeks, we remained vulnerable because a month-old mainline fix had yet to be backported to our primary kernel line.

    Hopefully a wake-up call to those who believe older distro LTS kernels are getting all the security fixes Canonical and Redhat would want you to believe.

  • sammy2255 2 days ago ago

    Any Cloudflare employees reading this, your network map has a few PoPs missing from it https://www.cloudflare.com/network/ notably, Perth (PER) Australia. Hobart (HBA) Australia. Wellington (WLG), New Zealand. Christchurch (CHC), New Zealand. Nausori (SUV), Fiji.

    • antonvs a day ago ago

      Perhaps those aren’t Cloudflare PoPs, they’re attacker MitM/honeypots.

  • skinfaxi 2 days ago ago

    Would love to learn more about their internal behavioural detection program.

    > One of the first things our security team did was confirm that our existing endpoint detection would catch this exploit. Our servers run behavioral detection that continuously monitors process execution patterns. It doesn't rely on knowing about specific vulnerabilities; it watches for anomalous behavior across the fleet.

    • CGamesPlay 2 days ago ago

      Would certainly be interesting to learn more about. A simple check: allowlist of known "processes that run as root". Any new process shows up, something happened.

      • jeffbee 2 days ago ago

        Based on what? Proc title?

        • CGamesPlay 2 days ago ago

          Proc title is very easily forged (without root even). Obviously a real privileged process could modify the kernel and do whatever it wants, but if I were trying to detect this I would start with /proc/$id/exe.

          • Retr0id 2 days ago ago

            /proc/pid/exe is also easily forged, without root. For example you can do LD_PRELOAD=evil.so /bin/foo on any dynamic executable, or spawn /bin/foo unmodified and inject code via ptrace or /proc/pid/mem.

            I have a fileless, execless copyfail exploit that works by injecting shellcode directly into systemd's pid 1. (I should probably publish it at some point...)

            • jeffbee 2 days ago ago

              Yeah the whole system is based on the ability of one task to apparently become another task, that's how Unix works. So the indicators in /proc are just that: indicative at best.

              There's no reason the task should even be assumed to be executing code in a file. A process can map code into anonymous memory and continue executing there without even branching. Again this is considered a feature of the system rather than a flaw.

          • jeffbee 2 days ago ago

            Maybe, but there's a prctl to change that reference which a root process can use.

        • dboreham 2 days ago ago

          They might just compute a hash over the binary, or the code space in memory.

        • parliament32 2 days ago ago

          It's curious they're just "monitoring" rather than preventing.

          In a serious environment you'd run IPE with dm-verity/fs-verity to ensure binaries are whitelisted and integrity-checked at every execution.

          • staticassertion 2 days ago ago

            lol no one does that (edit: or, rather, that is extremely uncommon, even in "serious" environments, for a ton of reasons).

            • parliament32 2 days ago ago

              Look at the FedRAMP requirements around integrity protection, then look at how massive the list of complaint products is. I promise, pretty much everyone in regulated environments is. It's so prevelant Azure is even pushing a turnkey solution for k8s https://learn.microsoft.com/en-us/azure/aks/use-azure-linux-...

              • staticassertion 2 days ago ago

                Nothing about fedramp requires that you enable any of the features you're talking about. Linking to a public preview of an Azure product that doesn't even run with enforcement on is not great supporting evidence.

              • jeffbee 2 days ago ago

                If you have much experience with fedramp, and it sounds like you do, perhaps you might agree that it is a huge list of things that superficially indicate doing something, without actually doing anything. As the documentation for IPE freely admits, it has no protective benefits because it is unaware of anonymous executable regions.

                • parliament32 2 days ago ago

                  It sure has limitations, but "no protective benefits" is pretty wrong. In a real world example, if your containerized application has an RCE, you're preventing the attacker from executing binaries they tampered with or down/up-loaded. Combined with minimal distroless containers, it's a very effective attack surface reduction strategy, and works much better than the legacy scan-occasionally integrity-checking methods (rkhunter et al).

    • staticassertion 2 days ago ago

      Syscalls and kernel module loading can both be logged, I assume that's sufficient here.

      • skinfaxi 2 days ago ago

        Yes but I am interested in hearing about cloudflare's implementation, how they scale it to their whole fleet, and what kinds of heuristics they are using to classifying behavior as anomalous.

    • mobeigi 2 days ago ago

      I'd very much like to learn more about this too, deserves its own blog post.

  • srcreigh 2 days ago ago

    It’s fascinating that already had a system which could identify the exploit at runtime. How can I learn more about that?

  • mkj 2 days ago ago

    If they're already running a custom Linux kernel build, why did they have AF_ALG enabled? Seems the perfect situation to limit features to only those actually being used.

    • computerfriend 2 days ago ago

      In the article they explain that some of their services use it.

      • mixdup 2 days ago ago

        And also as part of this, they have learned the lesson parent comment is trying to make: they called out that they are going to review their deployments and make sure there's no unused modules being deployed

  • PunchyHamster 2 days ago ago

    for us it was

    * Get list of modules from Puppet's facts, confirm module isn't used anywhere (it wasn't) * `install algif_aead /bin/false` in /etc/modprobe.d/disable-algif.conf * Run a check using exploit code to check it is no longer working

    I imagine CF runs more stuff that could use it I guess but apparently it's not often used API

  • tptacek 2 days ago ago

    This is an interesting post from Cloudflare, as usual, but it's not clear to me why they would have been vulnerable to CopyFail. Did I miss the point in this blog where that's addressed? What triggered the threat hunting and mitigation exploit? At what points in their architecture were they reliant on Linux user-based access control?

    • js2 a day ago ago

      They weren't vulnerable to it in anything but an academic sense. They call that out up front: "There was no impact to the Cloudflare environment, no customer data was at risk, and no services were disrupted at any point."

      This was probably written by their security team. Security teams are paranoid. They want everything patched everywhere all at once at a severity level zeo. Also, PR. Also, also, if through some lack of imagination, this was somehow involved in an exploit of their services, it would look really really bad. So, CYA.

      • tptacek a day ago ago

        Yeah I think what I'm trying to clarify here is: are they doing a threat hunting exercise out of concern for multitenant exposures, or out of concern for internal privilege escalation?

        Cross-tenant would be very surprising! But I don't know enough about their architecture.

        It's weird, right? The underlying CNE primitive here, for CopyFail, is not novel. These happen all the time. Why the announcement? Is it just because CopyFail got so much attention?

        • fragmede a day ago ago

          I can upload arbitrary code to Cloudflare workers, which they run on their systems. It's sandboxed, but in the big bad Internet, if you were Cloudflare, how much would you really trust that sandbox?

          • js2 17 hours ago ago

            Let's say an attacker escapes the sandbox and gets a local non-root shell on the machine. At that point, how much more access does escaping to root gain the attacker? (This is a rhetorical question. Cloudflare doesn't say, which I think is the point of this line of questioning.)

            • fragmede 12 hours ago ago

              Not actually knowing anything about their architecture, but if you somehow gained root on a Cloudflare worker box, the system that I'm sure they've design against this attack for, is for that attacker to then be able to steal the private keys for all the TLS traffic hitting that machine, and then exfiltrate all data going through it and also inject their own content to visitors.

              • js2 12 hours ago ago

                Why are you sure of that? I wouldn't design a critical system that relied on the difference between root and non-root accounts to protect private keys. I would design a system assuming the attacker can trivially escalate to root privilege. Because historically you just cannot rely on the difference. LPE attacks simply happen on too regular a basis.

          • tptacek a day ago ago

            It's not running with direct access to Linux kernel system calls, is it?

    • aduwah 2 days ago ago

      The whole IT industry is reliant on Linux user-based access controls, it is not a Cloudflare thing.

      Also leaving a massive gap like this behind would be a mistake on multiple levels. For example, it might get combined with another exploit that can achieve unprivileged access to some piece of metal, or you can have a disgruntled employee without admin access escalating their permissions on a box they aren't supposed to see all the secrets.

      • TacticalCoder 2 days ago ago

        > For example, it might get combined with another exploit that can achieve unprivileged access ...

        Yeah. TFA mentions datacenters in 330 cities. That's a lot of Linux boxen. And many of those have, by definition, ports opened to the big bad Internet. These Linux servers are running services. They answer to ping, for a start. I even heard some are running DNS servers. Remote local exploits are a thing.

        What does CloudFlare prefer: that when the next remote local exploit surface all their fleet is one copy.fail away from privilege escalation to root or that they get the time (seen that they obviously have quite advanced detection measures in place) to detect the intruder before it gains root everywhere?

        It's Linux. It's datacenters in 330 cities. Linux powers the world and that's how things works.

        I, for one, I'm glad to own CloudFlare stocks since right after the 2022 crash and, for two, I'm happy they don't let their huge fleet of Linux servers with a non-patched exploit.

        • tptacek 2 days ago ago

          I'm not asking why they'd need to go threat-hunting if there was an ICMP kernel RCE in Linux. CopyFail requires someone untrusted running shell commands somewhere. Where is that exposure in their architecture?

          I'm asking because I don't think they have such an exposure.

          • Yoric 2 days ago ago

            At the very least, Cloudflare hosts web workers, which let a customer execute more-or-less arbitrary wasm code on their servers. If there's an exploit that lets you escape the wasm sandbox, copy.fail can be chained into (afaiu) an exploit against the Linux host. That's a pretty big risk.

            Also, Cloudflare hosts some AI services, so it's possible that some consumers are running Python code in their containers, without the wasm sandbox.

            • tptacek 2 days ago ago

              If there's a direct link from Cloudflare workers / WASM to uid=nobody execve or arbitrary syscalls on their hosts, they're already fucked, so I don't think that's true.

              • HDBaseT a day ago ago

                I don't understand your point.

                You seem so pressed on the fact "why would they even patch this!!!", maybe because its best practice to patch things? You never known what things could be chained together, so you might as well patch this, given its so obviously bad.

                • js2 a day ago ago

                  That's a straw man and not what he asked. Literally, he asked: "why they would have been vulnerable to CopyFail?"

                  I've been a sysadmin/programmer since the mid-90s. Local root exploits are a dime a dozen. If your infrastructure relies upon the tenuous difference between root and non-root accounts, you've already lost. Cloudflare isn't an ISP handing out shell accounts on Unix machines.

                  So again, yes, of course you should patch your Linux machines. Defense in depth and all that. But the question remains: "why Cloudflare would have been vulnerable to CopyFail?" (in anything but an academic sense). Because I do not believe that they can possibly be relying on the difference between root and non-root account.

                  • HDBaseT a day ago ago

                    I don't care about your credentials. It doesn't take a genius to realize that having known major security holes is not ideal.

                    It is pretty clear they aren't too concerned about this being a issue for this business, after the first paragraph in bold on the blog:

                    "There was no impact to the Cloudflare environment, no customer data was at risk, and no services were disrupted at any point. Read on to learn how our preparedness paid off."

                    As mentioned, you never want to give options to a potential attacker/exploit by keeping known vulnerabilities present in your system. You cannot always predict every single avenue an attack could leverage.

                    Imagine having a data center with barbed wire fences, guard posts, security and cameras covering every square meter of the facility. You wouldn't just leave a door right open because in theory, people shouldn't be able to walk right in. But why would you willingly leave a door open? Even if the possibility is 0.000001%?

                    People like you would be the first to turn and say "Cloudflare are morons for not patching this!!! Me and my 1 billion years experience and goat status would of prevented this' when some major Cloudflare hack occurs and it was found that phishing 30 different people and using 9 different exploits (including Copyfail) allowed the attacker to bring down Cloudfare

                  • saurik a day ago ago

                    I mean, in some sense, Cloudflare simply accepts the security posture of "already lost", right? They run workloads for multiple users within the same process separated by nothing more than V8 boundaries, which even Chrome (which always claimed to run tabs in separate processes but actually didn't due to various edge cases) finally stopped doing (now afaik they do fence origins within processes) as it was so risky... Cloudflare's best lines of defense past "we patch often" are merely that they sort of KYC at least most of their users so they can log everything they run with their identity and that they take users of similar trust levels (age of account, level of KYC, amount of usage, etc.) and group those into processes... but, at the end of the day, they rely on something that I would certainly never consider reasonable to ship in production.

                    • js2 17 hours ago ago

                      > They run workloads for multiple users within the same process

                      Ah, then the root/non-root distinction means even less. They don't even distinguish between non-root accounts! Again, I'm not arguing against them defensively patching their systems against known exploits—they'd be crazy not to; just agreeing with Thomas that they can't be relying upon protecting root from non-root accounts as part a normal operational security boundary.

                      To wit: if an attacker escapes V8, it's unclear that leveraging "Copy Fail" to escape from non-root to root buys the attacker a whole lot more.

    • robotbikes 2 days ago ago

      I would assume it was about protecting their servers from internal sources escalating privileges vs. them providing publicly accessible Linux shells.

      • tptacek 2 days ago ago

        I mean, that's a real project, but Linux LPEs kind of grow on trees, so you can't literally rely on threat intelligence for this problem; presumably you handle it by drastically scoping down and surveilling what people do on prod hosts.

  • cluckindan 2 days ago ago

    Has anyone figured out whether this CVE was intentional?

  • 2 days ago ago
    [deleted]
  • jmclnx 2 days ago ago

    > Linux kernel build based on the community's Long-Term Support (LTS)

    CopyFail only highlights why Companies want LTS. If there was a supported kernel built prior to 2017, most large companies would still be on that version, avoiding this issue all-together.

    The corporate mindset is usually "never upgrade unless there is new hardware needed or critical software failure". All CopyFail did was reinforce that mindset.

    I wonder if CopyFail will cause enterprises put pressure on the Linux Foundation to maintain a "ultra LTS" were it is supported for 20 years ?

    • PunchyHamster 2 days ago ago

      > CopyFail only highlights why Companies want LTS. If there was a supported kernel built prior to 2017, most large companies would still be on that version, avoiding this issue all-together.

      Sadly not really how it works for say Red Hat. They routinely backport features while keeping whatever "stable" number on kernel. We even had displeasure of them backporting a bug... same bug to 2 different RHEL versions

    • tempest_ 2 days ago ago

      The longer you wait the more painful the switch will eventually be.

      • em-bee 2 days ago ago

        for the kernel? hardly. only if the kernel breaks userspace. which it shouldn't.

  • dboreham 2 days ago ago

    The "Hunting for Exploitation" section is unclear to me: "The exploit leaves a distinctive trace in kernel logs when it runs." Hmm. Wouldn't a system with a compromised kernel also log exactly what the attacker wanted logged?

    • cube00 2 days ago ago

      I guess the hope is the kernel has been able to successfully transmit that log message to the immutable central logging infra before it gets compromised.

      Although given the tendency for end point logging agents to run on buffers to reduce their network chattiness I do wonder if a fast acting exploit could dump that buffer before it manages to be transmitted.

      I don't think any of the agents are complex enough to immediately transmit permission elevation log messages over the regular background noise.

    • QuantumNoodle 2 days ago ago

      Also 48 hours prior the disclosure is a very narrow window? I wonder if their logs don't go back further or if there was another reason to look back only two days.

    • rithdmc 2 days ago ago

      The attack itself creates the logs, which - reading between the lines - are shipped to a central log server. A compromised server might not send any new indicators to the logs, but existing logs moved off device would still be available.

      I'd like to know what those distinctive traces are, which is also missing :(

    • PunchyHamster 2 days ago ago

      Your exploit would have to get root and kill/exploit the logging daemon near instantly, else the log will already be sent to remote before you can change it locally

  • john_strinlai 2 days ago ago

    this is a techincal dive into how cloudflare responded, not a confirmation that they responded

    for whatever reason, unknown to me, hn automatically strips "how" from the start of titles. i cant remember ever seeing a title where this was an improvement.

    • dang 2 days ago ago

      Of course you can't, because the cases it improves don't get noticed, while the ones that break stick out like sore thumbs.

      • john_strinlai 12 hours ago ago

        i mean... its pretty easy to tell in either direction because i read the article titles when i click on them. given the rule about matching the article title, any discrepancy is noticeable.

        but its your world dang, we're just living in it. do whatever you want with the titles. you have previously made your position clear to me about receiving feedback on hn; im not under any illusions about the value of my opinion.

        • dang 12 hours ago ago

          > but its your world dang, we're just living in it. do whatever you want with the titles. you have previously made your position clear to me about receiving feedback on hn; im not under any illusions about the value of my opinion.

          Is that how you felt about https://news.ycombinator.com/item?id=47328465? I can't find any other post that you might be referring to.

          • john_strinlai 10 hours ago ago

            no, its how i felt from a few different emails i have sent, with one of the more recent ones having what i felt was a pretty off-putting reply. but there is really no need to hash it out. in the grand scheme, i understand the approach you take, despite feeling frustration over it. you've got thousands of people offering their opinions, all of them thinking they are correct. my last comment should have probably been one that i wrote in a notepad and erased afterwards, rather than one that i posted. sorry.

    • gamegoblin 2 days ago ago

      I learned a few years ago that HN also editorializes by dropping "world's" from titles

      Before: Teens break record for world's longest kickball game

      After: Teens break record for longest kickball game

      • Velocifyer 2 days ago ago

        I do actually agree with that change.

        • gamegoblin 2 days ago ago

          It occasionally leads to kinda ambiguous headlines, e.g.

          "China opens world's longest undersea tunnel"

          vs

          "China opens longest undersea tunnel"

          It's a little unclear if it's the longest undersea tunnel in the world, or just in China

        • jmalicki 2 days ago ago

          It doesn't give enough recognition to the true longest game of space kickball.

      • buredoranna 2 days ago ago

        ... what a world.

    • varun_ch 2 days ago ago

      I'm yet to see a good example of the title stripping, at least for "how" and "how to" (although perhaps this is survivorship bias).

    • dpoloncsak 2 days ago ago

      Interestingly, there's a current post on the front page with "How" at the start of the title.

      > https://news.ycombinator.com/item?id=48018715 "How do I inform Windows that I’m writing a binary file?"

      I wonder if it ending in a '?' has anything to do with it?

      edit: Upon review, at the time of posting it was actually on the 2nd page

      • john_strinlai 2 days ago ago

        not sure about that specific case or if '?' has anything to do with it, but there is a short editing window where the submitter can re-add the "how" or whatever back in

      • GavinAnderegg 2 days ago ago

        I’ve been hit by this when posting links. If you edit the post, you can re-add the stripped word and it will stay. “Why” is another that is often stripped.

    • trollbridge 2 days ago ago

      Starting a title with “How” is standard clickbait.

      • gilrain 2 days ago ago

        Starting a sentence with “How” is standard English, too.

        • trollbridge 19 hours ago ago

          Much of clickbait is standard English. HN takes a policy of applying editorial discretion to headlines, which makes the site more valuable.

        • 2 days ago ago
          [deleted]
      • Goronmon 2 days ago ago

        If we are taking that attitude why not go all the way?

        Titles are standard clickbait.

        • miki123211 2 days ago ago

          With LLMs, you could actually do anti-clickbait titles. Extract the article text with something like r.jina.ai, and ask an LLM to generate a ~80-character summary that explains the main point of the article for people too busy to read it.

          I do think this would genuinely be useful.

          • senko 2 days ago ago

            You're absolutely right! (errm...oops....anyways...)

            The fact that LLMs usually generate anodyne summaries is actualy a benefit here.

            I used my website-to-markdown tool[0] to get the text, piped the output to claude -p and got a pretty decent "Patching Copy Fail at scale: how bpf-lsm bought us time before the kernel reboot" result.

            [0] https://markshot.dev

          • john_strinlai 2 days ago ago

            back in my day, people just used the thing that rattles around inside their skull for such tasks

            • senko 2 days ago ago

              To do that, you need to read the article first, which is the point of click-bait titles. The point of the defense is to avoid exposing your neurons to that stuff.

              • john_strinlai 2 days ago ago

                i would hope that people are reading articles first and submitting them to hn because they are interesting, rather than submitting articles to hn blindly.

                • senko 2 days ago ago

                  I agree with you on that, but that just holds true (we hope) for the OP.

                  HN already editorializes the title, to help everyone other than the OP (not all people agree over what's interesting to them). Now we're just arguing over the degree.

  • cube00 2 days ago ago

    > At the time of the "Copy Fail" disclosure, the majority of our infrastructure was running the 6.12 LTS version

    That could be as low as 50.1%, I wish they'd provide an actual percentage.