Eliminating Cold Starts 2: shard and conquer

(blog.cloudflare.com)

61 points | by cmsparks 5 days ago ago

18 comments

  • Havoc 2 days ago ago

    Surprised none of the big clouds have duplicated the handshake delay thing yet - CF is noticeably better at cold starts than the rest for small scripts. I guess the big cloud functions are aimed at bigger workloads perhaps

    • no_wizard 2 days ago ago

      I never understood this either, even if due to constraints it has to be a different product line.

      I thought lamda@edge was going in this direction but it’s a slightly faster, more constrained version of lambdas with all the same potential downsides

  • smacker 2 days ago ago

    While I really appreciate Workers platform "eleminated cold starts" advertising was always bothering me.

    This is a curl request from my machine right now to SSR react app hosted on CF Worker: ``` DNS lookup: 0.296826s Connect: 0.320031s Start transfer: 2.710684s Total: 2.710969s ```

    Second request: ``` DNS lookup: 0.002970s Connect: 0.015917s Start transfer: 0.176399s Total: 0.176621s ```

    2.5 seconds difference.

    • kentonv 2 days ago ago

      Does this app make any network requests that might have their own cold start or caching effects?

      2.5 seconds seems way too long to be attributed to the Worker cold start alone.

      • smacker 2 days ago ago

        It makes requests to API server that is deployed to k8s, which doesn't have a cold start. Clearly, some caching by the runtime and framework is involved here.

        My point is that "cold start" is often more than just booting VM instance.

        And I noticed not everybody understands it. I used to have conversations in which people argue that there is no difference in deploying web frontend to Cloudflare vs a stateful solution because of this confusing advertisement.

        • 0x696C6961 2 days ago ago

          Likely reusing http keep-alive connections.

    • omk 2 days ago ago

      I'm not well versed with CURL design, but curious - is your first connection handling TLS while second relying on the previously established handshake?

      • smacker 2 days ago ago

        I'm not very well versed with CURL design too but afaik it does reuse connections but only inside the same process (e.g. downloading 10 files with 1 command). In this case it shouldn't be re-using them as I ran 2 different commands. I should have included TLS handshake time in the output, though. You can see it here (overall time is lower because I hit preview env that is slightly different from staging/prod):

        First hit: ``` DNS Lookup: 0.026284s Connect (TCP): 0.036498s Time app connect (TLS): 0.059136s Start Transfer: 1.282819s Total: 1.282928s ```

        Second hit: ``` DNS Lookup: 0.003575s Connect (TCP): 0.016697s Time app connect (TLS): 0.032679s Start Transfer: 0.242647s Total: 0.242733s ```

        Metrics description:

        time_namelookup: The time, in seconds, it took from the start until the name resolving was completed.

        time_connect: The time, in seconds, it took from the start until the TCP connect to the remote host (or proxy) was completed.

        time_appconnect: The time, in seconds, it took from the start until the SSL/SSH/etc connect/handshake to the remote host was completed.

        time_starttransfer: The time, in seconds, it took from the start until the first byte was just about to be transferred. This includes time_pretransfer and also the time the server needed to calculate the result.

      • swiftcoder 2 days ago ago

        TLS handshakes (outside of embedded hardware) should be measured in milliseconds, not seconds

        Edit: you can kind of tell this from the connect timings listed above. TLS is faster the second time around, but not enough to make much difference to the overall speedup

        • scottlamb 2 days ago ago

          I think you're right that TLS doesn't explain the difference shown above, but for completeness: TLS 1.3 can reduce the round trips from 3 to 2 on session resumption. [1] Depending on your Internet connection, that could be a lot more than milliseconds. I don't think `curl` uses it by default though.

          [1] https://blog.cloudflare.com/introducing-0-rtt/

    • samschooler 2 days ago ago

      I would say Cloudflare "eliminated cold starts" in the context of bringing the server online, not in the form of rendering + caching the SSR page.

  • bluelightning2k 2 days ago ago

    I love Cloudflare engineering writeups & the workers platform

    • jtbaker 2 days ago ago

      Same. Really wish they had a managed k8s offering to complement it though. Some workloads just don't fit into the workers paradigm.

      • arcfour 2 days ago ago

        Containers+Workers is now a thing, at least.

        It's very much still maturing as an offering. But it does exist!

        • jtbaker 2 days ago ago

          Yeah, I've tinkered with it when it first rolled out. The dev experience and performance left a lot to be desired. Poor visibility on the status of containers, weird phantom bugs, having to learn new platform idiosyncrasies. Just embrace k8s. Just roll it up into a wrapper that runs on GCP or whatever under the hood that would allow people that need more control over the infra to adopt the platform.

          • arcfour 2 days ago ago

            I agree there's a lot of room for improvement, but it's definitely worth keeping an eye on; the experience of Workers now vs when I started using them 2ish years ago is massively improved. They definitely like to ship their MVPs, but CF is pretty good about actually improving them IME.

  • candiddevmike 2 days ago ago

    Maybe a hot take, but if you're that concerned about cold starts, you probably shouldn't use a service that scales to zero. If your service is really not used that heavily, it's like $5-10/month for an instance that can handle more than your traffic needs 24/7, hell you can even host other similarly unused services on it!

    • stackskipton 2 days ago ago

      Because attractiveness of Workers/Lambdas/Functions is whole "write simple amount of code and pay pennies to run it." Downside is cold starts, twisting yourself into knots you will do at scale to make them work and vendor lock in.

      If you start to say "If you are using this for production, 5-10/month is real cost you need to pay + transactions" Well, now the cost is about to same to deploy fly.io shared CPU container and it does not come with cold starts, vendor lock in and can run as long as you want. Cloudflare knows that so they don't want to introduce that charge or even talk about it.