Ask HN: What do SRE do at your company?

5 points | by petemc_ 13 hours ago ago

8 comments

  • TurboHaskal an hour ago ago

    I have worked at several places where SRE meant different things:

    - A virtual, first responder team on call rotation for all the services because developers didn't want to be on call for their own stuff.

    - Another name for a platform engineering / DevOps team.

    - A team who built nothing, maintained nothing and were not paged for anything. They just collected metrics, built dashboards and spread the gospel of SRE with a big focus on form over function. IME they were annoying to work with, frequently missed the point specially when it came to stateful workloads and were ultimately just an incident factory.

  • coldfloor 11 hours ago ago

    I was an SRE at Yahoo until around the end of 2024. Not sure if things have changed - last I heard my former team had been laid off - but when I was there it was pretty easy. We had three tiers in the org, with increasing specificity and expertise: Operations Center -> SRE -> Product Engineers.

    The OC collectively monitored everything across the company. Each alert that paged had an associated runbook. If they couldn't clear the alert with the runbook, they'd escalate to the SRE responsible for the alerting server/component. Our job was essentially to fix anything that broke that OC couldn't solve. For my domain this often just came down to basic Linux troubleshooting, but sometimes would actually involve specific knowledge about our component. For others (e.g. networking) I imagine the ratio of domain-specific-knowledge problems was higher.

    If we determined something was fundamentally broken, like someone pushed an update and now the service won't start, we'd escalate that to PE. PE did a lot of what I think falls under SRE purview at other places: Managing deployments, building out infrastructure, etc. At Yahoo we were really just "tier 2 ops."

    We'd also be paged for outages if our service went down or another team was blaming our service for their outage. The job here was essentially the same thing, just with more pressure and people yelling at you; or arguing and trying to prove your stuff was working, please find someone else to blame. If we were involved in an outage, we'd also have to join the "post mortem" (I'll never be able to say that without air quotes) and help with RCA/take on remediation tasks.

    Secondarily, we created the monitoring/alerts that went to OC and wrote and maintained their runbooks. In our downtime we were also supposed to do simple automation/scripting to help us or OC with redundant tasks. Sometimes I think I made useful stuff, but often this felt like self-imposed busy work, because we always - especially under Marissa's stack ranking regime - had to demonstrate that we were doing more than just our job. I swear one quarter between us and OC we ended up with like 10 redundant Slack bots because everyone was rushing to make something to pad their review with.

  • natyoung 10 hours ago ago

    Call APIs that 3rd party vendors provide. Talk about AI, because AI. Be silo.

  • TimXare 6 hours ago ago

    Mostly turning unknown unknowns into known incidents.

  • VirusNewbie 12 hours ago ago

    I'm a SWE SRE at Google. That means we had to do a SWE interview with an emphasis on system design.

    So I'm expected to be able to do both operations for oncall, but also do RCA and implement fixes and changes to make the systems our team is responsible for more reliable.

    We're able to throttle the release cadence of binaries, so we work together with dev teams (SWEs who develop features) to come up with appropriate monitoring, metrics, mitigation, and scaling capabilities.

    Some SREs are not SWE SREs, they usually have a specialty related to the team they're on, such as networking, low level linux internals, etc. They're still expected to be able to write production level python/Go code.

    They are more likely to send a bug to the devs rather than fix it themselves, where as I will often (but not always) just go right in and send a CL to the devs fixing or optimizing something.

    • petemc_ 11 hours ago ago

      Hey VirusNewbie, thank you for your response! A few more questions if you don't mind.

      Did you read the SRE handbook before applying?

      How do you decide who gets alerts (or are devs never on call)?

      • VirusNewbie 9 hours ago ago

        >Did you read the SRE handbook before applying?

        Yes, but it wasn't really necessary. Google isn't trying to hire SREs, they're trying to hire SWEs and specialists who they can turn into SREs, if that makes sense. Though I'm sure SRE experience elsewhere helps, they are still expecting SWE level coding/algorithms etc.

        >How do you decide who gets alerts

        SREs will be assigned to a team that has a rapidly growing service or essential service (or both), and they will be the primary team that takes the alerts, while the developers have a slower SLO (though still oncall rotations for them).

  • decatur 12 hours ago ago

    Produce hot air, check boxes, and 'see, I told you so'