SRE becomes the most critical layer because it's the only discipline focused on 'does this actually run reliably?' rather than 'did we ship the feature?'. We're moving from a world of 'crafting logic' to 'managing logic flows'.
They are very different. If your SREs are spending much of their time chasing fires, they are doing it wrong.
But how much of current day software complexity is inherent in the problem space vs just bad design and too many (human) chefs in the kitchen? I'm guessing most of it is the latter category.
We might get more software but with less complexity overall, assuming LLMs become good enough.
Would we say us as humans also have captured the "best" way to reduce complexity and write great code? Maybe there's patterns and guidelines but no hard and fast rules. Until we have better understanding around that, LLMs may also not arrive at those levels either. Most of that knowledge is gleamed when sticking with a system -- dealing with past choices and requiring changes and tweaks to the code, complexity and solution over time. Maybe the right "memory" or compaction could help LLMs get better over time, but we're just scratching the surface there today.
LLMs output code as good as their training data. They can reason about parts of code they are prompted and offer ideas, but they're inherently based on the data and concepts they've trained on. And unfortunately...its likely much more average code than highly respected ones that flood the training data, at least for now.
Ideally I'd love to see better code written and complexity driven down by _whatever_ writes the code. But there will always been verification required when using a writer that is probabilistic.
SREs usually don't know the first thing about whether particular logic within the product is working according to a particular set of business requirements. That's just not their role.
Without that it's impossible to correctly prioritise your work.
In a well-run org, the software engineers (along with QA if you have them) are responsible for validation of requirements.
Mature SRE teams get involved with the development of systems before they've even launched, to ensure that they have reliability and supportability baked in from the start, rather than shoddily retrofitted.
And they drive the cost of validating the correctness of such code towards infinity...
There's the small shops where you're running some kind of monolith generally open to the Internet, maybe you have a database hooked up to it. These shops do not need dedicated DevOps/SRE. Throw it into a container platform (e.g. AWS ECS/Fargate, GCP Cloud Run, fly.io, the market is broad enough that it's basically getting commoditized), hook up observability/alerting, maybe pay a consultant to review it and make sure you didn't do anything stupid. Then just pay the bill every month, and don't over-think it.
Then you have large shops: the ones where you're running at the scale where the cost premium of container platforms is higher than the salary of an engineer to move you off it, the ones where you have to figure out how to get the systems from different companies pre-M&A to talk to each other, where you have N development teams organizationally far away from the sales and legal teams signing SLAs yet need to be constrained by said SLAs, where you have some system that was architected to handle X scale and the business has now sold 100X and you have to figure out what band-aids to throw at the failing system while telling the devs they need to re-architect, where you need to build your Alertmanager routing tree configuration dynamically because YAML is garbage and the routing rules change based on whether or not SRE decided to return the pager, plus ensuring that devs have the ability to self-service create new services, plus progressive rollout of new alerts across the organization, etc., so even Alertmanager config needs to be owned by an engineer.
I really can't imagine LLMs replacing SREs in large shops. SREs debugging production outages to find a proximate "root" technical cause is a small fraction of the SRE function.
According to the specified goals of SRE, this is actually not just a small fraction - but something that shouldn't happen. To be clear, I'm fully aware that this will always be necessary - but whenever it happened - it's because the site reliability engineer (SRE) overlooked something.
Hence if that's considered a large part of the job.. then you're just not a SRE as Google defined that role
https://sre.google/sre-book/table-of-contents/
Very little connection to the blog post we're commenting on though - at least as far as I can tell.
At least I didn't find any focus on debugging. It put forward that the capability to produce reliable software is what will distinguish in the future, and I think this holds up and is inline with the official definition of SRE
Still not a great rendition of this thought, but closer.
what do you mean "progressive rollout of new alerts across the organization"? what kind of alerts?
If you're deploying alerts, then yeah you want a progressive rollout just like anything else, or you run the risk of alert fatigue from false positives, which is Really Bad because it undermines faith in the alerting system.
For example, say you want to start to track, per team, how many code quality issues they have, and set thresholds above which they will get alerted. The alert will make a Jira ticket - getting code quality under control can be afforded to be scheduled into a sprint. You probably need different alert thresholds for different teams, and you want to test the waters before you start having Alertmanager make real Jira issues. So, yeah, progressive rollout.
Kubernetes is a huge problem, it's IMO a shitty prototype that industry ran away with (because Google tried to throw a wrench at Docker/AWS when Containers and Cloud were the hot new things, pretending Kubernetes is basically the same as Borg), then the community calcified around the prototype state and bought all this SAAS/structured their production environments around it, and now all these SAAS providers and Platform Engineers/Devops people who make a living off of milking money out of Kubernetes users are guarding their gold mines.
Part of the K8s marketing push was rebranding Infrastructure Engineering = building atop Kubernetes (vs operating at the layers at and beneath it), and K8s leaks abstractions/exposes an enormous configuration surface area, so you just get K8s But More Configuration/Leaks. Also, You Need A Platform, so do Platform Engineering too, for your totally unique use case of connecting git to CI to slackbot/email/2FA to our release scripts.
At my new company we're working on fixing this but it'll probably be 1-2 more years until we can open source it (mostly because it's not generalized enough yet and I don't want to make the same mistake as Kubernetes. But we will open source it). The problem is mostly multitenancy, better primitives, modeling the whole user story in the platform itself, and getting rid of false dichotomies/bad abstractions regarding scaling and state (including the entire control plane). Also, more official tooling and you have to put on a dunce cap if YAML gets within 2 network hopes of any zone.
In your example, I think
1. you shouldn't have to think about scaling and provisioning at this level of granularity, it should always be at the multitenant zonal level, this is one of the cardinal sins Kubernetes made that Borg handled much better
2. YAML is indeed garbage but availability reporting and alerting need better official support, it doesn't make sense for every ecommerce shop and bank to building this stuff
3. a huge amount of alerts and configs could actually be expressed in business logic if cloud platforms exposed synchronous/real-time billing with the scaling speed of Cloud Run.
If you think about it, so so so many problems devops teams deal with are literally just
1. We need to be able to handle scaling events
2. We need to control costs
3. Sometimes these conflict and we struggle to translate between the two.
4. Nobody lets me set hard billing limits/enforcement at the platform level.
(I implemented enforcement for something close to this for Run/Appengine/Functions, it truly is a very difficult problem, but I do think it's possible. Real time usage->billing->balance debits was one of the first things we implemented on our platform).
5. For some reason scaling and provisioning are different things (partly because the cloud provider is slow, partly because Kubernetes is single-tenant)
6. Our ops team's job is to translate between business logic and resource logic, and half our alerts are basically asking a human to manually make some cost/scaling analysis or tradeoff, because we can't automate that, because the underlying resource model/platform makes it impossible.
You gotta go under the hood to fix this stuff.
To me this shows that cloud run is more of an end product than a building block and it hinders the adoption as basically we need to replicate most of cloud run ourselves just to add that tiny bit of also running our Sidecar.
How do you see this going in your new solution?
I'm not exactly sure what this means, a few different interpretations make sense to me. If this is purely a run <-> other gcp product in a vpc problem, I'm not sure how much info about that is considered proprietary and which I could share, or even if my understanding of it is even accurate anymore. If it's that cloud run can't run in your service mesh then it's just, these are both managed services. But yes, I do think it's possible to run into a situation/configuration that is impossible to express in run that doesn't seem like it should be inexpressible.
This is why designing around multitenancy is important. I think with hierarchical namespacing and a transparent resource model you could offer better escape hatches for integrating managed services/products that don't know how to talk to each other. Even though your project may be a single "tenant", because these managed services are probably implemented in different ways under the hood and have opaque resource models (ie run doesn't fully expose all underlying primitives), they end up basically being multitenant relative to each other.
That being said, I don't see why you couldn't use mTLS to talk to Cloud Run instances, you just might have to implement it differently from how you're doing it elsewhere? This almost just sounds like a shortcoming of your service mesh implementation that it doesn't bundle something exposing run-like semantics by default (which is basically what we're doing), because why would it know how to talk to a proprietary third party managed service?
Managed k8s services like EKS have been super reliable the last few years.
YAML is fine, it's just configuration language.
> you shouldn't have to think about scaling and provisioning at this level of granularity, it should always be at the multitenant zonal level, this is one of the cardinal sins Kubernetes made that Borg handled much better
I'm not sure what you mean here. Manage k8s services, and even k8s clusters you deploy yourself, can autoscale across AZ's. This has been a feature for many years now. You just set a topology key on your pod template spec, your pods will spread across the AZ's, easy.
Most tasks you would want to do to deploy an application, there's an out of the box solution for k8s that already exists. There have been millions of labor-hours poured into k8s as a platform, unless you have some extremely niche use case, you are wasting your time building an alternative.
I will just say based on recent experience the fix is not Kubernetes bad it’s Kubernetes is not a product platform; it’s a substrate, and most orgs actually want a platform.
We recently ripped out a barebones Kubernetes product (like Rancher but not Rancher). It was hosting a lot of our software development apps like GitLab, Nexus, KeyCloak, etc
But in order to run those things, you have to build an entire platform and wire it all together. This is on premises running on vxRail.
We ended up discovering that our company had an internal software development platform based on EKS-A and it comes with auto installers with all the apps and includes ArgoCD to maintain state and orchestrate new deployments.
The previous team did a shitty job DIY-ing the prior platform. So we switched to something more maintainable.
If someone made a product like that then I am sure a lot of people would buy it.
This is one of the things that excites me about TigerBeetle; the reason why so much billing by cloud providers is reported only on an hourly granularity at best is because the underlying systems are running batch jobs to calculate final billed sums. Having a billing database that is efficient enough to keep up with real-time is a game-changer and we've barely scratched the surface of what it makes possible.
This feels like a single-tenant, centralized ACH but I think what you actually want for a multitenant, multizonal cloud platform is not ACH but something more capability-based. The problem is that cloud resources are billed as subscriptions/rates and you can't centralize anything on the hot-path (like this does) because it means that zone/any availability interacting with that node causes a lack of availability for everything else. Also, the business logic and complexity for computing an actual final bill for a cloud customer's usage is quite complex because it's reliant on so many different kinds of things, including pricing models which can get very complex or bespoke, and it doesn't seem like tigerbeetle wants calculating prices to be part of their transactions (I think)
The way we're modelling this is with hierarchical sub-ledgers (eg per-zone, per-tenant, per-resourcegroup) and something which you could think of as a line of credit. In my opinion the pricing and resource modelling + integration with the billing tx are much more challenging because they need to be able to handle a lot of business logic. Anyway, if someone chooses to opt-in to invoice billing there's an escape hatch and way for us to handle things we can't express yet.
The incentives are just really messed up all around. Think about all the actual people working in devops who have their careers/job tied to Kubernetes, and how many developers get drawn in by the allure and marketing because it lets them work on more fun problems than their actual job, and all the provisioned instances and vendor software and certs and conferences, and all the money that represents.
The article's premise (AI makes code cheap, so operations becomes the differentiator) has some truth to it. But I'd frame it differently: the bottleneck was never really "writing code." It was understanding what to build and keeping it running. AI helps with one of those. Maybe.
I didn't find that particularly true during my tenure, but obviously Google is huge, so there probably exist teams that actually can afford to behave this way...
I actually argue that AI will therefore impact these levels of management the most.
Think about it, if you were employed as a transformational CEO would you risk trying to fight existing managers or just replace them with AI?
Not AI but bad economy and mass layoffs tend to wipe out management positions the most. As a decent IC, in case of layoffs in bad economy, you'll always find some place to work at if you're flexible with location and salary because everyone still needs people who know how to actually build shit, but nobody needs to add more managers in their ranks to consume payroll and add no value.
AI will transform this.
At all large MNCs I worked at, management got hired and fired mostly on their (or lack thereof) connections and less on what they actually did. Once they got let go, they had near impossible time finding another management position elsewhere without connections in other places.
I wonder if capitalism and democracy will be just a short chapter in history that will be replaced by something else. Autocratic governments seem to be the most prevalent form of government in history.
Edit: Or maybe he is fully aware and just need to push some books before it's too late.
Look at the 'Product Engineer' roles we are seeing spreading in forward-thinking startups and scaleups.
That's the future of SWE I think. SWEs take on more PM and design responsibilities as part of the existing role.
Of course things might look different when the product is something that requires really deep domain knowledge.
However, like in automated factories, only a small percentage is required to stay around.
That said, Claude has absolutely no problem not only answering questions, but finding bugs and adding new features to it.
In short, I feel they're just as screwed as us devs.
But not defining what an SRE is feels like a glaring, almost suffocating, omission.
That's what they used to say about software engineering and yet this is becoming less and less obvious as capabilities increase.
There are no hiding places for any of us.
Not sure management is eager to give permission to software owned by other companies (inference providers) the permission to delete prod DBs.
Also these roles usually involve talking to other teams and stakeholder more often than with a traditional SWE role.
Though
> There are no hiding places for any of us.
I agree with this statement. While the timeline is unclear (LLM use is heavily subsidized), I think this will translate into less demand for engineers, overall.
We just created a benchmark on adding distributed logs (OpenTelemetry instrumentation) to small services, around 300 lines of code.
Claude Opus 4.5 succeed at 29%, GPT 5.2 at 26%, Gemini 3 Pro at 16%.
I don’t think LLM context will able to digest large codebases and their algorithms are not going to reason like SREs in the next coming years. And given the current hype and market, investors are gonna pull out with recessions all over the world and we will see another AI Winters.
Code has become a commodity. Corporate engineering hierarchy will be much flat in coming years both horizontally and vertically - one staff will command two senior engineers with two juniors each, orchestrating N agents each.
I think that’s it - this is the end of bootcamp devs. This will act as a great filter and probably decrease the mass influx of bootcamp devs.
Now, there are way too many computer science grads in a time when code is easy and cheap. Not much to gain from hiring a bootcamp dev over the real deal.
But I would say if you truly enjoy coding and you didn’t get to study CS in a university, a bootcamp is probably a fun experience to go through just for your own enjoyment, not for job seeking purposes. Just don’t pay too much.
- testing
- reviewing, and reading/understanding/explaining
- operations / SREAs for more dedicated to Ops side, it's garbage in, garbage out. I've already had too many outages caused by AI Slop being fed into production, calling all Developers = SRE won't change the fact that AI can't program now without massive experienced people controlling it.
SREs are great when the problem is “the network is down” or “kubernetes won’t run my pods”, but expecting a random engineer to know all the failure modes of software they didn’t build and don’t have context on never seems to work out well.
A tricky part becomes when you don't have both roles for something, like SRE-developed tools that are maintained by the ones writing them, and you need to strike the balance yourselves until/unless you wind up with that split. If you're not aware of both hats and juggling wearing them intentionally, in that case, you can wind up with tools out of SRE that are worse than any SWE-only tool might ever be, because the SREs sometimes think they won't make the same mistakes, but all the same feature-focused things apply for SRE-written tools too...
From a people perspective that means excellence when working with outside teams and gathering requirements on your own. It also means always knowing the status of your work in all environments, even in production after deployment. If your soft skills are strong and you can independently program work streams that touch multiple external parties you are golden. It seems this is the future.
On a different note, i do see what you mention about some op excellence skills (e.g. project management, requirements gathering, etc.) being areas of concern at my $dayjob. But, i kinda always saw them as skills that are valuable in any era, and need not only be in this AI era....but everyone's mileage and environment certainly can vary that expectation. Also, at my $dayjob, the business lacks so much funding to pay software vendors fairly, properly that we get what we pay for....so its often low quality output. Its not low *code* because we employee and contract regular, full code devs....but it certainly often is poor quality...and i wonder as low code offerings and opportunities - paired with more solid AI development asistance - continue to emerge, i suppose something like a SRE role can become that much more important - regardless if one works in low code or low cost arena.
If your place is indeed the least sweatshop job, then congrats and enjoy the good parts! :-)
What people failed to grasp about low-code/no-code tools (and what I believe the author ultimately says) is that it was never about technical ability. It was about time.
The people who were "supposed" to be the targets of these tools didn't have the time to begin with, let alone the technical experience to round out the rough edges. It's a chore maintaining these types of things.
These tools don't change that equation. I truly believe that we'll see a new golden age of targeted, bepsoke software that can now be developed cheaper instead of small/medium businesses utilizing off-the-shelf, one-size-fits-all solutions.
> ...
> Are you keeping up with security updates? Will you leak all my data? Do I trust you? Can I rely on you?
IMO, if the answers to those questions matter to you, then you damn well should care how it works. Because even if you aren't sufficiently technically minded to audit the system, having someone be able to describe it to you coherently is an important starting point in building that trust and having reason to believe that security and privacy will work as advertised.
Sounds... reliable.
Why (imo)? Senior leaders still like to say: I run a 500 headcount finance EMEA organization for Siemens, I am the Chief People Officer of Meta anf I lead an org of 1000 smart HR pros. Most of their status is still tight to the org headcount.
AI will not get much better than what we have today, and what we have today is not enough to totally transform software engineering. It is a little easier to be a software engineer now, but that’s it. You can still fuck everything up.
Wow, where did this come from?
From what just comes to my mind based on recent research, I'd expect at least the following this or next year:
* Continuous learning via an architectural change like Titans or TTT-E2E.
* Advancement in World Models (many labs focusing on them now)
* Longer-running agentic systems, with Gas Town being a recent proof of concept.
* Advances in computer and browser usage - tons of money being poured into this, and RL with self-play is straightforward
* AI integration into robotics, especially when coupled with world models
The stuff you mention is unproven in usefulness or is so far away that most software engineers have enough time to wrap up their careers and retire gainfully.
AI has already been integrated with robotics. We have entire factories running entirely with robots in the dark. For mass consumer markets, a floor vacuuming and mopping robot that can also climb stairs is probably peak robotics. They already build world models that map out your entire home and reason about materials and cleanliness.
There’s not much more juice left to squeeze here. The next frontier is genetic programming (biological).
It's called AI SRE, and for now, it's mostly targeted at helping on-call engineers investigate and solve incidents. But of course, these agents can also be used proactively to improve reliability.
That OS on your laptop? Software. The terminal your SSH runs in? Software. The browser you’re reading this take in? Software. The editor you wrote your last 10k LOC in? Software.
The only “service” I buy is email — and even that I run myself. It’s still just software, plus ops.
Yes, running things is hard. Nobody serious disputes that. But pretending this is some new revelation is ahistorical. We used to call this systems engineering, operations, reliability, or just doing your job before SRE needed a brand deck.
And let’s be clear about the direction of value:
Software without SRE still has value. SRE without software has none.
A binary I can run, copy, fork, and understand beats a perfectly monitored nothing. A CLI tool with zero uptime guarantees still solves problems. A library still ships value. A game still runs. A compiler still compiles.
Ops exists to serve software, not replace it. Reliability amplifies value — it does not create it.
If “writing code is easy,” why is the world drowning in unreliable, unmaintainable, over-engineered trash with immaculate dashboards and flawless incident postmortems?
People buy software. They appreciate service when the software becomes infrastructure. Confusing the two is how you end up worshipping uptime graphs while shipping nothing worth running.
Every 5 hours 24/7. Rinse repeat
With Vibecoding I imagine the LLM will get a MCP that allows them to schedule the jobs on Kubernetes or whatever IaaS and a fleet of agents will do the basic troubleshooting or whackamole type activities, leaving only the hard problems for human SRE. Before and after AI, the corporate incentives will always be to ship slop unless there is a counterbalancing force keeping the shipping team accountable to higher standards.
Also this doesn't cover most of the jobs, which are actually in consulting, and not product development.
Who probably has never written anything of value in his life and therefore approves the theft of other people's valuable work.
Ultimately hardware, software, QA, etc is all about delivering a system that produces certain outputs for certain inputs, with certain penalties if it doesn’t. If you can, great, if you can’t, good luck. Whether you achieve the “can” with human development or LLM is of little concern as long as you can pay out the penalties of “can’t”.
Every industry has been undergoing digital transformation for decades. There are SREs ensuring service levels for everything, from your electrical meter, to satellite navigation systems. Someone wrote the code that boots your phone and starts your car. Somebody's wireless code is passing through your body as you read this, while an SRE ensures the packet loss isn't too high.
And no, as an SRE I won't read DEV code, but I can help my team test it.
I mean to each their own. Sometimes if I catch a page and the rabbit hole leads to the devs code, I look under the covers.
And sometimes it's a bug I can identify and fix pretty quickly. Sometimes faster than the dev team because I just saw another dev team make the same mistake a month prior.
You gotta know when to cut your losses and stop searching the rabbit hole though, that's true.
that doesn't sound like my definition of an SRE. How is what you're doing different than Ops?
Spoken like a true SRE. I'm mostly writing code, rather than working on keeping it in production, but I've had websites up since 2006 (hope that counts as long time in this corner of the internet) with very little down time and frankly not much effort.
My experience with SREs was largely that they're glorified SSH: they tell me I'm the programmer and I should know what to type into their shell to debug the problem (despite them SREing those services for years, while I joined two months ago and haven't even seen the particular service). But no I can't have shell access, and yes I should be the one spelling out what needs to be typed in.