It was 11:47pm on a Thursday when our auto-remediation system did exactly what it was programmed to do — and nearly took down our entire production environment in the process.
A latency spike triggered an automatic scale-up. The scale-up increased traffic to a downstream service. The downstream service had a bug introduced that afternoon that caused it to fail under increased load. That failure triggered more latency, which triggered more scaling, which triggered more failures. A perfectly designed automation loop had created a perfectly designed cascading disaster.
I caught it because I was awake, happened to glance at Grafana, and noticed the pattern didn't look like a normal traffic spike. I killed the automation, rolled back the downstream service, and we recovered in 11 minutes. Without human intuition recognising that something felt wrong, we would have been down for hours.
The Central Tension in Modern DevOps
We are caught between two imperatives that are both correct and in constant tension:
Imperative 1: Automate everything. Manual processes don't scale, are error-prone, and create single points of human failure. Every task a human does repeatedly should eventually be automated.
Imperative 2: Never automate judgment. Complex systems behave in unexpected ways. Novel failure modes require pattern recognition, contextual awareness, and the kind of "this doesn't feel right" instinct that only experience provides.
The tension between these two imperatives is where most DevOps disasters happen. Either we under-automate and burn out our teams, or we over-automate and find ourselves passengers in a system we can no longer steer.
What Automation Actually Does Well
Automation excels at:
- Known patterns at scale: "When metric X crosses threshold Y, do Z." It can do this a million times without fatigue.
- Eliminating toil: Repetitive, low-stakes tasks that consume human attention without creating human value.
- Speed in well-defined situations: An automated failover can happen in milliseconds. Human-initiated failover takes minutes.
- Consistency: Automation doesn't have bad days, doesn't skip steps when it's tired, doesn't forget to run the smoke tests at 3am.
What Human Intuition Does That Automation Cannot
Human intuition — built from years of watching systems behave — does things that no automation framework has ever replicated:
- Pattern recognition across context: "This looks like the incident we had 18 months ago, but the root cause then was a certificate expiry — let me check that first."
- Anomaly detection in novel situations: Recognising that a metric pattern is "wrong" even before you can articulate why.
- Risk calibration under uncertainty: "The automation wants to restart these pods, but this is Black Friday. I'm going to override it and investigate first."
- Knowing when the rules don't apply: Every automation system has blind spots. Human judgment is what catches the situations that fall outside the rules.
A Decision Framework: What to Automate and What to Keep Human
Building Systems That Know Their Own Limits
The best-designed automation I've built shares one characteristic: it knows when to stop and ask for help. Rather than trying to handle every scenario, it handles the scenarios it knows, and escalates the rest to humans with enough context for the human to make a good decision quickly.
This is also the design philosophy I apply to AI agents in infrastructure. The agent should be able to say: "I've identified three possible causes for this alert. I can automatically resolve cause #1 (disk space) and #2 (stale cache). Cause #3 (possible data corruption) requires human review. Here's what I know so far." That's a system I trust. One that just keeps acting without bounds is one I'm nervous about.
Design Principles for Trustworthy Automation
- Scope limits: Every automation has a maximum blast radius. Define it explicitly. Don't let automation touch more than X instances, or more than Y% of the cluster, without human approval.
- Confidence thresholds: Automation should act with confidence and escalate with uncertainty. Build in explicit "I'm not sure" paths.
- Audit trails: Every automated action should be logged with its triggering condition and the full context at the time of action. You need this for post-mortems.
- Kill switches that actually work: Test your ability to override automation under realistic conditions. A kill switch you can't activate quickly during an incident is not a kill switch.
- Graduated rollout: New automation should prove itself in low-stakes environments before being trusted in production. Give it a probationary period.
Closing the Loop: What This Means for Your Career
As AI-powered automation becomes more capable, the engineers who will be most valued are not the ones who build the most automation — they're the ones who design systems where automation and human judgment work together effectively. That requires deep technical skill, but it also requires something less common: the wisdom to know the limits of both.
Your years of watching systems fail in unexpected ways, of developing the gut feeling that something is wrong before you can prove it, of knowing when to trust the playbook and when to throw it out — that is not made obsolete by better automation. It becomes more valuable. Because better automation needs better oversight.
The balance between automation and intuition is not a technical problem with a technical solution. It's a design problem that requires human wisdom to navigate. And that's the kind of problem that will always need us.
— Naveed Ahmed, Lead DevOps Engineer @ DigitalOcean
// Key Takeaways from the Full Series
- AI anxiety is a signal that your mental model of your value needs updating — not that your value is gone.
- Job security now comes from judgment, trust, and AI leverage — not knowledge scarcity.
- AI enhances DevOps work most where speed and consistency matter. Human judgment dominates where context and stakes are high.
- DevOps culture was built for the AI era: automate everything, iterate constantly, stay curious.
- Upskilling means building, not just reading. Start with one real pain point.
- The best automation knows its own limits and escalates to humans with context, not just with alerts.