AI agents DevOps Developer Tools

AI Agents On-Call Engineering: Automate K8s Incidents

AI agents are handling first-response Kubernetes incidents autonomously, cutting MTTR in half and letting engineers sleep. Here's the architecture, real code, and production considerations to build your own on-call AI agent this weekend.

Oktay Ateş

Author

May 27, 2026 7 min read min read

AI Agents On-Call Engineering: Automate K8s Incidents

Your phone buzzes at 2 AM. PagerDuty. A Kubernetes pod is crash-looping in production. You groggily open your laptop, SSH into the cluster, start running kubectl describe commands, and spend 45 minutes diagnosing something that turns out to be a misconfigured memory limit. Again.

This is the on-call engineering tax that every team pays — and it's brutal. But right now, on Hacker News and across DevOps Slack communities, a serious conversation is happening: AI agents are starting to handle first-response Kubernetes incidents autonomously. Not perfectly. Not without oversight. But well enough that some teams are cutting their mean time to resolution (MTTR) in half and letting engineers actually sleep.

Let me show you what's actually working, why it's blowing up now, and how to build a basic version yourself.

Why This Is Exploding Right Now

Three things converged at once:

1. LLMs got good at tool use. Function calling and structured tool invocation — the backbone of any useful agentic workflow — matured enough that agents can reliably call kubectl, query Prometheus, read logs, and make decisions across multiple steps without falling apart.

2. Kubernetes runbooks are actually structured knowledge. Unlike open-ended software engineering tasks, incident response has patterns. OOMKilled means check memory limits. CrashLoopBackOff means check logs and exit codes. ImagePullBackOff means check credentials or image name. This is exactly the kind of procedural, step-by-step reasoning that LLMs handle well.

3. The on-call problem is genuinely unsustainable. With platform teams shrinking and microservice counts growing, the alert-to-engineer ratio is getting worse. Teams are desperate for anything that handles tier-1 incidents without waking a human.

What an AI On-Call Agent Actually Does

Think of it as a very senior SRE that never sleeps and has perfect recall of your runbooks. When an alert fires, the agent:

Receives the alert payload (via PagerDuty webhook, Alertmanager, or similar)
Queries the cluster to understand current state
Pulls recent logs and events for affected resources
Checks related metrics in Prometheus/Grafana
Cross-references your runbook knowledge base
Attempts a remediation action (or escalates with a full diagnostic summary)
Posts a detailed incident report to Slack

The critical design decision: the agent should always escalate with context rather than silently fail. Even if it can't fix the problem, waking you up with a complete diagnosis is 10x more valuable than waking you up cold.

The Architecture

Here's the stack that's emerging as the practical standard:

Agent framework: LangGraph or raw function-calling loop (OpenAI/Anthropic)
Tools: kubectl wrapper, Prometheus HTTP API, log aggregator (Loki/CloudWatch), Slack API
Knowledge base: Vector store of your runbooks (see semantic search with embeddings)
Memory: Incident history for pattern recognition (see persistent agent memory)
Safety layer: Approval gates for destructive actions

The safety layer is non-negotiable. Read our piece on AI agent production safety before you deploy anything that touches your cluster.

Let's Build It: A Minimal K8s Incident Agent

Here's a working skeleton using OpenAI function calling. This handles the most common Kubernetes failure modes.

Step 1: Define the tools

import subprocess
import json
import requests
from openai import OpenAI

client = OpenAI()

# Tool definitions for the agent
tools = [
    {
        "type": "function",
        "function": {
            "name": "kubectl_get",
            "description": "Run kubectl get command to inspect Kubernetes resources",
            "parameters": {
                "type": "object",
                "properties": {
                    "resource": {"type": "string", "description": "Resource type: pods, deployments, services, events"},
                    "namespace": {"type": "string", "description": "Kubernetes namespace"},
                    "name": {"type": "string", "description": "Resource name (optional)"},
                    "output": {"type": "string", "description": "Output format: json, yaml, wide"}
                },
                "required": ["resource", "namespace"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "get_pod_logs",
            "description": "Fetch recent logs from a Kubernetes pod",
            "parameters": {
                "type": "object",
                "properties": {
                    "pod_name": {"type": "string"},
                    "namespace": {"type": "string"},
                    "tail_lines": {"type": "integer", "default": 100},
                    "previous": {"type": "boolean", "description": "Get logs from previous crashed container"}
                },
                "required": ["pod_name", "namespace"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "kubectl_describe",
            "description": "Describe a Kubernetes resource for detailed status and events",
            "parameters": {
                "type": "object",
                "properties": {
                    "resource": {"type": "string"},
                    "name": {"type": "string"},
                    "namespace": {"type": "string"}
                },
                "required": ["resource", "name", "namespace"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "query_prometheus",
            "description": "Query Prometheus for metrics",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "PromQL query string"},
                    "time_range": {"type": "string", "description": "Time range like 5m, 1h, 24h"}
                },
                "required": ["query"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "escalate_to_human",
            "description": "Escalate incident to on-call human with full diagnostic context",
            "parameters": {
                "type": "object",
                "properties": {
                    "severity": {"type": "string", "enum": ["low", "medium", "high", "critical"]},
                    "summary": {"type": "string"},
                    "findings": {"type": "string"},
                    "recommended_action": {"type": "string"}
                },
                "required": ["severity", "summary", "findings"]
            }
        }
    }
]

Step 2: Implement the tool execution layer

def execute_tool(tool_name: str, args: dict) -> str:
    if tool_name == "kubectl_get":
        cmd = ["kubectl", "get", args["resource"], "-n", args["namespace"]]
        if args.get("name"):
            cmd.append(args["name"])
        if args.get("output"):
            cmd.extend(["-o", args["output"]])
        result = subprocess.run(cmd, capture_output=True, text=True, timeout=30)
        return result.stdout or result.stderr

    elif tool_name == "get_pod_logs":
        cmd = ["kubectl", "logs", args["pod_name"], "-n", args["namespace"],
               "--tail", str(args.get("tail_lines", 100))]
        if args.get("previous"):
            cmd.append("--previous")
        result = subprocess.run(cmd, capture_output=True, text=True, timeout=30)
        return result.stdout or result.stderr

    elif tool_name == "kubectl_describe":
        cmd = ["kubectl", "describe", args["resource"], args["name"], "-n", args["namespace"]]
        result = subprocess.run(cmd, capture_output=True, text=True, timeout=30)
        return result.stdout or result.stderr

    elif tool_name == "query_prometheus":
        # Adjust URL to your Prometheus endpoint
        prometheus_url = "http://prometheus:9090"
        params = {"query": args["query"]}
        try:
            resp = requests.get(f"{prometheus_url}/api/v1/query", params=params, timeout=10)
            return json.dumps(resp.json().get("data", {}))
        except Exception as e:
            return f"Prometheus query failed: {str(e)}"

    elif tool_name == "escalate_to_human":
        # In production: trigger PagerDuty + post to Slack
        message = f"""🚨 ESCALATION [{args['severity'].upper()}]
**Summary**: {args['summary']}
**Findings**: {args['findings']}
**Recommended Action**: {args.get('recommended_action', 'Manual investigation required')}"""
        print(message)  # Replace with actual Slack/PagerDuty call
        return "Escalation sent to on-call engineer"

    return f"Unknown tool: {tool_name}"

Step 3: The agent loop

SYSTEM_PROMPT = """You are an expert SRE agent responding to Kubernetes incidents.
Your job is to diagnose the root cause and either remediate it or escalate with full context.

For every incident:
1. Check the affected resource status
2. Inspect recent events and logs
3. Check related metrics if relevant
4. Identify root cause from these patterns:
   - OOMKilled → memory limit too low or memory leak
   - CrashLoopBackOff → check exit code in logs
   - ImagePullBackOff → image name wrong or registry credentials expired
   - Pending → node resources exhausted or PVC not bound
   - Evicted → node memory/disk pressure
5. Either remediate (only safe actions) or escalate with complete findings.

Never make destructive changes. Always escalate if uncertain."""

def run_incident_agent(alert: dict) -> str:
    incident_description = f"""
    INCIDENT ALERT:
    - Alert Name: {alert.get('alertname')}
    - Namespace: {alert.get('namespace')}
    - Pod: {alert.get('pod', 'unknown')}
    - Severity: {alert.get('severity')}
    - Description: {alert.get('description')}
    - Started: {alert.get('startsAt')}
    """

    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": incident_description}
    ]

    max_iterations = 10  # Safety limit
    iteration = 0

    while iteration < max_iterations:
        iteration += 1
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=tools,
            tool_choice="auto"
        )

        message = response.choices[0].message
        messages.append(message)

        # No more tool calls — agent is done
        if not message.tool_calls:
            return message.content

        # Execute each tool call
        for tool_call in message.tool_calls:
            tool_name = tool_call.function.name
            args = json.loads(tool_call.function.arguments)

            print(f"[Agent] Calling tool: {tool_name} with {args}")
            result = execute_tool(tool_name, args)

            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": result
            })

    return "Max iterations reached. Escalating for manual review."


# Example usage — triggered by Alertmanager webhook
example_alert = {
    "alertname": "KubePodCrashLooping",
    "namespace": "production",
    "pod": "payment-service-7d9f4b-xkp2m",
    "severity": "critical",
    "description": "Pod has been crash looping for 15 minutes",
    "startsAt": "2026-05-27T02:14:00Z"
}

result = run_incident_agent(example_alert)
print(result)

What the Agent Handles Well (and What It Doesn't)

Handles well:

OOMKilled diagnosis and memory limit recommendations
CrashLoopBackOff root cause from log analysis
ImagePullBackOff identification
Generating detailed incident summaries for human review
Checking if issues are widespread vs. isolated

Still needs humans:

Anything requiring cluster-wide changes
Novel failure modes not in runbooks
Application-level bugs (the infra is fine, the code is broken)
Multi-service cascading failures requiring system-wide context

This is a tier-1 responder, not a replacement for your SRE team. The goal is that 40-60% of alerts get fully resolved or triaged before a human wakes up.

Production Considerations You Can't Skip

RBAC lockdown. The agent's service account should have read-only access to most resources, with write access only to safe remediation actions (restart a deployment, not delete a namespace). Least privilege is critical here.

Audit logging. Every kubectl command the agent runs should be logged with the full incident context. You need this for post-mortems and for debugging why the agent did something unexpected.

Approval gates for destructive actions. Want the agent to restart a deployment? Post a Slack message with an approve/reject button. Only execute on approval. This hybrid human-in-the-loop pattern gives you speed without catastrophic risk.

Timeout everything. subprocess calls, Prometheus queries, LLM calls — all need hard timeouts. An agent that hangs during an incident is worse than no agent.

Cost awareness. GPT-4o with 10 tool call iterations per incident adds up. For high-volume alert environments, consider using a cheaper model for initial triage and only escalating to GPT-4o for complex cases. See our guide on reducing OpenAI API costs.

The Bigger Picture

This isn't just a cool demo. Teams at companies like Cortex, Incident.io, and various stealth startups are shipping this in production today. The HN thread that sparked this trend showed engineers reporting 60% reduction in 2 AM pages and MTTR dropping from 45 minutes to under 10 for common failure patterns.

The on-call engineering role is not going away. But the tier-1, pattern-matching, runbook-execution part of that role? That's automatable right now, with tools that exist today, deployed in a weekend.

The engineers who build these systems aren't replacing their teams — they're making their teams' lives dramatically better. And in a market where on-call burnout is a top reason senior engineers quit, that's not just a technical win. It's a retention strategy.

Start small. Pick your three most common alert types. Write the runbook logic into the system prompt. Add the kubectl tools. Test it on staging alerts. Deploy with read-only permissions first. Then iterate.

The 2 AM pages aren't going to stop. But they can stop waking you up for the ones that don't need you.

Tagged in AI agents DevOps Developer Tools

Oktay Ateş

Systems Architect building autonomous systems and modern web infrastructure in the open. Creator of autonode.tech and aixsap.com.

All articles by Oktay Ateş

More in

Multi-Agent Tool Orchestration Is Here Now

Multi-agent tool orchestration has crossed from demo to production reality. Here's the supervisor-worker mesh architecture, the MCP tool design rules, and the context rot problem nobody warns you about — with code you can ship today.

Jun 11, 2026 · 5 min read min

Human-in-the-Loop AI: Build Approval Gates Now

Your autonomous AI agents are one bad decision away from a painful postmortem. Human-in-the-loop approval gates are the production safety pattern blowing up on Hacker News — here's how to implement them with real code before something expensive goes wrong.

May 31, 2026 · 7 min read min

AI Agent Behavior Caching: The Muscle Memory Edge

Your AI agents are reasoning from scratch on every task — even ones they've solved a hundred times. Behavior caching fixes that by storing proven action sequences and replaying them like muscle memory. Here's how to build it and why it changes the economics of agent automation entirely.

May 30, 2026 · 7 min read min