SRE Agent

v1.0.0Monitoring & Observabilitystable

Autonomous agent for Kubernetes incident detection, diagnosis, and mitigation using LLMs and modular workflows. Integrates LangChain, LangGraph, and MCP servers to enable automated SRE tasks in cloud-native environments.

aiopsautomated-diagnosis-systemautonomous-agentscloud-monitoringcloud-native
Share:
14
Stars
0
Downloads
0
Weekly
0/5

What is SRE Agent?

SRE Agent is a Model Context Protocol (MCP) server that allows AI assistants like Claude, Cursor, and VS Code to autonomous agent for kubernetes incident detection, diagnosis, and mitigation using llms and modular workflows. integrates langchain, langgraph, and mcp servers to enable automated sre tasks in cloud-...

Autonomous agent for Kubernetes incident detection, diagnosis, and mitigation using LLMs and modular workflows. Integrates LangChain, LangGraph, and MCP servers to enable automated SRE tasks in cloud-native environments.

This server falls under the Monitoring & Observability category on MCPgee, the world's largest MCP server directory with 33,000+ servers.

Features

  • Autonomous agent for Kubernetes incident detection, diagnosi

Use Cases

Kubernetes incident detection and mitigation
Autonomous cloud-native problem diagnosis
LicenseMIT
Languagejupyter notebook
Versionv1.0.0
UpdatedMay 5, 2026
Statushealthy
Maintenanceactive

Works with

ClaudeOpenAIwindowsmacoslinux

Installation

Manual Installation

npx sre-agent

Configuration

Configuration Details

Config File

claude_desktop_config.json

Performance

Response Metrics

Response Time< 200ms
ThroughputMedium

Resource Usage

Memory UsageLow
CPU UsageLow

How to Set Up and Use SRE Agent

SRE Agent is an autonomous multi-agent system for automated Kubernetes incident response, combining LangChain, LangGraph, and a custom MCP server to diagnose and mitigate faults in cloud-native environments. It implements a Divide and Conquer strategy with parallel RCA Worker agents guided by a topology-aware Planner and a hybrid Triage Agent grounded in the Four Golden Signals (Latency, Errors, Traffic, Saturation). The custom MCP server standardizes access to observability tools including Prometheus, Jaeger, and the Kubernetes API, while reducing context window usage through pre-digested data summaries.

Prerequisites

  • Python 3.13 or higher and Poetry package manager installed
  • Docker and Kind (Kubernetes in Docker) for spinning up test clusters
  • Make utility for AIOpsLab commands
  • OpenAI API key (used for GPT-based LLM reasoning in the agent)
  • AIOpsLab framework available for fault injection benchmarks (optional but needed for automated experiments)
1

Clone the repository

Clone the SRE-agent repository to your local machine.

git clone https://github.com/martinimarcello00/SRE-agent.git
cd SRE-agent
2

Install dependencies with Poetry

Use Poetry to install all project dependencies including LangChain, LangGraph, and the custom MCP server package.

poetry install
3

Configure environment variables

Copy the example environment file and add your API keys. At minimum, set your OpenAI API key for the LLM reasoning components.

cp .env.example .env
# Edit .env and set:
# OPENAI_API_KEY=your_openai_api_key_here
4

Start the MCP server for observability tools

The custom MCP server in the MCP-server/ directory exposes Prometheus metrics, Jaeger traces, and the Kubernetes API as MCP tools. Start it before running the agent.

cd MCP-server
poetry run python -m mcp_server
5

Run the SRE agent interactively

Launch the multi-agent system using LangGraph Studio for a visual development experience, or run the agent script directly for a specific experiment scenario.

# Option A: LangGraph Studio (visual dev UI)
cd sre-agent
poetry run langgraph dev

# Option B: Direct script
python sre-agent/sre-agent.py
6

Run automated experiments (optional)

Use the automated_experiment.py script to run batch experiments: it provisions a Kind cluster, injects faults via AIOpsLab, runs the agent, evaluates results, and cleans up.

python automated_experiment.py

SRE Agent Examples

Client configuration

MCP client configuration for the SRE Agent's custom observability MCP server running locally over stdio.

{
  "mcpServers": {
    "sre-observability": {
      "command": "poetry",
      "args": ["run", "python", "-m", "mcp_server"],
      "env": {
        "OPENAI_API_KEY": "your_openai_api_key_here"
      }
    }
  }
}

Prompts to try

Example interactions for driving automated Kubernetes incident diagnosis through the SRE agent.

- "Detect any anomalies in the Hotel Reservation service cluster and report the root cause"
- "Query Prometheus for error rate spikes in the last 10 minutes across all services"
- "Check Jaeger traces for high-latency requests in the payment service"
- "List all unhealthy pods in the default namespace and suggest mitigation steps"
- "Run a full RCA on the current cluster topology starting from the Four Golden Signals"

Troubleshooting SRE Agent

Poetry install fails with Python version mismatch

This project requires Python 3.13+. Install the correct version and set it for the project with 'poetry env use python3.13' before running 'poetry install'.

MCP server cannot connect to Prometheus or Kubernetes API

Ensure your Kind cluster is running ('kind get clusters') and that kubectl is configured to point to it ('kubectl config current-context'). Prometheus must be deployed in the cluster and accessible at the configured endpoint in .env.

Agent produces hallucinated root cause diagnoses

The Triage Agent uses deterministic heuristics on Four Golden Signals to ground diagnoses. If results seem unreliable, ensure Prometheus metrics are being scraped correctly and that the Datagraph topology file reflects the actual cluster service dependencies.

Frequently Asked Questions about SRE Agent

What is SRE Agent?

SRE Agent is a Model Context Protocol (MCP) server that autonomous agent for kubernetes incident detection, diagnosis, and mitigation using llms and modular workflows. integrates langchain, langgraph, and mcp servers to enable automated sre tasks in cloud-native environments. It connects AI assistants to external tools and data sources through a standardized interface.

How do I install SRE Agent?

Follow the installation instructions on the SRE Agent GitHub repository. Clone the repo, install dependencies, and add the server config to your AI client.

Which AI clients work with SRE Agent?

SRE Agent works with all major MCP-compatible AI clients including Claude Desktop, Claude Code, Cursor, VS Code (GitHub Copilot), Windsurf, and Cline.

Is SRE Agent free to use?

Yes, SRE Agent is open source and available under the MIT license. You can use it freely in both personal and commercial projects.

Browse More Monitoring & Observability MCP Servers

Explore all monitoring & observability servers available in the MCPgee directory. Each server includes setup guides for Claude, Cursor, and VS Code.

Quick Config Preview

{ "mcpServers": { "sre-agent": { "command": "npx", "args": ["-y", "sre-agent"] } } }

Add this to your claude_desktop_config.json or .cursor/mcp.json

Read the full setup guide →

Ready to use SRE Agent?

Browse our complete directory of 33,000+ MCP servers, read setup guides for your editor, and start building with the Model Context Protocol.

33,000+ ServersFree & Open SourceStep-by-Step Guides