SRE Agent
Autonomous agent for Kubernetes incident detection, diagnosis, and mitigation using LLMs and modular workflows. Integrates LangChain, LangGraph, and MCP servers to enable automated SRE tasks in cloud-native environments.
What is SRE Agent?
SRE Agent is a Model Context Protocol (MCP) server that allows AI assistants like Claude, Cursor, and VS Code to autonomous agent for kubernetes incident detection, diagnosis, and mitigation using llms and modular workflows. integrates langchain, langgraph, and mcp servers to enable automated sre tasks in cloud-...
Autonomous agent for Kubernetes incident detection, diagnosis, and mitigation using LLMs and modular workflows. Integrates LangChain, LangGraph, and MCP servers to enable automated SRE tasks in cloud-native environments.
This server falls under the Monitoring & Observability category on MCPgee, the world's largest MCP server directory with 33,000+ servers.
Features
- Autonomous agent for Kubernetes incident detection, diagnosi
Use Cases
Maintainer
Works with
Installation
Manual Installation
npx sre-agentConfiguration
Configuration Details
claude_desktop_config.json
Performance
Response Metrics
Resource Usage
How to Set Up and Use SRE Agent
SRE Agent is an autonomous multi-agent system for automated Kubernetes incident response, combining LangChain, LangGraph, and a custom MCP server to diagnose and mitigate faults in cloud-native environments. It implements a Divide and Conquer strategy with parallel RCA Worker agents guided by a topology-aware Planner and a hybrid Triage Agent grounded in the Four Golden Signals (Latency, Errors, Traffic, Saturation). The custom MCP server standardizes access to observability tools including Prometheus, Jaeger, and the Kubernetes API, while reducing context window usage through pre-digested data summaries.
Prerequisites
- Python 3.13 or higher and Poetry package manager installed
- Docker and Kind (Kubernetes in Docker) for spinning up test clusters
- Make utility for AIOpsLab commands
- OpenAI API key (used for GPT-based LLM reasoning in the agent)
- AIOpsLab framework available for fault injection benchmarks (optional but needed for automated experiments)
Clone the repository
Clone the SRE-agent repository to your local machine.
git clone https://github.com/martinimarcello00/SRE-agent.git
cd SRE-agentInstall dependencies with Poetry
Use Poetry to install all project dependencies including LangChain, LangGraph, and the custom MCP server package.
poetry installConfigure environment variables
Copy the example environment file and add your API keys. At minimum, set your OpenAI API key for the LLM reasoning components.
cp .env.example .env
# Edit .env and set:
# OPENAI_API_KEY=your_openai_api_key_hereStart the MCP server for observability tools
The custom MCP server in the MCP-server/ directory exposes Prometheus metrics, Jaeger traces, and the Kubernetes API as MCP tools. Start it before running the agent.
cd MCP-server
poetry run python -m mcp_serverRun the SRE agent interactively
Launch the multi-agent system using LangGraph Studio for a visual development experience, or run the agent script directly for a specific experiment scenario.
# Option A: LangGraph Studio (visual dev UI)
cd sre-agent
poetry run langgraph dev
# Option B: Direct script
python sre-agent/sre-agent.pyRun automated experiments (optional)
Use the automated_experiment.py script to run batch experiments: it provisions a Kind cluster, injects faults via AIOpsLab, runs the agent, evaluates results, and cleans up.
python automated_experiment.pySRE Agent Examples
Client configuration
MCP client configuration for the SRE Agent's custom observability MCP server running locally over stdio.
{
"mcpServers": {
"sre-observability": {
"command": "poetry",
"args": ["run", "python", "-m", "mcp_server"],
"env": {
"OPENAI_API_KEY": "your_openai_api_key_here"
}
}
}
}Prompts to try
Example interactions for driving automated Kubernetes incident diagnosis through the SRE agent.
- "Detect any anomalies in the Hotel Reservation service cluster and report the root cause"
- "Query Prometheus for error rate spikes in the last 10 minutes across all services"
- "Check Jaeger traces for high-latency requests in the payment service"
- "List all unhealthy pods in the default namespace and suggest mitigation steps"
- "Run a full RCA on the current cluster topology starting from the Four Golden Signals"Troubleshooting SRE Agent
Poetry install fails with Python version mismatch
This project requires Python 3.13+. Install the correct version and set it for the project with 'poetry env use python3.13' before running 'poetry install'.
MCP server cannot connect to Prometheus or Kubernetes API
Ensure your Kind cluster is running ('kind get clusters') and that kubectl is configured to point to it ('kubectl config current-context'). Prometheus must be deployed in the cluster and accessible at the configured endpoint in .env.
Agent produces hallucinated root cause diagnoses
The Triage Agent uses deterministic heuristics on Four Golden Signals to ground diagnoses. If results seem unreliable, ensure Prometheus metrics are being scraped correctly and that the Datagraph topology file reflects the actual cluster service dependencies.
Frequently Asked Questions about SRE Agent
What is SRE Agent?
SRE Agent is a Model Context Protocol (MCP) server that autonomous agent for kubernetes incident detection, diagnosis, and mitigation using llms and modular workflows. integrates langchain, langgraph, and mcp servers to enable automated sre tasks in cloud-native environments. It connects AI assistants to external tools and data sources through a standardized interface.
How do I install SRE Agent?
Follow the installation instructions on the SRE Agent GitHub repository. Clone the repo, install dependencies, and add the server config to your AI client.
Which AI clients work with SRE Agent?
SRE Agent works with all major MCP-compatible AI clients including Claude Desktop, Claude Code, Cursor, VS Code (GitHub Copilot), Windsurf, and Cline.
Is SRE Agent free to use?
Yes, SRE Agent is open source and available under the MIT license. You can use it freely in both personal and commercial projects.
SRE Agent Alternatives — Similar Monitoring & Observability Servers
Looking for alternatives to SRE Agent? Here are other popular monitoring & observability servers you can use with Claude, Cursor, and VS Code.
Netdata
★ 78.9kReal-time infrastructure monitoring with metrics, logs, alerts, and ML-based anomaly detection.
Kubeshark
★ 11.9keBPF-powered network observability for Kubernetes. Indexes L4/L7 traffic with full K8s context, decrypts TLS without keys. Queryable by AI agents via MCP and humans via dashboard.
Mission Control
★ 4.9kSelf-hosted AI agent orchestration platform: dispatch tasks, run multi-agent workflows, monitor spend, and govern operations from one mission control dashboard.
Grafana
★ 3.0kThis MCP server enables natural-language querying of Grafana logs by automatically detecting log sources and service labels. It provides read-only access to log data with intelligent caching for efficient repeat queries.
Sentrux
★ 2.4kReal-time architectural sensor that helps AI agents close the feedback loop, enabling recursive self-improvement of code quality. Pure Rust.
OpenInference
★ 986OpenTelemetry Instrumentation for AI Observability
Browse More Monitoring & Observability MCP Servers
Explore all monitoring & observability servers available in the MCPgee directory. Each server includes setup guides for Claude, Cursor, and VS Code.
Set Up SRE Agent in Your Editor
Choose your AI client for step-by-step setup instructions.
Quick Config Preview
Add this to your claude_desktop_config.json or .cursor/mcp.json
Ready to use SRE Agent?
Browse our complete directory of 33,000+ MCP servers, read setup guides for your editor, and start building with the Model Context Protocol.