hermes-incident-commander
Autonomous SRE agent built on Hermes - detects, heals, and learns from production incidents. Uses Memory + Skills + Cron + Gateway + Subagents + Atropos RL.
β Hermes Incident Commander
An autonomous SRE agent that detects, diagnoses, and heals production infrastructure - then learns from every incident it resolves.
Built on Hermes Agent by NousResearch. Submitted for the "Show us what Hermes Agent can do" challenge.
The Problem
When a production server goes down at 3 AM, an on-call engineer has to:
- Wake up, check alerts
- SSH in, run diagnostics manually
- Piece together root cause from logs
- Apply a fix - hopefully the right one
- Verify it worked
- Write a post-mortem nobody will read
Mean time to resolve (MTTR) for P0 incidents averages 45β60 minutes. Much of that is humans doing things a sufficiently capable agent could do faster and better.
Hermes Incident Commander does all of it - autonomously, in minutes, getting smarter with each incident it handles.
Demo
# Install dependencies
pip install anthropic rich
# Set your API key
export ANTHROPIC_API_KEY=sk-ant-...
# Run a demo incident (disk full scenario)
python demo/demo_incident.py --scenario disk-full-logs
# Try other scenarios
python demo/demo_incident.py --scenario svc-crash-nginx
python demo/demo_incident.py --scenario cpu-runaway-process
What you'll see:
- Hermes detects the incident and classifies severity (P0/P1/P2/P3)
- Runs parallel diagnostics across CPU, memory, disk, and services
- Identifies root cause with explicit reasoning
- Applies the safest effective fix
- Verifies the fix worked
- Writes a structured post-incident report to
~/.hermes/incidents/ - Creates a new prevention skill in
~/.hermes/skills/so it handles this faster next time
How It Uses Every Hermes Feature
This project was designed to push every capability of Hermes Agent:
| Hermes Feature | How It's Used |
|---|---|
| Persistent Memory | Builds a system topology map over time. Learns which services fail together, time-of-day patterns, and which remediations work on YOUR infrastructure. |
| Skill Auto-Creation | After every novel incident, writes a new SKILL.md prevention playbook. Hermes gets measurably better at your stack over weeks. |
| Cron Scheduler | Every 5 min: critical health check. Every hour: full audit. Daily 08:00: morning briefing to Telegram. |
| Gateway (Telegram/Discord) | Real-time P0 alerts, resolution notices, and daily briefings delivered to your phone. |
| Subagent Spawning | For multi-service environments, spawns parallel subagents to investigate nginx, database, and application layers simultaneously. |
| Session Search (FTS5) | "Have we seen this error before?" - searches past incidents for matching patterns. |
| execute_code | Collapses multi-step diagnostic pipelines into single inference turns, dramatically reducing latency. |
| MCP Integration | Connects to cloud provider APIs (AWS/GCP/Azure MCP servers) for auto-scaling and cloud-native remediation. |
Architecture
flowchart TD
ALERT([π¨ Incident Alert]) --> DETECT
DETECT["π DETECT<br/>Gather system vitals<br/>CPU β’ Memory β’ Disk β’ Services"]
TRIAGE["βοΈ TRIAGE<br/>Classify severity<br/>P0 Β· P1 Β· P2 Β· P3"]
DIAGNOSE["π¬ DIAGNOSE<br/>Root cause analysis<br/>Logs Β· Processes Β· Stack traces"]
REMEDIATE["π§ REMEDIATE<br/>Apply safest fix<br/>Tier 1 β 2 β 3"]
VERIFY["β
VERIFY<br/>Confirm resolution<br/>Before vs after metrics"]
DETECT --> TRIAGE --> DIAGNOSE --> REMEDIATE --> VERIFY
CRON["β±οΈ CRON<br/>Every 5 min: health check<br/>Every hour: full audit<br/>Daily 08:00: briefing"]
CRON -->|triggers| DETECT
LEARN["π§ LEARN<br/>Write post-incident report<br/>Create prevention SKILL.md<br/>Update MEMORY.md<br/>Search past incidents (FTS5)"]
VERIFY --> LEARN
GATEWAY["π² GATEWAY<br/>Telegram Β· Discord Β· Slack"]
TRIAGE -->|"π¨ P0/P1 alert"| GATEWAY
VERIFY -->|"β
resolved"| GATEWAY
CRON -->|"π daily briefing"| GATEWAY
style DETECT fill:#1e3a5f,color:#fff
style TRIAGE fill:#7b2d00,color:#fff
style DIAGNOSE fill:#1e3a5f,color:#fff
style REMEDIATE fill:#1a4731,color:#fff
style VERIFY fill:#1a4731,color:#fff
style LEARN fill:#3d2068,color:#fff
style CRON fill:#2d2d2d,color:#fff
style GATEWAY fill:#2d2d2d,color:#fff
style ALERT fill:#7b2d00,color:#fff
Project Structure
graph LR
ROOT["π hermes-incident-commander"]
ROOT --> SKILLS["π skills/"]
ROOT --> ENVS["π environments/"]
ROOT --> DEMO["π demo/"]
ROOT --> TESTS["π tests/"]
ROOT --> DOCS["π docs/"]
ROOT --> REQ["π requirements.txt"]
SKILLS --> SKILL_MD["π incident-commander/SKILL.md<br/>β install into ~/.hermes/skills/"]
ENVS --> ENV_PY["π incident_env.py<br/>β Atropos RL environment"]
ENVS --> ENV_CFG["βοΈ incident_config.yaml<br/>β training configuration"]
DEMO --> DEMO_PY["π demo_incident.py<br/>β standalone demo"]
TESTS --> TEST_PY["π test_incident_env.py<br/>β pytest test suite"]
DOCS --> SETUP["π SETUP.md"]
DOCS --> WRITEUP["π WRITEUP.md"]
style ROOT fill:#1e3a5f,color:#fff
style SKILL_MD fill:#1a4731,color:#fff
style ENV_PY fill:#3d2068,color:#fff
style DEMO_PY fill:#7b2d00,color:#fff
style TEST_PY fill:#2d2d2d,color:#fff
Installation (Full Hermes Setup)
1. Install Hermes Agent
curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
2. Configure Hermes
hermes setup # Interactive setup wizard
hermes model # Choose your model (Nous Portal recommended)
hermes gateway setup # Connect Telegram/Discord for alerts
3. Install the Incident Commander Skill
# Copy the skill to Hermes's skills directory
cp -r skills/incident-commander ~/.hermes/skills/
# Verify it's loaded
hermes
> /skills
4. Set Up Monitoring Cron Jobs
In your Hermes conversation:
Set up incident monitoring: run a health check every 5 minutes and alert me
on Telegram if anything is P0 or P1. Send me a daily briefing at 08:00.
Hermes will install the cron jobs automatically.
5. Run the RL Training Environment (Optional)
# Install Atropos
pip install atroposlib
# Generate SFT training data
python environments/incident_env.py process --config environments/incident_config.yaml
# Full RL training (requires VLLM)
python environments/incident_env.py serve --config environments/incident_config.yaml
Reward Function (for RL Training)
The training environment uses a multi-component reward that captures real SRE quality:
pie title Reward Components
"Resolution β Did the incident get fixed?" : 50
"RCA Quality β Root cause explained?" : 15
"Report Quality β Post-mortem written?" : 15
"Skill Created β Prevention skill added?" : 10
"Response Speed β Fast MTTR?" : 5
"Tool Efficiency β Minimal tool calls?" : 5
Incident Scenarios (Training Scenarios)
| ID | Severity | Category | Description |
|---|---|---|---|
svc-crash-nginx |
P0 | service | nginx crashed, website unreachable |
disk-full-logs |
P1 | disk | 95% disk usage from exploded log files |
memory-leak-process |
P1 | memory | Mystery process eating 150MB+ |
cpu-runaway-process |
P2 | cpu | 95% CPU from runaway computation |
failed-systemd-unit |
P2 | service | Custom worker service in failed state |
Running Tests
# Install test dependencies
pip install pytest pytest-asyncio
# Run full test suite
pytest tests/ -v
# Run specific test classes
pytest tests/test_incident_env.py::TestScenarioDefinitions -v
pytest tests/test_incident_env.py::TestRewardFunction -v
pytest tests/test_incident_env.py::TestSkillFile -v
Why This Wins
Real problem, real impact. P0 incidents cost companies thousands of dollars per minute. Shaving 30 minutes off MTTR with an autonomous agent is immediately valuable.
Uses every Hermes capability. Memory, skills, cron, gateway, subagents, session search, execute_code - all integrated into a coherent, meaningful workflow.
Self-improving. The longer Hermes runs, the better it gets at your specific infrastructure. This is Hermes's core promise - "the agent that grows with you" - demonstrated concretely.
Closes the training loop. The Atropos RL environment means this isn't just a demo - it's a path to training models that are genuinely better at agentic SRE tasks.
Ships with working code. The demo runs standalone, the tests pass, and the skill file installs in one command.
License
MIT
Built with Hermes Agent - the agent that grows with you.