Lethe044/hermes-incident-commander

Name: hermes-incident-commander
Author: Lethe044

Autonomous SRE agent built on Hermes - detects, heals, and learns from production incidents. Uses Memory + Skills + Cron + Gateway + Subagents + Atropos RL.

★ 37 langPython licenseMIT updated2026-04-04

view on github →

overview

Hermes Incident Commander is an autonomous SRE agent designed to automate the detection, diagnosis, and remediation of production infrastructure failures. It utilizes the Hermes Agent framework to monitor system vitals, classify incident severity, and apply fixes across CPU, memory, disk, and service layers. Following resolution, the agent generates structured post-mortem reports and automatically creates new prevention skills to handle similar future incidents more efficiently. The system integrates with Telegram and Discord for real-time alerting and uses cron scheduling for continuous health audits.

Automates incident triage, root cause analysis, and remediation actions.
Generates persistent prevention skills and post-incident reports automatically.
Integrates with Telegram and Discord for real-time P0 alerts.

full readme from github

⚕ Hermes Incident Commander

An autonomous SRE agent that detects, diagnoses, and heals production infrastructure - then learns from every incident it resolves.

Built on Hermes Agent by NousResearch. Submitted for the "Show us what Hermes Agent can do" challenge.

The Problem

When a production server goes down at 3 AM, an on-call engineer has to:

Wake up, check alerts
SSH in, run diagnostics manually
Piece together root cause from logs
Apply a fix - hopefully the right one
Verify it worked
Write a post-mortem nobody will read

Mean time to resolve (MTTR) for P0 incidents averages 45–60 minutes. Much of that is humans doing things a sufficiently capable agent could do faster and better.

Hermes Incident Commander does all of it - autonomously, in minutes, getting smarter with each incident it handles.

Demo

# Install dependencies
pip install anthropic rich

# Set your API key
export ANTHROPIC_API_KEY=sk-ant-...

# Run a demo incident (disk full scenario)
python demo/demo_incident.py --scenario disk-full-logs

# Try other scenarios
python demo/demo_incident.py --scenario svc-crash-nginx
python demo/demo_incident.py --scenario cpu-runaway-process

What you'll see:

Hermes detects the incident and classifies severity (P0/P1/P2/P3)
Runs parallel diagnostics across CPU, memory, disk, and services
Identifies root cause with explicit reasoning
Applies the safest effective fix
Verifies the fix worked
Writes a structured post-incident report to ~/.hermes/incidents/
Creates a new prevention skill in ~/.hermes/skills/ so it handles this faster next time

How It Uses Every Hermes Feature

This project was designed to push every capability of Hermes Agent:

Hermes Feature	How It's Used
Persistent Memory	Builds a system topology map over time. Learns which services fail together, time-of-day patterns, and which remediations work on YOUR infrastructure.
Skill Auto-Creation	After every novel incident, writes a new `SKILL.md` prevention playbook. Hermes gets measurably better at your stack over weeks.
Cron Scheduler	Every 5 min: critical health check. Every hour: full audit. Daily 08:00: morning briefing to Telegram.
Gateway (Telegram/Discord)	Real-time P0 alerts, resolution notices, and daily briefings delivered to your phone.
Subagent Spawning	For multi-service environments, spawns parallel subagents to investigate nginx, database, and application layers simultaneously.
Session Search (FTS5)	"Have we seen this error before?" - searches past incidents for matching patterns.
execute_code	Collapses multi-step diagnostic pipelines into single inference turns, dramatically reducing latency.
MCP Integration	Connects to cloud provider APIs (AWS/GCP/Azure MCP servers) for auto-scaling and cloud-native remediation.

Architecture

flowchart TD
    ALERT([🚨 Incident Alert]) --> DETECT

    DETECT["🔍 DETECT<br/>Gather system vitals<br/>CPU • Memory • Disk • Services"]
    TRIAGE["⚖️ TRIAGE<br/>Classify severity<br/>P0 · P1 · P2 · P3"]
    DIAGNOSE["🔬 DIAGNOSE<br/>Root cause analysis<br/>Logs · Processes · Stack traces"]
    REMEDIATE["🔧 REMEDIATE<br/>Apply safest fix<br/>Tier 1 → 2 → 3"]
    VERIFY["✅ VERIFY<br/>Confirm resolution<br/>Before vs after metrics"]

    DETECT --> TRIAGE --> DIAGNOSE --> REMEDIATE --> VERIFY

    CRON["⏱️ CRON<br/>Every 5 min: health check<br/>Every hour: full audit<br/>Daily 08:00: briefing"]
    CRON -->|triggers| DETECT

    LEARN["🧠 LEARN<br/>Write post-incident report<br/>Create prevention SKILL.md<br/>Update MEMORY.md<br/>Search past incidents (FTS5)"]
    VERIFY --> LEARN

    GATEWAY["📲 GATEWAY<br/>Telegram · Discord · Slack"]
    TRIAGE -->|"🚨 P0/P1 alert"| GATEWAY
    VERIFY -->|"✅ resolved"| GATEWAY
    CRON -->|"📋 daily briefing"| GATEWAY

    style DETECT fill:#1e3a5f,color:#fff
    style TRIAGE fill:#7b2d00,color:#fff
    style DIAGNOSE fill:#1e3a5f,color:#fff
    style REMEDIATE fill:#1a4731,color:#fff
    style VERIFY fill:#1a4731,color:#fff
    style LEARN fill:#3d2068,color:#fff
    style CRON fill:#2d2d2d,color:#fff
    style GATEWAY fill:#2d2d2d,color:#fff
    style ALERT fill:#7b2d00,color:#fff

Project Structure

graph LR
    ROOT["📁 hermes-incident-commander"]

    ROOT --> SKILLS["📁 skills/"]
    ROOT --> ENVS["📁 environments/"]
    ROOT --> DEMO["📁 demo/"]
    ROOT --> TESTS["📁 tests/"]
    ROOT --> DOCS["📁 docs/"]
    ROOT --> REQ["📄 requirements.txt"]

    SKILLS --> SKILL_MD["📄 incident-commander/SKILL.md<br/>← install into ~/.hermes/skills/"]

    ENVS --> ENV_PY["🐍 incident_env.py<br/>← Atropos RL environment"]
    ENVS --> ENV_CFG["⚙️ incident_config.yaml<br/>← training configuration"]

    DEMO --> DEMO_PY["🐍 demo_incident.py<br/>← standalone demo"]

    TESTS --> TEST_PY["🐍 test_incident_env.py<br/>← pytest test suite"]

    DOCS --> SETUP["📄 SETUP.md"]
    DOCS --> WRITEUP["📄 WRITEUP.md"]

    style ROOT fill:#1e3a5f,color:#fff
    style SKILL_MD fill:#1a4731,color:#fff
    style ENV_PY fill:#3d2068,color:#fff
    style DEMO_PY fill:#7b2d00,color:#fff
    style TEST_PY fill:#2d2d2d,color:#fff

Installation (Full Hermes Setup)

1. Install Hermes Agent

curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash

2. Configure Hermes

hermes setup        # Interactive setup wizard
hermes model        # Choose your model (Nous Portal recommended)
hermes gateway setup  # Connect Telegram/Discord for alerts

3. Install the Incident Commander Skill

# Copy the skill to Hermes's skills directory
cp -r skills/incident-commander ~/.hermes/skills/

# Verify it's loaded
hermes
> /skills

4. Set Up Monitoring Cron Jobs

In your Hermes conversation:

Set up incident monitoring: run a health check every 5 minutes and alert me
on Telegram if anything is P0 or P1. Send me a daily briefing at 08:00.

Hermes will install the cron jobs automatically.

5. Run the RL Training Environment (Optional)

# Install Atropos
pip install atroposlib

# Generate SFT training data
python environments/incident_env.py process --config environments/incident_config.yaml

# Full RL training (requires VLLM)
python environments/incident_env.py serve --config environments/incident_config.yaml

Reward Function (for RL Training)

The training environment uses a multi-component reward that captures real SRE quality:

pie title Reward Components
    "Resolution — Did the incident get fixed?" : 50
    "RCA Quality — Root cause explained?" : 15
    "Report Quality — Post-mortem written?" : 15
    "Skill Created — Prevention skill added?" : 10
    "Response Speed — Fast MTTR?" : 5
    "Tool Efficiency — Minimal tool calls?" : 5

Incident Scenarios (Training Scenarios)

ID	Severity	Category	Description
`svc-crash-nginx`	P0	service	nginx crashed, website unreachable
`disk-full-logs`	P1	disk	95% disk usage from exploded log files
`memory-leak-process`	P1	memory	Mystery process eating 150MB+
`cpu-runaway-process`	P2	cpu	95% CPU from runaway computation
`failed-systemd-unit`	P2	service	Custom worker service in failed state

Running Tests

# Install test dependencies
pip install pytest pytest-asyncio

# Run full test suite
pytest tests/ -v

# Run specific test classes
pytest tests/test_incident_env.py::TestScenarioDefinitions -v
pytest tests/test_incident_env.py::TestRewardFunction -v
pytest tests/test_incident_env.py::TestSkillFile -v

Why This Wins

Real problem, real impact. P0 incidents cost companies thousands of dollars per minute. Shaving 30 minutes off MTTR with an autonomous agent is immediately valuable.
Uses every Hermes capability. Memory, skills, cron, gateway, subagents, session search, execute_code - all integrated into a coherent, meaningful workflow.
Self-improving. The longer Hermes runs, the better it gets at your specific infrastructure. This is Hermes's core promise - "the agent that grows with you" - demonstrated concretely.
Closes the training loop. The Atropos RL environment means this isn't just a demo - it's a path to training models that are genuinely better at agentic SRE tasks.
Ships with working code. The demo runs standalone, the tests pass, and the skill file installs in one command.

License

MIT

Built with Hermes Agent - the agent that grows with you.