hermes-skill-distillation

Hermes Agent Hackathon: RealWorldTaskEnv — generate agentic training trajectories from real-world tasks for Hermes 4 fine-tuning

★ 2 Python Updated 3/15/2026
View on GitHub →

Hermes Agent Hackathon — Skill Distillation Pipeline

Demo concept: Use Hermes agent's real-world tool usage to generate high-quality agentic training trajectories for Hermes 4 fine-tuning.

The Idea

Hermes agent already runs real tasks for real users. Every session is a potential training example. This project turns that latent signal into a closed learning loop:

Hermes agent runs tasks → trajectories captured → judge scores them → Atropos fine-tunes Hermes 4 → better model → better agent

The key insight: real-world grounded trajectories beat synthetic benchmarks. A model that learned to use tools by actually using them in production is fundamentally different from one trained on curated academic tasks.

What We Built

A RealWorldTaskEnv Hermes environment that:

  1. Runs a diverse task battery — 30 tasks across coding, web research, file ops, data analysis, sysadmin. Things users actually ask agents to do.
  2. Scores trajectories automatically — multi-dimensional reward: task completion (via ToolContext verification), efficiency, error recovery.
  3. Exports SFT-ready JSONL — drop-in for Atropos process mode.
  4. Connects to Atropos for live RLserve mode wires directly into GRPO training.
  5. Before/after comparisondemo/compare_models.py shows Hermes 4-14B vanilla vs. fine-tuned on 500 trajectories.

Quickstart

# Install
pip install git+https://github.com/NousResearch/hermes-agent.git
pip install git+https://github.com/NousResearch/atropos.git

# Generate SFT data (no training server needed)
python environments/real_world_task_env/real_world_task_env.py process \
  --config environments/real_world_task_env/default.yaml \
  --env.data_path_to_save_groups trajectories.jsonl \
  --openai.model_name NousResearch/Hermes-4-14B

# Run benchmark (evaluate before/after)
python environments/real_world_task_env/real_world_task_env.py evaluate \
  --config environments/real_world_task_env/default.yaml \
  --openai.model_name NousResearch/Hermes-4-14B

# Live RL training (connect to Atropos)
run-api &  # start Atropos API server
python environments/real_world_task_env/real_world_task_env.py serve \
  --config environments/real_world_task_env/default.yaml \
  --openai.model_name NousResearch/Hermes-4-14B

Demo Script

# Full before/after comparison demo
bash demo/run_demo.sh

This runs 10 held-out tasks on both the baseline and fine-tuned model, prints a side-by-side score table, and saves trajectories for inspection.

Architecture

real_world_task_env.py   — Environment class (HermesAgentBaseEnv subclass)
tasks.json               — 30 diverse real-world tasks with verification specs
judge_prompt.py          — LLM judge for open-ended task scoring
default.yaml             — Default config (Modal backend, tool subset)
demo/
  run_demo.sh            — One-command demo
  compare_models.py      — Side-by-side model comparison

Reward Function

Three-component reward (all 0.0–1.0, weighted):

Component Weight How
Completion 0.6 ToolContext verification (file exists, tests pass, output correct)
Efficiency 0.2 1.0 - (turns_used / max_turns) — fewer steps = higher score
Recovery 0.2 Judge model assesses whether agent handled errors gracefully

Task Categories

Category Examples Count
Coding Debug a script, add tests, refactor a function 8
Web research Summarise a topic, extract data from a URL 6
File ops Organise files, parse CSV, transform data 6
Sysadmin Find large files, check processes, write a cron 5
Data analysis Analyse a dataset, plot a chart, compute stats 5

Why This Matters

Hermes agent's loop is: real task → tool use → outcome. That's the exact data distribution Hermes 4 needs to become an agentic model. This pipeline closes the loop — the agent that serves users becomes the teacher that trains the next version.

Built for the Nous Research Hermes Agent Hackathon.