beardthelion/hermes-skill-distillation

Name: hermes-skill-distillation
Author: beardthelion

Generate agentic training trajectories from real-world tasks for Hermes 4 fine-tuning

★ 11

overview

This project provides a skill distillation pipeline designed to fine-tune Hermes 4 models using real-world agentic trajectories. It captures tool-use sessions from a diverse 30-task battery and converts them into SFT-ready data or live RL signals for the Atropos training framework. The system uses a multi-dimensional reward function to score trajectories based on task completion, efficiency, and error recovery. This approach enables a closed learning loop where production-style agent interactions serve as the primary training data for future model iterations.

Generates SFT-ready JSONL data from real-world tool usage trajectories
Integrates with Atropos for live reinforcement learning via GRPO
Automated scoring based on completion, efficiency, and error recovery

full readme from github

Hermes Agent Hackathon — Skill Distillation Pipeline

Demo concept: Use Hermes agent's real-world tool usage to generate high-quality agentic training trajectories for Hermes 4 fine-tuning.

The Idea

Hermes agent already runs real tasks for real users. Every session is a potential training example. This project turns that latent signal into a closed learning loop:

Hermes agent runs tasks → trajectories captured → judge scores them → Atropos fine-tunes Hermes 4 → better model → better agent

The key insight: real-world grounded trajectories beat synthetic benchmarks. A model that learned to use tools by actually using them in production is fundamentally different from one trained on curated academic tasks.

What We Built

A RealWorldTaskEnv Hermes environment that:

Runs a diverse task battery — 30 tasks across coding, web research, file ops, data analysis, sysadmin. Things users actually ask agents to do.
Scores trajectories automatically — multi-dimensional reward: task completion (via ToolContext verification), efficiency, error recovery.
Exports SFT-ready JSONL — drop-in for Atropos process mode.
Connects to Atropos for live RL — serve mode wires directly into GRPO training.
Before/after comparison — demo/compare_models.py shows Hermes 4-14B vanilla vs. fine-tuned on 500 trajectories.

Quickstart

# Install
pip install git+https://github.com/NousResearch/hermes-agent.git
pip install git+https://github.com/NousResearch/atropos.git

# Generate SFT data (no training server needed)
python environments/real_world_task_env/real_world_task_env.py process \
  --config environments/real_world_task_env/default.yaml \
  --env.data_path_to_save_groups trajectories.jsonl \
  --openai.model_name NousResearch/Hermes-4-14B

# Run benchmark (evaluate before/after)
python environments/real_world_task_env/real_world_task_env.py evaluate \
  --config environments/real_world_task_env/default.yaml \
  --openai.model_name NousResearch/Hermes-4-14B

# Live RL training (connect to Atropos)
run-api &  # start Atropos API server
python environments/real_world_task_env/real_world_task_env.py serve \
  --config environments/real_world_task_env/default.yaml \
  --openai.model_name NousResearch/Hermes-4-14B

Demo Script

# Full before/after comparison demo
bash demo/run_demo.sh

This runs 10 held-out tasks on both the baseline and fine-tuned model, prints a side-by-side score table, and saves trajectories for inspection.

Architecture

real_world_task_env.py   — Environment class (HermesAgentBaseEnv subclass)
tasks.json               — 30 diverse real-world tasks with verification specs
judge_prompt.py          — LLM judge for open-ended task scoring
default.yaml             — Default config (Modal backend, tool subset)
demo/
  run_demo.sh            — One-command demo
  compare_models.py      — Side-by-side model comparison

Reward Function

Three-component reward (all 0.0–1.0, weighted):

Component	Weight	How
Completion	0.6	ToolContext verification (file exists, tests pass, output correct)
Efficiency	0.2	`1.0 - (turns_used / max_turns)` — fewer steps = higher score
Recovery	0.2	Judge model assesses whether agent handled errors gracefully

Task Categories

Category	Examples	Count
Coding	Debug a script, add tests, refactor a function	8
Web research	Summarise a topic, extract data from a URL	6
File ops	Organise files, parse CSV, transform data	6
Sysadmin	Find large files, check processes, write a cron	5
Data analysis	Analyse a dataset, plot a chart, compute stats	5

Why This Matters

Hermes agent's loop is: real task → tool use → outcome. That's the exact data distribution Hermes 4 needs to become an agentic model. This pipeline closes the loop — the agent that serves users becomes the teacher that trains the next version.

Built for the Nous Research Hermes Agent Hackathon.