hermes-agent-metaharness

An implementation of a Meta Harness for Hermes.

β˜… 70 Python MIT Updated 4/7/2026
View on GitHub β†’

Hermes Agent Meta-Harness

hermes-agent-metaharness is the standalone outer-loop Meta-Harness repo for Hermes.

It treats hermes-agent as the execution backend for benchmark harness candidates and focuses on:

Origin

This project is directly inspired by the paper Meta-Harness: End-to-End Optimization of Model Harnesses and the companion project page.

The paper’s core argument is that LLM system quality depends not only on model weights, but also on the harness: the surrounding code that decides what context to collect, store, retrieve, and show to the model. Instead of hand-tuning that harness, Meta-Harness proposes an outer-loop optimizer that searches over harness code. Its proposer has access to the source code, scores, and execution traces of prior candidates through a filesystem, which gives it much richer diagnostic context than methods that only optimize from scores or short summaries. The paper reports gains on online text classification, retrieval-augmented math reasoning, and agentic coding, including improved TerminalBench-2 harnesses.

How Hermes Adapts Meta-Harness

Hermes uses the same high-level idea, but adapts it to a research-safe benchmark workflow:

In other words, the project applies Meta-Harness to Hermes by optimizing how Hermes is run on benchmarks, not by changing model weights and not by letting the production runtime self-modify.

Boundary

hermes-agent owns the inner Meta-Harness runtime:

hermes-agent-metaharness owns the outer loop:

Current Scope

The current release provides:

Quick Start

git clone https://github.com/howdymary/hermes-agent-metaharness.git
cd hermes-agent-metaharness
pip install -e ".[dev]"

Point it at Hermes with either:

If Hermes needs to run inside a managed environment, Meta-Harness can launch it through a shell-style prefix such as:

Choosing a Backend

Users should choose the strongest coding backend available in their Hermes benchmark config. Meta-Harness does not hardcode a model provider; it delegates backend choice to Hermes through --hermes-config-path.

Common options:

For example, you can point Meta-Harness at any Hermes benchmark config that defines a stronger coding model:

python -m meta_harness evaluate-candidate \
  --candidate snapshot_baseline \
  --benchmark tblite \
  --hermes-repo /path/to/hermes-agent \
  --python-executable /path/to/hermes-agent/.mh-venv/bin/python \
  --hermes-config-path /path/to/your_stronger_backend.yaml

Dry-run a built-in Hermes candidate on TBLite:

python -m meta_harness evaluate-candidate \
  --candidate snapshot_baseline \
  --benchmark tblite \
  --hermes-repo /path/to/hermes-agent \
  --launcher-prefix "uv run --python 3.12 --extra rl" \
  --dry-run

Compare two Hermes Meta-Harness run directories:

python -m meta_harness compare-runs \
  --baseline-run /path/to/baseline-run \
  --candidate-run /path/to/candidate-run

Run a candidate directly against a baseline and emit a richer report:

python -m meta_harness evaluate-vs-baseline \
  --candidate candidates/template_candidate.py \
  --baseline-candidate snapshot_baseline \
  --benchmark tblite \
  --hermes-repo /path/to/hermes-agent \
  --launcher-prefix "uv run --python 3.12 --extra rl"

Reuse an existing baseline run instead of rerunning baseline:

python -m meta_harness evaluate-vs-baseline \
  --candidate candidates/template_candidate.py \
  --baseline-run /path/to/baseline-run \
  --benchmark tblite \
  --hermes-repo /path/to/hermes-agent

Run a small deterministic search over generated wrapper candidates:

python -m meta_harness search-candidates \
  --seed-candidate candidates/template_candidate.py \
  --baseline-candidate snapshot_baseline \
  --benchmark tblite \
  --hermes-repo /path/to/hermes-agent \
  --launcher-prefix "uv run --python 3.12 --extra rl"

Inspect the current frontier for a benchmark:

python -m meta_harness show-frontier \
  --frontier-path output/frontier.json \
  --benchmark tblite

Repo Layout

meta_harness/
β”œβ”€β”€ archive_reader.py
β”œβ”€β”€ baseline.py
β”œβ”€β”€ benchmark_runner.py
β”œβ”€β”€ candidate_registry.py
β”œβ”€β”€ cli.py
β”œβ”€β”€ comparison.py
β”œβ”€β”€ config.py
β”œβ”€β”€ frontier.py
β”œβ”€β”€ models.py
β”œβ”€β”€ mutation.py
β”œβ”€β”€ search.py
└── __main__.py

Candidate files can live in candidates/, with an example in candidates/template_candidate.py.

Two local benchmark configs are also included in configs/ for smoke-testing against an Ollama OpenAI-compatible endpoint on http://localhost:11434/v1.

Release Notes

This repo is intentionally research-oriented:

Near-Term Roadmap

  1. Better ranking/reporting and frontier-backed baseline policies
  2. More expressive mutation spaces and composition
  3. Trace-driven reflective candidate improvement
  4. Frontier-aware search strategies
  5. Stronger benchmark-aware candidate generation