howdymary/hermes-agent-metaharness

Name: hermes-agent-metaharness
Author: howdymary

Meta-harness for Hermes Agent — meta-optimization with arxiv paper reference

★ 99

overview

Hermes Agent Meta-Harness is a standalone outer-loop optimization framework designed to improve the performance of Hermes agents on verifiable coding benchmarks. It treats the agent as an execution backend and searches for optimal harness configurations—the code governing context collection and retrieval—rather than modifying model weights. The system orchestrates benchmark evaluations, tracks performance frontiers, and performs structured candidate mutations to find superior harness variants. It currently supports TBLite and TB2 benchmarks through a research-safe workflow that emphasizes archive analysis and deterministic search.

Optimizes agent harness code through automated outer-loop search
Orchestrates TBLite and TB2 benchmark evaluations via Hermes
Tracks performance frontiers with structured candidate mutation and comparison

full readme from github

Hermes Agent Meta-Harness

hermes-agent-metaharness is the standalone outer-loop Meta-Harness repo for Hermes.

It treats hermes-agent as the execution backend for benchmark harness candidates and focuses on:

candidate resolution
benchmark evaluation orchestration
archive reading
run comparison
richer baseline-vs-candidate reporting
frontier tracking
structured candidate mutation and search

Origin

This project is directly inspired by the paper Meta-Harness: End-to-End Optimization of Model Harnesses and the companion project page.

The paper’s core argument is that LLM system quality depends not only on model weights, but also on the harness: the surrounding code that decides what context to collect, store, retrieve, and show to the model. Instead of hand-tuning that harness, Meta-Harness proposes an outer-loop optimizer that searches over harness code. Its proposer has access to the source code, scores, and execution traces of prior candidates through a filesystem, which gives it much richer diagnostic context than methods that only optimize from scores or short summaries. The paper reports gains on online text classification, retrieval-augmented math reasoning, and agentic coding, including improved TerminalBench-2 harnesses.

How Hermes Adapts Meta-Harness

Hermes uses the same high-level idea, but adapts it to a research-safe benchmark workflow:

hermes-agent owns the inner runtime: candidate protocol, benchmark integration, loop hooks, and archive writing.
hermes-agent-metaharness owns the outer loop: candidate evaluation, archive analysis, baseline reuse, frontier tracking, and search.
The current target is verifiable coding benchmarks such as TBLite and TB2, not general production chat behavior.
Candidate search is intentionally conservative today: this repo generates deterministic wrapper candidates around a seed candidate instead of rewriting Hermes core.

In other words, the project applies Meta-Harness to Hermes by optimizing how Hermes is run on benchmarks, not by changing model weights and not by letting the production runtime self-modify.

Boundary

hermes-agent owns the inner Meta-Harness runtime:

candidate protocol
TB2/TBLite integration
optional loop hooks
per-task archive writing

hermes-agent-metaharness owns the outer loop:

candidate evaluation and comparison
archive analysis
baseline helpers
frontier management
mutation and search

Current Scope

The current release provides:

candidate resolution by explicit path or Hermes built-in candidate name
TBLite and TB2 benchmark orchestration through Hermes
archive parsing for manifest.json, summary.json, and tasks/*.json
paired baseline-vs-candidate evaluation and reporting
baseline reuse from an existing run or the current frontier-best entry
task-selection comparability metadata for reused baselines
outer-loop provenance metadata with candidate/config hashes and launcher details
explicit task-set comparability checks for baseline-vs-candidate reports
a simple JSON-backed frontier with cross-platform locking
deterministic wrapper-mutation search over generated candidate variants

Quick Start

git clone https://github.com/howdymary/hermes-agent-metaharness.git
cd hermes-agent-metaharness
pip install -e ".[dev]"

Point it at Hermes with either:

HERMES_AGENT_REPO=/path/to/hermes-agent
a sibling checkout at ../hermes-agent
or ~/.hermes/hermes-agent

Check that the Hermes checkout exposes the Meta-Harness benchmark/runtime surface before running evaluations:

python -m meta_harness check-hermes --hermes-repo /path/to/hermes-agent

This repo currently targets the legacy Hermes Meta-Harness surface:

environments/benchmarks/tblite/tblite_env.py
environments/benchmarks/terminalbench_2/terminalbench2_env.py
environments/meta_harness/{candidate.py,loader.py,types.py}

As of 2026-05-23, the latest published upstream release NousResearch/hermes-agent v0.14.0 (v2026.5.16) and current main no longer ship that legacy surface. Use check-hermes against the exact Hermes checkout you intend to run; if it reports missing benchmark/runtime files, a Hermes-side port or restoration of that inner runtime is required before TBLite/TB2 evaluations can execute.

If Hermes needs to run inside a managed environment, Meta-Harness can launch it through a shell-style prefix such as:

--launcher-prefix "uv run --python 3.12 --extra rl"
--python-executable /path/to/hermes-agent/.mh-venv/bin/python

Choosing a Backend

Users should choose the strongest coding backend available in their Hermes benchmark config. Meta-Harness does not hardcode a model provider; it delegates backend choice to Hermes through --hermes-config-path.

Common options:

OpenRouter or other OpenAI-compatible hosted backends via a Hermes YAML config
local vLLM servers for stronger self-hosted coding models
local Ollama endpoints for smoke tests and low-cost local iteration

For example, you can point Meta-Harness at any Hermes benchmark config that defines a stronger coding model:

python -m meta_harness evaluate-candidate \
  --candidate snapshot_baseline \
  --benchmark tblite \
  --hermes-repo /path/to/hermes-agent \
  --python-executable /path/to/hermes-agent/.mh-venv/bin/python \
  --hermes-config-path /path/to/your_stronger_backend.yaml

Dry-run a built-in Hermes candidate on TBLite:

python -m meta_harness evaluate-candidate \
  --candidate snapshot_baseline \
  --benchmark tblite \
  --hermes-repo /path/to/hermes-agent \
  --launcher-prefix "uv run --python 3.12 --extra rl" \
  --dry-run

Compare two Hermes Meta-Harness run directories:

python -m meta_harness compare-runs \
  --baseline-run /path/to/baseline-run \
  --candidate-run /path/to/candidate-run

Run a candidate directly against a baseline and emit a richer report:

python -m meta_harness evaluate-vs-baseline \
  --candidate candidates/template_candidate.py \
  --baseline-candidate snapshot_baseline \
  --benchmark tblite \
  --hermes-repo /path/to/hermes-agent \
  --launcher-prefix "uv run --python 3.12 --extra rl"

Reuse an existing baseline run instead of rerunning baseline:

python -m meta_harness evaluate-vs-baseline \
  --candidate candidates/template_candidate.py \
  --baseline-run /path/to/baseline-run \
  --benchmark tblite \
  --hermes-repo /path/to/hermes-agent

Run a small deterministic search over generated wrapper candidates:

python -m meta_harness search-candidates \
  --seed-candidate candidates/template_candidate.py \
  --baseline-candidate snapshot_baseline \
  --benchmark tblite \
  --hermes-repo /path/to/hermes-agent \
  --launcher-prefix "uv run --python 3.12 --extra rl"

Inspect the current frontier for a benchmark:

python -m meta_harness show-frontier \
  --frontier-path output/frontier.json \
  --benchmark tblite

Repo Layout

meta_harness/
├── archive_reader.py
├── baseline.py
├── benchmark_runner.py
├── candidate_registry.py
├── cli.py
├── comparison.py
├── config.py
├── frontier.py
├── hermes_compat.py
├── models.py
├── mutation.py
├── search.py
└── __main__.py

Candidate files can live in candidates/, with an example in candidates/template_candidate.py.

Two local benchmark configs are also included in configs/ for smoke-testing against an Ollama OpenAI-compatible endpoint on http://localhost:11434/v1.

Best Practices Notes

See docs/META_HARNESS_BEST_PRACTICES.md for a May 2026 scan of recent Meta-Harness and agent-benchmarking developments and how they map onto this repo's next steps.

Release Notes

This repo is intentionally research-oriented:

it optimizes harness procedure, not model weights
it is designed around verifiable benchmark feedback
it keeps Hermes core stable by treating Hermes as the execution backend
reused baselines are validated against the same task-selection hash before comparison

Near-Term Roadmap

Better ranking/reporting and frontier-backed baseline policies
More expressive mutation spaces and composition
Trace-driven reflective candidate improvement
Frontier-aware search strategies
Stronger benchmark-aware candidate generation