autocontext

a recursive self-improving harness designed to help your agents (and future iterations of those agents) succeed on any task

β˜… 730 Python Apache-2.0 Updated 4/14/2026
View on GitHub β†’

autocontext ASCII banner

turn repeated agent work into validated, reusable execution

autocontext runs LLM agents through structured scenarios, evaluates their outputs, and accumulates the knowledge that improved results β€” so repeated runs get better, not just different. Point the harness at a real task in plain language, let it work the problem, and then inspect the traces, reports, artifacts, datasets, playbooks, and optional distilled model it produces.

What's New

What actually is autocontext?

Most agent systems still start every run cold. They do not reliably preserve what worked, separate signal from noise, or turn repeated success into a reusable asset.

autocontext is built to close that loop:

How People Use It

How It Works

The product model centers on a few stable ideas:

Inside a run, autocontext uses a structured multi-agent loop:

Strategies are then evaluated through scenario execution, staged validation, and gating. Weak changes are rolled back. Successful changes accumulate into reusable knowledge.

Which Surface Fits Which Job

Surface When to use it
run Improve behavior inside a reusable scenario or task across generations
simulate Model a system, explore parameter sweeps, or compare replayable outcomes
investigate Evidence-driven diagnosis with hypotheses and confidence scoring
analyze Inspect or compare runs, simulations, investigations, or missions after the fact
mission Verifier-driven goal advanced step by step with checkpoints and completion criteria
campaign Coordinate multiple missions with budget tracking, dependencies, and progress aggregation
train Distill stable exported data into a cheaper local runtime
replay Inspect what happened before deciding what knowledge should persist

campaign now ships as a TypeScript CLI/API/MCP workflow for multi-mission coordination. The Python package still does not expose a campaign control-plane surface.

Choose An Entry Point

Scenario Families

All 11 families are executable in both Python and TypeScript. TypeScript uses V8 isolate codegen for secure execution; Python uses subprocess-based executors.

Family Evaluation What it tests
game Tournament with Elo Turn-based strategy (grid_ctf, othello)
agent_task LLM judge Prompt-centric tasks with optional improvement loops
simulation Trace evaluation Action-trace scenarios with mock environments and fault injection
artifact_editing Artifact validation File, config, and schema modification with diff tracking
investigation Evidence chains Diagnosis accuracy with red herring detection
workflow Workflow evaluation Transactional flows with compensation, retry, and side-effect tracking
negotiation Negotiation evaluation Hidden preferences, BATNA constraints, and opponent modeling
schema_evolution Schema adaptation Mid-run state changes where agents must detect stale context
tool_fragility Drift adaptation APIs that drift, requiring agents to adapt to changed tool behavior
operator_loop Judgment evaluation Escalation and clarification judgment in operator-in-the-loop workflows
coordination Coordination evaluation Multi-agent partial context, handoff, merge, and duplication detection

Core Capabilities

Providers

Runtime routing across multiple LLM backends:

Runtimes

Agent runtimes control how agents execute during runs:

Executors

Strategy and code execution backends:

Quick Start From Source

The Python application lives in autocontext/, and most uv, pytest, ruff, and mypy commands should be run from there.

cd autocontext
uv venv
source .venv/bin/activate
uv sync --group dev

AUTOCONTEXT_AGENT_PROVIDER=deterministic uv run autoctx solve \
  --description "improve customer-support replies for billing disputes" \
  --gens 3

That hands the harness a real task, materializes the working scenario, runs the loop, and writes traces and artifacts under runs/ and knowledge/. It also works without external API keys.

Run with Anthropic:

cd autocontext
AUTOCONTEXT_AGENT_PROVIDER=anthropic \
ANTHROPIC_API_KEY=your-key \
uv run autoctx solve --description "improve customer-support replies for billing disputes" --gens 3

ANTHROPIC_API_KEY is the preferred Anthropic credential env var. AUTOCONTEXT_ANTHROPIC_API_KEY remains supported as a compatibility alias.

Run with Claude CLI:

cd autocontext
AUTOCONTEXT_AGENT_PROVIDER=claude-cli \
AUTOCONTEXT_CLAUDE_MODEL=sonnet \
uv run autoctx solve --description "improve customer-support replies for billing disputes" --gens 3

Run with Codex CLI:

cd autocontext
AUTOCONTEXT_AGENT_PROVIDER=codex \
AUTOCONTEXT_CODEX_MODEL=o4-mini \
uv run autoctx solve --description "improve customer-support replies for billing disputes" --gens 3

Start the API server:

cd autocontext
uv run autoctx serve --host 127.0.0.1 --port 8000

Then inspect http://127.0.0.1:8000/ for the API index, or use npx autoctx tui for the interactive terminal UI.

Use the repo-level .env.example as the reference for available AUTOCONTEXT_* settings and supported provider-native credential aliases such as ANTHROPIC_API_KEY.

Installable Packages

The repo publishes two installable packages with different scopes:

Important:

The Python package exposes the full autoctx control-plane CLI for scenario execution, API serving, exports, training, and operator workflows. The TypeScript package exposes the autoctx CLI and library surface for simulations, investigations, analysis, mission control, MCP serving, and Node integrations.

Which Package Should You Use?

If you want to... Start here Why
Run the full multi-generation control plane autocontext/README.md Python has the API server, training loop, scenario scaffolding, export/import, and full CLI surface.
Run simulations, investigations, analysis, or missions from Node ts/README.md The TypeScript package is focused on operator-facing workflows, integrations, mission control, and MCP serving.
Embed autocontext in a Node app or operator workflow ts/README.md The TypeScript package also exposes library surfaces for evaluation, artifacts, publishing, and integrations.
Point an external agent at autocontext autocontext/docs/agent-integration.md It documents the CLI-first contract, JSON output, MCP usage, and SDK options.
Grab copy-paste integration snippets examples/README.md The examples cover Python CLI, Claude Code MCP, Python SDK, and TypeScript library usage.
Catch up on recent repo evolution CHANGELOG.md It summarizes recent public releases and notable changes.

Common Workflows

Representative TypeScript operator workflows:

operator-in-the-loop is a fully runnable scenario family in both Python and TypeScript. It tests escalation and clarification judgment with real escalation/clarification hooks and behavioral-contract signals across multi-run, sweep, and replay flows.

MLX training is host-only on Apple Silicon macOS. If you want a sandboxed OpenClaw agent to trigger training, use the file-based host watcher flow documented in autocontext/docs/mlx-training.md.

Repository Layout

Where To Look Next

Acknowledgments

Thanks to George for generously donating the autocontext name on PyPI.

Project Signals

npm downloads PyPI downloads

Star History Chart