dev tutorial · module 2 of 7 · foundations

The agent loop — the heartbeat

In Module 1 we landed the one big idea: an agent is a loop, not a chatbot. That was the cartoon version. Now we trace the real loop — what actually happens, in which file, when you press Enter — so you can follow a single message from your keyboard, out to the model, and back to the screen.

Hermes Atlas dev tutorial · ~14 min · for builders, no CS degree assumed

Goal for this module — trace one message from keyboard to model and back: the entry path, one turn start to finish, the trajectory, provider-neutrality, and the sacred caching rule.

Quick recap, then the real thing

Last module's picture was: think → call a tool → see the result → repeat → answer. That's true, and it's the whole soul of the system. But a real codebase has to handle messy reality: the model running out of room, asking for eight tools at once, the conversation needing to be saved, and the same loop somehow talking to OpenAI, Anthropic, Gemini, and a model running on your own laptop.

This module opens the hood. We won't read all 3,900 lines of the loop — we'll build a mental model accurate enough that, when you do open the file, every part has an obvious job. The official write-up is the Agent Loop developer guide; this is the friendly companion to it.

The entry path (how a keystroke becomes a loop)

You don't have to memorize this — just see that there's a clean, short trail from typing hermes to the loop starting. Each hop hands off to the next:

Step	What happens
You type `hermes`	A small launcher script runs.
`hermes_cli/main.py`	The `main` entry point. With no command given, it defaults to a chat session.
`HermesCLI` in `cli.py`	Wires up your config and builds an `AIAgent` (defined in `run_agent.py`).
`run_conversation()`	The agent's heartbeat, living in `agent/conversation_loop.py`. This is where we'll spend the rest of the module.

So: hermes → hermes_cli/main.py → HermesCLI → builds an AIAgent → calls run_conversation(). Everything above the loop is just plumbing to get you there. The interesting machine is the loop itself.

One turn, start to finish

The key word is turn. A turn is everything that happens for one user message — from "you hit Enter" to "Hermes replies." Inside one turn, the model might call ten tools and talk to the provider five times. You see one question and one answer; under the hood, a lot of laps happened.

A turn has three parts: a prologue that runs once at the start, a main loop that repeats until there's an answer, and an epilogue that runs once at the end. Here's the whole thing:

You hit Enter on a message
        |
        v
  +----------------------------------------------------------+
  |  PROLOGUE — runs once   (agent/turn_context.py)          |
  |  build_turn_context(): clean up the message, build or    |
  |  restore the system prompt, set up the turn.             |
  +----------------------------------------------------------+
        |
        v
  +----------------------------------------------------------+
  |  MAIN LOOP — repeats (capped at a max # of laps)         |
  |                                                          |
  |   build the message list  -->  call the LLM (streaming)  |
  |                                       |                  |
  |                          read finish_reason  <--         |
  |                                       |                  |
  |      +--------------------------------+----------+       |
  |      v                v                          v       |
  |  "tool_calls"        "stop"                  "length"    |
  |  run the tools,      final answer —          out of room:|
  |  append results,     append it &             compress    |
  |  loop again          exit loop               history,    |
  |      |                                       retry       |
  |      +------------- back to top -------------------+     |
  +----------------------------------------------------------+
        |
        v
  +----------------------------------------------------------+
  |  EPILOGUE — runs once                                    |
  |  save the conversation to SQLite, run post-turn hooks    |
  |  (memory review, skill suggestions).                     |
  +----------------------------------------------------------+
        |
        v
Hermes replies on your screen

Let's walk the three parts.

Prologue — set the stage (once)

build_turn_context() (in agent/turn_context.py) does the pre-flight: it sanitizes your message (strips characters that would choke a provider), builds or restores the system prompt (the standing instructions that tell the model who it is and what tools exist), and sets up the turn's bookkeeping. This happens once per turn — hold onto that, it matters in the caching section below.

Main loop — think, act, repeat (many times)

Now the laps begin. Each lap is the same three beats:

Build the message list — assemble the running conversation so far (more on this in the trajectory section).
Call the LLM — send it off, streaming the response token-by-token by default so you see text appear live.
Read the finish_reason — the model tells you why it stopped talking. This one field decides what happens next.

There are three outcomes worth knowing, and the whole loop pivots on them:

`finish_reason`	Meaning	What the loop does
`tool_calls`	"I want to use some tools."	Execute them (in `agent/tool_executor.py`) — one at a time, or concurrently up to ~8 at once. Append each result as a `tool` message. Loop again.
`stop`	"Here's my final answer."	Append the answer, exit the loop, show it to you. Turn done.
`length`	"I ran out of room."	Quietly compress older history (`agent/context_compressor.py`) and retry. You never see this happen.

That tool_calls → run tools → loop again cycle is the engine. It's exactly the cartoon loop from Module 1, just with a real name for the signal (finish_reason) and a real file doing the work (tool_executor.py). And there's a safety belt: the loop is capped at a maximum number of laps, so a confused model can't spin forever.

The length case is quietly clever — Models have a finite context window — a budget for how much text they can consider at once. Long conversations eventually blow past it. Instead of erroring out in your face, Hermes catches the length signal, summarizes/trims the older parts of the conversation, and tries again. From your seat it just... keeps working. We dig into how in the compression & caching docs.

Epilogue — clean up (once)

Once the loop exits with an answer, the turn wraps up: the conversation is saved to SQLite (that's the state.db file from Module 1), and post-turn hooks fire — the background nudges that review what just happened for memory-worthy facts and suggest new skills. That self-improving behavior we flagged in Module 1? This is where it gets kicked off, after you already have your answer, so it never slows down your reply.

The trajectory — the conversation is a list

Here's the data structure that makes the whole thing tick, and it's simpler than you'd guess. The conversation is just a growing list of messages. Each message has a role, some content, and — when the model is requesting tools — a tool_calls field. This growing list has a name in the codebase:

Define: trajectory — The trajectory is the full, ordered list of messages in a conversation — every user message, every assistant reply, every tool request, and every tool result, in the order they happened. It's the agent's working memory for the current turn. The loop reads the trajectory to build each model call, and appends to it after every lap.

There are three roles you'll see constantly. Here's a tiny slice of a trajectory after the model decided to check the time:

[
  { "role": "user",
    "content": "what time is it in tokyo?" },

  { "role": "assistant",           // the model asks for a tool
    "content": null,
    "tool_calls": [{ "name": "get_time",
                    "arguments": { "tz": "Asia/Tokyo" } }] },

  { "role": "tool",                // the result comes back as its own message
    "content": "2026-06-15 22:14 JST" }
]

Read it top to bottom and the loop's logic falls out: the user asks; the assistant replies — but instead of prose, its reply is a request to run a tool; the tool result is appended as a brand-new message. Then the loop runs again, the model sees that tool message in the trajectory, and now it can answer "It's 10:14 PM in Tokyo" with finish_reason: "stop". The trajectory grew by one message each lap. That's the entire dance.

One loop, every provider

Reasonable question: if there's just one loop, how does it talk to OpenAI, Anthropic, Gemini, and a model on your laptop — each of which speaks a slightly different dialect? The answer is a small abstraction that keeps the dialects out of the loop.

Think of a universal power adapter — Your laptop charger ends in USB-C. The wall socket is different in every country. You don't rewire your laptop per country — you snap on the right plug adapter and the laptop never knows the difference. In Hermes, the loop is the laptop, and each provider is a different country's socket. A provider profile is the adapter that makes them all fit.

Concretely: providers/base.py defines a ProviderProfile — a declaration of how a given provider behaves. Per-provider adapters (like agent/anthropic_adapter.py) translate between the loop's neutral message format and what that specific provider expects on the wire. The payoff for you:

Why a builder cares — Adding support for a new provider means declaring a profile and writing an adapter — not touching conversation_loop.py. The loop stays provider-blind. This is the "narrow waist" principle from Module 1 in action: capability lives at the edges, the core stays small. More in the architecture overview.

The one sacred rule: the system prompt is built once

Remember the prologue — it builds the system prompt once, and the loop reuses it on every lap. That's not laziness; it's a deliberate, load-bearing rule, and understanding it will explain a surprising number of design choices later in this series.

Per-conversation prompt caching is sacred — Model providers can cache a prompt prefix they've seen before and charge you far less to reuse it. Hermes leans on this hard: the system prompt (often huge — all the tool definitions, instructions, your profile) is assembled once and reused verbatim every single turn, so the provider can cache it instead of re-reading it from scratch. The savings are large. Because of this, the architecture treats past context as immutable — it does not rewrite earlier messages mid-conversation, since any change would bust the cache. The only sanctioned exception is compression (the length case), which only happens when there's no choice.

Keep this in your pocket. When you later wonder "why doesn't Hermes just edit that earlier message?" or "why is the prompt assembled in this rigid order?" — the answer is almost always "to protect the cache." The prompt-assembly docs go deep on how that prompt is built.

Why this matters for you

Here's the freeing part. You almost never need to touch conversation_loop.py. The loop is the stable heart, and it's meant to stay that way. When you extend Hermes you work at the two edges the loop already reaches for:

Give the loop a new tool to call — a new ability the model can invoke during a tool_calls lap (Module 3).
Give the model better instructions — sharper prompts and skills about when to reach for things (Module 4).

Either way, the loop keeps doing its thing — think, act, repeat — and your work plugs into the edges it already exposes. That's the payoff of a narrow waist: you get to build a lot without ever risking the heartbeat.

Key takeaways

A turn is everything that happens for one user message: a one-time prologue, a repeating main loop, a one-time epilogue.
The loop pivots on the model's finish_reason: tool_calls → run tools & loop, stop → answer & exit, length → compress & retry. A max-lap cap prevents infinite spinning.
The conversation is a growing list of messages called the trajectory, with roles user / assistant / tool.
One loop serves every provider via provider profiles + adapters — add a provider by declaring a profile, not editing the loop.
Prompt caching is sacred: the system prompt is built once and reused verbatim, so past context is treated as immutable (compression is the only exception).
You extend at the edges (tools, instructions); the loop stays stable.

Quick check — The model returns finish_reason: "tool_calls". Does Hermes show the user a reply now?

Answer: No. tool_calls means the model wants to use tools, not that it's finished. The loop runs those tools (in tool_executor.py), appends each result to the trajectory as a tool message, and loops again so the model can react to the results. The user only gets a reply when the model returns finish_reason: "stop" — its final answer.

Pair with your AI — Trace the real loop in your own clone to make this concrete: "Open agent/conversation_loop.py in github.com/NousResearch/hermes-agent. Walk me through one turn of run_conversation() at a conceptual level: where it calls build_turn_context (the prologue), where the main loop branches on finish_reason (tool_calls vs stop vs length), and where the epilogue saves to SQLite and fires post-turn hooks. Point me to the line ranges but explain it like I'm a product builder, not a compiler."

← Module 1 · What Hermes Agent actually is

Module 3 · Tools & toolsets →

↩ Dev tutorial index