elkimek/honcho-self-hosted

Name: honcho-self-hosted
Author: elkimek

Self-hosted Honcho memory backend for cross-session persistence

★ 338

overview

This project provides a configuration layer for self-hosting Plastic Labs' Honcho memory backend, enabling cross-session persistence for Hermes Agent without relying on third-party cloud storage. It deploys the full Honcho stack—including the API, PostgreSQL with pgvector, and Redis—on local hardware using Docker Compose. Users can route memory extraction and reasoning tasks through any OpenAI-compatible API or local inference servers like Ollama to maintain data sovereignty. This setup allows Hermes to build long-term user profiles and logical observations while keeping conversation data on the user's own infrastructure.

Enables private cross-session memory for Hermes Agent via self-hosting
Supports any OpenAI-compatible LLM provider or local inference servers
Includes automated setup script for Docker, PostgreSQL, and Redis

full readme from github

Self-Hosted Honcho for Hermes Agent

Self-host Honcho (Plastic Labs' memory layer) on your own server instead of using their cloud. Works with Hermes Agent out of the box.

No fork required — just 3 config files on top of upstream Honcho.

Background: Hermes L4 Memory

Hermes Agent has a 4-layer memory system. The cross-session memory layer is powered by Honcho, which builds a deepening model of the user across conversations — extracting observations, recalling context, and consolidating memories over time.

By default, Hermes uses Plastic Labs' managed cloud (honcho.dev) + their Neuromancer models. This works out of the box but means your conversation data and user profile live on their servers.

What are Neuromancer models?

Neuromancer XR is a specialized 8B model fine-tuned from Qwen3-8B specifically for extracting logical conclusions from conversations. Unlike general-purpose LLMs which are optimized for plausible text generation, Neuromancer is trained on ~10,000 curated social reasoning traces to follow formal logic — extracting both explicit facts ("user said they like Python") and deductive conclusions ("user is likely a developer").

It scores 86.9% on the LoCoMo memory benchmark vs. 69.6% for base Qwen3-8B and 80.0% for Claude 4 Sonnet.

Tradeoff of not using it: General-purpose models work well for observation extraction and memory recall — Honcho's prompts and tool-calling pipeline compensate for much of the gap. You may get slightly less precise deductive reasoning, but capable models (GLM-5, Grok 4.1) with strong function calling largely close the difference. The main advantage of self-hosting is data sovereignty, not matching Neuromancer's exact reasoning quality.

Deployment Options

Option	Privacy	Data location	LLM for memory	Setup	Cost
Managed cloud (default)	Low — data + inference on 3rd party	Plastic Labs servers	Neuromancer (Plastic Labs)	None — built into Hermes	Free tier / paid
Self-hosted + API (this repo)	Medium — data on your machine, inference via API	Your machine	Any OpenAI-compatible API	~3 minutes	API usage only
Self-hosted + local model	High — nothing leaves your network	Your machine	Local LLM (Ollama, vLLM)	More setup	Hardware only

Managed cloud — Zero setup. Best for getting started. Your data is on Plastic Labs' infrastructure.

Self-hosted + API — This repo. Your data stays on your machine. LLM calls go to a cloud API for inference only — the provider sees request content but doesn't store your memory data. Best balance of privacy and capability.

Self-hosted + local model — Maximum privacy. No data leaves your network. Requires a GPU or capable CPU on your LAN running an inference server (Ollama, vLLM, llama.cpp). Set LLM_VLLM_BASE_URL to your local server. Trade-off: smaller models may produce lower quality observations and reasoning than cloud APIs.

What this does

Runs Honcho's full memory stack (API, Deriver, PostgreSQL, Redis) on your machine
Routes LLM calls through any OpenAI-compatible provider (primary + backup)
All your data stays on your machine — no third-party cloud storage
Works with OpenRouter, Venice, Routstr, Together, Ollama, or any other provider

Architecture

Hermes Agent ──► localhost:8000 (self-hosted Honcho API)
                      │
                      ├── PostgreSQL + pgvector (your machine)
                      ├── Redis cache (your machine)
                      │
                      └── Deriver/Dialectic/Dream workers
                              │
                              ├── Primary LLM provider (any OpenAI-compatible API)
                              └── Backup LLM provider (optional)

Prerequisites

Ubuntu 22.04+ (VM, VPS, bare metal, or any Linux server — tested on 22.04, 6GB RAM, 80GB disk)
Docker Engine + Compose plugin
API key from any OpenAI-compatible provider (openrouter.ai, venice.ai, together.ai, etc.)
Second API key for backup (optional)

Quick Start

curl -sL https://raw.githubusercontent.com/elkimek/honcho-self-hosted/main/setup.sh -o /tmp/setup.sh
bash /tmp/setup.sh

This installs Docker (if needed), clones Honcho, copies configs, prompts for API keys, starts everything, configures Hermes, and optionally sets up the MCP server. ~3 minutes.

Manual Setup

1. Install Docker

sudo apt-get update && sudo apt-get install -y ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] \
  https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo $VERSION_CODENAME) stable" \
  | sudo tee /etc/apt/sources.list.d/docker.list
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo usermod -aG docker $USER

Log out and back in for the group change to take effect.

2. Clone repos + copy configs

# Clone this config repo
git clone https://github.com/elkimek/honcho-self-hosted.git ~/honcho-self-hosted

# Clone upstream Honcho
git clone --depth 1 https://github.com/plastic-labs/honcho.git ~/honcho

# Copy config files into the Honcho clone
cp ~/honcho-self-hosted/docker-compose.yml ~/honcho/
cp ~/honcho-self-hosted/config.toml ~/honcho/
cp ~/honcho-self-hosted/env.example ~/honcho/.env

3. Set your API keys

Edit ~/honcho/.env:

nano ~/honcho/.env

Replace the placeholder values with your actual API keys:

LLM_VLLM_API_KEY — primary LLM provider
LLM_VLLM_BASE_URL — primary provider's API URL
LLM_EMBEDDING_API_KEY — embedding provider (can be same as primary)
LLM_EMBEDDING_BASE_URL — embedding provider's API URL
LLM_EMBEDDING_MODEL — embedding model name (default: openai/text-embedding-3-small)
LLM_OPENAI_COMPATIBLE_API_KEY — backup LLM provider (optional)
LLM_OPENAI_API_KEY — same as your primary key (needed for client init)

Any OpenAI-compatible provider works (OpenRouter, Venice, Routstr, Together, etc.) — just set the key and URL. See Using different providers for details.

Embedding fallback: if LLM_EMBEDDING_API_KEY or LLM_EMBEDDING_BASE_URL is left empty, Honcho falls back to the backup provider credentials (LLM_OPENAI_COMPATIBLE_*). This is useful if your backup provider (e.g. Venice) supports embeddings at negligible cost.

If you don't want a backup provider: remove all BACKUP_PROVIDER and BACKUP_MODEL lines from config.toml, and set LLM_OPENAI_COMPATIBLE_API_KEY + OPENAI_COMPATIBLE_BASE_URL to the same values as your primary. The setup script handles this automatically.

4. Start Honcho

cd ~/honcho
docker compose up -d

First run builds images and runs DB migrations (~2 minutes). Check status:

docker compose ps
docker compose logs -f api deriver

Wait ~10 seconds for the API to start, then verify:

curl -s http://localhost:8000/openapi.json | head -1

5. Configure Hermes

mkdir -p ~/.honcho
cp ~/honcho-self-hosted/honcho-config.json ~/.honcho/config.json
hermes gateway restart

Hermes will now use your local Honcho instead of api.honcho.dev.

Model Configuration

Honcho has 4 background components that use LLM calls:

Deriver — Reads every message and extracts observations about the user ("prefers Python", "privacy-focused"). Memory formation.
Dialectic — Answers questions about the user on demand, with 5 reasoning levels (minimal → max). Memory recall.
Summary — Compresses long sessions into short/long summaries to keep context manageable.
Dream — Runs every ~8 hours. Merges redundant observations, deletes outdated ones, infers higher-level patterns. Memory consolidation.

LLM calls are tiered by task complexity. Defaults are chosen for function-calling reliability (the primary requirement for Honcho's tool-using agents):

Component	Default model	Tier	When it runs
Deriver	`z-ai/glm-4.7-flash`	Light — fast, cheap, 79.5% tau-bench	Every message
Summary	`z-ai/glm-4.7-flash`	Light	Every 20/60 messages
Dialectic (low)	`z-ai/glm-4.7-flash`	Light	Per Hermes turn
Dialectic (med/high)	`x-ai/grok-4.1-fast`	Medium — built for tool use, 2M context	Complex queries
Dialectic (max)	`z-ai/glm-5`	Heavy — 89.7% tau2-bench	Hardest queries
Dream	`z-ai/glm-5`	Heavy	Every ~8 hours

These are OpenRouter model IDs. Any model your provider supports will work — just change the name in config.toml. Each component also has a backup provider that fires automatically if the primary fails on the last retry.

To change models, edit ~/honcho/config.toml and rebuild:

cd ~/honcho
docker compose up -d --build

Using different providers

Honcho supports these provider slots natively:

Slot	Config key	How to use
`custom`	`OPENAI_COMPATIBLE_BASE_URL` + `OPENAI_COMPATIBLE_API_KEY`	Any OpenAI-compatible API
`vllm`	`VLLM_BASE_URL` + `VLLM_API_KEY`	Any OpenAI-compatible API
`openai`	`OPENAI_API_KEY`	OpenAI direct
`anthropic`	`ANTHROPIC_API_KEY`	Anthropic direct
`google`	`GEMINI_API_KEY`	Google Gemini
`groq`	`GROQ_API_KEY`	Groq

You can mix providers per component in config.toml:

[deriver]
PROVIDER = "groq"          # fast for frequent tasks
MODEL = "llama-3.3-70b"

[dream]
PROVIDER = "anthropic"     # best reasoning for rare tasks
MODEL = "claude-sonnet-4-6"

Local / LAN Inference

For maximum privacy, run a local model instead of a cloud API. Any server with an OpenAI-compatible endpoint works.

Local servers have different model catalogs than cloud APIs, so model names will differ. The setup script asks you for the model name your server provides.

Recommended local models for reliable function calling:

Model	Params	Ollama name	Notes
GLM-4.7 Flash	30B MoE	`glm-4.7-flash`	Same family as cloud default, best tool use in 30B class
Llama 3.3	70B	`llama3.3:70b`	Battle-tested tools, needs ~40GB VRAM

Ollama (easiest)

# Install Ollama on this machine or another on your LAN
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model with strong function calling
ollama pull glm-4.7-flash

Then run the setup script and choose option 2 (Local / LAN):

Server URL: http://localhost:11434/v1
Model name for light tasks: glm-4.7-flash
Model name for heavy tasks: glm-4.7-flash

For a separate LAN machine, use its IP: http://192.168.x.x:11434/v1

vLLM

# Serve a model with tool calling support
vllm serve THUDM/GLM-4.7-Flash --port 8001 --enable-auto-tool-choice

Setup: http://localhost:8001/v1 with model name THUDM/GLM-4.7-Flash

Considerations

Model size matters — Honcho's agents need reliable function calling and structured JSON. Models under 14B may miss tool calls or malform output. 32B+ recommended.
Embeddings need a cloud API — local servers typically can't serve embedding models. The setup script asks for a separate cloud API key, URL, and model name for embeddings (e.g. OpenRouter with openai/text-embedding-3-small, or Venice with text-embedding-bge-m3), or lets you disable embeddings entirely (Honcho works but without vector search).
Same model for all tiers — locally you'll typically run one model. The script sets it for all components. You can differentiate later in config.toml if you serve multiple models.
No backup provider — local mode uses a single server. If it goes down, Honcho's deriver queues work until it's back.

MCP Server (optional)

Honcho includes an MCP server that exposes memory tools (search, chat, observations, peer cards) to any MCP-compatible client like Claude Code or Claude Desktop.

The hosted version at mcp.honcho.dev points at Plastic Labs' cloud. For self-hosted, run the MCP server locally and point it at your Honcho instance.

The setup script offers to configure MCP automatically. To set it up manually:

Setup

Requires Node.js 22+ and Bun on the server:

cd ~/honcho/mcp

# Patch to use local Honcho instead of honcho.dev
sed -i 's|https://api.honcho.dev|http://localhost:8000|' src/config.ts

bun install

Run as a service

Create /etc/systemd/system/honcho-mcp.service:

[Unit]
Description=Honcho MCP Server
After=network.target docker.service

[Service]
Type=simple
User=your-username
WorkingDirectory=/home/your-username/honcho/mcp
ExecStart=/usr/bin/npx wrangler dev --port 8787 --ip 0.0.0.0
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Then:

sudo systemctl daemon-reload
sudo systemctl enable --now honcho-mcp

Connect Claude Code

If the MCP server is on a remote machine, tunnel the port over SSH:

ssh -f -N -L 8787:localhost:8787 user@your-server

Then add to Claude Code:

claude mcp add --transport http honcho http://localhost:8787 \
  --header "Authorization: Bearer local" \
  --header "X-Honcho-User-Name: your-name" \
  --header "X-Honcho-Workspace-ID: hermes"

Connect Claude Desktop

Add to your Claude Desktop config (claude_desktop_config.json):

{
  "mcpServers": {
    "honcho": {
      "command": "npx",
      "args": [
        "mcp-remote",
        "http://localhost:8787",
        "--header", "Authorization:${AUTH_HEADER}",
        "--header", "X-Honcho-User-Name:${USER_NAME}"
      ],
      "env": {
        "AUTH_HEADER": "Bearer local",
        "USER_NAME": "your-name"
      }
    }
  }
}

Maintenance

Update Honcho:

cd ~/honcho
docker compose down
git checkout mcp/src/config.ts  # restore upstream MCP file if patched
git pull
docker compose up -d --build

If you use the MCP server, re-apply the patch after pulling:

sed -i 's|https://api.honcho.dev|http://localhost:8000|' mcp/src/config.ts
sudo systemctl restart honcho-mcp

View logs:

docker compose logs -f api deriver

Check queue status:

curl -s http://localhost:8000/v3/workspaces/hermes/queue/status | python3 -m json.tool

Backup data:

docker compose exec database pg_dump -U honcho honcho > backup.sql

Known Limitations

Embedding fallback shares backup config — if LLM_EMBEDDING_API_KEY / LLM_EMBEDDING_BASE_URL are empty, Honcho falls back to LLM_OPENAI_COMPATIBLE_* (backup provider). This is intentional and works well when your backup supports embeddings (e.g. Venice with text-embedding-bge-m3). Set the embedding env vars explicitly if you want embeddings routed separately.
One backup per component — Honcho supports primary + one backup provider, not a full failover chain. Using a multi-provider router (e.g. OpenRouter) as primary mitigates this.
No E2EE — Honcho's agents use function calling, which isn't compatible with end-to-end encryption. LLM request content is visible to the provider, but your stored data (sessions, observations, embeddings) stays on your machine.

Files

File	Purpose
`docker-compose.yml`	Docker deployment — API, Deriver, PostgreSQL, Redis
`config.toml`	Honcho config — providers, models, feature flags
`env.example`	API keys template — copy to `~/honcho/.env` and fill in
`honcho-config.json`	Hermes-side config — tells Hermes to use localhost:8000
`setup.sh`	One-command installer — handles everything

License

GPL-3.0

Credits

Honcho by Plastic Labs
Hermes Agent by Nous Research