metantonio/hermes-wsl-ubuntu

Name: hermes-wsl-ubuntu
Author: metantonio

Step-by-step WSL2 Ubuntu setup guide for Windows users

★ 40

overview

This project provides a comprehensive setup guide for running Hermes Agent locally on Windows via WSL2 Ubuntu or macOS. It integrates llama.cpp for model inference and Camofox for browser automation to enable agentic web-browsing capabilities. The guide includes automated installation scripts and detailed configurations for deploying Qwen3.5 GGUF models with CUDA or Metal acceleration.

Automated installation scripts for WSL2 Ubuntu and macOS platforms
GPU acceleration support via CUDA for NVIDIA and Metal for Apple Silicon
Integration of Hermes HUDUI, Camofox browser server, and llama.cpp

full readme from github

Hermes Agent + llama.cpp + Qwen3.5 Integration Guide

Author: Antonio Martinez Date: March 18, 2026 Updated: May 7, 2026

End-to-end setup for running Hermes Agent + llama.cpp + Qwen3.5 locally with GPU acceleration (CUDA for Linux/WSL, Metal for macOS).

System Requirements
Architecture Overview
Automatic installation
Prerequisites & Setup
Hermes Agent Installation
llama.cpp Installation
Camofox Browser Server
Hermes HUDUI (Web UI)
Hermes API Server
Model Setup (Qwen3.5)
Running the Server
Usage Examples
Performance & Memory
Troubleshooting
Optimization Tips
Best Practices

System Requirements

Component	Minimum	Recommended
GPU	GTX 1060 (6GB)	RTX 30/40 series
VRAM	6GB	12GB+
RAM	16GB	32GB
CPU	4 cores	8+ cores
OS	Linux / WSL / macOS	Apple Silicon / Ubuntu

Platform Requirements:

Linux / WSL2 (Ubuntu): NVIDIA GPU with CUDA drivers installed (optional).
macOS: Apple Silicon (M1/M2/M3) for Metal acceleration.
Node.js: 18+
Package Manager: apt (Linux) or brew (macOS).

Architecture Overview

flowchart LR
    User -->|Browser| WebUI[Hermes HUDUI]
    WebUI --> Hermes
    Hermes -->|OpenAI API| LlamaServer
    Hermes -->|Browser Control| Camofox
    LlamaServer --> Model[Qwen3.5 GGUF]
    Model --> GPU[GPU / CPU]
    Camofox --> Web[Internet]

Automatic installation

If you are in Ubuntu (WSL2) or macOS, execute the following script to install CUDA (if applicable), Hermes, llama.cpp, and Camofox. The script will automatically detect your platform.

curl -fsSL https://raw.githubusercontent.com/metantonio/hermes-wsl-ubuntu/master/setup-wsl.sh -o setup-wsl.sh && bash setup-wsl.sh

Note: Added installation of ik_llama.cpp fork in another script: fork_ik-llamacpp.sh.

curl -fsSL https://raw.githubusercontent.com/metantonio/hermes-wsl-ubuntu/master/fork_ik-llamacpp.sh -o fork_ik-llamacpp.sh && bash fork_ik-llamacpp.sh

Prerequisites & Setup

If you are in Windows, read and install WSL2 - Ubuntu

In Ubuntu:

sudo apt update && sudo apt upgrade -y
sudo apt install -y build-essential cmake git curl wget htop

Verify GPU

nvidia-smi

nvcc --version

If both commands works continue with Hermes agent installation, if don't, then you need CUDA Toolkits:

CUDA Toolkits for WSL - Ubuntu

check: Link

Also, may have to install nvcc with:

sudo apt install nvidia-cuda-toolkit

Hermes Agent Installation

curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash

Config

# Do the setup to create the necessary files
hermes setup

And edit these values:

hermes config set OPENAI_BASE_URL http://localhost:8080/v1
hermes config set OPENAI_API_KEY dummy
hermes config set LLM_MODEL Qwen3.5-9B-Q5_K_M
hermes config set TELEGRAM_BOT_TOKEN Your_Telegram_API_Token

llama.cpp Installation

Recommended Path

sudo mkdir -p /opt/llama.cpp
sudo chown -R $USER:$USER /opt/llama.cpp

cd /opt/llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git

Acceleration Backend

Linux/WSL: Build with GGML_CUDA=ON (requires NVIDIA GPU).
macOS: Build with GGML_METAL=ON (optimized for Apple Silicon).
Fallback: Standard CPU build.

Build with CUDA (you need a NVIDIA GPU)

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

Build with CPU (your CPU needs to have AVX2 instructions, basically any CPU since 2020)

This will make the inference slower, but still functional:

cmake -B build
cmake --build build --config Release

Note that for CPU, you may want to install BLAS for accelerated inference using .ggml LLM models, not .gguf

Note:

remove any previous failed build with:

rm -rf build

Verify

./build/bin/llama-server -h

Camofox Browser Server

Camofox is a browser-automation server used by the Hermes Agent to perform web-browsing tasks.

Interaction

Port: 9377
URL: http://localhost:9377

Manual Management

The setup-wsl.sh script automatically starts Camofox in the background. If you need to manage it manually:

Start:

cd /opt/camofox
npm start > camofox.log 2>&1 &

Stop:

sudo fuser -k 9377/tcp

Check Status:

ps aux | grep camofox

Hermes HUDUI (Web UI)

Hermes HUDUI is the graphical interface for interacting with the Hermes Agent.

Interaction

Port: 3001
URL: http://localhost:3001
Environment: Requires Python 3.11+ and a virtual environment (handled by setup script).

Manual Management

Start:

cd ~/hermes-hudui
source venv/bin/activate
hermes-hudui > hermes-hudui.log 2>&1 &

Stop:

sudo fuser -k 3001/tcp

Hermes API Server

The Hermes Agent includes an optional API server for remote integrations.

Interaction

Port: 8642
URL: http://localhost:8642

Configuration

The setup script configures this in ~/.hermes/.env:

API_SERVER_ENABLED=true
API_SERVER_KEY=change-me-local-dev

[!IMPORTANT] Change the API_SERVER_KEY manually for security in production environments.

Model Setup (Qwen3.5)

This will be created at your home directory

mkdir -p ~/models
cd ~/models

wget https://huggingface.co/unsloth/Qwen3.5-9B-GGUF/resolve/main/Qwen3.5-9B-Q5_K_M.gguf

Other models links that you could consider: https://huggingface.co/Tesslate/OmniCoder-9B-GGUF/resolve/main/omnicoder-9b-q5_k_m.gguf

Note: You may want to check huggingface for QWEN3.5 models of 0.8B, 2B and 4B if you want to run this with CPU at reasonable speed

Notes

Model Sizes Comparison

Model	Parameters	Q4_K_M	Q5_K_M	Q5_K_L	Q6_K_L
Qwen3.5 9B	9B	5.5GB	6.5GB	6.8GB	7.7GB
Qwen3.5 14B	14B	9.0GB	10.5GB	11.0GB	12.5GB

Model Selection Matrix

Use Case	Model	Quantization	VRAM	Speed	Precision Loss
General Chat	Qwen3.5 9B	Q4_K_M	5.5GB	35-50 tok/s	~7-8%
Development	Qwen3.5 9B	Q5_K_M	6.5GB	28-42 tok/s	~4-5%
Research	Qwen3.5 9B	Q5_K_L	6.8GB	22-35 tok/s	~3-4%
Complex Tasks	Qwen3.5 14B	Q4_K_M	9.0GB	25-40 tok/s	~7-8%
Maximum Quality	Qwen3.5 14B	Q5_K_M	10.5GB	22-35 tok/s	~4-5%

Running the Server

/opt/llama.cpp/build/bin/llama-server \
  -m ~/models/Qwen3.5-9B-Q5_K_M.gguf \
  -c 8192 \
  -ngl 32 \
  --threads 8 \
  -fa on \
  --host 127.0.0.1 \
  --port 8080

Usage Examples

API (OpenAI Compatible)

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Write a Python function",
    "max_tokens": 128
  }'

CLI

/opt/llama.cpp/build/bin/llama-cli \
  -m ~/models/Qwen3.5-9B-Q5_K_M.gguf \
  -ngl 32 \
  -c 4096 \
  -i

Recommended Commands (For a 12 GB VRAM)

Qwen3.5 9B - General Purpose (Recommended)

./llama.cpp/build/bin/llama-server \
    -m /home/antonio/models/Qwen3.5-9B-Instruct-Q5_K_M.gguf \
    -c 8192 -n 4096 -ngl 32 \
    --port 8080 \
    --host 127.0.0.1 \
    --threads 8 \
    --flash-attn 1 \
    --cache-type-k q4_0

Qwen3.5 14B - Complex Reasoning

./llama.cpp/build/bin/llama-server \
    -m /home/antonio/models/Qwen3.5-14B-Q4_K_M.gguf \
    -c 131072 -n 4096 -ngl 34 \
    -np 1 -fa on \
    --port 8080 \
    --host 127.0.0.1 \
    --threads 8 \
    --flash-attn 1 \
    --cache-type-k q4_0

Qwen3.5 9B - Maximum Speed

./llama.cpp/build/bin/llama-server \
    -m /home/antonio/models/Qwen3.5-9B-Instruct-Q4_K_M.gguf \
    -c 4096 -n 2048 -ngl 32 \
    --port 8080 \
    --host 127.0.0.1 \
    --threads 12 \
    --flash-attn 1 \
    --cache-type-k q4_0

Integration with Hermes Agent

# Option A: Hermes using CLI
hermes chat --model qwen3.5-9B_Q5_K_M

# Option B: Using Telegram
hermes gateway

Performance & Memory

Memory Breakdown

pie
    title VRAM Usage (Qwen 9B Q5_K_M has 32 layers)
    "Model Weights" : 65
    "KV Cache" : 30
    "Overhead" : 5

KV Cache Rule

~1.2 GB per 4096 tokens
Scales linearly with context

Context	KV Cache
4096	~1.2 GB
8192	~2.4 GB
10240	~3.0 GB

Important

KV cache = input + output tokens
Total tokens = -c (llama.cpp flag)

Tips

RAM Calculation

Each LLM Transformer layer needs around ~180�320 MB por layer (model-dependent, rough estimate) of VRAM (For GPU) / RAM (for CPU) depending of the Quantization.

Q4_K_M: would be around 200MB per Layer.
Q5_K_M: Would be around 300MB per Layer.

Every 4096 tokens will need a KV Cache equivalent to ~1.2GB VRAM / RAM

System + Overhead will required 0.5GB of VRAM/ RAM

Doing some math example for QWEN3.5-9B-Q5_K_M.gguf:

Component	VRAM Required (Q5_K_M)
Model weights (32 layers)	~9.4 GB
KV Cache (Context 4096)	~1.2 GB
System + overhead	~0.5 GB
Total	~11.1 GB

A little tight for a 12GB VRAM GPU, but totally functional.

Now, depending of the task, you may need to have more context, let's says 8192 tokens of context, you can sacrifice some inference speed by loading less layers on the GPU:

Component	VRAM Required (Q5_K_M)
Model weights (28 layers)	~8.2 GB
KV Cache (Context 8192)	~2.4 GB
System + overhead	~0.5 GB
Total	~11.1 GB

As you have more GPU VRAM available, you could try add even more context (2048 more):

Component	VRAM Required (Q5_K_M)
Model weights (28 layers)	~8.2 GB
KV Cache (Context 10240)	~3.0 GB
System + overhead	~0.5 GB
Total	~11.7 GB

But you will have only 2.5% of the GPU free, is always recommended to have 5%.

KV cache memory depends on the total active context (-c), not separately on input or output tokens.

Note: QWEN3.5-14B has 40 Layers.

Additional Parameters for llama.cpp

Parameter	Value	Description
`-c`	8192	Context size (text length, max buffer)
`-n`	4096	Max output tokens
`--ngl`	32	Layers on GPU (depends of LLM used)
`--port`	8080	Server port
`--host`	127.0.0.1	Listen on localhost
`--threads`	8	CPU threads for parallelization
`--flash-attn`	1	Enable Flash Attention (speed boost)

Troubleshooting

CUDA Out of Memory

-ngl 28
-c 4096

Slow Performance

--threads 12
-fa on

GPU Not Detected

nvidia-smi
wsl --update

Optimization Tips

Context vs Performance

graph LR
    A[Small Context] -->|Fast| B[High Speed]
    C[Large Context] -->|Slow| D[More Memory Usage]

Best Flags

-c 8192
-ngl 32
-fa on
--threads 8

Best Practices

Model Strategy

Use Case	Model
Chat	Qwen3.5 9B Q4_K_M
Dev	Qwen3.5 9B Q5_K_M
Research	Qwen3.5 9B Q5_K_L
Complex	Qwen3.5 14B

Security

chmod 700 ~/models
chmod 755 /opt/llama.cpp/build/bin/llama-server

Use 127.0.0.1
Do NOT expose publicly

Workflow

sequenceDiagram
    participant User
    participant Hermes
    participant LlamaServer
    participant Model

    User->>Hermes: Prompt
    Hermes->>LlamaServer: API Call
    LlamaServer->>Model: Inference
    Model-->>LlamaServer: Tokens
    LlamaServer-->>Hermes: Response
    Hermes-->>User: Output

Production Checklist

GPU working (nvidia-smi)
llama.cpp built with CUDA
Model downloaded
Server running
Hermes connected
Context optimized
Permissions secured

Resources

llama.cpp https://github.com/ggerganov/llama.cpp
Hermes Agent https://hermes-agent.nousresearch.com/
Qwen Models https://huggingface.co/Qwen

metantonio/hermes-wsl-ubuntu

Hermes Agent + llama.cpp + Qwen3.5 Integration Guide

Table of Contents

System Requirements

Architecture Overview

Automatic installation

Prerequisites & Setup

Verify GPU

CUDA Toolkits for WSL - Ubuntu

Hermes Agent Installation

Config

llama.cpp Installation

Recommended Path

Acceleration Backend

Build with CUDA (you need a NVIDIA GPU)

Build with CPU (your CPU needs to have AVX2 instructions, basically any CPU since 2020)

Note:

Verify

Camofox Browser Server

Interaction

Manual Management

Hermes HUDUI (Web UI)

Interaction

Manual Management

Hermes API Server

Interaction

Configuration

Model Setup (Qwen3.5)

Notes

Model Sizes Comparison

Model Selection Matrix

Running the Server

Usage Examples

API (OpenAI Compatible)

CLI

Recommended Commands (For a 12 GB VRAM)

Qwen3.5 9B - General Purpose (Recommended)

Qwen3.5 14B - Complex Reasoning

Qwen3.5 9B - Maximum Speed

Integration with Hermes Agent

Performance & Memory

Memory Breakdown

KV Cache Rule

Important

Tips

RAM Calculation

Additional Parameters for llama.cpp

Troubleshooting

CUDA Out of Memory

Slow Performance

GPU Not Detected

Optimization Tips

Context vs Performance

Best Flags

Best Practices

Model Strategy

Security

Workflow

Production Checklist

Resources