hermes-wsl-ubuntu
Instructions to use Hermes AI agents on WSL2 - Ubuntu
Hermes Agent + llama.cpp + Qwen3.5 Integration Guide
Author: Antonio Martinez Date: March 18, 2026 Updated: April 9, 2026
End-to-end setup for running Hermes Agent + llama.cpp + Qwen3.5 locally with GPU acceleration (CUDA for Linux/WSL, Metal for macOS).
Table of Contents
- System Requirements
- Architecture Overview
- Automatic installation
- Prerequisites & Setup
- Hermes Agent Installation
- llama.cpp Installation
- Camofox Browser Server
- Hermes HUDUI (Web UI)
- Hermes API Server
- Model Setup (Qwen3.5)
- Running the Server
- Usage Examples
- Performance & Memory
- Troubleshooting
- Optimization Tips
- Best Practices
System Requirements
| Component | Minimum | Recommended |
|---|---|---|
| GPU | GTX 1060 (6GB) | RTX 30/40 series |
| VRAM | 6GB | 12GB+ |
| RAM | 16GB | 32GB |
| CPU | 4 cores | 8+ cores |
| OS | Linux / WSL / macOS | Apple Silicon / Ubuntu |
Platform Requirements:
- Linux / WSL2 (Ubuntu): NVIDIA GPU with CUDA drivers installed (optional).
- macOS: Apple Silicon (M1/M2/M3) for Metal acceleration.
- Node.js: 18+
- Package Manager:
apt(Linux) orbrew(macOS).
Architecture Overview
flowchart LR
User -->|Browser| WebUI[Hermes HUDUI]
WebUI --> Hermes
Hermes -->|OpenAI API| LlamaServer
Hermes -->|Browser Control| Camofox
LlamaServer --> Model[Qwen3.5 GGUF]
Model --> GPU[GPU / CPU]
Camofox --> Web[Internet]
Automatic installation
If you are in Linux or macOS, execute the following script to install CUDA (if applicable), Hermes, llama.cpp, Camofox and the Web UI. The script will automatically detect your platform.
curl -fsSL https://raw.githubusercontent.com/metantonio/hermes-wsl-ubuntu/master/setup-wsl.sh -o setup-wsl.sh && bash setup-wsl.sh
Prerequisites & Setup
If you are in Windows, read and install WSL2 - Ubuntu
In Ubuntu:
sudo apt update && sudo apt upgrade -y
sudo apt install -y build-essential cmake git curl wget htop
Verify GPU
nvidia-smi
nvcc --version
If both commands works continue with Hermes agent installation, if don't, then you need CUDA Toolkits:
CUDA Toolkits for WSL - Ubuntu
check: Link
Also, may have to install nvcc with:
sudo apt install nvidia-cuda-toolkit
Hermes Agent Installation
curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
Config
# Do the setup to create the necessary files
hermes setup
And edit these values:
hermes config set OPENAI_BASE_URL http://localhost:8080/v1
hermes config set OPENAI_API_KEY dummy
hermes config set LLM_MODEL Qwen3.5-9B-Q5_K_M
hermes config set TELEGRAM_BOT_TOKEN Your_Telegram_API_Token
llama.cpp Installation
Recommended Path
sudo mkdir -p /opt/llama.cpp
sudo chown -R $USER:$USER /opt/llama.cpp
cd /opt/llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
Acceleration Backend
- Linux/WSL: Build with
GGML_CUDA=ON(requires NVIDIA GPU). - macOS: Build with
GGML_METAL=ON(optimized for Apple Silicon). - Fallback: Standard CPU build.
Build with CUDA (you need a NVIDIA GPU)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
Build with CPU (your CPU needs to have AVX2 instructions, basically any CPU since 2020)
This will make the inference slower, but still functional:
cmake -B build
cmake --build build --config Release
Note that for CPU, you may want to install BLAS for accelerated inference using .ggml LLM models, not .gguf
Note:
remove any previous failed build with:
rm -rf build
Verify
./build/bin/llama-server -h
Camofox Browser Server
Camofox is a browser-automation server used by the Hermes Agent to perform web-browsing tasks.
Interaction
- Port:
9377 - URL:
http://localhost:9377
Manual Management
The setup-wsl.sh script automatically starts Camofox in the background. If you need to manage it manually:
Start:
cd /opt/camofox
npm start > camofox.log 2>&1 &
Stop:
sudo fuser -k 9377/tcp
Check Status:
ps aux | grep camofox
Hermes HUDUI (Web UI)
Hermes HUDUI is the graphical interface for interacting with the Hermes Agent.
Interaction
- Port:
3001 - URL:
http://localhost:3001 - Environment: Requires Python 3.11+ and a virtual environment (handled by setup script).
Manual Management
Start:
cd ~/hermes-hudui
source venv/bin/activate
hermes-hudui > hermes-hudui.log 2>&1 &
Stop:
sudo fuser -k 3001/tcp
Hermes API Server
The Hermes Agent includes an optional API server for remote integrations.
Interaction
- Port:
8642 - URL:
http://localhost:8642
Configuration
The setup script configures this in ~/.hermes/.env:
API_SERVER_ENABLED=true
API_SERVER_KEY=change-me-local-dev
[!IMPORTANT] Change the
API_SERVER_KEYmanually for security in production environments.
Model Setup (Qwen3.5)
This will be created at your home directory
mkdir -p ~/models
cd ~/models
wget https://huggingface.co/unsloth/Qwen3.5-9B-GGUF/resolve/main/Qwen3.5-9B-Q5_K_M.gguf
Other models links that you could consider: https://huggingface.co/Tesslate/OmniCoder-9B-GGUF/resolve/main/omnicoder-9b-q5_k_m.gguf
Note: You may want to check huggingface for QWEN3.5 models of 0.8B, 2B and 4B if you want to run this with CPU at reasonable speed
Notes
Model Sizes Comparison
| Model | Parameters | Q4_K_M | Q5_K_M | Q5_K_L | Q6_K_L |
|---|---|---|---|---|---|
| Qwen3.5 9B | 9B | 5.5GB | 6.5GB | 6.8GB | 7.7GB |
| Qwen3.5 14B | 14B | 9.0GB | 10.5GB | 11.0GB | 12.5GB |
Model Selection Matrix
| Use Case | Model | Quantization | VRAM | Speed | Precision Loss |
|---|---|---|---|---|---|
| General Chat | Qwen3.5 9B | Q4_K_M | 5.5GB | 35-50 tok/s | ~7-8% |
| Development | Qwen3.5 9B | Q5_K_M | 6.5GB | 28-42 tok/s | ~4-5% |
| Research | Qwen3.5 9B | Q5_K_L | 6.8GB | 22-35 tok/s | ~3-4% |
| Complex Tasks | Qwen3.5 14B | Q4_K_M | 9.0GB | 25-40 tok/s | ~7-8% |
| Maximum Quality | Qwen3.5 14B | Q5_K_M | 10.5GB | 22-35 tok/s | ~4-5% |
Running the Server
/opt/llama.cpp/build/bin/llama-server \
-m ~/models/Qwen3.5-9B-Q5_K_M.gguf \
-c 8192 \
-ngl 32 \
--threads 8 \
-fa on \
--host 127.0.0.1 \
--port 8080
Usage Examples
API (OpenAI Compatible)
curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "Write a Python function",
"max_tokens": 128
}'
CLI
/opt/llama.cpp/build/bin/llama-cli \
-m ~/models/Qwen3.5-9B-Q5_K_M.gguf \
-ngl 32 \
-c 4096 \
-i
Recommended Commands (For a 12 GB VRAM)
Qwen3.5 9B - General Purpose (Recommended)
./llama.cpp/build/bin/llama-server \
-m /home/antonio/models/Qwen3.5-9B-Instruct-Q5_K_M.gguf \
-c 8192 -n 4096 -ngl 32 \
--port 8080 \
--host 127.0.0.1 \
--threads 8 \
--flash-attn 1 \
--cache-type-k q4_0
Qwen3.5 14B - Complex Reasoning
./llama.cpp/build/bin/llama-server \
-m /home/antonio/models/Qwen3.5-14B-Q4_K_M.gguf \
-c 131072 -n 4096 -ngl 34 \
-np 1 -fa on \
--port 8080 \
--host 127.0.0.1 \
--threads 8 \
--flash-attn 1 \
--cache-type-k q4_0
Qwen3.5 9B - Maximum Speed
./llama.cpp/build/bin/llama-server \
-m /home/antonio/models/Qwen3.5-9B-Instruct-Q4_K_M.gguf \
-c 4096 -n 2048 -ngl 32 \
--port 8080 \
--host 127.0.0.1 \
--threads 12 \
--flash-attn 1 \
--cache-type-k q4_0
Integration with Hermes Agent
# Option A: Hermes using CLI
hermes chat --model qwen3.5-9B_Q5_K_M
# Option B: Using Telegram
hermes gateway
Performance & Memory
Memory Breakdown
pie
title VRAM Usage (Qwen 9B Q5_K_M has 32 layers)
"Model Weights" : 65
"KV Cache" : 30
"Overhead" : 5
KV Cache Rule
- ~1.2 GB per 4096 tokens
- Scales linearly with context
| Context | KV Cache |
|---|---|
| 4096 | ~1.2 GB |
| 8192 | ~2.4 GB |
| 10240 | ~3.0 GB |
Important
- KV cache = input + output tokens
- Total tokens =
-c (llama.cpp flag)
Tips
RAM Calculation
Each LLM Transformer layer needs around ~180�320 MB por layer (model-dependent, rough estimate) of VRAM (For GPU) / RAM (for CPU) depending of the Quantization.
- Q4_K_M: would be around 200MB per Layer.
- Q5_K_M: Would be around 300MB per Layer.
Every 4096 tokens will need a KV Cache equivalent to ~1.2GB VRAM / RAM
System + Overhead will required 0.5GB of VRAM/ RAM
Doing some math example for QWEN3.5-9B-Q5_K_M.gguf:
| Component | VRAM Required (Q5_K_M) |
|---|---|
| Model weights (32 layers) | ~9.4 GB |
| KV Cache (Context 4096) | ~1.2 GB |
| System + overhead | ~0.5 GB |
| Total | ~11.1 GB |
A little tight for a 12GB VRAM GPU, but totally functional.
Now, depending of the task, you may need to have more context, let's says 8192 tokens of context, you can sacrifice some inference speed by loading less layers on the GPU:
| Component | VRAM Required (Q5_K_M) |
|---|---|
| Model weights (28 layers) | ~8.2 GB |
| KV Cache (Context 8192) | ~2.4 GB |
| System + overhead | ~0.5 GB |
| Total | ~11.1 GB |
As you have more GPU VRAM available, you could try add even more context (2048 more):
| Component | VRAM Required (Q5_K_M) |
|---|---|
| Model weights (28 layers) | ~8.2 GB |
| KV Cache (Context 10240) | ~3.0 GB |
| System + overhead | ~0.5 GB |
| Total | ~11.7 GB |
But you will have only 2.5% of the GPU free, is always recommended to have 5%.
KV cache memory depends on the total active context (-c), not separately on input or output tokens.
Note: QWEN3.5-14B has 40 Layers.
Additional Parameters for llama.cpp
| Parameter | Value | Description |
|---|---|---|
-c |
8192 | Context size (text length, max buffer) |
-n |
4096 | Max output tokens |
--ngl |
32 | Layers on GPU (depends of LLM used) |
--port |
8080 | Server port |
--host |
127.0.0.1 | Listen on localhost |
--threads |
8 | CPU threads for parallelization |
--flash-attn |
1 | Enable Flash Attention (speed boost) |
Troubleshooting
CUDA Out of Memory
-ngl 28
-c 4096
Slow Performance
--threads 12
-fa on
GPU Not Detected
nvidia-smi
wsl --update
Optimization Tips
Context vs Performance
graph LR
A[Small Context] -->|Fast| B[High Speed]
C[Large Context] -->|Slow| D[More Memory Usage]
Best Flags
-c 8192
-ngl 32
-fa on
--threads 8
Best Practices
Model Strategy
| Use Case | Model |
|---|---|
| Chat | Qwen3.5 9B Q4_K_M |
| Dev | Qwen3.5 9B Q5_K_M |
| Research | Qwen3.5 9B Q5_K_L |
| Complex | Qwen3.5 14B |
Security
chmod 700 ~/models
chmod 755 /opt/llama.cpp/build/bin/llama-server
- Use
127.0.0.1 - Do NOT expose publicly
Workflow
sequenceDiagram
participant User
participant Hermes
participant LlamaServer
participant Model
User->>Hermes: Prompt
Hermes->>LlamaServer: API Call
LlamaServer->>Model: Inference
Model-->>LlamaServer: Tokens
LlamaServer-->>Hermes: Response
Hermes-->>User: Output
Production Checklist
- GPU working (
nvidia-smi) - llama.cpp built with CUDA
- Model downloaded
- Server running
- Hermes connected
- Context optimized
- Permissions secured
Resources
- llama.cpp https://github.com/ggerganov/llama.cpp
- Hermes Agent https://hermes-agent.nousresearch.com/
- Qwen Models https://huggingface.co/Qwen