bryercowan/hermes-embodied

Name: hermes-embodied
Author: bryercowan

Self-improving robotics via VLA model fine-tuning on cloud GPUs

★ 12

overview

Hermes Embodied is a robotics extension for Hermes Agent that enables autonomous fine-tuning of Vision-Language-Action (VLA) models through natural language. The system orchestrates a self-improvement loop by collecting robot trajectories in simulation or hardware, provisioning cloud GPUs via Vast.ai, and retraining models when performance thresholds are met. It integrates with the LeRobot framework and MuJoCo simulation to automate the transition from data collection to model promotion. This allows users to manage complex robotics training pipelines without manual machine learning expertise.

Automates VLA fine-tuning loops using natural language commands
Provisions and manages Vast.ai cloud GPU instances autonomously
Supports SmolVLA and GR00T models for simulation or physical hardware

full readme from github

Hermes Embodied: Self-Improving Robotics via Hermes Agent

"Any robot owner can fine-tune a state-of-the-art VLA by talking to their agent. No ML expertise needed."

What Is This?

Hermes Embodied turns Hermes Agent into a self-improving robotics trainer. It adds three Hermes skills that close the loop between robot execution, training data collection, and model improvement — all orchestrated through natural language.

The same self-improvement loop that Hermes uses to get better at coding tasks (via Tinker-Atropos RL) now extends to physical robot control via Vision-Language-Action models.

Architecture

┌─────────────────────────────────────────────────────┐
│                   HERMES AGENT                       │
│  (Reasoning Layer — plans, monitors, orchestrates)   │
├─────────────────────────────────────────────────────┤
│                                                      │
│  ┌──────────┐  ┌──────────────┐  ┌───────────────┐  │
│  │ vast-gpu  │  │  vla-trainer │  │  robot-loop   │  │
│  │  (skill)  │  │   (skill)    │  │   (skill)     │  │
│  │           │  │              │  │               │  │
│  │ Provision │  │ SmolVLA /    │  │ Deploy model  │  │
│  │ & manage  │  │ GR00T fine-  │  │ Collect traj  │  │
│  │ cloud GPU │  │ tuning on    │  │ Auto-retrain  │  │
│  │ instances │  │ LeRobot data │  │ when improved │  │
│  └──────────┘  └──────────────┘  └───────────────┘  │
│                                                      │
├─────────────────────────────────────────────────────┤
│              SIMULATION / HARDWARE                   │
│                                                      │
│  MuJoCo + LeRobot gym_hil    OR    SO-ARM101 + USB  │
│  (Franka Panda sim tasks)          (Physical arm)    │
└─────────────────────────────────────────────────────┘

The Self-Improvement Loop

Deploy — Hermes loads a VLA checkpoint and runs it in sim (or on hardware)
Collect — Every rollout is recorded as a LeRobot trajectory (state, action, camera, reward)
Curate — Hermes filters successful trajectories (reward > threshold)
Train — Provisions a GPU on Vast.ai and fine-tunes SmolVLA on the new data
Evaluate — Runs open-loop eval comparing new checkpoint vs. old
Promote — If new model is better, it becomes the active policy
Repeat — Scheduled via Hermes cron, runs autonomously

Skills

`vast-gpu` — Cloud GPU Infrastructure

Provision, monitor, and teardown GPU instances on Vast.ai through natural language.

"Spin up an A100 for training" → finds cheapest A100, creates instance, returns SSH access
"How's my training instance?" → checks status, GPU utilization, cost so far
"Tear down the GPU" → destroys instance, confirms billing stopped

`vla-trainer` — VLA Fine-Tuning Pipeline

End-to-end fine-tuning of Vision-Language-Action models.

Supports SmolVLA (450M, fast) and GR00T N1.5 (3B, powerful)
Handles data prep, LeRobot format conversion, stats validation
Runs training on Vast.ai with WandB monitoring
Open-loop evaluation with trajectory visualization

`robot-loop` — Continuous Improvement

The autonomous improvement cycle.

Runs VLA inference in MuJoCo simulation
Collects and scores trajectories
Triggers retraining when enough new data accumulates
A/B tests new checkpoints against current best
Promotes winners, logs everything

Quick Start

# Tell Hermes what you want
"Set up a simulation environment for pick-and-place tasks"

# Hermes installs MuJoCo, LeRobot, configures the Franka Panda env

"Train SmolVLA on the pick-and-place demo dataset"

# Hermes provisions a Vast.ai GPU, downloads data, runs fine-tuning

"Deploy the trained model and start the improvement loop"

# Hermes runs inference in sim, collects trajectories, schedules retraining

Hardware Support (Optional)

For physical deployment on SO-ARM101:

Leader arm (teleoperation/demo recording)
Follower arm (autonomous execution)
USB cameras (wrist + global view)
Any Linux machine with USB ports

Models Supported

Model	Params	Train Time (A100)	VRAM	Best For
SmolVLA	450M	~4h / 20k steps	22GB	Fast iteration, prototyping
GR00T N1.5	3B	~4h / 10k steps	25GB	Production, complex tasks
GR00T N1.6	3B	~4h / 10k steps	25GB	Latest, best performance

Cost Estimate

Vast.ai A100 80GB: ~$1/hr → ~$4 per training run
Vast.ai A6000 48GB: ~$0.50/hr → ~$2 per training run
Simulation: Free (local CPU/GPU)
Physical arm (optional): ~$200-$440

Project Structure

hermes-embodied/
├── README.md
├── skills/
│   ├── vast-gpu/
│   │   └── SKILL.md
│   ├── vla-trainer/
│   │   └── SKILL.md
│   └── robot-loop/
│       └── SKILL.md
├── scripts/
│   ├── setup_sim.py          # MuJoCo + LeRobot environment setup
│   ├── collect_trajectories.py # Run VLA in sim, save rollouts
│   ├── train_smolvla.py      # Fine-tuning wrapper
│   ├── evaluate.py           # Open-loop eval + metrics
│   └── improvement_loop.py   # Full autonomous loop
├── configs/
│   ├── sim_env.json          # Simulation environment config
│   ├── training.yaml         # Training hyperparameters
│   └── vast_instance.yaml    # GPU instance specs
└── docs/
    └── ARCHITECTURE.md

Built With

Hermes Agent — AI agent framework with skills, memory, and RL training
LeRobot — Open-source robotics framework by Hugging Face
SmolVLA — 450M parameter Vision-Language-Action model
Vast.ai — Affordable cloud GPU rental
MuJoCo — Physics simulation for robotics
WandB — Experiment tracking

License

MIT