Borrowing From Hermes Agent: A Self-Improvement Stack for a Multi-Agent Claude Code Fleet

NousResearch shipped Hermes Agent recently — a self-improving open-source AI agent with persistent memory, auto-generated skills, and a closed learning loop. I read the architecture, cross-referenced it against what we already had on our four-worker Claude Code fleet, and ended up borrowing six patterns. They were small enough to build and wire in a single session. This post is what I built, why, and what surprised me.

What we already had

Our setup is four Claude Code processes running in a single Docker container — cc1 through cc4 — sharing a filesystem (/work) but coordinating via lock files. Each worker reads a tiered model policy (Ollama free → Haiku cheap → Opus only when needed) and respects a shared POLICIES.md. We have persistent memory under /root/.claude/projects/-work/memory/ organized hot/warm/cold, a failure journal, and a nightly daily-improve.sh cron that extracts steering rules from recent retros.

So we had the bones of a self-improving system. What we didn’t have:

A real search index over session history. We were grepping markdown files.
A dynamic user model — observed preferences and patterns, distinct from static personality config.
Auto-generated skills from successful tasks. We auto-extracted rules from failures only.
Trajectory data for fine-tuning or auditing.
A way to spawn long-running parallel work without inflating the calling worker’s context.
A summarization layer so I could ask “what’s been happening this week” without reading every session log.

Hermes does all of these. So I copied them.

What I built

Six tools, all in /work/_SHARED/claude-config/. Each is a thin bash wrapper around Python where Python does the real work — the bash layer just exists so workers can muscle-memory the command form.

1. FTS5 session recall

A Python module maintains a SQLite database with FTS5 full-text indexes over four sources: session logs, gotchas, failures, learnings, and the new Charles user model. A bash wrapper auto-reindexes before searching, so results are always fresh.

bash recall.sh "garagelog migration"           # top 10 across all sources
bash recall.sh --kind gotcha "expo 54"         # filter to GOTCHAS.md
bash recall.sh --days 7 "STIM"                 # last week only

Initial index: 1,300 chunks across 28 files in under a second. Search latency: a few milliseconds. Returns ranked snippets with matched terms highlighted. This single tool replaces most of the grep -r work the workers were doing.

2. Insights summarizer

Queries the same FTS5 index for recent activity, packs the chunks into a token budget, and asks the local Ollama model (qwen3-coder-next, 80B MoE) to produce a structured digest:

Themes — what topics dominated
Decisions made — concrete choices
Blockers — stuck items / pending Charles actions
New rules — anything added to GOTCHAS / FAILURES
Worth Charles’s attention — surprises and contradictions

Wired into cron at 7:30 AM EST. The output gets pushed to my phone via Nova as a daily briefing. First run produced a sharp summary of the last week including patterns I’d half-noticed but hadn’t named — like “Cloudflare Pages deploys silently incomplete” showing up across three workers’ logs.

3. CHARLES.md dialectic user model

The closest copy of Hermes. Hermes maintains an evolving “user profile” that tracks observed preferences, patterns, and vocabulary — distinct from settings. I built a CHARLES.md with sections: Confirmed Patterns, Preferences, Rejections, Vocabulary, Decision Style, Frustrations, plus an append-only Journal.

Workers append observations:

bash update-charles.sh \
    "Wants results not plans — said 'do all of them' when given a 5-step proposal" \
    --kind decision

The --promote mode is where it gets interesting. When the journal accumulates many observations:

bash update-charles.sh --promote

This sends the journal to Ollama with instructions to find groups of 3+ entries describing the same underlying pattern, write a single consolidated bullet, and assign it to the right top-level section. The script then edits CHARLES.md atomically: appends the bullet, removes the original journal entries.

First live run promoted four “Charles always picks all options when given a list” entries into a single Decision Style bullet. The journal got shorter; the structured sections got smarter. Hermes calls this dialectic memory — built up through observation, periodically consolidated.

4. Auto-skill drafting in retro.sh

Our existing retro.sh already wrote a daily learning note and (optionally) appended a new rule to GOTCHAS. I extended it with a --propose-skill <name> flag that asks Ollama to draft a Claude Code SKILL.md from the task summary:

bash retro.sh "task" "what worked" "what didn't" "rule" \
    --rating 5 --propose-skill recall-worker-memory

The drafted skill goes to a proposed/ directory. I review and cp -r to ~/.claude/skills/ if it’s worth promoting.

This is where I learned the most important lesson of the session: auto-drafted skills initially described how to BUILD the artifact, not how to USE it. The first draft of the FTS5 skill said “Step 1: Chunk markdown files by H2/H3 headings using a Python script…” — useless to a future agent, that work is already done. I rewrote the prompt to be explicit:

The task that just completed produced ARTIFACTS (a tool, a script, a workflow). The skill teaches future agents how to USE the artifacts that already exist. Phrase every step as “invoke X with these args”, not “build X like this”.

After the prompt fix, the same task produced a draft that correctly said “Use WHEN you need to search across worker memory; MUST use recall.sh before manually inspecting raw logs.” Big difference. One of those one-line prompt tweaks that makes a generative tool actually useful.

5. Trajectory logging + export

Every retro now also writes a trajectory tuple to a monthly JSONL file:

{"ts": "...", "worker": "2", "task": "...", "worked": "...",
 "failed": "...", "rule": "...", "rating": 5}

A separate exporter consolidates the JSONL into:

SFT — Anthropic-shaped messages array (system + user + assistant) ready for fine-tuning
CSV — for spreadsheet review
Stats — counts by worker / rating / tag

Filterable by date, rating, or worker. The SFT export is forward-looking — at this scale it’s audit data more than training data, but if we ever wanted to fine-tune a small model on “how this fleet operates”, the data is already structured.

6. RPC batch executor

The smallest piece, but the one I expect to use most. batch-exec.sh takes commands (via args or stdin), spawns each as a nohup background process, writes per-command logs and exit codes to a batch directory, and returns immediately with the directory path.

echo "cd app1 && ./gradlew bundleRelease
cd app2 && ./gradlew bundleRelease
cd app3 && ./gradlew bundleRelease" | \
    bash batch-exec.sh aab-builds

bash batch-status.sh aab-builds --tail 5 --wait

This solves a specific Claude Code problem: when a worker spawns a 10-minute parallel task, every line of stdout it captures bloats its conversation context. The batch executor sidesteps that — the worker dispatches work and only reads results when they’re ready.

What it cost

Roughly two hours of one worker’s time, mostly Opus for code generation, mostly Ollama for the auto-drafted content. The total token bill for the build session was small enough not to register on daily monitoring. None of these tools required new infrastructure — they all run inside the existing Docker container, write to the existing filesystem, and use the existing Ollama server.

What surprised me

The smallest tool was the most underused capability. Workers had spawned Agent subprocesses for parallel work many times. Each call still inflated the parent context with the subagent’s results. The plain-bash batch-exec.sh was the obvious right answer for the build-many-AABs case the workers had been brute-forcing for weeks.

Prompt framing matters more than model selection. The build-vs-use distinction in the skill drafter was a one-paragraph prompt change. It transformed the output quality more than swapping models would have. Bad prompts produce bad outputs no matter how good the model is.

Hermes’s “dialectic” framing for the user model is the right name. It’s not a fixed profile — it’s a working hypothesis updated by observation, periodically consolidated. That’s a different mental model than “settings” or “preferences”, and it produces a different file shape (Journal section, then sections you promote into).

If you want to see the larger systems these tools support:

Nova — Personal AI Assistant — the 56-tool assistant the daily insights push goes to.
Multi-Agent App Factory — the fleet that produced 48 mobile apps, including RxLog in Play Store production.
Local Content Pipeline — the cron-based publishing system this very post goes through.

What’s next

The obvious next layer is consumption: building the muscle memory to actually reach for these tools when they’re cheaper than the alternative. I added a hermes-self-improvement-stack skill so future cold-start sessions discover the whole stack at once. We’ll see how often it gets invoked.

If any of this resembles something your team would benefit from — multi-agent dev fleets, dialectic user models, or local-LLM content pipelines — I’m taking on consulting work in this space. Services and pricing here, or email [email protected] with a one-paragraph description.