Self-Hosting Honcho: Agent Memory That Actually Works
I spent two days debugging why my Honcho instance wasn't creating memories. The deriver was running, logs looked fine, but no observations appeared. The fix was one boolean flag that defaults to the wrong value for personal use. Here's the complete working configuration I wish I'd had from day one.
What is Honcho?
Honcho is an agent memory system by Plastic Labs. Unlike simple chat history, it extracts facts about the user, builds a persistent profile, and enables dialectic reasoning — the agent can answer questions about you based on accumulated observations across sessions.
The managed cloud version uses their proprietary Neuromancer model. I self-host because I want my conversation data local and I want to choose my own LLM backend.
Architecture
Hermes Agent → Honcho API (localhost:8000)
├── PostgreSQL (messages, sessions)
├── Redis (cache)
├── LanceDB (vector embeddings)
└── Deriver (background worker)
└── OpenRouter (LLM API)The deriver is the magic. It processes every message, extracts facts, generates embeddings, and stores observations. Without it running correctly, Honcho is just a chat logger.
The Critical Fix: FLUSH_ENABLED
Here's what blocked me for hours. Honcho's deriver has a FLUSH_ENABLED setting that defaults to false.
When false, the deriver batches representation work until REPRESENTATION_BATCH_MAX_TOKENS (1024 tokens) is reached. In high-volume production, this saves API costs by grouping multiple observations into a single LLM call.
In a personal deployment? Messages trickle in slowly. The 1024-token threshold is never reached. Observations never appear.
Fix: Set FLUSH_ENABLED = true in config.toml and DERIVER_FLUSH_ENABLED=true in .env.
# config.toml
[deriver]
FLUSH_ENABLED = true
# .env
DERIVER_FLUSH_ENABLED=trueAfter enabling this, I saw Observation Count: 4 within 20 seconds of sending a message. Problem solved.
Complete Working Configuration
config.toml
[app]
LOG_LEVEL = "INFO"
EMBED_MESSAGES = true
MAX_EMBEDDING_TOKENS = 8192
[db]
CONNECTION_URI = "postgresql+psycopg://honcho:honcho@database:5432/honcho"
[auth]
USE_AUTH = false
[cache]
ENABLED = true
URL = "redis://redis:6379/0?suppress=true"
[vector_store]
TYPE = "lancedb"
DIMENSIONS = 1024
URI = "/app/lancedb_data"
[deriver]
FLUSH_ENABLED = true
[deriver.model_config]
transport = "openai"
model = "deepseek/deepseek-v4-flash"
[deriver.model_config.overrides]
base_url = "https://openrouter.ai/api/v1"
api_key_env = "LLM_OPENAI_API_KEY"Key Configuration Details
Vector store: I use LanceDB instead of pgvector. Why? pgvector has a hardcoded 1536-dimension validation that cannot be overridden without source changes. My embedding model (baai/bge-m3) produces 1024 dimensions. LanceDB is file-based, accepts any dimension, and swaps in by changing TYPE.
TOML nesting: This tripped me up. Overrides must be under [*.model_config.overrides], not [*.overrides]. The model_config layer is essential.
Config mounts: Both api and deriver containers need ./config.toml:/app/config.toml:ro in docker-compose.yml. Without this, the deriver uses a baked-in config from the Docker image and ignores all your changes.
Model Selection
I use DeepSeek v4-flash via OpenRouter for all tiers. Here is how it compares to Gemini 2.5 Flash Lite:
| Metric | Gemini Flash Lite | DeepSeek v4 Flash |
|---|---|---|
| Tool Calls | 1 | 7 |
| Input Tokens | 6,330 | 21,151 |
| Duration | 5.2s | 24.9s |
| Quality | Good | Excellent (more detailed) |
DeepSeek is slower but makes 7x more tool calls, producing more thorough memory extractions. For a background worker where latency does not matter, the quality trade-off is worth it.
Verification Workflow
After setup, verify everything works:
# 1. Health check
curl -s http://localhost:8000/health
# → {"status":"ok"}
# 2. Create a session and message
curl -s -X POST http://localhost:8000/v3/workspaces/hermes/sessions \
-H "Content-Type: application/json" \
-d '{"id":"test","peer_id":"cosmo"}'
curl -s -X POST http://localhost:8000/v3/workspaces/hermes/sessions/test/messages \
-H "Content-Type: application/json" \
-d '{"messages":[{"content":"I build agent systems.","peer_id":"cosmo"}]}'
# 3. Wait 20-30 seconds, check deriver logs
docker compose logs deriver --tail 20 | grep "Observation Count"
# → Observation Count: 4 count ✅
# 4. Test dialectic reasoning
curl -s -X POST http://localhost:8000/v3/workspaces/hermes/peers/cosmo/chat \
-H "Content-Type: application/json" \
-d '{"query":"What do you know about this user?","agentic":true}'Troubleshooting Checklist
- No observations? Check FLUSH_ENABLED. This is the #1 cause.
- 401 AuthenticationError? Check PROVIDER is not "vllm" — only "openai", "anthropic", "gemini" are valid.
- Embedding dimension mismatch? Switch from pgvector to LanceDB.
- Config changes not applying? Verify volume mounts in docker-compose.yml for both api and deriver.
- LanceDB permission errors? Use a named Docker volume instead of bind mount.
Key Takeaways
- FLUSH_ENABLED=true is mandatory for personal use. The default false is for high-volume production only.
- LanceDB > pgvector for flexible dimensions. No hardcoded 1536-dim constraint.
- Config must be volume-mounted in both containers. The deriver will not see host changes otherwise.
- DeepSeek v4-flash is excellent for deriver work. Slower than Gemini but far more thorough.
- Verify end-to-end before declaring victory. Health check → message → observation → dialectic response.
Honcho with self-hosting is powerful once configured correctly. The documentation exists but scatters critical details across GitHub issues and source code. I hope this guide saves you the two days I spent debugging.