Memory & Token Guide
Understanding context windows, token budgets, and the four memory layers.
Last updated: March 21, 2026
📊 Understanding the Context Window
Every LLM has a fixed context window — the total number of tokens it can see at once. Everything you send (system prompt, memory, conversation history, tool calls, tool results) must fit inside this window.
Anatomy of a 1M token context window (Claude Opus)
🚨 When context fills up
At 85%+ capacity, the model triggers compaction — it summarizes the conversation and discards older messages. Each compaction costs tokens (the model reads everything to summarize it) and loses detail. 35 compactions in one session = massive token waste.
✅ Goal: stay under 60%
A healthy session stays below 600K tokens on a 1M window. This leaves room for long tool outputs, avoids compaction storms, and keeps the model responsive. The settings in this guide target this.
📚 The Four Memory Layers
OpenClaw has four independent memory systems. Each adds tokens to the context window. Understanding what each one does — and when to turn it off — is the key to token efficiency.
Layer 1: QMD (Curated Memory)
Markdown files injected into the system prompt every turn. MEMORY.md, SOUL.md, TOOLS.md, USER.md, IDENTITY.md. You write these manually. Always loaded — the "always on" layer.
Token cost: Fixed per turn. Depends on file sizes. Typically 3K–8K tokens.
Layer 2: LCM (Lossless Context Management)
Reconstructs context from previous compactions. When the model compacts, LCM saves what was lost. On the next turn, it selectively recalls relevant chunks. This is the most powerful — and most expensive — memory layer.
Token cost: Variable. Can inject 100K–500K+ tokens if uncapped. The #1 source of token burn.
Layer 3: TrueMem (Knowledge Graph)
Searches a Neo4j knowledge graph for facts relevant to the current conversation.
Injects structured entity/relationship data via the before_agent_start hook.
Only useful when the graph has data.
Token cost: Low per turn (~500–2K tokens for 8 facts). But if the graph is empty, it's pure overhead.
Layer 4: Session Memory (memory_search)
Hybrid search across past session transcripts and QMD files. Triggered by memory_search tool calls. Returns relevant snippets from previous conversations.
Token cost: On-demand only. Cost depends on result count and size. Cached results reduce repeat lookups.
💰 Where Tokens Go
A typical turn on Claude Opus with all memory systems enabled. Understanding this breakdown tells you exactly where to cut.
Token Budget Anatomy (per turn)
| Component | Typical Range | Can You Control It? |
|---|---|---|
| System prompt (OpenClaw core) | ~15K–25K | No — framework overhead |
| QMD default memory files | 3K–8K | Yes — trim file sizes |
| LCM recalled context | 0–500K+ | Yes — depth + threshold |
| TrueMem auto-recall | 0–2K | Yes — on/off, maxRecallFacts |
| Conversation history | Grows over session | Partial — pruning TTL |
| Tool calls + results | Varies wildly | Partial — softTrim |
| Reserve floor | 70K–200K | Yes — reserveTokensFloor |
⚠️ The compaction death spiral
When context hits the ceiling: compact (costs tokens) → LCM saves chunks → next turn, LCM recalls chunks back → context fills again → compact again → repeat. A session that compacts 35 times has spent more tokens on compaction overhead than on actual work.
🔄 LCM — Lossless Context Management
LCM is the most impactful setting for token usage. It controls how much historical context gets reconstructed after each compaction.
incrementalMaxDepth
How many compaction layers deep LCM will go to reconstruct context.
-1= unlimited (dangerous)0= disabled (no recall)3= recall from last 3 compactions5= deeper recall for long sessions
Recommended: 3–5
contextThreshold
Minimum relevance score for recalled chunks. Higher = more selective, fewer tokens.
0.5= loose (pulls in marginal context)0.6= moderate0.75= selective (good balance)0.9= very strict (may miss things)
Recommended: 0.7–0.8
freshTailCount
Number of most recent messages always kept in context (never compacted). Higher values preserve more immediate context but consume more tokens.
8= minimal (fast conversations)16= balanced (default)24= generous (complex multi-step tasks)
Recommended: 12–16
📝 QMD — Curated Memory
QMD files are injected into the system prompt on every single turn. Every byte counts. The goal: keep auto-injected files under 3KB each, move reference material to searchable files.
❌ Anti-pattern: bloated defaults
- MEMORY.md at 6.5KB (detailed bug lists, test results)
- TOOLS.md at 4.9KB (build commands, OAuth re-auth steps)
- SOUL.md at 8KB (full personality essay)
- Total: ~23KB = ~6K tokens every turn
✅ Best practice: lean defaults
- MEMORY.md: identity, team, rules, project one-liners (<3KB)
- TOOLS.md: core tools, security rules, key lessons (<2.5KB)
- SOUL.md: essential personality only (<3KB)
- Total: ~8KB = ~2K tokens every turn
The reference file pattern
Move detailed information to separate files that QMD can search on demand:
- TOOLS.md (auto-injected) → TOOLS-REFERENCE.md (searchable)
- MEMORY.md (auto-injected) → memory/archive-*.md (searchable)
Searchable files are indexed by QMD's hybrid search. The agent can find them
via memory_search when needed, but they don't consume tokens on every turn.
QMD search settings
| Setting | Value | Notes |
|---|---|---|
maxResults |
8 | Max search results per query. Lower = fewer tokens. |
timeoutMs |
6000 | Search timeout. 6s is generous. |
vectorWeight |
0.7 | Semantic search weight. Higher favors meaning over keywords. |
mmr.enabled |
true | Deduplicates similar results. Always keep on. |
temporalDecay |
30 day half-life | Older memories rank lower. Good default. |
🧠 TrueMem — Knowledge Graph
TrueMem searches a Neo4j knowledge graph and injects facts into the conversation. Great when the graph has data. Wasteful when it doesn't.
When to enable auto-recall
- Graph has 50+ entities and 100+ facts
- Conversations reference people, projects, decisions
- Multiple agents share a knowledge base
When to disable auto-recall
- Graph is empty or has <20 facts
- You're doing pure coding tasks
- Token budget is tight
TrueMem settings
| Setting | Recommended | Why |
|---|---|---|
autoRecall |
false until graph is populated | Empty searches waste a tool call + response tokens |
maxRecallFacts |
5–8 | More facts = more context. 8 is a good ceiling. |
minRelevanceScore |
0.3–0.5 | Filters low-quality results. 0.3 is permissive. |
debug |
false | Debug logging adds tokens to output. |
autoCapture |
false | Let the Librarian handle batch extraction instead. |
✂️ Compaction & Pruning
Compaction summarizes and discards old context when the window fills up. Pruning removes stale intermediate data (tool results, old messages) proactively. Both reduce token usage — but compaction is expensive and pruning is cheap.
Compaction settings
| Setting | Default | Recommended | Effect |
|---|---|---|---|
reserveTokensFloor |
70,000 | 150,000 | Compacts earlier (at 850K vs 930K on a 1M window). Fewer compactions per session. |
memoryFlush.softThresholdTokens |
4,000 | 35,000 | Saves important context to QMD files before compaction. Insurance against data loss. |
mode |
"default" | "default" | Standard compaction. Summarizes then truncates. Predictable and well-tested. |
Pruning settings
| Setting | Default | Recommended | Effect |
|---|---|---|---|
contextPruning.ttl |
"5m" | "30m" | Prunes tool results older than 30 minutes. Prevents stale data from bloating context. |
keepLastAssistants |
3 | 2 | Keeps fewer assistant messages in full. Saves tokens on long responses. |
softTrim.maxChars |
4,000 | 4,000 | Trims long messages to head + tail. 4K is reasonable. |
💡 The math: why 150K floor matters
On a 1M token window with reserveTokensFloor: 70000:
- Compaction triggers at ~930K → model reads 930K to summarize → costs ~930K input tokens
- If this happens 35 times: 35 × 930K = 32.5M tokens just on compaction
With reserveTokensFloor: 150000:
- Compaction triggers at ~850K → happens less often (LCM recalls less)
- Combined with LCM depth cap: maybe 8–12 compactions → ~8M tokens
- ~75% reduction in compaction overhead
⚙️ Recommended Settings
Copy-paste ready config for a balanced setup. Adjust based on your use case.
Profile: Balanced (recommended)
Good for general use. Smart recall, controlled growth, early compaction.
Profile: Minimal (cost-sensitive)
For tight budgets or lightweight tasks.
Profile: Deep Memory (research)
For long research sessions that need full history.
✅ Quick Checklist
Run through this before starting a long session or after noticing high token usage.
Before session
- LCM
incrementalMaxDepthis capped (not -1) - QMD default files are under 3KB each
- TrueMem
autoRecallis off if graph has <50 entities - TrueMem
debugis off in production reserveTokensFlooris at least 100K (150K preferred)- Reference material moved to searchable files, not auto-injected
Warning signs during session
- Context above 70% — session is growing fast
- 5+ compactions — LCM depth or threshold may be too aggressive
- Cache hit rate below 10% — context changing too much between turns
- Tokens out << tokens in — mostly reading context, not generating
Never do
- Set LCM
incrementalMaxDepth: -1(unlimited) in production - Put build commands, OAuth steps, or debug logs in auto-injected files
- Enable TrueMem
autoRecallwith an empty knowledge graph - Run
reserveTokensFloorbelow 70K on Opus (compacts too late) - Keep
debug: trueon any plugin in production