Memory & Token Guide

Understanding context windows, token budgets, and the four memory layers.

Last updated: March 21, 2026

Understanding the Context Window
The Four Memory Layers
Where Tokens Go (Token Budget Anatomy)
LCM — Lossless Context Management
QMD — Curated Memory
TrueMem — Knowledge Graph
Compaction & Pruning
Recommended Settings
Quick Checklist

📊 Understanding the Context Window

Every LLM has a fixed context window — the total number of tokens it can see at once. Everything you send (system prompt, memory, conversation history, tool calls, tool results) must fit inside this window.

Anatomy of a 1M token context window (Claude Opus)

System

QMD

LCM Recalled

Conversation + Tools

Reserve

0 Fixed cost per turn Grows with session 1M tokens

🚨 When context fills up

At 85%+ capacity, the model triggers compaction — it summarizes the conversation and discards older messages. Each compaction costs tokens (the model reads everything to summarize it) and loses detail. 35 compactions in one session = massive token waste.

✅ Goal: stay under 60%

A healthy session stays below 600K tokens on a 1M window. This leaves room for long tool outputs, avoids compaction storms, and keeps the model responsive. The settings in this guide target this.

📚 The Four Memory Layers

OpenClaw has four independent memory systems. Each adds tokens to the context window. Understanding what each one does — and when to turn it off — is the key to token efficiency.

📝

Layer 1: QMD (Curated Memory)

Markdown files injected into the system prompt every turn. MEMORY.md, SOUL.md, TOOLS.md, USER.md, IDENTITY.md. You write these manually. Always loaded — the "always on" layer.

Token cost: Fixed per turn. Depends on file sizes. Typically 3K–8K tokens.

🔄

Layer 2: LCM (Lossless Context Management)

Reconstructs context from previous compactions. When the model compacts, LCM saves what was lost. On the next turn, it selectively recalls relevant chunks. This is the most powerful — and most expensive — memory layer.

Token cost: Variable. Can inject 100K–500K+ tokens if uncapped. The #1 source of token burn.

🧠

Layer 3: TrueMem (Knowledge Graph)

Searches a Neo4j knowledge graph for facts relevant to the current conversation. Injects structured entity/relationship data via the before_agent_start hook. Only useful when the graph has data.

Token cost: Low per turn (~500–2K tokens for 8 facts). But if the graph is empty, it's pure overhead.

🗂️

Layer 4: Session Memory (memory_search)

Hybrid search across past session transcripts and QMD files. Triggered by memory_search tool calls. Returns relevant snippets from previous conversations.

Token cost: On-demand only. Cost depends on result count and size. Cached results reduce repeat lookups.

💰 Where Tokens Go

A typical turn on Claude Opus with all memory systems enabled. Understanding this breakdown tells you exactly where to cut.

Token Budget Anatomy (per turn)

Component	Typical Range	Can You Control It?
System prompt (OpenClaw core)	~15K–25K	No — framework overhead
QMD default memory files	3K–8K	Yes — trim file sizes
LCM recalled context	0–500K+	Yes — depth + threshold
TrueMem auto-recall	0–2K	Yes — on/off, maxRecallFacts
Conversation history	Grows over session	Partial — pruning TTL
Tool calls + results	Varies wildly	Partial — softTrim
Reserve floor	70K–200K	Yes — reserveTokensFloor

⚠️ The compaction death spiral

When context hits the ceiling: compact (costs tokens) → LCM saves chunks → next turn, LCM recalls chunks back → context fills again → compact again → repeat. A session that compacts 35 times has spent more tokens on compaction overhead than on actual work.

🔄 LCM — Lossless Context Management

LCM is the most impactful setting for token usage. It controls how much historical context gets reconstructed after each compaction.

`incrementalMaxDepth`

How many compaction layers deep LCM will go to reconstruct context.

-1 = unlimited (dangerous)
0 = disabled (no recall)
3 = recall from last 3 compactions
5 = deeper recall for long sessions

Recommended: 3–5

`contextThreshold`

Minimum relevance score for recalled chunks. Higher = more selective, fewer tokens.

0.5 = loose (pulls in marginal context)
0.6 = moderate
0.75 = selective (good balance)
0.9 = very strict (may miss things)

Recommended: 0.7–0.8

`freshTailCount`

Number of most recent messages always kept in context (never compacted). Higher values preserve more immediate context but consume more tokens.

8 = minimal (fast conversations)
16 = balanced (default)
24 = generous (complex multi-step tasks)

Recommended: 12–16

"lossless-claw": {
  "enabled": true,
  "config": {
    "freshTailCount": 16,
    "incrementalMaxDepth": 3,
    "contextThreshold": 0.75
  }
}

📝 QMD — Curated Memory

QMD files are injected into the system prompt on every single turn. Every byte counts. The goal: keep auto-injected files under 3KB each, move reference material to searchable files.

❌ Anti-pattern: bloated defaults

MEMORY.md at 6.5KB (detailed bug lists, test results)
TOOLS.md at 4.9KB (build commands, OAuth re-auth steps)
SOUL.md at 8KB (full personality essay)
Total: ~23KB = ~6K tokens every turn

✅ Best practice: lean defaults

MEMORY.md: identity, team, rules, project one-liners (<3KB)
TOOLS.md: core tools, security rules, key lessons (<2.5KB)
SOUL.md: essential personality only (<3KB)
Total: ~8KB = ~2K tokens every turn

The reference file pattern

Move detailed information to separate files that QMD can search on demand:

TOOLS.md (auto-injected) → TOOLS-REFERENCE.md (searchable)
MEMORY.md (auto-injected) → memory/archive-*.md (searchable)

Searchable files are indexed by QMD's hybrid search. The agent can find them via memory_search when needed, but they don't consume tokens on every turn.

QMD search settings

Setting	Value	Notes
`maxResults`	8	Max search results per query. Lower = fewer tokens.
`timeoutMs`	6000	Search timeout. 6s is generous.
`vectorWeight`	0.7	Semantic search weight. Higher favors meaning over keywords.
`mmr.enabled`	true	Deduplicates similar results. Always keep on.
`temporalDecay`	30 day half-life	Older memories rank lower. Good default.

🧠 TrueMem — Knowledge Graph

TrueMem searches a Neo4j knowledge graph and injects facts into the conversation. Great when the graph has data. Wasteful when it doesn't.

When to enable auto-recall

Graph has 50+ entities and 100+ facts
Conversations reference people, projects, decisions
Multiple agents share a knowledge base

When to disable auto-recall

Graph is empty or has <20 facts
You're doing pure coding tasks
Token budget is tight

TrueMem settings

Setting	Recommended	Why
`autoRecall`	false until graph is populated	Empty searches waste a tool call + response tokens
`maxRecallFacts`	5–8	More facts = more context. 8 is a good ceiling.
`minRelevanceScore`	0.3–0.5	Filters low-quality results. 0.3 is permissive.
`debug`	false	Debug logging adds tokens to output.
`autoCapture`	false	Let the Librarian handle batch extraction instead.

"truemem": {
  "enabled": true,
  "config": {
    "graphitiUrl": "http://127.0.0.1:8001",
    "autoRecall": false,
    "autoCapture": false,
    "maxRecallFacts": 8,
    "debug": false,
    "minRelevanceScore": 0.3
  }
}

✂️ Compaction & Pruning

Compaction summarizes and discards old context when the window fills up. Pruning removes stale intermediate data (tool results, old messages) proactively. Both reduce token usage — but compaction is expensive and pruning is cheap.

Compaction settings

Setting	Default	Recommended	Effect
`reserveTokensFloor`	70,000	150,000	Compacts earlier (at 850K vs 930K on a 1M window). Fewer compactions per session.
`memoryFlush.softThresholdTokens`	4,000	35,000	Saves important context to QMD files before compaction. Insurance against data loss.
`mode`	"default"	"default"	Standard compaction. Summarizes then truncates. Predictable and well-tested.

Pruning settings

Setting	Default	Recommended	Effect
`contextPruning.ttl`	"5m"	"30m"	Prunes tool results older than 30 minutes. Prevents stale data from bloating context.
`keepLastAssistants`	3	2	Keeps fewer assistant messages in full. Saves tokens on long responses.
`softTrim.maxChars`	4,000	4,000	Trims long messages to head + tail. 4K is reasonable.

💡 The math: why 150K floor matters

On a 1M token window with reserveTokensFloor: 70000:

Compaction triggers at ~930K → model reads 930K to summarize → costs ~930K input tokens
If this happens 35 times: 35 × 930K = 32.5M tokens just on compaction

With reserveTokensFloor: 150000:

Compaction triggers at ~850K → happens less often (LCM recalls less)
Combined with LCM depth cap: maybe 8–12 compactions → ~8M tokens
~75% reduction in compaction overhead

⚙️ Recommended Settings

Copy-paste ready config for a balanced setup. Adjust based on your use case.

Profile: Balanced (recommended)

Good for general use. Smart recall, controlled growth, early compaction.

// openclaw.json — plugins.entries
"lossless-claw": {
  "enabled": true,
  "config": {
    "freshTailCount": 16,
    "incrementalMaxDepth": 3,
    "contextThreshold": 0.75
  }
},
"truemem": {
  "enabled": true,
  "config": {
    "autoRecall": false,     // enable when graph has 50+ entities
    "autoCapture": false,    // librarian handles batch extraction
    "maxRecallFacts": 8,
    "debug": false,
    "minRelevanceScore": 0.3
  }
}

// openclaw.json — agents.defaults
"compaction": {
  "mode": "default",
  "reserveTokensFloor": 150000,
  "memoryFlush": {
    "enabled": true,
    "softThresholdTokens": 35000
  }
},
"contextPruning": {
  "mode": "cache-ttl",
  "ttl": "30m",
  "keepLastAssistants": 2,
  "softTrim": {
    "maxChars": 4000,
    "headChars": 1500,
    "tailChars": 1500
  }
}

Profile: Minimal (cost-sensitive)

For tight budgets or lightweight tasks.

LCM depth: 1
LCM threshold: 0.85
freshTailCount: 8
reserveTokensFloor: 200000
TrueMem autoRecall: false
keepLastAssistants: 1
pruning ttl: "15m"

Profile: Deep Memory (research)

For long research sessions that need full history.

LCM depth: 5
LCM threshold: 0.6
freshTailCount: 24
reserveTokensFloor: 100000
TrueMem autoRecall: true
keepLastAssistants: 3
pruning ttl: "60m"

✅ Quick Checklist

Run through this before starting a long session or after noticing high token usage.

Before session

LCM incrementalMaxDepth is capped (not -1)
QMD default files are under 3KB each
TrueMem autoRecall is off if graph has <50 entities
TrueMem debug is off in production
reserveTokensFloor is at least 100K (150K preferred)
Reference material moved to searchable files, not auto-injected

Warning signs during session

Context above 70% — session is growing fast
5+ compactions — LCM depth or threshold may be too aggressive
Cache hit rate below 10% — context changing too much between turns
Tokens out << tokens in — mostly reading context, not generating

Never do

Set LCM incrementalMaxDepth: -1 (unlimited) in production
Put build commands, OAuth steps, or debug logs in auto-injected files
Enable TrueMem autoRecall with an empty knowledge graph
Run reserveTokensFloor below 70K on Opus (compacts too late)
Keep debug: true on any plugin in production

This content is protected