Back to Blog
RESEARCH

What Is Wrong with OpenClaw's Memory
Layer

10 min read
Share:

I have been spending a lot of time inside OpenClaw's codebase lately. Not because something is broken, but because OpenClaw is, by any honest measure, a turning point for the open-source AI community. It is a full agent framework, open and forkable, that takes the problems most commercial tools wave away seriously.

And the problem I keep coming back to, the one we have been fighting for the better part of a year, is the one OpenClaw tried hardest to solve: how do you give an AI agent a memory that actually works?

OpenClaw matters because it gave the open-source world a real, production-grade agent framework at a time when the alternatives were either closed-source, half-baked, or both. The architecture is sophisticated. The community around it is growing fast. And their memory layer, the part I want to examine here, is in many ways the most successful attempt at agent memory I have seen.

They did not reach for a vector database and call it a day. They didn't build some massive, over-engineered knowledge graph with a query language that would take a PhD to master. They did something simpler, and in its simplicity, something genuinely interesting: they made the agent write things down in Markdown files, like a person keeping notes.

To be clear: my critique doesn't diminish OpenClaw's value; it's a vital project that has pushed the entire ecosystem forward. OpenClaw, as a project, has done more to advance open-source AI agents than most well-funded startups. And the specific decision to store memory as plain files in ~/.openclaw/workspace/, readable by humans, editable by humans, versioned by git if you want, is a choice that respects the user in a way that most AI memory systems do not.

You can open the folder and see what your agent remembers. You can delete a line, and the agent forgets. There is something almost radical about that transparency. But I have been reading their code, file by file, and the closer I look, the more I see the places where the architecture quietly breaks its own promises.

the librarian paradox

The fundamental issue begins with who is responsible for the memory structure. The system prompt contains an instruction that says, roughly: if someone says “remember this,” write it down. Do not keep it in context. Put it in a file.

This sounds reasonable until you think about what it implies. The agent is deciding what to remember. Not a separate system with its own logic, not a structured pipeline, not a set of rules. The same model tasked with being helpful and managing the conversation is also forced to act as its own gatekeeper for long-term storage.

OpenClaw Memory Pipeline
User says something
Agent decides to remember
Writes to .md file
File gets chunked (400 tok)
Embedded into SQLite
Searchable (maybe)

We ran into exactly this problem in our own experiments. When the AI is both the worker and the librarian, the library suffers. It is not that the model is bad at deciding what matters. There is no framework for deciding what does not matter. Everything feels important at the moment.

So the memory files grow, and the daily logs accumulate entries like “User prefers concise answers” next to “Decided to use PostgreSQL instead of SQLite for prod,” as if those two facts carry equal weight. There is no value scoring. No decay. A preference noted six months ago is stored at the same level as a decision made this morning. The Markdown files are append-only in practice, and nobody, not the agent and not the system, ever goes back to slim them down.

the flush

The most technically interesting piece of OpenClaw's memory architecture is what they call the pre-compaction flush. When the context window approaches its limit, a silent agentic turn fires automatically. The agent reviews what it knows, writes anything important to the memory files, and then context compaction wipes the conversation down to a summary.

The threshold is straightforward. They take the context window (128,000 tokens), subtract a reserve floor (20,000) and a soft threshold (4,000), and get 104,000. Cross that line and the flush fires.

context window usage
0 / 128,000 tokensthreshold: 104K
a silent agentic turn fires, hoping to save what matters

This is clever. It is also a band-aid on a wound that will not close. The problem is that the flush is itself an agentic turn, which means the agent must decide, under pressure, what to save. The context window is almost full. The clock is ticking. And now the model must review an entire conversation, identify what has long-term value, and write it down, all within a single turn, using tokens that are themselves eating into the remaining space.

The default prompt for this flush says: “Pre-compaction memory flush. Store durable memories now. If there is nothing to store, reply with NO_REPLY.” That prompt is doing an enormous amount of work for how little guidance it provides.

What counts as durable? The model has to guess. And it guesses under the worst possible conditions, when the context is bloated, and the remaining budget is thin. In practice, this means that the quality of what gets remembered depends on when the conversation hits the token limit. An important decision discussed at token 50,000 might be saved perfectly. The same decision discussed at token 103,000, right before the flush, might have been saved in a rushed, partial form, or missed entirely because the agent was focused on something else at that moment.

search problem

OpenClaw uses a hybrid search approach, and I want to give them credit here because most systems do not bother. They combine vector search (semantic similarity) with BM25 (keyword matching), blending the scores with a fixed formula: 70% vector, 30% text.

The hybrid approach is correct in principle. Vector search alone cannot find a commit hash. BM25 alone cannot understand that “the machine running the gateway” refers to the Mac Studio you mentioned three weeks ago. You need both. Or even one more, but for that is later.

what each search finds
"the machine running the gateway"
vector: finds itbm25: misses it
commit hash a828e60
vector: misses itbm25: finds it
memorySearch.query.hybrid
vector: weak matchbm25: exact match
finalScore = 0.7 * vectorScore + 0.3 * textScore, always

The issue is the fixed weights. That 0.7/0.3 split is hardcoded in src/memory/hybrid.ts, and it does not adapt. When you are searching for a config key like memorySearch.query.hybrid, the BM25 score should dominate because you want an exact match. When you are searching for “that conversation about database choices last week,” the vector score should dominate. The system treats both queries the same way.

Then there is the chunking. Memory files are split into roughly 400-token chunks with an 80-token overlap, and each chunk is embedded separately into a per-agent SQLite database. This is standard, but it means that a decision recorded across two paragraphs might be split across chunks, with the context in one chunk and the conclusion in the other.

The search might return the conclusion without the reasoning, or the reasoning without the conclusion. The memory_get tool exists to compensate for it, allowing the agent to read specific lines from a file after the search narrows down the location. It is a two-step process: search, then read. This works, but it costs two tool calls and two round-trips, where a system with better indexing would need zero.

isolated indexes

Each agent in OpenClaw gets its own SQLite database at ~/.openclaw/memory/<agentId>.sqlite. The memory files themselves might be shared, since they sit in the workspace, but the search index is per-agent. Agent A's embeddings live in one database. Agent B's live in another.

For a single-agent setup, this is fine. But the moment you have multiple agents, which is increasingly the norm for anything beyond simple chat, the memories become siloed. Agent A cannot search Agent B's index.

If you want cross-agent memory, you have to hope both agents indexed the same files, and even then, their embeddings might differ because they were generated at different times or from different model versions. This is the kind of limitation that does not show up in demos but becomes a wall in production. The whole point of memory is continuity across contexts. A memory system that only works within a single agent's context is barely a memory system at all.

what accumulates

There is a deeper problem that runs beneath all of these specifics, and it is the same problem we found in our own $3,000 experiment. Memory without curation is not memory. It is hoarding.

OpenClaw's daily logs are append-only. The MEMORY.md file grows but is never automatically pruned. Every session adds, nothing subtracts. Over weeks and months, the memory files become a sediment layer of every decision, preference, and offhand remark the agent thought was worth recording.

The search might still find what you need, for a while. But as the corpus grows, the signal-to-noise ratio degrades. A search for “database decision” that once returned one clean result now returns six, from different dates, some contradicting each other because the decision changed. The system has no concept of supersession, no way to know that the Redis decision from last week replaced the in-memory caching decision from three weeks ago.

Human memory is not perfect, but it has something OpenClaw's system lacks: the ability to forget. We forget the irrelevant, the outdated, the superseded. This is not a bug in cognition; it is what makes recall useful. A memory system that remembers everything remembers nothing well.

what they got right

I want to be clear: this isn't a takedown. OpenClaw is an amazing project, and its contributions deserve real recognition. OpenClaw is one of the most important open-source AI projects running right now, and the team behind it deserves real respect for what they have built.

OpenClaw made several choices that I think are genuinely correct and that the rest of the industry, open-source and commercial both, would do well to learn from. The file-first approach, where memory is something you can see, touch, and edit, just makes sense. The hybrid search, even with its fixed weights, is more thoughtful than the pure-vector approach that most systems use. The pre-compaction flush, even with its limitations, acknowledges a problem that most systems ignore entirely: that context windows are finite, and you need a plan for when they fill up. The fact that all of this is open, inspectable, and forkable is itself a statement about how AI tooling should be built.

The configuration system is also top-notch. You can swap embedding providers, point the indexer at external paths, and disable the flush if it does not suit your workflow. There is an openclaw memory status --deep command that tells you whether your index is healthy. These are the details of a team that has actually used their own system in production and knows where the rough edges are.

The problems I have described are not failures of execution. There are limitations to the underlying approach, one where the agent is trusted to be its own archivist. OpenClaw, as an open-source project, pushed that approach further than anyone else has, and in doing so, they have made the limitations visible in a way that benefits everyone working on this problem.

what is next

The memory problem in AI is not a search problem. It is not an embedding, chunking, or storage problem. However, it touches all of them. Ultimately, it is a curation problem. Who decides what to keep? Who decides what to discard? Who reconciles contradictions?

OpenClaw's answer is: the agent. Our experience at PAM suggests that this answer is necessary but not enough. The agent should participate in memory, but it cannot be the sole authority. You need infrastructure that enforces retention policies, resolves conflicts, tracks provenance, and, crucially, forgets. Not randomly, not by dropping old files, but deliberately, the way a well-maintained database trims stale records.

We are building that infrastructure. It is harder than it sounds, because “what should be forgotten” is not a question with a formula. It depends on the user, the domain, the recency, the frequency of access, and the presence of newer information that supersedes the old. Getting this right requires engineering discipline applied to an inherently fuzzy problem.

OpenClaw showed that file-based memory can work, that agents can write their own notes, and that hybrid search is worth the complexity. These are real findings that advanced the conversation.

The next step is to build a system in which the agent's notes are not the end of the story but the beginning.

Why is PAM different?

PAM is building a memory infrastructure that works. Structured, cost-efficient, and actually useful. See how proactive AI memory can transform your workflow.