Why Claude Code + Markdown Memory Is Not Enough for Agent Memory

A lot of people building with Claude Code end up with the same memory setup. You create a folder of markdown files. You ask the agent to write down decisions, facts, names, dates, bugs, preferences and project context. You keep an index. Later, when the agent needs context, it reads and searches those files.

It feels right because it fits Claude Code well. The agent already knows how to work with files. It can read, write, grep and move around a project tree. Markdown is also easy to inspect. When something looks wrong, you can open the file and see what the agent thinks it knows.

We like this direction. Pam also uses files as part of the memory layer. The question is whether a common Claude Code + markdown memory setup is enough on its own.

What we tested

In our 2026 run, Denys Herasymuk, AI Researcher at Harmix, tested Pam against a concrete baseline from our benchmark repo: Claude Code + Markdown memory. This baseline uses Claude Code with a directory of external Markdown files as memory. The agent writes structured notes during ingestion, then reads and searches those notes when answering questions.

The Pam side used Harmix's memory layer through the Pam API. The same memory layer is also exposed through Pam MCP and can be used from OpenClaw or Claude Code. The result below, however, refers to the Pam API run in our public benchmark harness.

The test was LoCoMo, the most popular benchmark for long conversational memory. Both systems answered the same 1,986 questions and were scored by the same GPT-4o judge.

82.8%

Pam

1,644 / 1,986 correct

50.4%

Claude Code + Markdown memory

1,000 / 1,986 correct

That is the claim in this article. Not every possible markdown memory system. Not every possible custom agent setup. Just this common, transparent baseline: Claude Code managing a directory of markdown memory files.

The implementation is deliberately simple. During ingestion, the agent can use file tools to read, write, edit, grep, glob, and list files. During answering, it can only read and search the memory. It cannot edit memory while answering, and it does not get to read the original transcript again. The backend also does not use an external memory server. It uses Claude Code's native file tools.

How the baseline builds memory

The prompt asks the agent to build memory as Obsidian-style markdown notes. It writes one file per salient entity — person, place, event, or recurring topic. Each note starts with YAML frontmatter (title, type, tags) followed by atomic facts as bullets, each tagged with the date or session it came from when known. Related notes are linked with wikilinks. The agent also maintains an index file as a table of contents that links every note with a one-line summary.

So this is not a lazy baseline where everything gets dumped into one messy text file. The setup has structure. It has dates. It has links. It has an index. It still broke down.

Why this setup is attractive

The appeal is obvious. A markdown memory folder is cheap. You can build it today. You can read it without special tooling. You can keep it in git. You can edit it by hand if the agent writes something strange.

For a small project, that may be fine. If the agent only needs a few stable facts, markdown notes can help a lot. A file saying "use pnpm, not npm" or "the API client lives in src/lib" can save time.

The trouble starts when memory has to survive a moving project. Real work changes. A decision gets reversed. A customer changes scope. A bug gets fixed, then reopened. Someone says "ignore what I said last week." Another detail only matters if you remember the order of events.

A folder can store all of that. It does not automatically know what should win when two notes disagree. That burden falls back on the model during answering.

The result

On the full LoCoMo run, Pam scored 82.8%. The Claude Code + Markdown memory baseline scored 50.4%.

Pam beats the Markdown memory baseline

Overall LLM-as-a-Judge accuracy on the LoCoMo benchmark

PamHarmix memory layer

82.8%

▲ +32.4 pt

Claude Code + Markdown memorymemory.md files

50.4%

Figure 1. Overall accuracy across all 1,986 LoCoMo questions, scored by GPT-4o as the judge. Pam answers 1,644/1,986 correctly vs 1,000/1,986 for the Claude Code + Markdown memory baseline — a +32.4 pt gap.

The interesting part is not only the overall score. It is where the markdown setup fails.

The gap widens on the hard question types

LLM-as-a-Judge accuracy by LoCoMo question category

PamClaude Code + Markdown memory

Single-hop282 questions

52.1%

41.5%

Temporal321 questions

85.4%

52.0%

Open-domain96 questions

57.3%

39.6%

Multi-hop841 questions

86.2%

54.8%

Adversarial446 questions

99.3%

48.7%

Figure 2. Accuracy broken down by question category, scored by GPT-4o as the judge. Pam leads on all five categories, with the largest gaps on Adversarial (+50.6 pt), Temporal (+33.4 pt) and Multi-hop (+31.4 pt).

Single-hop questions are simple recall. The gap exists there, but it is not the main story. The larger gaps show up when the system has to handle time, combine facts across sessions, or recognize that the answer was never in memory.

That matches what we see in real agent work. The painful errors are rarely "the agent forgot one obvious fact." The painful errors are more subtle. It uses an old decision as if it were current. It answers from the wrong part of the history. It finds something related and treats it as enough evidence.

The failure mode is maintenance

A markdown folder can hold facts. It can even hold well-written facts. But memory quality depends on what happens after the fact is written.

•
Was this fact later updated?
•
Was it contradicted?
•
Was it only true for one session?
•
Did another note supersede it?
•
Should the agent answer, or should it say the information is missing?

A basic markdown setup leaves too much of this to the final answering step. That is the wrong place to solve it. At answer time, the model is already under pressure to produce a response. If the notes contain stale, partial or conflicting evidence, the model has to notice that, search again, compare dates, decide which source is current and still answer cleanly.

Sometimes it can do that. The benchmark shows that it often cannot. Pam tries to move more of this work into the memory layer itself. The goal is not to replace files. The goal is to keep the memory usable as the project changes.

The result that made us pause

The adversarial category is the clearest signal. These are questions where the right answer is basically: the conversation did not say that. This sounds easy, but it is one of the most useful tests for an agent memory system. A memory system should not only recall facts. It should also protect the agent from answering when there is no evidence.

99.3%

Pam

Adversarial questions

48.7%

Claude Code + Markdown memory

Adversarial questions

That means the markdown baseline was often unable to draw a clean boundary around what it knew. In real work, this is where small memory errors become expensive. An agent that invents a plausible customer detail, deadline or product constraint can create more work than an agent that simply says it does not know.

What this does and does not prove

This test does not prove that markdown is a bad memory substrate. It also does not prove that no one can build a stronger custom markdown memory system. A team could add stricter ingestion, validation, review, better update rules and more careful answer-time prompts. That would be a different system, and it should be tested separately.

Our claim is narrower. We tested a common Claude Code + markdown memory pattern with structured notes, dated facts, wikilinks and an index. It was transparent, inspectable and easy to run. On LoCoMo, it scored 50.4%. Pam scored 82.8% on the same 1,986 questions with the same judge. That is enough to say that the simple version is not reliable enough for serious long-running agent work.

What Pam changes

Pam still respects the file-first instinct. Files are useful. The difference is the memory operations around them. Pam is built to maintain context as it changes. It can onboard from company sources, organize memory, update it when new information arrives, preserve time, resolve conflicts, and make retrieval safer for the agent using it.

The benchmark does not say Pam has solved memory completely. Single-hop and open-domain results still leave room to improve. That is worth saying plainly. But the same-run comparison does show that architecture matters. A readable folder of notes is helpful. A maintained memory layer is different.

The practical takeaway

If your agent needs to remember a handful of stable project facts, Claude Code + markdown may be enough.

If your agent is supposed to work across many sessions, changing requirements and incomplete information, the folder becomes only the starting point. You still need a way to keep memory current, handle old facts, preserve order, resolve conflicts, and stop the agent from answering when the evidence is not there.

That is the gap Pam is built for. Not a replacement for markdown. A way to make markdown-style memory reliable enough for real agent work.

FAQs

How to build a memory system in Claude?

The simplest way is to use Claude Code with a folder of Markdown memory files. During ingestion, ask Claude to write structured notes about people, decisions, dates, bugs, preferences, and project context. Each note should have clear headings, dated facts, links to related notes, and an index file so Claude can find the right context later. During answering, Claude should only read and search those memory files. It should not rewrite memory while answering, and it should not rely on the original transcript again.

This setup is easy to build and inspect, which is why many teams start there. But it is not enough for reliable long-term agent memory on its own. In our benchmark, Claude Code + Markdown memory scored 50.4% on LoCoMo, while Pam scored 82.8% on the same 1,986 questions. The hard part is not storing notes. The hard part is keeping memory current, handling outdated facts, resolving conflicts, preserving time, and knowing when the evidence is missing.

What is Claude Code + Markdown memory?

Claude Code + Markdown memory is a simple way to give an AI coding agent persistent context. The agent writes important facts, decisions, dates, preferences, bugs and project notes into .md files. Later, it searches and reads those files when it needs context. It is easy to build and inspect, which is why many AI builders start there.

Why is Claude Code + Markdown memory not enough for long-term agent memory?

Claude Code + Markdown memory starts to break when the project changes over time. A note may be true one week and outdated the next. Two notes may disagree. A customer requirement may change. A decision may get reversed. A Markdown folder can store those facts, but it does not automatically know which fact is current, which one was superseded, or when the agent should say that the answer is missing.

What did you test in the benchmark?

We tested Pam against a concrete Claude Code + Markdown memory baseline from the Harmix benchmark repo. The baseline used Claude Code with a directory of external Markdown files as memory. The agent wrote structured notes during ingestion, then searched and read those notes while answering questions. It could not edit memory during answering and could not re-read the original transcript.

How did Pam perform compared with Claude Code + Markdown memory?

Pam answered 1,644 out of 1,986 questions correctly, or 82.8%. The Claude Code + Markdown memory baseline answered 1,000 out of 1,986 questions correctly, or 50.4%.

Where did the Markdown memory setup fail most?

The largest gaps appeared on temporal, multi-hop and adversarial questions. Temporal questions require the system to understand what happened when. Multi-hop questions require it to combine facts from different parts of the conversation. Adversarial questions test whether the system can say that the answer was never given. The adversarial result was the clearest warning sign: Pam scored 99.3%, while Claude Code + Markdown memory scored 48.7%.

Why are adversarial questions important for agent memory?

Adversarial questions test whether the memory system knows when it does not know something. This matters in real work. If an agent invents a customer detail, deadline or product constraint, it can create more damage than an agent that simply says it does not have enough evidence. Good agent memory should support recall, but it should also reduce false confidence.

Can I build a better custom Markdown memory system?

Yes. You could add stricter ingestion, validation, review, update rules, conflict handling and better answer-time prompts. That would be a different memory system and should be tested separately. The article only claims that the common Claude Code + Markdown memory baseline was not enough in this benchmark.

What does Pam add on top of file-based memory?

Pam keeps the file-first direction, but adds memory operations around the files. It is built to organize context, update memory when new information arrives, preserve time, resolve conflicts and make retrieval safer for the agent using it. The goal is not to replace Markdown. The goal is to make Markdown-style memory more reliable for real agent work.

Give your agent memory that survives a moving project

Pam keeps the file-first workflow you already like and adds the memory operations markdown can't — staying current, preserving time, and resolving conflicts. Start free and see the difference on your own work.