Back to Blog
RESEARCH

Why CLAUDE.md + Obsidian Setup Is Not Enough for Long-Term Company Memory for AI Agents

11 min read
Share:

Context windows, CLAUDE.md, vector databases. Every fix for agent memory runs into the same wall. The LoCoMo benchmark shows you exactly where it is. And you don't have to take our word for any of it: the whole benchmark repo is open, so you can reproduce every number here yourself.

If you build with Claude Code, you know this process inside and out. Day one is great. You write a CLAUDE.md that lays out the project: what you're building, the conventions, the infrastructure. It runs with it. Come back a week later, and it leans on a decision you reversed on Monday, redoes work you already finished, and points at a part of the system you renamed. CLAUDE.md is now 40-60k tokens of text, and it kills your context window in every conversation.

The problem is not CLAUDE.md, but the structure it creates and maintains over longer time periods. What Claude Code lacks is a persistent memory layer: something that holds your project's state between sessions so it doesn't start cold every time.

What you're really building is an AI brain

What most builders are really after is an AI brain: one place an agent can reason over your whole history, the decisions you made, the experiments that failed, the why behind how the project got here. Try to build one by hand and you tend to land in one of two paths:

  1. You never get a structure that works: You get wired up Obsidian, Claude, and a Drive folder and still couldn't get the agent to reliably recall anything from the vault. The choice came down to a junk drawer, where you dump everything in and it turns to noise, or a second job, where you hand-curate every note. Most people give up there.
  2. You do get it working, but you pay twice: The journey to get it done will take a full dedicated week or two of farming, creating architecture and shaping data, and then a real task lands and the agent handles in minutes what would have taken 3 hours or more. That's the first cost, time to build. The second comes later. As the project moves and the notes drift out of date, are the answers still right? A flat file gives you no signal. The quality erodes quietly, and you only find out when the agent confidently hands you something wrong.

Why doesn't a bigger context window fix Claude Code's memory?

Larger context windows sound like a solution, but they're not. After about 400k tokens, you enter the dumb zone — what researchers call being lost in the middle.

~0k

tokens before recall degrades — and that's all the window you get, sometimes even less.

The problem itself is that the model currently has to re-process the whole context window by adding new tokens, which was partially solved on Anthropic's infrastructure, but not entirely: 400k is all you have (sometimes even less than that). So if 60k of that window is CLAUDE.md, it's not very helpful.

RAG doesn't save you either. Chunking notes into a vector index works for static docs, but it is blind to time: it can't tell that last year's rule is stale and today's patch is current when both sound similar. So it returns both. It has no sense of which source should take priority and no way to resolve the conflict, so the contradiction lands back on the model.

This isn't just our experience. The original LoCoMo paper put long-context models and retrieval head-to-head, and both still trailed humans by a wide margin, with long-context models doing far worse than the base model on adversarial questions. A bigger window didn't close the gap.

Why are plain files a good substrate for agent memory?

Here is the part most people skip past: plain text files are already a great memory substrate. You do not need a proprietary store or a database layer to get good agent memory. A clean directory of markdown files does the job and plays to the model's strengths instead of fighting them.

Claude Code was trained on enormous amounts of code and file trees. It knows how to move around a filesystem. Its native tools are file tools: ls, grep, cat, and find. When your memory lives as a directory of markdown notes, the model can read and search its own history with the same skills it uses on your code, no middleware in between.

Files are also honest. A vector space is opaque. If your agent starts doing something strange, you cannot open an embedding and read it. You can open a Markdown file. You see exactly what the agent thinks it knows, and you can fix it in your editor.

Why not a database?

It doesn't actually solve the problem, for three reasons.

  1. Claude Code is at home in a filesystem, not a database schema, so working against that grain has a real cost.
  2. Even when the agent writes a valid query, a database does lexical search over stored strings: if you describe a concept with synonyms or related terms that don't appear verbatim in the records, the query comes back empty, not because the fact is missing but because string matching has no sense of meaning.
  3. The hardest parts of long-running project memory are temporal and relational, what was true when and how facts depend on each other, and databases treat those as separate concerns rather than one. Our benchmark below makes this concrete: Pam's biggest margins over the plain-files baseline land on exactly the temporal and multi-hop questions, where a database would hit the same structural ceiling.

What are the limits of CLAUDE.md as memory?

CLAUDE.md has three limitations as memory:

  • Read only: The model reads CLAUDE.md to bootstrap itself, but it won't keep it up, left alone. Spend four hours figuring out a nasty race condition, and none of that lands in the file unless you stop and type it in yourself. Next session, the lesson is gone.
  • Bloats your context: Every line you add sits in the prompt on every single turn. As you pile in edge cases and quirks, the file grows, and you walk straight back into the attention and latency problems it was supposed to prevent.
  • No short vs long-term split: Out of the box, there's no split between the variable that's breaking the build right now and the rule that you always encrypt before migrating. Throwaway debugging notes and permanent rules get the same weight, so the model treats a detail as if it were architecture. You can rig a split yourself, but it's not something the file does for you.

A static file cannot keep up with a project that changes every day. That is the gap.

Claude Code's newer auto-memory does write markdown notes across sessions on its own, which goes past a static CLAUDE.md, but it is single-tier and doesn't reconcile conflicting facts, so the gap in structure and currency stays open.

Is Obsidian a good memory system for AI agents?

Obsidian gets closer: a vault is a folder of markdown files, exactly the substrate you want. The catch isn't that you fill it by hand. It's that you own the structure, how notes get organized, linked, and tiered, and designing a memory structure that an agent can reason well over is a specialist skill. Unless you happen to be in the rare slice of knowledge workers who do exactly that, the schema you land on probably isn't the right one.

As AI researcher Vitalii Ratushnyi says:

Obsidian is just a file structure. The real difference is the algorithm layered on top of it: how you actually store different types of information, and how you handle different sources and file types. What we built is a specialized, scalable system. Claude Code plus Obsidian stays free-form.

— Vitalii Ratushnyi, AI Researcher

How do the agent's memory solutions compare?

Here is how the common approaches stack up on the things that actually break over a long project.

ApproachSelf-updatingStays currentShort vs long-term splitResolves conflicts
CLAUDE.mdNo*No*No*No*
Obsidian vaultNo*No*No*No*
RAG over vectorsNo*YesNoNo, returns both
PamYesYesYesYes

* Not native, DIY only. You can build some of this with your own instructions and file layout. Maintenance of architecture is on you; a plain file has no built-in tiering or conflict handling.

What happens when you run agent loops without memory?

Without memory, a loop wakes up cold. It runs on schedule whether or not anyone is watching, reads whatever state it can find, and acts on it. If that state is stale, it doesn't stumble once and get corrected: it repeats the wrong thing on a schedule, and with subagents it repeats it in five places at once.

A loop is just a prompt that fires itself, and builders are leaning on them hard right now (Claire Vo's walkthrough on Lenny's How I AI covers crons, hooks, and goal loops that spawn their own subagents). But a loop is only as reliable as the state it wakes up to, and hand-maintained files can't stay current once you're no longer the one maintaining them. The harness around the model, the goals and checks and memory, is what makes loops work.

What is Pam, and how does it work?

Pam is a self-onboarding memory system for AI agents. What a folder of files can't do on its own, Pam does for you.

  • It onboards itself from the tools you already use: Your project memory is not only in your repo. It is in the Slack thread where you settled an API decision, the Notion page with the spec, and the emails where a constraint changed. Point Pam at those sources, and it pulls the context in and files it. You are not hand-curating a vault, and you are not wiring an ingestion pipeline per source.
  • It updates itself: When something worth remembering happens, Pam updates the right note instead of waiting for you to do it. The hard-won lesson from a long debugging session ends up in the project file, not lost at the end of the session.
  • It splits short-term from long-term: Active task notes stay in a working layer while the task is open. When the task closes, Pam reviews those notes, keeps the parts that matter for next time, writes them to the long-term project file, and clears out the scratch. Your permanent memory stays clean instead of filling up with trace logs.

How does Pam turn files into a context?

Pam turns the files into a context by linking the people, decisions, modules, and constraints across your notes into one structure, and keeping it current as things change. Keeping the files accurate is the foundation; this structure is the next step on top.

A folder of accurate notes is still a flat pile of text. To reason well, an agent needs to know how the pieces relate: which decision depends on which, which constraint replaced an older one, who changed what and when, what is true now versus what was superseded last week. Pam builds that layer on top of the files. It links the people, decisions, modules, and constraints across your notes into a living model of your project and keeps it current as things move. Where two facts disagree, the model holds the one that is true now instead of carrying both.

This is file-first growing up, not file-first thrown out. The files stay the ground truth you can open, read, and version. The model is the structure Pam derives and maintains over them, so the agent can follow a thread across the project without re-reading everything, and you can even look at it as a graph of how the pieces connect. There is no opaque vector store underneath, nothing you cannot open and check. The plain files are still there. Pam stops treating them as a flat pile and starts treating them as a connected model.

Does Pam actually improve memory? LoCoMo benchmark results

Yes, and the comparison that matters most for this audience is Pam against the exact setup you are probably running now: Claude Code with a folder of .md files.

In our 2026 run, Denys Herasymuk, AI Researcher, put both on LoCoMo (Long Conversational Memory), an ACL 2024 benchmark from Snap Research that tests recall across long, multi-session conversations, using the same 1,986 questions and the same GPT-4o judge. The baseline is Claude Code with a folder of memory .md files. The Pam side is Pam's memory MCP, run in OpenClaw (it works with Claude Code too).

Pam memory beats plain memory .md files

Overall LLM-as-a-Judge accuracy on the LoCoMo benchmark

Pam memoryOpenClaw + Pam MCP
82.8%
▲ +32.4 pt
Memory .md filesClaude Code + memory files
50.4%
Figure 1. Overall LLM-as-a-Judge accuracy across all 1,986 LoCoMo questions, scored by GPT-4o as the judge. Pam memory answers 1,644/1,986 correctly vs 1,000/1,986 for plain memory .md files — a +32.4 pt improvement.

That is a 32.4-point gap on identical questions.

The plain-files setup lands at 50.4%. It can find a fact, but over a long, shifting project it loses the thread. The failures in the log are the kind you have probably hit yourself. It hallucinated dates, dropping events into 2026 that actually happened in 2023.

In one conversation, it lost the context entirely and answered "No information available in memory" for facts that were sitting right there in the notes. And when the correct answer was that something had never been mentioned, it tended to invent one instead, filling the gap with made-up brand names and details. A folder of notes holds the information. It does nothing to stop the model from drifting or making things up once the project runs long.

Pam reaches 82.8% on the same questions, because it keeps the files structured along a timeline and hands the model a non-conflicting context. This is our own run on the public LoCoMo dataset, scored with the same GPT-4o judge for both setups.

The gap holds across every question type, and it is widest exactly where plain files fail hardest. LoCoMo splits its questions into five kinds:

  1. Single-hop: a plain recall of a single fact.
  2. Multi-hop: combining facts scattered across several sessions.
  3. Temporal: when things happened and in what order.
  4. Open-domain: blends something from the conversation with outside world knowledge.
  5. Adversarial: asks about things that were never said, to see whether the system makes up an answer or admits it does not know.

Pam wins across every question type

LLM-as-a-Judge accuracy by LoCoMo question category

Pam memory (OpenClaw + Pam MCP)Memory .md files (Claude Code)
Single-hop282 questions
52.1%
41.5%
Temporal321 questions
85.4%
52.0%
Open-domain96 questions
57.3%
39.6%
Multi-hop841 questions
86.2%
54.8%
Adversarial446 questions
99.3%
48.7%
Figure 2. LLM-as-a-Judge accuracy broken down by question category on the LoCoMo benchmark, scored by GPT-4o as the judge. Pam memory leads on all five categories, with the largest gaps on Adversarial (+50.6 pt), Temporal (+33.4 pt) and Multi-hop (+31.4 pt).

The widest gap is on adversarial questions, the ones built to bait a confident wrong answer (see the chart above). Plain files score 48.7%, barely better than a coin flip, because they let the model invent something when the honest answer is "not mentioned." Pam reaches 99.3%.

The next two gaps are temporal reasoning, knowing when things happened, and multi-hop, stitching facts across sessions, both over 31 points. Those are exactly the cases that break a project that has been running for weeks.

A LoCoMo score depends on the judge model, the prompt, and which questions you run, so numbers from different setups don't compare cleanly. The only honest comparison is a same-run one, which is exactly what the table above is.

What are the trade-offs of using Pam?

Pam is not free, and it is early. A few things to know before you rely on it.

  • It takes time to build a context layer: from 2 hours to 2 days. It depends on how much context data you have and how many sources you've added.
  • It is in early access: The benchmarks are solid, the proprietary architecture is real, and we are constantly improving toward a Full Self Driving mode for memory. If you want to push on it, you will be one of the early people doing so.

The bottom line

Plain memory .md files don't fail loudly. They fail at about 50%. On the LoCoMo benchmark, that setup gets just over half the questions right. The AI brain you build on top of it, or on any long, messy business project, gets it wrong about half the time when Claude Code reaches into memory, and it answers like it's certain, because a flat file has no way to know a stale note from a current one. Every decision you or the agent makes on top of that brain inherits the same odds: you're flipping a coin. For a quick script you'll never revisit, that's fine. For an AI brain meant to run real work for weeks across your tools, it's not something you can build on. The fix isn't a bigger prompt or a tidier file. It's a memory layer that stays true as the project moves. That's what Pam is.

Tired of re-explaining your own project every few days?

If you build with Claude Code, this is built for you. Pam keeps your project's memory current across sessions, so your agent never starts cold. We're onboarding early users now.

FAQ

What is a persistent memory layer for an AI agent?

It is a system that stores what an agent learns and keeps it available across separate sessions. A context window only lasts for one conversation. A persistent memory layer holds your project's state over days and weeks, updates it as things change, and feeds the agent the relevant part on each run.

Does Claude have built-in memory, and is it enough?

Claude can remember small, stable facts about you across chats, which helps with preferences. It is not built to hold the full, shifting state of a software project across sessions and tools. For that, you need a dedicated memory layer.

Why not just use a CLAUDE.md file?

CLAUDE.md is great for setup, but the agent won't keep it up, left alone, so what it learns is lost at the end of the session. It also sits in the prompt at every turn, which bloats context, and it has no split between throwaway notes and permanent rules. It cannot keep up with a project that changes daily.

How is Pam different from Obsidian?

An Obsidian vault is a folder of markdown files, but the structure is yours to design, and it won't keep itself current or resolve contradictions. Pam uses the same file substrate with an expert structure on top: it writes notes as things happen, keeps them current, splits short-term from long-term, and resolves conflicts.

Is Pam the same as RAG?

No. RAG retrieves chunks from a static index, is only as current as your last re-index, and returns matches by similarity rather than truth. Pam maintains a living model of your project, writes back what it learns, and resolves conflicting facts instead of handing the agent both.

Does a bigger context window solve agent memory?

No. A bigger window gives the model more room within a single session. It does nothing the next morning when the session is gone. Memory across sessions is a different problem from context within one.

How does Pam score on the LoCoMo benchmark?

On a same-setup run of 1,986 questions with a GPT-4o judge, Pam scored 82.8% against 50.4% for Claude Code with .md files. The full benchmark is open source, so you can reproduce it.