The LoCoMo benchmark, explained: testing long-term conversational memory in LLM agents
If you are building memory for an AI agent, LoCoMo is the test people cite first. Here is what it actually measures, what the ACL 2024 paper found, how systems like Mem0 score on it, and why those scores disagree more than the leaderboards admit.
The short version
LoCoMo (Long Conversational Memory) is a benchmark for very long-term conversational memory in LLM agents, from the ACL 2024 paper Evaluating Very Long-Term Conversational Memory of LLM Agents by Snap Research.
It asks a model questions about conversations running up to 35 sessions and around 300 turns, then checks whether the model can recall, connect, and reason across all of it.
The headline result: long-context models and RAG help, but they still trail humans by a wide margin, especially on time-based questions. And the published scores disagree depending on who runs them, which matters more than most leaderboards let on.
What is the LoCoMo benchmark?
LoCoMo measures whether an AI agent can remember a conversation that goes on for months.
Most older dialogue benchmarks stop at about five chat sessions. That is not how a real assistant gets used. People talk to an agent across weeks, refer back to things they said in March, change their minds, and expect it to keep up. LoCoMo was built to test exactly that gap. The name is short for Long Conversational Memory, and the full paper is titled Evaluating Very Long-Term Conversational Memory of LLM Agents.
The setup is simple to state. You give a system a very long conversation history, you ask it questions whose answers are buried somewhere in that history, and you score how often it gets them right. The hard part is that the answer might depend on a single line from session 2, or on stitching together three facts from sessions 4, 9, and 22.
The paper and where to find it
LoCoMo comes from Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. It was published at ACL 2024 (the 62nd Annual Meeting of the Association for Computational Linguistics) and first appeared on arXiv in February 2024.
- •
- •Project page: snap-research.github.io/locomo
- •Code and data: github.com/snap-research/locomo
If you only have time for one part of the paper, read the question answering results. That is the task almost everyone uses to compare memory systems, and it is the part the public dataset is annotated for.
Inside the LoCoMo dataset
LoCoMo was not scraped from real chat logs. The authors built it with a machine and human pipeline. Two LLM agents are each given a persona and a timeline of causally connected life events, then made to talk over many sessions. The agents can share and react to images, and the resulting conversations are checked and edited by human annotators to keep them consistent over the long range.
The numbers are the point of the whole thing:
| Property | LoCoMo |
|---|---|
| Avg. turns per conversation | ~300 |
| Avg. length | ~9,000 tokens |
| Sessions per conversation | up to 35 |
| Public GitHub release | 10 conversations (locomo10.json) |
| QA questions | ~1,540 |
The authors note that a LoCoMo conversation is about 16 times longer than one in MSC, the prior standard, spread over roughly 10 times more turns. The public release is a curated subset: they kept the longest, best-annotated conversations so that evaluating closed-source models did not get prohibitively expensive. The images themselves are not redistributed, but their captions and URLs are.
The question types LoCoMo tests
The benchmark defines three tasks. Question answering is the one almost everyone runs.
1. Question answering
This is the memory and reasoning test. Each question is tagged with the turns that contain its answer, and the categories are where the difficulty lives:
- •Single-hop: The answer sits in one session. Pure recall.
- •Multi-hop: You have to combine facts from several sessions. This is consistently the hardest category.
- •Temporal: Reasoning about when things happened and in what order. Models struggle here most.
- •Open-domain: Blend something from the conversation with outside world knowledge.
- •Adversarial: Questions designed to bait a confident wrong answer, so a model that hallucinates gets caught.
2. Event graph summarization
Each speaker has an underlying graph of causally and temporally linked events. The model reads the conversation and tries to reconstruct that event structure. It is a test of whether a system tracks cause and time, not just isolated facts.
3. Multi-modal dialog generation
Given the history, including shared images, the model has to produce a reply that stays consistent with everything said so far. This is the part that uses the image captions and URLs in the data.
What the original LoCoMo results showed
The first paper is more sobering than the vendor blogs that came after it. Long-context models and RAG both helped, improving over base models by 22% to 66% on QA. But the gap to people stayed large.
- •Long-context and RAG systems still lagged human performance by about 36% and 44% overall, respectively.
- •On temporal reasoning the gap widened to about 41% and 52%, respectively.
- •Long-context models were notably fragile on adversarial questions, performing about 64% worse there than the base model.
The takeaway from the authors was not that any one method wins. It was that very long-term memory is still an open problem, and that throwing a bigger context window at it does not close the gap on its own.
The LoCoMo leaderboard: how memory systems score
After the paper, LoCoMo became the number memory startups quote. The most cited figures come from Mem0. In their 2025 paper, scored with an LLM as judge, the comparison looked like this:
| Method | Accuracy | Notes |
|---|---|---|
| Full context | 72.9% | Highest raw accuracy, ~26K tokens, ~17s p95 latency |
| Mem0 (graph) | 68.4% | Graph-enhanced variant |
| Mem0 | 66.9% | ~1.8K tokens, p95 latency ~1.4s |
| RAG (standard) | 61.0% | Retrieval over chunks |
| Baseline memory feature | 52.9% | Reference point in the paper |
The interesting line there is not the top score. It is that full context wins on raw accuracy but costs roughly 14 times the tokens and an order of magnitude more latency. A memory layer trades a few points of accuracy for a system you can actually afford to run. In 2026 Mem0 reported a new token-efficient algorithm scoring 92.5 on LoCoMo at under 7K tokens per call, with the biggest jumps on the two categories that matter most in production, temporal and multi-hop.
Which AI memory layer performs best on the LoCoMo benchmark?
The honest answer is that it depends on who ran the test. That is not a dodge, it is the most important thing to understand about LoCoMo.
An independent evaluation from Memori Labs scored the same class of systems and got a different ranking: full context highest at about 87.5%, their own system around 82%, Zep and LangMem in the high 70s, and Mem0 lower at about 62%. Compare that to Mem0's own numbers and the order changes. Same benchmark, different conclusion.
Read this before quoting a number
A LoCoMo score is a function of the judge model, the judging prompt, the retrieval settings, and which conversations were used. When a vendor cites a single percentage, ask what setup produced it. The relative gaps inside one consistent run are meaningful. Cross-vendor comparisons usually are not.
Why LoCoMo scores disagree, and what synthetic benchmarks miss
Two things make LoCoMo numbers slippery.
First, the scoring. LoCoMo QA is graded by an LLM acting as judge, deciding whether a generated answer matches the ground truth. Swap the judge, reword its prompt, or change the temperature and the score moves. There is no single official harness everyone runs, so two honest teams can publish two different leaderboards.
Second, the data is synthetic. The conversations are generated from personas and event timelines, then cleaned up by annotators. That is a reasonable way to build a large, consistent benchmark, and the human editing pass is real work. But generated personas are tidy in a way real users are not. People contradict themselves, drop context, change jobs, and bring up things no event graph predicted. A system tuned to ace 10 curated conversations can still stumble on the messiness of one real account.
None of this means LoCoMo is useless. It is a good, hard, public test of long-range recall and temporal reasoning, and it moved the field forward. It just is not a substitute for measuring a memory system on the actual data it will run against. The benchmark tells you a system can remember. It does not tell you it will remember your users.
Where Pam fits
Pam is a self-onboarding memory system for AI agents.
Instead of asking you to hand-build an index or fill out a profile, Pam reads the data your work already produces and builds a living model of your context: who the people are, what changed, and what is true now. We care about LoCoMo because long-range recall and temporal reasoning are real, and the benchmark stresses both. We also take its limits seriously, which is why we treat any single score as a signal, not a finish line, and weigh behaviour on real, evolving data alongside it.
If you want the detail, we published Pam's LoCoMo results separately, and a broader piece on why persistent memory matters for agents.
Frequently asked questions
What does LoCoMo stand for?+
Long Conversational Memory. The full paper title is Evaluating Very Long-Term Conversational Memory of LLM Agents, published at ACL 2024 by Snap Research.
How long are LoCoMo conversations?+
On average about 300 turns and 9,000 tokens, spread over up to 35 sessions. That is roughly 16 times longer than conversations in the earlier MSC benchmark.
How many conversations are in the LoCoMo dataset?+
The public GitHub dataset (locomo10.json) is a curated subset of 10 long conversations annotated for question answering and event summarization.
What tasks does LoCoMo evaluate?+
Three: question answering (single-hop, multi-hop, temporal, open-domain, and adversarial), event graph summarization, and multi-modal dialog generation. QA is the one most people run.
Which memory system is best on LoCoMo?+
There is no settled answer. Mem0's paper puts Mem0 ahead of RAG with full context highest on raw accuracy, and a 2026 update reports 92.5. An independent Memori Labs run ranks the systems differently. The result depends on the judge, the prompt, and the data subset, so compare only within one consistent run.
Is LoCoMo a real or synthetic benchmark?+
Synthetic, with a human pass. Conversations are generated from personas and event timelines, then verified and edited by annotators. It is consistent and scalable, but it does not capture how real users actually behave, so strong LoCoMo scores do not guarantee strong real-world memory.
Where can I download the LoCoMo dataset?+
From the official repository at github.com/snap-research/locomo. The conversations are in ./data/locomo10.json. The images are not redistributed, but their captions and URLs are included.
