Pam Outperforms ChatGPT by 40% on LoCoMo Benchmark

Introduction: The Persistence of Memory in the Age of Agents

In the rapidly evolving landscape of artificial intelligence, the concept of memory is frequently discussed yet rarely mastered. The prevailing industry narrative suggests that the problem of AI memory has been solved through longer context windows or complex vector database implementations. Large Language Models (LLMs) can now process hundreds of thousands of tokens in a single prompt, leading many to believe that "remembering" is simply a matter of data retrieval.

However, researchers and developers working on complex, multi-week projects have discovered a significant gap between mere data processing and knowledge maintenance. It is one thing for an agent to retrieve a specific fact from a recent conversation; it is another challenge entirely for that agent to maintain a coherent understanding of a project's evolution over months of interaction. This distinction marks the fundamental difference between a standard reactive chatbot and a truly proactive AI manager.

Today, we are proud to announce that Pam (Proactive AI Manager) has established a new benchmark on the LoCoMo benchmark, which is currently recognized as the most rigorous industry standard for evaluating long-term conversational memory. By achieving an accuracy score of 74.35%, Pam has officially surpassed previous leaders such as Mem0 and has decisively outperformed the native memory features provided by OpenAI.

Understanding the LoCoMo Benchmark: The Gold Standard for Long-Term Recall

The LoCoMo (Long-term Conversational Memory) benchmark was published at ACL 2024 by researchers from Snap Research. Unlike traditional Question-Answering (QA) tests that rely on static datasets, LoCoMo is designed to simulate the complexities of real-world human-AI interactions. The benchmark specifically evaluates system capacity to recall isolated facts, synthesize fragmented information into a cohesive understanding, and reason about complex temporal sequences within conversations that often extend over several days or even weeks.

The benchmark consists of 10 massive conversations, each averaging approximately 26,000 tokens. This scale is intentional because it pushes models beyond their immediate context limits, forcing them to rely on an external memory architecture. The evaluation is divided into four critical categories that test different facets of memory:

Single-hop Retrieval: Tests the ability to find a single, specific fact from a precise moment in the past.
Multi-hop Reasoning: Perhaps the most difficult category, requiring the system to synthesize information fragments scattered across multiple sessions.
Temporal Reasoning: Tests the agent's ability to understand dates, sequences, and the relative timing of events.
Open-domain Inference: Evaluates how the system reasons new answers based on the collective conversational evidence gathered over time.

To ensure that the evaluation measures actual understanding rather than simple keyword matching, we enhanced the benchmark with an LLM-as-a-Judge protocol using GPT-4o-mini, similar to Mem0's approach. This method prioritizes semantic correctness. For instance, if an agent recalls that an event happened on "last Friday" and the correct answer is "May 12th," the judge can recognize that these phrases refer to the same reality even if the tokens do not match.

A Comparative Analysis of Performance

To validate our findings, we evaluated Pam against the current industry leaders using the exact experimental framework established in the Mem0 research paper (Chhikara et al., 2025).

System	LLM-as-a-Judge Accuracy	Absolute Improvement
OpenAI (ChatGPT Memory)	52.90%	Baseline
Mem0	66.88%	Baseline
Pam (Ours)	74.35%	+7.47% over Mem0

The results demonstrate a clear hierarchy in memory performance. The OpenAI built-in memory feature, despite having direct access to its own generated memory bank, struggled to maintain consistency across the massive LoCoMo conversations. Pam outperformed the OpenAI baseline by a staggering 21.45 absolute points, which represents a 40.5% relative improvement.

Long-Term Conversational Memory

LLM-as-a-Judge accuracy on the LoCoMo benchmark (1,540 questions)

Figure 1. Comparison of long-term conversational memory systems on the LoCoMo benchmark by Snap Research. We use the same experimental setup as Mem0, evaluating all methods with LLM-as-a-Judge (GPT-4o-mini) on 1,540 questions across 10 conversation samples. Pam achieves 74.35% accuracy, outperforming Mem0 (66.88%) by +7.47 absolute points.

Furthermore, Pam achieved an 11.2% relative improvement over Mem0, which was previously considered the gold standard for agentic memory.

The Architecture of Recall: Why Databases Often Fail

The core secret to Pam's success is not found in a more complex mathematical retrieval algorithm. Instead, it is the result of a fundamental architectural shift that we detailed in our previous research titled The Architecture of Recall.

For years, the technology industry has operated under the fundamental assumption that AI memory should be managed as a structured database system. Systems like Mem0 and OpenAI's native features typically convert conversational turns into vector embeddings or short text snippets stored in a structured database. Our research indicates that this "Memory-as-a-Database" approach creates three critical bottlenecks that hinder long-term performance:

Abstraction Overhead

Database reliance forces agents to translate thoughts into formal queries. This creates cognitive friction because any imperfection in the query leads to retrieval errors.

Lossy Compression

Summarizing rich conversations into short entries strips away vital context. When these relationships are lost, the agent cannot effectively reason about the information later.

Retrieval Fragility

Systems fail when agents do not use exact keywords. This fragility is the main reason AI assistants often forget details provided just days prior.

The File-First Philosophy: Aligning with LLM Strengths

Instead of hiding memory behind an API or a database layer, Pam utilizes a "File-First" philosophy. Pam's Memory Agent organizes all acquired information into structured local files. It operates much like a high-level human assistant would: it takes detailed notes, organizes them into folders categorized by project and organization, and maintains a chronological timeline.

The key insight behind this design is simple: LLMs are exceptionally good at reading files and navigating directories. These capabilities are fundamental skills baked into their training data from the very beginning. By using a file-first architecture, we allow the agent to use its "native" tools, such as ls, grep, and cat, to locate and process information.

This approach yielded Pam's most impressive result, an 80.1% accuracy score in Multi-hop reasoning.

Because related facts are stored together in organized documents rather than being fragmented into isolated database entries, the agent can "see" the logical connections between scattered pieces of information that other systems inevitably lose.

Accuracy by Question Category

LLM-as-a-Judge accuracy across four LoCoMo question types

Figure 2. Pam's LLM-as-a-Judge accuracy broken down by question category on the LoCoMo benchmark. Number of correct/total questions per category: Single-hop (197/282), Temporal (233/321), Open-domain (37/96), Multi-hop (678/841).

Why This Matters for the Future of Proactive AI

Pam's achievement on the LoCoMo benchmark is more than a technical win. This milestone marks a significant change in how we develop AI agents to ensure they can truly manage workflows rather than just answering isolated prompts. For a Proactive AI Manager to be effective, it must do more than just answer questions; it must anticipate needs, track project milestones, and remember users' specific preferences over months of collaboration.

When we simplify the architecture, we free up the model's reasoning capacity. The agent no longer has to "think" about how to use its memory. It simply looks at its files, much like a human professional would look at their notes before a meeting. This allows the AI to focus its cognitive resources on high-level strategy and problem-solving rather than database management.

The broader lesson for the AI industry is that the interface between an agent and its memory matters significantly more than the sophistication of the storage system itself. While the world remains obsessed with larger context windows and more complex RAG (Retrieval-Augmented Generation) pipelines, our results suggest that the path to SOTA performance lies in architectural simplicity and alignment with the model's natural strengths.

Conclusion: A New Era for AI Agents

The 74.35% accuracy achievement on the LoCoMo benchmark confirms that Pam is at the forefront of the next generation of AI agents. By moving away from the limitations of traditional database-centric memory and embracing a file-first architecture, we have created a system that truly understands the value of long-term context.

As we continue to develop Pam, we remain committed to the philosophy that simplicity and native integration are the keys to unlocking the full potential of artificial intelligence. We invite you to read more about our technical approach in our research on why local files beat abstractions and to explore the research behind Mem0 and the LoCoMo benchmark to see how the industry is moving toward a more persistent and reliable form of AI memory.

The era of the "forgetful" AI is coming to an end. With Pam, the promise of an AI that truly knows you and your work is finally becoming a reality.

Pam Outperforms ChatGPT by 40%
on LoCoMo Benchmark

Executive Summary