I Spent $3,000 on Memories and It Wasn't a Vacation

A while ago, I ran an experiment. I gave an AI agent access to a person's email, Google Drive, Notion workspace, and internal chat history. I told it to build a structured knowledge base from all of that. I gave it no instructions beyond the goal. No pipeline. No rules. Just here is the data, figure it out.

The idea behind this was not reckless, or at least I didn't think so at the time. We are building Pam, an AI manager that needs to remember things about you. Your ongoing projects, the decisions you made last Tuesday, that email thread where the client changed scope for the third time. For Pam to be useful, it needs to hold all of this in its head, structured, searchable, connected like a colleague who actually reads the briefs.

The question I wanted to answer was simple: if you give a state-of-the-art AI model full autonomy over this problem, what does it do? Can it look at a user's scattered digital life and impose order on it? Can it decide, on its own, what matters and what doesn't?

I connected the sources, pointed the agent at the data, and stepped back.

It ran for four days.

What we expected

Data Sources

email, drive, notion, chat

Agent Reads

processes all sources

Knowledge Base

facts.json

What actually happened

668 .md reports258 checklists304 fact files209 .py scripts232 status files290 backups= 6,299 files / 190MB / $3,000+

What happened

The agent created 0 files.

I had expected roughly ten. A facts file, a graph file, a timeline, maybe a few sync cursors. The kind of clean, consolidated output you'd design if you sat down with a whiteboard for an afternoon.

API costs, still climbing

$0.00

Instead, I got 190 megabytes of output. The API bill came to a little over $3,000. And when I opened the directory to see what had been produced, I found something that looked less like a knowledge base and more like the aftermath of a small bureaucratic explosion.

There were 668 markdown reports, one per processed page, each a small essay about what the model had found. There were 258 “verification checklists,” minimal text files that seemed to exist only because the model felt it should verify its own work, though the checklists added nothing. 304 intermediate fact files should have been merged into the main knowledge base and then deleted, but were instead left sitting in the root directory alongside everything else. The architecture relied on 209 ‘one-and-done’ Python scripts, dedicated to every single page. They handled the merges perfectly, but conveniently forgot the cleanup phase. There were 232 status-tracking files, little JSON breadcrumbs the model left to mark its progress, except that it never read them back, so they served no purpose at all.

All of this was dumped flat into a single folder.

No subdirectories.

No separation between intermediate artifacts and final outputs.

If you opened the folder in Finder, the file list scrolled for what felt like minutes. Names like PROCESSING_COMPLETE_notion_page_1c4e5fc3.md repeated with minor variations, hundreds of times, stretching to the horizon.

907 of the files were under 100 bytes. Empty, or nearly so. Created and never written to.

The backups

Perhaps the most telling detail was the backup situation. The model, showing a kind of nervous diligence, decided to back up the knowledge base before processing each page. Not once, at the start of the run. Before every single page.

This produced 290 backup files, each named according to whatever convention the model felt like using at that moment. Some had timestamps. Some had UUIDs. Some had page IDs. Some had Unix timestamps. A few samples:

facts.json.backup_20260206_202948
facts.json.backup_notion_1c4e5fc3
facts.json.backup_1770401928
facts.json.corrupted_attempt_3

That last one is worth pausing on. There were four .corrupted files. The model didn't just fail; it achieved its failures. It merged, corrupted, backed up, and repeated, showing just enough self-awareness to tag each file as ‘corrupted’ before moving on to the next mistake. It had the presence of mind to label these files as corrupted. It did not have the presence of mind to stop and reconsider its approach.

Attempt merge

Merge fails, output corrupted

Back up the corruption

Try again

...

The agent's arrogance

The architecture the agent designed for itself used two models. A fast, cheap one for initial classification, sorting incoming data by importance. And an expensive one for deep processing, the actual extraction of facts and integration into the knowledge base. This is a sensible pattern, in theory. You use the cheap model as a filter, so the expensive model only touches what matters.

In practice, the cheap model was fast and confident and frequently wrong. It assigned confidence scores between 0.75 and 1.0 with no discernible methodology. It extracted the same data multiple times in different formats across different runs. In one case, it reversed a known relationship, flipping a mentor and mentee, which then propagated into the knowledge base as a factual contradiction.

The expensive model downstream, receiving this contradictory input, did what any diligent bureaucrat would do: it created more files to track the contradictions.

A conflict file. Then, clarifying questions. Then, more processing is required to resolve the questions. Each step generates its own artifacts, none of them cleaned up.

Groundhog Day

The strangest part, looking at the output afterward, was the dates.

Files were created across a four-day window, February 5 through 9, and you could see overlapping waves of processing. The same data re-extracted, re-classified, re-processed.

The model had no memory of what it had already done. It left thirty individual summary files but no consolidated summary, no master log, no completion tracker.

There was no way to answer the most basic question: how much of the data has been processed?

The model just kept going.

Outcome

I want to be careful here, because the easy conclusion is that the models are bad at this. They are not bad at this. They are, in fact, remarkably good at the individual tasks involved. Classify this email, extract facts from this document, summarize this thread.

At the level of a single operation, the output was often excellent.

What they cannot do is manage themselves. They have no concept of cleanup, consolidation, state tracking, or cost. They optimize for precision, which in practice means they will create a file for every thought they have and never go back to delete unnecessary ones. They are brilliant employees who leave every light on in every room and never close a door behind them.

The uncomfortable realization is that this issue will not be fixed by better models, because without guardrails and architecture we are gonna have an even bigger mess.

The next generation of Claude or GPT will still produce 6,000 files if you give it the same instructions I gave, because the problem is not intelligence but discipline. Discipline is not a property of the model. It is a property of the system around the model.

There is a narrative in the industry right now that goes something like: give AI access to your data, and it will figure everything out. My experiment suggests that this narrative breaks down the moment you try it at any real scale.

What you get is not an order. What you get is a very expensive mess, created with extraordinary confidence.

What's next

The $3,000 experiment did not kill the idea. It changed how I think about it.

I am now building memory infrastructure where the AI is not the architect but the worker.

A powerful, capable worker, but one that operates within strict constraints designed by humans. Every processing step declares exactly what files it will produce, and the pipeline enforces that contract. Intermediate artifacts are deleted before the next step runs, not when the model feels like tidying up. Facts from different sources are cross-referenced against each other at merge time, so you don't end up with the same information stored four times in four slightly different formats. The system tracks extraction costs and prioritizes accordingly.

The goal has not changed. I still want an AI that remembers what matters about your work.

But I now know that the path to get there does not run through “let the model get loose on your Notion and see what happens.” It runs through engineering discipline applied to AI capabilities, through treating these models the way you would treat a brilliant new hire who has never worked in an office before. You do not hand them the keys and walk away. You give them clear tasks, check their work, and clean up the conference room after they leave.

There is an irony here that I keep coming back to. The AI models in this experiment could write beautiful, articulate essays on why file management matters, the importance of cleanup stages in data pipelines, and the dangers of redundant state.

They just cannot do it themselves. Not yet.