You Have a Corpus. You Just Don't Treat It Like One.

Start with two words, because the whole series rests on them and almost nobody outside a research lab has had to define either one.

A corpus is a body of documents you treat as one thing. Bounded, chosen, handled as a unit, not a folder and not a pile. The word is Latin for body. Linguists used it first, a deliberate collection of texts studied together: the complete works of an author, a century of one newspaper, a language sampled on purpose. Machine learning borrowed it, and that is where most people run into the word now, train a model on a corpus, point retrieval at a corpus, almost always with no definition attached.

Corpora is the plural, and it is the half nobody explains. It means many of those bodies at once: a client corpus, a finance corpus, a product corpus, one body per project. You almost never have a single corpus. You have corpora, whether you have ever named them or not.

The difference between the two is the part that actually bites. One corpus you can organize any way you like, by yourself, and change your mind later. Corpora you cannot, because the moment there are two, an address only works if it means the same thing in every body and to every system that reads it. One body you keep by preference. Many bodies you keep by agreement. That jump, from corpus to corpora, is where most ways of organizing quietly fall apart, and it earns a post of its own later.

Here is why this matters to you and not just to a lab. You already own a corpus, probably several, and have never drawn a line around any of them. Every document you have ever saved belongs to some body. You just never declared the body, so it behaves like a crowd instead of a collection. You do not have a pile of files. You have a corpus. You have probably never called it that, you almost certainly do not treat it like one, and that gap is the most expensive thing in your entire document estate.

A pile of files is not a corpus

What separates a corpus from the files on your drive is not size, and it is not tidiness. A messy corpus is still a corpus. A neat folder can still be just a folder. The difference is four properties, and a pile of files has none of them, not by accident but because nothing ever asked it to.

A corpus has a boundary. Someone has decided what is in it and what is not. That sounds trivial until you try it. Is last year’s archive in or out? Are the half-finished drafts in? The vendor’s PDFs you never read? A boundary is a decision, and a folder lets you skip the decision forever, which is why folders sprawl until nobody can say what they even contain.

A corpus has addressing. Any item can be pointed at, referenced, cited, by something more durable than “the third file I think I saved last spring.” This is the subject of the first two posts, scaled up. There it was about naming one thing well. Here it is about being able to reach any thing in the body on purpose, by an address that does not change when you reorganize.

A corpus has governance. There are rules about who and what may read it, write to it, change it, and remove from it. A folder has permissions, which is not the same thing. Permissions say who can open the door. Governance says what is allowed to happen inside the room, and keeps a record of what did.

A corpus is queryable as a whole. You can ask a question of the entire body and get an answer assembled from across it, instead of hunting one file at a time and hoping you remember where you looked. This is the property people most want and least have, and it is the one the other three make possible. You cannot query a body you have not bounded, addressed, and governed. You can only search it, badly, and searching because you cannot navigate is the system admitting it failed.

A folder has a location, and that is the end of its ambitions. Most people manage files. Most companies manage files. Almost nobody manages a corpus, and the tools are why. The drive hands you folders. The wiki hands you pages. The chat app hands you a search box. Not one of them asks you to draw a boundary or pick an addressing scheme or write a rule, so you never do, and the body of everything you own stays a body you cannot actually use.

Make it concrete with the contract from the last post. The Acme renewal, six months on. In a pile of files it is wherever you happened to drop it, under whichever logic made sense that afternoon, and the only working index is your memory. A new hire cannot find it. A colleague cannot find it. The version that was actually signed sits somewhere near three that were not. In a corpus the same document has a home that does not depend on you. It belongs to the Acme body. It carries an address a colleague can reach without asking. The pricing inside it is governed, so the wrong people do not see it. And the question, show me every contract that renews this quarter, has an answer, because the body can be asked. Same document. The difference is whether anyone but you can use it.

The plural is the hard part

We named this jump at the start, from one body you keep by preference to many that need a shared rule. Here is where it bites. The moment you have more than one corpus you inherit a question a single body never raises: how do you address something in one corpus from inside another, and have the reference still hold after someone reorganizes? The instant two people, or two corpora, disagree about what an address points to, the scheme is worth less than the folders it replaced.

This is the point where most organizational systems quietly give up. They are excellent inside one body and have nothing to say about the second. We are going to spend a whole post on what a real cross-corpus addressing scheme has to do, and why almost none of the popular ones survive contact with the plural. For now, just notice that “where do I put it” was never the real question. The real question was always “how does anyone else, including a machine, find it again without me in the room.”

Three things happen to a corpus

Once you see your files as a corpus, the single question you used to ask splits into three. You used to ask where you put a file. A corpus replaces that with: how does anything get into this body, how is everything in it addressed, and how does anyone, or anything, read across it. Those are not three features to shop for. They are three stages, and they run in order.

Things have to get in. Documents arrive from everywhere, drives and inboxes and scanners and other people’s systems, in a dozen formats, full of duplicates and sensitive content and noise. Getting them in is not copying them into a folder. It is bringing them in, cleaning them, classifying them, and making them actually belong to the body instead of sitting next to it. A corpus that anything can fall into is not a corpus, it is the pile again with a nicer name.

Everything in it has to be addressed. This is the naming series scaled from one item to the entire body, and it is where the boundary and the addressing scheme earn their keep. A body you cannot reference item by item is a body you can only ever read end to end, which means you cannot read it at all once it is bigger than an afternoon.

And it has to be read. This is the payoff and the reason anyone bothers. A person, and now increasingly an agent, asks the body a question and the body answers, drawing from across everything in it, in the right order, with the right pieces. Everything upstream exists to make this stage possible.

Each stage depends entirely on the one before it, and that dependency is the whole game. A corpus is only as good as what was brought into it. You can only read out as cleanly as you addressed going in. Skip the intake discipline and the body fills with garbage no addressing can rescue. Skip the addressing and the reading stage returns confident nonsense. The posts that follow take each of these apart, one at a time, because each one is its own deep and slightly expensive problem.

Something has to translate the question

There is a small piece hiding between addressing and reading, and the library had it all along: the catalog. An address gets you to the shelf. It does not tell you which shelf you want. Something has to translate the question in your head, the signed Acme renewal, into the address where that thing actually lives. In a library that is the card catalog and the staff who learned it. In most companies it is one person who has been there long enough to know where everything is, which means the catalog is a human being, and a human being leaves.

The last post called this the legend: the key that makes a numbered map mean anything, kept by hand and lost when the keeper goes. A corpus has to do better than that. The thing that turns a question into an address has to live in the body itself, written down and machine-readable, or the body stays legible only to the person who built it. That is the line between a filing system and a corpus. It is also the line between a drive an AI can use and one it cannot.

Why this matters now

None of this is new. Libraries have run corpora for centuries, with boundaries and call numbers and trained staff and a catalog. What changed, and changed recently, is that you now have a reader who is not human.

Retrieval systems and agents operate over a corpus. That is the unit they consume. And they inherit exactly the corpus you actually have, not the one you meant to keep. Point a capable model at a body of documents whose meaning lives in your head, in the logic you used to name things and the reasons you filed them where you did, and the model gets none of it, because it cannot read your head. The naming posts ended on this: humans externalize meaning into a person, machines externalize it into an index, and an undefined corpus has done neither. It left the meaning nowhere.

Here is what that looks like in practice. Connect an assistant to your company drive and ask it for the current signed master agreement with a client. If the body never recorded which version is authoritative, the assistant does what you would do with none of your own memory: it finds several files with similar names, picks one, and answers with total confidence. It cannot tell the executed copy from the third draft, because nothing in the corpus ever said which was which. You knew. You were the catalog. The agent only has what the body wrote down, and here the body wrote down nothing.

The corpus mindset is the fix, and it is not a tool you buy. It is the decision to force the meaning into the body itself, into the boundary and the addresses and the governance, so that the body is legible to a reader who was not in the room when it was built. That is also the bridge between the internet built for people and the one being built for agents. People can ask you what you meant. An agent only has what you wrote down.

A corpus is a product, not a junk drawer

Underneath all four properties and all three stages is one shift in posture. A junk drawer fills by accretion. Nobody owns it, nothing is decided, and it grows until you cannot find the thing you need and buy a second one. A corpus is the opposite. It is something with an owner and a standard, curated and maintained on purpose, the way you would maintain a product, because that is what it is.

We learned this on ourselves before we believed it. We stopped calling the top of our own file tree “projects” and started calling those bodies corpora, one per part of the business, each with a boundary and an owner and a single way of addressing what is in it. It sounds like a relabeling. It was not. It changed what it means to save a document. A document is no longer a file you dropped somewhere convenient. It is an entry you added to a body that someone is responsible for, addressed so that a colleague, or a system, or a model can reach it next year without asking you where it went. The discipline is small. The compounding is not.

You already have one

You do not get to decide whether you have a corpus. You have one. Everything you have ever saved is a body, defined or not, governed or not, legible or not. The only decision in front of you is whether you treat it like one, because the AI you are about to point at it will treat it exactly as well as you did, and not one bit better.

This is the hub of a series on naming, addressing, and actually finding things. It builds on the first two posts, What’s in a Filename? on how machines name, and Where Did You Put It? on how humans do. The posts after it take the corpus lifecycle apart: how a body gets fed, how it gets addressed, and how it gets read.

Aaron Lamb Co-Founder, Hexaxia Technologies