What's in a Filename?

Every file on every system you have ever touched has a name. That name is doing more work than you think, and usually doing some of it badly.

Open a browser’s network tab on almost any modern site and watch the names scroll past. a8f3c2b1.chunk.js. 2f4c39_1cd6b8246de54bbdb2e49256956da0dd~mv2.jpg. A string of hex with no extension at all. No human wrote those names, and no human can read them. Yet the machines passing them around never get confused, never grab the wrong file, never lose one. The system works perfectly. It is just completely illegible to you.

That gap, between a name a machine can use and a name a person can read, is the subject of this series. It is a deeper problem than it looks, and almost everyone who has tried to solve it has fixed one piece by quietly breaking another.

Start with what a name is actually for.

What does a filename have to choose between?

The three jobs of a name

A filename, or an ID, or an address, can be asked to do three different jobs. Hold all three in mind, because every scheme we look at, in this post and the ones after it, is really a bet on which one matters most.

Identity. Is this thing distinct from every other thing? If two files end up with the same name, can the system tell them apart, and what happens when it cannot? A good identity never collides.

Integrity. Can you trust that the thing is what it claims to be, and that it has not silently changed underneath you? Integrity is about proof: ideally the name itself gives you a way to verify the contents.

Meaning. Can a person, or now a machine reasoning about a topic rather than a stream of bytes, tell what the thing is and where it belongs, without asking anyone or consulting a separate index? Meaning is legibility and navigability rolled together.

Here is the uncomfortable part the rest of this series unpacks: the three jobs pull against each other. The names best at identity are worst at meaning. The names that carry the most meaning are the least reliable as identity. You can have a name that is globally unique and tamper-evident, or a name a human can read at a glance, but getting both in one string, at scale, is genuinely hard. Most systems do not even try. They pick a job, win it decisively, and push the others onto something else.

To see how, look at the side that actually finished the work: machines. Machine naming is, by any fair measure, a solved problem. It just solved a problem that has nothing to do with you being able to read the answer.

Sequential: the counter that knows too much

The oldest naming scheme in computing is also the simplest. Count. One, two, three. Database primary keys, invoice numbers, support-ticket numbers, the auto-increment column every developer reaches for first.

Sequential IDs have real virtues. They are trivially ordered, so you always know which came first. They are compact. And a human can actually hold them: “ticket 12” means something to a person in a way that a thirty-two-character hash never will. For a small system run by a few people, a counter is often exactly right.

The trouble starts when the system grows up. A counter needs someone to turn it, which means a single authority handing out the next number. The moment two servers, or two databases, or two anything try to mint IDs at once, they both reach for “1,002” and you have a collision, the one thing identity is supposed to prevent. Distributed systems spend real engineering effort working around this, with sharded ranges and coordination services, precisely because the naive counter does not survive being split.

And a sequential ID leaks. This is the quiet one. If your order confirmation is number 1,043, a competitor who buys something next week and gets 1,061 now knows you did eighteen orders in between. Sequential IDs broadcast your volume, your growth rate, sometimes your customers’ place in line, to anyone patient enough to sample them. The technique has a name in security circles: enumeration. It is why serious APIs stopped exposing raw incrementing IDs and reached for something with no order and no meaning at all.

Which brings us to the opposite extreme.

UUID: unique, and unreadable

A UUID looks like this: 550e8400-e29b-41d4-a716-446655440000. One hundred and twenty-eight bits, usually shown as thirty-two hex digits in five groups. Its entire reason to exist is to let any machine, anywhere, generate an identifier with no coordination at all and effectively no chance of ever colliding with one generated by any other machine.

“Effectively no chance” is doing real work in that sentence, and it is worth pausing on, because it is the heart of why this scheme wins on identity. A random UUID carries 122 bits of randomness. The count you would have to generate before a collision became likely is not millions or billions; it is on the order of a billion billions. You could mint a million UUIDs a second for a century and never expect a clash. So every machine in a system can create IDs independently, offline, in parallel, and simply trust they are unique. No counter, no authority, no coordination. For distributed systems that is close to a miracle, and it is why UUIDs sit underneath so much modern software.

They are also, to a human, pure noise. You cannot look at 550e8400-... and know whether it points to a customer, an invoice, a photo, or a log line. The identity is perfect and the meaning is zero. A UUID only becomes legible when something else, a database row, a lookup table, an index, stands next to it and translates. The machine does not mind, because the machine never needed to read it; it had the address. You mind, because you do.

The UUID world keeps evolving, and the direction is telling. Newer schemes like ULID, KSUID, and Snowflake bolt a timestamp onto the front so the IDs sort in roughly the order they were created, recovering the one nice property the plain counter had. That is a genuine improvement for machines. It does nothing for you. A time-sortable unreadable string is still unreadable. The trend line is clear: identity is so thoroughly solved that the remaining work is optimizing it for other machines, not for the people who have to live alongside it.

Content addresses: the name that is the proof

There is a third machine approach, and it is the cleverest of the three, because it makes the name do double duty as identity and integrity at once.

The idea is content addressing. Instead of assigning a file a name, you compute the name from the file’s own contents by running the bytes through a hash function. The same contents always produce the same name. Different contents produce a different name. MD5 was the early workhorse; SHA-1 and SHA-256 are the grown-up versions.

This buys two things you cannot get any other way. The first is deduplication: if two files have identical contents, they hash to the identical name, so the system stores one copy and knows it is one copy. The second, and the more powerful, is integrity. Because the name is a fingerprint of the contents, the name is a verification. Change one comma in the file and the hash changes completely. So if a file still hashes to the name you expect, you have mathematical proof it is byte-for-byte the thing you think it is, that nobody tampered with it in transit and nothing corrupted on disk.

This is not a niche trick. It is the foundation under Git, where every commit, tree, and blob is named by its hash and chained together so the whole history is tamper-evident. It is the foundation under IPFS, where a file’s content ID is its address on the network. It is how your CDN busts caches, by fingerprinting an asset so a changed file gets a new name and the browser cannot serve a stale copy. Content addressing runs an enormous amount of the modern world without anyone noticing it.

It has two costs, and they matter for where this series is going. The first is that a content address is even less of a handle than a UUID; a hash looks like a hash and tells you nothing, and a different hash points at the same logical document the instant it changes. Which is the second, subtler cost: a content address is, by design, not stable across change. The name is the contents, so the moment the contents change, the name changes. That is what you want for verifying a fixed artifact, and the last thing you want for a living document, a thing meant to be edited, revised, and revisited for years under one address. Content addressing names snapshots beautifully. It cannot name a thing that is supposed to grow.

A footnote that matters: MD5 specifically is broken for any security purpose. Researchers can manufacture two different files with the same MD5, so it can no longer prove a file is untampered. The principle of content addressing is sound; the choice of hash is not a detail you get to be lazy about.

Object keys: the flat world the cloud lives in

One more, because it is how most files actually sit today. When you store something in cloud object storage, an S3 bucket and its many imitators, you do not get a filesystem. You get a flat namespace: one enormous pool of objects, each addressed by a single string called a key. That key is whatever you make it. It can be a UUID, a hash, a slug, or a path that looks like folders.

That last part fools almost everyone. You will see keys like clients/acme/2026/invoice.pdf and assume there is a folder structure underneath. There is not. The slashes are just characters in one long label. The storage system sees a flat sea of keys and fakes the folders for you when you ask it to list a “directory,” which is really a search for keys that share a prefix. The hierarchy you think you have is a convention you painted onto a string, and the system will happily ignore it.

Object storage is the purest expression of the machine’s worldview about names. The key is an opaque handle. Meaning, hierarchy, organization, all of that is your problem, layered on top, not something the system understands or enforces.

The web’s compromise: an address that moves

There is one more machine name you read constantly, even if you never think of it as a name: the URL in your address bar. It is worth a look, because it makes a different bet than everything above, and the bet is instructive.

A URL does not name what a thing is. It names where the thing lives: a host, a path, a position in someone’s directory layout. That is a powerful idea. It is the reason the web is one connected space instead of a pile of islands. But it ties the name to a location, and locations move. Reorganize a site, rename a folder, migrate a host, and every URL that pointed at the old spot now points at nothing. We have a tired phrase for this: link rot. The most-used naming system humanity has ever built is also the one that breaks the most often, because it named the shelf instead of the book.

The web’s own designers saw this coming and proposed a fix: the URN, a name for a thing’s identity rather than its location, so the name would survive the thing being moved. It is a good idea. It never caught on. The location-based URL was simply too useful in the moment, and naming identity-instead-of-location turned out to require the very kind of stable, meaningful address that, as the rest of this series will show, is hard to build. So the web runs on locations, and we paper over the rot with redirects, canonical tags, and 404 pages.

URLs make one tension explicit that has been lurking under everything so far: identity versus location. A UUID is pure identity with no location. A URL is pure location with no identity. And that tension carries straight into the human schemes we look at next, because a folder path, the thing most people actually use to organize their files, is a location too. It tells you where something sits, not what it is, and it breaks the moment you move it.

What machines decided

Step back and the pattern is unmistakable. Across counters, UUIDs, hashes, object keys, and URLs, machines made the same choice every time. They optimized for identity and integrity, and they won both, decisively. Uniqueness at global scale with no coordination: solved. Cryptographic proof that a file is what it claims: solved. These are extraordinary achievements that we lean on constantly without a second thought.

And in every case they paid for it with the third job. Not one of these names tells a person, or an AI pointed at a pile of files, what the thing is or where it belongs. Meaning was not lost by accident. It was deliberately externalized, pushed out of the name and into a separate index, a database, a manifest, a lookup table, because the machine experiences no cost from an opaque name. It has the address; it does not need to understand. The opacity is free to the machine and expensive only to the humans, and now the agents, standing next to it trying to make sense of the pile.

That trade was the right one for machines. It is a dead end for everyone else. And “everyone else” is a larger group than it used to be, because the thing you now most want to understand your files, the AI you just pointed at your own document store, reasons about meaning, not byte-level identity. Hand it a folder of UUIDs and hashes and it is exactly as lost as you are.

So if machines went all in on identity and threw meaning over the wall, the obvious question is what happens when you do the reverse. Humans did exactly that. They gave up guaranteed uniqueness and built names you can read and walk through: folders, decimal systems, tags. Those schemes recover meaning, and they break in their own quiet, predictable ways.

That is the next post.

The next post in this series: Where Did You Put It?, on how humans name, with folders, conventions, Johnny Decimal, and tags.

Aaron Lamb Co-Founder, Hexaxia Technologies