Can UUIDs Really Collide? Understanding UUID Collision Probability

"The probability of a UUID collision is negligible." Every developer has heard this. Few have actually run the numbers. When I first did the math during a system design review, I was surprised by both how small the probability is and how many developers misunderstand why.

The short answer: yes, UUIDs can theoretically collide. The practical answer: you're more likely to be struck by a meteor while your data center experiences a simultaneous cosmic-ray-induced bit flip. But understanding the difference between "impossible" and "astronomically unlikely" matters when you're deciding whether to add a uniqueness constraint or retry logic.

The Math: How Big Is 2^122?

UUID v4 reserves 6 bits for version and variant markers, leaving 122 bits for randomness:

2^122 = 5,316,911,983,139,663,491,615,228,241,121,400,000
      ≈ 5.3 × 10^36

That number is hard to internalize. Here's a scale reference:

Comparison	Approximate Size
Atoms in the observable universe	~10^80
UUID v4 possibilities	~10^36
Stars in the observable universe	~10^24
Grains of sand on Earth	~10^20
Seconds since the Big Bang	~10^17
World population	~10^10

There are more possible UUID v4 values than there are grains of sand on every beach on Earth, by a factor of roughly 10^17. If each grain of sand on Earth generated a billion UUIDs, we'd still be nowhere near exhausting the space.

The Birthday Problem: Why Raw Numbers Are Misleading

The naive intuition is: "I have N possible values, so I need to generate roughly N/2 before a collision." That's wrong. The birthday problem shows that collisions become likely much sooner than N/2 -- but "much sooner" for UUIDs still means "never in practice."

The formula for the probability of at least one collision after generating k UUIDs from a space of n possible values is approximately:

p(k, n) ≈ 1 - e^(-k² / (2n))

Let's calculate for realistic generation volumes:

1 billion UUIDs (10^9) with n = 2^122:

p ≈ 1 - e^(-(10^9)² / (2 × 2^122))
  ≈ 1 - e^(-10^18 / 2 × 5.3×10^36)
  ≈ 1 - e^(-10^18 / 1.06×10^37)
  ≈ 1 - e^(-9.4×10^-20)
  ≈ 9.4 × 10^-20

That's 0.0000000000000000094%. For perspective, you're about a trillion times more likely to be killed by a vending machine this year.

1 trillion UUIDs (10^12):

p ≈ 1 - e^(-(10^12)² / (2 × 2^122))
  ≈ 9.4 × 10^-14

Still effectively zero.

Generating 1 billion UUIDs per second for 100 years:

Total generated ≈ 3.15 × 10^18
p ≈ 1 - e^(-(3.15×10^18)² / (2 × 2^122))
  ≈ 1 - e^(-10^37 / 10^37)
  ≈ 1 - e^(-0.93)
  ≈ 0.6

After 100 years of generating a billion UUIDs every second, you'd have roughly a 60% chance of seeing a single collision. No production system operates at that scale for that duration.

UUID v4 vs v7: Does v7 Collide More?

UUID v7 has 74 random bits (vs v4's 122), because the first 48 bits are a timestamp. For IDs generated within the same millisecond, the collision space is 2^74:

2^74 ≈ 1.8 × 10^22

Two processes generating UUID v7 in the same millisecond each pick from ~10^22 possibilities. The probability they pick the same value is roughly 1 in 10^22. Even at a million UUIDs per millisecond (which no single machine can realistically do), the collision probability within that millisecond is:

p ≈ 1 - e^(-(10^6)² / (2 × 2^74))
  ≈ 1 - e^(-10^12 / 3.6×10^22)
  ≈ 2.8 × 10^-11

Still negligible. Between different milliseconds, the timestamp prefix guarantees non-collision. So UUID v7's collision resistance remains effectively identical to v4 for any realistic workload.

Why Most Reported "UUID Collisions" Aren't Collisions at All

Every "UUID collision" postmortem I've read traces back to one of these:

Weak Random Number Generators

Using Math.random() instead of crypto.randomUUID():

// This WILL produce collisions under load
function brokenUUID() {
  return 'xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx'.replace(/[xy]/g, c => {
    const r = Math.random() * 16 | 0;
    return (c === 'x' ? r : (r & 0x3 | 0x8)).toString(16);
  });
}

Math.random() in V8 (Chrome/Node.js) uses xorshift128+, which has a period of 2^128 - 1 but only 64 bits of internal state. It's not suitable for generating unique identifiers at scale. Our UUID generator uses crypto.getRandomValues() for all v4 generation -- the same CSPRNG that TLS uses.

Fixed Seeds in Test Environments

A test suite that seeds its PRNG to get reproducible results will generate the same "random" UUIDs on every run. This isn't a UUID collision -- it's a testing artifact.

Duplicate Data Import

Importing the same CSV twice, replaying an event stream, or restoring a backup incorrectly creates rows with duplicate UUIDs. The UUIDs didn't collide -- the same UUIDs were inserted twice. A UNIQUE constraint catches this; the UUID generation itself wasn't at fault.

Cloned Virtual Machines

UUID v1 includes a timestamp and MAC address. Cloning a VM preserves the MAC address, so two clones can generate identical v1 UUIDs. This is one reason v1 has fallen out of favor. UUID v4 and v7 don't use MAC addresses, so VM cloning isn't a concern.

For verifying that two files or datasets are truly identical (not just UUID-labeled), a checksum comparison or hash verification is the right tool.

Should You Add a Uniqueness Constraint?

Yes. Always. Even though the probability of a genuine random collision is effectively zero:

CREATE TABLE users (
    id UUID PRIMARY KEY,  -- PRIMARY KEY implies UNIQUE NOT NULL
    email TEXT NOT NULL
);

The uniqueness constraint protects against:

Accidental duplicate inserts (the most common real cause).
Buggy UUID generation code (if someone uses Math.random()).
Data import mistakes.
Manual database edits that create duplicates.

The cost of the uniqueness check is negligible -- the database already enforces it for primary keys, and the B-tree structure inherently prevents duplicate keys. There's no scenario where skipping the uniqueness constraint is the right call.

When Do You Actually Need to Worry?

For 99.99% of applications, UUID collision probability is not a concern. But there are edge cases where the math deserves a second look:

Extremely high-throughput systems generating billions of UUIDs daily. At 10 billion per day (116,000/sec), you'll generate ~3.6 × 10^12 per year. Even then, the collision probability for UUID v4 over a century is approximately 10^-15. For UUID v7 at this rate, the 74-bit random space per millisecond means probability depends on peak concurrency, not total volume.

Custom UUID generation algorithms. If you're not using a well-tested library and instead implementing UUID generation from scratch, the algorithm bug risk dominates the collision probability. Use crypto.randomUUID() or a vetted library.

Financial or compliance-critical systems where any collision is unacceptable. In theory, these systems should already have uniqueness constraints and error handling. In practice, some compliance frameworks require documented analysis of collision probability -- and the math above satisfies that requirement.

A Practical Mental Model

Here's how I think about UUID collisions in system design:

Don't add retry logic. The expected number of UUIDs you'd need to generate before experiencing a collision exceeds the number your system will generate in its lifetime by many orders of magnitude. Retry logic adds complexity for a scenario that won't happen.
Do add uniqueness constraints. Not because UUID generation will collide, but because duplicate inserts from operational errors are a real risk.
Don't choose between UUID v4 and v7 based on collision probability. Both are effectively collision-proof for practical workloads. Choose based on database performance, as covered in the UUID v4 vs v7 comparison.
Do use cryptographically secure randomness. crypto.randomUUID() or equivalent. Not Math.random(). If you're validating UUIDs from untrusted sources, regex is okay for format checking, but a parser-based approach (like Python's uuid.UUID()) is safer.

FAQ

Has anyone ever actually observed a UUID v4 collision?

No verifiably documented case exists of a genuine random UUID v4 collision caused by two independent calls to a CSPRNG-based generator. Every reported case I've investigated was traced to implementation bugs, duplicate data, or seeded testing environments.

What happens if two systems generate the same UUID?

In theory, an INSERT with a duplicate primary key fails with a constraint violation. In practice, at UUID scale, you'll encounter this error because of an application bug, not because of random chance.

Is UUID v7 collision risk meaningfully higher than v4?

No. UUID v7 has 74 random bits per millisecond vs v4's 122 random bits total. For practical workloads, both provide collision probabilities indistinguishable from zero.

How many UUIDs would I need to generate to have a 50% chance of collision?

For UUID v4: approximately 2.7 × 10^18 (2.7 quintillion). For context, that's generating 1 billion UUIDs per second for about 85 years.

Should I use a uniqueness constraint with UUID primary keys?

Yes. PRIMARY KEY already enforces uniqueness. If you use UUIDs in a non-PK column, add a UNIQUE constraint. The constraint costs effectively nothing and catches real-world issues (duplicate inserts, import errors) that are far more common than genuine collisions.

Can I rely on UUIDs being unique without checking?

In practice, yes. The probability of collision is so low that checking individual UUIDs for duplicates before inserting is wasted compute. The database uniqueness constraint handles the rare operational error without per-insert checks.

Do Nano IDs collide more or less than UUIDs?

Similar probability. A 21-character Nano ID with the default 64-character alphabet provides ~126 bits of entropy (slightly more than UUID v4's 122 bits). Both are effectively collision-proof. The choice between them should be based on compatibility and readability, not collision risk.

If you're building a system that relies on unique identifiers and want to see the actual generation behavior, the UUID generator lets you generate batches of UUIDs v4, v7, and v1. Generate a few hundred thousand IDs and check for duplicates yourself -- you won't find any, but seeing the output reinforces the math better than any formula.