Abstract

Ever stuffed too many socks in a drawer? That’s the Pigeonhole Principle in action, hiding everywhere from birthday parties to hash tables. This article explores collision mathematics through pigeons, birthdays, and feature hashing. We’ll show why collisions are inevitable (spoiler: sooner than you think) with interactive visualizations proving even 512-bit hashes eventually collade. Buckle up for a fun ride through the math governing how our digital world handles too much stuff in too little space.

1 Introduction

Have you ever wondered how seemingly unrelated topics such as pigeons, birthdays, and collisions possibly share a connection? At first glance, they appear to belong to entirely different worlds. Pigeons are birds, birthdays are personal milestones, and collisions are unfortunatly a common sight in modern day traffic. Yet, beneath the surface, these concepts are tied together in a quirky way in how things fit into spaces that are too small to contain them.

This principle isn’t just an observation; it’s a building block of mathematics. It explains why, in a room of just 23 people, there’s a 50% chance that two people share the same birthday 🎂. It reveals why cramming too many pigeons into too few holes guarantees some are going to share a pigeon hole.

And it underpins a challenge in machine learning: how to efficiently map vast amounts of data into a manageable space without everything colliding into a mess.

2 The Pigeonhole Principle and the Birthday Paradox

2.1 The Pigeonhole Principle: When Space Runs Out

Imagine you are a pigeon keeper you have a 9 spaces for pigeons to nest in, and you have 3 birds. no problem what so ever. There is space for all the birds.

3 birds fit into 9 holes, no problem — enough space for all the birds, nobody has to share a hole

Now imagine you have 10 birds, we only have 9 holes. No matter how hard you try, at least one of those holes must end up with more than one pigeon.

no enogh space for all the birds, at least one hole must have more than one bird

Welcome to the Wikipedia: Pigeonhole principle, one of the simplest ideas in mathematics. if we have \(n\) items and \(m\) containers, at least one container must hold more than \(n\) items.

It’s an obvious idea, but its implications are widely applicable. Whether you’re stuffing pigeons into holes, assigning students to classrooms, or mapping data to storage bins, when you force too many things into too little space, collisions are inevitable.

2.2 The Birthday Paradox: A Surprising Twist

Now, let’s trade pigeons for people and holes for birthdays. You’re at a party with 23 guests, and someone asks: “What’s the chance that two people here share the same birthday?” Most people guess the odds are low—after all, there are 365 possible birthdays, so surely 23 people wouldn’t be enough to guarantee a match, right?

The birthday paradox reveals that there’s actually a 50% chance two people in that room share a birthday with only 23 people. By the time you reach 70 people, the probability skyrockets to 99.9%. How is this possible?

🎂➡️📅

The key lies in shifting perspective. Most of us instinctively think about the odds of someone else sharing our birthday. That is unlikely—just a 6% chance with 23 people. But the birthday paradox isn’t about matching your birthday. It’s about any two people in the room matching each other. Suddenly, the math flips, and collisions become not just likely but guaranteed as the group grows. We are stuffing pigeons into holes again.

2.3 Connecting the Dots: Why These Ideas Matter

At first glance, the pigeonhole principle and the birthday paradox seem like unrelated curiosities. One is about cramming pigeons into holes; the other is about people sharing birthdays. But strip away the context, and you’ll find they’re two sides of the same coin:

Pigeonhole Principle: “If you have more items than containers, collisions are guaranteed.”
Birthday Paradox: “Even with far fewer items than containers, collisions become likely faster than you’d expect.”

Both principles expose the same truth: when you map a large set to a smaller space, overlaps are inevitable. The pigeonhole principle proves it’s certain with enough items, while the birthday paradox shows it happens earlier than intuition suggests.

For now, let’s zoom in on the math behind the birthday paradox to see just how quickly collisions can sneak up on you.

Click to expand the math behind the birthday paradox

Let’s zoom in on the formula for the probability that at least two people in a group of \(n\) will share a birthday, assuming \(d\) possible birthdays (e.g., \(d=365\)):

% d = number of possible birthdays (usually 365), n = number of people

\[ P_{\text{distinct}}(n) = \prod_{k=0}^{n-1}\frac{d-k}{d} = \frac{d(d-1)\cdots(d-n+1)}{d^{n}} = \frac{d!}{(d-n)!\,d^{n}} \qquad (n \le d)\] \[ P_{\text{shared}}(n) = 1 - P_{\text{distinct}}(n)\] \[ P_{\text{shared}}(n) \approx 1 - \exp\!\left(-\frac{n(n-1)}{2d}\right) \qquad (n \ll d) \]

Where: - \(n\) is the number of people, - \(d\) is the number of possible birthdays.

\(d\) is the number of possible birthdays (365/366), \(n\) is the number of people

\[ P_{\text{shared}}(n) \approx 1 - \exp\!\left(-\frac{n(n-1)}{2d}\right) \qquad (n \ll d)\]

If we visualize the probability of a shared birthday, we get the following chart:

Code

import numpy as np
import pandas as pd
import altair as alt

days = 365
n_max = 100
n = np.arange(1, n_max + 1)

# Compute probability of at least one shared birthday
k = np.arange(n_max)
log_p_distinct = np.cumsum(np.log((days - k) / days))
p_shared = 1 - np.exp(log_p_distinct)

df = pd.DataFrame(
    {
        "Number of people": n,
        "Probability of two or more sharing a birthday": p_shared,
        "Probability (rounded)": np.round(p_shared, 3),
    }
)

# Create a selection that chooses the nearest point & selects based on x-value
nearest = alt.selection_point(
    nearest=True, on="pointerover", fields=["Number of people"], empty=False
)

# The basic line
line = (
    alt.Chart(df)
    .mark_line(interpolate="basis")
    .encode(
        x=alt.X("Number of people", title="Number of people"),
        y=alt.Y(
            "Probability of two or more sharing a birthday",
            scale=alt.Scale(domain=[0, 1]),
            title="Probability of two or more people sharing a birthday",
        ),
    )
)

when_near = alt.when(nearest)

# Draw points on the line and highlight based on selection
points = line.mark_point().encode(
    opacity=when_near.then(alt.value(1)).otherwise(alt.value(0))
)

# Draw a rule at the location of the selection
rules = (
    alt.Chart(df)
    .mark_rule(color="gray")
    .encode(
        x="Number of people:Q",
        opacity=when_near.then(alt.value(0.3)).otherwise(alt.value(0)),
        tooltip=[
            alt.Tooltip("Number of people", title="Number of people"),
            alt.Tooltip("Probability (rounded)", title="Probability", format=".2f"),
        ],
    )
    .add_params(nearest)
)


chart = (
    alt.layer(
        line,
        points,
        rules,
    )
    .properties(
        title="Chance of a birthday collision, is higher than intuition suggests",
        width=500,
        height=300,
    )
    .configure_axis(grid=True, gridOpacity=0.3)
)

chart

3 Hashing Basics: When Data Runs Out of Room

3.1 From Pigeons to Data: The Birth of Hashing

We’ve seen how the pigeonhole principle and the birthday paradox reveal a fundamental truth: when you try to fit too many things into too little space, collisions are inevitable. But what does this have to do with computers and data? Enter hashing, a concept that takes the pigeonhole principle and applies it to the digital world.

At its core, hashing is a way to take a large, unwieldy piece of data—like a pdf, a photo, or even an entire book and map it to a smaller, fixed-size value, called a hash which is similar to the fingerprint of humans. Each and every one of the, is unique and them are about the same size. the fixed-size value of the hash can be tought of like a pigeon hole, and the data we want to hash is the pigeon.

3.2 How Hashing Works: The Magic Black Box

Hash functions are like black boxes. You toss something in—whether it’s a PDF, a photo, a book, or even a single word—and out pops a fixed-length string of seemingly random characters. For example:

Input: "hello" → Output: "4acdf636-caee-88b8-ad0e-cb17fe41d7f7"
Input: The entire text of "Dune: Messiah" → Output: "d52492b0-440d-88c5-af3a-cce3d3afeb00"

No matter what you feed into a hash function, the output is always the same length. It’s like a digital fingerprint: unique (in theory) and compact. But here’s the catch: the number of possible outputs isn’t infinite/ \(\infty\). Just like the pigeonhole principle, if you have more inputs than possible outputs, collisions can and will happen.

3.2.1 Why Collisions Are Inevitable

Hash functions map an infinite (or near-infinite) number of possible inputs to a finite number of outputs.

For example: - A hash function that outputs 64-bit hashes has \(2^{64}\) possible outputs. - A hash function that outputs 256-bit hashes has \(2^{256}\) possible outputs.

But no matter how large the output space, it’s still finite. And if you have enough inputs, collisions will occur—just like cramming too many pigeons into too few holes. The birthday paradox tells us that collisions happen way sooner than you’d expect, too.

3.3 Hashing Collision Probabilities

Code

import numpy as np
import pandas as pd
import altair as alt

# --- Parameters ---
n_elements = [5, 10, 20, 50, 100, 200, 500, 1000]
n_buckets = [50, 100, 500, 1_000, 5_000, 10_000, 50_000, 100_000]

# --- Build long-form DataFrame ---
rows = []
for n in n_elements:
    for d in n_buckets:
        prob = (1 - np.exp(-n * (n - 1) / (2 * d))) * 100
        rows.append(
            {
                "Elements (n)": n,
                "Buckets (d)": d,
                "Collision %": round(prob, 1),
            }
        )
df = pd.DataFrame(rows)

# String labels for ordinal axes
df["n_label"] = df["Elements (n)"].apply(lambda x: f"{x:,}")
df["d_label"] = df["Buckets (d)"].apply(lambda x: f"{x:,}")
df["cell_text"] = df["Collision %"].apply(lambda x: f"{x:.1f}%")

# Sort orders (elements: high at top; buckets: left to right)
elements_sorted = [f"{x:,}" for x in sorted(n_elements, reverse=True)]
buckets_sorted = [f"{x:,}" for x in n_buckets]

# --- Shared encodings ---
base = alt.Chart(df).encode(
    x=alt.X("n_label:N", sort=elements_sorted, title="Number of Elements (n)"),
    y=alt.Y("d_label:N", sort=buckets_sorted, title="Output Dimension of Hash (d)"),
)

# --- Heatmap layer ---
heatmap = base.mark_rect(stroke="white", strokeWidth=1).encode(
    color=alt.Color(
        "Collision %:Q",
        scale=alt.Scale(domain=[0, 50, 100], range=["#2ecc71", "#f1c40f", "#e74c3c"]),
        legend=alt.Legend(title="Collision %"),
    ),
    tooltip=[
        alt.Tooltip("Elements (n):Q", title="Elements (n)"),
        alt.Tooltip("Buckets (d):Q", title="Buckets (d)", format=","),
        alt.Tooltip("Collision %:Q", title="Collision %", format=".1f"),
    ],
)

# --- Text label layer ---
text = base.mark_text(fontSize=10, fontWeight="bold").encode(
    text="cell_text:N",
    color=alt.condition(
        alt.datum["Collision %"] >= 0,
        alt.value("white"),
        alt.value("black"),
    ),
)

# --- Combine ---
chart = (heatmap + text).properties(
    title="Collision Probability - Birthday Paradox: P ≈ 1 − exp(−n(n−1) / 2d)",
    width=550,
    height=350,
)
chart

In the plot above we can see the probability of a collision for a hash function with an output dimension of \(d\) as the number of inputs \(n\) increases. If the output dimension of the hash function is large enough, the probability of a collision remains low even as the number of inputs grows. However the output dimension of the hash function should be large enough to accomodate the number of inputs, otherwise hash collisions will become more likely.

Code

import numpy as np
import pandas as pd
import altair as alt

# --- Parameters ---
n_elements = [
    1.0 * 10**4,
    6.1 * 10**8,
    2.6 * 10**18,
    7.4 * 10**32,
    4.8 * 10**37,
    8.9 * 10**56,
    1.6 * 10**76,
]
output_powers = [
    32,
    64,
    128,
    224,
    256,
    384,
    512,
]
n_output_dims = [2**p for p in output_powers]

# --- Build long-form DataFrame ---
rows = []
for n in n_elements:
    for power, d in zip(output_powers, n_output_dims):
        prob = (1 - np.exp(-n * (n - 1) / (2 * d))) * 100
        rows.append(
            {
                "Elements (n)": n,
                "Output Dimensions (d)": d,
                "Output Dimension": power,
                "Collision %": round(prob, 1),
            }
        )

df = pd.DataFrame(rows)


# String labels for ordinal axes
def sci_notation(x):
    # Format in scientific notation, e.g. 1e+06
    return f"{int(x):.0e}"


df["n_label"] = df["Elements (n)"].apply(sci_notation)
df["d_label"] = df["Output Dimension"].apply(lambda p: f"2^{p}")
df["cell_text"] = df["Collision %"].apply(lambda x: f"{x:.1f}%")

# Sort orders (output powers: high at top; elements: left to right)
output_powers_sorted = [f"2^{p}" for p in output_powers]  # low at top
elements_sorted = [
    sci_notation(x) for x in n_elements[::-1]
]  # scientific notation, high to low, left to right

# --- Shared encodings (TRANSPOSE axes) ---
base = alt.Chart(df).encode(
    x=alt.X(
        "n_label:N",
        sort=elements_sorted,
        title="Number of Elements (n)",
        axis=alt.Axis(labelAngle=-45),
    ),
    y=alt.Y(
        "d_label:N",
        sort=output_powers_sorted,
        title="Output Dimension of Hash (d)",
    ),
)

# --- Heatmap layer ---
heatmap = base.mark_rect(stroke="white", strokeWidth=1).encode(
    color=alt.Color(
        "Collision %:Q",
        scale=alt.Scale(domain=[0, 50, 100], range=["#2ecc71", "#f1c40f", "#e74c3c"]),
        legend=alt.Legend(title="Collision %"),
    ),
    tooltip=[
        alt.Tooltip("Elements (n):Q", title="Elements (n)"),
        alt.Tooltip(
            "Output Dimensions (d):Q", title="Output Dimension (d)", format=","
        ),
        alt.Tooltip("Output Power:Q", title="Hash Output Power"),
        alt.Tooltip("Collision %:Q", title="Collision %", format=".1f"),
    ],
)

# --- Text label layer ---
text = base.mark_text(fontSize=10, fontWeight="bold").encode(
    text="cell_text:N",
    color=alt.condition(
        alt.datum["Collision %"] >= 0.0,
        alt.value("white"),
        alt.value("black"),
    ),
)

# --- Combine ---
chart = (heatmap + text).properties(
    title="Collision Probability - P ≈ 1 − exp(−n(n−1) / 2d)",
    width=550,
    height=350,
)

chart