I want to share a couple of interesting examples of randomness that might surprise your intuition (if you haven’t heard of them before).
The birthday paradox
The birthday paradox describes the unintuitively high chance that two people in a group share the same birthday, even when the group is relatively small. See Wikipedia for more details. See my code on GitHub for the entirety of the code examined below.
Empirically simulate birthday collision
First, let’s create a function that empirically simulates how many people must be added to the group to trigger the first birthday collision.
This function works by creating a list (allows duplicates) and a set (does not allow duplicates) to record the randomly chosen birthday ordinal (day number of the year) for the most recently added person to the group. We can quickly determine if this new person has the same birthday as someone else by comparing the number of items in the list and set. If they are different, the group has two people with the same birthday.
The above counting strategy is an example of the pigeonhole principle in action.
So, what is the Pigeon Principle? The pigeonhole principle says that if you have n items (pigeons) and m containers (pigeonholes), if n > m, at least one container (pigeonhole) contains more than one item (pigeon).
What are some other implications of this principle? Well, if you live in a major city, you can use the pigeonhole principle to prove that at least two people in your city have the same number of hairs on their head. On average, humans have between 90,000 and 150,000 hairs on their head, depending on hair color. Assuming there is an upper biologic limit of 1 million hairs on a head (over 6 times the upper end of the average), if you live in a city of more than 1 million people, you can prove via the pigeonhole principle that at least two people have the exact same number of hairs on their heads, without examining any heads at all. In reality, many people share the same number of hairs on their head because 1) the birthday paradox and 2) hair count (likely) follows a normal distribution.
Run the simulation
Okay, so back to the birthday paradox.
Let’s run 100,000 instances of the simulation so we can get a good understanding of the distribution.
Construct a histogram
Now that we’ve ran the simulation 100,000 times (it took me 3.5s on my 2015 MBP), let’s construct a histogram to see what happened.
I’ve added a vertical line (dashed, in red) to show the average number of people in the room when the first birthday collision occurred.
Construct a CDF
I’ve added a horizontal line to show where the CDF crosses the 50% threshold.
As you can see above, it is even money that an average size classroom has two people with the same birthday. It is very rare to get 50 people in a room and not have at least one shared birthday.
Now, let’s look at a slightly flawed implementation of the Fisher-Yates shuffle via a demonstration.
Number of times a is in position 0: 33179 Number of times a is in position 1: 33583 Number of times a is in position 2: 33238 Number of times b is in position 0: 37213 Number of times b is in position 1: 29807 Number of times b is in position 2: 32980 Number of times c is in position 0: 29608 Number of times c is in position 1: 36610 Number of times c is in position 2: 33782
At first glance, our implementation seems simple and robust. We loop over the entire list, exchanging the current element we are visiting with one in a randomly chosen position. If we just look at the positional frequency of the first element in the deck of items that we are shuffling, our implementation seems fine. But, when we look at the position frequency for all the remaining items, we notice that this is actually a biased implementation. It is easy to empirically see that our implementation is flawed.
But, is there a counting strategy (similar to the pigeonhole principle) that can help us understand what is going on? If we look at our implementation for the example input, we see that we “shuffle” three times and in each of those shuffles, the current item that we are visiting may be exchanged to any of the three positions. So, our algorithm has 3^3 = 27 potential outcomes. But, if we consider the possible permutations (order matters) of three items, it is 3! = 6. Since the 27 equally likely outcomes of our algorithm cannot be evenly divided into the 6 possible permutations of 3 items, our algorithm must be biased.
- Randomness can surprise our intuition
- Randomness and cryptography are first cousins; it isn’t obvious when you make a mistake, but it is fatal
- Counting strategies can help us understand problems where our intuition fails us