# Shared Birthdays in Groups

## Intro

For the last several years I’ve participated in a reading challenge with a few family members. It’s a custom derivative of the POPSUGAR list and in general it’s great for encouraging reading outside of my normal choices (though I suppose I’ve been doing it long enough that maybe I have broader taste now?). This year one item is to read a book published the month of your birthday. As I was researching options, my 9 year old started trying to estimate the probability that a book was published *on* your birthday. This led to a great discussion about assumptions in these types of calculations and eventually settled on the topic of the required group size needed to have a 50% chance of at least two people sharing a birthday. He originally proposed 366 but quickly realized that would be a 100% chance. He settled on 182 or 183 (half of 365) which intuitively makes sense. When I told him it was just 23 he couldn’t believe it.

## The Math

The math is a bit beyond what he’s been learning, but I got to introduce basic combinatorics and critical thinking with probabilities. The first thing we needed to cover was how many pairs there are in a group, and then how to easily arrive at the probability of at least one pair sharing a birthday.

### Pairs in a Group

Since order doesn’t matter when selecting pairs from a group to compare birthdays, we need to know how many ways there are to choose 2 from a group of `N`

people. `N choose m`

is `N!/((N-m)!*m!)`

which is implemented in the R function `choose(N, m)`

. So, for a group of 23 people there are 253 possible pairs.

`choose(23, 2)`

`## [1] 253`

### Probabilities

Next we discussed independence in probabilities. Because the probability that a pair of people share a birthday isn’t affected by the shared state of any other pair (that is, if Dick and Jane share a birthday, that has no bearing on if Tom and Sue share a birthday), we have independence. This makes calculating the probability of `A & B`

much easier.

The hardest part was understanding the easy way to calculate the probability that *at least one* pair would share a birthday. In a simple case of 3 people, that would be `AB | AC | BC | (AB & AC) | (AB & BC) | (AC & BC) | (AB & AC & BC)`

which is a lot of work. But, the antithesis of *at least one* is *none*, or `P(at least one) = 1-P(none)`

. The probability of none is easy in this case because each pair has a `364/365`

probability of not sharing a birthday. Because these probabilities are independent, we can easily calculate the probability.

`1-(364/365)^choose(23, 2)`

`## [1] 0.5004772`

## Reality

The math, while involved, is ultimately straightfoward. This is because we’ve made a few of key assumptions. First, we’re ignoring leap years and all the people who are born on February 29, so we already know the math isn’t perfect. Second, we’re assuming that birthdays are uniformly distributed across the days of the year. There are more assumptions, but these are the big ones. So, how does it hold up? I suggested to my son that he could work with his teachers to check in each class at his school in how many instances there are shared birthdays and look at average class size to see if the probabilities we calculate are reflected acurately in reality.

I found a dump from a large dataset that included birthday (month and day only) and the individuals were organized by functional group. I realized I could use this to assess how well the probabilities held up.

### Data

I have pre-agreggated the data so the only information is the size of the group and the number of unique birthdays in the first file and the count of individuals with each birthday in the second. I’m not sure where the original data came from and I don’t want to be leaking PII.

```
library(data.table)
library(ggplot2)
library(Hmisc)
shared_birthdays = fread("https://thug-r-data-share.s3.amazonaws.com/birthday_data.txt")
birthday_counts = fread("https://thug-r-data-share.s3.amazonaws.com/birthday_data_dist.txt")
```

Here’s a sample of the data.

`head(shared_birthdays)`

```
## group_size unique_birthdays
## 1: 5 5
## 2: 6 6
## 3: 3 3
## 4: 5 5
## 5: 2 2
## 6: 7 7
```

`head(birthday_counts)`

```
## birthday count
## 1: 101 53
## 2: 102 48
## 3: 103 41
## 4: 104 46
## 5: 105 42
## 6: 106 58
```

### Visualize It!

```
viz = ggplot(shared, aes(x=group_size)) +
geom_errorbar(aes(ymin=lower, ymax=upper, color=in_bound)) +
geom_point(aes(y=observed, size=n), color='red', shape=1) +
geom_line(aes(y=expected), color='blue', lwd=1.1) +
labs(color="In Bound", x="Group Size", y="Likelihood of Shared Birthday", size="Number of Groups") +
scale_y_continuous(labels=scales::percent)
```

## Uniformity of Birthdays

First, we need to explode the counts of each birthday into each observation. Then we’ll run a Kolmogorov-Smirnov omnibus test on our observations vs a uniform distribution. This presents a few challenges - first, birthdays, even when reexpressed as values from 1/366 to 1, are discrete, not uniform, so we already know it’s not really a uniform distribution. Second, K-S doesn’t like ties.

```
psuedo_dist = unlist(lapply(1:nrow(birthday_counts), function(i){
rep(i/366, birthday_counts[["count"]][i])
}))
ks.test(psuedo_dist, "punif")
```

```
## Warning in ks.test(psuedo_dist, "punif"): ties should not be present for the
## Kolmogorov-Smirnov test
```

```
##
## One-sample Kolmogorov-Smirnov test
##
## data: psuedo_dist
## D = 0.0085357, p-value = 0.2125
## alternative hypothesis: two-sided
```

A superior choice is to run the Cramer-von Mises test which only works on discrete distributions, doesn’t mind ties, and is generally better in that it evaluates the distance between the two ECDFs holistically where K-S only looks at the maximum distance between the two.

`library(dgof)`

`## Warning: package 'dgof' was built under R version 3.5.2`

```
##
## Attaching package: 'dgof'
```

```
## The following object is masked from 'package:stats':
##
## ks.test
```

`cvm.test(psuedo_dist, ecdf((1:366)/366))`

```
##
## Cramer-von Mises - W2
##
## data: psuedo_dist
## W2 = 0.16641, p-value = 0.3432
## alternative hypothesis: Two.sided
```

Ultimately the conclusion is that there is no evidence that this group of people’s birthdays follow something other than a uniform distribution. Visually we can see that it generally does look like a swarm - February 29 being the obvious exception.

```
birthday_counts[, c("month", "day") := list(
gsub(".{2}$", "", birthday),
gsub(".*(.{2})$", "\\1", birthday)
)]
birthday_counts[, "birthday_date" := as.Date(paste0("2020-",month,"-",day))]
birthday_counts[, "expected" := sum(count)/366]
exp_viz = ggplot(birthday_counts, aes(x=birthday_date)) +
geom_line(aes(y=expected), color='red') +
geom_point(aes(y=count)) +
labs(x="Birthday", y="Count")
```

## Final Thoughts

Overall, it was fun to take a significant number of birthdays that were assigned to groups based on characteristics other than birthday and see the power a few simple assumptions gives us to make assessments about a group.