Berkson’s Paradox (also known as Berkson’s fallacy or Berkson’s bias) is a statistical quirk which makes it appear that there is an association between two events or variables which are actually unrelated. Notably, it shows that two values can be negatively correlated in a sample of a population when they are in fact uncorrelated or positively correlated in that population. It arises because of a type of selection bias, which is caused by the observation of some events more than others.

Take the case of a college which admits students based on either musical excellence or sporting excellence. For the sake of argument, assume that there is no link between the two in the total relevant population (say, all students in the country). In other words, a musically talented individual is no more nor less likely to be talented at sport. Because the college admits only students who are excellent at music, or excellent at sport, or both, this creates a group or subset of the population which displays a negative association between musical and sporting excellence.

To illustrate why, let’s make the simplifying assumption that the college admits students who score 9 or 10 out of 10 (on a scale of 0 to 10) on either sporting excellence or musical excellence. In the entire population, however, the average rating of the worst musician and the best sportsman would be equal, i.e. 5 out of 10. Yet within the group of student entrants, the average rating for sporting ability of those admitted for musical ability is still 5 (the population average) compared to an average rating of 9.5 for musical ability. The effect is to imply a negative correlation between sporting and musical ability where no such correlation exists in the wider population.

This has been shown to have important implications for medical statistics. Say, for example, that a hospital conducts a study which admits patients onto the study who are suffering from either eye cataracts or diabetes. In this case, there will appear an association (albeit spurious) between cataracts and diabetes in the set of patients included in the study which does not appear in the wider population. The reason that this paradox occurs is that the probability of one event happening (cataracts, in this example) is higher in the presence of the other event (diabetes, in this example) because cases whether neither occur are excluded.

Similarly, take the idea that there is a negative association in our minds between the quality of movies based on really good books and the quality of the books. One explanation can be derived from Berkson’s Paradox. This interpretation is that we remember the instances where the book is really good or the movie is really good or both. But we forget those cases where both the book and the movie were bad. In this case we find a (spurious) negative correlation between how good the movie is and how good the book is, because the bad movies/bad books element of the population are not included in the set of movies and books under analysis.

Another example of Berkson’s paradox was proposed by Jordan Ellsberg. This is the ‘good looking people are jerks’ example and is similar to the movies/books illustration. Say that someone only associates with people who are either pleasant or good looking or both. That eliminates from the sample pool those who are neither pleasant not good looking. That leaves a sample with good looking people who are unpleasant, and pleasant people who are not good looking, but eliminates those who are neither pleasant nor good looking. So an association is noted between being attractive and being unpleasant, but this is because the unattractive people who are also unpleasant are not observed. So even if no link exists between attractiveness and unpleasantness in the population, it does in an observed world where the counter-examples who exist in the population are avoided and ignored.

To put it more formally, assume there are two independent events, X and Y. These events are not correlated when observed in nature. If one conditions on the fact that either event X or event Y occurred (call this condition Z), however, these events are now correlated. This arises because of selection bias. If we condition on Z (that X OR Y occurs), then if we know that event X did not occur, we know that event Y did occur. This conditioning on Z, what we can call the union of X and Y, leads to a correlation.

Put mathematically, if P (XIY) = P (X), then P (XIY, Z) is less than P (XIZ) where Z = X U Y.

10% of the population swim and 5% play squash weekly, but there is no correlation between swimming and playing squash in the general population. So someone who plays squash is as likely to swim as any other member of the population and vice-versa.

Of the 200 members of a local health club, 30% swim and 20% play squash.

Based on the health club statistics, is there any evidence of a correlation between those who do not swim and those who play squash?

To answer this, we use the assumption that someone who plays squash is as likely to swim as any other member of the population, i.e. swimming and squash playing can be treated as independent events. So, the percentage of health club members who play squash who also swim would be 10% x 5% = 0.5% of 200 members, i.e. 1 member.

A randomly chosen health club member, however, has a 30% chance of swimming and a 20% chance of playing squash. So, 60 out of 200 members will swim and 40 play squash.

Now, what is the chance that a member who is not a swimmer plays squash?

Of the 60 members who swim, we have calculated above that only 1 also plays squash, i.e. of the 200 members in total, 60 swim and one swims and plays squash.

So, of the remaining 140 patients who do not swim, 39 play squash, i.e. 40 members in total play squash minus one who swims and plays squash. Thus, 39 members who do not swim play squash.

So 39 of the 140 health club members who do not swim play squash, i.e. 39/140 (27.9%). This is higher than the 20% in the population who play squash.

Even though the two events (swimming and squash) are independent, therefore, the health club statistics make it appear that swimming reduces the likelihood of playing squash, i.e. there is a negative correlation between swimming and playing squash. The reason is that we excluding from consideration those members of the general population who neither swim nor pay squash, and only considering those who either swim or play squash or both.