What is Simpson’s Paradox? And why it matters.

February 14, 2019

Was the University of California, Berkeley, guilty of discrimination in their entry standards? This was a cause of concern in the early 1970s. To show what was behind the concern, we can highlight the admission figures for the Fall term of 1973. This shows that male applicants to the University were significantly more likely to be accepted than females.

Applicants Admitted

Men 8442 44%

Women 4321 35%

Looks pretty damning, until it was decided to break the admittance figures down by department. In doing so, it revealed a paradox.

Dept. Men Women

Applicants Admitted Applicants Admitted

A 825 62% 108 82%

B 560 63% 25 68%

C 325 37% 593 34%

D 417 33% 375 35%

E 191 28% 393 24%

F 373 6% 341 7%

In other words, a higher proportion of women were admitted to four of the six departments than men.

So what was going on? Those with statistical training soon realised that this was a simple example of Simpson’s Paradox. Simpson’s Paradox arises when different groups of frequency data are combined, revealing a different performance rate overall than is the case when examining a breakdown of the performance rate. Put another way, Simpson’s paradox is the appearance of trends within different groups which disappear when data for the groups are combined together.

In the case of Berkeley, a study published in 1975 by Bickel, Hammel and O’Connell, in ‘Science’ reached the conclusion that women tended to apply to the more competitive departments with low rates of admission, such as the English Department, while men tended to apply to less competitive departments with high rates of admission, such as engineering and chemistry. As such the University was not actively discriminating against women, at least not on the basis of the statistics used to make the charge.

Ignorance of the implications of Simpson’s Paradox might also generate false conclusions in the case of medical trials.

Take the following drugs, and their success rate in medical trials over two different days.

Drug A Drug B

Day 1 63/90 = 70% 8/10 = 80%

Day 2 4/10 = 40% 45/90 = 50%

Overall, Drug A = 67% success rate; Drug B = 53% success rate.

But Drug B performs better on both days.

So which is the better drug? In the medical trials, I would certainly choose to be treated by Drug A. Others might differ, but I doubt they would persuade any reasonable judge of the outcome of the trials.

Take another example. In this trial, there are two groups, consisting of a control group of 240 patients who are supplied with a placebo drug, such as a sugar pill, which is known to have no effect on the illness under evaluation, and a test group of 240 patients who are supplied with the real drug. The 240 patients are made up of four groups. Group A is elderly adults, Group B is middle-aged adults, Group C is young adults and Group D is children.

Here are the results, with success rate measured by the proportion recovering from the illness within two days of taking the drug:

Those taking the placebo.

Group A: 20; Group B: 40; Group C: 120; Group D: 60

Success rates are:

Group A: 10%; Group B: 20%; Group C: 40%; Group D: 30%

Overall success rate for those taking the placebo = 2+8+48+18 Divided By 240 = 76/240 = 31.7%.

Those taking the real drug.

Group A: 120; Group B: 60; Group C: 20; Group D: 10

Success rates are:

Group A: 15%; Group B: 30%; Group C: 60%; Group D: 45%

Overall success rate for those taking the real drug = 18+18+12+18 Divided By 240 = 66/240 = 27.5%.

This compares with an overall success rate for those taking the placebo of 31.7%.

So the placebo, over the whole sample, produced a higher success rate than the real drug.

Breaking the numbers down by group, however, reveals a discrepancy.

For the real drug

Group A: 10%; Group B: 20%; Group C; 40%; Group D: 30%

For the placebo

Group A: 15%; Group B: 30%; Group C; 60%; Group D: 45%

So, in each individual group (elderly adults, middle-aged adults, young adults, children) the success rate is greater for those taking the real drug, although in the group as a whole, it is less.

How can we resolve the paradox?

The answer lies in the size and age distribution of each group, which differs between those who received the real drug and those who received the placebo. In this study, the group which received the placebo consists of a whole lot more young adults, for example, than the other groups, in contrast with the number taking the real drug. This is important because the natural recovery rates from this illness (as defined in the test) are normally higher in this demographic than the other groups, whether they receive the real drug or the placebo. Again, the elderly (whose recovery rates are normally lower than average) are much more heavily represented among those taking the real drug than the placebo.

Take another example from baseball. In the 1995/96 seasons, fans were divided between those who claimed Derek Jeter as the best performing player and those who claimed that title for David Justice. It is easy to see why. Here are their batting averages.

1995 1996 Combined

Derek Jeter 12/48 (.250) 183/582 (.314) 195/630 (.310)

David Justice 104/411 (.253) 45/140 (.321) 149/551 (.270)

Here we see that Jeter has the better overall batting average but Justice records a better average in each of the two years making up that overall average. To anyone conversant with Simpson’s Paradox this is nothing weird. It is certainly possible in theory for one player to score a better batting average in successive years than another, yet record a worse batting average overall. The case of Jeter and Justice is an example where the theory clearly shows up in practice.

Indeed, forward to 1997 and the paradox grows even stronger. In that year, Jeter averaged 0.291 (190/654), while Justice scored a better average (163/495). So, in three successive years, Justice recorded a better average than Jeter. Over the whole period, though, the batting average for Derek Jeter was 0.300 (385/1284), superior to David Justice, on 0.298 (312/1046).

For those more familiar with cricket than baseball, let’s take the following example of two mythical matches played by Harold Larwood and Bill Voce.

First Match:

Harold Larwood takes 3 wickets while bowling but concedes 60 runs off his bowling (an average of 20 runs conceded per wicket).

Bill Voce takes 2 wickets while bowling but concedes 68 runs (an average of 24 runs conceded per wicket).

Second Match:

Harold Larwood takes 1 wicket and concedes 8 runs (an average of 8 runs conceded per wicket).

Bill Voce takes 6 wickets and concedes 60 runs (an average of 10 runs conceded per wicket).

Here, Larwood has the superior performance in both matches (20 runs conceded per wicket compared to Voce’s 34 per wicket, and 8 runs conceded per wicket compared to Voce’s 10 per wicket). In the overall match, however, Larwood took 4 wickets for 68 runs (1 for 17) while Voce did slightly better, taking 8 wickets for 128 runs (1 for 16).

So who is the better baseball player? Who is the better bowler? Were the University of California, Berkeley, discriminating on the basis of gender? Which is the better drug? All of these questions are examples of Simpson’s Paradox.

Reference and links

P.J. Bickel, E.A. Hammel and J.W. O’Connell (1975), Sex Bias in Graduate Admissions: Data from Berkeley, Science, 187, 398-404.

From → Economics, Puzzles, Statistics

What is Simpson’s Paradox? And why it matters.

Share this:

Related

Leave a comment Cancel reply

Prof. Leighton Vaughan Williams

Recent Posts

Categories

A+ links

All Conversation articles

All Select Networks

Audio Files

Betting

Betting Taxation

Book Chapters

Books

Centres

Charity

Choice and Reason

Competition Commission

David Henry Morris Williams, C. Eng.

Editorial

Employment

Evidence to UK Parliament

Gambling Commission

HM Revenue and Customs

Memberships and Fellowships

My Adobe Voice

National Audit Office

Other Publications

Papers Online

Personal

Political Forecasting

Press and media

Probability

Profile

Published Papers

Radio Interviews

Select Abstracts

Select Books

Select Broadcasts

Select Clippings

Select Pages

Select Papers

Select Presentations

Select Social Media

Select Stories

Select Websites

Select Wiki

Selected Talks

Short stories

Thought Experiment

Twisted Logic

Twitter

Useful Links

Various Blogs

XYZ

Flickr Photos