Professor Leighton Vaughan Williams

March 21, 2019

Bayes and the Bobby Smith Problem – in a nutshell.

Bobby Smith, aged 8, is a good schoolboy basketball player, but you know that only one in a thousand such 8-year-olds go on to become professional players. So you would like to get an unbiased assessment of his real chance of developing into a top player. A coach tells you there is a test, taken by all good 8-year-old players, that can measure the child’s potential. If the test was perfect, everyone who received an A+ on the test would go on to a become a pro player. In fact, it is 95% accurate, in the sense that 5% of those taking the test will receive an A+ score and fail to become professional basketball players. Still, this is a very small percentage. Unfortunately, though, anyone failing to score A+ has no chance of becoming a pro player.

Bobby takes the test and is graded A+.

So what is the actual chance that Bobby will become a professional basketball player?

If you are like most people, you will think the chance is very high.

This is your reasoning: I don’t really know whether Bobby is likely to turn into a professional player or not. But he has taken this test. In fact, no professional player could have scored below A+, and the test only very rarely allocates a top grade to a child who will not become a professional basketball player. If the test is really this good, therefore, it looks like Bobby will have a bright future as a basketball star.

Is this true? Think of it this way. If there were no test, you would have asked the coach a very basic question: in your experience, what is the chance that Bobby will become a professional player? The coach would have dampened your enthusiasm: one in a thousand, he would have said. But with the test result in hand, there’s no need to ask this question. It’s irrelevant in the face of a very accurate test result, isn’t it?

In fact, this is a well-known fallacy, another example of the Inverse Fallacy, or Prosecutor’s Fallacy. The fallacy is to confuse the probability of a hypothesis being true, given some evidence, with the probability of the evidence arising given the hypothesis is true.

In our example, the hypothesis is that Bobby will become a professional player, and the evidence is the high test score. What we want to know is the probability that Bobby will become a pro player, given that the test says he will be. What we know, on the other hand, is the probability that Bobby will score A+ on the test, given that he will become a professional player. The coach told you that this probability is 100%: all future professional players will score A+ on the test. In answering your other question, the coach also told you that that 5% of those taking the test will score A+ yet fail to progress to the professional game. This is a small percentage. So you take this information and conclude that Bobby is very likely to turn into a top player.

In fact, of the thousand children who took the test, only one (statistically speaking) will become a professional player. The test for an A+ is 95% accurate in identifying a future pro player, in the sense that 5% of the 1,000 children will score A+ and not become professional players, i.e. there will be 50 ‘false positives.’ Anyone who will become a pro basketball player, on the other hand, will score A+ on the test.

So what is the chance that Bobby will become a professional basketball player if he scores A+ on the test?

Solution: 50 children who will not become professional basketball players score A+ (the 50 ‘false positives’). Only one of the one thousand eight-year-olds who take the test develops into a professional player, and that child will score A+. Look at it this way. A thousand 8-year-olds take the test and of these 50 of them will receive a letter telling them they have scored A+ on the test but will not develop into top players. One child will receive a letter with a score of A+ and actually will go on to become a professional player. Therefore the probability Bobby will become a top basketball player if he scores A+ is just 1 in 51, i.e. 1.96%.

This is a similar idea to the medical ‘false positives’ problem.

In the equivalent flu version of the problem, a thousand people go to the doctor and all are tested for flu. Only one actually has the flu. Those with the flu always test positive. We know that the test for flu is 95% accurate, in the sense that 5% of the 1,000 people will test positive and not have the flu, i.e. there will be 50 ‘false positives’. One will test positive who does have the flu. Those with the flu all test positive. So what is the chance someone has the flu if they test positive? In this case, 50 people who do not have the flu test positive. One person who has the flu tests positive. Therefore, the probability you have the flu if you test positive is 1 in 51, i.e. 1.96%

We can also solve the Bobby Smith problem using Bayes’ Theorem. The (posterior) probability that a hypothesis is true after obtaining new evidence, according to the a,b,c formula of Bayes’ Theorem, is equal to:

ab/ [ab+c(1-a)]

a is the prior probability, i.e. the probability that a hypothesis is true before the new evidence. b is the probability of the new evidence if the hypothesis is true. c is the probability of the new evidence if the hypothesis is false.

In the case of the Bobby Smith problem, the hypothesis is that Bobby will develop into a professional player.

Before the new evidence (the test), this chance is 1 in 1000 (0.001)

So a = 0.001

The probability of the new evidence (the A+ score on the test) if the hypothesis is true (Bobby will become a professional player) is 100%, since all professional players score A+ on the test.

So b =1

The probability we would see the new evidence (the A+ score on the test) if the hypothesis is false (Bobby will not become a professional player) is 5%, since the test is 95% accurate in spotting future professional players.

So c = 0.05

Substituting into Bayes’ equation gives:

Posterior probability = ab/ [ab+c(1-a)] = 0.001x 1 / [0.001 x 1 + 0.05 (1 – 0.001)] = 0.0196

So, using Bayes’ Theorem, the chance that Bobby Smith, who scored A+ on the test which is 95% accurate, will actually become a top player, is not 95% as intuition might suggest, but just 1.96%, as we have shown previously by a different route.

There is, therefore, just a 1.96 per cent chance that Bobby Smith will go on to become a professional basketball player, despite scoring A+ on that very accurate test of player potential.

That’s the statistics, the cold Bayesian logic. Now for the good news. Bobby Smith was the lucky one. He currently plays for New York Knicks under a different name.

Appendix

We can also solve the Bobby Smith problem using the traditional notation version of Bayes’ Theorem.

P (HIE) = P (EIH). P (H) / [P (EIH) . P(H) + P (EIH’) . P(H’)]

Before the new evidence (the test), this chance is 1 in 1000 (0.001)

So P (H) = 0.001

The probability of the new evidence (the A+ score on the test) if the hypothesis is true (Bobby will become a professional player) is 100%, since all professional players score A+ on the test.

So P (EIH) =1

So P (EIH’) = 0.05

Substituting into Bayes’ equation gives:

P (HIE) = 0.001x 1 / [0.001 x 1 + 0.05 (1 – 0.001)] = 0.0196

Exercise

Lucy Jones, aged 10, is a good school tennis player, but you know that only one in a thousand such 10-year-olds go on to become professional players. So you would like to get an unbiased assessment of her real chance of developing into a top player. A coach tells you there is a test, taken by all good 10-year-old tennis players, that can measure the child’s potential. The test, you learn, is 98 per cent accurate in identifying future professional tennis players, and these always receive a grade of A+.

Lucy takes the test and is graded A+.

How many of the 10-year-olds tested, who get an A+, fail to develop into top players, you ask? Now the coach imparts the good news. All professional players score A+ on the test as 10-year-olds, and we can take it that anyone who scores below that can be ruled out as a future professional player. And the test is 98 per cent accurate, so only 2 per cent of those who take the test will get the A+ grade and fail to develop into professional players. So what is the actual chance that Lucy will become a professional tennis player?

References and Links

Is your child a football star?

March 21, 2019

How rare is the specimen? A Bayesian puzzler.

Further and deeper exploration of paradoxes and challenges of intuition and logic can be found in my recently published book, Probability, Choice and Reason.

An entomologist spots what might be a rare category of beetle, due to the pattern on its back. In the rare category, 98% have the pattern. In the common category, only 5% have the pattern. The rare category accounts for only 0.1% of the population. How likely is the beetle to be rare?

Since only 5 per cent of the common beetles bear the distinctive pattern and 98 per cent of the rare beetles do, intuition would tell you that you have come across a rare insect when you espy the pattern. Bayes’ Theorem tells you something quite different.

To calculate just how likely the beetle is to be rare given that we see the pattern on its back, we apply Bayes’ Theorem.

Posterior probability = ab/ [ab+c (1-a)]

a is the prior probability of the hypothesis (beetle is rare) being true. b is the probability we observe the pattern and the beetle is rare (hypothesis is true). c is the probability we observe the pattern and the beetle is not rare (hypothesis is false).

In this case, a = 0.001 (0.1%); b = 0.98 (98%); c = 0.05 (5%).

So, updated probability = ab/ [ab+c (1-a)] = 0.0192. So there is just a 1.92 per cent chance that the beetle is rare when the entomologist spots the distinctive pattern on its back.

Why the counterintuitive result? Because so few of the population of all beetles are rare, i.e. the prior probability that the beetles is rare is almost vanishingly small and it would take a lot more evidence than that acquired to make a reasonable case for the beetle being rare.

So what is the probability that the beetle is rare given that we observe the distinctive pattern? In other words, what is the probability that the hypothesis (the beetle is rare) is true given the evidence (the pattern). That is 1.92 per cent. What is the probability that we will observe the distinctive pattern if the beetle is rare? In other words, what is the probability of observing the evidence (the pattern) if the hypothesis (the beetle is rare) is true. That is 98 per cent.

To conflate these, to believe these two concepts are the same, is to commit the classic Prosecutor’s Fallacy, i.e. to falsely equate the probability that the defendant is guilty given the observed evidence with the probability of observing the evidence given that the defendant is guilty. It’s a potentially very dangerous fallacy to commit, especially when you happen to be the defendant and the jury has never heard of the Reverend Thomas Bayes.

Appendix

We can also solve the Beetle problem using the traditional notation version of Bayes’ Theorem.

P (HIE) = P (EIH). P (H) / [P (EIH) . P(H) + P (EIH’) . P(H’)]

In this case, P (H) = 0.001 (0.1%); P (EIH) = 0.98 (98%); P (EIH’) = 0.05 (5%).

So, P (HIE) = 0.98 x 0.001/ [0.98 x 0.001 +0.05 x 0.999)] = 0.00098 / 0.00098 + 0.04995 = 0.00098 / 0.05093 = 0.0192. So there is just a 1.92 per cent chance that the beetle is rare when the entomologist spots the distinctive pattern on its back.

Note also that P (HIE) = 0.0192, while P (EIH) = 0.98.

The Prosecutor’s Fallacy is to conflate these two expressions.

Exercise

An entomologist spots what might be a rare category of beetle, due to the pattern on its back. In the rare category, 95% have the pattern. In the common category, only 2% have the pattern. The rare category accounts for only 0.5% of the population. How likely is the beetle to be rare?

References and Links

CS201 – Bayes’ Theorem – Excerpts from Wikipedia

Click to access BayesTheorem.pdf

Jeff Thompson. Bayes’ Theorem. November 20, 2011. https://www.jeffreythompson.org/blog/2011/11/20/bayes-theorem/

March 21, 2019

Is your friend guilty? A case for Bayes.

Further and deeper exploration of paradoxes and challenges of intuition and logic can be found in my recently published book, Probability, Choice and Reason.

Let us invent a little crime story in which you are a follower of Bayes and you have a friend in a spot of trouble. In this story, you receive a telephone call from your local police station. You are told that your best friend of many years is helping the police investigation into a case of vandalism of a shop window in a street adjoining where you knows she lives. It took place at noon that day, which you know is her day off work. You had heard about the incident earlier but had no good reason at the time to believe that your friend was in any way linked to it.

She next comes to the telephone and tells you she has been charged with smashing the shop window, based on the evidence of a police officer who positively identified her as the culprit. She claims mistaken identity. You must evaluate the probability that she did commit the offence before deciding how to advise her. So the condition is that she has been charged with criminal damage; the hypothesis you are interested in evaluating is the probability that she did it. Bayes’ Theorem, of course, helps to answer this type of question.

There are three things to estimate. The first is the Bayesian prior probability (which we represent as ‘a’). This is the probability you assign to the hypothesis being true before you become aware of the new information. In this case, it means the probability you would assign to your friend breaking the shop window immediately before you got the new information from her on the telephone that she had been charged on the basis of the witness evidence.

The second is the probability that the new evidence would have arisen if the hypothesis was true (which we represent as ‘b’). In this case, you need to estimate the probability of the police officer identifying your friend if your friend actually did break the window.

The third is to estimate the probability that the new evidence would have arisen if the hypothesis was false (which we represent as ‘c’). In this case, you need to estimate the probability of the police officer identifying your friend if your friend did NOT break the window.

According to Bayes’ Theorem, Posterior probability = ab/ [ab+c(1-a)]

So let’s apply Bayes’ Theorem to the case of the shattered shop window. Let’s start with a. Well, you have known her for years, and it is totally out of character, although she does live just a stone’s throw from the shop, and it is her day off work, so she could in principle have done it. Let’s say 5% (0.05). Assigning the prior probability is fraught with problems, however, as awareness of the new information might easily affect the way you assess the prior information. You need to make every effort to estimate this probability as it would have been before you received the new information. You also have to be precise as to the point in the chain of evidence at which you establish the prior probability.

What about b? This is the probability of the new evidence if the hypothesis was true. What is the hypothesis? That your friend broke the window. What is the new evidence? That the police officer has identified your friend as the person who smashes the window. So b is an estimate of the probability that the police officer would have identified your friend if she was indeed guilty. If she threw the brick, it’s easy to imagine how she came to be identified by the police officer. Still, he wasn’t close enough to catch the culprit at the time, which should be borne in mind. Let’s say that the probability he has identified her and that she is guilty is 80% (0.8).

Let’s move on to c. This is the probability of the new evidence if the hypothesis was false. What is the hypothesis again? That your friend broke the window. What is the new evidence again? That the police officer has identified your friend as the person who did it. So c is an estimate of the probability that the police officer would have identified her if she was not the guilty party, i.e. a false identification. If your friend didn’t shatter the window, how likely is the police officer to have wrongly identified her when he saw her in the street later that day? It is possible that he would see someone of similar age and appearance, wearing similar clothes, and jump to the wrong conclusion, or he may just want to identify someone to advance his career. Let us estimate the probability as 15% (0.15).

Once we’ve assigned these values, Bayes’ theorem can now be applied to establish a posterior probability. This is the number that we’re interested in. It is the measure of how likely is it that your friend broke the window, given that she’s been identified as the culprit by the police officer and charged on the basis of this evidence.

Given these estimates, we can use Bayes’ Theorem to update our probability that our friend is guilty to 21.9%, despite assigning a reliability of 80% to the police officer’s identification.

The most interesting takeaway from this application of Bayes’ Theorem is the relatively low probability you should assign to the guilt of your friend even though you were 80% sure that the police officer would identify her if she was guilty, and the small 15% chance you assigned that he would falsely identify her. The clue to the intuitive discrepancy is in the prior probability (or ‘prior’) you would have attached to the guilt of your friend before you were met face to face with the charge based on the evidence of the police officer. If a new piece of evidence now emerges (say a second witness), you should again apply Bayes’ Theorem to update to a new posterior probability, gradually converging, based on more and more pieces of evidence, ever nearer to the truth.

It is, of course, all too easy to dismiss the implications of this hypothetical case on the grounds that it was just too difficult to assign reasonable probabilities to the variables. But that is what we do implicitly when we don’t assign numbers. Bayes’ Theorem is not at fault for this in any case. It will always correctly update the probability of a hypothesis being true whenever new evidence is identified, based on the estimated probabilities. In some cases, such as the crime case illustrated here, that is not easy, though the approach you adopt to revising your estimate will always be better than using intuition to steer a path to the truth.

In many other cases, we do know with precision what the key probabilities are, and in those cases we can use Bayes’ Theorem to identify with precision the revised probability based on the new evidence, often with startlingly counter-intuitive results. In seeking to steer the path from ignorance to knowledge, the application of Bayes is always the correct method.

Appendix

The calculation and the simple algebraic expression that we have identified in this setting is:

ab/[ab+c(1-a)]

a is the prior probability of the hypothesis (she’s guilty) being true. This is more traditionally represented by the notation P(H). In the example, a = 0.05.

b is the probability the police officer identifies her conditional on the hypothesis being true, i.e. she’s guilty. This is more traditionally represented by the notation (PEIH), i.e. probability of E (the evidence) given the hypothesis is true, P(H). In the example, b = 0.8.

c is the probability the police officer identifies her conditional on the hypothesis not being true, i.e. she’s not guilty. This is more traditionally represented by the notation (PEIH’), i.e. probability of E (the evidence) given the hypothesis is false, P(H’). In the example, c = 0.15.

In our example, a = 0.05, b = 0.8, c = 0.15

Using Bayes’ Theorem, the updated (posterior) probability that the friend is guilty is:

ab/[ab+c(1-a)] = 0.04/(0.04+ 0.1425) = 0.04/0.1825

Posterior probability = 0.219 = 21.9%

Exercise

You are a follower of Bayes and you have a friend in a spot of trouble. In this story, you receive a telephone call from your local police station. You are told that your best friend of many years is helping the police investigation into a case of vandalism of a shop window in a street adjoining where you know she lives. It took place at noon that day, which you know is her day off work. You had heard about the incident earlier but had no good reason at the time to believe that your friend was in any way linked to it.

What three probabilities do you need to estimate in order to use Bayes’ Theorem to evaluate this probability?

References and Links

Murder Cases, Evidence and Logical Rigor. http://ucanalytics.com/blogs/logical-rigor/

March 21, 2019

The Reverend Bayes Investigates.

Further and deeper exploration of paradoxes and challenges of intuition and logic can be found in my recently published book, Probability, Choice and Reason.

A murder has been committed. There are five suspects, all of whom we consider equally likely to be guilty at the start of the investigation. We know that one of these suspects is the guilty party, and we know that whoever it was acted alone.

So 20 per cent is the prior probability of guilt for each of the five possible killers, before any new evidence is found. The names of the suspects are: Reverend Green, Colonel Mustard, Miss Scarlett, Professor Plum and Mrs. Peacock. The codename for the murder investigation is Operation Cluedo. The victim was Sir Caliban Mackenzie, a famed anthropologist, who was shot in the library while examining a rare first edition of Newton’s Principia.

Four hours into the investigation, evidence turns up which eliminates Reverend Green. He was leading the Holy Communion Service in the chapel at the time of the murder. There are now four remaining suspects, and so the probability that each of the remaining four suspects is guilty rises to 25 per cent (one chance in four).

Two hours later, a new clue now arises which casts some doubt on the alibi of Colonel Mustard, whose probability of guilt we now judge to rise from 25 per cent to 40 per cent.

As a result, the probability that one of the other three suspects is guilty falls by 15 per cent, down from a total of 75 per cent to 60 per cent. Since each of the three is equally likely to be guilty, we can now assign each a probability of guilt of 20 per cent, down from 25 per cent.

After a further 45 minutes, a third clue emerges, which eliminates Mrs. Peacock. She had been spotted by a number of very reliable witnesses at the Communion service in the chapel along with Reverend Green.

The big question is how we should now adjust the probabilities that Colonel Mustard, Miss Scarlett and Professor Plum pulled the trigger?

In other words, now that Mrs. Peacock has been eliminated, and taking account of the evidence which doubled the original likelihood that Colonel Mustard wielded the murder weapon (to 40 per cent), what is the best estimate of the revised probability that each of Mustard, Scarlett and Plum committed the murder?

Solution

One possibility would be to take the 20 per cent probability of guilt we had previously attached to Mrs. Peacock, and divide this equally between the three remaining suspects.

But to do so would be wrong, and notably at variance with the toolkit of a Bayesian detective, i.e. a detective who conducts investigations using the Bayesian approach to evidence and probability.

The Bayesian approach to detective work tells us always to consider the prior probability that each suspect is guilty before updating the probability after some new evidence is brought to bear on it. Applying this method, the correct way to adjust the probabilities attached to the remaining suspects is to do so in a way that is proportional to their prior probability of guilt before Mrs. Peacock was eliminated from the enquiry.

Since Colonel Mustard was the prime suspect, with a probability of guilt of 40 per cent before Peacock’s elimination (compared to 20 per cent for Miss Scarlett and Professor Plum), a good Bayesian needs to increase the probability we assign to his guilt by twice as much as we increase theirs. So we should now raise the estimate of the probability that Colonel Mustard shot Sir Caliban from 40 per cent to 50 per cent, while we should increase the probability we assign to Miss Scarlett and Professor Plum from 20 per cent to 25 per cent.

This is all derived from Bayes’ Theorem, which tells us that in order to calculate the probability of a hypothesis being true given new evidence, we must filter this evidence through the baseline of the probability of the hypothesis being true before we became aware of the new evidence (Mrs. Peacock’s elimination from the enquiry). This prior probability was twice as big for Colonel Mustard as for either of the other remaining suspects.

Epilogue

The estimated 50 per cent probability of guilt was more than sufficient to persuade the Crown Prosecution Service to haul the Colonel before a jury of his peers. In the event the jury convicted him, falling victim to the classic Prosecutor’s Fallacy. Like so many juries before them, they confused the probability that someone is guilty in light of the evidence with the probability of the evidence arising if they were guilty. The likelihood of Sir Caliban being shot in the library if the Colonel was guilty of murder was quite high, and this led to his conviction. Unfortunately for the Colonel, the relevant probability (that he was guilty of murder given that Sir Caliban was shot in the library) was rather smaller but bypassed in the jury’s deliberations.

Meanwhile, the actual killer, Miss Scarlett, got away scot-free. She had concealed an incriminating letter in the Principia, thinking it would be safe there, until Sir Caliban unhappily chanced upon it. This left her no option, in her mind, but to use the pistol hidden in the Georgian chest of drawers gracing the back wall of the library.

The Colonel’s appeal was unanimously rejected. He is serving a life sentence. Miss Scarlett is living as a tax exile in Belize.

Exercise

A murder has been committed and there are only five people who could have done it. There are no clues, np prior history that we know of. So we consider each suspect equally likely at the start of the investigation. The names of the suspects are: Reverend Green, Colonel Mustard, Miss Scarlett, Professor Plum, Mrs. Peacock.

What is the prior probability of guilt for each individual suspect?
Now the first clue is found, which eliminate Revd. Green. What is the new probability that each of the remaining individual suspects is guilty?

A new clue now arises which casts doubt upon the alibi of Colonel Mustard, whose probability of guilt we now judge to rise to 40 per cent.

3. What is the new probability that each of the other suspects is guilty?

The third clue now eliminates Mrs. Peacock.

4. What are the new probabilities of guilt that you, as a Bayesian, will attribute to Colonel Mustard, Miss Scarlett and Professor Plum?

References and Links

Books to teach yourself probability and Bayesian statistics. http://ucanalytics.com/blogs/probability-bayesian-statistics-books-self-taught/

March 16, 2019

Bayes and the Testing Problem – in a nutshell.

Further and deeper exploration of paradoxes and challenges of intuition and logic can be found in my recently published book, Probability, Choice and Reason.

Let’s say a patient goes to see the doctor. The doctor performs a test on all his patients, for a flu virus, estimating that only 1 per cent of the people who visit his surgery have the virus. The test he gives them, however, is 99 percent accurate – that is, 99 percent of people who are sick test positive and 99 percent of the healthy people test negative. Now the question is: if the patient tests positive, what chances should the doctor give to the patient having the flu virus?
The intuitive answer is 99 percent. But is that right?
The information we are given is ‘the probability of testing positive given that you have the virus’. What we want to know, however, is ‘the probability of having the virus given that you tested positive.’ Common intuition conflates these two probabilities, but they are in fact very different. In fact, if the test is 95% accurate, this means that 95% of sick people test positive. But this is NOT the same thing as saying that 95% of people who test positive are sick. This is known as the ‘Inverse Fallacy’ or ‘Prosecutor’s Fallacy’. It is the fallacy, to which jurors are very susceptible, of believing that the probability of a defendant being guilty of a crime in light of the observation of some piece of evidence is the same as the probability of observing that piece of evidence if the defendant was guilty. They are in fact very different things, and the two probabilities can diverge markedly.
So what is the probability of having the virus if you test positive, given that the test is 99% accurate (i.e. 99% of people who have the virus test positive and 99% of people who do not have the virus test negative)?
To answer this we can use Bayes’ Theorem.
The (posterior) probability that a hypothesis is true after obtaining new evidence, according to the a,b,c formula of Bayes’ Theorem, is equal to:
ab/ [ab+c(1-a)]
a is the prior probability, i.e. the probability that a hypothesis is true before you see the new evidence. Before the new evidence (the test), this chance is estimated at 1 in 100 (0.01), as we are told that 1 per cent of the people who visit his surgery have the virus. So, a = 0.01
b is the probability of the new evidence if the hypothesis is true. The probability of the new evidence (the positive result on the test) if the hypothesis is true (the patient is sick) is 99%, since the test is 99% accurate. So, b =0.99
c is the probability of the new evidence if the hypothesis is false. The probability of the new evidence (the positive result on the test) if the hypothesis is false (the patient is not sick) is just 1% (because the test is 99% accurate, and we can only expect a false positive 1 time in 100). So, c = 0.01
Using Bayes’ Theorem, the updated (posterior) probability = ab/ [ab+c(1-a)] = 1/2
So there is actually a 50% chance that the test, which is 99% accurate and has tested positive, has misdiagnosed you and you are actually flu-free.
Basically, it is a competition between how rare the disease is and how rarely the test is wrong. In this case, there is a 1 in 100 chance that you have the flu before undertaking the test, and the test is wrong 1 time in 100. These two probabilities are equal, so the chance that you actually have the flu when testing positive is actually 1 in 2, despite the test being 99% accurate.
But what if the patient is showing symptoms of the disease before being tested?
In this case, the prior probability should be updated to something higher than the prevalence rate of the disease in the entire tested population, and the chance you are actually sick when you test positive rises accordingly. To the extent that a doctor only tests for something that there is corroborating support for, the likelihood that the test result is correct grows. For this reason, any positive test result should be taken very seriously, statistics aside.
More generally, the ‘False Positive’ problem can easily lead to false convictions based on forensic evidence. Let’s say that we have a theft based on access to a secure storage facility, and we test everyone who could potentially have had access, which is 100 people. Without any other evidence, we can now assign a prior probability that the suspect currently being questioned is guilty of the crime at 1 in 100 or 0.01.
Forensic evidence now comes in the way of a partial fingerprint inside the office safe. It is scientifically determined that the probability the suspect’s fingerprint matches the partial print is 95% (0.95). So there’s just a 5% chance that the print was left by another of the suspects. Applying Bayes’ Theorem, we find that when the 95% accurate forensic test provides a match, the actual probability that the suspect is guilty is just 16%. This makes sense when we consider that testing all 100 suspects would (given that the test has a false positive rate of 5%) provide an estimated five false matches. With larger trawls of forensic testing, the likelihood of a false match becomes commensurately higher.
More generally, to differentiate truth from scare we really do need to understand and employ Bayes’ Theorem. Whether at the doctor’s surgery or in the jury room, understanding it really could save a life.

Appendix

In the original setting with the test results showing positive for a flu virus, a = 0.01, b = 0.99, c = 0.01. Substituting into Bayes’ equation, ab/[ab+c(1-a)], gives:

Posterior probability = 0.01x 0.99 / [0.01 x 0.99 + 0.01 (1 – 0.01)] = 0.01×0.99 / [0.01×0.99 + 0.01×0.99] = 1/2

Another way of visualising this problem is by constructing a simple box diagram for a population of 10,000 patients. Of these, 1%, or 100, have the flu virus and 9900 do not. These are inserted into the Total column. There is a 1% error rate, so 1% of the 9900 who do not have the flu virus test positive. Hence the remaining 9801 test negative. Of the 100 who actually have the flu virus, one tests negative (because of the error rate) and the remaining 99 correctly test positive. See below.

	Test positive	Test negative	Total
Has flu virus	99	1	100
No flu virus	99	9801	9900
Total	198	9802	10000

It is now easy to see that of the 198 who test positive, exactly half (99) actually have the flu virus. The other half are false positives.

Let’s take another example.

The probability of a true positive (test comes back positive for virus and the patient has the virus) is 90%. The chance that it gives a false negative (test comes back negative yet the patient has the virus) is 10%. The chance of a false positive (test comes back positive yet the patient does not have the virus) is 7%. The chance of a true negative (test comes back negative and the patient does not have the virus) is 93%.

The probability that a random patient has the virus based on the prevalence of the virus in the tested population is 0.8%.

Here, a = 0.8% (0.008) – this is the prior probability

b =90% (0.9) – probability of a true positive

c = 7% (0.07) – probability of a false positive

So, updated probability that the patient has the virus given the positive test result =

ab / [ab + c (1-a)] = 0.008 / [0.0072 + 0.07 x (1 – 0.008)]

= 0.008 x 0.9 / [0.008 x 0.9 + 0.06944] = 0.0072 / 0.07664 = 0.0939 = 9.39%

This can be shown using the raw figures to produce the same result. We can choose any number for total tested, and the result is the same. Let’s choose 1 million, say, as the number tested.

So total tested = 1,000,000

Total with virus = 0.008 x 1,000,000 = 8000

True positive = 0.9 x 8000 = 7200

False positive = 0.07 x 992,000 = 69,440

Tested positive = 69,440 + 7200 = 76,640

Updated (posterior) probability that the patient who tests positive has the virus = True positives / Total positives = 7200 / 76640 = 0.0939 = 9.39%

In the forensic match example, we can construct a box table. In the example, out of a population of suspects of 100, one is guilty and 99 are not guilty. These are inserted into the Total column. There is a 5% error rate in the forensic match, so there is a 0.95 chance of a match if the suspect is guilty (top left). There’s a 5% chance that one of the 99 will provide a match (0.05 x 99 = 4.95), leaving 84.15 as the number for the Not guilty/No match cell.

	Match	No match	Total
Guilty	0.95	0.05	1
Not guilty	4.95	94.05	99
Total	5.9	94.1	100

So the chance that the suspect provides a match and is actually guilty is the proportion of those guilty and matching out of all those matching (0.95/5.9 = 0.16).

So the 95% accurate forensic match provides a hit when matched to the suspect but his actual probability of guilt on these figures is just 16%.

Using Bayes’ Theorem, we reach the same conclusion:

Substituting into Bayes’ equation gives:

P (Guilty I Match) = 0.01x 0.95 / [0.01 x 0.95 + 0.05 (1 – 0.01)] = 0.01×0.95 / [0.01×0.95 + 0.05×0.99] = 0.0095/(0.0095+0.0495) = 0.0095/0.059 = 0.16.

So P (Guilty I Match) = 0.16

P (Not guilty I Match) = 0.84

Sensitivity and Specificity

In terms of false positive analysis, especially in a medical context, the concepts of sensitivity and specificity are often used.

Sensitivity (also termed the true positive rate) is the proportion of true positives who have a positive test result. In a medical context, it is for example the proportion of people with a condition that are correctly identified (test positive) with the condition.

Specificity (also termed the true negative rate) is the proportion of people who don’t have the disease who have a negative test result. In a medical context, it is for example the proportion of people without a condition that are correctly identified (test negative) as not having the condition.

Thus, sensitivity quantifies the avoiding of false negatives and specificity does the same for false positives. There is usually a trade-off between these measures. For example, airport security scanners that are set to detect low-risk items such as keys (wrongly identify true threats) have low specificity but will almost certainly identify high-risk items, such as guns (high sensitivity). A perfect predictor would identify all genuine cases and no false alarms would be triggered.

Say that TP is someone who has a disease and tests positive for it (True Positive). FN is someone who has a disease and tests negative for it (False Negative). FP is someone who does not have the disease but tests positive for it (False Positive). TN is someone who does not have the disease and tests negative for it (True Negative).

In this case, Sensitivity (True Positive Rate) = TP/(TP+FN), i.e. the probability of a positive test given that the patient has the disease. It is a function of the characteristics of the test itself. Because everyone in the group tested has the disease, it is not affected at all by the prevalence of the disease.

Specificity (True Negative Rate) = TN/(TN+FP), i.e. the probability of a negative test given that the patient does not have the disease.

Sensitivity is not the same as Precision (Positive Predictive Value, PPV), which is the ratio of true positives to combined true and false positives.

PPV = TP/(TP+FP)

PPV is a statement about the proportion of actual positives in the population being tested, i.e. the probability that you have the disease if you have tested positive for it.

NPV (Negative Predictive Value) = TN/(TN+FN)

So for positive and negative predicted values, these are affected by the prevalence of the disease in the community and so is not simply a function of the characteristics of the test itself. So, when comparing one test with another in terms of the positive and negative predicted value, you need top be looking at the same population group or at least population groups with the same incidence of disease.

Now, the Likelihood Ratio is the probability that a test is correct divided by the probability that it is incorrect. In medicine, Likelihood Ratios can be used to determine whether a test result usefully changes the probability that a condition exists.

Two versions of the Likelihood Ratio (Positive LR and Negative LR) exist, one for positive and one for negative test results.

The positive likelihood ratio is calculated as:

LR+ = sensitivity/(1-specificity), which is equivalent to:

LR+ = P (T+ I D+) / P (T+ I D-)

i.e. LR+ is the probability of a person who has the condition testing positive divided by the probability of a person who does not have the disease testing positive.

The negative likelihood ratio is calculated as:

LR- = (1-sensitivity)/specificity, which is equivalent to:

LR- = P (T+ I D+) / P (T- I D-)

i.e. LR- is the probability of a person who has the condition testing negative divided by the probability of a person who does not have the condition testing negative.

The pre-test odds of a particular diagnosis, multiplied by the likelihood ratio, determines the post-test odds.

Post-test odds = Pre-test odds x LR+

Odds = P (something is true) / P (something is false)

Probability = Odds / (1 + Odds)

Exercise

Question 1.

A patient goes to see the doctor. The doctor performs a test on all his patients, for a flu virus, estimating that only 1 per cent of the people who visit his surgery have the flu. The test he gives them, however, is 95 per cent reliable – that is, 95 per cent of people who are sick test positive and 95 per cent of the healthy people test negative.

Question 2.

A tennis tournament administers a test for banned drugs to all of the tournament entrants. The test is 90% accurate, if the person is using the banned drugs, and 85% accurate if the person is not using them. 10 per cent of all tournament entrants are in fact using the banned drugs. Now, what is the probability that am entrant is using drugs, if they test positive.

Question 3.

66 people have the flu and test positive for it. Four people have the flu and test negative for it. Three people don’t have the flu but test positive for it. 827 people don’t have the flu and test negative for it.

1. What is the Sensitivity of the test?
2. What is the Specificity of the test?
3. What is the Positive Predictive Value?
4. What is the Negative Predictive Value?
5. What is the Positive Likelihood Ratio?
6. What is the Negative Likelihood Ratio?
7. What are the Pre-Test Odds a person has the flu?
8. What are the Post-Test Odds a person has the flu?
Question 4.
1,000 people are tested for the flu. 100 people have the flu. Of these, 90 test positive and 10 test negative. 900 do not have the flu. 150 of these test positive, and 750 test negative.

1. What is the Sensitivity of the test?
2. What is the Specificity of the test?
1. 610 people have the virus and test positive. 118 people have the virus and test negative. 13,212 people do not have the virus but test positive. 127,344 people do not have the virus and test negative.

1. What is the Sensitivity of the test?
2. What is the Specificity of the test?
3. What is the Positive Likelihood Ratio?
4. What is the Negative Likelihood Ratio?
5. What are the Pre-Test Odds a person has the flu?
6. What are the Post-Test Odds a person has the flu?
7. Now, say that the doctor examines the person before administering the test and assigns a 30% pre-test probability that he has flu. Assuming this estimate is accurate, what is the pre-test Odds that the person has flu if he tests positive?
8. What is the post-test Odds that this person has flu?
9. What is the post-test probability that this person has flu?
10. Say the person who has been assigned a 30% pre-test probability of having the flu instead tests negative. What is the Post-test Odds now that he has the flu?
11. What is the Post-test probability that he has the flu?

References and Links

The Role of Probability. Bayes’ Theorem. Boston University School of Public Health.　http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Probability/BS704_Probability6.html

What is Bayes’ Theorem? Scientific American. https://www.scientificamerican.com/article/what-is-bayess-theorem-an/

Su, Francis E., et al. “Medical Tests and Bayes’ Theorem.” Math Fun Facts　https://www.math.hmc.edu/funfacts/ffiles/30002.6.shtml

Base Rate Fallacy. In: Paradoxes of probability and other statistical strangeness. S. Woodcock. April 4, 2017. https://theconversation.com/paradoxes-of-probability-and-other-statistical-strangeness-74440

Sensitivity vs Specificity and Predictive Value. Statistics HowTo. https://www.statisticshowto.datasciencecentral.com/sensitivity-vs-specificity-statistics/

Sensitivity, Specificity, Positive Predictive Value, and Negative Predictive Value. https://newonlinecourses.science.psu.edu/stat507/node/71/

Sensitivity and Specificity. Science Direct. https://www.sciencedirect.com/topics/medicine-and-dentistry/sensitivity-and-specificity

Sensitivity and Specificity. Wikipedia. https://en.wikipedia.org/wiki/Sensitivity_and_specificity

Likelihood Ratios. CEBM. https://www.cebm.net/2014/02/likelihood-ratios/

Diagnostics and Likelihood Ratios, Explained. http://www.thennt.com/diagnostics-and-likelihood-ratios-explained/

Likelihood Ratios in Diagnostic Testing. Wikipedia. https://en.wikipedia.org/wiki/Likelihood_ratios_in_diagnostic_testing

March 4, 2019

Does Hollywood ruin good books?

Berkson’s Paradox (also known as Berkson’s fallacy or Berkson’s bias) is a statistical quirk which makes it appear that there is an association between two events or variables which are actually unrelated. Notably, it shows that two values can be negatively correlated in a sample of a population when they are in fact uncorrelated or positively correlated in that population. It arises because of a type of selection bias, which is caused by the observation of some events more than others.

Take the case of a college which admits students based on either musical excellence or sporting excellence. For the sake of argument, assume that there is no link between the two in the total relevant population (say, all students in the country). In other words, a musically talented individual is no more nor less likely to be talented at sport. Because the college admits only students who are excellent at music, or excellent at sport, or both, this creates a group or subset of the population which displays a negative association between musical and sporting excellence.

To illustrate why, let’s make the simplifying assumption that the college admits students who score 9 or 10 out of 10 (on a scale of 0 to 10) on either sporting excellence or musical excellence. In the entire population, however, the average rating of the worst musician and the best sportsman would be equal, i.e. 5 out of 10. Yet within the group of student entrants, the average rating for sporting ability of those admitted for musical ability is still 5 (the population average) compared to an average rating of 9.5 for musical ability. The effect is to imply a negative correlation between sporting and musical ability where no such correlation exists in the wider population.

This has been shown to have important implications for medical statistics. Say, for example, that a hospital conducts a study which admits patients onto the study who are suffering from either eye cataracts or diabetes. In this case, there will appear an association (albeit spurious) between cataracts and diabetes in the set of patients included in the study which does not appear in the wider population. The reason that this paradox occurs is that the probability of one event happening (cataracts, in this example) is higher in the presence of the other event (diabetes, in this example) because cases whether neither occur are excluded.

Similarly, take the idea that there is a negative association in our minds between the quality of movies based on really good books and the quality of the books. One explanation can be derived from Berkson’s Paradox. This interpretation is that we remember the instances where the book is really good or the movie is really good or both. But we forget those cases where both the book and the movie were bad. In this case we find a (spurious) negative correlation between how good the movie is and how good the book is, because the bad movies/bad books element of the population are not included in the set of movies and books under analysis.

Another example of Berkson’s paradox was proposed by Jordan Ellsberg. This is the ‘good looking people are jerks’ example and is similar to the movies/books illustration. Say that someone only associates with people who are either pleasant or good looking or both. That eliminates from the sample pool those who are neither pleasant not good looking. That leaves a sample with good looking people who are unpleasant, and pleasant people who are not good looking, but eliminates those who are neither pleasant nor good looking. So an association is noted between being attractive and being unpleasant, but this is because the unattractive people who are also unpleasant are not observed. So even if no link exists between attractiveness and unpleasantness in the population, it does in an observed world where the counter-examples who exist in the population are avoided and ignored.

To put it more formally, assume there are two independent events, X and Y. These events are not correlated when observed in nature. If one conditions on the fact that either event X or event Y occurred (call this condition Z), however, these events are now correlated. This arises because of selection bias. If we condition on Z (that X OR Y occurs), then if we know that event X did not occur, we know that event Y did occur. This conditioning on Z, what we can call the union of X and Y, leads to a correlation.

Put mathematically, if P (XIY) = P (X), then P (XIY, Z) is less than P (XIZ) where Z = X U Y.

Numerical example of Berkson’s Paradox

10% of the population swim and 5% play squash weekly, but there is no correlation between swimming and playing squash in the general population. So someone who plays squash is as likely to swim as any other member of the population and vice-versa.

Of the 200 members of a local health club, 30% swim and 20% play squash.

Based on the health club statistics, is there any evidence of a correlation between those who do not swim and those who play squash?

To answer this, we use the assumption that someone who plays squash is as likely to swim as any other member of the population, i.e. swimming and squash playing can be treated as independent events. So, the percentage of health club members who play squash who also swim would be 10% x 5% = 0.5% of 200 members, i.e. 1 member.

A randomly chosen health club member, however, has a 30% chance of swimming and a 20% chance of playing squash. So, 60 out of 200 members will swim and 40 play squash.

Now, what is the chance that a member who is not a swimmer plays squash?

Of the 60 members who swim, we have calculated above that only 1 also plays squash, i.e. of the 200 members in total, 60 swim and one swims and plays squash.

So, of the remaining 140 patients who do not swim, 39 play squash, i.e. 40 members in total play squash minus one who swims and plays squash. Thus, 39 members who do not swim play squash.

So 39 of the 140 health club members who do not swim play squash, i.e. 39/140 (27.9%). This is higher than the 20% in the population who play squash.

Even though the two events (swimming and squash) are independent, therefore, the health club statistics make it appear that swimming reduces the likelihood of playing squash, i.e. there is a negative correlation between swimming and playing squash. The reason is that we excluding from consideration those members of the general population who neither swim nor pay squash, and only considering those who either swim or play squash or both.

Reference and links

Numberphile. Does Hollywood Ruin Books? https://www.youtube.com/watch?v=FUD8h9JpEVQ

Jordan Ellsberg (2014), Why are Handsome Men Such Jerks? June 3. Slate.com https://slate.com/human-interest/2014/06/berksons-fallacy-why-are-handsome-men-such-jerks.html

February 20, 2019

Bayes and the Taxi Problem – in a nutshell.

Bayes and the Taxi Problem.

To help explain how Bayes’ Theorem can be applied in practice, let’s start with the classic Bayesian Taxi Problem. It goes like this. New Amsterdam has 1,000 taxis. 850 are yellow, 150 are green. One of these taxis knocks down a pedestrian and then is driven away without stopping. We have no reason to believe that drivers of green taxis are any more or any less likely than drivers of yellow taxis to knock down a pedestrian and drive away. Neither do we have any reason to believe that green or yellow taxis are disproportionately represented in the area of New Amsterdam where the hit and run took place. There is one witness, however, who did see the event. The witness says the colour of the taxi was green. The witness is given a rigorous observation test, which recreates as closely as possible the event in question, and her judgment proves correct 80 per cent of the time. We have no reason to doubt the integrity of the witness.

So what is the probability that the taxi was green?

The intuitive answer is in the region of 80 per cent, as the only evidence is that of the witness, and the test of her powers of observation shows that she is right 80 per cent of the time. That is not the Bayesian approach, however, which is to consider the evidence in the light of the baseline, or prior, probability that the taxi was green before the witness evidence came to light. The prior probability can be derived from an identification of the proportion of taxis in New Amsterdam that are green. This is 15 per cent (of the 1,000 taxis, 150 are green).

Now, the (posterior) probability that a hypothesis is true after obtaining new evidence, according to the a,b,c formula of Bayes’ Theorem, is equal to:

ab/ [ab + c(1-a)]

In this case, the hypothesis is that the taxi that knocked down the pedestrian was green.

a is the prior probability, i.e. the probability that a hypothesis is true before the new evidence arises. This is 0.15 (15%) because 15% of the taxis in New Amsterdam are green.

b is the probability the new evidence would arise if the hypothesis is true. This is 0.8 (80%). There is an 80% chance that the witness would say the taxi was green if it was indeed green.

c is the probability the new evidence would arise if the hypothesis is false. This is 0.2 (20%). There is a 20% chance that the witness would be wrong and identify the taxi as green if it was in fact yellow.

Inserting these numbers into the formula, ab/ [ab + c(1-a)], gives:

Posterior probability = 0.15 x 0.8/ [0.15 x 0.8 + 0.2 (1 – 0.15)] = 0.41 = 41 per cent.

In other words, the true probability that the taxi that knocked down the pedestrian was green is not 80 per cent (despite the witness evidence) but about half that. The baseline probability is that important.

If new evidence subsequently arises, Bayesians are not content to leave the probabilities alone. Say, for example, that a new witness appears, totally independent of the other, and is also given the observation test, revealing a reliability score of 90 per cent. Again, we have no reason to doubt the integrity of this witness. What a Bayesian does now is to insert that number (0.9) into Bayes’ formula (y=0.9) so that c (the probability that the witness is mistaken) = 0.1. The new baseline (or prior) probability, a, is no longer 0.15, as it was before the first witness appeared, but 0.41 (the probability incorporating the evidence of the first witness). In this sense, yesterday’s posterior probabilities are today’s prior probabilities.

Inserting into Bayes’ Theorem, the new posterior probability = 0.86 = 86%. This is also the new baseline probability underpinning any new evidence which might arise.

There are three key illustrative cases of the Bayesian Taxi Problem which bear highlighting. The first is a scenario where the new witness scores 50 per cent on the observation test. Here is a case where intuition and Bayes’ formula converge. Intuition tells us that a witness who is right only half the time about the colour of the taxi is also wrong half the time, and so any evidence they give is worthless. Bayes’ Theorem tells us that this is indeed so, as the posterior probability ends up being equal to the prior probability.

The second illustrative case is where a new witness is 100 per cent reliable about the colour of the taxi. In this case, b =1 and c =0. Intuition tells us that the evidence of such a witness solves the case. If the infallible witness says the taxi was green, it was green. Bayes’ Theorem agrees.

This leads directly to the third illustrative case. If the new witness scores 0 per cent on the observation test, this indicates that they always identify the wrong colour for the taxi. If they say it is green, it is definitely not green. So the chance (posterior probability) that the taxi is green if they say so is zero. This accords with Bayes’ Theorem.

More generally, information that a witness is usually wrong is valuable, as it can be reversed to useful effect. So if the witness says the taxi is yellow, we can now identify the taxi as definitely green. This now converges on the second illustrative case.

Similarly, a witness who is right, say, only 25 per cent of the time in identifying the colour of the taxi in the observation test also yields us valuable information. By reversing the identified colour, this yields a 75 per cent reliability score, which can be inserted accordingly into Bayes’ Theorem to update the probability that the taxi that knocked down the pedestrian was green.

The only observation evidence that is worthless, therefore, is evidence that could have been produced by the flip of a fair coin.

The Bayesian taxi problem is in fact an instance of what is known as the Base Rate Fallacy, which occurs when we disregard or undervalue prior information when making a judgment on how likely something is. If presented with related base rate information (i.e. generic, general information) and specific information (information pertaining only to a certain case), the fallacy can also arise from a tendency to focus on the latter at the expense of the former. For example, if we are informed that someone is an avid book-lover, we might think it more likely that they are a librarian than a nurse. There are, however, many more nurses than librarians. In this example, we have not taken sufficient account of the base rate for the number of nurses relative to librarians.

And the conclusion to the case? CCTV evidence was later produced in court which was able to identify conclusively the taxi and the driver. The pedestrian never regained consciousness. The driver of what turned out to be a yellow taxi told the jury that the pedestrian unexpectedly stepped out and brushed against the passenger side door. He thought at the time that it was a very minor incident, and was completely unaware that the victim had slipped and hit his head awkwardly.

This was rejected by the jury, who accepted the prosecution’s contention that the taxi driver had acted with premeditation and malicious intent. They based their decision on their acceptance that a driver motivated by premeditated malice would indeed have driven off. They equated this with accepting that anyone who drove off must have been motivated by premeditated malice. It was all the evidence they needed to reach their unanimous verdict of first degree murder. Each member of the jury later left the court unaware that they had committed the classic Prosecutor’s Fallacy.

James Parker, a 29-year-old long-time resident of New Amsterdam, of previous good character, with no previous convictions or any known motive for the crime, is currently serving a sentence of life in a maximum security prison with no possibility of parole.

Appendix

In the original taxi problem scenario:

a = 0.15 (15 per cent of taxis are green)

b = 0.8 (the witness is correct 80 per cent of the time)

c = 0.2 (the witness is wrong 20 per cent of the time)

Inserting these numbers into the formula gives:

Posterior probability = (0.15 x 0.8)/ (0.15×0.8 + 0.2×0.85) = 0.12/ (0.12+0.17) = 41% (rounded to the nearest per cent).

This is the new baseline probability underpinning any new evidence which might arise.

If new evidence subsequently arises, such that a = 0.41, b = 0.9, c = 0.1. New posterior probability = 0.41 x 0.9/ (0.41×0.9 + 0.1×0.59) = 0.369/ (0.369+0.059) = 86% (rounded to the nearest per cent). This is also the new baseline probability underpinning any new evidence which might arise.

Solution to the three illustrative cases of the Bayesian Taxi Problem.

A scenario where the new witness scores 50 per cent on the observation test. In terms of the equation, such a witness would be accorded b = 0.5 and c = 0.5.

PP = ab/ [ab+ c (1-a)] = 0.5a / [0.5a + 0.5 (1-a)] = 0.5a / (0.5 + 0.5a – 0.5a) = 0.5a / 0.5 = a

So when b and c both equal 0.5 in regard to new evidence, this evidence has no impact on the probability of the hypothesis being tested being true. The posterior probability equals the prior probability. In this case, the witness’s evidence can be discounted.

The second illustrative case is where a new witness is 100 per cent accurate about the colour of the taxi. In this case, b =1 and c = 0. Intuition tells us that the evidence of such a witness solves the case. If the infallible witness says the taxi was green, it was green. Bayes’ formula agrees. Inserting b = 1, c = 0 into the formula gives:

ab/[ab + c(1-a)] = a / (a + 0) = a/a = 1.

So the new (posterior) probability that the taxi is green = 1.

This leads directly to the third illustrative case. If the new witness scores 0 per cent on the observation test, this indicates that they always identify the wrong colour for the taxi. If they say it is green, it is definitely not green. So the chance (posterior probability) that the taxi is green if they say so is zero. This accords with the formula.

ab/ [ab + c(1-a)] = 0 / [0 + (1-a)], assuming a is not equal to 1 = 0.

If a = 1 and b = 0, the question is meaningless (as we are saying that the taxi is definitely green (a=1) and it is definitely not green (b=0) and so the equation is undefined.

Exercise

Question a. New Amsterdam has 1,000 taxis. 800 are yellow, 200 are green. There is no reason for us to believe that one particular colour of taxi is more likely to knock down a pedestrian in the area that the accident occurred, or to believe that the behaviour of green or yellow taxi drivers is likely to differ in the event of knocking down a pedestrian.

One of these taxis now knocks down a pedestrian and drives away. There is one witness, who saw the event. The witness says the colour of the taxi was green.

The witness is given a well-respected observation test, and is right 80% of the time. We can be quite sure from the result that there is a probability of 80% that the witness identifies the colour of the taxi correctly.

What is our best estimate now of the probability that the taxi was green?

Question b. What if a second witness, independent of the first, now comes forward?

We determine that the probability that this witness is correct when identifying the colour of the taxi is 70%.

The witness says the colour of the taxi was green.

What is the new posterior (updated) probability that the taxi that knocked down the pedestrian is green?

Question c. What if a third witness, independent of the first and second, now comes forward?

We determine that the probability that this witness is correct when identifying the colour of the taxi is 50%.

The witness says the colour of the taxi was green.

What is the new posterior (updated) probability that the taxi that knocked down the pedestrian is green?

References and Links

Salop, S.C. (1987). Evaluating uncertain evidence with Sir Thomas Bayes: A Note for Teachers. Economic Perspectives, 1, 1, Summer, 155-160. https://pubs.aeaweb.org/doi/pdf/10.1257/jep.1.1.155

Bedwell, M. (2015). Slow thinking and deep learning: Tversky and Kahneman’s Cabs. Global Journal of Human-Social Science,15,12. https://www.socialscienceresearch.org/index.php/GJHSS/article/download/1634/1575

Base Rate Fallacy. The Decision Lab. https://thedecisionlab.com/bias/base-rate-fallacy/

Salop, S.C. (1987). Evaluating uncertain evidence with Sir Thomas Bayes: A Note for Teachers. Economic Perspectives, 1, 1, Summer, 155-160. https://pubs.aeaweb.org/doi/pdf/10.1257/jep.1.1.155

Base Rate Fallacy. The Decision Lab. https://thedecisionlab.com/bias/base-rate-fallacy/

Base Rate Fallacy. In: Paradoxes of probability and other statistical strangeness. UTS, 5 April, 2017. S. Woodcock. http://newsroom.uts.edu.au/news/2017/04/paradoxes-probability-and-other-statistical-strangeness

Tversky, A. and Kahneman, D. (1982), Evidential Impact of Base Rates. In: Kahneman, D., Slovic, P. and Tversky, A., Judgment Under Uncertainty: Heuristics and Biases. https://www.cambridge.org/core/books/judgment-under-uncertainty/evidential-impact-of-base-rates/CC35C9E390727085713C4E6D0D1D4633

Base Rate Fallacy. Wikipedia. https://en.wikipedia.org/wiki/Base_rate_fallacy

Know Your Bias: Base Rate Neglect. https://youtu.be/YuURK_q2NR8

Base Rate Fallacy. https://youtu.be/Fs8cs0gUjGY

Counting Carefully. The Base Rate Fallacy. https://youtu.be/VeQXXzEJQrg

February 20, 2019

What is Bayes’ Theorem? In a nutshell.

Bayes’ Theorem: The Most Powerful Equation in the World.

Further and deeper exploration of paradoxes and challenges of intuition and logic can be found in my recently published book, Probability, Choice and Reason.

How should we change our beliefs about the world when we encounter new data or information? This is one of the most important questions we can ask. A theorem bearing the name of Thomas Bayes, an eighteenth-century clergyman, is central to the way we should answer this question.

The original presentation of the Reverend Thomas Bayes’ work, ‘An Essay toward Solving a Problem in the Doctrine of Chances’, was given in 1763, after Bayes’ death, to the Royal Society, by Bayes’ friend and confidant, Richard Price.

In explaining Bayes’ work, Price proposed as a thought experiment the example of a person who enters the world and sees the sun rise for the first time. As this person has had no opportunity to observe the sunrise before (perhaps he has spent his life to that point entombed in a dark cave), he is not able to decide whether this is a typical or unusual occurrence. It might even be a unique event. Every day that he sees the same thing happen, however, the degree of confidence he assigns to this being a permanent aspect of nature increases. His estimate of the probability that the sun will rise again tomorrow as it did yesterday and the day before, and so on, gradually approaches, although never quite reaches, 100 per cent.

The Bayesian viewpoint is just like that, the idea that we learn about the world and everything in it through a process of gradually updating our beliefs, edging incrementally ever closer to the truth as we obtain more data, more information, more evidence.

As such, the perspective of Reverend Bayes on cause and effect is essentially different to that of philosopher David Hume, the logic of whose argument on this issue is contained in ‘An Enquiry Concerning Human Understanding,’ published in 1748. According to Hume, we cannot justify our assumptions about the future based on past experience unless there is a law that the future will always resemble the past. No such law exists. Therefore, we have no fundamentally rational support for believing in causation. For Hume, therefore, predicting that the sun will rise again after seeing it rise a hundred times in a row is no more rational than predicting that it will not. Bayes instead sees reason as a practical matter, in which we can apply the laws of probability to the issue of cause and effect.

To Bayes, therefore, rationality is matter of probability, by which we update our predictions based on new evidence, thereby edging closer and closer to the truth. This is called Bayesian reasoning. According to this approach, probability can be seen as a bridge between ignorance and knowledge. The particularly wonderful thing about the world of Bayesian reasoning is that the mathematics of it are so simple. Bayes’ Theorem is in this way concerned with conditional probability. It tells us the probability, or updates the probability, that a theory or hypothesis is true given that some event has taken place, that some new evidence has been observed. The problem with intuition is that people are not naturally probability thinkers, but instead are cause-effect thinkers. We have to be trained to think in a Bayesian way about the world.

Essentially, Bayes’ Theorem is just an algebraic expression with three known variables and one unknown. It is true by construction. Yet this simple formula is the foundation stone of that bridge I referred to between ignorance and knowledge, which can lead to important predictive insights. As noted, it allows us to update the probability that a theory or hypothesis is true when some new evidence comes to light, based on the probability we attach to the theory or hypothesis being true before the new evidence is known.

There are three things a Bayesian needs to estimate.

A Bayesian’s first task is to assign a starting point probability to a hypothesis being true, before some new evidence arises. This is known as the ‘prior’ probability. Let’s assign the letter ‘a’ to this.
1. A Bayesian’s second task is to estimate the probability that the new evidence would have arisen if the hypothesis was true. Let’s assign the letter ‘b’ to this.
A Bayesian’s third task is to estimate the probability that the new evidence would have arisen if the hypothesis was false. Let’s assign the letter ‘c’ to this.

Based on these three probability estimates, Bayes’ Theorem offers a way to calculate the revised probability of the hypothesis being true given the new evidence. The notable point about it is that the equation is true as a matter of logic. The result it produces will therefore be as accurate as the values inputted into the equation. The formula is also so straightforward it can be jotted on the back of your hand.

The formula for Bayes’ Theorem can be represented as:

Updated (posterior) probability given new evidence = ab/ [ab+ c (1-a)]

Essentially, then, Bayesian updating is a straightforward solution to the problem of how to combine pre-existing (prior) beliefs with observed new evidence. The solution is essentially to combine the probabilities together. To do this properly, we use Bayes’ Theorem. It is of particular use when we have a conditional probability of two events, and we are interested in the reversed conditional probability. For example, when we have P (A given B) and want to find P (B given A).

The key contributions of Bayesian analysis to our understanding of the world are threefold.

Bayes’ Theorem makes clear the importance not just of new evidence but also the (prior) probability that the hypothesis was true before the new evidence was observed. This prior probability is often given too little weight compared to the new evidence in common intuition about probability. Bayes’ Theorem makes the prior probability explicit and shows how much weight to attach to it.
Bayes’ Theorem allows us a way to calculate the updated probability based on the prior probability that the hypothesis is true and the probability of the new evidence arising given that the hypothesis is true and also given that the hypothesis is false.
Bayes’ Theorem shows that the probability that a hypothesis is true given the evidence is not equal to the probability of the evidence arising given that the hypothesis is true. Put another way, P (H given E) does not equal P (E given H).

Often the conclusions it generates are highly counter-intuitive, but that’s because the world is in many ways a counterintuitive place. Accepting that fact is the first step towards mastering life’s logical maze.

In summary, intuition lets us down because our in-built judgment of the weight we should attach to new evidence tends to be skewed, not least against pre-existing evidence. New evidence also tends to colour our perception of the pre-existing evidence. Moreover, we tend to see evidence that is consistent with something being true as evidence that it is actually true. Bayes’ Theorem is the map that helps guide us through this maze.

Appendix

Bayes’ Theorem consists of three variables.

a is the prior probability of the hypothesis being true (the probability we attach before new evidence arises). In traditional notation, this is represented as P (H).

b is the probability that the new evidence would arise if the hypothesis is true. In traditional notation, this is represented as P (EIH). We use the notation P (AIB) to represent the probability of A given B, i.e. the probability of A If B.

c is probability the new evidence would arise if the hypothesis is not true. In traditional notation, this is represented as P (EIH’). H’ is the notation for H not being true.

(1-a) is the prior probability that the hypothesis is not true. In traditional notation, this is represented as P (H’). It is derived from 1 – P (H), i.e. 1 minus the probability that the hypothesis is true.

Using this notation, the probability that a hypothesis is true given some new evidence (‘Posterior Probability’) = ab/ [ab+ c (1-a)].

Bayes’ Theorem can be derived from the equation P (HIE). P (E) = P (H).P (EIH), by dividing both sides by P (E). The intuition underlying this is that both sides of the equation are equal to the combined probability of the evidence relating to a hypothesis and the probability of the hypothesis being true, P (H and E). They are two ways of looking at the same thing.

In particular, P (HIE). P (E) is the probability of a hypothesis being true given the evidence times the probability of the evidence. This is logically equivalent to P (H). P (EIH), which is the probability of a hypothesis being true times the probability of the evidence given that the hypothesis is true.

So, P (HIE). P (E) = P (H). P (EIH)

Dividing the left and right sides of the equation by P (E),

P (HIE) = P (H). P (EIH) / P (E) … Bayes’ Theorem

P (E) = P (EIH). P (H) + P (EIH’). P(H’)

P (HIE) = P (H).P (EIH) / [P (H). P (EIH) + P (EIH’). P(H’)] … Bayes’ Theorem

This is equivalent to the formula:

Posterior probability = ab / [ab + c (1-a)], where a = P (H); b = P (EIH); c = P (EIH’)

Technical Proof

We write the conditional probability of A given B as P (A∣B) and define it as the probability that A has occurred, given that B has occurred.

The probability that A and B have both occurred is the conditional probability of A given B multiplied by the probability that B has occurred.

P(A∩B) = P (A∣B) P(B)

Hence:

P (A∣B) = P(A∩B) / P(B)

Similarly,

P(A∩B) = P (B∣A) P(A)

Hence:

P (B∣A) = P (A∩B) / P(A)

So:

P (A∣B) P (B) = P (A∩B) = P (B∣A) P(A), which is sometimes called the product rule for probabilities.

Dividing both sides by P (A), (which we take to be non-zero), the result follows:

P (B∣A) = P (A∣B) P(B) / P(A)

Where A represents the evidence, and B represents the hypothesis being true, this becomes:

P (H∣E) = P (E∣H) P(H) / P(E) … Bayes’ Formula

Now, P (E) = P (EIH) P (H) + P (EIH’) P (H’)

Therefore, P (E) = P (EIH) P (H) + P (EIH’) P (H’), where P (H’) represents the probability that the hypothesis is not true, i.e. P (H’) = 1 – P (H)

In traditional notation, the Prosecutor’s Fallacy is the fallacy of representing the probability of a hypothesis being true given the evidence, P (HIE), as being the same thing as P (EIH), the probability of the evidence arising given the hypothesis is true. In fact, P (HIE) = P (H). P (EIH) / P(E).

Examples

Is the probability that a selected card is the Ace of Spades (the hypothesis) given the evidence (it is a black card) equal to the probability it is a black card given that it is the Ace of Spades?

In this example, P (HIE) = 1/26 (probability the hypothesis is true given the evidence), since there is one Ace of Spades out of 26 black cards.

However, the probability of observing the evidence (it is a black card) given the hypothesis being true (it is the Ace of Spades) is P (EIH) = 1, since the probability it is a black card if it is the Ace of Spades is certain.

So, P (HIE) = 1/26 is not equal to P (EIH) = 1.

There follow some examples to illustrate that P (HIE). P (E) does indeed equal P (H). P (EIH).

Example 1: Take a deck of 52 cards, 26 red cards and 26 black cards, including one Ace of Spades. We are testing the hypothesis that a chosen card is the Ace of Spades. So, the hypothesis is that the selected card is the Ace of Spades. Now the probability a drawn card is the Ace of Spades (hypothesis is true) given that the card is black (the evidence) = 1/26 (there are 26 black cards, one of which is the Ace of Spades).

So P (HIE) = 1/26

The proportion of black cards in the deck = 1/2. So P (E) = 1/2

So, P (HIE). P (E) = 1/26 x ½ = 1/52.

Now P (EIH) is the probability that the card is black given that it is the Ace of Spades. This is certain, as the Ace of Spades is a black card.

So P (EIH) = 1.

P (H) is the probability the card is the Ace of Spades before we know what colour it is. There are 52 cards in the deck, so P (H) = 1/52.

So P (H). P (EIH) = 1/52 x 1 = 1/52

So P (HIE). P (E) = P (H). P (EIH) – they both equal 1/52 in this case.

Therefore, P (HIE) = P (H). P(EIH) / P (E) … Bayes’ Theorem

Example 2: There are in this example just four cards in our deck. These are the Ace of Spades, Ace of Clubs, Ace of Diamonds and Ace of Hearts. We are testing the hypothesis that the selected card is the Ace of Spades. Prior probability of Ace of Spades (AS) = ¼, as this is one of the four cards in our deck. What is the posterior probability it is Ace of Spades given the evidence that the card is black?

P (H) = ¼

P (EIH) = 1

P (E) = ½

P (HIE) = 1/2

So, P (HIE) = P (H). P (EIH)/ P (E) = ¼.1 / (1/2) = ½

Note that: P (HIE). P (E) = P (H). P (EIH)

P (HIE) = P (H). P (EIH) / P (E) … Bayes’ Theorem

Example 3: Two dice are thrown. The hypothesis is that two sixes will be thrown. The new evidence is that a six is thrown on the first one.

P (H) = x = 1/36

P (EIH) = y = 1 (for a double six, a six must be thrown on the first one).

P (E) = 1/6 (there is a 1 in 6 chance of throwing a six on the first die)

P (HIE) = posterior probability (PP) = P (EIH). P (H) / P (E) = 1. 1/36 / 1/6 = 1/6 (there is a 1 in 6 chance of a double six if the first die lands on a six).

Note: P (H). P (EIH) = P (E). P (HIE) = 1/36

Note also: P (E) = P (H). P (EIH) + P(H’). P(EIH’) = 1/36 . 1 + 35/36 . 5/35 = 1/36 + 5/36 = 1/6

Similarly, Posterior Probability = ab/[ab+c(1-a)] = 1/6

Note: c = P (EIH’) = 5/35 because if the dice do not land 6,6, so that the hypothesis is not true (H’), then 35 options are left (from 1,1 to 6,5) and chance of a single six occurs in 5 of them, i.e. 6,1; 6,2; 6,3; 6,4; 6,5.

As for the likelihood that the sun will rise again, there is a way of estimating this, which was proposed by Pierre-Simon Laplace. What is known as Laplace’s Law gives us a rule-of-thumb way of calculating how likely it is that something that has happened before will happen again, whether it be the sun rising, your favourite team winning, or the bus arriving on time. Simply count the number of times it has happened in the past plus one (successes, S+1), and divide that by the number of opportunities there has been for it to happen plus two (trials, T+2). For a person emerging from a dark cave into the world for the first time, and watching the sun rise seven times, for example, the estimate that it will rise again is: (S+1)/(T+2) = (7+1)/(7+2) = 8/9 = 88.9%. Every time it rises again makes it even more likely that the pattern will be repeated, so that by the end of a year, the estimated probability goes up to (365+1)/(365+2) = 99.7%. And so on. The 1 and 2 in the Laplace equation, (S+1)/ (T+2), essentially represent the Bayesian ‘prior.’ The 1 and 2 can be replaced by any numbers in the same proportion, such as 5 and 10 or 10 and 20, depending on the weight we wish to assign to the prior probabilities (probabilities assigned before encountering new evidence).

Larger numbers (e.g. S+10, T+20) bias the estimate towards the assigned prior probability. So, (S+10)/ (T+20) after seven days updates to a probability of (7+10)/ (7+20) = 17/27 = 63.0%, compared to 88.9% for (S+1)/(T+2). Smaller numbers bias the estimate, therefore, towards the observed record. Another way of looking at this is that larger numbers indicate we are more confident in our baseline estimates and need more evidence to change our prior beliefs. Smaller numbers indicate that we are less sure about our beliefs and are more open to quickly updating our beliefs based on new evidence. In other words, learning takes place more quickly with smaller numbers in the Laplace equation.

Exercise

Question a. Write the Bayesian equation (using a, b and c) for deriving the posterior (updated) probability of a hypothesis being true after you encounter new evidence. Explain what a, b and c represent.

Question b. If P (H) is the probability that a hypothesis is true before the observation of new evidence (E), what is the updated (or posterior) probability of the hypothesis being true after the observation of the new evidence? Use the terms P (H), P(EIH), P(HIE), P(H’), P(EIH’) to construct the Bayesian equation using each of these terms. Note that P(EIH) is the probability of encountering the evidence given that the hypothesis is true. P(H’) is the probability that the hypothesis is not true. P(HIE) is the probability the hypothesis is true after encountering the evidence.

Question c. How do these terms relate to a, b and c in the Bayesian formula you have studied.

Question d. Is the probability that a hypothesis is true, given the evidence, P (HIE), equal to the probability of encountering the evidence, given that the hypothesis is true, P (EIH)? In other words, does P (HIE) = P (EIH)?

Question e. You are presented with two dice. One is fair, one is biased. The fair die (A) lands on all numbers (1 to 6) with equal probability. The biased die (B) lands on 6 with a 50% chance and each of the other numbers (1 to 5) with an equal 10% chance each.

Now, choose one die. You can’t tell by inspection whether it is the fair or the biased die. You now roll the die, and it lands on 6. What is the probability that the die you rolled is the biased die?

Question f. You are presented with two coins. One is fair, the other is weighted. The fair coin (Coin 1) lands on heads and tails with equal likelihood, the weighted coin (Coin 2) lands on tails with a 75% chance.

Now, choose one coin. You can’t tell by inspection whether it is the fair or the weighted coin. You select a coin and toss it and it lands on tails. What is the probability that you tossed Coin 2 (the weighted coin).

Some Reading and Links

Puga, J., Krzywinski, N. and Altman, N. (2015). Points of Significance: Bayes’ Theorem. 12, 4, April, 277-278. https://www.nature.com/articles/nmeth.3335.pdf?origin=ppub

Hooper, M. (2013). Richard Price, Bayes’ Theorem and God. Significance, February, 36-39. https://www.york.ac.uk/depts/maths/histstat/price.pdf

Maths in a minute: The prosecutor’s fallacy. + plus magazine. https://plus.maths.org/content/maths-minute-prosecutor-s-fallacy

Lee, M. and King, B. (2017). Bayes’ Theorem: the maths tool we probably use every day. But what is it? The Conversation. April 23. https://theconversation.com/bayes-theorem-the-maths-tool-we-probably-use-every-day-but-what-is-it-76140

Ellerton, P. (2014). Why facts alone don’t change minds in our public debates. The Conversation. May 13. https://theconversation.com/why-facts-alone-dont-change-minds-in-our-big-public-debates-25094

Bayes Theorem. A Take Five Primer. An Iterative Quantification of Probability 2016). Corsair’s Publishing, March 24. http://comprehension360.corsairs.network/bayes-theorem-a-take-five-primer-fc7f7ade7abe

Bayes’ Theorem. Wikipedia. https://en.m.wikipedia.org/wiki/Bayes%27_theorem

February 14, 2019

What is Simpson’s Paradox? And why it matters.

Was the University of California, Berkeley, guilty of discrimination in their entry standards? This was a cause of concern in the early 1970s. To show what was behind the concern, we can highlight the admission figures for the Fall term of 1973. This shows that male applicants to the University were significantly more likely to be accepted than females.

Applicants Admitted

Men 8442 44%

Women 4321 35%

Looks pretty damning, until it was decided to break the admittance figures down by department. In doing so, it revealed a paradox.

Dept. Men Women

Applicants Admitted Applicants Admitted

A 825 62% 108 82%

B 560 63% 25 68%

C 325 37% 593 34%

D 417 33% 375 35%

E 191 28% 393 24%

F 373 6% 341 7%

In other words, a higher proportion of women were admitted to four of the six departments than men.

So what was going on? Those with statistical training soon realised that this was a simple example of Simpson’s Paradox. Simpson’s Paradox arises when different groups of frequency data are combined, revealing a different performance rate overall than is the case when examining a breakdown of the performance rate. Put another way, Simpson’s paradox is the appearance of trends within different groups which disappear when data for the groups are combined together.

In the case of Berkeley, a study published in 1975 by Bickel, Hammel and O’Connell, in ‘Science’ reached the conclusion that women tended to apply to the more competitive departments with low rates of admission, such as the English Department, while men tended to apply to less competitive departments with high rates of admission, such as engineering and chemistry. As such the University was not actively discriminating against women, at least not on the basis of the statistics used to make the charge.

Ignorance of the implications of Simpson’s Paradox might also generate false conclusions in the case of medical trials.

Take the following drugs, and their success rate in medical trials over two different days.

Drug A Drug B

Day 1 63/90 = 70% 8/10 = 80%

Day 2 4/10 = 40% 45/90 = 50%

Overall, Drug A = 67% success rate; Drug B = 53% success rate.

But Drug B performs better on both days.

So which is the better drug? In the medical trials, I would certainly choose to be treated by Drug A. Others might differ, but I doubt they would persuade any reasonable judge of the outcome of the trials.

Take another example. In this trial, there are two groups, consisting of a control group of 240 patients who are supplied with a placebo drug, such as a sugar pill, which is known to have no effect on the illness under evaluation, and a test group of 240 patients who are supplied with the real drug. The 240 patients are made up of four groups. Group A is elderly adults, Group B is middle-aged adults, Group C is young adults and Group D is children.

Here are the results, with success rate measured by the proportion recovering from the illness within two days of taking the drug:

Those taking the placebo.

Group A: 20; Group B: 40; Group C: 120; Group D: 60

Success rates are:

Group A: 10%; Group B: 20%; Group C: 40%; Group D: 30%

Overall success rate for those taking the placebo = 2+8+48+18 Divided By 240 = 76/240 = 31.7%.

Those taking the real drug.

Group A: 120; Group B: 60; Group C: 20; Group D: 10

Success rates are:

Group A: 15%; Group B: 30%; Group C: 60%; Group D: 45%

Overall success rate for those taking the real drug = 18+18+12+18 Divided By 240 = 66/240 = 27.5%.

This compares with an overall success rate for those taking the placebo of 31.7%.

So the placebo, over the whole sample, produced a higher success rate than the real drug.

Breaking the numbers down by group, however, reveals a discrepancy.

For the real drug

Group A: 10%; Group B: 20%; Group C; 40%; Group D: 30%

For the placebo

Group A: 15%; Group B: 30%; Group C; 60%; Group D: 45%

So, in each individual group (elderly adults, middle-aged adults, young adults, children) the success rate is greater for those taking the real drug, although in the group as a whole, it is less.

How can we resolve the paradox?

The answer lies in the size and age distribution of each group, which differs between those who received the real drug and those who received the placebo. In this study, the group which received the placebo consists of a whole lot more young adults, for example, than the other groups, in contrast with the number taking the real drug. This is important because the natural recovery rates from this illness (as defined in the test) are normally higher in this demographic than the other groups, whether they receive the real drug or the placebo. Again, the elderly (whose recovery rates are normally lower than average) are much more heavily represented among those taking the real drug than the placebo.

Take another example from baseball. In the 1995/96 seasons, fans were divided between those who claimed Derek Jeter as the best performing player and those who claimed that title for David Justice. It is easy to see why. Here are their batting averages.

1995 1996 Combined

Derek Jeter 12/48 (.250) 183/582 (.314) 195/630 (.310)

David Justice 104/411 (.253) 45/140 (.321) 149/551 (.270)

Here we see that Jeter has the better overall batting average but Justice records a better average in each of the two years making up that overall average. To anyone conversant with Simpson’s Paradox this is nothing weird. It is certainly possible in theory for one player to score a better batting average in successive years than another, yet record a worse batting average overall. The case of Jeter and Justice is an example where the theory clearly shows up in practice.

Indeed, forward to 1997 and the paradox grows even stronger. In that year, Jeter averaged 0.291 (190/654), while Justice scored a better average (163/495). So, in three successive years, Justice recorded a better average than Jeter. Over the whole period, though, the batting average for Derek Jeter was 0.300 (385/1284), superior to David Justice, on 0.298 (312/1046).

For those more familiar with cricket than baseball, let’s take the following example of two mythical matches played by Harold Larwood and Bill Voce.

First Match:

Harold Larwood takes 3 wickets while bowling but concedes 60 runs off his bowling (an average of 20 runs conceded per wicket).

Bill Voce takes 2 wickets while bowling but concedes 68 runs (an average of 24 runs conceded per wicket).

Second Match:

Harold Larwood takes 1 wicket and concedes 8 runs (an average of 8 runs conceded per wicket).

Bill Voce takes 6 wickets and concedes 60 runs (an average of 10 runs conceded per wicket).

Here, Larwood has the superior performance in both matches (20 runs conceded per wicket compared to Voce’s 34 per wicket, and 8 runs conceded per wicket compared to Voce’s 10 per wicket). In the overall match, however, Larwood took 4 wickets for 68 runs (1 for 17) while Voce did slightly better, taking 8 wickets for 128 runs (1 for 16).

So who is the better baseball player? Who is the better bowler? Were the University of California, Berkeley, discriminating on the basis of gender? Which is the better drug? All of these questions are examples of Simpson’s Paradox.

Reference and links

P.J. Bickel, E.A. Hammel and J.W. O’Connell (1975), Sex Bias in Graduate Admissions: Data from Berkeley, Science, 187, 398-404.

February 13, 2019

Shakespeare’s Merchant of Venice: A Bayesian Puzzler

Further and deeper exploration of paradoxes and challenges of intuition and logic can be found in my recently published book, Probability, Choice and Reason.

In William Shakespeare’s ‘Merchant of Venice’, potential suitors of young Portia are offered a choice of three caskets, one gold, one silver and one lead. Inside one of them is a miniature portrait of her. Portia knows it is in the lead casket.

Now, according to her father’s will, a suitor must choose the casket containing the portrait to win Portia’s hand in marriage. The first suitor, the Prince of Morocco, must choose from one of the three caskets. Each is engraved with a cryptic inscription. The gold casket reads, “Who chooseth me shall gain what many men desire.” The silver casket reads, “Who chooseth me shall get as much as he deserves.” The lead casket reads, “Who chooseth me must give and hazard all he hath”. He chooses the gold casket, hoping to find “an angel in a golden bed.” Instead, he finds a skull and a scroll inserted into the skull’s “empty eye.” The message he reads on the scroll says, “All that glisters is not gold.” The Prince beats a hasty exit. “A gentle riddance”, says Portia. The next suitor is the Prince of Arragon. “Who chooseth me shall get as much as he deserves”, he reads on the silver casket. “I’ll assume I deserve the very best”, he declares, and opens the casket. Inside he finds a picture of a fool with a sharp dismissive note which says “With one fool’s head I came to woo, But I go away with two.”

Now let us think about a plot twist where Portia must open one of the other caskets and give Arragon a chance to switch choice of caskets if he wishes. She is not allowed to indicate where the portrait is and in this case must open the gold casket (she knows it is in the lead casket so can’t open that) and show it is not in there. She now asks the Prince whether he wants to stick with his original choice of the silver casket or switch to the lead casket.

Let us imagine that he believes that Portia has no better idea than he has of which casket contains the prize. In that case, should he switch from his original choice of the silver casket to the lead casket? Well, since Portia had no knowledge of the location of the portrait, she might have inadvertently opened the casket containing the portrait, so she adds new information by opening the casket. But if he knows that she is aware of the location of the portrait, her decision to open the gold casket and not the lead casket has doubled the chance that the lead casket contains the portrait compared to his original choice, other things equal. This is because there was just a one third chance that his original choice (silver) was correct and a two thirds chance that one of the other choices (gold, lead) was correct. She is forced to eliminate the losing casket of the two (in this case, gold), so the two thirds chance converges on the lead casket.

So should he switch to the lead casket or stay with the silver? It depends whether things actually are equal. In particular, it depends on how valuable any information contained in the inscriptions is. If he has little faith in the inscriptions to arbitrate, he should definitely switch and improve his chance of winning fair Portia’s hand from 1/3 to 2/3. If he thinks, however, that he has unlocked the secret from the inscriptions, the decision is more difficult. If so, he might stick with his choice in good conscience.

In summary, the key to the problem is the new information Portia introduced by opening a casket which she knew did not contain the portrait. By acting on this new information, the Prince can potentially improve his chance of correctly predicting which casket will reveal the portrait from 1 in 3 to 2 in 3 – by switching boxes when given the chance. Unless he has other information which makes the opening probabilities different to 1/3 for each casket, such as those cryptic inscriptions. If this information is potentially valuable, or at least if the Prince thinks so, that complicates matters!

Further and deeper exploration of paradoxes and challenges of intuition and logic can be found in my recently published book, Probability, Choice and Reason.

Prof. Leighton Vaughan Williams

Recent Posts

Categories

A+ links

All Conversation articles

All Select Networks

Audio Files

Betting

Betting Taxation

Book Chapters

Books

Centres

Charity

Choice and Reason

Competition Commission

David Henry Morris Williams, C. Eng.

Editorial

Employment

Evidence to UK Parliament

Gambling Commission

HM Revenue and Customs

Memberships and Fellowships

My Adobe Voice

National Audit Office

Other Publications

Papers Online

Personal

Political Forecasting

Press and media

Probability

Profile

Published Papers

Radio Interviews

Select Abstracts

Select Books

Select Broadcasts

Select Clippings

Select Pages

Select Papers

Select Presentations

Select Social Media

Select Stories

Select Websites

Select Wiki

Selected Talks

Short stories

Thought Experiment

Twisted Logic

Twitter

Useful Links

Various Blogs

XYZ

Flickr Photos