The Monty Hall Problem is a famous, perhaps the most famous, probability puzzle ever to have been posed. It is based on an American game show, Let’s Make a Deal, hosted by Monty Hall, and came to public prominence as a question quoted in an ‘Ask Marilyn’ column in Parade magazine in 1990.

‘Suppose you’re on a game show, and you’re given the choice of three doors. Behind one door is a car: behind the others, goats. You pick a door, say No.1, and the host, who knows what’s behind all the doors, opens another door, say No. 3, which reveals a goat. He then says to you, “Do you want to switch to door No. 2?” This is not a strategic decision on his part based on knowing that you chose the car, in that he always opens one of the doors concealing a goat and offers the contestant the chance to switch. It is part of the rules of the game.

So should you switch doors?

Consider the probability that you chose the correct door the first time, i.e. No 1 is the door to a car. What is that probability? Well, clearly it is 1/3 in that you have three doors to choose from, all equally likely.

But what happens to the probability that Door No. 1 is the key to the car once Monty has opened one of the other doors?

This again seems quite straightforward. There are now two doors left unopened, and there is no way to tell behind which of these two doors lies the car. So the probability that Door 1 offers the star prize now that Door 2 (or else Door 3) has been opened would seem to be 1/2. So should you switch? Since the two remaining doors would seem to be equally likely paths to the car, it would seem to make no difference whether you stick with your original choice of Door 1 or switch to the only other door that is unopened.

But is this so? Marilyn Vos Savant, in her ‘Ask Marilyn’ column, declared that you should switch doors to boost your chances of winning the car. This answer was howled down by the great majority of the readers who wrote in, and was rejected by even such as Paul Erdos, one of the most prolific mathematicians of all time. Their reasoning was that once the door was opened, only two doors remained closed, so the chance that the car was behind either of the doors was identical, i.e. ½. For that reason, switching or sticking by the contestant should make no difference to the chance of winning the car. Vos Savant argued, in contrast, that switching doubled the chance of winning the car.

Let’s think it through.

When you choose Door 1, there is a 1 in 3 chance that you have won your way to the car if you stick with it. There is a 2 in 3 chance that Door 1 leads to a goat. On the other hand, if you have chosen Door 1, and it is the lucky door, the host must open one of the two doors concealing a goat. He knows that. You know that. So he is introducing useful new information into the game.

Before he opened a door, there was a 2 in 3 chance that the lucky door was EITHER Door 2 or Door 3 (as there was a 1 in 3 chance it was Door 1). Now he is telling you that there is a 2 in 3 chance that the lucky door is EITHER Door 2 or Door 3 BUT it is not the door he just opened. So there is a 2 in 3 chance that it is the door he didn’t open. So, if he opened Door 2, there is a 2 in 3 chance that Door 3 leads to the car. Likewise, if he opened Door 3, it is a 2 in 3 chance that Door 2 leads to the car. Either way, you are doubling your chance of winning the car by switching from Door 1 (probability of car = 1/3) to whichever of the other doors he does not open (probability of car = 2/3).

It is because the host knows what is behind the doors that his actions, which are constrained by the fact that he can’t open the door to the car, introduce valuable new information. Because he can’t open the door to the car, he is obliged to point to a door that isn’t concealing the car, increasing the probability that the door he doesn’t open is the lucky one (from 1/3 to 2/3).

If this is not intuitively clear, there is a way of making it more so. Let’s say there were 20 doors, with a car behind one of them and goats behind 19 of them. Now say we choose Door 1. This means that the probability that this is the winning door is 1 in 20. There is a 19 in 20 probability that one of the other doors conceals the car. Now Monty starts opening one door at a time, taking care not to reveal the car each time. After opening a carefully chosen 18 doors (chosen because they didn’t conceal a car), just one door remains. This could be the door to the car or your original choice of Door 1 could be the path to the car. But your original choice had an original probability of 1/20 of being the winning door. Nothing has changed that, because every time he opens a door he is sure to avoid opening a door leading to a car. So the chance that the door he leaves unopened points to the car is 19/20. So, by switching, you multiply the probability that you have won the car from 1/20 to 19/20.

If he didn’t know what lay behind the doors, he could inadvertently have opened the door to the car, so when he does so this adds no new information save that he has randomly eliminated one of the doors. If he randomly opens 18 doors, not knowing what is behind them, and two doors now remain, they each offer a 1 in 2 chance of the car. So you might as well just flip a coin – and hope!

Even when it is explained this way, I find that many people find it impossible to grasp the intuition. So here’s the clincher.

Say I have a pack of 52 playing cards, which I lay face down. If you choose the Ace of Spades, you win the car. Every other playing card, you win nothing. Go on, choose one. This is now laid aside from the rest of the deck, still face down. The probability that the card you have chosen is the Ace of Spades is clearly 1/52.

Now I, as the host, know exactly where the Ace of Spades is. There is a 51/52 chance that it must be somewhere in the rest of the deck, and if it is I know where. Now, I carefully turn over the cards in the deck one a time, taking care never to turn over the Ace of Spades, until there is just one card left. What is the chance that the one remaining card from the deck is the Ace of Spades? It is 51/52 because I have carefully sifted out all the losing cards to leave just one card, the Ace of Spades. In other words, I have presented you with the one card out of the remaining deck of 51 that is the Ace of Spades, assuming that it was not the card you chose in the first place. The chance that the card you chose in the first place was the Ace of Spades is 1/52. So the card I have selected for you out of the remaining deck has a probability of 51/52 of being the Ace of Spades. So should you switch when I offer you the chance to give up your original card for the one that I have filtered out of the remaining 51 cards (taking care each time never to reveal the Ace of Spades). Of course you should. And that’s what you should tell Monty Hall every single time. Switch!

**Appendix**

In the standard description of the Monty Hall Problem, Monty can open door 1 or door 2 or door 3. The car can be behind door 1, door 2 or door 3. The contestant can choose any door.

We can apply Bayes’ Theorem to solve this.

D1: Monty Hall opens Door 1.

D2: Monty Hall opens Door 2.

D3: Monty Hall opens Door 3.

C1: The car is behind Door 1.

C2: The car is behind Door 2.

C3: The car is behind Door 3.

The prior probability of Monty Hall finding a car behind any particular door is P(C#) = 1/3,

where P(C1) = P (C2) = P(C3).

Assume the contestant chooses Door 1 and Monty Hall randomly opens one of the two doors he knows the car is not behind.

The conditional probabilities given the car being behind either Door 1 or Door 2 or Door 3 are as follows.

P(D3 I C1) = 1/2 … as he is free to open Door 2 or Door 3, as he knows the car is behind the contestant’s chosen door, Door 1. He does so randomly.

P(D3 I C3) = 0 … as he cannot open a door that a car is behind (Door 3) or the contestant’s chosen door, so he must choose Door 2.

P (D3 I C2) = 1 … as he cannot open a door that a car is behind (Door 2) or the contestant’s chosen door (Door 1).

These are equally probable, so the probability he will open D3, i.e. P(D3) = ½ + 0 + 1 / 3 = 1/2

So, P (C1 I D3) = P(D3 I C1). P(C1) / P(D3) = 1/2 x 1/3 / 1/2 = 1/3

Therefore, there is a 1/3 chance that the car is behind the door originally chosen by the contestant (Door 1) when Monty opens Door 3.

But P (C2 I D3) = P(D3 I C2).P(C2) / P (D3) = 1 x 1/3 / 1/2 = 2/3

Therefore, there is twice the chance of the contestant winning the car by switching doors after Monty Hall has opened a door.

You are presented with three identical boxes. You are made aware that one of the boxes contains two gold coins, another contains two silver coins, and the third contains one gold coin and one silver coin. You do not know which box contains which.

Now, choose a box at random. Reach without looking under the cloth covering the coins and take out one of the coins. Now you can look. It is gold.

So you can be sure that the box you chose cannot be the box containing the two silver coins. It must be either the box containing two gold coins or the box containing one gold coin and one silver coin.

Withdrawing the gold coin from the box doesn’t provide you with the information to identify which of these two boxes it is. So the other coin must either be a gold coin or a silver coin.

Given what you now know, what is the probability the other coin in the box is also gold, and what odds would you take to bet on it?

This is essentially the so-called ‘Bertrand’s Box’ paradox, first proposed by Joseph Bertrand in 1889 in his opus, ‘Calcul des probabilités’.

After withdrawing the gold coin, there are only two boxes left. One is the box containing the two gold coins and the other is the box containing one gold and one silver coin. It seems intuitively clear that each of these boxes is equally likely to be the one you chose at random, and that therefore the chance it is the box with two gold coins is 1/2, and the chance that it is the box containing one gold and one silver coin is also 1/2. Therefore, the probability that the other coin is gold must be 1/2.

This sounds right, but is this right?

Well, let’s look a little closer. There are three equally likely scenarios that might have led to you choosing that shiny gold coin. Let us separately label all the coins in the boxes to make this clear.

In the box containing two gold coins, there will be Gold Coin 1 and Gold Coin 2. These are both gold coins but they are distinct, different coins.

In the box containing the gold and silver coins, we have Gold Coin 3, which is a different coin to Gold Coin 1 and Gold Coin 2. There is also what we might label Silver Coin 3 in the box with Gold Coin 3. This silver coin is distinct and different to what we might label Silver Coin 1 and Silver Coin 2, which are in the box containing two silver coins, which was not selected.

So here are the equally likely scenarios when you withdrew a gold coin from the box.

- You chose Gold Coin 1.
- You chose Gold Coin 2.
- You chose Gold Coin 3.

You do not know which of these gold coins you withdrew from the box.

If it was Gold Coin 1, the other coin in the box is also gold.

If it was Gold Coin 2, the other coin in the box is also gold.

If it was Gold Coin 3, the other coin in the box is silver.

Each of these possible scenarios is equally likely (i.e. each has a probability of being the true state of the world of 1/3), so the probability that the other coin is gold is 2/3 and the probability that the other coin is silver is 1/3. So, if you are offered even money about the other coin being gold, the edge is very much with you.

Before withdrawing the gold coin, the chance that the box you had selected was that containing two gold coins was 1/3. By revealing the gold coin, however, you not only excluded the box containing two silver coins but also introduced the new information that you could potentially have chosen a silver coin (if the selected box was that containing one gold and one silver coin) but in fact did not. That made it more likely (twice as likely) that the box you withdrew the gold coin from was that containing the two gold coins than the box containing one gold and one silver coin.

And that is the solution to the Bertrand’s Box paradox.

** Appendix**

On the 9th of November, 1999, Sally Clark, a 35-year-old solicitor and mother of a young child, was convicted of murdering two of her children. The presiding Judge, Mr. Justice Harrison, declared that “… we do not convict people in these courts on statistics. It would be a terrible day if that were so.” As it turned out, it was indeed a terrible day, for Sally Clark and for the justice system.

The background to the case is that the death of the babies was put down to natural causes, probably SIDS (‘Sudden Infant Death Syndrome’). Later the Home Office pathologist charged with the case became suspicious and Sally Clark was charged with murder and tried at Chester Crown Court. It eventually transpired that essential evidence in her favour had not been disclosed to the defence, but not before a failed appeal in 2000. At a second Appeal, in 2003, she was set free, and the case is now recognised as a huge miscarriage of justice.

So what went wrong?

A turning point in the trial was the evidence given by a key prosecution witnesses, who argued that the probability of a baby dying of SIDS was 1 in 8,543. So the probability of two babies dying of SIDS was that fraction squared, or 1 in about 73 million. It’s the chance, he argued, “… of backing that long odds outsider at the Grand National … let’s say it’s a 80 to 1 chance, you back the winner last year, then the next year there’s another horse at 80 to 1 and it is still 80 to 1 and you back it again and it wins. Now we’re here in a situation that, you know, to get to these odds of 73 million you’ve got to back that 1 in 80 chance four years running … So it’s the same with these deaths. You have to say two unlikely events have happened and together it’s very, very, very unlikely.”

Perhaps unsurprisingly in face of this interpretation of the evidence, the jury convicted her and she was sentenced to life in prison.

But the evidence was flawed, as anyone with a basic understanding of probability would have been aware. One of the basic laws of probability is that you can only multiply probabilities if those probabilities are independent of each other, even assuming that the proposed probability was accurate (there are separate reasons to doubt this). This would be true only if the cause of death of the first child was totally independent of the cause of death of the second child. There is no reason to believe this. It assumes no genetic, familial or other innocent link between these sudden deaths at all. That is a basic error of classical probability. The other error is much more sinister, in that it is harder for the layman to detect the flaw in the reasoning. It is the ‘Prosecutor’s Fallacy’ and is a well-known problem in the theory of conditional probability, and in particular the application of what is known as Bayesian reasoning, which is discussed in the context of Bayes’ Theorem elsewhere.

The ‘Prosecutor’s Fallacy’ is to conflate the probability of innocence given the available evidence with the probability of the evidence arising given the fact of innocence. In particular, the following propositions are very different:

- The probability of observing some evidence (the dead children) given that a hypothesis is true (here that Sally Clark is guilty).
- The probability that a hypothesis is true (here that Sally Clark is guilty) given that we observe some evidence (the dead children).

These are totally different propositions, the probabilities of which can and do diverge widely.

Notably, the probability of the former proposition is much higher than of the latter. Indeed, the probability of the children dying given that Sally Clark is a child murderer is effectively 1 (100%). However, the probability that she is a child murderer given that the children have died is a whole different picture.

Critically, we need to consider the prior probability that she would kill both babies, i.e. the probability that she would kill her children, before we are given this evidence of sudden death. This is the concept of ‘prior probability’, which is central to Bayesian reasoning. This prior probability must not be viewed through the lens of the later emerging evidence. It must be established on its own merits and then merged through what is known as Bayes’ Theorem with the new evidence.

In establishing this prior probability, we need to ask whether there was any other past indication or evidence to suggest that she was a child murderer, as the number of mothers who murder their children is almost vanishingly small. Without such evidence, the prior probability of guilt should correspond to something like the proportion of mothers in the general population who serially kill their children. This prior probability of guilt is close to zero. In order to update the probability of guilt, given the evidence of the dead children, the jury needs to weigh up the relative likelihood of the two competing explanations for the deaths. Which is more likely? Double infant murder by a mother or double SIDS. In fact, double SIDS is hugely more common than double infant murder. That is not a question that the jury, unversed in Bayesian reasoning or conditional probability, seems to have asked themselves. If they did, they reached the wrong conclusion.

More generally, it is likely in any large enough population that one or more cases will occur of something which is improbable in any particular case. Out of the entire population, there is a very good chance that some random family will suffer a case of double SIDS. This is no ground to suspect murder, however, unless there was a particular reason why the mother in this particular family was, before the event, likely to turn into a double child killer.

To look at the problem another way, consider the wholly fictional case of Lottie Jones, who is charged with winning the National Lottery by cheating. The prosecution expert gives the following evidence. The probability of winning the Lottery jackpot without cheating, he tells the jury, is 1 in 45 million. Lottie won the Lottery. What’s the chance she could have done so without cheating in some way? So small as to be laughable. The chance is 1 in 45 million. So she must be guilty. Sounds ridiculous put like that, but it is exactly the same sort of reasoning that sent Sally Clark, and sends many other innocent people, to prison in real life.

As in the Sally Clark case, the prosecution witness in this fictional parody committed the classic ‘Prosecutor’s Fallacy’, assuming that the probability that Lottie is innocent of cheating given the evidence (she won the Lottery) was the same thing as the probability of the evidence (she won the Lottery) given that she didn’t cheat. The former is much higher than the latter, unless we have some other indication that Lottie has cheated to win the Lottery. Once again, it is an example of how it is likely that in any large enough population one or more cases will occur of something which is improbable in any particular case. The probability that needed to be established in the Lottie case was the probability that she would win the Lottery before she did. If she is innocent, that probability is 1 in tens of millions. The fact that she did, in fact, win the Lottery does not change that.

Lottie just got very, very lucky. Just as Sally Clark got very, very unlucky.

Sally Clark never recovered from the trauma of losing her children and spending years in prison falsely convicted of killing them. She died on 16th March, 2007, of acute alcohol intoxication.

The majestic tragedy, Othello, was written by William Shakespeare in about 1603. The play revolves around four central characters: Othello, a Moor who is a General in the Venetian army; his beloved wife, Desdemona; his loyal lieutenant, Cassio; and his trusted ensign, Iago.

A key element of the play is Iago’s plot to convince Othello that Desdemona is conducting an affair with Cassio, by planting a treasured keepsake Othello gave to Desdemona, in Cassio’s lodgings, for Othello ‘accidentally’ to come upon.

We playgoers know she is not cheating on him, as does Iago, but Othello, while reluctant to believe it of Desdemona, is also very reluctant to believe that Iago could be making it up.

If Othello refuses to contemplate any possibility of betrayal, then we would have a play in which no amount of evidence, however overwhelming, including finding them together, could ever change his mind. We would have a farce or a comedy instead of a tragedy.

A shrewder Othello would concede that there is at least a possibility that Desdemona is betraying him, however small that chance might be. This means that there does exist some level of evidence, however great it would need to be, that would leave him no alternative. If his prior trust in Desdemona is almost, but not absolutely total, then this would permit of some level of evidence, logically incompatible with her innocence, changing his mind. This might be called ‘Smoking Gun’ evidence.

On the other hand, Othello might adopt a more balanced position, trying to assess the likelihood objectively and without emotion. But how? Should he try and find out the proportion of female Venetians who conduct extra-marital affairs? This would give him the probability for a randomly selected Venetian woman but no more than that. Hardly a convincing approach when surely Desdemona is not just an average Venetian woman. So should he limit the reference class to women who are similar to Desdemona? But what does that mean?

And this is where it is easy for Othello to come unstuck. Because it is so difficult to choose a prior probability (as Bayesians would term it), the temptation is to assume that since it might or might not be true, the likelihood is 50-50. This is known as the ‘Prior Indifference Fallacy’. Once Othello falls victim to this common fallacy, any evidence against Desdemona now becomes devastating. It is the same problem as that facing the defendant in the dock.

Extreme, though not blind, trust is one way to avoid this mistake. But an alternative would be to find evidence that is logically incompatible with Desdemona’s guilt, in effect the opposite of the ‘Smoking Gun.’ The ‘Perfect Alibi’ would fit the bill.

Perhaps Othello would love to find evidence that is logically incompatible with Desdemona conducting an affair with Cassio, but holds her guilty unless he can find it. He needs evidence that admits no True Positives.

Lacking extreme trust and a Perfect Alibi, what else could have saved Desdemona?

To find the answer, we can turn to Bayes’ Theorem.

The (posterior) probability that a hypothesis is true after obtaining new evidence, according to the a,b,c formula of Bayes’ Theorem, is equal to:

ab/[ab=c(1-a)]

a is the prior probability, i.e. the probability that a hypothesis is true before you see the new evidence. b is the probability you would see the new evidence if the hypothesis is true. c is the probability you would see the new evidence if the hypothesis is false.

In the case of the Desdemona problem, the hypothesis is that Desdemona is guilty of betraying Othello with Cassio. Before the new evidence (the finding of the keepsake), let’s say that Othello assigns a chance of 4% to Desdemona being unfaithful.

So a = 0.04

The probability we would see the new evidence (the keepsake in Cassio’s lodgings) if the hypothesis is true (Desdemona and Cassio are conducting an affair) is, say, 50%. There’s quite a good chance she would secretly hand Cassio the keepsake as proof of her love for him and not of Othello.

So b = 0.5

The probability we would see the new evidence (the keepsake in Cassio’s lodgings) if the hypothesis is false is, say, just 5%. Why would it be there if Desdemona had not been to his lodgings secretly, and why would she take the keepsake along in any case? It could have been stolen and ended up there, but how likely is that?

So c = 0.05

Substituting into Bayes’ equation gives:

Posterior probability = ab/[ab=c(1-a)] = 0.294.

So, using Bayes’ Rule, and these estimates, the chance that Desdemona is guilty of betraying Othello is 29.4%, worryingly high for the tempestuous Moor but perhaps low enough to prevent tragedy. The power of Bayes here lies in demonstrating to Othello that the finding of the keepsake in the living quarters of Cassio might only have a 1 in 20 chance of being consistent with Desdemona’s innocence, but in the bigger picture, there is a less than a 3 in 10 chance that she actually is culpable.

If this is what Othello concludes, the task of the evil Iago is to lower c in the eyes of Othello by arguing that the true chance of the keepsake ending up with Cassio without a nefarious reason is so astoundingly unlikely as to merit an innocent explanation that 1 in 100 is nearer the mark than 1 in 20. In other words, to convince Othello to lower his estimate of c from 0.05 to 0.01.

The new Bayesian probability of Desdemona’s guilt now becomes:

ab/[ab+c(1-a)]

a = 0.04 (the prior probability of Desdemona’s guilt, as before)

b = 0.5 (as before)

c = 0.01 (down from 0.05)

Substituting into Bayes’ equation gives:

New probability = 0.676 = 67.6%.

So, if Othello can be convinced that 5% is too high a probability that there is an innocent explanation for the appearance of the Cassio – let’s say he’s persuaded by Iago that the true probability is 1% – then Desdemona’s fate, as that of many a defendant whom a juror thinks has more than a 2 in 3 chance of being guilty, is all but sealed. Her best hope now is to try and convince Othello that the chance of the keepsake being found in Cassio’s place if she were guilty is much lower than 0.5. For example, she could try a common sense argument that there is no way that she would take the keepsake if she were actually having an affair with Cassio, nor be so careless as to leave it behind. In other words, she could argue that the presence of the keepsake where it was found actually provides testimony to her innocence. In Bayesian terms, she should try to reduce Othello’s estimate of b. What level of b would have prevented tragedy? That is another question.

William Shakespeare wrote Othello about a hundred years before the Reverend Thomas Bayes was born. That is true. But to my mind the Bard was always, in every inch of his being, a true Bayesian. Othello was not, and therein lies the tragedy.

**Appendix**

In the case of the Othello problem, the hypothesis is that Desdemona is guilty of betraying Othello with Cassio. Before the new evidence (the finding of the keepsake), let’s say that Othello assigns a chance of 4% to Desdemona being unfaithful.

So P (H) = 0.04

The probability we would see the new evidence (the keepsake in Cassio’s lodgings) if the hypothesis is true (Desdemona and Cassio are conducting an affair) is, say, 50%.

So P (EIH) = 0.5

The probability we would see the new evidence (the keepsake in Cassio’s lodgings) if the hypothesis is false is, say, just 5%.

So P (EIH’) = 0.05

Substituting into Bayes’ Theorem:

P (HIE) = P (EIH). P (H) / [P (EIH) . P(H) + P (EIH’) . P(H’)]

P (HIE) = 0.5 x 0.04 / [0.5 x 0.04 + 0.05 x 0.96]

P (HIE) = 0.02 / [0.02 + 0.048] = 0.294

Posterior probability = 0.294.

So, using Bayes’ Rule, and these estimates, the chance that Desdemona is guilty of betraying Othello is 29.4%.

If P (EIH’) = 0.01

The new Bayesian probability of Desdemona’s guilt now becomes:

P (HIE) = 0.5 x 0.04 / [0.5 x 0.04 + 0.01 x 0.96]

P (HIE) = 0.02 / (0.02 + 0.0096) = 0.02 / 0.0296 = 0.676

Updated probability = 0.676 = 67.6%.

** **

Bobby Smith, aged 8, is a good schoolboy footballer, but you know that only one in a thousand such 8-year-olds go on to become professional players. So you would like to get an unbiased assessment of his real chance of developing into a top player. A coach tells you there is a test, taken by all good 8-year-old footballers, that can measure the child’s potential. The test, you learn, is 95% accurate in identifying future professional footballers, and these always receive a grade of A+.

Bobby takes the test and is graded A+.

How many of the 8-year-olds tested, who get an A+, fail to develop into top players, you ask. Now the coach imparts the good news. All current professional players scored A+ when they took the test in their own school days, and we can take it that anyone who scores below that can be ruled out as a future professional player. And the test is 95% accurate, so only 5% of those who get the A+ grade fail to develop into professional footballers. So what is the actual chance that Bobby will become a top player?

If you are like most people, you will think the chance is very high.

This is your reasoning: I don’t really know whether Bobby is likely to turn into a professional player or not. But he has taken this test. In fact, no current professional player scored below A+, and the test only very rarely allocates a top grade to a child who will not become a professional footballer. If the test is really this good, therefore, it looks like Bobby will have a bright future as a football star.

Is this true? Think of it this way. If there were no test, you would have asked the coach a very basic question: in your experience, what is the chance that Bobby will become a professional player? The coach would have dampened your enthusiasm: one in a thousand, he would have said. But with the test result in hand, there’s no need to ask this question. It’s irrelevant in the face of a very accurate test result, isn’t it?

In fact, this is a well-known fallacy, another example of the *Inverse Fallacy*, or *Prosecutor’s Fallacy*. The fallacy is to confuse the probability of a hypothesis being true, given some evidence, with the probability of the evidence arising given the hypothesis is true.

In our example, the hypothesis is that Bobby will become a top player, and the evidence is the high test score. What we want to know is the probability that Bobby will become a top player, given that the test says he will be. What we know, on the other hand, is the probability that the test says Bobby will be a top player, given that he will be. The coach told you this probability, on all available evidence, is 100%: the test is in this sense infallible, in that all professional players score A+ on the test. In answering your other question, the coach also told you the probability of an A+ test score, given that the child will not become a top player, is only 5%. You take this information and conclude that Bobby is very likely to turn into a top player.

In fact, of the thousand children who took the test, only one (statistically speaking) will become a professional footballer. The test is 95% accurate, so 5% of the 1,000 children will score A+ and not become top players, i.e. there will be 50 ‘false positives.’ Anyone who will become a top player, on the other hand, will score A+ on the test.

So what is the chance that Bobby will become a professional footballer if he scores A+ on the test?

** Solution:** 50 kids who will not become top footballers score A+ (the 50 ‘false positives’). Only one of the one thousand eight-year-olds who take the test develops into a professional player, and that child will score A+. Look at it this way. A thousand 8-year-olds take the test, and of these 50 of them will receive a letter telling them they have scored A+ on the test but will not develop into top players. One child will receive a letter with a score of A+ and actually will go on to become a professional player. Therefore the probability you will become a top footballer if you score A+ is just 1 in 51, i.e. 1.96%.

This is exactly the same idea as the medical ‘false positives’ problem.

In the equivalent flu version of the problem, a thousand people go to the doctor and all are tested for flu. Only one actually has the flu. Those with the flu always test positive. We know that the test for flu is 95% accurate, so 5% of the 1,000 people will test positive and not have the flu, i.e. there will be 50 ‘false positives’. One will test positive who does have the flu. Those with the flu all test positive. So what is the chance that you have the flu if you test positive? In this case, 50 people who do not have the flu test positive. One person who has the flu tests positive. Therefore, the probability you have the flu if you test positive is 1 in 51, i.e. 1.96%

We can also solve the Bobby Smith problem using Bayes’ Theorem. The (posterior) probability that a hypothesis is true after obtaining new evidence, according to the a,b,c formula of Bayes’ Theorem, is equal to:

ab/ [ab+c(1-a)]

a is the prior probability, i.e. the probability that a hypothesis is true before the new evidence. b is the probability of the new evidence if the hypothesis is true. c is the probability of the new evidence if the hypothesis is false.

In the case of the Bobby Smith problem, the hypothesis is that Bobby will develop into a professional player.

Before the new evidence (the test), this chance is 1 in 1000 (0.001)

So a = 0.001

The probability of the new evidence (the A+ score on the test) if the hypothesis is true (Bobby will become a professional player) is 100%, since all professional players score A+ on the test.

So b =1

The probability we would see the new evidence (the A+ score on the test) if the hypothesis is false (Bobby will not become a professional player) is 5%, since the test is 95% accurate in spotting future professional footballers.

So c = 0.05

Substituting into Bayes’ equation gives:

Posterior probability = ab/ [ab+c(1-a)] = 0.001x 1 / [0.001 x 1 + 0.05 (1 – 0.001)] = 0.0196

So, using Bayes’ Theorem, the chance that Bobby Smith, who scored A+ on the test which is 95% accurate, will actually become a top player, is not 95% as intuition might suggest, but just 1.96%, as we have shown previously by a different route.

There is, therefore, just a 1.96 per cent chance that Bobby Smith will go on to become a professional footballer, despite scoring A+ on that very accurate test of player potential.

That’s the statistics, the cold Bayesian logic. Now for the good news. Bobby Smith was the lucky one. He currently plays for Barcelona, under a different name.

**Appendix**

We can also solve the Bobby Smith problem using the traditional notation version of Bayes’ Theorem.

P (HIE) = P (EIH). P (H) / [P (EIH) . P(H) + P (EIH’) . P(H’)]

Before the new evidence (the test), this chance is 1 in 1000 (0.001)

So P (H) = 0.001

The probability of the new evidence (the A+ score on the test) if the hypothesis is true (Bobby will become a professional player) is 100%, since all professional players score A+ on the test.

So P (EIH) =1

The probability we would see the new evidence (the A+ score on the test) if the hypothesis is false (Bobby will not become a professional player) is 5%, since the test is 95% accurate in spotting future professional footballers.

So P (EIH’) = 0.05

Substituting into Bayes’ equation gives:

P (HIE) = 0.001x 1 / [0.001 x 1 + 0.05 (1 – 0.001)] = 0.0196

An entomologist spots what might be a rare category of beetle, due to the pattern on its back. In the rare category, 98% have the pattern. In the common category, only 5% have the pattern. The rare category accounts for only 0.1% of the population. How likely is the beetle to be rare?

Since only 5 per cent of the common beetles bear the distinctive pattern and 98 per cent of the rare beetles do, intuition would tell you that you have come across a rare insect when you espy the pattern. Bayes’ Theorem tells you something quite different.

To calculate just how likely the beetle is to be rare given that we see the pattern on its back, we apply Bayes’ Theorem.

Posterior probability = ab/ [ab+c (1-a)]

a is the prior probability of the hypothesis (beetle is rare) being true. b is the probability we observe the pattern and the beetle is rare (hypothesis is true). c is the probability we observe the pattern and the beetle is not rare (hypothesis is false).

In this case, a = 0.001 (0.1%); b = 0.98 (98%); c = 0.05 (5%).

So, updated probability = ab/ [ab+c (1-a)] = 0.0192. So there is just a 1.92 per cent chance that the beetle is rare when the entomologist spots the distinctive pattern on its back.

Why the counterintuitive result? Because so few of the population of all beetles are rare, i.e. the prior probability that the beetles is rare is almost vanishingly small and it would take a lot more evidence than that acquired to make a reasonable case for the beetle being rare.

So what is the probability that the beetle is rare given that we observe the distinctive pattern? In other words, what is the probability that the hypothesis (the beetle is rare) is true given the evidence (the pattern). That is 1.92 per cent. What is the probability that we will observe the distinctive pattern if the beetle is rare? In other words, what is the probability of observing the evidence (the pattern) if the hypothesis (the beetle is rare) is true. That is 98 per cent.

To conflate these, to believe these two concepts are the same, is to commit the classic Prosecutor’s Fallacy, i.e. to falsely equate the probability that the defendant is guilty given the observed evidence with the probability of observing the evidence given that the defendant is guilty. It’s a potentially very dangerous fallacy to commit, especially when you happen to be the defendant and the jury has never heard of the Reverend Thomas Bayes.

**Appendix**

We can also solve the Beetle problem using the traditional notation version of Bayes’ Theorem.

P (HIE) = P (EIH). P (H) / [P (EIH) . P(H) + P (EIH’) . P(H’)]

In this case, P (H) = 0.001 (0.1%); P (EIH) = 0.98 (98%); P (EIH’) = 0.05 (5%).

So, P (HIE) = 0.98 x 0.001/ [0.98 x 0.001 +0.05 x 0.999)] = 0.00098 / 0.00098 + 0.04995 = 0.00098 / 0.05093 = 0.0192. So there is just a 1.92 per cent chance that the beetle is rare when the entomologist spots the distinctive pattern on its back.

Note also that P (HIE) = 0.0192, while P (EIH) = 0.98.

The Prosecutor’s Fallacy is to conflate these two expressions.

Let us invent a little crime story in which you are a follower of Bayes and you have a friend in a spot of trouble. In this story, you receive a telephone call from your local police station. You are told that your best friend of many years is helping the police investigation into a case of vandalism of a shop window in a street adjoining where you knows she lives. It took place at noon that day, which you know is her day off work. You had heard about the incident earlier but had no good reason at the time to believe that your friend was in any way linked to it.

She next comes to the telephone and tells you she has been charged with smashing the shop window, based on the evidence of a police officer who positively identified her as the culprit. She claims mistaken identity. You must evaluate the probability that she did commit the offence before deciding how to advise her. So the condition is that she has been charged with criminal damage; the hypothesis you are interested in evaluating is the probability that she did it. Bayes’ Theorem, of course, helps to answer this type of question.

There are three things to estimate. The first is the Bayesian prior probability (which we represent as ‘a’). This is the probability you assign to the hypothesis being true before you become aware of the new information. In this case, it means the probability you would assign to your friend breaking the shop window immediately before you got the new information from her on the telephone that she had been charged on the basis of the witness evidence.

The second is the probability that the new evidence would have arisen if the hypothesis was true (which we represent as ‘b’). In this case, you need to estimate the probability of the police officer identifying your friend if your friend actually did break the window.

The third is to estimate the probability that the new evidence would have arisen if the hypothesis was false (which we represent as ‘c’). In this case, you need to estimate the probability of the police officer identifying your friend if your friend did NOT break the window.

According to Bayes’ Theorem, Posterior probability = **ab/ [ab+c(1-a)]**

So let’s apply Bayes’ Theorem to the case of the shattered shop window. Let’s start with a. Well, you have known her for years, and it is totally out of character, although she does live just a stone’s throw from the shop, and it is her day off work, so she could in principle have done it. Let’s say 5% (0.05). Assigning the prior probability is fraught with problems, however, as awareness of the new information might easily affect the way you assess the prior information. You need to make every effort to estimate this probability as it would have been before you received the new information. You also have to be precise as to the point in the chain of evidence at which you establish the prior probability.

What about b? This is the probability of the new evidence if the hypothesis was true. What is the hypothesis? That your friend broke the window. What is the new evidence? That the police officer has identified your friend as the person who smashes the window. So b is an estimate of the probability that the police officer would have identified your friend if she was indeed guilty. If she threw the brick, it’s easy to imagine how she came to be identified by the police officer. Still, he wasn’t close enough to catch the culprit at the time, which should be borne in mind. Let’s say that the probability he has identified her and that she is guilty is 80% (0.8).

Let’s move on to c. This is the probability of the new evidence if the hypothesis was false. What is the hypothesis again? That your friend broke the window. What is the new evidence again? That the police officer has identified your friend as the person who did it. So c is an estimate of the probability that the police officer would have identified her if she was not the guilty party, i.e. a false identification. If your friend didn’t shatter the window, how likely is the police officer to have wrongly identified her when he saw her in the street later that day? It is possible that he would see someone of similar age and appearance, wearing similar clothes, and jump to the wrong conclusion, or he may just want to identify someone to advance his career. Let us estimate the probability as 15% (0.15).

Once we’ve assigned these values, Bayes’ theorem can now be applied to establish a posterior probability. This is the number that we’re interested in. It is the measure of how likely is it that your friend broke the window, given that she’s been identified as the culprit by the police officer and charged on the basis of this evidence.

Given these estimates, we can use Bayes’ Theorem to update our probability that our friend is guilty to 21.9%, despite assigning a reliability of 80% to the police officer’s identification.

The most interesting takeaway from this application of Bayes’ Theorem is the relatively low probability you should assign to the guilt of your friend even though you were 80% sure that the police officer would identify her if she was guilty, and the small 15% chance you assigned that he would falsely identify her. The clue to the intuitive discrepancy is in the prior probability (or ‘prior’) you would have attached to the guilt of your friend before you were met face to face with the charge based on the evidence of the police officer. If a new piece of evidence now emerges (say a second witness), you should again apply Bayes’ Theorem to update to a new posterior probability, gradually converging, based on more and more pieces of evidence, ever nearer to the truth.

It is, of course, all too easy to dismiss the implications of this hypothetical case on the grounds that it was just too difficult to assign reasonable probabilities to the variables. But that is what we do implicitly when we don’t assign numbers. Bayes’ Theorem is not at fault for this in any case. It will always correctly update the probability of a hypothesis being true whenever new evidence is identified, based on the estimated probabilities. In some cases, such as the crime case illustrated here, that is not easy, though the approach you adopt to revising your estimate will always be better than using intuition to steer a path to the truth.

In many other cases, we do know with precision what the key probabilities are, and in those cases we can use Bayes’ Theorem to identify with precision the revised probability based on the new evidence, often with startlingly counter-intuitive results. In seeking to steer the path from ignorance to knowledge, the application of Bayes is always the correct method.

**Appendix**

The calculation and the simple algebraic expression that we have identified in this setting is:

ab/[ab+c(1-a)]

a is the prior probability of the hypothesis (she’s guilty) being true. This is more traditionally represented by the notation P(H). In the example, a = 0.05.

b is the probability the police officer identifies her conditional on the hypothesis being true, i.e. she’s guilty. This is more traditionally represented by the notation (PEIH), i.e. probability of E (the evidence) given the hypothesis is true, P(H). In the example, b = 0.8.

c is the probability the police officer identifies her conditional on the hypothesis not being true, i.e. she’s not guilty. This is more traditionally represented by the notation (PEIH’), i.e. probability of E (the evidence) given the hypothesis is false, P(H’). In the example, c = 0.15.

In our example, a = 0.05, b = 0.8, c = 0.15

Using Bayes’ Theorem, the updated (posterior) probability that the friend is guilty is:

ab/[ab+c(1-a)] = 0.04/(0.04+ 0.1425) = 0.04/0.1825

Posterior probability = 0.219 = 21.9%