Skip to content

The Bertrand’s Box Paradox – in a nutshell.

You are presented with three identical boxes. You are made aware that one of the boxes contains two gold coins, another contains two silver coins, and the third contains one gold coin and one silver coin. You do not know which box contains which.

Now, choose a box at random. Reach without looking under the cloth covering the coins and take out one of the coins. Now you can look. It is gold.

So you can be sure that the box you chose cannot be the box containing the two silver coins. It must be either the box containing two gold coins or the box containing one gold coin and one silver coin.

Withdrawing the gold coin from the box doesn’t provide you with the information to identify which of these two boxes it is. So the other coin must either be a gold coin or a silver coin.

Given what you now know, what is the probability the other coin in the box is also gold, and what odds would you take to bet on it?

This is essentially the so-called ‘Bertrand’s Box’ paradox, first proposed by Joseph Bertrand in 1889 in his opus, ‘Calcul des probabilités’.

After withdrawing the gold coin, there are only two boxes left. One is the box containing the two gold coins and the other is the box containing one gold and one silver coin. It seems intuitively clear that each of these boxes is equally likely to be the one you chose at random, and that therefore the chance it is the box with two gold coins is 1/2, and the chance that it is the box containing one gold and one silver coin is also 1/2. Therefore, the probability that the other coin is gold must be 1/2.

This sounds right, but is this right?

Well, let’s look a little closer. There are three equally likely scenarios that might have led to you choosing that shiny gold coin. Let us separately label all the coins in the boxes to make this clear.

In the box containing two gold coins, there will be Gold Coin 1 and Gold Coin 2. These are both gold coins but they are distinct, different coins.

In the box containing the gold and silver coins, we have Gold Coin 3, which is a different coin to Gold Coin 1 and Gold Coin 2. There is also what we might label Silver Coin 3 in the box with Gold Coin 3. This silver coin is distinct and different to what we might label Silver Coin 1 and Silver Coin 2, which are in the box containing two silver coins, which was not selected.

So here are the equally likely scenarios when you withdrew a gold coin from the box.

  1. You chose Gold Coin 1.
  2. You chose Gold Coin 2.
  3. You chose Gold Coin 3.

You do not know which of these gold coins you withdrew from the box.

If it was Gold Coin 1, the other coin in the box is also gold.

If it was Gold Coin 2, the other coin in the box is also gold.

If it was Gold Coin 3, the other coin in the box is silver.

Each of these possible scenarios is equally likely (i.e. each has a probability of being the true state of the world of 1/3), so the probability that the other coin is gold is 2/3 and the probability that the other coin is silver is 1/3. So, if you are offered even money about the other coin being gold, the edge is very much with you.

Before withdrawing the gold coin, the chance that the box you had selected was that containing two gold coins was 1/3. By revealing the gold coin, however, you not only excluded the box containing two silver coins but also introduced the new  information that you could potentially have chosen a silver coin (if the selected box was that containing one gold and one silver coin) but in fact did not. That made it more likely (twice as likely) that the box you withdrew the gold coin from was that containing the two gold coins than the box containing one gold and one silver coin.

And that is the solution to the Bertrand’s Box paradox.

 

Exercise

You are presented with three identical boxes. You are made aware that one of the boxes contains two gold coins, another contains two silver coins, and the third contains one silver coin and one bronze coin. You do not know which box contains which. Now, choose a box at random and withdraw one of the coins. It is bronze. So the other coin must either be a silver coin or another bronze coin.

Given what you now know, what is the probability the other coin in the box is also bronze?

 

References and Links

Untrammeled Mind. November 9, 2018. Bertrand’s Box Paradox (with and without Bayes’ Theorem). https://www.untrammeledmind.com/2018/11/bertrands-box-paradox/

Steemit beta. Bertrand’s Box Problem. https://steemit.com/science/@galotta/bertrand-s-box-problem

Zymergi. Bertrand’s Box Paradox. http://blog.zymergi.com/2013/06/bertrands-box-paradox.html

Bertrand’s Box Paradox: the answer is 2/3!!! https://whyevolutionistrue.wordpress.com/2018/02/20/the-answer-is-2-3/

Bertrand’s box paradox. Wikipedia. https://en.m.wikipedia.org/wiki/Bertrand%27s_box_paradox

Bayes in the Courtroom – in a nutshell.


On the 9th of November, 1999, Sally Clark, a 35-year-old solicitor and mother of a young child, was convicted of murdering two of her children. The presiding Judge, Mr. Justice Harrison, declared that “… we do not convict people in these courts on statistics. It would be a terrible day if that were so.” As it turned out, it was indeed a terrible day, for Sally Clark and for the justice system.

The background to the case is that the death of the babies was put down to natural causes, probably SIDS (‘Sudden Infant Death Syndrome’). Later the Home Office pathologist charged with the case became suspicious and Sally Clark was charged with murder and tried at Chester Crown Court. It eventually transpired that essential evidence in her favour had not been disclosed to the defence, but not before a failed appeal in 2000. At a second Appeal, in 2003, she was set free, and the case is now recognised as a huge miscarriage of justice.

So what went wrong?

A turning point in the trial was the evidence given by a key prosecution witnesses, who argued that the probability of a baby dying of SIDS was 1 in 8,543. So the probability of two babies dying of SIDS was that fraction squared, or 1 in about 73 million. It’s the chance, he argued, “… of backing that long odds outsider at the Grand National … let’s say it’s a 80 to 1 chance, you back the winner last year, then the next year there’s another horse at 80 to 1 and it is still 80 to 1 and you back it again and it wins. Now we’re here in a situation that, you know, to get to these odds of 73 million you’ve got to back that 1 in 80 chance four years running … So it’s the same with these deaths. You have to say two unlikely events have happened and together it’s very, very, very unlikely.”

Perhaps unsurprisingly in face of this interpretation of the evidence, the jury convicted her and she was sentenced to life in prison.

But the evidence was flawed, as anyone with a basic understanding of probability would have been aware. One of the basic laws of probability is that you can only multiply probabilities if those probabilities are independent of each other, even assuming that the proposed probability was accurate (there are separate reasons to doubt this). This would be true only if the cause of death of the first child was totally independent of the cause of death of the second child. There is no reason to believe this. It assumes no genetic, familial or other innocent link between these sudden deaths at all. That is a basic error of classical probability. The other error is much more sinister, in that it is harder for the layman to detect the flaw in the reasoning. It is the ‘Prosecutor’s Fallacy’ and is a well-known problem in the theory of conditional probability, and in particular the application of what is known as Bayesian reasoning, which is discussed in the context of Bayes’ Theorem elsewhere.

The ‘Prosecutor’s Fallacy’ is to conflate the probability of innocence given the available evidence with the probability of the evidence arising given the fact of innocence. In particular, the following propositions are very different:

  1. The probability of observing some evidence (the dead children) given that a hypothesis is true (here that Sally Clark is guilty).
  2. The probability that a hypothesis is true (here that Sally Clark is guilty) given that we observe some evidence (the dead children).

These are totally different propositions, the probabilities of which can and do diverge widely.

Notably, the probability of the former proposition is much higher than of the latter. Indeed, the probability of the children dying given that Sally Clark is a child murderer is effectively 1 (100%). However, the probability that she is a child murderer given that the children have died is a whole different picture.

Critically, we need to consider the prior probability that she would kill both babies, i.e. the probability that she would kill her children, before we are given this evidence of sudden death. This is the concept of ‘prior probability’, which is central to Bayesian reasoning. This prior probability must not be viewed through the lens of the later emerging evidence. It must be established on its own merits and then merged through what is known as Bayes’ Theorem with the new evidence.

In establishing this prior probability, we need to ask whether there was any other past indication or evidence to suggest that she was a child murderer, as the number of mothers who murder their children is almost vanishingly small. Without such evidence, the prior probability of guilt should correspond to something like the proportion of mothers in the general population who serially kill their children. This prior probability of guilt is close to zero. In order to update the probability of guilt, given the evidence of the dead children, the jury needs to weigh up the relative likelihood of the two competing explanations for the deaths. Which is more likely? Double infant murder by a mother or double SIDS. In fact, double SIDS is hugely more common than double infant murder. That is not a question that the jury, unversed in Bayesian reasoning or conditional probability, seems to have asked themselves. If they did, they reached the wrong conclusion.

More generally, it is likely in any large enough population that one or more cases will occur of something which is improbable in any particular case. Out of the entire population, there is a very good chance that some random family will suffer a case of double SIDS. This is no ground to suspect murder, however, unless there was a particular reason why the mother in this particular family was, before the event, likely to turn into a double child killer.

To look at the problem another way, consider the wholly fictional case of Lottie Jones, who is charged with winning the National Lottery by cheating. The prosecution expert gives the following evidence. The probability of winning the Lottery jackpot without cheating, he tells the jury, is 1 in 45 million. Lottie won the Lottery. What’s the chance she could have done so without cheating in some way? So small as to be laughable. The chance is 1 in 45 million. So she must be guilty. Sounds ridiculous put like that, but it is exactly the same sort of reasoning that sent Sally Clark, and sends many other innocent people, to prison in real life.

As in the Sally Clark case, the prosecution witness in this fictional parody committed the classic ‘Prosecutor’s Fallacy’, assuming that the probability that Lottie is innocent of cheating given the evidence (she won the Lottery) was the same thing as the probability of the evidence (she won the Lottery) given that she didn’t cheat. The former is much higher than the latter, unless we have some other indication that Lottie has cheated to win the Lottery. Once again, it is an example of how it is likely that in any large enough population one or more cases will occur of something which is improbable in any particular case. The probability that needed to be established in the Lottie case was the probability that she would win the Lottery before she did. If she is innocent, that probability is 1 in tens of millions. The fact that she did, in fact, win the Lottery does not change that.

Lottie just got very, very lucky. Just as Sally Clark got very, very unlucky.

Sally Clark never recovered from the trauma of losing her children and spending years in prison falsely convicted of killing them. She died on 16th March, 2007, of acute alcohol intoxication.

 

Exercise

What is the Prosecutor’s Fallacy, using an equation or equations to illustrate your answer? How might this fallacy lead to false convictions?

 

References and Links

Scheurer, V. Understanding Uncertainty. Convicted on Statistics? https://understandinguncertainty.org/node/545

Joyce, H. (2002). Beyond Reasonable Doubt. +Plus Magazine. Sept. 1. https://plus.maths.org/content/beyond-reasonable-doubt https://plus.maths.org/content/beyond-reasonable-doubt

Centre for Evidence-Based Medicine. (2018). The Prosecutor’s Fallacy. July 19. https://www.cebm.net/2018/07/the-prosecutors-fallacy/

Fenton, N., Neil, M. and Berger, D. (2016). Bayes and the Law. Annual Review of Statistics and its Applcations. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4934658/

https://www.cebm.net/2018/07/the-prosecutors-fallacy/

McGrayne, S.B. Simple Bayesian Problems. The Sally Clark Case. http://www.mcgrayne.com/disc.htm

Using Statistical Evidence in Courts. A Case Study. http://ben-israel.rutgers.edu/711/Sally_Clark.pdf

Brown, R.J. Sally Clark – What went wrong? http://www.mathestate.com/Sally%20Clark%20-%20What%20went%20wrong.pdf

In the Dark. Archive for Sally Clark. https://telescoper.wordpress.com/tag/sally-clark/

Mike Disney. Cot Deaths, Bayes’ Theorem and Plain Thinking. http://www2.geog.ucl.ac.uk/~mdisney/teaching/GEOGG121/bayes/COT%20DEATHS.doc

Statistical Methods 2013. Sally Clark Case.

Click to access h02_sally_clark_cot_death.pdf

Coursera. The Sad Story of Sally Clark. https://www.coursera.org/lecture/introductiontoprobability/the-sad-story-of-sally-clark-bII6g

Sally Clark. Wikipedia. https://en.m.wikipedia.org/wiki/Sally_Clark

Bayes and the Othello Problem – in a nutshell.

The majestic tragedy, Othello, was written by William Shakespeare in about 1603. The play revolves around four central characters: Othello, a Moor who is a General in the Venetian army; his beloved wife, Desdemona; his loyal lieutenant, Cassio; and his trusted ensign, Iago.

A key element of the play is Iago’s plot to convince Othello that Desdemona is conducting an affair with Cassio, by planting a treasured keepsake Othello gave to Desdemona, in Cassio’s lodgings, for Othello ‘accidentally’ to come upon.

We playgoers know she is not cheating on him, as does Iago, but Othello, while reluctant to believe it of Desdemona, is also very reluctant to believe that Iago could be making it up.

If Othello refuses to contemplate any possibility of betrayal, then we would have a play in which no amount of evidence, however overwhelming, including finding them together, could ever change his mind. We would have a farce or a comedy instead of a tragedy.

A shrewder Othello would concede that there is at least a possibility that Desdemona is betraying him, however small that chance might be. This means that there does exist some level of evidence, however great it would need to be, that would leave him no alternative. If his prior trust in Desdemona is almost, but not absolutely total, then this would permit of some level of evidence, logically incompatible with her innocence, changing his mind. This might be called ‘Smoking Gun’ evidence.

On the other hand, Othello might adopt a more balanced position, trying to assess the likelihood objectively and without emotion. But how? Should he try and find out the proportion of female Venetians who conduct extra-marital affairs? This would give him the probability for a randomly selected Venetian woman but no more than that. Hardly a convincing approach when surely Desdemona is not just an average Venetian woman. So should he limit the reference class to women who are similar to Desdemona? But what does that mean?

And this is where it is easy for Othello to come unstuck. Because it is so difficult to choose a prior probability (as Bayesians would term it), the temptation is to assume that since it might or might not be true, the likelihood is 50-50. This is known as the ‘Prior Indifference Fallacy’. Once Othello falls victim to this common fallacy, any evidence against Desdemona now becomes devastating. It is the same problem as that facing the defendant in the dock.

Extreme, though not blind, trust is one way to avoid this mistake. But an alternative would be to find evidence that is logically incompatible with Desdemona’s guilt, in effect the opposite of the ‘Smoking Gun.’ The ‘Perfect Alibi’ would fit the bill.

Perhaps Othello would love to find evidence that is logically incompatible with Desdemona conducting an affair with Cassio, but holds her guilty unless he can find it. He needs evidence that admits no True Positives.

Lacking extreme trust and a Perfect Alibi, what else could have saved Desdemona?

To find the answer, we can turn to Bayes’ Theorem.

The (posterior) probability that a hypothesis is true after obtaining new evidence, according to the a,b,c formula of Bayes’ Theorem, is equal to:

ab/[ab+c(1-a)]

a is the prior probability, i.e. the probability that a hypothesis is true before you see the new evidence. b is the probability you would see the new evidence if the hypothesis is true. c is the probability you would see the new evidence if the hypothesis is false.

In the case of the Desdemona problem, the hypothesis is that Desdemona is guilty of betraying Othello with Cassio. Before the new evidence (the finding of the keepsake), let’s say that Othello assigns a chance of 4% to Desdemona being unfaithful.

So a = 0.04

The probability we would see the new evidence (the keepsake in Cassio’s lodgings) if the hypothesis is true (Desdemona and Cassio are conducting an affair) is, say, 50%. There’s quite a good chance she would secretly hand Cassio the keepsake as proof of her love for him and not of Othello.

So b = 0.5

The probability we would see the new evidence (the keepsake in Cassio’s lodgings) if the hypothesis is false is, say, just 5%. Why would it be there if Desdemona had not been to his lodgings secretly, and why would she take the keepsake along in any case? It could have been stolen and ended up there, but how likely is that?

So c = 0.05

Substituting into Bayes’ equation gives:

Posterior probability = ab/[ab+c(1-a)] = 0.294.

So, using Bayes’ Rule, and these estimates, the chance that Desdemona is guilty of betraying Othello is 29.4%, worryingly high for the tempestuous Moor but perhaps low enough to prevent tragedy. The power of Bayes here lies in demonstrating to Othello that the finding of the keepsake in the living quarters of Cassio might only have a 1 in 20 chance of being consistent with Desdemona’s innocence, but in the bigger picture, there is a less than a 3 in 10 chance that she actually is culpable.

If this is what Othello concludes, the task of the evil Iago is to lower c in the eyes of Othello by arguing that the true chance of the keepsake ending up with Cassio without a nefarious reason is so astoundingly unlikely as to merit an innocent explanation that 1 in 100 is nearer the mark than 1 in 20. In other words, to convince Othello to lower his estimate of c from 0.05 to 0.01.

The new Bayesian probability of Desdemona’s guilt now becomes:

ab/[ab+c(1-a)]

a = 0.04 (the prior probability of Desdemona’s guilt, as before)

b = 0.5 (as before)

c = 0.01 (down from 0.05)

Substituting into Bayes’ equation gives:

New probability = 0.676 = 67.6%.

So, if Othello can be convinced that 5% is too high a probability that there is an innocent explanation for the appearance of the Cassio – let’s say he’s persuaded by Iago that the true probability is 1% – then Desdemona’s fate, as that of many a defendant whom a juror thinks has more than a 2 in 3 chance of being guilty, is all but sealed. Her best hope now is to try and convince Othello that the chance of the keepsake being found in Cassio’s place if she were guilty is much lower than 0.5. For example, she could try a common sense argument that there is no way that she would take the keepsake if she were actually having an affair with Cassio, nor be so careless as to leave it behind. In other words, she could argue that the presence of the keepsake where it was found actually provides testimony to her innocence. In Bayesian terms, she should try to reduce Othello’s estimate of b.  What level of b would have prevented tragedy?  That is another question.

William Shakespeare wrote Othello about a hundred years before the Reverend Thomas Bayes was born. That is true. But to my mind the Bard was always, in every inch of his being, a true Bayesian. Othello was not, and therein lies the tragedy.

 

Appendix

In the case of the Othello problem, the hypothesis is that Desdemona is guilty of betraying Othello with Cassio. Before the new evidence (the finding of the keepsake), let’s say that Othello assigns a chance of 4% to Desdemona being unfaithful.

So P (H) = 0.04

The probability we would see the new evidence (the keepsake in Cassio’s lodgings) if the hypothesis is true (Desdemona and Cassio are conducting an affair) is, say, 50%.

So P (EIH) = 0.5

The probability we would see the new evidence (the keepsake in Cassio’s lodgings) if the hypothesis is false is, say, just 5%.

So P (EIH’) = 0.05

Substituting into Bayes’ Theorem:

P (HIE) = P (EIH). P (H) / [P (EIH) . P(H) + P (EIH’) . P(H’)]

P (HIE) = 0.5 x 0.04 / [0.5 x 0.04 + 0.05 x 0.96]

P (HIE) = 0.02 / [0.02 + 0.048] = 0.294

Posterior probability = 0.294.

So, using Bayes’ Rule, and these estimates, the chance that Desdemona is guilty of betraying Othello is 29.4%.

If P (EIH’) = 0.01

The new Bayesian probability of Desdemona’s guilt now becomes:

P (HIE) = 0.5 x 0.04 / [0.5 x 0.04 + 0.01 x 0.96]

P (HIE) = 0.02 / (0.02 + 0.0096) = 0.02 / 0.0296 = 0.676

Updated probability = 0.676 = 67.6%.

 

Exercise

Othello believes that there is just a 5 per cent chance that his wife, Desdemona, would be unfaithful to him. Until he comes upon one of her treasured keepsakes in Cassio’s lodgings. He believes that there is a fifty-fifty chance that he would have come across this keepsake if Desdemona was conducting an affair with Cassio, but just a one in ten chance he would have come across it if she was not.

On the basis of these estimates, what is the probability that Othello should assign to Desdemona being unfaithful to him? A reduction in which subjective probabilities would reduce the overall probability that Othello would assign to Desdemona being unfaithful to him?

 

References and Links

Iago’s trick

Bayes and the Bobby Smith Problem – in a nutshell.


Bobby Smith, aged 8, is a good schoolboy basketball player, but you know that only one in a thousand such 8-year-olds go on to become professional players.  So you would like to get an unbiased assessment of his real chance of developing into a top player. A coach tells you there is a test, taken by all good 8-year-old players, that can measure the child’s potential. If the test was perfect, everyone who received an A+ on the test would go on to a become a pro player. In fact, it is 95% accurate, in the sense that 5% of those taking the test will receive an A+ score and fail to become professional basketball players. Still, this is a very small percentage. Unfortunately, though, anyone failing to score A+ has no chance of becoming a pro player.

Bobby takes the test and is graded A+.

So what is the actual chance that Bobby will become a professional basketball player?

If you are like most people, you will think the chance is very high.

This is your reasoning: I don’t really know whether Bobby is likely to turn into a professional player or not. But he has taken this test. In fact, no professional player could have scored below A+, and the test only very rarely allocates a top grade to a child who will not become a professional basketball player. If the test is really this good, therefore, it looks like Bobby will have a bright future as a basketball star.

Is this true? Think of it this way. If there were no test, you would have asked the coach a very basic question: in your experience, what is the chance that Bobby will become a professional player? The coach would have dampened your enthusiasm: one in a thousand, he would have said. But with the test result in hand, there’s no need to ask this question. It’s irrelevant in the face of a very accurate test result, isn’t it?

In fact, this is a well-known fallacy, another example of the Inverse Fallacy, or Prosecutor’s Fallacy. The fallacy is to confuse the probability of a hypothesis being true, given some evidence, with the probability of the evidence arising given the hypothesis is true.

In our example, the hypothesis is that Bobby will become a professional player, and the evidence is the high test score. What we want to know is the probability that Bobby will become a pro player, given that the test says he will be. What we know, on the other hand, is the probability that Bobby will score A+ on the test, given that he will become a professional player. The coach told you that this probability is 100%: all future professional players will score A+ on the test. In answering your other question, the coach also told you that that 5% of those taking the test will score A+ yet fail to progress to the professional game. This is a small percentage. So you take this information and conclude that Bobby is very likely to turn into a top player.

In fact, of the thousand children who took the test, only one (statistically speaking) will become a professional player. The test for an A+ is 95% accurate in identifying a future pro player, in the sense that 5% of the 1,000 children will score A+ and not become professional  players, i.e. there will be 50 ‘false positives.’ Anyone who will become a pro basketball player, on the other hand, will score A+ on the test.

So what is the chance that Bobby will become a professional basketball player if he scores A+ on the test?

Solution: 50 children who will not become professional basketball players score A+ (the 50 ‘false positives’). Only one of the one thousand eight-year-olds who take the test develops into a professional player, and that child will score A+. Look at it this way. A thousand 8-year-olds take the test and of these 50 of them will receive a letter telling them they have scored A+ on the test but will not develop into top players. One child will receive a letter with a score of A+ and actually will go on to become a professional player. Therefore the probability Bobby will become a top basketball player if he scores A+ is just 1 in 51, i.e. 1.96%.

This is a similar idea to the medical ‘false positives’ problem.

In the equivalent flu version of the problem, a thousand people go to the doctor and all are tested for flu. Only one actually has the flu. Those with the flu always test positive. We know that the test for flu is 95% accurate, in the sense that 5% of the 1,000 people will test positive and not have the flu, i.e. there will be 50 ‘false positives’. One will test positive who does have the flu. Those with the flu all test positive. So what is the chance someone has the flu if they test positive? In this case, 50 people who do not have the flu test positive. One person who has the flu tests positive. Therefore, the probability you have the flu if you test positive is 1 in 51, i.e. 1.96%

We can also solve the Bobby Smith problem using Bayes’ Theorem. The (posterior) probability that a hypothesis is true after obtaining new evidence, according to the a,b,c formula of Bayes’ Theorem, is equal to:

ab/ [ab+c(1-a)]

a is the prior probability, i.e. the probability that a hypothesis is true before the new evidence. b is the probability of the new evidence if the hypothesis is true. c is the probability of the new evidence if the hypothesis is false.

In the case of the Bobby Smith problem, the hypothesis is that Bobby will develop into a professional player.

Before the new evidence (the test), this chance is 1 in 1000 (0.001)

So a = 0.001

The probability of the new evidence (the A+ score on the test) if the hypothesis is true (Bobby will become a professional player) is 100%, since all professional players score A+ on the test.

So b =1

The probability we would see the new evidence (the A+ score on the test) if the hypothesis is false (Bobby will not become a professional player) is 5%, since the test is 95% accurate in spotting future professional players.

So c = 0.05

Substituting into Bayes’ equation gives:

Posterior probability = ab/ [ab+c(1-a)] = 0.001x 1 / [0.001 x 1 + 0.05 (1 – 0.001)] = 0.0196

So, using Bayes’ Theorem, the chance that Bobby Smith, who scored A+ on the test which is 95% accurate, will actually become a top player, is not 95% as intuition might suggest, but just 1.96%, as we have shown previously by a different route.

There is, therefore, just a 1.96 per cent chance that Bobby Smith will go on to become a professional basketball player, despite scoring A+ on that very accurate test of player potential.

That’s the statistics, the cold Bayesian logic. Now for the good news. Bobby Smith was the lucky one. He currently plays for New York Knicks under a different name.

 

Appendix

We can also solve the Bobby Smith problem using the traditional notation version of Bayes’ Theorem.

P (HIE) = P (EIH). P (H) / [P (EIH) . P(H) + P (EIH’) . P(H’)]

Before the new evidence (the test), this chance is 1 in 1000 (0.001)

So P (H) = 0.001

The probability of the new evidence (the A+ score on the test) if the hypothesis is true (Bobby will become a professional player) is 100%, since all professional players score A+ on the test.

So P (EIH) =1

The probability we would see the new evidence (the A+ score on the test) if the hypothesis is false (Bobby will not become a professional player) is 5%, since the test is 95% accurate in spotting future professional footballers.

So P (EIH’) = 0.05

Substituting into Bayes’ equation gives:

P (HIE) = 0.001x 1 / [0.001 x 1 + 0.05 (1 – 0.001)] = 0.0196

 

Exercise

Lucy Jones, aged 10, is a good school tennis player, but you know that only one in a thousand such 10-year-olds go on to become professional players. So you would like to get an unbiased assessment of her real chance of developing into a top player. A coach tells you there is a test, taken by all good 10-year-old tennis players, that can measure the child’s potential. The test, you learn, is 98 per cent accurate in identifying future professional tennis players, and these always receive a grade of A+.

Lucy takes the test and is graded A+.

How many of the 10-year-olds tested, who get an A+, fail to develop into top players, you ask? Now the coach imparts the good news. All professional players score A+ on the test as 10-year-olds, and we can take it that anyone who scores below that can be ruled out as a future professional player. And the test is 98 per cent accurate, so only 2 per cent of those who take the test will get the A+ grade and fail to develop into professional players. So what is the actual chance that Lucy will become a professional tennis player?

 

References and Links

Is your child a football star?

How rare is the specimen? A Bayesian puzzler.


Further and deeper exploration of paradoxes and challenges of intuition and logic can be found in my recently published book, Probability, Choice and Reason.

An entomologist spots what might be a rare category of beetle, due to the pattern on its back. In the rare category, 98% have the pattern. In the common category, only 5% have the pattern. The rare category accounts for only 0.1% of the population. How likely is the beetle to be rare?

Since only 5 per cent of the common beetles bear the distinctive pattern and 98 per cent of the rare beetles do, intuition would tell you that you have come across a rare insect when you espy the pattern. Bayes’ Theorem tells you something quite different.

To calculate just how likely the beetle is to be rare given that we see the pattern on its back, we apply Bayes’ Theorem.

Posterior probability = ab/ [ab+c (1-a)]

a is the prior probability of the hypothesis (beetle is rare) being true. b is the probability we observe the pattern and the beetle is rare (hypothesis is true). c is the probability we observe the pattern and the beetle is not rare (hypothesis is false).

In this case, a = 0.001 (0.1%); b = 0.98 (98%); c = 0.05 (5%).

So, updated probability = ab/ [ab+c (1-a)] = 0.0192. So there is just a 1.92 per cent chance that the beetle is rare when the entomologist spots the distinctive pattern on its back.

Why the counterintuitive result? Because so few of the population of all beetles are rare, i.e. the prior probability that the beetles is rare is almost vanishingly small and it would take a lot more evidence than that acquired to make a reasonable case for the beetle being rare.

So what is the probability that the beetle is rare given that we observe the distinctive pattern? In other words, what is the probability that the hypothesis (the beetle is rare) is true given the evidence (the pattern). That is 1.92 per cent. What is the probability that we will observe the distinctive pattern if the beetle is rare? In other words, what is the probability of observing the evidence (the pattern) if the hypothesis (the beetle is rare) is true. That is 98 per cent.

To conflate these, to believe these two concepts are the same, is to commit the classic Prosecutor’s Fallacy, i.e. to falsely equate the probability that the defendant is guilty given the observed evidence with the probability of observing the evidence given that the defendant is guilty. It’s a potentially very dangerous fallacy to commit, especially when you happen to be the defendant and the jury has never heard of the Reverend Thomas Bayes.

Appendix

We can also solve the Beetle problem using the traditional notation version of Bayes’ Theorem.

P (HIE) = P (EIH). P (H) / [P (EIH) . P(H) + P (EIH’) . P(H’)]

In this case, P (H) = 0.001 (0.1%); P (EIH) = 0.98 (98%); P (EIH’) = 0.05 (5%).

So, P (HIE) = 0.98 x 0.001/ [0.98 x 0.001 +0.05 x 0.999)] = 0.00098 / 0.00098 + 0.04995 = 0.00098 / 0.05093 = 0.0192. So there is just a 1.92 per cent chance that the beetle is rare when the entomologist spots the distinctive pattern on its back.

Note also that P (HIE) = 0.0192, while P (EIH) = 0.98.

The Prosecutor’s Fallacy is to conflate these two expressions.

Exercise

An entomologist spots what might be a rare category of beetle, due to the pattern on its back. In the rare category, 95% have the pattern. In the common category, only 2% have the pattern. The rare category accounts for only 0.5% of the population. How likely is the beetle to be rare?

References and Links

CS201 – Bayes’ Theorem – Excerpts from Wikipedia

Click to access BayesTheorem.pdf

Jeff Thompson. Bayes’ Theorem. November 20, 2011. https://www.jeffreythompson.org/blog/2011/11/20/bayes-theorem/

Is your friend guilty? A case for Bayes.

Further and deeper exploration of paradoxes and challenges of intuition and logic can be found in my recently published book, Probability, Choice and Reason.

Let us invent a little crime story in which you are a follower of Bayes and you have a friend in a spot of trouble. In this story, you receive a telephone call from your local police station. You are told that your best friend of many years is helping the police investigation into a case of vandalism of a shop window in a street adjoining where you knows she lives. It took place at noon that day, which you know is her day off work. You had heard about the incident earlier but had no good reason at the time to believe that your friend was in any way linked to it.

She next comes to the telephone and tells you she has been charged with smashing the shop window, based on the evidence of a police officer who positively identified her as the culprit. She claims mistaken identity. You must evaluate the probability that she did commit the offence before deciding how to advise her. So the condition is that she has been charged with criminal damage; the hypothesis you are interested in evaluating is the probability that she did it. Bayes’ Theorem, of course, helps to answer this type of question.

There are three things to estimate. The first is the Bayesian prior probability (which we represent as ‘a’). This is the probability you assign to the hypothesis being true before you become aware of the new information. In this case, it means the probability you would assign to your friend breaking the shop window immediately before you got the new information from her on the telephone that she had been charged on the basis of the witness evidence.

The second is the probability that the new evidence would have arisen if the hypothesis was true (which we represent as ‘b’). In this case, you need to estimate the probability of the police officer identifying your friend if your friend actually did break the window.

The third is to estimate the probability that the new evidence would have arisen if the hypothesis was false (which we represent as ‘c’). In this case, you need to estimate the probability of the police officer identifying your friend if your friend did NOT break the window.

According to Bayes’ Theorem, Posterior probability = ab/ [ab+c(1-a)]

So let’s apply Bayes’ Theorem to the case of the shattered shop window. Let’s start with a. Well, you have known her for years, and it is totally out of character, although she does live just a stone’s throw from the shop, and it is her day off work, so she could in principle have done it. Let’s say 5% (0.05). Assigning the prior probability is fraught with problems, however, as awareness of the new information might easily affect the way you assess the prior information. You need to make every effort to estimate this probability as it would have been before you received the new information. You also have to be precise as to the point in the chain of evidence at which you establish the prior probability.

What about b? This is the probability of the new evidence if the hypothesis was true. What is the hypothesis? That your friend broke the window. What is the new evidence? That the police officer has identified your friend as the person who smashes the window. So b is an estimate of the probability that the police officer would have identified your friend if she was indeed guilty. If she threw the brick, it’s easy to imagine how she came to be identified by the police officer. Still, he wasn’t close enough to catch the culprit at the time, which should be borne in mind. Let’s say that the probability he has identified her and that she is guilty is 80% (0.8).

Let’s move on to c. This is the probability of the new evidence if the hypothesis was false. What is the hypothesis again? That your friend broke the window. What is the new evidence again? That the police officer has identified your friend as the person who did it. So c is an estimate of the probability that the police officer would have identified her if she was not the guilty party, i.e. a false identification. If your friend didn’t shatter the window, how likely is the police officer to have wrongly identified her when he saw her in the street later that day? It is possible that he would see someone of similar age and appearance, wearing similar clothes, and jump to the wrong conclusion, or he may just want to identify someone to advance his career. Let us estimate the probability as 15% (0.15).

Once we’ve assigned these values, Bayes’ theorem can now be applied to establish a posterior probability. This is the number that we’re interested in. It is the measure of how likely is it that your friend broke the window, given that she’s been identified as the culprit by the police officer and charged on the basis of this evidence.

Given these estimates, we can use Bayes’ Theorem to update our probability that our friend is guilty to 21.9%, despite assigning a reliability of 80% to the police officer’s identification.

The most interesting takeaway from this application of Bayes’ Theorem is the relatively low probability you should assign to the guilt of your friend even though you were 80% sure that the police officer would identify her if she was guilty, and the small 15% chance you assigned that he would falsely identify her. The clue to the intuitive discrepancy is in the prior probability (or ‘prior’) you would have attached to the guilt of your friend before you were met face to face with the charge based on the evidence of the police officer. If a new piece of evidence now emerges (say a second witness), you should again apply Bayes’ Theorem to update to a new posterior probability, gradually converging, based on more and more pieces of evidence, ever nearer to the truth.

It is, of course, all too easy to dismiss the implications of this hypothetical case on the grounds that it was just too difficult to assign reasonable probabilities to the variables. But that is what we do implicitly when we don’t assign numbers. Bayes’ Theorem is not at fault for this in any case. It will always correctly update the probability of a hypothesis being true whenever new evidence is identified, based on the estimated probabilities. In some cases, such as the crime case illustrated here, that is not easy, though the approach you adopt to revising your estimate will always be better than using intuition to steer a path to the truth.

In many other cases, we do know with precision what the key probabilities are, and in those cases we can use Bayes’ Theorem to identify with precision the revised probability based on the new evidence, often with startlingly counter-intuitive results. In seeking to steer the path from ignorance to knowledge, the application of Bayes is always the correct method.

Appendix

The calculation and the simple algebraic expression that we have identified in this setting is:

ab/[ab+c(1-a)]

a is the prior probability of the hypothesis (she’s guilty) being true. This is more traditionally represented by the notation P(H). In the example, a = 0.05.

b is the probability the police officer identifies her conditional on the hypothesis being true, i.e. she’s guilty. This is more traditionally represented by the notation (PEIH), i.e. probability of E (the evidence) given the hypothesis is true, P(H). In the example, b = 0.8.

c is the probability the police officer identifies her conditional on the hypothesis not being true, i.e. she’s not guilty. This is more traditionally represented by the notation (PEIH’), i.e. probability of E (the evidence) given the hypothesis is false, P(H’). In the example, c = 0.15.

In our example, a = 0.05, b = 0.8, c = 0.15

Using Bayes’ Theorem, the updated (posterior) probability that the friend is guilty is:

ab/[ab+c(1-a)] = 0.04/(0.04+ 0.1425) = 0.04/0.1825

Posterior probability = 0.219 = 21.9%

Exercise

You are a follower of Bayes and you have a friend in a spot of trouble. In this story, you receive a telephone call from your local police station. You are told that your best friend of many years is helping the police investigation into a case of vandalism of a shop window in a street adjoining where you know she lives. It took place at noon that day, which you know is her day off work. You had heard about the incident earlier but had no good reason at the time to believe that your friend was in any way linked to it.

She next comes to the telephone and tells you she has been charged with smashing the shop window, based on the evidence of a police officer who positively identified her as the culprit. She claims mistaken identity. You must evaluate the probability that she did commit the offence before deciding how to advise her.

What three probabilities do you need to estimate in order to use Bayes’ Theorem to evaluate this probability?

References and Links

Murder Cases, Evidence and Logical Rigor. http://ucanalytics.com/blogs/logical-rigor/

The Reverend Bayes Investigates.

Further and deeper exploration of paradoxes and challenges of intuition and logic can be found in my recently published book, Probability, Choice and Reason.

A murder has been committed. There are five suspects, all of whom we consider equally likely to be guilty at the start of the investigation. We know that one of these suspects is the guilty party, and we know that whoever it was acted alone.

So 20 per cent is the prior probability of guilt for each of the five possible killers, before any new evidence is found. The names of the suspects are: Reverend Green, Colonel Mustard, Miss Scarlett, Professor Plum and Mrs. Peacock. The codename for the murder investigation is Operation Cluedo. The victim was Sir Caliban Mackenzie, a famed anthropologist, who was shot in the library while examining a rare first edition of Newton’s Principia.

Four hours into the investigation, evidence turns up which eliminates Reverend Green. He was leading the Holy Communion Service in the chapel at the time of the murder. There are now four remaining suspects, and so the probability that each of the remaining four suspects is guilty rises to 25 per cent (one chance in four).

Two hours later, a new clue now arises which casts some doubt on the alibi of Colonel Mustard, whose probability of guilt we now judge to rise from 25 per cent to 40 per cent.

As a result, the probability that one of the other three suspects is guilty falls by 15 per cent, down from a total of 75 per cent to 60 per cent. Since each of the three is equally likely to be guilty, we can now assign each a probability of guilt of 20 per cent, down from 25 per cent.

After a further 45 minutes, a third clue emerges, which eliminates Mrs. Peacock. She had been spotted by a number of very reliable witnesses at the Communion service in the chapel along with Reverend Green.

The big question is how we should now adjust the probabilities that Colonel Mustard, Miss Scarlett and Professor Plum pulled the trigger?

In other words, now that Mrs. Peacock has been eliminated, and taking account of the evidence which doubled the original likelihood that Colonel Mustard wielded the murder weapon (to 40 per cent), what is the best estimate of the revised probability that each of Mustard, Scarlett and Plum committed the murder?

Solution

One possibility would be to take the 20 per cent probability of guilt we had previously attached to Mrs. Peacock, and divide this equally between the three remaining suspects.

But to do so would be wrong, and notably at variance with the toolkit of a Bayesian detective, i.e. a detective who conducts investigations using the Bayesian approach to evidence and probability.

The Bayesian approach to detective work tells us always to consider the prior probability that each suspect is guilty before updating the probability after some new evidence is brought to bear on it. Applying this method, the correct way to adjust the probabilities attached to the remaining suspects is to do so in a way that is proportional to their prior probability of guilt before Mrs. Peacock was eliminated from the enquiry.

Since Colonel Mustard was the prime suspect, with a probability of guilt of 40 per cent before Peacock’s elimination (compared to 20 per cent for Miss Scarlett and Professor Plum), a good Bayesian needs to increase the probability we assign to his guilt by twice as much as we increase theirs. So we should now raise the estimate of the probability that Colonel Mustard shot Sir Caliban from 40 per cent to 50 per cent, while we should increase the probability we assign to Miss Scarlett and Professor Plum from 20 per cent to 25 per cent.

This is all derived from Bayes’ Theorem, which tells us that in order to calculate the probability of a hypothesis being true given new evidence, we must filter this evidence through the baseline of the probability of the hypothesis being true before we became aware of the new evidence (Mrs. Peacock’s elimination from the enquiry). This prior probability was twice as big for Colonel Mustard as for either of the other remaining suspects.

Epilogue

The estimated 50 per cent probability of guilt was more than sufficient to persuade the Crown Prosecution Service to haul the Colonel before a jury of his peers. In the event the jury convicted him, falling victim to the classic Prosecutor’s Fallacy. Like so many juries before them, they confused the probability that someone is guilty in light of the evidence with the probability of the evidence arising if they were guilty. The likelihood of Sir Caliban being shot in the library if the Colonel was guilty of murder was quite high, and this led to his conviction. Unfortunately for the Colonel, the relevant probability (that he was guilty of murder given that Sir Caliban was shot in the library) was rather smaller but bypassed in the jury’s deliberations.

Meanwhile, the actual killer, Miss Scarlett, got away scot-free. She had concealed an incriminating letter in the Principia, thinking it would be safe there, until Sir Caliban unhappily chanced upon it. This left her no option, in her mind, but to use the pistol hidden in the Georgian chest of drawers gracing the back wall of the library.

The Colonel’s appeal was unanimously rejected. He is serving a life sentence. Miss Scarlett is living as a tax exile in Belize.

Exercise

A murder has been committed and there are only five people who could have done it. There are no clues, np prior history that we know of. So we consider each suspect equally likely at the start of the investigation. The names of the suspects are: Reverend Green, Colonel Mustard, Miss Scarlett, Professor Plum, Mrs. Peacock.

  1. What is the prior probability of guilt for each individual suspect?
  2. Now the first clue is found, which eliminate Revd. Green. What is the new probability that each of the remaining individual suspects is guilty?

A new clue now arises which casts doubt upon the alibi of Colonel Mustard, whose probability of guilt we now judge to rise to 40 per cent.

3. What is the new probability that each of the other suspects is guilty?

The third clue now eliminates Mrs. Peacock.

4. What are the new probabilities of guilt that you, as a Bayesian, will attribute to Colonel Mustard, Miss Scarlett and Professor Plum?

References and Links

Books to teach yourself probability and Bayesian statistics. http://ucanalytics.com/blogs/probability-bayesian-statistics-books-self-taught/

Bayes and the Testing Problem – in a nutshell.

Further and deeper exploration of paradoxes and challenges of intuition and logic can be found in my recently published book, Probability, Choice and Reason.

Let’s say a patient goes to see the doctor. The doctor performs a test on all his patients, for a flu virus, estimating that only 1 per cent of the people who visit his surgery have the virus. The test he gives them, however, is 99 percent accurate – that is, 99 percent of people who are sick test positive and 99 percent of the healthy people test negative. Now the question is: if the patient tests positive, what chances should the doctor give to the patient having the flu virus?
The intuitive answer is 99 percent. But is that right?
The information we are given is ‘the probability of testing positive given that you have the virus’. What we want to know, however, is ‘the probability of having the virus given that you tested positive.’ Common intuition conflates these two probabilities, but they are in fact very different. In fact, if the test is 95% accurate, this means that 95% of sick people test positive. But this is NOT the same thing as saying that 95% of people who test positive are sick. This is known as the ‘Inverse Fallacy’ or ‘Prosecutor’s Fallacy’. It is the fallacy, to which jurors are very susceptible, of believing that the probability of a defendant being guilty of a crime in light of the observation of some piece of evidence is the same as the probability of observing that piece of evidence if the defendant was guilty. They are in fact very different things, and the two probabilities can diverge markedly.
So what is the probability of having the virus if you test positive, given that the test is 99% accurate (i.e. 99% of people who have the virus test positive and 99% of people who do not have the virus test negative)?
To answer this we can use Bayes’ Theorem.
The (posterior) probability that a hypothesis is true after obtaining new evidence, according to the a,b,c formula of Bayes’ Theorem, is equal to:
ab/ [ab+c(1-a)]
a is the prior probability, i.e. the probability that a hypothesis is true before you see the new evidence. Before the new evidence (the test), this chance is estimated at 1 in 100 (0.01), as we are told that 1 per cent of the people who visit his surgery have the virus. So, a = 0.01
b is the probability of the new evidence if the hypothesis is true. The probability of the new evidence (the positive result on the test) if the hypothesis is true (the patient is sick) is 99%, since the test is 99% accurate. So, b =0.99
c is the probability of the new evidence if the hypothesis is false. The probability of the new evidence (the positive result on the test) if the hypothesis is false (the patient is not sick) is just 1% (because the test is 99% accurate, and we can only expect a false positive 1 time in 100). So, c = 0.01
Using Bayes’ Theorem, the updated (posterior) probability = ab/ [ab+c(1-a)] = 1/2
So there is actually a 50% chance that the test, which is 99% accurate and has tested positive, has misdiagnosed you and you are actually flu-free.
Basically, it is a competition between how rare the disease is and how rarely the test is wrong. In this case, there is a 1 in 100 chance that you have the flu before undertaking the test, and the test is wrong 1 time in 100. These two probabilities are equal, so the chance that you actually have the flu when testing positive is actually 1 in 2, despite the test being 99% accurate.
But what if the patient is showing symptoms of the disease before being tested?
In this case, the prior probability should be updated to something higher than the prevalence rate of the disease in the entire tested population, and the chance you are actually sick when you test positive rises accordingly. To the extent that a doctor only tests for something that there is corroborating support for, the likelihood that the test result is correct grows. For this reason, any positive test result should be taken very seriously, statistics aside.
More generally, the ‘False Positive’ problem can easily lead to false convictions based on forensic evidence. Let’s say that we have a theft based on access to a secure storage facility, and we test everyone who could potentially have had access, which is 100 people. Without any other evidence, we can now assign a prior probability that the suspect currently being questioned is guilty of the crime at 1 in 100 or 0.01.
Forensic evidence now comes in the way of a partial fingerprint inside the office safe. It is scientifically determined that the probability the suspect’s fingerprint matches the partial print is 95% (0.95). So there’s just a 5% chance that the print was left by another of the suspects. Applying Bayes’ Theorem, we find that when the 95% accurate forensic test provides a match, the actual probability that the suspect is guilty is just 16%. This makes sense when we consider that testing all 100 suspects would (given that the test has a false positive rate of 5%) provide an estimated five false matches. With larger trawls of forensic testing, the likelihood of a false match becomes commensurately higher.
More generally, to differentiate truth from scare we really do need to understand and employ Bayes’ Theorem. Whether at the doctor’s surgery or in the jury room, understanding it really could save a life
.

Appendix

In the original setting with the test results showing positive for a flu virus, a = 0.01, b = 0.99, c = 0.01. Substituting into Bayes’ equation, ab/[ab+c(1-a)], gives:

Posterior probability = 0.01x 0.99 / [0.01 x 0.99 + 0.01 (1 – 0.01)] = 0.01×0.99 / [0.01×0.99 + 0.01×0.99] = 1/2

Another way of visualising this problem is by constructing a simple box diagram for a population of 10,000 patients. Of these, 1%, or 100, have the flu virus and 9900 do not. These are inserted into the Total column. There is a 1% error rate, so 1% of the 9900 who do not have the flu virus test positive. Hence the remaining 9801 test negative. Of the 100 who actually have the flu virus, one tests negative (because of the error rate) and the remaining 99 correctly test positive. See below.

Test positive Test negative Total
Has flu virus 99 1 100
No flu virus 99 9801 9900
Total 198 9802 10000

It is now easy to see that of the 198 who test positive, exactly half (99) actually have the flu virus. The other half are false positives.

Let’s take another example.

The probability of a true positive (test comes back positive for virus and the patient has the virus) is 90%. The chance that it gives a false negative (test comes back negative yet the patient has the virus) is 10%. The chance of a false positive (test comes back positive yet the patient does not have the virus) is 7%. The chance of a true negative (test comes back negative and the patient does not have the virus) is 93%.

The probability that a random patient has the virus based on the prevalence of the virus in the tested population is 0.8%.

Here, a = 0.8% (0.008) – this is the prior probability

b =90% (0.9) – probability of a true positive

c = 7% (0.07) – probability of a false positive

So, updated probability that the patient has the virus given the positive test result =

ab / [ab + c (1-a)] = 0.008 / [0.0072 + 0.07 x (1 – 0.008)]

= 0.008 x 0.9 / [0.008 x 0.9 + 0.06944] = 0.0072 / 0.07664 = 0.0939 = 9.39%

This can be shown using the raw figures to produce the same result. We can choose any number for total tested, and the result is the same. Let’s choose 1 million, say, as the number tested.

So total tested = 1,000,000

Total with virus = 0.008 x 1,000,000 = 8000

True positive = 0.9 x 8000 = 7200

False positive = 0.07 x 992,000 = 69,440

Tested positive = 69,440 + 7200 = 76,640

Updated (posterior) probability that the patient who tests positive has the virus = True positives / Total positives = 7200 / 76640 = 0.0939 = 9.39%

In the forensic match example, we can construct a box table. In the example, out of a population of suspects of 100, one is guilty and 99 are not guilty. These are inserted into the Total column. There is a 5% error rate in the forensic match, so there is a 0.95 chance of a match if the suspect is guilty (top left). There’s a 5% chance that one of the 99 will provide a match (0.05 x 99 = 4.95), leaving 84.15 as the number for the Not guilty/No match cell.

Match No match Total
Guilty 0.95 0.05 1
Not guilty 4.95 94.05 99
Total 5.9 94.1 100

So the chance that the suspect provides a match and is actually guilty is the proportion of those guilty and matching out of all those matching (0.95/5.9 = 0.16).

So the 95% accurate forensic match provides a hit when matched to the suspect but his actual probability of guilt on these figures is just 16%.

Using Bayes’ Theorem, we reach the same conclusion:

Substituting into Bayes’ equation gives:

P (Guilty I Match) = 0.01x 0.95 / [0.01 x 0.95 + 0.05 (1 – 0.01)] = 0.01×0.95 / [0.01×0.95 + 0.05×0.99] = 0.0095/(0.0095+0.0495) = 0.0095/0.059 = 0.16.

So P (Guilty I Match) = 0.16

P (Not guilty I Match) = 0.84

Sensitivity and Specificity

In terms of false positive analysis, especially in a medical context, the concepts of sensitivity and specificity are often used.

Sensitivity (also termed the true positive rate) is the proportion of true positives who have a positive test result. In a medical context, it is for example the proportion of people with a condition that are correctly identified (test positive) with the condition.

Specificity (also termed the true negative rate) is the proportion of people who don’t have the disease who have a negative test result. In a medical context, it is for example the proportion of people without a condition that are correctly identified (test negative) as not having the condition.

Thus, sensitivity quantifies the avoiding of false negatives and specificity does the same for false positives. There is usually a trade-off between these measures. For example, airport security scanners that are set to detect low-risk items such as keys (wrongly identify true threats) have low specificity but will almost certainly identify high-risk items, such as guns (high sensitivity). A perfect predictor would identify all genuine cases and no false alarms would be triggered.

Say that TP is someone who has a disease and tests positive for it (True Positive). FN is someone who has a disease and tests negative for it (False Negative). FP is someone who does not have the disease but tests positive for it (False Positive). TN is someone who does not have the disease and tests negative for it (True Negative).

In this case, Sensitivity (True Positive Rate) = TP/(TP+FN), i.e. the probability of a positive test given that the patient has the disease. It is a function of the characteristics of the test itself. Because everyone in the group tested has the disease, it is not affected at all by the prevalence of the disease.

Specificity (True Negative Rate) = TN/(TN+FP), i.e. the probability of a negative test given that the patient does not have the disease.

Sensitivity is not the same as Precision (Positive Predictive Value, PPV), which is the ratio of true positives to combined true and false positives.

PPV = TP/(TP+FP)

PPV is a statement about the proportion of actual positives in the population being tested, i.e. the probability that you have the disease if you have tested positive for it.

NPV (Negative Predictive Value) = TN/(TN+FN)

So for positive and negative predicted values, these are affected by the prevalence of the disease in the community and so is not simply a function of the characteristics of the test itself. So, when comparing one test with another in terms of the positive and negative predicted value, you need top be looking at the same population group or at least population groups with the same incidence of disease.

Now, the Likelihood Ratio is the probability that a test is correct divided by the probability that it is incorrect. In medicine, Likelihood Ratios can be used to determine whether a test result usefully changes the probability that a condition exists.

Two versions of the Likelihood Ratio (Positive LR and Negative LR) exist, one for positive and one for negative test results.

The positive likelihood ratio is calculated as:

LR+ = sensitivity/(1-specificity), which is equivalent to:

LR+ = P (T+ I D+) / P (T+ I D-)

i.e. LR+ is the probability of a person who has the condition testing positive divided by the probability of a person who does not have the disease testing positive.

The negative likelihood ratio is calculated as:

LR- = (1-sensitivity)/specificity, which is equivalent to:

LR- = P (T+ I D+) / P (T- I D-)

i.e. LR- is the probability of a person who has the condition testing negative divided by the probability of a person who does not have the condition testing negative.

The pre-test odds of a particular diagnosis, multiplied by the likelihood ratio, determines the post-test odds.

Post-test odds = Pre-test odds x LR+

Odds = P (something is true) / P (something is false)

Probability = Odds / (1 + Odds)

Exercise

Question 1.

A patient goes to see the doctor. The doctor performs a test on all his patients, for a flu virus, estimating that only 1 per cent of the people who visit his surgery have the flu. The test he gives them, however, is 95 per cent reliable – that is, 95 per cent of people who are sick test positive and 95 per cent of the healthy people test negative.

Question 2.

A tennis tournament administers a test for banned drugs to all of the tournament entrants. The test is 90% accurate, if the person is using the banned drugs, and 85% accurate if the person is not using them. 10 per cent of all tournament entrants are in fact using the banned drugs. Now, what is the probability that am entrant is using drugs, if they test positive.

Question 3.

66 people have the flu and test positive for it. Four people have the flu and test negative for it. Three people don’t have the flu but test positive for it. 827 people don’t have the flu and test negative for it.

    1. What is the Sensitivity of the test?
    2. What is the Specificity of the test?
    3. What is the Positive Predictive Value?
    4. What is the Negative Predictive Value?
    5. What is the Positive Likelihood Ratio?
    6. What is the Negative Likelihood Ratio?
    7. What are the Pre-Test Odds a person has the flu?
    8. What are the Post-Test Odds a person has the flu?
  1. Question 4.
  2. 1,000 people are tested for the flu. 100 people have the flu. Of these, 90 test positive and 10 test negative. 900 do not have the flu. 150 of these test positive, and 750 test negative.
    1. What is the Sensitivity of the test?
    2. What is the Specificity of the test?
    1. 610 people have the virus and test positive. 118 people have the virus and test negative. 13,212 people do not have the virus but test positive. 127,344 people do not have the virus and test negative.
    1. What is the Sensitivity of the test?
    2. What is the Specificity of the test?
    3. What is the Positive Likelihood Ratio?
    4. What is the Negative Likelihood Ratio?
    5. What are the Pre-Test Odds a person has the flu?
    6. What are the Post-Test Odds a person has the flu?
    7. Now, say that the doctor examines the person before administering the test and assigns a 30% pre-test probability that he has flu. Assuming this estimate is accurate, what is the pre-test Odds that the person has flu if he tests positive?
    8. What is the post-test Odds that this person has flu?
    9. What is the post-test probability that this person has flu?
    10. Say the person who has been assigned a 30% pre-test probability of having the flu instead tests negative. What is the Post-test Odds now that he has the flu?
    11. What is the Post-test probability that he has the flu?

References and Links

The Role of Probability. Bayes’ Theorem. Boston University School of Public Health. http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Probability/BS704_Probability6.html

What is Bayes’ Theorem? Scientific American. https://www.scientificamerican.com/article/what-is-bayess-theorem-an/

Su, Francis E., et al. “Medical Tests and Bayes’ Theorem.” Math Fun Facts https://www.math.hmc.edu/funfacts/ffiles/30002.6.shtml

Base Rate Fallacy. In: Paradoxes of probability and other statistical strangeness. S. Woodcock. April 4, 2017. https://theconversation.com/paradoxes-of-probability-and-other-statistical-strangeness-74440

Sensitivity vs Specificity and Predictive Value. Statistics HowTo. https://www.statisticshowto.datasciencecentral.com/sensitivity-vs-specificity-statistics/

Sensitivity, Specificity, Positive Predictive Value, and Negative Predictive Value. https://newonlinecourses.science.psu.edu/stat507/node/71/

Sensitivity and Specificity. Science Direct. https://www.sciencedirect.com/topics/medicine-and-dentistry/sensitivity-and-specificity

Sensitivity and Specificity. Wikipedia. https://en.wikipedia.org/wiki/Sensitivity_and_specificity

Likelihood Ratios. CEBM. https://www.cebm.net/2014/02/likelihood-ratios/

Diagnostics and Likelihood Ratios, Explained. http://www.thennt.com/diagnostics-and-likelihood-ratios-explained/

Likelihood Ratios in Diagnostic Testing. Wikipedia. https://en.wikipedia.org/wiki/Likelihood_ratios_in_diagnostic_testing

Does Hollywood ruin good books?

Berkson’s Paradox (also known as Berkson’s fallacy or Berkson’s bias) is a statistical quirk which makes it appear that there is an association between two events or variables which are actually unrelated. Notably, it shows that two values can be negatively correlated in a sample of a population when they are in fact uncorrelated or positively correlated in that population. It arises because of a type of selection bias, which is caused by the observation of some events more than others.

Take the case of a college which admits students based on either musical excellence or sporting excellence. For the sake of argument, assume that there is no link between the two in the total relevant population (say, all students in the country). In other words, a musically talented individual is no more nor less likely to be talented at sport. Because the college admits only students who are excellent at music, or excellent at sport, or both, this creates a group or subset of the population which displays a negative association between musical and sporting excellence.

To illustrate why, let’s make the simplifying assumption that the college admits students who score 9 or 10 out of 10 (on a scale of 0 to 10) on either sporting excellence or musical excellence. In the entire population, however, the average rating of the worst musician and the best sportsman would be equal, i.e. 5 out of 10. Yet within the group of student entrants, the average rating for sporting ability of those admitted for musical ability is still 5 (the population average) compared to an average rating of 9.5 for musical ability. The effect is to imply a negative correlation between sporting and musical ability where no such correlation exists in the wider population.

This has been shown to have important implications for medical statistics. Say, for example, that a hospital conducts a study which admits patients onto the study who are suffering from either eye cataracts or diabetes. In this case, there will appear an association (albeit spurious) between cataracts and diabetes in the set of patients included in the study which does not appear in the wider population. The reason that this paradox occurs is that the probability of one event happening (cataracts, in this example) is higher in the presence of the other event (diabetes, in this example) because cases whether neither occur are excluded.

Similarly, take the idea that there is a negative association in our minds between the quality of movies based on really good books and the quality of the books. One explanation can be derived from Berkson’s Paradox. This interpretation is that we remember the instances where the book is really good or the movie is really good or both. But we forget those cases where both the book and the movie were bad. In this case we find a (spurious) negative correlation between how good the movie is and how good the book is, because the bad movies/bad books element of the population are not included in the set of movies and books under analysis.

Another example of Berkson’s paradox was proposed by Jordan Ellsberg. This is the ‘good looking people are jerks’ example and is similar to the movies/books illustration. Say that someone only associates with people who are either pleasant or good looking or both. That eliminates from the sample pool those who are neither pleasant not good looking. That leaves a sample with good looking people who are unpleasant, and pleasant people who are not good looking, but eliminates those who are neither pleasant nor good looking. So an association is noted between being attractive and being unpleasant, but this is because the unattractive people who are also unpleasant are not observed. So even if no link exists between attractiveness and unpleasantness in the population, it does in an observed world where the counter-examples who exist in the population are avoided and ignored.

To put it more formally, assume there are two independent events, X and Y. These events are not correlated when observed in nature. If one conditions on the fact that either event X or event Y occurred (call this condition Z), however, these events are now correlated. This arises because of selection bias. If we condition on Z (that X OR Y occurs), then if we know that event X did not occur, we know that event Y did occur. This conditioning on Z, what we can call the union of X and Y, leads to a correlation.

Put mathematically, if P (XIY) = P (X), then P (XIY, Z) is less than P (XIZ) where Z = X U Y.

Numerical example of Berkson’s Paradox

10% of the population swim and 5% play squash weekly, but there is no correlation between swimming and playing squash in the general population. So someone who plays squash is as likely to swim as any other member of the population and vice-versa.

Of the 200 members of a local health club, 30% swim and 20% play squash.

Based on the health club statistics, is there any evidence of a correlation between those who do not swim and those who play squash?

To answer this, we use the assumption that someone who plays squash is as likely to swim as any other member of the population, i.e. swimming and squash playing can be treated as independent events. So, the percentage of health club members who play squash who also swim would be 10% x 5% = 0.5% of 200 members, i.e. 1 member.

A randomly chosen health club member, however, has a 30% chance of swimming and a 20% chance of playing squash. So, 60 out of 200 members will swim and 40 play squash.

Now, what is the chance that a member who is not a swimmer plays squash?

Of the 60 members who swim, we have calculated above that only 1 also plays squash, i.e. of the 200 members in total, 60 swim and one swims and plays squash.

So, of the remaining 140 patients who do not swim, 39 play squash, i.e. 40 members in total play squash minus one who swims and plays squash. Thus, 39 members who do not swim play squash.

So 39 of the 140 health club members who do not swim play squash, i.e. 39/140 (27.9%). This is higher than the 20% in the population who play squash.

Even though the two events (swimming and squash) are independent, therefore, the health club statistics make it appear that swimming reduces the likelihood of playing squash, i.e. there is a negative correlation between swimming and playing squash. The reason is that we excluding from consideration those members of the general population who neither swim nor pay squash, and only considering those who either swim or play squash or both.

 

Reference and links

Numberphile. Does Hollywood Ruin Books? https://www.youtube.com/watch?v=FUD8h9JpEVQ

Jordan Ellsberg (2014), Why are Handsome Men Such Jerks? June 3. Slate.com https://slate.com/human-interest/2014/06/berksons-fallacy-why-are-handsome-men-such-jerks.html

Bayes and the Taxi Problem – in a nutshell.



Bayes and the Taxi Problem.

To help explain how Bayes’ Theorem can be applied in practice, let’s start with the classic Bayesian Taxi Problem. It goes like this. New Amsterdam has 1,000 taxis. 850 are yellow, 150 are green. One of these taxis knocks down a pedestrian and then is driven away without stopping. We have no reason to believe that drivers of green taxis are any more or any less likely than drivers of yellow taxis to knock down a pedestrian and drive away. Neither do we have any reason to believe that green or yellow taxis are disproportionately represented in the area of New Amsterdam where the hit and run took place. There is one witness, however, who did see the event. The witness says the colour of the taxi was green. The witness is given a rigorous observation test, which recreates as closely as possible the event in question, and her judgment proves correct 80 per cent of the time. We have no reason to doubt the integrity of the witness.

So what is the probability that the taxi was green?

The intuitive answer is in the region of 80 per cent, as the only evidence is that of the witness, and the test of her powers of observation shows that she is right 80 per cent of the time. That is not the Bayesian approach, however, which is to consider the evidence in the light of the baseline, or prior, probability that the taxi was green before the witness evidence came to light. The prior probability can be derived from an identification of the proportion of taxis in New Amsterdam that are green. This is 15 per cent (of the 1,000 taxis, 150 are green).

Now, the (posterior) probability that a hypothesis is true after obtaining new evidence, according to the a,b,c formula of Bayes’ Theorem, is equal to:

ab/ [ab + c(1-a)]

In this case, the hypothesis is that the taxi that knocked down the pedestrian was green.

a is the prior probability, i.e. the probability that a hypothesis is true before the new evidence arises. This is 0.15 (15%) because 15% of the taxis in New Amsterdam are green.

b is the probability the new evidence would arise if the hypothesis is true. This is 0.8 (80%). There is an 80% chance that the witness would say the taxi was green if it was indeed green.

c is the probability the new evidence would arise if the hypothesis is false. This is 0.2 (20%). There is a 20% chance that the witness would be wrong and identify the taxi as green if it was in fact yellow.

Inserting these numbers into the formula, ab/ [ab + c(1-a)], gives:

Posterior probability = 0.15 x 0.8/ [0.15 x 0.8 + 0.2 (1 – 0.15)] = 0.41 = 41 per cent.

In other words, the true probability that the taxi that knocked down the pedestrian was green is not 80 per cent (despite the witness evidence) but about half that. The baseline probability is that important.

If new evidence subsequently arises, Bayesians are not content to leave the probabilities alone. Say, for example, that a new witness appears, totally independent of the other, and is also given the observation test, revealing a reliability score of 90 per cent. Again, we have no reason to doubt the integrity of this witness. What a Bayesian does now is to insert that number (0.9) into Bayes’ formula (y=0.9) so that c (the probability that the witness is mistaken) = 0.1. The new baseline (or prior) probability, a, is no longer 0.15, as it was before the first witness appeared, but 0.41 (the probability incorporating the evidence of the first witness). In this sense, yesterday’s posterior probabilities are today’s prior probabilities.

Inserting into Bayes’ Theorem, the new posterior probability = 0.86 = 86%. This is also the new baseline probability underpinning any new evidence which might arise.

There are three key illustrative cases of the Bayesian Taxi Problem which bear highlighting. The first is a scenario where the new witness scores 50 per cent on the observation test. Here is a case where intuition and Bayes’ formula converge. Intuition tells us that a witness who is right only half the time about the colour of the taxi is also wrong half the time, and so any evidence they give is worthless. Bayes’ Theorem tells us that this is indeed so, as the posterior probability ends up being equal to the prior probability.

The second illustrative case is where a new witness is 100 per cent reliable about the colour of the taxi. In this case, b =1 and c =0. Intuition tells us that the evidence of such a witness solves the case. If the infallible witness says the taxi was green, it was green. Bayes’ Theorem agrees.

This leads directly to the third illustrative case. If the new witness scores 0 per cent on the observation test, this indicates that they always identify the wrong colour for the taxi. If they say it is green, it is definitely not green. So the chance (posterior probability) that the taxi is green if they say so is zero. This accords with Bayes’ Theorem.

More generally, information that a witness is usually wrong is valuable, as it can be reversed to useful effect. So if the witness says the taxi is yellow, we can now identify the taxi as definitely green. This now converges on the second illustrative case.

Similarly, a witness who is right, say, only 25 per cent of the time in identifying the colour of the taxi in the observation test also yields us valuable information. By reversing the identified colour, this yields a 75 per cent reliability score, which can be inserted accordingly into Bayes’ Theorem to update the probability that the taxi that knocked down the pedestrian was green.

The only observation evidence that is worthless, therefore, is evidence that could have been produced by the flip of a fair coin.

The Bayesian taxi problem is in fact an instance of what is known as the Base Rate Fallacy, which occurs when we disregard or undervalue prior information when making a judgment on how likely something is. If presented with related base rate information (i.e. generic, general information) and specific information (information pertaining only to a certain case), the fallacy can also arise from a tendency to focus on the latter at the expense of the former. For example, if we are informed that someone is an avid book-lover, we might think it more likely that they are a librarian than a nurse. There are, however, many more nurses than librarians. In this example, we have not taken sufficient account of the base rate for the number of nurses relative to librarians.

And the conclusion to the case? CCTV evidence was later produced in court which was able to identify conclusively the taxi and the driver. The pedestrian never regained consciousness. The driver of what turned out to be a yellow taxi told the jury that the pedestrian unexpectedly stepped out and brushed against the passenger side door. He thought at the time that it was a very minor incident, and was completely unaware that the victim had slipped and hit his head awkwardly.

This was rejected by the jury, who accepted the prosecution’s contention that the taxi driver had acted with premeditation and malicious intent. They based their decision on their acceptance that a driver motivated by premeditated malice would indeed have driven off. They equated this with accepting that anyone who drove off must have been motivated by premeditated malice. It was all the evidence they needed to reach their unanimous verdict of first degree murder. Each member of the jury later left the court unaware that they had committed the classic Prosecutor’s Fallacy.

James Parker, a 29-year-old long-time resident of New Amsterdam, of previous good character, with no previous convictions or any known motive for the crime, is currently serving a sentence of life in a maximum security prison with no possibility of parole.

 

Appendix

In the original taxi problem scenario:

a = 0.15 (15 per cent of taxis are green)

b = 0.8 (the witness is correct 80 per cent of the time)

c = 0.2 (the witness is wrong 20 per cent of the time)

Inserting these numbers into the formula gives:

Posterior probability = (0.15 x 0.8)/ (0.15×0.8 + 0.2×0.85) = 0.12/ (0.12+0.17) = 41% (rounded to the nearest per cent).

This is the new baseline probability underpinning any new evidence which might arise.

If new evidence subsequently arises, such that a = 0.41, b = 0.9, c = 0.1. New posterior probability = 0.41 x 0.9/ (0.41×0.9 + 0.1×0.59) = 0.369/ (0.369+0.059) = 86% (rounded to the nearest per cent). This is also the new baseline probability underpinning any new evidence which might arise.

Solution to the three illustrative cases of the Bayesian Taxi Problem.

  1. A scenario where the new witness scores 50 per cent on the observation test. In terms of the equation, such a witness would be accorded b = 0.5 and c = 0.5.

PP = ab/ [ab+ c (1-a)] = 0.5a / [0.5a + 0.5 (1-a)] = 0.5a / (0.5 + 0.5a – 0.5a) = 0.5a / 0.5 = a

So when b and c both equal 0.5 in regard to new evidence, this evidence has no impact on the probability of the hypothesis being tested being true. The posterior probability equals the prior probability. In this case, the witness’s evidence can be discounted.

  1. The second illustrative case is where a new witness is 100 per cent accurate about the colour of the taxi. In this case, b =1 and c = 0. Intuition tells us that the evidence of such a witness solves the case. If the infallible witness says the taxi was green, it was green. Bayes’ formula agrees. Inserting b = 1, c = 0 into the formula gives:

ab/[ab + c(1-a)] = a / (a + 0) = a/a = 1.

So the new (posterior) probability that the taxi is green = 1.

  1. This leads directly to the third illustrative case. If the new witness scores 0 per cent on the observation test, this indicates that they always identify the wrong colour for the taxi. If they say it is green, it is definitely not green. So the chance (posterior probability) that the taxi is green if they say so is zero. This accords with the formula.

ab/ [ab + c(1-a)] = 0 / [0 + (1-a)], assuming a is not equal to 1 = 0.

If a = 1 and b = 0, the question is meaningless (as we are saying that the taxi is definitely green (a=1) and it is definitely not green (b=0) and so the equation is undefined.

 

Exercise

Question a. New Amsterdam has 1,000 taxis. 800 are yellow, 200 are green. There is no reason for us to believe that one particular colour of taxi is more likely to knock down a pedestrian in the area that the accident occurred, or to believe that the behaviour of green or yellow taxi drivers is likely to differ in the event of knocking down a pedestrian.

One of these taxis now knocks down a pedestrian and drives away. There is one witness, who saw the event. The witness says the colour of the taxi was green.

The witness is given a well-respected observation test, and is right 80% of the time. We can be quite sure from the result that there is a probability of 80% that the witness identifies the colour of the taxi correctly.

What is our best estimate now of the probability that the taxi was green?

 

Question b. What if a second witness, independent of the first, now comes forward?

We determine that the probability that this witness is correct when identifying the colour of the taxi is 70%.

The witness says the colour of the taxi was green.

What is the new posterior (updated) probability that the taxi that knocked down the pedestrian is green?

 

Question c. What if a third witness, independent of the first and second, now comes forward?

We determine that the probability that this witness is correct when identifying the colour of the taxi is 50%.

The witness says the colour of the taxi was green.

What is the new posterior (updated) probability that the taxi that knocked down the pedestrian is green?

 

References and Links

Salop, S.C. (1987). Evaluating uncertain evidence with Sir Thomas Bayes: A Note for Teachers. Economic Perspectives, 1, 1, Summer, 155-160. https://pubs.aeaweb.org/doi/pdf/10.1257/jep.1.1.155

Bedwell, M. (2015). Slow thinking and deep learning: Tversky and Kahneman’s Cabs. Global Journal of Human-Social Science,15,12. https://www.socialscienceresearch.org/index.php/GJHSS/article/download/1634/1575

Base Rate Fallacy. The Decision Lab. https://thedecisionlab.com/bias/base-rate-fallacy/

Salop, S.C. (1987). Evaluating uncertain evidence with Sir Thomas Bayes: A Note for Teachers. Economic Perspectives, 1, 1, Summer, 155-160. https://pubs.aeaweb.org/doi/pdf/10.1257/jep.1.1.155

Bedwell, M. (2015). Slow thinking and deep learning: Tversky and Kahneman’s Cabs. Global Journal of Human-Social Science,15,12. https://www.socialscienceresearch.org/index.php/GJHSS/article/download/1634/1575

Base Rate Fallacy. The Decision Lab. https://thedecisionlab.com/bias/base-rate-fallacy/

Base Rate Fallacy. In: Paradoxes of probability and other statistical strangeness. UTS, 5 April, 2017. S. Woodcock. http://newsroom.uts.edu.au/news/2017/04/paradoxes-probability-and-other-statistical-strangeness

Tversky, A. and Kahneman, D. (1982), Evidential Impact of Base Rates. In: Kahneman, D., Slovic, P. and Tversky, A., Judgment Under Uncertainty: Heuristics and Biases. https://www.cambridge.org/core/books/judgment-under-uncertainty/evidential-impact-of-base-rates/CC35C9E390727085713C4E6D0D1D4633

Base Rate Fallacy. Wikipedia. https://en.wikipedia.org/wiki/Base_rate_fallacy

Know Your Bias: Base Rate Neglect. https://youtu.be/YuURK_q2NR8

Base Rate Fallacy. https://youtu.be/Fs8cs0gUjGY

Counting Carefully. The Base Rate Fallacy. https://youtu.be/VeQXXzEJQrg