How much should we bet when we believe the odds are in our favour. The answer to this question was first formalised in 1956, by daredevil pilot, recreational gunslinger and physicist John L. Kelly, Jr. at Bell Labs. The so-called Kelly Criterion is a formula employed to determine the optimal size of a series of bets when we have the advantage, in other words when the odds favour us. It takes account of the size of our edge over the market as well as the adverse impact of volatility. In other words, even when we have the edge, we can still go bankrupt along the way if we stake too much on any individual wager or series of wagers.
Essentially, the Kelly strategy is to wager a proportion of our capital which is equivalent to our advantage at the available odds. So if we are being offered even money, and we back heads, and we are certain that the coin will come down heads, we have a 100% advantage. So the recommended wager is the total of our capital. If there is a 60% chance of heads, and a 40% chance of tails, our advantage is now 20%, and we are advised to stake accordingly. This is a simplified representation of the literature on Kelly, Half-Kelly, and other derivatives of same, but the bottom line is clear. It is just as important to know how much to stake as it is to gauge when we have the advantage. But it’s not easy unless we can accurately identify that advantage.
Put more technically, the Kelly criterion is the fraction of capital to wager to maximise compounded growth of capital. The problem it seeks to address is that even when there is an edge, beyond some threshold larger bets will result in lower compounded return because of the adverse impact of volatility. The Kelly criterion defines the threshold, and indicates the fraction that should be wagered to maximise compounded return over the long run (F), which is given by:
F = Pw – (Pl/W)
where
F = Kelly criterion fraction of capital to bet
W = Amount won per amount wagered (i.e. win size divided by lose size)
Pw = Probability of winning
Pl = Probability of losing
When win size and loss size are equal, W = 1, and the formula reduces to:
F = Pw – Pl
For example, if a trader loses £1,000 on losing trades and gains £1,000 on winning trades, and 60 per cent of all trades are winning trades, the Kelly criterion indicates an optimal trade size equal to 20 per cent (0.60-0.40 = 0.20). As another example, if a trader wins £2,000 on winning trades and loses £1,000 on losing trades, and the probability of winning and losing are both equal to 50 per cent, the Kelly criterion indicates an optimal trade size equal to 25 per cent of capital: 0.50- (0.50/2) = 0.25.
In other words, Kelly argues that, in the long run, we should wager a percentage of our bankroll equal to the expected profit divided by than the amount we would receive if we win.
Proportional over-betting is more harmful than under-betting. For example, betting half the Kelly criterion will reduce compounded return by 25 per cent, while betting double the Kelly criterion will eliminate 100 per cent of the gain. Betting more than double the Kelly criterion will result in an expected negative compounded return, regardless of the edge on any individual bet. The Kelly criterion implicitly assumes that there is no minimum bet size. This assumption prevents the possibility of total loss. If there is a minimum trade size, as is the case in most practical investment and trading situations, then ruin is possible if the amount falls below the minimum possible bet size.
So should we bet the full amount recommended by the Kelly Criterion? Not so according to sports betting legend, Bill Benter. Betting the full amount recommended by the Kelly formula, he says, is unwise for a number of reasons. Notably, he warns that accurate estimation of the advantage of the bets is critical; if we overestimate the advantage by more than a factor of two, Kelly betting will cause a negative rate of capital growth, and he says this is easily done. So, as he puts it “… full Kelly betting is a rough ride.” According to Benter, and I for one will defer to his advice in these matters, a fractional Kelly betting strategy is advisable, that is, a strategy wherein one bets some fraction of the recommended Kelly bet (e.g. one half or one third). Ironically, John Kelly himself died in 1965, never having used his own criterion to make money.
So that’s the Kelly criterion. In a nutshell, the advice is only to bet when you believe you have the edge, and to do so using a stake size related to the size of the edge. Mathematically, it means betting a fraction of your capital equal to the size of your advantage. So, if you have a 20% edge at the odds, bet 20% of your capital. In the real world, however, we need to allow for errors that can creep in, like uncertainty as to the true edge, if any, that we have at the odds. So, unless we’re happy to risk a very bumpy ride, and we have total confidence in our judgment, a preferred strategy will to be stake a defined fraction of that amount, known as a fractional Kelly strategy. Purists will hate us for it, but it’s not their capital at risk. So if we are going to bet, the advice is to use Kelly, but with due caution, not least in the assessment of our advantage. And when the fun of betting stops, the best advice of all may of course be to just stop. Good luck!
Further Reading and Links
William of Occam (also spelled William of Ockham) was a 14th century English philosopher. At the heart of Occam’s philosophy is the principle of simplicity, and Occam’s Razor has come to embody the method of eliminating unnecessary hypotheses. Essentially, Occam’s Razor holds that the theory which explains all (or the most) while assuming the least is the most likely to be correct. This is the principle of parsimony – explain more, assume less. Put more elegantly, it is the principle of ‘pluritas non est ponenda sine necessitate’ (plurality must never be posited beyond necessity).
Empirical support for the Razor can be drawn from the principle of ‘overfitting.’ In statistics, ‘overfitting’ occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. Critically, a model that has been overfit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data. For example, a complex polynomial function might after the fact be used to pass through each data point, including those generated by noise, but a linear function might be a better fit to the signal in the data. By this we mean that the linear function would predict new and unseen data points better than the polynomial function, although the polynomial which has been devised to capture signal and noise would describe/fit the existing data better.
Turning now to ‘ad hoc’ hypotheses and the Razor. In science and philosophy, an ‘ad hoc hypothesis’ is a hypothesis added to a theory in order to save it from being falsified. Ad hoc hypothesising is compensating for anomalies not anticipated by the theory in its unmodified form. For example, you say that there is a leprechaun in your garden shed. A visitor to the shed sees no leprechaun. This is because he is invisible, you say. He spreads flour on the ground to see the footprints. He floats, you declare. He wants you to ask him to speak. He has no voice, you say. More generally, for each accepted explanation of a phenomenon, there is generally an infinite number of possible, more complex alternatives. Each true explanation may therefore have had many alternatives that were simpler and false, but also approaching an infinite number of alternatives that are more complex and false.
This leads us the idea of what I term ‘Occam’s Leprechaun.’ Any new and more complex theory can always be possibly true. For example, if an individual claims that leprechauns were responsible for breaking a vase that he is suspected of breaking, the simpler explanation is that he is not telling the truth, but ongoing ad hoc explanations (e.g. “That’s not me on the CCTV, it’s a leprechaun disguised as me) prevent outright falsification. An endless supply of elaborate competing explanations, called ‘saving hypotheses’, prevent ultimate falsification of the leprechaun hypothesis, but appeal to Occam’s Razor helps steer us toward the probable truth. Another way of looking at this is that simpler theories are more easily falsifiable, and hence possess more empirical content.
All assumptions introduce possibilities for error; if an assumption does not improve the accuracy of a theory, its only effect is to increase the probability that the overall theory is wrong.
It can also be looked at this way. The prior probability that a theory based on n+1 assumptions is true must be less than a theory based on n assumptions, unless the additional assumption is a consequence of the previous assumptions. For example, the prior probability that Jack is a train driver must be less than the prior probability that Jack is a train driver AND that he owns a Mini Cooper, unless all train drivers own Mini Coopers, in which case the prior probabilities are identical.
Again, the prior probability that Jack is a train driver and a Mini Cooper owner and a ballet dancer is less than the prior probability that he is just the first two, unless all train drivers are not only Mini Cooper owners but also ballet dancers. In the latter case, the prior probabilities of the n and n+1 assumptions are the same.
From Bayes’ Theorem, we know that reducing the prior probability will reduce the posterior probability, i.e. the probability that a proposition is true after new evidence arises.
Science prefers the simplest explanation that is consistent with the data available at a given time, but even so the simplest explanation may be ruled out as new data become available. This does not invalidate the Razor, which does not state that simpler theories are necessarily more true than more complex theories, but that when more than one theory explains the same data, the simpler should be accorded more probabilistic weight.
The theory which explains all (or the most) and assumes the least is most likely. So Occam’s Razor advises us to keep explanations simple. But it is also consistent with multiplying entities necessary to explain a phenomenon. A simpler explanation which fails to explain as much as another more complex explanation is not necessarily the better one. So if leprechauns don’t explain anything they cannot be used as proxies for something else which can explain something. This is the classic riposte to the materialist who holds that there is nothing beyond what we observe in the natural or material world. If a non-materialist explanation better explains the origin of the universe, for example, that explanation may be true and consistent with Occam’s Razor. I explore this issue separately in my blog – ‘Why is there Something Rather than Nothing? A Solution’.
More generally, we can now unify Epicurus and Occam. From Epicurus’ Principle we need to keep open all hypotheses consistent with the known evidence which are true with a probability of more than zero. From Occam’s Razor we prefer from among all hypotheses that are consistent with the known evidence, the simplest. In terms of a prior distribution over hypotheses, this is the same as giving simpler hypotheses higher a priori probability, and more complex ones lower probability.
From here we can move to the wider problem of induction about the unknown by extrapolating a pattern from the known. Specifically, the problem of induction is how we can justify inductive inference. According to Hume’s ‘Enquiry Concerning Human Understanding’ (1748), if we justify induction on the basis that it has worked in the past, then we have to use induction to justify why it will continue to work in the future. This is circular reasoning. This is faulty theory. “Induction is just a mental habit, and necessity is something in the mind and not in the events.” Yet in practice we cannot help but rely on induction. We are working from the idea that it works in practice if not in theory – so far. Induction is thus related to an assumption about the uniformity of nature. Of course, induction can be turned into deduction by adding principles about the world (such as ‘the future resembles the past’, or ‘space-time is homogeneous.’) We can also assign to inductive generalisations probabilities that increase as the generalisations are supported by more and more independent events. This is the Bayesian approach, and it is a response to the perspective pioneered by Karl Popper. From the Popperian perspective, a single observational event may prove hypotheses wrong, but no finite sequence of events can verify them correct. Induction is from this perspective theoretically unjustifiable and becomes in practice the choice of the simplest generalisation that resists falsification. The simpler a hypothesis, the easier it is to be falsified. Induction and falsifiability are in practice, from this viewpoint, is as good as it gets in science. Take an inductive inference problem where there is some observed data and a set of hypotheses, one of which may be the true hypothesis generating the data. The task then is to decide which hypothesis, or hypotheses, are the most likely to be responsible for the observations.
A better way of looking at this seems to be to abandon certainties and think probabilistically. Entropy is the tendency of isolated systems to move toward disorder and a quantification of that disorder, e.g. assembling a deck of cards in a defined order requires introducing some energy to the system. If you drop the deck, they become disorganised and won’t re-organise themselves automatically. This is the tendency in all systems to disorder. This is the Second Law of Thermodynamics, which implies that time is asymmetrical with respect to the amount of order: as the system, advances through time, it will statistically become more disordered. By ‘Order’ and ‘Disorder’ we mean how compressed the information is that is describing the system. So if all your papers are in one neat pile, then the description is “All paper in one neat pile.” If you drop them, the description becomes ‘One paper to the right, another to the left, one above, one below, etc. etc.” The longer the description, the higher the entropy. According to Occam’s Razor, we want a theory with low entropy, i.e. low disorder, high simplicity. The lower the entropy, the more likely it is that the theory is the true explanation of the data, and hence that theory should be assigned a higher probability.
More generally, whatever theory we develop, say to explain the origin of the universe, or consciousness, or non-material morality, must itself be based on some theory, which is based on some other theory, and so on. At some point we need to rely on some statement which is true but not provable, and so we think may be false, although it is actually true. We can never solve the ultimate problem of induction, but Occam’s Razor combined with Epicurus, Bayes and Popper is as good as it gets if we accept that. So Epicurus, Occam, Bayes and Popper help us pose the right questions, and help us to establish a good framework for thinking about the answers.
At least that applies to the realm of established scientific enquiry and the pursuit of scientific truth. How far it can properly be extended beyond that is a subject of intense and continuing debate.
Further Reading and Links
Bayes’ Theorem: The Most Powerful Equation in the World. https://leightonvw.com/2017/03/12/bayes-theorem-the-most-powerful-equation-in-the-world/
Why is there Something Rather than Nothing https://wordpress.com/post/leightonvw.com/639
A patient goes to see the doctor. The doctor performs a test on all his patients, for a flu virus , estimating that only 1 per cent of the people who visit his surgery have the virus. The test he gives them, however, is 99 percent accurate – that is, 99 percent of people who are sick test positive and 99 percent of the healthy people test negative. Now the question is: if the patient tests positive, what chances should the doctor give to the patient having the virus?
The intuitive answer is 99 percent.
But is that right?
The information we are given is ‘the probability of testing positive given that you are sick’. What we want to know, however, is ‘the probability of being sick given that you tested positive.’ Common intuition conflates these two probabilities, but they are in fact very different. In fact, if the test is 95% accurate, this means that 95% of sick people test positive. But this is NOT the same thing as saying that 95% of people who test positive are sick. This is known as the ‘Inverse Fallacy’ or ‘Prosecutor’s Fallacy’. It is the fallacy, to which jurors are very susceptible, of believing that the probability of a defendant being guilty of a crime given the observation of some piece of evidence is the same as the probability of observing that piece of evidence if the defendant was guilty. They are in fact very different things, and the two probabilities can diverge markedly, markedly enough in fact to send many innocent people to the place of execution or to a life without possibility of parole.
So what is the probability of being sick if you test positive, given that the test is 99% accurate (i.e. 99% of people who are sick test positive and 99% of people who are not sick test negative)?
To answer this we can use Bayes’ Theorem.
The (posterior) probability that a hypothesis is true after obtaining new evidence, according to the x,y,z formula of Bayes’ Theorem, is equal to:
xy/[xy+z(1-x)]
x is the prior probability, i.e. the probability that a hypothesis is true before you see the new evidence.
y is the probability you would see the new evidence if the hypothesis is true.
z is the probability you would see the new evidence if the hypothesis is false.
In the case of the flu test, the hypothesis is that the patient is sick.
Before the new evidence (the test), this chance is estimated at 1 in 100 (0.01)
So x = 0.01
The probability we would see the new evidence (the positive result on the test) if the hypothesis is true (the patient is sick) is 99%, since the test if 99% accurate.
So y =0.99
The probability we would see the new evidence (the positive result on the test) if the hypothesis is false (the patient is not sick) is just 1% (because the test is 99% accurate, and will only give a false positive 1 time in 100).
So z = 0.01
Substituting into Bayes’ equation gives:
0.01x 0.99 / [0.01 x 0.99 + 0.01 (1 – 0.01)] = 0.01×0.99 / [0.01×0.99 + 0.01×0.99] = 1/2
So there is actually a 50% chance that the test, which is 99% reliable and has tested positive, has misdiagnosed you and you are actually flu-free.
Basically, it is a competition between how rare the disease is and how rarely the test is wrong. In this case, there is a 1 in 100 chance that you have the flu before undertaking the test, and the test is wrong 1 time in 100. These two probabilities are equal, so the chance that you actually have the flu when testing positive is 1 in 2.
But what if the patient is showing symptoms of the disease before being tested?
In this case, the prior probability should be updated to something higher than the prevalence rate of the disease in the entire tested population, and the chance you are actually sick when you test positive rises accordingly. To the extent that a doctor only tests for something that there is corroborating support for, the likelihood that the test result is correct grows. For this reason, any positive test result should be taken very seriously, statistics aside.
More generally, to differentiate truth from scare we really do need to understand and employ Bayes’ Theorem. Whether at the doctor’s surgery or in the jury room, understanding it really could save a life.
Appendix
In the original setting with the test results showing positive for a flu virus, a = 0.01, b = 0.99, c = 0.01. Substituting into Bayes’ equation, ab/[ab+c(1-a)], gives:
Posterior probability = 0.01x 0.99 / [0.01 x 0.99 + 0.01 (1 – 0.01)] = 0.01×0.99 / [0.01×0.99 + 0.01×0.99] = 1/2
Another way of visualising this problem is by constructing a simple box diagram for a population of 10,000 patients. Of these, 1%, or 100, have the flu virus and 9900 do not. These are inserted into the Total column. There is a 1% error rate, so 1% of the 9900 who do not have the flu virus test positive. Hence the remaining 9801 test negative. Of the 100 who actually have the flu virus, one tests negative (because of the error rate) and the remaining 99 correctly test positive. See below.
| Test positive | Test negative | Total | |
| Has flu virus | 99 | 1 | 100 |
| No flu virus | 99 | 9801 | 9900 |
| Total | 198 | 9802 | 10000 |
It is now easy to see that of the 198 who test positive, exactly half (99) actually have the flu virus. The other half are false positives.
Let’s take another example.
The probability of a true positive (test comes back positive for virus and the patient has the virus) is 90%. The chance that it gives a false negative (test comes back negative yet the patient has the virus) is 10%. The chance of a false positive (test comes back positive yet the patient does not have the virus) is 7%. The chance of a true negative (test comes back negative and the patient does not have the virus) is 93%.
The probability that a random patient has the virus based on the prevalence of the virus in the tested population is 0.8%.
Here, a = 0.8% (0.008) – this is the prior probability
b =90% (0.9) – probability of a true positive
c = 7% (0.07) – probability of a false positive
So, updated probability that the patient has the virus given the positive test result =
ab / [ab + c (1-a)] = 0.008 / [0.0072 + 0.07 x (1 – 0.008)]
= 0.008 x 0.9 / [0.008 x 0.9 + 0.06944] = 0.0072 / 0.07664 = 0.0939 = 9.39%
This can be shown using the raw figures to produce the same result. We can choose any number for total tested, and the result is the same. Let’s choose 1 million, say, as the number tested.
So total tested = 1,000,000
Total with virus = 0.008 x 1,000,000 = 8000
True positive = 0.9 x 8000 = 7200
False positive = 0.07 x 992,000 = 69,440
Tested positive = 69,440 + 7200 = 76,640
Updated (posterior) probability that the patient who tests positive has the virus = True positives / Total positives = 7200 / 76640 = 0.0939 = 9.39%
In the forensic match example, we can construct a box table. In the example, out of a population of suspects of 100, one is guilty and 99 are not guilty. These are inserted into the Total column. There is a 5% error rate in the forensic match, so there is a 0.95 chance of a match if the suspect is guilty (top left). There’s a 5% chance that one of the 99 will provide a match (0.05 x 99 = 4.95), leaving 84.15 as the number for the Not guilty/No match cell.
| Match | No match | Total | |
| Guilty | 0.95 | 0.05 | 1 |
| Not guilty | 4.95 | 94.05 | 99 |
| Total | 5.9 | 94.1 | 100 |
So the chance that the suspect provides a match and is actually guilty is the proportion of those guilty and matching out of all those matching (0.95/5.9 = 0.16).
So the 95% accurate forensic match provides a hit when matched to the suspect but his actual probability of guilt on these figures is just 16%.
Using Bayes’ Theorem, we reach the same conclusion:
Substituting into Bayes’ equation gives:
P (Guilty I Match) = 0.01x 0.95 / [0.01 x 0.95 + 0.05 (1 – 0.01)] = 0.01×0.95 / [0.01×0.95 + 0.05×0.99] = 0.0095/(0.0095+0.0495) = 0.0095/0.059 = 0.16.
So P (Guilty I Match) = 0.16
P (Not guilty I Match) = 0.84
Further and deeper exploration of paradoxes and challenges of intuition and logic can be found in my recently published book, Probability, Choice and Reason.
The majestic tragedy, Othello, was written by William Shakespeare in about 1603. The play revolves around four central characters: Othello, a Moor who is a general in the Venetian army; his beloved wife, Desdemona; his loyal lieutenant, Cassio; and his trusted ensign, Iago.
A key element of the play is Iago’s plot to convince Othello that Desdemona is conducting an affair with Cassio, by planting a treasured keepsake Othello gave to Desdemona, in Cassio’s lodgings, for Othello ‘accidentally’ to come upon.
We playgoers know she is not cheating on him, as does Iago, but Othello, while reluctant to believe it of Desdemona, is also very reluctant to believe that Iago could be making it up.
If Othello refuses to contemplate any possibility of betrayal, then we would have a play in which no amount of evidence, however overwhelming, including finding them together, could ever change his mind. We would have a farce or a comedy instead of a tragedy.
A shrewder Othello would concede that there is at least a possibility that Desdemona is betraying him, however small that chance might be. This means that there does exist some level of evidence, however great it would need to be, that would leave him no alternative. If his prior trust in Desdemona is almost, but not absolutely total, then this would permit of some level of evidence, logically incompatible with her innocence, changing his mind. This might be called ‘Smoking Gun’ evidence.
On the other hand, Othello might adopt a more balanced position, trying to assess the likelihood objectively and without emotion. But how? Should he try and find out the proportion of female Venetians who conduct extra-marital affairs? This would give him the probability for a randomly selected Venetian woman but no more than that. Hardly a convincing approach when surely Desdemona is not just an average Venetian woman. So should he limit the reference class to women who are similar to Desdemona? But what does that mean?
And this is where it is easy for Othello to come unstuck. Because it is so difficult to choose a prior probability (as Bayesians would term it), the temptation is to assume that since it might or might not be true, the likelihood is 50-50. This is known as the ‘Prior Indifference Fallacy’. Once Othello falls victim to this common fallacy, any evidence against Desdemona now becomes devastating. It is the same problem as that facing the defendant in the dock.
Extreme, though not blind, trust is one way to avoid this mistake. But an alternative would be to find evidence that is logically incompatible with Desdemona’s guilt, in effect the opposite of the ‘Smoking Gun.’ The ‘Perfect Alibi’ would fit the bill.
Perhaps Othello would love to find evidence that is logically incompatible with Desdemona conducting an affair with Cassio, but holds her guilty unless he can find it. He needs evidence that admits no True Positives.
Lacking extreme trust and a Perfect Alibi, what else could have saved Desdemona?
To find the answer, we shall turn as usual to Bayes and Bayes’ Theorem. Bayes’ Theorem, otherwise known as the most important equation in the world, solves these sorts of problems very adeptly every time, using the wonderfully simple x,y,z formula.
The (posterior) probability that a hypothesis is true after obtaining new evidence, according to the x,y,z formula of Bayes’ Theorem, is equal to:
xy/[xy=z(1-x)]
x is the prior probability, i.e. the probability that a hypothesis is true before you see the new evidence.
y is the probability you would see the new evidence if the hypothesis is true.
z is the probability you would see the new evidence if the hypothesis is false.
In the case of the Desdemona problem, the hypothesis is that Desdemona is guilty of betraying Othello with Cassio.
Before the new evidence (the finding of the keepsake), let’s say that Othello assigns a chance of 4% to Desdemona being unfaithful.
So x = 0.04
The probability we would see the new evidence (the keepsake in Cassio’s lodgings) if the hypothesis is true (Desdemona and Cassio are conducting an affair) is, say, 50%. There’s quite a good chance she would secretly hand Cassio the keepsake as proof of her love for him and not of Othello.
So y = 0.5
The probability we would see the new evidence (the keepsake in Cassio’s lodgings) if the hypothesis is false is, say, just 5%. Why would it be there if Desdemona had not been to his lodgings secretly, and why would she take the keepsake along in any case?
So z = 0.05
Substituting into Bayes’ equation gives:
0.04 x 0.5 / [0.04 x 0.5 + 0.05 (1 – 0.04)] = 0.294.
So, using Bayes’ Rule, and these estimates, the chance that Desdemona is guilty of betraying Othello is 29.4%, worrying high for the tempestuous Moor but perhaps low enough to prevent tragedy. The power of Bayes here lies in demonstrating to Othello that the finding of the keepsake in the living quarters of Cassio might only have a 1 in 20 chance of being consistent with Desdemona’s innocence, but in the bigger picture, there is a less than a 3 in 10 chance that she actually is culpable.
If this is what Othello concludes, the task of the evil Iago is to lower z in the eyes of Othello by arguing that the true chance of the keepsake ending up with Cassio without a nefarious reason is so astoundingly unlikely as to merit an innocent explanation that 1 in 100 is nearer the mark than 1 in 20. In other words, to convince Othello to lower his estimate of z from 0.05 to 0.01.
The new Bayesian probability of Desdemona’s guilt now becomes:
xy/[xy=z(1-x)]
x = 0.04 (the prior probability of Desdemona’s guilt, as before)
y = 0.5 (as before)
z = 0.01 (down from 0.05)
Substituting into Bayes’ equation gives:
0.04 x 0.5 / [0.04 x 0.5 + 0.01 (1 – 0.04)] = 0.676.
So, if Othello can be convinced that 5% is too high a probability that there is an innocent explanation for the appearance of the Cassio – let’s say he’s persuaded by Iago that the true probability is 1% – then Desdemona’s fate, as that of many a defendant whom a juror thinks has more than a 2 in 3 chance of being guilty, is all but sealed. Her best hope now is to try and convince Othello that the chance of the keepsake being found in Cassio’s place if she were guilty is much lower than 0.5. For example, she could try a common sense argument that there is no way that she would take the keepsake if she were actually having an affair with Cassio, nor be so careless as to leave it behind. In other words, she could argue that the presence of the keepsake where it was found actually provides testimony to her innocence. In Bayesian terms, she should try to reduce Othello’s estimate of y. What level of y would have prevented tragedy? That is another question.
William Shakespeare wrote Othello about a hundred years before the Reverend Thomas Bayes was born. That is true. But to my mind the Bard was always, in every inch of his being, a true Bayesian. Othello was not, and therein lies the tragedy.
Appendix
In the case of the Othello problem, the hypothesis is that Desdemona is guilty of betraying Othello with Cassio. Before the new evidence (the finding of the keepsake), let’s say that Othello assigns a chance of 4% to Desdemona being unfaithful.
So P (H) = 0.04
The probability we would see the new evidence (the keepsake in Cassio’s lodgings) if the hypothesis is true (Desdemona and Cassio are conducting an affair) is, say, 50%.
So P (EIH) = 0.5
The probability we would see the new evidence (the keepsake in Cassio’s lodgings) if the hypothesis is false is, say, just 5%.
So P (EIH’) = 0.05
Substituting into Bayes’ Theorem:
P (HIE) = P (EIH). P (H) / [P (EIH) . P(H) + P (EIH’) . P(H’)]
P (HIE) = 0.5 x 0.04 / [0.5 x 0.04 + 0.05 x 0.96]
P (HIE) = 0.02 / [0.02 + 0.048] = 0.294
Posterior probability = 0.294.
So, using Bayes’ Rule, and these estimates, the chance that Desdemona is guilty of betraying Othello is 29.4%.
If P (EIH’) = 0.01
The new Bayesian probability of Desdemona’s guilt now becomes:
P (HIE) = 0.5 x 0.04 / [0.5 x 0.04 + 0.01 x 0.96]
P (HIE) = 0.02 / (0.02 + 0.0096) = 0.02 / 0.0296 = 0.676
Updated probability = 0.676 = 67.6%.
Bobby Smith, aged 8, is a good schoolboy footballer, but you know that only one in a thousand such 8-year-olds go on to become professional players. So you would like to get an unbiased assessment of his real chance of developing into a top player. A coach tells you there is a test, taken by all good 8-year-old footballers, that can measure the child’s potential. The test, you learn, is 95% accurate in identifying future professional footballers, and these always receive a grade of A+.
Bobby takes the test and is graded A+.
How many of the 8-year-olds tested, who get an A+, fail to develop into top players, you ask. Now the coach imparts the good news. All current professional players scored A+ when they took the test in their own school days, and we can take it that anyone who scores below that can be ruled out as a future professional player. And the test is 95% accurate, so only 5% of those who get the A+ grade fail to develop into professional footballers. So what is the actual chance that Bobby will become a top player?
If you are like most people, you will think the chance is very high.
This is your reasoning: I don’t really know whether Bobby is likely to turn into a professional player or not. But he has taken this test. In fact, no current professional player scored below A+, and the test only very rarely allocates a top grade to a child who will not become a professional footballer. If the test is really this good, therefore, it looks like Bobby will have a bright future as a football star.
Is this true? Think of it this way. If there were no test, you would have asked the coach a very basic question: in your experience, what is the chance that Bobby will become a professional player? The coach would have dampened your enthusiasm: one in a thousand, he would have said. But with the test result in hand, there’s no need to ask this question. It’s irrelevant in the face of a very accurate test result, isn’t it?
In fact, this is a well-known fallacy, which psychologists call the Inverse Fallacy, or Prosecutor’s Fallacy. The fallacy is to confuse the probability of a hypothesis being true, given some evidence, with the probability of the evidence arising, given the hypothesis is true.
In our example, the hypothesis is that Bobby will become a top player, and the evidence is the high test score. What we want to know is the probability that Bobby will become a top player, given that the test says he will be. What we know, on the other hand, is the probability that the test says Bobby will be a top player, given that he will be. The coach told you this probability, on all available evidence, is 100%: the test is in this sense infallible, in that all professional players score A+ on the test. In answering your other question, the coach also told you the probability of an A+ test score, given that the child will not become a top player, is only 5%. You take this information and conclude that Bobby is very likely to turn into a top player.
In fact, of the thousand children who took the test, only one (statistically speaking) will become a professional footballer. The test is 95% accurate, so 5% of the 1,000 children will score A+ and not become top players, i.e. there will be 50 ‘false positives.’ Anyone who will become a top player, on the other hand, will score A+ on the test.
So what is the chance that Bobby will become a professional footballer if he scores A+ on the test?
Solution: 50 kids who will not become top footballers score A+ (the 50 ‘false positives’). Only one of the one thousand eight-year-olds who take the test develops into a professional player, and that child will score A+. Look at it this way. A thousand 8-year-olds take the test, and of these 50 of them will receive a letter telling them they have scored A+ on the test but will not develop into top players. One child will receive a letter with a score of A+ and actually will go on to become a professional player. Therefore the probability you will become a top footballer if you score A+ is just 1 in 51, i.e. 1.96%.
This is the same idea as the medical ‘false positives’ problem.
In that problem, a thousand people go to the doctor and all are tested for flu. Only one actually has the flu. Those with the flu always test positive. We know that the test for flu is 95% accurate, so 5% of the 1,000 people will test positive and not have the flu, i.e. there will be 50 ‘false positives’. One will test positive who does have the flu. Those with the flu all test positive. So what is the chance that you have the flu if you test positive?
Solution: 50 people who do not have the flu test positive. One person who has the flu tests positive. Therefore, the probability you have the flu if you test positive is 1 in 51, i.e. 1.96%
We can also solve the Bobby Smith problem using Bayes’ Theorem. The (posterior) probability that a hypothesis is true after obtaining new evidence, according to the a,b,c formula of Bayes’ Theorem, is equal to:
ab/[ab+c(1-a)]
a is the prior probability, i.e. the probability that a hypothesis is true before the new evidence. b is the probability of the new evidence if the hypothesis is true. c is the probability you of the new evidence if the hypothesis is false.
In the case of the Bobby Smith problem, the hypothesis is that Bobby will develop into a professional player.
Before the new evidence (the test), this chance is 1 in 1000 (0.001)
So a = 0.001
The probability of the new evidence (the A+ score on the test) if the hypothesis is true (Bobby will become a professional player) is 100%, since all professional players score A+ on the test.
So b =1
The probability we would see the new evidence (the A+ score on the test) if the hypothesis is false (Bobby will not become a professional player) is 5%, since the test is 95% accurate in spotting future professional footballers.
So c = 0.05
Substituting into Bayes’ equation gives:
Posterior probability = ab/[ab+c(1-a)] = 0.001x 1 / [0.001 x 1 + 0.05 (1 – 0.001)] = 0.0196
So, using Bayes’ Theorem, the chance that Bobby Smith, who scored A+ on the test which is 95% accurate, will actually become a top player, is not 95% as intuition might suggest, but just 1.96%, as we have shown previously by a different route.
So there is just a 1.96 per cent chance that Bobby Smith will go on to become a professional player, despite scoring A+ on that very accurate test of player potential.
That’s the statistics, the cold Bayesian logic. Now for the good news. Bobby Smith was the lucky one. He currently plays for Barcelona, under a different name.
Appendix
We can also solve the Bobby Smith problem using the traditional notation version of Bayes’ Theorem.
P (HIE) = P (EIH). P (H) / [P (EIH) . P(H) + P (EIH’) . P(H’)]
Before the new evidence (the test), this chance is 1 in 1000 (0.001)
So P (H) = 0.001
The probability of the new evidence (the A+ score on the test) if the hypothesis is true (Bobby will become a professional player) is 100%, since all professional players score A+ on the test.
So P (EIH) =1
The probability we would see the new evidence (the A+ score on the test) if the hypothesis is false (Bobby will not become a professional player) is 5%, since the test is 95% accurate in spotting future professional footballers.
So P (EIH’) = 0.05
Substituting into Bayes’ equation gives:
P (HIE) = 0.001x 1 / [0.001 x 1 + 0.05 (1 – 0.001)] = 0.0196
APPENDIX TO CHAPTER 8
In the case of the Othello problem, the hypothesis is that Desdemona is guilty of betraying Othello with Cassio. Before the new evidence (the finding of the keepsake), let’s say that Othello assigns a chance of 4% to Desdemona being unfaithful.
So P (H) = 0.04
The probability we would see the new evidence (the keepsake in Cassio’s lodgings) if the hypothesis is true (Desdemona and Cassio are conducting an affair) is, say, 50%.
So P (EIH) = 0.5
The probability we would see the new evidence (the keepsake in Cassio’s lodgings) if the hypothesis is false is, say, just 5%.
So P (EIH’) = 0.05
Substituting into Bayes’ Theorem:
P (HIE) = P (EIH). P (H) / [P (EIH) . P(H) + P (EIH’) . P(H’)]
P (HIE) = 0.5 x 0.04 / [0.5 x 0.04 + 0.05 x 0.96]
P (HIE) = 0.02 / [0.02 + 0.048] = 0.294
Posterior probability = 0.294.
So, using Bayes’ Rule, and these estimates, the chance that Desdemona is guilty of betraying Othello is 29.4%.
If P (EIH’) = 0.01
The new Bayesian probability of Desdemona’s guilt now becomes:
P (HIE) = 0.5 x 0.04 / [0.5 x 0.04 + 0.01 x 0.96]
P (HIE) = 0.02 / (0.02 + 0.0096) = 0.02 / 0.0296 = 0.676
Updated probability = 0.676 = 67.6%.
How should we change our beliefs about the world when we encounter new data or information? This is one of the most important questions we can ask. A theorem bearing the name of Thomas Bayes, an eighteenth century clergyman, is central to the way we should answer this question.
The original presentation of the Reverend Thomas Bayes’ work, ‘An Essay toward Solving a Problem in the Doctrine of Chances’, was given in 1763, after Bayes’ death, to the Royal Society, by Bayes’ friend and confidant, Richard Price.
In framing Bayes’ work, Price gave the example of a person who emerges into the world and sees the sun rise for the first time. As he has had no opportunity to observe this before (perhaps he has spent his life to that point entombed in a dark cave), he is not able to decide whether this is a typical or unusual occurrence. It might even be a unique event. Every day that he sees the same thing happen, however, the degree of confidence he assigns to this being a permanent aspect of nature increases. His estimate of the probability that the sun will rise again tomorrow as it did yesterday and the day before, and so on, gradually approaches, although never quite reaches, 1.
The Bayesian viewpoint is just like that, the idea that we learn about the universe and everything in it through a process of gradually updating our beliefs, edging incrementally ever closer and closer to the truth as we obtain more data, more information, more evidence.
As such, the perspective of Rev. Bayes on cause and effect is essentially different to that of philosopher David Hume, the logic of whose argument on this issue is contained in ‘An Enquiry Concerning Human Understanding’. According to Hume, we cannot justify our assumptions about the future based on past experience unless there is a law that the future will always resemble the past. No such law exists. Therefore, we have no fundamentally rational support for believing in causation. For Hume, therefore, predicting that the sun will rise again after seeing it rise a hundred times in a row is no more rational than predicting that it will not. Bayes instead sees reason as a practical matter, in which we can apply the laws of probability to the issue of cause and effect.
To Bayes, therefore, rationality is matter of probability, by which you update your predictions based on new evidence, thereby edging closer and closer to the truth. This is called Bayesian reasoning. According to this approach, probability can be seen as a bridge between ignorance and knowledge. The particularly wonderful thing about the world of Bayesian reasoning is that the mathematics of operationalising it are so simple.
Essentially, Bayes’ Theorem is just an algebraic expression with three known variables and one unknown. Yet this simple formula is the foundation stone of that bridge I referred to between ignorance and knowledge.
Bayes’ Theorem is in this way concerned with conditional probability. That is, it tells us the probability, or updates the probability, that a theory or hypothesis is true given that some event has taken place.
To help explain how it works, let us invent a little crime story in which you are a follower of Bayes and you have a friend in a spot of trouble. In this story, you receive a telephone call from your local police station. You are told that your best friend of many years is helping the police investigation into a case of vandalism of a shop window in a street adjoining where you knows she lives. It took place at noon that day, which you know is her day off work.
She next comes to the telephone and tells you she has been charged with smashing the shop window, based on the evidence of a police officer who positively identified her as the culprit. She claims mistaken identity.
You must evaluate the probability that she did commit the offence before deciding how to advise her.
So the condition is that she has been charged with criminal damage; the hypothesis you are interested in evaluating is the probability that she did it.
Bayes’ Theorem helps you answer this type of question.
There are three things you need to estimate.
- A Bayesian’s first task is to estimate the probability that the new evidence would have arisen if the hypothesis was true. In this case, you need to estimate the probability of the police officer identifying your friend if your friend actually did break the window.
- A Bayesian’s second task is to estimate the probability that the new evidence would have arisen if the hypothesis was false. In this case, you need to estimate the probability of the police officer identifying your friend if your friend did NOT break the window.
- You need what Bayesians call a prior probability.
This is the probability you would have assigned to her smashing the shop window before she told you that she had been charged on the basis of the witness evidence. This is not always easy, since the new information might colour the way you assess the prior information, but ideally you should estimate this probability as it would have been before you received the new information.
A practical definition of a Bayesian prior is the odds at which you would be willing to place or offer a bet before the new information is disclosed.
Based on these three probability estimates, Bayes’ Theorem offers you the way to calculate accurately the revised probability you should assign to your friend’s guilt. The wonderful part about it is that the equation is true as a matter of logic. So the result it produces will be as accurate as the values inputted into the equation.
The formula is also so straightforward it can be jotted on the back of your hand. Actually, that’s not such a bad idea for such a powerful tool. Indeed, if you are attracted to tattoos, this is a good an idea for one as any. And it’s as simple as x,y,z.
The formula has xy on the top of the equation and xy+z(1-x) on the bottom.
And that’s it!
Bayes’ rule is:
Probability of hypothesis being true after obtaining new evidence = xy/[xy+z(1-x)]
This is known as the Posterior Probability.
So we have three variables.
x is the prior probability, i.e. the probability you assign to the hypothesis being true before you obtain the new evidence.
y is the probability that the new evidence would have arisen if the hypothesis was true.
z is the probability that the new evidence would have arisen if the hypothesis was false.
So let’s apply Bayes’ Rule to the case of the shattered shop window.
Let’s start with y. This is the probability that the new evidence would have arisen if the hypothesis was true. What is the hypothesis? That your friend broke the window. What is the new evidence? That the police officer has identified your friend as the person who smashes the window. So y is an estimate of the probability that the police officer would have identified your friend if she was indeed guilty.
If she threw the brick, it’s easy to imagine how she came to be identified by the police officer. Still, he wasn’t close enough to catch the culprit at the time, which should be borne in mind. Let’s say that the probability he has identified her and that she is guilty is 80% (0.8).
Let’s move on to z. This is the probability that the new evidence would have arisen if the hypothesis was false. What is the hypothesis again? That your friend broke the window. What is the new evidence again? That the police officer has identified your friend as the person who did it. So z is an estimate of the probability that the police officer would have identified if she was not the guilty party, i.e. a false identification.
If your friend didn’t shatter the window, how likely is the police officer to have wrongly identified her when he saw her in the street later that day? It is possible that he would see someone of similar age and appearance, wearing similar clothes, and jump to the wrong conclusion, or he may just want to identify someone to advance his career. Let us give him credit and say the probability is just 15% (0.15).
Finally, what is x? This is the probability you assign to the hypothesis being true before you obtain the new evidence. In this case, it means the probability you would assign to your friend breaking the shop window before you got the new information from her on the telephone about the evidence of the police officer? Well, you have known her for years, and it is totally out of character, although she does live just a stone’s throw from the shop, and is not at work that day, so she could have done it. Let’s say 5% (0.05). That’s just before you learn from her on the telephone about the witness evidence and the charge. Assigning the prior probability is fraught with problems, however, as awareness of the new information might easily colour the way you assess the prior information. You need to make every effort to estimate this probability as it would have been before you received the new information. You also have to be precise as to the point in the chain of evidence at which you establish the prior probability.
Once we’ve assigned these values, Bayes’ theorem can now be applied to establish a posterior probability. This is the number that we’re interested in. It is the measure of how likely is it that your friend broke the window, given that she’s been identified as the culprit by the police officer.
The calculation and the simple algebraic expression that we have identified is:
xy/[xy+z(1-x)]
where x is the prior probability of the hypothesis (she’s guilty) being true.
where y is the probability the police officer identifies her conditional on the hypothesis being true, i.e. she’s guilty.
where z is probability the police officer identifies her conditional on the hypothesis not being true, i.e. she’s not guilty.
In our example, x = 0.05, y = 0.8, z = 0.15
The rest is simple arithmetic.
xy = 0.05 x 0.8 = 0.04
z(1-x) = 0.15 x 0.95 = 0.1425
xy/xy+z(1-x) = 0.04/(0.04+ 0.1425) = 0.04/0.1825
Posterior probability = 0.219 = 21.9%
The most interesting takeaway from this is the relatively low probability you should assign to the guilt of your friend even though you were 80% sure that the police officer would get it right if she was guilty, and the small 15% chance you assigned that he would falsely identify her. The clue to the intuitive discrepancy is in the prior probability (or ‘prior’) you would have attached to the guilt of your friend before you were met face to face with the evidence of the police officer. If a new piece of evidence now emerges (say a second witness), you should again apply Bayes’ Theorem to update to a new posterior probability, gradually converging, based on more and more pieces of evidence, ever nearer to the truth.
It is, of course, all too easy to dismiss the implications of this hypothetical case on the grounds that it was just too difficult to assign reasonable probabilities to the variables. But that is what we do implicitly when we don’t assign numbers. Bayes’ rule is not at fault for this in any case. It will always correctly update the probability of a hypothesis being true whenever new evidence is identified, based on the estimated probabilities. In some cases, such as the crime case illustrated here, that is not easy, though the approach you adopt to revising your estimate will always be better than using intuition to steer a path to the truth.
In many other cases, we do know with precision what the key probabilities are, and in those cases we can use Bayes’ Rule to identify with precision the revised probability based on the new evidence, often with startlingly counter-intuitive results. In seeking to steer the path from ignorance to knowledge, the application of Bayes’ Theorem is always the correct method.
Thanks to Bayes, the path to the truth really is as easy as x,y,z. What remains is the wit and will to apply it.
Further Reading and Links
The most important idea in probability. Truth and justice depend on us getting it right. https://leightonvw.com/2014/12/13/this-is-probably-the-most-important-idea-in-probability-truth-and-justice-depends-on-us-getting-it-right/
A Visual Guide to Bayesian Thinking. YouTube. https://youtu.be/BrK7X_XlGB8
Bayes’ Theorem and Conditional Probabilities https://brilliant.org/wiki/bayes-theorem/
The Monty Hall Problem is a famous, perhaps the most famous, probability puzzle ever to have been posed. It is based on an American game show, Let’s Make a Deal, first hosted by Monty Hall. It came to public prominence as a question quoted in a column penned by mega-intellect Marilyn Vos Savant, in Parade magazine in 1990. The question itself is quite straightforward.
‘Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car: behind the others, goats. You pick a door, say No.1, and the host, who knows what’s behind al the doors, opens another door, say No. 3, which reveals a goat. He then says to you, “Do you want to switch to door No. 2?” This is not a strategic decision on his part based on knowing that you chose the car, in that he always opens one of the doors concealing a goat and offers the contestant the chance to switch. It is part of the rules of the game.
So should you switch doors?
Consider the probability that you chose the correct door the first time, i.e. No 1 is the door to a car. What is that probability? Well, clearly it is 1/3 in that you have three doors to choose from, all equally likely.
But what happens to the probability that Door No. 1 is the key to the car once Monty has opened one of the other doors?
This again seems quite straightforward. There are now two doors left unopened, and there is no way to tell behind which of these two doors lies the car. So the probability that Door 1 offers the star prize now that Door 2 (or else Door 3) has been opened would seem to be 1/2. So should you switch? Since the two remaining doors would seem to be equally likely paths to the car, it would seem to make no difference whether you stick with your original choice of Door 1 or switch to the only other door that is unopened.
But is this so?
Let’s think it through.
When you choose Door 1, there is a 1 in 3 chance that you have won your way to the car if you stick with it. There is a 2 in 3 chance that Door 1 leads to a goat.
On the other hand, if you have chosen Door 1, and it is the lucky door, the host is forced to open one of the two doors concealing a goat. He knows that. You know that. So he is introducing useful new information into the game.
Before he opened a door, there was a 2 in 3 chance that the lucky door was EITHER Door 2 or Door 3 (as there was a 1 in 3 chance it was Door 1). Now he is telling you that there is a 2 in 3 chance that the lucky door is EITHER Door 2 or Door 3 BUT it is not the door he just opened. So there is a 2 in 3 chance that it is the door he didn’t open. So, if he opened Door 2, there is a 2 in 3 chance that Door 3 leads to the car. Likewise, if he opened Door 3, it is a 2 in 3 chance that Door 2 leads to the car. Either way, you are doubling your chance of winning the car by switching from Door 1 (probability of car = 1/3) to whichever of the other doors he does not open (probability of car = 2/3).
It is because the host knows what is behind the doors that his actions, which are constrained by the fact that he can’t open the door to the car, that he introduces valuable new information. Because he can’t open the door to the car, he is forced to point to a door that isn’t concealing the car, increasing the probability that the door he doesn’t open is the lucky one (from 1/3 to 2/3).
If this is not intuitively clear, there is a way of making it more so. Let’s say there were 20 doors, with a car behind one of them and goats behind 19 of them. Now say we choose Door 1. This means that the probability that this is the winning door is 1 in 20. There is a 19 in 20 probability that one of the other doors conceals the car. Now Monty starts opening one door at a time, taking care not to reveal the car each time. After opening a carefully chosen 18 doors (chosen because they didn’t conceal a car), just one door remains. This could be the door to the car or your original choice of Door 1 could be the path to the car. But your original choice had an original probability of 1/20 of being the winning door. Nothing has changed that, because every time he opens a door he is sure to avoid opening a door leading to a car. So the chance that the door he leaves unopened points to the car is 19/20. So, by switching, you multiply the probability that you have won the car from 1/20 to 19/20.
If he didn’t know what lay behind the doors, he could inadvertently have opened the door to the car, so when he does so this adds no new information save that he has randomly eliminated one of the doors. If he randomly opens 18 doors, not knowing what is behind them, and two doors now remain, they each offer a 1 in 2 chance of the car. So you might as well just flip a coin – and hope!
Even when it is explained this way, I find that many people find it impossible to grasp the intuition. So here’s the clincher.
Say I have a pack of 52 playing cards, which I lay face down. If you choose the Ace of Spades, you win the car. Every other playing card, you win nothing. Go on, choose one. This is now laid aside from the rest of the deck, still face down. The probability that the card you have chosen is the Ace of Spades is clearly 1/52.
Now I, as the host, know exactly where the Ace of Spades is. There is a 51/52 chance that it must be somewhere in the rest of the deck, and if it is I know where. Now, I carefully turn over the cards in the deck one a time, taking care never to turn over the Ace of Spades, until there is just one card left. What is the chance that the one remaining card from the deck is the Ace of Spades? It is 51/52 because I have carefully sifted out all the losing cards to leave just one card, the Ace of Spaded. In other words, I have presented you with the one card out of the remaining deck of 51 that is the Ace of Spades, assuming that it was not the card you chose in the first place. The chance that the card you chose in the first place was the Ace of Spades is 1/52. So the card I have selected for you out of the remaining deck has a probability of 51/52 of being the Ace of Spades. So should you switch when I offer you the chance to give up your original card for the one that I have filtered out of the remaining 51 cards (taking care each time never to reveal the Ace of Spades). Of course you should. And that’s what you should tell Monty Hall every single time. Switch!
Appendix
In the standard description of the Monty Hall Problem, Monty can open door 1 or door 2 or door 3. The car can be behind door 1, door 2 or door 3. The contestant can choose any door.
We can apply Bayes’ Theorem to solve this.
D1: Monty Hall opens Door 1.
D2: Monty Hall opens Door 2.
D3: Monty Hall opens Door 3.
C1: The car is behind Door 1.
C2: The car is behind Door 2.
C3: The car is behind Door 3.
The prior probability of Monty Hall finding a car behind any particular door is P(C#) = 1/3,
where P(C1) = P (C2) = P(C3).
Assume the contestant chooses Door 1 and Monty Hall randomly opens one of the two doors he knows the car is not behind.
The probability that he will open Door 3 is 1/2 and the conditional probabilities given the door being behind either Door 1 or Door 2 or Door 3 are as follows.
P(D3 I C1) = 1/2 … as he is free to open Door 2 or Door 3, as he knows the car is behind the contestant’s chosen door, Door 1. He does so randomly.
P(D3 I C3) = 0 … as he cannot open a door that a car is behind (Door 3) or the contestant’s chosen door, so he must choose Door 2.
P (D3 I C2) = 1 … as he cannot open a door that a car is behind (Door 2) or the contestant’s chosen door (Door 1).
So, P(C1 I D3) = P(D3 I C1). P(C1) / P(D3) = 1/2 x 1/3 / 1/2 = 1/3
Therefore, there is a 1/3 chance that the car is behind the door originally chosen by the contestant (Door 1) when Monty opens Door 3.
But P(C2 I D3) = P(D3 I C2).P(C2) / P (D3) = 1 x 1/3 / 1/2 = 2/3
Therefore, there is twice the chance of the contestant winning the car by switching doors after Monty Hall has opened a door.
In the standard description of the Monty Hall Problem, Monty can open door 1 or door 2 or door 3. The car can be behind door 1, door 2 or door 3. The contestant can choose any door.
We can apply Bayes’ Theorem to solve this.
D1: Monty Hall opens Door 1.
D2: Monty Hall opens Door 2.
D3: Monty Hall opens Door 3.
C1: The car is behind Door 1.
C2: The car is behind Door 2.
C3: The car is behind Door 3.
The prior probability of Monty Hall finding a car behind any particular door is P(C#) = 1/3,
where P(C1) = P (C2) = P(C3).
Assume the contestant chooses Door 1 and Monty Hall randomly opens one of the two doors he knows the car is not behind.
The conditional probabilities given the car being behind either Door 1 or Door 2 or Door 3 are as follows.
P(D3 I C1) = 1/2 … as he is free to open Door 2 or Door 3, as he knows the car is behind the contestant’s chosen door, Door 1. He does so randomly.
P(D3 I C3) = 0 … as he cannot open a door that a car is behind (Door 3) or the contestant’s chosen door, so he must choose Door 2.
P (D3 I C2) = 1 … as he cannot open a door that a car is behind (Door 2) or the contestant’s chosen door (Door 1).
These are equally probable, so the probability he will open D3, i.e. P(D3) = ½ + 0 + 1 / 3 = 1/2
So, P (C1 I D3) = P(D3 I C1). P(C1) / P(D3) = 1/2 x 1/3 / 1/2 = 1/3
Therefore, there is a 1/3 chance that the car is behind the door originally chosen by the contestant (Door 1) when Monty opens Door 3.
But P (C2 I D3) = P(D3 I C2).P(C2) / P (D3) = 1 x 1/3 / 1/2 = 2/3
Therefore, there is twice the chance of the contestant winning the car by switching doors after Monty Hall has opened a door.
Further Reading and Links
Related blog post on leightonvw.com The Deadly Doors Problem: Monty Hall Plus. https://leightonvw.com/2014/11/27/the-four-doors-problem/
Related blog post on leightonvw.com Open the Box or Take the Money https://leightonvw.com/2011/11/25/open-the-box-or-take-the-money/
Wikipedia on the Monty Hall Problem. https://en.wikipedia.org/wiki/Monty_Hall_problem
The Monty Hall Problem. http://www.montyhallproblem.com/
Understanding the Monty Hall Problem. https://betterexplained.com/articles/understanding-the-monty-hall-problem/
The Monty Hall Problem. http://mathforum.org/dr.math/faq/faq.monty.hall.html
The Official Let’s Make a Deal Website: The Monty Hall Problem. http://www.letsmakeadeal.com/problem.htm
Probability and the Monty Hall Problem. Khan Academy. https://www.khanacademy.org/math/precalculus/prob-comb/dependent-events-precalc/v/monty-hall-problem
The Monty Hall Problem. Numberphile. YouTube. https://www.youtube.com/watch?v=4Lb-6rxZxx0
The Monty Hall Problem. YouTube. https://www.youtube.com/watch?v=mhlc7peGlGg
Testing out the Monty Hall Problem. YouTube. https://www.youtube.com/watch?v=o_djTy3G0pg
Monty Hall Problem. Singing Banana. YouTube. https://www.youtube.com/watch?v=njqrSvGz8Ps
Monty Hall II: Revenge of Monty Hall. Singing Banana. YouTube. https://www.youtube.com/watch?v=fYPXYzymUqI
Bayes’ Theorem and Conditional Probabilities. https://brilliant.org/wiki/bayes-theorem/
Sleeping Beauty volunteers to undergo the following experiment and is told all of the following details: On Sunday she will be put to sleep. Once or twice during the experiment, Beauty will be awakened, interviewed, and put back to sleep with an amnesia-inducing drug that makes her forget that awakening.
A fair coin will be tossed on Sunday evening after she is put to sleep, to determine which experimental procedure to undertake: if the coin comes up heads, Beauty will be awakened and interviewed on Monday only. If the coin comes up tails, she will be awakened and interviewed on Monday and Tuesday. In either case, she will be awakened on Wednesday without interview and the experiment ends.
Any time Sleeping Beauty is awakened and interviewed, she is asked, “What is your belief now, as a percentage, in the proposition that the coin landed heads?”
What should Beauty’s answer be?
To one way of thinking about this, the answer is clear. The coin was tossed once prior to her awakening, however many times she is woken, whether once (if it landed heads) or twice (if it landed tails).
Since the fair coin was tossed just once, and no further information is obtained by Beauty at the time she is awoken and interviewed, the answer she should give should be 50 per cent, i.e. a 1 in 2 chance that the fair coin landed heads.
To another way of thinking about it, she is interviewed just once if it landed heads (on the Monday) but she is interviewed twice if it landed tails (on Monday and Tuesday). She does not know which day it is when she is woken and interviewed but from her point of view there are three possibilities. These are:
- It landed heads and it is Monday.
- It landed tails and it is Monday.
- It landed tails and it is Tuesday.
So there are three possibilities, of equal likelihood, and two of these involve the coin landing tails and just one for the coin landing heads. So the answer she should give should be 33.3 per cent, i.e. a 1 in 3 chance that the fair coin landed heads.
So which answer is correct? The world of probability is by and large divided into those who are adamant that she should go with ½ (the so-called ‘halfers’) and those who are equally adamant that she should go with 1/3 (the so-called ‘thirders’). Are they both right, are they both wrong, or somewhere in between?
A way that I usually advocate to resolve seemingly intractable probability paradoxes is to ask at what odds Beauty should be willing to place a bet.
So, if in this experiment Beauty is offered odds of 1.5 to 1 that the coin landed heads, should she take those odds? If the correct answer is a half, those odds are attractive as the correct odds should be 1 to 1 (evens). If the correct answer is a third, those odds are unattractive as the correct odds should be 2 to 1.
So what should Beauty do if offered odds of 1.5 to 1? Bet or decline the bet?
The simplest way to resolve this is to ask what would happen if she accepted the odds of 1.5 to 1 and placed a bet of £10 each time. When the coin came up heads, she would be awoken just once, placed the £10 bet and won £15. However, when the coin landed tails she would be awoken twice and placed two bets of £10, i.e. a total of £20 and lost both bets.
So her net outcome of this betting strategy would be a loss of £5.
This suggests that a half is the wrong answer as to the probability that the coin landed heads. At odds of 2 to 1, on the other hand, she would place £10 on the one occasion she would be awoken, i.e. Monday, and would win £20. However, when the coin came up tails, she would lose £10 on the Monday and £10 on the Tuesday, i.e. £20. Her expected outcome would in this case be to break even. This suggests that odds of 2 to 1 are the correct odds, which is consistent with a probability of 1/3. Some ‘Halfers’ argue that Beauty should be assigned a chip of half the value if the coin lands Tails than if it lands Heads, although she will be unaware of the value of the chip when she stakes it. In this case, she would indeed break even by betting at even money odds, but there seems no reasonable case to be made for applying this arbitrary fix to the experiment.
Applying the ‘betting test’ to this problem, therefore, suggests that Beauty’s answer when she is woken up should that there is a 1 in 3 chance that the coin landed heads when tossed after she was put to sleep on the Sunday.
But how can this be right, when the fair coin was tossed just once, and we know that the chance of a fair coin landing heads is ½? If this is the ‘prior probability’ Beauty should assign to the coin landing heads, and she is given no further information about what happened to the coin when she is woken and questioned, on what grounds should the probability she assigns change? The only information she acquires is that she has been woken and questioned, but she knew that would happen in advance, so this is not new information. Given she assigns a prior probability of ½ to the coin coming up heads, and she acquires no new information, it is perhaps difficult to see on what grounds she should change her opinion. The posterior probability she assigns (after she acquires all new information) should be identical to the prior probability, because she has acquired no new information after being put to sleep to change anything.
This is the kernel of the conundrum, and it is why there is a long-standing and ongoing debate between fervent so-called ‘Halfers’ and ‘Thirders.’
So the question is whether there is a correct answer, and that one school of thought is simply wrong, or whether there is no correct answer and both schools of thought are wrong or only right under one interpretation of the question.
It seems to me that there is, in fact, a straightforward answer, which resolves the problem. To see this, we need to identify the actual ‘prior probability’ that the coin tossed after Beauty goes to sleep is Heads.
This depends on the question we are seeking to answer, and what information is available to Beauty before she goes to sleep.
If she is simply told that a coin will be tossed after she goes to sleep, and nothing else, then her correct estimate that the fair coin will land on heads is ½. This is the answer to a simple question of how likely a fair coin is to land Heads with no conditions, i.e. the unconditional probability that the coin will land Heads is 1/2.
If she is given the additional information, however, that she will be woken just once if the coin lands Heads but twice if it lands Tails (albeit she will remember just one of the awakenings), then we are posing a very different question.
The new question she is being asked to answer is to estimate the probability that whenever she awakens, that her awakening resulted from the coin toss landing Heads. Since she has just one awakening when the coin lands Heads, but two awakenings when it lands Tails, the probability that any particular awakening occurred from a Heads flip is 1/3, i.e. the conditional probability that the coin landed Heads given any particular awakening is 1/3.
By extension, if she is told she will be woken 1,000 times if the coin lands Tails but only once if the coin lands Heads, then her correct estimate of the probability that any particular awakening resulted from the coin landing Heads is 1/1001.
So the ‘prior probability’ Beauty should assign to the chance of a coin landing Heads after any particular awakening is actually 1/3 within the terms of the experiment, even before she goes to sleep. It is true that she has access to no new information whenever she awakens, but that simply means that her ‘prior probability’ of being awakened by a Heads flip remains at 1/3 after she is woken. This is totally consistent with Bayesian reasoning which states the prior probability of an event will not change unless there is new information.
Given, therefore, that she assigns a prior probability of 1/3 to any particular awakening arising from a Heads flip, this should be the answer she gives whenever she awakens, and also before she goes to sleep.
So the paradox resolves to the question Beauty is being asked to answer. What is the probability that a fair coin will land Heads? Answer = ½. What is the probability that whenever she is woken this awakening has resulted from a Heads flip? Answer = 1/3. She is consistent in these answers both before she goes to sleep and whenever she wakes. In other words, because Beauty knows that she will correctly answer 1/3 whenever she is woken, given the rules of the experiment, of which she is aware, she will answer 1/3 before she goes to sleep.
The resolution of the Sleeping Beauty Problem has implications for the so-called ‘anthropic principle’ more generally.
The ‘anthropic principle’ is the consideration that theories of the universe are constrained by the necessity to allow human existence, because our existence as conscious observers of the universe is a given. So any theory or model of the universe must have our existence as at least one possibility.
The simplest state of affairs would be a situation in which nothing had ever existed. This would also be the least arbitrary, and certainly the easiest to understand. Indeed, if nothing had ever existed, there would have been nothing to be explained. Most critically, it would solve the mystery of how things could exist without their existence having some cause. In particular, while it is not possible to propose a causal explanation of why the whole Universe exists, if nothing had ever existed, that state of affairs would not have needed to be caused. This is not helpful to us, though, as we know that in fact this Universe does exist.
In fact, we are faced with the fact that the positive and negative contributions to the cosmological constant cancel to 120 digit accuracy, yet fail to cancel beginning at the 121st digit. In fact, the cosmological constant must be zero to within one part in roughly 10120 (and yet be nonzero), or else the universe either would have dispersed too fast for stars and galaxies to have formed, or else would have collapsed upon itself long ago. How likely is this by chance? Essentially, it is the equivalent of tossing a coin and needing to get heads 400 times in a row and achieving it. Now, that’s just one constant that needs to be just right for galaxies and stars and planets and life to exist. There are quite a few, independent of this, which have to be equally just right, but this I think sets the stage. This is sometimes called the fine-tuning argument.
The parallel with the Sleeping Beauty Problem is that Beauty knows she has been awakened and so any explanation of this must have that awakening as at least one possibility, just as any theory of the Universe must have our conscious state as one possibility.
In terms of modelling the Universe, we might pose two possible theories. In one, all the physical constants we observe today are explained. They were designed that way or they have to be that way for some unknown reason. The second theory is that there could have been countless trillions of different ways that the physical constants could have arranged themselves, and only one of these is consistent with the Universe (and us) existing.
For simplicity of exposition, let us assume that the two theories are otherwise equal in terms of empirical evidence, scientific rigour, and so on, but the general point stands whatever.
In other words, from the perspective of an observer outside the Universe, these theories would be equally likely. Heads or Tails. ½.
But we as conscious observers of our existence are like Sleeping Beauty when she wakes. From our perspective, there is only one chance in countless trillions that we would be asking the question if the second theory is correct, which means from our ‘anthropic’ perspective the chance that the first theory is correct (the constants were designed that way or have to be that way) is trillions of times more plausible.
This has, of course, very important scientific, philosophical and theological implications, which demonstrates the power and importance of the Sleeping Beauty Problem as more than just a simple mind-bender.
Let us within this context now tackle the criticism of those who reject the larger importance of this vanishingly small possibility of the physical constants being randomly trillions to one in our favour on the grounds that if it wasn’t so, we would not have been around to even ask the question. This take on the ‘anthropic principle’ sounds a clever point but in fact it is not. For example it would be absolutely bewildering how I could have survived a fall out of an aeroplane from 39,000 feet onto tarmac without a parachute, but it would still be a question very much in need of an answer. To say that I couldn’t have posed the question if I hadn’t survived the fall is no answer at all.
Others propose the argument that since there must be some initial conditions, these conditions which gave rise to the Universe and life within it possible were just as likely to prevail as any others, so there is no puzzle to be explained.
But this is like saying that there are two people, Jack and Jill, who are arguing over whether Jill can control whether a fair coin lands heads or tails. Jack challenges Jill to toss the coin 400 times. He says he will be convinced of Jill’s amazing skill if she can toss heads followed by tails 200 times in a row, and she proceeds to do so. Jack could now argue that a head was equally likely as a tail on every single toss of the coin, so this sequence of heads and tails was, in retrospect, just as likely as any other outcome. But clearly that would be a very poor explanation of the pattern that just occurred. That particular pattern was clearly not produced by coincidence. Yet it’s the same argument as saying that it is just as likely that the initial conditions were just right to produce the Universe and life to exist as that any of the other pattern of billions of initial conditions that would not have done so. There may be a reason for the pattern that was produced, but it needs a more profound explanation than proposing that it was just coincidence.
A second example. There is one lottery draw, devised by an alien civilisation. The lottery balls, numbered from 1 to 49, are to be drawn, and the only way that we will escape destruction, we are told, is if the first 49 balls out of the drum emerge as 1 to 49 in sequence. The numbers duly come out in that exact sequence. Now that outcome is no less likely than any other particular sequence, so if it came out that way a sceptic could claim that we were just lucky. That would clearly be nonsensical. A much more reasonable and sensible conclusion, of course, is that the aliens had rigged the draw to allow us to survive, or else that the draw had to be that way because no other possible sequence of balls could physically emerge.
So the answer to the Sleeping Beauty Problem is 1/3 that she is in the Heads world if she is awakened once when the coin lands Heads and twice when it lands Tails. If awakened a million times in the Tails world but just once in the Heads world, the chance she awakes to a Heads world is 1 in a million and 1. The bigger question for humanity is what world we exist in, Heads (we have to exist) or Tails (there is effectively no chance that we exist). I call that the Possibility Problem, and it is a problem which would seem to have a probabilistic solution.
Appendix
Using Bayes’ Theorem:
P (Heads I Wake up) = P (Wake up I Heads) . P (Heads) / P (Wake up)
If you adopt the Self-Sampling Assumption (SSA), you sample a person from within that world at random.
So, P (Heads I Wake up) = 1 . 1/2 / 1 = 1/2
If you adopt the Self-Indication Assumption, you take into account that you are more likely to exist in a world with more beings (or opportunities to experience) than in one with less. In this case, there are twice as many opportunities to experience waking up if the coin lands Tails than if it landed Heads.
So, P (Heads I Wake up) = 1 . 1/3 / 1 = 1/3
Further Reading and Links
Sections of this blog relating to the ‘anthropic principle’ and ‘fine-tuning’ have appeared in my related blog, ‘Why is there Something Rather than Nothing?’ Link at: https://leightonvw.com/2015/08/03/why-is-there-something-rather-than-nothing/
Bayes’ Theorem: The Most Powerful Equation in the World. Related blog. https://leightonvw.com/2017/03/12/bayes-theorem-the-most-powerful-equation-in-the-world/
Wikipedia entry on the Sleeping Beauty Problem https://en.wikipedia.org/wiki/Sleeping_Beauty_problem
The Sleeping Beauty Problem. By Julia Galef. YouTube.
Philosophy- Epistemology. The Sleeping Beauty Problem. By Michael Campbell. YouTube.
https://www.youtube.com/watch?v=5Cqbf86jTro
Probably Overthinking It: A Blog by Allen Downey
http://allendowney.blogspot.co.uk/2015/06/the-sleeping-beauty-problem.html
Wikipedia entry on the Anthropic Principle
https://en.wikipedia.org/wiki/Anthropic_principle
Wikipedia entry on Fine-Tuned Universe
https://en.wikipedia.org/wiki/Fine-tuned_Universe
Blog entry on The Vaughan Williams ‘Possibility Theorem’ and related applications.
https://wordpress.com/post/leightonvw.com/445
Derek Parfit, ‘Why anything? Why this? Part 1. London Review of Books, 20, 2, 22 January 1998, pp. 24-27.
https://www.lrb.co.uk/v20/n02/derek-parfit/why-anything-why-this
Derek Parfit, ‘Why anything? Why this? Part 2. London Review of Books, 20, 3, 5 February 1998, pp. 22-25.
https://www.lrb.co.uk/v20/n03/derek-parfit/why-anything-why-this
John Horgan, ‘Science will never explain why there’s something rather than nothing’, Scientific American, April 23, 2012.
http://www.johnpiippo.com/2012/04/krausss-much-ado-about-nothing.html
David Bailey, What is the cosmological constant paradox, and what is its significance? 1 January 2017.
http://www.sciencemeetsreligion.org/physics/cosmo-constant.php
David Albert, ‘On the Origin of Everything’, Sunday Book Review, The New York Times, March 23, 2012.
https://nicolaelogofatu.wordpress.com/2014/04/26/on-the-origin-of-everything/
Suppose that a family has two children. What is the probability that both are girls? Well, this is straightforward because there are four equally likely possibilities (assuming the chances of a boy and a girl are 50-50).
Let us assume that the two children are concealed from view, one behind a red curtain and one behind a yellow curtain.
Put like this, there are four possibilities:
- Boy behind both curtains.
- Boy behind red curtain and girl behind yellow curtain.
- Girl behind red curtain and boy behind yellow curtain.
- Girl behind both curtains.
So the probability that there is a girl behind both curtains = ¼.
This answers the first question. Given the information that a family has two children, the chance that both are girls is 1 in 4.
Now what if we are told that at least one of the children is a girl. This is like saying that there is at least one girl behind the curtains, possibly two.
This eliminates option 1, i.e. a boy behind both curtains, leaving three equally likely possibilities, only one of which is a girl behind both curtains. So the chance that there is a girl behind both curtains given that you know that there is a girl behind at least one curtain is 1 in 3.
This is equivalent to asking the probability that both children are girls if you know that at least one of the children is a girl. The answer is 1 in 3.
Now what if you are told that at least one of the girls has a chin. This adds little or no new information, insofar as presumably all (or the vast majority of) girls have a chin. So if I tell you that a family has two children, at least one of whom is a girl with a chin, it is giving me effectively no new information. So the probability that both children are girls if at least one is a girl is still 1 in 3.
What if instead I tell you that one of the children is a girl called Florida. This is pretty much equivalent to telling you that the family has a daughter behind the red curtain, insofar as it is not just identifying that there is at least one girl in the family, but identifying who or where she is. When now asked the probability that there is a girl behind the yellow curtain, options 1 and 2 (above) disappear, leaving just option 3 (a girl behind the red curtain and a boy behind the yellow curtain) and option 4 (a girl behind both curtains). So the new probability, given the additional information which identifies or locates one particular girl advance is 1 in 2.
In other words, knowing that there is girl behind the red curtain, or else knowing that her name is Florida, is like meeting her in the street with her parents who introduce her. If you know they have another child at home, the chance it is a girl is 1 in 2. By meeting her, you have identified a feature particular to that individual girl, i.e. that she is standing in front of you and not at home (or behind the red curtain, or named Florida and not simply possessed of a chin).
If, on the other hand, you meet a man in the pub who mentions his two children and you find out that at least one of them is a daughter, but nothing more than that, you are back to knowing that there is a girl behind at least one curtain, but not which, i.e. Options 2, 3 and 4 above. In only one of these equally likely options, i.e. Option 4, is there a girl behind both curtains, so the chance of the other child being a girl is 1 in 3.
So does it matter that the daughter has this unusual name? It does. If you know that the man in the pub has two children and at least one daughter, but nothing more, the chance his other child is a girl is 1 in 3. If you find out that the man in the street has two children, and then he tells you that one of children is called Florida, you are left with (to all intents and purposes) just two options. His other child is either a boy or else a girl not called Florida, which is pretty much equivalent to saying his other child is a boy or a girl. So the probability that his other child is a girl is now effectively 1 in 2.
The different information sets can be compared to tossing a coin twice. The possible outcomes are HH, HT, TH, TT. If you already know there is ‘at least’ one head, that leaves HH, HT, TH. The probability that the remaining coin is a Tail is 2 in 3. If, on the other hand, you identify that the coin in your left hand is a Head, the probability that the coin in your right hand is a Head is now 1 in 2. It is because you have pre-identified a unique characteristic of the coin, in this case its location. Identifying the girl as Florida does the same thing. In terms of two coins it is like marking one of the coins with a blue felt tip pen. You now declare that there are two coins in your hands, and one of them contains a Head with a blue mark on it. Such coins are rare, perhaps as rare as girls called Florida. So you are now asked what the chance is that the other coin is Heads (without a blue felt mark). Well, there are two possibilities. The other coin is either Heads (almost surely with no blue felt mark on it) or Tails. So the chance the other coin is Heads is 1 in 2. Without marking one of the coins, to make it unique, the chance of the other coin being Heads is 1 in 3.
Put another way, there are four possibilities without marking one of the coins:
- Heads in left hand, Tails in right hand.
- Heads in left hand, Tails in right hand.
- Heads in both hands.
- Tails in both hands.
If you declare that at least one of the coins in your hands is Heads, this means the chance the other is Heads is 1 in 3. This is equivalent to declaring that one of the two children is a girl but saying nothing further. The chance the other child is a girl is 1 in 3.
Now if you identify one of the coins in some unique way, for example by declaring that Heads is in your left hand, the chance that Heads is also in your right hand is 1 in 2, not 1 in 3.
Similarly, declaring that one of the coins is a Heads marked with a blue felt tip pen, the chance that the other coin is Heads, albeit not marked with a blue felt tip, is 1 in 2. Marking the coin with the blue felt tip is like pre-identifying a girl (her name is Florida) as opposed to simply declaring that at least one of the children is some generic girl (for example, a girl with a chin).
In other words, there are four possibilities without identifying either child.
- Boy, Boy
- Girl, Girl
- Boy, Girl
- Girl, Boy
If at least one of the children is a girl, Option 1 disappears, and the chance the other child is a girl is 1 in 3.
If you identify one of the children, say a girl whom you name as Florida, it is like marking the Heads with blue felt tip or declaring which hand you are holding the coin in.
Your options now reduce to:
- Boy, Boy
- Boy, Girl named Florida
- Boy, Girl not named Florida
- Girl named Florida, Girl not named Florida.
Options 1 and 3 can be discarded, leaving Options 2 and 4. In this scenario, the chance that the other child is a girl (not named Florida) is 1 in 2. By pre-identifying one of the girls, Option 3 disappears, changing the probability that the other child is a girl from 1 in 3 to 1 in 2.
The new information changes everything.
So what is the probability of the family having two girls if you know that one of the two children is a girl, but no more than that? The answer is 1 in 3.
But what is the probability of the family having two girls if one of the two children is a girl named Florida? Armed with this new information, the answer is, to all intents and purposes, 1 in 2.
Another way to look at this is to consider a set of 4,000 families made up of two children. Choose a single unique identifier of each child, say age (it could equally be height or alphabetical order, anything uniquely identifying one child from the other). 1,000 of these will be two boys – say older boy and younger boy (BB), 1,000 will be two girls – older girl and younger girl (GG), 1,000 will be Boy-Girl – older boy, younger girl (BG), 1,000 will be Girl-Boy – older girl, younger boy (GB). If you identify at least one of the children as boy, there remain 3,000 families (1,000 BB, 1,000 BG, 1,000 GB). 2/3 of these families contain a girl, so the probability the other child is a girl is 2/3.
Now, add into the mix the fact that one girl in a thousand in your set of 4,000 families is named Florida, and there are no families with two daughters named Florida.
In this case, 1,000 of these will be two boys – an older boy and a younger boy (BB), 1 will be a older boy and a younger girl named Florida (BF), 1 will be an older girl named Florida and a younger boy (FB), 1 will be an older girl named Florida and a younger girl not named Florida (FG), 1 will be an older girl not named Florida and younger girl named Florida (GF), 999 will be an older boy and a younger girl not named Florida (BG), 999 will be an older girl not named Florida and a younger boy (GB), 998 will be an older girl not named Florida and a younger girl not named Florida (GG). There will be no families with both girls named Florida.
This can be summarised as (given that B is a boy, G is a girl not named Florida, F is a girl named Florida, and the sequence is older-younger):
1,000 BB; 1 BF; 999 BG; 1 FB; 0 FF; 1 FG; 999 GB; 1 GF; 998 GG).
Given that at least one child is a girl named Florida, 4 possible pairs remain:
BF; FB; FG; GF.
Of these, 2 contain a girl named Florida:
FG and GF.
So, if we know that out of 4,000 families, one is a child named Florida (and 1 in 1,000 girls is named Florida), then what is the chance that the other child is a girl once you are told that one of the children is a girl named Florida. It is 1/2.
Appendix
The solution to the ‘Girl Named Florida’ problem can be demonstrated using a Bayesian approach.
Let P(GG) = probability of two girls if there are two children. Let G be the probability of at least one girl in the family)
Let P(GG I 2 children) be the probability of two girls given there are two children.
Let P(GG I G) be the probability of two girls GIVEN THAT at least one is a girl.
Then, P(GG I 2 children) = 1/4
P (GG I G) = P (H I GG) . P (GG) / P (G) … by Bayes’ Rule
So P (GG I G) = 1 x 1/4 / (3/4) = 1/3
P (GG I 2 children, older child is a girl)
Now there are only two possibilities, GB and GG (Older girl and younger boy or Older girl and younger girl), so the conditional probability of two girls given the older child is a girl, P (GG I Older child is G) = 1/2.
GIRL NAMED FLORIDA PROBLEM
P(GG I 2 children, at least one being a girl named Florida).
B = 1/2
G = 1/2 – x
GF (Girl named Florida) = x
where x is the % of people who are girls named Florida.
Of families with at least one girl named Florida, there are the following possible combinations, with associated probabilities.
B GF = 1/2 x
GF B = 1/2 x
G GF = x (1/2 – x)
GF G = x (1/2 – x)
GF GF = x^2
Probability of two girls if one is a girl named Florida =
G.GF + GF.G + GF.GF / G.GF + GF.G + GF.GF + B.GF + GF.B
= x (1/2 – x) + x (1/2 – x) + x^2 / [x (1/2-x) + x (1/2-x) + x^2 + x]
= 1/2 x – x^2 + 1/2x – x^2 + x^2 / [1/2x – x^2 + 1/2x – x^2 + x^2 + x]
= x – x^2 / x – x^2 + x = x(1-x) / x (2-x) = 1-x / 2-x
Assuming that Florida is not a common name, x approaches zero and the answer approaches 1/2.
So it turns out that the name of the girl is relevant information.
As x approaches 1/2, the answer converges on 1/3. For example, if we know that at least one child is a girl with a chin, x is close to 1/2 and the problem reduces the standard P (GG I G) problem outlined above, i.e.
P (GG I G) = P (H I GG) . P (GG) / P (G) … by Bayes’ Theorem
So P (GG I G) = 1 x 1/4 / (3/4) = 1/3
Further Reading and Links
Since records began in 1868, no clear favourite for the White House has lost, except in the case of the 1948 election, when 8 to 1 longshot Harry Truman defeated his Republican rival, Thomas Dewey.
We can now add 2016 to that list, thanks to Donald Trump, who has beaten 1 to 5 favourite, Hillary Clinton, to take the presidency. In so doing, he also defied the polls, the experts and the wisdom of crowds.
I have been tracking various forecasting methodologies and prognosticators over the past few months, right up to election day, and can confirm that the rout of conventional wisdom was almost total.
Odds on
On the morning of the election, the best price available about Hillary Clinton was 2 to 7, equal to an implied win probability of about 78%. The spread betting markets made her a little over an 80% favourite, and gave her a head start over Trump of more than 80 electoral votes. The PredictIt prediction market assigned her a 79% chance of victory, and estimated her likely advantage as 323 electoral votes to 215 for Trump. Meanwhile, the Predictwise crowd wisdom platform assessed her chance of winning at a solid 89%, compared to 75% by the Hypermind prediction site.
The polling aggregation services fared no better. The RealClearPolitics and HuffPost Pollster polling averages gave Hillary Clinton a lead of between 3% and 6%. The FiveThirtyEight platform, which removes bias from polls based on their previous performance, gave her a popular vote lead on the day of 3.6% and an electoral vote advantage of 67 over Trump. Her chance of winning was assessed as 71.9% based on this polling.
Perhaps the biggest failure of the night, however, was Sam Wang’s Princeton Election Consortium, which gave Clinton more than a 99% chance of victory. Still, it must be said that his topline figures (an electoral college advantage of 307 to 231 for Clinton, and 2.5% in the popular vote) were less far off than a number of the other forecasting methodologies.
The New York Times Upshot elections model, which bases its estimates on state and national polls, gave Clinton a 84% chance of victory, which they helpfully compared to the chance of an NFL kicker making a 38-yard field goal. About 16% of the time they miss. That was the same chance as Hillary Clinton losing, they suggested.
Talking heads
Expert opinion was also woefully off. One of the most high-profile providers of expert political opinion is the Sabato Crystal Ball, run by Larry Sabato of the University of Virginia’s Center for Politics. This service has a very good track record. Yet, in line with the polls and the markets, the Crystal Ball got it badly wrong this time. Its final prediction was a win for Hillary Clinton by 322 electoral votes to 216.
It is the PollyVote election forecasting service which provides perhaps the most broad-based expert opinion survey, however, calling on its own panel of political experts to periodically update its forecast of the likely two-way vote share of the main candidates. The final expert panel survey, conducted on the eve of the election, put Clinton 4.4% up over Trump (52.2% to 47.8%).
In attempting to estimate the final vote share tallies of the candidates, PollyVote provides not just the estimates of experts, but also evidence gathered from a range of other methodologies, including prediction markets, poll aggregators, econometric models, citizen forecasts and index models. The idea is that aggregating and combining the wisdom of each and taking an average should provide a better estimate than any in isolation. It is a methodology which has served well over the past three election cycles.
This time the methodology broke down as badly as any of the main forecasting methodologies in isolation. Taking them in turn, the prediction market indicator (based on the trading in the Iowa electronic markets) gave Hillary Clinton a lead of 54.6% to 45.4%. Using data from RealClearPolitics and HuffPost Pollster to construct its poll aggregation metric, it gave the lead to Clinton by 52% to 48%.
PollyVote also highlights the various econometric forecasting models available, which typically use variables such as growth, unemployment, incumbency, and so on, to provide an aggregated estimate. That estimate was, this time, quite successful, giving Clinton the advantage in the popular vote of 50.2% to 49.8%. Winning the popular vote is, however, not the same thing as winning the electoral college, as Democrats in particular have learned in recent years.
The final two methodologies used to make up the PollyVote forecast are index models, which use information about the candidates, and citizen forecasts, which ask people whom they expect to win. The index models this time gave Clinton the edge over trump by 53.5% to 46.5%, and the citizen forecasts by 52.2% to 47.8%. Combining all these methodologies together produced an estimated advantage for Clinton over Trump of 52.5% to 47.5%.
The bottom line, therefore, is that most of the tried and tested forecasting methodologies failed this time. Election 2016 truly demonstrated, on a grand scale, the madness of crowds, polls and experts.
Further Reading and Links
