Bayes’ Theorem: The Most Powerful Equation in the World.
Further and deeper exploration of paradoxes and challenges of intuition and logic can be found in my recently published book, Probability, Choice and Reason.
How should we change our beliefs about the world when we encounter new data or information? This is one of the most important questions we can ask. A theorem bearing the name of Thomas Bayes, an eighteenth-century clergyman, is central to the way we should answer this question.
The original presentation of the Reverend Thomas Bayes’ work, ‘An Essay toward Solving a Problem in the Doctrine of Chances’, was given in 1763, after Bayes’ death, to the Royal Society, by Bayes’ friend and confidant, Richard Price.
In explaining Bayes’ work, Price proposed as a thought experiment the example of a person who enters the world and sees the sun rise for the first time. As this person has had no opportunity to observe the sunrise before (perhaps he has spent his life to that point entombed in a dark cave), he is not able to decide whether this is a typical or unusual occurrence. It might even be a unique event. Every day that he sees the same thing happen, however, the degree of confidence he assigns to this being a permanent aspect of nature increases. His estimate of the probability that the sun will rise again tomorrow as it did yesterday and the day before, and so on, gradually approaches, although never quite reaches, 100 per cent.
The Bayesian viewpoint is just like that, the idea that we learn about the world and everything in it through a process of gradually updating our beliefs, edging incrementally ever closer to the truth as we obtain more data, more information, more evidence.
As such, the perspective of Reverend Bayes on cause and effect is essentially different to that of philosopher David Hume, the logic of whose argument on this issue is contained in ‘An Enquiry Concerning Human Understanding,’ published in 1748. According to Hume, we cannot justify our assumptions about the future based on past experience unless there is a law that the future will always resemble the past. No such law exists. Therefore, we have no fundamentally rational support for believing in causation. For Hume, therefore, predicting that the sun will rise again after seeing it rise a hundred times in a row is no more rational than predicting that it will not. Bayes instead sees reason as a practical matter, in which we can apply the laws of probability to the issue of cause and effect.
To Bayes, therefore, rationality is matter of probability, by which we update our predictions based on new evidence, thereby edging closer and closer to the truth. This is called Bayesian reasoning. According to this approach, probability can be seen as a bridge between ignorance and knowledge. The particularly wonderful thing about the world of Bayesian reasoning is that the mathematics of it are so simple. Bayes’ Theorem is in this way concerned with conditional probability. It tells us the probability, or updates the probability, that a theory or hypothesis is true given that some event has taken place, that some new evidence has been observed. The problem with intuition is that people are not naturally probability thinkers, but instead are cause-effect thinkers. We have to be trained to think in a Bayesian way about the world.
Essentially, Bayes’ Theorem is just an algebraic expression with three known variables and one unknown. It is true by construction. Yet this simple formula is the foundation stone of that bridge I referred to between ignorance and knowledge, which can lead to important predictive insights. As noted, it allows us to update the probability that a theory or hypothesis is true when some new evidence comes to light, based on the probability we attach to the theory or hypothesis being true before the new evidence is known.
There are three things a Bayesian needs to estimate.
- A Bayesian’s first task is to assign a starting point probability to a hypothesis being true, before some new evidence arises. This is known as the ‘prior’ probability. Let’s assign the letter ‘a’ to this.
- A Bayesian’s second task is to estimate the probability that the new evidence would have arisen if the hypothesis was true. Let’s assign the letter ‘b’ to this.
- A Bayesian’s third task is to estimate the probability that the new evidence would have arisen if the hypothesis was false. Let’s assign the letter ‘c’ to this.
Based on these three probability estimates, Bayes’ Theorem offers a way to calculate the revised probability of the hypothesis being true given the new evidence. The notable point about it is that the equation is true as a matter of logic. The result it produces will therefore be as accurate as the values inputted into the equation. The formula is also so straightforward it can be jotted on the back of your hand.
The formula for Bayes’ Theorem can be represented as:
Updated (posterior) probability given new evidence = ab/ [ab+ c (1-a)]
Essentially, then, Bayesian updating is a straightforward solution to the problem of how to combine pre-existing (prior) beliefs with observed new evidence. The solution is essentially to combine the probabilities together. To do this properly, we use Bayes’ Theorem. It is of particular use when we have a conditional probability of two events, and we are interested in the reversed conditional probability. For example, when we have P (A given B) and want to find P (B given A).
The key contributions of Bayesian analysis to our understanding of the world are threefold.
- Bayes’ Theorem makes clear the importance not just of new evidence but also the (prior) probability that the hypothesis was true before the new evidence was observed. This prior probability is often given too little weight compared to the new evidence in common intuition about probability. Bayes’ Theorem makes the prior probability explicit and shows how much weight to attach to it.
- Bayes’ Theorem allows us a way to calculate the updated probability based on the prior probability that the hypothesis is true and the probability of the new evidence arising given that the hypothesis is true and also given that the hypothesis is false.
- Bayes’ Theorem shows that the probability that a hypothesis is true given the evidence is not equal to the probability of the evidence arising given that the hypothesis is true. Put another way, P (H given E) does not equal P (E given H).
Often the conclusions it generates are highly counter-intuitive, but that’s because the world is in many ways a counterintuitive place. Accepting that fact is the first step towards mastering life’s logical maze.
In summary, intuition lets us down because our in-built judgment of the weight we should attach to new evidence tends to be skewed, not least against pre-existing evidence. New evidence also tends to colour our perception of the pre-existing evidence. Moreover, we tend to see evidence that is consistent with something being true as evidence that it is actually true. Bayes’ Theorem is the map that helps guide us through this maze.
Appendix
Bayes’ Theorem consists of three variables.
a is the prior probability of the hypothesis being true (the probability we attach before new evidence arises). In traditional notation, this is represented as P (H).
b is the probability that the new evidence would arise if the hypothesis is true. In traditional notation, this is represented as P (EIH). We use the notation P (AIB) to represent the probability of A given B, i.e. the probability of A If B.
c is probability the new evidence would arise if the hypothesis is not true. In traditional notation, this is represented as P (EIH’). H’ is the notation for H not being true.
(1-a) is the prior probability that the hypothesis is not true. In traditional notation, this is represented as P (H’). It is derived from 1 – P (H), i.e. 1 minus the probability that the hypothesis is true.
Using this notation, the probability that a hypothesis is true given some new evidence (‘Posterior Probability’) = ab/ [ab+ c (1-a)].
Bayes’ Theorem can be derived from the equation P (HIE). P (E) = P (H).P (EIH), by dividing both sides by P (E). The intuition underlying this is that both sides of the equation are equal to the combined probability of the evidence relating to a hypothesis and the probability of the hypothesis being true, P (H and E). They are two ways of looking at the same thing.
In particular, P (HIE). P (E) is the probability of a hypothesis being true given the evidence times the probability of the evidence. This is logically equivalent to P (H). P (EIH), which is the probability of a hypothesis being true times the probability of the evidence given that the hypothesis is true.
So, P (HIE). P (E) = P (H). P (EIH)
Dividing the left and right sides of the equation by P (E),
P (HIE) = P (H). P (EIH) / P (E) … Bayes’ Theorem
P (E) = P (EIH). P (H) + P (EIH’). P(H’)
P (HIE) = P (H).P (EIH) / [P (H). P (EIH) + P (EIH’). P(H’)] … Bayes’ Theorem
This is equivalent to the formula:
Posterior probability = ab / [ab + c (1-a)], where a = P (H); b = P (EIH); c = P (EIH’)
Technical Proof
We write the conditional probability of A given B as P (A∣B) and define it as the probability that A has occurred, given that B has occurred.
The probability that A and B have both occurred is the conditional probability of A given B multiplied by the probability that B has occurred.
P(A∩B) = P (A∣B) P(B)
Hence:
P (A∣B) = P(A∩B) / P(B)
Similarly,
P(A∩B) = P (B∣A) P(A)
Hence:
P (B∣A) = P (A∩B) / P(A)
So:
P (A∣B) P (B) = P (A∩B) = P (B∣A) P(A), which is sometimes called the product rule for probabilities.
Dividing both sides by P (A), (which we take to be non-zero), the result follows:
P (B∣A) = P (A∣B) P(B) / P(A)
Where A represents the evidence, and B represents the hypothesis being true, this becomes:
P (H∣E) = P (E∣H) P(H) / P(E) … Bayes’ Formula
Now, P (E) = P (EIH) P (H) + P (EIH’) P (H’)
Therefore, P (E) = P (EIH) P (H) + P (EIH’) P (H’), where P (H’) represents the probability that the hypothesis is not true, i.e. P (H’) = 1 – P (H)
In traditional notation, the Prosecutor’s Fallacy is the fallacy of representing the probability of a hypothesis being true given the evidence, P (HIE), as being the same thing as P (EIH), the probability of the evidence arising given the hypothesis is true. In fact, P (HIE) = P (H). P (EIH) / P(E).
Examples
Is the probability that a selected card is the Ace of Spades (the hypothesis) given the evidence (it is a black card) equal to the probability it is a black card given that it is the Ace of Spades?
In this example, P (HIE) = 1/26 (probability the hypothesis is true given the evidence), since there is one Ace of Spades out of 26 black cards.
However, the probability of observing the evidence (it is a black card) given the hypothesis being true (it is the Ace of Spades) is P (EIH) = 1, since the probability it is a black card if it is the Ace of Spades is certain.
So, P (HIE) = 1/26 is not equal to P (EIH) = 1.
There follow some examples to illustrate that P (HIE). P (E) does indeed equal P (H). P (EIH).
Example 1: Take a deck of 52 cards, 26 red cards and 26 black cards, including one Ace of Spades. We are testing the hypothesis that a chosen card is the Ace of Spades. So, the hypothesis is that the selected card is the Ace of Spades. Now the probability a drawn card is the Ace of Spades (hypothesis is true) given that the card is black (the evidence) = 1/26 (there are 26 black cards, one of which is the Ace of Spades).
So P (HIE) = 1/26
The proportion of black cards in the deck = 1/2. So P (E) = 1/2
So, P (HIE). P (E) = 1/26 x ½ = 1/52.
Now P (EIH) is the probability that the card is black given that it is the Ace of Spades. This is certain, as the Ace of Spades is a black card.
So P (EIH) = 1.
P (H) is the probability the card is the Ace of Spades before we know what colour it is. There are 52 cards in the deck, so P (H) = 1/52.
So P (H). P (EIH) = 1/52 x 1 = 1/52
So P (HIE). P (E) = P (H). P (EIH) – they both equal 1/52 in this case.
Therefore, P (HIE) = P (H). P(EIH) / P (E) … Bayes’ Theorem
Example 2: There are in this example just four cards in our deck. These are the Ace of Spades, Ace of Clubs, Ace of Diamonds and Ace of Hearts. We are testing the hypothesis that the selected card is the Ace of Spades. Prior probability of Ace of Spades (AS) = ¼, as this is one of the four cards in our deck. What is the posterior probability it is Ace of Spades given the evidence that the card is black?
P (H) = ¼
P (EIH) = 1
P (E) = ½
P (HIE) = 1/2
So, P (HIE) = P (H). P (EIH)/ P (E) = ¼.1 / (1/2) = ½
Note that: P (HIE). P (E) = P (H). P (EIH)
P (HIE) = P (H). P (EIH) / P (E) … Bayes’ Theorem
Example 3: Two dice are thrown. The hypothesis is that two sixes will be thrown. The new evidence is that a six is thrown on the first one.
P (H) = x = 1/36
P (EIH) = y = 1 (for a double six, a six must be thrown on the first one).
P (E) = 1/6 (there is a 1 in 6 chance of throwing a six on the first die)
P (HIE) = posterior probability (PP) = P (EIH). P (H) / P (E) = 1. 1/36 / 1/6 = 1/6 (there is a 1 in 6 chance of a double six if the first die lands on a six).
Note: P (H). P (EIH) = P (E). P (HIE) = 1/36
Note also: P (E) = P (H). P (EIH) + P(H’). P(EIH’) = 1/36 . 1 + 35/36 . 5/35 = 1/36 + 5/36 = 1/6
Similarly, Posterior Probability = ab/[ab+c(1-a)] = 1/6
Note: c = P (EIH’) = 5/35 because if the dice do not land 6,6, so that the hypothesis is not true (H’), then 35 options are left (from 1,1 to 6,5) and chance of a single six occurs in 5 of them, i.e. 6,1; 6,2; 6,3; 6,4; 6,5.
As for the likelihood that the sun will rise again, there is a way of estimating this, which was proposed by Pierre-Simon Laplace. What is known as Laplace’s Law gives us a rule-of-thumb way of calculating how likely it is that something that has happened before will happen again, whether it be the sun rising, your favourite team winning, or the bus arriving on time. Simply count the number of times it has happened in the past plus one (successes, S+1), and divide that by the number of opportunities there has been for it to happen plus two (trials, T+2). For a person emerging from a dark cave into the world for the first time, and watching the sun rise seven times, for example, the estimate that it will rise again is: (S+1)/(T+2) = (7+1)/(7+2) = 8/9 = 88.9%. Every time it rises again makes it even more likely that the pattern will be repeated, so that by the end of a year, the estimated probability goes up to (365+1)/(365+2) = 99.7%. And so on. The 1 and 2 in the Laplace equation, (S+1)/ (T+2), essentially represent the Bayesian ‘prior.’ The 1 and 2 can be replaced by any numbers in the same proportion, such as 5 and 10 or 10 and 20, depending on the weight we wish to assign to the prior probabilities (probabilities assigned before encountering new evidence).
Larger numbers (e.g. S+10, T+20) bias the estimate towards the assigned prior probability. So, (S+10)/ (T+20) after seven days updates to a probability of (7+10)/ (7+20) = 17/27 = 63.0%, compared to 88.9% for (S+1)/(T+2). Smaller numbers bias the estimate, therefore, towards the observed record. Another way of looking at this is that larger numbers indicate we are more confident in our baseline estimates and need more evidence to change our prior beliefs. Smaller numbers indicate that we are less sure about our beliefs and are more open to quickly updating our beliefs based on new evidence. In other words, learning takes place more quickly with smaller numbers in the Laplace equation.
Exercise
Question a. Write the Bayesian equation (using a, b and c) for deriving the posterior (updated) probability of a hypothesis being true after you encounter new evidence. Explain what a, b and c represent.
Question b. If P (H) is the probability that a hypothesis is true before the observation of new evidence (E), what is the updated (or posterior) probability of the hypothesis being true after the observation of the new evidence? Use the terms P (H), P(EIH), P(HIE), P(H’), P(EIH’) to construct the Bayesian equation using each of these terms. Note that P(EIH) is the probability of encountering the evidence given that the hypothesis is true. P(H’) is the probability that the hypothesis is not true. P(HIE) is the probability the hypothesis is true after encountering the evidence.
Question c. How do these terms relate to a, b and c in the Bayesian formula you have studied.
Question d. Is the probability that a hypothesis is true, given the evidence, P (HIE), equal to the probability of encountering the evidence, given that the hypothesis is true, P (EIH)? In other words, does P (HIE) = P (EIH)?
Question e. You are presented with two dice. One is fair, one is biased. The fair die (A) lands on all numbers (1 to 6) with equal probability. The biased die (B) lands on 6 with a 50% chance and each of the other numbers (1 to 5) with an equal 10% chance each.
Now, choose one die. You can’t tell by inspection whether it is the fair or the biased die. You now roll the die, and it lands on 6. What is the probability that the die you rolled is the biased die?
Question f. You are presented with two coins. One is fair, the other is weighted. The fair coin (Coin 1) lands on heads and tails with equal likelihood, the weighted coin (Coin 2) lands on tails with a 75% chance.
Now, choose one coin. You can’t tell by inspection whether it is the fair or the weighted coin. You select a coin and toss it and it lands on tails. What is the probability that you tossed Coin 2 (the weighted coin).
Some Reading and Links
Puga, J., Krzywinski, N. and Altman, N. (2015). Points of Significance: Bayes’ Theorem. 12, 4, April, 277-278. https://www.nature.com/articles/nmeth.3335.pdf?origin=ppub
Hooper, M. (2013). Richard Price, Bayes’ Theorem and God. Significance, February, 36-39. https://www.york.ac.uk/depts/maths/histstat/price.pdf
Maths in a minute: The prosecutor’s fallacy. + plus magazine. https://plus.maths.org/content/maths-minute-prosecutor-s-fallacy
Lee, M. and King, B. (2017). Bayes’ Theorem: the maths tool we probably use every day. But what is it? The Conversation. April 23. https://theconversation.com/bayes-theorem-the-maths-tool-we-probably-use-every-day-but-what-is-it-76140
Ellerton, P. (2014). Why facts alone don’t change minds in our public debates. The Conversation. May 13. https://theconversation.com/why-facts-alone-dont-change-minds-in-our-big-public-debates-25094
Bayes Theorem. A Take Five Primer. An Iterative Quantification of Probability 2016). Corsair’s Publishing, March 24. http://comprehension360.corsairs.network/bayes-theorem-a-take-five-primer-fc7f7ade7abe
Bayes’ Theorem. Wikipedia. https://en.m.wikipedia.org/wiki/Bayes%27_theorem
Was the University of California, Berkeley, guilty of discrimination in their entry standards? This was a cause of concern in the early 1970s. To show what was behind the concern, we can highlight the admission figures for the Fall term of 1973. This shows that male applicants to the University were significantly more likely to be accepted than females.
Applicants Admitted
Men 8442 44%
Women 4321 35%
Looks pretty damning, until it was decided to break the admittance figures down by department. In doing so, it revealed a paradox.
Dept. Men Women
Applicants Admitted Applicants Admitted
A 825 62% 108 82%
B 560 63% 25 68%
C 325 37% 593 34%
D 417 33% 375 35%
E 191 28% 393 24%
F 373 6% 341 7%
In other words, a higher proportion of women were admitted to four of the six departments than men.
So what was going on? Those with statistical training soon realised that this was a simple example of Simpson’s Paradox. Simpson’s Paradox arises when different groups of frequency data are combined, revealing a different performance rate overall than is the case when examining a breakdown of the performance rate. Put another way, Simpson’s paradox is the appearance of trends within different groups which disappear when data for the groups are combined together.
In the case of Berkeley, a study published in 1975 by Bickel, Hammel and O’Connell, in ‘Science’ reached the conclusion that women tended to apply to the more competitive departments with low rates of admission, such as the English Department, while men tended to apply to less competitive departments with high rates of admission, such as engineering and chemistry. As such the University was not actively discriminating against women, at least not on the basis of the statistics used to make the charge.
Ignorance of the implications of Simpson’s Paradox might also generate false conclusions in the case of medical trials.
Take the following drugs, and their success rate in medical trials over two different days.
Drug A Drug B
Day 1 63/90 = 70% 8/10 = 80%
Day 2 4/10 = 40% 45/90 = 50%
Overall, Drug A = 67% success rate; Drug B = 53% success rate.
But Drug B performs better on both days.
So which is the better drug? In the medical trials, I would certainly choose to be treated by Drug A. Others might differ, but I doubt they would persuade any reasonable judge of the outcome of the trials.
Take another example. In this trial, there are two groups, consisting of a control group of 240 patients who are supplied with a placebo drug, such as a sugar pill, which is known to have no effect on the illness under evaluation, and a test group of 240 patients who are supplied with the real drug. The 240 patients are made up of four groups. Group A is elderly adults, Group B is middle-aged adults, Group C is young adults and Group D is children.
Here are the results, with success rate measured by the proportion recovering from the illness within two days of taking the drug:
Those taking the placebo.
Group A: 20; Group B: 40; Group C: 120; Group D: 60
Success rates are:
Group A: 10%; Group B: 20%; Group C: 40%; Group D: 30%
Overall success rate for those taking the placebo = 2+8+48+18 Divided By 240 = 76/240 = 31.7%.
Those taking the real drug.
Group A: 120; Group B: 60; Group C: 20; Group D: 10
Success rates are:
Group A: 15%; Group B: 30%; Group C: 60%; Group D: 45%
Overall success rate for those taking the real drug = 18+18+12+18 Divided By 240 = 66/240 = 27.5%.
This compares with an overall success rate for those taking the placebo of 31.7%.
So the placebo, over the whole sample, produced a higher success rate than the real drug.
Breaking the numbers down by group, however, reveals a discrepancy.
For the real drug
Group A: 10%; Group B: 20%; Group C; 40%; Group D: 30%
For the placebo
Group A: 15%; Group B: 30%; Group C; 60%; Group D: 45%
So, in each individual group (elderly adults, middle-aged adults, young adults, children) the success rate is greater for those taking the real drug, although in the group as a whole, it is less.
How can we resolve the paradox?
The answer lies in the size and age distribution of each group, which differs between those who received the real drug and those who received the placebo. In this study, the group which received the placebo consists of a whole lot more young adults, for example, than the other groups, in contrast with the number taking the real drug. This is important because the natural recovery rates from this illness (as defined in the test) are normally higher in this demographic than the other groups, whether they receive the real drug or the placebo. Again, the elderly (whose recovery rates are normally lower than average) are much more heavily represented among those taking the real drug than the placebo.
Take another example from baseball. In the 1995/96 seasons, fans were divided between those who claimed Derek Jeter as the best performing player and those who claimed that title for David Justice. It is easy to see why. Here are their batting averages.
1995 1996 Combined
Derek Jeter 12/48 (.250) 183/582 (.314) 195/630 (.310)
David Justice 104/411 (.253) 45/140 (.321) 149/551 (.270)
Here we see that Jeter has the better overall batting average but Justice records a better average in each of the two years making up that overall average. To anyone conversant with Simpson’s Paradox this is nothing weird. It is certainly possible in theory for one player to score a better batting average in successive years than another, yet record a worse batting average overall. The case of Jeter and Justice is an example where the theory clearly shows up in practice.
Indeed, forward to 1997 and the paradox grows even stronger. In that year, Jeter averaged 0.291 (190/654), while Justice scored a better average (163/495). So, in three successive years, Justice recorded a better average than Jeter. Over the whole period, though, the batting average for Derek Jeter was 0.300 (385/1284), superior to David Justice, on 0.298 (312/1046).
For those more familiar with cricket than baseball, let’s take the following example of two mythical matches played by Harold Larwood and Bill Voce.
First Match:
Harold Larwood takes 3 wickets while bowling but concedes 60 runs off his bowling (an average of 20 runs conceded per wicket).
Bill Voce takes 2 wickets while bowling but concedes 68 runs (an average of 24 runs conceded per wicket).
Second Match:
Harold Larwood takes 1 wicket and concedes 8 runs (an average of 8 runs conceded per wicket).
Bill Voce takes 6 wickets and concedes 60 runs (an average of 10 runs conceded per wicket).
Here, Larwood has the superior performance in both matches (20 runs conceded per wicket compared to Voce’s 34 per wicket, and 8 runs conceded per wicket compared to Voce’s 10 per wicket). In the overall match, however, Larwood took 4 wickets for 68 runs (1 for 17) while Voce did slightly better, taking 8 wickets for 128 runs (1 for 16).
So who is the better baseball player? Who is the better bowler? Were the University of California, Berkeley, discriminating on the basis of gender? Which is the better drug? All of these questions are examples of Simpson’s Paradox.
Reference and links
P.J. Bickel, E.A. Hammel and J.W. O’Connell (1975), Sex Bias in Graduate Admissions: Data from Berkeley, Science, 187, 398-404.
Further and deeper exploration of paradoxes and challenges of intuition and logic can be found in my recently published book, Probability, Choice and Reason.
In William Shakespeare’s ‘Merchant of Venice’, potential suitors of young Portia are offered a choice of three caskets, one gold, one silver and one lead. Inside one of them is a miniature portrait of her. Portia knows it is in the lead casket.
Now, according to her father’s will, a suitor must choose the casket containing the portrait to win Portia’s hand in marriage. The first suitor, the Prince of Morocco, must choose from one of the three caskets. Each is engraved with a cryptic inscription. The gold casket reads, “Who chooseth me shall gain what many men desire.” The silver casket reads, “Who chooseth me shall get as much as he deserves.” The lead casket reads, “Who chooseth me must give and hazard all he hath”. He chooses the gold casket, hoping to find “an angel in a golden bed.” Instead, he finds a skull and a scroll inserted into the skull’s “empty eye.” The message he reads on the scroll says, “All that glisters is not gold.” The Prince beats a hasty exit. “A gentle riddance”, says Portia. The next suitor is the Prince of Arragon. “Who chooseth me shall get as much as he deserves”, he reads on the silver casket. “I’ll assume I deserve the very best”, he declares, and opens the casket. Inside he finds a picture of a fool with a sharp dismissive note which says “With one fool’s head I came to woo, But I go away with two.”
Now let us think about a plot twist where Portia must open one of the other caskets and give Arragon a chance to switch choice of caskets if he wishes. She is not allowed to indicate where the portrait is and in this case must open the gold casket (she knows it is in the lead casket so can’t open that) and show it is not in there. She now asks the Prince whether he wants to stick with his original choice of the silver casket or switch to the lead casket.
Let us imagine that he believes that Portia has no better idea than he has of which casket contains the prize. In that case, should he switch from his original choice of the silver casket to the lead casket? Well, since Portia had no knowledge of the location of the portrait, she might have inadvertently opened the casket containing the portrait, so she adds new information by opening the casket. But if he knows that she is aware of the location of the portrait, her decision to open the gold casket and not the lead casket has doubled the chance that the lead casket contains the portrait compared to his original choice, other things equal. This is because there was just a one third chance that his original choice (silver) was correct and a two thirds chance that one of the other choices (gold, lead) was correct. She is forced to eliminate the losing casket of the two (in this case, gold), so the two thirds chance converges on the lead casket.
So should he switch to the lead casket or stay with the silver? It depends whether things actually are equal. In particular, it depends on how valuable any information contained in the inscriptions is. If he has little faith in the inscriptions to arbitrate, he should definitely switch and improve his chance of winning fair Portia’s hand from 1/3 to 2/3. If he thinks, however, that he has unlocked the secret from the inscriptions, the decision is more difficult. If so, he might stick with his choice in good conscience.
In summary, the key to the problem is the new information Portia introduced by opening a casket which she knew did not contain the portrait. By acting on this new information, the Prince can potentially improve his chance of correctly predicting which casket will reveal the portrait from 1 in 3 to 2 in 3 – by switching boxes when given the chance. Unless he has other information which makes the opening probabilities different to 1/3 for each casket, such as those cryptic inscriptions. If this information is potentially valuable, or at least if the Prince thinks so, that complicates matters!
Let us invent a little crime story in which you are a follower of Bayes and you have a friend in a spot of trouble. In this story, you receive a telephone call from your local police station. You are told that your best friend of many years is helping the police investigation into a case of vandalism of a shop window in a street adjoining where you knows she lives. It took place at noon that day, which you know is her day off work. You had heard about the incident earlier but had no good reason at the time to believe that your friend was in any way linked to it.
She next comes to the telephone and tells you she has been charged with smashing the shop window, based on the evidence of a police officer who positively identified her as the culprit. She claims mistaken identity. You must evaluate the probability that she did commit the offence before deciding how to advise her. So the condition is that she has been charged with criminal damage; the hypothesis you are interested in evaluating is the probability that she did it. Bayes’ Theorem, of course, helps to answer this type of question.
There are three things to estimate. The first is the Bayesian prior probability (which we represent as ‘a’). This is the probability you assign to the hypothesis being true before you become aware of the new information. In this case, it means the probability you would assign to your friend breaking the shop window immediately before you got the new information from her on the telephone that she had been charged on the basis of the witness evidence.
The second is the probability that the new evidence would have arisen if the hypothesis was true (which we represent as ‘b’). In this case, you need to estimate the probability of the police officer identifying your friend if your friend actually did break the window.
The third is to estimate the probability that the new evidence would have arisen if the hypothesis was false (which we represent as ‘c’). In this case, you need to estimate the probability of the police officer identifying your friend if your friend did NOT break the window.
According to Bayes’ Theorem, Posterior probability = ab/ [ab+c(1-a)]
So let’s apply Bayes’ Theorem to the case of the shattered shop window. Let’s start with a. Well, you have known her for years, and it is totally out of character, although she does live just a stone’s throw from the shop, and it is her day off work, so she could in principle have done it. Let’s say 5% (0.05). Assigning the prior probability is fraught with problems, however, as awareness of the new information might easily affect the way you assess the prior information. You need to make every effort to estimate this probability as it would have been before you received the new information. You also have to be precise as to the point in the chain of evidence at which you establish the prior probability.
What about b? This is the probability of the new evidence if the hypothesis was true. What is the hypothesis? That your friend broke the window. What is the new evidence? That the police officer has identified your friend as the person who smashes the window. So b is an estimate of the probability that the police officer would have identified your friend if she was indeed guilty. If she threw the brick, it’s easy to imagine how she came to be identified by the police officer. Still, he wasn’t close enough to catch the culprit at the time, which should be borne in mind. Let’s say that the probability he has identified her and that she is guilty is 80% (0.8).
Let’s move on to c. This is the probability of the new evidence if the hypothesis was false. What is the hypothesis again? That your friend broke the window. What is the new evidence again? That the police officer has identified your friend as the person who did it. So c is an estimate of the probability that the police officer would have identified her if she was not the guilty party, i.e. a false identification. If your friend didn’t shatter the window, how likely is the police officer to have wrongly identified her when he saw her in the street later that day? It is possible that he would see someone of similar age and appearance, wearing similar clothes, and jump to the wrong conclusion, or he may just want to identify someone to advance his career. Let us estimate the probability as 15% (0.15).
Once we’ve assigned these values, Bayes’ theorem can now be applied to establish a posterior probability. This is the number that we’re interested in. It is the measure of how likely is it that your friend broke the window, given that she’s been identified as the culprit by the police officer and charged on the basis of this evidence.
Given these estimates, we can use Bayes’ Theorem to update our probability that our friend is guilty to 21.9%, despite assigning a reliability of 80% to the police officer’s identification.
The most interesting takeaway from this application of Bayes’ Theorem is the relatively low probability you should assign to the guilt of your friend even though you were 80% sure that the police officer would identify her if she was guilty, and the small 15% chance you assigned that he would falsely identify her. The clue to the intuitive discrepancy is in the prior probability (or ‘prior’) you would have attached to the guilt of your friend before you were met face to face with the charge based on the evidence of the police officer. If a new piece of evidence now emerges (say a second witness), you should again apply Bayes’ Theorem to update to a new posterior probability, gradually converging, based on more and more pieces of evidence, ever nearer to the truth.
It is, of course, all too easy to dismiss the implications of this hypothetical case on the grounds that it was just too difficult to assign reasonable probabilities to the variables. But that is what we do implicitly when we don’t assign numbers. Bayes’ Theorem is not at fault for this in any case. It will always correctly update the probability of a hypothesis being true whenever new evidence is identified, based on the estimated probabilities. In some cases, such as the crime case illustrated here, that is not easy, though the approach you adopt to revising your estimate will always be better than using intuition to steer a path to the truth.
In many other cases, we do know with precision what the key probabilities are, and in those cases we can use Bayes’ Theorem to identify with precision the revised probability based on the new evidence, often with startlingly counter-intuitive results. In seeking to steer the path from ignorance to knowledge, the application of Bayes is always the correct method.
Appendix
The calculation and the simple algebraic expression that we have identified in this setting is:
ab/[ab+c(1-a)]
a is the prior probability of the hypothesis (she’s guilty) being true. This is more traditionally represented by the notation P(H). In the example, a = 0.05.
b is the probability the police officer identifies her conditional on the hypothesis being true, i.e. she’s guilty. This is more traditionally represented by the notation (PEIH), i.e. probability of E (the evidence) given the hypothesis is true, P(H). In the example, b = 0.8.
c is the probability the police officer identifies her conditional on the hypothesis not being true, i.e. she’s not guilty. This is more traditionally represented by the notation (PEIH’), i.e. probability of E (the evidence) given the hypothesis is false, P(H’). In the example, c = 0.15.
In our example, a = 0.05, b = 0.8, c = 0.15
Using Bayes’ Theorem, the updated (posterior) probability that the friend is guilty is:
ab/[ab+c(1-a)] = 0.04/(0.04+ 0.1425) = 0.04/0.1825
Posterior probability = 0.219 = 21.9%
An entomologist spots what might be a rare category of beetle, due to the pattern on its back. In the rare category, 98% have the pattern. In the common category, only 5% have the pattern. The rare category accounts for only 0.1% of the population. How likely is the beetle to be rare?
Since only 5 per cent of the common beetles bear the distinctive pattern and 98 per cent of the rare beetles do, intuition would tell you that you have come across a rare insect when you espy the pattern. Bayes’ Theorem tells you something quite different.
To calculate just how likely the beetle is to be rare given that we see the pattern on its back, we apply Bayes’ Theorem.
Posterior probability = ab/ [ab+c(1-a)]
a is the prior probability of the hypothesis (beetle is rare) being true. b is the probability we observe the pattern and the beetle is rare (hypothesis is true). c is the probability we observe the pattern and the beetle is not rare (hypothesis is false).
In this case, a = 0.001 (0.1%); b = 0.98 (98%); c = 0.05 (5%).
So, updated probability = ab/ [ab+c(1-a)] = 0.0192. So there is just a 1.92 per cent chance that the beetle is rare when the entomologist spots the distinctive pattern on its back.
Why the counterintuitive result? Because so few of the population of all beetles are rare. Specifically, the prior probability that the beetles is rare is very small and it would take a lot more evidence than that acquired to make a reasonable case for the beetle being rare.
So what is the probability that the beetle is rare given that we observe the distinctive pattern? In other words, what is the probability that the hypothesis (the beetle is rare) is true given the evidence (the pattern). That is 1.92 per cent. What is the probability that we will observe the distinctive pattern if the beetle is rare? In other words, what is the probability of observing the evidence (the pattern) if the hypothesis (the beetle is rare) is true. That is 98 per cent.
To conflate these, to believe these two concepts are the same, is to commit the classic Prosecutor’s Fallacy, i.e. to falsely equate the probability that the defendant is guilty given the observed evidence with the probability of observing the evidence given that the defendant is guilty. It’s a potentially very dangerous fallacy to commit, especially when you happen to be the defendant and the jury has never heard of the Reverend Thomas Bayes!
Appendix
We can also solve the Beetle problem using the traditional notation version of Bayes’ Theorem.
P (HIE) = P (EIH). P (H) / [P (EIH) . P(H) + P (EIH’) . P(H’)]
In this case, P (H) = 0.001 (0.1%); P (EIH) = 0.98 (98%); P (EIH’) = 0.05 (5%).
So, P (HIE) = 0.98 x 0.001/ [0.98 x 0.001 +0.05 x 0.999)] = 0.00098 / 0.00098 + 0.04995 = 0.00098 / 0.05093 = 0.0192. So there is just a 1.92 per cent chance that the beetle is rare when the entomologist spots the distinctive pattern on its back.
Note also that P (HIE) = 0.0192, while P (EIH) = 0.98. The Prosecutor’s Fallacy is to conflate these two expressions.
December 21st, 2018 is the shortest day of the year, at least in the UK, located in the Northern hemisphere of our planet.
So does that mean that the mornings should start to get lighter after today (earlier sunrise), as well as the evenings (later sunset). Not so, and there’s a simple reason for that. The length of a solar day, i.e. the period of time between the solar noon (the time when the sun is at its highest elevation in the sky) on one day and the next, is not 24 hours in December, but about 30 seconds longer than that.
For this reason, the days get progressively about 30 seconds longer throughout December, so that by the end of the month a standard 24-hour clock is lagging roughly 15 minutes behind real solar time.
Let’s say just for a moment that the hours of sunlight (the time difference between sunrise and sunset) stayed constant through December. This means that a 24-hour clock which timed sunset at 3.50pm one day would be 30 seconds slow by 3.50pm the next day. The solar day would be 30 seconds longer than this, so the sun would not set the next day till 3.50pm and 30 seconds. After ten days the sun would not set till 3.55pm according to the 24-hour clock. So the sunset would actually get later through all of December. For the same reason, the sunrise would get later through the whole of December.
In fact, the sunset doesn’t get progressively later through all of December because the hours of sunlight shorten for about the first three weeks. The effect of this is that the sun would set earlier and rise later.
These two things (the shortening hours of sunlight and the extended solar day) work in the opposite direction. The overall effect is that the sun starts to set later from a week or so before the shortest day, but doesn’t start to rise earlier till about a week or so after the shortest day.
So the old adage that that the evenings will start to draw out after the end of the third week of December or so, and the mornings will get lighter, is false. The evenings have already been drawing out for several days before the shortest day, and the mornings will continue to grow darker for several days more.
There’s one other curious thing. The solar noon coincides with noon on our 24-hour clocks just four times a year. One of those days is Christmas Day! So set your clock to noon on December 25th, look up to the sky and you will see the sun at its highest point. Just perfect!
Links
http://www.timeanddate.com/astronomy/uk/nottingham
http://www.bbc.co.uk/news/magazine-30549149
http://www.rmg.co.uk/explore/astronomy-and-time/time-facts/the-equation-of-time
http://en.wikipedia.org/wiki/Solar_time
http://earthsky.org/earth/everything-you-need-to-know-december-solstice
The results of the US midterm elections are now largely in and they came as a shock to many seasoned forecasters.
This wasn’t the kind of shock that occurred in 2016, when the EU referendum tipped to Brexit and the US presidential election to Donald Trump. Nor the type that followed the 2015 and 2017 UK general elections, which produced a widely unexpected Conservative majority and a hung parliament respectively.
On those occasions, the polls, pundits and prediction markets got it, for the most part, very wrong, and confidence in political forecasting took a major hit. The shock on this occasion was of a different sort – surprise related to just how right most of the forecasts were.
Take the FiveThirtyEight political forecasting methodology, most closely associated with Nate Silver, famed for the success of his 2008 and 2012 US presidential election forecasts.
In 2016, even that trusted methodology failed to predict Trump’s narrow triumph in some of the key swing states. This was reflected widely across other forecasting methodologies, too, causing a crisis of confidence in political forecasting. And things only got worse when much academic modelling of the 2017 UK general election was even further off targetthan it had been in 2015.
How did it go so right?
So what happened in the 2018 US midterm elections? This time, the FiveThirtyEight “Lite” forecast, based solely on local and national polls weighted by past performance, predicted that the Democrats would pick up a net 38 seats in the House of Representatives. The “Classic” forecast, which also includes fundraising, past voting and historical trends, predicted that they would pick up a net 39 seats. They needed 23 to take control.
Read more: Women candidates break records in the 2018 US midterm elections
With almost all results now declared, it seems that those forecasts are pretty near spot on the projected tally of a net gain of 40 seats by the Democrats. In the Senate, meanwhile, the Republicans were forecast to hold the Senate by 52 seats to 48. The final count is likely to be 53-47. There is also an argument that the small error in the Senate forecast can be accounted for by poor ballot design in Florida, which disadvantaged the Democrat in a very close race.
Some analysts currently advocate looking at the turnout of “early voters”, broken down by party affiliation, who cast their ballot before polling day. They argue this can be used as an alternative or supplementary forecasting methodology. This year, a prominent advocate of this methodology went with the Republican Senate candidate in Arizona, while FiveThirtyEight chose the Democrat. The Democrat won. Despite this, the jury is still out over whether “early vote” analysis can add any value.
There has also been research into the forecasting efficiency of betting/prediction markets compared to polls. This tends to show that the markets have the edge over polls in key respects, although they can themselves be influenced by and overreact to new poll results.
There are a number of theories to explain what went wrong with much of the forecasting prior to the Trump and Brexit votes. But looking at the bigger picture, which stretches back to the US presidential election of 1868 (in which Republican Ulysses S Grant defeated Democrat Horatio Seymour), forecasts based on markets (with one notable exception, in 1948) have proved remarkably accurate, as have other forecasting methodologies. To this extent, the accurate forecasting of the 2018 midterms is a return to the norm.
And the next president is …
But what do the results mean for politics in the US more generally? The bottom line is that there was a considerable swing to the Democrats across most of the country, especially among women and in the suburbs, such that the Republican advantage of almost 1% in the House popular vote in 2016 was turned into a Democrat advantage of about 8% this time. If reproduced in a presidential election, it would be enough to provide a handsome victory for the candidate of the Democratic Party.
The size of this swing, and the demographics underpinning it, were identified with a good deal of accuracy by the main forecasting methodologies. This success has clearly restored some confidence in them, and they will now be used to look forward to 2020. Useful current forecasts for the 2020 election include PredictIt, OddsChecker, Betfairand PredictWise.
Taken together, they indicate that the Democratic candidate for the presidency will most likely come from a field including Senators Kamala Harris (the overall favourite), Bernie Sanders, Elizabeth Warren, Amy Klobuchar, Kirsten Gillibrand and Cory Booker. Outside the Senate, the frontrunners are former vice-president, Joe Biden, and the recent (unsuccessful) candidate for the Texas Senate, Beto O’Rourke.
Whoever prevails is most likely to face sitting president, Donald Trump, who is close to even money to face impeachment during his current term of office. If Trump isn’t the Republican nominee, the vice-president, Mike Pence, and former UN ambassador Nikki Haley are attracting the most support in the markets. The Democrats are currently about 57% to 43% favourites over the Republicans to win the presidency.
With the midterms over, our faith in political forecasting, at least in the US, has been somewhat restored. The focus now turns to 2020 – and whether they’ll accurately predict the next leader of the free world, or be left floundering by the unpredictable forces of a new world politics.
Is it possible to be both alive and dead at the same time? This is the question central to the famous Schrödinger’s Cat thought experiment. In the version posed by Erwin Schrödinger, a cat is placed in an opaque box for an hour with a small piece of radioactive material which has an equal probability of decaying or not in that time period. If some radioactivity is detected by a Geiger counter also placed in the box, a relay releases a hammer which breaks a flask of hydrocyanic acid, killing the cat. If no radioactivity is detected, the cat lives. Before we open the box at the end of the hour, we estimate the chance that the radioactive material will decay and the cat will be dead at 50/50, the same as that it will be alive. Before we open the box, however, is the cat alive (and we don’t know it yet), dead (and we don’t know it yet) or both alive and dead (until we open the box and find out).
Common sense would seem to indicate that it is either alive or dead, but we don’t know until we open the box. Traditional quantum theory suggests otherwise. The cat is both alive, with a certain probability, and dead, with a certain probability, until we open the box and find out, when it has to become one or the other with a probability of 100 per cent. In quantum terminology, the cat is in a superposition (two states at the same time) of being alive and dead, which only collapses into one state (dead or alive) when the cat is observed. This might seem absurd when applied to a cat. After all surely it was either alive or dead before we opened the box and found out. It was simply that we didn’t know which. That may be true, when applied to cats. But when applied to the microscopic quantum world, such common sense goes out the window as a description of reality. For example, photons (the smallest measure of light) can exist simultaneously in both wave and particle states, and travel in both clockwise and anti-clockwise directions at the same time. Each state exists in the same moment. As soon as the photon is observed, however, it must settle on one unique state. In other words, the common sense that we can apply to cats we cannot apply to photons or other particles at the quantum level.
So what is going on? The traditional explanation as to why the same quantum particle can exist in different states simultaneously is known as the Copenhagen Interpretation. First proposed by Niels Bohr in the early twentieth century, the Copenhagen interpretation states that a quantum particle does not exist in any one state but in all possible states at the same time, with various probabilities. It is only when we observe it that it must in effect choose which of these states it exists as. At the sub-atomic level, then, particles seem to exist in a state of what is called ‘coherent superposition’, in which they can be two things at the same time, and only become one when they are forced to do so by the act of being observed. The total of all possible states is known as the ‘wave function.’ When the quantum particle is observed, the superposition ‘collapses’ and the object is forced into one of the states that make up its wave function.
The problem with this explanation is that all these different states exist. By observing the object, it might be that it reduces down to one of these states, but what has happened to the others? Where have they disappeared to?
This question lies at the heart of the so-called ‘Quantum Suicide’ thought experiment.
It goes like this. A man (not a cat) sits down in front of a gun which is linked to a machine that measures the spin of a quantum particle (a quark). If it is measured as spinning clockwise, the gun will fire and kill the man. If it is measured as spinning anti-clockwise, it will not fire and the man will survive to undergo the same experiment again.
The question is – will the man survive, and how long will he survive for? This thought experiment, proposed by Max Tegmark, has been answered in different ways by quantum theorists depending on whether or not they adhere to the Copenhagen Interpretation. In that interpretation, the gun will go off with a certain probability, depending on which way the quark is spinning. Eventually, by the laws of chance, the man will be killed, probably sooner rather than later. A growing number of theorists believe something else, however. They see both states (the particle is spinning clockwise and spinning anti-clockwise) as equally real, so there are two real outcomes. In one world, the man dies and in the other he lives. The experiment repeats, and the same split occurs. In one world there will exist a man who survives an indefinite number of rounds. In the other worlds, he is dead.
The difference between these alternative approaches is critical. The Copenhagen approach is to propose that the simultaneously existing states (for example, the quark that is spinning both clockwise and anti-clockwise simultaneously) exist in one world, and collapse into one of these states when observed. Meanwhile, the other states mysteriously disappear. The other approach is to posit that these simultaneously existing states are real states, and neither magically disappears, but branch off into different realities when observed. What is happening is that in one world, the particle is observed spinning clockwise (in the Quantum Suicide thought experiment, the man dies) and in the other world the particle is observed spinning the other way (and the man lives). Crucially, according to this interpretation both worlds are real. In other words, they are not notional states of one world but alternative realities. This is the so-called ‘Many Worlds Theory.’
Where is the burden of proof in trying to determine which interpretation of reality is correct? This depends on whether we take the one world that we can observe as the default position or the wave function of all possible states as represented in the mathematics of the wave function as the reality. Adherents to the Many Worlds position argue that the default is to go with what is described in the mathematics underpinning quantum theory – that the wave function represents all of reality. According to this argument, the minimal mathematical structure needed to make sense of quantum mechanics is the existence of many worlds which branch off, each of which contains an alternative reality. Moreover, these worlds are real. To say that our world, the one that we are observing, is the only real one, despite all the other possible worlds or measurement outcomes, has been likened to when we believed that the Earth was at the centre of the universe. There is no real justification, according to this interpretation, for saying that our branch of all possible states is the only real one, and that all other branches are non-existent or are ‘disappeared worlds.’ Put another way, the mathematics of quantum mechanics describes these different worlds. Nothing in the maths says that this world that we observe is more real than another world. So the burden of proof is on those who say it is. The viewpoint of the Copenhagen school is diametrically opposite. They argue that the hard evidence is of the world we are in, and the burden of proof is on those positing other worlds containing other branches of reality.
Depending on which default position we choose to adopt will determine whether we are adherents of the Copenhagen or the ‘Many Worlds’ schools.
For me personally, the logic of the argument points to the Many Worlds school. But to believe that they are right, and the Copenhagen school is wrong, seems kind of crazy, and totally counter-intuitive. In another world, of course, I’m probably saying the exact opposite.
Do we live in a simulation, created by an advanced civilisation, in which we are part of some sophisticated virtual reality experience? For this to be a possibility we can make the obvious assumption that sufficiently advanced civilisations will possess the requisite computing and programming power to create what philosopher Nick Bostrom termed such ‘ancestor simulations’. These simulations would be complex enough for the minds that are simulated to be conscious and able to experience the type of experiences that we do. The creators of these simulations could exist at any stage in the development of the universe, even billions of years into the future.
The argument around simulation goes like this. One of the following three statements must be correct.
a. That civilisations at our level of development always or almost always disappear before becoming technologically advanced enough to create these simulations.
b. That the proportion of these technologically advanced civilisations that wish to create these simulations is zero or almost zero.
c. That we are almost sure to be living in such a simulation.
To see this, let’s examine each proposition in turn.
a. Suppose that the first is not true. In that case, a significant proportion of civilisations at our stage of technology go on to become technologically advanced enough to create these simulations.
b. Suppose that the second is not true. In this case, a significant proportion of these civilisations run such simulations.
c. If both of the above propositions are not true, then there will be countless simulated minds indistinguishable to all intents and purposes from ours, as there is potentially no limit to the number of simulations these civilisations could create. The number of such simulated minds would almost certainly be overwhelmingly greater than the number of minds that created them. Consequently, we would be quite safe in assuming that we are almost certainly inside a simulation created by some form of advanced civilisation.
For the first proposition to be untrue, civilisations must be able to go through the phase of being able to wipe themselves out, either deliberately or by accident, carelessness or neglect, and never or almost never do so. This might perhaps seem unlikely based on our experience of this world, but becomes more likely if we consider all other possible worlds.
For the second proposition to be untrue, we would have to assume that virtually all civilisations that were able to create these simulations would decide not to do so. This again is possible, but would seem unlikely.
If we consider both propositions, and we think it is unlikely that no civilisations survive long enough to achieve what Bostrom calls ‘technological maturity’, and that it is unlikely that hardly any would create ‘ancestor simulations’ if they could, then anyone considering the question is left with a stark conclusion. They really are living in a simulation.
To summarise. An advanced ‘technologically mature’ civilisation would have the capability of creating simulated minds. Based on this, at least one of three propositions must be true.
a. The proportion of these advanced civilisations is close to zero or zero.
b. The proportion of these advanced civilisations that wish to run these simulations is close to zero.
c. The proportion of those consciously considering the question who are living in a simulation is close to one.
If the first of these propositions is true, we will almost certainly not survive to become ‘technologically mature.’ If the second proposition is true, virtually no advanced civilisations are interested in using their power to create such simulations. If the third proposition is true, then conscious beings considering the question are almost certainly living in a simulation.
Through the veil of our ignorance, it might seem sensible to assign equal credence to all three, and to conclude that unless we are currently living in a simulation, descendants of this civilisation will almost certainly never be in a position to run these simulations.
Strangely indeed, the probability that we are living in a simulation increases as we draw closer to the point at which we are able and willing to do so. At the point that we would be ready to create our own simulations, we would paradoxically be at the very point when we were almost sure that we ourselves were simulations. Only by refraining to do so could we in a certain sense make it less likely that we were simulated, as it would show that at least one civilisation that was able to create simulations refrained from doing so. Once we took the plunge, we would know that we were almost certainly only doing so as simulated beings. And yet there must have been someone or something that created the first simulation. Could that be us, we would be asking ourselves? In our simulated hearts and minds, we would already know the answer!
In ‘The Merchant of Venice’, by William Shakespeare, Portia sets her suitors a problem to solve to find who is right for her. In the play, there are just three suitors and they are asked to choose between a gold, a silver and a lead casket, one of which contains a portrait which is the key to her hand in marriage.
Let us base a thought experiment based around Portia’s quest for love in which she meets the successive suitors in turn. Her problem is when to stop looking and start choosing. To make the problem more general, let’s say she has 100 suitors to choose from. Each will be presented to her in random order and she has twenty minutes to decide whether he is the one for her. If she turns someone down there is no going back, but the good news is that she is guaranteed not to be turned down by anyone she selects. If she comes to the end of the line and has still not chosen a partner, she will have to take whomever is left, even if he is the worst of the hundred. All she has to go on in guiding her decision are the relative merits of the pool of suitors.
Let’s say that the first presented to her, whom we shall call No.1, is perfectly charming but she has some doubts. Should she choose him anyway, in case those to follow will be worse? With 99 potential matches left, it seems more than possible that there will be at least one who is a better match than No.1. The problem facing Portia is that she knows that if she dismisses No. 1, he will be gone forever, to be betrothed to someone else.
She decides to move on. The second suitor turns out to be far worse than the first, as does the third and fourth. She starts to think that she may have made a mistake in not accepting the first. Still, there are potentially 96 more to see. This goes on until she sees No. 20, whom she actually prefers to No. 1. Should she now grasp her opportunity before it is too late? Or should she wait for someone even better?
She is looking for the best of the hundred, and this is the best so far. But there are still 80 suitors left, one of whom might be better than No. 20. Should she take a chance? What is Portia’s optimal strategy in finding Mr. Right?
This is an example of an ‘Optimal Stopping Problem’, which has come to be known as the ‘Secretary Problem.’ In this variation, you are interviewing for a secretary, with your aim being to maximise your chance of hiring the single best applicant out of the pool of applicants. Your only criterion to measure suitability is their relative merits, i.e. who is better than whom. As with Portia’s Problem, you can offer the post to any of the applicants at any time before seeing any more candidates, but you lose the opportunity to hire that applicant if you decide to move on to the next in line.
This sort of stopping strategy can be extended to anything including the search for a place to live, a place to eat, the choice of a used car, and so on.
In each of these cases, there are two ways you can fail to meet your goal of finding the best option out there. The first is by stopping too early, and the second is by stopping too late. By stopping too early, you leave the best option out there. By stopping too late, you have waited for a better option that turns out not to exist. So how do you find the right balance?
Let’s consider the intuition. Obviously, the first option is the best yet, and the second option (assuming we are taking the options in a random order) has a 50% chance of being the best yet. Likewise, the tenth option has a 10% chance of being the best to that point. It follows logically that the chance of any given option being the best to that point declines as the number of options there have been before increases. So the chance of coming across the ‘best yet’ becomes more and more infrequent as we go through the process.
To see how we might best approach the problem, let’s go back to Portia and her suitors and look at her best strategy when faced with different-sized pools of suitors. Can she do better using some strategy other than choosing at some random position in the order of presentation to her? It can be shown mathematically that she can certainly expect to do better, given that there are more than two to choose from.
Let’s return to the original play where there are three suitors. If she chooses No. 1, she has no information with which to compare the relative merits of her suitors. On the other hand, by the time she reaches No. 3, she must choose him, even if he’s the worst of the three. In this way, she has maximum information but no choice. In the case of No. 2, she has more information than she did when she saw No. 1, as she can compare the two. She also has more control over her choice than she will if she leaves it until she meets No. 3.
So she turns down No. 1 to give herself more information about the relative merits of those available. But what if she finds that No. 2 is worse than No. 1? What should she do? It can in fact be shown that she should wait and take the risk of ending up with No. 3, as she must do if she leaves it to the last. On the other hand, if she finds that she prefers No. 2 to No. 1, she should chose him on the spot and forego the chance that No. 3 will be a better match.
It can also be shown that in the three-suitor scenario, she will succeed in finding her best available match exactly half the time by selecting No. 2 if he is better than No. 1. If she chooses No. 1 or No. 3, on the other hand, she will only have met that aim one time in three.
If there are four suitors, Portia should use No. 1 to gain information on what she should be measuring her standards against, and select No. 2 if he is a better choice than No. 1. If he is not, do the same with No. 3. If he is still not better than No. 1, go to No. 4 and hope for the best. The same strategy can be applied to any number of people in the pool.
So, in the case of a hundred suitors, how many should she see to gain information before deciding to choose someone? It can, in fact, be demonstrated mathematically that her optimal strategy (‘stopping strategy’) before turning looking into leaping is 37. She should meet with 37 of the suitors, then choose the first of those to come after who is better than the best of the first 37. By following this rule, she will find the best of the princely bunch of a hundred with a probability, strangely enough, of 37 per cent. By choosing randomly, on the other hand, she has a chance of 1 in 100 (1%) of settling upon the best.
This stopping rule of 37% applies to any similar decision, such as the secretary problem or looking for a house in a fast-moving market. It doesn’t matter how many options are on the table. You should always use the first 37% as your baseline, and then select the first of those coming after that is better than any of the first 37 per cent.
The mathematical proof is based on the mathematical constant, e (sometimes known as Euler’s number) and specifically 1/e, which can be shown to be the stopping point along a range from 0 to 1, after which it is optimal to choose the first option that is better than any of those before. The value of e is approximately equal to 2.71828, so 1/e is about 0.36788 or 36.788%. This has simply been rounded up to 37 per cent in explaining the stopping rule. It can also be shown that the chance that implementing this stopping rule will yield the very best outcome is also equal to 1/e, i.e. about 37 per cent.
If there is a chance that your selection might actually become unavailable, the rule can be adapted to give a different stopping rule, but the principle remains. For example, if there is a 50% chance that your selection might turn out to be unavailable, than the 37% rule is converted into a 25% rule. The rest of the strategy remains the same. By doing this, you will have a 25% chance of finding the best of the options, however, compared to a 37% chance if you always get to make the final choice. This is still a lot better than the 1 per cent chance of selecting the best out of a hundred options if you choose randomly. The lower percentage here (25% compared to 37%) reflects the additional variable (your choice might not be final) which adds uncertainty into the mix. There are other variations on the same theme, where it is possible to go back with a given probability that the option you initially passed over is no longer available. Take the case, for example, where an immediate proposal will certainly be accepted but a belated proposal is accepted half of the time. The cut-off proportion in one such scenario rises to 61% as the possibility of going back becomes real.
There is also a rule-of-thumb which can be derived when the aim is to maximise the chance of selecting a good option, if not the very best. This strategy has the advantage of reducing the chance of ending up with one of the worst options. It is the square root rule, which simply replaces the 37% criterion with the square root of the number of options available. In the case of Portia’s choice, she would meet the first ten of the hundred (instead of 37) and choose the first of the remaining 90 who is better than the best of those ten. Whatever variation you adopt, the numbers will change but the principle stays the same.
All this assumes that we are lacking in some objective standard about which we can measure each of our options objectively, without needing to compare which option is better than which. For example, Portia might simply be interested in choosing the richest of the suitors and she knows the distribution of wealth of all potential suitors. This ranges evenly from the bankrupt suitors to those worth 100,000 ducats.
This means that the upper percentile of potential suitors in the whole population are worth upwards of 99,000 ducats. The lowest percentile is worth up to 1,000 ducats. The 50th percentile is worth between 49,000 and 50,000 ducats.
Now Portia is presented with a hundred out of this population of potential suitors, and let’s assume that the suitors presented to her are representative of this population. Say now that the first to be presented to her is worth 99,500 ducats. Since wealth is her only criterion, and he is in the upper percentile in terms of wealth, her optimal decision is to accept his proposal of marriage. It is possible that one of the next 99 is worth more than 99,500 ducats but that isn’t the way to bet.
On the other hand, say that the first suitor is worth 60,000 ducats. Since there are 99 more to come, it is a good bet that at least one of them will be worth more than this. If she has turned down all suitors, however, until she is being presented with the 99th, her optimal decision now is to accept him. In other words, Portia’s decision as to whether to accept the proposal comes down to how many potential matches she has left to see. When down to the last two, she should choose him if he is above the 50th percentile, in this case 50,000 ducats. The more there are to come the higher the percentile of wealth at which she should accept. She can set a higher threshold. She should never accept anyone who is below the average unless she is out of choices. In this version of the stopping problem, the probability that Portia will end up with the wealthiest of the available suitors turns out to be 58 per cent. More information, of course, increases the chance of success. Indeed, any criterion that provides information on where an option is relative to the relevant population as a whole will increase the probability of finding the best choice of those available. As such, it seems that if Portia is only interested in the money, she is more likely to find it than if she is looking for love.
And who did fair Portia choose in the original play? Well, there are no spoilers here. But I can reveal that it was the best of the three.
