Bayes’ Theorem: The Most Powerful Equation in the World.

Further and deeper exploration of paradoxes and challenges of intuition and logic can be found in my recently published book, Probability, Choice and Reason.

How should we change our beliefs about the world when we encounter new data or information? This is one of the most important questions we can ask. A theorem bearing the name of Thomas Bayes, an eighteenth-century clergyman, is central to the way we should answer this question.

The original presentation of the Reverend Thomas Bayes’ work, ‘An Essay toward Solving a Problem in the Doctrine of Chances’, was given in 1763, after Bayes’ death, to the Royal Society, by Bayes’ friend and confidant, Richard Price.

In explaining Bayes’ work, Price proposed as a thought experiment the example of a person who enters the world and sees the sun rise for the first time. As this person has had no opportunity to observe the sunrise before (perhaps he has spent his life to that point entombed in a dark cave), he is not able to decide whether this is a typical or unusual occurrence. It might even be a unique event. Every day that he sees the same thing happen, however, the degree of confidence he assigns to this being a permanent aspect of nature increases. His estimate of the probability that the sun will rise again tomorrow as it did yesterday and the day before, and so on, gradually approaches, although never quite reaches, 100 per cent.

The Bayesian viewpoint is just like that, the idea that we learn about the world and everything in it through a process of gradually updating our beliefs, edging incrementally ever closer to the truth as we obtain more data, more information, more evidence.

As such, the perspective of Reverend Bayes on cause and effect is essentially different to that of philosopher David Hume, the logic of whose argument on this issue is contained in ‘An Enquiry Concerning Human Understanding,’ published in 1748. According to Hume, we cannot justify our assumptions about the future based on past experience unless there is a law that the future will always resemble the past. No such law exists. Therefore, we have no fundamentally rational support for believing in causation. For Hume, therefore, predicting that the sun will rise again after seeing it rise a hundred times in a row is no more rational than predicting that it will not. Bayes instead sees reason as a practical matter, in which we can apply the laws of probability to the issue of cause and effect.

To Bayes, therefore, rationality is matter of probability, by which we update our predictions based on new evidence, thereby edging closer and closer to the truth. This is called Bayesian reasoning. According to this approach, probability can be seen as a bridge between ignorance and knowledge. The particularly wonderful thing about the world of Bayesian reasoning is that the mathematics of it are so simple. Bayes’ Theorem is in this way concerned with conditional probability. It tells us the probability, or updates the probability, that a theory or hypothesis is true given that some event has taken place, that some new evidence has been observed. The problem with intuition is that people are not naturally probability thinkers, but instead are cause-effect thinkers. We have to be trained to think in a Bayesian way about the world.

Essentially, Bayes’ Theorem is just an algebraic expression with three known variables and one unknown. It is true by construction. Yet this simple formula is the foundation stone of that bridge I referred to between ignorance and knowledge, which can lead to important predictive insights. As noted, it allows us to update the probability that a theory or hypothesis is true when some new evidence comes to light, based on the probability we attach to the theory or hypothesis being true before the new evidence is known.

There are three things a Bayesian needs to estimate.

1. A Bayesian’s first task is to assign a starting point probability to a hypothesis being true, before some new evidence arises. This is known as the ‘prior’ probability. Let’s assign the letter ‘a’ to this.
1. A Bayesian’s second task is to estimate the probability that the new evidence would have arisen if the hypothesis was true. Let’s assign the letter ‘b’ to this.
2. A Bayesian’s third task is to estimate the probability that the new evidence would have arisen if the hypothesis was false. Let’s assign the letter ‘c’ to this.

Based on these three probability estimates, Bayes’ Theorem offers a way to calculate the revised probability of the hypothesis being true given the new evidence. The notable point about it is that the equation is true as a matter of logic. The result it produces will therefore be as accurate as the values inputted into the equation. The formula is also so straightforward it can be jotted on the back of your hand.

The formula for Bayes’ Theorem can be represented as:

Updated (posterior) probability given new evidence = ab/ [ab+ c (1-a)]

Essentially, then, Bayesian updating is a straightforward solution to the problem of how to combine pre-existing (prior) beliefs with observed new evidence. The solution is essentially to combine the probabilities together. To do this properly, we use Bayes’ Theorem. It is of particular use when we have a conditional probability of two events, and we are interested in the reversed conditional probability. For example, when we have P (A given B) and want to find P (B given A).

The key contributions of Bayesian analysis to our understanding of the world are threefold.

1. Bayes’ Theorem makes clear the importance not just of new evidence but also the (prior) probability that the hypothesis was true before the new evidence was observed. This prior probability is often given too little weight compared to the new evidence in common intuition about probability. Bayes’ Theorem makes the prior probability explicit and shows how much weight to attach to it.
2. Bayes’ Theorem allows us a way to calculate the updated probability based on the prior probability that the hypothesis is true and the probability of the new evidence arising given that the hypothesis is true and also given that the hypothesis is false.
3. Bayes’ Theorem shows that the probability that a hypothesis is true given the evidence is not equal to the probability of the evidence arising given that the hypothesis is true. Put another way, P (H given E) does not equal P (E given H).

Often the conclusions it generates are highly counter-intuitive, but that’s because the world is in many ways a counterintuitive place. Accepting that fact is the first step towards mastering life’s logical maze.

In summary, intuition lets us down because our in-built judgment of the weight we should attach to new evidence tends to be skewed, not least against pre-existing evidence. New evidence also tends to colour our perception of the pre-existing evidence. Moreover, we tend to see evidence that is consistent with something being true as evidence that it is actually true. Bayes’ Theorem is the map that helps guide us through this maze.

Appendix

Bayes’ Theorem consists of three variables.

a is the prior probability of the hypothesis being true (the probability we attach before new evidence arises). In traditional notation, this is represented as P (H).

b is the probability that the new evidence would arise if the hypothesis is true. In traditional notation, this is represented as P (EIH). We use the notation P (AIB) to represent the probability of A given B, i.e. the probability of A If B.

c is probability the new evidence would arise if the hypothesis is not true. In traditional notation, this is represented as P (EIH’). H’ is the notation for H not being true.

(1-a) is the prior probability that the hypothesis is not true. In traditional notation, this is represented as P (H’). It is derived from 1 – P (H), i.e. 1 minus the probability that the hypothesis is true.

Using this notation, the probability that a hypothesis is true given some new evidence (‘Posterior Probability’) = ab/ [ab+ c (1-a)].

Bayes’ Theorem can be derived from the equation P (HIE). P (E) = P (H).P (EIH), by dividing both sides by P (E). The intuition underlying this is that both sides of the equation are equal to the combined probability of the evidence relating to a hypothesis and the probability of the hypothesis being true, P (H and E). They are two ways of looking at the same thing.

In particular, P (HIE). P (E) is the probability of a hypothesis being true given the evidence times the probability of the evidence. This is logically equivalent to P (H). P (EIH), which is the probability of a hypothesis being true times the probability of the evidence given that the hypothesis is true.

So, P (HIE). P (E) = P (H). P (EIH)

Dividing the left and right sides of the equation by P (E),

P (HIE) = P (H). P (EIH) / P (E) … Bayes’ Theorem

P (E) = P (EIH). P (H) + P (EIH’). P(H’)

P (HIE) = P (H).P (EIH) / [P (H). P (EIH) + P (EIH’). P(H’)] … Bayes’ Theorem

This is equivalent to the formula:

Posterior probability = ab / [ab + c (1-a)], where a = P (H); b = P (EIH); c = P (EIH’)

Technical Proof

We write the conditional probability of A given B as P (AB) and define it as the probability that A has occurred, given that B has occurred.

The probability that A and B have both occurred is the conditional probability of A given B multiplied by the probability that B has occurred.

P(AB) = P (AB) P(B)

Hence:

P (AB) = P(AB) / P(B)

Similarly,

P(AB) = P (B∣A) P(A)

Hence:

P (BA) = P (AB) / P(A)

So:

P (AB) P (B) = P (AB) = P (BA) P(A), which is sometimes called the product rule for probabilities.

Dividing both sides by P (A), (which we take to be non-zero), the result follows:

P (BA) = P (AB) P(B) / P(A)

Where A represents the evidence, and B represents the hypothesis being true, this becomes:

P (H∣E) = P (E∣H) P(H) / P(E) … Bayes’ Formula

Now, P (E) = P (EIH) P (H) + P (EIH’) P (H’)

Therefore, P (E) = P (EIH) P (H) + P (EIH’) P (H’), where P (H’) represents the probability that the hypothesis is not true, i.e. P (H’) = 1 – P (H)

In traditional notation, the Prosecutor’s Fallacy is the fallacy of representing the probability of a hypothesis being true given the evidence, P (HIE), as being the same thing as P (EIH), the probability of the evidence arising given the hypothesis is true. In fact, P (HIE) = P (H). P (EIH) / P(E).

Examples

Is the probability that a selected card is the Ace of Spades (the hypothesis) given the evidence (it is a black card) equal to the probability it is a black card given that it is the Ace of Spades?

In this example, P (HIE) = 1/26 (probability the hypothesis is true given the evidence), since there is one Ace of Spades out of 26 black cards.

However, the probability of observing the evidence (it is a black card) given the hypothesis being true (it is the Ace of Spades) is P (EIH) = 1, since the probability it is a black card if it is the Ace of Spades is certain.

So, P (HIE) = 1/26 is not equal to P (EIH) = 1.

There follow some examples to illustrate that P (HIE). P (E) does indeed equal P (H). P (EIH).

Example 1: Take a deck of 52 cards, 26 red cards and 26 black cards, including one Ace of Spades. We are testing the hypothesis that a chosen card is the Ace of Spades. So, the hypothesis is that the selected card is the Ace of Spades. Now the probability a drawn card is the Ace of Spades (hypothesis is true) given that the card is black (the evidence) = 1/26 (there are 26 black cards, one of which is the Ace of Spades).

So P (HIE) = 1/26

The proportion of black cards in the deck = 1/2. So P (E) = 1/2

So, P (HIE). P (E) = 1/26 x ½ = 1/52.

Now P (EIH) is the probability that the card is black given that it is the Ace of Spades. This is certain, as the Ace of Spades is a black card.

So P (EIH) = 1.

P (H) is the probability the card is the Ace of Spades before we know what colour it is. There are 52 cards in the deck, so P (H) = 1/52.

So P (H). P (EIH) = 1/52 x 1 = 1/52

So P (HIE). P (E) = P (H). P (EIH) – they both equal 1/52 in this case.

Therefore, P (HIE) = P (H). P(EIH) / P (E) … Bayes’ Theorem

Example 2: There are in this example just four cards in our deck. These are the Ace of Spades, Ace of Clubs, Ace of Diamonds and Ace of Hearts. We are testing the hypothesis that the selected card is the Ace of Spades. Prior probability of Ace of Spades (AS) = ¼, as this is one of the four cards in our deck. What is the posterior probability it is Ace of Spades given the evidence that the card is black?

P (H) = ¼

P (EIH) = 1

P (E) = ½

P (HIE) = 1/2

So, P (HIE) = P (H). P (EIH)/ P (E) = ¼.1 / (1/2) = ½

Note that: P (HIE). P (E) = P (H). P (EIH)

P (HIE) = P (H). P (EIH) / P (E) … Bayes’ Theorem

Example 3: Two dice are thrown. The hypothesis is that two sixes will be thrown. The new evidence is that a six is thrown on the first one.

P (H) = x = 1/36

P (EIH) = y = 1 (for a double six, a six must be thrown on the first one).

P (E) = 1/6 (there is a 1 in 6 chance of throwing a six on the first die)

P (HIE) = posterior probability (PP) = P (EIH). P (H) / P (E) = 1. 1/36 / 1/6 = 1/6 (there is a 1 in 6 chance of a double six if the first die lands on a six).

Note: P (H). P (EIH) = P (E). P (HIE) = 1/36

Note also: P (E) = P (H). P (EIH) + P(H’). P(EIH’) = 1/36 . 1 + 35/36 . 5/35 = 1/36 + 5/36 = 1/6

Similarly, Posterior Probability = ab/[ab+c(1-a)] = 1/6

Note: c = P (EIH’) = 5/35 because if the dice do not land 6,6, so that the hypothesis is not true (H’), then 35 options are left (from 1,1 to 6,5) and chance of a single six occurs in 5 of them, i.e. 6,1; 6,2; 6,3; 6,4; 6,5.

As for the likelihood that the sun will rise again, there is a way of estimating this, which was proposed by Pierre-Simon Laplace. What is known as Laplace’s Law gives us a rule-of-thumb way of calculating how likely it is that something that has happened before will happen again, whether it be the sun rising, your favourite team winning, or the bus arriving on time. Simply count the number of times it has happened in the past plus one (successes, S+1), and divide that by the number of opportunities there has been for it to happen plus two (trials, T+2). For a person emerging from a dark cave into the world for the first time, and watching the sun rise seven times, for example, the estimate that it will rise again is: (S+1)/(T+2) = (7+1)/(7+2) = 8/9 = 88.9%. Every time it rises again makes it even more likely that the pattern will be repeated, so that by the end of a year, the estimated probability goes up to (365+1)/(365+2) = 99.7%. And so on. The 1 and 2 in the Laplace equation, (S+1)/ (T+2), essentially represent the Bayesian ‘prior.’ The 1 and 2 can be replaced by any numbers in the same proportion, such as 5 and 10 or 10 and 20, depending on the weight we wish to assign to the prior probabilities (probabilities assigned before encountering new evidence).

Larger numbers (e.g. S+10, T+20) bias the estimate towards the assigned prior probability. So, (S+10)/ (T+20) after seven days updates to a probability of (7+10)/ (7+20) = 17/27 = 63.0%, compared to 88.9% for (S+1)/(T+2). Smaller numbers bias the estimate, therefore, towards the observed record. Another way of looking at this is that larger numbers indicate we are more confident in our baseline estimates and need more evidence to change our prior beliefs. Smaller numbers indicate that we are less sure about our beliefs and are more open to quickly updating our beliefs based on new evidence. In other words, learning takes place more quickly with smaller numbers in the Laplace equation.

Exercise

Question a. Write the Bayesian equation (using a, b and c) for deriving the posterior (updated) probability of a hypothesis being true after you encounter new evidence. Explain what a, b and c represent.

Question b. If P (H) is the probability that a hypothesis is true before the observation of new evidence (E), what is the updated (or posterior) probability of the hypothesis being true after the observation of the new evidence? Use the terms P (H), P(EIH), P(HIE), P(H’), P(EIH’) to construct the Bayesian equation using each of these terms. Note that P(EIH) is the probability of encountering the evidence given that the hypothesis is true. P(H’) is the probability that the hypothesis is not true. P(HIE) is the probability the hypothesis is true after encountering the evidence.

Question c. How do these terms relate to a, b and c in the Bayesian formula you have studied.

Question d. Is the probability that a hypothesis is true, given the evidence, P (HIE), equal to the probability of encountering the evidence, given that the hypothesis is true, P (EIH)? In other words, does P (HIE) = P (EIH)?

Question e. You are presented with two dice. One is fair, one is biased. The fair die (A) lands on all numbers (1 to 6) with equal probability. The biased die (B) lands on 6 with a 50% chance and each of the other numbers (1 to 5) with an equal 10% chance each.

Now, choose one die. You can’t tell by inspection whether it is the fair or the biased die. You now roll the die, and it lands on 6. What is the probability that the die you rolled is the biased die?

Question f. You are presented with two coins. One is fair, the other is weighted. The fair coin (Coin 1) lands on heads and tails with equal likelihood, the weighted coin (Coin 2) lands on tails with a 75% chance.

Now, choose one coin. You can’t tell by inspection whether it is the fair or the weighted coin. You select a coin and toss it and it lands on tails. What is the probability that you tossed Coin 2 (the weighted coin).

Puga, J., Krzywinski, N. and Altman, N. (2015). Points of Significance: Bayes’ Theorem. 12, 4, April, 277-278. https://www.nature.com/articles/nmeth.3335.pdf?origin=ppub

Hooper, M. (2013). Richard Price, Bayes’ Theorem and God. Significance, February, 36-39. https://www.york.ac.uk/depts/maths/histstat/price.pdf

Maths in a minute: The prosecutor’s fallacy. + plus magazine. https://plus.maths.org/content/maths-minute-prosecutor-s-fallacy

Lee, M. and King, B. (2017). Bayes’ Theorem: the maths tool we probably use every day. But what is it? The Conversation. April 23. https://theconversation.com/bayes-theorem-the-maths-tool-we-probably-use-every-day-but-what-is-it-76140

Ellerton, P. (2014). Why facts alone don’t change minds in our public debates. The Conversation. May 13. https://theconversation.com/why-facts-alone-dont-change-minds-in-our-big-public-debates-25094

Bayes Theorem. A Take Five Primer. An Iterative Quantification of Probability 2016). Corsair’s Publishing, March 24. http://comprehension360.corsairs.network/bayes-theorem-a-take-five-primer-fc7f7ade7abe

Bayes’ Theorem. Wikipedia. https://en.m.wikipedia.org/wiki/Bayes%27_theorem