# Bayes’ Theorem – in a nutshell.

How should we change our beliefs about the world when we encounter new data or information? This is one of the most important questions we can ask. A theorem bearing the name of Thomas Bayes, an eighteenth century clergyman, is central to the way we should answer this question.

The original presentation of the Reverend Thomas Bayes’ work, ‘An Essay toward Solving a Problem in the Doctrine of Chances’, was given in 1763, after Bayes’ death, to the Royal Society, by Bayes’ friend and confidant, Richard Price.

In explaining Bayes’ work, Price proposed as a thought experiment the example of a person who enters the world and sees the sun rise for the first time. As he has had no opportunity to observe this before (perhaps he has spent his life to that point entombed in a dark cave), he is not able to decide whether this is a typical or unusual occurrence. It might even be a unique event. Every day that he sees the same thing happen, however, the degree of confidence he assigns to this being a permanent aspect of nature increases. His estimate of the probability that the sun will rise again tomorrow as it did yesterday and the day before, and so on, gradually approaches, although never quite reaches, 100 per cent.

The Bayesian viewpoint is just like that, the idea that we learn about the world and everything in it through a process of gradually updating our beliefs, edging incrementally ever closer to the truth as we obtain more data, more information, more evidence.

As such, the perspective of Reverend Bayes on cause and effect is essentially different to that of philosopher David Hume, the logic of whose argument on this issue is contained in ‘An Enquiry Concerning Human Understanding’. According to Hume, we cannot justify our assumptions about the future based on past experience unless there is a law that the future will always resemble the past. No such law exists. Therefore, we have no fundamentally rational support for believing in causation. For Hume, therefore, predicting that the sun will rise again after seeing it rise a hundred times in a row is no more rational than predicting that it will not. Bayes instead sees reason as a practical matter, in which we can apply the laws of probability to the issue of cause and effect.

To Bayes, therefore, rationality is matter of probability, by which we update our predictions based on new evidence, thereby edging closer and closer to the truth. This is called Bayesian reasoning. According to this approach, probability can be seen as a bridge between ignorance and knowledge. The particularly wonderful thing about the world of Bayesian reasoning is that the mathematics of it are so simple. Bayes’ Theorem is in this way concerned with conditional probability. It tells us the probability, or updates the probability, that a theory or hypothesis is true given that some event has taken place, that some new evidence has been observed. The problem with intuition is that people are not naturally probability thinkers, but instead are cause-effect thinkers. We have to be trained to think in a Bayesian way about the world.

Essentially, Bayes’ Theorem is just an algebraic expression with three known variables and one unknown. It is true by construction. Yet this simple formula is the foundation stone of that bridge I referred to between ignorance and knowledge, and can lead to important predictive insights. As noted, it allows us to calculate the probability that a theory or hypothesis is true if some new evidence comes to light, based on the probability we attach to it being true before the new evidence is known, updated in light of new information.

There are three things a Bayesian needs to estimate.

- A Bayesian’s first task is to assign a starting point probability to a hypothesis being true, before some new evidence arises. This is known as the ‘prior’ probability. Let’s assign the letter ‘a’ to this.

- A Bayesian’s second task is to estimate the probability that the new evidence would have arisen if the hypothesis was true. Let’s assign the letter ‘b’ to this.
- A Bayesian’s third task is to estimate the probability that the new evidence would have arisen if the hypothesis was false. Let’s assign the letter ‘c’ to this.

Based on these three probability estimates, Bayes’ Theorem offers you a way to calculate the revised probability of the hypothesis being true given new evidence. The wonderful part about it is that the equation is true as a matter of logic. So the result it produces will be as accurate as the values inputted into the equation. The formula is also so straightforward it can be jotted on the back of your hand.

The formula for Bayes’ Theorem can be represented as:

Updated (posterior) probability given new evidence = **ab/ [ab+ c (1-a)]**

More traditionally, we express this equation as:

P (HIE) = P (H). P (EIH) / [P (H). P (EIH) + P (EIH’) . P (1 – P (H’)]

P ((HIE) is the probability that the hypothesis is true given the new evidence. This is the posterior (or updated) probability.

P (H) is the probability that the hypothesis is true before encountering the new evidence. This is the prior probability (**a**).

P (EIH) is the probability of encountering the evidence given that the hypothesis is true (**b**).

P (H’) is the probability that the hypothesis is not true **(1-a)**.

P (EIH’) is the probability of encountering the evidence given that the hypothesis is not true **(c)**.

Essentially, then, Bayesian updating is a straightforward solution to the problem of how to combine pre-existing (prior) beliefs with observed new evidence. The solution is essentially to combine the probabilities together. To do this properly, we use Bayes’ Theorem.

**The key contributions of Bayesian analysis to our understanding of the world are threefold.**

- Bayes’ Theorem makes clear the importance not just of new evidence but also the (prior) probability that the hypothesis was true before the new evidence was observed. This prior probability is generally given too little weight compared to the new evidence in common intuition about probability. Bayes’ Theorem makes it explicit and shows how much weight to give to it.
- Bayes’ Theorem allows us a way to calculate the updated probability based on the prior probability that the hypothesis is true and the probability of the evidence arising given that the hypothesis is true and also given that the hypothesis is false.
- Bayes’ Theorem shows that the probability that a hypothesis is true given the evidence is not equal to the probability of the evidence arising given that the hypothesis is true. Put another way, P (HIE) does not equal P (EIH).

Often the conclusions it generates are highly counter-intuitive, but that’s because the world is in many ways a counterintuitive place. Accepting that fact is the first step towards mastering life’s logical maze.

In summary, intuition lets us down because our in-built judgment of the weight we should attach to new evidence tends to be skewed, not least against pre-existing evidence. New evidence also tends to colour our perception of the pre-existing evidence. Moreover, we tend to see evidence that is consistent with something being true as evidence that it is actually true. Bayes’ Theorem is the map that helps guide us through this maze.

**Appendix**

Bayes’ Theorem consists of three variables.

a is the prior probability of the hypothesis being true (the probability we attach before new evidence arises). In traditional notation, this is represented as P (H).

b is the probability that the new evidence would arise if the hypothesis is true. In traditional notation, this is represented as P (EIH). We use the notation P (AIB) to represent the probability of A given B, i.e. the probability of A If B.

c is probability the new evidence would arise if the hypothesis is not true. In traditional notation, this is represented as P (EIH’). H’ is the notation for H not being true.

(1-a) is the prior probability that the hypothesis is not true. In traditional notation, this is represented as P (H’). It is derived from 1 – P (H), i.e. 1 minus the probability that the hypothesis is true.

Using this notation, the probability that a hypothesis is true given some new evidence (‘Posterior Probability’) = **ab/ [ab+ c (1-a)].**

Bayes’ Theorem can be derived from the equation P (HIE).P (E) = P (H).P (EIH), by dividing both sides by P (E). The intuition underlying this is that both sides of the equation are equal to the combined probability of the evidence relating to a hypothesis and the probability of the hypothesis being true, P (H and E). They are two ways of looking at the same thing.

In particular, P (HIE).P (E) is the probability of a hypothesis being true given the evidence times the probability of the evidence. This is logically equivalent to P (H). P (EIH), which is the probability of a hypothesis being true times the probability of the evidence given that the hypothesis is true.

So, P (HIE).P (E) = P (H). P (EIH)

Dividing the left and right sides of the equation by P (E),

*P (HIE) = P (H). P (EIH) / P (E) … Bayes’ Theorem*

P (E) = P (EIH). P (H) + P (EIH’). P(H’)

**P (HIE) = ****P (H).P (EIH) / [P (H). P (EIH) + P (EIH’). P(H’)] … Bayes’ Theorem**

This is equivalent to the formula:

*Posterior probability = ab / [ab + c (1-a)], where a = P (H); b = P (EIH); c = P (EIH’) *

In traditional notation, the ** Prosecutor’s Fallacy** is the fallacy of representing the probability of a hypothesis being true given the evidence, P (HIE), as being the same thing as P (EIH), the probability of the evidence arising given the hypothesis is true. In fact, P (HIE) = P (H). P (EIH) / P(E).

This can easily be shown with an example.

Is the probability that a selected card is the Ace of Spades (the hypothesis) given the evidence (it is a black card) equal to the probability it is a black card given that it is the Ace of Spades?

In this example, P (HIE) = 1/26 (probability the hypothesis is true given the evidence), since there is one Ace of Spades out of 26 black cards.

However, the probability of observing the evidence (it is a black card) given the hypothesis being true (it is the Ace of Spades) is P (EIH) = 1, since the probability it is a black card if it is the Ace of Spades is certain.

*So, P (HIE) = 1/26 is not equal to P (EIH) = 1.*

** **There follow some examples to illustrate that P (HIE). P (E) does indeed equal P (H). P (EIH).

Example 1: Take a deck of 52 cards, 26 red cards and 26 black cards, including one Ace of Spades. We are testing the hypothesis that a chosen card is the Ace of Spades. So, the hypothesis is that the selected card is the Ace of Spades. Now the probability a drawn card is the Ace of Spades (hypothesis is true) given that the card is black (the evidence) = 1/26 (there are 26 black cards, one of which is the Ace of Spades).

So P (HIE) = 1/26

The proportion of black cards in the deck = 1/2. So P (E) = 1/2

So, P (HIE). P (E) = 1/26 x ½ = 1/52.

Now P (EIH) is the probability that the card is black given that it is the Ace of Spades. This is certain, as the Ace of Spades is a black card.

So P (EIH) = 1.

P (H) is the probability the card is the Ace of Spades before we know what colour it is. There are 52 cards in the deck, so P (H) = 1/52.

So P (H). P (EIH) = 1/52 x 1 = 1/52

So P (HIE). P (E) = P (H). P (EIH) – they both equal 1/52 in this case.

Therefore, P (HIE) = P (H). P(EIH) / P (E) … Bayes’ Theorem

Example 2: There are in this example just four cards in our deck. These are the Ace of Spades, Ace of Clubs, Ace of Diamonds and Ace of Hearts. We are testing the hypothesis that the selected card is the Ace of Spades. Prior probability of Ace of Spades (AS) = ¼, as this is one of the four cards in our deck. What is the posterior probability it is Ace of Spades given the evidence that the card is black?

So, P (H) = ¼

P (EIH) = 1

P (E) = ½

P (HIE) = 1/2

So, P (HIE) = P (H). P (EIH)/ P (E) = ¼.1 / (1/2) = ½

Note that: P (HIE). P (E) = P (H). P (EIH)

P (HIE) = P (H). P (EIH) / P (E) … Bayes’ Theorem

Example 3: Two dice are thrown. The hypothesis is that two sixes will be thrown. The new evidence is that a six is thrown on the first one.

P (H) = x = 1/36

P (EIH) = y = 1 (for a double six, a six must be thrown on the first one).

P (E) = 1/6 (there is a 1 in 6 chance of throwing a six on the first die)

P (HIE) = posterior probability (PP) = P (EIH). P (H) / P (E) = 1. 1/36 / 1/6 = 1/6 (there is a 1 in 6 chance of a double six if the first die lands on a six).

Note: P (H). P (EIH) = P (E). P (HIE) = 1/36

Note also: P (E) = P (H). P (EIH) + P(H’). P(EIH’) = 1/36 . 1 + 35/36 . 5/35 = 1/36 + 5/36 = 1/6

Similarly, Posterior Probability = ab/[ab+c(1-a)] = 1/6

Note: c = P (EIH’) = 5/35 because if the dice do not land 6,6, so that the hypothesis is not true (H’), then 35 options are left (from 1,1 to 6,5) and chance of a single six occurs in 5 of them, i.e. 6,1; 6,2; 6,3; 6,4; 6,5.

As for the likelihood that the sun will rise again, there is a way of estimating this, which was proposed by Pierre-Simon Laplace. What is known as Laplace’s Law gives us a rule-of-thumb way of calculating how likely it is that something that has happened before will happen again, whether it be the sun rising, your favourite team winning, or the bus arriving on time. Simply count the number of times it has happened in the past plus one (successes, S+1), and divide that by the number of opportunities there has been for it to happen plus two (trials, T+2). For a person emerging from a dark cave into the world for the first time, and watching the sun rise seven times, for example, the estimate that it will rise again is: (S+1)/(T+2) = (7+1)/(7+2) = 8/9 = 88.9%. Every time it rises again makes it even more likely that the pattern will be repeated, so that by the end of a year, the estimated probability goes up to (365+1)/(365+2) = 99.7%. And so on. The 1 and 2 in the Laplace equation, (S+1)/ (T+2), essentially represent the Bayesian ‘prior.’ The 1 and 2 can be replaced by any numbers in the same proportion, such as 5 and 10 or 10 and 20, depending on the weight we wish to assign to the prior probabilities (probabilities assigned before encountering new evidence).

Larger numbers (e.g. S+10, T+20) bias the estimate towards the assigned prior probability. So, (S+10)/ (T+20) after seven days updates to a probability of (7+10)/ (7+20) = 17/27 = 63.0%, compared to 88.9% for (S+1)/(T+2). Smaller numbers bias the estimate, therefore, towards the observed record. Another way of looking at this is that larger numbers indicate we are more confident in our baseline estimates and need more evidence to change our prior beliefs. Smaller numbers indicate that we are less sure about our beliefs and are more open to quickly updating our beliefs based on new evidence. In other words, learning takes place more quickly with smaller numbers in the Laplace equation.

**Exercise**

**Question a.** Write the Bayesian equation (using a, b and c) for deriving the posterior (updated) probability of a hypothesis being true after you encounter new evidence. Explain what a, b and c represent.

**Question b.** If P (H) is the probability that a hypothesis is true before the observation of new evidence (E), what is the updated (or posterior) probability of the hypothesis being true after the observation of the new evidence? Use the terms P (H), P(EIH), P(HIE), P(H’), P(EIH’) to construct the Bayesian equation using each of these terms. Note that P(EIH) is the probability of encountering the evidence given that the hypothesis is true. P(H’) is the probability that the hypothesis is not true. P(HIE) is the probability the hypothesis is true after encountering the evidence.

**Question c.** How do these terms relate to a, b and c in the Bayesian formula you have studied.

**Question d.** Is the probability that a hypothesis is true, given the evidence, P (HIE), equal to the probability of encountering the evidence, given that the hypothesis is true, P (EIH)? In other words, does P (HIE) = P (EIH)?

** **

__Some Reading and Links__

Maths in a minute: The prosecutor’s fallacy. + plus magazine. https://plus.maths.org/content/maths-minute-prosecutor-s-fallacy