Two suspected witches of Salem are subjected to a test by the Witchfinder General.

To ascertain whether they have magical powers of telepathy (They haven’t, by the way) they will be separated and seated at a table in the blue room (Suspect 1) and the yellow room (Suspect 2). They will be unable to see each other or communicate in any way.

Before being separated they are allowed a few private moments together.

After being separated, they are given a deck of cards each and asked to extract one card from the deck.

They are allowed to look at their chosen card if they wish, but what they must actually do is to name the colour of the card that the other suspect has drawn.

It is a standard deck of cards, so there is a 1 in 2 chance the chosen card is black, and the same that it is red.

The game will be repeated ten times, to reduce the chance that they will survive by simple good fortune.

If in any round they both correctly identify the colour of the other person’s card, then they will both die.

If both suspects are wrong, or one is wrong, in every round, then both are free to go.

There are two questions:

1. What is the probability they will survive by chance?

2. Is there a co-operative strategy they could agree on before being separated to guarantee they both survive?

Think about it: In any round, what is the chance that each suspect will correctly name the colour of the other suspect’s card? A half? A quarter? What about over ten successive rounds?

To survive, they must avoid this over ten rounds. Is there a way they can take chance out of it, and make sure that at least one of them names the wrong colour for the other suspect’s card, for ten rounds in a row.

If so, that is the door to freedom. Remember that they can secretly hatch a joint strategy and they either both survive or both die, so they can trust each other to stick to the plan, if there is one.

Spoiler Alert (The Solution)

In the first round, the chance that the suspect in the blue room will correctly name the colour of the other suspect’s card is ½. Similarly for the suspect in the yellow room.

These are independent events, so the probability of being condemned after first hands are dealt (i.e. both name the colour of the other suspect’s card correctly) = ½ x ½ = ¼.

So probability of surviving first hand = ¾

Probability of surviving 10 hands = (3/4)^{10} = 0.0563, i.e. 5.63%

But there is a strategy to ensure survival, if they can agree on it before.

Can you work it out?

The solution is for player 1 to guesses the same colour as his own card, and player 2 to guess a different colour to his card. This way they will always survive.

Thus:

Red Red gives Red Black – they survive.

Black Black gives Black Red – they survive.

Black Red gives Black Black – they survive.

Red Black gives Red Red – they survive.

To better conceal the strategy, they could also decide to alternate roles.

This is the optimal outcome in a game where the two players are able to co-ordinate a strategy in advance, and where trust is guaranteed because they both stand to gain by sticking to the strategy.

There are other scenarios in which the superior strategy from the point of view of one or both players is to defect from the strategy they would adopt if they were free to strike an enforceable deal. One such scenario is known as the Prisoner’s Dilemma problem. In this problem, the optimal strategy for each player, when a deal cannot be enforced, is to choose a strategy worse than they could reach co-operatively. This is in turn an example of a Nash equilibrium in which one or both players stand to gain by switching strategy from the current one.

These cases will be examined another time when we look at Game Theory and the important role of the Nash Equilibrium in it.

Benford’s Law is one of those laws of statistics that defies common intuition. Essentially, it states that if we randomly select a number from a table of real-life data, the probability that the first digit will be one particular number is significantly different to it being a different number. For example, the probability that the first digit will be a ‘1’ is about 30 per cent, rather than the intuitive 11 per cent or so, which assumes that all digits from 1 to 9 are equally likely. In particular, Benford’s Law applies to the distribution of leading and trailing digits in naturally occurring phenomena, such as the population of different countries or the heights of mountains. For example, choose a paper with a lot of numbers, and now circle the numbers that occur naturally, such as stock prices. So lengths of rivers and lakes could be included, but not artificial numbers like telephone numbers. About 30 per cent of these numbers will start with a 1, and it doesn’t matter what units they are in. So the lengths of rivers could be denominated in kilometres, miles, feet, centimetres, without it making a difference to the distribution frequency of the digits. Empirical support for this distribution can be traced to the man after whom the Law is named, physicist Frank Benford, in a paper he published in 1938, called ‘The Law of Anomalous Numbers.’ In that paper he examined 20,229 sets of numbers, as diverse as baseball statistics, the areas of rivers, numbers in magazine articles and so forth, confirming the 30 per cent rule for number 1. For information, the chance of throwing up a ‘2’ as first digit is 17.6 per cent, and of a ‘9’ just 4.6 per cent.

This has clear implications for fraud detection. In particular, if declared returns or receipts deviate significantly from the Benford distribution, we have an automatic red flag which those tackling fraud are, or should be, aware of.

To explain the basis of Benford’s Law, take £1 as a base. Assume this now grows at 10% per day.

£1.10, £1.21, £1.33, £1.46, £1.61, £1.77, £1.94, £2.14, £2.35, £2.59, £2.85, £3.13, £3.45, £3.80, £4.18, £4.59, £5.05, £5.56, £6.11, £6.72, £7.40, £8.14, £8.95, £9.84, £10.83, £11.92, £13.11, £14.42, £15.86, £17.45, £19.19, £21.11, £23.22, £25.50, £28.10, £30.91, £34.00, £37.40, £41.14, £45.26, £49.79, £54.74, £60.24, £72.89, £80.18, £88.20, £97.02 …

So we see that the leading digits stay a long time in the teens, less in the 20s, and so on through the 90s, and this pattern continues through three digits and so forth. Benford noticed that the probability that a number starts with n = log (n+1) – log (n), so that:

NB log_{10} 1 = 0; log_{10} 2 = 0.301; log_{10} 3 = 0.4771 … log_{10} 10 = 1.

Leading digit Probability

• 1 30.1%

• 2 17.6%

• 3 12.5%

• 4 9.7%

• 5 7.9%

• 6 6.7%

• 7 5.8%

• 8 5.1%

• 9 4.6%

Links:

One of the classic problems of Mathemagistics, or Mathematical Magic, is the Bus Problem. It goes like this:

**Question: **

Every day, Fred gets the solitary 8 am bus to work. There is no other bus that will get him to his destination.

10 per cent of the time the bus is early and leaves before he arrives at 8 am.

10 per cent of the time the bus is late and leaves after 8.10 am.

The rest of the time the bus departs between 8 am and 8.10 am.

One morning Fred arrives at the bus stop at 8 am, sees no bus, and waits for 10 minutes without the bus arriving.

Now, what is the probability that Fred’s bus will still arrive?

**Think about it: **

Fred’s bus could yet arrive or he might have missed it. So there are two possibilities. So is it correct to assume that in the absence of further evidence the chance of each must be equal, so the probability at 8.10am that his bus will still arrive is 50 per cent?

But if that is the answer at 8.10am, was it also the correct answer at 8 am?

Or was 50 per cent the correct answer at 8am but not at 8.10am?

Or is it the wrong answer at both times, but was correct at 8.05am?

The solution is posted below.

**Spoiler Alert (Solution):**

**Solution**

When Fred arrives at 8am, there is a 10 per cent chance that his bus will have already left. After Fred has waited for 10 minutes, he can eliminate the 80 per cent chance of the bus arriving in the period between 8 am and 8.10 am. So only two possibilities remain.

Either the bus has arrived ahead of schedule or it will arrive more than ten minutes late.

Both outcomes are unusual, but since the two outcomes are mutually exclusive and equally likely (10 per cent chance of each), and there are no other possibilities, we should update the probability that the bus will still arrive from 10 per cent (the likelihood, or prior probability, when Fred woke up) to 50 per cent, as there is (once the 80 per cent probability is eliminated) an equal probability (out of the remaining 20%) that the bus will still turn up and that he has missed it. So there is a 1 in 2 chance that he will still catch his bus if he has the patience to wait further, and a 1 in 2 chance that he will wait in vain. The follow-up question is how long he should wait. That’s for another day.

How much should we bet when we believe the odds are in our favour. The answer to this question was first formalised in 1956, by daredevil pilot, recreational gunslinger and physicist John L. Kelly, Jr. at Bell Labs. The so-called Kelly Criterion is a formula employed to determine the optimal size of a series of bets when we have the advantage, in other words when the odds favour us. It takes account of the size of our edge over the market as well as the adverse impact of volatility. In other words, even when we have the edge, we can still go bankrupt along the way if we stake too much on any individual wager or series of wagers.

Essentially, the Kelly strategy is to wager a proportion of our capital which is equivalent to our advantage at the available odds. So if we are being offered even money, and we back heads, and we are certain that the coin will come down heads, we have a 100% advantage. So the recommended wager is the total of our capital. If there is a 60% chance of heads, and a 40% chance of tails, our advantage is now 20%, and we are advised to stake accordingly. This is a simplified representation of the literature on Kelly, Half-Kelly, and other derivatives of same, but the bottom line is clear. It is just as important to know how much to stake as it is to gauge when we have the advantage. But it’s not easy unless we can accurately identify that advantage.

Put more technically, the Kelly criterion is the fraction of capital to wager to maximise compounded growth of capital. The problem it seeks to address is that even when there is an edge, beyond some threshold larger bets will result in lower compounded return because of the adverse impact of volatility. The Kelly criterion defines the threshold, and indicates the fraction that should be wagered to maximise compounded return over the long run (F), which is given by:

F = Pw – (Pl/W)

where

F = Kelly criterion fraction of capital to bet

W = Pounds won per pound wagered (i.e. win size divided by lose size)

Pw = Probability of winning

Pl = Probability of losing

When win size and loss size are equal, W = 1, and the formula reduces to:

F = Pw – Pl

For example, if a trader loses £1,000 on losing trades and gains £1,000 on winning trades, and 60 per cent of all trades are winning trades, the Kelly criterion indicates an optimal trade size equal to 20 per cent (0.60-0.40 = 0.20). As another example, if a trader wins £2,000 on winning trades and loses £1,000 on losing trades, and the probability of winning and losing are both equal to 50 per cent, the Kelly criterion indicates an optimal trade size equal to 25 per cent of capital: 0.50- (0.50/2) = 0.25.

Proportional over-betting is more harmful than under-betting. For example, betting half the Kelly criterion will reduce compounded return by 25 per cent, while betting double the Kelly criterion will eliminate 100 per cent of the gain. Betting more than double the Kelly criterion will result in an expected negative compounded return, regardless of the edge on any individual bet. The Kelly criterion implicitly assumes that there is no minimum bet size. This assumption prevents the possibility of total loss. If there is a minimum trade size, as is the case in most practical investment and trading situations, then ruin is possible if the amount falls below the minimum possible bet size.

So should we bet the full amount recommended by the Kelly Criterion? Not so according to sports betting legend, Bill Benter. Betting the full amount recommended by the Kelly formula, he says, is unwise for a number of reasons. Notably, he warns that accurate estimation of the advantage of the bets is critical; if we overestimate the advantage by more than a factor of two, Kelly betting will cause a negative rate of capital growth, and he says this is easily done. So, as he puts it “… full Kelly betting is a rough ride.” According to Benter, and I for one will defer to his advice in these matters, a fractional Kelly betting strategy is advisable, that is, a strategy wherein one bets some fraction of the recommended Kelly bet (e.g. one half or one third). Ironically, John Kelly himself died in 1965, never having used his own criterion to make money.

So that’s the Kelly criterion. In a nutshell, the advice is only to bet when you believe you have the edge, and to do so using a stake size related to the size of the edge. Mathematically, it means betting a fraction of your capital equal to the size of your advantage. So, if you have a 20% edge at the odds, bet 20% of your capital. In the real world, however, we need to allow for errors that can creep in, like uncertainty as to the true edge, if any, that we have at the odds. So, unless we’re happy to risk a very bumpy ride, and we have total confidence in our judgment, a preferred strategy will to be stake a defined fraction of that amount, known as a fractional Kelly strategy. Purists will hate us for it, but it’s not their capital at risk. So if we are going to bet, the advice is to use Kelly, but with due caution, not least in the assessment of our advantage. And when the fun of betting stops, the best advice of all may of course be to just stop. Good luck!

William of Occam (also spelled William of Ockham) was a 14^{th} century English philosopher. At the heart of Occam’s philosophy is the principle of simplicity, and Occam’s Razor has come to embody the method of eliminating unnecessary hypotheses. Essentially, Occam’s Razor holds that the theory which explains all (or the most) while assuming the least is the most likely to be correct. This is the principle of parsimony – explain more, assume less. Put more elegantly, it is the principle of ‘pluritas non est ponenda sine necessitate’ (plurality must never be posited beyond necessity).

Empirical support for the Razor can be drawn from the principle of ‘overfitting.’ In statistics, ‘overfitting’ occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. Critically, a model that has been overfit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data. For example, a complex polynomial function might after the fact be used to pass through each data point, including those generated by noise, but a linear function might be a better fit to the signal in the data. By this we mean that the linear function would predict new and unseen data points better than the polynomial function, although the polynomial which has been devised to capture signal and noise would describe/fit the existing data better.

Turning now to ‘ad hoc’ hypotheses and the Razor. In science and philosophy, an ‘ad hoc hypothesis’ is a hypothesis added to a theory in order to save it from being falsified. Ad hoc hypothesising is compensating for anomalies not anticipated by the theory in its unmodified form. For example, you say that there is a leprechaun in your garden shed. A visitor to the shed sees no leprechaun. This is because he is invisible, you say. He spreads flour on the ground to see the footprints. He floats, you declare. He wants you to ask him to speak. He has no voice, you say. More generally, for each accepted explanation of a phenomenon, there is generally an infinite number of possible, more complex alternatives. Each true explanation may therefore have had many alternatives that were simpler and false, but also approaching an infinite number of alternatives that are more complex and false.

This leads us the idea of what I term ‘Occam’s Leprechaun.’ Any new and more complex theory can always be possibly true. For example, if an individual claims that leprechauns were responsible for breaking a vase that he is suspected of breaking, the simpler explanation is that he is not telling the truth, but ongoing ad hoc explanations (e.g. “That’s not me on the CCTV, it’s a leprechaun disguised as me) prevent outright falsification. An endless supply of elaborate competing explanations, called ‘saving hypotheses’, prevent ultimate falsification of the leprechaun hypothesis, but appeal to Occam’s Razor helps steer us toward the probable truth. Another way of looking at this is that simpler theories are more easily falsifiable, and hence possess more empirical content.

All assumptions introduce possibilities for error; if an assumption does not improve the accuracy of a theory, its only effect is to increase the probability that the overall theory is wrong.

It can also be looked at this way. The prior probability that a theory based on n+1 assumptions is true must be less than a theory based on n assumptions, unless the additional assumption is a consequence of the previous assumptions. For example, the prior probability that Jack is a train driver must be less than the prior probability that Jack is a train driver AND that he owns a Mini Cooper, unless all train drivers own Mini Coopers, in which case the prior probabilities are identical.

Again, the prior probability that Jack is a train driver and a Mini Cooper owner and a ballet dancer is less than the prior probability that he is just the first two, unless all train drivers are not only Mini Cooper owners but also ballet dancers. In the latter case, the prior probabilities of the n and n+1 assumptions are the same.

From Bayes’ Theorem, we know that reducing the prior probability will reduce the posterior probability, i.e. the probability that a proposition is true after new evidence arises.

Science prefers the simplest explanation that is consistent with the data available at a given time, but even so the simplest explanation may be ruled out as new data become available. This does not invalidate the Razor, which does not state that simpler theories are necessarily more true than more complex theories, but that when more than one theory explains the same data, the simpler should be accorded more probabilistic weight.

The theory which explains all (or the most) and assumes the least is most likely. So Occam’s Razor advises us to keep explanations simple. But it is also consistent with multiplying entities necessary to explain a phenomenon. A simpler explanation which fails to explain as much as another more complex explanation is not necessarily the better one. So if leprechauns don’t explain anything they cannot be used as proxies for something else which can explain something. This is the classic riposte to the materialist who holds that there is nothing beyond what we observe in the natural or material world. If a non-materialist explanation better explains the origin of the universe, for example, that explanation may be true and consistent with Occam’s Razor. I explore this issue separately in my blog – ‘Why is there Something Rather than Nothing? A Solution’.

More generally, we can now unify Epicurus and Occam. From Epicurus’ Principle we need to keep open all hypotheses consistent with the known evidence which are true with a probability of more than zero. From Occam’s Razor we prefer from among all hypotheses that are consistent with the known evidence, the simplest. In terms of a prior distribution over hypotheses, this is the same as giving simpler hypotheses higher a priori probability, and more complex ones lower probability.

From here we can move to the wider problem of induction about the unknown by extrapolating a pattern from the known. Specifically, the problem of induction is how we can justify inductive inference. According to Hume’s ‘Enquiry Concerning Human Understanding’ (1748), if we justify induction on the basis that it has worked in the past, then we have to use induction to justify why it will continue to work in the future. This is circular reasoning. This is faulty theory. “Induction is just a mental habit, and necessity is something in the mind and not in the events.” Yet in practice we cannot help but rely on induction. We are working from the idea that it works in practice if not in theory – so far. Induction is thus related to an assumption about the uniformity of nature. Of course, induction can be turned into deduction by adding principles about the world (such as ‘the future resembles the past’, or ‘space-time is homogeneous.’) We can also assign to inductive generalisations probabilities that increase as the generalisations are supported by more and more independent events. This is the Bayesian approach, and it is a response to the perspective pioneered by Karl Popper. From the Popperian perspective, a single observational event may prove hypotheses wrong, but no finite sequence of events can verify them correct. Induction is from this perspective theoretically unjustifiable and becomes in practice the choice of the simplest generalisation that resists falsification. The simpler a hypothesis, the easier it is to be falsified. Induction and falsifiability are in practice, from this viewpoint, is as good as it gets in science. Take an inductive inference problem where there is some observed data and a set of hypotheses, one of which may be the true hypothesis generating the data. The task then is to decide which hypothesis, or hypotheses, are the most likely to be responsible for the observations.

A better way of looking at this seems to be to abandon certainties and think probabilistically. Entropy is the tendency of isolated systems to move toward disorder and a quantification of that disorder, e.g. assembling a deck of cards in a defined order requires introducing some energy to the system. If you drop the deck, they become disorganised and won’t re-organise themselves automatically. This is the tendency in all systems to disorder. This is the Second Law of Thermodynamics, which implies that time is asymmetrical with respect to the amount of order: as the system, advances through time, it will statistically become more disordered. By ‘Order’ and ‘Disorder’ we mean how compressed the information is that is describing the system. So if all your papers are in one neat pile, then the description is “All paper in one neat pile.” If you drop them, the description becomes ‘One paper to the right, another to the left, one above, one below, etc. etc.” The longer the description, the higher the entropy. According to Occam’s Razor, we want a theory with low entropy, i.e. low disorder, high simplicity. The lower the entropy, the more likely it is that the theory is the true explanation of the data, and hence that theory should be assigned a higher probability.

More generally, whatever theory we develop, say to explain the origin of the universe, or consciousness, or non-material morality, must itself be based on some theory, which is based on some other theory, and so on. At some point we need to rely on some statement which is true but not provable, and so we think may be false, although it is actually true. We can never solve the ultimate problem of induction, but Occam’s Razor combined with Epicurus, Bayes and Popper is as good as it gets if we accept that. So Epicurus, Occam, Bayes and Popper help us pose the right questions, and help us to establish a good framework for thinking about the answers.

At least that applies to the realm of established scientific enquiry and the pursuit of scientific truth. How far it can properly be extended beyond that is a subject of intense and continuing debate.

Further Reading

Bayes’ Theorem: The Most Powerful Equation in the World. https://leightonvw.com/2017/03/12/bayes-theorem-the-most-powerful-equation-in-the-world/

Why is there Something Rather than Nothing https://wordpress.com/post/leightonvw.com/639

A patient goes to see the doctor. The doctor performs a test on all his patients, for a flu bug, estimating that only 1 per cent of the people who visit his surgery have the flu bug. The test he gives them, however, is 99 percent reliable – that is, 99 percent of people who are sick test positive and 99 percent of the healthy people test negative. Now the question is: if the patient tests positive, what chances should the doctor give to the patient having the flu bug?

The intuitive answer is 99 percent.

But is that right?

The information we are given is ‘the probability of testing positive given that you are sick’. What we want to know, however, is ‘the probability of being sick given that you tested positive.’ Common intuition conflates these two probabilities, but they are in fact very different. In fact, if the test is 95% reliable, this means that 95% of sick people test positive. But this is NOT the same thing as saying that 95% of people who test positive are sick. This is known as the ‘Inverse Fallacy’ or ‘Prosecutor’s Fallacy’. It is the fallacy, to which jurors are very susceptible, of believing that the probability of a defendant being guilty of a crime given the observation of some piece of evidence is the same as the probability of observing that piece of evidence if the defendant was guilty. They are in fact very different things, and the two probabilities can diverge markedly, markedly enough in fact to send many people to the place of execution or to a life without possibility of parole.

So what is the probability of being sick if you test positive, given that the test is 99% reliable (i.e. 99% of people who are sick test positive and 99% of people who are not sick test negative)?

To answer this we can use Bayes’ Theorem.

The (posterior) probability that a hypothesis is true after obtaining new evidence, according to the x,y,z formula of Bayes’ Theorem, is equal to:

xy/[xy+z(1-x)]

x is the prior probability, i.e. the probability that a hypothesis is true before you see the new evidence.

y is the probability you would see the new evidence if the hypothesis is true.

z is the probability you would see the new evidence if the hypothesis is false.

In the case of the flu test, the hypothesis is that the patient is sick.

Before the new evidence (the test), this chance is estimated at 1 in 100 (0.01)

So x = 0.01

The probability we would see the new evidence (the positive result on the test) if the hypothesis is true (the patient is sick) is 99%, since the test if 99% reliable.

So y =0.99

The probability we would see the new evidence (the positive result on the test) if the hypothesis is false (the patient is not sick) is just 1% (because the test is 99% reliable, and will only give a false positive 1 time in 100).

So z = 0.01

Substituting into Bayes’ equation gives:

0.01x 0.99 / [0.01 x 0.99 + 0.01 (1 – 0.01)] = 0.01×0.99 / [0.01×0.99 + 0.01×0.99] = 1/2

So there is actually a 50% chance that the test, which is 99% reliable and has tested positive, has misdiagnosed you and you are actually flu-free.

Basically, it is a competition between how rare the disease is and how rarely the test is wrong. In this case, there is a 1 in 100 chance that you have the flu before undertaking the test, and the test is wrong 1 time in 100. These two probabilities are equal, so the chance that you actually have the flu when testing positive is 1 in 2.

But what if the patient is showing symptoms of the disease before being tested?

In this case, the prior probability should be updated to something higher than the prevalence rate of the disease in the entire tested population, and the chance you are actually sick when you test positive rises accordingly. To the extent that a doctor only tests for something that there is corroborating support for, the likelihood that the test result is correct grows. For this reason, any positive test result should be taken very seriously, statistics aside.

More generally, to differentiate truth from scare we really do need to understand and employ Bayes’ Theorem. Whether at the doctor’s surgery or in the jury room, understanding it really could save a life.

The majestic tragedy, Othello, was written by William Shakespeare in about 1603. The play revolves around four central characters: Othello, a Moor who is a general in the Venetian army; his beloved wife, Desdemona; his loyal lieutenant, Cassio; and his trusted ensign, Iago.

A key element of the play is Iago’s plot to convince Othello that Desdemona is conducting an affair with Cassio, by planting a treasured keepsake Othello gave to Desdemona, in Cassio’s lodgings, for Othello ‘accidentally’ to come upon.

We playgoers know she is not cheating on him, as does Iago, but Othello, while reluctant to believe it of Desdemona, is also very reluctant to believe that Iago could be making it up.

If Othello refuses to contemplate any possibility of betrayal, then we would have a play in which no amount of evidence, however overwhelming, including finding them together, could ever change his mind. We would have a farce or a comedy instead of a tragedy.

A shrewder Othello would concede that there is at least a possibility that Desdemona is betraying him, however small that chance might be. This means that there does exist some level of evidence, however great it would need to be, that would leave him no alternative. If his prior trust in Desdemona is almost, but not absolutely total, then this would permit of some level of evidence, logically incompatible with her innocence, changing his mind. This might be called ‘Smoking Gun’ evidence.

On the other hand, Othello might adopt a more balanced position, trying to assess the likelihood objectively and without emotion. But how? Should he try and find out the proportion of female Venetians who conduct extra-marital affairs? This would give him the probability for a randomly selected Venetian woman but no more than that. Hardly a convincing approach when surely Desdemona is not just an average Venetian woman. So should he limit the reference class to women who are similar to* *Desdemona? But what does that mean?

And this is where it is easy for Othello to come unstuck. Because it is so difficult to choose a prior probability (as Bayesians would term it), the temptation is to assume that since it might or might not be true, the likelihood is 50-50. This is known as the ‘Prior Indifference Fallacy’. Once Othello falls victim to this common fallacy, any evidence against Desdemona now becomes devastating. It is the same problem as that facing the defendant in the dock.

Extreme, though not blind, trust is one way to avoid this mistake. But an alternative would be to find evidence that is logically incompatible with Desdemona’s guilt, in effect the opposite of the ‘Smoking Gun.’ The ‘Perfect Alibi’ would fit the bill.

Perhaps Othello would love to find evidence that is logically incompatible with Desdemona conducting an affair with Cassio, but holds her guilty unless he can find it. He needs evidence that admits no True Positives.

Lacking extreme trust and a Perfect Alibi, what else could have saved Desdemona?

To find the answer, we shall turn as usual to Bayes and Bayes’ Theorem. Bayes’ Theorem, otherwise known as the most important equation in the world, solves these sorts of problems very adeptly every time, using the wonderfully simple x,y,z formula.

The (posterior) probability that a hypothesis is true after obtaining new evidence, according to the x,y,z formula of Bayes’ Theorem, is equal to:

xy/[xy=z(1-x)]

x is the prior probability, i.e. the probability that a hypothesis is true before you see the new evidence.

y is the probability you would see the new evidence if the hypothesis is true.

z is the probability you would see the new evidence if the hypothesis is false.

In the case of the Desdemona problem, the hypothesis is that Desdemona is guilty of betraying Othello with Cassio.

Before the new evidence (the finding of the keepsake), let’s say that Othello assigns a chance of 4% to Desdemona being unfaithful.

So x = 0.04

The probability we would see the new evidence (the keepsake in Cassio’s lodgings) if the hypothesis is true (Desdemona and Cassio are conducting an affair) is, say, 50%. There’s quite a good chance she would secretly hand Cassio the keepsake as proof of her love for him and not of Othello.

So y = 0.5

The probability we would see the new evidence (the keepsake in Cassio’s lodgings) if the hypothesis is false is, say, just 5%. Why would it be there if Desdemona had not been to his lodgings secretly, and why would she take the keepsake along in any case.

So z = 0.05

Substituting into Bayes’ equation gives:

0.04 x 0.5 / [0.04 x 0.5 + 0.05 (1 – 0.04)] = 0.294.

So, using Bayes’ Rule, and these estimates, the chance that Desdemona is guilty of betraying Othello is 29.4%, worrying high for the tempestuous Moor but perhaps low enough to prevent tragedy. The power of Bayes here lies in demonstrating to Othello that the finding of the keepsake in the living quarters of Cassio might only have a 1 in 20 chance of being consistent with Desdemona’s innocence, but in the bigger picture, there is a less than a 3 in 10 chance that she actually is culpable.

If this is what Othello concludes, the task of the evil Iago is to lower z in the eyes of Othello by arguing that the true chance of the keepsake ending up with Cassio without a nefarious reason is so astoundingly unlikely as to merit an innocent explanation that 1 in 100 is nearer the mark than 1 in 20. In other words, to convince Othello to lower his estimate of z from 0.05 to 0.01.

The new Bayesian probability of Desdemona’s guilt now becomes:

xy/[xy=z(1-x)]

x = 0.04 (the prior probability of Desdemona’s guilt, as before)

y = 0.5 (as before)

z = 0.01 (down from 0.05)

Substituting into Bayes’ equation gives:

0.04 x 0.5 / [0.04 x 0.5 + 0.01 (1 – 0.04)] = 0.676.

So, if Othello can be convinced that 5% is too high a probability that there is an innocent explanation for the appearance of the Cassio – let’s say he’s persuaded by Iago that the true probability is 1% – then Desdemona’s fate, as that of many a defendant whom a juror thinks has more than a 2 in 3 chance of being guilty, is all but sealed. Her best hope now is to try and convince Othello that the chance of the keepsake being found in Cassio’s place if she were guilty is much lower than 0.5. For example, she could try a common sense argument that there is no way that she would take the keepsake if she were actually having an affair with Cassio, nor be so careless as to leave it behind. In other words, she could argue that the presence of the keepsake where it was found actually provides testimony to her innocence. In Bayesian terms, she should try to reduce Othello’s estimate of y. What level of y would have prevented tragedy? That is another question.

William Shakespeare wrote Othello about a hundred years before the Reverend Thomas Bayes was born. That is true. But to my mind the Bard was always, in every inch of his being, a true Bayesian. Othello was not, and therein lies the tragedy.