Further and deeper exploration of paradoxes and challenges of intuition and logic can be found in my recently published book, Probability, Choice and Reason.

William of Occam (also spelled William of Ockham) was a 14th century English philosopher. At the heart of Occam’s philosophy is the principle of simplicity, and Occam’s Razor has come to embody the method of eliminating unnecessary hypotheses. Essentially, Occam’s Razor holds that the theory which explains all (or the most) while assuming the least is the most likely to be correct. This is the principle of parsimony – explain more, assume less. Put more elegantly, it is the principle of ‘pluritas non est ponenda sine necessitate’ (plurality must never be posited beyond necessity).

Yet empirical support for the Razor can be drawn from the principle of ‘overfitting.’ In statistics, ‘overfitting’ occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. Critically, a model that has been overfit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data.

We can also look at it through the lens of what is known as Solomonoff Induction. Whether a detective trying to solve a crime, a physicist trying to discover a new universal law, or an entrepreneur seeking to interpret some latest sales figures, all are involved in collecting information and trying to infer the underlying causes. The problem of induction is this: We have a set of observations (or data), and want to find the underlying causes of those observations, i.e. to find hypotheses that explain our data. We’d like to know which hypothesis is correct, so we can use that knowledge to predict future events. In doing so, we need to create a set of defined steps to arrive at the truth, a so-called algorithm for truth.

In particular, if all of the hypotheses are possible but some are more likely than others, how do you weight the various hypotheses? This is where Occam’s Razor comes in.

Consider, for example, the two 32 character sequences:

abababababababababababababababab

4c1j5b2p0cv4w1x8rx2y39umgw5q85s7

The first can be written “ab 16 times”. The second probably cannot be simplified further.

Now consider the following problem. A computer program outputs the following sequence of numbers: 1, 3, 5, 7. What rule do you think gave rise to the number sequence 1,3,5,7? If we know this, it will help us to predict what the next number in the sequence is likely to be, if there is one. Two hypotheses spring instantly to mind. It could be: 2n-1, where n is the step in the sequence. So the third step, for example, gives 2×3-1 = 5. If this is the correct rule generating the observations, the next step in the sequence will be 9 (5×2-1).

But it’s possible that the rule generating the number sequence is: 2n-1 + (n-1)(n-2)(n-3)(n-4). So the third step, for example, gives 2×3-1 + (3-1)(3-2)(3-3)(3-4) = 7. In this case, however, the next step in the sequence will be 33.

But doesn’t the first hypothesis seem more likely? Occam’s Razor is the principle behind this intuition. “Among all hypotheses consistent with the observations, the simplest is the most likely.”

More generally, say we have two different hypotheses about the rule generating the data. How do we decide which is more likely to be true? To start, is there a language in which we can express all problems, all data, all hypotheses? Let’s look at binary data. This is the name for representing information using only the characters ‘0’ and ‘1’. In a sense, binary is the simplest possible alphabet. With these two characters we can encode information. Each 0 or 1 in a binary sequence (e. g. 01001011) can be considered the answer to a yes-or-no question. And in principle, all information can be represented in binary sequences. Indeed, being able to do everything in the language of binary sequences simplifies things greatly, and gives us great power. We can treat everything contained in the data in the same way.

Now that we have a simple way to deal with all types of data, we need to look at the hypotheses, in particular how to assign prior probabilities to the hypotheses. When we encounter new data, we can then use Bayes’ Theorem to update these probabilities.

To be complete, to guarantee we find the real explanation for our data, we have to consider all possible hypotheses. But how could we ever find all possible explanations for our data?

By using the language of binary, we can do so.

Here we look to the concept of Solomonoff induction, in which the assumption we make about our data is that it was generated by some algorithm, i.e. the hypothesis that explains the data is an algorithm. Now we can find all the hypotheses that would predict the data we have observed. Given our data, we find potential hypotheses to explain it by running every hypothesis, one at a time. If the output matches our data, we keep it. Otherwise, we discard it. We now have a methodology, at least in theory, to examine the whole list of hypotheses that might be the true cause behind our observations.

The first thing is to imagine that for each bit of the hypothesis, we toss a coin. Heads will be 0, and tails will be 1. Take as an example, 01001101, so the coin landed heads, tails, heads, tails and so on. Because each toss of the coin has a 50% probability, each bit contributes ½ to the final probability. Therefore, an algorithm that is one bit longer is half as likely to be the true algorithm. This intuitively fits with Occam’s Razor: a hypothesis that is 8 bits long is much more likely than a hypothesis that is 34 bits long. Why bother with extra bits? We’d need evidence to show that they were necessary. So why not take the shortest hypothesis and call that the truth? Because all of the hypotheses predict the data we have so far, and in the future we might get data to rule out the shortest one. The more data we get, the easier it is likely to become to pare down the number of competing hypotheses which fit the data.

Turning now to ‘ad hoc’ hypotheses and the Razor. In science and philosophy, an ‘ad hoc hypothesis’ is a hypothesis added to a theory in order to save it from being falsified. Ad hoc hypothesising is compensating for anomalies not anticipated by the theory in its unmodified form. For example, you say that there is a leprechaun in your garden shed. A visitor to the shed sees no leprechaun. This is because he is invisible, you say. He spreads flour on the ground to see the footprints. He floats, you declare. He wants you to ask him to speak. He has no voice, you say. More generally, for each accepted explanation of a phenomenon, there is generally an infinite number of possible, more complex alternatives. Each true explanation may therefore have had many alternatives that were simpler and false, but also approaching an infinite number of alternatives that are more complex and false.

This leads us the idea of what I term ‘Occam’s Leprechaun.’ Any new and more complex theory can always be possibly true. For example, if an individual claims that leprechauns were responsible for breaking a vase that he is suspected of breaking, the simpler explanation is that he is not telling the truth, but ongoing ad hoc explanations (e.g. “That’s not me on the CCTV, it’s a leprechaun disguised as me) prevent outright falsification. An endless supply of elaborate competing explanations, called ‘saving hypotheses’, prevent ultimate falsification of the leprechaun hypothesis, but appeal to Occam’s Razor helps steer us towards the probable truth. Another way of looking at this is that simpler theories are more easily falsifiable, and hence possess more empirical content.

All assumptions introduce possibilities for error; if an assumption does not improve the accuracy of a theory, its only effect is to increase the probability that the overall theory is wrong.

It can also be looked at this way. The prior probability that a theory based on n+1 assumptions is true must be less than a theory based on n assumptions, unless the additional assumption is a consequence of the previous assumptions. For example, the prior probability that Jack is a train driver must be less than the prior probability that Jack is a train driver AND that he owns a Mini Cooper, unless all train drivers own Mini Coopers, in which case the prior probabilities are identical.

Again, the prior probability that Jack is a train driver and a Mini Cooper owner and a ballet dancer is less than the prior probability that he is just the first two, unless all train drivers are not only Mini Cooper owners but also ballet dancers. In the latter case, the prior probabilities of the n and n+1 assumptions are the same.

From Bayes’ Theorem, we know that reducing the prior probability will reduce the posterior probability, i.e. the probability that a proposition is true after new evidence arises.

Science prefers the simplest explanation that is consistent with the data available at a given time, but even so the simplest explanation may be ruled out as new data become available. This does not invalidate the Razor, which does not state that simpler theories are necessarily more true than more complex theories, but that when more than one theory explains the same data, the simpler should be accorded more probabilistic weight. The theory which explains all (or the most) and assumes the least is most likely. So Occam’s Razor advises us to keep explanations simple. But it is also consistent with multiplying entities necessary to explain a phenomenon. A simpler explanation which fails to explain as much as another more complex explanation is not necessarily the better one. So if leprechauns don’t explain anything they cannot be used as proxies for something else which can explain something.

More generally, we can now unify Epicurus and Occam. From Epicurus’ Principle we need to keep open all hypotheses consistent with the known evidence which are true with a probability of more than zero. From Occam’s Razor we prefer from among all hypotheses that are consistent with the known evidence, the simplest. In terms of a prior distribution over hypotheses, this is the same as giving simpler hypotheses higher a priori probability, and more complex ones lower probability.

From here we can move to the wider problem of induction about the unknown by extrapolating a pattern from the known. Specifically, the problem of induction is how we can justify inductive inference. According to Hume’s ‘Enquiry Concerning Human Understanding’ (1748), if we justify induction on the basis that it has worked in the past, then we have to use induction to justify why it will continue to work in the future. This is circular reasoning. This is faulty theory. “Induction is just a mental habit, and necessity is something in the mind and not in the events.” Yet in practice we cannot help but rely on induction. We are working from the idea that it works in practice if not in theory – so far. Induction is thus related to an assumption about the uniformity of nature. Of course, induction can be turned into deduction by adding principles about the world (such as ‘the future resembles the past’, or ‘space-time is homogeneous.’) We can also assign to inductive generalisations probabilities that increase as the generalisations are supported by more and more independent events. This is the Bayesian approach, and it is a response to the perspective pioneered by Karl Popper. From the Popperian perspective, a single observational event may prove hypotheses wrong, but no finite sequence of events can verify them correct. Induction is from this perspective theoretically unjustifiable and becomes in practice the choice of the simplest generalisation that resists falsification. The simpler a hypothesis, the easier it is to be falsified. Induction and falsifiability are in practice, from this viewpoint, is as good as it gets in science. Take an inductive inference problem where there is some observed data and a set of hypotheses, one of which may be the true hypothesis generating the data. The task then is to decide which hypothesis, or hypotheses, are the most likely to be responsible for the observations.

A better way of looking at this seems to be to abandon certainties and think probabilistically. Entropy is the tendency of isolated systems to move toward disorder and a quantification of that disorder, e.g. assembling a deck of cards in a defined order requires introducing some energy to the system. If you drop the deck, they become disorganised and won’t re-organise themselves automatically. This is the tendency in all systems to disorder. This is the Second Law of Thermodynamics, which implies that time is asymmetrical with respect to the amount of order: as the system, advances through time, it will statistically become more disordered. By ‘Order’ and ‘Disorder’ we mean how compressed the information is that is describing the system. So if all your papers are in one neat pile, then the description is “All paper in one neat pile.” If you drop them, the description becomes ‘One paper to the right, another to the left, one above, one below, etc. etc.” The longer the description, the higher the entropy. According to Occam’s Razor, we want a theory with low entropy, i.e. low disorder, high simplicity. The lower the entropy, the more likely it is that the theory is the true explanation of the data, and hence that theory should be assigned a higher probability.

More generally, whatever theory we develop, say to explain the origin of the universe, or consciousness, or non-material morality, must itself be based on some theory, which is based on some other theory, and so on. At some point we need to rely on some statement which is true but not provable, and so we think may be false, although it is actually true. We can never solve the ultimate problem of induction, but Occam’s Razor combined with Epicurus, Bayes and Popper is as good as it gets if we accept that. So Epicurus, Occam, Bayes and Popper help us pose the right questions, and help us to establish a good framework for thinking about the answers.

At least that applies to the realm of established scientific enquiry and the pursuit of scientific truth. How far it can properly be extended beyond that is a subject of intense and continuing debate.

McFadden, Johnjoe. 2021. Life is Simple. London: Basic Books.

Occam’s Razor. Principia Cybernetica Web. http://pespmc1.vub.ac.be/OCCAMRAZ.html

What is Occam’s Razor. UCR Math. http://math.ucr.edu/home/baez/physics/General/occam.html

Occam’s Razor. Simple English Wikipedia. https://simple.wikipedia.org/wiki/Occam%27s_razor

Occam’s Razor. Wikipedia. https://en.wikipedia.org/wiki/Occam%27s_razor

An Intuitive Explanation of Solomonoff Induction. LESSWRONG. Alex Altair. July 11, 2012. https://www.lesswrong.com/posts/Kyc5dFDzBg4WccrbK/an-intuitive-explanation-of-solomonoff-induction