Do we live in a simulation, created by an advanced civilisation, in which we are part of some sophisticated virtual reality experience? For this to be a possibility we can make the obvious assumption that sufficiently advanced civilisations will possess the requisite computing and programming power to create what philosopher Nick Bostrom termed such ‘ancestor simulations’. These simulations would be complex enough for the minds that are simulated to be conscious and able to experience the type of experiences that we do. The creators of these simulations could exist at any stage in the development of the universe, even billions of years into the future.

The argument around simulation goes like this. One of the following three statements must be correct.

a. That civilisations at our level of development always or almost always disappear before becoming technologically advanced enough to create these simulations.

b. That the proportion of these technologically advanced civilisations that wish to create these simulations is zero or almost zero.

c. That we are almost sure to be living in such a simulation.

To see this, let’s examine each proposition in turn.

a. Suppose that the first is not true. In that case, a significant proportion of civilisations at our stage of technology go on to become technologically advanced enough to create these simulations.

b. Suppose that the second is not true. In this case, a significant proportion of these civilisations run such simulations.

c. If both of the above propositions are not true, then there will be countless simulated minds indistinguishable to all intents and purposes from ours, as there is potentially no limit to the number of simulations these civilisations could create. The number of such simulated minds would almost certainly be overwhelmingly greater than the number of minds that created them. Consequently, we would be quite safe in assuming that we are almost certainly inside a simulation created by some form of advanced civilisation.

For the first proposition to be untrue, civilisations must be able to go through the phase of being able to wipe themselves out, either deliberately or by accident, carelessness or neglect, and never or almost never do so. This might perhaps seem unlikely based on our experience of this world, but becomes more likely if we consider all other possible worlds.

For the second proposition to be untrue, we would have to assume that virtually all civilisations that were able to create these simulations would decide not to do so. This again is possible, but would seem unlikely.

If we consider both propositions, and we think it is unlikely that no civilisations survive long enough to achieve what Bostrom calls ‘technological maturity’, and that it is unlikely that hardly any would create ‘ancestor simulations’ if they could, then anyone considering the question is left with a stark conclusion. They really are living in a simulation.

To summarise. An advanced ‘technologically mature’ civilisation would have the capability of creating simulated minds. Based on this, at least one of three propositions must be true.

a. The proportion of these advanced civilisations is close to zero or zero.

b. The proportion of these advanced civilisations that wish to run these simulations is close to zero.

c. The proportion of those consciously considering the question who are living in a simulation is close to one.

If the first of these propositions is true, we will almost certainly not survive to become ‘technologically mature.’ If the second proposition is true, virtually no advanced civilisations are interested in using their power to create such simulations. If the third proposition is true, then conscious beings considering the question are almost certainly living in a simulation.

Through the veil of our ignorance, it might seem sensible to assign equal credence to all three, and to conclude that unless we are currently living in a simulation, descendants of this civilisation will almost certainly never be in a position to run these simulations.

Strangely indeed, the probability that we are living in a simulation increases as we draw closer to the point at which we are able and willing to do so. At the point that we would be ready to create our own simulations, we would paradoxically be at the very point when we were almost sure that we ourselves were simulations. Only by refraining to do so could we in a certain sense make it less likely that we were simulated, as it would show that at least one civilisation that was able to create simulations refrained from doing so. Once we took the plunge, we would know that we were almost certainly only doing so as simulated beings. And yet there must have been someone or something that created the first simulation. Could that be us, we would be asking ourselves? In our simulated hearts and minds, we would already know the answer!

It was a puzzle first posed by the Swiss mathematician, Nicolas Bernoulli, in a letter to Pierre Raymond de Montmort, on Sept. 9, 1713, and published in the Commentaries of the Imperial Academy of Science of St. Petersburg. Mercifully it is simple to state. Less mercifully, it is supposedly a nightmare to solve. To state the paradox, imagine tossing a coin until it lands heads-up, and suppose that the payoff grows exponentially according to the number of tosses you make. If the coin lands heads-up on the first toss, then the payoff is £2. If it lands tails on the first toss, you receive £1. If it lands heads-up on the second toss, the payoff is £4; if it takes three tosses, the payoff is £8; and so forth, ad infinitum. You can play as many rounds of the game as you wish.

Now the odds of the game ending on the first toss is ½; of it ending on the second toss is (1/2)^2 = ¼; on the third, (1/2)^3 = 1/8, etc., so your expected win from playing the game = (1/2 x £1) + (1/2 x £2) + (1/4 x £4) + (1/8 x £8) + …, i.e. £0.5 + £1 + £1 + £1 … = infinity. It follows that you should be willing to pay any finite amount for the privilege of playing this game. Yet it seems irrational to pay very much at all.

According to this reasoning, any finite stake is justified because the eventual payout increases infinitely through time, so you must end up with a profit whenever the game ends. Yet most people are only willing to pay a few pounds, or at least not much more than this. So is this yet further evidence of our intuition letting us down?

That depends on why most people are not willing to pay much. There have been very many explanations proposed over the years, some more satisfying than others, but none has been universally accepted as getting near to a convincing explanation.

The best attempt, and one which I find the most convincing, is to address the issue of infinity. It is true, of course, that you will, if you play an infinite number of rounds of the game, win an infinite amount. But what happens in the real finite world? And here is the problem. Because playing to infinity pays an infinite amount, this does not mean that the game in finite time never stops paying out money. The key question in finite time is WHEN does the game turn profitable? The answer depends on the size of the stake per round. If this stake is £2, and you repeat the game over and over again, you are likely to make a lot of money very quickly. As the stake size increases, the number of rounds it takes to turn a profit becomes increasingly longer. Take the example of a stake of £4. In this case, you only make a profit if you throw three heads in a row, which is a 1 in 8 chance. You now need to factor in the losses you made in rounds where you didn’t throw three heads in a row. This extends the number of rounds it will take to turn a profit. So the game is not profitable at any stake size unless we are willing and able to play an infinite number of rounds. It is, in theoretical terms, profitable at any stake size, however large, but it will take forever to guarantee a profit. In a world of finite rounds and time scales, however, winnings generated by the game are easily countervailed by some specified level of stake size.

So what is the optimal stake size for playing the St. Petersburg game?

This depends on how many rounds you are willing to play and how likely you wish to be to come out ahead in that timescale.

This has been modelled empirically, using a computer program to calculate the outcome at different staking levels. What does it show? Well, if you stake a pound a round, you have a better than even chance of being in profit after just three rounds. If you pay £2 a round, the even-money chance of coming out ahead takes rather more rounds – about seven. At £3 a round we are looking at more than 20 rounds, at £4 approaching 100 rounds and £5 more than 300 rounds. By the time we are staking £10 a go, more than 350,000 rounds are needed to give you more than an even chance of being ahead of the game. An approximation that generates the 50-50 point to any staking level is 4 to the power of the stake, divided by 2.9. So what’s a reasonable spend per round to play the game? That depends on the person and the exact configuration of the game. Either way, it’s not that high.

Perhaps the median (the mean of the two middle values of the series), rather than the mean offers a pretty good approximation to the way most people think about this.

Let’s say that in the game as proposed, the game is run 1000 times. In this case, 500 of the values result in tails on the first toss with a return of £1. The next 25% of values result in tails on the second toss with a return of 2. The rest of the values are not then relevant. The 500^{th} value is 1 and the 501^{st} value is 2. The median is the mean of £1 and £2, i.e. £1.50.

Whichever of the two ways proposed here we look at it, the solution is much closer to most people’s intuitive answer than it is to the answer implied by the classic formulation of the St. Petersburg problem.

Reading

Koelman, J. Statistical Physics Attacks St. Petersburg: Paradox Resolved.

Fine, T.A. The Saint Petersburg Paradox is a Lie.

https://medium.com/@thomasafine/the-saint-petersburg-paradox-is-a-lie-62ed49aeca0b

Hayden, B.Y. and Platt, M.L. (2009), The mean, the median, and the St. Petersburg Paradox. Judgment and Decision Making, 4 (4), June, 256-272.

In ‘The Merchant of Venice’, by William Shakespeare, Portia sets her suitors a problem to solve to find who is right for her. In the play, there are just three suitors and they are asked to choose between a gold, a silver and a lead casket, one of which contains a portrait which is the key to her hand in marriage.

Let us base a thought experiment based around Portia’s quest for love in which she meets the successive suitors in turn. Her problem is when to stop looking and start choosing. To make the problem more general, let’s say she has 100 suitors to choose from. Each will be presented to her in random order and she has twenty minutes to decide whether he is the one for her. If she turns someone down there is no going back, but the good news is that she is guaranteed not to be turned down by anyone she selects. If she comes to the end of the line and has still not chosen a partner, she will have to take whomever is left, even if he is the worst of the hundred. All she has to go on in guiding her decision are the relative merits of the pool of suitors.

Let’s say that the first presented to her, whom we shall call No.1, is perfectly charming but she has some doubts. Should she choose him anyway, in case those to follow will be worse? With 99 potential matches left, it seems more than possible that there will be at least one who is a better match than No.1. The problem facing Portia is that she knows that if she dismisses No. 1, he will be gone forever, to be betrothed to someone else.

She decides to move on. The second suitor turns out to be far worse than the first, as does the third and fourth. She starts to think that she may have made a mistake in not accepting the first. Still, there are potentially 96 more to see. This goes on until she sees No. 20, whom she actually prefers to No. 1. Should she now grasp her opportunity before it is too late? Or should she wait for someone even better?

She is looking for the best of the hundred, and this is the best so far. But there are still 80 suitors left, one of whom might be better than No. 20. Should she take a chance? What is Portia’s optimal strategy in finding Mr. Right?

This is an example of an ‘Optimal Stopping Problem’, which has come to be known as the ‘Secretary Problem.’ In this variation, you are interviewing for a secretary, with your aim being to maximise your chance of hiring the single best applicant out of the pool of applicants. Your only criterion to measure suitability is their relative merits, i.e. who is better than whom. As with Portia’s Problem, you can offer the post to any of the applicants at any time before seeing any more candidates, but you lose the opportunity to hire that applicant if you decide to move on to the next in line.

This sort of stopping strategy can be extended to anything including the search for a place to live, a place to eat, the choice of a used car, and so on.

In each of these cases, there are two ways you can fail to meet your goal of finding the best option out there. The first is by stopping too early, and the second is by stopping too late. By stopping too early, you leave the best option out there. By stopping too late, you have waited for a better option that turns out not to exist. So how do you find the right balance?

Let’s consider the intuition. Obviously, the first option is the best yet, and the second option (assuming we are taking the options in a random order) has a 50% chance of being the best yet. Likewise, the tenth option has a 10% chance of being the best to that point. It follows logically that the chance of any given option being the best to that point declines as the number of options there have been before increases. So the chance of coming across the ‘best yet’ becomes more and more infrequent as we go through the process.

To see how we might best approach the problem, let’s go back to Portia and her suitors and look at her best strategy when faced with different-sized pools of suitors. Can she do better using some strategy other than choosing at some random position in the order of presentation to her? It can be shown mathematically that she can certainly expect to do better, given that there are more than two to choose from.

Let’s return to the original play where there are three suitors. If she chooses No. 1, she has no information with which to compare the relative merits of her suitors. On the other hand, by the time she reaches No. 3, she must choose him, even if he’s the worst of the three. In this way, she has maximum information but no choice. In the case of No. 2, she has more information than she did when she saw No. 1, as she can compare the two. She also has more control over her choice than she will if she leaves it until she meets No. 3.

So she turns down No. 1 to give herself more information about the relative merits of those available. But what if she finds that No. 2 is worse than No. 1? What should she do? It can in fact be shown that she should wait and take the risk of ending up with No. 3, as she must do if she leaves it to the last. On the other hand, if she finds that she prefers No. 2 to No. 1, she should chose him on the spot and forego the chance that No. 3 will be a better match.

It can also be shown that in the three-suitor scenario, she will succeed in finding her best available match exactly half the time by selecting No. 2 if he is better than No. 1. If she chooses No. 1 or No. 3, on the other hand, she will only have met that aim one time in three.

If there are four suitors, Portia should use No. 1 to gain information on what she should be measuring her standards against, and select No. 2 if he is a better choice than No. 1. If he is not, do the same with No. 3. If he is still not better than No. 1, go to No. 4 and hope for the best. The same strategy can be applied to any number of people in the pool.

So, in the case of a hundred suitors, how many should she see to gain information before deciding to choose someone? It can, in fact, be demonstrated mathematically that her optimal strategy (‘stopping strategy’) before turning looking into leaping is 37. She should meet with 37 of the suitors, then choose the first of those to come after who is better than the best of the first 37. By following this rule, she will find the best of the princely bunch of a hundred with a probability, strangely enough, of 37 per cent. By choosing randomly, on the other hand, she has a chance of 1 in 100 (1%) of settling upon the best.

This stopping rule of 37% applies to any similar decision, such as the secretary problem or looking for a house in a fast-moving market. It doesn’t matter how many options are on the table. You should always use the first 37% as your baseline, and then select the first of those coming after that is better than any of the first 37 per cent.

The mathematical proof is based on the mathematical constant, e (sometimes known as Euler’s number) and specifically 1/e, which can be shown to be the stopping point along a range from 0 to 1, after which it is optimal to choose the first option that is better than any of those before. The value of e is approximately equal to 2.71828, so 1/e is about 0.36788 or 36.788%. This has simply been rounded up to 37 per cent in explaining the stopping rule. It can also be shown that the chance that implementing this stopping rule will yield the very best outcome is also equal to 1/e, i.e. about 37 per cent.

If there is a chance that your selection might actually become unavailable, the rule can be adapted to give a different stopping rule, but the principle remains. For example, if there is a 50% chance that your selection might turn out to be unavailable, than the 37% rule is converted into a 25% rule. The rest of the strategy remains the same. By doing this, you will have a 25% chance of finding the best of the options, however, compared to a 37% chance if you always get to make the final choice. This is still a lot better than the 1 per cent chance of selecting the best out of a hundred options if you choose randomly. The lower percentage here (25% compared to 37%) reflects the additional variable (your choice might not be final) which adds uncertainty into the mix. There are other variations on the same theme, where it is possible to go back with a given probability that the option you initially passed over is no longer available. Take the case, for example, where an immediate proposal will certainly be accepted but a belated proposal is accepted half of the time. The cut-off proportion in one such scenario rises to 61% as the possibility of going back becomes real.

There is also a rule-of-thumb which can be derived when the aim is to maximise the chance of selecting a good option, if not the very best. This strategy has the advantage of reducing the chance of ending up with one of the worst options. It is the square root rule, which simply replaces the 37% criterion with the square root of the number of options available. In the case of Portia’s choice, she would meet the first ten of the hundred (instead of 37) and choose the first of the remaining 90 who is better than the best of those ten. Whatever variation you adopt, the numbers will change but the principle stays the same.

All this assumes that we are lacking in some objective standard about which we can measure each of our options objectively, without needing to compare which option is better than which. For example, Portia might simply be interested in choosing the richest of the suitors and she knows the distribution of wealth of all potential suitors. This ranges evenly from the bankrupt suitors to those worth 100,000 ducats.

This means that the upper percentile of potential suitors in the whole population are worth upwards of 99,000 ducats. The lowest percentile is worth up to 1,000 ducats. The 50^{th} percentile is worth between 49,000 and 50,000 ducats.

Now Portia is presented with a hundred out of this population of potential suitors, and let’s assume that the suitors presented to her are representative of this population. Say now that the first to be presented to her is worth 99,500 ducats. Since wealth is her only criterion, and he is in the upper percentile in terms of wealth, her optimal decision is to accept his proposal of marriage. It is possible that one of the next 99 is worth more than 99,500 ducats but that isn’t the way to bet.

On the other hand, say that the first suitor is worth 60,000 ducats. Since there are 99 more to come, it is a good bet that at least one of them will be worth more than this. If she has turned down all suitors, however, until she is being presented with the 99^{th}, her optimal decision now is to accept him. In other words, Portia’s decision as to whether to accept the proposal comes down to how many potential matches she has left to see. When down to the last two, she should choose him if he is above the 50^{th} percentile, in this case 50,000 ducats. The more there are to come the higher the percentile of wealth at which she should accept. She can set a higher threshold. She should never accept anyone who is below the average unless she is out of choices. In this version of the stopping problem, the probability that Portia will end up with the wealthiest of the available suitors turns out to be 58 per cent. More information, of course, increases the chance of success. Indeed, any criterion that provides information on where an option is relative to the relevant population as a whole will increase the probability of finding the best choice of those available. As such, it seems that if Portia is only interested in the money, she is more likely to find it than if she is looking for love.

And who did fair Portia choose in the original play? Well, there are no spoilers here. But I can reveal that it was the best of the three.

William of Occam (also spelled William of Ockham) was a 14^{th} century English philosopher. At the heart of Occam’s philosophy is the principle of simplicity, and Occam’s Razor has come to embody the method of eliminating unnecessary hypotheses. Essentially, Occam’s Razor holds that the theory which explains all (or the most) while assuming the least is the most likely to be correct. This is the principle of parsimony – explain more, assume less. Put more elegantly, it is the principle of ‘pluritas non est ponenda sine necessitate’ (plurality must never be posited beyond necessity).

Empirical support for the Razor can be drawn from the principle of ‘overfitting.’ In statistics, ‘overfitting’ occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. Critically, a model that has been overfit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data. For example, a complex polynomial function might after the fact be used to pass through each data point, including those generated by noise, but a linear function might be a better fit to the signal in the data. By this we mean that the linear function would predict new and unseen data points better than the polynomial function, although the polynomial which has been devised to capture signal and noise would describe/fit the existing data better.

We can also look at it through the lens of what is known as Solomonoff Induction. Whether a detective trying to solve a crime, a physicist trying to discover a new universal law, or an entrepreneur seeking to interpret some latest sales figures, all are involved in collecting information and trying to infer the underlying causes. The problem of induction is this: We have a set of observations (or data), and want to find the underlying causes of those observations, i.e. to find hypotheses that explain our data. We’d like to know which hypothesis is correct, so we can use that knowledge to predict future events. In doing so, we need to create a set of defined steps to arrive at the truth, a so-called algorithm for truth.

Ray Solomonoff’s algorithmic approach takes in data (observations) and outputs the rule by which the data was created. That is, it will give us an explanation of the observations; the causes. Suppose there are many hypotheses that could explain the data. All of the hypotheses are possible but some are more likely than others. How do you weight the various hypotheses? This depends on prior knowledge. But what if you have no prior knowledge of which hypothesis is likely to be better than another. This is where Occam’s Razor comes in. Solomonoff’s theory is one of prediction based on logical observations, such as predicting the next symbol based upon a given series of symbols. The only assumption that the theory makes is that the environment follows some unknown but computable probability distribution. It is as such a mathematical formalisation of Occam’s Razor.

All computable theories which perfectly describe previous observations are used to calculate the probability of the next observation, with more weight put on the shorter computable theories. Shorter computable theories have more weight when calculating the expected reward to an action across all computable theories which perfectly describe previous observations. At any time, given the limited observation sequence so far, what is the optimal way of selecting the next action? The answer is to use Solomonoff’s method of calculating the prior probabilities in order to predict the probability of each possible future, and to execute the policy which, on a weighted average of all possible futures, maximises the predicted reward up to the horizon. This requires a way, however, of measuring the complexity of a theory.

Here we can turn to methods discovered to digitalise communication into a series of 0s and 1s. These series of bits of binary information can be termed strings, of a given length, say y, for a given language, the length of which will differ depending on the complexity of what is being communicated. This is where the idea of so-called Kolomogorov complexity, K(y), comes in. K(y) is the shortest possible description of string y for a given language. The upper bounds on the Kolomogorov complexity can be simple. Consider, for example, the two 32 character sequences:

abababababababababababababababab

4c1j5b2p0cv4w1x8rx2y39umgw5q85s7

The first can be written “ab 16 times”. The second probably cannot be simplified further.

Now consider the following inductive problem. A computer program outputs the following sequence of numbers: 1, 3, 5, 7.

What rule gives rise to the number sequence 1,3,5,7? If we know this, it will help us to predict what the next number in the sequence is likely to be, if there is one. Two hypotheses spring instantly to mind. It could be: 2n-1, where n is the step in the sequence. So the third step, for example, gives 2×3-1 = 5. If this is the correct rule generating the observations, the next step in the sequence will be 9 (5×2-1).

But it’s possible that the rule generating the number sequence is: 2n-1 + (n-1)(n-2)(n-3)(n-4). So the third step, for example, gives 2×3-1 + (3-1)(3-2)(3-3)(3-4) = 7. In this case, however, the next step in the sequence will be 33.

But doesn’t the first hypothesis seem more likely? Occam’s Razor is the principle behind this intuition. “Among all hypotheses consistent with the observations, the simplest is the most likely.” This sounds right, but can it be made more precise, and can it be justified? How do we find all consistent hypotheses, and how do we judge their simplicity?

Probability theory is the mathematics of reasoning with uncertainty. The keystone of this subject is Bayes’ Theorem. This tells you how likely something is given some other knowledge. Bayes’ Theorem can tell us how likely a hypothesis is, given evidence (or data, or observations). This is helpful because we want to know which model of the world is correct so that we can successfully predict the future.

It calculates this probability based on the prior probability of the hypothesis alone, the probability of the evidence alone, and the probability of the evidence *given* the hypothesis. It is just a matter of plugging the numbers in, although it is not always easy to identify these. But you can do your best. With enough evidence, it should become clear which hypothesis is correct. But guesses are not well-suited to an exact algorithm, so how can we construct this algorithm? Most situations in real life are complex, so that your “priors” (as used in Bayes’ Theorem) are actually probabilities that have been updated several times with past evidence.

But what would our ideal reasoning computer do before it knew anything? What would the probabilities be set to before we turned it on? How can we determine the probability of a hypothesis before seeing any data?

The answer is Occam’s Razor; simpler hypotheses more likely. But how rigorous is this? It’s usually difficult to find a measure of complexity, even for mathematical hypotheses. Is a normal curve simpler than an exponential curve, for example? Bayesian probability theory doesn’t have anything to say about choosing priors. Thus, the same probability is often assigned to each of the various hypotheses that might explain the observations. Of course this is a good approach if all the hypotheses actually are equally likely. But some hypotheses are more complex than others, and this makes them less likely than the other hypotheses. So when distributing your probability across several hypotheses, you shouldn’t necessarily distribute it evenly.

But we need a method that all can agree provides the correct priors in all situations. This helps us perform induction correctly and instils more honesty into the process. Since priors partly determine what people believe, they can sometimes choose priors that help “prove” what they want to prove, intentionally or unintentionally. To solve the problem of priors once and for all, we’d like to have an acceptable, universal prior distribution, so that there’s no vagueness in the process of induction. We need a recipe, an algorithm, for selecting our priors. For that we turn to the subject of binary sequences.

So if this is all the information we have, we have two different hypotheses about the rule generating the data. How do we decide which is more likely to be true? In general, when we have more than one hypothesis, each of which could be true, how can we decide which one actually is true? To start, is there a language in which we can express all problems, all data, all hypotheses? Let’s look at binary data. This is the name for representing information using only the characters ‘0’ and ‘1’. In a sense, binary is the simplest possible alphabet. With these two characters we can encode information. Each 0 or 1 in a binary sequence (e. g. 01001011) can be considered the answer to a yes-or-no question. And in principle, all information can be represented in binary sequences. Indeed, being able to do everything in the language of binary sequences simplifies things greatly, and gives us great power. We can treat everything contained in the data in the same way. Now, which of the three is more likely to be the true hypothesis that generated the data in the first place? How do we decide what the probability is of each of these hypotheses being true?

Now that we have a simple way to deal with all types of data, we need to look at the hypotheses, in particular how to assign prior probabilities to the hypotheses. When we encounter new data, we can then use Bayes’ Theorem to update these probabilities. To be complete, to guarantee we find the real explanation for our data, we have to consider all possible hypotheses. But how could we ever find all possible explanations for our data? By using the language of binary, we can do so.

Here we look to the concept of Solomonoff induction, in which the assumption we make about our data is that it was generated by some algorithm, i.e. the hypothesis that explains the data is an algorithm. Now we can find all the hypotheses that would predict the data we have observed. Given our data, we find potential hypotheses to explain it by running every hypothesis, one at a time. If the output matches our data, we keep it. Otherwise, we discard it. We now have a methodology, at least in theory, to examine the whole list of hypotheses that might be the true cause behind our observations.

Since they are algorithms, these hypotheses look like binary sequences. For example, the first few might be 01001101, 0011010110000110100100110, and 100011111011111110001110100101000001.

That is, for each of these three, when you give them as input, the output is our data. But which of the three is more likely to be the true hypothesis that generated the data in the first place? The first thing is to imagine that the true algorithm is produced in an unbiased way, by tossing a coin. For each bit of the hypothesis, we toss a coin. Heads will be 0, and tails will be 1. In the previous example, 01001101, the coin landed heads, tails, heads, tails and so on. Because each toss of the coin has a 50% probability, each bit contributes ½ to the final probability. Therefore, an algorithm that is one bit longer is half as likely to be the true algorithm. This intuitively fits with Occam’s Razor: a hypothesis that is 8 bits long is much more likely than a hypothesis that is 34 bits long. Why bother with extra bits? We’d need evidence to show that they were necessary. So why not take the shortest hypothesis and call that the truth? Because all of the hypotheses predict the data we have so far, and in the future we might get data to rule out the shortest one. The more data we get, the easier it is likely to become to pare down the number of competing hypotheses which fit the data.

With the data we have, we keep all consistent hypotheses, but weight the shorter ones with higher probability. So in our eight-bit example, the probability of 01001101 being the true algorithm is 1/256, although this isn’t a probability in the normal sense, since the sum of the probabilities have not been normalised to add to one. But these probabilities can still be used to compare how likely different hypotheses are. Solomonoff induction is the process that describes the scientific method made into an algorithm.

To summarize, Solomonoff induction works by starting with all possible hypotheses (sequences) as represented by computer programs (that generate those sequences), weighted by their simplicity (2^{–n}, where **n** is the program length), and discarding those hypotheses that are inconsistent with the data. Weighting hypotheses by simplicity, the system automatically incorporates a form of Occam’s Razor.

Turning now to ‘ad hoc’ hypotheses and the Razor. In science and philosophy, an ‘ad hoc hypothesis’ is a hypothesis added to a theory in order to save it from being falsified. Ad hoc hypothesising is compensating for anomalies not anticipated by the theory in its unmodified form. For example, you say that there is a leprechaun in your garden shed. A visitor to the shed sees no leprechaun. This is because he is invisible, you say. He spreads flour on the ground to see the footprints. He floats, you declare. He wants you to ask him to speak. He has no voice, you say. More generally, for each accepted explanation of a phenomenon, there is generally an infinite number of possible, more complex alternatives. Each true explanation may therefore have had many alternatives that were simpler and false, but also approaching an infinite number of alternatives that are more complex and false.

This leads us the idea of what I term ‘Occam’s Leprechaun.’ Any new and more complex theory can always be possibly true. For example, if an individual claims that leprechauns were responsible for breaking a vase that he is suspected of breaking, the simpler explanation is that he is not telling the truth, but ongoing ad hoc explanations (e.g. “That’s not me on the CCTV, it’s a leprechaun disguised as me) prevent outright falsification. An endless supply of elaborate competing explanations, called ‘saving hypotheses’, prevent ultimate falsification of the leprechaun hypothesis, but appeal to Occam’s Razor helps steer us toward the probable truth. Another way of looking at this is that simpler theories are more easily falsifiable, and hence possess more empirical content.

All assumptions introduce possibilities for error; if an assumption does not improve the accuracy of a theory, its only effect is to increase the probability that the overall theory is wrong. It can also be looked at this way. The prior probability that a theory based on n+1 assumptions is true must be less than a theory based on n assumptions, unless the additional assumption is a consequence of the previous assumptions. For example, the prior probability that Jack is a train driver must be less than the prior probability that Jack is a train driver AND that he owns a Mini Cooper, unless all train drivers own Mini Coopers, in which case the prior probabilities are identical.

Again, the prior probability that Jack is a train driver and a Mini Cooper owner and a ballet dancer is less than the prior probability that he is just the first two, unless all train drivers are not only Mini Cooper owners but also ballet dancers. In the latter case, the prior probabilities of the n and n+1 assumptions are the same.

From Bayes’ Theorem, we know that reducing the prior probability will reduce the posterior probability, i.e. the probability that a proposition is true after new evidence arises. Science prefers the simplest explanation that is consistent with the data available at a given time, but even so the simplest explanation may be ruled out as new data become available. This does not invalidate the Razor, which does not state that simpler theories are necessarily more true than more complex theories, but that when more than one theory explains the same data, the simpler should be accorded more probabilistic weight. The theory which explains all (or the most) and assumes the least is most likely. So Occam’s Razor advises us to keep explanations simple. But it is also consistent with multiplying entities necessary to explain a phenomenon. A simpler explanation which fails to explain as much as another more complex explanation is not necessarily the better one. So if leprechauns don’t explain anything they cannot be used as proxies for something else which can explain something.

More generally, we can now unify Epicurus and Occam. From Epicurus’ Principle we need to keep open all hypotheses consistent with the known evidence which are true with a probability of more than zero. From Occam’s Razor we prefer from among all hypotheses that are consistent with the known evidence, the simplest. In terms of a prior distribution over hypotheses, this is the same as giving simpler hypotheses higher ‘a priori’ probability, and more complex ones lower probability.

From here we can move to the wider problem of induction about the unknown by extrapolating a pattern from the known. Specifically, the problem of induction is how we can justify inductive inference. According to Hume’s ‘Enquiry Concerning Human Understanding’ (1748), if we justify induction on the basis that it has worked in the past, then we have to use induction to justify why it will continue to work in the future. This is circular reasoning. This is faulty theory. “Induction is just a mental habit, and necessity is something in the mind and not in the events.” Yet in practice we cannot help but rely on induction. We are working from the idea that it works in practice if not in theory – so far. Induction is thus related to an assumption about the uniformity of nature. Of course, induction can be turned into deduction by adding principles about the world (such as ‘the future resembles the past’, or ‘space-time is homogeneous.’) We can also assign to inductive generalisations probabilities that increase as the generalisations are supported by more and more independent events. This is the Bayesian approach, and it is a response to the perspective pioneered by Karl Popper. From the Popperian perspective, a single observational event may prove hypotheses wrong, but no finite sequence of events can verify them correct. Induction is from this perspective theoretically unjustifiable and becomes in practice the choice of the simplest generalisation that resists falsification. The simpler a hypothesis, the easier it is to be falsified. Induction and falsifiability are in practice, from this viewpoint, as good as it gets in science. Take an inductive inference problem where there is some observed data and a set of hypotheses, one of which may be the true hypothesis generating the data. The task then is to decide which hypothesis, or hypotheses, are the most likely to be responsible for the observations.

A better way of looking at this seems to be to abandon certainties and think probabilistically. Entropy is the tendency of isolated systems to move toward disorder and a quantification of that disorder, e.g. assembling a deck of cards in a defined order requires introducing some energy to the system. If you drop the deck, they become disorganised and won’t re-organise themselves automatically. This is the tendency in all systems to disorder. This is the Second Law of Thermodynamics, which implies that time is asymmetrical with respect to the amount of order: as the system, advances through time, it will statistically become more disordered. By ‘Order’ and ‘Disorder’ we mean how compressed the information is that is describing the system. So if all your papers are in one neat pile, then the description is “All paper in one neat pile.” If you drop them, the description becomes ‘One paper to the right, another to the left, one above, one below, etc. etc.” The longer the description, the higher the entropy. According to Occam’s Razor, we want a theory with low entropy, i.e. low disorder, high simplicity. The lower the entropy, the more likely it is that the theory is the true explanation of the data, and hence that theory should be assigned a higher probability.

More generally, whatever theory we develop, say to explain the origin of the universe, or consciousness, or non-material morality, must itself be based on some theory, which is based on some other theory, and so on. At some point we need to rely on some statement which is true but not provable, and so we think may be false, although it is actually true. We can never solve the ultimate problem of induction, but Occam’s Razor combined with Epicurus, Bayes and Popper is as good as it gets if we accept that. So Epicurus, Occam, Bayes and Popper help us pose the right questions, and help us to establish a good framework for thinking about the answers.

At least that applies to the realm of established scientific enquiry and the pursuit of scientific truth. How far it can properly be extended beyond that is a subject of intense and continuing debate.

Bayes’ theorem concerns how we should update our beliefs about the world when we encounter new evidence. The original presentation of Rev. Thomas Bayes’ work, ‘An Essay toward Solving a Problem in the Doctrine of Chances’, was given in 1763, after Bayes’ death, to the Royal Society, by Richard Price. In framing Bayes’ work, Price gave the example of a person who emerges into the world and sees the sun rise for the first time. At first, he does not know whether this is typical or unusual, or even a one-off event. However, each day that he sees the sun rise again, his confidence increases that it is a permanent feature of nature. Gradually, through this purely statistical form of inference, the probability he assigns to his prediction that the sun will rise again tomorrow approaches (although never exactly reaches) 100 per cent. The Bayesian viewpoint is that we learn about the universe and everything in it through approximation, getting closer to the truth as we gather more evidence, and thus rationality is regarded as a probabilistic matter. As such, Bayes applies and formalises the laws of probability to the science of reason, to the issue of cause and effect.

In its most basic form, Bayes’ Theorem is simply an algebraic expression with three known variables and one unknown. It is true by construction. But this simple formula can lead to important predictive insights. So Bayes’ Theorem is concerned with conditional probability. That is, it tells us the probability that a theory or hypothesis is true if some new information comes to light, based on the probability we attach to it being true before the new information is known, updated in light of the new information.

Presented most simply, it looks like this:

Probability that a hypothesis is true given some new evidence (‘Posterior Probability’) =

**xy/[xy+z(1-x)], **where:

x is the prior probability of the hypothesis being true (the probability we attach before the new evidence arises)

y is the probability that the new evidence would arise if the hypothesis is true

z is probability the new evidence would arise if the hypothesis is not true

1-x is the prior probability that the hypothesis is not true

Using more traditional notation,

P(H) = probability the hypothesis is true (x)

P(E) = probability of the evidence

P(H’) = probability the hypothesis is not true

P(EIH) = probability of the evidence given that the hypothesis is true (y)

P(EIH’) = probability of the evidence given that the hypothesis is not true (z)

P(HIE) = posterior (updated) probability (PP)

The equation is easily derived.

The entry point is this equation, both sides of which are equal to the probability of the evidence and hypothesis taken together.

P(HIE).P(E) = P(H).P(EIH)

To derive Bayes’ Theorem, divide through by P (E).

P (HIE) = P (H).P(EIH) / P(E) … Bayes’ Theorem

P (E) = P (H). P (EIH) + P (EIH’).P (H’)

So P (HIE) = P (H).P (EIH) / [P (H). P (EIH) + P (EIH’).P(H’)]

*Take a simple card example.*

There are 52 cards in the deck. 26 are black, 26 are red. One of the cards is the Ace of Spades.

Hypothesis: A selected card is the Ace of Spades.

Evidence: The card is revealed to be black.

So P (HIE) = 1/26 (there are 26 black cards, of which one is the Ace of Spades).

P (E) = ½ (1/2 of the cards are black); P(H) = 1/52 (there are 52 cards of which one is the Ace of Spades)

P (EIH) = 1 (if it’s the Ace of Spades, it must be black).

So P (HIE). P (E) = 1/52

And P (H). P (EIH) = 1/52

As expected, and as always, P (HIE). P (E) = P (EIH).P(H). This is the fundamental expression from which Bayes’ Theorem is easily derived, as above by dividing both sides by P (E).

Thus, Bayes’ Theorem states that: P (HIE) = P (EIH).P (H) / P(E)

As before, P (E) = P (H). P (EIH) + P (EIH’).P (H’), i.e. Probability of the evidence = Probability of the evidence if the hypothesis is true times the probability the hypothesis is true PLUS probability of the evidence if the hypothesis is not true times the probability the hypothesis is not true.

So, P (HIE) = P (H).P (EIH) / [P (H).P(EIH) + P(EIH’).P(H’)] – Longer expression of Bayes’ Theorem

OR, P (HIE) = xy / [xy + z (1-x)] – Bayes’ Theorem.

*Does P (HIE) = P (EIH)?*

Is the probability that a selected card is the Ace of Spades given the evidence (it is a black card) equal to the probability it is a black card given that it is the Ace of Spades?

In this example, P(HIE) = 1/26, there is one Ace of Spades out of 26 black cards.

P(EIH) = 1, since the probability it is a black card if it is the Ace of Spades is certain.

So, P(HIE) is not equal to P(EIH).

To claim they are equal is more generally known as the Prosecutor’s Fallacy (also known as the Inverse Fallacy).

*What is the chance that a card in a four-card deck is the Ace of Spades?*

Four cards in the deck. Ace of Spades, Ace of Clubs, Ace of Diamonds, Ace of Hearts

Hypothesis: The selected card is the Ace of Spades.

Prior probability of Ace of Spades (AS) = ¼

What is the posterior probability it is Ace of Spades given evidence that the card is black?

P(HIE) = P(H).P(EIH)/P(E) = ¼.1 / (1/2) = ½

PP = xy/[xy+z(1-x)] = ¼.1/[1/4 + 1/3(3/4)] = ¼ / ½ = ½

NB z = P(EIH’). This is the probability of a black card if the card is not the Ace of Spades. There are three other cards, only one of which is black, so z = 1/3.

So either formula generates the same correct answer, that the posterior probability that the hypothesis is true (the card is the Ace of Spades given that it is black) is ½.

*Dice Example *

Two dice are thrown. The hypothesis is that two sixes will be thrown. The new evidence is that a six is thrown on the first one.

P (H) = x = 1/36

P (EIH) = y = 1 (for a double six, a six must be thrown on the first one).

P (E) = 1/6 (there is a 1 in 6 chance of throwing a six on the first die)

P (HIE) = posterior probability = P(EIH). P(H) /P(E) = 1/36 / 1/6 = 1/6 (there is a 1 in 6 chance of a double six if the first die lands on a six).

Note : P(H).P(EIH) = P(E).P(HIE) = 1/36

Note also: P(E) = P(H).P(EIH) + P(H’).P(EIH’) = 1/36 + 35/36 . 5/35 = 1/36 + 5/36 = 1/6

Similarly, PP = xy/[xy+z(1-x)] = 1/6

Note: z = P(EIH’) = 5/35 because if not a 6,6 (H’), 35 options left and chance of a single six occurs in 5 of them, i.e. 6,1; 6,2; 6,3; 6,4; 6,5

*Does P(HIE) = P(EIH)?*

Is the probability of obtaining six on two dice, if the first comes up six, the same as the probability of the first coming up six if both come up six?

In this example, P (HIE) = 1/6, which is the chance the second die will come up six if the first does.

P (EIH) = 1, since the first die must come up six if both dice are to come up six.

So, P (HIE) is not equal to P (EIH), highlighting again the classic Prosecutor’s Fallacy.

The key contributions of Bayesian analysis to our understanding of the world are fivefold

- It clearly shows that P (HIE) is not the same thing as P (EIH). The conflation of these two expressions is known as the Prosecutor’s Fallacy and has been sufficient in itself to cause countless miscarriages of justice and to reach erroneous conclusions more generally about the likelihood that a hypothesis is true in the context of observed evidence.
- So what is P (HIE) equal to?
- P (HIE) = P (EIH). P (H) /P (E), where P (E) = P (H). P (EIH) + P (H’). P(EIH’).
- Bayes’ Theorem makes clear the importance not just of new evidence but also the (prior) probability that the hypothesis was true before the new evidence was observed. This prior probability is generally given far too little weight compared to the new evidence in common intuition about probability. Bayes’ Theorem makes it explicit.
- Bayes’ Theorem allows us a way to calculate the updated probability as accurately as allowed by our assessment of the prior probability of the hypothesis being true and the probability of the evidence arising given the hypothesis being true and being false.

In all these ways, Bayes’ Theorem replaces often faulty intuition and logic with a rigorous application of conditional probability theory, to give us as accurate as possible a representation of how probable a hypothesis is to be true given the available evidence at any given point in time.

What is the fair division of stakes in a game which is interrupted before its conclusion? This was the problem posed in 1654 by French gambler Chevalier de Mere to philosopher and mathematician, Blaise Pascal, of Pascal’s Triangle and Pascal’s Wager fame, who shared it in now-famous correspondence with mathematician Pierre de Fermat (best known these days perhaps for Fermat’s Last Theorem). It has come to be known as the Problem of Points.

The question had first been formally addressed by Franciscan Friar and Leonardo Da Vinci collaborator, Luca Bartolomeo de Pacioli, father of the double-entry system of bookkeeping. Pacioli’s method was to divide the stakes in proportion to the number of rounds won by each player to that point. There is an obvious problem with this method, however. What happens, for example, if only a single round of many has been played. Should the entire pot be allocated to the winner of that single round? In the mid-1500s, Venetian mathematician and founder of the theory of ballistics, Niccolo Fontana Tartaglia, proposed basing the division on the ratio between the size of the lead and the length of the game. But this method is not without its own problems. For example, it would split the stakes in the same proportion whether one player was ahead by 40-30 or by 99-89, although the latter situation is hugely more advantageous than the former.

The solution adopted by Pascal and Fermat based the division of the stakes not on the history of the interrupted game to that point as on the possible ways the game might have continued were it not interrupted. In this method, a player leading by 6-4 in a game to 10 would have the same chance of winning as a player leading by 16-14 in a game to 20, so that an interruption at either point should lead to the same division of stakes. As such, what is important in the Pascal-Fermat solution is not the number of rounds each player has yet won but the number of rounds each player still needs to win.

Take another example. Suppose that two players agree to play a game of coin-tossing repeatedly to won £32, and the winner is the first player to win four times.

If the game is interrupted when one of the players is ahead by two games to one, how should the £32 be divided fairly between the players?

In Fermat’s method, imagine playing another four games. Outcomes of each coin-tossing game are equally likely and are P (won by Player 1) and Q (won by Player 2).

The possible outcomes of the next four games are as follows:

**PPPP; PPPQ; PPQP; PPQQ**

**PQPP; PQPQ; PQQP**; PQQQ

**PQQQ; PQQP; PQPQ**; PQPP

**PPQQ**; PPQP; PQQP; QQQQ

The probability that Player 1 would have won is 11/16 (in bold) = 68.75%.

The probability that Player 2 would have won is 5/16 = 31.25%.

Pascal simplified the method by devising a principle of smaller steps, which dispenses with the need to consider possible steps after the game had already been won. In doing so, he was able to devise a relatively simple formula which would solve all possible Problems of Points, without needing to go beyond the point at which the game resolves in favour of one or other of the players, based on Pascal’s Triangle, demonstrated below.

1 = 1

1 1 = 2

1 2 1 = 4

1 3 3 1 = 8

**1 4 6** * 4 1 = *16

1 5 10 10 5 1 = 32

1 6 15 20 15 6 1 = 64

1 7 21 35 35 21 7 1 = 128

Each of the numbers in Pascal’s Triangle is the sum of the adjacent numbers immediately above it.

If the game is interrupted, as above, 2-1, after three games, in a first to four match, the resolution is 1+4+6 / 16 to 4+1 / 16, i.e. 11/16 to Player 1 and 5/16 to Player 2.

More generally, Pascal’s method establishes the modern method of expected value when reasoning about probability.

To show this, consider the probability that Player 1 would win if leading 3-2 in a game in which the first player to win four games is the outright winner. If Player 1 wins the next coin toss, he goes ahead 4-2 and wins outright (value = 1). There is a 50% chance of this. There is a 50% chance of player 2 winning the coin toss, however, in which case the game is level (3-3). If the game is level, there is a 50% chance of player 1 winning (and a 50% chance of player 2 winning).

So the expected chance of player 1 winning when leading 3-2 = 50% x 1 + 50% x 50% = 50% + 25% = 75%. Expected chance of player 2 winning = 25%.

Now consider the probability that Player 1 would win if leading 3-1 in a game in which the first player to win four games is the outright winner. If Player 1 wins the next coin toss, he goes ahead 4-1 and wins outright (value = 1). There is a 50% chance of this. There is a 50% chance of player 2 winning the coin toss, however, in which case the game goes to 3-2. We know that the expected chance of player 1 winning if ahead 3-2 is 75% (derived above).

So the expected chance of player 1 winning when leading 3-1 = 50% x 1 + 50% x 75% = 50% + 25% = 87.5%. Expected chance of player 2 winning = 12.5%.

Now consider the question that we solved using Fermat’s method, i.e. the probability that Player 1 would win if leading 2-1 in a game in which the first player to win four games is the outright winner. If Player 1 wins the next coin toss, he goes ahead 3-1, and has an expected chance of winning of 87.5%. and wins outright (value = 1). There is a 50% chance of this. There is a 50% chance of player 2 winning the coin toss, however, in which case the game goes to 2-2. We know that the expected chance of player 1 winning if tied is 50%. There is a 50% chance of this.

So the expected chance of player 1 winning when leading 2-1 = 50% x 87.5% + 50% x 50% = 43.75% + 25% = 68.75% (i.e. 11/16). Expected chance of player 2 winning = 31.25% (i.e. 5/16).

So both Fermat’s method and Pascal’s method yield the same solution, by different routes, and will always do so in determining the correct division of stakes in an interrupted game of this nature.

Reference:

Can you solve the problem that inspired probability theory? (Problem of the Points). https://m.youtube.com/watch?v=C_nV3cVNjog&feature=youtu.be

Is the line next to you at the check-in at the airport or the check-out at the supermarket really always quicker than the one you are in? Is the traffic in the neighbouring lane always moving a bit more quickly than your lane? Or does it just seem that way?

One explanation is to appeal to basic human psychology. For example, is it an illusion caused by us being more likely to glance over at the neighbouring lane when we are progressing forward slowly than quickly? Is it a consequence of the fact that we tend to look forwards rather than backwards, so vehicles that are overtaken become forgotten very quickly, whereas those that remain in front continue to torment us? Do we take more notice, or remember for longer the times we are passed than when we pass others? If this is the complete explanation, it seems we should passively accept our lot. On the other hand, perhaps we really are more often than not in the slower lane. If so, there may be a reason. Let me explain using an example.

How big is the smallest fish in the pond? You catch sixty fish, all of which are more than six inches long. Does this evidence add support to a hypothesis that all the fish in the pond are longer than six inches? Only if your net is able to catch fish smaller than six inches. What if the holes in the net allow smaller fish to pass through? This may be described as a selection effect, or an observation bias.

Apply the same principle to your place in the line or the lane.

To understand the effect in this context we need to ask, ‘For a randomly selected person, are the people or vehicles in the next line or lane actually moving faster?’

Well, one obvious reason why we might be in a slower lane is that there are more vehicles in it than in the neighbouring lane. This means that more of our time is spent in the slower lane. In particular, cars travelling at greater speeds are normally more spread out than slower cars, so that over a given stretch of road there are likely to be more cars in the slower lane, which means that more of the average driver’s time is spent in the slower lane or lanes. This is known as an observer selection effect, a key idea in the theory of which is that observers should reason as if they were a random sample from the set of all observers. In other words, when making observations of the speed of cars in the next lane, or the progress of the neighbouring line to the cashier, it is important to consider yourself as a random observer, and think about the implications of this for your observation.

To put it another way, if you are in a line and think of your present observation as a random sample from all the observations made by all relevant observers, then the probability is that your observation will be made from the perspective that most drivers have, which is the viewpoint of the slower moving queue, as that is where more observers are likely to be. It is because most observers are in the slower lane, therefore, that a typical or randomly selected driver will not only seem to be in the slower lane but actually will be in the slower lane. Let’s put it this way. If there are 20 in the slower lane and 10 in the equivalent section of the fast lane, there is a 2/3 chance that you are in the slow lane.

So the next time you think that the other lane is faster, be aware that it very probably is.

Reference

Nick Bostrom, Anthropic Bias: Observation Selection Effects in Science and Philosophy, Chapter 1, Routledge, 2002. http://www.anthropic-principle.com/?q=book/chapter_1