Working In Uncertainty
First published 28 July 2003.
Practical ideas on the open questions of probability
The mathematics of probability have been developed over hundreds of years to extraordinary heights of advanced modeling.
But, despite this, some simple, fundamental questions remain unanswered and in dispute. ‘What is probability?’ has not been settled. A related question that also generates heated argument is ‘Does randomness really exist?’. And of course there's ‘What is randomness anyway?’ It is still unclear how to distinguish useless guesswork from well informed judgments. ‘What is an accurate probability?’
These questions do not seem to cause much trouble in the problems that gave birth to probability theory and still provide its most common illustrations.
Unfortunately, the real world in which we make decisions about our careers, health, relationships, investments, and so on is not a world of vigorously flipped coins, fair dice, and large urns from which coloured balls are drawn by people wearing blindfolds.
In our real world these apparently philosophical loose ends of probability have very practical implications and we need to do things that are not mentioned in the average probability textbook.
What is probability?
There have been two main schools of thought on this question, which I will explain before taking the analysis a bit further than usual and moving on to practical implications.
The first mathematicians to think about probability saw it as subjective. A probability is the extent to which a person believes a proposition (i.e. a statement that can be true or false), based on the evidence they have. The proposition could be something about the past or present of which the thinker is unsure (e.g. ‘The Loch Ness monster exists.’) or something about the future (e.g. ‘England will win the World Cup next time.’)
In this approach the probability is a measure of certainty, is held by someone, and depends on evidence.
The mathematicians appear to have shown that if your beliefs are uncertain then probability theory is the only logical way to reason about them.
However, after a while people became uncomfortable with the apparent subjectivity of the certainty concept of probability. How could you tell if a probability was accurate without a more objective definition? A more objective concept was proposed in terms of ‘relative frequencies’. Repeat an ‘experiment’ lots of times and record the outcomes. The relative frequency of each outcome over these many repetitions gradually approaches the probability. For example, flip a fair coin many times and the proportion of heads and tails will get closer and closer to 50:50 as you record the results of more flips.
The two concepts of probability are often used without making it clear which is intended – and often without having thought about the issue at all. This can be confusing, and even lead to errors. However, if you are careful about language it is possible to use the certainty concept alone, or a combination of the certainty concept and a form of relative frequencies. It is not right to use relative frequencies alone, as I will explain.
Here's an example to show how the two concepts are combined in typical probabilistic modelling. Imagine you are given a very odd looking coin by someone who looks untrustworthy. For reasons we won't go into your task is to decide the probability of getting heads if you flip it vigorously. You are allowed as many trials as to you like but must state your view before each trial. What is your initial view and how should you update it depending on the results of trial flips?
The coin is not necessarily a fair coin, but whatever its tendencies they are fixed and should gradually reveal themselves if the coin is tossed many times. It is natural to think of this tendency as a relative frequency of heads within the experiment of tossing the coin. However, you don't know what that relative frequency is. There is a probability of heads, in the relative-frequencies sense, but you are not certain what it is.
One good way to analyse this is to start with a probability density function for the proportion of heads that would come from many tosses of the coin. This captures your initial views about the fairness of the coin (i.e. about the relative frequency model that should be used). Each time you flip the coin and notice the result you can update that distribution. Gradually your probability density function will change from being very spread out to being very pointed and located around the proportion of heads found in your trials.
(This analysis is shown beautifully by a series of graphs in ‘Data analysis: a Bayesian tutorial’ by D S Sivia. Sivia even shows the effect of different initial beliefs.)
Constructing models using relative frequencies of outcomes within defined experiments exploits the characteristic consistency of how the world works from one day to the next, though of course there are many ways it can go wrong. Such models have been elaborated far beyond single situations by:
Forgetting about your uncertainty over the right model, or failing to reason about it rationally, is an error.
Imagine a man who likes gambling is waiting at an airport for a delayed flight. He sees the two people next to him playing a very simple betting game with a pair of dice. He soon gets into conversation with them, which leads to a game, and soon they are betting for money. They explain their rules and our gambler thinks frantically about how to bet. The dice are the usual 6 sided kind so obviously the probability (in the relative frequency sense) of any one number coming up for a single die is 1/6. But is it? Our gambler has ignored any possible uncertainty about whether the dice used are fair or loaded. He loses money to the two tricksters.
Uncertainty over whether the model is correct in its structure and parameters can be captured and computed using Bayesian Model Averaging. Every variation in model that we think is possible is applied and the results weighted by the probability (i.e. certainty) that each variation is the correct one. Bayesian Model Averaging makes better predictions and comes up with more accurate (useful) probabilities.
I suspect one should also consider uncertainty about whether the model should be applied to a future situation and uncertainty over starting conditions (for models that try to project forward in time from a given starting state).
New evidence about the model can be assimilated using Bayes' rule.
What is an accurate probability?
Defining probabilities in terms of relative frequencies (the ‘Frequentist’ approach) seemed to provide an objective basis for probability. It did not matter who was doing the thinking, or what evidence they had, and it only referred to things that happened in the future.
Unfortunately, it doesn't work.
The usual objection is that there are many unrepeatable situations for which we would like to think about the probability of different outcomes, and this is about as far as most textbooks take it.
The problem of ‘unrepeatable’ experiments is not a fatal objection. Sensible versions of the frequencies definition say that probabilities are what you would get if you repeated an experiment enough times but no experiment will be repeated infinitely many times, which is the number that would be enough.
Also, the frequencies definition is not saying that you need to repeat the experiment with all potentially relevant conditions replicated. The entire previous history of the universe might be relevant so no experiment would be repeatable if that were the intention. The experiment has to be defined by criteria that specify what is held constant for each repetition. Consequently, ‘one off’ situations do not exist because you can always think of an experiment whose criteria are sufficiently loose to allow repetitions.
But here lies the problem with frequencies as a definition of objective probabilities. The relative frequency depends on the definition of the experiment, and for any situation, past or future, you could define any number of different experiments that the situation could be an example of.
Even for coin tossing, a situation for which there seems only one logical ‘experiment’, a future coin toss could be an example of many different experiments such as: ‘coin tossed vigorously’, ‘florin tossed vigorously’, ‘coin tossed on earth’ (as opposed to out in space where the lack of strong gravity might make a difference to the outcomes), ‘coin tossed in air’ (as opposed to under water, say), coin tossed by a right handed person, coin tossed on a Tuesday, random event with two outcomes, throwing something in the air to see how it lands, and so on.
If a Frequentist is asked to state a probability for an outcome of some future situation the probability they give should depend on what experiment they choose. Hence it is not entirely objective. The Frequentist might make a poor choice or a good one.
The problem of finding an objective way to define probabilities has not been solved by frequencies, and nor has the practical problem of deciding whether a probability is accurate.
Good probabilities are ones that permit good decisions to be taken. People who take good decisions tend to be more successful. A gambler who makes money more often than other gamblers probably works on better probabilities. A businessman who enjoys better success than others may be better at probabilities. We cannot be certain because some decisions can be taken without probabilities and other decision making skills also affect the overall result.
The objective reliability of probabilities has been assessed in various ways. People have tested their probabilities against reality (e.g. when statisticians went gambling and famously made money at blackjack by card counting), and against simulated reality as in many experiments in machine learning (a branch of artificial intelligence).
Experiments to test the effectiveness of alternative methods of estimating probabilities have given some very interesting results. In some of the situations tested simple methods compete very well with complicated mathematics. These simple methods tend to work by looking for knowledge of similar situations to the one for which a probability estimate is needed, for which the result is known. These similar situations are then used as the basis for the probability estimate, either taking the best matching situation, or taking a kind of average, perhaps modified by similarity.
Some of these methods may become widely used, especially in situations where the available data is limited.
Since situations are so important it is obvious that if you want to build up a database of past experience with which to estimate future probabilities it is vital to capture reliably as much relevant situation information as possible. This will include recording systematically more aspects of the situation than you think are relevant at that time, in order to support future theories and requirements.
Our attempts to learn from experience are often undermined by consistent but unnoticed conditions that applied to past occasions but do not apply in a new prediction situation. We are often unaware of just how narrow and unreliable our experience is.
It is also important to look carefully at any uncertainties around future situations for which we need to estimate probabilities.
There will be difficult choices about which past situations to match to future prediction situations. Typically, if you choose to say the prediction situation is an example of a loose experiment there will be more data than if you choose to see it as an instance of a narrow experiment, but the narrow situation is more likely to be appropriate. Which should you choose?
What is randomness?
Is randomness something real, or just behaviour that is very difficult (perhaps even impossible) to predict? Among experts the debate is far from over. It seems possible that deterministic events can be very hard to predict because of sensitivity to conditions, as explained by chaos theory. Other thinkers point to quantum mechanics as an example of something that seems truly random.
Whatever the philosophical answer it is fairly clear to most people that there are phenomena that appear unpredictable and haphazard, and which can be modeled very accurately by tossing a coin vigorously. This behaviour appears random. This kind of behaviour can be modeled in terms of probabilities but it seems impossible to predict the outcome of each experiment.
It is very useful to be able to tell when you have reached random behaviour which is going to defeat every attempt at predictions of individual outcomes. We would like to avoid wasting time on problems that cannot be solved. We also need to avoid giving up while there is still a reasonable chance of making better predictions.
It is harder to tell if you have reached hard core randomness if you have only a few past data points to work with. On the other hand there are many situations where it is generally believed that the system is ‘random’ such as dice and roulette wheels.
In many situations we are a long way from hard core randomness, but the language of the mathematical techniques often used to try to make sense of data tends to encourage premature capitulation. Behaviour that fits the model is described as predictable, whereas the differences between the model's predictions and actual results are called ‘random error’ or ‘noise’. They might not be!
In statistical hypothesis testing the normal starting assumption is that data is random and, usually, normally distributed. Hypothesis testing usually tries to estimate the chance that the result of an experiment was the result of chance, assuming the underlying behaviour is randomly distributed.
How do you say how much evidence you have used?
Distinguishing between probability judgments based on very little evidence and judgments based on lots of relevant empirical data is still a problem. Information theory calculates the amount of information received as the reduction in uncertainty, where uncertainty is calculated from the probability distribution of possible ‘states’ of a ‘system’.
Does this match our usual concept of information? Here are two ways in which it does not.
Firstly, imagine a crime has been committed and several people saw the perpetrator. You are a detective and interview them for a description. Let's just consider hair colour alone. Before you conduct the interviews you might have a probability distribution for different hair colours (perhaps based on hair colours in the general population of the area). Let's imagine that after interviewing all the witnesses your probability distribution is unchanged because the witnesses contradict each other. In a sense Information Theory is quite right that you have obtained no information as to hair colour. The situation is as if you had not done the interviews.
But you did do the interviews and surely that makes some kind of difference. You have exhausted that line of enquiry. Perhaps we should say that you obtained evidence but it contained no information.
Secondly, evidence can change the reasons behind our probability distribution for future outcomes of a situation. In the clearest case, the effect of evidence – useful evidence this time – could be to reduce our uncertainties about which model is most likely to be true but guide us to a model that predicts a broader spread of outcomes than the models we initially favoured. In other words, taking our uncertainty about models into account as well our overall probability distribution for outcomes could end up being the same as it was before. According to Information Theory we are no further forward but I think most people would distinguish between predicting a broad spread of outcomes because you just don't know, and predicting a broad spread of outcomes because a mountain of data says that's a good model.
Psychological experiments show that we prefer to bet on probabilities we ‘know’ rather than probabilities we don't know, even though rationally there should be no preference.
In practical situations it is very useful to give some indication of the evidence you have been able to use in reaching a probability. Almost any description is better than none. At least say whose guess it was. Despite the limitation mentioned above, knowing what evidence was used helps people judge the prospects of reducing uncertainty through further research.
Links to further information
An interesting paper speculating about how probabilities might be estimated using similarity is ‘Probability from similarity’ by Sergey Blok, Douglas Medin, and Daniel Osherson.
An influential paper on simple alternatives to complicated mathematics for estimating probabilities is ‘Reasoning the Fast and Frugal Way: Models of Bounded Rationality’ by Gerd Gigerenzer and Daniel G Goldstein.
Magnus Persson and Peter Juslin have extended this with their PROBEX algorithm. See ‘Fast and Frugal Use of Cue Direction in States of Limited Knowledge’.
If you want to get into Bayesian Model Averaging you might start by reading ‘Bayesian Model Averaging: A Tutorial’ by Jennifer Hoeting, David Madigan, Adrian Raftery, and Chris Volinsky.
Hundreds of people receive notification of new publications every month. They include company directors, heads of finance, of internal audit, of risk management, and of internal control, professors, and other influential authors and researchers.
Made in England
Words © 2003 Matthew Leitch. First published 28 July 2003.