Working In Uncertainty

Probability axioms in Bayesian language

The gradual swing back from Frequentist probability thinking to Bayesian probability thinking has been quite slow for a number of reasons. One of these is that the language of the most famous, most copied introduction to basic probability theory is set in Frequentist language. The aim of this simple paper is to present the usual axiomatic basis of probability theory using simple language consistent with a Bayesian approach.

A typical Frequentist presentation is shown side-by-side with a Bayesian version so that you can see the differences in language. Frequentist language is highlighted in red while the Bayesian alternative language I am suggesting is in green. (In case you don't know the Greek alphabet very well, note that σ is sigma and Ω is capital Omega.)

Frequentist version

Bayesian version

Probabilities are defined within probability spaces. A probability space, (Ω, F, P), has three components.

1) Ω, a set of outcomes (also known as elementary events), for an experiment;

2) F, a set of subsets of Ω, representing events; and

3) P, a function that gives a real number for each event in F.

Where outcomes are put into a set to make an event then the event occurs if and only if one of the outcomes in it occurs.

Probabilities are defined within probability spaces. A probability space, (Ω, F, P), has three components.

1) Ω, a set of possible answers to an unsettled question;

2) F, a set of subsets of Ω, representing disjunctions of answers; and

3) P, a function that gives a real number for each set of possible answers in F.

A disjunction of answers is where the answers are combined with the logical ‘or’. For example, given two answers a₁ and a₂, the logical expression a₁ ∨ a₂ is the disjunction of the two answers, pronounced ‘a₁ or a₂’.

The outcomes in Ω must be chosen so that they are exhaustive and mutually exclusive. This means that the outcome that actually happens must be exactly one of the elements of Ω.

The definition of the experiment is important, even though the experiment is not identified in the statement of the probability space.

The possible answers in Ω must be chosen so that they are exhaustive and mutually exclusive. This means that exactly one of the elements of Ω must be the true answer to the question.

Each answer should be a proposition, which is a clear, internally consistent statement that can only be true or false, if its truth is known.

The question is important, even though it is not identified in the statement of the probability space.

For the probability space to support probability theory, F and P must meet a number of requirements.

F must be a σ-algebra, which is a set of subsets having special properties so that F is sufficiently comprehensive, while P must be a probability measure, which is a function with other special properties.

For the probability space to support probability theory, F and P must meet a number of requirements.

Specifically, F must be such that:

1) Ω is itself an event in F, (Ω ∈ F );

2) if A is a subset of Ω and is in F, then the set of outcomes not in A but still in Ω is also in F, (∀ A ⊆ Ω, A ∈ F ⇔ (Ω \ A) ∈ F ); and

3) if each of a countable set of subsets of Ω is in F, then so is the union of those subsets.

Specifically, F must be such that:

1) Ω is itself a set of possible answers in F, (Ω ∈ F );

2) if A is a subset of Ω and is in F, then the set of possible answers not in A but still in Ω is also in F, (∀ A ⊆ Ω, A ∈ F ⇔ (Ω \ A) ∈ F ); and

3) if each of a countable set of subsets of Ω is in F, then so is the union of those subsets.

The probability measure, P, must be such that:

1) P gives a real number between 0 and 1 inclusive for all events in F ;

2) P(Ω) = 1; and

3) the union of any countable set of elements of F, where no pair overlaps, has a probability given by P that is equal to the sum of the probabilities given by P to each of the elements.

The probability measure, P, must be such that:

1) P gives a real number between 0 and 1 inclusive for all sets of possible answers in F ;

2) P(Ω) = 1; and

3) the union of any countable set of elements of F, where no pair overlaps, has a probability given by P that is equal to the sum of the probabilities given by P to each of the elements.

The probability measure, P, gives probabilities for events rather than the more elementary outcomes because there are some sets of outcomes, Ω, that have so many elements, with the probability so evenly spread, that all outcomes have a probability that can be said to be zero (or infinitesimal). Considering probabilities for sets of outcomes only is a way to deal with this situation more easily.

Unfortunately, existing explanations of how this works are very hard to understand.

The probability measure, P, gives probabilities for sets of possible answers rather than individual answers because there are some sets of possible answers, Ω, that have so many elements, with the probability so evenly spread, that all answers have a probability that can be said to be zero (or infinitesimal). Considering probabilities for sets of answers only is a way to deal with this situation more easily.

Unfortunately, existing explanations of how this works are very hard to understand.

Most books covering this basic theory illustrate the ideas with some examples. Here are some in Frequentist and Bayesian language. Again, the changes are easy to make.

Frequentist version

Bayesian version

Example 1: Consider the experiment of flipping a fair coin once. A reasonable choice of outcomes would be

Ω = {H, T}.

The σ-algebra, F, would be defined with

F = {{}, {H}, {T}, {H, T}}.

The probability measure, P, would usually be defined by

P = {{} → 0, {H} → 0.5, {T} → 0.5, {H, T} → 1}.

Example 1: Consider a fair coin that is to be flipped once and the question is, ‘Which side will be on top on that flip?’ A reasonable choice of possible answers would be

Ω = {H, T}.

The σ-algebra, F, would be defined with

F = {{}, {H}, {T}, {H, T}}.

The probability measure, P, would usually be defined by

P = {{} → 0, {H} → 0.5, {T} → 0.5, {H, T} → 1}.

Example 2: Two men are in court and charged with robbing a bank. The experiment is the discovery of the truth of their guilt or innocence. A reasonable set of potential outcomes, using G and N for ‘guilty’ and ‘not guilty’ respectively, is:

Ω = {(N,N), (N,G), (G,N), (G,G)}

The σ-algebra, F, would be defined with

F = { {}, {(N,N)}, {(N,G)}, {(G,N)}, {(G,G)},
{(N,N), (N,G)}, {(N,N), (G,N)}, {(N,N), (G,G)},
{(N,G), (G,N)}, {(N,G), (G,G),
{(G,N), {G,G)},
{(N,N), (N,G), (G,N)}, {(N,N), (N,G), (G,G)},
{(N,N), (G,N), (G,G)}, {(N,G), (G,N), (G,G)},
{(N,N), (N,G), (G,N), (G,G)} }.

The probability measure, P, would be defined to give a probability for each of the possible sets of outcomes.

Example 2: Two men are in court and charged with robbing a bank. The question is: ‘Which if any of them are guilty?’ A reasonable set of possible answers, using G and N for ‘guilty’ and ‘not guilty’ respectively, is:

Ω = {(N,N), (N,G), (G,N), (G,G)}

The σ-algebra, F, would be defined with

The probability measure, P, would be defined to give a probability for each of the possible sets of answers.

You probably noticed that the second example feels much more naturally ‘Bayesian‘ than the first, which is a typical Frequentist example.

One final element of some introductions to probability theory is an attempt to explain what probabilities mean. This of course is where the Bayesian approach is fundamentally different to the Frequentist approach so I haven't added colour. Here are two alternative explanations.

Frequentist version

Bayesian version

Probabilities represent relative frequencies in the long run and are defined by the results of many similar experiments. The experiments do not have to be identical in every possible respect. They just have to meet defined conditions that specify a set of experiments as being within the same set for this purpose, sometimes called the reference class, and have the same possible outcomes and events.

If many similar experiments are performed and the actual outcomes recorded, the proportion of experiments where an event occurs will tend to move towards the true probability of that event. In many applications of probability theory the task is to estimate that true probability.

Probabilities represent degrees of belief that an answer is the true answer to the question. Put another way, they represent degrees of belief in a proposition (i.e. a statement that is clear and can be only true or false, if its truth is known).

A probability of 1 for a set of possible answers represents complete certainty that one of them is the true answer. A probability of 0 for a set of possible answers represents complete certainty that none of them is the true answer.

However, probabilities are not just a matter of opinion. A good source of probabilities produces probabilities that agree with the relative frequency of true propositions for which the source has given probabilities. For example, over all the instances where the probabilities have been stated as 0.7, say, about 70% should turn out to be true. A good source of probabilities also produces probabilities that are responsive to circumstances and experience. In combination, these two properties mean that good probabilities are informative, and that the better a source of probabilities the more informative its probabilities are.

Having got this basic introduction of axioms out the way there is still much about the modern Bayesian approach that needs to be explained. In particular, the approach usually focuses on some kind of system or process that can be observed, generating data from those observations, and which the analyst wants to represent with a mathematical model.

What is the question, and what is the set of answers that would be used in the probability space? The question is a compound question (really two questions in one) that asks: ‘Which model is best and what data could the process produce?’ The answers will be every possible combination of model paired with a set of data that might be observed.

The analyst will, in effect, use information about the probability of each combination of model and observed data to deduce how likely it is that each model is the best model, given the data actually observed. These probabilities will usually be represented by a distribution that says how likely it is that each model is the best of the set, and another, conditional, probability distribution that says how likely each set of data is given that each model is true.

Made in England