# Learning and teaching probability and statistics

### Contents

Over hundreds of years the basic language and notation of probability and statistics have evolved very, very slowly. Ideas have progressed faster than the language and notation, which have tended to lag behind as will be shown by the examples below. Each generation of writers, teachers, and students has carried the burden of obsolete and unhelpful junk passed on by the previous generation. To make sweeping changes across all the world's probability and statistics experts would be nearly impossible, so we have struggled on.

Today, if you are learning or teaching probability theory, each bit of misleading, confusing, or just wrong terminology or notation will cause you a bit of pain — a bit of extra effort and puzzlement that you need not have suffered.

The purpose of this article is to give you an edge by pinpointing the most common causes of confusion and offering some ways to overcome them more easily. If you still find you have to use the old words and symbols then at least you will not be misled by them so easily. If you have more freedom than that then you can choose to use better words and symbols in your own work, making it clearer and more compelling.

Here, grouped by their source, are some of the problems and, occasionally, suggested solutions.

## Some general mathematical notation glitches

### Curved parentheses

One quite well known ambiguity in common mathematical notation is a particular problem in probability work. When you learn algebra for the first time you learn how brackets control order of calculation and learn things like,

a(b) = ab and a(b + c) = ab + ac.

When you first see something like ‘sin(x)’ or ‘log(a)’ there is obviously scope for some confusion because here identical bracket are being used to identify the inputs to functions, not to control order of calculation. It is not true that

sin(a + b) = sin(a) + sin(b).

We deal with this potential confusion by learning to recognize the names of the common mathematical functions, such as sin, cos, tan, log, ln, and exp. It also helps that these functions have names that consist of more than one character.

Unfortunately, in probability work it is very common to use functions that have names that are single characters, such as P, p, E, and fx, and then go on to complicate matters even further by inventing new functions as a problem is solved, with names like φ, ψ L, and X. The potential for confusion is again increased.

A simple solution to this is to use square brackets for functions. For example, ‘sin(x)’ becomes ‘sin[x]’. Now it is easy to see that ‘f(x)’ means f times x while ‘f[x]’ means the value returned by f given x as input. See Wolfram (2000) for more discussion of general notation challenges.

### Not stating types

In mathematical writing you will sometimes see things like ‘Let h ∈ R be ...’ to introduce a new variable, h, which is an element of the set of real numbers, which means that h is a real number. A different style of saying the same thing is to use a colon, as in ‘Let h : R be ...’ The ‘∈ R’ and ‘: R’ parts tell you what type of object the h is.

The idea of explicitly saying what type of object you have invented in your mathematics can be taken much further and it is very useful for bringing clarity to more complex situations. Here are some examples of type statements:

Type statement What the object is
x : R x is a real number
y : R × R y is a pair of real numbers, perhaps coordinates on a two dimensional graph
f : R → R f is a function that takes a real number as input and returns a real number as output
p : Ω → R p is a function that takes an outcome as an input and returns a real number as output; perhaps it is a function that gives a probability for each possible outcome in some situation
c : Ω → (Ω → R) c is a function that takes an outcome as input and returns a function that itself takes an outcome as input and returns a real number as output; perhaps it is a conditional probability distribution

Unfortunately, it is common to omit type statements when introducing objects in probability work. If this were not so I think the notation of probability would be very different. Checking types for mis-matches and other problems reveals some deep problems with the probability notation we have inherited from previous generations. Examples are given below.

## From Kolmogorov

In his book, Foundations of The Theory of Probability, Andrey Kolmogorov provided a presentation of basic probability theory that has been translated and copied repeatedly every since. Although it has been an academic hit publication the cost has been high for later students and teachers.

### ‘Experiment’

In elementary probability theory the word ‘experiment’ is used to refer to anything that produces data, not just to refer to experiments. Kolmogorov may not have been the first person to use the word ‘experiment’ in this odd way, but his legacy helped to entrench this usage. Here are some examples from recent, authoritative sources.

Source Quote
Online Encyclopedia Britannica, entry on Probability theory. ‘The fundamental ingredient of probability theory is an experiment that can be repeated, at least hypothetically, under essentially identical conditions and that may lead to different outcomes on different trials.’
Wikipedia entry for Probability theory. ‘Consider an experiment that can produce a number of outcomes.’
A first course in probability, by Sheldon Ross, 9th edition (2013). ‘Consider an experiment whose outcome is not predictable with certainty.’

This odd use of the word ‘experiment’ probably comes from probabilities being used in proper experiments, before the theory was generalized more widely.

Clearly it is misleading to use the word ‘experiment’ when an experiment is not involved.

### ‘Sample space’

The phrase that so often appears along with ‘experiment’ is ‘sample space’, even when no sampling is involved and there is no physical space to grapple with either. (Kolmogorov seems to have avoided this phrase and I cannot find out where it came from.) Here are some examples:

Source Quote
Online Encyclopedia Britannica, entry on Probability theory. ‘The set of all possible outcomes of an experiment is called a “sample space.” The experiment of tossing a coin once results in a sample space with two possible outcomes, “heads” and “tails.”’
Wikipedia entry for Probability theory. ‘The set of all outcomes is called the sample space of the experiment.’
A first course in probability, by Sheldon Ross, 9th edition (2013). ‘This set of all possible outcomes is known as the sample space of the experiment and is denoted by S.’

Again, this use of ‘sample space’ probably comes from using probabilities in situations where sampling is involved, then generalizing the theory without updating the language.

The phrase is misleading in a number of ways:

• The ‘space’ is metaphorical, and just a synonym for ‘set’.

• The ‘sample space’ sounds like the set of items from which a sample is taken, but instead it refers to the possible values of the sampled items.

• There are applications of probability theory where there is no sampling.

### Other Frequentist terms

Kolmogorov was intent on applying the fashionable new ‘measure theory’ to probability and to provide axioms as a basis for probability theory. He phrased his axioms from a Frequentist perspective. Frequentism is the idea that ‘probability’ is no more than a synonym for ‘long run relative frequency’. For example, toss a bent coin enough times and you will discover the true Frequentist probability of heads with that coin from the proportion of total tosses when the result is heads.

Frequentism was popular in the early and mid-twentieth century but has been losing ground since then to the Bayesian perspective, where probabilities represent degrees of belief. To me it seems pointless to use ‘probability’ as a synonym for ‘frequency’. Why not say ‘frequency’ when you mean ‘frequency’ and reserve ‘probability’ for degrees of belief so that you can reason logically about your uncertainty as to the true frequencies? The debate between Frequentists and Bayesians has been dragging on for many decades but as Frequentists have died of old age Bayesianism has become more and more important.

The problem today for students and teachers is that the first introduction to probability theory usually comes with a dose of Frequentism, even though that's not the perspective most people will want to use in future. To provide an alternative, I have written a version of the usual axioms couched in Bayesian language.

The terms ‘experiment’ and ‘sample space’ are Frequentist, as are ‘trials’, ‘outcomes’, and ‘events’. This language tends to focus attention on using probabilities for repeated situations where the goal is to make probabilistic predictions about future results from continuing to repeat the ‘experiment’. This language tends to focus attention away from useful applications to past events (e.g. deciding who committed a crime) and current situations (e.g. deciding what disease a sick person has).

Here are some more examples of Frequentist language.

Source Quote
Online Encyclopedia Britannica, entry on Probability theory. ‘The fundamental ingredient of probability theory is an experiment that can be repeated, at least hypothetically, under essentially identical conditions and that may lead to different outcomes on different trials.’
Wikipedia entry for Probability theory. ‘In this case, {1,3,5} is the event that the die falls on some odd number. If the results that actually occur fall in a given event, that event is said to have occurred.’
A first course in probability, by Sheldon Ross, 9th edition (2013). ‘Any subset E of the sample space is known as an event.’

### Lack of attention to choices that must be made

A more subtle consequence of the early twentieth century swerve towards Frequentism was a tendency to think of probability as much more objective than it is, even when used in a Frequentist way. In Frequentism, probabilities are really just the relative frequencies that would result from repeating something lots of times. Repeating what? That is where the subjectivity comes in. Suppose a famous conjuror offers you a huge bet on the flip of a coin. Is the next flip of that coin an example of repeatedly flipping a coin, or is it an example of a bet against a famous conjuror? The same flip, two different probabilities.

We also have choices about the set of possible outcomes, and about how those are grouped into events, and how outcomes are mapped to numbers, and what evidence to use, and so on. These choices are usually ignored, and that is wrong.

Source Quote
Online Encyclopedia Britannica, entry on Probability theory.

‘The experiment of tossing a coin once results in a sample space with two possible outcomes, “heads” and “tails.”’

No it does not. The decision to represent the possible outcomes as ‘heads’ and ‘tails’ was the result of someone thinking about what to choose and deciding to ignore the possibility of a failed toss, or the coin landing in a crack, edge up. That choice was not the inevitable result of tossing a coin once. The thinker also chose not to say the outcomes were ‘win’ and ‘lose’, which might also have been a reasonable choice.

A first course in probability, by Sheldon Ross, 9th edition (2013).

‘If the experiment consists of measuring (in hours) the lifetime of a transistor, then the sample space consists of all non-negative real numbers.’

Not if you choose to record the life to the nearest hour, or nearest day, or anything other than infinitely exact numbers. Also, not if you choose to measure the lifetime relative to a planned lifetime, so that sometimes the transistors last less time than planned and sometimes more. In that approach you will have some negative numbers as well as positive ones.

Frequentism may not be the only reason for this tendency. When probability is taught to children today we tend to start with coin flips, die rolls, and drawing coloured balls out of bags ‘at random’ so that the possible outcomes seem obvious and so are the probability numbers to use. In more typical real-world applications of probability theory the choices are much more difficult but instead of thinking carefully about them we tend to just grab the first idea that comes into our heads, acting on long habit.

### Unstated probability spaces with ambiguous symbols

Kolmogorov's much-copied foundation for probability theory started by setting up a ‘probability space’ consisting of three elements (here given the usual modern names):

1. Ω, the set of all possible outcomes from the experiment.

2. F, a set of events defined using the outcomes in Ω.

3. P, a probability measure, which is a function giving a probability for each event in F.

You could imagine a probability space being set up for the flip of a coin, another for pulling coloured balls out of a bag, another for the sex of babies, and so on. If this was done in the same paper or chapter of a book then, of course, you would need to give different names to their parts, such as Ωc (c for coins), Ωb (b for balls), and Ωs (s for sex). Likewise you would need Pc, Pb, and Ps.

(An alternative approach has been suggested by Fokkinga, which is to keep the letter P but require the set of outcomes – the Ω – to be stated explicitly. Fokkinga's paper illustrates the errors that can be made without even this level of clarity. Unfortunately, this style does not allow different views of the probability measure for the same Ω, so the lack of distinct names for probability measures is still a problem.)

Unfortunately, it is now normal practice to omit an explicit probability space and to call all probability measures P. What started out as a genuine mathematical object has become little more than an abbreviation. We just pronounce P as ‘the probability of’.

Another reason for having different probability spaces is that different people might start off with different probability distributions, even for the same experiment. Without a habit of using distinctive names for probability measures it is hard to even think of this possibility, let alone write about it.

## Concerning ‘random variables’

### The term ‘random variable’

The name ‘random variable’ is particularly misleading because random variables are neither random nor variables, and because people use the phrase to mean something other than what it really means. It's hard to see how it could be any more confusing.

According to most reliable sources I can find a random variable is a function that maps outcomes in a sample space to real numbers. It is a function, not a variable, and it always returns the same real number given a particular outcome as input, so there is nothing random about it. The number returned by a random variable could be seen as random, however, because the input to the function is random.

Many people, including the authors of books, write about random variables as if they are variables whose values are different each time you think about them, according to a probability distribution. The feeling is that a random variable is like a person's weight, hour by hour, in that it keeps changing, except that the value of a random variable is more random than that.

This common (but, strictly speaking, incorrect) idea introduces subtle confusion too because when values of a so-called random variable are recorded they are not random. They are just the values of a variable. The randomness lies only in the probability distributions linked to the variable in a model.

This muddle would not have persisted if mathematicians were always careful to state explicitly the types of all the objects they invent. They haven't been, so here we are, confused.

When writing about probabilities it is best to avoid the term ‘random variable’ completely and just use ‘variable’ then talk about its probability distribution. Here's an example from Ross (2013). First, the original version:

‘Suppose our experiment consists of tossing three fair coins. If we let Y denote the number of heads that appear, then Y is a random variable taking on one of the values 0, 1, 2, and 3 with respective probabilities:

P(Y = 0) = P(T, T, T) = 1/8
P(Y = 1) = P((T, T, H), (T, H, T), (H, T, T)) = 3/8
P(Y = 2) = P((T, H, H), (H, T, H), (H, H, T)) = 3/8
P(Y = 3) = P(H, H, H) = 1/8

Now here it is with the reference to a ‘random variable’ removed and some other minor tweaks done too:

‘Suppose we toss three fair coins. Let Y denote the number of heads that appear. The probability distribution in our model for Y is:’

d = p[{(T, T, T)}] = 1/8
d = p[{(T, T, H), (T, H, T), (H, T, T)}] = 3/8
d = p[{(T, H, H), (H, T, H), (H, H, T)}] = 3/8
d = p[{(H, H, H)}] = 1/8

### P(X = 2) notation

The probability measure in Kolmogorov's approach is a function that provides a probability number for any given event in the probability space. This is usually called P, as I have just pointed out. The argument to P is supposed to be an event, which is a set of outcomes. Consequently, it only makes sense to provide sets of outcomes as arguments to P, as in the following examples.

P[{(H, H, T)}]

P[A] where A = {win, lose} and P[B] where B = {draw}

Sadly, this notation is routinely abused, without sufficient reason. Here are some errors, next to the correct notation:

Wrong Right
P[X = 2] P[{ω | X[ω] = 2}] where X is a function that maps outcomes to real numbers
P[X < x] P[{ω | X[ω] < x}] where X is a function that maps outcomes to real numbers

The text inside the brackets of P(   ) tends to be used to write anything at all that describes the set of outcomes in some way. It is quite convenient as a personal shorthand, but unsystematic and confusing when used for reasoning and for explaining to others. You are probably thinking that the Right examples in the table above look more complicated than the Wrong versions. They are, but the Right versions also make clear the true purpose and type of X and remind us that the probability distribution of numbers returned by X is driven by the probability distribution of the underlying outcomes, and in such a way that many outcomes might meet the conditions of set membership.

In some of the above examples, X is a ‘random variable’, but thought of as a variable rather than as the function it really is. The same confusion is the reason for the next issue.

### E[X] and Var[X] notation

These functions are somewhat confusing for three reasons:

• They are often shown with square brackets instead of the usual (ambiguous) curved parentheses. I have never seen this explained.

• They also tend to be called ‘operators’, but again the reason for this is never explained and is probably historical.

• The X is a ‘random variable’, which is a function from outcomes to real numbers, and not in fact what E and Var use as inputs. What they actually use are the probability distributions or probability density distributions associated with the real numbers produced by the ‘random variable’.

• There is usually no mention of the fact that E and Var require different calculations depending on whether the distribution given as input is a probability distribution or a probability density distribution.

Further confusion is possible because the word variance also gets used in the phrases ‘population variance’ and ‘sample variance’ and these numbers are calculated differently again, this time from the actual values in the population or sample, not from a probability distribution.

## Linked to Bayes' rule and Bayesianism

### Missing ‘For all...’

A typical statement of Bayes' law looks something like this:

P(A|B) = P(B|A)P(A) / P(B)

Ignoring for a moment the potentially confusing use of curved parentheses, the potential overuse of P, and the failure to mention that P(B) must be non-zero, there is yet another potential problem here. The statement says that this rule is true for two events, A and B, from the same probability space. Actually this is true for all pairs of events in a probability space, not just for two of them.

A more comprehensive statement of the law would be something like this:

∀ (Ω, F, P) : PROBABILITY_SPACE, a, b : EVENT | a ∈ F ∧ b ∈ F ∧ P[b] ≠ 0   •   P[a|b] = P[b|a]×P[a] / P[b]

In words that translates as ‘For all probability spaces and pairs of events, such that the events are within the probability space and the probability of one of them is not zero, ...’ Bayes law holds.

When we are interested in just two events this is not much of an omission. However, in a typical paper or book chapter about Bayesian methods, the omission is more important. That is because the main thrust of Bayesian methods is to update probabilities for each of a comprehensive set of alternative hypotheses (misleadingly called ‘events’, as explained earlier).

Sadly, it is common to miss out the ‘for all...’ part and just state the rule for one hypothesis.

### Ignoring the difference between probability and probability density

Another typically Bayesian bad habit is to state theorems for probabilities and then breezily generalise them to analogous theorems about probability densities.

Typically, there are analogous laws for probabilities and probability densities, but it is unsettling and confusing to flip between the two without acknowledging the fundamental difference in meaning and the lack of proof or even explanation.

### P(A|B) notation

The notation ‘P(A|B)’ is pronounced ‘the probability of A given B’. If you want to be more Bayesian you could say ‘the probability that A is true given that B is true’ and if you want to be more Frequentist you could say ‘the probability of A happening given that B has happened.’ The vertical line indicates a conditional probability.

So what is P? In the expression P[A], P was a function that took one event as input and returned the probability of that event. In P(A|B) the result is again a probability but this time either:

• P takes as input two events, one that has happened (B) and one that is still to happen (A); or

• the event that has happened is a given and, in effect, P( |B) is the name of a function that returns probabilities given events.

Either one object is changing its type or the notation for a function name is highly irregular. This is another of those oddities that could not happen if types were explicitly stated. If I do that then the logical approach seems to be something like this, where a conditional probability distribution is seen as a function that returns a function from events to probabilities:

Pc : EVENT → (EVENT → PROB)

If B is given, the probability distribution it selects from the conditional probability distribution can be written as Pc[B].

### p(x; θ) notation

This notation is used often in Bayesian theory in various ways. The items after the semicolon are parameters in a probability model that returns probabilities (or probability densities) for given events (or real numbers), x. As with the P(A|B) notation, the type of ‘p’ shifts.

To me the logical choice of type for the probability density version would be something like:

p : Real → (Real → Real)

This says that the p here is a function that takes as input a real number (the θ, a model parameter value) and returns a function that gives probability densities for given values of a number.

Once again, p is something different and explicit types would show what is really going on.

### X ~ N(μ, σ2) notation

This notation is pronounced ‘X is normally distributed, where the normal distribution has the usual parameters, μ and σ2.’ I have also seen small samples of numbers described as ‘~ N(μ, σ2)’ which raises questions about what the notation really means.

## Miscellaneous confusions

### ‘Error’

Picture an early physicist or astronomer measuring the temperature of some gas or the position of the moon perhaps. There is a true value that the measurements are trying to ascertain, but those measurements are not quite accurate. Measure the same thing over and over again and the result will be lots of slightly different measurements. This is the situation that the language of probability and statistics most often reflects, and the same language is used even when there is no experiment and when measurements are accurate, or where their errors are unimportant.

A good example of this is regression. Imagine you have measured the heights of thousands of boys of different ages. For any given age you have measured many boys' heights and know that they vary quite widely even for boys of the same age. If a regression line was fitted to this data the differences between the line and the actual heights of boys would be called ‘error’ or perhaps even ‘random error’, but they are not the result of your inaccurate measurement. They are there because height is driven by more than just age and your regression only uses age. There is no error to speak of.

It is better to use the term ‘residuals&Rsquo; for these differences.

### The problem of zero

What does a probability of zero mean? It is usually said to mean something is certainly not true, or certain not to happen, or that it is impossible. And yet, such impossibilities routinely happen. Take any of the famous probability density functions, such as the normal distribution, and the probability of any particular value is said to be zero even though one must be true. We can be certain that an ‘impossible&Rsquo; result will occur. That makes no sense.

One alternative approach to resolve this is to say that, with something like the normal distribution, the probability of any particular value is infinitesimal, not zero. It's a deep topic for mathematicians, but I don't think they can regard the matter as closed until the linguistic problem of impossible events happening has been resolved.

## References

Fokkinga, M (2006). Z-style notation for probabilities.

Kolmogorov, A (1933). ‘Foundations of The Theory of Probability&Rsquo;.

Ross, S (2013). ‘A first course in probability (9th edition)&Rsquo;.

Wolfram, S (2000). Mathematical notation past and future.

Hundreds of people receive notification of new publications every month. They include company directors, heads of finance, of internal audit, of risk management, and of internal control, professors, and other influential authors and researchers.