1.1 Introduction to Probability
Probability theory is a branch of mathematics concerned with the study of uncertainty and randomness. It provides a framework for quantifying uncertainty and making predictions in situations where outcomes are undetermined. Probability theory is widely used in fields as diverse as statistics, finance, science and engineering to model and analyse uncertain situations and to make informed decisions based on the likelihood of different outcomes.
In this chapter, we will introduce some basic concept of the probability, include sample space, event, axioms of probability, conditional probability, and Bayes’ theorem.
1.1.1 What is probability ?
Probability is a branch of mathematics that deals with the study of uncertainty and randomness. It provides a framework for quantifying uncertainty and making predictions in situations where outcomes are not certain.
Probability theory is widely applied in various fields such as statistics, finance, science, and engineering to model and analyze uncertain situations and make informed decisions based on the likelihood of different outcomes.
There are two models to describe the phenomena in the real world : deterministic models and probabilistic models. These two different types of mathematical models used in various fields to represent and analyze systems or phenomena.
(1) Deterministic Model:
In a deterministic model, the outcome is completely determined by the initial conditions and the parameters of the model.
There is no randomness or uncertainty involved in the system being modeled.
Given the same set of initial conditions and parameters, a deterministic model will always produce the same result.
Deterministic models are often used when the system being studied is assumed to be predictable and without random variations.
Example 1.1
A simple mathematical equation like \(y=mx+b\) represents a deterministic model. If you know the values of \(m\), \(x\), and \(b\), you can precisely determine the value of \(y\).
(2) Probabilistic Model:
In a probabilistic model, the outcome is described using probability distributions, reflecting the inherent uncertainty or randomness in the system.
These models acknowledge that certain events or parameters are subject to variation and are not known with certainty.
Probabilistic models are commonly used when dealing with systems involving random or unpredictable elements.
They provide a range of possible outcomes and the likelihood of each outcome occurring.
Example 1.2
A model predicting the likelihood of rain tomorrow could be probabilistic. It may provide a probability distribution of possible rainfall amounts, rather than a single deterministic prediction.
In summary, deterministic models assume perfect predictability and do not account for randomness, while probabilistic models incorporate uncertainty and provide a range of possible outcomes with associated probabilities. The choice between these models depends on the nature of the system being studied and the degree of certainty or randomness involved in its behavior.
We can’t predict outcomes perfectly in probabilistic models. However, different outcomes can follow a certain pattern. This is what we discussed in probability theory.
1.1.2 Basic Mathematic
1.1.2.1 Combinatorics
Before we read more about probability, we need to know some basic mathematical notation in combinatorics first.
(1) With replacement
List
Suppose we have a list of length \(n\), denote by \((x_1,\cdots,x_n)\). If there are \(n_i\) choices for \(x_i, i=1, \cdots, r\), then there are \[ n_1 \times n_2 \times \cdots \times n_r \] choices for \((x_1,\cdots,x_n)\).
Example 1.3 A seat number in a train has the form \(\square\bigcirc\bigcirc\), where \(\square\) are capital letters and \(\bigcirc\) are numbers \(0\sim9\). Therefore, there are total \(26 \times 10 \times 10 = 2600\) seat numbers.
(2) Without replacement
Permutations
Suppose now that we have \(n\) objects, then there are \[ n! := n(n − 1)(n − 2) \cdots 3 \cdot 2 \cdot 1 \] different permutations of the \(n\) objects.1
Example 1.4 Imagine you have three cups of different colors: red \(R\), blue \(B\), and green \(G\). You want to arrange these three cups in a straight line. How many ways can you do that?
Listing all possible arrangements: \(RBG\), \(RGB\), \(BRG\), \(BGR\), \(GRB\), \(GBR\)
The number of permutations is: \[ 3! = 3 \times 2 \times 1 = 6 \] Therefore, there are exactly \(3!=6\) unique ways to arrange these cups.
In statistics, permutations are used to calculate the number of possible arrangements of a set of items when the order of the items matters. Permutations account for all possible ways of ordering a specific number of items from a larger set. This concept is widely used in probability, cryptography, and various fields of data analysis. The general formula for a permutation is: \[ P(n, r) = \frac{n!}{(n-r)!}, \] where \(n\) is the total number of items, \(r\) is the number of items to arrange, \(n!\) is the product of all positive integers up to \(n\).
Example 1.5 How many different ways can you arrange 3 books from a set of 5 books on a shelf?
This is a permutation problem because the order of the books matters. We can use the permutation formula to calculate: \[ P(5, 3) = \frac{5!}{(5-3)!} = \frac{5 \times 4 \times 3 \times 2!}{2!} = 5 \times 4 \times 3 = 60 \] Therefore, there are 60 different ways to arrange 3 books from a set of 5.
Combinations
The number of different groups of \(r\) items that could be formed from a set of \(n\) items is \[ {{N}\choose{r}} = \frac{n!}{(n-r)!r!} \] which represents the number of possible combinations of \(n\) objects taken \(r\) at a time.
Example 1.6 Imagine you have five different fruits: apple \(A\), banana \(B\), cherry \(C\), date \(D\), and elderberry \(E\). You want to choose 3 fruits to make a fruit salad. How many different ways can you choose 3 fruits from the 5 available?
The number of combinations is calculated by: \[ \binom{5}{3} = \frac{5!}{3! \cdot (5-3)!} = \frac{5 \cdot 4 \cdot 3!}{3! \cdot 2 \cdot 1} = \frac{5 \cdot 4}{2 \cdot 1} = 10 \]
Listing all possible groups of 3 fruits: \(ABC\), \(ABD\), \(ABE\), \(ACD\), \(ACE\), \(ADE\), \(BCD\), \(BCE\), \(BDE\), \(CDE\)
Therefore, there are exactly \(\binom{5}{3} = 10\) unique ways to choose 3 fruits from the 5 available.
Note tat in the definition \(n \geq k \geq 1\), for other range we have: \[ {{N}\choose{r}} = \begin{cases} 1, & r=0 \\ 0, & (0\leq n<r) \lor (r<0\leq n) \\ (-1)^{r} {{|n|+r-1}\choose{r}}, & (n<0) \land (r>0) \\ (-1)^{n+r} {{|r|-1}\choose{|n|-1}}, & (n<0) \land (r<0) \end{cases} \]
In R programming, you can use factortial(x)
and choose(n,k)
to calculate the permutations and the combinations.
## [1] 6
permute <- function(n, r) {
return(factorial(n) / factorial(n - r))
}
permute(5,3) # (5 * 4 * 3 * 2 * 1) / (3 * 2 * 1)
## [1] 60
## [1] 10
multinomial coefficients
A set of \(n\) distinct items is to be divided into \(r\) distinct groups of respective sizes \(n_1, n_2, \cdots, n_r\), where \(\sum_{i=1}^r n_i = n\). How many different divisions are possible?
Note that there are \({{n}\choose{n_1}}\) possible choices for the first group; for each choice of the first group, there are \({{n-n_1}\choose{n_2}}\) possible choices for the second group; for each choice of the first two groups, there are \({{n-n_1-n_2}\choose{n_3}}\) possible choices for the third group; and so on. Hence, there are \[ {{n}\choose{n_1}} {{n-n_1}\choose{n_2}} \cdots {{n-n_1-n_2-n_{r-1}}\choose{n_r}} = \frac{n!}{n_1! \cdots n_r!} \] possible divisions.
If \(\sum_{i=1}^r n_i = n\), we define multinomial coefficients by \[ {{n}\choose{n_1,n_2,\cdots,n_r}} = \frac{n!}{n_1! \cdots n_r!} \]
Example 1.7 Imagine you have a box containing 6 balls: 2 red balls, 3 blue balls, and 1 green ball.
You want to know how many ways you can arrange all these balls in a line, considering that some of them are indistinguishable (the red and blue balls).
The number of ways to arrange them is calculated using the multinomial coefficient: \[ \binom{6}{2,3,1} = \frac{6!}{2!3!1!} = \frac{720}{12} = 60 \] So there are exactly 60 ways to arrange 2 red balls, 3 blue balls, and 1 green ball in a line.
Calculation by R programming using
## [1] 60
You can also create a custom function to compute it
multinom <- function(counts) {
factorial(sum(counts)) / prod(factorial(counts))
}
multinom(c(2,3,1))
## [1] 60
Cyclic permutation
A cyclic permutation is a rearrangement of elements where the order is shifted circularly. Unlike standard permutations, where all orders are considered distinct, cyclic permutations focus on rotations of a sequence. For a set of \(n\) elements, the number of cyclic permutations is \(Q(n,r)=\frac{P(n,r)}{r} = \frac{n!}{r \cdot (n-r)!}\). This reduction occurs because rotating a sequence does not create a new permutation.
Example 1.8 Consider three friends sitting around a circular table: Alice (\(A\)), Bob (\(B\)), and Charlie (\(C\)).
In a circular arrangement, rotating the group doesn’t create a new permutation. For example:
\(ABC\), \(BCA\), \(CAB\) are all considered the same cyclic permutation.
However, flipping the order changes it: \(ACB\), \(CBA\), \(BAC\)
The number of cyclic permutations for 3 people is calculated as: \[ (3-1)! = 2! = 2 \]
The two unique cyclic permutations for this example are: \(A, B, C\) (or any rotation of this order) and \(A, C, B\) (or any rotation of this order)
Thus, there are only 2 distinct ways to seat three friends around a circular table.
Let’s see a more complex case:
Example 1.9 Seating Arrangement with Restrictions. There are 5 people: Alice (\(A\)), Bob (\(B\)), Charlie (\(C\)), Diana (\(D\)), and Eva (\(E\)). Since Alice and Bob are couple, they must sit together.
The goal is to find how many distinct ways we can arrange these 5 people around a circular table, with the restriction that Alice and Bob must sit together.
First, treat Alice and Bob as a single unit. Since Alice and Bob must sit together, treat them as a “block” or a single unit. Now, instead of arranging 5 people, we have 4 units to arrange: 1. Alice and Bob (as a block), 2. Charlie, 3. Diana, 4. Eva
Second, count the cyclic permutations of 4 units. Since we are arranging the 4 units around a circular table, the number of cyclic permutations is: \[ (n-1)! = (4-1)! = 3! = 6 \] So, there are 6 ways to arrange the 4 units around the table.
Note that we need to account for Alice and Bob’s internal arrangement, Within their block, Alice and Bob can be arranged in 2 ways: \(AB\) or \(BA\).
Finally, to get the total number of distinct arrangements, we multiply the number of cyclic permutations of the 4 units by the number of ways Alice and Bob can be arranged within their block: \[ \text{Total arrangements} = 3! \times 2 = 12 \]
Thus, there are 12 distinct ways to arrange these 5 people around the table, with Alice and Bob sitting together.
In R programming, we write a function to calculate cyclic permutations:
## [1] 6
Why we introduce the permutations and combinations? This is because in classical probability, we need to calculate the number of events in the sample space, which will introduce in the below.
1.1.2.2 Set Theory
The following introduced some notations in the set theory.
Suppose two sets \(A,B \subseteq \Omega\), define
- \(\varnothing=\{\}\) (empty set)
- \(A^c = \{ \omega | \omega \in \Omega \land \omega \notin A\} = \Omega \setminus A\) (complement of the set \(A\))
- \(A \cup B = \{ \omega | \omega \in A \vee \omega \in B\}\) (union of the set \(A,B\))
- \(A \cap B = \{ \omega | \omega \in A \land \omega \in B\}\) (intersection of the set \(A,B\))
- \(A \setminus B = A - B = \{ \omega | \omega \in A \land \omega \notin B\} = A \cap B^c\) (difference of the set \(A,B\))
- \(A \Delta B = (A-B) \cup (B-A) = (A \cup B) - (A \cap B)\) (symmetric difference of the set \(A,B\))
- If \(A \cap B = \varnothing\), then define \(A+B = A \cup B\) (additively of the set \(A,B\))
If \(A,B \subseteq \Omega\), and \(A \subseteq B, B \subseteq A\), then \(A=B\).
Proposition:
Suppose \(A,B,C \subseteq \Omega\), then
- \(\left(A^c \right)^c = A\)
- \(A \cap A^c = \varnothing, \quad A \cup A^c = \Omega\)
- \(A \cap \Omega = A, \quad A \cup \varnothing = A\) ( \(\Omega\) and \(\varnothing\) are the identity element of \(\cap\) and \(\cup\) respectively)
- \(A \cap A = A, \quad A \cup A = A\) (idempotency law)
- \(A \cap B = B \cap A, \quad A \cup B = B \cup A\) (communicative law)
- \(A \cap (B \cap C) = (A \cap B) \cap C, \quad A \cup (B \cup C) = (A \cup B) \cup C\) (associated law)
- \(A \cap (B \cup C) = (A \cap B) \cup (A \cap C), \quad A \cup (B \cap C) = (A \cup B) \cap (A \cup C)\) (distributed law)
- If \(A \subseteq B\), then \(A \cap B = A, \quad A \cup B = B\)
- \((A \cap B)^c = A^c \cup B^c, \quad (A \cup B)^c = A^c \cap B^c\) (De-Morgan’s law)
Let \(\{A_i\}_{i=1}^\infty\) be a sequence of sets, then \(\bigcup_{i=1}^\infty = A_1 + \sum_{k=2}^\infty A_1^cA_2^c \cdots A_{k-1}^cA_k\).
Here we only discuss the most basic concept in the set theory. For more advanced topics, we will introduced in Chapter 47.
Venn Diagrams
A Venn diagram is a visual tool that displays the relationships between sets. It uses circles or other shapes to show the extent to which different sets overlap. Each circle usually represents a set, and the overlap between circles represents the elements that are common to those sets.
1.1.3 History of probability
1.1.3.1 Experiment, Sample space and Events
Experiment
An experiment refers to a process or procedure that generates an outcome (result). In general, under the same conditions, the results of each experiment are consistent, with at most a few measurement errors.
However, in the context of probability and statistics, this outcome is often uncertain or random, say random experiment. A random experiment is a specific type of experiment in which the outcome is not predictable with certainty. This means that, even if the experiment is performed under the same conditions, the outcome may vary. The purpose of studying experiments is to analyze the probability of different outcomes.
For more precisely, experiments possessing the following 3 characteristics are referred to as random experiments:
- It is possible to completely list and describe all possible outcomes before conducting the experiment.
- The result of the experiment cannot be predicted in advance among the numerous possible outcomes.
- The experiment can be repeated under identical conditions.
Sample space
The sample space, represented by \(\Omega\) or \(\mathcal{S}\), is the set of all possible outcomes in a random experiment. Each element in the sample space is called a sample point. The sample space can be divided into two types based on the number of sample points: discrete sample space, where the number of sample points is finite or countable, and continuous sample space, where the number of sample points is uncountable.
countable means that a set \(A\) exist an 1-1 mapping onto \(\mathbb{N}\), which can be learned in advanced calculus.
Events
A subset of a sample space is called an event \(E\), denoted by \(E \subset \Omega\). If \(E,F\) are two events in a sample space \(\Omega\) and \(E \cap F = \varnothing\), then we say two events \(E,F\) are disjoint (or mutually exclusive).
Example 1.10 Tossing two fair coins is a random experiment because the outcome (heads \(H\) or tails \(T\)) is uncertain and not predictable with certainty. The sample space is \(\Omega=\{HH,HT,TH,TT\}\). \(HT \in \Omega\) is an element in the sample space. An event where one outcome is heads and the other is tails is \(E=\{HT,TH\}\). Let \(F\) be another event where two coins have the same side. So \(F=\{HH,TT\}\) and \(A,B\) are disjoint.
1.1.4 Definitions of Probability
There are several definitions of probability, mainly including classical probability, geometric probability, relative frequency probability, and subjective probability. These are different approaches or interpretations of probability, each with its own characteristics and applications.
Classical probability
Suppose the sample space \(\Omega\) is finite, define the probability \(P(E)\) of an event \(E\) is \[ P(E)=\frac{N(E)}{N(\Omega)} \] where \(N(\cdot)\) represent the number of elements (sample points) in a set (event).
Classical probability is based on the assumption that all outcomes in the sample space are equally likely. It is applicable to situations where each outcome is equally likely to occur.
Example 1.11 Tossing two fair coins. The probability that one outcome is heads and the other is tails is \(P(E)=\frac{N(E)}{N(\Omega)}=\frac{2}{4}=0.5\)
Geometric probability
Geometric probability is used in situations where the sample space is geometrically defined, and the probability is determined by measuring geometric properties.
Example 1.12 If you throw a dart at a square target and want to find the probability of hitting a specific region within the square, the ratio of the area of that region to the total area of the square represents the geometric probability.
Relative frequency probability
Relative frequency probability is based on observed frequencies in the real world. It involves conducting experiments and calculating the probability of an event based on the frequency of its occurrence in those experiments.
Suppose the sample space is \(\Omega\), an experiment is repeatedly performed under exactly the same conditions. Let \(n\) be the number of repetitions of the experiment, and \(n_E\) be the number of times that the event \(E\) occurs. Then the probability \(P(E)\) of the event \(E\), is \[ P(E) = \lim_{n \to \infty} \frac{n_E}{n} \]
Example 1.13 Conducting multiple trials of flipping a coin and determining the probability of getting heads based on the observed frequency.
The advantage of relative frequency probability is that the sample point in the sample space can be infinite (even if it is uncountable). Also, we don’t have to assume that all outcomes in the sample space are equally likely. However, we can’t repeat the experiment infinitely many times in the real world, and the limit may not converge. Most importantly, we can’t apply relative frequency probability to a case that can’t be repeated.
Note that relative frequency probability is also known as objective probability.
Subjective probability
Define the probability of the event \(E\), \(P(E) \in [0,1]\). Subjective probability is based on an individual’s personal judgment or belief about the likelihood of an event occurring. It is often used in situations where there is uncertainty, and probabilities are assigned subjectively.
Subjective probability can be applied to a case that can’t be repeated, but it’s really “subjective”. Given the same information and under the same conditions, two people can have different subjective probabilities. This is why subjective probability is also called personal probability.
Example 1.14 Predicting the likelihood of rain tomorrow based on personal experience and intuition. Different individuals may assign different probabilities to the same event based on their subjective beliefs.
Remark. we also called classical probability by priori probability, and called relative frequency probability by posteriori probability. This is because classical probabilities are determined or predicted before any actual experimentation or observation; relative frequency probabilities are determined or estimated based on empirical evidence, observations, or data obtained through experimentation.
Each type of probability has its strengths and weaknesses, and the choice of which to use depends on the nature of the problem and the available information.
Axioms of probability
Axiom 1 : (non-negative) For any event \(E \subset \Omega\), \(P(E) \geq 0\).
Axiom 2 : (normed) \(P(\Omega)=1\).
Axiom 3 : (linearly, countable additivity) For any sequence of mutually exclusive events \(E_1,E_2,\cdots\) in a sample space \(\Omega\), i.e. \(E_i \cap E_j = \varnothing, \forall i \neq j\). We have \(\displaystyle P \left(\bigcup_{i=1}^\infty E_i \right) = \sum_{i=1}^\infty P(E_i)\).
satisfy -> prob. space
Proposition:
\(P(A^c) = 1 - P(A)\)
\(P(A) \leq 1\)
\(P(\varnothing)=0, \quad P(\Omega)=1\)
If \(A \subseteq B\), then \(P(A) \leq P(B)\).
\(P(A \cup B) = P(A) + P(B) - P(A \cap B)\). If \(A,B\) are disjoint, then \(P(A \cup B) = P(A) + P(B)\) (addition rule)
The law of total probability \(P(A)=P(A \cap B)+P(A \cap B^c)\)
Boole’s inequality \(P(A \cap B) \leq P(A) + P(B)\)
Bonferroni’s inequality \(P(A \cap B) \geq P(A)+P(B)-1\)
1.1.5 Conditional Probability and Independence
1.1.5.1 Conditional Probability
Conditional probability deals with the likelihood of an event occurring given that another event has already occurred. Here we give the formal definition
Let events \(A,B \subseteq \Omega\). If \(P(B) \neq 0\), then the conditional probability of \(A\) on \(B\) is \[ P(A|B) = \frac{P(A \cap B)}{P(B)} \]
The event \(B\) can be considered as a new sample space. Conditional probability also satisfies the three axioms of probability and hence is in a probability space.
Proposition:
Suppose events \(A,B,C \subseteq \Omega\), then
\(P(A \cap B) = P(A|B)P(B) = P(B|A)P(A)\) (multiplication rule)
\(P(C|C)=1\)
\(P(A^c|C) = 1 - P(A|C)\)
\(P(\varnothing|C)=0, \quad P(\Omega|C)=P(C)\)
If \(A \subseteq B\), then \(P(A|C) \leq P(B|C)\).
\(P(A \cup B|C) = P(A|C) + P(B|C) - P(A \cap B|C)\). If \(A,B\) are disjoint, then \(P(A \cup B|C) = P(A|C) + P(B|C)\) (addition rule)
The law of total probability \(P(A|C)=P(A \cap B|C)+P(A \cap B^c|C)\)
Boole’s inequality \(P(A \cap B|C) \leq P(A|C) + P(B|C)\)
Bonferroni’s inequality \(P(A \cap B|C) \geq P(A|C)+P(B|C)-1\)
Note that condition on different events is not meaningful.
1.1.5.2 Independence
Suppose events \(A,B \subseteq \Omega\). We say two events \(A,B\) are independent if \[ P(A|B)=P(A) \] denoted by \(A \perp\!\!\!\perp B\).
Note that the following are equivalent
\[ \begin{split} A \perp\!\!\!\perp B &\Longleftrightarrow P(A|B)=P(A) \\ &\Longleftrightarrow P(B|A)=P(B) \\ &\Longleftrightarrow P(A \cap B)=P(A)P(B) \end{split} \]
It’s important that independent events and mutually exclusive events are different.
Proposition:
Suppose two events \(A,B \subseteq \Omega\), then \[ A \perp\!\!\!\perp B \Longleftrightarrow A^c \perp\!\!\!\perp B \Longleftrightarrow A \perp\!\!\!\perp B^c \Longleftrightarrow A^c \perp\!\!\!\perp B^c \]
Pairwise independent
The following discuss the independence for more than three events.
Suppose there are three events \(A,B,C \subseteq \Omega\). We say \(A,B,C\) are mutual independence if and only if the following two statements are satisfied:
- \(P(A \cap B)=P(A)P(B), \quad P(A \cap C)=P(A)P(C), \quad P(B \cap C)=P(B)P(C)\)
- \(P(A \cap B \cap C)=P(A)P(B)P(C)\)
If only 1. was satisfied, we say \(A,B,C\) are pairwise independent.
It is clear that mutual independence must be pairwise independent, but the converse is not necessarily true.
1.1.6 Bayes’ Theorem
Suppose a sequence of sets \(A_1,A_2,\cdots,A_n \subseteq \Omega\), \(A_i \cap A_j = \varnothing, \forall i \neq j\), and \(\bigcup_{i=1}^n A_i = \Omega\). Then we say this sequence of sets is a partition on \(\Omega\), denoted by \(\sum_{i=1}^n A_i = \Omega\).
Suppose a sequence of sets \(\{A_i\}_{i=1}^n\) is a partition on \(\Omega\). Then for any event \(B \subseteq \Omega\), we have \[ P(B) = \sum_{i=1}^n P(A_i \cap B) = \sum_{i=1}^n P(B|A_i)P(A_i) \]
Bayes’ Theorem is a fundamental principle in probability theory and statistics. It provides a way to update the probability of a hypothesis based on new evidence or information. The theorem is named after the Reverend Thomas Bayes, who introduced the concept in the 18th century.
Theorem 1.1 (Bayes’ Theorem) Suppose sequence of sets \(\{A_i\}_{i=1}^n\) is a partition on \(\Omega\). Then for any event \(B \subseteq \Omega\), we have \[ P(A_k|B)=\frac{P(B|A_k)P(A_k)}{\sum_{i=1}^n P(B|A_i)P(A_i)}, \quad k = 1,\cdots,n. \]
\[ P(A_k|B) = \frac{P(A_k \cap B)}{P(B)} = \frac{P(A_k \cap B)}{\sum_{i=1}^n P(A_i \cap B)} =\frac{P(B|A_k)P(A_k)}{\sum_{i=1}^n P(B|A_i)P(A_i)} \]
Bayes’ Theorem states that the probability of a hypothesis being true given some new evidence is proportional to the product of the prior probability of the hypothesis and the likelihood of the evidence given that hypothesis, divided by the overall probability of the evidence.
For generalization, factorial for non-negative integer \(n\) is gamma function \(\Gamma(n)=\int_0^\infty t^{n-1}e^{-t} dt\).↩︎