Probability Fundamentals You Need To Know: Getting Up To Speed In One Article!
What is Probability Theory?¶
Probability theory is the mathematical framework for reasoning about the random nature of our world. It is a powerful tool that can be used to make predictions about the future, and to make decisions under uncertainty. But what do we mean by "random"? Maybe we mean that our knowledge of an outcome is incomplete or uncertain. And what do we mean by "probability"? Maybe we mean subjective belief or relative frequency of an event. For example, the probability of a coin landing heads is 0.5. One might argue that this is the case because we know that the coin is fair, or because we have observed that the coin has landed heads 50% of the time. In the first case, it is the perspective given by a Bayesian interpretation of probability, and in the second case, it is the perspective given by a frequentist interpretation of probability. Regardless of your philosophical interpretation of the concepts of "random" and "probability", you can still successfully utilize probability theory to make predictions about the future, and to make decisions under uncertainty.
Let's kick off with some mathematics that will help us understand the logic of probability theory. This will begin with set theory, which is the mathematical framework for reasoning about collections of objects. We will briefly discuss experiments in the probabilistic sense. Then we will introduce the Kolmogorov axioms, which are the axioms that define probability and are the foundations upon which probability theory is built. Finally, we will focus on the key probability measure known as the conditional probability, which is the probability of an event given another event has occurred. This will give us the tools we need to understand the law of total probability, which will then lead us to the Pythagorean theorem of probability: Bayes' Theorem. You should then have a good foundation for proceeding to more advanced topics in probability theory!
The Basics of Set Theory You Need to Know!¶
A set $S$ is simply a collection of objects and its notation, using curly brackets, goes like
$$S = \{1, 2, 3, 4\}$$For example. Order does not matter and we can also state properties of the set using:
$$S = \{x \in \mathbb{Z} | x > 0\}$$Where $\mathbb{Z}$ is the set of integers, $\in$ is the "element of" symbol, and $|$ is the "such that" symbol. This is read as "the set of all integers $x$ such that $x$ is greater than 0". But what exactly is $\mathbb{Z}$? It is the set of all integers, which is defined as:
$$\mathbb{Z} = \{...,-2, -1, 0, 1, 2, ...\}$$Some of the most important sets are:
- $\mathbb{N}$: The set of all natural numbers, which is defined as:
- $\mathbb{Z}$: The set of all integers, which is defined as:
- $\mathbb{Q}$: The set of all rational numbers (any number which can be expressed as a ratio of numbers), which is defined as:
- $\mathbb{C}$: The set of all complex numbers, which is defined as:
Additionally, a set $A$ is a subset of $B$ if all of $A$'s elements are also in $B$. This is denoted by $A \subseteq B$. For example,
$$ \{1, 2, 3\} \subseteq \{1, 2, 3, 4\} $$Other examples include $\mathbb{N} \subseteq \mathbb{Z}$, and $\mathbb{Z} \subseteq \mathbb{Q}$.
The Universal Set is the set of all possible outcomes of an experiment. For example, if we flip a coin, the universal set is $\{H, T\}$. If we roll a die, the universal set is $\{1, 2, 3, 4, 5, 6\}$. If we roll two dice, the universal set is $\{1, 2, 3, 4, 5, 6\} \times \{1, 2, 3, 4, 5, 6\} = \{(1, 1), (1, 2), (1, 3), ..., (6, 6)\}$.
With regards to Set operations, there are three main operations: Union, Intersection, Complement, and Subtraction. Here's the quick rundown:
- The Union of two sets $A$ and $B$ is the set of all elements that are in $A$ or $B$ or both. This is denoted by $A \cup B$. For example,
- The Intersection of two sets $A$ and $B$ is the set of all elements that are in $A$ and $B$. This is denoted by $A \cap B$. For example,
- The Complement of a set $A$ is the set of all elements that are not in $A$. This is denoted by $A^c$ (or $\bar{A}$). For example,
- Subtraction is defined as the set of all elements that are in $A$ but not in $B$. This is denoted by $A - B$ and is equivalent to $A \cap B^c$. For example,
Mutually Exclusive (or disjoint) sets are sets that have no elements in common. For example, $\{1, 2, 3\}$ and $\{4, 5, 6\}$ are mutually exclusive. Formally, two sets $A$ and $B$ are mutually exclusive if their intersection is the empty set, $A \cap B = \emptyset$, where $\emptyset$ is the empty set, $\{\}$.
The Partition of a set $A$ is a collection of sets that are pairwise disjoint (any pair is disjoint) and their union is $A$. For example,
$$ \text{a partition of } \{1, 2, 3, 4\} \text{ is } \{\{1, 2\}, \{3, 4\}\}$$Another example would be the US states which form a partition of the US, since they are pairwise disjoint and their union is the US.
To define our higher dimensional space, we use the Cartesian product. The Cartesian product of two sets $A$ and $B$ is the set of all ordered pairs $(a, b)$ where $a \in A$ and $b \in B$. For example,
$$ \{1, 2, 3\} \times \{4, 5, 6\} = \{(1, 4), (1, 5), (1, 6), (2, 4), (2, 5), (2, 6), (3, 4), (3, 5), (3, 6)\} $$and is formally defined as:
$$ A \times B = \{(a, b) | a \in A, b \in B\} $$When we speak about Cardinality, we are referring to the number of elements in a set. For example, the cardinality of $\{1, 2, 3\}$ is 3. The cardinality of a set $A$ is denoted by $|A|$. Countability means that there is a one-to-one correspondence between the elements of the set and the natural numbers. There are infinite sets which are countable like $\mathbb{N}$, $\mathbb{Z}$, and $\mathbb{Q}$. There are also finite sets that are not countable like any interval on the real line, for example, $\left[0, 1\right)$.
For finite sets, we have a useful equation known as the inclusion-exclusion principle:
$$ |A \cup B| = |A| + |B| - |A \cap B| $$Intuitively, when we are counting the number of elements in $A \cup B$, we are counting the number of elements in $A$ and $B$ twice, so we subtract the number of elements in $A \cap B$ to get the correct answer.
The nice thing about cartesian products and cardinality is that the cardinality of a cartesian product is the product of the cardinalities of the sets. For example, the cardinality of $\{1, 2, 3\} \times \{4, 5, 6\}$ is $|A| \times |B| = 3 \times 3 = 9$. This is formally defined as:
$$ |A \times B| = |A| \times |B| $$One last thing is functions. A function, $f$, is a mapping from one set, the domain, to another set, the codomain. Calling $A$ the domain and $B$ the codomain,
$$ f: A \rightarrow B $$For example, the function $f(x) = x^2$ is a function from $\mathbb{R}$ to $\mathbb{R}$, since the domain and codomain are both the set of real numbers. The set of all possible outputs of a function is called the range of the function and is denoted by $range(f)$. For example,
$$ f: \mathbb{R} \rightarrow \mathbb{R} \text{ defined as } f(x) = |x| $$$$ \text{then } range(f) = \mathbb{R}^{+} = \{x \in \mathbb{R} | x \geq 0\} $$Probabilistic Conception of Experiments¶
There are five main concepts to get started with thinking of experiments in a probabilistic way:
Experiment: An experiment is a procedure where we observe the outcome of a random phenomenon, for example the flipping of a coin.
Outcome: The outcome of an experiment is essentially the result of the experiment. For example, if we flip a coin, the outcome is either heads or tails.
Sample Space: The set of all possible outcomes of an experiment is called the sample space of the experiment, which is the universal set of the experiment. For example, if we draw a card from a deck of cards, the sample space is $\{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, J, Q, K\}$.
Trial: When we start repeating an experiment, we call each one of them a trial. For example, if we flip a coin 10 times, we have 10 trials.
Event: An event of an experiment is a collection of possible outcomes. It is a subset of the sample space and this is typically what we're trying to assign a probability to. For example, the event of getting an odd number when we roll a die is the set $\{1, 3, 5\}$ and the probability of this event is $\frac{1}{2}$.
What's Probability Again?¶
The Probability of an event $A$ is a number between 0 and 1 that tells us how likely it is that the event will occur. If it is close to 0, then the event is unlikely to occur. If it is close to 1, then the event is highly likely to occur. The probability of an event $A$ is denoted by $P(A)$. This leads us into the Kolmogorov axioms of probability:
Axiom of Probability
$P(A) \geq 0$ for all events $A$.
$P(\Omega) = 1$, where $\Omega$ is the sample space.
If $A_1, A_2, A_3, ...$ are mutually exclusive events, then
Number 1 is self-explanatory. Number 2 basically means that the sample space must be chosen such that something happens. Number 3 says that if some events are mutually exclusive, then the probability of any of them happening is the sum of the probabilities of each of them happening.
Lets use the axioms to solve a basic problem. If you roll a die, what is the probability of an odd number?
We know that the sample space is $\{1, 2, 3, 4, 5, 6\}$ and that the event of getting an odd number is $\{1, 3, 5\}$.
Since the events are mutually exclusive, we can use axiom 3:
$$ P(\{1, 3, 5\}) = P(\{1\}) + P(\{3\}) + P(\{5\}) $$Using axiom 2, we know that $P(\Omega) = 1$ which means that
$$P(\{1\}\cup\{2\}\cup\{3\}\cup\{4\}\cup\{5\}\cup\{6\}) = 1$$
$$P(\{1\}) + P(\{2\}) + P(\{3\}) + P(\{4\}) + P(\{5\}) + P(\{6\}) = 1$$and knowing that all outcomes are equally likely, we can say that
$$P(\{1\}) = P(\{2\}) = P(\{3\}) = P(\{4\}) = P(\{5\}) = P(\{6\}) = \frac{1}{6}$$therefore,
$$P(\{1\}) + P(\{3\}) + P(\{5\}) = \frac{1}{6} + \frac{1}{6} + \frac{1}{6} = \frac{1}{2}$$Useful formulas which follow from the axioms are:
$$ P(A - B) = P(A) - P(A \cap B) $$$$ P(A \cup B) = P(A) + P(B) - P(A \cap B) $$The second one being the extension of the third axiom to non-mutually exclusive events, for example if the faces of the die had more than one number on them.
Another key result which follows from the axioms is that in a finite sample space, where all outcomes are equally likely, the probability of an event is the number of outcomes in the event divided by the number of outcomes in the sample space:
$$ P(A) = \frac{|A|}{|S|} $$In practice, we obviously prefer to start from here, instead of starting from the axioms. But as this formula suggests, countability is key for calculating probabilities.
Conditional Probability: A Key Measure¶
As we start doing experiments in the world, we might find that we have already have some information that might impact the probability of certain events. For example, lets say you find yourself in Fairbanks attempting to see the northern lights. Would the fact that it is a cloudy night impact the probability of your seeing the northern lights? Surely it would. We call the probability of an event given some information, the conditional probability of the event. The conditional probability of an event $A$ given some information $B$ is denoted by $P(A|B)$. It is helpful to first build intuition for this concept by looking at the following example:
Example: Suppose you you're interested in the event of getting an even number when you roll a die. Now suppose the die is rolled but before seeing the result, you're told that the number is less than or equal to 3. What is the probability that you rolled an even number? In this example, $A$ is the event of getting an even number and $B$ is the event of getting a number less than or equal to 3 and we want to estimate $P(A|B)$. The intuition here is that we are now in the space of $B$, aka the space of numbers less than or equal to 3. So by our definition above, we can just count the number of even numbers in this space and divide by the number of elements in the space. In this case, there is one even number in the space, 2, and there are 3 elements in the space, so the probability is $\frac{1}{3}$. What we actually did in the numberator is count the number of outcomes in $A$ that are also in $B$, aka the intersection of the two. Formally, we solved this problem by using the following formula:
$$ P(A|B) = \frac{|A \cap B|}{|B|} $$We can then divide both numerator and denominator by the count of the sample space and rewrite this formula as:
$$ P(A|B) = \frac{\frac{|A \cap B|}{|S|}}{\frac{|B|}{|S|}} $$$$ P(A|B) = \frac{P(A \cap B)}{P(B)} $$It's easy to see that for disjoint events, $P(A \cap B) = 0$ and so $P(A|B) = 0$. This is because if $A$ and $B$ are disjoint, then there are no outcomes in $A$ that are also in $B$. Also, If $B$ is a subset of $A$, then $P(A \cap B) = P(B)$ and so $P(A|B) = 1$. This is because if $B$ is a subset of $A$, then all outcomes in $B$ are also in $A$.
Conditional probability also satisfies the Kolmogorov axioms of probability. For example, if you have a set of mutually exclusive events $A_1, A_2, A_3, ...$, then
$$ P(\bigcup_{i=1}^{\infty} A_i|B) = \sum_{i=1}^{\infty} P(A_i|B) $$In addition, the useful formulas from before also hold for conditional probability:
$$ P(A - B|C) = P(A|C) - P(A \cap B|C) $$$$ P(A \cup B|C) = P(A|C) + P(B|C) - P(A \cap B|C) $$$$ P(A^c|B) = 1 - P(A|B) $$Furthermore, one additional formula follows from a basic rearranging of the equation for conditional probability and it is known as the Chain Rule:
$$ P(A \cap B) = P(A|B)P(B) $$and in general for any number of events:
$$ P(A_1 \cap A_2 \cap A_3 \cap ... \cap A_n) = P(A_1)P(A_2|A_1)P(A_3|A_1 \cap A_2) ... P(A_n|A_1 \cap A_2 \cap ... \cap A_{n-1}) $$Now lets look at a key concept known as Independence. Essentially, two events are independent if the probability of one event does not change when we know the outcome of the other event. Note, that this is not the same as saying that the two events are disjoint or that they cannot occur simultaneously. For example, given the event $B$ that a human has blue eyes, what are the odds of the event $A$ that it has a brother? There are many people with blue eyes who have brothers, so the two events are not disjoint. However, the probability of having a brother is independent of the fact that you have blue eyes. So basically we're saying $B$ does not impact the probability of $A$:
$$ P(A|B) = P(A) $$But what does this imply given our equation for conditional probability? Well, we can use the chain rule:
$$ P(A \cap B) = P(A|B)P(B) $$and we know that $P(A|B) = P(A)$ so we can substitute this into the equation above and get:
$$ P(A \cap B) = P(A)P(B) $$This equation is a necessary and sufficient condition for independence (they're equivalent). In other words, when events are independent, we can multiply their probabilities together to get the probability of the intersection of the two events. Furthermore, it also means that the probability of one event given the other independent event is the the same as the prior probability of the event ($ P(A|B) = P(A) $).
For more than two events, independence is true if and only if all of the events are pairwise independent in addition to simulatenously independent.
In addition, independence tells us that the probability of the union of two independent events is related to their complements:
$$ P(A \cup B) = 1 - P(A^c)P(B^c) $$It is important to keep in mind that if two events are disjoint, then they are dependent, i.e.
$$ P(A \cap B) = 0 => P(A|B) = 0 \neq P(A) $$So knowing $B$ tells you that $A$ did not happen.
One useful tool that conditional probability gives us is the Law of Total Probability. Consider an event $B$. $B$ and $B^c$ form a partition of the sample space. Now consider an event $A$. It's trivial to show that:
$$ P(A) = P(A\cap B) + P(A\cap B^c) $$$$ => P(A) = P(A|B)P(B) + P(A|B^c)P(B^c) $$This then generalizes to any number of events:
$$ P(A) = \sum_{i=1} P(A\cap B_i) = \sum_{i=1} P(A|B_i)P(B_i) $$where $B_1, B_2, B_3, ...$ are pairwise disjoint and form a partition of the sample space.
Furthermore, conditional independence is a useful concept. If Two events $A$ and $B$ are conditionally independent given $C$, then:
$$ P(A|B,C) = P(A|C) $$We are now ready to state one of the most important results in probability theory and one that is used in many areas of machine learning and science.
Bayes' Theorem¶
"Bayes' theorem is to the theory of probability what the pythagorean theorem is to geometry." - Sir Harold Jeffreys
In many instances, you know the conditional probability $P(A|B)$ but are interested in the probability $P(B|A)$. For example, you may know the probability of a positive test result given that the person has the disease but you may be interested in the probability of the person having the disease given that they have a positive test result. The derivation is quite simple and we start from the definition of conditional probability:
$$ P(A\cap B) = P(A|B)P(B) = P(B|A)P(A) $$$$ => P(B|A) = \frac{P(A|B)P(B)}{P(A)} $$This is the famous equation of Bayes' theorem. Practically, Bayes' rule is used to "flip" a conditional probability, i.e. to find the probability of event 1 given event 2, when you know the probability of event 2 given event 1. Many times, you don't know the denominator $P(A)$, but you know its occurence under several disjoint events $B_1, B_2, B_3, ...$ so one can use the law of total probability to give the alternative form of Bayes' rule that is often used:
$$ P(B_i|A) = \frac{P(A|B_i)P(B_i)}{\sum_{i=1} P(A|B_i)P(B_i)} $$It is also used to update the probability of an event given new information.
The canonical example typically used is test results for an unlikely disease. For example, suppose that the probability of having the disease is 0.01% and the probability of a positive test result given that you have the disease is 98%. The probability that the test result is negative given that you have the disease is 2%. What is the probability that you have the disease given that you have a positive test result? We can use Bayes' rule to solve this problem:
$$ P(D|T) = \frac{P(T|D)P(D)}{P(T)} $$Where $D$ is the event that you have the disease and $T$ is the event that you have a positive test result.
The denominator, $P(T)$, is the probability of a positive test result. We can use the law of total probability to find this:
$$ P(T) = P(T|D)P(D) + P(T|D^c)P(D^c) $$Where $D^c$ is the complement of the event $D$ (i.e. the event that you do not have the disease). We know that $P(D^c) = 1 - P(D)$ and $P(T^c|D) = 1 - P(T|D)$ so we can simplify the equation above to:
$$ P(T) = P(T|D)P(D) + (1 - P(T|D))(1 - P(D)) $$$$ P(D|T) = \frac{0.98 \times 0.0001}{0.98 \times 0.0001 + 0.02 \times 0.9999} = 0.0049 $$So the probability that you have the disease given that you have a positive test result is 0.49%. This is surprising to our intuition given that the test result is quite accurate given you have the illness. But you have to keep in mind that the disease is also quite rare.
Within the context of Bayesian inference, Bayes' rule is used to update the prior probability of an event given new information. Looking at Bayes' rule again:
$$ P(B|A) = \frac{P(A|B)P(B)}{P(A)} $$Within the language of Bayesian inference, $P(B|A)$ is the posterior probability of event $B$ given event $A$. $P(A|B)$ is the likelihood of event $A$ given event $B$. $P(B)$ is the prior probability of event $B$. $P(A)$ is the marginal probability of event $A$.
Conclusion¶
In this article, we started from set theory and spoke about concepts such as unions, intersections, complements, partitions, mutual exclusivity, and cardinality. We then used this framework to define experiments, events, and sample spaces. We then axiomatically defined probability through the Kolmogorov axioms of probability and then dove into the important concepts of independence and conditional probability. We then derived Bayes' theorem and used it to solve a simple example. These are the fundamental concepts of probability theory and are used in many areas of machine learning and science and now you know them!
Sources¶
Wikipedia Contributors. (2022, March 19). Inclusion–exclusion principle. Wikipedia; Wikimedia Foundation. https://en.wikipedia.org/wiki/Inclusion%E2%80%93exclusion_principle
Probability, Statistics & Random Processes | Free Textbook | Course. (n.d.). Www.probabilitycourse.com. https://www.probabilitycourse.com/
Wikipedia Contributors. (2019, December 5). Probability axioms. Wikipedia; Wikimedia Foundation. https://en.wikipedia.org/wiki/Probability_axioms
Cardinality. (2021, January 6). Wikipedia. https://en.wikipedia.org/wiki/Cardinality
Chain rule (probability). (2023, January 3). Wikipedia. https://en.wikipedia.org/wiki/Chain_rule_(probability)
Conditional Probability and Independence. (n.d.). Bolt.mph.ufl.edu. Retrieved January 31, 2023, from https://bolt.mph.ufl.edu/6050-6052/unit-3/module-7/#:~:text=If%20events%20are%20disjoint%20then