The Data Scientist's Odyssey

The organized chaos of all things data science, machine learning, and social science


Multiple Random Variables: Marginal, Joint, and Conditional Distributions

Multivariate randomness is ubiquitous in nature

A Simple Example

Probability distributions are mathematical functions that describe the likelihood of different outcomes in a random event. In statistics, there are three main types of probability distributions: marginal, joint, and conditional. Each type describes a different aspect of the probability of a certain outcome or set of outcomes.

Let's intertwine an example as we form our analytical definitions. Suppose we took a survey of people every year from 2015-2022 and determined an aggregate top 500 classic rock songs. We have such a dataset from kaggle (https://www.kaggle.com/datasets/juliotorniero/classic-rock-top-500-songs) and can load it into a pandas dataframe. Lets do an analysis on the year and probability of a certain genre being in the top 500.

Note: This dataset, like most datasets, comes in the form of discrete data. Therefore, the analytical forms of our probability distributions shown below will be for the discrete case.

In [2]:
import pandas as pd
data = pd.read_csv("../data/classic_rock_playlist.csv")
# lets grab the genre and the years
data = data[['Genre','2022','2021','2020','2019','2018','2017','2016','2015']]
# lets add up all the times each genre made it into the top 500
data_agg = data.groupby('Genre').count()
data_agg
Out[2]:
2022 2021 2020 2019 2018 2017 2016 2015
Genre
A Cappella 1 1 1 1 1 0 1 0
Acoustic Rock 1 1 1 1 1 0 1 0
Alternative Metal 1 0 1 1 0 1 0 0
Alternative Rock 33 18 14 15 16 15 7 3
Arena Rock 3 2 1 1 0 1 1 1
... ... ... ... ... ... ... ... ...
Symphonic Rock 2 1 1 1 1 1 1 1
Synth Rock 1 0 0 0 0 0 0 0
Synthpop 6 5 1 3 2 1 2 0
Thrash Metal 1 1 0 0 0 0 0 0
Worldbeat 1 1 0 0 0 0 0 0

61 rows × 8 columns

To convert these numbers into probabilities, we have to divide each count by the total number of observations.

In [3]:
data_probdist = data_agg / data_agg.sum().sum()
data_probdist
Out[3]:
2022 2021 2020 2019 2018 2017 2016 2015
Genre
A Cappella 0.000447 0.000447 0.000447 0.000447 0.000447 0.000000 0.000447 0.000000
Acoustic Rock 0.000447 0.000447 0.000447 0.000447 0.000447 0.000000 0.000447 0.000000
Alternative Metal 0.000447 0.000000 0.000447 0.000447 0.000000 0.000447 0.000000 0.000000
Alternative Rock 0.014739 0.008039 0.006253 0.006699 0.007146 0.006699 0.003126 0.001340
Arena Rock 0.001340 0.000893 0.000447 0.000447 0.000000 0.000447 0.000447 0.000447
... ... ... ... ... ... ... ... ...
Symphonic Rock 0.000893 0.000447 0.000447 0.000447 0.000447 0.000447 0.000447 0.000447
Synth Rock 0.000447 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
Synthpop 0.002680 0.002233 0.000447 0.001340 0.000893 0.000447 0.000893 0.000000
Thrash Metal 0.000447 0.000447 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
Worldbeat 0.000447 0.000447 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

61 rows × 8 columns

In the examples below, let $X$ be the random variable representing the genre of a song and $Y$ be the random variable representing the year of a song.

Marginal Probability

A marginal probability distribution describes the likelihood of a single event occurring, independent of other events. For example, if we have a random variable $X$, the marginal probability distribution of $X$ is represented by $P(X=x)$, where $x$ is a possible outcome of the variable $X$. The sum of all the possible outcomes of the marginal probability distribution will always be equal to 1. So in our example, what is the likelihood of thrash metal being in the top 500? We can calculate this by dividing the number of thrash metal songs by the total number of songs in the dataset. The other way to calculate this marginal probability would be to sum the probabilities of thrash metal across all years. This is due to the following relation between marginal and joint probability distributions:

$$P(X=x) = \sum_{y} P(X=x, Y=y)$$

What this means in practice is we set $X$ to be fixed at the thrash metal genre and sum those probabilities over all possible values of $Y$. This number should match our intuitive calculation of dividing the number of thrash metal songs by the total number of songs in the dataset.

In [20]:
# check whether summing probabilitys across years matches the probability of the genre
data_probdist.loc['Thrash Metal'].sum() == (data_agg.loc['Thrash Metal'].sum() / data_agg.sum().sum())
Out[20]:
True

Joint Probability

A joint probability distribution describes the likelihood of two or more events occurring simultaneously. For example, if we have two random variables $X$ and $Y$, the joint probability distribution of $X$ and $Y$ is represented by $P(X=x, Y=y)$ (sometimes written as $P({X=x}\cap{Y=y})$), where $x$ and $y$ are possible outcomes of the variables $X$ and $Y$ respectively. The sum of all the possible outcomes of the joint probability distribution will also always be equal to 1. In the above table of probability values, this is already essentially the joint probability distribution (an estimation of it) and we simply need to read the values off of it! For example, the probability of thrash metal being in the top 500 in 2022 is 0.000447.

Conditional Probability

A conditional probability distribution is the probability of a certain outcome of a random variable given that another event has already occurred. For example, if we have two random variables $X$ and $Y$, the conditional probability distribution of $X$ given $Y$ is represented by $P(X=x|Y=y)$, where $x$ and $y$ are possible outcomes of the variables $X$ and $Y$ respectively. It gives the likelihood of the outcome of $X$ when $Y$ is given. An essential property used when calculating conditional probability is the following:

$$P(X=x|Y=y) = \frac{P(X=x, Y=y)}{P(Y=y)}$$

This directly follows from the chain rule of probability, $P(X=x, Y=y) = P(X=x|Y=y) * P(Y=y)$. From our example, if we wanted to calculate the probability of Thrash Metal being in the top 500 given that it is 2022, we would divide the probability of Thrash Metal being in the top 500 in 2022 by the probability of 2022 being in the dataset. Remembering how we calculate the marginal probability, we simple sum the probabilities in the 2022 column to get its marginal probability aka the probability of 2022 being in the dataset.

In [21]:
data_probdist.loc['Thrash Metal','2022'] / data_probdist['2022'].sum()
Out[21]:
0.0019999999999999996

To reiterate, the relationship between these three probability distributions that are essential in solving problems involving them are as follows:

$$P(X=x) = \sum_{y} P(X=x, Y=y)$$$$P(X=x|Y=y) = P(X=x, Y=y) / P(Y=y)$$

The above equations are essentially considered definitions and will always be useful on the scientist's toolbox. One additional equation which comes in handy has to do with independend. When random variables are independent, the following equation holds true:

$$P(X=x, Y=y) = P(X=x) * P(Y=y)$$

To sum up, the marginal probability distribution describes the likelihood of a single event occurring, the joint probability distribution describes the likelihood of two or more events occurring simultaneously and the conditional probability distribution describes the likelihood of a certain outcome of a random variable given that another event has already occurred. These three probability distributions can be related to each other through fundamental probability definitions are the key building blocks for solving multi random variable problems.

Sources

Pykes, K. (2020, September 5). Marginal, Joint and Conditional Probabilities explained By Data Scientist. Medium. https://towardsdatascience.com/marginal-joint-and-conditional-probabilities-explained-by-data-scientist-4225b28907a4

Probability, Statistics & Random Processes | Free Textbook | Course. (n.d.). Www.probabilitycourse.com. https://www.probabilitycourse.com/

Conditional Probability Distribution | Brilliant Math & Science Wiki. (n.d.). Brilliant.org. Retrieved January 24, 2023, from https://brilliant.org/wiki/conditional-probability-distribution/

‌Discrete Random Variables - Joint Probability Distribution | Brilliant Math & Science Wiki. (n.d.). Brilliant.org. Retrieved January 24, 2023, from https://brilliant.org/wiki/discrete-random-variables-joint-probability/

About
Hello! Names Eddie. I finished my PhD in experimental fusion physics in 2019 and have been working as an operations research scientist for a few years. I enjoy all things science and hope to understand my world better by reading the best sources and summarizing the information to the best of my ability for all to enjoy. I hope you enjoy my blog!