Probability basics for Bayesian analysis

Background

What distinguishes measure theory and probability theory?

Sample space

N <- 100
Omega <- 1:N
  • \(\Omega\) - a set

  • Each \(\omega\) in \(\Omega\) is one possible outcome in our uncertain world.

  • In computational uses, \(\Omega\) is finite.

  • But when we phrase a statistical model, it is often an infinite mathematical construct.

Sample space - cont.

  • Usually we do not actually care about the elements \(\omega\) in \(\Omega\).

  • We think in terms of events and random variables.

  • The sample space behind them is implicit.

Events

  • Subsets that we care about are called “events”.

  • E.g., \(H\)=“all outcomes \(\omega\) in which Piglet meets a Heffalump” may be an event.

The event space

  • \(\mathcal{F}\) - the event space, is the set of events that can be conceptualized in our model of the world.

  • In other words, events where it makes sense to ask whether they occur or not.

  • E.g., if we do not know what Heffalumps are like, then the event \(H\) above should not be in our \(\mathcal{F}\).

  • \(\mathcal{F}\) is assumed to be a so-called \(\sigma\)-algebra.

  • It means it has some kind of symmetry that makes it sensible.

Varying the event space

  • When we talk about conditional probability, etc.,
  • .. it can always be phrased by conditioning on a different event space.
  • .. but we will not use this terminology today.

Random variables

  • A random variable is a function \(\Omega \to \mathbb{R}\).
  • .. which is “measurable” in the event space.

Random variables examples

X <- function(omega) {
  set.seed(omega)
  sample((7:9)/10, 1)
}

Y <- function(omega) {
  set.seed(omega)
  rgeom(1, X(omega))
}

list(x=X(52),
     y=Y(52))
$x
[1] 0.9

$y
[1] 0

Following McElreath’s example:

  • \(X\) is the ratio of water to land on the planet.
  • \(Y\) is the number of points on the globe we sample till we find water.

Random variables coexist

Omega_XY <- tibble(
  omega=Omega,
  X=sapply(omega, X),
  Y=sapply(omega, Y))

Omega_XY %>% kable()
omega X Y
1 0.7 0
2 0.7 0
3 0.7 0
4 0.9 0
5 0.8 1
6 0.7 0
7 0.8 0
8 0.9 0
9 0.9 0
10 0.9 0
11 0.8 2
12 0.8 1
13 0.9 0
14 0.7 1
15 0.7 3
16 0.7 0
17 0.8 0
18 0.8 1
19 0.7 0
20 0.8 0
21 0.9 0
22 0.8 1
23 0.7 1
24 0.9 0
25 0.9 0
26 0.9 0
27 0.7 2
28 0.7 1
29 0.7 1
30 0.8 0
31 0.7 1
32 0.9 0
33 0.8 0
34 0.7 0
35 0.8 1
36 0.7 0
37 0.8 1
38 0.7 0
39 0.7 0
40 0.9 0
41 0.9 0
42 0.7 0
43 0.7 0
44 0.7 0
45 0.7 0
46 0.8 0
47 0.9 0
48 0.7 0
49 0.9 0
50 0.9 0
51 0.9 0
52 0.9 0
53 0.8 1
54 0.9 0
55 0.8 0
56 0.7 0
57 0.8 0
58 0.9 0
59 0.9 0
60 0.9 0
61 0.9 0
62 0.8 0
63 0.9 0
64 0.9 0
65 0.8 0
66 0.7 0
67 0.9 0
68 0.7 1
69 0.7 0
70 0.9 0
71 0.9 0
72 0.7 1
73 0.7 1
74 0.8 1
75 0.7 0
76 0.7 0
77 0.8 0
78 0.9 0
79 0.8 0
80 0.9 0
81 0.9 0
82 0.7 0
83 0.7 1
84 0.8 1
85 0.7 0
86 0.9 0
87 0.8 0
88 0.9 1
89 0.7 1
90 0.8 0
91 0.7 0
92 0.8 0
93 0.7 0
94 0.8 0
95 0.9 0
96 0.9 0
97 0.7 1
98 0.8 0
99 0.7 2
100 0.8 0

Coexisting

Omega_XY %>%
ggplot(aes(x=X, y=Y)) + geom_point(size=10)

Random vectors

\((X,Y)\) may be considered a random vector, viewed as a function \(\Omega \to \mathbb{R}^2\) \[\omega \mapsto (X(\omega), Y(\omega))\]

Events of random variables

\(Y>0\) means the subset of \(\Omega\): \[\{\omega \in \Omega \vert Y(\omega)>0\}\]

Y_is_positive <- Omega_XY %>% filter(Y>0)

nrow(Y_is_positive) / N
[1] 0.25

Events of random variables - cont.

\(Y \in [1,2], X=0.9\) means the subset of \(\Omega\): \[\{\omega \in \Omega \vert Y(\omega) \in [1,2], X(\omega)=0.9\}\] \[= \{\omega \in \Omega \vert 1 \leq Y(\omega) \leq 2, X(\omega)=0.9\}\]

Efficient sampling

(the relationship to the sample space is not explicit anymore)

set.seed(1987)
Omega_XY_ <- tibble(
  X = sample((4:6)/10, N, replace=T),
  Y = rgeom(N, X))

Omega_XY_ %>%
kable()
X Y
0.4 0
0.5 0
0.5 0
0.4 0
0.6 0
0.4 0
0.4 4
0.5 0
0.4 3
0.6 0
0.6 0
0.6 0
0.6 1
0.6 0
0.4 2
0.5 3
0.5 0
0.6 0
0.5 0
0.5 0
0.5 0
0.6 2
0.5 3
0.5 1
0.5 0
0.5 3
0.4 3
0.5 2
0.6 1
0.6 1
0.5 1
0.5 0
0.6 2
0.4 1
0.6 0
0.6 0
0.5 0
0.4 2
0.6 0
0.5 2
0.6 1
0.4 0
0.4 2
0.6 3
0.6 1
0.6 0
0.6 0
0.4 1
0.4 3
0.4 0
0.5 0
0.6 1
0.4 0
0.4 1
0.4 1
0.4 0
0.5 0
0.5 0
0.6 1
0.5 0
0.5 0
0.6 0
0.4 0
0.5 1
0.4 1
0.4 0
0.5 0
0.6 5
0.5 0
0.6 0
0.5 1
0.6 0
0.5 0
0.4 0
0.5 0
0.6 0
0.4 0
0.6 1
0.4 1
0.6 0
0.6 1
0.5 0
0.6 1
0.4 0
0.5 0
0.5 6
0.5 1
0.6 0
0.6 0
0.4 2
0.4 9
0.6 0
0.6 0
0.4 0
0.5 0
0.4 9
0.4 0
0.5 1
0.4 1
0.5 1

Probability

  • A probability measure \(\mathbb{P}\) is a function from to \(\mathcal{F} \to [0,1]\), satisfying Kolmogorov’s axioms:
    • \(\mathbb{P}(E) \geq 0\) for all \(E\) in \(\mathcal{F}\)
    • \(\mathbb{P}(\Omega) = 1\)
    • \(\mathbb{P}(\bigcup_{i=1,2,...} E_i) = \sum_{i=1,2,...} \mathbb{P}(E_i)\) for pairwise disjoint events
  • A sample space with a probability measure is called a probability space.

Probability example

For our finite example, we may define probabilities proportional to number of outcomes.

P <- (function(event)
  nrow(event)/N)

P(Y_is_positive)
[1] 0.25

Distribution

A random variable \(Y: \Omega \to \mathbb{R}\) pushes a probability measure \(\mathbb{P}\) over \(\Omega\) to a probability measure \(P_Y\) over \(\mathbb{R}\), called its distribution.

\[P_Y((0,\infty)) = \mathbb{P}(Y \in (0,\infty)) = \mathbb{P}(Y > 0)\]

Distribution examples

Omega_XY_ %>% helper_plot_X_probs()

Distribution examples - cont.

Omega_XY_ %>% helper_plot_Y_probs()

Joint distribution

Similarily, the distribution of a random vector \((X,Y)\) is a probability measure over the plane \(\mathbb{R^2}\).

It is also called the joint distribution of \(X\) and \(Y\).

Omega_XY_ %>% helper_plot_XY_probs()

Types of distribution

A probability distribution \(P\) is

  • discrete, if it is defined by a sequence of values \(y_1, y_2, ...\) such that for every region \(D\), \[P(D) = \sum_{i \vert y_i \in D} P(\{y_i\})\]

  • continuous, if \(P(\{y\})=0\) for all \(y\)

  • absolutely continuous, if there is a density function \(f_Y\) such that for every region \(D\), \[P(D) = \int_D f_Y(y) \mathrm{d} y\]

Relationships between types

  • continuous \(\Rightarrow\) not discrete
  • continuous and discrete can approximate each other
  • absolutely continuous \(\Rightarrow\) continuous
  • continuous \(\nRightarrow\) absolutely continuous
    • can we think of an example for this gap?
      • in \(\mathbb{R}^2\)?
      • in \(\mathbb{R}\)?
  • we can have mixtures

Another example

In this example, we approximate an absolutely continuous joint distribution.

       set.seed(1987)
       Omega_XY_continuous <- tibble(
              X = rexp(N, 1),
              Y = rexp(N, X))

       Omega_XY_continuous %>% helper_plot_XY_density() +
       xlim(0,5) + ylim(0,10)

Expectation

  • For a discrete random variable \(Y\) taking values \(y_1, y_2, ...\), the expectation of \(Y\) is defined (when the series converges): \[ \mathbb{E}(Y) = \sum_i y_i \mathbb{P}(Y=y_i)\]

  • For an absolutely continuous random variable \(Y\) with density \(f_Y\), it is defined (when the integral is well-defined): \[\mathbb{E}(Y) = \int y f_Y(y) \mathrm{d} y\]

Expectation - cont.

  • These are both special cases of a general notion of Lebesgue integral (when Y is Lebesgue integrable): \[\mathbb{E}(Y) = \int_\Omega f d\mathbb{P}\]

  • These notions depend only on the distribution of \(Y\).

Probability as expectation

  • Given an event \(A\), we define \(1_A\) to be the random variable such that:
    • \(1_A(\omega)=1\) if \(\omega \in A\)
    • \(1_A(\omega)=0\) if \(\omega \notin A\)
  • Then \[\mathbb{P}(A) = \mathbb{E}(1_A)\]

Conditioning

Conditional probability given an event

  • Given events \(A\), \(B\), such that \(\mathbb{P}(B)>0\), the conditional probability of \(A\) given \(B\) is \[ \mathbb{P}(A|B) = \frac {\mathbb{P}(A \cap B)} {\mathbb{P}(B)}\]

  • When also \(\mathbb{P}(A)>0\), we get Bayes’ formula: \[ \mathbb{P}(A|B) = \mathbb{P}(B|A) \frac {\mathbb{P}(A)} {\mathbb{P}(B)}\]

Conditional distribution given an event

  • Given a random variable \(X:\Omega \to \mathbb{R}\) and an event \(B\) such that \(\mathbb{P}(B)>0\), the conditional distribution of \(X\) given \(B\) is defined by pushing the probability measure \(\mathbb{P}(\cdot |B)\) from \(\Omega\) to \(\mathbb{R}\): \[P_{X|B} (D) = \mathbb{P}(X \in D|B)\] for every region \(D \subset \mathbb{R}\).

  • For example: \[P_{Y|B} ((0,\infty)) = \mathbb{P}(Y>0 |B)\] \[P_{X|B} (\{0.5\}) = \mathbb{P}(X=0.5 |B)\]

Conditional distribution - cont.

Omega_XY_ %>%
filter(Y>2) %>%
helper_plot_X_probs

Conditional expectation given an event

Given a random variable \(X:\Omega \to \mathbb{R}\) and an event \(B\) such that \(\mathbb{P}(B)>0\), the conditional expectation \(\mathbb{E}(X \vert B)\) is defined as the expectation of the conditional distribution.

Omega_XY_ %>%
filter(Y>0) %>%
pull(X) %>%
mean
[1] 0.4931818
Omega_XY_ %>%
filter(Y==2) %>%
pull(X) %>%
mean
[1] 0.475

Conditional expectation given a discrete random variable

  • If \(X\) is a random variable, and \(Y\) is a discrete random variable, then we have \[y \mapsto \mathbb{E}(X|Y=y)\] defined over all values \(y\) such that \(\mathbb{P}(Y=y)>0\).

  • We may now use the composition: \[\omega \xrightarrow[]{Y} y \xrightarrow[]{\mathbb{E}(X|Y=\cdot)} \mathbb{E}(X|Y=y)\] which is defined for all \(\omega\) in \(\Omega\), except for a negligible set. This defines a new random variable, called \(\mathbb{E}(X|Y)\).

  • This is a special case of a more general notion.

  • Later we can make this concrete for the absolutely continuous case as well.

Characteristics of conditional expectation - I

For the case above, we see the following:

  • There is a function \(g\) such that: \[\omega \xrightarrow[]{Y} y \xrightarrow[]{g} \mathbb{E}(X|Y)(\omega)\] We usually write: \[\mathbb{E}(X|Y) = g(Y)\] meaning \(\mathbb{E}(X|Y)(\omega) = g(Y(\omega))\) for all \(\omega \in \Omega\).

Characteristics of conditional expectation - II

  • For every funciton \(f\) such that the expectation is defined: \[\mathbb{E} (f(Y) \mathbb{E}(X|Y)) = \mathbb{E} (f(Y) X))\]

  • Equivalently, for every event \(A\) that depends only on \(Y\), \[\mathbb{E} (1_A \mathbb{E}(X|Y)) = \mathbb{E} (1_A X))\]

Characterization

The conditional expectation \(\mathbb{E}(X|Y)\) exists for a general \(Y\), assuming \(X\) has expectation.

It is characterized (up to a change on a negligible set of outcomes) by the properties I,II above.

Conditional probability given a random variable

Given an event \(A\) and a random variable \(Y\), we define the random variable: \[\mathbb{P}(A\vert Y) = \mathbb{E}(1_A\vert Y)\]

Conditional distribution given a random variable

  • Given random variables \(X\), \(Y\), for every \(\omega \in \Omega\) we can look into the distribution of \(X\) according to the probability measure \(\mathbb{P}(. \vert Y)(\omega)\).

  • This defines a mapping from \(\Omega\) to the set of probability distributions over \(\mathbb{R}\).

  • In other words, it is a random distribution. We call it the conditional distribution of \(X\) given \(Y\).

Examples

  • Our first finite, computational discrete examples approximates the joint distributions defined as follows: \[X \sim Uniform({0.7,0.8,0.9})\] \[Y \vert X \sim Geometric(X) \]
  • And the other example: \[X \sim Exponential(1)\] \[Y \vert X \sim Exponential(X) \]

Conditional density given a random variable

  • Concretely, if \((X,Y)\) is an absolutely continuous random vector whose joint distribution has a density \(f_{X,Y}\), then for every \(y\) where \(f_Y(y)>0\), we can define the conditional density of \(X\) given \(Y=y\) by \[f_{X|Y=y}(x) = \frac {f_{X,Y}(x,y)}{f_Y(y)}\] for every \(x\).

  • Note this is just a name, remember that \(\mathbb{P}(Y=y)=0\) for every \(y\).

Conditional density given a random variable - cont.

  • By composition, we define \[\omega \xrightarrow[]{Y} y \xrightarrow[]{f_{X|Y=\cdot}} f_{X|Y=y}\] which is defined for all \(\omega\) in \(\Omega\).

  • In other words, we defined a mapping from \(\Omega\) to the set of densities over \(\mathbb{R}\).

  • In other words, this defines a new random density, called \(f_{X|Y}\).

Bayesian setting

Given random variables or vectors such as \(X\), \(Y\) in our examples above, we may call \(X\) the parameters, and \(Y\) the observed.

  • We typically define the model using:
    • the prior - a density \(f_X\) over the parameters \(X\)
    • the likelihood - a family of conditional densities \(x \mapsto f_{Y|X=x}\)
  • Our goal is to estimate:
    • the posterior - the conditional density \(f_{X|Y=y}\), specifically for the data \(y\)