Probability basics for Bayesian analysis

A motivating example

To estimate how many times I sneeze per hour, I wait and measure the time till my forth sneeze.

  • \(X\) - unobserved (parameter) - my rate of sneezes per hour

  • \(Y\) - observed - the time till my forth sneeze

  • Prior: \(X \sim \Gamma(2,6)\)

  • Likelihood: \(Y \vert _X \sim \Gamma(4,X)\)

  • Posterior: \(X \vert _Y \sim \Gamma(2+4,6+Y)\)

A motivating example - cont.

rgamma(1000, 2, 6) %>% qplot()

A motivating example - cont.

(tibble(x=seq(0,2,0.01),
        density=dgamma(x, 2, 6)) %>%
  ggplot(aes(x,density)) + mytheme
  + geom_area())

A motivating example - cont.

  • Waiting for one event of rate \(x\) is distributed $Exp(x), \(Exp(x) = Gamma(1,x)\) (1-“shape”, \(x\)-“rate”),

  • Waiting for 4 events of rate x is distributed \(Gamma(4,x)\) (4-shape, \(x\)-rate).

Background

What distinguishes measure theory and probability theory?

Kolmogorov’s Foundations of the Theory of Probability

Sample space

  • \(\Omega\) - a set

  • Each \(\omega\) in \(\Omega\) is one possible outcome in our uncertain world.

  • In computational uses, \(\Omega\) is finite.

  • But when we phrase a statistical model, it is often an infinite mathematical construct.

Sample space example

N <- 100
set.seed(31)
Omega_finite <- tibble(X = rgamma(N, 2, 6),
                Y = rgamma(N, 4, X))

Omega_finite %>% kable()
X Y
0.2614719 6.375819
0.2138068 13.882671
0.2260581 31.840656
0.6520004 3.498244
0.1674620 31.098390
0.3659386 24.811430
0.2862107 16.811617
0.0819444 23.694354
0.0397472 210.982678
0.2405861 31.212829
0.4348955 6.498370
0.1159619 49.549171
1.2929998 2.958485
0.1931940 37.932899
0.2535873 34.449692
0.2134041 9.452900
0.2050184 23.788414
0.4423407 4.345751
0.4228450 14.891966
0.0856366 26.527150
0.0741577 53.736278
0.4213621 12.743175
0.1966911 20.973970
0.0180597 193.221812
0.1155568 34.940974
0.2431928 17.340992
0.2724593 10.477737
0.0442079 118.209978
0.2262941 17.727235
0.7087579 4.953093
0.3510747 6.627882
0.4372039 8.045685
0.3626183 4.192516
0.4547042 3.198129
0.2749661 16.162527
0.5460077 5.778064
0.2386606 11.486853
0.1236775 41.201705
0.5858767 7.095772
0.3355599 9.935324
0.2707641 16.941205
0.0692056 48.188094
0.5268671 4.399818
0.8421145 3.800355
0.4928962 6.256317
0.1952022 53.689639
0.4827437 11.122870
0.4343758 8.980154
0.9312470 8.886762
0.1548861 19.690230
0.6090559 9.066599
0.5128833 10.457306
0.2859819 7.267587
0.1985435 14.223205
0.6939549 4.034420
0.5590719 4.592477
0.2787074 10.043072
0.2683953 17.215678
0.3662903 15.457434
0.5969167 2.958815
0.1678776 9.688477
0.3537971 7.852522
0.4488835 14.703632
0.1498778 39.694465
0.1815088 16.617960
0.5520281 7.827161
0.5138644 11.601048
0.8351950 2.462199
0.8163396 4.603685
0.1543613 8.869716
0.1891107 19.166799
0.1784519 21.352133
0.5623886 8.372720
0.2314933 11.100859
0.3914930 9.101875
0.1936554 30.328917
0.3082267 16.012841
0.6272152 6.813958
0.3532382 11.866320
0.3038131 10.103965
0.1858272 18.287051
0.1946054 15.774182
0.2667366 25.697924
0.1965498 33.094091
0.0441445 36.165160
0.2619324 9.368694
0.1944936 25.235366
0.2751434 5.828970
0.2183396 13.909186
0.2330591 4.418328
0.4273974 12.804499
0.3440438 19.628506
0.2807141 21.522079
0.2276425 15.781731
0.1098608 43.928327
0.2856966 9.183774
0.4002635 11.916558
0.1680154 13.956158
0.3150993 10.917061
0.1901269 15.844127

Sample space example - cont

(Omega_finite %>%
  ggplot(aes(X,Y))
  + geom_point(size=5))

Sample space - remarks

  • Usually we do not actually care about the elements \(\omega\) in \(\Omega\).

  • We think in terms of events and random variables.

  • The sample space behind them is implicit.

Events

  • Subsets that we care about are called “events”.

Events example

“all outcomes \(\omega\) in \(\Omega_{finite}\) in which I waited at least five hours”

Omega_finite %>%
filter(Y>5) %>%
kable()
X Y
0.2614719 6.375819
0.2138068 13.882671
0.2260581 31.840656
0.1674620 31.098390
0.3659386 24.811430
0.2862107 16.811617
0.0819444 23.694354
0.0397472 210.982678
0.2405861 31.212829
0.4348955 6.498370
0.1159619 49.549171
0.1931940 37.932899
0.2535873 34.449692
0.2134041 9.452900
0.2050184 23.788414
0.4228450 14.891966
0.0856366 26.527150
0.0741577 53.736278
0.4213621 12.743175
0.1966911 20.973970
0.0180597 193.221812
0.1155568 34.940974
0.2431928 17.340992
0.2724593 10.477737
0.0442079 118.209978
0.2262941 17.727235
0.3510747 6.627882
0.4372039 8.045685
0.2749661 16.162527
0.5460077 5.778064
0.2386606 11.486853
0.1236775 41.201705
0.5858767 7.095772
0.3355599 9.935324
0.2707641 16.941205
0.0692056 48.188094
0.4928962 6.256317
0.1952022 53.689639
0.4827437 11.122870
0.4343758 8.980154
0.9312470 8.886762
0.1548861 19.690230
0.6090559 9.066599
0.5128833 10.457306
0.2859819 7.267587
0.1985435 14.223205
0.2787074 10.043072
0.2683953 17.215678
0.3662903 15.457434
0.1678776 9.688477
0.3537971 7.852522
0.4488835 14.703632
0.1498778 39.694465
0.1815088 16.617960
0.5520281 7.827161
0.5138644 11.601048
0.1543613 8.869716
0.1891107 19.166799
0.1784519 21.352133
0.5623886 8.372720
0.2314933 11.100859
0.3914930 9.101875
0.1936554 30.328917
0.3082267 16.012841
0.6272152 6.813958
0.3532382 11.866320
0.3038131 10.103965
0.1858272 18.287051
0.1946054 15.774182
0.2667366 25.697924
0.1965498 33.094091
0.0441445 36.165160
0.2619324 9.368694
0.1944936 25.235366
0.2751434 5.828970
0.2183396 13.909186
0.4273974 12.804499
0.3440438 19.628506
0.2807141 21.522079
0.2276425 15.781731
0.1098608 43.928327
0.2856966 9.183774
0.4002635 11.916558
0.1680154 13.956158
0.3150993 10.917061
0.1901269 15.844127

Events example - cont.

Omega_finite %>%
filter(Y>5) %>%
nrow()
[1] 86

Event space

  • \(\mathcal{F}\) - the event space, is the set of events that can be conceptualized in our model of the world.
    • In other words, events where it makes sense to ask whether they occur or not.
  • \(\mathcal{F}\) is assumed to be a so-called \(\sigma\)-algebra.
    • It means it has some kind of symmetry that makes it sensible.

Event space example

In this sample space:

tibble(U=c(1,1,2,2),
       V=c(3,4,3,4)) %>%
kable()
U V
1 3
1 4
2 3
2 4

if \(U\) is part of our model of the world, but \(V\) is not, then our event space contains events such as \(U=1\) but not events such as \(V=3\).

Varying the event space

  • When we talk about conditional probability, etc.,
  • .. it can always be phrased by conditioning on a different event space.
  • .. but we will not use this terminology today.

Random variables

  • A random variable is a function \(\Omega \to \mathbb{R}\).
  • .. which is “measurable” in the event space.

Random variables example

Omega_finite %>% kable()
X Y
0.2614719 6.375819
0.2138068 13.882671
0.2260581 31.840656
0.6520004 3.498244
0.1674620 31.098390
0.3659386 24.811430
0.2862107 16.811617
0.0819444 23.694354
0.0397472 210.982678
0.2405861 31.212829
0.4348955 6.498370
0.1159619 49.549171
1.2929998 2.958485
0.1931940 37.932899
0.2535873 34.449692
0.2134041 9.452900
0.2050184 23.788414
0.4423407 4.345751
0.4228450 14.891966
0.0856366 26.527150
0.0741577 53.736278
0.4213621 12.743175
0.1966911 20.973970
0.0180597 193.221812
0.1155568 34.940974
0.2431928 17.340992
0.2724593 10.477737
0.0442079 118.209978
0.2262941 17.727235
0.7087579 4.953093
0.3510747 6.627882
0.4372039 8.045685
0.3626183 4.192516
0.4547042 3.198129
0.2749661 16.162527
0.5460077 5.778064
0.2386606 11.486853
0.1236775 41.201705
0.5858767 7.095772
0.3355599 9.935324
0.2707641 16.941205
0.0692056 48.188094
0.5268671 4.399818
0.8421145 3.800355
0.4928962 6.256317
0.1952022 53.689639
0.4827437 11.122870
0.4343758 8.980154
0.9312470 8.886762
0.1548861 19.690230
0.6090559 9.066599
0.5128833 10.457306
0.2859819 7.267587
0.1985435 14.223205
0.6939549 4.034420
0.5590719 4.592477
0.2787074 10.043072
0.2683953 17.215678
0.3662903 15.457434
0.5969167 2.958815
0.1678776 9.688477
0.3537971 7.852522
0.4488835 14.703632
0.1498778 39.694465
0.1815088 16.617960
0.5520281 7.827161
0.5138644 11.601048
0.8351950 2.462199
0.8163396 4.603685
0.1543613 8.869716
0.1891107 19.166799
0.1784519 21.352133
0.5623886 8.372720
0.2314933 11.100859
0.3914930 9.101875
0.1936554 30.328917
0.3082267 16.012841
0.6272152 6.813958
0.3532382 11.866320
0.3038131 10.103965
0.1858272 18.287051
0.1946054 15.774182
0.2667366 25.697924
0.1965498 33.094091
0.0441445 36.165160
0.2619324 9.368694
0.1944936 25.235366
0.2751434 5.828970
0.2183396 13.909186
0.2330591 4.418328
0.4273974 12.804499
0.3440438 19.628506
0.2807141 21.522079
0.2276425 15.781731
0.1098608 43.928327
0.2856966 9.183774
0.4002635 11.916558
0.1680154 13.956158
0.3150993 10.917061
0.1901269 15.844127

Random variables example - cont.

Omega_finite$X[13]
[1] 1.293

Random variables coexist

c(Omega_finite$X[13],
  Omega_finite$Y[13])
[1] 1.293000 2.958485

Random vectors

\((X,Y)\) may be considered a random vector, viewed as a function \(\Omega_{finite} \to \mathbb{R}^2\) \[\omega \mapsto (X(\omega), Y(\omega))\]

Events of random variables

\[Y>5\] means the subset of \(\Omega_{finite}\): \[\{\omega \in \Omega_{finite} \vert Y(\omega)>5\}\]

Y_is_more_than_five <- 
  Omega_finite %>% 
  filter(Y>5)
nrow(Y_is_more_than_five)
[1] 86

Events of random variables - cont.

\[(Y \in [5,9], X<0.3)\] means the subset of \(\Omega_{finite}\): \[= \{\omega \in \Omega_{finite} \vert Y(\omega) \in [5,9], X(\omega)<0.3\}\] \[= \{\omega \in \Omega_{finite} \vert 5 \leq Y(\omega) \leq 9, X(\omega)<0.3\}\]

Probability

  • A probability measure \(\mathbb{P}\) is a function from to \(\mathcal{F} \to [0,1]\), satisfying Kolmogorov’s axioms:
    • \(\mathbb{P}(E) \geq 0\) for all \(E\) in \(\mathcal{F}\)
    • \(\mathbb{P}(\Omega) = 1\)
    • \(\mathbb{P}(\bigcup_{i=1,2,...} E_i) = \sum_{i=1,2,...} \mathbb{P}(E_i)\) for pairwise disjoint events
  • A sample space with a probability measure is called a probability space.

Probability example

For our finite example, we may define probabilities proportional to number of outcomes.

P_finite <- (function(event)
  nrow(event)/N)

P_finite(Y_is_more_than_five)
[1] 0.86

Distribution

A random variable \(Y: \Omega \to \mathbb{R}\) pushes a probability measure \(\mathbb{P}\) over \(\Omega\) to a probability measure \(P_Y\) over \(\mathbb{R}\), called its distribution.

Distribution example

\[P_X((0,0.3)) =\] \[\mathbb{P}(X \in (0,0.3)) =\] \[\mathbb{P}(0 < X < 0.3)\]

Distribution example - cont.

(Omega_finite %>% ggplot() + mytheme
  + geom_histogram(aes(X,..density..), bins=100)
  + geom_segment(x=0, xend=0.3, y=0, yend=0, size=10,
                 color="darkgreen", alpha=0.01)
  + geom_vline(xintercept=c(0,0.3), color="darkgreen"))
Omega_finite %>% filter(0 < X & X < 0.3) %>% P_finite()
[1] 0.57

Joint distribution

Similarily, the distribution of a random vector \((X,Y)\) is a probability measure over the plane \(\mathbb{R^2}\).

It is also called the joint distribution of \(X\) and \(Y\).

Joint distribution example

(ggplot(Omega_finite) + mytheme
  + geom_point(aes(X,Y), size=5)
  + geom_rect(xmin=0, xmax=0.3, ymin=30, ymax=1000,
              fill="darkgreen", alpha=0.01))

Density

A probability distribution \(P_X\) is absolutely continuous, if there is a density function \(f_X\) such that for every region \(D\), \[P_X(D) = \int_D f_X(x) \mathrm{d} x\]

Density example

(tibble(X=seq(0,2,0.01),
        density=dgamma(X,2,6)) %>%
  ggplot(aes(X,density)) + mytheme
  + geom_area(size=3, alpha=0.4)
  + geom_segment(x=0, xend=0.3, y=0, yend=0, size=10,
                 color="darkgreen", alpha=0.01)
  + geom_vline(xintercept=c(0,0.3), color="darkgreen"))

Joint density example

(expand.grid(X=seq(0,2,0.01), Y=seq(0,100,1)) %>%
  mutate(density = dgamma(X,2,6) * dgamma(Y,4,X)) %>%
  ggplot(aes(X,Y,z=density)) + mytheme
  + geom_raster(aes(fill=density))
  + geom_rect(xmin=0, xmax=0.3, ymin=30, ymax=1000,
              color="lightgreen", alpha=0.001))

Expectation

  • For a random variable \(X\) with density \(f_X\), it is defined (when the integral is well-defined): \[\mathbb{E}(X) = \int x f_X(x) \mathrm{d} x\]

  • It is actually a special case of a more general notion (not assuming having a density).

  • Note that the expectation is determined by the distribution.

Probability as expectation

  • Given an event \(A\), we define \(1_A\) to be the random variable such that:
    • \(1_A(\omega)=1\) if \(\omega \in A\)
    • \(1_A(\omega)=0\) if \(\omega \notin A\)
  • Then \[\mathbb{P}(A) = \mathbb{E}(1_A)\]

Conditioning

Conditional probability given an event

  • Given events \(A\), \(B\), such that \(\mathbb{P}(B)>0\), the conditional probability of \(A\) given \(B\) is \[ \mathbb{P}(A|B) = \frac {\mathbb{P}(A \cap B)} {\mathbb{P}(B)}\]

  • When also \(\mathbb{P}(A)>0\), we get Bayes’ formula: \[ \mathbb{P}(A|B) = \mathbb{P}(B|A) \frac {\mathbb{P}(A)} {\mathbb{P}(B)}\]

Conditional distribution given an event

  • Given a random variable \(X:\Omega \to \mathbb{R}\) and an event \(B\) such that \(\mathbb{P}(B)>0\), the conditional distribution of \(X\) given \(B\) is defined by pushing the probability measure \(\mathbb{P}(\cdot |B)\) from \(\Omega\) to \(\mathbb{R}\): \[P_{X|B} (D) = \mathbb{P}(X \in D|B)\] for every region \(D \subset \mathbb{R}\).

  • For example: \[P_{X|Y>5} ((0,0.3)) = \mathbb{P}(X<0.3 | 5<Y)\]

Conditional distribution example

(Omega_finite %>%
  ggplot(aes(X,..density..)) + mytheme
  + geom_histogram())

Conditional distribution example - cont.

(ggplot(Omega_finite) + mytheme
  + geom_point(aes(X,Y), size=5)
  + geom_rect(xmin=0, xmax=2, ymin=10, ymax=1000,
              fill="#00BFC4", alpha=0.01))

Conditional distribution example - cont.

(ggplot(Omega_finite) + mytheme
  + geom_point(aes(X,Y,color=Y>10), size=5))

Conditional distribution - cont.

(Omega_finite %>%
    ggplot(aes(X,..density..)) + mytheme
  + geom_histogram(aes(fill=factor(Y>10)),
                   position="identity",
                   alpha=0.8))

Conditional expectation given an event

Given a random variable \(X:\Omega \to \mathbb{R}\) and an event \(B\) such that \(\mathbb{P}(B)>0\), the conditional expectation \(\mathbb{E}(X \vert B)\) is defined as the expectation of the conditional distribution.

Omega_finite %>%
filter(Y>10) %>%
pull(X) %>%
mean
[1] 0.2342998

Conditional density given an event

If a conditional distribution has a density, we call it “conditional density”.

Conditional density - cont.

  • If \((X,Y)\) is an absolutely continuous random vector whose joint distribution has a density \(f_{X,Y}\), and assume that \(\mathbb{P}(1.9<Y<2.1)>0\).

  • Then we can look into the conditional density of \(X\) given \(\mathbb{P}(1.9<Y<2.1)\): \[f_{X|1.9<Y<2.1}(x) = \frac {\int_{1.9}^{2.1} f_{(X,Y)}(x,y) \mathrm{d}y} {\mathbb{P}(1.9<Y<2.1)}\] for every \(x\).

Conditional density - cont.

  • Indeed, for every \(a,b\) such that \(a<b\), \[\mathbb{P}(a<X<b | 1.9<Y<2.1) =\] \[\frac {\mathbb{P}(a<X<b , 1.9<Y<2.1)} {\mathbb{P}(1.9<Y<2.1)} =\]

\[\frac {\int_a^b \int_{1.9}^{2.1} f_{(X,Y)}(x,y) \mathrm{d}y \mathrm{d}x} {\mathbb{P}(1.9<Y<2.1)}\]

Conditional density - cont.

Now what happens when we replace \(1.9\) and \(2.1\) with numbers which get closer to a limit \(y_0\)?

Conditional density - cont.

Intuitively, for a given \(x\) and \(y_0\),

\[\frac {\int_{y_0-\delta}^{y_0+\delta} f_{(X,Y)}(x,y) \mathrm{d}y} {\mathbb{P}(y_0-\delta<Y<y_0+\delta)} = \] \[ \frac {\int_{y_0-\delta}^{y_0+\delta} f_{(X,Y)}(x,y) \mathrm{d}y} {\int_{y_0-\delta}^{y_0+\delta} f_Y(y) \mathrm{d}y} \approx_{\delta>0, small} \] \[ \frac { 2 \delta f_{(X,Y)}(x,y_0) \mathrm{d}y} {2 \delta f_Y(y_0)} = \] \[ \frac { f_{(X,Y)}(x,y_0) \mathrm{d}y} {f_Y(y_0)}\]

Conditional density given a random variable

  • Assume \((X,Y)\) is a random vector whose joint distribution has a density \(f_{X,Y}\), then for every \(y\) where \(f_Y(y)>0\), we can define the conditional density of \(X\) given \(Y=y\) by \[f_{X|Y=y}(x) = \frac {f_{X,Y}(x,y)}{f_Y(y)}\] for every \(x\).

  • Note this is just a name, remember that \(\mathbb{P}(Y=y)=0\) for every \(y\).

Conditional density given a random variable - cont.

  • Now, for every \(x\) we compose the mapping \(y \mapsto f_{X|Y=y}(x)\) with the random variable \(Y\): \[\omega \xrightarrow[]{Y} y \xrightarrow[]{} f_{X|Y=y}(x) = \frac {f_{X,Y}(x,y)}{f_Y(y)}\]

  • This way, we get a random variable that we call \(f_{X|Y}(x)\): \[f_{X|Y}(x) = \frac {f_{X,Y}(x,Y)}{f_Y(Y)}\]

  • An we may view this as a random density function.

Conditionint on random variable

  • This way, we can also get conditional distribution, probability, and expectation conditioned on a random variable.

  • These are all random objects defined in our probability space.

  • We can characterize them in a way that generalizes to more general cases (without a density).