\(\Omega\) - a set
Each \(\omega\) in \(\Omega\) is one possible outcome in our uncertain world.
In computational uses, \(\Omega\) is finite.
But when we phrase a statistical model, it is often an infinite mathematical construct.
Usually we do not actually care about the elements \(\omega\) in \(\Omega\).
We think in terms of events and random variables.
The sample space behind them is implicit.
Subsets that we care about are called “events”.
E.g., \(H\)=“all outcomes \(\omega\) in which Piglet meets a Heffalump” may be an event.
\(\mathcal{F}\) - the event space, is the set of events that can be conceptualized in our model of the world.
In other words, events where it makes sense to ask whether they occur or not.
E.g., if we do not know what Heffalumps are like, then the event \(H\) above should not be in our \(\mathcal{F}\).
\(\mathcal{F}\) is assumed to be a so-called \(\sigma\)-algebra.
It means it has some kind of symmetry that makes it sensible.
X <- function(omega) {
set.seed(omega)
sample((7:9)/10, 1)
}
Y <- function(omega) {
set.seed(omega)
rgeom(1, X(omega))
}
list(x=X(52),
y=Y(52))
$x
[1] 0.9
$y
[1] 0
Following McElreath’s example:
omega | X | Y |
---|---|---|
1 | 0.7 | 0 |
2 | 0.7 | 0 |
3 | 0.7 | 0 |
4 | 0.9 | 0 |
5 | 0.8 | 1 |
6 | 0.7 | 0 |
7 | 0.8 | 0 |
8 | 0.9 | 0 |
9 | 0.9 | 0 |
10 | 0.9 | 0 |
11 | 0.8 | 2 |
12 | 0.8 | 1 |
13 | 0.9 | 0 |
14 | 0.7 | 1 |
15 | 0.7 | 3 |
16 | 0.7 | 0 |
17 | 0.8 | 0 |
18 | 0.8 | 1 |
19 | 0.7 | 0 |
20 | 0.8 | 0 |
21 | 0.9 | 0 |
22 | 0.8 | 1 |
23 | 0.7 | 1 |
24 | 0.9 | 0 |
25 | 0.9 | 0 |
26 | 0.9 | 0 |
27 | 0.7 | 2 |
28 | 0.7 | 1 |
29 | 0.7 | 1 |
30 | 0.8 | 0 |
31 | 0.7 | 1 |
32 | 0.9 | 0 |
33 | 0.8 | 0 |
34 | 0.7 | 0 |
35 | 0.8 | 1 |
36 | 0.7 | 0 |
37 | 0.8 | 1 |
38 | 0.7 | 0 |
39 | 0.7 | 0 |
40 | 0.9 | 0 |
41 | 0.9 | 0 |
42 | 0.7 | 0 |
43 | 0.7 | 0 |
44 | 0.7 | 0 |
45 | 0.7 | 0 |
46 | 0.8 | 0 |
47 | 0.9 | 0 |
48 | 0.7 | 0 |
49 | 0.9 | 0 |
50 | 0.9 | 0 |
51 | 0.9 | 0 |
52 | 0.9 | 0 |
53 | 0.8 | 1 |
54 | 0.9 | 0 |
55 | 0.8 | 0 |
56 | 0.7 | 0 |
57 | 0.8 | 0 |
58 | 0.9 | 0 |
59 | 0.9 | 0 |
60 | 0.9 | 0 |
61 | 0.9 | 0 |
62 | 0.8 | 0 |
63 | 0.9 | 0 |
64 | 0.9 | 0 |
65 | 0.8 | 0 |
66 | 0.7 | 0 |
67 | 0.9 | 0 |
68 | 0.7 | 1 |
69 | 0.7 | 0 |
70 | 0.9 | 0 |
71 | 0.9 | 0 |
72 | 0.7 | 1 |
73 | 0.7 | 1 |
74 | 0.8 | 1 |
75 | 0.7 | 0 |
76 | 0.7 | 0 |
77 | 0.8 | 0 |
78 | 0.9 | 0 |
79 | 0.8 | 0 |
80 | 0.9 | 0 |
81 | 0.9 | 0 |
82 | 0.7 | 0 |
83 | 0.7 | 1 |
84 | 0.8 | 1 |
85 | 0.7 | 0 |
86 | 0.9 | 0 |
87 | 0.8 | 0 |
88 | 0.9 | 1 |
89 | 0.7 | 1 |
90 | 0.8 | 0 |
91 | 0.7 | 0 |
92 | 0.8 | 0 |
93 | 0.7 | 0 |
94 | 0.8 | 0 |
95 | 0.9 | 0 |
96 | 0.9 | 0 |
97 | 0.7 | 1 |
98 | 0.8 | 0 |
99 | 0.7 | 2 |
100 | 0.8 | 0 |
\((X,Y)\) may be considered a random vector, viewed as a function \(\Omega \to \mathbb{R}^2\) \[\omega \mapsto (X(\omega), Y(\omega))\]
\(Y>0\) means the subset of \(\Omega\): \[\{\omega \in \Omega \vert Y(\omega)>0\}\]
\(Y \in [1,2], X=0.9\) means the subset of \(\Omega\): \[\{\omega \in \Omega \vert Y(\omega) \in [1,2], X(\omega)=0.9\}\] \[= \{\omega \in \Omega \vert 1 \leq Y(\omega) \leq 2, X(\omega)=0.9\}\]
(the relationship to the sample space is not explicit anymore)
set.seed(1987)
Omega_XY_ <- tibble(
X = sample((4:6)/10, N, replace=T),
Y = rgeom(N, X))
Omega_XY_ %>%
kable()
X | Y |
---|---|
0.4 | 0 |
0.5 | 0 |
0.5 | 0 |
0.4 | 0 |
0.6 | 0 |
0.4 | 0 |
0.4 | 4 |
0.5 | 0 |
0.4 | 3 |
0.6 | 0 |
0.6 | 0 |
0.6 | 0 |
0.6 | 1 |
0.6 | 0 |
0.4 | 2 |
0.5 | 3 |
0.5 | 0 |
0.6 | 0 |
0.5 | 0 |
0.5 | 0 |
0.5 | 0 |
0.6 | 2 |
0.5 | 3 |
0.5 | 1 |
0.5 | 0 |
0.5 | 3 |
0.4 | 3 |
0.5 | 2 |
0.6 | 1 |
0.6 | 1 |
0.5 | 1 |
0.5 | 0 |
0.6 | 2 |
0.4 | 1 |
0.6 | 0 |
0.6 | 0 |
0.5 | 0 |
0.4 | 2 |
0.6 | 0 |
0.5 | 2 |
0.6 | 1 |
0.4 | 0 |
0.4 | 2 |
0.6 | 3 |
0.6 | 1 |
0.6 | 0 |
0.6 | 0 |
0.4 | 1 |
0.4 | 3 |
0.4 | 0 |
0.5 | 0 |
0.6 | 1 |
0.4 | 0 |
0.4 | 1 |
0.4 | 1 |
0.4 | 0 |
0.5 | 0 |
0.5 | 0 |
0.6 | 1 |
0.5 | 0 |
0.5 | 0 |
0.6 | 0 |
0.4 | 0 |
0.5 | 1 |
0.4 | 1 |
0.4 | 0 |
0.5 | 0 |
0.6 | 5 |
0.5 | 0 |
0.6 | 0 |
0.5 | 1 |
0.6 | 0 |
0.5 | 0 |
0.4 | 0 |
0.5 | 0 |
0.6 | 0 |
0.4 | 0 |
0.6 | 1 |
0.4 | 1 |
0.6 | 0 |
0.6 | 1 |
0.5 | 0 |
0.6 | 1 |
0.4 | 0 |
0.5 | 0 |
0.5 | 6 |
0.5 | 1 |
0.6 | 0 |
0.6 | 0 |
0.4 | 2 |
0.4 | 9 |
0.6 | 0 |
0.6 | 0 |
0.4 | 0 |
0.5 | 0 |
0.4 | 9 |
0.4 | 0 |
0.5 | 1 |
0.4 | 1 |
0.5 | 1 |
For our finite example, we may define probabilities proportional to number of outcomes.
A random variable \(Y: \Omega \to \mathbb{R}\) pushes a probability measure \(\mathbb{P}\) over \(\Omega\) to a probability measure \(P_Y\) over \(\mathbb{R}\), called its distribution.
\[P_Y((0,\infty)) = \mathbb{P}(Y \in (0,\infty)) = \mathbb{P}(Y > 0)\]
Similarily, the distribution of a random vector \((X,Y)\) is a probability measure over the plane \(\mathbb{R^2}\).
It is also called the joint distribution of \(X\) and \(Y\).
A probability distribution \(P\) is
discrete, if it is defined by a sequence of values \(y_1, y_2, ...\) such that for every region \(D\), \[P(D) = \sum_{i \vert y_i \in D} P(\{y_i\})\]
continuous, if \(P(\{y\})=0\) for all \(y\)
absolutely continuous, if there is a density function \(f_Y\) such that for every region \(D\), \[P(D) = \int_D f_Y(y) \mathrm{d} y\]
In this example, we approximate an absolutely continuous joint distribution.
For a discrete random variable \(Y\) taking values \(y_1, y_2, ...\), the expectation of \(Y\) is defined (when the series converges): \[ \mathbb{E}(Y) = \sum_i y_i \mathbb{P}(Y=y_i)\]
For an absolutely continuous random variable \(Y\) with density \(f_Y\), it is defined (when the integral is well-defined): \[\mathbb{E}(Y) = \int y f_Y(y) \mathrm{d} y\]
These are both special cases of a general notion of Lebesgue integral (when Y is Lebesgue integrable): \[\mathbb{E}(Y) = \int_\Omega f d\mathbb{P}\]
These notions depend only on the distribution of \(Y\).
Given events \(A\), \(B\), such that \(\mathbb{P}(B)>0\), the conditional probability of \(A\) given \(B\) is \[ \mathbb{P}(A|B) = \frac {\mathbb{P}(A \cap B)} {\mathbb{P}(B)}\]
When also \(\mathbb{P}(A)>0\), we get Bayes’ formula: \[ \mathbb{P}(A|B) = \mathbb{P}(B|A) \frac {\mathbb{P}(A)} {\mathbb{P}(B)}\]
Given a random variable \(X:\Omega \to \mathbb{R}\) and an event \(B\) such that \(\mathbb{P}(B)>0\), the conditional distribution of \(X\) given \(B\) is defined by pushing the probability measure \(\mathbb{P}(\cdot |B)\) from \(\Omega\) to \(\mathbb{R}\): \[P_{X|B} (D) = \mathbb{P}(X \in D|B)\] for every region \(D \subset \mathbb{R}\).
For example: \[P_{Y|B} ((0,\infty)) = \mathbb{P}(Y>0 |B)\] \[P_{X|B} (\{0.5\}) = \mathbb{P}(X=0.5 |B)\]
Given a random variable \(X:\Omega \to \mathbb{R}\) and an event \(B\) such that \(\mathbb{P}(B)>0\), the conditional expectation \(\mathbb{E}(X \vert B)\) is defined as the expectation of the conditional distribution.
If \(X\) is a random variable, and \(Y\) is a discrete random variable, then we have \[y \mapsto \mathbb{E}(X|Y=y)\] defined over all values \(y\) such that \(\mathbb{P}(Y=y)>0\).
We may now use the composition: \[\omega \xrightarrow[]{Y} y \xrightarrow[]{\mathbb{E}(X|Y=\cdot)} \mathbb{E}(X|Y=y)\] which is defined for all \(\omega\) in \(\Omega\), except for a negligible set. This defines a new random variable, called \(\mathbb{E}(X|Y)\).
This is a special case of a more general notion.
Later we can make this concrete for the absolutely continuous case as well.
For the case above, we see the following:
For every funciton \(f\) such that the expectation is defined: \[\mathbb{E} (f(Y) \mathbb{E}(X|Y)) = \mathbb{E} (f(Y) X))\]
Equivalently, for every event \(A\) that depends only on \(Y\), \[\mathbb{E} (1_A \mathbb{E}(X|Y)) = \mathbb{E} (1_A X))\]
The conditional expectation \(\mathbb{E}(X|Y)\) exists for a general \(Y\), assuming \(X\) has expectation.
It is characterized (up to a change on a negligible set of outcomes) by the properties I,II above.
Given an event \(A\) and a random variable \(Y\), we define the random variable: \[\mathbb{P}(A\vert Y) = \mathbb{E}(1_A\vert Y)\]
Given random variables \(X\), \(Y\), for every \(\omega \in \Omega\) we can look into the distribution of \(X\) according to the probability measure \(\mathbb{P}(. \vert Y)(\omega)\).
This defines a mapping from \(\Omega\) to the set of probability distributions over \(\mathbb{R}\).
In other words, it is a random distribution. We call it the conditional distribution of \(X\) given \(Y\).
Concretely, if \((X,Y)\) is an absolutely continuous random vector whose joint distribution has a density \(f_{X,Y}\), then for every \(y\) where \(f_Y(y)>0\), we can define the conditional density of \(X\) given \(Y=y\) by \[f_{X|Y=y}(x) = \frac {f_{X,Y}(x,y)}{f_Y(y)}\] for every \(x\).
Note this is just a name, remember that \(\mathbb{P}(Y=y)=0\) for every \(y\).
By composition, we define \[\omega \xrightarrow[]{Y} y \xrightarrow[]{f_{X|Y=\cdot}} f_{X|Y=y}\] which is defined for all \(\omega\) in \(\Omega\).
In other words, we defined a mapping from \(\Omega\) to the set of densities over \(\mathbb{R}\).
In other words, this defines a new random density, called \(f_{X|Y}\).
Given random variables or vectors such as \(X\), \(Y\) in our examples above, we may call \(X\) the parameters, and \(Y\) the observed.