Joint vs. Marginal Expectation

Jun 11 2016

Let \(p\) be a (joint) pdf over a set of random variables, which we partition into two random vectors \((\mathbf{x}, \mathbf{y})\). Then by definition, the expected value of a real-valued function \(g\) is

$$\mathbb{E}_{p(\mathbf{x}, \mathbf{y})}[g(\mathbf{x},\mathbf{y})] = \int \int g(\mathbf{x},\mathbf{y}) p(\mathbf{x},\mathbf{y}) d \mathbf{x} d \mathbf{y}$$

which we abbreviate as just \(\mathbb{E}[g(\mathbf{x},\mathbf{y})]\), as it's clear which underlying distribution we're averaging over. Note the shorthand \(\int \cdot d \mathbf{x}\) (which I borrowed from the PRML book) means taking iterated integrals over all the scalar random variables in \(\mathbf{x}\), like \(\int \int \int \cdot dx_1 dx_2 dx_3\) if for example \(\mathbf{x}=(x_1,x_2,x_3)\).

Now consider the case where \(g\) happens to be a function of only a subset of the random variables, say \(\mathbf{x}\). Then by our earlier definition, \(\mathbb{E}[g(\mathbf{x})]\) should be defined as

$$\mathbb{E}_{p(\mathbf{x}, \mathbf{y})}[g(\mathbf{x})] = \int \int g(\mathbf{x}) p(\mathbf{x},\mathbf{y}) d\mathbf{x} d\mathbf{y}$$

Fubini's theorem says that under suitable conditions, we're allowed to switch the order of integration, so we can simplify the above as

\begin{align} \int \int g(\mathbf{x}) p(\mathbf{x},\mathbf{y}) d\mathbf{x} d\mathbf{y} &= \int \int g(\mathbf{x}) p(\mathbf{x},\mathbf{y}) d\mathbf{y} d\mathbf{x} \\ &= \int \lbrace \int g(\mathbf{x}) p(\mathbf{x},\mathbf{y}) d\mathbf{y} \rbrace d\mathbf{x} \\ &= \int g(\mathbf{x}) \lbrace \int p(\mathbf{x},\mathbf{y}) d\mathbf{y} \rbrace d\mathbf{x} \\ &= \int g(\mathbf{x}) p(\mathbf{x}) d\mathbf{x} \\ &= \mathbb{E}_{p(\mathbf{x})}[g(\mathbf{x})] \end{align}

where we used the fact that \(g(\mathbf{x})\) is a constant in the integral w.r.t. to \(\mathbf{y}\), as well as the definition of marginal distribution \(p(\mathbf{x})\).

Therefore in most cases, we can safely drop the "irrelevant" variables (\(\mathbf{y}\)) in the underlying joint distribution \(p\) over which we take expectation, and treat \(\mathbb{E}[g(\mathbf{x})]\) as expectation with respect to the marginal distribution of \(\mathbf{x}\) only. This fact will be used in a lot of variational inference type of derivations.