## Joint vs. Marginal Expectation

Jun 11 2016

Let $$p$$ be a (joint) pdf over a set of random variables, which we partition into two random vectors $$(\mathbf{x}, \mathbf{y})$$. Then by definition, the expected value of a real-valued function $$g$$ is

$$\mathbb{E}_{p(\mathbf{x}, \mathbf{y})}[g(\mathbf{x},\mathbf{y})] = \int \int g(\mathbf{x},\mathbf{y}) p(\mathbf{x},\mathbf{y}) d \mathbf{x} d \mathbf{y}$$

which we abbreviate as just $$\mathbb{E}[g(\mathbf{x},\mathbf{y})]$$, as it's clear which underlying distribution we're averaging over. Note the shorthand $$\int \cdot d \mathbf{x}$$ (which I borrowed from the PRML book) means taking iterated integrals over all the scalar random variables in $$\mathbf{x}$$, like $$\int \int \int \cdot dx_1 dx_2 dx_3$$ if for example $$\mathbf{x}=(x_1,x_2,x_3)$$.

Now consider the case where $$g$$ happens to be a function of only a subset of the random variables, say $$\mathbf{x}$$. Then by our earlier definition, $$\mathbb{E}[g(\mathbf{x})]$$ should be defined as

$$\mathbb{E}_{p(\mathbf{x}, \mathbf{y})}[g(\mathbf{x})] = \int \int g(\mathbf{x}) p(\mathbf{x},\mathbf{y}) d\mathbf{x} d\mathbf{y}$$

Fubini's theorem says that under suitable conditions, we're allowed to switch the order of integration, so we can simplify the above as

\begin{align} \int \int g(\mathbf{x}) p(\mathbf{x},\mathbf{y}) d\mathbf{x} d\mathbf{y} &= \int \int g(\mathbf{x}) p(\mathbf{x},\mathbf{y}) d\mathbf{y} d\mathbf{x} \\ &= \int \lbrace \int g(\mathbf{x}) p(\mathbf{x},\mathbf{y}) d\mathbf{y} \rbrace d\mathbf{x} \\ &= \int g(\mathbf{x}) \lbrace \int p(\mathbf{x},\mathbf{y}) d\mathbf{y} \rbrace d\mathbf{x} \\ &= \int g(\mathbf{x}) p(\mathbf{x}) d\mathbf{x} \\ &= \mathbb{E}_{p(\mathbf{x})}[g(\mathbf{x})] \end{align}

where we used the fact that $$g(\mathbf{x})$$ is a constant in the integral w.r.t. to $$\mathbf{y}$$, as well as the definition of marginal distribution $$p(\mathbf{x})$$.

Therefore in most cases, we can safely drop the "irrelevant" variables ($$\mathbf{y}$$) in the underlying joint distribution $$p$$ over which we take expectation, and treat $$\mathbb{E}[g(\mathbf{x})]$$ as expectation with respect to the marginal distribution of $$\mathbf{x}$$ only. This fact will be used in a lot of variational inference type of derivations.