Let \(p\) be a (joint) pdf over a set of random variables, which we partition into two random vectors \((\mathbf{x}, \mathbf{y})\). Then by definition, the expected value of a real-valued function \(g\) is

which we abbreviate as just \(\mathbb{E}[g(\mathbf{x},\mathbf{y})]\), as it's clear which underlying distribution we're averaging over. Note the shorthand \(\int \cdot d \mathbf{x}\) (which I borrowed from the PRML book) means taking iterated integrals over all the scalar random variables in \(\mathbf{x}\), like \(\int \int \int \cdot dx_1 dx_2 dx_3\) if for example \(\mathbf{x}=(x_1,x_2,x_3)\).

Now consider the case where \(g\) happens to be a function of only a subset of the random variables, say \(\mathbf{x}\). Then by our earlier definition, \(\mathbb{E}[g(\mathbf{x})]\) should be defined as

Fubini's theorem says that under suitable conditions, we're allowed to switch the order of integration, so we can simplify the above as

where we used the fact that \(g(\mathbf{x})\) is a constant in the integral w.r.t. to \(\mathbf{y}\), as well as the definition of marginal distribution \(p(\mathbf{x})\).

Therefore in most cases, we can safely drop the "irrelevant" variables (\(\mathbf{y}\)) in the underlying joint distribution \(p\) over which we take expectation, and treat \(\mathbb{E}[g(\mathbf{x})]\) as expectation with respect to the marginal distribution of \(\mathbf{x}\) only. This fact will be used in a lot of variational inference type of derivations.