Gibbs Sampling in Pairwise Markov Networks

Feb 01 2017

Discrete pairwise Markov networks are pretty straightforward to work with analytically. Let's use the canonical overcomplete representation, where the sufficient statistics functions are indicator functions for events \(\{X_s = j\}\) and \(\{X_s = j,X_t = k \}\), and define an exponential family of the form

\begin{align} p_\theta(x) = \exp \{ \sum_{s\in V} \theta_s(x_s) + \sum_{(s,t)\in E} \theta_{st}(x_s,x_t) - A(\theta) \} \end{align}

where we have introduced the convenient shorthand notation

\begin{align} \theta_s(x_s) &:= \sum_j \theta_{s;j} \mathbb{I}_{s;j}(x_s)\\ \theta_{st}(x_s,x_t) &:= \sum_{(j,k)} \theta_{st;jk} \mathbb{I}_{st;jk}(x_s, x_t) \end{align}

i.e. when \(x_s = j\), \(\theta_s(x_s=j)\) selects the number \(\theta_{s;j}\); similarly \(\theta_{st}(x_s=j,x_t=k)\) selects the number \(\theta_{st;jk}\), so \(\theta_s\) is a vector of size \(|\chi_s|\), and \(\theta_{st}\) is a matrix of size \(|\chi_s| \times |\chi_t|\)--in fact, they are potential tables for the factors over \(X_s\) and \((X_s, X_t)\). To see this, rewrite our exponential representation as a standard Gibbs distribution:

\begin{align} p_\theta(x) = \frac{1}{Z} \prod_{s\in V} \exp \{ \theta_s(x_s) \} \prod_{(s,t)\in E} \exp \{ \theta_{st}(x_s,x_t) \} \end{align}

If we define singleton potential \(\psi_s(x_s) := \exp \{ \theta_s(x_s) \}\), and pair-wise potential \(\psi_{st}(x_s,x_t) :=\exp \{ \theta_{st}(x_s,x_t) \}\), then we see that our canonical parameters are precisely the \(\log\) potential values: \(\theta_{s;j} = \theta_s(x_s=j) = \log \psi_s(x_s=j)\), and \(\theta_{st;jk} = \theta_{st}(x_s=j,x_t=k) = \log \psi_{st}(x_s=j,x_t=k)\).

Let's consider making conditional inference in the model, which we need for Gibbs sampling. More precisely, we're interested in the marginal distribution of one node, call it \(s\), conditioned on the instantiation to all the rest of the nodes. (ypically this distribution is called the full conditional distribution). By Bayes rule,

\begin{align} p(x_s | x_{V \setminus s}) &= \frac{p(x_s, x_{V \setminus s})}{p(x_{V \setminus s})}\\ &= \frac{p(x)}{\sum_{x_s'} p(x_s', x_{V \setminus s})}\\ &= \frac{ \exp\{\theta_s(x_s) + \sum_{n \in V: n \neq s}\theta_n(x_n) + \sum_{n\in N(s)} \theta_{ns}(x_n, x_s) + \sum_{ (m,n) \in E: m\neq s, n\neq s} \theta_{mn}(x_m, x_n) - A(\theta) \} }{ \sum_{x_s'} \exp\{\theta_s(x_s') + \sum_{n \in V: n \neq s}\theta_n (x_n) + \sum_{n\in N(s)} \theta_{ns}(x_n, x_s') + \sum_{ (m,n) \in E: m\neq s, n\neq s} \theta_{mn}(x_m, x_n) - A(\theta)}\\ &= \frac{ \exp\{\theta_s(x_s) + \sum_{n\in N(s)} \theta_{ns}(x_n, x_s) \} }{ \sum_{x_s'} \exp\{\theta_s x_s' + \sum_{n\in N(s)} \theta_{ns}(x_n, x_s') \} } \end{align}

where \(N(s)\) means \(\{n \in V, (n, s) \in E \}\), i.e. the neighbor nodes of \(s\). Not surprisingly the conditional distribution is a softmax function, as we're working with exponential family distributions.

Equivalently, we can write the above as a Gibbs distribution:

$$ p(x_s | x_{V \setminus s}) \propto \psi(x_s) \prod_{n\in N(s)} \psi(x_n, x_s) $$

Generally, in an arbitrary Gibbs distribution \(p\), the full conditional distribution for some variable \(s\), \(p(x_s|x_{V \setminus s})\), involves precisely the factors over the Markov blanket of \(s\). The proof of this isn't too different from what we showed above for the pairwise case (i.e. expanding the factorized form of \(p\) in the numerator and denominator of \(p(x_s|x_{V \setminus s})\), then canceling out common factors).

In the specific example of an Ising model, we restrict the range of all the node values to \(\chi = \{0, 1\}\) 1, and let the sufficient statistics to simply be the node values and their pairwise products, i.e., \(x_s\) for \(s \in V\), and \(x_s x_t\) for \(s, t \in E\). The full conditional distribution then becomes

$$p(x_s | x_{V \setminus s}) = \frac{ \exp\{\theta_s x_s + \sum_{n\in N(s)} \theta_{ns}x_n x_s \} }{ \sum_{x_s'} \exp\{\theta_s x_s' + \sum_{n\in N(s)} \theta_{ns}x_n x_s' \} }$$

In particular, we have

\begin{align} p(x_s=1| x_{V \setminus s}) & = \frac{\exp\{\theta_s + \sum_{t\in N(s)} \theta_{ts}x_t \}}{\exp\{\theta_s + \sum_{t\in N(s)} \theta_{ts}x_t \}+1} \\ &= \{ 1 + \exp\{- (\theta_s + \sum_{t\in N(s)} \theta_{ts}x_t) \} \}^{-1} = \text{logistic sigmoid }(\theta_s + \sum_{t\in N(s)} \theta_{ts}x_t) \end{align}

which suggests this type of distribution can well be viewed as a stochastic recurrent neural network (with sigmoid activations)!

1: Traditionally the variables in an Ising model take values in \(\{-1, 1\}\).