## Linear Regression as Conditional Random Field

Feb 07 2017

Generally we can represent a CRF as a log-linear model:

$$p_\theta(y|x) = \exp \{ \langle \theta, \phi(y,x) \rangle - A(\theta, x) \} \label{eq:crf_exp_fam}$$

which can be obtained by conditioning a joint Markov network over $$(Y,X)$$ on the observation $$X=x$$. For simplicity we only consider a scalar r.v. $$Y$$. For each fixed $$x \in \mathbb{R}^d$$, the CRF has $$\phi(\cdot, x)$$ as the sufficient statistics function, and $$A(\theta, x)$$ the log partition function. As before, the derivative of $$A$$ yields the expected sufficient statistics, here taken w.r.t the conditional distribution $$p_\theta(y|x)$$:

\begin{align*} \frac{\partial A(\theta, x)}{\partial \theta_i}:= \frac{\partial} {\partial \theta_i} \log \int \exp \{ \langle \theta, \phi(y,x) \rangle \} dx = \mathbb{E}[\phi_i(Y,x)|x] \end{align*}

In the case of linear regression, we assume that conditioned on $$X=x$$, the target variable $$Y$$ is generated in the following way:

$$Y = f(x,\beta) + \epsilon$$

where $$f(x,\beta)$$ is a deterministic function parameterized by $$\beta$$, and $$\epsilon$$ is random noise distributed as $$\mathcal{N}(0, \sigma^2)$$. For simplicity, we consider a function $$f$$ that is linear in $$\beta$$ and $$x$$, and append a dummy variable $$x_0=1$$ to the vector $$x$$, so that $$f(x,\beta) = \beta_0 + \beta_1 x_1 + ... + \beta_d x_d = \beta^T x$$.

Equivalently, $$Y \sim \mathcal{N}(f(x,\beta), \sigma ^2)$$, and we can write

$$p(y|x) = \frac{1}{ \sqrt{2 \pi \sigma^2}} \exp \{ - \frac{(\beta_0 + \beta^T x - y)^2}{2\sigma^2} \}$$

To put it in the standard log-linear form, we consider the log conditional likelihood:

\begin{align*} \log p(y|x) &= -\frac{1}{2} \log(2\pi \sigma^2) - \frac{1}{2\sigma^2}(\beta_0 + \beta^T x - y)^2 \\ & = \frac{\beta^T x}{\sigma^2}y -\frac{1}{2\sigma^2}y^2 - \frac{(\beta^Tx)^2}{2\sigma^2} -\frac{1}{2} \log(2\pi \sigma^2) \\ & = \sum_{i=0}^d \frac{\beta_i}{\sigma^2} x_i y + \frac{-1}{2\sigma^2} y^2 - \frac{(\beta^Tx)^2}{2\sigma^2} -\frac{1}{2} \log(2\pi \sigma^2)\\ & = \begin{bmatrix} \frac{\beta_0}{\sigma^2} & \frac{\beta_1}{\sigma^2} \cdots \frac{\beta_d}{\sigma^2} & \frac{-1}{2\sigma^2} \end{bmatrix} \begin{bmatrix} y\\ x_1y\\ \vdots\\ x_dy\\ y^2 \end{bmatrix} - A(\beta, \sigma, x) \end{align*}

from which we recognize the vector of canonical parameters $$\theta = (\frac{\beta_0}{\sigma^2}, \frac{\beta_1}{\sigma^2}, ... , \frac{\beta_d}{\sigma^2}, \frac{-1}{2\sigma^2})$$ and sufficient statistics $$\phi(y,x)=(y, x_1y,...,x_dy,y^2)$$.

Suppose we have a data set $$\mathcal{D}$$ of i.i.d. observations $$(x^{(1)}, y^{(1)}), ..., (x^{(N)}, y^{(N)})$$; from ($$\ref{eq:crf_exp_fam}$$) we obtain the data log likelihood under the CRF:

$$l_{\mathbf{y}|\mathbf{x}} (\theta; \mathcal{D}) := \log p_\theta(y^{(1)},...,y^{(N)}|x^{(1)},...,x^{(N)}) = \sum_{n=1}^N \langle \theta, \phi(y^{(n)},x^{(n)} \rangle - A(\theta, x^{(n)})$$

Note each data case defines a Markov network with its own sufficient statistics function and partition function.

We find MLE for the parameters by differentiating the log likelihood function, making use of the cumulant-generating property of $$A$$:

$$\frac{\partial l_{\mathbf{y}|\mathbf{x}} (\theta; \mathcal{D}) }{\partial \theta_i} = \sum_{n=1}^N \phi_i(y^{(n)},x^{(n)}) - \mathbb{E}[\phi_i(Y,x^{(n)})|x^{(n)}]$$

Note the above deals directly with canonical parameters $$\theta$$ in the log-linear model (whereas conventionally linear regression performs MLE w.r.t. the parameters $$\beta$$ and $$\sigma^2$$). To apply it to our conditional Gaussian distribution, we define the subvector $$\theta_\beta:= (\frac{\beta_0}{\sigma^2}, \frac{\beta_1}{\sigma^2}, ... , \frac{\beta_d}{\sigma^2})$$ of canonical parameters, which is associated with the subvector $$\phi_\beta(y,x) := (x_0y, x_1y,...,x_dy) = y x \in \mathbb{R}^d$$ of sufficient statistics. Then

\begin{align*} \frac{\partial l_{\mathbf{y}|\mathbf{x}} (\theta; \mathcal{D}) }{\partial \theta_\beta} &= \sum_{n=1}^N \phi_\beta (y^{(n)}, x^{(n)}) - \mathbb{E}[\phi_\beta(Y, x^{(n)})|x^{(n)}] \\ &= \sum_{n=1}^N y^{(n)} x^{(n)} - \mathbb{E}[Y x^{(n)}|x^{(n)}] \\ &= \sum_{n=1}^N y^{(n)} x^{(n)} - \mathbb{E}[ Y|x^{(n)}] x^{(n)}\\ &= \sum_{n=1}^N y^{(n)} x^{(n)} - (\beta^T x^{(n)}) x^{(n)}\\ &= \sum_{n=1}^N (y^{(n)} - (\beta^T x^{(n)})) x^{(n)} \end{align*}

where we made use of the linearity of expectations, and the fact that the mean parameter $$\mathbb{E}[Y|x]$$ of our Gaussian distribution is by design $$\beta^T x$$. Setting the gradient to zero recovers the normal equations, yielding the MLE solution $$\beta_{ML}$$. The gradient encourages the model prediction (given by conditional expectation $$\mathbb{E}[Y|x^{(n)}])$$ to match the observed data $$y^{(n)}$$. This form of gradient holds more generally for GLM, and indeed much of the derivation above goes unchanged, except for the specific form of mean parameters.

Similarly, define $$\theta_{\sigma}:= \frac{-1}{2\sigma^2}$$ and $$\phi_\sigma(y,x) = y^2$$. Taking derivative w.r.t $$\theta_{\sigma}$$:

\begin{align*} \frac{\partial l_{\mathbf{y}|\mathbf{x}} (\theta; \mathcal{D}) }{\partial \theta_\sigma} &= \sum_{n=1}^N \phi_\beta (y^{(n)}, x^{(n)}) - \mathbb{E}[\phi_\sigma(Y, x^{(n)})|x^{(n)}]\\ &= \sum_{n=1}^N (y^{(n)})^2 - \mathbb{E}[Y^2|x^{(n)}]\\ &= \sum_{n=1}^N (y^{(n)})^2 - (\sigma^2 + (\beta^T x^{(n)})^2) \end{align*}

where we made use of the formula for the second moment of the Gaussian. Again, setting the derivative to zero results in a moment matching condition for $$\sigma$$, from which we can solve for the MLE (after plugging in $$\beta_{ML}$$). After some algebra it can be verified that $$\sigma^2_{ML}$$ is indeed the average residual sum of squares. The optimal conventional'' parameters $$(\beta_{ML}, \sigma^2_{ML})$$ implicitly define the optimal canonical parameters $$\theta_{ML}$$.