Generally we can represent a CRF as a log-linear model:
which can be obtained by conditioning a joint Markov network over \((Y,X)\) on the observation \(X=x\). For simplicity we only consider a scalar r.v. \(Y\). For each fixed \(x \in \mathbb{R}^d\), the CRF has \(\phi(\cdot, x)\) as the sufficient statistics function, and \(A(\theta, x)\) the log partition function. As before, the derivative of \(A\) yields the expected sufficient statistics, here taken w.r.t the conditional distribution \(p_\theta(y|x)\):
In the case of linear regression, we assume that conditioned on \(X=x\), the target variable \(Y\) is generated in the following way:
where \(f(x,\beta)\) is a deterministic function parameterized by \(\beta\), and \(\epsilon\) is random noise distributed as \(\mathcal{N}(0, \sigma^2)\). For simplicity, we consider a function \(f\) that is linear in \(\beta\) and \(x\), and append a dummy variable \(x_0=1\) to the vector \(x\), so that \(f(x,\beta) = \beta_0 + \beta_1 x_1 + ... + \beta_d x_d = \beta^T x\).
Equivalently, \(Y \sim \mathcal{N}(f(x,\beta), \sigma ^2)\), and we can write
To put it in the standard log-linear form, we consider the log conditional likelihood:
from which we recognize the vector of canonical parameters \(\theta = (\frac{\beta_0}{\sigma^2}, \frac{\beta_1}{\sigma^2}, ... , \frac{\beta_d}{\sigma^2}, \frac{-1}{2\sigma^2})\) and sufficient statistics \(\phi(y,x)=(y, x_1y,...,x_dy,y^2)\).
Suppose we have a data set \(\mathcal{D}\) of i.i.d. observations \((x^{(1)}, y^{(1)}), ..., (x^{(N)}, y^{(N)})\); from (\(\ref{eq:crf_exp_fam}\)) we obtain the data log likelihood under the CRF:
Note each data case defines a Markov network with its own sufficient statistics function and partition function.
We find MLE for the parameters by differentiating the log likelihood function, making use of the cumulant-generating property of \(A\):
Note the above deals directly with canonical parameters \(\theta\) in the log-linear model (whereas conventionally linear regression performs MLE w.r.t. the parameters \(\beta\) and \(\sigma^2\)). To apply it to our conditional Gaussian distribution, we define the subvector \(\theta_\beta:= (\frac{\beta_0}{\sigma^2}, \frac{\beta_1}{\sigma^2}, ... , \frac{\beta_d}{\sigma^2})\) of canonical parameters, which is associated with the subvector \(\phi_\beta(y,x) := (x_0y, x_1y,...,x_dy) = y x \in \mathbb{R}^d\) of sufficient statistics. Then
where we made use of the linearity of expectations, and the fact that the mean parameter \(\mathbb{E}[Y|x]\) of our Gaussian distribution is by design \(\beta^T x\). Setting the gradient to zero recovers the normal equations, yielding the MLE solution \(\beta_{ML}\). The gradient encourages the model prediction (given by conditional expectation \(\mathbb{E}[Y|x^{(n)}])\) to match the observed data \(y^{(n)}\). This form of gradient holds more generally for GLM, and indeed much of the derivation above goes unchanged, except for the specific form of mean parameters.
Similarly, define \(\theta_{\sigma}:= \frac{-1}{2\sigma^2}\) and \(\phi_\sigma(y,x) = y^2\). Taking derivative w.r.t \(\theta_{\sigma}\):
where we made use of the formula for the second moment of the Gaussian. Again, setting the derivative to zero results in a moment matching condition for \(\sigma\), from which we can solve for the MLE (after plugging in \(\beta_{ML}\)). After some algebra it can be verified that \(\sigma^2_{ML}\) is indeed the average residual sum of squares. The optimal ``conventional'' parameters \((\beta_{ML}, \sigma^2_{ML})\) implicitly define the optimal canonical parameters \(\theta_{ML}\).