Estimating the rate-distortion function of real-world data, part 2

As we saw in part 1 of this post, given a data (source) distribution $P_X$ and a distortion function $\rho$, the rate-distortion (R-D) function of the data quantifies its compressibility. It gives us an idea about the best possible compression performance we can expect, and can help us design better compression algorithms and data communication systems in general.

In this part 2 of the post, I'll introduce a simple method to estimate the R-D function, i.e., to train a family of $\beta$-VAE-like models. This method is much more scalable and widely-applicable than the classic Blahut-Arimoto algorithm, and works in an online fashion as it is based on stochastic gradient descent (SGD). Perhaps not coincidentally, a subset of these $\beta$-VAEs are also the basis of many learned lossy compression models today. You can find more details in the ICLR2022 paper, and a quick demo for how this all works in this ipython notebook.

Estimating the R-D function.

Recall the definintion of the rate-distortion function from last time:

$$\begin{aligned} R(D) := \inf_{Q_{Y|X}: \ \mathbb{E} [\rho(X, Y) ]\leq D} I(X; Y).\end{aligned}$$

To be able to carry out the integrals, we need knowledge of $P_X$, i.e., the probability that the distribution assigns to any event. For a Bernoulli or Gaussian source, the R-D function can be worked out mathematically. Unfortunately, outside of these special cases, the R-D function has no analytical form, and we have little knowledge of the R-D function of most data sources.

If the source and reproduction alphabets are finite (thus necessarily discrete), and if we know the source probability mass function $P_X$, then the Blahut-Arimoto (BA) algorithm can numerically compute the R-D function by solving the mathematical optimization problem defining it. However, there are two major difficulties with establishing the R-D function of real-world sources, which make the BA algorithm inapplicable:

1. the source distribution (e.g., the distribution of natural images) is almost always unknown; ¹

2. the data and the lossy reproductions may be continuously valued, e.g., in scientific applications.

The availability of massive amounts of data prompts us to consider estimating the R-D function from data samples. Indeed, Harrison & Kontoyiannis (2008) proved that that in many general settings, the R-D estimator from replacing the true data distribution by an empirical distribution of samples will asymptotically recover the true R-D function. Thus, one might try running the BA algorithm on an empirical distribution of data samples, after discretizing the source and reproduction alphabets if they are continuous. However, besides the inaccuracy from discretization, this approach also quickly runs into computational difficulties. In higher dimensions, an exceedingly large number of samples is required to faithfully represent the source distribution. This in turn creates a blowup in the number of discretized states in the source and reproduction alphabets. As the BA algorithm operates on tabular representations of the various quantities over all the possible discrete states (e.g., the source PMF $p(x_i)$, and the matrix of pairwise distortion values $\rho(x_i, y_j)$, where $x_i$ and $y_j$ range over all values of source and reproduction alphabets), the memory requirement of BA alone can quickly become infeasible in just a handful of dimensions.

Alternative approach based on an R-D-VAE.

Now I'll introduce an alternative to the BA algorithm, which works with general alphabets, and can be viewed as an SGD version of the BA algorithm. Besides being much more scalable, this method can also free us from working with a fixed (possibly discretized) empirical data distribution: if a stream of i.i.d. samples from the source is available, we will be able to continually improve our estimate of the R-D function in an online fashion, overcoming the finite-sample bias of the BA algorithm.

The basic idea is straightforward, and may appear unrelated at first sight: fitting a VAE to the data source. For now, let's use a common Gaussian likelihood model in the VAE, so that $p(x|z) \sim \mathcal{N}(x|\mu(z), \sigma^2(z))$, with the mean and variance parameters computed by a decoder neural network. Denoting the prior and posterior densities by $q(z)$ and $q(z|x)$, the standard maximum likelihood training objective is the aggregate ELBO (evidence lower bound):

$$\mathbb{E}_{x \sim P_X}[- KL(q(z|x) \|q(z))] + \mathbb{E}_{x \sim P_X, z \sim q(z|x)}[ \log p(x|z)]$$

As I wrote in a previous note, fitting such a density may be problematic if the true data distribution does not have full support over its ambient space. For instance, the true distribution of MNIST images has no probability mass outside of a collection of integer-valued points in $\mathbb{R}^{784}$. In this case, the maximum likelihood training objective is ill-defined (indeed, it seems silly to fit a density model to discrete data), and may reach arbitrarily high values.

One way to avoid this nonsensical scenario is to restrict, or regularize the density model in some way, so that the training objective has a finite maximum. For the Gaussian VAE, it is common to fix the variance of the likelihood model to a constant, like $\sigma^2 I$, and only learn its mean. The authors of this paper did just that: they fit a series of Gaussian VAE models to MNIST digits, keeping $\sigma^2$ fixed for each model, and varying it across models; they observe that models with smaller $\sigma^2$ achieve better/higher aggregate ELBO and exhibit less posterior collapse.

How should we interpret the aggregate ELBO in this model, which could be unbounded had we not fixed $\sigma^2$? And how should we set $\sigma^2$? Similar questions can framed in terms of $\beta$-VAEs; indeed, since the log-likelihood has the form $\log p(x|z) = \frac{1}{2\sigma^2} \|x - \mu(z) \|^2 + const$, the VAE with a fixed variance is essentially equivalent to a $\beta$-VAE with a unit variance in the likelihood model, with $2\sigma^2$ playing the same role as $\beta$.

The answer: this $\beta$-VAE is optimizing an upper bound on the R-D function of the data distribution, under a suitable distortion metric. The key is to look at the aggregate ELBO objective just the right way, and identify a correspondence between the likelihood model and the distortion function. A Gaussian $\beta$-VAE, for example, gives us an estimate of the compressibility of the data samples under a squared error distortion. A $\beta$-VAE with a Laplace likelihood model then corresponds to an $L_1$ distortion, $\rho(x,y) = \|x - y \|_1$.

The general statement is something like this (see Theorem A.3 in the paper for details): Let $q(z), q(z|x)$ be arbitrary choices of prior and posterior distributions in a VAE. Suppose the likelihood density $p(x|z)$ has the form of an exponentiated distortion,

$$p(x|z) = \frac{1}{const} \exp\{\rho(x, \omega(z))\},$$

where $\omega$ is an arbitrary decoder function, and the normalizing constant $const = \int \exp\{\rho(x, \omega(z))\} d x$ does not depend on $z$ and is allowed to be infinite. Then, the point $(\mathcal{D}, \mathcal{R})$ formed by the two terms of the negative ELBO,

$$\mathcal{D} := \mathbb{E}_{x \sim P_X, z \sim q(z|x)}[ \rho(x, \omega(z))] ,$$

and

$$\mathcal{R} := \mathbb{E}_{x \sim P_X}[ KL(q(z|x) \|q(z))],$$

lies on an upper bound of the rate-distortion function of the source under distortion function $\rho$. Moreover, by optimizing the objective $\mathcal{R} + \lambda \mathcal{D}$ (or, equivalently in the $\beta$-VAE formulation, $\beta \mathcal{R} + \mathcal{D}$) over all possible choices of the prior, posterior, and decoder, the resulting point $(\mathcal{D}, \mathcal{R})$ will lie exactly on the source R-D function. All of this pertains to a fixed $\lambda$ value; the optimal $(\mathcal{D}, \mathcal{R})$ pairs resulting from different settings of $\lambda$ then trace out the full $R(D)$ curve.

I refer to a $\beta$-VAE-like model satisfying the above condition on the likelihood-model an R-D (rate-distortion) VAE. The rate and distortion terms purposefully mirror the quantities in the definition of the R-D function, and it's not hard to prove the above result by noting that $\mathcal{R}$ is an upper bound on the mutual information $I(X; Z)$, which in turn is an upper bound on $I(X; Y), Y:= \omega(Z)$ by data processing inequality. The above result essentially allows us to train an R-D-VAE on a NELBO-like objective to optimize an upper bound on the source R-D function. The lossless counterpart of this result is that when the data is discrete, we can train a standard VAE with a discrete likelihood by optimizing the standard negative ELBO, which upper bounds the entropy of the source (this is explained towards the end of my note on ELBO decompositions).

Thus, we can think of an R-D-VAE as a theoretical data compressor. Compared to the lossless case, where a single algorithm, bits-back coding, has made it possible (and in many cases practical) to do lossless compression with generic VAEs and achieve compression rates close to the negative ELBO, the lossy case appears more challenging, and research is still ongoing to develop practical compression algorithms that realize the theoretical performance of R-D-VAEs, often with restrictions on the "prior" and "posterior" distributions to make things practical. You can learn more about these topics in Section 3.4 ("Compression without quantization") of my intro article to neural data compression with Stephan Mandt and Lucas Theis.

The need for a lower bound.

Although training an R-D-VAE is straightforward with deep learning libraries like Tensorflow or PyTorch, at the end of the day this only gives us an upper bound on the R-D function. We know that in theory, this upper bound becomes arbitrarily tight as we increase the expressive power of the R-D-VAEs, but deeper and bigger neural networks may also mean more local minima in the SGD optimization landscape, and not necessarily tighter bounds in practice.

To verify the tightness of an upper bound (and ultimately, to decide if an algorithm's R-D performance is close to theoretically optimal), we also need to estimate a lower bound on the R-D function. This the second half of the "sandwich bounds" proposed in the paper. I won't go into the details as they are a bit involved, but the basic idea is similar -- optimizing over functions parameterized by neural networks with SGD. It's really amazing how well SGD scales to large-scale datasets in the R-D-VAE upper bound method (afterall, it's just training a VAE), and by comparison, how much more challenging it appears to solve a large-scale optimization problem when SGD cannot be directly applied. Indeed, the bulk of my technical work on the lower bound went into reformulating the problem to make it suitable for stochastic optimization.

In some cases where we are in full control of the data generation process, we might be able to assert knowledge of the true data distribution, e.g., the toss of a fair coin is, for all intents and purposes, Bernoulli(0.5); but these cases tend to be rare or toyish. ↩