Estimating the rate-distortion function of real-world data, part 1

Mar 02 2022

This is part 1 of an informal introduction to some of the ideas I have been working on in the past year or so, which have been published as a workshop paper at ICLR 2021 and conference paper at ICLR 2022. The high-level idea is to use neural networks and machine learning to estimate a basic quantity in information theory -- the rate-distortion function, for real-world data such as images.

To start with, let's get some jargon out of the way, and set the stage for lossy compression. In information theory, the source is simply the random variable that we'd like to compress or transmit. The corresponding concept in machine learning and statistics is the data random variable, whose unknown distribution $$P_X$$ we'd like to model or infer. For our purpose they are really the same thing, and I'll talk about the (data) source and distribution interchangeably. Let's also only consider a memoryless source, so that the data samples are i.i.d. --- the standard setting for statistical estimation 1. As a concrete example, the data samples may correspond to the photos stored on your smartphone. A lossy compression algorithm can be viewed as a blackbox that takes a data sample as input, encodes it into a binary representation, based on which a reconstruction (or, reproduction) is decoded and returned. The set of values that the data (or, reproduction) can take is called the source (or, reproduction) alphabet, and can be thought of as the "domain" or "co-domain" of the compression blackbox; e.g., this might be $$\{0, 1, ..., 255\}^{H \times W \times 3}$$ for digital images.

Rate v.s. distortion.

If you've ever used a lossy data compression algorithm, say JPEG, you're likely already familiar with the so-called rate-distortion tradeoff. "Rate" is simply the average size of compressed files, and "distortion" is the loss of quality from the lossy compression process, on average. As you compress more aggressively (e.g., by dialing down the quality parameter in JPEG), you end up with smaller file sizes, but also have to live with increased distortion. It turns out there exist information-theoretic "laws" that govern this rate-distortion tradeoff. To quantiy the idea of distortion, let's also suppose we are given a distortion function $$\rho$$, which receives a data sample $$x$$ and its lossy reconstruction $$y$$, and returns a non-negative number $$\rho(x,y)$$ representing the amount of distortion/deviation in $$y$$ relative to $$x$$; e.g., this can be a squared error, $$\rho(x,y) = \| x- y\|^2$$.

Keep in mind throughout the discussion, we always have a fixed, target data distribution in mind. It doesn't really make sense to talk about the compression performance of an algorithm without referencing a data distribution, as a compression algorithm that does well on one data distribution may do worse on another, and it's simply impossible to compress all possible data equally well.

We now have a way to quantitatively evaluate the performance of a lossy compression algorithm on a data distribution: collect a large number of random samples from the distribution, compress and decompress each of them, and compute the rate and distortion averaged over the samples. The resulting pair of numbers, (distortion, rate), then corresponds to a point on the two-dimensional plane, which may look like this:

In this picture, the green point $$(D_1, R_1)$$ corresponds to an algorithm with relatively high rate but low distortion, while the orange point $$(D_2, R_2)$$ trades off higher distortion for a lower rate. The two points might also be of the same algorithm, but operating at two different settings. I also plotted the source rate-distortion function -- to be introduced below -- in the blue curve.

The rate-distortion function.

Next, let's talk about the rate-distortion (R-D) function, denoted $$R(D)$$. Intuitively, the R-D function captures the fundamental "lossy compressibility" of a data source. Thus, the R-D function is to lossy function what entropy is to lossless compression, and it generalizes the latter concept to lossy compression. Formally, $$R(D)$$ is a real-valued function of a single parameter $$D$$, the distortion tolerance, and is given by the mathematical definition,

\begin{aligned} R(D) := \inf_{Q_{Y|X}: \ \mathbb{E} [\rho(X, Y) ]\leq D} I(X; Y).\end{aligned}

In words, we consider all conditional distributions $$Q_{Y|X}$$ whose expected distortion is within the given threshold $$D$$, and minimize the mutual information between the source $$X$$ and its reproduction $$Y$$ (as induced by $$P_X Q_{Y|X}$$). This minimum value is then $$R(D)$$. Note that the R-D function is entirely determined by the data distribution $$P_X$$ and distortion metric $$\rho$$.

So far, $$R(D)$$ is nothing but a purely theoretical construction. However, it amazingly turns out to have a practical significance -- for each given distortion threshold $$D$$, the number $$R(D)$$ is the minimum achievable rate (as measured in nats per sample) with which any lossy compression algorithm can compress the data to an averaged distortion not exceeding $$D$$. The "minimum" part asserts that no compression algorithm can perform better than the R-D function of the source, while the "achievable" part says that conversely, the performance $$(D, R(D))$$ is also realizable by a (potentially expensive) compression algorithm. In my example illustration, this means the rate-distortion performance (or more precisely, distortion-rate performance) of all compression algorithms must lie above the blue curve; thus the red point is logically unattainable. Moreover, the orange point can be considered more optimal than the green point, in the sense that it lies closer to the source R-D curve. These results, known as lossy source coding theorems, were proven by Shannon in his two landmark papers from the 1950s and 60s, and formed the basis of the branch of information theory known as rate-distortion theory.

Why do we care about the rate-distortion function? Because knowledge of it can help researchers and engineers build more effective data communication systems. Here are two example applications:

1. Assessing the optimality of lossy compression algorithms: If the (distortion, rate) performance of a compression algorithm lies high above the source $$R(D)$$ curve, then further performance improvement may be expected; otherwise, its R-D performance is already close to theoretically optimal, and we may decide to focus our attention on other aspects of the algorithm. Data compression is big business, and one day, sooner or later, compression algorithms may run into their information-theoretic limit, as determined by the ultimate compressibility of the data. It would be helpful to know when we are getting close to this limit, before deciding to devote more time and resources to improving the compression performance of an algorithm further.

2. Assessing the minimum channel capacity required to transmit the compressed source. Consider a common data communication system that first subjects the data to compression, i.e., source coding, and then transmits the bits through an information channel with channel coding. For example, this could be a sensor repeatedly generating data and sending it to a user over a wifi network. And suppose the user can only tolerate an average distortion of at most $$D$$ in the received data, under some notion of distortion. Then, it turns out that reliable real-time communication in this setup is possible only if the channel capacity is at least $$R(D)$$, as given by the R-D function of the data source. If the channel capacity is less than $$R(D)$$, then this communication goal is simply infeasible (one workaround is to buffer the bitstream to be sent over the channel, but this introduces latency). This is a consequence of Shannon's source-channel separation theorem.

In part 2 of this post, I'll introduce a simple and scalable way to estimate the rate-distortion function from data.

1. One way to extend the setup to a non i.i.d. (e.g., a higher order Markovian) source is by defining an R-D function on blocks of $$k$$ data samples, resulting in something called the $$k$$-th order R-D function.