JPEG Compression Algorithm

最新推荐文章于 2024-01-20 08:42:24 发布

skykill2000

最新推荐文章于 2024-01-20 08:42:24 发布

阅读量413

点赞数

分类专栏： Video codec 文章标签： compression algorithm matrix image transformation each

Video codec 专栏收录该内容

9 篇文章

订阅专栏

Introduction to Compression

Compression is a technique used to reduce the size (in bits) of data, so that it's cheaper to store and faster to transmit. It is a ubiquitous technique in computing and an active area of research. There are hundreds of compression algorithms in use at present time. It should be noted, however, that compression algorithms are specialized in the kind of data they compress. The reason is that unless we make some assumptions about the input data (which is simply a string), it is impossible to reduce the data's size while at the same time ensuring full recovery of the original string. Example [1 ]: consider a string s of length 1000 over a binary alphabet. There are 2¹⁰⁰⁰ such strings. Suppose there was an algorithm that could reduce the string's length to n, n < 1000, i.e. produce a string s' of length n that is an encoding of s. But how can we recover the original string? By the pigeonhole principle, 2 or more input strings s map into one string s' (since n < 1000). To put it the other way around, there is a string s' such that it encodes more than one string s. If the decompression algorithm (the algorithm that produces s from its encoding s') gets such s' it has no way of deciding which of the more than one possible strings s is encoded by s'.

Now suppose, however, that every maximal consecutive subsequence of 0s is of even length (i.e. s cannot be ...10001..., or 110010100...). A simple compression algorithm would replace each pair of consecutive zeros 00 by a single 0 to produce s'. The decompression algorithm would then replace each single 0 in s' by a pair of zeros 00 to recover, unambiguously, the string s. If s has k zeros, then its encoding s' would be shorter by k/2. For strings with large number of zeros we would save lots of storage space and significantly speed up s's transmission rate across networks.

To make the above example more interesting, suppose that a string s can occur that has a maximal consecutive subsequence of 0s of odd length, but that such string occurs rarely, say 0.1% of the time. Also assume that the string represents sampling of some analog signal. Since the original signal is already corrupted with noise and the digitization process introduces some uncertainty, too, we cannot expect to have a 100% accurate representation of the signal. Therefore, if once in a while, 0 is replaced 1 or vice versa, good chances are the change will go unnoticed. Hence, we can adapt the above compression algorithm by first identifying all maximal consecutive subsequences of 0s of odd length and then replacing an end 0 in such a subsequence by 1 to conform to the original compression algorithm that required all maximal consecutive subsequences of 0s to be of even length.

Compression is therefore all about probability: it assumes that the input string is likely to have some form, whereas quite unlikely (or even impossible) to have another form. It is now clear why compression algorithms are specialized: every particular kind of data has a different probability distribution on its representation strings. The art of designing a good compression algorithm lies in thorough understanding of the probability distribution over the input strings.

In many cases, it is impossible to guarantee that any particular string cannot be the input string. However, there are usually many strings s' whose probability of occurrence is low but which are "similar" to other strings s whose probability of occurrence is high. A compression algorithm might therefore code s' the same way as s, thus effectively force the probability of occurrence of s' to be 0. There will therefore be some loss of information and such compression algorithm is said to be lossy. Compression algorithms that preserve all information are called lossless.

It's not easy to define when 2 strings are "similar" - an obvious choice would be the Hamming distance, but this is not always correct. To design a good lossy compression algorithm, one must have a good understanding of the structure of the underlying data and know when 2 strings can be considered "similar" and when they cannot.

Transform Coding [1 ]

The idea behind the transform coding is to transform the input string into a form that captures the data's features better. For example, the simplest way to store an image in a computer is as a matrix of pixels, with each pixel being a number of bits representing the pixel's color. If we use the RGB representation, with each of the 3 colors taking on 256 shades, we need to store 24 bits per pixel. For large high-resolution images this scheme is not practical, and even more so for images that need to be transmitted over a network, such as the Internet. This simple bitmap representation ignores some inherent features of images, such as the fact that the color varies very slowly or not at all over some relatively large areas of the image. Having 24 bits for each pixel in such an area thus means storing information redundantly. Actually, there is a fundamental difference between 2 classes of images: naturalistic images, such as photographs, and simple computer-generated images, such as icons, cursors, etc. The former exhibits continuous variation of color throughout the image, whereas the latter has relatively sharp color discontinuities in some places and large areas of uniform color in others. Not surprisingly, an efficient compression algorithm for one class will not likely be very efficient for the other class. From now on, I shall only be talking about compression of naturalistic images and to save some typing, the word "image" will mean "naturalistic image."

Mathematically, an RGB image is a function I: D Í R² ® Z³ mapping each point in a subset D of the plane to a triple of integers indicating the intensities of red, green, and blue colors. In a computer, of course, this function is represented approximately by its values on a grid of discrete, "closely" spaced points. Turns out that every such function I that is continuous except at finite number of curves in R² can be represented as an (infinite) sum of basis functions. A set S of basis functions has 2 important properties (which in fact constitute the definition of a basis set):

no function in S can be represented as a linear combination of the other functions in S (linear independence of S), and
every function I as defined above that is continuous except at a finite number of curves in R² can be written as a linear combination of functions in S (spanning property of S).

The basis functions are chosen to be simple, such as cosines, sines, or polynomials. Unfortunately, given a "realistic" function I and a basis set S of functions, the expansion of I in terms of functions in S is infinite. But not all hope is lost: in practice (and in theory, too) we find that we can often truncate the infinite expansion of I after an nth term and still obtain a very good match between I and the truncated expansion. Of course, higher the n the better the approximation, but also more computational time and memory is required. Judicial choice of n is therefore very important. Equally important is the choice of the basis functions - if we choose functions that are similar to I in some way, we can obtain a very good approximation to I with a relatively small n.

Here is then an idea to transform a bitmap image into something more manageable: choose basis functions S and expand the image I in terms of S, but truncate after the nth term. This will result in a loss of information (since our truncated expansion is only an approximation), but if we choose our basis functions and n carefully, the loss will be minimal.

JPEG Compression Algorithm

I can now finally begin describing the JPEG (Joint Photographic Experts Group) algorithm, which is a lossy compression algorithm for (naturalistic) images.

There is a preprocessing step which is optional [1 ]: the image's RGB representation is converted into YIQ representation which takes better into consideration some known facts about the human vision: the Y plane represents the brightness of each pixel, and is a weighted average of 59 % G, 30 % R, and 11 % B (this is because the human eye is more perceptive to green than it is to red, and it is more perceptive to red than it is to blue). The I and Q planes represent the color hue (i.e. color saturation).

The "zeroth" step is to separate the image into the 3 color planes representing R (red), G (green), and B (blue) (or YIQ if the preprocessing step has been carried out). If the image consists of just a single grayscale plane, we don't need to do anything. Each plane is compressed and decompressed separately. The process discussed below applies to each color plane, called the image in the subsequent discussion.

Now, we would like to apply a transform to the image as discussed in the previous section. If we were to apply the transform to the whole image, we would need a huge number of basis functions in general - the reason for that is simple: although variation of color in a small subregion of the image is likely to be small, it is in general pretty large over the whole image. A given subregion of the image would force certain linear combination of basis functions to be included, but that combination can be very inappropriate for another "far-away" subregion, thus forcing more basis functions to be included to cancel off the effect of the former ones. This effect is called frequency spilling, since a function required in one region "spills" over to a region where it is not wanted. A solution is to subdivide the image into smaller parts and apply the transform to each part separately - of course, we could just subdivide the image into individual pixels, but that wouldn't buy us much... JPEG subdivides the image into subregions of 8 x 8 pixels each (strictly speaking, the subregions on the boundary of the image will have different dimensions if the image's x and y dimensions are not divisible by 8; for simplicity I will therefore assume that x and y are divisible by 8 - the modifications required for the general case are easily accommodated afterwards). If our image has dimensions x by y (in pixels), then we obtain (x y) / 64 subregions from I. We can put these regions into a (y / 8) x (x / 8) matrix A. It then applies a transform into each of those subregions separately. The transform applied by JPEG is the discrete cosine transform (DCT). The basis functions are cosines, hence the name. And it is discrete since we're working with discrete image function I, and so integration is replaced by summation. Each 8 x 8 subregion of the image is transformed by multiplying it by the DCT matrix T from the left, where T is defined as follows [1 ]:

(1 / 8)^1/2 cos [(2 j + 1) i p / 16] if i = 0 and 0 £ j < 8

T_ij =

(1 / 2) cos [(2 j + 1) i p / 16] if 0 < i < 8 and 0 £ j < 8

Note that T is an 8 x 8 matrix, and that the numbering starts at 0 (as is usual in computer science). Let's denote the DCT-transformed matrix A by A', so that we have:

A_ij ' = T A_ij

This transformation is one-to-one, hence we can get A_ij back from A_ij ' by applying the inverse DCT:

A_ij = T^-1 A_ij '

A_ij ' is called frequency-domain matrix, while A_ij is spatial-domain matrix.

This transformation is not lossy - i.e. we have not lost any information (this follows from the fact that T is one-to-one). Note, however, that A_ij ' is band-limited - i.e. only 8 x 8 = 64 frequency components are included, while in general there are an infinite number of them.

Now comes the crux of JPEG compression: scalar quantization of frequencies. Studies have shown that the human eye is more perceptive to lower frequencies than to higher frequencies. The idea is therefore to scale the amplitude of each frequency in A_ij ' by a quantization factor to take the aforementioned fact into account. In general, higher frequencies will be scaled more than lower frequencies. The scaling factors are given in an 8 x 8 matrix Q (known as the quantization matrix), where Q_pq is the scaling factor to be used for frequency component (A_ij ')_pq . Q_pq has been (is being) determined by empirical studies of human visual perception and the recommended choice for Q_pq is [2 ]

16	12	10	16	24	40	51	61

11	12	14	19	26	58	60	55

14	13	16	24	40	57	69	56

14	17	22	29	51	87	80	62

18	22	37	56	68	109	103	77

24	35	55	64	81	104	113	92

49	64	78	87	103	121	120	103

72	92	95	98	112	100	101	99

The scaled matrix A_ij '' is obtained from A_ij ' by dividing each entry of A_ij ' by the corresponding entry of Q_pq and rounding the resulting value to the nearest integer, i.e.

(A_ij '')_pq = round[(A_ij ')_pq / Q_pq ]

As can be seen in the above matrix, higher frequencies are scaled more (since the human eye is less perceptive to them) than lower frequencies. When Q_pq = 1 for all p, q, A_ij '' = A_ij ' and we have preserved the image intact. If, however, Q_pq > 1 for some p, q, then the scaling transformation is not necessarily one-to-one and we've lost information (i.e. deteriorated the image quality) - in fact, the scaling step is what makes JPEG compression algorithm lossy. In the extreme case, Q_pq = 256 for all p, q, and A_ij '' is the 8 x 8 zero matrix and we have lost all information!

To further increase the compression ratio, we might choose to divide each entry of A_ij '' by a constant > 1 (the so-called quality control constant [1 ]). This will further degrade the image, but improve the compression ratio.

After all this has been said and done, it is not unusual for more than half of the frequency amplitudes to be zero. To take advantage of this fact, the JPEG algorithm records only non-zero amplitudes; their position within A_ij '' is captured by recording the number zeros preceding them. In order to take advantage of long runs of zero entries toward the right-bottom part of A_ij '', the algorithm processes A_ij '' in the following zigzag fashion [3 ]:

This traversal produces a sequence of pairs p, where p = (# of zeros preceding value, value). A special pair (0, 0) is used to indicate the end of the current matrix A_ij ''. As an example, consider the following matrix (only a 4 x 4 matrix is used for brevity):

First of all, the zigzag traversal will visit entries in the order

10, 6, 7, 0, 0, 4, 0, 1, 1, 0, 0, 2, 0, 0, 0, 0

which is coded as

<(0, 10), (0, 6), (0, 7), (2, 4), (1, 1), (0, 1), (2, 2), (0, 0)>

Each such sequence is encoded using Huffman or arithmetic coding [5 ] (which one is used is indicated in the compressed file's header; the header also includes information on y and x, or, equivalently, on the number of matrix blocks in x and y directions and also on the number of color planes encoded)). The algorithm starts with matrix A₀₀ ''. After finished with A_ij '', it goes to matrix A_i,j+1 '' if j < x / 8 - 1, A_i+1,0 if i < y / 8 - 1, or goes on to encode the next color plane. After all color planes have been processed, the algorithm writes the EOF (end-of-file) marker and saves the new file.

Some versions of the JPEG algorithm choose to compress the DC (direct current, constant) frequency component (A_ij '')₀₀ separately from AC (alternating current, oscillating) frequency components (A_ij '')_pq , p ¹ 0 or q ¹ 0. The reason for this is that the DC component tends to be almost constant throughout the entire image, so its value can be coded with very few bits. But since the DC component is only 1 / 64 of the entire matrix, the improvement in compression is minor.

The following picture summarizes the JPEG compression algorithm [3 ]:

JPEG Decompression Algorithm

If we want to display or otherwise use a JPEG-compressed image, we need to decompress it. The decompression algorithm does everything what the compression algorithm does, but in reverse. All the transformations in the JPEG compression algorithm are invertible (i.e. one-to-one) except the frequency amplitude quantization (since rounding is involved in this step and we thus cannot determine the original amplitude exactly). However, this step affects most seriously high frequencies to which the human eye is not very perceptive anyway.

References