R. T. RUSSELL

A Frequency Domain approach to Scaling

Introduction

Many people visualise down-scaling as an 'interpolation' problem in the spatial domain, and as a result they fail to appreciate the significance of the frequency domain to its performance. In particular, an important consideration when reducing the size of an image is the need to eliminate any spectral components in the source image that would cause aliasing, i.e. those in excess of half the (reduced) sampling frequency of the output image.

In this article I will attempt to illustrate some of the pitfalls of thinking about down-scaling in the spatial domain, and highlight the benefits of thinking about it in the frequency domain. I will attempt to keep it free of maths, and instead concentrate on practical examples.

Fundamentals

Before we start it is essential to understand a few fundamentals:

Any sampled system (whether sampled in time, like digital audio, or in space, like digital images) has the concept of sample rate or sampling frequency.
The sample rate is the number of samples in a given time or a given distance. So, in the case of digital audio, the sampling frequency may be 44,100 samples-per-second (this is the value used for Compact Discs). In the case of a digital image the sample rate may be expressed as samples per picture width or samples per picture height. So for example a 3 Megapixel picture may have 2048 samples across the width and 1536 samples from top to bottom.
One can think about any signal either in terms of its waveform or its spectrum. A signal's waveform represents how its amplitude changes with time or position. So in the case of audio a pure 1 kHz tone has a waveform consisting of a sine wave whose peaks are separated in time by 1 ms (one-thousandth of a second). That same signal has a spectrum consisting of a single frequency of 1 kHz. However complex the signal you can always completely describe it in either of these ways, and it is possible to convert one to the other - this is called a Fourier Transform.
In just the same way, a 2D image can be thought about in the spatial domain (i.e. how its brightness, colour etc. varies with position) or in the frequency domain. For example a picture of a set of regularly-spaced fence posts can be said to have a horizontal frequency corresponding to how many posts fit across the width of the picture. Similarly a picture of a ladder will contain a vertical frequency component corresponding to how many rungs fit between the top and bottom of the picture.
A fundamental property of any sampled system is that the maximum frequency which the system can carry is half the sampling frequency, this is known as the Nyquist Limit. For example, the highest audio frequency that can be recorded on a Compact Disc is 22,050 Hz. Similarly the highest horizontal frequency that can be contained within the 3 Megapixel picture mentioned above is 1024 cycles-per-picture-width, and the highest vertical frequency is 768 cycles-per-picture-height.

Scaling

Now we've covered that we can get down to business. I will use as an example the down-scaling of an image from 640 x 480 to 480 x 360, that is a reduction in size of 4:3 (i.e. the output image has ¾ the width and ¾ the height of the input image). Here is the source image that I will use:

This image has a number of features that make it particularly suitable for evaluating scaling performance, for example the hyperbolic zone plates at the corners and the frequency sweep just below the centre. That frequency sweep extends from zero frequency (DC) on the left to the maximum possible horizontal frequency (320 cycles-per-picture-width) on the right. The image will also be testing of your PC display, so if you see any dark bands or other strange effects towards the right-hand end of the frequency sweep, or in the zone plates, bear in mind that they're not really there!

Now we can get to the heart of what this article is all about, and because it's so important I'm going to put it in a box:

We know that, in the case of this example, the source image contains horizontal frequencies up to 320 cycles-per-picture-width, yet the scaled-down output image can only contain horizontal frequencies up to 240 cycles-per-picture-width. What, then, happens to those high-frequency components in the original image? The answer is that they must either be removed (i.e. attenuated) by the down-scaling process, or they get converted to aliases in the output image. In this context an alias is rather like a reflection: for example a frequency of 300 cycles in the source image – 60 cycles higher than the Nyquist Limit for the output image – will (if not removed) become an alias frequency of 180 cycles – 60 cycles lower than the Nyquist Limit. You will see later what aliases look like, but suffice it to say that they are not desirable!

Many down-scaling algorithms do not take account of the necessity to remove spectral components that cannot be carried in the output image, and therefore they suffer from aliasing. This article will demonstrate that, by thinking about the problem in the frequency domain, this aliasing can be avoided (or at least reduced).

Let's analyse in a little more detail what's involved with down-scaling the image from 640 pixels across to 480 pixels across. For every eight samples in the input image we need to create six samples in the output image; that can be visualised as follows:

Input: a-----b-----c-----d-----e-----f-----g-----h

Output: -A-------B-------C-------D-------E-------F-

Here a, b, c, d, e, f, g, h represent 8 input samples, and A, B, C, D, E, F represent 6 output samples. Note the relative positions of the samples: none of the output samples is at the same position in the image as an input sample. This ensures that the output image is correctly centered with respect to the input image (some scaling algorithms may get this wrong).

Nearest neighbour

Let's try applying the simplest scaling algorithm of them all - nearest neighbour. Each output sample is simply set to the value of the input sample to which it is nearest. This can be represented algebraically as follows:

A = a
B = b
C = d
D = e
E = f
F = h

Output samples B and E are actually positioned exactly half-way between two input samples, so here I have arbitrarily set them equal to the input sample to their left. You may guess that such a simple algorithm doesn't give very good results, and you would be right! Here is the output image:

Linear interpolation

The next simplest scaling algorithm generates an output sample from the two nearest input samples, in proportions related to their positions with respect to the output sample. Let's remind ourselves of the relative positions:

Input: a-----b-----c-----d-----e-----f-----g-----h

Output: -A-------B-------C-------D-------E-------F-

The values of the output samples are calculated as follows:

A = a*0.833333 + b*0.166667
B = b*0.500000 + c*0.500000
C = c*0.166667 + d*0.833333
D = e*0.833333 + f*0.166667
E = f*0.500000 + g*0.500000
F = g*0.166667 + h*0.833333

The output image is as follows:

This is certainly a big improvement, but there are still problems. The zone plates in the corners are covered in dark bands which shouldn't be there, and the right-hand (high frequency) end of the frequency sweep is a real mess.

Let's think a bit more deeply about why it isn't very good. The averaging processes represented by the above equations can be thought of as filters, in fact they're a special class of filter called a Finite Impulse Response filter (FIR). Such filters can be analysed in the frequency domain to reveal information about why the output image isn't as good as it could be. In particular one can consider how the magnitude response and the group delay response of the filters vary with frequency:

Averaging filter magnitude response Averaging filter delay response

You will note that in this particular example there are only three different filters used for the down-sampling process (the filter for sample A is the same as the filter for D, that for B is the same as that for E, and that for C is the same as that for F). Those three filters are represented in the graphs above in black, red and green; in the left-hand graph the green curve is covered by the black curve. The horizontal axis in each case is frequency, in cycles-per-picture-width.

What exactly are these graphs telling us? We know that frequencies in excess of 240 cycles cannot be represented in the output image, and if not attenuated will result in aliasing. You can see that one of the filters (the red curve) does a reasonable job of attenuating those frequencies - its response is down to about 0.4 at 240 cycles. However the other two filters (the black and the green) do not attenuate the high frequencies very much. This immediately tells us that aliasing is likely to be a problem; something that wouldn't be apparent without analysing the process in the frequency domain.

The group delay graph tells us about how accurately the output samples are positioned. This will probably not be obvious, but the effective position of the output samples with respect to the input samples depends on the frequency of the signal (at least, it does for the black and green filters). At DC the samples are correctly positioned at half-way between the input samples (red curve), one-third of a sample to the left of that position (green curve) and one-third of a sample to the right of that position (black curve). However at higher frequencies the upper and lower curves diverge from the wanted ± one-third positions. This will impair the quality of the image.

Cubic interpolation

Now we're getting an understanding of how down-sampling works in the frequency domain we can analyse another algorithm: cubic interpolation. This is a very popular scaling method, and when applied both horizontally and vertically to a 2D image is usually called bicubic interpolation. Once again let's remind ourselves of the relative input and output sample positions:

Input: a-----b-----c-----d-----e-----f-----g-----h

Output: -A-------B-------C-------D-------E-------F-

Cubic interpolation operates on the principle of fitting a set of cubic polynomials to the input samples, such that the result is a continuous smooth curve. The output sample values can be 'read off' the curve where it crosses their positions in the image. The result in this case is that the output samples are calculated as follows:

B = a*-0.125000 + b*0.625000 + c*0.625000 + d*-0.125000
C = b*-0.023148 + c*0.189815 + d*0.949074 + e*-0.115741
D = d*-0.115741 + e*0.949074 + f*0.189815 + g*-0.023148
E = e*-0.125000 + f*0.625000 + g*0.625000 + h*-0.125000

Because each output sample is derived from four consecutive input samples we can't show the equations for samples A and F, but in just the same way as equation B has the same coefficients as equation E, equation A is the same as equation D and equation F is the same as equation C. In fact, when down-sampling in the ratio 4:3 there can only ever be three different sets of coefficients.

The resulting image is as follows:

A bit better perhaps, but still not perfect and the right-hand end of the frequency sweep still looks a mess. Perhaps analysing this method of interpolation in terms of filter responses will tell us why:

Cubic filter magnitude response Cubic filter delay response

This is interesting. Looking at the magnitude curve we can see that the response is maintained closer to 1.0 than the previous method, in fact over much of the range it is greater than 1.0. This explains why an image generated by cubic interpolation can look nice and sharp. However it is even worse than the two-point average filter in attenuating those high frequencies that cause aliases, so it's not surprising the high-frequency end of the frequency sweep is still a mess.

The group delay curve is revealing, because at low frequencies the output sample positions are less accurate than those of the averaging filter, at approximately ± ¼ a sample rather than the correct value of ± one-third a sample. As the frequency increases the positions become more accurate, approaching the correct values at about 110 cycles; above that they diverge. Again, this is hardly ideal.

Synthesised FIR filters

We've seen that both of the previous down-sampling methods, when analysed in the frequency domain, have shortcomings in terms of magnitude response and group-delay response. An obvious question to ask therefore is "can we do any better without making the filters more complicated?". The way to find out is to try to synthesise a set of filters which more closely approach the ideal performance.

FIR filter synthesis is a complicated subject, and it is beyond the scope of this article to explain how it is done. Suffice it to say that there are various algorithms (a well-known one being the Parks-McClellan algorithm) which can synthesise optimum filters against a set of criteria. Just such a set of filters for the present application is as follows:

B = a*-0.103516 + b*0.603516 + c*0.603516 + d*-0.103516
C = b*-0.045105 + c*0.187988 + d*0.923279 + e*-0.066182
D = d*-0.066182 + e*0.923279 + f*0.187988 + g*-0.045105
E = e*-0.103516 + f*0.603516 + g*0.603516 + h*-0.103516

The filter response curves are as follows:

Synthesised FIR magnitude response Synthesised FIR delay response

Not dramatically different, to be sure, but the magnitude response is slightly flatter than the cubic and the delay response is significantly closer to the ideal than any of the previous efforts. The resulting image is as follows (it's hard to see the differences from the last one, but it is a little better):

Attenuating the aliases

What we've still not achieved is significantly to attenuate the high frequencies which cause aliasing. However by thinking about the problem in the frequency domain and synthesising filters to meet our needs we can see what we need to do. Within the constraints of a four-tap (four-coefficient) filter we can't reduce the aliases without at the same time reducing the sharpness of the output image, but sometimes that will be a better compromise.

So if reducing aliasing is the primary requirement, an alternative set of synthesised FIR filters is as follows:

B = a*0.122803 + b*0.377197 + c*0.377197 + d*0.122803
C = b*0.037048 + c*0.308289 + d*0.437622 + e*0.217041
D = d*0.217041 + e*0.437622 + f*0.308289 + g*0.037048
E = e*0.122803 + f*0.377197 + g*0.377197 + h*0.122803

Which give rise to the following response curves:

Soft filter magnitude response Soft filter FIR delay response

Here we've managed to attenuate those frequencies which would cause aliasing (above 240 cycles) to well below a magnitude of 0.1, but of course the price we pay is in the sharpness of the resulting image:

The best of both worlds

What if we want both a sharp image and reduced aliasing? Our newly acquired understanding of the process in the frequency domain tells us that we can do no better with only a four-coefficient filter, but the way forward is clear: synthesise FIR filters with more coefficients.

Here are the results from some filters synthesised with eight coefficients:

8-tap filter magnitude response 8-tap filter FIR delay response

Here you can see that the right-hand end of the frequency sweep has been significantly attenuated. Of course if you want to do even better you can synthesise filters with even more taps, at the cost of increased processing time. It's not unusual to use filters with 16 taps or more in critical applications, such as broadcast television.

Conclusion

The frequency-domain behaviour of down-sampling methods is important in determining their performance, and by analysing standard methods in the frequency domain shortcomings can be identified. Improved performance can be achieved by synthesising FIR filters to approach more closely the optimum characteristics in the frequency domain, for example in terms of magnitude and group delay response.

References

Home - Products - Contact us

Input:	`a-----b-----c-----d-----e-----f-----g-----h`
Output:	`-A-------B-------C-------D-------E-------F-`