Low-level optimization tips

Now that you've chosen a fast, low-complexity algorithm, it's time to optimize the #*($& out of it.  Read on...

Pixel accumulation

A lot of image processing algorithms require you to accumulate pixel sums.  The most straightforward way is to use separate integer accumulators for each channel; however, you can gain some speed by accumulating channels in parallel; do this by accumulating red and blue together, and by not shifting green:

rb_sum += (pixel & 0xff00ff);
g_sum += (pixel & 0xff00);

This allows you to accumulate up to 257 unscaled pixels.  Pixels can be prescaled by regular integer multiplication, and if you restrict yourself to 256 unscaled pixels, you can round the result as well:

rb_sum = 0x040004;
g_sum = 0x0400;

for(int x=0; x<8; x++) {
	Pixel32 p = src[x];

	rb_sum += p&0xff00ff;
	g_sum += p&0xff00;
}

*dst++ = ((rb_sum & 0x07f807f8) + (g_sum & 0x0007f800)) >> 3;

If you have to accumulate the alpha channel as well, you will have to sacrifice some speed to shift the alpha and green channels down.  Also, this trick won't work if your scaling coefficients don't add up to a power of two, because you can't shift to do the division; in this case, you'll have to separate the rb_sum result into separate channels before doing the non-power-of-two divide.

Kicking out the loop counter

Loop counters are expensive in scalar code because of the Intel register shortage, so you should always look for ways to eliminate one or more counting variables.  The first way is to use the destination pointer as the counter:

dst_limit = dst + w;

do {
    ...
} while(++dst < dst_limit);

If you have multiple pointers which are walking the same amounts, you can also convert some into pointer offsets:

ptrdiff_t   srcoffset = src - dst;
Pixel32    *dst_limit = dst + w;

do {
	p = *(dst + srcoffset);
	...
	*dst++ = p;
} while(dst < dst_limit);

You can also access memory by constant offsets from the counter:

x = -w;
dst += w;

do {
	...
	dst[x] = result;
} while(++x);

Averaging pixels

The most straightforward way to average pixels is to separate the pixels into channels, average all the channels together, and refold the channels.  This is sloooowwww.  You can get better performance by averaging all channels in parallel:

pf = ((p1&0xfefefefe)>>1) + ((p2>>1)&0x7f7f7f7f) + (p1&p2&0x01010101);   /* round down */
pf = ((p1&0xfefefefe)>>1) + ((p2>>1)&0x7f7f7f7f) + ((p1|p2)&0x01010101); /* round up */

You'll notice that p1 and p2 have opposite shift-mask orders.  This is done to help the compiler reduce pressure on the shifting unit; there are two ALUs in the CPU but only one shift unit.

Don't drop the rounding term!  If you try, and end up with code like this:

pf = ((p1&0xfefefefe)>>1) + ((p2>>1)&0x7f7f7f7f);
Your average will be biased toward zero, and more importantly, it will be impossible to achieve a FF result.

Alignment

VirtualDub guarantees that all input and output scanlines are aligned to 4-byte boundaries, but not to 8-byte boundaries.  You need to be careful about this if you are using MMX code, since 64-bit MMX memory accesses can incur misalignment penalties.  Therefore, for best results, you should replace a move-unpack instruction pair:

movq       mm0,[esi]
pxor       mm7,mm7
movq       mm1,mm0
punpcklbw  mm0,mm7
punpckhbw  mm1,mm7
with a dual load pair:
movd       mm0,[esi]
pxor       mm7,mm7
movd       mm1,[esi+4]
punpcklbw  mm0,mm7
punpckhbw  mm1,mm7

[up] back to main page


VirtualDub external filter SDK 1.03©1999-2000 Avery Lee (uleea05@umail.ucsb.edu)