Low-level optimization tips |
Now that you've chosen a fast, low-complexity algorithm, it's time to optimize the #*($& out of it. Read on...
Pixel accumulation |
A lot of image processing algorithms require you to accumulate pixel sums. The most straightforward way is to use separate integer accumulators for each channel; however, you can gain some speed by accumulating channels in parallel; do this by accumulating red and blue together, and by not shifting green:
rb_sum += (pixel & 0xff00ff);
g_sum += (pixel & 0xff00);
This allows you to accumulate up to 257 unscaled pixels. Pixels can be prescaled by regular integer multiplication, and if you restrict yourself to 256 unscaled pixels, you can round the result as well:
rb_sum = 0x040004;
g_sum = 0x0400;
for(int x=0; x<8; x++) {
Pixel32 p = src[x];
rb_sum += p&0xff00ff;
g_sum += p&0xff00;
}
*dst++ = ((rb_sum & 0x07f807f8) + (g_sum & 0x0007f800)) >> 3;
If you have to accumulate the alpha channel as well, you will have to sacrifice some speed to shift the alpha and green channels down. Also, this trick won't work if your scaling coefficients don't add up to a power of two, because you can't shift to do the division; in this case, you'll have to separate the rb_sum result into separate channels before doing the non-power-of-two divide.
Kicking out the loop counter |
Loop counters are expensive in scalar code because of the Intel register shortage, so you should always
look for ways to eliminate one or more counting variables. The first way is to use the destination pointer as the
counter:
dst_limit = dst + w;
do {
...
} while(++dst < dst_limit);
If you have multiple pointers which are walking the same amounts, you can also convert some into pointer offsets:
ptrdiff_t srcoffset = src - dst;
Pixel32 *dst_limit = dst + w;
do {
p = *(dst + srcoffset);
...
*dst++ = p;
} while(dst < dst_limit);
You can also access memory by constant offsets from the counter:
x = -w;
dst += w;
do {
...
dst[x] = result;
} while(++x);
Averaging pixels |
The most straightforward way to average pixels is to separate the pixels into channels, average all the channels
together, and refold the channels. This is sloooowwww. You can get better performance by averaging
all channels in parallel:
pf = ((p1&0xfefefefe)>>1) + ((p2>>1)&0x7f7f7f7f) + (p1&p2&0x01010101); /* round down */
pf = ((p1&0xfefefefe)>>1) + ((p2>>1)&0x7f7f7f7f) + ((p1|p2)&0x01010101); /* round up */
You'll notice that p1 and p2 have opposite shift-mask orders. This is done to help the compiler reduce pressure on the shifting unit; there are two ALUs in the CPU but only one shift unit.
Don't drop the rounding term! If you try, and end up with code like this:
Your average will be biased toward zero, and more importantly, it will be impossible to achieve a FF
result.
pf = ((p1&0xfefefefe)>>1) + ((p2>>1)&0x7f7f7f7f);
Alignment |
VirtualDub guarantees that all input and output scanlines are aligned to 4-byte boundaries, but not to 8-byte
boundaries. You need to be careful about this if you are using MMX code, since 64-bit MMX memory accesses
can incur misalignment penalties. Therefore, for best results, you should replace a move-unpack instruction
pair:
with a dual load pair:
movq mm0,[esi]
pxor mm7,mm7
movq mm1,mm0
punpcklbw mm0,mm7
punpckhbw mm1,mm7
movd mm0,[esi]
pxor mm7,mm7
movd mm1,[esi+4]
punpcklbw mm0,mm7
punpckhbw mm1,mm7
VirtualDub external filter SDK 1.03 | ©1999-2000 Avery Lee (uleea05@umail.ucsb.edu) |