Letter by Mike Edwards

from    Michael K. Edwards m.k.edwards@gmail.com via lists.sourceforge.net 
reply-to        libjpeg-turbo Developers <libjpeg-turbo-devel@lists.sourceforge.net>
to      libjpeg-turbo Developers <libjpeg-turbo-devel@lists.sourceforge.net>
cc      Linaro Development <linaro-dev@lists.linaro.org>
date    Thu, Oct 27, 2011 at 6:23 PM
subject Re: [Libjpeg-turbo-devel] libjpeg8c vs libjpeg-turbo with libjpeg8 compat on
mailing list    libjpeg-turbo-devel.lists.sourceforge.net Filter messages from this mailing list
mailed-by       lists.sourceforge.net
unsubscribe     Unsubscribe from this mailing-list
        Important mainly because of your interaction with messages in the conversation.
hide details Oct 27 (6 days ago)
Here's one way that NEON could be employed to accelerate Huffman
decoding.  The most common 32 symbols typically account for over 99%
of Huffman codes in a JPEG image, and are typically encoded with
codons of length 2-10 bits.  Four 128-bit registers can hold these 32
codons as left-justified 16-bit values.  You can pull 64 bits of the
decode buffer into a NEON register, and act on it as follows:

 * replicate its leading 16 bits into all 8 slots in a uint16x8_t
with a VDUP instruction;
 * compare against the 32 left-justified codons using four VCGE instructions;
 * sum up the resulting 0's and -1's with four VADDs and two VPADDs;
 * negate the result with a VSUB to get either a table index from
0-31 or a 32 (indicating that it wasn't one of the first 32 codons);
 * look this up in a table of codon sizes (32 bytes stored in two
128-bit registers) with a VTBL (this is the largest VTBL operation,
and the reason why we limit to the first 32 codons); this gets you a
number of bits by which to shift the decode buffer (64 bits is the
largest block you can shift with a VSHL);
 * in addition to performing the shift, VADD the shift to a running
total of bits consumed, and VTST it into a "logical accumulator".  The
purpose of this is to detect whether the result of the VTBL look up is
ever zero, i. e., if the index was ever out of range, implying that we
hit a codon outside the first 32.

You might as well repeat this sequence 8 times, slotting the table
indices one by one into a vector, without pausing to test for misses.
One more VTBL maps these 8 indices to symbols (from another table of
32 bytes stored in two 128-bit registers).  Then move the whole batch
over to the ARM side along with the "logical accumulator" and the
count of consumed bits, fix up the corner cases (codon outside the
first 32, too many bits consumed), and set up for the next round.

I obviously haven't tested this yet; but I would expect it to
significantly reduce the time spent in Huffman decoding, largely
because it should drastically reduce the number of branch operations.
On typical images, 9 times out of 10 through the loop, a single
test-and-branch verifies that all 8 lookups succeeded and loops back.
There may be a few branches in the code that shifts data into the
decode buffer, but they are normally executed once per batch of 8
codons -- and even those can probably be converted to conditional
operations by a sufficiently smart compiler.

Anyone want to play at coding this up with me next week during Linaro Connect?

- Michael

TomGall/LibJpegTurboHuffmanDecode (last modified 2011-11-03 15:56:15)