Overview

One of the Linaro toolchain projects was to make GCC's automatic vectoriser take advantage of the NEON multi-vector vldN and vstN instructions. We have now added this functionality to FSF GCC trunk (which will become GCC 4.7). We've also included it in Linaro's GCC 4.6 toolchain.

This page contains examples of libav functions that were affected by the change. I compiled the code using FSF GCC revision 177400, passing the following options:

  • -O2 -mcpu=cortex-a8 -mfloat-abi=softfp -mfpu=neon -ffast-math -funsafe-loop-optimizations -ftree-vectorize -mvectorize-with-neon-quad -std=c99

Taking each option in turn:

  • -O2 optimises for speed while avoiding the excessive code growth that can sometimes be seen with -O3.
  • -mcpu=cortex-a8 -mfloat-abi=softfp -mfpu=neon enables NEON and optimises for Cortex A8 (I used a standard arm-linux-gnueabi configuration, without the Linaro --with-* configure options)
  • -funsafe-math-optimizations is needed for floating-point vectorisation, while -ffast-math generalises that
  • -funsafe-loop-optimizations is needed so that loops with "unsigned int" counters can be assumed not to overflow
  • -ftree-vectorize enables automatic vectorisation
  • -mvectorize-with-neon-quad allows the automatic vectoriser to use
    • quad registers
  • -std=c99 is needed for the "restrict" keyword.

For each loop I wrote a simple microbenchmark and ran it on a BeagleBoard (Cortex A8). The "before" results are without -ftree-vectorize, the "after" results are with it.

aacsbr.c:autocorrelate()

Source code:

    float real_sum = 0.0f;
    float imag_sum = 0.0f;
    ...
    for (i = 1; i < 38; i++) {
        real_sum += x[i][0] * x[i+lag][0] + x[i][1] * x[i+lag][1];
        imag_sum += x[i][0] * x[i+lag][1] - x[i][1] * x[i+lag][0];
    }

Output:

.L181:
          vld2.32   {d22-d25}, [r3]
          add       ip, r4, r3
          add       r3, r3, #32
          vld2.32   {d18-d21}, [ip]
          vmov      q8, q11  @ v4sf
          vmov      q11, q12  @ v4sf
          cmp       r3, r5
          vmov      q12, q9  @ v4sf
          vmul.f32  q9, q9, q8
          vmul.f32  q8, q10, q8
          vmla.f32  q9, q10, q11
          vmls.f32  q8, q12, q11
          vadd.f32  q14, q14, q9
          vadd.f32  q13, q13, q8
          bne       .L181

Micro benchmark:

aacsbr-1
  before:  5000000 runs take 37.725s
  after:   5000000 runs take 4.08365s
  speedup: x9.24

Another loop in the same function.

Source code:

    for (i = 1; i < 38; i++) {
        real_sum += x[i][0] * x[i][0] + x[i][1] * x[i][1];
    }

Output:

.L589:
          vld2.32   {d18-d21}, [r2]!
          vmov      q8, q9  @ v4sf
          cmp       r2, r1
          vmul.f32  q8, q8, q8
          vmla.f32  q8, q10, q10
          vadd.f32  q11, q11, q8
          bne       .L589

Micro benchmark:

aacsbr-2
  before:  5000000 runs take 21.1694s
  after:   5000000 runs take 3.02008s
  speedup: x7.01

aacsbr.c:ff_aac_sbr_init()

This is explicitly listed as cold code, but still:

Source code:

    for (n = 0; n < 320; n++)
        sbr_qmf_window_ds[n] = sbr_qmf_window_us[2*n];

Output:

.L372:
          vld2.16   {d16-d19}, [r2]!
          add       r3, r3, #1
          cmp       r0, r3
          vst1.16   {q8}, [r1]!
          bhi       .L372

Micro benchmark:

aacsbr-3
  before:  4000000 runs take 10.436s
  after:   4000000 runs take 5.20148s
  speedup: x2.01

avs.c:avs_decode_frame()

Source code:

    for (i=first; i<last; i++, buf+=3)
        pal[i] = (buf[0] << 18) | (buf[1] << 10) | (buf[2] << 2);

Output:

.L9:
          mov       r0, r4
          add       ip, ip, #1
          vld3.8    {d16, d18, d20}, [r0]!
          add       r6, r2, #32
          add       r5, r2, #48
          cmp       ip, r7
          mov       r1, r2
          add       r4, r4, #48
          add       r2, r2, #64
          vld3.8    {d17, d19, d21}, [r0]
          vmov      q13, q8  @ v16qi
          vmov      q8, q10  @ v16qi
          vmovl.u8  q11, d19
          vmovl.u8  q12, d26
          vmovl.u8  q10, d27
          vmovl.u8  q13, d18
          vmovl.u8  q9, d16
          vmovl.u8  q8, d17
          vmovl.u16 q15, d26
          vmovl.u16 q14, d24
          vmovl.u16 q1, d22
          vmovl.u16 q2, d20
          vmovl.u16 q11, d23
          vmovl.u16 q10, d21
          vmovl.u16 q13, d27
          vmovl.u16 q12, d25
          vmovl.u16 q3, d18
          vmovl.u16 q0, d19
          vmovl.u16 q9, d16
          vmovl.u16 q8, d17
          vshl.i32  q4, q11, #10
          vshl.i32  q5, q10, #18
          vshl.i32  q15, q15, #10
          vshl.i32  q14, q14, #18
          vshl.i32  q13, q13, #10
          vshl.i32  q12, q12, #18
          vshl.i32  q1, q1, #10
          vshl.i32  q2, q2, #18
          vshl.i32  q3, q3, #2
          vshl.i32  q10, q0, #2
          vshl.i32  q9, q9, #2
          vshl.i32  q8, q8, #2
          vorr      q11, q15, q14
          vorr      q12, q13, q12
          vorr      q2, q1, q2
          vorr      q4, q4, q5
          vorr      q11, q11, q3
          vorr      q10, q12, q10
          vorr      q9, q2, q9
          vorr      q8, q4, q8
          vst1.32   {q11}, [r1]!
          vst1.32   {q10}, [r1]
          vst1.32   {q9}, [r6]
          vst1.32   {q8}, [r5]
          bcc       .L9

Micro benchmark:

avs
  before:  1000000 runs take 3.65213s
  after:   1000000 runs take 2.26584s
  speedup: x1.61

cdgraphics.c:cdg_load_palette()

Probably cold code again.

Source code:

    for (i = 0; i < 8; i++) {
        color = (data[2 * i] << 6) + (data[2 * i + 1] & 0x3F);
        r = ((color >> 8) & 0x000F) * 17;
        g = ((color >> 4) & 0x000F) * 17;
        b = ((color ) & 0x000F) * 17;
        palette[i + array_offset] = r << 16 | g << 8 | b;
    }

Output:

          vld2.8    {d18-d19}, [fp:64]
          vmov.i8   d26, #63  @ v8qi
          vmov.i16  d16, #15  @ v4hi
          add       r1, r3, #16
          vand      d26, d19, d26
          add       r2, r3, #24
          vmovl.u8  q12, d18
          vmov.i32  q11, #0  @ v8hi
          vmovl.u8  q13, d26
          vmov      d28, d24
          vmov      q10, q11  @ v8hi
          vmov      d24, d25
          vmov      q9, q11  @ v8hi
          vmov      d29, d26
          vshl.i16  d28, d28, #6
          vadd.i16  d25, d29, d28
          vmov      d26, d27
          vshl.i16  d27, d24, #6
          vadd.i16  d24, d26, d27
          vshr.u16  d29, d25, #8
          vshr.u16  d28, d25, #4
          vand      d29, d29, d16
          vand      d28, d28, d16
          vshr.u16  d27, d24, #8
          vshr.u16  d26, d24, #4
          vmov      d22, d29
          vand      d27, d27, d16
          vmov      d20, d28
          vand      d26, d26, d16
          vand      d25, d25, d16
          vmov      d23, d27
          vand      d16, d24, d16
          vmov      d21, d26
          vmov.i8   d17, #17  @ v8qi
          vmovn.i16 d22, q11
          vmov      d18, d25
          vmovn.i16 d20, q10
          vmov      d19, d16
          vmul.i8   d21, d22, d17
          vmul.i8   d22, d20, d17
          vmovn.i16 d16, q9
          vmul.i8   d16, d16, d17
          vmovl.u8  q10, d21
          vmovl.u8  q9, d22
          vmov      d26, d20
          vmov      d24, d18
          vmovl.u8  q8, d16
          vmovl.u16 q13, d26
          vmov      d18, d19
          vmovl.u16 q12, d24
          vmov      d20, d21
          vmov      d19, d16
          vmovl.u16 q11, d20
          vmov      d16, d17
          vmovl.u16 q10, d18
          vmov      d28, d26
          vmov      d29, d24
          vmovl.u16 q9, d19
          vmovl.u16 q8, d16
          vmov      d24, d25
          vmov      d26, d27
          vshl.i32  d25, d29, #8
          vmov      d27, d22
          vshl.i32  d28, d28, #16
          vmov      d22, d23
          vorr      d28, d28, d25
          vmov      d25, d20
          vshl.i32  d23, d24, #8
          vmov      d20, d21
          vshl.i32  d24, d27, #16
          vshl.i32  d21, d25, #8
          vmov      d27, d18
          vmov      d25, d19
          vorr      d24, d24, d21
          vshl.i32  d18, d22, #16
          vmov      d21, d17
          vshl.i32  d26, d26, #16
          vorr      d26, d26, d23
          vmov      d23, d16
          vshl.i32  d17, d20, #8
          vorr      d16, d18, d17
          vorr      d19, d28, d27
          vorr      d18, d26, d25
          vorr      d17, d24, d23
          vorr      d16, d16, d21
          vst1.32   {d19}, [r3]!
          vst1.32   {d18}, [r3]
          vst1.32   {d17}, [r1]
          vst1.32   {d16}, [r2]

Micro benchmark:

cdgraphics
  before:  1000000 runs take 7.25052s
  after:   1000000 runs take 2.40619s
  speedup: x3.01

dwt.c:horizontal_decompose53i()

Source code:

    for(x=0; x<width2; x++){
        temp[x ]= b[2*x ];
        temp[x+w2]= b[2*x + 1];
    }

Output:

.L104:
          vld2.32   {d16-d19}, [ip]!
          add       r2, r2, #1
          add       r4, r3, r7
          cmp       r2, r6
          vst1.32   {q8}, [r4]
          vst1.32   {q9}, [r3]!
          bcc       .L104

Micro benchmark:

dwt
  before:  2000000 runs take 9.25055s
  after:   2000000 runs take 9.25723s
  speedup: x0.999

dxa.c:decode_frame()

Another variation on the palette code.

Source code:

    for(i = 0; i < 256; i++){
        r = *buf++;
        g = *buf++;
        b = *buf++;
        c->pal[i] = (r << 16) | (g << 8) | b;
    }

Output:

.L13:
          mov       r0, r2
          add       r2, r2, #48
          vld3.8    {d16, d18, d20}, [r0]!
          add       lr, r3, #32
          add       ip, r3, #48
          cmp       r2, r5
          mov       r1, r3
          add       r3, r3, #64
          vld3.8    {d17, d19, d21}, [r0]
          vmov      q11, q8  @ v16qi
          vmov      q8, q10  @ v16qi
          vmovl.u8  q12, d22
          vmovl.u8  q10, d23
          vmovl.u8  q11, d18
          vmovl.u8  q9, d19
          vmovl.u16 q0, d24
          vmovl.u16 q15, d22
          vmovl.u16 q2, d25
          vmovl.u16 q14, d23
          vmovl.u16 q13, d18
          vmovl.u16 q12, d21
          vmovl.u16 q3, d19
          vmovl.u16 q1, d20
          vmovl.u8  q9, d16
          vshl.i32  q15, q15, #8
          vmovl.u8  q8, d17
          vshl.i32  q13, q13, #8
          vshl.i32  q11, q0, #16
          vshl.i32  q2, q2, #16
          vshl.i32  q14, q14, #8
          vshl.i32  q10, q1, #16
          vshl.i32  q12, q12, #16
          vshl.i32  q3, q3, #8
          vorr      q11, q11, q15
          vorr      q10, q10, q13
          vmovl.u16 q15, d18
          vmovl.u16 q13, d16
          vorr      q14, q2, q14
          vmovl.u16 q9, d19
          vorr      q12, q12, q3
          vmovl.u16 q8, d17
          vorr      q11, q11, q15
          vorr      q9, q14, q9
          vorr      q10, q10, q13
          vorr      q8, q12, q8
          vst1.32   {q11}, [r1]!
          vst1.32   {q9}, [r1]
          vst1.32   {q10}, [lr]
          vst1.32   {q8}, [ip]
          bne       .L13

Micro benchmark:

dxa
  before:  2000000 runs take 7.28299s
  after:   2000000 runs take 4.55399s
  speedup: x1.6

mjpegenc.c:escape_FF()

Input:

    for(; i<size-15; i+=16){
        int acc, v;

        v= *(uint32_t*)(&buf[i]);
        acc= (((v & (v>>4))&0x0F0F0F0F)+0x01010101)&0x10101010;
        v= *(uint32_t*)(&buf[i+4]);
        acc+=(((v & (v>>4))&0x0F0F0F0F)+0x01010101)&0x10101010;
        v= *(uint32_t*)(&buf[i+8]);
        acc+=(((v & (v>>4))&0x0F0F0F0F)+0x01010101)&0x10101010;
        v= *(uint32_t*)(&buf[i+12]);
        acc+=(((v & (v>>4))&0x0F0F0F0F)+0x01010101)&0x10101010;

        acc>>=4;
        acc+= (acc>>16);
        acc+= (acc>>8);
        ff_count+= acc&0xFF;
    }

Output:

.L212:
          mov       r1, ip
          add       r0, r0, #1
          vld4.32   {d16, d18, d20, d22}, [r1]!
          cmp       r0, r4
          add       ip, ip, #64
          vld4.32   {d17, d19, d21, d23}, [r1]
          vmov      q2, q8  @ v4si
          vshr.s32  q5, q8, #4
          vshr.s32  q4, q9, #4
          vmov      q15, q10  @ v4si
          vand      q3, q9, q14
          vand      q2, q2, q14
          vshr.s32  q10, q10, #4
          vand      q2, q2, q5
          vand      q3, q3, q4
          vand      q15, q15, q14
          vshr.s32  q9, q11, #4
          vand      q8, q11, q14
          vadd.i32  q2, q2, q13
          vadd.i32  q3, q3, q13
          vand      q15, q15, q10
          vand      q9, q8, q9
          vand      q2, q2, q12
          vand      q3, q3, q12
          vadd.i32  q15, q15, q13
          vadd.i32  q8, q2, q3
          vand      q15, q15, q12
          vadd.i32  q9, q9, q13
          vadd.i32  q8, q8, q15
          vand      q9, q9, q12
          vadd.i32  q8, q8, q9
          vshr.s32  q8, q8, #4
          vshr.s32  q9, q8, #16
          vadd.i32  q8, q8, q9
          vshr.s32  q9, q8, #8
          vadd.i32  q8, q8, q9
          vand      q8, q8, q0
          vadd.i32  q1, q1, q8
          bcc       .L212

Micro benchmark:

mjpegenc
  before:  500000 runs take 8.25491s
  after:   500000 runs take 3.28369s
  speedup: x2.51

qtrle.c:qtrle_decode_32bpp

This example is basically a copy on big-endian targets and a byte-swap on little-endian targets. (I did all these tests for armel rather than armeb.)

Input:

    while (rle_code--) {
        a = s->buf[stream_ptr++];
        r = s->buf[stream_ptr++];
        g = s->buf[stream_ptr++];
        b = s->buf[stream_ptr++];
        argb = (a << 24) | (r << 16) | (g << 8) | (b << 0);
        *(unsigned int *)(&rgb[pixel_ptr]) = argb;
        pixel_ptr += 4;
    }

Output:

.L218:
          add       r1, r3, r7
          add       r0, r0, #1
          add       r4, r3, #32
          add       lr, r3, #48
          vld4.8    {d16, d18, d20, d22}, [r1]!
          cmp       r0, r6
          mov       r2, r3
          add       r3, r3, #64
          vld4.8    {d17, d19, d21, d23}, [r1]
          vmovl.u8  q3, d16
          vmovl.u8  q14, d17
          vmovl.u8  q15, d18
          vmovl.u8  q13, d19
          vmov      q8, q11  @ v16qi
          vmovl.u16 q0, d28
          vmovl.u16 q1, d26
          vmovl.u16 q2, d6
          vmovl.u16 q4, d30
          vmovl.u16 q3, d7
          vmovl.u16 q15, d31
          vmovl.u16 q14, d29
          vmovl.u16 q13, d27
          vmovl.u8  q11, d20
          vmovl.u8  q10, d21
          vmovl.u8  q9, d16
          vshl.i32  q2, q2, #24
          vmovl.u8  q8, d17
          vshl.i32  q12, q0, #24
          vshl.i32  q5, q1, #16
          vshl.i32  q13, q13, #16
          vshl.i32  q4, q4, #16
          vshl.i32  q3, q3, #24
          vshl.i32  q15, q15, #16
          vshl.i32  q14, q14, #24
          vmovl.u16 q0, d22
          vmovl.u16 q1, d20
          vmovl.u16 q11, d23
          vmovl.u16 q10, d21
          vorr      q4, q2, q4
          vmovl.u16 q6, d16
          vmovl.u16 q2, d18
          vorr      q14, q14, q13
          vorr      q15, q3, q15
          vmovl.u16 q9, d19
          vorr      q12, q12, q5
          vmovl.u16 q8, d17
          vshl.i32  q0, q0, #8
          vshl.i32  q11, q11, #8
          vshl.i32  q1, q1, #8
          vshl.i32  q10, q10, #8
          vorr      q2, q4, q2
          vorr      q9, q15, q9
          vorr      q12, q12, q6
          vorr      q8, q14, q8
          vorr      q13, q2, q0
          vorr      q9, q9, q11
          vorr      q12, q12, q1
          vorr      q8, q8, q10
          vst1.32   {q13}, [r2]!
          vst1.32   {q9}, [r2]
          vst1.32   {q12}, [r4]
          vst1.32   {q8}, [lr]
          bcc       .L218

Micro benchmark:

qtrle
  before:  1000000 runs take 8.27667s
  after:   1000000 runs take 4.54797s
  speedup: x1.82

resample.c:stereo_to_mono()

Input:

    while (n > 0) {
        q[0] = (p[0] + p[1]) >> 1;
        q++;
        p += 2;
        n--;
    }

Output:

.L53:
          vld2.16   {d16-d19}, [ip]!
          add       r1, r1, #1
          add       lr, r8, r2
          cmp       r1, r7
          vst1.16   {q8}, [r2]
          add       r2, r2, #16
          vst1.16   {q9}, [lr]
          bcc       .L53

Micro benchmark:

resample
  before:  1000000 runs take 7.24585s
  after:   1000000 runs take 2.32373s
  speedup: x3.12

rgb2rgb.c:rgb24tobgr32_C()

Input:

    while (s < end) {
        *dest++ = *s++;
        *dest++ = *s++;
        *dest++ = *s++;
        *dest++ = 255;

    }

Output:

.L4:
          mov       ip, r6
          vmov.i32  q11, #4294967295  @ v16qi
          vld3.8    {d24, d26, d28}, [ip]!
          add       r4, r4, #1
          cmp       r4, r7
          mov       r3, r5
          add       r6, r6, #48
          add       r5, r5, #64
          vld3.8    {d25, d27, d29}, [ip]
          vmov      q8, q12  @ v16qi
          vmov      q9, q13  @ v16qi
          vmov      q10, q14  @ v16qi
          vst4.8    {d16, d18, d20, d22}, [r3]!
          vst4.8    {d17, d19, d21, d23}, [r3]
          bcc       .L4

Micro benchmark:

rgb2rgb-rgb24tobgr32
  before:  2000000 runs take 7.30218s
  after:   2000000 runs take 4.55164s
  speedup: x1.6

rgb2rgb.c:rgb32tobgr24_C()

Input:

    while (s < end) {
        *dest++ = *s++;
        *dest++ = *s++;
        *dest++ = *s++;
        s++;
    }

Output:

.L15:
          mov       ip, r6
          add       r4, r4, #1
          vld4.8    {d16, d18, d20, d22}, [ip]!
          cmp       r4, r7
          mov       r3, r5
          add       r6, r6, #64
          add       r5, r5, #48
          vld4.8    {d17, d19, d21, d23}, [ip]
          vmov      q12, q8  @ v16qi
          vmov      q13, q9  @ v16qi
          vmov      q14, q10  @ v16qi
          vst3.8    {d24, d26, d28}, [r3]!
          vst3.8    {d25, d27, d29}, [r3]
          bcc       .L15

Micro benchmark:

rgb2rgb-rgb32tobgr24
  before:  2000000 runs take 6.2756s
  after:   2000000 runs take 3.62195s
  speedup: x1.73

rgb2rgb.c:rgb24tobgr16_C()

Input

    while (s < end) {
        const int b = *s++;
        const int g = *s++;
        const int r = *s++;
        *d++ = (b>>3) | ((g&0xFC)<<3) | ((r&0xF8)<<8);
    }

Output:

.L87:
          mov       ip, r6
          add       r4, r4, #1
          vld3.8    {d16, d18, d20}, [ip]!
          cmp       r4, r7
          mov       r3, r5
          add       r6, r6, #48
          add       r5, r5, #32
          vld3.8    {d17, d19, d21}, [ip]
          vand      q12, q9, q14
          vand      q11, q10, q13
          vshr.u8   q8, q8, #3
          vmovl.u8  q10, d24
          vmovl.u8  q9, d22
          vmovl.u8  q12, d25
          vmovl.u8  q11, d23
          vshl.i16  q10, q10, #3
          vshl.i16  q9, q9, #8
          vshl.i16  q12, q12, #3
          vshl.i16  q11, q11, #8
          vorr      q9, q10, q9
          vorr      q11, q12, q11
          vmovl.u8  q10, d16
          vmovl.u8  q8, d17
          vorr      q9, q9, q10
          vorr      q8, q11, q8
          vst1.16   {q9}, [r3]!
          vst1.16   {q8}, [r3]
          bcc       .L87

Micro benchmark:

rgb2rgb-rgb24tobgr16
  before:  1000000 runs take 3.65076s
  after:   1000000 runs take 1.11987s
  speedup: x3.26

rgb2rgb.c:yv12touyvy_C()

Input:

    for (i = 0; i < chromWidth; i++) {
        *idst++ = uc[0] + (yc[0] << 8) +
           (vc[0] << 16) + (yc[1] << 24);

        yc += 2;
        uc++;
        vc++;
    }

Output:

.L196:
          vld2.8    {d16-d19}, [r5]!
          add       ip, sl, r1
          add       r4, r4, #1
          add       r7, r2, #32
          vmov      q12, q8  @ v16qi
          vld1.8    {q0}, [ip]
          vmov      q8, q9  @ v16qi
          add       r6, r2, #48
          vld1.8    {q9}, [r1]
          vmovl.u8  q5, d0
          vmovl.u8  q15, d24
          cmp       fp, r4
          vmovl.u8  q12, d25
          mov       ip, r2
          vmovl.u8  q0, d1
          add       r2, r2, #64
          vmovl.u16 q2, d30
          add       r1, r1, #16
          vmovl.u16 q13, d24
          vmovl.u16 q15, d31
          vmovl.u16 q12, d25
          vmovl.u8  q3, d18
          vmovl.u8  q10, d16
          vmovl.u8  q9, d19
          vmovl.u8  q8, d17
          vmovl.u16 q6, d10
          vmovl.u16 q4, d0
          vmovl.u16 q5, d11
          vmovl.u16 q0, d1
          vshl.i32  q2, q2, #8
          vshl.i32  q15, q15, #8
          vshl.i32  q13, q13, #8
          vshl.i32  q12, q12, #8
          vmovl.u16 q1, d6
          vmovl.u16 q11, d20
          vmovl.u16 q14, d18
          vmovl.u16 q7, d16
          vmovl.u16 q3, d7
          vmovl.u16 q10, d21
          vmovl.u16 q9, d19
          vmovl.u16 q8, d17
          vshl.i32  q6, q6, #16
          vshl.i32  q5, q5, #16
          vshl.i32  q4, q4, #16
          vshl.i32  q0, q0, #16
          vadd.i32  q2, q1, q2
          vadd.i32  q15, q3, q15
          vadd.i32  q13, q14, q13
          vadd.i32  q9, q9, q12
          vshl.i32  q11, q11, #24
          vshl.i32  q10, q10, #24
          vshl.i32  q7, q7, #24
          vshl.i32  q8, q8, #24
          vadd.i32  q6, q2, q6
          vadd.i32  q5, q15, q5
          vadd.i32  q4, q13, q4
          vadd.i32  q0, q9, q0
          vadd.i32  q6, q6, q11
          vadd.i32  q5, q5, q10
          vadd.i32  q4, q4, q7
          vadd.i32  q8, q0, q8
          vst1.32   {q6}, [ip]!
          vst1.32   {q5}, [ip]
          vst1.32   {q4}, [r7]
          vst1.32   {q8}, [r6]
          bhi       .L196

Micro benchmark:

rgb2rgb-yv12touyvy
  before:  1500000 runs take 6.24701s
  after:   1500000 runs take 3.52701s
  speedup: x1.77

rgb2rgb.c:yuy2toyv12_C()

Input:

    for (i=0; i<chromWidth; i++) {
        ydst[2*i+0] = src[4*i+0];
        udst[i] = src[4*i+1];
        ydst[2*i+1] = src[4*i+2];
        vdst[i] = src[4*i+3];
    }

Output:

.L223:
          mov       r0, r4
          add       ip, ip, #1
          vld4.8    {d16, d18, d20, d22}, [r0]!
          add       r7, r8, r1
          cmp       ip, r9
          add       r4, r4, #64
          vld4.8    {d17, d19, d21, d23}, [r0]
          vmov      q12, q8  @ v16qi
          vmov      q13, q10  @ v16qi
          vst1.8    {q9}, [r1]
          add       r1, r1, #16
          vst2.8    {d24-d27}, [r6]!
          vst1.8    {q11}, [r7]
          bcc       .L223

Micro benchmark:

rgb2rgb-yuy2toyv12
  before:  500000 runs take 52.5894s
  after:   500000 runs take 4.51532s
  speedup: x11.6

rgb2rgb.c:shuffle_bytes_0321()

Input:

    for (i = 0; i < src_size; i+=4) {
        dst[i + 0] = src[i + 0];
        dst[i + 1] = src[i + 3];
        dst[i + 2] = src[i + 2];
        dst[i + 3] = src[i + 1];
    }

Output:

.L568:
          mov       r5, ip
          add       r6, r6, #1
          vld4.8    {d16, d18, d20, d22}, [r5]!
          add       r4, r8, ip
          cmp       r7, r6
          add       ip, ip, #64
          vld4.8    {d17, d19, d21, d23}, [r5]
          vmov      q12, q8  @ v16qi
          vmov      q13, q11  @ v16qi
          vmov      q14, q10  @ v16qi
          vmov      q15, q9  @ v16qi
          vst4.8    {d24, d26, d28, d30}, [r4]!
          vst4.8    {d25, d27, d29, d31}, [r4]
          bhi       .L568

Micro benchmark:

rgb2rgb-shuffle-bytes
  before:  500000 runs take 4.13925s
  after:   500000 runs take 2.29379s
  speedup: x1.8

twinvq.c:eval_lpc_spectrum()

Input

    for (j = 0; j + 1 < order; j += 2*2) {
        q *= lsp[j] - two_cos_w;
        p *= lsp[j+1] - two_cos_w;

        q *= lsp[j+2] - two_cos_w;
        p *= lsp[j+3] - two_cos_w;
    }

Output

.L4:
          mov       r3, ip
          add       r2, r2, #1
          vld4.32   {d16, d18, d20, d22}, [r3]!
          cmp       r4, r2
          add       ip, ip, #64
          vld4.32   {d17, d19, d21, d23}, [r3]
          vsub.f32  q3, q8, q12
          vsub.f32  q2, q10, q12
          vsub.f32  q8, q9, q12
          vsub.f32  q15, q11, q12
          vmul.f32  q9, q2, q3
          vmul.f32  q8, q15, q8
          vmul.f32  q14, q14, q9
          vmul.f32  q13, q13, q8
          bhi       .L4

Micro benchmark:

twinvq
  before:  500000 runs take 5.75027s
  after:   500000 runs take 0.447266s
  speedup: x12.9

wmavoice.c:wiener_denoise()

Input

    for (n = 1; n < 64; n++) {
        float v1 = synth_pf[n * 2], v2 = synth_pf[n * 2 + 1];
        synth_pf[n * 2] = v1 * coeffs[n * 2] - v2 * coeffs[n * 2 + 1];
        synth_pf[n * 2 + 1] = v2 * coeffs[n * 2] + v1 * coeffs[n * 2 + 1];
    }

Output

.L102:
          sub       r1, r3, #576
          vld2.32   {d20-d23}, [r3]
          vld2.32   {d16-d19}, [r1]
          vmov      q12, q10  @ v4sf
          vmov      q10, q11  @ v4sf
          vmov      q11, q8  @ v4sf
          vmov      q8, q9  @ v4sf
          vmul.f32  q9, q11, q12
          vmul.f32  q11, q11, q10
          vmov      q4, q9  @ v4sf
          vmov      q5, q11  @ v4sf
          vmls.f32  q4, q8, q10
          vmla.f32  q5, q8, q12
          vst2.32   {d8-d11}, [r3]!
          cmp       r3, r2
          bne       .L102

Micro benchmark:

wmavoice
  before:  500000 runs take 5.61029s
  after:   500000 runs take 0.934143s
  speedup: x6.01

RichardSandiford/Sandbox/NeonLibAv (last modified 2011-08-05 05:40:40)