Summary

This specification is for the porting and optimization of the VP6 decoder on Linaro running on ARM Cortex a9 with NEON.The optimization will primarily look into the porting of the codec in multi core system like cortex a9.This will involve architectural changes in the codec. Secondarily, assembly optimization in NEON will be taken.

Assumptions

  • Get a a9 board to work on.
  • GDB ,perf tool and Oprofiler on Cortex a9(required for profiling or look for a workaround like DS5)

Design

  • Comparative analysis of Google's vp6 and ffmpeg vp6 for a9
  • Check for feasibility of using OpenMP or Open CL for multi threading vp6.Need to focus on tool support and overheads using these standards.
  • Have the basic vp6 selected running with test setup on board.This is required for bench marking the received code and check for the conformance of the codec with the vp6 standard.Ideally code coverage should be done will new test vectors generated and complete test setup in place.
  • Check for Performance Parameters and update here

Table 1 :Performance Statistics

Performance Statistics

Date/Release

Test file name and parameter

Program memory(KB)

Scratch(KB)

Stack(KB)

Static Memory(KB)

Codec MCPS/FPS**

10-11-10

..

  • Document the basic design(Code analysis and read standard).Since this is a performance optimization specification,the design will not follow a traditional approach.The design will be updated here on reaching the design milestone as give in the blueprints whiteboard.
  • Codec profiled data with list of functions with % load and absolute load to be updated here.

Table 2 :Codec Profiled Data

Codec Profiled Data

Serial No

Function Name

% of Codec MCPS*

Absolute MCPS

Remarks

1

2

..

*Million Cycles per second. **Frames per second

Implementation

C code Changes

  • Do multi threading in host x86 system using pure C.The changes can be done Linux system calls and using semaphores so that architecturally the codec can can be split up into multiple threads.
  • Port into target board and do performance measurement as shown in the performance measurement chart Table 2.The performance measurement will use system timers before and after the decoder function call to do the MCPS calculation.API should be measured in a single thread.
    1. No File i/o inside codec
    2. Input is encodec stream
    3. Output is interleaved PCM
    4. General Algo of measurement

   Test_Wrapper(args)         
   {            
   Time1=Gettime();            
   Call Codec(args);            
   Time2= Gettime();           
   Time=Time2-Time1;            
   TotalTime=TotalTime+Time;            
   }       

  • Calcuate total cycles using Total Time.Performance measurement using this method has to be done for codec performance measurement at the unit testing level.As it will be decoded in nonreal time.Once it is integrated with the middleware measurements will be done using top,powertop etc to see its system load as the decoding will happen in real time .But,those measurements correlate with the codec MCPS and are outside the scope of this development.
  • Interact with Kernel and tools team for support in scheduling.This may require some kernel level support to schedule different threads in different processor cores.

Processor specific code changes

  • Profile in target using perf or other performance measurement tool and update Table 2.
  • Check for alignment issues and do data alignment if required
  • Check for dual issue in NEON and code using intrinscs if required
  • NEON assembly coding of some functions which has been identified as hotspots.

Test/Demo Plan

Run on any Linaro approved board using command line with decoded stream redirected to display driver for display.

Unresolved issues

Configuration Management and packaging.

WorkingGroups/Middleware/Multimedia/Specs/1105/OptimizeVp6Decoding (last modified 2010-11-21 11:00:02)