The toolchain group is focused on improving the performance of code on Cortex-A devices. As part of that we want to measure any improvements in an automated, valid, reproducible way.
Andy, a GCC developer, wants to see if his latest change improves the pybench score.
Barry, a potential Linaro GCC user, wants to see how Linaro GCC compares to FSF GCC.
Charlie wants to see how the Linaro GCC performance has improved over time.
Danny wants to see how much of a speed boost he gets at -O3 and also the increase in file size.
Earl wants to monitor Linaro GCC and make sure the speed on ARMv5 devices doesn't significantly regress.
Non user stories
The system will not answer the following questions:
Fred wants to see how a Snowball board compares to a PandaBoard.
George wants to see how much faster YouTube playback is with Linaro.
TSC to settle on a benchmarking policy that at least allows publishing of relative, anonymous results of one board. It would be nice if relative results could be published for a readily available such as the PandaBoard. It would be nice if absolute results could be published.
The target profile is portable devices. Workloads shall be defined that match this profile, and benchmarks shall be defined from these workloads.
The system shall support the following suites:
- EEMBC DENBench
- SPEC CPU2000
- SPEC CPU2006
SPEC CPU2006 may only run on targets with > 1 GB of RAM.
The system should support the following suites:
- FFMPEG decoding H.264 video
- Vorbis encoding and decoding audio
- GNU Go
- Dhrystone and Whetstone
Tests shall primarirly exercise the toolchain by exercising the generated code on the CPU. Tests shall be generally CPU bound and should not rely on the I/O or memory performance of the board.
The system shall be applicable to current Cortex-A9 and Atom hardware. The system should be applicable to current x86 and ARMv5 hardware. Applicable is defined as running for long enough to give valid results and fitting within the constrains of readily available boards.
The target shall run Linux and have a sufficently recent kernel and distribution to meet the other requirements on this page. The target must have at least 256 MB of RAM and an Ethernet connection.
The system should not preclude use on bare metal targets such as the Cortex-M3 or ARM7 devices.
Results shall be valid. Individual runs shall be sufficiently long that a 0.1 % change can be detected. Sufficient runs shall be done to see if any particular result is statistically valid.
Results shall be reproducable. All inputs to the test including benchmark source, driver scripts, and host configuration, shall be controlled and captured as part of the test results. This must provide sufficient information to easily reproduce any run.
Results should be externally reproducable. A third party should be able to reproduce any particular run and verify the results. PENDING: this requires the TSC approval above to identify the target.
The system shall support running different variants of a test without manual intervention. A variant shall be able to have different configuration and/or compiler flags to show the difference between features such as the optimisation level.
- -O2 vs -O3
- With and without errata
Runs shall be able to be batched. Any combination of compilers, benchmarks, and variants shall be able to be batched and run in series. The results for each combination shall be separate from the others. All results shall be able to be collected at the end of the batch. Results should be able to be inspected while a batch is running without interfering with the results.
CoreMark results for all Linaro GCC releases
- pybench results at -O2, -O3, and -Os for one compiler
- All benchmark results for the latest release
The operator should be able to tell that the batch is running and the stage the batch is up to within a two hour resolution. This reporting shall not affect the results.
To allow reproducable runs and batching, all runs shall be automated. The automation suite shall be shared as a public project. The suite will not be a Linaro product and will not be maintained or released as such.
Both the tests and the automation shall be minimal. Tests shall prefer exercising one aspect of the toolchain where possible. Automation shall be focused on producing reliable results and, if a transient error occurs, allowing manual clean up.
The system shall be suitable for manual runs, runs as part of a continious integration system, and developer runs. A maunal run uses a pre-built toolchain and gathers the results. A continious integration run is updated to the latest build, run, and gathers the results. A developer run is made against a local, in-development compiler and may be used for feedback on new optimisations.
The system shall not preclude the results being either controlled or anonymised. Results may be private due to licensing or commercial reasons. Results may be anonymised at the target level to allow toolchain comparisons without board comparisons.
The system shall be extensable. It shall be easy to add new tests, variants, variables, and steps to the system.
The host shall be locked down during a test. Locked down is defined as running with the minimum number of services that still allow the test to be controlled and to run. It is expected that this will include:
- A serial login prompt
- SSH server
- NIS client, if required
- NFS client, if required
It is assumed that network load does not affect the test results. This should be checked as part of the implementation.
The system shall allow the following measurements for any test:
- Elapsed wall time
- Elapsed user time
- Executable file size
- Text, data, and BSS size of any executable
- Measurements reported by the test itself
- 'perf' summary
Results shall be machine and human readable. A parser shall be developed that parses all results. The parser shall parse sub-tests as well as the top level results.
A tabulator shall be developed that summarises results and emits a table. Each row shall contain sufficient information to decide if the results are valid. The summary may contained derrived information such as the span if it makes the results easier for a human to read.
The tabulated results shall be suitable for experimenting with in LibreOffice Calc or similar.
Note: michaelh quite likes sorting by best then using auto filter to quickly summarise results.
The columns shall include:
- Unique compiler name
- Test name
- Variant name
- Sub test name
- Number of runs
- The span between minimum and maximum
- 'Best' value
- The arithmetic mean
- The standard deviation
A graphing tool shall be developed that generates graphs suitable for both the web and for print. The graphing tool will take tabulated data and queries and produce the graphs and any other tabular information.
Note: michaelh expects these to be quite simple Python scripts that use matplotlib. The query language will be Python comprehensions.
All parsing, tabulation, and graphing shall be scriptable. By default all scripts should be recorded along with any reports to allow the reports to be reproduced.
Note: michaelh found that he made many short, single use scripts. The code is ugly but still worth recording.
PENDING: describe the reports.
WorkingGroups/ToolChain/Specs/ToolchainBenchmarks (last modified 2011-05-23 02:11:09)