1. Requirements towards 11.05

This page lists potential Toolchain WG requirements that could go into 11.05. It attempts to start at a high level and then distill those down into possible work items.

The goal of the Toolchain WG is to produce 'the toolchain for ARM'. We'll know we've achieved that when the majority of complex, shipping, ARM based devices use our toolchain in their development. To get there we will focus on performance, provide state-of-the-art developer tools, encourage others to consume the toolchain, validate all outputs, and bring in future technologies as they come along.

2. Performance

The goal is:

Improve the performance of typical code on current and near-term ARM Cortex-A architectures

For this round performance is limited to speed and size, code is limited to C and C++ applications compiled by the GNU toolchain, and the architectures are specific implementations of the Cortex-A8 and Cortex-A9 cores.

Specifically excluded are:

  • Power related performance, i.e. executing a workload using the least amount of energy
  • Other precompiled languages such as Fortran and Objective C
  • JIT based languages such as Java and C#
  • Interpreted languages such as Javascript
  • Architectures past A9+1
  • Extensions such as 64 bit support
  • Making use of other platform components such as the GPU, any DSP, or video codec blocks
  • Other toolchains such as LLVM or RVCT

To achieve this we shall:

  1. Improve the time-based performance of C and C++ code on the Cortex-A8 by 5 %

  2. Improve the time-based performance of C and C++ code on the Cortex-A9 by 5 %

  3. Reduce the final code size of C and C++ code on Thumb2 devices by 5 %

Implicit in these are a defined workload, platform, and baseline. The workload and platform will come out of the validation work, while the baseline will be FSF GCC 4.5.1. All work shall be targeted at a single core implementation with the assumption that any single core improvements will scale to multicore.

Approaches that could achieve this are:

Those that change GCC to better match the ARM architecture:

  1. Address re-association and mode selection, to make better use of the rich ARM addressing modes
  2. Tuning predictive commoning, which attempts to keep loop values about for later use
  3. Register spill/fill, which improves the register spilling and materialisation pass (epic)
  4. Pushing the SSA infrastructure into back-end processing (epic)

Those with a vectorisation focus:

  1. Expand the NEON implementation to exploit more of the existing GCC features
  2. Extend the GCC vectoriser to allow better use of the NEON instructions
  3. Tune the vectoriser for best use of NEON
  4. Add support for multiple hardware vector sizes
  5. Expand the ARMv6 implementation to exploit the existing ARMv6 core register based SIMD operations

Those that make use of the new link time optimisation phase:

  1. Exploit library link time optimisation
  2. Exploit application link time optimisation
  3. Exploit dynamic link time optimisation

Those that involve support libraries:

  1. Optimise memcpy(), strcpy(), and the other top five string routines for Thumb2
  2. Optimise memcpy(), strcpy(), and the other top five string routines for Cortex-A8 with NEON
  3. Optimise memcpy(), strcpy(), and the other top five string routines for Cortex-A9 with NEON
  4. Publish these routines so that they are usable by all implementers
  5. Publish these routines in the common C libraries such as EGLIBC and Newlib

3. Developer Tools

The goal is:

Provide good development tools to increase productivity and reduce the chance of missing the market.

For this cycle, the focus is limited to core and time-based performance tools. We will match the state of the art on other architectures by consolidating existing tools and adding others.

Specifically excluded are:

  • Tools for deeply embedded systems, such as on-chip debug or programming tools
  • Tools for untargeted languages or platforms, such as JIT profiling tools

Approaches that could achieve this are:

State of the art related:

  1. Investigate the state of the art in developer tools on x86 platforms

  2. Investigate Intel vtune
  3. Investigate any Linux-based tools added in other projects

Core tools:

  1. Complete Thumb2 support in GDB

  2. Verify and complete support for multithreaded applications in GDB

  3. Add hardware debug and watchpoint support to the kernel and GDB

Performance related tools:

  1. Implement oprofile support in the kernel and userspace

  2. Expand oprofile support to give better accuracy and resolution
  3. Expand oprofile to make best use of any kernel or hardware based tracking features
  4. Investigate profiler driven feedback for GCC

  5. Validate valgrind on ARM

Deep performance related tools:

  1. Verify support in OpenOCD for the set of supported implementations

  2. Make OpenOCD compile and run natively on ARM1

  3. Add support for software-based instrumentation trace PENDING to what?
  4. Add support for hardware based instrumentation trace PENDING to what? How do we view it?

4. Validation

The goal is:

Validate the performance and correctness of any outputs

The validation topic provides test platforms to run on, the infrastructure to run them, and validation of the results of other topics.

To achieve this we shall:

  1. Define implementations profiles, and workloads to test on

  2. Implement automated builds, testing, and benchmarks

  3. Benchmark and provide analysis

Definition:

  1. Define a set of implementations to validate on, such as the PandaBoard, Versatile EX, and one other

  2. Define a set of profiles to validate against. These are expected to be based on the existing uses of ARM such as the smart phone, tablet, and infotainment.
  3. Define a set of workloads based on the profiles. These will be used for benchmarking.

Infrastructure:

  1. Implement a system for automated builds of all of the Toolchain WG products
  2. Add automatic benchmark support to the build system
  3. Implement readily available reports on the build status and performance progress

Benchmarking:

  1. Decide on, source, and import existing, closed benchmarks
  2. Implement a new, sharable benchmark based on existing applications and the defined workloads
  3. Automate the benchmark process
  4. Provide results that are statistically valid and can be reproduced by others

Competitive analysis:

  1. Provide a limited comparison of the Intel Atom and Cortex-A9 using the above workloads

5. Consumption

The goal is:

Be used for or available in the majority of ARM related projects and products

To achieve this we shall:

  1. Make any improvements available in a timely manner
  2. Be as correct as the upstream toolchain
  3. As they are found, investigate and correct bugs in a timely manner

In general we will:

  1. Maintain Linaro specific branches of any upstream release
  2. Backport Linaro changes into these branches as soon as practical
  3. Make Linaro specific branches available soon after any new upstream release

For GCC in this cycle we will:

  1. Maintain a GCC 4.5 based branch with Linaro backports

  2. Track the GCC 4.6 release in a experimental branch
  3. Release a GCC 4.6 based branch shortly after upstream release

  4. Make Linaro GCC 4.5 available natively in Ubuntu N

  5. Make Linaro GCC 4.5 available as a cross-compiler in Ubuntu N

  6. Make Linaro GCC 4.6 available as a preview release in Ubuntu N+1

  7. Create Debian/Ubuntu packages with each release for easy tracking

5.1. Future

The approach so far emphasises consolidation and existing technologies. Future gains will probably come from new tools and technologies and, as such, we should track new technologies now.

Topics include:

Investigate runtime specialisation:

  1. of string routines
  2. of core 2D graphic routines
  3. of video and audio decode routines

Investigate optimising for power consumption. Focus on long running workloads such as video playback, audio playback, and web browsing.

Investigate the current state of LLVM and support for ARM

Investigate use of non-core processing units such as DSPs and GPUs

Investigate using OpenCL or similar to make use of non-core processing units

6. Missing

The following items appear on other lists but are not currently included in the Toolchain WG requirements.

Non-toolchain requirements:

  • User space tool to display CPU Configurations (incl. CP15, GIC, cache L1/L2 cache controller, MMU, etc), configure PMU and export detailed statistics. May need some kernel work.
  • User space tool allowing dynamic configuration of ARM PMU (Performance Monitoring Unit) and dynamically export statistics like cache miss rates,CPU stall cycles etc
  • Tool to investigate the throughput of various drivers (MMC, USB etc). Does one already exist? May need to write one.
  • Validation framework for performance events (supports T11). Will generate bug fixes, enhancement requests etc. Should develop simple sanity checks and simple tests which make sure that some counters turn. Possibly modify gprof to include profile-feedback directed optimizations
  • Validation framework for debug and instrumentation. Will generate bug fixes, missing features, enhancements etc against gdb, performance events etc.
  1. Not required or useful. Good for completeness (1)

Cycles/1105/TechnicalTopics/ToolchainCommentary (last modified 2011-03-25 18:15:54)