Notes on PGO+LTO benchmarking

These measures were made as part of STMicroelectronics projects.

On Webkit/x86:

ref O2 (gcc-4.6.1)

LTO

PGO

LTO+PGO

layout

+0.21%

+8.48%

+10.82%

javascript

-2.75%

+2.89%

+1.62%

image

+2.85%

+5.12%

+7.89%

rendering

-0.44%

+5.85%

+5.55%

sites

+1.7%

+8.6%

+11.02%

total

-0.11%

+5.88%

+6.85%

size 19054037

-24%

-27.4%

-37.5%

screenshot-LTO-x86.png

On Webkit/ARM

speedups

LTO

LTO+PGO

layout

6.2%

12.4%

javascript

13.9%

18%

image

4.7%

6.7%

rendering

4.6%

7.2%

sites

9.3%

11.7%

Total

7.8%

11.1%

size

+22.7%

+4.7%

In Thumb mode

Reference: gcc-4.4.3 -Os

Versus: gcc-4.6.2 -O2 + LTO/PGO

All of webcore + libv8 + libxml2

Compilation time +60%

Memory usage: x4

screenshot-LTO-ARM.png

LTO: most gains come from interprocedural inlining and function specialization/cloning

PGO: hot/cold functions partitioning works well; improved intra-function block layout and frequencies. Improve inlining heuristics.

LIPO (google) shows even better performances (14%) but with code size increase (15%). Cross-inlining is performed on selected groups of compilation units.

With automatic optimization space exploration techniques (vs -O2):

speedup

size

libjpeg

19.2%

+4.2%

zlib

3.5%

+15.4%

Optimizing for size

Reference GCC-4.4.3 -Os

speedups

4.6.2

4.6.2 PGO

4.6.2 PGO+LTO

4.6.2 LTO

layout

-1.61%

-0.6%

-0.25%

+0.66%

javascript

-4.8%

-3.05%

-1.05%

-0.49%

image

-0.99%

-0.92%

-0.59%

-0.33%

rendering

-0.96%

-0.58%

-1.19%

-0.84%

sites

+0.2%

+0.72%

+1.73%

+1.64%

total

-1.54%

-0.75%

-0.04%

+0.38%

size

-1.85%

-1.92%

5.2%

-4.94%

ChristopheLyon/Sandbox/benchPGO-LTO (last modified 2012-08-14 16:31:51)