Summary

Improve cloud-buildd's performance beyond that achieved by just adding more cloud CPU power to it.

Rationale

Doing builds in the cloud is a great way to speed up Android platform build comparing to local development. Still, they take long enough, and we have quicker turnaround to allow developers to be more productive. We should explore other possibilities to speed up builds besides just pumping up cloud power into build system (we already use rather powerful - and expensive - EC2 instance types).

Design

Per the current performance stats, 75% of build time is spent in the Android make per se, so if we want to improve build time, we should target compilation time. It will be hard to cut down compilation time on Android make system level, so we need plug into other stages, compilation. There are 2 well-known approaches to optimize compile time:

  • Distributed compilation (e.g. distcc)
  • Cached compilation (e.g. ccache)

Distributed compilation is a good way to parallelize build on lot of cheap single-processor machines, but we already use 4-CPU xlarge EC2 instances and build with make -j16. It's unclear if could parallelize much beyond that. Even if we could and instead 1hr build on 1 instance we have 0.5hr on 2 instances, we'd have to pay twice more, as EC2 accounting unit is an even hour.

On the other hand cached compilation, with persistent cache across the builds, which would be shared across instances, seems like a good approach. Android platform has vast amount of source code, and very little relative amount would change across (similar) builds, so we'd have good cache hit ratio. Also, cached compilation performance is sustainable regarding compiler traits/options. For example, if we use aggressive, elaborated optimizations (and Linaro is all about that) which take lot of compile time, we'll still benefit greatly from object cache (relative benefit will be even higher).

While designing cache structure, keep in mind security issues, in particular, possibility of cache "poisoning" (either intentional or unintentional). It may be good idea to make sure that any cache changed done by specific build system user may affect only that user. Of course, that may lead to the need to maintain per-user caches, which may increase storage needs considerably. See also Appendix for additional info.

User stories

User story for cache reset:

As a build engineer, I always expect complex system to start to behave erratically at any time and be too complicated to resolve in the given time frame on the spot. In anticipation of such situation, I want to have "red big button" which will reset system, so the system continues to function from a known good state.

Implementation

Use ccache. Keep it's cache on a separate FS. Make this FS available to slaves. EBS won't do here, it's raw block device. It should be either NFS, or Loic suggests to think about rsync'ing master copy to slave before build, then rsyncing changes back at the end. There can be concurrent builds, but they should produce matching results on different slaves (should they?). Should provide point-by-point comparison of 2 (or more) approaches, and if needed, practical tests (like benchmarking). When selecting between approaches, employ following criteria: ease to implement/viability, overheard (like, NFS is a natural solution, but it means another instance is needed to run NFS server all the time), reliability (rsync looks suspicious), easy to maintain (for example, read-only master rsync cache would be the easiest, but hit ratio would deteriorate over time, and it would be needed to be "re-seeded" at some point).

Should provide means to reset cache as means to reset system to known good state in case of contingency.

Test/Demo Plan

Should run ccache and non-ccache build in parallel and validate that results are the same.

Appendices

Cache poisoning issues:

On Wed, 08 Jun 2011 16:40:06 -0400
James Westby wrote:

> As we discussed on IRC, I do share concerns of separation, it's just
> "user" is a bit ambiguous term here (it likely maps not to a build
> system user, but to a build system config). I for sure will study what
> keys ccache uses to store object files in its cache-map, and if we'll
> need additional "meta-keys" to separate cache(s) for different builds.  

Right, I'm assuming that ccache is sensible and uses content-addressed
storage, so that two builds won't collide unless they are building the
same source with the same compiler, options etc. I can't see it being a
popular tool for many use cases if it will misbuild or only allow one
concurrent build.

While those things are good to worry about (and so should be documented
as things to investigate) they are a different class of problem to
different people using the same ccache.

If we allow everyone to write to everyone elses ccache then there is an
easy cache poisoning attack.

If we are required to separate build configs or similar then we can do
that without strong separation, we can simply use differnet prefixes or
something, as being able to cache poison yourself is pretty boring.

We do have to have strong separation between users as we can't trust
them not to follow convention to prevent poisoning someone else's cache.

That's why I bought up the issue and think that it needs
consideration. It's the difference between having a robust system that
can handle concurrent builds of different configurations and a system
that protects against malicious users. They are both important
properties and may require different solutions.


CategorySpec CategoryTemplate

Platform/Specs/11.11/CloudBuilddSpeedup (last modified 2011-06-10 17:12:50)