IRQ based prediction for idle duration
In the different strategies to save power, one of them is triggered when the CPU has nothing more to do and enters a quiescent point called the "idle state". It is now common to have a CPU with different levels of idle states, where these levels are from the shallowest to the deepest state. The shallowest has a small energy saving but costs nothing (in terms of performance) to enter and exit, the deepest one has a huge energy saving but with a significant cost to enter this state and a latency to wake up from sleep to run.
At this point, the system has to take a decision regarding the target residency and the exit latency (the former is explained in the Compute the target residency document).
Setting aside the latency constraints, the system must guesstimate how long the cpu will be doing nothing in order to choose the idle state with the best trade-off performance/energy.
We show in this document how a radically changing the logic and reasoning behind the current approach can improve on the current predictions.
Current approach : idle time measurement
The current implementation is based on a modular approach of the next event prediction logic called governors which are plugged into a common framework called cpuidle.
Each time a cpu goes idle, the cpuidle framework computes how long the cpu has been sleeping and re-inject back this residency time into the governor which in turn does some statistics to predict the next idle time duration based on some magic numbers coming from experimentation on specific platforms. Please note, this has been mainly tested and optimized for the Intel platform that has firmware which does auto-promotion of the idle state. In other words, the firmware can override the governor decision and choose a different idle state.
Even if this approach satisfied everyone and no better one was proposed, it has a few weaknesses in the design:
- there is no way to determine what interrupt woke up the cpu, so it is hard to tell if the prediction was correct or if we were lucky.
- without knowledge of the wake up source, we do statistics on a mix of predictable and unpredictable events (eg. timer vs IPI).
- The IPI is mainly used by the scheduler to wake up a cpu in order to give it work to do. So the idle governor is trying to predict the behavior of the scheduler that effectively controls cpuidle - a vicious circle.
The idle time is a quiescent point where nothing happen until an interrupt occurs and wakes up the CPU. Staying focused on the idle time, we don't see the history of the running system, only a small portion of the state of the system is analyzed without knowledge of what happened before and what will happen after.
So instead of tracking the idle time and making some assumptions (based on experimentation), why don't we track the wake up sources' behavior ?
New approach : IRQ tracking
What will happen if we track each irq independently and predict for each interrupt the next event? It will allow us to deduce the next CPU wakeup simply by picking the earliest event.
We can sensibly do the hypothesis an interrupt will occur into a reasonable predictable interval, with a burst of interruptions, then followed by an inactivity.
Let's make it clear: it is impossible to analyze all interrupts for all devices on earth because they depend on the type of the hardware and the interrupt controller. So we focus here on a set of commonly used devices: SSD, serial and network. For this document, all measurements were done on a Dual Xeon 6 cores HT.
The following interruptions can be discarded:
- Timer interrupts: For timer interrupts the next event is known, so there is no point in trying to predict them.
- IPI: it is up to the scheduler to take the right decision regarding the idle state of a CPU
We use ftrace and look for the event "irq_handler_entry". Please note, between the occurrence of an interrupt and the event being logged in a trace there is a latency if the CPU is in idle, because it needs to exit its sleep state and that can take a significant amount of time if the CPU is in a deep idle state. For this reason, the measurements are taken with CPUIDLE disabled.
We will first give a simple workload for each of these interrupts described above and then we use a database server solicited by an external client monkey.
First let's take a simple example with a SSD and a program reading randomly in a big file. The test is run with the command:
trace-cmd record -eirq_handler_entry iotest
It results into a 2.2Mb file, for a 1.67s run and 13002 interrupts : trace.dat
Zooming into the kernelshark display shows the intervals of the interruption.
After extraction and computation of the intervals, we have a mean interval value of 128us and a standard deviation of 103us. We convert that to a probability density, in other words is how many times an interval was hit.
We see there are four peaks, one around 10us, one around 40us, the third one around 80us and finally a big one around 150us. These four peaks shows a distribution of the probability which clearly emphasis this interrupt will be hard to predict. But wait, the min and max interval is actually not so bad, we have around 180us and the mean value is 128us which is near the bigger peak, the fourth one and we are trying to predict the next interrupt in the context of cpuidle, so we don't need an extremely precise prediction as the error margin can be leverage by the different idle states target residencies (statistically speaking, continuous variables - the intervals - are transformed to discrete variables - target residencies, implicitly).
Having the density is interesting but it does not give an information about the behavior over time.
So we see an erratic interval behavior at the beginning of the acquisition and then it stabilizes to be in a smaller hysteresis.
Let's have a look at the network interruption.
Again four peaks, with incremental density. So by reducing the resolution of the density, we can roughly assume a probability of 25% for the interval, slightly weighted for the fourth peak.
The behaviour over a timeframe is the same as the SSD, that is a an erratic behavior then followed by a stable interval.
Ha ! This one is probably the most stable and it looks like it is following a normal law, so using the mean value is perfectly fine.
And we see the intervals are perfectly stable with some peaks every 5-6 intervals.
A rapid analysis suggests, we have a repeating pattern which is an interval near the average and then followed by a peak.
With this observation, we can build a simple algorithm as a first step and see how it behaves compared to the menu governor and how it succeed to predict the next wakeup. We consider a success on the prediction when the idle state was correctly chosen. It is impossible to predict the next event at the exact time due to the multiple sources of wake up, so if the CPU stayed idle long enough compared to the target residency of an idle state and was shorter than the next idle state residency then we successfully predict the next event in a reasonable interval.
- when an interrupt occurs, compute the interval between the last event and now
when the interval is X <usec> long, reset stats
- compute the average interval time
- compute the variance
- when the next event is requested
- for each interrupt look the next event which is the last timestamp + the mean interval value
- take the minimum of the above
WorkingGroups/PowerManagement/Doc/IrqNextPrediction (last modified 2016-01-08 10:09:18)