British company ARM Holdings designs and licenses technology used in the manufacture of processors that power a vast array of mobile devices. Putting the numbers into perspective, ARM's partners have shipped over 30bn processors, with the expectation that a further 10bn will be sold next year. ARM has various Cortex A-series CPUs that are designed for a wide range of devices; some focus on providing maximum performance in a mobile or enterprise power envelope, while others tend to run with energy efficiency and cost as the leading attributes.
With battery life a key concern for mobile devices, ARM understands that a single processor architecture isn't ideally suited to powering entry-level, mid-range and high-end smartphones and tablets, as each have a different set of power and efficiency requirements. Exploring the high-end market further, characterised by smartphones such as the Samsung Galaxy S4, there is little need for a powerful CPU when running low-intensity tasks such as emailing or MP3 playback - a smaller, more efficient CPU is able to complete the same task just as proficiently while sipping on considerably less power. Conversely, a powerful CPU is the best candidate for heavy-duty processing required for gaming or multi-page web browsing, where using a lower-power CPU could lead to a substandard experience.
The right CPU for the task at hand
ARM firmly believes that a processor should do the work it is designed for, rather than a one-fits-all philosophy, and has technology for this specific purpose. Called big.LITTLE processing and already available in select versions of the Samsung Galaxy S4, it works by combining architecturally-identical big, powerful processors with little, energy efficient ones. The premise is disarmingly simple: the use of the correct processor for a given application results in lower overall energy consumption and longer battery life.
The current big.LITTLE implementation pairs the powerful Cortex-A15 CPU with an energy-efficient Cortex-A7. The CPUs are tied together via two interfaces called the CoreLink CCI-400 and GIC-400 Interrupt control, provided by ARM. Presently, each CPU can have up to four processing cores whose performance is further boosted by having access to on-chip L2 cache. The combination of a CPU's cores and caches is called a cluster. A big.LITTLE four-core cluster of Cortex-A15 and Cortex-A7s yields a total of eight cores, and this is why the select Samsung S4 smartphones are often referred to as octo-core devices.
Drilling down, data is passed between the Cortex cores via the Cache Coherent Interface, whose job is to make sure that the L1 and L2 caches from both clusters are kept consistent with one other. This means that when a CPU accesses memory it first looks in it's own caches, and if it misses in the local cluster it will check with the other CPU cluster before reading the value from main memory. This ensures the most recent copy of the data is always read, even if it is held in the other cluster’s caches. The Generic Interrupt Controller's job is to migrate interrupts between any cores in big.LITTLE configuration.
How big.LITTLE works
Running a particular application on the most appropriate CPU/cluster requires that a device's operating system and power-management algorithms, called dynamic voltage and frequency scaling (DVFS), monitor and allocate workload correctly. The operating system is in charge of deciding which cluster and CPUs are active, depending upon load, and ARM has fine-tuned the DVFS profiles to ensure that, in our examples, migration to the Cortex-A15 occurs once the application(s) requires more processing power than the Cortex-A7 can provide when operating at its highest frequency. With big.LITTLE's initial software implementation, called Cluster Migration, only one processor cluster is active at one time; the other cluster - be it Cortex A15 or Cortex-A7 - is powered down. The primary focus is on reducing overall power consumption when compared to a single performance-orientated CPU.
In Cluster Migration, the operating system sees cluster blocks instead of individual cores in the Cortex-A15 and Cortex-A7 big.LITTLE configuration. Cluster-level switching can be inefficient if, for example, there is high load on a single core contained within a Cortex-A7, yet other cores within cluster are running at low load. In this case, even though the Cortex-A7 has the multi-core ability to handle the task, the entire cluster may be switched off and migrated to the power-hungry, faster Cortex-A15.
A better form of big.LITTLE Migration is called CPU Migration or In-Kernel Switcher (IKS). It is available under the Linaro open-source software for ARM SoCs. Here, following on from the example above, single Cortex-A7 and Cortex-A15 cores are coupled together to form a virtual CPU. Only one core is active in this virtual CPU and the operating system's kernel is oblivious to the asymmetric nature of the Cortex-A7 and Cortex-A15. Moving from one core to another is controlled by what is known as the cpufreq driver, which is responsible for telling the OS' kernel the required voltage and frequency.
It takes around 30 microseconds for the big.LITTLE software to move the workload from one of the paired cores to another, based on load requirements, but the trade-off between lost time and improved energy efficiency is worth it. This paired approach of CPU Migration, or In-Kernel Switching, requires that the ARM Cortex processors have an equal number of cores. What this also means is that a maximum of four cores can be run at any time.
The optimum method of flitting between two big.LITTLE processors is called Global Task Scheduling (GTS) and is a type of Heterogeneous Multi-Processing (HMP). Here, all cores can be active at any time. Further, any combination of cores can be used simultaneously. The operating-system scheduler can then dynamically allocate workload on any core in either processor and, as such, device manufacturers can use non-core-matching configurations - two-core Cortex-A15s paired with four-core Cortex-A7s, for example. Unused processors are automatically turned off to conserve power, by the Operating System.
Using the scheduler instead of the cpufreq driver provides a performance boost of up to 10 per cent in many benchmarks, according to ARM, while there's also the possibility of further reduced power - a hugely important factor in mobile devices - by reacting more quickly to load changes and allocating work in a more fine-grained manner.
Though the technology is already present from ARM, device manufacturers and silicon vendors need to work together to harness the benefits of big.LITTLE by tuning the board support package with SoC specific parameters. Samsung has already announced that its Exynos 5420 SoC, based on Cortex-A15 and Cortex-A7 big.LITTLE cores, should have HMP compatibility by the end of the year.
How much power does big.LITTLE save?
Compared to running a device with a Cortex-A15 cluster alone, a big.LITTLE Cortex-A15/Cortex-A7 configuration can deliver the same performance but reduce overall power consumption by up to 70 per cent in low-load situations and up to 50 per cent when there's moderate load, defined as web browsing with MP3s playing in the background. Performance remains excellent because, without the user knowing it, the device intelligently switches to the Cortex-A15 cluster as and when needed.
Note that Global Task Scheduling offers the very best balance between performance and energy efficiency, as the seamless scheduler-led core switching enables peak performance via the Cortex-A15 and, as needed, very low power usage when using the Cortex-A7.
Some further use cases illustrate similarly high levels of power savings, all while delivering the same or greater performance.
big.LITTLE in action
Samsung is the standard-bearer for ARM's big.LITTLE technology. Implemented in its Exynos 5410 and Exynos 5420 SoCs powering the Galaxy S4 and Note 3 smartphones, respectively, the South Korean giant currently uses Cluster Migration to shift between the Cortex-A15 and Cortex-A7 cores. Future versions, one would hope, will be equipped with preferred Global Task Scheduling method of migrating between individual cores.
But Samsung isn't the only manufacturer putting its weight behind big.LITTLE. SoC designers HiSilicon, AllWinner and Renesas will also be debuting big.LITTLE designs soon, and key ARM partner MediaTek has demonstrated its implementation of big.LITTLE technology at the TechCon conference in October 2013.
ARM's big.LITTLE is based on the premise that a single processor isn't ideally suited for high performance and low power; it's best to run two processors that are designed for performance and energy efficiency. Seamless switching between the two has the enviable benefit of improving battery life, and ARM and its partners are working hard to maximise the potential of big.LITTLE in future hardware.
If you find that your smartphone or tablet has considerably better battery life than your friend's, without sacrificing performance, it could well be down to the implementation of ARM's nascent big.LITTLE technology.