What will YOU do with 100 cores?

NICTA's CTO outlines why you need to learn multi-threaded programming

Comments

There is an inflexion point approaching the computing industry. Consider the following statement and see if it could apply to your business.

"Any software business or embedded computing provider that derives competitive advantage from the execution speed of a single thread of code is at risk."

Why is this true and what does it mean?

Most software written for computers is written in a sequential programming language like C. Most secondary and tertiary computer courses teach this type of programming skill. All of it is rapidly becoming out of date, and much of the trillions of lines of software that has been developed in the past, may need to be re-written or become obsolete.

Multi-core processors are about to change everything. Processors have taken a dramatic shift in architecture over the past few years from superscalar to multi-core. A multi-core machine is essentially a parallel processing architecture known as Multiple Instruction Multiple Data (MIMD). This architectural trend started with the Intel Core Duo and has continued to Opteron, PowerXCell, UltraSparc, Cortex, SuperH and so on. All of these processor platforms are shipping multi-core chips. This trend will continue over the next decade. At the IEEE ISSCC conference in Feb 2010, Intel announced a 48-core processor chip. By 2015 we will likely have over 100 cores on a “many-core” processing chip in our notebook computers.

What will you do with 100 cores? Not much is my guess. If you open the Performance tab on the Task Manager of your Corei7-based computer you may notice that most of the 8 CPU threads are underutilised. This is because most of the software written for a PC executes as a single sequential thread on one of the processors. In a 100 core system, much of your software will have access to less than 1 per cent of the available computing power of the machine.

Yet as customer problem sets continue to grow, the computation needed to process them will also grow and therefore execution times will get longer. Even though a 100 core computer will have 25 times the processing capability, a single program will actually be slower unless it is re-written.

A large amount of software will need to be re-written for multi-threaded execution to exploit multi-core systems. This is a difficult undertaking because there is a shortage of people who know how to do it. At the Intel Developers Conference in 2009, Intel estimated that <2 per cent of the world’s programmers understand multi-threaded programming and most of these are developing computer games.

Multi-threaded programming remains a graduate level course in many degree programs and there are few tools available to assist with the re-architecture of legacy software. It follows that even those organisations with the most adaptive of strategies, may not be able to find or train the talent needed to re-architect their software in time. The CAD software industry is an example of one that could be redefined by the proliferation of multi-core computing.

Why we are where we are

The transition to multi-core has occurred due to the escalating power consumption of advanced processors hitting a ceiling of 130 Watts.

The chart below shows the power consumption of Intel processors since 1970. With the Itanium, Intel processors peaked at 130 Watts. The x86-64 compatible processor architectures have been optimised over many years to extract the maximum performance from a single thread of sequential code (Using techniques like multiple on-chip phase-locked loops, Reduced Instruction Set Computing (RISC) architectures, fine-grained pipelining, branch prediction, speculative execution etc).

When processors hit the 130W power ceiling, the only way to get more performance (and continue to deliver to Moore’s Law) is to place several cores on the chip. Furthermore, the cores have hit a peak clock rate of 4GHz (called the Clock Wall).

This is in part due to the fact that large synchronous digital chips in 32nm CMOS are not much faster than those in 45nm and 65nm. One reason for this is that the clock requires increased overhead in deep sub micron technology nodes.

Typically, about 10 per cent of the clock period is overhead to cover variance of device parameters and operating conditions. In technology nodes below 45nm, the increased variance of device parameters has required an increase in the amount of overhead added to the clock. So while typical device performance may improve, the overall clock period does not. Therefore, the performance available from any single core has essentially peaked – and herein is the problem for much of today’s software.

What will YOU do with 100 cores?

There are considerable research efforts underway to develop tools that assist with mapping programs written in “C” to parallel processors. This is a very difficult problem. The Wikipedia page on Automatic Parallelization explains why this is a difficult and as-yet unsolved problem:

“The goal of automatic parallelization is to relieve programmers from the tedious and error-prone manual parallelization process. Though the quality of automatic parallelization has improved in the past several decades, fully automatic parallelization of sequential programs by compilers remains a grand challenge due to its need for complex program analysis and the unknown factors (such as input data range) during compilation.”

Next: The shift to new languages