Multi-Core Programming For Software Architecture

Randy talks about the problem with multi-core software architecture and how to solve this problem through multicore programming.

Almost every major software system in use today was initially created prior to the advent of multi-core computers. Multi-processor computers have existed for a while, but not for the common computer, and very few software systems were adapted to take full advantage of them.

 More Hardware vs. Multi-Core Programming

Why can't you throw more hardware at it and expect it to run faster? The issue is almost always characterized as "shared resources." A shared resource is anything that one task, or thread must use without worrying about another task changing it while it is being used. Concepts of synchronization, locking, mutual exclusion or "critical sections" were developed to deal with the issue of safe sharing of common resources.

The traditional mechanisms for creating safe sharing among tasks are called semaphores or queues or events. Used correctly, the multicore programming primitives allow everything from the incrementing of a shared number to a complete database transaction to be done without the corruption that would occur if tasks were not properly sequenced, or "serialized."

In the single-CPU age, multi-processing or multi-threading existed to create the illusion of multiple tasks happening concurrently. When synchronization primitives prevented one task from making progress, no problem, because the single CPU would switch to another task that was not blocked. Software systems with shared resources were not degraded too much from synchronization if there was always something for the CPU to do.

Example | Multiple Cores

Why can't multiple cores just make this go 2 or 4 times faster? I'll try to use an analogy to explain it. Imagine a person at a desk with an inbox, a computer, and an outbox. The job is to take the next item from the inbox and read it, enter information from the item into the computer, get a response from the computer, write it down on a form, prepare it for delivery and put it in the outbox. This cycle takes on the average 5 minutes per item, and the boss needs an output of 30 per hour, not the current 12. To solve this problem, the boss puts another person (core) at the same desk and allows the computer to swivel. Getting started, both people reach into the inbox, grab the same item and tear it in two. Starting over, they agree to notify the other before grabbing. Now, they each read their next items and are ready for data entry. Unfortunately, only one can do this at once, so the other waits, turns the monitor around and starts data entry. In the mean time, the first person prepared the response for the outbox, grabbed the next item, and needed to wait a short time for the computer monitor.

 The Problem

The good news is that output went from 12 items per hour to 18. The bad news is that the boss still requires 30. Another person is added to the desk and production goes from 18 to 22 items per hour. The next person takes it to 24. One more person moves it to 22, as this person, being equally skilled in processing items, is actually interfering with the work that was already occurring.

So it goes, in a very real way, with software systems that have more cores thrown at them. The first few help, a little, but they get to a point where performance degrades.

The Solution

Instead, if another desk, inbox, computer and outbox were provided for the second person, output would nearly double. Nearly, because some additional time would be required to fill two inboxes and empty two outboxes rather than just one, but clearly a good trade-off. The solution is more expensive, but pays for itself quickly.

Changing a software system to split up the work is not so easy, especially software that is well established and mature. It's not just a change in algorithms, it's a change in architecture.

If you do a web search for multi-core optimization, you will find articles about shared L2 cache or matrix processing. Many OS and chip vendors say that they have done the optimization for you, but they don't say how. Some have proposed programming languages that have parallel processing constructs added to the language. All of these are fine and helpful, but don't fix an architectural problem any more than it helps the people processing a single inbox to send them to a speed-reading class. Speed-reading is great, but in the absence of a new desk architecture, it has very limited benefit.

There are very few articles about system and multicore programming architecture, where the main benefits of system-wide performance are discussed. It turns out there is only one major design principle for creating a software system that scales up with multiple cores:

Obtain all resources needed for a task

Step one is best if it is non-blocking. Next best is making it very granular, meaning that one task can have pieces of a resource while another has different pieces.

Perform the task

Step two may be time consuming, but in a correct multicore programming architecture, it is not blocking other tasks. It can execute in parallel. An important principle of step two is that the sum of the time taken for all cores can be much more than the time it would take one CPU, but the wall-clock time is much less because they are done at the same time. For example, if step two takes a single-CPU system 100ms, then twelve tasks would take 1200ms. In a re-architected multi-core system with 4 cores requiring 180ms for step two, 4 parallel cores each performing the task 3 times (twelve tasks) would be done in 540ms. It hasn't cut the time by 4, but the stage is set for true scaling when even more cores are added.

 Quickly merge, or re-integrate the results of the task

The third step is hard, especially for a database management system, where ACID principles must be followed. Saving transactions safely to disk in a consistent and durable manner requires synchronization around shared resources. The key here is to do as much work beforehand as possible, so that the re-integration is quick. This is probably why step two takes longer. Step two does more work (in parallel) so that the serialized portion is short. It's a massively worthwhile trade-off.

The trick now is to take working software and re-architect it, not just optimize it, for multi-core scaling.