SciTech

How Things Work: OpenCL

OpenCL is an emerging standard for taking advantage of the varied processors available on most computers today. This can be seen as a product of efforts being undertaken, particularly by scientists working with large datasets, to leverage the common PC for difficult calculations.

Gamers are one reason that PCs have so much computational power today. The task of providing millions of high-precision, shaded pixels to the screen tens of times per second is challenging to any hardware, particularly the CPU, which historically operates on one piece of data at a time. Specialized pieces of hardware called video cards were developed to meet this processing need. The appetite for realism has driven development of video cards’ capabilities and gross computing power, and now they can be viewed as processors streamlined to handle large volumes of data easily. The result is that in all desktops and most laptops on the market, there are two or more high-performance processors available, which are optimized for different kinds of work. For example, the CPU is designed for general purposes and system management. The modern video card, or GPU as it is sometimes called, is designed to perform a large number of high-precision calculations at a time. These devices have completely different interfaces: It is not possible today to write a piece of code that will run on either device.

OpenCL is able to harness all of the processors in a system, both general and specialized, in a way that lets the programmer abstract away from the specifics of the hardware. So now code can be written that will execute on whatever processor is available. The structure of an OpenCL task is a chunk of data paired with a piece of code written in a low-level but portable language. The code is applied to every datum, and this occurs in parallel to whatever degree possible. In scheduling tasks, the platform views the host as a collection of computing devices, each with certain capabilities (such as high parallelism) and limitations (such as precision). For example, the CPU is not as good as the GPU at doing a lot of floating–point operations quickly, but it can usually access a much larger amount of memory than the GPU. At run time, OpenCL decides according to capabilities and availability how to split up a task over the available computing devices, scheduling a segment of the task to each device.

As an example, we will consider an n-body simulation — tracking the position of n physical bodies according to the gravitational forces they exert on each other. One implementation may, for each time step t, store in a separate data structure each body’s new position and velocity (as a function of the gravitational constant, the current position and velocity of the body, the position of every other body, and the amount of time elapsed), and then copy over. This is very amenable to parallelization — in fact, the update to a body’s position and velocity can be computed independently of all other updates for each time step.

The equivalent in OpenCL may proceed in this manner: First, a list is created containing the position information of each body. A kernel, which is responsible for managing communication between hardware and software, uses this to compute the distance between all possible pairs of bodies. Then, another list loads containing the mass of each body. The kernel uses the list of masses with the distances between each pair computed earlier to calculate the force exerted on any of the bodies by any of the other bodies. Fianlly, the kernel uses the position, mass, volocity, and force information to find the new position of each body at the next time step.
Further details of how OpenCL is used can be found in Aaftab Munshi’s presentation “OpenCL: Parallel Computing on the GPU and CPU,” given at SIGGRAPH 2008, a popular conference for computer graphics research (http://s08.idav.ucdavis.edu/munshi-opencl.pdf.)

The upshot of all this is that when you have a data-intensive problem that would benefit greatly from parallelism, OpenCL aims to provide a way to write a solution that will run on whatever parallel computing resources are available on the machine.