A Mouthful of Rootbeer – Java on the GPU

The Rootbeer toolkit promises to run (almost) any Java code on the GPU. Sounds good, so I took a look, and tried a simple example, namely calculating a visualization of the well-known Mandelbrot set on the GPU using Rootbeer.

In short words – after some initial setup effort, it works nicely.
But on my machine (a Lenovo T520 laptop with a NVIDIA NVS 4200M) it’s slower than doing the same calculation on a single CPU core. I won’t blame Rootbeer, as this is only a business laptop that’s not even got a good rating for gaming performance.


The Mandelbrot Set rendered using Rootbeer

There’s now hardware which is by the orders of magnitude faster, on you which can expect better results. The other thing is that general Java code will be translated to, well, general CUDA code, not taking advantage of built-in functions.

As the rootbeer test suite also doesn’t run without errors on my machine, the speed is probably not very representative. Nevertheless, it was an interesting experience trying out Rootbeer (especially given that I’ve had no previous GPU computing experience).

Here’s how you can, too.

Initial Setup

Voila the process of how you can get set up and running (the version numbers are the versions I used, other may work as well):

  • Install Cuda Toolkit 4.2
  • Install Cuda Drivers 4.2. If you don’t install these, you’ll get link errors while loading the CUDA dlls (cudaruntime_x64.dll in my 64 bit case)
  • Install Visual Studio Express Edition for C++ 2010
  • If you have a 64 bit machine, there’s an additional step. Express Edition brings along only 32 bit compilers, you will also need to install the Windows SDK, (7.1) http://msdn.microsoft.com/en-us/windowsserver/bb980924.aspx and warp the build a bit by creating the file C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin\amd64\vcvars64.bat.
    With the following content
    CALL “C:\Program Files\Microsoft SDKs\Windows\v7.1\Bin\SetEnv.cmd” /x64
    (see issue in the issue tracker or also Stackoverflow)
  • Implement your Kernel, build and application to run Rootbeer, and follow the instructions on the Rootbeer toolkit (take a look in the doc folder). If things work already then, fine. Otherwise, the following hints may be helpful.
  • cicc.exe not found – This problem seems to be NVIDIAs fault, there’s an error “command “cicc” could not be found”. The solution is to copy the file C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.2\nvvm\cicc.exe to C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.2\bin
  • copy .cubin files to proper location – When trying to run a roobeer’ed jar file, there was the error “not found: output-class/MandelbrotLineKernelGcObjectVisitor.cubin” (MandelbrotLineKernel is the name of the Kernel class that is my GPU-runnable). You can work around this by extracting this file from the generated jar file and putting it in the looked-for path in the execution directory.
  • If you’re seeing exceptions like “cannot find class: edu.syr.pcpratts.rootbeer.runtime.remap.java.lang.Math” when trying to rootbeer your jar, try running the rootbeer test suite in the same location, it generates some additional folders that seem to be needed.

That’s it, the GPU should be busy now (in my case, rendering the Mandelbrot Set, see the code below).

Source Code

There’s the core algorithm that performs the calculation (on the GPU). Nothing special, just shoveling in the necessary data in the constructor, and then performing the standard Mandelbrot calculation, inspired by Wikipedia.

WP-SYNHIGHLIGHT PLUGIN: NOTHING TO HIGHLIGHT! PLEASE READ README.TXT IN PLUGIN FOLDER!
//Calculates one scan line of the Mandelbrot image
public class MandelbrotLineKernel implements Kernel {

//these will be available for the calculation
int width;//image width
int height;//image height
int[] data;//result - one vertical scan line
int column;//index of the column for the scan line
float scale;//zoom factor into the image
float offsetX;//x offset for panning around
float offsetY;//y offset for panning around
int maxIteration;//max iterations trying to escape the set

public MandelbrotLineKernel(int width, int height, int[] result, int column,
double scale, double offsetX, double offsetY, int maxIteration) {
super();
this.width = width;
this.height = height;
this.data = result;
this.column = column;
this.scale = (float)scale;
this.offsetX = (float)offsetX;
this.offsetY = (float)offsetY;
this.maxIteration = maxIteration;
}

//will be translated to CUDA and run on the GPU
public void gpuMethod() {
//standard Mandelbrot set algorithm
double x0 = column / scale + offsetX;
for (int j = 0; j < height; j++) { double y0 = j / scale + offsetY; int k = 0; double x = 0, y = 0; while (x*x+y*y < 4 && k < maxIteration) { double xtemp = x*x - y*y + x0; y = 2*x*y + y0; x = xtemp; k++; } data[j] = k; } } } [/codesyntax] and there's the supporting code that splits the calculation into the scan line portions, submits the jobs to Rootbeer, and assembles the scan lines into the final result. [codesyntax lang="java"] public class MandelbrotRenderer { private double scale = 250; private double offsetX = -2.2; private double offsetY = -1; private int maxIteration; private Rootbeer rootbeer; private Rootbeer getRootbeer() { //lazily initialize (constructor won't work as this class is //constructed in startup thread, rendering is done in ui thread) if (rootbeer == null) { rootbeer = new Rootbeer(); } return rootbeer; } public int[] calculateMandelbrotSet(final int width, final int height) { maxIteration = (int)(scale / 2); final List jobs = new ArrayList();
final List calculatedColumns = new ArrayList();

//create per vertical scan line kernels, and keep track of the future result data
for (int columnIndex = 0; columnIndex < width; columnIndex++) { final int[] column = new int[height]; jobs.add(createColumnForCalculation(width, height, column, columnIndex)); calculatedColumns.add(column); } runOnRootbeer(jobs); return aggregateResultColumnsToEntireData(width, height, calculatedColumns); } private void runOnRootbeer(final List jobs) {
long t = System.currentTimeMillis();
getRootbeer().runAll(jobs);
System.out.println("Rootbeer running of " + jobs.size() + " jobs took " + (System.currentTimeMillis() - t));
}

private MandelbrotLineKernel createColumnForCalculation(int width, int height, int[] data, int column) {

return new MandelbrotLineKernel(width, height, data, column, scale, offsetX, offsetY, maxIteration);
}

private int[] aggregateResultColumnsToEntireData(final int width,
final int height, List calculatedColumns) {
final int[] data = new int[width * height];
int i = 0;
for (int[] column : calculatedColumns) {
for (int j = 0; j< column.length; j++) { data[i + width * j] = column[j]; } i++; } return data; } } [/codesyntax] That's it - add the additional boilerplate code below for drawing in a Swing panel, and there's the Mandelbrot Set. [codesyntax lang="java"] public class DisplayPanel extends JPanel{ private MandelbrotRenderer renderer = new MandelbrotRenderer(); @Override public void paint(Graphics g) { final Graphics2D g2 = (Graphics2D) g; final Rectangle clipBounds = g2.getClipBounds(); int width = clipBounds.width; int height = clipBounds.height; long time = System.currentTimeMillis(); final int[] data = renderer.calculateMandelbrotSet(width, height); System.out.println("Overall rendering time " + (System.currentTimeMillis() - time)); drawImage(g2, clipBounds, width, height, data); } private void drawImage(final Graphics2D g2, final Rectangle clipBounds, int width, int height, final int[] data) { final BufferedImage image = new BufferedImage(width, height, BufferedImage.TYPE_INT_RGB); for (int i = 0; i < data.length; i++) { int value = data[i]; image.setRGB(i % width, i / width, (value * 4)); } g2.drawImage(image, clipBounds.x, clipBounds.y, null); } } [/codesyntax] Sounds too easy to be true? No GPU knowledge, no CUDA programming, just plain old Java? Try it yourself 😉

Results and Lessons Learned

Along the way while playing around and trying to tweak performance, I came across the following findings

Working with Rootbeer


    Two separate Eclipse projects
  • As working with Rootbeer requires additional build steps (rootbeering the compiled jar file, and possibly other operations), it quickly pays off to create a build script (e.g. with ant) that automates this for you
  • I found it useful to have the Kernel implementation and the application that actually runs it in separate Eclipse projects. This way, the build is faster, and most imporant, you can change and debug the non-CUDA portion of your application without recompiling with Rootbeer again (which takes about half a minute)
  • You need to create and run operations on the Rootbeer object in same thread (otherwise you'll get ERROR STATUS:201 : error in cuModuleLoad)
  • Creating Kernels that run too long (e.g. 10M iterations in a loop that does some math) run into ERROR STATUS:702 : error in cuCtxSynchronize. This may be a problem specific to my machine.

GPU computing

  • Calculations that are parallelized need to be about equally distributed concerning calculation time. If you've many Kernels that run very quickly and some that take much longer, the longer running ones will disproportionally influence overall execution time (because they are run in batches, and the longest running one will determine the overall time)
  • Use appropriate batch sizes. Parallelizing the mandelbrot image calculation into a one-Kernel-per-dot calculation slows things down, obviously because there's some overhead associated with running 400000 kernels.

Performance

As said above, that's just for a business laptop, not even good for decent game graphics. It's probably not exagerated to expect a hundredfold improvement on better hardware.

  • Simple number crunching (integer math operations) took about 2 times as long with rootbeer as on a CPU (on a single core) on my machine.
  • The mandelbrot calculation (parallelized along the scan rows of the image) took about 10 times as long as on the CPU (on my machine). Better load distribution between Kernels (not using straight scan lines but having every Kernel calculate dots from every image region) brought 10-20% speed improvement.
This entry was posted in Software. Bookmark the permalink.

3 Responses to A Mouthful of Rootbeer – Java on the GPU

  1. Sander says:

    Nice article!
    I have created a bruteforce chess algorithm and was thinking about gpgpu computing. This seems to be the easiest way but my application probably isn’t suitable for this. The algorithm mainly has simple calculations and lots of arraylist lookups. Also making it multi-threaded (more than 1000 threads) would be quite difficult.

    • Sti says:

      Yep, parallelizing a chess algorithm sounds like a good challenge 😉
      But there’s also lots of papers on it on google

  2. Yohan says:

    These are no longer valid. Non of the rootbeer’s examples are working except the `ArrayMultApp`. The methods like `runAll()` are no longer available too.

Comments are closed.