M1 GPUs for C++ science: Getting started
Relevant repo, follow-up post.
This story of working with M1 chips is an amalgation of various Apple documentations. As a scientific programmer it is a bit complicated to work on the new Apple M1 Macbooks; CUDA doesn’t work on these blazing fast chips! It would be cool if we can offload heavy physics simulations to the GPU, they’ve shown that they are quite capable. We’ll start off slow, working out the basics of array operations on the GPU, and hopefully end up at some proper fast physics!
Introduction to Apple’s Metal
Luckily for people who jumped on the new M1 chips: you can program rather easily for this chip using Apple’s Metal and it’s programming language MSL, Metal Shading Language.
Our second “luckily”, MSL is C++ based. This is cool, because my scientific code is in C++, I’m not going near Fortran. MSL promises that we can compile our computational kernels (or shaders as they’re called in MSL) to fast code, and use them on heterogeneous systems that support Metal. Here is a sample kernel written in MSL that adds two arrays together:
kernel void add_arrays(device const float* A,
device const float* B,
device float* C,
uint index [[thread_position_in_grid]])
{
C[index] = A[index] + B[index];
}
Looks a lot like CUDA right? Great! If you don’t understand it yet, tag along, and you will at the end. Let’s try to get this to work on a MacBook and compare it against OpenMP and serial implementations.
Metal has a bunch of other graphics-oriented functionality, but for scientific programming we leave those be for now.
0. Readying your system to follow along
I make it a point not to use Xcode. Nothing intrinsically against this piece of software, but to start out with, it is a lot more useful to me to see how dependencies work without using all Xcode’s handholding. I’ll compile all binaries using LLVM’s homebrewed CLang++:
brew install llvm
This has the added benefit of allowing me to use OpenMP on the M1 chip. Make sure the library files for OMP are installed:
brew install libomp
However, we will need the SDK’s that come with Xcode to be able to compile for MacBooks. I advise installing Xcode through your Mac’s app store.
1. Performing Calculations on a GPU using Metal
Our first stop on the world wide web is Apple’s own calculations on GPU, titled Performing Calculations on a GPU using Metal. Exactly what we need! Until… you see that the interface to work with Metal is available in Swift and Objective-C. Not the typical dialects used in the lands of Geophysics or other exact sciences.
The website does give a very good overview on getting kernels to work though. I recommend sitting down and reading the text, even if Objective-C is not your cup of tea. I never worked in it, but I still found the article helpful to understand MSL concepts. Among other things, we get the fun anecdote:
In Metal, code that runs on GPUs is called a shader, because historically they were first used to calculate colors in 3D graphics.
Additionally, we learn a few other things:
- Metal dynamically loads your shader library at runtime. That means that your applications and shaders are separately compiled.
- Metal orchestrates computations usings command queues and command buffers. This allows for asynchronous and heterogeneous operations, much like CUDA.
- The design of parallel loops works much like CUDA, where the kernel is a single function called with an index.
- Metal has different types of data buffers, that are differently exposed to the GPU and
CPU. For ease of this tutorial
MTLResourceStorageModeShared
seems very appropriate. We can use data in memory for both GPU and CPU computations!
2. Using Metal from C++, using metal-cpp
A bit more targeted web-surfing reveals another Apple page, geared towards running graphical Metal applications from C++, titled Getting started with Metal-cpp. Those highlights do their name justice, i.e.:
- Alternative to Objective-C;
- No measureable overhead.
This manual is not geared towards the scientific computation that we were interested in,
but it does allow us to get started with Metal in C++. One could follow the instructions
to download metal-cpp_macOS12_iOS15.zip
, however, I was (legally, courtesy of the Apache license) able to include the relevant code in this repository. The interesting bits for us are:
metal-cpp/Metal
metal-cpp/Foundation
metal-cpp/QuartzCore
These folders contain the relevant headers exposing Metal’s interface to C++.
We could probaly use these to translate the Objective-C code by hand …
3. Translating the Objective-C without understanding!
I’m not going to learn Objective-C just to translate a bit of code. The
programmer in me says that there is a more efficient way to rewrite
Performing Calculations on a GPU using Metal
into useable C++. Autocompletion in
VSCode would be a perfect shortcut, no? Additionally, both Objective-C and C++ contain a
C in their name. They must be extremely similar. … Right?
Specifically, we need to get the following files to work in a C++ implementation:
main.m
, Looks like a main function from C++.MetalAdder.h
, Header files for a class, apparently.MetalAdder.m
, Body of the class, I hope.add.metal
, the MSL code! We’ve seen this before, and it’s definitely the least intimidating.
Translating main.m
The Objective-C code main function (main.m
) starts confident, coming in hot with stuff I’ve never seen before:
#import <Foundation/Foundation.h>
#import <Metal/Metal.h>
#import "MetalAdder.h"
int main(int argc, const char * argv[]) {
@autoreleasepool {
id<MTLDevice> device = MTLCreateSystemDefaultDevice();
// Create the custom object used to encapsulate the Metal code.
// Initializes objects to communicate with the GPU.
MetalAdder* adder = [[MetalAdder alloc] initWithDevice:device];
// Create buffers to hold data
[adder prepareData];
// Send a command to the GPU to perform the calculation.
[adder sendComputeCommand];
NSLog(@"Execution finished");
}
return 0;
}
Seems like it creates a pointer to a computation device (such as a GPU, defined in Metal.h
), and then passes that to a constructor for MetalAdder
(that we haven’t defined yet). Next it runs a few functions associated with this object.
Using some smart autocompletion to figure out how to create the Metal device, and filling in the arbitrary gaps (arbitrary as we still have to define the MetalAdder class), we end up with some C++ code that looks like this:
#include <iostream>
#include <omp.h>
#define NS_PRIVATE_IMPLEMENTATION
#define CA_PRIVATE_IMPLEMENTATION
#define MTL_PRIVATE_IMPLEMENTATION
#include "Foundation/Foundation.hpp"
#include "Metal/Metal.hpp"
#include "QuartzCore/QuartzCore.hpp"
#include "MetalAdder.hpp"
int main(int argc, char *argv[])
{
MTL::Device *device = MTL::CreateSystemDefaultDevice();
MetalAdder *adder = new MetalAdder(device);
adder->sendComputeCommand();
adder->verifyResults();
}
Additionally, I’ve added IOStream and OMP to facilitate output and multicore stuff we’ll do to profile all the code.
Translating MetalAdder.m
and MetalAdder.h
The MetalAdder
class is a way to keep track of data, command, etc. that are relevant
to using the GPU. I’ll try to one-on-one translate this to C++ from Apple’s tutorial,
but in one or two places I optimized the code to C++ standards. The final result is
MetalAdder.cpp
and MetalAdder.hpp
.
Let us first have a look at the constructor. In Objective-C, it’s signature is the following:
- (instancetype) initWithDevice: (id<MTLDevice>) device
{
// ... body ...
}
meaning that it takes a pointer to an MTLDevice
. Using VSCode, we realize that in the
C++ headers, all types that lead with MTL, such as MTLDevice
are translated to
with a leading MTL namespace: MTL::Device
. Additionally, the Ojective-C constructor
tests for Metal errors by making sure none of the created objects turn out to be nil
.
The equivalent of this in C++ is to check against nullptr
s.
Functionally, what happens next in the constructor is the following:
- Loading the Metal library containing our shaders. The C++ equivalent is radily found using your editors code-completion.
Objective-C
id<MTLLibrary> defaultLibrary = [_mDevice newDefaultLibrary];
C++
MTL::Library *defaultLibrary = _mDevice->newDefaultLibrary();
- Loading a specific shader based on its name. This is the biggest translation
mismatch, as the C++ method newFunctionWithName doesn’t exist, and it’s equivalent
doesn’t accept
const char *
, only it’s own implementation of strings.
Objective-C
id<MTLFunction> addFunction = [defaultLibrary newFunctionWithName:@"add_arrays"];
C++
auto str = NS::String::string("add_arrays", NS::ASCIIStringEncoding);
MTL::Function *addFunction = defaultLibrary->newFunction(str);
- Creating a pipeline state object. This translates just like loading the default library.
- Creating a command queue object. This also translates without issue.
- Prepare the test data. This is simply done by a (yet-to-write) method of the class we are writing.
To understand exactly what the created objects do, refer to the Apple tutorial written for Objective-C.
Most other class methods are translated one-to-one much the same way. One of the
major differences (inconveniences, more like) between the original Objective-C and the
C++ implementation can be initially found in generateRandomFloatData
. This function
populates arbirtrary buffers with random data. To do this, it needs to set the values of
the buffer one by one, accessing these from the CPU. In the Objective-C implementation,
whenever we want to access the buffer’s data, we obtain a pointer to the start of the
buffer, and loop over it by pointer arithmetic:
- (void) generateRandomFloatData: (id<MTLBuffer>) buffer
{
float* dataPtr = buffer.contents;
for (unsigned long index = 0; index < arrayLength; index++)
{
dataPtr[index] = (float)rand()/(float)(RAND_MAX);
}
}
In C++, one is not able to implicitly cast this pointer. The return type of
buffer->contents()
is a void *
, i.e. a pointer to any type of object. For safety,
one needs to explicitly cast this pointer to a float *
.
void MetalAdder::generateRandomFloatData(MTL::Buffer *buffer)
{
// The pointer needs to be explicitly cast in C++, a difference from
// Objective-C.
float *dataPtr = (float *)buffer->contents();
for (unsigned long index = 0; index < arrayLength; index++)
{
dataPtr[index] = (float)rand() / (float)(RAND_MAX);
}
}
Pfew, getting quite close there to actually programming! This slight difference in
Objective-C w.r.t. C++ pops up again in verifyResults
, and basically whenever we try
to access the buffer manually.
To keep it tidy, I moved all declarations to the MetalAdder.h
(but mostly because I didn’t figure out how to have some declaration not in the header).
4. Compiling metal-cpp programs.
We avoid looking at documentation some more, and by messing around with the make
command in the Getting started with Metal-cpp project reveals the CLang++
includes
relevant to get Metal to work in C++:
larsgebraad@macbook:~$ pwd
/Users/larsgebraad/Downloads/LearnMetalCPP
larsgebraad@macbook:~$ make
clang++ -Wall -std=c++17 -I./metal-cpp -I./metal-cpp-extensions -fno-objc-arc -O2 -framework Metal -framework Foundation -framework Cocoa -framework CoreGraphics -framework MetalKit learn-metal/00-window/00-window.o -o build/00-window
It seems we need to include the metal-cpp
, metal-cpp-extensions
folders, as well as
hooking the frameworks into CLang. CoreGraphics
doesn’t seem like something we’d need
in an terminal based application, as does Cocoa. After some tweaking, the bare
necessities to compile a command line program with Metal seems to be:
larsgebraad@macbook:~$ clang++ -std=c++17 -I./metal-cpp -O2 \
-framework Metal -framework Foundation -framework MetalKit \
whatever.cpp
This seems like a good start for compiling our Metal program.
I skip my system-wide clang++
in favour of /opt/homebrew/opt/llvm/bin/clang++
, which
allows me to easily include OpenMP libraries, e.g.:
larsgebraad@macbook:~$ /opt/homebrew/opt/llvm/bin/clang++ \
-L/opt/homebrew/opt/libomp/lib -fopenmp some-openmp.cpp
Now, to actually compile our Metal+OpenMP application, we run:
larsgebraad@macbook:~$ /opt/homebrew/opt/llvm/bin/clang++ \
-std=c++17 -stdlib=libc++ -O2 \
-L/opt/homebrew/opt/libomp/lib -fopenmp \
-I./metal-cpp \
-fno-objc-arc \
-framework Metal -framework Foundation -framework MetalKit \
-g 01-MetalAdder/main.cpp 01-MetalAdder/MetalAdder.cpp -o 01-MetalAdder/benchmark.x
If one were to try out this executable, we’d find the following:
larsgebraad@macbook:~$ ./01-MetalAdder/benchmark.x
Failed to find the default library.
[1] 13767 segmentation fault ./benchmark.x
It seems that our GPU code itself is not compiled yet, as this is not standard when compiling the CPU code. To do this, we follow the instructions of yet another Apple documentation website, titled Building a Library with Metal’s Command-Line Tools and geared towards Objective-C.
This is where our installation of Xcode is relevant; we need to use the command line tools and SDKs from Xcode to compile our gpu code:
larsgebraad@macbook:~$ xcrun -sdk macosx metal -c add.metal -o MyLibrary.air
larsgebraad@macbook:~$ xcrun -sdk macosx metallib MyLibrary.air -o default.metallib
The final name of the .metallib
file is important, as our executable is only searching for the default
library. This behaviour can be adapted in the constructor of MetalAdder
.
5. Benchmarking against serial and OpenMP code
Now that our GPU code is compiled, we are ready to run a full benchmark. In main.cpp
, additional serial and OpenMP implementations of this array addition are defined. By running the resulting benchmark.x
, we get the following impressive results:
System specs:
- 2021 MacBook pro
- M1 Max, 10‑Core CPU, 32‑Core GPU und 16‑Core Neural Engine
- 32 GB RAM
larsgebraad@macbook:~$ ./01-MetalAdder/benchmark.x
Metal (GPU) code performance:
Average time: 803.566ms +/- 48.427ms
Serial code performance:
Average time: 2439.92ms +/- 74.4422ms
OpenMP (1 threads) code performance:
Average time: 2427.24ms +/- 15.666ms
OpenMP (2 threads) code performance:
Average time: 1315.2ms +/- 76.5438ms
OpenMP (3 threads) code performance:
Average time: 1684.19ms +/- 46.0139ms
OpenMP (4 threads) code performance:
Average time: 1339.81ms +/- 99.9749ms
OpenMP (5 threads) code performance:
...
OpenMP (10 threads) code performance:
Average time: 1756.53ms +/- 640.482ms
Weirdly enough, OpenMP is faster on even thread counts. Unsurprisingly, the serial code is the slowest. On this simple array addition problem, using Metal allows for a 3x speed-up with respect to the CPU single thread, and a 1.6x speedup to the fastest OpenMP configuration!
I suspect these numbers will be more dramatic when the computational kernels are more involved, but this we’ll see later.