Jekyll2023-12-20T21:40:02+00:00https://larsgeb.github.io/feed.xmlPersonal blogSeismology and silly tech solutions. Basis in Earth Science and Bayesian Statistics. 100% rambling.Lars GebraadStop keywording your stock photos2023-12-20T00:00:00+00:002023-12-20T00:00:00+00:00https://larsgeb.github.io/2023/12/20/vision-keywords<p>I am … not sure what I fully am. I have a job as an R&D engineer. At the same
time, I have a degree in the somewhat related field of seismic imaging, where
I’ve dabbled with different algorithms. I also own a camera and have been known
to side-hustle with it.</p>
<p>I like that balance — the grind professional artists often go through seems
hard. The physics/programming day job is very fun and stable, which keeps me
fully a ‘prosumer’ in the camera world. I wouldn’t say my pictures are
revolutionary, but I enjoy them, the occasional print gets sold to an
acquaintance and here and there I do some shoot for a customer. And, I sell
stock photos.</p>
<p>I hate doing stock photos. It’s a labour of volume and tediousness. Every photo
needs a title, and for discoverability, at least a good number of keywords that
not only convey the contents of an image, but preferably some emotion or
understanding of the image, to reach the customers the image is perfect for.
This a boring work to me. Submitting more than 3 images is mind numbing, even
with the keyword suggestions most platforms use nowadays. I don’t want to click
through each of them!</p>
<p>Image recognition machine learning models are exciting to me, for it would take
the most boring part out of a workflow that nets me only a few dollars per
month. These models take an image, and try to recognise its content in a
specific way. That could be objects in the photo, depth understanding, or
whatever the model is trained for.</p>
<p>Earlier image recognition models were a step in this direction. You get some
hints as to what an image contains, but largely, the output generated by simple
image to object models is not appropriate for production work – I do not want
the keyword “rock” 32 times. I want something that understands the scene, its
intent, etc. I want a human touch.</p>
<p>But a human touch is expensive. It’s either me, which makes it expensive
time-wise. I am not sure if there are keywording services, but if they exist, I
doubt they’d be worth it. So we need a halfway solution that is affordable, but
better than the computer equivalent o a toddler pointing at things and naming
them.</p>
<p>The real kicker is combining image recognition with a model that has larger
understanding of the world at large, Large Language Models. This is horribly
phrased, as LLMs don’t actually understand jackshit. They are amazing at
parroting information, and transposing information between situations. Show it
twenty discussions on stock photography keywording, and it will parrot great
keywords for an image that contains a sunset and a cow. Repetitive work that has
been done before, now condensed into a service. Let’s not talk about who did the
original work, lest we open that can of worms.</p>
<h2 id="gpt-4-vision">GPT-4 Vision</h2>
<p>So, how does one go from an image to keywords ripe for stock photography? How do
we couple a text to image with a LLM? Fear not, the hard part is already
finished. One of the latest models that OpenAI exposes through their API is the
<a href="https://platform.openai.com/docs/guides/vision"><code class="language-plaintext highlighter-rouge">gpt-4-vision-preview</code></a>, which
interfaces a image recognition input with then typical GPT interaction. The
cost? Well, money.</p>
<p><img src="/assets/IMG_9506.jpg" alt="" /></p>
<p><strong>Original keywords</strong>
Engadin, Switzerland</p>
<p><strong>Generated title</strong>
Winter Trek with Dog in Swiss Alps</p>
<p><strong>Generated keywords</strong> Winter, Alps, Hiking, Engadin, Outdoor, Mountains,
Nature, Switzerland, Snow, Adventure, Travel, Trekking, Cold, Scenic,
Landscape, Exploration, Snowy Trail, Dog Walking, Companion, Pet, Leisure
Activity, Friendship, Human and Animal, Exercise, Backpack, Cloudy Sky,
Footprints, Hiking Trail, Active Lifestyle, Remote Location, Wide Angle View</p>
<p>How the OpenAI exposes their models is through an API. Unfortunately, this means
that we need to use a self-programmed tool to get our images into their model,
and get their response back. Their business model is that each requests costs a
small amount of compute credits, which can be directly bought (or billed)
through their website.</p>
<h2 id="my-workflow">My workflow</h2>
<p>I normally go about my year, and organise my photos in Lightroom. Once in a
while I say to myself, wow I should really get these 100 images onto e.g. Adobe
Stock. However, I don’t generally use the tagging system unless for really
specific tags, nor do I use captions. Hence, getting my images in a discoverable
state will take quite some work. Here enters I-to-T+LLM. To follow along you
need to be a bit savvy with computers – the minimal requirements would be to at
least install Python 3.</p>
<p>My ideal approach is that I tag each series at least with a location, and
possibly some very general keywords about a whole series of photos. Then, using
a combination of various Python packages (e.g. <code class="language-plaintext highlighter-rouge">PIL</code> and a very outdated module
called <code class="language-plaintext highlighter-rouge">piexif</code> which should definitely be replaced), I was able to hack
together a script that takes a bunch of input images, queries the image through
OpenAI’s model together with the following request:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>I want thirty keywords to describe this image for Adobe Stock, targeted
towards discoverability. These keywords are already present: {existing_kws},
please include the ones that are relevant or location specific. Please output
them comma separated. Please as the first entry, output an editorialized title,
also separated by commas. Don't output any other characters.
</code></pre></div></div>
<p><img src="/assets/DSC5692.jpg" alt="" /></p>
<p><strong>Original keywords</strong>
Engadin, Switzerland</p>
<p><strong>Generated title</strong>
Starry Night over Swiss Alpine Cabin</p>
<p><strong>Generated keywords</strong> Switzerland, Night, Milky Way, Alps, Stars, Outdoor,
Engadin, Hiking, Nature, Mountains, Astronomy, Astrophotography,
Constellations, Galaxy, Universe, Stargazing, Cosmic, Night Sky, Dark, Stone
Cabin, Secluded, Travel, Adventure, Wilderness, Remote, Cosy, Twilight,
Serenity, Tranquil, Landscape</p>
<p>The neatest thing? I was able to put all this information back into the file in
one go, such that the only thing that needs to be done afterwards is to upload
to (e.g.) Adobe Stock. Their platform automatically recognises the tags and the
generated image.</p>
<p>In total, I was able to tag 127 images for about 75 cents. Now, for some people
this might be worth it, while for others it might be out of scale. If you know
Python, give my code a spin here. Please make sure you know what you are doing,
and don’t run it on large folders! You might run into unexpected costs for which
I am not responsible.</p>
<p>Check out the <a href="https://github.com/larsgeb/vision-keywords">code here</a>. If you’re interested in my actual photography work, check out
<a href="https://larsgebraad.com/">larsgebraad.com</a>.</p>Lars GebraadI am … not sure what I fully am. I have a job as an R&D engineer. At the same time, I have a degree in the somewhat related field of seismic imaging, where I’ve dabbled with different algorithms. I also own a camera and have been known to side-hustle with it.Voting behaviour in Dutch Parliament2023-11-21T00:00:00+00:002023-11-21T00:00:00+00:00https://larsgeb.github.io/2023/11/21/dutch-politics<p>In the age of information, providing voters with accessible and comprehensive statistics is crucial for fostering an informed electorate. I made a visualisation website that explores the behaviour of all fractions of the Dutch Parliament – specifically, how they correlate over the years. This visualisation is available as <a href="https://github.com/larsgeb/Polertiek">Python code</a>, or simply the <a href="https://protected-sea-76942-d6d7d105e2ef.herokuapp.com">interactive website</a>! It aims to show what parties do, not what they say.</p>
<p><img src="/assets/voting_2022.png" alt="here" /></p>
<p>The app is built using:</p>
<ul>
<li>Flask</li>
<li>Numpy</li>
<li>Scikit</li>
<li>Scipy</li>
<li>Pandas</li>
<li>sqlite3</li>
<li>git-lfs</li>
<li>heroku</li>
<li>plotly.js</li>
<li>Jupyter</li>
</ul>
<p>Note that of course this wouldn’t be possible without the <a href="https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/UXIBNO">permissive dataset</a> (CC BY 4.0) from <a href="http://rdcu.be/o6TC">Louwerse et al.</a>!</p>Lars GebraadIn the age of information, providing voters with accessible and comprehensive statistics is crucial for fostering an informed electorate. I made a visualisation website that explores the behaviour of all fractions of the Dutch Parliament – specifically, how they correlate over the years. This visualisation is available as Python code, or simply the interactive website! It aims to show what parties do, not what they say.Persistent Remote Notebooks for VSCode2023-04-25T00:00:00+00:002023-04-25T00:00:00+00:00https://larsgeb.github.io/2023/04/25/persistent-notebooks-vscode<p>For a lot of the data-analyses we do, we work in Python notebooks. They allow us to nicely present code and data. Sometimes, however, we need the performance of a large workstation, and as such we might want to run notebooks remotely.</p>
<p>A classic technique to do this is to open an SSH connection with a port forwarded to your local machine, start a Jupyter notebook server on that port and open in your local machine via the browsers. This is decent, but it won’t allow you the benefits of a IDE like VSCode.</p>
<p>This manual is for those who prefer to work with VSCode, but would also like to have notebooks persist across sessions. Remote connections in VSCode allow you to remotely run notebooks, but their kernels will be terminated as soon as VSCode on your local machine shuts down. Using the steps below, you will create a persistent (named!) tmux session, to which VSCode automatically connects. In this tmux session, a notebook server is running, to which one can connect the notebooks opened with VSCode.</p>
<p><strong>Warning: This leaves notebooks open on a remote machine, possiblty eating away lots of RAM.</strong></p>
<h2 id="assumptions">Assumptions</h2>
<ul>
<li>You have an SSH connection to your remote machine.</li>
<li>You have your Python stack ready on the remote machine.</li>
</ul>
<h2 id="instructions">Instructions</h2>
<ul>
<li>Create a remote connection in VSCode:
<ul>
<li>Click the blue arrows all the way in the bottom right.</li>
<li>Then select “Connect Current Window to Host…“.</li>
<li>Then select the appropriate connection.</li>
</ul>
</li>
</ul>
<p><img src="/assets/ss1.png" alt="here" />
<img src="/assets/ss2.png" alt="here" /></p>
<ul>
<li>
<p>Give it a few seconds goshdarnit, it’s only a machine. It needs to install VSCode remotely.</p>
</li>
<li>
<p>Install the Python Extension in your remote. This is done by opening the little stack of blocks on the left bar, typing in “Python” and selecting the top extension (made by Microsoft), and installing it to the remote machine.</p>
</li>
<li>
<p>Open the folder which you’d like to use.</p>
</li>
</ul>
<p><img src="/assets/ss3.png" alt="here" /></p>
<ul>
<li>Now, we create a persistent tmux terminal, which will serve as the place where the notebooks will run. We want to open up (and this is important), the json settings file for the Remote:
<ul>
<li>Open the Command Palette (MacOS: Command+Shift+P, mouse: View > Command Palette).</li>
<li>Type Remote JSON and click “Preferences: Open Remote Settings (JSON) (SSH: <wherever>).</wherever></li>
</ul>
</li>
</ul>
<p><img src="/assets/ss4.png" alt="here" /></p>
<ul>
<li>Add the following lines and save the file. <strong>You might want to change</strong> the fourth off the arguments passed to tmux, as this will be the process name of the tmux session!
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"terminal.integrated.profiles.linux"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"Persistent tmux"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"path"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/usr/bin/tmux"</span><span class="p">,</span><span class="w">
</span><span class="nl">"args"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="s2">"new-session"</span><span class="p">,</span><span class="w">
</span><span class="s2">"-A"</span><span class="p">,</span><span class="w">
</span><span class="s2">"-s"</span><span class="p">,</span><span class="w">
</span><span class="s2">"persistent-vscode-lars"</span><span class="p">,</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"terminal.integrated.defaultProfile.linux"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Persistent tmux"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div> </div>
</li>
</ul>
<p><img src="/assets/ss5.png" alt="here" /></p>
<p>This will now start (or reconnect to) a tmux with the name <code class="language-plaintext highlighter-rouge">persistent-vscode-lars</code> (or whatever you named it) whenever this VSCode folder is opened. Note that it will <em>never</em> shut this tmux session down by itself.</p>
<ul>
<li>Start a terminal inside VSCode (MacOS: Control + backtick, `). This should open a tmux session, as seen in the screenshot.</li>
</ul>
<p><img src="/assets/ss6.png" alt="here" /></p>
<p>Now when investigating with htop (e.g. using filter <code class="language-plaintext highlighter-rouge">persistent</code>), one will find the process:</p>
<p><img src="/assets/sstmux.png" alt="here" /></p>
<ul>
<li>
<p>To test this out, just close VSCode! When I bring it back up on my machine on this remote folder, it actually reopens the terminal with its previous state (as it is reconnecting to the tmux session).</p>
</li>
<li>
<p>When necessary, one can always open a normal terminal by using the dropdown next to the plus sign on the terminal pane:</p>
</li>
</ul>
<p><img src="/assets/ss7.png" alt="here" /></p>
<ul>
<li>Start a notebook server in the tmux terminal, and make a note of the address. For me, this needs the following:
<ul>
<li>Activate my conda environment <code class="language-plaintext highlighter-rouge">conda activate whatever</code>.</li>
<li>Start <code class="language-plaintext highlighter-rouge">jupyter notebook</code>.</li>
<li>Copy the address (and possible the token) in the output, for me, this was <code class="language-plaintext highlighter-rouge">http://localhost:8890/?token=<...></code>.</li>
</ul>
</li>
</ul>
<p><img src="/assets/ss8.png" alt="here" /></p>
<p>At this point, it is recommended to collapse (but not terminate!) your terminal pane. Do this by doing a double “Command + backtick (`)”, “Command + J” on MacOS, or simply right clicking the top bar of ther terminal and hiding it.</p>
<ul>
<li>
<p>Create a new notebook to test our configuration (File > New File … > Jupyter Notebook).</p>
</li>
<li>
<p>Before you do anything in this notebook, hit the top right “Select Kernel”, followed by “Existing Jupyter Server…” in the pop-up. You might have to click “Select Another Kernel…” before seeing the Jupyter option.</p>
</li>
</ul>
<p><img src="/assets/ss9.png" alt="here" /></p>
<ul>
<li>Select enter the URL of the running Jupyter server.</li>
</ul>
<p><img src="/assets/ss10.png" alt="here" /></p>
<ul>
<li>Paste the address that you copied from the terminal.</li>
</ul>
<p><img src="/assets/ss11.png" alt="here" /></p>
<ul>
<li>Now, a selection of conda environments pop-up. Since I only have one (the base environment) I select this. But feel free to select your preferred environment/kernel.</li>
</ul>
<p><img src="/assets/ss12.png" alt="here" /></p>
<ul>
<li>Write some notebook you wish to be persistent. I preferred to generate some random numbers, store them in a variable, and check if the are persistent between VSCode sessions.</li>
</ul>
<p><img src="/assets/ss13.png" alt="here" /></p>
<ul>
<li>Shut VSCode down for however long! When you come back to it, your notebooks should be intact, with variables!</li>
</ul>
<p><img src="/assets/ss14.png" alt="here" /></p>Lars GebraadFor a lot of the data-analyses we do, we work in Python notebooks. They allow us to nicely present code and data. Sometimes, however, we need the performance of a large workstation, and as such we might want to run notebooks remotely.M1 GPUs for C++ science: accelerating PDEs2022-06-07T00:00:00+00:002022-06-07T00:00:00+00:00https://larsgeb.github.io/2022/06/07/m1-gpu<p><a href="https://doi.org/10.48550/arXiv.2206.01791">Relevant preprint</a>,
<a href="https://larsgeb.github.io/2022/04/22/m1-gpu.html">previous post</a>.</p>
<p>You might have read the two previous posts detailing benchmarks for array operations on
the M1 GPU. We extended this work a bit to simulating partial differential equations for
e.g. tomographic imaging. The original abstract for this work is given below, but to get
the full read, head over to arXiv! A peer-reviewed version of this manuscript is in the
works.</p>
<h2 id="original-abstract">Original abstract</h2>
<p>The M1 series of chips produced by Apple have proven a capable and power-efficient alternative to mainstream
Intel and AMD x86 processors for everyday tasks. Additionally, the unified design integrating the central
processing and graphics processing unit, have allowed these M1 chips to excel at many tasks with heavy
graphical requirements without the need for a discrete graphical processing unit (GPU), and in some cases
even outperforming discrete GPUs.</p>
<p>In this work, we show how the M1 chips can be leveraged using the Metal Shading Language (MSL) to accelerate
typical array operations in C++. More importantly, we show how the usage of MSL avoids the typical
complexity of CUDA or OpenACC memory management, by allowing the central processing unit (CPU) and GPU to
work in unified memory. We demonstrate how performant the M1 chips are on standard 1D and 2D array
operations such as array addition, SAXPY and finite difference stencils, with respect to serial and OpenMP
accelerated CPU code. The reduced complexity of implementing MSL also allows us to accelerate an existing
elastic wave equation solver (originally based on OpenMP accelerated C++) using MSL, with minimal effort,
while retaining all CPU and OpenMP functionality.</p>
<p>The resulting performance gain of simulating the wave equation is near an order of magnitude for specific
settings. This gain attained from using MSL is similar to other GPU-accelerated wave-propagation codes with
respect to their CPU variants, but does not come at much increased programming complexity that prohibits the
typical scientific programmer to leverage these accelerators. This result shows how unified processing units
can be a valuable tool to seismologists and computational scientists in general, lowering the bar to writing
performant codes that leverage modern GPUs.</p>
<h2 id="demo-results">Demo results</h2>
<p>The figure below, taken from the preprint, is an example of how MSL can speed up array
operations of large enough size considerably. Especially at large sizes, the speed-up nears an order of magnitude.</p>
<p><img src="/assets/combined_02.png" alt="here" /></p>Lars GebraadRelevant preprint, previous post.M1 GPUs for C++ science: SAXPY and finite differences2022-04-22T00:00:00+00:002022-04-22T00:00:00+00:00https://larsgeb.github.io/2022/04/22/m1-gpu<p><a href="https://github.com/larsgeb/m1-gpu-cpp">Relevant repo</a>,
<a href="https://larsgeb.github.io/2022/04/20/m1-gpu.html">previous post</a>.</p>
<p>In this follow-up I want to look at a bit more practical uses for shaders written in
the Metal Shading Language for scientific C++ code. I would like to end up at the point
where we can comfortably use MSL combined with C++ to solve PDEs faster than we would on
the CPU. To that extent, let’s see if we can get SAXPY and finite differences to work
in MSL.</p>
<h2 id="sending-off-commands">Sending off commands</h2>
<p>In <a href="https://larsgeb.github.io/2022/04/20/m1-gpu.html">this previous post</a>, we saw how to
create an object to keep track of all MSL parts: it contained multiple buffers for
arrays (<code class="language-plaintext highlighter-rouge">MTL::Buffer</code>, using a mode s.t. they were accessible from CPU and GPU), the
function pointers (pointers to <code class="language-plaintext highlighter-rouge">MTL::Function</code>) and pipelines (pointers to
<code class="language-plaintext highlighter-rouge">MTL::ComputePipelineState</code>), a command queue to pass instructions to the metal device
(<code class="language-plaintext highlighter-rouge">MTL::CommandQueue</code>) and a few functions that allowed us to roll up all the ingredients
of a calculation (i.e. buffers + functions/pipelines) and send them to the command
queue. Let’s have a bit more of a dive into the internals of running shaders on Metal
devices with this approach, to see how we can improve to run many different shaders from
one library.</p>
<p>The previous class we built, <code class="language-plaintext highlighter-rouge">MetalAdder</code>, used a method called <code class="language-plaintext highlighter-rouge">sendComputeCommand()</code>
to set off our kernel computation.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">MetalAdder</span><span class="o">::</span><span class="n">sendComputeCommand</span><span class="p">()</span>
<span class="p">{</span>
<span class="c1">// Create a command buffer to hold commands.</span>
<span class="n">MTL</span><span class="o">::</span><span class="n">CommandBuffer</span> <span class="o">*</span><span class="n">commandBuffer</span> <span class="o">=</span> <span class="n">_mCommandQueue</span><span class="o">-></span><span class="n">commandBuffer</span><span class="p">();</span>
<span class="n">assert</span><span class="p">(</span><span class="n">commandBuffer</span> <span class="o">!=</span> <span class="nb">nullptr</span><span class="p">);</span>
<span class="c1">// Create an encoder that translates our command to something the</span>
<span class="c1">// device understands</span>
<span class="n">MTL</span><span class="o">::</span><span class="n">ComputeCommandEncoder</span> <span class="o">*</span><span class="n">computeEncoder</span> <span class="o">=</span>
<span class="n">commandBuffer</span><span class="o">-></span><span class="n">computeCommandEncoder</span><span class="p">();</span>
<span class="n">assert</span><span class="p">(</span><span class="n">computeEncoder</span> <span class="o">!=</span> <span class="nb">nullptr</span><span class="p">);</span>
<span class="c1">// Wrap our computation and data up into something our Metal device</span>
<span class="c1">// understands</span>
<span class="n">encodeAddCommand</span><span class="p">(</span><span class="n">computeEncoder</span><span class="p">);</span>
<span class="c1">// Signal that we have encoded all we want.</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">endEncoding</span><span class="p">();</span>
<span class="c1">// Execute the command buffer.</span>
<span class="n">commandBuffer</span><span class="o">-></span><span class="n">commit</span><span class="p">();</span>
<span class="c1">// Normally, you want to do other work in your app while the shader</span>
<span class="c1">// is running, but in this example, the code simply blocks until</span>
<span class="c1">// the calculation is complete.</span>
<span class="n">commandBuffer</span><span class="o">-></span><span class="n">waitUntilCompleted</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Internally, it does a few things:</p>
<ul>
<li>Retrieve a command buffer linked to the command queue. We need to place our GPU
instruction into this buffer.</li>
<li>Create a command encoder, an object that takes our buffers (<code class="language-plaintext highlighter-rouge">MTL::Buffer </code>, those
things that contain our actual data) and pipelines (basically computational
instructions) and parses them to something our hardware understands. It turns out we
can’t simply push <code class="language-plaintext highlighter-rouge">a+b</code> to our command buffer.</li>
<li>Encode the add instruction and its buffers, for which we use another class function
that we will get to.</li>
<li>Signal that we encoded all we want (<code class="language-plaintext highlighter-rouge">computeEncoder->endEncoding()</code>).</li>
<li>Signal that the command buffer can start sending its commands to the hardware
(<code class="language-plaintext highlighter-rouge">commandBuffer->commit()</code>).</li>
<li>Wait until the hardware is done (<code class="language-plaintext highlighter-rouge">commandBuffer->waitUntilCompleted()</code>).</li>
</ul>
<p>After all this is done, the function exits. Because we specifically wait for the buffer
to complete, we can see this as a blocking array operation. One could remove the waiting
for the buffer, and additionally return the command buffer, such that the array
operations can be performed asynchronously, and checked at a later time.</p>
<p>One obvious failing of this class and method in its original state was that the buffers
themselves were linked to the class. To make our class more useful, we would like to
pass the buffers, such that we can apply the operator to any arbitrary buffer defined
elsewhere. Our class doesn’t change much functionally, other than accepting buffers and
having a slightly clearer name.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">MetalOperations</span><span class="o">::</span><span class="n">addArrays</span><span class="p">(</span><span class="k">const</span> <span class="n">MTL</span><span class="o">::</span><span class="n">Buffer</span> <span class="o">*</span><span class="n">x_array</span><span class="p">,</span>
<span class="k">const</span> <span class="n">MTL</span><span class="o">::</span><span class="n">Buffer</span> <span class="o">*</span><span class="n">y_array</span><span class="p">,</span>
<span class="n">MTL</span><span class="o">::</span><span class="n">Buffer</span> <span class="o">*</span><span class="n">r_array</span><span class="p">,</span>
<span class="kt">size_t</span> <span class="n">arrayLength</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">MTL</span><span class="o">::</span><span class="n">CommandBuffer</span> <span class="o">*</span><span class="n">commandBuffer</span> <span class="o">=</span> <span class="n">_mCommandQueue</span><span class="o">-></span><span class="n">commandBuffer</span><span class="p">();</span>
<span class="n">assert</span><span class="p">(</span><span class="n">commandBuffer</span> <span class="o">!=</span> <span class="nb">nullptr</span><span class="p">);</span>
<span class="n">MTL</span><span class="o">::</span><span class="n">ComputeCommandEncoder</span> <span class="o">*</span><span class="n">computeEncoder</span> <span class="o">=</span>
<span class="n">commandBuffer</span><span class="o">-></span><span class="n">computeCommandEncoder</span><span class="p">();</span>
<span class="n">assert</span><span class="p">(</span><span class="n">computeEncoder</span> <span class="o">!=</span> <span class="nb">nullptr</span><span class="p">);</span>
<span class="c1">/// Encoding</span>
<span class="c1">/// ..omitted..</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">endEncoding</span><span class="p">();</span>
<span class="n">commandBuffer</span><span class="o">-></span><span class="n">commit</span><span class="p">();</span>
<span class="c1">// You could put this somewhere else, if you want asynchronous</span>
<span class="c1">// computations</span>
<span class="n">commandBuffer</span><span class="o">-></span><span class="n">waitUntilCompleted</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="encode-that-operator">Encode that operator</h2>
<p>The compute method from our initial class (<code class="language-plaintext highlighter-rouge">MetalAdder</code>) called on another method to
encode the instructions and data, <code class="language-plaintext highlighter-rouge">encodeAddCommand()</code>. Other than taking care of the
instructions and data, it also computed in what shape to launch the kernel, something
possibly familiar to those who have seen CUDA:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">MetalAdder</span><span class="o">::</span><span class="n">encodeAddCommand</span><span class="p">(</span><span class="n">MTL</span><span class="o">::</span><span class="n">ComputeCommandEncoder</span> <span class="o">*</span><span class="n">computeEncoder</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// Encode the pipeline state object and its parameters.</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">setComputePipelineState</span><span class="p">(</span><span class="n">_mAddFunctionPSO</span><span class="p">);</span>
<span class="c1">// Place data in encoder</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">setBuffer</span><span class="p">(</span><span class="n">_mBufferA</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">setBuffer</span><span class="p">(</span><span class="n">_mBufferB</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">setBuffer</span><span class="p">(</span><span class="n">_mBufferResult</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">);</span>
<span class="n">MTL</span><span class="o">::</span><span class="n">Size</span> <span class="n">gridSize</span> <span class="o">=</span> <span class="n">MTL</span><span class="o">::</span><span class="n">Size</span><span class="o">::</span><span class="n">Make</span><span class="p">(</span><span class="n">arrayLength</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="c1">// Calculate a threadgroup size.</span>
<span class="n">NS</span><span class="o">::</span><span class="n">UInteger</span> <span class="n">threadGroupSize</span> <span class="o">=</span>
<span class="n">_mAddFunctionPSO</span><span class="o">-></span><span class="n">maxTotalThreadsPerThreadgroup</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="n">threadGroupSize</span> <span class="o">></span> <span class="n">arrayLength</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">threadGroupSize</span> <span class="o">=</span> <span class="n">arrayLength</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">MTL</span><span class="o">::</span><span class="n">Size</span> <span class="n">threadgroupSize</span> <span class="o">=</span> <span class="n">MTL</span><span class="o">::</span><span class="n">Size</span><span class="o">::</span><span class="n">Make</span><span class="p">(</span><span class="n">threadGroupSize</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="c1">// Encode the compute command.</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">dispatchThreads</span><span class="p">(</span><span class="n">gridSize</span><span class="p">,</span> <span class="n">threadgroupSize</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>After encoding the instructions (the pipeline of the add function) and data, this
function does the following:</p>
<ul>
<li>Compute how many total processes need to be run (<code class="language-plaintext highlighter-rouge">gridSize</code>). This is simply the
length of our two arrays in this case.</li>
<li>Figure out how many processes the hardware <em>could</em> run at a time, by having a gander
at the pipeline: <code class="language-plaintext highlighter-rouge">_mAddFunctionPSO->maxTotalThreadsPerThreadgroup()</code>. I guess it does
realize that for some pipelines, this number varies based on shader properties. It
stores this in <code class="language-plaintext highlighter-rouge">threadGroupSize</code>.</li>
<li>If the array is shorter than the amount of threads that can be run at a time, reduce
the size of the thread group.</li>
<li>Encode the size of the grid and the thread group.</li>
</ul>
<p>You might notice that the sizes in MTL have three entries. This is (I think) because MSL
can natively handle up to 3 dimensional grids when performing computations on
accelerator hardware. Accessing arrays in two or three dimensional fashing is sometimes
faster. This can happen e.g. when neighbouring pixels in all three dimensions are
required in the computation of the kernel, leading to memory access patterns that can be
optimized by the software, similar to how CUDA works for higher dimensional data. Let’s
leave this lie for another time!</p>
<p>To homogenize this functionality with our new class, we combine this and the previously
discussed method into a single method:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">MetalOperations</span><span class="o">::</span><span class="n">addArrays</span><span class="p">(</span><span class="k">const</span> <span class="n">MTL</span><span class="o">::</span><span class="n">Buffer</span> <span class="o">*</span><span class="n">x_array</span><span class="p">,</span>
<span class="k">const</span> <span class="n">MTL</span><span class="o">::</span><span class="n">Buffer</span> <span class="o">*</span><span class="n">y_array</span><span class="p">,</span>
<span class="n">MTL</span><span class="o">::</span><span class="n">Buffer</span> <span class="o">*</span><span class="n">r_array</span><span class="p">,</span>
<span class="kt">size_t</span> <span class="n">arrayLength</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">MTL</span><span class="o">::</span><span class="n">CommandBuffer</span> <span class="o">*</span><span class="n">commandBuffer</span> <span class="o">=</span> <span class="n">_mCommandQueue</span><span class="o">-></span><span class="n">commandBuffer</span><span class="p">();</span>
<span class="n">assert</span><span class="p">(</span><span class="n">commandBuffer</span> <span class="o">!=</span> <span class="nb">nullptr</span><span class="p">);</span>
<span class="n">MTL</span><span class="o">::</span><span class="n">ComputeCommandEncoder</span> <span class="o">*</span><span class="n">computeEncoder</span> <span class="o">=</span>
<span class="n">commandBuffer</span><span class="o">-></span><span class="n">computeCommandEncoder</span><span class="p">();</span>
<span class="n">assert</span><span class="p">(</span><span class="n">computeEncoder</span> <span class="o">!=</span> <span class="nb">nullptr</span><span class="p">);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">setComputePipelineState</span><span class="p">(</span><span class="n">_mAddFunctionPSO</span><span class="p">);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">setBuffer</span><span class="p">(</span><span class="n">x_array</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">setBuffer</span><span class="p">(</span><span class="n">y_array</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">setBuffer</span><span class="p">(</span><span class="n">r_array</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">);</span>
<span class="n">MTL</span><span class="o">::</span><span class="n">Size</span> <span class="n">gridSize</span> <span class="o">=</span> <span class="n">MTL</span><span class="o">::</span><span class="n">Size</span><span class="o">::</span><span class="n">Make</span><span class="p">(</span><span class="n">arrayLength</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="n">NS</span><span class="o">::</span><span class="n">UInteger</span> <span class="n">threadGroupSize</span> <span class="o">=</span>
<span class="n">_mAddFunctionPSO</span><span class="o">-></span><span class="n">maxTotalThreadsPerThreadgroup</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="n">threadGroupSize</span> <span class="o">></span> <span class="n">arrayLength</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">threadGroupSize</span> <span class="o">=</span> <span class="n">arrayLength</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">MTL</span><span class="o">::</span><span class="n">Size</span> <span class="n">threadgroupSize</span> <span class="o">=</span> <span class="n">MTL</span><span class="o">::</span><span class="n">Size</span><span class="o">::</span><span class="n">Make</span><span class="p">(</span><span class="n">threadGroupSize</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">dispatchThreads</span><span class="p">(</span><span class="n">gridSize</span><span class="p">,</span> <span class="n">threadgroupSize</span><span class="p">);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">endEncoding</span><span class="p">();</span>
<span class="n">commandBuffer</span><span class="o">-></span><span class="n">commit</span><span class="p">();</span>
<span class="n">commandBuffer</span><span class="o">-></span><span class="n">waitUntilCompleted</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>
<p>All-in-all, our C++ code has now become pretty lean. We do create a new command buffer
and encoder everytime we want to use the Metal hardware. Although I am not sure of the
overhead of constructing these objects, the total runtime of the kernels is still much
faster than that of the serial or OpenMP equivalents.</p>
<p>In fact, we can try to encode more operations into a single encoder and send them to the
same buffer, to get a feel for the induced overhead. Let’s try to chain an <code class="language-plaintext highlighter-rouge">add_arrays</code>
and an <code class="language-plaintext highlighter-rouge">multiply_arrays</code> shader together. I haven’t discussed the <code class="language-plaintext highlighter-rouge">multiply_arrays</code>
shader before, but its implementation can be found in the repo, and only differs from
our previous <code class="language-plaintext highlighter-rouge">add_arrays</code> shader by one symbol. Extrapolating our previous approach to
run one shader, we might end up with something like the following for two shaders:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="c1">// encoder and buffer created, grid size computed, beforehand</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">setComputePipelineState</span><span class="p">(</span><span class="n">_mAddFunctionPSO</span><span class="p">);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">setBuffer</span><span class="p">(</span><span class="n">x_array</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">setBuffer</span><span class="p">(</span><span class="n">y_array</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">setBuffer</span><span class="p">(</span><span class="n">r_array</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">dispatchThreads</span><span class="p">(</span><span class="n">gridSize</span><span class="p">,</span> <span class="n">threadgroupSize</span><span class="p">);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">endEncoding</span><span class="p">();</span>
<span class="n">commandBuffer</span><span class="o">-></span><span class="n">commit</span><span class="p">();</span>
<span class="n">commandBuffer</span><span class="o">-></span><span class="n">waitUntilCompleted</span><span class="p">();</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">setComputePipelineState</span><span class="p">(</span><span class="n">_mMultiplyFunctionPSO</span><span class="p">);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">setBuffer</span><span class="p">(</span><span class="n">r_array</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">setBuffer</span><span class="p">(</span><span class="n">y_array</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">setBuffer</span><span class="p">(</span><span class="n">r_array</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">dispatchThreads</span><span class="p">(</span><span class="n">gridSize</span><span class="p">,</span> <span class="n">threadgroupSize</span><span class="p">);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">endEncoding</span><span class="p">();</span>
<span class="n">commandBuffer</span><span class="o">-></span><span class="n">commit</span><span class="p">();</span>
<span class="n">commandBuffer</span><span class="o">-></span><span class="n">waitUntilCompleted</span><span class="p">();</span>
</code></pre></div></div>
<p>This compiles just fine. However, upon running we encounter a segmentation fault trying
the second <code class="language-plaintext highlighter-rouge">setComputePipelineState</code>. This seems logical, because we specifically told
the encoder we were done encoding commands. Removing the first <code class="language-plaintext highlighter-rouge">endEncoding</code> yields the
following message just before aborting:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-[IOGPUMetalCommandBuffer validate]:200: failed assertion `commit command buffer with uncommitted encoder'
[1] 57876 abort ./benchmark.x
</code></pre></div></div>
<p>What we actually have to do, is encode the entire chain of pipelines and buffer
operations, and only then end the encoding, and commit the buffer:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="c1">// encoder and buffer created, grid size computed, beforehand</span>
<span class="kt">size_t</span> <span class="n">index</span> <span class="o">=</span> <span class="mi">102</span><span class="p">;</span> <span class="c1">// some random place in the array</span>
<span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o"><<</span> <span class="s">"Initial state"</span> <span class="o"><<</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>
<span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o"><<</span> <span class="s">"x: "</span> <span class="o"><<</span> <span class="n">x</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o"><<</span> <span class="s">" y: "</span>
<span class="o"><<</span> <span class="n">y</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o"><<</span> <span class="s">" r: "</span>
<span class="o"><<</span> <span class="n">r</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o"><<</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>
<span class="k">auto</span> <span class="n">expected_result</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o">+</span> <span class="n">y</span><span class="p">[</span><span class="n">index</span><span class="p">])</span> <span class="o">*</span> <span class="n">y</span><span class="p">[</span><span class="n">index</span><span class="p">];</span>
<span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o"><<</span> <span class="s">"Formula: (x + y) * y"</span> <span class="o"><<</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>
<span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o"><<</span> <span class="s">"Expected result: "</span> <span class="o"><<</span> <span class="n">expected_result</span> <span class="o"><<</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">setComputePipelineState</span><span class="p">(</span><span class="n">_mAddFunctionPSO</span><span class="p">);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">setBuffer</span><span class="p">(</span><span class="n">x_array</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">setBuffer</span><span class="p">(</span><span class="n">y_array</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">setBuffer</span><span class="p">(</span><span class="n">r_array</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">dispatchThreads</span><span class="p">(</span><span class="n">gridSize</span><span class="p">,</span> <span class="n">threadgroupSize</span><span class="p">);</span>
<span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o"><<</span> <span class="s">"Current result: "</span> <span class="o"><<</span> <span class="n">r</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o"><<</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">setComputePipelineState</span><span class="p">(</span><span class="n">_mMultiplyFunctionPSO</span><span class="p">);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">setBuffer</span><span class="p">(</span><span class="n">r_array</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">setBuffer</span><span class="p">(</span><span class="n">y_array</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">setBuffer</span><span class="p">(</span><span class="n">r_array</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">dispatchThreads</span><span class="p">(</span><span class="n">gridSize</span><span class="p">,</span> <span class="n">threadgroupSize</span><span class="p">);</span>
<span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o"><<</span> <span class="s">"Current result: "</span> <span class="o"><<</span> <span class="n">r</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o"><<</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">endEncoding</span><span class="p">();</span>
<span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o"><<</span> <span class="s">"Current result: "</span> <span class="o"><<</span> <span class="n">r</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o"><<</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>
<span class="n">commandBuffer</span><span class="o">-></span><span class="n">commit</span><span class="p">();</span>
<span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o"><<</span> <span class="s">"Current result: "</span> <span class="o"><<</span> <span class="n">r</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o"><<</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>
<span class="n">commandBuffer</span><span class="o">-></span><span class="n">waitUntilCompleted</span><span class="p">();</span>
<span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o"><<</span> <span class="s">"Current result: "</span> <span class="o"><<</span> <span class="n">r</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o"><<</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>
</code></pre></div></div>
<p>Output:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Initial state
x: 0.287212 y: 0.704251 r: 0
Formula: (x + y) * y
Expected result: 0.698239
Current result: 0
Current result: 0
Current result: 0
Current result: 0
Current result: 0.698239
</code></pre></div></div>
<p>We can see how the actual computation is only performed somewhere between <code class="language-plaintext highlighter-rouge">commit</code>
and <code class="language-plaintext highlighter-rouge">waitUntilCompleted</code>. This chaining of operators might seem arbitrary, but when
benchmarked against the serial use of the <code class="language-plaintext highlighter-rouge">add_arrays</code> and <code class="language-plaintext highlighter-rouge">multiply_arrays</code> shaders
a speedup of 5% was achieved (see <code class="language-plaintext highlighter-rouge">chaining_operators.cpp</code>, serial op:
<code class="language-plaintext highlighter-rouge">4800.38ms +/- 57.3528ms</code>, compound op: <code class="language-plaintext highlighter-rouge">4544.57ms +/- 30.7632ms</code>), suggesting that for
complex computations optimizing out the overhead of constructing buffers, placing data
in them and encoding the instructions might be worthwile. However, compared to our CPU
performances both for serial and OpenMP code, the gain is marginal.</p>
<h2 id="code-duplication-hell">Code duplication hell</h2>
<p>So far, we’ve now created two shaders. For these shaders, we’ve had to retrieve the
function reference and its pipeline state object (PSO) in our class constructor, and
whenever we launch a shader, we call again on the PSO to set the instructions and
determine the threadGroupSize.</p>
<p>This way, when we grow our library in our metal code (contained for this write-up in
<code class="language-plaintext highlighter-rouge">ops.metal</code>), we will have to keep adding attributes to our class for the PSOs, and
a-priori know what functions we should load:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">MetalOperations</span><span class="o">::</span><span class="n">MetalOperations</span><span class="p">(</span><span class="n">MTL</span><span class="o">::</span><span class="n">Device</span> <span class="o">*</span><span class="n">device</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">_mDevice</span> <span class="o">=</span> <span class="n">device</span><span class="p">;</span>
<span class="n">NS</span><span class="o">::</span><span class="n">Error</span> <span class="o">*</span><span class="n">error</span> <span class="o">=</span> <span class="nb">nullptr</span><span class="p">;</span>
<span class="k">auto</span> <span class="n">filepath</span> <span class="o">=</span> <span class="n">NS</span><span class="o">::</span><span class="n">String</span><span class="o">::</span><span class="n">string</span><span class="p">(</span><span class="s">"./ops.metallib"</span><span class="p">,</span>
<span class="n">NS</span><span class="o">::</span><span class="n">ASCIIStringEncoding</span><span class="p">);</span>
<span class="n">MTL</span><span class="o">::</span><span class="n">Library</span> <span class="o">*</span><span class="n">opLibrary</span> <span class="o">=</span> <span class="n">_mDevice</span><span class="o">-></span><span class="n">newLibrary</span><span class="p">(</span><span class="n">filepath</span><span class="p">,</span> <span class="o">&</span><span class="n">error</span><span class="p">);</span>
<span class="c1">// Loading functions</span>
<span class="k">auto</span> <span class="n">str</span> <span class="o">=</span> <span class="n">NS</span><span class="o">::</span><span class="n">String</span><span class="o">::</span><span class="n">string</span><span class="p">(</span><span class="s">"add_arrays"</span><span class="p">,</span> <span class="n">NS</span><span class="o">::</span><span class="n">ASCIIStringEncoding</span><span class="p">);</span>
<span class="n">MTL</span><span class="o">::</span><span class="n">Function</span> <span class="o">*</span><span class="n">addFunction</span> <span class="o">=</span> <span class="n">defaultLibrary</span><span class="o">-></span><span class="n">newFunction</span><span class="p">(</span><span class="n">str</span><span class="p">);</span>
<span class="n">str</span> <span class="o">=</span> <span class="n">NS</span><span class="o">::</span><span class="n">String</span><span class="o">::</span><span class="n">string</span><span class="p">(</span><span class="s">"multiply_arrays"</span><span class="p">,</span> <span class="n">NS</span><span class="o">::</span><span class="n">ASCIIStringEncoding</span><span class="p">);</span>
<span class="n">MTL</span><span class="o">::</span><span class="n">Function</span> <span class="o">*</span><span class="n">multiplyFunction</span> <span class="o">=</span> <span class="n">defaultLibrary</span><span class="o">-></span><span class="n">newFunction</span><span class="p">(</span><span class="n">str</span><span class="p">);</span>
<span class="n">str</span> <span class="o">=</span> <span class="n">NS</span><span class="o">::</span><span class="n">String</span><span class="o">::</span><span class="n">string</span><span class="p">(</span><span class="s">"saxpy"</span><span class="p">,</span> <span class="n">NS</span><span class="o">::</span><span class="n">ASCIIStringEncoding</span><span class="p">);</span>
<span class="n">MTL</span><span class="o">::</span><span class="n">Function</span> <span class="o">*</span><span class="n">saxpyFunction</span> <span class="o">=</span> <span class="n">defaultLibrary</span><span class="o">-></span><span class="n">newFunction</span><span class="p">(</span><span class="n">str</span><span class="p">);</span>
<span class="c1">// Loading PSOs</span>
<span class="n">_mAddFunctionPSO</span> <span class="o">=</span>
<span class="n">_mDevice</span><span class="o">-></span><span class="n">newComputePipelineState</span><span class="p">(</span><span class="n">addFunction</span><span class="p">,</span> <span class="o">&</span><span class="n">error</span><span class="p">);</span>
<span class="n">_mMultiplyFunctionPSO</span> <span class="o">=</span>
<span class="n">_mDevice</span><span class="o">-></span><span class="n">newComputePipelineState</span><span class="p">(</span><span class="n">multiplyFunction</span><span class="p">,</span> <span class="o">&</span><span class="n">error</span><span class="p">);</span>
<span class="n">_mSaxpyFunctionPSO</span> <span class="o">=</span>
<span class="n">_mDevice</span><span class="o">-></span><span class="n">newComputePipelineState</span><span class="p">(</span><span class="n">saxpyFunction</span><span class="p">,</span> <span class="o">&</span><span class="n">error</span><span class="p">);</span>
<span class="c1">// .. error handling</span>
<span class="c1">// .. other constructor operations</span>
<span class="p">}</span>
</code></pre></div></div>
<p>This ungodly work surely can be automated… And behold, one could simply determine all
names of the functions in a Metal library using <code class="language-plaintext highlighter-rouge">MTL::Library::functionNames();</code>. This
leads to a much leaner way to create PSOs for all functions:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">MetalOperations</span><span class="o">::</span><span class="n">MetalOperations</span><span class="p">(</span><span class="n">MTL</span><span class="o">::</span><span class="n">Device</span> <span class="o">*</span><span class="n">device</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">_mDevice</span> <span class="o">=</span> <span class="n">device</span><span class="p">;</span>
<span class="n">NS</span><span class="o">::</span><span class="n">Error</span> <span class="o">*</span><span class="n">error</span> <span class="o">=</span> <span class="nb">nullptr</span><span class="p">;</span>
<span class="k">auto</span> <span class="n">filepath</span> <span class="o">=</span> <span class="n">NS</span><span class="o">::</span><span class="n">String</span><span class="o">::</span><span class="n">string</span><span class="p">(</span><span class="s">"./ops.metallib"</span><span class="p">,</span>
<span class="n">NS</span><span class="o">::</span><span class="n">ASCIIStringEncoding</span><span class="p">);</span>
<span class="n">MTL</span><span class="o">::</span><span class="n">Library</span> <span class="o">*</span><span class="n">opLibrary</span> <span class="o">=</span> <span class="n">_mDevice</span><span class="o">-></span><span class="n">newLibrary</span><span class="p">(</span><span class="n">filepath</span><span class="p">,</span> <span class="o">&</span><span class="n">error</span><span class="p">);</span>
<span class="c1">// Get all function names</span>
<span class="k">auto</span> <span class="n">fnNames</span> <span class="o">=</span> <span class="n">opLibrary</span><span class="o">-></span><span class="n">functionNames</span><span class="p">();</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">fnNames</span><span class="o">-></span><span class="n">count</span><span class="p">();</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">auto</span> <span class="n">name_nsstring</span> <span class="o">=</span> <span class="n">fnNames</span><span class="o">-></span><span class="n">object</span><span class="p">(</span><span class="n">i</span><span class="p">)</span><span class="o">-></span><span class="n">description</span><span class="p">();</span>
<span class="k">auto</span> <span class="n">name_utf8</span> <span class="o">=</span> <span class="n">name_nsstring</span><span class="o">-></span><span class="n">utf8String</span><span class="p">();</span>
<span class="c1">// Load function into a map</span>
<span class="n">functionMap</span><span class="p">[</span><span class="n">name_utf8</span><span class="p">]</span> <span class="o">=</span> <span class="n">opLibrary</span><span class="o">-></span><span class="n">newFunction</span><span class="p">(</span><span class="n">name_nsstring</span><span class="p">);</span>
<span class="c1">// Create pipeline from function</span>
<span class="n">functionPipelineMap</span><span class="p">[</span><span class="n">name_utf8</span><span class="p">]</span> <span class="o">=</span>
<span class="n">_mDevice</span><span class="o">-></span><span class="n">newComputePipelineState</span><span class="p">(</span><span class="n">functionMap</span><span class="p">[</span><span class="n">name_utf8</span><span class="p">],</span>
<span class="o">&</span><span class="n">error</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>This creates a map of all pipelines, which we can look up by knowing their utf8 name.
This for example simplifies the calling of our <code class="language-plaintext highlighter-rouge">add_arrays</code>:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">computeEncoder</span><span class="o">-></span><span class="n">setComputePipelineState</span><span class="p">(</span><span class="n">_m</span><span class="p">);</span>
<span class="c1">// becomes</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">setComputePipelineState</span><span class="p">(</span><span class="n">functionPipelineMap</span><span class="p">[</span><span class="s">"add_arrays"</span><span class="p">]);</span>
</code></pre></div></div>
<p>However, this does not fully make all our shader calls from C++ redundant. Although the
<code class="language-plaintext highlighter-rouge">add_arrays</code> and <code class="language-plaintext highlighter-rouge">multiply_arrays</code> calls in C++ (<code class="language-plaintext highlighter-rouge">addArrays</code> and <code class="language-plaintext highlighter-rouge">multiplyArrays</code>) are
one-to-one duplicates other than the PSOs, this is only the case because they have the
same signature. This becomes evident when we create the SAXPY shader.</p>
<p>The SAXPY shader, single precision ax plus y, performs almost the same operation as
<code class="language-plaintext highlighter-rouge">add_arrays</code>. However, the addition of a constant changes the signature:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">kernel</span> <span class="kt">void</span> <span class="nf">saxpy</span><span class="p">(</span><span class="n">device</span> <span class="k">const</span> <span class="kt">float</span><span class="o">*</span> <span class="n">a</span> <span class="p">[[</span><span class="n">buffer</span><span class="p">(</span><span class="mi">0</span><span class="p">)]],</span>
<span class="n">device</span> <span class="k">const</span> <span class="kt">float</span><span class="o">*</span> <span class="n">X</span> <span class="p">[[</span><span class="n">buffer</span><span class="p">(</span><span class="mi">1</span><span class="p">)]],</span>
<span class="n">device</span> <span class="k">const</span> <span class="kt">float</span><span class="o">*</span> <span class="n">Y</span> <span class="p">[[</span><span class="n">buffer</span><span class="p">(</span><span class="mi">2</span><span class="p">)]],</span>
<span class="n">device</span> <span class="kt">float</span><span class="o">*</span> <span class="n">result</span> <span class="p">[[</span><span class="n">buffer</span><span class="p">(</span><span class="mi">3</span><span class="p">)]],</span>
<span class="n">uint</span> <span class="n">index</span> <span class="p">[[</span><span class="n">thread_position_in_grid</span><span class="p">]])</span>
<span class="p">{</span>
<span class="n">result</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="o">*</span><span class="n">a</span><span class="p">)</span> <span class="o">*</span> <span class="n">X</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o">+</span> <span class="n">Y</span><span class="p">[</span><span class="n">index</span><span class="p">];</span>
<span class="p">}</span>
</code></pre></div></div>
<p>We have now added another pointer to a (device) float; <code class="language-plaintext highlighter-rouge">a</code>. Additionally you might
notice a contrast from our previous shader implementations, that did not include the
<code class="language-plaintext highlighter-rouge">[[buffer(*)]]</code>, this explicitly loads a variable from a specific place in the buffer,
as placed by our encoder.</p>
<blockquote>
<p>For a better explanation of the buffer indices and other shader inputs, see
<code class="language-plaintext highlighter-rouge">5.2.1 Locating Buffer, Texture, and Sampler Arguments</code> (page 79) in the
<code class="language-plaintext highlighter-rouge">Metal-Shading-Language-Specification.pdf</code> available
<a href="https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf">here</a>.</p>
</blockquote>
<p>The change of our signature does mean that we have a different number of buffers to
place in our encoder, hence why one cannot reuse <code class="language-plaintext highlighter-rouge">MetalOperations::addArrays</code> for
<code class="language-plaintext highlighter-rouge">MetalOperations::saxpy</code> method:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">MetalOperations</span><span class="o">::</span><span class="n">saxpyArrays</span><span class="p">(</span><span class="k">const</span> <span class="n">MTL</span><span class="o">::</span><span class="n">Buffer</span> <span class="o">*</span><span class="n">alpha</span><span class="p">,</span>
<span class="k">const</span> <span class="n">MTL</span><span class="o">::</span><span class="n">Buffer</span> <span class="o">*</span><span class="n">x_array</span><span class="p">,</span>
<span class="k">const</span> <span class="n">MTL</span><span class="o">::</span><span class="n">Buffer</span> <span class="o">*</span><span class="n">y_array</span><span class="p">,</span>
<span class="n">MTL</span><span class="o">::</span><span class="n">Buffer</span> <span class="o">*</span><span class="n">r_array</span><span class="p">,</span>
<span class="kt">size_t</span> <span class="n">arrayLength</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">MTL</span><span class="o">::</span><span class="n">CommandBuffer</span> <span class="o">*</span><span class="n">commandBuffer</span> <span class="o">=</span> <span class="n">_mCommandQueue</span><span class="o">-></span><span class="n">commandBuffer</span><span class="p">();</span>
<span class="n">assert</span><span class="p">(</span><span class="n">commandBuffer</span> <span class="o">!=</span> <span class="nb">nullptr</span><span class="p">);</span>
<span class="n">MTL</span><span class="o">::</span><span class="n">ComputeCommandEncoder</span> <span class="o">*</span><span class="n">computeEncoder</span> <span class="o">=</span>
<span class="n">commandBuffer</span><span class="o">-></span><span class="n">computeCommandEncoder</span><span class="p">();</span>
<span class="n">assert</span><span class="p">(</span><span class="n">computeEncoder</span> <span class="o">!=</span> <span class="nb">nullptr</span><span class="p">);</span>
<span class="c1">// --- unique code ---</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">setComputePipelineState</span><span class="p">(</span><span class="n">functionPipelineMap</span><span class="p">[</span><span class="s">"saxpy"</span><span class="p">]);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">setBuffer</span><span class="p">(</span><span class="n">alpha</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">setBuffer</span><span class="p">(</span><span class="n">x_array</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">setBuffer</span><span class="p">(</span><span class="n">y_array</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">setBuffer</span><span class="p">(</span><span class="n">r_array</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">3</span><span class="p">);</span>
<span class="n">NS</span><span class="o">::</span><span class="n">UInteger</span> <span class="n">threadGroupSize</span> <span class="o">=</span>
<span class="n">functionPipelineMap</span><span class="p">[</span><span class="s">"saxpy"</span><span class="p">]</span><span class="o">-></span><span class="n">maxTotalThreadsPerThreadgroup</span><span class="p">();</span>
<span class="c1">// -------------------</span>
<span class="k">if</span> <span class="p">(</span><span class="n">threadGroupSize</span> <span class="o">></span> <span class="n">arrayLength</span><span class="p">)</span>
<span class="n">threadGroupSize</span> <span class="o">=</span> <span class="n">arrayLength</span><span class="p">;</span>
<span class="n">MTL</span><span class="o">::</span><span class="n">Size</span> <span class="n">threadgroupSize</span> <span class="o">=</span> <span class="n">MTL</span><span class="o">::</span><span class="n">Size</span><span class="o">::</span><span class="n">Make</span><span class="p">(</span><span class="n">threadGroupSize</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="n">MTL</span><span class="o">::</span><span class="n">Size</span> <span class="n">gridSize</span> <span class="o">=</span> <span class="n">MTL</span><span class="o">::</span><span class="n">Size</span><span class="o">::</span><span class="n">Make</span><span class="p">(</span><span class="n">arrayLength</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">dispatchThreads</span><span class="p">(</span><span class="n">gridSize</span><span class="p">,</span> <span class="n">threadgroupSize</span><span class="p">);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">endEncoding</span><span class="p">();</span>
<span class="n">commandBuffer</span><span class="o">-></span><span class="n">commit</span><span class="p">();</span>
<span class="n">commandBuffer</span><span class="o">-></span><span class="n">waitUntilCompleted</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>
<p>What remains can now be worked away into other functions. Since the only thing that
changes is the buffers (and their count), as well as the function reference, we can
abstract the actual shader away in the following way:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="kt">void</span> <span class="n">MetalOperations</span><span class="o">::</span><span class="n">Blocking1D</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="n">MTL</span><span class="o">::</span><span class="n">Buffer</span> <span class="o">*></span> <span class="n">buffers</span><span class="p">,</span>
<span class="kt">size_t</span> <span class="n">arrayLength</span><span class="p">,</span>
<span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">method</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">MTL</span><span class="o">::</span><span class="n">CommandBuffer</span> <span class="o">*</span><span class="n">commandBuffer</span> <span class="o">=</span> <span class="n">_mCommandQueue</span><span class="o">-></span><span class="n">commandBuffer</span><span class="p">();</span>
<span class="n">assert</span><span class="p">(</span><span class="n">commandBuffer</span> <span class="o">!=</span> <span class="nb">nullptr</span><span class="p">);</span>
<span class="n">MTL</span><span class="o">::</span><span class="n">ComputeCommandEncoder</span> <span class="o">*</span><span class="n">computeEncoder</span> <span class="o">=</span>
<span class="n">commandBuffer</span><span class="o">-></span><span class="n">computeCommandEncoder</span><span class="p">();</span>
<span class="n">assert</span><span class="p">(</span><span class="n">computeEncoder</span> <span class="o">!=</span> <span class="nb">nullptr</span><span class="p">);</span>
<span class="c1">// Unique code -> un-uniqued. </span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">setComputePipelineState</span><span class="p">(</span><span class="n">functionPipelineMap</span><span class="p">[</span><span class="n">method</span><span class="p">]);</span>
<span class="k">for</span> <span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">buffers</span><span class="p">.</span><span class="n">size</span><span class="p">();</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">setBuffer</span><span class="p">(</span><span class="n">buffers</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="mi">0</span><span class="p">,</span> <span class="n">i</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">NS</span><span class="o">::</span><span class="n">UInteger</span> <span class="n">threadGroupSize</span> <span class="o">=</span>
<span class="n">functionPipelineMap</span><span class="p">[</span><span class="n">method</span><span class="p">]</span><span class="o">-></span><span class="n">maxTotalThreadsPerThreadgroup</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="n">threadGroupSize</span> <span class="o">></span> <span class="n">arrayLength</span><span class="p">)</span>
<span class="n">threadGroupSize</span> <span class="o">=</span> <span class="n">arrayLength</span><span class="p">;</span>
<span class="n">MTL</span><span class="o">::</span><span class="n">Size</span> <span class="n">threadgroupSize</span> <span class="o">=</span> <span class="n">MTL</span><span class="o">::</span><span class="n">Size</span><span class="o">::</span><span class="n">Make</span><span class="p">(</span><span class="n">threadGroupSize</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="n">MTL</span><span class="o">::</span><span class="n">Size</span> <span class="n">gridSize</span> <span class="o">=</span> <span class="n">MTL</span><span class="o">::</span><span class="n">Size</span><span class="o">::</span><span class="n">Make</span><span class="p">(</span><span class="n">arrayLength</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">dispatchThreads</span><span class="p">(</span><span class="n">gridSize</span><span class="p">,</span> <span class="n">threadgroupSize</span><span class="p">);</span>
<span class="n">computeEncoder</span><span class="o">-></span><span class="n">endEncoding</span><span class="p">();</span>
<span class="n">commandBuffer</span><span class="o">-></span><span class="n">commit</span><span class="p">();</span>
<span class="n">commandBuffer</span><span class="o">-></span><span class="n">waitUntilCompleted</span><span class="p">();</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">MetalOperations</span><span class="o">::</span><span class="n">addArrays</span><span class="p">(</span><span class="n">MTL</span><span class="o">::</span><span class="n">Buffer</span> <span class="o">*</span><span class="n">x_array</span><span class="p">,</span>
<span class="n">MTL</span><span class="o">::</span><span class="n">Buffer</span> <span class="o">*</span><span class="n">y_array</span><span class="p">,</span>
<span class="n">MTL</span><span class="o">::</span><span class="n">Buffer</span> <span class="o">*</span><span class="n">r_array</span><span class="p">,</span>
<span class="kt">size_t</span> <span class="n">arrayLength</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="n">MTL</span><span class="o">::</span><span class="n">Buffer</span> <span class="o">*></span> <span class="n">buffers</span> <span class="o">=</span> <span class="p">{</span><span class="n">x_array</span><span class="p">,</span>
<span class="n">y_array</span><span class="p">,</span>
<span class="n">r_array</span><span class="p">};</span>
<span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">method</span> <span class="o">=</span> <span class="s">"add_arrays"</span><span class="p">;</span>
<span class="n">Blocking1D</span><span class="p">(</span><span class="n">buffers</span><span class="p">,</span> <span class="n">arrayLength</span><span class="p">,</span> <span class="n">method</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Now, when the method one wants to call changes signature to include more or fewer
buffers, this is already accommodated:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">MetalOperations</span><span class="o">::</span><span class="n">saxpyArrays</span><span class="p">(</span><span class="n">MTL</span><span class="o">::</span><span class="n">Buffer</span> <span class="o">*</span><span class="n">alpha</span><span class="p">,</span>
<span class="n">MTL</span><span class="o">::</span><span class="n">Buffer</span> <span class="o">*</span><span class="n">x_array</span><span class="p">,</span>
<span class="n">MTL</span><span class="o">::</span><span class="n">Buffer</span> <span class="o">*</span><span class="n">y_array</span><span class="p">,</span>
<span class="n">MTL</span><span class="o">::</span><span class="n">Buffer</span> <span class="o">*</span><span class="n">r_array</span><span class="p">,</span>
<span class="kt">size_t</span> <span class="n">arrayLength</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="n">MTL</span><span class="o">::</span><span class="n">Buffer</span> <span class="o">*></span> <span class="n">buffers</span> <span class="o">=</span> <span class="p">{</span><span class="n">alpha</span><span class="p">,</span>
<span class="n">x_array</span><span class="p">,</span>
<span class="n">y_array</span><span class="p">,</span>
<span class="n">r_array</span><span class="p">};</span>
<span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">method</span> <span class="o">=</span> <span class="s">"saxpy"</span><span class="p">;</span>
<span class="n">Blocking1D</span><span class="p">(</span><span class="n">buffers</span><span class="p">,</span> <span class="n">arrayLength</span><span class="p">,</span> <span class="n">method</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The only requirements of this approach is that the general design of the shader does not
fundamentally change; an element wise 1d operation that only accepts buffers as input.
Additionally, our approach required us to drop the <code class="language-plaintext highlighter-rouge">const</code> qualifier from some of our
buffers that we do not wish to modify. Heed!</p>
<h2 id="finite-differencing">Finite differencing</h2>
<p>One of the staples of scientific computations (and especially solving PDEs) is computing
derivatives. One of the standard numerical approaches to calculating a derivative on a
grid of values (such as an array) is by using finite differences; i.e. manually
computing the difference between neighbouring points, and using those to determine
derivatives (of whatever order).</p>
<p>This computation in standard serial C++ code would look something like this:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">central_difference</span><span class="p">(</span><span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">delta</span><span class="p">,</span>
<span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">x</span><span class="p">,</span>
<span class="kt">float</span> <span class="o">*</span><span class="n">r</span><span class="p">,</span>
<span class="kt">size_t</span> <span class="n">arrayLength</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">index</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">index</span> <span class="o"><</span> <span class="n">arrayLength</span><span class="p">;</span> <span class="n">index</span><span class="o">++</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">r</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="n">index</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span> <span class="o">-</span> <span class="n">x</span><span class="p">[</span><span class="n">index</span> <span class="o">-</span> <span class="mi">1</span><span class="p">])</span> <span class="o">/</span> <span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="o">*</span><span class="n">delta</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>By comparing our next (<code class="language-plaintext highlighter-rouge">index + 1</code>) and previous (<code class="language-plaintext highlighter-rouge">index - 1</code>) point, and dividing their
difference by their spacing (<code class="language-plaintext highlighter-rouge">2 * delta</code>), we obtain the first order (central)
derivative.</p>
<p>We have to take care though that at the beginning and at the end of our array we don’t
go out of bounds. Hence we use in those cases an alternative formula:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">central_difference</span><span class="p">(</span><span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">delta</span><span class="p">,</span>
<span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">x</span><span class="p">,</span>
<span class="kt">float</span> <span class="o">*</span><span class="n">r</span><span class="p">,</span>
<span class="kt">size_t</span> <span class="n">arrayLength</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">index</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">index</span> <span class="o"><</span> <span class="n">arrayLength</span><span class="p">;</span> <span class="n">index</span><span class="o">++</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">index</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">r</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="n">index</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span> <span class="o">-</span> <span class="n">x</span><span class="p">[</span><span class="n">index</span><span class="p">])</span> <span class="o">/</span> <span class="o">*</span><span class="n">delta</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">index</span> <span class="o">==</span> <span class="n">arrayLength</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">r</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o">-</span> <span class="n">x</span><span class="p">[</span><span class="n">index</span> <span class="o">-</span> <span class="mi">1</span><span class="p">])</span> <span class="o">/</span> <span class="p">(</span><span class="o">*</span><span class="n">delta</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">else</span>
<span class="p">{</span>
<span class="n">r</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="n">index</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span> <span class="o">-</span> <span class="n">x</span><span class="p">[</span><span class="n">index</span> <span class="o">-</span> <span class="mi">1</span><span class="p">])</span> <span class="o">/</span> <span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="o">*</span><span class="n">delta</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>These conditional statements alone are a reason that central differencing becomes a bit
more expensive than your average operation, but they are essential to the correct
computation of the derivatives.</p>
<p>We can easily modify our SAXPY shader to work as a central differencing shader:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">kernel</span> <span class="kt">void</span> <span class="nf">central_difference</span><span class="p">(</span>
<span class="n">device</span> <span class="k">const</span> <span class="kt">float</span><span class="o">*</span> <span class="n">delta</span> <span class="p">[[</span><span class="n">buffer</span><span class="p">(</span><span class="mi">0</span><span class="p">)]],</span>
<span class="n">device</span> <span class="k">const</span> <span class="kt">float</span><span class="o">*</span> <span class="n">X</span> <span class="p">[[</span><span class="n">buffer</span><span class="p">(</span><span class="mi">1</span><span class="p">)]],</span>
<span class="n">device</span> <span class="kt">float</span><span class="o">*</span> <span class="n">result</span> <span class="p">[[</span><span class="n">buffer</span><span class="p">(</span><span class="mi">2</span><span class="p">)]],</span>
<span class="n">uint</span> <span class="n">index</span> <span class="p">[[</span><span class="n">thread_position_in_grid</span><span class="p">]],</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">index</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">result</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">X</span><span class="p">[</span><span class="n">index</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span> <span class="o">-</span> <span class="n">X</span><span class="p">[</span><span class="n">index</span><span class="p">])</span> <span class="o">/</span> <span class="o">*</span><span class="n">delta</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">else</span> <span class="nf">if</span> <span class="p">(</span><span class="n">index</span> <span class="o">==</span> <span class="n">arrayLength</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">result</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">X</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o">-</span> <span class="n">X</span><span class="p">[</span><span class="n">index</span> <span class="o">-</span> <span class="mi">1</span><span class="p">])</span> <span class="o">/</span> <span class="o">*</span><span class="n">delta</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">else</span>
<span class="p">{</span>
<span class="n">result</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">X</span><span class="p">[</span><span class="n">index</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span> <span class="o">-</span> <span class="n">X</span><span class="p">[</span><span class="n">index</span> <span class="o">-</span> <span class="mi">1</span><span class="p">])</span> <span class="o">/</span> <span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="o">*</span><span class="n">delta</span><span class="p">);</span>
<span class="p">}</span>
<span class="err">}</span>
</code></pre></div></div>
<p>Except, one issue appears now. We also have to pass to this function the <code class="language-plaintext highlighter-rouge">arrayLength</code>,
to ensure that our right-side boundary case is handled well. We could make an extra
buffer, and pass the array size as an argument, but this is rather inelegant. In MSL, we
can use properties just like <code class="language-plaintext highlighter-rouge">[[thread_position_in_grid]]</code> to obtain characteristics
about our current shader:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">kernel</span> <span class="kt">void</span> <span class="nf">central_difference</span><span class="p">(</span>
<span class="n">device</span> <span class="k">const</span> <span class="kt">float</span><span class="o">*</span> <span class="n">delta</span> <span class="p">[[</span><span class="n">buffer</span><span class="p">(</span><span class="mi">0</span><span class="p">)]],</span>
<span class="n">device</span> <span class="k">const</span> <span class="kt">float</span><span class="o">*</span> <span class="n">X</span> <span class="p">[[</span><span class="n">buffer</span><span class="p">(</span><span class="mi">1</span><span class="p">)]],</span>
<span class="n">device</span> <span class="kt">float</span><span class="o">*</span> <span class="n">result</span> <span class="p">[[</span><span class="n">buffer</span><span class="p">(</span><span class="mi">2</span><span class="p">)]],</span>
<span class="n">uint</span> <span class="n">index</span> <span class="p">[[</span><span class="n">thread_position_in_grid</span><span class="p">]],</span>
<span class="n">uint</span> <span class="n">arrayLength</span> <span class="p">[[</span><span class="n">threads_per_grid</span><span class="p">]])</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">index</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">result</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">X</span><span class="p">[</span><span class="n">index</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span> <span class="o">-</span> <span class="n">X</span><span class="p">[</span><span class="n">index</span><span class="p">])</span> <span class="o">/</span> <span class="o">*</span><span class="n">delta</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">index</span> <span class="o">==</span> <span class="n">arrayLength</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">result</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">X</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o">-</span> <span class="n">X</span><span class="p">[</span><span class="n">index</span> <span class="o">-</span> <span class="mi">1</span><span class="p">])</span> <span class="o">/</span> <span class="o">*</span><span class="n">delta</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">else</span>
<span class="p">{</span>
<span class="n">result</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">X</span><span class="p">[</span><span class="n">index</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span> <span class="o">-</span> <span class="n">X</span><span class="p">[</span><span class="n">index</span> <span class="o">-</span> <span class="mi">1</span><span class="p">])</span> <span class="o">/</span> <span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="o">*</span><span class="n">delta</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>This <code class="language-plaintext highlighter-rouge">[[threads_per_grid]]</code> is exactly the size we passed using <code class="language-plaintext highlighter-rouge">dispatchGrid</code> in our
<code class="language-plaintext highlighter-rouge">Blocking1D</code> method, ideal for stencil operations. There is a few more descriptors that
can be accessed in this way. More information about launching shaders and how threads
behave can be found <a href="https://developer.apple.com/documentation/metal/compute_passes/calculating_threadgroup_and_grid_sizes?language=objc">here</a>
and <a href="https://developer.apple.com/documentation/metal/compute_passes/creating_threads_and_threadgroups?language=objc">here</a>.</p>
<h2 id="saxpy-and-fd-gpu-territory">SAXPY and FD: GPU territory</h2>
<p><a href="https://larsgeb.github.io/2022/04/20/m1-gpu.html">In the previous write-up</a> we saw how
MSL + C++ can outperform C++ OpenMP code for adding arrays, a relatively simple
operation. The speed-up of optimal OpenMP w.r.t. serial code was about x1.8, while the
MSL shader performed the same oepration with a x3 speed-up.</p>
<p>We do the same profiling test on the SAXPY and the central differencing shader. In the
project’s repo, one can find the verification of the results (compared to CPU serial
code) and the benchmark itself in <code class="language-plaintext highlighter-rouge">02-GeneralArrayOperations/main.cpp</code>. The general idea
behind the compilation can be read
<a href="https://larsgeb.github.io/2022/04/20/m1-gpu.html">here</a>, but if you are using VSCode,
the <code class="language-plaintext highlighter-rouge">.vscode/tasks.json</code> should contain all the details you need.</p>
<p>The benchmark has an output on my 2021 MacBook like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Running on Apple M1 Max
Array size 67108864, tests repeated 1000 times
Available Metal functions in 'ops.metallib':
central_difference
saxpy
multiply_arrays
add_arrays
Add result is equal to CPU code
Multiply result is equal to CPU code
SAXPY result is equal to CPU code
Central difference result is equal to CPU code
Starting SAXPY benchmarking ...
Metal (GPU): 2413.85ms +/- 75.9983ms <-- 4.1x speed-up
Serial: 9887.88ms +/- 520.411ms
OpenMP (2 threads): 5628.79ms +/- 1039.63ms
OpenMP (3 threads): 6863.52ms +/- 504.79ms
OpenMP (4 threads): 5781.61ms +/- 702.583ms
OpenMP (5 threads): 6178.79ms +/- 224.011ms
OpenMP (6 threads): 5530.03ms +/- 465.312ms
OpenMP (7 threads): 5888.13ms +/- 309.846ms
OpenMP (8 threads): 5350.56ms +/- 68.6307ms
OpenMP (9 threads): 5284.93ms +/- 442.001ms
OpenMP (10 threads): 5248.47ms +/- 482.052ms <-- 1.8x speed-up
OpenMP (11 threads): 6092.41ms +/- 1162.83ms
OpenMP (12 threads): 7914.33ms +/- 1680.73ms
OpenMP (13 threads): 8611.03ms +/- 1653.91ms
OpenMP (14 threads): 6918.88ms +/- 1769ms
Starting central differencing benchmarking ...
Metal (GPU): 1707.02ms +/- 639.27ms <-- 27x speed-up!
Serial: 46894.8ms +/- 214.954ms
OpenMP (2 threads): 29274.6ms +/- 84.4059ms
OpenMP (3 threads): 22267.3ms +/- 166.856ms
OpenMP (4 threads): 18256.5ms +/- 142.291ms
OpenMP (5 threads): 14330.1ms +/- 492.599ms
OpenMP (6 threads): 11894.4ms +/- 858.88ms
OpenMP (7 threads): 11110.7ms +/- 1751.74ms
OpenMP (8 threads): 9877.14ms +/- 746.775ms <-- 4.7x speed-up
OpenMP (9 threads): 10826.8ms +/- 1344.18ms
OpenMP (10 threads): 10408.7ms +/- 1113.99ms
OpenMP (11 threads): 11120ms +/- 1408.25ms
OpenMP (12 threads): 11427.5ms +/- 1347.53ms
OpenMP (13 threads): 11707.8ms +/- 1734.43ms
OpenMP (14 threads): 11301.5ms +/- 1929.12ms
</code></pre></div></div>
<p>The great thing is that the Metal GPU outperforms OpenMP across the board. The speed-up
of the central differencing kernel is remarkable, 27x w.r.t. the serial code, with the
OpenMP implementation not getting close at all. For SAXPY the GPU is 2 times faster than
all CPU cores, for central differencing it is more than 5 times faster!</p>
<p>Additionally, it is interesting to see at which thread count the OpenMP code performs
best. The M1 max chip has 10 cores, of which 2 are ‘efficiency’, and the rest
‘performance’. In the addition test, a relatively simple kernel, OpenMP performance
peaked at 2 threads. Now we see OpenMP perform best at a much more expected place: at
10 cores for SAXPY, and 8 cores for central differencing. I suspect that in these
computationally heavier shaders the computational benefit outweighs inefficient memory
access patterns or bandwidth saturation.</p>
<p><img src="/assets/m1max.jpg" alt="here" /></p>
<p>I highly suspect the slowdown between 8 and 10 threads on OpenMP central differencing to
be due to the threads running on the efficiency cores: when the threads need to access
buffers at indices, they actually have to communicate with memory addresses close to the
performance cores, as illustrated in the above diagram. This process likely slows down
the entire operation.</p>
<p>Next work will be on 2 dimensional shaders, which I have high hopes for!</p>Lars GebraadRelevant repo, previous post.M1 GPUs for C++ science: Getting started2022-04-20T00:00:00+00:002022-04-20T00:00:00+00:00https://larsgeb.github.io/2022/04/20/m1-gpu<p><a href="https://github.com/larsgeb/m1-gpu-cpp">Relevant repo</a>,
<a href="https://larsgeb.github.io/2022/04/22/m1-gpu.html">follow-up post</a>.</p>
<p>This story of working with M1 chips is an amalgation of various Apple documentations. As
a scientific programmer it is a bit complicated to work on the new Apple M1 Macbooks;
CUDA doesn’t work on these blazing fast chips! It would be cool if we can offload heavy
physics simulations to the GPU, they’ve shown that they are quite capable. We’ll start
off slow, working out the basics of array operations on the GPU, and hopefully end up at
some proper fast physics!</p>
<h2 id="introduction-to-apples-metal">Introduction to Apple’s Metal</h2>
<p>Luckily for people who jumped on the new M1 chips: you can program rather easily for
this chip using Apple’s <a href="https://developer.apple.com/metal/"><strong>Metal</strong></a> and it’s
programming language MSL, Metal Shading Language.</p>
<p>Our second “luckily”, MSL is C++ based. This is cool, because my scientific code is in
C++, I’m not going near Fortran. MSL promises that we can compile our computational
kernels (or shaders as they’re called in MSL) to fast code, and use them on heterogeneous
systems that support Metal. Here is a sample kernel written in MSL that adds two arrays
together:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">kernel</span> <span class="kt">void</span> <span class="nf">add_arrays</span><span class="p">(</span><span class="n">device</span> <span class="k">const</span> <span class="kt">float</span><span class="o">*</span> <span class="n">A</span><span class="p">,</span>
<span class="n">device</span> <span class="k">const</span> <span class="kt">float</span><span class="o">*</span> <span class="n">B</span><span class="p">,</span>
<span class="n">device</span> <span class="kt">float</span><span class="o">*</span> <span class="n">C</span><span class="p">,</span>
<span class="n">uint</span> <span class="n">index</span> <span class="p">[[</span><span class="n">thread_position_in_grid</span><span class="p">]])</span>
<span class="p">{</span>
<span class="n">C</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o">=</span> <span class="n">A</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o">+</span> <span class="n">B</span><span class="p">[</span><span class="n">index</span><span class="p">];</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Looks a lot like CUDA right? Great! If you don’t understand it yet, tag along, and you
will at the end. Let’s try to get this to work on a MacBook and compare it against
OpenMP and serial implementations.</p>
<p>Metal has a bunch of other graphics-oriented functionality, but for scientific
programming we leave those be for now.</p>
<h2 id="0-readying-your-system-to-follow-along">0. Readying your system to follow along</h2>
<p>I make it a point not to use Xcode. Nothing intrinsically against this piece of
software, but to start out with, it is a lot more useful to me to see how dependencies
work without using all Xcode’s handholding. I’ll compile all binaries using LLVM’s
homebrewed CLang++:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>brew <span class="nb">install </span>llvm
</code></pre></div></div>
<p>This has the added benefit of allowing me to use OpenMP on the M1 chip. Make sure the library files for OMP are installed:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>brew <span class="nb">install </span>libomp
</code></pre></div></div>
<p>However, we will need the SDK’s that come with Xcode to be able to compile for MacBooks.
I advise installing Xcode through your Mac’s app store.</p>
<h2 id="1-performing-calculations-on-a-gpu-using-metal">1. Performing Calculations on a GPU using Metal</h2>
<p>Our first stop on the world wide web is <a href="https://developer.apple.com/documentation/metal/performing_calculations_on_a_gpu?language=objc">Apple’s own calculations on GPU</a>, titled <strong>Performing Calculations on a GPU using Metal</strong>.
Exactly what we need! Until… you see that the interface to work with Metal is
available in Swift and Objective-C. Not the typical dialects used in the lands of
Geophysics or other exact sciences.</p>
<p>The website does give a very good overview on getting kernels to work though. I
recommend sitting down and reading the text, even if Objective-C is not your cup of tea.
I never worked in it, but I still found the article helpful to understand MSL concepts.
Among other things, we get the fun anecdote:</p>
<blockquote>
<p>In Metal, code that runs on GPUs is called a shader, because historically they were first used to calculate colors in 3D graphics.</p>
</blockquote>
<p>Additionally, we learn a few other things:</p>
<ul>
<li>Metal dynamically loads your shader library at runtime. That means that your
applications and shaders are separately compiled.</li>
<li>Metal orchestrates computations usings command queues and command buffers. This allows
for asynchronous and heterogeneous operations, much like CUDA.</li>
<li>The design of parallel loops works much like CUDA, where the kernel is a single function called with an index.</li>
<li>Metal has different types of data buffers, that are differently exposed to the GPU and
CPU. For ease of this tutorial <code class="language-plaintext highlighter-rouge">MTLResourceStorageModeShared</code> seems very appropriate. We
can use data in memory for both GPU and CPU computations!</li>
</ul>
<h2 id="2-using-metal-from-c-using-metal-cpp">2. Using Metal from C++, using metal-cpp</h2>
<p>A bit more targeted web-surfing reveals <a href="https://developer.apple.com/metal/cpp/">another Apple page, geared towards running graphical Metal applications from C++</a>, titled <strong>Getting started with Metal-cpp</strong>.
Those highlights do their name justice, i.e.:</p>
<ul>
<li>Alternative to Objective-C;</li>
<li>No measureable overhead.</li>
</ul>
<p>This manual is not geared towards the scientific computation that we were interested in,
but it does allow us to get started with Metal in C++. One could follow the instructions
to download <code class="language-plaintext highlighter-rouge">metal-cpp_macOS12_iOS15.zip</code>, however, I was (legally, courtesy of the Apache license) able to include the relevant code in <a href="https://github.com/larsgeb/m1-gpu-cpp">this repository</a>. The interesting bits for us are:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">metal-cpp/Metal</code></li>
<li><code class="language-plaintext highlighter-rouge">metal-cpp/Foundation</code></li>
<li><code class="language-plaintext highlighter-rouge">metal-cpp/QuartzCore</code></li>
</ul>
<p>These folders contain the relevant headers exposing Metal’s interface to
C++.</p>
<p>We could probaly use these to translate the Objective-C code by hand …</p>
<h2 id="3-translating-the-objective-c-without-understanding">3. Translating the Objective-C without understanding!</h2>
<p>I’m not going to learn Objective-C just to translate a bit of code. The
programmer in me says that there is a more efficient way to rewrite
<code class="language-plaintext highlighter-rouge">Performing Calculations on a GPU using Metal</code> into useable C++. Autocompletion in
VSCode would be a perfect shortcut, no? Additionally, both Objective-C and C++ contain a
C in their name. They must be extremely similar. … Right?</p>
<p>Specifically, we need to get the following files to work in a C++ implementation:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">main.m</code>, Looks like a main function from C++.</li>
<li><code class="language-plaintext highlighter-rouge">MetalAdder.h</code>, Header files for a class, apparently.</li>
<li><code class="language-plaintext highlighter-rouge">MetalAdder.m</code>, Body of the class, I hope.</li>
<li><code class="language-plaintext highlighter-rouge">add.metal</code>, the MSL code! We’ve seen this before, and it’s definitely the least intimidating.</li>
</ul>
<h3 id="translating-mainm">Translating <code class="language-plaintext highlighter-rouge">main.m</code></h3>
<p>The Objective-C code main function (<code class="language-plaintext highlighter-rouge">main.m</code>) starts confident, coming in hot with stuff I’ve never seen before:</p>
<div class="language-objc highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#import <Foundation/Foundation.h>
#import <Metal/Metal.h>
#import "MetalAdder.h"
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span> <span class="n">argv</span><span class="p">[])</span> <span class="p">{</span>
<span class="k">@autoreleasepool</span> <span class="p">{</span>
<span class="n">id</span><span class="o"><</span><span class="n">MTLDevice</span><span class="o">></span> <span class="n">device</span> <span class="o">=</span> <span class="n">MTLCreateSystemDefaultDevice</span><span class="p">();</span>
<span class="c1">// Create the custom object used to encapsulate the Metal code.</span>
<span class="c1">// Initializes objects to communicate with the GPU.</span>
<span class="n">MetalAdder</span><span class="o">*</span> <span class="n">adder</span> <span class="o">=</span> <span class="p">[[</span><span class="n">MetalAdder</span> <span class="nf">alloc</span><span class="p">]</span> <span class="nf">initWithDevice</span><span class="p">:</span><span class="n">device</span><span class="p">];</span>
<span class="c1">// Create buffers to hold data</span>
<span class="p">[</span><span class="n">adder</span> <span class="nf">prepareData</span><span class="p">];</span>
<span class="c1">// Send a command to the GPU to perform the calculation.</span>
<span class="p">[</span><span class="n">adder</span> <span class="nf">sendComputeCommand</span><span class="p">];</span>
<span class="n">NSLog</span><span class="p">(</span><span class="s">@"Execution finished"</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Seems like it creates a pointer to a computation device (such as a GPU, defined in <code class="language-plaintext highlighter-rouge">Metal.h</code>), and then passes that to a constructor for <code class="language-plaintext highlighter-rouge">MetalAdder</code> (that we haven’t defined yet). Next it runs a few functions associated with this object.</p>
<p>Using some smart autocompletion to figure out how to create the Metal device, and
filling in the arbitrary gaps (arbitrary as we still have to define the MetalAdder
class), we end up with some C++ code that looks like this:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include <iostream>
#include <omp.h>
</span>
<span class="cp">#define NS_PRIVATE_IMPLEMENTATION
#define CA_PRIVATE_IMPLEMENTATION
#define MTL_PRIVATE_IMPLEMENTATION
#include "Foundation/Foundation.hpp"
#include "Metal/Metal.hpp"
#include "QuartzCore/QuartzCore.hpp"
</span>
<span class="cp">#include "MetalAdder.hpp"
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">argv</span><span class="p">[])</span>
<span class="p">{</span>
<span class="n">MTL</span><span class="o">::</span><span class="n">Device</span> <span class="o">*</span><span class="n">device</span> <span class="o">=</span> <span class="n">MTL</span><span class="o">::</span><span class="n">CreateSystemDefaultDevice</span><span class="p">();</span>
<span class="n">MetalAdder</span> <span class="o">*</span><span class="n">adder</span> <span class="o">=</span> <span class="k">new</span> <span class="n">MetalAdder</span><span class="p">(</span><span class="n">device</span><span class="p">);</span>
<span class="n">adder</span><span class="o">-></span><span class="n">sendComputeCommand</span><span class="p">();</span>
<span class="n">adder</span><span class="o">-></span><span class="n">verifyResults</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Additionally, I’ve added IOStream and OMP to facilitate output and multicore stuff we’ll
do to profile all the code.</p>
<h3 id="translating-metaladderm-and-metaladderh">Translating <code class="language-plaintext highlighter-rouge">MetalAdder.m</code> and <code class="language-plaintext highlighter-rouge">MetalAdder.h</code></h3>
<p>The <code class="language-plaintext highlighter-rouge">MetalAdder</code> class is a way to keep track of data, command, etc. that are relevant
to using the GPU. I’ll try to one-on-one translate this to C++ from Apple’s tutorial,
but in one or two places I optimized the code to C++ standards. The final result is
<code class="language-plaintext highlighter-rouge">MetalAdder.cpp</code> and <code class="language-plaintext highlighter-rouge">MetalAdder.hpp</code>.</p>
<p>Let us first have a look at the <strong>constructor</strong>. In Objective-C, it’s signature is the following:</p>
<div class="language-objc highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">-</span> <span class="p">(</span><span class="n">instancetype</span><span class="p">)</span> <span class="nf">initWithDevice</span><span class="p">:</span> <span class="p">(</span><span class="n">id</span><span class="o"><</span><span class="n">MTLDevice</span><span class="o">></span><span class="p">)</span> <span class="n">device</span>
<span class="p">{</span>
<span class="c1">// ... body ...</span>
<span class="p">}</span>
</code></pre></div></div>
<p>meaning that it takes a pointer to an <code class="language-plaintext highlighter-rouge">MTLDevice</code>. Using VSCode, we realize that in the
C++ headers, all types that lead with MTL, such as <code class="language-plaintext highlighter-rouge">MTLDevice</code> are translated to
with a leading MTL namespace: <code class="language-plaintext highlighter-rouge">MTL::Device</code>. Additionally, the Ojective-C constructor
tests for Metal errors by making sure none of the created objects turn out to be <code class="language-plaintext highlighter-rouge">nil</code>.
The equivalent of this in C++ is to check against <code class="language-plaintext highlighter-rouge">nullptr</code>s.</p>
<p>Functionally, what happens next in the constructor is the following:</p>
<ul>
<li><strong>Loading the Metal library containing our shaders.</strong> The C++ equivalent is radily
found using your editors code-completion.</li>
</ul>
<p>Objective-C</p>
<div class="language-objc highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">id</span><span class="o"><</span><span class="n">MTLLibrary</span><span class="o">></span> <span class="n">defaultLibrary</span> <span class="o">=</span> <span class="p">[</span><span class="n">_mDevice</span> <span class="nf">newDefaultLibrary</span><span class="p">];</span>
</code></pre></div></div>
<p>C++</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">MTL</span><span class="o">::</span><span class="n">Library</span> <span class="o">*</span><span class="n">defaultLibrary</span> <span class="o">=</span> <span class="n">_mDevice</span><span class="o">-></span><span class="n">newDefaultLibrary</span><span class="p">();</span>
</code></pre></div></div>
<ul>
<li><strong>Loading a specific shader based on its name.</strong> This is the biggest translation
mismatch, as the C++ method newFunctionWithName doesn’t exist, and it’s equivalent
doesn’t accept <code class="language-plaintext highlighter-rouge">const char *</code>, only it’s own implementation of strings.</li>
</ul>
<p>Objective-C</p>
<div class="language-objc highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">id</span><span class="o"><</span><span class="n">MTLFunction</span><span class="o">></span> <span class="n">addFunction</span> <span class="o">=</span> <span class="p">[</span><span class="n">defaultLibrary</span> <span class="nf">newFunctionWithName</span><span class="p">:</span><span class="s">@"add_arrays"</span><span class="p">];</span>
</code></pre></div></div>
<p>C++</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">auto</span> <span class="n">str</span> <span class="o">=</span> <span class="n">NS</span><span class="o">::</span><span class="n">String</span><span class="o">::</span><span class="n">string</span><span class="p">(</span><span class="s">"add_arrays"</span><span class="p">,</span> <span class="n">NS</span><span class="o">::</span><span class="n">ASCIIStringEncoding</span><span class="p">);</span>
<span class="n">MTL</span><span class="o">::</span><span class="n">Function</span> <span class="o">*</span><span class="n">addFunction</span> <span class="o">=</span> <span class="n">defaultLibrary</span><span class="o">-></span><span class="n">newFunction</span><span class="p">(</span><span class="n">str</span><span class="p">);</span>
</code></pre></div></div>
<ul>
<li><strong>Creating a pipeline state object.</strong> This translates just like loading the default
library.</li>
<li><strong>Creating a command queue object.</strong> This also translates without issue.</li>
<li><strong>Prepare the test data.</strong> This is simply done by a (yet-to-write) method of the class
we are writing.</li>
</ul>
<p>To understand exactly what the created objects do, refer to the Apple tutorial written
for Objective-C.</p>
<p>Most <strong>other class methods</strong> are translated one-to-one much the same way. One of the
major differences (inconveniences, more like) between the original Objective-C and the
C++ implementation can be initially found in <code class="language-plaintext highlighter-rouge">generateRandomFloatData</code>. This function
populates arbirtrary buffers with random data. To do this, it needs to set the values of
the buffer one by one, accessing these from the CPU. In the Objective-C implementation,
whenever we want to access the buffer’s data, we obtain a pointer to the start of the
buffer, and loop over it by pointer arithmetic:</p>
<div class="language-objc highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">-</span> <span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="nf">generateRandomFloatData</span><span class="p">:</span> <span class="p">(</span><span class="n">id</span><span class="o"><</span><span class="n">MTLBuffer</span><span class="o">></span><span class="p">)</span> <span class="n">buffer</span>
<span class="p">{</span>
<span class="kt">float</span><span class="o">*</span> <span class="n">dataPtr</span> <span class="o">=</span> <span class="n">buffer</span><span class="p">.</span><span class="n">contents</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">index</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">index</span> <span class="o"><</span> <span class="n">arrayLength</span><span class="p">;</span> <span class="n">index</span><span class="o">++</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">dataPtr</span><span class="p">[</span><span class="nf">index</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="kt">float</span><span class="p">)</span><span class="n">rand</span><span class="p">()</span><span class="o">/</span><span class="p">(</span><span class="kt">float</span><span class="p">)(</span><span class="n">RAND_MAX</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>In C++, one is not able to implicitly cast this pointer. The return type of
<code class="language-plaintext highlighter-rouge">buffer->contents()</code> is a <code class="language-plaintext highlighter-rouge">void *</code>, i.e. a pointer to any type of object. For safety,
one needs to explicitly cast this pointer to a <code class="language-plaintext highlighter-rouge">float *</code>.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">MetalAdder</span><span class="o">::</span><span class="n">generateRandomFloatData</span><span class="p">(</span><span class="n">MTL</span><span class="o">::</span><span class="n">Buffer</span> <span class="o">*</span><span class="n">buffer</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// The pointer needs to be explicitly cast in C++, a difference from</span>
<span class="c1">// Objective-C.</span>
<span class="kt">float</span> <span class="o">*</span><span class="n">dataPtr</span> <span class="o">=</span> <span class="p">(</span><span class="kt">float</span> <span class="o">*</span><span class="p">)</span><span class="n">buffer</span><span class="o">-></span><span class="n">contents</span><span class="p">();</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">index</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">index</span> <span class="o"><</span> <span class="n">arrayLength</span><span class="p">;</span> <span class="n">index</span><span class="o">++</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">dataPtr</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="kt">float</span><span class="p">)</span><span class="n">rand</span><span class="p">()</span> <span class="o">/</span> <span class="p">(</span><span class="kt">float</span><span class="p">)(</span><span class="n">RAND_MAX</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Pfew, getting quite close there to actually programming! This slight difference in
Objective-C w.r.t. C++ pops up again in <code class="language-plaintext highlighter-rouge">verifyResults</code>, and basically whenever we try
to access the buffer manually.</p>
<p>To keep it tidy, I moved all declarations to the <code class="language-plaintext highlighter-rouge">MetalAdder.h</code> (but mostly because I didn’t figure out how to have some declaration <em>not</em> in the header).</p>
<h2 id="4-compiling-metal-cpp-programs">4. Compiling metal-cpp programs.</h2>
<p>We avoid looking at documentation some more, and by messing around with the make
command in the <strong>Getting started with Metal-cpp</strong> project reveals the CLang++
<code class="language-plaintext highlighter-rouge">includes</code> relevant to get Metal to work in C++:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>larsgebraad@macbook:~<span class="nv">$ </span><span class="nb">pwd</span>
/Users/larsgebraad/Downloads/LearnMetalCPP
larsgebraad@macbook:~<span class="nv">$ </span>make
clang++ <span class="nt">-Wall</span> <span class="nt">-std</span><span class="o">=</span>c++17 <span class="nt">-I</span>./metal-cpp <span class="nt">-I</span>./metal-cpp-extensions <span class="nt">-fno-objc-arc</span> <span class="nt">-O2</span> <span class="nt">-framework</span> Metal <span class="nt">-framework</span> Foundation <span class="nt">-framework</span> Cocoa <span class="nt">-framework</span> CoreGraphics <span class="nt">-framework</span> MetalKit learn-metal/00-window/00-window.o <span class="nt">-o</span> build/00-window
</code></pre></div></div>
<p>It seems we need to include the <code class="language-plaintext highlighter-rouge">metal-cpp</code>, <code class="language-plaintext highlighter-rouge">metal-cpp-extensions</code> folders, as well as
hooking the frameworks into CLang. <code class="language-plaintext highlighter-rouge">CoreGraphics</code> doesn’t seem like something we’d need
in an terminal based application, as does Cocoa. After some tweaking, the bare
necessities to compile a command line program with Metal seems to be:</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">larsgebraad@macbook:~$</span><span class="w"> </span>clang++ <span class="nt">-std</span><span class="o">=</span>c++17 <span class="nt">-I</span>./metal-cpp <span class="nt">-O2</span> <span class="se">\</span>
<span class="nt">-framework</span> Metal <span class="nt">-framework</span> Foundation <span class="nt">-framework</span> MetalKit <span class="se">\</span>
whatever.cpp
</code></pre></div></div>
<p>This seems like a good start for compiling our Metal program.</p>
<p>I skip my system-wide <code class="language-plaintext highlighter-rouge">clang++</code> in favour of <code class="language-plaintext highlighter-rouge">/opt/homebrew/opt/llvm/bin/clang++</code>, which
allows me to easily include OpenMP libraries, e.g.:</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">larsgebraad@macbook:~$</span><span class="w"> </span>/opt/homebrew/opt/llvm/bin/clang++ <span class="se">\</span>
<span class="nt">-L</span>/opt/homebrew/opt/libomp/lib <span class="nt">-fopenmp</span> some-openmp.cpp
</code></pre></div></div>
<p>Now, to actually compile our Metal+OpenMP application, we run:</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">larsgebraad@macbook:~$</span><span class="w"> </span>/opt/homebrew/opt/llvm/bin/clang++ <span class="se">\</span>
<span class="nt">-std</span><span class="o">=</span>c++17 <span class="nt">-stdlib</span><span class="o">=</span>libc++ <span class="nt">-O2</span> <span class="se">\</span>
<span class="nt">-L</span>/opt/homebrew/opt/libomp/lib <span class="nt">-fopenmp</span> <span class="se">\</span>
<span class="nt">-I</span>./metal-cpp <span class="se">\</span>
<span class="nt">-fno-objc-arc</span> <span class="se">\</span>
<span class="nt">-framework</span> Metal <span class="nt">-framework</span> Foundation <span class="nt">-framework</span> MetalKit <span class="se">\</span>
<span class="nt">-g</span> 01-MetalAdder/main.cpp 01-MetalAdder/MetalAdder.cpp <span class="nt">-o</span> 01-MetalAdder/benchmark.x
</code></pre></div></div>
<p>If one were to try out this executable, we’d find the following:</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">larsgebraad@macbook:~$</span><span class="w"> </span>./01-MetalAdder/benchmark.x
<span class="go">
Failed to find the default library.
[1] 13767 segmentation fault ./benchmark.x
</span></code></pre></div></div>
<p>It seems that our GPU code itself is not compiled yet, as this is not standard when compiling the CPU code. To do this, we follow the instructions <a href="https://developer.apple.com/documentation/metal/shader_libraries/building_a_library_with_metal_s_command-line_tools?language=objc">of yet another Apple documentation website</a>, titled <strong>Building a Library with Metal’s Command-Line Tools</strong> and geared towards Objective-C.</p>
<p>This is where our installation of Xcode is relevant; we need to use the command line
tools and SDKs from Xcode to compile our gpu code:</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">larsgebraad@macbook:~$</span><span class="w"> </span>xcrun <span class="nt">-sdk</span> macosx metal <span class="nt">-c</span> add.metal <span class="nt">-o</span> MyLibrary.air
<span class="go">
</span><span class="gp">larsgebraad@macbook:~$</span><span class="w"> </span>xcrun <span class="nt">-sdk</span> macosx metallib MyLibrary.air <span class="nt">-o</span> default.metallib
</code></pre></div></div>
<p>The final name of the <code class="language-plaintext highlighter-rouge">.metallib</code> file is important, as our executable is only searching for the <code class="language-plaintext highlighter-rouge">default</code> library. This behaviour can be adapted in the constructor of <code class="language-plaintext highlighter-rouge">MetalAdder</code>.</p>
<h2 id="5-benchmarking-against-serial-and-openmp-code">5. Benchmarking against serial and OpenMP code</h2>
<p>Now that our GPU code is compiled, we are ready to run a full benchmark. In <code class="language-plaintext highlighter-rouge">main.cpp</code>, additional serial and OpenMP implementations of this array addition are defined. By running the resulting <code class="language-plaintext highlighter-rouge">benchmark.x</code>, we get the following impressive results:</p>
<blockquote>
<p>System specs:</p>
<ul>
<li>2021 MacBook pro</li>
<li>M1 Max, 10‑Core CPU, 32‑Core GPU und 16‑Core Neural Engine</li>
<li>32 GB RAM</li>
</ul>
</blockquote>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">larsgebraad@macbook:~$</span><span class="w"> </span>./01-MetalAdder/benchmark.x
<span class="go">
Metal (GPU) code performance:
Average time: 803.566ms +/- 48.427ms
Serial code performance:
Average time: 2439.92ms +/- 74.4422ms
OpenMP (1 threads) code performance:
Average time: 2427.24ms +/- 15.666ms
OpenMP (2 threads) code performance:
Average time: 1315.2ms +/- 76.5438ms
OpenMP (3 threads) code performance:
Average time: 1684.19ms +/- 46.0139ms
OpenMP (4 threads) code performance:
Average time: 1339.81ms +/- 99.9749ms
OpenMP (5 threads) code performance:
</span><span class="c">...
</span><span class="go">
OpenMP (10 threads) code performance:
Average time: 1756.53ms +/- 640.482ms
</span></code></pre></div></div>
<p>Weirdly enough, OpenMP is faster on even thread counts. Unsurprisingly, the serial code is the slowest. On this simple array addition problem, using Metal allows for a 3x speed-up with respect to the CPU single thread, and a 1.6x speedup to the fastest OpenMP configuration!</p>
<p>I suspect these numbers will be more dramatic when the computational kernels are more involved, but this we’ll see later.</p>Lars GebraadRelevant repo, follow-up post.Parallel sparse dot products with SciPy: an MKL wrapper for SciPy CSR GEMV2022-03-31T00:00:00+00:002022-03-31T00:00:00+00:00https://larsgeb.github.io/2022/03/31/mkl-scipy<p>A long while ago I wrote a package that is intended as a tiny wrapper around MKL’s sparse matrix-vector products. The write up was pretty decent back then, so let’s backdate it and give it a decent place here!</p>
<p>There original code can be found in this <a href="https://github.com/larsgeb/scipy_mkl_gemv">repo</a>.</p>
<h2 id="the-problem">The problem</h2>
<p>A while ago I was working on huge sparse matrices in straight ray tomography. I was trying to Monte Carlo sample a system of equations that comptuted the dot product between a CSR matrix and a NumPy vector. I noticed that SciPy’s dot operator for this is not multithreaded in the SciPy implementation, which was quite painful perfomance-wise. I needed to do millions of evaluations, and running these multiplications on a single thread while my workstation at the time had 18 cores felt a bit silly.</p>
<p>Enter MKL, a software product from Intel. It implements a bunch of linear algebra (and other stuff) accelerations for Intel processors. Within MKL, the <a href="https://www.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top/blas-and-sparse-blas-routines/sparse-blas-level-2-and-level-3-routines/sparse-blas-level-2-and-level-3-routines-1/mkl-cspblas-csrgemv.html">operation</a> for this sparse-dense product does exist as:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mkl_cspblas_</span><span class="o">?</span><span class="n">csrgemv</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">transa</span> <span class="p">,</span> <span class="k">const</span> <span class="n">MKL_INT</span> <span class="o">*</span><span class="n">m</span> <span class="p">,</span> <span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">a</span> <span class="p">,</span> <span class="k">const</span> <span class="n">MKL_INT</span> <span class="o">*</span><span class="n">ia</span> <span class="p">,</span> <span class="k">const</span> <span class="n">MKL_INT</span> <span class="o">*</span><span class="n">ja</span> <span class="p">,</span> <span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">x</span> <span class="p">,</span> <span class="kt">float</span> <span class="o">*</span><span class="n">y</span> <span class="p">);</span>
</code></pre></div></div>
<p>The function itself is deprecated as off the time of writing, MKL giving the preference to Inspector-Executor type calls.However, for typical large sparse float type matrices, the implementation in my repo works fine, and is actually much faster than SciPy’s.</p>
<p>The implementation is based off of <a href="https://stackoverflow.com/a/23294826/6848887">this StackOverflow answer</a>, with some logic for performing both <code class="language-plaintext highlighter-rouge">float32</code> and <code class="language-plaintext highlighter-rouge">float64</code> general matrix-vector products.</p>
<h2 id="getting-it-to-work">Getting it to work</h2>
<p>Now, if you are like me, you don’t typically like working with compilers, inclusing variables and dynamic libraries. I promise though, setting this up isn’t a pain.</p>
<p>First make sure that the MKL libraries are properly installed and accesible from the shell where you will execute your Python script/notebook/interactive session. If you’ve installed MKL in <code class="language-plaintext highlighter-rouge">/opt/intel</code> (the default) this easily done using:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">INTEL_COMPILERS_AND_LIBS</span><span class="o">=</span>/opt/intel/compilers_and_libraries/linux
<span class="nb">source</span> <span class="nv">$INTEL_COMPILERS_AND_LIBS</span>/mkl/bin/mklvars.sh intel64
</code></pre></div></div>
<p>Now simply put the <code class="language-plaintext highlighter-rouge">mkl_interface.py</code> file from the repo in the directory where you will run your program such that you can import it in the following way:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">mkl_interface</span> <span class="kn">import</span> <span class="n">sparse_gemv</span>
</code></pre></div></div>
<p>and use the function <code class="language-plaintext highlighter-rouge">sparse_gemv(scipy.sparse.csr_matrix A, numpy.ndarray x)</code> from anywhere. Be mindful that the precision of <code class="language-plaintext highlighter-rouge">A</code> is used to perform the matrix-vector product, i.e. x is implicitly converted (its type will probably be altered outside the function scope too).</p>
<h2 id="how-fast-is-it">How fast is it?</h2>
<p>The implementaiton performace is tested for both ‘tall’ (more rows) and ‘fat’ (more columns) matrices, but perfomance for both MKL and SciPy implementations is affected neglegibly. The common denominator for perfomance is precision (float size), percentage of non-zero elements (sparsity) and total non-zero elements. A test can be seen below on randomly generated sparse matrices.</p>
<p><img src="/assets/2022-04-01-mkl-scipy-results.svg" alt="here" /></p>
<p>Results generated on a 18-core 128gb workstation. Execution times should be seen as relative, as each test is run 50 times. Error bars are due to varying shapes of matrices (tall, fat, square, but all with same number of elements (rows × columns).</p>
<p>Clearly, the MKL implementation should not always be favoured over SciPy’s native one. The high overhead of the MKL implementation (mainly acquiring pointers and extra memory) does mean that for small problems, SciPy’s implementation is more efficient. For any high dimension, MKL should be favourable.</p>
<p>It should be noted that the tests performed in this repo are limited in size due to the random sparse matrix generator for SciPy, which ran out of memory on a 128gb system for a matrix size of 10⁸ total elements.</p>Lars GebraadA long while ago I wrote a package that is intended as a tiny wrapper around MKL’s sparse matrix-vector products. The write up was pretty decent back then, so let’s backdate it and give it a decent place here!