Introduction

Unit quaternions, or versors, offer a more compact and efficient representation of rotations than matrices do. They also free us from issues such as the gimbal lock we often encounter when using Euler angles. That’s why in Computer Graphics you often represent a transformation by a struct like the one below, instead of generic 4×4 matrix,

struct Transform {
  var position = float3(0, 0, 0)
  var scale    = float3(1, 1, 1)
  var rotation = Quaternion()
}

However, more often than not, quaternions remain in the CPU domain and Transforms are converted into matrices before they are sent to the GPU. The conversion for the struct above looks like this,

func toMatrix4() -> float4x4 {
  let rm = rotation.toMatrix4()
  return float4x4([
    scale.x * rm[0],
    scale.y * rm[1],
    scale.z * rm[2],
    float4(position.x, position.y, position.z, 1.0)
  ])
}

The reason for this conversion is usually 2-fold,

  • GPUs have native support for matrices, making them the natural choice when thinking about performance;

  • in traditional pipelines, we only worried about the final position of a vertex in world coordinates, so we could premultiply the Projection, the View, and the World or Model matrix into a single matrix (the PVW matrix), thus, making the transformation of vertices in the GPU really cheap.

Growing shader complexity

From the 2 reasons stated earlier, the second one barely holds true anymore. Because of more complex shading and effects pipelines, we often want to split the Projection matrix from the View matrix, so we can compute the view normals, and the Projection-View matrix from the World matrix, so we can obtain the coordinates of the vertices in World space.

The Projection and View matrices are only set once per camera or viewport, and the World matrix will be set per object or instance being drawn. The vertex shader will look like this,

float4x4 m = uniforms.projectionMatrix * uniforms.viewMatrix * instance.worldMatrix;
TexturedVertex v = vertexData[vid];
outVertex.position = m * float4(v.position, 1.0);

If we were to send Transforms instead of 4×4 matrices, we could save at least 4 floats per instance. Memory is usually more precious these days than ALU time, but how much slower would it be if we used Transforms in the GPU? The vertex shader will need to do some extra operations,

Transform t = perInstanceUniforms[iid];
float4x4 m = uniforms.projectionMatrix * uniforms.viewMatrix;
TexturedVertex v = vertexData[vid];
outVertex.position = m * float4(t * v.position, 1.0);

The following code is the implementation of the Transform struct using Metal (for an introduction to Metal, check this previous blog post).

struct Transform {
 // for alignment reasons, position and scale are float4
 float4 position; // only xyz actually used
 float4 scale;    // only xyz actually used
 float4 rotation; // unit quaternion; w is the scalar
 float3 operator* (const float3 v) const {
   return position.xyz + quatMul(rotation, v * scale.xyz);
 }
};
/// Quaternion Inverse
float4 quatInv(const float4 q) {
 // assume it's a unit quaternion, so just Conjugate
 return float4( -q.xyz, q.w );
}
/// Quaternion multiplication
float4 quatDot(const float4 q1, const float4 q2) {
 float scalar = q1.w * q2.w - dot(q1.xyz, q2.xyz);
 float3 v = cross(q1.xyz, q2.xyz) + q1.w * q2.xyz + q2.w * q1.xyz;
 return float4(v, scalar);
}
/// Apply unit quaternion to vector (rotate vector)
float3 quatMul(const float4 q, const float3 v) {
 float4 r = quatDot(q, quatDot(float4(v, 0), quatInv(q)));
 return r.xyz;
}

Let’s see if this is any slower than matrices with an example.

Rotating cubes demo

I’ve created this demo of rotating cubes to measure the performance of using quaternions in a modern, but not high-end, GPU. I’ll be testing Apple’s A8 chip on an iPhone6.

The application spawns 240 cubes and draws them with a single draw call using instancing. Instancing allows us to reuse the same vertex buffer, and just use a different Transform for each instance. This way, the performance comparison will be simpler because we only need to analyze one draw call, instead of 240!

The CPU updates the rotation of each cube at random times, so the performance in the CPU won’t be constant per frame, but it should be almost constant in the GPU (there will be some slight differences in fill rate, depending the amount of area covered by the cubes as they rotate, but I placed them close so it’s always very dense).

The code of the demo can be found here:

Performance comparison in the GPU

Both versions run at 60fps on an iPhone6. This is a frame capture of the version that uses matrices,

Metal Framecapture instanced cubes with matrices

The draw call in both cases takes 2.32 ms, of which 2 ms is taken by the fragment shader. As suspected, the fill rate is the bottleneck and it looks like the quaternions haven’t introduced any extra load to the ALU in this example.

For a proper comparison, we need to make this example to be vertex-bound, so I’ve prepared another example with spheres instead of cubes,

The tessellation level can be increased at compile time. In the video, there’s only a few hundred vertices per sphere, so both matrices and quaternions still run at 60fps. But in the commits below, each sphere has 2562 vertices. That’s a total of around 600K vertices on screen, while for the cubes we only had 6K vertices.

The frame rate drops to 20 fps when using quaternions, and to 12 fps when using matrices. Surprise! Here’s a frame capture of the version that uses matrices,

Metal frame capture instanced spheres with matrices

The vertex shader takes 46.10 ms with quaternions, and 82.28 ms when using matrices. Matrices turned out to be 80% slower here.

Because GPUs are becoming more general purpose, it could be that matrices have no real advantage anymore, since the number of multiplications and additions is actually greater. Another possible reason for such a big difference could be that by reducing the memory footprint (we are sending one less float4 per object), we managed to increase the cache coherence. Every GPU will behave slightly different, so it’s better to do an empiric test like this to check what’s the real behaviour of your code.

Performance comparison in the CPU

Let’s go back to the cubes and check now what’s going on in the CPU. I took a performance capture of both versions using Instruments. Here’s a capture of the most expensive functions in the version that needs to convert the quaternions back into matrices,

Metal draw cost with matrices

The updateBuffers function takes 5.4% of the CPU time, mostly taken in converting the Transforms into matrices. It’s not a lot, but we only have 240 objects. Here’s the cost using quaternions all the way through,

Metal draw cost with quaternions

As expected, the cost almost disappeared, and the updateBuffers function now only takes 0.3% of the CPU time. The drawing cost is just the cost of the API issuing the commands,

Metal draw cost with quaternions

Extra thoughts on performance

More often than not we worry about small details in performance such as this difference between matrices and quaternions, while the big bottlenecks tend to be somewhere else. For this experiment, for instance, I’ve used instancing to create a single draw call to draw all the cubes. But the first version of the examples had no instancing. You can find the code of the first version here,

Both version still run at 60fps, but we are now issuing 240 draw calls, one per cube. While the CPU was around 20% usage in the instanced version of the quaternions, the non-instanced version runs at 90% CPU usage! The extra cost is basically the cost of issuing the drawing commands. So instancing was actually the biggest win in this experiment 😉

Note that we could do some extra memory optimization in matrices, if we just send the first 3 rows, enough to represent an affine transformation (not for projections). This is a common optimization and shader languages have support for operations with float3x4 matrices because of this. But if we are talking about just rotations, it is still more memory-efficient to just send a quaternion, which it’s a float4, instead of a float3x3 matrix (for memory alignment reasons sometimes become float3x4).

On a smaller note, the view matrix can also be expressed as a Transform. By doing this we can completely get rid of the code that does the conversion to matrices. And the only matrix we will need to keep will be the Projection matrix.

Conclusion

Our initial preconception that matrices were better for the shader world was wrong. Using quaternions in the GPU is actually faster than matrices in a modern GPU like the Apple’s A8 chip. The memory footprint will also get reduced and the chances of finding our data in the cache will increase.

Moreover, if we eliminate the quaternion-to-matrix conversions, not only the code will get simpler and tidier, but we’ll save several precious CPU cycles.

But to be absolutely sure that you are making the right choice, always test your hardware with examples like this, because hardware is constantly evolving!

Metal with Swift

Metal (not Metail) is a low-level API from Apple that combines OpenGL and OpenCL into a single interface. The purpose of introducing their own API was mainly to reduce overhead and increase performance. Metal is similar to Khronos Group’s Vulkan, or Microsoft’s DX12, but specifically targeted at Apple hardware.

Metal has been around since 2014, but now that Swift is more mature, I think it’s really easy to get started with Metal: you don’t need to be scared of pointers or of the overly verbose Objective-C syntax.

In this article I’m going to introduce Metal with a small example where all the data updates happen in the GPU. Instead of explaining Metal and Swift in detail, I’ll just write down a few notes following the example code. Hopefully, it will spark your interest and you dig into the references for extensive documentation 😉

Procedural rain example

I’ve written a small demo that should look like rain,

It draws and updates thousands of 2D lines at 60 fps on an iPhone6. In fact, drawing the lines takes only 2.4 ms, and the update takes less than 0.2ms.

You can find all the code here: https://github.com/endavid/metaltest

Getting started

To get started with Metal you will need a Metal-ready device and XCode. In XCode, just create a new project and select

  • iOS Application: Game

  • Language: Swift

  • Game technology: Metal

This will create a simple template that draws a moving rectangle on screen. You will need to run this directly on your device, since the simulator doesn’t understand Metal. The triangle data in the example is triple-buffered, so you can update it in the CPU while the GPU renders up to 3 frames before requiring a sync. Synchronization between the CPU and GPU is done like this,

// create semaphore
let inflightSemaphore = dispatch_semaphore_create(NumSyncBuffers)
// this is run per frame
func drawInMTKView(view: MTKView) {
    dispatch_semaphore_wait(inflightSemaphore, DISPATCH_TIME_FOREVER)
    // updates in CPU cycles
    self.update()
    // register completion callback
    let commandBuffer = commandQueue.commandBuffer()
    commandBuffer.addCompletedHandler{ [weak self] commandBuffer in
        if let strongSelf = self {
            dispatch_semaphore_signal(strongSelf.inflightSemaphore)
        }
        return
    }
    // draw stuff
    // ...
    commandBuffer.commit()
}

Some interesting Swift notes:

  • You can omit brackets when the last argument of the function you are calling is a lambda. You can still do ‘addCompletionHandler(myFunction)’.

  • The ‘weak’ keyword is used to avoid keeping a strong reference to ‘self’ inside the lambda function. Otherwise, we could have a cyclic reference and leak memory.

  • Because the reference is now weak, it basically becomes an optional (something that could be null). The ‘if let x = optional’ is used to dereference the optional when it’s not null.

Preparing Metal objects

These are the things you need to prepare in order to render something on screen:

  • Resources: data buffers and textures.

  • States: render pipeline state and depth-stencil state.

  • Descriptors: definitions that describe the objects above. This includes your shader code.

  • Render Command Encoder: the stuff that converts API commands into hardware commands.

  • Command Buffer: it’s where you store your commands that are eventually committed to the GPU.

  • Command Queue: where you queue an ordered list of command buffers.

I assume you are more or less familiar with how a typical graphics pipeline work, so in the example I’m going to focus on the physics update of the raindrops, which I’m performing in the GPU.

I’ll explain the shader code later, but for now you just need to know that you can access to your shader functions very easily using a shader library,

let defaultLibrary = device.newDefaultLibrary()!
let updateRaindropProgram = defaultLibrary.newFunctionWithName("updateRaindrops")!

“updateRaindrops” is the name of the function in the shader code.

You can create a render state without a fragment program. Your vertex shader can be used to modify any arbitrary buffer, without the need of specifically creating a compute shader.

let updateStateDescriptor = MTLRenderPipelineDescriptor()
updateStateDescriptor.vertexFunction = updateRaindropProgram
// vertex output is void
updateStateDescriptor.rasterizationEnabled = false
// pixel format needs to be set
updateStateDescriptor.colorAttachments[0].pixelFormat = view.colorPixelFormat

With that descriptor now we can create the state. Note that this is done only once,

do {
    try pipelineState = device.newRenderPipelineStateWithDescriptor(pipelineStateDescriptor)
    try updateState = device.newRenderPipelineStateWithDescriptor(updateStateDescriptor)
} catch let error {
    print("Failed to create pipeline state, error \(error)")
}

Notice that in Swift, the “try” keyword is used for every expression that can throw an exception. If we are happy with an optional value, we can remove the do-catch and use “try?”,

let state = try? device.newRenderPipelineStateWithDescriptor(descriptor)

Now we need a data buffer. Metal is designed for the A7 chip unified memory system, so both the CPU and the GPU can share the same storage. We will need to care about synchronization, but in this example the raindrops will be updated and read only in the GPU.

// member variable
var raindropDoubleBuffer: MTLBuffer! = nil
// ... on initialization:
raindropDoubleBuffer = device.newBufferWithLength(
            2 * maxNumberOfRaindrops * sizeOfLineParticle, options: [])
raindropDoubleBuffer.label = "raindrop buffer"

And now that you have everything ready, we can “draw stuff” in drawInMTKView,

// draw stuff
if let renderPassDescriptor = view.currentRenderPassDescriptor,
       currentDrawable = view.currentDrawable
{
    // setVertexBuffer offset: How far the data is from the start of the buffer, in bytes
    // Check alignment in setVertexBuffer doc
    let bufferOffset = maxNumberOfRaindrops * sizeOfLineParticle
    let uniformOffset = numberOfUniforms * sizeof(Float)
    let renderEncoder = commandBuffer.renderCommandEncoderWithDescriptor(renderPassDescriptor)
    renderEncoder.label = "render encoder"
      
    // The drawing phase is a simple shader that draws lines in 2D
    // DebugGroup labels are for debugging during frame capture.
    renderEncoder.pushDebugGroup("draw rain")
    renderEncoder.setRenderPipelineState(pipelineState)
    renderEncoder.setVertexBuffer(raindropDoubleBuffer, 
            offset: bufferOffset*doubleBufferIndex, atIndex: 0)
    renderEncoder.drawPrimitives(.Line, vertexStart: 0, 
            vertexCount: vertexCount, instanceCount: 1)
    renderEncoder.popDebugGroup()

    // update particles in the GPU            
    renderEncoder.pushDebugGroup("update raindrops")
    renderEncoder.setRenderPipelineState(updateState)
    // this is where we read the particles from
    renderEncoder.setVertexBuffer(raindropDoubleBuffer, 
            offset: bufferOffset*doubleBufferIndex, atIndex: 0)
    // this is where we write the updated particles 
    renderEncoder.setVertexBuffer(raindropDoubleBuffer, 
            offset: bufferOffset*((doubleBufferIndex+1)%2), atIndex: 1)
    renderEncoder.setVertexBuffer(uniformBuffer,
            offset: uniformOffset * syncBufferIndex, atIndex: 2)
    // noiseTexture contains random numbers
    renderEncoder.setVertexTexture(noiseTexture, atIndex: 0)
    // every particle is treated as a point, but we aren't rendering anything on screen
    renderEncoder.drawPrimitives(.Point, vertexStart: 0, 
            vertexCount: particleCount, instanceCount: 1)
    renderEncoder.popDebugGroup()
    renderEncoder.endEncoding()
            
    commandBuffer.presentDrawable(currentDrawable)
}
    
// syncBufferIndex matches the current semaphore controled frame index 
// to ensure writing occurs at the correct region in the vertex buffer
syncBufferIndex = (syncBufferIndex + 1) % NumSyncBuffers
doubleBufferIndex = (doubleBufferIndex + 1) % 2
    
commandBuffer.commit()

And that’s all! You don’t need to do anything else on the CPU 🙂

Writing shader code

Metal shaders are written in a subset of C++11 with some special keywords to define attributes and hardware features. You can have multiple shaders in a single file, and that file gets compiled before you run your application, so say bye to the runtime nightmares of OpenGL ES.

Let’s jump directly to the raindrop update function,

#include <metal_stdlib>
struct LineParticle
{
    float4 start;
    float4 end;
}; // => sizeOfLineParticle = sizeof(Float) * 4 * 2

// can only write to a buffer if the output is set to void
vertex void updateRaindrops(uint vid [[ vertex_id ]],
                        constant LineParticle* particle  [[ buffer(0) ]],
                        device LineParticle* updatedParticle  [[ buffer(1) ]],
                        constant Uniforms& uniforms  [[ buffer(2) ]],
                        texture2d<float> noiseTexture [[ texture(0) ]])
{
    LineParticle outParticle;
    float4 velocity = float4(0, -0.01, 0, 0);
    outParticle.start = particle[vid].start + velocity;
    outParticle.end = particle[vid].end + velocity;
    if (outParticle.start.y < -1) {
       outParticle.end.y = 1;
       outParticle.start.y = outParticle.end.y + 0.1;
    }
    updatedParticle[vid] = outParticle;
};

I’ve simplified the example above, so I’m not using the uniform buffer or the noise texture. Instead, the particles are just updated with a constant velocity that points downwards, and their position is reset once they reach the end of the screen. Check the full source for the full update, with some simple bouncing on the ground and obstacles, and resetting to a random position.

The “constant” and “device” keywords are address space qualifiers. “constant” refers to read-only buffer memory objects that are allocated from the device memory pool, while “device” refers to buffer memory objects allocated from the device memory pool that are both readable and writeable.

Handling Metal errors

In Metal you’ll find that clear error messages are output to the console. In OpenGL ES you had to query the OpenGL error status all the time just to get error messages, cluttering your code with those error queries all over the place. Plus, the error messages were usually hard to decipher.

This is an example error in Metal,

MTLPixelFormatRG16Unorm is compatible with texture data types type(s) (
    float
).'

I got this after calling: renderEncoder.setVertexTexture(noiseTexture, atIndex: 0)

Because in the shader I had: texture2d noiseTexture [[ texture(0) ]]

The noise texture pixel format is set to RG16Unorm and the error is telling me it doesn’t like “halfs”. So I just needed to change half to float to fix the issue.

Frame captures

The frame capture in XCode works as well with Metal as it does with OpenGL ES. You can see the performance of your shaders, see all the resources, change the shader code on the fly, jump to the Swift source code that originated a draw call, and much more. It’s one of the best tools of its kind that I’ve seen.

Let’s inspect a frame,

Frame Capture in XCode

Frame Capture in XCode

On the left side, you can see all the commands. There’s only a few! OpenGL ES programs tend to end up with lots of redundant state changes that negatively impact on performance. The debug group labels are shown as folders, and you can see the timings for each one. Or you can expand them and see the details. The particle update takes 183 microseconds. Surely faster than if we had linearly looped through the buffer and updated the particles on the CPU 😉

You can expand each command to see the call stack and jump to the CPU code.

You can also inspect all the buffers, render state, and shaders. You can see the cost of each shader block as a percentage of the total. As expected, most of the cost is in the fragment shader. It’s just fill-rate.

You can re-write the shader code there, and click the “Update Shaders” icon Update Shaders icon, to re-compile them and re-run the frame with the updated shaders.

It’s really powerful and easy to use.

Conclusion

If you are developing on iOS or macOS and into graphics, I recommend you try Metal if you haven’t yet. The setup is more straightforward than OpenGL, and it outperforms OpenGL by removing redundant state changes and making definitions more static.

If you like graphics programming, but you never tried native development on iOS or macOS, perhaps because you were scared of Objective-C, give Swift a try. It has a simple but powerful syntax, really easy to learn. It’s also a compiled language, so if you were thinking of mixing C++ into Objective-C just to increase performance, forget about it and write everything in Swift.

Check the references below for details.

References