Performance of quaternions in the GPU

Introduction

Unit quaternions, or versors, offer a more compact and efficient representation of rotations than matrices do. They also free us from issues such as the gimbal lock we often encounter when using Euler angles. That’s why in Computer Graphics you often represent a transformation by a struct like the one below, instead of generic 4×4 matrix,

struct Transform {
  var position = float3(0, 0, 0)
  var scale    = float3(1, 1, 1)
  var rotation = Quaternion()
}

However, more often than not, quaternions remain in the CPU domain and Transforms are converted into matrices before they are sent to the GPU. The conversion for the struct above looks like this,

func toMatrix4() -> float4x4 {
  let rm = rotation.toMatrix4()
  return float4x4([
    scale.x * rm[0],
    scale.y * rm[1],
    scale.z * rm[2],
    float4(position.x, position.y, position.z, 1.0)
  ])
}

The reason for this conversion is usually 2-fold,

GPUs have native support for matrices, making them the natural choice when thinking about performance;
in traditional pipelines, we only worried about the final position of a vertex in world coordinates, so we could premultiply the Projection, the View, and the World or Model matrix into a single matrix (the PVW matrix), thus, making the transformation of vertices in the GPU really cheap.

Growing shader complexity

From the 2 reasons stated earlier, the second one barely holds true anymore. Because of more complex shading and effects pipelines, we often want to split the Projection matrix from the View matrix, so we can compute the view normals, and the Projection-View matrix from the World matrix, so we can obtain the coordinates of the vertices in World space.

The Projection and View matrices are only set once per camera or viewport, and the World matrix will be set per object or instance being drawn. The vertex shader will look like this,

float4x4 m = uniforms.projectionMatrix * uniforms.viewMatrix * instance.worldMatrix;
TexturedVertex v = vertexData[vid];
outVertex.position = m * float4(v.position, 1.0);

If we were to send Transforms instead of 4×4 matrices, we could save at least 4 floats per instance. Memory is usually more precious these days than ALU time, but how much slower would it be if we used Transforms in the GPU? The vertex shader will need to do some extra operations,

Transform t = perInstanceUniforms[iid];
float4x4 m = uniforms.projectionMatrix * uniforms.viewMatrix;
TexturedVertex v = vertexData[vid];
outVertex.position = m * float4(t * v.position, 1.0);

The following code is the implementation of the Transform struct using Metal (for an introduction to Metal, check this previous blog post).

struct Transform {
 // for alignment reasons, position and scale are float4
 float4 position; // only xyz actually used
 float4 scale;    // only xyz actually used
 float4 rotation; // unit quaternion; w is the scalar
 float3 operator* (const float3 v) const {
   return position.xyz + quatMul(rotation, v * scale.xyz);
 }
};
/// Quaternion Inverse
float4 quatInv(const float4 q) {
 // assume it's a unit quaternion, so just Conjugate
 return float4( -q.xyz, q.w );
}
/// Quaternion multiplication
float4 quatDot(const float4 q1, const float4 q2) {
 float scalar = q1.w * q2.w - dot(q1.xyz, q2.xyz);
 float3 v = cross(q1.xyz, q2.xyz) + q1.w * q2.xyz + q2.w * q1.xyz;
 return float4(v, scalar);
}
/// Apply unit quaternion to vector (rotate vector)
float3 quatMul(const float4 q, const float3 v) {
 float4 r = quatDot(q, quatDot(float4(v, 0), quatInv(q)));
 return r.xyz;
}

Let’s see if this is any slower than matrices with an example.

Rotating cubes demo

I’ve created this demo of rotating cubes to measure the performance of using quaternions in a modern, but not high-end, GPU. I’ll be testing Apple’s A8 chip on an iPhone6.

The application spawns 240 cubes and draws them with a single draw call using instancing. Instancing allows us to reuse the same vertex buffer, and just use a different Transform for each instance. This way, the performance comparison will be simpler because we only need to analyze one draw call, instead of 240!

The CPU updates the rotation of each cube at random times, so the performance in the CPU won’t be constant per frame, but it should be almost constant in the GPU (there will be some slight differences in fill rate, depending the amount of area covered by the cubes as they rotate, but I placed them close so it’s always very dense).

The code of the demo can be found here:

instanced-cubes-matrices – This is the version using matrices.
instanced-cubes-quaternions – This is the version using quaternions and Transforms.

Performance comparison in the GPU

Both versions run at 60fps on an iPhone6. This is a frame capture of the version that uses matrices,

The draw call in both cases takes 2.32 ms, of which 2 ms is taken by the fragment shader. As suspected, the fill rate is the bottleneck and it looks like the quaternions haven’t introduced any extra load to the ALU in this example.

For a proper comparison, we need to make this example to be vertex-bound, so I’ve prepared another example with spheres instead of cubes,

The tessellation level can be increased at compile time. In the video, there’s only a few hundred vertices per sphere, so both matrices and quaternions still run at 60fps. But in the commits below, each sphere has 2562 vertices. That’s a total of around 600K vertices on screen, while for the cubes we only had 6K vertices.

instanced-spheres-matrices – This is the version using matrices.
instanced-spheres-quaternions – This is the version using quaternions and Transforms.

The frame rate drops to 20 fps when using quaternions, and to 12 fps when using matrices. Surprise! Here’s a frame capture of the version that uses matrices,

The vertex shader takes 46.10 ms with quaternions, and 82.28 ms when using matrices. Matrices turned out to be 80% slower here.

Because GPUs are becoming more general purpose, it could be that matrices have no real advantage anymore, since the number of multiplications and additions is actually greater. Another possible reason for such a big difference could be that by reducing the memory footprint (we are sending one less float4 per object), we managed to increase the cache coherence. Every GPU will behave slightly different, so it’s better to do an empiric test like this to check what’s the real behaviour of your code.

Performance comparison in the CPU

Let’s go back to the cubes and check now what’s going on in the CPU. I took a performance capture of both versions using Instruments. Here’s a capture of the most expensive functions in the version that needs to convert the quaternions back into matrices,

The updateBuffers function takes 5.4% of the CPU time, mostly taken in converting the Transforms into matrices. It’s not a lot, but we only have 240 objects. Here’s the cost using quaternions all the way through,

As expected, the cost almost disappeared, and the updateBuffers function now only takes 0.3% of the CPU time. The drawing cost is just the cost of the API issuing the commands,

Extra thoughts on performance

More often than not we worry about small details in performance such as this difference between matrices and quaternions, while the big bottlenecks tend to be somewhere else. For this experiment, for instance, I’ve used instancing to create a single draw call to draw all the cubes. But the first version of the examples had no instancing. You can find the code of the first version here,

cubes-demo-matrices – This is the version using matrices, with no instancing.
cubes-demo-quaternions – This is the version using quaternions, with no instancing.

Both version still run at 60fps, but we are now issuing 240 draw calls, one per cube. While the CPU was around 20% usage in the instanced version of the quaternions, the non-instanced version runs at 90% CPU usage! The extra cost is basically the cost of issuing the drawing commands. So instancing was actually the biggest win in this experiment 😉

Note that we could do some extra memory optimization in matrices, if we just send the first 3 rows, enough to represent an affine transformation (not for projections). This is a common optimization and shader languages have support for operations with float3x4 matrices because of this. But if we are talking about just rotations, it is still more memory-efficient to just send a quaternion, which it’s a float4, instead of a float3x3 matrix (for memory alignment reasons sometimes become float3x4).

On a smaller note, the view matrix can also be expressed as a Transform. By doing this we can completely get rid of the code that does the conversion to matrices. And the only matrix we will need to keep will be the Projection matrix.

Conclusion

Our initial preconception that matrices were better for the shader world was wrong. Using quaternions in the GPU is actually faster than matrices in a modern GPU like the Apple’s A8 chip. The memory footprint will also get reduced and the chances of finding our data in the cache will increase.

Moreover, if we eliminate the quaternion-to-matrix conversions, not only the code will get simpler and tidier, but we’ll save several precious CPU cycles.

But to be absolutely sure that you are making the right choice, always test your hardware with examples like this, because hardware is constantly evolving!

Metail Tech

Web, DevOps, 3D graphics, data engineering, systems, Clojure... y'know, that kind of thing