Introduction

I’ve always loved colour, and I’ve always been a bit geeky about it. Other kids would argue about football, but in my circle we’d go: “my Amiga 500 has 4096 colours, and yours has only 16”. And we counted the days for those 4096 to become 262144. But colour is not just numbers. And when now someone sends me an RGB triplet saying, “this colour: 146, 27, 98”, my brain just short-circuits. That’s not a colour, and I’ll explain why later. Colour and colour spaces are hard topics, and the more you dig into them, the more complex it gets, and the uglier the truth becomes.

Some of the topics around colour, like colour perception in humans, are still hot areas of research. They sometimes even become mainstream discussions in bars, like the white/orange dress meme a couple of years ago. But I won’t talk about colour perception today.

Instead, I will focus on a less mainstream but more technical topic, which is sadly neglected more often than I’d like: colour spaces. I’ll try to summarize the different concepts around colour spaces as briefly as I can, and then talk about a particular colour space that you may want to start using in your mobile apps: the Display P3 colour space.

Why do we see colour?

I’m not going to define what colour is. You probably know what it is even without a formal definition. Instead, I think it’s more useful to explain how colour is created.

We see colour because our eyes have photoreceptor cells, sensitive to different wavelengths of light. These photoreceptors are rods and cones in mammalian eyes, and ommatidia in arthropods. Although photoreceptors themselves do not signal colour, but only the presence of light in the visual field, the signals from the cones are used by the visual system to work out the colour. Here’s a figure of human photoreceptor absorbances for different wavelengths,

Color Sensitivity

(Source: Wikipedia)

In terms of number of photoreceptors, you can say you see more colours than your cat does (they discern at least 3 or 4 colours), but a mantis shrimp sees many more colours than you (they have 16 types of cones). But also, the colours that you see may not be exactly the same I see. And it is also worth mentioning that luminance is orthogonal to colour, so a cat can see much better in the dark than we do.

Colour Venn Diagrams

Colour Spaces and Colour Models

These two terms get messed up together more often than not. A Colour Space organizes colours so they are reproducible in physical devices (e.g. sRGB, Adobe RGB, CIE 1931 XYZ), whereas a Colour Model is an abstract mathematical model describing the way colours can be represented as tuples of numbers (e.g. RGB, CMYK).

That distinction is really important. I often hear people telling each other random RGB tuples to communicate colours, and I have to assume those tuples are in sRGB colour space, with the gamma already applied. But even the gamma may change from system to system, so those numeric values really don’t tell me anything unless you specify a colour profile as well.

Another very important thing to remember is that there’s no single RGB colour space! Although in most desktop applications we use sRGB, cameras may use Adobe RGB because they need a wider gamut.

XYZ Colour Space and Colour Conversions

XYZ Colour Space is a device-invariant representation of colour. It uses primary colours that aren’t real, i.e., that can’t be generated in any light spectrum. That means we can even represent “imaginary colours”, that is, colours with no physical reality.

In XYZ, Y is the luminance, Z is the blue stimulation, and X is a linear mix. It is also very common to normalize this space by the luminance to obtain the xyY colour space. You may have seen the typical horseshoe shape from the xy chromacity space before,

CIE 1931 xy chromaticity space

CIE 1931 xy chromaticity space

The horseshoe shape is the gamut of human vision, that is, the colours that we can see. The curved edge is the spectral locus and represents monochromatic light, in nanometres. Notice that the straight line at the bottom, or line of purples, are colours that can’t be represented by monochromatic light. Read about CIE 1931 Colour Space to find more details.

The important thing to remember is that these diagrams are useful to visualize the gamut of different devices and the colour conversions that happen between them. For instance, in the example below, the green dot from an Adobe RGB Camera needs to be flattened down to a more yellowish green in order to be displayed in a laptop display. Note that in an RGB colour model, both values may look identical, e.g. (0, 255, 0), but they aren’t the same. This conversion may be irreversible if we aren’t careful. When printing this green, we want to go back to the closest match between the original green colour, and the greens that the printer can represent, not the yellowish green from the display.

Colours Across Gamuts

(The image above is taken from Best Practices for Color Management in OS X and iOS, another recommended read.)

DCI-P3 Colour Space

The most common colour space for displays is sRGB. However, recent OLED displays have a wider gamut, so we need a better colour space to make the most out of them. DCI-P3 is a wide gamut RGB colour space. Apple calls it Display P3. Because the gamut is wider, you will need at least 16-bit per channel in order to store colours in P3. So if you are storing values as integers, instead of having maximum values of 255 per colour channel, now it will be 65535.

In order to visualize the differences between P3 and sRGB, I recommend using Apple’s ColorSync utility, which comes with macOS. This tool also has great colour calculators included that will help you understand all the different concepts from this blog post. It’s very simple to create a visualisation like the one below using that tool. This figure compares P3 and sRGB gamuts, plotted in L*a*b* colour space (close to human perception).

P3 vs sRGB

P3 vs sRGB in L*a*b* plot

Apple recommends the use of Display P3 for newer devices in its Human Interface Guidelines, so if you are developing a website or an app for iOS and/or macOS, it’s worth updating your authoring pipeline to use wide colour in every stage.

Most of MacOS and iOS SDK supports Display P3 already, with the exception of some frameworks like SpriteKit. The UIColor class has an initializer for displayP3. If you need to do the conversion yourself, I’ve written a couple of posts on how to compute (Exploring Display P3) and test (Stack Overflow). It boils down to this matrix that you can apply to your linear RGB colours (before applying the gamma) to convert from P3 to sRGB,

 1.2249  -0.2247  0
-0.0420   1.0419  0
-0.0197  -0.0786  1.0979

I’ve written a battery of colour conversion unit tests here: Colour Tests.

How much wider DCI-P3 is?

According to Wikipedia, the DCI-P3 colour gamut is 25% larger than sRGB. According to my accounts, it’s approximately 39% bigger. I’ve counted all the 24-bit samples in linear Display P3 RGB (16M) that fall out of sRGB, and that accounts for approximately 4.5M (~28%).

I’ve also tried visualising these differences in different ways. I ended up creating an iOS app, Palettist, to help me create colour palettes with P3 colours that fall outside the sRGB gamut. The result is some discrete palettes where each square is in P3, and a circle inside it with the same colour clamped to sRGB. Here’s one of such palettes,

DisplayP3-only Palette

DisplayP3-only Palette

Depending on where you are reading this, you may or may not see the circles. More details in this blog post: Display P3 vs sRGB. If you have a modern iOS device, try downloading this palette, and uploading it to, say, Instagram. You will see the circles, but the moment you click “Next”, all colours will look duller and the circles will disappear (you don’t need to actually post it; Instagram converts it to sRGB before uploading). Please try to use these palettes to test if an app supports P3 or not.

Rendering intent

If you see circles in the colour palette I posted above, but you are sure your display is sRGB, it could be that the colour management in your OS is trying its best by applying a rendering intent before displaying the image. The common modes are these two:

  • Relative Colorimetric intent: clamp values out of gamut to the nearest colour. This causes posterization (you won’t see the circles).

  • Perceptual intent: blindly squash the gamut of the image to fit the target colour space. This reduces saturation & colour vibrancy (but you’ll see the circles). We say “blindly” because even if it’s just one pixel that’s out of gamut, it will cause the whole image to shift colour… The amount of compression will be shown in the ICC profile

There are other modes, like Absolute Colorimetric intent and Saturation intent. Check this article for details: Understanding Rendering Intents.

A note about gamma

Gamma correction alone deserves a separate blog post… But it’s important to emphasize here that when people give you an RGB triplet like (181, 126, 220) (Lavender), not only do they mean it’s in sRGB D65 (there’s more than one sRGB, depending on the white point), but also they mean the gamma correction – an exponential function – has already been applied.

Why do we apply gamma? This is because equal steps in encoded luminance correspond roughly to subjectively equal steps in brightness. Read Gamma Correction.

Gamma values

If you only have 8 bits to store a luminance value, you better store it with the gamma applied, so you lose less perception-valuable information. However, if you are into 3D graphics, remember that light computations should happen in linear space!

The final decision: choosing a colour space

This is my small personal guide to choosing a colour space, depending on the occasion:

  • L*a*b* for Machine Learning, because the Euclidean distance between 2 colours in L*a*b* is closer to perceptual distances.

  • RGB colour spaces

    • Linear RGB if you are working with light (3D graphics), because light can be added linearly;

    • DCI-P3 if you target newer screens, because you can represent more colours; sRGB if you can only afford 8-bit per channel – make sure the gamma is applied to avoid banding artefacts in dark colours (the eye is more sensitive to differences in dark areas)

For the colour model,

  • RGB, if you are doing light or alpha blending computations, you better stick to RGB;
  • HSV for design, because the representation is intuitive; and if you are colour-blind, you can adjust saturation and luminance without worrying about accidentally having changed the hue.

Summary

Thanks for reading this long blog post! To be brief, I’ve tried to summarize it all with a few bullet points:

  • Cats may dream in colour

  • Every human is unique

  • Colour Space ≠ Colour Model

  • Display-P3 > sRGB

  • ColorSync Utility is your friend

  • Use provided P3 palettes as reference

  • Choose appropriate Colour Space & Gamma for the occasion (storage, ML, 3D)

MeModel GBuffer

About skin colour authoring

Part of our MeModel development process involves skin colour matching. We have to match our 3D avatars to a photographic reference. We have attempted to do this automatically in the past, but as the lighting process became more complex, the results were no longer good and it required a lot of manual tweaking. In effect, we needed to manually author the skin colour, but writing parameters by hand and trying them out one at a time is a tedious process. That’s why we decided to create an interactive tool so we could see the result immediately and iterate quickly.

The first choice we made was the platform: the browser. If we wrote this tool for the web, then we could share it immediately with remote teams. It’s a zero-install process, and therefore painless for the user.

We wrote a prototype that would use a high-resolution 2D canvas, and transform all the pixels in simple for-each loops. However, this was far from interactive. For our images, it could take a couple of seconds per transform, not very pleasant when adjusting parameters with sliders. You could try to parallelise those pixel loops using Javascript workers, for a 2 or 3-fold speed increase. But the real beast for local parallel processing is your GPU, giving us in this case more than a 100-fold speed increase.

So we decided to make the canvas a WebGL canvas. WebGL gives you access to the GPU in your machine, and you can write small programs for it to manipulate all pixels of the image in parallel.

Quick introduction to rendering

Forward rendering

The traditional programmable rendering pipeline is something that in the computer graphics jargon is referred to as forward rendering. Here’s a visual summary,

Forward rendering pipeline

Forward rendering pipeline

Before you can render anything, you need to prepare some data buffers with your vertex positions and any parameters you may need, which are referred to as uniforms. These buffers need to be in an area of memory that your GPU can access. Depending on your hardware, that area could be the same as the main memory, or a separate graphics memory. WebGL, based on OpenGL ES 2.0 API, has a series of functions to prepare this data.

Once you have the data ready, then you have to provide two programs to the GPU, a vertex shader and a fragment shader. In OpenGL/WebGL, these programs are written in HLSL, and compiled during run time. Your vertex shader will compute the final position and colour of your vertices. The GPU will rasterize the vertices for you (this part is not programmable), which is the process of computing which pixels the given geometry will cover. Then, your fragment shader program will be used to decide the final pixel colour on screen. Notice that all the processing in both the vertex and pixel/fragment shaders is done in parallel, so we write programs that know how to handle one data point. There’s no need to write loops in your program to apply the same function to all the input data.

A traditional vertex shader

There are basically two things that we compute in the vertex shader:

  • Space transforms. This is how we find the position of each pixel on screen. It’s just a series of matrix multiplications to change the coordinate system. We pass these matrices as uniforms.
  • Lighting computations. This is to figure out the colour of each vertex. Assuming that we are using a linear colour space, it is safe to assume that, given 2 vertices, the interpolation of pixel colours that happens during rasterization is correct because irradiance is additive.
A traditional vertex shader

A traditional vertex shader

Both the space transforms and lighting computations can be expensive to compute, so we prefer doing it per vertex, not per pixel, because there are usually fewer vertices than pixels. The problem is that the more lights you try to render, the more expensive it gets. Also, there’s a limit of the number of uniforms you can send to the GPU. One solution to these issues is deferred rendering.

Deferred rendering

The idea of deferred rendering is simple: let’s defer the lighting & shading computation until a later stage. It can be summarized with this diagram,

Deferred rendering pipeline

Deferred rendering pipeline

Our vertex shader will still compute the final position of each vertex, but it won’t do any lighting computation. Instead, we will output any data that will be needed for lighting later on. That’s usually just the depth (distance from the camera) of each pixel, and the normal vectors. If necessary, we can reconstruct the full 3D position of each pixel in the image, given its depth and its screen coordinates.

As I mentioned earlier, irradiance is additive. So now we can have a texture or a buffer where to store the final irradiance value, and just loop through all the lights in the scene and keep summing the pixel values in the final texture.

Skin colour authoring tool

If you followed so far, you may see where this is going. I introduced deferred rendering as the process of deferring lighting computation to a later stage. In fact, that later stage can be done in a different machine if you wanted to. And that’s precisely what we have done. Our rendering server does all the vertex processing, and produces renders of the albedo, normals, and some other things that we’ll need for lighting. Those images will be retrieved by our WebGL application, and it will do all the lighting in a pixel shader. The renders we generate look like this,

MeModel GBuffer

MeModel GBuffer

Having these images generated by our server, the client needs to worry only about lighting equations, and we only need a series of sliders that connect directly to the uniforms we send to the shader to produce a very responsive and interactive tool to help us author the skin tones. Here’s a short video of the tool in action,

 

The tool is just about 1000 lines of pure Javascript, and just 50 lines of shader code. There are some code details in the slides here:

(These slides were presented in the Cambridge Javascript meetup)

Summary

Javascript & WebGL are great for any graphic tool (not only 3D!): being in the web means zero-install, and being in WebGL should mean it gives you interactive speeds. Also, to simplify the code of your client, remember that you don’t need to do all the rendering in the client. Just defer the things that need interaction (lighting in our case).

While working in the games industry in Japan, I attended a seminar about brainstorming. The instructor, professor Hidenori Satō, has written dozens of books on the subject. Unfortunately for many of us it seems his work has not been translated from Japanese, so here’s a brief introduction to his approach. I’ve translated the method he introduced to us as “Spark Ideas” (スパーク発想法). At the beginning of his seminar Prof. Satō led in with the following quote attributed to Thomas Edison: “Genius is one percent inspiration and ninety-nine percent perspiration”. I read this in two ways: first, even if you have ideas, they mean nothing if you don’t put in sufficient effort to realize them; second, you may have a sudden bright idea once in a while, but to generate ideas continuously, you need to make an active effort – and probably use a tool like the one I describe here.

The brainstorming process

Often we try to think of ideas directly from a theme. Unless you’re in a moment of “inspiration” this is hard. For the “perspiration” moments, we need hints, like the one Newton got from an apple falling from a tree. The best way of getting these hints is by changing the Point of View. And that’s all you need to remember! (*^ω^*)

Brainstorming process: Spark Ideas

Brainstorming process

 

You can think of the Spark method as a “cheat sheet” with a series of keywords to help you get started with your brainstorming.

Points of View for Spark Ideas

Prof. Satō lists up 5 basic Points of View (PoV) to get started with the exercise:

  1. State of affairs
  2. Point of view of the other
  3. Change character
  4. Change case
  5. Free of constraints

Working through these first five perspectives is usually enough, but there’s an extra list if you want to dig deeper:

  1. Triple ease
  2. Fun
  3. Positivation
  4. Indirection
  5. 3D expansion
  6. Similar case
  7. General case

I will describe all these points in detail later, but let’s jump first to how to do the exercise.

Brainstorming with the Spark Sheet

I recommend time-boxing the exercise. From experience, the “State of affairs” is usually the most important PoV, so expect to spend at least twice the time in that one as opposed to the others. If you need tons of ideas, you may want to attempt all 12 PoVs, though it may take too long. Even if you spend only 10 minutes per PoV, it will take at least 2 hours to finish.

If you are doing the exercise with enough people you may choose to divide them into groups. You could assign a couple of PoVs per group, with one group dedicated solely to “State of Affairs”.

Once you have allocated time, and appropriately divided people into groups, you just need paper or a whiteboard. Write the PoV and the theme on the top, and draw 3 columns. The first column are hints, that should come from the PoV. Write hints with as much detail as possible. The middle column will be for direct ideas, coming straight from the hints. These too should be as detailed as possible.

The third column is for ideas from association, things related to an idea from the middle column. These can be something that follows on in order from the initial idea, is the exact opposite of the idea, or just things that go together. It helps to have a cheat sheet with keywords on one side. Your sheet of paper will look like this:

 

Brainstorming example: Spark Sheet Pollution

Spark Sheet about Pollution

 

Keywords to get started

I’ve tried to select a few keywords for each PoV so you can get started.

(1) State of affairs

  1. Status
    1. Where are we at? Contents, outlook, flow, related work, schedule, place
    2. Domain, level, quantity, season, important factors
  2. Target
    1. Characteristics, functionality, structure, processes, elements, type
    2. Materials, size, weight, color, design, definition
  3. Self
    1. Company values, our technology, our resources, strengths & weaknesses
    2. Budget, available developers, external opportunities
  4. Main point
    1. Reason for it, difficulties
    2. Essential conditions

(2) PoV of the other

  1. Target user(s), as detailed as possible
    1. Adult A, kid B, high-school girl C, athlete, married person, old lady from the neighbourhood, a person with 2 dogs, etc.
  2. Requests/needs, correct & detailed
    1. Can we ask/have we asked users?
    2. Person status, surroundings, circumstance, specialty, personality, real thinking, new thinking, needs, values, requests, dissatisfaction, worries, likes, opinions, feeling, goals, and conditions.

(3) Change character

  1. Think of another person and write down their name.
  2. How would they do things?
    1. Way of thinking, behaviour, performance, personality, strengths
    2. Ask them directly whenever possible!
  3. Examples
    1. Close person: colleague, boss, junior staff, from another team, related, from same industry, family member, friend, someone with similar/opposite interests, acquaintance, neighbour, professor, student of a higher/lower grade
    2. Famous/historical: Buddha, Jesus, Bono, Björk, John Lennon, Messi, Trump, Picasso, Tom Daley, Tom Adeyoola, Tom Jones, Tom & Jerry
    3. Role model: an expert, specialist, experienced person, aficionado, protagonist of a story/tale

(4) Change case

  1. Think of the theme and target, and find a similar topic
  2. Write down the contents (status, method, conditions) in detail
  3. Examples
    1. Direct method (visual): picture the theme in a broader sense, and give an example from intuition. E.g. “reduce stock” → “reduce ingredients”
    2. Indirect method (logical): think of the essence of the theme, and from it give another example. E.g. “reduce stock” → “reduce unnecessary stuff” (essence) → “reduce flavour additives”
    3. Close example (change from same class): e.g. “sell cameras” → “sell computers”
    4. Far example (different class): e.g. “sell cameras” → “become famous” (sell brand)

(5) Free of constraints

  1. Ideal
    1. What is the state we want to be in? (in detail)
    2. What’s the ideal, the best situation?
    3. Write down the “ideals” as Hints, and how to realize/get close to those as Ideas.
  2. Break norms
    1. Try to break the rules: odd techniques, silly things, nonsensical, fancy, dream, insane, not common sense, innovation, daring
    2. Write down as Ideas the way you’d get there.

Extra PoVs

  1. Triple ease
    1. Low-hanging fruit: do easy things first
    2. Divide-and-conquer: divide in several parts, and assign to different people/teams
    3. Reduction: reduce the quantity or targets. Make our lives easier.
  2. Fun
    1. Make it fun or interesting; add hobbies; gamify
  3. Positivation
    1. turn upside-down; take the negatives and turn them into positives;
    2. find the positives and work on them
  4. Indirection
    1. Soften/cushion the blow;
    2. Make it indirect; mediation
  5. 3D Expansion: think of these 3 dimensions
    1. Space: expand the space, the area; change the place.
    2. Time: expand time. Think of the future and the past. Think in a longer span.
    3. Human: expand the human circle. Think of others. Get help from the crowd.
  6. Similar case
    1. Compare to similar cases
    2. Compare with cases that offer contrast
  7. General case
    1. Remove the particular case. Look at the forest, not at the tree.
    2. Think of the system

Rank the ideas

Once you’re done, you will end up with dozens of ideas. You may want to quickly eyeball the ones that lack detail or that are obviously flawed in some way and discard them to save time. Or you can focus on the ideas with many arrows coming to them. For the ideas you select to explore further, use a ranking mechanism, a simple example being the combination of impact and feasibility. For instance,

 

Theme: Do something about pollution — Best 3 ideas  Impact Feasibility Expectation (I×F) Rank
Bring leaflets to schools  2 4 8 2
Gather signatures in online petition about cigarettes 3 2 6 3
Create a game where each stage is about a pollutant 3 4 12 1

 

Hopefully your Spark Sheets will be sufficiently detailed that they will also help you to start producing a plan for those selected ideas.

 

Conclusion

Generating novel ideas from nothing is a big challenge to some of us but a bit of structure in a brainstorming session can make a huge difference.  If you’re stuck, try using a tool like this and be surprised at the volume of ideas your brains can produce. And as in so many other things, the more you practice the better you’ll get at it.

Metail provides a yearly training budget for all employees consisting of both time and money, but we found that many employees were not making the most of this opportunity. We decided to look into why this is and work on increasing the uptake. One idea we had was around hackathons – pairing people together to do small hackathons together sounds more fun than just reading a book by yourself!

One-to-ones help uncover trends

From my one-to-ones I found that the main reason people were not using the training days was because they weren’t sure what to do with them. If people were going to a conference or working toward a qualification or certification it was easy to identify the time spent on that as ‘training’. But what if you are already qualified? or there isn’t a conference on this quarter? or you want to spend some time testing out new technology?

Crew Hackathons

I came up with the idea of running some small hackathons within the crew and suggested we could use training days for these. The idea is that people will pair for a couple of days to create something new. This aligns with our company values: being in this together, actively learning, trust to deliver, and making a difference. But I also wanted to push the joy/excitement axis up a bit as well (see previous post).

Because people never want an extra meeting, we decided to schedule this as a special retrospective session. We kept the happiness axis exercise and collected a few actions based on that, but we spent most of the hour running a hackathon proposal exercise outlined below:

  • Everyone tries to write down a couple of ideas for 2-day projects they would like to work on, and spends a couple of minutes to get others excited about it.

  • Vote proposals. Everyone has 2 votes to pick a project (other than their own). Only projects with 2 or more votes survive.

The projects do not need to be directly related to work, but we should learn something from them. The idea is to spend one day together working out designs, and another day creating a prototype or something usable.

I explained the exercise a week in advance, so people had time to think of projects before the meeting.

Deciding on projects

The exercise went well and everyone seemed quite excited. It turns out that a few people had similar ideas, so we grouped some projects together. We then drew a matrix so everyone could cast their votes. This is how the whiteboard looked:

Hackathon Matrix

The top row of the matrix has the people initials, with the number of available training days written below.

We (a team of seven) decided to work on 3 projects. The projects with more votes will have a couple of hackathons associated with them – this is particularly useful if we can’t get together all at the same time. We can also start thinking at this stage if we are going to need any materials, e.g. books, that we need to buy before we get started.

Scheduling the hackathons

We have the ideas, the people, and most importantly, the excitement, so now it’s just a matter of scheduling these hackathons. If a person is working for the full 10 days in a sprint, they instantly become candidates for any of the hackathons they showed interest in. If we can find someone else interested in the same project who has enough training days available, we pair them together and schedule it in the sprint.

Some of these projects have more than two people interested – in this case we have a 1-hour meeting with everyone interested in it, to come up with a plan and decide how we’ll split the work. For instance, if it was a project that involved four developers and two different platforms, one group could work on one platform one sprint, and the other group could do the other platform the following sprint.

Conclusion

Small hackathon exercises can be helpful for people that don’t know what to do with their training days. Other people can bring ideas that suddenly open the curiosity box, and we can turn the learning exercise into a shared experience. Just as it is, it’s a valuable experience. But some of the projects can even turn into something bigger that brings additional value to the company. I think it’s probably worth running this exercise every quarter, to disconnect from your main duties and refresh a bit. If you can’t find the time to run this, just pack it inside one of your retrospectives. You can always use the happiness axis for a swifter retrospective, and move straight away into finding topics for the hackathons.

Scrum retrospectives are a great opportunity to sit down with your team and make everyone’s voice heard. It’s about collective process improvement, by getting everyone involved and owning part of that process, it’s also about feelings, and about empathizing with each other.

A typical scrum retrospective

If you have a formula that works for your team, it’s good to repeat it: your team members will know what to do without having to repeat the agenda every week. However, it can be beneficial to try different things from time to time.

The most important source of ideas is probably the one-to-one meetings. Some team members may actually find the retrospectives boring or not particularly useful, and they may have ideas to improve them. Try some of them, discard things that do not work, and keep the things that people get more involved with.

We started our retrospectives with classical good / bad clustering: we draw two axis, time on the horizontal, and goodness to badness in the vertical, and people write down 2 positive things and 2 negative things, with a number from +5 to -5, and stick the post-its on the whiteboard. Every week, a different person tries to cluster the post-it notes into different categories. Sometimes, the time scale is a good indicator of a cluster, but we usually re-cluster them into more meaningful categories. Then, that person tries to explain what went well and what went badly during the sprint, asking the relevant people to explain their tickets. The important thing is trying to identify actions based on those notes, pretty much working out the start-stop-continue from that set. However, we don’t do this exhaustively. We focus on the immediately actionable items, the biggest wins and fails.

Some suggested we were wasting too much time on this, and we tried creating a thread on Slack for every sprint where people could write down thoughts as events happened during the sprint, and others would react with emoji. The thread died out after a few sprints, and we realized it was better to think retrospectively during the allocated time slot and get physically involved, i.e., standing up and writing things down.

Happiness axis

Our company wanted to measure happiness somehow. We discussed the option of having some anonymous surveys sent regularly to measure it, but many in the team were put off by having to fill in surveys online. So I decided to do something during the retrospective time, and get people directly involved.

I’ve selected 6 feelings or axes, 3 positive ones juxtaposed with 3 negative ones. Humans are complicated and full of emotions, so I tried to pick up things that I consider actionable in the work environment. This is our list:

Positive Negative
Enjoyment – did I work on something I enjoy? Boredom – most of the stuff was tedious and/or boring
Sense of accomplishment – I got that thing done! Despair – I’m getting nowhere
Powered up – learned something useful! Powered down – I feel I’m losing my skills

I think it’s important to keep it small, though. You don’t want to model the whole brain!

During the retrospective, we draw these axes on the whiteboard. Then, everyone stands up and casts up to 3 votes on any of the axes,

  • You don’t need to use all the votes (abstentions are counted as well)

  • You can vote in opposite axes (half of the sprint was really fun, but the other half was boring)

  • Preferably, add equally-spaced ticks, so we can draw a spider graph in the end.

And this is how it looks in the end,

Scrum Retrospectives Happiness

Happiness Axis

Actions based on happiness axis

Here are some of the recipes we have for actions based on the result of the happiness axis exercise,

  • … if joy is low:

    • everyone should have at least one ticket they would enjoy working on in next sprint;

  • … if boredom is high:

    • promote team work (e.g. pair-programming), from the premise that the conversation will make tedious tasks less painful;

  • … if not powering up:

    • plan for new things in next sprint;

    • schedule training time;

  • … when powering down:

    • discuss during the retrospective and/or one-on-ones which abilities are not being put to use. Try to find a place for them;

    • reduce time spent in repetitive tasks;

  • … when there’s no sense of accomplishment:

    • create smaller tickets with a well-defined goal;

    • try a “Demo-Driven Development” approach (this is a name I came up with): small features that are always “demoable”;

  • … when people feel they are going nowhere:

    • align the tickets with the company/crew objectives, so the goal is well defined;

    • identify blockers and deal with them ASAP (e.g. build issues).

Simple data visualization

In order to track the changes of the team mood over time, we also write the votes down in our Wiki. We keep 3 tables, one for each opposite axes, where each data point is just the date, the value on the positive axis, and the values on the negative one. Confluence can conveniently plot these for you,

Scrum retrospectives Happiness Data

Happiness data

From the graphs we noticed things like cycles in despair and accomplishment, that we regarded as being caused by having features that require a couple of sprints to complete, so the first sprint is full of despair, but when the feature gets finally completed in the following sprint, the sense of accomplishment spikes up.

Written down in words, it seems like a complex exercise, but it’s something that can be done really quickly, so we’ve kept this as part of our retrospectives.

Conclusion

There is no “correct” way of running scrum retrospectives, but the important thing is that they are dynamic and not too long. Also, make sure that people get involved in them. You probably know more or less what people feel from one-to-ones, but it’s important that they share some of that with everyone else in the team. At least, try to record the actionable needs. The happiness axis exercise is quick and it takes the scare out of surveys, and turns it into something a bit more fun. But if you feel stale, try doing something completely different from time to time, like brainstorming for ideas that people would like to work in with others. I’ll come back to that in a future post.

Introduction

Unit quaternions, or versors, offer a more compact and efficient representation of rotations than matrices do. They also free us from issues such as the gimbal lock we often encounter when using Euler angles. That’s why in Computer Graphics you often represent a transformation by a struct like the one below, instead of generic 4×4 matrix,

struct Transform {
  var position = float3(0, 0, 0)
  var scale    = float3(1, 1, 1)
  var rotation = Quaternion()
}

However, more often than not, quaternions remain in the CPU domain and Transforms are converted into matrices before they are sent to the GPU. The conversion for the struct above looks like this,

func toMatrix4() -> float4x4 {
  let rm = rotation.toMatrix4()
  return float4x4([
    scale.x * rm[0],
    scale.y * rm[1],
    scale.z * rm[2],
    float4(position.x, position.y, position.z, 1.0)
  ])
}

The reason for this conversion is usually 2-fold,

  • GPUs have native support for matrices, making them the natural choice when thinking about performance;

  • in traditional pipelines, we only worried about the final position of a vertex in world coordinates, so we could premultiply the Projection, the View, and the World or Model matrix into a single matrix (the PVW matrix), thus, making the transformation of vertices in the GPU really cheap.

Growing shader complexity

From the 2 reasons stated earlier, the second one barely holds true anymore. Because of more complex shading and effects pipelines, we often want to split the Projection matrix from the View matrix, so we can compute the view normals, and the Projection-View matrix from the World matrix, so we can obtain the coordinates of the vertices in World space.

The Projection and View matrices are only set once per camera or viewport, and the World matrix will be set per object or instance being drawn. The vertex shader will look like this,

float4x4 m = uniforms.projectionMatrix * uniforms.viewMatrix * instance.worldMatrix;
TexturedVertex v = vertexData[vid];
outVertex.position = m * float4(v.position, 1.0);

If we were to send Transforms instead of 4×4 matrices, we could save at least 4 floats per instance. Memory is usually more precious these days than ALU time, but how much slower would it be if we used Transforms in the GPU? The vertex shader will need to do some extra operations,

Transform t = perInstanceUniforms[iid];
float4x4 m = uniforms.projectionMatrix * uniforms.viewMatrix;
TexturedVertex v = vertexData[vid];
outVertex.position = m * float4(t * v.position, 1.0);

The following code is the implementation of the Transform struct using Metal (for an introduction to Metal, check this previous blog post).

struct Transform {
 // for alignment reasons, position and scale are float4
 float4 position; // only xyz actually used
 float4 scale;    // only xyz actually used
 float4 rotation; // unit quaternion; w is the scalar
 float3 operator* (const float3 v) const {
   return position.xyz + quatMul(rotation, v * scale.xyz);
 }
};
/// Quaternion Inverse
float4 quatInv(const float4 q) {
 // assume it's a unit quaternion, so just Conjugate
 return float4( -q.xyz, q.w );
}
/// Quaternion multiplication
float4 quatDot(const float4 q1, const float4 q2) {
 float scalar = q1.w * q2.w - dot(q1.xyz, q2.xyz);
 float3 v = cross(q1.xyz, q2.xyz) + q1.w * q2.xyz + q2.w * q1.xyz;
 return float4(v, scalar);
}
/// Apply unit quaternion to vector (rotate vector)
float3 quatMul(const float4 q, const float3 v) {
 float4 r = quatDot(q, quatDot(float4(v, 0), quatInv(q)));
 return r.xyz;
}

Let’s see if this is any slower than matrices with an example.

Rotating cubes demo

I’ve created this demo of rotating cubes to measure the performance of using quaternions in a modern, but not high-end, GPU. I’ll be testing Apple’s A8 chip on an iPhone6.

The application spawns 240 cubes and draws them with a single draw call using instancing. Instancing allows us to reuse the same vertex buffer, and just use a different Transform for each instance. This way, the performance comparison will be simpler because we only need to analyze one draw call, instead of 240!

The CPU updates the rotation of each cube at random times, so the performance in the CPU won’t be constant per frame, but it should be almost constant in the GPU (there will be some slight differences in fill rate, depending the amount of area covered by the cubes as they rotate, but I placed them close so it’s always very dense).

The code of the demo can be found here:

Performance comparison in the GPU

Both versions run at 60fps on an iPhone6. This is a frame capture of the version that uses matrices,

Metal Framecapture instanced cubes with matrices

The draw call in both cases takes 2.32 ms, of which 2 ms is taken by the fragment shader. As suspected, the fill rate is the bottleneck and it looks like the quaternions haven’t introduced any extra load to the ALU in this example.

For a proper comparison, we need to make this example to be vertex-bound, so I’ve prepared another example with spheres instead of cubes,

The tessellation level can be increased at compile time. In the video, there’s only a few hundred vertices per sphere, so both matrices and quaternions still run at 60fps. But in the commits below, each sphere has 2562 vertices. That’s a total of around 600K vertices on screen, while for the cubes we only had 6K vertices.

The frame rate drops to 20 fps when using quaternions, and to 12 fps when using matrices. Surprise! Here’s a frame capture of the version that uses matrices,

Metal frame capture instanced spheres with matrices

The vertex shader takes 46.10 ms with quaternions, and 82.28 ms when using matrices. Matrices turned out to be 80% slower here.

Because GPUs are becoming more general purpose, it could be that matrices have no real advantage anymore, since the number of multiplications and additions is actually greater. Another possible reason for such a big difference could be that by reducing the memory footprint (we are sending one less float4 per object), we managed to increase the cache coherence. Every GPU will behave slightly different, so it’s better to do an empiric test like this to check what’s the real behaviour of your code.

Performance comparison in the CPU

Let’s go back to the cubes and check now what’s going on in the CPU. I took a performance capture of both versions using Instruments. Here’s a capture of the most expensive functions in the version that needs to convert the quaternions back into matrices,

Metal draw cost with matrices

The updateBuffers function takes 5.4% of the CPU time, mostly taken in converting the Transforms into matrices. It’s not a lot, but we only have 240 objects. Here’s the cost using quaternions all the way through,

Metal draw cost with quaternions

As expected, the cost almost disappeared, and the updateBuffers function now only takes 0.3% of the CPU time. The drawing cost is just the cost of the API issuing the commands,

Metal draw cost with quaternions

Extra thoughts on performance

More often than not we worry about small details in performance such as this difference between matrices and quaternions, while the big bottlenecks tend to be somewhere else. For this experiment, for instance, I’ve used instancing to create a single draw call to draw all the cubes. But the first version of the examples had no instancing. You can find the code of the first version here,

Both version still run at 60fps, but we are now issuing 240 draw calls, one per cube. While the CPU was around 20% usage in the instanced version of the quaternions, the non-instanced version runs at 90% CPU usage! The extra cost is basically the cost of issuing the drawing commands. So instancing was actually the biggest win in this experiment 😉

Note that we could do some extra memory optimization in matrices, if we just send the first 3 rows, enough to represent an affine transformation (not for projections). This is a common optimization and shader languages have support for operations with float3x4 matrices because of this. But if we are talking about just rotations, it is still more memory-efficient to just send a quaternion, which it’s a float4, instead of a float3x3 matrix (for memory alignment reasons sometimes become float3x4).

On a smaller note, the view matrix can also be expressed as a Transform. By doing this we can completely get rid of the code that does the conversion to matrices. And the only matrix we will need to keep will be the Projection matrix.

Conclusion

Our initial preconception that matrices were better for the shader world was wrong. Using quaternions in the GPU is actually faster than matrices in a modern GPU like the Apple’s A8 chip. The memory footprint will also get reduced and the chances of finding our data in the cache will increase.

Moreover, if we eliminate the quaternion-to-matrix conversions, not only the code will get simpler and tidier, but we’ll save several precious CPU cycles.

But to be absolutely sure that you are making the right choice, always test your hardware with examples like this, because hardware is constantly evolving!

Metal with Swift

Metal (not Metail) is a low-level API from Apple that combines OpenGL and OpenCL into a single interface. The purpose of introducing their own API was mainly to reduce overhead and increase performance. Metal is similar to Khronos Group’s Vulkan, or Microsoft’s DX12, but specifically targeted at Apple hardware.

Metal has been around since 2014, but now that Swift is more mature, I think it’s really easy to get started with Metal: you don’t need to be scared of pointers or of the overly verbose Objective-C syntax.

In this article I’m going to introduce Metal with a small example where all the data updates happen in the GPU. Instead of explaining Metal and Swift in detail, I’ll just write down a few notes following the example code. Hopefully, it will spark your interest and you dig into the references for extensive documentation 😉

Procedural rain example

I’ve written a small demo that should look like rain,

It draws and updates thousands of 2D lines at 60 fps on an iPhone6. In fact, drawing the lines takes only 2.4 ms, and the update takes less than 0.2ms.

You can find all the code here: https://github.com/endavid/metaltest

Getting started

To get started with Metal you will need a Metal-ready device and XCode. In XCode, just create a new project and select

  • iOS Application: Game

  • Language: Swift

  • Game technology: Metal

This will create a simple template that draws a moving rectangle on screen. You will need to run this directly on your device, since the simulator doesn’t understand Metal. The triangle data in the example is triple-buffered, so you can update it in the CPU while the GPU renders up to 3 frames before requiring a sync. Synchronization between the CPU and GPU is done like this,

// create semaphore
let inflightSemaphore = dispatch_semaphore_create(NumSyncBuffers)
// this is run per frame
func drawInMTKView(view: MTKView) {
    dispatch_semaphore_wait(inflightSemaphore, DISPATCH_TIME_FOREVER)
    // updates in CPU cycles
    self.update()
    // register completion callback
    let commandBuffer = commandQueue.commandBuffer()
    commandBuffer.addCompletedHandler{ [weak self] commandBuffer in
        if let strongSelf = self {
            dispatch_semaphore_signal(strongSelf.inflightSemaphore)
        }
        return
    }
    // draw stuff
    // ...
    commandBuffer.commit()
}

Some interesting Swift notes:

  • You can omit brackets when the last argument of the function you are calling is a lambda. You can still do ‘addCompletionHandler(myFunction)’.

  • The ‘weak’ keyword is used to avoid keeping a strong reference to ‘self’ inside the lambda function. Otherwise, we could have a cyclic reference and leak memory.

  • Because the reference is now weak, it basically becomes an optional (something that could be null). The ‘if let x = optional’ is used to dereference the optional when it’s not null.

Preparing Metal objects

These are the things you need to prepare in order to render something on screen:

  • Resources: data buffers and textures.

  • States: render pipeline state and depth-stencil state.

  • Descriptors: definitions that describe the objects above. This includes your shader code.

  • Render Command Encoder: the stuff that converts API commands into hardware commands.

  • Command Buffer: it’s where you store your commands that are eventually committed to the GPU.

  • Command Queue: where you queue an ordered list of command buffers.

I assume you are more or less familiar with how a typical graphics pipeline work, so in the example I’m going to focus on the physics update of the raindrops, which I’m performing in the GPU.

I’ll explain the shader code later, but for now you just need to know that you can access to your shader functions very easily using a shader library,

let defaultLibrary = device.newDefaultLibrary()!
let updateRaindropProgram = defaultLibrary.newFunctionWithName("updateRaindrops")!

“updateRaindrops” is the name of the function in the shader code.

You can create a render state without a fragment program. Your vertex shader can be used to modify any arbitrary buffer, without the need of specifically creating a compute shader.

let updateStateDescriptor = MTLRenderPipelineDescriptor()
updateStateDescriptor.vertexFunction = updateRaindropProgram
// vertex output is void
updateStateDescriptor.rasterizationEnabled = false
// pixel format needs to be set
updateStateDescriptor.colorAttachments[0].pixelFormat = view.colorPixelFormat

With that descriptor now we can create the state. Note that this is done only once,

do {
    try pipelineState = device.newRenderPipelineStateWithDescriptor(pipelineStateDescriptor)
    try updateState = device.newRenderPipelineStateWithDescriptor(updateStateDescriptor)
} catch let error {
    print("Failed to create pipeline state, error \(error)")
}

Notice that in Swift, the “try” keyword is used for every expression that can throw an exception. If we are happy with an optional value, we can remove the do-catch and use “try?”,

let state = try? device.newRenderPipelineStateWithDescriptor(descriptor)

Now we need a data buffer. Metal is designed for the A7 chip unified memory system, so both the CPU and the GPU can share the same storage. We will need to care about synchronization, but in this example the raindrops will be updated and read only in the GPU.

// member variable
var raindropDoubleBuffer: MTLBuffer! = nil
// ... on initialization:
raindropDoubleBuffer = device.newBufferWithLength(
            2 * maxNumberOfRaindrops * sizeOfLineParticle, options: [])
raindropDoubleBuffer.label = "raindrop buffer"

And now that you have everything ready, we can “draw stuff” in drawInMTKView,

// draw stuff
if let renderPassDescriptor = view.currentRenderPassDescriptor,
       currentDrawable = view.currentDrawable
{
    // setVertexBuffer offset: How far the data is from the start of the buffer, in bytes
    // Check alignment in setVertexBuffer doc
    let bufferOffset = maxNumberOfRaindrops * sizeOfLineParticle
    let uniformOffset = numberOfUniforms * sizeof(Float)
    let renderEncoder = commandBuffer.renderCommandEncoderWithDescriptor(renderPassDescriptor)
    renderEncoder.label = "render encoder"
      
    // The drawing phase is a simple shader that draws lines in 2D
    // DebugGroup labels are for debugging during frame capture.
    renderEncoder.pushDebugGroup("draw rain")
    renderEncoder.setRenderPipelineState(pipelineState)
    renderEncoder.setVertexBuffer(raindropDoubleBuffer, 
            offset: bufferOffset*doubleBufferIndex, atIndex: 0)
    renderEncoder.drawPrimitives(.Line, vertexStart: 0, 
            vertexCount: vertexCount, instanceCount: 1)
    renderEncoder.popDebugGroup()

    // update particles in the GPU            
    renderEncoder.pushDebugGroup("update raindrops")
    renderEncoder.setRenderPipelineState(updateState)
    // this is where we read the particles from
    renderEncoder.setVertexBuffer(raindropDoubleBuffer, 
            offset: bufferOffset*doubleBufferIndex, atIndex: 0)
    // this is where we write the updated particles 
    renderEncoder.setVertexBuffer(raindropDoubleBuffer, 
            offset: bufferOffset*((doubleBufferIndex+1)%2), atIndex: 1)
    renderEncoder.setVertexBuffer(uniformBuffer,
            offset: uniformOffset * syncBufferIndex, atIndex: 2)
    // noiseTexture contains random numbers
    renderEncoder.setVertexTexture(noiseTexture, atIndex: 0)
    // every particle is treated as a point, but we aren't rendering anything on screen
    renderEncoder.drawPrimitives(.Point, vertexStart: 0, 
            vertexCount: particleCount, instanceCount: 1)
    renderEncoder.popDebugGroup()
    renderEncoder.endEncoding()
            
    commandBuffer.presentDrawable(currentDrawable)
}
    
// syncBufferIndex matches the current semaphore controled frame index 
// to ensure writing occurs at the correct region in the vertex buffer
syncBufferIndex = (syncBufferIndex + 1) % NumSyncBuffers
doubleBufferIndex = (doubleBufferIndex + 1) % 2
    
commandBuffer.commit()

And that’s all! You don’t need to do anything else on the CPU 🙂

Writing shader code

Metal shaders are written in a subset of C++11 with some special keywords to define attributes and hardware features. You can have multiple shaders in a single file, and that file gets compiled before you run your application, so say bye to the runtime nightmares of OpenGL ES.

Let’s jump directly to the raindrop update function,

#include <metal_stdlib>
struct LineParticle
{
    float4 start;
    float4 end;
}; // => sizeOfLineParticle = sizeof(Float) * 4 * 2

// can only write to a buffer if the output is set to void
vertex void updateRaindrops(uint vid [[ vertex_id ]],
                        constant LineParticle* particle  [[ buffer(0) ]],
                        device LineParticle* updatedParticle  [[ buffer(1) ]],
                        constant Uniforms& uniforms  [[ buffer(2) ]],
                        texture2d<float> noiseTexture [[ texture(0) ]])
{
    LineParticle outParticle;
    float4 velocity = float4(0, -0.01, 0, 0);
    outParticle.start = particle[vid].start + velocity;
    outParticle.end = particle[vid].end + velocity;
    if (outParticle.start.y < -1) {
       outParticle.end.y = 1;
       outParticle.start.y = outParticle.end.y + 0.1;
    }
    updatedParticle[vid] = outParticle;
};

I’ve simplified the example above, so I’m not using the uniform buffer or the noise texture. Instead, the particles are just updated with a constant velocity that points downwards, and their position is reset once they reach the end of the screen. Check the full source for the full update, with some simple bouncing on the ground and obstacles, and resetting to a random position.

The “constant” and “device” keywords are address space qualifiers. “constant” refers to read-only buffer memory objects that are allocated from the device memory pool, while “device” refers to buffer memory objects allocated from the device memory pool that are both readable and writeable.

Handling Metal errors

In Metal you’ll find that clear error messages are output to the console. In OpenGL ES you had to query the OpenGL error status all the time just to get error messages, cluttering your code with those error queries all over the place. Plus, the error messages were usually hard to decipher.

This is an example error in Metal,

MTLPixelFormatRG16Unorm is compatible with texture data types type(s) (
    float
).'

I got this after calling: renderEncoder.setVertexTexture(noiseTexture, atIndex: 0)

Because in the shader I had: texture2d noiseTexture [[ texture(0) ]]

The noise texture pixel format is set to RG16Unorm and the error is telling me it doesn’t like “halfs”. So I just needed to change half to float to fix the issue.

Frame captures

The frame capture in XCode works as well with Metal as it does with OpenGL ES. You can see the performance of your shaders, see all the resources, change the shader code on the fly, jump to the Swift source code that originated a draw call, and much more. It’s one of the best tools of its kind that I’ve seen.

Let’s inspect a frame,

Frame Capture in XCode

Frame Capture in XCode

On the left side, you can see all the commands. There’s only a few! OpenGL ES programs tend to end up with lots of redundant state changes that negatively impact on performance. The debug group labels are shown as folders, and you can see the timings for each one. Or you can expand them and see the details. The particle update takes 183 microseconds. Surely faster than if we had linearly looped through the buffer and updated the particles on the CPU 😉

You can expand each command to see the call stack and jump to the CPU code.

You can also inspect all the buffers, render state, and shaders. You can see the cost of each shader block as a percentage of the total. As expected, most of the cost is in the fragment shader. It’s just fill-rate.

You can re-write the shader code there, and click the “Update Shaders” icon Update Shaders icon, to re-compile them and re-run the frame with the updated shaders.

It’s really powerful and easy to use.

Conclusion

If you are developing on iOS or macOS and into graphics, I recommend you try Metal if you haven’t yet. The setup is more straightforward than OpenGL, and it outperforms OpenGL by removing redundant state changes and making definitions more static.

If you like graphics programming, but you never tried native development on iOS or macOS, perhaps because you were scared of Objective-C, give Swift a try. It has a simple but powerful syntax, really easy to learn. It’s also a compiled language, so if you were thinking of mixing C++ into Objective-C just to increase performance, forget about it and write everything in Swift.

Check the references below for details.

References