lap top, headphones: remote work important tools

fully remote working: my work station for a whole week

In the first week of December I ran an experiment: our entire team was made to work remotely from the two main offices. The aim of the venture was for everyone to feel exactly what our remote employees feel every day. As a result, we hoped to improve team communication, both within the team and external to it.

Our team is probably one of the most distributed engineering teams in Metail. While most of our engineers are in the Cambridge office, a few work remotely. We’re lucky enough they are in the same time zone as the headquarters. Nonetheless we still suffer a lot of the pains that distributed teams feel, especially when the rest of the company is more used to working between the two offices, based in Cambridge and London.

Our hypothesis was that we would probably miss out on a lot of incidental “water cooler” conversations. We also guessed that communication with the rest of the organisation would be somewhat difficult.

Before Kick off

Before we rolled out the experiment, I had to lay some groundwork. Firstly I checked with our crew director (we work in teams called ‘Crews’ at Metail) and the other engineering managers that this wouldn’t impact anything crucial. We communicated widely across multiple channels that our team would be entirely remote during the week before the start date. I also spoke to the team to hear their concerns. It certainly helped to draw up a few guidelines. This is in summary what we came up with:

  • We use Slack by default and Skype as a backup
    • We say when we are at our keyboards and when we’re not
    • Everyone is to use headset and have their webcams turned on.
  • In general we try to ensure that we are over communicating
  • If there is a problem or someone can’t be reached, people are to come to me (the engineering manager) or our crew director.

There were a few practical things to take care of as well. We made sure our contact details were added to all the meeting rooms’ Skype accounts. We also checked we could all access internal resources via the VPN. Just to be sure, we ran a couple of trial calls to make sure Slack and Skype would work for us (they did!).

So how did it go?

We were able to anticipate the problems we hit; there wasn’t too much of the unexpected. It was much harder to run work past people on a casual, in person basis. Attempting to do so required both parties to mic up and jump on a Slack call.

Meetings with the wider company is where we struggled the most. We noticed people in Metail occasionally talk over one another and because of this it was hard to participate in guilds and other group meetings. Usually it meant one person in the office would drown out another who was further away from the room mic. We also noticed that if there were multiple people in the office participating in a meeting, remote workers often ended up ignored. In some cases it was difficult to observe body language that would normally be cues for a person to start talking. From time to time it was hard to hear people in the office. Sometimes this was because of problems with the audio equipment, other times it was because of background office noise.

We encountered a few minor technical issues as well. Some of these things were easy to fix, like tweaking rules on a firewall. Others were harder to diagnose, like why a developer was seeing Jenkins time out during load, preventing him from being able to see when builds were finishing. A couple of times we had issues with Slack where one person in the group couldn’t see another but these were easily fixed by leaving the call and re-entering it.

Generally speaking the engineers found it easier to focus on the work they were attempting to do. On the other hand it was pretty difficult for myself and our crew director, being the main communications interface between the team and the rest of the company.

I also discovered that my house gets really cold during the day if I don’t put my heating on! I made a special effort to be a little more social, going out to dinner and to the pub for much needed social interaction.

Conclusions

On the Monday following the experiment we ran a retrospective where we recorded our experiences. On the whole, the world didn’t end and the company kept working. We recognise that it was a pretty short experiment, lasting only a week, but we still found it valuable. One thing we noticed was that we certainly affected how the rest of the company interacted with us by communicating that it was coming up. I can now say I have a much better understanding of the pain our remote colleagues go through every day. I’m definetely going to be reminding people in the office about it in the future.

Learnings

If you engage with remote employees or are planning to in the future, here is what I’d recommend:

  • When you are having a meeting with remote people and it’s possible for everyone attending to have mics, then do so.
  • Let remote employees know if you are starting a meeting late.
  • Respect meeting etiquette and allow all attendees to fully express themselves. Don’t interrupt until they’re done speaking.

Scrum retrospectives are a great opportunity to sit down with your team and make everyone’s voice heard. It’s about collective process improvement, by getting everyone involved and owning part of that process, it’s also about feelings, and about empathizing with each other.

A typical scrum retrospective

If you have a formula that works for your team, it’s good to repeat it: your team members will know what to do without having to repeat the agenda every week. However, it can be beneficial to try different things from time to time.

The most important source of ideas is probably the one-to-one meetings. Some team members may actually find the retrospectives boring or not particularly useful, and they may have ideas to improve them. Try some of them, discard things that do not work, and keep the things that people get more involved with.

We started our retrospectives with classical good / bad clustering: we draw two axis, time on the horizontal, and goodness to badness in the vertical, and people write down 2 positive things and 2 negative things, with a number from +5 to -5, and stick the post-its on the whiteboard. Every week, a different person tries to cluster the post-it notes into different categories. Sometimes, the time scale is a good indicator of a cluster, but we usually re-cluster them into more meaningful categories. Then, that person tries to explain what went well and what went badly during the sprint, asking the relevant people to explain their tickets. The important thing is trying to identify actions based on those notes, pretty much working out the start-stop-continue from that set. However, we don’t do this exhaustively. We focus on the immediately actionable items, the biggest wins and fails.

Some suggested we were wasting too much time on this, and we tried creating a thread on Slack for every sprint where people could write down thoughts as events happened during the sprint, and others would react with emoji. The thread died out after a few sprints, and we realized it was better to think retrospectively during the allocated time slot and get physically involved, i.e., standing up and writing things down.

Happiness axis

Our company wanted to measure happiness somehow. We discussed the option of having some anonymous surveys sent regularly to measure it, but many in the team were put off by having to fill in surveys online. So I decided to do something during the retrospective time, and get people directly involved.

I’ve selected 6 feelings or axes, 3 positive ones juxtaposed with 3 negative ones. Humans are complicated and full of emotions, so I tried to pick up things that I consider actionable in the work environment. This is our list:

Positive Negative
Enjoyment – did I work on something I enjoy? Boredom – most of the stuff was tedious and/or boring
Sense of accomplishment – I got that thing done! Despair – I’m getting nowhere
Powered up – learned something useful! Powered down – I feel I’m losing my skills

I think it’s important to keep it small, though. You don’t want to model the whole brain!

During the retrospective, we draw these axes on the whiteboard. Then, everyone stands up and casts up to 3 votes on any of the axes,

  • You don’t need to use all the votes (abstentions are counted as well)

  • You can vote in opposite axes (half of the sprint was really fun, but the other half was boring)

  • Preferably, add equally-spaced ticks, so we can draw a spider graph in the end.

And this is how it looks in the end,

Scrum Retrospectives Happiness

Happiness Axis

Actions based on happiness axis

Here are some of the recipes we have for actions based on the result of the happiness axis exercise,

  • … if joy is low:

    • everyone should have at least one ticket they would enjoy working on in next sprint;

  • … if boredom is high:

    • promote team work (e.g. pair-programming), from the premise that the conversation will make tedious tasks less painful;

  • … if not powering up:

    • plan for new things in next sprint;

    • schedule training time;

  • … when powering down:

    • discuss during the retrospective and/or one-on-ones which abilities are not being put to use. Try to find a place for them;

    • reduce time spent in repetitive tasks;

  • … when there’s no sense of accomplishment:

    • create smaller tickets with a well-defined goal;

    • try a “Demo-Driven Development” approach (this is a name I came up with): small features that are always “demoable”;

  • … when people feel they are going nowhere:

    • align the tickets with the company/crew objectives, so the goal is well defined;

    • identify blockers and deal with them ASAP (e.g. build issues).

Simple data visualization

In order to track the changes of the team mood over time, we also write the votes down in our Wiki. We keep 3 tables, one for each opposite axes, where each data point is just the date, the value on the positive axis, and the values on the negative one. Confluence can conveniently plot these for you,

Scrum retrospectives Happiness Data

Happiness data

From the graphs we noticed things like cycles in despair and accomplishment, that we regarded as being caused by having features that require a couple of sprints to complete, so the first sprint is full of despair, but when the feature gets finally completed in the following sprint, the sense of accomplishment spikes up.

Written down in words, it seems like a complex exercise, but it’s something that can be done really quickly, so we’ve kept this as part of our retrospectives.

Conclusion

There is no “correct” way of running scrum retrospectives, but the important thing is that they are dynamic and not too long. Also, make sure that people get involved in them. You probably know more or less what people feel from one-to-ones, but it’s important that they share some of that with everyone else in the team. At least, try to record the actionable needs. The happiness axis exercise is quick and it takes the scare out of surveys, and turns it into something a bit more fun. But if you feel stale, try doing something completely different from time to time, like brainstorming for ideas that people would like to work in with others. I’ll come back to that in a future post.

We welcomed back the Cambridge AWS User Group to the Cambridge office for it’s eighth Meetup. This one was focused on Big Data. This is something that I spend a lot of my time working on here at Metail, and I was keen to give a talk. I was nervous when having been put on the agenda we had 65 people sign up, the office capacity!

We had an exciting line up of speakers, if I do say so myself, with two talks about Redshift and one about building a big data solution on AWS. Peter Marriot gave the first talk which was an introduction to Redshift demonstrating how to create a cluster, log into it, load some data and then run queries. Most of this was a live demo and it went very smoothly. He was very enthusiastic about Redshift and demonstrated its speed at querying large data sets. I think his enthusiasm for Redshift came across as well measured and not just ‘oo shiny new tool’ as he did a good job of relating this to his own experience of querying large data sets; highlighting trade offs. The main one being Redshift seems to have a constant minimum overhead of a second or two on queries, where MySQL/PostgresSQL would be sub-second. This makes it difficult to support scenarios where multiple users make lots of small queries and receiving real-time results because the queue becomes backlogged. The general belief is that slow query response is because of the overhead of the leader node orchestrating the query, possibly a single node cluster wouldn’t have the problem. Something to put on the experiment list 🙂

The train chaos mentioned in the first Tweet meant our speaker from AWS, David Elliot, arrived late but still in plenty of time for his talk. It reminded me of my own experiences trying to get to my AWS London Loft talk back in April! His talk was an excellent live demo on setting up a trackers, and exploring the collected data. The exploration was done using Spark which is a managed install on EMR, and also Redshift and QuickSight. This was pretty similar to the demo I went to at the AWS Loft. It is impressive how quickly all this can be set up and how much power is available through these tools. I liked the demo and David had some good input to some of the questions asked of both me and Peter. We’ve blogged about this kind of setup and how it compares to our own here. We’ve changed our set up a little to be more event driven, using S3 notifications and SQS queues, but it’s still a good comparison. I see I blurred the lines a bit in my post about the use of Kinesis Firehose and Kinesis. The demo used Kinesis Firehose which is writing in batches, however you have control over when the buffer is flushed. David chose 60s to keep things flowing. You can use Kinesis streams, as David mentioned, if you want more of a streaming solution.

I was the final speaker on the agenda and my talk was titled Why The ‘Like’ In ‘Progres Like’ Matters”. I went through the decisions we’ve made when using Redshift and why. There were two main ones which I focused on. The first was whether to choose a cluster with a large amount of storage but limited compute, with the aim of storing all the data; or to have more CPU and less storage for faster querying but having to drop old data. We decided to keep all our data available in Redshift and progressed through a cluster made up of an increasing number of compute nodes until we had to switch to a cluster made up a few dense storage nodes to keep costs under control. The second major decision was the schema design. Unfortunately having never worked with columnar data stores we went with normalised schema layout which would have worked well on a row store such as PosgreSQL. We did use distribution and sort keys appropriate for the tables however the highly denormalised data often had different sort orders or distribution keys per table which made joins very slow. Since then we’ve done some more detailed research and more testing. Now we have a much larger data set and less CPU our tests highlight schema and query problems much more clearly which has lead to a much more efficient schema design. We have denormalised a lot of our data, and with common distribution and sort keys for the tables joins no longer need to sort data nor pull data from elsewhere in the cluster for table joins. As David said, Redshift optimisation is all about the schema design.

Overall we’ve found Redshift a very powerful tool, and like any tool there is a learning curve. As with all AWS services I’ve used there are the features in place to allow you to change your mind and hack around. Most of this due to the ease at which you can take snapshots and restore them to different shaped clusters.

Finally here’s me presenting:

It looks dark but it was still the hottest day of the year!

Thanks to @CambridgeAWS for the photos, to Peter and David for their talks, and Jon and Stephen for organising the Meetup. We’re looking forward to see everyone at the ninth Meetup here at Metail on Tuesday 25th October.

With two 3-month internships under my belt at Metail, it’s easy to see why people keep joining. As an R&D Intern, I’ve been continually challenged and pushed to learn new skills and apply them to often independent and in-depth projects. The responsibility and expected self-sufficiency have been well balanced to allow a comfortable attachment to my work; and now that I’m leaving to go back to my final year at university, it’s clear that a lot of what I’ve done with Metail will help me focus and push myself in my studies.

My assignedLuke Smith and chosen work has been excitingly challenging and intriguingly broad. With several weeks spent collaborating with a great team of people to build advanced features for a Facebook chatbot, I had the pleasure of working on state of the art 3D face modelling, including the challenges of adding cosmetic changes, finding ways to smoothly transform one face to another, and robustly positioning other 3D models on and around the faces. All within the context of delivering a user experience, albeit an experimental one. The surprise to me came in finding that “R&D” is not synonymous with “hidden in the back room for no one to see.” Sometimes, it turns out, it just means you don’t have to worry about perfecting a product and can focus on learning as much as possible about users and what they want.

In three months, you don’t necessarily just get to work on one project. On top of 3D face modelling, I got the opportunity to start with a blank folder with zero files in it and the seemingly simple task: recommend clothes to a user. That might only take five words to say, but it takes more than five lines of code to do. This task required me to build from an empty file tree to a framework for creating and testing ways of implementing recommendation algorithms. I found it extremely rewarding opportunity to work independently on such a project. At the same time, the real reward of working in a place like Metail isn’t just getting to take pride in your work, but knowing that at any point in time there is a whole host of people ready and willing to help you if you ask for it. With a collection of experienced and knowledgeable colleagues, I always found it easy to get help when I needed it. The technical knowledge and experiential learning gained will no doubt prove invaluable in the future.

I should also point out that the work itself isn’t the only part that’s fun. The people are wonderful and I personally enjoyed the fact that I only wore shoes to work 8 times in the entire summer (flip-flops are so much more comfortable). If you need advice on which are the best pubs in Cambridge, look no further, because Friday pub lunches serve as an excellent method of exploration. Meanwhile it’s worth noting that interns get free membership to the Friday cocktail club, which makes for a thoroughly enjoyable social activity whether you care for the cocktails or not!

In the end, sometimes you have to do some work, so in my experience, you might as well make sure it’s work that is in itself rewarding and comes with plenty of added benefits; my time at Metail has been a core of fulfilling work with a periphery of positive side effects. There’s no doubt in my mind that I’ll soon find an excuse to come back again.

Luke Smith
R&D Intern 2015, 2016

Introduction

Unit quaternions, or versors, offer a more compact and efficient representation of rotations than matrices do. They also free us from issues such as the gimbal lock we often encounter when using Euler angles. That’s why in Computer Graphics you often represent a transformation by a struct like the one below, instead of generic 4×4 matrix,

struct Transform {
  var position = float3(0, 0, 0)
  var scale    = float3(1, 1, 1)
  var rotation = Quaternion()
}

However, more often than not, quaternions remain in the CPU domain and Transforms are converted into matrices before they are sent to the GPU. The conversion for the struct above looks like this,

func toMatrix4() -> float4x4 {
  let rm = rotation.toMatrix4()
  return float4x4([
    scale.x * rm[0],
    scale.y * rm[1],
    scale.z * rm[2],
    float4(position.x, position.y, position.z, 1.0)
  ])
}

The reason for this conversion is usually 2-fold,

  • GPUs have native support for matrices, making them the natural choice when thinking about performance;

  • in traditional pipelines, we only worried about the final position of a vertex in world coordinates, so we could premultiply the Projection, the View, and the World or Model matrix into a single matrix (the PVW matrix), thus, making the transformation of vertices in the GPU really cheap.

Growing shader complexity

From the 2 reasons stated earlier, the second one barely holds true anymore. Because of more complex shading and effects pipelines, we often want to split the Projection matrix from the View matrix, so we can compute the view normals, and the Projection-View matrix from the World matrix, so we can obtain the coordinates of the vertices in World space.

The Projection and View matrices are only set once per camera or viewport, and the World matrix will be set per object or instance being drawn. The vertex shader will look like this,

float4x4 m = uniforms.projectionMatrix * uniforms.viewMatrix * instance.worldMatrix;
TexturedVertex v = vertexData[vid];
outVertex.position = m * float4(v.position, 1.0);

If we were to send Transforms instead of 4×4 matrices, we could save at least 4 floats per instance. Memory is usually more precious these days than ALU time, but how much slower would it be if we used Transforms in the GPU? The vertex shader will need to do some extra operations,

Transform t = perInstanceUniforms[iid];
float4x4 m = uniforms.projectionMatrix * uniforms.viewMatrix;
TexturedVertex v = vertexData[vid];
outVertex.position = m * float4(t * v.position, 1.0);

The following code is the implementation of the Transform struct using Metal (for an introduction to Metal, check this previous blog post).

struct Transform {
 // for alignment reasons, position and scale are float4
 float4 position; // only xyz actually used
 float4 scale;    // only xyz actually used
 float4 rotation; // unit quaternion; w is the scalar
 float3 operator* (const float3 v) const {
   return position.xyz + quatMul(rotation, v * scale.xyz);
 }
};
/// Quaternion Inverse
float4 quatInv(const float4 q) {
 // assume it's a unit quaternion, so just Conjugate
 return float4( -q.xyz, q.w );
}
/// Quaternion multiplication
float4 quatDot(const float4 q1, const float4 q2) {
 float scalar = q1.w * q2.w - dot(q1.xyz, q2.xyz);
 float3 v = cross(q1.xyz, q2.xyz) + q1.w * q2.xyz + q2.w * q1.xyz;
 return float4(v, scalar);
}
/// Apply unit quaternion to vector (rotate vector)
float3 quatMul(const float4 q, const float3 v) {
 float4 r = quatDot(q, quatDot(float4(v, 0), quatInv(q)));
 return r.xyz;
}

Let’s see if this is any slower than matrices with an example.

Rotating cubes demo

I’ve created this demo of rotating cubes to measure the performance of using quaternions in a modern, but not high-end, GPU. I’ll be testing Apple’s A8 chip on an iPhone6.

The application spawns 240 cubes and draws them with a single draw call using instancing. Instancing allows us to reuse the same vertex buffer, and just use a different Transform for each instance. This way, the performance comparison will be simpler because we only need to analyze one draw call, instead of 240!

The CPU updates the rotation of each cube at random times, so the performance in the CPU won’t be constant per frame, but it should be almost constant in the GPU (there will be some slight differences in fill rate, depending the amount of area covered by the cubes as they rotate, but I placed them close so it’s always very dense).

The code of the demo can be found here:

Performance comparison in the GPU

Both versions run at 60fps on an iPhone6. This is a frame capture of the version that uses matrices,

Metal Framecapture instanced cubes with matrices

The draw call in both cases takes 2.32 ms, of which 2 ms is taken by the fragment shader. As suspected, the fill rate is the bottleneck and it looks like the quaternions haven’t introduced any extra load to the ALU in this example.

For a proper comparison, we need to make this example to be vertex-bound, so I’ve prepared another example with spheres instead of cubes,

The tessellation level can be increased at compile time. In the video, there’s only a few hundred vertices per sphere, so both matrices and quaternions still run at 60fps. But in the commits below, each sphere has 2562 vertices. That’s a total of around 600K vertices on screen, while for the cubes we only had 6K vertices.

The frame rate drops to 20 fps when using quaternions, and to 12 fps when using matrices. Surprise! Here’s a frame capture of the version that uses matrices,

Metal frame capture instanced spheres with matrices

The vertex shader takes 46.10 ms with quaternions, and 82.28 ms when using matrices. Matrices turned out to be 80% slower here.

Because GPUs are becoming more general purpose, it could be that matrices have no real advantage anymore, since the number of multiplications and additions is actually greater. Another possible reason for such a big difference could be that by reducing the memory footprint (we are sending one less float4 per object), we managed to increase the cache coherence. Every GPU will behave slightly different, so it’s better to do an empiric test like this to check what’s the real behaviour of your code.

Performance comparison in the CPU

Let’s go back to the cubes and check now what’s going on in the CPU. I took a performance capture of both versions using Instruments. Here’s a capture of the most expensive functions in the version that needs to convert the quaternions back into matrices,

Metal draw cost with matrices

The updateBuffers function takes 5.4% of the CPU time, mostly taken in converting the Transforms into matrices. It’s not a lot, but we only have 240 objects. Here’s the cost using quaternions all the way through,

Metal draw cost with quaternions

As expected, the cost almost disappeared, and the updateBuffers function now only takes 0.3% of the CPU time. The drawing cost is just the cost of the API issuing the commands,

Metal draw cost with quaternions

Extra thoughts on performance

More often than not we worry about small details in performance such as this difference between matrices and quaternions, while the big bottlenecks tend to be somewhere else. For this experiment, for instance, I’ve used instancing to create a single draw call to draw all the cubes. But the first version of the examples had no instancing. You can find the code of the first version here,

Both version still run at 60fps, but we are now issuing 240 draw calls, one per cube. While the CPU was around 20% usage in the instanced version of the quaternions, the non-instanced version runs at 90% CPU usage! The extra cost is basically the cost of issuing the drawing commands. So instancing was actually the biggest win in this experiment 😉

Note that we could do some extra memory optimization in matrices, if we just send the first 3 rows, enough to represent an affine transformation (not for projections). This is a common optimization and shader languages have support for operations with float3x4 matrices because of this. But if we are talking about just rotations, it is still more memory-efficient to just send a quaternion, which it’s a float4, instead of a float3x3 matrix (for memory alignment reasons sometimes become float3x4).

On a smaller note, the view matrix can also be expressed as a Transform. By doing this we can completely get rid of the code that does the conversion to matrices. And the only matrix we will need to keep will be the Projection matrix.

Conclusion

Our initial preconception that matrices were better for the shader world was wrong. Using quaternions in the GPU is actually faster than matrices in a modern GPU like the Apple’s A8 chip. The memory footprint will also get reduced and the chances of finding our data in the cache will increase.

Moreover, if we eliminate the quaternion-to-matrix conversions, not only the code will get simpler and tidier, but we’ll save several precious CPU cycles.

But to be absolutely sure that you are making the right choice, always test your hardware with examples like this, because hardware is constantly evolving!

Metal with Swift

Metal (not Metail) is a low-level API from Apple that combines OpenGL and OpenCL into a single interface. The purpose of introducing their own API was mainly to reduce overhead and increase performance. Metal is similar to Khronos Group’s Vulkan, or Microsoft’s DX12, but specifically targeted at Apple hardware.

Metal has been around since 2014, but now that Swift is more mature, I think it’s really easy to get started with Metal: you don’t need to be scared of pointers or of the overly verbose Objective-C syntax.

In this article I’m going to introduce Metal with a small example where all the data updates happen in the GPU. Instead of explaining Metal and Swift in detail, I’ll just write down a few notes following the example code. Hopefully, it will spark your interest and you dig into the references for extensive documentation 😉

Procedural rain example

I’ve written a small demo that should look like rain,

It draws and updates thousands of 2D lines at 60 fps on an iPhone6. In fact, drawing the lines takes only 2.4 ms, and the update takes less than 0.2ms.

You can find all the code here: https://github.com/endavid/metaltest

Getting started

To get started with Metal you will need a Metal-ready device and XCode. In XCode, just create a new project and select

  • iOS Application: Game

  • Language: Swift

  • Game technology: Metal

This will create a simple template that draws a moving rectangle on screen. You will need to run this directly on your device, since the simulator doesn’t understand Metal. The triangle data in the example is triple-buffered, so you can update it in the CPU while the GPU renders up to 3 frames before requiring a sync. Synchronization between the CPU and GPU is done like this,

// create semaphore
let inflightSemaphore = dispatch_semaphore_create(NumSyncBuffers)
// this is run per frame
func drawInMTKView(view: MTKView) {
    dispatch_semaphore_wait(inflightSemaphore, DISPATCH_TIME_FOREVER)
    // updates in CPU cycles
    self.update()
    // register completion callback
    let commandBuffer = commandQueue.commandBuffer()
    commandBuffer.addCompletedHandler{ [weak self] commandBuffer in
        if let strongSelf = self {
            dispatch_semaphore_signal(strongSelf.inflightSemaphore)
        }
        return
    }
    // draw stuff
    // ...
    commandBuffer.commit()
}

Some interesting Swift notes:

  • You can omit brackets when the last argument of the function you are calling is a lambda. You can still do ‘addCompletionHandler(myFunction)’.

  • The ‘weak’ keyword is used to avoid keeping a strong reference to ‘self’ inside the lambda function. Otherwise, we could have a cyclic reference and leak memory.

  • Because the reference is now weak, it basically becomes an optional (something that could be null). The ‘if let x = optional’ is used to dereference the optional when it’s not null.

Preparing Metal objects

These are the things you need to prepare in order to render something on screen:

  • Resources: data buffers and textures.

  • States: render pipeline state and depth-stencil state.

  • Descriptors: definitions that describe the objects above. This includes your shader code.

  • Render Command Encoder: the stuff that converts API commands into hardware commands.

  • Command Buffer: it’s where you store your commands that are eventually committed to the GPU.

  • Command Queue: where you queue an ordered list of command buffers.

I assume you are more or less familiar with how a typical graphics pipeline work, so in the example I’m going to focus on the physics update of the raindrops, which I’m performing in the GPU.

I’ll explain the shader code later, but for now you just need to know that you can access to your shader functions very easily using a shader library,

let defaultLibrary = device.newDefaultLibrary()!
let updateRaindropProgram = defaultLibrary.newFunctionWithName("updateRaindrops")!

“updateRaindrops” is the name of the function in the shader code.

You can create a render state without a fragment program. Your vertex shader can be used to modify any arbitrary buffer, without the need of specifically creating a compute shader.

let updateStateDescriptor = MTLRenderPipelineDescriptor()
updateStateDescriptor.vertexFunction = updateRaindropProgram
// vertex output is void
updateStateDescriptor.rasterizationEnabled = false
// pixel format needs to be set
updateStateDescriptor.colorAttachments[0].pixelFormat = view.colorPixelFormat

With that descriptor now we can create the state. Note that this is done only once,

do {
    try pipelineState = device.newRenderPipelineStateWithDescriptor(pipelineStateDescriptor)
    try updateState = device.newRenderPipelineStateWithDescriptor(updateStateDescriptor)
} catch let error {
    print("Failed to create pipeline state, error \(error)")
}

Notice that in Swift, the “try” keyword is used for every expression that can throw an exception. If we are happy with an optional value, we can remove the do-catch and use “try?”,

let state = try? device.newRenderPipelineStateWithDescriptor(descriptor)

Now we need a data buffer. Metal is designed for the A7 chip unified memory system, so both the CPU and the GPU can share the same storage. We will need to care about synchronization, but in this example the raindrops will be updated and read only in the GPU.

// member variable
var raindropDoubleBuffer: MTLBuffer! = nil
// ... on initialization:
raindropDoubleBuffer = device.newBufferWithLength(
            2 * maxNumberOfRaindrops * sizeOfLineParticle, options: [])
raindropDoubleBuffer.label = "raindrop buffer"

And now that you have everything ready, we can “draw stuff” in drawInMTKView,

// draw stuff
if let renderPassDescriptor = view.currentRenderPassDescriptor,
       currentDrawable = view.currentDrawable
{
    // setVertexBuffer offset: How far the data is from the start of the buffer, in bytes
    // Check alignment in setVertexBuffer doc
    let bufferOffset = maxNumberOfRaindrops * sizeOfLineParticle
    let uniformOffset = numberOfUniforms * sizeof(Float)
    let renderEncoder = commandBuffer.renderCommandEncoderWithDescriptor(renderPassDescriptor)
    renderEncoder.label = "render encoder"
      
    // The drawing phase is a simple shader that draws lines in 2D
    // DebugGroup labels are for debugging during frame capture.
    renderEncoder.pushDebugGroup("draw rain")
    renderEncoder.setRenderPipelineState(pipelineState)
    renderEncoder.setVertexBuffer(raindropDoubleBuffer, 
            offset: bufferOffset*doubleBufferIndex, atIndex: 0)
    renderEncoder.drawPrimitives(.Line, vertexStart: 0, 
            vertexCount: vertexCount, instanceCount: 1)
    renderEncoder.popDebugGroup()

    // update particles in the GPU            
    renderEncoder.pushDebugGroup("update raindrops")
    renderEncoder.setRenderPipelineState(updateState)
    // this is where we read the particles from
    renderEncoder.setVertexBuffer(raindropDoubleBuffer, 
            offset: bufferOffset*doubleBufferIndex, atIndex: 0)
    // this is where we write the updated particles 
    renderEncoder.setVertexBuffer(raindropDoubleBuffer, 
            offset: bufferOffset*((doubleBufferIndex+1)%2), atIndex: 1)
    renderEncoder.setVertexBuffer(uniformBuffer,
            offset: uniformOffset * syncBufferIndex, atIndex: 2)
    // noiseTexture contains random numbers
    renderEncoder.setVertexTexture(noiseTexture, atIndex: 0)
    // every particle is treated as a point, but we aren't rendering anything on screen
    renderEncoder.drawPrimitives(.Point, vertexStart: 0, 
            vertexCount: particleCount, instanceCount: 1)
    renderEncoder.popDebugGroup()
    renderEncoder.endEncoding()
            
    commandBuffer.presentDrawable(currentDrawable)
}
    
// syncBufferIndex matches the current semaphore controled frame index 
// to ensure writing occurs at the correct region in the vertex buffer
syncBufferIndex = (syncBufferIndex + 1) % NumSyncBuffers
doubleBufferIndex = (doubleBufferIndex + 1) % 2
    
commandBuffer.commit()

And that’s all! You don’t need to do anything else on the CPU 🙂

Writing shader code

Metal shaders are written in a subset of C++11 with some special keywords to define attributes and hardware features. You can have multiple shaders in a single file, and that file gets compiled before you run your application, so say bye to the runtime nightmares of OpenGL ES.

Let’s jump directly to the raindrop update function,

#include <metal_stdlib>
struct LineParticle
{
    float4 start;
    float4 end;
}; // => sizeOfLineParticle = sizeof(Float) * 4 * 2

// can only write to a buffer if the output is set to void
vertex void updateRaindrops(uint vid [[ vertex_id ]],
                        constant LineParticle* particle  [[ buffer(0) ]],
                        device LineParticle* updatedParticle  [[ buffer(1) ]],
                        constant Uniforms& uniforms  [[ buffer(2) ]],
                        texture2d<float> noiseTexture [[ texture(0) ]])
{
    LineParticle outParticle;
    float4 velocity = float4(0, -0.01, 0, 0);
    outParticle.start = particle[vid].start + velocity;
    outParticle.end = particle[vid].end + velocity;
    if (outParticle.start.y < -1) {
       outParticle.end.y = 1;
       outParticle.start.y = outParticle.end.y + 0.1;
    }
    updatedParticle[vid] = outParticle;
};

I’ve simplified the example above, so I’m not using the uniform buffer or the noise texture. Instead, the particles are just updated with a constant velocity that points downwards, and their position is reset once they reach the end of the screen. Check the full source for the full update, with some simple bouncing on the ground and obstacles, and resetting to a random position.

The “constant” and “device” keywords are address space qualifiers. “constant” refers to read-only buffer memory objects that are allocated from the device memory pool, while “device” refers to buffer memory objects allocated from the device memory pool that are both readable and writeable.

Handling Metal errors

In Metal you’ll find that clear error messages are output to the console. In OpenGL ES you had to query the OpenGL error status all the time just to get error messages, cluttering your code with those error queries all over the place. Plus, the error messages were usually hard to decipher.

This is an example error in Metal,

MTLPixelFormatRG16Unorm is compatible with texture data types type(s) (
    float
).'

I got this after calling: renderEncoder.setVertexTexture(noiseTexture, atIndex: 0)

Because in the shader I had: texture2d noiseTexture [[ texture(0) ]]

The noise texture pixel format is set to RG16Unorm and the error is telling me it doesn’t like “halfs”. So I just needed to change half to float to fix the issue.

Frame captures

The frame capture in XCode works as well with Metal as it does with OpenGL ES. You can see the performance of your shaders, see all the resources, change the shader code on the fly, jump to the Swift source code that originated a draw call, and much more. It’s one of the best tools of its kind that I’ve seen.

Let’s inspect a frame,

Frame Capture in XCode

Frame Capture in XCode

On the left side, you can see all the commands. There’s only a few! OpenGL ES programs tend to end up with lots of redundant state changes that negatively impact on performance. The debug group labels are shown as folders, and you can see the timings for each one. Or you can expand them and see the details. The particle update takes 183 microseconds. Surely faster than if we had linearly looped through the buffer and updated the particles on the CPU 😉

You can expand each command to see the call stack and jump to the CPU code.

You can also inspect all the buffers, render state, and shaders. You can see the cost of each shader block as a percentage of the total. As expected, most of the cost is in the fragment shader. It’s just fill-rate.

You can re-write the shader code there, and click the “Update Shaders” icon Update Shaders icon, to re-compile them and re-run the frame with the updated shaders.

It’s really powerful and easy to use.

Conclusion

If you are developing on iOS or macOS and into graphics, I recommend you try Metal if you haven’t yet. The setup is more straightforward than OpenGL, and it outperforms OpenGL by removing redundant state changes and making definitions more static.

If you like graphics programming, but you never tried native development on iOS or macOS, perhaps because you were scared of Objective-C, give Swift a try. It has a simple but powerful syntax, really easy to learn. It’s also a compiled language, so if you were thinking of mixing C++ into Objective-C just to increase performance, forget about it and write everything in Swift.

Check the references below for details.

References

Tomorrow, Tuesday 12th, we’re welcoming back the Cam AWS User Group for their 7th Meetup. This is the fourth user group meetup we’ve hosted and now we’re set to host the remaining three of the year. The meet up promises to be information packed and is focusing on AWS Lambda with two speakers talking about their experiences. There’s also a debrief on the recent AWS summit in London and Danilo Poccia, a technical evangelist from AWS, is talking about data analytics.

The AWS London summit was on the 7th July and I went along with a colleague. Inevitably we bumped into some of the Cam AWS UG members and shared a DLR over to the Excel. Personally I found the Deep Dive on Amazon DynamoDB to be the most informative session with a good bit of depth on how to write your schema, avoiding hot keys and understand its internal partitioning. This is important for schema design and resolving certain bottlenecks. My most disappointing talk was the Deep Dive on Microservices and Amazon ECS as this talk didn’t add to my knowledge and I’ve only ever seen talks and demos of ECS never getting my hands dirty. My colleague attended the Deep Dive on EC2 Instances and it sounded like I’d have gotten much from that talk. I’m sure that others went to interesting (and disappointing) sessions and I would like to know what they got out of them.

AWS Lambda is one of AWS’ hot technologies which was released almost two years ago. We’ve started experimenting with it in Metail. I’m really keen to see how it’s being used and experimented with by others as AWS Lambda’s use within Metail is certainly growing. I’ve had a fun little project writing a plugin for leiningen which allows you to manage AWS Lambda functions with the aim of integrating it into our build process. Still it’s nowhere near as a functional as lambda Gordon which I saw demonstrated at the most recent Snowplow London Meetup; it sounds like something to compare to Ben Taylor’s talk on Using Lambda and CloudFormation.

The final talk of the night is from Danilo Poccia, I’m particularly looking forward to asking questions at the end as it’s the most relevant to my day to job 🙂

We’re looking forward to seeing everyone tomorrow, doors open at 6:45pm and the talks are starting promptly at 7pm. We’ll be providing beer, soft drinks and snacks, be prompt to get your favourite beverage before the talks start 🙂

AWS Loft London

Back in October 2015 Metail hosted the 3rd Cambridge AWS User Group Meetup and in addition to Ian Massingham‘s review of AWS re:Invent 2015 I was given the opportunity to talk about our use of AWS for our big data processing pipeline. After this I was pleased to be invited to give an Elastic MapReduce (EMR) specific version of this talk at an AWS EMR master class. Roll on March and the AWS loft London with me on the agenda for the EMR Master Class session 🙂

After a busy week and some concentrated talk preparations I almost didn’t make it. I caught the train from Cambridge to Liverpool street with the intention of walking from there to Old Street. Unfortunately there were problems with the power lines on the Liverpool Street line which lead to everyone getting off at Harlow Town. After a taxi ride to Epping and a nervous ride into Liverpool Street on the central line, I finally arrived only five minutes after the session started. This meant I missed my opportunity to introduce myself to Abhishek Sinha (the session leader) but after catching his eye during his talk I was back on the agenda 🙂

late-tweet                                                         made-it-tweet

Elastic MapReduce Master Class

Abhishek gave a very interesting and well-presented guide to EMR and its best practices. As ever when I attend a talk by someone from AWS I learn plenty of new things and start re-evaluating our use of their tools. In this case, these were mainly around the use of spot instance task nodes and taking advantage of EMRFS.

The spot instance task nodes are nodes that only perform MapReduce tasks, having no HDFS storage, and come from the EC2 spot instance market. Using the spot instance market you can get the nodes at a lower price but if you’re outbid you lose the node. Any compute tasks running when you lose the node fail, but Hadoop was built with this in mind and simply reschedules the task on another node. With no HDFS storage, no data re-replication need be done. It’s common to set a bid price of 100% of the on-demand cost, you still get the EC2 node at a lower bid price and at worst you pay the normal cost. Further, by picking nodes that are less commonly used, you are less likely to be outbid. For example, if you normally request two m3.2xlarge task nodes but on the on the spot market the m3.xlarge were less commonly used, then requesting four task nodes would give you equivalent power but with a greater saving. This is an imaginary example, you can find out real data for spot market here.

The other feature of EMR we are not yet taking advantage of is EMRFS. AWS have decoupled the compute from storage by allowing EMR clusters to make very efficient use of S3. The main/only drawback here is that S3 has eventual consistency for overwrites and deletes of objects in the S3 file system. The EMR nodes are not aware of the delays and thus when one job takes as input the output of a previous one there is a chance of seeing an inconsistent view of the data. EMRFS uses a DynamoDB table to keep a record of the expected state of S3 and the EMRFS file system will retry if a request is made for an object that does not match the expected state. Currently we work around this limitation by having things set up in such a way that it isn’t a problem (more by luck than design ;)). Another common solution is to create two copies: one in the cluster’s HDFS file system and the other in S3. The copy in HDFS is lost when the cluster shuts down. We are currently redesigning our pipeline and it may become a greater problem in the next iteration so we’re keeping EMRFS in mind, noting that you do pay for the DynamoDB usage.

My First Big Data Application

As for my own talk, I think it was well received. I was asked some interesting questions at the end and I’m taking that as a good sign. After my talk and some lunch I stayed for the next session “My First Big Data Application” which was introduced as a modern big data pipeline. This was a great session where a pipeline was setup to collect, process and analyse web logs. This was strikingly similar to the pipeline I’d described in my talk, however theirs is indeed more modern 🙂 I think it’s interesting to compare the two pipelines and to contrast their different strengths and weakness.

Starting with my talk and the beginning of our pipeline, events are recorded by making GET requests for a Cloudfront-hosted pixel and Cloudfront logs all the requests to an S3 bucket. Here AWS do the hard work of distributing our pixel around the globe to ensure fast access to the user. They also batch up the request logs, writing them to the configured bucket after some time/size. We’ve never done any measurements but I believe the latency is typically less than an hour and we get logs of the order 10MB in size although they can be KB in size. For the demonstration Toby Knight (the speaker) set up an Apache web server on an EC2 node which saved its logs locally. He then used an AWS Kinesis collector to stream the logs in real time into the Kinesis Firehose which records the data in an S3 bucket. Here you can see the more modern event collector which is a real-time streaming system compared to our batch. For the following purposes it’s not really clear why Kinesis Firehose is better than our Cloudfront solution. I’m not sure how you scale out the Apache web server (fairly easily I imagine, it’s just not my area of expertise) but that’s work you’ll have to do yourself and when the second step is a batch system I’m not sure the latency matters. However, talking of latency this is where Kinesis has potential the Cloudfront solution clearly doesn’t. In Metail we don’t have any real time monitor of our event stream (it’s never been a critical requirement) but with Kinesis you can connect to a topic and trigger some processing on each new event. This increased flexibility is clearly a win.

For the next step both we and Toby turned to EMR for a ‘model on read’ batch Event Transform and Load (ETL). We are using MapReduce in Clojure (Cascalog at the moment but switching over to Parkour) to read in our Cloudfront logs, validate the events and format them in a schema that can be loaded into Redshift. Here ‘model on read’ means that Cloudfront doesn’t enforce a schema on the data, it will quite happily write some quite corrupt events to file. It’s only if we try to format that event as, say, an order that we start requiring it to have certain properties. Toby’s talk used Spark to process the events, perhaps just as an opportunity to show EMR supports the latest cool MapReduce technology 😉 It does have some advantages over MapReduce and should be a lot faster than our ETL as Spark uses in-memory data structures, it’s written in Scala though (but there are Clojure bindings with Flambo or Sparkling). For the next step Metail is keeping up with the Joneses and we do the modern thing and copy the output of the EMR batch stage into Redshift. Redshift is a petabyte scale data warehouse where you use a PostgreSQL-like language to query your data. After some initial teething troubles we think our new schema will allow us to make much better use of Redshift’s strengths. We use a product called Looker to model the data in Redshift, produce dashboards for both internal and external use, gain insights into our data through dynamic queries and quite a few other things. For the talk they demonstrated the use of AWS QuickSight which is in a limited preview. Although it will compete with Looker (and similar tools like Tableau) it’s aiming to be less full featured and much cheaper, allowing companies to give everyone access to the data with only a few people using more expensive tools like Tableau. I suspect for us it would never replace Looker, it seemed like it wouldn’t have the client facing support we require, and our more powerful data analysis tools come largely from the open source Python and R community 🙂 Still I’m very excited about SPICE (Super-fast, Parallel, In-memory Calculation Engine) which gives each QuickSight user a local in-memory DB for very fast data modelling and exploration. This should be available to partners like Looker and Tableau next year.

And that’s it, after mentioning only a ‘few’ technologies I’ve raced through Metail’s big data pipeline and compared it to a more modern equivalent. For anyone looking to build their first pipeline I think it is worth looking at the streaming solution as that technology is advancing fast and windowing over the streams give much more powerful batches. It’s something we’re planning to look into with Onyx for the next iteration.

/dev/summer 2016 is almost upon us. This is the latest in a series of bi-annual developer conferences organized by our friends at Software Acumen, and will be taking place at the Møller Centre, Churchill College, Cambridge next Saturday 25th June. It’s a low cost, high value software developer event covering DevOps, Mobile, Web, NoSQL, Cloud, Functional Programming, Startups and more.

Back in 2014 Jim Downing (Metail CTO) and I gave a hands-on session based on an extended version of TryClojure, where participants got to implement a Sudoku solver in Clojure. This year we’re back with another REPL-driven development session. We’ll be showing people how to write a chat bot in Clojure. We’ll cover Clojure as a REST client, build a simple web service using Ring and Compojure, deploy our application to Heroku, and configure a Slack command integration.

If that’s not your cup of tea, there are plenty of other sessions to choose from and the conference provides an excellent opportunity to meet and chat with experts in your field in a friendly and relaxed environment. Metail is sponsoring this year’s event – look out for our stall. Oh, and we’re hiring! If you’d like to make Clojure and ClojureScript part of your day job, or are interested in any of the other tech jobs we’re advertising in Cambridge, come along and talk to us.

Tickets for /dev/summer are on sale here.

For the last few years Metail have hosted a small number of internships in our tech team. These are often really great experiences, giving students new skills and the chance to focus on a challenging and fun project as part of one of our teams, and give our teams scope to pursue some ideas that they might not otherwise find head space to do.

Some of the things our interns have done in the last two years are:

  • Created a prototype to allow users to look around a garment in 3D using Google Glasses or the Amazon FirePhone.
  • Contributed to a cutting edge 3D face and head research project
  • Created an automated test and performance system for our apps
  • Produced a new model of behavioural analytics based on Metail user browsing behaviour

Our interns are generally undergraduate students, but that’s not universally the case, and they have led on to permanent careers too. This year, a number of teams are looking for interns, so there are a range of opportunities: front-end app development, middle-tier Clojure development, garment simulation, computer graphics programming, data science and computer vision / machine learning.

Would you like to know more?