Metail is a UK fashion technology startup with offices in Cambridge and London, UK. We use Clojure on the front-end and back and currently have vacancies for both Clojure and ClojureScript developers in our Cambridge office. If you’re interested in functional programming and are keen to work with Clojure, we’d love to hear from you. You don’t need to be an expert, we’re a friendly company and there are plenty of people here to help you learn and grow your skills.
Metail were early adopters of Clojure with the first code going into production back in 2010. This was a Clojure implementation of our size recommendation algorithm. Back then we were using Java’s Spring Framework for server-side applications, with the Clojure code embedded into the Spring application as a Java class. Nowadays, our web services are implemented in Clojure using Pedestal and ring-swagger and we are considering Lacinia for one of our newest applications. On the front-end, we use ClojureScript with re-frame and a Material UI library. We also use Clojure to orchestrate cloud deployments (REPL-Driven DevOps) and for large-scale data processing on Amazon’s Elastic Map Reduce clusters.
William Byrd at Cambridge NonDysfuntional Programmers
Metail have long been supporters of the local tech community: I met CTO Jim Downing back in 2009, when he was running the local Clojure user group. I took over in 2013, and another Metailer, Rich Taylor, took up the reins this year. When Metail moved into a new city-centre office, we had space to host meet-ups ourselves, complete with data projector and excellent wi-fi. Now we are regular hosts of Cambridge NonDysfunctional Programmers, Data Insights Cambridge, Cambridge AWS User Group, DevOps Cambridge and Cambridge Gophers. As well as providing a free venue, Metail sponsors refreshments at many of these Meetups.
If you’d like to join this growing company and vibrant local tech community, check out our current vacancies. If you’re excited by the prospect of a Clojure career but don’t see your ideal job listed there, please drop us a line anyway – we’re always keen to hear from enthusiastic Clojure developers and there may an opening that hasn’t made it up to the website yet.
Most of the teams I’ve worked on have been not so great at breaking down the work that lands on the development backlog. There are plenty of resources out there on what stories are, and multiple different ways of writing them. There’s also articles written about different story splitting techniques. I couldn’t find anything out there about deliberately applying the theory however so I thought I’d write something.
Before I dive in, lets start with some definitions
Epic – Also known as a “very big” story, that is unlikely to be completed in a single sprint or planning cycle. Would normally be broken down into several stories before being pulled onto a backlog. Epics can also be used for defining the main focus of a development team for a series of sprints.
Story – A smaller piece of work that can fit into a sprint or planning cycle specifically aimed at providing value to the end user and/or the customer. It can be good to apply the INVEST criteria to any story that you’re writing or at the very least include some acceptance criteria to define when a story is complete. Typically a story would be written in non-technical language to make it accessible for all interested parties to discuss. There are lots of different ways to write stories, Here’s a link to some sample formats.
Task – Pieces of a story that describe how the story is going to be achieved. These are usually written by the people doing the work. They should generally be short lived and completed within the sprint or planning cycle.
Splitting patterns
When examining a story (or an epic), you’re going to need to break it down. This has already been written about in much more detail over here. To keep things simple, I’ve summarized some of the commons ones:
Most difficult bit first (What’s the hardest piece of the story to solve?)
Simple case first (What’s the simple solution to the story?)
Functional first (Make it work, worry about performance later.)
User flow (What’s the first thing the user does. What’s after that?)
Per use case (What does user A want to achieve? User B?)
Per operation (buy a subscription, change a subscription, cancel a subscription)
Spike (What questions do you need to answer in order to know more about the solution?)
This is great! We now have some lines of thought we can use to think about our stories. We need to practice using these deliberately so as to get used to using them naturally and to ensure we don’t fall into the trap of using one or two of them over and over again.
Kata
Much in the same way as you’d practice different coding techniques in a coding kata, you can practice breaking down stories into tasks in a similar way. There is a little preparation to do in advance of your task break down kata.
Before you start, you’ll need to define some problems to split up. The problems should be large enough that they can be solved with multiple steps. Some examples might be:
Make a banana split
Go on holiday abroad
Buy something from an online store
Set up a new computer for a relative
Organize a party.
Try to work out if your problem is an epic or a story. If you’ve picked an epic, can you split it and write each story against the INVEST mnemonic? You want to end up with a few stories that can the broken down during the task break down kata.
Running the session
Split up in to groups of 2-3. Participants should be anyone that needs practice breaking down stories into tasks. If you can, try to make sure there is a mix of disciplines breaking down the selected story.
Choose a splitting pattern from above or from elsewhere, then take 10-15 minutes to apply the pattern to break down one of the stories. After you’ve applied the pattern, try to think around the edges and work out what was missed. What else needs to be included to make the story “complete”?
If you have lots of participants, compare and contrast results with other groups in the kata. Once you’re done you can try splitting the same problem again using a different pattern or use the same pattern on a different problem. Some problems will lend themselves better to one type of splitting pattern that others. Just keep practicing and you’ll get better at knowing which pattern to use for what kinds of problem.
Metail provides a yearly training budget for all employees consisting of both time and money, but we found that many employees were not making the most of this opportunity. We decided to look into why this is and work on increasing the uptake. One idea we had was around hackathons – pairing people together to do small hackathons together sounds more fun than just reading a book by yourself!
One-to-ones help uncover trends
From my one-to-ones I found that the main reason people were not using the training days was because they weren’t sure what to do with them. If people were going to a conference or working toward a qualification or certification it was easy to identify the time spent on that as ‘training’. But what if you are already qualified? or there isn’t a conference on this quarter? or you want to spend some time testing out new technology?
Crew Hackathons
I came up with the idea of running some small hackathons within the crew and suggested we could use training days for these. The idea is that people will pair for a couple of days to create something new. This aligns with our company values: being in this together, actively learning, trust to deliver, and making a difference. But I also wanted to push the joy/excitement axis up a bit as well (see previous post).
Because people never want an extra meeting, we decided to schedule this as a special retrospective session. We kept the happiness axis exercise and collected a few actions based on that, but we spent most of the hour running a hackathon proposal exercise outlined below:
Everyone tries to write down a couple of ideas for 2-day projects they would like to work on, and spends a couple of minutes to get others excited about it.
Vote proposals. Everyone has 2 votes to pick a project (other than their own). Only projects with 2 or more votes survive.
The projects do not need to be directly related to work, but we should learn something from them. The idea is to spend one day together working out designs, and another day creating a prototype or something usable.
I explained the exercise a week in advance, so people had time to think of projects before the meeting.
Deciding on projects
The exercise went well and everyone seemed quite excited. It turns out that a few people had similar ideas, so we grouped some projects together. We then drew a matrix so everyone could cast their votes. This is how the whiteboard looked:
Hackathon Matrix
The top row of the matrix has the people initials, with the number of available training days written below.
We (a team of seven) decided to work on 3 projects. The projects with more votes will have a couple of hackathons associated with them – this is particularly useful if we can’t get together all at the same time. We can also start thinking at this stage if we are going to need any materials, e.g. books, that we need to buy before we get started.
Scheduling the hackathons
We have the ideas, the people, and most importantly, the excitement, so now it’s just a matter of scheduling these hackathons. If a person is working for the full 10 days in a sprint, they instantly become candidates for any of the hackathons they showed interest in. If we can find someone else interested in the same project who has enough training days available, we pair them together and schedule it in the sprint.
Some of these projects have more than two people interested – in this case we have a 1-hour meeting with everyone interested in it, to come up with a plan and decide how we’ll split the work. For instance, if it was a project that involved four developers and two different platforms, one group could work on one platform one sprint, and the other group could do the other platform the following sprint.
Conclusion
Small hackathon exercises can be helpful for people that don’t know what to do with their training days. Other people can bring ideas that suddenly open the curiosity box, and we can turn the learning exercise into a shared experience. Just as it is, it’s a valuable experience. But some of the projects can even turn into something bigger that brings additional value to the company. I think it’s probably worth running this exercise every quarter, to disconnect from your main duties and refresh a bit. If you can’t find the time to run this, just pack it inside one of your retrospectives. You can always use the happiness axis for a swifter retrospective, and move straight away into finding topics for the hackathons.
fully remote working: my work station for a whole week
In the first week of December I ran an experiment: our entire team was made to work remotely from the two main offices. The aim of the venture was for everyone to feel exactly what our remote employees feel every day. As a result, we hoped to improve team communication, both within the team and external to it.
Our team is probably one of the most distributed engineering teams in Metail. While most of our engineers are in the Cambridge office, a few work remotely. We’re lucky enough they are in the same time zone as the headquarters. Nonetheless we still suffer a lot of the pains that distributed teams feel, especially when the rest of the company is more used to working between the two offices, based in Cambridge and London.
Our hypothesis was that we would probably miss out on a lot of incidental “water cooler” conversations. We also guessed that communication with the rest of the organisation would be somewhat difficult.
Before Kick off
Before we rolled out the experiment, I had to lay some groundwork. Firstly I checked with our crew director (we work in teams called ‘Crews’ at Metail) and the other engineering managers that this wouldn’t impact anything crucial. We communicated widely across multiple channels that our team would be entirely remote during the week before the start date. I also spoke to the team to hear their concerns. It certainly helped to draw up a few guidelines. This is in summary what we came up with:
We use Slack by default and Skype as a backup
We say when we are at our keyboards and when we’re not
Everyone is to use headset and have their webcams turned on.
In general we try to ensure that we are over communicating
If there is a problem or someone can’t be reached, people are to come to me (the engineering manager) or our crew director.
There were a few practical things to take care of as well. We made sure our contact details were added to all the meeting rooms’ Skype accounts. We also checked we could all access internal resources via the VPN. Just to be sure, we ran a couple of trial calls to make sure Slack and Skype would work for us (they did!).
So how did it go?
We were able to anticipate the problems we hit; there wasn’t too much of the unexpected. It was much harder to run work past people on a casual, in person basis. Attempting to do so required both parties to mic up and jump on a Slack call.
Meetings with the wider company is where we struggled the most. We noticed people in Metail occasionally talk over one another and because of this it was hard to participate in guilds and other group meetings. Usually it meant one person in the office would drown out another who was further away from the room mic. We also noticed that if there were multiple people in the office participating in a meeting, remote workers often ended up ignored. In some cases it was difficult to observe body language that would normally be cues for a person to start talking. From time to time it was hard to hear people in the office. Sometimes this was because of problems with the audio equipment, other times it was because of background office noise.
We encountered a few minor technical issues as well. Some of these things were easy to fix, like tweaking rules on a firewall. Others were harder to diagnose, like why a developer was seeing Jenkins time out during load, preventing him from being able to see when builds were finishing. A couple of times we had issues with Slack where one person in the group couldn’t see another but these were easily fixed by leaving the call and re-entering it.
Generally speaking the engineers found it easier to focus on the work they were attempting to do. On the other hand it was pretty difficult for myself and our crew director, being the main communications interface between the team and the rest of the company.
I also discovered that my house gets really cold during the day if I don’t put my heating on! I made a special effort to be a little more social, going out to dinner and to the pub for much needed social interaction.
Conclusions
On the Monday following the experiment we ran a retrospective where we recorded our experiences. On the whole, the world didn’t end and the company kept working. We recognise that it was a pretty short experiment, lasting only a week, but we still found it valuable. One thing we noticed was that we certainly affected how the rest of the company interacted with us by communicating that it was coming up. I can now say I have a much better understanding of the pain our remote colleagues go through every day. I’m definetely going to be reminding people in the office about it in the future.
Learnings
If you engage with remote employees or are planning to in the future, here is what I’d recommend:
When you are having a meeting with remote people and it’s possible for everyone attending to have mics, then do so.
Let remote employees know if you are starting a meeting late.
Respect meeting etiquette and allow all attendees to fully express themselves. Don’t interrupt until they’re done speaking.
Scrum retrospectives are a great opportunity to sit down with your team and make everyone’s voice heard. It’s about collective process improvement, by getting everyone involved and owning part of that process, it’s also about feelings, and about empathizing with each other.
A typical scrum retrospective
If you have a formula that works for your team, it’s good to repeat it: your team members will know what to do without having to repeat the agenda every week. However, it can be beneficial to try different things from time to time.
The most important source of ideas is probably the one-to-one meetings. Some team members may actually find the retrospectives boring or not particularly useful, and they may have ideas to improve them. Try some of them, discard things that do not work, and keep the things that people get more involved with.
We started our retrospectives with classical good / bad clustering: we draw two axis, time on the horizontal, and goodness to badness in the vertical, and people write down 2 positive things and 2 negative things, with a number from +5 to -5, and stick the post-its on the whiteboard. Every week, a different person tries to cluster the post-it notes into different categories. Sometimes, the time scale is a good indicator of a cluster, but we usually re-cluster them into more meaningful categories. Then, that person tries to explain what went well and what went badly during the sprint, asking the relevant people to explain their tickets. The important thing is trying to identify actions based on those notes, pretty much working out the start-stop-continue from that set. However, we don’t do this exhaustively. We focus on the immediately actionable items, the biggest wins and fails.
Some suggested we were wasting too much time on this, and we tried creating a thread on Slack for every sprint where people could write down thoughts as events happened during the sprint, and others would react with emoji. The thread died out after a few sprints, and we realized it was better to think retrospectively during the allocated time slot and get physically involved, i.e., standing up and writing things down.
Happiness axis
Our company wanted to measure happiness somehow. We discussed the option of having some anonymous surveys sent regularly to measure it, but many in the team were put off by having to fill in surveys online. So I decided to do something during the retrospective time, and get people directly involved.
I’ve selected 6 feelings or axes, 3 positive ones juxtaposed with 3 negative ones. Humans are complicated and full of emotions, so I tried to pick up things that I consider actionable in the work environment. This is our list:
Positive
Negative
Enjoyment – did I work on something I enjoy?
Boredom – most of the stuff was tedious and/or boring
Sense of accomplishment – I got that thing done!
Despair – I’m getting nowhere
Powered up – learned something useful!
Powered down – I feel I’m losing my skills
I think it’s important to keep it small, though. You don’t want to model the whole brain!
During the retrospective, we draw these axes on the whiteboard. Then, everyone stands up and casts up to 3 votes on any of the axes,
You don’t need to use all the votes (abstentions are counted as well)
You can vote in opposite axes (half of the sprint was really fun, but the other half was boring)
Preferably, add equally-spaced ticks, so we can draw a spider graph in the end.
And this is how it looks in the end,
Happiness Axis
Actions based on happiness axis
Here are some of the recipes we have for actions based on the result of the happiness axis exercise,
… if joy is low:
everyone should have at least one ticket they would enjoy working on in next sprint;
… if boredom is high:
promote team work (e.g. pair-programming), from the premise that the conversation will make tedious tasks less painful;
… if not powering up:
plan for new things in next sprint;
schedule training time;
… when powering down:
discuss during the retrospective and/or one-on-ones which abilities are not being put to use. Try to find a place for them;
reduce time spent in repetitive tasks;
… when there’s no sense of accomplishment:
create smaller tickets with a well-defined goal;
try a “Demo-Driven Development” approach (this is a name I came up with): small features that are always “demoable”;
… when people feel they are going nowhere:
align the tickets with the company/crew objectives, so the goal is well defined;
identify blockers and deal with them ASAP (e.g. build issues).
Simple data visualization
In order to track the changes of the team mood over time, we also write the votes down in our Wiki. We keep 3 tables, one for each opposite axes, where each data point is just the date, the value on the positive axis, and the values on the negative one. Confluence can conveniently plot these for you,
Happiness data
From the graphs we noticed things like cycles in despair and accomplishment, that we regarded as being caused by having features that require a couple of sprints to complete, so the first sprint is full of despair, but when the feature gets finally completed in the following sprint, the sense of accomplishment spikes up.
Written down in words, it seems like a complex exercise, but it’s something that can be done really quickly, so we’ve kept this as part of our retrospectives.
Conclusion
There is no “correct” way of running scrum retrospectives, but the important thing is that they are dynamic and not too long. Also, make sure that people get involved in them. You probably know more or less what people feel from one-to-ones, but it’s important that they share some of that with everyone else in the team. At least, try to record the actionable needs. The happiness axis exercise is quick and it takes the scare out of surveys, and turns it into something a bit more fun. But if you feel stale, try doing something completely different from time to time, like brainstorming for ideas that people would like to work in with others. I’ll come back to that in a future post.
We welcomed back the Cambridge AWS User Group to the Cambridge office for it’s eighth Meetup. This one was focused on Big Data. This is something that I spend a lot of my time working on here at Metail, and I was keen to give a talk. I was nervous when having been put on the agenda we had 65 people sign up, the office capacity!
Just kicking off our Big Data meeting. Train chaos means lots of late arrivals still rocking up! pic.twitter.com/OwRqiv6VK5
We had an exciting line up of speakers, if I do say so myself, with two talks about Redshift and one about building a big data solution on AWS. Peter Marriot gave the first talk which was an introduction to Redshift demonstrating how to create a cluster, log into it, load some data and then run queries. Most of this was a live demo and it went very smoothly. He was very enthusiastic about Redshift and demonstrated its speed at querying large data sets. I think his enthusiasm for Redshift came across as well measured and not just ‘oo shiny new tool’ as he did a good job of relating this to his own experience of querying large data sets; highlighting trade offs. The main one being Redshift seems to have a constant minimum overhead of a second or two on queries, where MySQL/PostgresSQL would be sub-second. This makes it difficult to support scenarios where multiple users make lots of small queries and receiving real-time results because the queue becomes backlogged. The general belief is that slow query response is because of the overhead of the leader node orchestrating the query, possibly a single node cluster wouldn’t have the problem. Something to put on the experiment list 🙂
Peter Marriott, Redshift, live demo. What could go wrong? Turns out, not much… pic.twitter.com/ppPHwuD1h1
The train chaos mentioned in the first Tweet meant our speaker from AWS, David Elliot, arrived late but still in plenty of time for his talk. It reminded me of my own experiences trying to get to my AWS London Loft talk back in April! His talk was an excellent live demo on setting up a trackers, and exploring the collected data. The exploration was done using Spark which is a managed install on EMR, and also Redshift and QuickSight. This was pretty similar to the demo I went to at the AWS Loft. It is impressive how quickly all this can be set up and how much power is available through these tools. I liked the demo and David had some good input to some of the questions asked of both me and Peter. We’ve blogged about this kind of setup and how it compares to our own here. We’ve changed our set up a little to be more event driven, using S3 notifications and SQS queues, but it’s still a good comparison. I see I blurred the lines a bit in my post about the use of Kinesis Firehose and Kinesis. The demo used Kinesis Firehose which is writing in batches, however you have control over when the buffer is flushed. David chose 60s to keep things flowing. You can use Kinesis streams, as David mentioned, if you want more of a streaming solution.
I was the final speaker on the agenda and my talk was titled Why The ‘Like’ In ‘Progres Like’ Matters”. I went through the decisions we’ve made when using Redshift and why. There were two main ones which I focused on. The first was whether to choose a cluster with a large amount of storage but limited compute, with the aim of storing all the data; or to have more CPU and less storage for faster querying but having to drop old data. We decided to keep all our data available in Redshift and progressed through a cluster made up of an increasing number of compute nodes until we had to switch to a cluster made up a few dense storage nodes to keep costs under control. The second major decision was the schema design. Unfortunately having never worked with columnar data stores we went with normalised schema layout which would have worked well on a row store such as PosgreSQL. We did use distribution and sort keys appropriate for the tables however the highly denormalised data often had different sort orders or distribution keys per table which made joins very slow. Since then we’ve done some more detailed research and more testing. Now we have a much larger data set and less CPU our tests highlight schema and query problems much more clearly which has lead to a much more efficient schema design. We have denormalised a lot of our data, and with common distribution and sort keys for the tables joins no longer need to sort data nor pull data from elsewhere in the cluster for table joins. As David said, Redshift optimisation is all about the schema design.
Overall we’ve found Redshift a very powerful tool, and like any tool there is a learning curve. As with all AWS services I’ve used there are the features in place to allow you to change your mind and hack around. Most of this due to the ease at which you can take snapshots and restore them to different shaped clusters.
Finally here’s me presenting:
Gareth Rogers of our lovely hosts @metail, showing us their Big Data architecture, and how they're using it. pic.twitter.com/X9AlStfKWz
It looks dark but it was still the hottest day of the year!
Thanks to @CambridgeAWS for the photos, to Peter and David for their talks, and Jon and Stephen for organising the Meetup. We’re looking forward to see everyone at the ninth Meetup here at Metail on Tuesday 25th October.
With two 3-month internships under my belt at Metail, it’s easy to see why people keep joining. As an R&D Intern, I’ve been continually challenged and pushed to learn new skills and apply them to often independent and in-depth projects. The responsibility and expected self-sufficiency have been well balanced to allow a comfortable attachment to my work; and now that I’m leaving to go back to my final year at university, it’s clear that a lot of what I’ve done with Metail will help me focus and push myself in my studies.
My assigned and chosen work has been excitingly challenging and intriguingly broad. With several weeks spent collaborating with a great team of people to build advanced features for a Facebook chatbot, I had the pleasure of working on state of the art 3D face modelling, including the challenges of adding cosmetic changes, finding ways to smoothly transform one face to another, and robustly positioning other 3D models on and around the faces. All within the context of delivering a user experience, albeit an experimental one. The surprise to me came in finding that “R&D” is not synonymous with “hidden in the back room for no one to see.” Sometimes, it turns out, it just means you don’t have to worry about perfecting a product and can focus on learning as much as possible about users and what they want.
In three months, you don’t necessarily just get to work on one project. On top of 3D face modelling, I got the opportunity to start with a blank folder with zero files in it and the seemingly simple task: recommend clothes to a user. That might only take five words to say, but it takes more than five lines of code to do. This task required me to build from an empty file tree to a framework for creating and testing ways of implementing recommendation algorithms. I found it extremely rewarding opportunity to work independently on such a project. At the same time, the real reward of working in a place like Metail isn’t just getting to take pride in your work, but knowing that at any point in time there is a whole host of people ready and willing to help you if you ask for it. With a collection of experienced and knowledgeable colleagues, I always found it easy to get help when I needed it. The technical knowledge and experiential learning gained will no doubt prove invaluable in the future.
I should also point out that the work itself isn’t the only part that’s fun. The people are wonderful and I personally enjoyed the fact that I only wore shoes to work 8 times in the entire summer (flip-flops are so much more comfortable). If you need advice on which are the best pubs in Cambridge, look no further, because Friday pub lunches serve as an excellent method of exploration. Meanwhile it’s worth noting that interns get free membership to the Friday cocktail club, which makes for a thoroughly enjoyable social activity whether you care for the cocktails or not!
In the end, sometimes you have to do some work, so in my experience, you might as well make sure it’s work that is in itself rewarding and comes with plenty of added benefits; my time at Metail has been a core of fulfilling work with a periphery of positive side effects. There’s no doubt in my mind that I’ll soon find an excuse to come back again.
Unit quaternions, or versors, offer a more compact and efficient representation of rotations than matrices do. They also free us from issues such as the gimbal lock we often encounter when using Euler angles. That’s why in Computer Graphics you often represent a transformation by a struct like the one below, instead of generic 4×4 matrix,
struct Transform {
var position = float3(0, 0, 0)
var scale = float3(1, 1, 1)
var rotation = Quaternion()
}
However, more often than not, quaternions remain in the CPU domain and Transforms are converted into matrices before they are sent to the GPU. The conversion for the struct above looks like this,
GPUs have native support for matrices, making them the natural choice when thinking about performance;
in traditional pipelines, we only worried about the final position of a vertex in world coordinates, so we could premultiply the Projection, the View, and the World or Model matrix into a single matrix (the PVW matrix), thus, making the transformation of vertices in the GPU really cheap.
Growing shader complexity
From the 2 reasons stated earlier, the second one barely holds true anymore. Because of more complex shading and effects pipelines, we often want to split the Projection matrix from the View matrix, so we can compute the view normals, and the Projection-View matrix from the World matrix, so we can obtain the coordinates of the vertices in World space.
The Projection and View matrices are only set once per camera or viewport, and the World matrix will be set per object or instance being drawn. The vertex shader will look like this,
float4x4 m = uniforms.projectionMatrix * uniforms.viewMatrix * instance.worldMatrix;
TexturedVertex v = vertexData[vid];
outVertex.position = m * float4(v.position, 1.0);
If we were to send Transforms instead of 4×4 matrices, we could save at least 4 floats per instance. Memory is usually more precious these days than ALU time, but how much slower would it be if we used Transforms in the GPU? The vertex shader will need to do some extra operations,
Transform t = perInstanceUniforms[iid];
float4x4 m = uniforms.projectionMatrix * uniforms.viewMatrix;
TexturedVertex v = vertexData[vid];
outVertex.position = m * float4(t * v.position, 1.0);
The following code is the implementation of the Transform struct using Metal (for an introduction to Metal, check this previous blog post).
struct Transform {
// for alignment reasons, position and scale are float4
float4 position; // only xyz actually used
float4 scale; // only xyz actually used
float4 rotation; // unit quaternion; w is the scalar
float3 operator* (const float3 v) const {
return position.xyz + quatMul(rotation, v * scale.xyz);
}
};
/// Quaternion Inverse
float4 quatInv(const float4 q) {
// assume it's a unit quaternion, so just Conjugate
return float4( -q.xyz, q.w );
}
/// Quaternion multiplication
float4 quatDot(const float4 q1, const float4 q2) {
float scalar = q1.w * q2.w - dot(q1.xyz, q2.xyz);
float3 v = cross(q1.xyz, q2.xyz) + q1.w * q2.xyz + q2.w * q1.xyz;
return float4(v, scalar);
}
/// Apply unit quaternion to vector (rotate vector)
float3 quatMul(const float4 q, const float3 v) {
float4 r = quatDot(q, quatDot(float4(v, 0), quatInv(q)));
return r.xyz;
}
Let’s see if this is any slower than matrices with an example.
Rotating cubes demo
I’ve created this demo of rotating cubes to measure the performance of using quaternions in a modern, but not high-end, GPU. I’ll be testing Apple’s A8 chip on an iPhone6.
The application spawns 240 cubes and draws them with a single draw call using instancing. Instancing allows us to reuse the same vertex buffer, and just use a different Transform for each instance. This way, the performance comparison will be simpler because we only need to analyze one draw call, instead of 240!
The CPU updates the rotation of each cube at random times, so the performance in the CPU won’t be constant per frame, but it should be almost constant in the GPU (there will be some slight differences in fill rate, depending the amount of area covered by the cubes as they rotate, but I placed them close so it’s always very dense).
Both versions run at 60fps on an iPhone6. This is a frame capture of the version that uses matrices,
The draw call in both cases takes 2.32 ms, of which 2 ms is taken by the fragment shader. As suspected, the fill rate is the bottleneck and it looks like the quaternions haven’t introduced any extra load to the ALU in this example.
For a proper comparison, we need to make this example to be vertex-bound, so I’ve prepared another example with spheres instead of cubes,
The tessellation level can be increased at compile time. In the video, there’s only a few hundred vertices per sphere, so both matrices and quaternions still run at 60fps. But in the commits below, each sphere has 2562 vertices. That’s a total of around 600K vertices on screen, while for the cubes we only had 6K vertices.
The frame rate drops to 20 fps when using quaternions, and to 12 fps when using matrices. Surprise! Here’s a frame capture of the version that uses matrices,
The vertex shader takes 46.10 ms with quaternions, and 82.28 ms when using matrices. Matrices turned out to be 80% slower here.
Because GPUs are becoming more general purpose, it could be that matrices have no real advantage anymore, since the number of multiplications and additions is actually greater. Another possible reason for such a big difference could be that by reducing the memory footprint (we are sending one less float4 per object), we managed to increase the cache coherence. Every GPU will behave slightly different, so it’s better to do an empiric test like this to check what’s the real behaviour of your code.
Performance comparison in the CPU
Let’s go back to the cubes and check now what’s going on in the CPU. I took a performance capture of both versions using Instruments. Here’s a capture of the most expensive functions in the version that needs to convert the quaternions back into matrices,
The updateBuffers function takes 5.4% of the CPU time, mostly taken in converting the Transforms into matrices. It’s not a lot, but we only have 240 objects. Here’s the cost using quaternions all the way through,
As expected, the cost almost disappeared, and the updateBuffers function now only takes 0.3% of the CPU time. The drawing cost is just the cost of the API issuing the commands,
Extra thoughts on performance
More often than not we worry about small details in performance such as this difference between matrices and quaternions, while the big bottlenecks tend to be somewhere else. For this experiment, for instance, I’ve used instancing to create a single draw call to draw all the cubes. But the first version of the examples had no instancing. You can find the code of the first version here,
cubes-demo-matrices – This is the version using matrices, with no instancing.
Both version still run at 60fps, but we are now issuing 240 draw calls, one per cube. While the CPU was around 20% usage in the instanced version of the quaternions, the non-instanced version runs at 90% CPU usage! The extra cost is basically the cost of issuing the drawing commands. So instancing was actually the biggest win in this experiment 😉
Note that we could do some extra memory optimization in matrices, if we just send the first 3 rows, enough to represent an affine transformation (not for projections). This is a common optimization and shader languages have support for operations with float3x4 matrices because of this. But if we are talking about just rotations, it is still more memory-efficient to just send a quaternion, which it’s a float4, instead of a float3x3 matrix (for memory alignment reasons sometimes become float3x4).
On a smaller note, the view matrix can also be expressed as a Transform. By doing this we can completely get rid of the code that does the conversion to matrices. And the only matrix we will need to keep will be the Projection matrix.
Conclusion
Our initial preconception that matrices were better for the shader world was wrong. Using quaternions in the GPU is actually faster than matrices in a modern GPU like the Apple’s A8 chip. The memory footprint will also get reduced and the chances of finding our data in the cache will increase.
Moreover, if we eliminate the quaternion-to-matrix conversions, not only the code will get simpler and tidier, but we’ll save several precious CPU cycles.
But to be absolutely sure that you are making the right choice, always test your hardware with examples like this, because hardware is constantly evolving!
Metal (not Metail) is a low-level API from Apple that combines OpenGL and OpenCL into a single interface. The purpose of introducing their own API was mainly to reduce overhead and increase performance. Metal is similar to Khronos Group’s Vulkan, or Microsoft’s DX12, but specifically targeted at Apple hardware.
Metal has been around since 2014, but now that Swift is more mature, I think it’s really easy to get started with Metal: you don’t need to be scared of pointers or of the overly verbose Objective-C syntax.
In this article I’m going to introduce Metal with a small example where all the data updates happen in the GPU. Instead of explaining Metal and Swift in detail, I’ll just write down a few notes following the example code. Hopefully, it will spark your interest and you dig into the references for extensive documentation 😉
Procedural rain example
I’ve written a small demo that should look like rain,
It draws and updates thousands of 2D lines at 60 fps on an iPhone6. In fact, drawing the lines takes only 2.4 ms, and the update takes less than 0.2ms.
To get started with Metal you will need a Metal-ready device and XCode. In XCode, just create a new project and select
iOS Application: Game
Language: Swift
Game technology: Metal
This will create a simple template that draws a moving rectangle on screen. You will need to run this directly on your device, since the simulator doesn’t understand Metal. The triangle data in the example is triple-buffered, so you can update it in the CPU while the GPU renders up to 3 frames before requiring a sync. Synchronization between the CPU and GPU is done like this,
// create semaphore
let inflightSemaphore = dispatch_semaphore_create(NumSyncBuffers)
// this is run per frame
func drawInMTKView(view: MTKView) {
dispatch_semaphore_wait(inflightSemaphore, DISPATCH_TIME_FOREVER)
// updates in CPU cycles
self.update()
// register completion callback
let commandBuffer = commandQueue.commandBuffer()
commandBuffer.addCompletedHandler{ [weak self] commandBuffer in
if let strongSelf = self {
dispatch_semaphore_signal(strongSelf.inflightSemaphore)
}
return
}
// draw stuff
// ...
commandBuffer.commit()
}
Some interesting Swift notes:
You can omit brackets when the last argument of the function you are calling is a lambda. You can still do ‘addCompletionHandler(myFunction)’.
The ‘weak’ keyword is used to avoid keeping a strong reference to ‘self’ inside the lambda function. Otherwise, we could have a cyclic reference and leak memory.
Because the reference is now weak, it basically becomes an optional (something that could be null). The ‘if let x = optional’ is used to dereference the optional when it’s not null.
Preparing Metal objects
These are the things you need to prepare in order to render something on screen:
Resources: data buffers and textures.
States: render pipeline state and depth-stencil state.
Descriptors: definitions that describe the objects above. This includes your shader code.
Render Command Encoder: the stuff that converts API commands into hardware commands.
Command Buffer: it’s where you store your commands that are eventually committed to the GPU.
Command Queue: where you queue an ordered list of command buffers.
I assume you are more or less familiar with how a typical graphics pipeline work, so in the example I’m going to focus on the physics update of the raindrops, which I’m performing in the GPU.
I’ll explain the shader code later, but for now you just need to know that you can access to your shader functions very easily using a shader library,
let defaultLibrary = device.newDefaultLibrary()!
let updateRaindropProgram = defaultLibrary.newFunctionWithName("updateRaindrops")!
“updateRaindrops” is the name of the function in the shader code.
You can create a render state without a fragment program. Your vertex shader can be used to modify any arbitrary buffer, without the need of specifically creating a compute shader.
let updateStateDescriptor = MTLRenderPipelineDescriptor()
updateStateDescriptor.vertexFunction = updateRaindropProgram
// vertex output is void
updateStateDescriptor.rasterizationEnabled = false
// pixel format needs to be set
updateStateDescriptor.colorAttachments[0].pixelFormat = view.colorPixelFormat
With that descriptor now we can create the state. Note that this is done only once,
do {
try pipelineState = device.newRenderPipelineStateWithDescriptor(pipelineStateDescriptor)
try updateState = device.newRenderPipelineStateWithDescriptor(updateStateDescriptor)
} catch let error {
print("Failed to create pipeline state, error \(error)")
}
Notice that in Swift, the “try” keyword is used for every expression that can throw an exception. If we are happy with an optional value, we can remove the do-catch and use “try?”,
let state = try? device.newRenderPipelineStateWithDescriptor(descriptor)
Now we need a data buffer. Metal is designed for the A7 chip unified memory system, so both the CPU and the GPU can share the same storage. We will need to care about synchronization, but in this example the raindrops will be updated and read only in the GPU.
// member variable
var raindropDoubleBuffer: MTLBuffer! = nil
// ... on initialization:
raindropDoubleBuffer = device.newBufferWithLength(
2 * maxNumberOfRaindrops * sizeOfLineParticle, options: [])
raindropDoubleBuffer.label = "raindrop buffer"
And now that you have everything ready, we can “draw stuff” in drawInMTKView,
// draw stuff
if let renderPassDescriptor = view.currentRenderPassDescriptor,
currentDrawable = view.currentDrawable
{
// setVertexBuffer offset: How far the data is from the start of the buffer, in bytes
// Check alignment in setVertexBuffer doc
let bufferOffset = maxNumberOfRaindrops * sizeOfLineParticle
let uniformOffset = numberOfUniforms * sizeof(Float)
let renderEncoder = commandBuffer.renderCommandEncoderWithDescriptor(renderPassDescriptor)
renderEncoder.label = "render encoder"
// The drawing phase is a simple shader that draws lines in 2D
// DebugGroup labels are for debugging during frame capture.
renderEncoder.pushDebugGroup("draw rain")
renderEncoder.setRenderPipelineState(pipelineState)
renderEncoder.setVertexBuffer(raindropDoubleBuffer,
offset: bufferOffset*doubleBufferIndex, atIndex: 0)
renderEncoder.drawPrimitives(.Line, vertexStart: 0,
vertexCount: vertexCount, instanceCount: 1)
renderEncoder.popDebugGroup()
// update particles in the GPU
renderEncoder.pushDebugGroup("update raindrops")
renderEncoder.setRenderPipelineState(updateState)
// this is where we read the particles from
renderEncoder.setVertexBuffer(raindropDoubleBuffer,
offset: bufferOffset*doubleBufferIndex, atIndex: 0)
// this is where we write the updated particles
renderEncoder.setVertexBuffer(raindropDoubleBuffer,
offset: bufferOffset*((doubleBufferIndex+1)%2), atIndex: 1)
renderEncoder.setVertexBuffer(uniformBuffer,
offset: uniformOffset * syncBufferIndex, atIndex: 2)
// noiseTexture contains random numbers
renderEncoder.setVertexTexture(noiseTexture, atIndex: 0)
// every particle is treated as a point, but we aren't rendering anything on screen
renderEncoder.drawPrimitives(.Point, vertexStart: 0,
vertexCount: particleCount, instanceCount: 1)
renderEncoder.popDebugGroup()
renderEncoder.endEncoding()
commandBuffer.presentDrawable(currentDrawable)
}
// syncBufferIndex matches the current semaphore controled frame index
// to ensure writing occurs at the correct region in the vertex buffer
syncBufferIndex = (syncBufferIndex + 1) % NumSyncBuffers
doubleBufferIndex = (doubleBufferIndex + 1) % 2
commandBuffer.commit()
And that’s all! You don’t need to do anything else on the CPU 🙂
Writing shader code
Metal shaders are written in a subset of C++11 with some special keywords to define attributes and hardware features. You can have multiple shaders in a single file, and that file gets compiled before you run your application, so say bye to the runtime nightmares of OpenGL ES.
Let’s jump directly to the raindrop update function,
#include <metal_stdlib>
struct LineParticle
{
float4 start;
float4 end;
}; // => sizeOfLineParticle = sizeof(Float) * 4 * 2
// can only write to a buffer if the output is set to void
vertex void updateRaindrops(uint vid [[ vertex_id ]],
constant LineParticle* particle [[ buffer(0) ]],
device LineParticle* updatedParticle [[ buffer(1) ]],
constant Uniforms& uniforms [[ buffer(2) ]],
texture2d<float> noiseTexture [[ texture(0) ]])
{
LineParticle outParticle;
float4 velocity = float4(0, -0.01, 0, 0);
outParticle.start = particle[vid].start + velocity;
outParticle.end = particle[vid].end + velocity;
if (outParticle.start.y < -1) {
outParticle.end.y = 1;
outParticle.start.y = outParticle.end.y + 0.1;
}
updatedParticle[vid] = outParticle;
};
I’ve simplified the example above, so I’m not using the uniform buffer or the noise texture. Instead, the particles are just updated with a constant velocity that points downwards, and their position is reset once they reach the end of the screen. Check the full source for the full update, with some simple bouncing on the ground and obstacles, and resetting to a random position.
The “constant” and “device” keywords are address space qualifiers. “constant” refers to read-only buffer memory objects that are allocated from the device memory pool, while “device” refers to buffer memory objects allocated from the device memory pool that are both readable and writeable.
Handling Metal errors
In Metal you’ll find that clear error messages are output to the console. In OpenGL ES you had to query the OpenGL error status all the time just to get error messages, cluttering your code with those error queries all over the place. Plus, the error messages were usually hard to decipher.
This is an example error in Metal,
MTLPixelFormatRG16Unorm is compatible with texture data types type(s) (
float
).'
I got this after calling: renderEncoder.setVertexTexture(noiseTexture, atIndex: 0)
Because in the shader I had: texture2d noiseTexture [[ texture(0) ]]
The noise texture pixel format is set to RG16Unorm and the error is telling me it doesn’t like “halfs”. So I just needed to change half to float to fix the issue.
Frame captures
The frame capture in XCode works as well with Metal as it does with OpenGL ES. You can see the performance of your shaders, see all the resources, change the shader code on the fly, jump to the Swift source code that originated a draw call, and much more. It’s one of the best tools of its kind that I’ve seen.
Let’s inspect a frame,
Frame Capture in XCode
On the left side, you can see all the commands. There’s only a few! OpenGL ES programs tend to end up with lots of redundant state changes that negatively impact on performance. The debug group labels are shown as folders, and you can see the timings for each one. Or you can expand them and see the details. The particle update takes 183 microseconds. Surely faster than if we had linearly looped through the buffer and updated the particles on the CPU 😉
You can expand each command to see the call stack and jump to the CPU code.
You can also inspect all the buffers, render state, and shaders. You can see the cost of each shader block as a percentage of the total. As expected, most of the cost is in the fragment shader. It’s just fill-rate.
You can re-write the shader code there, and click the “Update Shaders” icon , to re-compile them and re-run the frame with the updated shaders.
It’s really powerful and easy to use.
Conclusion
If you are developing on iOS or macOS and into graphics, I recommend you try Metal if you haven’t yet. The setup is more straightforward than OpenGL, and it outperforms OpenGL by removing redundant state changes and making definitions more static.
If you like graphics programming, but you never tried native development on iOS or macOS, perhaps because you were scared of Objective-C, give Swift a try. It has a simple but powerful syntax, really easy to learn. It’s also a compiled language, so if you were thinking of mixing C++ into Objective-C just to increase performance, forget about it and write everything in Swift.
Tomorrow, Tuesday 12th, we’re welcoming back the Cam AWS User Group for their 7th Meetup. This is the fourth user group meetup we’ve hosted and now we’re set to host the remaining three of the year. The meet up promises to be information packed and is focusing on AWS Lambda with two speakers talking about their experiences. There’s also a debrief on the recent AWS summit in London and Danilo Poccia, a technical evangelist from AWS, is talking about data analytics.
The AWS London summit was on the 7th July and I went along with a colleague. Inevitably we bumped into some of the Cam AWS UG members and shared a DLR over to the Excel. Personally I found the Deep Dive on Amazon DynamoDB to be the most informative session with a good bit of depth on how to write your schema, avoiding hot keys and understand its internal partitioning. This is important for schema design and resolving certain bottlenecks. My most disappointing talk was the Deep Dive on Microservices and Amazon ECS as this talk didn’t add to my knowledge and I’ve only ever seen talks and demos of ECS never getting my hands dirty. My colleague attended the Deep Dive on EC2 Instances and it sounded like I’d have gotten much from that talk. I’m sure that others went to interesting (and disappointing) sessions and I would like to know what they got out of them.
AWS Lambda is one of AWS’ hot technologies which was released almost two years ago. We’ve started experimenting with it in Metail. I’m really keen to see how it’s being used and experimented with by others as AWS Lambda’s use within Metail is certainly growing. I’ve had a fun little project writing a plugin for leiningen which allows you to manage AWS Lambda functions with the aim of integrating it into our build process. Still it’s nowhere near as a functional as lambda Gordon which I saw demonstrated at the most recent Snowplow London Meetup; it sounds like something to compare to Ben Taylor’s talk on Using Lambda and CloudFormation.
The final talk of the night is from Danilo Poccia, I’m particularly looking forward to asking questions at the end as it’s the most relevant to my day to job 🙂
We’re looking forward to seeing everyone tomorrow, doors open at 6:45pm and the talks are starting promptly at 7pm. We’ll be providing beer, soft drinks and snacks, be prompt to get your favourite beverage before the talks start 🙂