How we can build for our colleagues

It’s sometimes necessary for an organisation to develop software to support its internal operations. Doing this well is less straightforward than one might think. In this post, I examine some of the challenges faced by product teams building internal tools, and share some lessons learned from working on consumer products that are applicable in overcoming them.

 

The value that comes from using a tool is in how it improves a process. When an actor from a user story hacks around the current process to get their job done, it’s a good indicator that a new tool might be needed. Another step may be required in the workflow, for instance, if users frequently open another browser window to perform a particular task. There are also situations when a new feature is required for reasons other than improving the user experience. We may wish to gather data to train a machine learning algorithm which will ultimately allow us to automate a manual process.

 

Another reason to build our own tools is to avoid vendor lock-in – the situation where we become unable to switch our process from one product or service to another without substantial costs. However, it’s important to remember, that the decision to adopt any technology, be it proprietary or open source, is a long term commitment. While there are compelling reasons to choose an open source solution, we may incur large costs in adapting it to fit our process or in simply learning how to use it well if the base technology and expertise doesn’t already exist in our stack.

 

How do we avoid reinventing an existing tool which already fits our purpose? Cast a wide net to find out whether or not a cost-effective solution is already available on the market. Don’t hesitate to open this investigation to the operations and engineering teams. Their involvement is important; although they may have a good understanding of the problem domain, they often lack the marketplace visibility and exposure to product demos or sales-driven trials that product managers or the business team have. How have stakeholders solved similar problems at previous organisations? Getting input from every player at this stage can eliminate a lot of uncertainty around the necessity of the work involved.

 

When there’s a genuine need for a bespoke solution because the marketplace doesn’t offer an essential feature, expectations may still be high because users will be familiar with similar well established, high quality software. We can manage these expectations by including metrics and benchmarking on the product roadmap and by building them into the product as early as the size of the user base justifies the effort. This also gives us the confidence to abandon our developing solution for something better if it isn’t performing as we’d hoped. Involving users in the development cycle early can also help – users are more forgiving of work in progress when they are part of its inception and growth.

 

We can develop the best understanding of our customers’ pains by beginning the development cycle with an exploratory research phase. This allows us to get to the root of the problem and discourages us from rushing to a suboptimal solution. IDEO’s human centered design framework provides some useful techniques for doing this, such as by having customers map their journey through the process or by observing the journey directly, taking note of any unnecessary cognitive overhead and the behaviours of our “power users”.

 

The research phase may also take the form of a design sprint, where inexpensive prototype solutions are validated by observing how customers interact with them. Be sure to meet with every possible user at this stage. Not only will users at different levels in the workstream be concerned with different tasks; they may also have different working styles which the UX will need to accommodate. This can seem like a large upfront time investment, but it’s far less costly than waiting until after UAT to learn that the chosen solution doesn’t meet the customers’ needs.

 

What do we do when we don’t have the luxury of conducting a lengthy exploratory research phase? When pivoting, a startup or a product team needs to adapt its operations at short notice, sometimes resulting in the prioritisation of a completely new set of features. As an internal product team, our colleagues are our customers; we should therefore be well positioned to meet with them early and often. When we don’t, we develop false assumptions about where the process bottlenecks are. When gathering requirements, don’t be afraid of asking “why” too often. On first asking, our customers might tell us what they think we want to hear, suggesting “quick wins” or solutions they believe are easy to pull off, rather than revealing their greatest pains. Persistence in our questioning will pay dividends.

 

Feature requests are, in theory, better supported by an internal development team than an outsourced one, and straightforward for us to act on because we can easily seek clarification. In practice, we need to consider the long term costs of maintaining these features. Even simple estimation exercises like Josh Pigford’s build vs. buy calculator can be of help. More often than we’d like, resource constraints may mean that we’re not able to balance the local needs of our internal customer with the overall needs of the business. When that’s the case, it’s important for the health of the relationship to communicate why the work can’t be done at this time. Shared understanding and goals reduces the tension between the team and encourages us to review and update these priorities continuously.

 

If our tool doesn’t require expertise to operate, then we’re able to easily dog food our product across the organisation. This lets us find and form relationships with product-minded users who can identify problems which we may have become blind to when designing and building. Take advantage of this, remembering that the managers of most consumer products don’t have this luxury! Developing these relationships by holding “open office hours” increases the quality and quantity of feedback we receive.

 

Once the tool has been built, how do we ensure that product development continues smoothly? Having the development team focus early on the infrastructure necessary to support continuous delivery allows us to launch and begin gathering feedback as early as possible and keep a tight, iterative development cycle. when done well, we can reap the same benefits from practicing agile with our internal tool development as with our consumer products. MVPs are a great way to accelerate learning, but we shouldn’t be duped into thinking that it’s acceptable to produce sub-standard features, believing that they can be “improved incrementally” because we have only our colleagues’ expectations to manage. The launched product should consist of the minimum set of features required to deliver value, but each of those features needs to meet some previously agreed standards.

 

When planning, it’s important to be mindful of how our users will onboard. We’re familiar with the notion that “good design needs no instructions”, but even refined technical operational processes require some training. To save time and effort, training for our tools could take the form of a webinar which can be made available online for later access. Announcing the initial launch internally and continuing to meet frequently with customers can both help drive adoption, and announcing subsequent feature releases can help users imprint on workflows. Make all of the feedback received easily accessible to engineers, for example, through a dedicated Slack channel or integration. Above all, celebrate as a team when users are delighted.

 

In summary, it’s easy for us to become complacent or misguided when we’re designing for our colleagues. We know their organisation, its mission and its roadmap. We know their titles, respective roles and working environment. We may therefore assume that we know what’s best for them, and worse, we won’t make the time to validate those assumptions. Instead, if we do our internal customers the same courtesies as we would our flagship product users, but acknowledge when to treat them differently, we stand a much better chance of delivering the best possible outcome.

Metail is a UK fashion technology startup with offices in Cambridge and London, UK. We use Clojure on the front-end and back and currently have vacancies for both Clojure and ClojureScript developers in our Cambridge office. If you’re interested in functional programming and are keen to work with Clojure, we’d love to hear from you. You don’t need to be an expert, we’re a friendly company and there are plenty of people here to help you learn and grow your skills.

Metail were early adopters of Clojure with the first code going into production back in 2010. This was a Clojure implementation of our size recommendation algorithm. Back then we were using Java’s Spring Framework for server-side applications, with the Clojure code embedded into the Spring application as a Java class. Nowadays, our web services are implemented in Clojure using Pedestal and ring-swagger and we are considering Lacinia for one of our newest applications. On the front-end, we use ClojureScript with re-frame and a Material UI library. We also use Clojure to orchestrate cloud deployments (REPL-Driven DevOps) and for large-scale data processing on Amazon’s Elastic Map Reduce clusters.

NonDysfunctional Programmers Meetup

William Byrd at Cambridge NonDysfuntional Programmers

Metail have long been supporters of the local tech community: I met CTO Jim Downing back in 2009, when he was running the local Clojure user group. I took over in 2013, and another Metailer, Rich Taylor, took up the reins this year. When Metail moved into a new city-centre office, we had space to host meet-ups ourselves, complete with data projector and excellent wi-fi. Now we are regular hosts of Cambridge NonDysfunctional Programmers, Data Insights Cambridge, Cambridge AWS User Group, DevOps Cambridge and Cambridge Gophers. As well as providing a free venue, Metail sponsors refreshments at many of these Meetups.

If you’d like to join this growing company and vibrant local tech community, check out our current vacancies. If you’re excited by the prospect of a Clojure career but don’t see your ideal job listed there, please drop us a line anyway – we’re always keen to hear from enthusiastic Clojure developers and there may an opening that hasn’t made it up to the website yet.

 

We welcomed back the Cambridge AWS User Group to the Cambridge office for it’s eighth Meetup. This one was focused on Big Data. This is something that I spend a lot of my time working on here at Metail, and I was keen to give a talk. I was nervous when having been put on the agenda we had 65 people sign up, the office capacity!

We had an exciting line up of speakers, if I do say so myself, with two talks about Redshift and one about building a big data solution on AWS. Peter Marriot gave the first talk which was an introduction to Redshift demonstrating how to create a cluster, log into it, load some data and then run queries. Most of this was a live demo and it went very smoothly. He was very enthusiastic about Redshift and demonstrated its speed at querying large data sets. I think his enthusiasm for Redshift came across as well measured and not just ‘oo shiny new tool’ as he did a good job of relating this to his own experience of querying large data sets; highlighting trade offs. The main one being Redshift seems to have a constant minimum overhead of a second or two on queries, where MySQL/PostgresSQL would be sub-second. This makes it difficult to support scenarios where multiple users make lots of small queries and receiving real-time results because the queue becomes backlogged. The general belief is that slow query response is because of the overhead of the leader node orchestrating the query, possibly a single node cluster wouldn’t have the problem. Something to put on the experiment list 🙂

The train chaos mentioned in the first Tweet meant our speaker from AWS, David Elliot, arrived late but still in plenty of time for his talk. It reminded me of my own experiences trying to get to my AWS London Loft talk back in April! His talk was an excellent live demo on setting up a trackers, and exploring the collected data. The exploration was done using Spark which is a managed install on EMR, and also Redshift and QuickSight. This was pretty similar to the demo I went to at the AWS Loft. It is impressive how quickly all this can be set up and how much power is available through these tools. I liked the demo and David had some good input to some of the questions asked of both me and Peter. We’ve blogged about this kind of setup and how it compares to our own here. We’ve changed our set up a little to be more event driven, using S3 notifications and SQS queues, but it’s still a good comparison. I see I blurred the lines a bit in my post about the use of Kinesis Firehose and Kinesis. The demo used Kinesis Firehose which is writing in batches, however you have control over when the buffer is flushed. David chose 60s to keep things flowing. You can use Kinesis streams, as David mentioned, if you want more of a streaming solution.

I was the final speaker on the agenda and my talk was titled Why The ‘Like’ In ‘Progres Like’ Matters”. I went through the decisions we’ve made when using Redshift and why. There were two main ones which I focused on. The first was whether to choose a cluster with a large amount of storage but limited compute, with the aim of storing all the data; or to have more CPU and less storage for faster querying but having to drop old data. We decided to keep all our data available in Redshift and progressed through a cluster made up of an increasing number of compute nodes until we had to switch to a cluster made up a few dense storage nodes to keep costs under control. The second major decision was the schema design. Unfortunately having never worked with columnar data stores we went with normalised schema layout which would have worked well on a row store such as PosgreSQL. We did use distribution and sort keys appropriate for the tables however the highly denormalised data often had different sort orders or distribution keys per table which made joins very slow. Since then we’ve done some more detailed research and more testing. Now we have a much larger data set and less CPU our tests highlight schema and query problems much more clearly which has lead to a much more efficient schema design. We have denormalised a lot of our data, and with common distribution and sort keys for the tables joins no longer need to sort data nor pull data from elsewhere in the cluster for table joins. As David said, Redshift optimisation is all about the schema design.

Overall we’ve found Redshift a very powerful tool, and like any tool there is a learning curve. As with all AWS services I’ve used there are the features in place to allow you to change your mind and hack around. Most of this due to the ease at which you can take snapshots and restore them to different shaped clusters.

Finally here’s me presenting:

It looks dark but it was still the hottest day of the year!

Thanks to @CambridgeAWS for the photos, to Peter and David for their talks, and Jon and Stephen for organising the Meetup. We’re looking forward to see everyone at the ninth Meetup here at Metail on Tuesday 25th October.

With two 3-month internships under my belt at Metail, it’s easy to see why people keep joining. As an R&D Intern, I’ve been continually challenged and pushed to learn new skills and apply them to often independent and in-depth projects. The responsibility and expected self-sufficiency have been well balanced to allow a comfortable attachment to my work; and now that I’m leaving to go back to my final year at university, it’s clear that a lot of what I’ve done with Metail will help me focus and push myself in my studies.

My assignedLuke Smith and chosen work has been excitingly challenging and intriguingly broad. With several weeks spent collaborating with a great team of people to build advanced features for a Facebook chatbot, I had the pleasure of working on state of the art 3D face modelling, including the challenges of adding cosmetic changes, finding ways to smoothly transform one face to another, and robustly positioning other 3D models on and around the faces. All within the context of delivering a user experience, albeit an experimental one. The surprise to me came in finding that “R&D” is not synonymous with “hidden in the back room for no one to see.” Sometimes, it turns out, it just means you don’t have to worry about perfecting a product and can focus on learning as much as possible about users and what they want.

In three months, you don’t necessarily just get to work on one project. On top of 3D face modelling, I got the opportunity to start with a blank folder with zero files in it and the seemingly simple task: recommend clothes to a user. That might only take five words to say, but it takes more than five lines of code to do. This task required me to build from an empty file tree to a framework for creating and testing ways of implementing recommendation algorithms. I found it extremely rewarding opportunity to work independently on such a project. At the same time, the real reward of working in a place like Metail isn’t just getting to take pride in your work, but knowing that at any point in time there is a whole host of people ready and willing to help you if you ask for it. With a collection of experienced and knowledgeable colleagues, I always found it easy to get help when I needed it. The technical knowledge and experiential learning gained will no doubt prove invaluable in the future.

I should also point out that the work itself isn’t the only part that’s fun. The people are wonderful and I personally enjoyed the fact that I only wore shoes to work 8 times in the entire summer (flip-flops are so much more comfortable). If you need advice on which are the best pubs in Cambridge, look no further, because Friday pub lunches serve as an excellent method of exploration. Meanwhile it’s worth noting that interns get free membership to the Friday cocktail club, which makes for a thoroughly enjoyable social activity whether you care for the cocktails or not!

In the end, sometimes you have to do some work, so in my experience, you might as well make sure it’s work that is in itself rewarding and comes with plenty of added benefits; my time at Metail has been a core of fulfilling work with a periphery of positive side effects. There’s no doubt in my mind that I’ll soon find an excuse to come back again.

Luke Smith
R&D Intern 2015, 2016

Introduction

Unit quaternions, or versors, offer a more compact and efficient representation of rotations than matrices do. They also free us from issues such as the gimbal lock we often encounter when using Euler angles. That’s why in Computer Graphics you often represent a transformation by a struct like the one below, instead of generic 4×4 matrix,

struct Transform {
  var position = float3(0, 0, 0)
  var scale    = float3(1, 1, 1)
  var rotation = Quaternion()
}

However, more often than not, quaternions remain in the CPU domain and Transforms are converted into matrices before they are sent to the GPU. The conversion for the struct above looks like this,

func toMatrix4() -> float4x4 {
  let rm = rotation.toMatrix4()
  return float4x4([
    scale.x * rm[0],
    scale.y * rm[1],
    scale.z * rm[2],
    float4(position.x, position.y, position.z, 1.0)
  ])
}

The reason for this conversion is usually 2-fold,

  • GPUs have native support for matrices, making them the natural choice when thinking about performance;

  • in traditional pipelines, we only worried about the final position of a vertex in world coordinates, so we could premultiply the Projection, the View, and the World or Model matrix into a single matrix (the PVW matrix), thus, making the transformation of vertices in the GPU really cheap.

Growing shader complexity

From the 2 reasons stated earlier, the second one barely holds true anymore. Because of more complex shading and effects pipelines, we often want to split the Projection matrix from the View matrix, so we can compute the view normals, and the Projection-View matrix from the World matrix, so we can obtain the coordinates of the vertices in World space.

The Projection and View matrices are only set once per camera or viewport, and the World matrix will be set per object or instance being drawn. The vertex shader will look like this,

float4x4 m = uniforms.projectionMatrix * uniforms.viewMatrix * instance.worldMatrix;
TexturedVertex v = vertexData[vid];
outVertex.position = m * float4(v.position, 1.0);

If we were to send Transforms instead of 4×4 matrices, we could save at least 4 floats per instance. Memory is usually more precious these days than ALU time, but how much slower would it be if we used Transforms in the GPU? The vertex shader will need to do some extra operations,

Transform t = perInstanceUniforms[iid];
float4x4 m = uniforms.projectionMatrix * uniforms.viewMatrix;
TexturedVertex v = vertexData[vid];
outVertex.position = m * float4(t * v.position, 1.0);

The following code is the implementation of the Transform struct using Metal (for an introduction to Metal, check this previous blog post).

struct Transform {
 // for alignment reasons, position and scale are float4
 float4 position; // only xyz actually used
 float4 scale;    // only xyz actually used
 float4 rotation; // unit quaternion; w is the scalar
 float3 operator* (const float3 v) const {
   return position.xyz + quatMul(rotation, v * scale.xyz);
 }
};
/// Quaternion Inverse
float4 quatInv(const float4 q) {
 // assume it's a unit quaternion, so just Conjugate
 return float4( -q.xyz, q.w );
}
/// Quaternion multiplication
float4 quatDot(const float4 q1, const float4 q2) {
 float scalar = q1.w * q2.w - dot(q1.xyz, q2.xyz);
 float3 v = cross(q1.xyz, q2.xyz) + q1.w * q2.xyz + q2.w * q1.xyz;
 return float4(v, scalar);
}
/// Apply unit quaternion to vector (rotate vector)
float3 quatMul(const float4 q, const float3 v) {
 float4 r = quatDot(q, quatDot(float4(v, 0), quatInv(q)));
 return r.xyz;
}

Let’s see if this is any slower than matrices with an example.

Rotating cubes demo

I’ve created this demo of rotating cubes to measure the performance of using quaternions in a modern, but not high-end, GPU. I’ll be testing Apple’s A8 chip on an iPhone6.

The application spawns 240 cubes and draws them with a single draw call using instancing. Instancing allows us to reuse the same vertex buffer, and just use a different Transform for each instance. This way, the performance comparison will be simpler because we only need to analyze one draw call, instead of 240!

The CPU updates the rotation of each cube at random times, so the performance in the CPU won’t be constant per frame, but it should be almost constant in the GPU (there will be some slight differences in fill rate, depending the amount of area covered by the cubes as they rotate, but I placed them close so it’s always very dense).

The code of the demo can be found here:

Performance comparison in the GPU

Both versions run at 60fps on an iPhone6. This is a frame capture of the version that uses matrices,

Metal Framecapture instanced cubes with matrices

The draw call in both cases takes 2.32 ms, of which 2 ms is taken by the fragment shader. As suspected, the fill rate is the bottleneck and it looks like the quaternions haven’t introduced any extra load to the ALU in this example.

For a proper comparison, we need to make this example to be vertex-bound, so I’ve prepared another example with spheres instead of cubes,

The tessellation level can be increased at compile time. In the video, there’s only a few hundred vertices per sphere, so both matrices and quaternions still run at 60fps. But in the commits below, each sphere has 2562 vertices. That’s a total of around 600K vertices on screen, while for the cubes we only had 6K vertices.

The frame rate drops to 20 fps when using quaternions, and to 12 fps when using matrices. Surprise! Here’s a frame capture of the version that uses matrices,

Metal frame capture instanced spheres with matrices

The vertex shader takes 46.10 ms with quaternions, and 82.28 ms when using matrices. Matrices turned out to be 80% slower here.

Because GPUs are becoming more general purpose, it could be that matrices have no real advantage anymore, since the number of multiplications and additions is actually greater. Another possible reason for such a big difference could be that by reducing the memory footprint (we are sending one less float4 per object), we managed to increase the cache coherence. Every GPU will behave slightly different, so it’s better to do an empiric test like this to check what’s the real behaviour of your code.

Performance comparison in the CPU

Let’s go back to the cubes and check now what’s going on in the CPU. I took a performance capture of both versions using Instruments. Here’s a capture of the most expensive functions in the version that needs to convert the quaternions back into matrices,

Metal draw cost with matrices

The updateBuffers function takes 5.4% of the CPU time, mostly taken in converting the Transforms into matrices. It’s not a lot, but we only have 240 objects. Here’s the cost using quaternions all the way through,

Metal draw cost with quaternions

As expected, the cost almost disappeared, and the updateBuffers function now only takes 0.3% of the CPU time. The drawing cost is just the cost of the API issuing the commands,

Metal draw cost with quaternions

Extra thoughts on performance

More often than not we worry about small details in performance such as this difference between matrices and quaternions, while the big bottlenecks tend to be somewhere else. For this experiment, for instance, I’ve used instancing to create a single draw call to draw all the cubes. But the first version of the examples had no instancing. You can find the code of the first version here,

Both version still run at 60fps, but we are now issuing 240 draw calls, one per cube. While the CPU was around 20% usage in the instanced version of the quaternions, the non-instanced version runs at 90% CPU usage! The extra cost is basically the cost of issuing the drawing commands. So instancing was actually the biggest win in this experiment 😉

Note that we could do some extra memory optimization in matrices, if we just send the first 3 rows, enough to represent an affine transformation (not for projections). This is a common optimization and shader languages have support for operations with float3x4 matrices because of this. But if we are talking about just rotations, it is still more memory-efficient to just send a quaternion, which it’s a float4, instead of a float3x3 matrix (for memory alignment reasons sometimes become float3x4).

On a smaller note, the view matrix can also be expressed as a Transform. By doing this we can completely get rid of the code that does the conversion to matrices. And the only matrix we will need to keep will be the Projection matrix.

Conclusion

Our initial preconception that matrices were better for the shader world was wrong. Using quaternions in the GPU is actually faster than matrices in a modern GPU like the Apple’s A8 chip. The memory footprint will also get reduced and the chances of finding our data in the cache will increase.

Moreover, if we eliminate the quaternion-to-matrix conversions, not only the code will get simpler and tidier, but we’ll save several precious CPU cycles.

But to be absolutely sure that you are making the right choice, always test your hardware with examples like this, because hardware is constantly evolving!

Metal with Swift

Metal (not Metail) is a low-level API from Apple that combines OpenGL and OpenCL into a single interface. The purpose of introducing their own API was mainly to reduce overhead and increase performance. Metal is similar to Khronos Group’s Vulkan, or Microsoft’s DX12, but specifically targeted at Apple hardware.

Metal has been around since 2014, but now that Swift is more mature, I think it’s really easy to get started with Metal: you don’t need to be scared of pointers or of the overly verbose Objective-C syntax.

In this article I’m going to introduce Metal with a small example where all the data updates happen in the GPU. Instead of explaining Metal and Swift in detail, I’ll just write down a few notes following the example code. Hopefully, it will spark your interest and you dig into the references for extensive documentation 😉

Procedural rain example

I’ve written a small demo that should look like rain,

It draws and updates thousands of 2D lines at 60 fps on an iPhone6. In fact, drawing the lines takes only 2.4 ms, and the update takes less than 0.2ms.

You can find all the code here: https://github.com/endavid/metaltest

Getting started

To get started with Metal you will need a Metal-ready device and XCode. In XCode, just create a new project and select

  • iOS Application: Game

  • Language: Swift

  • Game technology: Metal

This will create a simple template that draws a moving rectangle on screen. You will need to run this directly on your device, since the simulator doesn’t understand Metal. The triangle data in the example is triple-buffered, so you can update it in the CPU while the GPU renders up to 3 frames before requiring a sync. Synchronization between the CPU and GPU is done like this,

// create semaphore
let inflightSemaphore = dispatch_semaphore_create(NumSyncBuffers)
// this is run per frame
func drawInMTKView(view: MTKView) {
    dispatch_semaphore_wait(inflightSemaphore, DISPATCH_TIME_FOREVER)
    // updates in CPU cycles
    self.update()
    // register completion callback
    let commandBuffer = commandQueue.commandBuffer()
    commandBuffer.addCompletedHandler{ [weak self] commandBuffer in
        if let strongSelf = self {
            dispatch_semaphore_signal(strongSelf.inflightSemaphore)
        }
        return
    }
    // draw stuff
    // ...
    commandBuffer.commit()
}

Some interesting Swift notes:

  • You can omit brackets when the last argument of the function you are calling is a lambda. You can still do ‘addCompletionHandler(myFunction)’.

  • The ‘weak’ keyword is used to avoid keeping a strong reference to ‘self’ inside the lambda function. Otherwise, we could have a cyclic reference and leak memory.

  • Because the reference is now weak, it basically becomes an optional (something that could be null). The ‘if let x = optional’ is used to dereference the optional when it’s not null.

Preparing Metal objects

These are the things you need to prepare in order to render something on screen:

  • Resources: data buffers and textures.

  • States: render pipeline state and depth-stencil state.

  • Descriptors: definitions that describe the objects above. This includes your shader code.

  • Render Command Encoder: the stuff that converts API commands into hardware commands.

  • Command Buffer: it’s where you store your commands that are eventually committed to the GPU.

  • Command Queue: where you queue an ordered list of command buffers.

I assume you are more or less familiar with how a typical graphics pipeline work, so in the example I’m going to focus on the physics update of the raindrops, which I’m performing in the GPU.

I’ll explain the shader code later, but for now you just need to know that you can access to your shader functions very easily using a shader library,

let defaultLibrary = device.newDefaultLibrary()!
let updateRaindropProgram = defaultLibrary.newFunctionWithName("updateRaindrops")!

“updateRaindrops” is the name of the function in the shader code.

You can create a render state without a fragment program. Your vertex shader can be used to modify any arbitrary buffer, without the need of specifically creating a compute shader.

let updateStateDescriptor = MTLRenderPipelineDescriptor()
updateStateDescriptor.vertexFunction = updateRaindropProgram
// vertex output is void
updateStateDescriptor.rasterizationEnabled = false
// pixel format needs to be set
updateStateDescriptor.colorAttachments[0].pixelFormat = view.colorPixelFormat

With that descriptor now we can create the state. Note that this is done only once,

do {
    try pipelineState = device.newRenderPipelineStateWithDescriptor(pipelineStateDescriptor)
    try updateState = device.newRenderPipelineStateWithDescriptor(updateStateDescriptor)
} catch let error {
    print("Failed to create pipeline state, error \(error)")
}

Notice that in Swift, the “try” keyword is used for every expression that can throw an exception. If we are happy with an optional value, we can remove the do-catch and use “try?”,

let state = try? device.newRenderPipelineStateWithDescriptor(descriptor)

Now we need a data buffer. Metal is designed for the A7 chip unified memory system, so both the CPU and the GPU can share the same storage. We will need to care about synchronization, but in this example the raindrops will be updated and read only in the GPU.

// member variable
var raindropDoubleBuffer: MTLBuffer! = nil
// ... on initialization:
raindropDoubleBuffer = device.newBufferWithLength(
            2 * maxNumberOfRaindrops * sizeOfLineParticle, options: [])
raindropDoubleBuffer.label = "raindrop buffer"

And now that you have everything ready, we can “draw stuff” in drawInMTKView,

// draw stuff
if let renderPassDescriptor = view.currentRenderPassDescriptor,
       currentDrawable = view.currentDrawable
{
    // setVertexBuffer offset: How far the data is from the start of the buffer, in bytes
    // Check alignment in setVertexBuffer doc
    let bufferOffset = maxNumberOfRaindrops * sizeOfLineParticle
    let uniformOffset = numberOfUniforms * sizeof(Float)
    let renderEncoder = commandBuffer.renderCommandEncoderWithDescriptor(renderPassDescriptor)
    renderEncoder.label = "render encoder"
      
    // The drawing phase is a simple shader that draws lines in 2D
    // DebugGroup labels are for debugging during frame capture.
    renderEncoder.pushDebugGroup("draw rain")
    renderEncoder.setRenderPipelineState(pipelineState)
    renderEncoder.setVertexBuffer(raindropDoubleBuffer, 
            offset: bufferOffset*doubleBufferIndex, atIndex: 0)
    renderEncoder.drawPrimitives(.Line, vertexStart: 0, 
            vertexCount: vertexCount, instanceCount: 1)
    renderEncoder.popDebugGroup()

    // update particles in the GPU            
    renderEncoder.pushDebugGroup("update raindrops")
    renderEncoder.setRenderPipelineState(updateState)
    // this is where we read the particles from
    renderEncoder.setVertexBuffer(raindropDoubleBuffer, 
            offset: bufferOffset*doubleBufferIndex, atIndex: 0)
    // this is where we write the updated particles 
    renderEncoder.setVertexBuffer(raindropDoubleBuffer, 
            offset: bufferOffset*((doubleBufferIndex+1)%2), atIndex: 1)
    renderEncoder.setVertexBuffer(uniformBuffer,
            offset: uniformOffset * syncBufferIndex, atIndex: 2)
    // noiseTexture contains random numbers
    renderEncoder.setVertexTexture(noiseTexture, atIndex: 0)
    // every particle is treated as a point, but we aren't rendering anything on screen
    renderEncoder.drawPrimitives(.Point, vertexStart: 0, 
            vertexCount: particleCount, instanceCount: 1)
    renderEncoder.popDebugGroup()
    renderEncoder.endEncoding()
            
    commandBuffer.presentDrawable(currentDrawable)
}
    
// syncBufferIndex matches the current semaphore controled frame index 
// to ensure writing occurs at the correct region in the vertex buffer
syncBufferIndex = (syncBufferIndex + 1) % NumSyncBuffers
doubleBufferIndex = (doubleBufferIndex + 1) % 2
    
commandBuffer.commit()

And that’s all! You don’t need to do anything else on the CPU 🙂

Writing shader code

Metal shaders are written in a subset of C++11 with some special keywords to define attributes and hardware features. You can have multiple shaders in a single file, and that file gets compiled before you run your application, so say bye to the runtime nightmares of OpenGL ES.

Let’s jump directly to the raindrop update function,

#include <metal_stdlib>
struct LineParticle
{
    float4 start;
    float4 end;
}; // => sizeOfLineParticle = sizeof(Float) * 4 * 2

// can only write to a buffer if the output is set to void
vertex void updateRaindrops(uint vid [[ vertex_id ]],
                        constant LineParticle* particle  [[ buffer(0) ]],
                        device LineParticle* updatedParticle  [[ buffer(1) ]],
                        constant Uniforms& uniforms  [[ buffer(2) ]],
                        texture2d<float> noiseTexture [[ texture(0) ]])
{
    LineParticle outParticle;
    float4 velocity = float4(0, -0.01, 0, 0);
    outParticle.start = particle[vid].start + velocity;
    outParticle.end = particle[vid].end + velocity;
    if (outParticle.start.y < -1) {
       outParticle.end.y = 1;
       outParticle.start.y = outParticle.end.y + 0.1;
    }
    updatedParticle[vid] = outParticle;
};

I’ve simplified the example above, so I’m not using the uniform buffer or the noise texture. Instead, the particles are just updated with a constant velocity that points downwards, and their position is reset once they reach the end of the screen. Check the full source for the full update, with some simple bouncing on the ground and obstacles, and resetting to a random position.

The “constant” and “device” keywords are address space qualifiers. “constant” refers to read-only buffer memory objects that are allocated from the device memory pool, while “device” refers to buffer memory objects allocated from the device memory pool that are both readable and writeable.

Handling Metal errors

In Metal you’ll find that clear error messages are output to the console. In OpenGL ES you had to query the OpenGL error status all the time just to get error messages, cluttering your code with those error queries all over the place. Plus, the error messages were usually hard to decipher.

This is an example error in Metal,

MTLPixelFormatRG16Unorm is compatible with texture data types type(s) (
    float
).'

I got this after calling: renderEncoder.setVertexTexture(noiseTexture, atIndex: 0)

Because in the shader I had: texture2d noiseTexture [[ texture(0) ]]

The noise texture pixel format is set to RG16Unorm and the error is telling me it doesn’t like “halfs”. So I just needed to change half to float to fix the issue.

Frame captures

The frame capture in XCode works as well with Metal as it does with OpenGL ES. You can see the performance of your shaders, see all the resources, change the shader code on the fly, jump to the Swift source code that originated a draw call, and much more. It’s one of the best tools of its kind that I’ve seen.

Let’s inspect a frame,

Frame Capture in XCode

Frame Capture in XCode

On the left side, you can see all the commands. There’s only a few! OpenGL ES programs tend to end up with lots of redundant state changes that negatively impact on performance. The debug group labels are shown as folders, and you can see the timings for each one. Or you can expand them and see the details. The particle update takes 183 microseconds. Surely faster than if we had linearly looped through the buffer and updated the particles on the CPU 😉

You can expand each command to see the call stack and jump to the CPU code.

You can also inspect all the buffers, render state, and shaders. You can see the cost of each shader block as a percentage of the total. As expected, most of the cost is in the fragment shader. It’s just fill-rate.

You can re-write the shader code there, and click the “Update Shaders” icon Update Shaders icon, to re-compile them and re-run the frame with the updated shaders.

It’s really powerful and easy to use.

Conclusion

If you are developing on iOS or macOS and into graphics, I recommend you try Metal if you haven’t yet. The setup is more straightforward than OpenGL, and it outperforms OpenGL by removing redundant state changes and making definitions more static.

If you like graphics programming, but you never tried native development on iOS or macOS, perhaps because you were scared of Objective-C, give Swift a try. It has a simple but powerful syntax, really easy to learn. It’s also a compiled language, so if you were thinking of mixing C++ into Objective-C just to increase performance, forget about it and write everything in Swift.

Check the references below for details.

References

Tomorrow, Tuesday 12th, we’re welcoming back the Cam AWS User Group for their 7th Meetup. This is the fourth user group meetup we’ve hosted and now we’re set to host the remaining three of the year. The meet up promises to be information packed and is focusing on AWS Lambda with two speakers talking about their experiences. There’s also a debrief on the recent AWS summit in London and Danilo Poccia, a technical evangelist from AWS, is talking about data analytics.

The AWS London summit was on the 7th July and I went along with a colleague. Inevitably we bumped into some of the Cam AWS UG members and shared a DLR over to the Excel. Personally I found the Deep Dive on Amazon DynamoDB to be the most informative session with a good bit of depth on how to write your schema, avoiding hot keys and understand its internal partitioning. This is important for schema design and resolving certain bottlenecks. My most disappointing talk was the Deep Dive on Microservices and Amazon ECS as this talk didn’t add to my knowledge and I’ve only ever seen talks and demos of ECS never getting my hands dirty. My colleague attended the Deep Dive on EC2 Instances and it sounded like I’d have gotten much from that talk. I’m sure that others went to interesting (and disappointing) sessions and I would like to know what they got out of them.

AWS Lambda is one of AWS’ hot technologies which was released almost two years ago. We’ve started experimenting with it in Metail. I’m really keen to see how it’s being used and experimented with by others as AWS Lambda’s use within Metail is certainly growing. I’ve had a fun little project writing a plugin for leiningen which allows you to manage AWS Lambda functions with the aim of integrating it into our build process. Still it’s nowhere near as a functional as lambda Gordon which I saw demonstrated at the most recent Snowplow London Meetup; it sounds like something to compare to Ben Taylor’s talk on Using Lambda and CloudFormation.

The final talk of the night is from Danilo Poccia, I’m particularly looking forward to asking questions at the end as it’s the most relevant to my day to job 🙂

We’re looking forward to seeing everyone tomorrow, doors open at 6:45pm and the talks are starting promptly at 7pm. We’ll be providing beer, soft drinks and snacks, be prompt to get your favourite beverage before the talks start 🙂

AWS Loft London

Back in October 2015 Metail hosted the 3rd Cambridge AWS User Group Meetup and in addition to Ian Massingham‘s review of AWS re:Invent 2015 I was given the opportunity to talk about our use of AWS for our big data processing pipeline. After this I was pleased to be invited to give an Elastic MapReduce (EMR) specific version of this talk at an AWS EMR master class. Roll on March and the AWS loft London with me on the agenda for the EMR Master Class session 🙂

After a busy week and some concentrated talk preparations I almost didn’t make it. I caught the train from Cambridge to Liverpool street with the intention of walking from there to Old Street. Unfortunately there were problems with the power lines on the Liverpool Street line which lead to everyone getting off at Harlow Town. After a taxi ride to Epping and a nervous ride into Liverpool Street on the central line, I finally arrived only five minutes after the session started. This meant I missed my opportunity to introduce myself to Abhishek Sinha (the session leader) but after catching his eye during his talk I was back on the agenda 🙂

late-tweet                                                         made-it-tweet

Elastic MapReduce Master Class

Abhishek gave a very interesting and well-presented guide to EMR and its best practices. As ever when I attend a talk by someone from AWS I learn plenty of new things and start re-evaluating our use of their tools. In this case, these were mainly around the use of spot instance task nodes and taking advantage of EMRFS.

The spot instance task nodes are nodes that only perform MapReduce tasks, having no HDFS storage, and come from the EC2 spot instance market. Using the spot instance market you can get the nodes at a lower price but if you’re outbid you lose the node. Any compute tasks running when you lose the node fail, but Hadoop was built with this in mind and simply reschedules the task on another node. With no HDFS storage, no data re-replication need be done. It’s common to set a bid price of 100% of the on-demand cost, you still get the EC2 node at a lower bid price and at worst you pay the normal cost. Further, by picking nodes that are less commonly used, you are less likely to be outbid. For example, if you normally request two m3.2xlarge task nodes but on the on the spot market the m3.xlarge were less commonly used, then requesting four task nodes would give you equivalent power but with a greater saving. This is an imaginary example, you can find out real data for spot market here.

The other feature of EMR we are not yet taking advantage of is EMRFS. AWS have decoupled the compute from storage by allowing EMR clusters to make very efficient use of S3. The main/only drawback here is that S3 has eventual consistency for overwrites and deletes of objects in the S3 file system. The EMR nodes are not aware of the delays and thus when one job takes as input the output of a previous one there is a chance of seeing an inconsistent view of the data. EMRFS uses a DynamoDB table to keep a record of the expected state of S3 and the EMRFS file system will retry if a request is made for an object that does not match the expected state. Currently we work around this limitation by having things set up in such a way that it isn’t a problem (more by luck than design ;)). Another common solution is to create two copies: one in the cluster’s HDFS file system and the other in S3. The copy in HDFS is lost when the cluster shuts down. We are currently redesigning our pipeline and it may become a greater problem in the next iteration so we’re keeping EMRFS in mind, noting that you do pay for the DynamoDB usage.

My First Big Data Application

As for my own talk, I think it was well received. I was asked some interesting questions at the end and I’m taking that as a good sign. After my talk and some lunch I stayed for the next session “My First Big Data Application” which was introduced as a modern big data pipeline. This was a great session where a pipeline was setup to collect, process and analyse web logs. This was strikingly similar to the pipeline I’d described in my talk, however theirs is indeed more modern 🙂 I think it’s interesting to compare the two pipelines and to contrast their different strengths and weakness.

Starting with my talk and the beginning of our pipeline, events are recorded by making GET requests for a Cloudfront-hosted pixel and Cloudfront logs all the requests to an S3 bucket. Here AWS do the hard work of distributing our pixel around the globe to ensure fast access to the user. They also batch up the request logs, writing them to the configured bucket after some time/size. We’ve never done any measurements but I believe the latency is typically less than an hour and we get logs of the order 10MB in size although they can be KB in size. For the demonstration Toby Knight (the speaker) set up an Apache web server on an EC2 node which saved its logs locally. He then used an AWS Kinesis collector to stream the logs in real time into the Kinesis Firehose which records the data in an S3 bucket. Here you can see the more modern event collector which is a real-time streaming system compared to our batch. For the following purposes it’s not really clear why Kinesis Firehose is better than our Cloudfront solution. I’m not sure how you scale out the Apache web server (fairly easily I imagine, it’s just not my area of expertise) but that’s work you’ll have to do yourself and when the second step is a batch system I’m not sure the latency matters. However, talking of latency this is where Kinesis has potential the Cloudfront solution clearly doesn’t. In Metail we don’t have any real time monitor of our event stream (it’s never been a critical requirement) but with Kinesis you can connect to a topic and trigger some processing on each new event. This increased flexibility is clearly a win.

For the next step both we and Toby turned to EMR for a ‘model on read’ batch Event Transform and Load (ETL). We are using MapReduce in Clojure (Cascalog at the moment but switching over to Parkour) to read in our Cloudfront logs, validate the events and format them in a schema that can be loaded into Redshift. Here ‘model on read’ means that Cloudfront doesn’t enforce a schema on the data, it will quite happily write some quite corrupt events to file. It’s only if we try to format that event as, say, an order that we start requiring it to have certain properties. Toby’s talk used Spark to process the events, perhaps just as an opportunity to show EMR supports the latest cool MapReduce technology 😉 It does have some advantages over MapReduce and should be a lot faster than our ETL as Spark uses in-memory data structures, it’s written in Scala though (but there are Clojure bindings with Flambo or Sparkling). For the next step Metail is keeping up with the Joneses and we do the modern thing and copy the output of the EMR batch stage into Redshift. Redshift is a petabyte scale data warehouse where you use a PostgreSQL-like language to query your data. After some initial teething troubles we think our new schema will allow us to make much better use of Redshift’s strengths. We use a product called Looker to model the data in Redshift, produce dashboards for both internal and external use, gain insights into our data through dynamic queries and quite a few other things. For the talk they demonstrated the use of AWS QuickSight which is in a limited preview. Although it will compete with Looker (and similar tools like Tableau) it’s aiming to be less full featured and much cheaper, allowing companies to give everyone access to the data with only a few people using more expensive tools like Tableau. I suspect for us it would never replace Looker, it seemed like it wouldn’t have the client facing support we require, and our more powerful data analysis tools come largely from the open source Python and R community 🙂 Still I’m very excited about SPICE (Super-fast, Parallel, In-memory Calculation Engine) which gives each QuickSight user a local in-memory DB for very fast data modelling and exploration. This should be available to partners like Looker and Tableau next year.

And that’s it, after mentioning only a ‘few’ technologies I’ve raced through Metail’s big data pipeline and compared it to a more modern equivalent. For anyone looking to build their first pipeline I think it is worth looking at the streaming solution as that technology is advancing fast and windowing over the streams give much more powerful batches. It’s something we’re planning to look into with Onyx for the next iteration.

/dev/summer 2016 is almost upon us. This is the latest in a series of bi-annual developer conferences organized by our friends at Software Acumen, and will be taking place at the Møller Centre, Churchill College, Cambridge next Saturday 25th June. It’s a low cost, high value software developer event covering DevOps, Mobile, Web, NoSQL, Cloud, Functional Programming, Startups and more.

Back in 2014 Jim Downing (Metail CTO) and I gave a hands-on session based on an extended version of TryClojure, where participants got to implement a Sudoku solver in Clojure. This year we’re back with another REPL-driven development session. We’ll be showing people how to write a chat bot in Clojure. We’ll cover Clojure as a REST client, build a simple web service using Ring and Compojure, deploy our application to Heroku, and configure a Slack command integration.

If that’s not your cup of tea, there are plenty of other sessions to choose from and the conference provides an excellent opportunity to meet and chat with experts in your field in a friendly and relaxed environment. Metail is sponsoring this year’s event – look out for our stall. Oh, and we’re hiring! If you’d like to make Clojure and ClojureScript part of your day job, or are interested in any of the other tech jobs we’re advertising in Cambridge, come along and talk to us.

Tickets for /dev/summer are on sale here.

For the last few years Metail have hosted a small number of internships in our tech team. These are often really great experiences, giving students new skills and the chance to focus on a challenging and fun project as part of one of our teams, and give our teams scope to pursue some ideas that they might not otherwise find head space to do.

Some of the things our interns have done in the last two years are:

  • Created a prototype to allow users to look around a garment in 3D using Google Glasses or the Amazon FirePhone.
  • Contributed to a cutting edge 3D face and head research project
  • Created an automated test and performance system for our apps
  • Produced a new model of behavioural analytics based on Metail user browsing behaviour

Our interns are generally undergraduate students, but that’s not universally the case, and they have led on to permanent careers too. This year, a number of teams are looking for interns, so there are a range of opportunities: front-end app development, middle-tier Clojure development, garment simulation, computer graphics programming, data science and computer vision / machine learning.

Would you like to know more?