Metal with Swift

Metal (not Metail) is a low-level API from Apple that combines OpenGL and OpenCL into a single interface. The purpose of introducing their own API was mainly to reduce overhead and increase performance. Metal is similar to Khronos Group’s Vulkan, or Microsoft’s DX12, but specifically targeted at Apple hardware.

Metal has been around since 2014, but now that Swift is more mature, I think it’s really easy to get started with Metal: you don’t need to be scared of pointers or of the overly verbose Objective-C syntax.

In this article I’m going to introduce Metal with a small example where all the data updates happen in the GPU. Instead of explaining Metal and Swift in detail, I’ll just write down a few notes following the example code. Hopefully, it will spark your interest and you dig into the references for extensive documentation 😉

Procedural rain example

I’ve written a small demo that should look like rain,

It draws and updates thousands of 2D lines at 60 fps on an iPhone6. In fact, drawing the lines takes only 2.4 ms, and the update takes less than 0.2ms.

You can find all the code here: https://github.com/endavid/metaltest

Getting started

To get started with Metal you will need a Metal-ready device and XCode. In XCode, just create a new project and select

  • iOS Application: Game

  • Language: Swift

  • Game technology: Metal

This will create a simple template that draws a moving rectangle on screen. You will need to run this directly on your device, since the simulator doesn’t understand Metal. The triangle data in the example is triple-buffered, so you can update it in the CPU while the GPU renders up to 3 frames before requiring a sync. Synchronization between the CPU and GPU is done like this,

// create semaphore
let inflightSemaphore = dispatch_semaphore_create(NumSyncBuffers)
// this is run per frame
func drawInMTKView(view: MTKView) {
    dispatch_semaphore_wait(inflightSemaphore, DISPATCH_TIME_FOREVER)
    // updates in CPU cycles
    self.update()
    // register completion callback
    let commandBuffer = commandQueue.commandBuffer()
    commandBuffer.addCompletedHandler{ [weak self] commandBuffer in
        if let strongSelf = self {
            dispatch_semaphore_signal(strongSelf.inflightSemaphore)
        }
        return
    }
    // draw stuff
    // ...
    commandBuffer.commit()
}

Some interesting Swift notes:

  • You can omit brackets when the last argument of the function you are calling is a lambda. You can still do ‘addCompletionHandler(myFunction)’.

  • The ‘weak’ keyword is used to avoid keeping a strong reference to ‘self’ inside the lambda function. Otherwise, we could have a cyclic reference and leak memory.

  • Because the reference is now weak, it basically becomes an optional (something that could be null). The ‘if let x = optional’ is used to dereference the optional when it’s not null.

Preparing Metal objects

These are the things you need to prepare in order to render something on screen:

  • Resources: data buffers and textures.

  • States: render pipeline state and depth-stencil state.

  • Descriptors: definitions that describe the objects above. This includes your shader code.

  • Render Command Encoder: the stuff that converts API commands into hardware commands.

  • Command Buffer: it’s where you store your commands that are eventually committed to the GPU.

  • Command Queue: where you queue an ordered list of command buffers.

I assume you are more or less familiar with how a typical graphics pipeline work, so in the example I’m going to focus on the physics update of the raindrops, which I’m performing in the GPU.

I’ll explain the shader code later, but for now you just need to know that you can access to your shader functions very easily using a shader library,

let defaultLibrary = device.newDefaultLibrary()!
let updateRaindropProgram = defaultLibrary.newFunctionWithName("updateRaindrops")!

“updateRaindrops” is the name of the function in the shader code.

You can create a render state without a fragment program. Your vertex shader can be used to modify any arbitrary buffer, without the need of specifically creating a compute shader.

let updateStateDescriptor = MTLRenderPipelineDescriptor()
updateStateDescriptor.vertexFunction = updateRaindropProgram
// vertex output is void
updateStateDescriptor.rasterizationEnabled = false
// pixel format needs to be set
updateStateDescriptor.colorAttachments[0].pixelFormat = view.colorPixelFormat

With that descriptor now we can create the state. Note that this is done only once,

do {
    try pipelineState = device.newRenderPipelineStateWithDescriptor(pipelineStateDescriptor)
    try updateState = device.newRenderPipelineStateWithDescriptor(updateStateDescriptor)
} catch let error {
    print("Failed to create pipeline state, error \(error)")
}

Notice that in Swift, the “try” keyword is used for every expression that can throw an exception. If we are happy with an optional value, we can remove the do-catch and use “try?”,

let state = try? device.newRenderPipelineStateWithDescriptor(descriptor)

Now we need a data buffer. Metal is designed for the A7 chip unified memory system, so both the CPU and the GPU can share the same storage. We will need to care about synchronization, but in this example the raindrops will be updated and read only in the GPU.

// member variable
var raindropDoubleBuffer: MTLBuffer! = nil
// ... on initialization:
raindropDoubleBuffer = device.newBufferWithLength(
            2 * maxNumberOfRaindrops * sizeOfLineParticle, options: [])
raindropDoubleBuffer.label = "raindrop buffer"

And now that you have everything ready, we can “draw stuff” in drawInMTKView,

// draw stuff
if let renderPassDescriptor = view.currentRenderPassDescriptor,
       currentDrawable = view.currentDrawable
{
    // setVertexBuffer offset: How far the data is from the start of the buffer, in bytes
    // Check alignment in setVertexBuffer doc
    let bufferOffset = maxNumberOfRaindrops * sizeOfLineParticle
    let uniformOffset = numberOfUniforms * sizeof(Float)
    let renderEncoder = commandBuffer.renderCommandEncoderWithDescriptor(renderPassDescriptor)
    renderEncoder.label = "render encoder"
      
    // The drawing phase is a simple shader that draws lines in 2D
    // DebugGroup labels are for debugging during frame capture.
    renderEncoder.pushDebugGroup("draw rain")
    renderEncoder.setRenderPipelineState(pipelineState)
    renderEncoder.setVertexBuffer(raindropDoubleBuffer, 
            offset: bufferOffset*doubleBufferIndex, atIndex: 0)
    renderEncoder.drawPrimitives(.Line, vertexStart: 0, 
            vertexCount: vertexCount, instanceCount: 1)
    renderEncoder.popDebugGroup()

    // update particles in the GPU            
    renderEncoder.pushDebugGroup("update raindrops")
    renderEncoder.setRenderPipelineState(updateState)
    // this is where we read the particles from
    renderEncoder.setVertexBuffer(raindropDoubleBuffer, 
            offset: bufferOffset*doubleBufferIndex, atIndex: 0)
    // this is where we write the updated particles 
    renderEncoder.setVertexBuffer(raindropDoubleBuffer, 
            offset: bufferOffset*((doubleBufferIndex+1)%2), atIndex: 1)
    renderEncoder.setVertexBuffer(uniformBuffer,
            offset: uniformOffset * syncBufferIndex, atIndex: 2)
    // noiseTexture contains random numbers
    renderEncoder.setVertexTexture(noiseTexture, atIndex: 0)
    // every particle is treated as a point, but we aren't rendering anything on screen
    renderEncoder.drawPrimitives(.Point, vertexStart: 0, 
            vertexCount: particleCount, instanceCount: 1)
    renderEncoder.popDebugGroup()
    renderEncoder.endEncoding()
            
    commandBuffer.presentDrawable(currentDrawable)
}
    
// syncBufferIndex matches the current semaphore controled frame index 
// to ensure writing occurs at the correct region in the vertex buffer
syncBufferIndex = (syncBufferIndex + 1) % NumSyncBuffers
doubleBufferIndex = (doubleBufferIndex + 1) % 2
    
commandBuffer.commit()

And that’s all! You don’t need to do anything else on the CPU 🙂

Writing shader code

Metal shaders are written in a subset of C++11 with some special keywords to define attributes and hardware features. You can have multiple shaders in a single file, and that file gets compiled before you run your application, so say bye to the runtime nightmares of OpenGL ES.

Let’s jump directly to the raindrop update function,

#include <metal_stdlib>
struct LineParticle
{
    float4 start;
    float4 end;
}; // => sizeOfLineParticle = sizeof(Float) * 4 * 2

// can only write to a buffer if the output is set to void
vertex void updateRaindrops(uint vid [[ vertex_id ]],
                        constant LineParticle* particle  [[ buffer(0) ]],
                        device LineParticle* updatedParticle  [[ buffer(1) ]],
                        constant Uniforms& uniforms  [[ buffer(2) ]],
                        texture2d<float> noiseTexture [[ texture(0) ]])
{
    LineParticle outParticle;
    float4 velocity = float4(0, -0.01, 0, 0);
    outParticle.start = particle[vid].start + velocity;
    outParticle.end = particle[vid].end + velocity;
    if (outParticle.start.y < -1) {
       outParticle.end.y = 1;
       outParticle.start.y = outParticle.end.y + 0.1;
    }
    updatedParticle[vid] = outParticle;
};

I’ve simplified the example above, so I’m not using the uniform buffer or the noise texture. Instead, the particles are just updated with a constant velocity that points downwards, and their position is reset once they reach the end of the screen. Check the full source for the full update, with some simple bouncing on the ground and obstacles, and resetting to a random position.

The “constant” and “device” keywords are address space qualifiers. “constant” refers to read-only buffer memory objects that are allocated from the device memory pool, while “device” refers to buffer memory objects allocated from the device memory pool that are both readable and writeable.

Handling Metal errors

In Metal you’ll find that clear error messages are output to the console. In OpenGL ES you had to query the OpenGL error status all the time just to get error messages, cluttering your code with those error queries all over the place. Plus, the error messages were usually hard to decipher.

This is an example error in Metal,

MTLPixelFormatRG16Unorm is compatible with texture data types type(s) (
    float
).'

I got this after calling: renderEncoder.setVertexTexture(noiseTexture, atIndex: 0)

Because in the shader I had: texture2d noiseTexture [[ texture(0) ]]

The noise texture pixel format is set to RG16Unorm and the error is telling me it doesn’t like “halfs”. So I just needed to change half to float to fix the issue.

Frame captures

The frame capture in XCode works as well with Metal as it does with OpenGL ES. You can see the performance of your shaders, see all the resources, change the shader code on the fly, jump to the Swift source code that originated a draw call, and much more. It’s one of the best tools of its kind that I’ve seen.

Let’s inspect a frame,

Frame Capture in XCode

Frame Capture in XCode

On the left side, you can see all the commands. There’s only a few! OpenGL ES programs tend to end up with lots of redundant state changes that negatively impact on performance. The debug group labels are shown as folders, and you can see the timings for each one. Or you can expand them and see the details. The particle update takes 183 microseconds. Surely faster than if we had linearly looped through the buffer and updated the particles on the CPU 😉

You can expand each command to see the call stack and jump to the CPU code.

You can also inspect all the buffers, render state, and shaders. You can see the cost of each shader block as a percentage of the total. As expected, most of the cost is in the fragment shader. It’s just fill-rate.

You can re-write the shader code there, and click the “Update Shaders” icon Update Shaders icon, to re-compile them and re-run the frame with the updated shaders.

It’s really powerful and easy to use.

Conclusion

If you are developing on iOS or macOS and into graphics, I recommend you try Metal if you haven’t yet. The setup is more straightforward than OpenGL, and it outperforms OpenGL by removing redundant state changes and making definitions more static.

If you like graphics programming, but you never tried native development on iOS or macOS, perhaps because you were scared of Objective-C, give Swift a try. It has a simple but powerful syntax, really easy to learn. It’s also a compiled language, so if you were thinking of mixing C++ into Objective-C just to increase performance, forget about it and write everything in Swift.

Check the references below for details.

References

Tomorrow, Tuesday 12th, we’re welcoming back the Cam AWS User Group for their 7th Meetup. This is the fourth user group meetup we’ve hosted and now we’re set to host the remaining three of the year. The meet up promises to be information packed and is focusing on AWS Lambda with two speakers talking about their experiences. There’s also a debrief on the recent AWS summit in London and Danilo Poccia, a technical evangelist from AWS, is talking about data analytics.

The AWS London summit was on the 7th July and I went along with a colleague. Inevitably we bumped into some of the Cam AWS UG members and shared a DLR over to the Excel. Personally I found the Deep Dive on Amazon DynamoDB to be the most informative session with a good bit of depth on how to write your schema, avoiding hot keys and understand its internal partitioning. This is important for schema design and resolving certain bottlenecks. My most disappointing talk was the Deep Dive on Microservices and Amazon ECS as this talk didn’t add to my knowledge and I’ve only ever seen talks and demos of ECS never getting my hands dirty. My colleague attended the Deep Dive on EC2 Instances and it sounded like I’d have gotten much from that talk. I’m sure that others went to interesting (and disappointing) sessions and I would like to know what they got out of them.

AWS Lambda is one of AWS’ hot technologies which was released almost two years ago. We’ve started experimenting with it in Metail. I’m really keen to see how it’s being used and experimented with by others as AWS Lambda’s use within Metail is certainly growing. I’ve had a fun little project writing a plugin for leiningen which allows you to manage AWS Lambda functions with the aim of integrating it into our build process. Still it’s nowhere near as a functional as lambda Gordon which I saw demonstrated at the most recent Snowplow London Meetup; it sounds like something to compare to Ben Taylor’s talk on Using Lambda and CloudFormation.

The final talk of the night is from Danilo Poccia, I’m particularly looking forward to asking questions at the end as it’s the most relevant to my day to job 🙂

We’re looking forward to seeing everyone tomorrow, doors open at 6:45pm and the talks are starting promptly at 7pm. We’ll be providing beer, soft drinks and snacks, be prompt to get your favourite beverage before the talks start 🙂

AWS Loft London

Back in October 2015 Metail hosted the 3rd Cambridge AWS User Group Meetup and in addition to Ian Massingham‘s review of AWS re:Invent 2015 I was given the opportunity to talk about our use of AWS for our big data processing pipeline. After this I was pleased to be invited to give an Elastic MapReduce (EMR) specific version of this talk at an AWS EMR master class. Roll on March and the AWS loft London with me on the agenda for the EMR Master Class session 🙂

After a busy week and some concentrated talk preparations I almost didn’t make it. I caught the train from Cambridge to Liverpool street with the intention of walking from there to Old Street. Unfortunately there were problems with the power lines on the Liverpool Street line which lead to everyone getting off at Harlow Town. After a taxi ride to Epping and a nervous ride into Liverpool Street on the central line, I finally arrived only five minutes after the session started. This meant I missed my opportunity to introduce myself to Abhishek Sinha (the session leader) but after catching his eye during his talk I was back on the agenda 🙂

late-tweet                                                         made-it-tweet

Elastic MapReduce Master Class

Abhishek gave a very interesting and well-presented guide to EMR and its best practices. As ever when I attend a talk by someone from AWS I learn plenty of new things and start re-evaluating our use of their tools. In this case, these were mainly around the use of spot instance task nodes and taking advantage of EMRFS.

The spot instance task nodes are nodes that only perform MapReduce tasks, having no HDFS storage, and come from the EC2 spot instance market. Using the spot instance market you can get the nodes at a lower price but if you’re outbid you lose the node. Any compute tasks running when you lose the node fail, but Hadoop was built with this in mind and simply reschedules the task on another node. With no HDFS storage, no data re-replication need be done. It’s common to set a bid price of 100% of the on-demand cost, you still get the EC2 node at a lower bid price and at worst you pay the normal cost. Further, by picking nodes that are less commonly used, you are less likely to be outbid. For example, if you normally request two m3.2xlarge task nodes but on the on the spot market the m3.xlarge were less commonly used, then requesting four task nodes would give you equivalent power but with a greater saving. This is an imaginary example, you can find out real data for spot market here.

The other feature of EMR we are not yet taking advantage of is EMRFS. AWS have decoupled the compute from storage by allowing EMR clusters to make very efficient use of S3. The main/only drawback here is that S3 has eventual consistency for overwrites and deletes of objects in the S3 file system. The EMR nodes are not aware of the delays and thus when one job takes as input the output of a previous one there is a chance of seeing an inconsistent view of the data. EMRFS uses a DynamoDB table to keep a record of the expected state of S3 and the EMRFS file system will retry if a request is made for an object that does not match the expected state. Currently we work around this limitation by having things set up in such a way that it isn’t a problem (more by luck than design ;)). Another common solution is to create two copies: one in the cluster’s HDFS file system and the other in S3. The copy in HDFS is lost when the cluster shuts down. We are currently redesigning our pipeline and it may become a greater problem in the next iteration so we’re keeping EMRFS in mind, noting that you do pay for the DynamoDB usage.

My First Big Data Application

As for my own talk, I think it was well received. I was asked some interesting questions at the end and I’m taking that as a good sign. After my talk and some lunch I stayed for the next session “My First Big Data Application” which was introduced as a modern big data pipeline. This was a great session where a pipeline was setup to collect, process and analyse web logs. This was strikingly similar to the pipeline I’d described in my talk, however theirs is indeed more modern 🙂 I think it’s interesting to compare the two pipelines and to contrast their different strengths and weakness.

Starting with my talk and the beginning of our pipeline, events are recorded by making GET requests for a Cloudfront-hosted pixel and Cloudfront logs all the requests to an S3 bucket. Here AWS do the hard work of distributing our pixel around the globe to ensure fast access to the user. They also batch up the request logs, writing them to the configured bucket after some time/size. We’ve never done any measurements but I believe the latency is typically less than an hour and we get logs of the order 10MB in size although they can be KB in size. For the demonstration Toby Knight (the speaker) set up an Apache web server on an EC2 node which saved its logs locally. He then used an AWS Kinesis collector to stream the logs in real time into the Kinesis Firehose which records the data in an S3 bucket. Here you can see the more modern event collector which is a real-time streaming system compared to our batch. For the following purposes it’s not really clear why Kinesis Firehose is better than our Cloudfront solution. I’m not sure how you scale out the Apache web server (fairly easily I imagine, it’s just not my area of expertise) but that’s work you’ll have to do yourself and when the second step is a batch system I’m not sure the latency matters. However, talking of latency this is where Kinesis has potential the Cloudfront solution clearly doesn’t. In Metail we don’t have any real time monitor of our event stream (it’s never been a critical requirement) but with Kinesis you can connect to a topic and trigger some processing on each new event. This increased flexibility is clearly a win.

For the next step both we and Toby turned to EMR for a ‘model on read’ batch Event Transform and Load (ETL). We are using MapReduce in Clojure (Cascalog at the moment but switching over to Parkour) to read in our Cloudfront logs, validate the events and format them in a schema that can be loaded into Redshift. Here ‘model on read’ means that Cloudfront doesn’t enforce a schema on the data, it will quite happily write some quite corrupt events to file. It’s only if we try to format that event as, say, an order that we start requiring it to have certain properties. Toby’s talk used Spark to process the events, perhaps just as an opportunity to show EMR supports the latest cool MapReduce technology 😉 It does have some advantages over MapReduce and should be a lot faster than our ETL as Spark uses in-memory data structures, it’s written in Scala though (but there are Clojure bindings with Flambo or Sparkling). For the next step Metail is keeping up with the Joneses and we do the modern thing and copy the output of the EMR batch stage into Redshift. Redshift is a petabyte scale data warehouse where you use a PostgreSQL-like language to query your data. After some initial teething troubles we think our new schema will allow us to make much better use of Redshift’s strengths. We use a product called Looker to model the data in Redshift, produce dashboards for both internal and external use, gain insights into our data through dynamic queries and quite a few other things. For the talk they demonstrated the use of AWS QuickSight which is in a limited preview. Although it will compete with Looker (and similar tools like Tableau) it’s aiming to be less full featured and much cheaper, allowing companies to give everyone access to the data with only a few people using more expensive tools like Tableau. I suspect for us it would never replace Looker, it seemed like it wouldn’t have the client facing support we require, and our more powerful data analysis tools come largely from the open source Python and R community 🙂 Still I’m very excited about SPICE (Super-fast, Parallel, In-memory Calculation Engine) which gives each QuickSight user a local in-memory DB for very fast data modelling and exploration. This should be available to partners like Looker and Tableau next year.

And that’s it, after mentioning only a ‘few’ technologies I’ve raced through Metail’s big data pipeline and compared it to a more modern equivalent. For anyone looking to build their first pipeline I think it is worth looking at the streaming solution as that technology is advancing fast and windowing over the streams give much more powerful batches. It’s something we’re planning to look into with Onyx for the next iteration.

/dev/summer 2016 is almost upon us. This is the latest in a series of bi-annual developer conferences organized by our friends at Software Acumen, and will be taking place at the Møller Centre, Churchill College, Cambridge next Saturday 25th June. It’s a low cost, high value software developer event covering DevOps, Mobile, Web, NoSQL, Cloud, Functional Programming, Startups and more.

Back in 2014 Jim Downing (Metail CTO) and I gave a hands-on session based on an extended version of TryClojure, where participants got to implement a Sudoku solver in Clojure. This year we’re back with another REPL-driven development session. We’ll be showing people how to write a chat bot in Clojure. We’ll cover Clojure as a REST client, build a simple web service using Ring and Compojure, deploy our application to Heroku, and configure a Slack command integration.

If that’s not your cup of tea, there are plenty of other sessions to choose from and the conference provides an excellent opportunity to meet and chat with experts in your field in a friendly and relaxed environment. Metail is sponsoring this year’s event – look out for our stall. Oh, and we’re hiring! If you’d like to make Clojure and ClojureScript part of your day job, or are interested in any of the other tech jobs we’re advertising in Cambridge, come along and talk to us.

Tickets for /dev/summer are on sale here.

For the last few years Metail have hosted a small number of internships in our tech team. These are often really great experiences, giving students new skills and the chance to focus on a challenging and fun project as part of one of our teams, and give our teams scope to pursue some ideas that they might not otherwise find head space to do.

Some of the things our interns have done in the last two years are:

  • Created a prototype to allow users to look around a garment in 3D using Google Glasses or the Amazon FirePhone.
  • Contributed to a cutting edge 3D face and head research project
  • Created an automated test and performance system for our apps
  • Produced a new model of behavioural analytics based on Metail user browsing behaviour

Our interns are generally undergraduate students, but that’s not universally the case, and they have led on to permanent careers too. This year, a number of teams are looking for interns, so there are a range of opportunities: front-end app development, middle-tier Clojure development, garment simulation, computer graphics programming, data science and computer vision / machine learning.

Would you like to know more? 

We enjoyed hosting the first Cambridge AWS User Group Meetup of the year in February (but were so busy we forgot to blog about it), so we’re excited to welcome everyone back to our town hall for the 3rd meet up of the year. The details of the next meetup are still being decided but there are always great opportunities to learn about AWS and contribute your own knowledge back to the group. Come along we’ll have snacks, soft and hard drinks, and most importantly questions asked and knowledge shared! While the details are getting settled here’s a little write up, finally, of February’s meet up.

Main Meeting #6: details TBA

Thursday, Jun 9, 2016, 7:00 PM

Metail Cambridge Offices
50 St. Andrew’s Street, CB2 3AH Cambridge, GB

12 Clouds Attending

RESCHEDULED – new date is very preliminary, and may change at any time!

Check out this Meetup →

Breaking the February Silence

Last year at the first of the main meetups, Tom Clark from Ocean Array Systems presented their idea for a data processing solution (the ‘Zero to Hero’ section of the meetup). They were looking to get some advice on how best to leverage AWS’ services given they had ‘zero’ current knowledge. Now seven months later he was back to present on their progress. The solution was very similar to what was sketched out back in July. He didn’t have any slides and rather than moving rooms to one with a white board he just sketched things out on the glass and a big sheet of paper 🙂

20160202_213436

Tom had five points he’d taken away from the last seven months of development although I’ve merged two and added one from the list I originally captured when listening to him:

  1. It’s worth thinking about costs, particularly considering the prototyping to beta/first few clients stage. The different systems scale differently in price. For example, they ended up making heavy use of lambda whereas with hindsight t2.nano and t2.micro instances are very cheap, can handle reasonable scale for tasks suitable for lambda functions and have a large number of free hours associated with them. In the end lambda may scale to infinity more cheaply when you bring in the cost of engineers to maintain large numbers of EC2 nodes but in the early days it might be cheaper (your time included) to manage your own EC2 instance;
  2. Your choice of database is a sensitive aspect for you commercial model. They went with the room’s advice and used DynamoDB but a knowledgeable source highlighted that quickly gets expensive. They’d have saved some money scaling with a cheaper solution;
  3. What is the commercial value of your data? For them it was important to get a sane API in front of the data as soon as possible. One of their datasets is collected from 3rd parties and they realised 1) it’s tedious, and 2) no one else has done this. By providing a decent API to this data (which they need internally anyway) they have the opportunity to resell access (sounds like how AWS got started);
  4. Figure out how to create isolated software projects! This is learning from immutable infrastructure and fairly common advice these days which means there’s no harm in repeating it 😉 Clojure does this well with leiningen, and I’ve had limited but positive experience with Python’s virtualenv and limited painful experience the RVM;
  5. You need a dedicated person to pull together the AWS services. I believe their system has been developed by three people, two experts in Matlab and machine learning, and the other who could focus on AWS. This could be a good intern or junior as generally the learning curve for AWS services isn’t that steep.

The second talk was given by Jon Green who went through the AWS Re:Invert 2015 talk on the IOT service. I’m barely a beginner in the IOT field, but the solution does seem complete, and it looks like AWS have solved a lot of problems. How many of these were already solved I couldn’t say. Given the breadth of the solution it’s not easy to summarise but there are two interesting points that have stuck in my mind: –

Firstly; it’s currently hard for enthusiasts to have a play, even more so in the UK than in the US. Given the power of AWS’ free tier to have developers create toy projects for little cost and then take those solutions over to their companies where they start bringing in revenue this seems a missed opportunity. Having said that I’ve had a quick browse at https://aws.amazon.com/iot/getting-started/ and at least one of the partner hardware kits is available if you switch the link from amazon.com to amazon.co.uk. Stock availability was something Jon highlighted, so get it while stocks last 😉

The second point is a bit more abstract. IOT provides tools to provide secure identification, registration and state management of your devices. This is the most hardware centric part of the system, the rest is about event processing and a lot of that looks like a streaming service. Jon made a very thought provoking point that even the most hardware related parts need not necessarily refer to a physical device and your whole IOT solution could be virtual. The IOT system has a lot of components just waiting for someone with some knowledge and imagination 🙂

Main Meeting #4: Happy New Year!

Tuesday, Feb 2, 2016, 7:00 PM

Metail Cambridge Offices
50 St. Andrew’s Street, CB2 3AH Cambridge, GB

13 Clouds Went

Thanks to the kind folks at Metail for stepping up to the plate at short notice – and with a slight date change from the 4th to the 2nd – we’re meeting at their offices on St Andrew’s Street again for lots of Cloudy goodness!More details to be announced very soon, but here’s the starting point…The presentations:• Notices and news – Jon Green (…

Check out this Meetup →

Hope to see you all on Thursday 9th June!  The talk line up is still being finalised but we do know that Matt McDonnell from the Metail Data Science team will be starting things off with a presentation on ‘Deploying Data Science with Docker and AWS’.

Data Actually

David Robinson posted a great article Analyzing networks of characters in ‘Love Actually’ on 25th December 2015, which uses R to analyse the connections between the characters in the film Love Actually.

This Jupyter notebook and the associated python code attempts to reproduce his analysis using tools from the Python ecosystem.

I tweeted a link to an initial version of this notebook over the Christmas holidays and republished it here in response to interest from my colleagues in Metail Tech.

Package setup

To start we need to import some useful packages, shown below. Some of these needed to be installed using the Anaconda distribution or pip if conda failed:

  • pip install ggplot
  • conda install graphviz
  • conda install networkx
  • conda install pyplot
In [1]:
from __future__ import division
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
from scipy.cluster.hierarchy import dendrogram, linkage
import ggplot as gg
import networkx as nx
In [2]:
%matplotlib inline

Data import

First we need to define a couple of functions to read the ‘Love Actually’ script into a list of lines and the cast into a Pandas DataFrame. The data_dir variable gets the location of the directory containing the input files from an environment variable so set this environment variable to the location appropriate for you if you’re following along.

In [3]:
data_dir = os.path.join(os.getenv('MDA_DATA_DIR', '/home/mattmcd/Work/Data'), 'LoveActually') 
# Alternative: data_dir = os.getcwd() # And copy files to same directory as script

def read_script():
    """Read the Love Actually script from text file into list of lines
    The script is first Google hit for 'Love Actually script' as a doc
    file.  Use catdoc or Libre Office to save to text format.
    """
    with open(os.path.join(data_dir, 'love_actually.txt'), 'r') as f:
        lines = [line.strip() for line in f]
    return lines

def read_actors():
    """Read the mapping from character to actor using the varianceexplained data file
    Used curl -O http://varianceexplained.org/files/love_actually_cast.csv to get a local copy
    """
    return pd.read_csv(os.path.join(data_dir, 'love_actually_cast.csv'))

The cell below reproduces the logic in the first cell of the original article. It doesn’t feel quite as nice to me as the dplyr syntax but is not too bad.

In [4]:
def parse_script(raw):
    df = pd.DataFrame(raw, columns=['raw'])

    df = df.query('raw != ""')
    df = df[~df.raw.str.contains("(song)")]
    lines = (df.
             assign(is_scene=lambda d: d.raw.str.contains(" Scene ")).
             assign(scene=lambda d: d.is_scene.cumsum()).
             query('not is_scene'))
    speakers = lines.raw.str.extract('(?P<speaker>[^:]*):(?P<dialogue>.*)')
    lines = (pd.concat([lines, speakers], axis=1).
             dropna().
             assign(line=lambda d: np.cumsum(~d.speaker.isnull())))

    lines.drop(['raw', 'is_scene'], axis=1, inplace=True)

    return lines
In [5]:
def read_all():
    lines = parse_script(read_script())
    cast = read_actors()
    combined = lines.merge(cast).sort('line').assign(
        character=lambda d: d.speaker + ' (' + d.actor + ')').reindex()
    # Decode bytes to unicode
    combined['character'] = map(lambda s: s.decode('utf-8'), combined['character'])
    return combined
In [6]:
# Read in script and cast into a dataframe
lines = read_all()
# Print the first few rows
lines.head(5)
Out[6]:
scene speaker dialogue line actor character
0 2 Billy ♪ I feel it in my fingers ♪ I feel it in my t… 2 Bill Nighy Billy (Bill Nighy)
34 2 Joe I’m afraid you did it again, Bill. 3 Gregor Fisher Joe (Gregor Fisher)
1 2 Billy It’s just I know the old version so well, you… 4 Bill Nighy Billy (Bill Nighy)
35 2 Joe Well, we all do. That’s why we’re making the … 5 Gregor Fisher Joe (Gregor Fisher)
2 2 Billy Right, OK, let’s go. ♪ I feel it in my finger… 6 Bill Nighy Billy (Bill Nighy)

Constructing the n_character x n_scene matrix showing how many lines each character has in each scene is quite easy using pandas groupby method to create a hierarchical index, followed by the unstack method to convert the second level of the index into columns.

In [7]:
def get_scene_speaker_matrix(lines):
    by_speaker_scene = lines.groupby(['character', 'scene'])['line'].count()
    speaker_scene_matrix = by_speaker_scene.unstack().fillna(0)
    return by_speaker_scene, speaker_scene_matrix
In [8]:
# Group by speaker and scene and construct the speaker-scene matrix
by_speaker_scene, speaker_scene_matrix = get_scene_speaker_matrix(lines)

Analysis

Now we get to the analysis itself. First we perform a hierarchical clustering of the data using the same data normalization and clustering method as the original article. The leaf order in the dendrogram is returned for use in later steps of the analysis, as it has similar characters close to each other.

In [9]:
def plot_dendrogram(mat, normalize=True):
    # Cluster and plot dendrogram.  Return order after clustering.
    if normalize:
        # Normalize by number of lines
        mat = mat.div(mat.sum(axis=1), axis=0)
    Z = linkage(mat, method='complete', metric='cityblock')
    labels = mat.index
    f = plt.figure()
    ax = f.add_subplot(111)
    R = dendrogram(Z, leaf_rotation=90, leaf_font_size=8,
               labels=labels, ax=ax, color_threshold=-1)
    f.tight_layout()
    ordering = R['ivl']
    return ordering
In [10]:
# Hierarchical cluster and return order of leaves
ordering = plot_dendrogram(speaker_scene_matrix)
da_dendrogram
In [11]:
print(ordering)
[u'Peter (Chiwetel Ejiofor)', u'Juliet (Keira Knightley)', u'Mark (Andrew Lincoln)', u'Harry (Alan Rickman)', u'Mia (Heike Makatsch)', u'Karl (Rodrigo Santoro)', u'Sarah (Laura Linney)', u'Karen (Emma Thompson)', u'Natalie (Martine McCutcheon)', u'PM (Hugh Grant)', u'Daniel (Liam Neeson)', u'Sam (Thomas Sangster)', u'Colin (Kris Marshall)', u'Tony (Abdul Salis)', u'Billy (Bill Nighy)', u'Joe (Gregor Fisher)', u'Jack (Martin Freeman)', u'Judy (Joanna Page)', u'Aurelia (L\xfacia Moniz)', u'Jamie (Colin Firth)']

Timeline

Plotting the timeline of character versus scene is very similar to the R version since we make use of the Python ggplot port from yhat. It seems like there are a few differences between this package and the R ggplot2 library.

In particular:

  • ggplot does not seem to handle categorical variables so it was necessary to introduce an extra character_code dimension
  • it didn’t seem possible to change the y-axis tick labels to character names so here the axis directions are swapped
  • the aes (aesthetic) does not seem to support ‘group’ so the geom_path joining characters in the same scene has been left out

(note that these points may just be limitations of my understanding of ggplot in Python)

In [12]:
def get_scenes_with_multiple_characters(by_speaker_scene):
    # Filter speaker scene dataframe to remove scenes with only one speaker

    # n_scene x 1 Series with index 'scene'
    filt = by_speaker_scene.count('scene') > 1
    # n_scene x n_character Index
    scene_index = by_speaker_scene.index.get_level_values('scene')
    # n_scene x n_character boolean vector
    ind = filt[scene_index].values
    return by_speaker_scene[ind]
In [13]:
def order_scenes(scenes, ordering=None):
    # Order scenes by e.g. leaf order after hierarchical clustering
    scenes = scenes.reset_index()
    scenes['scene'] = scenes['scene'].astype('category')
    scenes['character'] = scenes['character'].astype('category', categories=ordering)
    scenes['character_code'] = scenes['character'].cat.codes
    return scenes
In [14]:
# Order the scenes by cluster leaves order
scenes = order_scenes(get_scenes_with_multiple_characters(by_speaker_scene), ordering)
In [15]:
def plot_timeline(scenes):
    # Plot character vs scene timelime
    # NB: due to limitations in Python ggplot we need to plot with scene on y-axis
    # in order to label x-ticks by character.
    # scale_x_continuous and scale_y_continuous behave slightly differently.

    print (gg.ggplot(gg.aes(y='scene', x='character_code'), data=scenes) +
            gg.geom_point() + gg.labs(x='Character', y='Scene') +
           gg.scale_x_continuous(
               labels=scenes['character'].cat.categories.values.tolist(),
           breaks=range(len(scenes['character'].cat.categories))) +
           gg.theme(axis_text_x=gg.element_text(angle=30, hjust=1, size=10)))
In [16]:
# Plot a timeline of characters vs scene
plot_timeline(scenes);
da_characters

Co-occurrence matrix

Next we construct the co-occurrence matrix showing how often characters share scenes, and visualize using a heatmap and network graph.

In [17]:
def get_cooccurrence_matrix(speaker_scene_matrix, ordering=None):
    # Co-occurrence matrix for the characters, ignoring last scene where all are present
    scene_ind = speaker_scene_matrix.astype(bool).sum() < 10
    if ordering:
        mat = speaker_scene_matrix.loc[ordering, scene_ind]
    else:
        mat = speaker_scene_matrix.loc[:, scene_ind]
    return mat.dot(mat.T)
In [18]:
cooccur_mat = get_cooccurrence_matrix(speaker_scene_matrix, ordering)

The heatmap below is not as nice as the default R heatmap as it is missing the dendrograms on each axis and also the character names, so could be extended e.g. following Hierarchical Clustering Heatmaps in Python.

It otherwise shows a similar result to the original article in that ordering using the dendrogram leaf order has resulted in a co-occurrence matrix predominantly of block diagonal form.

In [19]:
def plot_heatmap(cooccur_mat):
    # Plot co-ccurrence matrix as heatmap
    plt.figure()
    plt.pcolor(cooccur_mat)
In [20]:
# Plot heatmap of co-occurrence matrix
plot_heatmap(cooccur_mat)
da_connectivity

The network plot gives similar results to the original article. This could be extended, for example by adding weights to the graph edges.

In [21]:
def plot_network(cooccur_mat):
    # Plot co-occurence matrix as network diagram
    G = nx.Graph(cooccur_mat.values)
    pos = nx.graphviz_layout(G)  # NB: needs pydot installed
    plt.figure(num=None, figsize=(15, 15), dpi=80)
    nx.draw_networkx_nodes(G, pos, node_size=700, node_color='c')
    nx.draw_networkx_edges(G, pos)
    nx.draw_networkx_labels(
        G, pos,
        labels={i: s for (i, s) in enumerate(cooccur_mat.index.values)},
        font_size=10)
    plt.axis('off')
    plt.show()
In [22]:
# Plot network graph of co-occurrence matrix
plot_network(cooccur_mat)
da_network

Conclusion

This notebook attempted to reproduce using Python the R analysis and visualization of the character network in ‘Love Actually’. Overall this was a useful exercise in learning similarities and differences between the tools, as well as becoming more familiar with the Pandas syntax. The output currently doesn’t look quite as nice as the R original but is a good starting point for future tinkering.

ReSharper C++In the Visualization Team, we’ve recently started using ReSharper C++ 10.0.2 in Microsoft Visual Studio to assist with writing C++ code. We also have some home-grown (and predictably ugly) C++ preprocessor macros specifically for the Microsoft compiler to help with various code quality issues such as automatic coverage reports, execution profiling, unit test generators and so on. The problem is that when ReSharper parses the C++ source code, it does so with a slightly different “interpretation” of the the C++ preprocessor specification than Microsoft’s compiler and the IntelliSense parser. I’m not going to get involved with any argument as to which is “most correct”; let’s just say that they are different. Different enough that ReSharper complains about our macros, even though the compiler thinks they’re fine. If we could detect when the ReSharper parser is looking at our code, we could simplify the macro definitions and stop ReSharper complaining.

Microsoft Visual Studio helpfully provides a preprocessor macro named ‘__INTELLISENSE__‘ that allows you to detect when it is IntelliSense that is parsing your source code, but I couldn’t find the equivalent for ReSharper. That’s not to say that one doesn’t exist, but I couldn’t find any on-line documentation for one.

However, there obviously is a difference between the Microsoft and JetBrains parsers (otherwise we wouldn’t need to distinguish between them!) so can we use that variation to detect who is parsing our source code? The difference that is causing our macros problems is the way that macro argument tokens are pasted together and joined. Here’s an example:

#define LITERAL(a) a
#define JOIN(a,b) LITERAL(a)LITERAL(b)
#define BEFOREAFTER 1

The first question is: why aren’t we using the token pasting operator? Ironically, that operator, ‘##‘, solves all our problems (in this case). Therefore it’s not a candidate for distinguishing the two parsers. So, given the macros above, what does the following expand to?

JOIN(BEFORE,AFTER)

Well, the Microsoft products (as of MSVC 2013) expand it to ‘1‘ whereas ReSharper expands it to two tokens: ‘BEFORE‘ immediately followed by ‘AFTER‘. Fascinating, but not particularly useful, surely? Ah, but consider this:

#if JOIN(BEFORE,AFTER)
   // We're being parsed by Microsoft products
#else
   // We're being parsed by ReSharper
#endif

Inside, the ReSharper parser is no doubt bitterly fuming about the malformed #if‘ condition; but it does so silently and the test condition ultimately fails.

Putting it all together gives us:

#define RESHARPER_LITERAL(a) a
#define RESHARPER_JOIN(a,b) RESHARPER_LITERAL(a)RESHARPER_LITERAL(b)
#define RESHARPER_DISABLED 1
#if RESHARPER_JOIN(RESHARPER,_DISABLED)
#define __RESHARPER__ 0
#else
#define __RESHARPER__ 1
#endif
#undef RESHARPER_DISABLED
#undef RESHARPER_JOIN
#undef RESHARPER_LITERAL

Of course, this code snippet is preceded by a huge comment explaining why we’re abusing the preprocessor quite so badly, and suggesting that the reader pretends she never saw it.

This is the fourth instalment of our Think Stats study group; we are working through Allen Downey’s Think Stats, implementing everything in Clojure. This week we made a start on chapter 2 of the book, which introduces us to statistical distributions by way of histograms. This was our first encounter with the incanter.charts namespace, which we use to plot histograms of some values from the National Survey for Family Growth dataset we have worked with in previous sessions.

You can find previous instalments from the study group on our blog:

If you’d like to follow along, start by cloning our thinkstats repository from Github:

git clone https://github.com/ray1729/thinkstats.git --recursive

Change into the project directory and fire up Gorilla REPL:

cd thinkstats
lein gorilla

Getting Started

As usual, we start out with a namespace declaration that loads the namespaces we’ll need:

(ns radioactive-darkness
  (:require [incanter.core :as i
               :refer [$ $map $where $rollup $order $fn $group-by $join]]
            [incanter.stats :as s]
            [incanter.charts :as c]
            [incanter-gorilla.render :refer [chart-view]]
            [thinkstats.gorilla]
            [thinkstats.incanter :as ie :refer [$! $not-nil]]
            [thinkstats.family-growth :as f]))

There are two additions since last time: incanter.charts mentioned above, and incanter-gorilla.render that provides a function to display Incanter charts in Gorilla REPL.

We start by generating a vector of random integers to play with:

(def xs (repeatedly 100 #(rand-int 5)))

We can generate a histogram from these data:

(def h (c/histogram xs))

This returns a JFreeChart object that we can display in Gorilla REPL with chart-view:

(chart-view h)

histogram-1

If you’re running from a standard Clojure REPL, you should use the view function from incanter.core instead:

(i/view h)

The first thing we notice about this is that the default number of bins is not optimal for our data; let’s look at the documentation for histogram to see how we might change this.

(require '[clojure.repl :refer [doc]])
(doc c/histogram)

We see that the :nbins option controls the number of bins. We can also set the title and labels for the axes by specifiyng :title, :x-label and :y-label respectively.

(chart-view (c/histogram xs :nbins 5
                            :title "Our first histogram"
                            :x-label "Value"
                            :y-label "Frequency"))

histogram-2

We can save the histogram as a PNG file:

(i/save (c/histogram xs :nbins 5
                            :title "Our first histogram"
                            :x-label "Value"
                            :y-label "Frequency")
            "histogram-1.png")

Birth Weight

Now that we know how to plot histograms, we can start to visualize values from the NSFG data set. We start by loading the data:

(def ds (f/fem-preg-ds))

Plot the pounds part of birth weight (note the use of $! to exclude nil values):

(chart-view (c/histogram ($! :birthwgt-lb ds) :x-label "Birth weight (lb)"))

histogram-3

…and the ounces part of birth weight:

(chart-view (c/histogram ($! :birthwgt-oz ds) :x-label "Birth weight (oz)"))

histogram-4

We can see immediately that these charts are very different, reflecting the different “shapes” of the data. What we see fits well with our intuition: we expect the ounces component of the weight to be distributed fairly evenly, while most newborns are around 7lb or 8lb and babies bigger than 10lb at birth are rarely seen.

Recall that we also computed the total weight in pounds and added :totalwgt-lb to the dataset:

(chart-view (c/histogram ($! :totalwgt-lb ds) :x-label "Total weight (lb)"))

histogram-5

This does not look much different from the :birthwgt-lb histogram, as this value dominates ounces in the computaiton

A Few More Histograms

The shape of a histogram tells us how the data are distributed: it may be approximately flat like the :birthwgt-oz histogram, or bell-shaped like :birthwgt-lb, or an asymetrical bell (with longer tail to the left or to the right) like the following two.

(chart-view (c/histogram ($! :ageatend ds)
                         :x-label "Age"
                         :title "Mother's age at end of pregnancy"))

histogram-6

Let’s try that again, excluding the outliers with an age over 60:

(chart-view (c/histogram (filter #(< % 60) ($! :ageatend ds))
                         :x-label "Age"
                         :title "Mother's age at end of pregnancy"))

histogram-7

Finally, let’s look at pregnancy length for live births:

(chart-view (c/histogram ($! :prglngth ($where {:outcome 1} ds))
                         :x-label "Weeks"
                         :title "Pregnancy length (live births)"))

histogram-8

We have now reached the end of section 2.4 of the book, and will pick up next time with section 2.5.

One of the user stories I had to tackle in a recent sprint was to import data maintained by a non-technical colleague in a Google Spreadsheet into our analytics database. I quickly found a Java API for Google Spreadsheets that looked promising but turned out to be more tricky to get up and running than expected at first glance. In this article, I show you how to use this library from Clojure and avoid some of the pitfalls I fell into.

Google Spreadsheets API

The GData Java client referenced in the Google Spreadsheets API documentation uses an old XML-based protocol, which is mostly deprecated. We are recommended to use the newer, JSON-based client. After chasing my tail on this, I discovered that Google Spreadsheets does not yet support this new API and we do need the GData client after all.

The first hurdle: dependencies

The GData Java client is not available from Maven, so we have to download a zip archive. The easiest way to use these from a Leiningen project is to use mvn to install the required jar files in our local repository and specify the dependencies in the usual way. This handy script automates the process, only downloading the archive if necessary. (For this project, we only need the gdata-core and gdata-spreadsheet jars, but the script is easily extended if you need other components.)

#!/bin/bash

set -e

function log () {
    echo "$1" >&2
}

function install_artifact () {
    log "Installing artifact $2"
    mvn install:install-file -DgroupId="$1" -DartifactId="$2" -Dversion="$3" -Dfile="$4" \
        -Dpackaging=jar -DgeneratePom=true
}

R="${HOME}/.m2/repository"
V="1.47.1"
U="http://storage.googleapis.com/gdata-java-client-binaries/gdata-src.java-${V}.zip"

if test -r "${R}/com/google/gdata/gdata-core/1.0/gdata-core-1.0.jar" \
        -a -r "${R}/com/google/gdata/gdata-spreadsheet/3.0/gdata-spreadsheet-3.0.jar";
then
    log "Artifacts up-to-date"
    exit 0
fi

log "Downloading $U"
cd $(mktemp -d)
wget "${U}"
unzip "gdata-src.java-${V}.zip"

install_artifact com.google.gdata gdata-core 1.0 gdata/java/lib/gdata-core-1.0.jar

install_artifact com.google.gdata gdata-spreadsheet 3.0 gdata/java/lib/gdata-spreadsheet-3.0.jar

Once we’ve installed these jars, we can configure dependencies as follows:

(defproject gsheets-demo "0.1.0-SNAPSHOT"
  :description "Google Sheets Demo"
  :url "https://github.com/ray1729/gsheets-demo"
  :license {:name "Eclipse Public License"
            :url "http://www.eclipse.org/legal/epl-v10.html"}
  :dependencies [[org.clojure/clojure "1.8.0"]
                 [com.google.gdata/gdata-core "1.0"]
                 [com.google.gdata/gdata-spreadsheet "3.0"]])

The second hurdle: authentication

This is a pain, as the documentation for the GData Java client is incomplete and at times confusing, and the examples it ships with no longer work as they use a deprecated OAuth version. The example Java code in the documentation tells us:

// TODO: Authorize the service object for a specific user (see other sections)

The other sections were no more enlightening, but after more digging and reading of source code, I realized we can use the google-api-client to manage our OAuth credentials and simply pass that credentials object to the GData client. This library is already available from a central Maven repository, so we can simply update our project’s dependencies to pull it in:

:dependencies [[org.clojure/clojure "1.8.0"]
               [com.google.api-client/google-api-client "1.21.0"]
               [com.google.gdata/gdata-core "1.0"]
               [com.google.gdata/gdata-spreadsheet "3.0"]]

OAuth credentials

Before we can start using OAuth, we have to register our client with Google. This is done via the Google Developers Console. See Using OAuth 2.0 to Access Google APIs for full details, but here’s a quick-start guide to creating credentials for a service account.

Navigate to the Developers Console. Click on Enable and manage APIs and select Create a new project. Enter the project name and click Create.

Once project is created, click on Credentials in the sidebar, then the Create Credentials drop-down. As our client is going to run from cron, we want to enable server-to-server authentication, so select Service account key. On the next screen, select New service account and enter a name. Make sure the JSON radio button is selected, then click on Create.

Copy the downloaded JSON file into your project’s resources directory. It should look something like:

{
  "type": "service_account",
  "project_id": "gsheetdemo",
  "private_key_id": "041db3d758a1a7ef94c9c59fb3bccd2fcca41eb8",
  "private_key": "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n",
  "client_email": "gsheets-demo@gsheetdemo.iam.gserviceaccount.com",
  "client_id": "106215031907469115769",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://accounts.google.com/o/oauth2/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/gsheets-demo%40gsheetdemo.iam.gserviceaccount.com"
}

We’ll use this in a moment to create a GoogleCredential object, but before that navigate to Google Sheets and create a test spreadsheet. Grant read access to the spreadsheet to the email address found in client_email in your downloaded credentials.

A simple Google Spreadsheets client

We’re going to be using a Java client, so it should come as no surprise that our namespace imports a lot of Java classes:

(ns gsheets-demo.core
  (:require [clojure.java.io :as io])
  (:import com.google.gdata.client.spreadsheet.SpreadsheetService
           com.google.gdata.data.spreadsheet.SpreadsheetFeed
           com.google.gdata.data.spreadsheet.WorksheetFeed
           com.google.gdata.data.spreadsheet.CellFeed
           com.google.api.client.googleapis.auth.oauth2.GoogleCredential
           com.google.api.client.json.jackson2.JacksonFactory
           com.google.api.client.googleapis.javanet.GoogleNetHttpTransport
           java.net.URL
           java.util.Collections))

We start by defining some constants for our application. The credentials resource is the JSON file we downloaded from the developer console:

(def application-name "gsheetdemo-v0.0.1")

(def credentials-resource (io/resource "GSheetDemo-041db3d758a1.json"))

(def oauth-scope "https://spreadsheets.google.com/feeds")

(def spreadsheet-feed-url (URL. "https://spreadsheets.google.com/feeds/spreadsheets/private/full"))

With this in hand, we can create a GoogleCredential object and initialize the Google Sheets service:

(defn get-credential
  []
  (with-open [in (io/input-stream credentials-resource)]
    (let [credential (GoogleCredential/fromStream in)]
      (.createScoped credential (Collections/singleton oauth-scope)))))

(defn init-service
  []
  (let [credential (get-credential)
        service (SpreadsheetService. application-name)]
    (.setOAuth2Credentials service credential)
    service))

Let’s try it at a REPL:

lein repl

user=> (require '[gsheets-demo.core :as gsheets])
nil
user=> (def service (gsheets/init-service))
#'user/service
user=> (.getEntries (.getFeed service
                              gsheets/spreadsheet-feed-url
                              com.google.gdata.data.spreadsheet.SpreadsheetFeed))
(#object[com.google.gdata.data.spreadsheet.SpreadsheetEntry 0x43ab2a3e "com.google.gdata.data.spreadsheet.SpreadsheetEntry@43ab2a3e"])

Great! We can see the one spreadsheet we granted our service account read access. Let’s wrap this up in a function and implement a helper to find a spreadsheet by name:

(defn list-spreadsheets
  [service]
  (.getEntries (.getFeed service spreadsheet-feed-url SpreadsheetFeed)))

(defn find-spreadsheet-by-title
  [service title]
  (let [spreadsheets (filter (fn [sheet] (= (.getPlainText (.getTitle sheet)) title))
                             (list-spreadsheets service))]
    (if (= (count spreadsheets) 1)
      (first spreadsheets)
      (throw (Exception. (format "Found %d spreadsheets with name %s"
                                 (count spreadsheets)
                                 title))))))

Back at the REPL:

user=> (def spreadsheet (gsheets/find-spreadsheet-by-title service "Colour Counts"))
user=>  (.getPlainText (.getTitle spreadsheet))
"Colour Counts"

A spreadsheet contains one or more worksheets, so the next functions we implement take a SpreadsheetEntry object and list or search worksheets:

(defn list-worksheets
  [service spreadsheet]
  (.getEntries (.getFeed service (.getWorksheetFeedUrl spreadsheet) WorksheetFeed)))

(defn find-worksheet-by-title
  [service spreadsheet title]
  (let [worksheets (filter (fn [ws] (= (.getPlainText (.getTitle ws)) title))
                           (list-worksheets service spreadsheet))]
    (if (= (count worksheets) 1)
      (first worksheets)
      (throw (Exception. (format "Found %d worksheets in %s with name %s"
                                 (count worksheets)
                                 spreadsheet
                                 title))))))

…and at the REPL:

user=> (def worksheets (gsheets/list-worksheets service spreadsheet))
user=> (map (fn [ws] (.getPlainText (.getTitle ws))) worksheets)
("Sheet1")

Our next function returns the cells belonging to a worksheet:

(defn get-cells
  [service worksheet]
  (map (memfn getCell) (.getEntries (.getFeed service (.getCellFeedUrl worksheet) CellFeed))))

This gives us a flat list of Cell objects. It will be much more convenient to work in Clojure with a nested vector of the cell values:

(defn to-nested-vec
  [cells]
  (mapv (partial mapv (memfn getValue)) (partition-by (memfn getRow) cells)))

We now have all the building blocks for the function that will be the main entry point to our minimal Clojure API:

(defn fetch-worksheet
  [service {spreadsheet-title :spreadsheet worksheet-title :worksheet}]
  (if-let [spreadsheet (find-spreadsheet-by-title service spreadsheet-title)]
    (if-let [worksheet (find-worksheet-by-title service spreadsheet worksheet-title)]
      (to-nested-vec (get-cells service worksheet))
      (throw (Exception. (format "Spreadsheet '%s' has no worksheet '%s'"
                                 spreadsheet-title worksheet-title))))
    (throw (Exception. (format "Spreadsheet '%s' not found" spreadsheet-title)))))

With this in hand:

user=> (def sheet (gsheets/fetch-worksheet service {:spreadsheet "Colour Counts" :worksheet "Sheet1"}))
#'user/sheet
user=> (clojure.pprint/pprint sheet)
[["Colour" "Count"]
 ["red" "123"]
 ["orange" "456"]
 ["yellow" "789"]
 ["green" "101112"]
 ["blue" "131415"]
 ["indigo" "161718"]
 ["violet" "192021"]]
nil

Our to-nested-vec function returns the cell values as strings. I could have used the getNumericValue method instead of getValue, but then to-nested-vec would have to know what data type to expect in each cell. Instead, I used Plumatic Schema to define a schema for each row, and used its data coercion features to coerce each column to the desired data type – but that’s a blog post for another day.

Code for the examples above is available on Github https://github.com/ray1729/gsheets-demo. We have barely scratched the surface of the Google Spreadsheets API; check out the API Documentation if you need to extend this code, for example to create or update spreadsheets.