Getting Started

Think Stats One of our new starters here at Metail was keen to brush up their statistics, and it’s more than 20 years since I completed an introductory course at university so I knew I would benefit from some revision. We also have a bunch of statisticians in the office who would like to brush up their Clojure, so I thought it might be fun to organise a lunchtime study group to work through Allen Downey’s Think Stats and attempt the exercises in Clojure. We’re using the second edition of the book which is available online in HTML format, and meeting on Wednesday lunchtimes to work through it together.

We’ll use Clojure’s Incanter library which provides utilities for statistical analysis and generating charts. Create a Leiningen project for our work:

lein new thinkstats

Make sure the project.clj depends on Clojure 1.7.0 and add a dependency on Incanter 1.5.6:

:dependencies [[org.clojure/clojure "1.7.0"]
               [incanter "1.5.6"]]

Parsing the data

In the first chapter of the book, we are introduced to a data set from the US Centers for Disease Control and Prevention, the National Survey of Family Growth. The data are in a gzipped file with fixed-width columns. An accompanying Stata dictionary describes the variable names, types, and column indices for each record. Our first job will be to parse the dictionary file and use that information to build a parser for the data.

We cloned the Github repository that accompanies Allen’s book:

git clone https://github.com/AllenDowney/ThinkStats2

Then created symlinks to the data files from our project:

cd thinkstats
mkdir data
cd data
for f in ../../ThinkStats2/code/{*.dat.gz,*.dct}; do ln -s $f; done

We can now read the Stata dictionary for the family growth study fromdata/2002FemPreg.dct. The dictionary looks like:

infile dictionary {
    _column(1)  str12    caseid     %12s  "RESPONDENT ID NUMBER"
    _column(13) byte     pregordr   %2f  "PREGNANCY ORDER (NUMBER)"
}

If we skip the first and last lines of the dictionary, we can use a regular expression to parse each column definition:

(def dict-line-rx #"^\s+_column\((\d+)\)\s+(\S+)\s+(\S+)\s+%(\d+)(\S)\s+\"([^\"]+)\"")

We’re capturing the column position, colum type, column name, format and length, and description. Let’s test this at the REPL. First we have to read a line from the dictionary:

(require '[clojure.java.io :as io])
(def line (with-open [r (io/reader "data/2002FemPreg.dct")]
            (first (rest (line-seq r)))))

We use rest to skip the first line of the file then grab the first column definition. Now we can try matching this with our regular expression:

(re-find dict-line-rx line)

This returns the string that matched and the capture groups we defined in our regular expression:

["    _column(1)      str12        caseid  %12s  \"RESPONDENT ID NUMBER\""
 "1"
 "str12"
 "caseid"
 "12"
 "s"
 "RESPONDENT ID NUMBER"]

We need to do some post-processing of this result to parse the column index and length to integers; we’ll also replace underscores in the column name with hyphens, which makes for a more idiomatic Clojure variable name. Let’s wrap that up in a function:

(require '[clojure.string :as str])

(defn parse-dict-line
  [line]
  (let [[_ col type name f-len f-spec descr] (re-find dict-line-rx line)]
    {:col    (dec (Integer/parseInt col))
     :type   type
     :name   (str/replace name "_" "-")
     :f-len  (Integer/parseInt f-len)
     :f-spec f-spec
     :descr  descr}))

Note that we’re also decrementing the column index – we need zero-indexed column indices for Clojure’s substring function. Now when we parse our sample line we get:

{:col 0,
 :type "str12",
 :name "caseid",
 :f-len 12,
 :f-spec "s",
 :descr "RESPONDENT ID NUMBER"}

With this function in hand, we can write a parser for the dictionary file:

(defn read-dict-defn
  "Read a Stata dictionary file, return a vector of column definitions."
  [path]
  (with-open [r (io/reader path)]
    (mapv parse-dict-line (butlast (rest (line-seq r))))))

We use rest and butlast to skip the first and last lines of the file, and mapv to force eager evaluation and ensure we process all of the input before the reader is closed when we exit with-open.

(def dict (parse-dict-defn "data/2002FemPreg.dat"))

The dictionary tells us the starting position (:col) and length (:f-len) of each field, so we can use subs to extract the raw value of each column from the data. This will give us a string, and the :type key we’ve extracted from the dictionary tells us how to interpret this. We’ve seen the types str12 and byte above, but what other types appear in the dictionary?

(distinct (map :type dict))
;=> ("str12" "byte" "int" "float" "double")

We’ll leave str12 unchanged, coerce byte and int to Long, andfloat and double to Double:

(defn parse-value
  [type raw-value]
  (when (not (empty? raw-value))
    (case type
      ("str12")          raw-value
      ("byte" "int")     (Long/parseLong raw-value)
      ("float" "double") (Double/parseDouble raw-value))))

We can now build a record parser from the dictionary definition:

(defn make-row-parser
  "Parse a row from a Stata data file according to the specification in `dict`.
   Return a vector of columns."
  [dict]
  (fn [row]
    (reduce (fn [accum {:keys [col type name f-len]}]
              (let [raw-value (str/trim (subs row col (+ col f-len)))]
                (conj accum (parse-value type raw-value))))
            []
            dict)))

To read gzipped data, we need to open an input stream, coerce this to a GZIPInputStream, and construct a buffered reader from that. For convenience, we’ll define a function to do this automatically if passed a path ending in .gz.

(import 'java.util.zip.GZIPInputStream)

(defn reader
  "Open path with io/reader; coerce to a GZIPInputStream if suffix is .gz"
  [path]
  (if (.endsWith path ".gz")
    (io/reader (GZIPInputStream. (io/input-stream path)))
    (io/reader path)))

Given a dictionary and reader, we can parse the records from a data file:

(defn read-dct-data
  "Parse lines from `rdr` according to the specification in `dict`.
   Return a lazy sequence of parsed rows."
  [dict rdr]
  (let [parse-fn (make-row-parser dict)]
    (map parse-fn (line-seq rdr))))

Finally, we bring this all together with a function to parse the dictionary and data and return an Incanter dataset:

(require '[incanter.core :refer [dataset]])

(defn as-dataset
  "Read Stata data set, return an Incanter dataset."
  [dict-path data-path]
  (let [dict   (read-dict-defn dict-path)
        header (map (comp keyword :name) dict)]
    (with-open [r (reader data-path)]
      (dataset header (doall (read-dct-data dict r))))))

Getting the code

The code for all this is available on Github; if you’d like to follow along, you can fork my thinkstats repository.

The functions we’ve developed above are in the namespacethinkstats.dct-parser In the next article in this series, we use our parser to explore and clean the data using Incanter.

For the latest in the Data Insights Cambridge meetup which we host, we are delighted to be welcoming Metail’s very own Shrividya Ravi to speak about the ‘A – Z of A/B testing’.

What is A – Z of A/B testing?

Randomised control trials have been a key part of medical science since the 18th century. More recently they have gained rapid traction in the e-commerce world where the term ‘A/B testing’ has become synonymous with businesses that are innovative and data-driven.

A/B testing has become the ‘status quo’ for retail website development – enabling product managers and marketing professionals to positively affect the customer journey; the sales funnel in particular. Combining event stream data with sound questions and good experiment design, these controlled trials become powerful tools for insight into user behaviour.

This talk will present a comprehensive overview of A/B testing discussing both the advantages and the caveats. A series of case studies and toy examples will detail the myriad of analyses possible from rich web events data. Topics covered will include inference with hypothesis testing, regression, bootstrapping, Bayesian models and parametric simulations.

The Speaker

Dr Ravi transitioned to Data Science following her PhD in experimental materials physics. Working as a Data Scientist at Metail (online try-on technology), Shriv continues experimenting and teasing insights from data.

Head to the Data Insights Cambridge Meetup page to register.

A bunch of Metail Clojurians are off to Clojure eXchange 2015 this week.

Members from the Data Science, Data Engineering and Web teams will be catching up on what’s new and seeing how others are using Clojure to solve their problems. Metail make extensive use of Clojure and ClojureScript for a lot of our internal tools. We are also currently investigating the feasibility of using ClojureScript to implement the next version of our MeModel visualisation product instead of CoffeeScript and Backbone.

Some of the Clojure tech that we currently use is: Cascalog, Om (Now and Next), Immutant, Prismatic/Schema, Figwheel. Grab one of us if you want to talk about our experiences with any of these or anything Clojure or Data related.

All the content looks really interesting, some highlights of particular interest:

  • Bozhidar Batsov – CIDER: The journey so far and the road ahead
  • Kris Jenkins – ClojureScript: Architecting for Scale
  • Nicola Mometto – Immutable code analysis with tools.analyzer
  • Hans Hubner – Datomic in Practice

Looking forward to all the talks, catching up with old friends and making new ones. See you there.

I have recently needed to run varnish (A very fast web cache for busy sites) in a situation that also required use of HTTPS on the box. Unfortunately, Varnish does not not handle crypto, which is probably a good thing given how easy it is for programmers to make mistakes in their code, rendering the security useless!

Whilst recipes for Stunnel and Varnish together exist, information on running them on the same box whilst still presenting the original source IP to varnish for logging/load balancing purposes was scarce – the below configuration “worked for me”, at least on Debian 7.0. (Wheezy) You will need the xt_mark module which should be part of most distributions, but I found was missing from some hosted boxes and VMs with custom kernels. The specific versions of software we are running are based on:

  • Linux 3.13.5
  • Varnish 3.0.6
  • STunnel 5.24

If you are running Varnish 4, the VCL here will require rewriting as it has been updated from version 3.

IPTables – mark traffic from source port 8088 for routing
iptables -t mangle -A OUTPUT -p tcp -m multiport --sports 8088 -j MARK --set-xmark 0x1/0xffffffff

Routing configuration – anything marked by IPTables, send back to the local box. These two can be added under iface lo as “post-up” commands if you’re on a Debian box.
ip rule add fwmark 1 lookup 100
ip route add local 0.0.0.0/0 dev lo table 100

STunnel configuration. The connect IP MUST be an IP on the box other than loopback, i.e. it will not work if you specify 127.0.0.1.

[https]
accept = 443
connect = 10.1.1.1:8088
transparent = source

From default.vcl:
import std;

sub vcl_recv {
// Set header variables in a sensible way.
remove req.http.X-Forwarded-Proto;

if (server.port == 8088) {
set req.http.X-Forwarded-Proto = “https”;
} else {
set req.http.X-Forwarded-Proto = “http”;
}

set req.http.X-Forwarded-For = client.ip;
std.collect(req.http.X-Forwarded-For);
}

sub vcl_hash {
// SSL data returned may be different from non-SSL.
// (E.g. including https:// in URLs)
hash_data(server.port);
}

Clojure Meetup as part of the Cambridge Non-dysfunctional Programmers group, in Metails office at 50 St Andrews St, Cambridge

Clojure Meetup as part of the Cambridge Non-dysfunctional Programmers group, in Metails office at 50 St Andrews St, Cambridge

This month’s meetup of the Cambridge NonDysFunctional Programmers will be hosted here at Metail’s Cambridge office next Thursday (26th November) from 6.30pm. I (Ray Miller) will be giving a hands-on introduction to web development with Clojure, where attendees get to implement their first Clojure web application from the ground up.

Along the way, we’ll learn about the Compojure routing library, Ring requests and responses, middleware, and generating HTML with hiccup. Time permitting, we’ll also cover interacting with a relational database and using buddy to add session-based authentication and authorization to our application.

The theme running through the tutorial is implementation of an ad server (an example I shamelessly stole from Dan Benjamin’s Meet Sinatra screencast). This demo application delivers a Javascript snippet to embed random ads in a web page and tracks user click-throughs. It also provides an administrative interface for reporting and managing ads. If you’ve already seen Dan’s screencast, you’ll see how Clojure compares with Sinatra to implement the same application.

See the Meetup page for full details and to sign up.

 

Shortly after I joined Metail in late summer 2011 there was a typical English bit of weather; namely an apocalyptic downfall of rain just as I was leaving work on a Wednesday. I was immediately transported back to a Kenyan balcony on which I had spent many happy hours as a proper colonial – sipping a G&T and watching the sun go down. The temperature was right, the sun was low in the sky, the tree was in full bloom (well it had leaves on it at any rate) and a tropical storm was in the air. Thinking that it was only fair to share this experience with those who had not been fortunate enough to have exposure to the original I packed my bag on Friday with a selection of gins, tonic and appropriate accoutrements.

This isn't www.xkcd.com

Unpacking the essentials after moving office

This went down well with colleagues and every Friday since we have endeavoured to gather and raise a glass to the end of the week. It has provided a great opportunity to relax and to meet other team mates and their partners and children. It also allowed teams to bond and cross-team conversations to happen. We also got the chance to hear the result of Nick’s nimble fingers (this harks back to the halcyon days when there was room to swing a cat and strum a guitar upstairs in 16), and share in the occasional sing song.

As the team has expanded I have not been able to support the whole cost so have asked for contributions of £10 a month towards the cost and welcome any suggestions or requests for particular drinks. As well as widening the group who make the drinks each week. We have progressed from simple gins and tonics to brambles, why nots, and even non gin-based drinks – manhattans, daiquiris, sidecars and orange brulées spring immediately to mind. Most of our shopping is done at Cambridge Wine Merchants who are always happy to help us out with ideas or substitutions for ingredients.

Oh and finally like any good dealer – your first few hits are free.

 

This is not http://www.smbc-comics.com/

Ace of Clubs Daiquiris made by Ian Taylor and photographed by Andrew Dunn

For anyone thinking of setting up something similar, the following price structure was designed to be as inclusive as possible and has been in place as the office head count has more than quadrupled. It has kept the club in the black and allowed it to provide something alcoholic and non-alcoholic for everyone at Christmas. Membership of the club is purely optional.

Members: £10 a month (first month free)

Interns: drink free

Partners: drink free

Guests: drink free

Non-Members from the office: drink free

Most of Metail’s try-it-on tech is delivered as a single page JavaScript application that embeds inside retailer sites, calling back to Metail-hosted services in Europe over HTTPS/JSON. In this post, Nick Day describes some of the more difficult trade-offs that we needed to make to make the best use of Content Distribution Networks to improve the speed of our app.

tl;dr When you’re using a CDN to accelerate the delivery of a 3rd party HTTPS / AJAX app to users far away, you end up having to choose your devils. We chose the one called “pre-flight OPTIONS request”.

We’re currently working with Dafiti, a Brazilian company whose user base is almost exclusively in-country. Our web-application and web-cache servers are based in the UK, so to aid user-experience we’ve moved as much as possible of our static and dynamic content so it’s served through a CDN (CloudFront, which handily has edge nodes in São Paulo).  

The last piece of this “CDN-ifying” has been to serve the initial HTML page from CloudFront.  Naturally, this has a large impact on page load speeds as the browser can do nothing else until this resource has been obtained; so fulfilling the request from a nearby CloudFront node is hugely preferable to having it trot under the Atlantic to hit our servers and then back.  However, as this page is now being served from the CDN’s domain, browser security constraints mean that any AJAX requests being made from it must either:

  1. also go through the CDN domain, or
  2. be served directly from your service domain using cross-origin resource sharing (CORS)

For cacheable content obtained through AJAX (in our case things like garment information) the preferred option is clear; serve it through the CDN.  The decision is more involved for content that you don’t want to or can’t cache, like requests that you need to be up-to-date (e.g. account details) or update requests. In this case your choice is for the lesser of two evils.

Serving through the CDN:

  • you’re making every one of these requests take an indirect route, as the CDN will be passing them all on for your servers to handle
  • on top of that there’s an extra hit; SSL de/encryption must happen in the CDN for these requests to be forwarded
  • you have to manage potentially complex CDN config to do the right thing for each request.  You wouldn’t have to do this with your typical CDN-cached content, but in practise you may need to add cookies and headers to a whitelist for the CDN to forward.  CloudFront makes this problem extra-fun by lacking a programmatic way to update/version your distribution configuration, making it easy to take down your service accidentally, and hard to roll back.
  • you’re paying (real money) each time you route one of these requests (somewhat unnecessarily) through a CDN

Alternatively, if you send these requests directly to your service domain:

  • you’ll need to implement and maintain CORS handling for those endpoints being requested from other domains
  • for requests deemed “non-simple” by the CORS spec (methods other than GET/HEAD or POST; using Content-Type other than a very strict set, which surprisingly does not include JSON; or using custom headers) the browser will automatically make a pre-flight OPTIONS request to check that it is able to make the request before actually making the one you desire.  This is a big deal if your longer-than-you’d-like cross-Atlantic request suddenly becomes two!  It was this point that almost pushed me toward taking the CDN route for these requests.

Fortunately for us, there are two saving graces with this latter choice of sending the requests directly to our service domain:

  1. All of the requests in our app that require a pre-flight request are non-blocking, that is that they don’t require a user to wait before viewing part of, or accessing functionality in the app. They’re all “update in the background” kind of requests. Granted, if these take too long you could imagine timing issues creeping in, but we should be coding defensively to guard against those kind of asynchronous problems anyway.
  2. The CORS spec allows you to specify a caching max-age for the OPTIONS requests. Of course, since you’re not going through a CDN, this caching will be on a per-user basis against the web-cache. At the moment, unfortunately for us, even a cache hit must come to Europe. The answer here is that it’s probably going to make sense for us to move web-caches closer to where the users are, even if the apps stay rooted in the UK.  Another slightly niggling point is that some browsers currently have an enforced maximum cache period that’s quite short for pre-flight requests (e.g. Chrome is 5 minutes).  As this is shorter than our average user session, it means that it’s likely a user will have to make those same requests more than once per session.  From half a world away, that’s not ideal.

Taking all of these points into consideration, we favoured sending requests directly to our service domain; preferring to take the hit with the “background” requests that require a pre-flight in order to improve performance of most of our non-caching requests.  

As hosts of the excellent Data Insights Cambridge meetup, we’re excited for the upcoming talk by Dr Sacha Krstulovic from Audio Analytic Ltd. (who also happen to be our downstairs neighbours!). Dr Krstulovic will be speaking about ‘Filling the sensory gap in Big Data’.

What is Filling the sensory gap in Big Data?

Data analysis creates value by uncovering relationships between various types of data. Whilst there is lots of thought put into developing new data analysis techniques, this talk, on the other hand, focuses on the nature of the data itself. As a matter of fact, the evolution of technology across time has unlocked access to data layers of an increasingly rich and complex nature. As a consequence, new value propositions have arisen not only from new analysis techniques, but also from treating new types of data. Nowadays, this movement has reached as deep as allowing to think of humans as having an equivalent data self in various computer systems. However, at the cutting edge frontier of this movement, there is still an opportunity that is untapped, that of creating value from exploring one of the richest and hardest to reach data sets: human sensory data. As a case study, Audio Analytic is exploring this opportunity in the domain of acoustic data.

The Speaker

Dr Sacha Krstulovic

Dr Sacha Krstulovic is the VP of Technology at Audio Analytic, a company building a significant leadership in automatic sound recognition. Before joining AA, Sacha was a Senior Research Engineer at Nuance’s Advanced Speech Group (Nuance ASG), where he worked on pushing the limits of large scale speech recognition services such as Voicemail-to-Text and Voice-Based Mobile Assistants (Apple Siri type services). Prior to that, he was a Research Engineer at Toshiba Research Europe Ltd., developing novel Text-To-Speech synthesis approaches able to learn from data. He is the author and co-author of two book chapters, two international patents and several articles in international journals and conferences.

The meetup is scheduled for Thursday, November 12, 2015 at 7:00 pm at 50 St Andrew’s St, CB2 3AH. We hope to see you there, just sign up for it on the Data Insights Cambridge meetup page.

As part of the Visualization Team at Metail, I spend a large proportion of my time staring at renders of models, hoping that too much virtual flesh isn’t being exposed. It’s a onerous task and any automation to make my life easier is always appreciated. Can we get computers to make sure “naughty bits” aren’t being accidentally shown to our customers?

Detecting Flesh Colours

The diversity of flesh colours is quite large; the following is a small selection of MeModel renders:

Feet Skin Colours

Feet Skin Colours

The range of colours considered “fleshy” is further compounded by the synthesised lighting environment: shadows tend to push the colours towards black, highlights towards white.

The first question to consider is which colour space to use for detecting flesh colours. There’s quite a bit of informal debate about this. Some propose perceptual spaces such as HSL or HSV, but I’m going to stick my neck out and say that there’s no reason not to use plain, old RGB.1 Or, more precisely, to overcome the effects of fluctuating lighting levels, some form of normalized RGB. For example:

M = max(RGB) or 1 iff G ≡ B ≡ 0
R* = R ÷ M
G* = G ÷ M
B* = B ÷ M

Two observations can be made at this point:

  1. If G ≡ B ≡ 0, we’re in such a dark place that we cannot tell whether we’re looking at flesh or not; and
  2. The dominant colour channel in all (healthy) human skin is red (green-skinned lizards and blue Venusians are unsupported), so MR.

Typically, we further assume that the least dominant colour channel is blue. This leads to a pleasing heuristic:

(RGB) is flesh only if R > G > B

Marrying this up with a hue/brightness wheel gives us a generous segment of potentially “fleshy” colours:

Hue/Brightness Wheel

Hue/Brightness Wheel

There are various refinements to this linear programming technique to further partition the colour space, including:

  • J. Kovac, P. Peer, and F. Solina, “Human skin colour clustering for face detection” in Proceedings of EUROCON 2003. Computer as a Tool. The IEEE Region 8, 2003
  • A. A.-S. Saleh, “A simple and novel method for skin detection and face locating and tracking” in Asia-Pacific Conference on Computer-Human Interaction 2004 (APCHI 2004), LNCS 3101, 2004
  • D. B. Swift, “Evaluating graphic image files for objectionable content” US Patent US 6895111 B1, 2006
  • G. Osman, M. S. Hitam and M. N. Ismail, “Enhanced skin colour classifier using RGB Ratio model” in International Journal on Soft Computing (IJSC) Vol.3, No.4, November 2012

For example, one in-house heuristic we tried could be coded in JavaScript as:

function IsSkin(rgba) {
  if ((rgba.a > 0.9) && (rgba.r > rgba.g) && (rgba.g > rgba.b)) {
    var g = rgba.g / rgba.r;
    if ((g > 0.6) && (g < 0.9)) {
      var b = rgba.b / rgba.r;
      var expected_b = g * 1.28 - 0.35;
      return Math.abs(b - expected_b) < 0.05;
    }
  }
  return false;
}

However, there are some fundamental flaws with merely classifying each pixel as either “fleshy” or “non-fleshy”:

  1. The portion of the colour space that is taken up by human hair, although larger, overlaps the space taken up by flesh tones. This makes distinguishing between long hair lying on top of clothes and actual flesh very difficult.
  2. Many clothes or parts of clothes are deliberately designed to be flesh-coloured or “nude” looking.
  3. You do not know if the “fleshy” pixels are at naughty locations of the body or not.
  4. As the body shape parameters of the MeModel changes, the location of the “naughty bits” changes.

We’ll address these issues in the next part. However, even with fairly simplistic heuristics, the techniques discussed thus far reduce the number of images that have to be pored over by humans, as opposed to automated systems, by up to 90%.

Footnotehsv

The maximal range of hues generally considered as skin tones by most flesh pixel detectors is 0° ≤ H ≤ 60° (red to yellow).

This range equates to R ≥ G ≥ B, as can be seen in the chart on the right.

Furthermore, the saturation and value (or lightness) components of HSV (or HSL) detectors are usually discarded. Therefore, the “flesh predicate” can be constructed purely from relative comparisons between RG and B, or pairwise differences thereof. This is why I believe the RGB colour space to be no less accurate than other colour spaces.

AWS Meetup

On the 22nd October Metail sponsored and hosted the third Cambridge AWS User Group (who have a few photos of the event up). It was great to welcome the meetup organiser Jon Green, AWS tech evangelist Ian Massingham (the main speaker for the evening), and around 25 others to our office “town hall” for drinks, snacks (including honey roasted cashews), networking and of course the talks.

Ian had two spots on the schedule. The first was a round up of the announcements made last week in at AWS re:Invent. I’m not going to produce my own recap of Ian’s summary, A Cloud Guru gave a nice summary of the keynotes at re:Invent in their medium blog post and there’s a nice summary with follow on links over at trendmicro. There were a few questions from the audience last night all of which Ian answered to their satisfaction. Seems AWS is doing something right 🙂 My favourite bit of the talk was during the overview of AWS’ now beta IoT platform (apparently the first thing AWS themselves have referred to as a platform) and the consumer stories from re:Invent. Here the focus was largely around an application in farming, mentioning the possibility of fitting monitoring devices to animals and being able to buzz them to wake them up. As someone who grew up in the valleys of Wales surrounded by sheep I found the thought of experimentally setting the state of your flock to be awake quite amusing. The scale of farming in the US is slightly larger than what I’m used so that’s a lot more sheep to wake up through patchy signal coverage ;).

Ian’s second talk was a demo of the new EC2 container service. I’ve played with docker before but it’s not my area of expertise however the audience seemed suitably impressed and there was a good bit of discussion at the end.

Having bribed my way onto the agenda by getting Metail to host (and doing the literal leg work to get the drinks and snacks to the office) it was good to give a talk to an engaged audience. I opted to go into some depth on the tracking and batch processing steps of our data architecture and skimmed over the insights and how we actually drive the pipeline. I’m hoping to get invited back to go in to a bit more depth. Elastic MapReduce is a slightly more niche service in AWS, and perhaps being overtaken by the use of real time systems such as AWS’ kinesis family. The talk itself is up on slideshare and you can see me in action on the @CambridgeAWS twitter account 🙂

We are often complimented on the selection of beers which we put on at the meetups at Metail. This is in large part owing to the excellent choice found at the King’s Parade branch of Cambridge Wine Merchants; it makes a nice lunch time outing to go over and choose the selection for the evening’s meetup. Having said that, a little data insight from our event hosting is that diet coke is the most popular drink.

Photo cc Ian M.