Yule log

The winning entry in this years Yule Log bake off

The nights are growing shorter again. The office is looking bare. The coffee, lunch, tea, and biscuit breaks are lengthening so that they almost merge. It all points to the impending festival being nearly upon us. We are off to take well earned breaks and get ready to start afresh next year when you will be able to read more about flesh detection, about A/B testing, to see how the next chapter of Think Stats unfolds, and things we discovered during our Christmas holidays.

We will of course also keep you posted on the meetups we are hosting and would invite anyone interested to come and join us.

And so, until next year, we wish you a Merry Christmas and a happy new year.


I’m excited to be speaking at the December installment of the Cambridge functional programming meetup, hosted at Metail, this Thursday. I’ll be talking about making synthesized electronic music using Overtone.


Overtone is an ‘Open Source toolkit for designing synthesizers and collaborating with music’. The library leverages Clojure’s power and generality for making flexible definitions of instruments, melodies and rhythms. Using code-reloading and the Clojure REPL, it provides an awesome environment for ‘live-coding’, a particularly modern form of improvisational music making!

The Clojure JVM process doesn’t actually synthesize audio in real-time itself; instead it connects to the separate program ‘scsynth’ (the Supercollider synthesis server), which is a high performance application written in C++.

Last summer at the Metail office-warming party we had Sam Aaron (one of the original Overtone developers, and another Cambridge resident) performing live. Check out http://meta-ex.com for some videos of other gigs he has done!

The Talk

In this talk, I’ll give a basic introduction to some of the features of Overtone library that I’ve had fun playing with. I’ll talk about several of the most important synthesis techniques, and demonstrate how they work (and what the results sound like!) in Overtone.

In the second part of the talk, I’ll describe how to put the noises created in the first part together to make some music. I’ll explain a bit of music theory and show how it can be put into practice, ending with a suitably festive demo!

The Speaker

I’m a Clojure programmer working at Metail in the data engineering team. I’ve always been fascinated by music: I have significantly more years experience as a choral singer than as a programmer! Currently I sing tenor with Selwyn College Chapel Choir.

Getting Started

Think Stats One of our new starters here at Metail was keen to brush up their statistics, and it’s more than 20 years since I completed an introductory course at university so I knew I would benefit from some revision. We also have a bunch of statisticians in the office who would like to brush up their Clojure, so I thought it might be fun to organise a lunchtime study group to work through Allen Downey’s Think Stats and attempt the exercises in Clojure. We’re using the second edition of the book which is available online in HTML format, and meeting on Wednesday lunchtimes to work through it together.

We’ll use Clojure’s Incanter library which provides utilities for statistical analysis and generating charts. Create a Leiningen project for our work:

lein new thinkstats

Make sure the project.clj depends on Clojure 1.7.0 and add a dependency on Incanter 1.5.6:

:dependencies [[org.clojure/clojure "1.7.0"]
               [incanter "1.5.6"]]

Parsing the data

In the first chapter of the book, we are introduced to a data set from the US Centers for Disease Control and Prevention, the National Survey of Family Growth. The data are in a gzipped file with fixed-width columns. An accompanying Stata dictionary describes the variable names, types, and column indices for each record. Our first job will be to parse the dictionary file and use that information to build a parser for the data.

We cloned the Github repository that accompanies Allen’s book:

git clone https://github.com/AllenDowney/ThinkStats2

Then created symlinks to the data files from our project:

cd thinkstats
mkdir data
cd data
for f in ../../ThinkStats2/code/{*.dat.gz,*.dct}; do ln -s $f; done

We can now read the Stata dictionary for the family growth study fromdata/2002FemPreg.dct. The dictionary looks like:

infile dictionary {
    _column(1)  str12    caseid     %12s  "RESPONDENT ID NUMBER"
    _column(13) byte     pregordr   %2f  "PREGNANCY ORDER (NUMBER)"

If we skip the first and last lines of the dictionary, we can use a regular expression to parse each column definition:

(def dict-line-rx #"^\s+_column\((\d+)\)\s+(\S+)\s+(\S+)\s+%(\d+)(\S)\s+\"([^\"]+)\"")

We’re capturing the column position, colum type, column name, format and length, and description. Let’s test this at the REPL. First we have to read a line from the dictionary:

(require '[clojure.java.io :as io])
(def line (with-open [r (io/reader "data/2002FemPreg.dct")]
            (first (rest (line-seq r)))))

We use rest to skip the first line of the file then grab the first column definition. Now we can try matching this with our regular expression:

(re-find dict-line-rx line)

This returns the string that matched and the capture groups we defined in our regular expression:

["    _column(1)      str12        caseid  %12s  \"RESPONDENT ID NUMBER\""

We need to do some post-processing of this result to parse the column index and length to integers; we’ll also replace underscores in the column name with hyphens, which makes for a more idiomatic Clojure variable name. Let’s wrap that up in a function:

(require '[clojure.string :as str])

(defn parse-dict-line
  (let [[_ col type name f-len f-spec descr] (re-find dict-line-rx line)]
    {:col    (dec (Integer/parseInt col))
     :type   type
     :name   (str/replace name "_" "-")
     :f-len  (Integer/parseInt f-len)
     :f-spec f-spec
     :descr  descr}))

Note that we’re also decrementing the column index – we need zero-indexed column indices for Clojure’s substring function. Now when we parse our sample line we get:

{:col 0,
 :type "str12",
 :name "caseid",
 :f-len 12,
 :f-spec "s",

With this function in hand, we can write a parser for the dictionary file:

(defn read-dict-defn
  "Read a Stata dictionary file, return a vector of column definitions."
  (with-open [r (io/reader path)]
    (mapv parse-dict-line (butlast (rest (line-seq r))))))

We use rest and butlast to skip the first and last lines of the file, and mapv to force eager evaluation and ensure we process all of the input before the reader is closed when we exit with-open.

(def dict (parse-dict-defn "data/2002FemPreg.dat"))

The dictionary tells us the starting position (:col) and length (:f-len) of each field, so we can use subs to extract the raw value of each column from the data. This will give us a string, and the :type key we’ve extracted from the dictionary tells us how to interpret this. We’ve seen the types str12 and byte above, but what other types appear in the dictionary?

(distinct (map :type dict))
;=> ("str12" "byte" "int" "float" "double")

We’ll leave str12 unchanged, coerce byte and int to Long, andfloat and double to Double:

(defn parse-value
  [type raw-value]
  (when (not (empty? raw-value))
    (case type
      ("str12")          raw-value
      ("byte" "int")     (Long/parseLong raw-value)
      ("float" "double") (Double/parseDouble raw-value))))

We can now build a record parser from the dictionary definition:

(defn make-row-parser
  "Parse a row from a Stata data file according to the specification in `dict`.
   Return a vector of columns."
  (fn [row]
    (reduce (fn [accum {:keys [col type name f-len]}]
              (let [raw-value (str/trim (subs row col (+ col f-len)))]
                (conj accum (parse-value type raw-value))))

To read gzipped data, we need to open an input stream, coerce this to a GZIPInputStream, and construct a buffered reader from that. For convenience, we’ll define a function to do this automatically if passed a path ending in .gz.

(import 'java.util.zip.GZIPInputStream)

(defn reader
  "Open path with io/reader; coerce to a GZIPInputStream if suffix is .gz"
  (if (.endsWith path ".gz")
    (io/reader (GZIPInputStream. (io/input-stream path)))
    (io/reader path)))

Given a dictionary and reader, we can parse the records from a data file:

(defn read-dct-data
  "Parse lines from `rdr` according to the specification in `dict`.
   Return a lazy sequence of parsed rows."
  [dict rdr]
  (let [parse-fn (make-row-parser dict)]
    (map parse-fn (line-seq rdr))))

Finally, we bring this all together with a function to parse the dictionary and data and return an Incanter dataset:

(require '[incanter.core :refer [dataset]])

(defn as-dataset
  "Read Stata data set, return an Incanter dataset."
  [dict-path data-path]
  (let [dict   (read-dict-defn dict-path)
        header (map (comp keyword :name) dict)]
    (with-open [r (reader data-path)]
      (dataset header (doall (read-dct-data dict r))))))

Getting the code

The code for all this is available on Github; if you’d like to follow along, you can fork my thinkstats repository.

The functions we’ve developed above are in the namespacethinkstats.dct-parser In the next article in this series, we use our parser to explore and clean the data using Incanter.

For the latest in the Data Insights Cambridge meetup which we host, we are delighted to be welcoming Metail’s very own Shrividya Ravi to speak about the ‘A – Z of A/B testing’.

What is A – Z of A/B testing?

Randomised control trials have been a key part of medical science since the 18th century. More recently they have gained rapid traction in the e-commerce world where the term ‘A/B testing’ has become synonymous with businesses that are innovative and data-driven.

A/B testing has become the ‘status quo’ for retail website development – enabling product managers and marketing professionals to positively affect the customer journey; the sales funnel in particular. Combining event stream data with sound questions and good experiment design, these controlled trials become powerful tools for insight into user behaviour.

This talk will present a comprehensive overview of A/B testing discussing both the advantages and the caveats. A series of case studies and toy examples will detail the myriad of analyses possible from rich web events data. Topics covered will include inference with hypothesis testing, regression, bootstrapping, Bayesian models and parametric simulations.

The Speaker

Dr Ravi transitioned to Data Science following her PhD in experimental materials physics. Working as a Data Scientist at Metail (online try-on technology), Shriv continues experimenting and teasing insights from data.

Head to the Data Insights Cambridge Meetup page to register.

A bunch of Metail Clojurians are off to Clojure eXchange 2015 this week.

Members from the Data Science, Data Engineering and Web teams will be catching up on what’s new and seeing how others are using Clojure to solve their problems. Metail make extensive use of Clojure and ClojureScript for a lot of our internal tools. We are also currently investigating the feasibility of using ClojureScript to implement the next version of our MeModel visualisation product instead of CoffeeScript and Backbone.

Some of the Clojure tech that we currently use is: Cascalog, Om (Now and Next), Immutant, Prismatic/Schema, Figwheel. Grab one of us if you want to talk about our experiences with any of these or anything Clojure or Data related.

All the content looks really interesting, some highlights of particular interest:

  • Bozhidar Batsov – CIDER: The journey so far and the road ahead
  • Kris Jenkins – ClojureScript: Architecting for Scale
  • Nicola Mometto – Immutable code analysis with tools.analyzer
  • Hans Hubner – Datomic in Practice

Looking forward to all the talks, catching up with old friends and making new ones. See you there.