Data Actually

David Robinson posted a great article Analyzing networks of characters in ‘Love Actually’ on 25th December 2015, which uses R to analyse the connections between the characters in the film Love Actually.

This Jupyter notebook and the associated python code attempts to reproduce his analysis using tools from the Python ecosystem.

I tweeted a link to an initial version of this notebook over the Christmas holidays and republished it here in response to interest from my colleagues in Metail Tech.

Package setup

To start we need to import some useful packages, shown below. Some of these needed to be installed using the Anaconda distribution or pip if conda failed:

  • pip install ggplot
  • conda install graphviz
  • conda install networkx
  • conda install pyplot
In [1]:
from __future__ import division
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
from scipy.cluster.hierarchy import dendrogram, linkage
import ggplot as gg
import networkx as nx
In [2]:
%matplotlib inline

Data import

First we need to define a couple of functions to read the ‘Love Actually’ script into a list of lines and the cast into a Pandas DataFrame. The data_dir variable gets the location of the directory containing the input files from an environment variable so set this environment variable to the location appropriate for you if you’re following along.

In [3]:
data_dir = os.path.join(os.getenv('MDA_DATA_DIR', '/home/mattmcd/Work/Data'), 'LoveActually') 
# Alternative: data_dir = os.getcwd() # And copy files to same directory as script

def read_script():
    """Read the Love Actually script from text file into list of lines
    The script is first Google hit for 'Love Actually script' as a doc
    file.  Use catdoc or Libre Office to save to text format.
    """
    with open(os.path.join(data_dir, 'love_actually.txt'), 'r') as f:
        lines = [line.strip() for line in f]
    return lines

def read_actors():
    """Read the mapping from character to actor using the varianceexplained data file
    Used curl -O http://varianceexplained.org/files/love_actually_cast.csv to get a local copy
    """
    return pd.read_csv(os.path.join(data_dir, 'love_actually_cast.csv'))

The cell below reproduces the logic in the first cell of the original article. It doesn’t feel quite as nice to me as the dplyr syntax but is not too bad.

In [4]:
def parse_script(raw):
    df = pd.DataFrame(raw, columns=['raw'])

    df = df.query('raw != ""')
    df = df[~df.raw.str.contains("(song)")]
    lines = (df.
             assign(is_scene=lambda d: d.raw.str.contains(" Scene ")).
             assign(scene=lambda d: d.is_scene.cumsum()).
             query('not is_scene'))
    speakers = lines.raw.str.extract('(?P<speaker>[^:]*):(?P<dialogue>.*)')
    lines = (pd.concat([lines, speakers], axis=1).
             dropna().
             assign(line=lambda d: np.cumsum(~d.speaker.isnull())))

    lines.drop(['raw', 'is_scene'], axis=1, inplace=True)

    return lines
In [5]:
def read_all():
    lines = parse_script(read_script())
    cast = read_actors()
    combined = lines.merge(cast).sort('line').assign(
        character=lambda d: d.speaker + ' (' + d.actor + ')').reindex()
    # Decode bytes to unicode
    combined['character'] = map(lambda s: s.decode('utf-8'), combined['character'])
    return combined
In [6]:
# Read in script and cast into a dataframe
lines = read_all()
# Print the first few rows
lines.head(5)
Out[6]:
scene speaker dialogue line actor character
0 2 Billy ♪ I feel it in my fingers ♪ I feel it in my t… 2 Bill Nighy Billy (Bill Nighy)
34 2 Joe I’m afraid you did it again, Bill. 3 Gregor Fisher Joe (Gregor Fisher)
1 2 Billy It’s just I know the old version so well, you… 4 Bill Nighy Billy (Bill Nighy)
35 2 Joe Well, we all do. That’s why we’re making the … 5 Gregor Fisher Joe (Gregor Fisher)
2 2 Billy Right, OK, let’s go. ♪ I feel it in my finger… 6 Bill Nighy Billy (Bill Nighy)

Constructing the n_character x n_scene matrix showing how many lines each character has in each scene is quite easy using pandas groupby method to create a hierarchical index, followed by the unstack method to convert the second level of the index into columns.

In [7]:
def get_scene_speaker_matrix(lines):
    by_speaker_scene = lines.groupby(['character', 'scene'])['line'].count()
    speaker_scene_matrix = by_speaker_scene.unstack().fillna(0)
    return by_speaker_scene, speaker_scene_matrix
In [8]:
# Group by speaker and scene and construct the speaker-scene matrix
by_speaker_scene, speaker_scene_matrix = get_scene_speaker_matrix(lines)

Analysis

Now we get to the analysis itself. First we perform a hierarchical clustering of the data using the same data normalization and clustering method as the original article. The leaf order in the dendrogram is returned for use in later steps of the analysis, as it has similar characters close to each other.

In [9]:
def plot_dendrogram(mat, normalize=True):
    # Cluster and plot dendrogram.  Return order after clustering.
    if normalize:
        # Normalize by number of lines
        mat = mat.div(mat.sum(axis=1), axis=0)
    Z = linkage(mat, method='complete', metric='cityblock')
    labels = mat.index
    f = plt.figure()
    ax = f.add_subplot(111)
    R = dendrogram(Z, leaf_rotation=90, leaf_font_size=8,
               labels=labels, ax=ax, color_threshold=-1)
    f.tight_layout()
    ordering = R['ivl']
    return ordering
In [10]:
# Hierarchical cluster and return order of leaves
ordering = plot_dendrogram(speaker_scene_matrix)
da_dendrogram
In [11]:
print(ordering)
[u'Peter (Chiwetel Ejiofor)', u'Juliet (Keira Knightley)', u'Mark (Andrew Lincoln)', u'Harry (Alan Rickman)', u'Mia (Heike Makatsch)', u'Karl (Rodrigo Santoro)', u'Sarah (Laura Linney)', u'Karen (Emma Thompson)', u'Natalie (Martine McCutcheon)', u'PM (Hugh Grant)', u'Daniel (Liam Neeson)', u'Sam (Thomas Sangster)', u'Colin (Kris Marshall)', u'Tony (Abdul Salis)', u'Billy (Bill Nighy)', u'Joe (Gregor Fisher)', u'Jack (Martin Freeman)', u'Judy (Joanna Page)', u'Aurelia (L\xfacia Moniz)', u'Jamie (Colin Firth)']

Timeline

Plotting the timeline of character versus scene is very similar to the R version since we make use of the Python ggplot port from yhat. It seems like there are a few differences between this package and the R ggplot2 library.

In particular:

  • ggplot does not seem to handle categorical variables so it was necessary to introduce an extra character_code dimension
  • it didn’t seem possible to change the y-axis tick labels to character names so here the axis directions are swapped
  • the aes (aesthetic) does not seem to support ‘group’ so the geom_path joining characters in the same scene has been left out

(note that these points may just be limitations of my understanding of ggplot in Python)

In [12]:
def get_scenes_with_multiple_characters(by_speaker_scene):
    # Filter speaker scene dataframe to remove scenes with only one speaker

    # n_scene x 1 Series with index 'scene'
    filt = by_speaker_scene.count('scene') > 1
    # n_scene x n_character Index
    scene_index = by_speaker_scene.index.get_level_values('scene')
    # n_scene x n_character boolean vector
    ind = filt[scene_index].values
    return by_speaker_scene[ind]
In [13]:
def order_scenes(scenes, ordering=None):
    # Order scenes by e.g. leaf order after hierarchical clustering
    scenes = scenes.reset_index()
    scenes['scene'] = scenes['scene'].astype('category')
    scenes['character'] = scenes['character'].astype('category', categories=ordering)
    scenes['character_code'] = scenes['character'].cat.codes
    return scenes
In [14]:
# Order the scenes by cluster leaves order
scenes = order_scenes(get_scenes_with_multiple_characters(by_speaker_scene), ordering)
In [15]:
def plot_timeline(scenes):
    # Plot character vs scene timelime
    # NB: due to limitations in Python ggplot we need to plot with scene on y-axis
    # in order to label x-ticks by character.
    # scale_x_continuous and scale_y_continuous behave slightly differently.

    print (gg.ggplot(gg.aes(y='scene', x='character_code'), data=scenes) +
            gg.geom_point() + gg.labs(x='Character', y='Scene') +
           gg.scale_x_continuous(
               labels=scenes['character'].cat.categories.values.tolist(),
           breaks=range(len(scenes['character'].cat.categories))) +
           gg.theme(axis_text_x=gg.element_text(angle=30, hjust=1, size=10)))
In [16]:
# Plot a timeline of characters vs scene
plot_timeline(scenes);
da_characters

Co-occurrence matrix

Next we construct the co-occurrence matrix showing how often characters share scenes, and visualize using a heatmap and network graph.

In [17]:
def get_cooccurrence_matrix(speaker_scene_matrix, ordering=None):
    # Co-occurrence matrix for the characters, ignoring last scene where all are present
    scene_ind = speaker_scene_matrix.astype(bool).sum() < 10
    if ordering:
        mat = speaker_scene_matrix.loc[ordering, scene_ind]
    else:
        mat = speaker_scene_matrix.loc[:, scene_ind]
    return mat.dot(mat.T)
In [18]:
cooccur_mat = get_cooccurrence_matrix(speaker_scene_matrix, ordering)

The heatmap below is not as nice as the default R heatmap as it is missing the dendrograms on each axis and also the character names, so could be extended e.g. following Hierarchical Clustering Heatmaps in Python.

It otherwise shows a similar result to the original article in that ordering using the dendrogram leaf order has resulted in a co-occurrence matrix predominantly of block diagonal form.

In [19]:
def plot_heatmap(cooccur_mat):
    # Plot co-ccurrence matrix as heatmap
    plt.figure()
    plt.pcolor(cooccur_mat)
In [20]:
# Plot heatmap of co-occurrence matrix
plot_heatmap(cooccur_mat)
da_connectivity

The network plot gives similar results to the original article. This could be extended, for example by adding weights to the graph edges.

In [21]:
def plot_network(cooccur_mat):
    # Plot co-occurence matrix as network diagram
    G = nx.Graph(cooccur_mat.values)
    pos = nx.graphviz_layout(G)  # NB: needs pydot installed
    plt.figure(num=None, figsize=(15, 15), dpi=80)
    nx.draw_networkx_nodes(G, pos, node_size=700, node_color='c')
    nx.draw_networkx_edges(G, pos)
    nx.draw_networkx_labels(
        G, pos,
        labels={i: s for (i, s) in enumerate(cooccur_mat.index.values)},
        font_size=10)
    plt.axis('off')
    plt.show()
In [22]:
# Plot network graph of co-occurrence matrix
plot_network(cooccur_mat)
da_network

Conclusion

This notebook attempted to reproduce using Python the R analysis and visualization of the character network in ‘Love Actually’. Overall this was a useful exercise in learning similarities and differences between the tools, as well as becoming more familiar with the Pandas syntax. The output currently doesn’t look quite as nice as the R original but is a good starting point for future tinkering.

ReSharper C++In the Visualization Team, we’ve recently started using ReSharper C++ 10.0.2 in Microsoft Visual Studio to assist with writing C++ code. We also have some home-grown (and predictably ugly) C++ preprocessor macros specifically for the Microsoft compiler to help with various code quality issues such as automatic coverage reports, execution profiling, unit test generators and so on. The problem is that when ReSharper parses the C++ source code, it does so with a slightly different “interpretation” of the the C++ preprocessor specification than Microsoft’s compiler and the IntelliSense parser. I’m not going to get involved with any argument as to which is “most correct”; let’s just say that they are different. Different enough that ReSharper complains about our macros, even though the compiler thinks they’re fine. If we could detect when the ReSharper parser is looking at our code, we could simplify the macro definitions and stop ReSharper complaining.

Microsoft Visual Studio helpfully provides a preprocessor macro named ‘__INTELLISENSE__‘ that allows you to detect when it is IntelliSense that is parsing your source code, but I couldn’t find the equivalent for ReSharper. That’s not to say that one doesn’t exist, but I couldn’t find any on-line documentation for one.

However, there obviously is a difference between the Microsoft and JetBrains parsers (otherwise we wouldn’t need to distinguish between them!) so can we use that variation to detect who is parsing our source code? The difference that is causing our macros problems is the way that macro argument tokens are pasted together and joined. Here’s an example:

#define LITERAL(a) a
#define JOIN(a,b) LITERAL(a)LITERAL(b)
#define BEFOREAFTER 1

The first question is: why aren’t we using the token pasting operator? Ironically, that operator, ‘##‘, solves all our problems (in this case). Therefore it’s not a candidate for distinguishing the two parsers. So, given the macros above, what does the following expand to?

JOIN(BEFORE,AFTER)

Well, the Microsoft products (as of MSVC 2013) expand it to ‘1‘ whereas ReSharper expands it to two tokens: ‘BEFORE‘ immediately followed by ‘AFTER‘. Fascinating, but not particularly useful, surely? Ah, but consider this:

#if JOIN(BEFORE,AFTER)
   // We're being parsed by Microsoft products
#else
   // We're being parsed by ReSharper
#endif

Inside, the ReSharper parser is no doubt bitterly fuming about the malformed #if‘ condition; but it does so silently and the test condition ultimately fails.

Putting it all together gives us:

#define RESHARPER_LITERAL(a) a
#define RESHARPER_JOIN(a,b) RESHARPER_LITERAL(a)RESHARPER_LITERAL(b)
#define RESHARPER_DISABLED 1
#if RESHARPER_JOIN(RESHARPER,_DISABLED)
#define __RESHARPER__ 0
#else
#define __RESHARPER__ 1
#endif
#undef RESHARPER_DISABLED
#undef RESHARPER_JOIN
#undef RESHARPER_LITERAL

Of course, this code snippet is preceded by a huge comment explaining why we’re abusing the preprocessor quite so badly, and suggesting that the reader pretends she never saw it.

This is the fourth instalment of our Think Stats study group; we are working through Allen Downey’s Think Stats, implementing everything in Clojure. This week we made a start on chapter 2 of the book, which introduces us to statistical distributions by way of histograms. This was our first encounter with the incanter.charts namespace, which we use to plot histograms of some values from the National Survey for Family Growth dataset we have worked with in previous sessions.

You can find previous instalments from the study group on our blog:

If you’d like to follow along, start by cloning our thinkstats repository from Github:

git clone https://github.com/ray1729/thinkstats.git --recursive

Change into the project directory and fire up Gorilla REPL:

cd thinkstats
lein gorilla

Getting Started

As usual, we start out with a namespace declaration that loads the namespaces we’ll need:

(ns radioactive-darkness
  (:require [incanter.core :as i
               :refer [$ $map $where $rollup $order $fn $group-by $join]]
            [incanter.stats :as s]
            [incanter.charts :as c]
            [incanter-gorilla.render :refer [chart-view]]
            [thinkstats.gorilla]
            [thinkstats.incanter :as ie :refer [$! $not-nil]]
            [thinkstats.family-growth :as f]))

There are two additions since last time: incanter.charts mentioned above, and incanter-gorilla.render that provides a function to display Incanter charts in Gorilla REPL.

We start by generating a vector of random integers to play with:

(def xs (repeatedly 100 #(rand-int 5)))

We can generate a histogram from these data:

(def h (c/histogram xs))

This returns a JFreeChart object that we can display in Gorilla REPL with chart-view:

(chart-view h)

histogram-1

If you’re running from a standard Clojure REPL, you should use the view function from incanter.core instead:

(i/view h)

The first thing we notice about this is that the default number of bins is not optimal for our data; let’s look at the documentation for histogram to see how we might change this.

(require '[clojure.repl :refer [doc]])
(doc c/histogram)

We see that the :nbins option controls the number of bins. We can also set the title and labels for the axes by specifiyng :title, :x-label and :y-label respectively.

(chart-view (c/histogram xs :nbins 5
                            :title "Our first histogram"
                            :x-label "Value"
                            :y-label "Frequency"))

histogram-2

We can save the histogram as a PNG file:

(i/save (c/histogram xs :nbins 5
                            :title "Our first histogram"
                            :x-label "Value"
                            :y-label "Frequency")
            "histogram-1.png")

Birth Weight

Now that we know how to plot histograms, we can start to visualize values from the NSFG data set. We start by loading the data:

(def ds (f/fem-preg-ds))

Plot the pounds part of birth weight (note the use of $! to exclude nil values):

(chart-view (c/histogram ($! :birthwgt-lb ds) :x-label "Birth weight (lb)"))

histogram-3

…and the ounces part of birth weight:

(chart-view (c/histogram ($! :birthwgt-oz ds) :x-label "Birth weight (oz)"))

histogram-4

We can see immediately that these charts are very different, reflecting the different “shapes” of the data. What we see fits well with our intuition: we expect the ounces component of the weight to be distributed fairly evenly, while most newborns are around 7lb or 8lb and babies bigger than 10lb at birth are rarely seen.

Recall that we also computed the total weight in pounds and added :totalwgt-lb to the dataset:

(chart-view (c/histogram ($! :totalwgt-lb ds) :x-label "Total weight (lb)"))

histogram-5

This does not look much different from the :birthwgt-lb histogram, as this value dominates ounces in the computaiton

A Few More Histograms

The shape of a histogram tells us how the data are distributed: it may be approximately flat like the :birthwgt-oz histogram, or bell-shaped like :birthwgt-lb, or an asymetrical bell (with longer tail to the left or to the right) like the following two.

(chart-view (c/histogram ($! :ageatend ds)
                         :x-label "Age"
                         :title "Mother's age at end of pregnancy"))

histogram-6

Let’s try that again, excluding the outliers with an age over 60:

(chart-view (c/histogram (filter #(< % 60) ($! :ageatend ds))
                         :x-label "Age"
                         :title "Mother's age at end of pregnancy"))

histogram-7

Finally, let’s look at pregnancy length for live births:

(chart-view (c/histogram ($! :prglngth ($where {:outcome 1} ds))
                         :x-label "Weeks"
                         :title "Pregnancy length (live births)"))

histogram-8

We have now reached the end of section 2.4 of the book, and will pick up next time with section 2.5.

One of the user stories I had to tackle in a recent sprint was to import data maintained by a non-technical colleague in a Google Spreadsheet into our analytics database. I quickly found a Java API for Google Spreadsheets that looked promising but turned out to be more tricky to get up and running than expected at first glance. In this article, I show you how to use this library from Clojure and avoid some of the pitfalls I fell into.

Google Spreadsheets API

The GData Java client referenced in the Google Spreadsheets API documentation uses an old XML-based protocol, which is mostly deprecated. We are recommended to use the newer, JSON-based client. After chasing my tail on this, I discovered that Google Spreadsheets does not yet support this new API and we do need the GData client after all.

The first hurdle: dependencies

The GData Java client is not available from Maven, so we have to download a zip archive. The easiest way to use these from a Leiningen project is to use mvn to install the required jar files in our local repository and specify the dependencies in the usual way. This handy script automates the process, only downloading the archive if necessary. (For this project, we only need the gdata-core and gdata-spreadsheet jars, but the script is easily extended if you need other components.)

#!/bin/bash

set -e

function log () {
    echo "$1" >&2
}

function install_artifact () {
    log "Installing artifact $2"
    mvn install:install-file -DgroupId="$1" -DartifactId="$2" -Dversion="$3" -Dfile="$4" \
        -Dpackaging=jar -DgeneratePom=true
}

R="${HOME}/.m2/repository"
V="1.47.1"
U="http://storage.googleapis.com/gdata-java-client-binaries/gdata-src.java-${V}.zip"

if test -r "${R}/com/google/gdata/gdata-core/1.0/gdata-core-1.0.jar" \
        -a -r "${R}/com/google/gdata/gdata-spreadsheet/3.0/gdata-spreadsheet-3.0.jar";
then
    log "Artifacts up-to-date"
    exit 0
fi

log "Downloading $U"
cd $(mktemp -d)
wget "${U}"
unzip "gdata-src.java-${V}.zip"

install_artifact com.google.gdata gdata-core 1.0 gdata/java/lib/gdata-core-1.0.jar

install_artifact com.google.gdata gdata-spreadsheet 3.0 gdata/java/lib/gdata-spreadsheet-3.0.jar

Once we’ve installed these jars, we can configure dependencies as follows:

(defproject gsheets-demo "0.1.0-SNAPSHOT"
  :description "Google Sheets Demo"
  :url "https://github.com/ray1729/gsheets-demo"
  :license {:name "Eclipse Public License"
            :url "http://www.eclipse.org/legal/epl-v10.html"}
  :dependencies [[org.clojure/clojure "1.8.0"]
                 [com.google.gdata/gdata-core "1.0"]
                 [com.google.gdata/gdata-spreadsheet "3.0"]])

The second hurdle: authentication

This is a pain, as the documentation for the GData Java client is incomplete and at times confusing, and the examples it ships with no longer work as they use a deprecated OAuth version. The example Java code in the documentation tells us:

// TODO: Authorize the service object for a specific user (see other sections)

The other sections were no more enlightening, but after more digging and reading of source code, I realized we can use the google-api-client to manage our OAuth credentials and simply pass that credentials object to the GData client. This library is already available from a central Maven repository, so we can simply update our project’s dependencies to pull it in:

:dependencies [[org.clojure/clojure "1.8.0"]
               [com.google.api-client/google-api-client "1.21.0"]
               [com.google.gdata/gdata-core "1.0"]
               [com.google.gdata/gdata-spreadsheet "3.0"]]

OAuth credentials

Before we can start using OAuth, we have to register our client with Google. This is done via the Google Developers Console. See Using OAuth 2.0 to Access Google APIs for full details, but here’s a quick-start guide to creating credentials for a service account.

Navigate to the Developers Console. Click on Enable and manage APIs and select Create a new project. Enter the project name and click Create.

Once project is created, click on Credentials in the sidebar, then the Create Credentials drop-down. As our client is going to run from cron, we want to enable server-to-server authentication, so select Service account key. On the next screen, select New service account and enter a name. Make sure the JSON radio button is selected, then click on Create.

Copy the downloaded JSON file into your project’s resources directory. It should look something like:

{
  "type": "service_account",
  "project_id": "gsheetdemo",
  "private_key_id": "041db3d758a1a7ef94c9c59fb3bccd2fcca41eb8",
  "private_key": "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n",
  "client_email": "gsheets-demo@gsheetdemo.iam.gserviceaccount.com",
  "client_id": "106215031907469115769",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://accounts.google.com/o/oauth2/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/gsheets-demo%40gsheetdemo.iam.gserviceaccount.com"
}

We’ll use this in a moment to create a GoogleCredential object, but before that navigate to Google Sheets and create a test spreadsheet. Grant read access to the spreadsheet to the email address found in client_email in your downloaded credentials.

A simple Google Spreadsheets client

We’re going to be using a Java client, so it should come as no surprise that our namespace imports a lot of Java classes:

(ns gsheets-demo.core
  (:require [clojure.java.io :as io])
  (:import com.google.gdata.client.spreadsheet.SpreadsheetService
           com.google.gdata.data.spreadsheet.SpreadsheetFeed
           com.google.gdata.data.spreadsheet.WorksheetFeed
           com.google.gdata.data.spreadsheet.CellFeed
           com.google.api.client.googleapis.auth.oauth2.GoogleCredential
           com.google.api.client.json.jackson2.JacksonFactory
           com.google.api.client.googleapis.javanet.GoogleNetHttpTransport
           java.net.URL
           java.util.Collections))

We start by defining some constants for our application. The credentials resource is the JSON file we downloaded from the developer console:

(def application-name "gsheetdemo-v0.0.1")

(def credentials-resource (io/resource "GSheetDemo-041db3d758a1.json"))

(def oauth-scope "https://spreadsheets.google.com/feeds")

(def spreadsheet-feed-url (URL. "https://spreadsheets.google.com/feeds/spreadsheets/private/full"))

With this in hand, we can create a GoogleCredential object and initialize the Google Sheets service:

(defn get-credential
  []
  (with-open [in (io/input-stream credentials-resource)]
    (let [credential (GoogleCredential/fromStream in)]
      (.createScoped credential (Collections/singleton oauth-scope)))))

(defn init-service
  []
  (let [credential (get-credential)
        service (SpreadsheetService. application-name)]
    (.setOAuth2Credentials service credential)
    service))

Let’s try it at a REPL:

lein repl

user=> (require '[gsheets-demo.core :as gsheets])
nil
user=> (def service (gsheets/init-service))
#'user/service
user=> (.getEntries (.getFeed service
                              gsheets/spreadsheet-feed-url
                              com.google.gdata.data.spreadsheet.SpreadsheetFeed))
(#object[com.google.gdata.data.spreadsheet.SpreadsheetEntry 0x43ab2a3e "com.google.gdata.data.spreadsheet.SpreadsheetEntry@43ab2a3e"])

Great! We can see the one spreadsheet we granted our service account read access. Let’s wrap this up in a function and implement a helper to find a spreadsheet by name:

(defn list-spreadsheets
  [service]
  (.getEntries (.getFeed service spreadsheet-feed-url SpreadsheetFeed)))

(defn find-spreadsheet-by-title
  [service title]
  (let [spreadsheets (filter (fn [sheet] (= (.getPlainText (.getTitle sheet)) title))
                             (list-spreadsheets service))]
    (if (= (count spreadsheets) 1)
      (first spreadsheets)
      (throw (Exception. (format "Found %d spreadsheets with name %s"
                                 (count spreadsheets)
                                 title))))))

Back at the REPL:

user=> (def spreadsheet (gsheets/find-spreadsheet-by-title service "Colour Counts"))
user=>  (.getPlainText (.getTitle spreadsheet))
"Colour Counts"

A spreadsheet contains one or more worksheets, so the next functions we implement take a SpreadsheetEntry object and list or search worksheets:

(defn list-worksheets
  [service spreadsheet]
  (.getEntries (.getFeed service (.getWorksheetFeedUrl spreadsheet) WorksheetFeed)))

(defn find-worksheet-by-title
  [service spreadsheet title]
  (let [worksheets (filter (fn [ws] (= (.getPlainText (.getTitle ws)) title))
                           (list-worksheets service spreadsheet))]
    (if (= (count worksheets) 1)
      (first worksheets)
      (throw (Exception. (format "Found %d worksheets in %s with name %s"
                                 (count worksheets)
                                 spreadsheet
                                 title))))))

…and at the REPL:

user=> (def worksheets (gsheets/list-worksheets service spreadsheet))
user=> (map (fn [ws] (.getPlainText (.getTitle ws))) worksheets)
("Sheet1")

Our next function returns the cells belonging to a worksheet:

(defn get-cells
  [service worksheet]
  (map (memfn getCell) (.getEntries (.getFeed service (.getCellFeedUrl worksheet) CellFeed))))

This gives us a flat list of Cell objects. It will be much more convenient to work in Clojure with a nested vector of the cell values:

(defn to-nested-vec
  [cells]
  (mapv (partial mapv (memfn getValue)) (partition-by (memfn getRow) cells)))

We now have all the building blocks for the function that will be the main entry point to our minimal Clojure API:

(defn fetch-worksheet
  [service {spreadsheet-title :spreadsheet worksheet-title :worksheet}]
  (if-let [spreadsheet (find-spreadsheet-by-title service spreadsheet-title)]
    (if-let [worksheet (find-worksheet-by-title service spreadsheet worksheet-title)]
      (to-nested-vec (get-cells service worksheet))
      (throw (Exception. (format "Spreadsheet '%s' has no worksheet '%s'"
                                 spreadsheet-title worksheet-title))))
    (throw (Exception. (format "Spreadsheet '%s' not found" spreadsheet-title)))))

With this in hand:

user=> (def sheet (gsheets/fetch-worksheet service {:spreadsheet "Colour Counts" :worksheet "Sheet1"}))
#'user/sheet
user=> (clojure.pprint/pprint sheet)
[["Colour" "Count"]
 ["red" "123"]
 ["orange" "456"]
 ["yellow" "789"]
 ["green" "101112"]
 ["blue" "131415"]
 ["indigo" "161718"]
 ["violet" "192021"]]
nil

Our to-nested-vec function returns the cell values as strings. I could have used the getNumericValue method instead of getValue, but then to-nested-vec would have to know what data type to expect in each cell. Instead, I used Plumatic Schema to define a schema for each row, and used its data coercion features to coerce each column to the desired data type – but that’s a blog post for another day.

Code for the examples above is available on Github https://github.com/ray1729/gsheets-demo. We have barely scratched the surface of the Google Spreadsheets API; check out the API Documentation if you need to extend this code, for example to create or update spreadsheets.

OnyxMetail is hosting the next meetup of Cambridge NonDysfunctional Programmers next Thursday, 17th March. This month we’ll be taking a look at Onyx, a distributed cloud computing platform implemented in Clojure. We’re currently using Cascalog to process data on a Hadoop cluster, and are considering Onyx as a possible alternative. It will be interesting for us to hear what our local Clojure community makes of this new kid on the block.

One of my favourite talks at the recent Clojure Remote conference was Michael Drogalis’s keynote, where he discussed some of the principles behind Onyx’s data-driven API. At the Meetup, we’ll watch Michael’s Onyx talk from last year’s Clojure/conj. After the video, we’ll work through the getting started guide and tutorial together. Please see the Meetup page for full details.

I have been toying around with Clojure for five or six years now and whilst I really enjoy the way it allows me to think and to solve problems I find it difficult to come up with small but fun projects to work on to help me learn the language. I have spent many happy hours on sites such as the venerable project euler and more recently 4clojure and learned plenty but I don’t think either is as good a resource for problems which help you become a better coder as advent of code does because neither encourages as much refactoring.

OK, ok I've only done five so far..

Advent of code Christmas tree

The site has 25 days of questions and each one follows a similar pattern: (disclaimer: I haven’t completed all of them yet or even looked at them all so for all I know some of the later ones break the mould but I hope not).

  1. Explain some rules which typically will require you to parse text to numbers
  2. Provide some short examples which could be used as unit tests
  3. Provide access to a unique set of input to your problem

I will now show you as an example how the first problem helped me to write better code. (I have slightly exaggerated how bad some of my initial solutions were to highlight the learning and have given you full access to my inner monologue.)

Day 1

I’ve read the rubric and it is clear that I am going to need a function which converts a brace to a plus or minus one. In none of the examples do I see anything other than a ‘(‘ or a ‘)’ so I assume I can get away with the following:

(defn brace->movement
  [brace]
  (if (= "(" brace)
    1
    -1))

The first functional programming I did was some ML 16 or 17 years ago so I leap to recursion to solve the rest of the problem.

(defn calculate-floor
  [braces floor]
  (if (empty? braces)
    floor
    (let [h (first braces)
      m (brace->movement h)
      f (+ floor m)]
      (calculate-floor (rest braces) f))))

I start testing …

user> (calculate-floor "(())" 0)
-4
user> (calculate-floor "()()" 0)
-4

well this is not going well, both of these should have resulted in ‘0’. It looks like something must be going wrong with my brace->movement parsing function so I will just test that at the REPL. (I guess I should have done that first huh?)

user> (brace->movement "(")
1
user> (brace->movement ")")
-1

Hmmm.. nope, not that. Everything working perfectly. OK something other than ‘(‘ or ‘)’ must be being passed to that function. It must be first that is the problem:

user> (first "(())")
\(

Ah ha! Of course a string is a sequence of characters and if I call first on a sequence of characters it must give me a character. So now do I go for a function that parses a character such as:

(defn brace->movement'
  [brace]
  (if (= \( brace)
    1
    -1))

or one leave it untouched and convert the calling function to pass a string:

(defn calculate-floor'
  [braces floor]
  (if (empty? braces)
    floor
    (let [h (str (first braces))
          m (brace->movement h)
          f (+ floor m)]
      (calculate-floor' (rest braces) f))))

I prefer the second alternative as I have already proved to myself that the parse function works correctly when passed a string.

back to testing…

user> (calculate-floor' "(())" 0)
0
user> (calculate-floor' "()()" 0)
0

Woo hoo! And not only that, it passes all the test cases. The actual problem is a fairly long string which I don’t really want to copy and paste in to the REPL so I create a helper function to load a string from a file by adapting some code I saw in a different project written by a friend of mine (this code “just worked” and allowed me to focus on the problem that I actually wanted to solve so I didn’t question it then and I am not going to go into it now):

(ns advent.core
  (:require [clojure.java.io :as io]))

(defn read-input-string
  "returns the single string read from the file"
  [file]
  (with-open [rdr (io/reader file)]
    (first (line-seq rdr))))

I test it

user> (require '[advent.core :refer [read-input-string]])
nil
user> (def data (read-input-string "data/day-1"))
#'user/data
user> (take 10 data)
(\( \) \( \) \( \( \) \( \) \()

All looks good so I can wrap up my functions to answer day 1.

(ns advent.day-1
  (:require [advent.core :refer [read-input-string]]))

(defn day-1
  []
  (let [braces (read-input-string "data/day-1")]
    (calculate-floor' braces 0)))

user> (day-1)
java.lang.StackOverflowError: null...

Oh dear oh dear. I tried only using half of the input and I do get an answer. At first I think maybe I should divide and conquer – split the input string into a list of lists each of which is short enough to not cause a stack overflow error then combine the answers (I have heard of map reduce you see). But come on. This really isn’t a lot of input; 7000 characters. Map reduce is for big data sets and yes it might work for me here but seems like overkill and I really only want exactly the right amount of kill.

OK, maybe recursion isn’t the best way to go (NOTE: there is no problem with recursion to solve this problem and one way was suggested to me later with I go through at the end of this piece). So what other choices do I have? Well actually I start off with a list of braces I want to convert them to a list of plus or minus ones then sum them. I know that I can sum a sequence of numbers using (reduce + numbers). And actually I can create the list of numbers by using map.

(defn calculate-floor''
  [braces]
  (let [numbers (map brace->movement braces)]
    (reduce + numbers)))

Run the test cases again…

user> (calculate-floor'' "(())")
-4

Drat! I forgot the character to string conversion. But, if I supply a sequence of strings to the function it should be fine, which I can do by calling map str on a string:

user> (map str "(())")
("(" "(" ")" ")")
user> (calculate-floor'' (map str "(())"))
0

And it passes all the rest of the tests too. Good news so I am happy with this function, now to modify the calling function to pass the right thing.

(defn day-1'
  []
  (let [braces (read-input-string "data/day-1")
        list   (map str braces)]
    (calculate-floor'' list)))

That gives me an answer (280 if you are interested) which I submit and find I am correct! Woohoo! Now I am given access to the extension problem. Before I go on to that let me just give the a tidied up and complete version of all the code so far.

(ns advent.core
  (:require [clojure.java.io :as io]))

(defn read-input-string
  "returns the single string read from the file"
  [file]
  (with-open [rdr (io/reader file)]
    (first (line-seq rdr))))
(ns advent.day-1
  (:require [advent.core :refer [read-input-string]]))

(defn brace->movement
  [brace]
  (if (= "(" brace)
    1
    -1))

(defn calculate-floor
  [braces]
  (let [numbers (map brace->movement braces)]
    (reduce + numbers)))

(defn day-1
  []
  (let [braces (read-input-string "data/day-1")
        list (map str braces)]
    (calculate-floor list)))

That looks pretty short, but I don’t think that some of the intermediate assignments make much sense. The body of calculate-floor could actually be a one line without any loss of clarity:

(reduce + (map brace->movement braces))

and why introduce list in day-1? I know I need a list of strings as input so lets do the map first thing. The body would now be:

(let [braces (map str (read-input-string "data/day-1"))]
  (calculate-floor braces))

So now the cleaned up version is:

(ns advent.day-1
  (:require [advent.core :refer [read-input-string]]))

(defn brace->movement
  [brace]
  (if (= "(" brace)
    1
    -1))

(defn calculate-floor
  [braces]
  (reduce + (map brace->movement braces)))

(defn day-1
  []
  (let [braces (map str (read-input-string "data/day-1"))]
    (calculate-floor braces)))

calculate-floor looks a bit ridiculous now – is it really necessay to define a function for that one liner? I don’t think so as the only function which is not part of the language itself is brace->movement and that has already been tested so I refactor the code again to end up with:

(ns advent.day-1
  (:require [advent.core :refer [read-input-string]]))
(defn brace->movement
  [brace]
  (if (= "(" brace)
    1
    -1))

(defn day-1
  []
  (let [braces (map str (read-input-string "data/day-1"))]
    (reduce + (map brace->movement braces))))

I’m pretty happy with that, it is succinct and clear (in my estimation) and still gives the correct answer. If I had a gripe it would be that the brace->movement function is not as tightly defined as it could be; it does pass the unit tests and does its job well enough but maybe the extension introduces some new kind of brace. I decide it is worth another refactor:

(defn brace->movement
  [brace]
  (cond
    (= "(" brace) 1
    (= ")" brace) -1))

Much more satisfactory – the function does exactly what it needs to and doesn’t understand anthing else which means I should be able to pick up if there is input other than ‘(‘ or ‘)’ more quickly.

The extension

And so on to the extension. Again, these tend to follow a fairly standard pattern:

  1. Add some further constraints or stopping conditions
  2. Provide some examples/test cases for the extension
  3. Calcuate the answer using the same input set as the original problem.

For the day 1 extension I need to know the position of the brace which first makes the sum negative. First of all I notice that my refactored answer is far less testable than it used to be – I have only a single function which does everything and it accepts no input. Being able to test and refactor each method at the REPL individually whilst I was building up the solution was invaluable so I set up a stub function which will do the work for the extension and hook it into the existing functionality in a way that will allow me to test the extension:

(defn extension
  "given a sequence of plus and minus ones will return the first position for which the sum becomes negative"
  [movements])
  
(defn day-1
  []
  (let [braces (map str (read-input-string "data/day-1"))
        movements (map brace->movement braces)
        ans (reduce + movements)
        ext (extension movements)]
    (println "answer: " ans " extension: " ext)))

First of all lets create a function to convert the sequence of movements to a sequence of maps where each map contains the original movement and the index; I’ve come across map-indexed and think that is pretty much exactly what I need:

(defn add-indices
  [movements]
  (map-indexed (fn [idx mvmt] {:idx (inc idx) :mvmt mvmt}) movements))

I have now learnt my lesson and know I should test this whilst it is fresh in my mind, so here goes (though I called the argument “movements” it can be anything which makes testing nice and easy):

user> (add-indices [:a :b :c])
({:idx 1, :mvmt :a} {:idx 2, :mvmt :b} {:idx 3, :mvmt :c})

Recursion didn’t work so well for the original problem but that was because the sequence was too long and I needed to process every value in it. In this case I expressly don’t want to process every value but stop when a certain criteria has been hit so I don’t think reduce is going to help me much and I go back to recursion. A helper function extension to do the work:

(defn extension
  [movements sum]
  (let [h (first movements)
        s (+ sum (:mvmt h))]
    (if (< s 0)
      (:idx h)
      (extension (rest movements) s))))

And I can now test it:

user> (extension (add-indices [-1]) 0)
1
user> (extension (add-indices [1 -1 1 -1 -1]) 0)
5

Seems to be working on the test cases so I jump in with both feet:

user> (def braces (map str (read-input-string "data/day-1")))
#'user/braces
user> (def movements (map brace->movement braces))
#'user/movements
user> (def indexed (add-indices movements))
#'user/indexed
user> (extension indexed 0)
1797

I submit this answer and presto! I am right. But I am not satisfied; I have a solution but I don’t have a good one. What if the answer was actually 7000? I would not have been able to find that instead I would have a stack overflow error again. In fact for all I know if it had been the 3501st value that pushed Santa over the edge into the basement that would have caused a stack overflow. To prove that I have a poor solution I devise a stress test:

user> (def stress-input (flatten [(repeat 3500 "(") (repeat 3500 ")") braces]))
#'user/stress-input
user> (def movements' (map brace->movement stress-input))
#'user/movements'
user> (def indexed (add-indices movements))
#'user/indexed'
user> (extension indexed' 0)
stack overflow error

expecting 8797

Recursion seems to be the right answer, but how to do that in a scalable way? Well, I have heard of loop .. recur as a Clojurey thing so off I go to investigate that:

(defn extension
  [movements]
  (loop [indexed (add-indices movements)
         floor   0]
    (let [n          (first indexed)
          next-floor (+ floor (:mvmt n))]
      (if (< next-floor 0)
        (:idx n)
        (recur (rest indexed) next-floor)))))

user> (extension movements)
1797
user> (extension movements')
8797

And the complete solution to day 1 is now:

(ns advent.day-1
  (:require [advent.core :refer [read-input-string]]))

(defn brace->movement
  [brace]
  (cond
    (= "(" brace) 1
    (= ")" brace) -1))

(defn add-indices
  [movements]
  (map-indexed (fn [idx mvmt] {:idx (inc idx) :mvmt mvmt}) movements))

(defn extension
  [movements]
  (loop [indexed (add-indices movements)
         floor   0]
    (let [n          (first indexed)
          next-floor (+ floor (:mvmt n))]
      (if (< next-floor 0)
        (:idx n)
        (recur (rest indexed) next-floor)))))

(defn answer
  []
  (let [braces    (map str (read-input-string "data/day-1"))
        movements (map brace->movement braces)
        ans       (reduce + movements)
        ext       (extension movements)]
    (println "answer: " ans " extension: " ext)))

And that is a solution I am happy enough to leave; it has passed all the required tests and the extra stress tests I imposed. I could perhaps spend a little while coming up with better names for functions or variables but they are clear enough for me and even though I am writing this post several months after implementing the code I found the code easy enough to understand which is usually a good sign.

Conclusions

So why is it that I think advent of code is particularly good as a source of interesting problems to help people learn to code or improve how they code? (And though I have done this one in Clojure I think that many of the points are equally applicable to any language.)

In actual fact when I first solved this problem I didn’t do nearly as much refactoring as I have shown here, but when I found that each of the next few problems followed a similar pattern it encouraged me to approach them as I have shown above. I found myself writing functions which had a single responsibility and were designed to be testable and reusable. By problem three I found myself automatically assessing a solution and a function for its flexibility; for instance what if Santa had not started on the ground floor? How much code would I have to change to make that extension? How much code could I reuse from the initial solution? I also love the fact that the naïve solution (in this case recursion) will only get you so far which forced me to find out more about the language. What else have I learnt? Perhaps that you don’t always need to define a function; once I was more familiar with Clojure it seemed silly that I ever felt the need to define the original calculate-floors function and once it had been refactored to a reduce it was even more obvious to me that it didn’t need to be a named function. So why would I define a named function now? If it involved more than one line I think I would. If I wanted to test how it behaved I would. And when wouldn’t I? If it was something trivially testable at the REPL – for example (reduce + ...) or (map str ...) and I expect that the more familiar I become with the language the more likely I am to consider something trivially testable at the REPL.

Review

At Metail we like code review – so much in fact that we have double pass reviews across the board. That even goes for blog posts. We know it doesn’t guarantee perfection, but we know it does catch a lot of mistakes. Small things like typos are simply fixed by the editor/reviewer. The reviewer for this piece went beyond that by checking the implementation of the code too and has made a suggestion for an improvement. Part of the process of being open to refactoring (and code improvements) is accepting that you aren’t perfect and you didn’t get it right first time. So in this spirit rather than change the code throughout to the improved version I think I should give it as a final improvement at the end.

It is more idiomatic to use destructuring than first and rest so the extension method *could* be rewritten as:

(defn extension
  [movements]
  (loop [[head & tail] (add-indices movements)
         floor   0]
    (let [next-floor (+ floor (:mvmt head))]
      (if (< next-floor 0)
        (:idx head)
        (recur tail next-floor)))))

It was also mentioned that list was a really poor name for a variable in Clojure as it is the name of a function – but as I fixed that one myself by refactoring I wanted to leave it it.

Review 2

How fitting that we should come back to recursion. I honestly felt quite proud that after all this time I remembered my first computer science class when an enthusiastic lecturer demonstrated that tail recursion rapidly caused stack overflow issues (or whatever they are in ML) and then introduced an accumulator as an argument to the function and Lo! no more problems. So my first solution harked back to that – obviously I must have missed something because my solution had an argument I thought was an accumulator but still blew up.

Handily Clojure thinks of people like me and saves them from themselves by providing a way to do recursion without having to remember any computer science (or using loop). So I could have rewritten my original calculation as:

(defn calculate-floor
  [braces floor]
  (if (empty? braces)
    floor
    (let [f (first braces)
          r (rest braces)
          m (brace->movement f)]
      (recur r (+ floor m)))))

However the review and I both agreed that the version using map which I found myself was better than this implementation. They also suggested further changes to the brace->movement function – using case rather than cond and went on to say that if this were a true code review reduce and reduced would be brought up as interesting corners of the language to explore. It was then that I backed away.

Thanks to the reviewers for the opportunity to learn 🙂

Now this post is written I can get back to the next problem and I very much hope that there will be another set come advent 2016.

This is the third instalment of our Think Stats study group; we are working through Allen Downey’s Think Stats, implementing everything in Clojure. In the previous part we showed how to use functions from the Incanter library to explore and transform a dataset. Now we build on that knowledge to explore the National Survey for Family Growth (NSFG) data and answer the question do first babies arrive late? This takes us to the end of chapter 1 of the book.

If you’d like to follow along, start by cloning our thinkstats repository from Github:

git clone https://github.com/ray1729/thinkstats.git --recursive

Change into the project directory and fire up Gorilla REPL:

cd thinkstats
lein gorilla

Getting Started

Our project includes the namespace thinkstats.incanter that brings together our general Incanter utility functions, and thinkstats.family-growth for the functions we developed last time for cleaning and augmenting the female pregnancy data.

Let’s start by importing these and the Incanter namspaces we’re going to need this time:

(ns mysterious-aurora
  (:require [incanter.core :as i
              :refer [$ $map $where $rollup $order $fn $group-by $join]]
            [incanter.stats :as s]
            [thinkstats.gorilla]
            [thinkstats.incanter :as ie :refer [$! $not-nil]]
            [thinkstats.family-growth :as f]))

(We’ve also included thinkstats.gorilla, which just includes some functionality to render Incanter datasets more nicely in Gorilla REPL.)

The function thinkstats.family-growth/fem-preg-ds combines reading the data set with clean-and-augment-fem-preg:

(def ds (f/fem-preg-ds))

This function is parsing and transforming the dataset; depending on the speed of your computer, it could take one or two minutes to run.

Validating Data

There are a couple of things covered in chapter 1 of the book that we haven’t done yet: looking at frequencies of values in particular columns of the NSFG data and validating against the code book, and building a function to index rows by :caseid.

We can use the core Clojure frequencies function in conjunction with Incanter’s $ to select values of a column and return a map of value to frequency:

(frequencies ($ :outcome ds))
;=> {1 9148, 2 1862, 4 1921, 5 190, 3 120, 6 352}

Incanter’s $rollup function can be used to compute a summary function over a column or set of columns, and has built-in support for :min, :max, :mean, :sum, and :count. Rolling up :outcome by :count will compute the freqency for each outcome and return a new dataset:

($rollup :count :total :outcome ds)
:outcome :total
1 9148
2 1862
4 1921
5 190
3 120
6 352

Compare this with the table in the code book (you’ll find the table on page 103).

Exploring and Interpreting Data

We saw previously that we can use $where to select rows matching a predicate. For example, to select rows for a given :caseid:

($where {:caseid "10229"} ds)

This could be quite slow for a large dataset as it has to examine every row. An alternative strategy is to build an index in advance then use that to select the desired rows. Here’s how we might do this:

(defn build-column-ix
  [col-name ds]
  (reduce (fn [accum [row-ix v]]
            (update accum v (fnil conj []) row-ix))
          {}
          (map-indexed vector ($ col-name ds))))

(def caseid-ix (build-column-ix :caseid ds))

Now we can quickly select rows for a given :caseid using this index:

(i/sel ds :rows (caseid-ix "10229"))

Recall that we can also select a subset of columns at the same time:

(i/sel ds :rows (caseid-ix "10229") :cols [:pregordr :agepreg :outcome])
:pregordr :agepreg :outcome
1 19.58 4
2 21.75 4
3 23.83 4
4 25.5 4
5 29.08 4
6 32.16 4
7 33.16 1

Recall also the meaning of :outcome; a value of 4 indicates a miscarriage and 1 a live birth. So this respondent suffered 6 miscarriages between the ages of 19 and 32, finally seeing a live birth at age 33.

We can use functions from the incanter.stats namespace to compute basic statistics on our data:

(s/mean ($! :totalwgt-lb ds))
;=> 7.2623018494055485
(s/median ($! :totalwgt-lb ds))
;=> 7.375

(Note the use of $! to exclude nil values, which would otherwise trigger a null pointer exception.)

To compute several statistics at once:

(s/summary ($! [:totalwgt-lb] ds))
;=> ({:col :totalwgt-lb, :min 0.0, :max 15.4375, :mean 7.2623018494055485, :median 7.375, :is-numeric true})

Note that, while mean and median take a sequence of values (argument to $! is just a keyword), the summary function expects a dataset (argument to $! is a vector).

Do First Babies Arrive Late?

We now know enough to have a first attempt at answering this question. The columns we’ll use are:

:outcome Pregnancy outcome (1 == live birth)
:birthord Birth order
:prglngth Duration of completed pregnancy in weeks

Compute the mean pregnancy length for the first birth:

(s/mean ($! :prglngth ($where {:outcome 1 :birthord 1} ds)))
;=> 38.60095173351461

…and for subsequent births:

(s/mean ($! :prglngth ($where {:outcome 1 :birthord {:$ne 1}} ds)))
;=> 38.52291446673706

The diffenence between these two values in just 0.08 weeks, so I’d say that these data do not indicate that first babies arrive late.

Here we’ve computed mean pregnancy length for first baby and others; if we want a table of mean pregnancy length by birth order, we can use $rollup again:

($rollup :mean :prglngth :birthord ($where {:outcome 1 :prglngth $not-nil} ds))
:birthord :prglngth
3 47501/1234
4 16187/421
5 2419/63
10 36
9 75/2
7 763/20
1 56782/1471
8 263/7
6 1903/50
2 55420/1437

The mean has been returned as a rational, but we can use transform-col to convert it to a floating-point number:

(as-> ds x
      ($where {:outcome 1 :prglngth $not-nil} x)
      ($rollup :mean :prglngth :birthord x)
      (i/transform-col x :prglngth float))
:birthord :prglngth
3 38.49352
4 38.448933
5 38.396824
10 36.0
9 37.5
7 38.15
1 38.600952
8 37.57143
6 38.06
2 38.56646

Finally, we can use $order to sort this dataset on birth order:

(as-> ds x
      ($where {:outcome 1 :prglngth $not-nil} x)
      ($rollup :mean :prglngth :birthord x)
      (i/transform-col x :prglngth float)
      ($order :birthord :asc x))
:birthord :prglngth
1 38.600952
2 38.56646
3 38.49352
4 38.448933
5 38.396824
6 38.06
7 38.15
8 37.57143
9 37.5
10 36.0

The Incanter functions $where, $rollup, $order, etc. all take a dataset to act on as their last argument. If this argument is omitted, they use the dynamic $data variable that is usually bound using with-data. So the following two expressions are equivalent:

($where {:outcome 1 :prglngth $not-nil} ds)

(with-data ds
  ($where {:outcome 1 :prglngth $not-nil}))

It’s a bit annoying that we have to use as-> when we add transform-col to the mix, as this function takes the dataset as its first argument. Let’s add the following to our thinkstats.incanter namespace:

(defn $transform
  "Like Incanter's `transform-col`, but takes the dataset as an optional
   last argument and, when not specified, uses the dynamically-bound
   `$data`."
  [col f & args]
  (let [[ds args] (if (or (i/matrix? (last args)) (i/dataset? (last args)))
                    [(last args) (butlast args)]
                    [i/$data args])]
    (apply i/transform-col ds col f args)))

Now we can use the ->> threading macro:

(->> ($where {:outcome 1 :prglngth $not-nil} ds)
     ($rollup :mean :prglngth :birthord)
     ($transform :prglngth float)
     ($order :birthord :asc))

We have now met most of the core Incanter functions for manipulating datasets, and a few of the statistics functions. I hope that, as we get further into the book, we’ll learn how to calculate error bounds for computed values, and how to decide when we have a statistically significant result. In the next installment we start to look at statistical distributions and plot our first histograms.

Dr Shrividya Ravi spoke about the statistics of A/B testing at the Data Insights Cambridge meetup. It’s now live on the Metail YouTube channel, watch below or click here.

A – Z of A/B testing

Randomised control trials have been a key part of medical science since the 18th century. More recently they have gained rapid traction in the e-commerce world where the term ‘A/B testing’ has become synonymous with businesses that are innovative and data-driven.

A/B testing has become the ‘status quo’ for retail website development – enabling product managers and marketing professionals to positively affect the customer journey; the sales funnel in particular. Combining event stream data with sound questions and good experiment design, these controlled trials become powerful tools for insight into user behaviour.

This talk will present a comprehensive overview of A/B testing discussing both the advantages and the caveats. A series of case studies and toy examples will detail the myriad of analyses possible from rich web events data. Topics covered will include inference with hypothesis testing, regression, bootstrapping, Bayesian models and parametric simulations.

You can check out the slides below or alternatively download them here:

 

 

The first Data Insights Cambridge meetup of 2016 is nearly upon us. Metail looks forward to welcoming Sean McGuire, from the University of Cambridge Research Institutional Services, who will present on ‘Supercomputing for your data’.

What does Supercomputing for your Data mean?

Data proliferation and collection means that even small companies are capable of collecting vast amounts of data very quickly these days. But how do companies make the move from desktop or small compute clusters to larger clusters as their data grows? Knowledge of the tools and equipment needed to scale is not necessarily part of the existing knowledge base. This talk will describe how the Research Institutional Services (University of Cambridge) is helping companies today from a wide range of areas, from Life Science to Oil and Gas to the Manufacturing industry. We’ll cover everything from data security to how to go about designing components for a large compute and store cluster.

The Speaker:

Sean has spent the last 20 years working for two well-known vendors in the Super Computing space:

  • Intel Corporation, Director of HPC EMEA
  • Seagate Storage Systems, VP EMEA

Sean has worked in sales, operations and people management before moving into senior EMEA based roles with responsibility for business unit P&L’s.

The meetup is scheduled for Thursday, February 4, 2016 at 7:00 pm at 50 St Andrew’s St, CB2 3AH. We hope to see you there, just sign up for it on the Data Insights Cambridge meetup page.

In the first part of this series of posts, we introduced the idea of trying to detect flesh in images by looking at the colour values of individual pixels in the image. This produces reasonable results, but far too many “false positives” due to the fact that other items in the scene, such as hair and clothes, may be flesh-coloured too.

Boolean Pixel Function

In the example below, pixels in the left-hand image that are fleshy (R > G > B) are rendered red in the right-hand image, whereas non-fleshy pixels are rendered green:

Simple flesh detection

Simple flesh detection

Fuzzy Pixel Function

We can improve things slightly by using fuzzy logic. Our original fleshy function (R > G > B) is actually made up of two conditions: a pixel is “fleshy” if the red component is greater than the green component and the green component is greater than the blue component. These two conditions are binary, but they could be made fuzzy. Consider the following JavaScript function:

function fuzzy(x, false_limit, true_limit) {
  var y = (x - false_limit) / (true_limit - false_limit);
  return Math.min(Math.max(y, 0), 1);
}

This produces a fuzzy truth value between zero, meaning definitely false, and one, meaning definitely true:

fuzzy_graph

We can then compose a fuzzy logic expression for fleshiness (notice that the fuzzy AND operator is simply multiplication):

var rg = fuzzy(r - g, 0, 0.10);
var gb = fuzzy(g - b, 0, 0.02);
var fleshiness = rg * gb;

The values 0.10 and 0.02 were derived empirically. Effectively, we’re saying that we expect the red value to be quite a bit greater than the green channel; the difference between the green and blue values is less important.

The fuzzy approach gives us marginally better results. Parts of the hair are deemed to be less likely to be fleshy, as are some portions of the dress pattern.

flesh2_dress_fuzzy

But, as mentioned at the end of Part One, we need a radically different approach to consistently find accidentally-rendered “naughty bits” in an image.

Chameleon Detector

Fortunately, we have control over the rendering pipeline of these images, so there’s nothing stopping us from rendering the scene twice with slightly different parameters. Let us pretend that belly buttons are considered “naughty” and that we want to detect renders that show some or all of this body part. When we render body parts, we use texture mapping on to a 3D mesh. If we “paint” over the naughty bits in the source texture maps with a known colour (say, green) and render the scene, we may get the following for two different outfits:

fleshy2_star_green

For the purposes of clarity, we’ve painted a large star over the belly button. In reality, the painted region would be smaller and more accurately shaped. If we render the scene again with the naughty bits over-painted with the same shape but a different colour, say, red, we get:

fleshy2_star_red

Obviously, the image on the right is unchanged by this modification to the skin texture, but the image on the left is. All we need to do is run the two sets of images through a very simple (fuzzy) comparator to find visible naughty bits:

flesh2_star_fuzzy

As can be seen, this “chameleon” technique produces a strong signal. And even though it requires two renders per image, there are other advantages too:

  1. The regions considered “naughty” are hand-painted into the source skin textures. This is both intuitive and flexible.
  2. Different “naughtiness maps” can easily be used for different regions and cultures.
  3. One of the outputs of the technique is an image illustrating which naughty bit is visible and where.
  4. It is body shape agnostic.
  5. It is viewpoint agnostic.
  6. It handles translucent garments gracefully, particularly if a fuzzy comparator is used.
  7. It does not matter how complex the scene is.
  8. The code used to run the test is identical to the final rendering code: only input texture data is modified.