/dev/summer 2016 is almost upon us. This is the latest in a series of bi-annual developer conferences organized by our friends at Software Acumen, and will be taking place at the Møller Centre, Churchill College, Cambridge next Saturday 25th June. It’s a low cost, high value software developer event covering DevOps, Mobile, Web, NoSQL, Cloud, Functional Programming, Startups and more.

Back in 2014 Jim Downing (Metail CTO) and I gave a hands-on session based on an extended version of TryClojure, where participants got to implement a Sudoku solver in Clojure. This year we’re back with another REPL-driven development session. We’ll be showing people how to write a chat bot in Clojure. We’ll cover Clojure as a REST client, build a simple web service using Ring and Compojure, deploy our application to Heroku, and configure a Slack command integration.

If that’s not your cup of tea, there are plenty of other sessions to choose from and the conference provides an excellent opportunity to meet and chat with experts in your field in a friendly and relaxed environment. Metail is sponsoring this year’s event – look out for our stall. Oh, and we’re hiring! If you’d like to make Clojure and ClojureScript part of your day job, or are interested in any of the other tech jobs we’re advertising in Cambridge, come along and talk to us.

Tickets for /dev/summer are on sale here.

This is the fourth instalment of our Think Stats study group; we are working through Allen Downey’s Think Stats, implementing everything in Clojure. This week we made a start on chapter 2 of the book, which introduces us to statistical distributions by way of histograms. This was our first encounter with the incanter.charts namespace, which we use to plot histograms of some values from the National Survey for Family Growth dataset we have worked with in previous sessions.

You can find previous instalments from the study group on our blog:

If you’d like to follow along, start by cloning our thinkstats repository from Github:

git clone https://github.com/ray1729/thinkstats.git --recursive

Change into the project directory and fire up Gorilla REPL:

cd thinkstats
lein gorilla

Getting Started

As usual, we start out with a namespace declaration that loads the namespaces we’ll need:

(ns radioactive-darkness
  (:require [incanter.core :as i
               :refer [$ $map $where $rollup $order $fn $group-by $join]]
            [incanter.stats :as s]
            [incanter.charts :as c]
            [incanter-gorilla.render :refer [chart-view]]
            [thinkstats.gorilla]
            [thinkstats.incanter :as ie :refer [$! $not-nil]]
            [thinkstats.family-growth :as f]))

There are two additions since last time: incanter.charts mentioned above, and incanter-gorilla.render that provides a function to display Incanter charts in Gorilla REPL.

We start by generating a vector of random integers to play with:

(def xs (repeatedly 100 #(rand-int 5)))

We can generate a histogram from these data:

(def h (c/histogram xs))

This returns a JFreeChart object that we can display in Gorilla REPL with chart-view:

(chart-view h)

histogram-1

If you’re running from a standard Clojure REPL, you should use the view function from incanter.core instead:

(i/view h)

The first thing we notice about this is that the default number of bins is not optimal for our data; let’s look at the documentation for histogram to see how we might change this.

(require '[clojure.repl :refer [doc]])
(doc c/histogram)

We see that the :nbins option controls the number of bins. We can also set the title and labels for the axes by specifiyng :title, :x-label and :y-label respectively.

(chart-view (c/histogram xs :nbins 5
                            :title "Our first histogram"
                            :x-label "Value"
                            :y-label "Frequency"))

histogram-2

We can save the histogram as a PNG file:

(i/save (c/histogram xs :nbins 5
                            :title "Our first histogram"
                            :x-label "Value"
                            :y-label "Frequency")
            "histogram-1.png")

Birth Weight

Now that we know how to plot histograms, we can start to visualize values from the NSFG data set. We start by loading the data:

(def ds (f/fem-preg-ds))

Plot the pounds part of birth weight (note the use of $! to exclude nil values):

(chart-view (c/histogram ($! :birthwgt-lb ds) :x-label "Birth weight (lb)"))

histogram-3

…and the ounces part of birth weight:

(chart-view (c/histogram ($! :birthwgt-oz ds) :x-label "Birth weight (oz)"))

histogram-4

We can see immediately that these charts are very different, reflecting the different “shapes” of the data. What we see fits well with our intuition: we expect the ounces component of the weight to be distributed fairly evenly, while most newborns are around 7lb or 8lb and babies bigger than 10lb at birth are rarely seen.

Recall that we also computed the total weight in pounds and added :totalwgt-lb to the dataset:

(chart-view (c/histogram ($! :totalwgt-lb ds) :x-label "Total weight (lb)"))

histogram-5

This does not look much different from the :birthwgt-lb histogram, as this value dominates ounces in the computaiton

A Few More Histograms

The shape of a histogram tells us how the data are distributed: it may be approximately flat like the :birthwgt-oz histogram, or bell-shaped like :birthwgt-lb, or an asymetrical bell (with longer tail to the left or to the right) like the following two.

(chart-view (c/histogram ($! :ageatend ds)
                         :x-label "Age"
                         :title "Mother's age at end of pregnancy"))

histogram-6

Let’s try that again, excluding the outliers with an age over 60:

(chart-view (c/histogram (filter #(< % 60) ($! :ageatend ds))
                         :x-label "Age"
                         :title "Mother's age at end of pregnancy"))

histogram-7

Finally, let’s look at pregnancy length for live births:

(chart-view (c/histogram ($! :prglngth ($where {:outcome 1} ds))
                         :x-label "Weeks"
                         :title "Pregnancy length (live births)"))

histogram-8

We have now reached the end of section 2.4 of the book, and will pick up next time with section 2.5.

One of the user stories I had to tackle in a recent sprint was to import data maintained by a non-technical colleague in a Google Spreadsheet into our analytics database. I quickly found a Java API for Google Spreadsheets that looked promising but turned out to be more tricky to get up and running than expected at first glance. In this article, I show you how to use this library from Clojure and avoid some of the pitfalls I fell into.

Google Spreadsheets API

The GData Java client referenced in the Google Spreadsheets API documentation uses an old XML-based protocol, which is mostly deprecated. We are recommended to use the newer, JSON-based client. After chasing my tail on this, I discovered that Google Spreadsheets does not yet support this new API and we do need the GData client after all.

The first hurdle: dependencies

The GData Java client is not available from Maven, so we have to download a zip archive. The easiest way to use these from a Leiningen project is to use mvn to install the required jar files in our local repository and specify the dependencies in the usual way. This handy script automates the process, only downloading the archive if necessary. (For this project, we only need the gdata-core and gdata-spreadsheet jars, but the script is easily extended if you need other components.)

#!/bin/bash

set -e

function log () {
    echo "$1" >&2
}

function install_artifact () {
    log "Installing artifact $2"
    mvn install:install-file -DgroupId="$1" -DartifactId="$2" -Dversion="$3" -Dfile="$4" \
        -Dpackaging=jar -DgeneratePom=true
}

R="${HOME}/.m2/repository"
V="1.47.1"
U="http://storage.googleapis.com/gdata-java-client-binaries/gdata-src.java-${V}.zip"

if test -r "${R}/com/google/gdata/gdata-core/1.0/gdata-core-1.0.jar" \
        -a -r "${R}/com/google/gdata/gdata-spreadsheet/3.0/gdata-spreadsheet-3.0.jar";
then
    log "Artifacts up-to-date"
    exit 0
fi

log "Downloading $U"
cd $(mktemp -d)
wget "${U}"
unzip "gdata-src.java-${V}.zip"

install_artifact com.google.gdata gdata-core 1.0 gdata/java/lib/gdata-core-1.0.jar

install_artifact com.google.gdata gdata-spreadsheet 3.0 gdata/java/lib/gdata-spreadsheet-3.0.jar

Once we’ve installed these jars, we can configure dependencies as follows:

(defproject gsheets-demo "0.1.0-SNAPSHOT"
  :description "Google Sheets Demo"
  :url "https://github.com/ray1729/gsheets-demo"
  :license {:name "Eclipse Public License"
            :url "http://www.eclipse.org/legal/epl-v10.html"}
  :dependencies [[org.clojure/clojure "1.8.0"]
                 [com.google.gdata/gdata-core "1.0"]
                 [com.google.gdata/gdata-spreadsheet "3.0"]])

The second hurdle: authentication

This is a pain, as the documentation for the GData Java client is incomplete and at times confusing, and the examples it ships with no longer work as they use a deprecated OAuth version. The example Java code in the documentation tells us:

// TODO: Authorize the service object for a specific user (see other sections)

The other sections were no more enlightening, but after more digging and reading of source code, I realized we can use the google-api-client to manage our OAuth credentials and simply pass that credentials object to the GData client. This library is already available from a central Maven repository, so we can simply update our project’s dependencies to pull it in:

:dependencies [[org.clojure/clojure "1.8.0"]
               [com.google.api-client/google-api-client "1.21.0"]
               [com.google.gdata/gdata-core "1.0"]
               [com.google.gdata/gdata-spreadsheet "3.0"]]

OAuth credentials

Before we can start using OAuth, we have to register our client with Google. This is done via the Google Developers Console. See Using OAuth 2.0 to Access Google APIs for full details, but here’s a quick-start guide to creating credentials for a service account.

Navigate to the Developers Console. Click on Enable and manage APIs and select Create a new project. Enter the project name and click Create.

Once project is created, click on Credentials in the sidebar, then the Create Credentials drop-down. As our client is going to run from cron, we want to enable server-to-server authentication, so select Service account key. On the next screen, select New service account and enter a name. Make sure the JSON radio button is selected, then click on Create.

Copy the downloaded JSON file into your project’s resources directory. It should look something like:

{
  "type": "service_account",
  "project_id": "gsheetdemo",
  "private_key_id": "041db3d758a1a7ef94c9c59fb3bccd2fcca41eb8",
  "private_key": "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n",
  "client_email": "gsheets-demo@gsheetdemo.iam.gserviceaccount.com",
  "client_id": "106215031907469115769",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://accounts.google.com/o/oauth2/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/gsheets-demo%40gsheetdemo.iam.gserviceaccount.com"
}

We’ll use this in a moment to create a GoogleCredential object, but before that navigate to Google Sheets and create a test spreadsheet. Grant read access to the spreadsheet to the email address found in client_email in your downloaded credentials.

A simple Google Spreadsheets client

We’re going to be using a Java client, so it should come as no surprise that our namespace imports a lot of Java classes:

(ns gsheets-demo.core
  (:require [clojure.java.io :as io])
  (:import com.google.gdata.client.spreadsheet.SpreadsheetService
           com.google.gdata.data.spreadsheet.SpreadsheetFeed
           com.google.gdata.data.spreadsheet.WorksheetFeed
           com.google.gdata.data.spreadsheet.CellFeed
           com.google.api.client.googleapis.auth.oauth2.GoogleCredential
           com.google.api.client.json.jackson2.JacksonFactory
           com.google.api.client.googleapis.javanet.GoogleNetHttpTransport
           java.net.URL
           java.util.Collections))

We start by defining some constants for our application. The credentials resource is the JSON file we downloaded from the developer console:

(def application-name "gsheetdemo-v0.0.1")

(def credentials-resource (io/resource "GSheetDemo-041db3d758a1.json"))

(def oauth-scope "https://spreadsheets.google.com/feeds")

(def spreadsheet-feed-url (URL. "https://spreadsheets.google.com/feeds/spreadsheets/private/full"))

With this in hand, we can create a GoogleCredential object and initialize the Google Sheets service:

(defn get-credential
  []
  (with-open [in (io/input-stream credentials-resource)]
    (let [credential (GoogleCredential/fromStream in)]
      (.createScoped credential (Collections/singleton oauth-scope)))))

(defn init-service
  []
  (let [credential (get-credential)
        service (SpreadsheetService. application-name)]
    (.setOAuth2Credentials service credential)
    service))

Let’s try it at a REPL:

lein repl

user=> (require '[gsheets-demo.core :as gsheets])
nil
user=> (def service (gsheets/init-service))
#'user/service
user=> (.getEntries (.getFeed service
                              gsheets/spreadsheet-feed-url
                              com.google.gdata.data.spreadsheet.SpreadsheetFeed))
(#object[com.google.gdata.data.spreadsheet.SpreadsheetEntry 0x43ab2a3e "com.google.gdata.data.spreadsheet.SpreadsheetEntry@43ab2a3e"])

Great! We can see the one spreadsheet we granted our service account read access. Let’s wrap this up in a function and implement a helper to find a spreadsheet by name:

(defn list-spreadsheets
  [service]
  (.getEntries (.getFeed service spreadsheet-feed-url SpreadsheetFeed)))

(defn find-spreadsheet-by-title
  [service title]
  (let [spreadsheets (filter (fn [sheet] (= (.getPlainText (.getTitle sheet)) title))
                             (list-spreadsheets service))]
    (if (= (count spreadsheets) 1)
      (first spreadsheets)
      (throw (Exception. (format "Found %d spreadsheets with name %s"
                                 (count spreadsheets)
                                 title))))))

Back at the REPL:

user=> (def spreadsheet (gsheets/find-spreadsheet-by-title service "Colour Counts"))
user=>  (.getPlainText (.getTitle spreadsheet))
"Colour Counts"

A spreadsheet contains one or more worksheets, so the next functions we implement take a SpreadsheetEntry object and list or search worksheets:

(defn list-worksheets
  [service spreadsheet]
  (.getEntries (.getFeed service (.getWorksheetFeedUrl spreadsheet) WorksheetFeed)))

(defn find-worksheet-by-title
  [service spreadsheet title]
  (let [worksheets (filter (fn [ws] (= (.getPlainText (.getTitle ws)) title))
                           (list-worksheets service spreadsheet))]
    (if (= (count worksheets) 1)
      (first worksheets)
      (throw (Exception. (format "Found %d worksheets in %s with name %s"
                                 (count worksheets)
                                 spreadsheet
                                 title))))))

…and at the REPL:

user=> (def worksheets (gsheets/list-worksheets service spreadsheet))
user=> (map (fn [ws] (.getPlainText (.getTitle ws))) worksheets)
("Sheet1")

Our next function returns the cells belonging to a worksheet:

(defn get-cells
  [service worksheet]
  (map (memfn getCell) (.getEntries (.getFeed service (.getCellFeedUrl worksheet) CellFeed))))

This gives us a flat list of Cell objects. It will be much more convenient to work in Clojure with a nested vector of the cell values:

(defn to-nested-vec
  [cells]
  (mapv (partial mapv (memfn getValue)) (partition-by (memfn getRow) cells)))

We now have all the building blocks for the function that will be the main entry point to our minimal Clojure API:

(defn fetch-worksheet
  [service {spreadsheet-title :spreadsheet worksheet-title :worksheet}]
  (if-let [spreadsheet (find-spreadsheet-by-title service spreadsheet-title)]
    (if-let [worksheet (find-worksheet-by-title service spreadsheet worksheet-title)]
      (to-nested-vec (get-cells service worksheet))
      (throw (Exception. (format "Spreadsheet '%s' has no worksheet '%s'"
                                 spreadsheet-title worksheet-title))))
    (throw (Exception. (format "Spreadsheet '%s' not found" spreadsheet-title)))))

With this in hand:

user=> (def sheet (gsheets/fetch-worksheet service {:spreadsheet "Colour Counts" :worksheet "Sheet1"}))
#'user/sheet
user=> (clojure.pprint/pprint sheet)
[["Colour" "Count"]
 ["red" "123"]
 ["orange" "456"]
 ["yellow" "789"]
 ["green" "101112"]
 ["blue" "131415"]
 ["indigo" "161718"]
 ["violet" "192021"]]
nil

Our to-nested-vec function returns the cell values as strings. I could have used the getNumericValue method instead of getValue, but then to-nested-vec would have to know what data type to expect in each cell. Instead, I used Plumatic Schema to define a schema for each row, and used its data coercion features to coerce each column to the desired data type – but that’s a blog post for another day.

Code for the examples above is available on Github https://github.com/ray1729/gsheets-demo. We have barely scratched the surface of the Google Spreadsheets API; check out the API Documentation if you need to extend this code, for example to create or update spreadsheets.

OnyxMetail is hosting the next meetup of Cambridge NonDysfunctional Programmers next Thursday, 17th March. This month we’ll be taking a look at Onyx, a distributed cloud computing platform implemented in Clojure. We’re currently using Cascalog to process data on a Hadoop cluster, and are considering Onyx as a possible alternative. It will be interesting for us to hear what our local Clojure community makes of this new kid on the block.

One of my favourite talks at the recent Clojure Remote conference was Michael Drogalis’s keynote, where he discussed some of the principles behind Onyx’s data-driven API. At the Meetup, we’ll watch Michael’s Onyx talk from last year’s Clojure/conj. After the video, we’ll work through the getting started guide and tutorial together. Please see the Meetup page for full details.

This is the third instalment of our Think Stats study group; we are working through Allen Downey’s Think Stats, implementing everything in Clojure. In the previous part we showed how to use functions from the Incanter library to explore and transform a dataset. Now we build on that knowledge to explore the National Survey for Family Growth (NSFG) data and answer the question do first babies arrive late? This takes us to the end of chapter 1 of the book.

If you’d like to follow along, start by cloning our thinkstats repository from Github:

git clone https://github.com/ray1729/thinkstats.git --recursive

Change into the project directory and fire up Gorilla REPL:

cd thinkstats
lein gorilla

Getting Started

Our project includes the namespace thinkstats.incanter that brings together our general Incanter utility functions, and thinkstats.family-growth for the functions we developed last time for cleaning and augmenting the female pregnancy data.

Let’s start by importing these and the Incanter namspaces we’re going to need this time:

(ns mysterious-aurora
  (:require [incanter.core :as i
              :refer [$ $map $where $rollup $order $fn $group-by $join]]
            [incanter.stats :as s]
            [thinkstats.gorilla]
            [thinkstats.incanter :as ie :refer [$! $not-nil]]
            [thinkstats.family-growth :as f]))

(We’ve also included thinkstats.gorilla, which just includes some functionality to render Incanter datasets more nicely in Gorilla REPL.)

The function thinkstats.family-growth/fem-preg-ds combines reading the data set with clean-and-augment-fem-preg:

(def ds (f/fem-preg-ds))

This function is parsing and transforming the dataset; depending on the speed of your computer, it could take one or two minutes to run.

Validating Data

There are a couple of things covered in chapter 1 of the book that we haven’t done yet: looking at frequencies of values in particular columns of the NSFG data and validating against the code book, and building a function to index rows by :caseid.

We can use the core Clojure frequencies function in conjunction with Incanter’s $ to select values of a column and return a map of value to frequency:

(frequencies ($ :outcome ds))
;=> {1 9148, 2 1862, 4 1921, 5 190, 3 120, 6 352}

Incanter’s $rollup function can be used to compute a summary function over a column or set of columns, and has built-in support for :min, :max, :mean, :sum, and :count. Rolling up :outcome by :count will compute the freqency for each outcome and return a new dataset:

($rollup :count :total :outcome ds)
:outcome :total
1 9148
2 1862
4 1921
5 190
3 120
6 352

Compare this with the table in the code book (you’ll find the table on page 103).

Exploring and Interpreting Data

We saw previously that we can use $where to select rows matching a predicate. For example, to select rows for a given :caseid:

($where {:caseid "10229"} ds)

This could be quite slow for a large dataset as it has to examine every row. An alternative strategy is to build an index in advance then use that to select the desired rows. Here’s how we might do this:

(defn build-column-ix
  [col-name ds]
  (reduce (fn [accum [row-ix v]]
            (update accum v (fnil conj []) row-ix))
          {}
          (map-indexed vector ($ col-name ds))))

(def caseid-ix (build-column-ix :caseid ds))

Now we can quickly select rows for a given :caseid using this index:

(i/sel ds :rows (caseid-ix "10229"))

Recall that we can also select a subset of columns at the same time:

(i/sel ds :rows (caseid-ix "10229") :cols [:pregordr :agepreg :outcome])
:pregordr :agepreg :outcome
1 19.58 4
2 21.75 4
3 23.83 4
4 25.5 4
5 29.08 4
6 32.16 4
7 33.16 1

Recall also the meaning of :outcome; a value of 4 indicates a miscarriage and 1 a live birth. So this respondent suffered 6 miscarriages between the ages of 19 and 32, finally seeing a live birth at age 33.

We can use functions from the incanter.stats namespace to compute basic statistics on our data:

(s/mean ($! :totalwgt-lb ds))
;=> 7.2623018494055485
(s/median ($! :totalwgt-lb ds))
;=> 7.375

(Note the use of $! to exclude nil values, which would otherwise trigger a null pointer exception.)

To compute several statistics at once:

(s/summary ($! [:totalwgt-lb] ds))
;=> ({:col :totalwgt-lb, :min 0.0, :max 15.4375, :mean 7.2623018494055485, :median 7.375, :is-numeric true})

Note that, while mean and median take a sequence of values (argument to $! is just a keyword), the summary function expects a dataset (argument to $! is a vector).

Do First Babies Arrive Late?

We now know enough to have a first attempt at answering this question. The columns we’ll use are:

:outcome Pregnancy outcome (1 == live birth)
:birthord Birth order
:prglngth Duration of completed pregnancy in weeks

Compute the mean pregnancy length for the first birth:

(s/mean ($! :prglngth ($where {:outcome 1 :birthord 1} ds)))
;=> 38.60095173351461

…and for subsequent births:

(s/mean ($! :prglngth ($where {:outcome 1 :birthord {:$ne 1}} ds)))
;=> 38.52291446673706

The diffenence between these two values in just 0.08 weeks, so I’d say that these data do not indicate that first babies arrive late.

Here we’ve computed mean pregnancy length for first baby and others; if we want a table of mean pregnancy length by birth order, we can use $rollup again:

($rollup :mean :prglngth :birthord ($where {:outcome 1 :prglngth $not-nil} ds))
:birthord :prglngth
3 47501/1234
4 16187/421
5 2419/63
10 36
9 75/2
7 763/20
1 56782/1471
8 263/7
6 1903/50
2 55420/1437

The mean has been returned as a rational, but we can use transform-col to convert it to a floating-point number:

(as-> ds x
      ($where {:outcome 1 :prglngth $not-nil} x)
      ($rollup :mean :prglngth :birthord x)
      (i/transform-col x :prglngth float))
:birthord :prglngth
3 38.49352
4 38.448933
5 38.396824
10 36.0
9 37.5
7 38.15
1 38.600952
8 37.57143
6 38.06
2 38.56646

Finally, we can use $order to sort this dataset on birth order:

(as-> ds x
      ($where {:outcome 1 :prglngth $not-nil} x)
      ($rollup :mean :prglngth :birthord x)
      (i/transform-col x :prglngth float)
      ($order :birthord :asc x))
:birthord :prglngth
1 38.600952
2 38.56646
3 38.49352
4 38.448933
5 38.396824
6 38.06
7 38.15
8 37.57143
9 37.5
10 36.0

The Incanter functions $where, $rollup, $order, etc. all take a dataset to act on as their last argument. If this argument is omitted, they use the dynamic $data variable that is usually bound using with-data. So the following two expressions are equivalent:

($where {:outcome 1 :prglngth $not-nil} ds)

(with-data ds
  ($where {:outcome 1 :prglngth $not-nil}))

It’s a bit annoying that we have to use as-> when we add transform-col to the mix, as this function takes the dataset as its first argument. Let’s add the following to our thinkstats.incanter namespace:

(defn $transform
  "Like Incanter's `transform-col`, but takes the dataset as an optional
   last argument and, when not specified, uses the dynamically-bound
   `$data`."
  [col f & args]
  (let [[ds args] (if (or (i/matrix? (last args)) (i/dataset? (last args)))
                    [(last args) (butlast args)]
                    [i/$data args])]
    (apply i/transform-col ds col f args)))

Now we can use the ->> threading macro:

(->> ($where {:outcome 1 :prglngth $not-nil} ds)
     ($rollup :mean :prglngth :birthord)
     ($transform :prglngth float)
     ($order :birthord :asc))

We have now met most of the core Incanter functions for manipulating datasets, and a few of the statistics functions. I hope that, as we get further into the book, we’ll learn how to calculate error bounds for computed values, and how to decide when we have a statistically significant result. In the next installment we start to look at statistical distributions and plot our first histograms.

This is the second instalment of our Think Stats study group; we are working through Allen Downey’s Think Stats, implementing everything in Clojure. In the first part we implemented a parser for Stata dictionary and data files. Now we are going to use that to start exploring the National Survey of Family Growth data with Incanter, a Clojure library for statistical computing and graphics. We are still working through Chapter 1 of the book, and in this instalment we cover sections 1.4 DataFrames through to 1.7 Validation.

If you’d like to follow along, start by cloning our thinkstats repository from Github:

git clone https://github.com/ray1729/thinkstats.git --recursive

I’ve made two changes since writing the first post in this series. I realised that I could include Allen’s repository as a git submodule, hence the --recursive option above. This means the data files will be in a predictable place in our project folder so we can refer to them in the examples. I’ve also included Gorilla REPL in the project, so if you want to try out the examples but aren’t familiar with the Clojure tool chain, you can simply run:

lein gorilla

This will print out a URL for you to open in your browser. You can then start running the examples and seeing the output in your browser. Read more about Gorilla REPL here: http://gorilla-repl.org/.

To Business…

Gorilla has created the namespace harmonious-willow for our worksheet. We’ll start by importing the Incanter and thinkstats namespaces we require:

(ns harmonious-willow
  (:require [incanter.core :as i
              :refer [$ $map $where $rollup $order $fn $group-by $join]]
            [incanter.stats :as s]
            [thinkstats.dct-parser :as dct]))

Incanter defines a number of handy functions whose names begin with $; we’re likely to use these a lot, so we’ve imported them into our namespace. We’ll refer to the other Incanter functions we need by qualifying them with the i/ or s/ prefix.

Load the NFSG Pregnancy data into an Incanter dataset:

(def ds (dct/as-dataset "ThinkStats2/code/2002FemPreg.dct"
                        "ThinkStats2/code/2002FemPreg.dat.gz"))

Incanter’s dim function tells us the number of rows and columns in the dataset:

(i/dim ds)
;=> [13593 243]

and col-names lists the column names:

(i/col-names ds)
;=> [:caseid :pregordr :howpreg-n :howpreg-p ...]

We can select a subset of rows or columns from the dataset using sel:

(i/sel ds :cols [:caseid :pregordr] :rows (range 10))

Either of :rows or :cols may be omitted, but you’ll get a lot of data back if you ask for all rows. Selecting subsets of the dataset is such a common thing to do that Incanter provides the function $ as a short-cut (but note the different argument order):

($ (range 10) [:caseid :pregordr] ds)

If the first argument is omitted, it will return all rows. This returns a new Incanter dataset, but  if you ask for just a single column and don’t wrap the argument in a vector, you get back a sequence of values for that column:

(take 10 ($ :caseid ds))
;=> ("1" "1" "2" "2" "2" "6" "6" "6" "7" "7")

We can also select a subset of rows using Incanter’s $where function, which provides a succinct syntax for selecting rows that match a predicate. For example, to select rows where the :caseid is 6, we can do:

($ [:caseid :pregordr :outcome] ($where {:caseid "6"} ds))

(Note that we’re still using $ to limit the columns returned.)  There are lots of other options to $where; for example, to find all the case ids where 3000 <= :agepreg < 3100:

($ :caseid ($where {:agepreg {:$gte 3000 :$lt 3100}} ds))
;=> ("6" "15" "21" "36" "92" "142" "176" "210" ...)

The $where function is a convenience wrapper for query-dataset, so we need to look at the documentation for the latter to find out the other supported options:

(clojure.repl/doc i/query-dataset)

Cleaning data

Before we start to analyze the data, we may want to remove outliers or other special values. For example, the :birthwgt-lb column gives the birth weight in pounds of the first baby in the pregnancy. Let’s look at the top 10 values:

(take 10 (sort > (distinct ($ :birthwgt-lb ds))))
;=> Exception thrown: java.lang.NullPointerException

Oops! That’s not what we wanted, we’ll have to remove nil values before sorting. We can use Incanter’s $where to do this. Although $where has a number of built-in predicates, there isn’t one to check for nil values, so we have to write our own:

(def $not-nil {:$fn (complement nil?)})

(take 10 ($ :birthwgt-lb ($where {:birthwgt-lb $not-nil} ds)))
;=> (8 7 9 7 6 8 9 8 7 6)

(take 10 (sort > (distinct ($ :birthwgt-lb
                             ($where {:birthwgt-lb $not-nil} ds)))))
;=> (99 98 97 51 15 14 13 12 11 10)

This is still a bit cumbersome, so let’s write a variant of sel that returns only the rows where none of the specified columns are nil:

(defn ensure-collection
  [x]
  (if (coll? x) x (vector x)))

(defn sel-defined
  [ds & {:keys [rows cols]}]
  (let [rows (or rows :all)
        cols (or cols (i/col-names ds))]
    (i/sel ($where (zipmap (ensure-collection cols) (repeat $not-nil))
                   ds)
           :rows rows :cols cols)))

(take 10 (sort > (distinct (sel-defined ds :cols :birthwgt-lb))))
;=> (99 98 97 51 15 14 13 12 11 10)

Looking up the definition of :birthwgt-lb in the code book, we see that values greater than 95 encode special meaning:

Value Meaning
97 Not ascertained
98 Refused
99 Don’t know

We’d like to remove these values (and the obvious outlier 51) from the dataset before processing it. Incanter provides the function transform-col that applies a function to each value in the specified column of a dataset and returns a new dataset. Using this, we can write a helper function for setting illegal values to nil:

(defn set-invalid-nil
  [ds col valid?]
  (i/transform-col ds col (fn [v] (when (and (not (nil? v)) (valid? v)) v))))

(def ds' (set-invalid-nil ds :birthwgt-lb (complement #{51 97 98 99})))

(take 10 (sort > (distinct (sel-defined ds' :cols :birthwgt-lb))))
;=> (15 14 13 12 11 10 9 8 7 6)

We should also update the :birthwgt-oz column to remove any values greater than 15:

(def ds'
    (-> ds
        (set-invalid-nil :birthwgt-lb (complement #{51 97 98 99}))
        (set-invalid-nil :birthwgt-oz (fn [v] (<= 0 v 15)))))

Transforming data

We used the transform-col function in the implementation of set-invalid-nil above. We can also use this to perform an arbitrary calculation on a value. For example, the :agepreg column contains the age of the participant in centiyears (hundredths of a year):

(i/head (sel-defined ds' :cols :agepreg))
;=> (3316 3925 1433 1783 1833 2700 2883 3016 2808 3233)

Let’s transform this to years (perhaps fractional):

(defn centiyears->years
  [v]
  (when v (/ v 100.0)))

(def ds' (i/transform-col ds' :agepreg centiyears->years))
(i/head (sel-defined ds' :cols :agepreg))
;=> (33.16 39.25 14.33 17.83 18.33 27.0 28.83 30.16 28.08 32.33)

Augmenting data

The final function we’ll show you this time is add-derived-column; this function adds a column to a dataset, where the added column is a function of other columns. For example:

(defn compute-totalwgt-lb
  [lb oz]
  (when lb (+ lb (/ (or oz 0) 16.0))))

(def ds' (i/add-derived-column :totalwgt-lb
                               [:birthwgt-lb :birthwgt-oz]
                               compute-totalwgt-lb
                               ds'))

(i/head (sel-defined ds' :cols :totalwgt-lb))
;=> (8.8125 7.875 9.125 7.0 6.1875 8.5625 9.5625 8.375 7.5625 6.625)

Putting it all together

We’ve built up a new dataset above with a number of transformations. Let’s bring these all together into a single function that will thread the dataset through all these transformations. We can’t use the usual -> or ->> macros because of an inconsistency in the argument order of the transformations, but Clojure’s as-> comes to the rescue here.

(defn clean-and-augment-fem-preg
  [ds]
  (as-> ds ds
    (set-invalid-nil ds :birthwgt-lb (complement #{51 97 98 99}))
    (set-invalid-nil ds :birthwgt-oz (fn [v] (<= 0 v 15)))
    (i/transform-col ds :agepreg centiyears->years)
    (i/add-derived-column :totalwgt-lb
                          [:birthwgt-lb :birthwgt-oz]
                          compute-totalwgt-lb
                          ds)))

Now we can do:

(def ds (clean-and-augment-fem-preg
          (dct/as-dataset "ThinkStats2/code/2002FemPreg.dct"
                          "ThinkStats2/code/2002FemPreg.dat.gz")))

The Incanter helper functions we’ve implemented can be found in the thinkstats.incanter namespace, along with a $! short-cut for sel-defined that was a bit too complex to show in this post.

In the next part in this series, we start to explore the cleaned dataset.

Getting Started

Think Stats One of our new starters here at Metail was keen to brush up their statistics, and it’s more than 20 years since I completed an introductory course at university so I knew I would benefit from some revision. We also have a bunch of statisticians in the office who would like to brush up their Clojure, so I thought it might be fun to organise a lunchtime study group to work through Allen Downey’s Think Stats and attempt the exercises in Clojure. We’re using the second edition of the book which is available online in HTML format, and meeting on Wednesday lunchtimes to work through it together.

We’ll use Clojure’s Incanter library which provides utilities for statistical analysis and generating charts. Create a Leiningen project for our work:

lein new thinkstats

Make sure the project.clj depends on Clojure 1.7.0 and add a dependency on Incanter 1.5.6:

:dependencies [[org.clojure/clojure "1.7.0"]
               [incanter "1.5.6"]]

Parsing the data

In the first chapter of the book, we are introduced to a data set from the US Centers for Disease Control and Prevention, the National Survey of Family Growth. The data are in a gzipped file with fixed-width columns. An accompanying Stata dictionary describes the variable names, types, and column indices for each record. Our first job will be to parse the dictionary file and use that information to build a parser for the data.

We cloned the Github repository that accompanies Allen’s book:

git clone https://github.com/AllenDowney/ThinkStats2

Then created symlinks to the data files from our project:

cd thinkstats
mkdir data
cd data
for f in ../../ThinkStats2/code/{*.dat.gz,*.dct}; do ln -s $f; done

We can now read the Stata dictionary for the family growth study fromdata/2002FemPreg.dct. The dictionary looks like:

infile dictionary {
    _column(1)  str12    caseid     %12s  "RESPONDENT ID NUMBER"
    _column(13) byte     pregordr   %2f  "PREGNANCY ORDER (NUMBER)"
}

If we skip the first and last lines of the dictionary, we can use a regular expression to parse each column definition:

(def dict-line-rx #"^\s+_column\((\d+)\)\s+(\S+)\s+(\S+)\s+%(\d+)(\S)\s+\"([^\"]+)\"")

We’re capturing the column position, colum type, column name, format and length, and description. Let’s test this at the REPL. First we have to read a line from the dictionary:

(require '[clojure.java.io :as io])
(def line (with-open [r (io/reader "data/2002FemPreg.dct")]
            (first (rest (line-seq r)))))

We use rest to skip the first line of the file then grab the first column definition. Now we can try matching this with our regular expression:

(re-find dict-line-rx line)

This returns the string that matched and the capture groups we defined in our regular expression:

["    _column(1)      str12        caseid  %12s  \"RESPONDENT ID NUMBER\""
 "1"
 "str12"
 "caseid"
 "12"
 "s"
 "RESPONDENT ID NUMBER"]

We need to do some post-processing of this result to parse the column index and length to integers; we’ll also replace underscores in the column name with hyphens, which makes for a more idiomatic Clojure variable name. Let’s wrap that up in a function:

(require '[clojure.string :as str])

(defn parse-dict-line
  [line]
  (let [[_ col type name f-len f-spec descr] (re-find dict-line-rx line)]
    {:col    (dec (Integer/parseInt col))
     :type   type
     :name   (str/replace name "_" "-")
     :f-len  (Integer/parseInt f-len)
     :f-spec f-spec
     :descr  descr}))

Note that we’re also decrementing the column index – we need zero-indexed column indices for Clojure’s substring function. Now when we parse our sample line we get:

{:col 0,
 :type "str12",
 :name "caseid",
 :f-len 12,
 :f-spec "s",
 :descr "RESPONDENT ID NUMBER"}

With this function in hand, we can write a parser for the dictionary file:

(defn read-dict-defn
  "Read a Stata dictionary file, return a vector of column definitions."
  [path]
  (with-open [r (io/reader path)]
    (mapv parse-dict-line (butlast (rest (line-seq r))))))

We use rest and butlast to skip the first and last lines of the file, and mapv to force eager evaluation and ensure we process all of the input before the reader is closed when we exit with-open.

(def dict (parse-dict-defn "data/2002FemPreg.dat"))

The dictionary tells us the starting position (:col) and length (:f-len) of each field, so we can use subs to extract the raw value of each column from the data. This will give us a string, and the :type key we’ve extracted from the dictionary tells us how to interpret this. We’ve seen the types str12 and byte above, but what other types appear in the dictionary?

(distinct (map :type dict))
;=> ("str12" "byte" "int" "float" "double")

We’ll leave str12 unchanged, coerce byte and int to Long, andfloat and double to Double:

(defn parse-value
  [type raw-value]
  (when (not (empty? raw-value))
    (case type
      ("str12")          raw-value
      ("byte" "int")     (Long/parseLong raw-value)
      ("float" "double") (Double/parseDouble raw-value))))

We can now build a record parser from the dictionary definition:

(defn make-row-parser
  "Parse a row from a Stata data file according to the specification in `dict`.
   Return a vector of columns."
  [dict]
  (fn [row]
    (reduce (fn [accum {:keys [col type name f-len]}]
              (let [raw-value (str/trim (subs row col (+ col f-len)))]
                (conj accum (parse-value type raw-value))))
            []
            dict)))

To read gzipped data, we need to open an input stream, coerce this to a GZIPInputStream, and construct a buffered reader from that. For convenience, we’ll define a function to do this automatically if passed a path ending in .gz.

(import 'java.util.zip.GZIPInputStream)

(defn reader
  "Open path with io/reader; coerce to a GZIPInputStream if suffix is .gz"
  [path]
  (if (.endsWith path ".gz")
    (io/reader (GZIPInputStream. (io/input-stream path)))
    (io/reader path)))

Given a dictionary and reader, we can parse the records from a data file:

(defn read-dct-data
  "Parse lines from `rdr` according to the specification in `dict`.
   Return a lazy sequence of parsed rows."
  [dict rdr]
  (let [parse-fn (make-row-parser dict)]
    (map parse-fn (line-seq rdr))))

Finally, we bring this all together with a function to parse the dictionary and data and return an Incanter dataset:

(require '[incanter.core :refer [dataset]])

(defn as-dataset
  "Read Stata data set, return an Incanter dataset."
  [dict-path data-path]
  (let [dict   (read-dict-defn dict-path)
        header (map (comp keyword :name) dict)]
    (with-open [r (reader data-path)]
      (dataset header (doall (read-dct-data dict r))))))

Getting the code

The code for all this is available on Github; if you’d like to follow along, you can fork my thinkstats repository.

The functions we’ve developed above are in the namespacethinkstats.dct-parser In the next article in this series, we use our parser to explore and clean the data using Incanter.

Clojure Meetup as part of the Cambridge Non-dysfunctional Programmers group, in Metails office at 50 St Andrews St, Cambridge

Clojure Meetup as part of the Cambridge Non-dysfunctional Programmers group, in Metails office at 50 St Andrews St, Cambridge

This month’s meetup of the Cambridge NonDysFunctional Programmers will be hosted here at Metail’s Cambridge office next Thursday (26th November) from 6.30pm. I (Ray Miller) will be giving a hands-on introduction to web development with Clojure, where attendees get to implement their first Clojure web application from the ground up.

Along the way, we’ll learn about the Compojure routing library, Ring requests and responses, middleware, and generating HTML with hiccup. Time permitting, we’ll also cover interacting with a relational database and using buddy to add session-based authentication and authorization to our application.

The theme running through the tutorial is implementation of an ad server (an example I shamelessly stole from Dan Benjamin’s Meet Sinatra screencast). This demo application delivers a Javascript snippet to embed random ads in a web page and tracks user click-throughs. It also provides an administrative interface for reporting and managing ads. If you’ve already seen Dan’s screencast, you’ll see how Clojure compares with Sinatra to implement the same application.

See the Meetup page for full details and to sign up.