## Think Stats in Clojure Part III: Exploring Data

This is the third instalment of our Think Stats study group; we are working through Allen Downey’s Think Stats, implementing everything in Clojure. In the previous part we showed how to use functions from the Incanter library to explore and transform a dataset. Now we build on that knowledge to explore the National Survey for Family Growth (NSFG) data and answer the question do first babies arrive late? This takes us to the end of chapter 1 of the book.

If you’d like to follow along, start by cloning our thinkstats repository from Github:

``````git clone https://github.com/ray1729/thinkstats.git --recursive
``````

Change into the project directory and fire up Gorilla REPL:

``````cd thinkstats
lein gorilla
``````

## Getting Started

Our project includes the namespace `thinkstats.incanter` that brings together our general Incanter utility functions, and `thinkstats.family-growth` for the functions we developed last time for cleaning and augmenting the female pregnancy data.

Let’s start by importing these and the Incanter namspaces we’re going to need this time:

``````(ns mysterious-aurora
(:require [incanter.core :as i
:refer [\$ \$map \$where \$rollup \$order \$fn \$group-by \$join]]
[incanter.stats :as s]
[thinkstats.gorilla]
[thinkstats.incanter :as ie :refer [\$! \$not-nil]]
[thinkstats.family-growth :as f]))
``````

(We’ve also included `thinkstats.gorilla`, which just includes some functionality to render Incanter datasets more nicely in Gorilla REPL.)

The function `thinkstats.family-growth/fem-preg-ds` combines reading the data set with `clean-and-augment-fem-preg`:

``````(def ds (f/fem-preg-ds))
``````

This function is parsing and transforming the dataset; depending on the speed of your computer, it could take one or two minutes to run.

## Validating Data

There are a couple of things covered in chapter 1 of the book that we haven’t done yet: looking at frequencies of values in particular columns of the NSFG data and validating against the code book, and building a function to index rows by `:caseid`.

We can use the core Clojure `frequencies` function in conjunction with Incanter’s `\$` to select values of a column and return a map of value to frequency:

``````(frequencies (\$ :outcome ds))
;=> {1 9148, 2 1862, 4 1921, 5 190, 3 120, 6 352}
``````

Incanter’s `\$rollup` function can be used to compute a summary function over a column or set of columns, and has built-in support for `:min`, `:max`, `:mean`, `:sum`, and `:count`. Rolling up `:outcome` by `:count` will compute the freqency for each outcome and return a new dataset:

``````(\$rollup :count :total :outcome ds)
``````
:outcome :total
1 9148
2 1862
4 1921
5 190
3 120
6 352

Compare this with the table in the code book (you’ll find the table on page 103).

## Exploring and Interpreting Data

We saw previously that we can use `\$where` to select rows matching a predicate. For example, to select rows for a given `:caseid`:

``````(\$where {:caseid "10229"} ds)
``````

This could be quite slow for a large dataset as it has to examine every row. An alternative strategy is to build an index in advance then use that to select the desired rows. Here’s how we might do this:

``````(defn build-column-ix
[col-name ds]
(reduce (fn [accum [row-ix v]]
(update accum v (fnil conj []) row-ix))
{}
(map-indexed vector (\$ col-name ds))))

(def caseid-ix (build-column-ix :caseid ds))
``````

Now we can quickly select rows for a given `:caseid` using this index:

``````(i/sel ds :rows (caseid-ix "10229"))
``````

Recall that we can also select a subset of columns at the same time:

``````(i/sel ds :rows (caseid-ix "10229") :cols [:pregordr :agepreg :outcome])
``````
:pregordr :agepreg :outcome
1 19.58 4
2 21.75 4
3 23.83 4
4 25.5 4
5 29.08 4
6 32.16 4
7 33.16 1

Recall also the meaning of `:outcome`; a value of `4` indicates a miscarriage and `1` a live birth. So this respondent suffered 6 miscarriages between the ages of 19 and 32, finally seeing a live birth at age 33.

We can use functions from the `incanter.stats` namespace to compute basic statistics on our data:

``````(s/mean (\$! :totalwgt-lb ds))
;=> 7.2623018494055485
(s/median (\$! :totalwgt-lb ds))
;=> 7.375
``````

(Note the use of `\$!` to exclude nil values, which would otherwise trigger a null pointer exception.)

To compute several statistics at once:

``````(s/summary (\$! [:totalwgt-lb] ds))
;=> ({:col :totalwgt-lb, :min 0.0, :max 15.4375, :mean 7.2623018494055485, :median 7.375, :is-numeric true})
``````

Note that, while `mean` and `median` take a sequence of values (argument to `\$!` is just a keyword), the `summary` function expects a dataset (argument to `\$!` is a vector).

## Do First Babies Arrive Late?

We now know enough to have a first attempt at answering this question. The columns we’ll use are:

`:outcome` Pregnancy outcome (1 == live birth) Birth order Duration of completed pregnancy in weeks

Compute the mean pregnancy length for the first birth:

``````(s/mean (\$! :prglngth (\$where {:outcome 1 :birthord 1} ds)))
;=> 38.60095173351461
``````

…and for subsequent births:

``````(s/mean (\$! :prglngth (\$where {:outcome 1 :birthord {:\$ne 1}} ds)))
;=> 38.52291446673706
``````

The diffenence between these two values in just 0.08 weeks, so I’d say that these data do not indicate that first babies arrive late.

Here we’ve computed mean pregnancy length for first baby and others; if we want a table of mean pregnancy length by birth order, we can use `\$rollup` again:

``````(\$rollup :mean :prglngth :birthord (\$where {:outcome 1 :prglngth \$not-nil} ds))
``````
:birthord :prglngth
3 47501/1234
4 16187/421
5 2419/63
10 36
9 75/2
7 763/20
1 56782/1471
8 263/7
6 1903/50
2 55420/1437

The mean has been returned as a rational, but we can use `transform-col` to convert it to a floating-point number:

``````(as-> ds x
(\$where {:outcome 1 :prglngth \$not-nil} x)
(\$rollup :mean :prglngth :birthord x)
(i/transform-col x :prglngth float))
``````
:birthord :prglngth
3 38.49352
4 38.448933
5 38.396824
10 36.0
9 37.5
7 38.15
1 38.600952
8 37.57143
6 38.06
2 38.56646

Finally, we can use `\$order` to sort this dataset on birth order:

``````(as-> ds x
(\$where {:outcome 1 :prglngth \$not-nil} x)
(\$rollup :mean :prglngth :birthord x)
(i/transform-col x :prglngth float)
(\$order :birthord :asc x))
``````
:birthord :prglngth
1 38.600952
2 38.56646
3 38.49352
4 38.448933
5 38.396824
6 38.06
7 38.15
8 37.57143
9 37.5
10 36.0

The Incanter functions `\$where`, `\$rollup`, `\$order`, etc. all take a dataset to act on as their last argument. If this argument is omitted, they use the dynamic `\$data` variable that is usually bound using `with-data`. So the following two expressions are equivalent:

``````(\$where {:outcome 1 :prglngth \$not-nil} ds)

(with-data ds
(\$where {:outcome 1 :prglngth \$not-nil}))
``````

It’s a bit annoying that we have to use `as->` when we add `transform-col` to the mix, as this function takes the dataset as its first argument. Let’s add the following to our `thinkstats.incanter` namespace:

``````(defn \$transform
"Like Incanter's `transform-col`, but takes the dataset as an optional
last argument and, when not specified, uses the dynamically-bound
`\$data`."
[col f & args]
(let [[ds args] (if (or (i/matrix? (last args)) (i/dataset? (last args)))
[(last args) (butlast args)]
[i/\$data args])]
(apply i/transform-col ds col f args)))
``````

Now we can use the `->>` threading macro:

``````(->> (\$where {:outcome 1 :prglngth \$not-nil} ds)
(\$rollup :mean :prglngth :birthord)
(\$transform :prglngth float)
(\$order :birthord :asc))
``````

We have now met most of the core Incanter functions for manipulating datasets, and a few of the statistics functions. I hope that, as we get further into the book, we’ll learn how to calculate error bounds for computed values, and how to decide when we have a statistically significant result. In the next installment we start to look at statistical distributions and plot our first histograms.

## The A-Z of A/B testing

Dr Shrividya Ravi spoke about the statistics of A/B testing at the Data Insights Cambridge meetup. It’s now live on the Metail YouTube channel, watch below or click here.

### A – Z of A/B testing

Randomised control trials have been a key part of medical science since the 18th century. More recently they have gained rapid traction in the e-commerce world where the term ‘A/B testing’ has become synonymous with businesses that are innovative and data-driven.

A/B testing has become the ‘status quo’ for retail website development – enabling product managers and marketing professionals to positively affect the customer journey; the sales funnel in particular. Combining event stream data with sound questions and good experiment design, these controlled trials become powerful tools for insight into user behaviour.

This talk will present a comprehensive overview of A/B testing discussing both the advantages and the caveats. A series of case studies and toy examples will detail the myriad of analyses possible from rich web events data. Topics covered will include inference with hypothesis testing, regression, bootstrapping, Bayesian models and parametric simulations.

You can check out the slides below or alternatively download them here:

## Preview of Data Insights Cambridge, 4 Feb 2015

The first Data Insights Cambridge meetup of 2016 is nearly upon us. Metail looks forward to welcoming Sean McGuire, from the University of Cambridge Research Institutional Services, who will present on ‘Supercomputing for your data’.

What does Supercomputing for your Data mean?

Data proliferation and collection means that even small companies are capable of collecting vast amounts of data very quickly these days. But how do companies make the move from desktop or small compute clusters to larger clusters as their data grows? Knowledge of the tools and equipment needed to scale is not necessarily part of the existing knowledge base. This talk will describe how the Research Institutional Services (University of Cambridge) is helping companies today from a wide range of areas, from Life Science to Oil and Gas to the Manufacturing industry. We’ll cover everything from data security to how to go about designing components for a large compute and store cluster.

The Speaker:

Sean has spent the last 20 years working for two well-known vendors in the Super Computing space:

• Intel Corporation, Director of HPC EMEA
• Seagate Storage Systems, VP EMEA

Sean has worked in sales, operations and people management before moving into senior EMEA based roles with responsibility for business unit P&L’s.

The meetup is scheduled for Thursday, February 4, 2016 at 7:00 pm at 50 St Andrew’s St, CB2 3AH. We hope to see you there, just sign up for it on the Data Insights Cambridge meetup page.