This is the second instalment of our Think Stats study group; we are working through Allen Downey’s Think Stats, implementing everything in Clojure. In the first part we implemented a parser for Stata dictionary and data files. Now we are going to use that to start exploring the National Survey of Family Growth data with Incanter, a Clojure library for statistical computing and graphics. We are still working through Chapter 1 of the book, and in this instalment we cover sections 1.4 DataFrames through to 1.7 Validation.

If you’d like to follow along, start by cloning our thinkstats repository from Github:

git clone https://github.com/ray1729/thinkstats.git --recursive

I’ve made two changes since writing the first post in this series. I realised that I could include Allen’s repository as a git submodule, hence the --recursive option above. This means the data files will be in a predictable place in our project folder so we can refer to them in the examples. I’ve also included Gorilla REPL in the project, so if you want to try out the examples but aren’t familiar with the Clojure tool chain, you can simply run:

lein gorilla

This will print out a URL for you to open in your browser. You can then start running the examples and seeing the output in your browser. Read more about Gorilla REPL here: http://gorilla-repl.org/.

To Business…

Gorilla has created the namespace harmonious-willow for our worksheet. We’ll start by importing the Incanter and thinkstats namespaces we require:

(ns harmonious-willow
  (:require [incanter.core :as i
              :refer [$ $map $where $rollup $order $fn $group-by $join]]
            [incanter.stats :as s]
            [thinkstats.dct-parser :as dct]))

Incanter defines a number of handy functions whose names begin with $; we’re likely to use these a lot, so we’ve imported them into our namespace. We’ll refer to the other Incanter functions we need by qualifying them with the i/ or s/ prefix.

Load the NFSG Pregnancy data into an Incanter dataset:

(def ds (dct/as-dataset "ThinkStats2/code/2002FemPreg.dct"
                        "ThinkStats2/code/2002FemPreg.dat.gz"))

Incanter’s dim function tells us the number of rows and columns in the dataset:

(i/dim ds)
;=> [13593 243]

and col-names lists the column names:

(i/col-names ds)
;=> [:caseid :pregordr :howpreg-n :howpreg-p ...]

We can select a subset of rows or columns from the dataset using sel:

(i/sel ds :cols [:caseid :pregordr] :rows (range 10))

Either of :rows or :cols may be omitted, but you’ll get a lot of data back if you ask for all rows. Selecting subsets of the dataset is such a common thing to do that Incanter provides the function $ as a short-cut (but note the different argument order):

($ (range 10) [:caseid :pregordr] ds)

If the first argument is omitted, it will return all rows. This returns a new Incanter dataset, but  if you ask for just a single column and don’t wrap the argument in a vector, you get back a sequence of values for that column:

(take 10 ($ :caseid ds))
;=> ("1" "1" "2" "2" "2" "6" "6" "6" "7" "7")

We can also select a subset of rows using Incanter’s $where function, which provides a succinct syntax for selecting rows that match a predicate. For example, to select rows where the :caseid is 6, we can do:

($ [:caseid :pregordr :outcome] ($where {:caseid "6"} ds))

(Note that we’re still using $ to limit the columns returned.)  There are lots of other options to $where; for example, to find all the case ids where 3000 <= :agepreg < 3100:

($ :caseid ($where {:agepreg {:$gte 3000 :$lt 3100}} ds))
;=> ("6" "15" "21" "36" "92" "142" "176" "210" ...)

The $where function is a convenience wrapper for query-dataset, so we need to look at the documentation for the latter to find out the other supported options:

(clojure.repl/doc i/query-dataset)

Cleaning data

Before we start to analyze the data, we may want to remove outliers or other special values. For example, the :birthwgt-lb column gives the birth weight in pounds of the first baby in the pregnancy. Let’s look at the top 10 values:

(take 10 (sort > (distinct ($ :birthwgt-lb ds))))
;=> Exception thrown: java.lang.NullPointerException

Oops! That’s not what we wanted, we’ll have to remove nil values before sorting. We can use Incanter’s $where to do this. Although $where has a number of built-in predicates, there isn’t one to check for nil values, so we have to write our own:

(def $not-nil {:$fn (complement nil?)})

(take 10 ($ :birthwgt-lb ($where {:birthwgt-lb $not-nil} ds)))
;=> (8 7 9 7 6 8 9 8 7 6)

(take 10 (sort > (distinct ($ :birthwgt-lb
                             ($where {:birthwgt-lb $not-nil} ds)))))
;=> (99 98 97 51 15 14 13 12 11 10)

This is still a bit cumbersome, so let’s write a variant of sel that returns only the rows where none of the specified columns are nil:

(defn ensure-collection
  [x]
  (if (coll? x) x (vector x)))

(defn sel-defined
  [ds & {:keys [rows cols]}]
  (let [rows (or rows :all)
        cols (or cols (i/col-names ds))]
    (i/sel ($where (zipmap (ensure-collection cols) (repeat $not-nil))
                   ds)
           :rows rows :cols cols)))

(take 10 (sort > (distinct (sel-defined ds :cols :birthwgt-lb))))
;=> (99 98 97 51 15 14 13 12 11 10)

Looking up the definition of :birthwgt-lb in the code book, we see that values greater than 95 encode special meaning:

Value Meaning
97 Not ascertained
98 Refused
99 Don’t know

We’d like to remove these values (and the obvious outlier 51) from the dataset before processing it. Incanter provides the function transform-col that applies a function to each value in the specified column of a dataset and returns a new dataset. Using this, we can write a helper function for setting illegal values to nil:

(defn set-invalid-nil
  [ds col valid?]
  (i/transform-col ds col (fn [v] (when (and (not (nil? v)) (valid? v)) v))))

(def ds' (set-invalid-nil ds :birthwgt-lb (complement #{51 97 98 99})))

(take 10 (sort > (distinct (sel-defined ds' :cols :birthwgt-lb))))
;=> (15 14 13 12 11 10 9 8 7 6)

We should also update the :birthwgt-oz column to remove any values greater than 15:

(def ds'
    (-> ds
        (set-invalid-nil :birthwgt-lb (complement #{51 97 98 99}))
        (set-invalid-nil :birthwgt-oz (fn [v] (<= 0 v 15)))))

Transforming data

We used the transform-col function in the implementation of set-invalid-nil above. We can also use this to perform an arbitrary calculation on a value. For example, the :agepreg column contains the age of the participant in centiyears (hundredths of a year):

(i/head (sel-defined ds' :cols :agepreg))
;=> (3316 3925 1433 1783 1833 2700 2883 3016 2808 3233)

Let’s transform this to years (perhaps fractional):

(defn centiyears->years
  [v]
  (when v (/ v 100.0)))

(def ds' (i/transform-col ds' :agepreg centiyears->years))
(i/head (sel-defined ds' :cols :agepreg))
;=> (33.16 39.25 14.33 17.83 18.33 27.0 28.83 30.16 28.08 32.33)

Augmenting data

The final function we’ll show you this time is add-derived-column; this function adds a column to a dataset, where the added column is a function of other columns. For example:

(defn compute-totalwgt-lb
  [lb oz]
  (when lb (+ lb (/ (or oz 0) 16.0))))

(def ds' (i/add-derived-column :totalwgt-lb
                               [:birthwgt-lb :birthwgt-oz]
                               compute-totalwgt-lb
                               ds'))

(i/head (sel-defined ds' :cols :totalwgt-lb))
;=> (8.8125 7.875 9.125 7.0 6.1875 8.5625 9.5625 8.375 7.5625 6.625)

Putting it all together

We’ve built up a new dataset above with a number of transformations. Let’s bring these all together into a single function that will thread the dataset through all these transformations. We can’t use the usual -> or ->> macros because of an inconsistency in the argument order of the transformations, but Clojure’s as-> comes to the rescue here.

(defn clean-and-augment-fem-preg
  [ds]
  (as-> ds ds
    (set-invalid-nil ds :birthwgt-lb (complement #{51 97 98 99}))
    (set-invalid-nil ds :birthwgt-oz (fn [v] (<= 0 v 15)))
    (i/transform-col ds :agepreg centiyears->years)
    (i/add-derived-column :totalwgt-lb
                          [:birthwgt-lb :birthwgt-oz]
                          compute-totalwgt-lb
                          ds)))

Now we can do:

(def ds (clean-and-augment-fem-preg
          (dct/as-dataset "ThinkStats2/code/2002FemPreg.dct"
                          "ThinkStats2/code/2002FemPreg.dat.gz")))

The Incanter helper functions we’ve implemented can be found in the thinkstats.incanter namespace, along with a $! short-cut for sel-defined that was a bit too complex to show in this post.

In the next part in this series, we start to explore the cleaned dataset.

Yule log

The winning entry in this years Yule Log bake off

The nights are growing shorter again. The office is looking bare. The coffee, lunch, tea, and biscuit breaks are lengthening so that they almost merge. It all points to the impending festival being nearly upon us. We are off to take well earned breaks and get ready to start afresh next year when you will be able to read more about flesh detection, about A/B testing, to see how the next chapter of Think Stats unfolds, and things we discovered during our Christmas holidays.

We will of course also keep you posted on the meetups we are hosting and would invite anyone interested to come and join us.

And so, until next year, we wish you a Merry Christmas and a happy new year.

 

I’m excited to be speaking at the December installment of the Cambridge functional programming meetup, hosted at Metail, this Thursday. I’ll be talking about making synthesized electronic music using Overtone.

Overtone

Overtone is an ‘Open Source toolkit for designing synthesizers and collaborating with music’. The library leverages Clojure’s power and generality for making flexible definitions of instruments, melodies and rhythms. Using code-reloading and the Clojure REPL, it provides an awesome environment for ‘live-coding’, a particularly modern form of improvisational music making!

The Clojure JVM process doesn’t actually synthesize audio in real-time itself; instead it connects to the separate program ‘scsynth’ (the Supercollider synthesis server), which is a high performance application written in C++.

Last summer at the Metail office-warming party we had Sam Aaron (one of the original Overtone developers, and another Cambridge resident) performing live. Check out http://meta-ex.com for some videos of other gigs he has done!

The Talk

In this talk, I’ll give a basic introduction to some of the features of Overtone library that I’ve had fun playing with. I’ll talk about several of the most important synthesis techniques, and demonstrate how they work (and what the results sound like!) in Overtone.

In the second part of the talk, I’ll describe how to put the noises created in the first part together to make some music. I’ll explain a bit of music theory and show how it can be put into practice, ending with a suitably festive demo!

The Speaker

I’m a Clojure programmer working at Metail in the data engineering team. I’ve always been fascinated by music: I have significantly more years experience as a choral singer than as a programmer! Currently I sing tenor with Selwyn College Chapel Choir.

Getting Started

Think Stats One of our new starters here at Metail was keen to brush up their statistics, and it’s more than 20 years since I completed an introductory course at university so I knew I would benefit from some revision. We also have a bunch of statisticians in the office who would like to brush up their Clojure, so I thought it might be fun to organise a lunchtime study group to work through Allen Downey’s Think Stats and attempt the exercises in Clojure. We’re using the second edition of the book which is available online in HTML format, and meeting on Wednesday lunchtimes to work through it together.

We’ll use Clojure’s Incanter library which provides utilities for statistical analysis and generating charts. Create a Leiningen project for our work:

lein new thinkstats

Make sure the project.clj depends on Clojure 1.7.0 and add a dependency on Incanter 1.5.6:

:dependencies [[org.clojure/clojure "1.7.0"]
               [incanter "1.5.6"]]

Parsing the data

In the first chapter of the book, we are introduced to a data set from the US Centers for Disease Control and Prevention, the National Survey of Family Growth. The data are in a gzipped file with fixed-width columns. An accompanying Stata dictionary describes the variable names, types, and column indices for each record. Our first job will be to parse the dictionary file and use that information to build a parser for the data.

We cloned the Github repository that accompanies Allen’s book:

git clone https://github.com/AllenDowney/ThinkStats2

Then created symlinks to the data files from our project:

cd thinkstats
mkdir data
cd data
for f in ../../ThinkStats2/code/{*.dat.gz,*.dct}; do ln -s $f; done

We can now read the Stata dictionary for the family growth study fromdata/2002FemPreg.dct. The dictionary looks like:

infile dictionary {
    _column(1)  str12    caseid     %12s  "RESPONDENT ID NUMBER"
    _column(13) byte     pregordr   %2f  "PREGNANCY ORDER (NUMBER)"
}

If we skip the first and last lines of the dictionary, we can use a regular expression to parse each column definition:

(def dict-line-rx #"^\s+_column\((\d+)\)\s+(\S+)\s+(\S+)\s+%(\d+)(\S)\s+\"([^\"]+)\"")

We’re capturing the column position, colum type, column name, format and length, and description. Let’s test this at the REPL. First we have to read a line from the dictionary:

(require '[clojure.java.io :as io])
(def line (with-open [r (io/reader "data/2002FemPreg.dct")]
            (first (rest (line-seq r)))))

We use rest to skip the first line of the file then grab the first column definition. Now we can try matching this with our regular expression:

(re-find dict-line-rx line)

This returns the string that matched and the capture groups we defined in our regular expression:

["    _column(1)      str12        caseid  %12s  \"RESPONDENT ID NUMBER\""
 "1"
 "str12"
 "caseid"
 "12"
 "s"
 "RESPONDENT ID NUMBER"]

We need to do some post-processing of this result to parse the column index and length to integers; we’ll also replace underscores in the column name with hyphens, which makes for a more idiomatic Clojure variable name. Let’s wrap that up in a function:

(require '[clojure.string :as str])

(defn parse-dict-line
  [line]
  (let [[_ col type name f-len f-spec descr] (re-find dict-line-rx line)]
    {:col    (dec (Integer/parseInt col))
     :type   type
     :name   (str/replace name "_" "-")
     :f-len  (Integer/parseInt f-len)
     :f-spec f-spec
     :descr  descr}))

Note that we’re also decrementing the column index – we need zero-indexed column indices for Clojure’s substring function. Now when we parse our sample line we get:

{:col 0,
 :type "str12",
 :name "caseid",
 :f-len 12,
 :f-spec "s",
 :descr "RESPONDENT ID NUMBER"}

With this function in hand, we can write a parser for the dictionary file:

(defn read-dict-defn
  "Read a Stata dictionary file, return a vector of column definitions."
  [path]
  (with-open [r (io/reader path)]
    (mapv parse-dict-line (butlast (rest (line-seq r))))))

We use rest and butlast to skip the first and last lines of the file, and mapv to force eager evaluation and ensure we process all of the input before the reader is closed when we exit with-open.

(def dict (parse-dict-defn "data/2002FemPreg.dat"))

The dictionary tells us the starting position (:col) and length (:f-len) of each field, so we can use subs to extract the raw value of each column from the data. This will give us a string, and the :type key we’ve extracted from the dictionary tells us how to interpret this. We’ve seen the types str12 and byte above, but what other types appear in the dictionary?

(distinct (map :type dict))
;=> ("str12" "byte" "int" "float" "double")

We’ll leave str12 unchanged, coerce byte and int to Long, andfloat and double to Double:

(defn parse-value
  [type raw-value]
  (when (not (empty? raw-value))
    (case type
      ("str12")          raw-value
      ("byte" "int")     (Long/parseLong raw-value)
      ("float" "double") (Double/parseDouble raw-value))))

We can now build a record parser from the dictionary definition:

(defn make-row-parser
  "Parse a row from a Stata data file according to the specification in `dict`.
   Return a vector of columns."
  [dict]
  (fn [row]
    (reduce (fn [accum {:keys [col type name f-len]}]
              (let [raw-value (str/trim (subs row col (+ col f-len)))]
                (conj accum (parse-value type raw-value))))
            []
            dict)))

To read gzipped data, we need to open an input stream, coerce this to a GZIPInputStream, and construct a buffered reader from that. For convenience, we’ll define a function to do this automatically if passed a path ending in .gz.

(import 'java.util.zip.GZIPInputStream)

(defn reader
  "Open path with io/reader; coerce to a GZIPInputStream if suffix is .gz"
  [path]
  (if (.endsWith path ".gz")
    (io/reader (GZIPInputStream. (io/input-stream path)))
    (io/reader path)))

Given a dictionary and reader, we can parse the records from a data file:

(defn read-dct-data
  "Parse lines from `rdr` according to the specification in `dict`.
   Return a lazy sequence of parsed rows."
  [dict rdr]
  (let [parse-fn (make-row-parser dict)]
    (map parse-fn (line-seq rdr))))

Finally, we bring this all together with a function to parse the dictionary and data and return an Incanter dataset:

(require '[incanter.core :refer [dataset]])

(defn as-dataset
  "Read Stata data set, return an Incanter dataset."
  [dict-path data-path]
  (let [dict   (read-dict-defn dict-path)
        header (map (comp keyword :name) dict)]
    (with-open [r (reader data-path)]
      (dataset header (doall (read-dct-data dict r))))))

Getting the code

The code for all this is available on Github; if you’d like to follow along, you can fork my thinkstats repository.

The functions we’ve developed above are in the namespacethinkstats.dct-parser In the next article in this series, we use our parser to explore and clean the data using Incanter.

For the latest in the Data Insights Cambridge meetup which we host, we are delighted to be welcoming Metail’s very own Shrividya Ravi to speak about the ‘A – Z of A/B testing’.

What is A – Z of A/B testing?

Randomised control trials have been a key part of medical science since the 18th century. More recently they have gained rapid traction in the e-commerce world where the term ‘A/B testing’ has become synonymous with businesses that are innovative and data-driven.

A/B testing has become the ‘status quo’ for retail website development – enabling product managers and marketing professionals to positively affect the customer journey; the sales funnel in particular. Combining event stream data with sound questions and good experiment design, these controlled trials become powerful tools for insight into user behaviour.

This talk will present a comprehensive overview of A/B testing discussing both the advantages and the caveats. A series of case studies and toy examples will detail the myriad of analyses possible from rich web events data. Topics covered will include inference with hypothesis testing, regression, bootstrapping, Bayesian models and parametric simulations.

The Speaker

Dr Ravi transitioned to Data Science following her PhD in experimental materials physics. Working as a Data Scientist at Metail (online try-on technology), Shriv continues experimenting and teasing insights from data.

Head to the Data Insights Cambridge Meetup page to register.

A bunch of Metail Clojurians are off to Clojure eXchange 2015 this week.

Members from the Data Science, Data Engineering and Web teams will be catching up on what’s new and seeing how others are using Clojure to solve their problems. Metail make extensive use of Clojure and ClojureScript for a lot of our internal tools. We are also currently investigating the feasibility of using ClojureScript to implement the next version of our MeModel visualisation product instead of CoffeeScript and Backbone.

Some of the Clojure tech that we currently use is: Cascalog, Om (Now and Next), Immutant, Prismatic/Schema, Figwheel. Grab one of us if you want to talk about our experiences with any of these or anything Clojure or Data related.

All the content looks really interesting, some highlights of particular interest:

  • Bozhidar Batsov – CIDER: The journey so far and the road ahead
  • Kris Jenkins – ClojureScript: Architecting for Scale
  • Nicola Mometto – Immutable code analysis with tools.analyzer
  • Hans Hubner – Datomic in Practice

Looking forward to all the talks, catching up with old friends and making new ones. See you there.

I have recently needed to run varnish (A very fast web cache for busy sites) in a situation that also required use of HTTPS on the box. Unfortunately, Varnish does not not handle crypto, which is probably a good thing given how easy it is for programmers to make mistakes in their code, rendering the security useless!

Whilst recipes for Stunnel and Varnish together exist, information on running them on the same box whilst still presenting the original source IP to varnish for logging/load balancing purposes was scarce – the below configuration “worked for me”, at least on Debian 7.0. (Wheezy) You will need the xt_mark module which should be part of most distributions, but I found was missing from some hosted boxes and VMs with custom kernels. The specific versions of software we are running are based on:

  • Linux 3.13.5
  • Varnish 3.0.6
  • STunnel 5.24

If you are running Varnish 4, the VCL here will require rewriting as it has been updated from version 3.

IPTables – mark traffic from source port 8088 for routing
iptables -t mangle -A OUTPUT -p tcp -m multiport --sports 8088 -j MARK --set-xmark 0x1/0xffffffff

Routing configuration – anything marked by IPTables, send back to the local box. These two can be added under iface lo as “post-up” commands if you’re on a Debian box.
ip rule add fwmark 1 lookup 100
ip route add local 0.0.0.0/0 dev lo table 100

STunnel configuration. The connect IP MUST be an IP on the box other than loopback, i.e. it will not work if you specify 127.0.0.1.

[https]
accept = 443
connect = 10.1.1.1:8088
transparent = source

From default.vcl:
import std;

sub vcl_recv {
// Set header variables in a sensible way.
remove req.http.X-Forwarded-Proto;

if (server.port == 8088) {
set req.http.X-Forwarded-Proto = “https”;
} else {
set req.http.X-Forwarded-Proto = “http”;
}

set req.http.X-Forwarded-For = client.ip;
std.collect(req.http.X-Forwarded-For);
}

sub vcl_hash {
// SSL data returned may be different from non-SSL.
// (E.g. including https:// in URLs)
hash_data(server.port);
}

Clojure Meetup as part of the Cambridge Non-dysfunctional Programmers group, in Metails office at 50 St Andrews St, Cambridge

Clojure Meetup as part of the Cambridge Non-dysfunctional Programmers group, in Metails office at 50 St Andrews St, Cambridge

This month’s meetup of the Cambridge NonDysFunctional Programmers will be hosted here at Metail’s Cambridge office next Thursday (26th November) from 6.30pm. I (Ray Miller) will be giving a hands-on introduction to web development with Clojure, where attendees get to implement their first Clojure web application from the ground up.

Along the way, we’ll learn about the Compojure routing library, Ring requests and responses, middleware, and generating HTML with hiccup. Time permitting, we’ll also cover interacting with a relational database and using buddy to add session-based authentication and authorization to our application.

The theme running through the tutorial is implementation of an ad server (an example I shamelessly stole from Dan Benjamin’s Meet Sinatra screencast). This demo application delivers a Javascript snippet to embed random ads in a web page and tracks user click-throughs. It also provides an administrative interface for reporting and managing ads. If you’ve already seen Dan’s screencast, you’ll see how Clojure compares with Sinatra to implement the same application.

See the Meetup page for full details and to sign up.

 

Shortly after I joined Metail in late summer 2011 there was a typical English bit of weather; namely an apocalyptic downfall of rain just as I was leaving work on a Wednesday. I was immediately transported back to a Kenyan balcony on which I had spent many happy hours as a proper colonial – sipping a G&T and watching the sun go down. The temperature was right, the sun was low in the sky, the tree was in full bloom (well it had leaves on it at any rate) and a tropical storm was in the air. Thinking that it was only fair to share this experience with those who had not been fortunate enough to have exposure to the original I packed my bag on Friday with a selection of gins, tonic and appropriate accoutrements.

This isn't www.xkcd.com

Unpacking the essentials after moving office

This went down well with colleagues and every Friday since we have endeavoured to gather and raise a glass to the end of the week. It has provided a great opportunity to relax and to meet other team mates and their partners and children. It also allowed teams to bond and cross-team conversations to happen. We also got the chance to hear the result of Nick’s nimble fingers (this harks back to the halcyon days when there was room to swing a cat and strum a guitar upstairs in 16), and share in the occasional sing song.

As the team has expanded I have not been able to support the whole cost so have asked for contributions of £10 a month towards the cost and welcome any suggestions or requests for particular drinks. As well as widening the group who make the drinks each week. We have progressed from simple gins and tonics to brambles, why nots, and even non gin-based drinks – manhattans, daiquiris, sidecars and orange brulées spring immediately to mind. Most of our shopping is done at Cambridge Wine Merchants who are always happy to help us out with ideas or substitutions for ingredients.

Oh and finally like any good dealer – your first few hits are free.

 

This is not http://www.smbc-comics.com/

Ace of Clubs Daiquiris made by Ian Taylor and photographed by Andrew Dunn

For anyone thinking of setting up something similar, the following price structure was designed to be as inclusive as possible and has been in place as the office head count has more than quadrupled. It has kept the club in the black and allowed it to provide something alcoholic and non-alcoholic for everyone at Christmas. Membership of the club is purely optional.

Members: £10 a month (first month free)

Interns: drink free

Partners: drink free

Guests: drink free

Non-Members from the office: drink free

Most of Metail’s try-it-on tech is delivered as a single page JavaScript application that embeds inside retailer sites, calling back to Metail-hosted services in Europe over HTTPS/JSON. In this post, Nick Day describes some of the more difficult trade-offs that we needed to make to make the best use of Content Distribution Networks to improve the speed of our app.

tl;dr When you’re using a CDN to accelerate the delivery of a 3rd party HTTPS / AJAX app to users far away, you end up having to choose your devils. We chose the one called “pre-flight OPTIONS request”.

We’re currently working with Dafiti, a Brazilian company whose user base is almost exclusively in-country. Our web-application and web-cache servers are based in the UK, so to aid user-experience we’ve moved as much as possible of our static and dynamic content so it’s served through a CDN (CloudFront, which handily has edge nodes in São Paulo).  

The last piece of this “CDN-ifying” has been to serve the initial HTML page from CloudFront.  Naturally, this has a large impact on page load speeds as the browser can do nothing else until this resource has been obtained; so fulfilling the request from a nearby CloudFront node is hugely preferable to having it trot under the Atlantic to hit our servers and then back.  However, as this page is now being served from the CDN’s domain, browser security constraints mean that any AJAX requests being made from it must either:

  1. also go through the CDN domain, or
  2. be served directly from your service domain using cross-origin resource sharing (CORS)

For cacheable content obtained through AJAX (in our case things like garment information) the preferred option is clear; serve it through the CDN.  The decision is more involved for content that you don’t want to or can’t cache, like requests that you need to be up-to-date (e.g. account details) or update requests. In this case your choice is for the lesser of two evils.

Serving through the CDN:

  • you’re making every one of these requests take an indirect route, as the CDN will be passing them all on for your servers to handle
  • on top of that there’s an extra hit; SSL de/encryption must happen in the CDN for these requests to be forwarded
  • you have to manage potentially complex CDN config to do the right thing for each request.  You wouldn’t have to do this with your typical CDN-cached content, but in practise you may need to add cookies and headers to a whitelist for the CDN to forward.  CloudFront makes this problem extra-fun by lacking a programmatic way to update/version your distribution configuration, making it easy to take down your service accidentally, and hard to roll back.
  • you’re paying (real money) each time you route one of these requests (somewhat unnecessarily) through a CDN

Alternatively, if you send these requests directly to your service domain:

  • you’ll need to implement and maintain CORS handling for those endpoints being requested from other domains
  • for requests deemed “non-simple” by the CORS spec (methods other than GET/HEAD or POST; using Content-Type other than a very strict set, which surprisingly does not include JSON; or using custom headers) the browser will automatically make a pre-flight OPTIONS request to check that it is able to make the request before actually making the one you desire.  This is a big deal if your longer-than-you’d-like cross-Atlantic request suddenly becomes two!  It was this point that almost pushed me toward taking the CDN route for these requests.

Fortunately for us, there are two saving graces with this latter choice of sending the requests directly to our service domain:

  1. All of the requests in our app that require a pre-flight request are non-blocking, that is that they don’t require a user to wait before viewing part of, or accessing functionality in the app. They’re all “update in the background” kind of requests. Granted, if these take too long you could imagine timing issues creeping in, but we should be coding defensively to guard against those kind of asynchronous problems anyway.
  2. The CORS spec allows you to specify a caching max-age for the OPTIONS requests. Of course, since you’re not going through a CDN, this caching will be on a per-user basis against the web-cache. At the moment, unfortunately for us, even a cache hit must come to Europe. The answer here is that it’s probably going to make sense for us to move web-caches closer to where the users are, even if the apps stay rooted in the UK.  Another slightly niggling point is that some browsers currently have an enforced maximum cache period that’s quite short for pre-flight requests (e.g. Chrome is 5 minutes).  As this is shorter than our average user session, it means that it’s likely a user will have to make those same requests more than once per session.  From half a world away, that’s not ideal.

Taking all of these points into consideration, we favoured sending requests directly to our service domain; preferring to take the hit with the “background” requests that require a pre-flight in order to improve performance of most of our non-caching requests.