In the first part of this series of posts, we introduced the idea of trying to detect flesh in images by looking at the colour values of individual pixels in the image. This produces reasonable results, but far too many “false positives” due to the fact that other items in the scene, such as hair and clothes, may be flesh-coloured too.

Boolean Pixel Function

In the example below, pixels in the left-hand image that are fleshy (R > G > B) are rendered red in the right-hand image, whereas non-fleshy pixels are rendered green:

Simple flesh detection

Simple flesh detection

Fuzzy Pixel Function

We can improve things slightly by using fuzzy logic. Our original fleshy function (R > G > B) is actually made up of two conditions: a pixel is “fleshy” if the red component is greater than the green component and the green component is greater than the blue component. These two conditions are binary, but they could be made fuzzy. Consider the following JavaScript function:

function fuzzy(x, false_limit, true_limit) {
  var y = (x - false_limit) / (true_limit - false_limit);
  return Math.min(Math.max(y, 0), 1);

This produces a fuzzy truth value between zero, meaning definitely false, and one, meaning definitely true:


We can then compose a fuzzy logic expression for fleshiness (notice that the fuzzy AND operator is simply multiplication):

var rg = fuzzy(r - g, 0, 0.10);
var gb = fuzzy(g - b, 0, 0.02);
var fleshiness = rg * gb;

The values 0.10 and 0.02 were derived empirically. Effectively, we’re saying that we expect the red value to be quite a bit greater than the green channel; the difference between the green and blue values is less important.

The fuzzy approach gives us marginally better results. Parts of the hair are deemed to be less likely to be fleshy, as are some portions of the dress pattern.


But, as mentioned at the end of Part One, we need a radically different approach to consistently find accidentally-rendered “naughty bits” in an image.

Chameleon Detector

Fortunately, we have control over the rendering pipeline of these images, so there’s nothing stopping us from rendering the scene twice with slightly different parameters. Let us pretend that belly buttons are considered “naughty” and that we want to detect renders that show some or all of this body part. When we render body parts, we use texture mapping on to a 3D mesh. If we “paint” over the naughty bits in the source texture maps with a known colour (say, green) and render the scene, we may get the following for two different outfits:


For the purposes of clarity, we’ve painted a large star over the belly button. In reality, the painted region would be smaller and more accurately shaped. If we render the scene again with the naughty bits over-painted with the same shape but a different colour, say, red, we get:


Obviously, the image on the right is unchanged by this modification to the skin texture, but the image on the left is. All we need to do is run the two sets of images through a very simple (fuzzy) comparator to find visible naughty bits:


As can be seen, this “chameleon” technique produces a strong signal. And even though it requires two renders per image, there are other advantages too:

  1. The regions considered “naughty” are hand-painted into the source skin textures. This is both intuitive and flexible.
  2. Different “naughtiness maps” can easily be used for different regions and cultures.
  3. One of the outputs of the technique is an image illustrating which naughty bit is visible and where.
  4. It is body shape agnostic.
  5. It is viewpoint agnostic.
  6. It handles translucent garments gracefully, particularly if a fuzzy comparator is used.
  7. It does not matter how complex the scene is.
  8. The code used to run the test is identical to the final rendering code: only input texture data is modified.

This is the second instalment of our Think Stats study group; we are working through Allen Downey’s Think Stats, implementing everything in Clojure. In the first part we implemented a parser for Stata dictionary and data files. Now we are going to use that to start exploring the National Survey of Family Growth data with Incanter, a Clojure library for statistical computing and graphics. We are still working through Chapter 1 of the book, and in this instalment we cover sections 1.4 DataFrames through to 1.7 Validation.

If you’d like to follow along, start by cloning our thinkstats repository from Github:

git clone --recursive

I’ve made two changes since writing the first post in this series. I realised that I could include Allen’s repository as a git submodule, hence the --recursive option above. This means the data files will be in a predictable place in our project folder so we can refer to them in the examples. I’ve also included Gorilla REPL in the project, so if you want to try out the examples but aren’t familiar with the Clojure tool chain, you can simply run:

lein gorilla

This will print out a URL for you to open in your browser. You can then start running the examples and seeing the output in your browser. Read more about Gorilla REPL here:

To Business…

Gorilla has created the namespace harmonious-willow for our worksheet. We’ll start by importing the Incanter and thinkstats namespaces we require:

(ns harmonious-willow
  (:require [incanter.core :as i
              :refer [$ $map $where $rollup $order $fn $group-by $join]]
            [incanter.stats :as s]
            [thinkstats.dct-parser :as dct]))

Incanter defines a number of handy functions whose names begin with $; we’re likely to use these a lot, so we’ve imported them into our namespace. We’ll refer to the other Incanter functions we need by qualifying them with the i/ or s/ prefix.

Load the NFSG Pregnancy data into an Incanter dataset:

(def ds (dct/as-dataset "ThinkStats2/code/2002FemPreg.dct"

Incanter’s dim function tells us the number of rows and columns in the dataset:

(i/dim ds)
;=> [13593 243]

and col-names lists the column names:

(i/col-names ds)
;=> [:caseid :pregordr :howpreg-n :howpreg-p ...]

We can select a subset of rows or columns from the dataset using sel:

(i/sel ds :cols [:caseid :pregordr] :rows (range 10))

Either of :rows or :cols may be omitted, but you’ll get a lot of data back if you ask for all rows. Selecting subsets of the dataset is such a common thing to do that Incanter provides the function $ as a short-cut (but note the different argument order):

($ (range 10) [:caseid :pregordr] ds)

If the first argument is omitted, it will return all rows. This returns a new Incanter dataset, but  if you ask for just a single column and don’t wrap the argument in a vector, you get back a sequence of values for that column:

(take 10 ($ :caseid ds))
;=> ("1" "1" "2" "2" "2" "6" "6" "6" "7" "7")

We can also select a subset of rows using Incanter’s $where function, which provides a succinct syntax for selecting rows that match a predicate. For example, to select rows where the :caseid is 6, we can do:

($ [:caseid :pregordr :outcome] ($where {:caseid "6"} ds))

(Note that we’re still using $ to limit the columns returned.)  There are lots of other options to $where; for example, to find all the case ids where 3000 <= :agepreg < 3100:

($ :caseid ($where {:agepreg {:$gte 3000 :$lt 3100}} ds))
;=> ("6" "15" "21" "36" "92" "142" "176" "210" ...)

The $where function is a convenience wrapper for query-dataset, so we need to look at the documentation for the latter to find out the other supported options:

(clojure.repl/doc i/query-dataset)

Cleaning data

Before we start to analyze the data, we may want to remove outliers or other special values. For example, the :birthwgt-lb column gives the birth weight in pounds of the first baby in the pregnancy. Let’s look at the top 10 values:

(take 10 (sort > (distinct ($ :birthwgt-lb ds))))
;=> Exception thrown: java.lang.NullPointerException

Oops! That’s not what we wanted, we’ll have to remove nil values before sorting. We can use Incanter’s $where to do this. Although $where has a number of built-in predicates, there isn’t one to check for nil values, so we have to write our own:

(def $not-nil {:$fn (complement nil?)})

(take 10 ($ :birthwgt-lb ($where {:birthwgt-lb $not-nil} ds)))
;=> (8 7 9 7 6 8 9 8 7 6)

(take 10 (sort > (distinct ($ :birthwgt-lb
                             ($where {:birthwgt-lb $not-nil} ds)))))
;=> (99 98 97 51 15 14 13 12 11 10)

This is still a bit cumbersome, so let’s write a variant of sel that returns only the rows where none of the specified columns are nil:

(defn ensure-collection
  (if (coll? x) x (vector x)))

(defn sel-defined
  [ds & {:keys [rows cols]}]
  (let [rows (or rows :all)
        cols (or cols (i/col-names ds))]
    (i/sel ($where (zipmap (ensure-collection cols) (repeat $not-nil))
           :rows rows :cols cols)))

(take 10 (sort > (distinct (sel-defined ds :cols :birthwgt-lb))))
;=> (99 98 97 51 15 14 13 12 11 10)

Looking up the definition of :birthwgt-lb in the code book, we see that values greater than 95 encode special meaning:

Value Meaning
97 Not ascertained
98 Refused
99 Don’t know

We’d like to remove these values (and the obvious outlier 51) from the dataset before processing it. Incanter provides the function transform-col that applies a function to each value in the specified column of a dataset and returns a new dataset. Using this, we can write a helper function for setting illegal values to nil:

(defn set-invalid-nil
  [ds col valid?]
  (i/transform-col ds col (fn [v] (when (and (not (nil? v)) (valid? v)) v))))

(def ds' (set-invalid-nil ds :birthwgt-lb (complement #{51 97 98 99})))

(take 10 (sort > (distinct (sel-defined ds' :cols :birthwgt-lb))))
;=> (15 14 13 12 11 10 9 8 7 6)

We should also update the :birthwgt-oz column to remove any values greater than 15:

(def ds'
    (-> ds
        (set-invalid-nil :birthwgt-lb (complement #{51 97 98 99}))
        (set-invalid-nil :birthwgt-oz (fn [v] (<= 0 v 15)))))

Transforming data

We used the transform-col function in the implementation of set-invalid-nil above. We can also use this to perform an arbitrary calculation on a value. For example, the :agepreg column contains the age of the participant in centiyears (hundredths of a year):

(i/head (sel-defined ds' :cols :agepreg))
;=> (3316 3925 1433 1783 1833 2700 2883 3016 2808 3233)

Let’s transform this to years (perhaps fractional):

(defn centiyears->years
  (when v (/ v 100.0)))

(def ds' (i/transform-col ds' :agepreg centiyears->years))
(i/head (sel-defined ds' :cols :agepreg))
;=> (33.16 39.25 14.33 17.83 18.33 27.0 28.83 30.16 28.08 32.33)

Augmenting data

The final function we’ll show you this time is add-derived-column; this function adds a column to a dataset, where the added column is a function of other columns. For example:

(defn compute-totalwgt-lb
  [lb oz]
  (when lb (+ lb (/ (or oz 0) 16.0))))

(def ds' (i/add-derived-column :totalwgt-lb
                               [:birthwgt-lb :birthwgt-oz]

(i/head (sel-defined ds' :cols :totalwgt-lb))
;=> (8.8125 7.875 9.125 7.0 6.1875 8.5625 9.5625 8.375 7.5625 6.625)

Putting it all together

We’ve built up a new dataset above with a number of transformations. Let’s bring these all together into a single function that will thread the dataset through all these transformations. We can’t use the usual -> or ->> macros because of an inconsistency in the argument order of the transformations, but Clojure’s as-> comes to the rescue here.

(defn clean-and-augment-fem-preg
  (as-> ds ds
    (set-invalid-nil ds :birthwgt-lb (complement #{51 97 98 99}))
    (set-invalid-nil ds :birthwgt-oz (fn [v] (<= 0 v 15)))
    (i/transform-col ds :agepreg centiyears->years)
    (i/add-derived-column :totalwgt-lb
                          [:birthwgt-lb :birthwgt-oz]

Now we can do:

(def ds (clean-and-augment-fem-preg
          (dct/as-dataset "ThinkStats2/code/2002FemPreg.dct"

The Incanter helper functions we’ve implemented can be found in the thinkstats.incanter namespace, along with a $! short-cut for sel-defined that was a bit too complex to show in this post.

In the next part in this series, we start to explore the cleaned dataset.