Data Actually

David Robinson posted a great article Analyzing networks of characters in ‘Love Actually’ on 25th December 2015, which uses R to analyse the connections between the characters in the film Love Actually.

This Jupyter notebook and the associated python code attempts to reproduce his analysis using tools from the Python ecosystem.

I tweeted a link to an initial version of this notebook over the Christmas holidays and republished it here in response to interest from my colleagues in Metail Tech.

Package setup

To start we need to import some useful packages, shown below. Some of these needed to be installed using the Anaconda distribution or pip if conda failed:

pip install ggplot
conda install graphviz
conda install networkx
conda install pyplot

In [1]:

from __future__ import division
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
from scipy.cluster.hierarchy import dendrogram, linkage
import ggplot as gg
import networkx as nx

In [2]:

%matplotlib inline

Data import

First we need to define a couple of functions to read the ‘Love Actually’ script into a list of lines and the cast into a Pandas DataFrame. The data_dir variable gets the location of the directory containing the input files from an environment variable so set this environment variable to the location appropriate for you if you’re following along.

In [3]:

data_dir = os.path.join(os.getenv('MDA_DATA_DIR', '/home/mattmcd/Work/Data'), 'LoveActually') 
# Alternative: data_dir = os.getcwd() # And copy files to same directory as script

def read_script():
    """Read the Love Actually script from text file into list of lines
    The script is first Google hit for 'Love Actually script' as a doc
    file.  Use catdoc or Libre Office to save to text format.
    """
    with open(os.path.join(data_dir, 'love_actually.txt'), 'r') as f:
        lines = [line.strip() for line in f]
    return lines

def read_actors():
    """Read the mapping from character to actor using the varianceexplained data file
    Used curl -O http://varianceexplained.org/files/love_actually_cast.csv to get a local copy
    """
    return pd.read_csv(os.path.join(data_dir, 'love_actually_cast.csv'))

The cell below reproduces the logic in the first cell of the original article. It doesn’t feel quite as nice to me as the dplyr syntax but is not too bad.

In [4]:

def parse_script(raw):
    df = pd.DataFrame(raw, columns=['raw'])

    df = df.query('raw != ""')
    df = df[~df.raw.str.contains("(song)")]
    lines = (df.
             assign(is_scene=lambda d: d.raw.str.contains(" Scene ")).
             assign(scene=lambda d: d.is_scene.cumsum()).
             query('not is_scene'))
    speakers = lines.raw.str.extract('(?P<speaker>[^:]*):(?P<dialogue>.*)')
    lines = (pd.concat([lines, speakers], axis=1).
             dropna().
             assign(line=lambda d: np.cumsum(~d.speaker.isnull())))

    lines.drop(['raw', 'is_scene'], axis=1, inplace=True)

    return lines

In [5]:

def read_all():
    lines = parse_script(read_script())
    cast = read_actors()
    combined = lines.merge(cast).sort('line').assign(
        character=lambda d: d.speaker + ' (' + d.actor + ')').reindex()
    # Decode bytes to unicode
    combined['character'] = map(lambda s: s.decode('utf-8'), combined['character'])
    return combined

In [6]:

# Read in script and cast into a dataframe
lines = read_all()
# Print the first few rows
lines.head(5)

Out[6]:

	scene	speaker	dialogue	line	actor	character
0	2	Billy	♪ I feel it in my fingers ♪ I feel it in my t…	2	Bill Nighy	Billy (Bill Nighy)
34	2	Joe	I’m afraid you did it again, Bill.	3	Gregor Fisher	Joe (Gregor Fisher)
1	2	Billy	It’s just I know the old version so well, you…	4	Bill Nighy	Billy (Bill Nighy)
35	2	Joe	Well, we all do. That’s why we’re making the …	5	Gregor Fisher	Joe (Gregor Fisher)
2	2	Billy	Right, OK, let’s go. ♪ I feel it in my finger…	6	Bill Nighy	Billy (Bill Nighy)

Constructing the n_character x n_scene matrix showing how many lines each character has in each scene is quite easy using pandas groupby method to create a hierarchical index, followed by the unstack method to convert the second level of the index into columns.

In [7]:

def get_scene_speaker_matrix(lines):
    by_speaker_scene = lines.groupby(['character', 'scene'])['line'].count()
    speaker_scene_matrix = by_speaker_scene.unstack().fillna(0)
    return by_speaker_scene, speaker_scene_matrix

In [8]:

# Group by speaker and scene and construct the speaker-scene matrix
by_speaker_scene, speaker_scene_matrix = get_scene_speaker_matrix(lines)

Analysis

Now we get to the analysis itself. First we perform a hierarchical clustering of the data using the same data normalization and clustering method as the original article. The leaf order in the dendrogram is returned for use in later steps of the analysis, as it has similar characters close to each other.

In [9]:

def plot_dendrogram(mat, normalize=True):
    # Cluster and plot dendrogram.  Return order after clustering.
    if normalize:
        # Normalize by number of lines
        mat = mat.div(mat.sum(axis=1), axis=0)
    Z = linkage(mat, method='complete', metric='cityblock')
    labels = mat.index
    f = plt.figure()
    ax = f.add_subplot(111)
    R = dendrogram(Z, leaf_rotation=90, leaf_font_size=8,
               labels=labels, ax=ax, color_threshold=-1)
    f.tight_layout()
    ordering = R['ivl']
    return ordering

In [10]:

# Hierarchical cluster and return order of leaves
ordering = plot_dendrogram(speaker_scene_matrix)

In [11]:

print(ordering)

[u'Peter (Chiwetel Ejiofor)', u'Juliet (Keira Knightley)', u'Mark (Andrew Lincoln)', u'Harry (Alan Rickman)', u'Mia (Heike Makatsch)', u'Karl (Rodrigo Santoro)', u'Sarah (Laura Linney)', u'Karen (Emma Thompson)', u'Natalie (Martine McCutcheon)', u'PM (Hugh Grant)', u'Daniel (Liam Neeson)', u'Sam (Thomas Sangster)', u'Colin (Kris Marshall)', u'Tony (Abdul Salis)', u'Billy (Bill Nighy)', u'Joe (Gregor Fisher)', u'Jack (Martin Freeman)', u'Judy (Joanna Page)', u'Aurelia (L\xfacia Moniz)', u'Jamie (Colin Firth)']

Timeline

Plotting the timeline of character versus scene is very similar to the R version since we make use of the Python ggplot port from yhat. It seems like there are a few differences between this package and the R ggplot2 library.

In particular:

ggplot does not seem to handle categorical variables so it was necessary to introduce an extra character_code dimension
it didn’t seem possible to change the y-axis tick labels to character names so here the axis directions are swapped
the aes (aesthetic) does not seem to support ‘group’ so the geom_path joining characters in the same scene has been left out

(note that these points may just be limitations of my understanding of ggplot in Python)

In [12]:

def get_scenes_with_multiple_characters(by_speaker_scene):
    # Filter speaker scene dataframe to remove scenes with only one speaker

    # n_scene x 1 Series with index 'scene'
    filt = by_speaker_scene.count('scene') > 1
    # n_scene x n_character Index
    scene_index = by_speaker_scene.index.get_level_values('scene')
    # n_scene x n_character boolean vector
    ind = filt[scene_index].values
    return by_speaker_scene[ind]

In [13]:

def order_scenes(scenes, ordering=None):
    # Order scenes by e.g. leaf order after hierarchical clustering
    scenes = scenes.reset_index()
    scenes['scene'] = scenes['scene'].astype('category')
    scenes['character'] = scenes['character'].astype('category', categories=ordering)
    scenes['character_code'] = scenes['character'].cat.codes
    return scenes

In [14]:

# Order the scenes by cluster leaves order
scenes = order_scenes(get_scenes_with_multiple_characters(by_speaker_scene), ordering)

In [15]:

def plot_timeline(scenes):
    # Plot character vs scene timelime
    # NB: due to limitations in Python ggplot we need to plot with scene on y-axis
    # in order to label x-ticks by character.
    # scale_x_continuous and scale_y_continuous behave slightly differently.

    print (gg.ggplot(gg.aes(y='scene', x='character_code'), data=scenes) +
            gg.geom_point() + gg.labs(x='Character', y='Scene') +
           gg.scale_x_continuous(
               labels=scenes['character'].cat.categories.values.tolist(),
           breaks=range(len(scenes['character'].cat.categories))) +
           gg.theme(axis_text_x=gg.element_text(angle=30, hjust=1, size=10)))

In [16]:

# Plot a timeline of characters vs scene
plot_timeline(scenes);

Co-occurrence matrix

Next we construct the co-occurrence matrix showing how often characters share scenes, and visualize using a heatmap and network graph.

In [17]:

def get_cooccurrence_matrix(speaker_scene_matrix, ordering=None):
    # Co-occurrence matrix for the characters, ignoring last scene where all are present
    scene_ind = speaker_scene_matrix.astype(bool).sum() < 10
    if ordering:
        mat = speaker_scene_matrix.loc[ordering, scene_ind]
    else:
        mat = speaker_scene_matrix.loc[:, scene_ind]
    return mat.dot(mat.T)

In [18]:

cooccur_mat = get_cooccurrence_matrix(speaker_scene_matrix, ordering)

The heatmap below is not as nice as the default R heatmap as it is missing the dendrograms on each axis and also the character names, so could be extended e.g. following Hierarchical Clustering Heatmaps in Python.

It otherwise shows a similar result to the original article in that ordering using the dendrogram leaf order has resulted in a co-occurrence matrix predominantly of block diagonal form.

In [19]:

def plot_heatmap(cooccur_mat):
    # Plot co-ccurrence matrix as heatmap
    plt.figure()
    plt.pcolor(cooccur_mat)

In [20]:

# Plot heatmap of co-occurrence matrix
plot_heatmap(cooccur_mat)

The network plot gives similar results to the original article. This could be extended, for example by adding weights to the graph edges.

In [21]:

def plot_network(cooccur_mat):
    # Plot co-occurence matrix as network diagram
    G = nx.Graph(cooccur_mat.values)
    pos = nx.graphviz_layout(G)  # NB: needs pydot installed
    plt.figure(num=None, figsize=(15, 15), dpi=80)
    nx.draw_networkx_nodes(G, pos, node_size=700, node_color='c')
    nx.draw_networkx_edges(G, pos)
    nx.draw_networkx_labels(
        G, pos,
        labels={i: s for (i, s) in enumerate(cooccur_mat.index.values)},
        font_size=10)
    plt.axis('off')
    plt.show()

In [22]:

# Plot network graph of co-occurrence matrix
plot_network(cooccur_mat)

Conclusion

This notebook attempted to reproduce using Python the R analysis and visualization of the character network in ‘Love Actually’. Overall this was a useful exercise in learning similarities and differences between the tools, as well as becoming more familiar with the Pandas syntax. The output currently doesn’t look quite as nice as the R original but is a good starting point for future tinkering.

Metail Tech

Web, DevOps, 3D graphics, data engineering, systems, Clojure... y'know, that kind of thing

Author: Matt McDonnell

Data Actually

Data Actually

Package setup

Data import

Analysis

Timeline

Co-occurrence matrix

Conclusion