Data Actually
David Robinson posted a great article Analyzing networks of characters in ‘Love Actually’ on 25th December 2015, which uses R to analyse the connections between the characters in the film Love Actually.
This Jupyter notebook and the associated python code attempts to reproduce his analysis using tools from the Python ecosystem.
I tweeted a link to an initial version of this notebook over the Christmas holidays and republished it here in response to interest from my colleagues in Metail Tech.
Package setup
To start we need to import some useful packages, shown below. Some of these needed to be installed using the Anaconda distribution or pip if conda failed:
- pip install ggplot
- conda install graphviz
- conda install networkx
- conda install pyplot
from __future__ import division
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
from scipy.cluster.hierarchy import dendrogram, linkage
import ggplot as gg
import networkx as nx
%matplotlib inline
Data import
First we need to define a couple of functions to read the ‘Love Actually’ script into a list of lines and the cast into a Pandas DataFrame. The data_dir variable gets the location of the directory containing the input files from an environment variable so set this environment variable to the location appropriate for you if you’re following along.
data_dir = os.path.join(os.getenv('MDA_DATA_DIR', '/home/mattmcd/Work/Data'), 'LoveActually')
# Alternative: data_dir = os.getcwd() # And copy files to same directory as script
def read_script():
"""Read the Love Actually script from text file into list of lines
The script is first Google hit for 'Love Actually script' as a doc
file. Use catdoc or Libre Office to save to text format.
"""
with open(os.path.join(data_dir, 'love_actually.txt'), 'r') as f:
lines = [line.strip() for line in f]
return lines
def read_actors():
"""Read the mapping from character to actor using the varianceexplained data file
Used curl -O http://varianceexplained.org/files/love_actually_cast.csv to get a local copy
"""
return pd.read_csv(os.path.join(data_dir, 'love_actually_cast.csv'))
The cell below reproduces the logic in the first cell of the original article. It doesn’t feel quite as nice to me as the dplyr syntax but is not too bad.
def parse_script(raw):
df = pd.DataFrame(raw, columns=['raw'])
df = df.query('raw != ""')
df = df[~df.raw.str.contains("(song)")]
lines = (df.
assign(is_scene=lambda d: d.raw.str.contains(" Scene ")).
assign(scene=lambda d: d.is_scene.cumsum()).
query('not is_scene'))
speakers = lines.raw.str.extract('(?P<speaker>[^:]*):(?P<dialogue>.*)')
lines = (pd.concat([lines, speakers], axis=1).
dropna().
assign(line=lambda d: np.cumsum(~d.speaker.isnull())))
lines.drop(['raw', 'is_scene'], axis=1, inplace=True)
return lines
def read_all():
lines = parse_script(read_script())
cast = read_actors()
combined = lines.merge(cast).sort('line').assign(
character=lambda d: d.speaker + ' (' + d.actor + ')').reindex()
# Decode bytes to unicode
combined['character'] = map(lambda s: s.decode('utf-8'), combined['character'])
return combined
# Read in script and cast into a dataframe
lines = read_all()
# Print the first few rows
lines.head(5)
Constructing the n_character x n_scene matrix showing how many lines each character has in each scene is quite easy using pandas groupby method to create a hierarchical index, followed by the unstack method to convert the second level of the index into columns.
def get_scene_speaker_matrix(lines):
by_speaker_scene = lines.groupby(['character', 'scene'])['line'].count()
speaker_scene_matrix = by_speaker_scene.unstack().fillna(0)
return by_speaker_scene, speaker_scene_matrix
# Group by speaker and scene and construct the speaker-scene matrix
by_speaker_scene, speaker_scene_matrix = get_scene_speaker_matrix(lines)
Analysis
Now we get to the analysis itself. First we perform a hierarchical clustering of the data using the same data normalization and clustering method as the original article. The leaf order in the dendrogram is returned for use in later steps of the analysis, as it has similar characters close to each other.
def plot_dendrogram(mat, normalize=True):
# Cluster and plot dendrogram. Return order after clustering.
if normalize:
# Normalize by number of lines
mat = mat.div(mat.sum(axis=1), axis=0)
Z = linkage(mat, method='complete', metric='cityblock')
labels = mat.index
f = plt.figure()
ax = f.add_subplot(111)
R = dendrogram(Z, leaf_rotation=90, leaf_font_size=8,
labels=labels, ax=ax, color_threshold=-1)
f.tight_layout()
ordering = R['ivl']
return ordering
# Hierarchical cluster and return order of leaves
ordering = plot_dendrogram(speaker_scene_matrix)
print(ordering)
Timeline
Plotting the timeline of character versus scene is very similar to the R version since we make use of the Python ggplot port from yhat. It seems like there are a few differences between this package and the R ggplot2 library.
In particular:
- ggplot does not seem to handle categorical variables so it was necessary to introduce an extra character_code dimension
- it didn’t seem possible to change the y-axis tick labels to character names so here the axis directions are swapped
- the aes (aesthetic) does not seem to support ‘group’ so the geom_path joining characters in the same scene has been left out
(note that these points may just be limitations of my understanding of ggplot in Python)
def get_scenes_with_multiple_characters(by_speaker_scene):
# Filter speaker scene dataframe to remove scenes with only one speaker
# n_scene x 1 Series with index 'scene'
filt = by_speaker_scene.count('scene') > 1
# n_scene x n_character Index
scene_index = by_speaker_scene.index.get_level_values('scene')
# n_scene x n_character boolean vector
ind = filt[scene_index].values
return by_speaker_scene[ind]
def order_scenes(scenes, ordering=None):
# Order scenes by e.g. leaf order after hierarchical clustering
scenes = scenes.reset_index()
scenes['scene'] = scenes['scene'].astype('category')
scenes['character'] = scenes['character'].astype('category', categories=ordering)
scenes['character_code'] = scenes['character'].cat.codes
return scenes
# Order the scenes by cluster leaves order
scenes = order_scenes(get_scenes_with_multiple_characters(by_speaker_scene), ordering)
def plot_timeline(scenes):
# Plot character vs scene timelime
# NB: due to limitations in Python ggplot we need to plot with scene on y-axis
# in order to label x-ticks by character.
# scale_x_continuous and scale_y_continuous behave slightly differently.
print (gg.ggplot(gg.aes(y='scene', x='character_code'), data=scenes) +
gg.geom_point() + gg.labs(x='Character', y='Scene') +
gg.scale_x_continuous(
labels=scenes['character'].cat.categories.values.tolist(),
breaks=range(len(scenes['character'].cat.categories))) +
gg.theme(axis_text_x=gg.element_text(angle=30, hjust=1, size=10)))
# Plot a timeline of characters vs scene
plot_timeline(scenes);
Co-occurrence matrix
Next we construct the co-occurrence matrix showing how often characters share scenes, and visualize using a heatmap and network graph.
def get_cooccurrence_matrix(speaker_scene_matrix, ordering=None):
# Co-occurrence matrix for the characters, ignoring last scene where all are present
scene_ind = speaker_scene_matrix.astype(bool).sum() < 10
if ordering:
mat = speaker_scene_matrix.loc[ordering, scene_ind]
else:
mat = speaker_scene_matrix.loc[:, scene_ind]
return mat.dot(mat.T)
cooccur_mat = get_cooccurrence_matrix(speaker_scene_matrix, ordering)
The heatmap below is not as nice as the default R heatmap as it is missing the dendrograms on each axis and also the character names, so could be extended e.g. following Hierarchical Clustering Heatmaps in Python.
It otherwise shows a similar result to the original article in that ordering using the dendrogram leaf order has resulted in a co-occurrence matrix predominantly of block diagonal form.
def plot_heatmap(cooccur_mat):
# Plot co-ccurrence matrix as heatmap
plt.figure()
plt.pcolor(cooccur_mat)
# Plot heatmap of co-occurrence matrix
plot_heatmap(cooccur_mat)
The network plot gives similar results to the original article. This could be extended, for example by adding weights to the graph edges.
def plot_network(cooccur_mat):
# Plot co-occurence matrix as network diagram
G = nx.Graph(cooccur_mat.values)
pos = nx.graphviz_layout(G) # NB: needs pydot installed
plt.figure(num=None, figsize=(15, 15), dpi=80)
nx.draw_networkx_nodes(G, pos, node_size=700, node_color='c')
nx.draw_networkx_edges(G, pos)
nx.draw_networkx_labels(
G, pos,
labels={i: s for (i, s) in enumerate(cooccur_mat.index.values)},
font_size=10)
plt.axis('off')
plt.show()
# Plot network graph of co-occurrence matrix
plot_network(cooccur_mat)
Conclusion
This notebook attempted to reproduce using Python the R analysis and visualization of the character network in ‘Love Actually’. Overall this was a useful exercise in learning similarities and differences between the tools, as well as becoming more familiar with the Pandas syntax. The output currently doesn’t look quite as nice as the R original but is a good starting point for future tinkering.