We welcomed back the Cambridge AWS User Group to the Cambridge office for it’s eighth Meetup. This one was focused on Big Data. This is something that I spend a lot of my time working on here at Metail, and I was keen to give a talk. I was nervous when having been put on the agenda we had 65 people sign up, the office capacity!

We had an exciting line up of speakers, if I do say so myself, with two talks about Redshift and one about building a big data solution on AWS. Peter Marriot gave the first talk which was an introduction to Redshift demonstrating how to create a cluster, log into it, load some data and then run queries. Most of this was a live demo and it went very smoothly. He was very enthusiastic about Redshift and demonstrated its speed at querying large data sets. I think his enthusiasm for Redshift came across as well measured and not just ‘oo shiny new tool’ as he did a good job of relating this to his own experience of querying large data sets; highlighting trade offs. The main one being Redshift seems to have a constant minimum overhead of a second or two on queries, where MySQL/PostgresSQL would be sub-second. This makes it difficult to support scenarios where multiple users make lots of small queries and receiving real-time results because the queue becomes backlogged. The general belief is that slow query response is because of the overhead of the leader node orchestrating the query, possibly a single node cluster wouldn’t have the problem. Something to put on the experiment list 🙂

The train chaos mentioned in the first Tweet meant our speaker from AWS, David Elliot, arrived late but still in plenty of time for his talk. It reminded me of my own experiences trying to get to my AWS London Loft talk back in April! His talk was an excellent live demo on setting up a trackers, and exploring the collected data. The exploration was done using Spark which is a managed install on EMR, and also Redshift and QuickSight. This was pretty similar to the demo I went to at the AWS Loft. It is impressive how quickly all this can be set up and how much power is available through these tools. I liked the demo and David had some good input to some of the questions asked of both me and Peter. We’ve blogged about this kind of setup and how it compares to our own here. We’ve changed our set up a little to be more event driven, using S3 notifications and SQS queues, but it’s still a good comparison. I see I blurred the lines a bit in my post about the use of Kinesis Firehose and Kinesis. The demo used Kinesis Firehose which is writing in batches, however you have control over when the buffer is flushed. David chose 60s to keep things flowing. You can use Kinesis streams, as David mentioned, if you want more of a streaming solution.

I was the final speaker on the agenda and my talk was titled Why The ‘Like’ In ‘Progres Like’ Matters”. I went through the decisions we’ve made when using Redshift and why. There were two main ones which I focused on. The first was whether to choose a cluster with a large amount of storage but limited compute, with the aim of storing all the data; or to have more CPU and less storage for faster querying but having to drop old data. We decided to keep all our data available in Redshift and progressed through a cluster made up of an increasing number of compute nodes until we had to switch to a cluster made up a few dense storage nodes to keep costs under control. The second major decision was the schema design. Unfortunately having never worked with columnar data stores we went with normalised schema layout which would have worked well on a row store such as PosgreSQL. We did use distribution and sort keys appropriate for the tables however the highly denormalised data often had different sort orders or distribution keys per table which made joins very slow. Since then we’ve done some more detailed research and more testing. Now we have a much larger data set and less CPU our tests highlight schema and query problems much more clearly which has lead to a much more efficient schema design. We have denormalised a lot of our data, and with common distribution and sort keys for the tables joins no longer need to sort data nor pull data from elsewhere in the cluster for table joins. As David said, Redshift optimisation is all about the schema design.

Overall we’ve found Redshift a very powerful tool, and like any tool there is a learning curve. As with all AWS services I’ve used there are the features in place to allow you to change your mind and hack around. Most of this due to the ease at which you can take snapshots and restore them to different shaped clusters.

Finally here’s me presenting:

It looks dark but it was still the hottest day of the year!

Thanks to @CambridgeAWS for the photos, to Peter and David for their talks, and Jon and Stephen for organising the Meetup. We’re looking forward to see everyone at the ninth Meetup here at Metail on Tuesday 25th October.

With two 3-month internships under my belt at Metail, it’s easy to see why people keep joining. As an R&D Intern, I’ve been continually challenged and pushed to learn new skills and apply them to often independent and in-depth projects. The responsibility and expected self-sufficiency have been well balanced to allow a comfortable attachment to my work; and now that I’m leaving to go back to my final year at university, it’s clear that a lot of what I’ve done with Metail will help me focus and push myself in my studies.

My assignedLuke Smith and chosen work has been excitingly challenging and intriguingly broad. With several weeks spent collaborating with a great team of people to build advanced features for a Facebook chatbot, I had the pleasure of working on state of the art 3D face modelling, including the challenges of adding cosmetic changes, finding ways to smoothly transform one face to another, and robustly positioning other 3D models on and around the faces. All within the context of delivering a user experience, albeit an experimental one. The surprise to me came in finding that “R&D” is not synonymous with “hidden in the back room for no one to see.” Sometimes, it turns out, it just means you don’t have to worry about perfecting a product and can focus on learning as much as possible about users and what they want.

In three months, you don’t necessarily just get to work on one project. On top of 3D face modelling, I got the opportunity to start with a blank folder with zero files in it and the seemingly simple task: recommend clothes to a user. That might only take five words to say, but it takes more than five lines of code to do. This task required me to build from an empty file tree to a framework for creating and testing ways of implementing recommendation algorithms. I found it extremely rewarding opportunity to work independently on such a project. At the same time, the real reward of working in a place like Metail isn’t just getting to take pride in your work, but knowing that at any point in time there is a whole host of people ready and willing to help you if you ask for it. With a collection of experienced and knowledgeable colleagues, I always found it easy to get help when I needed it. The technical knowledge and experiential learning gained will no doubt prove invaluable in the future.

I should also point out that the work itself isn’t the only part that’s fun. The people are wonderful and I personally enjoyed the fact that I only wore shoes to work 8 times in the entire summer (flip-flops are so much more comfortable). If you need advice on which are the best pubs in Cambridge, look no further, because Friday pub lunches serve as an excellent method of exploration. Meanwhile it’s worth noting that interns get free membership to the Friday cocktail club, which makes for a thoroughly enjoyable social activity whether you care for the cocktails or not!

In the end, sometimes you have to do some work, so in my experience, you might as well make sure it’s work that is in itself rewarding and comes with plenty of added benefits; my time at Metail has been a core of fulfilling work with a periphery of positive side effects. There’s no doubt in my mind that I’ll soon find an excuse to come back again.

Luke Smith
R&D Intern 2015, 2016