Cam AWS UG #8: Big Data

We welcomed back the Cambridge AWS User Group to the Cambridge office for it’s eighth Meetup. This one was focused on Big Data. This is something that I spend a lot of my time working on here at Metail, and I was keen to give a talk. I was nervous when having been put on the agenda we had 65 people sign up, the office capacity!

We had an exciting line up of speakers, if I do say so myself, with two talks about Redshift and one about building a big data solution on AWS. Peter Marriot gave the first talk which was an introduction to Redshift demonstrating how to create a cluster, log into it, load some data and then run queries. Most of this was a live demo and it went very smoothly. He was very enthusiastic about Redshift and demonstrated its speed at querying large data sets. I think his enthusiasm for Redshift came across as well measured and not just ‘oo shiny new tool’ as he did a good job of relating this to his own experience of querying large data sets; highlighting trade offs. The main one being Redshift seems to have a constant minimum overhead of a second or two on queries, where MySQL/PostgresSQL would be sub-second. This makes it difficult to support scenarios where multiple users make lots of small queries and receiving real-time results because the queue becomes backlogged. The general belief is that slow query response is because of the overhead of the leader node orchestrating the query, possibly a single node cluster wouldn’t have the problem. Something to put on the experiment list 🙂

The train chaos mentioned in the first Tweet meant our speaker from AWS, David Elliot, arrived late but still in plenty of time for his talk. It reminded me of my own experiences trying to get to my AWS London Loft talk back in April! His talk was an excellent live demo on setting up a trackers, and exploring the collected data. The exploration was done using Spark which is a managed install on EMR, and also Redshift and QuickSight. This was pretty similar to the demo I went to at the AWS Loft. It is impressive how quickly all this can be set up and how much power is available through these tools. I liked the demo and David had some good input to some of the questions asked of both me and Peter. We’ve blogged about this kind of setup and how it compares to our own here. We’ve changed our set up a little to be more event driven, using S3 notifications and SQS queues, but it’s still a good comparison. I see I blurred the lines a bit in my post about the use of Kinesis Firehose and Kinesis. The demo used Kinesis Firehose which is writing in batches, however you have control over when the buffer is flushed. David chose 60s to keep things flowing. You can use Kinesis streams, as David mentioned, if you want more of a streaming solution.

I was the final speaker on the agenda and my talk was titled Why The ‘Like’ In ‘Progres Like’ Matters”. I went through the decisions we’ve made when using Redshift and why. There were two main ones which I focused on. The first was whether to choose a cluster with a large amount of storage but limited compute, with the aim of storing all the data; or to have more CPU and less storage for faster querying but having to drop old data. We decided to keep all our data available in Redshift and progressed through a cluster made up of an increasing number of compute nodes until we had to switch to a cluster made up a few dense storage nodes to keep costs under control. The second major decision was the schema design. Unfortunately having never worked with columnar data stores we went with normalised schema layout which would have worked well on a row store such as PosgreSQL. We did use distribution and sort keys appropriate for the tables however the highly denormalised data often had different sort orders or distribution keys per table which made joins very slow. Since then we’ve done some more detailed research and more testing. Now we have a much larger data set and less CPU our tests highlight schema and query problems much more clearly which has lead to a much more efficient schema design. We have denormalised a lot of our data, and with common distribution and sort keys for the tables joins no longer need to sort data nor pull data from elsewhere in the cluster for table joins. As David said, Redshift optimisation is all about the schema design.

Overall we’ve found Redshift a very powerful tool, and like any tool there is a learning curve. As with all AWS services I’ve used there are the features in place to allow you to change your mind and hack around. Most of this due to the ease at which you can take snapshots and restore them to different shaped clusters.

Finally here’s me presenting:

It looks dark but it was still the hottest day of the year!

Thanks to @CambridgeAWS for the photos, to Peter and David for their talks, and Jon and Stephen for organising the Meetup. We’re looking forward to see everyone at the ninth Meetup here at Metail on Tuesday 25th October.

Leave a Reply

Your email address will not be published. Required fields are marked *