We welcomed back the Cambridge AWS User Group to the Cambridge office for it’s eighth Meetup. This one was focused on Big Data. This is something that I spend a lot of my time working on here at Metail, and I was keen to give a talk. I was nervous when having been put on the agenda we had 65 people sign up, the office capacity!

We had an exciting line up of speakers, if I do say so myself, with two talks about Redshift and one about building a big data solution on AWS. Peter Marriot gave the first talk which was an introduction to Redshift demonstrating how to create a cluster, log into it, load some data and then run queries. Most of this was a live demo and it went very smoothly. He was very enthusiastic about Redshift and demonstrated its speed at querying large data sets. I think his enthusiasm for Redshift came across as well measured and not just ‘oo shiny new tool’ as he did a good job of relating this to his own experience of querying large data sets; highlighting trade offs. The main one being Redshift seems to have a constant minimum overhead of a second or two on queries, where MySQL/PostgresSQL would be sub-second. This makes it difficult to support scenarios where multiple users make lots of small queries and receiving real-time results because the queue becomes backlogged. The general belief is that slow query response is because of the overhead of the leader node orchestrating the query, possibly a single node cluster wouldn’t have the problem. Something to put on the experiment list 🙂

The train chaos mentioned in the first Tweet meant our speaker from AWS, David Elliot, arrived late but still in plenty of time for his talk. It reminded me of my own experiences trying to get to my AWS London Loft talk back in April! His talk was an excellent live demo on setting up a trackers, and exploring the collected data. The exploration was done using Spark which is a managed install on EMR, and also Redshift and QuickSight. This was pretty similar to the demo I went to at the AWS Loft. It is impressive how quickly all this can be set up and how much power is available through these tools. I liked the demo and David had some good input to some of the questions asked of both me and Peter. We’ve blogged about this kind of setup and how it compares to our own here. We’ve changed our set up a little to be more event driven, using S3 notifications and SQS queues, but it’s still a good comparison. I see I blurred the lines a bit in my post about the use of Kinesis Firehose and Kinesis. The demo used Kinesis Firehose which is writing in batches, however you have control over when the buffer is flushed. David chose 60s to keep things flowing. You can use Kinesis streams, as David mentioned, if you want more of a streaming solution.

I was the final speaker on the agenda and my talk was titled Why The ‘Like’ In ‘Progres Like’ Matters”. I went through the decisions we’ve made when using Redshift and why. There were two main ones which I focused on. The first was whether to choose a cluster with a large amount of storage but limited compute, with the aim of storing all the data; or to have more CPU and less storage for faster querying but having to drop old data. We decided to keep all our data available in Redshift and progressed through a cluster made up of an increasing number of compute nodes until we had to switch to a cluster made up a few dense storage nodes to keep costs under control. The second major decision was the schema design. Unfortunately having never worked with columnar data stores we went with normalised schema layout which would have worked well on a row store such as PosgreSQL. We did use distribution and sort keys appropriate for the tables however the highly denormalised data often had different sort orders or distribution keys per table which made joins very slow. Since then we’ve done some more detailed research and more testing. Now we have a much larger data set and less CPU our tests highlight schema and query problems much more clearly which has lead to a much more efficient schema design. We have denormalised a lot of our data, and with common distribution and sort keys for the tables joins no longer need to sort data nor pull data from elsewhere in the cluster for table joins. As David said, Redshift optimisation is all about the schema design.

Overall we’ve found Redshift a very powerful tool, and like any tool there is a learning curve. As with all AWS services I’ve used there are the features in place to allow you to change your mind and hack around. Most of this due to the ease at which you can take snapshots and restore them to different shaped clusters.

Finally here’s me presenting:

It looks dark but it was still the hottest day of the year!

Thanks to @CambridgeAWS for the photos, to Peter and David for their talks, and Jon and Stephen for organising the Meetup. We’re looking forward to see everyone at the ninth Meetup here at Metail on Tuesday 25th October.

Tomorrow, Tuesday 12th, we’re welcoming back the Cam AWS User Group for their 7th Meetup. This is the fourth user group meetup we’ve hosted and now we’re set to host the remaining three of the year. The meet up promises to be information packed and is focusing on AWS Lambda with two speakers talking about their experiences. There’s also a debrief on the recent AWS summit in London and Danilo Poccia, a technical evangelist from AWS, is talking about data analytics.

The AWS London summit was on the 7th July and I went along with a colleague. Inevitably we bumped into some of the Cam AWS UG members and shared a DLR over to the Excel. Personally I found the Deep Dive on Amazon DynamoDB to be the most informative session with a good bit of depth on how to write your schema, avoiding hot keys and understand its internal partitioning. This is important for schema design and resolving certain bottlenecks. My most disappointing talk was the Deep Dive on Microservices and Amazon ECS as this talk didn’t add to my knowledge and I’ve only ever seen talks and demos of ECS never getting my hands dirty. My colleague attended the Deep Dive on EC2 Instances and it sounded like I’d have gotten much from that talk. I’m sure that others went to interesting (and disappointing) sessions and I would like to know what they got out of them.

AWS Lambda is one of AWS’ hot technologies which was released almost two years ago. We’ve started experimenting with it in Metail. I’m really keen to see how it’s being used and experimented with by others as AWS Lambda’s use within Metail is certainly growing. I’ve had a fun little project writing a plugin for leiningen which allows you to manage AWS Lambda functions with the aim of integrating it into our build process. Still it’s nowhere near as a functional as lambda Gordon which I saw demonstrated at the most recent Snowplow London Meetup; it sounds like something to compare to Ben Taylor’s talk on Using Lambda and CloudFormation.

The final talk of the night is from Danilo Poccia, I’m particularly looking forward to asking questions at the end as it’s the most relevant to my day to job 🙂

We’re looking forward to seeing everyone tomorrow, doors open at 6:45pm and the talks are starting promptly at 7pm. We’ll be providing beer, soft drinks and snacks, be prompt to get your favourite beverage before the talks start 🙂

AWS Loft London

Back in October 2015 Metail hosted the 3rd Cambridge AWS User Group Meetup and in addition to Ian Massingham‘s review of AWS re:Invent 2015 I was given the opportunity to talk about our use of AWS for our big data processing pipeline. After this I was pleased to be invited to give an Elastic MapReduce (EMR) specific version of this talk at an AWS EMR master class. Roll on March and the AWS loft London with me on the agenda for the EMR Master Class session 🙂

After a busy week and some concentrated talk preparations I almost didn’t make it. I caught the train from Cambridge to Liverpool street with the intention of walking from there to Old Street. Unfortunately there were problems with the power lines on the Liverpool Street line which lead to everyone getting off at Harlow Town. After a taxi ride to Epping and a nervous ride into Liverpool Street on the central line, I finally arrived only five minutes after the session started. This meant I missed my opportunity to introduce myself to Abhishek Sinha (the session leader) but after catching his eye during his talk I was back on the agenda 🙂

late-tweet                                                         made-it-tweet

Elastic MapReduce Master Class

Abhishek gave a very interesting and well-presented guide to EMR and its best practices. As ever when I attend a talk by someone from AWS I learn plenty of new things and start re-evaluating our use of their tools. In this case, these were mainly around the use of spot instance task nodes and taking advantage of EMRFS.

The spot instance task nodes are nodes that only perform MapReduce tasks, having no HDFS storage, and come from the EC2 spot instance market. Using the spot instance market you can get the nodes at a lower price but if you’re outbid you lose the node. Any compute tasks running when you lose the node fail, but Hadoop was built with this in mind and simply reschedules the task on another node. With no HDFS storage, no data re-replication need be done. It’s common to set a bid price of 100% of the on-demand cost, you still get the EC2 node at a lower bid price and at worst you pay the normal cost. Further, by picking nodes that are less commonly used, you are less likely to be outbid. For example, if you normally request two m3.2xlarge task nodes but on the on the spot market the m3.xlarge were less commonly used, then requesting four task nodes would give you equivalent power but with a greater saving. This is an imaginary example, you can find out real data for spot market here.

The other feature of EMR we are not yet taking advantage of is EMRFS. AWS have decoupled the compute from storage by allowing EMR clusters to make very efficient use of S3. The main/only drawback here is that S3 has eventual consistency for overwrites and deletes of objects in the S3 file system. The EMR nodes are not aware of the delays and thus when one job takes as input the output of a previous one there is a chance of seeing an inconsistent view of the data. EMRFS uses a DynamoDB table to keep a record of the expected state of S3 and the EMRFS file system will retry if a request is made for an object that does not match the expected state. Currently we work around this limitation by having things set up in such a way that it isn’t a problem (more by luck than design ;)). Another common solution is to create two copies: one in the cluster’s HDFS file system and the other in S3. The copy in HDFS is lost when the cluster shuts down. We are currently redesigning our pipeline and it may become a greater problem in the next iteration so we’re keeping EMRFS in mind, noting that you do pay for the DynamoDB usage.

My First Big Data Application

As for my own talk, I think it was well received. I was asked some interesting questions at the end and I’m taking that as a good sign. After my talk and some lunch I stayed for the next session “My First Big Data Application” which was introduced as a modern big data pipeline. This was a great session where a pipeline was setup to collect, process and analyse web logs. This was strikingly similar to the pipeline I’d described in my talk, however theirs is indeed more modern 🙂 I think it’s interesting to compare the two pipelines and to contrast their different strengths and weakness.

Starting with my talk and the beginning of our pipeline, events are recorded by making GET requests for a Cloudfront-hosted pixel and Cloudfront logs all the requests to an S3 bucket. Here AWS do the hard work of distributing our pixel around the globe to ensure fast access to the user. They also batch up the request logs, writing them to the configured bucket after some time/size. We’ve never done any measurements but I believe the latency is typically less than an hour and we get logs of the order 10MB in size although they can be KB in size. For the demonstration Toby Knight (the speaker) set up an Apache web server on an EC2 node which saved its logs locally. He then used an AWS Kinesis collector to stream the logs in real time into the Kinesis Firehose which records the data in an S3 bucket. Here you can see the more modern event collector which is a real-time streaming system compared to our batch. For the following purposes it’s not really clear why Kinesis Firehose is better than our Cloudfront solution. I’m not sure how you scale out the Apache web server (fairly easily I imagine, it’s just not my area of expertise) but that’s work you’ll have to do yourself and when the second step is a batch system I’m not sure the latency matters. However, talking of latency this is where Kinesis has potential the Cloudfront solution clearly doesn’t. In Metail we don’t have any real time monitor of our event stream (it’s never been a critical requirement) but with Kinesis you can connect to a topic and trigger some processing on each new event. This increased flexibility is clearly a win.

For the next step both we and Toby turned to EMR for a ‘model on read’ batch Event Transform and Load (ETL). We are using MapReduce in Clojure (Cascalog at the moment but switching over to Parkour) to read in our Cloudfront logs, validate the events and format them in a schema that can be loaded into Redshift. Here ‘model on read’ means that Cloudfront doesn’t enforce a schema on the data, it will quite happily write some quite corrupt events to file. It’s only if we try to format that event as, say, an order that we start requiring it to have certain properties. Toby’s talk used Spark to process the events, perhaps just as an opportunity to show EMR supports the latest cool MapReduce technology 😉 It does have some advantages over MapReduce and should be a lot faster than our ETL as Spark uses in-memory data structures, it’s written in Scala though (but there are Clojure bindings with Flambo or Sparkling). For the next step Metail is keeping up with the Joneses and we do the modern thing and copy the output of the EMR batch stage into Redshift. Redshift is a petabyte scale data warehouse where you use a PostgreSQL-like language to query your data. After some initial teething troubles we think our new schema will allow us to make much better use of Redshift’s strengths. We use a product called Looker to model the data in Redshift, produce dashboards for both internal and external use, gain insights into our data through dynamic queries and quite a few other things. For the talk they demonstrated the use of AWS QuickSight which is in a limited preview. Although it will compete with Looker (and similar tools like Tableau) it’s aiming to be less full featured and much cheaper, allowing companies to give everyone access to the data with only a few people using more expensive tools like Tableau. I suspect for us it would never replace Looker, it seemed like it wouldn’t have the client facing support we require, and our more powerful data analysis tools come largely from the open source Python and R community 🙂 Still I’m very excited about SPICE (Super-fast, Parallel, In-memory Calculation Engine) which gives each QuickSight user a local in-memory DB for very fast data modelling and exploration. This should be available to partners like Looker and Tableau next year.

And that’s it, after mentioning only a ‘few’ technologies I’ve raced through Metail’s big data pipeline and compared it to a more modern equivalent. For anyone looking to build their first pipeline I think it is worth looking at the streaming solution as that technology is advancing fast and windowing over the streams give much more powerful batches. It’s something we’re planning to look into with Onyx for the next iteration.

We enjoyed hosting the first Cambridge AWS User Group Meetup of the year in February (but were so busy we forgot to blog about it), so we’re excited to welcome everyone back to our town hall for the 3rd meet up of the year. The details of the next meetup are still being decided but there are always great opportunities to learn about AWS and contribute your own knowledge back to the group. Come along we’ll have snacks, soft and hard drinks, and most importantly questions asked and knowledge shared! While the details are getting settled here’s a little write up, finally, of February’s meet up.

Main Meeting #6: details TBA

Thursday, Jun 9, 2016, 7:00 PM

Metail Cambridge Offices
50 St. Andrew’s Street, CB2 3AH Cambridge, GB

12 Clouds Attending

RESCHEDULED – new date is very preliminary, and may change at any time!

Check out this Meetup →

Breaking the February Silence

Last year at the first of the main meetups, Tom Clark from Ocean Array Systems presented their idea for a data processing solution (the ‘Zero to Hero’ section of the meetup). They were looking to get some advice on how best to leverage AWS’ services given they had ‘zero’ current knowledge. Now seven months later he was back to present on their progress. The solution was very similar to what was sketched out back in July. He didn’t have any slides and rather than moving rooms to one with a white board he just sketched things out on the glass and a big sheet of paper 🙂

20160202_213436

Tom had five points he’d taken away from the last seven months of development although I’ve merged two and added one from the list I originally captured when listening to him:

  1. It’s worth thinking about costs, particularly considering the prototyping to beta/first few clients stage. The different systems scale differently in price. For example, they ended up making heavy use of lambda whereas with hindsight t2.nano and t2.micro instances are very cheap, can handle reasonable scale for tasks suitable for lambda functions and have a large number of free hours associated with them. In the end lambda may scale to infinity more cheaply when you bring in the cost of engineers to maintain large numbers of EC2 nodes but in the early days it might be cheaper (your time included) to manage your own EC2 instance;
  2. Your choice of database is a sensitive aspect for you commercial model. They went with the room’s advice and used DynamoDB but a knowledgeable source highlighted that quickly gets expensive. They’d have saved some money scaling with a cheaper solution;
  3. What is the commercial value of your data? For them it was important to get a sane API in front of the data as soon as possible. One of their datasets is collected from 3rd parties and they realised 1) it’s tedious, and 2) no one else has done this. By providing a decent API to this data (which they need internally anyway) they have the opportunity to resell access (sounds like how AWS got started);
  4. Figure out how to create isolated software projects! This is learning from immutable infrastructure and fairly common advice these days which means there’s no harm in repeating it 😉 Clojure does this well with leiningen, and I’ve had limited but positive experience with Python’s virtualenv and limited painful experience the RVM;
  5. You need a dedicated person to pull together the AWS services. I believe their system has been developed by three people, two experts in Matlab and machine learning, and the other who could focus on AWS. This could be a good intern or junior as generally the learning curve for AWS services isn’t that steep.

The second talk was given by Jon Green who went through the AWS Re:Invert 2015 talk on the IOT service. I’m barely a beginner in the IOT field, but the solution does seem complete, and it looks like AWS have solved a lot of problems. How many of these were already solved I couldn’t say. Given the breadth of the solution it’s not easy to summarise but there are two interesting points that have stuck in my mind: –

Firstly; it’s currently hard for enthusiasts to have a play, even more so in the UK than in the US. Given the power of AWS’ free tier to have developers create toy projects for little cost and then take those solutions over to their companies where they start bringing in revenue this seems a missed opportunity. Having said that I’ve had a quick browse at https://aws.amazon.com/iot/getting-started/ and at least one of the partner hardware kits is available if you switch the link from amazon.com to amazon.co.uk. Stock availability was something Jon highlighted, so get it while stocks last 😉

The second point is a bit more abstract. IOT provides tools to provide secure identification, registration and state management of your devices. This is the most hardware centric part of the system, the rest is about event processing and a lot of that looks like a streaming service. Jon made a very thought provoking point that even the most hardware related parts need not necessarily refer to a physical device and your whole IOT solution could be virtual. The IOT system has a lot of components just waiting for someone with some knowledge and imagination 🙂

Main Meeting #4: Happy New Year!

Tuesday, Feb 2, 2016, 7:00 PM

Metail Cambridge Offices
50 St. Andrew’s Street, CB2 3AH Cambridge, GB

13 Clouds Went

Thanks to the kind folks at Metail for stepping up to the plate at short notice – and with a slight date change from the 4th to the 2nd – we’re meeting at their offices on St Andrew’s Street again for lots of Cloudy goodness!More details to be announced very soon, but here’s the starting point…The presentations:• Notices and news – Jon Green (…

Check out this Meetup →

Hope to see you all on Thursday 9th June!  The talk line up is still being finalised but we do know that Matt McDonnell from the Metail Data Science team will be starting things off with a presentation on ‘Deploying Data Science with Docker and AWS’.

AWS Meetup

On the 22nd October Metail sponsored and hosted the third Cambridge AWS User Group (who have a few photos of the event up). It was great to welcome the meetup organiser Jon Green, AWS tech evangelist Ian Massingham (the main speaker for the evening), and around 25 others to our office “town hall” for drinks, snacks (including honey roasted cashews), networking and of course the talks.

Ian had two spots on the schedule. The first was a round up of the announcements made last week in at AWS re:Invent. I’m not going to produce my own recap of Ian’s summary, A Cloud Guru gave a nice summary of the keynotes at re:Invent in their medium blog post and there’s a nice summary with follow on links over at trendmicro. There were a few questions from the audience last night all of which Ian answered to their satisfaction. Seems AWS is doing something right 🙂 My favourite bit of the talk was during the overview of AWS’ now beta IoT platform (apparently the first thing AWS themselves have referred to as a platform) and the consumer stories from re:Invent. Here the focus was largely around an application in farming, mentioning the possibility of fitting monitoring devices to animals and being able to buzz them to wake them up. As someone who grew up in the valleys of Wales surrounded by sheep I found the thought of experimentally setting the state of your flock to be awake quite amusing. The scale of farming in the US is slightly larger than what I’m used so that’s a lot more sheep to wake up through patchy signal coverage ;).

Ian’s second talk was a demo of the new EC2 container service. I’ve played with docker before but it’s not my area of expertise however the audience seemed suitably impressed and there was a good bit of discussion at the end.

Having bribed my way onto the agenda by getting Metail to host (and doing the literal leg work to get the drinks and snacks to the office) it was good to give a talk to an engaged audience. I opted to go into some depth on the tracking and batch processing steps of our data architecture and skimmed over the insights and how we actually drive the pipeline. I’m hoping to get invited back to go in to a bit more depth. Elastic MapReduce is a slightly more niche service in AWS, and perhaps being overtaken by the use of real time systems such as AWS’ kinesis family. The talk itself is up on slideshare and you can see me in action on the @CambridgeAWS twitter account 🙂

We are often complimented on the selection of beers which we put on at the meetups at Metail. This is in large part owing to the excellent choice found at the King’s Parade branch of Cambridge Wine Merchants; it makes a nice lunch time outing to go over and choose the selection for the evening’s meetup. Having said that, a little data insight from our event hosting is that diet coke is the most popular drink.

Photo cc Ian M.

We’re delighted to be hosting and sponsoring the 3rd Cambridge AWS User Group meetup tonight. Where we’re going to be getting a run down of this year’s AWS re:Invent and I’ll give an introduction to Metail’s data pipeline with a focus on how we’re using and configuring some of AWS’ services to power Metail’s data insights. As sponsors we’ll be providing beer, soft drinks and wine as well as some snacks (I’m a big fan of honey roasted cashews and in charge of purchasing, so feel free to comment on this post with any special requests).

It’s the first of these events we’re hosting here at Metail and I’ve used it as an excuse to write a presentation for the group. Although the meetup is moving around various hosts and sponsors in Cambridge you can expect to see it back at Metail in future, and hopefully a follow up/continuation of my talk in a future meetup.

Here’s the summary of the event from the user group page:

Docker, containers and the like. Plus, those of us who made it back from Vegas alive report on our findings.

The BIG news…if you’re in the Internet of Things business, a database or data analytics specialist, are into Docker containers, or use Kinesis or Lambda, there were some major announcements – and you will definitely want to be here to hear them!

This meeting is hosted, and snacks and drinks sponsored, by Metail.

We’ve got a lot to get through, so the meeting will start at 7pm sharp, with doors open from 6:30 – please be prompt!

 

The presentations:

• Notices and news – Jon Green (Adeptium Consulting) and Brief Intro to Metail – Gareth Rogers (Metail)

• re:Invent 2015 Debrief – Ian Massingham (AWS) (and/or Jon Green)  – all the gory details and new announcements!

•  Metail’s Data Pipeline and AWS – Gareth Rogers

• Docker and Amazon Container Service  – Ian Massingham

• [Possibly] From Zero to Hero in AWS – follow-up – Tom Clark

Hope to see you later! We also host the Cambridge R Users Group, Data Insights Cambridge, DevOps Cambridge and Cambridge NonDysFunctional Programmers so if you can’t make it, there are plenty of other opportunities to come along to Metail for some drinks, snacks and tech chat.