Engineering Socially: Traffic Spikes and the old new Old Spice Guy

A while back, we did a little promotional project to tie into Old Spice’s online marketing campaign running on YouTube.  It was very spur of the moment, since we launched the project after the Old Spice campaign had already started.  Because of how fast we needed to get something working, and the size of the potential exposure, some of the engineering issues were more prominent for us than they have been in the past.

Engineering Challenges

We knew right away that if we got any pick up at all, we’d be looking at a significant traffic spike.  While we were hoping for the best in regards to traffic, that also meant preparing for the worst – a huge traffic spike.  The big engineering challenges were:

  1. The promotional page must not impact regular operation of the other websites we manage for our clients
  2. It must not negatively impact bandwidth allocation from our hosting partner Slicehost.  And by negatively impact, I mean cost us money.
  3. The hosting must be able to scale easily, so we weren’t looking at a lot of server errors, or being offline completely.
  4. Ideally, the hosting for this should cost as little as possible, since it was pretty much a one shot deal.
  5. We knew the campaign was already going on, so we needed to get it up and running fast.
  6. The application had to handle several large data sets, namely: Twitter feeds and Facebook, Digg, YouTube and Reddit comments.

Clearly, we weren’t going to be hosting it on our own servers.  Too much risk of a slowdown causing denial of service for our customers.  The sites we host generally don’t get the level of traffic that warrants the engineering investment in load balancers, content delivery networks, redundant servers, etc.  And setting all that up for a spur of the moment deal like this just wasn’t worth the investment of time.  We also didn’t want to afford the cost of setting up at least one, but possibly several, new virtual servers, since we would be paying the full monthly cost.

While there are many virtual application platforms out there, such as PHP Fog, Heroku, and Google App Engine just to name a few, the short timeline and my previous experience with Heroku made it an obvious choice. Since we had no idea when Old Spice would declare a winner and end the campaign, we set ourselves the goal of having something up and running the same night.

Because of its close integration with rake and git, Heroku seemed like the best choice to host the application.  Heroku makes it easy to create, deploy and scale rails apps, and has lots of seamless automation to make maintaining them easy.  A bonus for us is that they only charge you for the time you actually use.  So we could scale up our processes (to serve the app) during the initial rush, and then scale down again when it was over.

Also, Heroku is a Rails 3 hosting service and Ruby made the app a breeze to build (satisfying the time constraint).  I went from idea to working site in an evening. I built the app as a single page, which updates the data on a fixed interval.  While I could have used Heroku worker processes to remove  the refresh process from the page display code path entirely, that would have added to the final bill, so I stuck with refreshing the data during page load and causing an occasional slow request.

Implementation Challenges

While it would be nice to say everything went smoothly, despite all this planning, there were occasionally problems, but surprisingly all of them were from our outside data sources.  We used Google Fusion Tables as mass storage for the collected tweets, comments and feedback that we were mining for “votes.”  I discovered the hard way that very occasionally the comma separated values (CSV) formatted output from the Google Tables API was not quite as CSV standard as Ruby would have liked it to be.  Comments from Twitter with newlines in them were occasionally showing up without being enclosed in double quotes.  In fairness, this might have been a garbage-in-garbage-out issue from the software that was scanning Twitter, but the time we had the problem, it was much too late to fix the Twitter side of things, as the data was already somewhere amongst the tens of thousands of records in the raw Twitter feed table.

Fortunately, we could exclude records from our API calls to Google, but we needed a unique record id to do it.  Unfortunately, Ruby’s CSV parser was somewhat unhelpful about exactly which record was causing the problem.  And the web front end to Google Fusion tables doesn’t have a way to jump to a specific record easily in any case.  Paging through a table with tens of thousands of records a hundred at a time is no way to do things.  And of course, rather than returning nothing, or the data up until that point, or trying to recover, the Ruby CSV parser just throws up its metaphorical hands and raises an exception when the CSV data isn’t up to its standards.  So it was a perfect storm of mediocrity.  While the argument can be made that that’s exactly what the Ruby CSV parser should have done, having it return nothing meant that suddenly our numbers were all over the place.  I did eventually track down the offending data and hide it from the Ruby CSV parser, but it would have been much more helpful to have a parser that could at least try to continue in the face of corrupt data.  I think the robustness principle applies here.

Lessons Learned

We learned several lessons from this exercise:

  1. Plan to be inundated with traffic.  We got more than 25 thousand unique requests the first day.
  2. Make sure you have a backup plan when relying on remote data.  You never know when or why problems might crop up.
  3. Secure a domain name or other stable URL sooner rather than later.  After all that work, we almost blew it by announcing too soon, before we could be sure our domain name was working.
  4. If you think you might need to scale, plan ahead.  It’s much easier to take advantage of someone else’s infrastructure which was designed for scalability than to roll your own or try to shoehorn it in after the fact.

But all in all, we learned a lot from this about how to handle high profile events.  We also had a lot of fun doing it.  And hey, a shout out from the Old Spice Guy is pretty cool too.