“Integerocalypse” is a word that has been spoken in hushed tones around the office for a long time. While I’ll let Ben explain it in better technical detail, here’s the gist: all analytics data for all our customers is held in a database. When that database passed 2.1 billion records, all hell was set to break loose.
While we took strides to avoid the issue, we couldn't quite escape its clutches. When we passed that mythical number, stats had to be taken offline, and they ended up running behind for nearly a week. This was terrible news for our customers, many of whom depend on our analytics for making daily business decisions.
This issue forced us to think a lot more critically about how we communicate issues to our customers, and how can improve them in the future.
Here are some of the lessons we learned:
Have a process in place... before an issue occurs.
We get scared of "putting process in place" – it sounds like something for big corporations, not us. But a lack of process can cause confusion and lead to critical pieces getting dropped on the floor. You don’t need to be completely prescriptive, but you do need to get the basics down.
And it’s not just what needs to be done, it’s also by whom. It’s a great first step to have a skeleton process in place, but with that alone, you are still not completely out of the woods. Just as important is to assign clear roles so that folks know exactly what action they are expected to take. This further reduces confusion and enables quick action.
Categorize, then Attack.
Not all outages are created equal. For us, we have two key buckets: internal issues (those that only affect our direct customers) and external issues (those that affect our customer’s customer as well).
Once you determine who is being affected, it is much easier to determine the type of communication that is appropriate.
In the example of our stats running behind, our direct customers were affected, and the issue would manifest itself on the stats pages in their account. This meant communicating about the outage should happen in-app (where our customers could see it) and on the stats pages specifically (where it was most relevant).
Communicate thoroughly, and keep those affected up to date.
In-app messages are great, because they can be pointed and timely, but you may not have room to communicate everything you need. Twitter is the same way – 140 characters is just enough room to incite panic, not to actually communicate the issue.
We needed a good place to point interested or affected customers, where they could read up on the issue and get a good sense of where we were in the process of fixing it.
I’ll admit, we’re pretty behind the times, but it was time to build a status page. Codenamed
Bugle, after the simplest brass instrument, it is designed to communicate quickly and clearly what issues our infrastructure might be experiencing, and keep customers in the loop.
It provides us a place to point to when we quickly communicate issues in-app or through Twitter. This will make us much more confident alerting folks to issues that may be affecting them, since we can provide sufficient context for those interested.
From a support perspective, it will also help our customers identify whether an issue they are experiencing is also affecting their users. We receive lots of inquiries from customers, checking to see if a local issue (something like the Flash plugin failing) are actually infrastructure issues. Bugle gives concerned customers a reliable place to check on this, and I hope it will let them breathe easier.
Fix it Twice
Joel Spolsky wrote a great post, "Seven Steps to Remarkable Customer Service", that gave me a roadmap to tackle my Wistia position and that I turn to for inspiration often. I’d go and give it a read, if you haven’t already.
I’d argue we’ve been following this advice for a long time, but on an individual basis. We’ve been fixing the root issue, and following up with customers who reached out via email. It’s time we made this communication more transparent.
The first post on Wistia Status is a postmortem of Integerocalypse. I hope you’ll give it a read. It’s a great write-up of what went down and what we’re working on to prevent it from happening again.