A lesson to everyone on capacity planning – comical failings of foursquare and MongoDB
Firstly I’ll start by saying this is not a MongoDB bashing and in reality this is really about Foursquare, the technology is irrelevant its about how to handle growth. What is sad to say is that I’ve seen this far too often.
Foursquare had an outage this week and a very honest write up has been done here
What I find amazing is that they identified they need to expand a couple of months ago and so expanded to a 2 node shard, and then hit problems 2 months later.
Judging by the sizes of the shard’s they generated a few months ago it would suggest that they did the split in reaction to problems of size. If so why didn’t they learn from the mistake.
There are two capacity planning methodologies that exist.
1. Build it big enough to last.
For most systems this is the chosen option, however understanding what your growth is, is very difficult. Ask your CEO what he expects the business growth to be and he’ll give you a number, ask him to translate that into the number of transactions on your database system, size of databases etc, and you will get a blank look. What’s more his growth figures are to some extent independent of your system. Even if the business doesn’t grow you will likely still see your databases grow and the use of the system grow.
Its therefore very very difficult to come up with a model that can predict your database grow based on the CEOs growth expectations, its even more difficult to translate that into resource usage CPU, memory, IOs. Doing scale testing is, time consuming, complex, expensive and often won’t match your your business scales anyway.
For this reason often you just need to build big. Buy a big box with lots of room for growth in each direction
2. The other option is to be agile
This is an option that is very good when generating a meaningful growth model is fruitless. Startups generally fall into this category.
The model is that, rather than invest in capacity planning and buying large infrastructure, you invest in your processes and systems such that if you need to add more capacity you can.
Historically for systems this meant 1 of a few things, you are able to procure and move to new hardware, you could just add more hardware or your app was distributed so you could add more nodes. The latter was historically the complicated thing.
The crux of this model is that you see the mountain before you hit it and your processes are good enough to increase your systems capacity and so not hit the mountain.
The historical challenge with this model was that, buying new hardware was expensive, adding more hardware is ok up to a point but once you’ve filled those memory slots and CPU sockets your stuck, and developing a distributed app wasn’t easy.
The latter is what the major web systems have to do, have some way of distributing the load, sharding. As can be seen from the post, the ability to spread the load over multiple servers seems to have been a fairly easy one using MongoDB.
Whats common about these approaches.
Irrespective of which model you have, the key thing you have to do is monitor your system. If you have no monitoring in place then you have no way of knowing when the next mountain is coming and if you don’t do anything you will hit it.
I’ve seen it all too often, that you get over one hurdle and people think “few we’ve made it” but then don’t learn from their lessons, not realising that they’ve got a even bigger mountain to get over looming on the horizon.
Thats what I find really surprising about this foursquare outage. They new this was a problem, they were using MongoDB which provides sharding, they had got over one mountain and yet they hit another one mountain only 2 months later.
As the post mentions once you hit a limit your system performance will often take a nose dive. It won’t gradually get slower, what often happens is that because you haven’t got enough resource you have contention for that resource, which causes a backlog, which uses more resource, which causes contention and you end up in a downward spiral.
So the lesson to be learnt from this is that you have to have monitoring in place, and react to that in enough time. If you are using the agile model you don’t need as much time but that doesn’t mean you should be monitoring.
If you are in a growing business you never know when the next spike is going to hit so make sure you are monitoring your systems and understand how and when you need to react to growth.