At AuthorityLabs we have a saying, “We’ll Do It Live“. This goes to the fact that anyone in the company has the ability to deploy any part of our system to the production environment. There are no keys to the kingdom as it were and it is a rather large kingdom.
Many companies nowadays have this policy and rely on their tests suites and continuous integration systems to catch bugs and failing tests but that to me is the easy part. What we have found over the last year is that it is rarely a bad test or worse a missed test case that causes us problems. It is usually something under performant code or infrastructure issue that pops up at scale. We thinking about our issues we started looking at how we could handle this without taking away the ability for anyone to deploy updates as quickly as possible.
What we found was missing was metric driven deployment. The concept isn’t new, but it is overshadowed with all the talk of CI and continuous deployment. Now when we deploy something, we have a dashboard that show us all of our critical metrics (work throughput, server loads, queue lengths, etc) and these are watched looking for variations that are out of line with the norm and expected. It is amazing what you will see once you hit certain scale thresholds. This system let’s us to a couple of things.
We can rollback the deployment and reevaluate the code, spin up additional resources to make up for the drop, make changes to our infrastructure to deal with things like connection limits or whatever. This has had an interesting and welcome side effect. We now monitor and check many more metrics than before and have a better idea of the health of our system. It also has caused us to think about our code in a different way. We now put more thought into how it is going to affect the system as a whole and immediately add support for any new metric we think will need to be tracked for new features.
Does this mean we have fewer issues? No (we are doing it live remember) but we are able to deal with them faster and in a better way. We aren’t just rolling back after someone complains and then investigating. We are seeing in real-time the actual problems and correcting them were they need to be corrected.