01
Dec 2011
AUTHORMike Benner
COMMENTS1 Comment

Infrastructure Monitoring – Scout

Scout Server Monitoriing

Scout Server Monitoring

In my last post on Application Deployment I talked about deploying your application straight to production and using metrics to measure it. One of the tools we us is Scout for server monitoring. Scout is a great service that not only monitors you typical system metrics but has a great collection of community built plugins and the capability to write your own plugins (using Ruby).

It has all of your standard features, Graphing, Triggers, Email and SMS alerts, Server Groups, support for cloud instances, etc. But what we really like is the ability to build your own dashboard as well as custom graphs.

Scout ~ 3 Month Graph

We have charts that we built and use on our main monitoring board. These charts combine our Redis Queues, Processing Throughputs and Database Commands. The great part is that all of the data points come from different servers and services that on their own don’t really give enough insight into whether or not there is an actual issue or where that issue maybe. But when they are all combined on to one chart the context for spikes can be seen and appropriate action (if necessary) can be taken.

Mongo Queries and Inserts with Redis Queues

Mongo Queries and Inserts with Redis Queues

Scout was the first monitoring service AuthorityLabs began using back when it was a one server web app with a handful of clients. It has served us well since then growing with us and they are continuing to add items and make the app more robust. We have since supplemented our monitoring with other services and internal tools but Scout is still a go to for us when it comes to building a custom plugin and getting those data points on a graph with our existing metrics.

Next time I will cover how we used New Relic to help us with performance monitoring and application bottlenecks.

22
Nov 2011
AUTHORMike Benner
COMMENTSNo Comments

Deployment, What Really Matters

Do It LiveAt AuthorityLabs we have a saying, “We’ll Do It Live“. This goes to the fact that anyone in the company has the ability to deploy any part of our system to the production environment. There are no keys to the kingdom as it were and it is a rather large kingdom.

Many companies nowadays have this policy and rely on their tests suites and continuous integration systems to catch bugs and failing tests but that to me is the easy part. What we have found over the last year is that it is rarely a bad test or worse a missed test case that causes us problems. It is usually something under performant code or infrastructure issue that pops up at scale. We thinking about our issues we started looking at how we could handle this without taking away the ability for anyone to deploy updates as quickly as possible.

What we found was missing was metric driven deployment. The concept isn’t new, but it is overshadowed with all the talk of CI and continuous deployment. Now when we deploy something, we have a dashboard that show us all of our critical metrics (work throughput, server loads, queue lengths, etc) and these are watched looking for variations that are out of line with the norm and expected. It is amazing what you will see once you hit certain scale thresholds. This system let’s us to a couple of things.

We can rollback the deployment and reevaluate the code, spin up additional resources to make up for the drop, make changes to our infrastructure to deal with things like connection limits or whatever. This has had an interesting and welcome side effect. We now monitor and check many more metrics than before and have a better idea of the health of our system. It also has caused us to think about our code in a different way. We now put more thought into how it is going to affect the system as a whole and immediately add support for any new metric we think will need to be tracked for new features.

Does this mean we have fewer issues? No (we are doing it live remember) but we are able to deal with them faster and in a better way. We aren’t just rolling back after someone complains and then investigating. We are seeing in real-time the actual problems and correcting them were they need to be corrected.

I will follow this up with a post on our tools (Scout , New Relic, etc) and some of the internal things we have done to tie this together. In the meantime sit back and enjoy Mr O’Reilly do it live: