As an IT professional – whether you be a database administrator, a systems engineer, a storage administrator, a network administrator, developer, etc. – the one thing that we need to get a good grasp on in our day-to-day task is MONITORING. Whether we like it or not, monitoring is part of our daily life. I remember the very first thing that I needed to learn when we moved up here in Canada a couple of years ago was monitoring daily weather changes. Growing up in a tropical country, I didn’t know what it means to monitor how hot or how cold any given day could be. All I knew was that it can either be a sunny day or rainy one. But that all changed when I started feeling what freezing temperature felt like in my toes – or rather when I no longer feel them at all.
Here’s a simple definition that I commonly use about the term: monitoring is an intermittent series of observations in time, carried out to show the extent of compliance with a formulated standard or degree of deviation from an expected norm. A couple of things to note about monitoring from the definition.
- Observations in time. For monitoring to be meaningful, it has to be a continuous, series of observations in a timeline. Not a single observation but multiple ones that span a time frame. You can’t say it’s already winter when one cold day in August had 2 Celcius/35 Fahrenheit temperature while the rest had 25 Celcius/77 Fahrenheit. It could just be an abnormal cold summer day.
- Extent of compliance. Someone who came before us (I’m still trying to figure out who) decided what would be the “normal” temperature for the different seasons. Every temperature measurement thereafter was based on that initial measurement.
- Degree of deviation. Anything outside of the defined “normal” temperature is deemed to be weird or abnormal, just like that cold summer day in August.
- Expected norm. Each city or region has a different average temperature during the winter. Anything within the range of average is considered normal.
A proper monitoring methodology should include these key characteristics. Often times, when someone asks me to recommend a monitoring solution, I ask why they want to set up monitoring for their infrastructure before I even recommend any tool that they can use. Most IT professionals simply want to monitor for performance – how fast and efficient something is. I would propose to have a monitoring methodology that would include R.A.P.
- Reliability. This is the extent to which an activity, test, or measuring procedure yields the same results when repeated. If you execute a stored procedure that takes 25 seconds to complete, would it still take 25 seconds the next time you run it, assuming that nothing has changed in the system? If you failover a SQL Server Availability Group from one machine to another in less than 5 seconds, would you expect the same results the next time, again, assuming that nothing has changed in the system? While most IT professionals look at this as a performance metric, I simply look at it as a reliability metric.
- Availability. As I mentioned in a previous blog post, this is not just availability from your point of view but from those of the end users. It is the state simply being available.
- Performance. I don’t think I have to explain this further since most of us start monitoring systems and infrastructures on the basis of their performance.
Each of these three need specifics in terms of what you are trying to measure. And it can be overwhelming at times – you don’t know which metric to include in your monitoring tool or maybe try to include everything so you don’t miss anything. My rule of thumb is this: start with what matters to the business.
If a reliability metric for the business simply means that users can connect to the application within a 5-second window, you don’t measure database connectivity directly from the monitoring tool but rather from the application that connects to the database. If an availability metric for the business means that users can connect to the application 24 X 7 X 365 days, then you have to measure uptime from the application, the database and both simultaneously. If a performance metric for the business simply means that users can expect to submit a filled-up form within the application within a 10-second window, then you can replicate the monitoring that you used for reliability. Note that I kept highlighting the phrase “for the business” because whatever we do as IT professionals should meet the overall goals of the business.
But monitoring is more than just the collection of metrics over time. We need to know what the “expected norm” is, the “extent of compliance” of those collected values and to what extent do they “deviate” from the norm. That way, it will be easier to provide a qualitative answer to someone who may ask why the system is slow today? In the next blog post, we’ll look at what baseline is and how it should influence your monitoring implementation.
Please note: I reserve the right to delete comments that are offensive or off-topic.