Do you know that expression we say a lot in IT? “first get the basics right”.
Actually, it is a big truth, we, as IT engineers, love to use cutting-edge technologies, configure complicated stuff, spend lots of hours trying to figure out how to use that advance feature that guy at the IP EXPO told us was possible. All this can be at the expenses of the basics, those simple things that are easy to get done, that should be done, that everyone thinks is being done, but it is not.
Through these seven points, we will review the basics of monitoring that no one has told you before. These concepts should be applied to our monitoring tools, before you start doing more complicated stuff.
Is My Device Being Monitored?
Guys, this is a common one, think about the following situation:
Engineer 1: Your monitoring tool is not working, I didn’t receive any alert this morning regarding my [insert any of your critical devices here].
Engineer 2: Did you add that device to the monitoring tool?
Engineer 1: Ehh, no, that’s your thing, isn’t it?
Engineer 2: Did you tell me to add that device to our monitoring tool?
Engineer 1: Uhmmmmm
Be honest, does this sound familiar? It does to me, but don’t feel too bad, this is more common than you think.
People forget to monitor their devices, even if there is a procedure to do so! Don’t worry, most monitoring tools have the ability to discover what new is on the network and monitor it. As an example, with SolarWinds we can interrogate the Domain Controllers in order to get a list of servers there, and if it finds something not monitored. BANG…it adds it by itself.
- Utilise Network Discovery scans to find devices currently not being monitored.
- Have a written and distributed procedure for managing new devices added to your infrastructure.
- Compare Orion managed nodes to your devices in your CMDB/ITSM platform.
- The API works well for this.
It Is More Than Up Or Down
When you ask people if they monitor their network they reply ‘Yeah sure’, but, more often that it should be, what they really mean is ‘I have a script that pings my servers’.
There is nothing bad in pinging devices, but monitoring your network is more than pinging, actually, it is MUCH more than pinging. Don’t get me wrong, checking the status of your devices is a big part of monitoring, but not the only one. We need to monitor performance, and CPU, and traffic, and connections and disk and services, and logs, and.., and.., …All of them are important.
What happens when the C:\ drive of a server is full? What happens when you reach the limit of connections on the firewall? Or what about when the battery of the UPS is going to die? All these events are service disruptive and could be avoided (think about that NOC call at 3 am in the morning) by just monitoring those metrics.
- Understand what the device/server does and the service it delivers.
- Over monitoring is better than under monitoring.
- Control too much information with prioritisation on page views using sub-pages.
Alerts: Get The Correct Amount Using Custom Thresholds
I always say that alerting is like when you have to cook pasta for your family, either you cook pasta for the whole block, or not even enough for yourself. With alerting is the same, or you are flooded with emails, or you don’t get enough (normally the former).
This is a serious subject, why do you want a monitoring tool if when you receive an alert, you ignore it? I’ve seen folders on mailboxes where there are thousands of unread emails from the monitoring tool.
I do sympathise with you, if I receive one alert every two minutes, I would also lose confidence in the alerts, but if we have such an important tool as our NMS (Network Monitoring Solution) then we have to try to get the most out of it and alerting is one of the big features of any NMS.
Let’s focus on the root of the issue, the main problem here is that you haven’t tailored alerting to your environment. We all have similar equipment and services running on our network, but we use them in different ways, therefore we need to configure our NMS to handle these differences.
One of the main features that monitoring solutions have is the use of custom thresholds (for example in SolarWinds). Is it normal that the CPU load of your backup server is near 100% during the night? Probably yes. Is it normal that the CPU or your router is 70% during the night? Probably not, therefore we need to apply different thresholds in order to receive an alert when the CPU of the router goes crazy, but not when this same situation happens on the backup server.
Metrics like CPU, memory usage, traffic, disk usage, response time, packet loss are good candidates to have custom thresholds.
- Fine tune thresholds on your monitoring tool. If your NMS does not support this feature, use SolarWinds :)
- Identify what you actually NEED to be alerted on. At Prosperon the expression is to alert only on ‘Actionable conditions’.
- Get the right notification methods right for each type of alert; some may require SMS while others a Slack message is sufficient.
The Holistic Approach: Use Different Technologies To Monitor Your Network
When you want to monitor Exchange (as an example), we could use WMI in order to monitor the OS metrics and some services and performance counters. But it would also be nice to monitor if the port TCP 25 is open and responding in a timely manner, and also to check that the OWA interface is up and running, and don’t forget to check if emails are being sent and received.
As you can imagine, sometimes using just one single way or protocol to monitor your devices is not enough, we have to use all the options available to get all the information available, otherwise, we won’t get all the data we need in order to properly troubleshoot our devices.
When we monitor network devices we think about SNMP, when we talk about Windows server we think about WMI, but what about syslog or Traps, what about log files for exchange or SQL, what about application response times based on traffic timestamps, what about real user experience with Web Performance Monitor, what about the cloud and the path to reach it? There are so many ways to monitor the performance of your network, and the most of the times, only one of these monitoring methods will give you the insight you need to find the root cause of the issue.
Some of the methods that SolarWinds allows us to use in order to monitor our network are listed here:
- ICMP, SNMP, WMI
- Syslog and SNMP Traps
- Event logs and log files
- Application response time
- File and directory monitoring (size, age, existence…)
- Scripting (Perl, PowerShell, Python, SQL…)
- Route analysis
- Configuration analysis
- User experience monitors
- Traffic analysis
- Explore other methods or protocols to monitor your devices.
Now You Got The Data, Play With It
OK, now you have all the data that you need, you are monitoring every single aspect of your network, you think that this is all you know, now what?
Now it’s time to play with the data, to create alerts, to create reports, to create dashboards, for capacity planning, for management, for customers….
Bear this in mind, if you don’t play with the data, if you don’t display it or don’t create alerts, it’s like you were not monitoring that data. Therefore, as important as monitoring the right information is to display that information.
You would be surprised at the number of times I check a SolarWinds Orion installation and the node details view is missing more than 50% of the metrics we are monitoring from the devices. It’s a shame because when the users try to find out why the device is broken, they will not find it because it is not being displayed (although it is being monitored).
- Create as many dashboards as you need in order to display all data you are monitoring.
- Customise your details views to make sure important data is shown at the top of the first page.
- Use subpages to structure larger quantities of monitoring data.
- Target your views at your audience's.
Keep The Platform Up To Date
I don’t have to tell you that your IT infrastructure is an organic entity. From time to time, you will replace devices, add new services, reconfigure your network… This means that once you have configured your monitoring platform, the job is not done, you need to maintain it, you need to take care of it.
And we don’t just talk about adding new stuff, we talk about what is already there, in your NMS: tuning thresholds, editing alerts, modifying reports…. These are the common tasks that the owner of the monitor tool has to performed almost day-to-day.
I guess that you probably knew this, that your platform has to be maintained, then, why don’t you do it? Let me answer this for you because you forget. What’s the solution, create reminders.
In SolarWinds, you can create scheduled tasks that email you all those reports that you need in order to maintain your network. Reports like devices with missing properties, or devices with wrong credentials, or devices generating too many alerts. These are some of the examples of reports that should be sent to you every week or fortnight.
- Create maintenance tasks to remind you what has to be reviewed.
Talk to the experts
It would be nice to know about everything, I’m not talking about just a little bit of everything, I mean a lot about everything. I would love to know a lot about BGP (I’m still working on my CCIE), but also I would love to know a lot more about SQL, and Citrix, and HTML. But that’s not possible I’m afraid (at least for the common mortals).
We, as human beings, have a limited capacity, and we need to choose what to learn and what to rely on other people.
However, when we talk about monitoring, we need to know about all these features, at the end of the day we have a monitoring system specifically to monitor all these things. That’s why we have to engage the SMEs in order to design and fine tune our NMS to monitor what we really need to monitor.
Maybe it is pointless to monitor the submission queue length of our Exchange server, or maybe it is not, I don’t know, that’s why we have to ask the experts. (By the way, I checked, it is worth to monitor it).
- Have regular chats with the SMEs to ask for advice.
SolarWinds Network Performance Monitor
SolarWinds Network Performance Monitor (NPM) is a powerful and affordable network monitoring software that enables you to quickly detect, diagnose, and resolve network performance problems and outages.