By Dax Attwood
By Dax Attwood
We’ve all been there - SolarWinds isn’t displaying up-to-date metrics for a node but nothing seems to be wrong – the device is up, you can log in and perform commands and everything else seems functional but try as you might, you can’t gather current statistics.
Finally, you discover that the node isn’t responding to SNMP/WMI for some reason. You reboot the device or restart the relevant services and resolve the issue but it’s left you annoyed because it’s taken time from your day which you already didn’t have, not to mention that the problem could re-appear on another device or server, or even the same one.
Luckily, there is a way to get SolarWinds to work for you to detect that a device may not be replying to the assigned protocol being used to monitor the node.
The first and essential method to utilise will start by making use of the alerting engine to notify us when Orion detects that a node’s data has not been updated for three consecutive polls. We can accomplish this by utilising the custom SQL trigger method, which allows us to create the necessary logic condition we are looking for.
First, let’s go over how to create the alert and refine it suit our needs. I’m going to structure the steps to accommodate new or basic users of SolarWinds so that everybody can follow along and take away some extra knowledge for this function:
We’re going to start by creating a new alert: Settings > Manage Alerts > Add New Alert
In the screenshot I have indicated the main aspects of the alert properties which need to be completed, which are Name, Description and the Severity of the alert. By default, the severity will be set to critical. You can adjust this to any other level according to preference. I have set this to Serious. Hit next to move on to the meat of the alert.
On the Trigger Condition screen, the first thing we need to change is the ‘I want to alert on’ field to ‘Custom SQL Alert’. The Custom SQL object type gives us more power over the pre-defined selections as it allows us to reference any parts of the database and allows us to formulate data in ways the GUI does not support. Expanding the selector will show you many options to choose from but the one we want is second from the bottom.
We’ll be defining our trigger condition to trigger when the following conditions are met:
By default, the SQL condition is set to Node, so just confirm that Node is selected and we can move on to the query. Type in (or copy and paste) the following:
In the above query, we can see that the first line of the WHERE clause corresponds to point 1 of our conditions, likewise for line 2 and line 3. Hit next to move on to the Reset Condition.
The Reset Condition screen is where we can define any criteria for when the alert should reset, for this alert we can leave it to reset when the trigger condition is no longer true. Hit next to move on to Time of Day
Time of Day
We are just going to hit next here without changing anything as we want this alert to be active 24/7.
The trigger action is where we tell Orion what to do in the event the alert is triggered. We’re going to want to add two actions for this alert; one to write to the internal log as a reference and a second to externalise the notification, in this example via email. Hit ‘Add Action’ and select ‘Log the Alert to the NetPerfMon Event Log’, then hit ‘Configure Action’
In the configure action window we need to give the action a recognisable name. This can be anything you like but try and make it something with some searchable keywords in case you need to find it in the Message Centre. ‘NPM EvtLog: Device not replying to SNMP/WMI’ should suffice.
In the message that we want to send to the event log, we should include (as a minimum) some useful information to help us identify the node that triggered the alert and the last time the node was updated.
We need to use alert variables that the alerting engine will populate when the alert is triggered. Hit ‘Insert Variable’.
There are a few ways to find the variables we need but the easiest – if you know what you’re looking for – is the search bar. For our alert, we’re going to add the Caption and LastSystemUpTimePollUtc variables. Search for caption and select the top result, don’t worry if you see more, the results will vary depending on how many modules you have installed. Confirm caption is selected then go ahead and search for the remaining variable and select that. Then hit ‘Insert Variable’.
The two variables should now be in your message block. Continue you type out the rest of the message, feel free to copy mine from the screenshot then hit ‘Save Changes’.
Add a second action; this one will be ‘Send an Email/page’.
Configuring a Send Email action is slightly different but very similar to the NetPerfMon action. Begin by inserting the variables you’d like to see in the email. Useful examples are: Caption, LastSystemUpTimePollUtc and any custom properties that could be used to identify the node. In our example above we are using the Site custom property and the Room custom property.
Once you’ve structured an email message that looks good, you can move down and confirm that the SMTP server is correct, or just leave it at the default server which can be configured under Settings > Configure Default Email Send Action. Hit Save Changes to add the action.
Now that we’ve got the two trigger actions for our node we can hit next to move on to the Reset Actions.
The reset actions page is where we’re going to tell Orion what we want it to do if the alert resets. It’s not necessary for Orion to send us an email but best practice is still to have an Event Log reset action so that it’s trackable.
We could add a new action here and configure it as we did in the trigger action page, but here’s a trick – Just hit ‘Copy Actions from Trigger Actions Tab’. This will duplicate the trigger actions into your reset actions. Hit delete on the email action that it duplicated and then edit the reset action and add ‘- reset’ to the end of the name of the action. Hit next to confirm and move to the Summary page.
If you made it this far then congrats! All we need to do here is confirm all of our alerts settings and then most importantly, confirm in the bottom right corner how many nodes the alert will immediately trigger on based on the current trigger condition.
This is extremely important because if the trigger condition was wrong, then you may inadvertently be about to send 1,000+ emails to your infrastructure team and/or create 1,000+ ServiceNow incidents (if you were being more advanced than simple email). Additionally, the opposite may also be true as you’re expecting the alert to trigger on 4 devices but Orion says it will trigger on none. This gives you sufficient warning to allow you to go back and edit your trigger condition without affecting your environment before committing and saving the alert.
That’s it! You’ve created an alert which will detect if a device stops responding to SNMP/WMI or the Agent.
An alert isn’t what you want, you say? Well as well as or instead of an alert, we can create a report to provide details on devices in this condition.
While I would advocate that you use an alert as the method to identify devices not responding to management protocols, a report can be a useful resource to aid in this function.
To create our report let’s head over to the reports page: Settings > Manage Reports then hit ‘Create New Report’
You will be presented with a window similar to the screenshot above. Let’s change the report width to something reasonable, I find 960px works best for this report.
Next, we’ll name our report and then hit ‘Add Content’.
The add content window has many things we can choose from but two should already be listed, Custom Table and Custom Chart. Go ahead and select Custom Table and add it.
You should now see the Add Datasource window as in the screenshot above. First thing we want to do is change the Selection Method to ‘Advanced Database Query’ and then choose SQL for the query type.
Now the query we’re going to use is slightly more complicated than the alert, this is because we need the query to help us present the data in a table. Copy and paste the following query into the text box:
Our query uses a SELECT statement to allow our table to embed the url to the node details page for any devices that appear on our report. Our WHERE statement defines what criteria needs to be met for a result, we are only wanting nodes that do not have a polling method of ICMP (i.e. all the management protocols) and the LastSystemUpTimePollUtc value is greater than 3 polls and the status of the device status is not 2 (down), 9 (external) or 11 (unmanaged). We’re using ORDER BY to order the table by the LastSystemUpTimePollUtc column.
Lastly, give the Data Source a name and hit ‘Add to Layout’. You should now be back at the Layout Builder screen with our new table added. Hit ‘Edit Table’ so that we can the data columns.
Give the table a name and then hit the + button to add the columns from the SQL query. The columns to add are: Caption, IP_Address, Vendor, Polling Method and LastSystemUpTimePollUtc. Now your edit table window should look similar to the screenshot above.
There are a few things to do to make this an elegant report.
Once complete hit Submit on the table to take us back to the Layout Builder window. Once there hit next to move on to Report Preview
Your report preview should look similar to this. Don’t worry if there are no nodes listed, that just means all your devices are working as they should! If you’re happy with the layout, hit next to move on to the Properties of the report.
The only option we’re particularly interested in here is the Report Category. Choose one of the pre-defined or go to the bottom of the selection to add a new category. Hit Next to set up a report schedule.
On the schedule report screen, we can create a new schedule for our report or you may choose an existing report schedule. If you’ve chosen to create a new schedule then give it a name and hit ‘Add Frequency’. Play around with the frequency settings to get your desired frequency and time to generate the report, submit then hit ‘Add Action’ and choose email.
Give the email action a name, specify an email address and put something in the body of the message, then submit. Hit Next to move to the Summary page. Confirm that everything looks good and Submit to create your report.
I have included an export of the alert and report definitions which are available to download via the links below. These can be imported into your own Orion installation should you wish to implement the content of this post quickly. Don’t forget to review the settings and adjust anything to your particular needs.
As a reseller of SolarWinds, Prosperon Networks are always looking for new ways to push the limits and improve user experience by increasing automation and enhancing the base functionality of the product. We hope this blog helps you and your use of Orion.
Prosperon Networks are the UK's leading authority on SolarWinds IT Management Solutions. We run training courses that suit a number of roles in your organisation, these courses cater for engineers, helpdesk operators and management personnel who all use monitoring platforms differently. The SolarWinds products retain their simplicity and ease of use, however product training in some form is recommended to get the most out of the tools we use every day.
Copyright © 2018 Prosperon Networks. All rights reserved. Registered Co. No. 5884643. VAT Number 889545649, Argyll House, 15 Liverpool Gardens, Worthing, W. Sussex, BN11 1RY (UK).