In UNIX, Log Monitoring is a big deal and there’s usually several different individually unique ways that a log file can be set up, thereby making monitoring it for specific errors, a customized task.
Now, if you’re the person at your job charged with the task of setting up effective UNIX monitoring for various departments within the company, you probably already know the frequency with which requests come in to monitor log files for specific strings/error codes, and how tiring it can be to set them up.
Not only do you have to write a script that will monitor the log file and extract the provided strings or codes from it, you also need to spend ample amount of time studying the log file itself. This is a step you can’t do without. It is only after manually observing a log file and learning to predict its behavior that a good programmer can write the proper monitoring check for it.
When planning to monitor log files effectively, it is imperative you suspend the notion of using the UNIX tail command as your primary method of monitoring.
Why? Because, say for instance you were to write a script that tails the last 5000 lines of a log every 5 minutes. How do you know if the error you’re looking for didn’t occur slightly past the 5000 lines? During the 5 minute interval that your script is waiting to run again, how do you know if more than 5000 lines might have been written to the log file? You don’t.
In other words, the UNIX tail command will do only exactly what you tell it to do… no more, no less. Which then opens the room for missing critical errors.
But if you don’t use the UNIX tail command to monitor a log, what then are you to do?
As long as each line of the log you want to monitor has a date and time on it, there is a much better way to efficiently and accurately monitor it.
You can make your job as the UNIX monitoring specialist, or a UNIX administrator a heck of a lot easier by writing a robotic log scanner script. And when I say “robotic”, I mean designing an automated program that will think like a human and have a useful versatility.
What do I mean?
Rather than having to script your log monitoring command after a line similar to the following:
tail -5000 /var/prod/sales.log | grep -I disconnected
Why not write a program that monitors the log, based on a time frame?
Instead of using the aforementioned primitive method of tailing logs, a robotic program like the one in the examples below can actually cut your amount of tedious work from 100% down to about 0.5%.
The simplicity of the code below speaks for itself. Take a good look at the examples for illustration:
Say for instance, you want to monitor a particular log file and alert if X amount of certain errors are found within the present hour. This script does it for you:
/sbin/MasterLogScanner.sh (logfile-absolute-path) ‘(string1)’ ‘(string2)’ (warning:critical) (-hourly)
/sbin/MasterLogScanner.sh /prod/media/log/relays.log ‘Err1300’ ‘Err1300’ 5:10 -hourly
All you have to pass to the script is the absolute path of the log file, the strings you want to examine in the log and the thresholds.
In regards to the strings, keep in mind that both string1 and string2 must be present on each line of logs that you want extracted. In the syntax examples shown above, Err1300 was used twice because there’s no other unique string that can be searched for on the lines that Err1300 is expected to show up on.
If you want to monitor the last X amount of minutes, or even hours of logs in a log file for a certain string and alert if string is found, then the following syntax will do that for you:
/sbin/MasterLogScanner.sh (logfile-absolute-path) (time-in-minutes) ‘(string1)’ ‘(string2)’ (-found)
/sbin/MasterLogScanner.sh /prod/media/log/relays.log 60 ‘luance’ ‘Err1310’ -found
So in this example,
/prod/media/log/relays.log is the log file.
60 is the amount of previous minutes you want to search the log file for.
“luance” is one of the strings that is on the lines of logs that you’re interested in.
Err1310 is another string on the same line that you expect to find the “nuance” string on. Specifying these two strings (luance and Err1310) isolates and processes the lines you want a lot quicker, particularly if you’re dealing with a very huge log file.
-found specifies what type of response you’ll get. By specifying -found, you’re saying if anything is found that matches the preceding strings, then that should be regarded as a problem and outputted out.
/sbin/MasterLogScanner.sh (logfile-absolute-path) (time-in-minutes) ‘(string1)’ ‘(string2)’ (-notfound)
/sbin/MasterLogScanner.sh /prod/apps/mediarelay/log/relay.log 60 ‘luance’ ‘Err1310’ -notfound
The preceding example follows the same exact logic as Example 2. Except that, with this one, -found is replaced with -notfound. This basically means that if Err1310 isn’t found for luance within a certain period, then this is a problem.