Docs/SysAdmin/CLI/WatchDog

From Mandriva Community Wiki

Jump to: navigation, search
Watch Dog

A UPS (Uninterruptable Power Supply) is a very useful piece of hardware that can be used to keep a computer running during a power failure. Another reason to have such a device is that it allows for a graceful shutdown in the event of an extended power outage and it is also possible to have the computer restart automatically once power is restored.

A WatchDog device is usually a device that sits within a computer and monitors certain processes on the computer. When it detects that those processes have become interrupted, it reacts by forcing a restart of the computer. This is very useful in those instances where a computer has locked up because it can cause a reboot which restarts the computer and hopefully removes the locked condition. WatchDog devices are usually used in computers that are required for 24/7 service so that they do not need to be immediately tended by a physical person but can be automatically restarted.

This topic is dedicated to the intersection of these two different hardware devices. Because of the nature of both of them, it is possible to use a UPS as a watchdog timer device to force a reboot of a server once it has locked up. This is done by essentially telling the software that controls the UPS device that there is a powerfailure and getting it to shutdown power to machines connected to it and then having it restart those machines. The technique is very easy once you understand how the UPS and software that controls it works.

I will discuss this technique is general terms first and then get into specifics on how to do it for specific hardware that I am familiar with. Hopefully, if you have different hardware, you can apply these techniques to your particular situation. First, I will discuss what we want to accomplish.

  • Create a monitor that watches a specific process that should always respond.
  • Create a process to automatically initiate a hard power shutdown if that process fails.
  • Insure that the automatic shutdown has some limitation so that hardware failures don't result in a continually looping shutdown state.
  • Create a notification so that the admin can be notified when a shutdown has taken place.

We want to be very careful to build in limitations to the hard booting process. For instance, if we were monitoring a web server and had a traumatic failure such as a hard drive crash, we would not want to continually reboot the machine indefinitely because the hardware failure will cause the web server to cease functioning. So, we build in a limitation at the admin's discretion and provide for an admin notification. If the admin is notified too many successive times, he can take action to prevent further attempts until he has corrected whatever failure is preventing the process from starting. We also want to be sure that we drop a machine only as a last resort when it is locked or not responding. So, we can build some general safeties into the monitor so that intermittent failures don't automatically result in rebooting a machine.

In order to do this, you will need a few different things. First, you need a UPS device that is capable of shutting down and automatically restarting a machine on power failure. Most of the APC Back-UPS series will do this, as will the Smart-UPS series. If a UPS normally causes the machines attached to it to powerdown during a power failure and then automatically restart when power is restored, then they should work for this purpose. You will also need at least two machines attached to the UPS power supply. You can not do this with a single machine because the monitor and UPS controller must be a different physical machine from the server that will be restarted. This is because the machine that you want to restart will probably be in a locked state and it will be unable to instruct the UPS to powerdown. The monitor machine must be the UPS controller so it must be capable of being hooked directly to the UPS device and controlling it. You will need UPS controller software, either NUT or APCUPSD software to control the UPS. I use APCUPSD on my machine because I run an APC UPS.

Generally, we will do the following things. First, install the UPS to the two machines, hook each one into the UPS and hook up your monitor machine as the controller of the UPS. Next, install the UPS controller software. The watchdog machine will be setup as the master net controller machine, the watched machine will be setup as the slave net device machine. You should follow the software instructions for doing this and completely test the failover recover status of the machines before proceeding. If your failover/recovery procedures don't work, then neither will the watchdog functions. Once you have the UPS, UPS software and machines installed, working and tested, then you can proceed to the next steps.

You will need to install some type of monitoring software. I use the MON daemon for my monitoring because it is easy to install, it has pre-edited monitor scripts as well as customizable alerts which is what we need. You can use whatever monitoring software that you want but it needs to do two things, it needs to be able to check the processes that you want to monitor, it should be customizable so that you can tailor the thresholds to your desires and it needs to have a customized alert process. The MON software will run as a daemon and check the target machine's process at a specified interval. At the point that it alerts, it will run a script that you will write that will kick off the reboot process. For instance, I use MON to check a web server process every 10 minutes. If I have 3 or more failures of that process within an hour, I assume that the web server is locked up and I issue an alert. The alert is customized to notify me as well as start the reboot process.

The APCUPSD software uses a special file to trigger a hard boot and we will use that file to trigger the watchdog function. In the /etc/apcupsd/ directory, we create a file called powerfail simply by issuing the touch /etc/apcupsd/powerfail command. This powerfail file is the file that is created when the APCUPSD daemon detects a powerfailure along with an impending low battery condition. It triggers the apcupsd daemon to issue the shutdown command to all attached computers and then shut down power to the UPS. This will force all active machines attached to the UPS to shutdown gracefully and then power will be cut to all machines. The UPS will stay dormant for a preset amount of time (300 seconds on mine) and then it will detect that mains power is still active and will interpret that as power coming back online. It will then trigger the machines to power back up by turning power back on to the machines. This should force a recovery condition and each machine should boot back up and become active. I will include a copy of my script that does this below because it has the necessary elements of limiting number of reboots, requiring admin intervention, notification, etc.

#!/bin/bash
#/usr/local/sbin/upsrestart
#This is a script to force a hard boot when a server is locked up

#First check to see if the server has already been hard booted
if [ -e /var/hardboot ]; then
echo "Machine has already been restarted"
else
touch /var/hardboot
service apcupsd stop
touch /etc/apcupsd/powerfail
/etc/apcupsd/apccontrol doshutdown
fi

This script first checks to see if the /var/hardboot file exists, if it does, it echos a line and then exits. We do not want to create a situation where a hardboot can occur repeatedly and this will prevent that from happening. The admin must manually remove the /var/hardboot file in order to allow another hardboot to occur. He can also setup a cron job to remove that file if he wants to allow specific interval reboots without manual intervention. My machine runs a cron job to remove the /var/hardboot once per day, which allows my watchdog timer to hardboot my server only once per day without manual intervention from me. If the /var/hardboot file does not exist, the script first creates it by touching /var/hardboot, then it issues the command to stop the apcupsd daemon (because we can not actively change anything while the daemon is running). Next, we touch the /etc/apcupsd/powerfail file which will trigger the UPS power cut after shutdown. Last, we issue the apccontrol doshutdown command which is the command issued when the daemon detects a powerfailure and tells connected machines to shutdown. We issue this manually because the daemon is not running and we don't want to restart the daemon because it will detect that the power is not out and will cancel the shutdown. We do not want that to happen, we want to force a shutdown.

We also need to create a custom alert for the MON daemon so that we can get a restart when we desire it. So, my alert is titled upsrestart.alert and it sits in the /usr/lib/mon/alert.d directory.

#!/bin/bash
#This alert calls the ups hard boot script and notifies the admin with an sms message

/usr/bin/smssend provider smsnumber servernumber upshardbootcalled
exec /usr/local/sbin/upsrestart

Notification is up to you, you can use a pager, sms phone using a program to send an sms message, mail an email message, whatever works best for you. It is important that you include a notfification because in the event of a hardware failure, you will get repeated notifications which will alert you to the failure.

If you use MON as the monitor/alert application, you can use webmin as an interface to control the thresholds and tailor them to your needs. If there is any information that I have left out or that you need to do this, please send an email to bphinney_mail(at)kislinux.org and I will try to update this entry and help you get this working.

Personal tools