Linux servers are usually rock-solid but sometimes even they crash so badly that only pressing the reset button will bring them up to life again. If you are using a server hosting provider and you have 24/7 technical support, you are lucky. Normally, a call to helpdesk should make your server up and running. But if you are hosting your server in the basement and disaster strikes when nobody is at home then you are in trouble.

The following little device will help you in this situation. It is a watchdog that receives impulses from the server and when the impulses stop, the server is reset. Sounds simple but precaution needs to be taken: wait for the boot sequence to complete and stop the watchdog after five reboots if no impulse has been received. The first rule prevents the server reset during booting until the server service starts to send impulses. Second rule prevent an infinite reset loop when the server wouldn’t boot correctly and human intervention is needed.

Watchdog schematicThe main component of the watchdog is a PIC12F509 driven by his internal clock generator. A relay is used to make the reset. The server case reset switch is disconnected and the relay is connected to the main board. This is a much better way of doing a reset than switching off and on the mains. The impulses are sent through a RS232 (COM) port by a little C program.

Three LEDs are used to display the watchdog state. At first boot, after the device is powered, they alternatively blink yellow, red and green. After that, the green LED remains lit and the yellow one blinks when an impulse is received. When a reset is made all LEDs are off, except the red one. After that the boot waiting stage is signaled by alternatively blinking the yellow and red LEDs.

Watchdog caseI have used an old 3.5 inch floppy disk drive enclosure, so the pcb is made to fit in this one. Although it was plenty of room I didn’t want to drill so much and I used SMD components on the bottom layer. All connections are made through pin headers. A 4 pin floppy power connector is taken from the original FDD board in order to connect the watchdog to the server’s power supply with no modification.

If you want to recompile the firmware you need HI-TECH PICC. Right now the timings are: ~184 seconds boot delay and ~216 seconds impulse timeout.

The server software part is made from 30 code lines. Basically, every 60 seconds the serial port is shortly opened and then closed. The program it is called by crontab and writes its status to the system log.

After you build the watchdog, connect power to see if it works correctly. The LEDs should indicate the first boot delay and after a total of approximately 6 minutes and 30 seconds the device should send a reset signal.

If you consider everything is ok, you can mount the watchdog in the server. **WARNING** I don’t take any responsibility if you burn or damage something :).

  • Become root.
  • Compile lwd program using make command.
  • Copy lwd binary to /usr/sbin.
  • Edit the root crontab file and insert * * * * * /usr/sbin/lwd.
  • Shut down your server and connect the watchdog to the server’s power supply, to the reset pins located on the main
  • board and to the serial port.
  • Power up your server

After 3 minutes initial boot delay you should see only the green LED lit and the yellow one fast blinking once per minute. If you want to test a failure, unplug the watchdog from the serial port or temporarily stop the crontab daemon.

All the source codes, compiled HEX file, schematic and PCB can be downloaded in one file

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]