Watchdog timers have been coming up in discussions lately, so I thought it might be good to start a discussion about the use and misuse of watchdog timers in a preemptive multitasking operating system, like Windows CE. I am going to share my thoughts, but look forward to you, my reader, sharing your thoughts on the subject. I am going to focus this discussion on hardware watchdogs, totally ignoring the software watchdog that is included in Windows CE and discussed by Luca Calligaris in an article titled Beware of the watchdog!
What is a watchdog timer? A watchdog timer is a hardware timer that is set to timeout after a set amount of time. When the watchdog times out, it causes a hardware reset. It is software’s job to “pet” the dog before it times out. Petting the dog is the act of resetting the time so that the watchdog doesn’t timeout. The goal is to have the hardware watch for code that takes too long, because code that takes too long has failed in some way. Watchdog timers should primarily be used to detect catastrophic software failures. Note that these catastrophic software failures could be the result of catastrophic hardware failures that prevent software from working correct, or working at all.
The code to manage the watchdog should be broken into two parts; the code that touches the hardware should be in a driver or the kernel and the code that starts the watchdog and pets it should be in separate thread(s). Note that earlier version of Windows CE didn’t enforce that separation.
So there are two important variables in a watchdog; the timeout period and the interval between petting the dog. Wait a minute, maybe there are more variables involved, variables that aren’t quite so obvious. The other important variables are thread quantum and thread priority. Remember that Windows CE is a preemptive multitasking operating system, and you didn’t write all of the code that is running on the device. This means that code that you don’t control is also running and may run at a higher priority and/or run until its quantum expires.
Let’s think about how the watchdog works some by drawing some pictures. To start with, here is what happens if we start the watchdog with a 50 millisecond timeout and never pet the dog:
Nothing pets the watchdog, so the board resets. Now if we pet the dog at about 25 milliseconds:
Now we are on to something, we are petting the dog so the board doesn’t reset. The default thread quantum for a Windows CE system is 100 milliseconds (which can be changed by the OEM or on a per thread basis) so look what happens if a thread that we don’t control jumps in and runs for 30 milliseconds just before we are ready to pet the dog:
Of course this could be even worse couldn’t it? There could be several threads that need to run when we need to pet the dog and one or more of them could need to run to quantum. I am assuming that all of these threads are of equal or greater priority to the thread(s) that pet the dog. That is that any thread(s) of an equal or greater priority can and will prevent the thread that pets the dog from running.
And of course there is the case where the watchdog detects a catastrophic failure:
Most of the discussion that I have had about watchdog timers involves a decision on the timeout, and the timeout value was usually selected with reckless disregard for the reality of the OS. I strongly believe that any timeout value that is measured in tens or even hundreds of milliseconds is too short to effectively monitor for catastrophic failure to the exclusion of resetting when the system is running normally. There may be some exceptions to this, like a minimal OS which only includes drivers that are completely controlled and understood by the OEM and the kernel.
What should we do? We could raise the thread priority of the code that pets the dog increase the timeout period. Raising the priority of the code that pets the dog has a nasty side effect in that we could easily pet the dog through a catastrophic software failure. I am in favor of setting the timeout period in seconds, rather than milliseconds for a multithreaded OS, but how many seconds is enough? That is the tricky question and will depend on what your system does. If you are going to use a hardware watchdog, you had better be prepared to figure it out though.
I have seen a lot of creative algorithms based on working around setting a longer timeout period, most of which could be handled completely in software. They are usually centered around watching for threads that register to be watched. It then only pets the dog if all of the threads have checked in within a set time period. Really these could be handled by switching the logic just a little bit to cause a reset if one or more threads haven’t checked in.   These algorithms have a lot of merit in a preemptive multitask OS because they only monitor specific threads. But they still have the problem of setting a suitable timeout period.
Kind of makes me want to reminisce about the good old days of single threaded embedded systems, watchdogs were easy; set a timeout and instrument the code to pet the dog as needed. But then I quickly get over it when I think about how much more we can do with the systems today.
Again, I look forward to you comments on watchdog timers.
Copyright © 2009 – Bruce Eitman
All Rights Reserved