I've mostly been avoiding the computer this evening, but I did spend the last hour working on attempt #2 at distributed monitoring.
The more I plot, plan & ponder the more appealing the notion becomes.
Too many ideas to discuss, but in brief:
My previous idea of running tests every few minutes on each node scaled badly when the number of host+service pairs to be tested grew.
This lead to the realisation that as long as some node tests each host+service pair you're OK. Every node need not check each host on every run - this was something I knew, and had discussed, but I assumed that would be a nice optimisation later rather than something which is almost mandatory.
My previous idea of testing for failures on other nodes after seeing a local failure was similarly flawed. It introduces too many delays:
- Node 1 starts all tests - notices a failure. Records it
- Fetches current results from all neighbour nodes.
- Sees they are OK - the remote server only just crashed. Oops.
- Node 2 starts all tests - notices a failure. Records it.
- Fetches current results from all neighbour nodes.
- Sees they are OK - the remote server only just crashed. Oops.
In short you have a synchronisation problem which coupled with the delay of making a large number of tests soon grows. Given a testing period of five minutes, ten testing nodes, and 150 monitored hosts+services, you're looking at delays of 8-15 minutes. On average. (Largely depends on where in the cycle the failing host is, and how many nodes must see a failure prior to alerting.)
So round two has each node picking tests at "random" (making sure no host+service was tested more than 5 minutes ago) and at the point a failure is detected the neighbour nodes are immediately instructed to test and report their results (via XML::RPC).
The new code is simpler, more reliable, and scales better. Plus it doesn't need Apache/CGI.
Anyway bored now. Hellish day. MySQL blows goats.
ObFilm: Hellboy