Re: status of signal
- From: Olav Vitters <olav vitters nl>
- To: Christer Edwards <christer edwards gmail com>
- Cc: gnome-infrastructure gnome org
- Subject: Re: status of signal
- Date: Mon, 16 Aug 2010 09:52:51 +0200
On Sun, Aug 15, 2010 at 09:54:44PM -0600, Christer Edwards wrote:
> Based on our downtime this past evening I took an interest in our
> current monitoring solution (if you could call it that). The details I
> found are listed below, and I think clear up some misconceptions I've
> (we've) had about this box.
>
> signal.gnome.org is, as we know, hosted at OSUDL. It is a 2cpu VM
> (QEMU Virtual CPU version 0.11.1), with 256M RAM and about 7.5G
> storage. Currently it is running nagios3 on apache 1.3 and mysql
> server (a requirement of nagios3?).
>
> The current monitoring configuration is poor and looks like it has
> been for some time. It is only monitoring a handful of services, the
> key services not even configured properly. As an example,
> window.gnome.org HTTP service: down 246d 16h 33m 12s. Most configured
> services are like this. It's mostly red across the board, and I'm sure
> it's simply misconfiguration.
At one time it contained an outdated Nagios and a Nagios3 version. Both
were running. It kept sending out mails even though we acknowledged that
a machine was permanently down (box.gnome.org).
So repeat emails have been disabled, etc. I guess it still has old IP
addresses in the configuration and we didn't notice due to lack of
repeat emails.
> It'll take a little bit of work but it can be cleaned up to provide
> rudimentary monitoring without a lot of work. This is what I'd like to
> do:
>
> 1) update to apache2 (why is it even on apache 1.3??)
Old VM.
> 2) define as a group the critical services we want monitored (I'd
> suggest http for bugzilla and the wiki for starters)
> 3) configure SSL for the signal webserver. Auth is done by htpasswd.
> We all know plain text is bad.
> 4) configure the nagios3 path as the default DocumentRoot. Currently /
> shows some generic message, the wiki points to /nagios/, but the
> actual monitoring is at /nagios3/
> 5) as an extra, perhaps add a DNS cname/alias for 'nagios.gnome.org'
> which points to signal.
> 6) /etc/aliases only defines specific admins as email recipients. I
> think these should be sent team-wide.
That should be kept up to date, yes.
It should NOT send stuff out to gnome-sysadmin, as then we miss out on a
lot of downed stuff. Only directly to admins as otherwise people might
rely on the gnome-sysadmin nagios bits.
Not sure how to keep that list up to date. It is not connected to LDAP
due to history + might intervene with monitoring.
Would be nice to have an announcement bot in irc.gnome.org, #sysadmin
(+configure it to repeat the downed machines)
> All of this would take me maybe a couple hours tomorrow. I'm
> interested in any other feedback re: services monitored, notification
> methods (emails to specific sysadmins per-host? emails to -sysadmin?
> emails to -infrastructure?)
Nothing to gnome-sysadmin, nor gnome-infrastructure. It should only send
stuff outside @gnome.org. This as announcement of downed stuff should
not rely on any infrastructure which might have issues.
Would be nice if it sends test emails to itself
(signal->menubar->signal). So we know when there is an email problem
(sometimes clamav / amavisd has issues). In those cases postfix works
ok, it just doesn't send stuff anymore.
> In the meantime I'll get started on some basic maintenance, such as
> fixing the monitoring that is there.
Cool!
--
Regards,
Olav
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]