Re: status of signal

From: Olav Vitters <olav vitters nl>
To: Christer Edwards <christer edwards gmail com>
Cc: gnome-infrastructure gnome org
Subject: Re: status of signal
Date: Mon, 16 Aug 2010 09:52:51 +0200

On Sun, Aug 15, 2010 at 09:54:44PM -0600, Christer Edwards wrote:
> Based on our downtime this past evening I took an interest in our
> current monitoring solution (if you could call it that). The details I
> found are listed below, and I think clear up some misconceptions I've
> (we've) had about this box.
> 
> signal.gnome.org is, as we know, hosted at OSUDL. It is a 2cpu VM
> (QEMU Virtual CPU version 0.11.1), with 256M RAM and about 7.5G
> storage. Currently it is running nagios3 on apache 1.3 and mysql
> server (a requirement of nagios3?).
> 
> The current monitoring configuration is poor and looks like it has
> been for some time. It is only monitoring a handful of services, the
> key services not even configured properly. As an example,
> window.gnome.org HTTP service: down 246d 16h 33m 12s. Most configured
> services are like this. It's mostly red across the board, and I'm sure
> it's simply misconfiguration.

At one time it contained an outdated Nagios and a Nagios3 version. Both
were running. It kept sending out mails even though we acknowledged that
a machine was permanently down (box.gnome.org).

So repeat emails have been disabled, etc. I guess it still has old IP
addresses in the configuration and we didn't notice due to lack of
repeat emails.

> It'll take a little bit of work but it can be cleaned up to provide
> rudimentary monitoring without a lot of work. This is what I'd like to
> do:
> 
> 1) update to apache2 (why is it even on apache 1.3??)

Old VM.

> 2) define as a group the critical services we want monitored (I'd
> suggest http for bugzilla and the wiki for starters)
> 3) configure SSL for the signal webserver. Auth is done by htpasswd.
> We all know plain text is bad.
> 4) configure the nagios3 path as the default DocumentRoot. Currently /
> shows some generic message, the wiki points to /nagios/, but the
> actual monitoring is at /nagios3/
> 5) as an extra, perhaps add a DNS cname/alias for 'nagios.gnome.org'
> which points to signal.
> 6) /etc/aliases only defines specific admins as email recipients. I
> think these should be sent team-wide.

That should be kept up to date, yes.

It should NOT send stuff out to gnome-sysadmin, as then we miss out on a
lot of downed stuff. Only directly to admins as otherwise people might
rely on the gnome-sysadmin nagios bits.

Not sure how to keep that list up to date. It is not connected to LDAP
due to history + might intervene with monitoring.

Would be nice to have an announcement bot in irc.gnome.org, #sysadmin
(+configure it to repeat the downed machines)

> All of this would take me maybe a couple hours tomorrow. I'm
> interested in any other feedback re: services monitored, notification
> methods (emails to specific sysadmins per-host? emails to -sysadmin?
> emails to -infrastructure?)

Nothing to gnome-sysadmin, nor gnome-infrastructure. It should only send
stuff outside @gnome.org. This as announcement of downed stuff should
not rely on any infrastructure which might have issues.

Would be nice if it sends test emails to itself
(signal->menubar->signal). So we know when there is an email problem
(sometimes clamav / amavisd has issues). In those cases postfix works
ok, it just doesn't send stuff anymore.

> In the meantime I'll get started on some basic maintenance, such as
> fixing the monitoring that is there.

Cool!

-- 
Regards,
Olav

Follow-Ups:
- Re: status of signal
  - From: Olav Vitters

References:
- status of signal
  - From: Christer Edwards

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]