Unexpected outage: 20:00 CET 26-05-2014 to 03:00 CET 27-05-2014



Hello everyone,

As you might have noticed, we had a major issue in the GNOME infrastructure last night, which extended as far 
as to render almost every service we provide unavailable.
This was caused by our main file server stopping to serve the file systems required for home directories and 
mailing lists.

The cause about the outage is current not clear as the logs are not showing up anything relevant.
We've sent them to gluster engineers to ask them for help on analyzing them.

On rebooting the server, something went wrong, requiring a powercycle of the affected machine.
When trying this, we were hit by a bug in the management cards that made us unable to use them to reboot the 
server.

Because of this, we have requested hands-on service to get the server power cycled, which had us waiting for 
some time.
Within minutes after the server was rebooted, the file systems came back online, and with it all of the GNOME 
services.

To prevent all services from going down when the primary file server would go down, we had previously setup a 
synchronized secondary file server.
The reason we were unable to make all servers fallback to this one was because we weren't able to login to 
the affected servers to update the target IP.

To prevent this problem from pulling down the entire GNOME infrastructure in the future, we have taken some 
steps:
    - We have added a way for us to login to any server even if the home directories are down.
    - We'll be introducing automatic failover to the other available file server
    - We'll be spreading our documentation off-site to prevent the relevant documentation to disappear when 
the machine hosting 
     is experiencing problems
     - We will be making sure to get access to the power management to our servers, so we can reboot them 
even if the management
     cards are not functioning

We really hope that this will prevent such drastic failures in the future, and make it easier to recover if 
problems do occur.

If you have any additional questions, don't hesitate to contact either of us on IRC (#sysadmin) or by sending 
us an email.

With kind regards,
Patrick Uiterwijk and Andrea Veri
System Administrators, GNOME


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]