2021-12-28: Incident

Timeline

18:03 CET: last mail received from ssacli monitoring, telling me that main raid array was in state rebuilding, at 92%;
18:06 CET: according to Gatus, all services hosted on pve-main became unreachable;
18:12 CET: I realize the problem while trying to load a ZIP file from https://static.phowork.fr. After a quick debugging, it appears that pve-main doesn't answer ping_requests, however its iLO is still reachable; but I don't have its credentials as they are stored in Vault, hosted on... pve-main...
20:08 CET: arrival on-site
pve-main was awaiting a boot media, probably because main raid array had been disabled by the RAID controller.
Reboot triggered with Ctrl+Alt+Del;
20:11 CET: smart array controller warns me that both drives 5 and 9 are faulty:

     Slot 1 Drive Array - Replacement Drive(s) detected OR previously failed drive(s) now appear to be operational:

         Port 2I: Box 1: Bays 5,9

     Logical drive(s) disabled due to possible data loss.

Select "F1" to continue with logical drive(s) disabled.

Select "F2" to accept data loss and to re-enable logical drive(s)

I choose to use them anyway and reenable the logical array;

20:11 CET: kernel is loaded

Points to improve

All "points to improve" of both #6 and #9 still need work, in particular the need of a procedure to reinstall pve-main's system from scratch and high availability of some services (netbox & vault).

REALLY do sth to improve storage reliability: near 1 rebuilding operation occured each day since last similar issue (2021-12-15, 2 weeks ago)
High availability: netbox and vault would have been very useful (vault in particular) yesterday.
Having a redundant VPN would be a good thing and isn't technically hard
gitlab runners needed manual verify
irc bouncer doesn't start automatically
storj-nodes doesn't start automatically
KVM solution for pve-main: #8

Edited Jan 03, 2022 by Charles Decoux