2021-12-28: Incident
Timeline
- 18:03 CET: last mail received from ssacli monitoring, telling me that main raid array was in state rebuilding, at 92%;
- 18:06 CET: according to Gatus, all services hosted on pve-main became unreachable;
- 18:12 CET: I realize the problem while trying to load a ZIP file from https://static.phowork.fr. After a quick debugging, it appears that pve-main doesn't answer ping_requests, however its iLO is still reachable; but I don't have its credentials as they are stored in Vault, hosted on... pve-main...
-
20:08 CET: arrival on-site
pve-main was awaiting a boot media, probably because main raid array had been disabled by the RAID controller.
Reboot triggered with Ctrl+Alt+Del; - 20:11 CET: smart array controller warns me that both drives 5 and 9 are faulty:
Slot 1 Drive Array - Replacement Drive(s) detected OR previously failed drive(s) now appear to be operational:
Port 2I: Box 1: Bays 5,9
Logical drive(s) disabled due to possible data loss.
Select "F1" to continue with logical drive(s) disabled.
Select "F2" to accept data loss and to re-enable logical drive(s)
I choose to use them anyway and reenable the logical array;
- 20:11 CET: kernel is loaded
Points to improve
All "points to improve" of both #6 and #9 still need work, in particular the need of a procedure to reinstall pve-main's system from scratch and high availability of some services (netbox & vault).
-
REALLY do sth to improve storage reliability: near 1 rebuilding operation occured each day since last similar issue (2021-12-15, 2 weeks ago) -
High availability: netbox and vault would have been very useful (vault in particular) yesterday.
Having a redundant VPN would be a good thing and isn't technically hard -
gitlab runners needed manual verify
-
irc bouncer doesn't start automatically -
storj-nodes
doesn't start automatically -
KVM solution for pve-main: #8