2021-12-15: Incident
Once again, I discovered it by chance because my DNS resolver wasn't answering anymore.
Symptoms
- none of the services hosted on pve-main.albaron.phowork.fr is available;
- local network is working as well as outside connectivity;
- external monitoring (https://oup.si) is all red (quite logical as the services it monitors are hosted on pve-main);
- storage-2.albaron.phowork.fr is up and its services are running well;
- pve-main's physical console is frozen with a kernel panic.
Based on previous #6 experience and the fact that the main physical raid was constantly rebuilding, I immediately thought it was -again- a storage issue.
Timeline
- 23:25 CET: services hosted on pve-main became inaccessible;
- 23:45 CET: pve-main powered-off using power button (long press);
-
<no time>: 1st boot; P410 Smart Array util found errors on drives
2I:1:5
and2I:1:9
, which are both part of the main RAID1+0 array used by the system. I choosed to reenable the array; the server still refuses to boot, telling me to disconnect a device. - <no time>: dust blown away from drives 5 and 9;
-
<no time>: 2nd boot; ; P410 Smart Array util found errors on drive
2I:1:8
, which is part of the main RAID1+0 array used by the system. I turned the server off with a quick power button press. - <no time>: dust blown away from drive 8;
- <no time>: 3rd boot; there still is a message telling me to disconnect a device and retry;
- <no time>: 4rd reboot. Boot array manually set in p410 smart array utility, and system directly booted from this utility and the selected array. Still the same message telling me to disconnect a device and retry; I disconnected the Dymo USB printer attached to the front port, pressed enter and Grub appeared!
- 23:57 CET: pve-main's kernel is loaded!
- 00:05 CET: all services are back online!
Side notes
I didn't paid enough attention to the message displayed (the ones that appeared 3 times, according to the previous timeline). The system could (maybe) have booted the 3rd time if I had disconnected the printer at this time.
As of 2021-12-16, the layout of the concerned raid array is:
Mirror Group 1:
physicaldrive 2I:1:2 (port 2I:box 1:bay 2, SATA SSD, 240 GB, OK)
physicaldrive 2I:1:3 (port 2I:box 1:bay 3, SATA SSD, 240 GB, OK)
physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SATA SSD, 240 GB, OK)
Mirror Group 2:
physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SATA SSD, 240 GB, OK)
physicaldrive 2I:1:8 (port 2I:box 1:bay 8, SATA SSD, 240 GB, OK)
physicaldrive 2I:1:9 (port 2I:box 1:bay 9, SATA SSD, 240 GB, OK)
Points to improve
All 4 "points to improve" of #6 still need work, in particular the need of a procedure to reinstall pve-main's system from scratch.
-
Notification: I didn't receive any SMS alert even if I had them configured on a remote device; -
Backups: there hasn't had any problem with backups here, but they still need to be more replicated; -
Notification: alert when a logical array isn't "OK"; -
Storage reliability: #10 -
High availability: some services (netbox, gitlab, vault) should be available if one of the two servers fails.