2021-12-15: Incident

Once again, I discovered it by chance because my DNS resolver wasn't answering anymore.

Symptoms

none of the services hosted on pve-main.albaron.phowork.fr is available;
local network is working as well as outside connectivity;
external monitoring (https://oup.si) is all red (quite logical as the services it monitors are hosted on pve-main);
storage-2.albaron.phowork.fr is up and its services are running well;
pve-main's physical console is frozen with a kernel panic.

Based on previous #6 experience and the fact that the main physical raid was constantly rebuilding, I immediately thought it was -again- a storage issue.

Timeline

23:25 CET: services hosted on pve-main became inaccessible;
23:45 CET: pve-main powered-off using power button (long press);
<no time>: 1st boot; P410 Smart Array util found errors on drives 2I:1:5 and 2I:1:9, which are both part of the main RAID1+0 array used by the system. I choosed to reenable the array; the server still refuses to boot, telling me to disconnect a device.
<no time>: dust blown away from drives 5 and 9;
<no time>: 2nd boot; ; P410 Smart Array util found errors on drive 2I:1:8, which is part of the main RAID1+0 array used by the system. I turned the server off with a quick power button press.
<no time>: dust blown away from drive 8;
<no time>: 3rd boot; there still is a message telling me to disconnect a device and retry;
<no time>: 4rd reboot. Boot array manually set in p410 smart array utility, and system directly booted from this utility and the selected array. Still the same message telling me to disconnect a device and retry; I disconnected the Dymo USB printer attached to the front port, pressed enter and Grub appeared!
23:57 CET: pve-main's kernel is loaded!
00:05 CET: all services are back online!

Side notes

I didn't paid enough attention to the message displayed (the ones that appeared 3 times, according to the previous timeline). The system could (maybe) have booted the 3rd time if I had disconnected the printer at this time.

As of 2021-12-16, the layout of the concerned raid array is:

Mirror Group 1:
   physicaldrive 2I:1:2 (port 2I:box 1:bay 2, SATA SSD, 240 GB, OK)
   physicaldrive 2I:1:3 (port 2I:box 1:bay 3, SATA SSD, 240 GB, OK)
   physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SATA SSD, 240 GB, OK)
Mirror Group 2:
   physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SATA SSD, 240 GB, OK)
   physicaldrive 2I:1:8 (port 2I:box 1:bay 8, SATA SSD, 240 GB, OK)
   physicaldrive 2I:1:9 (port 2I:box 1:bay 9, SATA SSD, 240 GB, OK)

Points to improve

All 4 "points to improve" of #6 still need work, in particular the need of a procedure to reinstall pve-main's system from scratch.

Notification: I didn't receive any SMS alert even if I had them configured on a remote device;
Backups: there hasn't had any problem with backups here, but they still need to be more replicated;
Notification: alert when a logical array isn't "OK";
Storage reliability: #10
High availability: some services (netbox, gitlab, vault) should be available if one of the two servers fails.

Edited Jan 03, 2022 by Charles Decoux