2021-11-04: Incident
During a planned maintenance which goal was to setup basicly storage-2.albaron.phowork.fr
(power + network) in order to quantify its power draw, I noticed that its DNS server (192.168.114.254
, aka rb4011.core.albaron.phowork.fr
) wasn't responding to DNS queries. As it was still reachable over others protocols, I suspected an issue with its upstream DNS server which is hosted on pve-main.albaron.phowork.fr
. I checked Phowork's services status on https://oup.si/ and noticed that many of them were down for one to two minutes. This was at 22:50 CET on 2021-11-04.
As I was already in the servers room (aka my basement) and using my kvm monitor, I switched input to this server and saw that it wasn't responding at all to any keyboard input, nor responding to ICMP ping requests. I decided to hard-reboot it and got the following message at 23:54 CET:
These two drives are in the same RAID 1+0 array and I started panicking as I first thought the whole array was lost. This would not have been a huge loss as every VM has backups on another host but in such case I would have had to reinstall and reconfigure hypervisor before being able to fetch backups back and try to restore them.
Fortunately, its configuration is:
Array C
Logical Drive: 3
Size: 670.62 GB
Fault Tolerance: 1+0
Heads: 255
Sectors Per Track: 32
Cylinders: 65535
Strip Size: 256 KB
Full Stripe Size: 768 KB
Status: OK
Unrecoverable Media Errors: None
Caching: Enabled
Unique Identifier: <redacted>
Disk Name: /dev/sda
Mount Points: None
Boot Volume: Primary
Logical Drive Label: <redacted>
Mirror Group 1:
physicaldrive 2I:1:2 (port 2I:box 1:bay 2, SATA SSD, 240 GB, OK)
physicaldrive 2I:1:3 (port 2I:box 1:bay 3, SATA SSD, 240 GB, OK)
physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SATA SSD, 240 GB, OK)
Mirror Group 2:
physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SATA SSD, 240 GB, OK)
physicaldrive 2I:1:8 (port 2I:box 1:bay 8, SATA SSD, 240 GB, OK)
physicaldrive 2I:1:9 (port 2I:box 1:bay 9, SATA SSD, 240 GB, OK)
Drive Type: Data
LD Acceleration Method: Controller Cache
After another reboot, I entered beautiful HP GUI "Option ROM Configuration for Arrays" which allowed me to see that... none of the 6 drives of this array seemed to have problems, at least none was reported at the this time. However system was still refusing to boot on this array.
I tried unplugging storage-2
power supply, booting on Arch installation USB key (which didn't work, I don't know yet why: a black screen was appearing and staying after choosing option "installation" in Grub menu) and finally decided to unplug the false-failing SSDs, blow into theirs [SATA] connectors and plug them back. I did this without particular hope but... yeah, it worked! Or at least, system booted "as usual" after this -and only this- operation. Problem solved (at least for now).
At 00:29 CET on 2021-11-05, most of my services were back up!
Conclusion
I have been very lucky to spot this problem near 2 minutes before the end of storage-2
installation. Without it, the down would have last all the night.
Side notes
Observium stopped storing data in its database at 11:16 CET on 2021-11-04. The seems seems to come from MariaDB server and not related at all to the problem described in this issue but would need deeper digging.
Points to improve
-
Procedure to reinstall / bootstrap pve-main
in case of whole array failure? -
External notifications from gatus -
Move as much VMs vdrives as possible to NVMe array -
[minor] secondary DNS resolver for wifi (to be discussed)