2021-11-04: Incident

During a planned maintenance which goal was to setup basicly storage-2.albaron.phowork.fr (power + network) in order to quantify its power draw, I noticed that its DNS server (192.168.114.254, aka rb4011.core.albaron.phowork.fr) wasn't responding to DNS queries. As it was still reachable over others protocols, I suspected an issue with its upstream DNS server which is hosted on pve-main.albaron.phowork.fr. I checked Phowork's services status on https://oup.si/ and noticed that many of them were down for one to two minutes. This was at 22:50 CET on 2021-11-04.

As I was already in the servers room (aka my basement) and using my kvm monitor, I switched input to this server and saw that it wasn't responding at all to any keyboard input, nor responding to ICMP ping requests. I decided to hard-reboot it and got the following message at 23:54 CET:

These two drives are in the same RAID 1+0 array and I started panicking as I first thought the whole array was lost. This would not have been a huge loss as every VM has backups on another host but in such case I would have had to reinstall and reconfigure hypervisor before being able to fetch backups back and try to restore them.

Fortunately, its configuration is:

   Array C

      Logical Drive: 3
         Size: 670.62 GB
         Fault Tolerance: 1+0
         Heads: 255
         Sectors Per Track: 32
         Cylinders: 65535
         Strip Size: 256 KB
         Full Stripe Size: 768 KB
         Status: OK
         Unrecoverable Media Errors: None
         Caching:  Enabled
         Unique Identifier: <redacted>
         Disk Name: /dev/sda 
         Mount Points: None
         Boot Volume: Primary
         Logical Drive Label: <redacted>
         Mirror Group 1:
            physicaldrive 2I:1:2 (port 2I:box 1:bay 2, SATA SSD, 240 GB, OK)
            physicaldrive 2I:1:3 (port 2I:box 1:bay 3, SATA SSD, 240 GB, OK)
            physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SATA SSD, 240 GB, OK)
         Mirror Group 2:
            physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SATA SSD, 240 GB, OK)
            physicaldrive 2I:1:8 (port 2I:box 1:bay 8, SATA SSD, 240 GB, OK)
            physicaldrive 2I:1:9 (port 2I:box 1:bay 9, SATA SSD, 240 GB, OK)
         Drive Type: Data
         LD Acceleration Method: Controller Cache

After another reboot, I entered beautiful HP GUI "Option ROM Configuration for Arrays" which allowed me to see that... none of the 6 drives of this array seemed to have problems, at least none was reported at the this time. However system was still refusing to boot on this array.

I tried unplugging storage-2 power supply, booting on Arch installation USB key (which didn't work, I don't know yet why: a black screen was appearing and staying after choosing option "installation" in Grub menu) and finally decided to unplug the false-failing SSDs, blow into theirs [SATA] connectors and plug them back. I did this without particular hope but... yeah, it worked! Or at least, system booted "as usual" after this -and only this- operation. Problem solved (at least for now).

At 00:29 CET on 2021-11-05, most of my services were back up!

Conclusion

I have been very lucky to spot this problem near 2 minutes before the end of storage-2 installation. Without it, the down would have last all the night.

Side notes

Observium stopped storing data in its database at 11:16 CET on 2021-11-04. The seems seems to come from MariaDB server and not related at all to the problem described in this issue but would need deeper digging.

Points to improve

Procedure to reinstall / bootstrap pve-main in case of whole array failure?
External notifications from gatus
Move as much VMs vdrives as possible to NVMe array
[minor] secondary DNS resolver for wifi (to be discussed)

Edited Nov 05, 2021 by Charles Decoux