Gentlent's practical approach to dealing with a disk failure in our server setup. This post covers the steps we took to identify, resolve, and upgrade our system to prevent future issues, all without affecting our customers.
During a routine server check a few days ago, we noticed an unsettling pattern: one of our disks had been ejected from the RAID array for the second time in a month. It became clear that this disk was failing, which had left the server's small RAID array in a degraded state.
The potential for data loss or downtime in such situations is a concern for any IT team. However, we've always prioritized data integrity and system reliability. Thanks to our regular, secure backup protocols and real-time replication for core databases, we were prepared. This approach ensured that even with the server at risk, our operations could continue without interruption, and more importantly, without jeopardizing any customer data.
Upon recognizing the issue, we didn't waste any time. We quickly procured additional SSDs and set about upgrading the RAID arrays on our machines. The upgrade process was smooth for the second server, which we upgraded just in case, but we hit a snag with the first one: its boot partition was on the failing disk.
Addressing this issue required a hands-on approach. We went on site, replaced the problematic disk, and reconfigured the RAID array. This process took a few hours, but by the end of it, the server was back up and running as if nothing had happened.
When we identified the failing disk, our immediate focus was to ensure the integrity of our RAID array and to restore full functionality. Here's a brief overview of the technical steps we took:
First, we used mdadm
to examine the status of our RAID arrays:
sudo mdadm --detail /dev/md0
This command helped us confirm which disk was failing. Upon trying to re-add it to the software RAID array, we noticed a significant drop of write speed in real-time.
Our first hurdle was gaining access to the server's file system without booting from the compromised disk. We achieved this by using a Live Ubuntu Server ISO, which itself is pretty straightforward:
mount
and chroot
were used through this guide to access the server's filesystem. This allowed us to make changes to the server's configuration and RAID array. Those commands might look like this:
for i in /dev /dev/pts /proc /sys /run; do sudo mount -B $i /mnt$i; done sudo chroot /mnt
With access to a shell, we proceeded to prepare the new disk for integration into the RAID array:
/dev/sdX
), we created a new partition table and partitions mirroring those of the existing RAID disk(s).
sudo fdisk /dev/sdX
With the disk partitioned, the next step was to integrate it into the RAID array:
mdadm
to add the new partition to the existing RAID array.
sudo mdadm --manage /dev/md0 --add /dev/sdX1
cat /proc/mdstat
The absence of the boot partition on the surviving disk was a critical issue we needed to resolve:
fdisk
to create a new EFI system partition on the surviving disk.
sudo fdisk /dev/sdY
sudo mkfs.vfat -F 32 /dev/sdY1
/mnt/efi
.
sudo mount /dev/sdY1 /mnt/efi
sudo grub-install --target=x86_64-efi --efi-directory=/mnt/efi --bootloader-id=Ubuntu
fstab
The final step was to ensure the system could automatically mount the new EFI partition at boot:
blkid
to get the UUID.
blkid /dev/sdY1
/etc/fstab
: We added a new line for the EFI partition using the UUID obtained from blkid
.
UUID=<new-efi-partition-uuid> /boot/efi vfat umask=0077 0 1
After completing these steps, we rebooted the server to verify that the recovery was successful. The system booted normally, and all RAID arrays were functioning as expected.
Throughout this ordeal, our main concern was to maintain service continuity for our customers. Thanks to our preemptive measures and quick response, we managed to do just that. No customer data was put at risk, and our services remained online and fully operational.
In facing this challenge, we were reminded of the importance of regular system checks, reliable backup strategies, and the ability to respond swiftly to unforeseen issues. It's these practices that help us keep our promise of reliable service to our customers.
Tom Klein
Founder & CEO
Gentlent UG (haftungsbeschränkt)
Gentlent
Customer Support
support@gentlent.com