Dell PERC RAID Multiple Drive Failure
This article is to help you get your Dell PowerEdge servers equipped with the PERC RAID controller back online if you experience a multiple drive failure. A multiple drive failure will be indicated by flashing amber lights on more than one physical hard drive in your RAID array.
Background: Some Dell servers equipped with PERC RAID controllers can experience multiple drive failures. These RAID arrays can usually tolerate a single drive failure without any impact on the servers availability in your network environment. However, if more than one drive fails the server will not be available on the network. In this case you will usually see amber flashing status LEDs on the failed hard drives.
In most cases multiple hard drives do not fail at the same time so it is very likely that one of the drives is still good but may have gone offline when another drive failed. In some cases this can be due to timing issues between the PERC RAID controller and the drives firmware. As of this writing Dell has released a firmware update to bring hard drives in many PowerEdge systems up to level JT00. If you have drives in your server with a lower revision level firmware you should consider this a urgent update.
Getting Back Online
These step require that you enter the PERC RAID controllers BIOS and making some changes. We recommend that you leave this to experienced server administrators as you can cause permanent drive damage, or data loss. As usual you should be backing up your server as by the time you have to perform these steps it is too late to do so.
Restart The Server
Power down the server, or perform a soft restart if possible.
Enter The PERC RAID Controller Setup
You will see the <CTL-M> message on screen. This key combination will get you into the PERC controllers setup mode.
Force Drive Online
Choose a failed drive to force online. This is a coin toss as you cannot be sure which drive (if not both) is actually failed. Select objects and then physical drives from the menus.
IMPORTANT: Make a note of which physical drives are marked bad before making any changes. Drives will be numbered starting with 0 so your first drive is not 1. Example: If you have a 5 drive array they will be numbered 0-4.
Select one of the failed drives and select force online.
ESC out of all menus and quit the RAID controller setup by selecting yes. Soft reboot the server with CTL-ALT-DEL.
Allow the server to boot completely. If you see the operating systems startup logo you are probably ok. If you see a message indicating corrupt data which does not allow the server to start you will have to go back into the RAID controller’s BIOS setup and force the drive you just forced online into a forced offline state. Next you will have to repeat the above procedure forcing the other failed drive online and soft restarting your server.
Normally by following this procedure you will be able to get your server back online with only one drive indicating a failure mode. You will be able to replace the failed drive with a drive of equal or higher capacity, but not lower capacity. Dell, or your server manufacturer can ship you the replacement drive via overnight delivery. you do not want to operate your server in a failed drive state for a prolonged period as it will not tolerate a second drive failure.
There have also been cases where the problem was not solved following these steps due to a defective backplane or controller card. Usually replacing these components resolved the issue without data loss, however if the rebuild process is not completed successfully you can experience file corruption. In these instances you will have to perform a complete system restore, or worse a complete re-installation of your server.
When you purchase a new server from any manufacturer you should be opting for the 3 year minimum full warranty. Dell for example has been able to deliver us with replacement components the same day because the client had purchased the full warranty when the server was ordered. In fact, on one occasion we had the parts delivered in two hours because Dell works with UPS logistics and they have caches of replacement parts. In this instance the customer happened to be in close proximity to the UPS resource location.
We see most servers as having a three year life expectancy due to the technology advancement and the fact that many small and mid-sized businesses are not operating their servers in clean climate controlled environments. Additionally most companies seem to be able to amortize their server investments sooner than 10 years ago. If you intend to amortize your server assets over five years or more you should definitely opt for the maximum warranty during your purchase phase.
I hope this article helped you get your server back online and getting the angry users off your back! If you have any questions about this article contact me.