Archive for the ‘Dell’ Category

Troubleshooting DELL SAS 5/iR RAID controller

Friday, November 28th, 2008

My first server was Dell SC1435. Well, the experience is always built on mistakes 🙂 I realised it when I’ve bought HP 360 G4p servers from my hosting company.  Now I know why HP costs more and why this is justified.

Well, I still have to live with a couple of SC1435s as they are still doing their job, not so good though. Each Dell server had problems with RAID on the 13th month after the shipment date (maybe it is hard coded in their chips?). I was a cunning fox second time, so I’ve bought additionally 3 year warranty for the next server. And this is already paid off, as one of the WD hard drives collapsed last Friday (for some reason these type of things are happening only on Friday arvos).

Well, first sign of problem was an mpt-status utility which reported RAID status DEGRADED, and HDD1 status FAILED, OUT_OF_SYNC. And after a couple of hours HDD1 changed its status do MISSING.

I thought that this problem is similar to the problem I had before, when RAID array just become unsynchronised for some reason, so I was looking for an utility which will allow me to start remotely RAID re-synchronisation process (rebuild). I was surprised to find that Dell provides only 2 types of utilities – for Windows and for RedHat Linux. Well, Linux is Linux, even if I don’t run RedHat and I hate it, there was a good chance that I will be able to run the executables under Gentoo.  First ominous sign was the size of the download – 44Mb, which is too much for an utility. When I unpacked it with rpm2targz I found that most of the contents is Java crap which uses GUI to handle the RAID.

I know that I am weird, but none of my servers are running Windows (M$ or X, does not matter). My concept is that server is something which could be managed via the smartphone & ssh. GUI is not an option for servers. They are sitting in the hot hell of data centre and every single piece of processor & memory resources should be doing the primary task – running Apache/database/you name it, but not the Java+X+KDE for occasional maintenance.

Well, my attempt to use Dell’s utility failed. I looked up the controller’s chip (LSI Logic 1068) and went on their site for help.  The idiocy is spread there as well. They had a CLI tool for RAID management, but is was for the superseded MegaRAID, not for SAS 5 i/R anymore. They have this RedHat/Java rpm to manage controllers. Damn!

I googled for a while and finally found sane people at IBM who use command line tool called cfggen to manage SAS 5 i/R RAIDs on their servers.
NB HP servers also have it, not online though.

Well, at that point of time I already faced a fact that the hard drive went hanky-panky and never came back, so cfggen could only display the RAID and HDD0 status and all attempts to start rebuild failed, as there was nothing to rebuild.

There was another option – to use BMC facility of the server. Because Dell servers come without documentation (well, there were unpacking instructions packed inside of the box), I had no idea that BMC is accessible via shared network interface, and there is no need to use DRAC (Dell Remote Access Card), which is actually unavailable for SC1435 due to the space limitations. So I even didn’t bother to setup BMC in BIOS.
(On the controrary, HP servers have iLO, which is implemented more thoughtfully – separate interface which could use a separate or shared network and they are shipped with tons of documentations in PDFs.)

So, on the next business day, i.e. Tuesday (remember, that it happened on Friday arvo? 🙂 technician from Dell changed the hard drive and RAID was back in business in a couple of hours.

So, what is the lesson I’ve learnt?

  • Use Nagios plugin check_mpt
  • cfggen from IBM website/ HP server utilities disk is a must for lucky Dell customers
  • Out-Of-Band access is something which should not be ignored.
  • think 3 times next time before buying Dell servers 🙂