I was in a hurry, ticked off at something and tired when I left and forgot to upload the right stuff before I left for the holidays.
It was a combination of things from one of the SCSI hard drives failing and taking down the entire server to mail order Christmas presents being late.
The game server used a 3 drive SCSI software raid 0 array. This gave us the speed but no recovery abilities. Basically one of the partitions on the array failed and it is preventing the booting of the server. It is not the database or web site paritions that died so we should be able to recover those.
The SCSI controller we are using doesn't have hardware raid (the SCSI controller controls the raid arrays though it's hardware and internal software) but relies upon the software raid drivers incuded with the linux install we are using. These are slow and didn't include support for raid 5. With raid 5 we have parity checking distributed over all of the drives. If one of the drives fail we can remove it and put in a replacement and the drive is rebuilt from the parity data on the other drives in the array.
We bought another SCSI controller that has hardware raid built in and it should be here sometime next week. We will be rebuilding the server and moving everything onto it when the new controller arrives. We are going to set up an overly redundant raid array this time using raid 5. We will be using 3 drives in the raid 5 array with a fourth drive designated as a hot spare. The hot spare will not have anything stored on it but if one of the other three drives fails then the hot spare will automatically be activated by the SCSI controller and the failed drive will be rebuilt on it. The drive that failed will be shut down and the controller will notify us about it. This way the server will not go down due to drive failure.
If a drive fails it will be totally transparent to everyone using the server. You will never notice that a drive failed.
The other great advantage to using a SCSI controller with hardware raid built in is the lower load on the CPU since it will not have to run the raid software anymore. This is a huge advantage when the server gets heavy usage. Plus the controller has an onboard cache that stores the most used data from the drives. This means less drive activity.
Unfortunately the board only has a 16 meg cache right now and we would like to be able to upgrade that to 128 meg.
If someone would happen to have a 128 meg, 72 pin, Double Sided, 32MBx36 EDO or FPM SIMM to donate we would sure appreciate it.
And as you have guess the game will be reset with the REAL new code when we rebuild the server. Sorry for the problems but this hasn't been that fun for us either.