Ongoing Problems with WD2002FYPS


#1

A year ago I bought 6 of these drives for a new system I was building. When I installed them I found one was Dead On Arrival. Fortunately the store I purchased the drives from replaced the dead one right away.

I installed the drives in a RAID 5 configuration with 5 active drives and one cold standby. Everything seemed to run fine initially.

A few weeks ago my system performance was really slow and there seemed to be a lot of disk activity so I installed the Intel RAID management software which confirmed one of the drives (SATA 5) had failed and was being rebuilt. After many hours I noticed that the rebuild was not really progressing, it just could not get past 5% completed. I also noticed that the drive had 254 unrecoverable medium errors.

I powered down my system and replaced the failed disk with the cold standby and powered up my system. The RAID controller automatically started rebuilding the new drive. Initially the drive started off with no media errors, but eventually it started reporting unrecoverable medium errors too. It took over a week for the rebuild to complete, but shortly after it was over 90% complete the entire rebuild failed with “bad-block table full” and there were 254 media errors like the previous drive.

So I went and bought a brand new WD2002FYPS, installed it, and waited for the rebuild to complete. A week later this new drive failed in pretty much the same manner as the previous one with “bad-block table full” and 254 recorded unrecoverable medium errors.

I downloaded and installed the WD Life Guard Diagnostics for Windows and ran the Quick test on the first failed drive. The results were:

Quick Test on Drive 2 did not complete! Status code = 07 (failed read test element, failure checkpoint = 97 (unknown test) SMART self-test did not complete on drive 2!

I’ve started running the Extended Test on the drive but there is an estimated 5 hours to go before completion.

What are the chances I could have 3 drives fail like this in almost exactly the same way? All three were connected to the same SATA port on my motherboard. Could it be a faulty SATA cable or connection? It seems strange that the rebuild gets to 90% completed and then fails completely.

the remaining 4 disks on my system seem to be running fine without any medium errors. Unfortunately it’s rather scary to be running a degraded RAID 5 system like this for so long as one more disk failure will corrupt the entire array.

Has anyone else run into problems with this particular model/line of disk drives?

Cheers, Eric


#2

This may sound like a stupid question (and probably is), but did you buy them all from the same store?  If so, when you return the one, you might consider buying it from another retailer.


#3

Yes, I did buy them all from the same store (ncix.com). If I buy another one of these disks I’ll try a different store.

At the moment I am running the WD Diagnostics on the failed disks. Unfortunately it takes about 6 hours to run the extended diagnostics. In the case of the first failed disk the extended diagnostics would not complete. It got to sector 3877396735 of 3907029167 and then froze. I could not cancel it or otherwise stop the program. I had to use task manager to terminate the process.

I’m now testing the second failed disk. It passed the WD Quick Test and is now running the extended test.

I plan to RMA the disks to WD as soon as I collect more information from the tests.

Cheers, Eric


#4

OK, my second failed drive passed both the Quick Test and the Extended Test. So according to the WD Diagnostics this disk is fine. I’m at a loss to explain why my Intel RAID Controller determined that this disk had failed during rebuild. I have one more disk to test.


#5

Just saw your post.

Did some “test” of my WD2002FYPS also with the idea of later buying  5 more of  them for use in a RAID.

Seems to run ok when directly connected to Intel ICH10 OR 9, but when connected to either of my backplanes i noticed that i was hanging a little during post.

That indicates some problems in my eyes (? in SATA interface circuit/IC)  why my 2 WD1002FBYS and three WD5002ABYS don’t show this problem at post and runs ok in the same SAS/SATA backplanes.

So i wont buy any WD2002FYPS for RAID use.

But your problem could be in another place like cooling of the ICH (the heat pipe lost pressure and thus gives less cooling to ICH/southbridge) or SATA cable defect.

So did you make any change of your system shortly before this error surfaced ?

And you are off cause running with the 04.05G05 firmware on all your WD2002FYPS drives ?

http://www.newegg.com/Product/ProductReview.aspx?Item=N82E16822136365

Hope this can help you and pls. post your results here


#6

How can you tell if there is a problem during POST?

I have an Intel S5520SC motherboard. Athough the ICH10 is capable of handling all the RAID functions itself, for some reason Intel does not use it for the RAID. Instead the BMC (Baseboard Management Controller) implements the actual RAID functions. However, the disks are directly connected to the ICH10.

In my system I have a 120 mm fan directly above the 5520 chipset, and the ICH10 is only a few cm from that. Presumable my ICH10 gets decent cooling. I don’t know if this is a possabilty, but could it be that during a disk rebuild the ICH10 is busier and concequently drawing more power, making it more likely to overheat?

There is no heat-pipe for the ICH10, it’s just air cooled with a heat sink.

Actually my disk firmware is 04.05G04. I’ll have to see if there is a way to upgrade it.

Frankly I do not know exactly when the problem started, but I noticed after I returned from vacation last month. I had powered down my system for the week. When I powered it back up I notice the file system was sluggish.

A month or two before then I had shipped my system back to CoolIT to have the water cooling system repaired. They disconnected all the SATA cables from the MB to do the repair, so they make have not seated one of the cables properly. Anyway, I plan to replace that cable shortly.

Thanks for the advice.


#7

My most recently purchased disk is running the 04.05G05 firmware. This disk failed to rebuild on the RAID 5 controller like the other two. So I doubt this is a disk firmware issue as all the rest of the disks are running 04.05G04 and seem to run fine on SATA ports 1, 2, 3, and 4. Only SATA port 5 seems to be having the problems.

I replaced the new disk on SATA port 5 with one of the older disks that passed the WD Diagnostic Tests. I also replaced the SATA cable with a new one. The drive is rebuilding, and it will take over a week to finish, if it gets that far. Unfortunately rebuilds on 3 different disks have failed.

The RAID software still reports there are 254 media errors on the drive. I have no idea what this number represents, but I doubt it is derived from the drive itself as the drive diagnostics looked good. I suspect it is some magic number held in the RAID controller or associated software. Unfortunately there does not seem to be any way to reset or clear this value.

I checked my ICH10 chip, and there is no heat-sink on it, but it should be getting good ventilation.

I do not think the disk drives are overheating as they are in cages of 3 drives each, with a 120 mm fan per cage.

Anyway, at this point I am beginning to doubt that the drives are the problem.


#8

Eric-Kolotyluk wrote:

How can you tell if there is a problem during POST?   

" IT hangs ca. 2 seconds, my WD1002FBYS detecs much faster during post "

 

I have an Intel S5520SC motherboard. Athough the ICH10 is capable of handling all the RAID functions itself, for some reason Intel does not use it for the RAID. Instead the BMC (Baseboard Management Controller) implements the actual RAID functions. However, the disks are directly connected to the ICH10.

“Nice looking MB but have no experience with that and the way the RAID is implemented but any way the X58 is a close brother to the 5520 so they have something in common but for sure not the onboard SAS/SATA raidcontroller implementation”

 

 

In my system I have a 120 mm fan directly above the 5520 chipset, and the ICH10 is only a few cm from that. Presumable my ICH10 gets decent cooling. I don’t know if this is a possabilty, but could it be that during a disk rebuild the ICH10 is busier and concequently drawing more power, making it more likely to overheat?

“Yes you are right i can see temp going up during rebuild on my X38/ICH9 based system”

 

There is no heat-pipe for the ICH10, it’s just air cooled with a heat sink.

 

Actually my disk firmware is 04.05G04. I’ll have to see if there is a way to upgrade it.

Look at the new egg link from my former post and you will find the place at 3 ware where you can download the bios (btw funny you to have to download a WD firmware from a non WD site or ?)  :)))

 

Frankly I do not know exactly when the problem started, but I noticed after I returned from vacation last month. I had powered down my system for the week. When I powered it back up I notice the file system was sluggish.

“Had the same experience with my X58 system + absolutely no idea what caused it but maybe my failing JMB onboard controller, but very strange indeed”

 

A month or two before then I had shipped my system back to CoolIT to have the water cooling system repaired. They disconnected all the SATA cables from the MB to do the repair, so they make have not seated one of the cables properly. Anyway, I plan to replace that cable shortly.

"Only the CPUs are watercooled or ?

Makes me wonder it Coolit did something nasty to your system (like water on some component(s)) or the shipment (if the HDUs where not removed) damaged your HDUs

 

Thanks for the advice.


#9

Eric-Kolotyluk wrote:

My most recently purchased disk is running the 04.05G05 firmware. This disk failed to rebuild on the RAID 5 controller like the other two. So I doubt this is a disk firmware issue as all the rest of the disks are running 04.05G04 and seem to run fine on SATA ports 1, 2, 3, and 4. Only SATA port 5 seems to be having the problems.

" suggest to contact Intel support and hear what they say"

 

I replaced the new disk on SATA port 5 with one of the older disks that passed the WD Diagnostic Tests. I also replaced the SATA cable with a new one. The drive is rebuilding, and it will take over a week to finish, if it gets that far. Unfortunately rebuilds on 3 different disks have failed.

" what a bi** you prob lost your files sure you know that one week for rebuild indicates some error"

 

The RAID software still reports there are 254 media errors on the drive. I have no idea what this number represents, but I doubt it is derived from the drive itself as the drive diagnostics looked good. I suspect it is some magic number held in the RAID controller or associated software. Unfortunately there does not seem to be any way to reset or clear this value.

“normally this info comes from SMART on drive but with your MB i am not sure and strange they come if you have written 0 to drive with WD diag (data life dos diag/util)”

 

I checked my ICH10 chip, and there is no heat-sink on it, but it should be getting good ventilation.

"Intel must be the first know what is nessesary, so… "

 

I do not think the disk drives are overheating as they are in cages of 3 drives each, with a 120 mm fan per cage.

 “agree but in one case some(read many :)) years ago i found the manetic field from the blower (cheap chineese sh**) to degrade a disk over time but seriously doubt that this is the case here”

Anyway, at this point I am beginning to doubt that the drives are the problem.

"Agree here suggest if possible you get someone that know your MB to help you and at the same time get WD and Intel support to help you + as a rule of thump its best to run the same drives with the same firmware but doubt that updating to 04.05G05 will solve this problem and if you are really unlucky and have problems with the SATA controllers you maybe brick the drive when updating.

Best to check/update your HDUs on another functioning system with sata controllers in non RAID mode of cause"

 

Strongly suggest you get some expert help here.

By expert i mean someone that really knows your MB


#10

The POST on my MB takes a really, really long time, so it’s hard for me to tell if anything abnormal is happening. Usually if something is wrong the beep sequence is different.

I wonder if changing the rebuild rate on my RAID controller would help keep the ICH10 cooler. Right now it’s set for 30% which is the default. I noticed that my ICH10 has no heat sink, but it should be getting good ventilation.

According to what I’ve read the program to update the disk firmware requires DOS. Does that mean it will not run under Windows? What about Windows safe mode? Basically I’m not sure where I would get a copy of DOS.

Also, the firmware upgrade recommends backing up the disk first. If I were to be cautious this would mean updating each disk in the array, one at a time, and reconnecting it to the RAID to confirm it’s ok. It would probably also be good to do some patrolled reads to verify the integrity of the array after each disk firmware update. This boils down to a lot of effort to update all the disks in the array. Something I have to think hard about.

It’s just my two CPUs that are water cooled. I’m assuming that because CoolIT are experts at liquid cooling systems they would know how to take precautions against water leaks. On the other hand they disconnected all my disk drives without writing down the connections. When they reconnected everything the system booted up fine. Best I can tell is my RAID controller does not care if you reconnect the drives in a different order, as long as they are all part of the original array.


#11

Eric-Kolotyluk wrote:

The POST on my MB takes a really, really long time, so it’s hard for me to tell if anything abnormal is happening. Usually if something is wrong the beep sequence is different.

"You can only verify this(if the bios does not show each individual disk beeing detected) by connecting the same number of disks that are ok (that the controller “likes”) and measure the post time

 

I wonder if changing the rebuild rate on my RAID controller would help keep the ICH10 cooler. Right now it’s set for 30% which is the default. I noticed that my ICH10 has no heat sink, but it should be getting good ventilation.

 “dont change that dont think there is a problem here”

 

According to what I’ve read the program to update the disk firmware requires DOS. Does that mean it will not run under Windows? What about Windows safe mode? Basically I’m not sure where I would get a copy of DOS.

“if you have doubts here as how to make and use dos (not to offend you) seek expert advice”

 

Also, the firmware upgrade recommends backing up the disk first. If I were to be cautious this would mean updating each disk in the array, one at a time, and reconnecting it to the RAID to confirm it’s ok. It would probably also be good to do some patrolled reads to verify the integrity of the array after each disk firmware update. This boils down to a lot of effort to update all the disks in the array. Something I have to think hard about.

“Dont do it until you are sure that other issues have been fixed(thats why i recommend to do test and firmware upgrade on another known good system)”

 

It’s just my two CPUs that are water cooled. I’m assuming that because CoolIT are experts at liquid cooling systems they would know how to take precautions against water leaks. On the other hand they disconnected all my disk drives without writing down the connections. When they reconnected everything the system booted up fine. Best I can tell is my RAID controller does not care if you reconnect the drives in a different order, as long as they are all part of the original array.

"that is correct but when drives is not in their original SATA port(as when Array was created) it will make a reconstruction with some “can opener” programs impossible. (like diskpart raid reconstruct and all their “freinds”)

I mark up all my drives with SATA port number to counter that problem"

 

In my opinion if you have a failing member on a RAID system its best to backup all data when still possible and after that do the reconstruction/failiure correction.( a Bi** if you have like 5 to 10 terra files but in my experience the only failsafe way to be sure that you save the files)

 

But still suggest you get some expert to help you (not meaning that i will not try answer your posts)


#12

OK, an update on my disk problems. Two of my three disks passed the WD Disk Diagnostics. I installed one of the original disk back in the array on SATA 5 and tried rebuilding it. It failed eventually so I started rebuilding it again. Again it failed after getting about 12% complete.

Using the Intel RAID Web Console 2 software I set disk 5 offline. Then I set it online again. However this time I got a message warning me to backup my files first and to confirm the operation. So I clicked “yes” and the drive went online. This time the drive state was “OK” - I don’t know why. I left the system run this way a while and it seemed fine.

Next I restarted the system, I wanted to see if the drive would stay online. The system restarted OK, so I went back into the RAID console and everything looked OK except there there was a background initialization of Virtual Disk 0. I was a bit worried as this was my boot disk, but the system seemed to be running OK. After a few hours if finished that operation and my systems seemed none the worse. Currently it is doing a background initialization of Virtual Disk 1 (which is a much larger disk - 7 TB).

Aside from one disk which failed the WD diagnostics, it looks like most of my problems are with the RAID system itself. I called Intel support and they suggested I had a bad SATA port and suggested I replace the system board. I’ll avoid doing that if I can as it is extremely disruptive changing a system board.


#13

Nice to see that you had some progress in your fault finding.

As to Intels suggestion changing the System board the safest way to do that is to build yourself a similar PC (w/o the harddisks) and then move your arrays to that and replace the original system board  (You could use something like  http://www.corsair.com/products/h50/default.aspx to cool your CPU(s) on the backup machine.

Then you are able to replace your system board yourself w/o having to rely on the watercooler compagny and avoid shipment damages and so on.

That will of cause cost you so the decision to make a second machine is depending how much it will cost you if you do not have access to it when it is repaired or waiting for parts / watercooling compagny etc.

Difficult decision cause workstation/serverboard CPU(s) RAM are not cheap I know. (you maybe could do with less RAM and only one CPU in the backup machine)

Same goes for the disks I normally always have one (“cold” i.e. not mounted in backplane) spare disk to each array (if the disks are different models) then the time it takes to get the disk RMAed and back is no problem.

I also always use a separate RAID controller board so I can move my system to my “older” machine if the “new” fails.

If you have a disk that will not initialize or rebuild you can write zeroes to it using the dos version of DLG (that will erase all data on disk), do the extended test and many times get it online again, but it can indicate some problems with the disk so I always order a new disk and replace the once failing drive when I receive the new.

? You are aware that the DOS version of DLG is able to fix more failures on disk drives than the windows version and the disk you test must be on a standard SATA controller (sometimes it runs best if you have enhanced IDE activated sometimes enhanced SATA function must be disabled for the best result, only way to find out is to try both BIOS settings)


#14

I’m actually able to replace the system board myself, and have dome so a few times already. The case where I had to send it back to the cooling company was when the cooling system itself was failing and needed replacement.

That said, replacing the system board is a **bleep**. You have to take the coolers off of the CPUs; take out the graphics card; disconnect all the cables and other connections; etc. In particular the Intel S5520SC comes with this optional RAID 5 key that is not really designed to be removed from the board once installed. It requires a set of plyers and a fair bit of force to remove the key. Fortunatly the cooling system is all self contained so you don’t have to drain it or refill it. But is is tricky installing the coolers becaue the attached hoses make it difficult to attach the coolers accurately to the CPUs.

My RAID array has 5 disks, and I keep a 6th cold spare in the case just for these situation. Unfortunately I had to go through 3 disks to deal with this problem.

I was actually getting ready to reformat or write zeros to one of my disks as the next step in resolving the problem. Forutnately the WD utilities I was using had this feature. I’ve had situations in the past with others disks that had failed where reformatting the disk corrected the problem.

Yes, I do find that the standalone versions of disk utilities often have more features than the windows ones. I don’t know why it has to be that way.

Continuing on with my saga… According to Intel the background initialization on my virtual disks reads all the disk blocks and repairs any parity problems it finds. At last count the system has found and corrected over 50 media errors. In this case these are logical media errors and not physical media errors (the software’s terminology is very misleading). I still have 250 media error left that I hope will be corrected.

At any rate, I feel a lot better about my WD2002FYPS disks now, but I think far less of the Intel RAID software than I use to. Maybe some day I will work up the nerve to update the firmware on my HDDs.

Thanks so much for your help - it’s been a valuable learning experience.

Cheers, Eric


#15

You are welcome :))