Get S.M.A.R.T. (Hard Drive Health)

Cerberus · November 20, 2023, 6:57pm

The My Cloud OS5 dashboard is lying to you.

Ever check the S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology) status of a hard drive via the My Cloud OS5 dashboard and see the following? It looks a little spread out, almost as if something has been removed, doesn’t it?

Well, something was intentionally removed by Western Digital, to prevent you from seeing it. Here’s what it should look like. Notice the difference?

It’s just another form of deception, where the RAW value S.M.A.R.T. attribute was intentionally removed to prevent users from knowing the true S.M.A.R.T. status of their hard drives. Knowledge is POWER, and WD must think we’re all stupid.

RAW Value Commented Out (disk_mgmt.js):

$("#DiskMgmt_SMART_Data").flexigrid({						
	url: '/cgi-bin/smart.cgi',		
	dataType: 'xml',
	cmd: 'cgi_Status_SMART_HD_Info',	
	colModel : [
		{display: "ID", name : 'my_id', width : my_flexigrid_id_width, align: 'left'},			
		{display: "Name", name : 'my_name', width : my_flexigrid_item_width, align: 'left'},			
		{display: "Value", name : 'my_value', width : 80, align: 'left'},		
		{display: "Worst", name : 'my_worst', width : 100, align: 'left'},
		{display: "Thresh", name : 'my_thresh', width : 100, align: 'left'}	
		// {display: "Raw Value", name : 'my_raw_value', width : '80', align: 'center'}
		],
	usepager: false,
	useRp: true,
	rp: 300,
	showTableToggleBtn: true,
	f_field:my_dev,
	width:  650,
	height: 'auto',
	errormsg: _T('_common','connection_error'),
	nomsg: _T('_common','no_items'),
	noSelect:true,
	striped:true,
	resizable: false,
	onSuccess:function(){
		_jScrollPane = $("#DiskMgmt_smartdata_content").jScrollPane();
	},
	preProcess: function(r) {
		return r;
	}
});

An app is being created to deal with this issue, but it’s not quite finished yet and will need to be tested prior to release.

For the time being, the only reliable way to get the true S.M.A.R.T. status of installed hard drives is to enable SSH, then run the following commands, one at a time. Some commands may not be needed, depending on the number of installed hard drives.

smartctl -a /dev/sda;
smartctl -a /dev/sdb;
smartctl -a /dev/sdc;
smartctl -a /dev/sdd;

How to Access WD My Cloud Using SSH (Secure Shell)

fzabkar · November 20, 2023, 7:31pm

The raw value of 346 for Power_On_Hours is not consistent with the normalised value of 53. It’s implying that the drive only has another 350 hours to go before it reaches end-of-life.

Cerberus · November 20, 2023, 7:44pm

Good catch. It’s a bug in the app PHP code that truncates the RAW_VALUE text string. As mentioned in my post, the app is a work in progress, and it’s far from being finished.

Incorrect (truncated):

347 / (100 - 53) x 100 = 738.29 hours (0.084 years)

Correct:

34732 / (100 - 53) x 100 = 73897.87 hours (8.435 years)

The math above uses current S.M.A.R.T. values, shown below.

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    0
  3 Spin_Up_Time            0x0027   167   146   021    10625
  4 Start_Stop_Count        0x0032   089   089   000    11121
  5 Reallocated_Sector_Ct   0x0033   200   200   140    0
  7 Seek_Error_Rate         0x002e   200   200   000    0
  9 Power_On_Hours          0x0032   053   053   000    34732
 10 Spin_Retry_Count        0x0032   100   100   000    0
 11 Calibration_Retry_Count 0x0032   100   100   000    0
 12 Power_Cycle_Count       0x0032   099   099   000    1474
192 Power-Off_Retract_Count 0x0032   200   200   000    509
193 Load_Cycle_Count        0x0032   197   197   000    10611
194 Temperature_Celsius     0x0022   106   094   000    46
196 Reallocated_Event_Count 0x0032   200   200   000    0
197 Current_Pending_Sector  0x0032   200   200   000    0
198 Offline_Uncorrectable   0x0030   200   200   000    0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    0

fzabkar · November 20, 2023, 10:00pm

It appears that the drive’s rated life is set to 100 months, or at least that’s where the POH attribute drops to 0.

Cerberus · November 20, 2023, 11:45pm

Both hard drives in my PR4100 development box are Western Digital Black (WD2003FZEX-00Z4SA0) and I believe their MTBF (Mean Time Between Failures) is 300000 hours. They don’t make them like this anymore, unfortunately.

/dev/sda

34732 / (100 - 53) x 100 = 73897.87 hours (8.435 years)

/dev/sdb

29738 / (100 - 60) x 100 = 74345.00 hours (8.487 years)

The S.M.A.R.T. values from the second drive are shown below.

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    0
  3 Spin_Up_Time            0x0027   156   139   021    11183
  4 Start_Stop_Count        0x0032   094   094   000    6443
  5 Reallocated_Sector_Ct   0x0033   200   200   140    0
  7 Seek_Error_Rate         0x002e   200   200   000    0
  9 Power_On_Hours          0x0032   060   060   000    29738
 10 Spin_Retry_Count        0x0032   100   100   000    0
 11 Calibration_Retry_Count 0x0032   100   100   000    0
 12 Power_Cycle_Count       0x0032   098   098   000    2094
192 Power-Off_Retract_Count 0x0032   200   200   000    635
193 Load_Cycle_Count        0x0032   199   199   000    5804
194 Temperature_Celsius     0x0022   108   097   000    44
196 Reallocated_Event_Count 0x0032   200   200   000    0
197 Current_Pending_Sector  0x0032   200   200   000    0
198 Offline_Uncorrectable   0x0030   200   200   000    0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    0

For comparison, the hard drives in my PR4100 media server are an assortment of Toshiba X300 series performance hard drives. For RAID arrays, it’s best to use Toshiba N300 series hard drives, but I only use JBOD.

HDWE160: 47926 / (100 - 01) x 100 = 48410.101 (5.526 years)
HDWR21C: 16937 / (100 - 58) x 100 = 40326.190 (4.603 years)
HDWF180: 19455 / (100 - 52) x 100 = 40531.250 (4.627 years)
HDWF180: 38680 / (100 - 04) x 100 = 40291.667 (4.599 years)

As you can see, a couple of them have been running for a long time, with zero problems to date, and I hammer the snot out of them. The Power_On_Hours S.M.A.R.T. attributes for each of them are shown below.

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH RAW_VALUE
  9 Power_On_Hours          0x0032   001   001   000    47926
  9 Power_On_Hours          0x0032   058   058   000    16937
  9 Power_On_Hours          0x0032   052   052   000    19455
  9 Power_On_Hours          0x0032   004   004   000    38680

fzabkar · November 21, 2023, 12:18am

It looks like Toshiba rates their drives for 40,000 hours.

Cerberus · November 21, 2023, 12:23am

I was thinking 50,000 hours. Regardless, Toshiba does seem to be a bit conservative with their S.M.A.R.T. ratings. Per the Toshiba X300 series datasheet, the MTTF (Mean Time To Failure) for all models is 600,000 hours.

fzabkar · November 21, 2023, 1:47am

BTW, WD’s drives hide certain SMART attributes from the user. You can see them if you dump the appropriate firmware module.

Cerberus · November 21, 2023, 4:07am

Western Digital tries to hide many things, but they’re terrible at the art of deception because they keep getting caught. Being no stranger to advanced hardware hacking, I’m aware of hidden S.M.A.R.T. attributes, although they’re not very important in the grand scheme of things.

In this case, the app I’m working on will use a simplistic approach to alert the user that there may be a problem, by using several primary S.M.A.R.T. attributes to calculate a total, where a value greater than zero would trigger a warning.

ATTRIBUTE		DESCRIPTION
SMART 1		Raw Read Error Rate
SMART 5		Reallocated Sectors Count
SMART 7		Seek Error Rate
SMART 10		Spin Retry Count
SMART 187		Reported Uncorrectable Errors
SMART 188		Command Timeout
SMART 196		Reallocated Event Count
SMART 197		Current Pending Sector Count
SMART 198		Uncorrectable Sector Count
SMART 199		UDMA CRC Error Count
SMART 200		Multi Zone Error Rate

The “187 Reported Uncorrectable Errors” and “188 Command Timeout” attributes may not be used, because they’re often hidden from the user. Otherwise, no attempt will be made to decode or interpret S.M.A.R.T. attributes, except for “194 Temperature Celsius”, because it’s just too complicated to deal with for an app like this.

Another area where a warning to the user may be triggered, is when a hard drive with SMR technology is detected, because it’s a disaster waiting to happen on NAS devices, where RAID is often used.

fzabkar · November 21, 2023, 3:45pm

Cerberus:

  9 Power_On_Hours          0x0032   058   058   000    16937
  9 Power_On_Hours          0x0032   052   052   000    19455
  9 Power_On_Hours          0x0032   004   004   000    38680

16937 / (100 - 58) x 100 = 40326 hours
19455 / (100 - 52) x 100 = 40531 hours
38680 / (100 - 4) x 100 = 40291 hours

fzabkar · November 21, 2023, 3:55pm

 ID  Flg   Cur  Wor  Thr  Raw             Description
-----------------------------------------------------------------------------
  1   2F   100  253  100  00000000000000  Raw Read Error Rate
  2  *A4   100  253    0  00000000000000  Throughput Performance
  3   27   100  253    0  00000000000000  Spin Up Time
  4   32   100  100    0  00000000000000  Start/Stop Count
  5   33   200  200    0  00000000000000  Reallocated Sector Count
  7   2E   100  253  100  00000000000000  Seek Error Rate
  8  *A4   100  253    0  00000000000000  Seek Time Performance
  9   32   100  100    0  00000000000000  Power-On Hours Count
 10   32   100  253    0  00000000000000  Spin Retry Count
 11   32   100  253    0  00000000000000  Drive Calibration Retry Count
 12   32   100  100    0  00000000000000  Drive Power Cycle Count
180  *AE   200  200    0  00000000000000  Unknown Attribute
183  *B2   100  100    0  00000000000000  SATA Downshift Error Count
184  *B2   100  100    0  00000000000000  End to End Error Det/Corr Count
187  *B2   100  100    0  00000000000000  Reported Uncorrectable Errors
188  *B2   100  100    0  00000000000000  Command Time Out
190  *A2    61   61    0  00000000000027  Airflow Temperature
191  *B2   100  100    0  00000000000000  Shock Sense
192   32   200  200    0  00000000000000  Emergency Retract Cycle Count
193   32   200  200    0  00000000000000  Load/Unload Cycle Count
194   22   113  113    0  00000000000027  HDA Temperature
195  *B6   100  253    0  00000000000000  ECC on the Fly Count
196   32   200  200    0  00000000000000  Reallocated Sector Event
197   32   200  200    0  00000000000000  Current Pending Sector Count
198   30   100  253    0  00000000000000  Offline Uncorrectable Sector Count
199   32   200  253    0  00000000000000  UltraDMA CRC Error Rate
200   08   100  253    0  00000000000000  Multi Zone Error Rate
240  *B2   100  100    0  00000000000000  Head Flying Hours
241  *B2   200  200    0  00000000000001  Total LBAs written
242  *B2   200  200    0  00000000000002  Total LBAs read

     * = hidden attribute

https://files.hddguru.com/download/PC-3000-UDMA%20Support/WDC%20Marvell%20family%20utility/VIVALDI/WDC%20WD2003FZEX-00Z4SA0-01-01A01-0001003V-WD-WMC1F0320752.rar

These are the attributes I have extracted from the firmware dump using my own tool.

Cerberus · November 21, 2023, 4:24pm

Hmmm, I already know that, and they’re exactly as I posted previously, except with a little less precision.

fzabkar:

 ID  Flg   Cur  Wor  Thr  Raw             Description
-----------------------------------------------------------------------------
  2  *A4   100  253    0  00000000000000  Throughput Performance
  8  *A4   100  253    0  00000000000000  Seek Time Performance
180  *AE   200  200    0  00000000000000  Unknown Attribute
183  *B2   100  100    0  00000000000000  SATA Downshift Error Count
184  *B2   100  100    0  00000000000000  End to End Error Det/Corr Count
187  *B2   100  100    0  00000000000000  Reported Uncorrectable Errors
188  *B2   100  100    0  00000000000000  Command Time Out
190  *A2    61   61    0  00000000000027  Airflow Temperature
191  *B2   100  100    0  00000000000000  Shock Sense
195  *B6   100  253    0  00000000000000  ECC on the Fly Count
240  *B2   100  100    0  00000000000000  Head Flying Hours
241  *B2   200  200    0  00000000000001  Total LBAs written
242  *B2   200  200    0  00000000000002  Total LBAs read

     * = hidden attribute

So like I said, the hidden attributes are nothing of any real significance, and hardly worth kicking up a fuss about. The key here is simplicity, to work with what’s available.

fzabkar · November 21, 2023, 4:44pm

If you already know that the SMART threshold is set for 40,000 hours, then why pick an arbitrary figure of 50,000 out of your head?

As for the importance of SMART attributes, that is a matter of opinion. I, personally, would like to see all of them and then come to my own conclusions.

Cerberus · November 21, 2023, 4:59pm

For Pete’s sakes, are you blind?

My 50,000 number was NOT arbitrary, and you deliberately excluded it in your quote to support your 40,000 number. I’m not stupid and don’t have time for games.

Then we disagree. Shocker!

fzabkar · November 21, 2023, 7:28pm

I deliberately excluded this result because the attribute has clearly reached its threshold and therefore any calculation based on it is meaningless.

BTW, the reason that the normalised value is sitting at 1 rather than 0 is because that appears to be the bottom limit that has been set by Toshiba (to avoid triggering an unnecessary SMART failure).

Cerberus · November 21, 2023, 8:15pm

Honestly, I really don’t care, because you’ve completely missed the point by going off on needless tangents. The value was obviously somewhere between 40,000 and 50,000 so going round and round about it is pointless.

From my perspective, you very much look like you’re trying to get your ego stroked, and that won’t fly with me because I also have a great deal of experience, but I tend to keep the details to myself.

Here we go again. You started this whole sideshow by nitpicking a screenshot that was simply used for illustration purposes, despite the fact that I clearly stated that the app was not finished, and even followed up by saying it was a bug in the PHP code that truncated a text string. Then it went on, and on, and on.

To be crystal clear, the app I’m creating will show whatever S.M.A.R.T. attributes are available, and maybe warn the user if any obvious faults are detected, and that’s it. It won’t be hacking the hard drive firmware to dredge up hidden S.M.A.R.T. attributes, or doing any other crazy things, so discussing that is a moot point.

In fact, displaying S.M.A.R.T. attributes is a secondary function of my app, because it will do much more, but ONLY after it’s finished.

So do us both a favor and give it a rest.

fzabkar · November 21, 2023, 8:32pm

Did I damage your fragile ego? Is that the reason that you’ve got your back up?

As for the numbers, it’s not “obviously somewhere between 40,000 and 50,000”.

Try to get your mind around this simple arithmetic:

16937 / (100 - 58) x 100 = 40326 hours
16937 / (100 - 57) x 100 = 39388 hours

19455 / (100 - 52) x 100 = 40531 hours
19455 / (100 - 51) x 100 = 39704 hours

38680 / (100 - 4) x 100 = 40291 hours
38680 / (100 - 3) x 100 = 39876 hours

Do you get it now?

Cerberus · November 21, 2023, 8:47pm

You have that backwards. You just want to be “right” and that’s all you care about.

You’re the one who doesn’t get it, yet you persist. I’ve already plotted a spreadsheet, so I know how it works, but you’re intentionally focusing on values that were cherry picked from my examples to fit your 40,000 narrative.

For the last time, I don’t give a fuc*k about that number, because it’s irrelevant here.