HELP with MBL: Stopped working fine all of the sudden & Super High CPU usage

bigal666 · January 24, 2013, 1:01pm

Hello friends.

My MBL 3TB stopped working fine yesterday all of the sudden. From night to morning it was not reachable via Network enviroment, but was answering ping.

I managed to access to it via ssh, only after rebooting it. Because after a while being up, it becames again impossible to be accessed. I realized the CPU consumption was extreamly high… i even saw it reach 13.0+ So i decided to kill all deamons i saw running on top. So i renamed mediacrawler, tally and monitoreo.sh to stop system restart this services.

After all this and runnig e2fsck -y /dev/sda4 (no bad sectors found) device is reachable again via network. But now performance its really really slow. I cant copy several big files all together to MBL (like 4 o 5 files of 1GB each) cos speed decrease and decrease to rates as 2.5MB/s until if gives error like “MBL not available” when i normally was enjoying about 40MB/s as avg. If i copy the files one by one, after a very slow transfer rate files can be copied.

I m really lost about what to do next. I got feeling there is smtg comsuming all CPU and because of that performance its so low. Maybe updating firmware would help, but unfortunately i installed the latest like 2 weeks ago. So i cant replace it with a new one. (or dont know how to)

Im about to buy a new NAS, different brand of course, extract the disk manually from the case, connected to a computer and backup everything to new NAS. I know this is a really extream decision, but i really dont know what else i can do.

If anybody was in my situation before, and could solve it, i d really appreciate any help or suggestion to keep trying to make things work as fine as they were doing until yesterday morning.

Thanks a lot, and i appologize for my long post.

TonyPh12345 · January 24, 2013, 2:09pm

Well, when you’re doing a file transfer, look at the “top” report and see what’s using the CPU.

It’s normal for the process “smbd” to be high CPU during a copy, because that’s the Samba daemon.

Also, look for signs of network errors in both the PC and the MBL.

bigal666 · January 24, 2013, 7:26pm

Well… this is exactly one of the problems… i cant see any process consuming the cpu usage :(((

Have a look to this screenshot with the result of top… The only activity that its doing its to open the web UI, that by the way failed: Error 3001: Max timeout reached.

TonyPh12345 · January 24, 2013, 7:59pm

So it’s NOT CPU bound.

It’s likely IO bound as 99.3%wa indicates.

Sort the TOP by “waiting” processes to see what’s thrashing the disk.

bigal666 · January 24, 2013, 9:26pm

Sorry if its a very basic question, but i dont know how to sort it by waiting process :((

I tried ‘man top’, but i dont manage to get how to. Do you remember how can be sort?

TonyPh12345 · January 24, 2013, 9:54pm

Will TOP is running, press

Shift-F

w

(enter)

Shift-R

(That’s Capital F followed by Lowercase w, ENTER, Capital R.)

That will sort by STATUS. Look for things with Status “D” which should be at the top of the list.

bigal666 · January 24, 2013, 10:18pm

Thank you so so much Tony.

There are 2 daemons only running with ‘D’.

308 root -2 0 0 0 0 D 0.0 0.0 0:00.00 btn_t
316 root -2 0 0 0 0 D 0.0 0.0 0:00.34 a3gblink_t

I really have no clue what are they for… but they seems like frozen…

After a few seconds this one also appear:

153 root 20 0 0 0 0 D 0.3 0.0 0:01.52 kswapd0

but disappear again in 1 or 2 seconds.

bigal666 · January 25, 2013, 9:39am

Hi again.

2415 root 20 0 0 0 0 D 0.0 0.0 0:02.99 jbd2/sda4-8

I found this other process that intermitently appear, and when it does wa reach 99%. When it disappear wa decrease a little bit to 75%-85%, but since jbd2/sda4-8 runs every few seconds wa is most of the time 99%.

This process is related to ext4 journal right?

Another weird behavior. When i just restart the unit, seems to work fine with wa=0%. But after a copuple of minutes copying files from MBL, wa start rising again to 99% and never drops to 0% again even if I cancel the transfer.

I run dmesg after restart and the only warnings i get related to HD are these:

sd 1:0:0:0: [sda] 5860533168 512-byte logical blocks: (3.00 TB/2.72 TiB)
sd 1:0:0:0: [sda] 4096-byte physical blocks
sd 1:0:0:0: Attached scsi generic sg0 type 0
sd 1:0:0:0: [sda] Write Protect is off
sd 1:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn’t support DPO or FUA
sda: sda1 sda2 sda3 sda4
sd 1:0:0:0: [sda] Attached SCSI disk
md: Waiting for all devices to be available before autodetect
md: If you don’t use raid, use raid=noautodetect
md: Autodetecting RAID arrays.
md: Scanned 2 and added 2 devices.
md: autorun …
md: considering sda2 …
md: adding sda2 …
md: adding sda1 …
md: created md1
md: bind
md: bind
md: running:
md1: WARNING: sda2 appears to be on the same physical disk as sda1.
True protection against single-disk failure might be compromised.
raid1: raid set md1 active with 2 out of 2 mirrors
md1: detected capacity change from 0 to 2047803392
md: … autorun DONE.
md1: unknown partition table
kjournald starting. Commit interval 5 seconds
EXT3 FS on md1, internal journal
EXT3-fs: recovery complete.
EXT3-fs: mounted filesystem with ordered data mode.
VFS: Mounted root (ext3 filesystem) on device 9:1.
Freeing unused kernel memory: 448k init
Enable EMAC EMI Fix
eth0: link is up, 1000 FDX, pause enabled
eth0: no IPv6 routers present
EXT4-fs (sda4): recovery complete
EXT4-fs (sda4): mounted filesystem with ordered data mode
Adding 500608k swap on /dev/sda3. Priority:-1 extents:1 across:500608k
svc: failed to register lockdv1 RPC service (errno 97).
Calling led_set_blink with value x

They doesnt tell me too much. Any clue?