HELP with MBL: Stopped working fine all of the sudden & Super High CPU usage

Hello friends.

My MBL 3TB stopped working fine yesterday all of the sudden. From night to morning it was not reachable via Network enviroment, but was answering ping.

I managed to access to it via ssh, only after rebooting it. Because after a while being up, it becames again impossible to be accessed. I realized the CPU consumption was extreamly high… i even saw it reach 13.0+ So i decided to kill all deamons i saw running on top. So i renamed mediacrawler, tally and monitoreo.sh to stop system restart  this services.

After all this and runnig e2fsck -y  /dev/sda4 (no bad sectors found) device is reachable again via network. But now performance its really really slow. I cant copy several big files all together to MBL (like 4 o 5 files of 1GB each) cos speed decrease and decrease to rates as 2.5MB/s until if gives error like “MBL not available” when i normally was enjoying about 40MB/s as avg. If i copy the files one by one, after a very slow transfer rate files can be copied.

I m really lost about what to do next. I got feeling there is smtg comsuming all CPU and because of that performance its so low. Maybe updating firmware would help, but unfortunately i installed the latest like 2 weeks ago. So i cant replace it with a new one. (or dont know how to)

Im about to buy a new NAS, different brand of course, extract the disk manually from the case, connected to a computer and backup everything to new NAS. I know this is a really extream decision, but i really dont know what else i can do.

If anybody was in my situation before, and could solve it, i d really appreciate any help or suggestion to keep trying to make things work as fine as they were doing until yesterday morning.

Thanks a lot, and i appologize for my long post.

Well, when you’re doing a file transfer, look at the “top” report and see what’s using the CPU.

It’s normal for the process “smbd” to be high CPU during a copy, because that’s the Samba daemon.

Also, look for signs of network errors in both the PC and the MBL.

Well… this is exactly one of the problems… i cant see any process consuming the cpu usage :(((

Have a look to this screenshot with the result of top… The only activity that its doing its to open the web UI, that by the way failed: Error 3001: Max timeout reached.

So it’s NOT CPU bound.

It’s likely IO bound as 99.3%wa indicates.

Sort the TOP by “waiting” processes to see what’s thrashing the disk.

Sorry if its a very basic question, but i dont know how to sort it by waiting process :((

I tried ‘man top’, but i dont manage to get how to. Do you remember how can be sort?

Will TOP is running, press

Shift-F

w

(enter)

Shift-R

(That’s Capital F followed by Lowercase w, ENTER, Capital R.)

That will sort by STATUS.   Look for things with Status “D” which should be at the top of the list.

Thank you so so much Tony.

There are 2 daemons only running with ‘D’.

308 root      -2   0     0    0    0 D  0.0  0.0   0:00.00 btn_t
316 root      -2   0     0    0    0 D  0.0  0.0   0:00.34 a3gblink_t

I really have no clue what are they for… but they seems like frozen…

After a few seconds this one also appear:

153 root      20   0     0    0    0 D  0.3  0.0   0:01.52 kswapd0

but disappear again in 1 or 2 seconds.

Hi again.

2415 root      20   0     0    0    0 D  0.0  0.0   0:02.99 jbd2/sda4-8

I found this other process that intermitently appear, and when it does wa reach 99%. When it disappear wa decrease a little bit to 75%-85%, but since jbd2/sda4-8 runs every few seconds wa is most of the time 99%.

This process is related to ext4 journal right?

Another weird behavior. When i just restart the unit, seems to work fine with wa=0%. But after a copuple of minutes copying files from MBL,  wa start rising again to 99% and never drops to 0% again even if I cancel the transfer.

I run dmesg after restart and the only warnings i get related to HD are these:

sd 1:0:0:0: [sda] 5860533168 512-byte logical blocks: (3.00 TB/2.72 TiB)
sd 1:0:0:0: [sda] 4096-byte physical blocks
sd 1:0:0:0: Attached scsi generic sg0 type 0
sd 1:0:0:0: [sda] Write Protect is off
sd 1:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn’t support DPO or FUA
 sda: sda1 sda2 sda3 sda4
sd 1:0:0:0: [sda] Attached SCSI disk
md: Waiting for all devices to be available before autodetect
md: If you don’t use raid, use raid=noautodetect
md: Autodetecting RAID arrays.
md: Scanned 2 and added 2 devices.
md: autorun …
md: considering sda2 …
md:  adding sda2 …
md:  adding sda1 …
md: created md1
md: bind
md: bind
md: running:
md1: WARNING: sda2 appears to be on the same physical disk as sda1.
True protection against single-disk failure might be compromised.
raid1: raid set md1 active with 2 out of 2 mirrors
md1: detected capacity change from 0 to 2047803392
md: … autorun DONE.
 md1: unknown partition table
kjournald starting.  Commit interval 5 seconds
EXT3 FS on md1, internal journal
EXT3-fs: recovery complete.
EXT3-fs: mounted filesystem with ordered data mode.
VFS: Mounted root (ext3 filesystem) on device 9:1.
Freeing unused kernel memory: 448k init
Enable EMAC EMI Fix
eth0: link is up, 1000 FDX, pause enabled
eth0: no IPv6 routers present
EXT4-fs (sda4): recovery complete
EXT4-fs (sda4): mounted filesystem with ordered data mode
Adding 500608k swap on /dev/sda3.  Priority:-1 extents:1 across:500608k
svc: failed to register lockdv1 RPC service (errno 97).
Calling led_set_blink with value x

They doesnt tell me too much. Any clue? :confounded: