EX4100 becomes unresponsive, loses network

Hi all,

I ran the following two parameters in my cron script so this is always set after the RAID device is mounted :

#Set Reboot on Kernel Panic
echo 30 > /proc/sys/kernel/panic
echo 1 > /proc/sys/kernel/panic_on_oops

I then triggered a kernel panic as per How to use kdump to debug kernel crashes - Fedora Project Wiki (Step 2) and then waited and the NAS did end up rebooting by itself, so it might be possible this is a “workaround” for a hung NAS. This is by no means perfect as it is still putting the disks in an inconsistent state.

One thing which I am not sure if WD have considered is the kernel level. Currently it is 3.10.39, but don’t remember this issue happening when I original got the NAS about 2+ years ago. If the kernel has been updated in the meantime, perhaps this has introduced a bug which now causes kernel panics. Compiling a newer kernel for the device could possibly help ?

Perhaps someone from WD Staff or a moderator could comment. Only way to get further is to try and get a serial console working on the EX4100 or to see if something like netconsole could be compiled and installed onto the EX4100.

Cheers,

JediNite

Hi all,

I downloaded the source code and it looks like if netconsole is to be used, then a whole kernel needs to be compiled and a new firmware image installed. This is because the existing kernel does not contain the “netpoll” module and this must be built into the kernel. After this is done then the “netconsole” can work as a pluggable module.

I’ve not had anymore kernel panics of the NAS so far since putting in the cron jobs so can’t comment further on it. I am looking at ways to get access to the serial console UARTs and there are some guides from others out there on how to do that. If WD are planning on releasing new firmware for this unit in the near future, it would be nice if they could build “netpoll” and “netconsole” into it. :slight_smile:

Cheers,

JediNite

Hi all,

Just had this happen to the EX4100 NAS a few hours ago and looks like it rebooted all by itself. So we have definitely proved that a kernel panic is the cause of the lockups. Really need WD to get involved if we are going to get any further to solving this issue or look at getting a working serial port connection.

Cheers,

JediNite

Hi all,

Just put in a Feature Request on the forums at this link :
https://community.wd.com/t/add-netpoll-and-netconsole-to-kernel-on-ex4100/212390/1

If you want to see if this can get resolved, please upvote this request.

Cheers,

JediNite

I’ve had better luck with mine turning Drive Sleep OFF. It would make sense actually if that feature was related to the freeze. I would almost bet if the device was constantly being accessed and never fell asleep that it wouldn’t lock up normally. I hate to put wear-and-tear on the unit though so I’m keeping it powered down until I want to make backups. That defeats the entire purpose of a NAS unfortunately, and until Western Digital actually recognizes the problem and addresses it, I can’t recommend the product to anyone. I think they’re waiting on us to debug it honestly. I hate to leave a negative review on Amazon saying “don’t buy this” but that looks like the only way to get attention.

I’ve debugged this as much as I can now without either rebuilding a kernel or soldering the internal serial ports up. I am hoping WD can offer some next steps as I have raised a case with them now detailing my findings and asking for next steps.

But yeah I think you are right in that they are choosing not to debug it, and are telling everyone who logs the problem that they can’t reproduce it themselves to save them the need to try and resolve it.

Cheers,

JediNite

Hi everyone,

I was about to buy 2 EX4100s until I red this thread as part of pre-purchase research.

I wonder if this issue related to certain firmware upgrade and if the issue is resolved now ?

Thanks

@Jak0ps

I suspect the issue is related to specific firmware. WD Support asked me to send them some logs, but unsure what value they will get from them as they are all cleared and written fresh on reboot of the unit, so don’t capture anything relating to why it failed. In saying that, the unit has been up for 8 days so far and counting, but will prob lock up in the next day or so again.

Cheers,

JediNite

Up to 18 days now. This is almost like a record for it… Now up to 22 days…

Got to 33 days and looks like it crashed and auto rebooted last night…

Changing IP packet to Jumbo Frame at 9000 in place of 1500 by default, help a bit it will lock up in 4 days in stead of 1 day as usual.

I had a stretch of about 3 months where this happened weekly, no reason, just dead had to pull the plug to get it running again.

I took down any 3rd party software and only reinstalled what i actually use, and as far as that issue i havent seen it much as of late, but im sure its just waiting for the right time when i really need it working to hang.

its a chronic issue though and needs to be addressed

Here is a list of the reboots of my NAS for the last few months since I put in palce the cron job every 5 mins.

root@WDMyCloudEX4100 Public # grep "up 1[12345] min, " five_min_cron.out.old
2017/07/11 07:55:03 07:55:03 up 13 min, load average: 0.64, 0.90, 0.71 0
2017/07/11 18:20:03 18:20:03 up 13 min, load average: 0.29, 0.59, 0.51 0
2017/07/11 21:35:03 21:35:03 up 11 min, load average: 1.21, 1.21, 0.72 0
2017/07/15 13:55:03 13:55:03 up 13 min, load average: 0.18, 0.59, 0.55 0
2017/07/16 11:40:03 11:40:03 up 11 min, load average: 0.26, 0.36, 0.30 0
2017/08/18 21:55:03 21:55:03 up 13 min, load average: 1.01, 1.70, 1.15 0
2017/08/29 15:50:03 15:50:03 up 12 min, load average: 1.15, 1.52, 1.03 0
2017/09/04 16:05:13 16:05:13 up 14 min, load average: 0.30, 0.38, 0.33 1
2017/09/04 18:35:13 18:35:13 up 11 min, load average: 0.21, 0.40, 0.35 1
2017/09/10 11:40:03 11:40:03 up 11 min, load average: 0.71, 1.79, 1.12 0
2017/09/18 04:50:03 04:50:03 up 14 min, load average: 1.53, 1.98, 1.47 0
2017/09/30 14:40:03 14:40:03 up 12 min, load average: 1.14, 2.39, 1.49 0
2017/10/07 10:50:03 10:50:03 up 13 min, load average: 0.44, 1.35, 1.01 0
2017/10/13 17:05:04 17:05:04 up 15 min, load average: 0.70, 1.42, 1.21 0
2017/10/14 16:15:30 16:15:30 up 11 min, load average: 1.24, 2.02, 1.19 0
2017/10/14 16:20:03 16:20:03 up 15 min, load average: 0.61, 1.19, 1.05 0
2017/10/14 20:05:03 20:05:03 up 12 min, load average: 0.84, 2.31, 1.47 0
2017/10/17 12:10:03 12:10:03 up 15 min, load average: 1.14, 1.79, 1.41 0
2017/10/17 14:20:03 14:20:03 up 14 min, load average: 0.44, 1.32, 1.07 0

There is a few days there where the NAS rebooted a few times (some of the “15 mins” are duplicates). There is no rhyme or reason behind the reboots / lockups. No different activity on the NAS and some are even when no one is home to cause any load on the NAS.

The sounds of crickets from WD on this is very disappointing!

I am having the same issues with ex4100 becoming unresponsive, and I would like to try your solution JediNite, By I am not an expert.

How can I create this script and add the cron job?
Do I create a text file in /etc/cron.d with your 3-line code to create the script?
Then how do I add the cron job after the RAID device is mounted?

Thank you in advance

@panandreas

There is a link in the forums at How to Make Persistent System Changes (crontab, etc) which details how to make the changes.

I have a script on my system in a directory called “/mnt/HD/HD_a2/scripts”. I created this directory via an SSH session and not via the web GUI. In this folder I have a script called “five_min_cron.sh”. The script is as per attached.

five_min_cron.sh.txt (1.1 KB)

I am also running entware-ng on my NAS so this script also sets up that environment and starts up the applications I run within that environment.

The entry in my /usr/local/config/config.xml file is as follows:

.
.
           <crond>
                    <list>
                            <count>7</count>
                            <name id="1">stime</name>
                            <name id="2">wd_crontab</name>
                            <name id="3">app_get_info</name>
                            <name id="4">recycle_bin_clear</name>
                            <name id="5">chk_wfs_download</name>
                            <name id="6">ga_cron_q</name>
                            <name id="7">ga_cron_d</name>
                            <name id="8">five_min_cron</name>
                            <name id="9">random_check</name>
                            <name id="10">user_expire_chk</name>
                            <name id="11">fw_available</name>
                    </list>
.
.
                    <five_min_cron>
                            <count>1</count>
                            <item id="1">
                                    <method>3</method>
                                    <1>*/5</1>
                                    <2>*</2>
                                    <3>*</3>
                                    <4>*</4>
                                    <5>*</5>
                                    <run>/mnt/HD/HD_a2/scripts/five_min_cron.sh</run>
                            </item>
                    </five_min_cron>
.
.
           </crond>
.
.

Sorry about the formatting, but if you need a copy of the file I can try and upload it once the details on my own NAS configuration are removed.

Once all in place, reboot it and check.

Cheers,

JediNite

Thank you very much JediNite!!!

I have created your script (starting from the ping command and all the way down) and changed /usr/local/config/config.xml and after a reboot, I have seen the script output inside public folder.

Do I need entware-ng so this script is executed inside that environment, or does the “Reboot on Kernel Panic” commands work anyway???

@panandreas,

No need for the entware-ng stuff. I am running stuff like sickbeard which uses it, so is why I have it installed.

Cheers,

JediNite

Hi guys,
purchased MyCloud 4100 last week, and I have the same issue, and its happening EVERY DAY!
It’s so annoying and frustrating!!!
Is there some workable solution, or the only option is exchange (and pray that new one is ok)???

1 Like

I would like to know something.

after you pull the plug to get the system back up running. Wait an hour and then log into the web console. go to Settings, the Utilities, Then press View Logs.
are there any errors listed in the logs?

@Jeff_Davis

It comes up with “Power loss detected on port 1” type messages.

Cheers,

JediNite