No network connection after a reboot sometimes

alirz1 · June 25, 2015, 6:21pm

I have always had this strange issue with both of my WD drives.

Everything in my house has a static IP lease from the router based on the MAC address.

Every now and then, although very often but not every single time, If i reboot any of my WD my cloud drive it ends up with a blinking yellow light. Which usually means, no network connection. At this point i have to unplug the power and re plug, after which it comes back on fine.

To try to avoid this problem, where lets say i reboot the drive for whatever reason while remote, or if the drive goes into this state after a power failure etc. I’ve put in scripts in place, which kick in after a reboot, restart and also via cron every 2 hours.

The script checks if the drive can ping the router and if the drive can ping its own IP address. If the test fails, it triggers a “shutdown -r now”. Now this script works fine if I lets say unplug the network cable, or if I bring down the eth0 interface. The script reboots the drive.

However, as we speak right now. I am at work and rebooted the drive at home and it seems to have gone into this messed up state again and it hasnt come back online. Looks like the script did not trigger or couldnt trigger for whatever reason… Im wondering if the file system is not accessible at this point thus the scrpt cannot be read from drive?

Has anyone seen this behavior of no network after a reboot? I know there are issues out there where the drive dissappears from the network etc…

P.S My router cannot be the problem. First this is the third router that im on since owning these drives and they have always had the problem, I do not have over lapping IPs and no other device at my house this issue.

Ralphael · June 25, 2015, 7:26pm

The drives should never go offline by itself unless it is a firmware problem in 4.x My uptime would have been a year now if I hadn’t rebooted my drive due to some “remote access app bug” that caused my drive to lock half a year ago.

Great script solution but you are patching a sympton rather than the problem.

So checklist

router (I know you said it couldn’t be, but you could try a new switch or a new router temporarily)
cable cat 6
the two scan programs that causes hectic activity, may cause slowdown or disconnect
attached usb device on bootup (remove all usb devices when booting)
could be your script… an occasional unroutable ping may not mean that your device is offline, it could be just a glitch and the next thing that happens is a reboot. Comment the reboot step out and just log the instances when your device is unreachable without a reboot.
if the device is offline, try wiggling the cable.

To check if it is your WD device, you can connect directly to your pc using only an ethernet cable to see if it drops off. Try various ethernet cables.

If it is intermittent, I would guess the cable or the plug in port of either the WD or the router. Try different cable, different port and make sure the plug to the WD is solid, you can wiggle it a bit to see if it is loose.

Make sure you only have one copy of that script running :-P

That is all my guesses…

alirz1 · June 25, 2015, 7:39pm

Thanks Raphael.

1-So i’ve already tried a new router and a new switch.

2- cables cannot be the issue, two different drives, using different cables. In any case i have changed those at some point in the past

3- im a power user. All scann tasks are permanently disabled on the drives! No itunes, no twonky, no wdmcserver, nothing!

4- No usb drive attched, EVER

5- Already tested without a reboot. Just to add further on the script, the script does several pings before it determines the there is no network connectivity. If the problem is detected, the script writes a log entry in a custom log file. I dont see that file so it means that the reboot is not triggering in this case…This makes me think that even though its a yellow blinking light, perhaps there is something else going on and not really a network problem… The system logs dont say anything useful also.

I even have a script running on my router that pings both WD drives every hour. if it gets no response, it sends me an email that a drive is unreachable…

I know i’m patching stuff but thats because i dont know what’s casuing the problem.

6- Happens on both drives. A power reset brings it back online so physical connections should be the cause.

I have never had this problem with 3.x Fws. Only the 4.x versions. Im on the latest at the moment.

Ralphael · June 25, 2015, 7:58pm

was making cheese sands and the only other two last guesses…

ips outside of DHCP range
time clock syncs (just in case that statics get renewed just kidding)…
take one wd machine offline and just leave the other running to see if they might be clashing.

So thats it… You might try to downgrade one WD down to version 3.x which is what I’m on at the moment (no problems at all).

Do try a straight connect to a PC (eliminating routers, cables, switches etc).

If a straight connect drops off too, this would definitely sound like a firmware 4.x problem… of which you really have to check what is going on in that linux box… and you might be the only one that can come up with a workaround.

alirz1 · June 25, 2015, 8:10pm

I get it. Thank for all your replys…But i think i already got that covered. I’ve had this issue for several months and i’ve spent a good mount of time on troublshooting this but never found anything.

1- Static lease. no dhcp

2- Times are in sync. NTP used. Correct time zone.

3- I see no way they can clash there IPs are well apart.

I only upgraded to 4.x beacayse samba speeds are ALOT faster than 3.x and im all about optimizing things to the max so i dont see myself living with 3.x

I could try a straight connect. Perhaps in the near future.

Thanks again

Ralphael · June 25, 2015, 8:37pm

I know… troubleshooting is absolutely terrible, but you should try it from my end when the user says I’ve done all that but give me an answer…

so I know that the following will probably have no solution but you might discover something in the process.

static lease outside of DHCP range, I meant something like Statip ip of 192.168.0.199 and DHCP is turned on to assign ips in the range of 10-150. The reason that you have DHCP assign in a different range is so it doesn’t clash and I know that you said there isn’t any other devices on it, but this is just one of those things.
clashing of two WD, but not the ips. From Rac8006 he had two WD and the SMB were causing the devices not to sleep. Now it might be possible that two WD’s are causing some internal thrashing. So turn off one completely and see if the other one stays on.
lastly turn off remote. I know that you want access but this is temporary to check whether WD is sending reboot signals. The reason I’m saying this is that when I access my remote program once last year they had this new feature of remote update. I know I did not press the ok button on that, but for some reason my WD rebooted. So, just in case, give it a try… turn it off to see if that is a possibility.

Good luck… no more guesses…

Ralphael · June 25, 2015, 9:06pm

I lied… I have a couple more suggestions…

You could try adding some network commands to your script…

ifdown eth0

ifup eth0

or even stop and start your networking services

/etc/init.d/networking restart

log your scripts to see if they do indeed get “runned” and maybe even add

/sbin/ifconfig -a >mylogfile

alirz1 · June 25, 2015, 10:58pm

Hi again…

Well the scrip does get run via corn. I have cron log enabled and I can see the script being called.
I had initially started off with network restart instead of a reboot …didn’t try with if down and if up though…

I still think that the reason for the drive not rebooting as per my script when I run into this problem might be the fact the script is not readable at that point…perhaps the data partition is not mounted?
I guess I will try moving the script to / or something…

What its worh this is messages log after the drive recoverd by pulling the power and powering it back on…

Issue occured at 14:03, i got home and power cycled it at around 18:05… The last line for the 14th hour, shows the network link coming up but then it stopped there…Now that i think more about it…so if the logging stopped right there, the drive never proceeded to mount the paritions and thus it seems that the OS never loads completely in this situation and thus my script doesnt even come into play…

Jun 25 14:04:43 nas1 kernel: [43.849187] PFE binary version: pfe_nas_2_00_3
Jun 25 14:04:43 nas1 kernel: [43.853696] pfe_firmware_init: class firmware loaded 0xbc0 0xc3010000
Jun 25 14:04:43 nas1 kernel: [43.860172] pfe_load_elf
Jun 25 14:04:43 nas1 kernel: [43.864893] pfe_firmware_init: tmu firmware loaded 0x1a0
Jun 25 14:04:43 nas1 kernel: [43.870233] pfe_load_elf
Jun 25 14:04:43 nas1 kernel: [43.875355] pfe_firmware_init: util firmware loaded 0x1220
Jun 25 14:04:43 nas1 kernel: [43.880914] pfe_ctrl_init
Jun 25 14:04:43 nas1 kernel: [43.913152] timer_add c30ae150
Jun 25 14:04:43 nas1 kernel: [43.916228] timer_add c30ae0f0
Jun 25 14:04:43 nas1 kernel: [43.919299] timer_add c30ae120
Jun 25 14:04:43 nas1 kernel: [43.922376] timer_add c30ae090
Jun 25 14:04:43 nas1 kernel: [43.925460] timer_add c30ae078
Jun 25 14:04:43 nas1 kernel: [43.928543] timer_add c30ae0a8
Jun 25 14:04:43 nas1 kernel: [43.932586] timer_add c30ae0c0
Jun 25 14:04:43 nas1 kernel: [43.935689] ipsec_common_hard_init ipsec_baseaddr:d0c00000 - espah_base:d0c00000
Jun 25 14:04:43 nas1 kernel: [43.944480] timer_add c30ae060
Jun 25 14:04:43 nas1 kernel: [43.947560] pfe_ctrl_init finished
Jun 25 14:04:43 nas1 kernel: [43.950980] pfe_eth_init
Jun 25 14:04:43 nas1 kernel: [43.952744] pfe_ctrl_timer
Jun 25 14:04:43 nas1 kernel: [43.967691] Comcerto MDIO Bus: probed
Jun 25 14:04:43 nas1 kernel: [43.975228] pfe_eth_init_one: preallocating rx & tx queue buffers
Jun 25 14:04:43 nas1 kernel: [43.981661] eth0: pfe_eth_init_one: created interface, baseaddr: d3200000
Jun 25 14:04:43 nas1 kernel: [43.988583] pfe_pcap_init
Jun 25 14:04:43 nas1 kernel: [56.998988] eth0: pfe_eth_open
Jun 25 14:04:43 nas1 kernel: [57.002204] hif_process_client_req: register client_id 0
Jun 25 14:04:43 nas1 kernel: [57.007595] pfe_hif_client_register
Jun 25 14:04:43 nas1 kernel: [57.011114] eth0: pfe_gemac_init
Jun 25 14:04:43 nas1 kernel: [57.014643] bcm54610_config_init: before 0x2c09, after 0x2c8c
Jun 25 14:04:43 nas1 kernel: [57.020673] bcm54610_config_init: before 0x01e1, 0x0300; after 0x0141, 0x0200
Jun 25 14:04:43 nas1 kernel: [62.023075] PHY: comcerto-0:00 - Link is Up - 1000/Full
Jun 25 18:05:35 nas1 kernel: [75.602532] jnl: driver (lke_9.0.0 lke_9.0.0_r233487_b2, LBD=ON) loaded at c30f0000
Jun 25 18:05:35 nas1 kernel: [75.687911] ufsd: module license ‘Commercial product’ taints kernel.
Jun 25 18:05:35 nas1 kernel: [75.694408] Disabling lock debugging due to kernel taint
Jun 25 18:05:36 nas1 kernel: [75.742571] ufsd:: trace mask set to 0000000f
Jun 25 18:05:36 nas1 kernel: [75.747029] ufsd: driver (lke_9.0.0 lke_9.0.0_r233487_b2, LBD=ON, acl, ioctl, bdi, sd(0), fua, bz, tr) loaded at c3130000
Jun 25 18:05:36 nas1 kernel: [75.747042] NTFS support included
Jun 25 18:05:36 nas1 kernel: [75.747046] Hfs+/HfsJ support included
Jun 25 18:05:36 nas1 kernel: [75.747049] optimized: speed
Jun 25 18:05:36 nas1 kernel: [75.747053] Build_for__WD_Sequoia_k3.2.26_2014-08-07_lke_9.0.0_r233487_b2

Below is a normal boot up log. You can see that after the network link is up the data partition gets mounted few seconds later. Guess I’m answering my own questions now.

Can i use soemthing like chkconfig or something to prioritize some services? e.g leave network to come up later in the boot process? Porbably would have other implications!

Jun 9 23:15:17 nas1 kernel: [61.432868] PHY: comcerto-0:00 - Link is Up - 1000/Full
Jun 9 23:15:23 nas1 kernel: [76.633930] jnl: driver (lke_9.0.0 lke_9.0.0_r233487_b2, LBD=ON) loaded at c30f0000
Jun 9 23:15:23 nas1 kernel: [76.703485] ufsd: module license ‘Commercial product’ taints kernel.
Jun 9 23:15:23 nas1 kernel: [76.709910] Disabling lock debugging due to kernel taint
Jun 9 23:15:24 nas1 kernel: [76.759281] ufsd:: trace mask set to 0000000f
Jun 9 23:15:24 nas1 kernel: [76.763730] ufsd: driver (lke_9.0.0 lke_9.0.0_r233487_b2, LBD=ON, acl, ioctl, bdi, sd(0), fua, bz, tr) loaded at c3130000
Jun 9 23:15:24 nas1 kernel: [76.763741] NTFS support included
Jun 9 23:15:24 nas1 kernel: [76.763745] Hfs+/HfsJ support included
Jun 9 23:15:24 nas1 kernel: [76.763749] optimized: speed
Jun 9 23:15:24 nas1 kernel: [76.763752] Build_for__WD_Sequoia_k3.2.26_2014-08-07_lke_9.0.0_r233487_b2
Jun 9 23:15:24 nas1 kernel: [76.763758]
Jun 9 23:15:29 nas1 kernel: [82.342621] EXT4-fs (sda4): barriers disabled
Jun 9 23:15:30 nas1 kernel: [83.615823] EXT4-fs (sda4): mounted filesystem with writeback data mode. Opts: acl,user_xattr,data=writeback,barrier=0,init_itable=10
Jun 9 23:15:30 nas1 kernel: [83.706213] EXT4-fs (sda4): re-mounted. Opts: user_xattr,barrier=0,data=writeback

larryg0 · June 26, 2015, 5:58pm

I would try setting a real static IP outside of the routers DHCP range. with a lease it is still using DHCP the router is supposed to always give the same address

when you get the yellow front light what are the rear lights doing?

are you sure it is yellow and not off-white? have you ever tried leaving it in this state for a day with any reboots and network resets disabled?

alirz1 · June 26, 2015, 6:06pm

Will give setting an IP outside the dhcp range.

As for the ligth in the front, its definitely the same light i get when i pull the network cable.

The lights in the back. One is solid green. The other blinks every 4-5 seconds. I think thats a bit slow rate? I never really look in the back but i think with a normal connection, the blink rate would be a bit faster.

I know if its a white blinking light, the drive is doing a disk check or something but i dont think thats the case here.

alirz1 · June 26, 2015, 6:25pm

One more thing to add is that once the drive is in this “no network” state. Even if i reboot the router, it doesnt fix it.

Ralphael · June 26, 2015, 7:05pm

Hmm I’m trying to post from my iPad but there is no quote…

Anyways yes even with router reboot it still doesn’t know that your device exists because it has forgotten about it and reassigned the ip back to the available pool.

That is why I kept typing up the solution of assigning a static outside the dhcp range.

Let us know if that was the solution because we keep blaming firmware 4.x

Edit: let’s say this is the problem, would a release and renew on a static ip work? Put that in the script?

alirz1 · June 26, 2015, 7:27pm

Router’s DHCP pool is 192.168.2-50 i.e 48 devices. Lease time is 2 days.

I have configured 192.168.1.2-15 with static leases set for particular devices based on MAC addresses.

192.168.1.4 is reserved for WD1 (currently stuck in the no network state as i rebooted it like an hour ago)

192.168.1.14 is reserved for WD2 (this one came back online when i rebooted it earlier today, however the problem occurs on this also sometimes)

The router will NEVER assign the IP address reserved for the WD1 to any other device.

So if the router was to blame, i would expect it to re-issue the IP again once i had rebooted the router. But as that doesnt happen, i still blame the WD drive for not requesting the IP or if fails to load the OS or whatever at certain reboots, i dont know what to say about that.

EDIT: That why i had initially tried a network restart but no luck…I think the script is not triggered when the drive encounters this problem. Even though i moved the script to / parition its possible cron is not even inialized at point? That is likely!

Ralphael · June 26, 2015, 7:40pm

Alirz1,

We are not trying to assign blame but looking for a solution.

So let’s do this, assign wd1 and wd2 to 192.168.1.100 and 192.168.1.101 respectively

Optionally leve one at the lower ip and try release and renew.

Remember that I didn’t say it would reassign it to another device but it might conveniently forget that it is being used even if it’s reserved. Also a reserve might mean that you still use dhcp on the device rather than static. Reserve means you get the same ip but you still expire and needs to renew

Edit: optionally turn on dhcp on both wd and be surprised that it stays connected

Ralphael · June 29, 2015, 4:25pm

Hey alirz1,

How did the solution work out for you? are you now always connected… let us know…

alirz1 · June 29, 2015, 5:15pm

Well i didnt try to assign them out IPs from outside the dhcp pool yet. I dont really want to change the WD’s IPs becuaase i have several scripts, rsync, webdav along with transmission that runs on the drive using the specific IP im using right now.

Changing the IP of the drive will casue me quite a bit of headache and it would have to go around the house updating computers that map drives to this drive, etc…firewall rules. Just too many things…

However i’m trying something again that i had tried in the past. I’ve disabled the static lease for the drive from the router however i’ve configured the drive to use the same static IP address using the WD dashboard. So far i’ve rebooted 3 times and the drive has back online with no problem.

That being said, with this config i used to have anothr issue with the drive that others on this forum have also talked about often. That issue is “the drive disappeards from the network” after some random time. At that point visually everything looks fine on the drive i.e solid blue led, good networking leds in the back etc… however the drive is not accessble, pinagble, ssh’able etc at that point.

Will see how it goes this time. Wil keep you updated…

Ralphael · June 29, 2015, 5:29pm

good to hear from you…

Yes, I can understand that you don’t want to change the ips

However you could change the DHCP pool to the range of 50-100, thus ensuring that your static ips remains safe.
Alternatively you could also re-enable the static lease and change the WD to DHCP to see if they get the same static ip that was reserved for them.

I think #1 is the easiest and safest and will change nothing in your current configuration.

Thanks again for responding…

larryg0 · June 29, 2015, 8:46pm

you don’t want to disable the lease and set the mycloud to use that address. unless I missed it those IPs are still in the DHCP pool and without the lease the router is free to assign those IPs. it may not happen often but it will at some point

either leave the lease in place or pull those IPs out of the DHCP pool. haveing a lease set to the same myclouds static IP won’t hirt anything as the router would only assign that IP when that mycloud request it which will never happen since it is static

edit:

comment about rebooting the router. this would not help get the mycloud back online unless it is a bad router as the DHCP requests are initiated from the client (mycloud) which would only happen at lease expiration or mycloud reboot. the router could never push this to the client

alirz1 · June 30, 2015, 12:53am

larryg0 wrote:

you don’t want to disable the lease and set the mycloud to use that address. unless I missed it those IPs are still in the DHCP pool and without the lease the router is free to assign those IPs. it may not happen often but it will at some point

either leave the lease in place or pull those IPs out of the DHCP pool. haveing a lease set to the same myclouds static IP won’t hirt anything as the router would only assign that IP when that mycloud request it which will never happen since it is static

edit:

comment about rebooting the router. this would not help get the mycloud back online unless it is a bad router as the DHCP requests are initiated from the client (mycloud) which would only happen at lease expiration or mycloud reboot. the router could never push this to the client

Thats a good point. I didnt keep that in mind. I’ll put the staic lease alsl back on the router. Though that would occur almost never, as 95% of the devices at my place are all set for static IPs by the router.

As for your comments about the DHCP. My thought behind the router rebot was that, if for whatever reason the router was not responding to the lease request by the WD (after the WD is rebooted) , or if the router’s previous static lease for the WD drive was “stuck” or not being renewed after the WD reboot. Rebooting the router would delete ALL static leases and they would be re-issued.

larryg0 · June 30, 2015, 1:35pm

“Rebooting the router would delete ALL static leases” they would be released

“and they would be re-issued.” but not re-issued until the mycloud requested it, which is basicly a reboot or lease expiration