Raspberry Shake response to loss of network

K2PI · June 17, 2024, 10:33am

I have in the past asked for a user-selectable notification when a RS fails to report in. I am hoping this is still being worked.

I just had this happen again, and, of course, as usual, lost data before I noticed. This seems to happen when the Wifi link drops (for reasons I am troubleshooting, probably a new wifi bridge needed). When this happens, the data producer goes off and remains off, even after the network is reconnected.

I’d like to know the normal programmed behavior of the RS when the network is interrupted? How long does it try, are there any fallbacks, what does it do when the network is restored, and how does it establish the heartbeat/keep alive that tells is the network is there?

Thank you.

Stormchaser · June 18, 2024, 8:56pm

Hello K2PI,

Regarding your query at the end, I’ve asked for more details on the reconnection protocols, and I’ll report back as soon as possible. It’ll probably be next week as our manager is taking some well-deserved days off.

I want to try and replicate the issue on my side, using one of the older Shakes I have. It would be interesting to understand if this is only WiFi-related or if it is happening with Ethernet connections, too.

If I manage to do that, I could collect more data to provide our software team; it’s worth a try. In any case, you can rest assured that all your previous feedback, which I remember well, has always been logged and taken into account for improvements in the next Shake OS releases.

Stormchaser · June 24, 2024, 2:21pm

Hello again K2PI, some questions from our team:

Do you perchance have the logs from the Shake when (or immediately after) the last loss of network happened?
When this connection loss happens, have you checked (via the local helicorder, accessible from rs.local/heli or yourShakeIPaddress/heli) if the data is still locally saved on the Shake?

This because, if we are facing “only” a loss of internet connection, data is not lost and is still accessible on the Shake itself. However, if data is not present on the Shake, then the situation is more worrying, as there could be software/hardware issues that needs to be solved.

Regarding your behavior query, when the Shake doesn’t find an active internet connection at start, it will try for some time before stopping and continuing the booting process (this is to not make the Shake “hang” and to start recording some data, even if offline). When the Shake is instead already on and running, a situation like yours can happen where connection can only be re-established by restarting the Shake.

This depends on many factors related to connection protocols and any router/bridge/link equipment that is used. As there are hundreds if not thousands of different models, it can be difficult to provide more precise support, and we thus limit ourselves to general indications.

I’ve tried to replicate the issue on my side but was unable to. Could you describe in more details one of such disconnection episodes? Such as, for example, internet/power goes away and is then restored, the WiFi bridge behavior, and any other thing that comes to mind, so that we have more to work on.

Thank you.

K2PI · June 27, 2024, 12:20am

Hi Stormchaser,
I can provide the logs, but frankly they will tell us nothing useful I believe. Something caused the shake to lose the wifi connectivity from the bridge. It could be the bridge or my network. I am not blaming the shake for that.

What I am pointing out is that the connection loss is noticed because the Shake is no longer visible at RS.Local, and the shake needs to be rebooted to make it available. So far, this is my problem to deal with. But, after that, the data the shake collected during that network outage is only available locally.

In other words, it doesn’t appear that the device does anything to upload the old logs once connectivity is restored, so other than me using local tools to look at the data, it is unavailable in the app or online at the raspberryshake.org site.

I don’t know why the decision was made to not send data except for real-time data that is collected while the system is in a connected state. I can deal with connectivity issues every now and then, and I’m sure all of us deal with this from time to time. But, the shakes once reconnected do not appear to then catch up and send that backlog of data to the servers. Is mine the only one that doesn’t do that, or is that by design? Is there any way to force the Rshake to upload that backlog to the system, filling in the missing data for our stations?

Thank you.

ivor · June 27, 2024, 5:12pm

hello K2PI,

first, please forward the logs. while it is possible they there’s nothing useful there, it is also possible something actually is useful to be found there.

for example: you say that the data-producer program stops when the connection to the network is lost, but this is not normal behavior; the data-producer program has zero dependency on the existence of a network connection or not. even when the unit starts with no network, the system will start using the clock time as it is, without the NTP client being connected to a server. in this case, the data packet timestamps will be wrong, but when the data is stored locally only, this is not an issue.

regarding data recovery to the server after a network outage, the following sequence of events takes place:

connection to server is lost
data acquisition off the serial port continues uninterrupted
data is queued and stored in memory for a period of time depending on the unit type, i.e., the total number of data channels, versus the amount of RAM available to store the queued data. (this can be between approximately 20 minutes and 3 hours, depending on several factors.)
if the RAM limit is reached, data packets are dropped so as to avoid crashing the program when it would run out of RAM
once the unit reconnects to the server, all data on the internal queue is forwarded to the server. depending on the length of the outage

in terms of wanting to know that the shake has lost connectivity to the network sooner than later, since this is a local phenomenon, i would suggest applying some type of local solution which would alert you to this network exception sooner than later. it will always be unreliable to rely on server-side solutions to identify an issue existing on the local network client-side.

hope this clarifies things somewhat for you. please send the log files and any other questions you may have.

cheers,
richard