Shake Runs Connected to the server, then disconnects

SchoolShake · January 28, 2025, 6:01am

Hi!

I’m having an issue with one of our RS instruments. It is in a private home, not behind any specifically restrictive firewalls. It runs for a while (this amount of time can vary, but usually 10-24 hrs). During this period of time, it is successfully connected to the RS Data Server. You can view the data there. However, at a certain point, it loses the connection to the RS data server.

The webpage and myshake.log output show no server connection. I can see during this time frame that the instrument is still locally archiving its data. I also remotely collect data from the instrument through connecting to its seedlink server, and I am successfully acquiring data even during the period of time where it indicates no data connection to RS.

It seems that an rsh-restart does not bring it back the connection to the RS Data Server (but seedlink works again), however, rebooting the shake does correct this. I’m curious which logs I should look at.

The instrument in question is AM.R4BAD. This is not one of our SchoolShake instruments behind a school firewall which is a previous problem we’ve had and are still trying to sort out (seems firewall rules are being reset). I have a very old ticket about that problem that I didn’t answer. I may restart a new ticket for that separately from here…

Thanks,
Andrew

Stormchaser · January 28, 2025, 9:22am

Hello Andrew,

This sounds like an interesting case! Could you post the entire log set here when you have a minute?

Thank you!

SchoolShake · January 28, 2025, 5:02pm

Hi @Stormchaser. Thank you! Here is a set of logs downloaded a couple of days ago just before it was rebooted (and the problem fixed).

Cheers,
Andrew
dmesg.out (20.1 KB)
RSH.R4BAD.2025-01-27T04_42_48.logs.tar (4.3 MB)

Stormchaser · January 30, 2025, 11:07am

Hello again,

Thank you for the file and the logs. From what I can see, there are quite a bit of these during the past year:

2024 240 00:01:05>>    NTP timing is not available, so an accurate data timestamp is also not possible.

2024 253 17:35:22>>    Time adjustment M0: HARD RESET.  This will result in a one-time time-tear.

which indicate that the Shake was not able to reach any NTP time synchronization server and then had to hard reset its timing when the connection was established again.

There are also some error lines such as these:

2024 261 18:34:56>>    JSON Packet error: ']' expected near 'MA'
2024 261 18:34:59>>        Cannot process record, throwing it away.

As you write that, locally, the Shake continues to correctly record data (thus only losing the server transmission) makes me think that there could be some issue either with the microSD card or with the router closing the Shake communication ports (55555 and 55556) after a threshold is reached. When then the Shake is rebooted, this “count” is reset and the Shake communicates again.

The easiest/fastest thing you can try is to re-burn the microSD card (if needed, instructions are here: microSD card topics) and see if the connection issue happens again.

If it doesn’t, all this was probably caused by some file corruption. If, instead, it does, then there may be other local network factors involved.

SchoolShake · February 1, 2025, 4:50pm

Thanks for the analysis @Stormchaser. One thing I did since sending this message was change the crontab. I had it running an “rsh-restart”- at midnight UTC daily. I’ve disabled that and this issue hasn’t happened since.

Do you see any obvious reason the rsh-restart would cause this?

Regardless, I will reburn the rs image onto a new card and try that out.

Thanks!

Stormchaser · February 2, 2025, 6:23pm

No trouble at all!

I am not sure about rsh-restart, but I have asked our software team, and I’ll get back to you as soon as I have a reply.

Stormchaser · February 3, 2025, 12:34pm

Hello again,

As you know, the rsh-restart command restarts the data-producer and data-consumer docker containers. It is usually called only when there are any metadata edits coming from rs.local/, to reflect the new user input.

What was the reason behind adding it as a cronjob to execute it every midnight? Was it related to connectivity issues, or something else?

Thank you.

SchoolShake · February 5, 2025, 11:19pm

Thanks @Stormchaser. I had found that for some of the instruments behind school firewalls, running an rsh-restart instead of a full reboot was sometimes enough to re-establish the connections. However, this no longer seems to be the case. I’ll make a separate ticket, but we have the intermitent (ie, most data is missing from RS) for the following stations:
RF5C1
R664C
R1B48
R4081
R54F7

These are all in the same school district with the same firewall issues. I would suggest that the issue is not on your side, but so far I have not had luck with their firewall getting it working. All of these do a once-daily rsh-restart. And I’ll add that we know they are successfully collecting data through this period of time as we have all the data from them collected via a seedlink specific tunnel to our server.

Early on, I had found an rsh-restart could help revive the data flowing to the RS servers, but I can see now that is not the case. I will post shortly a new post with these instruments, including the logs from two of them as well as a screenshot of our data availability.

SchoolShake · February 5, 2025, 11:46pm

FYI, here is the link to the new ticket for the persistent data drop in the one particular school district.

Stormchaser · February 6, 2025, 11:02am

Thank you for all the details SchoolShake! I’ll go on the other thread and see the specs.