2 RS1D Connected to the Net but not to your system

mario · February 18, 2020, 9:48am

Dear Ian and Richard,

We have 8 raspberryshakes 1D in 8 different Institutes here in Barcelona, they were working fine so far, sending data to your cups server, visible through StationView and swarm, but a few days/weeks ago two of them “disappeared” from your system. The strange think is that in all of these 8 RS we installed a script with a cron job, that every night is mounting an external disk through davfs2 protocol (sudo mount -t davfs -o noexec https://ictjabox… /mnt/dav) and coping there the last 2 days of data. This way we have continous data without gaps. This externanl disk is an owncloud account here in our institute, and we have data from them! Then they are alive, they are working and “they can reach the network” but they do not send data to your system! These stations are R4D07 and R888C. Do you have any idea about what is going on in these RS? Could it be related to the new firmware? We did not fix any IP on these devices, just plugged to the Ethernet, so as far as I know they didn’t have any special network config to loose… ¿?

Thank you very much in advance!

Best Regards,

Mario

ivor · February 18, 2020, 1:07pm

hi mario,

can you send the log files for these, please?

thanks in advance,

richard

mario · February 18, 2020, 1:20pm

Will be difficult as I have no remote access to the stations. I have to ask to the contacts in the institutes to do it, or go there personally. I’ll send you more info as soon as possible! It can take a while, sorry.

Regards

mario

ivor · February 18, 2020, 4:08pm

in that case, can you ask them to restart / reboot / power-cycle the units first? i hate to sound like microsoft support, but restarting does sometimes clear up problems…

richard

mario · February 18, 2020, 4:33pm

hahaha

I wrote them and tomorrow morning (CET) I hope to have something, at least from R4D07…

But R888C has been rebooted/restarted this morning without any success. Ergo… Linux is more complex than microsoft

mario · February 19, 2020, 8:50am

I had no feedback from contacts in the schools, but R888C recovered the transmission yesterday afternoon and is again visible on swarm. Perhaps they rebooted again the instrument. As soon as I have more info I will upload it.

mario · February 19, 2020, 11:27am

Both alive with no feedback from the institutes… Sorry but they are “Free Spirits” …
If I have logs or more info I will upload it.

So sorry for the inconveniences

…

mario · February 19, 2020, 11:55am

Good News! Some info from R4D07, it is running firmware v18. Please find attachet the log files!

Original data copied each night to our owncloud by davfs2 protocol, so data in the microsd, has sometimes some gaps during nights. Station stops ¿? MicroSD corrupted ¿?

RSH.R4D07.2020-02-19T11_35_28.logs.tar (2.9 MB)

mario · February 19, 2020, 1:09pm

And now logs from R888C, finally I could get them! It is also running Firmware V0.18

RSH.R888C.2020-02-19T12_52_33.logs.tar (2.0 MB)

mario · February 19, 2020, 1:23pm

R4D07 not appearing in stationview, 3h after recovering the connection to the server, is it normal?

ivor · February 19, 2020, 2:34pm

hi mario,

it’s there now. there was a sync problem between the data server and the station-view server that has been resolved.

sorry about that,

richard

mario · February 19, 2020, 2:55pm

Thank you!
Let’s see if the logs give some information about what could happen to these two stations…
Regards
Mario

mario · February 24, 2020, 4:18pm

Hi,

Did you find anything in the R888C and R4D07 log files that could explain what happened last week?

We are intrigued, now everything seems to work correctly …

Best regards
Mario

ivor · February 24, 2020, 5:07pm

hi mario,

at some point in january, both of these units were having problems with their network / internet connections. one way this manifested itself was that each would make a connection request to the shake data server, which was granted, but would then immediately be followed by a “socket read error” on the server-side, prompting the connection to be closed. the unit then would reconnect and the cycle would continue.

since this behavior is detrimental to the overal health of the server (to which the entire AM network is connected), the IP addresses for these two machines were banned from being able to communicate with the server at all, thus preventing the connection request from even getting in.

your query prompted these IP addresses to be “un-banned” to test if they were the same units you had reported, and to see if the network problem had resolved; it was true in both cases.

a better solution to this problem (client network has issues) is to put more intelligence into the client to recognize that this ping-pong / up-down connecting to the server is occuring and to change the re-connection request interval to something more reasonable, say, once per hour.

cheers,

richard

mario · February 24, 2020, 5:47pm

Thank you very much for this info Richard.

But, what about the data gaps in the mseed files recorded in the microsd? Is all related? Can a client/ server problem produce these gaps in the acquired dará files?

Regads

Mario

ivor · February 24, 2020, 7:08pm

hi,

the gaps you see here locally are not related to the story with the server connection. these two data “destinations” are fully uncoupled from each other, in that, if delivery of data to one fails, this will not affect the delivery to the other in any way.

what does seem to be happening is that the data transfer between the shake board and the Pi is sometimes compromised (see log file system.log), where the data reader program is unable to read the data coming off the port quickly enough, some data is dropped, and the packet is rejected. this is the first short gap you see on day 049 at 01:30.

as for the gap that starts a short time later, continuing until the morning is unexplained, i cannot find anything in the logs pointing to a definitive cause. you said that you download the data in the middle of the night, at what time? is it possible that the download is somehow having an adverse effect on data collection? do you see this on any other units you fetch data for in the middle of the night?

in any case, i would confirm the cable connections between the Shake board and the Pi unit, this might be a cause of the intermittent dropped data packets.

sorry i can’t say more about the longer gap. your suspicion that the SD card may be corrupt is also a possibility. when it’s easy, replace the SD card and see if this results in a (positive) change.

cheers,

richard

mario · February 25, 2020, 10:47am

Dear Richard,

Thank you very much for your explanations, they are very instructive.

Only some remarks:

Cron is launching the script to copy the last two mseed files into microsd (julday-1 & julday-2) to the remote disk at 01:30 UTC to make sure the file of the previous day is complete. As seen in the attached screenshots (R4D07_01.png, R4D07_02.png, R888C_1.png and R888C_2.png), we generally don’t find gaps in these files. From Julday 051 till now, when the incidence was solved for both stations, we have no more gaps. We started this procedure to avoid transmission gaps in data coming from caps server. We have not seen any adverse effect of this procedure on any other unit, the script is very simple: mounts the external disk , copies 2 files and unmount the disk. System is “busy” on this 3-5 min each night, not more. But maybe something escapes us…

There is something that intrigues me, as seen in post #7, the swarm screenshot I sent on February 20 shows that the problem in R888C was resolved at approximately 20h UTC on February, 18 (Julday 049), and problem in R4D07 around 10h UTC February, 19 (Julday 050). And as seen in R888C_Jul049.png and R4D07_Jul050.png, both stations present a small gap in the original data around this hour. Coincidences or relationship, at least it is suspicious. ¿?

On the other hand, I don’t understand what you mean by: “I would confirm the cable connections between the Shake board and the Pi unit, this might be a cause of the intermittent dropped data packets”. If I’m not mistaken, the only cable connected to the shake board is the geophone cable, shake is connected to GPIO pins. Do you mean to press the board ¿?

Thank you very much for your time and help.

Best Regards,

Mario

mario · February 25, 2020, 11:46am

Maybe this image allows clarifying the point that intrigues me, there are gaps in the data when the server issue is solved. Why?. Maybe the two data channels are not as uncoupled/independent as expected.

What I mean is, could the large gap of day 049, without explanation in the system registry (system.log), be related to the problem of not being able to send data to the server?

ivor · February 26, 2020, 10:05pm

hi mario,

your analysis is, of course, compelling! (and i have to say, you very much back me into a corner as well when you use my own program to visually state your case… )

what i now think is going on is that instead of the problem having anything to do with data being sent to the server, the problem instead lies with the data not being properly read off the serial port. it cannot be a coincidence that when the problem sending data to the server is solved happens to be the same moment when the data-producer program is restarted. so the problem with data flow stopping must be further upstream.

as well, i would agree that this is not related to your daily download since you are doing this on several units and see problems with only one. (is that correct?)

at this point, i can only suggest to reseat the the Shake board, perhaps this could help. if this continues to be a problem after that, then if you have a spare Pi you could swap it out with, that would answer whether or not the problem sits with the computer. when the problem persists across two Pi’s, then that could point to the problem being with the board itself.

sorry i can’t be more definitive, the log files have told me only so much and don’t really point me to where this problem could be originating from. in these cases, the only real course of action is to make a guess where it could be, make a modification, and check for any change in behaviour. a process of elimination, as it were…

let me know any more of what you find. speaking of which, it would be nice if you could put numbers on this:

how often does it happen, and
how long is each outage?

thanks,
richard

mario · February 27, 2020, 3:38pm

Dear Richard,

Sorry, this query is getting longer and the fronts are diverging, however there are still some points that I don’t understand. These 2 stations are working fine now, they acted in a strange way during 3 days, but they recovered the expected behaviour, so I’m not sure about the electronic explanation…

Quoting your last answer :

“what i now think is going on is that instead of the problem having anything to do with data being sent to the server, the problem instead lies with the data not being properly read off the serial port. it cannot be a coincidence that when the problem sending data to the server is solved happens to be the same moment when the data-producer program is restarted. so the problem with data flow stopping must be further upstream.”

I don’t understand this new explanation. In a precedent answer (#14, Feb, 25) I was told: the IP addresses for these two machines were banned from being able to communicate with the server at all, thus preventing the connection request from even getting in. Your query prompted these IP addresses to be “un-banned” to test if they were the same units you had reported, and to see if the network problem had resolved; it was true in both cases.

So, if I understood well, you had to act on the server, and it is clear that at point the communication is re-established, we had a gap in both stations. If the problem was the serial port, why do we have data, before and after this server un-banning, in the microSD ¿?

“as well, i would agree that this is not related to your daily download since you are doing this on several units and see problems with only one. (is that correct?)”

Yes and No. Yes, We are running this script in 9 RS and never found anything strange… Only on these Two stations (R4D07 and R888C) around the same day… So, No, We had problems in two stations.

“at this point, i can only suggest to reseat the the Shake board, perhaps this could help. if this continues to be a problem after that, then if you have a spare Pi you could swap it out with, that would answer whether or not the problem sits with the computer. when the problem persists across two Pi’s, then that could point to the problem being with the board itself.”

They are in 8 high schools (plus one at home), nine in total. It will be difficult to act on the 8 RS till we recover the instruments. Now they are working fine, so… The one at home is a RS4D and had a “similar communication/hardware problems”, but I changed the PI, re-burned the microSD 3 times and plugged it to the ethernet connector of a wifi tplink repeater, and it works better now (RS 4D Continously rebooting). I’m really concerned with the problems at home when working with wifi, I had a lot of gaps and strange resets, but now it is difficult to know, as I changed all the configuration. So we can leave this one apart …

“let me know any more of what you find. speaking of which, it would be nice if you could put numbers on this: how often does it happen, and how long is each outage?”

The only way to know and quantify this is running a msi over the mseed copied from the microSDs cards, please find the resulting file attached. I only run it over 2020 data, it is long… The problem is that we won’t have the logs, I can have them for 4, maximum 5, RS, so it will be difficult to know their origin, but what it is sure is that they are not transmission gaps to caps server …

Best regards

Mario
rasp_2020_gaps.txt (147.6 KB)