2 RS1D Connected to the Net but not to your system

mario · February 19, 2020, 11:27am

Both alive with no feedback from the institutes… Sorry but they are “Free Spirits” …
If I have logs or more info I will upload it.

So sorry for the inconveniences

…

mario · February 19, 2020, 11:55am

Good News! Some info from R4D07, it is running firmware v18. Please find attachet the log files!

Original data copied each night to our owncloud by davfs2 protocol, so data in the microsd, has sometimes some gaps during nights. Station stops ¿? MicroSD corrupted ¿?

RSH.R4D07.2020-02-19T11_35_28.logs.tar (2.9 MB)

mario · February 19, 2020, 1:09pm

And now logs from R888C, finally I could get them! It is also running Firmware V0.18

RSH.R888C.2020-02-19T12_52_33.logs.tar (2.0 MB)

mario · February 19, 2020, 1:23pm

R4D07 not appearing in stationview, 3h after recovering the connection to the server, is it normal?

ivor · February 19, 2020, 2:34pm

hi mario,

it’s there now. there was a sync problem between the data server and the station-view server that has been resolved.

sorry about that,

richard

mario · February 19, 2020, 2:55pm

Thank you!
Let’s see if the logs give some information about what could happen to these two stations…
Regards
Mario

mario · February 24, 2020, 4:18pm

Hi,

Did you find anything in the R888C and R4D07 log files that could explain what happened last week?

We are intrigued, now everything seems to work correctly …

Best regards
Mario

ivor · February 24, 2020, 5:07pm

hi mario,

at some point in january, both of these units were having problems with their network / internet connections. one way this manifested itself was that each would make a connection request to the shake data server, which was granted, but would then immediately be followed by a “socket read error” on the server-side, prompting the connection to be closed. the unit then would reconnect and the cycle would continue.

since this behavior is detrimental to the overal health of the server (to which the entire AM network is connected), the IP addresses for these two machines were banned from being able to communicate with the server at all, thus preventing the connection request from even getting in.

your query prompted these IP addresses to be “un-banned” to test if they were the same units you had reported, and to see if the network problem had resolved; it was true in both cases.

a better solution to this problem (client network has issues) is to put more intelligence into the client to recognize that this ping-pong / up-down connecting to the server is occuring and to change the re-connection request interval to something more reasonable, say, once per hour.

cheers,

richard

mario · February 24, 2020, 5:47pm

Thank you very much for this info Richard.

But, what about the data gaps in the mseed files recorded in the microsd? Is all related? Can a client/ server problem produce these gaps in the acquired dará files?

Regads

Mario

ivor · February 24, 2020, 7:08pm

hi,

the gaps you see here locally are not related to the story with the server connection. these two data “destinations” are fully uncoupled from each other, in that, if delivery of data to one fails, this will not affect the delivery to the other in any way.

what does seem to be happening is that the data transfer between the shake board and the Pi is sometimes compromised (see log file system.log), where the data reader program is unable to read the data coming off the port quickly enough, some data is dropped, and the packet is rejected. this is the first short gap you see on day 049 at 01:30.

as for the gap that starts a short time later, continuing until the morning is unexplained, i cannot find anything in the logs pointing to a definitive cause. you said that you download the data in the middle of the night, at what time? is it possible that the download is somehow having an adverse effect on data collection? do you see this on any other units you fetch data for in the middle of the night?

in any case, i would confirm the cable connections between the Shake board and the Pi unit, this might be a cause of the intermittent dropped data packets.

sorry i can’t say more about the longer gap. your suspicion that the SD card may be corrupt is also a possibility. when it’s easy, replace the SD card and see if this results in a (positive) change.

cheers,

richard

mario · February 25, 2020, 10:47am

Dear Richard,

Thank you very much for your explanations, they are very instructive.

Only some remarks:

Cron is launching the script to copy the last two mseed files into microsd (julday-1 & julday-2) to the remote disk at 01:30 UTC to make sure the file of the previous day is complete. As seen in the attached screenshots (R4D07_01.png, R4D07_02.png, R888C_1.png and R888C_2.png), we generally don’t find gaps in these files. From Julday 051 till now, when the incidence was solved for both stations, we have no more gaps. We started this procedure to avoid transmission gaps in data coming from caps server. We have not seen any adverse effect of this procedure on any other unit, the script is very simple: mounts the external disk , copies 2 files and unmount the disk. System is “busy” on this 3-5 min each night, not more. But maybe something escapes us…

There is something that intrigues me, as seen in post #7, the swarm screenshot I sent on February 20 shows that the problem in R888C was resolved at approximately 20h UTC on February, 18 (Julday 049), and problem in R4D07 around 10h UTC February, 19 (Julday 050). And as seen in R888C_Jul049.png and R4D07_Jul050.png, both stations present a small gap in the original data around this hour. Coincidences or relationship, at least it is suspicious. ¿?

On the other hand, I don’t understand what you mean by: “I would confirm the cable connections between the Shake board and the Pi unit, this might be a cause of the intermittent dropped data packets”. If I’m not mistaken, the only cable connected to the shake board is the geophone cable, shake is connected to GPIO pins. Do you mean to press the board ¿?

Thank you very much for your time and help.

Best Regards,

Mario

mario · February 25, 2020, 11:46am

Maybe this image allows clarifying the point that intrigues me, there are gaps in the data when the server issue is solved. Why?. Maybe the two data channels are not as uncoupled/independent as expected.

What I mean is, could the large gap of day 049, without explanation in the system registry (system.log), be related to the problem of not being able to send data to the server?

ivor · February 26, 2020, 10:05pm

hi mario,

your analysis is, of course, compelling! (and i have to say, you very much back me into a corner as well when you use my own program to visually state your case… )

what i now think is going on is that instead of the problem having anything to do with data being sent to the server, the problem instead lies with the data not being properly read off the serial port. it cannot be a coincidence that when the problem sending data to the server is solved happens to be the same moment when the data-producer program is restarted. so the problem with data flow stopping must be further upstream.

as well, i would agree that this is not related to your daily download since you are doing this on several units and see problems with only one. (is that correct?)

at this point, i can only suggest to reseat the the Shake board, perhaps this could help. if this continues to be a problem after that, then if you have a spare Pi you could swap it out with, that would answer whether or not the problem sits with the computer. when the problem persists across two Pi’s, then that could point to the problem being with the board itself.

sorry i can’t be more definitive, the log files have told me only so much and don’t really point me to where this problem could be originating from. in these cases, the only real course of action is to make a guess where it could be, make a modification, and check for any change in behaviour. a process of elimination, as it were…

let me know any more of what you find. speaking of which, it would be nice if you could put numbers on this:

how often does it happen, and
how long is each outage?

thanks,
richard

mario · February 27, 2020, 3:38pm

Dear Richard,

Sorry, this query is getting longer and the fronts are diverging, however there are still some points that I don’t understand. These 2 stations are working fine now, they acted in a strange way during 3 days, but they recovered the expected behaviour, so I’m not sure about the electronic explanation…

Quoting your last answer :

“what i now think is going on is that instead of the problem having anything to do with data being sent to the server, the problem instead lies with the data not being properly read off the serial port. it cannot be a coincidence that when the problem sending data to the server is solved happens to be the same moment when the data-producer program is restarted. so the problem with data flow stopping must be further upstream.”

I don’t understand this new explanation. In a precedent answer (#14, Feb, 25) I was told: the IP addresses for these two machines were banned from being able to communicate with the server at all, thus preventing the connection request from even getting in. Your query prompted these IP addresses to be “un-banned” to test if they were the same units you had reported, and to see if the network problem had resolved; it was true in both cases.

So, if I understood well, you had to act on the server, and it is clear that at point the communication is re-established, we had a gap in both stations. If the problem was the serial port, why do we have data, before and after this server un-banning, in the microSD ¿?

“as well, i would agree that this is not related to your daily download since you are doing this on several units and see problems with only one. (is that correct?)”

Yes and No. Yes, We are running this script in 9 RS and never found anything strange… Only on these Two stations (R4D07 and R888C) around the same day… So, No, We had problems in two stations.

“at this point, i can only suggest to reseat the the Shake board, perhaps this could help. if this continues to be a problem after that, then if you have a spare Pi you could swap it out with, that would answer whether or not the problem sits with the computer. when the problem persists across two Pi’s, then that could point to the problem being with the board itself.”

They are in 8 high schools (plus one at home), nine in total. It will be difficult to act on the 8 RS till we recover the instruments. Now they are working fine, so… The one at home is a RS4D and had a “similar communication/hardware problems”, but I changed the PI, re-burned the microSD 3 times and plugged it to the ethernet connector of a wifi tplink repeater, and it works better now (RS 4D Continously rebooting). I’m really concerned with the problems at home when working with wifi, I had a lot of gaps and strange resets, but now it is difficult to know, as I changed all the configuration. So we can leave this one apart …

“let me know any more of what you find. speaking of which, it would be nice if you could put numbers on this: how often does it happen, and how long is each outage?”

The only way to know and quantify this is running a msi over the mseed copied from the microSDs cards, please find the resulting file attached. I only run it over 2020 data, it is long… The problem is that we won’t have the logs, I can have them for 4, maximum 5, RS, so it will be difficult to know their origin, but what it is sure is that they are not transmission gaps to caps server …

Best regards

Mario
rasp_2020_gaps.txt (147.6 KB)

ivor · February 27, 2020, 5:22pm

hi mario,

sorry, but it’s entirely possible i’ve lost the plot. be assured, it is not my intent to try to deflect or confuse, but the restrictions on my time are real, and sometimes the best i can do is provide an explanation based on my best deductive reasoning; and there will be no guarantee any of my conclusions are actually correct.

if i understand correctly now, there are currently no problems with the data capture, either to the local disk or to the server?

if so, then it may have to remain a mystery as to the complete dynamics of what caused the problem in the first place. i think we can conclude that the banning of the IP indeed had an overall detrimental effect. if un-banning the IP’s caused the problems to go away, then there is no longer a problem to solve. again, without setting up a thorough test of what happens in all combinations of settings, both client- and server-side, i’m unable to do more than guess based on the information that’s available (log files plus your observations).

if this is something that must be understood, in absolute terms, please have a look at purchasing technical support so that resources can be assigned to looking into this at a deeper level. as well, i would also encourage you to investigate why the network connection was faulty in the first place, generating the socket read errors on the server, which then caused the IP’s to be banned.

cheers,

richard

ivor · February 27, 2020, 5:59pm

hi again,

looking at the server logs just now, the station R4D07 is again exhibiting the ‘socket read’ problem; its IP is banned from communicating with the server until the problem resolves.

not sure how long this problem persists, but while it does would be a good time to get a better understanding of what is happening with the network client-side, to see if anything can be done to prevent it.

cheers,

richard

mario · February 28, 2020, 2:26pm

Thank you very much Richard for your time and help. I wrote to the person in charge of the Network in this scool, to try to know what happened, but, as you, I think it will remain a Mistery. Could you please unban it again from your server?

If I get the Logs I will send them again.

Have a nive weekend!
Regards
Mario

ivor · February 28, 2020, 2:53pm

hi mario,

the IP has been un-banned, unit has connected and seems to be functioning normally again. and since this seems to be a recurring problem, we really need to figure out why this network issue is occurring and solve it since this really is detrimental to the server’s operations.

thanks, buen fin de semana…

richard

mario · February 28, 2020, 3:15pm

Thank you very much, I will try to obtain as much as information as possible!
Have a good weekend!
Mario

mario · February 28, 2020, 3:51pm

Only for info…

Original versus Server Data. You were righ, there is a problem with the net! Let’s see if I can have a feedback from the highschool…