Raspberry Shake occasionally stops working

Hi,

I have a Raspberry Shake that occasionally just stops working by itself. It’s on a UPS (uninterruptible power supply). My router is on a UPS. The switch between the router and Raspberry Shake is also on a UPS.

I know that one of the times of the last failure was around when the power went out in the neighborhood. I could still access the Raspberry Shake and use the Internet, but the software says that it went offline. It never came back automatically to report data.

Rebooting it made it to working again. It does make me wonder why there isn’t a cron job or watchdog process to check the health of the Raspberry Shake. If it’s not working correctly for a period of time, try to start the services back up. If the recording services refuse to come back, then reboot.

If there is a local large earthquake, the power is likely going to flicker. How can I ensure that it’s recording reliably throughout a shaking event? I thought a UPS was sufficient, but that doesn’t seem to be the case. The other stuff on UPS seems to work fine, except the software on the Raspberry Shake.

Hello BlackDiamond,

Thank you for your report. The behaviour is quite unusual, so I would like to investigate more. For example, usually after a power cut or network reset, my Shakes continue to try connecting to the servers until they find them again.

Could you please post the logs from the Shake after the event that you have described has happened? Without them, it is impossible to understand what is going on. Thank you.

Instructions on how to do it are here, if they are needed: Please read before posting!

Here’s the log files.

RSH.RAAE6.2021-07-17T07_37_49.logs.tar.gz (258.6 KB)

I took a quick look at the logs. This looked interesting from the rsh-data-consumer.log. Cannot allocate memory seems bad. Did something trigger a memory leak?

2021-07-13 13:12:22 S.T. [INFO] Server: Connection from 127.0.0.1:52020 (0x2207b58)
20210713_UTC_13:14:46 heli_ewII (Main):  Updating Helicorders.
2021-07-13 13:14:46 S.T. [INFO] Server: Connection from 127.0.0.1:52022 (0x22367a8)
2021-07-13 13:14:49 S.T. [ERROR] libmseedmm:    Source                Start sample             End sample        Gap  Hz  Samples
2021-07-13 13:14:53 S.T. [ERROR] libmseedmm: mst_addmsr(): Cannot allocate memory
  heli_ewII: RequestWave:  server: 127.0.0.1  16032 Trace RAAE6 EHZ AM 00: Timeout to wave server. Try again.
2021-07-13 13:15:21 S.T. [ERROR] libmseedmm: mst_addmsr(): Cannot allocate memory
  heli_ewII: RequestWave:  server: 127.0.0.1  16032 Trace RAAE6 EHZ AM 00: Timeout to wave server. Try again.
  heli_ewII: RequestWave:  server: 127.0.0.1  16032 Trace RAAE6 ENZ AM 00: No connection to wave server.
  heli_ewII: RequestWave:  server: 127.0.0.1  16032 Trace RAAE6 ENN AM 00: No connection to wave server.
  heli_ewII: RequestWave:  server: 127.0.0.1  16032 Trace RAAE6 ENE AM 00: No connection to wave server.
20210713_UTC_13:21:09 heli_ewII (Main):  Updating Helicorders.
2021-07-13 13:21:09 S.T. [INFO] Server: Connection from 127.0.0.1:52026 (0x21f9ea0)
2021-07-13 13:21:11 S.T. [ERROR] libmseedmm: mst_addmsr(): Cannot allocate memory
2021-07-13 13:21:16 S.T. [ERROR] libmseedmm: mst_addmsr(): Cannot allocate memory
2021-07-13 13:21:29 S.T. [ERROR] libmseedmm: mst_addmsr(): Cannot allocate memory
  heli_ewII: RequestWave:  server: 127.0.0.1  16032 Trace RAAE6 EHZ AM 00: Timeout to wave server. Try again.
2021-07-13 13:22:03 S.T. [ERROR] libmseedmm: mst_addmsr(): Cannot allocate memory
  heli_ewII: RequestWave:  server: 127.0.0.1  16032 Trace RAAE6 EHZ AM 00: Timeout to wave server. Try again.
  heli_ewII: RequestWave:  server: 127.0.0.1  16032 Trace RAAE6 EHZ AM 00: No connection to wave server.
  heli_ewII: RequestWave:  server: 127.0.0.1  16032 Trace RAAE6 EHZ AM 00: No connection to wave server.
2021-07-13 13:24:51 S.T. [ERROR] libmseedmm: mst_addmsr(): Cannot allocate memory
/opt/bin/rsh-data-consumer.start.sh: fork: Cannot allocate memory
2021-07-13 13:34:24 S.T. [INFO] libslinkmm: network timeout (600s), reconnecting in 30s
2021-07-13 13:38:12 S.T. [ERROR] libmseedmm: mst_addmsr(): Cannot allocate memory
2021-07-13 13:38:38 S.T. [ERROR] libslinkmm: [172.17.0.2:18000] timeout waiting for response to 'HELLO'
2021-07-13 13:40:50 S.T. [ERROR] libslinkmm: [172.17.0.2:18000] timeout waiting for response to 'HELLO'
2021 197 13:56:23: Docker Container rsh-data-consumer is starting...
2021 197 13:56:23: Waiting until seedlink is serving data...
2021 197 13:56:54: Seedlink ready, here we go...
2021 197 13:56:54: Starting OWS...
2021 197 13:56:54: Pausing to let OWS get started...
2021-07-16 13:56:54 S.T. [INFO] OSOP Wave Server: version b3755b45258d2539f256b8a2c1fc8070a85cc24b
2021-07-16 13:56:56 S.T. [INFO] Server: Started
2021 197 13:58:24: Starting HELI...
20210716_UTC_13:58:24 heli_ewII (Main):  Version: 1.0.5 2015-03-17
20210716_UTC_13:58:24 heli_ewII (Main):  Read command file </opt/settings/dataC/heli_ewII.d>
2021-07-16 13:58:24 S.T. [INFO] Server: Connection from 127.0.0.1:47648 (0x12f8e60)
20210716_UTC_13:58:24 heli_ewII (Main):  Starting plots
 heli_ewII: RequestWave:  server: 127.0.0.1  16032 Failed.  io = -7

and so on…

Hello BlackDiamond,

Thank you for the logs. I am seeing, in fact, what you have highlighted and that there are also some hard resets that can be connected to loss of network or loss of power events. Regarding the memory errors, do you have other processes running on the same Pi board together with our Shake OS?

Have you tried connecting the Shake to a different UPS, or, except for the initial failure you described, leave it connected to a normal power supply, to see if events like these happened regularly?

If you have not, please execute these tests for some days and then please repost the logs, so that we can exclude the UPS as a possible source for these issues, and concentrate on a different route. Thank you.