2 hour event sync outage

We had a 2 hour outage on processing new events; see the “updatedAt” and “createdAt” fields jumping from 14:41:19 UTC to 16:27:44 UTC.

The issue shows up on all synced events for the exact same timeframe - they just weren’t processed, and then they show up. It’s not even caught up by the time it’s processing new events.

There’s no useful logs that I can view, other than showing that the server was responding to client read requests just fine:

  1. 2021-11-17T17:06:23.163Z - Can not find client undefined on disconnect
  2. 2021-11-17T15:43:06.899Z - Can not find client undefined on disconnect
  3. 2021-11-17T14:41:33.798Z - Can not find client undefined on disconnect
  4. 2021-11-17T13:45:19.949Z - Can not find client undefined on disconnect

We also have a second server (with a mostly identical setup, but some different calculations) processing the same events. It had the same outage.

My guess is some kind of node issue with syncing events? I’d like to understand the mitigations you’ve got in place here so we can prevent this in the future.

is this on BSC network?

You say that there was a 2 hours downtime where no events were processed, but after that, the missing events were properly inserted or they were completely skipped?

On Polygon. All skipped events were processed after the delay. Processing them took ~20 minutes, which leads me to belive it’s a node problem and the node was catching up (events sync far quicker than that usually), but if the sync is a “pull” by the server, I could see it being a problem with some long job running on the server. I can’t imagine a job that would block for 2 hours, but you can probably tell from the logs.

Ok, I remember now about a downtime on polygon yesterday of ~1 hour for web3api. It should be related to that.

We’re loving the functionality of Moralis, but its stability is leaving a little to be desired. Is there a roadmap for fixing this stuff? I worry that your lunch is going to get eaten by someone who nails the stability/reliability side of things.

That problem from yesterday with polygon network was fixed, and it was a one time problem