In the early days of Clarisights(mid-2018), we had this problem where our Sidekiq VM used to restart on some weekends
We used to use Sidekiq to process and ingest user-uploaded CSV files into our data warehouse.
User flow goes like this
- Users will upload big CSV files or we pull these daily from their servers(metabase)
- We start a Sidekiq job to process to these in background
- Sidekiq was configured to retry in case we had transient failures
It looks simple, right?
On one fine Monday, we got the issue from a customer that their data for Sunday was not processed. After some digging, we found that our Sidekiq VM restarted on Sunday, and it came back online automatically
We saw it restarted because it was out of memory, we attributed it to a bunch of retries(due to some bad CSVs) we saw around the same time and moved on
Next weekend it was fine, and then after a few more weeks, it happened again. we tried debugging but no luck, we added more instrumentation and moved on.
Fast forward to next week, while I was optimizing CSV processing code, I found a memory leak. CSV processing code used to leak memory, but it was a slow leak. Now all of sudden it all adds up on why it only happened one some weekends.
We used to deploy multiple times in a day, and during deploy, we restart Sidekiq to load new code.
Because of these multiple deploys, the leak was hidden and never showed up in monitoring. It was clear when we looked at restarts, memory growth, and deploys. , these restarts happened on weekends with zero deploys.
We ran an isolated test to verify it, and sure enough, it was the reason. So, that was my mysterious weekend restarts story, a slow memory leak, hiding behind deploys 🙂
For me, this was a perfect reminder of correlation can lead to causation but, correlation is not causation
That’s all folks, Stay In, Stay Safe 👋