5 min read

A Series of Unfortunate Events: April 14 Foodnoms Outage Postmortem

Foodnoms recently experienced its worst outage in years. Here’s what went wrong, what I learned, and what I’m doing to prevent it from happening again.

A week ago, the Foodnoms Database suffered the worst outage in over five years, with 5–6 hours of degraded service.

This outage did not cause the Foodnoms app to be completely unusable, thanks to its local-first design. Users could still log items from their library and previously-logged items. However, barcode scans and database searches failed or were incomplete.

No one is asking for a postmortem, but I wanted to write one up anyway. This isn't a formal writeup, but should be sufficient to give you a detailed look into what all went wrong, and what I learned from the experience.

What Happened

Typesense Cluster Failure

Around 3pm PDT on April 14, I started to receive reports that Foodnoms search results were failing.

For context, search results in the app are powered by Typesense Cloud, a hosted "search-as-a-service" solution. I went to check the Typesense cluster to see what the metrics were indicating.

By the time I checked, CPU usage was pegged at 100% on all nodes, and they had been for several minutes. I gave it 5 more minutes, expecting it to eventually die down, but that didn't happen.

Typesense is a great service, but it is fairly limited in its monitoring and operations capabilities. It doesn't offer a reboot command. If it had, I would've triggered the nodes to stop and restart.

The only mechanism available to me was a cluster configuration change. I requested a change to add more resources (RAM) to the cluster. I expected this to cause the cluster to reboot and come back online.

Rolling Back

While I was waiting for the cluster configuration change, I suspected that the root cause of the maxed-out CPU load was from a code change I had deployed earlier that day.

The code change upgraded the OpenAI model for Foodnoms AI to use gpt-4.1. Foodnoms AI uses Typesense as part of its processing pipeline, and I figured that something went haywire there and introduced too much load on the Typesense cluster.

I immediately rolled back the deployment as a precaution.

Wiping Production

After I rolled back the deployment and while I was waiting for the Typesense cluster configuration change to take effect (which it never did), I started to look closer at the code change and theorize what might have induced the higher load.

I suspected that this change caused more queries to be issued concurrently to the cluster. I decided to roll forward with a fix for that instead of reverting the code change from the Git repo.

This is where things went really wrong. As part of developing and testing this change, I was using a local instance of VS Code with a Docker-ized version of the app. And at some point, I recreated the Docker container.

When the Docker container is created, it runs through a series of scripts to set everything up. One of those scripts is a db seed command, which prepares the local test database for testing.

The issue: the postgres instance that the Docker container was connected to wasn't a local test database. It was production. I had run db seed against production.

How could this possibly happen? Well, I had recently been leveraging a capability with AWS to securely proxy an RDS database, normally inaccessible from the public internet, through an EC2 instance to a port on my local machine.

I have only been using this proxy capability for a couple months. Before then, I was in the habit of SSHing into an EC2 machine whenever I needed production access.

The new proxy capability has been quite handy for various testing I've needed to do, but I put way too little thought into the risk this solution imposed.

Confirming Backups Work

Two hours into the outage, the Typesense Cluster was still unhealthy and the configuration change wasn't taking effect. Also, the production DB was wiped, but I hadn't realized it yet.

I forget exactly what prompted me, but at some point I checked to see what the current user experience was when searching in the Foodnoms app. It was still failing, but not consistently. However, something else was odd – when search results loaded, no matter what I searched for, I got a single result in the generic section: "Water".

I went to the Typesense management UI and tried the same search there. I noticed that there were only 300-ish records in the index.

Soon I started to piece together what happened. I confirmed with a query on the production DB: select count(*) from food. Shit.

I immediately spun up a new RDS instance restoring from the latest backup, which finished earlier that day. (I can't remember the last time I tested backups. I think I've only done it once just to confirm that the backups actually work.)

At around 5pm PDT, I had a new RDS instance deployed with hundreds of thousands of rows. Whew.

I then triggered deployments to several services that depend on the database.

Restoring Service

After confirming still no further movement on the Typesense cluster upgrade, I triggered a support request and spun up a new Typesense cluster.

Thankfully, I had the foresight to make the Typesense cluster information live-configurable. My plan: once the cluster is online, I would trigger all the data to be indexed, then update the live configuration to tell the Foodnoms app to use the new cluster.

Unfortunately, building the search index from scratch takes 3–4 hours. I decided to update the live configuration before it was finished, because the results would start to become visible to users, even if it was incomplete.

By around 8 or 9 PM, the search index hosted by the new Typesense cluster had finished being populated. At this point, the worst part of the outage was over.

Cleaning Up

I got a reasonably prompt response from the cofounder of Typesense, who was able to finally get the original cluster back in a healthy condition.

The next day, I began to downgrade resources for normal operations. After waiting a couple days, I finished terminating unused resources.

I also reached out to Typesense support again, to ask for help in understanding the root cause of the high CPU load. They suspect that it was due to a known bug for the version of Typesense my cluster was running.

Summary of Mistakes

  1. I failed to keep my Typesense cluster up-to-date.
  2. I carelessly used of the production DB by proxying it to my local machine.
  3. I was too optimistic and waited too long for the Typesense cluster configuration change to take effect.
  4. I did not have any automatic monitoring for the Typesense cluster, so I didn't know about the outage until users started to report issues.

Remediation Steps

  1. New rule: no more proxying the production DB to my local machine unless I really need it!
  2. I introduced an assertion to the seed command to ensure that it is running against a local database.
  3. I added a new Datadog monitor to frequently test that the Typesense cluster is healthy.
  4. I set a reminder to test production DB backups once a quarter.

Up until last week, Foodnoms has had an excellent uptime track record. I want to say, “it was bound to happen eventually.”

This outage cost me in three ways:

  1. Additional, costly compute resources.
  2. My time and full attention last Monday.
  3. A measurable dip in revenue.

I will have outages in the future, but I hope that the next one won't be as bad.

If you were using Foodnoms during the outage: I’m sorry for the disruption, and I sincerely appreciate your patience and support.