Image API requests are failing

Incident Report for Imagizer Cloud

Postmortem

Imagizer Cloud services suffered an outage today the 18th of June from 12:01 PM PST to 12:28 PM PST. The majority of image requests failed with a 500 response error. Some portion of images continued to be served from the CDN caching layer.

The cause of the outage was a corrupted master DB server. The DB instance was taken down accidentally by an operator error. Development tables were mistakenly migrated onto the production DB instance, which effectively overrode all the production records.

Interval monitoring systems quickly reported the problem, and recovery began immediately. Imagizer engineering restored database schemas and table content from a snapshot made earlier this morning.

The following modifications have been or will be put into place to prevent a future outage.

The production environment will be shielded from direct usage from our development environment.
DB recovery will be further automated to increase the quickness of recovery.
Caching Imagizer source configurations will be increased from the current 3 minute TTL to a 1 hour TTL, which will allow images to be served longer without a working DB. (Should it needs to be down for maintenance).
Best practices for deployment to production reviewed and improved.

Posted Jun 18, 2019 - 16:10 PDT

Resolved

The issue has been resolved.

Posted Jun 18, 2019 - 12:38 PDT

This incident affected: Website and Image API (us-east-1).