Image API requests are failing on us-east-1
Incident Report for Imagizer Cloud
Postmortem

Early this morning, Imagizer Cloud east coast zone underwent a large scale outage. The majority of image requests to our east coast cluster failed with a timeout.

The outage began at 5:50 am PST and ended at 8:10 am PST, affecting all east-coast regional customers.

What and why it happened:

Although there were many factors contributed to the outage, the trigger cause was due to a large increase in resource-intensive requests. The effect of the query overload was a rapid increase in server creation via autoscaling.

Unfortunately, due to an artificial restriction the cluster did not have enough time to scale to absorb all requests coming from HomeSnap.

We have also failed to react in time to mitigate the outage.

Immediate action taken:

  • Oncall/Nagios sensitivity boosted.
  • Oncall procedures reviewed with Oncall team.
  • Escalation actions reviewed.
  • Overload mitigation emergency plan reviewed and enhanced.
  • Cluster size restrictions lifted.
  • Server EC2 size doubled.

We take all outages seriously and always work hard on improving our methods to provide safest most stable environment for our clients. With actions taken above, we expect a significant improvement in stability across all of our services with minimal if any disruption in the future.

Posted Jun 18, 2019 - 16:41 PDT

Resolved
The majority of requests to our image API are failing with a timeout
Posted Jun 12, 2019 - 05:50 PDT