History

Elevated response times and error rates for the Authentication API

Elevated response times and error rates for the Authentication API

Current Status: Resolved | Last updated at December 5, 2018, 1:12 UTC

Affected environments: US-1 Production

We have now completed our full RCA. Please see here for the full details.


History for this incident

November 29, 20180:11 UTC

Resolved

Auth0 service has stabilized for all customers. We have started working on our public post-mortem.

November 28, 201823:13 UTC

Monitoring

We are continuing to monitor the incident and have started the RCA. We will keep this status in monitoring and continue to provide updates.

November 28, 201820:52 UTC

Monitoring

At 14:44 UTC our incident response team started investigating an alarm showing an increase in response times for the Authentication API affecting all customers in the US (PROD) environment. The team started investigating the issue, and at 16:14 UTC we discovered issues in our MongoDB cluster that soon began affecting all customers in the US region. The issue started getting worse, not just impacting response times but also generating errors for customers in different APIs. From 16:30 UTC until 18:35 UTC (the worst period of the incident), 18.88% of all requests failed in the US environment. With all hands on deck, we found issues related to a couple of Database queries that significantly impacted the cluster. We immediately began applying changes and redirected traffic to a different application cluster, which began to normalize the issue. We are still working diligently on improving the situation and will provide an in-depth Root Cause Analysis as soon as possible. We are deeply sorry for the inconvenience this issue has caused you, your users and your customers.

November 28, 201819:11 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

November 28, 201819:09 UTC

Identified

We're seeing decrease in the amount of errors to our services. We continue to monitor the situation and update as we make progress.

November 28, 201818:13 UTC

Investigating

We continue to investigate this issue. We're working to stabilize our services in prod-us.

November 28, 201816:56 UTC

Identified

We're continuing to work on a fix. Response times are slowly normalizing and errors are going down; we'll post an update as soon as we're back to normal.

November 28, 201816:35 UTC

Identified

The issue has been identified and a fix is being implemented.

November 28, 201815:13 UTC

Investigating

We are currently investigating this issue.