We have now completed our full RCA. Please see here for the full details.
Auth0 service has stabilized for all customers. We have started working on our public post-mortem.
We are continuing to monitor the incident and have started the RCA. We will keep this status in monitoring and continue to provide updates.
At 14:44 UTC our incident response team started investigating an alarm showing an increase in response times for the Authentication API affecting all customers in the US (PROD) environment. The team started investigating the issue, and at 16:14 UTC we discovered issues in our MongoDB cluster that soon began affecting all customers in the US region. The issue started getting worse, not just impacting response times but also generating errors for customers in different APIs. From 16:30 UTC until 18:35 UTC (the worst period of the incident), 18.88% of all requests failed in the US environment. With all hands on deck, we found issues related to a couple of Database queries that significantly impacted the cluster. We immediately began applying changes and redirected traffic to a different application cluster, which began to normalize the issue. We are still working diligently on improving the situation and will provide an in-depth Root Cause Analysis as soon as possible. We are deeply sorry for the inconvenience this issue has caused you, your users and your customers.
A fix has been implemented and we are monitoring the results.
We're seeing decrease in the amount of errors to our services. We continue to monitor the situation and update as we make progress.
We continue to investigate this issue. We're working to stabilize our services in prod-us.
We're continuing to work on a fix. Response times are slowly normalizing and errors are going down; we'll post an update as soon as we're back to normal.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.