On April 20th, Auth0 experienced both intermittent errors and high response times in its users and logs search features in the US environment, and a subsequent delay in indexing. Errors and high response times started at 5:33 AM UTC and it was fixed at 10:17 AM UTC; the indexing delays were fixed at 10:47 AM UTC for users and 12:02 PM UTC for logs respectively. From 5:33 AM UTC until 10:17 AM UTC, approximately 9.6% of the search requests failed.
I would like to apologize for the impact this issue had on you and your customers and explain what caused the problem, how we reacted to the incident, and what we are doing to prevent incidents like this from happening in the future.
Due to an erroneous configuration management change, our ElasticSearch data nodes lost access to the UDP port used by the NTP service. We do have monitoring for NTP failures scenarios, but the monitor for the case of a clock drifting off was not running frequently enough to detect this before it caused issues.
With NTP effectively not available, the clock in each server started to drift off, with small but increasing differences. During the incident, what happened is that the ElasticSearch cluster discovery mechanism was failing because of the clock differences, so the nodes that we use to support search and indexing started having issues connecting with each other and making requests to the data cluster. Due to the monitoring issue not detecting that NTP was off, it took us considerable time to find out the root cause.
Once we found and fixed the NTP issue, a new master node was re-elected, the issue was fixed and the indexing backlog started being processed without issues, and all search requests started succeeding again.
The intermittent errors also caused some log searches to return data from a wrong starting point; e.g. 30 days instead of 1 day. This was isolated and didn't affect all requests; we've identified the issue and are currently working on a fix for it.
To improve the automatic detection of future issues and the time it takes to recover from delays, these are the actions we're taking:
We realize how important it is that all of our services and search in particular works flawlessly; we take our commitment to reliability and transparency very seriously and regret letting you down. We have learned lessons that we will incorporate in the future to help us react faster to similar situations.
Thank you for your understanding and your continued support of Auth0.
Dirceu Pereira Tiegs
Production Engineer
The indexing backlog has been processed and all systems are operational.
User indexing is back to normal. Log indexing is being processed.
We are going through the queues backlog to reindex changes. This might take 45 to 90 minutes.
We have identified the root cause of the issue and applied a fix to the search cluster. The search API is back to normal response times and error rates.
Reaponse times are still fluctuating and we see an improvement on error rates at 50% down from ~85%. The response team continues working on the issue.
Response times for Users search API are higher than usual and sometimes times out. Response team continues to work on the issue to fix it.
We are investigating intermitent errors with our Users and Logs search endpoint on the management API. The underlying cause is our search engine and the response team is looking into it. The runtime is not affected unless you are using the user search API in a rule (eg to link accounts)
We are looking into intermittent failures in US deployments on the User Search API