Error rate increase in Search API

Current Status: Resolved | Last updated at August 3, 2018, 17:31 UTC

Affected environments: US-1 Preview, US-1 Production

On April 20th, Auth0 experienced both intermittent errors and high response times in its users and logs search features in the US environment, and a subsequent delay in indexing. Errors and high response times started at 5:33 AM UTC and it was fixed at 10:17 AM UTC; the indexing delays were fixed at 10:47 AM UTC for users and 12:02 PM UTC for logs respectively. From 5:33 AM UTC until 10:17 AM UTC, approximately 9.6% of the search requests failed.

I would like to apologize for the impact this issue had on you and your customers and explain what caused the problem, how we reacted to the incident, and what we are doing to prevent incidents like this from happening in the future.

What happened?

Due to an erroneous configuration management change, our ElasticSearch data nodes lost access to the UDP port used by the NTP service. We do have monitoring for NTP failures scenarios, but the monitor for the case of a clock drifting off was not running frequently enough to detect this before it caused issues.

With NTP effectively not available, the clock in each server started to drift off, with small but increasing differences. During the incident, what happened is that the ElasticSearch cluster discovery mechanism was failing because of the clock differences, so the nodes that we use to support search and indexing started having issues connecting with each other and making requests to the data cluster. Due to the monitoring issue not detecting that NTP was off, it took us considerable time to find out the root cause.

Once we found and fixed the NTP issue, a new master node was re-elected, the issue was fixed and the indexing backlog started being processed without issues, and all search requests started succeeding again.

The intermittent errors also caused some log searches to return data from a wrong starting point; e.g. 30 days instead of 1 day. This was isolated and didn't affect all requests; we've identified the issue and are currently working on a fix for it.

Timeline

5:33 AM: The first connection failure to ElasticSearch was detected by our exception tracking system.
8:08 AM: Customer support reported that our customers were having issues and we started investigating.
10:07 AM: We fixed the issue in the ElasticSearch cluster.
10:17 AM: Cluster went back to normal.
10:47 AM: User indexing caught up.
12:02 PM: Log indexing caught up.

What we’re doing about it

To improve the automatic detection of future issues and the time it takes to recover from delays, these are the actions we're taking:

Improve NTP monitoring to run more frequently and to account for more failure scenarios: in progress.
Improve the ElasticSearch monitoring to raise the alert level in case of repeated failures.
Improve the log indexing service to make recovery faster; essentially, be able to scale up more easily to account for a big backlog of unindexed logs that need to be processed quickly.
Refactor the user indexing service to make recovery faster.

Summary

We realize how important it is that all of our services and search in particular works flawlessly; we take our commitment to reliability and transparency very seriously and regret letting you down. We have learned lessons that we will incorporate in the future to help us react faster to similar situations.

Thank you for your understanding and your continued support of Auth0.

Dirceu Pereira Tiegs

Production Engineer

History for this incident

March 20, 201712:01 UTC

Resolved

The indexing backlog has been processed and all systems are operational.

March 20, 201710:47 UTC

Monitoring

User indexing is back to normal. Log indexing is being processed.

March 20, 201710:25 UTC

Monitoring

We are going through the queues backlog to reindex changes. This might take 45 to 90 minutes.

March 20, 201710:16 UTC

Monitoring

We have identified the root cause of the issue and applied a fix to the search cluster. The search API is back to normal response times and error rates.

March 20, 201710:02 UTC

Identified

Reaponse times are still fluctuating and we see an improvement on error rates at 50% down from ~85%. The response team continues working on the issue.

March 20, 20179:21 UTC

Identified

Response times for Users search API are higher than usual and sometimes times out. Response team continues to work on the issue to fix it.

March 20, 20179:04 UTC

Investigating

We are investigating intermitent errors with our Users and Logs search endpoint on the management API. The underlying cause is our search engine and the response team is looking into it. The runtime is not affected unless you are using the user search API in a rule (eg to link accounts)

March 20, 20178:20 UTC

Investigating

We are looking into intermittent failures in US deployments on the User Search API

Back to history