High Error Percentage in users and log search (AU)

Current Status: Resolved | Last updated at August 3, 2018, 17:31 UTC

Affected environments: AU Production, AU Preview

On August 18th, Auth0 customers in the AU PROD and PREVIEW environments experienced errors and high latency when using the User and Log search API endpoints. This happened between 00:08 UTC and 00:58 UTC (52% of tenants were affected), and between 02:12 UTC and 02:50 UTC.

During the first period 52% of tenants received errors. During the second one 46% of tenants received errors.

We would like to apologize for the impact this had on you and your customers and explain what caused the problem, how we reacted to the incident, and what we are doing to prevent incidents like this from happening in the future.

What happened?

Our User and Logs search feature utilize an ElasticSearch cluster as the storage engine. User updates and activity logs are written asynchronously to ElasticSearch. This is done for performance reasons and ensure that the database component we use to process authentication transactions is isolated from these operations. User and Logs search requests end up reading from that ElasticSearch cluster.

An unusual load increase related to user creation and deletion operations resulted in a large number of changes to be indexed in a short period of time in our ElasticSearch cluster in AU. This resulted in memory shortage, causing the cluster unresponsive to both write and read operations.

During the first incident period (00:08-00:58 UTC), our team found the cluster in an error state and the cluster recovery procedure was executed, as restoring service back to normal was the highest priority. This restored both features to a working state and the team put together a list of things to continue researching during work hours.

During the second incident (02:12-02:50 UTC), while we repaired the cluster, we found that a large number of requests being performed came from load/stress testing (not valid usage of our API). At this point, we activated a load shedding mechanism to drop those requests. Once the cluster was repaired and the extra load was shed, things went back to normal.

Auth0 has a variety of rate limiting mechanisms in place to prevent traffic spikes from affecting our services. This particular combination of operations was within the acceptable limits, however those limits were higher than those the provisioned cluster could handle.

Timeline

00:08 UTC: Our monitoring system detected the first spike of error responses
00:56 UTC: The cluster healing procedure completed, no more errors occurred
02:12 UTC: Second spike of error responses began
02:30 UTC: Found initial evidence that automated requests (invalid usage) were being performed against the service
02:44 UTC: Activated load shedding mechanisms to drop invalid traffic
02:49 UTC: The cluster healing procedure completed, no more errors occurred

What we’re doing about it

We are going to work on the following things in the short term: [done] Increase the ES cluster size and nodes to better handle load spikes [in progress] Put stricter limits for user write operations

In addition to this, we have been working for some time on a large set of improvements for all our Users and Logs search infrastructure to continuously improve how we handle our increasing customer demand. This is a long running project that we are close to completing and should improve the general performance and reliability of these feature.

Summary

We realize how important it is that all of our services works flawlessly. We take our commitment to reliability and transparency very seriously and regret letting you down. We have learned lessons that we will incorporate in the future to help us prevent similar situations and react faster if they were to occur.

Thank you for your understanding and your continued support of Auth0.

History for this incident

August 18, 20173:41 UTC

Resolved

All back to normal. No errors have occurred in 45 minutes

August 18, 20172:57 UTC

Monitoring

Cluster is back to a healthy state and no more errors are occurring

August 18, 20172:32 UTC

Investigating

We are seeing a high error percentage on user and log search again

August 18, 20171:01 UTC

Monitoring

We have restarted the search cluster and performance has normalized. We will be monitoring the performance closely.

August 18, 20170:35 UTC

Investigating

We are currently investigating this issue.

Back to history