Increased Errors for Users Search in US

Current Status: Resolved | Last updated at August 3, 2018, 17:31 UTC

On July 10th between 17:17 UTC and 18:30 UTC we had intermittent errors in the user endpoints of the Management API on the US environment. During this period, 3.9% of management API requests failed.

What happened

A batch of Management API calls resulted in the unnecessary creation of multiple indexes in our ElasticSearch cluster, which caused a rapid increase in CPU on ElasticSearch nodes until they became unresponsive and could no longer process requests.

New indexes are created by our indexing process, which takes all new and updated users' Metadata and stores it in ElasticSearch. In this case, a batch of users with complex (and nested) metadata was stored, which caused issues and an (almost) infinite loop in index creation.

Once we detected the issue we applied a hotfix that stopped the creation of additional indexes and restarted the ElasticSearch nodes to normalize their resource usage. A permanent fix was rolled out to all environments a few hours later.

Timeline

17:17: Our monitoring system detected that errors started happening in the US environment
17:20: Our incident response team started investigating
18:02: Stopped indexing processes
18:04: Performed a full cluster restart
18:16: Started indexing processes
18:21: Issue happened again, stopped indexing processes
18:30: Last errors were detected by our monitoring system
18:38: Found the root cause
18:40: Applied a hotfix

What are we doing about it?

Add a safeguard to prevent too many indexes from being created in a short period of time: done. Update documentation to be more clear about metadata limits: in process. Add audit logs to help our customers detect indexing issues: in process.

Summary

We are sorry about this issue and have learned lessons that we will incorporate and action items that we will work on to help us prevent similar situations.

Thank you for your understanding and your continued support of Auth0.

Dirceu Pereira Tiegs, Production Engineer

History for this incident

July 10, 201717:30 UTC

Resolved

This incident has been resolved.

July 10, 201717:14 UTC

Identified

We are looking into errors coming from users search endpoint.

Back to history