On May 16th, Auth0 customers experienced high response times when using the users and logs search features in the US environment. Additionally, results were outdated due to a delay in indexing. High response times and indexing delays started at 13:06 UTC; response times normalized at 16:14 UTC. User and logs responses were up to date at 17:50 UTC and 19:36 UTC respectively.
I would like to apologize for the impact this issue had on you and your customers and explain what caused the problem, how we reacted to the incident, and what we are doing to prevent incidents like this from happening in the future.
In the past few weeks our monitoring system detected heap size issues in the ElasticSearch cluster that enables User and Logs search features in the US environment; our core infrastructure team made several improvements to the ElasticSearch automation, and we decided to roll out those changes on May 16th. This meant doubling the cluster size: we would add new ones, then remove the old ones and keep only the new.
The cluster had 7 data nodes at the time, and we decided to replace it with a 9 node cluster. We created the new nodes, adding first 2 nodes and then 7 more, and then we just waited. From previous experience, we expected it to start relocating shards and that it would take some time to stabilize while shards (both primaries and replicas) would start moving from the old nodes to the new ones. After the cluster had normalized, we would remove one old node, wait to normalize and repeat until we only had new nodes. We didn't expect any negative impact on indexing or search speed.
At 13:06 UTC (exactly 30 minutes after we started this process) we got the first alert from our monitoring system that indicated that user indexing was delayed. We saw an increase in search response times and a big increase in indexing response time for. At 13:09 UTC we got one for the log indexing process as well.
We started monitoring closer and trying to find better metrics and different settings we could tweak to speed up the process; since things had already started moving, we didn't want to make the situation worse by removing the new nodes or changing the topology in any other way.
At 13:55 UTC we stopped all the indexing processes to reduce the load in the cluster in the hopes it would recover faster; it improved the situation, but it was still not recovering fast enough.
Upon closer review of the changes we made with the new automation, we found out that we increased the number of replicas per index from 2 to 5; since we have several indices (from a few hundreds of MB to 130 GB per shard). This means that the cluster was busy creating new replicas of pre-existing shards, and relocating them according to disk space and availability heuristics, which was increasing disk, CPU, and network usage. As soon as we saw this, we reduced the number of replicas to 2 again, which dramatically improved the speed at which the cluster was recovering.
After the load got to a usual levels and shards stopped relocating we restarted the user indexing process, figuring out that it would be better to drain that queue and fix the delay in one and then move on to the other, instead of doing both at the same time and risking a) too much indexing load and b) taking longer to fix both features.
With both features back to normal, we started the process of removing old nodes, one by one, and that didn't have any impact on customers. While we did this, we also started to collect the information we found to avoid this problem in the future and to also help index faster if it ends up happening anyway.
These are the actions we've taken:
We realize how important it is that all of our services and search, in particular, works flawlessly. We take our commitment to reliability and transparency very seriously and regret letting you down. We have learned lessons that we will incorporate in the future to help us prevent similar situations and react faster if they were to occur.
Thank you for your understanding and your continued support of Auth0.
Dirceu Pereira Tiegs, Production Engineer
This incident has been resolved.
User indexing is back to normal; log indexing is being processed.
The user indexing backlog is being processed, and should be complete in less than 2 hours.
We're still working on increasing capacity. Once our cluster is stable and fully replicated we will resume processing.
We identified indexing delay in users and logs in the US environment and we are provisioning more capacity.