Okta FGA status
History

We are researching API issues in the US region

We are researching API issues in the US region

Current Status: Resolved | Last updated at August 3, 2018, 17:31 UTC

Affected environments: US-1 Preview, US-1 Production

On May 22, Auth0 customers in our US environment experienced errors calling our APIs. Errors started at 8:50 UTC in both the PROD a PREVIEW and PROD environments. The errors stopped at 9:04 UTC and 9:20 UTC respectively.

I would like to apologize for the impact this outage had on you and your customers and explain what caused the problem, how we reacted to the incident, and what we are doing to prevent incidents like this from happening in the future.

What happened?

As part of our high availability strategy, our US environment has database nodes in multiple regions and cloud providers, securely connected by VPN tunnels. On May 22th at 8:50 UTC we received an IKE_SA initiation request from the VPN tunnel service in our Microsoft Azure cluster. The tunnel was already established on the AWS side and it didn’t receive a IKE_SA delete request, so it failed to install new ipsec policies, as they already existed.

This caused an inconsistent state in the VPN tunnel, where the ipsec service in AWS side kept the tunnel up, while the VPN service in Azure side was trying to establish a new tunnel every few seconds.

The inconsistent VPN state caused MongoDB elections to fail. This is because whenever the visibility of the other nodes changed from the perspective of one, a new election began. There were 33 primary elections in 9 minutes. Whenever a primary node changes our applications needs to close database connections and start new ones, to the new primary. Because the VPN was unhealthy, services couldn’t reach the primary node, if it was elected on the other side of the tunnel.

Timeline

  • 08:50 UTC VPN service enters inconsistent state. Our MongoDB nodes in different clusters can't see each other

  • 08:50 UTC Our monitoring systems generated alerts due to authentication failures in the US environment

  • 09:01 UTC We reestablished the VPN tunnel. This fixed the MongoDB issue

  • 09:02 UTC We found that our core authentication service was not starting by itself, and ran a script to restart it in all nodes.

  • 09:04 UTC Service was completely restored in the PRODUCTION environment

  • 09:12 UTC We continue to receive alerts from the PREVIEW environment

  • 09:15 UTC We found that a service that is used to do auth with Auth0 as the provider (e.g.: DB connection, passwordless) was failing to start in PREVIEW environment after the database failure

  • 09:22 UTC We restarted all PREVIEW nodes

  • 09:20 UTC Service is restored in all environments

#What we’re doing about it Auth0's infrastructure is planned to resist single VPN tunnel failures. Our design allowed one tunnel to go down without requiring new elections to take place. It did not handle this inconsistent state.

On June 6th, we completed a long running project that involved changing our MongoDB cluster infrastructure setup to not depend on VPN connectivity for primary elections. The VPN will be used to maintain an up to date set of replicas for Disaster Recovery purposes. VPN issues will not interfere with MongoDB elections.

The aforementioned means that AWS single AZ failures will not cause any cluster issues. Disaster Recovery scenarios will involve the execution of playbooks to restore full functionality and we working to improve the time it takes for these playbooks to be executed. After years of running with the previous setup, we believe the new one will provide better overall availability.

Summary

We realize how important it is that all of our services and search, in particular, works flawlessly. We take our commitment to reliability and transparency very seriously and regret letting you down. We have learned lessons that we will incorporate in the future to help us prevent similar situations and react faster if they were to occur.

Thank you for your understanding and your continued support of Auth0.

Sebastian Rodriguez, Production Engineer


History for this incident

May 22, 20179:40 UTC

Resolved

All systems have been stable for 10 minutes. We are closing this incident

May 22, 20179:30 UTC

Monitoring

We have fixed some remaining issues in our US latest environment

May 22, 20179:23 UTC

Investigating

We are still looking into some intermittent errors

May 22, 20179:08 UTC

Monitoring

We have implemented a fix and all is back to normal

May 22, 20179:04 UTC

Identified

An issue with our database caused API errors to fail. We are working on a fix

May 22, 20178:51 UTC

Investigating

We are currently investigating this issue.