Okta FGA status
History

Authentication API and Search API Issues due to AWS connectivity

Authentication API and Search API Issues due to AWS connectivity

Current Status: Resolved | Last updated at August 3, 2018, 17:31 UTC

Affected environments: US-1 Preview, US-1 Production

On October 18th, between 20:36 UTC and 21:52 UTC, all Auth0 services (Authentication API, Management API, Dashboard) in our two US environments (PROD and PREVIEW) failed intermittently due to AWS issues, resulting in a partial outage that affected all end users in these environments. We regret that we cannot quantify the number of requests that failed during this period, as we normally would, since the root cause (network connectivity issues in our upstream provider, AWS) means we do not know how many requests failed to reach our servers. We are investigating alternative ways to measure the impact in the event of a similar incident in the future, as explained below in the What We're Doing About It section.

We apologize for this service disruption that may have affected you and your customers. We will now explain what happened, how we responded to the incident, and what we are doing to continually improve our response to similar incidents in the future.

What Happened

Both of Auth0’s US environments (PROD and PREVIEW) are primarily hosted in the AWS us-west-2 region. This AWS region experienced network connectivity issues from 20:36 UTC to 21:15 UTC, as shown below (from the AWS Personal Health Dashboard):

These AWS network connectivity issues caused all of Auth0’s external services (Authentication API, Management API, Dashboard) to fail intermittently, as our internal services in the region weren’t able to reliably connect to one another. We identified network connectivity issues as the cause of the disruption at 20:44 UTC, and confirmed that the issues were with AWS at 20:59 UTC. We unfortunately don't know how many requests failed during this period (23 minutes), because the failed requests never reached our servers. We are investigating adding additional logging to the DNS service so that we can track these requests, even when they don't reach our servers due to network connectivity issues.

Once we confirmed the AWS issue in us-west-2 as the cause, we started our failover procedure to use services in our secondary us-west-1 AWS region, where we keep full PROD and PREVIEW environments ready on standby for incidents such as this one. It took us 6 minutes to finish the procedure, after which the Authentication API service was fully restored; however, the Management API and Dashboard services were only partially restored, as the search API endpoints (and Dashboard actions that rely on them) continued to fail. The failure continued due to a misconfiguration in our secondary environment. We monitored the upstream AWS issue’s status while also working to fix the search API misconfiguration.

AWS indicated their issue was resolved at 21:15 UTC. We then started testing and monitoring the us-west-2 network so we could be be confident it was reliable enough to recover back to. At 21:48 UTC we started our recovery procedure to return to using services in our primary us-west-2 region, before we were able to fix the search API misconfiguration in our secondary environment. The recovery procedure took 4 minutes, and all services were fully restored at 21:52 UTC.

Timeline

20:36 UTC: AWS us-west-2 network connectivity issues began. 20:39 UTC: Incident response team received the first alert. 20:44 UTC: Incident response team identified intermittent connectivity in the internal network. 20:59 UTC: Incident response team received confirmation that AWS us-west-2 was experiencing networking issues. 21:00 UTC: Incident response team decided to run failover procedure to move to our secondary us-west-1 region. 21:05 UTC: Secondary region started receiving traffic, and authentication errors started to subside. 21:15 UTC: AWS reported us-west-2 network connectivity had stabilized. 21:48 UTC: Incident response team started recovery procedure to move back to the primary us-west-2 region. 21:52 UTC: Primary region started receiving traffic and all services started to stabilize.

What We're Doing About It

While we cannot prevent disruptions happening in the services of our upstream providers like AWS, we can better prepare for and respond to them. Here’s what we are doing to be better prepared and to respond more effectively in the future:

  • [in-progress] Update our failover procedures to include direct links to resources, reducing execution time.
  • [in-progress] Investigate ways of further automating the failover procedure, for even faster execution time.
  • [in-progress] Update configuration and automation in our secondary environments so we don’t have search API failures in the event of a future failover.
  • [backlog] Add end-to-end tests for the search API endpoints in the secondary environment so that we know it is ready in the event of a future failover.
  • [backlog] Configure query logging for Route 53 DNS entries so we have additional visibility into network errors, and can quantify failed requests in the event of a future similar incident.

Summary

We realize that Auth0 is a critical part of your architecture, and a core technology on which you depend daily. We apologize for any issues this outage may have caused for your business, and will continue to work to provide you with the best authentication service possible.

Thank you for your continued partnership with Auth0.


History for this incident

October 18, 201723:24 UTC

Resolved

This incident has been resolved.

October 18, 201721:54 UTC

Monitoring

Response team pointed DNS back to primary datacenter and will be monitoring over the next few hours in case the issue comes up again. All components are restored.

October 18, 201721:45 UTC

Identified

Connectivity issues in AWS seem to be resolved. Response team is switching back to primary DC to fully recover the services and will continue to monitor.

October 18, 201721:20 UTC

Identified

Failover was successful. Logins and dashboard are working. Search API is not recovered yet.

October 18, 201721:15 UTC

Investigating

We activated the failover region. Issue seems to be related to AWS connectivity issues https://status.aws.amazon.com/

October 18, 201720:47 UTC

Investigating

We are seeing timeouts when processing authentication requests