Okta FGA status
History

Elevated error rates on authentication using rules in US

Elevated error rates on authentication using rules in US

Current Status: Resolved | Last updated at August 3, 2018, 17:31 UTC

Affected environments: US-1 Preview, US-1 Production

Customers using Auth0 Rules between October 4th and October 25th 2017 experienced a percentage of login transaction failures due to uncaught “Blocked event loop” exceptions. The issue manifested itself mostly in failures of login transactions using the Auth0 Rule created by the Authorization Extension.

During the aforementioned time period, this is the percentage of login transaction using rules that failed due to this error:

  • US region: 0.01%
  • EU region: 0.01%
  • AU region: 0.06%

We apologize for the inconvenience this has caused to you and your customers. This post mortem explains what happened and what we did about it.

Investigation

Auth0 Rules allow customization of login transactions by executing Node.js code to perform customer-specific business logic during each transaction. Execution of custom code on a multi-tenant platform like Auth0 requires strong sandboxing and isolation guarantees. Auth0 identity platform uses Auth0 Extend product based on Auth0 Webtask technology (https://auth0.com.extend) to support these guarantees. Webtask system executes Auth0 Rule code of each Auth0 tenant in an isolated environment called a “webtask container”.

When code executing within a webtask container blocks the Node.js event loop for longer than a preconfigured time, webtask's infrastructure throws a “Blocked Event Loop” exception. It is one of the measures we have in place to prevent erroneous, malicious, or run-away code from consuming inordinate amounts of CPU before it fails. When an uncaught exception is thrown in a webtask container, all custom scripts running within that container at the time are terminated.

Since the “Blocked Event Loop” problem manifested itself primarily in login transactions executing the Auth0 Rule created by the Authorization Extension, our initial investigation focused on reviewing that extension's code looking for potential causes of blocking CPU for prolonged periods. This investigation did not yield conclusive results. However, in the course of it, whileanalyzing logs generated by the webtask infrastructure, we uncovered that the component responsible for provisioning NPM modules within a webtask container was intermittently blocking the Node.js event loop for a period longer than the preconfigured threshold.

Root Cause

We deduced that it was the module provisioning logic that was the root cause of the issue. The logic was modified shortly before October 4th to perform Node.js module unpacking operation within Node.js process itself, rather than spawning an external process to complete the job. This change was made as part of an ongoing effort to improve the performance of the system, in this case to reduce the latency of module unpacking. The unforeseen side-effect of this change manifested itself only on a system under load. Due to the overall CPU load of the system the amount of time the module unpacking blocked the event loop was intermittently putting it over the preconfigured threshold, and resulted in the “Blocked Event Loop” exception.

How we fixed it

To fix the problem, we have reverted to the previous model of unpacking NPM modules, in which an external process is spawned to perform the job. This design does not allow the logic to block the Node.js event loop. We rolled out the fix on October 25th and were closely monitoring the performance of the systems for 24h to validate the login transaction success rates in affected regions returned to levels from before the change that introduced the regression.

What we will do to avoid similar situations in the future

  • We will be investing in expanding our monitoring capabilities to introduce monitoring of leading indicators of various failure modes, including regressions in failed transaction levels.
  • We are expanding the coverage of our stress testing efforts to capture regressions related to Node.js module provisioning scenarios.

Summary

We realize that customization through code and extensibility is critical part of the Auth0 identity platform you depend on. We apologize for the disruption this event has caused and will continue to work on improving the performance and reliability of the system.


History for this incident

October 25, 201711:10 UTC

Resolved

This incident has been resolved.

October 25, 201710:58 UTC

Identified

We have detected an increased error rates for some tenants using rules or custom DB on US and we are working on a fix