It became apparent last week that the internet is highly susceptible to even tiny failures in a software.
A week earlier, Cloudflare customers faced a notable blackout when Verizon unintentionally re-routed IP packages after it incorrectly approved a system misconfiguration from an internet service provider in Pennsylvania, USA. A couple of days later, the Cloudfare failure resulted from a single misconfigured rule within the Cloudflare Web Application Firewall (WAF), which resulted in a rise in Cloudflare’s network CPU usage,which then got scaled across different global geographies.
The incident took 30 minutes to complete. Due to the worldwide network outage of Cloudflare, visitors to cloudflare-proxied domains got 502 errors. It influenced thousands of prominent web pages, including some major tech brands.
The rules were introduced in a simulated mode where problems were recognized and logged in accordance with the fresh regulations but there was no blocked client traffic. This has been achieved in such a way that Cloudflare can assess false positive levels and guarantee that when deployed in complete manufacturing, the new laws do not cause issues. But, as the new rules also contained a regular expression that caused all the havoc, things didn’t go according to plan.
The CPU exhaustion incident it witnessed was unprecedented, as the business had never experienced worldwide exhaustion in the past, according to Cloudflare. Cloudflare pulled the plug on the fresh WAF Managed Rules in the aftermath of finding the true reason for the problem, which immediately moved CPU back to typical and restored ordinary web traffic.