Commentary

Google Unplugging The Internet Is No Longer An Option

by Laurie Sullivan , Staff Writer, June 16, 2025

Google Cloud last week experienced a major widespread outage, and days later released a detailed report sharing why it failed.

Multiple Google Cloud and Google Workspace products experienced increased 503 errors in external API requests, impacting customers.

In a status report, the company blamed the occurrence on an “invalid automated quota update to our API management system which was distributed globally, causing external API requests to be rejected.”

To recover, Google bypassed the “quota check” allowing it to recover in most regions within two hours, but the quota policy database in us-central1 became overloaded, resulting in much longer recovery in that region.

Its customers had intermittent API and user-interface access, but Google said existing streaming and Infrastructure as a Service -- a cloud computing model that provides on-demand access to computing resources like servers, storage, and networking over the internet -- were not impacted.

Cloud computing has become important or more important than search and ad services because it has become the backbone of Google’s business. Since May, the company Google’s cloud division has signed

Since May 1, 2025, Google Cloud has signed several significant deals related to AI and expanding cloud infrastructure.

OpenAI signed a deal with Google Cloud for additional computing capacity for AI model training and operation. The partnership seeks to reduce OpenAI's reliance on one cloud vendor and increase the availability of computing resources.

Google Cloud has also announced a partnership with Databricks, integrating Gemini models into the Databricks Data Intelligence Platform.

Amadeus and Google Cloud partnered to advance cloud-based operations and AI innovation in the travel industry.

When Google Cloud failed, several products had moderate residual impact for up to one hour after the primary issue was mitigated, and a small number were recovering after that.

Google completed an Incident Report that described the root cause. “We deeply apologize for the impact this outage has had. Google Cloud customers and their users trust their businesses to Google, and we will do better,” the company wrote in a blog post. “We apologize for the impact this has had not only on our customers’ businesses and their users but also on the trust of our systems. We are committed to making improvements to help avoid outages like this moving forward.”

Despite all the technology, experience and knowledge, Google will need to do better because unplugging the internet is no longer an option.

Too many companies rely on it for support, and it has too many competitors to make these types of mistakes.

Google had added that new feature to Service Control on May 29, to enable additional quota policy checks.

“This code change and binary release went through our region by region rollout, but the code path that failed was never exercised during this rollout due to needing a policy change that would trigger the code,” Google’s report explains. A red button had been installed to turn off that particular policy serving path, but it was never used.

The change "did not have appropriate error handling nor was it feature flag protected. Without the appropriate error handling, the null pointer caused the binary to crash. Feature flags are used to gradually enable the feature region by region per project, starting with internal projects, to enable us to catch issues. If this had been flag protected, the issue would have been caught in staging," according to the report.

Last Thursday, Google made a policy change, inserting it into what the company calls "the regional Spanner tables that Service Control uses for policies." The report provides specific details.

Within two minutes, Google's Site Reliability Engineering team triggered the incident. Within 10 minutes, the root cause was identified and the red-button (to disable the serving path) was being put in place. The red-button was ready to roll out about 25 minutes from the start of the incident. Within 40 minutes of the incident, the red-button rollout was completed, and we started seeing recovery across regions, starting with the smaller ones first.

Google did put a detailed process together in the event that this happens again, beyond freezing the system.

ad performance measurement, api-driven, cloud computing, google, internet, performance marketing, search, technology

Next story loading

About the Author

Laurie Sullivan is a writer and editor for MediaPost. You can reach Laurie at lauriesullivan@gmail.com.