
Google Cloud last week experienced a major
widespread outage, and days later released a detailed report sharing why it failed.
Multiple Google Cloud and Google Workspace products experienced increased 503 errors in external API
requests, impacting customers.
In a status report, the company blamed the occurrence on an “invalid automated quota update to our API management system which was distributed globally,
causing external API requests to be rejected.”
To recover, Google bypassed the “quota check” allowing it to recover in most regions within two hours, but the quota policy
database in us-central1 became overloaded, resulting in much longer recovery in that region.
Its customers had intermittent API and user-interface access, but Google said existing streaming
and Infrastructure as a Service -- a cloud computing model that provides on-demand access to computing resources like servers, storage, and networking over the internet -- were not impacted.
advertisement
advertisement
Cloud computing has become important or more important than search and ad services because it has become the backbone of Google’s business. Since May, the company Google’s cloud
division has signed
Since May 1, 2025, Google Cloud has signed several significant deals related to AI and expanding cloud infrastructure.
OpenAI signed a deal with Google Cloud for
additional computing capacity for AI model training and operation. The partnership seeks to reduce OpenAI's reliance on one cloud vendor and increase the availability of computing resources.
Google Cloud has also announced a partnership with Databricks, integrating Gemini models into the Databricks Data Intelligence Platform.
Amadeus and Google Cloud partnered to advance
cloud-based operations and AI innovation in the travel industry.
When Google Cloud failed, several products had moderate residual impact for up to one hour after the primary issue was
mitigated, and a small number were recovering after that.
Google completed an Incident Report that described the root cause. “We deeply apologize for the impact this outage has had.
Google Cloud customers and their users trust their businesses to Google, and we will do better,” the company wrote in
a blog post. “We apologize for the impact this has had not only on our customers’ businesses and their users but also on the trust of our systems. We are committed to making improvements
to help avoid outages like this moving forward.”
Despite all the technology, experience and knowledge, Google will need to do better because unplugging the internet is no longer an
option.
Too many companies rely on it for support, and it has too many competitors to make these types of mistakes.
Google had added that new feature to Service Control
on May 29, to enable additional quota policy checks.
“This code change and binary release went through our region by region rollout, but the code path that failed was never
exercised during this rollout due to needing a policy change that would trigger the code,” Google’s report explains. A red button had been installed to turn off that particular policy
serving path, but it was never used.
The change "did not have appropriate error handling nor was it feature flag protected. Without the appropriate error handling, the null pointer caused the
binary to crash. Feature flags are used to gradually enable the feature region by region per project, starting with internal projects, to enable us to catch issues. If this had been flag protected,
the issue would have been caught in staging," according to the report.
Last Thursday, Google made a policy change, inserting it into what the company calls "the regional Spanner tables that
Service Control uses for policies." The report provides specific details.
Within two minutes, Google's Site Reliability Engineering team triggered the incident. Within 10 minutes, the root
cause was identified and the red-button (to disable the serving path) was being put in place. The red-button was ready to roll out about 25 minutes from the start of the incident. Within 40 minutes of
the incident, the red-button rollout was completed, and we started seeing recovery across regions, starting with the smaller ones first.
Google did put a
detailed process together in the event that this happens again, beyond freezing the system.