Google details 'catastrophic' cloud outage events: Promises to do better next time
Google has now offered customers a full technical breakdown of what it says was a “catastrophic failure” on Sunday, June 2, disrupting services for up to four and a half hours. The networking issues affected YouTube, Gmail, and Google Cloud users like Snapchat and Vimeo.
Earlier this week, Google’s VP of engineering Benjamin Treynor Sloss apologized to customers, admitting it had taken “far longer” than the company expected to recover from a situation triggered by a configuration mishap, which caused a 10 percent drop in YouTube traffic and a 30 percent fall in Google Cloud Storage traffic. The incident also impacted one percent of more than one billion Gmail users.
The company has now given a technical breakdown of what failed, who was impacted, and why a configuration error that Google engineers detected within minutes turned into a multi-hour outage that mostly affected users in North America.
“Customers may have experienced increased latency, intermittent errors, and connectivity loss to instances in us-central1, us-east1, us-east4, us-west2, northamerica-northeast1, and southamerica-east1. Google Cloud instances in us-west1, and all European regions and Asian regions, did not experience regional network congestion,” Google said in its technical report.
Google Cloud Platform services affected during the incident in these regions included Google Compute Engine, App Engine, Cloud Endpoints, Cloud Interconnect, Cloud VPN, Cloud Console, Stackdriver Metrics, Cloud Pub/Sub, Bigquery, regional Cloud Spanner instances, and Cloud Storage regional buckets. G Suite services in these regions were also affected.
Google again apologized to customers for the failure and said it taking “immediate steps” to boost performance and availability.
Big name customers that were affected include Snapchat, Vimeo, Shopify, Discord, and Pokemon GO.
The simple explanation was that a configuration change intended for a small group of servers in one region was wrongly applied to a larger number of servers across several neighboring regions. It resulted in the affected regions using less than half of their available capacity.
Google now says a software bug in its automation software was also at play:
“Two normally benign misconfigurations, and a specific software bug, combined to initiate the outage: firstly, network control plane jobs and their supporting infrastructure in the impacted regions were configured to be stopped in the face of a maintenance event.
“Secondly, the multiple instances of cluster management software running the network control plane were marked as eligible for inclusion in a particular, relatively rare maintenance event type.
“Thirdly, the software initiating maintenance events had a specific bug, allowing it to deschedule multiple independent software clusters at once, crucially even if those clusters were in different physical locations.”
As for the reduced network capacity, Google said its methods for protecting network availability worked against it on this occasion, “resulting in the significant reduction in network capacity observed by our services and users, and the inaccessibility of some Google Cloud regions”.
As first revealed in Sloss’s account, Google engineers detected the failure “two minutes after it began” and initiated a response. However, the new report says debugging was “significantly hampered by failure of tools competing over use of the now-congested network”.
That happened despite Google’s vast resources and backup plans, which include “engineers traveling to secure facilities designed to withstand the most catastrophic failures”.
Additionally, damage to Google’s communication tools frustrated engineers’ ability to identify the impact on customers, in turn hampering their ability to communicate accurately with customers.
Google has now halted its data-center automation software responsible for rescheduling jobs during maintenance work. It will re-enable this software after ensuring it doesn’t deschedule jobs in multiple physical locations concurrently.
Google also plans to review its emergency response tools and procedures to ensure they’re up to the task of a similar network failure and still capable of accurately communicating with customers. It notes that the post-mortem is still at a “relatively early stage” and that further actions may be identified in future.
“Google’s emergency response tooling and procedures will be reviewed, updated and tested to ensure that they are robust to network failures of this kind, including our tooling for communicating with the customer base. Furthermore, we will extend our continuous disaster-recovery testing regime to include this and other similarly catastrophic failures,” Google said.
As for impact, the worst service impact was Google Cloud Storage in the US West region where the error rate for buckets was 96.2 percent, followed by South America East, where the error base 79.3 percent.
Google Cloud Interconnect was severely impacted with reported packet loss ranging from 10 percent to 100 percent in affected regions.