It has never been so hot in England: 40.3 degrees Celsius were measured in the shade in Coningsby on Tuesday. Luton Airport had to pause operations because the runway melted away; there were also heat-related disruptions to trains, highways and power supplies. Designers of data centers obviously did not expect such high temperatures. Because some large cooling systems were overwhelmed, both Google Cloud and Oracle experienced outages.
At Oracle, network connections and the Block Volumes, Compute and Object Storage services were affected, at Google it was autoscaling, persistent disk and the Google Computing Engine (GCE) including the virtual computers (VM) running on them. For example, Kubernetes instances, SQL databases, BigQuery warehouses and of course numerous websites were affected.
Oracle believes in “seasonal temperatures”
“As a result of unseasonal temperatures in the region, part of the refrigeration equipment at the UK South (London) data center has experienced an issue,” Oracle admitted Tuesday, “resulting in the need to shut down part of our service infrastructure to prevent uncontrolled hardware failures This move was taken with the intention of limiting the potential for long-term impact on our customers.”
To put it in plain English: Our air conditioning system, including redundant cooling, can’t handle this monkey heat. We suddenly had to switch off routers and computers, otherwise they would burn us down and customer data would go to waste. Oracle did not say what other season the temperatures in England would have been typical for, and whether the air conditioning systems would have held out.
The emergency shutdowns of London Oracle servers began Tuesday at 1:10 p.m. In a second step, Oracle informed, further computers were manually shut down “as a preventive measure” “to avoid further hardware failure”. In addition, “relevant service teams were activated to bring the affected infrastructure into a healthy state”. Want to say: The air conditioning is broken. For some of the hardware, the emergency shutdowns came too late. By consciously switching off, we save the rest.
Only seven hours after the emergency shutdown did the temperatures in the data center reach “usable temperatures” again, after nine hours some of the broken cooling systems were repaired, after eleven hours all of them. After 20 hours of hard work, Oracle was able to report that “all services and their resources have now been restored”.
Google’s air conditioner also fails
At Google, the outages began about two hours after Oracle, i.e. at 3:10 p.m. “There is a cooling-related outage in one of our buildings that hosts the europe-west2-a zone of the europe-west2 region. This has caused a partial outage in that zone, resulting in virtual machine shutdowns and loss of computing power for a small led some of our customers,” reported the data company. Other customers lost the redundancy of their persistent disks. Google’s cooling system also failed, and this group also had to shut down other computers as a precaution.
After nine hours, Google was able to give the all-clear for its cloud services in London, only a small proportion of persistent disks were still suffering from I/O errors. The lesson from the misery is that resilience in times of climate change not only requires protection from floods, hurricanes and fires, as well as help for climate refugees. Likewise, cooling devices must be dimensioned more heavily and hardware must be designed to be more heat-resistant. This increases the energy consumption of the data centers if we do not become significantly more data-efficient.
To home page