« Please Insert One More Bit | Main | Another Cover Letter: A Jobseeker Adapts »
Oops - What Just Changed? Production Downtime Ensues.
By Eric Gross | November 20, 2007
There is a story circulating describing how the VA had a simple issue cascade into a full day of production downtime. We can learn from their failures! The initial problem turned out to be an improperly executed change in port numbers, but they wouldn’t determine that until much later. There was no visibility into the changes in the system - it would have been possible if they had been storing their configuration data in a database of some sort that could be use for remediating issues.
Instantly, technicians present began to troubleshoot the problem. “There was a lot of attention on the signs and symptoms of the problem and very little attention on what is very often the first step you have in triaging an IT incident, which is, ‘What was the last thing that got changed in this environment?’” Raffin said.
Planes don’t crash because of one failure. It takes a multitude of problems, and here is another one that happened that fateful day at the VA:
Volpp assumed that the data center in Sacramento would move into the first level of backup — switching over to the Denver data center. It didn’t happen.
The importance of this part of the story is that a failover was possible, but that option was not taken because they didn’t know what caused the problem. Taking the chance that the failover would have brought even more widespread failures was not possible because there was no clear indication what the root cause was. Using GridApp Clarity, it would have been as simple as clicking on the failed resources and asking the system for the list of recent changes - it is likely one would have popped out as being the culprit, or perhaps the change lacking the proper change control elements could be the problem.
Topics: Changing State, People
