April 13, 2004

How Did the Blackout Happen?

So how did the blackout happen? First you need to understand that the power grid was under stress. It was a warm August day with lots of air conditioners drawing power. Deregulation has also encouraged the development of power generation fair from where it is being used. This power has to be moved, which itself takes reactive power, which no one pays for. So all the utilities try to minimize the amount of reactive power the put into the grid, which puts it at the edge. That is the general context of what was happening to the power grid that month. It was in a precarious position.

The final report cited FirstEnergy for not planning how to handle this situation effectively. This citation wasn’t in the initial report, and it seems to me that the entire industry was guilty of a lack of preparedness for the strains that deregulation caused. The system was in a fragile condition.

The first critical event occurred when FirstEnergy, a utilty in Ohio, had a silent failure with its alarm system. The grid operators and FirstEnergy did not realize that never knew about it until too late. The IT staff at FirstEnergy eventually became aware of it and then thought they had corrected it, but didn’t.

Transmission lines start going down without the knowledge of FirstEnergy’s grid operators. Three go down in total. They get phone calls from other operators whose monitoring systems are functional and can see that the lines are down. FirstEnergy’s operators, presumably confused, don’t act.

The report also sited the Midwest Interconnect System Operator for not managing the situation. It is their job to ensure that the utilities are working correctly and to coordinate responses. Presumably they could have also limited the area affected by the blackout if they taken prompt action. It seems that part of their failure stems from inadequacies in their computer models of the grid. This seems potentially worthy of analysis by software experts but is an aspect i haven’t studied.

So what actually happened in that control room? Imagine you are in a control room with lots of monitors with readouts and diagrams and no windows. You are in the FirstEnergy grid control room. It’s your job to manage the local electric grid. Some people have said that the North American electric grid is the largest machine in the world. The entire grid is synchronized to beat sixty cycles a second. Everything must be kept in phase, all the power that is generated must be sent somewhere. If this doesn’t happen, lots of expensive equipment can short out, destroying it and causing fires. To guard against this, equipment is protected by breakers. As a grid operator, it is your job to keep the breakers from tripping.

Normally, you can keep things working by turning on and off transmission lines and generators, but in the worst case you may have to shed load. Shedding load is to cut a region off from the grid. It is intentionally causing a blackout in order to prevent the possibility of a worse blackout. Every major blackout that has occurred has occurred because someone somewhere faced the decision of whether to shed load and hesitated.

At 2:14, on August 14, 2004 you stop getting alarms. Alarms are not unusual. An alarm typically is given when anything unusual happens, that may or may not require human intervention. But you have no reason to believe that anything unusual has happened. Not yet. After a while, you might notice the lack of any alarms and become suspicious. But not right away. But without them, you are at a severe disadvantage. The task force said:

Power system operators rely heavily on audible and on-screen alarms, plus alarm logs, to reveal any significant changes in their system’s conditions…. Alarms are a critical function of an EMS [energy management system], and EMS-generated alarms are the fundamental means by which system operators identify events on the power system that need their attention.

At 2:41, the backlog of unprocessed alarms causes the server hosting the alarm software to shutdown. The IT staff are alerted, and the alarm processing automatically switches over to a backup system. At 2:54 the backup system fails. The data that had been corrupted due to a software race condition at 2:14 had been sent over to the backup causing it to also fail. The IT staff are alerted again. At 3:08, they restart the initial alarm server, thinking they now have a functional system. But the alarm system is still non-functional. During this time the IT staff doesn’t notify you or the other grid operators of these problems. You don’t realize that the first transmission line already tripped at 3:05.

Back at 2:32, you received a phone call from from another grid operator about a transmission line that briefly tripped and reclosed. Your monitoring system should have raised an alarm, but didn’t. This was your first solid clue that it wasn’t working. The operators in the room expressed concern, but no action is taken.

At 3:19, you get a second phone call from the other operator. They think a line is down, but you convince that what they see is a fluke, and that their data is wrong. Actually it is your data that is wrong.

A second line is tripped at 3:32 and you get a phone call at 3:35. The operators in your control are beginning to realize that there is a problem with the grid, but you aren’t sure whether it is with your grid or with an adjacent region. Phone calls become more and more frequent, and some of the information you are getting isn’t quite accurate which only adds to the confusion. A third line trips at 3:41 and more calls ensue. You know there is a problem now, and that your alarm system is unreliable. But you still don’t take corrective action. You’ve never been in this situation before. What’s happening?

Between 3:39 and 3:59 seven lower voltage lines will sag due to the excess load they are carrying and then trip when the sag low enough to touch branches. And then a breaker failure immediately causes five more low voltage lines to trip at 3:59.

Investigators later found that up until 4:05, the cascading blackout could have been prevented if you and the other system operator at First Energy had shed most of the load from the Cleveland-Akron area.

Between 4:06 and 4:09 five more lines in your area will fail due to the high current. And now it is out of your control. There is nothing that any one can do now. By 4:13, line failures have cascaded to Michigan, Ontario and all the way to New York City.

Posted by bret at 05:44 PM | Comments (1)

Don’t Dismiss the Blackout Bug

Everyone seems to be getting this story wrong. The Slashdotters go off on a long analysis premised that the problem was that the data in the FirstEnergy wasn’t live. It’s a fair analysis, except for one thing. That wasn’t the problem. The data was still coming through. The problem was that the alarm system was no longer checking the data stream. It wasn’t doing anything. It locked up and then it crashed. No alarms. It apparently did cause the system to slow down when it was locked up, which no doubt made the grid operators more frazzled. But if they had pulled up the screens for the downed transmission lines, they would have seen that they weren’t transmitting power.

Even the New York Times blames FirstEnergy for the problems with the monitoring system:

There are rules in place addressing many of the problems that the task force identified, including shoddy maintenance, faulty monitoring equipment (the company’s computers were outdated) and a breakdown in communications between FirstEnergy and neighboring systems.

The bug in the XA/21 software that triggered the alarm failure wasn’t fixed until November. The software wasn’t outdated. It was just buggy. “Outdated” suggests that standards have changed since it was written, or that somehow it has worn-out.

Even the February report is at odds with the formal report that came out last week. AP reporter Anick Jesdanun interviewed Joseph Bucciero, an energy consultant from KEMA who worked with GE Harris to diagnose the problem:

Bucciero said the software bug surfaced because of the number of unusual events occurring simultaneously—by that time, three FirstEnergy power lines had already short-circuited.

But the final report says that the bug occurred at 2:14, but the three transmission lines tripped between 3:05 and 3:41. You can’t blame the alarm failure on the line trips. The alarm failed first.

Many software developers seem uncomfortable with the attention put on the software failure and try to minimize it’s importance. On both ADT and Slashdot, software developers are heaping the blame elsewhere, citing other non-computer related infractions that FirstEnergy is accused of in the report, including not trimming trees properly.

There certainly were other causes, but the software failure was critical. If the software hadn’t failed, it’s reasonable to presume that the blackout would have been restricted to part of Cleveland. These developers have pointed out that FirstEnergy was cited for not keeping the trees trimmed back. However, the report also says that they followed standard practices for tree trimming. So maybe the standards need to change. I’m not really sure, and i don’t really care. Why? Because i’m a software expert, not a vegetation management expert. The software was supposed to be able to give an alarm when the line went down and it didn’t. End of story. From a software engineering perspective, it doesn’t matter why the line went down, whether it was a kid with a kite, a major storm, or a sag into a tree. Responsible software engineers will focus on the software angle and let others focus on the other angles. In my view, the software angle was not sufficiently investigated either by task force or the press, with the exception of Kevin Poulsen.

Part of the problem is that task force can’t criticize GE Harris, because it is outside their jurisdiction. So they cite FirstEnergy because its “operational monitoring equipment was not adequate to alert FE’s operators regarding important deviations in operating conditions and the need for corrective action.” Why wasn’t it adequate? Because it had a critical bug that no one knew about. The task force also criticizes the regulatory agency for not having “detailed requirements for: (a) monitoring and functional testing of critical EMS and supervisory control and data acquisition (SCADA) system, and (b) contingency analysis.”

Posted by bret at 05:42 PM | Comments (0)

April 11, 2004

Blackout Bug Proves Limits of Software Testing

The US and Canadian departments of energy released their final report on last summer’s blackout in the Northeast. The blackout was triggered when summer storms caused tree branches hit three separate transmission lines in Ohio. Ordinarily the electric grid systems operators are able to minimize the impact of failures like this. But FirstEnergy’s grid operators didn’t know about these line failures. They were relying on the GE Harris XA/21 monitoring software to alert them of such problems, but it had stopped working an hour beforehand due to a critical software bug. They were unaware that their alarm software had stopped working and were unaware of the need to take corrective action.

The XA/21 monitoring software runs on Unix and is made of of several subsysystems. According to hacker journalist Kevin Poulsen, the bug was a race condition in the one-million lines of C++ code that made up the event processing subsystem.

According to Mike Unum, manager at GE Energy in Melbourne, Florida: “There was a couple of processes that were in contention for a common data structure, and through a software coding error in one of the application processes, they were both able to get write access to a data structure at the same time. And that corruption led to the alarm event application getting into an infinite loop and spinning.”

Race conditions are timing bugs that occur when two separate execution threads try to use a resource—typically a memory location—at the same time. Code that has been checked to be safe from this kind of problem is called “thread safe.” Anyone writing or testing multi-threaded code needs to understand race conditions. According to expert developer Sean Beatty, “Identifying potential race conditions in complex code can be a tedious, time-consuming process. Tools to assist in this range from simple scripts used in identifying accesses to global data, to sophisticated dynamic analysis programs…. Despite its difficulty, a detailed analysis of the code is the only way to identify these types of errors. Testing is unlikely to generate the exact timing sequences required to trigger a race condition repeatably.”

Perhaps the most notorious race condition was the bug in the Therac-25 medical radiation equipment that lead to several deaths. That race condition was only triggered when diagnostic parameters where initially entered incorrectly and then quickly corrected. If they were corrected too quickly, the display didn’t match the internal settings and the patient was at risk for a radiation overdose.

According to the report, the race condition in the XA/21 lead to complete failure of the alarm subsystem: “the alarm process essentially ‘stalled’ while processing an alarm event, such that the process began to run in a manner that failed to complete the processing of that alarm or produce any other valid output (alarms). In the mean time, new inputs … built up and then overflowed the process’ input buffers.” The subsystem was locked up by the corrupt data that had been written during the race condition. A half-hour after the initial failure, the backlog of queued events apparently lead to the failure of the entire server. When this happened, the data (including the corruption) was transferred to a backup server which failed 13 minutes later, also confounded by the corrupt data. The alarm subsystem remained down until the entire XA/21 system was cold booted the following morning.

Mike Unum told Kevin Poulsen, “It took us a considerable amount of time to go in and reconstruct the events.” They had to inject delays in the code while feeding it alarm inputs. Unum thinks they did a pretty respectable job: “We text exhaustively, we test with third parties, and we had in excess of three million online operational hours in which nothing had ever exercised that bug. I’m not sure that more testing would have revealed that. Unfortunately, that’s kind of the nature of software… you may never find the problem. I don’t think that’s unique to control systems or any particular vendor software.”

That’s 3,000,000 hours of operation in test without uncovering that bug. Big deal. I’d be willing to bet that their similated environment wouldn’t find this bug if it ran for another three million hours or even a trillion hours. I wouldn’t be surprised if the timing profile that triggered the race condition wasn’t inside the parameters of their simulation. That’s the thing about testing. You can run the same test a thousand times, but it’s not going to find bugs outside its parameters.

I’m also willing to bet that they found a lot of other bugs in those 3,000,000 hours of testing. The grid operators at FirstEnergy sure saw a lot: “It is not unusual for alarms to fail. Often times, they may be slow to update or they may die completely… the fact that the alarms failed did not surprise him.” In fact, they had already decided replace XA/21 well before the events of August 14. I’m willing to bet that in hindset they had wished they’d replaced it more quickly.

Almost 20 years ago, the Therac-25 bug led the FDA to start regulating medical software. (Remarkably, a software bug in a new radiation machine from Therac’s successor is now being blamed for several more deaths due to radiation overdosing.)

One of the lessons that software safety expert Nancy Leveson drew from that disaster was that you couldn’t just focus on reliability— on running the software for long periods of time or checking to make sure that all the requirements had been met. Rather, safety requires that we presume that failures can happen. It requires that the system be designed to minimize the impact of failures.

If that race condition in the XA/21 alarm system had only caused one or two alarm events to be ignored, it would have been a survivable error. But instead it took down the alarm system and the corrupt data was even propigated so that it took down the backup system too. That shouldn’t happen in a system that was designed for safety.

And so it now looks like the GE Harris XA/21 bug will lead to regulations that “require annual independent testing and certification of industry energy management software and system control and data acquisition systems to ensure that they meet the minimum requirements,” according to the report.

And it is truly a matter of safety. At least one death has been blamed on the blackout: a Long Island resident whose respirator failed.

Posted by bret at 01:58 AM | Comments (5)

April 05, 2004

Questioning Certifiers

In this week's Stickymind's column, i take issue with some questions about black box, white box and grey box testing featured on a testing certification exam.
Posted by bret at 12:02 PM | Comments (0)