Blog Post 7-Therac-25

There were two main causes for the Therac-25 accidents. One was the actual software controlling the machine, which had bugs that would cause fatal errors. Another key issue was that there was no hardware responsible for handling fatal errors, which basically all safety-critical systems have. There were no interlocks or last-ditch hardware techniques to prevent catastrophic failures. These two causes combined to show us a side of software engineering that many people don’t realize exists, that software engineers are often responsible for the lives of others.

Therac-6 and 20, the previous versions of the machines, were reliant on a combination of software and hardware. Thus there were hardware interlocks in place to prevent the user from doing anything catastrophic. There was also software present that allowed for faster setup of the machine, but there were still these hardware safety measures in place. However, in the Therac-25 iteration, the decision was made to rely on software only. The hardware safety measures were completely removed, leaving the system vulnerable.

It took a while to determine the cause of the software issue; it was something that the maker of the machine couldn’t reproduce. However, an actual user was eventually able to reproduce the error. If a user selected “X-Ray mode”, the machine would take 8 seconds to set up. However, if while setting up “X-Ray mode” the user switched to “Electron mode”, the turntable would not switch over and be left in an unknown state. This seems like a simple error that would easily come up through rigorous testing, but it turns out the testing wasn’t that rigorous. There was no timing analysis performed, which would have caught the error. The fact that testing wasn’t rigorous on a safety critical system is certainly a huge issue. This issue was fixed in an update, but then another patient was overdosed due to a completely different error. The company behind the product seemed completely inept at making safety critical systems.

As we can see from this case, there are certainly unique issues that software developers face when working on safety-critical systems. Most people don’t really think of developers as being responsible for people’s lives; they think of them as people that make games or websites, where bugs will at worst cause monetary loses. However, people don’t realize that today software is basically in everything, certainly including safety-critical systems like airplanes or pacemakers. The software developers behind these projects are responsible for people’s lives. I think a main challenge for these developers is to not become too detached. It is very easy when coding to forget that a mistake you make could lead to loss of life! This is probably not the first thing you are thinking of when typing code into a machine, because you are so detached from the actual user, but you must remember this.

However, I believe most of the responsibility is bore by the project manager. It was the project leaders of AECL that were responsible for people’s lives. They couldn’t cut corners on developer skill or testing level. One of the first thing a good project manager will do at the beginning of a project is determine the risk of the project. This is so they can determine what level of corners they can cut. On safety critical systems, this is essentially zero. So it’s tough to blame the unexperienced and unqualified programmer in the case of Therac-25. The blame is on the managers who decided to use this type of programmer and decided to not test rigorously.

Leave a comment