1.0 IntroductionIn this report I will be concentrating on the failure of software systems. To understand why software systems fail we need to understand what are software systems. Software systems are a type of information system. This is because a software system is basically a means for hardware to process information.
Flynns definition of an information system is:”An information system provides procedures to record and make available information, concerning part of an organization, to assist organization-related activities.”Humans have been processing information manually for thousands of years, but with the vast increase of demand for knowledge this century has meant that a new method of information processing has been needed. Software systems have provided a new means that is much faster and efficient. As a result a huge number of organisations have become software dependent. Some of these systems are used to safeguard the lives of many people. This means that if these systems were to fail they could lead to devastating consequences. Here are some examples of where software systems are used heavily and could be very dangerous if they were to fail – aviation, hospitals, space exploration, nuclear power stations and communications.
I will be looking at some examples of actual software failure in these fields to explain the reasons why systems fail.2.0 Reasons for Systems FailureIf software systems failure can be so dangerous why can they not be completely eliminated? According to Parnas, “The main reason is that software can never be guaranteed to be 100% reliable. Software systems are discrete-state systems that do not have repetitive structures. The mathematical functions that describe the behaviour of software systems are not continuous, and traditional engineering mathematics do not help in their verification.” In other words some software can be so large that thorough testing can be almost impossible and so bugs in the software can go unnoticed. An example of this was when an Atlas-Agena rocket veered off-course when it was ninety miles up. Ground control had to destroy the $18.5 rocket. The reasons for this – a missing hyphen.
However there are many more reasons for software systems failure, and most of them are due to human negligence that leads to software failure. There are two types of software systems failure. These are in the design stage of the software or in the implementation of the software. These are the main reasons for systems failure.Poor software design – Fundamental flaws in the design of the software.Incorrect requirements specifications – The brief is inconsistent or missing vital information.
Political / Commercial pressures – This can lead to developers skipping parts of the system to save time or money. There are also cases of rivalry between sub-contractors, which damages the design of the system.Incorrect analysis and assumptions – Predictions based on incorrect assumptions of the real world or its behaviour.Not properly tested software implemented in a high risk environment – This is almost guaranteed to lead to systems failure.Poor user-interface – Makes it difficult or even impossible for the user to operate the software system.Incorrect fit between software and hardware – Incorrect specification of the hardware type in the brief, or upgrading the hardware without upgrading the software (or vice-versa).Inadequate training given to the operators – The people who have to use the software are not taught properly how to use the software system or they are expected to learn on their own. Over reliance on the software system – The operators expect their software system to work in all conditions and to perform miracles for them.I will be looking at these types of systems failure with examples. 2.1 Poor software design- the Denver airport automated luggage handling systemAn example of poor software design is the Denver International Airport luggage controller. In this case Jones says that the senior executives did not have a sufficient background in software systems and as a result accepted “nonsensical software claims at face value”.The airport boasted about its new “automated baggage handling system, with a contract price of $193 million, will be one of the largest and most sophisticated systems of its type in the world.
It was designed to provide the high-speed transfer of baggage to and from aircraft, thereby facilitating quick turnaround times for aircraft and improved services to passengers.” The baggage system, which came into operation in October 1995, included “over 17 miles of track; 5.5 miles of conveyors; 4,000 telecarts; 5,000 electric motors; 2,700 photocells; 59 laser bar code reader arrays; 311 radio frequency readers; and over 150 computers, workstations, and communication servers. The automated luggage handling system (ALHS) was originally designed to carry up to 70 bags per minute to and from the baggage check-in.”However there were fundamental flaws identified but not addressed in the development and testing stage. ABC news later reported that ” In tests, bags were being misloaded, misrouted or fell out of telecarts, causing the system to jam.” The Dr. Dobbs Journal (January 1997) also carried an article in which the author claims that his software simulation of the automatic baggage handling system of the Denver airport mimicked the real-life situation. He concluded that the consultants did perform a similar simulation and, as a result, had recommended against the installation of the system. However the city overruled the consultant’s report and gave the go-ahead (the contractors who were building the system never saw the report).The report into the failure of the Denver ALHS says that the Federal Aviation Authority had required the designers (BAE Automated Systems Incorporated) to properly test the system before the opening date on 28th February 1995.
Problems with the ALHS had already caused the airports opening date to be postponed and no further delays could be tolerated by the city. The report speculates that delays had already cost the airport $360 million by February 1995.The lack of testing inevitably led to problems with the ALHS. One problem occurred when the photo eye at a particular location could not detect the pile of bags on the belt and hence could not signal the system to stop. The baggage system loaded bags into telecarts that were already full, resulting in some bags falling onto the tracks, again causing the telecarts to jam.
This problem caused another problem. This one occurred because the system had lost track of which telecarts were loaded or unloaded during a previous jam. When the system came back on-line, it failed to show that the telecarts were loaded. Also the timing between the conveyor belts and the moving telecarts were not properly synchronized, causing bags to fall between the conveyor belt and the telecarts. The bags then became wedged under the telecarts.
This eventually caused so many problems that there was a need for a major overhaul of the system.The government report concluded that the ALHS at the new airport was afflicted by “serious mechanical and software problems”. However you can not help thinking how much the city was blamed for their part in a lack of demand for proper testing. Denver International Airport had to install a $51 million alternative system to get around the problem. However United Airlines still continue to use the ALHS.
A copy of the report can be found at http://www.bts.gov/smart/cat/rc9535br.html.2.2 Political / Commercial pressures the Challenger DisasterThere are many examples of failures occurring because of this. One of the most famous examples of these is the Challenger disaster. On the 28th January 1986 the challenger space shuttle exploded shortly after launch, killing all seven astronauts onboard. This was initially blamed on the design of the booster rockets and allowing the launch to proceed in cold weather. However it was later revealed that “there was a decision along the way to economize on the sensors and on their computer interpretation by removing the sensors on the booster rockets. There is speculation that those sensors might have permitted earlier detection of the booster-rocket failure, and possible early separation of the shuttle in an effort to save the astronauts.
Other shortcuts were also taken so that the team could adhere to an accelerated launch sequence.” (Neumann). This was not the first time there had been problems with space shuttle missions. A presidential commission was set up and the Chicago Tribune reported what some astronauts said, “that poor organization of shuttle operations led to such chronic problems as crucial mission software arrived just before shuttle launches and the constant cannibalization of orbiters for spare parts.” Obviously the pressures of getting a space shuttle launch and mission to run smoothly and on time is huge. However there has to be a limit on how many short cuts can be taken.
Another example of commercial pressure is the case of a Fortune 500 company. (A Fortune 500 company is one that appears in a listing of the top 500 U.S. companies ranked by revenues, according to Fortune magazine’s classic list.) According to Jones, “the client executive and the senior software manager disliked each other so intensely that they could not never reach agreement on the features, schedules, and effort for the project (a sales support system of about 3000 function points)”. They both appealed to their higher executives to dismiss the other person.
The project was eventually abandoned, after acquiring expenses of up to $500 000. Jones reported another similar case in a different Fortune 500 company. “two second-line managers on an expert system (a project of about 2500 function points) were political opponents. They both devoted the bulk of their energies to challenging and criticizing the work products of the opposite teams.” Not surprisingly the project was abandoned after costing the company $1.5 million.2.3 Incorrect analysis and assumptions – the Three Mile Island accidentIncorrect assumptions can seem very obvious when they are thought about, however it does not stop them from creeping in. According to Neumann a Gemini V rocket landed a hundred miles off course because of an error in the software.
The programmer used the Earths reference point relative to the Sun, as elapsed time since launch, as a fixed constant. However the programmer did not realise that the Earth position relative to the Sun does not come back to the same point 24 hours later. As a result the error accumulated while the rocket was in space. The Three Mile Island II nuclear accident, on 28th March 1979, was also blamed on assuming too much.
The accident started in the cooling system when one of the pipes became blocked, resulting in the temperature of the fuel rods increased from 600 degrees to over 4000 degrees. Instruments used to measure the temperature of the reactor core was not standard equipment at the time, however thermocouples had been installed and could measure high temperatures. However after the temperature reached over 700 degrees the thermocouples had been programmed to produce a string of question marks instead of displaying the temperature. After the reactor started to over-heat the turbines shut down automatically. However this did not stop the rods from over-heating as someone had left the valves for the secondary cooling system closed. There was no way of knowing this at the time because there was no reading on the temperature of the reactor core.Operators testified to the commission that there were so many valves that sometimes the would get left in the wrong position, even though their positions are supposed to be recorded and even padlocked.
This is also a case of the designers blaming the operators and vice-versa. In the end the operators had to concede reluctantly that large valves do not close themselves.Petroski says, “Contemporaneous explanations of what was going on during the accident at Three Mile Island were as changeable as the weather forecasts, and even as the accident was in progress, computer models of the plant were being examined to try to figure it out.” Lots of assumptions had been made about how high the temperature of the reactor core could go and the state of the valves in the secondary cooling system. This shows that in an environment where safety is supposed to be the number one issue people are still too busy to think about all the little things all the time and high pressure situations develop that compromise the safety of hundreds of thousands of people. It took until August 1993 for the site to be declared safe. Facts are taken from Neumann and Perrow.2.4 Not properly tested software implemented in a high risk environment the London Ambulance ServiceThe failure of the London Ambulance Service (LAS) on Monday and Tuesday 26 and 27 November 1992, was, like all major failures, blamed on a number of factors.
These include inadequate training given to the operators, commercial pressures, no backup procedure, no consideration was given to system overload, poor user interface, not a proper fit between software and hardware and not enough system testing being carried out before hand. Claims were later made in the press that up to 20-30 people might have died as a result of ambulances arriving too late on the scene. According to Flowers, “The major objective of the London Ambulance Service Computer Aided Despatch (LASCAD) project was to automate many of the human-intensive processes of manual despatch systems associated with ambulance services in the UK. Such a manual system would typically consist of, among others, the following functions: Call taking. Emergency calls are received by ambulance control.
Control assistants write down details of incidents on pre-printed forms.”The LAS offered a contract for this system and wanted it to be up and running by 8th January 1992. All the contractors raised concerns about the short amount of time available but the LAS said that this was non-negotiable. A consortium consisting of Apricot, Systems Options and Datatrak won the contract. Questions were later asked about why there contract was significantly cheaper than their competitors.
(They asked for 1.1 million to carry out the project while their competitors asked for somewhere in the region of 8 million.)The system was lightly loaded at start-up on 26 October 1992. Staff could manually correct any problems, caused particularly by the communications systems such as ambulance crews pressing the wrong buttons. However, as the number of calls increased, a build up of emergencies accumulated. This had a knock-on effect in that the system made incorrect allocations on the basis of the information it had. This led to more than one ambulance being sent to the same incident, or the closest vehicle was not chosen for the emergency. As a consequence, the system had fewer ambulance resources to use.
With so many problems the LASCAD generated exception messages for those incidents for which it had received incorrect status information. The number of exception messages appears to have increased to such an extent the staff were not able to clear the queues. Operators later said this was because the messages scrolled of the screen and there was no way to scroll back through the list of calls to ensure that a vehicle had been dispatched. This all resulted in a viscous circle with the waiting times for ambulances increasing.
The operators also became bogged down in calls from frustrated patients who started to fill the lines. This led to the operators becoming frustrated, which in turn led to an increased number of instances where crews failed to press the right buttons, or took a different vehicle to an incident than that suggested by the system. Crew frustration also seems to have contributed to a greater volume of voice radio traffic. This in turn contributed to the rising radio communications bottleneck, which caused a general slowing down in radio communications which, in turn, fed back into increasing crew frustration. The system therefore appears to have been in a vicious circle of cause and effect. One distraught ambulance driver was interviewed and recounted that the police are saying “Nice of you to turn up” and other things.
At 23:00 on October 28 the LAS eventually instigated a backup procedure, after the death of at least 20 patients.An inquiry was carried out into this disaster at the LAS and a report was released in February 1993. Here is what the main summary of the report said:”What is clear from the Inquiry Team’s investigations is that neither the Computer Aided Despatch (CAD) system itself, nor its users, were ready for full implementation on 26 October 1992. The CAD software was not complete, not properly tuned, and not fully tested. The resilience of the hardware under a full load had not been tested.
The fall back option to the second file server had certainly not been tested. There were outstanding problems with data transmission to and from the mobile data terminals. Staff, both within Central Ambulance Control (CAC) and ambulance crews, had no confidence in the system and was not all fully trained and there was no paper backup. There had been no attempt to foresee fully the effect of inaccurate or incomplete data available to the system (late status reporting/vehicle locations etc.). These imperfections led to an increase in the number of exception messages that would have to be dealt with and which in turn would lead to more call-backs and enquiries.
In particular the decision on that day to use only the computer generated resource allocations (which were proven to be less than 100% reliable) was a high-risk move.”In a report by Simpson (1994) she claimed that the software for the system was written in Visual Basic and was run in a Windows operating system. This decision itself was a fundamental flaw in the design. “The result was an interface that was so slow in operation that users attempted to speed up the system by opening every application they would need at the start of their shift, and then using the Windows multi-tasking environment to move between them as required. This highly memory-intensive method of working would have had the effect of reducing system performance still further.”The system was never tested properly and nor was their any feedback gathered from the operators before hand. The report refers to the software as being incomplete and unstable, with the back up system being totally untested. The report does say that there was “functional and maximum load testing” throughout the project.
However it raised doubts over the “completeness and quality of the systems testing”. It also questions the suitability of the operating system chosen.This along with the poor staff training was identified to be the main root of the problem. The management staff was highly criticised in the report for their part in the organisation of staff training. The ambulance crew and the central control crew staff were, among other things, trained in separate rooms, which did not lead to a proper working relationship between the pair. Here is what the report said about staff training:”Much of the training was carried out well in advance of the originally planned implementation date and hence there was a significant “skills decay” between then and when staff were eventually required to use the system. There was also doubts over the quality of training provided, whether by Systems Options or by LAS’s own Work Based Trainers (WBTs).
This training was not always comprehensive and was often inconsistent. The problems were exacerbated by the constant changes being made to the system.”Facts are taken from http://catless.ncl.ac.uk/Risks, http://www.scit.wlv.ac.uk and the report of the Inquiry into the London Ambulance Service, February 1993.2.5 Poor user-interfaceThe last case was a good example of how a poor user-interface can lead to mayhem. Another similar case was reported to the Providence newspaper. The Providence (part of New York) police chief, Walter Clark, was grilled over why his officers were taking so long to respond to calls. In one case it took two hours to respond to a burglary in progress. He explained that all the calls are entered into a computer and are shown on a monitor.
However the monitor can only show twenty reports at a time as the programmer did not design a scroll function for the screen. The programmer had some serious misconceptions about the crime rate in New York. Facts taken from: http://catless.ncl.ac.uk/Risks.2.6 Over reliance on the software systemThe Exxon Valdez oil disaster was simultaneously blamed on the drunken captain, the severely fatigued third mate, the helmsman and the “system”. The system refers to the auto-pilot of the ship and the lack of care the crew had on its operation.
According to Neumann the crew were so tired that they did not realise that the auto-pilot was left on and so the ship was ignoring their rudder adjustments. This example shows that even though everything was working properly, all the safety measures had a minimal effect when they were trying to override the auto-pilot. This is a very small mistake and could easily have been prevented.The Therac-25 case, a system designed to give the right amount of radiation to the patient in chemotherapy treatment also fell into a case “foolproofedness”. The operators did not imagine the “software permitted the therapeutic radiation device to be configured unsafely in X-Ray mode, without its protective filter in place” (Neumann). Such blind faith in the system resulted in several patients being given too high a dose that killed the patients.3.0 ConclusionIt is obvious to see from these examples that failures are very rarely due to one cause alone.
In major system failures it can be over a dozen mistakes being made that usually results in the failure of the system. Also the mistakes have a domino effect or leads to a viscous circle of mistakes, the systems becoming worse and worse during both the design and implementation stage. In almost all large system failures there is a case of when commercial pressures are put above safety. The Paddington rail crash (5th October 1999) could have been prevented if the train had been fitted with the Train Protection Warning System. This system would physically stop the train if it went through a red signal and was recommended in the report following the train crash at Southall.
However it would have cost Railtrack something like 150-200 million. The system will however now be introduced to all trains by 2004. The facts were taken from BBC online.It is obvious that the main reason for the commercial pressures is cost. The Challenger disaster might have been prevented if sensors had not been removed from the booster rockets.
But the cost of some extra sensors compared to the already astronomical cost of space exploration makes it seem a little nonsensical. The cost of a space shuttle is well over $1 billion, never mind the damage it did to NASAs reputation. However it is not always cost saving that leads to system failures. In both the Denver ALHS and the London Ambulance System CAD it is more a case of money wasting. When the initial investment has been made a company finds it very hard to terminate the project.
They would rather get the system working than admit defeat, whatever the cost. Sometimes the cost can be in terms of human lives. This would be why United Airlines still insist on using the Denver ALHS and twenty people died before the LAS switched their dispatching system. Proper communication and feedback between the designers and the operators will stop a lot of problems like a poor user-interface and incorrect fit between the hardware and software. It all starts with a proper brief being given to the designers. But this can only happen if the management knows what they want.
So the only way to have a successful system is to have good communications and understanding between the designers and operators, with the senior managers being kept in the know at all times. However the most important job is for someone to take responsibility for the design and operation of the system. If someone who is competent is put in charge and takes responsibility then the system is likely to be working properly before its implementation and the operators will have adequate training for using the system. With the London Ambulance System this was doubly important where patients lives are at risk. In situations like these “Ethics” is the key word and there has to be someone held responsible for the actions of the organisation.
4.0 BibliographyFlynn, Donal J.; “Information Systems Requirements: Determination and Analysis”; McGraw-Hill Book Company; 1992Parnas; 1985; taken from: Sherer, Susan A.; “Software Failure Risk Measurement and Management”; Plenum Press; 1992Jones, Carpers; “Patterns of Software Systems Failure and Success”; Thomson computer press; 1996Neumann, Peter G.; “Computer Related Risks”; Addison-Wesley publishing company; 1995Petroski, Henry; “To Engineer is Human”; MacMillan Publishing; 1985Flowers, Stephen; “Software failure: management failure”; Chichester: John Wiley and Sons; 1996.Report of the Inquiry into the London Ambulance Service; February 1993. Simpson, Moira (1994); “999!: My computers stopped breathing !”; The Computer Law and Security Report, 10; March April; pp 76-81Dr. Dobbs Journal; January 1997 editionhttp://catless.ncl.ac.uk/Riskshttp://www.scit.wlv.ac.uk http://www.bbc.co.uk/newshttp://abcnews.go.com/sections/travel