Embrace the error – how the lessons from the United Airlines 173 disaster improve our software implementations

Shortly after 5pm on the evening of 28 December 1978, United Airlines Flight 173 began its descent to Portland International Airport.

 

What followed was tragic and entirely avoidable, and led to a revolution in error-handling in the aviation industry.

 

As the flight was descending, and the landing gear was lowered, the crew felt a strange vibration and yaw.  A lack of an indicator light led the crew to conclude that the landing gear had not properly deployed.  They then proceeded to spend the next hour in a holding pattern trying to diagnose the issue – was it instrumentation failure or a gear failure?  Visual checks by the control tower and the flight crew confirmed that the gear was deployed.

 

The captain was unconvinced.

 

In fact, they spent so long focused on this one issue that they created another – a shortage of fuel.  They ran out completely just over an hour after their first approach and crash landed in a wooded suburban area 6 miles short of the airport, with the loss of 10 lives.  Later it was confirmed that the landing gear had been down and locked correctly from the beginning.

 

It’s a difficult read, as it seems obvious to the casual observer what could and should have been done differently.

 

The lessons from this event are a reminder that humans are fallible and therefore mistakes are inevitable.  When mistakes do happen, our instinctive human response is not always borne out of common sense.  Work or home pressure, emotion, arcane work practices, deference to rank or status – these things can obscure or inhibit good decision-making.

 

The lessons from this tragedy are universal.  There is a fantastic essay by Ian Leslie which tells the story in more detail, and how this incident subsequently changed operating theatre procedure in the UK National Health Service.  It’s a profound lesson in learning from our mistakes and a most worthy 10-minute read.

 

You can find it here.  https://www.newstatesman.com/uncategorized/2014/06/how-mistakes-can-save-lives

 

Here at Rapid4Cloud we have procedures in place to continuously improve our operations.  Reflecting on the lessons from Flight 173, we have processes and controls to address quality and how we respond to issues, and we have a flat structure in which everyone feels empowered and safe to question and challenge.

 

We know we can’t eradicate all errors, but we can try, and our team spends a large amount of their time trying to prevent errors from happening at all.  These errors come from various sources:

  • errors in our code. This is down to quality control.
  • errors inadvertently performed by users. These come from bad product design, insufficient or inadequate training/procedure or simply human error caused by any number of reasons – tiredness, stress etc.
  • errors in processing because of poor data.  Nearly all the errors we see fall into this category.

It’s one of those never-ending journeys – software is never bug free, users will always find holes in your design, and bad data…well, I’m afraid that’s up there with death and taxes as one of life’s certainties.

 

Our automation software exactly replicates what a user should do in front of an Oracle screen.  If our software encounters a problem entering data it captures a screenshot from Oracle, with the exact error message a consultant would see if they entered the data manually, and puts it into an error log.  This becomes the checklist for the consultant.  It guides them to each configuration and data error.

 

It was on this theme that I was speaking with one of our business partner’s consultants last week.  He said something which surprised me.  He said he loved getting error logs.

 

This caught me out.

 

It was not a view that had been expressed to me before.  We typically see error logs as a sign of failure – “What went wrong and why?”.  For this consultant they were a sign of achievement.  He had just successfully (and quickly) found a problem and knew exactly what and where it was.  To him, the error logs were like a satnav that took him straight to configuration and data issues and enabled him to fix them efficiently and effectively.  Now there’s an attitude I can embrace!

 

We can mitigate human-error with processes and checklists (this was the primary lesson from Flight 173), and we can eliminate human error with systems and automation.  Ideally, you have both.  Not that systems are infallible (again, software is never bug free) but they are measurably and significantly more accurate than humans.

 

So given that errors are unavoidable, it is better to find and diagnose them as soon as possible and fix the cause.

 

One of the many benefits of Robotic Process Automation, the engine that powers Rapid4Cloud software, is that it is dispassionate.  It is free from the bias that distorts human decision making, and it simultaneously eliminates and exposes human error in processes.  This makes problem resolution very quick and easy – and it’s what our consultant friend finds so helpful in our error report.

 

Not long ago, one of our customers lost $1.5m in orders over a weekend due to a configuration change.  Understandably emotions were high, and they spent the weekend frantically looking for what had changed/gone wrong, problem solving in the time-honored fashion of trial and elimination.  Ultimately, they called us for advice.  We ran a standard configuration report and found the issue in 10 minutes.  They could have done this themselves, and knew how to, but it wasn’t embedded as part of their process.  It sure is now.  Mistakes in our business are very rarely life-threatening, as they are in aviation or healthcare, but we can absolutely learn and apply lessons from these industries.

 

By Philip Martin

Founder & CEO