Monday, October 14, 2013

Dealing with poor reliability issues

While a project faces several issues during its lifecycle, poor reliability issues are  critical as these can lead to failed projects. These are also difficult to resolve. If the functionality does not work, then it is possible to find the root cause and implement corrective action. If the problem is intermittent, then it is a big challenge to  even diagnose the problem.
Blue Screen of death in a presentation (via commons)
"Blue Screen of Death"  (Credit:Masem Via Commons)
I would like to highlight two  instances of poor reliability  and the corrective action that helped.

In the first case, a Personal Computer (PC) running Microsoft Windows 95 was used along with an custom built add-on card to  provide interactive audio video services over cable television system.  The services were disrupted sometimes and the reason was that PC crashed.  The service could be restored only by rebooting the computer. As there were several software components,  a careful check of the  application software  did not reveal a problem, the fault was assumed to lie with operating system software. The short term fix was to detect the PC crash  and provide a hardware trigger to reset the PC. The long term fix was done by moving to an embedded hardware  with reliable real time operating system.

In the second instance, the PCMCIA modem that worked with laptops for Wireless Internet connectivity  was used in an embedded environment for transferring equipment health data.  During the tests, it was  found that the modem operation was intermittent. We tried to reproduce the error in the laptop environment apart from contacting the vendor for advice.  The vendor suggested using a new version of the modem cards. After extensive debugging with alternate wire-line modems, which had high reliability, we traced the problem to  bugs in the TCP/IP stack supplied by the real time OS vendor.  As these problems surfaced during the later part of project, this led to crisis situation, requiring fire fighting actions which are costly and detrimental.

In both the above cases, the issues resulted from trying to use Commercial Off The Shelf  (COTS) HW/SW for aggressive time to market  and low cost product needs, while  ignoring the reliability issues. By focusing on the reliability requirements  during the requirements phase and ensuring appropriate design choices as well as early prototyping to find out any reliability issues, projects can handle such issues effectively.

1 comment:

Unknown said...

Well I am a project manager and have been going through the guide to Scrum Body of Knowledge by Scrumstudy which provide a complete guide for the scrum project. I highly recommend this books to all those who are planning to implement scrum in your organization. You can go directly to for first chapter is available there.