Are you comfortable with Software FMEA?
For development of software in space industry, based on ECSS standards, it is required to perform a software FMEA. In the FMEA acronym, the “F” stands for “Failure”, even if this wording is not totally appropriate for software.
Software failure, what does it mean?
Indeed, software cannot fail in the sense of the failure of a hardware component. However, the software may contain a design defect, which, when activated, will generate a failure of the system. Therefore, it makes sense, especially for critical applications, to perform an analysis comparable to traditional FMEA leading to:
- Identify the potential impacts of software defects at system level, allowing the system designer to compare with system level FMEA and potentially complete the system FMEA and handle various impacts deemed necessary at system level (modification of the system design, interfaces, operational documentation, procedures…)
- Determine what to do, at software level, when a “Failure” occurs, leading to a substantiated specification of the fault handler
A software “Failure” should be understood as being the consequence of the activation of a residual software bug that was not detected through the validation and verification layers applied during the development phase and that could occur during operational life cycle of the product.
Useful software FMEA should lead to recommendations for the software design in order to improve the software robustness without negatively impact the performances and with a realistic cost in relation to the level of acceptability of the risk and the price of the product. It should also lead to improvements of the system FMEA with the identification of additional failure modes for the module hosting the considered software.
Organization difficulty for performing a software FMEA
Firstly, it is necessary to understand the mechanism of the FMEA and also to have a sufficient knowledge about how a software works even in complex architectures. This dual skill is not easy to find.
In most of cases there are 2 people or teams, RAMS specialist with a good knowledge of the FMEA principles but not too much skills in software design and the designers without FMEA expertise and with schedule pressure making them very busy with design activities, so that result of analysis is not optimum and may lead to spend a huge effort for a limited benefit.
It is recommended to have people with double skills, generally it comes with people having a first experience in system and software development and that oriented their careers towards RAMS activities.
Where to start a software FMEA?
A second level of difficulty is to identify what kind of “failure modes” we need to consider and at which level of detail to start the FMEA.
- Regarding failure modes of a basic hardware component, e.g. a resistor, they are easy to identify (open, short-circuit, drift with temperature) by looking at various standards and so the effects of the failures and their propagation are also easy to determine, but for software this is not so easy, especially in real time multi-tasks and processors environment where, potentially, an error in one function may impact several other functions without intended relationship (e.g. pointer or index corruption, stack handling, communication buffer overflow…), and this shall be determined on a case by case analysis. Another case is a software error having for root cause an error coming from a complex hardware component where such error is not identified in the hardware FMEA.
- Regarding the level of detail for starting the software FMEA, here again it shall be considered on a case-by-case basis. It may be not relevant to start from the line of code or even from the lowest level of functions, it depends on various factors like criticality, redundancy or any other safety barriers defined by a higher-level approach.
Author: Luc Pelle - System Design Assurance & Certification Manager, PMV Consulting & Services