P2: The Safe Failure Fraction and the Architectural Constraints

We remind that there are four types of Failures:

Safe failures;
Dangerous failures;
No Effect failures; and
No Part failures

A Safe failure is the failure of an element, inside a component that plays a part in implementing a safety function, that results in a spurious operation of the safety function. That means, it places the machine into a safe state (it generates and emergency stop of the machine for example). An example of a safe failure for a power contactor is when, despite the contactor coil is energised, the coil itself has a failure and the power contacts open.

A Dangerous failure is the failure of an element, inside a component that plays a part in implementing a safety function, that prevents the safety function from operating when required such that the machine is put into a hazardous or potentially hazardous state. An example of a dangerous failure for a power contactor is when, despite the contactor coil was de-energised, the power contacts do not open and therefore the dangerous movement continues.

A No Effect Failure the failure of an element, inside a component that plays a part in implementing a safety function, but that has no direct effect on the safety function itself. An example of a No effect failure for a power contactor is when it will not close once the safety function is reset. That means, for example the door gate in a robot cell is closed, the safety system is reset but the robot does not start. This failure is of no relevance for the safety function and it has an influence only on the robot availability, but again not on its safety.

The Safe Failure Fraction (SFF)

The Safe Failure Fraction (SFF) was introduced in the first edition of IEC 61508 as a measure used to determine the minimum level of redundancy, or better, of Hardware Fault Tolerance (HFT), of a safety subsystem.

The SFF can be defined as a property of a safety component, like a pressure transmitter, that is defined by the ratio of the average failure rates of safe plus dangerous detected failures and safe plus dangerous failures. This ratio is represented by the following equation:

SFF= (λs+λdd) / (λs+λd)

The SFF is the proportion of “safe” failures among all failures: please notice that neither the No Effect nor the No Part failures are considered. A “safe” failure is either a failure that is safe by design, or a dangerous failure that is immediately detected and corrected. IEC standards deﬁne a safe failure as a failure that does not have the potential to put the SIS in a hazardous or fail-to-function state. A dangerous detected failure is a failure that can prevent the SIS from performing a speciﬁc SIF, but when detected soon after its occurrence, for example by online diagnostics, the failure is considered to be “safe” since the Diagnostics can bring the system to a safe state. In some cases, the SIS can automatically respond to a dangerous detected failure as if it were a true demand, for example, causing the shutdown of the process [68].

Many electronic safety devices have built-in diagnostics such that most dangerous failures become Dangerous Detected failures and they will therefore have a high SFF, often greater than 90%. Mechanical safety devices, for which internal diagnostics is not feasible, will have, in general, a low SFF.

Example of SFF for and Pressure Trasmitter used in Safety Applications

Hereafter is an example of the failure rate of a pressure transmitter that can be used in a safety instrumented system.

The SFF is the following:
SFF = (λs+λdd)/(λs+λd ) = (184+280) / (184+280+36) = 464 / 500 = 92,8%
The pressure transmitter certificate (ABB 2600T, model 261) states that it is a Type B component and it has a Systematic Capability SC 2.
In order to understand the full meaning of what is stated just above, we need to introduce first of all the difference between Random and Systematic Failures; than you need to have clear the difference between a Type A and a Type B component and what is meant with Route 1H.

Random Failures, Systematic Failures and the Systematic Capability

In Functional safety, Failures are classified as either random (in hardware) or systematic (in hardware or software).

Random Failures are normally attributed to hardware. They are failures occurring at a random time, which result in a degradation of the component capability to perform its scope. Based upon historical data, Random Failures can be characterized by a parameter called Failure Rate λ, that we discussed so far. In other words, a random hardware failure involves only the equipment; random failures can occur suddenly without warning or be the outcome of slow deterioration over time. These failures can be characterized by a single reliability parameter, the device failure rate, which can be controlled and managed using an asset integrity program.

Systematic failures are in essence due to mistakes. They can only be eliminated by a modification of the design or of the manufacturing process, operational procedures, or other relevant factors. Examples of causes of systematic failures include human error in the design, manufacture, installation and operation of the hardware. If, for example, a product is used in the wrong environment, the risk of systematic failures exists. A systematic failure involves both the equipment and a human error; systematic failures exist from the time that human errors were made and continue to exist until they are corrected. A systematic failure can be eliminated after being detected, while random hardware failures cannot.

Failures are therefore either random or systematic; the latter can be hidden in the hardware or in the software program. A major difference between random hardware failures and systematic failure is that system failure rates (or other appropriate measures) arising from random hardware failures can be predicted with reasonable accuracy, while systematic failures, by their very nature, cannot be accurately predicted. That means, system failure rates arising from random hardware failures can be quantified with reasonable accuracy while those arising from systematic failures cannot be statistically quantified, because the events leading to them cannot easily be predicted. In other words, the reliability parameters of random hardware failures can be estimated from field feedbacks, while it is very difficult to do the same for systematic failures: a qualitative approach is preferred for systematic failures.

Random failures are taken into considerations with the calculation of the different types of failure rates. For Systematic failure the concept of Systematic Capability of a component is used.

A component Systematic Capability is a measure (expressed on a scale of SC 1 to SC 4) of the confidence that the systematic safety integrity of the component meets the requirements of the specified SIL. In other words, a component with Systematic Capability SC 2 can only be used in safety instrumented systems with reliability up to SIL 2, regardless how much redundancy is used for that component. That means, if a component has SC 1 (SIL 1), even if I use two of them in parallel, the maximum reliability level that subsystem can reach is SIL 1.

In order to assess a component systematic capability, the laboratory that estimates it looks, for example, into how good its development process was as well how good its production process was.

The concept of systematic capability was introduced in the second edition of IEC 61508. Since the term was not present in the first edition, terminology like “SIL n capable” component, or “SIL n compliant” component appeared in datasheets and documents: which generated confusion! The confusion comes from the fact that, thanks to the concept of Systematic Safety Integrity, it is possible to give a SIL level to a component. That is unusual, since the SIL concept belongs to a safety function and not to a component.

With the new edition of IEC 61508 series, that was clarified: a device, with a systematic capability of SIL 2 (SC 2), for example, meets the systematic safety integrity of SIL 2 when applied in accordance with the instructions stated by its manufacturer. That means, even if the failure rates and SFF allow the component to reach SIL 3, it can only be used in a SIL 2 safety subsystem. That also means, even if it is used in redundancy with another component and, together, their subsystem can reach SIL 3, the safety system can only reach SIL 2.

Once again: safety integrity means both Hardware Safety Integrity and Systematic Safety Integrity. Systematic failures (hardware or software), and consequently Systematic safety integrity, cannot be quantified. On the other hand, Random Hardware Failures usually can be.

A Safety Control System needs a certain level of Safety Integrity in order to be “reliable” or, in other terms, to be “immune” from both Systematic and Random failures.

Hardware random failures are quantifiable and are taken into considerations thanks to values given by the component manufacturer, like Failure Rates, MTTF_D and PFH_D. The issue is how to tackle the Systematic Failures. That is done by guaranteeing a certain level of Systematic Capability, that is the terminology used in IEC 61508. The Systematic Capability applies to a safety component with respect to the confidence that the Systematic Safety Integrity meets the requirements of the specified Safety Integrity Level (SIL).

Type A and Type B Components

Components used in a Safety Function can be classified as Type A or Type B.

According to IEC 61508-2, a component can be regarded as type A if

the failure modes of all constituent components are well defined; and
the behaviour of the element under fault conditions can be completely determined; and
there is sufficient dependable failure data to show that the claimed rates of failure for detected and undetected dangerous failures are met.

Electromechanical interlocking devices are examples of Type A component.

A component can be regarded as type B if

the failure mode of at least one constituent component is not well defined; or
the behaviour of the element under fault conditions cannot be completely determined; or
there is insufficient dependable failure data to support claims for rates of failure for detected and undetected dangerous failures.

A 4-20 mA transmitter is normally a Type B component.

Route 1H

Components used in a Safety Function can be classified as Type A or Type B.

According to IEC 61508-2, a component can be regarded as type A if

the failure modes of all constituent components are well defined; and
the behaviour of the element under fault conditions can be completely determined; and
there is sufficient dependable failure data to show that the claimed rates of failure for detected and undetected dangerous failures are met.

Electromechanical interlocking devices are examples of Type A component.

A component can be regarded as type B if

the failure mode of at least one constituent component is not well defined; or
the behaviour of the element under fault conditions cannot be completely determined; or
there is insufficient dependable failure data to support claims for rates of failure for detected and undetected dangerous failures.

A 4-20 mA transmitter is normally a Type B component.

Route 1_H

Historically, this was the only way to determine the maximum SIL that could be claimed by a Safety Function. Here are the steps to be followed according to IEC 61508-2, § 7.4.4.2

Divide the Safety-related system in subsystems;
For each subsystem, calculate the Safe Failure Fraction for all elements in the subsystem separately. In case of redundant element configurations, the SFF may be calculated by taking into consideration the additional diagnostics that may be available (e.g. by comparison of redundant elements);
For each element, use the achieved Safe Failure Fraction and Hardware Fault Tolerance of 0 to determine the maximum safety integrity level that can be claimed from column 2 of Table 2 of IEC 61502-2 for Type A elements; in case of Type B elements, Table 3 of IEC 61502-2 must be used;
The maximum safety integrity level that can be claimed for an E/E/PE safety-related system shall be determined by the subsystem that has achieved the lowest safety integrity level.

For a Route 1_H, each safety component must have all the failure rates coming from a FMEDA Analysis.

Safe Failure Fraction of an element	Hardware fault tolerance
Safe Failure Fraction of an element	0	1	2
SFF < 60 %	SIL 1	SIL 2	SIL 3
60 % ≤ SFF < 90 %	SIL 2	SIL 3	SIL 4
90 % ≤ SFF < 99 %	SIL 3	SIL 4	SIL 4
SFF ≥ 99 %	SIL 3	SIL 4	SIL 4

Table 2 of IEC 61508-2: Maximum allowable safety integrity level for a safety function carried out by a Type A safety-related element or subsystem

Safe Failure Fraction of an element	Hardware fault tolerance
Safe Failure Fraction of an element	0	1	2
SFF < 60 %	Not Allowed	SIL 1	SIL 2
60 % ≤ SFF < 90 %	SIL 1	SIL 2	SIL 3
90 % ≤ SFF < 99 %	SIL 2	SIL 3	SIL 4
SFF ≥ 99 %	SIL 3	SIL 4	SIL 4

Table 3 of IEC 61508-2: Maximum allowable safety integrity level for a safety function carried out by a Type B safety-related element or subsystem.

The concept Hardware Fault Tolerance (HFT) is used in IEC 61508 series to indicate the ability of a hardware subsystem to continue performing a required function, in the presence of faults or errors. The HFT is given as a digit. HFT = 0 means that, in case of one fault, the function (e.g., a pressure measurement) is lost. HFT = 1 means that if a channel fails, there is other one that is able to perform the same function: in other terms, the subsystem can tolerate one failure and still be able to function. A subsystem of three channels that are voted 2oo3 is functioning as long as two of its three channels are functioning. This means the subsystem can tolerate one channel failure and still function normally. The Hardware Fault Tolerance of the 2oo3 voted group is, therefore, HFT = 1. In figure 1 an input subsystem with HFT = 1 is shown: I1 and I2 could be two identical pressure transmitters.

Figure 1: HFT=1 input safety sub-function

Conclusions

In this article we explained the methodology used for components in Low Demand mode Safety Instrumented Systems.

A component can have a very low percentage of Failure rate, but that does not mean it can reach a high level of reliability when installed in a Safety System.

The value of the random failures is just one aspect to be considered. The other one is the risk the component (a pressure transmitter in this article) can be subject to systematic failures due to the fact it was not properly designed, engineered, and produced, or it is not properly maintained. The level of Systematic Capability of a component is formalised in levels from SC 1 to SC 4.

We discussed what a Type A or Type B component is and how that limits the maximum SIL a Subsystem containing that component can reach, looking at its Random Failures and SFF level.

In any case, if a component has a Systematic Capability of SC 2, the maximum SIL level that can be reached by its subsystem is SIL 2, regardless of how many of them we connect in parallel or of the PFD_avg value the subsystem reaches.