PT 10: Functional safety in high demand:The Markov model of Category 2 according to ISO 13849-1

Last edit: 05/08/2025

Summary

This article is part of a series of articles written on Functional Safety of Machinery. Last year we presented the Category B, 1 and 2. Category 2 is the most difficult to understand since it is in theory a single channel architecture but that allows reaching PL d.

To better understand the reasons of some limitations to this category, in this article we present the principles of the Markov Modelling that is behind Category 2.

Introduction

Back in the early 2000, when IEC 61508 series showed its probabilistic approach to functional Safety, a team from IFA reacted by making a Markov modelling of each category of EN 954-1.
Category 2 can be represented by the Safety-related Block Diagrams shown in Figure 1. F represents the functional Channel (I-L-O) and M represents the Test Equipment (TE+OTE).

  • λFD is the dangerous failure rate of the Functional Channel F.
  • λMD is the dangerous failure rate of the Test Channel M.
  • β is the Common Cause factor that influences at the same time the functional channel F and test channel M;
  • rt is the Test rate of the functional channel; in other terms, how often the Functional Channel is tested by the Test Channel
  • rd is the Demand rate of the safety function: how often the Safety Function is required.

 

 

 

Figure 1: Category 2 Architecture

The OK State

Figure 2 shows the starting Markov model for a Category 2 Architecture.
As you can see, it is a repairable system since from both the Operating Inhibition state, or from the Hazardous Event state, the system is brought back to the OK state. As already explained in chapter 1, that is a different approach from IEC 62061. However, for the same safety-related control function, the difference in PFHD in the two cases is small, since what is important is that the system is considered the ultimate safety barrier. That means, for failures that cause the loss of the overall safety function, the overall safety-related system repair has a negligeable influence on the PFHD value since, in both cases, the safety system fails, before, in one case, being repaired.

 

 

 

Figure 2: State Transition Model for a Category 2 Architecture (1oo1D)

One final word on the subject. A safety-related control system in high demand mode, during its mission time of typically 20 years, may statistically reach the hazardous event several times, especially when the PFHD is high. Imagine, for example, a PL a low risk applications with PFHD = 6·10-5. The system is statistically facing a hazardous event 20 [years] · 365 [days] · 24 h · 6·10-5 [1/h] = 10,5 times during 20 years mission time and, each time, the system is normally repaired.
Let’ now analyse and simplify the Markov Graph for a Category 2 (or a 1oo1D Subsystem). The machine is in the OK state when it is working normally and all Safety-related control Systems are vigilant and not affected by any fault.

From the OK state to the Failure State

The SRP/CS we are analysing is a 1oo1D architecture and therefore a failure can happen either to the Functional Channel F, or to the Monitoring Channel M. A dangerous failure in the Functional Channel can either be Detected or Undetected, that is the way IEC 61508 series reasons.
Therefore, from the OK state the system can move to 3 possible states (Figure 3):

  • F DD: The system has a failure that is detected
  • F DU: The system has an undetected failure
  • M D: The monitoring Channel has a failure

 

 

 

Figure 3: Transition from the OK to the failure states

Being λFD the dangerous failure rate, its detected part is:

But we need to take into consideration the Common Cause Failures between the Functional Channel F and the Monitoring channel M. β is the common cause factor and λCC the Common Cause Failure Rate. Therefore, the probability that the SRP/CS moves from the OK state to F DD is linked to the following Failure Rate:

In case the Monitoring channel cannot detect the failure, the system moves to a so called Dangerous Undetected State, indicated as F DU in the drawing. The probability that the Safety system moves from the OK state to F DU is linked to the following Failure Rate:

The probability of a failure of the Monitoring Channel is linked to the following Failure Rate:

The status is defined as M D

From the failure state to the hazardous event

In the Markov Model of Category 2 Architecture a “temporary state” called F DU + MD is defined, but just for the system completeness. The transition to this state is either from the F DD or from the F DU in case a failure of the Monitoring Channel happens (λMD). Alternatively, from the M D state, in case of a failure in the Functional Channel (λFD). 

 

 

 

 

Figure 4: Transition from Failed States to the Hazardous Event Functional Channel

Please consider that all the transitions described so far are Exponentially Distributed: they may happen randomly.

Now, what is important, is the transition to the Hazardous State. The SRP/CS enters this state when:

  • It has has a Failure (just one!) and
  • There has been a demand upon the safety Function (rd)

As an example, the access gate to a robot cell was opened and the interlocking device has failed to open the electrical contact: the consequence is that a person is inside the area and the Robot is still running.

At the end, what is important is the probability that the system moves from the OK state to the Hazardous State; that probability is given by the sum of the probabilities of 3 transitions, described with a big arrow in  Figure 4:

  • Transition from the F DD state to the Hazardous State or Event
  • Transition from the F DU state to the Hazardous State
  • Transition from the F DU + MD state to the Hazardous State

Please consider that all these transitions are Uniformly Distributed and the frequency is rd: the frequency of the request upon the safety function.

Other states in the transition model

One other state displayed in the Markov Model is the Operation Inhibition one. The OK system moves to this state in 2 steps:

  1. A detected failures in the SRP/CS has happened: transition from the OK state to F DD.
  2. The Monitoring Channel tests the Functional Channel before a request upon the safety function and it brings the system to a safe state. In practice, the safety systems stops the machine or the dangerous movement. In our example of the Robot Cell, the safety System detects a failure when the person enters the robot cell, the Robot is still running and safely stops it, before a person reaches the robot.

Finally the last transitions are:

  • From the Hazardous Event to the OK State or
  • From the Operation Inhibition to the OK State

That happens when the system is repaired.

The Simplified graph of the Markov Modelling

The Graph can be simplified, using conservative assumptions, to demonstrate that the PFHD values of a Category 2 Architectures, indicated in Table K.1 of ISO 13849-1, can be calculated using formulas. Those are comparable to the formulas used in IEC 62061.

The purpose of this exercise, that was prepared in 2018 by some mathematicians from IFA, had exactly that scope: to demonstrate that the reliability level of the same architecture, analysed with either IEC 62061 or ISO 13849-1 is essentially the same. That is again another way to show that, especially with the new edition of IEC 62061, the two standards are aligned on many aspects, as ever it has been.

The simplified model is shown in figure 5. The steps from the full to the simplified model are not detailed in the book.

 

 

 

 

Figure 5: Simplified Markov model for calculation of PFHD in 1oo1D architecture

From the graph, the instantaneous PFHD value for the 1oo1D Architecture can be calculated as follows:

When properly integrated,

Where TM is the Mission Time.

The PFHD of a 1oo1D Architecture is:

The importance of the time-optimal testing

There is a problem with the above formula and it is the term called TRTE: Time-related Test Efficiency.

That concept is also stated in IEC 61508-2, §7.4.4.1.4.

In the new edition of the ISO 13849-1, it is indicated that, in case the ratio is only 25 times, the Monitoring function is still considered effective and the PFHD has just to be increased by 10%.

An alternative to the time-optimal testing is that the test of the safety function happens together with its demand and, in case a failure is detected, the overall time to bring the machine to a safe state is shorter that the time to reach the hazard.

1oo1D in case of time-optimal testing

In case of a time optimal testing, the formula becomes:

If |X| << 1, the exponential function can be replaced, without notable loss of precision, by its quadratic approximation:

Substitution of the exponential function, in the PFHD Equation, by the above approximation yields:

λCC takes into consideration the common cause failure of both the Functional Channel (FD) and the Test Channel (MD) and it can be estimated using the following equation:

The reason why this exercise was done is not only to understand better the Markov Modelling behind Category 2 subsystems; it was also done to show that, despite ISO 13849-1 and IEC 62061 use different modelling, the difference in PFHD between the two approaches is negligeable. The same exercise was done for category 3 and 4 and compared to architecture 1oo2D, with similar results.

In ISO 13849-1 the values of PFHD are listed in Table K.1. However, the models behind those numbers can be simplified, on the safe side, and represented with formulas, as shown in this paragraph for a 1oo1D architecture. Despite ISO 13849-1 uses the Markov chains and it assumes the systems as repairable, its simplified equations are very similar to those of IEC 62061, that uses reliability block diagrams and it assumes the systems as non-repairable.

Safety in Collaborative Robotics
There is no “Collaborative Robot”. That is one of the first statements you hear from people working in Collaborative Robotics. The reason is because...