What measures the average amount of time between failures for a particular system?

Domain 7: Security Operations (e.g., Foundational Concepts, Investigations, Incident Management, Disaster Recovery)

Eric Conrad, ... Joshua Feldman, in CISSP Study Guide (Third Edition), 2016

Mean Time Between Failures

Mean Time Between Failures (MTBF) quantifies how long a new or repaired system will run before failing. It is typically generated by a component vendor and is largely applicable to hardware as opposed to applications and software. A vendor selling LCD computer monitors may run 100 monitors 24 hours a day for 2 weeks and observe just one monitor failure. The vendor then extrapolates the following:

100LCD computermonitors*14days*24hours/day=1failure/33600 hours

This does not mean that one LCD computer monitor will be able to run for 3.8 years (33,600 hours) without failing. [23] Each monitor may fail at rates significantly different than this calculated mean (or average in this case). However, for planning purposes, we can assume that if we were running an office with 20 monitors, we can expect that one will fail about every 70 days. Once the vendor releases the MTBF, it is incumbent upon the BCP/DRP team to determine the correct amount of expected failures within the IT system during a course of time. Calculating the MTBF becomes less reliant when an organization uses fewer and fewer hardware assets. See the example below to see how to calculate the MTBF for 20 LCD computer monitors.

1failure/33600 hours=20LCDcomputermonitors*Xdays*24hours/day

Solve for X by dividing both sides of the equation by 20 * 24

Xdays=33600/20*24Xdays=70

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128024379000084

Domain 8

Eric Conrad, ... Joshua Feldman, in CISSP Study Guide (Second Edition), 2012

Mean time between failures

Mean time between failures (MTBF) quantifies how long a new or repaired system will run before failing. It is typically generated by a component vendor and is largely applicable to hardware as opposed to applications and software. A vendor selling LCD computer monitors may run 100 monitors 24 hours a day for 2 weeks and observe just one monitor failure. The vendor then extrapolates the following:

100LCD computer monitors×14days×24hours/day=1failure/33,600 hours

This does not mean that one LCD computer monitor will be able to run for 3.8 years (33,600 hours) without failing [14]. Each monitor may fail at rates significantly different than this calculated mean (or average in this case). However, for planning purposes, we can assume that if we were running an office with 20 monitors, we can expect that one will fail about every 70 days. Once the vendor releases the MTBF, it is incumbent upon the BCP/DRP team to determine the correct amount of expected failures within the IT system during a course of time. Calculating the MTBF becomes less reliant when an organization uses fewer and fewer hardware assets. See the example below to see how to calculate the MTBF for 20 LCD computer monitors.

1failure/33,600hours=20LCD computer monitors×Xdays×24hours/day

Solve for X by dividing both sides of the equation by 20 × 24:

Xdays=33,600/20×´24Xdays=70

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9781597499613000091

Video Management Systems

Vlado Damjanovski, in CCTV (Third Edition), 2014

Hard disk failures

Since a hard disk drive is an electromechanical device, wear and tear will cause it to eventually fail.

It’s not a question if a drive will fail, the question is when it will fail?

When analyzing hard disks life expectancy, there are three common parameters manufacturers give.

The Annualized Failure Rate (AFR), which is the percentage failure share of a certain amount of hard disks (extrapolated to one year based on expectation values).

The Mean Time Between Failures (MTBF) specifies the expected operating time between two consecutive failures of a device type in hours (definition according to IEC 60050 (191)). The MTBF considers the life cycle of a device that fails repeatedly, then is repaired and returned to service again. Because repair of hard drives rarely happens, we don’t really have mean time between failures. We only have mean time to a failure, after which the drive is discarded. Therefore, Mean Time To Failure (MTTF) is used, which specifies the expected operating time until the failure of a device type in hours (definition according to IEC 60050 (191)).

The acronyms MTBF and MTTF are often used synonymously in terms of drives. Some manufacturers, for example, estimate the MTBF as the number of operating hours per year divided by the projected failure rate. This view is based on a failure without repair. As such , the MTTF would be more practical parameter, but it will still show unrealistic high number of hours of life expectancy, something that is not the case with a standard electronic definition of MTBF.

What measures the average amount of time between failures for a particular system?

An extreme case of a broken hard drive

For example, a typical hard disk MTBF’s numbers range between 300,000 and 1,000,000 hours. This is quite a high number, equivalent to 34 to 114 years. These numbers are way too high, and practical experience shows that the posted lifetime of a drive is more likely one tenth of it, typically 3~5 years. In addition, technology progress and new standards and increased capacity does not allow hard drives to be used for more than a couple of years. Moore’s law can easily be applied to hard disk capacity too, they double almost every year or two.

The stated MTBF/MTTF high numbers are a result of a specific definition by hard disk manufacturers, which refers to a testing a number of drive, the testing period selected (in hours) and the percentage of drives failed in that period. This can be written as the following formula:

(47)MTTF=testperiodxnumberofdrives/numberoffaileddrives

For example, if testing has been conducted over one month (720 hrs), and out of 1,000 drives, two have failed, the MTTF will be:

MTTF=720×1,000 /2=360,000hrs.

In reality, this does not mean any drive may fail on and around 360,000 hrs of operation. The better interpretation of our example is that out of 1,000 drives tested for one month, two have failed, which is equivalent to having failure of one drive out of 500, each month (1,000 / 2). This is a 0.2% for one month. Over a period of one year, as the annual failure rate (AFR) defines it, this would be equivalent to having probability of 2.4% drives to fail. So, following our example, in a system with 1,000 drives with MTTF of 360,000 hrs, statistically, there will be 24 failed hard disk drives. If the MTTF was 1,000,000 hrs, which is typically quoted for enterprise drives, this statistics will mean 1 failed rive among 1,000 over a period of 1 year.

This same calculation can be generalized using the common exponential model of distribution:

(48)FailureProbability=Rt =1−Ft=1−e−t/M=1−e−λt

where e is the natural base number e = 2.71, t is the time for which this probability is calculated, and M is the MTBF.

So, if we do the same calculation for the previous example, for a drive with 360,000 hour MTBF drive (M = 360,000), we could calculate the failure probability for 1 year (t = 8,760 hrs) to be:

Rt=1‐e−8,760/360,000=0.024=2.4%.

Clearly, the above numbers are statistical only and very much depend on the environment in which the tests have been conducted. Most notable are certainly temperature, humidity, mechanical shocks, and static electricity during handling and installing. It is quite understandable that manufacturers would try and conduct such tests in as ideal conditions as possible. This means, in practice, we can only expect higher numbers than the above, drawn as a statistical calculation from manufacturers’ tests.

An interesting study was made by Carnegie Mellon University, which confirmed the empirical knowledge in mass-produced consumer electronics: if a new product doesn’t fail immediately after its inital usage, it will serve its purpose until approximately its MTBF time.

This study evaluated the data of about 100,000 hard drives that were used in several large-scale systems and found a large deviation of the manufacturer’s information. The average failure rate of all hard drives was six times higher for systems with an operating time of less than three years and even 30 times higher for systems with an operating time of 5-8 years. This statistic took them to a conclusion that in the early “infancy” usage time there is much higher rate of failure, then it settles for a longer period, which is the expected useful working life. After that, it starts failing due to age, wear, and tear, which coincides with practical experience of around five years (60 months) before increased failure rate is experienced.

What measures the average amount of time between failures for a particular system?

The Carnegi Mellon bath-tub curve

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780124045576500094

Sequential Logic Design

Sarah L. Harris, David Harris, in Digital Design and Computer Architecture, 2022

3.5.5 Synchronizers

Asynchronous inputs to digital systems from the real world are inevitable. Human input is asynchronous, for example. If handled carelessly, these asynchronous inputs can lead to metastable voltages within the system, causing erratic system failures that are extremely difficult to track down and correct. The goal of a digital system designer should be to ensure that, given asynchronous inputs, the probability of encountering a metastable voltage is sufficiently small. “Sufficiently” depends on the context. For a cell phone, perhaps one failure in 10 years is acceptable, because the user can always turn the phone off and back on if it locks up. For a medical device, one failure in the expected life of the universe (1010 years) is a better target. To guarantee good logic levels, all asynchronous inputs should be passed through synchronizers.

A synchronizer, shown in Figure 3.52, is a device that receives an asynchronous input D and a clock CLK. It produces an output Q within a bounded amount of time; the output has a valid logic level with extremely high probability. If D is stable during the aperture, Q should take on the same value as D. If D changes during the aperture, Q may take on either a HIGH or LOW value, but must not be metastable.

What measures the average amount of time between failures for a particular system?

Figure 3.52. Synchronizer symbol

Figure 3.53 shows a simple way to build a synchronizer out of two flip-flops. F1 samples D on the rising edge of CLK. If D is changing at that time, the output D2 may be momentarily metastable. If the clock period is long enough, D2 will, with high probability, resolve to a valid logic level before the end of the period. F2 then samples D2, which is now stable, producing a good output Q.

What measures the average amount of time between failures for a particular system?

Figure 3.53. Simple synchronizer

We say that a synchronizer fails if Q, the output of the synchronizer, becomes metastable. This may happen if D2 has not resolved to a valid level by the time it must set up at F2—that is, if tres > Tc − tsetup. According to Equation 3.24, the probability of failure for a single input change at a random time is

(3.25)P(failure)=T0Tce−Tc−tsetupτ

The probability of failure, P(failure), is the probability that the output Q will be metastable upon a single change in D. If D changes once per second, the probability of failure per second is just P(failure). However, if D changes N times per second, the probability of failure per second is N times as great:

(3.26)P(failure)/sec=NT0 Tce−Tc−tsetupτ

System reliability is usually measured in mean time between failures (MTBF). As the name suggests, MTBF is the average amount of time between failures of the system. It is the reciprocal of the probability that the system will fail in any given second

(3.27)MTBF =1P(failure)/sec=TceTc−tsetupτNT0

Equation 3.27 shows that the MTBF improves exponentially as the synchronizer waits for a longer time, Tc. For most systems, a synchronizer that waits for one clock cycle provides a safe MTBF. In exceptionally high-speed systems, waiting for more cycles may be necessary.

Example 3.14

Synchronizer for FSM Input

The traffic light controller FSM from Section 3.4.1 receives asynchronous inputs from the traffic sensors. Suppose that a synchronizer is used to guarantee stable inputs to the controller. Traffic arrives on average 0.2 times per second. The flip-flops in the synchronizer have the following characteristics: τ = 200 ps, T0 = 150 ps, and tsetup = 500 ps. How long must the synchronizer clock period be for the MTBF to exceed 1 year?

Solution

1 year ≈ π × 107 seconds. Solve Equation 3.27.

(3.28)π×107=TceTc−500×10−12200×10−12(0.2)(150×10 −12)

This equation has no closed form solution. However, it is easy enough to solve by guess and check. In a spreadsheet, try a few values of Tc and calculate the MTBF until discovering the value of Tc that gives an MTBF of 1 year: Tc = 3.036 ns.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128200643000039

Dependability Architecture

Bruce Powel Douglass Ph.D., in Real-Time UML Workshop for Embedded Systems (Second Edition), 2014

6.1 Overview

Smart systems are automating many processes that used to be reserved for the monitoring and intervention of highly trained personnel. This adds tremendous benefits in terms of cost and capability, but can we depend upon these systems? Dependability refers to the confidence with which we can entrust our lives to automated systems. Dependability has three primary aspects: safety, reliability, and security.

Reliability is a measure of the “uptime” or “availability” of a system – specifically, it is the probability that a computation will successfully complete before the system fails. It is normally estimated with mean time between failure (MTBF) or a related measure known as availability. MTBF is a statistical estimate of the probability of failure, and applies to stochastic failure modes.

Reducing the system downtime increases reliability by improving the MTBF. Redundancy is one design approach that increases availability because if one component fails, another takes its place. Of course, redundancy only improves reliability when the failures of the redundant components are independent.1 The situation in which a single failure can bring down multiple components is called a common mode failure. One example of a common-mode failure is running software for both the primary and secondary processing on the same CPU – should the processor fail, then both components will fail. In reliability analysis, great care must be taken to avoid common mode failures or to provide additional redundancy in the event oan element common to all redundant components fails.

The reliability of a component does not depend upon what happens after the system fails. That is, regardless of the impact of the failure, the reliability of the system remains the same. Clearly the primary concern relative to the reliability of a system is the availability of its functions to the user.

Safety is very different from reliability, but a great deal of analysis affects both safety and reliability. A safe system is one that does not incur too much risk of loss, either to persons or equipment. A hazard is an undesirable event or condition that can occur during system operation. Risk is a quantitative measure of how dangerous a system is and is usually specified as

Risk=Hazardseverity∗Hazardlikelihood

The failure of a jet engine is unlikely, but the consequences can be very severe. Overall, the risk of flying in a plane is tolerable because even though it is unlikely that you would survive a crash from 30,000 feet, such an incident is extremely unlikely. At the other end of the spectrum, there are events that are common, but are of lesser concern. A battery-operated radio has a hazard of electric shock, but the risk is acceptable because even though the likelihood of the hazard manifesting is relatively high, its severity is low.2

Faults come in two flavors. Errors are systematic faults introduced in analysis, design, implementation, or deployment. By “systematic,” we mean that the error is always present, even though it may not be always be manifest. In contrast, failures are random errors that occur when something breaks. Hardware exhibits both errors and failures but software exhibits only errors. The distinction between error and failure is important because different design patterns optimize the system against these concerns differently.

The key to managing both safety and reliability is redundancy. Redundancy improves reliability because it allows the system to continue to work in the presence of faults. Simply, the redundant system elements can take over the functionality of faulty ones and continue to provide system functionality. For improving safety, additional elements are needed to monitor the system to ensure that it is operating properly and possibly other elements are needed to either shut down the system in a safe way or take over the required functionality. The goal of redundancy used for safety is different – the concern is not about continuing to provide functionality, but instead to ensure that there is no loss (to either persons or equipment).

The example I like to use to demonstrate the difference is the handgun versus my ancient Plymouth station wagon. The handgun is highly reliable piece of equipment – most of them fire when dirty or even under water. It is, however, patently not very safe since even in the absence of a fault, you can (and people do) shoot yourself in the foot. On the other hand, my enormous 1972-vintage station wagon (affectionately referred to as “The Hulk”) is the safest automobile on the planet. It has a fail-safe state3 (“OFF”) and it spends all of its time in that state. So while the vehicle is very safe, it is not at all reliable.

As with the other architectural dimensions, safety and reliability are achieved through the application of architectural design patterns.4 All design patterns have costs and benefits, and selecting good safety patterns requires balancing the design concerns, such as

Development cost

Recurring (manufacturing) cost

Level of safety needed

Level of reliability needed

Coverage of systematic faults (errors)

Coverage of random faults (failures)

Complexity

Resource demand

Ease of certification against relevant standards

In general, safety and reliability patterns can be categorized into either homogenous or heterogeneous patterns. The former creates exact replicas of the architectural elements to provide redundant processing, and adds glue logic to determine when and under what circumstances the replicas run. The latter patterns use different implementations, designs, or approaches to provide redundant processing. These systems can be further subdivided into lightweight or heavyweight patterns. Lightweight patterns use fewer resources but may not be able to provide the full functionality or fidelity of the primary system elements. Heavyweight redundancy replicates the full functionality but at a greater cost.

Security is a bit different from reliability or safety but intersects with both. Security of information is called information assurance, but security is a broader issue with embedded devices. Certainly, security in the IT sense – managing the wired and wireless connections to prevent intrusions – is important, but it is not the only concern. In cyberphysical systems, we must concern ourselves with more mundane threats (such as walking off with the device) and we must concern ourselves with the severity of the potential outcomes as security of a nuclear power plant may compromise its safety. The solutions are likely to include a mixture of standard IT approaches and physical system security measures.

We will discuss details of these different aspects in more detail in the upcoming problems.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780124077812000064

Smart Pixels

John A. Neff, in Encyclopedia of Physical Science and Technology (Third Edition), 2003

IV.D Redundancy Based on Reconfiguration

Failure rates for optoelectronic devices are currently very reasonable (mean times between failure in excess of 50,000 hr) and should improve in the future, but large-dimension arrays will have some failures due to the large numbers of devices involved; therefore, allowances will have to be made to accommodate these failures. Rather than replace an entire SPA due to the failure of a single pixel, SPAs of the future will likely have built-in reconfiguration capable of activating redundant pixels. A 256 × 256 might be configured as a 250 × 250 array with each row and column having 6 redundant pixels. A failed pixel could have its I/Os switched to a redundant pixel. In the case of independent pixels (e.g., optical interconnect), this switching could be accomplished via reconfiguration built into the silicon circuits. For applications involving dependent pixels (e.g., display), MEMS could be used to steer the beams so as to maintain the required spatial patterns among the pixels.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0122274105006955

Requirements Analysis

James D. McCabe, in Network Analysis, Architecture, and Design (3), 2007

3.3 Developing Service Metrics

After gathering requirements for our network, the next step is to analyze these requirements in order to distinguish between various performance levels in the network. We will develop and use performance thresholds and limits to distinguish between low and high performance, and also use performance characteristics to identify predictable and guaranteed performance levels. Performance thresholds and limits and performance characteristics are measured in the system with service metrics.

Service metrics are either actual measurable quantities in the network or are derived from measured quantities. These service metrics are important, as they are where “the rubber meets the road”—where requirements from all layers in the system are distilled into configurable and measurable quantities.

Recall from Chapter 2 that in order for a performance characteristic to be useful, it must be configurable, measurable, and verifiable within the network. This is particularly true when parts of the network are outside the control of the network administrator—for example, when a service provider is used to supply a service such as Frame Relay in the network, or when parts of the network (or the entire network) are outsourced. In cases such as these, service metrics can be used to ensure that you are getting the service you are requesting (and paying for) from the service provider or outsourcing agent.

The types of service metrics you use will depend on your design and the types of equipment (network devices) you implement in your network, but at this point in the analysis process, you can influence or require what will be measured in the network and (to some extent) how it will be measured.

Service metrics for RMA include:

Reliability, in terms of mean time between failures (MTBF) and mean time between mission-critical failures (MTBCF)

Maintainability, in terms of mean time to repair (MTTR)

Availability, in terms of MTBF, MTBCF, MTTR

Optionally, uptime and downtime (as a percent of total time), error and loss rates at various levels, such as packet error rate, bit error rate (BER), cell loss ratio (CLR), cell misinsertion ratio (CMR), frame and packet loss rates

Service metrics for capacity include:

Data rates, in terms of peak data rate (PDR), sustained data rate (SDR), and minimum data rate (MDR)

Data sizes, including burst sizes and durations

Service metrics for delay include:

End-to-end or round-trip delay

Latency

Delay variation

As configurable and measurable quantities in the network, service metrics can be described in terms of variables in network devices. There are also mechanisms to configure and measure these variables. As we see in Chapter 7 on network management, current mechanisms to configure and measure service metrics are found within network management platforms that use the simple network management protocol (SNMP) and the common management information protocol (CMIP), both of which access variables described in management information bases, or MIBs. MIBs describe generic and enterprise-specific management variables.

Examples of variables used as service metrics include:

Bytes in/out (per interface)

IP packets in/out (per interface)

Dropped Internet control message protocol (ICMP) messages/unit time (per interface)

Service-level agreement (SLA) metrics (per user)

Capacity limit

Burst tolerance

Delay

Downtime

3.3.1 Measurement Tools

In addition to the management protocols and MIBs, we can use commonly available tools to help measure service metrics. One such tool is the utility ping (available in TCP/IP releases), which roughly measures round-trip delays between selected sources and destinations in the network. Another tool is pathchar (available from ee.lbl.gov), which combines round-trip delay and per-link capacity measurements with path traces, as does another popular utility traceroute. Another popular tool to analyze TCP traffic is TCPdump. There are also proprietary, enterprise, and technology-specific tools that may be used in addition to those described here.

For example, one method to monitor availability in the network is to use ping to estimate delay and packet loss (see Figure 3.6). Ping tells us the approximate round-trip delay, as well as when ICMP echo packets (ping packets) are lost in the network or at the destination. While not an exact method, it is fairly simple to set up and use and provides an early warning mechanism for RMA problems.

What measures the average amount of time between failures for a particular system?

Figure 3.6. Using Ping and IP Packet Loss as Service Metrics for RMA

When developing service metrics, we also want to try to determine where in the system we want to measure each metric, as well as potential mechanisms for measurement, as in Figure 3.7.

What measures the average amount of time between failures for a particular system?

Figure 3.7. Example Service Metrics

3.3.2 Where to Apply Service Metrics

Where service metrics are applied is determined in part by what you plan to achieve from them (e.g., separating responsibilities). They are useful when trying to isolate and track problems in the network, especially when there are multiple groups responsible for the network. For example, in Figure 3.6, the service metrics that are applied can also be used to separate responsibilities between an end-to-end provider, a WAN service provider, and other intermediate providers.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123704801500049

Replication

Philip A. Bernstein, Eric Newcomer, in Principles of Transaction Processing (Second Edition), 2009

The Primary-Backup Model

To maximize a server’s availability, we should try to maximize its mean time between failures (MTBF) and minimize its mean time to repair (MTTR). After doing the best we can at this, we can still expect periods of unavailability. To improve availability further requires that we introduce some redundant processing capability by configuring each server as two server processes: a primary server that is doing the real work, and a backup server that is standing by, ready to take over immediately after the primary fails (see Figure 9.1). The goal is to reduce MTTR: If the primary server fails, then we do not need to wait for a new server to be created. As soon as the failure is detected, the backup server can immediately become the primary and start recovering to the state the former primary had after executing its last non-redoable operation, such as sending a message to an ATM to dispense money. If it recovered to an earlier state, it would end up redoing the operation, which would be incorrect. Since we are interested primarily in transactional servers, this means recovering to a state that includes the effects of all transactions that committed at the former primary and no other transactions. For higher availability, more backup servers can be used to guard against the possibility that the primary and backup fail.

What measures the average amount of time between failures for a particular system?

Figure 9.1. Primary-Backup Model. The primary server does the real work. The backup server is standing by, ready to take over after the primary fails.

This technique is applicable to resource managers and to servers that run ordinary applications, such as request controllers and transaction servers. When a server of either type fails, it needs to be recreated. Having a backup server avoids having to create the backup server at recovery time.

If there are many clients and some are connected by slow communication lines, then it can take a long time to recreate sessions with the backup server. To avoid doing this at recovery time, each client connected to the primary server should also have a backup communication session with the backup server. This further decreases (i.e., improves) MTTR.

In general, the degree of readiness of the backup server is a critical factor in determining MTTR. If a backup server is kept up to date so that it is always ready to take over when the primary fails with practically no delay, then it is called a hot backup. If it has done some preparation to reduce MTTR but still has a significant amount of work to do before it is ready to take over from the primary, then it is called a warm backup. If it has done no preparation, then it is called a cold backup.

As in the case of a server that has no backup, when the primary server fails, some external agent, such as a monitoring process, has to detect the failure and then cause the backup server to become the primary. The delay in detecting failures contributes to MTTR, so fast failure detection is important for high availability.

Once the backup server has taken over for the failed primary, it may be worthwhile to create a backup for the new primary. An alternative is to wait until the former primary recovers, at which time it can become the backup. Then, if desired, the former backup (which is the new primary) could be told to fail, so that the original primary becomes primary again and the backup is restarted as the backup again. This restores the system to its original configuration, which was tuned to work well. The cost is a brief period of downtime while the secondary and primary switch roles.

When telling a backup to become the primary, some care is needed to avoid ending up with two servers believing they’re the primary. For example, if the monitor process gets no response from the primary, it may conclude that the primary is dead. But the primary may actually be operating. It may just be slow because its system is overloaded (e.g., a network storm is swamping its operating system), and it therefore hasn’t sent an “I’m alive” message in a long time, which the monitor interprets as a failure of the primary. If the monitor then tells the backup to become the primary, then two processes will be operating as primary. If both primaries perform operations against the same resource, they may conflict with each other and corrupt that resource. For example, if the resource is a disk they might overwrite each other, or if the resource is a communications line they may send conflicting messages.

One way to avoid ending up with two primaries is to require the primary to obtain a lock that only one process can hold. This lock could be implemented in hardware as part of the resource. For example, some networking techniques, such as reflective memory, and most disk systems, such as SCSI and Fiber Channel (as it runs over SCSI), allow a lock on a resource over their shared bus. Or it could be implemented in software using a global lock manager, which is supported by some operating systems that are designed for multiserver clusters and as independent components in some distributed systems. Another solution is to use a third “watchdog” process, which is described in Section 9.4, Primary Recovery with One Secondary.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9781558606234000093

RFID+ Monitoring and Troubleshooting RFID Systems

Dr.Paul Sanghera, in RFID+ Study Guide and Practice Exams, 2007

Mean Time Between Failures (MTBF)

Mean time between failures (MTBF) is the average time between two consecutive failures of a device or a system. Usually there is an underlying assumption in calculating MTBF:

After each failure, the system is fixed and returned to service immediately. This is a measure of reliability, robustness, and stability of the system. It can be applied to system components or to the system itself. In our case, the system is the RFID system and the components are the antennas, readers, host computer, and other network elements.

What do we mean by a failure? We need to define that. For example, on one extreme, you can consider that a reader has failed if it's not operational at all; on the other end of the spectrum, you could consider it failed when it misses a tag read or creates read error. In general, you can determine a threshold of read errors, and when the number of read errors exceeds the threshold, the reader may be considered failed.

MTBF indicates the robustness of the system measured in the past and, based on that measurement, predicts the rate of failure in the near future. MTBF can be calculated using the following simple equation:

MTBF = TL/NF

where:

TL is the total lifetime (or operation time) of the device or the system over which the MTBF is being measured.

NF is the total number of failures recorded.

For example, if a reader fails twice during 200 hours of operation, the MTBF can be calculated as in the following: MTBF = TL/NF = 200/2 = 100 hours

A more involved example is presented in Exercise 10.1.

EXERCISE 10.1

You have an RFID system with 100 readers. You have collected reader failure data for 200 hours. During this time, 10 readers failed once and another set of five readers failed twice (each of the five readers failed twice). What is the MTBF of this system?

Solution:

TL = 100 readers × 200 hours = 20,000 reader hours

NF= 10 + 5 × 2 = 20 reader failures

MTBF = TL/NF = 20,000 reader hours / 20 reader failures = 1000 hours/failure

A reader reads tags. So, tag traffic rate is obviously another metric of interest.

Average Tag Traffic Volume

Average tag traffic volume (ATTV) is the average number of tags passing through an interrogation zone during an interval of time. The interval can be a minute, 10 minutes, an hour, or whatever you determine it to be. This metric is important for the following two reasons:

It indicates how much load the reader has to deal with on a portal.

When a reader reads a tag, it typically sends the collected data to a host computer. So the tag traffic increases the data traffic in the network.

To measure ATTV, the monitoring system collects the following data from the readers:

Tag counts

The time at which the tag was counted

With this data, the ATTV can be calculated using the following equation:

ATTV = 1/N Si = Ni=1 ti

where:

N is the number of intervals for which the measurement is being taken.

ti is the number of tags counted during an interval denoted by i.

Let's work through a simple example. Assume that you determine the interval to be 15 minutes, and you collect the data over an hour. In the four quarters of the hour, 50, 35, 30, and 45 tags are detected, respectively. So, you can calculate the ATTV as shown in the following:

ATTV = 1/N Σi=N i = 1 ti = 1/4 (50 + 35 + 30 + 45) = 160/4 = 40 tags per 15 minutes

How do you determine the time interval? That depends on your application and the system requirements. But keep in mind that if your interval is too long, say, a few hours, you cannot see the pattern of traffic (how the traffic changes during this interval), and if your pattern is too short, you'll have too many data points to deal with unnecessarily.

So, the ATTV indicates the following:

How much tag traffic is flowing through the interrogation zone

The pattern of traffic, that is, how the traffic changes with time; for example, you can see the pattern by taking 24 measurements of ATTV each day

From the ATTV measurements, you can predict how much traffic is expected to pass through an interrogation zone during a certain period. The accuracy of this prediction partly depends on the amount of data that was collected to make this prediction, that is, the statistical uncertainty.

Actual Versus Predicted Traffic Rate

It's always of interest to measure the variance between the predicted values of a metric with its actual value. Actual versus predicted traffic rate (APTR) is the variance of the actual tag traffic from the predicted tag traffic through an interrogation zone over a time period. The predicted tag traffic rate can come from the ATTV measurements in the past. A significant variance of actual traffic rate from the predicted traffic rate could indicate a problem with the system.

The APTR can be calculated using the following equation:

APTR = 1/N Σi=N i=1 (ta − tp)

where:

N is the number of intervals for which the actual measurement is taken.

ta is the actual current ATTV.

tp is the predicted ATTV from the past measurements.

The larger the magnitude of APTR, the larger the variance of the actual value from the predicted value, and therefore the louder is the alarm that there could be something wrong with the system. Make sure you are comparing the actual value to the predicted value for the same time interval, because the predicted (and also the actual) values for different time intervals could be different. For example, in a given day, there might be more tag traffic from 2:00 to 3:00 P.M. than from 7:00 to 8:00 A.M.

Read Errors to Total Reads Rate

Read errors to total reads rate (RETR) is the total number of read errors divided by the total number of read attempts. A read error is reader's a failed attempt to read a tag. The measure of RETR can indicate the problems that caused the read errors, including:

A faulty antenna

Faulty tags or improperly tagged items

Improper placement of antennas

Low signal strength

Signal interference, signal absorption, or any other adverse environmental effect

To measure RETR, the monitoring system collects the data about number of read errors, number of successful reads, and the time interval during which these read attempts were made. With this data collected for a few intervals, you can calculate RETR using the following equation:

RETR=(Σi=Ni=1Ei)/(Σi=Ni=1Ei+Σi=Ni=1Si)=1/(1+Σi=Ni=1Si/Σi=Ni=1Ei)

where:

N is the number of intervals for which the measurements are taken.

Ei is the number of read errors during the interval i.

Si is the number of successful reads during the interval i.

A high value for RETR should be taken as an alarm for a problem with the RFID system (or portal): either an internal problem or due to adverse environmental effects such as absorption and interference.

The value of RETR can change over time.

Read Error Change Rate

Read error change rate (RECR) is the variance of RETR over time. It indicates the instability or unreliability of the RFID system. For example, a continuous increase or fluctuation in the value of RETR indicates an underlying problem with the system. Upon troubleshooting, you might find a fault with the design of the system or with the hardware components.

EXERCISE 10.2

Table 10.1 shows the number of tags counted by a reader in its interrogation zone.

Table 10.1. Reader Tag Count for Exercise 10.2

Time IntervalTag Counts in the Interrogation Zone
9:00–9:10 45
9:10–9:20 50
9:20–9:30 35
9:30–9:40 60
9:40–9:50 55
9:50–10:00 31

During this time:

Total number of read errors = 50

Total number of successful reads = 600

Calculate ATTV and RETR.

Solution:

ATTV = 1/N Σi=N i=1 ti = (1/6) (45 + 50 + 35 + 60 + 55 + 31) = 276/6 = 46 tags per 10 minutes

Σi=Ni=N Si/Σi=N Ei = 600/50 = 12

RETR = 1 / (1+ 12) = 1/13 = 1/13

Or simply:

RETR = 50/(50 + 600) = 50/650 = 1/13

A reader reads tags in its interrogation zone. Readability problems can also occur due to faulty tags.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9781597491341500141

Integrated Dependability and Security Evaluation Using Game Theory and Markov Models

Bjarne E. Helvik, ... Svein J. Knapskog, in Information Assurance, 2008

8.2.4 Obtaining System Measures

Denote the state of a system at time t by s (t) ∈ S. Denote the probability of being in state i at t by Xi(t), that is, Xi(t) = P(s(t) = Si). Let

(8.4)X(t)={X1(t),…,XN(t)}

be the state probability vector. It is then well known [23, 24] that this vector may be obtained as the solution to the set of linear differential equations:

(8.5)dX(t)dt=X(0)Q

with the general (but not very useful) solution X(t) = X(0)exp(Qt), where X(0) is the initial condition. However, a numerical solution to Eq. 8.5 is typically quite demanding due to the size of N and the stiffness of Q caused by the orders of magnitude in the ratios between the various rates in the system. Therefore, we concentrate on measures that may be obtained from the asymptotic (i.e., steady-state) behavior of the system. Let

(8.6)X=limt→∞X(t)

The asymptotic state probabilities of the system may now be obtained [23] as the solution to the set of linear equations:

(8.7)XQ=0N

with one of the equations in Eq. 8.7 replaced by:

(8.8)X1N=1

In the above equations, 0N and 1N represent vectors of length N constituted of elements all being 0 and 1, respectively, the latter being a column vector.

Having obtained the asymptotic state probabilities X in Eq. 8.6, we may obtain operational measures of the system like the availability (A), the mean time between failures (MTBF), the mean time spent in the good states (MUT), and so on. The system measures of special interest in our context are:

The mean time to first failure (MTFF) for the system (i.e., the expected time from the system is as new and until the first failure).

The mean time to failure (MTTF) (i.e., the expected duration until a failure occurs when we start to observe the system when it is asymptotically in a good state).

See Figure 8.4 for an illustration of these two system measures.

What measures the average amount of time between failures for a particular system?

FIGURE 8.4. Sample behavior of a system alternating between good and failed states, where MTFF = E(TFF) and MTTF = E(TF)

To efficiently compute these measures, we adopt the approach of Buzacott [25]. The state space is partitioned into two disjoint sets S = {SG,SF}, where SG = {S1,…,SK} and SF = {SK+1,…,SN}, so that the states 1,…,K are good states and the states K + 1, …, N are failed states. Since the state set S is ordered, the Q matrix can be written in partitioned form as:

(8.9)Q=[Q1Q2 Q3Q4]

where the size of Q1 is K × K, the size of Q2 is K × (N–K), and so forth. To compute MTFF one assumes that the system is as new at t =0. Let this be state S1 ∈SG. Define T = {T1, …,TK}. By solving

(8.10)−TQ1={1,0,…,0}

the MTFF for the system can be computed as:

(8.11)MTFF=T1k

To obtain MTTF, the asymptotic state probabilities (Eq. 8.6) must be known. Since S is partitioned, X also can be partitioned as X = {XG, XF}, where XG = {X1,…,XK} and XF = {XK+1,…,XN}. The asymptotic probability of being in one of the good states is from the Markov property:

(8.12)XG*=limt→∞P ({s(t)∈Si}i=1,…,K|s(t)∈SG)=XGXG1K

Hence,

(8.13)MTTF=XG*(−Q1)−11K=XG(−Q1) −11KXG1K

Having obtained the asymptotic state probabilities, the availability of the system is straightforwardly obtained as:

(8.14) A=limt→∞P(s(t)∈SG) =XG1K

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123735669500100

What measures the average amount of time between failures for a particular system quizlet?

The MTBF (mean time between failures) is calculated as the reciprocal of the number of failures during a period of time. If the mean time between failures has been calculated to be 2,000 hours, then the number of unit failures per 2,000 hours must be one.

How can we determine the mean time between system failures?

MTBF = # of operational hours ÷ # of failures For example, an asset may have been operational for 1,000 hours in a year. Over the course of that year, that asset broke down eight times. Therefore, the MTBF for that piece of equipment is 125 hours.

What is MTBF and MTTR?

MTBF vs MTTR – Mean time before failure vs mean time to repair. MTBF and MTTR are related as different steps in a larger process. MTBF measures the time between failures for devices that need to be repaired, MTTR is simply the time that it takes to repair those failed devices.

What is MTTF measured in?

You calculate MTTF taking the total amount of hours of operation (aka uptime) and divide it by the number of items you're tracking.