Evaluating Sparing, Service Level Agreements and Warranties – The Analysis Behind Operational Decisions
By Kenneth Shere
1. Introduction
Large systems, such as air traffic control systems or satellite ground stations, frequently have high availability requirements and 24×7 operations. The operations centers for these systems may be distributed geographically and may include a huge number of hardware devices. Any given location may have thousands of servers and storage devices, multiple networks and many other devices. Experience has shown that these operational centers may have many spares that either figuratively or literally sit on the shelf for years, have expensive service level agreements and have made inadequate use of product warranties (partly because warranties may have expired before the device enters service).
In this article, I present a statistics-based approach to address the following three questions:
- How many spares do we need to buy?
- How do we negotiate a service level agreement (SLA) from a technical perspective?
- How do we evaluate the vendor’s warranty price?
It is assumed that the reader has had statistics at some distant time, and doesn’t remember too much of it. Consequently, an effort is made to explain each step of this approach. In the event that the reader does not have either the time or patience to go through the statistical analysis, skip it and read the conclusions.
2. Data Needs
This statistical approach depends upon the following data:
- The required system availability
- The probability distribution function of the failures
- The mean time to restore (MTTR)
Availability depends on down and total time – the latter being a measure of how much time the system is supposed to be up, or operational. The formula is
Depending upon how we define total time and down time, various definitions of availability exist. These definitions depend on whether down time includes scheduled maintenance, whether total time includes the time when a system cannot be used due to lack of data from a dependent system (perhaps resulting from a down communications link), etc. Examples of various definitions are provided in.
If we know the mean time between failures (MTBF), then the formula for availability becomes
Both the MTBF and MTTR are statistical parameters. If a probability distribution function (PDF) were known, then the MTBF is the expected value or mean of the PDF. If the mean time between failures is a constant, then the probability distribution must be the exponential distribution.
A primary issue associated with these calculations is that the PDF is generally unknown. A process is specified for estimating the PDF from operational data in Section 13.
The analyses of this article are applied to two PDFs – the exponential distribution and the normal distribution. These two distributions have markedly different behavior. The exponential distribution, which is used by some vendors, has the property that most failures occur early in the life of the device, and that the probability of failure decreases with time. The normal distribution has the property that most of the failures occur near the mean. The methodology presented here is applicable to any PDF.
Other data affecting sparing, SLAs and warranty analyses include the acceptable level of risk associated with down time and the tolerance of system down time. These items are beyond the scope of this article.
3. Assumptions and their Reasonableness
The analyses of this article are limited to hardware devices. Thus, issues such as software maintenance cost, the cost of monitoring operations, and so on are ignored.
For convenience of explanation, it is assumed that the devices under consideration are servers and that their MTBF is 51,000 hours. It compares reasonably with data obtained for other devices listed in Table 1. It is also assumed that we are considering a system of 120 servers. This number is as a middle ground, corresponding to about 8 racks of servers. Large operations like an air traffic control system might require far more servers, and many smaller systems require fewer than 64 servers (four racks of 1U servers).
A problem occurs when we inquire into the meaning of the data provided. The vendor representative might not know the difference between the median and the mean. The “median value” is the “50% mark” – there is a 50% chance that half the failures occur before this number and half after this number. The mean is the same as the expected value of the probability distribution function.
For the normal distribution, the . For the exponential distribution, the
Table 1. MTBF for a variety of devices
Type of Device | MTBF (hours) | Type of Device | MTBF (hours) |
---|---|---|---|
Server – load balancer | 51,000 – 60,000 | Blade server | 250,000 |
Power distribution unit | 450,000 | 48 port 1 GigEth I/O Mod. | 75,000 |
Switch management module | 1,010,000 | Network switch (has internal redundancy) | 180,000 |
Server A | 45,000 | I/O board | 50,100 |
High speed 8-core rack mounted server | 56,000 | Quad core server | 49,000 |
60-drive storage tray (8 for 7 redundancy yields 99.9995% availability) |
10,800 | High capacity rack server | 148,000 |
Firewall | 51,000 | Switch | 97,000 |
If it were assumed that each device of Table 1 follows an exponential distribution, the MTBF seems too low. For example, an MTBF of 45,000 hours for Server A would have a median of 31,190 hours or 3.5 years. Fifty percent of the servers failing in that period seem much too high. In this case, it is more likely that 45,000 hours is the median and 64,925 (=45,000/0.6931) is the MTBF. On the other hand, the MTBF of the power distribution unit seems acceptable.
In Table 2, industry data on the percentage of server failures over time is compared to the probability of failure of a server using an exponential failure distribution function and using a normal distribution. Considering comments above regarding whether 51,000 hours is the mean or the median, calculations for the exponential distribution are provided for both options. Assuming 51,000 hours is the MTBF or mean, the exponential function predicts much higher failure rates for each time period. Assuming that 51,000 hours is the median, the five-year failure probability is very close to industry data provided in Table 2, but the 1 and 3 year probabilities are roughly double the industry data.
Table 2. Comparison of server failure rates observed in industry to failure probability of a server using the exponential distribution and normal distribution
Time period | Industry data: % of servers failed – cumulative |
Exponential distribution with mean = 51,000 hours | Exponential distribution with median = 51,000 hours | Normal distribution N(51000,25676) |
---|---|---|---|---|
1 year | 5% | 16% | 11% | 5% |
3 years | 18% | 40% | 30% | 17% |
5 years | 42% | 58% | 45% | 39% |
A normal distribution requires knowledge of the standard deviation in addition to the mean, or one point on the PDF curve. Since the standard deviation is unknown, it is assumed that the industry data for the first year is correct (i.e., 5% of the devices fail in one year or 8766 hours). This assumption provides a point on the normal distribution PDF, so the standard deviation can be calculated, and is 25,676 hours. (Steps for performing this calculation are provided in Section 16.1.) In this case, the normal distribution, N(mean, standard deviation) is represented as N(51000, 25676).
The first year match between the normal distribution and industrial data is exact because it is assumed. Interestingly, the third year estimate using the normal distribution differs from the industry data by 1% and on a ratio basis is within 5% of the industry data. This difference is very small. The difference between predicted failures and industry data for the five year estimate is the same for the exponential distribution with median 51,000 hours and the normal distribution N(51000,25676). On a ratio basis, both are within 7% of industry data.
OK, so which probability distribution should be used? There is no clean answer to this question. In fact, the exponential and normal distributions are not the only distributions used for failure analysis. Other probability distributions include Rayleigh distribution and the Weibull distribution. The latter is a very powerful distribution defined in terms of two parameters (scale and shape) for which the Rayleigh and exponential distributions are special cases. Whereas the normal distribution is not a special case of the Weibull distribution, for certain values of its parameters, it can look very much like a normal distribution (with a mean greater than zero). The Weibull distribution is discussed further in Section 13 where a method for estimating the PDF is suggested.
What is the point of this confusion? It is necessary for operations centers to collect failure rates on all devices in their facility. These data are needed to estimate the appropriate failure probability distribution function. Simply relying on vendor information is inadequate because:
- There may be uncertainty on what is meant by the data provided, and on the distribution upon which it is based.
- Even if the data were totally trustworthy and the vendor supplied the distribution used to determine failure rates, these data might not apply to your operations. Failure rates are a function of
- The environment of the operations center (air conditioning, power supply, etc.)
- The load rate of the devices (e.g., servers run at 80% of capacity on a 24×7basis, versus servers at 10% capacity on an 8×5 basis).
Another assumption used in this article is that every server failure is fatal. That is, we need to replace the server. This is a very conservative assumption. Failures may be caused by a power supply, a hard disc, or a network interface card or some other component of the server that is easily replaced. Large operational centers may have spare components in inventory and onsite maintenance personnel may make these repairs, thereby reducing the time to replace the server in inventory to an hour or less. These spare components may be good parts taken from other servers that were taken out of service because of failures.
This factor may also influence the cost of a warranty. When a failed server is shipped to the vendor, the vendor may replace it with a new server or replace a failed part and send it back. When warranties are discussed, this factor needs to be considered.
This analysis also does not account for replacement of redundant parts with zero downtime. Fans, power supplies and interface cards are examples of redundant components on servers that can be replaced while the system is hot.
4. How Many Spares Do I Need to Keep in Inventory?
How many spare servers are needed? The answer to this question depends on the number of operational servers in your facility or system, how quickly a spare server can be replenished after it is taken out of inventory to replace a failed server and the criticality of the system under consideration.
For ease of explanation, we assume that the system has 120 servers and that it takes 36 hours to replace a server taken out of inventory. It is also assumed that the spare server is either active or can be installed to replace a failed server within an acceptable down time.
This problem reduces to asking the probability of failure of n servers during the replacement time period t_{f }≤ t ≤ t_{f} + 36 where t_{f} is the time of a failure. A sample calculation is provided in Table 3 for failures from 1 to 5 years of operation. The ending period of 5 years was chosen because it corresponds to a typical replenishment cycle.
5. Likelihood of a Server Failing Before a Spare can be Replenished Assuming an Exponential Distribution
Table 3. Failure probability during various time intervals for an exponential distribution with median = 51,000 hours
The probabilities specified in Table 3 are easily calculated using a Microsoft Excel spreadsheet. The formula is “EXPON.DIST(t,LAMBDA1,TRUE)” where t is expressed in hours, and LAMBDA1 = 1/MTBF = 1 / 73,577 for the assumption that the median is 51,000 hours (or MTBF x ln 2). TRUE denotes a return of the cumulative probability, as opposed to the point on the probability density curve.
The next question becomes, how many servers are likely to fail during the interval of [t, t+36] for a system of 120 servers? The answer is determined by using a binomial distribution (the number of n out of m failures).* The result of this calculation is provided in Table 4. The formula for calculating these entries using Excel terminology is:
Equation 1 1-BINOMDIST(G19,120, q,TRUE)
where “G19” is the number of failures (0,1 or 2) during the replacement time (36 hours), 120 corresponds to the number of servers and q is the failure probability shown in Table 3.
As shown in Table 4, there is a 5.1% chance of a server failure (or a 94.9% likelihood of no failure) during the 36 hours it would take to replace a spare after one year of service and an extremely small chance (<0.0022%) that two failures would occur during this period. As time progresses, these probabilities continue to decrease to the point where the likelihood of two failures during the replenishment period is 0.0005% when one is within two months of a normal replenishment period. It seems odd that the likelihood decreases with time, but that is the nature of an exponential distribution.
Table 4. Probability that 0, 1 or 2 servers will fail during the 36 hour replacement period, exponential distribution
Number of failures in [t , t + 36]
Time of Failure, t (years) | Time of Failure, t (hours) | 0 | 1 | 2 |
---|---|---|---|---|
1 | 8766 | 5.07829% | 0.13011% | 0.002213% |
2 | 17532 | 4.52091% | 0.10292% | 0.001555% |
3 | 26298 | 4.02342% | 0.08137% | 0.001092% |
4 | 35064 | 3.57965% | 0.06431% | 0.000766% |
5 | 43830 | 3.18401% | 0.05081% | 0.000538% |
The answer to how many spares are necessary depends on risk tolerance. For a mission critical system in which down time is very detrimental, two spares should be maintained for a system of 120 servers. In the event of a second failure, use the remaining spare server. The likelihood of a third failure in 36 hours, for which we need a third spare server, is less than 0.0023% – a highly unlikely event. For an extremely risk averse situation, maintain a third spare.
If a server in inventory can be replaced in fewer than 36 hours, then the probability of failure during the replacement period drops substantially. Frequently, a server can be replaced within 24 hours and if the operational facility is located near a vendor facility, it can sometimes be replaced within 4 hours. For example, if a spare server is replaced within 24 hours, the probability of needing the second spare is less than 0.06% after 1 year, and reduces to 0.02% after 5 years.
* When the total number of items times the failure probability, mq, does not exceed 10, the binomial distribution can be approximated using the Poisson distribution. This statistical detail was historically very useful, but is of use today only when we need to do a hand calculation.
6. Likelihood of a Server Failing Before a Spare can be Replenished Assuming a Normal Distribution
In this case, we assume that the failure probability follows an exponential distribution with mean of 51,000 hours. As above, we assume that 5% of the servers fail in one year so that the standard deviation is 25676 hours.
The failure probability during the spare replacement period is provided in Table 5. Since 5 years is close to the MTBF and most of the failures occur at that time, the need for spares differs from what is dictated by the exponential distribution. For 120 servers, the likelihood of a failure occurring within the spare replacement period at one year is 1.7% (or 98.3% likelihood of no failure) and the likelihood of two failures during this period is 0.00008% (8 x 10^{-6}). After five years, the time is close to the mean so the likelihood of no failure during this period increases to 6.3%, but the likelihood of two failures is only 0.0042% which is still highly unlikely (there is a 99.9958% likelihood that there is either no failure or only 1 failure). Thus, for a system of 120 servers, maintaining two spare servers should be adequate. Note that if a spare server could be replaced in 24 hours, then after five years the likelihood of two failures occurring during this period reduces to 0.001%. There is still a 4% chance that one failure would occur, so a strategy of maintaining two spares for a 120-server system remains reasonable.
Table 5. Probability that 0, 1 or 2 servers will fail during the 36 hour replacement period, normal distribution N(51000,25676)
Number of failures in [t , t + 36]
Time of Failure, t (years) | Time of Failure, t (hours) | 0 | 1 | 2 |
---|---|---|---|---|
1 | 8766 | 1.72230% | 0.01479% | 0.00008% |
2 | 17532 | 2.83236% | 0.04016% | 0.00038% |
3 | 26298 | 4.14095% | 0.08623% | 0.00119% |
4 | 35064 | 5.38930% | 0.14669% | 0.00265% |
5 | 43830 | 6.25440% | 0.19817% | 0.00417% |
7. How Many Failures are Likely to Occur?
The answer to this question is needed to analyze maintenance strategies, including service level agreements and warranties. By having an understanding of the probability that a specified number of devices will fail during a given period of time, and by understanding the replacement cost of these devices, an organization is in a position of strength when it comes to evaluating and negotiating service level agreements (SLAs) and warranties. Note that the replacement cost includes cost of a new device, labor cost associated with removing the failed device and installing a new device, and disposal, shipping and other costs as appropriate. If these costs are not understood, then the organization in not equipped to quantitatively evaluate these agreements or warranties.
In this section we calculate the probability that “n or fewer” servers will fail within a given time period. The expected number of servers to fail in a given period of time is simply the total number of servers times the probability of failure in that time.
By knowing the likely number of failures and the replacement cost, we can pose questions such as “Is it cheaper to just buy a bunch of spares and forego the SLA?” If it is cheaper to have the SLA, why? The latter question is something to be explored during the evaluation of the proposed SLA. If the SLA cost is close to the cost of simply having enough spares to forego the SLA, the organization is in a strong negotiation position.
8. Number of Failures Using the Exponential Distribution
To determine these probabilities, use the binomial distribution provided in Equation 1 with the failure probabilities (q) for years 1 to 5 respectively, and “G19” the number of failed serves. Using the value FALSE rather than TRUE gives the probability of “≤G19” failures. The results for the device under consideration using the exponential distribution are provided in Table 6.
Table 6. Probability that the “n or fewer” servers will fail within the indicated time period {system of 120 servers with exponential distribution and median failure probability (50%) at 51,000 hours}
# failed servers (≤n) | time period | ||||
---|---|---|---|---|---|
1 year | 2 years | 3 years | 4 years | 5 years | |
8766 hours | 17532 hours | 26298 hours | 35064 hours | 43830 hours | |
12 | 40.228% | 0.095% | 0.000% | 0.000% | 0.000% |
16 | 81.155% | 1.892% | 0.002% | 0.000% | 0.000% |
20 | 97.366% | 13.357% | 0.057% | 0.000% | 0.000% |
24 | 99.833% | 42.483% | 0.881% | 0.002% | 0.000% |
28 | 99.995% | 75.625% | 6.359% | 0.049% | 0.000% |
32 | 100.000% | 93.933% | 24.120% | 0.632% | 0.003% |
36 | 100.000% | 99.136% | 53.992% | 4.375% | 0.060% |
40 | 100.000% | 99.930% | 81.232% | 17.413% | 0.662% |
44 | 100.000% | 99.997% | 95.148% | 42.905% | 4.214% |
48 | 100.000% | 100.000% | 99.227% | 71.609% | 16.270% |
52 | 100.000% | 100.000% | 99.925% | 90.562% | 40.274% |
56 | 100.000% | 100.000% | 99.996% | 97.992% | 68.684% |
60 | 100.000% | 100.000% | 100.000% | 99.734% | 88.836% |
64 | 100.000% | 100.000% | 100.000% | 99.978% | 97.430% |
68 | 100.000% | 100.000% | 100.000% | 99.999% | 99.631% |
In Section 3, we showed that it is unnecessary to keep more than two spare servers in inventory, provided that we can replace a server taken out of inventory and put into service within 36 hours. Table 6 shows, for example, that there is an 81% chance of replacing 16 or fewer servers during the first year of service, and a 90% chance of replacing 52 or fewer servers within 4 years of service.
If the cost of replacing a server is C, then there is a 50% chance that the maintenance cost (i.e., the cost of server replacement) is 14 x C during the first year and an 81% chance that the cost will be less than 16 x C. Similarly, there is 91% chance that the four-year cost will be less than 52 x C.
Suppose that the cost of a server is $10,000 and that it takes 3 hours of labor to replace a failed server at a cost of $200/hour. For this example, C = $10,600, and the four year cost of maintaining 120 servers is less than $551,200 with a 91% probability. If a vendor bids more than this amount in a service level agreement to maintain a system of 120 servers, then the bid is excessive.
To determine the cost that is reasonable for your operations, it is necessary to collect operational data. One question to ask is whether the exponential distribution really applies to your environment.
9. Number of Failures Assuming a Normal Distribution
As indicated above, we assume that the MTBF is 51,000 hours and that 5% of the servers fail in the first year. These data permit us to calculate the standard deviation, thereby determining the normal distribution explicitly. With these assumptions, it is shown in Section 16.1 that the distribution is N(51000,25676).
To parallel the discussion of the preceding section, we next present a table showing the number of servers likely to fail over a period of 43,830 hours.
Table 7. Probability that the “n or fewer” servers will fail within the indicated time period {system of 120 servers with normal distribution N(51000,25676)}
# failed servers (≤n) | time period | ||||
---|---|---|---|---|---|
1 year | 2 years | 3 years | 4 years | 5 years | |
8766 hours | 17532 hours | 26298 hours | 35064 hours | 43830 hours | |
1 | 1.553% | 0.007% | 0.000% | 0.000% | 0.000% |
5 | 44.155% | 2.189% | 0.002% | 0.000% | 0.000% |
9 | 92.137% | 27.185% | 0.242% | 0.000% | 0.000% |
12 | 99.281% | 63.065% | 2.526% | 0.001% | 0.000% |
16 | 99.990% | 93.182% | 18.704% | 0.031% | 0.000% |
20 | 100.000% | 99.484% | 54.367% | 0.647% | 0.000% |
24 | 100.000% | 99.983% | 85.487% | 5.562% | 0.001% |
28 | 100.000% | 100.000% | 97.536% | 23.182% | 0.020% |
32 | 100.000% | 100.000% | 99.777% | 54.004% | 0.312% |
36 | 100.000% | 100.000% | 99.989% | 81.927% | 2.548% |
40 | 100.000% | 100.000% | 100.000% | 95.611% | 11.844% |
44 | 100.000% | 100.000% | 100.000% | 99.357% | 33.535% |
48 | 100.000% | 100.000% | 100.000% | 99.944% | 62.686% |
52 | 100.000% | 100.000% | 100.000% | 99.997% | 85.666% |
56 | 100.000% | 100.000% | 100.000% | 100.000% | 96.426% |
Paralleling the preceding section, we analyze Table 7. It indicates that there is a 92% chance that 9 or fewer servers will fail the first year. Considering the assumption that 5% of the servers (6 out of 120 servers) are expected to fail, this is a reasonable conclusion. As we progress to year four, there is a 99% chance that fewer than 44 servers will fail within that time. The number of servers that are likely to fail is driven by the standard deviation. As we approach year 5, there is only a 63% probability that fewer than 48 servers will fail. This sharply increasing number of expected server failures is due to the fact that we are approaching the mean (51,000 hours or 5.8 years) when half the servers (recall that the mean equals the median for normal distributions) are expected to fail.
Note that the 91% probability for year four is that 38 or fewer servers will fail using the normal distribution and 52 or fewer servers will fail using the exponential distribution.
10. Comparing the Cost of Failures using Two Distributions
The normal distribution predicts 14 fewer failures during the first four years of use for a system of 120 servers. Assuming $10,000 per server plus $600 labor to replace a server with a spare, the normal distribution provides a four-year estimate that is $148,400 lower than the estimate provided by the exponential distribution. This difference again points to the need for large operations to collect data and determine the appropriate probability distribution so that they can more accurately assess maintenance cost.
11. Estimating Maintenance Cost
The preceding sections focused on how many servers are likely to fail and comparing the results using two different failure probability distributions. Estimating the cost of maintenance also depends on:
- Warranties provided by the vendor at no charge.
- Extended warranties provided by the vendor for a fee.
- Operational data related to labor cost and the time it takes to perform maintenance tasks.
- Ability of the maintenance staff to assess server failures and determine whether the server needs to be replaced or whether the maintenance staff can replace a component.
- The time it takes to replace in inventory a server that has been placed in service.
Items 1 and 2 depend on negotiations with vendors. Item 5 is a function of a service level agreement and tolerance for down time. For example, a mission critical system might require active spares and a four-hour replacement window for a failed server. Even if the likelihood of a failure to occur during “an eight hour window” is extremely small, it may be necessary to require a four hour window and an extra spare to assure that the mission critical system experiences no down time.
This example is intended to show that mission criticality can drive maintenance cost higher than what might be specified by a purely analytical cost estimate.
Items 3 and 4 are major drivers for cost estimation. Simply stated, future costs cannot be estimated if the current cost is not known. Likewise, the value of a warranty and decisions on whether or not to purchase an extended warranty depend on knowing the cost of current operations on a task by task basis. If an organization wants to evaluate a service level agreement, it is necessary to know all of the data specified above and failure probability distributions for all devices included in the agreement.
12. Evaluating Warranties
Many vendors provide warranties with their devices. These warranties vary considerably. For example, a vendor might provide a free one-year warranty and offer a 3-year warranty at a given price. The question here is whether it is worth the price to buy the 3-year warranty. It is assumed here that the first year is free, so the 3-year warranty really covers years two and three.
The answer depends on performing a tradeoff of the warranty cost versus the maintenance cost of replacing the expected number of devices (servers in our example) that are likely to fail? Given a system of 120 servers with the assumptions specified above, we see from Table 6 and Table 7 that:
- For an exponential distribution, there is roughly a 98% chance (99.227% – 0.881%) that between 25 and 48 servers will fail during years 2 and 3. (25 has been picked because there is a 99.9% likelihood that 25 or fewer servers will fail during the first year.)
- For a normal distribution, there is a roughly a 97% chance that between 12 and 36 servers will fail during years 2 and 3.
The actual calculation for determining the associated with the number of failures expected during years 2 and 3 involves complicated conditional probabilities. The simplest way to get a more accurate answer is to simulate the problem. For example, perform a thousand random calculations for each distribution and count the number of failures during years 2 and 3. The answer should be pretty close to the rough answer provided by the simple subtraction above.
Interestingly enough, either distribution indicates that 24 servers are likely to fail during years two and three {48 – 25 = 23 for the exponential and 36 – 12 = 24 for the normal distribution}. Thus, if the warranty for years two and three cost more than 24 servers, don’t buy the warranty.
If the warranty costs significantly less than the cost of 24 servers, look for the reason why. Several possibilities exist. One possibility is that the vendor thinks that the actual reliability for the device is much higher than either (i) what is advertised or (ii) what has been used in your analysis. Another possibility is that the vendor thinks that the cost of repairing a server is quite low, so few actual replacements will be provided.
Another consideration is operational data. If operational data indicates a significant discrepancy between warranty cost and estimated maintenance cost for the servers, it is necessary to determine why this difference exists. The devices being purchased might be much more reliable (much less reliable) than previous models.
13. Estimating the Failure PDF from Operational Data
As indicated above, this article compares distributions with significantly different properties. In the case of the normal distribution, failures are clustered about the mean. In the case of the exponential distribution, failures are more likely to occur early in the life cycle (i.e., before the mean). So which should you use?
The most likely answer to this question is neither. Use your operational data to estimate the probability distribution for the device under consideration. The approach we recommend is to use the Weibull distribution.
The Weibull probability density function for the Weibull distribution is with
Equation 2
The cumulative probability or probability distribution function is obtained by integrating Equation 2 from 0 to x. The parameter k is called the shape parameter and l is the scale parameter. This distribution is a very useful general distribution because it corresponds to the exponential distribution when k = 1, and the Rayleigh distribution when k =2. The shape of a Weibull distribution for various parameters is provided in Figure 1. Note the similarity of the green curve (l= 1 and k =5) to a normal distribution with mean 1. Decisions regarding the shape parameter are generally determined from experimental and operational data.
For the exponential distribution (k = 1), l corresponds to the mean or MTBF, and the PDF is completely defined.
The approach recommended is to “fit the Weibull distribution to your data.” This can be done by trying a large number of values for k and l, and perform for example a least square calculation. Then select the particular values of k and l that provide the best fit. It can be reasonably assumed that the shape parameter is between 1 and 5. Use values of k between 1 and 5.4 in increments of 0.1. That is, run the calculation 44 times for each value of l. For estimating l, start with the mean of your sample and pick a reasonable range of values. All total, if 100 values of l were selected, a total of 4,400 calculations are needed. These calculations should not require significant compute time.
14. Summary
This article uses a system comprised of 120 servers to discuss from a probabilistic sense, how many spares need to be kept in inventory. Whether the spare is located on a shelf in a storage room or in a rack (possibly in an active backup mode) is not relevant to the discussion. The focus is on determining the likelihood of a failure occurring during the interval from taking the spare out of inventory to replacement with a new spare in inventory. Assuming this replenishment period to be 36 hours, the likelihood of 1, 2 or 3 failures during this period is calculated for both exponential and normal distributions. Under the assumptions specified above, it was concluded that two spares are adequate.
Attention is next turned to asking how many total failures are expected during a five year period. The total number of failures depends on the distribution. With 96% probability, it is shown that the number of failures for an exponential distribution with median 51,000 hours is nine more than the number of failures for a normal distribution N(51000,25676). It was also shown that the number of server failures in the second and third year is likely to be 24 (about 97% likelihood) for either distribution. This information and the associated approaches for determining it, provide useful information when considering service level agreements and warranties.
An underlying theme is that operations groups need to collect data on device failures. These data are needed to develop estimates of the failure probability distribution for each type of device. It is these data and the associated failure pdf that is needed to address questions related to sparing, service level agreements and warranties.
15. References and notes
[1] Availability – Definitions within Systems Engineering, http://en.wikipedia.org/wiki/Availability [last modified on 23 April 2014 at 22:31] [2] To protect proprietary information, vendor names and specific models are omitted, and the actual MTBF numbers have been modified. However, all numbers provided in Table 1 are in the “general ballpark” of the actual data.
[3] Randy Perry, et. al., “The cost of retaining aging IT infrastructure,” IDC White Paper, February 12, 2012.
[4] Wikipedia, The Free Encyclopedia Weibull Distribution, http://en.wikipedia.org/wiki/Weibull_distribution
[5] Two internal hardware experts indicated that assuming 5% of the servers would fail in five years seems reasonable. Using a conservative assumption that 25% of the servers would fail in five years results in a standard deviation of 10,600 hours – less than half of the standard deviation used in this paper. With that assumption, it would be unnecessary to have more than one spare.
[6] Wikipedia, op. cit. http://www.dell.com/content/topics/global.aspx/power/en/ps3q02_shetty?c=us&1=en,7/9/13
[7] W. Beyer (ed.), Handbook of tables for Probability and Statistics, The Chemical Rubber Company, 2^{nd} edition, 1968.
16. Appendix – Illustrations of Probability Distribution Functions
The PDF, F(x) is the cumulative probability of an occurrence; that is, the probability of occurrence for y ≤ x. The probability density function f(x) is the instantaneous value of the random variable. For a continuous random variable
16.1 The normal distribution
In the case of a normal distribution, the density function is
Equation 3
In Equation 3, σ is the called the standard deviation and µ is the mean. The PDF is denoted N(µ σ).
Figure 2 shows the probability density function for the normal distribution N(µ, s) where µ = 0 and s = 1. The region under the curve to the left of the vertical line on the graph corresponds to 10% of the area under the curve.
This area corresponds to a failure probability of 10%.
To use the normal distribution statistical tables, it is necessary to translate from N (µ, s) to N (0, 1). This translation is accomplished using Equation 3.
Equation 4 or
Here, is the standard distribution for N (µ, s) and is the standard distribution N (0, 1). In Equation 3, z is the value of the abscissa (the horizontal axis).
For the example used in this article, σ is determined from the Equation 4 and the fact that 5% of the devices fail in one year. From probability tables Pr(z) = 5% for z = -1.64 = (8766 – 51000)/σ. Solving for σ, we get σ = 25,676 hours (almost 3 years). The normal distribution is N(51000, 25676).
16.2 The Exponential Distribution
As indicated in Section 13, the Weibull distribution with k = 1 becomes the exponential distribution.
Equation 5
An Illustration of this PDF with l = 1 is provided in Figure 3.
Dell Power Solutions uses the exponential distribution to determine the reliability of subsystems of their direct attached storage.
Figure 2 and Figure 3 are nice to look at but impossible to read unless one has the eyesight of an owl or eagle. Computations can readily be done using a spreadsheet or the probabilities can be determined the old fashioned method – looking them up in a table, e.g.
About the Author
Kenneth Shere is the Senior Engineering Specialist for The Aerospace Company.
Leave A Comment