M O B J E C T I V I S T: Predictably Unreliable

I wrote about the unpredictably predictable nature of wind power in a few recent posts.

And of course we have watched the unexpected and unpredicted blow-out of the Deepwater Horizon oil well (the ultra-rare 1 out of 30,000 failure according to conventional wisdom) and hoping for the successful deployment of relief wells.

In the wind situation we know that it will work at least part of the time (given sufficient wind power, that is) without knowing precisely when, while in the second case we can only guess when a catastrophe with such safety-critical implications will occur.

We also have the unnerving situation of knowing that something will eventually blow-out, but with uncertain knowledge of exactly when. Take the unpredictability of popcorn popping as a trivial example. We can never predict the time of any particular kernel but we know the vast majority will pop.

In a recent episode that I went through, the specific failure also did not come as a surprise. I had an inkling that an Internet radio that I frequently use would eventually stop working. From everything I had read on-line, my Soundbridge model had a power-supply flaw that would eventually reveal itself as a dead radio. Previous customers had reported the unit would go bad anywhere from immediately after purchase to a few years later. After about 3 years it finally happened to my radio and the failure mode turned out exactly the same as everyone else's -- a blown electrolytic capacitor and a possible burned out diode.

The part obviously blew out because of some heat stress and power dissipation problem, yet like the popcorn popping, my interest lies in the wide range in failure times. The Soundbridge failure in fact looks like the classic Markov process of a constant failure rate per unit time. In a Markov failure process, the number of expected defects reported per day equate proportionally to how many units remain operational. This turns into a flat line when graphed as failure rate versus time. Customers that have purchased Soundbridges will continue to routinely report the failures for the next few years, with fewer and fewer reports as that model becomes obsolete.

Because of the randomness of the failure time, we know that any failures should follow some stochastic principle and likely that entropic effects play into the behavior as well. When the component goes bad, the unit's particular physical state and the state of the environment governs the actual process; engineers call this the physics of failure. Yet, however specific the failure circumstance, the variability in the component's parameter space ultimately sets the variability in the failure time.

So I see another way to look at failure modes. We can either interpret the randomness from the perspective of the component or from the perspective of the user. If the latter, we might expect that someone would abuse the machine more than another customer, and therefore effectively speed up its failure rate. Except for some occasional power-cycling this likely didn't happen with my radio as the clock stays powered in standby most of the time. Further, many people will treat their machine gingerly. So we have a spread in both dimensions of component and environment.

If we look at the randomness from a component quality-control perspective, certainly manufacturing variations and manual assembly plays a role. Upon internal inspection, I noticed the Soundbridge needed lots of manual labor to construct. Someone posting to the online Roku radio forum noticed a manually extended lead connected to a diode on their unit -- not good from a reliability perspective.

So I have a different way of thinking about failures which doesn't always match the conventional wisdom in reliability circles. In certain cases the result derives as expected, but in other cases the result diverges from the textbook solution.

Fixed wear rate, variable critical point: To model this to first-order, we assume a critical-point (cp) in the component that fails and then assume a distribution of the cp value about a mean. Maximum entropy would say that this distribution would approximate an exponential:

p(x) = 1/cp * exp(-x/cp)

The rate at which we approach the variable cp remains constant at R (everyone uses/abuses it at the same rate). Then the cumulative probability of failure is

P(t) = integral of p(x) from x=0 to x=R*t

This invokes the monotonic nature of failures by capturing all the points on the shortest critical path, and anything "longer" than the R*t threshold won't get counted until it fails later on. The solution to this integral becomes the expected rising damped exponential.

P(t) = 1 - exp(-R*t/cp)

Most people will substitute a value of τ for cp/R to make it look like a lifetime. This is the generally accepted form for the expected lifetime of a component to first-order.

P(t) = 1 - exp(-t / τ)

So even though it looks as if we have a distribution of lifetimes, in this situation we actually have as a foundation a distribution in critical points. In other words, I get the correct result but I approach it from a non-conventional angle.

Fixed critical point, variable rate: Now turn this case on its head and say that we have a fixed critical point and we have a maximum entropy variation in rate assuming some mean value, R.

p(r) = 1/R * exp(-r/R)

Then the cumulative integral looks like:

P(t) = integral of p(r) from r=cp/t to r=∞

Note carefully that the critical path in this case captures only the fastest rates and anything slower than the cp/t threshold won't get counted until later.

The result derives to

P(t) = exp(-cp/(R*t))

This has the characteristics of a fat-tail distribution because time goes into the denominator of the exponent, instead of the numerator. Physically, this means that we have very few instantaneously fast rates and many rates proceed slower than the mean.

Variable wear rate, variable critical point: In a sense, the two preceding behaviors act complementary to each other. So we can also derive P(t) for the situation whereby both the rate and critical point vary.

P(t) = integral of P(t | r)*p(r) over all r

This results in the exponential-free cumulative, which has the form of an entroplet.

P(t) = R*t/cp / (1+ R*t/cp) = t/τ/(1+t/τ)

Plotting the three variations side-by-side and assuming that τ=1, we get the following set of cumulative failure distributions. The full variant nestles in between the two other exponential variants, so it retains the character of more early failures (ala the bathtub curve) yet it also shows a fat-tail so that failure-free operation can extend for longer periods of time.

To understand what happens at a more intuitive level we define the fractional failure rate as

F(t) = dP/dt / (1-P(t))

Analysts use this form since it makes it more amenable to predicting failures on populations of parts. The rate then applies only to how many remain in the population, and the ones that have failed drop out of the count.

Only the first case above gives a failure rate that approaches the Markov ideal of constant rate over time. The other two dip below the constant rate of the Markov simply because the fat-tail cumulative requires a finite integrability over the time scale, and so the rates will necessarily stay lower.

Another post gives a full account of what happens when we generalize the first-order linear growth on the rate term, letting R=g(t). The full variant ultimately gives dg/dt / (1+g(t)), so that if g(t) starts rising we get the complete bathtub curve.

If we don't invoke other time dependencies on the rate function g(t), we see how certain systems never show failures after an initial period. Think about it for a moment -- the fat-tails of the variable rate cases push the effective threshold for failure further and further into the future.

In effect, normalizing the failures in this way explains why some components have predictable unreliability, while other components can settle down and seemingly last forever after the initial transient.

I discovered that this paper by Pandey jives with the way I think about the general problem.

Enjoy your popcorn, it should have popped by now.