Root Cause, Interactions, Robustness and Design of Experiments

Eric Budd asked on The W. Edwards Deming Institute group (LinkedIn broke the link with a register wall so I removed the link):

If observed performance/behavior in a system is a result of the interactions between components–and variation exists in those components–the best root cause explanation we might hope for is a description of the interactions and variation at a moment in time. How can we make such an explanation useful?

A single root cause is rare. Normally you can look at the question a bit differently see the scope a bit differently and get a different “root cause.” In my opinion “root cause” is more a decision about what is an effective way to improve the system right now rather than finding a scientifically valid “root cause.”

Sometimes it might be obvious combination which is an issue so must be prevented. In such a case I don’t think interaction root cause is hard – just list out the conditions and then design something to prevent that in the future.

Often I think you may find that the results are not very robust and this time we caught the failure because of u = 11, x = 3, y = 4 and z =1. But those knowledge working on the process can tell the results are not reliable unless x = 5 or 6. And if z is under 3 things are likely to go wrong. and if u is above 8 and x is below 5 and y is below 5 things are in trouble…

To me this often amounts to designing systems to be robust and able to perform with the variation that is likely to happen. And for those areas where the system can’t be made robust for some variation then designing things so that variation doesn’t happen to the system (mistake proofing processes, for example).

In order to deal with interaction, learn about interaction and optimize results possible due to interactions I believe the best method is to use design of experiments (DoE) – factorial experiments.


Evolutionary operation (EVOP) is also useful. Normally people using EVOP and DoE are seeking to optimize performance based on interactions rather than find systemic weaknesses that may be causes of future problems. But the ideas work fine for finding systemic risks (interaction ranges that are risky).

It seems to me if we have systems that we know are risky in when several variables are interacting in certain ways that tells us to design it so they avoid doing so or even that we need to change things such that the process eliminates the possibility.

So we see the risk of certain interaction are there, we can’t see how to avoid those conditions possibly happening with the current process in the future, then we need a more radical change – perhaps eliminating one of more of the offending interactions altogether (with a new process that doesn’t include that factor).

So the descriptions that capture the problems found after certain interactions can lead us to explore the system for weaknesses in “nearby” interaction spaces. If it is truly just an odd case of very specific interactions that all must be exactly those levels to cause a problem. So a system is general robust; it just has this one somewhat odd failure that happens when 4 factors are at exactly certain conditions that interact poorly. Then we just need to design a countermeasure that avoids that from happening.

More often we likely discovered an area when the system is weak. Sometimes it is pretty easy to see issue and we can do some simple PDSA improvements. Sometimes it is a bit more difficult to see how interactions will have an impact and we can then use DoE to find good solutions.

Related: Poor Results Should be Addressed by Improving the System Not Blaming IndividualsFind the Root Cause Instead of the Person to BlameEuropean Blackout: Blame a Person or Find a Cause in the System?Jeff Bezos and Root Cause Analysis

I work for Hexawise which deals with this interaction issue in a somewhat simplified arena (software testing). One nice thing is this can make it a bit easier to understand as interaction bugs in software much more often either cause failure or not (they don’t have a partial impact that then has to be separated from other partial impacts to see what is really going on.

If you look at pairwise bugs (interaction between two factors causes a bug) and found something that sometimes was a problem and sometimes wasn’t then what is really going on is there is a three way factor (or higher) so that the bug happens for example using

Windows and Firefox – but it worked sometimes and not others

So for example it may be that it works

Windows and Firefox and Flash

but fails with

Windows and Firefox without Flash

In such cases it is possible that it works with Windows and Flash and Chrome. And works with Mac and Firefox and Chrome. But with Windows and Firefox and without Flash it fails.

With software it is often easier to track down “root causes” in the sense that things nearly always work or fail while with many other processes things will degrade but may not have the binary pass/fail result.

Now software does partially make things more complicated by the sheer number of specific interaction that can cause problems. So it might be that it isn’t enough to test with parameter values like Windows, OSx or Linux but a need to test specific instances of those parameter values (so you might have users with Windows 7, Windows 8, Window 9, some tablet version of Windows etc.).

Software that forces people to use the latest version as much as possible (such as the Chrome browser) helps narrow the interaction possibilities that are likely worth testing. But even so it is an enormous effort. That is one of the things that make me very proud of what we offer with Hexawise, creating software testing plans to quickly discover bugs caused by interactions.

This entry was posted in Management, Process improvement, Systems thinking and tagged , , , , , , , . Bookmark the permalink.