Editor's note: William M. Briggs is an adjunct professor of statistics at Cornell University, Ithaca, N.Y., and a consultant in marketing statistics.
Statistics isn’t as easy as it looks. Mastering the subject isn’t equivalent to “submitting the data to software.” From my perspective as a statistician, these are the top five mistakes I have seen marketers and researchers make. Do any of them seem familiar to you?
1. Asking too many questions
Data drives statistics: If there isn’t any, few questions can be answered. Yet too much data causes problems just as too little does. I don’t mean big data, defined as rich and plentiful data, but of such size that it’s difficult to handle in the usual manner. Too much bad data is what hurts.
Who’s been in a survey-design meeting where a client wants to know what makes his product popular, where everybody contributes a handful of questions they want asked? And those questions lead to more questions, which bring up still others.
The discussion ranges broadly: Everybody has an idea what might be important. A v.p. will say, “I feel we should ask, ‘Do you like the color blue?’,” while a rival v.p. will insist on, “About blue, do you not like it?” Gentle hints that one of these questions could and should be dropped might be taken as impolitic. The marketing analysis company, wanting to keep its contract, acquiesces.
Statisticians are rarely invited to these soirées but if one were present he would have insisted that duplicate or near-duplicate data cannot provide additional insight but can cause the analysis to break or give absurd answers.
If there is genuine uncertainty about a battery of questions, then a test survey should be run first. This trial analysis works out bugs and sets expectations. The process can be iterated until the suite of questions are manageable and where there is now high likelihood each piece of data will be useful. This also prevents situations where an analytical method has been promised but where the survey design did not include the necessary questions (this often happens; see Mistake 5).
This simple yet rare procedure, if used routinely, would eliminate most of the mistakes listed below and save money in the long haul.
2. Failing to appreciate limitations
Not everything you want to know can be answered. The best brain, programming the fastest computer running the most sophisticated algorithm, can’t discover what isn’t there. Even if you ask Ph.D.s from the best universities or if you write large checks to a company with a reputation for doing the impossible.
Probability and statistical algorithms are not magic. Software spits out answers but answers don’t imply the results are what you hope or believe they are.
Example: driver models, where drivers of some outcome are input into an algorithm which orders the importance and gives the strength of each driver. Now, clients often insist that each driver be positively associated with the outcome and that negative associations are either impossible or unacceptable. Pleas for positive “correlations” become so earnest that some analysts, concerned about their paycheck, provide the client what he wishes.
But sometimes negative results which don’t make sense are still found. This always means the wrong method of analysis has been used or, pace Mistake 1, too much bad data has been used.
Or it means that a driver has nothing to say about the outcome after all the other drivers have been taken into account. These superfluous drivers should be expunged from the model. But then comes politics: Whichever driver is tossed will be somebody’s favorite. What’s worrying is when, under pressure, statisticians “discover” ways to keep problematic drivers.
Other common instances where a statistician is asked to “make it work” are when an old analysis doesn’t match a current one, when a decline in some measure “should be” an increase or when somebody doesn’t want to deliver bad news.
3. Not understanding regression
Regression or regression-like techniques are the backbone of marketing statistics. Yet most folks don’t have a good handle on their interpretation and limitations.
Here’s the setup: We have something we want explained, like customer purchase intent or money spent; any number. Call that number Y. It’s also called the outcome, or, in older terminology, the dependent variable.
We also have other data which we hope are probative of Y. These are called drivers or correlates, or, in the same old words, independent variables. Call this potential explanatory data X. Since we might have more than one piece of explanatory data, we call them X1, X2, and so forth.
You see equations written like this
Y = b0 + b1 X1 + b2 X2 + …
where the ellipsis indicates we could go on adding terms and go on and go on some more – you get the idea. People who use regression certainly grasp this trick: They add terms like there’s no tomorrow, figuring, “Why not?” Because the equation is wrong, that’s why. Here’s the real math:
Y ~ N(b0 + b1 X1 + b2 X2 + …, s)
where the tilde indicates it is our uncertainty in Y – and not Y itself – which is characterized by a normal distribution with a central parameter (which tells where the peak of the bell-shaped curve goes) which is dependent on values of the Xs. The “s” describes the width of the bell-shaped curve.
The bi are called parameters, coefficients or sometimes betas (they are occasionally written using the Greek alphabet). Inordinate interest is given to these creatures, as if they were the reason for regression. They are not.
It turns out that in classical statistics you can make guesses for the parameters (Bayesians do this less often). These guesses fascinate marketers in several ways, though they shouldn’t. Remember the intent of the model was that once we knew what value X took, then we would know the likely values – plural – Y might take. Who cares about a parameter? They can’t be seen, tasted or touched.
It’s rare to see uncertainty accompany parameter guesses but it should. Or ideally, as said, we should eschew the parameters altogether and speak of the relationship between the Xs and the subsequent uncertainty in Y. But the methods to do this (Bayesian predictive analytics) are not well-known.
Now you can appreciate Mistake 2 in more detail. Sticking dozens of Xs (drivers) into a regression equation, which are not designed to say things about parameters, but about Xs and Ys, practically guarantees some parameters will be negative. Such is life.
The discrepancies between understanding and usage occur everywhere, incidentally, not just regression.
4. Falling for the latest gee-whiz approach
Every year some new algorithm is touted which will solve all conceivable statistical problems. Remember neural nets? Genetic algorithms? How about partial least squares, permutation tests, support vector machines, trees, smoothing, machine learning, Bayesian nets, Markov chain Monte Carlo? Now it’s big data (which isn’t even a technique). Add your favorite to the list.
Once the new algorithm is released from academia into the wild, somebody invariably writes a hagiographical article which catches the imagination of marketers, who then beg statisticians to have the slick new wonder applied to their data. Doesn’t make any difference if the method is appropriate or not or that it is like applying a sledge hammer to a tack; the algorithm is hot, it’s sexy and it must be used.
Believe it or not, sometimes the best and fairest analysis is no analysis at all. Simple summaries and descriptions of data are often superior to the fanciest model. This is because statistical models are not meant to tell you about what you’ve already seen but what you will see in the future, given conditions are this or that.
We don’t need to model old data, we need to predict new data. We don’t need to guess (using p-values or hypothesis tests) whether this X is associated with that Y, we can just look. If the relationship is real, then given the simplest model the situation allows, knowing X will give the uncertainty in new Ys. In this way models can actually be validated with new data.
5. Not coming to a statistician (soon enough)
Too often statisticians are called at the same time coroners are called to murder scenes. What they can do at that point is the same, too: identify the cause of death.
I don’t want to hurt anybody’s feelings but the next topic is rather sensitive. Let me put it to you in the form of a question: Would you board a jumbo jet piloted by a man whose only experience comes from operating remote-control models? What if he learned his techniques from older experienced hobbyists? What if he possessed a certificate showing he knows all about model planes? What if he had a Ph.D. (proving his intelligence) in a subject not related to piloting? Still no?
I am anxious to agree that it is possible that those who have had a statistics class or two from a psychologist as they study for their Ph.D. in the same subject can understand fully the complexities and nuances of probability and are just as facile with computation as any statistician and sometimes are even more so.
But – and don’t get mad – it doesn’t happen that often. And just think: How many statisticians try to practice psychology, politics, sociology, etc., or all those other fields which contribute much to marketing science?