Editor's note: Steven Millman is senior vice president, research and operations, with research firm Dynata.
One of the most ubiquitous tools in the researcher’s toolkit is the test of statistical significance, more commonly known as the stat test. Stat tests are used in market research along with virtually every scientific discipline, showing up wherever one needs to know how to interpret their results. It’s a powerful and flexible methodology when properly used but it’s also one of the most abused and poorly understood metrics in our industry. In this article, I’ll be discussing what tests of statistical significance are, what they really mean, how they are commonly misused and how to use them correctly for maximum value in your research. I’ve kept this article intentionally free of too much math to make it more widely accessible to a broad audience of practitioners.
How confident they should be
At the most basic level, a stat test is a form of hypothesis testing that serves to help a researcher determine how confident they should be about whether two numbers are different. Those numbers could be the difference in brand awareness between advertising-exposed and -unexposed individuals, the market share of different brands or whether sales have changed over time. In regression methodology, where we are investigating for signs of a causal relationship, the difference is between zero (no effect) and the rate of increase or decrease (slope) of the effect measured. Where we are confident differences exist, we feel more comfortable about making inferences about those relationships leading to data-driven decision making.
Because true census data are rare in market research, we rely on the examination of subsets of data in order to generalize to a population we cannot directly observe. It is not possible, of course, to ask every consumer in the United States for their opinion on a particular brand. Practical constraints such as time, cost, incomplete contact information and the like limit the number of persons who can be interviewed. Proper sampling techniques allow for the effective use of relatively small groups of representative individuals to help us understand the population of interest to us within a certain margin of error. Even in the hypothetical where we could ask everyone, there would still be measurement error. There will be individuals who refuse to answer, who are not yet sure, who will change their mind prior to a purchase decision or who will simply lie to the investigator. Investigators themselves are human and thus prone to error.
Stat testing helps us wade through these uncertain and messy collections of data to make sense of what might really be going on in the world. Along with a wide variety of other techniques (such as random sampling, weighting, non-response bias testing, etc.), these tests guide us with respect to what extent we ought to draw inferences from our data. If the difference or effect is statistically significant, we feel more comfortable saying that the result is a function of some systematic underlying phenomena – such as the advertising being effective at increasing awareness – rather than from a random process such as sample selection. If the stat test does not show a significant result, we view the relationship with skepticism.
Five categories of error
In practice, despite how common stat testing is in our industry, many people with the best of intentions set up or interpret these tests incorrectly. Over the last 20 years or so, I’ve found that most of these errors fall into one (or more) of five categories: misinterpretation of confidence levels; sample size; robustness; using the correct stat test; and overreliance on simplified tools.
Let’s consider them one by one.
Misinterpretation of confidence levels. One of the most important things to remember in stat testing is that the level of confidence used is arbitrary and selected by the researcher based on their needs. To properly interpret confidence levels, one first needs to understand the p-value, one of the key metrics stat testing produces. The p-value is a number between zero and one that indicates the probability that the result of the test is at least as extreme as the results found in relation to a hypothesis of no difference. That’s a mouthful. In simpler terms, the larger the p-value, the more likely that your results are caused by a real effect and not randomness. A p-value of 0.95, for example, means that the result found is more unlikely to be the result of randomness than 95% of the range of possible values. Another way to think about this is that at the 95% confidence level you are willing to accept about a 5% risk of being wrong about your inference. That’s not precisely how the math works but it’s a reasonable rule of thumb for thinking about it. The higher the confidence level, the more conservative the test is. In market research a 90% confidence level is generally typical for small-sample studies like most survey research. For larger data sets such as point-of-sale data, a 95% confidence level is more common.
A common mistake people make with confidence levels is that they accept levels of confidence too low for reasonable decision-making. Most statisticians and social scientists use the 95% confidence level and will begrudgingly accept a 90% confidence level in cases of small sample size. In market research, where 90% is generally the industry standard for survey work, it isn’t uncommon to see a result which is only significant at the 80% confidence level treated as though it constitutes some kind of trustworthy result. Consider for a moment what a result with a p-value of 0.80 really means. What that stat test result is telling the researcher is that about 20% of the possible values of the results are more likely to suggest a real relationship than the result that was actually found. At the 90% confidence level you’re willing to accept about a one-in-10 chance of being wrong about the nature of the relationship (again, not exactly how the math works but close). Should a researcher recommend making a large media investment or change in strategy based on a one-in-five chance of being wrong about the legitimacy of the result? It’s probably best to wait and get more data so that decisions can be made with clearer vision. I’ve even been asked for 70% confidence levels – at that point one may as well throw darts.
Because confidence levels are arbitrary by nature, it’s important to also remember that in a test with a 90% confidence level there’s no real difference between a result with a p-value of 0.899 and another with a p-value of 0.901. Looking at stat testing as a simple binary choice has become a bit of a crutch for some researchers who may never look at the underlying data. Researchers should always look at the p-values for this additional context and consumers of research should never feel shy about asking for them from their vendors.
Finally, because stat tests are about confidence in relationships between variables we cannot directly observe, sometimes – out of sheer randomness – a stat test gives the wrong result altogether. These errors become rarer as the confidence level is set higher and sample size gets larger but can happen in any research regardless of sample size or how conservative the testing. These fallacious results need not be the result of any kind of mistake made in the research design or collection; it’s simply the case that given enough random trials every possible event, no matter how unlikely, is probable to occur. In simpler terms, weird stuff happens all the time. For instance, in a real-world example from my own life, about two-and-a-half years ago my son and I both broke the exact same bone in our foot about two hours apart on the same day in completely unrelated accidents. If that can happen, anything can happen. In statistics, you’ll hear these kinds of errors described as Type I and Type II errors. A Type I error is a false positive in which the test leads you to conclude there is a real difference between groups when there is none and a Type II error is a false negative in which the test leads you to believe there is no difference between groups where one actually exists. Type I and Type II errors are often the cause when bizarre results occur, such as ad exposure causing a statistically significant decline in awareness. Be particularly cautious about interpretation at small sample sizes where these kinds of errors become more common.
Sample size. There is a direct relationship between sample size and the sensitivity of a stat test. The larger the sample size, the more likely you are to see a positive test of statistical significance where a relationship actually exists. Conversely, the smaller the sample size, the less likely to find evidence of a statistical relationship. It is important to note that this relationship between sample size and test sensitivity is not linear. In low-sample studies, even small increases in sample will have substantial improvements in sensitivity whereas for very large-sample studies marginal gains in sensitivity take a lot more effort. As an example, a presidential approval poll with 100 respondents will have a margin of error of about +/- 10 points at the 95% confidence level. Doubling that to 200 respondents reduces the margin of error to about +/- 7 points. Compare that to a sample of 1,000 respondents at about +/- 3.1 points which improves by less than a point (+/- 2.2 points) when doubled to 2,000 respondents.
I mentioned earlier that it’s not a great idea to make too much of a stat test at the 80% confidence level but there’s one exception to this advice. When there’s very low sample size and a large effect but the p-value doesn’t quite reach a 0.9, one could infer that given more sample a statistically valid relationship might emerge. That doesn’t mean it will, of course, but taken together these facts would suggest that the negative stat test should not necessarily be taken at face value.
Robustness. Another important consideration in stat testing that is often overlooked is the robustness of the results. If you see a statistically significant lift in purchase intent for the entire population that you’re studying, a lift in purchase intent for both men and women individually and for every age group except 35-55, a good researcher will look at that p-value for more context. If the p-value is close but does not quite reach statistical significance, there’s a strong chance that this age cut represents a systematic, non-random relationship as well despite failing the stat test. Similarly, if you have a result where the overall ad campaign shows no lift, nor among any subgroup but one, a smart researcher should be cautious reading anything into that result. A robust result repeats itself over different cuts of the data and across time periods and thinking properly about robustness allows you to better identify Type I and Type II errors in your data. Always try to look at your results as part of the larger picture to better understand each of the individual elements and keep sample size in mind as described earlier. Narrower cuts of data naturally rely on considerably smaller sample sizes.
Using the correct stat test. Stat tests come in a variety of forms depending on the kind of data being used and the kind of relationship being evaluated. It’s generally best to consult with a statistician to ensure you’re using the right one but one particularly common mistake is making the wrong choice when selecting a one-tailed test vs. a two-tailed test. A two-tailed test should be chosen when the relationship being measured could be positive or negative. Most independent means tests, like A/B testing or advertising effectiveness, can be either significantly higher or lower than their point of comparison. Other kinds of relationships may only be relevant in one direction, for example the effectiveness of a painkiller in stopping a headache. In market research, advertising might have a positive or negative impact on brand favorability based on the quality of the creative but can only have a non-negative impact on awareness. It’s difficult to imagine an ad so confusing or hypnotic that it would be capable of making a consumer forget a brand they used to know. Using the wrong test has implications for correctly evaluating significance.
Overreliance on simplified tools. There are many available free online calculators for stat-testing out there and they can be handy for understanding how sample size and confidence levels may impact hypothesis-testing but they can’t be used as a replacement for a proper statistical test that considers all of the various complications. Online calculators generally make a series of assumptions to simplify the equations which will often be incompatible with your data. They do not take into account some vital influences on stat testing, including the effect of sample weighting, which changes the effective base sizes, or extreme values. Stat testing behaves quite differently when the probabilities are either very low or very high such as awareness for a new brand (near zero) versus a well-known brand like Pepsi (>95%). The underlying distribution of the data can also have substantial impact on the accuracy of results. You get what you pay for, as they say, and you probably shouldn’t commit to changes in a substantial media buy based on the results of a free app.
Become more fluent
Stat testing is a powerful tool in market research and as the availability of data continues its exponential growth, it’s more important than ever for professionals in our industry to become more fluent in its use. As an advisor to university market research programs, I have for years encouraged the development of more robust instruction on the interpretation of statistics for everyone in a marketing program, not just the market researchers, for exactly this reason. While most will never need to conduct such a test, everyone in the advertising and marketing ecosystem will need to be able to interpret these kinds of results at some point in their careers. Understanding what stat testing means (and doesn’t mean) is especially critical for senior leaders and other decision makers because of the impact on ROI and the success of their endeavors. A famous quote, attributed to Scottish poet and anthropologist Andrew Lang, says that some people “use statistics like a drunk man uses a lamppost – more for support than illumination.” We should all be striving to illuminate and a clearer understanding of stat testing is an important step towards that goal.