By the Numbers: Let’s avoid these statistical sins of the past

Abstract

From top two-box to semantic differential scales, Jerry Thomas argues that researchers should move on from some long-used data analysis tools.

Editor's note: Jerry Thomas is president and CEO of Decision Analyst, an Arlington, Texas, research firm.

The founders of marketing research invented a number of extremely powerful and valuable tools, methods, questions and concepts that we all use and benefit from every single day. We are indebted to their originality, inventiveness and pioneering genius that founded and shaped our industry and its culture. Much of this founding work took place during the 1920s through the 1960s and some of the research inventions occurred during the 1970s through the 1990s. But no one is perfect and our industry fathers and mothers committed sins that blight our industry to this day.

The first great sin is the top two-box percentage. Somewhere along the way, a founder developed the top two-box concept for questions with multiple positive responses. A good example is the five-point purchase intent scale: definitely buy, probably buy, might or might not buy, probably not buy, definitely not buy. If only the “definitely buy” answers are counted, the founders reasoned, information is lost.

What about the “probably buy” answers – shouldn’t they be counted, too? Hence, the top two-box solution came into being and the custom is to present the “definitely buy” percentage, followed by the top two-box percentage (“definitely buy” plus “probably buy”). Sounds perfectly reasonable, so where is the sin and shame?

The top two-box percentage counts a “definitely buy” the same (i.e., gives it the same weight) as a “probably buy,” when it’s blatantly obvious to everyone that a “probably buy” is not nearly as good as a “definitely buy” answer. For the five-point purchase scale above, the sin of counting a “definitely” and “probably” as equals is, no doubt, a cardinal sin. If we were working with a nine-point, 10-point or 11-point scale, the top two-box percentage might only be a minor transgression. That is, on a longer scale, the difference in meaning between a top box and the second box is relatively small, so no great harm in adding the two together. On shorter scales, however, the distortion (and the sin) is usually much greater.

Back to the five-point purchase intent scale. A better solution is to count all of the “definitely buys” and then discount the “probably buys” by 40 percent, or 50 percent, or 60 percent, and add the “definitely buys” to the discounted (or down-weighted) “probably buys,” creating a weighted average that provides a more accurate measure of the results.

For example, if the “definitely buy” answers equaled 32 percent of respondents and the “probably buy” answers equaled 20 percent of respondents, a best practice is to count all of the top box (the 32 percent who said “definitely buy”) and let’s say 50 percent of the second box (the 20 percent who said “probably buy”). That yields a purchase intent score of 42 (32 percent plus half of 20 percent). The result is called a score (not a percent) since we have created a hybrid number.

The second great sin of our industry founders involves significance testing. There is little doubt that significance testing of critical decision statistics is valuable. For example, determining whether product blue is better than product red is a good application of significance testing. In the beginning, significance tests had to be calculated by hand, so only the most important results were subjected to significance testing. But the growing power of computers and the expanding availability of statistical software led to the automation of significance testing in crosstabulation tables. Thus, with a few programming scripts, thousands of significance tests could be automatically run on a set of crosstabs.

You could easily test rows of percentages against the adjoining rows, or test Column A against Column B. You could even determine if the differences between statistics in rows and/or columns were significant at the 90%, 95% or 99% level – with some type of code letters, symbols or colors. The resulting significance assertions could then be incorporated easily into charts, graphs and written reports.

Some might hail the exhaustive use of significance testing as a great advance in our craft. However, I would argue that willy-nilly significance testing is a great waste of time and effort. Overuse of significance testing adds costs to the preparation of written reports, adds extra time in quality-assurance verification and actually increases the risks of errors in interpreting the survey results. If every number in a set of tables or a report is significance-tested, the analyst might avoid looking at the non-statistically significant results and thus overlook important findings and patterns in the data. If the analyst is overly focused on statistical significance, he or she often overlooks other types of significance or other signals in the data.

Mass use of significance testing adds a hodgepodge of confusing symbols and potential bias into survey results. Also, many of the “significantly different” indications will be false, based purely on chance variation. I have personally watched analysts overlook almost everything of importance in survey results because they were so focused on statistical significance that they were blind to everything else. A best practice is to use significance testing only on the one or two most important questions in the survey data.

The third great sin comes from type I and type II error in hypothesis testing. You can easily argue that the founders of the research industry stole type I error and type II error from the statistics or the academic world (and should, therefore, be blameless) but why on earth would our industry founders steal something as confusing as type I error and type II error? Couldn’t they have stolen something more useful?

Can anyone remember which is which (false positive versus false negative) and exactly what the heck type I and type II mean? Maybe I’m just old and over-the-hill but I have to do a Google search and study type I and type II error before ever attempting to actually use these concepts. And, why are we only focusing on errors and not on truths?

If there are two types of error (false positive and false negative), then there must be two types of truth (true positive and true negative). Or are there more than two types of error and more than two types of truth? My head hurts. Let’s move on.

The fourth great sin is the so-called semantic differential scale. It was no doubt stolen from the psychology or sociology world, but again, why did our founders not have better judgment? Now, I’m not against stealing if you can do it in the dark of night and if it’s profitable but I am firmly opposed to dumb stealing. Semantic differentials are usually some type of numeric scale (five points, seven points, nine points, 10 or 11 points) with the endpoints anchored by two words with opposite meanings, such as love/hate, fast/slow, modern/old-fashioned and so on. The two words with opposite meanings are okay. It’s the long number scale in between that bothers me. What the heck does a 7.3 mean on a 10-point scale, or what does a 3.8 mean?

A better practice is short scales – true-false, yes-no, excellent-good, fair-poor and so forth – where each answer on the scale means something. It is much easier to explain true-false or yes-no answers to high-level executives than to explain 7.2 on an 11-point scale. In general, the higher the executive’s level, the shorter and simpler the research results must be; and that is where the simple, short answer scales are at their very best. The older I get, the shorter my answer choices become.

Our research founders did not stop with the aforementioned sins but I do not wish to punish their collective reputations any further – since I’m one of them. They were a well-intentioned, studious lot and the useful tools they handed down to the current generation surely counterbalance some of their sins.