Editor's note: Paul Gurwitz is managing director of Renaissance Research Consulting, New York.
In terms of analysis, market research has come a long way in a short time. Many of the statistical tools and techniques in common use today had not even been invented twenty years ago, and even as recently as five years ago, much of the new research technology was scorned by a large number of practitioners as an impractical frill, rather than an integral part of most studies.
However, the progress made in the area of research analysis has been a mixed blessing. In some cases, we have gone in a few years from one extreme: "Statistical analysis is a lot of hokum; I can analyze a study just as well using only cross-tabs!" to the other: "Statistical analysis is so easy, anybody can do it- you just shove the data disk into the computer and press a button!"
In the long run, the latter attitude is more dangerous than the former. While those who disdain any advanced analytic techniques will often produce a limited analysis which may not make the full use of the data they have, at least they know what they have. Those who use multivariate techniques without being aware of the assumptions behind them may make fundamental misinterpretations of the results.
This problem is aggravated by the growing practice among some suppliers to "throw in" multivariate analysis as a free bonus for conducting a study, sort of like the drinking glasses that gas stations used to give away with a fill-up. The availability of menu-driven statistical programs for use in micro-computers makes this sort of thing possible, but no amount of twisting and turning can make it good research.
The following is a list of questions that you, as the client, should ask the next time your research supplier generously offers to "run you a few multiple regressions for free." Satisfactory answers to these questions should set your mind at ease regarding the value of the free bonus you're getting. And, less than satisfactory answers should start you worrying about the value of the free bonus you're getting.
Question 1: Are the data being cleaned properly for this analysis?
Practically no data set ever comes from the tab house ready for multivariate analysis. Even if the data have been cleaned in the data collection process, there are still extra steps to perform before it is ready to use, for example, in factor analysis.
One of the most common problems in this area involves numeric coding. Often the way a variable is coded by a tab house is perfectly alright for cross-tabs, but will cause trouble for a multivariate analysis. For example, coding a five-point likelihood scale with "1" as the highest point and "5" as the lowest is a common practice. Yet a technique like multiple regression assumes that a "5" is higher than a "1". If the variable is not recoded (or everyone who reads the results is not thoroughly familiar with the coding scheme), you might get what appear to be very strange results.
Question 2: How are missing data handled?
One of the most important differences between cross-tabular and multivariate analysis is in the handling of missing data-"don't know," "no answer," etc. With cross-tabs, the problem is relatively simple-different types of missing data are given their own categories, and tabulated as separate stubs.
However, this cannot be done in multivariate analysis, because techniques such as regression and discriminant need a valid value for every variable in the analysis. Instead, there are a number of useful strategies for dealing with missing data-for instance, substituting the variable mean, ascribing a value based on valid values of other variables, using the existence of a missing value as a "check" variable to determine whether missing data might bias the analysis.
There are also other "strategies" for dealing with missing values that are, unfortunately, in common use. These approaches are most often the "default" choice of statistical packages-that is, the method chosen by the program if the operator simply "presses the button. " The usage of either will most likely do violence to your analysis.
The first of these is called listwise deletion: if a respondent has a missing value in any variable in the analysis, that respondent is simply dropped. There is an evident problem with this approach: since many multivariate analyses involve large numbers of variables, the probability of a respondent having missing information in at least one variable, and therefore being dropped, is quite high. This is the source of one of the most common problems in statistical analysis: analyses performed on a small (and usually biased) fragment of the sample, because the rest of the sample has been excluded for missing values on as few as one variable.
The second most common approach is called pairwise deletion. While the problems caused by this approach are not as obvious as those caused by listwise deletion, they can severely distort the interpretation of the results. This method relies on the fact that most multivariate analyses are based on a correlation matrix. When pairwise deletion is used, the program calculates each correlation in the matrix based on all respondents who have valid responses to the two variables being correlated. As a result, the analysis uses different sets of respondents when examining different variables. This can lead to extremely biased results, particularly when, as is often the case, missing data are not randomly distributed through the sample.
Question 3: What program is the supplier using? Will it do what you think it will?
For example, when suppliers tell you, "I'll run you a cluster analysis," they really aren't telling you very much. There are four or five major statistical packages that perform cluster analysis, and countless smaller end stand-alone programs, as well. Each of them does something different when you use it to do "cluster analysis"-and some of them do things that you probably wouldn't like, if you knew about them.
A clustering routine in one popular statistical package works by passing through the data, sorting respondents into homogeneous clusters, based on their variable means-once. The problem with this approach is that, when respondents are moved from cluster to cluster, the cluster means change, and the sort has to be repeated. In fact, the usefulness of most cluster solutions continues to improve significantly over ten or more passes through the data, so that a program that looks at the data once, and quits, is not likely to give you a solution you can really use. The approach that the supplier plans to use will have a great bearing on how valuable the results will be. Make sure that you agree with it.
Question 4: What decisions will be made about how your analysis will be done? Would you agree with them?
In contrast to a cross-tab, which has a fairly standard procedure, most multivariate analyses involve a series of choices of method. For instance, there are at least five different ways to perform a factor analysis, and seven methods for rotating the results afterward.
Every statistical package will make these decisions itself if the operator does not explicitly tell it otherwise. Generally, these choices, known as defaults, were made by those who wrote the program to provide an analysis that is suited to the "average" problem. This is something like buying a "one-size-fits-all" dress-it may generally cover the area, but it won't be "you." In order for an analysis to fit your particular problem, care has to be taken in designing it-and that takes time.
Question 5: What will the final product look like? Will you understand it? Be able to use it?
The raw output of many statistical packages is designed for statisticians, not market researchers. It often consists of poorly-annotated lists of numbers, unlabeled graphs, and assorted hieroglyphics. All of which is fine, if you're accustomed to this style of presentation, and can read and interpret it.
If not (and this is usually the case), be sure that your supplier also "throws in" either a detailed discussion of the findings, written in English, or will agree to be available to interpret the output for you. Otherwise, you may end up with a two-inch thick sheaf of computer paper that your client will find quite unimpressive.
Question 6: Is your supplier willing to repeat the analysis, as necessary?
One of the real advantages of multivariate analysis is that its speed allows the researcher to "ask questions of the data." This usually involves repeated runs, modifying each successive analysis to answer the questions posed by the previous one. This interactive approach permits much more intensive and creative use of your data set than was ever possible before.
However, many suppliers, in offering to "run a regression," mean exactly that-one regression. This could conceivably be all you need in a certain situation, but more likely it will simply whet your appetite for further analysis. In that case, will your vendor still be willing to offer "free bonuses?" Make sure that your supplier is willing to be a partner in a dialogue with the data, not simply a space age tab house.
In nine cases out of ten, the answers you get to these questions will confirm the old adage, "You get what you pay for." Statistical analysis is a real advance in marketing research, but only when all concerned take the time and trouble to make sure that it is used to best advantage. Analyses performed as a "free bonus" usually reflect the small investment of time and care taken to do them - and produce results that are at best meaningless, and at worst misleading.
By contrast, multivariate analyses conducted systematically, by people who know the issues and take the time to consider the options, will often yield new insights into your problem. Because of the training required and the time involved, this approach is unlikely to be available for free - but it will more than pay off in results.