College at Georgia Institute of Technology. His previous marketing research experience includes several years with Sophisticated Data Research and Burke Marketing Research. He has also taught at the University of Michigan and the University of Cincinnati . A reformed theoretical statistician, he is the author of several articles on statistics as applied to marketing research and is an active presenter at meetings of various professional societies.
In the June/July 1988 Data Use column, Mike Baumgardner and Ron Tatham discussed how to handle the data from respondents who had no preference in simple paired comparison tests of preference. They wisely stated that they felt "uncomfortable giving a consumer a response that we (made) up." However, not all of their peers in marketing research and data processing agree—a point I'll return to later.
How, then, is one to handle a data set in which a number of questionnaires have item non-response? The problem is somewhat different than that faced above; there, the respondents were explicitly allowed to state that they didn't really prefer one product to another. Here I'm talking about the instances where particular questions are left totally blank or the appropriate "no answer" code is marked. Now what?
Is it a problem? Most assuredly. Anyone who has worked in marketing research for more than five minutes realizes that not all respondents answer all questions. There are several reasons for this. Some of the information is potentially threatening to the respondent, such as income, alcohol usage, age, and the like. Sometimes they have never heard of a particular brand and, thus, can't really rate it on several attributes, even though we ask them nicely to do so. Sometimes they may' not know the meaning of one or two of your laundry list of attributes, but will gladly do the ratings for the rest. Motivation aside, it's a small proportion of respondents who answer every question on any given study.
So how are you, the analyst, to do a reasonable data tabulation and statistical analysis of your study, given this missing data (an oxymoron if ever there was one) problem? Unfortunately, there is no universal answer. To say "it depends" sounds like a cop-out, but it truly does depend on your computer hardware, your computer software, how you've coded the "no answer" from the questionnaire and, as always, the objectives of the study. Some of these will be discussed below. However, until some mysterious force compels every respondent to answer every question (honestly, we should add), there seems to be no panacea.
Tabulation and summary statistics
Here's where evidence of the problem may first be seen. Things don't always add up. You know that you have 500 total respondents, but when you add back those who use Brand A most often to those who use Brand B most often, to those who use any other brand most often, you only come up with 493. The rest, of course, didn't answer the question. Run the same data tabulation on someone else's data processing system, however, and the missing respondents may, in fact, magically reappear. What's going on?
Not all computer programs handle missing values the same way. You need to explicitly tell some of them what's going on, others have some built-in assumptions. For instance, one practice is to leave a blank in the card column where there was a "no answer." However, some hardware/software combinations treat that blank as a numeric zero and also count the respondent.
Thus, if you are using this type of system, all of your missing values are automatically assumed to be the number zero and you'll have no missing responses at all. But think for a minute what effect this will have on your sample mean ounces consumed per week, or sample median household income category, or even, the proportion of respondents who prefer Brand A, since the new base is total respondents, not just those who made a choice.
Other programs automatically drop a blank from further consideration and, at the same time, make certain that the respondent base size is handled accordingly. Thus, unless you know which you are using, or is being used for you, it's easy to see how two analysts looking at the same data set can come up with completely different sets of crosstabulations and summary statistics.
The other thing you need to be aware of is that in some studies a "no answer" is coded as a "99" or "999," or some other value that is far out of range of the usual answer (how’d you like to try to put 99 kids through college?). Whenever you see a sample mean that is far out of line from what you expected, you should suspect that this type of coding might possibly be the cause.
Again, unless you explicitly tell the software package that such a code is to be disregarded as missing, it will be counted and added and averaged right along with the rest. Computers, alas, do exactly what we tell them and not what we necessarily want them to do. Clearly, communication with the data tabulation system is a must.
Other statistical problems
Here, too, you need to know your system. Depending on the hardware, there is at least one major heavy duty statistical software system that treats blanks as if they were all numeric zeroes—you lose no responses and, of course, no respondents, if you merely leave a missing value as blank during the data entry stage. At least one other software package treats a blank as a missing value for that particular case only. Another package will discard ALL information from a respondent if they show a blank for only ONE question for some procedures, and won't for others. Does it make a difference? You bet.
A simple example
Assume that you've left all of the "no answers" blank (you know what will happen if you have a 99 or other numeric code punched in, so if you did this, most packages will allow this to be "blanked" by you, explicitly stating missing values as 99). Let's look at the case where respondents are comparing two products on several rating scales and you want to find out for which of these scales the products differ significantly. The first package, above, will put in zeroes wherever it sees a blank and make the comparisons as if every respondent had answered all questions for both products. It's easy to see how this would both deflate the sample means and invalidate the statistical comparisons which you are doing.
The other two packages would both reduce the data base to only those respondents who rated both products on a given scale. Thus, for the first rating scale you might see the 501 respondents who evaluated both products, for the second, the 505 who evaluated both, and so on. For each scale it may be a substantially different set of respondents that are used to make the statistical comparison—this mayor may not be worrisome to you. One thing is for sure: you'll end up with fewer pairs of evaluations for each scale than you have total respondents in the study. It shouldn't take much of a stretch of the imagination to see what would happen if the comparisons involved three or four or more products or brands. (For only two such comparisons, we could open the Pandora's box of overlapping samples; but since it seems to be on none of the standard computer software packages, let's not.)
It gets even worse for some of the more complex multivariate procedures. Again, there are packages that will read in the blanks as zeroes and you lose no one at all. Of course, the results are not worth the paper they're printed on. (No problem, either, if you use something like a 99 for no answers and forget to tell the computer. Your results are worth exactly as much as the previous.)
For procedures such as regression analysis and discriminant analysis, the theoretical assumption is that each respondent answered every question currently being analyzed. Many packages drop ALL responses from any respondent who fails to answer even a single question; the respondent becomes a nonentity. It's very disconcerting, but not uncommon, to request a regression analysis using 60 or 70 variables and have the computer print a message that you don't have any respondents. What it means is that none of your respondents have answered all of the questions in the analysis. So, if the blanks are treated as zeroes, you get worthless results; if they cause the respondent to be dropped, you get no results at all.
Discarding all respondent information as in the cases above is called either case wise or listwise deletion. So if you take the same data set and run it through two programs, or if two analysts run the data through any program that causes blanks to trigger listwise deletion, you'll get comparable answers for any of these multivariate procedures, right? Wrong!
Or right, to an extent. They'll give you the same for regression and discriminant analysis. However, let's assume that two analysts are going to do a factor analysis and one of them uses pairwise deletion of missing data and the other uses casewise.
Pairwise means that if a respondent answers, say, 9 out of 10 questions that you're going to factor analyze, her answers can be used to impact 35 out of the 45 correlation coefficients that go into the factoring procedure. The listwise deletion means that all of her answers are discarded and she's essentially treated as if she didn't participate in the survey at all. Now we will probably get factor results which are not at all comparable. Again, you have a need to know how your data are handled.
What can be done? First, you need to know how your analyses will be affected by blanks, 99, or whatever codes you use for missing data. Ask questions and assume nothing. It may mean that you'll have to delve into a computer manual to find the answer or even have to talk to those guys in the basement who dress funny but sure do a great job of keeping the computers going. If you're using an outside source for data processing, be sure to tell them explicitly how you have coded the no answers and how you want them handled.
Second, although I agree with Baumgardner and Tatham about putting words into respondents' mouths, you should recognize that there are a few procedures around for imputation of a numerical value for any missing data. Some software packages recommend that you do so. In fact, one manual goes so far as to state that, "If you have missing values, you will do better taking the time to impute (replace) them in the raw data by examining similar cases or variables with non-missing values." My own thinking on this issue is that if the respondent had wanted to or could have given you an answer, she would have. Imputation is not without risk.
Third, use the amount of item non-response as a positive indicator. If there is a particular statement or set of statements that are causing 90 percent of the item non-response, then perhaps deleting only those few can straighten out your regression analysis. Of course, always report the number of "no answers" for each question tabulated.
Finally, recognize that for some statistical procedures, "no answer" is an answer. One is correspondence analysis, in which the "no answer" categories can be used in both perceptual mapping and cluster analysis.
As with so much in this life, the key is effective communication between all of the concerned parties—the respondent, the analyst, the project director, the computer, and you.