Editor's note: Gary Mullet, Ph.D., is president of Gary Mullet Associates, Lawrenceville, Ga.
The topics covered in this column over the past several months have ranged from elementary to esoteric. The articles have been almost universally well-written and generally immediately useful. Several have touched on the basic concepts of statistical analysis, one or two invoked simulation and more than one addressed the sometimes arcane vocabulary that many statisticians employ. In what follows, the results of several computer generated simulations will be used to tie the package of basic statistical significance testing ideas with a tighter knot, using intuition as the string which will, we hope, hold things together.
For our purposes, random numbers were generated to represent the integer responses that we typically gather in marketing research surveys. For each scenario, two independent samples of 500 responses were compared. The comparisons were repeated 2,000 times for each of the sampling situations reported. Note that we could equally well have used dependent comparisons, paired-preference or any of the other measures which are typically gathered in survey research. Also note that the samples would not have to have been equal; this is more for convenience than anything.
Let's start with the simplest case where the two population means are equal. Obviously, in a survey situation we wouldn't know whether the two populations being compared have equal means or not - if we did know, then we wouldn't have to do any statistical significance testing. The situation could be like this: everyone in a survey is asked to rate the overall liking of a concept on an integer valued 5-or-7-or-somesuch point scale. We then want to examine the mean responses of the males and the mean responses of the females in the survey, to see whether or not they liked the concept equally. It can't be overemphasized that if we really knew that the populations had equal means (or unequal, for that matter) we would not do the significance testing.
Most computer packages doing the significance testing do exactly as you were taught in the basic statistics class that you wrestled with. They assume that the population means are equal, churn the numbers, and print a statistic that lets you decide whether the "theory" of equal population means is tenable or not, given the sample results. The statistic, in this case, is an independent groups/samples t-statistic, which, because we have "large" sample sizes, may show up as a Z-statistic. SPSS and other programs print results for both "equal population variances" and "unequal population variances" (although they are not necessarily called this explicitly). Generally, the decision relative to the sample means will be the same, irrespective of your assumption about the population variances. Everything reported below will use the "equal population variance" testing of the means, since by the sample generation method they were.
Now, a digression into the jargon area. Note that we are comparing answers to ratings from integer (and finite) scales. The idea of normality, which many remember as underlying the t-or-Z-statistic, does not refer to the scale values themselves but to the items being compared, the sample means (or, even better, the sampling distribution of the differences in sample means). Why? Because the Central Limit Theorem assures us that such normality holds in cases with samples as large as ours.
We have two more statistical jargon bears to wrestle. First, significance level. As has been noted by others in this column and elsewhere, the statistical significance level, a, is the probability of declaring the population means to be unequal, given the sample evidence, when in fact they are equal. Now, you can appreciate two things. First, redundantly, in a real research situation, you really won't know whether or not the population means are equal. Thus, you probably set a, your risk of falsely declaring a difference when really there is none, at a relatively low level, something like .05 or .10. Next, in our first sampling situation with 2,000 pairs of samples from populations with known equal means, we can see how well this common t-test works, or doesn't work.
We'll use both .05 and .10 for illustrative purposes, below.
Bear number two is deciding whether we want to use a two-sided or one-sided significance test. The former is for when we really don't have a feel for which, if either, population mean should be larger if they are not equal. The latter is used when you want to be able to say something like "tests prove that females like this stuff better than males" (you sure wouldn't want to hire me as your copywriter). That is, before the test is run, you have a feeling that one of the user groups should have a higher mean rating than the other. Let's arbitrarily settle on a one-sided alternative - females should rate the concept higher in all of our simulations, if there is any difference at all.
Now some intuition. I doubt if anyone would think that for all 2,000 pairs of samples the means will be exactly equal each and every time, even though they were generated to be equal. They should be close, sure, but every once in a while, due to that undefined term "sampling error," they will be far enough apart that we'd say, "Whoa! It looks like our sample of females came from a population with a mean greater than the mean of the population whence came the sample of males" - and, of course, we'd be wrong. The question is: how often is "every once in a while?" The answer, which you knew all along, is either .05 or .10, depending on which value we selected as our significance level.
So how'd we do? Not too badly, as a matter of fact. For the case where the two means are really equal, the t-statistic indicates that the mean for females is greater than the mean for males (shorthand jargon for a more statistically correct statement about the population means) 221 times, or 221/2,000 = 11.05 percent. If we tighten the screws on the required evidence to a = .05, our 2,000 samples yield a wrong conclusion 116/2,000 = 5.8 percent of the time. Not too shabby. Our observed error rate is slightly higher than the nominal rate, but we only did this process 2,000, not the infinite number that statistical theorists refer to (which brings to mind the old joke about sentences ending in propositions, which can't be repeated on these pages). So, depending on the significance level, we see that the number of false positives is about where it should be. One way, then, of looking at a significance level is as the long run percentage of times when you are willing to say that the females like the concept better than the males do, when in reality they are at parity in the ratings. (Aha! Maybe rather than selecting the usual textbook significance level of .05 or the "way we've always done things here," we should consider the consequences of the false positives. Clearly, the monetary consequences should be factored in before we decide our tolerance for these false positives.)
Summing up, what we've seen so far (and is reiterated in Table 1 below) is that the simple statistical t-test for comparing two independent means works about as it should. We find pretty close to the expected number of false positives. You may rest assured that simulations of other common marketing research situations, such as picking one preferred product from three or whatever, would present us with similar "expected" results. Now let's turn the coin over and do some more simulating.
Let's assume that, unknown to us of course, the mean for all females in the population (not the sample of females) is really .1 higher than the mean for all males. Intuitively, and correctly, it would usually be pretty tough for our two samples of 500 who are evaluating this concept to give us the sample information with which we would conclude that, "Hey, guys, the statistical test indicates that females like this stuff better than do males." Also, letting our intuition loose, most of us would feel pretty comfortable saying that if the mean liking for females was a full point higher than the mean liking for males in the respective populations, the 500 of each gender who test this product will (almost) always give us data that reflects the stronger liking by females.
Rather than false positives, we are now concerned with false negatives - our data not detecting a difference that really exists. In the preceding paragraph we argued that we will have fewer false negatives the further apart the two (unknown) population means really are. Makes sense.
In Table 1, the results of several more simulations of 2,000 sets of samples are shown. In each case, 500 observations representing our sample of males and another 500 representing the females in the sample were created. The column headings are the differences in the simulated means. The first column (0), is a repetition of the discussion for when the sample means were equal (zero difference). The entries in that column show the number and percentage of false positives for three different significance levels: .10, .05 and .01 (not discussed above). Remember that false positives arise when we find that our sample evidence supports the hypothesis that females have a mean rating greater than that of males, by an unspecified amount.
Columns 2 through 8 are headed by the simulated differences in generated population means. These differences run from .1 to 1. The entries in this portion of the table are the number and percentages of time when our samples would lead us to conclude that the female mean is not higher than that for males. (More notation/jargon: this is b, the probability that we accept the hypothesis of equal means, when really they are not equal or one is larger than the other in the populations of interest, depending on the form of the hypotheses studied.) Thus, in all cases the table shows the number of times the data would lead us to make an incorrect decision regarding the populations of interest.
Without getting into some nasty statistical squiggly notation, we cannot easily evaluate what should have happened in columns 2 through 8. It was fairly easy to do so in the first column, by the way we ordinarily do practical statistical significance testing and set our significance level, a. However, the numbers do bear out our intuition that the more that one population mean beats the other, the more likely the samples will lead us to the appropriate conclusion (or the less likely that the samples will lead us to the incorrect conclusion). We do, however see such things as this - if the true population means are such that one is .1 larger than the other, and we run our significance test at a = .01, then around 96 percent of the time our samples of 500 each will fail to give us the right answer.
One major issue not addressed here is that of statistical significance versus practical significance. That is, just because our samples indicate that the female mean is higher than that for males (as it will around one time in four when we let a = .10 and the true mean difference is only .1), is that enough for us to really invest a chunk of money in marketing, advertising, plant and equipment or whatever? Might we not be better off to do these types of simulations before we draw samples, using meaningful population differences, to decide what sample sizes are appropriate? Sometimes, yes; sometimes, not.
So, what are our conclusions? Among other things:
1. Standard significant testing works pretty well in rejecting a true hypothesis of equal means the proportion of time we specify when we select an a-level.
2. The more different the true population means are, the more likely we are to detect a difference, for fixed sample sizes.
3. For a given sample size and a given difference in population means, as a decreases, b increases (and, duh, vice versa).
4. Adding a and b will not give us a constant, though many, many think otherwise.
5. When testing a large number of scales, say 100, with the type of significance test we looked at above, recognize that in five or six or so you'll get false positives.
Thus, if all you see are five or six cases in which the mean for females is higher than that for males, don't turn this into a federal case.
Reiterating, we've demonstrated some things which everyone knew all along to be true, some things that intuitively were going on in statistical significance testing and maybe one or two things that might cause some careful thought. As said weekly on the old TV series Hill Street Blues, "Be careful out there!"