Editor’s note: Susie Sangren, president of Clearview Data Strategy, Inc., Ithaca, N.Y., is a consulting statistician.
You wouldn’t believe how many times I have been asked, "How big should my sample size be to give a reasonable estimate of the target population?" (My answer is, "It all depends. . .") The questioners are usually research analysts not trained in probability sampling and statistical theory.
The quality of a market analysis is judged by its validity - in other words, how confident are you, as a researcher, about your findings being replicated in the real marketplace? Data collected from non-probability, informal sample surveys will not allow you to make conclusions about the population with measurable confidence. Remember that the intent of a survey is never just to describe the particular individuals who happen to be selected into the sample, but to obtain a composite profile of the population.
What I am about to show you is an easy (and nonetheless robust) method of calculating the sample size you would need for your specific market survey or an experiment. The research design is the simple random sampling, and the sample size calculated is the number of completed surveys required to achieve a certain level of confidence and error rate. The number of "completes" may be a lot lower than that of the surveys you will actually send out, depending on your expectation of the response rate.
The beauty of the simple random sampling is that it is probability-based (therefore representative of the population, because everyone in the population has an equal chance of being selected), and it is simple. You can use a random-number generator to pick any sampling units out of the entire population. Simple random sampling is robust because it can meet the needs of most managers. With probability sampling, you can report the following two quantities to relate the accuracy of your sample estimate to the population parameter:
- Sampling errors: How close is your sample estimate to the true population number? A typical answer may be, "The population number is within ±3 percent of the sample estimate." Naturally, the smaller the sampling error you want, the larger the sample size you will need.
- Level of confidence: How confident are you about your one-sample estimate in repeating itself through repeated samples? An answer may be, "I am 95 percent confident that the population number is between A and B." The larger the confidence level you want, the larger the sample size you will need.
The sample size should be determined before other survey considerations such as: what questions you should ask; what response rate you can expect: how to or who should collect the data. There are two ways to approach the sample-size problems:
1) You have already decided on the confidence level and the sampling error requirements, now you want to know the sample size;
2) You have decided on the sample size and the confidence level required, now you want to know the error rate of your sample estimate.
To solve Problem One for the sample size, I begin by assuming the following, rather limited, conditions:
- All my survey questions have the yes/no type of dichotomous answers.
- My absolute error-rate (E) requirement is 3 percent. (The true population number is within the range of ±3 percent of my sample estimate.)
- My confidence level (C) requirement is 95 percent. (I want to be sure that my population number estimated from one sample can be repeated 95 times out of 100 samples.)
- My first guess at the percentage estimate for the "yes" answer in my sample for a particular question (P) is 35 percent.
The sample size (N) calculation formula is simply:
N = square of {square root of [P x (1-P)] / (E/std(C)},
where "std(C)" is the equivalent of confidence level, expressed in terms of standard deviation. I list below three widely acceptable levels of confidence, and their standard-deviation counterparts:
1. 68 percent confidence level - The population number is within plus or minus one standard deviation of my sample estimate.
2. 95 percent confidence level - plus or minus two standard deviations. It is the most popular level.
3. 99.7 percent (almost 100 percent) confidence level - three standard deviations.
Now, let’s substitute all the known quantities into the size calculation formula to solve for N:
0.4770 = sq. rt. of [0.35 x (1-0.35)]
0.015 = 0.03/2
N = (0.4770/0.015) ** 2 = 1,011
Therefore, the required survey sample size is 1,011, for a 95 percent confidence level and a tight error bound of ±3 percent. Exhibit 1 shows the calculated sample sizes under various levels of sampling error rates and estimated "yes" percentages, all at 95 percent confidence level by simple random sampling.
To solve for Problem Two for the error rate, I have already been given a sample size, say, 1,011 (N), and the confidence level, say, 95 percent (C). Using the same formula, converting the confidence level (C) into an appropriate standard deviation, std(C), and assuming that my sample percentage of the "yes" answer (P) is 35 percent, my sampling error rate will again be calculated as ±3 percent. Remember that increased sample size generally means increased survey reliability, which must be traded off with increased cost and time.
Exhibit 2 shows the calculated sampling errors under various sample sizes and estimated "yes" percentages, all at 95 percent confidence level by simple random sampling.
Notice also that when P=0.5, or 50 percent, the value of [P x (1-P)] is at the maximum. What this implies is that, the more unsure I am about the survey outcome (i.e., the percentage estimate for the "yes" answer, P, would be close to 50 percent - I am only certain half the time), the larger the sampling error will be.
Going back to the Problem Two scenario, and changing my sample estimate for the "yes" percentage from the earlier 35 percent to 50 percent, now I would calculate a slightly larger sampling error (3.145 percent versus the earlier 3 percent):
0.50 = Sq. rt. of [0.50 x (1-0.50)]
31.7980 = Sq. rt. of 1,011
E = 0.50 / 31.7980 x 2 = 0.03145 (or, 3.145%)
Finally, I may want to enlarge the calculated sample size (done somewhat subjectively) because:
1. My survey contains questions with multinomial answers. In such a case, I will pick the question with the highest number of answer categories to estimate my sample size. The resulting size should be good for the entire survey.
2. I have to take into consideration the non-response rate.
3. I want to ensure that when I crosstabulate one variable with another, I would have enough data in each cell.