Sampling 101
From QualtricsWiki
Contents |
[edit] Estimating Sample Size
Sampling is the process of selecting a subset of units from the population. After choosing a sampling method (these methods are discussed in the Sampling page), the next step is to decide how many should be sampled. We use sampling formulas to determine how many to select. It is important to select the right amount of people because it is based on the characteristics of this sample that we make inferences about the population.
The most common method of sample size determination is based on proportions. For example, suppose we are preparing for the winter Olympics and are interested in estimating "the proportion of out of state skiers that took at least one over night trip." We might use this number of people that would consider traveling to the Olympics.
In this case, the sample size is estimated using proportions. sp = (p/(1-p)/ n) where p is the proportion of "out of state skiers that took at least one over night trip". The most conservative number for this proportion is .50 and if the desired accuracy was .05 and the formula would appear as:
(Number of Standard Errors)2 * ((proportion)*(1-proportion)) / (Accuracy)
(1+((Number of Standard Errors)2 * ((proportion)*(1-proportion)) / (Accuracy)-1) / (the population size)
While this formula is easily entered into a spread sheet, this formula results in the following sample size determination table.
| TABLE DATA SHOW THE RANGE OF ERROR (+/-%) FOR DIFFERENT SAMPLE SIZES | ||||||
|---|---|---|---|---|---|---|
| Rows Show the Sample Sizes at the 95% Confidence Level | ||||||
| Columns Show the Expected Response (Variable proportions) | ||||||
| Size | 50/50% | 40/60% | 30/70.% | 20/80% | 90/10% | 95/5% |
| 25 | 20 | 19.6 | 18.3 | 16 | 12 | 8.7 |
| 50 | 14.2 | 13.9 | 13 | 11.4 | 8.5 | 6.2 |
| 75 | 11.5 | 11.3 | 10.5 | 9.2 | 6.9 | 5 |
| 100 | 10 | 9.8 | 9.2 | 8 | 6 | 4.4 |
| 150 | 8.2 | 8 | 7.5 | 6.6 | 4.9 | 3.6 |
| 200 | 7.1 | 7 | 6.5 | 5.7 | 4.3 | 3.1 |
| 250 | 6.3 | 6.2 | 5.8 | 5 | 3.8 | 2.7 |
| 300 | 5.8 | 5.7 | 5.3 | 4.6 | 3.5 | 2.5 |
| 400 | 5 | 4.9 | 4.6 | 4 | 3 | 2.2 |
| 500 | 4.5 | 4.4 | 4.1 | 3.6 | 2.7 | 2 |
| 600 | 4.1 | 4 | 3.8 | 3.3 | 2.5 | 1.8 |
| 800 | 3.5 | 3.4 | 3.2 | 2.8 | 2.1 | 1.5 |
| 3.2 | 3.1 | 2.9 | 2.5 | 1.9 | 1.4 | |
| 2.9 | 2.8 | 2.7 | 2.3 | 1.7 | 1.3 | |
| 2.6 | 2.5 | 2.4 | 2.1 | 1.6 | 1.1 | |
| 2.2 | 2.2 | 2 | 1.8 | 1.3 | 0.96 | |
| 2 | 2 | 1.8 | 1.6 | 1.2 | 0.87 | |
| 1.8 | 1.8 | 1.7 | 1.4 | 1.1 | 0.79 | |
| 1.6 | 1.5 | 1.4 | 1.3 | 0.95 | 0.69 | |
| 1.4 | 1.4 | 1.3 | 1.1 | 0.85 | 0.62 | |
| 1.2 | 1.1 | 1.1 | 0.92 | 0.69 | 0.5 | |
| 1 | 0.98 | 0.92 | 0.8 | 0.6 | 0.44 | |
| 0.82 | 0.8 | 0.75 | 0.66 | 0.49 | 0.36 | |
| 0.63 | 0.62 | 0.58 | 0.5 | 0.38 | 0.27 | |
| 0.4 | 0.39 | 0.31 | 0.32 | 0.24 | 0.17 | |
| In a product usage study where the expected product usage incidence rate is 30%, a sample of 500 will yield a precision of +/- 4 percentage points at the 95% confidence level. | ||||||
[edit] Calculating Sample Size
How many people do you need to interview to get results representative of the target population with the level of confidence that you are willing to accept? Given an existing sample, what is the level of precision you now have?
Computing sample size is based on the confidence interval and the confidence level. If you are not familiar with these terms, click here. To learn more about how sample size, the population diversity, and population size influence the size of confidence intervals, click here.
For a paper on sample size determination, refer to: Albaum, Green, and Smith Chapters 12-13
[edit] Sample Size Terminology
The following is an Edited version of "Inside the paper's election polls", an article by Elsa McDowell that appeared in The Charleston Post and Courier, November 8, 2002
"The beauty of... election polls is that they are straightforward. They use statistical formulae to estimate how many people will vote one way and how many will vote another. No spin. No qualifying clauses to muddy the picture.
The difficulty of.. election polls is that they are not always straightforward. How else could you explain that a poll done by one candidate shows him in the lead and that a poll done by his opponent shows her in the lead?
Statisticians say there are ways to twist questions or interpret answers to give one candidate an advantage over another.
One reader took issue with a recent poll results run in The Post and Courier. He questioned whether the methodology was described in enough detail, whether the sample size was adequate.
He was right about one point. The story did not make clear who was polled. It said "voters" and failed to elaborate. It should have indicated that the people polled were registered and likely to vote in the November elections.
His next point is debatable. He said the sample size of 625 likely voters was insufficient for a state with nearly 4 million residents and suggested at least 800 should have been polled.
Brad Coker, the researcher responsible for the study responded that "the standard sample size used by polling groups nationally is 625. It produces, as the story stated, a margin of error of plus-or-minus 4 percent. Increasing the sample size to 800 would have produced a margin of error of plus-or-minus 3.5 - more accurate, but not so much more accurate to justify the additional cost."
"Many people do not understand how sample sizes work. They believe that, the larger the pool, the larger the sample size needs to be."
"It's not like that. You can take a drop of blood from a 400-pound person and it will contain the same data you would get if you took it from a 100-pound person," he said.
The reader's next concern was that the margin of error of plus-or-minus 4 applies only to the group viewed in its entirety. "If 'minorities' constituted 27 percent of the total sample, then only 169 were sampled. The margin of error then skyrockets into double digits."
Coker said the reader is right and wrong. The margin of error does jump for subgroups, but does not reach double digits. In this case, it moves from plus-or-minus 4 to plus-or-minus 6 to 8.
Two days before The Post and Courier ran the... poll, there was a short story... about a... poll commissioned by MSNBC. The... poll indicated incumbent Gov. Jim Hodges with a 45-43 percent lead. (Our) poll indicated challenger Mark Sanford was ahead 45 to 41 percent. When the margin of error is considered, both polls show the race is still a toss-up.
[edit] Terminology
The confidence interval is the plus-or-minus figure that represents the accuracy of the reported. Consider the following example:
A Canadian national sample showed "Who Canadians spend their money on for Mother's Day." Eighty-two percent of Canadians expect to buy gifts for their mom, compared to 20 percent for their wife and 15 percent for their mother-in-law. In terms of spending, Canadians expect to spend $93 on their wife this Mother's Day versus $58 on their mother. The national findings are accurate, plus or minus 2.75 percent, 19 times out of 20.
For example, if you use a confidence interval of 2.75 and 82% percent of your sample indicates they will "buy a gift for mom" you can be "confident (95% or 99%)" that if you had asked the question to ALL CANADIANS, somewhere between 79.25% (82%-2.75%) and 84.75% (82%+2.75%) would have picked that answer.
The confidence level tells you how confident you are of this result. It is expressed as a percentage of times that different samples (if repeated samples were drawn) would produce this result. The 95% confidence level means that 19 times out of twenty that results would fall in this - + interval confidence interval. The 95% confidence level is the most commmonly used.
When you put the confidence level and the confidence interval together, you can say that you are 95% (19 out of 20) sure that the true percentage of the population that will "buy a gift for mom" is between 79.25% and 84.75%.
Wider confidence intervals increase the certainty that the true answer is within the range specified. These wider confidence intervals come from smaller sample sizes. When the costs of an error is extremely high (a multi-million dollar decision is at stake) the confidence interval should be kept small. This can be done by increasing the sample size."
[edit] What Influences the Size of the Confidence Intervals
Sampling theory teaches us that the accuracy of our estimates are dependent on such factors as the dispersion and skewness of the populations responses (divergent opinions or characteristics verses similar opinions or characteristics ), the sample size, and the size of the population. Controlling these variables contributes to the incidence (and elimination) of sampling error. Note that other "non-sampling" errors, such as bad question design and selection of a "bad" sample list is not controlled by sample size.
[edit] Sample Size
Larger sample sizes generally produce a more accurate picture of the true characteristics of the population. Larger samples tighten the size of the confidence interval, making your estimate much more accurate. This relationship is not linear. Increasing sample size from 500 to 1000 reduces the confidence interval from +- 4.38 to +- 3.1.
[edit] Proportion
The accuracy of an estimate also depends on the dispersion and skewness of the population on the question being asked. A sample of Republicans would give a less dispersed evaluation of a Republican president than would a sample of Democrats. Likewise, a sample of Catholic priests would have less variability on the issue of abortion than would a survey of the general population. Accuracy of the sample estimate increases as the dispersion in the population decreases.
To cover all eventualities, you should use a worse case scenario (50%) as the percentage when estimating confidence intervals and required sample size. Once your data has been collected, you can use the observed proportion in a specific question to obtain a more accurate estimate of the confidence interval.
[edit] Population Size
The size of the population also influences the size of the confidence level, but not as much as you might expect. If your survey of 1000 respondents were from a population of 100,000, the confidence interval is +- 3.08% , while the confidence interval widens to only +- 3.1% when sampled from a population of 1,000,000. The size of the population has little effect on the confidence interval relative to the effect of the actual sample size.
[edit] Non-Sampling Errors
Non-sampling errors cannot be compensated for by increased sample size. Often, larger samples accentuate non-sampling errors rather than reduce them. Non-sampling errors come from samples that are not truly random, bad scales, misleading questions, incomplete surveys, and so on.

