Navigational Menu
MAIN MENU
OVERVIEW OF STATISTICAL THINKING
MICROCASE
MICROCASE
BASIC STATISTICS OPTIONS
AVAILABLE DATA SETS
STATISTICAL SOURCES ON LINE
QUANTITATIVE METHODS
|
A
sample is a sub-set of the population that is under investigation.
There are three considerations about samples that will shape the decision
about which statistical test to use:
- Is the sample(s) representative of the population
from which it was drawn?
- What is the sample(s) size?
- Are the samples independent (unpaired) or
dependent (paired or related)?
Representativeness
If a researcher has data about the entire
population, then descriptive statistics may be used to describe the
population. If a researcher has data about the entire population,
there is no generalization involved. If, however, a researcher has
data about only a sub-set of the population, the researcher may want to
use statistical inference to generalize to the whole population.
Statistical inference requires that the sample be
representative of the population from which it was drawn. The
only way to know whether a sample is representative is to study the whole
population as a comparison. If a researcher had data about the whole
population, he or she would not bother with a sample at all. Thus
researchers hope that samples represent the population from which they are
drawn, but there is no guarantee. For this reason, research
emphasizes the importance of replication. Researcher A may derive
important findings, but if Researchers B, and C, and D (and so forth)
can't replicate the findings, then the scholarly community suspects that
Researcher A's sample may have been unrepresentative--a fluke.
The best way to maximize representativeness is to
draw a probability sample--a sample in which every element in the
population has a known probability of being selected into the
sample. See the Quantitative Research pages for more information
about probability samples.
Probability samples are the best, but not always
possible or practical. A probability sample requires a list of all
the elements in the population under consideration*; often such lists do
not exist (e.g., people with low self-esteem). Even when they do
exist, contacting every selected element may be a daunting task; a low
response rate ruins the best probability sample. The response rate
is given by the equation: [(drawn sample size-number who refused or could
not be reached)/drawn sample size]x100. Anything in the 75% range is
considered good; mailed surveys, even with two or three follow-ups, often
achieve less than 50% response rates.
Researchers make an argument for the
representativeness of non-probability samples in two ways: 1) by
comparing the sample to whatever characteristics are known about the
population based on the assumption that representativeness on some
characteristics may translate to representativeness on all
characteristics; 2) by comparing early responders and late responders
based on the assumption that nonresponders are more like late responders
than like early responders.
Sample Size
Sample size is most often determined by
practical considerations--how much time and resources does the research
have? In situations of unlimited time and a lot of resources, sample
size is determined by three factors:
- The hypothesized distribution of the dependent
variable, expressed as a dichotomy. In the 2000 US presidential
election, for example, approximately 50% of the voters supported
George Bush and approximately 50% supported other candidates. In
satisfaction surveys of college students, approximately 90% of
graduating seniors say they would probably or definitely choose the
same school again and 10% say they would probably or definitely not
choose the same school again. If a researcher has no idea what
the distribution of the dependent variable will be, the best choice is
50%/50% as this distribution results in the largest sample size.
- The margin of error the researcher is willing to
tolerate. In pre-election, candidate-preference polling, results
are often reported as 45% support candidate X with a margin or error
or +/-3%. In other words, the poll predicts that between 42% and
48% of the population favors candidate X. A given researcher may
be willing to accept a 10% margin of error (35%-55%) or may want the
margin of error to be as small as 1%(44%-46%). The smaller the
margin of error the larger the required sample size.
- Degree of confidence that the sample results
represent the population. Commonly researchers use 95% or 99%
confidence. The greater the confidence desired the larger the
required sample size.
There are formulas to
determine required sample size based on these three factors. Using
one of the on-line sample size calculators is easier, however.
http://www.surveysystem.com/sscalc.htm
If a researcher wanted to be 95% confident about a
50/50 percentage with a margin of error of 5% from a population sized
2000, the researcher would need a sample size of 322. 99% confidence
in the same situation would require a sample of 500.
Number and Nature of the Samples
Much social science research involves the
comparison of two or more groups, for example
- Are the political attitudes of incoming first
years and graduating seniors different?
- Do competitive athletes have better mental health
than recreational athletes? Do either of these groups have better
mental health than couch potatoes? Are these differences the same for
men as for women?
- Do states with an above average income tax rate
have higher high school graduation rates than states with a below
average income tax rate?
Some statistical tests can
be used with any number of samples; other statistical tests are
appropriate for a two-sample comparison only.
Whether a two-sample or a multi-sample
comparison, there are different tests for independent and dependent
samples.
Dependent (or paired or related)
samples occur whenever there is reason to suspect that the responses of
one member of one sample is dependent upon the responses of a specific
other member(s) of the other sample(s). For example, the political
attitudes of a particular graduating senior might be dependent upon or
related to that person's political attitudes as an incoming first-year
student. A husband's marital satisfaction might well depend upon his
wife's marital satisfaction. Dependent samples occur when there is a
good reason to pair members of the samples.
Independent (or unpaired) samples occur
when there is no reason to pair a respondent in one sample with a
particular respondent in the other sample(s). An independent sample test
might well have a sample of husbands and a sample of wives, but these
folks would not be married to each other. Whenever there are
different sample sizes for the two or more samples, an independent sample
test has been used.
With dependent samples, the paired scores are
compared and the differences are summarized. With independent
samples, each sample is summarized and the group summaries are compared.
____________________
*National polling organizations draw samples representative
of the adult, non-institutionalized, English-speaking, mainland US
population without such a comprehensive list of people. Such
organizations start with a list of census tracks and take a probability
sample of these. Then they obtain the census track maps for the
selected tracks and take a sample of blocks within each census
track. At the block level, they hire an individual who walks the
block and makes a list of the number of households. The organization
then draws a sample of households in each of the selected blocks.
Interviewers approach each household and make a list of the eligible
respondents in the household. From this list the interviewer selects
the person to be interviewed using a pre-established probability
formula. Such a sampling plan is called a multi-stage cluster
sample.
for questions or comments contact me at mduncombe@coloradocollege.edu
last updated on August 19, 2003 |