OVERVIEW OF STATISTICAL THINKING
Samples

Navigational Menu

MAIN MENU

OVERVIEW OF STATISTICAL THINKING
Levels of Measurement
Samples
Descriptive Statistics
Statistical Inference

MICROCASE
Getting Started

File Management
Data Management

MICROCASE
Basic Statistics Options
UnivariateStatistics

CrossTabulations

ttest/ANOVA

Mapping

Scatterplot

Correlation

Regression

AVAILABLE DATA SETS

STATISTICAL SOURCES ON LINE

     A sample is a sub-set of the population that is under investigation.  There are three considerations about samples that will shape the decision about which statistical test to use:
  • Is the sample(s) representative of the population from which it was drawn?
  • What is the sample(s) size?
  • Are the samples independent (unpaired) or dependent (paired or related)?

Representativeness
     If a researcher has data about the entire population, then descriptive statistics may be used to describe the population.  If a researcher has data about the entire population, there is no generalization involved.  If, however, a researcher has data about only a sub-set of the population, the researcher may want to use statistical inference to generalize to the whole population. 
     Statistical inference requires that the sample be representative of the population from which it was drawn.   The only way to know whether a sample is representative is to study the whole population as a comparison.  If a researcher had data about the whole population, he or she would not bother with a sample at all.  Thus researchers hope that samples represent the population from which they are drawn, but there is no guarantee.  For this reason, research emphasizes the importance of replication.  Researcher A may derive important findings, but if Researchers B, and C, and D (and so forth) can't replicate the findings, then the scholarly community suspects that Researcher A's sample may have been unrepresentative--a fluke.
     The best way to maximize representativeness is to draw a probability sample--a sample in which every element in the population has a known probability of being selected into the sample.  See the Quantitative Research pages for more information about probability samples.
     Probability samples are the best, but not always possible or practical.  A probability sample requires a list of all the elements in the population under consideration*; often such lists do not exist (e.g., people with low self-esteem).  Even when they do exist, contacting every selected element may be a daunting task; a low response rate ruins the best probability sample.  The response rate is given by the equation: [(drawn sample size-number who refused or could not be reached)/drawn sample size]x100.  Anything in the 75% range is considered good; mailed surveys, even with two or three follow-ups, often achieve less than 50% response rates.
     Researchers make an argument for the representativeness of non-probability samples in two ways:  1) by comparing the sample to whatever characteristics are known about the population based on the assumption that representativeness on some characteristics may translate to representativeness on all characteristics; 2) by comparing early responders and late responders based on the assumption that nonresponders are more like late responders than like early responders.

Sample Size
     Sample size is most often determined by practical considerations--how much time and resources does the research have?  In situations of unlimited time and a lot of resources, sample size is determined by three factors:

  • The hypothesized distribution of the dependent variable, expressed as a dichotomy.  In the 2000 US presidential election, for example, approximately 50% of the voters supported George Bush and approximately 50% supported other candidates.  In satisfaction surveys of college students, approximately 90% of graduating seniors say they would probably or definitely choose the same school again and 10% say they would probably or definitely not choose the same school again.  If a researcher has no idea what the distribution of the dependent variable will be, the best choice is 50%/50% as this distribution results in the largest sample size.
  • The margin of error the researcher is willing to tolerate.  In pre-election, candidate-preference polling, results are often reported as 45% support candidate X with a margin or error or +/-3%.  In other words, the poll predicts that between 42% and 48% of the population favors candidate X.  A given researcher may be willing to accept a 10% margin of error (35%-55%) or may want the margin of error to be as small as 1%(44%-46%).  The smaller the margin of error the larger the required sample size.
  • Degree of confidence that the sample results represent the population.  Commonly researchers use 95% or 99% confidence.  The greater the confidence desired the larger the required sample size.

     There are formulas to determine required sample size based on these three factors.  Using one of the on-line sample size calculators is easier, however.

     http://www.surveysystem.com/sscalc.htm

If a researcher wanted to be 95% confident about a 50/50 percentage with a margin of error of 5% from a population sized 2000, the researcher would need a sample size of 322.  99% confidence in the same situation would require a sample of 500.

Number and Nature of the Samples
     Much social science research involves the comparison of two or more groups, for example

  • Are the political attitudes of incoming first years and graduating seniors different?
  • Do competitive athletes have better mental health than recreational athletes? Do either of these groups have better mental health than couch potatoes? Are these differences the same for men as for women?
  • Do states with an above average income tax rate have higher high school graduation rates than states with a below average income tax rate?

     Some statistical tests can be used with any number of samples; other statistical tests are appropriate for a two-sample comparison only.
     Whether a two-sample or a multi-sample comparison, there are different tests for independent and dependent samples. 
     Dependent (or paired or related) samples occur whenever there is reason to suspect that the responses of one member of one sample is dependent upon the responses of a specific other member(s) of the other sample(s).  For example, the political attitudes of a particular graduating senior might be dependent upon or related to that person's political attitudes as an incoming first-year student.  A husband's marital satisfaction might well depend upon his wife's marital satisfaction.  Dependent samples occur when there is a good reason to pair members of the samples.
     Independent (or unpaired) samples occur when there is no reason to pair a respondent in one sample with a particular respondent in the other sample(s). An independent sample test might well have a sample of husbands and a sample of wives, but these folks would not be married to each other.  Whenever there are different sample sizes for the two or more samples, an independent sample test has been used.
     With dependent samples, the paired scores are compared and the differences are summarized.  With independent samples, each sample is summarized and the group summaries are compared.
____________________
*National polling organizations draw samples representative of the adult, non-institutionalized, English-speaking, mainland US population without such a comprehensive list of people.  Such organizations start with a list of census tracks and take a probability sample of these.  Then they obtain the census track maps for the selected tracks and take a sample of blocks within each census track.  At the block level, they hire an individual who walks the block and makes a list of the number of households.  The organization then draws a sample of households in each of the selected blocks.  Interviewers approach each household and make a list of the eligible respondents in the household.  From this list the interviewer selects the person to be interviewed using a pre-established probability formula.  Such a sampling plan is called a multi-stage cluster sample.