Missing Data

  • Not all questions will be answered by all people, so missing data is something which needs to be dealt with.
  • There are various types of missing data:
    • Questions which the respondent was not supposed to answer (contingency questions)
    • Not asked (interviews), missed, or unclear response are another type of missing data.
    • Questions which the respondent refused to answer (sometimes hard to tell from missed in non-personally administered questionnaires.
    • The "don't know" and "no opinion" responses can be treated as missing data, but in some cases it is better to leave them in the data as a separate category.
  • You can give different codes to each type of missing data if necessary, however each code should be clearly different than the valid responses and valid codes for that question.
  • Because it is easier to select one number for missing data for all questions, the code(s) for missing data should be different from a valid response for any question.
  • Usually the number 0, 9, or 99 is used so that confusion with valid data is avoided.
  • Biases in the data can occur because of missing data. It is important to check if a certain type of person did not answer a particular question, which would skew the data. To check for biases, you should cross-tabulate the people who did answer a question with people who did not on other questions to see if patterns emerge.
  • There are various ways to minimize the effect of missing data from the sample:
    • Delete all cases that have any missing data. This is only useful if there are only a small number of cases which have missing data, as deleting too many cases can lead to a greatly reduced sample size.
    • Delete the variable which is causing the non-response. If there is one variable which has a lot of missing data, then it can simply be discarded. This method only works if there is one question which people refuse to answer and if that question is not important to the study.
    • Pairwise deletion uses a zero-order correlation matrix to calculate the missing data in multivariate analysis. This method uses similar cases to estimate the data. The problem with this method is that it leads to a distortion of the data because if a correlation already exists the estimation will be based on that correlation, not the true values for a particular case.
    • The mean approach simply takes the mean of the sample and places it in for the missing data. A slightly more complex method is to take certain background characteristics and calculate the mean for that group. The groups must be selected on the basis that they strongly correlate with the missing variable. The group mean method gives more variability than the sample mean approach, however it increases the correlation between the group characteristic and the missing variable.
    • Group -based random assignment can be used to maintain variability. This technique takes the value of the previous case (within the same group) for that variable and enters it for the missing data. This eliminates the exaggeration of the relationship between groups and variables and avoids loss of data all together.