Processing Quantitative Data Files
A quantitative dataset is typically a data matrix consisting of rows and columns where one row corresponds to one observation and one column corresponds to one variable. To analyse quantitative data, software for statistical analysis is needed, as well as at least basic knowledge of statistics and quantitative methods.
In social sciences, empirical quantitative data are usually collected by surveys, for example, by a postal, telephone, face-to-face or Internet survey. When such collection methods are used, the unit of observation is most often the individual, and the variables in the data matrix represent the survey responses of these individuals. Data matrices are sometimes also called microdata (referring to individual response data) or numerical data.
Data collection method and instrument affect how the data are stored in digital form. In online surveys, the responses are saved the moment they are submitted. In computer-assisted face-to-face or phone interviews, the interviewer records the responses as the interview proceeds. Paper questionnaires can be optically read or the data can be entered manually. All recording methods have the potential to cause mistakes in the data. The quality of a dataset can be improved with the following measures:
- check for and correct values out of range
- check the entered data against a few randomly selected questionnaires
- check the lengths of rows and the number of variables
- do not recode variables (e.g. collapse categories) while entering the data
- when the data have been entered, immediately create both a back-up file and a separate working copy
- when recoding variables, use statistical software and, if possible, recode the variables by using syntax
- be consistent in determining values for missing data and 'can't say' type of responses
- check the accuracy of frequencies
- create documentation of all changes made to the dataset
Be consistent when naming variables. Favour short names that correspond to the numbering of the instrument used in data collection. Examples:
- Variables relating to actual survey questions. A good name for the variable that contains the responses to the first question in the questionnaire is q1. If a question has several subquestions (e.g. the so-called grid questions), the following form can be used: q2_1, q2_2, q2_3, ...
- Variables relating to questions about background information. In addition to actual survey questions, the questionnaire often contains questions that chart the respondent's background. Generally, these questions do not have question numbers in the questionnaire. The background variables should be named in a consistent manner, for instance bv1, bv2, bv3, ...
- Other variables. The dataset can contain information that is not directly related to the research instrument (e.g. observation id, date of response and time spent in responding). Data collected through online questionnaires often contain technical information, for instance, browser information, time of response and the respondent's IP address. The variables related to this sort of information should also be named in a logical manner, for example t1, t2, .... If there are only a few variables of this kind in the dataset, descriptive names, such as 'ID', 'date', 'time' and 'IP' can also be used.
If a dataset consists of several different sources or of datasets that are combined, it is sensible to name the variables in a manner that makes it possible to see which subset each variable originates from. Different subsets can be named, for instance, a, b, c etc. and the variables in these subsets could be named a1, a2_1, a2_2, b1_1 or, alternatively, a_q1, a_q2_1, b_q1_1 etc.
Even though some statistical packages allow long variable names, the recommendation is to avoid them because they may cause problems in file conversion. It is also best not to name variables according to their content, because this will basically lead to the use of abbreviated forms. The meanings of abbreviations can be ambiguous, which may make connecting a variable to the corresponding question in the questionnaire impossible. It is also recommended that using symbols and accented letters in variable names be avoided.
A variable label refers to the description of the contents of a variable. Different statistical packages and file formats limit the length of variable labels (for example, SPSS Portable has a limit of 255 characters). Even if a label has to be shortened, it should provide relevant information on the contents of the variable. The label should follow the original question as accurately as possible.
It is advisable to code the values of a variable to correspond to the numbering used in the research instrument, for instance:
|Disagree to some extent||2|
|Neither agree nor disagree||3|
|Agree to some extent||4|
For missing data and 'can't say' types of classes, negative values and zero can be used. However, it should be made sure that the values are clearly distinguishable from each other. The pre-defined values for missing data provided in statistical packages can also be used.
When creating value labels, it is advisable to follow the wording of the response alternatives used in the instrument. The maximum length of a value label depends on the software and file format. Often the maximum length of the label is very short (for example, 120 characters in SPSS 20.0.0 Portable). If a value label must be shortened, this should be done by using the wording and terminology in the research instrument as closely as possible to retain the original content.
When analysing the data, variables sometimes need to be recoded, or new variables based on them need to be formed. For example, the respondents' years of birth are often inquired in the questionnaires, but the results are reported as age groups. All such changes made to variables must be well-documented.
In almost all datasets, there are variables with missing data for some cases. For instance, a respondent may have decided not to answer a question in the questionnaire, or there may have been a failure in collecting the response. If the cases that have missing data are removed from the analysis, the total number of cases decreases and the accuracy of results may suffer. Results of analysis may become quite significantly skewed, if missing data are not evenly distributed among the cases. To ensure the accuracy of the results, it is important to process the missing data before analysing the dataset.
The missing data should be coded so that they can be clearly distinguished from the "actual" values of a variable. Often, values such as 9, 99 or 999 are used to signify missing data. Zero is also frequently used, but in those cases 0 should not be an acceptable value within the range.
If there are systematic errors in the dataset, it may be prudent to weight the observations. With weight variables, potential bias in age, gender and region distributions resulting from the sampling can be corrected. Clear documentation on the weighting methods and calculations used should be provided to ensure that people reusing the data also have an understanding of the variables created during the research process.
Most statistical packages allow users to process and analyse data with the help of a programming language, or syntax. Often the most effective features of statistical packages are available only through syntax commands, even though basic analyses can be performed through menus. The commands given in syntax can be saved to a separate file ( syntax file ).
It is advisable to always use syntax rather than menus when editing data. Syntax allows users to see what changes have been made to the data and how. This makes it easy to perform quality control, search for potential mistakes, and make corrections and adjustments. Moreover, executing often-used commands is faster with syntax. Many statistical packages are compatible with several programming languages, which enables the user to create custom features or analyses.
It is recommended to write comments in the syntax file beside the commands to explain why a certain command is executed (for instance, why a variable is recoded). The name of the dataset, version number, date created and the name of author are usually added as comments at the beginning of a syntax file.