Processing Quantitative Data Files
A quantitative dataset is typically a data matrix consisting of rows and columns where one row corresponds to one unit of observation and one column corresponds to one variable. To analyse quantitative data, software for statistical analysis is needed, as well as at least basic knowledge of statistics and quantitative methods.
In social sciences, empirical quantitative data are usually collected by surveys, for example, by a postal, telephone, face-to-face or Internet survey. When such collection methods are used, the unit of observation is most often the individual, and the variables in the data matrix represent the survey responses of these individuals. Data matrices are sometimes also called microdata (referring to individual response data) or numerical data.
Data collection method and instrument affect how the data are stored in digital form. In online surveys, the responses are saved the moment they are submitted. In computer-assisted face-to-face or phone interviews, the interviewer records the responses as the interview proceeds. Paper questionnaires can be optically read or the data can be entered manually. All recording methods have the potential to cause mistakes in the data. The quality of a dataset can be improved with the following measures:
- check for and correct values out of range
- check the entered data against a few randomly selected questionnaires
- check the lengths of rows and the number of variables
- do not recode variables (e.g. collapse categories) while entering the data to ensure that the original data are not lost
- when the data have been entered, immediately create both a back-up file and a separate working copy
- when recoding variables, use statistical software and, if possible, recode the variables by using syntax
- be consistent in determining values for missing data and 'can't say' type of responses
- check the accuracy of frequencies
- create documentation (e.g. within the syntax) of all changes made to the dataset, such as anonymisations, categorisations, new variables, and removal of duplicates
Be consistent when naming variables. Favour short names that correspond to the numbering of the instrument used in data collection. Examples:
- Variables relating to actual survey questions: A good name for the variable that contains the responses to the first question in the questionnaire is q1. If a question has several subquestions (e.g. the so-called grid questions), the following form can be used: q2_1, q2_2, q2_3, ...
- Variables relating to questions about background information: In addition to actual survey questions, the questionnaire often contains questions that chart the respondent's background. Generally, these questions do not have question numbers in the questionnaire. The background variables should be named in a consistent manner, for instance bv1, bv2, bv3, ...
- Other variables: The dataset can contain information that is not directly related to the research instrument (e.g. observation id, date of response and time spent in responding). Data collected through online questionnaires often contain technical information, for instance, browser information, time of response and the respondent's IP address. The variables related to this sort of information should also be named in a logical manner, for example t1, t2, .... If there are only a few variables of this kind in the dataset, descriptive names, such as 'ID', 'date', 'time' and 'IP' can also be used.
If a dataset consists of several different sources or of datasets that are combined, it is sensible to name the variables in a manner that makes it possible to see which subset each variable originates from. Different subsets can be named, for instance, a, b, c etc. and the variables in these subsets could be named a1, a2_1, a2_2, b1_1 or, alternatively, a_q1, a_q2_1, b_q1_1 etc.
Even though some statistical packages allow long variable names, the recommendation is to avoid them because they may cause problems in file conversion. It is also best not to name variables according to their content, because this will basically lead to the use of abbreviated forms. The meanings of abbreviations can be ambiguous, which may make connecting a variable to the corresponding question in the questionnaire impossible. It is also recommended that using symbols and accented letters in variable names be avoided.
A variable label refers to the description of the contents of a variable. If the label length limit allows, it is recommended to include the entire question text in the label. Different statistical packages and file formats limit the length of variable labels. Even if a label needs to be shortened, it should provide relevant information on the contents of the variable. To achieve this, it is advisable to shorten pre-texts that introduce the question or leave out parts that are less significant for understanding the main content of the question. The shortened label should still accurately represent the content of the original question and use the terminology of the original question text. The question text and response options should match each other, meaning that value labels should directly "respond" to the question presented in the variable label. If a table was created based on the data, it should form a coherent entity so that the contents presented are comprehensible directly from studying the table. In continuous variables the label should indicate the unit of measurement (e.g. hours, euros, metres, times per day etc.) in which the value is given.
It is advisable to code the values of a variable to correspond to the numbering and order used in the research instrument (e.g. a questionnaire), for instance:
|Disagree to some extent||2|
|Neither agree nor disagree||3|
|Agree to some extent||4|
When creating value labels, it is advisable to follow the wording of the response alternatives used in the instrument. The maximum length of a value label depends on the software and file format. Often the maximum length of the label is very short. If a value label must be shortened, this should be done by using the wording and terminology in the research instrument as closely as possible to retain the original content.
When analysing the data, variables sometimes need to be recoded, or new variables based on them need to be formed. For example, the respondents' years of birth are often inquired in the questionnaires, but the results are reported as age groups. All such changes made to variables must be well-documented.
In almost all datasets, there are variables with missing data for some cases. For instance, a respondent may have decided not to answer a question in the questionnaire, or there may have been a failure in collecting the response. If the cases that have missing data are removed from the analysis, the total number of cases decreases and the accuracy of results may suffer. Results of analysis may become quite significantly skewed if missing data are not evenly distributed among the cases. To ensure the accuracy of the results, it is important to process the missing data before analysing the dataset.
The missing data should be coded so that they can be clearly distinguished from the "actual" values of a variable. Often, values such as 9, 99 or 999 are used to signify missing data. Zero is also frequently used, but in those cases 0 should not be an acceptable value within the range. The pre-defined values for missing data provided in statistical packages can also be used. In the case of survey data, response options 'can't say', 'don't want to say', and 'don't know' are generally not regarded as missing data, but rather as interesting information that can be utilised in the research.
If there are systematic errors in the dataset, it may be prudent to weight the observations. With weight variables, potential bias in age, gender and region distributions resulting from the sampling can be corrected. Clear documentation on the weighting methods and calculations used should be provided to ensure that people reusing the data also have an understanding of the variables created during the research process.
Most statistical packages allow users to process and analyse data with the help of a programming language, or syntax. Often the most effective features of statistical packages are available only through syntax commands, even though basic analyses can be performed through menus. The commands given in syntax can be saved to a separate file ( syntax file ).
It is advisable to always use syntax rather than menus when editing data. Syntax allows users to see what changes have been made to the data and how. This makes it easy to perform quality control, search for potential mistakes, and make corrections and adjustments. Moreover, executing often-used commands is faster with syntax. Many statistical packages are compatible with several programming languages, which enables the user to create custom features or analyses.
It is recommended to write comments in the syntax file beside the commands to explain why a certain command is executed (for instance, why a variable is recoded). The name of the dataset, version number, date created and the name of author are usually added as comments at the beginning of a syntax file.
Folders and files should be named in an uncomplicated and logical manner. It is advisable to save basic information about the files in the same location as the metadata. Modern software allows fairly long file names, and the name should include at least abbreviation of the project name, year, file contents, and file version. For instance, the original SPSS file of the European Values Survey 2017 survey data could be named evs2017_data_original.por, and the questionnaire used in data collection evs2017_questionnaire_eng.odt. If the dataset has been given a unique identifier, it is advisable to include it in the names of all files related to the data.
Example : Files belonging to dataset FSD2248 ISSP 2006: Role of Government IV: Finnish Data archived at FSD:
Directory of X:\Data\FSD2248 | cbF2248.pdf | meF2248.xml | mef2248e.xml | quF2248_fin.pdf | quF2248_sve.pdf | vaf2248.xml | +---Data | daF2248.csv | daF2248.por | labF2248.html | syF2248.SPS | \---Original ISSP06_FSDdata.sas7bdat ISSP06_FSDdata.sav ISSP06_jakaumat.xls ISSP06_labfor.sas ISSP06_muuttujalistaus.lst ISSP06_questionnaire_fin.pdf ISSP06_questionnaire_swe.pdf ISSP06_study_description.doc ISSP_vastaus%_2002-06.xls
In the example, a folder named FSD2248 has been created for the dataset based on the identifier given by the archive. First two characters in a file name indicate what the file contains:
- cb = codebook
- da = data file
- sy = syntax file
- lab = label file
- me = data description/metadata
- qu = questionnaire
- va = variable description
Fxxxx is the identifier of the dataset. Information on the file language can be found at the end of the file name. The data folder contains the same data in two different file formats (.csv and .por) to facilitate the use of different statistical packages. The HTML file contains the variable and value labels for the csv file. The original folder contains the original data files that were deposited at FSD by a research team. When processing the data has been completed by FSD, the original folder with all its contents is deleted.