Data Description and Metadata
Carefully describing and documenting the content, data collection procedures and variables of research data is essential to ensure the usability of data. Without this descriptive information, that is, metadata, research data are simply a meaningless collection of files, values and characters. An extensive description also facilitates data discovery. Comprehensive metadata containing all key information on a research dataset is also an important part in assuring the reliability of reported results.
Descriptions of data archived at FSD are published in the data catalogue available in the Aila Data Service and in the common data catalogue of the European data archives. When a dataset is published in a catalogue, citing it is easy.
When creating metadata, it is important to focus on describing the dataset itself instead of the results, conclusions and publications based on the data.
For each individual research dataset, it is advisable to create a separate directory where both the data and the metadata are stored. Some metadata are usually also included in the actual data file (e.g. variable labels in quantitative data or information in a data unit in qualitative data). Data description can be preserved, for example, as a text file by including the basic information on the data listed below. Another alternative is to select a suitable metadata standard.
Describing data following a metadata standard and storing the metadata in a database is recommended especially in cases where there are a great deal of datasets or metadata. Using a database enables faster searches and facilitates creating various kinds of reports based on the metadata. For long-term preservation, using structured XML is advised.
When applicable, the following information should be stored:
- description of how the study was conducted
- information on data collection instrument
- description of data files
- description of variables
- information on data availability
- contextual information and paradata
In addition, instructions and other documents given to individuals collecting and processing the data (e.g. interviewers, data entry clerks, coders or transcribers) should be included in the metadata. These can be saved as text or PDF files, for instance.
Crucial information include the original purpose of the data, creators/principal investigators, producers/funders, selection criteria of study population and units of observation, and information on how the data were collected. The following information should be documented for both quantitative and qualitative data:
Original purpose : Information on the study for which the data were collected, the theoretical framework and the operationalisations of concepts under study.
Publications : A list of publications in which the data were used and/or described.
Creators / principal investigators : Creators of the data are the individuals who were responsible for the content of the data, often the directors/coordinators of the research project. Additionally, the data collector (can be an external organisation) and data entry and processing personnel (e.g. quantitative data coders and entry personnel or qualitative data transcribers) should be documented as well as the organisations of these individuals.
Producers/funders : The organisation(s) or individual(s) who funded or ordered the collection of data or the project for which the data were collected.
Population/universe : The population covered by the data, i.e. the group of people/things/phenomena which were examined or which the results are based on. (Example: People aged 18 – 79 permanently residing in Finland)
Unit of observation / unit of analysis : The units on which the empirical observations are based. The unit of observation can be, for example, an individual, administrative area, phenomenon, or a text unit such as a newspaper article. Even if individuals are interviewed, the unit of analysis can be something else, such as an organisation which the individual represents.
Data unit : When it comes to qualitative data, each collected data unit should be listed. These can be individual interviews, recordings of interactions, diary entries, field notes, newspaper clippings etc. Information on each data unit should be carefully documented. For newspaper articles, for example, this information includes the name of the paper, date, pages, author and title/topic. For interviews, this information would be the background information of the interviewee and other background information. Basic information on each data unit should be included in the units as well as in a separate list. For instance, key background information can be included at the beginning of each interview transcript.
Selection criteria of observation units or data units : Description of the sampling procedure and/or other means used to select the units of observation or data units. For qualitative data, the selection criteria for data units should be described by explaining, for example, how the interviewees or newspaper articles were selected. For quantitative data, the type of the sample and the method used to select a sample that is representative of the population (the group that the researcher aims to examine) should be described. The size of the target population and sample can also be included.
Data collection : Beginning and end dates of data collection should be documented as well as the mode of collection (e.g. telephone interview, computer-assisted personal interview, web-based questionnaire, audio recording, audiovisual recording or invitation to write about personal experiences). For quantitative data, information on possible reminders and new collection rounds as well as analysis of non-response bias should be included. Ways used to contact the research subjects and collect data should be documented for qualitative data. Information on the interviewer (e.g. age, gender, education, occupation), interview location and other contextual information may also be significant. Information on the interviewer and interview situation should be saved in the data matrix in quantitative data or included in the basic information on data units in qualitative data.
Source data : If the data are not collected through surveys or interviews but are based on an existing data source, information on the source data should be documented: for example, books, articles, register data and online public communication sources such as blogs or Twitter.
Writing invitation provided to research participants (e.g. published in a newspaper/magazine or online), interview questions, questionnaire or interview frame, cover letter and possible interviewer instructions should be saved in the same directory as the data and metadata.
One blank copy of all language versions of the data collection instrument should be saved. In addition to digital copies, it is advisable to save one blank paper questionnaire if it exists.
For computer-assisted surveys without an actual questionnaire, questions and response alternatives as well as the order they were presented in can be saved as a text file.
A research dataset may consist of one or more files. One quantitative data file typically contains dozens or hundreds of variables. One qualitative data file, however, often contains only one data unit, such as one individual interview.
All properties of a single file should be described. It is recommended that the following aspects are documented for each file:
- name of the file
- file location (file path)
- file size
- file format
- software used to create the file
- date of creation
- file creator
- file version
- access rights set for the file
Much of this information is easily listed by using the dir command in the Windows command line (Command Prompt), which displays a list of files and subdirectories in a directory. For example, the command
C:\> dir Data /S >filelist.txt
creates a new file called filelist.txt which contains a list of all files and subdirectories in the directory Data .
The following information should be documented on variables in a quantitative dataset:
- number of variables and units of observation
- list of variables with the name and label of each variable as well as its location in the file and its values and value labels
- frequency distribution of each variable
- information on the classifications used, e.g. "main categories of the ISCO-88 were used in the occupational classification" or "country codes: 3-digit ISO 3166"
- meanings of abbreviations used
- codings for missing data
- information on constructed variables (e.g. how the weight variables and sum variables were calculated)
- recoding and standardising of variables
- data protection measures taken
If the variables or the values of the variables are dissimilar to the questions or response alternatives in the questionnaire, these dissimilarities should be explained.
In addition, any changes and edits made to the data during processing should be documented (e.g. removal of duplicates, removal of exceptional values). Some of the descriptive information can be documented in the data file itself.
The description of the data should include information on the availability of the data. The description provides information on where the data are stored, how they can be used, whether there are any special conditions on the use of the data and who can provide additional information.
Contextual information refers to the external circumstances and events that may have affected the units of observation at the time of data collection.
For example, the economic situation, political events, public opinion and various changes in the society at the time of data collection as well as sudden natural disasters and accidents may affect the attitudes, responses and thoughts of research participants when the study is being conducted.
Statistics offer general macro-level information on society at the time of data collection. Individual events may be logged in a diary during data collection, and main news topics and news items related to the topic under study could be documented.
Paradata refers to empirical information on the data collection process. Paradata include the beginning and ending time of an interview, duration of the interview or parts of it, time taken to respond to each question, visual observations made by the interviewer and opinions on the interview situation. A particularly great deal of paradata is generated in computer-assisted surveys and internet surveys. In quantitative data, paradata variables can be stored in the same file with the actual study variables or they can be stored in a separate file. In qualitative data, paradata may be included at the beginning of each data unit or in a separate file (e.g. who was present during data collection or which third persons joined the interview later on).
Metadata can be saved into an ordinary text file. However, various metadata standards can also be used to facilitate documentation.
The Data Documentation Initiative (DDI) is an international metadata standard designed specifically for describing research data. At the Finnish Social Science Data Archive, metadata are stored into XML files using the DDI Codebook 2.1 specification. Structured XML is suited for long-term preservation, and various documents can be created based on it. There is also DDI Lifecycle (DDI 3) that allows documenting and managing data across the entire life cycle.
The DDI standard also includes controlled vocabularies which provide terms to describe various aspects of data, such as analysis unit, mode of collection, data format, sampling procedure, time dimension, type of instrument, and data source types. The vocabularies are useful in facilitating and standardising data description.
Quantitative dataset FSD3133 Development Cooperation Survey 2016
Qualitative dataset FSD2999 My Public Living Room Interviews 2014
All data descriptions produced by FSD are available in DDI XML format. A link to the DDI description is available at the end of the description page of each dataset. The descriptions are also available as a single ZIP file.
Metadata standards for research data
- Data Documentation Initiative (DDI) Lifecyle ja Codebook
- CESSDA Metadata Model (CMM)
- Text Encoding Initiative (TEI) is used especially to code text documents.
- Statistical Data and Metadata Exchange (SDMX) is an exchange format for statistical data and metadata.
- Ecological Metadata Language (EML) is a specification developed for documenting ecological data.
Other metadata standards
- Dublin Core Metadata Initiative (DCMI) . Dublin Core is a metadata format designed especially for describing digital publications. The Finnish version is maintained by the National Library of Finland.
- Metadata Encoding and Transmission Standard (METS) is a standard developed for encoding metadata regarding objects in digital libraries.
- PREMIS (Preservation Metadata: Implementation Strategies) is a metadata standard to support the preservation of digital objects and ensure their long-term usability.
- EAD (Encoded Archival Description) is a standard for the encoding of archival finding aids for use in networked environments.