FSD Bulletin

Issue 22 (2/2007)

ISSN 1795-5262

Front page
Previous issues
Editorial staff

» latest issue

FSD Bulletin is the electronic newsletter of the Finnish Social Science Data Archive. The Bulletin provides information and news related to the data archive and social science research.


Finnish Social Science Data Archive
E-mail: fsd@tuni.fi

Privacy Policy

Processing Qualitative Data for Archiving

Arja Kuula

Traditional archives seek to preserve archived documents in as original a form as possible. A data archive does it differently: datasets are processed and new information is added.

There are three processing activities that the FSD generally uses for preparing qualitative data for archiving: conversion, anonymisation and metadata generation. First, digital data must sometimes be converted to a new format in order to safeguards its usability in present-day and future software. Second, confidentiality and ethical issues may require anonymisation, that is, making changes to the original data. Third, new metadata must be generated in order to provide future users with sufficient amount of background information, in order to enable data analyses.


Most qualitative datasets archived at the FSD are texts in digital format: interviews, group interviews, diaries or other written texts, or semi-structured internet questionnaires. The data are generally converted to Rich Text Format (RTF) or to plain unformatted format (txt) to allow access to data regardless of which software reusers of data are using. Some datasets have been xml coded. The main advantage of xml coding is the ease with which the data can be converted to other formats, be it to RTF, HTML or PDF. The XML structure also provides added tools for analysis. Reusers with some programming skills can select from the data only those questions and responses that are relevant to their research.

Occasionally, when a qualitative dataset arrives at the FSD, the medium on which it is stored has already become obsolete. Contrary to data stored on paper, digital data become obsolete very quickly. With a dataset that is over15 years old, recovery activities are most likely to be successful if the data have been stored as plain unformatted text (txt or ASCII) format. However, the FSD takes old formats as a challenge. The staff have been able to open most qualitative datasets that have been deposited for archiving. In one case, data collected in the 1980s was successfully converted to a new format even though the researcher could not remember the software that had been used to generate the files -or the password he had used to prevent access to the material.


If research participants had not been informed before the data collection that the data collected from them would be archived, the data must always be anonymised.

The FSD decides the extent and nature of anonymisation on a case-to-case basis. Information given to research participants on the use and preservation of the data, number and nature of identifiers in the data, and sensibility of data are all taken into account. The FSD encourages researchers to anonymise the data in advance of deposit. In cases where the researcher has informed research participants that the data will be archived for scientific purposes after the names of the participants have been removed, all that needs to be done is to remove the names and addresses of the participants from the data and replace other names with pseudonyms.

The Finnish Personal Data Act allows archiving data even without removing identifiers if participants have been informed of this in advance. It is worthwhile to remember this fact when doing audio recordings or collecting data for a research project where a follow-up study is planned.

Generating metadata

The usability of a dataset is largely dependent on the sufficiency of the data description, i.e. the metadata. Archived data are described using the international DDI format. The description includes information on the content and extent of the data, data collection methods and temporal coverage. In addition to study level information, data description also includes information on each case. For example, the archive adds information on the interviewer, interviewee, interview location and time in the beginning of each interview. How exact the metadata is depends on the amount and exactness of the contextual information available.

Without the metadata, analysing the data would be difficult or impossible. A transcription of an interview, for example, is much less valuable if no-one knows whether the interviewee was male or female, what was his/her profession, and whether there were other people present during the interview. Though it may seem to researchers at the time of the data collection that they will certainly remember all that information when reading the transcripts later, it is in fact a misconception. People forget things astonishingly quickly. It is advisable to plan before data collection how to create metadata about the whole study and particularly how to create background information for each interview/observation. The last moment to create such metadata is when the data are processed, i.e. when transcribing speech into a text document.