Anonymisation of Qualitative Data

It is often claimed that qualitative data cannot be anonymised. It is indeed particularly challenging to anonymise video or audio data, and the Finnish Social Science Data Archive (FSD) does not provide guidelines suitable for those. However, FSD does provide guidance for anonymising textual data, as such data can be anonymised. Texts may include transcriptions of interviews or other interactions, or writings produced by individuals for research purposes.

Tekstiä, josta on punaisella värillä ja haksulkeilla korostettu anonymisoidut kohdat.
A fictional example excerpt from an anonymised interview.

Anonymisation must always be irreversible. Therefore, at the end of the anonymisation process, the researcher must destroy all original files containing personal data. If, however, the research data is derived from official documents or online content, the researcher cannot destroy the original information. In such cases, anonymisation is challenging and often impossible. For example, datasets collected from the internet that contain personal data are difficult to process in a way that ensures the principle of irreversibility in anonymisation. Original content is often permanently available and can be found relatively easily through web searches.

This article focuses solely on the anonymisation of qualitative textual data produced for research purposes. Based on our experience, we argue that textual research data can be anonymised, provided that the original data containing personal information can be destroyed after anonymisation. Identifiers and information obtainable from other sources are the two key aspects that must be carefully considered during anonymisation.

Identifiers

An identifier can be any piece of information by which a person can be identified directly, indirectly, or by combining it with other data. Identifiers can be direct, strong indirect, or indirect. The concept of a strong indirect identifier was developed by the FSD to support the planning of identifier removal in anonymisation.

A direct identifier is information that uniquely identifies a person as such. Examples include a personal identity number, a person’s voice, and a recognizable photograph or video of the person. A direct identifier can also be a rare full name or an email address that includes the person’s name. Direct identifiers are always removed during anonymisation.

Strong indirect identifiers do not allow direct identification of a person, but identification can be made with relative ease. Examples of strong indirect identifiers include a postal address, phone number, an email address not based on the person’s name, the title of a person’s art or written work, the URL of a website containing identifying information about the person, a rare job title, or a position held by only one person at a time—such as the chairperson of an association or a named municipal committee. Sometimes, a rare event can also be a strong indirect identifier. In anonymisation, strong indirect identifiers are either removed or generalized to make them non-identifiable.

Indirect identifiers are almost always present in data from the human sciences. A person cannot be identified based on a single indirect identifier, but combining several may make identification possible. Indirect identifiers include, for example, gender, age, education, occupation, family composition, marital status, language, nationality, workplace or school, and information about place of residence. Indirect identifiers may be included in anonymised research data, but individuals must not be identifiable from the data based on these or other available information.

In qualitative data, strong indirect identifiers can appear anywhere within the material. They may relate directly to the subject of study, but they can also be mentioned incidentally. Examples include references to a significant role in organizing an event (“at that time I was the promoter of Qstock”) or a position that can only be held by one person at a time (“I am currently the communications officer of the Interaktio student organization”). References to one’s own publications are also strong indirect identifiers that can make a person identifiable. Sometimes, the identifier to be removed is a clue embedded within the data.

Example of an Identifier to Be Removed

In a fictional study, the workload of postal workers is examined by interviewing 20 employees from six selected Finnish municipalities. The municipality information is retained in the data to allow for comparison, as the work is organized differently in each municipality. The principle of data minimization is applied so that no detailed background information is collected. For each interviewee, the anonymised dataset retains the following information: municipality of employment, gender, and age group.

The interview questions focus on the content of the work and perceived workload. No questions are asked about private life. However, in open interviews, people may speak freely. One interviewee, in response to a question about the work atmosphere, says: “…I’m the only one here with dreadlocks and piercings…”. The details mentioned in this response are identifiers. Based on them, someone familiar with the local context could deduce whose interview it is. Since the municipality information is retained in the data, the reference to the interviewee’s physical appearance must be removed.

Among identifiers and, more broadly, information about the interviewee, the most significant are those that can be assumed to remain relatively unchanged. A person’s date or year of birth does not change, nor does the date when a close or important person died. Family composition may change, but the number of children in a family at a given time is a fixed fact for that moment. The interviewee’s place of residence at the time of the study is a factual detail describing the moment of data collection, even if the person later moves elsewhere. Information about education and employment history, including workplaces, is also unchanging. These may only be added to over time.

Research data also includes information that is not permanent and cannot be found elsewhere in exactly the same form. People’s thoughts and attitudes change over time. A person may not remember the exact answer they gave in an interview the next day. Descriptions of experiences and memories, and the interpretation of their meaning, also change over time. According to Saarenheimo (2012), remembering always involves a tension between truth and narrative. On the one hand, people strive for truthfulness when recalling events, but at the same time, they aim to present things in a way that is favorable or at least understandable to themselves (ibid., p. 33). Narratives and descriptions of past events and their meanings change over time and also depending on the audience.

Example of a Unique Life Experience

In a fictional study on long-term marriages, one interview includes a lengthy description of the feeling of loneliness experienced within the marriage. If the data contains enough identifiers that a person can be deduced from the material or from other available information (for example, a rare job title, municipality of residence, gender, and year of birth), the description of the experience of loneliness constitutes personal data.

When anonymised, the description of the experience of loneliness may be accompanied only by information such as: the woman belongs to the 50–54 age group, works in the public sector, and lives in a rural municipality in Southern Finland. These background details could match dozens or even hundreds of people. Although the description of loneliness experienced in marriage remains individual, it is no longer linkable to the research subject – assuming that all other possible identifiers have also been removed from the data.

Information Obtainable from Other Sources

Information obtainable from other sources is constantly increasing and poses challenges for anonymisation. Such sources include, for example, social media, websites and statistics of organisations, associations, and public authorities. A limited number of individuals may also have access to various service databases that reveal the names, email addresses, and sometimes personal identity numbers of registered users. Everyone has easy access to information about the authors of publications and other works. Additionally, more and more research data is becoming available for use each year.

Name, Date of Birth, and Address

Removing a person’s name is straightforward. It can be replaced with a code (e.g., respondent01), a fictitious first name, or simply the gender implied by the name. An exact date of birth, on the other hand, is a strong indirect identifier, even though it alone may not be sufficient to identify a person. In relation to other identifiers in the dataset, the date of birth can be generalized in a way that best suits the research.

Person’s Age

An exact date of birth is too detailed a piece of information about an individual and should be converted to a more general level. For example, if the respondent’s date of birth is 24 June 1952 and the data was collected in February 2021, a less specific piece of information would be the birth year 1952. Even less precise than the birth year is the age in years (68 years), in which case the researcher can no longer be certain whether the person was born in 1952 or 1951. At the most general level, age can be reported as a category. The person could be classified in the 65–69 age group.

Geographic information has been identified as one of the most significant factors in planning the removal of identifiers (Elliot et al. 2016, 347; Elliot et al. 2020, 47). Place of residence or other geographic data is an indirect identifier. The more precise the geographic information related to individuals in the dataset is, the easier it becomes to attempt to deduce and identify individuals included in the data. Geographic data includes, for example, postal code, city district, municipality, region, sub-region, or major area.

Generalising Geographic Information

In a fictional example, the research data includes background information such as occupation, municipality of residence, and whether the respondent works in the public or private sector. In some cases, these identifiers may already be sufficient to identify an individual. One such combination would be municipality information combined with the occupation of a veterinarian employed in the public sector. Some municipalities publish on their websites the names of veterinarians employed by the municipality, and in the smallest municipalities, there may be only one.

There are several ways to reduce the identifiability of geographic information. Suppose the respondent lives in Nauvo, which is part of the city of Parainen and is a rather small residential area. The geographic information could be expressed using the postal code 21600, but even that is too specific. It can be generalised by removing digits from the end of the postal code (e.g., 216XX; 21XXX). Less specific than the postal code is simply stating the municipality, Parainen, but in this example, including the municipality in the data proved to be an unsuccessful solution for small municipalities. Instead of the municipality, one could report only the region or create a combination of geographic descriptors. In anonymisation, Nauvo could be transformed into geographic information that effectively refers to five different municipalities in the area:

  • A city with 10,000–30,000 inhabitants
  • Southwest Finland (Varsinais-Suomi)

Sometimes, precise geographic information is an essential part of the research itself. In such cases, instead of removing the geographic data, all other possible identifiers must be removed or generalised to a sufficiently coarse level to ensure the dataset remains anonymous.

Education and Work History

Information about people’s schools, degrees, and workplaces is easily accessible (for example, through social media and job placement service applicant data). Therefore, when anonymising data, information describing education and work history should be categorised, and precise details should be removed.

At the beginning of qualitative interviews, it is typical to ask background questions. The purpose of these opening questions is to gather information about the interviewee and to demonstrate that the researcher is interested in them. The beginnings of interviews often contain a lot of personal data. In the fictional example, work history is not the focus of the study, but the researcher still provides an opportunity to share it through the opening question:

Example from the Beginning of a Qualitative Interview: Education and Work History

""Researcher: Could you maybe start by telling me a bit about yourself? I mean, I know you're 28, but like what schools you've attended and whether you're working or…?

Participant: Well, I was born and lived in Heinävaara until I finished high school, so I went to high school in Ilomantsi. I tried to get into the University of Eastern Finland to study business, but I didn’t get in. That year I was unemployed and did some telemarketing from time to time. But the next spring I applied and got into Savonia to study business administration, and that also meant I could move out from home in Ilomantsi. I graduated in 2015, and since then I’ve had no trouble finding sales jobs, since I got to know places already during my internship. First, I worked at Dressmann in the Metropoli shopping center for a year, then I moved to KappAhl in the same mall for just under a year, and after that I’ve been at K-Kenkä. I enjoy it there and I think I’ll get to stay longer too."

An interview excerpt can be anonymised as-is by removing and generalising place names and proper nouns, for example: “…first I worked for a year at [a clothing store in a shopping centre], then I moved to [another clothing store in the same shopping centre] for just under a year…”.

Since the focus of the study is not on education and work history, the content of the first response can be minimised immediately after transcription, thereby initiating the preliminary anonymisation of the data. The responses to the interview’s main content questions are not minimised, but based on the minimised opening response and the information already known to the researcher, categorised background information sufficient for conducting the study is compiled about the participant. After that, the original transcribed response is removed from the dataset.

Minimised version of the example response:

  • Gender: Female
  • Age: 25–29 years
  • Place of residence: Urban municipality, North Karelia
  • Education: University of Applied Sciences degree, Business Administration
  • Occupation: Salesperson, clothing store

When the education and work history described by the interviewee is generalised in the manner described above—and similar generalisation is applied to the rest of the interview content—it becomes difficult to identify the person by searching for similar information, for example, on social media or in CVs from job-seeking platforms. Generalising the interviewee’s place of residence means that she could live in Ilomantsi, but also in Kitee, Lieksa, Nurmes, or Outokumpu.

Rare Information and Events

Sometimes datasets include rare pieces of information or references to uncommon events. If the information itself is identifiable, it must be removed (for example, “I am the oldest prison director in Finland”). When rare information or events included in the data can also be found elsewhere, the information must be modified to be anonymous.

Rare information almost always includes newsworthy joys, sorrows, or accidents experienced by the participant or their close circle. It is highly likely that additional details about such events can be easily found elsewhere, and they may lead to the identification of individuals in the dataset. For example, in an interview, someone might say—without naming names—that their mother received a specific literary award in 2009. The information about the literary award is an identifier related to the mother, and in some cases, it may also lead to identifying the interviewee. In any case, identifying the mother reveals all personal data included in the dataset about her and her child (the interviewee), even if the interviewee themselves cannot be directly identified from that information.

Sometimes rare and at least identifiable information may be a reference to a blog maintained by the participant or to a popular social media account belonging to the participant or someone close to them. Such information must either be removed or generalised (example of removal: “some of my free time goes into producing content for my Kakkukeisari blog” → “some of my free time goes into producing content for my [-] blog”). Sometimes even references to social media content created by the participant can enable identification. For example, the timing of a post may provide enough of a clue for identification, even if the content itself is not quoted verbatim in the dataset.

Publications, Other Works, and Research Data

Datasets may include references to publications, books, theses, compositions, or other works created by the participant or their close circle. Even if the author’s name is not mentioned, the title of the work alone is often enough to identify the author. Author information can be found not only in databases but also through a simple internet search. When references to works could reveal the identities of individuals in the dataset, the information must be removed or generalised to make it non-identifiable (example of removal: “the Tulilla sculpture made by my father in the town centre” → “the [-] sculpture made by my father in the town centre”)..

Anonymisation must also ensure that identifiers in the dataset cannot be inferred from research publications. If location data is removed or categorised, it must be ensured that the municipalities cannot be deduced from the publications based on the data. If the municipalities can be inferred from the research publications, removing them from the dataset is not an effective anonymisation solution. If the anonymisation of qualitative data involves categorising participants’ age and occupation, similar information must not appear alongside data excerpts in the research publications.

When multiple datasets are collected about the same participants, anonymisation must be planned with particular care. Each decision must be evaluated not only in relation to information available from other sources but also in relation to other research datasets containing information about the same participants. Sometimes the intention is to allow information about a specific individual to be linked across datasets. In such cases, the removal of identifiers must be done in a deliberately consistent manner to ensure that linking datasets does not enable identification.

Ensuring Anonymity

Once the anonymisation of research data has been completed, the effectiveness of the process can be tested using three questions. If the answer to each is “Not by reasonably likely means,” the anonymisation can be considered successful:

  1. Can the person still be identified from the data?
  2. Can the data be linked to another dataset or external information in a way that would allow identification of the person?
  3. Can it be inferred that the data concerns a specific individual? Are the modified or removed details still deducible? (WP29 WP 216, pp. 18–19)

To ensure that anonymisation is truly irreversible, all materials containing personal data must be destroyed at the end of the process. These include audio and video recordings, original transcripts, consent forms, contact information, and any communication with the participants.

There is also a boundary when it comes to anonymising identifiers. When participants share their views and opinions about public figures such as politicians or entertainers, those individuals are not anonymised. Only if participants disclose private, non-public matters about public figures should such references be anonymised.

The Data Management Guidelines can be used as a resource when planning the anonymisation of research data. It provides comprehensive basic guidelines on anonymisation, starting with data minimisation.

Text: Arja Kuula-Luumi