Text: Katja Fält, Photo: Kaisa Järvelä

Archiving and Reuse of Social Media Data Often Flounders at Terms of Service

Use of social media as research data is on the rise as the number of social media users increase. However, the scientific community has only recently begun to consider the methods and procedures of analysing and archiving social media data. The greatest barriers to archiving data include the commercial nature of social media services and their terms of service that, at least for now, usually prevent archiving.

Social media data is proving to be part of valuable digital cultural heritage. Examining such data can reveal much about communication culture in the 2000s, social networks, information sharing and many other phenomena related to society and human behavior. Data produced by social media users are also very easy and quick to process in a machine-readable format.


Terms and conditions of social media services are at odds with the requirements of open data.

Using social media data in research is not without its problems, and many researchers struggle with various ethical and practical questions. Another challenge is posed by the requirements of several research funders, like the Academy of Finland, that urge researchers to make their data available for reuse.

A researcher who uses social media data in his or her research is allowed to use the data, but cannot deposit them for archiving or share them. In this sense, the terms and conditions of social media services are significantly at odds with the requirements of research funders. Academic institutions and archives are increasingly developing infrastructures and ways of enabling the preservation of social media data.

Strategies for storing social media content

Social media content is most effectively collected with the help of APIs (Application Programming Interface). API is an interface that allows programs to make requests and share information between the programs.

In the context of social media, API functions as the interface between the social media platform and the person using social media data. API enables a controlled access to data in social media platforms or, for instance, data produced by certain users.

Another way to acquire social media data is to purchase data from data resellers. Typically, resellers are companies that offer services and products based on the data collected through APIs.

Some of the resellers, like Gnip, are authorised resellers for certain platforms and provide exclusive access to social media data that cannot be collected directly through social media platforms. These kind of data are usually “historical” — in other words, they are not real-time data.

The data provided by resellers are filtered through the APIs of social media platforms and are thus subject to regulation by the platforms. This limits publishing and sharing the data. This is why companies, rather than individual researchers or research organisations, often purchase data from resellers.

A person requiring access to social media data can also make an agreement directly with the social media platform. Several organisations offer social media archiving as part of their online archiving services or as their main service. Commercial services are also provided by, for example, ArchiveSocial, MirrorWeb, Erado, and Gwava, which cooperate with cultural heritage organisations as well. Additionally, the Internet Memory Foundation and the International Internet Preservation Consortium (IIPC) offer support for managing social media data.

The most inconvenient aspect of third party services is their high price, caused by, for instance, the wages of skilled staff. For this reason, these services are rarely viable for an individual researcher.

Finally, one possible way of harvesting social media content is to make use of the self-archiving services of social media platforms. Some of the platforms, such as Facebook, Google, and Twitter, use self-archiving tools for backing up data, which enable users to download their data from their accounts in machine-readable format.

However, the data archived by these platforms is quite limited. Facebook, for instance, only archives data that a user has personally uploaded on his/her account. Nonetheless, the self-archiving service can very well be useful for organisations that want to preserve their own social media publications.

Special features of social media data should be considered

Social media content is largely produced by users in real-time social interaction. Instead of individual publications, dialogue is often central to the content. Discussions in social media are typically fairly amorphous, which poses challenges to data selection and collection criteria.

In Facebook and Twitter, it is often difficult to see where one discussion ends and another begins. It may be complicated for a researcher to determine how data can be outlined so that everything essential is included.

What is also challenging is that organisations and people collecting social media data often have experience from processing analogue data, and use this experience to store, process, and preserve digital social media data. Researchers as well as other people and organisations collecting social media data lack standard procedures and instructions to preserve online content and big data.

Preserving social media content requires different solutions than traditional archiving methods in order to keep the data understandable and accessible. For example, it is important to ensure that linked or embedded information can also be used later on, as a missing or broken link can make a whole discussion meaningless. This is why any external content (such as URLs) linked to social media should be stored with the actual content.

Secondary information, or metadata, relating to the collected data should also be preserved. Metadata include all secondary information relating to the content, such as user information (e.g. age, nationality, occupation, place of residence), user specific IDs and, for some platforms, IDs of individual publications. Metadata may also include information on processes relating to capturing, sorting and analysing data.

Understanding data requires information on how they were created, scrubbed, edited and analysed. In cases where original source data cannot be shared, documentation of processes supports their short-term preservation. In addition, metadata can offer significant additional information for understanding non-textual archived content, such as photographs or video clips.

Terms of service limit data use

A significant problem in archiving social media data are the business models of many social media platforms, focused on maximizing the profits gained by data use.

Principles and agreements relating to APIs strongly limit data sharing and reuse, because the basic principle of developers is that, as a rule, data acquired through APIs cannot be shared.

Researchers often need large datasets to be able to observe significant trends. Many social media platforms restrict the amount of data that can be requested and some even track the number of requests to prevent transfers of large amounts of data. A few platforms allow access to data for research purposes, but forbid open sharing of data through digital archives or other organisations.

According to a strict interpretation, transferring Twitter data into a cloud service is forbidden. What further complicates things is that the terms and conditions of the platforms may change annually. This makes developing long-term practices of social media archiving difficult. Navigating the restrictive terms of service is particularly challenging for researchers and other data collectors who collect data from different platforms and have to consider the terms and conditions of each platform separately.

Are my data protected in social media?

Long-term preservation of social media data is even further complicated by privacy and data security issues. Social media data contain a great deal of user data that may reveal personal information, particularly if combined with other data.

Social media platforms usually reserve the right to buy and sell content produced by their users without prior notice. Facebook, Google, and LinkedIn, for example, reserve the right to user data in accordance with their terms of service.

Users of social media have alarmingly little say in what happens to the content they have created after it is posted. Users do not often know that their data may end up to be used for several different purposes: for research, for commercial purposes, in cultural heritage collections, for journalism, and other non-commercial purposes. Long-term preservation of data produced by users is in conflict with the EU legislation, which grants, in some cases, an individual the right to have their personal data removed from search engines on the Internet.

It is likely that ethical and privacy issues will increase in the future along with the continuously growing mass of user data, if users are not given more say over how their data and content are used.

Am I breaking copyright law?

Terms and conditions of social media platforms mostly prevent the grossest copyright violations, because they restrict data sharing and copying. Collecting and processing social media data is not, in itself, generally in violation of copyrights. However, some of the content published in social media, such as photographs, may contain copyright-protected material.

Publishing pictures is not usually possible in, say, academic research journals. Copyright issues usually only become significant in qualitative research when a researcher wants to publish or share, for instance, individual Twitter posts or copyrighted material such as pictures or audio.

However, archiving and preserving social media user data does not, in itself, violate copyright. In the future, it is important to chart the needs and wishes of social media users when decisions are being made on what kind of data can be stored, how much, and in which format.

Future prospects of social media data archiving

When developing practices and conventions for social media archiving and reuse, it is important that research organisations and researchers cooperate actively from the very beginning. Various organisations, like universities and cultural heritage organisations, should also collaborate between each other and share the costs of developing best practices and technical solutions. At its best, cooperation may broaden access to valuable datasets or enable the development of technical infrastructure.

Cooperation could also be used to create a centralised infrastructure that would negotiate with social media platforms and bring together research standards and requirements, preservation standards and social media terms of service. Such an infrastructure might also be helpful in harmonising social media collection policies and standards.

Cooperation with social media platforms is also needed. So far, only Twitter has actively stored user data with a reliable archiving institution and negotiated agreements with individual research institutes to support academic research. Twitter has provided the Library of Congress all its archived and real-time data for long-term preservation. Besides this, there are very few precedents of relationships between social media platforms and archives.

The Twitter archive of the Library of Congress still has a few knots to untangle, because processing and sorting the data have proven to be time-consuming, and researchers cannot yet use the data. The delay is no wonder, as the number of Twitter posts and their metadata is close to half a trillion. Regardless of the problems, Twitter’s contribution is an admirable effort to cooperate with the cultural heritage sector and support non-commercial research.

A big problem with long-term preservation of social media data is the commercial nature of the platforms and particularly the fact that the platforms can sell user information to third parties. Consequently, the platforms are primarily interested in profiting from the data and increasing their sales.

Researchers and data collectors as well as decision-makers could develop an alternative business model which would facilitate researcher access to data without the need to be involved in the sales and profits of a platform. Non-commercial data use does not have to prevent commercial use of the data. Access to data for non-commercial purposes could even be funded, either by rewarding companies that share their data with non-commercial actors or by offering special advantages to those that grant access to data for research purposes.

One possible alternative could also be to make use of research infrastructure funding. This would allow making a long-term agreement with willing platforms and transferring a certain amount of data for research purposes or to cultural heritage collections.

In any case, transparency is also needed when social media data are archived. Development and publication of practices and standards relating to collecting, sorting and analysing the data would help social media users understand how their content is being used.

Transparency in capturing, processing, and analysing data provides important information on the origin of data to archives that preserve them. Openness also supports rights management and ensures that researchers have rights they need for their research.

Primary source

  • Thomson, Sara Day (2016).Preserving Social Media. DPC Technology Watch Report 16-01 February 2016. Digital Preservation Coalition.

Other sources

Creative Commons -license