File Formats and Software
File formats and software are constantly updated and older formats may be replaced by newer ones. There is no single format or software that could be said to surpass others in terms of usability and persistence. At least one copy of files should be stored in a format that is commonly used and supported by several different software or in a fully software-independent format. This increases the probability that the file can be read also in the future.
Up-to-date recommendations for file formats suitable for transfer and long-term preservation of data can be found in the specifications published by the National Digital Preservation Services (digitalpreservation.fi). The specifications do not include all file formats of statistical software, but following the specifications is advisable whenever they are applicable. It is recommended that data to be digitised, in particular, are stored in one of the formats supported to avoid later file conversions. Basic metadata on devices and software used in collecting and processing the data should also be documented.
When files are converted from a file format used by one software into another, data are easily lost. Data can also be lost when files are converted between the versions of the same software. To minimise data loss, limitations of different file formats and software should be examined before conversion.
Several software offer options to export or save data into a different file format (Save as). However, using these options does not always mean that data in the original file is entirely converted. For example, in statistical packages, definitions for missing data may be lost even if the target format supports the feature. When converting files from one text processing tool to another, formatting is often lost or distorted.
There are software designed to convert files from one format to another that take the features of different formats into consideration. FSD uses the StatTransfer software to convert quantitative data from one format to another.
Digitisation of textual or visual data in paper format is nowadays easy by using a scanner or a scanning tool and software included in most printers.
When the aim is to store only textual data, scanned text is converted into a text file by using optical character recognition (OCR), which is included in practically all scanners. If paper documents need to be stored with their original formatting or, for example, hand-written notes, the scanned file is saved and stored as a picture file, just like files containing images, illustrations or other visualisations.
PDF (Portable Document Format) is the de facto standard for document publishing. It allows printing documents while retaining the original formatting of images. Using the ISO standardised PDF/A is recommended for documents that need to be kept unchanged. PDF/A is meant for use in archiving and long-term preservation, and it only allows some of the features of PDF and ensures that all the information necessary for displaying the document is embedded in the file. Most office software in use today can be used to create PDF/A files.
Audiovisual recordings on VHS tapes can be converted into a digital format by using a DVD-VCR combination recorder, but a more exact digital copy of the original is achieved by transferring the recording to a computer using a separate device. There are many businesses that offer digitisation services and know what to take into consideration when converting recordings.
Tips and instructions on how to digitise audio recordings on obsolete media (audio tapes and analogue discs) can be found, for example, in the Publications of TAPE, Training for Audiovisual Presentation in Europe and the Report of a Roundtable Discussion of Best Practices for Transferring Analog Discs and Tapes PDF by the National Recording Preservation Board of The Library of Congress.
A quantitative dataset is typically a data matrix consisting of rows and columns, where one row represents one unit of observation and one column corresponds to one variable. Cells in a data matrix contain numeric or textual information. A data matrix is processed and analysed by using a statistical package.
In addition to a data matrix, most statistical software store metadata in a dataset. The metadata describes the contents of the cells in the data matrix. Typically, this information includes variable names and labels, value labels and definitions for missing data.
Various statistical packages for analysing quantitative data are available. Different packages offer different possibilities to analyse data. Packages also differ in how they handle variable and value labels, missing data and variable formats. There may be significant differences between different versions of the same package.
In social sciences, the statistical packages used the most are SPSS, Stata and R, but there is a wide variety of other statistical analysis software available, including, for instance, Survo, Matlab, Glim, Statistica, NSD-Stat and BUGS. Spreadsheets (e.g. Excel) are sometimes used to edit and analyse research data. Research data can also be stored and analysed in relational databases (e.g. Oracle, MS SQL Server, DB2, MySQL, PostgreSQL).
You can deliver data to FSD in SPSS, SAS or Excel file formats or as ASCII-encoded text files, among others. Careful documentation makes the data usable regardless of file format, software or version.
Statistical packages and their file formats
SPSS (IBM SPSS Statistics)
The first version of SPSS was released as early as 1968. The SPSS Portable format is often used in long-term preservation of research data. There are SPSS versions available for Windows, Linux/UNIX and Mac operating systems. The software is used by selecting commands in drop-down menus or by entering syntax commands. SPSS supports a number of file formats used by other software.
Filename extensions: *.sav, *.por
First released in 1985. Available for Windows, Linux/UNIX, Mac OS X. Lower-priced than SAS or SPSS.
Filename extension: *.dta
The first version was released in 1960s. Available for Windows, IBM mainframe, Linux/UNIX and OpenVMS Alpha. In addition to statistical packages, SAS offers a wide array of products, including packages for graphs, optimisation and matrix calculations. Primarily used by entering syntax commands, but also has drop-down menus.
Filename extensions: *.sd2, *.sd7, *.sas7dbat (SAS for Windows), *.ssd01, *.sas7dbat (SAS for UNIX)
R (GNU S)
First released in 1980s. Available for Windows and Linux/UNIX. Open source R was published at the end of 1990s. R is rather a software environment for statistical computing than a statistical package.
Other file formats
Comma Separated Values, CSV
A text file where values are separated by commas. Filename extension: *.csv
Tab-separated values (tab delimited)
A text file where information is separated by tabs. Filename extension: *.dat, *.tab, *.txt
A text file where values have a set length, i.e. a fixed width. If a value has fewer characters than the fixed width, the remaining characters are filled with blank spaces, for example. Filename extension: *.dat, *.txt
Most of the qualitative data archived at FSD are textual data. These include diary entries or transcriptions of audio or video recordings, among others. Often the data also contain transcription instructions or writing prompts and instructions for research participants. Textual data can be analysed with text processing tools or software specifically designed for qualitative data analysis (e.g. Atlas.ti, NVivo).
Metadata describing the research data can also be stored in a document file format.
The most common document file formats are:
- TXT: Files saved as plain text (non-formatted) typically have the filename extension *.txt. Plain text files are also sometimes referred to as ASCII (American Standard code for Information Exchange) text files after the character encoding standard used. Plain text is a good solution for long-term preservation, as plain text files can be opened by all text processing tools and text editors.
- ODT: OpenDocument Text (*.odt, *.fodt) is an ISO standardised, XML-based open file format. It is based on the ODF file format of the open-source office suite OpenOffice. Like Word documents, ODT files may contain rather complex formatting, tables, graphs and images. ODT file format is considered suitable for long-term preservation of data owing to its openness and interoperability.
- DOC/DOCX: DOC files (*.doc, *.docx) may contain complex formatting (styles, columns, emphases, colours) as well as tables, graphs and images alongside text. Microsoft Word is the surest option to open and display DOC/DOCX files correctly, but in recent years other text processing tools have increasingly began supporting the XML-based DOCX. Because it is dependent on software, the format is not recommended for long-term preservation.
- RTF: Rich Text Format (*.rtf) is an interoperable alternative for storing documents. RTF files are usually ASCII plain text and most word processors are able to read and write them. RTF files can also be used across operating systems. For example, transferring an RTF file from Windows to Unix does not usually change the contents or formatting of the file. RTF typically supports basic formatting (italic type, boldface, and underlining), text alignment, font specification and document margins. However, RTF is not widely accepted as a format for long-term preservation, and RTF files containing images tend to be very large.
Research data may consist fully or partly of visual image files. For instance, research subjects may be shown pictures to stimulate conversation or researchers may study pictures on magazine covers or on front pages of newspapers. The most common image formats are:
- JPEG (Joint Photographic Experts Group) is suited for storing images and photographs that are published online, because they retain their colour information. The size of images can be adjusted, but details are lost when images are resized and edited. JPEG is a solid format for photograph data, as JPEG files do not take a lot of space and transferring them is easy. Images in textual data at FSD are delivered to users in JPEG format. The JPEG 2000 has the alternative of selecting lossless compression.
- TIFF (Tagged Image File Format) retains all information on the image and its colours and is not dependent on operating system. Both of these features make TIFF a good choice for long-term preservation when it is important that the digitised images correspond to the originals as closely as possible. TIFF images take up a great deal of space, but there are various ways to compress them.
- PNG (Portable Network Graphics) was developed to replace the GIF format. It is well suited for web images and particularly for graphs and figures.
- GIF (Graphics Interchange Format) is suitable for images published on websites, as all browsers support it. The format compresses files and only allows up to 256 colours. The format is not designed for long-term preservation.
- BMP (Bitmap) is a format similar to TIFF and developed for Windows environments. Because it is not compatible with all operating systems, it is not recommended for long-term preservation.
Research data increasingly include or consist of recorded interviews. File formats for audio and video are dependent on operating systems and constantly changing. When doing research, the file formats used by recording devices are often sufficient, but for long-term preservation the files are usually converted. The most common formats are:
- WAV (Waveform Audio File Format) is an uncompressed audio file format, which takes up a great deal of space, but retains a good audio quality when high bitrates are used. WAV is a recommended format for long-term preservation when a very high quality of audio is important. However, using WAV requires a large amount of storage space and transferring the data may be slow.
- MPEG-1/2 (MP3) heavily compresses audio and files are significantly smaller than WAV files. Compression used in MPEG-1/2 mostly reduces or discards information in frequencies that the human ear cannot perceive. MP3 is suitable for preserving research data containing audio, while MPEG-2 is suited for storing both audio and high definition video.
- MPEG-4 (H.264, mp4) is a standard for a group of audio and video coding formats that can be used for digital video and interactive multimedia, among others. MPEG-4 is often used in video cameras today.