Issue 25 (3/2008)
1.12.2008

ISSN 1795-5262

Front page
Previous issues
Editorial staff

FSD Bulletin is the electronic newsletter of the Finnish Social Science Data Archive. The Bulletin provides information and news related to the data archive and social science research.

Finnish Social Science Data Archive
E-mail: fsd@tuni.fi

Corpus Offers Data on Language Skills

Päivi Vännilä

The corpus compiled by Mirja Tarnanen is a collection of Finns' and Finnish-speaking immigrants' background information, language proficiency assessments, and performance samples. It is interesting not only in terms of language policy but also in terms of education.

- The language proficiency levels of different languages are comparable, because they are based on the same language proficiency scale. This is unique in Finland and also interesting on an international scale, according to Mirja Tarnanen, a specialist researcher at the Centre for Applied Language Studies at the University of Jyväskylä.

The corpus consists of the data gathered from the Finnish National Foreign Language Certificate examinations. The test takers are adults who need a language certificate, mostly to apply for work or citizenship. The corpus incorporates both qualitative and quantitative data. The quantitative part includes the language proficiency assessments and background information of the test takers, and the qualitative part contains the responses to the speaking and writing subtests. The data cover nine different languages: English, Spanish, Italian, French, Swedish, Sami, German, Finnish and Russian.

- One can find answers to many interesting questions in the data. For example, one could combine the background information with the language proficiency levels or both of them with the writing subtests to pose the questions "What seems to be the level of English proficiency among Finnish women aged over 40?" and "How good are native Russian speakers at writing email messages?" Mirja Tarnanen describes.

Corpus to be developed further

Mirja Tarnanen, specialist researcher at the Centre for Applied Language Studies at the University of Jyväskylä.

The corpus is a growing entity, and more data are being added to it as new people complete the test. The corpus interface enables the users to send feedback while browsing the data.

- Since there are plenty of diverse data in the corpus, both quantitative as well as qualitative, the interface has numerous search methods. This makes it challenging to use, and whoever uses the corpus must read the search instructions properly. In addition, searching for data becomes easier if one has outlined the research questions before running the searches, Tarnanen says.

Teachers should familiarise themselves with the description of the contents of the corpus as well as with the interface in order to find out how their students could benefit from it. Thesis writers in particular should formulate their research questions beforehand in order to avoid getting lost in the maze of search options.

Other corpora exist

The Cambridge Learner Corpus (CLC) has been compiled in the UK from similar data as the Finnish one. It is considerably larger than its Finnish counterpart in terms of the amount of data, and the performance data are collected from different tests, but in the same language.

The CLC is a tagged corpus, whereas the Finnish corpus is not. In other words, the texts in the Finnish corpus are in raw form and therefore do not contain any metadata on, for instance, parts of speech. At the moment, there is no open access to the CLC, while the Finnish corpus is freely available.

Corpus based on valuable data

- The Finnish corpus was founded because there is no sense in packing valuable research material into cardboard boxes people cannot access. Material from the Finnish National Foreign Language Certificate examinations has been sought after, and the corpus will meet this need by facilitating access to the data for research and teaching purposes. Using the corpus will be easier if the end users of the data also do the searches themselves, Mirja Tarnanen says.

The test data are confidential and therefore the corpus contains only the kind of information which does not violate test takers' privacy. The test takers' performance samples, background information, language proficiency assessments, and speaking and writing subtests are linked by individual ID numbers. Information which has not been authorised for reuse by the test takers will not be saved in the corpus.

The writing subtests of certain languages in the corpus date back to early 2002, whereas the speaking subtests and background information have been collected from the autumn of 2004 onwards. At the moment, the corpus contains the language proficiency assessments and background information of about 14,000 persons.

The corpus has received funding from the Academy of Finland, because it has been part of the Human Technology infrastructure project of the University of Jyväskylä, which is funded by the Academy. The Centre for Applied Language Studies has also funded the corpus.

The corpus is available as a web database. Accessing the corpus requires a user name and a password, which can be ordered from the FSD. Users must send an access application and an agreement on material use conditions to the FSD in order to receive a user name.

Issue 25 (3/2008) 1.12.2008

Corpus Offers Data on Language Skills

Corpus to be developed further

Other corpora exist

Corpus based on valuable data

Issue 25 (3/2008)
1.12.2008