Swedish Test Data for SemEval 2020 Task 1

Nina Tahmasebi , Simon Hengchen , Dominik Schlechtweg , Barbara McGillivray , Haim Dubossarsky

Feb 19, 2020

Swedish Test Data for SemEval 2020 Task 1

Nina Tahmasebi , Simon Hengchen , Dominik Schlechtweg , Barbara McGillivray , Haim Dubossarsky

Feb 19, 2020

Abstract

This data collection contains the Swedish test data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. It consists of a Swedish text corpus pair (corpus1/, corpus2/) and 31 lemmas which have been annotated for their lexical semantic change between the two corpora (targets.txt). We sample from the KubHist2 corpus, digitized by the National Library of Sweden, and available through the Språkbanken corpus infrastructure Korp (Borin et al., 2012). The full corpus is available through a CC BY (attribution) license. Each word for which the lemmatizer in the Korp pipelien has found a lemma is replaced with the lemma. In cases where the lemmatizer cannot find a lemma, we leave the word as is (i.e., unlemmatized, no lower-casing). KubHist contains very frequent OCR errors, especially for the older data.More detail about the properties and quality of the Kubhist corpus can be found in (Adesam et al., 2019).

Type

Wp2: Semantic Change Evaluation

Publication

Swedish Test Data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection

Date

February, 2020

Links

DOI