Towards Computational Lexical Semantic Change Detection

Project Description

In this project, we worked on automatic, corpus-based methods for detecting semantic change and lexical replacement for Swedish and English. The project was the first of its kind. We investigated some of the fundamental questions of how, when, and why languages change to allow us to quantify language change and shift lexical typological research from small case studies done on limited data sets to larger scales and over wider time spans using various media types and sources.

The team involved in the project has changed over time, the original members Richard Johansson, Susanne Vejdemo and Maria Koptjevskaja Tamm were strengthened by Simon Hengchen, Dominik Schlechtweg, and Pierluigi Cassotti over the course of the project. We have also had several software engineers and invited guest researchers.

During this project, we started up the LChange workshop series, from its first resounding success in LChange’19 to its follow-ups LChange’21, LChange’22 and LChange’23. The series will remain long after the closing of this project. We have also participated in the first ever shared task on lexical semantic change, SemEval2020 Task 1 on Unsupervised Lexical Semantic Change Detection. There have been multiple outcomes relating to this work, including several manually annotated datasets: the largest resource of graded contextualized, diachronic word meaning, in particular on Swedish and a paper describing the resources at EMNLP’21. Since our task at SemEval2020, there has been many follow-ups, including for Italian, Russian, Spanish, Norwegian, and Chinese. We published the first book on Computational approaches to semantic change and the first major survey in the field of LSC. We have also written book chapters to be included in the Routledge Handbook of Historical Linguistics and The Wiley Blackwell Companion to Diachronic Linguistics. As a part of our work, we contributed to the development of the DURel Annotation Tool used for annotating sentence pairs (in graded Word-In-Context-style), to form sense clusters of a word and to visualize them over time. This can reveal lexical properties such as polysemy, vagueness or change of meaning of a word and is crucial for our continued study of semantic change.

This project led to a large research program, Change is Key! around computational semantic change for the humanities and social sciences funded by Riksbankens Jubileumsfond. The program spans six years starting in 2022 with a total of 17 researchers, a research engineer and six partner universities. It is a true testament of the progress that was made during this project.

The project was funded 2019–2022, and got extended due to Covid-19. Early 2024, we will finally close the project and thus stop updating this webpage. We thank our funders, the Swedish Research Council for funding us under project number 2018-01184.

News

May 2022 – May 2022

Dublin, Ireland

LChange’22 in conjunction with ACL2022

LChange’22

LChange’22 is the third workshop for computational approaches to historical language change with the focus on digital text corpora. This year, the workshop will feature a shared task on semantic change detection for Spanish.

Come join us for this exciting adventure!

August 2021 – Present

EMNLP'21

DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages

EMNLP’21

We got our paper accepted for EMNLP’21, in which we describe the manually annotated dataset used for SemEval2020 Task1. This dataset contains the largest resource of graded contextualized, diachronic word meaning annotation in four different languages, namely English, German, Swedish, and Latin, based on 100,000 human semantic proximity judgments. The Swedish DWUG dataset is freely available on Zenodo.

August 2021 – Present

online

Book on Computational Approaches to Semantic Change

LangSci

Our book on Computational approaches to semantic change is finally published and openly available. There are eleven chapters ranging a survey of computational methods for semantic change, methods for and applications of semantic change, to a chapter on future challenges.

August 2021 – August 2021

Bangkok, Thailand

LChange’21 in conjunction with ACL2021

LChange’21

In this second LChange workshop, we will bring together the world’s pioneers and experts in computational approaches to historical language change with the focus on digital text corpora. In doing so, this workshop carries the triple goals of disseminating the state-of-the-art research on diachronic modelling of language change, fostering international cross-disciplinary collaborations, and exploring the fundamental theoretical and methodological challenges in this growing niche of computational linguistic research.

Papers can be found in the LChange’21 proceedings.

November 2020 – November 2020

Gothenburg, Sweden

2nd Workshop on Computational Detection of Language Change 2020

CDLC’2020

This second workshop on computational detection of language change is held in conjunction with SLTC 2020, and will be given completely online. It is aimed as a smaller, Swedish version of the LChange workshop series, but has drawn international attention.

The keynotes are extremely interesting and are given by Dominik Schlechtweg, IMS Stuttgart, on Sparse Usage Graphs as Model for Word Meaning in Context, and by Stellan Petersson & Emma Sköldberg, University of Gothenburg on Lexical Semantics and Semantic Change - a Report from the Lexicographer’s Shop Floor.

Tune in and register via the SLTS website.

September 2019 – December 2020

COLING2020

Unsupervised Lexical Semantic Change Detection Challenge

SemEval2020

We participated in SemEval2020 with a task on unsupervised lexical semantic change detection for English, German, Swedish and Latin, together with Barbara McGillivray, Dominik Schlechtweg, Simon Hengchen, and Haim Dubossarsky. Swedish testdata available here, postevaluation data for all participating languages available here. Our task description paper is is now published.

The workshop will be held online on December 12-13. Come and join us!

August 2019 – August 2019

Helsinki

The two week-long paper

ACL2019

Our paper Time-Out: Temporal Referencing for Robust Modeling of Lexical Semantic Change got accepted at ACL2019!

Read about how it came about in The two week-long paper.

Selected Publications

More publications

Routledge Handbook of Historical Linguistics, 2nd Edition

In this chapter we provide an overview of computational modeling for semantic change using large and semi-large textual corpora

Nina Tahmasebi, Haim Dubossarsky

PDF

LSCDiscovery: A shared task on semantic change discovery and detection in Spanish

We present the first shared task on semantic change discovery and detection in Spanish. We create the first dataset of Spanish words …

Frank D., Zamora-Reina, Felipe Bravo-Marquez, Dominik Schlechtweg

PDF

DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages

In this paper, we describe the creation of the largest resource of graded contextualized, diachronic word meaning annotation in four …

Dominik Schlechtweg, Nina Tahmasebi, Simon Hengchen, Haim Dubossarsky, Barbara McGillivray

PDF

Computational approaches to semantic change

This volume offers a survey of this exciting new direction in the study of semantic change, a discussion of the many remaining …

Nina Tahmasebi, Lars Borin, Adam Jatowt, Yang Xu, Simon Hengchen

PDF

Survey of computational approaches to lexical semantic change detection

This article provides a comprehensive survey of recent computational techniques to tackle both diachronic conceptual change (semantic …

Nina Tahmasebi, Lars Borin, Adam Jatowt

PDF

Challenges for computational lexical semantic change

In this chapter, we aim to describe the most important of these challenges and outline future directions for the field of lexical …

Simon Hengchen, Nina Tahmasebi, Dominik Schlechtweg, Haim Dubossarsky

PDF

SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection

Lexical Semantic Change detection, i.e., the task of identifying words that change meaning over time, is a very active research area, …

Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky, Nina Tahmasebi

PDF

Post-Evaluation Data for SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection

This data collection contains the post-evaluation data for SemEval-2020 Task 1

Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky, Nina Tahmasebi

DOI

Time-Out: Temporal Referencing for Robust Modeling of Lexical Semantic Change

State-of-the-art models of lexical semantic change detection suffer from noise stemming from vector space alignment. We have …

Haim Dubossarsky, Simon Hengchen, Nina Tahmasebi, Dominik Schlechtweg

PDF

Models and Algorithms for Automatic Detection of Language Evolution

Nina Tahmasebi

PDF

Recent Publications

More Publications

(Chat)GPT v BERT: Dawn of Justice for Semantic Change Detection

In the universe of Natural Language Processing, Transformer-based language models like BERT and (Chat)GPT have emerged as lexical …

Francesco Periti, Haim Dubossarsky, Nina Tahmasebi

PDF

EMNLP 2023

We present Superlim, a multi-task NLP benchmark and analysis platform for evaluating Swedish language models, a counterpart to the …

Aleksandrs Berdicevskis, Gerlof Bouma, Robin Kurtz, Felix Morger, Joey Öhman, Yvonne Adesam, Lars Borin, Dana Dannélls, Markus Forsberg, Tim Isbister, Anna Lindahl, Martin Malmsten, Faton Rekathati, Magnus Sahlgren, Elena Volodina, Love Börjeson, Simon Hengchen, Nina Tahmasebi

PDF

ChiWUG: A Graph-based Evaluation Dataset for Chinese Lexical Semantic Change Detection

we present the compilation of the first graph-based evaluation dataset for lexical semantic change in the context of the Chinese …

Jing Chen, Emmanuele Chersoni, Dominik Schlechtweg, Jelena Prokic, Chu-Ren Huang

PDF

Incremental Semantic Shift Detection

In this paper, we introduce a novel approach to tracing the evolution of word meaning over time.

Francesco Periti, Sergio Picascia, Stefano Montanelli, Alfio Ferrara , Nina Tahmasebi

PDF

Routledge Handbook of Historical Linguistics, 2nd Edition

In this chapter we provide an overview of computational modeling for semantic change using large and semi-large textual corpora

Nina Tahmasebi, Haim Dubossarsky

PDF

Nodalida 2023

In this work we investigate the hypothesis that enriching contextualized models using fine-tuning tasks can improve their capacity to …

Wei Zhou, Nina Tahmasebi, Haim Dubossarsky

PDF

LSCDiscovery: A shared task on semantic change discovery and detection in Spanish

We present the first shared task on semantic change discovery and detection in Spanish. We create the first dataset of Spanish words …

Frank D., Zamora-Reina, Felipe Bravo-Marquez, Dominik Schlechtweg

PDF

DiaWUG: A Dataset for Diatopic Lexical Semantic Variation in Spanish

We provide a novel dataset – DiaWUG – with judgements on diatopic lexical semantic variation for six Spanish variants in Europe and …

Gioia Baldissin, Dominik Schlechtweg, Sabine Schulte im Walde

PDF

DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages

In this paper, we describe the creation of the largest resource of graded contextualized, diachronic word meaning annotation in four …

Dominik Schlechtweg, Nina Tahmasebi, Simon Hengchen, Haim Dubossarsky, Barbara McGillivray

PDF

Computational approaches to semantic change

This volume offers a survey of this exciting new direction in the study of semantic change, a discussion of the many remaining …

Nina Tahmasebi, Lars Borin, Adam Jatowt, Yang Xu, Simon Hengchen

PDF

Recent & Upcoming Talks

An introduction to lexical semantic change

An introduction to LSC at IMS, Stuttgart, 2020 at the Online Workshop on Automatic Detection of Semantic Change

Oct 27, 2020 11:30 AM IMS, Stuttgart, 2020

Nina Tahmasebi

Slides

Unsupervised Lexical Semantic Change Detection – Past, Present & Future

This talk has unfortunately been cancelled.

Mar 25, 2020 10:45 AM Language Change: Theoretical and Empirical Perspectives Conference

Nina Tahmasebi

Discovering Semantic Shifts Using Diachronic Word Sense Embeddings

Lectures on Language Technology, Uppsala University, 2019

Nov 26, 2019 10:45 AM Uppsala University, 2019

Richard Johansson

On Lexical Semantic Change and Evaluation

A presentation at IMS, Stuttgart, 2019

Jun 25, 2019 11:30 AM IMS, Stuttgart, 2019

Nina Tahmasebi

PDF

Workshop on Lexical Semantic Change 2019

A presentation of LSC in Helsinki, 2019

Feb 18, 2019 1:00 PM Workshop on Lexical Semantic Change, Helsinki, 2019

Nina Tahmasebi

PDF

Estonian Digital Humanities Conference 2018

A Keynote for EDH2018 in Tartu.

Sep 27, 2018 9:30 AM 6th Estonian Digital Humanities Conference

Nina Tahmasebi

Slides

Recent & Upcoming Events

An introduction to lexical semantic change

An introduction to LSC at IMS, Stuttgart, 2020 at the Online Workshop on Automatic Detection of Semantic Change

Oct 27, 2020 IMS, Stuttgart, 2020

Nina Tahmasebi

Slides

Unsupervised Lexical Semantic Change Detection – Past, Present & Future

This talk has unfortunately been cancelled.

Mar 25, 2020 Language Change: Theoretical and Empirical Perspectives Conference

Nina Tahmasebi

Discovering Semantic Shifts Using Diachronic Word Sense Embeddings

Lectures on Language Technology, Uppsala University, 2019

Nov 26, 2019 Uppsala University, 2019

Richard Johansson

On Lexical Semantic Change and Evaluation

A presentation at IMS, Stuttgart, 2019

Jun 25, 2019 IMS, Stuttgart, 2019

Nina Tahmasebi

PDF

Workshop on Lexical Semantic Change 2019

A presentation of LSC in Helsinki, 2019

Feb 18, 2019 Workshop on Lexical Semantic Change, Helsinki, 2019

Nina Tahmasebi

PDF

Estonian Digital Humanities Conference 2018

A Keynote for EDH2018 in Tartu.

Sep 27, 2018 6th Estonian Digital Humanities Conference

Nina Tahmasebi

Slides

Project description

The most common word for ‘young female human’ changed from maiden in Old and Middle English to girl in Modern English. This is a case of lexical replacement: a bundle of semantic material is first symbolized by one word, and later in time by another word (onomasiology). Parallel to this, girl ‘young person’ came to mean ‘young female person’. This is a case of semantic change: a word stays the same over time 1 while the semantic material it symbolizes changes (semasiology). It is often useful to talk about the semantic material of a word as clustering into several sub-meanings: senses. Senses can be added, removed or changed. A particularly interesting kind of alteration is positive/negative sentiment change: while the morpheme skit in skitdag ’shitty, bad day’ has a negative connotation, in the last few decades it has acquired a positive connotation as an intensifier in words like skitgott ’really good’. Sentiment analysis is increasingly important for commercial and political research and can greatly benefit from automatically handling lexical and semantic change.

All these intertwined processes make lexical and semantic change highly complex problems relying on defining a particular sense (and the allocation of senses to words), problems that are considered AI complete, i.e. equivalent of making computers as intelligent as people. Recent NLP advances based on the distributional hypothesis of meaning have proven extremely useful in assisting researchers in untangling SC processes. The distributional hypothesis links semantic similarity to distributional similarity - meaning can be induced from the set of words that appear in similar contexts. Automatically induced senses are approximations of an underlying word sense and vary naturally depending on which sentences that are used for the sense induction. A great challenge in automatic change detection is determining when two induced senses (for the same word at different times) are natural variations and when the differences represent sense change (cf okasionelle and usuelle Bedeutung in Paul (1886)).

The methods developed in this project will go beyond the state-of-the-art in the field in several aspects. Previously, Semantic Change detection projects have primarily focused on (i) a limited number of change types, e.g., only birth of senses; (ii) a few (far apart) time points, e.g., 50-year slots; or (iii) methods that find signals for change without differentiating between change types or separating the senses of a word (i.e., one topic/vector/cluster per word); and (iv) words in isolation, not their interplay within a semantic field. Existing techniques reduce complexity severely because considering yearly time buckets over two centuries and up to 5 senses per time period, the solution space is in the order of 5200 which is impossible to compute and evaluate. This project will build on the promising reduction techniques described by Tahmasebi and Risse (2017) to enable us to answer the what, how and when questions in full, and create a complete picture of all changes related to a word and its semantic field.

For automatic Lexical Replacement, we will set the state-of-the-art simply because there is almost no existing research. The problem is extremely complex because words must be linked based on their (stable) senses. In this project, we have a unique opportunity to study the LR problem because we are one of few research groups that will attempt to solve the problem of word sense change first. We will beginby working on word sense induction for Swedish, as this is the core of our methodology. Once we can induce word senses automatically, we can begin to detect change in senses (SC) and then, as a third step, find word replacement (LR). Using these tools, we can study the varying speed and different processes of LR and SC in e.g. different parts of the vocabulary and during different time periods. Thus far, methods for detecting sentiment change, i.e., words changing their sentiment value, (Cook and Stevenson, 2010; Nguyen et al., 2012) have not differentiated between different senses: only the predominant value of a word has been considered. We will be able to overcome this hurdle by first solving word sense change. We build on ongoing effort at Språkbanken and our sentiment lexicon, SenSaldo, to tackle diachronic sentiment analysis for Swedish.

For all kinds of change targeted in this project, we will provide textual evidence to support our claims, e.g. , example sentences for each sense used to help users evaluate and understand the results. This is a prerequisite for uptake in the research community, in particular in the DHSS.

We envision several use cases that will help researchers and nonprofessional users to study language changes themselves, enabling them to search and explore archival content and improve downstream large-scale text mining applications. For researchers that are interested in language changes in general, our results will offer answers to what has changed as well as how and when it changed. It will also be possible to answer more complex questions like how is change in one word connected to changes in others in the same semantic field?

For researchers that have an interest in the resources but not necessarily in the changes themselves, e.g., researchers in DHSS, our methods will help to gather evidence for concepts by finding linked vocabulary and their senses, e.g. the word handikappad ’handicapped’ has been replaced over time (handikappad$\rightarrow$ funktionshindrad $\rightarrow$funktionsnedsatt$\rightarrow$ funktionsvariation). The replacement aims to remove negatively connotated senses, but from the continuous replacements, we know that these still catch up. Upon completion of this project, it will be possible to study the lexical replacements on the one hand, and tie it with semantic change. E.g. Does a new word like funktionshindrad take over a subset of the senses of the previous word at first and then later add the negative senses? How fast are the negative senses added for each lexical replacement and does it speed up with additional replacements? Do we include more or less in each sense or add new senses over time?

The methods developed in the project will be generally applicable, but our primary target language is Swedish with equal focus on historical and modern text. E.g. Swedish historical newspapers, Kubhist 1750-1925, books (the Literature bank), parliamentary data (SUC, SOU) and modern newspapers but also social media text where there is evidence of high linguistic diversity and creativity (Goel et al., 2016). This makes the availability of large amounts of Swedish text crucial to the project.
Språkbanken continuously collects texts written in Swedish, to date there is over 10 billion words of modern Swedish (e.g. fiction, news, politics and social media) and over one billion words of historical news materials. We will also use digital lexical resources, like SALDO, Svedbergs, Dahlin and SAOL.

We will replicate our research using English text such as the Corpus of Historical American English and Google books. We will extend previous studies and quantify hypothesis regarding lexical and semantic change, work that can feed back into our tools for quality assessment and improvement.

Search

Contact

Contact us and sign up for our mailinglist.

If you want to collaborate with us on a Master thesis project or a PhD project, send us an email with a short project proposal.

languagechange!@flov.gu[remove].se

Towards Computational Lexical Semantic Change Detection

Project Description

News

LChange’22 in conjunction with ACL2022

DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages

Book on Computational Approaches to Semantic Change

LChange’21 in conjunction with ACL2021

2nd Workshop on Computational Detection of Language Change 2020

Unsupervised Lexical Semantic Change Detection Challenge

The two week-long paper

Selected Publications

Recent Publications

Recent & Upcoming Talks

Recent & Upcoming Events

Recent Posts

Project description

Tags

Search

Contact