For the last decade, automatic detection of word sense change has primarily focused on detecting the main changes in meaning of a word. Most current methods rely on new, powerful embedding technologies, but do not differentiate between different senses of a word, which is needed in many applications in the digital humanities. Of course, this radically reduces the complexity, but often fails to answer questions like: what changed and how, and when did the change occur?
In this talk, I will present methods for automatically detecting sense change from large amounts of diachronic data. I will focus on a study on a Historical Swedish Newspaper Corpus, the Kubhist dataset with digitized Swedish newspapers from 1749-1925. I will present our work with detecting and correcting OCR errors, normalizing spelling variations, and creating representations for individual words using a popular neural embedding method, namely Word2Vec.
Methods for creating (neural) word embeddings are the state-of-the-art in sense change detection, and many other areas of study, and mainly studied on English corpora where the size of the datasets are sufficiently large. I will discuss the limitations of such methods for this particular context; fairly small-sized data with a high error rate as is common in a historical context for most languages. In addition, I will discuss the particularities of text mining methods for digital humanities and what is needed to bridge the gap between computer science and the digital humanities.