This repository contains data and code for reproducing a study on the use of arch- in present-day English. The data come from the ENCOW corpus (Schäfer & Bildhauer (2012)) described here.
Preliminaries
Loading packages:
Code
# install packages if neededlibrary(tidyverse)library(remotes)library(devtools)#devtools::install_github("bmschmidt/wordVectors")library(wordVectors) library(vroom)library(pbapply)library(readxl)library(ggrepel)library(cluster)library(ggiraph)library(renv)# Snapshot of packages# renv::snapshot()
Data wrangling
Read in the data from ENCOW. The data are in two frequency lists, one for nouns and one for adjectives.
We remove false hits and add a column for the bases (i.e. the words without the prefix arch-), then export an Excel sheet for correcting the base lemma annotation that we get in this way, and we re-import the file after manual annotation.
# remove false hitsarch_adj <-filter(arch_adj, Keep=="y")arch_nouns <-filter(arch_nouns, Keep=="y")# get basesarch_adj$Base_Lemma <-gsub("^arch-?", "", arch_adj$Token)arch_nouns$Base_Lemma <-gsub("^arch-?", "", arch_nouns$Token)# make sheet for annotating lexemes / correcting lemmas# rbind(mutate(arch_adj, pos = "ADJ"), mutate(arch_nouns, pos = "N")) %>% writexl::write_xlsx("arch_adj_n.xlsx")# read in againarch <-read_xlsx("data/arch_adj_n.xlsx")# only keeo true positivesarch <-filter(arch, Keep =="y")
Model
We re-use the word2vec model trained for Hartmann & Ungerer (2024). The model can be downloaded here.
modal <-readRDS("path/to/file")
Visualization
For visualizing the results, we obtain the relevant lemmas from the annotated spreadsheet and the corresponding vectors from the model, and we apply multidimensional scaling (MDS) to show the semantic distances in two-dimensional space. We also calculate the k-means clusters to find groups within the data. We determine the appropriate number of clusters via visual inspection of the scree plot following Levshina (2015).
# get lemmaslemmas <-sort(unique(arch$Base_Lemma))# get Cosine distancecosDist_mds <-cosineDist(model[[tolower(lemmas), average =FALSE]], model[[tolower(lemmas), average =FALSE]]) %>%cmdscale()# get k-means clustersplot(1:10, sapply(1:10, function(x) kmeans(cosDist_mds, x, nstart =25)$tot.withinss), type ="b", ylab ="WCSS")
Click on the magnifying glass symbol at the upper-right corner to pan and zoom.
References
Hartmann, Stefan & Tobias Ungerer. 2024. Attack of the snowclones: A corpus-based analysis of extravagant formulaic patterns. Journal of Linguistics 60(3). 599–634. https://doi.org/10.1017/S0022226723000117.
Levshina, Natalia. 2015. How to do linguistics with R. Data exploration and statistical analysis. Amsterdam, Philadelphia: John Benjamins.
Schäfer, Roland & Felix Bildhauer. 2012. Building large corpora from the web using a new efficient tool chain. In Nicoletta Calzolari, Khalid Choukri, Terry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk & Stelios Piperidis (eds.), 486493.