arch- in ENCOW

About this repository

This repository contains data and code for reproducing a study on the use of arch- in present-day English. The data come from the ENCOW corpus (Schäfer & Bildhauer (2012)) described here.

Preliminaries

Loading packages:

Code

# install packages if needed


library(tidyverse)
library(remotes)
library(devtools)
#devtools::install_github("bmschmidt/wordVectors")
library(wordVectors) 
library(vroom)
library(pbapply)
library(readxl)
library(ggrepel)
library(cluster)
library(ggiraph)
library(renv)

# Snapshot of packages
# renv::snapshot()

Data wrangling

Read in the data from ENCOW. The data are in two frequency lists, one for nouns and one for adjectives.

# read nouns
arch_nouns <- read_xlsx("data/COW16 nouns cleaned.xlsx")

# read adj
arch_adj   <- read_xlsx("data/COW16 adjectives cleaned.xlsx")

We remove false hits and add a column for the bases (i.e. the words without the prefix arch-), then export an Excel sheet for correcting the base lemma annotation that we get in this way, and we re-import the file after manual annotation.

# remove false hits
arch_adj <- filter(arch_adj, Keep=="y")
arch_nouns <- filter(arch_nouns, Keep=="y")

# get bases
arch_adj$Base_Lemma <- gsub("^arch-?", "", arch_adj$Token)
arch_nouns$Base_Lemma <- gsub("^arch-?", "", arch_nouns$Token)

# make sheet for annotating lexemes / correcting lemmas
# rbind(mutate(arch_adj, pos = "ADJ"), mutate(arch_nouns, pos = "N")) %>% writexl::write_xlsx("arch_adj_n.xlsx")

# read in again
arch <- read_xlsx("data/arch_adj_n.xlsx")

# only keeo true positives
arch <- filter(arch, Keep == "y")

Model

We re-use the word2vec model trained for Hartmann & Ungerer (2024). The model can be downloaded here.

modal <- readRDS("path/to/file")

Visualization

For visualizing the results, we obtain the relevant lemmas from the annotated spreadsheet and the corresponding vectors from the model, and we apply multidimensional scaling (MDS) to show the semantic distances in two-dimensional space. We also calculate the k-means clusters to find groups within the data. We determine the appropriate number of clusters via visual inspection of the scree plot following Levshina (2015).

# get lemmas
lemmas <- sort(unique(arch$Base_Lemma))

# get Cosine distance
cosDist_mds <- cosineDist(model[[tolower(lemmas), average = FALSE]], model[[tolower(lemmas), average = FALSE]]) %>% cmdscale()

# get k-means clusters
plot(1:10, sapply(1:10, function(x) kmeans(cosDist_mds, x, nstart = 25)$tot.withinss), type = "b", ylab = "WCSS")

k_means_clusters <- kmeans(cosDist_mds, 3, nstart = 25)

# as df
cosDist_mds <- cosDist_mds %>% as.data.frame() %>% rownames_to_column()
colnames(cosDist_mds)[1] <- "lemma"

# add frequency information
freqs <- rbind(arch_adj, arch_nouns) %>% select(Base_Lemma, Frequency) %>%
  group_by(Base_Lemma) %>%
  summarise(
    n = sum(Frequency)
  ) %>% setNames(c("lemma", "n"))

# multidimensional scaling
cosDist_mds <- left_join(cosDist_mds, freqs) %>% replace_na(list(n = 0))

# also add log frequency
cosDist_mds$logFreq <- log1p(cosDist_mds$n)

# add k-means clusters to df
cosDist_mds$kcluster <- as.factor(as.numeric(k_means_clusters$cluster))

Finally, create the plot:

# plot for publication

cosDist_mds %>% 
  filter(n>4) %>% 
  ggplot(aes(x = V1, y = V2, 
                                                 size = n*5, label = lemma)) +
  stat_ellipse(aes(group = as.factor(kcluster), fill = as.factor(kcluster)),
               geom = "polygon", alpha = 0.1, color = NA) +
  geom_text_repel(max.overlaps = 35, seed = 1705, segment.size = .2) + theme_bw() + guides(size = 'none') +
  guides(col = "none", fill = "none") +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
  scale_size_continuous(range = c(5,15))

ggsave("images/archdistances_k_means.png", width = 14, height = 14, dpi = 600)

set.seed(1705)
p <- cosDist_mds %>% 
  # filter(n>4) %>% 
  ggplot(aes(x = V1, y = V2, 
                                                 size = n*5, label = lemma)) +
  stat_ellipse(aes(group = as.factor(kcluster), fill = as.factor(kcluster)),
               geom = "polygon", alpha = 0.1, color = NA) +
  geom_text() + theme_bw() + guides(size = 'none') +
  guides(col = "none", fill = "none") +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
  scale_size_continuous(range = c(1,6))

g <- girafe(ggobj = p,
            options = list(
              opts_zoom(max = 5),   
              opts_sizing(rescale = TRUE),
              opts_toolbar(saveaspng = TRUE)
            ))

g

Click on the magnifying glass symbol at the upper-right corner to pan and zoom.

References

Hartmann, Stefan & Tobias Ungerer. 2024. Attack of the snowclones: A corpus-based analysis of extravagant formulaic patterns. Journal of Linguistics 60(3). 599–634. https://doi.org/10.1017/S0022226723000117.

Levshina, Natalia. 2015. How to do linguistics with R. Data exploration and statistical analysis. Amsterdam, Philadelphia: John Benjamins.

Schäfer, Roland & Felix Bildhauer. 2012. Building large corpora from the web using a new efficient tool chain. In Nicoletta Calzolari, Khalid Choukri, Terry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk & Stelios Piperidis (eds.), 486493.

Session info

sessionInfo()

R version 4.5.1 (2025-06-13)
Platform: x86_64-apple-darwin20
Running under: macOS Tahoe 26.2

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Brussels
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] renv_1.1.5      ggiraph_0.9.2   cluster_2.1.8.1 ggrepel_0.9.6  
 [5] readxl_1.4.5    pbapply_1.7-4   vroom_1.6.7     wordVectors_2.0
 [9] devtools_2.4.6  usethis_3.2.1   remotes_2.5.0   lubridate_1.9.4
[13] forcats_1.0.1   stringr_1.6.0   dplyr_1.1.4     purrr_1.2.0    
[17] readr_2.1.6     tidyr_1.3.2     tibble_3.3.0    ggplot2_4.0.1  
[21] tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] gtable_0.3.6            xfun_0.55               htmlwidgets_1.6.4      
 [4] tzdb_0.5.0              vctrs_0.6.5             tools_4.5.1            
 [7] generics_0.1.4          parallel_4.5.1          pkgconfig_2.0.3        
[10] RColorBrewer_1.1-3      S7_0.2.1                lifecycle_1.0.4        
[13] compiler_4.5.1          farver_2.1.2            fontLiberation_0.1.0   
[16] fontquiver_0.2.1        htmltools_0.5.9         yaml_2.3.12            
[19] pillar_1.11.1           crayon_1.5.3            MASS_7.3-65            
[22] ellipsis_0.3.2          cachem_1.1.0            sessioninfo_1.2.3      
[25] fontBitstreamVera_0.1.1 tidyselect_1.2.1        digest_0.6.39          
[28] stringi_1.8.7           labeling_0.4.3          fastmap_1.2.0          
[31] grid_4.5.1              cli_3.6.5               magrittr_2.0.4         
[34] pkgbuild_1.4.8          withr_3.0.2             gdtools_0.4.4          
[37] scales_1.4.0            bit64_4.6.0-1           timechange_0.3.0       
[40] rmarkdown_2.30          bit_4.6.0               otel_0.2.0             
[43] cellranger_1.1.0        hms_1.1.4               memoise_2.0.1          
[46] evaluate_1.0.5          knitr_1.51              rlang_1.1.6            
[49] Rcpp_1.1.0              glue_1.8.0              pkgload_1.4.1          
[52] rstudioapi_0.17.1       jsonlite_2.0.0          R6_2.6.1               
[55] systemfonts_1.3.1       fs_1.6.6