This package allows language researchers to generate lexical competitors for a given set of words. Generating these competitors is useful for both experimental control (i.e., balancing a word list based on known lexical competitor types) and testing hypotheses about how lexical competitors may influence aspects of language processing. The package includes many competitor types frequently studied in word recognition research, such as cohorts, neighbors, and rhymes (along with many others), and also makes use of lexical dimensions such as lexical frequency to calculate measures like frequency weighted competitor probabilities. Importantly, the package can be modified so that researchers can add novel competitor definitions suitable for their research questions.
Let’s say you wanted to get the cohorts of ark in a small word set: ark, art, and bab:
library(LexFindR)
# Get cohort index of ark in dictionary of ark, art and bab
<- "AA R K"
target <- c("AA R K", "AA R T", "B AA B")
lexicon
<- get_cohorts(target, lexicon)
cohort
cohort#> [1] 1 2
# To get string rather than the index
lexicon[cohort]#> [1] "AA R K" "AA R T"
# Get count
length(cohort)
#> [1] 2
# Get the log Frequency Weighted Competitor Probabilities
<- 50
target_freq <- c(50, 274, 45)
lexicon_freq
get_fwcp(target_freq, lexicon_freq)
#> [1] 0.2934352
You can install the released version of LexFindR from CRAN with:
install.packages("LexFindR")
And the development version from GitHub with:
# install.packages("devtools")
::install_github("maglab-uconn/LexFindR") devtools
The package comes with two lexicons: the 212-word slex lexicon (with only 14 phonemes) from the TRACE model of spoken word recognition [@TRACE] as a small data set for the user to experiment with, and a larger lexicon (lemmalex) that we compiled from various open-access, non-copyrighted materials. Let’s say that we wanted to find the rhymes of ARK within the lemmalex lexicon.
Running the most basic version of the command will give us the rhyme indices:
get_rhymes("AA R K", lemmalex$Pronunciation)
#> [1] 767 1217 3826 7094 8785 9434 11073 14010
If we want to get the actual competitors, we can run the following, where the first command will show the forms and the second command will show the orthographic labels:
get_rhymes("AA R K", lemmalex$Pronunciation, form = TRUE)
#> [1] "AA R K" "B AA R K" "D AA R K" "HH AA R K" "L AA R K" "M AA R K"
#> [7] "P AA R K" "SH AA R K"
get_rhymes("AA R K", lemmalex$Pronunciation),]$Item
lemmalex[#> [1] "arc" "bark" "dark" "hark" "lark" "mark" "park" "shark"
Note that it is important to strip lexical stress.
# Not stripping lexical stress will result in errors
<- "AA0 R K"
target <- c("AA1 R K", "AA2 R T", "B AA3 B")
lexicon
get_cohorts(target, lexicon)
#> integer(0)
# Strip lexical stress using regex
<- gsub("\\d", "", target)
target <- gsub("\\d", "", lexicon)
lexicon
print(target)
#> [1] "AA R K"
print(lexicon)
#> [1] "AA R K" "AA R T" "B AA B"
get_cohorts(target, lexicon)
#> [1] 1 2
In the examples above, we had one target word that we were analyzing. Often, however, we will want to find competitors for each word in our lexicon. Using the lapply function, this is possible:
# define the list of target words to compute cohorts for
<- slex
target_df
# specify the lexicon; here it is the same, as we want
# to find all cohorts for all words in our lexicon
<- target_df
lexicon_df
# we create "cohort_idx", a list of indices
# corresponding to each word's cohort set
$cohort_idx <-
target_dflapply(
$Pronunciation,
target_dfFUN = get_cohorts,
lexicon = lexicon_df$Pronunciation
)
# to see the forms, create cohort_str
$cohort_str <-
target_dflapply(
$cohort_idx, function(idx) {
target_df$Item[idx]
lexicon_df
}
)
# to see frequencies for each target's cohort
$cohort_freq <-
target_dflapply(
$cohort_idx, function(idx) {
target_df$Frequency[idx]
lexicon_df
}
)
# to get the count of cohorts for each item,
$cohort_count <- lengths(target_df$cohort_str) target_df
In order to get faster run times, we can make use of the package future to engage multiple cores. Using the same example from above with a larger lexicon, we replace lapply with future_lapply:
library(future.apply)
# get the total number of cores
# num_cores <- availableCores()
# using two cores for demo
<- 2
num_cores
plan(multisession, workers = num_cores)
# we use a larger lexicon here
<- lemmalex
target_df <- target_df
lexicon_df
# get the indices of the cohorts
$cohort_idx <-
target_dffuture_lapply(
$Pronunciation,
target_dfFUN = get_cohorts,
lexicon = lexicon_df$Pronunciation
)
As in the example about finding competitors in the whole lexicon, the above parallelization example can be applied to get the forms of the cohorts above, the frequencies, etc.
For much more detailed discussion of the package and its features, refer to LexFindR manusript. Preprint: https://osf.io/preprints/psyarxiv/8dyru/. Open access: https://doi.org/10.3758/s13428-021-01667-6.
Li, Z., Crinnion, A.M. & Magnuson, J.S. (2021). LexFindR: A fast, simple, and extensible R package for finding similar words in a lexicon. Behavior Research Methods, 1-15. https://doi.org/10.3758/s13428-021-01667-6.