The spell.replacer
package provides probabilistic
spelling correction for character vectors in R. It uses the Jaro-Winkler
string distance metric combined with word frequency data from the Corpus
of Contemporary American English (COCA) to automatically correct
misspelled words.
The main function is spell_replace()
, which takes a
character vector and returns it with corrected spellings:
# Example text with misspellings
text <- c("This is a smple text with some mispelled words.",
"We can corect them automaticaly.")
# Apply spell correction
corrected_text <- spell_replace(text)
print(corrected_text)
#> [1] "This is a simple text with some spelled words."
#> [2] "We can correct them automatically."
The package uses a two-step process:
hunspell
package to identify words not found in standard
dictionariesYou can adjust the correction behavior with several parameters:
# More restrictive threshold (fewer corrections)
conservative <- spell_replace(text, threshold = 0.08)
# Ignore potential proper names
text_with_names <- "John went to Bostan yesterday."
corrected_names <- spell_replace(text_with_names, ignore_names = TRUE)
print(corrected_names)
#> [1] "John went to Boston yesterday."
You can also correct individual words using the
correct()
function:
One of the main benefits of spell.replacer
is that it
integrates seamlessly with tidyverse workflows. You can easily apply
spell correction to entire columns of text data:
library(dplyr)
# Example dataframe with text column
docs <- data.frame(
id = 1:3,
text = c("This docment has misspellings.",
"Anothr exmple with erors.",
"The finl text sampel.")
)
# Apply spell correction using tidy syntax
docs %>%
mutate(text = spell_replace(text))
The package processes approximately 1,000 words per second, making it suitable for large-scale text processing tasks. For example:
This makes spell.replacer
practical for preprocessing
large text datasets before analysis.
The package includes the coca_list
dataset with the
100,000 most frequent words from COCA: