
fuzzystring provides fast, flexible fuzzy string
joins for data.frame and data.table objects
using approximate string matching. It combines
stringdist-based matching with a data.table
backend and compiled C++ result assembly to reduce overhead in large
joins while preserving standard join semantics.
Real-world identifiers rarely line up exactly.
fuzzystring is designed for workloads such as:
The package includes:
inner, left, right,
full, semi, and anti joinsstringdist methods, including OSA,
Levenshtein, Damerau-Levenshtein, Jaro-Winkler, q-gram, cosine, jaccard,
and soundexx
(data.table, tibble, or base data.frame)# Install from CRAN
install.packages("fuzzystring")
# Development version from GitHub
# pak::pak("PaulESantos/fuzzystring")
# remotes::install_github("PaulESantos/fuzzystring")library(fuzzystring)
x <- data.frame(
name = c("Idea", "Premiom", "Very Good"),
id = 1:3
)
y <- data.frame(
approx_name = c("Ideal", "Premium", "VeryGood"),
grp = c("A", "B", "C")
)
fuzzystring_inner_join(
x, y,
by = c(name = "approx_name"),
max_dist = 2,
distance_col = "distance"
)fuzzystring_inner_join(x, y, by = c(name = "approx_name"), max_dist = 2)
fuzzystring_left_join(x, y, by = c(name = "approx_name"), max_dist = 2)
fuzzystring_right_join(x, y, by = c(name = "approx_name"), max_dist = 2)
fuzzystring_full_join(x, y, by = c(name = "approx_name"), max_dist = 2)
fuzzystring_semi_join(x, y, by = c(name = "approx_name"), max_dist = 2)
fuzzystring_anti_join(x, y, by = c(name = "approx_name"), max_dist = 2)fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "osa")
fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "dl")
fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "jw")
fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "soundex")fuzzystring_inner_join(
x, y,
by = c(name = "approx_name"),
ignore_case = TRUE,
max_dist = 1
)The package ships with misspellings, a dataset of common
misspellings adapted from Wikipedia for examples and testing.
data(misspellings)
head(misspellings)fuzzystring keeps more of the join execution on a
compiled path than the original fuzzyjoin implementation.
In practice, the package combines:
data.table grouping and candidate planningThe benchmark article summarizes a precomputed comparison against
fuzzyjoin::stringdist_join() using the same methods and
sample sizes:
fuzzystring_join() can match across more than one string
column by applying the same distance method and threshold to each mapped
column.
x_multi <- data.frame(
first = c("Jon", "Maira"),
last = c("Smyth", "Gonzales")
)
y_multi <- data.frame(
first_ref = c("John", "Maria"),
last_ref = c("Smith", "Gonzalez"),
id = 1:2
)
fuzzystring_inner_join(
x_multi, y_multi,
by = c(first = "first_ref", last = "last_ref"),
method = "osa",
max_dist = 1
)fuzzystring builds on ideas popularized by
fuzzyjoin, while reinterpreting the join pipeline around
data.table and compiled C++ result assembly.