chrismuir / refinr Goto Github PK

Cluster and merge similar string values: an R implementation of Open Refine clustering algorithms

R 34.58% C++ 65.42%

openrefine fuzzy-matching ngram approximate-string-matching data-cleaning data-clustering clustering cran r rstats

refinr's Introduction

refinr

refinr is designed to cluster and merge similar values within a character vector. It features two functions that are implementations of clustering algorithms from the open source software OpenRefine. The cluster methods used are key collision and ngram fingerprint (more info on these here).

In addition, there are a few add-on features included, to make the clustering/merging functions more useful. These include approximate string matching to allow for merging despite minor mispellings, the option to pass a dictionary vector to dictate edit values, and the option to pass a vector of strings to ignore during the clustering process.

Please report issues, comments, or feature requests.

Installation

Install from CRAN:

install.packages("refinr")

Or install the dev version from this repo:

# install.packages("devtools")
devtools::install_github("ChrisMuir/refinr")

Example Usage

library(refinr)

x <- c("Acme Pizza, Inc.", "ACME PIZZA COMPANY", "acme pizza LLC", "Acme Pizza, Inc.")
key_collision_merge(x)
#> [1] "Acme Pizza, Inc." "Acme Pizza, Inc." "Acme Pizza, Inc." "Acme Pizza, Inc."

A dictionary character vector can be passed to key_collision_merge, which will dictate merge values when a cluster has a match within the dict vector.

x <- c("Acme Pizza, Inc.", "ACME PIZZA COMPANY", "acme pizza LLC", "Acme Pizza, Inc.")
key_collision_merge(x, dict = c("Nicks Pizza", "acme PIZZA inc"))
#> [1] "acme PIZZA inc" "acme PIZZA inc" "acme PIZZA inc" "acme PIZZA inc"

Function n_gram_merge can be used to merge similar values that contain slight spelling differences. The stringdist package is used for calculating edit distance between strings. refinr links to the stringdist C API to improve the speed of the functions.

x <- c("Acmme Pizza, Inc.", "ACME PIZA COMPANY", "Acme Pizzazza LLC")
n_gram_merge(x, weight = c(d = 0.2, i = 0.2, s = 1, t = 1))
#> [1] "ACME PIZA COMPANY" "ACME PIZA COMPANY" "ACME PIZA COMPANY"

# The performance of the approximate string matching can be ajusted using parameters 
# "weight" and/or "edit_threshold".
n_gram_merge(x, weight = c(d = 1, i = 1, s = 0.1, t = 0.1))
#> [1] "Acme Pizzazza LLC" "ACME PIZA COMPANY" "Acme Pizzazza LLC"

Both key_collision_merge and n_gram_merge have optional arg ignore_strings, which takes a character vector of strings to be ignored during the merging of values.

x <- c("Bakersfield Highschool", "BAKERSFIELD high", "high school, bakersfield")
key_collision_merge(x, ignore_strings = c("high", "school", "highschool"))
#> [1] "BAKERSFIELD high" "BAKERSFIELD high" "BAKERSFIELD high"

The clustering is designed to be insensitive to common business name suffixes, i.e. "inc", "llc", "co", etc. This feature can be turned on/off using function parameter bus_suffix.

Workflow for checking the results of the refinr processes

library(dplyr)
library(knitr)

x <- c(
  "Clemsson University", 
  "university-of-clemson", 
  "CLEMSON", 
  "Clem son, U.", 
  "college, clemson u", 
  "M.I.T.", 
  "Technology, Massachusetts' Institute of", 
  "Massachusetts Inst of Technology", 
  "UNIVERSITY:  mit"
)

ignores <- c("university", "college", "u", "of", "institute", "inst")

x_refin <- x %>% 
  refinr::key_collision_merge(ignore_strings = ignores) %>% 
  refinr::n_gram_merge(ignore_strings = ignores)

# Create df for comparing the original values to the edited values.
# This is especially useful for larger input vectors.
inspect_results <- data_frame(original_values = x, edited_values = x_refin) %>% 
  mutate(equal = original_values == edited_values)

# Display only the values that were edited by refinr.
knitr::kable(
  inspect_results[!inspect_results$equal, c("original_values", "edited_values")]
)
#> |original_values                         |edited_values                    |
#> |:---------------------------------------|:--------------------------------|
#> |Clemsson University                     |CLEMSON                          |
#> |university-of-clemson                   |CLEMSON                          |
#> |Clem son, U.                            |CLEMSON                          |
#> |college, clemson u                      |CLEMSON                          |
#> |Technology, Massachusetts' Institute of |Massachusetts Inst of Technology |
#> |UNIVERSITY:  mit                        |M.I.T.                           |

Notes

This package is NOT meant to replace OpenRefine for every use case. For situations in which merging accuracy is the most important consideration, OpenRefine is preferable. Since the merging steps in refinr are automated, there will usually be more false positive merges, versus manually selecting clusters to merge in OpenRefine.
The advantages this package has over OpenRefine:
- Operations are fully automated.
- Facilitates a more reproducible workflow.
- Faster when working with large input data (character vectors of length 500000+).

refinr's People

Stargazers

Watchers

Forkers

henri-lo benjaminschwetz eleakin xtmgah bedantaguru

refinr's Issues

Functions not handling accented chars properly

Testing this on a Mac and a PC and getting different results.

library(refinr)
vect <- c("César Moreira Nuñez", "cesar moreira nunez")

On the PC:

key_collision_merge(vect)
#> "César Moreira Nuñez" "César Moreira Nuñez" # This is the correct output
n_gram_merge(vect)
#> "César Moreira Nuñez" "cesar moreira nunez"

On the Mac:

key_collision_merge(vect)
#> "César Moreira Nuñez" "cesar moreira nunez"
n_gram_merge(vect)
#> "César Moreira Nuñez" "cesar moreira nunez"

The expected output for all four functions above is c("César Moreira Nuñez", "César Moreira Nuñez").

This issue is possibly related to issue #58 from the rOpenSci pkg tokenizers (and the reprex above was stolen from that issue).

Both the Mac and PC are running R v3.4.4, and here's the local and encoding setting for each:

PC:

Sys.getlocale()
#> "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

getOption("encoding")
#> "native.enc"

Mac:

Sys.getlocale()
#> en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

getOption("encoding")
#> native.enc

Punc with no surrounding spaces causing incorrect output

Getting incorrect output when strings are passed as input that contain punctuation marks that are not surrounded by spaces. Here's an example:

vect <- c("cats,inc", "cats, inc", "cats,incorporated", "cats, incorporated")
refinr::key_collision_merge(vect)
#> [1] "cats,inc"          "cats, inc"         "cats,incorporated" "cats, inc"

The intended outcome is for all four of these strings to be edited to be identical.

Feature Request: Street suffixes

Would it be reasonable to create something similar to the bus_suffix function, but for common street suffixes, such as "avenue," "Ave.", "ave", "street", "st.", "St.", and so on? Avenues, Streets, Boulevards, etc. would have to remain distinct from one another, but it would be helpful for the package to be insensitive to common variations within each type.

bus_suffix question

Dear Chris,

Many thanks for this package. I am trying to write a walkthrough using refinr on matching author and inventor names and will then move on to organisation names. I wanted to ask whether you could point me to the source of the bus_suffix data. I ask because I can't seem to locate it in the package code (and am possibly being dim). It would be useful to see this in a df (maybe in data) depending on where it comes from. Then users can assess the coverage.

Thanks again for the package!

Best,

Paul

Merge values not taking most frequent string

Seeing a bug in which the edit value assigned to a cluster is not the most frequent string in that cluster. Example:

refinr::key_collision_merge(c("cat bike", "bike cat", "bike cat"))
#> "cat bike" "cat bike" "cat bike"

I think the issue is within this line from file key_collision_merge_funcs.cpp:

// Get the string that appears most often in curr_vect.
String most_freq_string = curr_vect[which_max(table(curr_vect))];

I think the solution is to apply .sort() to curr_vect prior to calling table on it. Need to do some more testing.

Comparing two separate columns

Hi Chris,

Great package! I'm a journalist and a lot of my time is spend in merging two data frames on the country, municipality, name or party column. These columns often contain different spellings for the same entity.

Now your package comes in handy, only I haven't figured out yet how to compare two of 'the same' columns and name the strings like I use refinr on a single vector. I'm not that experienced in R so maybe this sounds a little bit vague. Maybe my examples make things a bit clearer.

library(tidyverse)
library(refinr)

# I would like to add the values (and the right name's) of this example df...
df1 <- tribble(
  ~uid, ~name, ~value,
  "A", "Red", 13,
  "A", "violet", 145,
  "B", "Blue", 3,
  "B", "yellow", 56,
  "C", "yellow-purple", 789,
  "C", "green", 17
  )

# ...to the following df
df2 <- tribble(
  ~uid, ~name,
  "A", "red",
  "B", "blu",
  "C", "YellowPurple",
  "C", "green"
  )

# The following code of course produces NA values
df3 <- left_join(df1, df2, by = c("uid", "name"))

# While the following is the desired outcome

# A tibble: 4 x 3
  uid   name             value
  <chr> <chr>          <dbl>
1 A     Red                    13 
2 B     Blue                     3
3 C     yellow-purple  789   
4 C     green                 17

If this is possible, it would safe me so much time!

Thanks in advance.

Edit key_collision_merge.R to be more efficient with an input vector of unique values

Within function key_collision_merge, if the input param vect is a vector of unique values, then the step that calculates the number of unique values within each cluster can be skipped, and the function can move directly to the cluster merging step.

More specifically, this is the code chunk that can be skipped if vect is made up of unique values:

  # If dict is NULL, for each element of clusters, get the number of unique
  # values within vect associated with that cluster. Otherwise, for each
  # element of clusters, get the number of unique values across both vect AND
  # dict associated with that cluster. Idea is to skip the merging step for all
  # elements of cluster for which each associated element of vect is already
  # identical (or identical to an element of dict). In those spots its
  # pointless to perform merging.
  if (is.null(dict)) {
    csize <- vapply(clusters, function(n) {
      vect_sub[which(equality(keys_vect_sub, n))] %>%
        unique %>%
        length
    },
    integer(1),
    USE.NAMES = FALSE)
  } else {
    csize <- vapply(clusters, function(n) {
      c(vect_sub[which(equality(keys_vect_sub, n))],
        dict[which(equality(keys_dict, n))]) %>%
        unique %>%
        length
    },
    integer(1),
    USE.NAMES = FALSE)
  }

Use unordered_map within cpp functions

I've been playing around with incorporating std::unordered_map into the cpp functions that perform the value merging after the clusters have been generated. I believe I can get a substantial speed up by re-writing these functions to use unordered_map.

Error when passing empty strings to n_gram_merge()

refinr::n_gram_merge(c("cats", "CATS", ""))
#> Error in cpp_get_char_ngrams(., numgram = numgram) : 
#>   negative length vectors are not allowed

This is related to #2, in that they are both related to the fact that cpp function char_ngram currently doesn't check the length of each string as it's creating ngrams.

Functions not differentiating between NA and "NA"

Here's an example:

vect <- c("cats", "CATS", "NA", "na", "na na", NA, NA_character_, " ", "")
n_gram_merge(vect)
#> [1] "CATS" "CATS" NA     NA     NA     NA     NA     " "    ""

The expected output is this:

#> [1] "CATS" "CATS" "NA"   "NA"   "NA"   NA     NA     " "    ""

Where the string literal "NA" and the missing value NA are kept separate, and not merged together.

I cannot get edit_threshold to work

Either edit_threshold is not working in n_gram_merge or, more likely, I do not understand how the algorithm actually works. I posted my full question with a reproducible example at stackoverflow.

Can you please provide a a few examples of where changing the numgram to 3 or 4 or changing the edit_threshold saves the day?

Bug when passing length 1 strings to n_gram_merge()

# This works as intended
refinr::n_gram_merge(c("cats", "CATS", "d", "h"))
#> [1] "CATS" "CATS" "d"    "h"

# Add a second lowercase "d" to the input vector, and it still works as intended
refinr::n_gram_merge(c("cats", "CATS", "d", "h", "d"))
#> [1] "CATS" "CATS" "d"    "h"    "d"

# However, adding an uppercase "D" to the input vector causes all length 1 strings 
# to be treated as a single cluster
refinr::n_gram_merge(c("cats", "CATS", "d", "h", "D"))
#> [1] "CATS" "CATS" "D"    "D"    "D"

This bug is related to the cpp function char_ngram, it currently does not check the length of each string as it's creating ngrams.

In `n_gram_merge()`, issues when arg `bus_suffix = FALSE` and `ignore_strings` is non_NULL

In n_gram_merge(), getting incorrect output when arg bus_suffix is set to FALSE and a char vector is passed to arg ignore_strings. Here's an example:

vect <- c("cats, inc", "cats, incorporated", "cats, llc")
refinr::n_gram_merge(vect, bus_suffix = FALSE, ignore_strings = "dogs")
#> [1] "cats, inc" "cats, inc" "cats, llc"

The intended output is that none of the input values should have been merged together. Currently, if bus_suffix = FALSE and ignore_strings is not NULL, within refinr:::get_fingerprint_ngram(), vect is being run through business_suffix() (this should not be happening) .... this is causing the issue.

ignore_strings is passed to gsub with perl = TRUE

Hi,
Thanks for the package :)

I noticed that in key_collision_merge() you remove ignore strings with remove_strings function which is fine. But in n_gram_merge() you use gsub with perl = TRUE which makes a mess if you have some kind of regex pattern between your ignore_strings.

In my case I was generating ignore strings from text so I got all kind of things like '+' , '.', ect...

Error message if I pass '+' in ignore_strings

key_collision_merge(c('aa+', 'aa'), ignore_strings = '+') # works fine
n_gram_merge(c('aa+', 'aa'), ignore_strings = '+') # crashes
Error in gsub(regex, "", vect, perl = TRUE) : 
  invalid regular expression '\b(+|inc|corp|co|llc|ltd|div|ent|lp|and)\b| '

I can clean ignore_strings myself, but that is something to consider for the package too, maybe use remove_strings in both cases?

CRAN docs issue

Email from CRAN:

Dear maintainer,

You have file 'refinr/man/refinr.Rd' with \docType{package}, likely
intended as a package overview help file, but without the appropriate
PKGNAME-package \alias as per "Documenting packages" in R-exts.

This seems to be the consequence of the breaking change

Using @doctype package no longer automatically adds a -package alias.
Instead document _PACKAGE to get all the defaults for package
documentation.

in roxygen2 7.0.0 (2019-11-12) having gone unnoticed, see
r-lib/roxygen2#1491.

As explained in the issue, to get the desired PKGNAME-package \alias
back, you should either change to the new approach and document the new
special sentinel

"_PACKAGE"

or manually add

@Aliases refinr-package

if remaining with the old approach.

Please fix in your master sources as appropriate, and submit a fixed
version of your package within the next few months.

Best,
-k