Coder Social home page Coder Social logo

k3jph / phonics-in-r Goto Github PK

View Code? Open in Web Editor NEW
28.0 4.0 7.0 454 KB

Phonetic Spelling Algorithms in R

Home Page: https://jameshoward.us/phonics-in-r

License: Other

R 83.99% C++ 12.75% TeX 3.27%
phonetic-spelling-algorithms soundex phonics nysiis metaphone text-processing linguistics record-linkage bsd-2-license

phonics-in-r's People

Contributors

ahood avatar howardjp avatar kylehaynes avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

phonics-in-r's Issues

soundex single characters

Hi James (Sorry for calling you by you Surname!),

Currently, single character strings return no padded out 0's. Would you consider this a bug?

Looking at three implementations of soundex ...

phonics::soundex("A")
# [1] "A"
RecordLinkage::soundex("A")
# [1] "A000"
stringdist::phonetic("A")
# [1] "A000"

It's pretty edge case, but with the types of names I deal with sometimes I get abbreviations, so when doing linkage, if a name was "DA" on one dataset and "D" on another, I might consider it a pair, though blocking on soundex name wouldn't result in a pair ("D" vs "D000").

Happy to do a pull request if you agree.

NYSIIS encoding of 'HANNAH'

Both nysiis_original() and nysiis_modified() are returning 'HANAH'. The encoding rule for a terminal 'H' is ambiguous in this case because of its definition in terms of the preceding and following letters, whereas there is no following letter for the last letter in the name. However it seems more in the spirit of this phonetic encoding to omit the final 'H' (and therefore the second 'A') from the final encoding, and to return 'HAN' instead. The latter interpretation has been adopted in the plurality of implementations here, by the way.

Ensure all algorithms return "" for input ""

  • Caverphone
  • Caverphone 2
  • Cologne
  • Lein
  • MRA
  • Metaphone
  • NYSIIS
  • Modified NYSIIS
  • Oxford Name Compression Algorithm
  • Phonex
  • Roger Root
  • Original Soundex
  • Apache Refined Soundex
  • Statistics Canada

Use of perl = TRUE

Hi Howard,

Thanks for the package.

Have you ever considered the use of the perl = TRUE argument in a lot of your gsub() functions?

It offers considerable time benefits.

Below is an example having updated the nysiis_original function.

# install.packages("babynames")
# install.packages("phonics")
library("babynames")
library("phonics")

name <- babynames$name

length(name)
# 1858689

system.time(a <- nysiis_original_perl(name))
# user  system elapsed 
# 13.36    0.14   13.54 

system.time(b <- nysiis(name))
#  user  system elapsed 
# 22.75    0.24   23.02 

# All equal?
all.equal(a, b)
# [1] TRUE

# microbenchmark'ing
microbenchmark(
  nysiis_original_perl(name),
  nysiis(name), times = 25
)
# Unit: milliseconds
#                        expr      min       lq     mean   median       uq      max neval
#  nysiis_original_perl(name) 308.5931 311.0220 316.0347 312.2456 315.8408 345.8459    25
#                nysiis(name) 568.2662 573.1073 577.4318 575.4571 577.5975 606.7362    25

sessionInfo()
# R version 3.5.0 (2018-04-23)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 10 x64 (build 17763)
# 
# Matrix products: default
# 
# locale:
# [1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252    LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                       LC_TIME=English_Australia.1252    
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] phonics_1.1.0   babynames_0.3.0
# 
# loaded via a namespace (and not attached):
# [1] compiler_3.5.0 tools_3.5.0    pillar_1.3.1   tibble_1.4.2   Rcpp_1.0.0     crayon_1.3.4   rlang_0.3.0.1 


# nysiis_original with perl = TRUE ...
nysiis_original_perl <- function(word, maxCodeLen = 6) {

    ## First, remove any nonalphabetical characters and capitalize it
    word <- gsub("[^[:alpha:]]*", "", word, perl = TRUE)
    word <- toupper(word)

    ## Translate first characters of name: MAC to MCC, KN to N, K to C, PH,
    ## PF to FF, SCH to SSS
    word <- gsub("^MAC", "MCC", word, perl = TRUE)
    word <- gsub("KN", "NN", word, perl = TRUE)
    word <- gsub("K", "C", word, perl = TRUE)
    word <- gsub("^PF", "FF", word, perl = TRUE)
    word <- gsub("PH", "FF", word, perl = TRUE)
    word <- gsub("SCH", "SSS", word, perl = TRUE)

    ## Translate last characters of name: EE to Y, IE to Y, DT, RT, RD,
    ## NT, ND to D
    word <- gsub("EE$", "Y", word, perl = TRUE)
    word <- gsub("IE$", "Y", word, perl = TRUE)
    word <- gsub("DT$", "D", word, perl = TRUE)
    word <- gsub("RT$", "D", word, perl = TRUE)
    word <- gsub("RD$", "D", word, perl = TRUE)
    word <- gsub("NT$", "D", word, perl = TRUE)
    word <- gsub("ND$", "D", word, perl = TRUE)

    ## First character of key = first character of name.
    first <- substr(word, 1, 1)
    word <- substr(word, 2, nchar(word))

    ## EV to AF else A, E, I, O, U to A
    word <- gsub("EV", "AF", word, perl = TRUE)
    word <- gsub("E|I|O|U", "A", word, perl = TRUE)

    ## Q to G, Z to S, M to N
    word <- gsub("Q", "G", word, perl = TRUE)
    word <- gsub("Z", "S", word, perl = TRUE)
    word <- gsub("M", "N", word, perl = TRUE)

    ## KN to N else K to C
    ## SCH to SSS, PH to FF
    ## Rules are implemented as part of opening block

    ## H to If previous or next is non-vowel, previous.
    word <- gsub("([^AEIOU])H", "\\1", word, perl = TRUE)
    word <- gsub("(.)H[^AEIOU]", "\\1", word, perl = TRUE)

    ## W to If previous is vowel, A
    word <- gsub("([AEIOU])W", "A", word, perl = TRUE)

    ## If last character is S, remove it
    word <- gsub("S$", "", word, perl = TRUE)

    ## If last characters are AY, replace with Y
    word <- gsub("AY$", "Y", word, perl = TRUE)

    ## Remove duplicate consecutive characters
    word <- gsub("([A-Z])\\1+", "\\1", word, perl = TRUE)

    ## If last character is A, remove it
    word <- gsub("A$", "", word, perl = TRUE)

    ## Append word except for first character to first
    word <- paste(first, word, sep = "")

    ## Truncate to requested length
    word <- substr(word, 1, maxCodeLen)

    return(word)
}

NYSIIS encoding of 'CHRISTINA'

Noticed phonics::nysiis('CHRISTINA') outputs 'CHRASTAN' (for maxCodeLen >= 8) whereas it should be 'CRASTAN' as per original algorithm (see https://naldc.nal.usda.gov/download/27833/PDF or https://www.springer.com/us/book/9780387695020 and the somewhat more vague https://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System; can't find original report by Taft). Steps worked through here: christina.txt

Looks like discrepancy is due to the omission of the first letter of the name in nysiis.R line 107, i.e.
word <- substr(word, 2, nchar(word)) before the application of the 'H' rule (Step 4.5).

Ensure all algorithms return NA for input NA

  • Caverphone
  • Caverphone 2
  • Cologne
  • Lein
  • MRA
  • Metaphone
  • NYSIIS
  • Modified NYSIIS
  • Oxford Name Compression Algorithm
  • Phonex
  • Roger Root
  • Original Soundex
  • Apache Refined Soundex
  • Statistics Canada

Add warnings to Lein

  • Rewrite the unit tester
  • Add new test cases
  • Rewrite the code for to process warnings

Roger Root

Phonics should include an implementation of the Roger Root name comparison algorithm. See this USDA publication for more information.

Match Rating Approach

Phonics should include the match rating approach algorithm, including the comparison engine.

NYSIIS encoding of 'JOHN'

nysiis_original() returns 'J', whereas the encoding should be 'JAN'. This is a mistake in the use of gsub (both previous and next letters were part of the 'string to replace' instead of lookarounds being used). Have forked and will fix.

Add warnings to MRA

  • Rewrite the unit tester
  • Add new test cases
  • Rewrite the code for to process warnings

Add warnings to ONCA

  • Rewrite the unit tester
  • Add new test cases
  • Rewrite the code for to process warnings

Metaphone crashing when encoding "gh"

Describe the bug
metaphone crashes when encoding "gh"

Possibly this is version dependent - I'm running an old R and cannot upgrade until I buy a new computer.

It's just strange that it seems to work for many words and only crash on gh. Maybe gh is producing some sort of strange unicode or something? IDK, I'm not much of a user of this package, but I need to get some code to run and this is breaking it. Any help would be appreciated.

To Reproduce
phonics::metaphone("sigh")

Or any other word with gh in it, as far as I can tell

Expected behavior
Should return the metaphone encoding for sigh.

Example

> phonics::metaphone("ruff")
[1] "RF"
> phonics::metaphone("rough")
Error in metaphone_internal(word, maxCodeLen) : 
  c++ exception (unknown reason)
> phonics::metaphone("funhouse")
[1] "FNHS"
> phonics::metaphone("bughouse")
Error in metaphone_internal(word, maxCodeLen) : 
  c++ exception (unknown reason)
library(stringr); words[!str_detect(words,"gh")] %>% phonics::metaphone()
# works properly on 962 other words :-)

Desktop (please complete the following information):

> version
               _                           
platform       x86_64-apple-darwin15.6.0   
arch           x86_64                      
os             darwin15.6.0                
system         x86_64, darwin15.6.0        
status                                     
major          3                           
minor          6.1                         
year           2019                        
month          07                          
day            05                          
svn rev        76782                       
language       R                           
version.string R version 3.6.1 (2019-07-05)
nickname       Action of the Toes

Running phonics v1.3.9

Soundex returning single letter instead of augmenting with zeros

If I understand correctly from the Soundex algorithm steps on Wikipedia, the encoding of e.g. the string 'A' should be 'A000'. Indeed this is what is produced by other Soundex implementations I'm looking at. However, phonics::soundex('A') returns 'A'.

Happy to make a pull request if you agree that 'A000' is the correct encoding and if you agree with the rule that "If you have too few letters in your word that you can't assign three numbers, append with zeros until there are three numbers" (quoting from Step 4 in the Wikipedia article).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.