Coder Social home page Coder Social logo

simstring.jl's Introduction

SimString

Stable Dev Build Status Coverage Code Style: Blue ColPrac: Contributor's Guide on Collaborative Practices for Community Packages

A native Julia implementation of the CPMerge algorithm, which is designed for approximate string matching. This package is be particulary useful for natural language processing tasks which demand the retrieval of strings/texts from a very large corpora (big amounts of texts). Currently, this package supports both Character and Word based N-grams feature generations and there are plans to open the package up for custom user defined feature generation methods.

Features

  • Fast algorithm for string matching
  • 100% exact retrieval
  • Support for unicodes
  • Support for building databases directly from text files
  • Mecab-based tokenizer support
  • Support for persistent databases like MongoDB

Suported String Similarity Measures

  • Dice coefficient
  • Jaccard coefficient
  • Cosine coefficient
  • Overlap coefficient
  • Exact match

Installation

You can grab the latest stable version of this package from Julia registries by simply running;

NB: Don't forget to invoke Julia's package manager with ]

pkg> add SimString

The few (and selected) brave ones can simply grab the current experimental features by simply adding the master branch to your development environment after invoking the package manager with ]:

pkg> add SimString#main

You are good to go with bleeding edge features and breakages!

To revert to a stable version, you can simply run:

pkg> free SimString

simstring.jl's People

Contributors

pydatablog avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

simstring.jl's Issues

Switch from array indexing to byte indexing for n-grams generation.

Special unicode characters breaks the current n-grams feature generation as it uses array indexing as replicated below:

using SimString

db = DictDB(CharacterNGrams(2, " "))
push!(db, "„orosz állami részvénytársaság")

Output:

StringIndexError: invalid index [3], valid nearby indices [2]=>'', [5]=>'o'

string_index_err(::String, ::Int64)@string.jl:12
getindex@string.jl:263[inlined]
(::SimString.var"#13#14"{String, Int64})(::Int64)@features.jl:25
iterate@generator.jl:47[inlined]
collect_to!@array.jl:782[inlined]
collect_to_with_first!@array.jl:760[inlined]
_collect(::UnitRange{Int64}, ::Base.Generator{UnitRange{Int64}, SimString.var"#13#14"{String, Int64}}, ::Base.EltypeUnknown, ::Base.HasShape{1})@array.jl:754
collect_similar@array.jl:653[inlined]
map@abstractarray.jl:2849[inlined]
init_ngrams@features.jl:24[inlined]
n_grams@features.jl:44[inlined]
extract_features(::SimString.CharacterNGrams{Int64, String}, ::String)@features.jl:70
push!(::SimString.DictDB{SimString.CharacterNGrams{Int64, String}, String, DataStructures.DefaultDict{Int64, Set{String}, SimString.var"#1#4"}, DataStructures.DefaultDict{Int64, DataStructures.DefaultOrderedDict{Tuple{String, Int64}, Set{String}}, SimString.var"#2#5"}, DataStructures.DefaultDict{Int64, DataStructures.DefaultDict{Tuple{String, Int64}, Set{String}}, SimString.var"#3#6"}}, ::String)@features.jl:124
top-level scope

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.