Coder Social home page Coder Social logo

fastenshtein's People

Contributors

danharltey avatar donaldhansen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

fastenshtein's Issues

Assembly is unsigned

Consider signing the assembly which is unfortunately still a thing for many projects.

Parallel Levenshtein

It seems you should not reuse an instance of Levenshtein across multiple threads. If it's not worth fixing, perhaps a note in the README specifiying not to multithread calls to DistanceFrom on a shared instance. The static method works fantastic. For my examples below I'm calculating the distance of a user inputted drug name against a cache of ~32k drugs.

For example this produces incorrect results:

var returnItems = new ConcurrentBag<(int distance, Drug drug)>();
string item = "Ibuprophen";//user input
Levenshtein lev = new Levenshtein(item);
Parallel.ForEach(await GetFullCache(), drug =>//~32k Drug classes with Name property
{
    int distance = lev.DistanceFrom(drug.Name);
    if (distance <= 10)
    {
        returnItems.Add((distance, drug));
    }
});

Whereas this produces expected results:

var returnItems = new ConcurrentBag<(int distance, Drug drug)>();
string item = "Ibuprophen";
Parallel.ForEach(await GetFullCache(), drug =>
{
    int distance = Levenshtein.Distance(item, drug.Name);
    if (distance <= 10)
    {
        returnItems.Add((distance, drug));
    }
});

AutoCompleteLevenshtein bug

Please try test
[Fact]
public void Shorter_Value_Is_Distance_Test2()
{
Test("tst", "test", 1);
}

for AutoCompleteLevenshteinTests

it fails and AutoCompleteLevenshtein.Distance returns 2 instead of 1

Searching for similar substrings

Does Fastensthein include a method to find substrings that are similar to a search string?
I know that this can be done using a modified version of the Levenshtein distance algorithm, so it might be useful to implement it here.

Unsafe variant

Thought I would throw this out there. I think it will outperform even your instanced mode. Would love it if you could try it and let me know.

    /// <summary>
    /// Measures the difference between two strings.
    /// Uses the Levenshtein string difference algorithm.
    /// </summary>
    public static class LevenshteinUnsafe
    {
        /// <summary>
        /// Compares the two values to find the minimum Levenshtein distance. 
        /// Thread safe.
        /// </summary>
        /// <returns>Difference. 0 complete match. -1 if either value1 or value2 is null, -2 if value2 is too large.</returns>
        public static unsafe int Distance(string value1, string value2)
        {
            if (value1 == null || value2 == null)
            {
                return -1;
            }
            else if (value2.Length == 0)
            {
                return value1.Length;
            }
            else if (value1.Length == 0)
            {
                return value2.Length;
            }
            else if (value2.Length > 1024)
            {
                // prevent stack overflow
                return -2;
            }

            int* costs = stackalloc int[value2.Length];
            int* costsEnd = costs + value2.Length, ptr;
            int i = 0, j, insertionCost, cost, additionCost;
            char value1Char;

            for (ptr = costs; ptr != costsEnd; ptr++)
            {
                *ptr = ++i;
            }

            for (i = 0; i < value1.Length; i++)
            {
                // cost of the first index
                cost = additionCost = i;

                // cache value for inner loop to avoid index lookup and bounds checking, profiled this is quicker
                value1Char = value1[i];

                for (ptr = costs, j = 0; ptr != costsEnd; ptr++, j++)
                {
                    insertionCost = cost;
                    cost = additionCost;

                    // assigning this here reduces the array reads we do, improvement of the old version
                    additionCost = *ptr;

                    if (value1Char != value2[j])
                    {
                        if (insertionCost < cost)
                        {
                            cost = insertionCost;
                        }

                        if (additionCost < cost)
                        {
                            cost = additionCost;
                        }

                        ++cost;
                    }

                    *ptr = cost;
                }
            }

            // the last int is the cost
            return *(--costsEnd);
        }
    }

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.