danharltey / fastenshtein Goto Github PK
View Code? Open in Web Editor NEWThe fastest .Net Levenshtein around
License: MIT License
The fastest .Net Levenshtein around
License: MIT License
Consider signing the assembly which is unfortunately still a thing for many projects.
It seems you should not reuse an instance of Levenshtein
across multiple threads. If it's not worth fixing, perhaps a note in the README specifiying not to multithread calls to DistanceFrom
on a shared instance. The static method works fantastic. For my examples below I'm calculating the distance of a user inputted drug name against a cache of ~32k drugs.
For example this produces incorrect results:
var returnItems = new ConcurrentBag<(int distance, Drug drug)>();
string item = "Ibuprophen";//user input
Levenshtein lev = new Levenshtein(item);
Parallel.ForEach(await GetFullCache(), drug =>//~32k Drug classes with Name property
{
int distance = lev.DistanceFrom(drug.Name);
if (distance <= 10)
{
returnItems.Add((distance, drug));
}
});
Whereas this produces expected results:
var returnItems = new ConcurrentBag<(int distance, Drug drug)>();
string item = "Ibuprophen";
Parallel.ForEach(await GetFullCache(), drug =>
{
int distance = Levenshtein.Distance(item, drug.Name);
if (distance <= 10)
{
returnItems.Add((distance, drug));
}
});
Missing tag and release for 1.0.0.8 and some other previous versions
Its useful to have these tags on the repo for comparison.
Is the current master HEAD 1.0.0.8 ?
I'm trying to use Fastenshtein in a SSIS package. To do that, the DLL must be registered in the GAC. From what I understand you have to sign your dll before you can register is in the GAC. Any chance we could get the dlls in nuget signed?
Is it possible to update your performance table to include FuzzySharp?
Please try test
[Fact]
public void Shorter_Value_Is_Distance_Test2()
{
Test("tst", "test", 1);
}
for AutoCompleteLevenshteinTests
it fails and AutoCompleteLevenshtein.Distance returns 2 instead of 1
Hi!
Is there any option to break your algo calculations if distance reach limit, that i set up?
Does Fastensthein include a method to find substrings that are similar to a search string?
I know that this can be done using a modified version of the Levenshtein distance algorithm, so it might be useful to implement it here.
Hi!
substitution operation is missing.
Thought I would throw this out there. I think it will outperform even your instanced mode. Would love it if you could try it and let me know.
/// <summary>
/// Measures the difference between two strings.
/// Uses the Levenshtein string difference algorithm.
/// </summary>
public static class LevenshteinUnsafe
{
/// <summary>
/// Compares the two values to find the minimum Levenshtein distance.
/// Thread safe.
/// </summary>
/// <returns>Difference. 0 complete match. -1 if either value1 or value2 is null, -2 if value2 is too large.</returns>
public static unsafe int Distance(string value1, string value2)
{
if (value1 == null || value2 == null)
{
return -1;
}
else if (value2.Length == 0)
{
return value1.Length;
}
else if (value1.Length == 0)
{
return value2.Length;
}
else if (value2.Length > 1024)
{
// prevent stack overflow
return -2;
}
int* costs = stackalloc int[value2.Length];
int* costsEnd = costs + value2.Length, ptr;
int i = 0, j, insertionCost, cost, additionCost;
char value1Char;
for (ptr = costs; ptr != costsEnd; ptr++)
{
*ptr = ++i;
}
for (i = 0; i < value1.Length; i++)
{
// cost of the first index
cost = additionCost = i;
// cache value for inner loop to avoid index lookup and bounds checking, profiled this is quicker
value1Char = value1[i];
for (ptr = costs, j = 0; ptr != costsEnd; ptr++, j++)
{
insertionCost = cost;
cost = additionCost;
// assigning this here reduces the array reads we do, improvement of the old version
additionCost = *ptr;
if (value1Char != value2[j])
{
if (insertionCost < cost)
{
cost = insertionCost;
}
if (additionCost < cost)
{
cost = additionCost;
}
++cost;
}
*ptr = cost;
}
}
// the last int is the cost
return *(--costsEnd);
}
}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.