Coder Social home page Coder Social logo

jabberhams / simhash-dotnet Goto Github PK

View Code? Open in Web Editor NEW

This project forked from mrvancil/simhash-csharp

0.0 0.0 0.0 57 KB

.NET port of a very good simhash implementation in python. Recompiled for net8.0. Forked from mrvancil/simhash-sharp

License: GNU General Public License v2.0

C# 100.00%

simhash-dotnet's Introduction

SimHash-DotNet

Upgraded to .NET 8.0

Forked from mrvancil/simhash-sharp

This is a C# port of a very clear and concise simhash implementation in python (also on github at https://github.com/liangsun/simhash). I have ported most all of the tests as well, adding a couple of other along the way. The port is pretty close but I have diverged from the overloading of the class and implemented methods for calculation because I prefer this approach.

Getting Started

There is an external (nuget) dependency on hashing algorithms, System.Data.HashFunction.Core, System.Data.HashFunction.Interfaces, and System.Data.HashFunction.Jenkins . Jenkins, MurMur and FNV seem to be the most popular for hashing the feature set. The default in this library is Jenkins (you can specify bit length of 64) but there is also an MD5 implementation (not recommended due to BigInteger messypants).

Issues

Currently there are no known issues.

Performance Optimization

Since some of the types from python to C# not are exactly the same (HoneyBadger don't care you got a huge number!) there might be some speed loss/improvements depending on how I implemented the native types in C#.

With Jenkins as the hashing algorithm, it takes roughly 2 minutes (on a smallish laptop) to generate fingerprints for 11,000 full text articles.

With the MD5 as the hashing algorithm, it takes roughly 18 minutes (on a smallish laptop) to generate fingerprints for 11,000 full text articles.

It takes roughly 15 seconds to calculate the hamming distance for all the articles against those 11,000 fingerprints (still on the smallish laptop).

Future Work

  • Database backends (Redis, HBase, Sql, Mongo) - this will mimic the concepts around another popular simhash library (https://github.com/seomoz/simhash-db-py)
  • Nuget Submission
  • Interfaces
  • Alternate Hashing Functions
  • More test coverage
  • Ensure the Converters are up-to-snuff.

simhash-dotnet's People

Contributors

mrvancil avatar jabberhams avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.