Coder Social home page Coder Social logo

simhash's Introduction

专门针对中文文档的simhash算法库 English

CMake Build Status Platform Author License Tag

简介

此项目用来对中文文档计算出对应的 simhash 值。 simhash 是谷歌用来进行文本去重的算法,现在广泛应用在文本处理中。

详见simhash算法原理及实现

特性

  • 使用 CppJieba 作为分词器和关键词抽取器
  • 使用 jenkins 作为 hash 函数
  • hpp 风格,所有源码都是 .hpp 文件里面,方便使用。 没有链接,就没有伤害。
  • 本项目的副产品项目:simhash_server 提供了简单的 simhash HTTP 服务。

依赖

  • g++ (version >= 4.1 recommended), or clang++ .

用法

mkdir build
cd build
cmake ..
make

测试

make test

演示

./demo

结果如下:

文本:"我是蓝翔技工拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上总经理,出任CEO,走上人生巅峰。"
关键词序列是: ["蓝翔:11.7392", "CEO:11.7392", "升职:10.8562", "加薪:10.6426", "手扶拖拉机:10.0089"]
simhash值是: 17831459094038722629
100010110110和110001110011 simhash值的相等判断如下:
海明距离阈值默认设置为3,则isEqual结果为:0
海明距离阈值默认设置为5,则isEqual结果为:1

详情请看 example/demo.cpp

Benchmark

./benchmark/benchmarking

结果如下:

Running ./benchmark/benchmarking
Run on (16 X 2494.14 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 4096 KiB (x16)
  L3 Unified 36608 KiB (x1)
Load Average: 0.07, 0.04, 0.03
***WARNING*** Library was built as DEBUG. Timings may be affected.
-------------------------------------------------------------------------------------------------
Benchmark                                                       Time             CPU   Iterations
-------------------------------------------------------------------------------------------------
BENCHMARK_Simhasher_extract_text50_top5                     13478 ns        13478 ns        52013
BENCHMARK_Simhasher_extract_text50_top10                    13843 ns        13843 ns        50833
BENCHMARK_Simhasher_extract_text50_top15                    13929 ns        13929 ns        49488
BENCHMARK_Simhasher_extract_text50_top20                    13842 ns        13842 ns        50541
BENCHMARK_Simhasher_extract_text500_top5                   184074 ns       184067 ns         3775
BENCHMARK_Simhasher_make_text50_top5                        14457 ns        14457 ns        48341
BENCHMARK_Simhasher_make_text50_top10                       15170 ns        15169 ns        46203
BENCHMARK_Simhasher_make_text50_top15                       15585 ns        15585 ns        44903
BENCHMARK_Simhasher_make_text50_top20                       15743 ns        15742 ns        44466
BENCHMARK_Simhasher_binaryStringToUint64                    0.000 ns        0.000 ns   1000000000
BENCHMARK_Simhasher_toBinaryString                           63.9 ns         63.9 ns     10937009
BENCHMARK_Simhasher_make_from_predefined_keywords5            423 ns          423 ns      1644823
BENCHMARK_Simhasher_make_from_predefined_keywords10           735 ns          735 ns       950156
BENCHMARK_Simhasher_make_from_predefined_keywords20          1364 ns         1364 ns       508935
BENCHMARK_Simhasher_make_from_predefined_keywords50          7876 ns         7875 ns        89006
BENCHMARK_Simhasher_make_from_predefined_keywords100        21409 ns        21409 ns        32743
BENCHMARK_Simhasher_make_from_predefined_keywords200        47469 ns        47468 ns        14728
BENCHMARK_Simhasher_make_from_predefined_keywords500       124316 ns       124314 ns         5627
BENCHMARK_Simhasher_make_from_predefined_keywords1000      251336 ns       251329 ns         2785
BENCHMARK_Simhasher_binaryStringToUint64_isEqual            0.000 ns        0.000 ns   1000000000
BENCHMARK_Simhasher_binaryStringToUint64_isEqual_10k        0.000 ns        0.000 ns   1000000000
BENCHMARK_Simhasher_binaryStringToUint64_isEqual_1000k      0.000 ns        0.000 ns   1000000000

客服

image

simhash's People

Contributors

alexyangfox avatar bitdeli-chef avatar innernull avatar micheal-zhang-0111 avatar yanyiwu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

simhash's Issues

路径问题

我想使用demo来运行,但是发现路径都是相对路径,然后我这边运行不了。我不是太了解C++,一个一个该路径过于繁琐,有什么好的建议吗

算法优化问题

hi,
以下两个语句,取top32 ,海明距离为6.

都听别人说好做,自己还没尝试呢 我是个大学生,我也想开店,楼主要好好教我啊 加油把

都听别人说好做,自己还没尝试呢 我是个大学生,我也想开店,楼主要好好教我啊 学习了

请问,有哪些优化的思路。

Error happened when compiled with g++ 4.8.2 with option -std=c++11

In file included from /home/janfan/documents/simhash/src/CppJieba/MixSegment.hpp:5:0,
from /home/janfan/documents/simhash/src/CppJieba/KeywordExtractor.hpp:4,
from /home/janfan/documents/simhash/src/Simhasher.hpp:4,
from /home/janfan/documents/simhash/src/main.cpp:6:
/home/janfan/documents/simhash/src/CppJieba/MPSegment.hpp: In member function ‘bool CppJieba::MPSegment::cut(Limonp::LocalVector::const_iterator, Limonp::LocalVector::const_iterator, std::vector<Limonp::LocalVector >&) const’:
/home/janfan/documents/simhash/src/CppJieba/MPSegment.hpp:103:106: error: no matching function for call to ‘make_pair(size_t&, NULL)’
segmentChars[i].dag.insert(make_pair<DagType::key_type, DagType::mapped_type>(i, NULL));
^

Assertion failed: (topN == wordweights.size())

我输入如下的话:
string s("我想吃饭,我最喜欢计算机了。");
结果运行时会出现这样的错误。
Assertion failed: (topN == wordweights.size()), function make, file /Users/taowei/Documents/工程/simhash/simhash/Simhasher.hpp, line 33.

question about dict

Sorry, I am a newbie.

There are four utf8 files in the dict and I am confused where they are from, what each of them are used for, Can I change them?

Some more clearly information about them? Thanks.

simhash比特位的疑问

hi,您好
请教一个关于simhash比特位的问题,原论文中的是64bit,每一个char用4bit(0-f)表示的话,算出来的结果应该是长度为16的字符串。
我看到demo里面表示的是长度为20的字符串,如果每一个char是4bit的话(存疑,只看到了0-9没有看到a-f),多出来的16bit的作用是什么?

目前尝试使用了一下simhash做新闻的去重,有三个疑问,希望能帮助解答一下

一、您设置的词典的idf是怎么计算得到的,在海量文档处理的时候,是否需要更新idf?
二、对于形如“鍗楁棆鎺ц偂寮 姤1.32鍏 楂树笂甯备环10% 銆  鍗楁棆鎺ц偂锛”、“懆浜斿憿鏄 [富锷涢槾闄╃殑鐜╀竴鎶婏纴涓嶈Е纰”这样的语句,您是怎么处理的?
三、对若干文本进行汉明距离计算时,发现文档区别很大,但是汉明距离很小,这大概是什么原因?词频设置问题?

compiling error

simhash/cppjieba/../limonp/StdExtension.hpp:19: error: 'unordered_map' is already declared in this scope
g++ (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4)
Copyright © 2010 Free Software Foundation, Inc.

Dose anybody meet that?

How can I fixed that?

Assertion failed:

tw:build tw$ ./bin/simhash.demo
Assertion failed: (buf.size() == DICT_COLUMN_NUM), function _loadDict, file /Users/taowei/simhash/src/CppJieba/DictTrie.hpp, line 161.
Abort trap: 6

我建立工程把src倒入到xcode里也会出现这个问题

将simhash与cppjieba分词放到同一个目录,无法编译

Simhasher.hpp: In member function ‘bool simhash::Simhasher::extract(const string&, std::vector<std::pair<std::__cxx11::basic_string, double> >&, size_t) const’:
Simhasher.hpp:23:58: error: void value not ignored as it ought to be
return _extractor.Extract(text, res, topN);

simhash的文档?

请问该项目有文档么?

请问demo.cpp中的

simhasher.extract(s, res, topN);
simhasher.make(s, topN, u64);

分别是什么意思?

gcc4.4.6编译失败,报错error: invalid conversion from ‘long int’ to ‘const CppJieba::DictUnit*’

[ 16%] Building CXX object src/CMakeFiles/simhash.demo.dir/main.cpp.o
In file included from /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/bits/stl_algobase.h:66,
from /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/bits/char_traits.h:41,
from /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/ios:41,
from /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/ostream:40,
from /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/iostream:40,
from /search/billczhang/xfs/simhash/src/main.cpp:2:
/usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/bits/stl_pair.h: In constructor ‘std::pair<_T1, _T2>::pair(_U1&&, _U2&&) [with _U1 = size_t&, _U2 = long int, T1 = long unsigned int, T2 = const CppJieba::DictUnit]’:
/search/billczhang/xfs/simhash/src/CppJieba/MPSegment.hpp:96: instantiated from here
/usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/bits/stl_pair.h:90: error: invalid conversion from ‘long int’ to ‘const CppJieba::DictUnit

make[2]: *** [src/CMakeFiles/simhash.demo.dir/main.cpp.o] Error 1
make[1]: *** [src/CMakeFiles/simhash.demo.dir/all] Error 2
make: *** [all] Error 2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.