bitfunnel / workbench Goto Github PK

Java and Lucene based tools for BitFunnel corpus preparation

License: MIT License

Java 71.95% Python 28.05%

workbench's Introduction

BitFunnel

This repo contains the code for the BitFunnel index used by Bing's super-fresh, news, and media indexes. The algorithm is described in BitFunnel: Revisiting Signatures for Search, a paper presented at SIGIR 2017. This video gives a good overview of the algorithm.

The codebase here was published to allow the research community to replicate results from the SIGIR paper. The documentation is pretty thin, but we encourage you to look at the following :

Dependencies

In order to build BitFunnel you will need CMake (2.8.11+), and a modern C++ compiler (gcc 5+, clang 3.5+, or VC 2015+). You can run CMake directly to generate the appropriate build setup for your platform. Alternately, we have some scripts that have the defaults that we use available.

*nix

For *nix platforms (including OS X),

./Configure_Make.sh
cd build-make
make
make test

Note that while these instructions are for a make build, it's also possible to build using ninja by changing the cmake command to create ninja files instead of Makefiles. These aren't listed in the instructions because ninja requires installing an extra dependency for some developers, but if you want to use ninja it's available via apt-get, brew, etc., and is susbtantially faster than make.

Ubuntu

If you're on Ubuntu 15+, you can install dependencies with:

sudo apt-get install clang cmake

On Ubuntu 14 and below, you'll need to install a newer version of CMake. To install a new-enough CMake, see this link. If you're using gcc, you'll also need to make sure you have gcc-5 (sudo apt-get install g++-5).

To override the default compiler, set the CXX and CC environment variables. For example, if you have clang-3.8 installed as clang-3.8 and are using bash:

export CXX="clang++-3.8"
export CC="clang-3.8"

OS X

Install XCode and then run the following command to install required packages using Homebrew (http://brew.sh/):

brew install cmake

BitFunnel can be built on OS X using either standard *nix makefiles or XCode. In order to generate and build makefiles, in the root BitFunnel directory run:

If you want to create an Xcode project instead of using Makefiles, run:

./Configure_XCode.sh

If you use XCode, you'll have to either re-run Configure_XCode or run the ZERO_CHECK target when the CMakeLists changes, e.g., when source files are added or removed.

Windows

You will need these tools:

CMake (http://www.cmake.org/download/)
Visual Studio 2017 with C++ compiler (the free version)

Note: If you install Visual Studio for the first time and select the default install options, you won't get a C++ compiler. To force the install of the C++ compiler, you need to either create a new C++ project or open an existing C++ project.

Clone the BitFunnel repository and then run the following command in BitFunnel's root folder:

.\Configure_MSVC.bat

Note: You will need to modify the CMake -G option if you use a different version of Visual Studio. Bitfunnel must be built as a 64-bit program, so 'Win64' must be part of the specified G option text.

At this point, you can open the generated solution BitFunnel_CMake.sln from Visual Studio and then build it. Alternatively, you can build from the command line using cmake --build build-MSVC.

workbench's People

Contributors

Stargazers

Watchers

Forkers

hausdorff danluu 399601829 pombredanne

workbench's Issues

JDK installation instructions missing

I'll update this for Linux, but that will still leave missing instructions for Windows and OS X. I think it's fine to have those be missing, but we should have a dependency list and add that to the list or something.

Image links are broken

I think the reason images work for you and not for me is because the images are in a location that you have access to and I don't, e.g,. https://github.com/MikeHopcroft/wbtest/blob/master/README/system-control-panel.png

It's probably best to move those images in here, but a quick fix would be to make that repo public.

Stopwords being removed

This seems like it would screw up phrase queries?

An alternate fix would be to add Lucene's stopword list to our parser and submit the appropriately modified phrase query.

We're not stemming some (any?) terms

I thought that we were doing this. If we're not and that's on purpose, that's fine, but we have (for example) downlink, downlinks, and downlinked in our DocumentFrequencyTable when we use the chunked1 corpus.

Dates don't appear to be normalized

ba3f1d7af1500bd9,1,1,1.15318e-07,19.06.2002
2ed60e759534c745,1,1,1.15318e-07,1915aug10a

We also have nothing with /, which may mean that we're incorrectly splitting up dates with slashes in them.

i, ii, iii, etc.

It looks like we're indexing the index number in lists or something like that?

a8120ede65e124a2,1,1,0.068493,i
4148ef8fddabef20,1,1,0.0374597,ii
971fe168e0442370,1,1,0.0113684,iii
99c51617da0db9ac,1,1,0.00588676,iv

*, =, etc. being removed

See:

https://en.wikipedia.org/?curid=28831157
https://en.wikipedia.org/?curid=1468119
https://en.wikipedia.org/?curid=31533859

In combination with the --lists issue, this is resulting in empty documents instead of 1 posting documents. With the --lists issue fixed, these should turn into "real" documents.

Wikipedia extraction seems to be giving bigrams

Repro:

BitFunnel: 9e9e96ecb32841c53edc4542813ed1531fd4c4a9
Workbench: 580b74b

StatisticsBuilder c:\git\Wikipedia\Manifest100.txt c:\temp\wiki\out100 -statistics -text

Shouldn't have bigrams, shouldn't have capital letters:

Bigram where none expected (also capital letter):
72a2c4b53c781027,1,1,0.000144196,zephyrinus
bd01f0b68e57b2a7,1,1,0.000144196,sveshtari
3fad0c4faf3cb52b,1,0,0.000144196,Algebraic geometry
50c9029d9d3c5378,1,1,0.000144196,darabont
a2f5153a7612c5d0,1,1,0.000144196,up─üsik─ü

3ca7b8a975b95d4d,1,1,0.000144196,crisplock

Capital letter
49fc77672b6b54c4,1,0,0.000144196,Alexander Graham Bell

7d8b10a0a2b9f455,1,0,0.000144196,Evolutionarily stable strategy

Random garbase
b651bc4fddcd84af,1,1,0.000144196,86p
6cc733ca24bc18e,1,1,0.000144196,α▓╣α▓░α▓┐α▓╡α│å
5847567b67dc03cb,1,1,0.000144196,xis

a31bc33fc17f3fc6,1,1,0.000144196,αñ▓αñ╛αñû

b3d2c5a33dd1efc6,1,1,0.000144196,k├╢nigsberger

README.md shows wrong output

Near the bottom of the readme, the output section actually shows the input files. This should show the chunk files instead.

Here's the output

$ ls -l sample-output/
total 8
-rw-r--r-- 1 Mike 197121 836 May 15 23:55 Frost.txt
-rw-r--r-- 1 Mike 197121 1862 May 15 23:55 Whitman.txt

ProcessDocumentHeader() in WikipediaDumpProcessor should use analyzer.

Currently ProcessDocumentHeader() does not use the Lucene analyzer for the document title. This leads to problems with terms that contain colons. As an example, in the file AA\wiki_83, document 11327 https://en.wikipedia.org/?curid=11327 has the title "Wikipedia:Free On-line Dictionary of Computing/symbols - B". Since this title is not passed through the Lucene tokenizer, the colon makes it through and we end up with the term "Wikipedia:Free" in the Document Frequency Table. When we use the Document Frequency Table as a source of test queries, we try to parse the query "Wikipedia:Free" and fail because the parser thinks that "Wikipedia" is a stream name prefix.

Terms prefixed with numbers

ff30dd2be245a4d4,1,1,1.15318e-07,22raggedy
e06a3c2008939008,1,1,1.15318e-07,269life

Many n-grams in corpus

After processing wikipedia with the fixes as of 274293f3af97c507416f6387020507ee99ca3238, the tail of the DocFreqTable has a lot of n-grams:

724ddeaf8cb3c269,1,0,1.93455e-07,Vasilije Veljko Milovanović
e802585d5e004af1,1,0,1.93455e-07,2014 All-Arena Team
7c401744d5d61355,1,1,1.93455e-07,f.a.cortez
dafa24ba41b2a01d,1,0,1.93455e-07,Coeliades ramanatek
1a8055b58daaf330,1,0,1.93455e-07,Jeff Tobolski
adeb1f3f88d9bf92,1,1,1.93455e-07,shirt.turnfurlong
9dc6283de675270b,1,0,1.93455e-07,1978 Notre Dame Fighting Irish football team
5cc16879c0ad5653,1,1,1.93455e-07,shrambhushan
aea5e0ae16c34325,1,1,1.93455e-07,ca1703286
ce7ac1e3fa0fa95b,1,0,1.93455e-07,Hyperthaema sordida
bbd646c18643abf0,1,1,1.93455e-07,yelkhovoozersky
895697f8c748363f,1,0,1.93455e-07,Alashkert Stadium
d5ddbbd6281b2f91,1,0,1.93455e-07,Crédito Predial Português
7a18bab66de2a784,1,0,1.93455e-07,List of wars involving Iraqi Kurdistan
71000fa2b784fbb1,1,1,1.93455e-07,alox12p
6c47ffa2419cebfc,1,0,1.93455e-07,Republican Social Party of French Reconciliation
...
91ea6c89333d46fe,1,0,1.93455e-07,Janet Jackson filmography
596acddb187d2224,1,1,1.93455e-07,bingobingo
7f4e295958f0d3ad,1,0,1.93455e-07,Nawal al-Hawsawi
2b3c46a61d6a01,1,1,1.93455e-07,arachidconic
a99158398732ad89,1,0,1.93455e-07,Sâne Morte
a07381d76998301,1,1,1.93455e-07,blind.net
5252fcd8785074c4,1,1,1.93455e-07,aettn
2467896c19e4ae96,1,0,1.93455e-07,Montsweag Bay
9e49735fc54b7c76,1,0,1.93455e-07,"Friedrich Günther, Prince of Schwarzburg-Rudolstadt"
4c46bc6fc4ce2549,1,0,1.93455e-07,Herman Riddle Page
79ba22ae9e1cfc9c,1,1,1.93455e-07,689368
d50cfa30c7b6357a,1,0,1.93455e-07,McDonald's African American Heritage Series
f237df08253dc88,1,0,1.93455e-07,Ireland at the 2000 Summer Olympics
dfa3155bb84d1397,1,0,1.93455e-07,Preferential creditor
eef202c76f008699,1,0,1.93455e-07,Virgilio Tosi
577e733f140b86b2,1,1,1.93455e-07,agents13_en.html
b96d6bb3dd1da702,1,0,1.93455e-07,Ravi Gossain
d70665fd53174abe,1,1,1.93455e-07,mcdean's
92f39c417e294d1f,1,1,1.93455e-07,sonnenritter
558a9b05e72d7319,1,0,1.93455e-07,Althaea
5406174c6bc23256,1,1,1.93455e-07,ouleida
8542aa4e4d48249f,1,0,1.93455e-07,"John Savile, 4th Earl of Mexborough"
5274284f94fffeb6,1,0,1.93455e-07,Sue Timney
62d049ebf69b2705,1,1,1.93455e-07,commercially.it
35fec685ff1011ea,1,1,1.93455e-07,comfia
565d59d3b90b8fee,1,0,1.93455e-07,List of Banksia species
51120ae38af4f54b,1,1,1.93455e-07,spökprästgården

Many terms have underscores in them

For example:

6d6b8015505c7099,1,1,4.61273e-07,2c_thrissur
3e8f9e5769458e9f,1,1,4.61273e-07,government_medical_college

We also have terms with double underscores that appear to be some kind of metadata?

868661c0426526a7,1,1,0.000557102,__noeditsection__
a135c90cbb896da0,1,1,2.97521e-05,__notoc__
14a64ebade034c85,1,1,3.11359e-06,__nogallery__

As well as weird terms that have even more underscores:

b614cd7474e25139,1,1,5.76591e-07,f___
22ed3514efd6df2a,1,1,4.61273e-07,o___y
3ea21b09f892bac0,1,1,4.61273e-07,mother______
c4767d3137687cf6,1,1,9.6262e-07,i_______________________________________

Many terms appear to be filenames

58824d7fb6a6fae3,1,1,4.61273e-07,mp_photo_guidelines.html
65f2c1768fda3aaa,1,0,4.61273e-07,engineering.svg
48f47d14ac8f85df,1,0,0.00314992,cover.jpg
39cf0dcb77b80dd8,1,0,0.000296944,filmposter.jpeg
44cd8472a0872e66,1,0,0.00436699,logo.png

Some terms appear to have possibly meaningless numbers at the end

4aa043e964c55f97,1,0,1.15318e-07,eaglepointpark2

Possibly spurious single letters

A large fraction of documents have single letter terms. This looks pretty suspicious.

be6572d261265461,1,1,0.0380624,s
4b2cb04c8105642d,1,1,0.0299786,c
14b93e0f59ea8e65,1,1,0.0286755,b

The vast majority of documents are tiny

If we look at the wikipedia dump currently hosted on Azure, the modal number of postings per document is 5, and things drop off rapidly from there:

Postings,Count
0,5
1,9013
2,161034
3,490873
4,752513
5,795627
6,458944
7,297922
8,187495
9,159601
10,122515
11,98068
12,93155
13,82168
14,80742
15,74154
16,69059
17,64268
18,67888
19,63546
20,63112