bitfunnel / mg4j-workbench Goto Github PK

Java tools for evaluating BitFunnel performance compared to an mg4j baseline.

License: GNU Lesser General Public License v3.0

Java 100.00%

mg4j-workbench's Introduction

BitFunnel

This repo contains the code for the BitFunnel index used by Bing's super-fresh, news, and media indexes. The algorithm is described in BitFunnel: Revisiting Signatures for Search, a paper presented at SIGIR 2017. This video gives a good overview of the algorithm.

The codebase here was published to allow the research community to replicate results from the SIGIR paper. The documentation is pretty thin, but we encourage you to look at the following :

Dependencies

In order to build BitFunnel you will need CMake (2.8.11+), and a modern C++ compiler (gcc 5+, clang 3.5+, or VC 2015+). You can run CMake directly to generate the appropriate build setup for your platform. Alternately, we have some scripts that have the defaults that we use available.

*nix

For *nix platforms (including OS X),

./Configure_Make.sh
cd build-make
make
make test

Note that while these instructions are for a make build, it's also possible to build using ninja by changing the cmake command to create ninja files instead of Makefiles. These aren't listed in the instructions because ninja requires installing an extra dependency for some developers, but if you want to use ninja it's available via apt-get, brew, etc., and is susbtantially faster than make.

Ubuntu

If you're on Ubuntu 15+, you can install dependencies with:

sudo apt-get install clang cmake

On Ubuntu 14 and below, you'll need to install a newer version of CMake. To install a new-enough CMake, see this link. If you're using gcc, you'll also need to make sure you have gcc-5 (sudo apt-get install g++-5).

To override the default compiler, set the CXX and CC environment variables. For example, if you have clang-3.8 installed as clang-3.8 and are using bash:

export CXX="clang++-3.8"
export CC="clang-3.8"

OS X

Install XCode and then run the following command to install required packages using Homebrew (http://brew.sh/):

brew install cmake

BitFunnel can be built on OS X using either standard *nix makefiles or XCode. In order to generate and build makefiles, in the root BitFunnel directory run:

If you want to create an Xcode project instead of using Makefiles, run:

./Configure_XCode.sh

If you use XCode, you'll have to either re-run Configure_XCode or run the ZERO_CHECK target when the CMakeLists changes, e.g., when source files are added or removed.

Windows

You will need these tools:

CMake (http://www.cmake.org/download/)
Visual Studio 2017 with C++ compiler (the free version)

Note: If you install Visual Studio for the first time and select the default install options, you won't get a C++ compiler. To force the install of the C++ compiler, you need to either create a new C++ project or open an existing C++ project.

Clone the BitFunnel repository and then run the following command in BitFunnel's root folder:

.\Configure_MSVC.bat

Note: You will need to modify the CMake -G option if you use a different version of Visual Studio. Bitfunnel must be built as a 64-bit program, so 'Win64' must be part of the specified G option text.

At this point, you can open the generated solution BitFunnel_CMake.sln from Visual Studio and then build it. Alternatively, you can build from the command line using cmake --build build-MSVC.

mg4j-workbench's People

Contributors

Stargazers

Watchers

Forkers

amallia pombredanne

mg4j-workbench's Issues

Look into making QueryPerformance multi-threaded.

If the mg4j index is threadsafe, we should extend ProcessQueryLog to use multiple threads.

Archive files from http://mg4j.di.unimi.it/

The mg4j website went down on 4/21 and is still down as of 4/23. I have no eta for when the site might come back up.

In the meantime, might want to make the following files available somewhere else:

Investigate whether mg4j has a mode that counts matches.

The act of returning a list of query results may cause GC-related performance problems. Might want to consider counting the matches instead of returning them.

One way to do this would be to modify QueryEngine.getResults() (starts on 387 of QueryEngine.java) to count the results, but not return them.

Query processing pipeline should reuse result array

If we return results (instead of just counting them see #2), we should reuse the result array, instead of creating it and collecting it for each query.

QueryLogRunner.main() should have a command line argument for thread count.

README.md: Author and test step-by-step build-and-run instructions for Linux.

Author and test step-by-step build-and-run instructions for Linux.

Ensure that experimental index is non-positional.

utf-8 to utf-16 conversion in ChunkWordReader.next() is incorrect.

This code casts each byte to char, ignoring all multi-byte characters. The only reason this works is that the gov2 corpus is mostly ASCII.

Investigate using wired BitStreamIndexReader/BitStreamHPIndexReader

See note on query performance in the mg4j manual.

Note: if the web site is down, the manual is available with the distribution. See #13.

Description of mgj4 index configuration

This issue tracks how we configure / intend to configure the mg4j index for maximum performance in the experiment.

[Not implemented] Disable positions.
Disable scoring.
[Not verified]. Use BitStreamHPIndexReader. Actually, we probably want to use the subclass InMemoryHPIndex. Check to see if this is used by default. Right now it looks like the code uses QuasiSuccinctIndex.
[Not verified]. Use in-memory index. See JavaDocs for Index.UriKeys.
[Not verified]. Use wired index.
No stemming.
No stop word elimination.
??? Disable advanced queries (e.g. near, WAND, phrase).
??? Disable forward index storage for titles.
??? Disable forward index storage for BM25F scoring information.
Exporter for Partitioned Elias-Fano index generates a frequency of 1 for every posting.

README.md: IntelliJ instructions.

GenerateBitFunnelChunks should not hard-code field numbers

GenerateBitFunnelChunks should not hard-code field numbers. These numbers should come automatically from the collection.

Conflicting timing measurements

In 525d6d6, QueryLogRunner.go() implements three different strategies for measuring the time spent processing queries:

Take the difference between the time the last thread finished processing and the first thread started.
Take the difference between the start time of of the finishSynchronizer and that of the the performanceSynchronizer.
Take the difference between the time the threads joined and the time the threads were started.

In the results, below processing 06.efficiency_topics.all, we see that method (3) measures 24.25s, a value almost twice as large as the other two methods. The difference stems from the fact that QueryProcessorThread.run() processes the entire query log twice, in order to warm up the system. Methods (1) and (2) only measure the time for the second run. We should clean up the time reporting.

Thread count: 8
Query count: 100000
Total time: 11.793248
Total time (synchronizer): 11.793267
Total time (simple): 24.249336
QPS: 8479.414311

README.md: Author and test step-by-step build-and-run instructions for OSX.

Author and test step-by-step build-and-run instructions for OSX.

README.md: Update recipes for building from chunks, chunk manifests, exporting, etc.

Refactor GenerateBitFunnelChunks into Main() and run() methods.

Right now all of the code is in Main().

QueryLogRunner.main() should use SimpleJSAP for command line argument parsing.

QueryLogRunner.go should write query performance data to a file.

QueryLogRunner.go should write query performance data to a file. Will need to add another command line arugment to specify the output file.

Description of query log preprocessing

This issue documents our pre-processing of the query logs.

Started with 05.efficiency_topics and 06.efficiency_topics.all from the 2005 and 2006 TREC Terabyte Tracks.
Topic files contain one query per line where each query is preceded by a line number and a colon. The line numbers were removed.
The mg4j and BitFunnel query languages do not allow certain punctuation characters. These were filtered out by replacing matches to the regular expression [-;:'\+] with the empty string. May want to consider replacing punctuation with a space, rather than the empty string.
Query terms were not stemmed.

Trec terabyte topics contain characters that are illegal in mg4j queres.

The 2006 Trec Terabyte Topics contain the following characters that are illegal in mg4j (and BitFunnel) queries: '-', ';', ':', ''', and '+'. Right now QueryLogRunner.LoadQueries() replaces each of these characters with a space:

line.replaceAll("[-;:'\\+]", "")

We should preprocess these input files to remove these characters, and then update LoadQueries() to remove the regex code.

Crash in ChunkDocument.tryParseStream while attempting to read filtered version of GX229

This crash happens in ChunkDocument.tryParseStream() when buffer[writeCursor++] = (byte)c; attempts to write past the end of buffer. The size of buffer was 256k, based on the assertion that gov2 documents are truncated at 256KB.

The document that causes the crash has length 357895. This document was encountered while processing GX229-1000-1500.chunk, which was a version of GX229.chunk that was filtered by BitFunnel to contain documents with unique posting counts from 1000 to 1500.

Some observations:

BitFunnel was able to read GX229.chunk in order to to generate the filtered chunk GX229-1000-1500.chunk. This suggests that GX229.chunk, which was created by mg4j, was well formatted, even if it contained long documents.
In the chunk processing pipeline, mg4j generated GX229.chunk, but never attempted to read it.

My leading theory is that the original gov2 GX229 directory contains a bundle (.txt file) with a document, which tikka represents as longer than 256k.

Ensure ExperimentalQueryEngine searches both text and titles

Figure out whether we can measure the size of the mg4j posting list data structure.

The potential issue here is that the postings may be co-mingled with other data, like term frequency counts.

IndexExporter.exportIndex() exports text index only.

Should this method export the title as well?

Remove the QueryPerformance class.

Remove the QueryPerformance class. This class is now obsolete.

Chunks generated from Gov2 seem to have lots of duplicate DocId values.

Chunks generated from Gov2 seem to have lots of duplicate DocId values. For instance, index-273-100-150 (all 273 gov2 directories, retaining only documents with 100-150 terms) contains 3,870,096 documents, but only 123,544 unique DocIds. When ingesting the documents in Gov2 directory order (GX000, GX001, GX002, ...), the first duplicate is DocId=24 in GX001.

Index built directly from chunk differs from index built from collection.

Chunk file was produced from the first bundle of gov2:

java -cp target/mg4j-1.0-SNAPSHOT-jar-with-dependencies.jar ^
     it.unimi.di.big.mg4j.document.TRECDocumentCollection ^
     -f HtmlDocumentFactory -p encoding=iso-8859-1 ^
    d:\data\work\out2.collection d:\data\gov2\gx000\gx000\00.txt

java -cp target/mg4j-1.0-SNAPSHOT-jar-with-dependencies.jar ^
     org.bitfunnel.reproducibility.GenerateBitFunnelChunks ^
     -S d:\data\work\out2.collection d:\data\work\out2.chunk

Index from collection was built as follows:

java -cp target/mg4j-1.0-SNAPSHOT-jar-with-dependencies.jar ^
     it.unimi.di.big.mg4j.tool.IndexBuilder ^
      --keep-batches --downcase -S d:\data\work\out2.collection d:\data\work\out2

Index from chunk was built as follows:

java -cp target/mg4j-1.0-SNAPSHOT-jar-with-dependencies.jar ^
     it.unimi.di.big.mg4j.tool.IndexBuilder ^
     -o org.bitfunnel.reproducibility.ChunkDocumentSequence(d:\data\work\out2.chunk) ^
     d:\data\work\out3

The process ran to completion, but the properties of the two indexes (out2 is from collection and out3 is from chunk) were different. Notably, the number of documents and occurences and the maxsize are the same in both indexes, but terms, postings, and maxcount are different.

Index built from collection:

d:\data\work>type out2-text.properties
documents=1092
terms=21604
postings=158659
maxcount=652
indexclass=it.unimi.di.big.mg4j.index.QuasiSuccinctIndex
skipquantum=256
byteorder=LITTLE_ENDIAN
termprocessor=it.unimi.di.big.mg4j.index.DowncaseTermProcessor
batches=1
field=text
size=4417176
maxdocsize=10622
occurrences=295570

Index built from chunk:

d:\data\work>type out3-text.properties
documents=1092
terms=292005
postings=295570
maxcount=1
indexclass=it.unimi.di.big.mg4j.index.QuasiSuccinctIndex
skipquantum=256
byteorder=LITTLE_ENDIAN
termprocessor=it.unimi.di.big.mg4j.index.NullTermProcessor
batches=1
field=text
size=11473690
maxdocsize=10622
occurrences=295570

Index built from collection:

D:\data\work>more out2-title.properties
documents=1092
terms=1631
postings=5740
maxcount=4
indexclass=it.unimi.di.big.mg4j.index.QuasiSuccinctIndex
skipquantum=256
byteorder=LITTLE_ENDIAN
termprocessor=it.unimi.di.big.mg4j.index.DowncaseTermProcessor
batches=1
field=title
size=96009
maxdocsize=27
occurrences=5904

Index built from chunk:

D:\data\work>more out3-title.properties
documents=1092
terms=4725
postings=5904
maxcount=1
indexclass=it.unimi.di.big.mg4j.index.QuasiSuccinctIndex
skipquantum=256
byteorder=LITTLE_ENDIAN
termprocessor=it.unimi.di.big.mg4j.index.NullTermProcessor
batches=1
field=title
size=154267
maxdocsize=27
occurrences=5904

Collection-building pipeline should be based on gz files.

Our collection-building pipeline should be based on gz files instead of uncompressed .txt files. There are too many .txt files to pass as command-line arguments (Windows limit for command line is 8191 characters).

This mainly involves updates to shell scripts and the README.md to use the -z flag.

Index-build needs to filter out documents not in the BitFunnel shard.

The workflow will be as follows:

Create a collection from some subset of gov2.
Create a BitFunnel chunk from this collection.
Filter the BitFunnel chunk to include only those documents slated for a particular shard.
Somehow build an mg4j index that contains only those documents.

One option is to make the chunk filtering program emit a file that informs the index build of which documents to incorporate. This could be as simple as a file with one line per document, consisting of the character 'T' or 'F', indicating if the document should be included in the index. This approach would require a modification to the mg4j indexer.

Another approach would be to convert the filtered chunk back into something that looks like a gov2 file and then rerun the mg4j collection builder.

Mitigate performance impact of slf4j logging

Currently the single-query version of QueryEngine.Process() (starts at line 256 of QueryEngine.java) logs the fact that it is processing a query. This happens on line 257.

LOGGER.debug( "Processing query \"" + queries + "\", offset=" + offset + ", length="+ length );

This call to the slf4j logging will cause the parameter string to be serialized, which is likely to be expensive. See this note on slf4j logging performance for more information.

Note that using the nop logger (slf4j-nop-1.8.0-alpha1.jar) will not solve this performance problem because the parameter string will still be created.

One possible solution is to use the query-array version of QueryEngine.Process() (starts on line 298). This version logs once per call, instead of once per query. The challenge with this version is that storing all of the matches for all of the queries might lead to GC-related performance problems.

Another possible solution is to incorporate the mg4j source code directly into our repository (instead of allowing maven to reference the .JAR files). With this strategy, we could comment out the logging.

Use a ConcatenatedDocumentCollection to combine chunk files

We should implement a class that creates a ConcatenatedDocumentCollection from a manifest file listing BitFunnel chunks. This will allow us to build the index in pieces. Orignally, we had though that we would have to build one collection for the entire index.

This new class will be passed to the IndexBuilder command line, the same way we pass ChunkDocumentSequence today.

README.md: Author and test step-by-step build-and-run instructions for Windows.

Author and test step-by-step build-and-run instructions for Windows.

LuceneQueryProcessor may be querying titles instead of text

In LuceneQueryProcessor.java:

// field seems to be "00" for title, "01" for body.
Term term = new Term("01", termText);

The "01" configuration runs three times faster than the "00" configuration.