Coder Social home page Coder Social logo

mg4j-workbench's Introduction

BitFunnel

This repo contains the code for the BitFunnel index used by Bing's super-fresh, news, and media indexes. The algorithm is described in BitFunnel: Revisiting Signatures for Search, a paper presented at SIGIR 2017. This video gives a good overview of the algorithm.

The codebase here was published to allow the research community to replicate results from the SIGIR paper. The documentation is pretty thin, but we encourage you to look at the following :

Build status Build status

Dependencies

In order to build BitFunnel you will need CMake (2.8.11+), and a modern C++ compiler (gcc 5+, clang 3.5+, or VC 2015+). You can run CMake directly to generate the appropriate build setup for your platform. Alternately, we have some scripts that have the defaults that we use available.

*nix

For *nix platforms (including OS X),

./Configure_Make.sh
cd build-make
make
make test

Note that while these instructions are for a make build, it's also possible to build using ninja by changing the cmake command to create ninja files instead of Makefiles. These aren't listed in the instructions because ninja requires installing an extra dependency for some developers, but if you want to use ninja it's available via apt-get, brew, etc., and is susbtantially faster than make.

Ubuntu

If you're on Ubuntu 15+, you can install dependencies with:

sudo apt-get install clang cmake

On Ubuntu 14 and below, you'll need to install a newer version of CMake. To install a new-enough CMake, see this link. If you're using gcc, you'll also need to make sure you have gcc-5 (sudo apt-get install g++-5).

To override the default compiler, set the CXX and CC environment variables. For example, if you have clang-3.8 installed as clang-3.8 and are using bash:

export CXX="clang++-3.8"
export CC="clang-3.8"

OS X

Install XCode and then run the following command to install required packages using Homebrew (http://brew.sh/):

brew install cmake

BitFunnel can be built on OS X using either standard *nix makefiles or XCode. In order to generate and build makefiles, in the root BitFunnel directory run:

If you want to create an Xcode project instead of using Makefiles, run:

./Configure_XCode.sh

If you use XCode, you'll have to either re-run Configure_XCode or run the ZERO_CHECK target when the CMakeLists changes, e.g., when source files are added or removed.

Windows

You will need these tools:

Note: If you install Visual Studio for the first time and select the default install options, you won't get a C++ compiler. To force the install of the C++ compiler, you need to either create a new C++ project or open an existing C++ project.

Clone the BitFunnel repository and then run the following command in BitFunnel's root folder:

.\Configure_MSVC.bat

Note: You will need to modify the CMake -G option if you use a different version of Visual Studio. Bitfunnel must be built as a 64-bit program, so 'Win64' must be part of the specified G option text.

At this point, you can open the generated solution BitFunnel_CMake.sln from Visual Studio and then build it. Alternatively, you can build from the command line using cmake --build build-MSVC.

mg4j-workbench's People

Contributors

danluu avatar jondgoodwin avatar mikehopcroft avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

mg4j-workbench's Issues

Investigate whether mg4j has a mode that counts matches.

The act of returning a list of query results may cause GC-related performance problems. Might want to consider counting the matches instead of returning them.

One way to do this would be to modify QueryEngine.getResults() (starts on 387 of QueryEngine.java) to count the results, but not return them.

Description of mgj4 index configuration

This issue tracks how we configure / intend to configure the mg4j index for maximum performance in the experiment.

  1. [Not implemented] Disable positions.
  2. Disable scoring.
  3. [Not verified]. Use BitStreamHPIndexReader. Actually, we probably want to use the subclass InMemoryHPIndex. Check to see if this is used by default. Right now it looks like the code uses QuasiSuccinctIndex.
  4. [Not verified]. Use in-memory index. See JavaDocs for Index.UriKeys.
  5. [Not verified]. Use wired index.
  6. No stemming.
  7. No stop word elimination.
  8. ??? Disable advanced queries (e.g. near, WAND, phrase).
  9. ??? Disable forward index storage for titles.
  10. ??? Disable forward index storage for BM25F scoring information.
  11. Exporter for Partitioned Elias-Fano index generates a frequency of 1 for every posting.

Conflicting timing measurements

In 525d6d6, QueryLogRunner.go() implements three different strategies for measuring the time spent processing queries:

  1. Take the difference between the time the last thread finished processing and the first thread started.
  2. Take the difference between the start time of of the finishSynchronizer and that of the the performanceSynchronizer.
  3. Take the difference between the time the threads joined and the time the threads were started.

In the results, below processing 06.efficiency_topics.all, we see that method (3) measures 24.25s, a value almost twice as large as the other two methods. The difference stems from the fact that QueryProcessorThread.run() processes the entire query log twice, in order to warm up the system. Methods (1) and (2) only measure the time for the second run. We should clean up the time reporting.

Thread count: 8
Query count: 100000
Total time: 11.793248
Total time (synchronizer): 11.793267
Total time (simple): 24.249336
QPS: 8479.414311

Description of query log preprocessing

This issue documents our pre-processing of the query logs.

  1. Started with 05.efficiency_topics and 06.efficiency_topics.all from the 2005 and 2006 TREC Terabyte Tracks.

  2. Topic files contain one query per line where each query is preceded by a line number and a colon. The line numbers were removed.

  3. The mg4j and BitFunnel query languages do not allow certain punctuation characters. These were filtered out by replacing matches to the regular expression [-;:'\+] with the empty string. May want to consider replacing punctuation with a space, rather than the empty string.

  4. Query terms were not stemmed.

Trec terabyte topics contain characters that are illegal in mg4j queres.

The 2006 Trec Terabyte Topics contain the following characters that are illegal in mg4j (and BitFunnel) queries: '-', ';', ':', ''', and '+'. Right now QueryLogRunner.LoadQueries() replaces each of these characters with a space:

line.replaceAll("[-;:'\\+]", "")

We should preprocess these input files to remove these characters, and then update LoadQueries() to remove the regex code.

Crash in ChunkDocument.tryParseStream while attempting to read filtered version of GX229

This crash happens in ChunkDocument.tryParseStream() when buffer[writeCursor++] = (byte)c; attempts to write past the end of buffer. The size of buffer was 256k, based on the assertion that gov2 documents are truncated at 256KB.

The document that causes the crash has length 357895. This document was encountered while processing GX229-1000-1500.chunk, which was a version of GX229.chunk that was filtered by BitFunnel to contain documents with unique posting counts from 1000 to 1500.

Some observations:

  • BitFunnel was able to read GX229.chunk in order to to generate the filtered chunk GX229-1000-1500.chunk. This suggests that GX229.chunk, which was created by mg4j, was well formatted, even if it contained long documents.
  • In the chunk processing pipeline, mg4j generated GX229.chunk, but never attempted to read it.

My leading theory is that the original gov2 GX229 directory contains a bundle (.txt file) with a document, which tikka represents as longer than 256k.

Chunks generated from Gov2 seem to have lots of duplicate DocId values.

Chunks generated from Gov2 seem to have lots of duplicate DocId values. For instance, index-273-100-150 (all 273 gov2 directories, retaining only documents with 100-150 terms) contains 3,870,096 documents, but only 123,544 unique DocIds. When ingesting the documents in Gov2 directory order (GX000, GX001, GX002, ...), the first duplicate is DocId=24 in GX001.

Index built directly from chunk differs from index built from collection.

Chunk file was produced from the first bundle of gov2:

java -cp target/mg4j-1.0-SNAPSHOT-jar-with-dependencies.jar ^
     it.unimi.di.big.mg4j.document.TRECDocumentCollection ^
     -f HtmlDocumentFactory -p encoding=iso-8859-1 ^
    d:\data\work\out2.collection d:\data\gov2\gx000\gx000\00.txt
java -cp target/mg4j-1.0-SNAPSHOT-jar-with-dependencies.jar ^
     org.bitfunnel.reproducibility.GenerateBitFunnelChunks ^
     -S d:\data\work\out2.collection d:\data\work\out2.chunk

Index from collection was built as follows:

java -cp target/mg4j-1.0-SNAPSHOT-jar-with-dependencies.jar ^
     it.unimi.di.big.mg4j.tool.IndexBuilder ^
      --keep-batches --downcase -S d:\data\work\out2.collection d:\data\work\out2

Index from chunk was built as follows:

java -cp target/mg4j-1.0-SNAPSHOT-jar-with-dependencies.jar ^
     it.unimi.di.big.mg4j.tool.IndexBuilder ^
     -o org.bitfunnel.reproducibility.ChunkDocumentSequence(d:\data\work\out2.chunk) ^
     d:\data\work\out3

The process ran to completion, but the properties of the two indexes (out2 is from collection and out3 is from chunk) were different. Notably, the number of documents and occurences and the maxsize are the same in both indexes, but terms, postings, and maxcount are different.

Index built from collection:

d:\data\work>type out2-text.properties
documents=1092
terms=21604
postings=158659
maxcount=652
indexclass=it.unimi.di.big.mg4j.index.QuasiSuccinctIndex
skipquantum=256
byteorder=LITTLE_ENDIAN
termprocessor=it.unimi.di.big.mg4j.index.DowncaseTermProcessor
batches=1
field=text
size=4417176
maxdocsize=10622
occurrences=295570

Index built from chunk:

d:\data\work>type out3-text.properties
documents=1092
terms=292005
postings=295570
maxcount=1
indexclass=it.unimi.di.big.mg4j.index.QuasiSuccinctIndex
skipquantum=256
byteorder=LITTLE_ENDIAN
termprocessor=it.unimi.di.big.mg4j.index.NullTermProcessor
batches=1
field=text
size=11473690
maxdocsize=10622
occurrences=295570

Index built from collection:

D:\data\work>more out2-title.properties
documents=1092
terms=1631
postings=5740
maxcount=4
indexclass=it.unimi.di.big.mg4j.index.QuasiSuccinctIndex
skipquantum=256
byteorder=LITTLE_ENDIAN
termprocessor=it.unimi.di.big.mg4j.index.DowncaseTermProcessor
batches=1
field=title
size=96009
maxdocsize=27
occurrences=5904

Index built from chunk:

D:\data\work>more out3-title.properties
documents=1092
terms=4725
postings=5904
maxcount=1
indexclass=it.unimi.di.big.mg4j.index.QuasiSuccinctIndex
skipquantum=256
byteorder=LITTLE_ENDIAN
termprocessor=it.unimi.di.big.mg4j.index.NullTermProcessor
batches=1
field=title
size=154267
maxdocsize=27
occurrences=5904

Collection-building pipeline should be based on gz files.

Our collection-building pipeline should be based on gz files instead of uncompressed .txt files. There are too many .txt files to pass as command-line arguments (Windows limit for command line is 8191 characters).

This mainly involves updates to shell scripts and the README.md to use the -z flag.

Index-build needs to filter out documents not in the BitFunnel shard.

The workflow will be as follows:

  1. Create a collection from some subset of gov2.
  2. Create a BitFunnel chunk from this collection.
  3. Filter the BitFunnel chunk to include only those documents slated for a particular shard.
  4. Somehow build an mg4j index that contains only those documents.

One option is to make the chunk filtering program emit a file that informs the index build of which documents to incorporate. This could be as simple as a file with one line per document, consisting of the character 'T' or 'F', indicating if the document should be included in the index. This approach would require a modification to the mg4j indexer.

Another approach would be to convert the filtered chunk back into something that looks like a gov2 file and then rerun the mg4j collection builder.

Mitigate performance impact of slf4j logging

Currently the single-query version of QueryEngine.Process() (starts at line 256 of QueryEngine.java) logs the fact that it is processing a query. This happens on line 257.

LOGGER.debug( "Processing query \"" + queries + "\", offset=" + offset + ", length="+ length );

This call to the slf4j logging will cause the parameter string to be serialized, which is likely to be expensive. See this note on slf4j logging performance for more information.

Note that using the nop logger (slf4j-nop-1.8.0-alpha1.jar) will not solve this performance problem because the parameter string will still be created.

One possible solution is to use the query-array version of QueryEngine.Process() (starts on line 298). This version logs once per call, instead of once per query. The challenge with this version is that storing all of the matches for all of the queries might lead to GC-related performance problems.

Another possible solution is to incorporate the mg4j source code directly into our repository (instead of allowing maven to reference the .JAR files). With this strategy, we could comment out the logging.

Use a ConcatenatedDocumentCollection to combine chunk files

We should implement a class that creates a ConcatenatedDocumentCollection from a manifest file listing BitFunnel chunks. This will allow us to build the index in pieces. Orignally, we had though that we would have to build one collection for the entire index.

This new class will be passed to the IndexBuilder command line, the same way we pass ChunkDocumentSequence today.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.