Comments (9)
For a baseline... according to Anandtech, they get 30 QPS from Elastic Search when they put Wikiepdia on a Broadwell Xeon D (1540). They say that wikipedia is "+/- 40GB", which implies they might be using English language wikipedia, articles only. That's currently 49GB but maybe it was closer to 40GB when they benchmarked it? We should ask them exactly what configuration they ran when we do our benchmarks if we end up having access to similar hardware.
from bitfunnel.
This ASPLOS '15 paper on something unrelated happens to use Lucene as one of its targets. They run 10k queries on "Wikipedia" and their latency (fig 2b) implies single digit QPS number per core. On "a server with two 8-core Intel 64-bit
Xeon processors (2.30 GHz)... 64GB RAM", they show that it's possible to get to 40-ish QPS before tail latency spikes above 1 second (fig 4). See section 6.1 for tails on methodology and hardware.
from bitfunnel.
I tried asking this question publicly and basically everyone who replied thought that it should be possible to do better. However, there were very few concrete suggestions; the actual concrete suggestions were:
- "possible by reducing term dictionary size. I bet the term dict for Wikipedia is large"
- "you need to reserve a good bit of the systems ram for FS caching (i.e. Don't give it all to the JVM)"
We can try those, although it would be a bit surprising if every public benchmark we can find has a poor setup. If so, it seems to indicate that there are non-obvious default settings that we need to change, and that we should make sure our defaults don't result in people doing the same thing they do for Lucene.
from bitfunnel.
Someone who works at ElasticSearch claims that it's because the benchmarks are benchmarking worst case queries with no stopwords, and that the results are per-thread. I can't see any evidence of either claim. I don't know how that guy could have determined that it's on worst case queries since neither benchmark I linked to talks about query distribution.
The ASPLOS paper specifically notes that tail latency is worse with 4 threads than 1 above 42 QPS and that, in general, tail latency degrades more with more threads. The y-axis doesn't go high enough to tell, but once you get significantly above the top of the graph (1.5s) with a slope that indicates that tail latency could easily hit 10s+ in the high 40QPS range. Please note that the units here are seconds and not milliseconds. In the results section, the ASPLOS paper also notes that it mixes "long" and "short" queries, where long and short refer to the time the query takes to complete, so it almost can't be the case that it's all worst case queries (and we don't know that the long queries are worst case queries). In (6), the ASPLOS paper notes that they pull queries from the nightly tests and use "the term requests". Additionally, running with no stop words is what we do, and AFAIK what every major search engine has done for years.
That guy basically concludes with "The source code is open...", which seems to match what you got in your other interaction with the ElasticSearch folks.
from bitfunnel.
On the nightly regression tests mentioned above, they appear to get... 30QPS - 40QPS on Wikipedia: http://home.apache.org/~mikemccand/lucenebench/Term.html. They note that they take best case results ("Each of the 5 instances are run 50 times per JVM instance; we keep the best (fastest) time per task/query instance").
This code appears to be the code that's used to run their benchmarks. It looks like runNightly.cmd
launches nightlyBench.py
.
The code has almost no documentation in the actual code, but if I've skimmed it correctly, that launches r.runSimpleSearchBench
in benchUtil.py
.
In benchUtil.py
, we see
# Skip this pctg of the slowest runs:
SLOW_SKIP_PCT = 10
If you believe the text from the main page, this means that they have two separate filters that filter out the tail, so they drop anything above the 90% tail and then after doing that only look at the best of 50 runs?
But wait, this code also has
# SELECT = 'min'
# SELECT = 'mean'
SELECT = 'median'
Which seems to indicate that the text on the main page is outdated, and that they take the median instead of the min?
Also, the README for that repo says "In the second step, the setup procedure creates all necessary directories in the clones parent directory and downloads a 6 GB compressed Wikipedia line doc file from an Apache mirror.", which seems to indicate they're not running against all of Wikipedia?
from bitfunnel.
For an ingestion baseline, this post talks a bit about Lucene's benchmark setup and mentions that with an OCZ Vertex 3 in a 2 socket Xeon X5680s overclocked to 4Ghz, Lucene ingests "~102GB plain text per hour"
The benchmark takes a while to run. To even extract wikipedia takes 40 minutes on the mac we've been using
expand-enwiki:
[bunzip2] Expanding enwiki-20070527-pages-articles.xml.bz2 to /Users/visualstudio/dev/lucene-solr/lucene/benchmark/temp/enwiki-20070527-pages-articles.xml
BUILD SUCCESSFUL
Total time: 42 minutes 34 seconds
from bitfunnel.
Another open source project we could use for a baseline is OpenAcoon / DeuSu. They claim:
The above website runs on an Intel E3-1225 with 32gb RAM and two 500gb SSDs. The search-index on that site currently holds about 1.08 billion WWW-pages. On average a query takes about 0.2 seconds.
...
The software was originally written in Delphi (=Pascal).
...
Sorry for the quality of most of the code. Big parts of it were written 15 years ago when I was still young and stupid. :)
from bitfunnel.
Here's an old blog post where someone at Elastic ran Term queries against all of wikipedia, which I believe are basically what we support:
It looks like they get 8k QPS for single term queries, with increasing speed as they AND in more terms and decreasing speed when they OR in more terms, as expected.
from bitfunnel.
We now have a setup that lets us compare against Lucene, here. Our Lucene results are in the same ballpark as the results from the Elastic post cited above, which puts us multiple orders of magnitude faster than most public benchmarks that have been cited, like Anandtech and the ASPLOS '15 paper linked to above.
I'm sure the setup could use a lot of work, but it appears to have results that are at least as fast as the fastest public results we can find on a machine that's no faster than any of the machines cited above.
from bitfunnel.
Related Issues (20)
- BitFunnel repl returns 0 on exit after failing to load a script
- Linux and Windows versions of BitFunnelToolTest have different behavior. HOT 1
- ShardCostFunction has no way to specify shard density.
- Support for Wildcard or Regex Queries? HOT 2
- Support VS 2017 Build HOT 1
- Complete support for Sharding
- Establish termtable defaults for density and treatment HOT 2
- REPL fails to load index due to buffer size calculations HOT 3
- Document the bitfunnel library API HOT 3
- Doozer build fails on Utilities - TokenManagerTest line 411 HOT 1
- Upgrade GoogleTest HOT 1
- REPL "show rows" command does not list all documents/columns HOT 3
- Ubuntu Artful g++ compiler (7.2) cannot compile NativeJIT due to deprecation warning
- The proportion of ad hoc vs. explicit terms varies significantly across shards
- Change BitFunnel executable name to 'bitfunnel' for *nix users
- Query parser errors need more graceful handling HOT 1
- REPL "status" command outputs incorrect shard statistics & does not use "shard" info
- REPL shouldn't catch CheckException HOT 1
- Is this project now dormant? What is the status? HOT 2
- Replicating BitFunnel Experiments HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bitfunnel.