Coder Social home page Coder Social logo

Comments (9)

danluu avatar danluu commented on July 2, 2024

For a baseline... according to Anandtech, they get 30 QPS from Elastic Search when they put Wikiepdia on a Broadwell Xeon D (1540). They say that wikipedia is "+/- 40GB", which implies they might be using English language wikipedia, articles only. That's currently 49GB but maybe it was closer to 40GB when they benchmarked it? We should ask them exactly what configuration they ran when we do our benchmarks if we end up having access to similar hardware.

from bitfunnel.

danluu avatar danluu commented on July 2, 2024

This ASPLOS '15 paper on something unrelated happens to use Lucene as one of its targets. They run 10k queries on "Wikipedia" and their latency (fig 2b) implies single digit QPS number per core. On "a server with two 8-core Intel 64-bit
Xeon processors (2.30 GHz)... 64GB RAM", they show that it's possible to get to 40-ish QPS before tail latency spikes above 1 second (fig 4). See section 6.1 for tails on methodology and hardware.

from bitfunnel.

danluu avatar danluu commented on July 2, 2024

I tried asking this question publicly and basically everyone who replied thought that it should be possible to do better. However, there were very few concrete suggestions; the actual concrete suggestions were:

  1. "possible by reducing term dictionary size. I bet the term dict for Wikipedia is large"
  2. "you need to reserve a good bit of the systems ram for FS caching (i.e. Don't give it all to the JVM)"

We can try those, although it would be a bit surprising if every public benchmark we can find has a poor setup. If so, it seems to indicate that there are non-obvious default settings that we need to change, and that we should make sure our defaults don't result in people doing the same thing they do for Lucene.

from bitfunnel.

danluu avatar danluu commented on July 2, 2024

Someone who works at ElasticSearch claims that it's because the benchmarks are benchmarking worst case queries with no stopwords, and that the results are per-thread. I can't see any evidence of either claim. I don't know how that guy could have determined that it's on worst case queries since neither benchmark I linked to talks about query distribution.

The ASPLOS paper specifically notes that tail latency is worse with 4 threads than 1 above 42 QPS and that, in general, tail latency degrades more with more threads. The y-axis doesn't go high enough to tell, but once you get significantly above the top of the graph (1.5s) with a slope that indicates that tail latency could easily hit 10s+ in the high 40QPS range. Please note that the units here are seconds and not milliseconds. In the results section, the ASPLOS paper also notes that it mixes "long" and "short" queries, where long and short refer to the time the query takes to complete, so it almost can't be the case that it's all worst case queries (and we don't know that the long queries are worst case queries). In (6), the ASPLOS paper notes that they pull queries from the nightly tests and use "the term requests". Additionally, running with no stop words is what we do, and AFAIK what every major search engine has done for years.

That guy basically concludes with "The source code is open...", which seems to match what you got in your other interaction with the ElasticSearch folks.

from bitfunnel.

danluu avatar danluu commented on July 2, 2024

On the nightly regression tests mentioned above, they appear to get... 30QPS - 40QPS on Wikipedia: http://home.apache.org/~mikemccand/lucenebench/Term.html. They note that they take best case results ("Each of the 5 instances are run 50 times per JVM instance; we keep the best (fastest) time per task/query instance").

This code appears to be the code that's used to run their benchmarks. It looks like runNightly.cmd launches nightlyBench.py.

The code has almost no documentation in the actual code, but if I've skimmed it correctly, that launches r.runSimpleSearchBench in benchUtil.py.

In benchUtil.py, we see

# Skip this pctg of the slowest runs:
SLOW_SKIP_PCT = 10

If you believe the text from the main page, this means that they have two separate filters that filter out the tail, so they drop anything above the 90% tail and then after doing that only look at the best of 50 runs?

But wait, this code also has

# SELECT = 'min'
# SELECT = 'mean'
SELECT = 'median'

Which seems to indicate that the text on the main page is outdated, and that they take the median instead of the min?

Also, the README for that repo says "In the second step, the setup procedure creates all necessary directories in the clones parent directory and downloads a 6 GB compressed Wikipedia line doc file from an Apache mirror.", which seems to indicate they're not running against all of Wikipedia?

from bitfunnel.

danluu avatar danluu commented on July 2, 2024

For an ingestion baseline, this post talks a bit about Lucene's benchmark setup and mentions that with an OCZ Vertex 3 in a 2 socket Xeon X5680s overclocked to 4Ghz, Lucene ingests "~102GB plain text per hour"

The benchmark takes a while to run. To even extract wikipedia takes 40 minutes on the mac we've been using

expand-enwiki:
  [bunzip2] Expanding enwiki-20070527-pages-articles.xml.bz2 to /Users/visualstudio/dev/lucene-solr/lucene/benchmark/temp/enwiki-20070527-pages-articles.xml
BUILD SUCCESSFUL
Total time: 42 minutes 34 seconds

from bitfunnel.

danluu avatar danluu commented on July 2, 2024

Another open source project we could use for a baseline is OpenAcoon / DeuSu. They claim:

The above website runs on an Intel E3-1225 with 32gb RAM and two 500gb SSDs. The search-index on that site currently holds about 1.08 billion WWW-pages. On average a query takes about 0.2 seconds.
...
The software was originally written in Delphi (=Pascal).
...
Sorry for the quality of most of the code. Big parts of it were written 15 years ago when I was still young and stupid. :)

from bitfunnel.

danluu avatar danluu commented on July 2, 2024

Here's an old blog post where someone at Elastic ran Term queries against all of wikipedia, which I believe are basically what we support:

image

It looks like they get 8k QPS for single term queries, with increasing speed as they AND in more terms and decreasing speed when they OR in more terms, as expected.

from bitfunnel.

danluu avatar danluu commented on July 2, 2024

We now have a setup that lets us compare against Lucene, here. Our Lucene results are in the same ballpark as the results from the Elastic post cited above, which puts us multiple orders of magnitude faster than most public benchmarks that have been cited, like Anandtech and the ASPLOS '15 paper linked to above.

I'm sure the setup could use a lot of work, but it appears to have results that are at least as fast as the fastest public results we can find on a machine that's no faster than any of the machines cited above.

from bitfunnel.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.