Coder Social home page Coder Social logo

Comments (3)

fulmicoton avatar fulmicoton commented on July 28, 2024

I don't have time to debug this thing. One thing you can do is pick one specific example where tantivy is outperformed by BM25, and use "explain".

The usual suspects are

  • your evaluation code
  • tokenization
  • the way query are sanitized
  • BM25 constants

Tantivy has an explain function telling precisely how tantivy came up with a given score.
(The formula should be the same a lucene.)

https://docs.rs/tantivy/latest/tantivy/query/trait.Query.html#method.explain

from tantivy.

triandco avatar triandco commented on July 28, 2024

Thank you @fulmicoton, I've gone through each of the usual suspects and verify each. I also ran the task against Lucene which yielded more similar scores. Look like this is working as intended.

Dataset Tantivy ndcg@10 Apache Lucene ndcg@10 Beir BM25 Flat ndcg@10
Scifact 0.6251573122952132 0.632431156289918 0.679
NFCorpus 0.20505084876906404 0.20712280950112716 0.322
TREC-COVID 0.0362915780899568 0.035369826134136535 0.595
NQ 0.2637953053727399 0.2803606345656689 0.306

Regarding why Beir got such highscore, their "BM25" retrieval task is just a wrapper around ElasticSearch. I'm evaluating ElasticSearch now, will update the result soon.

from tantivy.

triandco avatar triandco commented on July 28, 2024

Updated result with ElasticSearch evaluation and increase the retrieval task complexity from single field to multifield. The current result look reasonable as ElasticSearch default to do a bit more than BM25. I'll contact Beir about their specifics test since their result look a bit too pretty. I'll close this issue. Thank you @fulmicoton!

Dataset Tantivy Apache Lucene Beir BM25 Flat Elastic Search
Scifact 0.6110550406527024 0.6105774540257333 0.679 0.6563018879997284
NFCorpus 0.20174488628325865 0.2021653197430468 0.322 0.2116375800036891
TREC-COVID 0.03640657024103224 0.03705072222267741 0.595 0.05433894833185797
NQ 0.30181710921729077 0.301753090384626 0.306 0.310128528137924

from tantivy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.