Light

Computing BM25 Similarity for 30 Querys and 85000 Documents about text-analytics-with-python HOT 7 CLOSED

codingnoobneedshelp commented on June 28, 2024

Computing BM25 Similarity for 30 Querys and 85000 Documents

from text-analytics-with-python.

Comments (7)

dipanjanS commented on June 28, 2024 1

What are you trying to do for 85K documents? You haven't mentioned much about the problem you are trying to solve.

The similarity chapter is all about showing how the algorithms are implemented actually with the math behind them. If you want to scale this out for similarity, search and information retrieval, consider using a more scalable solution like elasticsearch which uses BM25 in the backend, rather than write it in raw python.

from text-analytics-with-python.

codingnoobneedshelp commented on June 28, 2024

yes this work belongs the the information retrieval chapter. i want to do learning to rank, but first i need to calculate the features and bm25 is one of them. so i just run each query against the document corpus to get the scores. i mean it still should be doable to change the function somehow that i dont get the error or?

from text-analytics-with-python.

dipanjanS commented on June 28, 2024

The BM25 code is just a mathematical function which has been converted into python code based on the formula. It could be that the numpy feature matrices are not fitting in the RAM of your system. It's still not very clear which line of the code is throwing the memory error though.

In general, the code corpus_features = corpus_features.toarray() can be brought outside the function to prevent it from eating up all the RAM for each query (basically generate the dense matrix just once instead of generating it each time in the function when making queries)

But if you want to solve this problem in the real-world, for ranking\querying 85K documents consider using elasticsearch which is more efficient and the right way to get similar documents and ranking (and you can even tune the algorithm in the backend based on constructs in the DSL queries).

from text-analytics-with-python.

codingnoobneedshelp commented on June 28, 2024

Thanks for your help. Yes the error is because if this code: corpus_features = corpus_features.toarray()

Can you maybe post how the code would with the changes you suggest?

Big thanks

from text-analytics-with-python.

dipanjanS commented on June 28, 2024

I'm actually working on the 2nd revision of this book for Python 3.x so a bit busy restructuring and working on the code for the different chapters since some things will change and all the code will also be ported over to python 3.

You just need to put that line of code in your main code file\segment and not call it repeatedly in the function where BM25 is defined. Then assuming you have enough RAM it should work.

But like I said, prefer using elasticsearch for these kind of problems.

from text-analytics-with-python.

codingnoobneedshelp commented on June 28, 2024

ok...

if i put the code outside and just execute it, i still get the MemoryError and i have 32gb ram.

Is ElasticSearch easy to use?

from text-analytics-with-python.

dipanjanS commented on June 28, 2024

It is better to build an index with the 85K documents instead of repeatedly making a matrix on python for all the queries.

Elasticsearch is very easy to learn and use: https://www.elastic.co/products/elasticsearch

There is also a python client for the same to use on top of elasticsearch: https://elasticsearch-py.readthedocs.io/en/master/

from text-analytics-with-python.

Related Issues (14)

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.