Coder Social home page Coder Social logo

Comments (7)

dipanjanS avatar dipanjanS commented on June 28, 2024 1

What are you trying to do for 85K documents? You haven't mentioned much about the problem you are trying to solve.

The similarity chapter is all about showing how the algorithms are implemented actually with the math behind them. If you want to scale this out for similarity, search and information retrieval, consider using a more scalable solution like elasticsearch which uses BM25 in the backend, rather than write it in raw python.

from text-analytics-with-python.

codingnoobneedshelp avatar codingnoobneedshelp commented on June 28, 2024

yes this work belongs the the information retrieval chapter. i want to do learning to rank, but first i need to calculate the features and bm25 is one of them. so i just run each query against the document corpus to get the scores. i mean it still should be doable to change the function somehow that i dont get the error or?

from text-analytics-with-python.

dipanjanS avatar dipanjanS commented on June 28, 2024

The BM25 code is just a mathematical function which has been converted into python code based on the formula. It could be that the numpy feature matrices are not fitting in the RAM of your system. It's still not very clear which line of the code is throwing the memory error though.

In general, the code corpus_features = corpus_features.toarray() can be brought outside the function to prevent it from eating up all the RAM for each query (basically generate the dense matrix just once instead of generating it each time in the function when making queries)

But if you want to solve this problem in the real-world, for ranking\querying 85K documents consider using elasticsearch which is more efficient and the right way to get similar documents and ranking (and you can even tune the algorithm in the backend based on constructs in the DSL queries).

from text-analytics-with-python.

codingnoobneedshelp avatar codingnoobneedshelp commented on June 28, 2024

Thanks for your help. Yes the error is because if this code: corpus_features = corpus_features.toarray()

Can you maybe post how the code would with the changes you suggest?

Big thanks

from text-analytics-with-python.

dipanjanS avatar dipanjanS commented on June 28, 2024

I'm actually working on the 2nd revision of this book for Python 3.x so a bit busy restructuring and working on the code for the different chapters since some things will change and all the code will also be ported over to python 3.

You just need to put that line of code in your main code file\segment and not call it repeatedly in the function where BM25 is defined. Then assuming you have enough RAM it should work.

But like I said, prefer using elasticsearch for these kind of problems.

from text-analytics-with-python.

codingnoobneedshelp avatar codingnoobneedshelp commented on June 28, 2024

ok...

if i put the code outside and just execute it, i still get the MemoryError and i have 32gb ram.

Is ElasticSearch easy to use?

from text-analytics-with-python.

dipanjanS avatar dipanjanS commented on June 28, 2024

It is better to build an index with the 85K documents instead of repeatedly making a matrix on python for all the queries.

Elasticsearch is very easy to learn and use: https://www.elastic.co/products/elasticsearch

There is also a python client for the same to use on top of elasticsearch: https://elasticsearch-py.readthedocs.io/en/master/

from text-analytics-with-python.

Related Issues (14)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.