Comments (7)
What are you trying to do for 85K documents? You haven't mentioned much about the problem you are trying to solve.
The similarity chapter is all about showing how the algorithms are implemented actually with the math behind them. If you want to scale this out for similarity, search and information retrieval, consider using a more scalable solution like elasticsearch which uses BM25 in the backend, rather than write it in raw python.
from text-analytics-with-python.
yes this work belongs the the information retrieval chapter. i want to do learning to rank, but first i need to calculate the features and bm25 is one of them. so i just run each query against the document corpus to get the scores. i mean it still should be doable to change the function somehow that i dont get the error or?
from text-analytics-with-python.
The BM25 code is just a mathematical function which has been converted into python code based on the formula. It could be that the numpy feature matrices are not fitting in the RAM of your system. It's still not very clear which line of the code is throwing the memory error though.
In general, the code corpus_features = corpus_features.toarray()
can be brought outside the function to prevent it from eating up all the RAM for each query (basically generate the dense matrix just once instead of generating it each time in the function when making queries)
But if you want to solve this problem in the real-world, for ranking\querying 85K documents consider using elasticsearch which is more efficient and the right way to get similar documents and ranking (and you can even tune the algorithm in the backend based on constructs in the DSL queries).
from text-analytics-with-python.
Thanks for your help. Yes the error is because if this code: corpus_features = corpus_features.toarray()
Can you maybe post how the code would with the changes you suggest?
Big thanks
from text-analytics-with-python.
I'm actually working on the 2nd revision of this book for Python 3.x so a bit busy restructuring and working on the code for the different chapters since some things will change and all the code will also be ported over to python 3.
You just need to put that line of code in your main code file\segment and not call it repeatedly in the function where BM25 is defined. Then assuming you have enough RAM it should work.
But like I said, prefer using elasticsearch for these kind of problems.
from text-analytics-with-python.
ok...
if i put the code outside and just execute it, i still get the MemoryError and i have 32gb ram.
Is ElasticSearch easy to use?
from text-analytics-with-python.
It is better to build an index with the 85K documents instead of repeatedly making a matrix on python for all the queries.
Elasticsearch is very easy to learn and use: https://www.elastic.co/products/elasticsearch
There is also a python client for the same to use on top of elasticsearch: https://elasticsearch-py.readthedocs.io/en/master/
from text-analytics-with-python.
Related Issues (14)
- I get in error HOT 3
- from pattern.en import tag raise BadZipFile in Chapter 6
- Non functioning code in chapter 7: sentiwordnet example HOT 1
- Jupyter Notebooks for 2nd Edition? HOT 1
- Uploading Code from new edition? HOT 1
- Is this ready? HOT 1
- Error in: text-analytics-with-python/New-Second-Edition/Ch05 - Text Classification/Ch05b - Text Classification - I.ipynb HOT 1
- Keras and spaCy New Versions HOT 1
- Convert code base for Python 3.x HOT 25
- Bug in feature_extractors() (Chapter 4) HOT 2
- ModuleNotFoundError: No module named 'normalization HOT 4
- csv files are not able to downlod HOT 4
- Variable "re" ? Where?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from text-analytics-with-python.