Comments (7)
Hi @iibarant
please provide more information. 88510 is small enough to handle by my computer (which can process >500,000 records).
from string_grouper.
Here's the code:
from string_grouper import match_strings
matches = match_strings(check2['full address'])
I run the code on MacBook Pro 2.4 GHz 8-Core Intel Core i9 32 GB 2667 MHz DDR4
The dataframe contains 3 columns name, phone, full address and I need to run the match on the address only.
Thank you!
from string_grouper.
Thanks @iibarant
Curious! This is an unexpected error. Can you please provide the traceback log (just copy and paste whatever python spits out) of the error so that I can determine where exactly the problem is stemming from in the code.
from string_grouper.
There you go ...
matches = match_strings(check2['full address'])
Traceback (most recent call last):
File "", line 1, in
matches = match_strings(check2['full address'])
File "/opt/anaconda3/lib/python3.8/site-packages/string_grouper/string_grouper.py", line 131, in match_strings
string_grouper = StringGrouper(master,
File "/opt/anaconda3/lib/python3.8/site-packages/string_grouper/string_grouper.py", line 264, in fit
matches, self._true_max_n_matches = self._build_matches(master_matrix, duplicate_matrix)
File "/opt/anaconda3/lib/python3.8/site-packages/string_grouper/string_grouper.py", line 467, in _build_matches
return awesome_cossim_topn(
File "/opt/anaconda3/lib/python3.8/site-packages/sparse_dot_topn/awesome_cossim_topn.py", line 119, in awesome_cossim_topn
alt_indices, alt_data = ct_thread.sparse_dot_topn_extd_threaded(
File "sparse_dot_topn/sparse_dot_topn_threaded.pyx", line 133, in sparse_dot_topn.sparse_dot_topn_threaded.__pyx_fuse_0sparse_dot_topn_extd_threaded
File "sparse_dot_topn/sparse_dot_topn_threaded.pyx", line 168, in sparse_dot_topn.sparse_dot_topn_threaded.sparse_dot_topn_extd_threaded
OverflowError: value too large to convert to int
from string_grouper.
Looks like the error stems from ‘sparse_dot_topn’, a package dependency.
And I’m not sure if it’s platform-dependent (the package, as far as I know, has only been tested on Linux and Microsoft Windows OS’s) or something else.
Could you try the following command just to see what happens:
matches = match_strings(check2['full address'], max_n_matches=20)
(This limits the output a bit.)
from string_grouper.
Yes, that works. Thank you. Should I check whether the code works with greater max_n_matches ? I'm planning to keep similarity score > 0.9. Would it be possible to apply on the call?
from string_grouper.
Ok good. Yes, I suggest you try successively larger values of max_n_matches
until the output size no longer changes. Yes of course, you can also use min_similarity
at the same time.
from string_grouper.
Related Issues (20)
- Jupyter Notebook installation not working HOT 1
- Error When matching Chinese name HOT 9
- Formula for optimal matrix block-size
- Question / suggestion to use multiple n-grams to get more features
- Could not install string-grouper HOT 2
- able to change default cosine similarity of .8? HOT 2
- using string_grouper with a lookup column in orginal source data HOT 13
- Question about version string_grouper group_similar_strings HOT 4
- Different matching behavior across versions HOT 9
- Some general questions about the package
- Numpy version issue? HOT 2
- Unable to allocate 8.41 GiB for an array with shape (2258174000,) and data type int32
- How to cite this work? HOT 1
- Tips for working with large datasets HOT 1
- Import Error HOT 4
- Installation not working: Massive error message HOT 3
- Module compiled against API version 0x10 but this version of numpy is 0xf HOT 1
- Ngram re-use HOT 1
- sparse_dot_topn_for_blocks dependency broken with Cython 3.0 HOT 8
- value too large to convert to int , help, How can just a few tens of thousands of strings not work? TKS ~~ HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from string_grouper.