Hey there. I'm having some trouble linking two large-ish CSVs (1M ro

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

Out of Memory Issue: Chunking? about recordlinkage HOT 10 OPEN

j535d165 commented on June 4, 2024

Out of Memory Issue: Chunking?

from recordlinkage.

Comments (10)

J535D165 commented on June 4, 2024

Thanks for the issue. This is a bit the result of a bad design of the Index class. I am working on an improvement, but it is time-consuming.

There is a workaround. Will try to find some time tomorrow and post it here.

from recordlinkage.

spatialbits commented on June 4, 2024

Great, thanks. Cool library btw. 👍

from recordlinkage.

spatialbits commented on June 4, 2024

So it seems if I override the _blockindex(...) method and tell the merge to operate only on the left/right indices (not column names) then it proceeds without running into a MemoryError.

    # Join
    pairs = data_left.reset_index().merge(
        data_right.reset_index(),
        how='inner',
        left_on=left_on,
        right_on=right_on,
    ).set_index([df_a.index.name, df_b.index.name])

Now becomes:

    # Join
    pairs = data_left.reset_index().merge(
        data_right.reset_index(),
        how='inner',
        left_index=True,
        right_index=True,
    ).set_index([df_a.index.name, df_b.index.name])

Have I missed something important here? EDIT: yes. I've only removed the columns it was supposed to block on. Not very helpful at all.

from recordlinkage.

J535D165 commented on June 4, 2024

How many unique last names do you have? Can you estimate how make record pairs you are expecting? The built-in value_counts functions can be helpful.

making candidate record pairs with large files

file_a_generator = pandas.read_csv('file_a.csv', ..., chunksize=10000)
file_b_generator = pandas.read_csv('file_b.csv', ..., chunksize=10000)

matches = []

for df_a_chunk_i in file_a_generator:

    for df_b_chunk_i in file_b_generator:

        # make pairs
        pcl = rl.Pairs(df_a_chunk_i, df_b_chunk_i)   
        pairs = pcl.block(on='Last Name')
 
        # compare pairs 
        ...

        # classify pairs
        ...
        # append matches to list

This works only in case of linking records between two files. Not for deduplication.
If you need to train a classifier, train it on a random subset of the data. Thereafter, use the predict method.

I am trying to fix the chunk size bug. Also replacing the indexing class by another class.

from recordlinkage.

J535D165 commented on June 4, 2024

 # Join
    pairs = data_left.reset_index().merge(
        data_right.reset_index(),
        how='inner',
        left_index=True,
        right_index=True,
    ).set_index([df_a.index.name, df_b.index.name])

this gives wrong results

from recordlinkage.

spatialbits commented on June 4, 2024

Thanks! The work around for large files worked great.

from recordlinkage.

J535D165 commented on June 4, 2024

Great

Will keep this issue open until there is a solution for the deduplication case.

from recordlinkage.

amberjrivera commented on June 4, 2024

Hi there, just checking to see if there are any updates for the deduplication case? I'm currently running into that error. My file for deduplication is about 560k records, and there are 28k unique values on the block column.

(I ended up getting around this by blocking on multiple columns, as suggested in the Performance section of the docs.)

from recordlinkage.

Pav1777 commented on June 4, 2024

Hi
I have used the above solution"making candidate record pairs with large files"
but I am getting an error:

Traceback (most recent call last):
File "feature.py", line 98, in
block()
File "feature.py", line 88, in block
features = compare_cl.compute(pairs, df_a_chunk_i, df_b_chunk_i)
File "/home/dhillon7/.local/lib/python2.7/site-packages/recordlinkage/base.py", line 849, in compute
results = self._compute(pairs, x, x_link)
File "/home/dhillon7/.local/lib/python2.7/site-packages/recordlinkage/base.py", line 712, in _compute
result = feat._compute(data1, data2)
File "/home/dhillon7/.local/lib/python2.7/site-packages/recordlinkage/base.py", line 454, in _compute
result = self._compute_vectorized(*tuple(left_on + right_on))
File "/home/dhillon7/.local/lib/python2.7/site-packages/recordlinkage/base.py", line 429, in _compute_vectorized
raise NotImplementedError()
NotImplementedError

Any help is really appreciated.

from recordlinkage.

Pav1777 commented on June 4, 2024

Hi
Just wanted to know how did you computed the feature vectors using chunk size when the size of two files is not same. Because I am getting into error , because one of my file closes before the other when there is no more data to read from it.

Pavneet

from recordlinkage.

Out of Memory Issue: Chunking? about recordlinkage HOT 10 OPEN

Comments (10)

making candidate record pairs with large files

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent