Hey there.
I'm having some trouble linking two large-ish CSVs (1M rows & 650k rows). I'm running into MemoryError exceptions when I execute the pair indexing. I'm using the chunksize parameter on the Pairs class but it doesn't seem to help.
File "<input>", line 1, in <module>
File "/data/python/contact-match/match.py", line 122, in run_match
pairs = pcl.block(left_on=left_cols, right_on=right_cols)
File "/data/ve/ve-contact-match/lib/python3.4/site-packages/recordlinkage/indexing.py", line 478, in block
on=on, left_on=left_on, right_on=right_on
File "/data/ve/ve-contact-match/lib/python3.4/site-packages/recordlinkage/indexing.py", line 339, in index
pairs = index_func(self.df_a, self.df_b, *args, **kwargs)
File "/data/ve/ve-contact-match/lib/python3.4/site-packages/recordlinkage/indexing.py", line 44, in index_name_checker
return func(df_a, df_b, *args, **kwargs)
File "/data/ve/ve-contact-match/lib/python3.4/site-packages/recordlinkage/indexing.py", line 174, in _blockindex
).set_index([df_a.index.name, df_b.index.name])
File "/data/ve/ve-contact-match/lib/python3.4/site-packages/pandas/core/frame.py", line 4607, in merge
copy=copy, indicator=indicator)
File "/data/ve/ve-contact-match/lib/python3.4/site-packages/pandas/tools/merge.py", line 62, in merge
return op.get_result()
File "/data/ve/ve-contact-match/lib/python3.4/site-packages/pandas/tools/merge.py", line 564, in get_result
concat_axis=0, copy=self.copy)
File "/data/ve/ve-contact-match/lib/python3.4/site-packages/pandas/core/internals.py", line 4825, in concatenate_block_managers
placement=placement) for placement, join_units in concat_plan]
File "/data/ve/ve-contact-match/lib/python3.4/site-packages/pandas/core/internals.py", line 4825, in <listcomp>
placement=placement) for placement, join_units in concat_plan]
File "/data/ve/ve-contact-match/lib/python3.4/site-packages/pandas/core/internals.py", line 4922, in concatenate_join_units
for ju in join_units]
File "/data/ve/ve-contact-match/lib/python3.4/site-packages/pandas/core/internals.py", line 4922, in <listcomp>
for ju in join_units]
File "/data/ve/ve-contact-match/lib/python3.4/site-packages/pandas/core/internals.py", line 5222, in get_reindexed_values
fill_value=fill_value)
File "/data/ve/ve-contact-match/lib/python3.4/site-packages/pandas/core/algorithms.py", line 1100, in take_nd
out = np.empty(out_shape, dtype=dtype)
MemoryError
pcl = rl.Pairs(df_left, df_right, chunks=1000) #
pairs = pcl.block(left_on='Last Name', right_on='Last Name')
I am still a tad newish with pandas so pls forgive me if I've missed something obvious.