j535d165 / recordlinkage Goto Github PK

A powerful and modular toolkit for record linkage and duplicate detection in Python

Home Page: http://recordlinkage.readthedocs.io/

License: BSD 3-Clause "New" or "Revised" License

Python 99.01% C 0.58% Cython 0.41%

record-linkage entity-resolution dedupe string-distance machine-learning privacy python python-library data-matching deduplication

recordlinkage's People

Stargazers

Watchers

Forkers

griverorz pawanadh nielsbosnl indera phillipf bluemountaincapital perryvais barleyj luyang1210 mquad uwnetlab jillianderson8 mkdeak galic-vlad calde12 kiran-raja gnoxy-dev ohenrik paul-english mcmoots lsanchez197 dmescherina srane163 photographe tushartilwankar phzpan peopzen ochirgarid minyang-chen ksutcr18 pat-earl danakf taiwotman kwombach olgach shr264 brettbevers marfox mayerantoine mortenjohs salthot msuryaprakash raphaelzh krishnaganjigatti tomdyq lea129 karthikbharadwaj jpweytjens desaetiis rijpma preethaaugus1 stjordanis emmafengxi alex-sconjecture harshadkavathiya luciabaldassini ajwstevens gazzola daltix feynmanium cesar530 rosina9700 spothapragada mynameisfiber philippulse rafmacalaba twalen ssitb ale3385 eliudloza dylanculfogienis amarjitghuman wfranus nihal223 tmkr15hna lioulbehailu tylerbinski kevmcnu olheimer premy990 ctylerv donovan680 chamikara1986 christophbrosch tteigman epap-app jonnycrunch talaveol marcelotournier sbgha josh-rhodes nhivo1406 equideum ianad zxexz sanoosha admojj rafi138 skohari zhuohuwu0603

recordlinkage's Issues

ImportError: cannot import name 'logsumexp'

Full error:

Traceback (most recent call last):

  File "<ipython-input-221-e68a19870618>", line 1, in <module>
    import recordlinkage

  File "/Users/connorhogendorn/anaconda/lib/python3.6/site-packages/recordlinkage/__init__.py", line 6, in <module>
    from recordlinkage.classifiers import *

  File "/Users/connorhogendorn/anaconda/lib/python3.6/site-packages/recordlinkage/classifiers.py", line 6, in <module>
    from sklearn import cluster, linear_model, naive_bayes, svm

  File "/Users/connorhogendorn/anaconda/lib/python3.6/site-packages/sklearn/cluster/__init__.py", line 6, in <module>
    from .spectral import spectral_clustering, SpectralClustering

  File "/Users/connorhogendorn/anaconda/lib/python3.6/site-packages/sklearn/cluster/spectral.py", line 17, in <module>
    from ..manifold import spectral_embedding

  File "/Users/connorhogendorn/anaconda/lib/python3.6/site-packages/sklearn/manifold/__init__.py", line 6, in <module>
    from .isomap import Isomap

  File "/Users/connorhogendorn/anaconda/lib/python3.6/site-packages/sklearn/manifold/isomap.py", line 11, in <module>
    from ..decomposition import KernelPCA

  File "/Users/connorhogendorn/anaconda/lib/python3.6/site-packages/sklearn/decomposition/__init__.py", line 19, in <module>
    from .online_lda import LatentDirichletAllocation

  File "/Users/connorhogendorn/anaconda/lib/python3.6/site-packages/sklearn/decomposition/online_lda.py", line 22, in <module>
    from ..utils.fixes import logsumexp

ImportError: cannot import name 'logsumexp'

I'm using sklearn version '0.18.1'

kernel died

while creating indexing my kernel is dying , is there any way to run the job parallel in recordlinkage?

Include option to use Spark dataframes

Hi,

I'm considering to write an extension making it possible to use spark dataframes with this tool. as it is pretty similar to Pandas dataframes, but does not necessarily have the same problems related to memory related to large datasets.

However i would like som input on this.

Do you think it is possible? Are there any major issues i don't see?
Does any one know what the current upper limit on data size before the current implementation gets in trouble? For example if i try to link 90k records up against 4.5 million potential matches (records). Would this be problematic if proper blocking etc is applied?
Any constructive input about this idea is appreciated.
Would any one like to help me develop this? Ether with coding or by answering questions as i go along. :)

recordlinkage python

input****

import recordlinkage
import pandas as pd
import quandl, math
import numpy as np
import pandas._libs.sparse as splib
from sklearn import preprocessing, cross_validation, svm
from sklearn.linear_model import LinearRegression
df_a = pd.read_csv("This pc/Downloads/ex.1.csv")
df_b= pd.read_csv("This pc/Downloads/ex2.csv")
indexer = recordlinkage.Index()
indexer.block('surname')
candidate_links = indexer.index(df_a,df_b)
c = recordlinkage.Compare()
c.string('name_a', 'name_b', method='jarowinkler', threshold=0.85)
c.exact('sex', 'gender')
c.date('dob', 'date_of_birth')
c.string('str_name', 'streetname', method='damerau_levenshtein', threshold=0.7)
c.exact('place', 'placename')
c.numeric('income', 'income', method='gauss', offset=3, scale=3, missing_value=0.5)
logrg =recordlinkage.LogisticRegressionClassifier()
logrg.fit(TRAINING_COMPARISON_VECTORS, TRAINING_CLASSES)
logrg.predict(feature_vectors)
ecm = recordlinkage.ECMClassifier()
ecm.fit_predict(feature_vectors)
print(len(matches))

output****

File "C:\Users\DELL\AppData\Roaming\Python\Python36\site-packages\scipy_lib_ccallback.py", line 1, in
from . import _ccallback_c

ImportError: cannot import name '_ccallback_c'

Implement ECM algorithm

Implement the ECM algorithm of Larson, Rubin and Dempster as classifier.

Window indexing algorithm

I'm kinda stuck on trying to do a custom blocking function :

I would want to index by a date interval between df_a['start'] + 2 days >= df_b['start'] let's say.

I just can't figure out how to implement a function to return a multiIndex like this. Any clues?

Thank you so much for a such a great toolkit :)
!

Problem with Jellyfish

When using the string comparison I get
...
/Users/shaharb/.local/lib/python3.5/site-packages/recordlinkage/algorithms/string.py in jaro_winkler_apply(x)
48
49 try:
---> 50 return jellyfish.jaro_winkler(x[0], x[1])
51 except Exception as err:
52 if pandas.isnull(x[0]) or pandas.isnull(x[1]):

NameError: ("name 'jellyfish' is not defined", 'occurred at index (1, 1)')

Indeed the cjellyfish is missing:
jellyfish.jaro_winkler
<function jellyfish.cjellyfish.jaro_winkler>

Latest release on github is 0.8.1

The releases on github do not reflect the latest state of the project.

Great module by the way! :)

Compare, use columns of A and B as argument.

Rename name to label in comparison class

label is more in line with literature?

load_krebsregister() fails with [SSL: CERTIFICATE_VERIFY_FAILED]

Hi,

Following the readthedocs documentations, I get the following error when I try to load the krebs register:

>>> krebs_data, krebs_match = load_krebsregister(missing_values=0)
Start downloading the data.
Issue with downloading the data: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:749)>
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/user/Library/Python/3.6/lib/python/site-packages/recordlinkage/datasets/external.py", line 75, in load_krebsregister
    data = pandas.concat([_krebsregister_block(bl) for bl in block])
  File "/Users/user/Library/Python/3.6/lib/python/site-packages/recordlinkage/datasets/external.py", line 75, in <listcomp>
    data = pandas.concat([_krebsregister_block(bl) for bl in block])
  File "/Users/user/Library/Python/3.6/lib/python/site-packages/recordlinkage/datasets/external.py", line 123, in _krebsregister_block
    compression='zip')
  File "/Users/user/Library/Python/3.6/lib/python/site-packages/pandas/io/parsers.py", line 655, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/Users/user/Library/Python/3.6/lib/python/site-packages/pandas/io/parsers.py", line 405, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/Users/user/Library/Python/3.6/lib/python/site-packages/pandas/io/parsers.py", line 764, in __init__
    self._make_engine(self.engine)
  File "/Users/user/Library/Python/3.6/lib/python/site-packages/pandas/io/parsers.py", line 985, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/Users/user/Library/Python/3.6/lib/python/site-packages/pandas/io/parsers.py", line 1605, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 394, in pandas._libs.parsers.TextReader.__cinit__ (pandas/_libs/parsers.c:4209)
  File "pandas/_libs/parsers.pyx", line 664, in pandas._libs.parsers.TextReader._setup_parser_source (pandas/_libs/parsers.c:8001)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/zipfile.py", line 1082, in __init__
    self.fp = io.open(file, filemode)
FileNotFoundError: [Errno 2] No such file or directory: '/Users/user/Library/Python/3.6/lib/python/site-packages/recordlinkage/datasets/krebsregister/block_1.zip'

If there's an easy way to fix this, I'd be happy to create PR with the fix.

Add k-means clustering

Add unsupervised learning methods like k-means clustering.

Add Support Vector Machine (SVM) classifier

Implement a SVM classifier with the algorithm used in scipy.

Iterindex has one or two parameters to define block size

For linking:

c_pairs = recordlinkage.Pairs(A, B)

for pairs in c_pairs.fullindex(100, 100):
    print len(pairs)

But for deduplication:

c_pairs = recordlinkage.Pairs(A)

for pairs in c_pairs.fullindex(100, 100):
    print len(pairs)

Are two block size parameters needed? And what to do with arguments passed right after it? Are they kwargs or args?

classifiers 'prob' method should have return type

The classifiers with 'prob' method should have return_type arguments 'array' and 'series'. This is more in line with the learn and predict method.

Out of Memory Issue: Chunking?

Hey there.

I'm having some trouble linking two large-ish CSVs (1M rows & 650k rows). I'm running into MemoryError exceptions when I execute the pair indexing. I'm using the chunksize parameter on the Pairs class but it doesn't seem to help.

  File "<input>", line 1, in <module>
  File "/data/python/contact-match/match.py", line 122, in run_match
    pairs = pcl.block(left_on=left_cols, right_on=right_cols)
  File "/data/ve/ve-contact-match/lib/python3.4/site-packages/recordlinkage/indexing.py", line 478, in block
    on=on, left_on=left_on, right_on=right_on
  File "/data/ve/ve-contact-match/lib/python3.4/site-packages/recordlinkage/indexing.py", line 339, in index
    pairs = index_func(self.df_a, self.df_b, *args, **kwargs)
  File "/data/ve/ve-contact-match/lib/python3.4/site-packages/recordlinkage/indexing.py", line 44, in index_name_checker
    return func(df_a, df_b, *args, **kwargs)
  File "/data/ve/ve-contact-match/lib/python3.4/site-packages/recordlinkage/indexing.py", line 174, in _blockindex
    ).set_index([df_a.index.name, df_b.index.name])
  File "/data/ve/ve-contact-match/lib/python3.4/site-packages/pandas/core/frame.py", line 4607, in merge
    copy=copy, indicator=indicator)
  File "/data/ve/ve-contact-match/lib/python3.4/site-packages/pandas/tools/merge.py", line 62, in merge
    return op.get_result()
  File "/data/ve/ve-contact-match/lib/python3.4/site-packages/pandas/tools/merge.py", line 564, in get_result
    concat_axis=0, copy=self.copy)
  File "/data/ve/ve-contact-match/lib/python3.4/site-packages/pandas/core/internals.py", line 4825, in concatenate_block_managers
    placement=placement) for placement, join_units in concat_plan]
  File "/data/ve/ve-contact-match/lib/python3.4/site-packages/pandas/core/internals.py", line 4825, in <listcomp>
    placement=placement) for placement, join_units in concat_plan]
  File "/data/ve/ve-contact-match/lib/python3.4/site-packages/pandas/core/internals.py", line 4922, in concatenate_join_units
    for ju in join_units]
  File "/data/ve/ve-contact-match/lib/python3.4/site-packages/pandas/core/internals.py", line 4922, in <listcomp>
    for ju in join_units]
  File "/data/ve/ve-contact-match/lib/python3.4/site-packages/pandas/core/internals.py", line 5222, in get_reindexed_values
    fill_value=fill_value)
  File "/data/ve/ve-contact-match/lib/python3.4/site-packages/pandas/core/algorithms.py", line 1100, in take_nd
    out = np.empty(out_shape, dtype=dtype)
MemoryError

and my code:

pcl = rl.Pairs(df_left, df_right, chunks=1000)    # 

pairs = pcl.block(left_on='Last Name', right_on='Last Name')

I am still a tad newish with pandas so pls forgive me if I've missed something obvious.

Problems duplicated level names for MultiIndex objects

Pandas 0.23.0 will no longer allow duplicate level names for MultiIndex objects pandas-dev/pandas#18882. This will break the package.

Affects:

Allowing assign the weight to different feature when using compare

Not all features are equally important when compare if two records are the same. it would be really nice if package can support assigning weight options when compare.

so let's say we trying to do record linkage for online shopping customer data set. the data have SSN(they shouldn't store SSN, here just use as demonstration), address, gender, birth data, telephone, email etc.. example would be:
compare_cl.string('LASTNAME', 'LASTNAME', method='jarowinkler', threshold=0.9, label='LASTNAME', weight = 0.7)
compare_cl.exact('SSN', 'SSN', method='jarowinkler', threshold=0.9, label='SSN', weight = 1)
In the end even the lastname match, it will only get 0.7*vector score compare to ssn match will get 1 score.

recordlinkage.preprocessing module is missing

I got this error after installing with Pip from the PyPI wheel, PyPI source, and GitHub source:

import recordlinkage.preprocessing

ModuleNotFoundError: No module named 'recordlinkage.preprocessing'

So I looked and found that, in all three cases, the preprocessing module was missing:

cd <stuff>/site-packages
ls -1XF recordlinkage

algorithms/
datasets/
__pycache__/
standardise/
base.py
classifiers.py
comparing.py
indexing.py
__init__.py
measures.py
rl_logging.py
types.py
utils.py
_version.py

And it looks like preprocessing is missing from setup.py. Is this intentional? What's the correct usage of recordlinkage.preprocessing? The documentation uses that module by name in the examples.

Possible bug when len(labels) > 1

https://github.com/J535D165/recordlinkage/blob/master/recordlinkage/base.py#L288-L294

        labels = []
        for i in range(0, n_cols):

            label_val = label[i] if label is not None else label_num
            label_num += 1

            labels.append(label_val)

        results[label_val] = c

This code creates the list labels and then discards it, attempting to assign c, which might be a 2D array, to only the final label_val column.

What is the correct behavior? My guess is, this:

    if n_cols > 1:
        labels = []
        for i in range(0, n_cols):
            label_val = label[i] if label is not None else label_num
            label_num += 1
            labels.append(label_val)

            results[label_val] = c[:, i]
    else:
        label_val = labels[0]
        if label_val is None:
            label_val = label_num
        label_num += 1

        results[labels[0]] = c

Unfortunately you need to handle n_cols == 1 separately, because you can't assign to multiple columns at once in a DataFrame (I consider this a bug, not sure how the Pandas devs feel about it).

Add SVM clustering

Remove requirement of pairs in Compare class

For example:

import recordlinkage
from recordlinkage.sampledata import censusdataA, censusdataB

index = recordlinkage.Index(censusdataA, censusdataB)
pairs = index.block('surname')

compare = recordlinkage.Compare()

instead of

compare = recordlinkage.Compare(pairs)

Comparing Numeric Strings

It looks like pandas is trying to infer data types in Compare.string(), which is causing problems when it would automatically parse to something other than a string (simple example: zip codes).

df1 = pd.DataFrame([['A','60045'],['A','60046']],columns = ['type','zip'])
df2 = pd.DataFrame([['A','60045'],['A','60047']],columns = ['type','zip'])

compare = recordlinkage.Compare()

indexer = recordlinkage.BlockIndex('type')
pairs = indexer.index(df1,df2)
compare.string('zip', 'zip', method='levenshtein', threshold=0.8,label = 'zip')
features = compare.compute(pairs, df1, df2)

returns the following error:


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-136-2fa5013d65c7> in <module>()
     15 #compare.exact('haircolor', 'haircolor', missing_value=9)
     16 
---> 17 features = compare.compute(pairs, marketing_df, hub_df)
     18 

c:\python27\lib\site-packages\recordlinkage\base.pyc in compute(self, pairs, x, x_link)
    542 
    543         if self.n_jobs == 1:
--> 544             results = _compute(self, pairs, x, x_link)
    545         elif self.n_jobs > 1:
    546             results = _compute_parallel(self, pairs, x, x_link,

c:\python27\lib\site-packages\recordlinkage\base.pyc in _compute(self, pairs, x, x_link)
    278         data2 = tuple([df_b_indexed[lbl] for lbl in listify(lbl2)])
    279 
--> 280         c = f(*tuple(data1 + data2 + feat.args), **feat.kwargs)
    281 
    282         if isinstance(c, (pandas.Series, pandas.DataFrame)):

c:\python27\lib\site-packages\recordlinkage\comparing.pyc in func_wrapper(*args, **kwargs)
     37             mv = kwargs.pop('missing_value', missing_value)
     38 
---> 39             result = func(*args, **kwargs)
     40 
     41             # fill missing values if missing_value is not a missing value like

c:\python27\lib\site-packages\recordlinkage\comparing.pyc in _string_internal(s1, s2, call_method, threshold, *args, **kw)
     70 def _string_internal(s1, s2, call_method, threshold=None, *args, **kw):
     71 
---> 72     c = call_method(s1, s2, *args, **kw)
     73 
     74     if threshold:

c:\python27\lib\site-packages\recordlinkage\algorithms\string.pyc in jarowinkler_similarity(s1, s2)
     49                 raise err
     50 
---> 51     return conc.apply(jaro_winkler_apply)
     52 
     53 

c:\python27\lib\site-packages\pandas\core\series.pyc in apply(self, func, convert_dtype, args, **kwds)
   2549             else:
   2550                 values = self.asobject
-> 2551                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   2552 
   2553         if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/src\inference.pyx in pandas._libs.lib.map_infer()

c:\python27\lib\site-packages\recordlinkage\algorithms\string.pyc in jaro_winkler_apply(x)
     47                 return np.nan
     48             else:
---> 49                 raise err
     50 
     51     return conc.apply(jaro_winkler_apply)

TypeError: unicode argument expected

I'm not quite sure how to actually go about fixing this, but it seems calling the .string() method should force it to try and read the series as a string, not allow it to infer the data type.

AttributeError: module 'recordlinkage' has no attribute 'Index'

Hi @J535D165,

I am trying to run the example provided but I am running into an error about no attribute "index".

index = recordlinkage.Index(df_a, df_b)

index = recordlinkage.Index(df_a, df_b)
AttributeError: module 'recordlinkage' has no attribute 'Index'

Any help is much appreciated. Thank you.

Handle set-wise comparison and pooling

If one or more of my datasets includes a list of possible values (e.g. alternative names in the authoritative record for some entity), I may want to do comparisons between all values of the corresponding fields in each dataset, and then pool over them (max, avg) to get an overall score for that pair of records. It might be a set of numerics, a set of strings, a set of addresses...

While I can either handle this with a custom comparison function, which in turn might contain a Compare; or by duplicating rows in my input for every combination of set values, and constructing appropriate candidate links, none of this seems straightforward for something that would seem to be a common need in record linkage.

Could we get, at least, a recipe for this, or a helper?

Link two datasets: How to Save Results to CSV

Pardon me right from the start. I am new to Python (just a week old, because I have to learn record linking and deduplication for merging e-commerce data). So I guess my question may sound stupid to many.

So I followed the Link two datasets tutorial, and right now I am sitting here, looking like a restarted monkey, asking myself:

How do I save the linked records to CSV?
How do I save the unpaired record CSV?

What is next? I see all these bunch of stats but how does it help me from getting fired, seeing that my boss whats me to give him CSVs of linked data and deduplicated data?

Please please help me.

[Feature Req/Question] EM-algorithm for frequency based estimates

Hi,

Just wondering whether the EM-algorithm for frequency based estimates, or any other algorithm taking into account value frequencies is/will be included in the package?

Thanks!!

Version requirement on pandas should be 0.19

import pandas
import recordlinkage
fails at:
from pandas.types.inference import is_list_like
in comparing.py at line 7 with the following error:
ImportError: No module named 'pandas.types.inference'
I ran this on pandas 0.18.1, and the problem ceased to be an issue in version 0.19.1

As far as I can gather, pandas.types.inference.py was added in 0.19.0.

Cheers,
Ed

Return after comparing information returns all comparisons

For example,

print (comp.exact(pairs['name_A'], pairs['name_B'], name='name'))
print (comp.exact(pairs['surname_A'], pairs['surname_B'], name='surname'))

The last one returns also 'name'. This causes problems. Replace by returning only compared column?!

Add quality measures such as F-score, precision and recall

Implement quality measures such as F-score, precision and recall for evaluation of the record linkage. This is useful when trainings data is used.

Readme.md is outdated

Update the content op readme.md

Missing values in SNI causing non-uniqueness

The following example does not work. Clue: missing values are not handled correctly.

import recordlinkage

dfA, dfB = recordlinkage.datasets.load_febrl4()
dfB.index.name = dfB.index.name + "_"

index = recordlinkage.Pairs(dfA, dfB)
pairs = index.sortedneighbourhood('given_name')

# Check index is unique
pairs.is_unique
# False

StandardSeries inplace

StandardSeries does not support inplace standardisations correctly. Fix the wrapper that checks the type of the Series.

Same story for DataFrames

Parallel computing (indexing and comparing)

There is a need for parallel computations in the comparing and indexing modules.

The current situation:

Indexing functions don't use parallel computing. The functions are based on pandas functions.
Comparing uses numexpr if possible. This library performs parallel computing if possible.
String similarity functions use the C version of Jellyfish.

The desired situation:

Split the candidate links in parts and compare in separate processes.
Drop numexpr to prevent prevent computing of the library. Replace by 'pandas' parser engine or pure Cython functions
Use the chunks engine in the Index class for parallel processes. This might speed up the situation in some cases.

To do:

Add support for parallel computing in the Compare class.
Drop numexpr support.
Add support for parallel computing in the Indexing class.

ValueError: The truth value of a DataFrame is ambiguous.

Code:

from pandas import DataFrame as DF
import jellyfish
import recordlinkage

names = \
    [ {'name': 'Daniel'}
    , {'name': 'Danieel'}
    , {'name': 'Daaniel'}
    , {'name': 'Daniiel'}
    , {'name': 'Alex'}
    , {'name': 'George'}
    , {'name': 'Laura'}
    , {'name': 'Mary'}
    ]
names_df = DF(names)
names_df.index.name = 'aaa'
ps = recordlinkage.Pairs(names_df)
pairs = ps.full()
compare_cl = recordlinkage.Compare(pairs, names_df, names_df)
print compare_cl.fuzzy('name', 'name', method='jarowinkler', threshold=0.85)

The error I'm getting:

Traceback (most recent call last):
  File "rl.py", line 20, in <module>
    print compare_cl.fuzzy('name', 'name', method='jarowinkler', threshold=0.85)
  File "/opt/anaconda/lib/python2.7/site-packages/pandas/core/base.py", line 47, in __str__
    return self.__bytes__()
  File "/opt/anaconda/lib/python2.7/site-packages/pandas/core/base.py", line 59, in __bytes__
    return self.__unicode__().encode(encoding, 'replace')
  File "/opt/anaconda/lib/python2.7/site-packages/pandas/core/series.py", line 984, in __unicode__
    max_rows=max_rows)
  File "/opt/anaconda/lib/python2.7/site-packages/pandas/core/series.py", line 1025, in to_string
    dtype=dtype, name=name, max_rows=max_rows)
  File "/opt/anaconda/lib/python2.7/site-packages/pandas/core/series.py", line 1053, in _get_repr
    result = formatter.to_string()
  File "/opt/anaconda/lib/python2.7/site-packages/pandas/formats/format.py", line 224, in to_string
    fmt_index, have_header = self._get_formatted_index()
  File "/opt/anaconda/lib/python2.7/site-packages/pandas/formats/format.py", line 206, in _get_formatted_index
    have_header = any(name for name in index.names)
  File "/opt/anaconda/lib/python2.7/site-packages/pandas/core/generic.py", line 892, in __nonzero__
    .format(self.__class__.__name__))
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Am I doing something wrong?

Add tests for batch compare

Other options to store record pairs

Record pairs are stored in pandas.MultiIndex objects. For several users, this object is hard to understand. It would be nice to add an option to store record pairs in other formats like numpy.arrays of even python sets.

Argument handling in compare functions in not consistent

Instead of
exact(s1, s2, agree_value=1, disagree_value=0, missing_value=0, name=None, store=True)
arguments are handled like this:
exact(s1, s2, name=None, store=True, agree_value=1, disagree_value=0, missing_value=0)

Use fast .at indexer instead of slower .loc

The step between indexing and comparing is slow because of

df_a.loc[multiindex.get_level_values(0)]
df_b.loc[multiindex.get_level_values(1)]

Using .at might cause a significant speed up.

Compare two columns returns 1 column dataframe; Series more logical

Index KeyError

If working with two dataframes that have identical index_col names _blockindex() in indexing.py will produce a key error.

Line 168 in indexing.py:

    # Join
    pairs = data_left.reset_index().merge(
        data_right.reset_index(),
        how='inner',
        left_on=left_on,
        right_on=right_on,
    ).set_index([df_a.index.name, df_b.index.name])

will fail because the clashing index column names will have been auto-renamed to "x_column" and "x_column".

(My index columns are called "ID") -- see debug screenshot.

Not a biggie, I just renamed my index columns in the data and carried on. There's maybe another easy way to rename them in the pandas dataframe itself but I haven't looked into it.

Batch compare

Use batch compare to prevent indexing over and over again. This is especially useful for large dataframes.

Python recordlinkage identity

Similar issue as R recordlinkage identity but in python. The algorithm generates new identity that do no reflect the correct identity of the records that were matched . Assuming recordlinkage with data duplication using a single dataframe.

PS: It seems to be okay in the data duplication example

Classifiers return non-match probability instead of match probability

The classifier LogisticRegressionClassifier.prob returns non-match probability instead of match probability. Is this dataset dependent behavior?

Logistic regression classifiers attribute coefficients dimensions issue

The following is intuitive but malfunctioning:

# Train the classifier
c = recordlinkage.LogisticRegressionClassifier()

c.coefficients = numpy.array([0.5, 0.5, 0.5, 0.5])

The following works

# Train the classifier
c = recordlinkage.LogisticRegressionClassifier()

c.coefficients = numpy.array([[0.5, 0.5, 0.5, 0.5]])

Tool to merge/combine two linked dataframes.

After successfully linking two dataframes, there is often a need to make one unified dataframe. There should be a tool to combine the information of both dataframes into one. I think, one tool might not be enough to cover all use cases. So a set of tools might be a good idea. Idea's are welcome.

Similar tools are needed for deduplication. A new issue will be made for that case.

LogisticRegressionClassifier

When running the LogisticRegressionClassifier, I get:
"ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: True"

What does this mean, and how to resolve it?
Thanks!

numpy divide by zero warning

I am getting the following warning when I run a block indexed match between 2 dataframes.

string.py:151: RuntimeWarning: invalid value encountered in divide
return np.divide(ab, np.multiply(a, b)).A1

Any thoughts?

TypeError: expected unicode, got str

Hi, I tried to run this

compare_cl.string('Address', 'Address', method='jarowinkler', threshold=0.85) and getting this

TypeError: ('expected unicode, got str', u'occurred at index (93, 1934)') Then thought its related to the underlying jellyfish issue - #jamesturk/jellyfish#45, then tried the solution suggested there, but it didn't work. Can you please let me know if I have to convert the columns to unicode before hand?

Increase Logging Verbosity

Currently, recordlinkage logs very few events. However, if more events were logged, users could track the progress of their recordlinkage code while it runs, which would be especially useful for large computations which take a long time to run. Events which might be logged could include beginning/ending a new stage (i.e. indexing, comparing, classifying, fusing), and beginning/ending jobs within this stage.

Using the current API, it is easy enough for the user to include logging/print statements in between method calls. However, this will become difficult when the Pipeline API is introduced and all major computations are initiated by a single class, and are executed in parallel.

j535d165 / recordlinkage Goto Github PK

recordlinkage's People

Stargazers

Watchers

Forkers

recordlinkage's Issues

Recommend Projects

Recommend Topics

Recommend Org