Coder Social home page Coder Social logo

string_grouper's Introduction

String Grouper

pypi license lastcommit codecov

Click to see image

The image displayed above is a visualization of the graph-structure of one of the groups of strings found by string_grouper. Each circle (node) represents a string, and each connecting arc (edge) represents a match between a pair of strings with a similarity score above a given threshold score (here 0.8).

The centroid of the group, as determined by string_grouper (see tutorials/group_representatives.md for an explanation), is the largest node, also with the most edges originating from it. A thick line in the image denotes a strong similarity between the nodes at its ends, while a faint thin line denotes weak similarity.

The power of string_grouper is discernible from this image: in large datasets, string_grouper is often able to resolve indirect associations between strings even when, say, due to memory-resource-limitations, direct matches between those strings cannot be computed using conventional methods with a lower threshold similarity score.

———

This image was designed using the graph-visualization software Gephi 0.9.2 with data generated by string_grouper operating on the sec__edgar_company_info.csv sample data file.


string_grouper is a library that makes finding groups of similar strings within a single, or multiple, lists of strings easy — and fast. string_grouper uses tf-idf to calculate cosine similarities within a single list or between two lists of strings. The full process is described in the blog Super Fast String Matching in Python.

Installing

pip install string-grouper

Usage

import pandas as pd
from string_grouper import match_strings, match_most_similar, \
	group_similar_strings, compute_pairwise_similarities, \
	StringGrouper

As shown above, the library may be used together with pandas, and contains four high level functions (match_strings, match_most_similar, group_similar_strings, and compute_pairwise_similarities) that can be used directly, and one class (StringGrouper) that allows for a more interactive approach.

The permitted calling patterns of the four functions, and their return types, are:

Function Parameters pandas Return Type
match_strings (master, **kwargs) DataFrame
match_strings (master, duplicates, **kwargs) DataFrame
match_strings (master, master_id=id_series, **kwargs) DataFrame
match_strings (master, duplicates, master_id, duplicates_id, **kwargs) DataFrame
match_most_similar (master, duplicates, **kwargs) Series (if kwarg ignore_index=True) otherwise DataFrame (default)
match_most_similar (master, duplicates, master_id, duplicates_id, **kwargs) DataFrame
group_similar_strings (strings_to_group, **kwargs) Series (if kwarg ignore_index=True) otherwise DataFrame (default)
group_similar_strings (strings_to_group, strings_id, **kwargs) DataFrame
compute_pairwise_similarities (string_series_1, string_series_2, **kwargs) Series

In the rest of this document the names, Series and DataFrame, refer to the familiar pandas object types.

Parameters:

Name Description
master A Series of strings to be matched with themselves (or with those in duplicates).
duplicates A Series of strings to be matched with those of master.
master_id (or id_series) A Series of IDs corresponding to the strings in master.
duplicates_id A Series of IDs corresponding to the strings in duplicates.
strings_to_group A Series of strings to be grouped.
strings_id A Series of IDs corresponding to the strings in strings_to_group.
string_series_1(_2) A Series of strings each of which is to be compared with its corresponding string in string_series_2(_1).
**kwargs Keyword arguments (see below).

New in version 0.6.0: each of the high-level functions listed above also has a StringGrouper method counterpart of the same name and parameters. Calling such a method of any instance of StringGrouper will not rebuild the instance's underlying corpus to make string-comparisons but rather use it to perform the string-comparisons. The input Series to the method (master, duplicates, and so on) will thus be encoded, or transformed, into tf-idf matrices, using this corpus. For example:

# Build a corpus using strings in the pandas Series master:
sg = StringGrouper(master)
# The following method-calls will compare strings first in
# pandas Series new_master_1 and next in new_master_2
# using the corpus already built above without rebuilding or
# changing it in any way:
matches1 = sg.match_strings(new_master_1)
matches2 = sg.match_strings(new_master_2)

Functions:

  • match_strings

    Returns a DataFrame containing similarity-scores of all matching pairs of highly similar strings from master (and duplicates if given). Each matching pair in the output appears in its own row/record consisting of

    1. its "left" part: a string (with/without its index-label) from master,
    2. its similarity score, and
    3. its "right" part: a string (with/without its index-label) from duplicates (or master if duplicates is not given),

    in that order. Thus the column-names of the output are a collection of three groups:

    1. The name of master and the name(s) of its index (or index-levels) all prefixed by the string 'left_',
    2. 'similarity' whose column has the similarity-scores as values, and
    3. The name of duplicates (or master if duplicates is not given) and the name(s) of its index (or index-levels) prefixed by the string 'right_'.

    Indexes (or their levels) only appear when the keyword argument ignore_index=False (the default). (See tutorials/ignore_index_and_replace_na.md for a demonstration.)

    If either master or duplicates has no name, it assumes the name 'side' which is then prefixed as described above. Similarly, if any of the indexes (or index-levels) has no name it assumes its pandas default name ('index', 'level_0', and so on) and is then prefixed as described above.

    In other words, if only parameter master is given, the function will return pairs of highly similar strings within master. This can be seen as a self-join where both 'left_' and 'right_' prefixed columns come from master. If both parameters master and duplicates are given, it will return pairs of highly similar strings between master and duplicates. This can be seen as an inner-join where 'left_' and 'right_' prefixed columns come from master and duplicates respectively.

    The function also supports optionally inputting IDs (master_id and duplicates_id) corresponding to the strings being matched. In which case, the output includes two additional columns whose names are the names of these optional Series prefixed by 'left_' and 'right_' accordingly, and containing the IDs corresponding to the strings in the output. If any of these Series has no name, then it assumes the name 'id' and is then prefixed as described above.

  • match_most_similar

    If ignore_index=True, returns a Series of strings, where for each string in duplicates the most similar string in master is returned. If there are no similar strings in master for a given string in duplicates (because there is no potential match where the cosine similarity is above the threshold [default: 0.8]) then the original string in duplicates is returned. The output Series thus has the same length and index as duplicates.

    For example, if an input Series with the values \['foooo', 'bar', 'baz'\] is passed as the argument master, and \['foooob', 'bar', 'new'\] as the values of the argument duplicates, the function will return a Series with values: \['foooo', 'bar', 'new'\].

    The name of the output Series is the same as that of master prefixed with the string 'most_similar_'. If master has no name, it is assumed to have the name 'master' before being prefixed.

    If ignore_index=False (the default), match_most_similar returns a DataFrame containing the same Series described above as one of its columns. So it inherits the same index and length as duplicates. The rest of its columns correspond to the index (or index-levels) of master and thus contain the index-labels of the most similar strings being output as values. If there are no similar strings in master for a given string in duplicates then the value(s) assigned to this index-column(s) for that string is NaN by default. However, if the keyword argument replace_na=True, then these NaN values are replaced with the index-label(s) of that string in duplicates. Note that such replacements can only occur if the indexes of master and duplicates have the same number of levels. (See tutorials/ignore_index_and_replace_na.md for a demonstration.)

    Each column-name of the output DataFrame has the same name as its corresponding column, index, or index-level of master prefixed with the string 'most_similar_'.

    If both parameters master_id and duplicates_id are also given, then a DataFrame is always returned with the same column(s) as described above, but with an additional column containing those IDs from these input Series corresponding to the output strings. This column's name is the same as that of master_id prefixed in the same way as described above. If master_id has no name, it is assumed to have the name 'master_id' before being prefixed.

  • group_similar_strings

    Takes a single Series of strings (strings_to_group) and groups them by assigning to each string one string from strings_to_group chosen as the group-representative for each group of similar strings found. (See tutorials/group_representatives.md for details on how the the group-representatives are chosen.)

    If ignore_index=True, the output is a Series (with the same name as strings_to_group prefixed by the string 'group_rep_') of the same length and index as strings_to_group containing the group-representative strings. If strings_to_group has no name then the name of the returned Series is 'group_rep'.

    For example, an input Series with values: \['foooo', 'foooob', 'bar'\] will return \['foooo', 'foooo', 'bar'\]. Here 'foooo' and 'foooob' are grouped together into group 'foooo' because they are found to be similar. Another example can be found below.

    If ignore_index=False, the output is a DataFrame containing the above output Series as one of its columns with the same name. The remaining column(s) correspond to the index (or index-levels) of strings_to_group and contain the index-labels of the group-representatives as values. These columns have the same names as their counterparts prefixed by the string 'group_rep_'.

    If strings_id is also given, then the IDs from strings_id corresponding to the group-representatives are also returned in an additional column (with the same name as strings_id prefixed as described above). If strings_id has no name, it is assumed to have the name 'id' before being prefixed.

  • compute_pairwise_similarities

    Returns a Series of cosine similarity scores the same length and index as string_series_1. Each score is the cosine similarity between the pair of strings in the same position (row) in the two input Series, string_series_1 and string_series_2, as the position of the score in the output Series. This can be seen as an element-wise comparison between the two input Series.

All functions are built using a class StringGrouper. This class can be used through pre-defined functions, for example the four high level functions above, as well as using a more interactive approach where matches can be added or removed if needed by calling the StringGrouper class directly.

Options:

  • kwargs

    All keyword arguments not mentioned in the function definitions above are used to update the default settings. The following optional arguments can be used:

    • ngram_size: The amount of characters in each n-gram. Default is 3.
    • regex: The regex string used to clean-up the input string. Default is r"[,-./]|\s".
    • ignore_case: Determines whether or not letter case in strings should be ignored. Defaults to True.
    • tfidf_matrix_dtype: The datatype for the tf-idf values of the matrix components. Allowed values are numpy.float32 and numpy.float64. Default is numpy.float32. (Note: numpy.float32 often leads to faster processing and a smaller memory footprint albeit less numerical precision than numpy.float64.)
    • max_n_matches: The maximum number of matching strings in master allowed per string in duplicates. Default is the total number of strings in master.
    • min_similarity: The minimum cosine similarity for two strings to be considered a match. Defaults to 0.8
    • number_of_processes: The number of processes used by the cosine similarity calculation. Defaults to number of cores on a machine - 1.
    • ignore_index: Determines whether indexes are ignored or not. If False (the default), index-columns will appear in the output, otherwise not. (See tutorials/ignore_index_and_replace_na.md for a demonstration.)
    • replace_na: For function match_most_similar, determines whether NaN values in index-columns are replaced or not by index-labels from duplicates. Defaults to False. (See tutorials/ignore_index_and_replace_na.md for a demonstration.)
    • include_zeroes: When min_similarity ≤ 0, determines whether zero-similarity matches appear in the output. Defaults to True. (See tutorials/zero_similarity.md.) Note: If include_zeroes is True and the kwarg max_n_matches is set then it must be sufficiently high to capture all nonzero-similarity-matches, otherwise an error is raised and string_grouper suggests an alternative value for max_n_matches. To allow string_grouper to automatically use the appropriate value for max_n_matches then do not set this kwarg at all.
    • group_rep: For function group_similar_strings, determines how group-representatives are chosen. Allowed values are 'centroid' (the default) and 'first'. See tutorials/group_representatives.md for an explanation.
    • force_symmetries: In cases where duplicates is None, specifies whether corrections should be made to the results to account for symmetry, thus compensating for those losses of numerical significance which violate the symmetries. Defaults to True.
    • n_blocks: This parameter is a tuple of two ints provided to help boost performance, if possible, of processing large DataFrames (see Subsection Performance), by splitting the DataFrames into n_blocks[0] blocks for the left operand (of the underlying matrix multiplication) and into n_blocks[1] blocks for the right operand before performing the string-comparisons block-wise. Defaults to 'guess', in which case the numbers of blocks are estimated based on previous empirical results. If n_blocks = 'auto', then splitting is done automatically in the event of an OverflowError.

Examples

In this section we will cover a few use cases for which string_grouper may be used. We will use the same data set of company names as used in: Super Fast String Matching in Python.

Find all matches within a single data set

import pandas as pd
import numpy as np
from string_grouper import match_strings, match_most_similar, \
	group_similar_strings, compute_pairwise_similarities, \
	StringGrouper
company_names = '/media/chris/data/dev/name_matching/data/sec_edgar_company_info.csv'
# We only look at the first 50k as an example:
companies = pd.read_csv(company_names)[0:50000]
# Create all matches:
matches = match_strings(companies['Company Name'])
# Look at only the non-exact matches:
matches[matches['left_Company Name'] != matches['right_Company Name']].head()
left_index left_Company Name similarity right_Company Name right_index
15 14 0210, LLC 0.870291 90210 LLC 4211
167 165 1 800 MUTUALS ADVISOR SERIES 0.931615 1 800 MUTUALS ADVISORS SERIES 166
168 166 1 800 MUTUALS ADVISORS SERIES 0.931615 1 800 MUTUALS ADVISOR SERIES 165
172 168 1 800 RADIATOR FRANCHISE INC 1.000000 1-800-RADIATOR FRANCHISE INC. 201
178 173 1 FINANCIAL MARKETPLACE SECURITIES LLC ... 0.949364 1 FINANCIAL MARKETPLACE SECURITIES, LLC 174

Find all matches in between two data sets.

The match_strings function finds similar items between two data sets as well. This can be seen as an inner join between two data sets:

# Create a small set of artificial company names:
duplicates = pd.Series(['S MEDIA GROUP', '012 SMILE.COMMUNICATIONS', 'foo bar', 'B4UTRADE COM CORP'])
# Create all matches:
matches = match_strings(companies['Company Name'], duplicates)
matches
left_index left_Company Name similarity right_side right_index
0 12 012 SMILE.COMMUNICATIONS LTD 0.944092 012 SMILE.COMMUNICATIONS 1
1 49777 B.A.S. MEDIA GROUP 0.854383 S MEDIA GROUP 0
2 49855 B4UTRADE COM CORP 1.000000 B4UTRADE COM CORP 3
3 49856 B4UTRADE COM INC 0.810217 B4UTRADE COM CORP 3
4 49857 B4UTRADE CORP 0.878276 B4UTRADE COM CORP 3

Out of the four company names in duplicates, three companies are found in the original company data set. One company is found three times.

Finding duplicates from a (database extract to) DataFrame where IDs for rows are supplied.

A very common scenario is the case where duplicate records for an entity have been entered into a database. That is, there are two or more records where a name field has slightly different spelling. For example, "A.B. Corporation" and "AB Corporation". Using the optional 'ID' parameter in the match_strings function duplicates can be found easily. A tutorial that steps though the process with an example data set is available.

For a second data set, find only the most similar match

In the example above, it's possible that multiple matches are found for a single string. Sometimes we just want a string to match with a single most similar string. If there are no similar strings found, the original string should be returned:

# Create a small set of artificial company names:
new_companies = pd.Series(['S MEDIA GROUP', '012 SMILE.COMMUNICATIONS', 'foo bar', 'B4UTRADE COM CORP'],\
                          name='New Company')
# Create all matches:
matches = match_most_similar(companies['Company Name'], new_companies, ignore_index=True)
# Display the results:
pd.concat([new_companies, matches], axis=1)
New Company most_similar_Company Name
0 S MEDIA GROUP B.A.S. MEDIA GROUP
1 012 SMILE.COMMUNICATIONS 012 SMILE.COMMUNICATIONS LTD
2 foo bar foo bar
3 B4UTRADE COM CORP B4UTRADE COM CORP

Deduplicate a single data set and show items with most duplicates

The group_similar_strings function groups strings that are similar using a single linkage clustering algorithm. That is, if item A and item B are similar; and item B and item C are similar; but the similarity between A and C is below the threshold; then all three items are grouped together.

# Add the grouped strings:
companies['deduplicated_name'] = group_similar_strings(companies['Company Name'],
                                                       ignore_index=True)
# Show items with most duplicates:
companies.groupby('deduplicated_name')['Line Number'].count().sort_values(ascending=False).head(10)
deduplicated_name
ADVISORS DISCIPLINED TRUST                                      1824
AGL LIFE ASSURANCE CO SEPARATE ACCOUNT                           183
ANGELLIST-ART-FUND, A SERIES OF ANGELLIST-FG-FUNDS, LLC          116
AMERICREDIT AUTOMOBILE RECEIVABLES TRUST 2001-1                   87
ACE SECURITIES CORP. HOME EQUITY LOAN TRUST, SERIES 2006-HE2      57
ASSET-BACKED PASS-THROUGH CERTIFICATES SERIES 2004-W1             40
ALLSTATE LIFE GLOBAL FUNDING TRUST 2005-3                         39
ALLY AUTO RECEIVABLES TRUST 2014-1                                33
ANDERSON ROBERT E /                                               28
ADVENT INTERNATIONAL GPE VIII LIMITED PARTNERSHIP                 28
Name: Line Number, dtype: int64

The group_similar_strings function also works with IDs: imagine a DataFrame (customers_df) with the following content:

# Create a small set of artificial customer names:
customers_df = pd.DataFrame(
   [
      ('BB016741P', 'Mega Enterprises Corporation'),
      ('CC082744L', 'Hyper Startup Incorporated'),
      ('AA098762D', 'Hyper Startup Inc.'),
      ('BB099931J', 'Hyper-Startup Inc.'),
      ('HH072982K', 'Hyper Hyper Inc.')
   ],
   columns=('Customer ID', 'Customer Name')
).set_index('Customer ID')
# Display the data:
customers_df
Customer Name
Customer ID
BB016741P Mega Enterprises Corporation
CC082744L Hyper Startup Incorporated
AA098762D Hyper Startup Inc.
BB099931J Hyper-Startup Inc.
HH072982K Hyper Hyper Inc.

The output of group_similar_strings can be directly used as a mapping table:

# Group customers with similar names:
customers_df[["group-id", "name_deduped"]]  = \
    group_similar_strings(customers_df["Customer Name"])
# Display the mapping table:
customers_df
Customer Name group-id name_deduped
Customer ID
BB016741P Mega Enterprises Corporation BB016741P Mega Enterprises Corporation
CC082744L Hyper Startup Incorporated CC082744L Hyper Startup Incorporated
AA098762D Hyper Startup Inc. AA098762D Hyper Startup Inc.
BB099931J Hyper-Startup Inc. AA098762D Hyper Startup Inc.
HH072982K Hyper Hyper Inc. HH072982K Hyper Hyper Inc.

Note that here customers_df initially had only one column "Customer Name" (before the group_similar_strings function call); and it acquired two more columns "group-id" (the index-column) and "name_deduped" after the call through a "setting with enlargement" (a pandas feature).

Simply compute the cosine similarities of pairs of strings

Sometimes we have pairs of strings that have already been matched but whose similarity scores need to be computed. For this purpose we provide the function compute_pairwise_similarities:

# Create a small DataFrame of pairs of strings:
pair_s = pd.DataFrame(
    [
        ('Mega Enterprises Corporation', 'Mega Enterprises Corporation'),
        ('Hyper Startup Inc.', 'Hyper Startup Incorporated'),
        ('Hyper Startup Inc.', 'Hyper Startup Inc.'),
        ('Hyper Startup Inc.', 'Hyper-Startup Inc.'),
        ('Hyper Hyper Inc.', 'Hyper Hyper Inc.'),
        ('Mega Enterprises Corporation', 'Mega Enterprises Corp.')
   ],
   columns=('left', 'right')
)
# Display the data:
pair_s
left right
0 Mega Enterprises Corporation Mega Enterprises Corporation
1 Hyper Startup Inc. Hyper Startup Incorporated
2 Hyper Startup Inc. Hyper Startup Inc.
3 Hyper Startup Inc. Hyper-Startup Inc.
4 Hyper Hyper Inc. Hyper Hyper Inc.
5 Mega Enterprises Corporation Mega Enterprises Corp.
# Compute their cosine similarities and display them:
pair_s['similarity'] = compute_pairwise_similarities(pair_s['left'], pair_s['right'])
pair_s
left right similarity
0 Mega Enterprises Corporation Mega Enterprises Corporation 1.000000
1 Hyper Startup Inc. Hyper Startup Incorporated 0.633620
2 Hyper Startup Inc. Hyper Startup Inc. 1.000000
3 Hyper Startup Inc. Hyper-Startup Inc. 1.000000
4 Hyper Hyper Inc. Hyper Hyper Inc. 1.000000
5 Mega Enterprises Corporation Mega Enterprises Corp. 0.826463

The StringGrouper class

The four functions mentioned above all create a StringGrouper object behind the scenes and call different functions on it. The StringGrouper class keeps track of all tuples of similar strings and creates the groups out of these. Since matches are often not perfect, a common workflow is to:

  1. Create matches
  2. Manually inspect the results
  3. Add and remove matches where necessary
  4. Create groups of similar strings

The StringGrouper class allows for this without having to re-calculate the cosine similarity matrix. See below for an example.

company_names = '/media/chris/data/dev/name_matching/data/sec_edgar_company_info.csv'
companies = pd.read_csv(company_names)
  1. Create matches
# Create a new StringGrouper
string_grouper = StringGrouper(companies['Company Name'], ignore_index=True)
# Check if the ngram function does what we expect:
string_grouper.n_grams('McDonalds')
['McD', 'cDo', 'Don', 'ona', 'nal', 'ald', 'lds']
# Now fit the StringGrouper - this will take a while since we are calculating cosine similarities on 600k strings
string_grouper = string_grouper.fit()
# Add the grouped strings
companies['deduplicated_name'] = string_grouper.get_groups()

Suppose we know that PWC HOLDING CORP and PRICEWATERHOUSECOOPERS LLP are the same company. StringGrouper will not match these since they are not similar enough.

companies[companies.deduplicated_name.str.contains('PRICEWATERHOUSECOOPERS LLP')]
Line Number Company Name Company CIK Key deduplicated_name
478441 478442 PRICEWATERHOUSECOOPERS LLP /TA 1064284 PRICEWATERHOUSECOOPERS LLP /TA
478442 478443 PRICEWATERHOUSECOOPERS LLP 1186612 PRICEWATERHOUSECOOPERS LLP /TA
478443 478444 PRICEWATERHOUSECOOPERS SECURITIES LLC 1018444 PRICEWATERHOUSECOOPERS LLP /TA
companies[companies.deduplicated_name.str.contains('PWC')]
Line Number Company Name Company CIK Key deduplicated_name
485535 485536 PWC CAPITAL INC. 1690640 PWC CAPITAL INC.
485536 485537 PWC HOLDING CORP 1456450 PWC HOLDING CORP
485537 485538 PWC INVESTORS, LLC 1480311 PWC INVESTORS, LLC
485538 485539 PWC REAL ESTATE VALUE FUND I LLC 1668928 PWC REAL ESTATE VALUE FUND I LLC
485539 485540 PWC SECURITIES CORP /BD 1023989 PWC SECURITIES CORP /BD
485540 485541 PWC SECURITIES CORPORATION 1023989 PWC SECURITIES CORPORATION
485541 485542 PWCC LTD 1172241 PWCC LTD
485542 485543 PWCG BROKERAGE, INC. 67301 PWCG BROKERAGE, INC.

We can add these with the add function:

string_grouper = string_grouper.add_match('PRICEWATERHOUSECOOPERS LLP', 'PWC HOLDING CORP')
companies['deduplicated_name'] = string_grouper.get_groups()
# Now lets check again:

companies[companies.deduplicated_name.str.contains('PRICEWATERHOUSECOOPERS LLP')]
Line Number Company Name Company CIK Key deduplicated_name
478441 478442 PRICEWATERHOUSECOOPERS LLP /TA 1064284 PRICEWATERHOUSECOOPERS LLP /TA
478442 478443 PRICEWATERHOUSECOOPERS LLP 1186612 PRICEWATERHOUSECOOPERS LLP /TA
478443 478444 PRICEWATERHOUSECOOPERS SECURITIES LLC 1018444 PRICEWATERHOUSECOOPERS LLP /TA
485536 485537 PWC HOLDING CORP 1456450 PRICEWATERHOUSECOOPERS LLP /TA

This can also be used to merge two groups:

string_grouper = string_grouper.add_match('PRICEWATERHOUSECOOPERS LLP', 'ZUCKER MICHAEL')
companies['deduplicated_name'] = string_grouper.get_groups()

# Now lets check again:
companies[companies.deduplicated_name.str.contains('PRICEWATERHOUSECOOPERS LLP')]
Line Number Company Name Company CIK Key deduplicated_name
478441 478442 PRICEWATERHOUSECOOPERS LLP /TA 1064284 PRICEWATERHOUSECOOPERS LLP /TA
478442 478443 PRICEWATERHOUSECOOPERS LLP 1186612 PRICEWATERHOUSECOOPERS LLP /TA
478443 478444 PRICEWATERHOUSECOOPERS SECURITIES LLC 1018444 PRICEWATERHOUSECOOPERS LLP /TA
485536 485537 PWC HOLDING CORP 1456450 PRICEWATERHOUSECOOPERS LLP /TA
662585 662586 ZUCKER MICHAEL 1629018 PRICEWATERHOUSECOOPERS LLP /TA
662604 662605 ZUCKERMAN MICHAEL 1303321 PRICEWATERHOUSECOOPERS LLP /TA
662605 662606 ZUCKERMAN MICHAEL 1496366 PRICEWATERHOUSECOOPERS LLP /TA

We can remove strings from groups in the same way:

string_grouper = string_grouper.remove_match('PRICEWATERHOUSECOOPERS LLP', 'ZUCKER MICHAEL')
companies['deduplicated_name'] = string_grouper.get_groups()

# Now lets check again:
companies[companies.deduplicated_name.str.contains('PRICEWATERHOUSECOOPERS LLP')]
Line Number Company Name Company CIK Key deduplicated_name
478441 478442 PRICEWATERHOUSECOOPERS LLP /TA 1064284 PRICEWATERHOUSECOOPERS LLP /TA
478442 478443 PRICEWATERHOUSECOOPERS LLP 1186612 PRICEWATERHOUSECOOPERS LLP /TA
478443 478444 PRICEWATERHOUSECOOPERS SECURITIES LLC 1018444 PRICEWATERHOUSECOOPERS LLP /TA
485536 485537 PWC HOLDING CORP 1456450 PRICEWATERHOUSECOOPERS LLP /TA

Performance

Semilogx plots of run-times of match_strings() vs the number of blocks (n_blocks[1]) into which the right matrix-operand of the dataset (663 000 strings from sec__edgar_company_info.csv) was split before performing the string comparison. As shown in the legend, each plot corresponds to the number n_blocks[0] of blocks into which the left matrix-operand was split.

Semilogx

String comparison, as implemented by string_grouper, is essentially matrix multiplication. A pandas Series of strings is converted (tokenized) into a matrix. Then that matrix is multiplied by itself (or another) transposed.

Here is an illustration of multiplication of two matrices D and MT: Block Matrix 1 1

It turns out that when the matrix (or Series) is very large, the computer proceeds quite slowly with the multiplication (apparently due to the RAM being too full). Some computers give up with an OverflowError.

To circumvent this issue, string_grouper now allows the division of the Series into smaller chunks (or blocks) and multiplies the chunks one pair at a time instead to get the same result:

Block Matrix 2 2

But surprise ... the run-time of the process is sometimes drastically reduced as a result. For example, the speed-up of the following call is about 500% (here, the Series is divided into 200 blocks on the right operand, that is, 1 block on the left × 200 on the right) compared to the same call with no splitting [n_blocks=(1, 1), the default, which is what previous versions (0.5.0 and earlier) of string_grouper did]:

# A DataFrame of 668 000 records:
companies = pd.read_csv('data/sec__edgar_company_info.csv')

# The following call is more than 6 times faster than earlier versions of 
# match_strings() (that is, when n_blocks=(1, 1))!
match_strings(companies['Company Name')], n_blocks=(1, 200))

Further exploration of the block number space (see plot above) has revealed that for any fixed number of right blocks, the run-time gets longer the larger the number of left blocks specified. For this reason, it is recommended not to split the left matrix.

Block Matrix 1 2

In general,

   total runtime = n_blocks[0] × n_blocks[1] × mean runtime per block-pair

                          = Left Operand Size × Right Operand Size ×

                               mean runtime per block-pair / (Left Block Size × Right Block Size)

So for given left and right operands, minimizing the total runtime is the same as minimizing the

   runtime per string-pair comparison
                              mean runtime per block-pair / (Left Block Size × Right Block Size)

Below is a log-log-log contour plot of the runtime per string-pair comparison scaled by its value at Left Block Size = Right Block Size = 5000. Here, Block Size is the number of strings in that block, and mean runtime per block-pair is the time taken for the following call to run:

# note the parameter order!
match_strings(right_Series, left_Series, n_blocks=(1, 1))

where left_Series and right_Series, corresponding to Left Block and Right Block respectively, are random subsets of the Series companies['Company Name')] from the sec__edgar_company_info.csv sample data file.

ContourPlot

It can be seen that when right_Series is roughly the size of 80 000 (denoted by the white dashed line in the contour plot above), the runtime per string-pair comparison is at its lowest for any fixed left_Series size. Above Right Block Size = 80 000, the matrix-multiplication routine begins to feel the limits of the computer's available memory space and thus its performance deteriorates, as evidenced by the increase in runtime per string-pair comparison there (above the white dashed line). This knowledge could serve as a guide for estimating the optimum block numbers — namely those that divide the Series into blocks of size roughly equal to 80 000 for the right operand (or right_Series).

So what are the optimum block number values for any given Series? That is anyone's guess, and may likely depend on the data itself. Furthermore, as hinted above, the answer may vary from computer to computer.

We however encourage the user to make judicious use of the n_blocks parameter to boost performance of string_grouper whenever possible.

string_grouper's People

Contributors

bergvca avatar flindeberg avatar justasojourner avatar particularminer avatar stevenmaude avatar taimursajid avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

string_grouper's Issues

able to change default cosine similarity of .8?

I noticed that the default cosine similarity is .8, but it seems like it's not bringing back matches that are just one character in some cases and I was wondering if there was a way I could bring the cosine similarity as a parameter? I looked at your examples and seemed like there was nothing that did that. Or would I just have to import the class and change the default value myself?

pip install string-grouper doesn't work

Hi guys,

I tried to install string-grouper using anaconda prompt and got the following error. What could be the issue? Thanks!

ERROR: Could not find a version that satisfies the requirement string-grouper (from versions: none)
ERROR: No matching distribution found for string-grouper

Issue importing string_grouper

I want to use the match_strings function in the stringer_grouper library to find similar strings in two separate lists. When I attempt import the necessary functions from the library, it throws a Value Error (shown below).


ValueError Traceback (most recent call last)
in
----> 1 from string_grouper import match_strings, match_most_similar, group_similar_strings, StringGrouper

~/anaconda3/lib/python3.7/site-packages/string_grouper/init.py in
----> 1 from .string_grouper import group_similar_strings, match_most_similar, match_strings, StringGrouperConfig, StringGrouper

~/anaconda3/lib/python3.7/site-packages/string_grouper/string_grouper.py in
6 from scipy.sparse.csr import csr_matrix
7 from typing import Tuple, NamedTuple, List, Optional
----> 8 from sparse_dot_topn import awesome_cossim_topn
9 from functools import wraps
10

~/anaconda3/lib/python3.7/site-packages/sparse_dot_topn/init.py in
3
4 if sys.version_info[0] >= 3:
----> 5 from sparse_dot_topn.awesome_cossim_topn import awesome_cossim_topn
6 else:
7 from awesome_cossim_topn import awesome_cossim_topn

~/anaconda3/lib/python3.7/site-packages/sparse_dot_topn/awesome_cossim_topn.py in
5
6 if sys.version_info[0] >= 3:
----> 7 from sparse_dot_topn import sparse_dot_topn as ct
8 from sparse_dot_topn import sparse_dot_topn_threaded as ct_thread
9 else:

init.pxd in init sparse_dot_topn.sparse_dot_topn()

ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 192 from PyObject


Does anyone know why this is occurring? It looks to be having issues loading the sparse_dot_topn library. Thanks!

Return complete similarity matrix with get_matches() - including elements with 0 similarity

Is it possible to return the full similarity matrix when getting matches from the string grouper class?
Example:
string_grouper = StringGrouper(master = master, duplicates=duplicates[:1], master_id=master_ID, duplicates_id=duplicates_ID[:1], min_similarity = 0.0, max_n_matches = 10000, regex = "[,-./#]").fit() matches_df = string_grouper.get_matches()
Matches_df would ideally contain a dataframe with the same number of rows as master. So a complete similarity comparison of the one duplicate to all the master examples. But it seems to do a cutoff at some point (0) due to low similarity and I can't change that no matter how low (negative) I set the min_similarity. Is there a way to allow the 0 similarities to be returned as well? I can pad them later but it would be convenient.

match_string on small data series

Hi,
I was just curious about what happens when I run this piece of code.
I came across this when I split my data into smaller chunks.

Code:
import pandas as pd
from string_grouper import match_strings

accounts = pd.DataFrame()
accounts['name'] = ['Jim Beam','Jim Boom','Jack Daniels','John Dummel','Bob Bubble','Seth Suckerman']

matches = match_strings(accounts['name'])
print(matches)

Output:
left_index left_name similarity right_name right_index
0 0 Jim Beam 1.0 Jim Beam 0
1 1 Jim Boom 1.0 Jim Boom 1
2 2 Jack Daniels 1.0 Jack Daniels 2
3 3 John Dummel 1.0 John Dummel 3
4 4 Bob Bubble 1.0 Bob Bubble 4
5 5 Seth Suckerman 1.0 Seth Suckerman 5

Am I doing something wrong here?
I hope this is not too dumb of a question, I am new to py and pandas.

Thank you for looking into this.

Tips for working with large datasets

Hi I'm working with a 200MB file and using the command group_similar_strings, however, this is taking so long that it's never completing (running for several days). I've tried several n_gram sizes with no luck. Do you have any tips to run on large datasets?

Case is not ignored

It is problematic that case is not ignored, since case is often unimportant and just a consequence of the data used rather than containing any useful information in itself.

Formula for optimal matrix block-size

Hi @Bergvca

I think that the optimal matrix block-size, or the maximum number of strings Nmax for the master Series (beyond which cache-misses begin to dominate the computation and thus lead to computational slowdown) would be directly proportional to the CPU cache-size MCPU and inversely proportional to the density ρright of the right operand-matrix encoding the strings in master. That is,
Nmax  ∝  MCPU / ρright .

Since for my computer, Nmax = 8 × 104, MCPU = 6MB and ρright is a number I don't know yet but can easily find during runtime (that is, the number of nonzero matrix-elements divided by the total number of matrix-elements), we can then determine the constant of proportionality and use it to find Nmax for any other computer whose CPU cache-size is known or can be queried (using python package psutil, for example).

What do you think?

On Group Representatives

Hi @Bergvca,

@justasojourner and I have been discussing ways to enable String Grouper to choose other candidates to represent the groups it finds.

At the moment String Grouper’s choices are arbitrary — they likely depend on the sorting order of the input string data in the call to function group_similar_strings.

One prudent alternative we could think of was this: choose the earliest record in the group assuming of course that timestamp metadata is available. This suggestion is based on the premise that unnecessary duplicate records are often created after the original record.

As the code stands now, it may be possible (I’m not yet sure; I have to verify this) to achieve the above result by simply sorting the input Series by timestamp in ascending order before applying group_similar_strings on them.

But perhaps there’s a way to re-code group_similar_strings to accept any given specification for the group representatives.

Your thoughts on this are welcome!

Get an error message while running match_string

Hi there,

I would like to run match_strings on addresses on df with 88510 rows and 3 columns.
All I get is
OverflowError: value too large to convert to int.

Is there a quick fix?

Thank you very much!

Including an ID column in String Grouper output

Firstly, thank you for developing the String Grouper function. It's fast!

I have a question regarding the functionality of String Grouper, hopefully it will be possible to do this.

Requirement

A very common scenario is to have a column (series) of values like, for example, the name of a company together with a column with IDs for each data row.

I have seen that String Group does (well and very fast) matching of a column of string data to find closely matched values.

What we would like to be able to achieve is to have the IDs related to the string value in a row considered when doing the string matches so they are returned in the String Grouper output Pandas dataframe.

Example

Assume a data file, 'accounts.csv', with the following data:

account_id,name
A000123,ABC Enterprises Inc.
B000987,XYZ Co.
C000456,ABC Enterprises Incorporated
D000345,XYX Company
E000678,XYX Co
F000345,ABC Enterprises
G000876,The Big Group

If a typical import is done to a Pandas dataframe we will get:

accounts = pd.read_csv('accounts.csv')
accounts
account_id name
0 A000123 ABC Enterprises Inc.
1 B000987 XYZ Co.
2 C000456 ABC Enterprises Incorporated
3 D000345 XYX Company
4 E000678 XYX Co
5 F000345 ABC Enterprises
6 G000876 The Big Group

If the account_id is made the index column we will get:

accounts = pd.read_csv('accounts.csv', index_col='account_id')
accounts
account_id name
A000123 ABC Enterprises Inc.
B000987 XYZ Co.
C000456 ABC Enterprises Incorporated
D000345 XYX Company
E000678 XYX Co
F000345 ABC Enterprises
G000876 The Big Group

This, we believe, would be the better option as the index value is fixed. I note also that Pandas reports this as a DataFrame even though it has only one value column.

String Grouper function

Currently I use the first import method accounts = pd.read_csv('accounts.csv') and then the following code

matches = match_strings(accounts['name'], min_similarity = 0.79)
dupes = matches[matches.left_side != matches.right_side]
dupes

which returns something like this

left_side right_side similarity
0 ABC Enterprises Inc. ABC Enterprises Incorporated 0.912345
3 ABC Enterprises ABC Enterprises Inc. 0.922345
8 ABC Enterprises Incorporated ABC Enterprises 0.892345

etc.

The Pandas index column value is dynamically assigned.

Question

Would it be possible to have the match_strings function (or an addition function) return the IDs of the row the string value is in, so:

id left_side id right_side similarity
0 A000123 ABC Enterprises Inc. C000456 ABC Enterprises Incorporated 0.912345
3 F000345 ABC Enterprises A000123 ABC Enterprises Inc. 0.922345
8 C000456 ABC Enterprises Incorporated F000345 ABC Enterprise s 0.892345

or even dispense with the text values and just return the ID, because the ID in the resulting String Grouper dataframe can always be linked to the ID of the original dataframe to retrieve the text value in the relevant row, so:

left_side right_side similarity
0 A000123 C000456 0.912345
3 F000345 A000123 0.922345
8 C000456 F000345 0.892345

To do this would mean the ID values would have to be 'included' in the processing. That is a dataframe, not a series, would be passed to String Grouper, even though the IDs would not be evaluated. I see that the current function uses a Pandas Series. I don't have enough knowledge/experience in the code to know if this is possible to pass a dataframe and 'take along' the ID matching the string of the row - would it be?

Current work around

I use Pandas merge to join the two dataframes:

result = pd.merge(dupes, accounts, left_on="left_side", right_on="name")
result

to give me the following columns:

left_side | right_side | similarity | account_id | name

from which I then make a series with unique account_id:

account_id_list = pd.Series(result.account_id.unique())

Could not install string-grouper

when I enter the command pip install string_grouper I get the following errors

LINK : fatal error LNK1158: cannot run 'rc.exe'
ERROR: Failed building wheel for topn
ERROR: Could not build wheels for sparse-dot-topn-for-blocks which use PEP 517 and cannot be installed directly

Group Connectivity Visualization may reveal other possible representatives

Hi @Bergvca , @justasojourner

I'm no expert in graph visualization, but take a look at these graph drawings of two of string-grouper's groups of the 1st 50 000 records of the sec__edgar_company_info.csv file:

group0
group1

Images were rendered using Gephi 0.9.2.

It may be that one other group representative to be considered is the string with the highest number of matches (in graph theory this is the node of highest degree). What do you think?

Question / suggestion to use multiple n-grams to get more features

Hi @Bergvca and @ParticularMiner,

Hope you are doing good.

I got to work on the same project again and have a question / suggestion - would it be possible to use multiple n-grams to get more features? Like currently we have the following - ngram_size: The amount of characters in each n-gram. Default is 3.

What if we get n-grams in a list like [2,3,4] and get more vector components - ngrams=2 plus ngrams=3 and ngrams=4?

What do you think?

By the way, the string_grouper approach is really good in terms of speed and efficiency. Great work!

Thank you,
iibarant

Error When matching Chinese name

Hi, I try to match the Chinese firm name and get errors

File "C:/Users/acemec/Documents/firm_data/name_match.py", line 14, in
matches = match_most_similar(companies['company_name'], new_companies['assignee'], ignore_index=True)
File "C:\Users\acemec\anaconda3\lib\site-packages\string_grouper\string_grouper.py", line 108, in match_most_similar
string_grouper = StringGrouper(master,
File "C:\Users\acemec\anaconda3\lib\site-packages\string_grouper\string_grouper.py", line 218, in init
raise TypeError('Input does not consist of pandas.Series containing only Strings')
TypeError: Input does not consist of pandas.Series containing only Strings

Here is my code:

import pandas as pd
import numpy as np
from string_grouper import match_strings, match_most_similar, group_similar_strings, compute_pairwise_similarities, StringGrouper
import dask.dataframe as dd
company_names = 'C:/Users/acemec/Documents/firm_data/company_annual.csv'
companies = dd.read_csv(company_names, on_bad_lines='skip',dtype=str,low_memory=False)

new_companies_name = 'C:/Users/acemec/Documents/firm_data/Pat_firm_list.csv'
new_companies = dd.read_csv(new_companies_name, on_bad_lines='skip',dtype=str,low_memory=False)

matches = match_most_similar(companies['company_name'], new_companies['assignee'], ignore_index=True)

match_result = pd.concat([new_companies, matches], axis=1)

df = pd.DataFrame(match_result)
df.to_csv('C:/Users/acemec/Documents/firm_data/file_name.csv', encoding='utf-8')

Could you give me some suggestions?

Need help

Instructions are purely Windows only.

Jupyter Notebook installation not working

I cannot get string_grouper to work in jupyter notebook as it is not building the sparse_dot_topn module, wondering if anyone could help?

`Collecting string-grouper
Using cached string_grouper-0.5.0-py3-none-any.whl (20 kB)
Requirement already satisfied: scikit-learn in /opt/conda/lib/python3.9/site-packages (from string-grouper) (0.24.2)
Collecting sparse-dot-topn>=0.3.1
Using cached sparse_dot_topn-0.3.1.tar.gz (17 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Installing backend dependencies ... done
Preparing wheel metadata ... done
Requirement already satisfied: pandas>=0.25.3 in /opt/conda/lib/python3.9/site-packages (from string-grouper) (1.3.2)
Requirement already satisfied: numpy in /opt/conda/lib/python3.9/site-packages (from string-grouper) (1.20.3)
Requirement already satisfied: scipy in /opt/conda/lib/python3.9/site-packages (from string-grouper) (1.7.1)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/lib/python3.9/site-packages (from pandas>=0.25.3->string-grouper) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /opt/conda/lib/python3.9/site-packages (from pandas>=0.25.3->string-grouper) (2021.1)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.9/site-packages (from python-dateutil>=2.7.3->pandas>=0.25.3->string-grouper) (1.16.0)
Requirement already satisfied: cython>=0.29.15 in /opt/conda/lib/python3.9/site-packages (from sparse-dot-topn>=0.3.1->string-grouper) (0.29.24)
Requirement already satisfied: setuptools>=42 in /opt/conda/lib/python3.9/site-packages (from sparse-dot-topn>=0.3.1->string-grouper) (57.4.0)
Requirement already satisfied: joblib>=0.11 in /opt/conda/lib/python3.9/site-packages (from scikit-learn->string-grouper) (1.0.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.9/site-packages (from scikit-learn->string-grouper) (2.2.0)
Building wheels for collected packages: sparse-dot-topn
Building wheel for sparse-dot-topn (PEP 517) ... error
ERROR: Command errored out with exit status 1:
command: /opt/conda/bin/python3.9 /opt/conda/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py build_wheel /tmp/tmpi5pog1_w
cwd: /tmp/pip-install-z5tecldu/sparse-dot-topn_829b4cf792fa4587a955c95bc64c1bbb
Complete output (38 lines):
running bdist_wheel
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.9
creating build/lib.linux-x86_64-3.9/sparse_dot_topn
copying sparse_dot_topn/awesome_cossim_topn.py -> build/lib.linux-x86_64-3.9/sparse_dot_topn
copying sparse_dot_topn/init.py -> build/lib.linux-x86_64-3.9/sparse_dot_topn
running egg_info
writing sparse_dot_topn.egg-info/PKG-INFO
writing dependency_links to sparse_dot_topn.egg-info/dependency_links.txt
writing requirements to sparse_dot_topn.egg-info/requires.txt
writing top-level names to sparse_dot_topn.egg-info/top_level.txt
reading manifest file 'sparse_dot_topn.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
adding license file 'LICENSE'
writing manifest file 'sparse_dot_topn.egg-info/SOURCES.txt'
copying sparse_dot_topn/array_wrappers.pxd -> build/lib.linux-x86_64-3.9/sparse_dot_topn
copying sparse_dot_topn/array_wrappers.pyx -> build/lib.linux-x86_64-3.9/sparse_dot_topn
copying sparse_dot_topn/sparse_dot_topn.pyx -> build/lib.linux-x86_64-3.9/sparse_dot_topn
copying sparse_dot_topn/sparse_dot_topn_parallel.h -> build/lib.linux-x86_64-3.9/sparse_dot_topn
copying sparse_dot_topn/sparse_dot_topn_source.h -> build/lib.linux-x86_64-3.9/sparse_dot_topn
copying sparse_dot_topn/sparse_dot_topn_threaded.pyx -> build/lib.linux-x86_64-3.9/sparse_dot_topn
running build_ext
cythoning ./sparse_dot_topn/array_wrappers.pyx to ./sparse_dot_topn/array_wrappers.cpp
/tmp/pip-build-env-93__ikic/normal/lib/python3.9/site-packages/Cython/Compiler/Main.py:369: FutureWarning: Cython directive 'language_level' not set, using 2 for now (Py2). This will change in a later release! File: /tmp/pip-install-z5tecldu/sparse-dot-topn_829b4cf792fa4587a955c95bc64c1bbb/sparse_dot_topn/array_wrappers.pxd
tree = Parsing.p_module(s, pxd, full_module_name)
cythoning ./sparse_dot_topn/sparse_dot_topn.pyx to ./sparse_dot_topn/sparse_dot_topn.cpp
/tmp/pip-build-env-93__ikic/normal/lib/python3.9/site-packages/Cython/Compiler/Main.py:369: FutureWarning: Cython directive 'language_level' not set, using 2 for now (Py2). This will change in a later release! File: /tmp/pip-install-z5tecldu/sparse-dot-topn_829b4cf792fa4587a955c95bc64c1bbb/sparse_dot_topn/sparse_dot_topn.pyx
tree = Parsing.p_module(s, pxd, full_module_name)
cythoning ./sparse_dot_topn/sparse_dot_topn_threaded.pyx to ./sparse_dot_topn/sparse_dot_topn_threaded.cpp
/tmp/pip-build-env-93__ikic/normal/lib/python3.9/site-packages/Cython/Compiler/Main.py:369: FutureWarning: Cython directive 'language_level' not set, using 2 for now (Py2). This will change in a later release! File: /tmp/pip-install-z5tecldu/sparse-dot-topn_829b4cf792fa4587a955c95bc64c1bbb/sparse_dot_topn/sparse_dot_topn_threaded.pyx
tree = Parsing.p_module(s, pxd, full_module_name)
building 'sparse_dot_topn.array_wrappers' extension
creating build/temp.linux-x86_64-3.9
creating build/temp.linux-x86_64-3.9/sparse_dot_topn
gcc -pthread -B /opt/conda/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /opt/conda/include -fPIC -O2 -isystem /opt/conda/include -fPIC -I/opt/conda/include/python3.9 -I/tmp/pip-build-env-93__ikic/normal/lib/python3.9/site-packages/numpy/core/include -c ./sparse_dot_topn/array_wrappers.cpp -o build/temp.linux-x86_64-3.9/./sparse_dot_topn/array_wrappers.o -std=c++0x -pthread -O3
error: command 'gcc' failed: No such file or directory

ERROR: Failed building wheel for sparse-dot-topn
Failed to build sparse-dot-topn
ERROR: Could not build wheels for sparse-dot-topn which use PEP 517 and cannot be installed directly`

Is it possible to swap the tfidfvectorizer with different vectorizer?

I absolutely love this library! Lightning-fast!

This is not really an issue, but a discussion.

I'm just wondering if it makes sense to be able to swap the tfidfvectorizer with other context-based embedding models/vectorizers, e.g., fasttext, word2vec. That is, compute the cosine similarity for the matrices from custom embedding models other than just using tf-idf matrix.

using string_grouper with a lookup column in orginal source data

Sorry I keep bugging you about your library but thanks for your quick responses. So I have a table that I'm searching through and a list of key words that I'm searching against. I want to include some sort of identifier from the table data in the resultant dataframe so that I can take that primary key and get more information from the table if I get any hits. Right now I just have the key word match, and since it's not a primary key I can just join that to the original table or it may take a minute and result in many more records than I need. So is it possible to have the left word match word, similarity score, right word match_word and then left word primary key as a column? Or would I have to join using the search words I'm matching against? It may be easier to just include all the columns from the left table that I care about to ensure I don't have to go back to the source, but not sure if that will slow things down.

Update documentation after match_strings code change for ID

Hi @Bergvca — just wondering if you wanted me to do the task of updating the documentation after the addition of ID functionality to the match_strings function.

Happy to write it and submit as pull request when ready.

I was thinking of the necessary base updating of the calling parameters etc. of course and then as well (in a separate markdown document file) a mini tutorial explaining what sort of scenario match_strings with ID would be used in and how to use it.

The idea for a separate document file is so the README.md can stay succinct.

OK with you?

Duplicate (but swapped) right and left

Data

world-universities.txt

id,country,name,url
1,AD,University of Andorra,http://www.uda.ad/
2,AE,Abu Dhabi University,http://www.adu.ac.ae/
3,AE,Ajman University of Science & Technology,http://www.ajman.ac.ae/
4,AE,Alain University of Science and Technology,http://www.alainuniversity.ac.ae/
5,AE,Al Ghurair University,http://www.agu.ae/

Code

import pandas as pd
import numpy as np
from string_grouper import match_strings, match_most_similar, group_similar_strings, StringGrouper

data = pd.read_csv('world-universities.txt')
matches = match_strings(data['name'])
matches[matches.left_side != matches.right_side].head()

Output

                             left_side                          right_side  similarity
11      American University of Sharjah               University of Sharjah    0.800178
36               University of Sharjah      American University of Sharjah    0.800178
43  Aria Institute of Higher Education  Rana Institute of Higher Education    0.844736
76  Rana Institute of Higher Education  Aria Institute of Higher Education    0.844736
85                     Academy of Arts            National Academy of Arts    0.800615

Issue

1 and 2 are the same, but swapped, and same for 3 and 4. How to avoid these duplicates?

Also: retaining ID

The records have an ID column, how to retain that in the final output?

Thank you!

does this work on python3.7?

i tried pip install but it gave an error error: Microsoft Visual C++ 14.0 is required.

Am running C++ 19.

Appreciate your help.

thanks,

time complexity & accuracy

I just tested the algorithm for grouping of records which consist of company names, for my case its only 100 records works fine when using ngram=2 . So i just gave kaggale data set consist of 6,60,000 records, but went on running for more than 65 minutes. Is this time complexity problem is common for very one? Is there any way to make faster.
And another one problem I faced is accuracy for some common records(example 11447 SECND STREET LLC and 11447 SECOND STREET LLC are not came to same group, some records consist of numbers to differentiate companies but this algorithm combines them to same group. ). Is there any way to make better.
Thanks

Index mismatch may lead to unintended consequences!

Issue: Unequal pandas Indexes of Input and Output data can cause unintended consequences

The following describes the problem, and how it can be fixed.

import pandas as pd
from string_grouper import group_similar_strings

Let's import some data:

companies = pd.read_csv('data/sec__edgar_company_info.csv')[0:50000]
companies
Line Number Company Name Company CIK Key
0 1 !J INC 1438823
1 2 #⁠1 A LIFESAFER HOLDINGS, INC. 1509607
2 3 #⁠1 ARIZONA DISCOUNT PROPERTIES LLC 1457512
3 4 #⁠1 PAINTBALL CORP 1433777
4 5 $ LLC 1427189
... ... ... ...
49995 49996 BABB DOUGLAS J 1190359
49996 49997 BABB HENRY C 1193948
49997 49998 BABB INTERNATIONAL INC 1139504
49998 49999 BABB JACK J 1280368
49999 50000 BABB JAMES G. III 1575424

50000 rows × 3 columns

Let's give the data a different (unique-valued) index as most users might do:

companies.set_index(['Line Number', 'Company CIK Key'], inplace=True, verify_integrity=True)

Display the data to see the change:

companies
Company Name
Line Number Company CIK Key
1 1438823 !J INC
2 1509607 #⁠1 A LIFESAFER HOLDINGS, INC.
3 1457512 #⁠1 ARIZONA DISCOUNT PROPERTIES LLC
4 1433777 #⁠1 PAINTBALL CORP
5 1427189 $ LLC
... ... ...
49996 1190359 BABB DOUGLAS J
49997 1193948 BABB HENRY C
49998 1139504 BABB INTERNATIONAL INC
49999 1280368 BABB JACK J
50000 1575424 BABB JAMES G. III

50000 rows × 1 columns

Now let's do some grouping as usual:

grouped_data = group_similar_strings(companies['Company Name'])

Notice that the (default) index of the output (grouped_data) is different from that of the input DataFrame (companies): for instance, it starts from 0 instead of 1:

grouped_data
0                                    !J INC
1             #1 A LIFESAFER HOLDINGS, INC.
2        #1 ARIZONA DISCOUNT PROPERTIES LLC
3                         #1 PAINTBALL CORP
4                                     $ LLC
                        ...                
49995                        BABB DOUGLAS J
49996                          BABB HENRY C
49997                BABB INTERNATIONAL INC
49998                           BABB JACK J
49999                     BABB JAMES G. III
Length: 50000, dtype: object

This causes unintended consequences when for example, blindly assigning this to new columns in the input DataFrame:

companies['Group Name'] = grouped_data

Notice the incorrect group names (NaN values) in each row of the companies DataFrame:

companies
Company Name Group Name
Line Number Company CIK Key
1 1438823 !J INC NaN
2 1509607 #⁠1 A LIFESAFER HOLDINGS, INC. NaN
3 1457512 #⁠1 ARIZONA DISCOUNT PROPERTIES LLC NaN
4 1433777 #⁠1 PAINTBALL CORP NaN
5 1427189 $ LLC NaN
... ... ... ...
49996 1190359 BABB DOUGLAS J NaN
49997 1193948 BABB HENRY C NaN
49998 1139504 BABB INTERNATIONAL INC NaN
49999 1280368 BABB JACK J NaN
50000 1575424 BABB JAMES G. III NaN

50000 rows × 2 columns

Why did this happen? — Because in many pandas operations, including column/row value assignment, the indexes of the values being assigned must match those of the values being replaced.

How to fix this? — Simply ensure that the index of the output of our functions such as group_similar_strings() and match_most_similar() is exactly the same as that of the input strings.

Let's illustrate this here:

grouped_data.index = companies.index

See how the output Dataframe grouped_data changed:

grouped_data
Line Number  Company CIK Key
1            1438823                                        !J INC
2            1509607                 #1 A LIFESAFER HOLDINGS, INC.
3            1457512            #1 ARIZONA DISCOUNT PROPERTIES LLC
4            1433777                             #1 PAINTBALL CORP
5            1427189                                         $ LLC
                                               ...                
49996        1190359                                BABB DOUGLAS J
49997        1193948                                  BABB HENRY C
49998        1139504                        BABB INTERNATIONAL INC
49999        1280368                                   BABB JACK J
50000        1575424                             BABB JAMES G. III
Length: 50000, dtype: object

Now let's perform the column assignment:

companies['Group Name'] = grouped_data

Now everything is as it should be:

companies
Company Name Group Name
Line Number Company CIK Key
1 1438823 !J INC !J INC
2 1509607 #⁠1 A LIFESAFER HOLDINGS, INC. #⁠1 A LIFESAFER HOLDINGS, INC.
3 1457512 #⁠1 ARIZONA DISCOUNT PROPERTIES LLC #⁠1 ARIZONA DISCOUNT PROPERTIES LLC
4 1433777 #⁠1 PAINTBALL CORP #⁠1 PAINTBALL CORP
5 1427189 $ LLC $ LLC
... ... ... ...
49996 1190359 BABB DOUGLAS J BABB DOUGLAS J
49997 1193948 BABB HENRY C BABB HENRY C
49998 1139504 BABB INTERNATIONAL INC BABB INTERNATIONAL INC
49999 1280368 BABB JACK J BABB JACK J
50000 1575424 BABB JAMES G. III BABB JAMES G. III

50000 rows × 2 columns

Thus if we do not wish the user to go through the hassle of doing all this himself/herself, then this may mean capturing the index of the input Series/DataFrame in the StringGrouper instance and stashing it somewhere until the grouping is done; then reassigning it to the index of the output Series/DataFrame just before return.

What do you think?

match_strings(): Any way to return additional columns?

I can't tell if this is a design limitation, or a matter of my being a "Spamonista", rather than a "Pythonista" so please bear with me...
...I'm calling match_strings with 2 dataframes like this:

matches = match_strings(dfREFERENCE[dfREFERENCE_KeyColumn].squeeze(), dfSUBJECT[dfSUBJECT_KeyColumn].squeeze())

As you can see, since my dataframes contain more than 1 column, I specify the columns of interest (dfREFERENCE_KeyColumn & dfSUBJECT_KeyColumn).

Question: Is it fundamentally impossible to have matches return the other columns in the master & duplicates arguments to match_strings? Once I match on a text column (I'm doing "fuzzy matching" or "record-linking" between datasets with a lot of columns), I want to retain all the other columns.

Again, I can't tell if this is a design limitation, or a matter of my being a "Spamonista", rather than a "Pythonista" so please bear with me... while I'm a seasoned and aging IT guy, I mostly "push paperwork" these days, and am new to Python.

Adding similarity column in the group_similar_strings output

Hi,

Thank you for this amazing code working just great so far in my use case.
Please, How can I add the similarities values from the computed cosine in the outputted result of group_similar_strings functions?
The output I am trying to make is a pandas.series containing the duplicated name with their respective cosine similarities value regarding the deduplicated_name.

So it would be something like this:
Line Number | Company Name | Company CIK Key | Similarity | deduplicated_name

Please any help?
Thank you.

Some general questions about the package

Dear developer,

I've recently found this interesting package and I have a few questions. Not sure if this is the right place to post them.

  1. I'm working with a data set with company names that are being updated. The goal is to group them into the same entities, something like you presented in the example https://github.com/Bergvca/string_grouper/blob/master/tutorials/ignore_index_and_replace_na.md. Your matching algorithm is based on character-level N-grams and TF-IDF vectors. Because of that, I guess this algorithm is not deterministic and some old companies in the updated data might not match together. I just want to ask, do you have any experience working with dynamic data sets and any advice about whether is it worth trying this package?
  2. In the documentation https://bergvca.github.io/2017/10/14/super-fast-string-matching.html, you mentioned that for Levenstein distance, the amount of calculations grows quadratic. Actually, the complexity can be reduced to nlogn using an appropriate data structure, like BKTree. What is the computation complexity of this algorithm? Besides speed, is there any other reason why would you recommend using N-grams and TF-IDF instead of Levenstain-based metrics?
  3. Do functions match_strings and group_similar_strings have the same logic but different format of the output? Is it possible that companies A and B are grouped together with group_similar_strings but are not matched with match_strings if we used the same similarity threshold and same data set?

Thanks in advance!

group_similar_strings with series length 1 produces ValueError

using the group_similar_strings method in version 0.4.0, if a series of length 1 is sent, a ValueError with no explanation is raised. To reproduce:

group_similar_strings(strings_to_group=pd.Series(['hello']))

I believe in previous versions this wasn't the case. I would have expected a DataFrame such as:

group_rep_index group_rep
0 hello

full stacktrace:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-18-f6cdcfd0b8c6> in <module>
----> 1 group_similar_strings(strings_to_group=pd.Series(['hello']))

~/.cache/pypoetry/virtualenvs/env/lib/python3.8/site-packages/string_grouper/string_grouper.py in group_similar_strings(strings_to_group, string_ids, **kwargs)
     78     """
     79     string_grouper = StringGrouper(strings_to_group, master_id=string_ids, **kwargs).fit()
---> 80     return string_grouper.get_groups()
     81 
     82 

~/.cache/pypoetry/virtualenvs/env/lib/python3.8/site-packages/string_grouper/string_grouper.py in wrapper(*args, **kwargs)
    181     def wrapper(*args, **kwargs):
    182         if args[0].is_build:
--> 183             return f(*args, **kwargs)
    184         else:
    185             raise StringGrouperNotFitException(f'{f.__name__} was called before the "fit" function was called.'

~/.cache/pypoetry/virtualenvs/env/lib/python3.8/site-packages/string_grouper/string_grouper.py in get_groups(self, ignore_index, replace_na)
    366         if ignore_index is None: ignore_index = self._config.ignore_index
    367         if self._duplicates is None:
--> 368             return self._deduplicate(ignore_index=ignore_index)
    369         else:
    370             if replace_na is None: replace_na = self._config.replace_na

~/.cache/pypoetry/virtualenvs/env/lib/python3.8/site-packages/string_grouper/string_grouper.py in _deduplicate(self, ignore_index)
    606         # pandas groupby transform function and enlargement enable both respectively in one step:
    607         group_of_master_index['group_rep'] = \
--> 608             group_of_master_index.groupby('raw_group_id', sort=False)['weight'].transform(method)
    609 
    610         # Prepare the output:

~/.cache/pypoetry/virtualenvs/env/lib/python3.8/site-packages/pandas/core/groupby/generic.py in transform(self, func, *args, **kwargs)
    475         # result to the whole group. Compute func result
    476         # and deal with possible broadcasting below.
--> 477         result = getattr(self, func)(*args, **kwargs)
    478         return self._transform_fast(result, func)
    479 

~/.cache/pypoetry/virtualenvs/env/lib/python3.8/site-packages/pandas/core/groupby/groupby.py in wrapper(*args, **kwargs)
    654             if self.obj.ndim == 1:
    655                 # this can be called recursively, so need to raise ValueError
--> 656                 raise ValueError
    657 
    658             # GH#3688 try to operate item-by-item

ValueError: 

Question: How to achieve matching for multiple fields and priorities

Excellent work here. Great performance.
Can you please give some advice on how to achieve deduplication for multiple fields, some fuzzy, some 'hard' matches.

Like this one does:
https://recordlinkage.readthedocs.io/en/latest/notebooks/data_deduplication.html

compare_cl = recordlinkage.Compare()
compare_cl.exact('given_name', 'given_name', label='given_name')
compare_cl.string('surname', 'surname', method='jarowinkler', threshold=0.85, label='surname')
compare_cl.exact('date_of_birth', 'date_of_birth', label='date_of_birth')
compare_cl.exact('suburb', 'suburb', label='suburb')
compare_cl.exact('state', 'state', label='state')
compare_cl.string('address_1', 'address_1', threshold=0.85, label='address_1')

features = compare_cl.compute(pairs, dfA)

Thank you!

Question about version string_grouper group_similar_strings

Dear developer,

Could you get me an explanation about the different versions of string_grouper?
I only use one function named as "group_similar_strings", currently I am using 0.1.1 version, but the latest version now is 0.6.1

this library is very helpful and great, but when I used function group_similar_strings with customer similarity, sometimes the result missed group the group as I checked human eyes.
Is it worth it if I upgrade the version to the latest version,? what is the improvement?

Question: How to have built StringGrouper corpus persist across multiple match_string calls in a programming session

Hi this the logical progression, next step, to the clean-up process of data extracted from a database having an ID and name field. It's a follow on from the use case documented in the first String Grouper tutorial which I wrote.

Background

To recap the requirement back then — there was a reasonably large database of accounts with an ID and customer name. Browsing the database it could be seen by eye that there were many, many duplicates. Using String Grouper (with the added functionality to include IDs in the matching process) it is possible to (insanely) quickly get a list of IDs of duplicate records which are then imported back into the database to do a join with the original table. It's all in the tutorial.

The persons responsible are now starting the clean up necessary to remove all the duplicates, but now I want to setup a solution to ensure that when bulk imports are done (typically using Excel) in the future to this system that duplicates are proactively avoided. The web app does have functionality to identify duplicates — but it is very limited in functionality, and its lack is what got the database into having so many duplicates in the first place.

Process/Requirement

I will have a Python application which will open the Excel file, walk through the rows and do a number of checks and validations (data missing etc) and reject rows that are not clean. One of the checks will be for potential duplicates — the customer [name] already exists in the database, I have access to the database in PostgreSQL.

PostgreSQL Search Options

PostgreSQL has a number of search options like levenshtein, soundex and Trigrams as well as tsvector. I have tried the first three but one way or the other they don't give me the same duplication functionality I have seen with String Grouper using cosine similarities.

Possibility of Using String Grouper?

So I did a simple test in String Grouper importing the existing data (id & name) into a pandas DataFrame accounts and then making a one row DataFrame duplicates to hold the string being queried.

accounts = pd.read_csv('data/accounts.tsv', delimiter='\t', lineterminator='\n')

duplicates = pd.DataFrame(
    [
        ('1111', 'Some Company Ltd')
    ],
    columns=('duplicate_id', 'duplicate_name')
)

matches = match_strings(
    accounts['name'],
    duplicates['duplicate_name'],
    master_id=accounts['id'],
    duplicates_id=duplicates['duplicate_id'],
    ignore_index=True,
    max_n_matches=20,
    min_similarity=0.8
)

It works, however I see that each time I call the String Grouper match_strings function it takes some time as I believe it is rebuilding the corpus each time because I have to call the function each time I step through the rows in the tabular data being imported. Obviously this is not efficient.

Question

So, to the question. Is it possible to run String Grouper such that the corpus is built one time only for the duration of the Python running application. So:

  1. Initialise the Python program
  2. Load the dataframe with the many rows accounts
  3. Build the corpus — once only
  4. Loop:
    1. Step through the rows getting the candidate (potential duplicate) string name
    2. run match_strings for each row.
  5. End loop:

Thanks in advance for any help/guidance.

p.s. This might be good functionality to build into the package, it is likely a fairly common use case.

[question] Partial matching of strings

Goal: The goal is to group the following strings into the same group.

Should you raise an adverse event in a specific patient?
Although what you say will be treated in confidence, should you raise an adverse event in a specific patient?

The following code creates separate groups for those strings.

string_grouper = StringGrouper(question_table["question"])
string_grouper = string_grouper.fit()
question_table["labels"] = string_grouper.get_groups()

Question: is it possible to adjust string matching to reach the goal?

Thank you in advance for any hints!

[question] How to import string_grouper_utils?

Hi all, I think this is a very simple question, but how can one import the string_grouper_utils module? I've installed string_grouper via pip install, but cannot seem to access the modules within string_grouper_utils, such as new_group_rep_by_highest_weight. Anyone?

Setting min_similarity missing in _init_

Is there a clear way to set the min_similarity for the grouping? This doesn't seem supported. For example:

string_grouper = StringGrouper(
    dataset["Description"].astype(str),
    min_similarity=0.1,  # <-- not supported?
).fit()

Optimizing ._deduplicate()

Hi Bergcva,

This is not a terribly important issue per se -- I was just musing over your very instructive string_grouper.py code, and wondered whether using a standard graph-theoretical numerical routine, such as connected_components() in scipy.sparse.csgraph, wouldn't perform better in grouping a pandas Series of strings linked to each other through cosine similarities. So I went ahead and re-coded the StringGrouper._deduplicate() function which resulted in a speedup of about 3.6 times on the sec__edgar_company_info.csv sample data file.

If you are interested, do let me know, and I'll create a pull request to your code.

...
from scipy.sparse.csgraph import connected_components
...
...
...  
class StringGrouper(object):
    ...
    ...
    ...
    def _deduplicate2(self) -> pd.Series:
        N = len(self._master)
        graph = csr_matrix(
            (
                np.full(len(self._matches_list), 1),
                (self._matches_list.master_side.to_numpy(), self._matches_list.dupe_side.to_numpy())
            ),
            shape=(N, N)
        )
        group_labels = pd.Series(connected_components(csgraph=graph, directed=False)[1])
        grouped = pd.DataFrame({'group_label': group_labels, 'master_id': self._master.index.to_series()})
        firstIDinGroup = grouped.groupby('group_label')['master_id'].first().rename('nominal_group').reset_index()
        group_id = \
            grouped.merge(firstIDinGroup, how='right', on='group_label').sort_values('master_id').reset_index(drop=True)
        return self._master[group_id.nominal_group].reset_index(drop=True)

Cheers!

Numpy version issue?

Something in string_grouper's pip install is removing numpy.1.23 and reverting to 1.22 but then there is an error with import saying the wrong version of numpy is in use. Any ideas?

Regarding ID functionality for match_most_similar() and group_similar_strings()

Hi Bergcva,

@justasojourner and I have been discussing your comments in Issue #19 on adding ID functionality for functions match_most_similar() and group_similar_strings(). While neither of us find the proposed additional functionality (as we imagine it) for these functions to be particularly useful for our own use, we thought you might have a better idea so we would first check with you on what exactly you meant by it.

Do let us know your thoughts on this.

Cheers!

Installation problem

System: Windows 10

(base) C:\Users\sachi>python --version
Python 3.8.10

(base) C:\Users\sachi>pip --version
pip 21.1.2 from c:\users\sachi\anaconda3\lib\site-packages\pip (python 3.8)
(base) C:\Users\sachi>pip install string-grouper==0.2.2
Collecting string-grouper==0.2.2
  Downloading string_grouper-0.2.2-py3-none-any.whl (11 kB)
Requirement already satisfied: pandas>=0.25.3 in c:\users\sachi\anaconda3\lib\site-packages (from string-grouper==0.2.2) (1.2.4)
Requirement already satisfied: numpy in c:\users\sachi\anaconda3\lib\site-packages (from string-grouper==0.2.2) (1.20.2)
Requirement already satisfied: scipy in c:\users\sachi\anaconda3\lib\site-packages (from string-grouper==0.2.2) (1.6.2)
Requirement already satisfied: scikit-learn in c:\users\sachi\anaconda3\lib\site-packages (from string-grouper==0.2.2) (0.24.2)
Collecting sparse-dot-topn>=0.2.6
  Downloading sparse_dot_topn-0.3.1.tar.gz (17 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
    Preparing wheel metadata ... done
Requirement already satisfied: python-dateutil>=2.7.3 in c:\users\sachi\anaconda3\lib\site-packages (from pandas>=0.25.3->string-grouper==0.2.2) (2.8.1)
Requirement already satisfied: pytz>=2017.3 in c:\users\sachi\anaconda3\lib\site-packages (from pandas>=0.25.3->string-grouper==0.2.2) (2021.1)
Requirement already satisfied: six>=1.5 in c:\users\sachi\anaconda3\lib\site-packages (from python-dateutil>=2.7.3->pandas>=0.25.3->string-grouper==0.2.2) (1.15.0)
Requirement already satisfied: setuptools>=42 in c:\users\sachi\anaconda3\lib\site-packages (from sparse-dot-topn>=0.2.6->string-grouper==0.2.2) (57.0.0)
Requirement already satisfied: cython>=0.29.15 in c:\users\sachi\anaconda3\lib\site-packages (from sparse-dot-topn>=0.2.6->string-grouper==0.2.2) (0.29.21)
Requirement already satisfied: joblib>=0.11 in c:\users\sachi\anaconda3\lib\site-packages (from scikit-learn->string-grouper==0.2.2) (1.0.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\sachi\anaconda3\lib\site-packages (from scikit-learn->string-grouper==0.2.2) (2.1.0)
Building wheels for collected packages: sparse-dot-topn
  Building wheel for sparse-dot-topn (PEP 517) ... error
  ERROR: Command errored out with exit status 1:
   command: 'c:\users\sachi\anaconda3\python.exe' 'c:\users\sachi\anaconda3\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py' build_wheel 'C:\Users\sachi\AppData\Local\Temp\tmpo7rfwtmy'
       cwd: C:\Users\sachi\AppData\Local\Temp\pip-install-kg_5az86\sparse-dot-topn_4981ea8a5fd046dba8d68b644d539e9b
  Complete output (37 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build\lib.win-amd64-3.8
  creating build\lib.win-amd64-3.8\sparse_dot_topn
  copying sparse_dot_topn\awesome_cossim_topn.py -> build\lib.win-amd64-3.8\sparse_dot_topn
  copying sparse_dot_topn\__init__.py -> build\lib.win-amd64-3.8\sparse_dot_topn
  running egg_info
  writing sparse_dot_topn.egg-info\PKG-INFO
  writing dependency_links to sparse_dot_topn.egg-info\dependency_links.txt
  writing requirements to sparse_dot_topn.egg-info\requires.txt
  writing top-level names to sparse_dot_topn.egg-info\top_level.txt
  reading manifest file 'sparse_dot_topn.egg-info\SOURCES.txt'
  reading manifest template 'MANIFEST.in'
  adding license file 'LICENSE'
  writing manifest file 'sparse_dot_topn.egg-info\SOURCES.txt'
  copying sparse_dot_topn\array_wrappers.pxd -> build\lib.win-amd64-3.8\sparse_dot_topn
  copying sparse_dot_topn\array_wrappers.pyx -> build\lib.win-amd64-3.8\sparse_dot_topn
  copying sparse_dot_topn\sparse_dot_topn.pyx -> build\lib.win-amd64-3.8\sparse_dot_topn
  copying sparse_dot_topn\sparse_dot_topn_parallel.cpp -> build\lib.win-amd64-3.8\sparse_dot_topn
  copying sparse_dot_topn\sparse_dot_topn_parallel.h -> build\lib.win-amd64-3.8\sparse_dot_topn
  copying sparse_dot_topn\sparse_dot_topn_source.cpp -> build\lib.win-amd64-3.8\sparse_dot_topn
  copying sparse_dot_topn\sparse_dot_topn_source.h -> build\lib.win-amd64-3.8\sparse_dot_topn
  copying sparse_dot_topn\sparse_dot_topn_threaded.pyx -> build\lib.win-amd64-3.8\sparse_dot_topn
  running build_ext
  cythoning ./sparse_dot_topn/array_wrappers.pyx to ./sparse_dot_topn\array_wrappers.cpp
  cythoning ./sparse_dot_topn/sparse_dot_topn.pyx to ./sparse_dot_topn\sparse_dot_topn.cpp
  cythoning ./sparse_dot_topn/sparse_dot_topn_threaded.pyx to ./sparse_dot_topn\sparse_dot_topn_threaded.cpp
  building 'sparse_dot_topn.array_wrappers' extension
  C:\Users\sachi\AppData\Local\Temp\pip-build-env-ah2hjm16\normal\Lib\site-packages\Cython\Compiler\Main.py:369: FutureWarning: Cython directive 'language_level' not set, using 2 for now (Py2). This will change in a later release! File: C:\Users\sachi\AppData\Local\Temp\pip-install-kg_5az86\sparse-dot-topn_4981ea8a5fd046dba8d68b644d539e9b\sparse_dot_topn\array_wrappers.pxd
    tree = Parsing.p_module(s, pxd, full_module_name)
  C:\Users\sachi\AppData\Local\Temp\pip-build-env-ah2hjm16\normal\Lib\site-packages\Cython\Compiler\Main.py:369: FutureWarning: Cython directive 'language_level' not set, using 2 for now (Py2). This will change in a later release! File: C:\Users\sachi\AppData\Local\Temp\pip-install-kg_5az86\sparse-dot-topn_4981ea8a5fd046dba8d68b644d539e9b\sparse_dot_topn\sparse_dot_topn.pyx
    tree = Parsing.p_module(s, pxd, full_module_name)
  C:\Users\sachi\AppData\Local\Temp\pip-build-env-ah2hjm16\normal\Lib\site-packages\Cython\Compiler\Main.py:369: FutureWarning: Cython directive 'language_level' not set, using 2 for now (Py2). This will change in a later release! File: C:\Users\sachi\AppData\Local\Temp\pip-install-kg_5az86\sparse-dot-topn_4981ea8a5fd046dba8d68b644d539e9b\sparse_dot_topn\sparse_dot_topn_threaded.pyx
    tree = Parsing.p_module(s, pxd, full_module_name)
  error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
  ----------------------------------------
  ERROR: Failed building wheel for sparse-dot-topn
Failed to build sparse-dot-topn
ERROR: Could not build wheels for sparse-dot-topn which use PEP 517 and cannot be installed directly

Error when importing

I tried import
from string_grouper import match_strings, match_most_similar, group_similar_strings, StringGrouper
then I got this error
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

What is the root caused?

Function of simply calculating pairwise similarity without matching

Hi,
Is there a function that can simply calculate pairwise (cosine) similarity (for already constructed pairs, no need to match)? If not, please add this function.
In some cases, I already have pairs (e.g. var1 and var2, in the same data file), and I want to calculate the pairwise similarity. For instance, I first use String Grouper to match two variables and get a file of pairs. Then I modify two string variables (such as further excluding prefixes and suffixes). I need to get the similarity of the two modified string variables, and compare two similarity scores.
Under the current function of String Grouper, I have to perform matching, and then merge matches to my pairs.
matches = match_strings(data["var1"], data["var2"], min_similarity=0.80)
This requires much more computation than what I actually need.

Different matching behavior across versions

Hi there,

I've been using this package for some time now (thank you for writing it, it's very useful), but only recently upgraded my version, from 0.3.2 to 0.6.1. On the new version, I'm getting behavior I hadn't expected and which differs from the older version.

Given two Pandas Series s1 and s2, I initialize and fit a StringGrouper object:

sg = StringGrouper(
    s1, 
    s2, 
    min_similarity=0.70,
    max_n_matches=1
).fit()

Then I grab the raw matches from sg._matches_list. (I know this isn't the typical way to use this object, but it suits my use case.)

On 0.3.2 I get the behavior I usually rely on: the number of rows in the result (sg._matches_list.shape[0]) is equal to the number of unique master_side indices listed in the result (sg._matches_list['master_side'].nunique()). Meaning, there is exactly one row per record in the master list for which there was a sufficiently good match in the duplicates list. In my case, there are duplicates in the list of dupe_side indices, which is what I expect:

>>> sg._matches_list.shape
(193, 3)
>>> sg._matches_list['master_side'].nunique()
193
>>> sg._matches_list['dupe_side'].nunique()
127

On 0.6.1, I get the exact opposite behavior - one row per record in the duplicates list for which there was a sufficiently good match in the master list:

>>> sg._matches_list.shape
(193, 3)
>>> sg._matches_list['master_side'].nunique()
128
>>> sg._matches_list['dupe_side'].nunique()
193

Does anyone know why the behavior is so different between the two versions?

Unable to allocate 8.41 GiB for an array with shape (2258174000,) and data type int32

I tried to match string with 1M rows dataset against 10M rows dataset. N Block I set 10:2000.
Error occured with message:
File "", line 164, in match_strings\n File "", line 623, in fit\n File "", line 461, in _fit_blockwise_manual\n File "c:\program files\python38\lib\site-packages\topn\awesome_topn.py", line 88, in awesome_hstack_topn\n r = np.concatenate([b.indptr for b in blocks])\n File "<array_function internals>", line 180, in concatenate\nnumpy.core._exceptions._ArrayMemoryError: Unable to allocate 8.41 GiB for an array with shape (2258174000,) and data type int32\n<traceback object at 0x0000028A1BC113C0>.

Any solutions?

ModuleNotFoundError: No module named 'string_grouper'

Hi! Great package by the way, thanks for creating this code! I love the concept.

I used string_grouper two weeks ago on another project and loved it. I tried installing again yesterday on another computer and it says there is no module. Did something change or update? Thanks for your help!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.