Coder Social home page Coder Social logo

string_grouper's Introduction

String Grouper

pypi license lastcommit codecov

string_grouper is a library that makes finding groups of similar strings within a single, or multiple, lists of strings easy โ€” and fast. string_grouper uses tf-idf to calculate cosine similarities within a single list or between two lists of strings. The full process is described in the blog Super Fast String Matching in Python.

Installing

pip install string-grouper

Usage

import pandas as pd
from string_grouper import match_strings, match_most_similar, group_similar_strings, StringGrouper

As shown above, the library may be used together with pandas, and contains three high level functions (match_strings, match_most_similar and group_similar_strings) that can be used directly, and one class (StringGrouper) that allows for a more iterative approach.

The permitted calling patterns of the three functions, and their return types, are:

Function Parameters pandas Return Type
match_strings (master, **kwargs) DataFrame
match_strings (master, duplicates, **kwargs) DataFrame
match_strings (master, master_id=id_series, **kwargs) DataFrame
match_strings (master, duplicates, master_id, duplicates_id, **kwargs) DataFrame
match_most_similar (master, duplicates, **kwargs) Series
match_most_similar (master, duplicates, master_id, duplicates_id, **kwargs) DataFrame
group_similar_strings (strings_to_group, **kwargs) Series
group_similar_strings (strings_to_group, strings_to_group_id, **kwargs) DataFrame

In the rest of this document the names, Series and DataFrame, refer to the familiar pandas object types.

Parameters:

Name Description
master A Series of strings to be matched with themselves (or with those in duplicates).
duplicates A Series of strings to be matched with those of master.
master_id (or id_series) A Series of IDs corresponding to the strings in master.
duplicates_id A Series of IDs corresponding to the strings in duplicates.
strings_to_group A Series of strings to be grouped.
strings_to_group_id A Series of IDs corresponding to the strings in strings_to_group.
**kwargs Keyword arguments (see below).

Functions:

  • match_strings

    Returns all pairs of highly similar strings in a DataFrame. The column names of the output DataFrame are 'left_side', 'right_side' and 'similarity'.

    If only parameter master is given, it will return pairs of highly similar strings within master. This can be seen as a self-join (both 'left_side' and 'right_side' column values come from master). If both parameters master and duplicates are given, it will return pairs of highly similar strings between master and duplicates. This can be seen as an inner-join ('left_side' and 'right_side' column values come from master and duplicates respectively).

    The function also supports optionally inputting IDs (master_id and duplicates_id) corresponding to the string values being matched. In which case, the output includes two additional columns whose names are 'left_side_id' and 'right_side_id' containing the IDs corresponding to the string values in 'left_side' and 'right_side' respectively.

  • match_most_similar

    Returns a nameless Series of strings of the same length as the parameter duplicates, where for each string in duplicates the most similar string in master is returned. If there are no similar strings in master for a given string in duplicates (there is no potential match where the cosine similarity is above the threshold (default: 0.8)) the original string in duplicates is returned.

    For example, if the input series [foooo, bar, baz] is passed as the argument to master, and [foooob, bar, new] as the argument to duplicates, the function will return: [foooo, bar, new].

    If both parameters master_id and duplicates_id are also given, then a DataFrame with two unnamed columns is returned. The second column is the same as the Series of strings described above, and the first column contains the corresponding IDs.

  • group_similar_strings

    Takes a single Series (strings_to_group) of strings and groups them by assigning to each string one single string chosen as the group representative for each group of similar strings found. The output is a nameless Series of group-representative strings of the same length as the input Series.

    For example, the input series: [foooo, foooob, bar] will return [foooo, foooo, bar]. Here foooo and foooob are grouped together into group foooo because they are found to be similar. (Another example can be found here.)

    If strings_to_group_id is also given, then the IDs corresponding to the output Series above is also returned. The combined output is a DataFrame with two columns.

All functions are built using a class StringGrouper. This class can be used through pre-defined functions, for example the three high level functions above, as well as using a more iterative approach where matches can be added or removed if needed by calling the StringGrouper class directly.

Options:

  • kwargs

    All keyword arguments not mentioned in the function definitions above are used to update the default settings. The following optional arguments can be used:

    • ngram_size: The amount of characters in each n-gram. Optional. Default is 3
    • regex: The regex string used to clean-up the input string. Optional. Default is "[,-./]|\s".
    • max_n_matches: The maximum number of matches allowed per string. Default is 20.
    • min_similarity: The minimum cosine similarity for two strings to be considered a match. Defaults to 0.8
    • number_of_processes: The number of processes used by the cosine similarity calculation. Defaults to number of cores on a machine - 1.

Examples

In this section we will cover a few use cases for which string_grouper may be used. We will use the same data set of company names as used in: Super Fast String Matching in Python.

Find all matches within a single data set

import pandas as pd
import numpy as np
from string_grouper import match_strings, match_most_similar, group_similar_strings, StringGrouper
company_names = '/media/chris/data/dev/name_matching/data/sec_edgar_company_info.csv'
# We only look at the first 50k as an example:
companies = pd.read_csv(company_names)[0:50000]
# Create all matches:
matches = match_strings(companies['Company Name'])
# Look at only the non-exact matches:
matches[matches.left_side != matches.right_side].head()
left_side right_side similarity
15 0210, LLC 90210 LLC 0.870291
167 1 800 MUTUALS ADVISOR SERIES 1 800 MUTUALS ADVISORS SERIES 0.931616
169 1 800 MUTUALS ADVISORS SERIES 1 800 MUTUALS ADVISOR SERIES 0.931616
171 1 800 RADIATOR FRANCHISE INC 1-800-RADIATOR FRANCHISE INC. 1.000000
178 1 FINANCIAL MARKETPLACE SECURITIES LLC ... 1 FINANCIAL MARKETPLACE SECURITIES, LLC 0.949364

Find all matches in between two data sets.

The match_strings function finds similar items between two data sets as well. This can be seen as an inner join between two data sets:

# Create a small set of artificial company names:
duplicates = pd.Series(['S MEDIA GROUP', '012 SMILE.COMMUNICATIONS', 'foo bar', 'B4UTRADE COM CORP'])
# Create all matches:
matches = match_strings(companies['Company Name'], duplicates)
matches
left_side right_side similarity
0 012 SMILE.COMMUNICATIONS LTD 012 SMILE.COMMUNICATIONS 0.944092
1 B.A.S. MEDIA GROUP S MEDIA GROUP 0.854383
2 B4UTRADE COM CORP B4UTRADE COM CORP 1.000000
3 B4UTRADE COM INC B4UTRADE COM CORP 0.810217
4 B4UTRADE CORP B4UTRADE COM CORP 0.878276

Out of the four company names in duplicates, three companies are found in the original company data set. One company is found 3 times.

Finding duplicates from a (database extract to) DataFrame where IDs for rows are supplied.

A very common scenario is the case where duplicate records for an entity have been entered into a database. That is, there are two or more records where a name field has slightly different spelling. For example, "A.B. Corporation" and "AB Corporation". Using the optional 'ID' parameter in the match_strings function duplicates can be found easily. A tutorial that steps though the process with an example data set is available.

For a second data set, find only the most similar match

In the example above, it's possible that multiple matches are found for a single string. Sometimes we just want a string to match with a single most similar string. If there are no similar strings found, the original string should be returned:

# Create a small set of artificial company names:
new_companies = pd.Series(['S MEDIA GROUP', '012 SMILE.COMMUNICATIONS', 'foo bar', 'B4UTRADE COM CORP'])
# Create all matches:
matches = match_most_similar(companies['Company Name'], new_companies)
# Display the results:
pd.DataFrame({'new_companies': new_companies, 'duplicates': matches})
new_companies duplicates
0 S MEDIA GROUP B.A.S. MEDIA GROUP
1 012 SMILE.COMMUNICATIONS 012 SMILE.COMMUNICATIONS LTD
2 foo bar foo bar
3 B4UTRADE COM CORP B4UTRADE COM CORP

Deduplicate a single data set and show items with most duplicates

The group_similar_strings function groups strings that are similar using a single linkage clustering algorithm. That is, if item A and item B are similar; and item B and item C are similar; but the similarity between A and C is below the threshold; then all three items are grouped together.

# Add the grouped strings:
companies['deduplicated_name'] = group_similar_strings(companies['Company Name'])
# Show items with most duplicates:
companies.groupby('deduplicated_name').count().sort_values('Line Number', ascending=False).head(10)['Line Number']
deduplicated_name
ADVISORS DISCIPLINED TRUST 1100                        188
ACE SECURITIES CORP HOME EQUITY LOAN TRUST 2005-HE4     32
AMERCREDIT AUTOMOBILE RECEIVABLES TRUST 2010-1          28
ADVENT LATIN AMERICAN PRIVATE EQUITY FUND II-A CV       25
ALLSTATE LIFE GLOBAL FUNDING TRUST 2004-1               24
ADVENT INTERNATIONAL GPE VII LIMITED PARTNERSHIP        24
7ADVISORS DISCIPLINED TRUST 1197                        23
AMERICREDIT AUTOMOBILE RECEIVABLES TRUST  2002 - D      23
ALLY AUTO RECEIVABLES TRUST 2010-1                      23
ANDERSON DAVID  A                                       23
Name: Line Number, dtype: int64

The group_similar_strings function also works with IDs: imagine a DataFrame (customers_df) with the following content:

# Create a small set of artificial customer names:
customers_df = pd.DataFrame(
   [
      ('BB016741P', 'Mega Enterprises Corporation'),
      ('CC082744L', 'Hyper Startup Incorporated'),
      ('AA098762D', 'Hyper Startup Inc.'),
      ('BB099931J', 'Hyper-Startup Inc.')
      ('HH072982K', 'Hyper Hyper Inc.')
   ],
   columns=('Customer ID', 'Customer Name')
)
# Display the data:
customers_df
Customer ID Customer Name
0 BB016741P Mega Enterprises Corporation
1 CC082744L Hyper Startup Incorporated
2 AA098762D Hyper Startup Inc.
3 BB099931J Hyper-Startup Inc.
4 HH072982K Hyper Hyper Inc.

The output of group_similar_strings can be directly used as a mapping table:

# Group customers with similar names:
customers_df[["group-id", "name_deduped"]]  = \
    group_similar_strings(customers_df["Customer Name"], customers_df["Customer ID"])
# Display the mapping table:
customers_df
Customer ID Customer Name group-id name_deduped
BB016741P Mega Enterprises Corporation BB016741P Mega Enterprises Corporation
CC082744L Hyper Startup Incorporated CC082744L Hyper Startup Incorporated
AA098762D Hyper Startup Inc. CC082744L Hyper Startup Incorporated
BB099931J Hyper-Startup Inc. CC082744L Hyper Startup Incorporated
HH072982K Hyper Hyper Inc. CC082744L Hyper Startup Incorporated

Note that here customers_df initially had only two columns "Customer ID" and "Customer Name" (before the group_similar_strings function call); and it acquired two more columns "group-id" and "name_deduped" after the call.

The StringGrouper class

The three functions mentioned above all create a StringGrouper object behind the scenes and call different functions on it. The StringGrouper class keeps track of all tuples of similar strings and creates the groups out of these. Since matches are often not perfect, a common workflow is to:

  1. Create matches
  2. Manually inspect the results
  3. Add and remove matches where necessary
  4. Create groups of similar strings

The StringGrouper class allows for this without having to re-calculate the cosine similarity matrix. See below for an example.

company_names = '/media/chris/data/dev/name_matching/data/sec_edgar_company_info.csv'
# We only look at the first 50k as an example
companies = pd.read_csv(company_names)
  1. Create matches
# Create a new StringGrouper
string_grouper = StringGrouper(companies['Company Name'])
# Check if the ngram function does what we expect:
string_grouper.n_grams('McDonalds')
['McD', 'cDo', 'Don', 'ona', 'nal', 'ald', 'lds']
# Now fit the StringGrouper - this will take a while since we are calculating cosine similarities on 600k strings
string_grouper = string_grouper.fit()
# Add the grouped strings
companies['deduplicated_name'] = string_grouper.get_groups()

Suppose we know that PWC HOLDING CORP and PRICEWATERHOUSECOOPERS LLP are the same company. StringGrouper will not match these since they are not similar enough.

companies[companies.deduplicated_name.str.contains('PRICEWATERHOUSECOOPERS LLP')]
Line Number Company Name Company CIK Key deduplicated_name
478441 478442 PRICEWATERHOUSECOOPERS LLP /TA 1064284 PRICEWATERHOUSECOOPERS LLP /TA
478442 478443 PRICEWATERHOUSECOOPERS LLP 1186612 PRICEWATERHOUSECOOPERS LLP /TA
478443 478444 PRICEWATERHOUSECOOPERS SECURITIES LLC 1018444 PRICEWATERHOUSECOOPERS LLP /TA
companies[companies.deduplicated_name.str.contains('PWC')]
Line Number Company Name Company CIK Key deduplicated_name
485535 485536 PWC CAPITAL INC. 1690640 PWC CAPITAL INC.
485536 485537 PWC HOLDING CORP 1456450 PWC HOLDING CORP
485537 485538 PWC INVESTORS, LLC 1480311 PWC INVESTORS, LLC
485538 485539 PWC REAL ESTATE VALUE FUND I LLC 1668928 PWC REAL ESTATE VALUE FUND I LLC
485539 485540 PWC SECURITIES CORP /BD 1023989 PWC SECURITIES CORP /BD
485540 485541 PWC SECURITIES CORPORATION 1023989 PWC SECURITIES CORPORATION
485541 485542 PWCC LTD 1172241 PWCC LTD
485542 485543 PWCG BROKERAGE, INC. 67301 PWCG BROKERAGE, INC.

We can add these with the add function:

string_grouper = string_grouper.add_match('PRICEWATERHOUSECOOPERS LLP', 'PWC HOLDING CORP')
companies['deduplicated_name'] = string_grouper.get_groups()
# Now lets check again:

companies[companies.deduplicated_name.str.contains('PRICEWATERHOUSECOOPERS LLP')]
Line Number Company Name Company CIK Key deduplicated_name
478441 478442 PRICEWATERHOUSECOOPERS LLP /TA 1064284 PRICEWATERHOUSECOOPERS LLP /TA
478442 478443 PRICEWATERHOUSECOOPERS LLP 1186612 PRICEWATERHOUSECOOPERS LLP /TA
478443 478444 PRICEWATERHOUSECOOPERS SECURITIES LLC 1018444 PRICEWATERHOUSECOOPERS LLP /TA
485536 485537 PWC HOLDING CORP 1456450 PRICEWATERHOUSECOOPERS LLP /TA

This can also be used to merge two groups:

string_grouper = string_grouper.add_match('PRICEWATERHOUSECOOPERS LLP', 'ZUCKER MICHAEL')
companies['deduplicated_name'] = string_grouper.get_groups()

# Now lets check again:
companies[companies.deduplicated_name.str.contains('PRICEWATERHOUSECOOPERS LLP')]
Line Number Company Name Company CIK Key deduplicated_name
478441 478442 PRICEWATERHOUSECOOPERS LLP /TA 1064284 PRICEWATERHOUSECOOPERS LLP /TA
478442 478443 PRICEWATERHOUSECOOPERS LLP 1186612 PRICEWATERHOUSECOOPERS LLP /TA
478443 478444 PRICEWATERHOUSECOOPERS SECURITIES LLC 1018444 PRICEWATERHOUSECOOPERS LLP /TA
485536 485537 PWC HOLDING CORP 1456450 PRICEWATERHOUSECOOPERS LLP /TA
662585 662586 ZUCKER MICHAEL 1629018 PRICEWATERHOUSECOOPERS LLP /TA
662604 662605 ZUCKERMAN MICHAEL 1303321 PRICEWATERHOUSECOOPERS LLP /TA
662605 662606 ZUCKERMAN MICHAEL 1496366 PRICEWATERHOUSECOOPERS LLP /TA

We can remove strings from groups in the same way:

string_grouper = string_grouper.remove_match('PRICEWATERHOUSECOOPERS LLP', 'ZUCKER MICHAEL')
companies['deduplicated_name'] = string_grouper.get_groups()

# Now lets check again:
companies[companies.deduplicated_name.str.contains('PRICEWATERHOUSECOOPERS LLP')]
Line Number Company Name Company CIK Key deduplicated_name
478441 478442 PRICEWATERHOUSECOOPERS LLP /TA 1064284 PRICEWATERHOUSECOOPERS LLP /TA
478442 478443 PRICEWATERHOUSECOOPERS LLP 1186612 PRICEWATERHOUSECOOPERS LLP /TA
478443 478444 PRICEWATERHOUSECOOPERS SECURITIES LLC 1018444 PRICEWATERHOUSECOOPERS LLP /TA
485536 485537 PWC HOLDING CORP 1456450 PRICEWATERHOUSECOOPERS LLP /TA

string_grouper's People

Contributors

bergvca avatar particularminer avatar flindeberg avatar taimursajid avatar justasojourner avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.