vaneseltine / nominally Goto Github PK

View Code? Open in Web Editor NEW

27.0 3.0 0.0 1.06 MB

A maximum-strength name parser for record linkage.

License: GNU Affero General Public License v3.0

Python 95.53% TeX 4.47%

data-science parsing human-name entity-resolution record-linkage deduplication parser data-matching

nominally's Introduction

nominally: a maximum-strength name parser for record linkage

🔗 Names

Nominally simplifies and parses a personal name written in Western name order into six core fields: title, first, middle, last, suffix, and nickname.

Typically, nominally is used to parse entire lists or pd.Series of names en masse. This package includes a command line tool to parse a single name for convenient one-off testing and examples.

Human names can be difficult to work with in data. Varying quality and practices across institutions and datasets introduce noise and cause misrepresentation, increasing linkage and deduplication challenges. Common errors and discrepancies include (and this list is by no means exhaustive):

Arbitrarily split first and middle names.
Misplaced prefixes of last names such as "van" and "de la."
Multiple last names partitioned into middle name fields.
Titles and suffixes variously recorded in different fields, with or without separators.
Inconsistent capture of accents, the ʻokina, and other non-ASCII characters.
Single name fields arbitrarily concatenating name parts.

Nominally produces fields intended for comparisons between or within datasets. As such, names come out formatted for data without regard to human syntactic preference: de von ausfern, mr johann g rather than Mr. Johann G. de von Ausfern.

📜 Documentation

Full nominally documentation is maintained on ReadTheDocs: https://nominally.readthedocs.io/en/latest/

⛏️ Installation

Releases of nominally are distributed on PyPI, so the recommended approach is to install via pip:

$ python -m pip install -U nominally

📓 Getting Started

Call parse_name() to parse out the six core fields:

$ python -q
>>> from nominally import parse_name
>>> parse_name("Vimes, jr, Mr. Samuel 'Sam'")
{
    'title': 'mr',
    'first': 'samuel',
    'middle': '',
    'last': 'vimes',
    'suffix': 'jr',
    'nickname': 'sam'
}

Dive into the Name class to parse out a reformatted string...

>>> from nominally import Name
>>> n = Name("Vimes, jr, Mr. Samuel 'Sam'")
>>> n
Name({
  'title': 'mr',
  'first': 'samuel',
  'middle': '',
  'last': 'vimes',
  'suffix': 'jr',
  'nickname': 'sam'
})
>>> str(n)
'vimes, mr samuel (sam) jr'

...or use the dict...

>>> dict(n)
{
  'title': 'mr',
  'first': 'samuel',
  'middle': '',
  'last': 'vimes',
  'suffix': 'jr',
  'nickname': 'sam'
}
>>> list(n.values())
['mr', 'samuel', '', 'vimes', 'jr', 'sam']

...or retrieve a more elaborate set of attributes...

>>> n.report()
{
  'raw': "Vimes, jr, Mr. Samuel 'Sam'",
  'cleaned': {'jr', 'sam', 'vimes, mr samuel'},
  'parsed': 'vimes, mr samuel (sam) jr',
  'list': ['mr', 'samuel', '', 'vimes', 'jr', 'sam'],
  'title': 'mr',
  'first': 'samuel',
  'middle': '',
  'last': 'vimes',
  'suffix': 'jr',
  'nickname': 'sam'
}

...or capture individual attributes.

>>> n.first
'samuel'
>>> n['last']
'vimes'
>>> n.get('suffix')
'jr'
>>> n.raw
"Vimes, jr, Mr. Samuel 'Sam'"

🖥️ Command Line

For a quick report, invoke the nominally command line tool:

$ nominally "Vimes, jr, Mr. Samuel 'Sam'"
       raw: Vimes, jr, Mr. Samuel 'Sam'
   cleaned: {'jr', 'vimes, mr samuel', 'sam'}
    parsed: vimes, mr samuel (sam) jr
      list: ['mr', 'samuel', '', 'vimes', 'jr', 'sam']
     title: mr
     first: samuel
    middle:
      last: vimes
    suffix: jr
  nickname: sam

🔬 Worked Examples

Binder hosts live Jupyter notebooks walking through examples of nominally.

These notebooks and additional examples reside in the Nominally Examples repository.

👩‍💻 Community

Interested in helping to improve nominally? Please see CONTRIBUTING.md.

CONTRIBUTING.md also includes directions to run tests, using a clone of the full repository.

Having problems with nominally? Need help or support? Feel free to open an issue here on Github, or contact me via email or Twitter (see my profile for links).

🧙‍ Author

💡 Acknowledgements

Nominally started as a fork of the python-nameparser package, and has benefitted considerably from this origin⸺especially the wealth of examples and tests developed for python-nameparser.

nominally's People

Contributors

Stargazers

Watchers

nominally's Issues

Basic Name, .api documentation to readthedocs

Strip numbers following suffix extraction

Do a light profiling to ensure there aren't crazy problems

python -m pip install snakeviz
python -m cProfile -o nominally cprof -m pytest
snakeviz nominally.crof

Drop from OS to Python versions in CircleCI

Test in 3.6 and 3.8, lint and deploy from 3.7.

_extract_suffixes chokes on oddly spaced "andrews, dr "

zip safe?

Yea or nay?

Refactor Name._combine_rightmost_prefixes()

Flagged by codeclimate

Refactor using coverage

Contextualize and distribute examples

Just via github? In package at all? Links, perhaps?

The answer to this helps resolve #19

Hyphenated numbers are not fully ignored

ɪ (.venv) nominally › python -m nominally "14,A,A"
       raw: 14,A,A
   cleaned: {'14, a, a'}
    parsed: a, a
      list: ['', 'a', '', 'a', '', '']
     title:
     first: a
    middle:
      last: a
    suffix:
  nickname:
ɪ (.venv) nominally › python -m nominally "1-4,A,A"
       raw: 1-4,A,A
   cleaned: {'1-4, a, a'}
    parsed: a a
      list: ['', 'a', 'a', '', '', '']
     title:
     first: a
    middle: a
      last:
    suffix:
  nickname:

This was found via hypothesis generating "¼,A,A"

Make sure there are no hideous regressions

Run examples vs. name-parser to ensure nothing weird has slipped through the cracks.

Improve parsing of names with nicknames

By treating the nickname as a field separator, a first/middle or middle/last split could potentially be maintained. Currently the nickname is invisible to the subsequent processing.

Fix "method_complexity" issue in nominally/parser.py _combine_rightmost_prefixes

Function _combine_rightmost_prefixes has a Cognitive Complexity of 6 (exceeds 5 allowed). Consider refactoring.

https://codeclimate.com/github/vaneseltine/nominally/nominally/parser.py#issue_5d859e0248d9f0000100002d

Leveraging last, first, middle comma split breaks idempotence

Name("von floogle, moogle mary, mcdoogle") loses the comma into str "von floogle, moogle mary mcdoogle" which is reparsed to move the mary into the middle name.

Fix cleaned version of names

Name: blueberry, jr r

May be related to #7

Add tests for api.py

from nominally.api import parse_name, report, prettier_print

def test_parse_name():
    pass

def test_report():
    pass

def test_prettier_print():
    pass

Increase suffix expectations when further initials are unlikely

If two full names precede, it's unlikely in the target use cases that a last name will be abbreviated. E.g.

William Henry Jameson V is the fifth of his name
William H J V has a last name starting with a 'V'

Check fingerprint between cleaned and final

junior -> jr

Initialize Name() with a Name

Currently only string input is accepted.

Smart quotes

Fix raw = "Ramsay “R.J.” Jackson Canning"

only take one generational suffix

ii should probably be treated as a suffix more often, see #8

rogers, fred ii mcfeely
underhill, frodo v., ii

Empirically assess prefixes

Currently the list stands at:

PREFIXES = {
    "abu",
    "bin",
    "bon",
    "da",
    "dal",
    "de",
    "dei",
    "del",
    "dela",
    "della",
    "delle",
    "delli",
    "dello",
    "dem",
    "der",
    "di",
    "do",
    "dos",
    "du",
    "ibn",
    "la",
    "le",
    "mc",
    "mac",
    "san",
    "santa",
    "st",
    "ste",
    "van",
    "vel",
    "von",
}

Some of these are in fairly common use as given names:

Della   29,219
Van     22,943
Von      4,608
Del      4,454
Santa    2,852
San      2,308
La       1,663
Le       1,427
De       1,259

We certainly want to support Van, but what of Della and Del? San?

tried running the [...] code that used nominally, and we hit an error with _grab_junior()

Line 181, index function either returns value or ValueError

I'll run it again later to get the string.

Improve typing of list(name) and dict(name)

dict(Name())

nominally\__init__.py:15: error: No overload variant of "dict" matches argument type "Name"
nominally\__init__.py:15: note: Possible overload variants:
nominally\__init__.py:15: note:     def [_KT, _VT] __init__(self, map: Mapping[_KT, _VT], **kwargs: _VT) -> Dict[_KT, _VT]
nominally\__init__.py:15: note:     def [_KT, _VT] __init__(self, iterable: Iterable[Tuple[_KT, _VT]], **kwargs: _VT) -> Dict[_KT, _VT]
nominally\__init__.py:15: note:     <1 more non-matching overload not shown>
nominally\__init__.py:15: note:     def [_KT, _VT] __init__(self, map: Mapping[_KT, _VT], **kwargs: _VT) -> Dict[_KT, _VT]
nominally\__init__.py:15: note:     def [_KT, _VT] __init__(self, iterable: Iterable[Tuple[_KT, _VT]], **kwargs: _VT) -> Dict[_KT, _VT]

list(Name())

nominally\__init__.py:17: error: No overload variant of "list" matches argument type "Name"
nominally\__init__.py:17: note: Possible overload variant:
nominally\__init__.py:17: note:     def [_T] __init__(self, iterable: Iterable[_T]) -> List[_T]
nominally\__init__.py:17: note:     <1 more non-matching overload not shown>