Coder Social home page Coder Social logo

vaneseltine / nominally Goto Github PK

View Code? Open in Web Editor NEW
27.0 3.0 0.0 1.06 MB

A maximum-strength name parser for record linkage.

License: GNU Affero General Public License v3.0

Python 95.53% TeX 4.47%
data-science parsing human-name entity-resolution record-linkage deduplication parser data-matching

nominally's Introduction

Nominally Logo

nominally: a maximum-strength name parser for record linkage

Builds at CircleCI Test coverage at Coveralls Maintainability rated at Code Climate Documentation at Read the Docs Latest commit at GitHub

License: AGPL 3.0+ Distributed via PyPI

🔗 Names

Nominally simplifies and parses a personal name written in Western name order into six core fields: title, first, middle, last, suffix, and nickname.

Typically, nominally is used to parse entire lists or pd.Series of names en masse. This package includes a command line tool to parse a single name for convenient one-off testing and examples.

Human names can be difficult to work with in data. Varying quality and practices across institutions and datasets introduce noise and cause misrepresentation, increasing linkage and deduplication challenges. Common errors and discrepancies include (and this list is by no means exhaustive):

  • Arbitrarily split first and middle names.
  • Misplaced prefixes of last names such as "van" and "de la."
  • Multiple last names partitioned into middle name fields.
  • Titles and suffixes variously recorded in different fields, with or without separators.
  • Inconsistent capture of accents, the ʻokina, and other non-ASCII characters.
  • Single name fields arbitrarily concatenating name parts.

Nominally produces fields intended for comparisons between or within datasets. As such, names come out formatted for data without regard to human syntactic preference: de von ausfern, mr johann g rather than Mr. Johann G. de von Ausfern.

📜 Documentation

Full nominally documentation is maintained on ReadTheDocs: https://nominally.readthedocs.io/en/latest/

⛏️ Installation

Releases of nominally are distributed on PyPI, so the recommended approach is to install via pip:

$ python -m pip install -U nominally

📓 Getting Started

Call parse_name() to parse out the six core fields:

$ python -q
>>> from nominally import parse_name
>>> parse_name("Vimes, jr, Mr. Samuel 'Sam'")
{
    'title': 'mr',
    'first': 'samuel',
    'middle': '',
    'last': 'vimes',
    'suffix': 'jr',
    'nickname': 'sam'
}

Dive into the Name class to parse out a reformatted string...

>>> from nominally import Name
>>> n = Name("Vimes, jr, Mr. Samuel 'Sam'")
>>> n
Name({
  'title': 'mr',
  'first': 'samuel',
  'middle': '',
  'last': 'vimes',
  'suffix': 'jr',
  'nickname': 'sam'
})
>>> str(n)
'vimes, mr samuel (sam) jr'

...or use the dict...

>>> dict(n)
{
  'title': 'mr',
  'first': 'samuel',
  'middle': '',
  'last': 'vimes',
  'suffix': 'jr',
  'nickname': 'sam'
}
>>> list(n.values())
['mr', 'samuel', '', 'vimes', 'jr', 'sam']

...or retrieve a more elaborate set of attributes...

>>> n.report()
{
  'raw': "Vimes, jr, Mr. Samuel 'Sam'",
  'cleaned': {'jr', 'sam', 'vimes, mr samuel'},
  'parsed': 'vimes, mr samuel (sam) jr',
  'list': ['mr', 'samuel', '', 'vimes', 'jr', 'sam'],
  'title': 'mr',
  'first': 'samuel',
  'middle': '',
  'last': 'vimes',
  'suffix': 'jr',
  'nickname': 'sam'
}

...or capture individual attributes.

>>> n.first
'samuel'
>>> n['last']
'vimes'
>>> n.get('suffix')
'jr'
>>> n.raw
"Vimes, jr, Mr. Samuel 'Sam'"

🖥️ Command Line

For a quick report, invoke the nominally command line tool:

$ nominally "Vimes, jr, Mr. Samuel 'Sam'"
       raw: Vimes, jr, Mr. Samuel 'Sam'
   cleaned: {'jr', 'vimes, mr samuel', 'sam'}
    parsed: vimes, mr samuel (sam) jr
      list: ['mr', 'samuel', '', 'vimes', 'jr', 'sam']
     title: mr
     first: samuel
    middle:
      last: vimes
    suffix: jr
  nickname: sam

🔬 Worked Examples

Binder hosts live Jupyter notebooks walking through examples of nominally.

     csv.ipynb on mybinder.org

     pandas_simple.ipynb on mybinder.org

These notebooks and additional examples reside in the Nominally Examples repository.

👩‍💻 Community

Interested in helping to improve nominally? Please see CONTRIBUTING.md.

CONTRIBUTING.md also includes directions to run tests, using a clone of the full repository.

Having problems with nominally? Need help or support? Feel free to open an issue here on Github, or contact me via email or Twitter (see my profile for links).

🧙‍ Author

Matt VanEseltine

https://pypi.org/user/matvan/

matvan@umich.edu

https://github.com/vaneseltine

https://twitter.com/vaneseltine

https://stackoverflow.com/users/7846185/matt-vaneseltine

💡 Acknowledgements

Nominally started as a fork of the python-nameparser package, and has benefitted considerably from this origin⸺especially the wealth of examples and tests developed for python-nameparser.

nominally's People

Contributors

abnerjacobsen avatar arusahni avatar corbinbs avatar derek73 avatar edwardbetts avatar kelvins avatar peterscott avatar svisser avatar tyvik avatar vaneseltine avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

nominally's Issues

Hyphenated numbers are not fully ignored

ɪ (.venv) nominally › python -m nominally "14,A,A"
       raw: 14,A,A
   cleaned: {'14, a, a'}
    parsed: a, a
      list: ['', 'a', '', 'a', '', '']
     title:
     first: a
    middle:
      last: a
    suffix:
  nickname:
ɪ (.venv) nominally › python -m nominally "1-4,A,A"
       raw: 1-4,A,A
   cleaned: {'1-4, a, a'}
    parsed: a a
      list: ['', 'a', 'a', '', '', '']
     title:
     first: a
    middle: a
      last:
    suffix:
  nickname:

This was found via hypothesis generating "¼,A,A"

Improve parsing of names with nicknames

By treating the nickname as a field separator, a first/middle or middle/last split could potentially be maintained. Currently the nickname is invisible to the subsequent processing.

Add tests for api.py

from nominally.api import parse_name, report, prettier_print

def test_parse_name():
    pass

def test_report():
    pass

def test_prettier_print():
    pass

Smart quotes

Fix raw = "Ramsay “R.J.” Jackson Canning"

Empirically assess prefixes

Currently the list stands at:

PREFIXES = {
    "abu",
    "bin",
    "bon",
    "da",
    "dal",
    "de",
    "dei",
    "del",
    "dela",
    "della",
    "delle",
    "delli",
    "dello",
    "dem",
    "der",
    "di",
    "do",
    "dos",
    "du",
    "ibn",
    "la",
    "le",
    "mc",
    "mac",
    "san",
    "santa",
    "st",
    "ste",
    "van",
    "vel",
    "von",
}

Some of these are in fairly common use as given names:

Della   29,219
Van     22,943
Von      4,608
Del      4,454
Santa    2,852
San      2,308
La       1,663
Le       1,427
De       1,259

We certainly want to support Van, but what of Della and Del? San?

Clean nicknames

Nicknames are not cleaned in the same way as post-nickname full names.

4th, 5th

Continue to support /[4-9]th/ suffixes; otherwise they turn into "th."

Resolve ambiguity in first-name prefixes

Currently "van meep de ook" -> first "van meep", last "de ook"
But "de ook, van meep" -> first "van", middle "meep", last "de ook"

So should van/de/etc operate on first names at all? Should we just identify the last name cluster and only run it on that?

Support multi-word prefixes?

E.g. 'von der X' and 'von dem Y' without necessarily picking up 'der' or 'dem' on their own. Might not be necessary.

Refactor cleaning

Currently it's spread out across multiple stages, esp. for the swept strings.

strip() hyphens from final results

Otherwise hyphens are left over in unusual cases, e.g., misplaced id number: J 2309-23492-3234 Smith ends up with a middle name of ---.

_grab_junior() error

tried running the [...] code that used nominally, and we hit an error with _grab_junior()

Line 181, index function either returns value or ValueError

I'll run it again later to get the string.

Improve typing of list(name) and dict(name)

dict(Name())

nominally\__init__.py:15: error: No overload variant of "dict" matches argument type "Name"
nominally\__init__.py:15: note: Possible overload variants:
nominally\__init__.py:15: note:     def [_KT, _VT] __init__(self, map: Mapping[_KT, _VT], **kwargs: _VT) -> Dict[_KT, _VT]
nominally\__init__.py:15: note:     def [_KT, _VT] __init__(self, iterable: Iterable[Tuple[_KT, _VT]], **kwargs: _VT) -> Dict[_KT, _VT]
nominally\__init__.py:15: note:     <1 more non-matching overload not shown>
nominally\__init__.py:15: note:     def [_KT, _VT] __init__(self, map: Mapping[_KT, _VT], **kwargs: _VT) -> Dict[_KT, _VT]
nominally\__init__.py:15: note:     def [_KT, _VT] __init__(self, iterable: Iterable[Tuple[_KT, _VT]], **kwargs: _VT) -> Dict[_KT, _VT]

list(Name())

nominally\__init__.py:17: error: No overload variant of "list" matches argument type "Name"
nominally\__init__.py:17: note: Possible overload variant:
nominally\__init__.py:17: note:     def [_T] __init__(self, iterable: Iterable[_T]) -> List[_T]
nominally\__init__.py:17: note:     <1 more non-matching overload not shown> 

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.