Coder Social home page Coder Social logo

vaneseltine / nominally Goto Github PK

View Code? Open in Web Editor NEW
27.0 3.0 0.0 1.06 MB

A maximum-strength name parser for record linkage.

License: GNU Affero General Public License v3.0

Python 95.53% TeX 4.47%
data-science parsing human-name entity-resolution record-linkage deduplication parser data-matching

nominally's Introduction

Hi there πŸ‘‹

  • πŸ”­ I’m currently working on ...
  • 🌱 I’m currently learning ...
  • πŸ‘― I’m looking to collaborate on ...
  • πŸ€” I’m looking for help with ...
  • πŸ’¬ Ask me about ...
  • πŸ“« How to reach me: [email protected]
  • πŸ˜„ Pronouns: he/him
  • ⚑ Fun fact: ...

nominally's People

Contributors

abnerjacobsen avatar arusahni avatar corbinbs avatar derek73 avatar edwardbetts avatar kelvins avatar peterscott avatar svisser avatar tyvik avatar vaneseltine avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

nominally's Issues

strip() hyphens from final results

Otherwise hyphens are left over in unusual cases, e.g., misplaced id number: J 2309-23492-3234 Smith ends up with a middle name of ---.

Hyphenated numbers are not fully ignored

Ιͺ (.venv) nominally β€Ί python -m nominally "14,A,A"
       raw: 14,A,A
   cleaned: {'14, a, a'}
    parsed: a, a
      list: ['', 'a', '', 'a', '', '']
     title:
     first: a
    middle:
      last: a
    suffix:
  nickname:
Ιͺ (.venv) nominally β€Ί python -m nominally "1-4,A,A"
       raw: 1-4,A,A
   cleaned: {'1-4, a, a'}
    parsed: a a
      list: ['', 'a', 'a', '', '', '']
     title:
     first: a
    middle: a
      last:
    suffix:
  nickname:

This was found via hypothesis generating "ΒΌ,A,A"

Improve typing of list(name) and dict(name)

dict(Name())

nominally\__init__.py:15: error: No overload variant of "dict" matches argument type "Name"
nominally\__init__.py:15: note: Possible overload variants:
nominally\__init__.py:15: note:     def [_KT, _VT] __init__(self, map: Mapping[_KT, _VT], **kwargs: _VT) -> Dict[_KT, _VT]
nominally\__init__.py:15: note:     def [_KT, _VT] __init__(self, iterable: Iterable[Tuple[_KT, _VT]], **kwargs: _VT) -> Dict[_KT, _VT]
nominally\__init__.py:15: note:     <1 more non-matching overload not shown>
nominally\__init__.py:15: note:     def [_KT, _VT] __init__(self, map: Mapping[_KT, _VT], **kwargs: _VT) -> Dict[_KT, _VT]
nominally\__init__.py:15: note:     def [_KT, _VT] __init__(self, iterable: Iterable[Tuple[_KT, _VT]], **kwargs: _VT) -> Dict[_KT, _VT]

list(Name())

nominally\__init__.py:17: error: No overload variant of "list" matches argument type "Name"
nominally\__init__.py:17: note: Possible overload variant:
nominally\__init__.py:17: note:     def [_T] __init__(self, iterable: Iterable[_T]) -> List[_T]
nominally\__init__.py:17: note:     <1 more non-matching overload not shown> 

4th, 5th

Continue to support /[4-9]th/ suffixes; otherwise they turn into "th."

Support multi-word prefixes?

E.g. 'von der X' and 'von dem Y' without necessarily picking up 'der' or 'dem' on their own. Might not be necessary.

Clean nicknames

Nicknames are not cleaned in the same way as post-nickname full names.

Add tests for api.py

from nominally.api import parse_name, report, prettier_print

def test_parse_name():
    pass

def test_report():
    pass

def test_prettier_print():
    pass

Empirically assess prefixes

Currently the list stands at:

PREFIXES = {
    "abu",
    "bin",
    "bon",
    "da",
    "dal",
    "de",
    "dei",
    "del",
    "dela",
    "della",
    "delle",
    "delli",
    "dello",
    "dem",
    "der",
    "di",
    "do",
    "dos",
    "du",
    "ibn",
    "la",
    "le",
    "mc",
    "mac",
    "san",
    "santa",
    "st",
    "ste",
    "van",
    "vel",
    "von",
}

Some of these are in fairly common use as given names:

Della   29,219
Van     22,943
Von      4,608
Del      4,454
Santa    2,852
San      2,308
La       1,663
Le       1,427
De       1,259

We certainly want to support Van, but what of Della and Del? San?

Improve parsing of names with nicknames

By treating the nickname as a field separator, a first/middle or middle/last split could potentially be maintained. Currently the nickname is invisible to the subsequent processing.

Resolve ambiguity in first-name prefixes

Currently "van meep de ook" -> first "van meep", last "de ook"
But "de ook, van meep" -> first "van", middle "meep", last "de ook"

So should van/de/etc operate on first names at all? Should we just identify the last name cluster and only run it on that?

_grab_junior() error

tried running the [...] code that used nominally, and we hit an error with _grab_junior()

Line 181, index function either returns value or ValueError

I'll run it again later to get the string.

Refactor cleaning

Currently it's spread out across multiple stages, esp. for the swept strings.

Smart quotes

Fix raw = "Ramsay β€œR.J.” Jackson Canning"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.