vaneseltine / nominally Goto Github PK
View Code? Open in Web Editor NEWA maximum-strength name parser for record linkage.
License: GNU Affero General Public License v3.0
A maximum-strength name parser for record linkage.
License: GNU Affero General Public License v3.0
Function _combine_rightmost_prefixes
has a Cognitive Complexity of 6 (exceeds 5 allowed). Consider refactoring.
Including confirm r'[a-z]'
on each part to make, e.g., Name("#") unparsable.
By treating the nickname as a field separator, a first/middle or middle/last split could potentially be maintained. Currently the nickname is invisible to the subsequent processing.
Lots of variations on a theme.
Currently the list stands at:
PREFIXES = {
"abu",
"bin",
"bon",
"da",
"dal",
"de",
"dei",
"del",
"dela",
"della",
"delle",
"delli",
"dello",
"dem",
"der",
"di",
"do",
"dos",
"du",
"ibn",
"la",
"le",
"mc",
"mac",
"san",
"santa",
"st",
"ste",
"van",
"vel",
"von",
}
Some of these are in fairly common use as given names:
Della 29,219
Van 22,943
Von 4,608
Del 4,454
Santa 2,852
San 2,308
La 1,663
Le 1,427
De 1,259
We certainly want to support Van, but what of Della and Del? San?
", a, a
is broken
ɪ (.venv) nominally › python -m nominally "14,A,A"
raw: 14,A,A
cleaned: {'14, a, a'}
parsed: a, a
list: ['', 'a', '', 'a', '', '']
title:
first: a
middle:
last: a
suffix:
nickname:
ɪ (.venv) nominally › python -m nominally "1-4,A,A"
raw: 1-4,A,A
cleaned: {'1-4, a, a'}
parsed: a a
list: ['', 'a', 'a', '', '', '']
title:
first: a
middle: a
last:
suffix:
nickname:
This was found via hypothesis generating "¼,A,A"
Currently "van meep de ook" -> first "van meep", last "de ook"
But "de ook, van meep" -> first "van", middle "meep", last "de ook"
So should van/de/etc operate on first names at all? Should we just identify the last name cluster and only run it on that?
Tagged by codeclimate.
Currently only string input is accepted.
Currently the apostrophes are conservatively considered part of the name; so Gwinifer 'Old Mother' Blackcap
has 'Old Mother'
as a middle name rather than Old Mother
as a nickname.
Currently it's spread out across multiple stages, esp. for the swept strings.
Name("von floogle, moogle mary, mcdoogle")
loses the comma into str "von floogle, moogle mary mcdoogle"
which is reparsed to move the mary into the middle name.
Probably an issue to leave downstream -- but note in documentation.
May be related to #7
Observing Gregor, Ewan Gordon Mc
Flagged by codeclimate
Run examples vs. name-parser to ensure nothing weird has slipped through the cracks.
ii
should probably be treated as a suffix more often, see #8
rogers, fred ii mcfeely
underhill, frodo v., ii
Fix raw = "Ramsay “R.J.” Jackson Canning"
E.g. 'von der X' and 'von dem Y' without necessarily picking up 'der' or 'dem' on their own. Might not be necessary.
from nominally.api import parse_name, report, prettier_print
def test_parse_name():
pass
def test_report():
pass
def test_prettier_print():
pass
Just via github? In package at all? Links, perhaps?
The answer to this helps resolve #19
Yea or nay?
"Johann Gambolputty de von Ausfern" <-> "Ausfern, Johann Gambolputty de von"
dict(Name())
nominally\__init__.py:15: error: No overload variant of "dict" matches argument type "Name"
nominally\__init__.py:15: note: Possible overload variants:
nominally\__init__.py:15: note: def [_KT, _VT] __init__(self, map: Mapping[_KT, _VT], **kwargs: _VT) -> Dict[_KT, _VT]
nominally\__init__.py:15: note: def [_KT, _VT] __init__(self, iterable: Iterable[Tuple[_KT, _VT]], **kwargs: _VT) -> Dict[_KT, _VT]
nominally\__init__.py:15: note: <1 more non-matching overload not shown>
nominally\__init__.py:15: note: def [_KT, _VT] __init__(self, map: Mapping[_KT, _VT], **kwargs: _VT) -> Dict[_KT, _VT]
nominally\__init__.py:15: note: def [_KT, _VT] __init__(self, iterable: Iterable[Tuple[_KT, _VT]], **kwargs: _VT) -> Dict[_KT, _VT]
list(Name())
nominally\__init__.py:17: error: No overload variant of "list" matches argument type "Name"
nominally\__init__.py:17: note: Possible overload variant:
nominally\__init__.py:17: note: def [_T] __init__(self, iterable: Iterable[_T]) -> List[_T]
nominally\__init__.py:17: note: <1 more non-matching overload not shown>
python -m pip install snakeviz
python -m cProfile -o nominally cprof -m pytest
snakeviz nominally.crof
Otherwise hyphens are left over in unusual cases, e.g., misplaced id number: J 2309-23492-3234 Smith
ends up with a middle name of ---
.
jones, xin x
Maybe related to #8
Currently the periods are removed, leaving JRR||Tolkien
We want to recognize this use of periods are parse J|R R|Tolkien
Garc?a, Mart?n
Nicknames are not cleaned in the same way as post-nickname full names.
Continue to support /[4-9]th/
suffixes; otherwise they turn into "th."
i should be initial; ii and iii and iv are sweepable
Test in 3.6 and 3.8, lint and deploy from 3.7.
If two full names precede, it's unlikely in the target use cases that a last name will be abbreviated. E.g.
William Henry Jameson V
is the fifth of his name
William H J V
has a last name starting with a 'V'
tried running the [...] code that used nominally, and we hit an error with _grab_junior()
Line 181, index function either returns value or ValueError
I'll run it again later to get the string.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.