Coder Social home page Coder Social logo

apicrafter / metacrafter-registry Goto Github PK

View Code? Open in Web Editor NEW
15.0 15.0 0.0 1.07 MB

Registry of metadata identifier entities like UUID, GUID, person fullname, address and so on. Linked with other sources

Home Page: https://registry.apicrafter.io

License: Apache License 2.0

Python 100.00%
datadiscovery entity entity-recognition entity-registry metadata metadata-registry named-entity-recognition pii pii-detection

metacrafter-registry's Introduction

APICrafter

API wrapper for MongoDB databases

APICrafter creates Python Eve wrapper over MongoDB database/databases, creates Eve scheme for each collection and generates OpenAPI (Swagger) documentation.

Commands

Discover

Creates apicrafter.yml API description file from database or collection. Automatically generates data schemas from original data

Build API definition as apicrafter.yml apicrafter discover -h 127.0.0.1 -p 27017 -d rusregions

Run

Uses API definition from apicrafter.yml file and launches API server over MongoDB. You could

Run server apicrafter run

Examples

Please see /examples directory for data and usage

metacrafter-registry's People

Contributors

ivbeg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

metacrafter-registry's Issues

Common field names as datatypes

There are several field names that are common accross datasets:

  • name - name of the object/person/entity
  • title - also name of the object/person/organization, any entity
  • description - text field with description of the object
  • tags - list of comma-separated words or array of words linked to object entity
  • notes - text comments for an object entity
  • comments - also text comments for an object entity

But fields also could have certain types based on it's contents. It could be country name or organization name with field name name. So should we add these common field names as data types and to create basic datatypes and to inherit data types like country name, organization name and e.t.c. from these data types?

Review Semgrep registry

Team behind Semgrep tool created Semgrep registry https://semgrep.dev/r
It's a community-driven registry of rules used in bug and secret detection in source code.

Consider a review of semantic types registry for similar community-driven types and rules registry implementation.

Add OSM related data types: osm_id, osm_user, osm_timestamp

There are a lot of datasets, especially in open data portals in France and other EU countries linked to OSM.
Examples:

Several data field names are common:

  • osm_id - unique id of the geographic object in OSM
  • osm_user - username of the OSM user
  • osm_timestamp - datetime of OSM object creation

Wikidata has description of OSM user numeric ID only https://www.wikidata.org/wiki/Property:P8754

An idea is to add osm_id as new data type, osm_user as subtype of [username](https://registry.apicrafter.io/datatype/username) and osm_timestamp not yet sure. Probably it doesn't need new data type and could be used for detection rules only.

Add sync with wikidata entries

Add script/code to extract Wikidata properties data from it's records. It could include:

  1. name
  2. description
  3. examples
  4. regular expression

Maybe something else?

Semantic types navigation and permalinks

It would be helpful if each semantic type will have a permalink and navigation to find and see information about each semantic type.

Permalink should be like https://[domain name]/[semantic type slug], for example meta.apicrafter.io/timerange (url not working)

Navigation should include:

  • list of all semantic types with filter by category, language and country
  • search
  • system type page
  • downloads
  • about

Analyze FIBO ontology

The Financial Industry Business Ontology (FIBO) defines the sets of things that are of interest in financial business applications and the ways that those things can relate to one another. https://github.com/edmcouncil/fibo

It's mentioned in Datahub blog post about business glossary https://blog.datahubproject.io/creating-a-business-glossary-and-putting-it-to-use-in-datahub-43a088323c12

  • Review approach of FIBO to define business glossary
  • Review an idea of integration of business glossary and semantic types
  • Update metacrafter registry ontology

Add semantic types from Fast Text Analysis repository

Original repository by Tim Segall https://github.com/tsegall/fta

  • Review and append missing semantic types to metacrafter registry
  • Add mapping table of FTA semantic types to registry identifiers
  • Add fta (Fast Text Analysis) to list of the tools

List 1. List of semantic types provided by Tim Segall

Semantic Type	Description	Locale
AIRPORT_CODE.IATA	IATA Airport Code	*
CHECKDIGIT.ABA	ABA Number (or Routing Transit Number (RTN))	*
CHECKDIGIT.CUSIP	North American Security Identifier	*
CHECKDIGIT.EAN13	EAN-13 Check digit (also UPC and ISBN-13)	*
CHECKDIGIT.IBAN	International Bank Account Number	*
CHECKDIGIT.ISBN	ISBN-13 identifiers (with hyphens)	*
CHECKDIGIT.ISIN	International Securities Identification Number	*
CHECKDIGIT.LUHN	Digit String that has a valid Luhn Check digit (and length between 8 and 30 inclusive)	*
CHECKDIGIT.SEDOL	UK/Ireland Security Identifier	*
CHECKDIGIT.UPC	Universal Product Code	*
CITY	City/Town	en
COLOR.HEX	Hex Color code	*
COMPANY_NAME	Company Name	en
CONTINENT.CODE_EN	Continent Code	en
CONTINENT.TEXT_EN	Continent Name	en
COORDINATE.LATITUDE_DECIMAL	Latitude (Decimal degrees)	*
COORDINATE.LONGITUDE_DECIMAL	Longitude (Decimal degrees)	*
COORDINATE.LATITUDE_DMS	Latitude (degrees/minutes/seconds)	*
COORDINATE.LONGITUDE_DMS	Longitude (degrees/minutes/seconds)	*
COORDINATE.EASTING	Coordinate - Easting	*
COORDINATE.NORTHING	Coordinate - Northing	*
COORDINATE_PAIR.DECIMAL	Coordinate Pair (Decimal degrees)	*
COUNTRY.ISO-3166-2	Country as defined by ISO 3166 - Alpha 2	*
COUNTRY.ISO-3166-3	Country as defined by ISO 3166 - Alpha 3	*
COUNTRY.TEXT_	Country as a string	de, en
CREDIT_CARD_TYPE	Type of Credit CARD - e.g. AMEX, VISA, ...	*
CURRENCY_CODE.ISO-4217	Currency as defined by ISO 4217	*
CURRENCY.TEXT_EN	Currency Name	en
DAY.DIGITS	Day represented as a number (1-31)	*
DAY.ABBR_	Day of Week Abbreviation  = Locale, e.g. en-US for English language in US	Current Locale
DAY.FULL_	Full Day of Week name  = Locale, e.g. en-US for English language in US	Current Locale
EMAIL	Email Address	*
EPOCH.MILLISECONDS	Unix Epoch (Timestamp) - milliseconds	*
EPOCH.NANOSECONDS	Unix Epoch (Timestamp) - nanoseconds	*
FREE_TEXT	Free Text field - e.g. Description, Notes, Comments, ...	de, en, fr
GENDER.TEXT_	Gender	bg, ca, de, en, es, fi, fr, hr, it, ja, ms, nl, pl, pt, ro, sv, tr, zh
GUID	Globally Unique Identifier, e.g. 30DD879E-FE2F-11DB-8314-9800310C9A67	*
HASH.SHA1_HEX	SHA1 Hash - hexadecimal	*
HASH.SHA256_HEX	SHA256 Hash - hexadecimal	*
HONORIFIC_EN	Title (English language)	en
IDENTITY.AADHAR_IN	Aadhar	en-IN, hi-IN
IDENTITY.DUNS	Data Universal Numbering System (Dun & Bradstreet)	*
IDENTITY.EIN_US	Employer Identification Number	en-US
IDENTITY.NHS_UK	NHS Number	en-UK
IDENTITY.SSN_FR	Social Security Number (France)	fr-FR
IDENTITY.SSN_CH	AVH Number / SSN (Switzerland)	de-CH, fr-CH, it-CH
IDENTITY.INDIVIDUAL_NUMBER_JA	Individual Number / My Number (Japan)	ja
INDUSTRY_EN	Industry Name	en
IPADDRESS.IPV4	IP V4 Address	*
IPADDRESS.IPV6	IP V6 Address	*
JOB_TITLE_EN	Job Title	en
LANGUAGE.ISO-639-2	Language code - ISO 639, two character	*
LANGUAGE.TEXT_EN	Language name, e.g. English, French, ...	en
MACADDRESS	MAC Address	*
MONTH.ABBR_	Month Abbreviation  = Locale, for example, en-US for English language in US	Current Locale
MONTH.DIGITS	Month represented as a number (1-12)	*
MONTH.FULL_	Full Month name  = Locale, for example, en-US for English language in US	Current Locale
NAME.FIRST	First Name	br, de, do, en, es, fr, gt, mx, nl, pr, pt
NAME.FIRST_LAST	Merged Name (First Last)	br, de, do, en, es, fr, gt, mx, nl, pr, pt
NAME.LAST	Last Name	br, de, do, en, es, fr, gt, mx, nl, pr, pt
NAME.LAST_FIRST	Merged Name (Last, First)	br, de, do, en, es, fr, gt, mx, nl, pr, pt
NAME.MIDDLE	Middle Name	br, de, do, en, es, fr, gt, mx, nl, pr, pt
NAME.MIDDLE_INITIAL	Middle Initial	br, de, do, en, es, fr, gt, mx, nl, pr, pt
NATIONALITY_EN	Nationality	en
PERSON.AGE	Age (Person)	en, es, fr, es, it, pt
POSTAL_CODE.POSTAL_CODE_	Postal Code	AU, BG, CA, FR, JA, NL, UK, ES, MX, PT, SE, UY
POSTAL_CODE.ZIP5_US	Postal Code	en-CA, en-US
POSTAL_CODE.ZIP5_PLUS4_US	Postal Code + 4	en-CA, en-US
SSN	Social Security Number (US)	en-US
STATE_PROVINCE.COMMUNE_IT	Italian Commune	it-IT
STATE_PROVINCE.COUNTY_	County	en-UK, en-US, hu-HU
STATE_PROVINCE.DISTRICT_NAME_PT	Portuguese District Name	pt-PT
STATE_PROVINCE.MUNICIPALITY_BR	Brazilian Municipality	pt-BR
STATE_PROVINCE.STATE_	State Code	en-AU, pt-BR, es-MX, en-US
STATE_PROVINCE.STATE_NAME_	State Name	en-AU, pt-BR, de-DE, es-MX, en-US
STATE_PROVINCE.STATE_PROVINCE_NA	US State Code/Canadian Province Code/Mexican State Code	en-CA, en-US, es-MX
STATE_PROVINCE.PROVINCE_CA	Canadian Province Code	en-CA, en-US
STATE_PROVINCE.PROVINCE_IT	Italian Province Code	it-IT
STATE_PROVINCE.PROVINCE_ZA	South African Province Code	en-ZA
STATE_PROVINCE.PROVINCE_NAME_CA	Canadian Province Name	en-CA, en-US
STATE_PROVINCE.PROVINCE_NAME_IT	Italian Province Name	it-IT
STATE_PROVINCE.PROVINCE_NAME_ES	Spanish Province Name	es-ES
STATE_PROVINCE.PROVINCE_NAME_NL	Dutch Province Name	nl-NL
STATE_PROVINCE.PROVINCE_NAME_ZA	South African Province Name	en-ZA
STATE_PROVINCE.STATE_PROVINCE_NAME_NA	US State Name/Canadian Province Name	en-CA, en-US, es-MX
STATE_PROVINCE.DEPARTMENT_FR	French Department Name	fr-FR
STATE_PROVINCE.REGION_FR	French Region Name	fr-FR
STATE_PROVINCE.CANTON_CH	Swiss Canton Code	de-CH, fr-CH, it-CH
STATE_PROVINCE.CANTON_NAME_CH	Swiss Canton Name	de-CH, fr-CH, it-CH
STATE_PROVINCE.PREFECTURE_NAME_JP	Japanese Prefecture Name	ja
STREET_ADDRESS_EN	Street Address (English Language)	en
STREET_ADDRESS2_EN	Street Address - Line 2 (English Language)	en
STREET_MARKER_EN	Street Suffix (English Language)	en
TELEPHONE	Telephone Number (Generic)	*
TIMEZONE.IANA	IANA Time Zone (Olson)	*
URI.URL	URL - see RFC 3986	*
VIN	Vehicle Identification Number

Add gml_id as datatype

gml_id commonly used when GML (Geographic markup language) data converted to CSV/JSON datasets. It's defined as gml:id in the GML standard and represented as gml_id in datasets.

No such datatype in Wikidata. But it's common in datasets 774 occurences in OpenDataSoft metadata dataset and in Europe.

An idea is to investigate gml:id usage and to create new datatype for it.

Analyze The Common Data Model (CDM)

The Common Data Model (CDM) is a standard and extensible collection of schemas (entities, attributes, relationships) that represents business concepts and activities with well-defined semantics, to facilitate data interoperability. Examples of entities include: Account, Contact, Lead, Opportunity, Product, etc. https://github.com/microsoft/CDM

It's mentioned in Datahub blog post https://blog.datahubproject.io/creating-a-business-glossary-and-putting-it-to-use-in-datahub-43a088323c12

  • Review CDM
  • Consider integration of CDM business terms to semantic types

Add Schema.org properties

https://schema.org/docs/full.html

Review following properties:

Add semantic data types from Sherlock (MIT Media Labs)

Sherlock supports 78 semantic data types but there is no single source with list of all of them.

  • Contact Sherlock team about list of semantic data types
  • Review Sherlock semantic data types and add data types to the registry if they are missing
  • Create Sherlock profile in tools section

Add geo_point_2d, geo_shape and objectid as datatypes

There are combination of datatypes like geo_point_2d, geo_shape and objectid used in thousands French and EU datasets.
It looks like it's used in Geographic Information Systems (GIS) across Europe and integrated with ArcGIS.

  • geo_point_2d is dict with lat and lon keys with latitude and longitude
  • geo_shape is a GeoJSON data with type key and geometry key. type key could have values like 'Feature' and geometry include type key and coordinates as array of 2-size arrays of floats.
  • objectid - unique id of the record in this dataset (not globally unique)

Consider adding these field names as datatypes

Add links to Wikidata entities

Right now only to wikidata properties.
To sync import data from wikidata it would be wise to add links to wikidata entities.

  • Add 'wikidata_id' to schema validation
  • Add 'wikidata_id' to most records manually

Redesign registry structure - ontology redesign

Several changes are important to improve registry:

  1. Rename entity to semantic type. Scientific name of data classification is "semantic type detection". There are several research papers on this topic and this name used in Metabase, Dataprep and Google DataStudio.
  2. Add concept country similar to spoken language but country related and used to set country type of rules and identifiers. For example business codes/identifiers are country and not language related.
  3. Add subtyping, any type could have parent semantic type.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.