The soweego's discuss from wikidata

Build a whitelist of URL domains

It should be used at dump extraction time to feed the link table in MariaDB with "good" URLs.
A starting point would be to query Wikidata for URL domains of existing catalog identifiers.

Script to import MusicBrainz dump

It should perform the following actions:

download the dump;
slice the subset about artists;
import the subset into the s51434__mixnmatch_large_catalogs_p database on Toolforge.

Investigate full-text indices solutions

The investigation should answer the question: is a full-text index column in MariaDB better than a Lucene-based solution?

Cross-catalog link matcher

For each link to another catalog of a target identifier, match against Wikidata item identifiers.

Wikipedia links matcher

For each Wikipedia/DBpedia link of a target identifier, match against Wikidata item site links.

Damerau-Levenshtein distance name matcher

use jellyfish;
for each full name of a target identifier, compute the Damerau-Levenshtein edit distance against all Wikidata item labels.

Script to import BIBSYS dump

It should perform the following actions:

download the dump at http://data.bibsys.no/autreg/linked_data_auth_reg.nt;
slice the subset about people (if applicable), i.e., all the subjects that have rdf:type foaf:Person. For instance:

<http://data.bibsys.no/data/notrbib/authorityentry/x90054225> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> .

import the dataset into the s51434__mixnmatch_large_catalogs_p database on Toolforge.

Deprecate statements that do not comply with validation criteria 2 or 3

As per #19, if required criteria 2 or 3 are not met, then deprecate the statement.

Docker configuration

The whole project should work in a virtual environment like Docker

Validate the existence of MusicBrainz ID in Wikidata

INPUT: full set of Wikidata musicians and bands QIDs with MusicBrainz ID; full set of MusicBrainz artists;
OUTPUT: set of Wikidata QIDs with invalid MusicBrainz ID
Please put the implementation under the soweego/validator folder.

Investigate phonetic algorithms

Soundex, metaphone, and other phonetic algorithms can be useful to normalize ideographic languages like Japanese and Chinese.

The jellyfish Python library we use implements those algorithms, but does not seem to support ideographic languages (exceptions raised or empty strings as output).

Perfect string matcher

This is the most straightforward one.

For each full name of a target identifier:

lowercase the string;
match against Wikidata item labels (also lowercased).

Script to import BNE dump

It should perform the following actions:

download the dump at http://datos.bne.es/datadumps/autoridades.nt.bz2;
slice the subset about people;
import the dataset into the s51434__mixnmatch_large_catalogs_p database on Toolforge.

Jaro-Winkler distance name matcher

- [ ] Port this Java implementation into Python: https://github.com/fbk/utils/blob/master/utils-core/src/main/java/eu/fbk/utils/core/strings/JaroWinklerDistance.java
based on Apache Commons lang: https://commons.apache.org/proper/commons-text/javadocs/api-release/index.html?org/apache/commons/text/similarity/JaroWinklerDistance.html
original source code should be contained here: http://it.apache.contactlab.it//commons/lang/source/commons-lang3-3.7-src.tar.gz;

use jellyfish;
for each full name of a target identifier, compute the Jaro-Winkler edit distance against all Wikidata item labels.

Compute MusicBrainz coverage estimation

use the Wikidata musicians and bands sample;
implement the following baseline matchers:
#6;
#7;
#8;
#11;
compute the following ratio: # unique matches / (sample length - #already linked Wikidata items);
report the result here: https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego/Timeline#Big_fishes

Set up your account for Wikimedia Toolforge

Follow steps 1 to 3 of this link: https://tools.wmflabs.org/

Birth and death dates matcher

For each (birth, death) couple (death is optional) of a target identifier, for each of [#6, #9, #10], match against Wikidata item birth and death dates, and labels.

Compute IMDB coverage estimation

While working on #62, we realized that BIBSYS suffers from data inconsistency:

the available dump is out of sync with the online resources, with identifiers yielding HTTP 404;
an identifier may have multiple cross-catalog links (VIAF, GND);
an identifier may have correct sitelinks, but wrong cross-catalog links.

IMDB is the next big fish in line.

build the Wikidata sample of actors, directors and producers with no links to IMDB;
implement the following baseline matchers:
#6;
~~#7;~~ No links available
~~#8;~~ No Wikilinks available
#11;
compute the following ratio: #matches / sample length;
report the result here: https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego/Timeline#August_2018:_big_fishes_selection

Get Wikidata aliases

Besides labels, aliases may be useful to augment matching.

Requirements for bot creation request

code pointer in User:Hjfocs
test edits:
35 Discogs
35 MusicBrainz
35 Twitter

Token similarity for urls

We cannot be sure that the url building from Wikidata is the same available in the target databases.

We could split for '/' and '.' the urls to get the tokens. Excluding grammar words like "https" and so on we can try to understand similarity between urls.

E.g. (trivial)
https://twitter.com/BarackObama -> "https" "twitter" "com" "BarackObama"
After removing grammar words we have "twitter" "BarackObama" which are actually the core informations we need to match against another form of twitter url like http://www.twitter.com/BarackObama -> "twitter" "BarackObama" (after cleaning)

Import Musicbrainz links

Fix the identifier used by Musicbrainz

Currently musicbrainz calculation are done with an internal identifier and not with the one reachable from the outside

json library breaks in Python 3.4 with pkgutil.get_data

Python 3.4 is the version deployed on Toolforge.

pkgutil.get_data returns a binary string, i.e., bytes.
json.loads docstring:

Python 3.4
Deserialize s (a str instance containing a JSON document) to a Python object.
Python 3.6.5
Deserialize s (a str, bytes or bytearray instance containing a JSON document) to a Python object.`

Temporary fix: don't use get_data
See discogs baseline matcher.

Script to import Discogs dump

It should perform the following actions:

download the dump at https://data.discogs.com/;
slice the subset about artists;
import the dataset into the s51434__mixnmatch_large_catalogs_p database on Toolforge.

Match Discogs releases

Currently, only 12 out of 256k albums in Wikidata have a Discogs identifier.

Discogs has a huge amount of albums: we could either leverage the releases_url tag of each artist or the releases dump directly, e.g.:
https://discogs-data.s3-us-west-2.amazonaws.com/data/2018/discogs_20180801_releases.xml.gz

Implement the BIBSYS link scraper

In bibsys_baseline_helper.link_scraper.

Album matcher

Get Wikidata musical albums;
implement #55;
implement #56;
run name-based matchers.

Levenshtein distance name matcher

~~- [ ] Pick the best implementation that fits our use case here: https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Python;~~

use jellyfish;
for each full name of a target identifier, compute the Levenshtein edit distance against all Wikidata item labels.

Index target database entries

standard analyzer (tokenization, bag of words);
~~- [ ] fuzzy analyzer;~~
optimize #9 #10 #33 runtime from O(nm) to O(n), where n is the length of the source term list, m is the length of the target term list. We can drop a "constant", coming from the result set length of a query against the index.

Match MusicBrainz releases

Currently, only 37 out of 256k albums in Wikidata have a MusicBrainz identifier:

SELECT (COUNT(DISTINCT ?item) AS ?count) WHERE { ?item wdt:P31/wdt:P279* wd:Q482994 ; wdt:P434 ?identifier . }

Investigate how we could leverage the album data in MusicBrainz to both populate album identifiers and performer statements, like:
Animals, performer, Pink Floyd

Split Discogs into musicians and bands

they are all artist nodes in the dump.
Musicians have a groups tag.
Bands have a members tag.

Import them into 2 different tables in MariaDB.

Get the set of target identifiers from MariaDB

Currently, it is loaded from a file.
See check_existence checks.py

Use the project as a CLI or as a module

The whole project should be structured to serve both as a command line interface and as a Python module.

Get inspired by the StrepHit structure: https://github.com/Wikidata/StrepHit/tree/master/strephit: the folder strephit is the parent module and all its subfolders are submodules;
for module usage, you can see the typical __init__.py and __main__.py files;
for CLI usage, click is the way to go.

Bot ingestor

create a bot request, after #66;
implement the bot in soweego/ingestor;
import accurate links as per #62.

Delete statements not complying with validation criterium 1

As per #19, if required criterium 1 is not met, then delete the statement.

Name cleaning procedure

Remove honorific words, e.g., Sir, Mlle, Jr., Sr., PhD, MD M.D. (prefixes), de, de la, of, von (infixes). See mix'n'match regexps below:

/[, ]+(Jr\.{0,1}|Sr\.{0,1}|PhD\.{0,1}|MD|M\.D\.)$
/^(Sir|Baron|Baronesse{0,1}|Graf|Gräfin)\s+/
/\b(Mmle|pseud\.|diverses)\b/
/ Bt$/' , ' Baronet'
'/^(.+)[ ,]+[sj]r\.{0,1}$/i'
/^(.+)[ ,]+I+\.{0,1}$/i

remove name initials with or without dot, e.g. M.;
remove commas, dashes and quotes;
normalization: convert to ASCII via a diacritics map, e.g., { 'à': 'a' };
lowercase;
split on white spaces.

existence: whether the target ID still exists in the target catalog;
links: URLs overlap;
metadata: statements overlap.

Community discussion initiated at:

Refactor target selection module

So far, we have developed several different things in the target_selection module.
It's time now to move logic to appropriate places:

discogs, ~~bne~~, musicbrainz, matching_strategies.py, -> linker module;
data extraction functions, e.g., discogs/baseline_matcher#extract_data_from_dump -> importer module, with appropriate sub-folders.

Adding code style guideline

In particular:

For paths, no string concatenation but os.path.join
Loading of resources should be done using resources.get_data
Paths for Input should be set from click.argument
Paths for Output should be set from click.option

Compute BIBSYS coverage estimation

build the Wikidata sample of authors and teachers with no links to BIBSYS;
implement the following baseline matchers:
#6;
#7;
#8;
~~#11;~~ No dates available!
compute the following ratio: #matches / sample length;
report the result here: https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego/Timeline#August_2018:_big_fishes_selection

Compute BNE coverage estimation

use the Wikidata sample;
implement the following baseline matchers:
#6;
#7;
#8;
#11;
compute the following ratio: #matches / (sample length - #already linked Wikidata items);
report the result here: https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego/Timeline#Target_selection

Integrate feedback on the long tail report

The final version will be published on the project page: https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego/Timeline

Compute discogs coverage estimation

use the Wikidata musicians and bands sample;
implement the following baseline matchers:
#6;
#7;
#8;
#11;
compute the following ratio: # unique matches / (sample length - #already linked Wikidata items);
report the result here: https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego/Timeline#Big_fishes

Assess the accuracy of baseline matches over the samples

Band matcher

Get Wikidata bands;
implement #54;
implement #59;
run name-based matchers over album titles.

wikidata / soweego Goto Github PK

soweego's Issues

Recommend Projects

Recommend Topics

Recommend Org