wikidata / soweego Goto Github PK
View Code? Open in Web Editor NEWLink Wikidata items to large catalogs
Home Page: https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego_2
License: GNU General Public License v3.0
Link Wikidata items to large catalogs
Home Page: https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego_2
License: GNU General Public License v3.0
It should be used at dump extraction time to feed the link
table in MariaDB with "good" URLs.
A starting point would be to query Wikidata for URL domains of existing catalog identifiers.
It should perform the following actions:
s51434__mixnmatch_large_catalogs_p
database on Toolforge.The investigation should answer the question: is a full-text index column in MariaDB better than a Lucene-based solution?
For each link to another catalog of a target identifier, match against Wikidata item identifiers.
For each Wikipedia/DBpedia link of a target identifier, match against Wikidata item site links.
It should perform the following actions:
rdf:type foaf:Person
. For instance:<http://data.bibsys.no/data/notrbib/authorityentry/x90054225> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> .
s51434__mixnmatch_large_catalogs_p
database on Toolforge.As per #19, if required criteria 2 or 3 are not met, then deprecate the statement.
The whole project should work in a virtual environment like Docker
INPUT: full set of Wikidata musicians and bands QIDs with MusicBrainz ID; full set of MusicBrainz artists;
OUTPUT: set of Wikidata QIDs with invalid MusicBrainz ID
Please put the implementation under the soweego/validator
folder.
Soundex, metaphone, and other phonetic algorithms can be useful to normalize ideographic languages like Japanese and Chinese.
The jellyfish Python library we use implements those algorithms, but does not seem to support ideographic languages (exceptions raised or empty strings as output).
This is the most straightforward one.
For each full name of a target identifier:
It should perform the following actions:
s51434__mixnmatch_large_catalogs_p
database on Toolforge.- [ ] Port this Java implementation into Python: https://github.com/fbk/utils/blob/master/utils-core/src/main/java/eu/fbk/utils/core/strings/JaroWinklerDistance.java
based on Apache Commons lang: https://commons.apache.org/proper/commons-text/javadocs/api-release/index.html?org/apache/commons/text/similarity/JaroWinklerDistance.html
original source code should be contained here: http://it.apache.contactlab.it//commons/lang/source/commons-lang3-3.7-src.tar.gz;
# unique matches / (sample length - #already linked Wikidata items)
;Follow steps 1 to 3 of this link: https://tools.wmflabs.org/
While working on #62, we realized that BIBSYS suffers from data inconsistency:
IMDB is the next big fish in line.
#matches / sample length
;Besides labels, aliases may be useful to augment matching.
We cannot be sure that the url building from Wikidata is the same available in the target databases.
We could split for '/' and '.' the urls to get the tokens. Excluding grammar words like "https" and so on we can try to understand similarity between urls.
E.g. (trivial)
https://twitter.com/BarackObama -> "https" "twitter" "com" "BarackObama"
After removing grammar words we have "twitter" "BarackObama" which are actually the core informations we need to match against another form of twitter url like http://www.twitter.com/BarackObama -> "twitter" "BarackObama" (after cleaning)
Currently musicbrainz calculation are done with an internal identifier and not with the one reachable from the outside
Python 3.4 is the version deployed on Toolforge.
pkgutil.get_data
returns a binary string, i.e., bytes
.
json.loads
docstring:
s
(a str
instance containing a JSON document) to a Python object.s
(a str
, bytes
or bytearray
instance containing a JSON document) to a Python object.`Temporary fix: don't use get_data
See discogs baseline matcher.
It should perform the following actions:
s51434__mixnmatch_large_catalogs_p
database on Toolforge.Currently, only 12 out of 256k albums in Wikidata have a Discogs identifier.
Discogs has a huge amount of albums: we could either leverage the releases_url
tag of each artist
or the releases
dump directly, e.g.:
https://discogs-data.s3-us-west-2.amazonaws.com/data/2018/discogs_20180801_releases.xml.gz
In bibsys_baseline_helper.link_scraper
.
- [ ] Pick the best implementation that fits our use case here: https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Python;
Currently, only 37 out of 256k albums in Wikidata have a MusicBrainz identifier:
SELECT (COUNT(DISTINCT ?item) AS ?count) WHERE { ?item wdt:P31/wdt:P279* wd:Q482994 ; wdt:P434 ?identifier . }
Investigate how we could leverage the album data in MusicBrainz to both populate album identifiers and performer statements, like:
Animals, performer, Pink Floyd
they are all artist
nodes in the dump.
Musicians have a groups
tag.
Bands have a members
tag.
Import them into 2 different tables in MariaDB.
Currently, it is loaded from a file.
See check_existence
checks.py
The whole project should be structured to serve both as a command line interface and as a Python module.
strephit
is the parent module and all its subfolders are submodules;__init__.py
and __main__.py
files;As per #19, if required criterium 1 is not met, then delete the statement.
Sir, Mlle, Jr., Sr., PhD, MD M.D.
(prefixes), de
, de la
, of
, von
(infixes). See mix'n'match regexps below:/[, ]+(Jr\.{0,1}|Sr\.{0,1}|PhD\.{0,1}|MD|M\.D\.)$
/^(Sir|Baron|Baronesse{0,1}|Graf|Gräfin)\s+/
/\b(Mmle|pseud\.|diverses)\b/
/ Bt$/' , ' Baronet'
'/^(.+)[ ,]+[sj]r\.{0,1}$/i'
/^(.+)[ ,]+I+\.{0,1}$/i
M.
;{ 'à': 'a' }
;Setup a logging mechanism.
INPUT: full set of Wikidata musicians and bands QIDs with Discogs ID; full set of Discogs artists;
OUTPUT: set of Wikidata QIDs with invalid Discogs ID
Please put the implementation under the soweego/validator
folder.
Import them into 2 different tables in MariaDB
As per #35, MariaDB on Toolforge doesn't seem to support one.
Therefore, the fuzzy analyzer must be implemented before ingesting datasets into MariaDB.
We have 3 criteria, 2 generic and 1 specific to item type.
Community discussion initiated at:
So far, we have developed several different things in the target_selection
module.
It's time now to move logic to appropriate places:
discogs,
bne
musicbrainz, matching_strategies.py,
-> linker
module;discogs/baseline_matcher#extract_data_from_dump
-> importer
module, with appropriate sub-folders.In particular:
os.path.join
resources.get_data
click.argument
click.option
#matches / sample length
;#matches / (sample length - #already linked Wikidata items)
;The final version will be published on the project page: https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego/Timeline
# unique matches / (sample length - #already linked Wikidata items)
;A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.