Coder Social home page Coder Social logo

metadata-conversion's Introduction

CLARIN metadata conversion

Conversion between metadata formats, in particular conversion to/from CMDI. Also see the CMDI toolkit project.

Development

Please develop in a conversion specific development branch with a clear name, such as dev-edm-cmdi for EDM-CMDI conversion.

CI & tests

A travis configuration is included, which defines test.sh as its script file, which in turn triggers all test.sh files found in directories directly below the project's root.

metadata-conversion's People

Contributors

menzowindhouwer avatar twagoo avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

menzowindhouwer

metadata-conversion's Issues

EDM: Preserve metadata hierarchies (hasPart)

EDM records may be 'metadata parents' having e.g.

<dcterms:hasPart 
  xmlns:dcterms="http://purl.org/dc/terms/"
  rdf:resource="http://data.theeuropeanlibrary.org/BibliographicResource/3000051827382"/>

or they may be 'metadata children' having e.g.

<dcterms:isPartOf 
  xmlns:dcterms="http://purl.org/dc/terms/"
  rdf:resource="http://data.theeuropeanlibrary.org/BibliographicResource/3000053640021"/>

Preserve this information in the generated CMDI by creating metadata type resource proxies in the parent record.

EDM: resolve vocabulary items by id or URI to value

Europeana records from the newspapers collections (possibly also others) use various identifiers for e.g. subject or resource type values that could be resolved to make the metadata better suitable for indexing into the VLO.

  • IDs of the library of congress subject headers appear as subject values, e.g. <dc-subject>sh85091614</dc-subject> (full record), which is a reference to http://id.loc.gov/authorities/subjects/sh85091614 "Newspapers--Sections, columns, etc" (skos RDF)

    • These always take the form of /sh[0-9]+/ as text content within dc:subject elements. The concept URIs don't appear to be used, i.e. no @rdf:resource.
  • Resource types are often encoded with concepts from the Getty Art and Architecture Thesaurus, which are included in expanded form in the RDF/XML representations harvested. Rather than rendering the full content we could also detect these and do a lookup or trim down the provided values to only include the most relevant information.

    • Example from the RDF: <dc:type xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:resource="http://vocab.getty.edu/aat/300026656"/>, which is expanded in the conversion to CMDI with all content found in the concept definition also included in the RDF/XML served by Europeana's OAI provider (in this case ten altLabels/prefLabels in different languages: Tageblätter, tidning, newspaper etc).

An example EDM record and its current CMDI conversion:
BibliographicResource_30001170701972017.xml (RDF/XML)
BibliographicResource_30001170701972017.cmdi (CMDI)

Enhanced Datacite - CMDI conversion

Follow-up of #9. Selected records are being harvested from the DataCite OAI endpoint as dublin core and converted to CMDI using the default DC/OLAC-to-CMDI stylesheet. The end result is acceptable but not perfect. An enhanced custom conversion could produce:

  • DOI as a landing page link rather than a resource link
  • a nice self link and document ID (currently e.g. oai:oai.datacite.org:13701651)
  • more detailed and specific information can be extracted, e.g. details on the creators and contributors, distinction between geo and temporal coverage, funding information

OLAC2CMDI: accept `doi:` identifiers

When processing dc:identifier while looking for resources, we now support 'regular' URLs, handles and urn:nbn. We could add DOI to that. A real use case of this popped up with the metadata provided by TROLLing (OAI) where content like the following element can be found:

<dc:identifier>doi:10.18710/AGL9FD</dc:identifier>

Currently this is ignored. However, since DOI is a valid and common scheme, we could choose to accept it as is (like we do with handle and urn:nbn), or be rewritten on the fly to a https://dx.doi.org/.... URL.

Errors converting DataCite metadata from TextGrid

When harvesting from the TextGrid OAI-PMH provider, conversion from DataCite to CMDI fails with an Ambiguous rule match error.

Example of a full stacktrace:

2022-04-04T18:09:46,846 ERROR [TextGrid Repository] TransformAction - Ambiguous rule match for /datacite:resource/datacite:identifier[1]/@identifierType
Matches both "attribute(Q{}identifierType)" on line 242 of https://raw.githubusercontent.com/clarin-eric/metadata-conversion/master/datacite-cmdi/datacite_to_cmdi-kernel4.xsl
and "attribute(Q{}identifierType)" on line 242 of https://raw.githubusercontent.com/clarin-eric/metadata-conversion/master/datacite-cmdi/datacite_to_cmdi-kernel3.xsl
net.sf.saxon.trans.XPathException: Ambiguous rule match for /datacite:resource/datacite:identifier[1]/@identifierType
Matches both "attribute(Q{}identifierType)" on line 242 of https://raw.githubusercontent.com/clarin-eric/metadata-conversion/master/datacite-cmdi/datacite_to_cmdi-kernel4.xsl
and "attribute(Q{}identifierType)" on line 242 of https://raw.githubusercontent.com/clarin-eric/metadata-conversion/master/datacite-cmdi/datacite_to_cmdi-kernel3.xsl
	at net.sf.saxon.trans.Mode.reportAmbiguity(Mode.java:809) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.trans.Mode.searchRuleChain(Mode.java:556) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.trans.Mode.getRule(Mode.java:483) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.trans.Mode.applyTemplates(Mode.java:1040) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.instruct.ApplyTemplates.apply(ApplyTemplates.java:276) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.instruct.ApplyTemplates.processLeavingTail(ApplyTemplates.java:236) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.instruct.Block.processLeavingTail(Block.java:657) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.instruct.Choose.processLeavingTail(Choose.java:871) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.instruct.Instruction.process(Instruction.java:138) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.instruct.ElementCreator.processLeavingTail(ElementCreator.java:429) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.instruct.ElementCreator.processLeavingTail(ElementCreator.java:371) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.instruct.Template.applyLeavingTail(Template.java:239) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.trans.Mode.applyTemplates(Mode.java:1056) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.instruct.ApplyTemplates.apply(ApplyTemplates.java:276) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.instruct.ApplyTemplates.processLeavingTail(ApplyTemplates.java:236) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.instruct.Block.processLeavingTail(Block.java:657) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.instruct.Instruction.process(Instruction.java:138) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.instruct.DocumentInstr.evaluateItem(DocumentInstr.java:320) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.instruct.DocumentInstr.evaluateItem(DocumentInstr.java:53) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.parser.ExpressionTool.evaluate(ExpressionTool.java:328) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.instruct.GeneralVariable.getSelectValue(GeneralVariable.java:473) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.instruct.Instruction.assembleParams(Instruction.java:194) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.instruct.CallTemplate.process(CallTemplate.java:341) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.instruct.CallTemplate.processLeavingTail(CallTemplate.java:393) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.instruct.Block.processLeavingTail(Block.java:657) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.instruct.Template.applyLeavingTail(Template.java:239) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.trans.Mode.applyTemplates(Mode.java:1056) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.instruct.ApplyTemplates.apply(ApplyTemplates.java:276) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.instruct.ApplyTemplates.process(ApplyTemplates.java:232) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.instruct.ElementCreator.processLeavingTail(ElementCreator.java:429) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.instruct.ElementCreator.processLeavingTail(ElementCreator.java:371) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.instruct.Instruction.process(Instruction.java:138) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.instruct.ElementCreator.processLeavingTail(ElementCreator.java:429) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.instruct.ElementCreator.processLeavingTail(ElementCreator.java:371) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.instruct.Block.processLeavingTail(Block.java:657) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.instruct.Instruction.process(Instruction.java:138) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.instruct.ElementCreator.processLeavingTail(ElementCreator.java:429) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.instruct.ElementCreator.processLeavingTail(ElementCreator.java:371) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.instruct.Template.applyLeavingTail(Template.java:239) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.trans.Mode.applyTemplates(Mode.java:1056) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.expr.instruct.ApplyTemplates$ApplyTemplatesPackage.processLeavingTail(ApplyTemplates.java:514) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.trans.TextOnlyCopyRuleSet.process(TextOnlyCopyRuleSet.java:67) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.trans.Mode.applyTemplates(Mode.java:1044) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.Controller.transformDocument(Controller.java:2088) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.Controller.transform(Controller.java:1911) [Saxon-HE-9.5.1-8.jar:?]
	at net.sf.saxon.s9api.XsltTransformer.transform(XsltTransformer.java:450) [Saxon-HE-9.5.1-8.jar:?]
	at nl.mpi.oai.harvester.action.TransformAction.perform(TransformAction.java:168) [oai-harvest-manager-1.2.1.6fb640.jar:?]
	at nl.mpi.oai.harvester.action.ActionSequence.runActions(ActionSequence.java:136) [oai-harvest-manager-1.2.1.6fb640.jar:?]
	at nl.mpi.oai.harvester.action.ActionSequence.runActions(ActionSequence.java:116) [oai-harvest-manager-1.2.1.6fb640.jar:?]
	at nl.mpi.oai.harvester.harvesting.Scenario.listRecords(Scenario.java:234) [oai-harvest-manager-1.2.1.6fb640.jar:?]
	at nl.mpi.oai.harvester.control.Worker.run(Worker.java:207) [oai-harvest-manager-1.2.1.6fb640.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]

EDM: mapping of collection name (for newspapers)

See the example query 9200366_Ag_EU_TEL_a0641_Newspapers_Slovenia, which gives the following values for the collection facet (at time of testing):

Europeana Newspapers (47712)
The European Library: Newspapers Slovenia (27)

The issues are all in the former, larger collection, while the top level title descriptions are in the latter. This may be an artifact of our mapping to collection name, which prefers 'is part of' relations which are present for issue records but not for top level title records. Investigate if there is a better approach...

EDM: filtering of content languages

Add a paramter to the EDM to CMDI conversion stylesheet to include/exclude content with certain language codes. This could reduce the size of the records and consequently the size of the VLO's Solr index a lot.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.