Description
Several works from academia and industry exploit the "type" of DBpedia resources. This "type" is a class in the DBpedia ontology, like Person, Movie or Device. The "type" comes from (1) the Wikipedia infobox of the resource and (2) the mapping created by humans. Therefore, DBpedia extractors cannot assign a type to a resource when (1) the resource has not infobox in Wikipedia, or (2) the resource has an infobox not mapped. For many languages this lack of type reaches 50% of resources.
Several experimental studies have tried to infer the type of a resource from the "connections" this resource has in the graph this resource belongs. For instance, [1] follows a statistic approach, and [2] follows a machine learning approach.
However, these approaches need a validation that is not simple: as DBpedia classes are in a hierarchy (Writer is a subclass of Person, Poet is a subclass of Writer, etc.) with up to 7 levels, the deeper levels use to have fewer resources. Therefore, the precision and recall of the "type predictors" must be validated per clase or, at least, per level.
[1] Paulheim, H., Bizer, C.: Type inference on noisy RDF data. ISWC 2013. LNCS, vol. 8218, pp. 510–525. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41335-3_32
[2] Rico M., Santana-Pérez I., Pozo-Jiménez P., Gómez-Pérez A.: Inferring Types on Large Datasets Applying Ontology Class Hierarchy Classifiers: The DBpedia Case. EKAW 2018. LNCS, vol. 11313. Springer. https://doi.org/10.1007/978-3-030-03667-6_21
Goals
In order to achieve this validation we need a "golden standard" in which we have manually ensured the type of several resources for each type of the ontology. This "golden standard" should be built using ad hoc software tools. Ideally a web application.
Impact
Enhance the quality of the DBpedia. With this golden standard we could evaluate more easily the approaches to assign a type to a un-typed resource. Also could help us to assign alternative types to typed resources, for example, a more specific (deeper) type or, may be, an alternative type in another DBpedia class hierarchy branch.
Caveats
- The DBpedia ontology grows up. The tool should be able to generate a golden standard for every new version of the DBpedia ontology. The latest version is here.
- DBpedia is not only the English DBpedia. There are several "chapters" of DBpedia, each one for a specific language. The list of available DBpedias is here.
- The class of a resource could depend on the (human) evaluator. Therefore, a multi-evaluator tool is required. Fleiss' kappa could help us to measure this agreement level.
Ideal profile
Experience with Linked Data technologies (RDF, SPARQL), development of web applications.
Warm up tasks
- For a given language (e.g. English)
- Find 10 Wikipedia entries without infobox.
- Find 10 Wikipedia entries with infobox but whose infobox has no mapping.
- Read these papers:
Mentors
Mariano Rico
Keywords
golden standard, resource type validation.