ibm / datascienceontology Goto Github PK

View Code? Open in Web Editor NEW

35.0 8.0 14.0 248 KB

Data Science Ontology

Home Page: https://www.datascienceontology.org

License: Creative Commons Attribution 4.0 International

Shell 65.47% TeX 34.53%

open-discovery

datascienceontology's People

Contributors

Stargazers

Watchers

Forkers

epatters marlean lorarjohns sander bhaskers-blu-org1 yeniherdiyeni madanska cthoyt free-quarks ashwaniverma27 ghas-results

datascienceontology's Issues

Investigate alignment with related ontologies Intelligent Task Ontology

I have a couple ontologies and controlled vocabularies in mind such as the Intelligent Task Ontology (https://github.com/OpenBioLink/ITO) that might be relevant for alignment (noting that there are already wikidata links present for some concepts that might be helpful for getting started)

Improve test harness

Currently, the test harness simply builds JSON documents from the YAML source and checks that they conform to the JSON schemas. This is a minimal test of syntactic well-formedness.

The tests should be extended to check that the annotations reference valid concepts and define valid Monocl expressions. The easiest way to do that is to just load everything into semanticflowgraph's ontology DB and see if any exceptions are thrown.

Package versions in annotations

Package versions should be attached to annotations. This is important for reproducibility and maintainability over time.

Implementation guidance: It would be too cumbersome to pin all the annotations for a particular package to a single version number. We will take a more lightweight approach. Individual annotations will have an optional field for package version. This can be bumped when breaking changes are made to the package's API.

Once implemented here, the Python and R flow graph packages should be updated to resolve annotations using the versioning information.

Map upper-level concepts to ML-Schema

ML-Schema is an upper ontology for machine learning workflows, inspired by OpenML's data model and created by @joaquinvanschoren and others.

We should map the upper-level concepts in the DSO, like model and data, to concepts in ML-Schema. Actually, the overlap is currently not too big but it may grow over time.

This mapping should be implemented by

external resources in the native format, as we do for Wikidata concepts
suitable OWL statements in the RDF/OWL export

Generate OWL artifact

I'd like to generate an OWL artifact so we can ingest the data science ontology in the ASKEM TA2 domain knowledge graph - would you be willing to accept a PR that generates this?

Publish backend code

Thank you for sharing this interesting project.

If I’m not mistaken, datascienceontology-frontend uses a CouchDB + Node.js backend at api.datascienceontology.org. I could not find the source code for it however. Is it publicly available somewhere for collaboration?

Bibliographic references for concepts

Concepts that are esoteric or not entirely standard should include references to the scholarly literature.

Currently I am adding these references as plain text in the description field. We should support a more structured format for references. Options include BibTeX, RIS, Citation Style Language, or a Semantic Web ontology, such as Dublin Core, BIBO, and FRBR. For our purposes, a lightweight approach is preferable.

DSO browser should link to github repo

First page here:
https://www.datascienceontology.org/

should link to this repo with some "contribute-here" type of heading.

One option: a "Contribute" button aligned with the Browse, Learn More ones.

Design principles

The purpose of this issue is to collect and discuss preliminary "design principles" for the DSO, as they come up. At some point we'll create a more permanent document.

OWL Provenance aware ontology parallel version

I love this Data science ontology initiative and the full mission of the project.

I understand the explanation in the FAQ "Why doesn't the DSO use the Semantic Web standards?" about the importance of an ontology language based on the lambda calculus. However, I think there may be ways to overcome this drawback (e.g. http://west.uni-koblenz.de/lambda-dl) and the advantages of making inferences, visualize and review the correctness of the axioms, will set-up the DS ontology in another level.
So, I came out with this ontology which contain the most important upper-level concepts and solve the problem about how to link DS theory, its implementations (the annotation of your work) and provenance data generated from executions. I have added the main details of the k-Means algorithm, so we can discuss around a realistic example.

I haven’t added yet a provenance graph example using the program examples of the k-Means in your paper, but let’s see if all this makes sense to you all.

The main ideas are:

PROV-O is imported, and all the objects, morphisms and annotations (I prefer to call them implementations) should go to the respective place below concepts of the provenance ontology
Every object is an prov:entity or a prov: agent
Every morphism (function) is a DS entity (a sub type of prov:entity)
Implementations are also entities, describing the exactly data type implementation, ordering of parameters in morphisms, etc. (although there are a few examples as Python_3.6 class and RD_Python_3.6 instance, where I still have doubts about the best modelling approach)
A DS activity occurs when a DS morphism implementation is executed

Having all these, a DS program can be described as a graph of instances of DS activities, agents and entities.

DataScienceOnto.owl.zip

Many thanks in advance,
Roxana

Investigate the R package broom

The R package broom aims to standardize the output of R models as tidy data. See slides by the maintainer for an overview.

Can we learn anything from this project? At the very least, it suggests a class of fairly popular R packages for us to annotate.

Migrate to graph database

@epatters wrote in #16:

BTW, for a while I've been considering migrating to a graph database, possibly Dgraph, to enable more flexible querying, but I haven't yet been able to dedicate the time.

What kind of queries are you thinking about that cannot easily be handled using CouchDB views?

For the purpose of datascienceontology-frontend, I'm thinking the database design could even be "less intelligent" and easier to maintain. Since a built ontology just consists of static linked documents plus search/browse indices, all public, at the current scale an S3/IPFS/Dat bucket containing these static docs plus some static indices might be sufficient.

Asking because I'm interested in creating patches to make the frontend and collaboration workflow easier to use and more engaging. I'd like to explore several use cases for collaborative ontology building across concepts and code using this project. The direction this upstream project is going with database management impacts how I should focus my effort.

High-level, informal concepts of data science

The ontology should, perhaps, include high-level concepts of data science, such as "data cleaning/preprocessing", "inference", and "evaluation". The usefulness of such concepts is obvious, but there are several difficulties. Unlike the concepts currently in the ontology, these high-level concepts are

informal and imprecise, i.e., do not admit a clean mathematical description
usually present only implicitly in code or natural text, i.e., must be either inferred using NLP methods or manually annotated by the data analysis author

How to proceed is an open question.

Use consistent terminology in ontology and frontend

After much discussion, we've decided to adopt PLT-style terminology (types, functions, etc.) instead of category-theoretic terminology (objects, morphisms, etc.) for the ontology. This choice is already reflected in the frontend, but not in the ontology data itself. That may seem a small matter but it creates a non-negligble barrier for new contributors.

To fix this, we need to update:

the JSON schemas for concepts and annotations
the concepts and annotations themselves
the downstream tools that consume the ontology

Guards or conditionals for argument values in function annotations

In data science packages, a single function will often implement several conceptually distinct models or methods by dispatching on the value of an argument, as in R's glm or scikit-learn's LogisticRegression (which supports multiple forms of regularization). Function annotations should support dispatching on such arguments as well.

In this package, it is enough to describe a schema that makes sense and start writing the annotations. The language-specific flow graph packages should then be updated accordingly.

Type concept for dimensionality reduction?

Right now there is a distinction between feature-extraction and feature-extraction-model (curious on the distinction, I think I get it: one is the general method, the other one is a model that reifies the method).

There is a concept for dimension-reduction-model (name is missing model, I'll add that in a PR). But there is no concept for dimension-reduction (which would follow the lines of the distinction between method and reifying model present in the example above for feature extraction).

Is this omission intentional? Or is this a point-in-time POC situation - dimension-reduction should be added?

Corresponding items in the DSO browser:
https://www.datascienceontology.org/concept/dimension-reduction-model
https://www.datascienceontology.org/concept/feature-extraction-model
https://www.datascienceontology.org/concept/feature-extraction

Investigate preexisting ontologies Expose and DMOP

Several preexisting ontologies are worth investigating, both for inspiration and for their contents. In his PhD thesis, Joaquin Vanschoren describes an ontology called Expose.

The thesis cites a number of previous ontologies related to data mining and ML. One of these is DMOP, some of whose authors are involved in ML-schema (cf. #14).