Coder Social home page Coder Social logo

datascienceontology's People

Contributors

dependabot[bot] avatar epatters avatar sander avatar sander-cb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datascienceontology's Issues

Improve test harness

Currently, the test harness simply builds JSON documents from the YAML source and checks that they conform to the JSON schemas. This is a minimal test of syntactic well-formedness.

The tests should be extended to check that the annotations reference valid concepts and define valid Monocl expressions. The easiest way to do that is to just load everything into semanticflowgraph's ontology DB and see if any exceptions are thrown.

Package versions in annotations

Package versions should be attached to annotations. This is important for reproducibility and maintainability over time.

Implementation guidance: It would be too cumbersome to pin all the annotations for a particular package to a single version number. We will take a more lightweight approach. Individual annotations will have an optional field for package version. This can be bumped when breaking changes are made to the package's API.

Once implemented here, the Python and R flow graph packages should be updated to resolve annotations using the versioning information.

Map upper-level concepts to ML-Schema

ML-Schema is an upper ontology for machine learning workflows, inspired by OpenML's data model and created by @joaquinvanschoren and others.

We should map the upper-level concepts in the DSO, like model and data, to concepts in ML-Schema. Actually, the overlap is currently not too big but it may grow over time.

This mapping should be implemented by

  • external resources in the native format, as we do for Wikidata concepts
  • suitable OWL statements in the RDF/OWL export

Generate OWL artifact

I'd like to generate an OWL artifact so we can ingest the data science ontology in the ASKEM TA2 domain knowledge graph - would you be willing to accept a PR that generates this?

Publish backend code

Thank you for sharing this interesting project.

If I’m not mistaken, datascienceontology-frontend uses a CouchDB + Node.js backend at api.datascienceontology.org. I could not find the source code for it however. Is it publicly available somewhere for collaboration?

Bibliographic references for concepts

Concepts that are esoteric or not entirely standard should include references to the scholarly literature.

Currently I am adding these references as plain text in the description field. We should support a more structured format for references. Options include BibTeX, RIS, Citation Style Language, or a Semantic Web ontology, such as Dublin Core, BIBO, and FRBR. For our purposes, a lightweight approach is preferable.

Design principles

The purpose of this issue is to collect and discuss preliminary "design principles" for the DSO, as they come up. At some point we'll create a more permanent document.

OWL Provenance aware ontology parallel version

I love this Data science ontology initiative and the full mission of the project.

I understand the explanation in the FAQ "Why doesn't the DSO use the Semantic Web standards?" about the importance of an ontology language based on the lambda calculus. However, I think there may be ways to overcome this drawback (e.g. http://west.uni-koblenz.de/lambda-dl) and the advantages of making inferences, visualize and review the correctness of the axioms, will set-up the DS ontology in another level.
So, I came out with this ontology which contain the most important upper-level concepts and solve the problem about how to link DS theory, its implementations (the annotation of your work) and provenance data generated from executions. I have added the main details of the k-Means algorithm, so we can discuss around a realistic example.

I haven’t added yet a provenance graph example using the program examples of the k-Means in your paper, but let’s see if all this makes sense to you all.

The main ideas are:

  • PROV-O is imported, and all the objects, morphisms and annotations (I prefer to call them implementations) should go to the respective place below concepts of the provenance ontology
  • Every object is an prov:entity or a prov: agent
  • Every morphism (function) is a DS entity (a sub type of prov:entity)
  • Implementations are also entities, describing the exactly data type implementation, ordering of parameters in morphisms, etc. (although there are a few examples as Python_3.6 class and RD_Python_3.6 instance, where I still have doubts about the best modelling approach)
  • A DS activity occurs when a DS morphism implementation is executed

Having all these, a DS program can be described as a graph of instances of DS activities, agents and entities.

DataScienceOnto.owl.zip

Many thanks in advance,
Roxana

Migrate to graph database

@epatters wrote in #16:

BTW, for a while I've been considering migrating to a graph database, possibly Dgraph, to enable more flexible querying, but I haven't yet been able to dedicate the time.

What kind of queries are you thinking about that cannot easily be handled using CouchDB views?

For the purpose of datascienceontology-frontend, I'm thinking the database design could even be "less intelligent" and easier to maintain. Since a built ontology just consists of static linked documents plus search/browse indices, all public, at the current scale an S3/IPFS/Dat bucket containing these static docs plus some static indices might be sufficient.

Asking because I'm interested in creating patches to make the frontend and collaboration workflow easier to use and more engaging. I'd like to explore several use cases for collaborative ontology building across concepts and code using this project. The direction this upstream project is going with database management impacts how I should focus my effort.

High-level, informal concepts of data science

The ontology should, perhaps, include high-level concepts of data science, such as "data cleaning/preprocessing", "inference", and "evaluation". The usefulness of such concepts is obvious, but there are several difficulties. Unlike the concepts currently in the ontology, these high-level concepts are

  1. informal and imprecise, i.e., do not admit a clean mathematical description
  2. usually present only implicitly in code or natural text, i.e., must be either inferred using NLP methods or manually annotated by the data analysis author

How to proceed is an open question.

Use consistent terminology in ontology and frontend

After much discussion, we've decided to adopt PLT-style terminology (types, functions, etc.) instead of category-theoretic terminology (objects, morphisms, etc.) for the ontology. This choice is already reflected in the frontend, but not in the ontology data itself. That may seem a small matter but it creates a non-negligble barrier for new contributors.

To fix this, we need to update:

  • the JSON schemas for concepts and annotations
  • the concepts and annotations themselves
  • the downstream tools that consume the ontology

Guards or conditionals for argument values in function annotations

In data science packages, a single function will often implement several conceptually distinct models or methods by dispatching on the value of an argument, as in R's glm or scikit-learn's LogisticRegression (which supports multiple forms of regularization). Function annotations should support dispatching on such arguments as well.

In this package, it is enough to describe a schema that makes sense and start writing the annotations. The language-specific flow graph packages should then be updated accordingly.

Type concept for dimensionality reduction?

Right now there is a distinction between feature-extraction and feature-extraction-model (curious on the distinction, I think I get it: one is the general method, the other one is a model that reifies the method).

There is a concept for dimension-reduction-model (name is missing model, I'll add that in a PR). But there is no concept for dimension-reduction (which would follow the lines of the distinction between method and reifying model present in the example above for feature extraction).

Is this omission intentional? Or is this a point-in-time POC situation - dimension-reduction should be added?

Corresponding items in the DSO browser:
https://www.datascienceontology.org/concept/dimension-reduction-model
https://www.datascienceontology.org/concept/feature-extraction-model
https://www.datascienceontology.org/concept/feature-extraction

Investigate preexisting ontologies Expose and DMOP

Several preexisting ontologies are worth investigating, both for inspiration and for their contents. In his PhD thesis, Joaquin Vanschoren describes an ontology called Expose.

The thesis cites a number of previous ontologies related to data mining and ML. One of these is DMOP, some of whose authors are involved in ML-schema (cf. #14).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.