ibm / datascienceontology Goto Github PK
View Code? Open in Web Editor NEWData Science Ontology
Home Page: https://www.datascienceontology.org
License: Creative Commons Attribution 4.0 International
Data Science Ontology
Home Page: https://www.datascienceontology.org
License: Creative Commons Attribution 4.0 International
I have a couple ontologies and controlled vocabularies in mind such as the Intelligent Task Ontology (https://github.com/OpenBioLink/ITO) that might be relevant for alignment (noting that there are already wikidata links present for some concepts that might be helpful for getting started)
Currently, the test harness simply builds JSON documents from the YAML source and checks that they conform to the JSON schemas. This is a minimal test of syntactic well-formedness.
The tests should be extended to check that the annotations reference valid concepts and define valid Monocl expressions. The easiest way to do that is to just load everything into semanticflowgraph's ontology DB and see if any exceptions are thrown.
Package versions should be attached to annotations. This is important for reproducibility and maintainability over time.
Implementation guidance: It would be too cumbersome to pin all the annotations for a particular package to a single version number. We will take a more lightweight approach. Individual annotations will have an optional field for package version. This can be bumped when breaking changes are made to the package's API.
Once implemented here, the Python and R flow graph packages should be updated to resolve annotations using the versioning information.
ML-Schema is an upper ontology for machine learning workflows, inspired by OpenML's data model and created by @joaquinvanschoren and others.
We should map the upper-level concepts in the DSO, like model
and data
, to concepts in ML-Schema. Actually, the overlap is currently not too big but it may grow over time.
This mapping should be implemented by
I'd like to generate an OWL artifact so we can ingest the data science ontology in the ASKEM TA2 domain knowledge graph - would you be willing to accept a PR that generates this?
Thank you for sharing this interesting project.
If I’m not mistaken, datascienceontology-frontend
uses a CouchDB + Node.js backend at api.datascienceontology.org
. I could not find the source code for it however. Is it publicly available somewhere for collaboration?
Concepts that are esoteric or not entirely standard should include references to the scholarly literature.
Currently I am adding these references as plain text in the description
field. We should support a more structured format for references. Options include BibTeX, RIS, Citation Style Language, or a Semantic Web ontology, such as Dublin Core, BIBO, and FRBR. For our purposes, a lightweight approach is preferable.
First page here:
https://www.datascienceontology.org/
should link to this repo with some "contribute-here" type of heading.
One option: a "Contribute" button aligned with the Browse, Learn More ones.
The purpose of this issue is to collect and discuss preliminary "design principles" for the DSO, as they come up. At some point we'll create a more permanent document.
I love this Data science ontology initiative and the full mission of the project.
I understand the explanation in the FAQ "Why doesn't the DSO use the Semantic Web standards?" about the importance of an ontology language based on the lambda calculus. However, I think there may be ways to overcome this drawback (e.g. http://west.uni-koblenz.de/lambda-dl) and the advantages of making inferences, visualize and review the correctness of the axioms, will set-up the DS ontology in another level.
So, I came out with this ontology which contain the most important upper-level concepts and solve the problem about how to link DS theory, its implementations (the annotation of your work) and provenance data generated from executions. I have added the main details of the k-Means algorithm, so we can discuss around a realistic example.
I haven’t added yet a provenance graph example using the program examples of the k-Means in your paper, but let’s see if all this makes sense to you all.
The main ideas are:
Having all these, a DS program can be described as a graph of instances of DS activities, agents and entities.
Many thanks in advance,
Roxana
The R package broom aims to standardize the output of R models as tidy data. See slides by the maintainer for an overview.
Can we learn anything from this project? At the very least, it suggests a class of fairly popular R packages for us to annotate.
BTW, for a while I've been considering migrating to a graph database, possibly Dgraph, to enable more flexible querying, but I haven't yet been able to dedicate the time.
What kind of queries are you thinking about that cannot easily be handled using CouchDB views?
For the purpose of datascienceontology-frontend
, I'm thinking the database design could even be "less intelligent" and easier to maintain. Since a built ontology just consists of static linked documents plus search/browse indices, all public, at the current scale an S3/IPFS/Dat bucket containing these static docs plus some static indices might be sufficient.
Asking because I'm interested in creating patches to make the frontend and collaboration workflow easier to use and more engaging. I'd like to explore several use cases for collaborative ontology building across concepts and code using this project. The direction this upstream project is going with database management impacts how I should focus my effort.
The ontology should, perhaps, include high-level concepts of data science, such as "data cleaning/preprocessing", "inference", and "evaluation". The usefulness of such concepts is obvious, but there are several difficulties. Unlike the concepts currently in the ontology, these high-level concepts are
How to proceed is an open question.
After much discussion, we've decided to adopt PLT-style terminology (types, functions, etc.) instead of category-theoretic terminology (objects, morphisms, etc.) for the ontology. This choice is already reflected in the frontend, but not in the ontology data itself. That may seem a small matter but it creates a non-negligble barrier for new contributors.
To fix this, we need to update:
In data science packages, a single function will often implement several conceptually distinct models or methods by dispatching on the value of an argument, as in R's glm
or scikit-learn's LogisticRegression
(which supports multiple forms of regularization). Function annotations should support dispatching on such arguments as well.
In this package, it is enough to describe a schema that makes sense and start writing the annotations. The language-specific flow graph packages should then be updated accordingly.
Right now there is a distinction between feature-extraction
and feature-extraction-model
(curious on the distinction, I think I get it: one is the general method, the other one is a model that reifies the method).
There is a concept for dimension-reduction-model
(name is missing model, I'll add that in a PR). But there is no concept for dimension-reduction (which would follow the lines of the distinction between method and reifying model present in the example above for feature extraction).
Is this omission intentional? Or is this a point-in-time POC situation - dimension-reduction
should be added?
Corresponding items in the DSO browser:
https://www.datascienceontology.org/concept/dimension-reduction-model
https://www.datascienceontology.org/concept/feature-extraction-model
https://www.datascienceontology.org/concept/feature-extraction
Several preexisting ontologies are worth investigating, both for inspiration and for their contents. In his PhD thesis, Joaquin Vanschoren describes an ontology called Expose.
The thesis cites a number of previous ontologies related to data mining and ML. One of these is DMOP, some of whose authors are involved in ML-schema (cf. #14).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.