chop-dbhi / data-models Goto Github PK
View Code? Open in Web Editor NEWCollection of various biomedical data models in parseable formats.
Home Page: https://data-models-service.research.chop.edu
Collection of various biomedical data models in parseable formats.
Home Page: https://data-models-service.research.chop.edu
The purpose of this file is to make it possible to translate the files into a common format for downstream consumption.
Table definition file fields: tables.csv
model
(required) - Name of the modelversion
(required) - Version of the modeltable
(required) - Name of the tablelabel
- Label for the table.description
- Description of the table.Field definition file fields: definitions/<table>.csv
model
(required) - Name of the model.version
(required) - Version of the model.table
(required) - Name of the table.field
(required) - Name of the field.label
- Corresponds to the field label. This defaults to table
+ field
description
- Corresponds to the description or documentation of the field.type
- Describes the data type of the value the field holds.ref_table
- The table this field references.ref_field
- The field this field references.Field schema file fields: schema/<table>.csv
model
(required) - Name of the model.version
(required) - Version of the model.table
(required) - Name of the table.field
(required) - Name of the field.type
(required) - Data type of the field.length
- Describes the maximum length of the value.precision
- Describes the precision of value specified in type
, typically for a number.scale
- Describes the scale of the value specified in type
, typically for a number.default
- Defines the default value for the field.Constraints file fields: constraints/<table>.csv
model
(required) - Name of the model.version
(required) - Version of the model.table
(required) - Table the constraint applies to.fields
- One or more fields the constraint applies to.type
(required) - Type of constraint.name
- Suggested name of the constraintIndexes file fields: indexes/<table>.csv
model
(required) - Name of the model.version
(required) - Version of the model.table
(required) - Name of the table.fields
(required) - One or more fields the index applies.name
- Suggested name of the index.type
- Suggested type of the index.I don't believe this is published yet.
This would enable integration with the Open Knowledge Foundation: http://data.okfn.org/doc/data-package
Currently the /compare
endpoint needs to be accessed directly, however links generated for each pair of versions for a model, e.g. v1 → v2 and v2 → v3.
The motivation is to support other structures such as REDCap data dictionaries (form, section, field), Harvest metadata (category, concept, field), etc. The benefit is that data models even with different structures can be maintained and represented in a similar way. The interesting bit is what will come out of defining the mappings between disparate models.
Support for this requires generalizing the format to support any DAG (directed acyclic graph; e.g. hierarchy) using a variable length path rather than a static table. For example a REDCap form → section → field could be represented this way:
model | version | path | name |
---|---|---|---|
redcap_project | v1 | demographics | |
redcap_project | v1 | demographics | location |
redcap_project | v1 | demographics/location | city |
An empty path
denotes name
is a root element in the model (a table in the case of a relational model). The full identifier within a model would be path
+ name
Here are few survey questions I would appreciate interested members to answer:
I will note that this format could live alongside the current one.
The format will vary based on format. See #25
A renaming file declares name changes across two versions in a model. This can be used in two areas:
Currently there is just one; committed by accident (which will be removed by a pending PR). The old original i2b2 indexes are very heavy-weight and largely a huge waste of disk and load time as far as pedsnet is concerned.
This probably came from a Django implementation? In any case, some implementations obviously require a PK, but we shouldn't enforce that at the data model level. The Django model generator will create a surrogate PK if required. So if there's no PK on this table in the original DDL, let's remove the constraint.
For example, one CSV file per table.
This is primarily to jump between the major endpoints.
In Oracle Section under FULL DDL, First table has following
conditionid VARCHAR2 NOT NULL,
encounterid VARCHAR2,
patid VARCHAR2 NOT NULL,
raw_condition VARCHAR2,
raw_condition_source VARCHAR2,
raw_condition_status VARCHAR2,
raw_condition_type VARCHAR2,
Varchar2 without size specifications will result in errors.
Parth
This will be implemented using SQLAlchemy since it has a robust and consistent way of extracting metadata from databases (as supposed to JDBC).
See #5 as a starting point.
Specific issues:
pedsnet.v1.person.day_of_birth.type
from numder
-> number
pedsnet.v2.vocabulary.vocabulary_concept_id
from ''
-> integer
pedsnet.v1.indexes.idx_visit_person_id
and ...idx_visit_concept_id
replaced by ...idx_visit_person_date
omop.v5.cohort_definition
definitions field cohort_instantiation_date
-> cohort_initiation_date
omop.v5.concept
definitions remove field concept_level
omop.v5.fact_relationship
schema field domain_concept_id
split into ..._1
and ..._2
omop.v5.constraints.primary_keys.xpk_cohort_attribute
remove in favor of xpk_cohort_definition
Larger issues:
schema.constraints.not_null
and foreign_keys
set to []
instead of none
?)pcornet.v1.schema.vital
fields have integer
types with a length
attribute, which is choking the SQLAlchemy constructors... What is the intended meaning of this attribute? Should these be number
types (with the length
attribute moved to precision
instead)?pcornet
models.i2b2.v1_7.schema
fields share the integer
with length
problem described above.timestamp
as used in i2b2.v1_7
significantly different from datetime
elsewhere?The motivation is to maintain a separate repository or private or internal data models that can be merged in with public data models.
These fields were originally used to represent references, but since has been replaced with reference file types. They are no longer being used in the service as well.
visit_occurrence
table
visit_start_time
is not required...date
fields should be date
type`visit_end_date
is not requiredplace_of_service_source_value
should be removedvisit_type_concept_id
should be addedprocedure_occurrence
table
relevant_condition_concept_id
should be removedprocedure_date
should be date
typeprovider
table
gender_source_concept_id
should be includedcare_site
table
specialty_source_value
should be addedOther:
relevant_condition_concept_id
should not exist in OMOPV5.drug_exposure
tableCurrently, the directory of the datamodel.json file is walked and bound to the declared model. This presumes a structure which is not necessary since each file type can stand alone (each one contains the model and version).
The first step is to replace datamodel.json with a CSV file (see #45), then update the walk algorithm to rely on the model specified in the file.
This is to enable creators of new data models to have a starting point.
A renaming file specifies fields that have been renamed between versions which cannot be inferred from a simple diff mechanism.
Proposed fields:
model
source_version
source_table
source_field
target_version
target_table
target_field
/cc @aaron0browne @murphyke
Will reorganize the i2b2 directory to contain V1 and V2; existing stuff will move into V1. The V2 directory will be populated with content that can be digested by https://github.com/chop-dbhi/data-models, i.e. a handful of CSV files describing the data model.
These used in the early stages make it easier to do mass edits.
This may be a bit too specific to our use case, but declaring the base model could help to infer some things. For example, the PEDSnet data model is based on OMOP so we could rely on that for implicit mappings and add a file type to declare the diff from the base model (rather than redefining all the fields).
This simply makes it easy to read and navigate the documents.
The order precedence is:
Accept
header e.g. Accept: application/json
/models/pcornet/v1.html
See #5 for the original discussion, but here are relevant fields:
Field schema file fields: schema/<table>.csv
model
(required) - Name of the model.version
(required) - Version of the model.table
(required) - Name of the table.field
(required) - Name of the field.type
(required) - Data type of the field.length
- Describes the maximum length of the value.precision
- Describes the precision of value specified in type
, typically for a number.scale
- Describes the scale of the value specified in type
, typically for a number.default
- Defines the default value for the field.Constraints file fields: constraints/<table>.csv
model
(required) - Name of the model.version
(required) - Version of the model.table
(required) - Table the constraint applies to.field
- One or more fields the constraint applies to.type
(required) - Type of constraint.name
- Suggested name of the constraintIndexes file fields: indexes/<table>.csv
model
(required) - Name of the model.version
(required) - Version of the model.table
(required) - Name of the table.field
(required) - One or more fields the index applies.name
- Suggested name of the index.type
- Suggested type of the index.unique
- If true, denotes a unique index.order
- For ordered indexes, specifies asc
or desc
This issue applies to v1 and v2:
Depends on #5
datamodel.json
file (#5)datamodel.json
, read the raw files and generate Markdown
/pcornet/v1
- HTML view/pcornet/v1.md
- Renders raw markdown/compare/pcornet/v1/pcornet/v2
- Renders a diff between the two specified data models.See pedsnet.v2.concept.concept_class_id.description
for the example:
"The category or class of the concept along both the hierarchical tree as well as different domains within a vocabulary. Examples are “Clinical Drug�, “Ingredient�, “Clinical Finding� etc. "
Log data:
2015-05-22T11:46:02.976121598Z fatal: Not a git repository (or any of the parent directories): .git
2015-05-22T11:46:02.976765909Z time="2015-05-22T11:46:02Z" level=fatal msg="problem pulling repo: exit status 128"
A map file defines the relationship between two models. The intended audience are people who want to learn more about how data models are related. There is not enough detail in these map files for authors of ETL code.
The following fields are being proposed. Even though "source" and "target" imply directionality, there is no inherent directionality the way fields are mapped. comment
contains any useful or necessary high-level information about a non-obvious mapping.
source_model
source_version
source_table
source_field
target_model
target_version
target_table
target_field
comment
/cc @aaron0browne @murphyke
This is the same feature as #51 but for fields.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.