Coder Social home page Coder Social logo

data-models's People

Contributors

boss-sam avatar bruth avatar burrowse avatar dcam2015 avatar lucyshettel avatar murphyke avatar willshen99 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data-models's Issues

Standardize field names in data model files

The purpose of this file is to make it possible to translate the files into a common format for downstream consumption.

Table definition file fields: tables.csv

  • model (required) - Name of the model
  • version (required) - Version of the model
  • table (required) - Name of the table
  • label - Label for the table.
  • description - Description of the table.

Field definition file fields: definitions/<table>.csv

  • model (required) - Name of the model.
  • version (required) - Version of the model.
  • table (required) - Name of the table.
  • field (required) - Name of the field.
  • label - Corresponds to the field label. This defaults to table + field
  • description - Corresponds to the description or documentation of the field.
  • type - Describes the data type of the value the field holds.
  • ref_table - The table this field references.
  • ref_field - The field this field references.

Field schema file fields: schema/<table>.csv

  • model (required) - Name of the model.
  • version (required) - Version of the model.
  • table (required) - Name of the table.
  • field (required) - Name of the field.
  • type (required) - Data type of the field.
  • length - Describes the maximum length of the value.
  • precision - Describes the precision of value specified in type, typically for a number.
  • scale - Describes the scale of the value specified in type, typically for a number.
  • default - Defines the default value for the field.

Constraints file fields: constraints/<table>.csv

  • model (required) - Name of the model.
  • version (required) - Version of the model.
  • table (required) - Table the constraint applies to.
  • fields - One or more fields the constraint applies to.
  • type (required) - Type of constraint.
  • name - Suggested name of the constraint

Indexes file fields: indexes/<table>.csv

  • model (required) - Name of the model.
  • version (required) - Version of the model.
  • table (required) - Name of the table.
  • fields (required) - One or more fields the index applies.
  • name - Suggested name of the index.
  • type - Suggested type of the index.

Proposal for supporting non-table based data models

The motivation is to support other structures such as REDCap data dictionaries (form, section, field), Harvest metadata (category, concept, field), etc. The benefit is that data models even with different structures can be maintained and represented in a similar way. The interesting bit is what will come out of defining the mappings between disparate models.

Support for this requires generalizing the format to support any DAG (directed acyclic graph; e.g. hierarchy) using a variable length path rather than a static table. For example a REDCap form → section → field could be represented this way:

model version path name
redcap_project v1 demographics
redcap_project v1 demographics location
redcap_project v1 demographics/location city

An empty path denotes name is a root element in the model (a table in the case of a relational model). The full identifier within a model would be path + name

Here are few survey questions I would appreciate interested members to answer:

  • Do you have non-relational models that you want represented in this format?
  • What is the advantage for you to represent all of your data models in one format?
  • For the data models that are related, do you intend to define mappings between them?
  • What changes to the current views (HTML/Markdown) would you want or expect to see?
  • Does this alternate naming feel more cumbersome or confusing to work with?

I will note that this format could live alongside the current one.

/cc @aaron0browne @murphyke @tjrivera

Integrate renaming file

A renaming file declares name changes across two versions in a model. This can be used in two areas:

  • Link on target version back to source version, e.g. Previously named...
  • Comparison output would show rename rather than add + remove

Add indexes for i2b2_pedsnet

Currently there is just one; committed by accident (which will be removed by a pending PR). The old original i2b2 indexes are very heavy-weight and largely a huge waste of disk and load time as far as pedsnet is concerned.

i2b2_pedsnet v2 `pk_i2b2` constraint refers to non-existent `i2b2.id` column

This probably came from a Django implementation? In any case, some implementations obviously require a PK, but we shouldn't enforce that at the data model level. The Django model generator will create a surrogate PK if required. So if there's no PK on this table in the original DDL, let's remove the constraint.

Needs change in table definition

In Oracle Section under FULL DDL, First table has following
conditionid VARCHAR2 NOT NULL,
encounterid VARCHAR2,
patid VARCHAR2 NOT NULL,
raw_condition VARCHAR2,
raw_condition_source VARCHAR2,
raw_condition_status VARCHAR2,
raw_condition_type VARCHAR2,

Varchar2 without size specifications will result in errors.

Parth

Errors in csv data found through DDL work

Specific issues:

  • pedsnet.v1.person.day_of_birth.type from numder -> number
  • pedsnet.v2.vocabulary.vocabulary_concept_id from '' -> integer
  • pedsnet.v1.indexes.idx_visit_person_id and ...idx_visit_concept_id replaced by ...idx_visit_person_date
  • omop.v5.cohort_definition definitions field cohort_instantiation_date -> cohort_initiation_date
  • omop.v5.concept definitions remove field concept_level
  • omop.v5.fact_relationship schema field domain_concept_id split into ..._1 and ..._2
  • omop.v5.constraints.primary_keys.xpk_cohort_attribute remove in favor of xpk_cohort_definition

Larger issues:

  • Can the JSON object always adhere to the defined structure, even if the underlying lists are empty? (For example, can the pcornet.v3 json have schema.constraints.not_null and foreign_keys set to [] instead of none?)
  • The omop.v4 json endpoint is not returning anything.
  • Several pcornet.v1.schema.vital fields have integer types with a length attribute, which is choking the SQLAlchemy constructors... What is the intended meaning of this attribute? Should these be number types (with the length attribute moved to precision instead)?
  • Downcase all object names in the pcornet models.
  • Several i2b2.v1_7.schema fields share the integer with length problem described above.
  • Is timestamp as used in i2b2.v1_7 significantly different from datetime elsewhere?

PEDSnet CDM V2 notes

  1. visit_occurrence table
    1. visit_start_time is not required
    2. ...date fields should be date type`
    3. visit_end_date is not required
    4. place_of_service_source_value should be removed
    5. visit_type_concept_id should be added
  2. procedure_occurrence table
    1. relevant_condition_concept_id should be removed
    2. procedure_date should be date type
  3. provider table
    1. gender_source_concept_id should be included
  4. care_site table
    1. specialty_source_value should be added

Other:

  • relevant_condition_concept_id should not exist in OMOPV5.drug_exposure table

Remove dependence on model definition

Currently, the directory of the datamodel.json file is walked and bound to the declared model. This presumes a structure which is not necessary since each file type can stand alone (each one contains the model and version).

The first step is to replace datamodel.json with a CSV file (see #45), then update the walk algorithm to rely on the model specified in the file.

Add "renaming" files

A renaming file specifies fields that have been renamed between versions which cannot be inferred from a simple diff mechanism.

Proposed fields:

  • model
  • source_version
  • source_table
  • source_field
  • target_version
  • target_table
  • target_field

/cc @aaron0browne @murphyke

Add field in models.csv to declare base model

This may be a bit too specific to our use case, but declaring the base model could help to infer some things. For example, the PEDSnet data model is based on OMOP so we could rely on that for implicit mappings and add a file type to declare the diff from the base model (rather than redefining all the fields).

Add schema, constraint, and index files for PEDSnet models

See #5 for the original discussion, but here are relevant fields:

Field schema file fields: schema/<table>.csv

  • model (required) - Name of the model.
  • version (required) - Version of the model.
  • table (required) - Name of the table.
  • field (required) - Name of the field.
  • type (required) - Data type of the field.
  • length - Describes the maximum length of the value.
  • precision - Describes the precision of value specified in type, typically for a number.
  • scale - Describes the scale of the value specified in type, typically for a number.
  • default - Defines the default value for the field.

Constraints file fields: constraints/<table>.csv

  • model (required) - Name of the model.
  • version (required) - Version of the model.
  • table (required) - Table the constraint applies to.
  • field - One or more fields the constraint applies to.
  • type (required) - Type of constraint.
  • name - Suggested name of the constraint

Indexes file fields: indexes/<table>.csv

  • model (required) - Name of the model.
  • version (required) - Version of the model.
  • table (required) - Name of the table.
  • field (required) - One or more fields the index applies.
  • name - Suggested name of the index.
  • type - Suggested type of the index.
  • unique - If true, denotes a unique index.
  • order - For ordered indexes, specifies asc or desc

This issue applies to v1 and v2:

  • Fill in remaining schema fields (default values)
  • Constraints
  • Indexes

Develop website for rendering various views of the data models

Depends on #5

  • Clones a local repository of the data models repo
  • Scan repository for directories with datamodel.json file (#5)
  • Develop view functions that read the datamodel.json, read the raw files and generate Markdown
    • Stats
    • Hierarchy
    • Diff between versions: see this for a starting point.
  • Markdown can be viewed/downloaded directly or rendered as HTML
  • Add endpoints for each data model and views
    • /pcornet/v1 - HTML view
    • /pcornet/v1.md - Renders raw markdown
    • /compare/pcornet/v1/pcornet/v2 - Renders a diff between the two specified data models.
  • Define a GitHub webhook endpoint to receive a payload when the repo changes
  • Write it in Go

Encoding error

See pedsnet.v2.concept.concept_class_id.description for the example:

"The category or class of the concept along both the hierarchical tree as well as different domains within a vocabulary. Examples are “Clinical Drug�, “Ingredient�, “Clinical Finding� etc. "

Refactor compare resource to use template

  • Define struct to contain the comparison outcome
  • Add direct links to the model, table, and field definitions
  • Add Summary section at the top to highlight all things that changed

Cannot find path to repository on refresh

Log data:

2015-05-22T11:46:02.976121598Z fatal: Not a git repository (or any of the parent directories): .git
2015-05-22T11:46:02.976765909Z time="2015-05-22T11:46:02Z" level=fatal msg="problem pulling repo: exit status 128" 

Add data model "mapping" files

A map file defines the relationship between two models. The intended audience are people who want to learn more about how data models are related. There is not enough detail in these map files for authors of ETL code.

The following fields are being proposed. Even though "source" and "target" imply directionality, there is no inherent directionality the way fields are mapped. comment contains any useful or necessary high-level information about a non-obvious mapping.

  • source_model
  • source_version
  • source_table
  • source_field
  • target_model
  • target_version
  • target_table
  • target_field
  • comment

/cc @aaron0browne @murphyke

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.