karrlab / obj_tables Goto Github PK

View Code? Open in Web Editor NEW

8.0 10.0 2.0 36.2 MB

Tools for creating and reusing high-quality spreadsheets

Home Page: https://objtables.org

License: MIT License

Python 83.63% CSS 1.82% JavaScript 0.39% HTML 13.88% Dockerfile 0.29%

complex-datasets relational-data data-tables schema object-mapping excel csv tsv python

obj_tables's Introduction

ObjTables: Tools for creating and reusing high-quality spreadsheets

ObjTables is a toolkit which makes it easy to use spreadsheets (e.g., XLSX workbooks) to work with complex datasets by combining spreadsheets with rigorous schemas and an object-relational mapping system (ORM; similar to Active Record (Ruby), Django (Python), Doctrine (PHP), Hibernate (Java), Propel (PHP), SQLAlchemy (Python), etc.). This combination enables users to use programs such as Microsoft Excel, LibreOffice Calc, and OpenOffice Calc to view and edit spreadsheets and use schemas and the ObjTables software to validate the syntax and semantics of datasets, compare and merge datasets, and parse datasets into object-oriented data structures for further querying and analysis with languages such as Python.

ObjTables makes it easy to:

Use collections of tables (e.g., an XLSX workbook) to represent complex data consisting of multiple related objects of multiple types (e.g., rows of worksheets), each with multiple attributes (e.g., columns).
Use complex data types (e.g., numbers, strings, numerical arrays, symbolic mathematical expressions, chemical structures, biological sequences, etc.) within tables.
Use progams such as Excel and LibreOffice as a graphical interface for viewing and editing complex datasets.
Use embedded tables and grammars to encode relational information into columns and groups of columns of tables.
Define clear schemas for tabular datasets.
Use schemas to rigorously validate tabular datasets.
Use schemas to parse tabular datasets into data structures for further analysis in languages such as Python.
Compare, merge, split, revision, and migrate tabular datasets.

The ObjTables toolkit includes five components:

Format for schemas for tabular datasets
Numerous data types
Format for tabular datasets
Software tools for parsing, validating, and manipulating tabular datasets
Python package for more flexibility and analysis

Please see https://objtables.org for more information.

Installing the command-line program and Python API

Please see the documentation.

Examples, tutorials, and documentation

Please see the user documentation, developer documentation, and tutorials.

License

ObjTables is released under the MIT license.

Development team

ObjTables was developed by the Karr Lab at the Icahn School of Medicine at Mount Sinai in New York, USA and the Applied Mathematics and Computer Science, from Genomes to the Environment research unit at the National Research Institute for Agriculture, Food and Environment in Jouy en Josas, FR.

Questions and comments

Please contact the developers with any questions or comments.

obj_tables's People

Contributors

Stargazers

Watchers

Forkers

0u812 trendingtechnology

obj_tables's Issues

Support schema migration

Stored Models, such as spreadsheets and delimited files on disk, can become incompatible with updates to model definitions in obj_model. Create a utility that migrates stored models to be compatible with a modified model. E.g., Django has a migration utility, which was originally called fabric.

Fix reading of inherited classes

Multiple tables in single file

Table header: \n!!SBtab ... \n

Example:

!!SBtab   TableID='def_table' SBtabVersion='1.0' TableType='Definition'   TableName='Allowed_types'

Clean up docstring and documentation problems.

Sphinx generates 21 WARNINGs.

Add get_or_create method

Create RelatedManagers for higher levels of inheritance and redefinition of related_name

create attribute class for sympy

filter output of pprint()

while very handy, pprint() produces too much output. filter it.

latent bug in ModelMeta.validate_attributes

consider this code:

is_attr = False
for base in bases:
    if attr_name in dir(base):
        is_attr = True

this code is risky, because dir(base) may contain a string that matches attr_name
but is not an attribute inherited from base. E.g., the name of a method could match.
a better approach would be to directly check whether attr_name is an attribute of a base that is an obj_model.core.Models. namespace doesn't have the same problem.

Provide serialization of nested structures to JSON or YAML within Excel

Provide JSON/YAML export up to a maximum depth

More safely evaluate expressions

Potentially use RestrictedPython when there is more documentation to understand how to use it.

Future enhancements to migration

Listed in decreasing order of my subjective assessment of importance

avoid need for obj_model.core.ModelMeta.CHECK_SAME_RELATED_ATTRIBUTE_NAME by comparing related models by value, not name
use Model.revision to label git commit of wc_lang and automatically migrate models to current schema and report inconsistency between a schema and model file
move generate_wc_lang_migrator() to wc_lang
provide a well-documented example
YAML config examples with multiple existing_files and multiple migrated_files
use deepcopy on obj_model.ontology.OntologyAttribute attributes when deepcopy of pronto terms works
associate schema pairs with renaming maps
separately specified default value for attribute
improve performance of test_migration
obtain sort order of sheets in existing model file and replicate in migrated model file
confirm migration works for json, etc.
test sym links in Migrator.parse_module_path
use PARSED_EXPR everywhere applicable

Better error message for set_value()

Hi Jonathan

In def set_value(self, obj, new_value), at ../obj_model/obj_model/core.py:3884: ValueError
an example error looks like:
ValueError: Attribute '<wc_lang.core.RateLaw object at 0x7fbb31678a58>' of '<wc_lang.core.RateLawEquation object at 0x7fbb31678b00>' must be None

I think it would be better if the first value was the string value of the attribute and the 2nd was something like the classname, id and name of the Model.
Thanks

Use Excel validation

SlugAttribute --> custom
IntAttribute --> Whole number (min, max)
FloatAttribute --> Decmial (min, max)
OneToOneAttribute --> List
ManyToOneAttribute --> List
DateAttribute --> Date
TimeAttribute --> Time
EnumAttribute --> List

Hide extra rows and columns in Excel output

Add method to get nested objects

Features

traverse object graph in only one direction
filter related objects by attributes

Example:

Get DOIs of all nested references of a gene in a knowledge base
- Get all nested objects
- Filter for objects of type Identifier
- Filter for namespace = DOI

Organize extra_attributes module in bio, chem, and math modules

Create context manager for models

... if they make sense

Add Meta attribute to control which columns are printed

Allow unused columns not to be printed
Check that unprinted columns all have None values

Understand unexpectedly high memory usage

Query functions (both class & instance) for attributes.

Class methods

cls.get_related_attributes() - returns names of all RelatedAttributes of the class
cls.get_scalar_attributes() - returns all LiteralAttributes of the class
cls.get_attributes() - returns names of all Attributes
cls.get_related_name(attribute) - for a given RelatedAttribute of this class, get the related name

Instance methods

self.get_empty_scalar_attributes() - returns LiteralAttributes that are set to None
self.get_nonempty_scalar_attributes() - opposite of above
self.get_empty_related_attributes() - returns RelatedAttributes that are set to None or []
self.get_nonempty_related_attributes() - opposite of above

Implement views

Refactor io to reduce memory

Add ability to filter related attributes based on type

Better error checking on attribute_order

Verify that attribute_order is a tuple (or at least not a string) so that a model definition like this

class A(obj_model.Model):
    id = SlugAttribute()

    class Meta(obj_model.Model.Meta):
        attribute_order = ('id')

does not return an incomprehensible error like this:

                    raise ValueError("`attribute_order` must contain attribute names; '{}' not found in "
>                                    "attributes of {}".format(attr_name, name))
E                   ValueError: `attribute_order` must contain attribute names; 'i' not found in attributes of A

test __prep_expr_for_tokenization

Add a test of ParsedExpression.__prep_expr_for_tokenization() to obj_model/tests/test_expression.py. It's covered because it's always called, but the substitutions it makes need to be tested.

Allow object identifiers that start with digits

ensure that subclasses of MigrationWrapper satisfy its method signatures

ensure that subclasses of MigrationWrapper satisfy its method signatures type-checking statically with typing and using mypy.api to verify subclasses

Correct errors in Excel data validation

Incorrect links to column oriented worksheets.

For example, in worksheet 19 of h1_hesc KB /xl/worksheets/sheet19.xml 'Cell'!$B$0:$XFD$0 should be 'Cell'!$B$1:$XFD$1

<dataValidation type="list" errorStyle="warning" allowBlank="1" showInputMessage="1" showErrorMessage="1" errorTitle="Cell" error="Value must be a value from &quot;Cell:0&quot; or blank." promptTitle="Cell" prompt="Select a value from &quot;Cell:0&quot; or blank." sqref="B1:B2">
  <formula1>'Cell'!$B$0:$XFD$0</formula1>
</dataValidation>

Incorrect links to parent models which are represented by multiple worksheets. For example,
- Reference to "Polymer species types" from "Polymer" column of "Genes" worksheet. This should not reference a worksheet because PolymerSpeciesType is a parent class which is represented by multiple worksheets.

Docstring issue

In the return values of obj_model.io.get_fields(), what's the difference between attrs
& sub_attrs? Docs say:

    :obj:`list` of :obj:`Attribute`: attributes in the order they should be printed
    :obj:`list` of tuple of :obj:`Attribute`: attributes in the order they should be printed

Implement context to allow object construction from strings

See example problem Andrew encountered and I debugged.

Unspecific error

A modeler writing a wc_lang spreadsheet might have trouble fixing this error:

   ValueError: The model cannot be loaded because '2_species_1_reaction.xlsx' contains error(s):
  Taxon
    The attributes must be defined in this order:
      Id
      Name
      Rank
      Comments
      References
  Submodel
    The attributes must be defined in this order:
      Id
      Name
      Algorithm
      Compartment
      Biomass reaction
      Objective function
      Comments
      References
  Compartment
    The attributes must be defined in this order:
      Id
      Name
      Initial volume
      Comments
      References
  SpeciesType
    The attributes must be defined in this order:
      Id
      Name
      Structure
      Empirical formula
      Molecular weight
      Charge
      Type
      Comments
      References
  Observable
    The attributes must be defined in this order:
      Id
      Name
      Species
      Observables
      Comments
  Function
    The attributes must be defined in this order:
      Id
      Name
      Expression
      Comments
  StopCondition
    The attributes must be defined in this order:
      Id
      Name
      Expression
      Comments
  Reference
    The attributes must be defined in this order:
      Id
      Name
      Title
      Author
      Editor
      Year
      Type
      Publication
      Publisher
      Series
      Volume
      Number
      Issue
      Edition
      Chapter
      Pages
      Comments

Support self references

Make serialization/deserialization easier with textx or similar

http://www.igordejanovic.net/textX/

check 'return self.default_cleaned_value()'

the line return self.default_cleaned_value() in

        if isinstance(self.default_cleaned_value, (
                six.types.FunctionType, six.types.MethodType, six.types.LambdaType)):
            return self.default_cleaned_value()

looks wrong. default_cleaned_value isn't defined as a function anywhere.

don't have time to investigate now.

Change obj_model.expression.ParsedExpression from disallowing illegal tokens names to only allowing legal token names

ILLEGAL_TOKENS_NAMES --> LEGAL_TOKENS_NAMES

Migration enhancements to do later

add migrate commands to H1 & Mp
remove branch from Git metadata
expose the optional locations for migrated files
obtain schema commit metadata from pip installed package
detect erroneous schema changes file annotations
ensure that the Git version of a data file is a sentinel commit
enforce this invariant: each sentinel commit must be identified by one schema changes file

Add feature to define groups of Excel columns

Add attribute to obj_model.Model.Meta to define column groupings and the heading for the group
Export additional row with merged cells that are headings for multiple individual columns
Update Excel import

io.JsonReader.run() ignores ignore_extra_sheets=True

Generates

            decoded = {}
            for json_obj in json_objs:
>               model = models_by_name[json_obj['__type']]
E               KeyError: 'DataRepoMetadata'