Coder Social home page Coder Social logo

fairtracks / omnipy Goto Github PK

View Code? Open in Web Editor NEW
14.0 9.0 1.0 2.06 MB

Omnipy is a high level Python library for type-driven data wrangling and scalable workflow orchestration (under development)

Home Page: http://omnipy.readthedocs.io/

License: Apache License 2.0

Python 100.00%
data-wrangling etl metadata prefect research-data fair json ontologies pydantic tabular type-driven data-models orchestration data universal workflow

omnipy's Introduction

Omnypy logo

Omnipy is a high level Python library for type-driven data wrangling and scalable workflow orchestration.

Conceptual overview of Omnipy

Updates

  • Feb 3, 2023: Documentation of the Omnipy API is still sparse. However, for examples of running code, please check out the omnipy-examples repo.
  • Dec 22, 2022: Omnipy is the new name of the Python package formerly known as uniFAIR. We are very grateful to Dr. Jamin Chen, who gracefully transferred ownership of the (mostly unused) "omnipy" name in PyPI to us!_

Installation and use

For basic information on installation and use of omnipy, read the INSTALL.md file.

Contribute to omnipy development

For basic information on how to set up a development environment to effectively contribute to the omnipy library, read the CONTRIBUTING.md file.

Overview of Omnipy

Generic functionality

(NOTE: Read the section Transformation on the FAIRtracks.net website for a more detailed and better formatted version of the following description!)

Omnipy is designed primarily to simplify development and deployment of (meta)data transformation processes in the context of FAIRification and data brokering efforts. However, the functionality is very generic and can also be used to support research data (and metadata) transformations in a range of fields and contexts beyond life science, including day-to-day research scenarios:

Data wrangling in day-to-day research

Researchers in life science and other data-centric fields often need to extract, manipulate and integrate data and/or metadata from different sources, such as repositories, databases or flat files. Much research time is spent on trivial and not-so-trivial details of such "data wrangling":

  • reformat data structures
  • clean up errors
  • remove duplicate data
  • map and integrate dataset fields
  • etc.

General software for data wrangling and analysis, such as Pandas, R or Frictionless, are useful, but researchers still regularly end up with hard-to-reuse scripts, often with manual steps.

Step-wise data model transformations

With the Omnipy Python package, researchers can import (meta)data in almost any shape or form: nested JSON; tabular (relational) data; binary streams; or other data structures. Through a step-by-step process, data is continuously parsed and reshaped according to a series of data model transformations.

"Parse, don't validate"

Omnipy follows the principles of "Type-driven design" (read Technical note #2: "Parse, don't validate" on the FAIRtracks.net website for more info). It makes use of cutting-edge Python type hints and the popular pydantic package to "pour" data into precisely defined data models that can range from very general (e.g. "any kind of JSON data", "any kind of tabular data", etc.) to very specific (e.g. "follow the FAIRtracks JSON Schema for track files with the extra restriction of only allowing BigBED files").

Data types as contracts

Omnipy tasks (single steps) or flows (workflows) are defined as transformations from specific input data models to specific output data models. pydantic-based parsing guarantees that the input and output data always follows the data models (i.e. data types). Thus, the data models defines "contracts" that simplifies reuse of tasks and flows in a mix-and-match fashion.

Catalog of common processing steps

Omnipy is built from the ground up to be modular. We aim to provide a catalog of commonly useful functionality ranging from:

  • data import from REST API endpoints, common flat file formats, database dumps, etc.
  • flattening of complex, nested JSON structures
  • standardization of relational tabular data (i.e. removing redundancy)
  • mapping of tabular data between schemas
  • lookup and mapping of ontology terms
  • semi-automatic data cleaning (through e.g. Open Refine)
  • support for common data manipulation software and libraries, such as Pandas, R, Frictionless, etc.

In particular, we will provide a FAIRtracks module that contains data models and processing steps to transform metadata to follow the FAIRtracks standard.

Catalog of commonly useful processing steps, data modules and tool integrations

Refine and apply templates

An Omnipy module typically consists of a set of generic task and flow templates with related data models, (de)serializers, and utility functions. The user can then pick task and flow templates from this extensible, modular catalog, further refine them in the context of a custom, use case-specific flow, and apply them to the desired compute engine to carry out the transformations needed to wrangle data into the required shape.

Rerun only when needed

When piecing together a custom flow in Omnipy, the user has persistent access to the state of the data at every step of the process. Persistent intermediate data allows for caching of tasks based on the input data and parameters. Hence, if the input data and parameters of a task does not change between runs, the task is not rerun. This is particularly useful for importing from REST API endpoints, as a flow can be continuously rerun without taxing the remote server; data import will only carried out in the initial iteration or when the REST API signals that the data has changed.

Scale up with external compute resources

In the case of large datasets, the researcher can set up a flow based on a representative sample of the full dataset, in a size that is suited for running locally on, say, a laptop. Once the flow has produced the correct output on the sample data, the operation can be seamlessly scaled up to the full dataset and sent off in software containers to run on external compute resources, using e.g. Kubernetes. Such offloaded flows can be easily monitored using a web GUI.

Working with Omnipy directly from an Integrated Development Environment (IDE)

Industry-standard ETL backbone

Offloading of flows to external compute resources is provided by the integration of Omnipy with a workflow engine based on the Prefect Python package. Prefect is an industry-leading platform for dataflow automation and orchestration that brings a series of powerful features to Omnipy:

  • Predefined integrations with a range of compute infrastructure solutions
  • Predefined integration with various services to support extraction, transformation, and loading (ETL) of data and metadata
  • Code as workflow ("If Python can write it, Prefect can run it")
  • Dynamic workflows: no predefined Direct Acyclic Graphs (DAGs) needed!
  • Command line and web GUI-based visibility and control of jobs
  • Trigger jobs from external events such as GitHub commits, file uploads, etc.
  • Define continuously running workflows that still respond to external events
  • Run tasks concurrently through support for asynchronous tasks

Overview of the compute and storage infrastructure integrations that comes built-in with Prefect

Pluggable workflow engines

It is also possible to integrate Omnipy with other workflow backends by implementing new workflow engine plugins. This is relatively easy to do, as the core architecture of Omnipy allows the user to easily switch the workflow engine at runtime. Omnipy supports both traditional DAG-based and the more avant garde code-based definition of flows. Two workflow engines are currently supported: local and prefect.

omnipy's People

Contributors

bianchini88 avatar jcheneby-zz avatar joshbaskaran avatar pavelvaz avatar pavelvazquez avatar sveinugu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

maanst

omnipy's Issues

Check if we can move to explicit definition of __root__ field at the object

level in pydantic 2.0 (when it is released)

# TODO: Check if we can move to explicit definition of __root__ field at the object

        # As long as models are not created concurrently, setting the class members temporarily
        # should not have averse effects
        # TODO: Check if we can move to explicit definition of __root__ field at the object
        #       level in pydantic 2.0 (when it is released)
        if cls == Model:
            cls._depopulate_root_field()

Refactor to remove dependency

Also, add test for not allowing override of fixed_params

# TODO: Refactor to remove dependency

                    param_keys = set(inspect.signature(job).parameters.keys())

                    # TODO: Refactor to remove dependency
                    #       Also, add test for not allowing override of fixed_params
                    if hasattr(job, 'param_key_map'):
                        for key, val in job.param_key_map.items():

Possibly reimplement logic using a state machine, e.g. "transitions" package

# TODO: Possibly reimplement logic using a state machine, e.g. "transitions" package

                 persist_outputs: Optional[PersistOutputsOptions] = None,
                 restore_outputs: Optional[RestoreOutputsOptions] = None):

        # TODO: Possibly reimplement logic using a state machine, e.g. "transitions" package
        if persist_outputs is None:
            self._persist_outputs = PersistOpts.FOLLOW_CONFIG if self._has_job_config else None
        else:

refactor using state machine

# TODO: refactor using state machine

    def __init__(self, *args: object, name: Optional[str] = None, **kwargs: object):
        # super().__init__()

        # TODO: refactor using state machine

        if not isinstance(self, JobTemplate) and not isinstance(self, Job):
            raise JobStateException('JobBase and subclasses not inheriting from JobTemplate '

implement general solution to make sure that one does not modify input data by

automatically copying or otherwise. Perhaps setting immutable/frozen option in pydantic,

if available?

# TODO: implement general solution to make sure that one does not modify input data by

@TaskTemplate()
def remove_columns(json_dataset: JsonListOfDictsOfAnyDataset,
                   column_keys_for_data_items: Dict[str, List[str]]) -> JsonListOfDictsOfAnyDataset:
    # TODO: implement general solution to make sure that one does not modify input data by
    #       automatically copying or otherwise. Perhaps setting immutable/frozen option in pydantic,
    #       if available?
    #

when parsing config from file is implemented, make sure that the new engine

config classes here reparse the config files

# TODO: when parsing config from file is implemented, make sure that the new engine

        return getattr(self.objects, engine_choice)

    def _new_engine_config_if_new_cls(self, engine: IsEngine, engine_choice: EngineChoice) -> None:
        # TODO: when parsing config from file is implemented, make sure that the new engine
        #       config classes here reparse the config files
        engine_config_cls = engine.get_config_cls()
        if self._get_engine_config(engine_choice).__class__ is not engine_config_cls:

Add test for get_model_class

# TODO: Add test for get_model_class

        if not self.__doc__:
            self._set_standard_field_description()

    # TODO: Add test for get_model_class

    def get_model_class(self) -> ModelT:
        return self.__fields__.get(DATA_KEY).type_

    # TODO: Update _raise_no_model_exception() text. Model is now a requirement

    @staticmethod
    def _raise_no_model_exception() -> None:

Update _raise_no_model_exception() text. Model is now a requirement

# TODO: Update _raise_no_model_exception() text. Model is now a requirement

        if not self.__doc__:
            self._set_standard_field_description()

    # TODO: Add test for get_model_class

    def get_model_class(self) -> ModelT:
        return self.__fields__.get(DATA_KEY).type_

    # TODO: Update _raise_no_model_exception() text. Model is now a requirement

    @staticmethod
    def _raise_no_model_exception() -> None:

change model type to params: Union[Type[Any], Tuple[Type[Any], ...]]

as in GenericModel

# TODO: change model type to params: Union[Type[Any], Tuple[Type[Any], ...]]

        del cls.__annotations__[ROOT_KEY]

    def __class_getitem__(cls, model: Union[Type[RootT], TypeVar]) -> Union[Type[RootT], TypeVar]:
        # TODO: change model type to params: Union[Type[Any], Tuple[Type[Any], ...]]
        #       as in GenericModel

        # For now, only singular model types are allowed. These lines are needed for

switch from plural to singular for names of modules in omnipy modules

# TODO: switch from plural to singular for names of modules in omnipy modules

                                        JsonModel,
                                        JsonNestedDictsModel)

# TODO: switch from plural to singular for names of modules in omnipy modules
# TODO: call omnipy modules something else than modules, to distinguish from Python modules.
#       Perhaps plugins?

JsonModelT = TypeVar('JsonModelT', bound=Union[JsonModel, JsonListModel, JsonDictModel])

Refactor

        datetime_str = run_time.strftime('%Y_%m_%d-%H_%M_%S')
        return datetime_str

    # TODO: Refactor
    def _deserialize_and_restore_outputs(self) -> Dataset:
        output_path = Path(self.config.persist_data_dir_path)
        if os.path.exists(output_path):

call omnipy modules something else than modules, to distinguish from Python modu...

Perhaps plugins?

# TODO: call omnipy modules something else than modules, to distinguish from Python modules.

                                        JsonModel,
                                        JsonNestedDictsModel)

# TODO: switch from plural to singular for names of modules in omnipy modules
# TODO: call omnipy modules something else than modules, to distinguish from Python modules.
#       Perhaps plugins?

JsonModelT = TypeVar('JsonModelT', bound=Union[JsonModel, JsonListModel, JsonDictModel])

Reimplement logic using a state machine, e.g. "transitions" package

# TODO: Reimplement logic using a state machine, e.g. "transitions" package

        return bool(other_job_same_unique_name) and id(other_job_same_unique_name) != id(job)

    def _update_job_registration(self, job: IsJob, state: RunState) -> None:
        # TODO: Reimplement logic using a state machine, e.g. "transitions" package
        if self._other_job_registered_with_same_unique_name(job):
            while self._other_job_registered_with_same_unique_name(job):
                job.regenerate_unique_name()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.