Coder Social home page Coder Social logo

microsoft / datashaper Goto Github PK

View Code? Open in Web Editor NEW
63.0 9.0 12.0 443.4 MB

Processing engine and React components for constructing configuration-based data transformation and processing pipelines.

Home Page: https://microsoft.github.io/datashaper/

License: MIT License

JavaScript 0.34% TypeScript 64.55% EJS 0.02% Python 9.52% Jupyter Notebook 25.57%

datashaper's Introduction

DataShaper

This project provides a collection of components for executing processing pipelines, particularly oriented to data wrangling. Detailed documentation is provided in subfolders, with an overview of high-level goals and concepts here. Most of the documentation within individual packages is tailored to developers needing to understand how the code is organized and executed. Higher-level concepts for the project as a whole, constructing workflows, etc. are in the root docs folder.

Motivation

There are four primary goals of the project:

  1. Create a shareable client/server schema for describing data processing steps. This is in the schema folder. TypeScript types and JSONSchema generation is in javascript/schema, and published schemas are copied out to schema along with test cases that are executed by JavaScript and Python builds to ensure parity. Stable released versions of DataShaper schemas are hosted on github.io for permanent reference (described below).
  2. Maintain an implementation of a basic client-side wrangling engine (largely based on Arquero). This is in the javascript/workflow folder. This contains a reactive execution engine, along with individual verb implementations.
  3. Maintain a python implementation using common wrangling libraries (e.g., pandas) for backend or data science deployments. This is in the python folder. The execution engine is less complete than in JavaScript, but has complete verb implementations and test suite parity. A fuller-featured generalized pipeline execution engine is forthcoming.
  4. Provide an application framework along with some reusable React components so wrangling operations can be incorporated into web applications easily. This is in the javascript/app-framework and javascript/react folders.

Individual documentation for the JavaScript and Python implementations can be found in their respective folders. Broad documentation about building pipelines and the available verbs is available in the docs folder.

We currently have seven primary JavaScript packages:

  • app-framework - this provides web application infrastructure for creating data-driven apps with minimal boilerplate.
  • react - this is a set of React components for each verb that you can include in web apps that enable transformation pipeline building.
  • schema - this is a set of core types and associated JSONSchema definitions for formalizing our data package and resource models (including the definitions for table parsing, Codebooks, and Workflows).
  • tables - this is the primary set of functions for loading and parsing data tables, using Arquero under the hood.
  • utilities - this is a set of helpers for working with files, etc., to ease building data wrangling applications.
  • webapp - this is the deployable DataShaper webapp that includes all of the verb components and allows creation, execution, and saving of pipeline JSON files. We also rely on this to demonstrate example code, including a TestApp profile. If you're wondering how to build an app with DataShaper components, start here!
  • workflow - this is the primary engine for pipeline execution. It includes low-level operational primitives to execute a wide variety of relational algebra transformations over Arquero tables.

Also note that each JavaScript package has a generated docs folder containing Markdown API documentation extracted from code comments using api-extractor.

The Python packages are much simpler, because there is no associated web application and component code.

  • engine - contains the core verb implementations.
  • workflow.py - this is the primary execution engine that loads and interprets pipelines, and iterates through the steps to produce outputs.

Schema management

We generate JSONSchema for formal project artifacts including resource definitions and workflow specifications. This allows validation by any consumer and/or implementor. Schema versions are published on github.io for permanent reference. Each variant of a schema is hosted in perpetuity with semantic versioning. Aliases to the most recent (unversioned latest) and major revisions are also published. Here are direct links to the latest versions of our primary schemas:

Note that for the purposes of pipeline development, the workflow schema is primary. The rest are largely used for package management and table bundling in the web application.

Creating new verbs

For new verbs within the DataShaper toolkit, you must first determine if JavaScript and Python parity is desired. For operations that should be configurable via a UX, a JavaScript implementation is necessary. However, if the verb is primarily useful for data science workflows and has potentially complicated parameters, a Python-only implementation may be fine. We have a preference for parity to reduce confusion and allow for cross-platform execution of any pipelines created with the tool, but also recognize the value of the Python-based execution engine for configuring data science and ETL workflows that will only ever be run server-side.

Core verbs

Core verbs are built into the toolkit, and should generally have JavaScript and Python parity. Creating these verbs involves the following steps:

  1. Schema definition - this is done by authoring TypeScript types in the javascript/schema folder, which are then generated as JSONSchema during a build step.
  2. Cross-platform tests - these are defined in schema/fixtures, primarily in the workflow folder. Each fixture includes a workflow.json and an expected output csv file. Executors run in both JavaScript and Python to confirm that outputs match the expected table.
  3. JavaScript implementation - verbs are implemented in javascript/workflow/verbs
  4. Verb UX - individual verb UX components are in javascript/react

Python implementation

  1. Verbs are implemented in python/verbs
  2. Create a verb file following the json schema as package structure, for example, if in the schema the verbs is defined as:
"verb": {
    "const": "strings.upper",
    "type": "string"
}

The location of the verb must be in datashaper.engine.verbs.strings.upper.

  1. Create a function that replicates the same functionality as the javascript version and use the @verb decorator to make it available to the Workflow engine. The name parameter of the decorator must match the package name defined in the schema. For example:
@verb(name="my_package.upper")
def upper(input: VerbInput, column: str, to: str):
    ...

Important Note: If a verb already exists with the same name you will get a ValueError, pick a unique name for each verb. For example if you try to create a new "strings.upper" you will get a ValueError if you want to create a custom version of this verb you could use "my_package.upper" like the example above.

Custom verbs

The Python implementation supports the use of custom verbs supplied by your application - this allows arbitrary processing pipelines to be built that contain custom logic and processing steps.

TODO: document custom verb format

Build and test

JavaScript

  • You need node and yarn installed
  • Operate from project root
  • Run: yarn
  • Then: yarn build
  • Run the webapp locally: yarn start

Python

  • You need Python and poetry installed
  • Operate from python/datashaper folder
  • Run: poetry install
  • Then: poetry run poe test

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

datashaper's People

Contributors

alonsoguevara avatar andresmor-ms avatar arunsathiya avatar darthtrevino avatar dayesouza avatar dependabot[bot] avatar dreness avatar dworthen avatar gaudyb avatar gbm2494 avatar microsoft-github-operations[bot] avatar microsoftopensource avatar monik182 avatar natoverse avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datashaper's Issues

@data-wrangling-components/react missing dependency @essex/themed-components

After upgrading to the latest version of @data-wrangling-components/react, ^3.0.0, I get the following error when running yarn install.

YN0000: [webapp]: ✘ [ERROR] Could not resolve "@essex/themed-components"
➤ YN0000: [webapp]:
➤ YN0000: [webapp]:     ../../node_modules/@data-wrangling-components/react/dist/Steps/ManageSteps/ManageSteps.js:2:30:
➤ YN0000: [webapp]:       2 │ import { DialogConfirm } from '@essex/themed-components';
➤ YN0000: [webapp]:         ╵                               ~~~~~~~~~~~~~~~~~~~~~~~~~~

I believe this is because @essex/themed-components is listed as a devDependency but used as an actual runtime dependency.

No matching export in @data-wrangling-components/core for import "ParseType"

After upgrading to @data-wrangling-components/core 4.2.1 and @data-wrangling-components/react 3.0.0 I get the following error during build and bundling

 YN0000: [webapp]: ✘ [ERROR] No matching export in "../../node_modules/@data-wrangling-components/core/dist/index.js" for import "ParseType"
➤ YN0000: [webapp]:
➤ YN0000: [webapp]:     ../../node_modules/@data-wrangling-components/react/dist/verbs/Convert/Convert.js:2:9:
➤ YN0000: [webapp]:       2 │ import { ParseType } from '@data-wrangling-components/core';
➤ YN0000: [webapp]:         ╵          ~~~~~~~~~
➤ YN0000: [webapp]:
➤ YN0000: [webapp]: ✘ [ERROR] No matching export in "../../node_modules/@data-wrangling-components/core/dist/index.js" for import "ParseType"
➤ YN0000: [webapp]:
➤ YN0000: [webapp]:     ../../node_modules/@data-wrangling-components/react/dist/verbs/Convert/ConvertDescription.js:2:9:
➤ YN0000: [webapp]:       2 │ import { ParseType } from '@data-wrangling-components/core';
➤ YN0000: [webapp]:         ╵          ~~~~~~~~~

I think the latest version of @dwc/react in npm is depending on features of @dwc/core that have not made it to npm yet.

It appears ParseType is in the latest API of @dwc/core in https://github.com/microsoft/data-wrangling-components/blob/main/javascript/core/docs/core.api.md but ParseType is not present in the released npm module (https://cdn.jsdelivr.net/npm/@data-wrangling-components/[email protected]/dist/types/enums.d.ts).

Update workflow runner to execute verbs concurrently

When the steps are topologically sorted, some verbs may end up being runnable at the same time. We will need to update our status-updating callbacks to account for multiple inflight steps executing at once.

Binarize->Bin Steps Result in Inf. Render Loop

With the sample datasets (in react/tests/public/*.csv):

  • Select the stocks table
  • Binarize a numeric column
  • Select the "Bin" operation on that numeric column
  • Observe render errors in console

Allow verbs to define their own input invariants

Currently, if a verb does not receive a default input, the workflow engine will throw an error. This works most of the time, but in some cases it may be preferable to allow verbs to declare their own named input invariants.

@datashaper/utilities - FileWithPath doesn't use path

new FileWithPath(content, filename, path) will only save files into a path if the path is expressed in filename. However, the path argument is required. We should either respect the path argument or specify that the filename is fully qualified.

add `pin` option to verbs

In order to enable debugging and complex output scenarios, we can include a pin verb configuration that will flush the verb parquet to disk

e.g.

{
  "verb": "binarize",
  "args": {...},
  "pin": {
     "enabled": true,
     "name": "binarized-data-table",
     "format": "parquet"
  }

Workflow `output` specification

Workflows should specify named outputs that can be referenced downstream. These outputs should have a name, and a specification on which nodes they are derived from:

e.g.

{
  "output": {
     "nodes": { "source": "graph-node", "output': "nodes" },
     "edges": { "source": "graph-node", "output": "edges" } 
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.