Coder Social home page Coder Social logo

mason's People

Contributors

camden-penn avatar kyprifog avatar wrathagom avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Forkers

kyprifog malave

mason's Issues

Add STDOUT as a path type

Alot of code can be cleaned up if stdout is specified as an output path and it can be set as the default for local synchronous operator runs (see: #60)

Automatic operator installation

Need the ability to reference operators in a location without installing them, by specifying the operator home and just using them implicitly.

Make Table deferred data type

Make table a shell of what it is now, and delegate schema population to the execution engine (table.populate something to that extent)

Format Operator

Baseline code for format operator irrespective of client implementation.

Build S3 backend plugin for airflow?

As per this issue there is still a gap between s3 and airflow:

apache/airflow#9937

We can address this with documentation in mason to direct them what we are doing (sync local dags to s3) for now but seems kind of unsavory. This would at least give them a plugin to use.

Figure out serial representations of Engine Models

I want a way for engine models to be serialized and passed between various programming languages (python -> scala for example) or libraries without having to depend on transpiling (like Jython).

Something like an avro or protobuf representation with client interpreters in each language.

python->python is easy (just libs with pypi), but handicaps the ability to scale beyond python ecosystem later on.

Revamp async mode

Make it so that run and run_async are the same again in operator definitions.

Make async an implicit property of a particular execution engine, maybe merge OperatorResponse and DelayedOperatorResponse

Move things such as sampling files in infer into local execution engine

Start to work on actual asynchronous execution for operator runs

Build out "metadatabase" namespace in operator examples

In the metadatabase namespace the query operator for example would allow query that spans multiple distinct sources, ie federated query. For example:

mason operator metadatabase query SELECT * from source1.table join source2.table on ID

where source1 is a database in presto, source2 is an s3 bucket.

Would require doing: #15

Notebook operator

Operator that executes contents of a jupyter notebook with papermill configuration.

Figure out job "fan out" and aggregation

Suppose you wanted to have a workflow that did:

table list

followed by

join (for each item listed)

followed by

summarize (for the joined table)

Figure out how this would be handled in terms of the fan out of the mason job scheduling and workflow handing.

                          +--------------+           +-----------------+
           +--------------+ table query  +---------- | table summarize +------------+
           |              +--------------+           +-----------------+            |
           |                                                                        |
           |                                                                        |
           |                                                                        |
           |              +--------------+           +-----------------+            |
           +--------------+ table query  +---------- | table summarize +------------+
           |              +--------------+           +-----------------+            |
           |                                                                  +-----+-----+            +-----+------+
+----------+---+                                                              |table join | ---------  |table dedupe|
| table list   |                                                              +-----+-----+            +-----+------+
+----------+---+           +-------------+           +-----------------+            |
           +---+-----------+table query  +---------- | table summarize +------------+
           |               +-------------+           +-----------------+            |
           |                                                                        |
           |                                                                        |
           |                                                                        |
           |              +--------------+          +------------------+            |
           +--------------+ table query  +----------+  table summarize +------------+
                          +--------------+          +------------------+

Add calcite and express operators in terms of it

This is the first step in the following very ambitious 3 step process:

Step 1.
Reexpress as many mason operators in calcite as possible. In terms of execution, this would mean those jobs would all become QueryJob. Validate the calcite SQL, so when the job is serialized they are sending across calcite SQL (as opposed to SparkSQL, Hive, PrestoSQL)

Step 2.
In mason-spark use Coral to translate the calcite to SparkSQL (RelNode -> Spark Catalyst). Use this to build mason-hive and mason-presto (and get alot of operator support for free using coral).

Step 3.
Mason operators and workflows are now a curated collection of calcite SQL pipelines with some additional connecting tissue that goes outside of what SQL (really should) express. Look into the effort to add logica (datalog based query language) as a view language for coral. This has possibility to allow constraints address the additional connecting tissue (types for governance, authenatication, io formatting, workflow specification?). Then mason operators/workflows would be expressed completely within Datalog. This could be particularly interesting for expressing things like job fan out and aggregation

Parameter Aliases

Right now referring to S3 bucket as "database_name" and the S3 path as "table_name" when using S3 as a metastore is consistent but akward. Would like to have parameter aliases so that they could be called:
"bucket", "path", etc.

Likely would be client specific overrides.

Change how tables, databases, etc are referenced by connection strings

Right now some operators reference table_name database_name I want to consolidate into a SQLalchemy like connection string like:

<SOURCE>://<DB_NAME>/<TABLE_NAME>

for example:

athena://database/table
s3://bucket/path...
glue://database/table

etc.

Also think about how to reference multiple such objects (lists for joins for example)

Clean up error messages

Right now errors are collected on a single line string. Either add line breaks, or collect them into an array so that the message resembles a log or a stack trace more than one continuous error string.

All requirements are version locked?

With all of the requirements being == it can be quite hard to load mason into any type of shared env. Are all of the depencies known to only work at the locked in versions? or could the version requirements be opened up?

Allow config_id to be set on any rest api endpoint

Currently you would have to run the REST equivalent of

mason config -s 4
mason operator table format ...

to run an operator by itself with a particular config_id. This creates issues with statefullness. I want to make it so that you can pass config_id as a parameter to the rest endpoint to avoid having to do this. Setting the config_id would then just set the "default" config_id. I will rename the rest api endpoint appropriately.

Refactor Mason Registry

Registry is currently just local files in ~/.mason Want to add:

(1) sha style versioning
(2) More options for backend than local file systems, for example distributed registry

Autogenerate aspects of swagger files

Some aspects of swagger.yml definitions for workflows and operators can be generated based on the structure of the workflows and parameters.

This includes the 200 status, etc.

Split out validations into their own library

There is alot of code in both mason and mason-dask just focused on validating typed inputs from the api as well as from json-schemas. It might be worth looking into streamlining that code or seperating it into its own repo (mason-validations).

Allow multiple engine types per operator

Currently Each operator just supports one engine of each type. Would like to have multiple engines per type, for example:

metastores: [hive, s3]
execution: [spark, dask]

And ability to use multiple of them within an operator

Export Workflow

Export Workflow ==

Query Operator
followed by
Format Operator

Need to think about asynchronicity of Query operator. Callback mechanism?

Dedupe Operator

Details to come, basically dedupe by a single key on a metastore table.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.