The mason from samtecspg

Add STDOUT as a path type

Alot of code can be cleaned up if stdout is specified as an output path and it can be set as the default for local synchronous operator runs (see: #60)

mason.util.notebook_environment NotebookEnvironment doesn't utilize ~/.aws/credentials files

Would like it if NotebookEnvironment read the boto config ~/.aws/credentials if no config is specified. Also would be awesome if the profile within the credentials could be specified too

Add Github actions to run tests

Now that we have more than myself in the code it seems appropriate to test using github actions.

Automatic operator installation

Need the ability to reference operators in a location without installing them, by specifying the operator home and just using them implicitly.

Make Table deferred data type

Make table a shell of what it is now, and delegate schema population to the execution engine (table.populate something to that extent)

Format Operator

Baseline code for format operator irrespective of client implementation.

Build S3 backend plugin for airflow?

As per this issue there is still a gap between s3 and airflow:

apache/airflow#9937

We can address this with documentation in mason to direct them what we are doing (sync local dags to s3) for now but seems kind of unsavory. This would at least give them a plugin to use.

Figure out serial representations of Engine Models

I want a way for engine models to be serialized and passed between various programming languages (python -> scala for example) or libraries without having to depend on transpiling (like Jython).

Something like an avro or protobuf representation with client interpreters in each language.

python->python is easy (just libs with pypi), but handicaps the ability to scale beyond python ecosystem later on.

Make table and path protocal or inherited abstract classes

Table as a protocal would allow for expressions of things like S3Table, vs GlueTable, similarly for path.

Revamp async mode

Make it so that run and run_async are the same again in operator definitions.

Make async an implicit property of a particular execution engine, maybe merge OperatorResponse and DelayedOperatorResponse

Move things such as sampling files in infer into local execution engine

Start to work on actual asynchronous execution for operator runs

Build out "metadatabase" namespace in operator examples

In the metadatabase namespace the query operator for example would allow query that spans multiple distinct sources, ie federated query. For example:

mason operator metadatabase query SELECT * from source1.table join source2.table on ID

where source1 is a database in presto, source2 is an s3 bucket.

Would require doing: #15

Notebook operator

Operator that executes contents of a jupyter notebook with papermill configuration.

Add Operator Client Compatability Matrix

Something like this for each engine type:

Figure out job "fan out" and aggregation

Suppose you wanted to have a workflow that did:

table list

followed by

join (for each item listed)

followed by

summarize (for the joined table)

Figure out how this would be handled in terms of the fan out of the mason job scheduling and workflow handing.

                          +--------------+           +-----------------+
           +--------------+ table query  +---------- | table summarize +------------+
           |              +--------------+           +-----------------+            |
           |                                                                        |
           |                                                                        |
           |                                                                        |
           |              +--------------+           +-----------------+            |
           +--------------+ table query  +---------- | table summarize +------------+
           |              +--------------+           +-----------------+            |
           |                                                                  +-----+-----+            +-----+------+
+----------+---+                                                              |table join | ---------  |table dedupe|
| table list   |                                                              +-----+-----+            +-----+------+
+----------+---+           +-------------+           +-----------------+            |
           +---+-----------+table query  +---------- | table summarize +------------+
           |               +-------------+           +-----------------+            |
           |                                                                        |
           |                                                                        |
           |                                                                        |
           |              +--------------+          +------------------+            |
           +--------------+ table query  +----------+  table summarize +------------+
                          +--------------+          +------------------+

Add calcite and express operators in terms of it

This is the first step in the following very ambitious 3 step process:

Step 1.
Reexpress as many mason operators in calcite as possible. In terms of execution, this would mean those jobs would all become QueryJob. Validate the calcite SQL, so when the job is serialized they are sending across calcite SQL (as opposed to SparkSQL, Hive, PrestoSQL)

Step 2.
In mason-spark use Coral to translate the calcite to SparkSQL (RelNode -> Spark Catalyst). Use this to build mason-hive and mason-presto (and get alot of operator support for free using coral).

Step 3.
Mason operators and workflows are now a curated collection of calcite SQL pipelines with some additional connecting tissue that goes outside of what SQL (really should) express. Look into the effort to add logica (datalog based query language) as a view language for coral. This has possibility to allow constraints address the additional connecting tissue (types for governance, authenatication, io formatting, workflow specification?). Then mason operators/workflows would be expressed completely within Datalog. This could be particularly interesting for expressing things like job fan out and aggregation

Scrub out all company specific references in repo

Including company specific buckets.

Make the mason-sample-data repo public (only has NYCtaxi data right now), and reference in demo's

Move table spanning operators into "database" namespace

To avoid confusion table spanning operations should be moved out of the table namespace and into the database namespace:
For example: merge or join or list

Build out "database" namespace in operator examples

Example of operators:

mason operator database query

would allow you to do queries that span multiple tables, for example, "join" queries.

Parameter Aliases

Right now referring to S3 bucket as "database_name" and the S3 path as "table_name" when using S3 as a metastore is consistent but akward. Would like to have parameter aliases so that they could be called:
"bucket", "path", etc.

Likely would be client specific overrides.

Change how tables, databases, etc are referenced by connection strings

Right now some operators reference table_name database_name I want to consolidate into a SQLalchemy like connection string like:

<SOURCE>://<DB_NAME>/<TABLE_NAME>

for example:

athena://database/table
s3://bucket/path...
glue://database/table

etc.

Also think about how to reference multiple such objects (lists for joins for example)

Clean up error messages

Right now errors are collected on a single line string. Either add line breaks, or collect them into an array so that the message resembles a log or a stack trace more than one continuous error string.

1.6.0 Demo

Build demo shell file for 1.6.0

Spark support for Query

Spark support for table query operator

All requirements are version locked?

With all of the requirements being == it can be quite hard to load mason into any type of shared env. Are all of the depencies known to only work at the locked in versions? or could the version requirements be opened up?

Allow config_id to be set on any rest api endpoint

Currently you would have to run the REST equivalent of

mason config -s 4
mason operator table format ...

to run an operator by itself with a particular config_id. This creates issues with statefullness. I want to make it so that you can pass config_id as a parameter to the rest endpoint to avoid having to do this. Setting the config_id would then just set the "default" config_id. I will rename the rest api endpoint appropriately.

Refactor Mason Registry

Registry is currently just local files in ~/.mason Want to add:

(1) sha style versioning
(2) More options for backend than local file systems, for example distributed registry

Spark Format Support

Spark support for table format operator

Join Operator

With Dask and Spark support initially.

DRY up and enforce various namespace conventions

For example, all table namespace operators should have required parameters

table_name
database_name

by default. Related to #46

Autogenerate aspects of swagger files

Some aspects of swagger.yml definitions for workflows and operators can be generated based on the structure of the workflows and parameters.

This includes the 200 status, etc.

Pull ASCII Dag out of resources and publish on pypi

ASCII Dag is mostly abandoned so we will have to maintain. Its not a main area of concern so would like to pull out of mason resources and publish on pypi.

Add partitioning concepts to DDL generation

Split out validations into their own library

There is alot of code in both mason and mason-dask just focused on validating typed inputs from the api as well as from json-schemas. It might be worth looking into streamlining that code or seperating it into its own repo (mason-validations).

Allow multiple engine types per operator

Currently Each operator just supports one engine of each type. Would like to have multiple engines per type, for example:

metastores: [hive, s3]
execution: [spark, dask]

And ability to use multiple of them within an operator

Need to think about asynchronicity of Query operator. Callback mechanism?

samtecspg / mason Goto Github PK

mason's People

Contributors

Stargazers

Watchers

Forkers

mason's Issues

Recommend Projects

Recommend Topics

Recommend Org