Coder Social home page Coder Social logo

yasp's People

Contributors

dsnicola avatar giucris avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

dsnicola

yasp's Issues

Yasp yaml file from external location

Is your feature request related to a problem? Please describe.
Currently only the local file is supported. This could be an issue in some case on wich you cannot upload a file to the driver instance directly.

Describe the solution you'd like
Read file from any external location with some strategy that could be easily extended

Describe alternatives you've considered
...

Additional context
...

More fields to yaml file should raise an exception

Describe the bug
When run yasp with more yaml fields then required or with a typo on the yaml field there is no error as yasp consider the relative option field as None.

Expected behavior
It should raise a parser exception

Add assembly jar as ci artifacts

Is your feature request related to a problem? Please describe.
No way to checkout the result of the ci build stage.
It could be useful to have the artifacts upload to github action artifact

Describe the solution you'd like
...

Describe alternatives you've considered
...

Additional context
...

Compatibility for Spark 2.x

Is your feature request related to a problem? Please describe.
Yasp was initially written using Spark 2.4.7 and then migrated to Spark 3.x to improve performance and integration with the table format.
It may currently be a limitation to only embrace Spark 3.x

Describe the solution you'd like
...

Describe alternatives you've considered
...

Additional context
...

Build process for light and fat package

Is your feature request related to a problem? Please describe.
Currently there are no way to build yasp in a light way mode.

Describe the solution you'd like
As provided for spark version it could be useful to have an sbt variable that can configure the kind of package.
Something like:

  • sbt -Dyasp.build.type=fat ... for fat jar that contains all the required library
  • sbt -Dyasp.buld.type=light ... for a lightweight package that does not include spark library

Describe alternatives you've considered
Currently there are no useful alternative, someone that want to build it in a lightweight mode should clone and change the code.

Additional context

Fix Gituhub Action Warning on Node.js 12

Describe the bug
Github Action has recently deprecated the use of Node.js 12, this might bring some issue
All actions that use Node.js 12 should be migrated to the latest versions.

To Reproduce
Steps to reproduce the behavior:

  1. Just check one of the latest ci

Expected behavior
...

Environment (please describe your environment):

  • ..

Added spark sql validation on dry-run

Is your feature request related to a problem? Please describe.
Currently, the dry-run command is only useful for testing parsing from yaml to YaspPlan.
However, several issues may still be there and not be parsed by the dry-run command.
For example sql spark queries that could be parsed and validated before running the real application

Describe the solution you'd like
...

Describe alternatives you've considered
...

Additional context
...

Improve clarity in sink

In sinks the id is different related to the id used in source and process.
This can raise some issue on the user side.

Missing documentation regarding yasp light build and fat build

Describe the bug
Currently yasp support he fat and light packaging but there are no

To Reproduce
Steps to reproduce the behavior:
...

Expected behavior
A clear and concise description of what you expected to happen.
...

Environment (please describe your environment):

  • ..

Added docs for latest supported source and dest

Is your feature request related to a problem? Please describe.
Reading the YaspPlan.md files there are no section for supported source and destination

Describe the solution you'd like
List of source and destination supported

Describe alternatives you've considered
...

Additional context
...

Automatic process order

Is your feature request related to a problem? Please describe.
Currently, there is no guarantee that parsing of processes from yaml list to scala list will be carried out in order. this may cause unexpected exceptions.
Yasp should handle dependencies between processes automatically, semi-automatically or manually. In order to avoid this possible case

Describe the solution you'd like
An automatic approach should be preferred, one could use Spark's logic plan

Describe alternatives you've considered
An alternative option is to provide the user with a depends_on keyword, which would however increase the complexity of the yaml file

Additional context
...

Fix cache parser for lower case input string

Describe the bug
The Yasp parser raises an exception when attempting to parse a cache field in lower or upper case.

To Reproduce
Steps to reproduce the behavior:

  1. Create a yml file with the following source config:
.
.
.
sources:
    - id: users
      source:
        format: csv
        options:
          path: examples/example-1/input/
          header: 'true'
      cache: MEMORY
.
.
.
...
  1. run java -jar yasp-app-0.0.1-SNAPSHOT.jar --file ${path to the file}
  2. And the following exception will be raised:
Exception in thread "main" scala.MatchError: MEMORY (of class java.lang.String)
        at it.yasp.app.support.DecodersSupport.$anonfun$cacheLayerDecoder$2(DecodersSupport.scala:46)

Expected behavior
Parse without failure

Environment (please describe your environment):

  • Windows 11
  • JDK 8

Simplify YaspPlan with flat structure

Describe the solution you'd like
Currently YaspPlan it take source, process and sink. It could be useful and more readable if yasp plan is a flat list that take in input any YaspAction.

For example:
Current:

...
...
plan:   
  sources:
    - id: my_csv      
      source:           
        format: csv   
        options:       
          header: 'true'
          path: path/to/input/csv/
  processes:
    - id: my_csv_filtered   
      process:              
        query: >-         
          SELECT * 
          FROM my_csv 
          WHERE id=1
  sinks:                   
    - id: my_csv_filtered  
      dest:                 
        format: csv     
        options:          
          header: 'true'
          path: path/to/out/csv/

This could be simplified using a flat list as follow:

...
...
plan:   
    - id: my_csv      
      source:           
        format: csv   
        options:       
          header: 'true'
          path: path/to/input/csv/
    - id: my_csv_filtered   
      process:              
        query: >-         
          SELECT * 
          FROM my_csv 
          WHERE id=1
    - id: my_csv_filtered  
      dest:                 
        format: csv     
        options:          
          header: 'true'
          path: path/to/out/csv/

Describe alternatives you've considered
No alternatives considered

Additional context

Contributing and Code of Conduct md title doesn't have a proper markdown format

Describe the bug
Loking at CONTRIBUTING.md and CODE_OF_CONDUCT.md it seems' that the title it was written in a wrong md format

The two files is rendered as:
#CONTRIBUTING #CODE_OF_CONDUCT

To Reproduce
Steps to reproduce the behavior:

  1. Click to CONTRIBUTING.md and CODE_OF_CONDUCT.md

Expected behavior
The two file should be rendered with a proper title style

Environment (please describe your environment):

  • ..

Missing some automatic version system

Is your feature request related to a problem? Please describe.
Currently there are no automatic version system. Everty time we had to manually upgrade it.

Describe the solution you'd like
As we are going throught semver it could be useful to have some sbt plugint like sbt-dynver or sbt-git that directly retrieve the version from the tag.

Describe alternatives you've considered

  • Use ci script with an sbt variable configured

Additional context
...

Add some model to reduce complexity for most used source and sink

Describe the solution you'd like
In some cases it could be annoying to write the full format configuration for the most common source and sink.
It could be useful to have some facility

For example:

...
    - id: my_csv      
      source:           
        format: csv   
        options:       
          header: 'true'
          path: path/to/input/csv/

It could be done in the following way:

...
    - id: my_csv      
      source:           
        csv: path/to/input/csv/
        options:       
          header: 'true' 

Added a --dry-run args to run Yasp with NoOp process

Is your feature request related to a problem? Please describe.
Sometimes with a huge ETL flow, it is easy to write some yml that can be badly formatted. Or even simply a configuration that yasp does not accept. It might be useful to insert a dry-run command that allows the user to execute all the yasp logic but without necessarily starting the stream.

Describe the solution you'd like
Add a dry-run command that will create a YaspExecution with a NoOp service

Describe alternatives you've considered
No other solution

Additional context
No other context required

Fix Session.kind parser lower case input string

Describe the bug
The Yasp parser raises an exception when attempting to parse a Session.kind other than 'Local or Distributed'.

To Reproduce
Steps to reproduce the behavior:

  1. Create a yml file the following:
session:
  kind: local
...
  1. run java -jar yasp-app-0.0.1-SNAPSHOT.jar --file ${path to the file}
  2. And the following exception will be raised:
Exception in thread "main" scala.MatchError: local (of class java.lang.String)
        at it.yasp.app.support.DecodersSupport.$anonfun$sessionTypeDecoder$2(DecodersSupport.scala:46)

Expected behavior
Parse without failure

Environment (please describe your environment):

  • Windows 11
  • JDK 8

Added support for custom source process and destination

Is your feature request related to a problem? Please describe.
Currently, it is not possible to execute custom code for any type of operation.
This may be a limitation in use.

Describe the solution you'd like
Probably the best solution is to expose libraries to build yasp plugins for reader writers and processors. Something like

....
source:
  classPath: my.plugin.classpath
  options: 
      x: y
...

And then the users should provide the jar with the proper plugin implementation

Describe alternatives you've considered
...

Additional context
...

Deploy yasp library and packages

Is your feature request related to a problem? Please describe.
Support for library checkout and direct package checkout should be provided.
Currently, each user should build his own version on his own machine.

Describe the solution you'd like
Just make it usable as a library or as a distributed package.

Describe alternatives you've considered
Currently the only alternative is to provide it with source code and let user build it.

Additional context
...

Yasp docs site on github pages

Is your feature request related to a problem? Please describe.
Add Yasp site to provide docs

Describe the solution you'd like
Use github page with some tools like Hugo to generate user docs

Describe alternatives you've considered
...

Additional context
...

Added Apache Iceberg support

Is your feature request related to a problem? Please describe.
Currently only deltalake is supported as table format.
Given the trend of the iceberg and its ability to be highly interoperable, it might be a limitation not to support it

Describe the solution you'd like
...

Describe alternatives you've considered
...

Additional context
...

GithubActions does not start for pull request comes from a fork

Describe bug
While trying to merge the following PR: #85
I noticed that all Github actions defined in the project workflow were not being started.

Expected behaviour
When a PR is submitted all checks should be performed in order to provide the reviewer some feedback on the code style, test and so on.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.