giucris / yasp Goto Github PK
View Code? Open in Web Editor NEWYet Another SPark Framework
License: Apache License 2.0
Yet Another SPark Framework
License: Apache License 2.0
Is your feature request related to a problem? Please describe.
Currently only the local file is supported. This could be an issue in some case on wich you cannot upload a file to the driver instance directly.
Describe the solution you'd like
Read file from any external location with some strategy that could be easily extended
Describe alternatives you've considered
...
Additional context
...
It could be useful to have something to support metrics and check computation
Describe the bug
When run yasp with more yaml fields then required or with a typo on the yaml field there is no error as yasp consider the relative option field as None.
Expected behavior
It should raise a parser exception
Is your feature request related to a problem? Please describe.
No way to checkout the result of the ci build stage.
It could be useful to have the artifacts upload to github action artifact
Describe the solution you'd like
...
Describe alternatives you've considered
...
Additional context
...
Is your feature request related to a problem? Please describe.
Yasp was initially written using Spark 2.4.7 and then migrated to Spark 3.x to improve performance and integration with the table format.
It may currently be a limitation to only embrace Spark 3.x
Describe the solution you'd like
...
Describe alternatives you've considered
...
Additional context
...
Describe the bug
The Roadmap section on the README.md contains support for ApacheIceberg that was already added to the main branch
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Environment (please describe your environment):
Is your feature request related to a problem? Please describe.
Currently there are no way to build yasp in a light way mode.
Describe the solution you'd like
As provided for spark version it could be useful to have an sbt variable that can configure the kind of package.
Something like:
sbt -Dyasp.build.type=fat ...
for fat jar that contains all the required librarysbt -Dyasp.buld.type=light ...
for a lightweight package that does not include spark libraryDescribe alternatives you've considered
Currently there are no useful alternative, someone that want to build it in a lightweight mode should clone and change the code.
Additional context
Describe the bug
Github Action has recently deprecated the use of Node.js 12, this might bring some issue
All actions that use Node.js 12 should be migrated to the latest versions.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
...
Environment (please describe your environment):
Is your feature request related to a problem? Please describe.
Currently, the dry-run command is only useful for testing parsing from yaml to YaspPlan.
However, several issues may still be there and not be parsed by the dry-run command.
For example sql spark queries that could be parsed and validated before running the real application
Describe the solution you'd like
...
Describe alternatives you've considered
...
Additional context
...
In sinks the id is different related to the id used in source and process.
This can raise some issue on the user side.
Describe the bug
Currently yasp support he fat and light packaging but there are no
To Reproduce
Steps to reproduce the behavior:
...
Expected behavior
A clear and concise description of what you expected to happen.
...
Environment (please describe your environment):
Is your feature request related to a problem? Please describe.
Reading the YaspPlan.md files there are no section for supported source and destination
Describe the solution you'd like
List of source and destination supported
Describe alternatives you've considered
...
Additional context
...
Is your feature request related to a problem? Please describe.
Currently, there is no guarantee that parsing of processes from yaml list to scala list will be carried out in order. this may cause unexpected exceptions.
Yasp should handle dependencies between processes automatically, semi-automatically or manually. In order to avoid this possible case
Describe the solution you'd like
An automatic approach should be preferred, one could use Spark's logic plan
Describe alternatives you've considered
An alternative option is to provide the user with a depends_on keyword, which would however increase the complexity of the yaml file
Additional context
...
Describe the bug
The Yasp parser raises an exception when attempting to parse a cache field in lower or upper case.
To Reproduce
Steps to reproduce the behavior:
.
.
.
sources:
- id: users
source:
format: csv
options:
path: examples/example-1/input/
header: 'true'
cache: MEMORY
.
.
.
...
${path to the file}
Exception in thread "main" scala.MatchError: MEMORY (of class java.lang.String)
at it.yasp.app.support.DecodersSupport.$anonfun$cacheLayerDecoder$2(DecodersSupport.scala:46)
Expected behavior
Parse without failure
Environment (please describe your environment):
Describe the solution you'd like
Currently YaspPlan it take source, process and sink. It could be useful and more readable if yasp plan is a flat list that take in input any YaspAction.
For example:
Current:
...
...
plan:
sources:
- id: my_csv
source:
format: csv
options:
header: 'true'
path: path/to/input/csv/
processes:
- id: my_csv_filtered
process:
query: >-
SELECT *
FROM my_csv
WHERE id=1
sinks:
- id: my_csv_filtered
dest:
format: csv
options:
header: 'true'
path: path/to/out/csv/
This could be simplified using a flat list as follow:
...
...
plan:
- id: my_csv
source:
format: csv
options:
header: 'true'
path: path/to/input/csv/
- id: my_csv_filtered
process:
query: >-
SELECT *
FROM my_csv
WHERE id=1
- id: my_csv_filtered
dest:
format: csv
options:
header: 'true'
path: path/to/out/csv/
Describe alternatives you've considered
No alternatives considered
Additional context
Describe the bug
As reported in the article below on scala org website, sbt 1.4.9 version is affected by the vulnerabilities CVE-2021-44228 and CVE-2021-45046.
Vulnerabilities have been resolved with log4j-2.17.1 which comes with the latest release of sbt 1.8.0.
Describe the bug
Loking at CONTRIBUTING.md and CODE_OF_CONDUCT.md it seems' that the title it was written in a wrong md format
The two files is rendered as:
#CONTRIBUTING #CODE_OF_CONDUCT
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The two file should be rendered with a proper title style
Environment (please describe your environment):
Is your feature request related to a problem? Please describe.
Currently there are no automatic version system. Everty time we had to manually upgrade it.
Describe the solution you'd like
As we are going throught semver it could be useful to have some sbt plugint like sbt-dynver or sbt-git that directly retrieve the version from the tag.
Describe alternatives you've considered
Additional context
...
Describe the solution you'd like
In some cases it could be annoying to write the full format configuration for the most common source and sink.
It could be useful to have some facility
For example:
...
- id: my_csv
source:
format: csv
options:
header: 'true'
path: path/to/input/csv/
It could be done in the following way:
...
- id: my_csv
source:
csv: path/to/input/csv/
options:
header: 'true'
Is your feature request related to a problem? Please describe.
Sometimes with a huge ETL flow, it is easy to write some yml that can be badly formatted. Or even simply a configuration that yasp does not accept. It might be useful to insert a dry-run command that allows the user to execute all the yasp logic but without necessarily starting the stream.
Describe the solution you'd like
Add a dry-run command that will create a YaspExecution with a NoOp service
Describe alternatives you've considered
No other solution
Additional context
No other context required
Describe the bug
The Yasp parser raises an exception when attempting to parse a Session.kind other than 'Local or Distributed'.
To Reproduce
Steps to reproduce the behavior:
session:
kind: local
...
${path to the file}
Exception in thread "main" scala.MatchError: local (of class java.lang.String)
at it.yasp.app.support.DecodersSupport.$anonfun$sessionTypeDecoder$2(DecodersSupport.scala:46)
Expected behavior
Parse without failure
Environment (please describe your environment):
Is your feature request related to a problem? Please describe.
Currently, it is not possible to execute custom code for any type of operation.
This may be a limitation in use.
Describe the solution you'd like
Probably the best solution is to expose libraries to build yasp plugins for reader writers and processors. Something like
....
source:
classPath: my.plugin.classpath
options:
x: y
...
And then the users should provide the jar with the proper plugin implementation
Describe alternatives you've considered
...
Additional context
...
Is your feature request related to a problem? Please describe.
Support for library checkout and direct package checkout should be provided.
Currently, each user should build his own version on his own machine.
Describe the solution you'd like
Just make it usable as a library or as a distributed package.
Describe alternatives you've considered
Currently the only alternative is to provide it with source code and let user build it.
Additional context
...
Is your feature request related to a problem? Please describe.
Add Yasp site to provide docs
Describe the solution you'd like
Use github page with some tools like Hugo to generate user docs
Describe alternatives you've considered
...
Additional context
...
Is your feature request related to a problem? Please describe.
Currently only deltalake is supported as table format.
Given the trend of the iceberg and its ability to be highly interoperable, it might be a limitation not to support it
Describe the solution you'd like
...
Describe alternatives you've considered
...
Additional context
...
Describe bug
While trying to merge the following PR: #85
I noticed that all Github actions defined in the project workflow were not being started.
Expected behaviour
When a PR is submitted all checks should be performed in order to provide the reviewer some feedback on the code style, test and so on.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.