ploomber / ploomber Goto Github PK
View Code? Open in Web Editor NEWThe fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️
Home Page: https://docs.ploomber.io
License: Apache License 2.0
The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️
Home Page: https://docs.ploomber.io
License: Apache License 2.0
Both executors have to iterate the DAG until no more tasks can run, abstract this in the DAG object
Line 175 in f9cec77
Should create a shallow copy and run build on it
When a File is created, a .source file keeps a copy of the source code that it generated it but the timestamp is retrieved using the actual file's metadata. If an entire pipeline's products tree is moved, all this timestamps will change but this should not happen.
Use case: branching off using git
Say your env.yaml looks like this:
path:
data: /data/project/{{git}}
You are working in the dev branch and you want to experiment with a new feature, so you branch off from dev to new-feature. Since there is no /data/project/new-feature
, the DAG will have to run end-to-end again.
But at this point both branches are the same so you could just copy /data/project/dev
to /data/project/new-feature
and the DAG should look up-to-date
like _module
Missing "force"
Line 175 in f9cec77
The "RuntimeError: Kernel didn't respond in 60 seconds" has been resolved in nbconvert 5.6.0, nee to specify this as the minimum dependency since papermill does not do so
We could make the name argument optional if a reasonable value can be inferred.
During and early ploomber version, if name was None, it was inferred from the product argument, problem is that if a Product contains tags, their its representation is only known after dag.render(), hence this was discarded.
A better solution is to use information from the source, for PythonCallable, this is easy, we can just use the function name ( name attribute).
For tasks that accept Placeholders as source, we can use the filename (without extension), to keep things consistent we can make Placeholder.name to return this.
Current error example:
File "/.../python3.6/site-packages/ploomber/validators/validators.py", line 154, in data_frame_validator
raise AssertionError(str(assert_))
AssertionError: 2 errors found:
* validate_schema: missing columns {'SOME_COLUMN'}.
* validate_schema: wrong dtype for column "SOME_ID". Expected: "int64". Got: "object"
There is little context here, the first line should point to a schema validation error
Currently, on_render
, on_finish
and on_failure
hooks can only be set at the task level. making impossible to set the same hook for many tasks at once.
One use case for this is static analysis, an on_render
hook could be set for all NotebookRunner
tasks to do static analysis in the source code to detect potential issues before executing.
Potential API (similar to how dag.clients work):
from ploomber.tasks import PythonCallable
t = PythonCallable(...)
t.on_render = task_level_on_render
dag.class_on_render[PythonCallable] = class_level_on_render
dag.on_render = dag_level_on_render
Placeholders should be render-aware and return appropriate values in their repr method (show rendered values if available). We cannot do this with str since str(Product) is always assumed to return an already-rendered value and should raise an exception otherwise
If we pass a templated string to PostgresCopy, it will initialize its source as a Generic Source:
ploomber/src/ploomber/tasks/sql.py
Line 238 in b82cc28
GenericSource is just an implementation of the abc class:
ploomber/src/ploomber/sources/sources.py
Line 243 in b82cc28
Such class uses Placeholder to support templates, placeholder validates that there are no missing parameters:
The problem is that PostgresCopy source parameter should either be a path to a file or a templated string referencing an upstream dependency. For the second case, the '{{product}}' tag won't appear, only an '{{upstream}}' tag. This same problem appears in other Tasks.
Solution: move the parameter validation logic out of Placeholder and move it to Source, and let the task decide how to validate the source with a few options (think about use cases to make this as simple as possible)
Each task should use its own params copy, if more than one task use the same dict and one of them modifies it, the other task will be affected
Input tasks are always up-to-date, but downstream tasks will be declared as data outdated since their upstream Input has no timestamp associated:
ploomber/src/ploomber/products/Product.py
Line 96 in 90285f4
And include information about which content failed and which function is using it
When on_finish is executed, the task has finished (hence metadata and product were already saved), next time is run, the task will be skipped
An on_render hook should be executed after rendering each task, this will allow to embed logic to try to find problems before actually executing the pipeline (and prevent long-running build to be useless after errors are found, possibly at runtime) The use case that comes to mind is to do static analysis on Python code.
Example:
# ... very long running code
# ... that never declared variable "a"
# ...
# code will break here, after running for a long time
a + 1
ploomber/src/ploomber/tasks/sql.py
Line 242 in b82cc28
PostgresCopy does not accept templates as sources but it should, when an upstream task creates a File, the source could be referenced dynamically as '{{upstream["another_task"}}' to resolve the path
Edit: This is related to issue #2 - maybe the validation logic should be part of the source object itself, with the option to provide more granular validation in _init_source for specific use cases
Placeholders contain their rendered value, so they should not be used in more than one Task. A fix could be to make sure Tasks use a copy
load_env decorator should only attempt to load environment when the decorated function is called not when the decorator is initialized
We are using chained exceptions so it must be at least 3.3 https://www.python.org/dev/peps/pep-3134/
Can't remember if there are any other restrictions. Should configure Travis to test using more Python versions
Avoid passing None as first argument
Handling exceptions inside hooks:
on_render
: TaskRenderError
exceptions make the task status change to TaskErroredRender
, any other exceptions are loggedon_finish
: Similar behavior but with TaskBuildError
and status TaskErrored
(also AssertionError)on_fail
: Catch all exceptions and log them, if we reach this hook, the task status is already TaskErrored
This is related to #27. If more than one hook is applicable (say dag-level and task-level), we should probably run all of them and raise all tracebacks in a single exception.
How to manage on_fail? Log all exceptions
What is the appropriate logging behavior? Send all to logger.exception? (with traceback) Send to standard error?
Note: Is "Errored" a valid word? https://english.stackexchange.com/questions/3059/is-errored-correct-usage
It should show the full traceback
See #10
ploomber/src/ploomber/helpers.py
Line 108 in f9cec77
getstate (used by pickle and copy modules) delete the _connection attribute. Pickling works since it has to go through init (?) again, hence sets _connection to None, but copying fails with a: "AttributeError: 'SQLAlchemyClient' object has no attribute '_connection'"
When env is init with a filename and the file does not exist, the following is raised:
FileNotFoundError: Could not find file "None"
The variable holding the filename passed is replaced, that's why the error shows "None"
Instead of relying on files and bash commands, use a db backend, the product only needs to provide a way to compute a URI
Task.params are used in a bunch of places: when rendering tasks, running them and are also accessible via Task.params. One could inadvertently modify them and cause hard to debug errors. Once Task is initialized with params, there is no need to modify them (except internally, when adding the Product in Task.render), it is safer to make them read-only
Update: make this available via DAGConfigurator via a hot_reload option.
This should be passed to sources which should re-load from disk when .render is called
Hello! Just checking out this new library. There is one line that is confusing to me which I don't understand. The comment of course says what it does, but the API syntax is very strange to me:
task_add_one.on_finish = on_finish
What is strange to me is that the on_finish function somehow gets attached to the task_add_one and you don't pass a parameter to it. Does this line belong here?
Related to #14
Currently, the only way to add an on_finish hook to as ask is:
task.on_finish = some_callable
Even though that API is fine, it should also be possible to add it in the Task's constructor:
Task(..., on_finish=some_callable)
Our current test suite depends on a connection to a PostgreSQL db. Since forks do no have credentials, it will fail.
I think Travis has postgres installed so I should connect to the localhost by default.
See #8 for an example
SQL scripts raise errors if jinja rendering fails (e.g. a needed tag is not passed), this helps catch errors locally and avoid sending ill-defined scripts to the db. Currently, PythonCallable have an empty rendering method, but we could use the inspect module to do some basic checkings such as passing parameters that are not declared in the function, same logic applies to NotebookRunner
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.