The ploomber's discuss from ploomber

Abstract iteration logic

Both executors have to iterate the DAG until no more tasks can run, abstract this in the DAG object

Raise nbconvert error on render rather than on build for NotebookRunner

DAG.build_partially mutates object

ploomber/src/ploomber/dag.py

Line 175 in f9cec77

def build_partially(self, target, clear_cached_status=False):

Should create a shallow copy and run build on it

Store timestamp in File.source

When a File is created, a .source file keeps a copy of the source code that it generated it but the timestamp is retrieved using the actual file's metadata. If an entire pipeline's products tree is moved, all this timestamps will change but this should not happen.

Use case: branching off using git

Say your env.yaml looks like this:

path:
    data: /data/project/{{git}}

You are working in the dev branch and you want to experiment with a new feature, so you branch off from dev to new-feature. Since there is no /data/project/new-feature, the DAG will have to run end-to-end again.

But at this point both branches are the same so you could just copy /data/project/dev to /data/project/new-feature and the DAG should look up-to-date

Printing location to partially executed notebook when NotebookRunner fails

Special env keys should start with _

like _module

DAG.build_partially should have the same signature as DAG.build

Missing "force"

ploomber/src/ploomber/dag.py

Line 175 in f9cec77

def build_partially(self, target, clear_cached_status=False):

make CSV the default io handler in SQLUpload

Specify minimum nbconvert, jupyter-* versions

The "RuntimeError: Kernel didn't respond in 60 seconds" has been resolved in nbconvert 5.6.0, nee to specify this as the minimum dependency since papermill does not do so

See: nteract/papermill#239

Making Task.name optional

We could make the name argument optional if a reasonable value can be inferred.

During and early ploomber version, if name was None, it was inferred from the product argument, problem is that if a Product contains tags, their its representation is only known after dag.render(), hence this was discarded.

A better solution is to use information from the source, for PythonCallable, this is easy, we can just use the function name ( name attribute).

For tasks that accept Placeholders as source, we can use the filename (without extension), to keep things consistent we can make Placeholder.name to return this.

Improve error message when schema validation fails

Current error example:

  File "/.../python3.6/site-packages/ploomber/validators/validators.py", line 154, in data_frame_validator
    raise AssertionError(str(assert_))
AssertionError: 2 errors found: 
 * validate_schema: missing columns {'SOME_COLUMN'}.
 * validate_schema: wrong dtype for column "SOME_ID". Expected: "int64". Got: "object"

There is little context here, the first line should point to a schema validation error

with_env should change signature and remove "env"

dag-level, class-level and task-level hooks

Currently, on_render, on_finish and on_failure hooks can only be set at the task level. making impossible to set the same hook for many tasks at once.

One use case for this is static analysis, an on_render hook could be set for all NotebookRunner tasks to do static analysis in the source code to detect potential issues before executing.

Potential API (similar to how dag.clients work):

from ploomber.tasks import PythonCallable

t = PythonCallable(...)

t.on_render = task_level_on_render
dag.class_on_render[PythonCallable] = class_level_on_render
dag.on_render = dag_level_on_render

Add tests for repr (Placeholders)

Placeholders should be render-aware and return appropriate values in their repr method (show rendered values if available). We cannot do this with str since str(Product) is always assumed to return an already-rendered value and should raise an exception otherwise

Add DBAPIClient tests

Verify copying works
Pickling
Add a few tests using tasks.sql

Add more tests to postgres relation

PostgresCopy render dails due to missing value for "product" parameter

If we pass a templated string to PostgresCopy, it will initialize its source as a Generic Source:

ploomber/src/ploomber/tasks/sql.py

Line 238 in b82cc28

def _init_source(self, source):

GenericSource is just an implementation of the abc class:

ploomber/src/ploomber/sources/sources.py

Line 243 in b82cc28

class GenericSource(Source):

Such class uses Placeholder to support templates, placeholder validates that there are no missing parameters:

ploomber/src/ploomber/templates/Placeholder.py

Line 185 in b82cc28

def render(self, params, optional=None):

The problem is that PostgresCopy source parameter should either be a path to a file or a templated string referencing an upstream dependency. For the second case, the '{{product}}' tag won't appear, only an '{{upstream}}' tag. This same problem appears in other Tasks.

Solution: move the parameter validation logic out of Placeholder and move it to Source, and let the task decide how to validate the source with a few options (think about use cases to make this as simple as possible)

Allow {{git}} to be resolved using a path to a folder instead of requiring a package

Create a params copy for tasks

Each task should use its own params copy, if more than one task use the same dict and one of them modifies it, the other task will be affected

Task whose upstream is a Link will always be executed

Input tasks are always up-to-date, but downstream tasks will be declared as data outdated since their upstream Input has no timestamp associated:

ploomber/src/ploomber/products/Product.py

Line 96 in 90285f4

if self.timestamp is None or up_prod.timestamp is None:

Catch Env initialization errors when decorating with_env

And include information about which content failed and which function is using it

Even if on_finish fails, the task will be skipped in the next run

When on_finish is executed, the task has finished (hence metadata and product were already saved), next time is run, the task will be skipped

Add on_render hook

An on_render hook should be executed after rendering each task, this will allow to embed logic to try to find problems before actually executing the pipeline (and prevent long-running build to be useless after errors are found, possibly at runtime) The use case that comes to mind is to do static analysis on Python code.

Example:

# ... very long running code
# ... that never declared variable "a"
# ...

# code will break here, after running for a long time
a + 1

Add support to SourceLoader to load from more than one module

PostgresCopy does not accept template as sources

ploomber/src/ploomber/tasks/sql.py

Line 242 in b82cc28

raise SourceInitializationError('{} does not support templates as '

PostgresCopy does not accept templates as sources but it should, when an upstream task creates a File, the source could be referenced dynamically as '{{upstream["another_task"}}' to resolve the path

Edit: This is related to issue #2 - maybe the validation logic should be part of the source object itself, with the option to provide more granular validation in _init_source for specific use cases

Colored exceptions from ExceptionsCollector in IPython

Relevant: https://ipython.readthedocs.io/en/stable/api/generated/IPython.core.ultratb.html

It is not safe to reuse the same Placeholder across tasks

Placeholders contain their rendered value, so they should not be used in more than one Task. A fix could be to make sure Tasks use a copy

load_env decorator error

load_env decorator should only attempt to load environment when the decorated function is called not when the decorator is initialized

Determine minimum Python 3.x supported version

We are using chained exceptions so it must be at least 3.3 https://www.python.org/dev/peps/pep-3134/

Can't remember if there are any other restrictions. Should configure Travis to test using more Python versions

Refactor SQLiteRelation

Avoid passing None as first argument

Handling exceptions inside hooks

Handling exceptions inside hooks:

on_render: TaskRenderError exceptions make the task status change to TaskErroredRender, any other exceptions are logged
on_finish: Similar behavior but with TaskBuildError and status TaskErrored (also AssertionError)
on_fail: Catch all exceptions and log them, if we reach this hook, the task status is already TaskErrored

This is related to #27. If more than one hook is applicable (say dag-level and task-level), we should probably run all of them and raise all tracebacks in a single exception.

How to manage on_fail? Log all exceptions

What is the appropriate logging behavior? Send all to logger.exception? (with traceback) Send to standard error?

Note: Is "Errored" a valid word? https://english.stackexchange.com/questions/3059/is-errored-correct-usage

Document supported dbs

For moving data from db to local file: any db with a PEP 249 compatible driver or supported by SQLAlchemy
For creating tables/views: only sqlite and postgres

Add --debug option to ploomber.entry

It should show the full traceback

Add tests for SQLAlchemyClient using copy

See #10

File.delete fails if target is a directory

SQLAlchemyClient deepcopy fails

ploomber/src/ploomber/helpers.py

Line 108 in f9cec77

kwargs = deepcopy(task_kwargs)

getstate (used by pickle and copy modules) delete the _connection attribute. Pickling works since it has to go through init (?) again, hence sets _connection to None, but copying fails with a: "AttributeError: 'SQLAlchemyClient' object has no attribute '_connection'"

Improve error message

When env is init with a filename and the file does not exist, the following is raised:

FileNotFoundError: Could not find file "None"

The variable holding the filename passed is replaced, that's why the error shows "None"

Improve GenericProduct

Instead of relying on files and bash commands, use a db backend, the product only needs to provide a way to compute a URI

Make Task.params read-only

Task.params are used in a bunch of places: when rendering tasks, running them and are also accessible via Task.params. One could inadvertently modify them and cause hard to debug errors. Once Task is initialized with params, there is no need to modify them (except internally, when adding the Product in Task.render), it is safer to make them read-only

Create a read-only dictionary object
Implement a to_dict() function to return a copy of the dictionary representation

Reload sources on render

If sources are strings they cannot change once the DAG is declared
If they are Paths/Placeholders they can if the underlying file changes (we should be able to reload)
For callables it is technically possible

Update: make this available via DAGConfigurator via a hot_reload option.

This should be passed to sources which should re-load from disk when .render is called

QUESTION on example at README

Hello! Just checking out this new library. There is one line that is confusing to me which I don't understand. The comment of course says what it does, but the API syntax is very strange to me:

task_add_one.on_finish = on_finish

What is strange to me is that the on_finish function somehow gets attached to the task_add_one and you don't pass a parameter to it. Does this line belong here?

Add on_finish argument to Task constructor

Related to #14

Currently, the only way to add an on_finish hook to as ask is:

task.on_finish = some_callable

Even though that API is fine, it should also be possible to add it in the Task's constructor:

Task(..., on_finish=some_callable)

DAG.plot improvements

matplotlib DAG.plot in jupyter shows twice
display using IPython.display.Image

Raise kernelspec error on init rather than on build for NotebookRunner

with_env decorator should also expand tags if replacing anything from the env

Travis fails for PRs

Our current test suite depends on a connection to a PostgreSQL db. Since forks do no have credentials, it will fail.

I think Travis has postgres installed so I should connect to the localhost by default.

See #8 for an example

Use render infrastructure for preventing errors before executing DAG.build

SQL scripts raise errors if jinja rendering fails (e.g. a needed tag is not passed), this helps catch errors locally and avoid sending ill-defined scripts to the db. Currently, PythonCallable have an empty rendering method, but we could use the inspect module to do some basic checkings such as passing parameters that are not declared in the function, same logic applies to NotebookRunner

Finish first 3 chapters in the guide

core
env
static files

ploomber / ploomber Goto Github PK

ploomber's Issues

Recommend Projects

Recommend Topics

Recommend Org