ploomber / ploomber Goto Github PK

View Code? Open in Web Editor NEW

3.4K 29.0 228.0 6.87 MB

The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️

Home Page: https://docs.ploomber.io

License: Apache License 2.0

Python 96.39% HTML 3.27% Jupyter Notebook 0.23% R 0.11%

workflow machine-learning data-science data-engineering mlops papermill jupyter jupyter-notebooks pipelines vscode

ploomber's Introduction

Tip

Deploy AI apps for free on Ploomber Cloud!

Ploomber is the fastest way to build data pipelines ⚡️. Use your favorite editor (Jupyter, VSCode, PyCharm) to develop interactively and deploy ☁️ without code changes (Kubernetes, Airflow, AWS Batch, and SLURM). Do you have legacy notebooks? Refactor them into modular pipelines with a single command.

Installation

Compatible with Python 3.7 and higher.

Install with pip:

pip install ploomber

Or with conda:

conda install ploomber -c conda-forge

Getting started

Try the tutorial:

Community

Main Features

⚡️ Get started quickly

A simple YAML API to get started quickly, a powerful Python API for total flexibility.

get-started.mp4

⏱ Shorter development cycles

Automatically cache your pipeline’s previous results and only re-compute tasks that have changed since your last execution.

shorter-cycles.mp4

☁️ Deploy anywhere

Run as a shell script in a single machine or distributively in Kubernetes, Airflow, AWS Batch, or SLURM.

deploy.mp4

📙 Automated migration from legacy notebooks

Bring your old monolithic notebooks, and we’ll automatically convert them into maintainable, modular pipelines.

refactor.mp4

I want to migrate my notebook.

Show me a demo.

Resources

About Ploomber

Ploomber is a big community of data enthusiasts pushing the boundaries of Data Science and Machine Learning tooling.

Whatever your skillset is, you can contribute to our mission. So whether you're a beginner or an experienced professional, you're welcome to join us on this journey!

Click here to know how you can contribute to Ploomber.

ploomber's People

Contributors

Stargazers

Watchers

Forkers

edblancas cxz xaviercarrera israelrico007 bigrlab nicolizamacorrea idomic reloadbrain fferegrino alokranjan1234 arturomf94 miblue119 fanshijianpharmacy admariner srikalyan fidelak jacknoble rgolovnya idomichael aaryan2134 mike-k0 jayeclark semwalshaurya melan96 rehman000 bibhashthakur inovaprog harsh-1309 gkydev younessaada taslimmuhammed siddharthmagadum16 qiwen-yu tamoghna-dey michaelnimmer talkshrey ghassemin daddy-doubleshot y2010-ops annie-jain jc-jay sandy4321 nobalpha raj-pansuriya augustasv swipswaps prasunshrestha hex41434 kadu9 michellebonat javenzhu luiscapella seanahmad preethamsai69 collector-m drtghosh weiplanet jnmaomao huangweiboy2 bimec maticortesr aadityasinha-dotcom kulwantparkar ndeechikere julianpsm azulgarza sachinshenoy vermosen syllogy dliofindia datacascadia tammytdo lfunderburk anki112279 sri0027 iruoma vjem hypefi igalem marcojhb tanglespace vinay26k sumanpaikdev hparmar12 jramirez857 fatelei cyberflamego creative-research-project-v1-1 pariyat jschuller adbmd krishnabyggari94 desolatetraveller marcelomata data-science-ai-open-source abukimemia restevesd chandanpanda ychuckt8 manojprabhakar

ploomber's Issues

Add support to SourceLoader to load from more than one module

Refactor SQLiteRelation

Avoid passing None as first argument

Use render infrastructure for preventing errors before executing DAG.build

SQL scripts raise errors if jinja rendering fails (e.g. a needed tag is not passed), this helps catch errors locally and avoid sending ill-defined scripts to the db. Currently, PythonCallable have an empty rendering method, but we could use the inspect module to do some basic checkings such as passing parameters that are not declared in the function, same logic applies to NotebookRunner

with_env should change signature and remove "env"

Improve error message when schema validation fails

Current error example:

  File "/.../python3.6/site-packages/ploomber/validators/validators.py", line 154, in data_frame_validator
    raise AssertionError(str(assert_))
AssertionError: 2 errors found: 
 * validate_schema: missing columns {'SOME_COLUMN'}.
 * validate_schema: wrong dtype for column "SOME_ID". Expected: "int64". Got: "object"

There is little context here, the first line should point to a schema validation error

DAG.build_partially mutates object

ploomber/src/ploomber/dag.py

Line 175 in f9cec77

def build_partially(self, target, clear_cached_status=False):

Should create a shallow copy and run build on it

Add more tests to postgres relation

Finish first 3 chapters in the guide

core
env
static files

Create a params copy for tasks

Each task should use its own params copy, if more than one task use the same dict and one of them modifies it, the other task will be affected

DAG.plot improvements

matplotlib DAG.plot in jupyter shows twice
display using IPython.display.Image

load_env decorator error

load_env decorator should only attempt to load environment when the decorated function is called not when the decorator is initialized

Even if on_finish fails, the task will be skipped in the next run

When on_finish is executed, the task has finished (hence metadata and product were already saved), next time is run, the task will be skipped

Specify minimum nbconvert, jupyter-* versions

The "RuntimeError: Kernel didn't respond in 60 seconds" has been resolved in nbconvert 5.6.0, nee to specify this as the minimum dependency since papermill does not do so

See: nteract/papermill#239

Improve docs organization

dag-level, class-level and task-level hooks

Currently, on_render, on_finish and on_failure hooks can only be set at the task level. making impossible to set the same hook for many tasks at once.

One use case for this is static analysis, an on_render hook could be set for all NotebookRunner tasks to do static analysis in the source code to detect potential issues before executing.

Potential API (similar to how dag.clients work):

from ploomber.tasks import PythonCallable

t = PythonCallable(...)

t.on_render = task_level_on_render
dag.class_on_render[PythonCallable] = class_level_on_render
dag.on_render = dag_level_on_render

Travis fails for PRs

Our current test suite depends on a connection to a PostgreSQL db. Since forks do no have credentials, it will fail.

I think Travis has postgres installed so I should connect to the localhost by default.

See #8 for an example

Reload sources on render

If sources are strings they cannot change once the DAG is declared
If they are Paths/Placeholders they can if the underlying file changes (we should be able to reload)
For callables it is technically possible

Update: make this available via DAGConfigurator via a hot_reload option.

This should be passed to sources which should re-load from disk when .render is called

Determine minimum Python 3.x supported version

We are using chained exceptions so it must be at least 3.3 https://www.python.org/dev/peps/pep-3134/

Can't remember if there are any other restrictions. Should configure Travis to test using more Python versions

PostgresCopy does not accept template as sources

ploomber/src/ploomber/tasks/sql.py

Line 242 in b82cc28

raise SourceInitializationError('{} does not support templates as '

PostgresCopy does not accept templates as sources but it should, when an upstream task creates a File, the source could be referenced dynamically as '{{upstream["another_task"}}' to resolve the path

Edit: This is related to issue #2 - maybe the validation logic should be part of the source object itself, with the option to provide more granular validation in _init_source for specific use cases

Make Task.params read-only

Task.params are used in a bunch of places: when rendering tasks, running them and are also accessible via Task.params. One could inadvertently modify them and cause hard to debug errors. Once Task is initialized with params, there is no need to modify them (except internally, when adding the Product in Task.render), it is safer to make them read-only

Create a read-only dictionary object
Implement a to_dict() function to return a copy of the dictionary representation

Raise nbconvert error on render rather than on build for NotebookRunner

Allow {{git}} to be resolved using a path to a folder instead of requiring a package

Add on_render hook

An on_render hook should be executed after rendering each task, this will allow to embed logic to try to find problems before actually executing the pipeline (and prevent long-running build to be useless after errors are found, possibly at runtime) The use case that comes to mind is to do static analysis on Python code.

Example:

# ... very long running code
# ... that never declared variable "a"
# ...

# code will break here, after running for a long time
a + 1

SQLAlchemyClient deepcopy fails

ploomber/src/ploomber/helpers.py

Line 108 in f9cec77

kwargs = deepcopy(task_kwargs)

getstate (used by pickle and copy modules) delete the _connection attribute. Pickling works since it has to go through init (?) again, hence sets _connection to None, but copying fails with a: "AttributeError: 'SQLAlchemyClient' object has no attribute '_connection'"

Special env keys should start with _

like _module

Colored exceptions from ExceptionsCollector in IPython

Relevant: https://ipython.readthedocs.io/en/stable/api/generated/IPython.core.ultratb.html

DAG.build_partially should have the same signature as DAG.build

Missing "force"

ploomber/src/ploomber/dag.py

Line 175 in f9cec77

def build_partially(self, target, clear_cached_status=False):

QUESTION on example at README

Hello! Just checking out this new library. There is one line that is confusing to me which I don't understand. The comment of course says what it does, but the API syntax is very strange to me:

task_add_one.on_finish = on_finish

What is strange to me is that the on_finish function somehow gets attached to the task_add_one and you don't pass a parameter to it. Does this line belong here?

Add on_finish argument to Task constructor

Related to #14

Currently, the only way to add an on_finish hook to as ask is:

task.on_finish = some_callable

Even though that API is fine, it should also be possible to add it in the Task's constructor:

Task(..., on_finish=some_callable)

File.delete fails if target is a directory

Improve error message

When env is init with a filename and the file does not exist, the following is raised:

FileNotFoundError: Could not find file "None"

The variable holding the filename passed is replaced, that's why the error shows "None"

Abstract iteration logic

Both executors have to iterate the DAG until no more tasks can run, abstract this in the DAG object

Document supported dbs

For moving data from db to local file: any db with a PEP 249 compatible driver or supported by SQLAlchemy
For creating tables/views: only sqlite and postgres

Add tests for repr (Placeholders)

Placeholders should be render-aware and return appropriate values in their repr method (show rendered values if available). We cannot do this with str since str(Product) is always assumed to return an already-rendered value and should raise an exception otherwise

PostgresCopy render dails due to missing value for "product" parameter

If we pass a templated string to PostgresCopy, it will initialize its source as a Generic Source:

ploomber/src/ploomber/tasks/sql.py

Line 238 in b82cc28

def _init_source(self, source):

GenericSource is just an implementation of the abc class:

ploomber/src/ploomber/sources/sources.py

Line 243 in b82cc28

class GenericSource(Source):

Such class uses Placeholder to support templates, placeholder validates that there are no missing parameters:

ploomber/src/ploomber/templates/Placeholder.py

Line 185 in b82cc28

def render(self, params, optional=None):

The problem is that PostgresCopy source parameter should either be a path to a file or a templated string referencing an upstream dependency. For the second case, the '{{product}}' tag won't appear, only an '{{upstream}}' tag. This same problem appears in other Tasks.

Solution: move the parameter validation logic out of Placeholder and move it to Source, and let the task decide how to validate the source with a few options (think about use cases to make this as simple as possible)

Store timestamp in File.source

When a File is created, a .source file keeps a copy of the source code that it generated it but the timestamp is retrieved using the actual file's metadata. If an entire pipeline's products tree is moved, all this timestamps will change but this should not happen.

Use case: branching off using git

Say your env.yaml looks like this:

path:
    data: /data/project/{{git}}

You are working in the dev branch and you want to experiment with a new feature, so you branch off from dev to new-feature. Since there is no /data/project/new-feature, the DAG will have to run end-to-end again.

But at this point both branches are the same so you could just copy /data/project/dev to /data/project/new-feature and the DAG should look up-to-date

data frame validator support for categorical variables unique values

Catch Env initialization errors when decorating with_env

And include information about which content failed and which function is using it

Improve GenericProduct

Instead of relying on files and bash commands, use a db backend, the product only needs to provide a way to compute a URI

with_env decorator should also expand tags if replacing anything from the env

Printing location to partially executed notebook when NotebookRunner fails

It is not safe to reuse the same Placeholder across tasks

Placeholders contain their rendered value, so they should not be used in more than one Task. A fix could be to make sure Tasks use a copy

Add tests for SQLAlchemyClient using copy

See #10

Add DBAPIClient tests

Verify copying works
Pickling
Add a few tests using tasks.sql

Handling exceptions inside hooks

Handling exceptions inside hooks:

on_render: TaskRenderError exceptions make the task status change to TaskErroredRender, any other exceptions are logged
on_finish: Similar behavior but with TaskBuildError and status TaskErrored (also AssertionError)
on_fail: Catch all exceptions and log them, if we reach this hook, the task status is already TaskErrored

This is related to #27. If more than one hook is applicable (say dag-level and task-level), we should probably run all of them and raise all tracebacks in a single exception.

How to manage on_fail? Log all exceptions

What is the appropriate logging behavior? Send all to logger.exception? (with traceback) Send to standard error?

Note: Is "Errored" a valid word? https://english.stackexchange.com/questions/3059/is-errored-correct-usage

make CSV the default io handler in SQLUpload

Raise kernelspec error on init rather than on build for NotebookRunner

Add --debug option to ploomber.entry

It should show the full traceback

Making Task.name optional

We could make the name argument optional if a reasonable value can be inferred.

During and early ploomber version, if name was None, it was inferred from the product argument, problem is that if a Product contains tags, their its representation is only known after dag.render(), hence this was discarded.

A better solution is to use information from the source, for PythonCallable, this is easy, we can just use the function name ( name attribute).

For tasks that accept Placeholders as source, we can use the filename (without extension), to keep things consistent we can make Placeholder.name to return this.

Task whose upstream is a Link will always be executed

Input tasks are always up-to-date, but downstream tasks will be declared as data outdated since their upstream Input has no timestamp associated:

ploomber/src/ploomber/products/Product.py

Line 96 in 90285f4

if self.timestamp is None or up_prod.timestamp is None: