Coder Social home page Coder Social logo

ploomber / ploomber Goto Github PK

View Code? Open in Web Editor NEW
3.4K 29.0 228.0 6.87 MB

The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️

Home Page: https://docs.ploomber.io

License: Apache License 2.0

Python 96.39% HTML 3.27% Jupyter Notebook 0.23% R 0.11%
workflow machine-learning data-science data-engineering mlops papermill jupyter jupyter-notebooks pipelines vscode

ploomber's Introduction

CI Linux CI macOS CI Windows Documentation Status PyPI Conda (channel only) Conda Coverage Twitter Downloads

Tip

Deploy AI apps for free on Ploomber Cloud!

Join our community | Newsletter | Contact us | Docs | Blog | Website | YouTube

Ploomber is the fastest way to build data pipelines ⚡️. Use your favorite editor (Jupyter, VSCode, PyCharm) to develop interactively and deploy ☁️ without code changes (Kubernetes, Airflow, AWS Batch, and SLURM). Do you have legacy notebooks? Refactor them into modular pipelines with a single command.

Installation

Compatible with Python 3.7 and higher.

Install with pip:

pip install ploomber

Or with conda:

conda install ploomber -c conda-forge

Getting started

Try the tutorial:

Community

Main Features

⚡️ Get started quickly

A simple YAML API to get started quickly, a powerful Python API for total flexibility.

get-started.mp4

⏱ Shorter development cycles

Automatically cache your pipeline’s previous results and only re-compute tasks that have changed since your last execution.

shorter-cycles.mp4

☁️ Deploy anywhere

Run as a shell script in a single machine or distributively in Kubernetes, Airflow, AWS Batch, or SLURM.

deploy.mp4

📙 Automated migration from legacy notebooks

Bring your old monolithic notebooks, and we’ll automatically convert them into maintainable, modular pipelines.

refactor.mp4

I want to migrate my notebook.

Show me a demo.

Resources

About Ploomber

Ploomber is a big community of data enthusiasts pushing the boundaries of Data Science and Machine Learning tooling.

Whatever your skillset is, you can contribute to our mission. So whether you're a beginner or an experienced professional, you're welcome to join us on this journey!

Click here to know how you can contribute to Ploomber.

ploomber's People

Contributors

94rain avatar aadityasinha-dotcom avatar anirudhviyer avatar arturomf94 avatar bibhashthakur avatar dependabot[bot] avatar e1ha avatar edublancas avatar fferegrino avatar grnnja avatar hypefi avatar idomic avatar jennifertieu avatar jramirez857 avatar judahrand avatar lbellomo avatar maticortesr avatar mehtamohit013 avatar neelasha23 avatar qixuan27 avatar raj-pansuriya avatar rehman000 avatar rodolfoferro avatar shizuchanw avatar tomarm avatar tonykploomber avatar vinay26k avatar wxl19980214 avatar yafimvo avatar zhenye-na avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ploomber's Issues

Use render infrastructure for preventing errors before executing DAG.build

SQL scripts raise errors if jinja rendering fails (e.g. a needed tag is not passed), this helps catch errors locally and avoid sending ill-defined scripts to the db. Currently, PythonCallable have an empty rendering method, but we could use the inspect module to do some basic checkings such as passing parameters that are not declared in the function, same logic applies to NotebookRunner

Improve error message when schema validation fails

Current error example:

  File "/.../python3.6/site-packages/ploomber/validators/validators.py", line 154, in data_frame_validator
    raise AssertionError(str(assert_))
AssertionError: 2 errors found: 
 * validate_schema: missing columns {'SOME_COLUMN'}.
 * validate_schema: wrong dtype for column "SOME_ID". Expected: "int64". Got: "object"

There is little context here, the first line should point to a schema validation error

Create a params copy for tasks

Each task should use its own params copy, if more than one task use the same dict and one of them modifies it, the other task will be affected

load_env decorator error

load_env decorator should only attempt to load environment when the decorated function is called not when the decorator is initialized

dag-level, class-level and task-level hooks

Currently, on_render, on_finish and on_failure hooks can only be set at the task level. making impossible to set the same hook for many tasks at once.

One use case for this is static analysis, an on_render hook could be set for all NotebookRunner tasks to do static analysis in the source code to detect potential issues before executing.

Potential API (similar to how dag.clients work):

from ploomber.tasks import PythonCallable

t = PythonCallable(...)

t.on_render = task_level_on_render
dag.class_on_render[PythonCallable] = class_level_on_render
dag.on_render = dag_level_on_render

Travis fails for PRs

Our current test suite depends on a connection to a PostgreSQL db. Since forks do no have credentials, it will fail.

I think Travis has postgres installed so I should connect to the localhost by default.

See #8 for an example

Reload sources on render

  • If sources are strings they cannot change once the DAG is declared
  • If they are Paths/Placeholders they can if the underlying file changes (we should be able to reload)
  • For callables it is technically possible

Update: make this available via DAGConfigurator via a hot_reload option.

This should be passed to sources which should re-load from disk when .render is called

PostgresCopy does not accept template as sources

raise SourceInitializationError('{} does not support templates as '

PostgresCopy does not accept templates as sources but it should, when an upstream task creates a File, the source could be referenced dynamically as '{{upstream["another_task"}}' to resolve the path

Edit: This is related to issue #2 - maybe the validation logic should be part of the source object itself, with the option to provide more granular validation in _init_source for specific use cases

Make Task.params read-only

Task.params are used in a bunch of places: when rendering tasks, running them and are also accessible via Task.params. One could inadvertently modify them and cause hard to debug errors. Once Task is initialized with params, there is no need to modify them (except internally, when adding the Product in Task.render), it is safer to make them read-only

  • Create a read-only dictionary object
  • Implement a to_dict() function to return a copy of the dictionary representation

Add on_render hook

An on_render hook should be executed after rendering each task, this will allow to embed logic to try to find problems before actually executing the pipeline (and prevent long-running build to be useless after errors are found, possibly at runtime) The use case that comes to mind is to do static analysis on Python code.

Example:

# ... very long running code
# ... that never declared variable "a"
# ...

# code will break here, after running for a long time
a + 1

SQLAlchemyClient deepcopy fails

kwargs = deepcopy(task_kwargs)

getstate (used by pickle and copy modules) delete the _connection attribute. Pickling works since it has to go through init (?) again, hence sets _connection to None, but copying fails with a: "AttributeError: 'SQLAlchemyClient' object has no attribute '_connection'"

QUESTION on example at README

Hello! Just checking out this new library. There is one line that is confusing to me which I don't understand. The comment of course says what it does, but the API syntax is very strange to me:

task_add_one.on_finish = on_finish

What is strange to me is that the on_finish function somehow gets attached to the task_add_one and you don't pass a parameter to it. Does this line belong here?

Add on_finish argument to Task constructor

Related to #14

Currently, the only way to add an on_finish hook to as ask is:

task.on_finish = some_callable

Even though that API is fine, it should also be possible to add it in the Task's constructor:

Task(..., on_finish=some_callable)

Improve error message

When env is init with a filename and the file does not exist, the following is raised:

FileNotFoundError: Could not find file "None"

The variable holding the filename passed is replaced, that's why the error shows "None"

Abstract iteration logic

Both executors have to iterate the DAG until no more tasks can run, abstract this in the DAG object

Document supported dbs

  • For moving data from db to local file: any db with a PEP 249 compatible driver or supported by SQLAlchemy
  • For creating tables/views: only sqlite and postgres

Add tests for __repr__ (Placeholders)

Placeholders should be render-aware and return appropriate values in their repr method (show rendered values if available). We cannot do this with str since str(Product) is always assumed to return an already-rendered value and should raise an exception otherwise

PostgresCopy render dails due to missing value for "product" parameter

If we pass a templated string to PostgresCopy, it will initialize its source as a Generic Source:

def _init_source(self, source):

GenericSource is just an implementation of the abc class:

class GenericSource(Source):

Such class uses Placeholder to support templates, placeholder validates that there are no missing parameters:

def render(self, params, optional=None):

The problem is that PostgresCopy source parameter should either be a path to a file or a templated string referencing an upstream dependency. For the second case, the '{{product}}' tag won't appear, only an '{{upstream}}' tag. This same problem appears in other Tasks.

Solution: move the parameter validation logic out of Placeholder and move it to Source, and let the task decide how to validate the source with a few options (think about use cases to make this as simple as possible)

Store timestamp in File.source

When a File is created, a .source file keeps a copy of the source code that it generated it but the timestamp is retrieved using the actual file's metadata. If an entire pipeline's products tree is moved, all this timestamps will change but this should not happen.

Use case: branching off using git

Say your env.yaml looks like this:

path:
    data: /data/project/{{git}}

You are working in the dev branch and you want to experiment with a new feature, so you branch off from dev to new-feature. Since there is no /data/project/new-feature, the DAG will have to run end-to-end again.

But at this point both branches are the same so you could just copy /data/project/dev to /data/project/new-feature and the DAG should look up-to-date

Improve GenericProduct

Instead of relying on files and bash commands, use a db backend, the product only needs to provide a way to compute a URI

Handling exceptions inside hooks

Handling exceptions inside hooks:

  • on_render: TaskRenderError exceptions make the task status change to TaskErroredRender, any other exceptions are logged
  • on_finish: Similar behavior but with TaskBuildError and status TaskErrored (also AssertionError)
  • on_fail: Catch all exceptions and log them, if we reach this hook, the task status is already TaskErrored

This is related to #27. If more than one hook is applicable (say dag-level and task-level), we should probably run all of them and raise all tracebacks in a single exception.

How to manage on_fail? Log all exceptions

What is the appropriate logging behavior? Send all to logger.exception? (with traceback) Send to standard error?

Note: Is "Errored" a valid word? https://english.stackexchange.com/questions/3059/is-errored-correct-usage

Making Task.name optional

We could make the name argument optional if a reasonable value can be inferred.

During and early ploomber version, if name was None, it was inferred from the product argument, problem is that if a Product contains tags, their its representation is only known after dag.render(), hence this was discarded.

A better solution is to use information from the source, for PythonCallable, this is easy, we can just use the function name ( name attribute).

For tasks that accept Placeholders as source, we can use the filename (without extension), to keep things consistent we can make Placeholder.name to return this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.