Coder Social home page Coder Social logo

tinybaker's Introduction

TinyBaker: Lightweight tool for defining composable file-to-file transformations

Python Package

tinybaker is still in beta, and is not yet suitable for production use

Installation with pip, e.g. pip install tinybaker

TinyBaker allows programmers to define file-to-file transformations in a concise format and compose them together with clarity.

Python Package

The model

The main component of TinyBaker is a Transform: a standalone mapping from one set of files to another.

                 ___________
---[ file1 ]--->|           |
                |           |->--[ file4 ]---
---[ file2 ]--->| Transform |
                |           |->--[ file5 ]---
---[ file3 ]--->|___________|

For example, let's say we were running predictions over a certain ML model. This might look like this:

                  ___________
---[ config ]--->|           |
                 |           |->--[ predictions ]---
---[ model ]---->|  Predict  |
                 |           |->--[ performance ]---
---[ data ]----->|___________|

TinyBaker calls the labels associated which each input / output file a tag.

                  ___________
---[ config ]--->|           |
      ^ Tag      |           |->--[ predictions ]---
---[ model ]---->|  Predict  |       ^ Tag
      ^ Tag      |           |->--[ performance ]---
---[ data ]----->|___________|       ^ Tag
      ^ Tag

We might want where we store input/output files to be configurable, and come from different filesystems. TinyBaker allows you to define the transform while paying attention to only the tags, and not where the file is or what filesystem it's on.

                                       ___________
ftp://path/to/config >--[ config ]--->|           |
                                      |           |->--[ predictions ]---> ./output.pkl
/path/to/model.pkl >----[ model ]---->|  Predict  |
                                      |           |->--[ performance ]---> ./performance.pkl
/path/to/data.pkl >-----[ data ]----->|___________|

We can imagine a situation where we have file transformations that could theoretically compose:

                   ________________
                  |                |
---[ raw_logs ]-->| BuildDataFrame |->--[ df ]---
                  |________________|
                  
             ____________
            |            |
---[ df ]-->| BuildModel |->--[ model ]---
            |____________|

TinyBaker allows you to compose these two transformations together:

                   ___________________________
                  |                           |
---[ raw_logs ]-->| BuildDataFrame+BuildModel |->--[ model ]---
                  |___________________________|

We now only need to specify the location of 2 files-- TinyBaker handles linking the two steps together

                                 ___________________________
                                |                           |
/raw/logs.txt ---[ raw_logs ]-->| BuildDataFrame+BuildModel |->--[ model ]--- /path/to/model.pkl
                                |___________________________|

Extra deps are propagated to the top level, ensuring you'll never miss one in step 5 of 17, e.g.

                   ________________
                  |                |
---[ raw_logs ]-->| BuildDataFrame |->--[ df ]---
                  |________________|
                  
                 ____________
---[ df ]------>|            |
                | BuildModel |->--[ model ]---
---[ config ]-->|____________|
            
# Goes to...

                   ___________________________
---[ raw_logs ]-->|                           |
                  | BuildDataFrame+BuildModel |->--[ model ]---
---[ config ]---->|___________________________|

In-Code Anatomy of a single transform

The following describes a minimal transform one can define in TinyBaker

from tinybaker import Transform

class SampleTransform(Transform):
  # 1 tag per input file
  input_tags = {"first_input", "second_input"}
  output_tags = {"some_output"}

  # self.script describes what actually executes when the transform task runs
  script(self):
    # Transforms provide self.input_files and self.output_files, dictionaries with
    # fully-qualified references to files that can be directly opened:
    with self.input_files["first_input"].open() as f:
      do_something_with(f)
    with self.input_files["second_input"].open() as f:
      do_something_else_with(f)

    # and output or something
    with self.input_files["some_output"].open() as f:
      write_something_to(f)

This would then be executed via:

SampleTransform(
  input_paths={"first_input": "path/to/input1", "second_input"= "path/to/input2"}
  output_paths={"some_output": "path/to/write/output"}
).run()

Real-world example of a single transform

For a real-world example, consider training an ML model. This is a transformation from the two files some/path/train.csv and some/path/test.csv to a pickled ML model another/path/some_model.pkl and statistics. With tinybaker, you can specify this individual configurable step as follows:

# train_step.py
from tinybaker import Transform
import pandas as pd
from some_cool_ml_library import train_model, test_model

class TrainModelStep(Transform):
  input_tags = {"train_csv", "test_csv"}
  output_tags = {"pickled_model", "results"}

  def script():
    # Read from files
    with self.input_files["train_csv"].open() as f:
      train_data = pd.read_csv(f)
    with self.input_files["test_csv"].open() as f:
      test_data = pd.read_csv(f)

    # Run computations
    X = train_data.drop(["label"])
    Y = train_data[["label"]]
    [model, train_results] = train_model(X, Y)
    test_results = test_model(model, test_data)

    # Write to output files
    with self.output_files["results"].open() as f:
      results = train_results.formatted_summary() + test_results.formatted_summary()
      f.write(results)
    with self.output_files["pickled_model"].openbin() as f:
      pickle.dump(f, model)

The script that consumes this may look like:

# script.py
from .train_step import TrainModelStep

train_csv_path = "s3://data/train.csv"
test_csv_path = "s3://data/test_csv"
pickled_model_path = "./model.pkl"
results_path = "./results.txt"

TrainModelStep(
  input_paths={
    "train_csv": train_csv_path,
    "test_csv": test_csv_path,
  },
  output_paths={
    "pickled_model": pickled_model_path,
    "results": results_path
  }
).run()

This will perform standard error handling, such as raising early if certain files are missing.

Operating over multiple filesystems

Since TinyBaker uses pyfilesystem2 as its filesystem, TinyBaker can use any filesystem that pyfilesystem2 supports. For example, you can enable support for s3 via installing https://github.com/PyFilesystem/s3fs.

This makes testing of steps very easy: test suites can operate off of local data, but production jobs can run off of s3 data.

Combining several build steps

We can compose several build steps together using the methods merge and sequence.

from tinybaker import Transform, sequence

class CleanLogs(Transform):
  input_files={"raw_logfile"}
  output_files={"cleaned_logfile"}
  ...

class BuildDataframe(Transform):
  input_files={"cleaned_logfile"}
  output_files={"dataframe"}
  ...

class BuildLabels(Transform):
  input_files={"cleaned_logfile"}
  output_files={"labels"}

class TrainModelFromDataframe(Transform):
  input_files={"dataframe", "labels"}
  output_files={"trained_model"}


TrainFromRawLogs = sequence(
  CleanLogs,
  merge(BuildDataframe, BuildLabels),
  TrainModelFromDataframe
)

task = TrainFromRawLogs(
  input_paths={"raw_logfile": "/path/to/raw.log"},
  output_paths={"trained_model": "/path/to/model.pkl"}
)

task.run()

Hooking up inputs and outputs is determined via tag name, e.g. if step 1 outputs tag "foo", and step 2 takes tag "foo" as inputs, they will be automatically hooked together.

Propagation of inputs and outptus

Let's say task 3 of 4 in a sequence of tasks requires tag "foo", but no previous step generates tag "foo", then this dependency will be propagated to the top level; the sequence as a whole will have a dependency on tag "foo".

Additionally, if task 3 of 4 generates a tag "bar", but no further step requires "bar", then the sequence exposes "bar" as an output.

expose_intermediates

If you need to expose intermediate files within a sequence, you can use the keywork arg expose_intermediates to additionally output the listed intermediate tags, e.g.

sequence([A, B, C], expose_intermediates={"some_intermediate", "some_other_intermediate"})

Renaming

Right now, since association of files from one step to the next is based on tags, we may end up in a situation where we want to rename tags. If we want to change the tag names, we can use map_tags to change them.

from tinybaker import map_tags

MappedStep = map_tags(
  SomeStep,
  input_mapping={"old_input_name": "new_input_name"},
  output_mapping={"old_output_name": "new_output_name"})

Filesets

Warning: The Filesets interface will probably be changed at some point in the future!

If a step operates over a dynamic set of files (e.g. logs from n different days), you can use the filesets interface to specify that. Tags that begin with the prefix fileset:: are interpreted to be filesets rather than just files.

If a sequence includes a fileset as an intermediate, the developer is expected to

Example

A concat task can be done as follows:

class Concat(Transform):
    input_tags = {"fileset::files"}
    output_tags = {"concatted"}

    def script(self):
        content = ""
        for ref in self.input_files["fileset::files"]:
            with ref.open() as f:
                content = content + f.read()

        with self.output_files["concatted"].open() as f:
            f.write(content)

Concat(
    input_paths={
        "fileset::files": ["./tests/__data__/foo.txt", "./tests/__data__/bar.txt"],
    },
    output_paths={"concatted": "/tmp/concatted"},
    overwrite=True,
).run()

Contributing

Please contribute! I appreciate any and all help!

tinybaker's People

Contributors

evinism avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.