Coder Social home page Coder Social logo

Comments (7)

rgbkrk avatar rgbkrk commented on May 22, 2024

Great question!

"arrays tend to be too big, so store them somewhere else and record the path to it"

That would be my recommendation and I'd like to make it a simple built-in if it makes sense across users. I think it requires some infrastructure as well as opinion. The stance I'd likely take inside Netflix as a paved path would be to record the array (or pandas dataframe) likely using Arrow to store it to HDFS or S3. Then the field we store in a papermill record is likely the full path to the record (which can be an S3 path, or an HDFS path). What object storage would you end up using?

How lossy would it be for you to use Pandas' JSON Table Schema + row oriented JSON option?

An alternative would be to add a step to pickle/serialize types that can't be stored as JSON in record before dumping them into the notebook.

As much as possible we want to keep the data types language agnostic (as there are R bindings for papermill as well). I think it would be ok to serialize them, so long as you're the one expecting it on the other side of reading them (where you == you and your collaborators).

from papermill.

rgbkrk avatar rgbkrk commented on May 22, 2024

@charsmith -- any thoughts here?

from papermill.

mpacer avatar mpacer commented on May 22, 2024

If you are going the pickle route, it's probably better to use dill.

from papermill.

rgbkrk avatar rgbkrk commented on May 22, 2024

Shameless plug - have you tried cloudpickle?

from papermill.

lukasheinrich avatar lukasheinrich commented on May 22, 2024

as mentioned in my PR #56 there is some usecase of persisting the data that is recorded.. right not it assumes it's jsonable via panda's to_json() but perhaps other persistency approaches could be made pluggable

from papermill.

betatim avatar betatim commented on May 22, 2024

I think pickle (and friends) aren't the right way to go because you'd want to be able to load the data in R (and maybe other languages).

There are docs on how to write pandas dataframes to files but not simple numpy arrays. This is nice because you don't have to use an object store.

How lossy would it be for you to use Pandas' JSON Table Schema + row oriented JSON option?

The usecase I have in mind is storing scikit-learn classifiers which are mostly (very) big numpy arrays. You could turn them into a dataframe first ... having said that scikit-learn classifiers mean nothing to R so using a python only storage format might be Ok.

Maybe the common denominator here is: a method for recording a large "bunch of bytes" which records the path where they were stored in the papermill metadata.

from papermill.

rgbkrk avatar rgbkrk commented on May 22, 2024

a method for recording a large "bunch of bytes" which records the path where they were stored in the papermill metadata.

That seems reasonable, even if I don't see what the right way to do that would be.

from papermill.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.