Comments (7)
Great question!
"arrays tend to be too big, so store them somewhere else and
record
the path to it"
That would be my recommendation and I'd like to make it a simple built-in if it makes sense across users. I think it requires some infrastructure as well as opinion. The stance I'd likely take inside Netflix as a paved path would be to record
the array (or pandas dataframe) likely using Arrow to store it to HDFS or S3. Then the field we store in a papermill record is likely the full path to the record (which can be an S3 path, or an HDFS path). What object storage would you end up using?
How lossy would it be for you to use Pandas' JSON Table Schema + row oriented JSON option?
An alternative would be to add a step to pickle/serialize types that can't be stored as JSON in record before dumping them into the notebook.
As much as possible we want to keep the data types language agnostic (as there are R bindings for papermill as well). I think it would be ok to serialize them, so long as you're the one expecting it on the other side of reading them (where you == you and your collaborators).
from papermill.
@charsmith -- any thoughts here?
from papermill.
If you are going the pickle route, it's probably better to use dill.
from papermill.
Shameless plug - have you tried cloudpickle
?
from papermill.
as mentioned in my PR #56 there is some usecase of persisting the data that is record
ed.. right not it assumes it's jsonable via panda's to_json()
but perhaps other persistency approaches could be made pluggable
from papermill.
I think pickle
(and friends) aren't the right way to go because you'd want to be able to load the data in R (and maybe other languages).
There are docs on how to write pandas dataframes to files but not simple numpy arrays. This is nice because you don't have to use an object store.
How lossy would it be for you to use Pandas' JSON Table Schema + row oriented JSON option?
The usecase I have in mind is storing scikit-learn classifiers which are mostly (very) big numpy arrays. You could turn them into a dataframe first ... having said that scikit-learn classifiers mean nothing to R so using a python only storage format might be Ok.
Maybe the common denominator here is: a method for record
ing a large "bunch of bytes" which records the path where they were stored in the papermill metadata.
from papermill.
a method for recording a large "bunch of bytes" which records the path where they were stored in the papermill metadata.
That seems reasonable, even if I don't see what the right way to do that would be.
from papermill.
Related Issues (20)
- --report-mode bug
- SparkMagic pyspark kernel magic(%%sql) hangs when running with Papermill. HOT 1
- Metadata of parameters cell not copied (`{'slideshow': {'slide_type': 'skip'}`)
- Enhance the progress bar to display customizable message HOT 1
- AttributeError: kernelspec HOT 3
- Latest release not installable from sdist HOT 1
- Do an audit of requirements files, pyproject.toml, ci config, tox HOT 5
- Stale repo action HOT 3
- Kernel not found with venvs (jupyter_client.kernelspec.NoSuchKernel) HOT 2
- Does tqdm print properly with --log-output? HOT 1
- nbformat 5.1.2 and 5.1.3 cause AttributeError: 'NoneType' object has no attribute 'cells'
- Parameter parsing fails for strings containing `=` character HOT 2
- Pandas style has no effect HOT 3
- RuntimeError: Kernel didn't respond in 60 seconds HOT 2
- Make the parameters of progress_bar specifiable by a dictionary
- pip install --no-binary gives "No such file or directory" error
- Cell that starts with %%time does not report errors HOT 2
- PapermillNotebookClient deprecation warning: unrecognized argument input_path HOT 1
- Tuple in the config dictionary changed to string when run a python notebook with papermill HOT 1
- New Release HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from papermill.