Comments (10)
I think the possibility to use Apache Spark within Metaflow would be extremely useful. When you have your feature engineering workflow written in pyspark it's kind of a pain to translate everything to pandas and also it's hard to predict how well this will work on large datasets.
from metaflow.
@tduffy000 We have an in-house implementation of dataframe
which provides faster primitive operations with a lower memory footprint than Pandas. This is supported both on local instance and in the cloud. One can use this implementation inside a step
or even outside of Metaflow (just like the metaflow.s3
client).
from metaflow.
Maybe the use of Metaflow could be somehow combined with Dask, which supports bigger-than-memory dataframes to solve this issue. I am not sure if/how it would be possible to serialize and restore Dasks big and lazy-evaluated dataframes between steps though.
from metaflow.
@savingoyal any update to release the dataframe implementation?
Adding modin as a distributed drop-in for pandas dfs
from metaflow.
Would something like https://vaex.readthedocs.io/en/latest/index.html be a possible solution here?
from metaflow.
another mention Spark Pandas https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html
from metaflow.
@romain-intel is the idea to support this locally or on an AWS instance?
Wondering if the idea is just making it more integrated with Apache Spark (via pyspark), or finding a way like an IterableDataset in Pytorch, to split loading among workers and have them loaded at model time.
I imagine the difficulty might be in the atomicity of a @step
given that a feature selection & engineering step would be wholly separated from the model step. Know from experience that there are still a lot of pandas
fans out there.
Would be curious to hear your thoughts on this.
from metaflow.
Maybe something like a dataflow transfered between steps like Bonobo.
Also here is other example of software product that uses datapickle and Dask to run dataflows clusterized in cloud.
from metaflow.
Agree the Pandas on Spark reference by @talebzeghmi would be valuable, but you would still need a Spark context. I think being able to declare that your task run in AWS Glue would potentially allow for both Pandas on Spark or just vanilla pySpark as a step.
from metaflow.
@savingoyal any update to release the dataframe implementation?
Still interested! Would appreciate any update, especially if it's "yeah we're not going to do this in the forseeable future after all"
from metaflow.
Related Issues (20)
- Conda environment being treated as disabled, and not appending environment to PATH.
- Metaflow crashes on AWS Batch if folder called `metaflow` is present in the working directory HOT 5
- Cardview on WSL error HOT 2
- S3 access denied even if I have full access to S3
- Certain flows failing on Argo Workflows =>3.5.0 HOT 1
- Metaflow job completion or exit handlers?
- run.finished not set when using AWS Step Functions and there's an error
- setting METAFLOW_OTEL_ENDPOINT when running in ECS fargate, not Kubernetes HOT 1
- add __repr__ methods to Parameter
- create contributing guide
- "Service token file does not exist" error when deploying flow to Argo from CI HOT 1
- argo-workflows create --only-json doesn't export the cron workflow configuration
- Using `tags` as a Parameter name breaks flow. HOT 1
- Add option to batch decorator to increase ephemeralStorage on Fargate
- `--package-suffixes` omits dotfiles HOT 1
- Is it possible to run metaflow steps in custom docker containers on local?
- Opentelemetry configuration not carrying over to Batch
- Add a priority class option for the kubernetes flow decorator HOT 1
- Reduce the number of reserved parameter names
- Logs don't show up on the console. gs_tail raises NotFound error
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from metaflow.