Coder Social home page Coder Social logo

Comments (10)

benjaminbluhm avatar benjaminbluhm commented on May 17, 2024 11

I think the possibility to use Apache Spark within Metaflow would be extremely useful. When you have your feature engineering workflow written in pyspark it's kind of a pain to translate everything to pandas and also it's hard to predict how well this will work on large datasets.

from metaflow.

savingoyal avatar savingoyal commented on May 17, 2024 10

@tduffy000 We have an in-house implementation of dataframe which provides faster primitive operations with a lower memory footprint than Pandas. This is supported both on local instance and in the cloud. One can use this implementation inside a step or even outside of Metaflow (just like the metaflow.s3 client).

from metaflow.

leftys avatar leftys commented on May 17, 2024 10

Maybe the use of Metaflow could be somehow combined with Dask, which supports bigger-than-memory dataframes to solve this issue. I am not sure if/how it would be possible to serialize and restore Dasks big and lazy-evaluated dataframes between steps though.

from metaflow.

crypdick avatar crypdick commented on May 17, 2024 6

@savingoyal any update to release the dataframe implementation?

Adding modin as a distributed drop-in for pandas dfs

from metaflow.

tekumara avatar tekumara commented on May 17, 2024 1

Would something like https://vaex.readthedocs.io/en/latest/index.html be a possible solution here?

from metaflow.

talebzeghmi avatar talebzeghmi commented on May 17, 2024 1

another mention Spark Pandas https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html

from metaflow.

tduffy000 avatar tduffy000 commented on May 17, 2024

@romain-intel is the idea to support this locally or on an AWS instance?

Wondering if the idea is just making it more integrated with Apache Spark (via pyspark), or finding a way like an IterableDataset in Pytorch, to split loading among workers and have them loaded at model time.

I imagine the difficulty might be in the atomicity of a @step given that a feature selection & engineering step would be wholly separated from the model step. Know from experience that there are still a lot of pandas fans out there.

Would be curious to hear your thoughts on this.

from metaflow.

juarezr avatar juarezr commented on May 17, 2024

Maybe something like a dataflow transfered between steps like Bonobo.

Also here is other example of software product that uses datapickle and Dask to run dataflows clusterized in cloud.

from metaflow.

jimmycfa avatar jimmycfa commented on May 17, 2024

Agree the Pandas on Spark reference by @talebzeghmi would be valuable, but you would still need a Spark context. I think being able to declare that your task run in AWS Glue would potentially allow for both Pandas on Spark or just vanilla pySpark as a step.

from metaflow.

dsjoerg avatar dsjoerg commented on May 17, 2024

@savingoyal any update to release the dataframe implementation?

Still interested! Would appreciate any update, especially if it's "yeah we're not going to do this in the forseeable future after all"

from metaflow.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.