It's not clear from the documentation

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Partioned . Thanks anyway. <a hr

does quilt supports HDFS big data and Spark (pySpark)? about quilt HOT 8 CLOSED

quiltdata commented on September 3, 2024

does quilt supports HDFS big data and Spark (pySpark)?

from quilt.

Comments (8)

kevinemoore commented on September 3, 2024

Thanks for the question! Quilt supports reading packages in pySpark. Structured datasets are read into Spark DataFrames. In Spark, most often, you'll want to access a large partitioned table as a Spark DataFrame by accessing the GroupNode and calling _data() (or just calling the object()). There's no support in Quilt yet for distributed package creation (build) in pySpark. Is that something you would find useful?

Quilt doesn't yet support running a separate registry on HDFS, but that's definitely on the roadmap.

from quilt.

chanansh commented on September 3, 2024

I am not sure I understand the terminology of build and call. I will read your documentation. Essentially I have ~1Tb HDFS data which changes from time to time and I would like to manage it's versions. Currently I just use folders with date and set a string in the code to point to the relevant path. Can quilt help me?

…

On Wed, Apr 18, 2018, 8:54 PM Kevin Moore ***@***.***> wrote: Thanks for the question! Quilt supports reading packages in pySpark. Structured datasets are read into Spark DataFrames. In Spark, most often, you'll want to access a large partitioned table as a Spark DataFrame by accessing the GroupNode and calling _data() (or just calling the object()). There's no support in Quilt yet for distributed package creation (build) in pySpark. Is that something you would find useful? Quilt doesn't yet support running a separate registry on HDFS, but that's definitely on the roadmap. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#563 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFtFzAijBhWDUlKcLEyL_B8Cunhq7Ecgks5tp33NgaJpZM4TZ-HZ> .

from quilt.

kevinemoore commented on September 3, 2024

@chanansh, build is a Quilt command that creates a new snapshot of your data (as a new package version). By call I simply meant the Python __call__ method, which is callable by just adding () to a node in the Quilt package.

My guess is that Quilt will need a couple of small extensions (which are all on the roadmap) before it will do what you want. Basically, we'll need to add support for using HDFS in a PackageStore. That could be as easy as swapping the normal Python os.path and pathlib calls for hdfs3 when using HDFS and pyspark.

What type of data are you storing in HDFS? Parquet files?

from quilt.

chanansh commented on September 3, 2024

Yep. Parquet

…

On Fri, Apr 20, 2018, 10:20 PM Kevin Moore ***@***.***> wrote: @chanansh <https://github.com/chanansh>, build is a Quilt command that creates a new snapshot of your data (as a new package version). By call I simply meant the Python __call__ method, which is callable by just adding () to a node in the Quilt package. My guess is that Quilt will need a couple of small extensions (which are all on the roadmap) before it will do what you want. Basically, we'll need to add support for using HDFS in a PackageStore. That could be as easy as swapping the normal Python os.path and pathlib calls for hdfs3 when using HDFS and pyspark. What type of data are you storing in HDFS? Parquet files? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#563 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFtFzNrcjTBSbnBnrn6j4y9O8QSl__dJks5tqjTngaJpZM4TZ-HZ> .

from quilt.

kevinemoore commented on September 3, 2024

Great! And, is it a set of partitioned tables? We just added support for Parquet input data, but the code doesn't handle partitioned tables/dataframes yet. I can probably get a prototype of that working this afternoon if you'd like to give it a try.

from quilt.

chanansh commented on September 3, 2024

Partioned . Thanks anyway.

…

On Fri, Apr 20, 2018, 10:26 PM Kevin Moore ***@***.***> wrote: Great! And, is it a set of partitioned tables? We just added support for Parquet input data, but the code doesn't handle partitioned tables/dataframes yet. I can probably get a prototype of that working this afternoon if you'd like to give it a try. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#563 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFtFzEm-ughqz0CuWTLUL2hayTHJppUVks5tqjZogaJpZM4TZ-HZ> .

from quilt.

leferrad commented on September 3, 2024

I'm also interested on loading and versioning data from HDFS through Quilt, so it can be integrated on a Spark pipeline of distributed data processing. Can someone give me details of this integration Quilt + HDFS? is it feasible? Thanks!

from quilt.

akarve commented on September 3, 2024

Current support is for S3 and S3 connectors can be used to provide access via Hadoop. General trend is away from HDFS towards S3. Feel free to reopen if still an issue.

from quilt.

does quilt supports HDFS big data and Spark (pySpark)? about quilt HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent