Hi, can I read partitioned parquet file (which is tree of directories) <code class="no

hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

read partitioned parquet directories about fastparquet HOT 13 CLOSED

dask commented on August 18, 2024 7

read partitioned parquet directories

from fastparquet.

Comments (13)

martindurant commented on August 18, 2024 1

I haven't tied this, but you might be able to use merge in the directory above the partition, passing the relative paths of all of the parquet files, which then build the metadata file. There is no specific way to read a set of isolated parquet files.

from fastparquet.

martindurant commented on August 18, 2024 1

If you don't need an index (and it seems you don't, or maybe don't even have a column that is appropriate), you can use infer_divisions=False, which should skip gathering metadata from all of the files before constructing the graph. In general, though, the size of each partition will be very important to performance, and you might want to create your data with larger ones, if you have the memory to spare.

from fastparquet.

mingzhou commented on August 18, 2024

hi @martindurant . I find that parquet.enable.summary-metadata is set false default since Spark 2 for two reasons SPARK-15719. Hope fastparquet follow these reasons. We can save new partitioned parquet in spark by setting parquet.enable.summary-metadata true until fastparquet add this feature. However, what do you mean merge existing partitioned parquet. Could you tell me more details. Thanks.

from fastparquet.

martindurant commented on August 18, 2024

I shall follow your link and consider. I guess at read time, it must walk the directory structure and find all parquet-like files before performing any action? That seems an expensive operation.

I was talking about using the function fastparquet.writer.merge() to create the metadata from the set of parquet files.

from fastparquet.

mingzhou commented on August 18, 2024

thanks @martindurant .

from fastparquet.

martindurant commented on August 18, 2024

Do you think it would be useful to provide a way to open a list-of-parquet-files as if there were a metadata file, as a shortcut to merging? I do not want to have to walk through the directory structure, touching each file to see if it is parquet or not, and that would still leave the ambiguity of ordering. Something like the following would be easy enough to implement:

filelist = glob('mydirectory/*/*.parq')  # alphabetically sorted by default
pf = fastparquet.ParquetFile(filelist)

from fastparquet.

mrocklin commented on August 18, 2024

Is it common to have non-parquet files in a directory that was written as a parquet dataset? How bad is it if we just assume that everything is parquet-like and err if things fail?

Is there a logical merge operation in fastparquet? If so how expensive is this?

from fastparquet.

martindurant commented on August 18, 2024

There is a merge function which creates a metadata file; the two parts can easily be split.
Each input file needs to be accessed, even if we realise that we don't want the data in some of them, and the directory structure could potentially be deep.
At the moment, the ParquetFile constructor will try the path as given, assuming it is a file, or try adding _metadata, assuming it is a directory (for S3, we have no way to know if a path is a directory). Trying those, and then walking the directory to find all files would be expensive for a remote store, I am very surprised spark thought this was a good idea.

On non-parquet files in a parquet directory, I have no idea.

from fastparquet.

mingzhou commented on August 18, 2024

yeah, file list could be a solution encapsued inside the ParquetFile constructor. I think performance should be the second consideration after functionality. non-parquet files in a parquet directory should be avoided by users for it makes no sense considering actual use scenarios. 😄

from fastparquet.

DigitalPig commented on August 18, 2024

Just want to follow-up on this issue. Is it still the case that we need to turn on metadata file writing at the Spark side in order for fastparquet to read those files?

from fastparquet.

martindurant commented on August 18, 2024

That is the best way - but you can read the files anyway. To read without the metadata, you need to first get the list of paths, fastparquet does not walk the directory structure for you.

from fastparquet.

martindurant commented on August 18, 2024

At some point, we maybe should add the ability to also walk the directories and find all files that look parquet-ish.

from fastparquet.

DigitalPig commented on August 18, 2024

Thanks for the quick reply. I have a folder generated by Apache Spark partitioned by date with 6800 parquet files in total.

I am currently using:

s3 = s3fs.S3FileSystem()
filelist = s3.glob('test-bucket/full_dataset_pq/*/*.parquet')
filelist_s3 = ['s3://' + x for x in filelist]
source = dd.read_parquet(filelist_s3)

it takes a very long time reading all in a dask cluster (10 workers, 4TB mem in total). The read_parquet seems to only work on scheduler at this moment as I did not see any other updates on the cluster based on dask UI.

Is there a way to speed up this process? Maybe I am using it wrong? Thank you! @martindurant @mrocklin

from fastparquet.

read partitioned parquet directories about fastparquet HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent