Coder Social home page Coder Social logo

Comments (13)

martindurant avatar martindurant commented on August 18, 2024 1

I haven't tied this, but you might be able to use merge in the directory above the partition, passing the relative paths of all of the parquet files, which then build the metadata file. There is no specific way to read a set of isolated parquet files.

from fastparquet.

martindurant avatar martindurant commented on August 18, 2024 1

If you don't need an index (and it seems you don't, or maybe don't even have a column that is appropriate), you can use infer_divisions=False, which should skip gathering metadata from all of the files before constructing the graph. In general, though, the size of each partition will be very important to performance, and you might want to create your data with larger ones, if you have the memory to spare.

from fastparquet.

mingzhou avatar mingzhou commented on August 18, 2024

hi @martindurant . I find that parquet.enable.summary-metadata is set false default since Spark 2 for two reasons SPARK-15719. Hope fastparquet follow these reasons. We can save new partitioned parquet in spark by setting parquet.enable.summary-metadata true until fastparquet add this feature. However, what do you mean merge existing partitioned parquet. Could you tell me more details. Thanks.

from fastparquet.

martindurant avatar martindurant commented on August 18, 2024

I shall follow your link and consider. I guess at read time, it must walk the directory structure and find all parquet-like files before performing any action? That seems an expensive operation.

I was talking about using the function fastparquet.writer.merge() to create the metadata from the set of parquet files.

from fastparquet.

mingzhou avatar mingzhou commented on August 18, 2024

thanks @martindurant .

from fastparquet.

martindurant avatar martindurant commented on August 18, 2024

Do you think it would be useful to provide a way to open a list-of-parquet-files as if there were a metadata file, as a shortcut to merging? I do not want to have to walk through the directory structure, touching each file to see if it is parquet or not, and that would still leave the ambiguity of ordering. Something like the following would be easy enough to implement:

filelist = glob('mydirectory/*/*.parq')  # alphabetically sorted by default
pf = fastparquet.ParquetFile(filelist)

from fastparquet.

mrocklin avatar mrocklin commented on August 18, 2024

Is it common to have non-parquet files in a directory that was written as a parquet dataset? How bad is it if we just assume that everything is parquet-like and err if things fail?

Is there a logical merge operation in fastparquet? If so how expensive is this?

from fastparquet.

martindurant avatar martindurant commented on August 18, 2024

There is a merge function which creates a metadata file; the two parts can easily be split.
Each input file needs to be accessed, even if we realise that we don't want the data in some of them, and the directory structure could potentially be deep.
At the moment, the ParquetFile constructor will try the path as given, assuming it is a file, or try adding _metadata, assuming it is a directory (for S3, we have no way to know if a path is a directory). Trying those, and then walking the directory to find all files would be expensive for a remote store, I am very surprised spark thought this was a good idea.

On non-parquet files in a parquet directory, I have no idea.

from fastparquet.

mingzhou avatar mingzhou commented on August 18, 2024

yeah, file list could be a solution encapsued inside the ParquetFile constructor. I think performance should be the second consideration after functionality. non-parquet files in a parquet directory should be avoided by users for it makes no sense considering actual use scenarios. 😄

from fastparquet.

DigitalPig avatar DigitalPig commented on August 18, 2024

Just want to follow-up on this issue. Is it still the case that we need to turn on metadata file writing at the Spark side in order for fastparquet to read those files?

from fastparquet.

martindurant avatar martindurant commented on August 18, 2024

That is the best way - but you can read the files anyway. To read without the metadata, you need to first get the list of paths, fastparquet does not walk the directory structure for you.

from fastparquet.

martindurant avatar martindurant commented on August 18, 2024

At some point, we maybe should add the ability to also walk the directories and find all files that look parquet-ish.

from fastparquet.

DigitalPig avatar DigitalPig commented on August 18, 2024

Thanks for the quick reply. I have a folder generated by Apache Spark partitioned by date with 6800 parquet files in total.

I am currently using:

s3 = s3fs.S3FileSystem()
filelist = s3.glob('test-bucket/full_dataset_pq/*/*.parquet')
filelist_s3 = ['s3://' + x for x in filelist]
source = dd.read_parquet(filelist_s3)

it takes a very long time reading all in a dask cluster (10 workers, 4TB mem in total). The read_parquet seems to only work on scheduler at this moment as I did not see any other updates on the cluster based on dask UI.

Is there a way to speed up this process? Maybe I am using it wrong? Thank you! @martindurant @mrocklin

from fastparquet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.