Coder Social home page Coder Social logo

Comments (15)

martindurant avatar martindurant commented on August 18, 2024

The "path_in_schema" reference in a column chunk should be ids.eid. The name of the schema element is eid, and it is possible to have non-unique names if there is nesting.

Currently, fastparquet does not understand any kind of nested schema.

I have begun some work in this branch https://github.com/martindurant/fastparquet/tree/structured_types to consider at least list and map types (one-level nesting), and also some code to view the schema as the tree it's meant to be. That might be useful for you.

from fastparquet.

j-bennet avatar j-bennet commented on August 18, 2024

@martindurant Are you still working on the structured_types branch? I was trying to use this branch, but it looks like it's very much behind master.

from fastparquet.

martindurant avatar martindurant commented on August 18, 2024

LIST and MAP types have been implemented in the master branch and recent releases, and I believe that deeper-nested schemas should not cause a problem, so long as you restrict to loading only the top-level ones (this has not been tested).

from fastparquet.

j-bennet avatar j-bennet commented on August 18, 2024

@martindurant Oh I see. I have a requirement to load parquet files with multiple levels of nesting, so I guess fastparquet can't do that yet. Is there a timeline to support this?

from fastparquet.

martindurant avatar martindurant commented on August 18, 2024

The reason that this had not been implemented, is that a nested schema does not map well to the tabular layout of pandas dataframes. The best that one could do is to iteratively build python dicts and lists out of the levels of nesting, which is a job better handled by more flexible formats such as thrift or avro.
In short, implementing this seems (to me) to cost too much effort, when most practical use cases are served without it.
Out of interest, I would be interested to know the schema of your data - you should be able to do

print(fastparquet.ParquetFile(...).schema)

from fastparquet.

j-bennet avatar j-bennet commented on August 18, 2024

@martindurant Here it is: https://gist.github.com/j-bennet/5920397f6673e5abc8d2087d1343bfaa. It represents the data coming from a javascript tracker on the client's page.

from fastparquet.

martindurant avatar martindurant commented on August 18, 2024

Thanks for providing!
I can imagine being able to read in the many optional->optional columns (this would flatten easily), but where it gets to optional->optional->list and list->list and deeper, it would require a lot of work.
Are you able to read the top-level columns like "url", or any of the one-level-deep columns such as "visitor.ip"? I suspect that in fact most of the OPTIONAL labels should be REQUIRED, because it is standard for spark to label everything as OPTIONAL whether it is nullable or not (spark keeps its own separate metadata in the footer comments), and so the schema is not actually as deeply nested as it would appear.

from fastparquet.

j-bennet avatar j-bennet commented on August 18, 2024

@martindurant I can see how it would be problematic to map this data into a tabular format. My use case involves loading those events into dataframes and aggregating them over a few time intervals (5 min, 1 day), to create dataframes with totaled metrics (such as page views or engaged time). For this use case, it might be enough if fastparquet flattened the hierarchical dataframe and created columns with dotted names (such as medadata.authors.urls).

from fastparquet.

j-bennet avatar j-bennet commented on August 18, 2024

@martindurant No, reading is just broken. If I try to only read urls, for example:

df = dd.read_parquet(parquet_paths[0], columns=['url'])

I get this:

Traceback (most recent call last):
  File "main_dask.py", line 37, in <module>
    main()
  File "main_dask.py", line 32, in main
    df = dd.read_parquet(parquet_paths[0], columns=['url'])
  File "/Users/irina/.pyenv/versions/redask/lib/python2.7/site-packages/dask/dataframe/io/parquet.py", line 286, in read_parquet
    categories=categories, index=index)
  File "/Users/irina/.pyenv/versions/redask/lib/python2.7/site-packages/dask/dataframe/io/parquet.py", line 68, in _read_fastparquet
    minmax = fastparquet.api.sorted_partitioned_columns(pf)
  File "/Users/irina/.pyenv/versions/redask/lib/python2.7/site-packages/fastparquet/api.py", line 643, in sorted_partitioned_columns
    s = statistics(pf)
  File "/Users/irina/.pyenv/versions/redask/lib/python2.7/site-packages/fastparquet/api.py", line 607, in statistics
    for n in ['min', 'max', 'null_count', 'distinct_count']}
  File "/Users/irina/.pyenv/versions/redask/lib/python2.7/site-packages/fastparquet/api.py", line 607, in <dictcomp>
    for n in ['min', 'max', 'null_count', 'distinct_count']}
  File "/Users/irina/.pyenv/versions/redask/lib/python2.7/site-packages/fastparquet/api.py", line 606, in <dictcomp>
    for col in obj.columns}
KeyError: u'visitor'

from fastparquet.

martindurant avatar martindurant commented on August 18, 2024

Would you mind trying directly with fastparquet:

fastparquet.ParquetFile(parquet_paths[0]).to_pandas(columns=['url'])

The dask reader makes some additional assumptions on top of fastparquet, so all implementation is always done in the lower-level library first ;)

If you would like to post a small sample of your data (if this is possible, or can can be suitable de-sensitised), I could have a look at flattening non-repeating elements. There would be a pretty high chance of me not having the time to follow through, though.

from fastparquet.

j-bennet avatar j-bennet commented on August 18, 2024

With fastparquet directly, I can read urls:

In [6]: df = ParquetFile(sample_file_path).to_pandas(columns=['url'])

In [7]: df.head()
Out[7]:
                                                 url
0            http://blog.parsely.com/post/1928/cass/
1     http://blog.parsely.com/post/3886/pykafka-now/
2  http://blog.parsely.com/post/2503/4-steps-to-d...
3  http://blog.parsely.com/post/1630/how-to-find-...
4     http://blog.parsely.com/post/3886/pykafka-now/

I can't read visitor.ip:

In [8]: df = ParquetFile(sample_file_path).to_pandas(columns=['visitor.ip'])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-ccaa144d4722> in <module>()
----> 1 df = ParquetFile(sample_file_path).to_pandas(columns=['visitor.ip'])

/Users/irina/.pyenv/versions/redask/lib/python2.7/site-packages/fastparquet/api.pyc in to_pandas(self, columns, categories, filters, index, timestamp96)
    380         if index and index not in columns:
    381             columns.append(index)
--> 382         check_column_names(self.columns, columns, categories)
    383         df, views = self.pre_allocate(size, columns, categories, index,
    384                                       timestamp96=timestamp96)

/Users/irina/.pyenv/versions/redask/lib/python2.7/site-packages/fastparquet/util.pyc in check_column_names(columns, *args)
    131                 raise ValueError("Column name not in list.\n"
    132                                  "Requested %s\n"
--> 133                                  "Allowed %s" % (arg, columns))
    134
    135

ValueError: Column name not in list.
Requested ['visitor.ip']
Allowed [u'url', u'referrer', u'action', u'extra_data', u'user_agent', u'__version__', u'visitor', u'display', u'timestamp_info', u'session', u'slot', u'metadata', u'engaged_time', u'flags', u'timestamp', u'timestamp_5min']

I can read visitor, but all the values end up being None:

In [17]: df = ParquetFile(sample_file_path).to_pandas(columns=['visitor'])

In [18]: df.head()
Out[18]:
  visitor
0    None
1    None
2    None
3    None
4    None

In [19]: df[~df.visitor.isnull()]
Out[19]:
Empty DataFrame
Columns: [visitor]
Index: []

I know they are not None, because I looked at the same data with sc.read.parquet(sample_file_path).

from fastparquet.

j-bennet avatar j-bennet commented on August 18, 2024

@martindurant Thanks for trying to help! I'll generate some sample data for you to look at.

from fastparquet.

martindurant avatar martindurant commented on August 18, 2024

I dare say there should not really be a "visitor" column at all as it has no values, only "visitor.*" columns.

from fastparquet.

j-bennet avatar j-bennet commented on August 18, 2024

@martindurant I agree that is how it should work.

from fastparquet.

j-bennet avatar j-bennet commented on August 18, 2024

@martindurant Here is some generated sample data:

parquet.zip

from fastparquet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.