Trying to parse a parquet file, and it falls over on a nested element. <div class=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Would you mind trying directly with fastparquet: <div class="snippet-clipboard-con

schema element name problem on nested elements about fastparquet HOT 15 CLOSED

dask commented on August 18, 2024

schema element name problem on nested elements

from fastparquet.

Comments (15)

martindurant commented on August 18, 2024

The "path_in_schema" reference in a column chunk should be ids.eid. The name of the schema element is eid, and it is possible to have non-unique names if there is nesting.

Currently, fastparquet does not understand any kind of nested schema.

I have begun some work in this branch https://github.com/martindurant/fastparquet/tree/structured_types to consider at least list and map types (one-level nesting), and also some code to view the schema as the tree it's meant to be. That might be useful for you.

from fastparquet.

j-bennet commented on August 18, 2024

@martindurant Are you still working on the structured_types branch? I was trying to use this branch, but it looks like it's very much behind master.

from fastparquet.

martindurant commented on August 18, 2024

LIST and MAP types have been implemented in the master branch and recent releases, and I believe that deeper-nested schemas should not cause a problem, so long as you restrict to loading only the top-level ones (this has not been tested).

from fastparquet.

j-bennet commented on August 18, 2024

@martindurant Oh I see. I have a requirement to load parquet files with multiple levels of nesting, so I guess fastparquet can't do that yet. Is there a timeline to support this?

from fastparquet.

martindurant commented on August 18, 2024

The reason that this had not been implemented, is that a nested schema does not map well to the tabular layout of pandas dataframes. The best that one could do is to iteratively build python dicts and lists out of the levels of nesting, which is a job better handled by more flexible formats such as thrift or avro.
In short, implementing this seems (to me) to cost too much effort, when most practical use cases are served without it.
Out of interest, I would be interested to know the schema of your data - you should be able to do

print(fastparquet.ParquetFile(...).schema)

from fastparquet.

j-bennet commented on August 18, 2024

@martindurant Here it is: https://gist.github.com/j-bennet/5920397f6673e5abc8d2087d1343bfaa. It represents the data coming from a javascript tracker on the client's page.

from fastparquet.

martindurant commented on August 18, 2024

Thanks for providing!
I can imagine being able to read in the many optional->optional columns (this would flatten easily), but where it gets to optional->optional->list and list->list and deeper, it would require a lot of work.
Are you able to read the top-level columns like "url", or any of the one-level-deep columns such as "visitor.ip"? I suspect that in fact most of the OPTIONAL labels should be REQUIRED, because it is standard for spark to label everything as OPTIONAL whether it is nullable or not (spark keeps its own separate metadata in the footer comments), and so the schema is not actually as deeply nested as it would appear.

from fastparquet.

j-bennet commented on August 18, 2024

@martindurant I can see how it would be problematic to map this data into a tabular format. My use case involves loading those events into dataframes and aggregating them over a few time intervals (5 min, 1 day), to create dataframes with totaled metrics (such as page views or engaged time). For this use case, it might be enough if fastparquet flattened the hierarchical dataframe and created columns with dotted names (such as medadata.authors.urls).

from fastparquet.

j-bennet commented on August 18, 2024

@martindurant No, reading is just broken. If I try to only read urls, for example:

df = dd.read_parquet(parquet_paths[0], columns=['url'])

I get this:

Traceback (most recent call last):
  File "main_dask.py", line 37, in <module>
    main()
  File "main_dask.py", line 32, in main
    df = dd.read_parquet(parquet_paths[0], columns=['url'])
  File "/Users/irina/.pyenv/versions/redask/lib/python2.7/site-packages/dask/dataframe/io/parquet.py", line 286, in read_parquet
    categories=categories, index=index)
  File "/Users/irina/.pyenv/versions/redask/lib/python2.7/site-packages/dask/dataframe/io/parquet.py", line 68, in _read_fastparquet
    minmax = fastparquet.api.sorted_partitioned_columns(pf)
  File "/Users/irina/.pyenv/versions/redask/lib/python2.7/site-packages/fastparquet/api.py", line 643, in sorted_partitioned_columns
    s = statistics(pf)
  File "/Users/irina/.pyenv/versions/redask/lib/python2.7/site-packages/fastparquet/api.py", line 607, in statistics
    for n in ['min', 'max', 'null_count', 'distinct_count']}
  File "/Users/irina/.pyenv/versions/redask/lib/python2.7/site-packages/fastparquet/api.py", line 607, in <dictcomp>
    for n in ['min', 'max', 'null_count', 'distinct_count']}
  File "/Users/irina/.pyenv/versions/redask/lib/python2.7/site-packages/fastparquet/api.py", line 606, in <dictcomp>
    for col in obj.columns}
KeyError: u'visitor'

from fastparquet.

martindurant commented on August 18, 2024

Would you mind trying directly with fastparquet:

fastparquet.ParquetFile(parquet_paths[0]).to_pandas(columns=['url'])

The dask reader makes some additional assumptions on top of fastparquet, so all implementation is always done in the lower-level library first ;)

If you would like to post a small sample of your data (if this is possible, or can can be suitable de-sensitised), I could have a look at flattening non-repeating elements. There would be a pretty high chance of me not having the time to follow through, though.

from fastparquet.

j-bennet commented on August 18, 2024

With fastparquet directly, I can read urls:

In [6]: df = ParquetFile(sample_file_path).to_pandas(columns=['url'])

In [7]: df.head()
Out[7]:
                                                 url
0            http://blog.parsely.com/post/1928/cass/
1     http://blog.parsely.com/post/3886/pykafka-now/
2  http://blog.parsely.com/post/2503/4-steps-to-d...
3  http://blog.parsely.com/post/1630/how-to-find-...
4     http://blog.parsely.com/post/3886/pykafka-now/

I can't read visitor.ip:

In [8]: df = ParquetFile(sample_file_path).to_pandas(columns=['visitor.ip'])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-ccaa144d4722> in <module>()
----> 1 df = ParquetFile(sample_file_path).to_pandas(columns=['visitor.ip'])

/Users/irina/.pyenv/versions/redask/lib/python2.7/site-packages/fastparquet/api.pyc in to_pandas(self, columns, categories, filters, index, timestamp96)
    380         if index and index not in columns:
    381             columns.append(index)
--> 382         check_column_names(self.columns, columns, categories)
    383         df, views = self.pre_allocate(size, columns, categories, index,
    384                                       timestamp96=timestamp96)

/Users/irina/.pyenv/versions/redask/lib/python2.7/site-packages/fastparquet/util.pyc in check_column_names(columns, *args)
    131                 raise ValueError("Column name not in list.\n"
    132                                  "Requested %s\n"
--> 133                                  "Allowed %s" % (arg, columns))
    134
    135

ValueError: Column name not in list.
Requested ['visitor.ip']
Allowed [u'url', u'referrer', u'action', u'extra_data', u'user_agent', u'__version__', u'visitor', u'display', u'timestamp_info', u'session', u'slot', u'metadata', u'engaged_time', u'flags', u'timestamp', u'timestamp_5min']

I can read visitor, but all the values end up being None:

In [17]: df = ParquetFile(sample_file_path).to_pandas(columns=['visitor'])

In [18]: df.head()
Out[18]:
  visitor
0    None
1    None
2    None
3    None
4    None

In [19]: df[~df.visitor.isnull()]
Out[19]:
Empty DataFrame
Columns: [visitor]
Index: []

I know they are not None, because I looked at the same data with sc.read.parquet(sample_file_path).

from fastparquet.

j-bennet commented on August 18, 2024

@martindurant Thanks for trying to help! I'll generate some sample data for you to look at.

from fastparquet.

martindurant commented on August 18, 2024

I dare say there should not really be a "visitor" column at all as it has no values, only "visitor.*" columns.

from fastparquet.

j-bennet commented on August 18, 2024

@martindurant I agree that is how it should work.

from fastparquet.

j-bennet commented on August 18, 2024

@martindurant Here is some generated sample data:

parquet.zip

from fastparquet.

schema element name problem on nested elements about fastparquet HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent