jcrobak / parquet-python Goto Github PK

View Code? Open in Web Editor NEW

335.0 335.0 254.0 160 KB

python implementation of the parquet columnar file format.

License: Apache License 2.0

Python 78.50% Thrift 21.50%

parquet-python's People

Contributors

Stargazers

Watchers

Forkers

arahuja jaguarx holdenk halfaleague mindis juexiaolin fiolbs dask coderfender h4ck3rm1k3 drdee abhiroyg nkhuyu gna-phetsarath cstoafer tien-dungle miggec liangb02 makrisoft sqm2050 sebbegg kieleth brianbruggeman liancheng danielshir mattbellis rishabh-miglani chongjiang grue shisanliu snowch zhouyonglong apesti malsmith scrandyb afcarl mmontagna taylorfinnell imrulkayes misc-useful mantyr yuanjie-ai xueshantaoyun xubo245 turicas tomaxwell investabit vishalbelsare pombredanne jalpes196 joserfjuniorllms lewis-wsw wyfunique tirkarthi frankfanslc panda0125 jingmouren python-repository-hub rohankumardubey s2t2 cgopalan hopeman1986 oplad

parquet-python's Issues

Parquet-Python in Conda?

Hi,

would you consider adding parquet-python to conda-forge?

Right now it seems to be the only tool out there that can read parquet data as a bytestream and doesn't require a file which is really helpful for reading parquet files from remote hdfs with the hdfs-package.

For me it's the best parquet reader and the only one that works error free, as I'm having troubles with fastparquet and dask. Having it on Conda would make it easier to install and probably interesting to a larger audience as there would be no need to additionally use pip.

Replace thriftpy with thriftpy2

thriftpy is not maintained anymore, there's thriftpy2 available, a pure Python implementation.

array.array.tostring is deprecated in Python 3 in favour of tobytes

Following instances should use tostring on Python 2 and tobytes on Python 3.

test/test_encoding.py
114:        encoded_bitstring = array.array('B', raw_data_in).tostring()
134:            'B', [0b00000101, 0b00111001, 0b01110111]).tostring()

pandas Series support

We should implement pandas Series as was done in #13 as an optional API to improve performance for that use case.

python-parquet with Python 3

python-parquet relies on cStringIO, which is gone in python 3. Is there a plan for porting python-parquet to python 3 ?

Can you push a new version to Pypi?

Hi,
Thank you for the great project.

The current version on Pypi doesn't have 2 already implemented fixes that I need for a pure-python stack (#69 / #70 and #64).

Are you able to push a new version to Pypi?

TypeError: Struct() argument 1 must be string, not unicode

I'm testing this script and rising the error:TypeError: Struct() argument 1 must be string, not unicode

with open("nation.dict.parquet") as fo:
    for row in parquet.DictReader(fo):
        print str(row)

KeyError: 15

Traceback (most recent call last):
File "pd.py", line 5, in
for row in parquet.reader(fo,columns=['channel']):
File "/usr/lib/python2.7/site-packages/parquet/init.py", line 459, in reader
page_header = _read_page_header(file_obj)
File "/usr/lib/python2.7/site-packages/parquet/init.py", line 92, in _read_page_header
page_header.read(pin)
File "/usr/lib64/python2.7/site-packages/thriftpy/thrift.py", line 150, in read
iprot.read_struct(self)
File "/usr/lib64/python2.7/site-packages/thriftpy/protocol/compact.py", line 250, in read_struct
fname, ftype, fid = self.read_field_begin()
File "/usr/lib64/python2.7/site-packages/thriftpy/protocol/compact.py", line 181, in read_field_begin
return None, self._get_ttype(type), fid
File "/usr/lib64/python2.7/site-packages/thriftpy/protocol/compact.py", line 134, in _get_ttype
return TTYPES[byte & 0x0f]
KeyError: 15

Unable to read parquet file containing multiple MAP_KEY_VALUE columns

There seems to be a mishandling of MAP columns since those columns contain groups named key_value with elements key and value, and those are considered already seen which leads to the following error:

File "/usr/local/lib/python3.9/site-packages/parquet/schema.py", line 20, in init
assert len(self.schema_elements) == len(self.schema_elements_by_name)

'snappy' is not defined

After installing parquet-1.3.1 on windows 10. The following code was run:

with open("myparquetfile.parquet","rb") as fo:
for row in parquet.reader(fo, columns=['tconst', 'nconst']):
print(",".join([str(r) for r in row])

This errored, stack trace below:

Traceback (most recent call last):
File "c:\working\test.py", line 25, in
for row in parquet.reader(fo, columns=['tconst', 'nconst']):
File "C:\Python\Python38\lib\site-packages\parquet_init_.py", line 464, in reader
values = read_data_page(file_obj, schema_helper, page_header, cmd,
File "C:\Python\Python38\lib\site-packages\parquet_init_.py", line 283, in read_data_page
raw_bytes = read_page(file_obj, page_header, column_metadata)
File "C:\Python\Python38\lib\site-packages\parquet_init.py", line 229, in _read_page
raw_bytes = snappy.decompress(bytes_from_file)
NameError: name 'snappy' is not defined

Thx

parquet file has null value cause traceback

When I try to read data from a parquet file which contains null value for some key, I got below error.

(most recent call last):
File "tt.py", line 12, in
for r in parquet.DictReader(fo):
File "/usr/local/lib/python2.7/site-packages/parquet/init.py", line 420, in DictReader
for row in reader(fo, columns):
File "/usr/local/lib/python2.7/site-packages/parquet/init.py", line 467, in reader
dict_items)
File "/usr/local/lib/python2.7/site-packages/parquet/init.py", line 380, in read_data_page
dict_values_io_obj, bit_width, len(dict_values_bytes))
File "/usr/local/lib/python2.7/site-packages/parquet/encoding.py", line 227, in read_rle_bit_packed_hybrid
res += read_bitpacked(io_obj, header, width, debug_logging)
File "/usr/local/lib/python2.7/site-packages/parquet/encoding.py", line 146, in read_bitpacked
b = raw_bytes[current_byte]
IndexError: list index out of range

"pip install parquet" doesn't have the fix for Unicode issue

The following stack trace is due to a bug that, from a different thread on an issue reported last year, was fixed in oct 2016 but this fix doesn't seem to be present in the version you get when you do a "pip install parquet". Since that issue was closed after the fix was posted, I am opening a new issue here so this issue gets attention or I can get some responses:

Traceback (most recent call last):
File "test.py", line 7, in
for line in parquet.DictReader(fo,columns=['pixel','querystring']):
File "/usr/lib/python2.7/site-packages/parquet/init.py", line 375, in DictReader
footer = _read_footer(fo)
File "/usr/lib/python2.7/site-packages/parquet/init.py", line 71, in _read_footer
footer_size = _get_footer_size(fo)
File "/usr/lib/python2.7/site-packages/parquet/init.py", line 64, in _get_footer_size
tup = struct.unpack("<i", fo.read(4))
TypeError: Struct() argument 1 must be string,

Upload to PyPI

I'm creating the Parquet rows plugin and need this library available on PyPI to do it properly.

As it was not available there and the last commit here is almost 1 year old, I've uploaded it: https://pypi.python.org/pypi/parquet

@jcrobak, if you want access on PyPI, please tell me your username and I can give you.

If the library is not maintained anymore, then probably I'm going to create a fork of it and maintain the version on PyPI with the fork.

Example Code

It would be great to have some example code to show how to use this interesting library. Trying to tease it out from the test cases has proven to be unsuccessful so far.

TypeError: Struct() argument 1 must be string, not unicode

I've installed this parquet module from pypi and tried to unpack file.
Got this error

Traceback (most recent call last):
  File "test.py", line 7, in <module>
    for line in parquet.DictReader(fo,columns=['pixel','querystring']):
  File "/usr/lib/python2.7/site-packages/parquet/__init__.py", line 375, in DictReader
    footer = _read_footer(fo)
  File "/usr/lib/python2.7/site-packages/parquet/__init__.py", line 71, in _read_footer
    footer_size = _get_footer_size(fo)
  File "/usr/lib/python2.7/site-packages/parquet/__init__.py", line 64, in _get_footer_size
    tup = struct.unpack("<i", fo.read(4))
TypeError: Struct() argument 1 must be string, not unicode

It looks like the constant string "<i" somehow interpreted as unicode string. I've deleted this string and wrote it again with vim. And the error has gone.
But after that python warns me about another unicode string - "<{}i".
I think all such strings should be replaced.

$ cat /etc/centos-release
CentOS Linux release 7.2.1511 (Core)
$ python --version
Python 2.7.5

Structures cause error

Hi Joe and others
I am trying to use your module to read a parquet file, and i ran into a problem here:
schema.py, line 21:
assert len(self.schema_elements) == len(self.schema_elements_by_name)
Apparently the init method assumes that my structure has multiple fields with the same name. Module works correctly if you comment out this line though
Originally these files were used by Hive, and here is the list of fields in the table:

fileid bigint,
version bigint,
ip_geocode structcountrycode:string,regionname:string,city:string,postalcode:string,metrocode:string,dmacode:string,
timestamp bigint,
region bigint,
pixel bigint,
uuid bigint,
uuid_exists boolean,
referingurl string,
useragent string,
ip string,
querystring string,
campaignsinfo array<struct<campaign_id:bigint,media_types:array,advertiser_id:bigint,funnel_step_id:bigint,funnel_step_value:bigint,track_conversion:boolean>>,
opted_out boolean,
event_id string

Here is how the list of fields that the module sees:

name=u'hive_schema', field_id=None, repetition_type=None, type_length=None, precision=None, num_children=17, converted_type=None, type=None
name=u'fileid', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'version', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'ip_geocode', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=6, converted_type=None, type=None
name=u'countrycode', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'regionname', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'city', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'postalcode', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'metrocode', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'dmacode', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'timestamp', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'region', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'pixel', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'uuid', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'uuid_exists', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=0
name=u'referingurl', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'useragent', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'ip', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'querystring', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'campaignsinfo', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=1, converted_type=3, type=None
name=u'bag', field_id=None, repetition_type=2, type_length=None, precision=None, num_children=1, converted_type=None, type=None
name=u'array_element', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=6, converted_type=None, type=None
name=u'campaign_id', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'media_types', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=1, converted_type=3, type=None
name=u'bag', field_id=None, repetition_type=2, type_length=None, precision=None, num_children=1, converted_type=None, type=None
name=u'array_element', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'advertiser_id', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'funnel_step_id', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'funnel_step_value', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'track_conversion', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=0
name=u'opted_out', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=0
name=u'event_id', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'dt', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=1
name=u'hr', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=1

Apparently there are 2 elements named 'array_element' and 'bag' - i assume these fields just come with structures

support for optional items

Currently, parquet-python isn't making use of definition levels, which are needed to support optional items in a struct (they encode nulls).

Won't install on python 3.11 because of thrift2

pip install parquet fails on python 3.11 becuase thriftpy2 won't build on python 3.11

error: bad char in struct format in read_plain_int96

"<qi" * count produces something like <qi<qi<qi

The docs indicate that the first character of the format string can be used to indicate the byte order, size and alignment.

I've tested potential fixes for this but I suspect the results may be incorrect because the values aren't what I expected. I tried calling an old version of read_plain_int96 multiple times, but that didn't produce values I expected either.

Does anybody have a good test case for this or know what the format ought to be?

(most recent call last):
  File "/usr/lib64/python2.7/pdb.py", line 1314, in main
    pdb._runscript(mainpyfile)
  File "/usr/lib64/python2.7/pdb.py", line 1233, in _runscript
    self.run(statement)
  File "/usr/lib64/python2.7/bdb.py", line 400, in run
    exec cmd in globals, locals
  File "<string>", line 1, in <module>
  File "transformParquet.py", line 1, in <module>
    import parquet
  File "/home/ec2-user/poll-pull-transform-parquet/pptp/local/lib/python2.7/site-packages/parquet/__init__.py", line 379, in DictReader
    for row in reader(fo, columns):
  File "/home/ec2-user/poll-pull-transform-parquet/pptp/local/lib/python2.7/site-packages/parquet/__init__.py", line 433, in reader
    dict_items = read_dictionary_page(fo, ph, cmd)
  File "/home/ec2-user/poll-pull-transform-parquet/pptp/local/lib/python2.7/site-packages/parquet/__init__.py", line 359, in read_dictionary_page
    page_header.dictionary_page_header.num_values)
  File "/home/ec2-user/poll-pull-transform-parquet/pptp/local/lib/python2.7/site-packages/parquet/encoding.py", line 88, in read_plain
    return conv(fo, count)
  File "/home/ec2-user/poll-pull-transform-parquet/pptp/local/lib/python2.7/site-packages/parquet/encoding.py", line 46, in read_plain_int96
    items = struct.unpack("<qi" * count, fo.read(12) * count)
error: bad char in struct format

Reading gzip compressed parquet files

Hello!
How can I read a compressed parquet file ("filename.gz.parquet")?
I tried reading it straight and it's giving me an assertion error:

with open('part-0001.gz.parquet') as fo:
for r in parquet.DictReader(fo): print r
Traceback (most recent call last):
File , line 2, in
File "build/bdist.macosx-10.11-x86_64/egg/parquet/init.py", line 356, in DictReader
File "build/bdist.macosx-10.11-x86_64/egg/parquet/init.py", line 371, in reader
File "build/bdist.macosx-10.11-x86_64/egg/parquet/schema.py", line 12, in init
AssertionError

Thanks!

can support protobuf?

Writing parquet files

Hi,
We need to be able to write python dicts to parquet. What are the chances that you'll have time to work on this? I.e. a writer class.

My team is totally new to parquet so we have a lot to learn. We did see #13 which claims to have a writer functionality but that PR is out-of-sync and tries to solve a couple of other things at the same time.

Would appreciate your thoughts on this project's near future.

cc @adngdb

bit_width of 0 causes a crash

in read_data_page() of parquet/init.py
when the bit_width is zero, we need to implement what this class provides:
https://github.com/apache/parquet-mr/blob/e54ca615f213f5db6d34d9163c97eec98920d7a7/parquet-column/src/main/java/org/apache/parquet/column/values/rle/ZeroIntegerValuesReader.java

I have a fix for this, but I need to test this some more:

        if bit_width == 0:
            vals += [dictionary[0] for i in range(0,daph.num_values)]
        else:
            # TODO jcrobak -- not sure that this loop is needed?
            while total_seen < daph.num_values:
                values = encoding.read_rle_bit_packed_hybrid(
                    dict_values_io_obj, bit_width, len(dict_values_bytes))
                if len(values) + total_seen > daph.num_values:
                    values = values[0: daph.num_values - total_seen]
                vals += [dictionary[v] for v in values]
                total_seen += len(values)

the problem happens on a very small parquet file with, in my case, just 4 rows.
The problem row represents a user_id and all row user_ids are the same. The user id bytes are stored in the dictionary section, and from what I could gather from the java code, one is supposed to replicate the one dictionary value by daph.num_values times.

Negative Seek Offset in Python 3.5

I've used your library successfully in Python 2.7, but when I tried it on Python 3.5, I ran in to a problem. Evidently, Python 3.5 no longer supports negative seek offsets. When I try to open a simple file, I get the following exception:

/home/czdn2lq/local/anaconda3/lib/python3.5/site-packages/parquet/init.py in _get_footer_size(fo)
61 def _get_footer_size(fo):
62 "Readers the footer size in bytes, which is serialized as little endian"
---> 63 fo.seek(-8, 2)
64 tup = struct.unpack("<i", fo.read(4))
65 return tup[0]

UnsupportedOperation: can't do nonzero end-relative seeks

For now, I'll go back to using Python 2.7 so it's not a huge deal. But I'd really like to start migrating to Python 3 (it's only been out for... 8 years, now). :-)

Thanks,
Bill

ci-build is broken

flake8 and pylint are failing:

pylint:

C:137,37: Do not use `len(SEQUENCE)` as condition value (len-as-condition)

flake8:

AttributeError: module 'pydocstyle' has no attribute 'ConventionChecker'

TypeError: cannot convert 'int' object to bytes

I had an error when trying to open a parquet file:

Traceback (most recent call last):
File "/local/workplace/lib/python3.6/site-packages/lambda_handlers/parquet_test.py", line 57, in lambda_handler
for row in parquet.reader(fin):
File "/local/workplace/lib/python3.6/site-packages/parquet/init.py", line 470, in reader
dict_items)
File "/local/workplace/lib/python3.6/site-packages/parquet/init.py", line 340, in read_data_page
if schema_element.converted_type is not None else read_values
File "/local/workplace/lib/python3.6/site-packages/parquet/converted_types.py", line 68, in convert_column
return [Decimal(intbig(unscaled)) * scale_factor for unscaled in data]
File "/local/workplace/lib/python3.6/site-packages/parquet/converted_types.py", line 68, in
return [Decimal(intbig(unscaled)) * scale_factor for unscaled in data]
File "/local/workplace/lib/python3.6/site-packages/parquet/converted_types.py", line 42, in intbig
return int.from_bytes(data, 'big', signed=True)
TypeError: cannot convert 'int' object to bytes

issue reading byte array with precision 10 and escale 2

I have the following parquet schema:

field4: BINARY UNCOMPRESSED DO:0 FPO:170 SZ:58/58/1.00 VC:1 ENC:PLAIN,BIT_PACKED ST:[min: 32505002.09, max: 32505002.09, num_nulls: 0]

json:

{"field4":"32505002.09"}

However, if I try to read it I get the following value:

325050020.90

I have more examples:

parquet -> 62753276.08
parquet-pyton-> 627532760.80

parquet ->57768428.82
parquet-pyton->577684288.20

parquet -> 32505002.09
parquet-pyton-> 325050020.90

is that a kind of normal behavior?

Thanks!

Hive-partitioned parquet files are broken

Due to this line (I think!): https://github.com/dask/fastparquet/blob/master/fastparquet/core.py#L347

The following code:

import fastparquet
import pandas as pd


fastparquet.write('test.parquet', pd.DataFrame({
    'literal': ['40+2', '1e-10', '"5"', "2018-10-09", "2018-10-10"],
    'idx': [1, 2, 3, 4, 5]
}), partition_on=['literal'], file_scheme='hive')

fastparquet.ParquetFile('test.parquet').to_pandas()

produces the following output

I would make a PR to fix this but I can't really fathom what the intention was here. Do you need fastparquet to parse certain partition values as literals for some reason or can I just remove the function call?

Infinite loop for Impala-generated file

Hi there,

Was wondering what condition would cause an infinite loop in this while-loop block: https://github.com/jcrobak/parquet-python/blob/master/parquet/__init__.py#L354-L360

Using the following file which we generated from Impala: https://www.dropbox.com/s/kah986gqjt7mrnr/movies.0.parquet at some point where it reads Bytes 65278 -> 112466 it gets stuck in an endless loop b/c the values stop updating. However, we've been able to read smaller Impala-generated files, so not sure if this is a limitation with file size (the file is 100MB+ but there are only 5 columns of data).

Any insight would be hugely appreciated, thanks!

Jenny

Consider major rewrite

@martindurant has put together a shiny new implementation that improves performance, adds interop with dataframes libraries, and adds write support. See dask#3

The major changes are new interfaces and dependencies on several new packages (numpy, pandas, numba, dask). I'd love feedback from folks using parquet-python on how invasive those changes would be...especially given the historic problems installing some of those libraries.

Please let me know what you think. Some folks that have contributed and may have an opinion include @SergeNov @turicas @spaztic1215 but anyone is welcome to chime in!

parquet.ParquetFormatException: Unsupported encoding: RLE_DICTIONARY

> echo -e 'hi' | parquet-fromcsv  --input-file /dev/stdin  --schema <(  echo 'message schema { OPTIONAL BYTE_ARRAY key (STRING); }' ) --output-file test.parquet
> base64 -w 0 < test.pq 
UEFSMRUEFRoVGkwVBBUAEgAAAwAAAGtleQIAAABoaRUAFRIVEiwVBBUQFQYVBhxYA2tleRgCaGkAAAACAAAABAEBAwIVDBk1AAYQGRgDa2V5FQAWBBaAARaAASY2JgAcWANrZXkYAmhpAAAZEQIZGAJoaRkYA2tleRUAGRYAABkcFj4VShYAAAAVAhksSAxhcnJvd19zY2hlbWEVAgAVDCUCGANrZXklAEwcAAAAFgQZHBkcJogBHBUMGTUABhAZGANrZXkVABYEFoABFoABJj4mCBxYA2tleRgCaGkAABb+ARUUFtYBFSgAFoABFgQmCBaAARQAABkcGAxBUlJPVzpzY2hlbWEYnAEvLy8vLzJ3QUFBQVFBQUFBQUFBS0FBd0FDZ0FKQUFRQUNnQUFBQkFBQUFBQUFRUUFDQUFJQUFBQUJBQUlBQUFBQkFBQUFBRUFBQUFVQUFBQUVBQVVBQkFBRGdBUEFBUUFBQUFJQUJBQUFBQVlBQUFBREFBQUFBQUFBUVVRQUFBQUFBQUFBQVFBQkFBRUFBQUFBd0FBQUd0bGVRQT0AGBlwYXJxdWV0LXJzIHZlcnNpb24gNDYuMC4wADoBAABQQVIx
> python3 -m parquet test.pq
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/tmp/pq2/parquet-python/parquet/__main__.py", line 63, in <module>
    main()
  File "/tmp/pq2/parquet-python/parquet/__main__.py", line 59, in main
    parquet.dump(args.file, args)
  File "/tmp/pq2/parquet-python/parquet/__init__.py", line 526, in dump
    return _dump(file_obj, options=options, out=out)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/pq2/parquet-python/parquet/__init__.py", line 506, in _dump
    for row in DictReader(file_obj, options.col):
  File "/tmp/pq2/parquet-python/parquet/__init__.py", line 415, in DictReader
    for row in reader(file_obj, columns):
  File "/tmp/pq2/parquet-python/parquet/__init__.py", line 464, in reader
    values = read_data_page(file_obj, schema_helper, page_header, cmd,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/pq2/parquet-python/parquet/__init__.py", line 376, in read_data_page
    raise ParquetFormatException("Unsupported encoding: {}".format(
parquet.ParquetFormatException: Unsupported encoding: RLE_DICTIONARY

byte array data type confusion

ValueError ordinal must be >= 1

I'm trying to use parquet.reader(file_obj), but when I do on my parquet I find this error:

    for row in parquet.reader(fo):
  File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/parquet/__init__.py", line 472, in reader
    dict_items = _read_dictionary_page(file_obj, schema_helper, page_header, cmd)
  File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/parquet/__init__.py", line 395, in _read_dictionary_page
    return convert_column(values, schema_element) if schema_element.converted_type is not None else values
  File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/parquet/converted_types.py", line 68, in convert_column
    return [datetime.date.fromordinal(d) for d in data]
  File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/parquet/converted_types.py", line 68, in <listcomp>
    return [datetime.date.fromordinal(d) for d in data]

What can I do?

Two different errors when reading two different files

I'm using parquet on Windows 10 and I have two different parquet files for testing, one is snappy-compressed, one is not compressed.

Simple test code for reading:

with open(filename,'r') as f:
    for row in parquet.reader(f):
        print row

The uncompressed file throws this error:

  File "E:/PythonDir/Diverses/DataTest.py", line 23, in <module>
	for row in parquet.reader(f):

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 426, in reader
	dict_items)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 275, in read_data_page
	raw_bytes = _read_page(fo, page_header, column_metadata)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 244, in _read_page
	page_header.uncompressed_page_size)

AssertionError: found 87 raw bytes (expected 367)

Reading the compressed file like that gives:

  File "E:/PythonDir/Diverses/DataTest.py", line 23, in <module>
	for row in parquet.reader(f):

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 393, in reader
	footer = _read_footer(fo)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 71, in _read_footer
	footer_size = _get_footer_size(fo)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 64, in _get_footer_size
	tup = struct.unpack("<i", fo.read(4))

error: unpack requires a string argument of length 4

I can open both files with fastparquet 0.0.5 just fine so there's nothing wrong with the files.

What am I doing wrong?
Do I have to explicitely uncompress the data with snappy or is parquet doing that by itself?
Can you in general provide some more documentation on the basic usage?

import parquet crashes Python (ipython + 3.6)

In [1]: import parquet
Python(76650,0x7fffb31593c0) malloc: *** error for object 0x1048ea580: pointer being freed was not allocated
*** set a breakpoint in malloc_error_break to debug
[1]    76650 abort      ipython3

IPython + python 2.7 works fine.

Support for logical types

i.e. https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

Mandatory snappy installation

Some use cases don't involve Snappy usage, and indeed it seems the code can handle scenarios where python-snappy is not installed; However, the parquet-python package does require python-snappy explicitly (thereby also possibly forcing installation of system dependencies required by python-snappy such as snappy itself).

Is it ok if I made a PR to have python-snappy an optional requirement in setup.py (i.e. it would not get installed by default but rather only when pip is used like so: pip install parquet-python[snappy])?