jcrobak / parquet-python Goto Github PK
View Code? Open in Web Editor NEWpython implementation of the parquet columnar file format.
License: Apache License 2.0
python implementation of the parquet columnar file format.
License: Apache License 2.0
Hi,
would you consider adding parquet-python to conda-forge?
Right now it seems to be the only tool out there that can read parquet data as a bytestream and doesn't require a file which is really helpful for reading parquet files from remote hdfs with the hdfs-package.
For me it's the best parquet reader and the only one that works error free, as I'm having troubles with fastparquet and dask. Having it on Conda would make it easier to install and probably interesting to a larger audience as there would be no need to additionally use pip.
thriftpy is not maintained anymore, there's thriftpy2 available, a pure Python implementation.
Following instances should use tostring on Python 2 and tobytes on Python 3.
test/test_encoding.py
114: encoded_bitstring = array.array('B', raw_data_in).tostring()
134: 'B', [0b00000101, 0b00111001, 0b01110111]).tostring()
We should implement pandas Series as was done in #13 as an optional API to improve performance for that use case.
python-parquet relies on cStringIO, which is gone in python 3. Is there a plan for porting python-parquet to python 3 ?
I'm testing this script and rising the error:TypeError: Struct() argument 1 must be string, not unicode
with open("nation.dict.parquet") as fo:
for row in parquet.DictReader(fo):
print str(row)
Traceback (most recent call last):
File "pd.py", line 5, in
for row in parquet.reader(fo,columns=['channel']):
File "/usr/lib/python2.7/site-packages/parquet/init.py", line 459, in reader
page_header = _read_page_header(file_obj)
File "/usr/lib/python2.7/site-packages/parquet/init.py", line 92, in _read_page_header
page_header.read(pin)
File "/usr/lib64/python2.7/site-packages/thriftpy/thrift.py", line 150, in read
iprot.read_struct(self)
File "/usr/lib64/python2.7/site-packages/thriftpy/protocol/compact.py", line 250, in read_struct
fname, ftype, fid = self.read_field_begin()
File "/usr/lib64/python2.7/site-packages/thriftpy/protocol/compact.py", line 181, in read_field_begin
return None, self._get_ttype(type), fid
File "/usr/lib64/python2.7/site-packages/thriftpy/protocol/compact.py", line 134, in _get_ttype
return TTYPES[byte & 0x0f]
KeyError: 15
There seems to be a mishandling of MAP columns since those columns contain groups named key_value with elements key and value, and those are considered already seen which leads to the following error:
File "/usr/local/lib/python3.9/site-packages/parquet/schema.py", line 20, in init
assert len(self.schema_elements) == len(self.schema_elements_by_name)
After installing parquet-1.3.1 on windows 10. The following code was run:
with open("myparquetfile.parquet","rb") as fo:
for row in parquet.reader(fo, columns=['tconst', 'nconst']):
print(",".join([str(r) for r in row])
This errored, stack trace below:
Traceback (most recent call last):
File "c:\working\test.py", line 25, in
for row in parquet.reader(fo, columns=['tconst', 'nconst']):
File "C:\Python\Python38\lib\site-packages\parquet_init_.py", line 464, in reader
values = read_data_page(file_obj, schema_helper, page_header, cmd,
File "C:\Python\Python38\lib\site-packages\parquet_init_.py", line 283, in read_data_page
raw_bytes = read_page(file_obj, page_header, column_metadata)
File "C:\Python\Python38\lib\site-packages\parquet_init.py", line 229, in _read_page
raw_bytes = snappy.decompress(bytes_from_file)
NameError: name 'snappy' is not defined
Thx
When I try to read data from a parquet file which contains null value for some key, I got below error.
(most recent call last):
File "tt.py", line 12, in
for r in parquet.DictReader(fo):
File "/usr/local/lib/python2.7/site-packages/parquet/init.py", line 420, in DictReader
for row in reader(fo, columns):
File "/usr/local/lib/python2.7/site-packages/parquet/init.py", line 467, in reader
dict_items)
File "/usr/local/lib/python2.7/site-packages/parquet/init.py", line 380, in read_data_page
dict_values_io_obj, bit_width, len(dict_values_bytes))
File "/usr/local/lib/python2.7/site-packages/parquet/encoding.py", line 227, in read_rle_bit_packed_hybrid
res += read_bitpacked(io_obj, header, width, debug_logging)
File "/usr/local/lib/python2.7/site-packages/parquet/encoding.py", line 146, in read_bitpacked
b = raw_bytes[current_byte]
IndexError: list index out of range
The following stack trace is due to a bug that, from a different thread on an issue reported last year, was fixed in oct 2016 but this fix doesn't seem to be present in the version you get when you do a "pip install parquet". Since that issue was closed after the fix was posted, I am opening a new issue here so this issue gets attention or I can get some responses:
Traceback (most recent call last):
File "test.py", line 7, in
for line in parquet.DictReader(fo,columns=['pixel','querystring']):
File "/usr/lib/python2.7/site-packages/parquet/init.py", line 375, in DictReader
footer = _read_footer(fo)
File "/usr/lib/python2.7/site-packages/parquet/init.py", line 71, in _read_footer
footer_size = _get_footer_size(fo)
File "/usr/lib/python2.7/site-packages/parquet/init.py", line 64, in _get_footer_size
tup = struct.unpack("<i", fo.read(4))
TypeError: Struct() argument 1 must be string,
I'm creating the Parquet rows plugin and need this library available on PyPI to do it properly.
As it was not available there and the last commit here is almost 1 year old, I've uploaded it: https://pypi.python.org/pypi/parquet
@jcrobak, if you want access on PyPI, please tell me your username and I can give you.
If the library is not maintained anymore, then probably I'm going to create a fork of it and maintain the version on PyPI with the fork.
It would be great to have some example code to show how to use this interesting library. Trying to tease it out from the test cases has proven to be unsuccessful so far.
I've installed this parquet module from pypi and tried to unpack file.
Got this error
Traceback (most recent call last):
File "test.py", line 7, in <module>
for line in parquet.DictReader(fo,columns=['pixel','querystring']):
File "/usr/lib/python2.7/site-packages/parquet/__init__.py", line 375, in DictReader
footer = _read_footer(fo)
File "/usr/lib/python2.7/site-packages/parquet/__init__.py", line 71, in _read_footer
footer_size = _get_footer_size(fo)
File "/usr/lib/python2.7/site-packages/parquet/__init__.py", line 64, in _get_footer_size
tup = struct.unpack("<i", fo.read(4))
TypeError: Struct() argument 1 must be string, not unicode
It looks like the constant string "<i" somehow interpreted as unicode string. I've deleted this string and wrote it again with vim. And the error has gone.
But after that python warns me about another unicode string - "<{}i".
I think all such strings should be replaced.
$ cat /etc/centos-release
CentOS Linux release 7.2.1511 (Core)
$ python --version
Python 2.7.5
Hi Joe and others
I am trying to use your module to read a parquet file, and i ran into a problem here:
schema.py, line 21:
assert len(self.schema_elements) == len(self.schema_elements_by_name)
Apparently the init method assumes that my structure has multiple fields with the same name. Module works correctly if you comment out this line though
Originally these files were used by Hive, and here is the list of fields in the table:
fileid bigint,
version bigint,
ip_geocode structcountrycode:string,regionname:string,city:string,postalcode:string,metrocode:string,dmacode:string,
timestamp bigint,
region bigint,
pixel bigint,
uuid bigint,
uuid_exists boolean,
referingurl string,
useragent string,
ip string,
querystring string,
campaignsinfo array<struct<campaign_id:bigint,media_types:array,advertiser_id:bigint,funnel_step_id:bigint,funnel_step_value:bigint,track_conversion:boolean>>,
opted_out boolean,
event_id string
Here is how the list of fields that the module sees:
name=u'hive_schema', field_id=None, repetition_type=None, type_length=None, precision=None, num_children=17, converted_type=None, type=None
name=u'fileid', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'version', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'ip_geocode', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=6, converted_type=None, type=None
name=u'countrycode', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'regionname', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'city', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'postalcode', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'metrocode', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'dmacode', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'timestamp', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'region', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'pixel', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'uuid', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'uuid_exists', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=0
name=u'referingurl', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'useragent', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'ip', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'querystring', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'campaignsinfo', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=1, converted_type=3, type=None
name=u'bag', field_id=None, repetition_type=2, type_length=None, precision=None, num_children=1, converted_type=None, type=None
name=u'array_element', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=6, converted_type=None, type=None
name=u'campaign_id', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'media_types', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=1, converted_type=3, type=None
name=u'bag', field_id=None, repetition_type=2, type_length=None, precision=None, num_children=1, converted_type=None, type=None
name=u'array_element', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'advertiser_id', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'funnel_step_id', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'funnel_step_value', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'track_conversion', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=0
name=u'opted_out', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=0
name=u'event_id', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'dt', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=1
name=u'hr', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=1
Apparently there are 2 elements named 'array_element' and 'bag' - i assume these fields just come with structures
Currently, parquet-python isn't making use of definition levels, which are needed to support optional items in a struct (they encode nulls).
pip install parquet
fails on python 3.11 becuase thriftpy2 won't build on python 3.11
"<qi" * count
produces something like <qi<qi<qi
The docs indicate that the first character of the format string can be used to indicate the byte order, size and alignment.
I've tested potential fixes for this but I suspect the results may be incorrect because the values aren't what I expected. I tried calling an old version of read_plain_int96 multiple times, but that didn't produce values I expected either.
Does anybody have a good test case for this or know what the format ought to be?
(most recent call last):
File "/usr/lib64/python2.7/pdb.py", line 1314, in main
pdb._runscript(mainpyfile)
File "/usr/lib64/python2.7/pdb.py", line 1233, in _runscript
self.run(statement)
File "/usr/lib64/python2.7/bdb.py", line 400, in run
exec cmd in globals, locals
File "<string>", line 1, in <module>
File "transformParquet.py", line 1, in <module>
import parquet
File "/home/ec2-user/poll-pull-transform-parquet/pptp/local/lib/python2.7/site-packages/parquet/__init__.py", line 379, in DictReader
for row in reader(fo, columns):
File "/home/ec2-user/poll-pull-transform-parquet/pptp/local/lib/python2.7/site-packages/parquet/__init__.py", line 433, in reader
dict_items = read_dictionary_page(fo, ph, cmd)
File "/home/ec2-user/poll-pull-transform-parquet/pptp/local/lib/python2.7/site-packages/parquet/__init__.py", line 359, in read_dictionary_page
page_header.dictionary_page_header.num_values)
File "/home/ec2-user/poll-pull-transform-parquet/pptp/local/lib/python2.7/site-packages/parquet/encoding.py", line 88, in read_plain
return conv(fo, count)
File "/home/ec2-user/poll-pull-transform-parquet/pptp/local/lib/python2.7/site-packages/parquet/encoding.py", line 46, in read_plain_int96
items = struct.unpack("<qi" * count, fo.read(12) * count)
error: bad char in struct format
Hello!
How can I read a compressed parquet file ("filename.gz.parquet")?
I tried reading it straight and it's giving me an assertion error:
with open('part-0001.gz.parquet') as fo:
for r in parquet.DictReader(fo): print r
Traceback (most recent call last):
File , line 2, in
File "build/bdist.macosx-10.11-x86_64/egg/parquet/init.py", line 356, in DictReader
File "build/bdist.macosx-10.11-x86_64/egg/parquet/init.py", line 371, in reader
File "build/bdist.macosx-10.11-x86_64/egg/parquet/schema.py", line 12, in init
AssertionError
Thanks!
M
can support protobuf?
Hi,
We need to be able to write python dicts to parquet. What are the chances that you'll have time to work on this? I.e. a writer class.
My team is totally new to parquet so we have a lot to learn. We did see #13 which claims to have a writer functionality but that PR is out-of-sync and tries to solve a couple of other things at the same time.
Would appreciate your thoughts on this project's near future.
cc @adngdb
in read_data_page() of parquet/init.py
when the bit_width is zero, we need to implement what this class provides:
https://github.com/apache/parquet-mr/blob/e54ca615f213f5db6d34d9163c97eec98920d7a7/parquet-column/src/main/java/org/apache/parquet/column/values/rle/ZeroIntegerValuesReader.java
I have a fix for this, but I need to test this some more:
if bit_width == 0:
vals += [dictionary[0] for i in range(0,daph.num_values)]
else:
# TODO jcrobak -- not sure that this loop is needed?
while total_seen < daph.num_values:
values = encoding.read_rle_bit_packed_hybrid(
dict_values_io_obj, bit_width, len(dict_values_bytes))
if len(values) + total_seen > daph.num_values:
values = values[0: daph.num_values - total_seen]
vals += [dictionary[v] for v in values]
total_seen += len(values)
the problem happens on a very small parquet file with, in my case, just 4 rows.
The problem row represents a user_id and all row user_ids are the same. The user id bytes are stored in the dictionary section, and from what I could gather from the java code, one is supposed to replicate the one dictionary value by daph.num_values
times.
I've used your library successfully in Python 2.7, but when I tried it on Python 3.5, I ran in to a problem. Evidently, Python 3.5 no longer supports negative seek offsets. When I try to open a simple file, I get the following exception:
/home/czdn2lq/local/anaconda3/lib/python3.5/site-packages/parquet/init.py in _get_footer_size(fo)
61 def _get_footer_size(fo):
62 "Readers the footer size in bytes, which is serialized as little endian"
---> 63 fo.seek(-8, 2)
64 tup = struct.unpack("<i", fo.read(4))
65 return tup[0]
UnsupportedOperation: can't do nonzero end-relative seeks
For now, I'll go back to using Python 2.7 so it's not a huge deal. But I'd really like to start migrating to Python 3 (it's only been out for... 8 years, now). :-)
Thanks,
Bill
flake8 and pylint are failing:
pylint:
C:137,37: Do not use `len(SEQUENCE)` as condition value (len-as-condition)
flake8:
AttributeError: module 'pydocstyle' has no attribute 'ConventionChecker'
I had an error when trying to open a parquet file:
Traceback (most recent call last):
File "/local/workplace/lib/python3.6/site-packages/lambda_handlers/parquet_test.py", line 57, in lambda_handler
for row in parquet.reader(fin):
File "/local/workplace/lib/python3.6/site-packages/parquet/init.py", line 470, in reader
dict_items)
File "/local/workplace/lib/python3.6/site-packages/parquet/init.py", line 340, in read_data_page
if schema_element.converted_type is not None else read_values
File "/local/workplace/lib/python3.6/site-packages/parquet/converted_types.py", line 68, in convert_column
return [Decimal(intbig(unscaled)) * scale_factor for unscaled in data]
File "/local/workplace/lib/python3.6/site-packages/parquet/converted_types.py", line 68, in
return [Decimal(intbig(unscaled)) * scale_factor for unscaled in data]
File "/local/workplace/lib/python3.6/site-packages/parquet/converted_types.py", line 42, in intbig
return int.from_bytes(data, 'big', signed=True)
TypeError: cannot convert 'int' object to bytes
I have the following parquet schema:
field4: BINARY UNCOMPRESSED DO:0 FPO:170 SZ:58/58/1.00 VC:1 ENC:PLAIN,BIT_PACKED ST:[min: 32505002.09, max: 32505002.09, num_nulls: 0]
json:
{"field4":"32505002.09"}
However, if I try to read it I get the following value:
325050020.90
I have more examples:
parquet -> 62753276.08
parquet-pyton-> 627532760.80
parquet ->57768428.82
parquet-pyton->577684288.20
parquet -> 32505002.09
parquet-pyton-> 325050020.90
is that a kind of normal behavior?
Thanks!
Due to this line (I think!): https://github.com/dask/fastparquet/blob/master/fastparquet/core.py#L347
The following code:
import fastparquet
import pandas as pd
fastparquet.write('test.parquet', pd.DataFrame({
'literal': ['40+2', '1e-10', '"5"', "2018-10-09", "2018-10-10"],
'idx': [1, 2, 3, 4, 5]
}), partition_on=['literal'], file_scheme='hive')
fastparquet.ParquetFile('test.parquet').to_pandas()
produces the following output
I would make a PR to fix this but I can't really fathom what the intention was here. Do you need fastparquet to parse certain partition values as literals for some reason or can I just remove the function call?
Hi there,
Was wondering what condition would cause an infinite loop in this while-loop block: https://github.com/jcrobak/parquet-python/blob/master/parquet/__init__.py#L354-L360
Using the following file which we generated from Impala: https://www.dropbox.com/s/kah986gqjt7mrnr/movies.0.parquet at some point where it reads Bytes 65278 -> 112466 it gets stuck in an endless loop b/c the values stop updating. However, we've been able to read smaller Impala-generated files, so not sure if this is a limitation with file size (the file is 100MB+ but there are only 5 columns of data).
Any insight would be hugely appreciated, thanks!
Jenny
@martindurant has put together a shiny new implementation that improves performance, adds interop with dataframes libraries, and adds write support. See dask#3
The major changes are new interfaces and dependencies on several new packages (numpy, pandas, numba, dask). I'd love feedback from folks using parquet-python on how invasive those changes would be...especially given the historic problems installing some of those libraries.
Please let me know what you think. Some folks that have contributed and may have an opinion include @SergeNov @turicas @spaztic1215 but anyone is welcome to chime in!
> echo -e 'hi' | parquet-fromcsv --input-file /dev/stdin --schema <( echo 'message schema { OPTIONAL BYTE_ARRAY key (STRING); }' ) --output-file test.parquet
> base64 -w 0 < test.pq
UEFSMRUEFRoVGkwVBBUAEgAAAwAAAGtleQIAAABoaRUAFRIVEiwVBBUQFQYVBhxYA2tleRgCaGkAAAACAAAABAEBAwIVDBk1AAYQGRgDa2V5FQAWBBaAARaAASY2JgAcWANrZXkYAmhpAAAZEQIZGAJoaRkYA2tleRUAGRYAABkcFj4VShYAAAAVAhksSAxhcnJvd19zY2hlbWEVAgAVDCUCGANrZXklAEwcAAAAFgQZHBkcJogBHBUMGTUABhAZGANrZXkVABYEFoABFoABJj4mCBxYA2tleRgCaGkAABb+ARUUFtYBFSgAFoABFgQmCBaAARQAABkcGAxBUlJPVzpzY2hlbWEYnAEvLy8vLzJ3QUFBQVFBQUFBQUFBS0FBd0FDZ0FKQUFRQUNnQUFBQkFBQUFBQUFRUUFDQUFJQUFBQUJBQUlBQUFBQkFBQUFBRUFBQUFVQUFBQUVBQVVBQkFBRGdBUEFBUUFBQUFJQUJBQUFBQVlBQUFBREFBQUFBQUFBUVVRQUFBQUFBQUFBQVFBQkFBRUFBQUFBd0FBQUd0bGVRQT0AGBlwYXJxdWV0LXJzIHZlcnNpb24gNDYuMC4wADoBAABQQVIx
> python3 -m parquet test.pq
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/tmp/pq2/parquet-python/parquet/__main__.py", line 63, in <module>
main()
File "/tmp/pq2/parquet-python/parquet/__main__.py", line 59, in main
parquet.dump(args.file, args)
File "/tmp/pq2/parquet-python/parquet/__init__.py", line 526, in dump
return _dump(file_obj, options=options, out=out)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/pq2/parquet-python/parquet/__init__.py", line 506, in _dump
for row in DictReader(file_obj, options.col):
File "/tmp/pq2/parquet-python/parquet/__init__.py", line 415, in DictReader
for row in reader(file_obj, columns):
File "/tmp/pq2/parquet-python/parquet/__init__.py", line 464, in reader
values = read_data_page(file_obj, schema_helper, page_header, cmd,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/pq2/parquet-python/parquet/__init__.py", line 376, in read_data_page
raise ParquetFormatException("Unsupported encoding: {}".format(
parquet.ParquetFormatException: Unsupported encoding: RLE_DICTIONARY
I'm trying to use parquet.reader(file_obj)
, but when I do on my parquet I find this error:
for row in parquet.reader(fo):
File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/parquet/__init__.py", line 472, in reader
dict_items = _read_dictionary_page(file_obj, schema_helper, page_header, cmd)
File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/parquet/__init__.py", line 395, in _read_dictionary_page
return convert_column(values, schema_element) if schema_element.converted_type is not None else values
File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/parquet/converted_types.py", line 68, in convert_column
return [datetime.date.fromordinal(d) for d in data]
File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/parquet/converted_types.py", line 68, in <listcomp>
return [datetime.date.fromordinal(d) for d in data]
What can I do?
I'm using parquet on Windows 10 and I have two different parquet files for testing, one is snappy-compressed, one is not compressed.
Simple test code for reading:
with open(filename,'r') as f:
for row in parquet.reader(f):
print row
The uncompressed file throws this error:
File "E:/PythonDir/Diverses/DataTest.py", line 23, in <module>
for row in parquet.reader(f):
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 426, in reader
dict_items)
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 275, in read_data_page
raw_bytes = _read_page(fo, page_header, column_metadata)
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 244, in _read_page
page_header.uncompressed_page_size)
AssertionError: found 87 raw bytes (expected 367)
Reading the compressed file like that gives:
File "E:/PythonDir/Diverses/DataTest.py", line 23, in <module>
for row in parquet.reader(f):
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 393, in reader
footer = _read_footer(fo)
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 71, in _read_footer
footer_size = _get_footer_size(fo)
File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 64, in _get_footer_size
tup = struct.unpack("<i", fo.read(4))
error: unpack requires a string argument of length 4
I can open both files with fastparquet 0.0.5 just fine so there's nothing wrong with the files.
What am I doing wrong?
Do I have to explicitely uncompress the data with snappy or is parquet doing that by itself?
Can you in general provide some more documentation on the basic usage?
Some use cases don't involve Snappy usage, and indeed it seems the code can handle scenarios where python-snappy
is not installed; However, the parquet-python
package does require python-snappy
explicitly (thereby also possibly forcing installation of system dependencies required by python-snappy
such as snappy
itself).
Is it ok if I made a PR to have python-snappy
an optional requirement in setup.py
(i.e. it would not get installed by default but rather only when pip is used like so: pip install parquet-python[snappy]
)?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.