apache / parquet-format Goto Github PK

View Code? Open in Web Editor NEW

1.7K 66.0 426.0 1.24 MB

Apache Parquet Format

Home Page: https://parquet.apache.org/

License: Apache License 2.0

Makefile 1.67% Shell 11.40% Python 23.21% Thrift 63.71%

parquet apache parquet-format

parquet-format's Issues

Add a special ColumnOrder for testing

PR #46 introduced ColumnOrder with the limitation that a reader should ignore stats for a column if the corresponding ColumnOrder in FileMetaData contains an unknown value. In order to test this logic, it would be helpful to have a special value InvalidOrder or UnsupportedOrder that would never be supported by a reader. I assume this may be helpful to test other implementations, too.

Reporter: Lars Volker / @lekv
Assignee: Lars Volker / @lekv

_{Note: This issue was originally created as PARQUET-974. Please see the migration documentation for further details.}

Add thrift streaming API to read metadata

patch available: https://github.com/apache/incubator-parquet-format/pull/8

Reporter: Julien Le Dem / @julienledem
Assignee: Julien Le Dem / @julienledem

_{Note: This issue was originally created as PARQUET-79. Please see the migration documentation for further details.}

Store `dictionary entries` of parquet columns that will be used for joins

It would be great if Parquet would store dictionary entries for columns marked to be used for joins.

When a column is used for a join (it could be a surrogate key or a natural key) - the value of a cloumn used for join itself is actually not so important.

So we could join directly on dictionary entries instead of values
and save CPU cycles. (no need to decompress etc)

Inspired by Oracle In-memory columnar storage improvements in 12.2

Reporter: Ruslan Dautkhanov / @Tagar

_{Note: This issue was originally created as PARQUET-966. Please see the migration documentation for further details.}

Bumps Thrift version to 0.9.3

Thrift 0.7.0 is too old a version, and it doesn't compile on Mac. Would be nice to bump Thrift version.

Reporter: Cheng Lian / @liancheng
Assignee: Julien Le Dem / @julienledem

Related issues:

Release Parquet format 2.4.0 (blocks)
Format: Do not shade slf4j-api (is required by)

PRs and other links:

PR link

_{Note: This issue was originally created as PARQUET-371. Please see the migration documentation for further details.}

The LogicalTypes.md link in README.md points to the old Parquet GitHub repository

Reporter: Cheng Lian / @liancheng
Assignee: Cheng Lian / @liancheng

_{Note: This issue was originally created as PARQUET-655. Please see the migration documentation for further details.}

Typo in decimal type specification

The original document says:

int32: for 1 <= precision <= 9
int64: for 1 <= precision <= 18; precision <= 10 will produce a warning
...

For int64, the warning should be produced when precision < 10 (rather than <= 10).

Reporter: Cheng Lian / @liancheng
Assignee: Cheng Lian / @liancheng

_{Note: This issue was originally created as PARQUET-255. Please see the migration documentation for further details.}

Fix LICENSE and NOTICE files for parquet-format release

The current NOTICE and LICENSE files need to implement the Apache policy before we can do another RC.

Reporter: Ryan Blue / @rdblue
Assignee: Ryan Blue / @rdblue

_{Note: This issue was originally created as PARQUET-109. Please see the migration documentation for further details.}

Upgrade snappy-java to 1.1.1.6

Upgrade snappy-java to 1.1.1.6 (the latest vesrion), since 1.0.5 is no longer maintained in https://github.com/xerial/snappy-java, and 1.1.1.6 supports broader platforms including PowerPC, IBM-AIX 6.4, SunOS, etc. And also it has a better native coding loading mechanism (allowing to use snappy-java from multiple class loaders)

The compression format between 1.0.5 and 1.1.1.6 are compatible. 1.1.1.x version adds framing format support, but currently Parquet is not using this framing format, so I think this upgrade cause no data format incompatibility.

Reporter: Taro L. Saito / @xerial
Assignee: Taro L. Saito / @xerial

_{Note: This issue was originally created as PARQUET-133. Please see the migration documentation for further details.}

Format: Add a flag when min/max are truncated

PARQUET-372 drops page and column chunk stats when values are larger than 4k to avoid storing very large values in page headers and the file footer. An alternative approach is to truncate the values, which would still allow filtering on page stats. The problem with truncating values is that the value in stats may not be the true min or max so engines that use these values as the result of aggregations like min(col) would return incorrect data. We should consider adding metadata to allow truncating values for filtering that captures the fact that the values have been modified.

Reporter: Ryan Blue / @rdblue

Related issues:

Parquet stats can have awkwardly large values (is related to)

_{Note: This issue was originally created as PARQUET-411. Please see the migration documentation for further details.}

Support filtering on group-is-null in filter2 API

Currently filters only apply to primitive (leaf node) columns. But given the column "a.b.c.d.e.f" it seems reasonable to want to filter on "a.b.c" != null or something like that.

Reporter: Alex Levenson / @isnotinvain

_{Note: This issue was originally created as PARQUET-43. Please see the migration documentation for further details.}

add logical type timestamp with timezone (per SQL)

timestamp with timezone (per SQL)
timestamps are adjusted to UTC and stored as integers.
metadata in logical types PR:
See discussion here: #51 (comment)

Reporter: Julien Le Dem / @julienledem
Assignee: Julien Le Dem / @julienledem

Related issues:

Release Parquet format 2.4.0 (depends upon)
Update parquet.thrift from https://github.com/apache/parquet-format (is depended upon by)

PRs and other links:

PR #51: Add new LogicalType structures to the format

_{Note: This issue was originally created as PARQUET-906. Please see the migration documentation for further details.}

Typo in LIST logical type description

The description mistakenly refers to the repeated element being named "array" instead of "list" in the spec, in a single place.

PR at https://github.com/apache/incubator-parquet-format/pull/25.

Reporter: Colin Marc / @colinmarc
Assignee: Colin Marc / @colinmarc

_{Note: This issue was originally created as PARQUET-240. Please see the migration documentation for further details.}

Min-max should be computed based on logical type

The min/max stats are currently underspecified - it is not clear in any cases from the spec what the expected ordering is.

There are some related issues, like PARQUET-686 to fix specific problems, but there seems to be a general assumption that the min/max should be defined based on the primitive type instead of the logical type.

However, this makes the stats nearly useless for some logical types. E.g. consider a DECIMAL encoded into a (variable-length) BINARY. The min-max of the underlying binary type is based on the lexical order of the byte string, but that does not correspond to any reasonable ordering of the decimal values. E.g. 16 (0x1 0x0) will be ordered between 1 (0x0) and (0x2). This makes min-max filtering a lot less effective and would force query engines using parquet to implement workarounds to produce correct results (e.g. custom comparators).

Reporter: Tim Armstrong / @timarmstrong

Related issues:

Allow for Unsigned Statistics in Binary Type (duplicates)

_{Note: This issue was originally created as PARQUET-839. Please see the migration documentation for further details.}

Add type of Decimal data with 20 Bytes to accomodate dynamic Precision and Scale with data like Oracle Number type

Reporter: Bhargav Donga

_{Note: This issue was originally created as PARQUET-602. Please see the migration documentation for further details.}

ScalaReflectionException when trying to convert an RDD of Scrooge to a DataFrame

I get "scala.ScalaReflectionException: is not a term" when I try to convert an RDD of Scrooge to a DataFrame, e.g. myScroogeRDD.toDF

Has anyone else encountered this problem?

I'm using Spark 1.3.1, Scala 2.10.4 and scrooge-sbt-plugin 3.16.3

Here is my thrift IDL:

namespace scala com.junk
namespace java com.junk

struct Junk {
    10: i64 junkID,
    20: string junkString
}

from a spark-shell:

val junks = List( Junk(123L, "junk1"), Junk(567L, "junk2"), Junk(789L, "junk3") )
val junksRDD = sc.parallelize(junks)
junksRDD.toDF

Exception thrown:

scala.ScalaReflectionException: <none> is not a term
	at scala.reflect.api.Symbols$SymbolApi$class.asTerm(Symbols.scala:259)
	at scala.reflect.internal.Symbols$SymbolContextApiImpl.asTerm(Symbols.scala:73)
	at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:148)
	at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
	at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107)
	at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
	at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:316)
	at org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:254)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:27)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:32)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:34)
	at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:36)
	at $iwC$$iwC$$iwC$$iwC.<init>(<console>:38)
	at $iwC$$iwC$$iwC.<init>(<console>:40)
	at $iwC$$iwC.<init>(<console>:42)
	at $iwC.<init>(<console>:44)
	at <init>(<console>:46)
	at .<init>(<console>:50)
	at .<clinit>(<console>)
	at .<init>(<console>:7)
	at .<clinit>(<console>)
	at $print(<console>)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
	at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
	at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
	at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
	at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)
	at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
	at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)
	at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)
	at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
	at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
	at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944)
	at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)
	at org.apache.spark.repl.Main$.main(Main.scala:31)
	at org.apache.spark.repl.Main.main(Main.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Reporter: Tim Chan / @zzztimbo

Related issues:

ScalaReflection should also try apply methods defined in companion objects when inferring schema from a Product type (duplicates)

_{Note: This issue was originally created as PARQUET-293. Please see the migration documentation for further details.}

Add microsecond time and timestamp annotations

When the date/time type annotations were added, we decided not to add precisions smaller than milliseconds because there wasn't a clear requirement. I think that the requirement is for nanosecond precision. The SQL spec requires at least microsecond. Some databases support nanosecond, including SQL engines on Hadoop like Phoenix. Hive and Impala currently support nanosecond times using an int96, but intend to move to microsecond precision with this spec.

I propose adding the following type annotations:

TIME_MICROS: annotates an int64 (8 bytes), represents the number of microseconds from midnight.
TIMESTAMP_MICROS: annotates an int64 (8 bytes), represents the number of microseconds from the unix epoch.

Reporter: Ryan Blue / @rdblue
Assignee: Ryan Blue / @rdblue

PRs and other links:

Format PR #23

_{Note: This issue was originally created as PARQUET-200. Please see the migration documentation for further details.}

_{Note: This issue was originally created as PARQUET-758. Please see the migration documentation for further details.}

Add PR merge tool to incubator-parquet-format

Reporter: Ryan Blue / @rdblue
Assignee: Ryan Blue / @rdblue

_{Note: This issue was originally created as PARQUET-58. Please see the migration documentation for further details.}

META-INF for slf4j should not be in parquet-format jar

$ jar tf parquet-format-2.2.0-rc1.jar  | grep org\\.slf
META-INF/maven/org.slf4j/
META-INF/maven/org.slf4j/slf4j-api/
META-INF/maven/org.slf4j/slf4j-api/pom.xml
META-INF/maven/org.slf4j/slf4j-api/pom.properties

It is not clear to me why these are here. I suspect they should not be.

Reporter: koert kuipers
Assignee: Ryan Blue / @rdblue

PRs and other links:

Format PR #24

_{Note: This issue was originally created as PARQUET-178. Please see the migration documentation for further details.}

Add index pages to the format to support efficient page skipping

When a Parquet file is sorted we can define an index consisting of the boundary values for the pages of the columns sorted on as well as the offsets and length of said pages in the file.
The goal is to optimize lookup and range scan type queries, using this to read only the pages containing data matching the filter.
We'd require the pages to be aligned accross columns.

[~marcelk] will add a link to the google doc to discuss the spec

Reporter: Julien Le Dem / @julienledem
Assignee: Marcel Kinard

Related issues:

Release Parquet format 2.4.0 (blocks)
Don't write page level statistics in Parquet files in anticipation of page indexes (is required by)
Write page index in Parquet files (is required by)
Column indexes (is depended upon by)
Write index page in parquet file (is depended upon by)

PRs and other links:

_{Note: This issue was originally created as PARQUET-922. Please see the migration documentation for further details.}

Fix spelling errors, whitespace, GitHub urls

There's a couple of spelling mistakes, whitespaces at the end of a line, and some old url pointing the the Parquet organization instead of apache. We should fix those.

There's already a PR for this that needs rebasing.

Reporter: Lars Volker / @lekv
Assignee: Anna Szonyi

PRs and other links:

Pull request

_{Note: This issue was originally created as PARQUET-1031. Please see the migration documentation for further details.}

Incorrect delta-encoding example

The minimum and the number of bits are incorrect at delta encoding Example 2 In Encodings.md.

In the example,

Example 2

7, 5, 3, 1, 2, 3, 4, 5, the deltas would be

-2, -2, -2, 1, 1, 1, 1
The minimum is -2, so the relative deltas are:

0, 0, 0, 3, 3, 3, 3

The encoded data is

header: 8 (block size), 1 (miniblock count), 8 (value count), 7 (first value)

block 0 (minimum delta), 2 (bitwidth), 000000111111b (0,0,0,3,3,3 packed on 2 bits)

The minimum is -2 and the relative deltas are 0, 0, 0, 3, 3, 3, 3. So, this should be corrected as below:

block -2 (minimum delta), 2 (bitwidth), 00000011111111b (0,0,0,3,3,3,3 packed on 2 bits)

Reporter: choi woo cheol

_{Note: This issue was originally created as PARQUET-407. Please see the migration documentation for further details.}

Add thrift.executable property to parquet-format build

Parquet MR has a property to point the thrift maven plugin to a different thrift executable for testing thrift 9 and 7. We should add the same to parquet-format.

Reporter: Ryan Blue / @rdblue
Assignee: Ryan Blue / @rdblue

_{Note: This issue was originally created as PARQUET-608. Please see the migration documentation for further details.}

Optionally store Page level metadata in the footer to enable predicate pushdowns

Reporter: Julien Le Dem / @julienledem

_{Note: This issue was originally created as PARQUET-907. Please see the migration documentation for further details.}

Add INTERVAL_YEAR_MONTH and INTERVAL_DAY_TIME types

For completeness and compatibility with Arrow and SQL types.
Those are related to the existing INTERVAL type.
some references:

Reporter: Julien Le Dem / @julienledem
Assignee: Julien Le Dem / @julienledem

Related issues:

Add NULL type to Bring Parquet logical types to par with Arrow (blocks)

PRs and other links:

_{Note: This issue was originally created as PARQUET-675. Please see the migration documentation for further details.}

Prepare parquet-format for Apache release

Need to prepare the parquet source for release. We're planning on two releases:

2.2.0 as com.twitter:parquet-format
2.3.0 as org.apache.parquet:parquet-format

2.3.0 will be identical to 2.2.0 other than changing the parquet packages to org.apache.parquet and updating the coordinate.

For both releases, we need to go through the Incubator checklist and follow steps for publishing maven artifacts.

Reporter: Ryan Blue / @rdblue
Assignee: Ryan Blue / @rdblue

_{Note: This issue was originally created as PARQUET-72. Please see the migration documentation for further details.}

Each RowGroup has one DictionaryData

If using dictionary more of duplicate records have more compress rate ？

Why do not let one RowGroup has one DictionaryData to make more duplicate records into dictionary .

Reporter: Xu Chen

_{Note: This issue was originally created as PARQUET-622. Please see the migration documentation for further details.}

Define the parquet bloom filter statistics in parquet format

As discussed in Parquet-41, we should define the bloom filter in binary level.

Reporter: Ferdinand Xu / @winningsix
Assignee: Junjie Chen / @chenjunjiedada

Related issues:

Release Parquet format 2.7.0 (is related to)

PRs and other links:

_{Note: This issue was originally created as PARQUET-319. Please see the migration documentation for further details.}

Format: Do not shade slf4j-api

PARQUET-369 fixed warnings from shading slf4j-api, but a consequence of shading is that the log messages from thrift for parquet-format classes are dropped. This was an accepted trade-off until PARQUET-305 changed logging in the rest of the library to SLF4J. Now that the slf4j-api is a dependency for all of Parquet except parquet-format, it no longer makes sense to suppress the format thrift logs to avoid exposing it.

This also requires PARQUET-371 because thrift 0.7.0 relies on a very old version of slf4j-api.

Reporter: Ryan Blue / @rdblue
Assignee: Julien Le Dem / @julienledem

Related issues:

Release Parquet format 2.4.0 (blocks)
Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder (is related to)
Logger instantiated for package org.apache.parquet may be GC-ed (is related to)
Bumps Thrift version to 0.9.3 (requires)
Shutdown hook in parquet-avro library corrupts data and disables logging (is required by)

PRs and other links:

PR #50

_{Note: This issue was originally created as PARQUET-412. Please see the migration documentation for further details.}

parquet.thrift comments for Statistics are not consistent with parquet-mr and Hive implementations

I'm currently working on adding support for writing min/max statistics to Parquet files to Impala (IMPALA-3909). I noticed, that the comments in parquet.thrift#L201 don't seem to match the implementations in parquet-mr and Hive.

The comments ask for min/max statistics to be "encoded in PLAIN encoding". For strings (BYTE_ARRAY), this should be "4 byte length stored as little endian, followed by bytes".

Looking at BinaryStatistics.java#L61, it seems to return the bytes without a length-prefix. Writing a parquet file with Hive also shows this behavior.

Similarly, but less ambiguous, PLAIN encoding for booleans uses bit-packing. It seems to be implied that for a single bit (min/max of a boolean column) it means setting the least significant bit of a single byte. This could be made more clear in the parquet.thrift file, too.

Reporter: Lars Volker / @lekv
Assignee: Lars Volker / @lekv

_{Note: This issue was originally created as PARQUET-826. Please see the migration documentation for further details.}

Document ENUM as a logical type

ENUM is used to annotate enum type in Thrift, Avro, and ProtoBuf, but it's not documented anywhere in parquet-format.

According to current (1.8-SNAPSHOT) code base, ENUM is only used to annotate BINARY. For data models which lack a native enum type, BINARY (ENUM) should be interpreted as a UTF-8 string.

Reporter: Cheng Lian / @liancheng
Assignee: Jakub Kukul / @jkukul

PRs and other links:

PR #54

_{Note: This issue was originally created as PARQUET-322. Please see the migration documentation for further details.}

Small typos/issues in parquet-format documentation

I noticed several typos/omissions in parquet format documentation:

HDFS should be all uppercase (acronym)
enncoding instead of encoding
markdown issues
no link to the thrift definition file
the integer format (LE vs BE) is not specified for the file metadata
the order of informations in a data page

Reporter: Laurent Goujon / @laurentgo
Assignee: Laurent Goujon / @laurentgo

PRs and other links:

GitHub PR #36

_{Note: This issue was originally created as PARQUET-450. Please see the migration documentation for further details.}

Document INT96 timestamps

Although considered as deprecated, they should be documented as the format is quite special.

Reporter: Uwe Korn / @xhochy
Assignee: Uwe Korn / @xhochy

PRs and other links:

GitHub Pull Request #49

_{Note: This issue was originally created as PARQUET-861. Please see the migration documentation for further details.}

Add "Floating Timestamp" logical type

Unlike current Parquet Timestamp stored in UTC, a "floating timestamp" has no timezone, it is up to the reader to interpret the timestamps based on their timezone.
This is the behavior of a Timestamp in the sql standard

Reporter: Julien Le Dem / @julienledem

Related issues:

Support for new logical type representation (duplicates)
[Format] Add time zone metadata to Timestamp type (is related to)

_{Note: This issue was originally created as PARQUET-905. Please see the migration documentation for further details.}

README.md misses a word

It says The size of specified in the header is for all 3 pieces combined. but should say The value of uncompressed_page_size specified in the header is for all 3 pieces combined.

Reporter: Lars Volker / @lekv
Assignee: Lars Volker / @lekv

_{Note: This issue was originally created as PARQUET-975. Please see the migration documentation for further details.}

Allow for Unsigned Statistics in Binary Type

BinaryStatistics currently only have a min/max, which are compared as signed byte[]. However, for real UTF8-friendly lexicographic comparison, e.g. for string columns, we would want to calculate the BinaryStatistics based off of a comparator that treats the bytes as unsigned.

Reporter: Andrew Duffy
Assignee: Ryan Blue / @rdblue

Related issues:

Release Parquet format 2.4.0 (blocks)
Release Parquet-mr 1.9.0 (blocks)
Min-max should be computed based on logical type (is duplicated by)
Statistics is not available for DECIMAL types (causes)
Reference column_order field from column indexes (relates to)
Parquet String Pushdown for Non-Eq Comparisons Broken (is related to)

PRs and other links:

_{Note: This issue was originally created as PARQUET-686. Please see the migration documentation for further details.}

parquet is not storing the type for the column.

Create Text file format table
CREATE EXTERNAL TABLE IF NOT EXISTS emp(
id INT,
first_name STRING,
last_name STRING,
dateofBirth STRING,
join_date INT
)
COMMENT 'This is Employee Table Date Of Birth of type String'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/user/employee/beforePartition';
Load the data into table
load data inpath '/user/somupoc_timestamp/employeeData_partitioned.csv' into table emp;
select * from emp;
Create Partitioned table with file format as Parquet (dateofBirth STRING))

create external table emp_afterpartition(
id int, first_name STRING, last_name STRING, dateofBirth STRING)
COMMENT 'Employee partitioned table with dateofBirth of type string'
partitioned by (join_date int)
STORED as parquet
LOCATION '/user/employee/afterpartition';
Fetch the data from Partitioned column

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table emp_afterpartition partition (join_date) select * from emp;
select * from emp_afterpartition;
Create Partitioned table with file format as Parquet (dateofBirth TIMESTAMP))

CREATE EXTERNAL TABLE IF NOT EXISTS employee_afterpartition_timestamp_parq(
id INT,first_name STRING,last_name STRING,dateofBirth TIMESTAMP)
COMMENT 'employee partitioned table with dateofBirth of type TIMESTAMP'
PARTITIONED BY (join_date INT)
STORED AS PARQUET
LOCATION '/user/employee/afterpartition';

select * from employee_afterpartition_timestamp_parq;
– 0 records returned
impala :: alter table employee_afterpartition_timestamp_parq RECOVER PARTITIONS;
Hive :: MSCK REPAIR TABLE employee_afterpartition_timestamp_parq;
– MSCK works in Hive and RECOVER PARTITIONS works in Impala – metastore check command with the repair table option:

select * from employee_afterpartition_timestamp_parq;

Actual Result :: Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable

Expected Result :: Data should display

Note: if file format is text file instead of Parquet then I am able to fetch the data.
Observation : Two tables having different column type pointing to same location(HDFS ).

sample Data

1,Joyce,Garza,2016-07-17 14:42:18,201607
2,Jerry,Ortiz,2016-08-17 21:36:54,201608
3,Steven,Ryan,2016-09-10 01:32:40,201609
4,Lisa,Black,2015-10-12 15:05:13,201610
5,Jose,Turner,2015-011-10 06:38:40,201611
6,Joyce,Garza,2016-08-02,201608
7,Jerry,Ortiz,2016-01-01,201601
8,Steven,Ryan,2016/08/20,201608
9,Lisa,Black,2016/09/12,201609
10,Jose,Turner,09/19/2016,201609
11,Jose,Turner,20160915,201609

Reporter: Narasimha

_{Note: This issue was originally created as PARQUET-723. Please see the migration documentation for further details.}

INT96 should be marked as deprecated

As discussed in the mailing list, INT96 is only used to represent nanosec timestamp in Impala for some historical reasons, and should be deprecated. Since nanosec precision is rarely a real requirement, one possible and simple solution would be replacing INT96 with INT64 (TIMESTAMP_MILLIS) or INT64 (TIMESTAMP_MICROS).

Several projects (Impala, Hive, Spark, ...) support INT96.
We need a clear spec of the replacement and the path to deprecation.

Reporter: Cheng Lian / @liancheng
Assignee: Lars Volker / @lekv

Related issues:

Interpret Parquet INT96 type as FIXED[12] AVRO Schema (relates to)
Define INT96 ordering (relates to)
Support Parquet time related logical types (relates to)

PRs and other links:

_{Note: This issue was originally created as PARQUET-323. Please see the migration documentation for further details.}

Add Union Logical type

Add a union type annotation for Group types that represent a Union rather than a struct.
Models like Avro or Arrow would make use of it.

Reporter: Julien Le Dem / @julienledem
Assignee: Julien Le Dem / @julienledem

Related issues:

Add NULL type to Bring Parquet logical types to par with Arrow (blocks)

PRs and other links:

_{Note: This issue was originally created as PARQUET-756. Please see the migration documentation for further details.}

Add Brotli compression to Parquet format

To use Brotli with Parquet, we need to add it to the format's compression codec enum.

Reporter: Ryan Blue / @rdblue
Assignee: Ryan Blue / @rdblue

Related issues:

Release Parquet format 2.4.0 (blocks)
Add Brotli compression to Parquet (blocks)
Support brotli codec (is related to)

PRs and other links:

Format PR #40

_{Note: This issue was originally created as PARQUET-609. Please see the migration documentation for further details.}

add data_encodings to ColumnMetaData to enable dictionary based predicate push down

To implement predicate push down based on dictionary we need to know if fall back happened.
If all data pages are dictionary encoded we can use the dictionary for predicate-push down.
If not we can not.

Reporter: Julien Le Dem / @julienledem
Assignee: Julien Le Dem / @julienledem

_{Note: This issue was originally created as PARQUET-119. Please see the migration documentation for further details.}

Parquet need to support empty list and empty map

In current hive upstream 2.1version, when hive tried to insert empty list or empty map to parquet table, it fails with error:
parquet.io.ParquetEncodingException: empty fields are illegal, the field should be ommited completely instead

It seems now parquet only support null value, we should find a way to support empty list and empty map values.

Reporter: Yongzhi Chen

Related issues:

Hive failing on insert empty array into parquet table (relates to)

_{Note: This issue was originally created as PARQUET-596. Please see the migration documentation for further details.}

fix license headers in parquet-format

https://github.com/apache/incubator-parquet-format/pull/10

Reporter: Julien Le Dem / @julienledem
Assignee: Julien Le Dem / @julienledem

_{Note: This issue was originally created as PARQUET-85. Please see the migration documentation for further details.}

Change link in Encodings.md for variable length encoding

There's a PR for this already: #30

The spec says that varint-encode() is ULEB-128 encoding but links to VLQ algorithm that is slightly different from ULEB-128

Reporter: Lars Volker / @lekv
Assignee: Konstantin Shaposhnikov / @kostya-sh

PRs and other links:

PR #69

_{Note: This issue was originally created as PARQUET-1032. Please see the migration documentation for further details.}

Define INT96 ordering

Currently int96 binary ordering doesn't match its natural ordering.
We should either specify this or declare int96 not ordered and link to the type replacing it.

Reporter: Julien Le Dem / @julienledem

Related issues:

INT96 should be marked as deprecated (is related to)

_{Note: This issue was originally created as PARQUET-904. Please see the migration documentation for further details.}

Update parquet-format release scripts and POM

Updates based on 2.3.0-rc1 feedback

Disable source zip created by maven
Minor script updates to add "-incubating" to the maven version.

Reporter: Ryan Blue / @rdblue
Assignee: Ryan Blue / @rdblue

_{Note: This issue was originally created as PARQUET-185. Please see the migration documentation for further details.}

Finalize Parquet 2.0 spec

We should finalize (and document?) the new encodings and format structures in the 2.0 spec.

Reporter: Ryan Blue / @rdblue

_{Note: This issue was originally created as PARQUET-588. Please see the migration documentation for further details.}

Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder

Parquet-format shades SLF4J to parquet.org.slf4j (see here). This also accidentally shades this line

private static String STATIC_LOGGER_BINDER_PATH = "org/slf4j/impl/StaticLoggerBinder.class";

private static String STATIC_LOGGER_BINDER_PATH = "parquet/org/slf4j/impl/StaticLoggerBinder.class";

and thus LoggerFactory can never find the correct StaticLoggerBinder implementation even if we provide dependencies like slf4j-log4j12 on the classpath.

This happens in Spark. Whenever we write a Parquet file, we see the following famous message and can never get rid of it:

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

Reporter: Cheng Lian / @liancheng
Assignee: Ryan Blue / @rdblue

Related issues:

Deprecate Log and move to SLF4J Logger (relates to)
Shutdown hook in parquet-avro library corrupts data and disables logging (relates to)
Format: Do not shade slf4j-api (relates to)

PRs and other links:

Format PR #32

_{Note: This issue was originally created as PARQUET-369. Please see the migration documentation for further details.}

Fix link to parquet-mr project in README.md Contributing section

There's a broken link inFix link to parquet-mr project in README.md Contributing section which needs fixing (github.com/Parquet/parquet-mr).

Reporter: Lars Volker / @lekv
Assignee: Anna Szonyi

Related issues:

Wrong and broken links in README (is duplicated by)

PRs and other links:

Pull request

_{Note: This issue was originally created as PARQUET-1034. Please see the migration documentation for further details.}

Add NULL type to Bring Parquet logical types to par with Arrow

Missing:

Null
Interval types
Union
half precision float

Reporter: Julien Le Dem / @julienledem
Assignee: Julien Le Dem / @julienledem

Related issues:

Add INTERVAL_YEAR_MONTH and INTERVAL_DAY_TIME types (is blocked by)
Add Union Logical type (is blocked by)
[Format] HALF precision FLOAT Logical type (relates to)

PRs and other links:

_{Note: This issue was originally created as PARQUET-757. Please see the migration documentation for further details.}

apache / parquet-format Goto Github PK

parquet-format's Issues

Related issues:

PRs and other links:

Related issues:

Related issues:

PRs and other links:

Related issues:

Related issues:

PRs and other links:

Related issues:

PRs and other links:

PRs and other links:

Related issues:

PRs and other links:

PRs and other links:

Related issues:

PRs and other links:

Related issues:

PRs and other links:

Related issues:

PRs and other links:

PRs and other links:

PRs and other links:

PRs and other links:

Related issues:

Related issues:

PRs and other links:

sample Data

Related issues:

PRs and other links:

Related issues:

PRs and other links:

Related issues:

PRs and other links:

Related issues:

PRs and other links:

Related issues:

Related issues:

PRs and other links:

Related issues:

PRs and other links:

Related issues:

PRs and other links:

Recommend Projects

Recommend Topics

Recommend Org