Coder Social home page Coder Social logo

apache / parquet-format Goto Github PK

View Code? Open in Web Editor NEW
1.7K 66.0 426.0 1.24 MB

Apache Parquet Format

Home Page: https://parquet.apache.org/

License: Apache License 2.0

Makefile 1.67% Shell 11.40% Python 23.21% Thrift 63.71%
parquet apache parquet-format

parquet-format's Issues

Add a special ColumnOrder for testing

PR #46 introduced ColumnOrder with the limitation that a reader should ignore stats for a column if the corresponding ColumnOrder in FileMetaData contains an unknown value. In order to test this logic, it would be helpful to have a special value InvalidOrder or UnsupportedOrder that would never be supported by a reader. I assume this may be helpful to test other implementations, too.

Reporter: Lars Volker / @lekv
Assignee: Lars Volker / @lekv

Note: This issue was originally created as PARQUET-974. Please see the migration documentation for further details.

Store `dictionary entries` of parquet columns that will be used for joins

It would be great if Parquet would store dictionary entries for columns marked to be used for joins.

When a column is used for a join (it could be a surrogate key or a natural key) - the value of a cloumn used for join itself is actually not so important.

So we could join directly on dictionary entries instead of values
and save CPU cycles. (no need to decompress etc)

Inspired by Oracle In-memory columnar storage improvements in 12.2

Reporter: Ruslan Dautkhanov / @Tagar

Note: This issue was originally created as PARQUET-966. Please see the migration documentation for further details.

Upgrade snappy-java to 1.1.1.6

Upgrade snappy-java to 1.1.1.6 (the latest vesrion), since 1.0.5 is no longer maintained in https://github.com/xerial/snappy-java, and 1.1.1.6 supports broader platforms including PowerPC, IBM-AIX 6.4, SunOS, etc. And also it has a better native coding loading mechanism (allowing to use snappy-java from multiple class loaders)

The compression format between 1.0.5 and 1.1.1.6 are compatible. 1.1.1.x version adds framing format support, but currently Parquet is not using this framing format, so I think this upgrade cause no data format incompatibility.

Reporter: Taro L. Saito / @xerial
Assignee: Taro L. Saito / @xerial

Note: This issue was originally created as PARQUET-133. Please see the migration documentation for further details.

Format: Add a flag when min/max are truncated

PARQUET-372 drops page and column chunk stats when values are larger than 4k to avoid storing very large values in page headers and the file footer. An alternative approach is to truncate the values, which would still allow filtering on page stats. The problem with truncating values is that the value in stats may not be the true min or max so engines that use these values as the result of aggregations like min(col) would return incorrect data. We should consider adding metadata to allow truncating values for filtering that captures the fact that the values have been modified.

Reporter: Ryan Blue / @rdblue

Related issues:

Note: This issue was originally created as PARQUET-411. Please see the migration documentation for further details.

add logical type timestamp with timezone (per SQL)

timestamp with timezone (per SQL)
timestamps are adjusted to UTC and stored as integers.
metadata in logical types PR:
See discussion here: #51 (comment)

Reporter: Julien Le Dem / @julienledem
Assignee: Julien Le Dem / @julienledem

Related issues:

PRs and other links:

Note: This issue was originally created as PARQUET-906. Please see the migration documentation for further details.

Min-max should be computed based on logical type

The min/max stats are currently underspecified - it is not clear in any cases from the spec what the expected ordering is.

There are some related issues, like PARQUET-686 to fix specific problems, but there seems to be a general assumption that the min/max should be defined based on the primitive type instead of the logical type.

However, this makes the stats nearly useless for some logical types. E.g. consider a DECIMAL encoded into a (variable-length) BINARY. The min-max of the underlying binary type is based on the lexical order of the byte string, but that does not correspond to any reasonable ordering of the decimal values. E.g. 16 (0x1 0x0) will be ordered between 1 (0x0) and (0x2). This makes min-max filtering a lot less effective and would force query engines using parquet to implement workarounds to produce correct results (e.g. custom comparators).

Reporter: Tim Armstrong / @timarmstrong

Related issues:

Note: This issue was originally created as PARQUET-839. Please see the migration documentation for further details.

ScalaReflectionException when trying to convert an RDD of Scrooge to a DataFrame

I get "scala.ScalaReflectionException: is not a term" when I try to convert an RDD of Scrooge to a DataFrame, e.g. myScroogeRDD.toDF

Has anyone else encountered this problem?

I'm using Spark 1.3.1, Scala 2.10.4 and scrooge-sbt-plugin 3.16.3

Here is my thrift IDL:

namespace scala com.junk
namespace java com.junk

struct Junk {
    10: i64 junkID,
    20: string junkString
}

from a spark-shell:

val junks = List( Junk(123L, "junk1"), Junk(567L, "junk2"), Junk(789L, "junk3") )
val junksRDD = sc.parallelize(junks)
junksRDD.toDF

Exception thrown:

scala.ScalaReflectionException: <none> is not a term
	at scala.reflect.api.Symbols$SymbolApi$class.asTerm(Symbols.scala:259)
	at scala.reflect.internal.Symbols$SymbolContextApiImpl.asTerm(Symbols.scala:73)
	at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:148)
	at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
	at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107)
	at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
	at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:316)
	at org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:254)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:27)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:32)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:34)
	at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:36)
	at $iwC$$iwC$$iwC$$iwC.<init>(<console>:38)
	at $iwC$$iwC$$iwC.<init>(<console>:40)
	at $iwC$$iwC.<init>(<console>:42)
	at $iwC.<init>(<console>:44)
	at <init>(<console>:46)
	at .<init>(<console>:50)
	at .<clinit>(<console>)
	at .<init>(<console>:7)
	at .<clinit>(<console>)
	at $print(<console>)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
	at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
	at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
	at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
	at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)
	at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
	at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)
	at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)
	at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
	at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
	at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944)
	at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)
	at org.apache.spark.repl.Main$.main(Main.scala:31)
	at org.apache.spark.repl.Main.main(Main.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Reporter: Tim Chan / @zzztimbo

Related issues:

Note: This issue was originally created as PARQUET-293. Please see the migration documentation for further details.

Add microsecond time and timestamp annotations

When the date/time type annotations were added, we decided not to add precisions smaller than milliseconds because there wasn't a clear requirement. I think that the requirement is for nanosecond precision. The SQL spec requires at least microsecond. Some databases support nanosecond, including SQL engines on Hadoop like Phoenix. Hive and Impala currently support nanosecond times using an int96, but intend to move to microsecond precision with this spec.

I propose adding the following type annotations:

  • TIME_MICROS: annotates an int64 (8 bytes), represents the number of microseconds from midnight.
  • TIMESTAMP_MICROS: annotates an int64 (8 bytes), represents the number of microseconds from the unix epoch.

Reporter: Ryan Blue / @rdblue
Assignee: Ryan Blue / @rdblue

PRs and other links:

Note: This issue was originally created as PARQUET-200. Please see the migration documentation for further details.

[Format] HALF precision FLOAT Logical type

Reporter: Julien Le Dem / @julienledem
Assignee: Anja Boskovic / @anjakefala

Related issues:

PRs and other links:

Note: This issue was originally created as PARQUET-758. Please see the migration documentation for further details.

META-INF for slf4j should not be in parquet-format jar

$ jar tf parquet-format-2.2.0-rc1.jar  | grep org\\.slf
META-INF/maven/org.slf4j/
META-INF/maven/org.slf4j/slf4j-api/
META-INF/maven/org.slf4j/slf4j-api/pom.xml
META-INF/maven/org.slf4j/slf4j-api/pom.properties

It is not clear to me why these are here. I suspect they should not be.

Reporter: koert kuipers
Assignee: Ryan Blue / @rdblue

PRs and other links:

Note: This issue was originally created as PARQUET-178. Please see the migration documentation for further details.

Add index pages to the format to support efficient page skipping

When a Parquet file is sorted we can define an index consisting of the boundary values for the pages of the columns sorted on as well as the offsets and length of said pages in the file.
The goal is to optimize lookup and range scan type queries, using this to read only the pages containing data matching the filter.
We'd require the pages to be aligned accross columns.

[~marcelk] will add a link to the google doc to discuss the spec

Reporter: Julien Le Dem / @julienledem
Assignee: Marcel Kinard

Related issues:

PRs and other links:

Note: This issue was originally created as PARQUET-922. Please see the migration documentation for further details.

Incorrect delta-encoding example

The minimum and the number of bits are incorrect at delta encoding Example 2 In Encodings.md.

In the example,

Example 2

7, 5, 3, 1, 2, 3, 4, 5, the deltas would be

-2, -2, -2, 1, 1, 1, 1
The minimum is -2, so the relative deltas are:

0, 0, 0, 3, 3, 3, 3

The encoded data is

header: 8 (block size), 1 (miniblock count), 8 (value count), 7 (first value)

block 0 (minimum delta), 2 (bitwidth), 000000111111b (0,0,0,3,3,3 packed on 2 bits)

The minimum is -2 and the relative deltas are 0, 0, 0, 3, 3, 3, 3. So, this should be corrected as below:

block -2 (minimum delta), 2 (bitwidth), 00000011111111b (0,0,0,3,3,3,3 packed on 2 bits)

Reporter: choi woo cheol

Note: This issue was originally created as PARQUET-407. Please see the migration documentation for further details.

Add INTERVAL_YEAR_MONTH and INTERVAL_DAY_TIME types

Prepare parquet-format for Apache release

Need to prepare the parquet source for release. We're planning on two releases:

  • 2.2.0 as com.twitter:parquet-format
  • 2.3.0 as org.apache.parquet:parquet-format

2.3.0 will be identical to 2.2.0 other than changing the parquet packages to org.apache.parquet and updating the coordinate.

For both releases, we need to go through the Incubator checklist and follow steps for publishing maven artifacts.

Reporter: Ryan Blue / @rdblue
Assignee: Ryan Blue / @rdblue

Note: This issue was originally created as PARQUET-72. Please see the migration documentation for further details.

Format: Do not shade slf4j-api

PARQUET-369 fixed warnings from shading slf4j-api, but a consequence of shading is that the log messages from thrift for parquet-format classes are dropped. This was an accepted trade-off until PARQUET-305 changed logging in the rest of the library to SLF4J. Now that the slf4j-api is a dependency for all of Parquet except parquet-format, it no longer makes sense to suppress the format thrift logs to avoid exposing it.

This also requires PARQUET-371 because thrift 0.7.0 relies on a very old version of slf4j-api.

Reporter: Ryan Blue / @rdblue
Assignee: Julien Le Dem / @julienledem

Related issues:

PRs and other links:

Note: This issue was originally created as PARQUET-412. Please see the migration documentation for further details.

parquet.thrift comments for Statistics are not consistent with parquet-mr and Hive implementations

I'm currently working on adding support for writing min/max statistics to Parquet files to Impala (IMPALA-3909). I noticed, that the comments in parquet.thrift#L201 don't seem to match the implementations in parquet-mr and Hive.

The comments ask for min/max statistics to be "encoded in PLAIN encoding". For strings (BYTE_ARRAY), this should be "4 byte length stored as little endian, followed by bytes".

Looking at BinaryStatistics.java#L61, it seems to return the bytes without a length-prefix. Writing a parquet file with Hive also shows this behavior.

Similarly, but less ambiguous, PLAIN encoding for booleans uses bit-packing. It seems to be implied that for a single bit (min/max of a boolean column) it means setting the least significant bit of a single byte. This could be made more clear in the parquet.thrift file, too.

Reporter: Lars Volker / @lekv
Assignee: Lars Volker / @lekv

Note: This issue was originally created as PARQUET-826. Please see the migration documentation for further details.

Document ENUM as a logical type

ENUM is used to annotate enum type in Thrift, Avro, and ProtoBuf, but it's not documented anywhere in parquet-format.

According to current (1.8-SNAPSHOT) code base, ENUM is only used to annotate BINARY. For data models which lack a native enum type, BINARY (ENUM) should be interpreted as a UTF-8 string.

Reporter: Cheng Lian / @liancheng
Assignee: Jakub Kukul / @jkukul

PRs and other links:

Note: This issue was originally created as PARQUET-322. Please see the migration documentation for further details.

Small typos/issues in parquet-format documentation

I noticed several typos/omissions in parquet format documentation:

  • HDFS should be all uppercase (acronym)
  • enncoding instead of encoding
  • markdown issues
  • no link to the thrift definition file
  • the integer format (LE vs BE) is not specified for the file metadata
  • the order of informations in a data page

Reporter: Laurent Goujon / @laurentgo
Assignee: Laurent Goujon / @laurentgo

PRs and other links:

Note: This issue was originally created as PARQUET-450. Please see the migration documentation for further details.

Add "Floating Timestamp" logical type

Unlike current Parquet Timestamp stored in UTC, a "floating timestamp" has no timezone, it is up to the reader to interpret the timestamps based on their timezone.
This is the behavior of a Timestamp in the sql standard

Reporter: Julien Le Dem / @julienledem

Related issues:

Note: This issue was originally created as PARQUET-905. Please see the migration documentation for further details.

Allow for Unsigned Statistics in Binary Type

BinaryStatistics currently only have a min/max, which are compared as signed byte[]. However, for real UTF8-friendly lexicographic comparison, e.g. for string columns, we would want to calculate the BinaryStatistics based off of a comparator that treats the bytes as unsigned.

Reporter: Andrew Duffy
Assignee: Ryan Blue / @rdblue

Related issues:

PRs and other links:

Note: This issue was originally created as PARQUET-686. Please see the migration documentation for further details.

parquet is not storing the type for the column.

  1. Create Text file format table
    CREATE EXTERNAL TABLE IF NOT EXISTS emp(
    id INT,
    first_name STRING,
    last_name STRING,
    dateofBirth STRING,
    join_date INT
    )
    COMMENT 'This is Employee Table Date Of Birth of type String'
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    LINES TERMINATED BY '\n'
    STORED AS TEXTFILE
    LOCATION '/user/employee/beforePartition';

  2. Load the data into table
    load data inpath '/user/somupoc_timestamp/employeeData_partitioned.csv' into table emp;
    select * from emp;

  3. Create Partitioned table with file format as Parquet (dateofBirth STRING))

    create external table emp_afterpartition(
    id int, first_name STRING, last_name STRING, dateofBirth STRING)
    COMMENT 'Employee partitioned table with dateofBirth of type string'
    partitioned by (join_date int)
    STORED as parquet
    LOCATION '/user/employee/afterpartition';

  4. Fetch the data from Partitioned column

    set hive.exec.dynamic.partition=true;
    set hive.exec.dynamic.partition.mode=nonstrict;
    insert overwrite table emp_afterpartition partition (join_date) select * from emp;
    select * from emp_afterpartition;

  5. Create Partitioned table with file format as Parquet (dateofBirth TIMESTAMP))

    CREATE EXTERNAL TABLE IF NOT EXISTS employee_afterpartition_timestamp_parq(
    id INT,first_name STRING,last_name STRING,dateofBirth TIMESTAMP)
    COMMENT 'employee partitioned table with dateofBirth of type TIMESTAMP'
    PARTITIONED BY (join_date INT)
    STORED AS PARQUET
    LOCATION '/user/employee/afterpartition';

    select * from employee_afterpartition_timestamp_parq;
    – 0 records returned
    impala :: alter table employee_afterpartition_timestamp_parq RECOVER PARTITIONS;
    Hive :: MSCK REPAIR TABLE employee_afterpartition_timestamp_parq;
    – MSCK works in Hive and RECOVER PARTITIONS works in Impala – metastore check command with the repair table option:

    select * from employee_afterpartition_timestamp_parq;

Actual Result :: Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable

Expected Result :: Data should display

Note: if file format is text file instead of Parquet then I am able to fetch the data.
Observation : Two tables having different column type pointing to same location(HDFS ).

sample Data

1,Joyce,Garza,2016-07-17 14:42:18,201607
2,Jerry,Ortiz,2016-08-17 21:36:54,201608
3,Steven,Ryan,2016-09-10 01:32:40,201609
4,Lisa,Black,2015-10-12 15:05:13,201610
5,Jose,Turner,2015-011-10 06:38:40,201611
6,Joyce,Garza,2016-08-02,201608
7,Jerry,Ortiz,2016-01-01,201601
8,Steven,Ryan,2016/08/20,201608
9,Lisa,Black,2016/09/12,201609
10,Jose,Turner,09/19/2016,201609
11,Jose,Turner,20160915,201609

Reporter: Narasimha

Note: This issue was originally created as PARQUET-723. Please see the migration documentation for further details.

INT96 should be marked as deprecated

As discussed in the mailing list, INT96 is only used to represent nanosec timestamp in Impala for some historical reasons, and should be deprecated. Since nanosec precision is rarely a real requirement, one possible and simple solution would be replacing INT96 with INT64 (TIMESTAMP_MILLIS) or INT64 (TIMESTAMP_MICROS).

Several projects (Impala, Hive, Spark, ...) support INT96.
We need a clear spec of the replacement and the path to deprecation.

Reporter: Cheng Lian / @liancheng
Assignee: Lars Volker / @lekv

Related issues:

PRs and other links:

Note: This issue was originally created as PARQUET-323. Please see the migration documentation for further details.

Parquet need to support empty list and empty map

In current hive upstream 2.1version, when hive tried to insert empty list or empty map to parquet table, it fails with error:
parquet.io.ParquetEncodingException: empty fields are illegal, the field should be ommited completely instead

It seems now parquet only support null value, we should find a way to support empty list and empty map values.

Reporter: Yongzhi Chen

Related issues:

Note: This issue was originally created as PARQUET-596. Please see the migration documentation for further details.

Shading SLF4J prevents SLF4J locating org.slf4j.impl.StaticLoggerBinder

Parquet-format shades SLF4J to parquet.org.slf4j (see here). This also accidentally shades this line

private static String STATIC_LOGGER_BINDER_PATH = "org/slf4j/impl/StaticLoggerBinder.class";

to

private static String STATIC_LOGGER_BINDER_PATH = "parquet/org/slf4j/impl/StaticLoggerBinder.class";

and thus LoggerFactory can never find the correct StaticLoggerBinder implementation even if we provide dependencies like slf4j-log4j12 on the classpath.

This happens in Spark. Whenever we write a Parquet file, we see the following famous message and can never get rid of it:

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

Reporter: Cheng Lian / @liancheng
Assignee: Ryan Blue / @rdblue

Related issues:

PRs and other links:

Note: This issue was originally created as PARQUET-369. Please see the migration documentation for further details.

Add NULL type to Bring Parquet logical types to par with Arrow

Missing:

  • Null
  • Interval types
  • Union
  • half precision float

Reporter: Julien Le Dem / @julienledem
Assignee: Julien Le Dem / @julienledem

Related issues:

PRs and other links:

Note: This issue was originally created as PARQUET-757. Please see the migration documentation for further details.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.