Coder Social home page Coder Social logo

parquet-rs's People

Contributors

andygrove avatar asdetrefle avatar gnieto avatar liurenjie1024 avatar sadikovi avatar sunchao avatar xrl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

parquet-rs's Issues

Write support for Parquet (low-level writes)

This is an RFC for write support in this crate.

Prototype PR #127. API resembles closely read support. See the design overview and implementation details below in the comments.

Sub-tasks:

  • Add writer properties (#128).
  • Extend and refactor IO module to include Thrift related streams/buffers and file sink (#129).
  • Add schema type conversion to Thrift (#130).
  • Add metadata conversion to Thrift (#131).
  • Add page writer (#133).
  • Add column writer (#138).
  • Add row group writer and file writer (#149).
  • Update all crate documentation with regard to writes (#157).
  • Add reader properties Update the batch size (#161).

Add support for reading columns as Apache Arrow arrays

This is going to be easy. I have some ugly prototype code working already.

    let path = Path::new(&args[1]);
    let file = File::open(&path).unwrap();
    let parquet_reader = SerializedFileReader::new(file).unwrap();

    let row_group_reader = parquet_reader.get_row_group(0).unwrap();

    for i in 0..row_group_reader.num_columns() {
        match row_group_reader.get_column_reader(i) {
            Ok(ColumnReader::Int32ColumnReader(ref mut r)) => {

                let batch_size = 1024;
                let sz = mem::size_of::<i32>();
                let p = memory::allocate_aligned((batch_size * sz) as i64).unwrap();
                let ptr_i32 = unsafe { mem::transmute::<*const u8, *mut i32>(p) };
                let mut buf = unsafe {
                    slice::from_raw_parts_mut(ptr_i32, batch_size) };

//                let mut builder : Builder<i32> = Builder::with_capacity(1024);
//                let buffer = builder.finish();
                match r.read_batch(1024, None, None, &mut buf) {
                    Ok((count,_)) => {
                        let arrow_buffer = Buffer::from_raw_parts(ptr_i32, count as i32);
                        let arrow_array = Array::from(arrow_buffer);

                        match arrow_array.data() {
                            ArrayData::Int32(b) => {
                                println!("len: {}", b.len());
                                println!("data: {:?}", b.iter().collect::<Vec<i32>>());
                            },
                            _ => println!("wrong type")
                        }


                    },
                    _ => println!("error")
                }
            }
            _ => println!("column type not supported")
        }
    }

Now we need to come up with a real design and I probably need to add more helper methods to Arrow to make this easier.

Profiling encoding & decoding

It may be useful to do some profiling on encoding & decoding. We can use existing bench for this. Some useful scripts:

For CPU:

perf record -g -F 1000 <bench-name> --bench && perf script | stackcollapse-perf | flamegraph > flamegraph.svg

The stackcollapse-perf and flamegraph are from flamegraph.

For cache performance (from this thread):

valgrind --tool=callgrind --dump-instr=yes --collect-jumps=yes --simulate-cache=yes <bench-name>

I also found that we may need to add the following section in Cargo.toml:

[profile.bench]
debug = true
opt-level = 1

If opt-level is too high, rustc will inline most of the functions and therefore it's hard to see which function is the bottleneck. Not sure if there's better way to do this.

Update crate documentation

Currently we do not publish documentation when releasing a new version of crate. It would be very useful for people to get started, if we had one.

Items to cover:

  • Compile documentation for crate, so it can be published, e.g. module documentation, structs, methods, etc.
  • Add some references in README.
  • Add badge to point to the latest docs.
  • Fix docs.rs build somehow (?), see comments below.

HDFS Support?

I'm curious if there are plans to support HDFS within this crate?

The Java parquet library allows parquet files to be read locally or from HDFS and in both cases it is possible to push down the projection and only retrieve the columns needed, which can make a huge difference in performance.

Investigate on publish the project in Cargo crate

At some point we should investigate on the possibility to publish parquet-rs as a crate so that it can be more easily used by other projects. So far I think one main issue is that parquet.rs needs to be generated before compile time and there is no easy way to do this (something similar was done for protobuf rust which we may borrow). Otherwise, we have to put the generated parquet.rs in the repo, which is not very desirable.

Writing?

This crate doesn't seem to support writing Parquet files. Is this planned? I'd like to convert CSV files to Parquet.

Planning milestone 1

I compiled list of items that could be fixed in the near future, though it is not a complete list:

  • Adding read support, this includes adding a high level API for reading data, additionally tooling similar to parquet-mr/parquet-tools/cli.
  • Adding write support, this includes adding a high level API for writing data.
  • Improving dictionary encoding/decoding (I am not sure about the effort here, but might require a few changes).
  • Improving delta binary packed encoding/decoding (I made some mistakes earlier, so this might require quite a few changes).
  • Adding benchmarks for delta byte array and delta length byte array encoding/decoding.
  • Fixing TODO items.

Questions I have:

  • Do we need to add more encoders/decoders?
  • Are there any other items that I missed that should be listed?
  • What is the priority of tasks? What should be done first?
  • Are there any ideas/suggestions for tracking such progress?

PLAIN encoding for boolean is not following Parquet format

Parquet encoding format specifies boolean should be encoded as bit-packed, LSB first. In the current implementation we just encode it as plain 1-bit value.

It is a little confusing though, that the bit-packed encoding, is already deprecated and replaced by RLE/bit-packed hybrid encoding. Nevertheless, parquet-mr follows this specification and still uses the bit-packed encoding, while parquet-cpp uses the same plain encoding as us.

Problems installing Thrift

I was rushing through this a bit to be honest, so may have failed to properly read all the instructions, but I copied steps from the travis yaml file to clone the thrift repo.

My build fails with:

$ make
make  all-recursive
make[1]: Entering directory '/home/andy/git/thrift'
Making all in compiler/cpp
make[2]: Entering directory '/home/andy/git/thrift/compiler/cpp'
Making all in src
make[3]: Entering directory '/home/andy/git/thrift/compiler/cpp/src'
/bin/bash ../../../ylwrap thrift/thrifty.yy y.tab.c thrift/thrifty.cc y.tab.h `echo thrift/thrifty.cc | sed -e s/cc$/hh/ -e s/cpp$/hpp/ -e s/cxx$/hxx/ -e s/c++$/h++/ -e s/c$/h/` y.output thrift/thrifty.output -- yacc -d 
yacc: e - line 1 of "/home/andy/git/thrift/compiler/cpp/src/thrift/thrifty.yy", syntax error
%code requires {
^

I'm using Ubuntu 16.04.4 LTS

Add row group filtering

Right now we deserialize File metadata from Thrift, but this process has to handle all row groups, which is inefficient in case only a subset is selected. It's better to do something similar to parquet-mr.

File coverage is not shown on coveralls

When I look up coverage per file for a commit, coveralls reports following:

The file "file/metadata.rs" isn't available on github. Either it's been removed, or the repo root directory needs to be updated.

I found that one could set repository root directory in coveralls: lemurheavy/coveralls-public#886 (comment)

@sunchao Could you have a look and possibly fix it in coveralls? So we can see what lines are not covered and add more unittests.

Add benchmarks

To evaluate performance, we should add benchmarks for encoding & decoding. Perhaps we can borrow some from the C++/Java version and compare the performance between these implementations.

Issue when reading PLAIN_DICTIONARY fields

I am currently working on Usage/Examples section in README and wanted to add couple of code snippets on how to use library (I hope you do not object to this). While working on it, found that there is a potential issue when reading columns with PLAIN_DICTIONARY encoding.

I am trying to read data/alltypes_plain.snappy.parquet file that is in repository with code that is attached below.

Discovered that this returns me empty buffers when I read values. For example, the first column in Parquet file is "id" that has values 6 and 7:

column 0:
--------------------------------------------------------------------------------
column type: INT32
column path: "id"
encodings: RLE PLAIN_DICTIONARY PLAIN
file path: N/A
file offset: 55
num of values: 2
total compressed size (in bytes): 51
total uncompressed size (in bytes): 47
data page offset: 27
index page offset: N/A
dictionary page offset: 4

When I run my code it prints following:

Row group: 0
Read column 0: 0 values (0 levels): [0, 0, 0, 0, 0, 0, 0, 0]
! Skip column 1, unknown type: BOOLEAN
Read column 2: 0 values (0 levels): [0, 0, 0, 0, 0, 0, 0, 0]
Read column 3: 0 values (0 levels): [0, 0, 0, 0, 0, 0, 0, 0]
Read column 4: 0 values (0 levels): [0, 0, 0, 0, 0, 0, 0, 0]
! Skip column 5, unknown type: INT64
! Skip column 6, unknown type: FLOAT
! Skip column 7, unknown type: DOUBLE
! Skip column 8, unknown type: BYTE_ARRAY
! Skip column 9, unknown type: BYTE_ARRAY
! Skip column 10, unknown type: INT96

All returned buffers are empty.

Turns out that there are several issues with reading values: when reading dictionary page we should not increment number of seen values, because they are just for dictionary, we have not read actual values yet, and also encoding assignment for data page .

diff contains updated test_file_reader test:

diff --git a/src/column/reader.rs b/src/column/reader.rs
index 9d36918..cea0bc5 100644
--- a/src/column/reader.rs
+++ b/src/column/reader.rs
@@ -213,9 +213,11 @@ impl<'a, T: DataType> ColumnReaderImpl<'a, T> where T: 'static {
                 self.def_level_decoder = Some(def_decoder);
               }
 
-              if encoding == Encoding::PLAIN_DICTIONARY {
-                encoding == Encoding::RLE_DICTIONARY;
-              }
+              let encoding = if encoding == Encoding::PLAIN_DICTIONARY {
+                Encoding::RLE_DICTIONARY
+              } else {
+                encoding
+              };
 
               let decoder =
                 if encoding == Encoding::RLE_DICTIONARY {
diff --git a/src/file/reader.rs b/src/file/reader.rs
index ea7871d..5863f57 100644
--- a/src/file/reader.rs
+++ b/src/file/reader.rs
@@ -308,7 +308,6 @@ impl PageReader for SerializedPageReader {
           assert!(page_header.dictionary_page_header.is_some());
           let dict_header = page_header.dictionary_page_header.as_ref().unwrap();
           let is_sorted = dict_header.is_sorted.unwrap_or(false);
-          self.seen_num_values += dict_header.num_values as i64;
           Page::DictionaryPage {
             buf: ByteBufferPtr::new(buffer), num_values: dict_header.num_values as u32,
             encoding: Encoding::from(dict_header.encoding), is_sorted: is_sorted
@@ -396,7 +395,7 @@ mod tests {
     let mut page_reader_0: Box<PageReader> = page_reader_0_result.unwrap();
     let mut page_count = 0;
     while let Ok(Some(page)) = page_reader_0.get_next_page() {
-      let is_dict_type = match page {
+      let is_expected_page = match page {
         Page::DictionaryPage{ buf, num_values, encoding, is_sorted } => {
           assert_eq!(buf.len(), 32);
           assert_eq!(num_values, 8);
@@ -404,12 +403,22 @@ mod tests {
           assert_eq!(is_sorted, false);
           true
         },
-        _ => false
+        Page::DataPage { buf, num_values, encoding, def_level_encoding, rep_level_encoding } => {
+          assert_eq!(buf.len(), 11);
+          assert_eq!(num_values, 8);
+          assert_eq!(encoding, Encoding::PLAIN_DICTIONARY);
+          assert_eq!(def_level_encoding, Encoding::RLE);
+          assert_eq!(rep_level_encoding, Encoding::BIT_PACKED);
+          true
+        },
+        _ => {
+          false
+        }
       };
-      assert!(is_dict_type);
+      assert!(is_expected_page);
       page_count += 1;
     }
-    assert_eq!(page_count, 1);
+    assert_eq!(page_count, 2);
   }
 
   fn get_test_file<'a>(file_name: &str) -> fs::File {

With these changes I get following output - looks correct:

Row group: 0
Read column 0: 2 values (0 levels): [6, 7, 0, 0, 0, 0, 0, 0]
! Skip column 1, unknown type: BOOLEAN
Read column 2: 2 values (0 levels): [0, 1, 0, 0, 0, 0, 0, 0]
Read column 3: 2 values (0 levels): [0, 1, 0, 0, 0, 0, 0, 0]
Read column 4: 2 values (0 levels): [0, 1, 0, 0, 0, 0, 0, 0]
! Skip column 5, unknown type: INT64
! Skip column 6, unknown type: FLOAT
! Skip column 7, unknown type: DOUBLE
! Skip column 8, unknown type: BYTE_ARRAY
! Skip column 9, unknown type: BYTE_ARRAY
! Skip column 10, unknown type: INT96

Add CLI tools for reading files

Add CLI tools like parquet-mr/parquet-tools as binaries, so people could use them to inspect files. We have already started doing it for file schema/metadata, we should add something to list records in a file, e.g. in JSON or tabular format or something similar.

Add type-safe accessors to Row

When processing parquet files I would like the ability to get values like this:

let a = row.get_bool().unwrap();
let a = row.get_float().unwrap();
...

Currently I have to write pattern matching each time I want to extract a value.

Boost test coverage

Right now the test coverage is quite poor. We should solidify on this before working on new features.

Issue when reading required INT32 column with PLAIN encoding

For required fields with PLAIN encoding, current code read_batch in ColumnReaderImpl goes into infinite loop. It happens when field is required, it does not have repetition/definition levels, therefore values_to_read keeps being set to 0, which results in buffer being &[values_read..values_read + values_to_read] -> empty. This goes into infinite loop because decoding for empty buffer returns 0, and loop never progresses.

Fix for the issue is setting values_to_read explicitly, when field is required (assertion is optional):

diff --git a/src/column/reader.rs b/src/column/reader.rs
index 9d36918..2a230e8 100644
--- a/src/column/reader.rs
+++ b/src/column/reader.rs
@@ -154,6 +154,8 @@ impl<'a, T: DataType> ColumnReaderImpl<'a, T> where T: 'static {
             }
           }
         }
+      } else {
+        values_to_read = batch_size;
       }
 
       if self.descr.max_rep_level() > 0 && rep_levels.is_some() {
diff --git a/src/encodings/decoding.rs b/src/encodings/decoding.rs
index cf98276..a76ace8 100644
--- a/src/encodings/decoding.rs
+++ b/src/encodings/decoding.rs
@@ -127,6 +127,7 @@ default impl<T: DataType> Decoder<T> for PlainDecoder<T> {
   #[inline]
   fn get(&mut self, buffer: &mut [T::T]) -> Result<usize> {
     assert!(self.data.is_some());
+    assert!(buffer.len() != 0);
 
     let data = self.data.as_mut().unwrap();
     let num_values = cmp::min(buffer.len(), self.num_values);

Here is schema information for which it fails.

version: 1
num of rows: 4
created by: parquet-mr version 1.8.1 (build 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)
message spark_schema {
  REQUIRED INT32 a;
  REQUIRED INT64 b;
  OPTIONAL BYTE_ARRAY c;
  REQUIRED group d {
    REQUIRED INT32 a;
    REQUIRED INT64 b;
    OPTIONAL BYTE_ARRAY c;
  }
  REQUIRED group e (LIST) {
    REPEATED group list {
      REQUIRED INT32 element;
    }
  }
}

Releasing 0.2.0?

We made several important changes on the code base recently:

  1. Get rid of the Thrift dependency which makes this crate much easier to use by other projects (#82 )
  2. Updated documentation and with 1) we will publish the doc to docs.rs, which will be very useful (#81)
  3. Improved decoding performance for various encodings including dict, rle, delta_binary_packed, etc. (#80)

I think it may be a good time to publish version 0.2.0.

Build fails with Rust 1.25-nightly

Looks like build fails on the latest nightly version:

nightly-x86_64-unknown-linux-gnu (default)
rustc 1.25.0-nightly (27a046e93 2018-02-18)

Get following errors:

error[E0599]: no method named `set_data` found for type `encodings::decoding::PlainDecoder<T>` in the current scope
   --> src/column/reader.rs:443:18
    |
443 |       dictionary.set_data(page.buffer().clone(), num_values as usize)?;
    |                  ^^^^^^^^
    | 
   ::: src/encodings/decoding.rs:83:1
    |
83  | pub struct PlainDecoder<T: DataType> {
    | ------------------------------------ method `set_data` not found for this
    |
    = note: the method `set_data` exists but the following trait bounds were not satisfied:
            `encodings::decoding::PlainDecoder<T> : encodings::decoding::Decoder<_>`
    = help: items from traits can only be used if the trait is implemented and in scope
    = note: the following trait defines an item `set_data`, perhaps you need to implement it:
            candidate #1: `encodings::decoding::Decoder`

error[E0275]: overflow evaluating the requirement `encodings::decoding::PlainDecoder<T>: encodings::decoding::Decoder<T>`
   --> src/column/reader.rs:446:24
    |
446 |       decoder.set_dict(Box::new(dictionary))?;
    |                        ^^^^^^^^^^^^^^^^^^^^
    |
    = note: required because of the requirements on the impl of `encodings::decoding::Decoder<T>` for `encodings::decoding::PlainDecoder<T>`
    = note: required for the cast to the object type `encodings::decoding::Decoder<T>`

error: aborting due to 2 previous errors

error: Could not compile `parquet`.

Suggestion: Add examples

With a new Rust project I usually go look at the examples first. I think it would be helpful to have some simple examples for reading parquet files. Perhaps one column-oriented and one row-oriented?

Update and refactor tests in src/column/reader.rs

Currently we have tests in src/column/reader.rs, which are split into 3 main functions test_plain, test_dict and test_read_batch. We should refactor common functionality and improve test structure, so it is easy to add new tests in the future, e.g. for different encodings or types.

Relates to #48.

Cannot run parquet-read binary

I cloned the repo, ran cargo clean then cargo build but then cannot run the parquet-read binary.

andy@freedom:~/git/parquet-rs$ cargo build
   Compiling cc v1.0.9
   Compiling libc v0.2.40
   Compiling cfg-if v0.1.2
   Compiling void v1.0.2
   Compiling alloc-no-stdlib v1.2.0
   Compiling num-traits v0.2.2
   Compiling integer-encoding v1.0.5
   Compiling lazy_static v1.0.0
   Compiling byteorder v1.2.2
   Compiling parquet v0.1.0 (file:///home/andy/git/parquet-rs)
   Compiling try_from v0.2.2
   Compiling byteorder v0.5.3
   Compiling quick-error v1.2.1
   Compiling x86intrin v0.4.5
   Compiling log v0.4.1
   Compiling unreachable v0.1.1
   Compiling brotli-decompressor v1.2.1
   Compiling snap v0.2.4
   Compiling num_cpus v1.8.0
   Compiling rand v0.4.2
   Compiling num-traits v0.1.43
   Compiling log v0.3.9
   Compiling ordered-float v0.5.0
   Compiling threadpool v1.7.1
   Compiling miniz-sys v0.1.10
   Compiling thrift v0.0.4
   Compiling brotli v1.2.0
   Compiling flate2 v0.2.20
    Finished dev [unoptimized + debuginfo] target(s) in 23.33 secs

andy@freedom:~/git/parquet-rs$ ./target/debug/parquet-read 
./target/debug/parquet-read: error while loading shared libraries: libstd-990ef2cc9d91a9e2.so: cannot open shared object file: No such file or directory

I do not have this libstd library. The closest I have are:

$ find /usr -name "libstd*.so"
/usr/lib/x86_64-linux-gnu/coreutils/libstdbuf.so
/usr/lib/gcc/x86_64-linux-gnu/5/libstdc++.so

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.