segmentio / parquet-go Goto Github PK

View Code? Open in Web Editor NEW

341.0 341.0 102.0 7.45 MB

Go library to read/write Parquet files

Home Page: https://pkg.go.dev/github.com/segmentio/parquet-go

License: Apache License 2.0

Go 85.38% Assembly 14.61% Makefile 0.02%

go golang parquet performance

parquet-go's Introduction

Project has been Archived

Development has moved to https://github.com/parquet-go/parquet-go. No API's have changed, we just decided to create a new organization for this library. Thank you to all of the contributors for your hard work.

segmentio/parquet-go

High-performance Go library to manipulate parquet files.

parquet-go's People

Contributors

Stargazers

Watchers

Forkers

brancz rizalgowandy vbmithr seaguest mdisibio himanshpal light-bringer datadge annanay25 cube2222 zhanglei ttreptow isabella232 zalegrala ohkinozomu javierhonduco nvvu simonswine cyriltovena h12w vc42 isgasho vc42-2 sgtcodfish heylongdacoder zsolt-haraszti asubiotto toannhu96 t00350320 thorfour jasunliu tschaub ariels2 bprosnitz neevaco yonesko parsnips luis-sousa-pinto duanweifan omniverse-ae metalmatze polygon-io alxarno ryandeivert sentioxyz achille-roussel sikanrong atreides-intelligence scouvreur cgilling chelseajonesr fpetkovski ajzo90 zolstein parquet-go onlyone0001

parquet-go's Issues

Implement new interface to optimize scans on optional and repeated columns

#64 and #72 have added new APIs to optimize reading columns of values in a type-safe and compute efficient fashion. The model we implemented is based on testing whether a parquet.Page satisfies type reader interfaces, for example:

type Int64Reader interface {
    ReadInt64s(data []int64) (int, error)
}

This model is effective for required columns, but is too limited for optional or repeated column which need to communicate null values or length of repeated sequences. For those cases, we currently believe that introducing more interfaces would be the practical way to go, for example:

type OptionalInt32Reader interface {
  ReadOptionalInt32s(nulls []byte, values []int32) (int, error)
}

The nulls argument here would be a bitmap indicating the positions of null values. Either values would omit nulls or have placeholder zero values at those positions.

This work would open up the doors to creating a path towards greater integration with Apache Arrow: the null bitmap and values would be used as input to creating Arrow objects.

Write performance regression

Hi, I am seeing a regression in write performance ~0.5x after updating to the latest.

-require github.com/segmentio/parquet-go v0.0.0-20220330180547-542d8f0815f1
+require github.com/segmentio/parquet-go v0.0.0-20220426180755-25a20d16fa7f

Attaching some cpu profiles for 3 runs:

cpu.old.out - Baseline, parquet.Writer only, ~15s
cpu.new.out - Latest, parquet.Writer only, ~37s
cpu.buffer.out - Latest, both parquet.Writer and Buffer, ~32s
profiles.zip

On initial scan it looks like more time spent in parquet-go.clearValues and github.com/pierrec/lz4/v4/internal/lz4block.(*CompressorHC).CompressBlock

Reader split-file support

The parquet specification allows to split data and metadata into multiple files. Today, the readers assume all the bytes are in the same file. They should support reading from split files.

In addition to support more of the spec, we also want to be able to write parquet files with at least one file for metadata + 1 file per column, then re-stitch the file together when uploading it to S3.

Design improvements for row readers and writers

In this issue, I would like to submit a proposal for modifying the parquet.RowReader and parquet.RowWriter interfaces in order to improve usability of those abstractions, as well as reducing their overhead to improve application performance.

Context

At this time, the APIs in question have the following definitions:

package parquet 

type Row []Value

type RowReader interface {
    ReadRow(Row) (Row, error)
}

type RowWriter interface {
    WriteRow(Row) error
}

Qualities

The goals of these APIs is to allow the expression of generic algorithms working on sequences of parquet rows. For example, when writing rows to a parquet.Writer, applications can call WriteRow repeatedly to append rows and construct a parquet file. When reading rows, the API allows for a stream-like consumption of the content of a parquet file, which is useful when the datasets cannot be loaded entirely in memory (a common case when working with large parquet files).

Simplicity of the design

The use of parquet.Row offers a simple abstraction (a slice doesn't hide much about the implementation details), and enables constructs that leverage language features like append or for ... range. For example, some programs can simply iterate over the values of a row with a construct like this:

for _, value := range row {
    ...
}

Similarities with the `io` package

Being somewhat similar to the design of io.Reader and io.Writer, this model also reuses concepts that Go developers are often familiar with, helping ramp up when learning to use the package. The parallels go beyond the interfaces into APIs like parquet.CopyRows which mimic the design of io.Copy, etc... creating a strong sense of familiarity for developers who have worked with the standard io package.

Flaws

There are limitations with the design of these APIs that seem to have a greater impact than originally anticipated, and we might be able to mitigate by striking different balances in the design.

Asymmetry of parquet.RowReader and parquet.RowWriter

When writing rows to a parquet.RowWriter, the application forms a full parquet.Row and passes it to the writer, allowing the WriteRow method to offer very simple input and output. The intuitive corresponding API for parquet.RowReader would be to consume and return the next row, for example:

interface {
    ReadRow() (Row, error)
}

Performance consideration arise with this model, since the parquet.Row type needs to be somewhat flexible in the amount of memory that it consumes (e.g. repeated columns may cause the number of values in a row to vary). Returning a newly allocated parquet.Row value on every call would greatly increase GC pressure on the application, as well as impact the compute footprint due to extra malloc calls.

To address these concerns, the contract of parquet.RowReader could establish that the returned parquet.Row remains valid until the next call to ReadRow, and reuse the buffer it had returned. While effective to address the performance concerns, this model weakens the application because nothing in the language prevents an application from retaining the parquet.Row and unexpectedly have the inner buffers mutated on the next call to ReadRow. Explicit copies of the parquet.Row values would be required when they need to be retained, but developers would need to remember to copy the rows, the programs would be correct either way from the compiler's perspective.

The current approach taken by parquet-go is to allow the application to pass a pre-allocated parquet.Row value that the reader will append values to. This makes memory management more explicit, allowing the application to choose where to allocate the buffer, or pass nil to let the append calls allocate memory on demand. The model better leverages the language infrastructure to help achieve greater efficiency, at the expense of usability by having an asymmetry between the writer and reader interfaces, the reader case forcing more complexity onto the application in this case.

Performance overhead of the API

While the interfaces are named parquet.RowReader and parquet.RowWriter, the models are more similar to iterators than readers/writers since applications can only read or write a single row at a time. For example, the io.Reader and io.Writer interfaces allow bulk operations on arrays of bytes, they don't require applications to read or write bytes one by one.

Bulk operations yield performance benefits over iterator-like APIs, as they allow application to balance between memory and compute utilization: by using larger memory buffers, we can reduce the number of calls through the abstraction layers, amortizing the runtime cost of the abstractions.

This becomes more important the more wrappers we use, which is a common approach in Go programs. For example, one might wrap an io.Reader to transparently inject counters of bytes read from an underlying medium. The cost of the interface calls increases linearly with the number of wrappers, quickly becoming a measurable portion of the compute time consumed by an application. Reducing the number of calls by allowing bulk operations is an effective way to get the best of both worlds, leveraging the power of Go interfaces while keeping the cost of going through the abstraction layers to a minimum.

The following screen capture shows a subset of a CPU profile measuring compute utilization during the MergeRowGroups benchmark. Because we are reading a single row at a time, we must cross through the abstraction layer each time to attain the column that we need to read values from. In this case, it accounts for ~16% of the CPU time used by this code path (and tends to increase linearly with the number of columns):

Proposal

With this proposal, we are attempting to strike a better balance in the design of the row readers and writers, which retains qualities of the existing design while also offering solutions to address the issues highlighted above.

The change would consist in changing the parquet.RowReader and parquet.RowWriter interfaces to adopt the following definitions:

package parquet

type RowReader interface {
    ReadRows([]Row) (int, error)
}

type RowWriter interface {
    WriteRows([]Row) (int, error)
}

The interfaces would receive slices of rows, returning the number of rows that were successfully read or written (starting from the first).

On addressing the lack of symmetry between the reader and writer

As we see in the interface definitions, the methods have the same inputs and outputs, removing the asymmetry that exist today between the interfaces.

This model mimics the standard io.Reader and io.Writer interfaces more closely, as well as the parquet.ValueReader and parquet.ValueWriter from this package. It would result in having APIs that are better understood by developers, smoothing the learning curve and reducing the cognitive overhead of using parquet-go.

On addressing the performance overhead of the API

With the proposed change, multiple rows can be read or written in each call to ReadRows or WriteRows. Applications that intend to achieve high throughput of row processing (whether reading or writing) can increase the row buffer size to read or write more rows in each call, amortizing the cost of going through abstraction layers.

To prevent allocations of individual rows on each call to ReadRows, the readers must be allowed to append to the Row values if entries of the []parquet.Row slice are set to non-nil values. This contract not only optimizes memory allocations by being able to reuse the back arrays of the slices, it also allows for simpler and less error prone approach than the current model. The following example shows how the code scanning rows combines both simplicity and efficiency:

var rows [100]parquet.Row
for {
    n, err := reader.ReadRows(rows[:])
    // The call to ReadRows has reused inner buffers of the `rows` array,
    // setting each row to the column values while retaining the backing
    // array of each slice to be reused in the next call.
    for _, row := range rows[:n] {
        ...
    }
    if err != nil {
        ...
    }
}

Other considerations

Non-contiguous grouping of column values in memory

The proposed solution highlights another missed opportunity to optimize the column values layout in memory. The []Row type keeps the values for each row contiguous in memory, but the values of columns are spread across random locations.

Contiguous memory layout of columnar data can be achieved by reading values from columns individually through the parquet.ValueReader (and reciprocally parquet.ValueWriter). Constructing an API which will highlight both the qualities of efficient columnar layout AND the simplicity of manipulating rows is a balance that seems hard to strike, and would likely require much more advanced software constructs. It also seems to overlap with goals of other projects such as Apache Arrow, which may be better suited to solve these problems than duplicating the efforts in parquet-go.

I believe that there is room for both, the parquet.RowReader and parquet.RowWriter interface should be the logical mediums that the package is built upon, while also allowing optimizations allowing acceleration through columnar data layouts with different APIs.

Support deprecated BIT_PACKED encoding

The parquet file format used to support a BIT_PACKED encoding which represented bits in MSB order, but has since been deprecated.

To support reading files that were encoded with columns in this format, we should add support in parquet-go.

Specification:

https://github.com/apache/parquet-format/blob/master/Encodings.md#bit-packed-deprecated-bit_packed--4

StructBuilder

Write the implementation of StructBuilder https://github.com/segmentio/parquet/blob/573c620fbdbffe56cf431db8aeec411d613c88a0/struct_builder.go#L8-L52

Add parquet-testing files to the fixtures directory

https://github.com/apache/parquet-testing/tree/master/data

Make Schema immutable

minimize in-memory overhead of page indexes

Page indexes are split in two data structures: Column Indexes and Offset Indexes. The goal of this ticket is to track work to optimize storage of those data structures in memory, as the number of pages in large parquet files can be significant, and memory utilization rows linearly with it.

To put the work into perspective, our system currently holds ~75TB of data in the central deployment, mostly in files of ~256GB, so around 300 files, about 10 files per storage node. Assuming we use a page size between 256KB and 2MB (sometimes pages need to be loaded in memory), we are looking at 100K to 1M pages per file, x10 files per storage node. At this scale, optimizing page index entries can save in the order of a GBs of memory per process, making it worth the investment: the cost of the infrastructure remains inversely proportional to the amount of data we can manage per node.

Column Index

The column indexes have the following structure:

type ColumnIndex struct {
  NullPages     []bool        `thrift:"1,required"`
  MinValues     [][]byte      `thrift:"2,required"`
  MaxValues     [][]byte      `thrift:"3,required"`
  BoundaryOrder BoundaryOrder `thrift:"4,required"`
  NullCounts    []int64       `thrift:"5,optional"`
}

Here are a few observations about this structure:

The array of booleans wastes 7 bits per item since there are only two possible values but bool items are one byte each. The null page index could be omitted for required columns or repeated group containing only required columns since the values will always be false.

The arrays of byte slices have a 24 bytes overhead per item (each slice is a pointer + length + capacity), considering values are serialized using the plain encoding, this represents a significant amount for columns holding smaller data types like boolean, int32, int64, etc…

For columns that contain no null values (e.g. required or repeated groups with only required fields), the count of null values will be an array containing only zeroes. For columns that may hold null values, the null count should fit in a 32 bits integer since the number of values and the number of nulls are stored as a 32 bits integer in the data page header. The thrift field is optional, indicating that the arrays could be omitted in some cases (tho the parquet spec does not mention it unfortunately parquet-format/PageIndex.md at master · apache/parquet-format)

While we wouldn’t want to change the definition of the data structure in format/parquet.go (since it needs to match the thrift definition), these observation create opportunities to use more efficient in-memory representations after decoding a column index from a parquet file.

In the parquet package, we represent column indexes via the following interface:

type ColumnIndex interface {
    NumPages() int
    NullCount(int) int64
    NullPage(int) bool
    MinValue(int) []byte
    MaxValue(int) []byte
    IsAscending() bool
    IsDescending() bool
}

Which allows the underlying in-memory objects to have heterogenous shapes as long as they respect the contract of the interface. Take as example a required column of type int64, instead of using the thrift structure in memory (which wastes a lot of memory), we could use a more efficient representation:

type int64ColumnIndex struct {
  minValues []int64
  maxValues []int64
  order     format.BoundaryOrder
}

func (index *int64ColumnIndex) NumPages() int         { return len(index.minValues) }
func (index *int64ColumnIndex) NullCount(i int) int64 { return 0 }
func (index *int64ColumnIndex) NullPage(i int) int64  { return false }
func (index *int64ColumnIndex) MinValue(i int) []byte { /* see below */ }
func (index *int64ColumnIndex) MaxValue(i int) []byte { /* see below */ }
func (index *int64ColumnIndex) IsAscending() bool     { return index.order == format.Ascending }
func (index *int64ColumnIndex) IsDescending() bool    { return index.order == format.Descending }

The overhead of this representation is close to zero, the arrays of min/max values store the values only, there are no more indirections, etc…

We would probably want to change the signature of the MinValue/MaxValue methods to this:

    MinValue(int) Value
    MaxValue(int) Value

Which means that we could box the int64 values into parquet.Value and avoid having to convert them to []byte.

Applying similar specialization strategies for each primitive parquet type, as well as having specialization for nullable columns will provide a much more efficient use of memory when dealing with large parquet files holding thousands or more pages per column.

The use of arrays of plain types over [][]byte also greatly reduces the number of pointers in the program; instead of having 2 slices per page (each holding a pointer to a byte array), there are none. Memory areas containing no pointers are skipped by the garbage collector, reducing the overall compute footprint of memory management by avoiding to touch large amounts of memory, which are unlikely to be in CPU caches and therefore have high load latency and cause more important data to be evicted.

Finally, we should take advantage of the fact that null counts are an optional field in the thrift definition, and omit writing them when the array contains only zero values, this will save space both in memory and on disk, and reduce encoding and decoding time.

Offset Index

The thrift definition of offset index items is the following:

type PageLocation struct {
    Offset             int64 `thrift:"1,required"`
    CompressedPageSize int32 `thrift:"2,required"`
    FirstRowIndex      int64 `thrift:"3,required"`
}

type OffsetIndex struct {
    PageLocations []PageLocation `thrift:"1,required"`
}

There are less immediately obvious ways to optimize the in-memory layout of this data structure, but each item is still using 20 to 24 bytes in memory (depending on alignment rules). However, here are a few observations that we can make:

The page locations are ordered in incrementing Offset/FirstRowIndex

Pages of a column should have roughly similar compressed sizes

Instead of storing page locations as an array of this data structure, we could instead split the fields into three separate arrays holding deltas between values rather than absolute values:

type offsetIndex struct {
  offsetDeltas             []int32
  compressedPageSizeDeltas []int32
  firstRowIndexDeltas      []int32
}

We now need only 12 bytes of memory per page, which represents a reduction in the order of 40-50% compared to using the thrift data format.

However, we are making a trade-off of memory for compute time; for example, in order to reconstruct the first row index of a page, we need to sum the values of all deltas up to that page. Considering the use case of performing a binary search for a specific row index, this turns a O(log(N)) operation into O(N*log(N)), which makes it more interesting to perform a simple O(N) scan then. This may still significantly impact query time when searching rows in columns with large number of rows.

An amelioration of this model would be to use a Frame-of-Reference data structure, where instead of using a single array of deltas, we group deltas into pages and keep the absolute value at the beginning of the page:

type frameOfRef struct {
  first  int64
  deltas []int32
}

Depending on the maximum deltas, we could also reach for further optimizations by storing the deltas as 16 bits integers where 2 bytes is enough to represent the highest delta.

Binary search can now be done using the first value of each frame, then a linear scan on a single page allows reconstructing the final values. Bounding the frame sizes to a constant effectively makes this scan a O(1) operation with a high constant cost. The larger the frames the more efficient memory usage becomes, but the longer the scans; however we now have a mechanism for evolving on the spectrum of compute/memory trade off which allows us to find the right balance.

A second optimization here would consist in using SIMD optimization to reduce the constant cost of summing the deltas. Vectors of 16 or 32 bits integers would get 8 to 16x speed boosts from using optimized routines, allowing us to trade more compute for lower memory footprints without negatively impacting query performance.

How to ignore a field when write parquet to file

Just want to ask is there any way to ignore a field when write to parquet file like what encoding/json package did when marshal/unmarshal:

// Field is ignored by this package.
Field int `json:"-"`

// Field appears in JSON as key "-".
Field int `json:"-,"`

ref: https://pkg.go.dev/encoding/json#Marshal

P/s: Thank, this project is great.

Convenience methods for float and double node types

There are convenience methods for:

parquet.String()
parquet.Int(<bits>)
parquet.Uint(<bits>)

But not for: boolean, float, and double.

Should we add some for these types? Currently, it looks inconsistent to write them as parquet.Leaf(parquet.BooleanType) but the others use the convenience methods.

Min/Max values in column index of type []byte

I was looking into using column indicies and noticed that the format.ColumnIndex used raw byte slices for the indicies instead of the parquet.Value

Would it make sense to change those to parquet.Values to make them easier to utilize?

Reader vs. OpenFile

Trying to add a code example, I assumed that "OpenFile" was the best/easiest API to use to interact with data in a parquet file on disk, just based on the name and the usage in tests.

This API is a little bit confusing because unlike e.g. os.OpenFile, it does not take a filename as the first argument, and it also appears to be significantly more complex than parquet.Reader.

I'm wondering if there's something we can do to simplify this and/or draw a better distinction between the simple and the complex API.

Data races in tests

I'm not sure yet whether this is about the test setup, or the actual parquet-go code.

==================
WARNING: DATA RACE
Write at 0x00c0052645d8 by goroutine 21:
  bytes.(*Reader).Reset()
      /opt/homebrew/Cellar/go/1.17.6/libexec/src/bytes/reader.go:157 +0x69c
  github.com/segmentio/parquet-go.(*filePages).readPage()
      /Users/brancz/src/github.com/segmentio/parquet-go/file.go:533 +0x64c
  github.com/segmentio/parquet-go.(*filePages).ReadPage()
      /Users/brancz/src/github.com/segmentio/parquet-go/file.go:585 +0xfc
  github.com/segmentio/parquet-go.(*columnPages).ReadPage()
      /Users/brancz/src/github.com/segmentio/parquet-go/column.go:146 +0xec
  github.com/segmentio/parquet-go_test.forEachPage()
      /Users/brancz/src/github.com/segmentio/parquet-go/parquet_test.go:41 +0x38
  github.com/segmentio/parquet-go_test.forEachColumnPage.func1()
      /Users/brancz/src/github.com/segmentio/parquet-go/parquet_test.go:74 +0x98
  github.com/segmentio/parquet-go_test.forEachLeafColumn()
      /Users/brancz/src/github.com/segmentio/parquet-go/parquet_test.go:27 +0x6c
  github.com/segmentio/parquet-go_test.forEachLeafColumn()
      /Users/brancz/src/github.com/segmentio/parquet-go/parquet_test.go:31 +0xe0
  github.com/segmentio/parquet-go_test.forEachColumnPage()
      /Users/brancz/src/github.com/segmentio/parquet-go/parquet_test.go:73 +0x50
  github.com/segmentio/parquet-go_test.forEachColumnValue()
      /Users/brancz/src/github.com/segmentio/parquet-go/parquet_test.go:79 +0x50
  github.com/segmentio/parquet-go_test.scanParquetValues()
      /Users/brancz/src/github.com/segmentio/parquet-go/writer_test.go:38 +0x3c
  github.com/segmentio/parquet-go_test.scanParquetFile()
      /Users/brancz/src/github.com/segmentio/parquet-go/writer_test.go:34 +0xdc
  github.com/segmentio/parquet-go_test.generateParquetFile()
      /Users/brancz/src/github.com/segmentio/parquet-go/writer_test.go:58 +0x230
  github.com/segmentio/parquet-go_test.TestWriter.func1()
      /Users/brancz/src/github.com/segmentio/parquet-go/writer_test.go:426 +0xc0
  testing.tRunner()
      /opt/homebrew/Cellar/go/1.17.6/libexec/src/testing/testing.go:1259 +0x198

Previous read at 0x00c0052645d8 by goroutine 24:
  bytes.(*Reader).Read()
      /opt/homebrew/Cellar/go/1.17.6/libexec/src/bytes/reader.go:41 +0x50
  io.ReadAtLeast()
      /opt/homebrew/Cellar/go/1.17.6/libexec/src/io/io.go:328 +0xcc
  io.ReadFull()
      /opt/homebrew/Cellar/go/1.17.6/libexec/src/io/io.go:347 +0x78
  github.com/klauspost/compress/zstd.(*readerWrapper).readSmall()
      /Users/brancz/pkg/mod/github.com/klauspost/[email protected]/zstd/bytebuf.go:88 +0x28
  github.com/klauspost/compress/zstd.(*frameDec).reset()
      /Users/brancz/pkg/mod/github.com/klauspost/[email protected]/zstd/framedec.go:83 +0x84
  github.com/klauspost/compress/zstd.(*Decoder).startStreamDecoder()
      /Users/brancz/pkg/mod/github.com/klauspost/[email protected]/zstd/decoder.go:493 +0x344

Goroutine 21 (running) created at:
  testing.(*T).Run()
      /opt/homebrew/Cellar/go/1.17.6/libexec/src/testing/testing.go:1306 +0x5b8
  github.com/segmentio/parquet-go_test.TestWriter()
      /Users/brancz/src/github.com/segmentio/parquet-go/writer_test.go:423 +0x234
  testing.tRunner()
      /opt/homebrew/Cellar/go/1.17.6/libexec/src/testing/testing.go:1259 +0x198

Goroutine 24 (running) created at:
  github.com/klauspost/compress/zstd.(*Decoder).Reset()
      /Users/brancz/pkg/mod/github.com/klauspost/[email protected]/zstd/decoder.go:202 +0x388
  github.com/klauspost/compress/zstd.NewReader()
      /Users/brancz/pkg/mod/github.com/klauspost/[email protected]/zstd/decoder.go:109 +0x618
  github.com/segmentio/parquet-go/compress/zstd.(*Codec).NewReader()
      /Users/brancz/src/github.com/segmentio/parquet-go/compress/zstd/zstd.go:52 +0xac
  github.com/segmentio/parquet-go.acquireCompressedPageReader()
      /Users/brancz/src/github.com/segmentio/parquet-go/compress.go:86 +0x210
  github.com/segmentio/parquet-go.makeCompressedPage()
      /Users/brancz/src/github.com/segmentio/parquet-go/file.go:1054 +0x124
  github.com/segmentio/parquet-go.(*filePageValueReaderState).init()
      /Users/brancz/src/github.com/segmentio/parquet-go/file.go:900 +0x36c
  github.com/segmentio/parquet-go.(*filePage).Values()
      /Users/brancz/src/github.com/segmentio/parquet-go/file.go:807 +0x1dc
  github.com/segmentio/parquet-go_test.forEachColumnValue.func1()
      /Users/brancz/src/github.com/segmentio/parquet-go/parquet_test.go:80 +0x40
  github.com/segmentio/parquet-go_test.forEachColumnPage.func1.1()
      /Users/brancz/src/github.com/segmentio/parquet-go/parquet_test.go:74 +0x58
  github.com/segmentio/parquet-go_test.forEachPage()
      /Users/brancz/src/github.com/segmentio/parquet-go/parquet_test.go:48 +0x68
  github.com/segmentio/parquet-go_test.forEachColumnPage.func1()
      /Users/brancz/src/github.com/segmentio/parquet-go/parquet_test.go:74 +0x98
  github.com/segmentio/parquet-go_test.forEachLeafColumn()
      /Users/brancz/src/github.com/segmentio/parquet-go/parquet_test.go:27 +0x6c
  github.com/segmentio/parquet-go_test.forEachLeafColumn()
      /Users/brancz/src/github.com/segmentio/parquet-go/parquet_test.go:31 +0xe0
  github.com/segmentio/parquet-go_test.forEachColumnPage()
      /Users/brancz/src/github.com/segmentio/parquet-go/parquet_test.go:73 +0x50
  github.com/segmentio/parquet-go_test.forEachColumnValue()
      /Users/brancz/src/github.com/segmentio/parquet-go/parquet_test.go:79 +0x50
  github.com/segmentio/parquet-go_test.scanParquetValues()
      /Users/brancz/src/github.com/segmentio/parquet-go/writer_test.go:38 +0x3c
  github.com/segmentio/parquet-go_test.scanParquetFile()
      /Users/brancz/src/github.com/segmentio/parquet-go/writer_test.go:34 +0xdc
  github.com/segmentio/parquet-go_test.generateParquetFile()
      /Users/brancz/src/github.com/segmentio/parquet-go/writer_test.go:58 +0x230
  github.com/segmentio/parquet-go_test.TestWriter.func1()
      /Users/brancz/src/github.com/segmentio/parquet-go/writer_test.go:426 +0xc0
  testing.tRunner()
      /opt/homebrew/Cellar/go/1.17.6/libexec/src/testing/testing.go:1259 +0x198
==================

Support field IDs in parquet schemas

The parquet format supports storing fields ids, which is especially useful to have when converting data between protobuf or thrift definitions.

Move cmd-specific dependencies to dedicated go.mod

Eventually the library itself should not have external dependencies. Some of ones listed in go.mod today are only used for commands, so they should be move to a dedicated go.mod file.

Remove dependency on strcase

It is just used to turn field names into snake_case. We should aim to avoid dependencies on the core library.

How to define float in schema?

Hello,

I have the following struct, and need to load the parquet to AWS Redshift.

	type record struct {
		Format   string  `parquet:"format,gzip"`
		DataType int32   `parquet:"data_type,gzip"`
		Country  string  `parquet:"country,gzip"`
		Price    float64 `parquet:"price,gzip"`
	}

But I get the error:

uses an unsupported encoding: DELTA_LENGTH_BYTE_ARRAY for column 'country'.

Then I tried to add encoding annotation, but I didn't find a guide for float type definition,

	type record struct {
		Format   string  `parquet:"format,plain,gzip"`
		DataType int32   `parquet:"data_type,plain,gzip"`
		Country  string  `parquet:"country,plain,gzip"`
		Price    float64 `parquet:"price,decimal(2,6),gzip"`
	}

I tried above, it gave:

panic: struct has invalid 'decimal(2' parquet tag: price float64 parquet:"price,decimal(2,6),gzip"

How to keep the schema field same order as struct?

Hello,

I have a table already in redshift, I need to keep the same schema (even the order of the fields) to load parquet to redshift.

However I found the library creates the schema based on the struct and sorts the key. what I need is to keep the schema field same order as we defined in struct.

In redshift I have:

Schema: message test {
  required binary b (STRING);
  required int32 a (INT(32, true));
  required binary c (STRING);
}

The defined struct:

	type Test struct {
		B string `parquet:"b,plain,gzip"`
		A string `parquet:"a,plain,gzip"`
		C string `parquet:"c,plain,gzip"`
	}

its generated schema:

Schema: message Test {
  required binary a (STRING);
  required binary b (STRING);
  required binary c (STRING);
}

Is there any way to do that?

Tests failing on `main` for apple silicon

One of the latest patches appear to have broken main to build on apple silicon, here's what I get running go test from main, which previously passed just fine.

first_name > C:0 D:0 R:0 V:Han
first_name > C:0 D:0 R:0 V:Leia
first_name > C:0 D:0 R:0 V:Luke
last_name > C:1 D:0 R:0 V:Solo
last_name > C:1 D:0 R:0 V:Skywalker
first_name > C:0 D:0 R:0 V:Han
first_name > C:0 D:0 R:0 V:Leia
first_name > C:0 D:0 R:0 V:Luke
last_name > C:1 D:0 R:0 V:Skywalker
last_name > C:1 D:0 R:0 V:Solo
last_name > C:1 D:0 R:0 V:Skywalker
last_name > C:1 D:0 R:0 V:Skywalker
name > C:0 D:0 R:0 V:http_request_total
name > C:0 D:0 R:0 V:http_request_total
name > C:0 D:0 R:0 V:http_request_total
name > C:0 D:0 R:0 V:http_request_total
name > C:0 D:1 R:0 V:Dmitriy Ryaboy
name > C:0 D:1 R:1 V:Chris Aniszczyk
name > C:0 D:0 R:0 V:<null>
name > C:0 D:0 R:0 V:http_request_total
name > C:0 D:0 R:0 V:http_request_total
name > C:0 D:0 R:0 V:http_request_total
phoneNumber > C:1 D:2 R:0 V:555 987 6543
phoneNumber > C:1 D:1 R:1 V:<null>
phoneNumber > C:1 D:0 R:0 V:<null>
name > C:0 D:0 R:0 V:http_request_total
name > C:0 D:1 R:0 V:Dmitriy Ryaboy
name > C:0 D:1 R:1 V:Chris Aniszczyk
name > C:0 D:0 R:0 V:<null>
owner > C:2 D:0 R:0 V:Julien Le Dem
phoneNumber > C:1 D:2 R:0 V:555 987 6543
phoneNumber > C:1 D:1 R:1 V:<null>
phoneNumber > C:1 D:0 R:0 V:<null>
name > C:0 D:0 R:0 V:http_request_total
owner > C:2 D:0 R:0 V:A. Nonymous
name > C:0 D:0 R:0 V:http_request_total
owner > C:2 D:0 R:0 V:Julien Le Dem
timestamp > C:1 D:0 R:0 V:1639444033
timestamp > C:1 D:0 R:0 V:1639444058
owner > C:2 D:0 R:0 V:A. Nonymous
timestamp > C:1 D:0 R:0 V:1639444085
timestamp > C:1 D:0 R:0 V:1639444093
timestamp > C:1 D:0 R:0 V:1639444101
timestamp > C:1 D:0 R:0 V:1639444108
timestamp > C:1 D:0 R:0 V:1639444133
timestamp > C:1 D:0 R:0 V:1639444137
timestamp > C:1 D:0 R:0 V:1639444141
timestamp > C:1 D:0 R:0 V:1639444144
value > C:2 D:0 R:0 V:100
value > C:2 D:0 R:0 V:0
value > C:2 D:0 R:0 V:42
value > C:2 D:0 R:0 V:1
value > C:2 D:0 R:0 V:2
value > C:2 D:0 R:0 V:5
value > C:2 D:0 R:0 V:4
value > C:2 D:0 R:0 V:5
value > C:2 D:0 R:0 V:6
value > C:2 D:0 R:0 V:10
ownerPhoneNumbers > C:3 D:1 R:0 V:555 123 4567
ownerPhoneNumbers > C:3 D:1 R:1 V:555 666 1337
ownerPhoneNumbers > C:3 D:0 R:0 V:<null>
ownerPhoneNumbers > C:3 D:1 R:0 V:555 123 4567
ownerPhoneNumbers > C:3 D:1 R:1 V:555 666 1337
ownerPhoneNumbers > C:3 D:0 R:0 V:<null>
--- FAIL: TestWriter (0.03s)
    --- FAIL: TestWriter/example_from_the_twitter_blog_(v1) (0.60s)
        writer_test.go:428: 
            row group 0 
            --------------------------------------------------------------------------------
            contacts:          
            .name:              BINARY UNCOMPRESSED DO:0 FPO:4 SZ:120/120/1,00 VC:3 [more]...
            .phoneNumber:       BINARY SNAPPY DO:0 FPO:124 SZ:100/96/0,96 VC:3 ENC [more]...
            owner:              BINARY ZSTD DO:0 FPO:224 SZ:98/80/0,82 VC:2 ENC:DE [more]...
            ownerPhoneNumbers:  BINARY GZIP DO:0 FPO:322 SZ:166/116/0,70 VC:3 ENC: [more]...
            org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] no native library is found for os.name=Mac and os.arch=aarch64
            	at org.xerial.snappy.SnappyLoader.findNativeLibrary(SnappyLoader.java:361)
            	at org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:195)
            	at org.xerial.snappy.SnappyLoader.loadSnappyApi(SnappyLoader.java:167)
            	at org.xerial.snappy.Snappy.init(Snappy.java:69)
            	at org.xerial.snappy.Snappy.<clinit>(Snappy.java:46)
            	at org.apache.parquet.hadoop.codec.SnappyDecompressor.decompress(SnappyDecompressor.java:62)
            	at org.apache.parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:51)
            	at java.base/java.io.DataInputStream.readFully(DataInputStream.java:201)
            	at java.base/java.io.DataInputStream.readFully(DataInputStream.java:172)
            	at org.apache.parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:286)
            	at org.apache.parquet.tools.command.DumpCommand$1.visit(DumpCommand.java:279)
            	at org.apache.parquet.tools.command.DumpCommand$1.visit(DumpCommand.java:265)
            	at org.apache.parquet.column.page.DataPageV1.accept(DataPageV1.java:120)
            	at org.apache.parquet.tools.command.DumpCommand.dump(DumpCommand.java:265)
            	at org.apache.parquet.tools.command.DumpCommand.dump(DumpCommand.java:195)
            	at org.apache.parquet.tools.command.DumpCommand.execute(DumpCommand.java:148)
            	at org.apache.parquet.tools.Main.main(Main.java:223)
            org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] no native library is found for os.name=Mac and os.arch=aarch64
            
        writer_test.go:429: exit status 1
    --- FAIL: TestWriter/timeseries_with_delta_encoding (0.65s)
        writer_test.go:435: 
            --- want.txt
            +++ got.txt
            @@ -1,8 +1,8 @@
             row group 0
             --------------------------------------------------------------------------------
            -name:       BINARY UNCOMPRESSED DO:4 FPO:45 SZ:101/101/1.00 VC:10 ENC: [more]...
            -timestamp:  INT64 UNCOMPRESSED DO:0 FPO:105 SZ:278/278/1.00 VC:10 ENC: [more]...
            -value:      DOUBLE UNCOMPRESSED DO:0 FPO:383 SZ:220/220/1.00 VC:10 ENC:PLAIN [more]...
            +name:       BINARY UNCOMPRESSED DO:4 FPO:45 SZ:101/101/1,00 VC:10 ENC: [more]...
            +timestamp:  INT64 UNCOMPRESSED DO:0 FPO:105 SZ:278/278/1,00 VC:10 ENC: [more]...
            +value:      DOUBLE UNCOMPRESSED DO:0 FPO:383 SZ:220/220/1,00 VC:10 ENC:PLAIN [more]...
             
                 name TV=10 RL=0 DL=0 DS: 1 DE:PLAIN
                 ----------------------------------------------------------------------------
    --- FAIL: TestWriter/page_v2_with_dictionary_encoding (0.71s)
        writer_test.go:435: 
            --- want.txt
            +++ got.txt
            @@ -1,7 +1,7 @@
             row group 0
             --------------------------------------------------------------------------------
            -first_name:  BINARY ZSTD DO:4 FPO:55 SZ:86/77/0.90 VC:3 ENC:RLE_DICTIONARY,PLAIN [more]...
            -last_name:   BINARY ZSTD DO:0 FPO:90 SZ:163/136/0.83 VC:3 ENC:DELTA_BYTE_ARRAY [more]...
            +first_name:  BINARY ZSTD DO:4 FPO:55 SZ:86/77/0,90 VC:3 ENC:RLE_DICTIONARY,PLAIN [more]...
            +last_name:   BINARY ZSTD DO:0 FPO:90 SZ:163/136/0,83 VC:3 ENC:DELTA_BYTE_ARRAY [more]...
             
                 first_name TV=3 RL=0 DL=0 DS: 3 DE:PLAIN
                 ----------------------------------------------------------------------------
    --- FAIL: TestWriter/page_v1_with_dictionary_encoding (0.74s)
        writer_test.go:435: 
            --- want.txt
            +++ got.txt
            @@ -1,7 +1,7 @@
             row group 0
             --------------------------------------------------------------------------------
            -first_name:  BINARY ZSTD DO:4 FPO:55 SZ:90/72/0.80 VC:3 ENC:PLAIN,RLE_DICTIONARY [more]...
            -last_name:   BINARY ZSTD DO:0 FPO:94 SZ:148/121/0.82 VC:3 ENC:DELTA_BYTE_ARRAY [more]...
            +first_name:  BINARY ZSTD DO:4 FPO:55 SZ:90/72/0,80 VC:3 ENC:PLAIN,RLE_DICTIONARY [more]...
            +last_name:   BINARY ZSTD DO:0 FPO:94 SZ:148/121/0,82 VC:3 ENC:DELTA_BYTE_ARRAY [more]...
             
                 first_name TV=3 RL=0 DL=0 DS: 3 DE:PLAIN
                 ----------------------------------------------------------------------------
    --- FAIL: TestWriter/example_from_the_twitter_blog_(v2) (0.78s)
        writer_test.go:428: 
            row group 0 
            --------------------------------------------------------------------------------
            contacts:          
            .name:              BINARY UNCOMPRESSED DO:0 FPO:4 SZ:114/114/1,00 VC:3 [more]...
            .phoneNumber:       BINARY SNAPPY DO:0 FPO:118 SZ:94/90/0,96 VC:3 ENC: [more]...
            owner:              BINARY ZSTD DO:0 FPO:212 SZ:108/90/0,83 VC:2 ENC:D [more]...
            ownerPhoneNumbers:  BINARY GZIP DO:0 FPO:320 SZ:159/109/0,69 VC:3 ENC: [more]...
            
                contacts.name TV=3 RL=1 DL=1
                ----------------------------------------------------------------------------
                page 0:  DLE:RLE RLE:RLE VLE:DELTA_LENGTH_BYTE_ARRAY ST:[no stats  [more]... VC:2
                page 1:  DLE:RLE RLE:RLE VLE:DELTA_LENGTH_BYTE_ARRAY ST:[no stats  [more]... VC:1
            
                contacts.phoneNumber TV=3 RL=1 DL=2
                ----------------------------------------------------------------------------
                page 0:  DLE:RLE RLE:RLE VLE:DELTA_LENGTH_BYTE_ARRAY ST:[no stats  [more]... VC:2
                page 1:  DLE:RLE RLE:RLE VLE:DELTA_LENGTH_BYTE_ARRAY ST:[no stats  [more]... VC:1
            
                owner TV=2 RL=0 DL=0
                ----------------------------------------------------------------------------
                page 0:  DLE:RLE RLE:RLE VLE:DELTA_LENGTH_BYTE_ARRAY ST:[no stats  [more]... VC:1
                page 1:  DLE:RLE RLE:RLE VLE:DELTA_LENGTH_BYTE_ARRAY ST:[no stats  [more]... VC:1
            
                ownerPhoneNumbers TV=3 RL=1 DL=1
                ----------------------------------------------------------------------------
                page 0:  DLE:RLE RLE:RLE VLE:DELTA_LENGTH_BYTE_ARRAY ST:[no stats  [more]... VC:2
                page 1:  DLE:RLE RLE:RLE VLE:DELTA_LENGTH_BYTE_ARRAY ST:[no stats  [more]... VC:1
            
            BINARY contacts.name 
            --------------------------------------------------------------------------------
            *** row group 1 of 1, values 1 to 3 *** 
            value 1: R:0 D:1 V:Dmitriy Ryaboy
            value 2: R:1 D:1 V:Chris Aniszczyk
            value 3: R:0 D:0 V:<null>
            
            BINARY contacts.phoneNumber 
            --------------------------------------------------------------------------------
            org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] no native library is found for os.name=Mac and os.arch=aarch64
            	at org.xerial.snappy.SnappyLoader.findNativeLibrary(SnappyLoader.java:361)
            	at org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:195)
            	at org.xerial.snappy.SnappyLoader.loadSnappyApi(SnappyLoader.java:167)
            	at org.xerial.snappy.Snappy.init(Snappy.java:69)
            	at org.xerial.snappy.Snappy.<clinit>(Snappy.java:46)
            	at org.apache.parquet.hadoop.codec.SnappyDecompressor.decompress(SnappyDecompressor.java:62)
            	at org.apache.parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:51)
            	at java.base/java.io.DataInputStream.readFully(DataInputStream.java:201)
            	at java.base/java.io.DataInputStream.readFully(DataInputStream.java:172)
            	at org.apache.parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:286)
            	at org.apache.parquet.bytes.BytesInput.toByteBuffer(BytesInput.java:237)
            	at org.apache.parquet.bytes.BytesInput.toInputStream(BytesInput.java:246)
            	at org.apache.parquet.column.impl.ColumnReaderBase.readPageV2(ColumnReaderBase.java:698)
            	at org.apache.parquet.column.impl.ColumnReaderBase.access$400(ColumnReaderBase.java:57)
            	at org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:628)
            	at org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:620)
            	at org.apache.parquet.column.page.DataPageV2.accept(DataPageV2.java:192)
            	at org.apache.parquet.column.impl.ColumnReaderBase.readPage(ColumnReaderBase.java:620)
            	at org.apache.parquet.column.impl.ColumnReaderBase.checkRead(ColumnReaderBase.java:594)
            	at org.apache.parquet.column.impl.ColumnReaderBase.consume(ColumnReaderBase.java:735)
            	at org.apache.parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:30)
            	at org.apache.parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:47)
            	at org.apache.parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:82)
            	at org.apache.parquet.tools.command.DumpCommand.dump(DumpCommand.java:312)
            	at org.apache.parquet.tools.command.DumpCommand.dump(DumpCommand.java:225)
            	at org.apache.parquet.tools.command.DumpCommand.execute(DumpCommand.java:148)
            	at org.apache.parquet.tools.Main.main(Main.java:223)
            org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] no native library is found for os.name=Mac and os.arch=aarch64
            
        writer_test.go:429: exit status 1
FAIL
exit status 1
FAIL	github.com/segmentio/parquet-go	2.098s

Buildkite should test against parquet-tools

There are a bunch of tests covering the ptools command that compare its output to the official parquet-tools program.

Currently, Buildkite skips those tests because it does not have java available on the same image as golang.

Potentially useful: a pre-built binary of parquet-tools https://github.com/pelletier/parquet-tools-bin

Incorrect data written and panic for nested repeated columns when using parquet.Buffer

I believe the logic in repeatedPage.Slice is not correctly handling when a repeated column is nested by multiple levels. Two issues are presented and there are minimal failing tests committed here. The tests attempt to roundtrip some basic objects.

For a normal column, it writes incorrect data and the roundtrip detects mismatch.
For an optional column with only nulls, negative offsets i and j are calculated in repeatedPage.Slice which leads to slice panic.

An observation:

It occurs when enough data is written to exceed the first page and a split is determined in forEachPageSlice.
The issues occur in the calculation of subsequent pages. First page is always fine.
The tests aim for minimal reproduction by using page size of 100 bytes and only write a few objects.

Test 1:
The expected objects are [i, i] ([0,0] ... [5,5]). The 6 objects are written as 2 pages of 3. The mismatch starts on the 2nd page.

func TestBufferRoundtripNestedRepeated(t *testing.T) {
	type C struct {
		D int
	}
	type B struct {
		C []C
	}
	type A struct {
		B []B
	}

	// Write enough objects to exceed first page
	buffer := parquet.NewBuffer()
	var objs []A
	for i := 0; i < 6; i++ {
		o := A{[]B{{[]C{
			{i},
			{i},
		}}}}
		buffer.Write(&o)
		objs = append(objs, o)
	}

	buf := new(bytes.Buffer)
	w := parquet.NewWriter(buf, parquet.PageBufferSize(100))
	w.WriteRowGroup(buffer)
	w.Flush()
	w.Close()

	r := parquet.NewReader(bytes.NewReader(buf.Bytes()))
	for i := 0; ; i++ {
		o := new(A)
		err := r.Read(o)
		if err == io.EOF {
			break
		}
		if !reflect.DeepEqual(*o, objs[i]) {
			t.Errorf("points mismatch at row index %d: want=%v got=%v", i, objs[i], o)
		}
	}
}

Test output:

$ go test -v -run=TestBufferRoundtripNestedRepeated
=== RUN   TestBufferRoundtripNestedRepeated
    buffer_test.go:463: points mismatch at row index 3: want={[{[{3} {3}]}]} got=&{[{[{1} {2}]}]}
    buffer_test.go:463: points mismatch at row index 4: want={[{[{4} {4}]}]} got=&{[{[{2} {3}]}]}
    buffer_test.go:463: points mismatch at row index 5: want={[{[{5} {5}]}]} got=&{[{[{3} {4}]}]}
--- FAIL: TestBufferRoundtripNestedRepeated (0.00s)
FAIL
exit status 1
FAIL	github.com/segmentio/parquet-go	0.013s

Test2:
Virtually similar except *int data type and writes 26 objects of nils to exceed first page.

$ go test -v -run=TestBufferRoundtripNestedRepeatedOptional
=== RUN   TestBufferRoundtripNestedRepeatedOptional
--- FAIL: TestBufferRoundtripNestedRepeatedOptional (0.00s)
panic: runtime error: slice bounds out of range [:-13] [recovered]
	panic: runtime error: slice bounds out of range [:-13]

goroutine 19 [running]:
testing.tRunner.func1.2({0x140f1e0, 0xc00015e090})
	/usr/local/go/src/testing/testing.go:1209 +0x24e
testing.tRunner.func1()
	/usr/local/go/src/testing/testing.go:1212 +0x218
panic({0x140f1e0, 0xc00015e090})
	/usr/local/go/src/runtime/panic.go:1038 +0x215
github.com/segmentio/parquet-go.(*int64Page).Slice(0xc00034001a, 0x0, 0x0)
	/Users/marty/src/oss/parquet-go/page_default.go:262 +0xd3
github.com/segmentio/parquet-go.(*repeatedPage).Slice(0xc00012c230, 0xd, 0x1a)
	/Users/marty/src/oss/parquet-go/page.go:494 +0x19f
github.com/segmentio/parquet-go.forEachPageSlice({0x14d2ab8, 0xc00012c230}, 0x64, 0xc0001a3ba8)
	/Users/marty/src/oss/parquet-go/page.go:202 +0x103

Unexport reader types and methods

The main API is through RowReader at the moment. Until we decide on a low-level API, those types should be unexported.

Writer

This is the main issue for supporting writing Parquet files in this library. It is meant to be broken off in multiple follow up issues when a basic writer works.

Feature of the writer:

Performance: it cannot be slower than the current parquet-go implementation on the traces schema.
Memory usage: it cannot use more memory than the current parquet-go implementation on the traces schema.
API should be simple to use.
Tune how much data is kept in memory vs on disk. That would allow us to adjust which resource to use (ram vs disk) depending on our performance and costs.
Write to one file per column. This would allow us to stream writes without buffering as much in memory. The multiple files can then be re-constructed at upload time with #17.
Restore/continue: the writer needs to be able to stop and resume writing, and recover in case of failure. Currently, we trash the whole parquet file and restart from scratch using the buffer file. Doesn't get worse than that.
The writer needs to store statistics and bloom filters for select columns in the file.
To make the above more powerful, we may want to look into a way to sort some or part of the data being written.
Per-column encoding and compression should be tunable. Use good defaults.

panic in `SeekToRow` when using dictionary encoding and accessing random row

Some interesting behaviour of SeekToRow captured in the following tests:

No dictionary encoding on the columns -> reading random rows works without panics

func TestSeekToRow_NoDict(t *testing.T) {
	type rowType struct {
		Name utf8string `parquet:","` // no dictionary encoding
	}

	// write samples to in-memory buffer
	buf := new(bytes.Buffer)
	schema := parquet.SchemaOf(new(rowType))
	w := parquet.NewWriter(buf, schema)
	sample := rowType{
		Name: "foo1",
	}
	// write two rows
	w.Write(sample)
	sample.Name = "foo2"
	w.Write(sample)
	w.Close()

	// create reader
	r := parquet.NewReader(bytes.NewReader(buf.Bytes()))

	// read second row
	r.SeekToRow(1)
	row := new(rowType)
	err := r.Read(row)
	if err != nil {
		t.Fatalf("reading row: %v", err)
	}
	// fmt.Println(&sample, row)
	if *row != sample {
		t.Fatalf("read != write")
	}
}

With dictionary encoding on the columns -> reading rows sequentially works

func TestSeekToRow_ReadAll(t *testing.T) {
	type rowType struct {
		Name utf8string `parquet:",dict"`
	}

	// write samples to in-memory buffer
	buf := new(bytes.Buffer)
	schema := parquet.SchemaOf(new(rowType))
	w := parquet.NewWriter(buf, schema)
	sample := rowType{
		Name: "foo1",
	}
	// write two rows
	w.Write(sample)
	sample.Name = "foo2"
	w.Write(sample)
	w.Close()

	// create reader
	r := parquet.NewReader(bytes.NewReader(buf.Bytes()))

	// read first row
	r.SeekToRow(0)
	row := new(rowType)
	err := r.Read(row)
	if err != nil {
		t.Fatalf("reading row: %v", err)
	}
	// read second row
	r.SeekToRow(1)
	row = new(rowType)
	err = r.Read(row)
	if err != nil {
		t.Fatalf("reading row: %v", err)
	}
	// fmt.Println(&sample, row)
	if *row != sample {
		t.Fatalf("read != write")
	}
}

With dictionary encoding on the columns -> reading random rows panics

func TestSeekToRow_DictReadSecond(t *testing.T) {
	type rowType struct {
		Name utf8string `parquet:",dict"`
	}

	// write samples to in-memory buffer
	buf := new(bytes.Buffer)
	schema := parquet.SchemaOf(new(rowType))
	w := parquet.NewWriter(buf, schema)
	sample := rowType{
		Name: "foo1",
	}
	// write two rows
	w.Write(sample)
	sample.Name = "foo2"
	w.Write(sample)
	w.Close()

	// create reader
	r := parquet.NewReader(bytes.NewReader(buf.Bytes()))

	// read second row
	r.SeekToRow(1)
	row := new(rowType)
	err := r.Read(row)
	if err != nil {
		t.Fatalf("reading row: %v", err)
	}
	// fmt.Println(&sample, row)
	if *row != sample {
		t.Fatalf("read != write")
	}
}

Running tool: /usr/local/go/bin/go test -timeout 30s -run ^TestSeekToRow_DictReadSecond$ github.com/segmentio/parquet-go

--- FAIL: TestSeekToRow_DictReadSecond (0.00s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x12dfc49]

goroutine 19 [running]:
testing.tRunner.func1.2({0x13ca760, 0x17d4b70})
	/usr/local/go/src/testing/testing.go:1209 +0x24e
testing.tRunner.func1()
	/usr/local/go/src/testing/testing.go:1212 +0x218
panic({0x13ca760, 0x17d4b70})
	/usr/local/go/src/runtime/panic.go:1038 +0x215
github.com/segmentio/parquet-go.(*indexedPageValues).ReadValues(0xc000244230, {0xc000248000, 0xaa, 0x1})
	/Users/annanay/Desktop/git/go/src/github.com/segmentio/parquet-go/dictionary.go:1039 +0x89
github.com/segmentio/parquet-go.(*columnChunkReader).readValues(0xc0001e3700)
	/Users/annanay/Desktop/git/go/src/github.com/segmentio/parquet-go/column_chunk.go:91 +0xce
github.com/segmentio/parquet-go.readRowsFuncOfLeaf.func1({0xc0001dc738, 0x1, 0x1406c40}, 0x1, {0xc0001e3700, 0x12e3805, 0xc000207e08})
	/Users/annanay/Desktop/git/go/src/github.com/segmentio/parquet-go/column_chunk.go:336 +0x1e7
github.com/segmentio/parquet-go.(*rowGroupRows).ReadRows(0xc0001e36c0, {0xc0001dc738, 0xc00019e5a0, 0x1})
	/Users/annanay/Desktop/git/go/src/github.com/segmentio/parquet-go/row_group.go:336 +0xed
github.com/segmentio/parquet-go.(*reader).ReadRows(0xc0001dc700, {0xc0001dc738, 0x1, 0x1})
	/Users/annanay/Desktop/git/go/src/github.com/segmentio/parquet-go/reader.go:311 +0xb9
github.com/segmentio/parquet-go.(*Reader).Read(0xc0001dc6c0, {0x13a3e80, 0xc000244130})
	/Users/annanay/Desktop/git/go/src/github.com/segmentio/parquet-go/reader.go:192 +0x251
github.com/segmentio/parquet-go_test.TestSeekToRow_DictReadSecond(0xc000187860)
	/Users/annanay/Desktop/git/go/src/github.com/segmentio/parquet-go/reader_test.go:564 +0x229
testing.tRunner(0xc000187860, 0x14527e0)
	/usr/local/go/src/testing/testing.go:1259 +0x102
created by testing.(*T).Run
	/usr/local/go/src/testing/testing.go:1306 +0x35a
FAIL	github.com/segmentio/parquet-go	0.019s
FAIL

(do let me know if you would like me to PR these test cases to this repository!)

Considerations for changing the design of passing configuration options to constructor functions

This issue is a follow up to discussions that happened on pull requests and deserved taking a closer look. Links for context:

Currently the library uses a model based on Functional Options for Friendly APIs, which has become a popular model in the Go ecosystem, for example:

Another common model found in Go packages is the use of struct types that carry the configuration. The configuration struct is either passed as argument to a constructor function, or is the instantiated object itself. In the former case, configuration is usually applied by the function instantiating the object, while in the latter, configuration is lazily consumed when methods of the instantiated value are called. Here are examples of these models:

Each approach navigates a multi-dimensional spectrum, optimizing for different properties, including:

simplicity to modify or extend
readability
usability
type safety

Functional Options

The functional option model is based on passing function values as parameters to the variable argument list of a constructor function. The constructor also uses default configuration internally for option values that were not defined.

This model arguably optimizes for readability, especially in the case of instantiating an object with the default options, where the program simply calls to constructor without arguments (or only with the required arguments), leaving the variable option list empty:

obj := pkg.New()

Not being exposed to the often complex configuration helps smooth the learning curve, as concepts can be introduced more gradually. This syntax is also useful to keep tests short and expressive.

This approach is also effective at improving readability; options are usually defined through constructor functions of their own with descriptive names:

obj := pkg.New(
  pkg.FirstName(firstName),
  pkg.LastName(lastName),
)

Often times, the model is extended to use single-method interface instead of functions, which helps allow for more complex options to be passed as argument to the function. This helps leverage named fields in the instantiation of the option value, for example:

// Identity would implement an Option interface required by values of the New argument list
obj := pkg.New(
  &pkg.Identity{
    FirstName: firstName,
    LastName:  lastName,
  },
)

Configuration Structures

The configuration struct carries the configuration in the exported fields, and the application assigns the fields to set the values of options. In this model, the zero-value is used to indicate that an option has the default value.

When a constructor is employed, the program must always pass a configuration object, even if it has the zero value. If the configuration struct is passed by pointer, the program can pass nil, which tends to negatively impact readability since the nil value does not give any indication at the call site of what it means:

obj := pkg.New(nil) // the reader must refer to the documentation to understand what nil is

Some packages have used variable argument lists to work around this constraint, but must then define how to handle how to handle receiving multiple configuration values: do the values stack up or is it invalid to have more than one value? The former can be confusing to the reader, while the latter delays failure to runtime since the compiler does not know what only zero or one arguments are valid.

The zero-value is often a useful default but it can be complex to express the absence of a value in some cases (e.g. when zero is a valid value but not a desirable default). Some APIs resort to using pointer types in order to work around this limitation, but the result sacrifices usability since Go does not support declaring literals for pointers to primitive types (e.g. we cannot write &int(0)), requiring to declare separate variables that we pass the address of, or helper functions to return the address of the value they receive as argument.

A major advantage of configuration structs remains the ability to generate configuration from serialized data (e.g. the command line or a json file), as well as making it possible for the application to inspect and mutate the configuration value in different contexts; to illustrate, these qualities were called out by kafka-go users in segmentio/kafka-go#706

parquet-go

The parquet-go package attempts to embrace the qualities of these various approaches by offering a model based on functional options applying to a configuration struct, the latter also implementing the option interface(s) itself for cases where the application as already constructed a configuration object by loading it from an external location.

Programs can take the approach of using functional parameters, configuration struct, or a mix of both if needed. All these forms would be equivalent and offer the same level of type safety:

sortFunc := parquet.SortFuncOf(
  parquet.Descending(true),
  parquet.NullsFirst(true),
)

sortFunc := parquet.SortFuncOf(
  &parquet.SortConfig{
    Descending: true,
  },
  parquet.NullsFirst(true),
)

sortFunc := parquet.SortFuncOf(
  &parquet.SortConfig{
    Descending: true,
    NullsFirst: true,
  },
)

config := &parquet.SortConfig{}
config.Apply(
  parquet.Descending(true),
  parquet.NullsFirst(true),
)
sortFunc := parquet.SortFuncOf(config)

The model optimizes for readability and usability at the call site, but trades off some cognitive overhead for the library maintainers when it comes to implementing the option interface on configuration structs. Those methods tend to have ad-hoc implementations with a signature which does not communicate well the intent:

func (c *SortConfig) ConfigureSort(config *SortConfig) { /* assign values from config to c */ }

End users of the library are not required to deal with this complexity, as they should be able to leverage the API without having to understand the implementation.

Replace ReadSeeker with ReaderAt

The Reader should be constructed with an io.ReaderAt instead of an io.ReadSeeker. ReaderAt is more prevalent, allowing more data sources to be used from the get go. It also facilitates the future use of concurrency, within the library, and reduces the bookkeeping required in internal/readers/.

Add ARM build agents

Tracking issue for #56.

Value column index incorrectly maintained in dictionaries

Once again, it's totally possible that I'm misusing the format and or library, but I believe that when using a parquet.Group to build a schema instead of using the parquet.SchemaOf function and build a row manually (correctly specifying the index of each value in the row), and then read that row back, then the values have incorrect index values, meaning when the values are used in further steps, for example writing it to another buffer, then the row is detected to be faulty.

Here is some example code that demonstrates it:

package main

import (
	"fmt"
	"io"
	"testing"

	"github.com/segmentio/parquet-go"
)

func TestExample(t *testing.T) {
	g := parquet.Group{}
	g["a"] = parquet.Encoded(parquet.String(), &parquet.RLEDictionary)
	g["b"] = parquet.Encoded(parquet.String(), &parquet.RLEDictionary)
	g["c"] = parquet.Encoded(parquet.String(), &parquet.RLEDictionary)
	s := parquet.NewSchema("test", g)
	buf := parquet.NewBuffer(s)

	row := parquet.Row{
		parquet.ValueOf("value1").Level(0, 0, 0),
		parquet.ValueOf("value2").Level(0, 0, 1),
		parquet.ValueOf("value3").Level(0, 0, 2),
	}
	fmt.Println("write row")
	for _, v := range row {
		fmt.Printf("%#+v\n", v)
	}
	err := buf.WriteRow(row)
	if err != nil {
		t.Fatal(err)
	}

	buf2 := parquet.NewBuffer(s)
	rows := buf.Rows()
	for {
		row, err := rows.ReadRow(nil)
		if err == io.EOF {
			break
		}
		if err != nil && err != io.EOF {
			panic(err)
		}
		fmt.Println("write row")
		for _, v := range row {
			fmt.Printf("%#+v\n", v)
		}
		err = buf2.WriteRow(row)
		if err != nil {
			panic(err)
		}
	}
}

And the output:

$ go test
write row
C:0 D:0 R:0 V:value1
C:1 D:0 R:0 V:value2
C:2 D:0 R:0 V:value3
write row
C:-1 D:0 R:0 V:value1
C:-1 D:0 R:0 V:value2
C:-1 D:0 R:0 V:value3
--- FAIL: TestExample (0.00s)
panic: runtime error: index out of range [-1] [recovered]
        panic: runtime error: index out of range [-1]

goroutine 4 [running]:
testing.tRunner.func1.2({0x102f22020, 0x1400051e000})
        /opt/homebrew/Cellar/go/1.17.6/libexec/src/testing/testing.go:1209 +0x258
testing.tRunner.func1(0x1400011c9c0)
        /opt/homebrew/Cellar/go/1.17.6/libexec/src/testing/testing.go:1212 +0x284
panic({0x102f22020, 0x1400051e000})
        /opt/homebrew/Cellar/go/1.17.6/libexec/src/runtime/panic.go:1038 +0x21c
github.com/segmentio/parquet-go.(*Buffer).WriteRow(0x1400015e5b0, {0x1400007c2a0, 0x3, 0x4})
        /Users/brancz/src/github.com/segmentio/parquet-go/buffer.go:197 +0x2f8
github.com/parca-dev/parquetbug.TestExample(0x1400011c9c0)
        /Users/brancz/src/github.com/parca-dev/parquetbug/main_test.go:47 +0x8c8
testing.tRunner(0x1400011c9c0, 0x102f46750)
        /opt/homebrew/Cellar/go/1.17.6/libexec/src/testing/testing.go:1259 +0xfc
created by testing.(*T).Run
        /opt/homebrew/Cellar/go/1.17.6/libexec/src/testing/testing.go:1306 +0x328
exit status 2
FAIL    github.com/parca-dev/parquetbug 0.149s

ColumnReader

The library used to have an implementation of a ColumnReader, which allows to read values of a column sequentially (instead of interleaved with other columns' values when using a RowReader).

Main use-case for this is quickly computing metrics over a given column, as well as reading a full column when piecing a multi-file parquet document together.

Test failure when run locally

Running the tests locally on my Mac (with Go tip), I get the following failures:

--- FAIL: TestOpenFile (0.07s)
    --- FAIL: TestOpenFile/testdata/data_index_bloom_encoding_stats.parquet (0.00s)
        file_test.go:38: reading page index of parquet file: decoding column index: rowGroup=0 columnChunk=0/1: decoding thrift payload: 42:FIELD<STRUCT> → 6:FIELD<STRUCT> → 6:FIELD<?>: skipping unsupported thrift type 15
    --- FAIL: TestOpenFile/testdata/delta_encoding_required_column.parquet (0.00s)
        file_test.go:38: reading page index of parquet file: decoding column index: rowGroup=0 columnChunk=0/17: missing required field: 1:FIELD<LIST>
    --- FAIL: TestOpenFile/testdata/dict-page-offset-zero.parquet (0.00s)
        file_test.go:38: reading page index of parquet file: decoding column index: rowGroup=0 columnChunk=0/1: missing required field: 1:FIELD<LIST>
FAIL
FAIL	github.com/segmentio/parquet-go	1.450s

I haven't investigated them at all. CI still passes.

Support for BYTE_ARRAY data type on DECIMAL logical type

This DECIMAL logical type can be backed by INT32, INT64, BYTE_ARRAY, and FIXED_LEN_BYTE_ARRAY.

The parquet-go package does not yet support using BYTE_ARRAY for DECIMAL logical type.

Specification:

https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal

Unexpected behavior with optional field

First of all, I love this library, it is very very well designed, thank so so much for creating it!

I noticed something that might be a bug so I'm opening an issue since I don't totally understand what's going on. On a high level, I have two structs that are identical, except in one of them the same field is a pointer (therefore an optional field in parquet), and when I write rows to these and then read them back I only get a subset returned. Here is some example code that illustrates what I mean:

package main

import (
	"fmt"
	"io"

	"github.com/segmentio/parquet-go"
)

type Test1 struct {
	A string
	B string
	C string
}

type Test2 struct {
	A string
	B *string
	C string
}

func example(records ...interface{}) {
	s := parquet.SchemaOf(records[0])
	buf := parquet.NewBuffer(s)
	fmt.Println(s)
	for _, record := range records {
		r := s.Deconstruct(nil, record)
		fmt.Println("---")
		for _, v := range r {
			fmt.Println(v.RepetitionLevel(), v.DefinitionLevel(), v.Kind(), v)
		}
		fmt.Println("---")
		err := buf.WriteRow(r)
		if err != nil {
			panic(err)
		}
	}

	rows := buf.Rows()
	for {
		row, err := rows.ReadRow(nil)
		if err == io.EOF {
			break
		}
		if err != nil && err != io.EOF {
			panic(err)
		}
		fmt.Println(row)
	}
}

func main() {
	fmt.Println("--- Test1 ---")
	example(&Test1{
		A: "a1",
		B: "b1",
		C: "c1",
	}, &Test1{
		A: "a2",
		B: "b2",
		C: "c2",
	}, &Test1{
		A: "a3",
		B: "b3",
		C: "c3",
	})
	fmt.Println("--- Test2 ---")
	str := "b2"
	example(&Test2{
		A: "a1",
		B: nil,
		C: "c1",
	}, &Test2{
		A: "a2",
		B: &str,
		C: "c2",
	}, &Test2{
		A: "a3",
		B: nil,
		C: "c3",
	})
}

The first struct, that has no optional fields reads back exactly the amount of rows as inserted, 3, but the one with the optional field somehow only 2.

--- Test1 ---
message Test1 {
	required binary A (STRING);
	required binary B (STRING);
	required binary C (STRING);
}
---
0 0 BYTE_ARRAY a1
0 0 BYTE_ARRAY b1
0 0 BYTE_ARRAY c1
---
---
0 0 BYTE_ARRAY a2
0 0 BYTE_ARRAY b2
0 0 BYTE_ARRAY c2
---
---
0 0 BYTE_ARRAY a3
0 0 BYTE_ARRAY b3
0 0 BYTE_ARRAY c3
---
[a1 b1 c1]
[a2 b2 c2]
[a3 b3 c3]
--- Test2 ---
message Test2 {
	required binary A (STRING);
	optional binary B (STRING);
	required binary C (STRING);
}
---
0 0 BYTE_ARRAY a1
0 0 Type(?) <null>
0 0 BYTE_ARRAY c1
---
---
0 0 BYTE_ARRAY a2
0 1 BYTE_ARRAY b2
0 0 BYTE_ARRAY c2
---
---
0 0 BYTE_ARRAY a3
0 0 Type(?) <null>
0 0 BYTE_ARRAY c3
---
[a1 <null> c1]
[a2 b2 c2]

Handle optionals

Both in Readers and StructBuilder.

Panic while running tests

I submitted a non-code change and got the following panic in the test suite. I am up to date with the latest changes in main.

SIGILL: illegal instruction
PC=0x4edc1e m=2 sigcode=2
instruction bytes: 0x62 0xe2 0xfd 0x48 0x7c 0xdf 0x62 0x61 0xfe 0x48 0x6f 0x3d 0xf2 0xf1 0x6 0x0

goroutine 63 [running]:
github.com/segmentio/parquet-go/internal/bits.maxBE[128](https://github.com/segmentio/parquet-go/runs/5337241346?check_suite_focus=true#step:6:128)({0xc0004e5800, 0x100, 0x100})
	/home/runner/work/parquet-go/parquet-go/internal/bits/max_amd64.s:467 +0x3e fp=0xc00008d3c0 sp=0xc00008d3b8 pc=0x4edc1e
github.com/segmentio/parquet-go/internal/bits.MaxFixedLenByteArray(0x0, {0xc0004e5800, 0x0, 0x18})
	/home/runner/work/parquet-go/parquet-go/internal/bits/max.go:21 +0x32 fp=0xc00008d400 sp=0xc00008d3c0 pc=0x4eb7d2
github.com/segmentio/parquet-go/internal/bits_test.TestMaxFixedLenByteArray16.func1({0xc0004e5800, 0x10, 0x1})
	/home/runner/work/parquet-go/parquet-go/internal/bits/max_test.go:176 +0xce fp=0xc00008d498 sp=0xc00008d400 pc=0x4faaae
runtime.call32(0xc00006a720, 0x538288, 0x0, 0x0, 0x0, 0x18, 0xc00008d9d8)
	/opt/hostedtoolcache/go/1.17.7/x64/src/runtime/asm_amd64.s:626 +0x49 fp=0xc00008d4c8 sp=0xc00008d498 pc=0x461dc9
runtime.reflectcall(0x50a7e0, 0xc00036b968, 0x4, 0x52fdad, 0x0, 0x12, 0x50a7e0)
	<autogenerated>:1 +0x3c fp=0xc00008d508 sp=0xc00008d4c8 pc=0x465f3c
reflect.Value.call({0x50ec20, 0x538288, 0x4f183b}, {0x52cc57, 0x4}, {0xc00008dea8, 0x1, 0x10})
	/opt/hostedtoolcache/go/1.17.7/x64/src/reflect/value.go:556 +0x845 fp=0xc00008daf8 sp=0xc00008d508 pc=0x493385
reflect.Value.Call({0x50ec20, 0x538288, 0xc0002e4a00}, {0xc00008dea8, 0x1, 0x1})
	/opt/hostedtoolcache/go/1.17.7/x64/src/reflect/value.go:339 +0xc5 fp=0xc00008db70 sp=0xc00008daf8 pc=0x492a85
github.com/segmentio/parquet-go/internal/bits_test.quickCheck({0x50ec20, 0x538288})
	/home/runner/work/parquet-go/parquet-go/internal/bits/bits_test.go:223 +0x5c8 fp=0xc00008df30 sp=0xc00008db70 pc=0x4f1628
github.com/segmentio/parquet-go/internal/bits_test.TestMaxFixedLenByteArray16(0xc000082d00)
	/home/runner/work/parquet-go/parquet-go/internal/bits/max_test.go:166 +0x2c fp=0xc00008df70 sp=0xc00008df30 pc=0x4f346c
testing.tRunner(0xc000082d00, 0x538290)
	/opt/hostedtoolcache/go/1.17.7/x64/src/testing/testing.go:1259 +0x102 fp=0xc00008dfc0 sp=0xc00008df70 pc=0x4b59a2
testing.(*T).Run·dwrap·21()
	/opt/hostedtoolcache/go/1.17.7/x64/src/testing/testing.go:[130](https://github.com/segmentio/parquet-go/runs/5337241346?check_suite_focus=true#step:6:130)6 +0x2a fp=0xc00008dfe0 sp=0xc00008dfc0 pc=0x4b66aa
runtime.goexit()
	/opt/hostedtoolcache/go/1.17.7/x64/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc00008dfe8 sp=0xc00008dfe0 pc=0x4638c1
created by testing.(*T).Run
	/opt/hostedtoolcache/go/1.17.7/x64/src/testing/testing.go:1306 +0x35a

goroutine 1 [chan receive]:
testing.(*T).Run(0xc00010e1a0, {0x531d6b, 0x465fd3}, 0x538290)
	/opt/hostedtoolcache/go/1.17.7/x64/src/testing/testing.go:1307 +0x375
testing.runTests.func1(0xc00010e1a0)
	/opt/hostedtoolcache/go/1.17.7/x64/src/testing/testing.go:1598 +0x6e
testing.tRunner(0xc00010e1a0, 0xc000045d18)
	/opt/hostedtoolcache/go/1.17.7/x64/src/testing/testing.go:1259 +0x102
testing.runTests(0xc000110000, {0x628da0, 0x35, 0x35}, {0x4726cd, 0x52edb3, 0x62c340})
	/opt/hostedtoolcache/go/1.17.7/x64/src/testing/testing.go:1596 +0x43f
testing.(*M).Run(0xc000110000)
	/opt/hostedtoolcache/go/1.17.7/x64/src/testing/testing.go:[150](https://github.com/segmentio/parquet-go/runs/5337241346?check_suite_focus=true#step:6:150)4 +0x51d
main.main()
	_testmain.go:229 +0x14b

rax    0xc0004e5800
rbx    0xc0004e5800
rcx    0xc0004e5900
rdx    0x100
rdi    0x10
rsi    0xc00008d448
rbp    0xc00008d3f0
rsp    0xc00008d3b8
r8     0x10
r9     0xf
r10    0xc0004e58f0
r11    0x0
r12    0xc00008d480
r13    0x3
r14    0xc00010e4e0
r15    0xc00036b968
rip    0x4edc1e
rflags 0x10246
cs     0x33
fs     0x0
gs     0x0

RowReader predicates

The RowReader should be able to take predicates and use them to intelligently only read the columns it needs to, and skip over rows that don't match the predicates.

How to use snappy?

Hello,

I am trying yo use this library with snappy, but I can't find any example, here is my code

func write4() {
	f, err := os.OpenFile("output4.parquet", os.O_APPEND|os.O_WRONLY|os.O_CREATE, os.ModePerm)
	if err != nil {
		log.Error(err)
		return
	}
	defer f.Close()

	writer, _ := parquet3.Snappy.NewWriter(f)

	type record struct {
		Format   string `parquet:"format"`
		DataType int32  `parquet:"data_type"`
		Country  string `parquet:"country"`
	}

	num := 1000
	for i := 0; i < num; i++ {
		stu := record{
			Format:   "Test",
			DataType: 1,
			Country:  "IN",
		}

		writer.Write(stu) // here argument can only be []byte
	}
	// Closing the writer is necessary to flush buffers and write the file footer.
	if err := writer.Close(); err != nil {
		log.Error(err)
	}
}

I tried github.com/fraugster/parquet-go, github.com/fraugster/parquet-go/parquet, it is very easy to use SNAPPY with them.

writer.Write(stu) // here argument can only be []byte
What is the right way to do so?

Is there a way to get the file schema?

Currently the File is not exposing the schema in the FileMetaData, to access the schema you need to get it from an specific RowGroup

f, err := os.Open(filename)
if err != nil {
	return nil, err
}
defer f.Close()

s, err := f.Stat()
if err != nil {
	return nil, err
}

p, err := parquet.OpenFile(f, s.Size())
if err != nil {
	return nil, err
}

p.RowGroup(1).Schema()

is there a way to get the schema without getting the rowgroup?

Support for maintaining datasets in hive style partition format

Storing data in hive style partitions is very common use-case while writing data in columnar formats to object-stores.
It would be great if the library adds support for the following features wrt to partition management

Reading & writing partitioned datasets natively
Support for multi layered partitions ( Ex - date=20220416/hour=07/*.parquet.snappy )
Support for emitting an event whenever a new partition is discovered so that it could be used for updating partition info in meta-store (hms, glue etc)

Caching parquet metadata in an off-process cache

As discussed earlier, in Grafana Tempo we are planning to build an object store backed parquet database. At higher volumes, querying could mean retrieving metadata from a large number of parquet files and we would like to optimise by caching these in an off-process cache (redis/memcached/...)

There are two strategies we have considered to implement this:

Build in support for metadata caching directly into this library so the details are abstracted away from the user.
Provide methods to marshal/unmarshal metadata and handle all the caching on the application side.

Thoughts?

Using this ticket as a placeholder to start discussions around this.

Document supported features

There are a few features of the parquet format that are not yet supported by this package, for example:

encryption
deprecated LZ4 compression codec

We should document this in the README to help users understand when they can or cannot use the package.

RowReader projection

RowReader should be able to only read a subset of the columns.

Schema with UUID cannot be read by parquet-tools

Take a program, that generates a parquet file with a schema that uses a UUID. An example:

package main

import (
	"fmt"
	"os"

	"github.com/google/uuid"
	"github.com/segmentio/parquet-go"
)

type Test struct {
	ABC uuid.UUID `parquet:"abc"`
}

func main() {
	s := parquet.SchemaOf(&Test{})

	fmt.Println(s)

	f, err := os.Create("test.parquet")
	if err != nil {
		panic(err)
	}
	w := parquet.NewWriter(f, s)
	for i := 0; i < 10; i++ {
		r := s.Deconstruct(nil, &Test{
			ABC: uuid.New(),
		})
		if err := w.WriteRow(r); err != nil {
			panic(err)
		}
	}
	err = w.Close()
	if err != nil {
		panic(err)
	}
	err = f.Close()
	if err != nil {
		panic(err)
	}
}

We run it:

$ go run main.go
message Test {
        required fixed_len_byte_array(16) abc (UUID);
}

If this file is then subsequently read by parquet-tools, it errors that only the FIXED_LEN_BYTE_ARRAY(16) can be annotated with the UUID logical type.

$ parquet-tools schema test.parquet
java.lang.IllegalStateException: UUID can only annotate FIXED_LEN_BYTE_ARRAY(16)

It's entirely possible that the bug is in parquet-tools. I need some more investigation to find out.

StructPlanner should handle anonymous fields

Fallback to non AVX instructions when not available

We recently ran into an issue (parca-dev/parca#1062) where a user had an amd64 architecture CPU without AVX extensions. This caused a SIGILL (illegal instruction) when parquet-go attempts to use the VPBROADCASTD instruction:

parquet-go/bloom/filter_amd64.s

Line 110 in ce6d4da

VPBROADCASTD x+24(FP), Y1

I would suggest we use the golang.org/x/sys/cpu package to detect if the SIMD instructions we want to use are available and fallback to pure go implementations when not.

The apache arrow project accomplishes this using global variables:

https://github.com/apache/arrow/blob/442b24b0b9cf11d45245564fa7448f1ca4931ae7/go/arrow/math/int64.go#L30

Which are set using init functions:

https://github.com/apache/arrow/blob/6240eae2189d938b4f98b8d8235146baed0246c5/go/arrow/math/math_amd64.go#L25-L33

I'm happy with any strategy that works, this is just one I've seen before.

Add the ability to configure the compression for all Parquet fields

Follow up from #114

To simplify the usage of compression, we want to be able to configure the compression algorithm to be used against all the fields when creating a writer.

Investigate misuse of AVX-512 CPU feature checks

It seems the checks we have to validate whether AVX-512 instructions are available isn't correct in some functions:

SIGILL: illegal instruction
PC=0x4edc1f m=3 sigcode=2
instruction bytes: 0x62 0xe2 0xfd 0x48 0x7c 0xdf 0x62 0x61 0xfe 0x48 0x6f 0x3d 0xf1 0xf1 0x6 0x0

goroutine 71 [running]:
github.com/segmentio/parquet-go/internal/bits.maxBE[128](https://github.com/segmentio/parquet-go/runs/5206204595?check_suite_focus=true#step:6:128)({0xc0000af800, 0x100, 0x100})
	/home/runner/work/parquet-go/parquet-go/internal/bits/max_amd64.s:467 +0x3f fp=0xc00008d3c0 sp=0xc00008d3b8 pc=0x4edc1f
github.com/segmentio/parquet-go/internal/bits.MaxFixedLenByteArray(0x0, {0xc0000af800, 0x0, 0x18})
	/home/runner/work/parquet-go/parquet-go/internal/bits/max.go:21 +0x32 fp=0xc00008d400 sp=0xc00008d3c0 pc=0x4eb7d2
github.com/segmentio/parquet-go/internal/bits_test.TestMaxFixedLenByteArray16.func1({0xc0000af800, 0x10, 0x1})
	/home/runner/work/parquet-go/parquet-go/internal/bits/max_test.go:176 +0xce fp=0xc00008d498 sp=0xc00008d400 pc=0x4faaae
runtime.call32(0xc00006a720, 0x538288, 0x0, 0x0, 0x0, 0x18, 0xc00008d9d8)
	/opt/hostedtoolcache/go/1.17.6/x64/src/runtime/asm_amd64.s:626 +0x49 fp=0xc00008d4c8 sp=0xc00008d498 pc=0x461dc9
runtime.reflectcall(0x50a7e0, 0xc0002ff938, 0x4, 0x52fdad, 0x0, 0x12, 0x50a7e0)
	<autogenerated>:1 +0x3c fp=0xc00008d508 sp=0xc00008d4c8 pc=0x465f3c
reflect.Value.call({0x50ec20, 0x538288, 0x4f183b}, {0x52cc57, 0x4}, {0xc00008dea8, 0x1, 0x10})
	/opt/hostedtoolcache/go/1.17.6/x64/src/reflect/value.go:556 +0x845 fp=0xc00008daf8 sp=0xc00008d508 pc=0x493385
reflect.Value.Call({0x50ec20, 0x538288, 0xc0002c6a00}, {0xc00008dea8, 0x1, 0x1})
	/opt/hostedtoolcache/go/1.17.6/x64/src/reflect/value.go:339 +0xc5 fp=0xc00008db70 sp=0xc00008daf8 pc=0x492a85
github.com/segmentio/parquet-go/internal/bits_test.quickCheck({0x50ec20, 0x538288})
	/home/runner/work/parquet-go/parquet-go/internal/bits/bits_test.go:223 +0x5c8 fp=0xc00008df30 sp=0xc00008db70 pc=0x4f1628
github.com/segmentio/parquet-go/internal/bits_test.TestMaxFixedLenByteArray16(0xc000082d00)
	/home/runner/work/parquet-go/parquet-go/internal/bits/max_test.go:166 +0x2c fp=0xc00008df70 sp=0xc00008df30 pc=0x4f346c
testing.tRunner(0xc000082d00, 0x538290)
	/opt/hostedtoolcache/go/1.17.6/x64/src/testing/testing.go:1259 +0x102 fp=0xc00008dfc0 sp=0xc00008df70 pc=0x4b59a2
testing.(*T).Run·dwrap·21()
	/opt/hostedtoolcache/go/1.17.6/x64/src/testing/testing.go:[130](https://github.com/segmentio/parquet-go/runs/5206204595?check_suite_focus=true#step:6:130)6 +0x2a fp=0xc00008dfe0 sp=0xc00008dfc0 pc=0x4b66aa
runtime.goexit()
	/opt/hostedtoolcache/go/1.17.6/x64/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc00008dfe8 sp=0xc00008dfe0 pc=0x4638c1
created by testing.(*T).Run
	/opt/hostedtoolcache/go/1.17.6/x64/src/testing/testing.go:1306 +0x35a

https://github.com/segmentio/parquet-go/runs/5206204595?check_suite_focus=true

ErrInvalidArguments used for both encoding and decoding

Here's ErrInvalidArguments:

	// ErrInvalidArguments is an error returned when arguments passed to the
	// encoding functions are incorrect and will lead to an expected failure.
	//
	// As with ErrNotSupported, this error may be wrapped with specific information
	// about the problem and application are expected to use errors.Is for
	// comparisons
	ErrInvalidArguments = errors.New("invalid encoding arguments")

However, this error message gets passed to decoding functions as well as encoding, which can be confusing - I was wondering why we were encoding when we were reading data from a file into a Go struct.

Can we change the error message or use two different error objects?

parquet read error

Hello,

I tried to write parquet file and then read it, but I got several errors

panic: invalid magic header of parquet file: "\u007fELF"

panic: runtime error: index out of range [4] with length 4

I am not sure what is wrong with the code

package main

import (
	"bytes"
	"io"
	"io/ioutil"
	"os"

	"github.com/seaguest/log"
	segmentparquet "github.com/segmentio/parquet-go"
)

func main() {
	writeFile("out.parquet")

	readFile("out.parquet")
}

type record struct {
	Format   string  `parquet:"format,plain,gzip"`
	DataType int32   `parquet:"data_type,plain,gzip"`
	Country  string  `parquet:"country,plain,gzip"`
	Price    float64 `parquet:"price,gzip"`
}

func readFile(filename string) {
	bs, err := ioutil.ReadFile(filename)
	if err != nil {
		log.Error(err)
		return
	}

	reader := segmentparquet.NewReader(bytes.NewReader(bs))
	for {
		var row record
		err := reader.Read(&row)
		if err != nil {
			if err == io.EOF {
				break
			}
		}
		log.Error(row)
	}
}

func writeFile(filename string) {
	f, err := os.OpenFile(filename, os.O_APPEND|os.O_WRONLY|os.O_CREATE|os.O_TRUNC, os.ModePerm)
	if err != nil {
		log.Error(err)
		return
	}
	defer f.Close()

	writer := segmentparquet.NewWriter(f)

	num := 1000
	for i := 0; i < num; i++ {
		stu := record{
			Format:   "Test",
			DataType: int32(i),
			Country:  "IN",
			Price:    1415926777,
		}

		writer.Write(stu) // here argument can only be []byte
	}

	writer.Flush()

	// Closing the writer is necessary to flush buffers and write the file footer.
	if err := writer.Close(); err != nil {
		log.Error(err)
	}
}

Limit RAM usage

I’m trying to use flush on the writer to limit RAM usage, but it seems to have no effect.
It seems that buffers are never cleared during the execution of the program, as if all the data I ever write is kept in memory. Not sure what to do here.

Best,

segmentio / parquet-go Goto Github PK

parquet-go's Introduction

Project has been Archived

segmentio/parquet-go

parquet-go's People

Contributors

Stargazers

Watchers

Forkers

parquet-go's Issues

Context

Qualities

Simplicity of the design

Similarities with the io package

Flaws

Asymmetry of parquet.RowReader and parquet.RowWriter

Performance overhead of the API

Proposal

On addressing the lack of symmetry between the reader and writer

On addressing the performance overhead of the API

Other considerations

Non-contiguous grouping of column values in memory

Column Index

Offset Index

Functional Options

Configuration Structures

parquet-go

Recommend Projects

Recommend Topics

Recommend Org

Similarities with the `io` package