scritchley / orc Goto Github PK

View Code? Open in Web Editor NEW

128.0 128.0 44.0 30.04 MB

An ORC file format reader and writer for Go.

Home Page: https://godoc.org/github.com/scritchley/orc

License: MIT License

Go 100.00%

big-data golang orc

orc's People

Contributors

Stargazers

Watchers

orc's Issues

Issues reading Orc's from Spark output

Noticed two issues reading Orc generated by Spark.

Reading a file getmerged out of HDFS results in proto: proto.StripeFooter: wiretype end group for non-group. The schema can be read fine, but can't access the stream.
Reading a file generated via Spark on local, iteration through the stripes isn't working (infinite loop) and the types are coming up as zeros.

For the former, can't replicate on local and the data is work related, but the latter you can replicate pretty easily:

How to get the rows number of a ORC file without reading all content ?

hi, @scritchley
I need to know the number of rows in a orc file without reading all rows. Does this package provide some method ? Thanks.

fix cursor.go

Hi, we are using the library to read and write orc files.

We've noticed an issue about "orc/cursor.go:85" 's comparison processing between currentRow and stripeRow Count.
Seems like there is no chance for currentRow's data to be reset, so if the file has more than 2 stripe, all the objects comming after the first one won't be read out. We've already done with the program modification which the data will return to be 0 in Stripes(), so if you don't mind, could you please merge it?

Thank you.

Files created by library not parseable by Presto when containing a multiple of 10,000 rows

Hello,
I have been using ORC files created by this library with AWS Athena and have been running into an issue when one of these files contains exactly 10,000 rows. It is failing a check in Presto that I honestly have no idea the intent of. Every other tool that deals with ORC files has no issues with the file.

Without fully understanding the ORC file format, I made a local change to the library that does not call the recordPositions function in flushWriters if recordPositions has just been called on the last write. This appears to fix the issue parsing the files and does not upset any of the unit tests. I'm not sure if there are other actions that should be skipped in this case, so I'm usure if this change creates files that are malformed in other ways. I'm hoping someone with a better understanding of the format can evaluate my patch and determine if it's the right action.

Here's the patch:

diff --git a/writer.go b/writer.go
index 63718e0..2b4d993 100644
--- a/writer.go
+++ b/writer.go
@@ -51,6 +51,7 @@ type Writer struct {
        indexOffset          uint64
        chunkOffset          uint64
        compressionCodec     CompressionCodec
+       lastRecordPositions  uint64
 }
 
 func ptrInt64(i int64) *int64 {
@@ -162,6 +163,7 @@ func (w *Writer) Write(values ...interface{}) error {
        if w.totalRows%uint64(w.footer.GetRowIndexStride()) == 0 {
                // Records and resets indexes for each writer.
                w.recordPositions()
+               w.lastRecordPositions = w.totalRows
 
                if w.treeWriters.size() >= w.stripeTargetSize {
                        return w.writeStripe()
@@ -217,7 +219,9 @@ func (w *Writer) flushWriters() error {
        if err := w.treeWriter.Flush(); err != nil {
                return err
        }
-       w.recordPositions()
+       if w.lastRecordPositions != w.totalRows {
+               w.recordPositions()
+       }
        return nil
 }

Thanks
-matt

Can there be a tagged release for go mod versioning?

Read ORC file missing partition key column

I have a ORC file which corresponding external hive table is created by(Presto):

CREATE TABLE IF NOT EXISTS db.aggregate.f_market_selected_hourly_gcyu (
    network_id                            BIGINT,
    asset_id                              BIGINT,
    series_id                             BIGINT,
    site_id                               BIGINT,
    site_section_id                       BIGINT,
    country_id                            BIGINT,
    time_position_class                   VARCHAR,
    device_type                           VARCHAR,
    dsp_id                                BIGINT,
    deal_id                               BIGINT,
    buyer_id                              BIGINT,
    buyer_group_id                        BIGINT,
    integration_type                      VARCHAR,
    error_code                            VARCHAR,
    asset_group_ids                       ARRAY<BIGINT>,
    site_section_group_ids                ARRAY<BIGINT>,
    error_frequency                       BIGINT,
    received_bid                          BIGINT,
    resolved_bid                          BIGINT,
    selected_primary_bid                  BIGINT,
    selected_fallback_bid                 BIGINT,
    selected_bid_in_watched_slot_primary  BIGINT,
    selected_bid_in_watched_slot_fallback BIGINT,
    total_received_bid_price              DOUBLE,
    total_resolved_bid_price              DOUBLE,
    total_bid_won_price                   DOUBLE,
    process_batch_id                      VARCHAR,
    demand_total_received_bid_price       DOUBLE,
    demand_total_resolved_bid_price       DOUBLE,
    demand_total_bid_won_price            DOUBLE,
    dsp_currency_id                       BIGINT,
    buyer_platform_id                     BIGINT,
    seat_id                               BIGINT,
    market_ad_id                          BIGINT,
    event_date                            TIMESTAMP
)
WITH (
  format = 'ORC',
  partitioned_by = ARRAY['event_date'],
  external_location = 's3a://fw1-dev-eng/gcyu/orc_files/f_market_selected_hourly_gcyu/'
);

When reading this ORC file, all goes well but missing the partition key column event_date. Can anyone tell me if this is expected and why this happens?

Many thanks,
gcyu

How to support orc with zstd compression

When decoding files with zstd compression, it throws expection "unsupported compression kind zstd". So is there anyway to support ?

How to decrease memory usage when read orc file

I read an ORC file using code much like this:

	cursor := reader.Select(readColumns...)
	defer cursor.Close()
	err = cursor.PrepareStripe(sr.StripeObject.StripeIndex)
	for cursor.Next() {
		row0 := cursor.Row()
		row := make([]driver.Value, len(dataBlock.Columns.Columns))
		for i := range row0 {
			row[i] = driver.Value(row0[i])
		}
		for i, j := len(row0), 0; j < len(sr.StripeObject.ExtraValues); i, j = i+1, j+1 {
			row[i] = driver.Value(sr.StripeObject.ExtraValues[j])
		}
		dataBlock.Payload = append(dataBlock.Payload, row

I try to build a dataBlock with 20000 rows and then insert this dataBlock to ClickHouse, However this process consume too much memory. So is there any way which can be used to decrease memory usage when read orc file to build dataBlock?

Compatibility with Hive

Has anyone had any luck getting Hive to read ORC files written with this libary?

hive --orcfiledump test.orc

Currently, I'm getting the following error from Hive if I write the file with zlib compression:

Processing data file test.orc [length: 3403]
Structure for test.orc
File Version: 0.12 with ORIGINAL
Exception in thread "main" java.lang.IllegalArgumentException: Buffer size too small. size = 262144 needed = 8026884
	at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:212)
	at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:257)
	at java.io.InputStream.read(InputStream.java:101)
	at com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:737)
	at com.google.protobuf.CodedInputStream.isAtEnd(CodedInputStream.java:701)
	at com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:99)
	at org.apache.orc.OrcProto$StripeFooter.<init>(OrcProto.java:11063)
	at org.apache.orc.OrcProto$StripeFooter.<init>(OrcProto.java:11027)
	at org.apache.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:11132)
	at org.apache.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:11127)
	at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:89)
	at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:95)
	at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
	at org.apache.orc.OrcProto$StripeFooter.parseFrom(OrcProto.java:11360)
	at org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:267)
	at org.apache.orc.impl.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:296)
	at org.apache.orc.impl.RecordReaderImpl.beginReadStripe(RecordReaderImpl.java:953)
	at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:915)
	at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1081)
	at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1116)
	at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:272)
	at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:598)
	at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:592)
	at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:308)
	at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:273)
	at org.apache.orc.tools.FileDump.main(FileDump.java:134)
	at org.apache.orc.tools.FileDump.main(FileDump.java:141)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:234)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:148)

And this error if I write it without compression:

Processing data file test.orc [length: 3945]
Structure for test.orc
File Version: 0.12 with ORIGINAL
Exception in thread "main" com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.
	at com.google.protobuf.InvalidProtocolBufferException.invalidEndTag(InvalidProtocolBufferException.java:94)
	at com.google.protobuf.CodedInputStream.checkLastTagWas(CodedInputStream.java:124)
	at com.google.protobuf.CodedInputStream.readGroup(CodedInputStream.java:241)
	at com.google.protobuf.UnknownFieldSet$Builder.mergeFieldFrom(UnknownFieldSet.java:488)
	at com.google.protobuf.GeneratedMessage.parseUnknownField(GeneratedMessage.java:193)
	at org.apache.orc.OrcProto$StripeFooter.<init>(OrcProto.java:11069)
	at org.apache.orc.OrcProto$StripeFooter.<init>(OrcProto.java:11027)
	at org.apache.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:11132)
	at org.apache.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:11127)
	at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:89)
	at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:95)
	at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
	at org.apache.orc.OrcProto$StripeFooter.parseFrom(OrcProto.java:11360)
	at org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:267)
	at org.apache.orc.impl.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:296)
	at org.apache.orc.impl.RecordReaderImpl.beginReadStripe(RecordReaderImpl.java:953)
	at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:915)
	at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1081)
	at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1116)
	at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:272)
	at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:598)
	at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:592)
	at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:308)
	at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:273)
	at org.apache.orc.tools.FileDump.main(FileDump.java:134)
	at org.apache.orc.tools.FileDump.main(FileDump.java:141)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:234)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:148)

Using schema

		struct<
			t_int:int,
			t_int64:bigint,
			t_float32:float,
			t_float64:double,
			t_string:string,
			t_bool:boolean,
			t_timestamp:timestamp,
			t_list:array<int>,
			t_map:map<string,int>,
			t_nested:struct<
				t_int:int,
				t_int64:bigint,
				t_float32:float,
				t_float64:double,
				t_string:string,
				t_bool:boolean,
				t_timestamp:timestamp
			>
		>

can we read user metadata from orc?

Is there a way we can read orc user metadata using this library ? If yes, can you guide me through how to do that?

Add More Examples

I used this orc reader for a project I was working and I wanted to offer an example I needed which helped me. I had nested orc structures I wanted to convert to json. If would like a PR I would be happy to submit to the README.md

// Example 2

   r, err := Open("./examples/demo-12-zlib.orc")
    if err != nil {
        log.Fatal(err)
    }
    defer r.Close()

    rootColumns := r.Schema().Columns()
    count := len(rootColumns)
    values := make([]interface{}, count)

    // Create a new Cursor reading the provided columns.
    c := r.Select(rootColumns...)


    returnData := make(map[string]interface{})
    // Iterate over each stripe in the file.
    for c.Stripes() {
        
        // Iterate over each row in the stripe.
        for c.Next() {
              
            // Retrieve a slice of interface values for the current row.
            log.Println(c.Row())
            
            currentRow := c.Row()
	    for i, _ := range values {
		masterData[rootColumns[i]] = currentRow[i]
	    }
			
	    slicer, _ := json.Marshal(returnData)
	    fmt.Println(string(slicer))
        }
       
    }

    if err := c.Err(); err != nil {
        log.Fatal(err)
    }

This is taken from https://github.com/vfrank66/aws-s3-orc-to-kinesis.

Incorrect values returned by reader with double type and nil values.

When we create an orc file using the writer with a column of type double and has null values for certain rows, then reading the file returns nil value for all the rows for that column.

Sample code to replicate the issue

package main

import (
	"bytes"
	"compress/flate"
	"fmt"
	"io/ioutil"
	"strings"
	"sync"

	"github.com/scritchley/orc"
)

func generateSchema(columns map[string]string, colNames []string) (*orc.TypeDescription, error) {
	keys := colNames

	var sb strings.Builder
	sb.WriteString("struct<")

	// Ensure keys are sorted
	schemaKey := keys

	first := true
	for _, key := range schemaKey {

		typ := columns[key]
		if !first {
			sb.WriteByte(0x2c) // ,
		}
		first = false

		sb.WriteString(key)
		sb.WriteByte(0x3a) // :
		sb.WriteString(typ)
	}

	sb.WriteByte(0x3e) // >
	return orc.ParseSchema(sb.String())
}

func main() {
	filepath := "/Users/ankit.sinha/gopath/src/proto-learn/new-file.orc"
	writeFile(filepath)
	readFile(filepath)
}

func writeFile(filepath string) {
	schema, err := generateSchema(map[string]string{"val": "double"}, []string{"val"})
	b := &sync.Pool{
		New: func() interface{} {
			return bytes.NewBuffer(make([]byte, 0, 16*1<<20))
		},
	}
	buffer := b.Get().(*bytes.Buffer)

	writer, err := orc.NewWriter(buffer,
		orc.SetSchema(schema),
		orc.SetCompression(orc.CompressionZlib{Level: flate.DefaultCompression}))

	row := []interface{}{}
	row = append(row, 2.0)
	if err := writer.Write(row...); err != nil {
		fmt.Printf("%+v\n flush: error writing row", err)
	}
	row = []interface{}{}
	row = append(row, nil)
	if err := writer.Write(row...); err != nil {
		fmt.Printf("%+v\n flush: error writing row", err)
	}

	if err := writer.Close(); err != nil {
		fmt.Printf("%+v\n flush: error closing writer", err)
	}

	err = ioutil.WriteFile(filepath, buffer.Bytes(), 0644)
	if err != nil {
		fmt.Printf("%+v\n", err)
	}
}

func readFile(filepath string) {
	r, err := orc.Open(filepath)
	if err != nil {
		fmt.Printf("%+v\n", err)
	}
	defer r.Close()

	// Create a new Cursor reading the provided columns.
	c := r.Select("val")

	// Iterate over each stripe in the file.
	for c.Stripes() {
		// Iterate over each row in the stripe.
		for c.Next() {
			fmt.Printf("%+v\n", c.Row())

		}
	}

	if err := c.Err(); err != nil {
		fmt.Printf("%+v\n", err)
	}
}

Sample output when read using the same orc reader

[<nil>]
[<nil>]

Sample output when using the orc-tools library to read the same file.

{"val":2}
{"val":null}

NULLs in writer

Sorry if this is something obvious. Trying to write nils to float64, int64 and strings. Getting a number of errors (Expected float64 got and similar) Was wondering if that is supported and if so what's the right way to have null in a column? Otherwise non-null values are written just fine. Thanks a lot for an awesome work!

Support Bloom filter indexes

https://orc.apache.org/docs/spec-index.html

TimeStamp statistic not available?

Im trying to currently extract the timestamp static out of the fields, but it always seems like its nil.

I modified one of the writer test to just write a timestamp and read the file in, but theres no timestamp statistic.

c := r.Select("int1", "timestamp1")
	for c.Stripes() {
		for c.Next() {
			fmt.Println(r.getTypes())
			fmt.Println(c.Row())
		}
	}
        //custom defined method to get statistic for a certain field
	fmt.Println(r.ColumnStatistics("timestamp1"))

I see the type is recorded correctly, but theres no statistic.
numberOfValues:5 hasNull:false

The full output of statistics is
[numberOfValues:5 hasNull:false numberOfValues:5 hasNull:false numberOfValues:5 intStatistics:<minimum:51 maximum:9410 sum:22770 > hasNull:false ]

More writer types

orc.Date
orc.Decimal
(others?)

(Are you planning to promote the Writer to a supported state?)

Support ACID transactions

as per: https://orc.apache.org/docs/acid.html

Support writing with Snappy compression

subsequent calls to select do not reset row index (cursor)

Hi,

code is like:

		files, err := orc.Open(orcfile)
		if err != nil {
			println(err.Error())
		}

and later:

		c := files.Select("fid", "fname", "filesize")

First pass, it returns values.

When I don't open the file again and run c := files.Select("fid", "fname", "filesize") again later, it seems to return nothing.

It seems like the cursor runs to the end of the recordset and subsequent calls to renew the cursor do not reset it's position?

If I open the file again and run select, I get output.

my loop code like:

for c.Stripes() {
   println("stripe ok")
      for c.Next() {

subsequent calls, it doesn't get to the print statement.

Is README accurate?

It appears you have no writer support listed in the README but I see a writer implementation and it looks to have been committed after the update to the README. What is the current status of write capabilities?

Road Map?

Regarding "work in progress" status, do you have a road map in mind for this project?

Just came across the lib. I have some work-related interest in reading Orc files in Go. I'd be interesting in jumping in.

Missing opening brace in example program

The example program in your README.md is broken. The line:

for c.Stripes()

... should be ...

for c.Stripes() {

Why name can not start with sharp like this? struct<#field1:int,field2:int,field3:int>

Why name can not start with sharp like this? structstruct<#field1:int,field2:int,field3:int>

Set OrcProto.Footer.writer and PostScript.writerVersion

We've added a code for your writer implementation. You should set:

Footer.writer = 3
PostScript.writerVersion = 6 (or higher)

to ensure that the different ORC readers can distinguish between the various writers when they need to work around bugs in the writers.

Your implementation's writer id was added here - https://issues.apache.org/jira/browse/ORC-249

Verify row index implementation

Not sure whether the existing row index implementation is correct. The documentation is slightly hard to interpret. Particularly these sections from https://orc.apache.org/docs/spec-index.html:

To record positions, each stream needs a sequence of numbers. For uncompressed streams, the position is the byte offset of the RLE run’s start location followed by the number of values that need to be consumed from the run. In compressed streams, the first number is the start of the compression chunk in the stream, followed by the number of decompressed bytes that need to be consumed, and finally the number of values consumed in the RLE.

For columns with multiple streams, the sequences of positions in each stream are concatenated. That was an unfortunate decision on my part that we should fix at some point, because it makes code that uses the indexes error-prone.

Implement UnionTreeWriter

What is the benefit of iterating over the stripes?

I'm looking at the example in the README, and the loop looks like this:

...

// Iterate over each stripe in the file.
for c.Stripes() {
    
    // Iterate over each row in the stripe.
    for c.Next() {
          
        // Retrieve a slice of interface values for the current row.
        log.Println(c.Row())
        
    }
   
}

...

What is the reason for the user having to iterate over the stripes? It feels very low-level. An alternative would be to simplify the API to

...

// Iterate over each row in the stripe.
for c.Next() {

    // Retrieve a slice of interface values for the current row.
    log.Println(c.Row())
   
}

...

...and have the cursor handle the stripes internally. Thoughts?

Boolean data is written wrong after 10000 rows

Something that I discovered which debugging something was that boolean data gets written out incorrectly after 10000 rows have been written out. It could happen less since I have seen it in our services that are capped at 5000 rows, but I could consistently replicate the issue using the TestWriter test with 1000 rows.

I set up the test to write the following schema
struct<int1:int,boolean1:bool>

I wrote a combination of true, false, and nils out for the boolean values then read the file in row by row and did an assertion that it was the right row.

After 10000 rows, the test would break because of an assertion failure in which the boolean value for boolean1 was incorrect. Trues would be falses and falses would be true. I was able to mitigate this by changing the DefaultStripeTargetRowCount to 10000 and was then able to do a test with 100000 rows and have it pass.

I can't seem to find where in the boolean writer, or the tree writers this issue is occurring.

Repository description says this project has a writer

https://github.com/scritchley/orc says

An ORC file format reader and writer for Go.

...but this project doesn't implement a writer. How about updating the project description until that's actually in place?

Reading Decimals

I'm working on reading the OSM Planet data, which has a pair of columns for latitude and longitude, defined as : lat:decimal(9,7),lon:decimal(10,7)

On Open, typdescription's withPrecision fires off an error: "precision 9 is out of range of 1 .. 10". Reading the case for this:

if precision < 1 || precision > maxPrecision || t.scale > precision

I assume that t.scale (which defaults to 10) being larger than the lat precision is where the problem lies. Naively flipping that operator around does allow the file to be read, and I can read other columns fine, but both lat and lon are always {<nil> 0}, so I assume there's another problem elsewhere.

Thoughts?

The 50GB file in question is publicly available at s3://osm-pds/planet/planet-latest.orc if that helps. I'm also happy to provide more information as needed. Thanks

Proposal: Add Scan method on cursor implementation

Similar to the database/sql method. Scan should copy the columns in the current row into the values pointed at by dest. The number of values in dest must be the same as the number of columns in Rows.
func (c *Cursor) Scan(dest ...interface{}) error

Implement DecimalTreeWriter

Is this the intended behavior of the Reader?

Hello, is this the expected behavior of your Reader / Cursor? I'm getting unusual results.

(Implementation follows example output.)

Git ref

❯ cd $GOPATH/src/github.com/scritchley/orc
❯ git log --pretty=oneline --max-count=1
4020c3e12e90c23f58b84edb1738b3032704acb1 ...

Generated ORC using example writer

~/tmp/go/orc❯ go run main.go
~/tmp/go/orc❯ ll
total ...
-rw-r--r--  1 ...  254449427   337B Jul  6 08:57 hello.orc
-rw-r--r--  1 ...  254449427   1.4K Jul  5 18:48 main.go

Inspect ORC using Apache orc-contents

~/tmp/go/orc❯ orc-contents ./hello.orc
{"hello": "hi", "goodbye": "bye"}
{"hello": "ok", "goodbye": null}
{"hello": null, "goodbye": "ok"}

Read ORC using orc.Cursor + Row()

~/tmp/go/orc❯ go run ./main.go -read ./hello.orc -scan=false
next stripe
[]interface {}{"hi", "bye"}
[]interface {}{"ok", interface {}(nil)}
[]interface {}{interface {}(nil), "ok"}
[]interface {}{interface {}(nil), interface {}(nil)}
[]interface {}{interface {}(nil), interface {}(nil)}
[]interface {}{interface {}(nil), interface {}(nil)}
[]interface {}{interface {}(nil), interface {}(nil)}
[]interface {}{interface {}(nil), interface {}(nil)}

Read ORC using orc.Cursor + Scan()

~/tmp/go/orc❯ go run ./main.go -read ./hello.orc
next stripe
[]interface {}{"ok", "ok"}
[]interface {}{"", interface {}(nil)}
[]interface {}{interface {}(nil), ""}
[]interface {}{interface {}(nil), interface {}(nil)}
[]interface {}{interface {}(nil), interface {}(nil)}
[]interface {}{interface {}(nil), interface {}(nil)}
[]interface {}{interface {}(nil), interface {}(nil)}
[]interface {}{interface {}(nil), interface {}(nil)}

Example reader / writer

package main

import (
	"flag"
	"fmt"
	"os"

	"github.com/scritchley/orc"
)

var read = flag.String("read", "", "")
var scan = flag.Bool("scan", true, "")

func main() {
	flag.Parse()

	rr := *read
	switch {
	case *scan && rr != "":
		scanFile(rr)
	case rr != "":
		readFile(rr)
	}

	f, err := os.Create("hello.orc")
	if err != nil {
		panic(err)
	}

	schema, err := orc.ParseSchema("struct<" +
		"hello:string," +
		"goodbye:string>")
	if err != nil {
		panic(err)
	}

	r, err := orc.NewWriter(f, orc.SetSchema(schema))
	if err != nil {
		panic(err)
	}

	for _, row := range [][]interface{}{
		{"hi", "bye"},
		{"ok", nil},
		{nil, "ok"},
	} {
		if err := r.Write(row...); err != nil {
			panic(err)
		}
	}

	if err := r.Close(); err != nil {
		panic(err)
	}
}

func readFile(path string) {
	r, err := orc.Open(path)
	if err != nil {
		panic(err)
	}

	c := r.Select(r.Schema().Columns()...)

	for c.Stripes() {
		println("next stripe")
		for c.Next() {
			fmt.Printf("%#v\n", c.Row())
		}
	}

	if err := r.Close(); err != nil {
		panic(err)
	}
}

func scanFile(path string) {
	r, err := orc.Open(path)
	if err != nil {
		panic(err)
	}

	cols := r.Schema().Columns()
	c := r.Select(cols...)

	for c.Stripes() {
		println("next stripe")
		row := make([]interface{}, len(cols))
		for c.Next() {
			if err := c.Scan(row...); err != nil {
				panic(err)
			}
			fmt.Printf("%#v\n", row)
		}
	}

	if err := r.Close(); err != nil {
		panic(err)
	}
}

scritchley / orc Goto Github PK

orc's People

Contributors

Stargazers

Watchers

Forkers

orc's Issues

Git ref

Generated ORC using example writer

Inspect ORC using Apache orc-contents

Inspect ORC using Apache orc-metadata

Inspect ORC using orc-metadata --raw

Inspect ORC using orc-metadata --verbose

Inspect ORC using orc-metadata --raw --verbose

Read ORC using orc.Cursor + Row()

Read ORC using orc.Cursor + Scan()

Example reader / writer

Recommend Projects

Recommend Topics

Recommend Org