scritchley / orc Goto Github PK
View Code? Open in Web Editor NEWAn ORC file format reader and writer for Go.
Home Page: https://godoc.org/github.com/scritchley/orc
License: MIT License
An ORC file format reader and writer for Go.
Home Page: https://godoc.org/github.com/scritchley/orc
License: MIT License
Noticed two issues reading Orc generated by Spark.
getmerged
out of HDFS results in proto: proto.StripeFooter: wiretype end group for non-group
. The schema can be read fine, but can't access the stream.For the former, can't replicate on local and the data is work related, but the latter you can replicate pretty easily:
hi, @scritchley
I need to know the number of rows in a orc file without reading all rows. Does this package provide some method ? Thanks.
Hi, we are using the library to read and write orc files.
We've noticed an issue about "orc/cursor.go:85" 's comparison processing between currentRow and stripeRow Count.
Seems like there is no chance for currentRow's data to be reset, so if the file has more than 2 stripe, all the objects comming after the first one won't be read out. We've already done with the program modification which the data will return to be 0 in Stripes(), so if you don't mind, could you please merge it?
Thank you.
Hello,
I have been using ORC files created by this library with AWS Athena and have been running into an issue when one of these files contains exactly 10,000 rows. It is failing a check in Presto that I honestly have no idea the intent of. Every other tool that deals with ORC files has no issues with the file.
Without fully understanding the ORC file format, I made a local change to the library that does not call the recordPositions
function in flushWriters
if recordPositions
has just been called on the last write. This appears to fix the issue parsing the files and does not upset any of the unit tests. I'm not sure if there are other actions that should be skipped in this case, so I'm usure if this change creates files that are malformed in other ways. I'm hoping someone with a better understanding of the format can evaluate my patch and determine if it's the right action.
Here's the patch:
diff --git a/writer.go b/writer.go
index 63718e0..2b4d993 100644
--- a/writer.go
+++ b/writer.go
@@ -51,6 +51,7 @@ type Writer struct {
indexOffset uint64
chunkOffset uint64
compressionCodec CompressionCodec
+ lastRecordPositions uint64
}
func ptrInt64(i int64) *int64 {
@@ -162,6 +163,7 @@ func (w *Writer) Write(values ...interface{}) error {
if w.totalRows%uint64(w.footer.GetRowIndexStride()) == 0 {
// Records and resets indexes for each writer.
w.recordPositions()
+ w.lastRecordPositions = w.totalRows
if w.treeWriters.size() >= w.stripeTargetSize {
return w.writeStripe()
@@ -217,7 +219,9 @@ func (w *Writer) flushWriters() error {
if err := w.treeWriter.Flush(); err != nil {
return err
}
- w.recordPositions()
+ if w.lastRecordPositions != w.totalRows {
+ w.recordPositions()
+ }
return nil
}
Thanks
-matt
I have a ORC file which corresponding external hive table is created by(Presto):
CREATE TABLE IF NOT EXISTS db.aggregate.f_market_selected_hourly_gcyu (
network_id BIGINT,
asset_id BIGINT,
series_id BIGINT,
site_id BIGINT,
site_section_id BIGINT,
country_id BIGINT,
time_position_class VARCHAR,
device_type VARCHAR,
dsp_id BIGINT,
deal_id BIGINT,
buyer_id BIGINT,
buyer_group_id BIGINT,
integration_type VARCHAR,
error_code VARCHAR,
asset_group_ids ARRAY<BIGINT>,
site_section_group_ids ARRAY<BIGINT>,
error_frequency BIGINT,
received_bid BIGINT,
resolved_bid BIGINT,
selected_primary_bid BIGINT,
selected_fallback_bid BIGINT,
selected_bid_in_watched_slot_primary BIGINT,
selected_bid_in_watched_slot_fallback BIGINT,
total_received_bid_price DOUBLE,
total_resolved_bid_price DOUBLE,
total_bid_won_price DOUBLE,
process_batch_id VARCHAR,
demand_total_received_bid_price DOUBLE,
demand_total_resolved_bid_price DOUBLE,
demand_total_bid_won_price DOUBLE,
dsp_currency_id BIGINT,
buyer_platform_id BIGINT,
seat_id BIGINT,
market_ad_id BIGINT,
event_date TIMESTAMP
)
WITH (
format = 'ORC',
partitioned_by = ARRAY['event_date'],
external_location = 's3a://fw1-dev-eng/gcyu/orc_files/f_market_selected_hourly_gcyu/'
);
When reading this ORC file, all goes well but missing the partition key column event_date
. Can anyone tell me if this is expected and why this happens?
Many thanks,
gcyu
When decoding files with zstd compression, it throws expection "unsupported compression kind zstd". So is there anyway to support ?
I read an ORC file using code much like this:
cursor := reader.Select(readColumns...)
defer cursor.Close()
err = cursor.PrepareStripe(sr.StripeObject.StripeIndex)
for cursor.Next() {
row0 := cursor.Row()
row := make([]driver.Value, len(dataBlock.Columns.Columns))
for i := range row0 {
row[i] = driver.Value(row0[i])
}
for i, j := len(row0), 0; j < len(sr.StripeObject.ExtraValues); i, j = i+1, j+1 {
row[i] = driver.Value(sr.StripeObject.ExtraValues[j])
}
dataBlock.Payload = append(dataBlock.Payload, row
I try to build a dataBlock with 20000 rows and then insert this dataBlock to ClickHouse, However this process consume too much memory. So is there any way which can be used to decrease memory usage when read orc file to build dataBlock?
Has anyone had any luck getting Hive to read ORC files written with this libary?
hive --orcfiledump test.orc
Currently, I'm getting the following error from Hive if I write the file with zlib compression:
Processing data file test.orc [length: 3403]
Structure for test.orc
File Version: 0.12 with ORIGINAL
Exception in thread "main" java.lang.IllegalArgumentException: Buffer size too small. size = 262144 needed = 8026884
at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:212)
at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:257)
at java.io.InputStream.read(InputStream.java:101)
at com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:737)
at com.google.protobuf.CodedInputStream.isAtEnd(CodedInputStream.java:701)
at com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:99)
at org.apache.orc.OrcProto$StripeFooter.<init>(OrcProto.java:11063)
at org.apache.orc.OrcProto$StripeFooter.<init>(OrcProto.java:11027)
at org.apache.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:11132)
at org.apache.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:11127)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:89)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:95)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
at org.apache.orc.OrcProto$StripeFooter.parseFrom(OrcProto.java:11360)
at org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:267)
at org.apache.orc.impl.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:296)
at org.apache.orc.impl.RecordReaderImpl.beginReadStripe(RecordReaderImpl.java:953)
at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:915)
at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1081)
at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1116)
at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:272)
at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:598)
at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:592)
at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:308)
at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:273)
at org.apache.orc.tools.FileDump.main(FileDump.java:134)
at org.apache.orc.tools.FileDump.main(FileDump.java:141)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:234)
at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
And this error if I write it without compression:
Processing data file test.orc [length: 3945]
Structure for test.orc
File Version: 0.12 with ORIGINAL
Exception in thread "main" com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.
at com.google.protobuf.InvalidProtocolBufferException.invalidEndTag(InvalidProtocolBufferException.java:94)
at com.google.protobuf.CodedInputStream.checkLastTagWas(CodedInputStream.java:124)
at com.google.protobuf.CodedInputStream.readGroup(CodedInputStream.java:241)
at com.google.protobuf.UnknownFieldSet$Builder.mergeFieldFrom(UnknownFieldSet.java:488)
at com.google.protobuf.GeneratedMessage.parseUnknownField(GeneratedMessage.java:193)
at org.apache.orc.OrcProto$StripeFooter.<init>(OrcProto.java:11069)
at org.apache.orc.OrcProto$StripeFooter.<init>(OrcProto.java:11027)
at org.apache.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:11132)
at org.apache.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:11127)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:89)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:95)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
at org.apache.orc.OrcProto$StripeFooter.parseFrom(OrcProto.java:11360)
at org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:267)
at org.apache.orc.impl.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:296)
at org.apache.orc.impl.RecordReaderImpl.beginReadStripe(RecordReaderImpl.java:953)
at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:915)
at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1081)
at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1116)
at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:272)
at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:598)
at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:592)
at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:308)
at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:273)
at org.apache.orc.tools.FileDump.main(FileDump.java:134)
at org.apache.orc.tools.FileDump.main(FileDump.java:141)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:234)
at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
Using schema
struct<
t_int:int,
t_int64:bigint,
t_float32:float,
t_float64:double,
t_string:string,
t_bool:boolean,
t_timestamp:timestamp,
t_list:array<int>,
t_map:map<string,int>,
t_nested:struct<
t_int:int,
t_int64:bigint,
t_float32:float,
t_float64:double,
t_string:string,
t_bool:boolean,
t_timestamp:timestamp
>
>
Is there a way we can read orc user metadata using this library ? If yes, can you guide me through how to do that?
I used this orc reader for a project I was working and I wanted to offer an example I needed which helped me. I had nested orc structures I wanted to convert to json. If would like a PR I would be happy to submit to the README.md
// Example 2
r, err := Open("./examples/demo-12-zlib.orc")
if err != nil {
log.Fatal(err)
}
defer r.Close()
rootColumns := r.Schema().Columns()
count := len(rootColumns)
values := make([]interface{}, count)
// Create a new Cursor reading the provided columns.
c := r.Select(rootColumns...)
returnData := make(map[string]interface{})
// Iterate over each stripe in the file.
for c.Stripes() {
// Iterate over each row in the stripe.
for c.Next() {
// Retrieve a slice of interface values for the current row.
log.Println(c.Row())
currentRow := c.Row()
for i, _ := range values {
masterData[rootColumns[i]] = currentRow[i]
}
slicer, _ := json.Marshal(returnData)
fmt.Println(string(slicer))
}
}
if err := c.Err(); err != nil {
log.Fatal(err)
}
This is taken from https://github.com/vfrank66/aws-s3-orc-to-kinesis.
When we create an orc file using the writer with a column of type double and has null values for certain rows, then reading the file returns nil value for all the rows for that column.
package main
import (
"bytes"
"compress/flate"
"fmt"
"io/ioutil"
"strings"
"sync"
"github.com/scritchley/orc"
)
func generateSchema(columns map[string]string, colNames []string) (*orc.TypeDescription, error) {
keys := colNames
var sb strings.Builder
sb.WriteString("struct<")
// Ensure keys are sorted
schemaKey := keys
first := true
for _, key := range schemaKey {
typ := columns[key]
if !first {
sb.WriteByte(0x2c) // ,
}
first = false
sb.WriteString(key)
sb.WriteByte(0x3a) // :
sb.WriteString(typ)
}
sb.WriteByte(0x3e) // >
return orc.ParseSchema(sb.String())
}
func main() {
filepath := "/Users/ankit.sinha/gopath/src/proto-learn/new-file.orc"
writeFile(filepath)
readFile(filepath)
}
func writeFile(filepath string) {
schema, err := generateSchema(map[string]string{"val": "double"}, []string{"val"})
b := &sync.Pool{
New: func() interface{} {
return bytes.NewBuffer(make([]byte, 0, 16*1<<20))
},
}
buffer := b.Get().(*bytes.Buffer)
writer, err := orc.NewWriter(buffer,
orc.SetSchema(schema),
orc.SetCompression(orc.CompressionZlib{Level: flate.DefaultCompression}))
row := []interface{}{}
row = append(row, 2.0)
if err := writer.Write(row...); err != nil {
fmt.Printf("%+v\n flush: error writing row", err)
}
row = []interface{}{}
row = append(row, nil)
if err := writer.Write(row...); err != nil {
fmt.Printf("%+v\n flush: error writing row", err)
}
if err := writer.Close(); err != nil {
fmt.Printf("%+v\n flush: error closing writer", err)
}
err = ioutil.WriteFile(filepath, buffer.Bytes(), 0644)
if err != nil {
fmt.Printf("%+v\n", err)
}
}
func readFile(filepath string) {
r, err := orc.Open(filepath)
if err != nil {
fmt.Printf("%+v\n", err)
}
defer r.Close()
// Create a new Cursor reading the provided columns.
c := r.Select("val")
// Iterate over each stripe in the file.
for c.Stripes() {
// Iterate over each row in the stripe.
for c.Next() {
fmt.Printf("%+v\n", c.Row())
}
}
if err := c.Err(); err != nil {
fmt.Printf("%+v\n", err)
}
}
Sample output when read using the same orc reader
[<nil>]
[<nil>]
Sample output when using the orc-tools library to read the same file.
{"val":2}
{"val":null}
Sorry if this is something obvious. Trying to write nils to float64, int64 and strings. Getting a number of errors (Expected float64 got and similar) Was wondering if that is supported and if so what's the right way to have null in a column? Otherwise non-null values are written just fine. Thanks a lot for an awesome work!
Im trying to currently extract the timestamp static out of the fields, but it always seems like its nil.
I modified one of the writer test to just write a timestamp and read the file in, but theres no timestamp statistic.
c := r.Select("int1", "timestamp1")
for c.Stripes() {
for c.Next() {
fmt.Println(r.getTypes())
fmt.Println(c.Row())
}
}
//custom defined method to get statistic for a certain field
fmt.Println(r.ColumnStatistics("timestamp1"))
I see the type is recorded correctly, but theres no statistic.
numberOfValues:5 hasNull:false
The full output of statistics is
[numberOfValues:5 hasNull:false numberOfValues:5 hasNull:false numberOfValues:5 intStatistics:<minimum:51 maximum:9410 sum:22770 > hasNull:false ]
orc.Date
orc.Decimal
(Are you planning to promote the Writer to a supported state?)
Hi,
code is like:
files, err := orc.Open(orcfile)
if err != nil {
println(err.Error())
}
and later:
c := files.Select("fid", "fname", "filesize")
First pass, it returns values.
When I don't open the file again and run c := files.Select("fid", "fname", "filesize")
again later, it seems to return nothing.
It seems like the cursor runs to the end of the recordset and subsequent calls to renew the cursor do not reset it's position?
If I open the file again and run select, I get output.
my loop code like:
for c.Stripes() {
println("stripe ok")
for c.Next() {
subsequent calls, it doesn't get to the print statement.
It appears you have no writer support listed in the README but I see a writer implementation and it looks to have been committed after the update to the README. What is the current status of write capabilities?
Regarding "work in progress" status, do you have a road map in mind for this project?
Just came across the lib. I have some work-related interest in reading Orc files in Go. I'd be interesting in jumping in.
The example program in your README.md is broken. The line:
for c.Stripes()
... should be ...
for c.Stripes() {
Why name can not start with sharp like this? structstruct<#field1:int,field2:int,field3:int>
We've added a code for your writer implementation. You should set:
Footer.writer = 3
PostScript.writerVersion = 6 (or higher)
to ensure that the different ORC readers can distinguish between the various writers when they need to work around bugs in the writers.
Your implementation's writer id was added here - https://issues.apache.org/jira/browse/ORC-249
Not sure whether the existing row index implementation is correct. The documentation is slightly hard to interpret. Particularly these sections from https://orc.apache.org/docs/spec-index.html:
To record positions, each stream needs a sequence of numbers. For uncompressed streams, the position is the byte offset of the RLE run’s start location followed by the number of values that need to be consumed from the run. In compressed streams, the first number is the start of the compression chunk in the stream, followed by the number of decompressed bytes that need to be consumed, and finally the number of values consumed in the RLE.
For columns with multiple streams, the sequences of positions in each stream are concatenated. That was an unfortunate decision on my part that we should fix at some point, because it makes code that uses the indexes error-prone.
I'm looking at the example in the README, and the loop looks like this:
...
// Iterate over each stripe in the file.
for c.Stripes() {
// Iterate over each row in the stripe.
for c.Next() {
// Retrieve a slice of interface values for the current row.
log.Println(c.Row())
}
}
...
What is the reason for the user having to iterate over the stripes? It feels very low-level. An alternative would be to simplify the API to
...
// Iterate over each row in the stripe.
for c.Next() {
// Retrieve a slice of interface values for the current row.
log.Println(c.Row())
}
...
...and have the cursor handle the stripes internally. Thoughts?
Something that I discovered which debugging something was that boolean data gets written out incorrectly after 10000 rows have been written out. It could happen less since I have seen it in our services that are capped at 5000 rows, but I could consistently replicate the issue using the TestWriter test with 1000 rows.
I set up the test to write the following schema
struct<int1:int,boolean1:bool>
I wrote a combination of true, false, and nils out for the boolean values then read the file in row by row and did an assertion that it was the right row.
After 10000 rows, the test would break because of an assertion failure in which the boolean value for boolean1 was incorrect. Trues would be falses and falses would be true. I was able to mitigate this by changing the DefaultStripeTargetRowCount to 10000 and was then able to do a test with 100000 rows and have it pass.
I can't seem to find where in the boolean writer, or the tree writers this issue is occurring.
https://github.com/scritchley/orc says
An ORC file format reader and writer for Go.
...but this project doesn't implement a writer. How about updating the project description until that's actually in place?
I'm working on reading the OSM Planet data, which has a pair of columns for latitude and longitude, defined as : lat:decimal(9,7),lon:decimal(10,7)
On Open, typdescription's withPrecision fires off an error: "precision 9 is out of range of 1 .. 10". Reading the case for this:
if precision < 1 || precision > maxPrecision || t.scale > precision
I assume that t.scale (which defaults to 10) being larger than the lat precision is where the problem lies. Naively flipping that operator around does allow the file to be read, and I can read other columns fine, but both lat and lon are always {<nil> 0}
, so I assume there's another problem elsewhere.
Thoughts?
The 50GB file in question is publicly available at s3://osm-pds/planet/planet-latest.orc if that helps. I'm also happy to provide more information as needed. Thanks
Similar to the database/sql method. Scan should copy the columns in the current row into the values pointed at by dest. The number of values in dest must be the same as the number of columns in Rows.
func (c *Cursor) Scan(dest ...interface{}) error
Hello, is this the expected behavior of your Reader / Cursor? I'm getting unusual results.
(Implementation follows example output.)
❯ cd $GOPATH/src/github.com/scritchley/orc
❯ git log --pretty=oneline --max-count=1
4020c3e12e90c23f58b84edb1738b3032704acb1 ...
~/tmp/go/orc❯ go run main.go
~/tmp/go/orc❯ ll
total ...
-rw-r--r-- 1 ... 254449427 337B Jul 6 08:57 hello.orc
-rw-r--r-- 1 ... 254449427 1.4K Jul 5 18:48 main.go
~/tmp/go/orc❯ orc-contents ./hello.orc
{"hello": "hi", "goodbye": "bye"}
{"hello": "ok", "goodbye": null}
{"hello": null, "goodbye": "ok"}
http://pastebin.centos.org/120216/
http://pastebin.centos.org/120221/
http://pastebin.centos.org/120226/
http://pastebin.centos.org/120236/
~/tmp/go/orc❯ go run ./main.go -read ./hello.orc -scan=false
next stripe
[]interface {}{"hi", "bye"}
[]interface {}{"ok", interface {}(nil)}
[]interface {}{interface {}(nil), "ok"}
[]interface {}{interface {}(nil), interface {}(nil)}
[]interface {}{interface {}(nil), interface {}(nil)}
[]interface {}{interface {}(nil), interface {}(nil)}
[]interface {}{interface {}(nil), interface {}(nil)}
[]interface {}{interface {}(nil), interface {}(nil)}
~/tmp/go/orc❯ go run ./main.go -read ./hello.orc
next stripe
[]interface {}{"ok", "ok"}
[]interface {}{"", interface {}(nil)}
[]interface {}{interface {}(nil), ""}
[]interface {}{interface {}(nil), interface {}(nil)}
[]interface {}{interface {}(nil), interface {}(nil)}
[]interface {}{interface {}(nil), interface {}(nil)}
[]interface {}{interface {}(nil), interface {}(nil)}
[]interface {}{interface {}(nil), interface {}(nil)}
package main
import (
"flag"
"fmt"
"os"
"github.com/scritchley/orc"
)
var read = flag.String("read", "", "")
var scan = flag.Bool("scan", true, "")
func main() {
flag.Parse()
rr := *read
switch {
case *scan && rr != "":
scanFile(rr)
case rr != "":
readFile(rr)
}
f, err := os.Create("hello.orc")
if err != nil {
panic(err)
}
schema, err := orc.ParseSchema("struct<" +
"hello:string," +
"goodbye:string>")
if err != nil {
panic(err)
}
r, err := orc.NewWriter(f, orc.SetSchema(schema))
if err != nil {
panic(err)
}
for _, row := range [][]interface{}{
{"hi", "bye"},
{"ok", nil},
{nil, "ok"},
} {
if err := r.Write(row...); err != nil {
panic(err)
}
}
if err := r.Close(); err != nil {
panic(err)
}
}
func readFile(path string) {
r, err := orc.Open(path)
if err != nil {
panic(err)
}
c := r.Select(r.Schema().Columns()...)
for c.Stripes() {
println("next stripe")
for c.Next() {
fmt.Printf("%#v\n", c.Row())
}
}
if err := r.Close(); err != nil {
panic(err)
}
}
func scanFile(path string) {
r, err := orc.Open(path)
if err != nil {
panic(err)
}
cols := r.Schema().Columns()
c := r.Select(cols...)
for c.Stripes() {
println("next stripe")
row := make([]interface{}, len(cols))
for c.Next() {
if err := c.Scan(row...); err != nil {
panic(err)
}
fmt.Printf("%#v\n", row)
}
}
if err := r.Close(); err != nil {
panic(err)
}
}
Reading is currently supported, however, writing of a compressed file is not. https://orc.apache.org/docs/compression.html
Hi -- I've noticed a number of issues have been closed or removed from the 1.0 milestone. Is this project still moving?
I have seen how to read parquet file on hdfs, but I dont known how to read or write orc file on hdfs ,count you add example?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.