branky / cascading.hive Goto Github PK

View Code? Open in Web Editor NEW

11.0 6.0 17.0 617 KB

Provide support for reading/writing data in Hive native file format in Cascading.

License: Other

Java 97.94% Scala 2.06%

cascading.hive's Issues

Release to maven repository

Hi,

Do you plan to release cascading.hive to public maven repository?

Created branch built against hadoop 2.6 cdh 5.4.4 hive 1.1.0

just an fyi really. https://github.com/bfabry/cascading.hive/tree/cdh5.4.4

MAP datatype support in ORCFile

Hi Team,
We are looking for reading and writing ORC files which has a MAP field ( MAP). We found that the current version is not supporting MAP. Is there anyway to support MAP? If not, is there any plan to support it in near future? Please advise.

Thanks in advance.
Best Regards
Manoj K Nair

Problems reading from a partitioned Hive table

I am using the HCatTap to read from a partitioned Hive table. The table is partitioned into this pattern of paths:

hdfs://nameservice1/datasets/nowtv/mpp/mpp_order_report/p=<partition>/

where every directory contains a file with name MPP-CONSOLIDATED-OrderReport-.osv giving the following example path

hdfs://nameservice1/datasets/nowtv/mpp/mpp_order_report/p=20100501MPP-CONSOLIDATED-OrderReport-.osv

I am getting this error

Caused by: java.io.IOException: Not a file: hdfs://nameservice1/datasets/nowtv/mpp/mpp_order_report/p=20100501
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:212)
    at cascading.tap.hadoop.io.MultiInputFormat.getSplits(MultiInputFormat.java:200)
    at cascading.tap.hadoop.io.MultiInputFormat.getSplits(MultiInputFormat.java:134)
    at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1106)
    at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1098)
    at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:177)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:995)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:948)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:948)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:922)
    at cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:105)
    at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:196)
    at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149)
    at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124)
    at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
    at java.lang.Thread.run(Thread.java:662)

Looking at the code I actually see that the HCatTap gets the location for every partition and passes to the MultiSourceTap but the actual location of the source is under the partition directory

Add support for date and timestamp

Support of orc partitioning for Tez

Hi,

Does cascading hive support multiple partition of orc, when using tez?

Thanks

Create ORC file with Partition

Hi Team,
We are looking for creating an ORC file with Partition. But it looks like the partitioning is not supported while creating an ORC file. Please advise.

Best Regards
Manoj K Nair

Code License

Hi,

The work here looks very interesting, however, I'm a little unsure as to the license under which it's release. Could you please clarify?

RCFileLocalScheme to run on local

Hi there,

First, thanks for the good work.

I wanted to add support for RCFile on local for scalding.
So basically, I wanted to create a class like this RCFileLocalScheme extends Scheme <...>
so that we can override the localScheme of the RCFile class in the ColumnarSerDeSource.

Would that interest you?

Support to accept TezConfiguration in ORCFile

We were testing PartitionTap for TEZ (our input/output are ORC files ) using cascading 3.0.0-wip-63 libs,Tez -0.5.3 and Cascading.hive 0.0.4 snapshot jar and encountered the following ClassCastException,

Caused by: java.lang.ClassCastException: org.apache.tez.dag.api.TezConfiguration cannot be cast to org.apache.hadoop.mapred.JobConf
at cascading.hive.ORCFile.sinkConfInit(ORCFile.java:72)
at cascading.tap.Tap.sinkConfInit(Tap.java:206)
at cascading.tap.hadoop.Hfs.sinkConfInit(Hfs.java:399)
at cascading.tap.hadoop.Hfs.sinkConfInit(Hfs.java:106)
at cascading.tap.hadoop.io.TapOutputCollector.initialize(TapOutputCollector.java:96)
at cascading.tap.hadoop.io.TapOutputCollector.(TapOutputCollector.java:91)
at cascading.tap.hadoop.PartitionTap.createTupleEntrySchemeCollector(PartitionTap.java:159)
at cascading.tap.partition.BasePartitionTap$PartitionCollector.getCollector(BasePartitionTap.java:130)
at cascading.tap.partition.BasePartitionTap$PartitionCollector.collect(BasePartitionTap.java:228)
at cascading.tuple.TupleEntryCollector.safeCollect(TupleEntryCollector.java:145)
at cascading.tuple.TupleEntryCollector.add(TupleEntryCollector.java:95)
at cascading.flow.stream.element.SinkStage.receive(SinkStage.java:98)

in the function,
public void sinkConfInit(FlowProcess flowProcess, Tap<JobConf, RecordReader, OutputCollector> tap, JobConf conf) of ORCFile of cascading.hive.

It seems that ORCFile doesnt have the support to receive TezConfiguration. Can you please check this?

Thanks.

cascading-hive does not write yo hcatalog in parquet format.

Trying to write data through hcatalog in parquet format I see this exception from hadoop logs [below]. Could you please point out what is going on wrong ? I use cascading-hive of latest master version (just built when you said about parquet support). Code which I use is extrimely simple - just read data from one table and put it to another with parquet aboard.

Thanks in advace and best regards,
Boris

Error: java.lang.RuntimeException: Should never be used at parquet.hive.DeprecatedParquetOutputFormat.getRecordWriter(DeprecatedParquetOutputFormat.java:74) at org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.init(MapTask.java:799) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:422) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:160) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:155) Container killed by the ApplicationMaster. Container killed on request. Exit code is 143

Cannot use byte[] in RCFile scheme for Hive binary type

The Java type for the binary type in Hive is supposed to be byte[] according to the object inspector. However, when providing a byte[] object in the tuple, the RCFile scheme first casts the object to a string and then to an array of bytes when sinking the field. This is an issue since the toString() call for a byte[] is not the actual contents of byte[].

Is there another way around this issue or do we have to make an exception for a byte[] object in the RCFile code?

Please advice how to make cascading.hive work with parquet through hcatalog

Please advice how to make cascading.hive work with parquet through hcatalog.
I have very simple application which works well if table is text-based. When I switch to parquet then ClassNotFound Exception arised pointing on HiveOutputFormat.
Thanks in advance.

Using cascading.hive with Scalding

I am making use of cascading.hive with scalding. To do so I created some new classes derived from the SchemedSource object in scalding.

They may be of use to others, so I wanted to share them.

https://gist.github.com/hellertime/10020639

I'm happy to make a pull request with the changes, but wasn't sure how best to integrate the code, so I would appreciate some feedback in that regard before I begin.

-Chris

unclear where jars are deployed

I was wondering if you have already deployed jars of this project and if so, in which repo they are. I searched conjars, but I did not find it. Thx!

branky / cascading.hive Goto Github PK

cascading.hive's Issues

Release to maven repository

Created branch built against hadoop 2.6 cdh 5.4.4 hive 1.1.0

MAP datatype support in ORCFile

Problems reading from a partitioned Hive table

Add support for date and timestamp

Support of orc partitioning for Tez

Create ORC file with Partition

Code License

RCFileLocalScheme to run on local

Support to accept TezConfiguration in ORCFile

cascading-hive does not write yo hcatalog in parquet format.

Cannot use byte[] in RCFile scheme for Hive binary type

Please advice how to make cascading.hive work with parquet through hcatalog

Using cascading.hive with Scalding

unclear where jars are deployed

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent