branky / cascading.hive Goto Github PK
View Code? Open in Web Editor NEWProvide support for reading/writing data in Hive native file format in Cascading.
License: Other
Provide support for reading/writing data in Hive native file format in Cascading.
License: Other
Hi,
Do you plan to release cascading.hive to public maven repository?
just an fyi really. https://github.com/bfabry/cascading.hive/tree/cdh5.4.4
Hi Team,
We are looking for reading and writing ORC files which has a MAP field ( MAP). We found that the current version is not supporting MAP. Is there anyway to support MAP? If not, is there any plan to support it in near future? Please advise.
Thanks in advance.
Best Regards
Manoj K Nair
I am using the HCatTap to read from a partitioned Hive table. The table is partitioned into this pattern of paths:
hdfs://nameservice1/datasets/nowtv/mpp/mpp_order_report/p=<partition>/
where every directory contains a file with name MPP-CONSOLIDATED-OrderReport-.osv giving the following example path
hdfs://nameservice1/datasets/nowtv/mpp/mpp_order_report/p=20100501MPP-CONSOLIDATED-OrderReport-.osv
I am getting this error
Caused by: java.io.IOException: Not a file: hdfs://nameservice1/datasets/nowtv/mpp/mpp_order_report/p=20100501
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:212)
at cascading.tap.hadoop.io.MultiInputFormat.getSplits(MultiInputFormat.java:200)
at cascading.tap.hadoop.io.MultiInputFormat.getSplits(MultiInputFormat.java:134)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1106)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1098)
at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:177)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:995)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:948)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:948)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:922)
at cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:105)
at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:196)
at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)
Looking at the code I actually see that the HCatTap gets the location for every partition and passes to the MultiSourceTap but the actual location of the source is under the partition directory
Hi,
Does cascading hive support multiple partition of orc, when using tez?
Thanks
Hi Team,
We are looking for creating an ORC file with Partition. But it looks like the partitioning is not supported while creating an ORC file. Please advise.
Best Regards
Manoj K Nair
Hi,
The work here looks very interesting, however, I'm a little unsure as to the license under which it's release. Could you please clarify?
Hi there,
First, thanks for the good work.
I wanted to add support for RCFile on local for scalding.
So basically, I wanted to create a class like this RCFileLocalScheme extends Scheme <...>
so that we can override the localScheme of the RCFile class in the ColumnarSerDeSource.
Would that interest you?
Hi
We were testing PartitionTap for TEZ (our input/output are ORC files ) using cascading 3.0.0-wip-63 libs,Tez -0.5.3 and Cascading.hive 0.0.4 snapshot jar and encountered the following ClassCastException,
Caused by: java.lang.ClassCastException: org.apache.tez.dag.api.TezConfiguration cannot be cast to org.apache.hadoop.mapred.JobConf
at cascading.hive.ORCFile.sinkConfInit(ORCFile.java:72)
at cascading.tap.Tap.sinkConfInit(Tap.java:206)
at cascading.tap.hadoop.Hfs.sinkConfInit(Hfs.java:399)
at cascading.tap.hadoop.Hfs.sinkConfInit(Hfs.java:106)
at cascading.tap.hadoop.io.TapOutputCollector.initialize(TapOutputCollector.java:96)
at cascading.tap.hadoop.io.TapOutputCollector.(TapOutputCollector.java:91)
at cascading.tap.hadoop.PartitionTap.createTupleEntrySchemeCollector(PartitionTap.java:159)
at cascading.tap.partition.BasePartitionTap$PartitionCollector.getCollector(BasePartitionTap.java:130)
at cascading.tap.partition.BasePartitionTap$PartitionCollector.collect(BasePartitionTap.java:228)
at cascading.tuple.TupleEntryCollector.safeCollect(TupleEntryCollector.java:145)
at cascading.tuple.TupleEntryCollector.add(TupleEntryCollector.java:95)
at cascading.flow.stream.element.SinkStage.receive(SinkStage.java:98)
in the function,
public void sinkConfInit(FlowProcess flowProcess, Tap<JobConf, RecordReader, OutputCollector> tap, JobConf conf) of ORCFile of cascading.hive.
It seems that ORCFile doesnt have the support to receive TezConfiguration. Can you please check this?
Thanks.
Trying to write data through hcatalog in parquet format I see this exception from hadoop logs [below]. Could you please point out what is going on wrong ? I use cascading-hive of latest master version (just built when you said about parquet support). Code which I use is extrimely simple - just read data from one table and put it to another with parquet aboard.
Thanks in advace and best regards,
Boris
Error: java.lang.RuntimeException: Should never be used at parquet.hive.DeprecatedParquetOutputFormat.getRecordWriter(DeprecatedParquetOutputFormat.java:74) at org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.init(MapTask.java:799) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:422) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:160) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:155) Container killed by the ApplicationMaster. Container killed on request. Exit code is 143
The Java type for the binary type in Hive is supposed to be byte[] according to the object inspector. However, when providing a byte[] object in the tuple, the RCFile scheme first casts the object to a string and then to an array of bytes when sinking the field. This is an issue since the toString() call for a byte[] is not the actual contents of byte[].
Is there another way around this issue or do we have to make an exception for a byte[] object in the RCFile code?
Please advice how to make cascading.hive work with parquet through hcatalog.
I have very simple application which works well if table is text-based. When I switch to parquet then ClassNotFound Exception arised pointing on HiveOutputFormat.
Thanks in advance.
I am making use of cascading.hive with scalding. To do so I created some new classes derived from the SchemedSource object in scalding.
They may be of use to others, so I wanted to share them.
https://gist.github.com/hellertime/10020639
I'm happy to make a pull request with the changes, but wasn't sure how best to integrate the code, so I would appreciate some feedback in that regard before I begin.
-Chris
I was wondering if you have already deployed jars of this project and if so, in which repo they are. I searched conjars, but I did not find it. Thx!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.