twitter / elephant-bird Goto Github PK
View Code? Open in Web Editor NEWTwitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.
License: Apache License 2.0
Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.
License: Apache License 2.0
Hello-
Do you guys have any suggestions for ways to use Elephant-Bird-generated protobuf Writables with Hadoop Streaming? As far as I can tell, Hadoop Streaming more or less calls toString() on each input key and value, and sends the result to stdout- which is a problem since protobufs don't really serialize to text all that well, and their serializations often include newlines, which plays havoc with Streaming.
The best option I've come up with thus far involves writing a custom InputFormat that would Base-64-encode each PB, and then, in my streaming code, decode the Base-64 and deserialize it back into a protocol buffer. This seems like it ought to work, but also seems like a ton of extra overhead, and I feel like there's got to be a better way that I'm somehow just missing. Googling around, it looks like Dumbo's "TypedBytes" might do some of what I need, but for a variety of reasons I can neither upgrade nor patch my Hadoop installation directly.
Thanks in advance for any ideas that anybody might have...
We have run into some issues with LzoJsonRecordReader and JsonLoader where on some JSON data -- we haven't been able to narrow it down yet -- both classes will fail with a NullPointerException.
For instance, here is the relevant snippet of LzoJsonRecordReader#decodeLineToJson(). What seems to happen is that LzoJsonRecordReader#nextKeyValue() passes a null value to decodeLineToJson(). The problem might therefore originate in the line "int newSize = in_.readLine(currentLine_);" in nextKeyValue(). From what I remember, json-simple's parse method might actually return a null value as well (= adding a safeguard to LzoJsonLoader and JsonLoader might be a good idea anyways) but in this case the null value is already passed to json-simple, before the latter will actually return itself with a null value.
A similar NPE issue exists for the JsonLoader#parseStringToTuple().
/*
LzoJsonRecordReader.java
*/
public static boolean decodeLineToJson(JSONParser parser, Text line, MapWritable value) {
try {
JSONObject jsonObj = (JSONObject)parser.parse(line.toString());
for (Object key: jsonObj.keySet()) { /* *** The NPE is thrown here when jsonObj == null *** */
Text mapKey = new Text(key.toString());
Text mapValue = new Text();
if (jsonObj.get(key) != null) {
mapValue.set(jsonObj.get(key).toString());
}
value.put(mapKey, mapValue);
}
return true;
} catch (ParseException e) {
LOG.warn("Could not json-decode string: " + line, e);
return false;
} catch (NumberFormatException e) {
LOG.warn("Could not parse field into number: " + line, e);
return false;
}
A quick fix is to simply catch for a NPE in LzoJsonRecordReader#decodeLineToJson() and JsonLoader#parseStringToTuple() as is already being done for e.g. ParseException. This fix has worked fine for us. As I said above, however, this might not fix the root of the problem.
I am not so familiar with the code, so I am not sure whether just catching NPEs is the best fix. Maybe you have a better idea!
When I try [ant nonothing compile] on a machine with thrift 0.6 installed, I get the output below. Clearly there are some protobuf references that have slipped in. I would be happy to help with fixes, but don't know the code nearly well enough to make sensible changes.
Also, I find it difficult to stomach installing thrift 0.5 but note that if I just don't use the thrift support then commenting out the version check makes everything work for me.
Teds-MacBook-Pro:elephant-bird[master]$ ant nonothing compile
Buildfile: /Users/tdunning/Apache/tmp/elephant-bird/build.xml
init:
[mkdir] Created dir: /Users/tdunning/Apache/tmp/elephant-bird/build
[mkdir] Created dir: /Users/tdunning/Apache/tmp/elephant-bird/build/classes
[mkdir] Created dir: /Users/tdunning/Apache/tmp/elephant-bird/build/test
nonothing:
[echo] building without protobuf and thrift support
[javac] /Users/tdunning/Apache/tmp/elephant-bird/build.xml:103: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds
[javac] Compiling 71 source files to /Users/tdunning/Apache/tmp/elephant-bird/build/classes
[javac] /Users/tdunning/Apache/tmp/elephant-bird/src/java/com/twitter/elephantbird/mapreduce/input/MultiInputFormat.java:20: package com.twitter.data.proto.BlockStorage does not exist
[javac] import com.twitter.data.proto.BlockStorage.SerializedBlock;
[javac] ^
[javac] /Users/tdunning/Apache/tmp/elephant-bird/src/java/com/twitter/elephantbird/mapreduce/io/BinaryBlockReader.java:6: package com.twitter.data.proto.BlockStorage does not exist
[javac] import com.twitter.data.proto.BlockStorage.SerializedBlock;
[javac] ^
[javac] /Users/tdunning/Apache/tmp/elephant-bird/src/java/com/twitter/elephantbird/mapreduce/io/BinaryBlockReader.java:27: cannot find symbol
[javac] symbol : class SerializedBlock
[javac] location: class com.twitter.elephantbird.mapreduce.io.BinaryBlockReader<M>
[javac] private SerializedBlock curBlock_;
[javac] ^
[javac] /Users/tdunning/Apache/tmp/elephant-bird/src/java/com/twitter/elephantbird/mapreduce/io/BinaryBlockReader.java:105: cannot find symbol
[javac] symbol : class SerializedBlock
[javac] location: class com.twitter.elephantbird.mapreduce.io.BinaryBlockReader<M>
[javac] public SerializedBlock parseNextBlock() throws IOException {
[javac] ^
[javac] /Users/tdunning/Apache/tmp/elephant-bird/src/java/com/twitter/elephantbird/pig/util/ProtobufToPig.java:14: package com.twitter.data.proto.Misc does not exist
[javac] import com.twitter.data.proto.Misc.CountedMap;
[javac] ^
[javac] /Users/tdunning/Apache/tmp/elephant-bird/src/java/com/twitter/elephantbird/mapreduce/io/BinaryBlockReader.java:123: cannot find symbol
[javac] symbol : class SerializedBlock
[javac] location: class com.twitter.elephantbird.mapreduce.io.BinaryBlockReader<M>
[javac] SerializedBlock block = SerializedBlock.parseFrom(byteArray);
[javac] ^
[javac] /Users/tdunning/Apache/tmp/elephant-bird/src/java/com/twitter/elephantbird/mapreduce/io/BinaryBlockReader.java:123: cannot find symbol
[javac] symbol : variable SerializedBlock
[javac] location: class com.twitter.elephantbird.mapreduce.io.BinaryBlockReader<M>
[javac] SerializedBlock block = SerializedBlock.parseFrom(byteArray);
[javac] ^
[javac] /Users/tdunning/Apache/tmp/elephant-bird/src/java/com/twitter/elephantbird/pig/util/ProtobufToPig.java:119: cannot find symbol
[javac] symbol : variable CountedMap
[javac] location: class com.twitter.elephantbird.pig.util.ProtobufToPig
[javac] fieldDescriptor.getMessageType().getName().equals(CountedMap.getDescriptor().getName())) {
[javac] ^
[javac] /Users/tdunning/Apache/tmp/elephant-bird/src/java/com/twitter/elephantbird/pig/util/ProtobufToPig.java:122: cannot find symbol
[javac] symbol : class CountedMap
[javac] location: class com.twitter.elephantbird.pig.util.ProtobufToPig
[javac] CountedMap cm = (CountedMap) m;
[javac] ^
[javac] /Users/tdunning/Apache/tmp/elephant-bird/src/java/com/twitter/elephantbird/pig/util/ProtobufToPig.java:122: cannot find symbol
[javac] symbol : class CountedMap
[javac] location: class com.twitter.elephantbird.pig.util.ProtobufToPig
[javac] CountedMap cm = (CountedMap) m;
[javac] ^
[javac] /Users/tdunning/Apache/tmp/elephant-bird/src/java/com/twitter/elephantbird/pig/util/ProtobufToPig.java:124: operator + cannot be applied to long,CountedMap.getValue
[javac] map.put(cm.getKey(), (curCount == null ? 0L : curCount) + cm.getValue());
[javac] ^
[javac] /Users/tdunning/Apache/tmp/elephant-bird/src/java/com/twitter/elephantbird/pig/util/ProtobufToPig.java:252: cannot find symbol
[javac] symbol : variable CountedMap
[javac] location: class com.twitter.elephantbird.pig.util.ProtobufToPig
[javac] fieldDescriptor.getMessageType().getName().equals(CountedMap.getDescriptor().getName()) && fieldDescriptor.isRepeated()) {
[javac] ^
[javac] /Users/tdunning/Apache/tmp/elephant-bird/src/java/com/twitter/elephantbird/pig/util/ProtobufToPig.java:414: cannot find symbol
[javac] symbol : variable CountedMap
[javac] location: class com.twitter.elephantbird.pig.util.ProtobufToPig
[javac] fieldDescriptor.getMessageType().getName().equals(CountedMap.getDescriptor().getName()) && fieldDescriptor.isRepeated()) {
[javac] ^
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] Note: /Users/tdunning/Apache/tmp/elephant-bird/src/java/com/twitter/elephantbird/util/TypeRef.java uses unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.
[javac] 13 errors
BUILD FAILED
/Users/tdunning/Apache/tmp/elephant-bird/build.xml:74: The following error occurred while executing this line:
/Users/tdunning/Apache/tmp/elephant-bird/build.xml:103: Compile failed; see the compiler error output for details.
Hyphens are stripped from filenames before protoc generates java class names. This function isn't doing that. I imagine other characters are stripped to.
With either pig 0.8.1 or pig 0.9.1 and elephant-bird 53a814f
I try this:
$ ../pig/bin/pig -x local
register 'target/pig-vector-1.0-jar-with-dependencies.jar';
x = load 'x' using PigStorage('\t') as (a:int, b:chararray, c:chararray);
y = foreach x generate a,b;
store y into 'y.w' using com.twitter.elephantbird.pig.store.SequenceFileStorage (
'-c com.twitter.elephantbird.pig.util.IntWritableConverter',
'-c com.twitter.elephantbird.pig.util.TextConverter' ) ;
z = load 'y.w' using com.twitter.elephantbird.pig.load.SequenceFileLoader (
'-c com.twitter.elephantbird.pig.util.IntWritableConverter',
'-c com.twitter.elephantbird.pig.util.TextConverter' ) as (a:int, b:chararray);
illustrate z;
and this happens:
java.lang.NullPointerException: Signature is null
at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:204)
at com.twitter.elephantbird.pig.load.SequenceFileLoader.getContextProperties(SequenceFileLoader.java:296)
at com.twitter.elephantbird.pig.load.SequenceFileLoader.getContextProperty(SequenceFileLoader.java:306)
at com.twitter.elephantbird.pig.load.SequenceFileLoader.setLocation(SequenceFileLoader.java:411)
at org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:147)
at org.apache.pig.impl.io.ReadToEndLoader.<init>(ReadToEndLoader.java:116)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.setUp(POLoad.java:91)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.getNext(POLoad.java:119)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.pig.pen.LocalMapReduceSimulator.launchPig(LocalMapReduceSimulator.java:194)
at org.apache.pig.pen.ExampleGenerator.getData(ExampleGenerator.java:257)
at org.apache.pig.pen.ExampleGenerator.readBaseData(ExampleGenerator.java:222)
at org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:154)
at org.apache.pig.PigServer.getExamples(PigServer.java:1245)
at org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:698)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.Illustrate(PigScriptParser.java:591)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:306)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:188)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:67)
at org.apache.pig.Main.run(Main.java:487)
at org.apache.pig.Main.main(Main.java:108)
2012-01-03 20:09:49,315 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2997: Encountered IOException. Exception : Signature is null
This seems wrong.
Very minor issue I wanted to report, more documentation than anything:
The dependency on Google Collections/Guava isn't mentioned in the README. If you're using ProtoBufs, that makes sense, but for a non-protobuf build the dependency isn't obvious without digging into the source.
Otherwise, fantastically useful!
I receive this error when using ILLUSTRATE:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. null
ERROR 2999: Unexpected internal error. null
java.lang.NullPointerException
at com.twitter.elephantbird.pig.util.PigCounterHelper.incrCounter(PigCounterHelper.java:26)
at com.twitter.elephantbird.pig.load.LzoBaseLoadFunc.incrCounter(LzoBaseLoadFunc.java:61)
at com.twitter.elephantbird.pig.load.LzoProtobufB64LinePigLoader.getNext(LzoProtobufB64LinePigLoader.java:83)
at org.apache.pig.impl.io.ReadToEndLoader.getNextHelper(ReadToEndLoader.java:209)
at org.apache.pig.impl.io.ReadToEndLoader.getNext(ReadToEndLoader.java:189)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.getNext(POLoad.java:131)
at org.apache.pig.pen.ExampleGenerator.readBaseData(ExampleGenerator.java:166)
at org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:91)
at org.apache.pig.PigServer.getExamples(PigServer.java:1155)
at org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:630)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:308)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
at org.apache.pig.Main.run(Main.java:500)
I'm guessing these two commits in May removed some classes that are used in the examples:
-whole of elephantbird/proto directory can be removed.
-remove lib thrift and protobuf-java jars from lib/
I had to do a "export PIG_OPTS=-Djava.library.path=/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64/" in order to run the json example. Maybe put that in the documentation? Or am I doing something wrong?
Writing data with LzoTokenizedStorage appears to write data incorrectly. For example, the following load+store:
raw_data = LOAD '/data.lzo' USING com.twitter.elephantbird.pig.load.LzoTextLoader();
store raw_data into '/user/travis/test' USING com.twitter.elephantbird.pig.store.LzoTokenizedStorage('\n');
writes these data:
[B@9ba6076
[B@48fd918a
[B@7f5e2075
[B@7ca522a6
[B@3d860038
This writes correctly compressed data, leading us to believe the issue lives in LzoTokenizedStorage rather than hadoop-lzo.
SET output.compression.enabled true;
SET output.compression.codec com.hadoop.compression.lzo.LzopCodec;
raw_data = LOAD '/data.lzo' USING com.twitter.elephantbird.pig.load.LzoTextLoader();
STORE raw_data INTO '/user/travis/test' USING PigStorage();
If the size of the lzo file is over 64m, hive queries will fail with the following exception (full stacktrace attached at the end):
java.io.IOException: Compressed length 1602367537 exceeds max block size 67108864 (probably corrupt file)
One may advise indexing the files, but seems the index is not recoginzsed, either. However, pig script works fine with or without index.
Full stacktrace:
java.io.IOException: Compressed length 1602367537 exceeds max block size 67108864 (probably corrupt file)
at com.hadoop.compression.lzo.LzopInputStream.getCompressedData(LzopInputStream.java:286)
at com.hadoop.compression.lzo.LzopInputStream.decompress(LzopInputStream.java:256)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:63)
at com.twitter.elephantbird.util.StreamSearcher.search(StreamSearcher.java:47)
at com.twitter.elephantbird.mapreduce.io.BinaryBlockReader.skipToNextSyncPoint(BinaryBlockReader.java:102)
at com.twitter.elephantbird.mapreduce.io.BinaryBlockReader.parseNextBlock(BinaryBlockReader.java:107)
at com.twitter.elephantbird.mapreduce.io.BinaryBlockReader.setupNewBlockIfNeeded(BinaryBlockReader.java:139)
at com.twitter.elephantbird.mapreduce.io.BinaryBlockReader.readNextProtoBytes(BinaryBlockReader.java:75)
at com.twitter.elephantbird.mapreduce.io.BinaryBlockReader.readNext(BinaryBlockReader.java:63)
at com.twitter.elephantbird.mapreduce.io.ProtobufBlockReader.readProtobuf(ProtobufBlockReader.java:42)
at com.twitter.elephantbird.mapred.input.DeprecatedLzoProtobufBlockRecordReader.next(DeprecatedLzoProtobufBlockRecordReader.java:85)
at com.twitter.elephantbird.mapred.input.DeprecatedLzoProtobufBlockRecordReader.next(DeprecatedLzoProtobufBlockRecordReader.java:28)
at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:98)
at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:42)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:67)
at org.apache.hadoop.hive.shims.Hadoop20SShims$CombineFileRecordReader.next(Hadoop20SShims.java:208)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:208)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:193)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
Hello-
I'm running into trouble building elephant-bird. Ant is failing on the compile-generated-protobuf
task, with errors that look like the protocol buffer jar file isn't ending up on the classpath:
Buildfile: /Users/steven/Desktop/Hacking/protobuf/elephant_bird/elephant-bird/build.xml
release:
[echo] Building in release mode...
init:
compile-protobuf:
[apply] Applied thrift to 1 file and 0 directories.
[apply] Applied protoc to 4 files and 0 directories.
[javac] Compiling 9 source files to /Users/steven/Desktop/Hacking/protobuf/elephant_bird/elephant-bird/build/classes
[javac] /Users/steven/Desktop/Hacking/protobuf/elephant_bird/elephant-bird/build/gen-java/com/twitter/data/proto/BlockStorage.java:12: cannot find symbol
[javac] symbol : class MessageOrBuilder
[javac] location: package com.google.protobuf
[javac] extends com.google.protobuf.MessageOrBuilder {
[javac] ^
[javac]
... snip ...
/Users/steven/Desktop/Hacking/protobuf/elephant_bird/elephant-bird/build/gen-java/com/twitter/data/proto/tutorial/AddressBookProtos.java:1275: cannot find symbol
[javac] symbol : method onChanged()
[javac] location: class com.twitter.data.proto.tutorial.AddressBookProtos.Person.Builder
[javac] onChanged();
[javac] ^
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.
[javac] 100 errors
Any ideas what might be going on here? It's running protoc
just fine, and from what I can tell from build.xml
, javac's classpath ought to include protobuf-java-2.3.0.jar
from elephant-bird's lib
directory- but it sure looks like it isn't actually getting set up correctly.
having problems loading a list in a thrift structure using the latest version of elephantbird. My pig script is like
raw_data = load 'client_event.lzo'
using com.twitter.elephantbird.pig.load.LzoThriftB64LinePigLoader('com.twitter.clientapp.gen.LogEvent');
records = foreach raw_data generate
((event_details is null OR event_details.item_ids is null ) ? TOBAG((long)null) : event_details.item_ids) as item_ids,
describe records;
dump records;
Here is the thrift structure.
struct LogEvent {
1: optional EventDetails event_details
}
struct EventDetails {
1: optional list item_ids
}
in PIG 8, mapper failed with a stack trace like:
java.lang.ClassCastException: org.apache.pig.data.BinSedesTuple cannot be cast to org.apache.pig.data.DataBag
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:482)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:197)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:480)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:197)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POIsNull.getNext(POIsNull.java:162)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POOr.getNext(POOr.java:83)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POBinCond.getNext(POBinCond.java:89)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:338)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:290)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:237)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:232)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
In Pig 10, stack trace
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1001: Unable to describe schema for alias raw_data
at org.apache.pig.PigServer.dumpSchema(PigServer.java:802)
at org.apache.pig.tools.grunt.GruntParser.processDescribe(GruntParser.java:276)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:313)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:188)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
at org.apache.pig.Main.run(Main.java:553)
at org.apache.pig.Main.main(Main.java:108)
Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2245:
<file /tmp/webclient.pig, line 12, column 11> Cannot get schema from loadFunc com.twitter.elephantbird.pig.load.LzoThriftB64LinePigLoader
at org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:154)
at org.apache.pig.newplan.logical.relational.LOLoad.getSchema(LOLoad.java:109)
at org.apache.pig.newplan.logical.visitor.LineageFindRelVisitor.visit(LineageFindRelVisitor.java:100)
at org.apache.pig.newplan.logical.relational.LOLoad.accept(LOLoad.java:218)
at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
at org.apache.pig.newplan.logical.visitor.CastLineageSetter.(CastLineageSetter.java:57)
at org.apache.pig.PigServer$Graph.compile(PigServer.java:1691)
at org.apache.pig.PigServer$Graph.compile(PigServer.java:1666)
at org.apache.pig.PigServer$Graph.access$200(PigServer.java:1391)
at org.apache.pig.PigServer.getOperatorForAlias(PigServer.java:1384)
at org.apache.pig.PigServer.dumpSchema(PigServer.java:788)
... 7 more
Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2218: Invalid resource schema: bag schema must have tuple as its field
at org.apache.pig.ResourceSchema$ResourceFieldSchema.throwInvalidSchemaException(ResourceSchema.java:213)
at org.apache.pig.impl.logicalLayer.schema.Schema.getPigSchema(Schema.java:1881)
at org.apache.pig.impl.logicalLayer.schema.Schema.getPigSchema(Schema.java:1871)
at org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:151)
... 18 more
My apologies if this is a silly question, but does anyone else see this when compiling?
ed@curry:~/Projects/elephant-bird$ ant
Buildfile: build.xml
init:
compile-protobuf:
[exec] Result: 1
[apply] Applied thrift to 1 file and 0 directories.
[apply] Applied protoc to 4 files and 0 directories.
[javac] Compiling 9 source files to /home/ed/Projects/elephant-bird/build/classes
[javac] /home/ed/Projects/elephant-bird/build/gen-java/com/twitter/data/proto/BlockStorage.java:142: boolean cannot be dereferenced
[javac] return newBuilder().mergeDelimitedFrom(input).buildParsed();
[javac] ^
[javac] /home/ed/Projects/elephant-bird/build/gen-java/com/twitter/data/proto/BlockStorage.java:149: boolean cannot be dereferenced
[javac] .buildParsed();
[javac] ^
[javac] /home/ed/Projects/elephant-bird/build/gen-java/com/twitter/data/proto/Misc.java:118: boolean cannot be dereferenced
[javac] return newBuilder().mergeDelimitedFrom(input).buildParsed();
[javac] ^
[javac] /home/ed/Projects/elephant-bird/build/gen-java/com/twitter/data/proto/Misc.java:125: boolean cannot be dereferenced
[javac] .buildParsed();
[javac] ^
[javac] /home/ed/Projects/elephant-bird/build/gen-java/com/twitter/data/proto/tutorial/AddressBookProtos.java:210: boolean cannot be dereferenced
[javac] return newBuilder().mergeDelimitedFrom(input).buildParsed();
[javac] ^
[javac] /home/ed/Projects/elephant-bird/build/gen-java/com/twitter/data/proto/tutorial/AddressBookProtos.java:217: boolean cannot be dereferenced
[javac] .buildParsed();
[javac] ^
[javac] /home/ed/Projects/elephant-bird/build/gen-java/com/twitter/data/proto/tutorial/AddressBookProtos.java:542: boolean cannot be dereferenced
[javac] return newBuilder().mergeDelimitedFrom(input).buildParsed();
[javac] ^
[javac] /home/ed/Projects/elephant-bird/build/gen-java/com/twitter/data/proto/tutorial/AddressBookProtos.java:549: boolean cannot be dereferenced
[javac] .buildParsed();
[javac] ^
[javac] /home/ed/Projects/elephant-bird/build/gen-java/com/twitter/data/proto/tutorial/AddressBookProtos.java:940: boolean cannot be dereferenced
[javac] return newBuilder().mergeDelimitedFrom(input).buildParsed();
[javac] ^
[javac] /home/ed/Projects/elephant-bird/build/gen-java/com/twitter/data/proto/tutorial/AddressBookProtos.java:947: boolean cannot be dereferenced
[javac] .buildParsed();
[javac] ^
[javac] /home/ed/Projects/elephant-bird/build/gen-java/com/twitter/elephantbird/examples/proto/ThriftFixtures.java:316: boolean cannot be dereferenced
[javac] return newBuilder().mergeDelimitedFrom(input).buildParsed();
[javac] ^
[javac] /home/ed/Projects/elephant-bird/build/gen-java/com/twitter/elephantbird/examples/proto/ThriftFixtures.java:323: boolean cannot be dereferenced
[javac] .buildParsed();
[javac] ^
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.
[javac] 12 errors
BUILD FAILED
/home/ed/Projects/elephant-bird/build.xml:162: The following error occurred while executing this line:
/home/ed/Projects/elephant-bird/build.xml:148: Compile failed; see the compiler error output for details.
Total time: 3 seconds
It seems that you guys are aware of this issue and that it's really a larger Pig issue (I saw one of Kevin's emails on the Pig list mentioning this), but new users will be thankful to have this information upfront.
There are two problems here:
a) The loader doesn't do any LZO decompression in local mode, so the data should be fed uncompressed to the loader in local mode. It's simple to write a wrapper script to work around this. It appears that Pig's local mode doesn't use the slicing interface (which is what wraps an LZO codec around a slice's input stream in elephant-bird), which is why this problem exists in the first place.
b) The loader doesn't actually feed any tuples to Pig in local mode. I've written a number printlns in getNext() showing that protobufs are successfully deserialized into Pig tuples in local mode, but a DUMP of the data in Pig doesn't actually output anything. I've poked around the code and I'm still baffled by this. Can anyone shed some light on why this is the case?
As reported by Torben:
Hey Dmitriy,
i found two typo issues in people_phone_number_count_thrift.pig and ThriftUtils.java
Just search the branch with:
grep -rE '(thirft|thift)' *
Best regards,
In ProtobufBlockWriter.java
, the given example (see code below) throws NullPointerException when the number of records written is not a multiple of numRecordsPerBlock
ProtobufBlockWriter<Person> writer = new ProtobufBlockWriter<Person>(
new FileOutputStream("person_data"), Person.class);
writer.write(person1);
...
writer.write(person100000);
writer.finish();
writer.close();
The issue is that you can't call writer.close()
after ``writer.finish(). The first line in
close()` is calling `finish()`, so `finish()` is actually called twice. However, `finish()` can't be called twice, since once `serialize()` is invoked, `builder_` is no longer usable. One way to fix this defect is adding `builder_ = reinitializeBlockBuilder();` right after `serialize();` in finish().
public void finish() throws IOException {
if (builder_.getProtoBlobsCount() > 0) {
serialize();
// add: builder_ = reinitializeBlockBuilder();
}
}
public void close() throws IOException {
finish();
out_.close();
}
Hello,
I'm running into trouble while trying to run one of the examples. Here's what I did:
$ ~/hadoop-0.20.2/bin/hadoop jar ~/elephant-bird/examples/dist/elephant-bird-examples-1.0.jar com.twitter.elephantbird.examples.ProtobufMRExample s3n://hassanrom-sna/name_age_sample.txt s3n://hassanrom-sna/name_age_sample.lzo11/11/25 00:42:57 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
Exception in thread "main" java.lang.NoClassDefFoundError: com/google/protobuf/GeneratedMessage
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at com.twitter.elephantbird.examples.ProtobufMRExample.runLzoToText(Unknown Source)
at com.twitter.elephantbird.examples.ProtobufMRExample.main(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: java.lang.ClassNotFoundException: com.google.protobuf.GeneratedMessage
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 18 more
Any ideas what am I doing wrong? I've tried setting HADOOP_CLASSPATH to point to where the libraries are and that didn't work. I've also tried bundling all the jars together and that didn't work either. I also tried passing the jars via -libjars and that didn't work either. Some documentation on how to get the examples running would definitely be nice especially for folks who are new to hadoop.
Thanks in advance!
It is stated in the README that you can use git-raw support to access the repo as a maven repository. I can't get that to work.
Recursive messages cause a java.lang.StackOverflowError when using ProtobufBytesToTuple.
ie.
.proto file:
syntax = "proto2";
package test;
option java_package="com.test";
message a {
message b {
repeated b test = 1;
}
optional b test = 2;
}
pig script:
define ProtoFileLoader com.twitter.elephantbird.pig.load.LzoProtobufBlockPigLoader('some.proto.message$message');
define test1 com.twitter.elephantbird.pig.piggybank.ProtobufBytesToTuple('com.test.T$a');
record = LOAD '/somefile/' USING ProtoFileLoader;
packedrecord = FOREACH record GENERATE test1(protobuf.data);
error:
================================================================================
Pig Stack Trace
---------------
ERROR 2998: Unhandled internal error. null
java.lang.StackOverflowError
at java.util.AbstractList$Itr.<init>(AbstractList.java:318)
at java.util.AbstractList$Itr.<init>(AbstractList.java:318)
at java.util.AbstractList.iterator(AbstractList.java:273)
at java.util.Collections$UnmodifiableCollection$1.<init>(Collections.java:1007)
at java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1006)
at com.twitter.elephantbird.pig.util.ProtobufToPig.toSchema(ProtobufToPig.java:225)
at com.twitter.elephantbird.pig.util.ProtobufToPig.messageToFieldSchema(ProtobufToPig.java:256)
at com.twitter.elephantbird.pig.util.ProtobufToPig.toSchema(ProtobufToPig.java:227)
at com.twitter.elephantbird.pig.util.ProtobufToPig.messageToFieldSchema(ProtobufToPig.java:256)
...
With Raghu's refactoring, we can set the inputFormat and outputFormat as follows:
job.setInputFormatClass(
LzoProtobufB64LineInputFormat.getInputFormatClass(MyProtobufClass.class, conf)
);
We need to do the same for Writables, as this:
job.setOutputValueClass(ThriftWritable.class);
doesn't work.
I use an LZO (v.1.01) compressed Protocol buffer file on HDFS. While using the elephant-bird API to read the file on my Java MR, there were no input records passed on to my Map class. On debugging in detail, I found that the "com.twitter.elephantbird.util.StreamSearcher.search(InputStream)" method always returns a false. Is this a known issue? If you need more details, please do not hesitate to reach me at [email protected].
Regards,
Karthik.
Current readme says to go here:
https://raw.github.com/kevinweil/elephant-bird/master/repo
which doesn't exist. Poking around this does exist:
https://raw.github.com/kevinweil/elephant-bird/master/
but looks like someone checked in a .gitignore file for it.
I'm using the ThriftWritableConverter to convert a pig tuple into a Thrift object. I get the following exception when trying to create the converter:
ERROR 2999: Unexpected internal error. could not instantiate 'com.twitter.elephantbird.pig.store.SequenceFileStorage' with arguments '[-c com.twitter.elephantbird.pig.util.ThriftWritableConverter AdvertisementId, -c com.twitter.elephantbird.pig.util.ThriftWritableConverter Advertisement]'
java.lang.RuntimeException: could not instantiate 'com.twitter.elephantbird.pig.store.SequenceFileStorage' with arguments '[-c com.twitter.elephantbird.pig.util.ThriftWritableConverter AdvertisementId, -c com.twitter.elephantbird.pig.util.ThriftWritableConverter Advertisement]'
at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:502)
at org.apache.pig.impl.logicalLayer.parser.QueryParser.NonEvalFuncSpec(QueryParser.java:5660)
at org.apache.pig.impl.logicalLayer.parser.QueryParser.StoreClause(QueryParser.java:4034)
at org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1501)
at org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:1013)
at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:825)
at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1612)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1562)
at org.apache.pig.PigServer.registerQuery(PigServer.java:534)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:871)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:388)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
at org.apache.pig.Main.run(Main.java:500)
at org.apache.pig.Main.main(Main.java:107)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:470)
... 21 more
Caused by: java.lang.RuntimeException: Failed to create WritableConverter instance
at com.twitter.elephantbird.pig.load.SequenceFileLoader.getWritableConverter(SequenceFileLoader.java:247)
at com.twitter.elephantbird.pig.load.SequenceFileLoader.(SequenceFileLoader.java:140)
at com.twitter.elephantbird.pig.store.SequenceFileStorage.(SequenceFileStorage.java:103)
... 26 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at com.twitter.elephantbird.pig.load.SequenceFileLoader.getWritableConverter(SequenceFileLoader.java:222)
... 28 more
Caused by: java.lang.RuntimeException: while trying to find id in class AdvertisementId
at com.twitter.elephantbird.util.ThriftUtils.getFiedlType(ThriftUtils.java:110)
at com.twitter.elephantbird.thrift.TStructDescriptor$Field.(TStructDescriptor.java:201)
at com.twitter.elephantbird.thrift.TStructDescriptor$Field.(TStructDescriptor.java:128)
at com.twitter.elephantbird.thrift.TStructDescriptor.build(TStructDescriptor.java:104)
at com.twitter.elephantbird.thrift.TStructDescriptor.getInstance(TStructDescriptor.java:88)
at com.twitter.elephantbird.pig.util.ThriftToPig.(ThriftToPig.java:56)
at com.twitter.elephantbird.pig.util.ThriftToPig.newInstance(ThriftToPig.java:52)
at com.twitter.elephantbird.pig.util.ThriftWritableConverter.(ThriftWritableConverter.java:61)
... 33 more
Caused by: java.lang.NoSuchFieldException: id
at java.lang.Class.getDeclaredField(Class.java:1882)
at com.twitter.elephantbird.util.ThriftUtils.getFiedlType(ThriftUtils.java:107)
... 40 more
This is the part of my schema:
union AdvertisementId {
1: string id;
}
struct Advertisement {
1: required AdvertisementId id;
}
And this is the structure of the pig relation:
ads: {advertisementId: (id: chararray)}
the TStructDescriptor tries to search for a field but in case of a union you won't have actual fields within the generated classes.
<dependency>
<groupId>org.apache.thrift</groupId>
<artifactId>libthrift</artifactId>
<version>0.7.0</version>
</dependency>
We've always depended on 0.5.0 AFAIK and I don't think this should change?
Hi there,
we run into an issue where the following test in src/test/com/twitter/elephantbird/pig/piggybank/TestInvoker.java
may fail on slow(er) machines. In our case, for instance, this happens on our build server. On my local dev box, the test sometimes passes and sometimes it doesn't (hmm).
@Test
public void testSpeed() throws IOException, SecurityException, ClassNotFoundException, NoSuchMethodException {
EvalFunc<Double> log = new Log();
Tuple tup = tf_.newTuple(1);
long start = System.currentTimeMillis();
for (int i=0; i < 1000000; i++) {
tup.set(0, (double) i);
log.exec(tup);
}
long staticSpeed = (System.currentTimeMillis()-start);
start = System.currentTimeMillis();
log = new InvokeForDouble("java.lang.Math.log", "Double", "static");
for (int i=0; i < 1000000; i++) {
tup.set(0, (double) i);
log.exec(tup);
}
long dynamicSpeed = System.currentTimeMillis()-start;
System.err.println("Dynamic to static ratio: "+((float) dynamicSpeed)/staticSpeed);
assertTrue( ((float) dynamicSpeed)/staticSpeed < 5);
}
The culprit is the last line in the method, where it checks whether the dynamic-to-static ration is smaller than "5" (see my question below -- I do not know why 5 in particular is used here.)
The "problem" with this test is that its success seems to depend on the build environment. Even though a relative ratio is used, this alone apparently does not guarantee that the test completes successfully on any box that the build is being run on (e.g. it may pass if the build box is otherwise idle but it may fail if there are other tasks running in the background). Also, this unit test is not a functional test but a performance test.
Hence I'd say that one can argue pro and con whether it should be a test that a) is enabled by default and/or b) whether failing this test should make the full build fail (maybe a warning would be more appropriate?).
On a related note: Where does the value/number of 5
(last line in testSpeed()
) come from? Was it arbitrarily chosen? FWIW, on my dev box most dynamic-to-static ratios are in the 4.9-5.3 range, and on our build server they are in the 6.1-6.5 range. At the moment, we have decided to @Ignore
this test in our build for the time being because otherwise the build always fails due to this single test.
To reproduce:
mvn -e package
org.apache.maven.lifecycle.LifecycleExecutionException: Error configuring: com.github.igor-petruk.protobuf:protobuf-maven-plugin. Reason: Invalid or missing parameters: [Mojo parameter [name: 'inputDirectories'; alias: 'null']] for mojo: com.github.igor-petruk.protobuf:protobuf-maven-plugin:0.4:run
at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoals(DefaultLifecycleExecutor.java:723)
at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoalWithLifecycle(DefaultLifecycleExecutor.java:556)
at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoal(DefaultLifecycleExecutor.java:535)
at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoalAndHandleFailures(DefaultLifecycleExecutor.java:387)
at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeTaskSegments(DefaultLifecycleExecutor.java:348)
at org.apache.maven.lifecycle.DefaultLifecycleExecutor.execute(DefaultLifecycleExecutor.java:180)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:328)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:138)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:362)
at org.apache.maven.cli.compat.CompatibleMain.main(CompatibleMain.java:60)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.codehaus.classworlds.Launcher.launchEnhanced(Launcher.java:315)
at org.codehaus.classworlds.Launcher.launch(Launcher.java:255)
at org.codehaus.classworlds.Launcher.mainWithExitCode(Launcher.java:430)
at org.codehaus.classworlds.Launcher.main(Launcher.java:375)
Caused by: org.apache.maven.plugin.PluginParameterException: Error configuring: com.github.igor-petruk.protobuf:protobuf-maven-plugin. Reason: Invalid or missing parameters: [Mojo parameter [name: 'inputDirectories'; alias: 'null']] for mojo: com.github.igor-petruk.protobuf:protobuf-maven-plugin:0.4:run
at org.apache.maven.plugin.DefaultPluginManager.checkRequiredParameters(DefaultPluginManager.java:1117)
at org.apache.maven.plugin.DefaultPluginManager.getConfiguredMojo(DefaultPluginManager.java:722)
at org.apache.maven.plugin.DefaultPluginManager.executeMojo(DefaultPluginManager.java:468)
at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoals(DefaultLifecycleExecutor.java:694)
... 17 more
Many thanks!!
JSON is currently handled a little weird. Pig can read json from any file input format, however, map reduce jobs can only read json from lzo files. Additionally, parsing is done a little different in both places.
I think the ideal situation would be a json input format that wraps a given actual input format, converting each valid value to json. The json pig loader would simply use the json input format, and convert each record into a tuple (rather than parsing records into json like it does today).
Thoughts? I think it would be cool to separate the app data format (pig tuple), from the storage record format (json) from the file format (lzo compressed file) so users can mix & match as they see fit. For example, there's no reason why someone who wants to compress files with snappy shouldn't be able to use the json & pig layers.
I started playing around with this idea in the recent json loader refactor, and tonight in https://github.com/traviscrawford/elephant-bird/tree/json_record_reader but want to get some feedback before going too far.
I get this after a git clone and then running "ant". I am not an ivy expert. However, something simple must be wrong.
resolve:
No ivy:settings found for the default reference 'ivy.instance'. A default instance will be used
[ivy:retrieve] :: Ivy 2.0.0-rc2 - 20081028224207 :: http://ant.apache.org/ivy/ ::
:: loading settings :: file = /Users/shiv/Documents/elephant-bird/ivysettings.xml
[ivy:retrieve] [xml parsing: ivy.xml:4:117: cvc-complex-type.3.2.2: Attribute 'defaultconf' is not allowed to appear in element 'configurations'. in file:/Users/shiv/Documents/elephant-bird/ivy.xml
[ivy:retrieve] ]
BUILD FAILED
/Users/shiv/Documents/elephant-bird/build.xml:114: syntax errors in ivy file: java.text.ParseException: [xml parsing: ivy.xml:4:117: cvc-complex-type.3.2.2: Attribute 'defaultconf' is not allowed to appear in element 'configurations'. in file:/Users/shiv/Documents/elephant-bird/ivy.xml
]
at org.apache.ivy.plugins.parser.AbstractModuleDescriptorParser$AbstractParser.checkErrors(AbstractModuleDescriptorParser.java:89)
at org.apache.ivy.plugins.parser.AbstractModuleDescriptorParser$AbstractParser.getModuleDescriptor(AbstractModuleDescriptorParser.java:342)
at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorParser.parseDescriptor(XmlModuleDescriptorParser.java:102)
at org.apache.ivy.plugins.parser.AbstractModuleDescriptorParser.parseDescriptor(AbstractModuleDescriptorParser.java:48)
at org.apache.ivy.core.resolve.ResolveEngine.resolve(ResolveEngine.java:183)
at org.apache.ivy.Ivy.resolve(Ivy.java:502)
at org.apache.ivy.ant.IvyResolve.doExecute(IvyResolve.java:234)
at org.apache.ivy.ant.IvyTask.execute(IvyTask.java:277)
at org.apache.ivy.ant.IvyPostResolveTask.ensureResolved(IvyPostResolveTask.java:207)
at org.apache.ivy.ant.IvyPostResolveTask.prepareAndCheck(IvyPostResolveTask.java:154)
at org.apache.ivy.ant.IvyRetrieve.doExecute(IvyRetrieve.java:49)
at org.apache.ivy.ant.IvyTask.execute(IvyTask.java:277)
at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
at org.apache.tools.ant.Task.perform(Task.java:348)
at org.apache.tools.ant.Target.execute(Target.java:390)
at org.apache.tools.ant.Target.performTasks(Target.java:411)
at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1399)
at org.apache.tools.ant.Project.executeTarget(Project.java:1368)
at org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41)
at org.apache.tools.ant.Project.executeTargets(Project.java:1251)
at org.apache.tools.ant.Main.runBuild(Main.java:809)
at org.apache.tools.ant.Main.startAnt(Main.java:217)
at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280)
at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109)
Total time: 2 seconds
I have downloaded the sources and following the instructions in readme.md file. Platform is windows.
Build is failing while compiling example directory with following error
compile-protobuf:
[apply] --twadoop_out: protoc-gen-twadoop: The system cannot find the file specified. I see this file in examples directory, but somehow protoc is not able to find it.
With one of the more recent wips, Cascading changed SourceCall from a class to an interface. Elephant-bird uses SourceCall here and other places -- to make this work, the project needs to be recompiled against cascading 2.0-wip226.
I tested with a simple text file+ lzo before moving onto encoding with Protocol Buffers and using lzop to compress it. But then when I try to run a simple map reduce on top of it. I get these errors:
12-06-13 01:19:16,446 INFO org.apache.hadoop.mapred.TaskStatus: task-diagnostic-info for task attempt_201206121825_0041_m_000000_1 : java.lang.RuntimeException: error rate while reading input records crossed threshold
at com.twitter.elephantbird.mapreduce.input.LzoRecordReader$InputErrorTracker.incErrors(LzoRecordReader.java:155)
at com.twitter.elephantbird.mapreduce.input.LzoBinaryB64LineRecordReader.nextKeyValue(LzoBinaryB64LineRecordReader.java:135)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:456)
at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
Caused by: java.lang.Exception: Unknown error
at com.twitter.elephantbird.mapreduce.input.LzoRecordReader$InputErrorTracker.incErrors(LzoRecordReader.java:138)
Hi Folks,
A small documentation question...
The Readme.md says "2. Pig 0.6 (not compatible with 0.7+)"
Is that still the case?
I am looking for a plain old Json Loader - Elephant Bird seems to include one - but only if the data is also LZO compressed. Should I carry on trying to figure out Elephant Bird even though my data isn't LZO compressed?
thanks
Hi Kevin & all,
I have two quick question:
First, I have seen that Dmitriy merged (kevinweil/elephant-bird@ebaa220c5c728) his Pig-0.8 compatibility branch of EB into Kevin's eb-dev. Kevin's master branch of kevinweil/elephant-bird now contains the "2.0.2 bug fix release" and is just one version-bump commit behind eb-dev. So my question would be: Is EB 2.0.2 considered stable or an official release? Is Kevin's repo the recommended code base or rather Dmitriy's? (I'm a bit confused what development happens in each repo, sorry :-D)
Second, gerritjvv forked Dmitriy's EB repo some time ago (gerritjvv/elephant-bird). He added the class LzoPigStorage.java. This class appears to be a very simple, straight-forward LZO-enabled wrapper for standard PigStorage -- less than 10 lines of effective code. If the implementation is sound, do you think it would make sense to bring LzoPigStorage
into your repository?
Many thanks for your work on ElephantBird, it's appreciated!
Best,
Michael
Hi all,
I couldn't find a good way to email anyone, so I'm posting this as an issue in the hopes that it gets to someone.
I just have a very simple question: in Protobufs, you have a byte[] called KNOWN_GOOD_POSITION_MARKER. I'm curious about the chosen pattern--why did you pick those particular byte values? And why is it 16 bytes long? I'm sure it's a simple reason, I just can't seem to wrap my head around it, for some reason.
Thanks,
Scott Fines
[email protected]
Almost there getting elephant-bird compiled! I now get this. On the main wiki, there are no instructions on getting "protoc". This happens when I run "ant".
Thanks, Shiv
generate-protobuf:
BUILD FAILED
/Users/shiv/Documents/elephant-bird/build.xml:131: The following error occurred while executing this line:
/Users/shiv/Documents/elephant-bird/build.xml:59: Execute failed: java.io.IOException: Cannot run program "protoc": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at java.lang.Runtime.exec(Runtime.java:593)
at org.apache.tools.ant.taskdefs.Execute$Java13CommandLauncher.exec(Execute.java:862)
at org.apache.tools.ant.taskdefs.Execute.launch(Execute.java:481)
at org.apache.tools.ant.taskdefs.Execute.execute(Execute.java:495)
at org.apache.tools.ant.taskdefs.ExecTask.runExecute(ExecTask.java:631)
at org.apache.tools.ant.taskdefs.ExecTask.runExec(ExecTask.java:672)
at org.apache.tools.ant.taskdefs.ExecTask.execute(ExecTask.java:498)
at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
at org.apache.tools.ant.Task.perform(Task.java:348)
at org.apache.tools.ant.taskdefs.Sequential.execute(Sequential.java:68)
at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
at org.apache.tools.ant.Task.perform(Task.java:348)
at org.apache.tools.ant.taskdefs.MacroInstance.execute(MacroInstance.java:398)
at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
at org.apache.tools.ant.Task.perform(Task.java:348)
at org.apache.tools.ant.Target.execute(Target.java:390)
at org.apache.tools.ant.Target.performTasks(Target.java:411)
at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1399)
at org.apache.tools.ant.Project.executeTarget(Project.java:1368)
at org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41)
at org.apache.tools.ant.Project.executeTargets(Project.java:1251)
at org.apache.tools.ant.Main.runBuild(Main.java:809)
at org.apache.tools.ant.Main.startAnt(Main.java:217)
at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280)
at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.(UNIXProcess.java:53)
at java.lang.ProcessImpl.start(ProcessImpl.java:91)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 37 more
Hi!
I've just found an issue using JsonStringToMap on PIG 0.8.1. The schema "json: [chararray]"
throws a ParseException
.
The following works on PIG 0.8.1:
return Utils.getSchemaFromString("json: []", DataType.CHARARRAY);
This seems related to https://issues.apache.org/jira/browse/PIG-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017764#comment-13017764
I did a fresh git clone this evening and am getting lots of issues being unable to find StorageBlock when compiling with "ant noproto release-jar"
[echo] building without protobuf support
[javac] /Users/jreitman/Development/web_development/elephant-bird/build.xml:89: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds
[javac] Compiling 83 source files to /Users/jreitman/Development/web_development/elephant-bird/build/classes
[javac] /Users/jreitman/Development/web_development/elephant-bird/src/java/com/twitter/elephantbird/mapreduce/io/BinaryBlockReader.java:6: package com.twitter.data.proto.BlockStorage does not exist
[javac] import com.twitter.data.proto.BlockStorage.SerializedBlock;
[javac] ^
[javac] /Users/jreitman/Development/web_development/elephant-bird/src/java/com/twitter/elephantbird/mapreduce/io/BinaryBlockReader.java:27: cannot find symbol
json-simple throws an exception on 'mykey' (single quotes), need to use double quotes. The example in json_word_count.pig has single quotes.
Currently if the Protobuf or Thrift structure contains a map with non-String keys, we give up since Pig does not support non-String keys.
We should allow the user to specify an alternative behavior, and provide at least one alternative built-in option (call toString() on the key and use that).
I have a similar problem as in #112, but with the SequenceFileLoader in the mapreduce mode.
The signature is null and a command fails because of that. The error is:
java.lang.NullPointerException: Signature is null
at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:204)
at com.twitter.elephantbird.pig.load.SequenceFileLoader.getContextProperties(SequenceFileLoader.java:296)
at com.twitter.elephantbird.pig.load.SequenceFileLoader.getContextProperty(SequenceFileLoader.java:306)
at com.twitter.elephantbird.pig.load.SequenceFileLoader.setLocation(SequenceFileLoader.java:411)
The commands leading to this error are:
A = load 'hdfs://XXX' USING com.twitter.elephantbird.pig.load.SequenceFileLoader (
'-c com.twitter.elephantbird.pig.util.TextConverter',
'-c com.twitter.elephantbird.pig.util.ProtobufWritableConverter de.pc2.dedup.fschunk.pig.PigProtocol.File');
ILLUSTRATE A;
I have a repeated field in my protobuf message:
message Tweet {
....
repeated string hashtags = 6;
....
}
Here is the generated Java protobuf code for the repeated field:
// repeated string hashtags = 6;
public static final int HASHTAGS_FIELD_NUMBER = 6;
private java.util.List<java.lang.String> hashtags_ = java.util.Collections.emptyList();
Everything looks good. However when I created a Hive table, the repeated field 'hashtags_' is missing from the table. All other non-repeated fields are shown up correctly in the table as columns.
CREATE EXTERNAL TABLE tweet
ROW FORMAT SERDE 'protobuf.hive.serde.LzoTweetProtobufHiveSerde'
STORED AS INPUTFORMAT 'protobuf.mapred.input.DeprecatedLzoTweetProtobufBlockInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveNullValueSequenceFileOutputFormat'
LOCATION '/path/to/hdfs/dir';
describe tweet;
OK
id_ bigint from deserializer
createdat_ string from deserializer
text_ string from deserializer
source_ string from deserializer
user_ protobuf.TweetProtos$User from deserializer
inreplytoscreenname_ string from deserializer
....
Am I missing something here? Thanks!
data = LOAD 'hdfs://localhost//foo/23,hdfs://localhost/foo/24' USING com.twitter.elephantbird.pig.load.SequenceFileLoader ()
produces:
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: java.net.URISyntaxException: Illegal character in scheme name at index 0: 23,hdfs:
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:280)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:944)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:961)
at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:807)
at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 0: 23,hdfs:
at org.apache.hadoop.fs.Path.initialize(Path.java:148)
at org.apache.hadoop.fs.Path.(Path.java:126)
at org.apache.hadoop.fs.Path.(Path.java:50)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1063)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1066)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1066)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1066)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1066)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1066)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1066)
at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1002)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:966)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:211)
at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:55)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:268)
... 14 more
Caused by: java.net.URISyntaxException: Illegal character in scheme name at index 0: 23,hdfs:
at java.net.URI$Parser.fail(URI.java:2809)
at java.net.URI$Parser.checkChars(URI.java:2982)
at java.net.URI$Parser.checkChar(URI.java:2992)
at java.net.URI$Parser.parse(URI.java:3008)
at java.net.URI.(URI.java:736)
at org.apache.hadoop.fs.Path.initialize(Path.java:145)
... 29 more
Kevin's project "hadoop-lzo" (Which EB depends on anyway) already offers LZO I/O formats for Text, etc. - just not protobuf.
Why does EB choose to maintain a forked copy?
Would it be acceptable if I contributed a patch that makes EB classes extend the hadoop-lzo one (with tests to ensure we not break users of it)?
Some users of EB classes reported that it does not properly gel with Hive's CombineHiveInputFormat usage (which is now the default I/F).
For the same, simple "README.lzo" text file input that spans four blocks (worth 512 bytes each):
hive> describe extended lzoblocks;
OK
r string
Detailed Table Information Table(tableName:lzoblocks, dbName:default, owner:root, createTime:1329727318, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:r, type:string, comment:null)], location:hdfs://localhost/user/hive/warehouse/lzoblocks, inputFormat:com.hadoop.mapred.DeprecatedLzoTextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}), partitionKeys:[], parameters:{transient_lastDdlTime=1329727393}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
Time taken: 0.078 seconds
hive> describe extended ebblocks;
OK
r string
Detailed Table Information Table(tableName:ebblocks, dbName:default, owner:root, createTime:1329730707, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:r, type:string, comment:null)], location:hdfs://localhost/user/hive/warehouse/ebblocks, inputFormat:com.twitter.elephantbird.mapred.input.DeprecatedLzoTextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}), partitionKeys:[], parameters:{transient_lastDdlTime=1329732671}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
Time taken: 0.07 seconds
Doing a simple select of the hadoop-lzo class based table works fine, while the latter fails - when used with CDH3u3. The error the ebblocks table gets on a select * is [0].
This may be MAPREDUCE-1597 but I still think EB should be reusing classes rather than duplicating code and diverging in behavior.
[0] - 2012-02-17 11:04:07,598 WARN com.hadoop.compression.lzo.LzopInputStream: IOException in getCompressedData; likely LZO corruption.
java.io.IOException: Compressed length 439750499 exceeds max block size 67108864 (probably corrupt file)
at com.hadoop.compression.lzo.LzopInputStream.getCompressedData(LzopInputStream.java:285)
at com.hadoop.compression.lzo.LzopInputStream.decompress(LzopInputStream.java:255)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:63)
at com.twitter.elephantbird.util.StreamSearcher.search(StreamSearcher.java:47)
at com.twitter.elephantbird.mapreduce.io.BinaryBlockReader.skipToNextSyncPoint(BinaryBlockReader.java:102)
at com.twitter.elephantbird.mapreduce.io.BinaryBlockReader.parseNextBlock(BinaryBlockReader.java:107)
at com.twitter.elephantbird.mapreduce.io.BinaryBlockReader.setupNewBlockIfNeeded(BinaryBlockReader.java:139)
at com.twitter.elephantbird.mapreduce.io.BinaryBlockReader.readNextProtoBytes(BinaryBlockReader.java:75)
at com.twitter.elephantbird.mapreduce.io.BinaryBlockReader.readNext(BinaryBlockReader.java:63)
at com.twitter.elephantbird.mapreduce.io.ProtobufBlockReader.readProtobuf(ProtobufBlockReader.java:42)
at com.twitter.elephantbird.mapred.input.DeprecatedLzoProtobufBlockRecordReader.next(DeprecatedLzoProtobufBlockRecordReader.java:84)
at com.twitter.elephantbird.mapred.input.DeprecatedLzoProtobufBlockRecordReader.next(DeprecatedLzoProtobufBlockRecordReader.java:27)
at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:98)
at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:42)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:67)
at org.apache.hadoop.hive.shims.Hadoop20SShims$CombineFileRecordReader.next(Hadoop20SShims.java:208)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:208)
I use an LZO (v.1.01) compressed Protocol buffer file on HDFS. While using the elephant-bird API to read the file on my Java MR, there were no input records passed on to my Map class. On debugging in detail, I found that the "com.twitter.elephantbird.util.StreamSearcher.search(InputStream)" method always returns a false. Is this a known issue? If you need more details, please do not hesitate to reach me at [email protected]. Also wanted to mention that my LZO file contains a list of repeated Protobuf objects in them.
The pattern:
new byte[] { 0x29, (byte)0xd8, (byte)0xd5, 0x06, 0x58,
(byte)0xcd, 0x4c, 0x29, (byte)0xb2,
(byte)0xbc, 0x57, (byte)0x99, 0x21, 0x71,
(byte)0xbd, (byte)0xff };
does not seem to be matching anywhere in the LZO compressed GPB file.
Regards,
Karthik.
I'm trying to serialize a message that contains another nested, repeated message using LzoProtobufB64LinePigStorage, but I get a class cast (DefaultDataBag to java.util.List) error. This is using Cloudera's CDH2 distribution and Pig 0.5.0.
You can find a protocol buffer interface + script that aggravates this bug here: http://github.com/voberoi/elephant-bird-bug.
The default Hadoop partitioner uses the key hashCode() to decide which reducer to send the keyvalue pair to. ProtobufWritable returns the hashCode from the underlying protocol buffers object.
So far so good. Unfortunately a protocol buffers (2.3) object returns different hashcodes for different invocations. This means that if you use a ProtobufWritable as the key data that should be on the same node is being sent to different ones.
Example code (replace the ProtobufWritable with one of your own). If you run this a few times in new jvms it'll give different partitions.
ProtobufNodeKeyWritable key1 = new ProtobufNodeKeyWritable();
key1.set(NodeKey.newBuilder().setKey("key").setNode("node").build());
HashPartitioner p = new HashPartitioner();
System.out.println(p.getPartition(key1, null, 5));
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.