twitter / elephant-bird Goto Github PK

View Code? Open in Web Editor NEW

1.1K 1.1K 390.0 60.81 MB

Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.

License: Apache License 2.0

Java 97.84% Thrift 1.23% PigLatin 0.06% Shell 0.86%

elephant-bird's People

Contributors

Stargazers

Watchers

Forkers

gutelius chuangl4 dvryaboy voberoi jrussek aroetter mteodoro ningtwitter hirohanin zhuomingliang znbailey danharvey rbart rangadi khellan angushe avibryant sigmoids tongwang spacyz 11xor6 clizzin ianbashford traviscrawford pombredanne sagemintblue stevenbedrick atbrox billonahill johnynek harrychen jiangchang julienledem acnithin gensen a-b charles-cai zeph cutting agiamas ketralnis dynamicguy azymnis sritchie calmera isaiah alexistp panisson prasanthj levyjoshua-vast xstevens yuxutw nimbusrepo steveblackmon rathboma housejester channel4 aruld srinivas-surasani mibesr edrevo mbrandoll jinghanx aneeshs elifinkelshteyn bongirr longcongduoi jcrobak strategist922 xplenty foursquare ashishtadose rjurney hodgesz tulinski shilicqupt tobywang dongpf baunz tonilap egergo aaron-siegel girishrao omnisis eshioji arkajit wuman danp60 jerrylam niumowm markguo maczech killerwhile robsomo oztc remegraw ddaniels chilinglam vbogdanovsky rore

elephant-bird's Issues

Hadoop Streaming best practices?

Hello-

Do you guys have any suggestions for ways to use Elephant-Bird-generated protobuf Writables with Hadoop Streaming? As far as I can tell, Hadoop Streaming more or less calls toString() on each input key and value, and sends the result to stdout- which is a problem since protobufs don't really serialize to text all that well, and their serializations often include newlines, which plays havoc with Streaming.

The best option I've come up with thus far involves writing a custom InputFormat that would Base-64-encode each PB, and then, in my streaming code, decode the Base-64 and deserialize it back into a protocol buffer. This seems like it ought to work, but also seems like a ton of extra overhead, and I feel like there's got to be a better way that I'm somehow just missing. Googling around, it looks like Dumbo's "TypedBytes" might do some of what I need, but for a variety of reasons I can neither upgrade nor patch my Hadoop installation directly.

Thanks in advance for any ideas that anybody might have...

LzoJsonRecordReader and JsonLoader may fail with NullPointerException

We have run into some issues with LzoJsonRecordReader and JsonLoader where on some JSON data -- we haven't been able to narrow it down yet -- both classes will fail with a NullPointerException.

For instance, here is the relevant snippet of LzoJsonRecordReader#decodeLineToJson(). What seems to happen is that LzoJsonRecordReader#nextKeyValue() passes a null value to decodeLineToJson(). The problem might therefore originate in the line "int newSize = in_.readLine(currentLine_);" in nextKeyValue(). From what I remember, json-simple's parse method might actually return a null value as well (= adding a safeguard to LzoJsonLoader and JsonLoader might be a good idea anyways) but in this case the null value is already passed to json-simple, before the latter will actually return itself with a null value.

A similar NPE issue exists for the JsonLoader#parseStringToTuple().

/*
    LzoJsonRecordReader.java
*/

 public static boolean decodeLineToJson(JSONParser parser, Text line, MapWritable value) {
    try {
      JSONObject jsonObj = (JSONObject)parser.parse(line.toString());
      for (Object key: jsonObj.keySet()) {      /* *** The NPE is thrown here when jsonObj == null *** */
        Text mapKey = new Text(key.toString());
        Text mapValue = new Text();
        if (jsonObj.get(key) != null) {
          mapValue.set(jsonObj.get(key).toString());
        }

        value.put(mapKey, mapValue);
      }
      return true;
    } catch (ParseException e) {
      LOG.warn("Could not json-decode string: " + line, e);
      return false;
    } catch (NumberFormatException e) {
      LOG.warn("Could not parse field into number: " + line, e);
      return false;
    }

A quick fix is to simply catch for a NPE in LzoJsonRecordReader#decodeLineToJson() and JsonLoader#parseStringToTuple() as is already being done for e.g. ParseException. This fix has worked fine for us. As I said above, however, this might not fix the root of the problem.

I am not so familiar with the code, so I am not sure whether just catching NPEs is the best fix. Maybe you have a better idea!

nonothing build broken

When I try [ant nonothing compile] on a machine with thrift 0.6 installed, I get the output below. Clearly there are some protobuf references that have slipped in. I would be happy to help with fixes, but don't know the code nearly well enough to make sensible changes.

Also, I find it difficult to stomach installing thrift 0.5 but note that if I just don't use the thrift support then commenting out the version check makes everything work for me.

Teds-MacBook-Pro:elephant-bird[master]$ ant nonothing compile
Buildfile: /Users/tdunning/Apache/tmp/elephant-bird/build.xml

init:
    [mkdir] Created dir: /Users/tdunning/Apache/tmp/elephant-bird/build
    [mkdir] Created dir: /Users/tdunning/Apache/tmp/elephant-bird/build/classes
    [mkdir] Created dir: /Users/tdunning/Apache/tmp/elephant-bird/build/test

nonothing:
     [echo] building without protobuf and thrift support
    [javac] /Users/tdunning/Apache/tmp/elephant-bird/build.xml:103: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds
    [javac] Compiling 71 source files to /Users/tdunning/Apache/tmp/elephant-bird/build/classes
    [javac] /Users/tdunning/Apache/tmp/elephant-bird/src/java/com/twitter/elephantbird/mapreduce/input/MultiInputFormat.java:20: package com.twitter.data.proto.BlockStorage does not exist
    [javac] import com.twitter.data.proto.BlockStorage.SerializedBlock;
    [javac]                                           ^
    [javac] /Users/tdunning/Apache/tmp/elephant-bird/src/java/com/twitter/elephantbird/mapreduce/io/BinaryBlockReader.java:6: package com.twitter.data.proto.BlockStorage does not exist
    [javac] import com.twitter.data.proto.BlockStorage.SerializedBlock;
    [javac]                                           ^
    [javac] /Users/tdunning/Apache/tmp/elephant-bird/src/java/com/twitter/elephantbird/mapreduce/io/BinaryBlockReader.java:27: cannot find symbol
    [javac] symbol  : class SerializedBlock
    [javac] location: class com.twitter.elephantbird.mapreduce.io.BinaryBlockReader<M>
    [javac]   private SerializedBlock curBlock_;
    [javac]           ^
    [javac] /Users/tdunning/Apache/tmp/elephant-bird/src/java/com/twitter/elephantbird/mapreduce/io/BinaryBlockReader.java:105: cannot find symbol
    [javac] symbol  : class SerializedBlock
    [javac] location: class com.twitter.elephantbird.mapreduce.io.BinaryBlockReader<M>
    [javac]   public SerializedBlock parseNextBlock() throws IOException {
    [javac]          ^
    [javac] /Users/tdunning/Apache/tmp/elephant-bird/src/java/com/twitter/elephantbird/pig/util/ProtobufToPig.java:14: package com.twitter.data.proto.Misc does not exist
    [javac] import com.twitter.data.proto.Misc.CountedMap;
    [javac]                                   ^
    [javac] /Users/tdunning/Apache/tmp/elephant-bird/src/java/com/twitter/elephantbird/mapreduce/io/BinaryBlockReader.java:123: cannot find symbol
    [javac] symbol  : class SerializedBlock
    [javac] location: class com.twitter.elephantbird.mapreduce.io.BinaryBlockReader<M>
    [javac]     SerializedBlock block = SerializedBlock.parseFrom(byteArray);
    [javac]     ^
    [javac] /Users/tdunning/Apache/tmp/elephant-bird/src/java/com/twitter/elephantbird/mapreduce/io/BinaryBlockReader.java:123: cannot find symbol
    [javac] symbol  : variable SerializedBlock
    [javac] location: class com.twitter.elephantbird.mapreduce.io.BinaryBlockReader<M>
    [javac]     SerializedBlock block = SerializedBlock.parseFrom(byteArray);
    [javac]                             ^
    [javac] /Users/tdunning/Apache/tmp/elephant-bird/src/java/com/twitter/elephantbird/pig/util/ProtobufToPig.java:119: cannot find symbol
    [javac] symbol  : variable CountedMap
    [javac] location: class com.twitter.elephantbird.pig.util.ProtobufToPig
    [javac]           fieldDescriptor.getMessageType().getName().equals(CountedMap.getDescriptor().getName())) {
    [javac]                                                             ^
    [javac] /Users/tdunning/Apache/tmp/elephant-bird/src/java/com/twitter/elephantbird/pig/util/ProtobufToPig.java:122: cannot find symbol
    [javac] symbol  : class CountedMap
    [javac] location: class com.twitter.elephantbird.pig.util.ProtobufToPig
    [javac]           CountedMap cm = (CountedMap) m;
    [javac]           ^
    [javac] /Users/tdunning/Apache/tmp/elephant-bird/src/java/com/twitter/elephantbird/pig/util/ProtobufToPig.java:122: cannot find symbol
    [javac] symbol  : class CountedMap
    [javac] location: class com.twitter.elephantbird.pig.util.ProtobufToPig
    [javac]           CountedMap cm = (CountedMap) m;
    [javac]                            ^
    [javac] /Users/tdunning/Apache/tmp/elephant-bird/src/java/com/twitter/elephantbird/pig/util/ProtobufToPig.java:124: operator + cannot be applied to long,CountedMap.getValue
    [javac]           map.put(cm.getKey(), (curCount == null ? 0L : curCount) + cm.getValue());
    [javac]                                ^
    [javac] /Users/tdunning/Apache/tmp/elephant-bird/src/java/com/twitter/elephantbird/pig/util/ProtobufToPig.java:252: cannot find symbol
    [javac] symbol  : variable CountedMap
    [javac] location: class com.twitter.elephantbird.pig.util.ProtobufToPig
    [javac]         fieldDescriptor.getMessageType().getName().equals(CountedMap.getDescriptor().getName()) && fieldDescriptor.isRepeated()) {
    [javac]                                                           ^
    [javac] /Users/tdunning/Apache/tmp/elephant-bird/src/java/com/twitter/elephantbird/pig/util/ProtobufToPig.java:414: cannot find symbol
    [javac] symbol  : variable CountedMap
    [javac] location: class com.twitter.elephantbird.pig.util.ProtobufToPig
    [javac]         fieldDescriptor.getMessageType().getName().equals(CountedMap.getDescriptor().getName()) && fieldDescriptor.isRepeated()) {
    [javac]                                                           ^
    [javac] Note: Some input files use or override a deprecated API.
    [javac] Note: Recompile with -Xlint:deprecation for details.
    [javac] Note: /Users/tdunning/Apache/tmp/elephant-bird/src/java/com/twitter/elephantbird/util/TypeRef.java uses unchecked or unsafe operations.
    [javac] Note: Recompile with -Xlint:unchecked for details.
    [javac] 13 errors

BUILD FAILED
/Users/tdunning/Apache/tmp/elephant-bird/build.xml:74: The following error occurred while executing this line:
/Users/tdunning/Apache/tmp/elephant-bird/build.xml:103: Compile failed; see the compiler error output for details.

ProtogenHelper.getProtoNameFromFilename doesn't handle hyphens

Hyphens are stripped from filenames before protoc generates java class names. This function isn't doing that. I imagine other characters are stripped to.

Writable loading seems broken in local mode

With either pig 0.8.1 or pig 0.9.1 and elephant-bird 53a814f

I try this:

$ ../pig/bin/pig -x local
register 'target/pig-vector-1.0-jar-with-dependencies.jar';
x = load  'x' using PigStorage('\t') as (a:int, b:chararray, c:chararray);
y = foreach x generate a,b;
store y into 'y.w' using com.twitter.elephantbird.pig.store.SequenceFileStorage ( 
      '-c com.twitter.elephantbird.pig.util.IntWritableConverter', 
      '-c com.twitter.elephantbird.pig.util.TextConverter' ) ;
z = load 'y.w' using com.twitter.elephantbird.pig.load.SequenceFileLoader ( 
      '-c com.twitter.elephantbird.pig.util.IntWritableConverter', 
      '-c com.twitter.elephantbird.pig.util.TextConverter' ) as (a:int, b:chararray);
illustrate z;

and this happens:

java.lang.NullPointerException: Signature is null
    at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:204)
    at com.twitter.elephantbird.pig.load.SequenceFileLoader.getContextProperties(SequenceFileLoader.java:296)
    at com.twitter.elephantbird.pig.load.SequenceFileLoader.getContextProperty(SequenceFileLoader.java:306)
    at com.twitter.elephantbird.pig.load.SequenceFileLoader.setLocation(SequenceFileLoader.java:411)
    at org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:147)
    at org.apache.pig.impl.io.ReadToEndLoader.<init>(ReadToEndLoader.java:116)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.setUp(POLoad.java:91)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.getNext(POLoad.java:119)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.pig.pen.LocalMapReduceSimulator.launchPig(LocalMapReduceSimulator.java:194)
    at org.apache.pig.pen.ExampleGenerator.getData(ExampleGenerator.java:257)
    at org.apache.pig.pen.ExampleGenerator.readBaseData(ExampleGenerator.java:222)
    at org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:154)
    at org.apache.pig.PigServer.getExamples(PigServer.java:1245)
    at org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:698)
    at org.apache.pig.tools.pigscript.parser.PigScriptParser.Illustrate(PigScriptParser.java:591)
    at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:306)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:188)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164)
    at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:67)
    at org.apache.pig.Main.run(Main.java:487)
    at org.apache.pig.Main.main(Main.java:108)
2012-01-03 20:09:49,315 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2997: Encountered IOException. Exception : Signature is null

This seems wrong.

Dependency on Guava

Very minor issue I wanted to report, more documentation than anything:

The dependency on Google Collections/Guava isn't mentioned in the README. If you're using ProtoBufs, that makes sense, but for a non-protobuf build the dependency isn't obvious without digging into the source.

Otherwise, fantastically useful!

ILLUSTRATE error

I receive this error when using ILLUSTRATE:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. null

Pig Stack Trace

ERROR 2999: Unexpected internal error. null

java.lang.NullPointerException
at com.twitter.elephantbird.pig.util.PigCounterHelper.incrCounter(PigCounterHelper.java:26)
at com.twitter.elephantbird.pig.load.LzoBaseLoadFunc.incrCounter(LzoBaseLoadFunc.java:61)
at com.twitter.elephantbird.pig.load.LzoProtobufB64LinePigLoader.getNext(LzoProtobufB64LinePigLoader.java:83)
at org.apache.pig.impl.io.ReadToEndLoader.getNextHelper(ReadToEndLoader.java:209)
at org.apache.pig.impl.io.ReadToEndLoader.getNext(ReadToEndLoader.java:189)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.getNext(POLoad.java:131)
at org.apache.pig.pen.ExampleGenerator.readBaseData(ExampleGenerator.java:166)
at org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:91)
at org.apache.pig.PigServer.getExamples(PigServer.java:1155)
at org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:630)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:308)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
at org.apache.pig.Main.run(Main.java:500)

at org.apache.pig.Main.main(Main.java:107)

'ant examples' fails on generate-protobuf task

I'm guessing these two commits in May removed some classes that are used in the examples:
-whole of elephantbird/proto directory can be removed.
-remove lib thrift and protobuf-java jars from lib/

Need to specify java.library.path when running examples

I had to do a "export PIG_OPTS=-Djava.library.path=/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64/" in order to run the json example. Maybe put that in the documentation? Or am I doing something wrong?

LzoTokenizedStorage writes bogus data

Writing data with LzoTokenizedStorage appears to write data incorrectly. For example, the following load+store:

raw_data = LOAD '/data.lzo' USING com.twitter.elephantbird.pig.load.LzoTextLoader();
store raw_data into '/user/travis/test' USING com.twitter.elephantbird.pig.store.LzoTokenizedStorage('\n');

writes these data:

[B@9ba6076

[B@48fd918a

[B@7f5e2075

[B@7ca522a6

[B@3d860038

This writes correctly compressed data, leading us to believe the issue lives in LzoTokenizedStorage rather than hadoop-lzo.

SET output.compression.enabled true;
SET output.compression.codec com.hadoop.compression.lzo.LzopCodec;
raw_data = LOAD '/data.lzo' USING com.twitter.elephantbird.pig.load.LzoTextLoader();
STORE raw_data INTO '/user/travis/test' USING PigStorage();

Hive queries failed with big files

If the size of the lzo file is over 64m, hive queries will fail with the following exception (full stacktrace attached at the end):

 java.io.IOException: Compressed length 1602367537 exceeds max block size 67108864 (probably corrupt file)

One may advise indexing the files, but seems the index is not recoginzsed, either. However, pig script works fine with or without index.

Full stacktrace:

java.io.IOException: Compressed length 1602367537 exceeds max block size 67108864 (probably corrupt file)
at com.hadoop.compression.lzo.LzopInputStream.getCompressedData(LzopInputStream.java:286)
at com.hadoop.compression.lzo.LzopInputStream.decompress(LzopInputStream.java:256)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:63)
at com.twitter.elephantbird.util.StreamSearcher.search(StreamSearcher.java:47)
at com.twitter.elephantbird.mapreduce.io.BinaryBlockReader.skipToNextSyncPoint(BinaryBlockReader.java:102)
at com.twitter.elephantbird.mapreduce.io.BinaryBlockReader.parseNextBlock(BinaryBlockReader.java:107)
at com.twitter.elephantbird.mapreduce.io.BinaryBlockReader.setupNewBlockIfNeeded(BinaryBlockReader.java:139)
at com.twitter.elephantbird.mapreduce.io.BinaryBlockReader.readNextProtoBytes(BinaryBlockReader.java:75)
at com.twitter.elephantbird.mapreduce.io.BinaryBlockReader.readNext(BinaryBlockReader.java:63)
at com.twitter.elephantbird.mapreduce.io.ProtobufBlockReader.readProtobuf(ProtobufBlockReader.java:42)
at com.twitter.elephantbird.mapred.input.DeprecatedLzoProtobufBlockRecordReader.next(DeprecatedLzoProtobufBlockRecordReader.java:85)
at com.twitter.elephantbird.mapred.input.DeprecatedLzoProtobufBlockRecordReader.next(DeprecatedLzoProtobufBlockRecordReader.java:28)
at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:98)
at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:42)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:67)
at org.apache.hadoop.hive.shims.Hadoop20SShims$CombineFileRecordReader.next(Hadoop20SShims.java:208)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:208)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:193)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)

Elephant-Bird does not work with protoc 2.4

Hello-

I'm running into trouble building elephant-bird. Ant is failing on the compile-generated-protobuf task, with errors that look like the protocol buffer jar file isn't ending up on the classpath:

Buildfile: /Users/steven/Desktop/Hacking/protobuf/elephant_bird/elephant-bird/build.xml

release:
     [echo] Building in release mode...

init:

compile-protobuf:
    [apply] Applied thrift to 1 file and 0 directories.
    [apply] Applied protoc to 4 files and 0 directories.
    [javac] Compiling 9 source files to /Users/steven/Desktop/Hacking/protobuf/elephant_bird/elephant-bird/build/classes
    [javac] /Users/steven/Desktop/Hacking/protobuf/elephant_bird/elephant-bird/build/gen-java/com/twitter/data/proto/BlockStorage.java:12: cannot find symbol
    [javac] symbol  : class MessageOrBuilder
    [javac] location: package com.google.protobuf
    [javac]       extends com.google.protobuf.MessageOrBuilder {
    [javac]                                  ^
    [javac]
... snip ...
/Users/steven/Desktop/Hacking/protobuf/elephant_bird/elephant-bird/build/gen-java/com/twitter/data/proto/tutorial/AddressBookProtos.java:1275: cannot find symbol
 [javac] symbol  : method onChanged()
 [javac] location: class com.twitter.data.proto.tutorial.AddressBookProtos.Person.Builder
 [javac]           onChanged();
 [javac]           ^
 [javac] Note: Some input files use unchecked or unsafe operations.
 [javac] Note: Recompile with -Xlint:unchecked for details.
 [javac] 100 errors

Any ideas what might be going on here? It's running protoc just fine, and from what I can tell from build.xml, javac's classpath ought to include protobuf-java-2.3.0.jar from elephant-bird's lib directory- but it sure looks like it isn't actually getting set up correctly.

Load list in thrift structure

having problems loading a list in a thrift structure using the latest version of elephantbird. My pig script is like

raw_data = load 'client_event.lzo'
using com.twitter.elephantbird.pig.load.LzoThriftB64LinePigLoader('com.twitter.clientapp.gen.LogEvent');

records = foreach raw_data generate
((event_details is null OR event_details.item_ids is null ) ? TOBAG((long)null) : event_details.item_ids) as item_ids,

describe records;
dump records;

Here is the thrift structure.

struct LogEvent {
1: optional EventDetails event_details
}

struct EventDetails {
1: optional list item_ids
}

in PIG 8, mapper failed with a stack trace like:
java.lang.ClassCastException: org.apache.pig.data.BinSedesTuple cannot be cast to org.apache.pig.data.DataBag
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:482)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:197)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:480)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:197)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POIsNull.getNext(POIsNull.java:162)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POOr.getNext(POOr.java:83)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POBinCond.getNext(POBinCond.java:89)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:338)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:290)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:237)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:232)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

In Pig 10, stack trace
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1001: Unable to describe schema for alias raw_data
at org.apache.pig.PigServer.dumpSchema(PigServer.java:802)
at org.apache.pig.tools.grunt.GruntParser.processDescribe(GruntParser.java:276)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:313)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:188)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
at org.apache.pig.Main.run(Main.java:553)
at org.apache.pig.Main.main(Main.java:108)
Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2245:
<file /tmp/webclient.pig, line 12, column 11> Cannot get schema from loadFunc com.twitter.elephantbird.pig.load.LzoThriftB64LinePigLoader
at org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:154)
at org.apache.pig.newplan.logical.relational.LOLoad.getSchema(LOLoad.java:109)
at org.apache.pig.newplan.logical.visitor.LineageFindRelVisitor.visit(LineageFindRelVisitor.java:100)
at org.apache.pig.newplan.logical.relational.LOLoad.accept(LOLoad.java:218)
at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
at org.apache.pig.newplan.logical.visitor.CastLineageSetter.(CastLineageSetter.java:57)
at org.apache.pig.PigServer$Graph.compile(PigServer.java:1691)
at org.apache.pig.PigServer$Graph.compile(PigServer.java:1666)
at org.apache.pig.PigServer$Graph.access$200(PigServer.java:1391)
at org.apache.pig.PigServer.getOperatorForAlias(PigServer.java:1384)
at org.apache.pig.PigServer.dumpSchema(PigServer.java:788)
... 7 more
Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2218: Invalid resource schema: bag schema must have tuple as its field
at org.apache.pig.ResourceSchema$ResourceFieldSchema.throwInvalidSchemaException(ResourceSchema.java:213)
at org.apache.pig.impl.logicalLayer.schema.Schema.getPigSchema(Schema.java:1881)
at org.apache.pig.impl.logicalLayer.schema.Schema.getPigSchema(Schema.java:1871)
at org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:151)
... 18 more

boolean cannot be dereferenced

My apologies if this is a silly question, but does anyone else see this when compiling?

ed@curry:~/Projects/elephant-bird$ ant
Buildfile: build.xml

init:

compile-protobuf:
[exec] Result: 1
[apply] Applied thrift to 1 file and 0 directories.
[apply] Applied protoc to 4 files and 0 directories.
[javac] Compiling 9 source files to /home/ed/Projects/elephant-bird/build/classes
[javac] /home/ed/Projects/elephant-bird/build/gen-java/com/twitter/data/proto/BlockStorage.java:142: boolean cannot be dereferenced
[javac] return newBuilder().mergeDelimitedFrom(input).buildParsed();
[javac] ^
[javac] /home/ed/Projects/elephant-bird/build/gen-java/com/twitter/data/proto/BlockStorage.java:149: boolean cannot be dereferenced
[javac] .buildParsed();
[javac] ^
[javac] /home/ed/Projects/elephant-bird/build/gen-java/com/twitter/data/proto/Misc.java:118: boolean cannot be dereferenced
[javac] return newBuilder().mergeDelimitedFrom(input).buildParsed();
[javac] ^
[javac] /home/ed/Projects/elephant-bird/build/gen-java/com/twitter/data/proto/Misc.java:125: boolean cannot be dereferenced
[javac] .buildParsed();
[javac] ^
[javac] /home/ed/Projects/elephant-bird/build/gen-java/com/twitter/data/proto/tutorial/AddressBookProtos.java:210: boolean cannot be dereferenced
[javac] return newBuilder().mergeDelimitedFrom(input).buildParsed();
[javac] ^
[javac] /home/ed/Projects/elephant-bird/build/gen-java/com/twitter/data/proto/tutorial/AddressBookProtos.java:217: boolean cannot be dereferenced
[javac] .buildParsed();
[javac] ^
[javac] /home/ed/Projects/elephant-bird/build/gen-java/com/twitter/data/proto/tutorial/AddressBookProtos.java:542: boolean cannot be dereferenced
[javac] return newBuilder().mergeDelimitedFrom(input).buildParsed();
[javac] ^
[javac] /home/ed/Projects/elephant-bird/build/gen-java/com/twitter/data/proto/tutorial/AddressBookProtos.java:549: boolean cannot be dereferenced
[javac] .buildParsed();
[javac] ^
[javac] /home/ed/Projects/elephant-bird/build/gen-java/com/twitter/data/proto/tutorial/AddressBookProtos.java:940: boolean cannot be dereferenced
[javac] return newBuilder().mergeDelimitedFrom(input).buildParsed();
[javac] ^
[javac] /home/ed/Projects/elephant-bird/build/gen-java/com/twitter/data/proto/tutorial/AddressBookProtos.java:947: boolean cannot be dereferenced
[javac] .buildParsed();
[javac] ^
[javac] /home/ed/Projects/elephant-bird/build/gen-java/com/twitter/elephantbird/examples/proto/ThriftFixtures.java:316: boolean cannot be dereferenced
[javac] return newBuilder().mergeDelimitedFrom(input).buildParsed();
[javac] ^
[javac] /home/ed/Projects/elephant-bird/build/gen-java/com/twitter/elephantbird/examples/proto/ThriftFixtures.java:323: boolean cannot be dereferenced
[javac] .buildParsed();
[javac] ^
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.
[javac] 12 errors

BUILD FAILED
/home/ed/Projects/elephant-bird/build.xml:162: The following error occurred while executing this line:
/home/ed/Projects/elephant-bird/build.xml:148: Compile failed; see the compiler error output for details.

Total time: 3 seconds

Local mode in LzoProtobufB64LinePigLoader not working

It seems that you guys are aware of this issue and that it's really a larger Pig issue (I saw one of Kevin's emails on the Pig list mentioning this), but new users will be thankful to have this information upfront.

There are two problems here:

a) The loader doesn't do any LZO decompression in local mode, so the data should be fed uncompressed to the loader in local mode. It's simple to write a wrapper script to work around this. It appears that Pig's local mode doesn't use the slicing interface (which is what wraps an LZO codec around a slice's input stream in elephant-bird), which is why this problem exists in the first place.

b) The loader doesn't actually feed any tuples to Pig in local mode. I've written a number printlns in getNext() showing that protobufs are successfully deserialized into Pig tuples in local mode, but a DUMP of the data in Pig doesn't actually output anything. I've poked around the code and I'm still baffled by this. Can anyone shed some light on why this is the case?

Typos in "Thrift"

As reported by Torben:

Hey Dmitriy,
i found two typo issues in people_phone_number_count_thrift.pig and ThriftUtils.java
Just search the branch with:

grep -rE '(thirft|thift)' *

Best regards,

NullPointerException thrown when both writer.finish() and writer.close() called

In ProtobufBlockWriter.java, the given example (see code below) throws NullPointerException when the number of records written is not a multiple of numRecordsPerBlock

ProtobufBlockWriter<Person> writer = new ProtobufBlockWriter<Person>(
new FileOutputStream("person_data"), Person.class);
writer.write(person1);
...
writer.write(person100000);
writer.finish();
writer.close();

The issue is that you can't call writer.close() after ``writer.finish(). The first line in close()` is calling `finish()`, so `finish()` is actually called twice. However, `finish()` can't be called twice, since once `serialize()` is invoked, `builder_` is no longer usable. One way to fix this defect is adding `builder_ = reinitializeBlockBuilder();` right after `serialize();` in finish().

public void finish() throws IOException {
  if (builder_.getProtoBlobsCount() > 0) {
    serialize();
    // add: builder_ = reinitializeBlockBuilder();
  }
}

public void close() throws IOException {
  finish();
  out_.close();
}

How to run example

Hello,

I'm running into trouble while trying to run one of the examples. Here's what I did:

$ ~/hadoop-0.20.2/bin/hadoop jar ~/elephant-bird/examples/dist/elephant-bird-examples-1.0.jar com.twitter.elephantbird.examples.ProtobufMRExample s3n://hassanrom-sna/name_age_sample.txt s3n://hassanrom-sna/name_age_sample.lzo11/11/25 00:42:57 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
Exception in thread "main" java.lang.NoClassDefFoundError: com/google/protobuf/GeneratedMessage
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at com.twitter.elephantbird.examples.ProtobufMRExample.runLzoToText(Unknown Source)
at com.twitter.elephantbird.examples.ProtobufMRExample.main(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: java.lang.ClassNotFoundException: com.google.protobuf.GeneratedMessage
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 18 more

Any ideas what am I doing wrong? I've tried setting HADOOP_CLASSPATH to point to where the libraries are and that didn't work. I've also tried bundling all the jars together and that didn't work either. I also tried passing the jars via -libjars and that didn't work either. Some documentation on how to get the examples running would definitely be nice especially for folks who are new to hadoop.

Thanks in advance!

Hassan

Cannot use the git repo as Maven Repository

It is stated in the README that you can use git-raw support to access the repo as a maven repository. I can't get that to work.

ProtobufBytesToTuple support for recursive messages

Recursive messages cause a java.lang.StackOverflowError when using ProtobufBytesToTuple.

ie.

.proto file:

syntax = "proto2";
package test;
option java_package="com.test";

message a {
 message b {
    repeated b test = 1;
  }
  optional b test = 2;
}

pig script:

define ProtoFileLoader com.twitter.elephantbird.pig.load.LzoProtobufBlockPigLoader('some.proto.message$message');
  define test1 com.twitter.elephantbird.pig.piggybank.ProtobufBytesToTuple('com.test.T$a');
record = LOAD '/somefile/' USING ProtoFileLoader;
packedrecord = FOREACH record GENERATE test1(protobuf.data);

error:

================================================================================
Pig Stack Trace
---------------
ERROR 2998: Unhandled internal error. null

java.lang.StackOverflowError
        at java.util.AbstractList$Itr.<init>(AbstractList.java:318)
        at java.util.AbstractList$Itr.<init>(AbstractList.java:318)
        at java.util.AbstractList.iterator(AbstractList.java:273)
        at java.util.Collections$UnmodifiableCollection$1.<init>(Collections.java:1007)
        at java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1006)
        at com.twitter.elephantbird.pig.util.ProtobufToPig.toSchema(ProtobufToPig.java:225)
        at com.twitter.elephantbird.pig.util.ProtobufToPig.messageToFieldSchema(ProtobufToPig.java:256)
        at com.twitter.elephantbird.pig.util.ProtobufToPig.toSchema(ProtobufToPig.java:227)
        at com.twitter.elephantbird.pig.util.ProtobufToPig.messageToFieldSchema(ProtobufToPig.java:256)
...

EB 3.0.0 release tasks

Clean up Readme.md, add new material regarding Sonatype + MC repos
Update materials in examples/ dir
Leave repo/ alone for awhile
- Remove repo/ once release artifacts are in MC
Deploy remaining 3rd-party jar to Sonatype OSS repo, promote to MC
- Remove lib/ dir
- Remove path from parent pom's dep management section
Ensure side-effect files generated by unit tests are placed beneath module target/ dirs somewhere

Writables need instantiation similar to InputFormats

With Raghu's refactoring, we can set the inputFormat and outputFormat as follows:

job.setInputFormatClass(
  LzoProtobufB64LineInputFormat.getInputFormatClass(MyProtobufClass.class, conf)
);

We need to do the same for Writables, as this:
job.setOutputValueClass(ThriftWritable.class);
doesn't work.

LZO Compressed Protocol Buffer fails to read the file correctly

I use an LZO (v.1.01) compressed Protocol buffer file on HDFS. While using the elephant-bird API to read the file on my Java MR, there were no input records passed on to my Map class. On debugging in detail, I found that the "com.twitter.elephantbird.util.StreamSearcher.search(InputStream)" method always returns a false. Is this a known issue? If you need more details, please do not hesitate to reach me at [email protected].

Regards,
Karthik.

Maven repo URL (master/repo) not working

Current readme says to go here:
https://raw.github.com/kevinweil/elephant-bird/master/repo
which doesn't exist. Poking around this does exist:
https://raw.github.com/kevinweil/elephant-bird/master/
but looks like someone checked in a .gitignore file for it.

Untitled

Pig Complex thrift structures

I'm using the ThriftWritableConverter to convert a pig tuple into a Thrift object. I get the following exception when trying to create the converter:

ERROR 2999: Unexpected internal error. could not instantiate 'com.twitter.elephantbird.pig.store.SequenceFileStorage' with arguments '[-c com.twitter.elephantbird.pig.util.ThriftWritableConverter AdvertisementId, -c com.twitter.elephantbird.pig.util.ThriftWritableConverter Advertisement]'

java.lang.RuntimeException: could not instantiate 'com.twitter.elephantbird.pig.store.SequenceFileStorage' with arguments '[-c com.twitter.elephantbird.pig.util.ThriftWritableConverter AdvertisementId, -c com.twitter.elephantbird.pig.util.ThriftWritableConverter Advertisement]'
at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:502)
at org.apache.pig.impl.logicalLayer.parser.QueryParser.NonEvalFuncSpec(QueryParser.java:5660)
at org.apache.pig.impl.logicalLayer.parser.QueryParser.StoreClause(QueryParser.java:4034)
at org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1501)
at org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:1013)
at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:825)
at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1612)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1562)
at org.apache.pig.PigServer.registerQuery(PigServer.java:534)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:871)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:388)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
at org.apache.pig.Main.run(Main.java:500)
at org.apache.pig.Main.main(Main.java:107)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:470)
... 21 more
Caused by: java.lang.RuntimeException: Failed to create WritableConverter instance
at com.twitter.elephantbird.pig.load.SequenceFileLoader.getWritableConverter(SequenceFileLoader.java:247)
at com.twitter.elephantbird.pig.load.SequenceFileLoader.(SequenceFileLoader.java:140)
at com.twitter.elephantbird.pig.store.SequenceFileStorage.(SequenceFileStorage.java:103)
... 26 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at com.twitter.elephantbird.pig.load.SequenceFileLoader.getWritableConverter(SequenceFileLoader.java:222)
... 28 more
Caused by: java.lang.RuntimeException: while trying to find id in class AdvertisementId
at com.twitter.elephantbird.util.ThriftUtils.getFiedlType(ThriftUtils.java:110)
at com.twitter.elephantbird.thrift.TStructDescriptor$Field.(TStructDescriptor.java:201)
at com.twitter.elephantbird.thrift.TStructDescriptor$Field.(TStructDescriptor.java:128)
at com.twitter.elephantbird.thrift.TStructDescriptor.build(TStructDescriptor.java:104)
at com.twitter.elephantbird.thrift.TStructDescriptor.getInstance(TStructDescriptor.java:88)
at com.twitter.elephantbird.pig.util.ThriftToPig.(ThriftToPig.java:56)
at com.twitter.elephantbird.pig.util.ThriftToPig.newInstance(ThriftToPig.java:52)
at com.twitter.elephantbird.pig.util.ThriftWritableConverter.(ThriftWritableConverter.java:61)
... 33 more
Caused by: java.lang.NoSuchFieldException: id
at java.lang.Class.getDeclaredField(Class.java:1882)
at com.twitter.elephantbird.util.ThriftUtils.getFiedlType(ThriftUtils.java:107)
... 40 more

This is the part of my schema:

union AdvertisementId {
1: string id;
}

struct Advertisement {
1: required AdvertisementId id;
}

And this is the structure of the pig relation:
ads: {advertisementId: (id: chararray)}

the TStructDescriptor tries to search for a field but in case of a union you won't have actual fields within the generated classes.

EB-3 depends on thrift 0.7.0 instead of 0.5.0

  <dependency>
    <groupId>org.apache.thrift</groupId>
    <artifactId>libthrift</artifactId>
    <version>0.7.0</version>
  </dependency>

We've always depended on 0.5.0 AFAIK and I don't think this should change?

TestInvoker#testSpeed() may fail on slow machines

Hi there,

we run into an issue where the following test in src/test/com/twitter/elephantbird/pig/piggybank/TestInvoker.java may fail on slow(er) machines. In our case, for instance, this happens on our build server. On my local dev box, the test sometimes passes and sometimes it doesn't (hmm).

    @Test
    public void testSpeed() throws IOException, SecurityException, ClassNotFoundException, NoSuchMethodException {
        EvalFunc<Double> log = new Log();
        Tuple tup = tf_.newTuple(1);
        long start = System.currentTimeMillis();
        for (int i=0; i < 1000000; i++) {
            tup.set(0, (double) i);
            log.exec(tup);
        }
        long staticSpeed = (System.currentTimeMillis()-start);
        start = System.currentTimeMillis();
        log = new InvokeForDouble("java.lang.Math.log", "Double", "static");
        for (int i=0; i < 1000000; i++) {
            tup.set(0, (double) i);
            log.exec(tup);
        }
        long dynamicSpeed = System.currentTimeMillis()-start;
        System.err.println("Dynamic to static ratio: "+((float) dynamicSpeed)/staticSpeed);
        assertTrue( ((float) dynamicSpeed)/staticSpeed < 5);
    }

The culprit is the last line in the method, where it checks whether the dynamic-to-static ration is smaller than "5" (see my question below -- I do not know why 5 in particular is used here.)

The "problem" with this test is that its success seems to depend on the build environment. Even though a relative ratio is used, this alone apparently does not guarantee that the test completes successfully on any box that the build is being run on (e.g. it may pass if the build box is otherwise idle but it may fail if there are other tasks running in the background). Also, this unit test is not a functional test but a performance test.

Hence I'd say that one can argue pro and con whether it should be a test that a) is enabled by default and/or b) whether failing this test should make the full build fail (maybe a warning would be more appropriate?).

On a related note: Where does the value/number of 5 (last line in testSpeed() ) come from? Was it arbitrarily chosen? FWIW, on my dev box most dynamic-to-static ratios are in the 4.9-5.3 range, and on our build server they are in the 6.1-6.5 range. At the moment, we have decided to @Ignore this test in our build for the time being because otherwise the build always fails due to this single test.

Maven won't build

To reproduce:

mvn -e package

org.apache.maven.lifecycle.LifecycleExecutionException: Error configuring: com.github.igor-petruk.protobuf:protobuf-maven-plugin. Reason: Invalid or missing parameters: [Mojo parameter [name: 'inputDirectories'; alias: 'null']] for mojo: com.github.igor-petruk.protobuf:protobuf-maven-plugin:0.4:run
at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoals(DefaultLifecycleExecutor.java:723)
at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoalWithLifecycle(DefaultLifecycleExecutor.java:556)
at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoal(DefaultLifecycleExecutor.java:535)
at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoalAndHandleFailures(DefaultLifecycleExecutor.java:387)
at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeTaskSegments(DefaultLifecycleExecutor.java:348)
at org.apache.maven.lifecycle.DefaultLifecycleExecutor.execute(DefaultLifecycleExecutor.java:180)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:328)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:138)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:362)
at org.apache.maven.cli.compat.CompatibleMain.main(CompatibleMain.java:60)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.codehaus.classworlds.Launcher.launchEnhanced(Launcher.java:315)
at org.codehaus.classworlds.Launcher.launch(Launcher.java:255)
at org.codehaus.classworlds.Launcher.mainWithExitCode(Launcher.java:430)
at org.codehaus.classworlds.Launcher.main(Launcher.java:375)
Caused by: org.apache.maven.plugin.PluginParameterException: Error configuring: com.github.igor-petruk.protobuf:protobuf-maven-plugin. Reason: Invalid or missing parameters: [Mojo parameter [name: 'inputDirectories'; alias: 'null']] for mojo: com.github.igor-petruk.protobuf:protobuf-maven-plugin:0.4:run
at org.apache.maven.plugin.DefaultPluginManager.checkRequiredParameters(DefaultPluginManager.java:1117)
at org.apache.maven.plugin.DefaultPluginManager.getConfiguredMojo(DefaultPluginManager.java:722)
at org.apache.maven.plugin.DefaultPluginManager.executeMojo(DefaultPluginManager.java:468)
at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoals(DefaultLifecycleExecutor.java:694)
... 17 more

Look forward to supporting Pig 0.7+

Many thanks!!

json handling improvements

JSON is currently handled a little weird. Pig can read json from any file input format, however, map reduce jobs can only read json from lzo files. Additionally, parsing is done a little different in both places.

I think the ideal situation would be a json input format that wraps a given actual input format, converting each valid value to json. The json pig loader would simply use the json input format, and convert each record into a tuple (rather than parsing records into json like it does today).

Thoughts? I think it would be cool to separate the app data format (pig tuple), from the storage record format (json) from the file format (lzo compressed file) so users can mix & match as they see fit. For example, there's no reason why someone who wants to compress files with snappy shouldn't be able to use the json & pig layers.

I started playing around with this idea in the recent json loader refactor, and tonight in https://github.com/traviscrawford/elephant-bird/tree/json_record_reader but want to get some feedback before going too far.

Cannot compile elephant-bird, ivy.xml parse error

I get this after a git clone and then running "ant". I am not an ivy expert. However, something simple must be wrong.

resolve:
No ivy:settings found for the default reference 'ivy.instance'. A default instance will be used
[ivy:retrieve] :: Ivy 2.0.0-rc2 - 20081028224207 :: http://ant.apache.org/ivy/ ::
:: loading settings :: file = /Users/shiv/Documents/elephant-bird/ivysettings.xml
[ivy:retrieve] [xml parsing: ivy.xml:4:117: cvc-complex-type.3.2.2: Attribute 'defaultconf' is not allowed to appear in element 'configurations'. in file:/Users/shiv/Documents/elephant-bird/ivy.xml
[ivy:retrieve] ]

BUILD FAILED
/Users/shiv/Documents/elephant-bird/build.xml:114: syntax errors in ivy file: java.text.ParseException: [xml parsing: ivy.xml:4:117: cvc-complex-type.3.2.2: Attribute 'defaultconf' is not allowed to appear in element 'configurations'. in file:/Users/shiv/Documents/elephant-bird/ivy.xml
]
at org.apache.ivy.plugins.parser.AbstractModuleDescriptorParser$AbstractParser.checkErrors(AbstractModuleDescriptorParser.java:89)
at org.apache.ivy.plugins.parser.AbstractModuleDescriptorParser$AbstractParser.getModuleDescriptor(AbstractModuleDescriptorParser.java:342)
at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorParser.parseDescriptor(XmlModuleDescriptorParser.java:102)
at org.apache.ivy.plugins.parser.AbstractModuleDescriptorParser.parseDescriptor(AbstractModuleDescriptorParser.java:48)
at org.apache.ivy.core.resolve.ResolveEngine.resolve(ResolveEngine.java:183)
at org.apache.ivy.Ivy.resolve(Ivy.java:502)
at org.apache.ivy.ant.IvyResolve.doExecute(IvyResolve.java:234)
at org.apache.ivy.ant.IvyTask.execute(IvyTask.java:277)
at org.apache.ivy.ant.IvyPostResolveTask.ensureResolved(IvyPostResolveTask.java:207)
at org.apache.ivy.ant.IvyPostResolveTask.prepareAndCheck(IvyPostResolveTask.java:154)
at org.apache.ivy.ant.IvyRetrieve.doExecute(IvyRetrieve.java:49)
at org.apache.ivy.ant.IvyTask.execute(IvyTask.java:277)
at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
at org.apache.tools.ant.Task.perform(Task.java:348)
at org.apache.tools.ant.Target.execute(Target.java:390)
at org.apache.tools.ant.Target.performTasks(Target.java:411)
at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1399)
at org.apache.tools.ant.Project.executeTarget(Project.java:1368)
at org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41)
at org.apache.tools.ant.Project.executeTargets(Project.java:1251)
at org.apache.tools.ant.Main.runBuild(Main.java:809)
at org.apache.tools.ant.Main.startAnt(Main.java:217)
at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280)
at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109)

Total time: 2 seconds

Build failing in example directory

I have downloaded the sources and following the instructions in readme.md file. Platform is windows.

Build is failing while compiling example directory with following error
compile-protobuf:

[apply] --twadoop_out: protoc-gen-twadoop: The system cannot find the file specified. I see this file in examples directory, but somehow protoc is not able to find it.

elephant-bird fails w/ Cascading 2.0-wip226

With one of the more recent wips, Cascading changed SourceCall from a class to an interface. Elephant-bird uses SourceCall here and other places -- to make this work, the project needs to be recompiled against cascading 2.0-wip226.

Protocol Buffer+Lzo Errors

I tested with a simple text file+ lzo before moving onto encoding with Protocol Buffers and using lzop to compress it. But then when I try to run a simple map reduce on top of it. I get these errors:

12-06-13 01:19:16,446 INFO org.apache.hadoop.mapred.TaskStatus: task-diagnostic-info for task attempt_201206121825_0041_m_000000_1 : java.lang.RuntimeException: error rate while reading input records crossed threshold
at com.twitter.elephantbird.mapreduce.input.LzoRecordReader$InputErrorTracker.incErrors(LzoRecordReader.java:155)
at com.twitter.elephantbird.mapreduce.input.LzoBinaryB64LineRecordReader.nextKeyValue(LzoBinaryB64LineRecordReader.java:135)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:456)
at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
Caused by: java.lang.Exception: Unknown error
at com.twitter.elephantbird.mapreduce.input.LzoRecordReader$InputErrorTracker.incErrors(LzoRecordReader.java:138)

Documentation clarification

Hi Folks,

A small documentation question...

The Readme.md says "2. Pig 0.6 (not compatible with 0.7+)"
Is that still the case?

I am looking for a plain old Json Loader - Elephant Bird seems to include one - but only if the data is also LZO compressed. Should I carry on trying to figure out Elephant Bird even though my data isn't LZO compressed?

thanks

Release status of ElephantBird 2.x + LzoPigStorage ?

Hi Kevin & all,

I have two quick question:

First, I have seen that Dmitriy merged (kevinweil/elephant-bird@ebaa220c5c728) his Pig-0.8 compatibility branch of EB into Kevin's eb-dev. Kevin's master branch of kevinweil/elephant-bird now contains the "2.0.2 bug fix release" and is just one version-bump commit behind eb-dev. So my question would be: Is EB 2.0.2 considered stable or an official release? Is Kevin's repo the recommended code base or rather Dmitriy's? (I'm a bit confused what development happens in each repo, sorry :-D)

Second, gerritjvv forked Dmitriy's EB repo some time ago (gerritjvv/elephant-bird). He added the class LzoPigStorage.java. This class appears to be a very simple, straight-forward LZO-enabled wrapper for standard PigStorage -- less than 10 lines of effective code. If the implementation is sound, do you think it would make sense to bring LzoPigStorage into your repository?

Many thanks for your work on ElephantBird, it's appreciated!

Best,
Michael

Not really an issue, but a question

Hi all,

I couldn't find a good way to email anyone, so I'm posting this as an issue in the hopes that it gets to someone.

I just have a very simple question: in Protobufs, you have a byte[] called KNOWN_GOOD_POSITION_MARKER. I'm curious about the chosen pattern--why did you pick those particular byte values? And why is it 16 bytes long? I'm sure it's a simple reason, I just can't seem to wrap my head around it, for some reason.

Thanks,

Scott Fines
[email protected]

Ant build fails on protobuf

Almost there getting elephant-bird compiled! I now get this. On the main wiki, there are no instructions on getting "protoc". This happens when I run "ant".

Thanks, Shiv

generate-protobuf:

BUILD FAILED
/Users/shiv/Documents/elephant-bird/build.xml:131: The following error occurred while executing this line:
/Users/shiv/Documents/elephant-bird/build.xml:59: Execute failed: java.io.IOException: Cannot run program "protoc": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at java.lang.Runtime.exec(Runtime.java:593)
at org.apache.tools.ant.taskdefs.Execute$Java13CommandLauncher.exec(Execute.java:862)
at org.apache.tools.ant.taskdefs.Execute.launch(Execute.java:481)
at org.apache.tools.ant.taskdefs.Execute.execute(Execute.java:495)
at org.apache.tools.ant.taskdefs.ExecTask.runExecute(ExecTask.java:631)
at org.apache.tools.ant.taskdefs.ExecTask.runExec(ExecTask.java:672)
at org.apache.tools.ant.taskdefs.ExecTask.execute(ExecTask.java:498)
at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
at org.apache.tools.ant.Task.perform(Task.java:348)
at org.apache.tools.ant.taskdefs.Sequential.execute(Sequential.java:68)
at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
at org.apache.tools.ant.Task.perform(Task.java:348)
at org.apache.tools.ant.taskdefs.MacroInstance.execute(MacroInstance.java:398)
at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
at org.apache.tools.ant.Task.perform(Task.java:348)
at org.apache.tools.ant.Target.execute(Target.java:390)
at org.apache.tools.ant.Target.performTasks(Target.java:411)
at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1399)
at org.apache.tools.ant.Project.executeTarget(Project.java:1368)
at org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41)
at org.apache.tools.ant.Project.executeTargets(Project.java:1251)
at org.apache.tools.ant.Main.runBuild(Main.java:809)
at org.apache.tools.ant.Main.startAnt(Main.java:217)
at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280)
at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.(UNIXProcess.java:53)
at java.lang.ProcessImpl.start(ProcessImpl.java:91)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 37 more

JsonStringToMap doesn't parse output schema

Hi!

I've just found an issue using JsonStringToMap on PIG 0.8.1. The schema "json: [chararray]" throws a ParseException.
The following works on PIG 0.8.1:

return Utils.getSchemaFromString("json: []", DataType.CHARARRAY);

Unable to compile with noproto

I did a fresh git clone this evening and am getting lots of issues being unable to find StorageBlock when compiling with "ant noproto release-jar"

[echo] building without protobuf support
[javac] /Users/jreitman/Development/web_development/elephant-bird/build.xml:89: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds
[javac] Compiling 83 source files to /Users/jreitman/Development/web_development/elephant-bird/build/classes
[javac] /Users/jreitman/Development/web_development/elephant-bird/src/java/com/twitter/elephantbird/mapreduce/io/BinaryBlockReader.java:6: package com.twitter.data.proto.BlockStorage does not exist
[javac] import com.twitter.data.proto.BlockStorage.SerializedBlock;
[javac] ^
[javac] /Users/jreitman/Development/web_development/elephant-bird/src/java/com/twitter/elephantbird/mapreduce/io/BinaryBlockReader.java:27: cannot find symbol

json_word_count.pig json snippets should use double quotes not single

json-simple throws an exception on 'mykey' (single quotes), need to use double quotes. The example in json_word_count.pig has single quotes.

Add optional conversion of non-String map keys to Thrift, Proto converters

Currently if the Protobuf or Thrift structure contains a map with non-String keys, we give up since Pig does not support non-String keys.

We should allow the user to specify an alternative behavior, and provide at least one alternative built-in option (call toString() on the key and use that).

Signature is null in SequenceFileLoader

I have a similar problem as in #112, but with the SequenceFileLoader in the mapreduce mode.

The signature is null and a command fails because of that. The error is:

java.lang.NullPointerException: Signature is null
    at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:204)
    at com.twitter.elephantbird.pig.load.SequenceFileLoader.getContextProperties(SequenceFileLoader.java:296)
    at com.twitter.elephantbird.pig.load.SequenceFileLoader.getContextProperty(SequenceFileLoader.java:306)
    at com.twitter.elephantbird.pig.load.SequenceFileLoader.setLocation(SequenceFileLoader.java:411)

The commands leading to this error are:

A  = load 'hdfs://XXX'  USING com.twitter.elephantbird.pig.load.SequenceFileLoader (
    '-c     com.twitter.elephantbird.pig.util.TextConverter',                                                              
    '-c com.twitter.elephantbird.pig.util.ProtobufWritableConverter de.pc2.dedup.fschunk.pig.PigProtocol.File');       
ILLUSTRATE A;

repeated fields missing from Hive table

I have a repeated field in my protobuf message:

message Tweet {
....
    repeated string hashtags = 6;
....
}

Here is the generated Java protobuf code for the repeated field:

// repeated string hashtags = 6;
public static final int HASHTAGS_FIELD_NUMBER = 6;
private java.util.List<java.lang.String> hashtags_ = java.util.Collections.emptyList();

Everything looks good. However when I created a Hive table, the repeated field 'hashtags_' is missing from the table. All other non-repeated fields are shown up correctly in the table as columns.

CREATE EXTERNAL TABLE tweet 
ROW FORMAT SERDE 'protobuf.hive.serde.LzoTweetProtobufHiveSerde'
STORED AS INPUTFORMAT 'protobuf.mapred.input.DeprecatedLzoTweetProtobufBlockInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveNullValueSequenceFileOutputFormat'
LOCATION '/path/to/hdfs/dir';

describe tweet;
OK
id_ bigint  from deserializer
createdat_  string  from deserializer
text_   string  from deserializer
source_ string  from deserializer
user_   protobuf.TweetProtos$User   from deserializer
inreplytoscreenname_    string  from deserializer
....

Am I missing something here? Thanks!

SequenceFileLoader fails if given comma-separated list of files

data = LOAD 'hdfs://localhost//foo/23,hdfs://localhost/foo/24' USING com.twitter.elephantbird.pig.load.SequenceFileLoader ()

produces:

Backend error message during job submission

org.apache.pig.backend.executionengine.ExecException: ERROR 2118: java.net.URISyntaxException: Illegal character in scheme name at index 0: 23,hdfs:
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:280)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:944)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:961)
at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:807)
at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 0: 23,hdfs:
at org.apache.hadoop.fs.Path.initialize(Path.java:148)
at org.apache.hadoop.fs.Path.(Path.java:126)
at org.apache.hadoop.fs.Path.(Path.java:50)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1063)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1066)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1066)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1066)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1066)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1066)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1066)
at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1002)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:966)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:211)
at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:55)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:268)
... 14 more
Caused by: java.net.URISyntaxException: Illegal character in scheme name at index 0: 23,hdfs:
at java.net.URI$Parser.fail(URI.java:2809)
at java.net.URI$Parser.checkChars(URI.java:2982)
at java.net.URI$Parser.checkChar(URI.java:2992)
at java.net.URI$Parser.parse(URI.java:3008)
at java.net.URI.(URI.java:736)
at org.apache.hadoop.fs.Path.initialize(Path.java:145)
... 29 more

Why is Elephant-Bird offering its own LZO Input/Output Formats?

Kevin's project "hadoop-lzo" (Which EB depends on anyway) already offers LZO I/O formats for Text, etc. - just not protobuf.

Why does EB choose to maintain a forked copy?

Would it be acceptable if I contributed a patch that makes EB classes extend the hadoop-lzo one (with tests to ensure we not break users of it)?

Some users of EB classes reported that it does not properly gel with Hive's CombineHiveInputFormat usage (which is now the default I/F).

For the same, simple "README.lzo" text file input that spans four blocks (worth 512 bytes each):

hive> describe extended lzoblocks;
OK
r string

Detailed Table Information Table(tableName:lzoblocks, dbName:default, owner:root, createTime:1329727318, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:r, type:string, comment:null)], location:hdfs://localhost/user/hive/warehouse/lzoblocks, inputFormat:com.hadoop.mapred.DeprecatedLzoTextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}), partitionKeys:[], parameters:{transient_lastDdlTime=1329727393}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
Time taken: 0.078 seconds

hive> describe extended ebblocks;
OK
r string

Detailed Table Information Table(tableName:ebblocks, dbName:default, owner:root, createTime:1329730707, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:r, type:string, comment:null)], location:hdfs://localhost/user/hive/warehouse/ebblocks, inputFormat:com.twitter.elephantbird.mapred.input.DeprecatedLzoTextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}), partitionKeys:[], parameters:{transient_lastDdlTime=1329732671}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
Time taken: 0.07 seconds

Doing a simple select of the hadoop-lzo class based table works fine, while the latter fails - when used with CDH3u3. The error the ebblocks table gets on a select * is [0].

This may be MAPREDUCE-1597 but I still think EB should be reusing classes rather than duplicating code and diverging in behavior.

[0] - 2012-02-17 11:04:07,598 WARN com.hadoop.compression.lzo.LzopInputStream: IOException in getCompressedData; likely LZO corruption.
java.io.IOException: Compressed length 439750499 exceeds max block size 67108864 (probably corrupt file)
at com.hadoop.compression.lzo.LzopInputStream.getCompressedData(LzopInputStream.java:285)
at com.hadoop.compression.lzo.LzopInputStream.decompress(LzopInputStream.java:255)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:63)
at com.twitter.elephantbird.util.StreamSearcher.search(StreamSearcher.java:47)
at com.twitter.elephantbird.mapreduce.io.BinaryBlockReader.skipToNextSyncPoint(BinaryBlockReader.java:102)
at com.twitter.elephantbird.mapreduce.io.BinaryBlockReader.parseNextBlock(BinaryBlockReader.java:107)
at com.twitter.elephantbird.mapreduce.io.BinaryBlockReader.setupNewBlockIfNeeded(BinaryBlockReader.java:139)
at com.twitter.elephantbird.mapreduce.io.BinaryBlockReader.readNextProtoBytes(BinaryBlockReader.java:75)
at com.twitter.elephantbird.mapreduce.io.BinaryBlockReader.readNext(BinaryBlockReader.java:63)
at com.twitter.elephantbird.mapreduce.io.ProtobufBlockReader.readProtobuf(ProtobufBlockReader.java:42)
at com.twitter.elephantbird.mapred.input.DeprecatedLzoProtobufBlockRecordReader.next(DeprecatedLzoProtobufBlockRecordReader.java:84)
at com.twitter.elephantbird.mapred.input.DeprecatedLzoProtobufBlockRecordReader.next(DeprecatedLzoProtobufBlockRecordReader.java:27)
at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:98)
at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:42)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:67)
at org.apache.hadoop.hive.shims.Hadoop20SShims$CombineFileRecordReader.next(Hadoop20SShims.java:208)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:208)

LZO Compressed Protocol Buffer fails to read the file correctly - New

The pattern:
new byte[] { 0x29, (byte)0xd8, (byte)0xd5, 0x06, 0x58,
(byte)0xcd, 0x4c, 0x29, (byte)0xb2,
(byte)0xbc, 0x57, (byte)0x99, 0x21, 0x71,
(byte)0xbd, (byte)0xff };
does not seem to be matching anywhere in the LZO compressed GPB file.

Regards,
Karthik.

Unable to serialize repeated protobuf message using LzoProtobufB64LinePigStorage()

I'm trying to serialize a message that contains another nested, repeated message using LzoProtobufB64LinePigStorage, but I get a class cast (DefaultDataBag to java.util.List) error. This is using Cloudera's CDH2 distribution and Pig 0.5.0.

You can find a protocol buffer interface + script that aggravates this bug here: http://github.com/voberoi/elephant-bird-bug.

Using a ProtobufWritable as key breaks partitioning

The default Hadoop partitioner uses the key hashCode() to decide which reducer to send the keyvalue pair to. ProtobufWritable returns the hashCode from the underlying protocol buffers object.
So far so good. Unfortunately a protocol buffers (2.3) object returns different hashcodes for different invocations. This means that if you use a ProtobufWritable as the key data that should be on the same node is being sent to different ones.

Example code (replace the ProtobufWritable with one of your own). If you run this a few times in new jvms it'll give different partitions.

ProtobufNodeKeyWritable key1 = new ProtobufNodeKeyWritable();
key1.set(NodeKey.newBuilder().setKey("key").setNode("node").build());

HashPartitioner p = new HashPartitioner();
System.out.println(p.getPartition(key1, null, 5));