cascading / maple Goto Github PK

All the Cascading taps you need and love.

Clojure 1.83% Java 98.17%

maple's Introduction

This repo is deprecated. Please see: scalding/maple

which now hosts the HBase, MemoryTap, and Etsy LocalTap. For JDBC, please see: Cascading-JDBC.

The reason for this deprecation is purely due to the manangement cost of the extra repo. If the old code meets your needs, there is little reason to upgrade.

maple

A collection of useful Cascading taps.

Building

Maple uses Leiningen 2.0 to build.

lein with-profile dev deps
lein with-profile dev uberjar
lein with-profile dev install

The above should build a jar with all dependencies. And then install will add this jar to your local maven repository in ~/.m2/repositories.

Usage

Maple is hosted on Conjars. We expect most users will pull "com.twitter/maple" with the version they need. If you are submitting a patch, you will need to follow the above steps in Building.

License

Distributed under the Eclipse Public License, the same as Clojure.

maple's People

Contributors

Stargazers

Watchers

maple's Issues

HBaseScheme can only serialize strings

I have an HBase scheme/tap written for a Cascading 2.0 pre-release version that I would love to replace with the HBase tap/scheme in maple. One issue I'm running into is the code assumes row keys and values are strings. I'm using bytes as the key and thrift structures serialized to bytes for the values.

Is there any interest in making the maple HBaseScheme more flexible in this regard? Looks like the scheme source code just puts the bytes in a tuple. Maybe the sink code could do the same?

Hbase Filters

Does this support Hbase filters (http://hbase.apache.org/book/client.filter.html) such as SingleColumnValueFilter ?

use org.apache.hadoop.mapreduce

use org.apache.hadoop.mapreduce instead of deprecated org.apache.hadoop.mapred

HBase Tap reports misleading error 'table is missing'

When there is a configuration issue or ZK isn't running, the error message being reported is that the table does not exist. In the case below, the table does exist - it just doesn't know that because it hasn't connected to zookeeper.

2012-07-25 17:40:20,588 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.RuntimeException: TABLENAME does not exist !
at com.twitter.maple.hbase.HBaseTap.sinkConfInit(Unknown Source)
at com.twitter.maple.hbase.HBaseTap.sinkConfInit(Unknown Source)
at com.twitter.maple.hbase.HBaseTapCollector.initialize(Unknown Source)
at com.twitter.maple.hbase.HBaseTapCollector.prepare(Unknown Source)
at com.twitter.maple.hbase.HBaseTap.openForWrite(Unknown Source)
at com.twitter.maple.hbase.HBaseTap.openForWrite(Unknown Source)
at cascading.flow.stream.SinkStage.prepare(SinkStage.java:60)
at cascading.flow.stream.StreamGraph.prepare(StreamGraph.java:165)
at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:107)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
at org.apache.hadoop.mapred.Child.main(Child.java:264)

Support DataDrivenDBInputFormat

This would support splits for databases that don't allow limit and offset.

Similar Implementations exist in Sqoop (https://github.com/apache/sqoop/blob/trunk/src/java/org/apache/sqoop/mapreduce/db/DataDrivenDBInputFormat.java) and Hadoop (https://github.com/apache/hadoop-mapreduce/blob/trunk/src/java/org/apache/hadoop/mapreduce/lib/db/DataDrivenDBInputFormat.java). However, due to implementation differences some translation and transfer would have to occur.

I can do this work but I wanted to see what others think. Do you guys have any recommendations for this?

Missing cell value causes a nullpointer exception.

Here's a stack trace.

cascading.tuple.TupleException: unable to read from input identifier: 'unknown'
at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:127)
at cascading.flow.stream.SourceStage.map(SourceStage.java:76)
at cascading.flow.stream.SourceStage.run(SourceStage.java:58)
at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:124)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176)
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hbase.io.ImmutableBytesWritable.(ImmutableBytesWritable.java:60)
at com.twitter.maple.hbase.HBaseScheme.source(Unknown Source)
at cascading.tuple.TupleEntrySchemeIterator.getNext(TupleEntrySchemeIterator.java:140)
at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:120)
... 6 more

If my table in hbase has a schema that looks like the following.
table_name: test
column_family : cf

Say that my HbaseScheme is expecting value fields: foo and bar and the test table has the following rows.

1, cf:foo="hello", cf:bar="world"
2, cf:foo="bye"

Row 2 will cause the exception described above.

I'd expect that an empty byte array will be returned for row 2's cf:bar column.

Use of OFFSET very inefficient with large Postgres DB

When using maple to import a 40GB+ Postgres database I noticed that queries became too slow and the complete hadoop job failed because of the use of OFFSET:

After changing this line to this:

            // HARDCODING PRIMARY KEY.....
            query.append(" WHERE id >= ").append(split.getStart());
            query.append(" LIMIT ").append(split.getLength());

The query time doesn't grow exponentially anymore and stays the same. The above is not a generic solution (e.g. your index might not be id). Do you have suggestions to handle this situation? I'm also not sure how other JDBC databases handle OFFSET.

Has this library been used on large Postgres DB's before? I would like to gain some insights into best practices. Even with the above optimization my import time is around 3 hours.

Thanks for you work on maple.

Cheers,
Jeroen

Testing with twitter scalding JobTest

Hi everyone,
Could you kindly provide some detail about how to use HBase support in maple for testing jobs with the JobTest class included in twitter scalding.
Thank you in Advance

JDBCTap should return properly from getIdentifier()

The identifier should be a combination of all unique attributes, so that Cascading can optimize resources. (Currently we just add a UUID onto the end of the connection URL.)

JDBCTap fails when trying to write byte arrays

Hi,

I've recently ran into an issue when trying to write byte arrays using JDBCTap (I'm using bytea type in PostgreSQL). The issue is almost identical to the one resolved by this pull request but concerns writing instead of reading objects from Postgres. Basically it boils down to the fact that cascading.tuple.Tuple's get method does a cast to Comparable which of course breaks types that don't implement said interface.

I have a patch for this issue, which solves it for me without breaking any of existing code.

Cheers.

JDBCTap fails on Oracle with "ORA-00911: invalid character"

When using the JDBCTap with an Oracle database (using Oracle's ojdbc6.jar driver) the flow fails with an IOException:

Caused by: java.io.IOException: unable to execute insert batch [msglength: 29][totstmts: 1000][crntstmts: 1000][batch: 1000] ORA-00911: invalid character

at com.twitter.maple.jdbc.db.DBOutputFormat$DBRecordWriter.createThrowMessage(Unknown Source)
at com.twitter.maple.jdbc.db.DBOutputFormat$DBRecordWriter.executeBatch(Unknown Source)
at com.twitter.maple.jdbc.db.DBOutputFormat$DBRecordWriter.write(Unknown Source)
at com.twitter.maple.jdbc.db.DBOutputFormat$DBRecordWriter.write(Unknown Source)
at com.twitter.maple.jdbc.JDBCTapCollector.collect(Unknown Source)
at com.twitter.maple.jdbc.JDBCScheme.sink(Unknown Source)
at cascading.tuple.TupleEntrySchemeCollector.collect(TupleEntrySchemeCollector.java:153)

This is fixed by removing the "query.append(";");" on line 276 of DBOutputFormat.java and removed the semicolon from the query.append(");") on line 231. Apparently the Oracle JDBC driver doesn't like the semicolon on the end of the SQL statement.