Coder Social home page Coder Social logo

maple's Introduction

This repo is deprecated. Please see: scalding/maple

which now hosts the HBase, MemoryTap, and Etsy LocalTap. For JDBC, please see: Cascading-JDBC.

The reason for this deprecation is purely due to the manangement cost of the extra repo. If the old code meets your needs, there is little reason to upgrade.

maple

A collection of useful Cascading taps.

Building

Maple uses Leiningen 2.0 to build.

  1. lein with-profile dev deps
  2. lein with-profile dev uberjar
  3. lein with-profile dev install

The above should build a jar with all dependencies. And then install will add this jar to your local maven repository in ~/.m2/repositories.

Usage

Maple is hosted on Conjars. We expect most users will pull "com.twitter/maple" with the version they need. If you are submitting a patch, you will need to follow the above steps in Building.

License

Copyright (C) 2012 Twitter Inc

Distributed under the Eclipse Public License, the same as Clojure.

maple's People

Contributors

arkajit avatar avi-stripe avatar azymnis avatar johnynek avatar koertkuipers avatar mishok13 avatar noitcudni avatar r0man avatar senior avatar sritchie avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

maple's Issues

HBaseScheme can only serialize strings

I have an HBase scheme/tap written for a Cascading 2.0 pre-release version that I would love to replace with the HBase tap/scheme in maple. One issue I'm running into is the code assumes row keys and values are strings. I'm using bytes as the key and thrift structures serialized to bytes for the values.

Is there any interest in making the maple HBaseScheme more flexible in this regard? Looks like the scheme source code just puts the bytes in a tuple. Maybe the sink code could do the same?

HBase Tap reports misleading error 'table is missing'

When there is a configuration issue or ZK isn't running, the error message being reported is that the table does not exist. In the case below, the table does exist - it just doesn't know that because it hasn't connected to zookeeper.

2012-07-25 17:40:20,588 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.RuntimeException: TABLENAME does not exist !
at com.twitter.maple.hbase.HBaseTap.sinkConfInit(Unknown Source)
at com.twitter.maple.hbase.HBaseTap.sinkConfInit(Unknown Source)
at com.twitter.maple.hbase.HBaseTapCollector.initialize(Unknown Source)
at com.twitter.maple.hbase.HBaseTapCollector.prepare(Unknown Source)
at com.twitter.maple.hbase.HBaseTap.openForWrite(Unknown Source)
at com.twitter.maple.hbase.HBaseTap.openForWrite(Unknown Source)
at cascading.flow.stream.SinkStage.prepare(SinkStage.java:60)
at cascading.flow.stream.StreamGraph.prepare(StreamGraph.java:165)
at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:107)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
at org.apache.hadoop.mapred.Child.main(Child.java:264)

Support DataDrivenDBInputFormat

This would support splits for databases that don't allow limit and offset.

Similar Implementations exist in Sqoop (https://github.com/apache/sqoop/blob/trunk/src/java/org/apache/sqoop/mapreduce/db/DataDrivenDBInputFormat.java) and Hadoop (https://github.com/apache/hadoop-mapreduce/blob/trunk/src/java/org/apache/hadoop/mapreduce/lib/db/DataDrivenDBInputFormat.java). However, due to implementation differences some translation and transfer would have to occur.

I can do this work but I wanted to see what others think. Do you guys have any recommendations for this?

Missing cell value causes a nullpointer exception.

Here's a stack trace.

cascading.tuple.TupleException: unable to read from input identifier: 'unknown'
at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:127)
at cascading.flow.stream.SourceStage.map(SourceStage.java:76)
at cascading.flow.stream.SourceStage.run(SourceStage.java:58)
at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:124)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176)
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hbase.io.ImmutableBytesWritable.(ImmutableBytesWritable.java:60)
at com.twitter.maple.hbase.HBaseScheme.source(Unknown Source)
at cascading.tuple.TupleEntrySchemeIterator.getNext(TupleEntrySchemeIterator.java:140)
at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:120)
... 6 more

If my table in hbase has a schema that looks like the following.
table_name: test
column_family : cf

Say that my HbaseScheme is expecting value fields: foo and bar and the test table has the following rows.

1, cf:foo="hello", cf:bar="world"
2, cf:foo="bye"

Row 2 will cause the exception described above.

I'd expect that an empty byte array will be returned for row 2's cf:bar column.

Use of OFFSET very inefficient with large Postgres DB

When using maple to import a 40GB+ Postgres database I noticed that queries became too slow and the complete hadoop job failed because of the use of OFFSET:

After changing this line to this:

            // HARDCODING PRIMARY KEY.....
            query.append(" WHERE id >= ").append(split.getStart());
            query.append(" LIMIT ").append(split.getLength());

The query time doesn't grow exponentially anymore and stays the same. The above is not a generic solution (e.g. your index might not be id). Do you have suggestions to handle this situation? I'm also not sure how other JDBC databases handle OFFSET.

Has this library been used on large Postgres DB's before? I would like to gain some insights into best practices. Even with the above optimization my import time is around 3 hours.

Thanks for you work on maple.

Cheers,
Jeroen

Testing with twitter scalding JobTest

Hi everyone,
Could you kindly provide some detail about how to use HBase support in maple for testing jobs with the JobTest class included in twitter scalding.
Thank you in Advance

JDBCTap fails when trying to write byte arrays

Hi,

I've recently ran into an issue when trying to write byte arrays using JDBCTap (I'm using bytea type in PostgreSQL). The issue is almost identical to the one resolved by this pull request but concerns writing instead of reading objects from Postgres. Basically it boils down to the fact that cascading.tuple.Tuple's get method does a cast to Comparable which of course breaks types that don't implement said interface.

I have a patch for this issue, which solves it for me without breaking any of existing code.

Cheers.

JDBCTap fails on Oracle with "ORA-00911: invalid character"

When using the JDBCTap with an Oracle database (using Oracle's ojdbc6.jar driver) the flow fails with an IOException:

Caused by: java.io.IOException: unable to execute insert batch [msglength: 29][totstmts: 1000][crntstmts: 1000][batch: 1000] ORA-00911: invalid character

at com.twitter.maple.jdbc.db.DBOutputFormat$DBRecordWriter.createThrowMessage(Unknown Source)
at com.twitter.maple.jdbc.db.DBOutputFormat$DBRecordWriter.executeBatch(Unknown Source)
at com.twitter.maple.jdbc.db.DBOutputFormat$DBRecordWriter.write(Unknown Source)
at com.twitter.maple.jdbc.db.DBOutputFormat$DBRecordWriter.write(Unknown Source)
at com.twitter.maple.jdbc.JDBCTapCollector.collect(Unknown Source)
at com.twitter.maple.jdbc.JDBCScheme.sink(Unknown Source)
at cascading.tuple.TupleEntrySchemeCollector.collect(TupleEntrySchemeCollector.java:153)

This is fixed by removing the "query.append(";");" on line 276 of DBOutputFormat.java and removed the semicolon from the query.append(");") on line 231. Apparently the Oracle JDBC driver doesn't like the semicolon on the end of the SQL statement.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.