Coder Social home page Coder Social logo

lasthbase's Introduction

== lasthbase ==

A java library last.fm uses with hbase. Currently just table input and output
formats for using dumbo with hbase.
 - HBase: http://wiki.apache.org/hadoop/Hbase
 - Dumbo: http://klbostee.github.com/dumbo/

=== Using dumbo over HBase ===

These assume you are storing everything in hbase as if they are byte strings.
You should also note, the tables and families you are writing and reading to
must already exist in hbase.

The input format, will give key values to your mapper as:
(row, {family: {qualifier1: value, qualifier2 : value2}, family2: {qualifier3: value3} })

The output format takes the same format. The row, families, qualifiers, and 
values must all be strings (for now). 

To use:
1. you obviously must already have hadoop and hbase setup so you can run java 
   mapreduce jobs over hbase first (have hbase jar in hadoop lib folder, etc).
2. build the lasthbase.jar, with ant. This project is not using a release version of hbase yet, replace the jars with the hbase.jar you're using before compiling.
3. write your dumbo job

eg. using the input format 

# test_in.py
import dumbo

def mapper(key, columns):
    for family in columns:
        for qualifier, value in columns[family].iteritems():
            yield key, (family, qualifier, value)

def runner(job):
    job.additer(mapper)

if __name__ == "__main__":
    dumbo.main(runner)

eg. using the output format.

# test_out.py
import dumbo

def mapper(key, column):
    columns = {}
    for family, qualifier, value in column:
        column = columns.get(family, {})
        column[qualifier] = value
        columns[family] = column
    yield key, columns

def runner(job):
    job.additer(mapper)

if __name__ == "__main__":
    dumbo.main(runner)

4. Starting your dumbo job over hbase:

$ dumbo test_in.py -hadoop <hadoopdir> -libjar lasthbase.jar \
-inputformat fm.last.hbase.mapred.TypedBytesTableInputFormat \
-hadoopconf hbase.mapred.tablecolumns="family1:qualifier1 family1:qualifier2 family2:qualifier3" \
-input input_table -output output_dir

$ dumbo test_out.py -hadoop <hadoopdir> -libjar lasthbase.jar \ 
-outputformat fm.last.hbase.mapred.TypedBytesTableOutputFormat \
-input test_data \
-jobconf hbase.mapred.outputtable=output_table \
-output ignoresthisbutyouneeditsorry

lasthbase's People

Contributors

luk avatar tims avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lasthbase's Issues

TypedBytesTableOutputFormat broken

Commit 9c287d7 broke TypedBytesTableOutputFormat completely.
Before that, it was not possible to write non-utf8 bytes to hbase.

The issue was discussed in this on the dumbo-user mailing list.

I pointed out, that this could be resolved by changing the mapping of hadoop streaming types to python types in the typedbytes python module to

  • read typedbytes bytes to regular python strings
  • read typedbytes strings to python unicode strings
  • write regular python strings to typedbytes bytes
  • write unicode python strings to typedbytes strings

Klaas pointed out, that this would yield to performance degradation for dumbo client code which deals with text input as hadoop streaming emits textinput as typedbytes string and thus lots of utf-8 to python unicode conversion overhead would be paid.

He further pointed out, that this issue could be resolved by changing the mapping in typedbytes to

  • read typedbytes bytes to regular python strings
  • read typedbytes strings to regular python strings
  • write regular python strings to typedbytes bytes
  • write unicode python strings to typedbytes strings

which would not be so intuitive.

TypedBytesTableOutputFormat alters row key

TypedBytesTableOutputFormat uses TypedBytesWritable.getBytes() for retrieving the row key. However this method returns the whole byte array, for which the data is only valid between 0 and getLength() - 1 and not for the whole length of the returned byte array. Furthermore it returns also the bytes added by the typedbytes protocol.

This causes TypedBytesTableOutputFormat to store the bytes \x03\x00\x00\x00\x01\x00\x00 for a yielded key int(1).
However, I would expect \x01 to be used as row key in this case.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.