Coder Social home page Coder Social logo

sparkhbaseexample-1's Introduction

HBase Snapshot to Spark Example

This project shows how to analyze an HBase Snapshot using Spark.

Why do this?
The main motivation for writing this code is to reduce the impact on the HBase Region Servers while analyzing HBase records. By creating a snapshot of the HBase table, we can run Spark jobs against the snapshot, eliminating the impact to region servers and reducing the risk to operational systems.

At a high-level, here's what the code is doing:

  1. Reads an HBase Snapshot into a Spark
  2. Parses the HBase KeyValue to a Spark Dataframe
  3. Applies arbitrary data processing (timestamp and rowkey filtering)
  4. Saves the results back to an HBase (HFiles / KeyValue) format within HDFS, using HFileOutputFormat.
    • The output format maintains the original rowkey, timestamp, column family, qualifier, and value structure.
  5. From here, you can bulkload the HDFS file into HBase.

Here's more detail on how to run this project:
1. Create an HBase table and populate it with data (or you can use an existing table). I've included two ways to simulate the HBase table within this repo (for testing purposes). Use the SimulateAndBulkLoadHBaseData.scala code (preferred method) or you can use write_to_hbase.py (this is very slow compared to the scala code).

2. Take an HBase Snapshot: snapshot 'hbase_simulated_1m', 'hbase_simulated_1m_ss'

3. (Optional) The HBase Snapshot will already be in HDFS (at /apps/hbase/data), but you can use this if you want to load the HBase Snapshot to an HDFS location of your choice:
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot hbase_simulated_1m_ss -copy-to /tmp/ -mappers 2

4. Run the included Spark (scala) code against the HBase Snapshot. This code will read the HBase snapshot, filter records based on rowkey range (80001 to 90000) and based on a timestamp threshold (which is set in the props file), then write the results back to HDFS in HBase format (HFiles/KeyValue).

a.) Build project: mvn clean package

b.) Run Spark job: spark-submit --class com.github.zaratsian.SparkHBase.SparkReadHBaseSnapshot --jars /tmp/SparkHBaseExample-0.0.1-SNAPSHOT.jar /usr/hdp/current/phoenix-client/phoenix-client.jar /tmp/props

c.) NOTE: Adjust the properties within the props file (if needed) to match your configuration.

Preliminary Performance Metrics:
Number of Records Spark Runtime (without write to HDFS) Spark Runtime (with write to HDFS) HBase Shell Scan (without write to HDFS)
1,000,000 27.07 seconds 36.05 seconds 6.8600 seconds
50,000,000 417.38 seconds 764.801 seconds 7.5970 seconds
100,000,000 741.829 seconds 1413.001 seconds 8.1380 seconds

NOTE: Here is the HBase Shell scan that was used scan 'hbase_simulated_100m', {STARTROW => "\x00\x01\x38\x81", ENDROW => "\x00\x01\x5F\x90", TIMERANGE => [1474571655001,9999999999999]}. This scan will filtered a 100 Million record HBase table based on rowkey range of 80001-90000 (\x00\x01\x38\x81 - \x00\x01\x5F\x90) and also from an arbitrary timerange, specified in unix timestamp.

Sample output of HBase simulated data structure (using SimulateAndBulkLoadHBaseData.scala):



Sample output of HBase simulated data structure (using write_to_hbase.py):




Versions:
This code was tested using Hortonworks HDP HDP-2.5.0.0
HBase version 1.1.2
Spark version 1.6.2
Scala version 2.10.5


References:
HBaseConfiguration Class
HBase TableSnapshotInputFormat Class
HBase KeyValue Class
HBase Bytes Class
HBase CellUtil Class

sparkhbaseexample-1's People

Contributors

randerzander avatar zaratsian avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.