The sparkhbaseexample-1 from 0xqq

HBase Snapshot to Spark Example

This project shows how to analyze an HBase Snapshot using Spark.

Why do this?
The main motivation for writing this code is to reduce the impact on the HBase Region Servers while analyzing HBase records. By creating a snapshot of the HBase table, we can run Spark jobs against the snapshot, eliminating the impact to region servers and reducing the risk to operational systems.

At a high-level, here's what the code is doing:

Reads an HBase Snapshot into a Spark
Parses the HBase KeyValue to a Spark Dataframe
Applies arbitrary data processing (timestamp and rowkey filtering)
Saves the results back to an HBase (HFiles / KeyValue) format within HDFS, using HFileOutputFormat.
- The output format maintains the original rowkey, timestamp, column family, qualifier, and value structure.
From here, you can bulkload the HDFS file into HBase.

Here's more detail on how to run this project:
1. Create an HBase table and populate it with data (or you can use an existing table). I've included two ways to simulate the HBase table within this repo (for testing purposes). Use the SimulateAndBulkLoadHBaseData.scala code (preferred method) or you can use write_to_hbase.py (this is very slow compared to the scala code).

2. Take an HBase Snapshot: snapshot 'hbase_simulated_1m', 'hbase_simulated_1m_ss'

3. (Optional) The HBase Snapshot will already be in HDFS (at /apps/hbase/data), but you can use this if you want to load the HBase Snapshot to an HDFS location of your choice:
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot hbase_simulated_1m_ss -copy-to /tmp/ -mappers 2

4. Run the included Spark (scala) code against the HBase Snapshot. This code will read the HBase snapshot, filter records based on rowkey range (80001 to 90000) and based on a timestamp threshold (which is set in the props file), then write the results back to HDFS in HBase format (HFiles/KeyValue).

a.) Build project: mvn clean package

b.) Run Spark job:

spark-submit --class com.github.zaratsian.SparkHBase.SparkReadHBaseSnapshot --jars /tmp/SparkHBaseExample-0.0.1-SNAPSHOT.jar /usr/hdp/current/phoenix-client/phoenix-client.jar /tmp/props

c.) NOTE: Adjust the properties within the props file (if needed) to match your configuration.

Preliminary Performance Metrics:

Number of Records	Spark Runtime (without write to HDFS)	Spark Runtime (with write to HDFS)	HBase Shell Scan (without write to HDFS)
1,000,000	27.07 seconds	36.05 seconds	6.8600 seconds
50,000,000	417.38 seconds	764.801 seconds	7.5970 seconds
100,000,000	741.829 seconds	1413.001 seconds	8.1380 seconds

NOTE: Here is the HBase Shell scan that was used

scan 'hbase_simulated_100m', {STARTROW => "\x00\x01\x38\x81", ENDROW => "\x00\x01\x5F\x90", TIMERANGE => [1474571655001,9999999999999]}

. This scan will filtered a 100 Million record HBase table based on rowkey range of 80001-90000 (\x00\x01\x38\x81 - \x00\x01\x5F\x90) and also from an arbitrary timerange, specified in unix timestamp.

Sample output of HBase simulated data structure (using SimulateAndBulkLoadHBaseData.scala):

Sample output of HBase simulated data structure (using write_to_hbase.py):

Versions:
This code was tested using Hortonworks HDP HDP-2.5.0.0
HBase version 1.1.2
Spark version 1.6.2
Scala version 2.10.5

References:
HBaseConfiguration Class
HBase TableSnapshotInputFormat Class
HBase KeyValue Class
HBase Bytes Class
HBase CellUtil Class

0xqq / sparkhbaseexample-1 Goto Github PK

sparkhbaseexample-1's Introduction

HBase Snapshot to Spark Example

sparkhbaseexample-1's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent