Coder Social home page Coder Social logo

deanjain / bloom Goto Github PK

View Code? Open in Web Editor NEW

This project forked from americanexpress/bloom

0.0 1.0 0.0 657 KB

BLooM is a configuration driven bigdata framework to load massive data into MemSQL

License: Apache License 2.0

Shell 6.40% Java 93.60%

bloom's Introduction

BLooM - Bulk Loader of MemSQL

BLooM

BLooM is a configuration driven bigdata based framework to load massive data into MemSQL Database.

Build

Features -->

  • Blazing performance 50 Millions records loaded in MemSQL Database in 90 Seconds leveraging Apache Spark.

  • It can load data in all three types of tables in MemSQL - Columnstore, Rowstore and Reference Table.

  • Supports control-A and comma delimited files as input.

  • Also support data load from Hive table directly to MemSQL.

  • supports processing multiple files in a directory.

  • There are three types of modes to run the framework:

    • Full load (full refresh): It does not check if any records already exists in MemSQL and it just directly loads all the data and overwrites the old data if any. In case the data file has records for only a few columns of the table, the remaining columns are loaded as null.
    • Delta load (upsert/update):
      • If the incoming record is newer than the existing one, based on last updated timestamp, then it only makes an UPDATE on DB.
      • If the record is older then it ignores the update.
      • The input file can have data for all the columns or for a few columns only.
      • If the input has data only for a few columns and the same record is not present already, it inserts the incoming columns and null for the remaining columns.
      • If the incoming record is older and mandatory columns are configured, then it does an UPSERT of those mandatory columns into DB and keep the other columns intact.
    • Load Append: It does not check if any records already exists in MemSQL and just appends the incoming data to the existing data. This mode is only supported for columnstore tables.
  • It can accept config yaml about an input file / hive table from where data needs to be loaded in MemSQL

  • Input can be specified at command line, if its a full load or a delta load

  • It is mandatory for the MemSQL table to have a lastModifiedTimeStamp column, because this column is used for the delta load to verify which is the latest record

  • Full load does not check if any records already exists in MemSQL and just directly loads all data, pls ensure table is truncated before running full load.

  • For a full load the input file can have data for all the columns or for a few columns only, In-case the input has data of only for few columns it updates the remaining columns with null

  • Delta load checks if the record already exists in the MemSQL and if the record is latest than the one which exists in DB (based on last updated timestamp) then only it makes an upsert else ignores the update if input record is older

  • For a Delta Load, the input file can have data for all the columns or for a few columns only, In-case the input has data of only for few columns and the same record is not present already ,it insert the incoming columns and null for the remaining columns. And in-case the record already exists and but is older it updates the incoming columns but keeps the other column intact

Prerequisites:

  • In the case of the input is a CSV file, it should have proper column headers. The order of the header names need not be same to the MemSQL table column order. Also, the csv can have data related to a few columns or all the columns.
  • For the MemSQL table where the data has to be loaded, it is mandatory to have a last updated timestamp column.
  • The utility supports all the data types for MemSQL. While configuring the data types in MemSQL, the user should check whether the incoming data size is greater than the size of the configured data type, because if so, MemSQL will downcast it to the closest value.
  • The input csv file should have unique record for all primary key(s)
  • Note that for Columnstore tables, it is mandatory to have a staging table which will be used during the upsert. The incoming data is first loaded in the staging table, which is then used for comparison with the existing data to come up with the net insert data, which is finally written back to the original table.

Performance Stats:

Scenario Table Type Records Processed Run time Executors Count Mode of Processing Executors Memory Executor Core
Fresh Load Rowstore 5 million 45 seconds 100 FULL-REFRESH 8G 8
Fresh Load Rowstore 5 million 57 seconds 10 FULL-REFRESH 8G 8
Fresh Load Rowstore 50 million 93 seconds 100 FULL-REFRESH 8G 8
50m data changed Rowstore 50 million 37 min and 37 seconds 100 UPSERT 22G 8
25 million data changed Rowstore 50 million 24 min and 39 seconds 100 UPSERT 8G 8
2.5 million data changed Rowstore 5 million 179 seconds 100 UPSERT 8G 8
2.5 million data changed Rowstore 5 million 9 min and 56 seconds 10 UPSERT 8G 8

Contributing

We welcome Your interest in the American Express Open Source Community on Github. Any Contributor to any Open Source Project managed by the American Express Open Source Community must accept and sign an Agreement indicating agreement to the terms below. Except for the rights granted in this Agreement to American Express and to recipients of software distributed by American Express, You reserve all right, title, and interest, if any, in and to Your Contributions. Please fill out the Agreement.

License

Any contributions made under this project will be governed by the Apache License 2.0.

Code of Conduct

This project adheres to the American Express Community Guidelines. By participating, you are expected to honor these guidelines.

bloom's People

Contributors

deanjain avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.