Coder Social home page Coder Social logo

binlog2hive's Introduction

项目背景

RDS的数据实时同步到HDFS下,并映射到Hive

原理

通过解析RDS的binlog将RDS的增量数据同步到HDFS下,并映射加载到Hive外部分区表

由于RDS表中的第二个字段都为datetime字段,所以刚才以该字段作为Hive的分区字段

配置文件介绍

  • doc/creat table.sql:Hive表的建表语句,除了静态表外,其他全部为天级别外部分区表
  • binglog2Hive_conf.properties:里面为所有全部需要同步到HDFS的表
  • mysql.properties:Mysql druid连接池配置

程序说明

binlog解析框架:https://github.com/shyiko/mysql-binlog-connector-java

核心类为BinlogClient

  1. 程序主要序列化以下几个事件
  • TABLE_MAP:包括表名,数据库名
  • WRITE_ROWS:包含增量的业务记录
  • 程序启动时会先从t_position表中获取上次的同步状态,根据上次的同步状态来确定binlog的读取位置
  • BinaryLogClient首先对TABLE_MAP事件进行序列化,再结合binlog2Hive_conf.propertiesd配置过滤出我们需要同步的表再对WRITE_ROW事件进行序列化
  • 解析WRITE_ROWS时,将</DATA/PUBLIC/表名,记录>存储到ConcurrentHashMap<String, List<Serializable[]>> mapA中
  • 解析的记录超过一定阀值option.countInterval后再统一写HDFS文件
  • 写HDFS文件时,遍历mapA,根据表名分类,整理成</DATA/PUBLIC/表名/day=xxx,记录>存储到 ConcurrentHashMap<String, ArrayList<Serializable[]>> mapB,最后再统一遍历mapB将数据写入到HDFS,写到哪个文件中是根据mapB的key来确定的
  • 文件操作类在FSUtils中,写文件时以下三种情况
    1. 如果目录不存在就创建文件并将Hive表的分区映射到这个路径下,
    2. 文件已存在且文件大小小于250MB就以追加的方式写文件
    3. 文件大小超250MB就重新写成另一个新文件,以HDFS_BLOCK_SIZE为标准
  • 文件写入成功后将当前的同步状态(binlogfilename,nextposition)更新到t_position表中

项目已经去掉敏感业务信息

binlog2hive's People

Contributors

mobin-f avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.