Coder Social home page Coder Social logo

hadoop-sync's Introduction

hadoop-sync

hadoop-sync is a Java application that synchronizes block metadata from the HDFS NameNode to the CitusDB master node. The application also creates a colocated foreign table on CitusDB worker nodes for each block in the Hadoop File System (HDFS). Optionally, the application also collects statistics about the underlying HDFS blocks.

hadoop-sync is idempotent. You can run it multiple times; it calculates metadata differences between the HDFS NameNode and CitusDB master node, and only syncs metadata for newly added or removed HDFS blocks. If no such blocks exist, the application does nothing.

hadoop-sync also leaves metadata on the CitusDB master node in a consistent state. When you run the application, it either completes by propagating enough metadata that makes recent HDFS blocks queriable; or it errors out by reverting recent metadata updates and leaves CitusDB in the queriable state it was in before hadoop-sync was run.

hadoop-sync assumes that you are already running a Hadoop cluster and you have your CitusDB databases colocated on this cluster. For step by step instructions on how to get started, please see our documentation page at http://citusdata.com/docs/sql-on-hadoop

Usage

Our documentation page explains in detail running the hadoop-sync application. The following section talks about a few details that relate to hadoop-sync's internals. Please note that these details may change over time as hadoop-sync is still in Beta form.

Now, assuming that you already have a Hadoop cluster with data loaded into it, CitusDB deployed on the Hadoop cluster, and a distributed foreign table created on the CitusDB master node, you synchronize metadata like the following:

java -jar target/hadoop-sync-0.1.jar table_name [--fetch-min-max]

The distributed foreign table name is a required argument, and hadoop-sync will only synchronize block metadata for that particular table. The fetch-min-max is an optional argument that triggers the collection of min/max statistics for each new HDFS block. Since this requires going over all records in the HDFS block, passing this argument notably increases synchronization times. On the other hand, with these statistics, CitusDB can perform optimizations such as partition or join pruning that dramatically reduce query execution times.

It is worth mentioning that these min/max statistics are only relevant in the context of distributed queries. For local queries, the worker nodes themselves already use PostgreSQL's statistics collector to optimize queries, if the foreign data wrapper supports data sampling.

It is also worth noting that hadoop-sync relies on a properties file to determine node names and port numbers that Hadoop and CitusDB clusters use. This sync.properties file is bundled with the JAR file; and you can change the default settings by using emacs or vi on the JAR bundle, or by copying the file from target/classes/sync.properties to the current directory and editing it.

hadoop-sync's People

Watchers

James Cloos avatar baeeq avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.