Coder Social home page Coder Social logo

benchristel / pxf Goto Github PK

View Code? Open in Web Editor NEW

This project forked from greenplum-db/pxf-archive

0.0 2.0 0.0 4.02 MB

License: Apache License 2.0

Makefile 0.31% Java 41.30% Shell 2.42% Perl 2.50% Python 51.37% C 0.49% C++ 0.01% SQLPL 0.75% PLpgSQL 0.18% Go 0.46% Dockerfile 0.20%

pxf's Introduction

Introduction

PXF is an extensible framework that allows a distributed database like GPDB to query external data files, whose metadata is not managed by the database. PXF includes built-in connectors for accessing data that exists inside HDFS files, Hive tables, HBase tables and more. Users can also create their own connectors to other data storages or processing engines. To create these connectors using JAVA plugins, see the PXF API and Reference Guide onGPDB.

Package Contents

server/

Contains the server side code of PXF along with the PXF Service and all the Plugins

cli/

Contains command line interface code for PXF

automation/

Contains the automation and integration tests for PXF against the various datasources

singlecluster/

Hadoop testing environment to exercise the pxf automation tests

concourse/

Resources for PXF's Continuous Integration pipelines

PXF Development

Below are the steps to build and install PXF along with its dependencies including GPDB and Hadoop.

To start, ensure you have a ~/workspace directory and have cloned the pxf and its prerequisites(shown below) under it. (The name workspace is not strictly required but will be used throughout this guide.)

mkdir -p ~/workspace
cd ~/workspace

git clone https://github.com/greenplum-db/pxf.git

Alternatively, you may create a symlink to your existing repo folder.

ln -s ~/<git_repos_root> ~/workspace

Install Dependencies

To build PXF, you must have:

  • JDK 1.8+
  • Go (1.9 or later)

To install Go on CentOS, sudo yum install go.

For other platforms, see the Go downloads page.

Once you have installed Go, you will need the dep and ginkgo tools, which install Go dependencies and run Go tests, respectively. Assuming go is on your PATH, you can run:

go get github.com/golang/dep/cmd/dep
go get github.com/onsi/ginkgo/ginkgo

to install them.

How to Build

PXF uses gradle for build and has a wrapper makefile for abstraction

cd ~/workspace/pxf

# Compile & Test PXF
make
  
# Simply Run unittest
make unittest

Demonstrating Hadoop Integration

In order to demonstrate end to end functionality you will need GPDB and Hadoop installed.

Hadoop

We have all the related hadoop components (hdfs, hive, hbase, zookeeper, etc) mapped into simple artifact named singlecluster. You can download from here and untar the singlecluster-HDP.tar.gz file, which contains everything needed to run Hadoop.

mv singlecluster-HDP.tar.gz ~/workspace/
cd ~/workspace
tar xzf singlecluster-HDP.tar.gz

GPDB

git clone https://github.com/greenplum-db/gpdb.git

You'll end up with a directory structure like this:

~
└── workspace
    ├── pxf
    ├── singlecluster-HDP
    └── gpdb

If you already have GPDB installed and running using the instructions shown in the GPDB README, you can ignore the Setup GPDB section below and simply follow the steps in Setup Hadoop and Setup PXF

If you don't wish to use docker, make sure you manually install JDK.

Development With Docker

NOTE: Since the docker container will house all Single cluster Hadoop, Greenplum and PXF, we recommend that you have at least 4 cpus and 6GB memory allocated to Docker. These settings are available under docker preferences.

The following commands run the docker container and set up and switch to user gpadmin.

# Get the latest image
docker pull pivotaldata/gpdb-pxf-dev:centos6

# If you want to use gdb to debug gpdb you need the --privileged flag in the command below
docker run --rm -it \
  -p 5432:5432 \
  -p 5888:5888 \
  -p 8000:8000 \
  -p 5005:5005 \
  -p 8020:8020 \
  -p 9000:9000 \
  -p 9090:9090 \
  -p 50070:50070 \
  -w /home/gpadmin/workspace \
  -v ~/workspace/gpdb:/home/gpadmin/workspace/gpdb \
  -v ~/workspace/pxf:/home/gpadmin/workspace/pxf \
  -v ~/workspace/singlecluster-HDP:/home/gpadmin/workspace/singlecluster \
  pivotaldata/gpdb-pxf-dev:centos6 /bin/bash -c \
  "/home/gpadmin/workspace/pxf/dev/set_up_gpadmin_user.bash && /sbin/service sshd start && su - gpadmin"

Setup GPDB

Configure, build and install GPDB. This will be needed only when you use the container for the first time with GPDB source.

~/workspace/pxf/dev/build_gpdb.bash
~/workspace/pxf/dev/install_gpdb.bash

For subsequent minor changes to GPDB source you can simply do the following:

~/workspace/pxf/dev/install_gpdb.bash

Create Greenplum Cluster

source /usr/local/greenplum-db-devel/greenplum_path.sh
make -C ~/workspace/gpdb create-demo-cluster
source ~/workspace/gpdb/gpAux/gpdemo/gpdemo-env.sh

Setup Hadoop

Hdfs will be needed to demonstrate functionality. You can choose to start additional hadoop components (hive/hbase) if you need them.

Setup User Impersonation prior to starting the hadoop components (this allows the gpadmin user to access hadoop data).

~/workspace/pxf/dev/configure_singlecluster.bash

Setup and start HDFS

pushd ~/workspace/singlecluster/bin
echo y | ./init-gphd.sh
./start-hdfs.sh
popd

Start other optional components based on your need

pushd ~/workspace/singlecluster/bin
# Start Hive
./start-yarn.sh
./start-hive.sh

# Start HBase 
./start-zookeeper.sh
./start-hbase.sh
popd

Setup Minio (optional)

Minio is an S3-API compatible local storage solution. The development docker image comes with Minio software pre-installed. To start the Minio server, run the following script:

source ~/workspace/pxf/dev/start_minio.bash

After the server starts, you can access Minio UI at http://localhost:9000 from the host OS. Use admin for the access key and password for the secret key when connecting to your local Minio instance.

The script also sets PROTOCOL=minio so that the automation framework will use the local Minio server when running S3 automation tests. If later you would like to run Hadoop HDFS tests, unset this variable with unset PROTOCOL command.

Setup PXF

Install PXF Server

# Install PXF
make -C ~/workspace/pxf install

# Initialize PXF
export PXF_CONF=~/pxf
$PXF_HOME/bin/pxf init

# Start PXF
$PXF_HOME/bin/pxf start

Finally, if you don't have any servers configured, go ahead and copy the Hadoop templates to the default configuration location:

cp "${PXF_CONF}"/templates/*-site.xml "${PXF_CONF}/servers/default"

Install PXF client (ignore if this is already done)

if [ -d ~/workspace/gpdb/gpAux/extensions/pxf ]; then
	PXF_EXTENSIONS_DIR=gpAux/extensions/pxf
else
	PXF_EXTENSIONS_DIR=gpcontrib/pxf
fi
make -C ~/workspace/gpdb/${PXF_EXTENSIONS_DIR} installcheck
psql -d template1 -c "create extension pxf"

Run PXF Tests

All tests use a database named pxfautomation.

pushd ~/workspace/pxf/automation

# Initialize default server configs using template
cp ~/pxf/templates/*.xml ~/pxf/servers/default

# Run specific tests. Example: Hdfs Smoke Test
make TEST=HdfsSmokeTest

# Run all tests. This will be time consuming.
make GROUP=gpdb

# If you wish to run test(s) against a different storage protocol set the following variable (for eg: s3) 
export PROTOCOL=s3
popd

If you see any HBase failures, try copying pxf-hbase-*.jar to the HBase classpath, and restart HBase:

cp ${PXF_HOME}/lib/pxf-hbase-*.jar ~/workspace/singlecluster/hbase/lib
~/workspace/singlecluster/bin/stop-hbase.sh
~/workspace/singlecluster/bin/start-hbase.sh

Make Changes to PXF

To deploy your changes to PXF in the development environment.

# $PXF_HOME folder is replaced each time you make install.
# So, if you have any config changes, you may want to back those up.
$PXF_HOME/bin/pxf stop
make -C ~/workspace/pxf install

# Make any config changes you had backed up previously
$PXF_HOME/bin/pxf start

IDE Setup (IntelliJ)

  • Start IntelliJ. Click "Open" and select the directory to which you cloned the pxf repo.
  • Select File > Project Structure.
  • Make sure you have a JDK selected.
  • In the Project Settings > Modules section, import two modules for the pxf/server and pxf/automation directories. The first time you'll get an error saying that there's no JDK set for Gradle. Just cancel and retry. It goes away the second time.
  • Restart IntelliJ
  • Check that it worked by running a test (Cmd+O)

To run a Kerberized Hadoop Cluster

Requirements

  • Download bin_gpdb (from any of the pipelines)
  • Download pxf_tarball (from any of the pipelines)

These instructions allow you to run a Kerberized cluster

docker run --rm -it \
  --privileged \
  --hostname c6401.ambari.apache.org \
  -p 5432:5432 \
  -p 5888:5888 \
  -p 8000:8000 \
  -p 8080:8080 \
  -p 8020:8020 \
  -p 9000:9000 \
  -p 9090:9090 \
  -p 50070:50070 \
  -w /home/gpadmin/workspace \
  -v ~/workspace/gpdb:/home/gpadmin/workspace/gpdb_src \
  -v ~/workspace/pxf:/home/gpadmin/workspace/pxf_src \
  -v ~/workspace/singlecluster-HDP:/home/gpadmin/workspace/singlecluster \
  -v ~/Downloads/bin_gpdb:/home/gpadmin/workspace/bin_gpdb \
  -v ~/Downloads/pxf_tarball:/home/gpadmin/workspace/pxf_tarball \
  -e CLUSTER_NAME=hdp \
  -e NODE=c6401.ambari.apache.org \
  -e REALM=AMBARI.APACHE.ORG \
  -e TARGET_OS=centos \
  pivotaldata/gpdb-pxf-dev:centos6-hdp-secure /bin/bash

# Inside the container run the following command:
pxf_src/concourse/scripts/test_pxf_secure.bash

echo "+----------------------------------------------+"
echo "| Kerberos admin principal: admin/admin@$REALM |"
echo "| Kerberos admin password : admin              |"
echo "+----------------------------------------------+"

su - gpadmin

pxf's People

Contributors

shivzone avatar frankgh avatar denalex avatar divyabhargov avatar hornn avatar benchristel avatar kavinderd avatar oliverralbertini avatar rvs avatar outofmem0ry avatar radarwave avatar tumuguskun avatar edespino avatar jiadexin avatar avocader avatar kdunn926 avatar michaelandrepearce avatar mgoddard-pivotal avatar wangzw avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.