Coder Social home page Coder Social logo

oracle-quickstart / oci-streamsets Goto Github PK

View Code? Open in Web Editor NEW
0.0 9.0 2.0 1.01 MB

Terraform module to deploy StreamSets on Oracle Cloud Infrastructure (OCI)

License: Apache License 2.0

HCL 26.91% Shell 22.49% Roff 0.01% Python 50.60%
cloud oci terraform streamsets partner-led oracle

oci-streamsets's Introduction

oci-streamsets

These are Terraform modules that deploy Streamsets on Oracle Cloud Infrastructure (OCI). They are developed jointly by Oracle and StreamSets.

Getting Started

Welcome! This folder contains Terraform scripts that setup the StreamSets Data Collector (SDC) to ingest data rapidly and easily. In the the current level of this directory are the Terraform files that create a single compute instance running one data collector. This is commonly used for learning or developing on the StreamSets Data Operations Platform. However, it can be used for production ready data movement and transformation.

The folder titled, "SDC Standalone with EDH Cluster", will create single SDC instance ready for data movement inside a Cloudera Enterprise Data Hub. This SDC instance will reside in the same subnet(s) as the worker nodes in the cluster. This is mainly for easy development and learning how the StreamSets Data Operations Platform extends to Hadoop infrastructure.

The folder titled, "SDC via CDH Parcel Manager", will create SDC instances on all the worker nodes in the cluster and enable things like clustered execution of pipelines or REST-based microservices pipelines. This is not currently production ready and is still in development but stay tuned!

Standalone StreamSets Data Collector Architecture

Prerequisites

In addition to an active tenancy on OCI, you will need a functional installation of Terraform, and an API key for a privileged user in the tenancy. See these documentation links for more information:

Getting Started with Terraform on OCI

How to Generate an API Signing Key

Once the pre-requisites are in place, you will need to copy the templates from this repository to where you have Terraform installed.

Clone the Terraform template

Now, you'll want a local copy of this repo. You can make that with the commands:

git clone https://github.com/oracle-quickstart/oci-streamsets.git
cd oci-streamsets
ls

Update Template Configuration

Update environment variables in config file: env-vars to specify your OCI account details like tenancy_ocid, user_ocid, compartment_ocid. To source this file prior to installation, either reference it in your .rc file for your shell's or run the following:

    source env-vars

Deployment & Post Deployment

Deploy using standard Terraform commands

    terraform init
terraform plan
terraform apply

SSH to SDC Node

When terraform apply is complete, the terminal console will display the public ip address for first broker and worker node. The default login is opc. You can SSH into the machine with a command like this:

    ssh -i ~/.ssh/id_rsa opc@${data.oci_core_vnic.datacollector_vnic.public_ip_address}
    http://${data.oci_core_vnic.datacollector_vnic.public_ip_address}:18630/ The default username and password are admin and admin.

Data Collector Web Console

What is the StreamSets Data Collector?

StreamSets Data Collector is a lightweight, powerful design and execution engine that streams data in real time. SDC is used to route and process data in your data streams from almost any origin to almost any source.

To define the flow of data, you design a pipeline in SDC. A pipeline consists of one or more stages that represents the origin(s) and destination of the pipeline, as well as any additional processing that you want to perform. After you design the pipeline, you can preview it to assist with debugging. When ready to run the pipeline live, you click Start and the SDC goes to work.

Once SDC is running, it processes the data when it arrives at the origin and waits quietly when not needed. You can view real-time statistics about your data, inspect data as it passes through the pipeline, or take a closer look at a snapshot of data.

How should I use SDC?

Use SDC like a pipe for a data stream. Throughout your enterprise data topology, you have streams of data that you need to move, collect, and process on the way to their destinations. SDC provides the crucial connection between hops in the stream.

To solve your ingest needs, you can use a single SDC to run one or more pipelines. Or you might install a series of Data Collectors to stream data across your enterprise data topology.

How does this really work?

Let's walk through it...

After you run the terraform script for a standalone SDC, you use the Data Collector UI to log in and create your first pipeline.

What do you want it to do? Let's say you want to read XML files from a directory and remove the newline characters before moving it into HDFS. To do this, you start with a Directory origin stage and configure it to point to the source file directory. (You can also have the stage archive processed files and write files that were not fully processed to a separate directory for review.)

To remove the newline characters, connect Directory to an Expression Evaluator processor and configure it to remove the newline character from the last field in the record.

To make the data available to HDFS, you connect the Expression Evaluator to a Hadoop FS destination stage. You configure the stage to write the data as a JSON object (though you can use other data formats as well).

You preview data to see how source data moves through the pipeline and notice that some fields have missing data. So you add a Field Replacer to replace null values in those fields.

Now that the data flow is done, you configure the pipeline error record handling to write error records to a file, you create a data drift alert to let you know when field names change, and you configure an email alert to let you know when the pipeline generates more than 100 error records. Then, you start the pipeline and Data Collector goes to work.

The Data Collector goes into Monitor mode and displays summary and error statistics immediately. To get a closer look at the activity, you take a snapshot of the pipeline so you can examine how a set of data passed through the pipeline. You see some unexpected data in the pipeline, so you create a data rule for a link between two stages to gather information about similar data and set an alert to notify you when the numbers get too high.

And what about those error records being written to file? They're saved with error details, so you can create an error pipeline to reprocess that data. Et voila!

StreamSets Data Collector is a powerful tool, but we're making it as simple as possible to use. So give it a try, click the Help icon for information, and contact us if you need a hand. For more use cases or examples to learn on, please visit: https://github.com/streamsets/tutorials

oci-streamsets's People

Contributors

acahn avatar benofben avatar yamachinu avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

oci-streamsets's Issues

Glitches during deployment

I suggest the following additions/modifications to this post.

1) Space Problem
The Output block at lines 161 and 168 in main.tf files contain spaces which was caught as an error. Removing space by way of tricks such as inserting an underscore would do.

output "SSH to Data Collector instance " {
output "Access the Web Console for Data Collector instance " {

2) Directory Ownership Problem
The directory /opt/streamsets-datacollector is owned by 'root' user post terraform deployment. This prevents any additional package installations through Streamsets Web UI since it runs under user 'sdc'. Perhaps adding a note like below would be helpful.

Post Installation Task:
Modify the ownership of the following directory to user 'sdc'.

sudo chown sdc:sdc /opt/streamsets-datacollector/ -R

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.