Coder Social home page Coder Social logo

kettle-beam's Introduction

kettle-beam

Kettle plugins for Apache Beam

First

build/install project kettle-beam-core

https://github.com/mattcasters/kettle-beam-core

Build

mvn clean install

Note you need the Pentaho settings.xml in your ~/.m2 : https://github.com/pentaho/maven-parent-poms/blob/master/maven-support-files/settings.xml

Install

  • Create a new directory called kettle-beam in <PDI-DIR>plugins/
  • Copy target/kettle-beam-.jar to <PDI-DIR>/plugins/kettle-beam/
  • Copy the other jar files in target/lib to <PDI-DIR>/plugins/kettle-beam/lib/

Configure

File Definitions

Describe the file layout for the input and output of your pipeline using :

Spoon menu Beam / Create a file definition

Specify this file layout in your "Beam Input" and "Beam Output" steps. If you do not specify the file definition in the "Beam Output" step, all fields arriving at the step will be written with comma for separator and double quotes as enclosure. The formatting in the fields will be used.

Beam Job Configurations

A Beam Job configuration is needed to run your transformation on Apache Beam. Specify which Runner to use (Direct and Dataflow are supported).
You can use the variables to make your transformations completely generic. For example you can set an INPUT_LOCATION location variable

  • /some/folder/* for a Direct execution during testing
  • gs://mybucket/input/* for an execution on GCP Dataflow

Supported

  • Input: Beam Input, Google Pub/Sub Subscribe and Google BigQuery Input
  • Output: Beam Output, Google Pub/Sub Publish and Google BigQuery Output
  • Windowing with the Beam Window step and adding timestamps to bounded data for streaming (Beam Timestamp)
  • Sort rows is not yet supported and will never be supported in a generic sense like in Kettle.
  • Group By step : experimental, SUM (Integer, Number), COUNT, MIN, MAX, FIRST (throws errors for not-supported stuff)
  • Merge Join
  • Stream Lookup (side loading data)
  • Filter rows (including targeting steps for true/false)
  • Switch/Case
  • Plugin support through the Beam Job Configuration: specify which plugins to include in the runtime

Runners

  • Beam Direct : working
  • Google Cloud DataFlow : working
  • Apache Spark : mostly untested, configurable (feedback welcome)
  • Apache Flink : not started yet, stubbed out code
  • Aache Apex : not started yet, stubbed out code
  • JStorm : not started yet

More information

http://diethardsteiner.github.io/pdi/2018/12/01/Kettle-Beam.html

kettle-beam's People

Contributors

mattcasters avatar dependabot[bot] avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.