Coder Social home page Coder Social logo

Riffl

gradle

Riffl is a generic streaming data delivery and integration framework currently executing on top of Flink and leveraging Table API. It aims for its processing to be simple to define and reason about with YAML configuration and SQL expressions. Deploys into a wide range of environments be it Hadoop, Kubernetes, or in any other Flink supported ways, and it is self-contained. Riffl puts data quality first with exactly-once guarantees but also output optimization so that query engines can utilize their features to operate efficiently.

Features

  • Streaming directly into storage ready for querying
  • High performance, horizontally scalable, low latency
  • Exactly-once guarantees
  • No coding required only Yaml and SQL
  • Output optimization

Configuration

config.yaml

Distributed file system location (hdfs|s3a|file|...) hdfs:///riffl/config/config.yaml

name: Riffl application
# [optional] Properties to expand placeholders specified as ${property...}
properties:
  catalog.name: "custom_catalog"
  s3.bucket: "<S3 bucket>"

# Execution configuration overrides
execution:
  type: FLINK
  configuration:
    execution.checkpointing.interval: 45s
    execution.checkpointing.mode: EXACTLY_ONCE
    
# Data metrics store location to support advanced distribution algorithms
metrics:
  storeUri: s3://${properties.s3.bucket}/metrics/
  
# [optional] Catalog/Database definitions to support external integration points like Hive or Iceberg
catalogs:                                        
  - createUri: hdfs:///riffl/example/catalog.ddl
   [create: "CREATE CATALOG ${properties.catalog.name} (...)"]             
databases:                                          
  - createUri: example/database.ddl
   [create: "CREATE DATABASE (...)"]
   
# Source definitions to load data from e.g. Kafka, Kinesis
sources:
  - createUri: example/source.ddl
   [create: "CREATE TABLE (...)"]
    mapUri: example/source-map.ddl
   [map: "SELECT column FROM (...)"]
    # Source stream rebalance in case of input data skew  [default: false]
    rebalance: false

# Sink definitions to define output location and format e.g. AWS S3 as Parquet
sinks:
  - createUri: example/sink-1.ddl
   [create: "CREATE TABLE (...)"]
   # Name of a table if already created in an external catalog
   [table: "iceberg_catalog.riffle.sink_1"]
    queryUri: example/sink-1-query.ddl
   [query: "SELECT column FROM (...)"]
  - createUri: example/sink-2.ddl
    queryUri: example/sink-2-query.ddl
    
    # [optional] Parallism of sink [dafault: application paralleism]
    parallelism: 5
    # [optional] Custom data distribution configuartion to optimize the output  
    distribution:
      className: "io.riffl.sink.row.KeyByFactory"
      properties:
        keys:
          - "someField_2"
        keyParallelism: 2

Source

Data source and format defined as one of Flink connectors supporting the "Unbounded Scan".

e.g. location hdfs:///riffl/config/source.ddl
CREATE TABLE source_table (
    `timestamp` STRING,
    `user` STRING,
    product STRING,
    price DOUBLE,
    ingredients ARRAY<STRING>
) WITH (
    'connector.type' = 'kafka',
    'connector.version' = 'universal',
    'connector.topic' = 'test-topic',
    'connector.properties.bootstrap.servers' = '<ips>',
    'format.type' = 'json',
    'format.fail-on-missing-field' = 'false'
)

Sink

Data output destination and format defined with standard filesystem connector or custom e.g. Iceberg connector.

e.g. location hdfs:///riffl/config/sink.ddl
CREATE TABLE IF NOT EXISTS sink_default (
  product_id INT,
  product_type INT,
  product_name STRING,
  product_desc STRING,
  dt STRING,
  hr STRING
) PARTITIONED BY (dt, hr)
WITH (
'connector'='filesystem',
'format'='parquet',
'path'='${properties.sink.path}'
)"

Deployment

Supported Flink versions:

  • 1.15

Build

./gradlew clean build

Local

./gradlew runLocal --args='--application example/application.yaml'

riffl's Projects

riffl icon riffl

Riffl - generic streaming data ingestion framework

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.