Coder Social home page Coder Social logo

0xqq / spark-streaming-clickhouse Goto Github PK

View Code? Open in Web Editor NEW

This project forked from dmitrybe/spark-streaming-clickhouse

1.0 1.0 0.0 1.01 MB

Apache Spark structured streaming connector for Yandex ClickHouse OLAP

License: Other

Makefile 0.29% Scala 99.71%

spark-streaming-clickhouse's Introduction

Spark structured streaming Clickhouse sink

Dump Spark structured streaming output to Yandex ClickHouse OLAP

Quick start

Run ClickHouse server (local, docker)

docker run -it -p 8123:8123 -p 9000:9000 --name clickhouse yandex/clickhouse-server

Run ClickHouse client

docker run -it --net=host --rm yandex/clickhouse-client

Create ClickHouse databases

CREATE DATABASES IF NOT EXISTS db01
SHOW DATABASES

Create a project, define Spark structured streaming sink for ClickHouse

// input events
case class Event(word: String, timestamp: Timestamp)

// stream internal state
case class State(c: Int)

// stream output
case class StateUpdate(updateTimestamp: Timestamp, word: String, c: Int)

// clickhouse sink
class ClickHouseStateUpdatesSinkProvider extends ClickHouseSinkProvider[StateUpdate]{
  override def clickHouseServers: Seq[(String, Int)] = Seq(("localhost", 8123))
  override def dbName: String = "db01"
  override def tableName = Some("stateUpdates")
  override def eventDateColumnName: String = "eventDate"
  override def indexColumns: Seq[String] = Seq("word")
  override def partitionFunc: (Row) => Date =
    (row) => {
      // use event timestamp as partition key
      new java.sql.Date(row.getAs[Timestamp](0).getTime)
      // use current
      //new java.sql.Date(Calendar.getInstance().getTimeInMillis())
    }
}

Run nc -lk 9999

Describe Spark structured stream and start query

// spark session
val spark = SparkSession
    .builder
    .master("local[*]")
    .config("spark.sql.streaming.checkpointLocation", "./spark-checkpoints")
    .appName("streaming-test")
    .getOrCreate()

import spark.implicits._

val host = "localhost"
val port = "9999"

// define socket source
val lines = spark.readStream
    .format("socket")
    .option("host", host)
    .option("port", port)
    .option("includeTimestamp", true)
    .load()

// transform input data to stream of events
val events = lines
    .as[(String, Timestamp)]
    .flatMap { case (line, timestamp) =>
    line.split(" ").map(word => Event(word = word, timestamp))
    }

println(s"Events schema:")
events.printSchema()

// statefull transformation: word => Iterator[Event] => Iterator[StateUpdate]
val stateStream = events.groupByKey((x) => x.word)
    .flatMapGroupsWithState[State, StateUpdate](OutputMode.Append(), GroupStateTimeout.NoTimeout())((key, iter, state) => {

    // get / create new state
    val wState = state.getOption.getOrElse(State(0))
    val count = wState.c + iter.length

    // update state
    state.update(State(count))

    // output: Iterator[StateUpdate]
    List(
        StateUpdate(new Timestamp(Calendar.getInstance().getTimeInMillis), key, count)
    ).toIterator
    })

val query = stateStream.writeStream
    .outputMode("append")
    .format("ClickHouseEventsSinkProvider") // clickhouse sink
    //.format("console")
    .start()

query.awaitTermination()

spark-streaming-clickhouse's People

Contributors

dmitrybe avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.