Coder Social home page Coder Social logo

struct-type-encoder's Introduction

struct-type-encoder

Build Status Join the chat at https://gitter.im/struct-type-encoder/Lobby Maven Central Stories in Ready

Deriving Spark DataFrame schemas from case classes.

Installation

struct-type-encoder is available on maven central with the following coordinates:

"com.github.benfradet" %% "struct-type-encoder" % "0.5.0"

Motivation

When reading a DataFrame/Dataset from a data source the schema of the data has to be inferred. In practice, this translates into looking at every record of all the files and coming up with a schema that can satisfy every one of these records, as shown here for JSON.

As anyone can guess, this can be a very time-consuming task, especially if you know in advance the schema of your data. A common pattern is to do the following:

case class MyCaseClass(a: Int, b: String, c: Double)
val inferred = spark
  .read
  .json("/some/dir/*.json")
  .as[MyCaseClass]

In this case, there is no need to spend time inferring the schema as the DataFrame is directly converted to a Dataset of MyCaseClass. However, it can be a lot of boilerplate to bypass the inference by specifying your own schema.

import org.apache.spark.sql.types._
val schema = SructType(
  StructField("a", IntegerType) ::
  StructField("b", StringType) ::
  StructField("c", DoubleType) :: Nil
)
val specified = spark
  .read
  .schema(schema)
  .json("/some/dir/*.json")
  .as[MyCaseClass]

struct-type-encoder derives instances of StructType (how Spark represents a schema) from your case class automatically:

import ste.StructTypeEncoder
import ste.StructTypeEncoder._
val derived = spark
  .read
  .schema(StructTypeEncoder[MyCaseClass].encode)
  .json("/some/dir/*.json")
  .as[MyCaseClass]

No inference, no boilerplate!

Additional features

Spark Metadata support

It is possible to add Metada information to StructFields with the Meta annotation:

import org.apache.spark.sql.types._
import ste._

val metadata = new MetadataBuilder()
  .putLong("foo", 10)
  .putString("bar", "baz")
  .build()

case class Foo(a: String, @Meta(metadata) b: Int)

Flattening schemas

Using the ste.Flatten annotation we can eliminate repetitions from case class definitions. Take the following example:

import ste._
case class Foo(a: String, b: Int)
case class Bar(@Flatten(2) a: Seq[Foo], @Flatten(1, Seq("x", "y")) b: Map[String, Foo], @Flatten c: Foo)

StructTypeEncoder[Bar].encode

The derived schema is the following:

StructType(
  StructField("a.0.a", StringType, false) ::
  StructField("a.0.b", IntegerType, false) ::
  StructField("a.1.a", StringType, false) ::
  StructField("a.1.b", IntegerType, false) ::
  StructField("b.x.a", StringType, false) ::
  StructField("b.x.b", IntegerType, false) ::
  StructField("b.y.a", StringType, false) ::
  StructField("b.y.b", IntegerType, false) ::
  StructField("c.a", StringType, false) ::
  StructField("c.b", IntegerType, false) :: Nil
)

Now we want to read our data source with a flat schema:

import ste.StructTypeEncoder
import ste.StructTypeEncoder._
val df = spark
  .read
  .schema(StructTypeEncoder[Bar].encode)
  .csv("/some/dir/*.csv")

struct-type-encoder can derive the nested projection of a Dataframe and convert it to a Dataset by providing the class:

import StructTypeSelector._

val ds: Dataset[Bar] = df.asNested[Bar]

Benchmarks

This project includes JMH benchmarks to prove that inferring schemas and coming up with the schema satisfying all records is expensive. The benchmarks compare the average time spent parsing a thousand files each containing a hundred rows when the schema is inferred (by Spark, not user-specified) and derived (thanks to struct-type-encoder).

derived inferred
CSV 5.936 ± 0.035 s 6.494 ± 0.209 s
JSON 5.092 ± 0.048 s 6.019 ± 0.049 s

We see that when deriving the schemas we spend 16.7% less time reading JSON data and a 8.98% for CSV.

struct-type-encoder's People

Contributors

benfradet avatar gaborbarna avatar gitter-badger avatar scala-steward avatar waffle-iron avatar zoltan-nagy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

struct-type-encoder's Issues

Investigate coproducts

  • not supported by Spark:
sealed trait A
final case class B(b: Int) extends A
final case class C(c: String) extends A
spark.createDataset(Seq(B(1), C("c")))
// java.lang.NoClassDefFoundError: no Java class corresponding to Product with Serializable with A found
  • might still be useful if not converting to dataset / see if frameless can be useful here

  • should either be:

StructType(
  StructField("B", StructType(StructField("b", IntegerType) :: Nil) ::
  StructField("C", StructType(StructField("c", StringType) :: Nil) :: Nil
)

or flattened set:

StructType(
  StructField("b", IntegerType) ::
  StructField("c", StringType) :: Nil
)

how to query on SPARKSQL

Hi Team,

I am trying following SQL:-

var work__store_level_vend_pack_loc_final_data =

sparksession.read.format("csv")
.option("header", "true")
.option("delimiter", "|")
.option("inferSchema", "true")
.load("C:\Users\jabin\Desktop\project_files\work__store_level_vend_pack_loc_final_data.txt");

work__store_level_vend_pack_loc_final_data.registerTempTable("work__store_level_vend_pack_loc_final_data_table");

var r1 = sparksession.sqlContext.sql( "SELECT shc_item_id ,'K' as source_owner_cd,item_purchase_status_cd, vendor_package_id,vendor_package_purchase_status_cd,flow_type_cd as vendor_package_flow_type_cd,vendor_carton_qty,vendor_stock_nbr,ksn_package_id,ksn_purchase_status_cd,import_ind,sears_divission_nbr,sears_item_nbr,sears_sku_nbr,scan_based_trading_ind,cross_merchandising_cd,retail_carton_vendor_package_id,vendor_package_owner_cd,can_carry_model_id,'' AS days_to_check_begin_day_qty,'' AS days_to_check_end_day_qty ,dotcom_allocation_ind ,retail_carton_internal_package_qty,allocation_replenishment_cd,shc_item_type_cd,idrp_order_method_cd,source_package_qty as store_source_package_qty,order_duns_nbr FROM work__store_level_vend_pack_loc_final_data_table WHERE flow_type_cd = 'JIT' OR servicing_dc_nbr > '0' ")
// .collect.foreach(println)

now i want to distinct all column of the r1 using sparksession.sqlContext.sql("")
how to do above thing?

Find a way to deal with Tuples

spark.createDataset((1 to 100).map(i => (i, i.toString))).schema
// res0: org.apache.spark.sql.types.StructType = StructType(StructField(_1,IntegerType,false), StructField(_2,StringType,true))

case class R(a: (Int, String))
spark.createDataset((1 to 100).map(i => R(i, i.toString))).schema
// res1: org.apache.spark.sql.types.StructType = StructType(StructField(a,StructType(StructField(_1,IntegerType,false), StructField(_2,StringType,true)),true))

https://github.com/circe/circe/blob/master/project/Boilerplate.scala

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.