benfradet / struct-type-encoder Goto Github PK

View Code? Open in Web Editor NEW

44.0 4.0 13.0 117 KB

Deriving Spark DataFrame schemas from case classes

License: Apache License 2.0

Scala 100.00%

spark sparksql

struct-type-encoder's Introduction

struct-type-encoder

Deriving Spark DataFrame schemas from case classes.

Installation

struct-type-encoder is available on maven central with the following coordinates:

"com.github.benfradet" %% "struct-type-encoder" % "0.5.0"

Motivation

When reading a DataFrame/Dataset from a data source the schema of the data has to be inferred. In practice, this translates into looking at every record of all the files and coming up with a schema that can satisfy every one of these records, as shown here for JSON.

As anyone can guess, this can be a very time-consuming task, especially if you know in advance the schema of your data. A common pattern is to do the following:

case class MyCaseClass(a: Int, b: String, c: Double)
val inferred = spark
  .read
  .json("/some/dir/*.json")
  .as[MyCaseClass]

In this case, there is no need to spend time inferring the schema as the DataFrame is directly converted to a Dataset of MyCaseClass. However, it can be a lot of boilerplate to bypass the inference by specifying your own schema.

import org.apache.spark.sql.types._
val schema = SructType(
  StructField("a", IntegerType) ::
  StructField("b", StringType) ::
  StructField("c", DoubleType) :: Nil
)
val specified = spark
  .read
  .schema(schema)
  .json("/some/dir/*.json")
  .as[MyCaseClass]

struct-type-encoder derives instances of StructType (how Spark represents a schema) from your case class automatically:

import ste.StructTypeEncoder
import ste.StructTypeEncoder._
val derived = spark
  .read
  .schema(StructTypeEncoder[MyCaseClass].encode)
  .json("/some/dir/*.json")
  .as[MyCaseClass]

No inference, no boilerplate!

Additional features

Spark `Metadata` support

It is possible to add Metada information to StructFields with the Meta annotation:

import org.apache.spark.sql.types._
import ste._

val metadata = new MetadataBuilder()
  .putLong("foo", 10)
  .putString("bar", "baz")
  .build()

case class Foo(a: String, @Meta(metadata) b: Int)

Flattening schemas

Using the ste.Flatten annotation we can eliminate repetitions from case class definitions. Take the following example:

import ste._
case class Foo(a: String, b: Int)
case class Bar(@Flatten(2) a: Seq[Foo], @Flatten(1, Seq("x", "y")) b: Map[String, Foo], @Flatten c: Foo)

StructTypeEncoder[Bar].encode

The derived schema is the following:

StructType(
  StructField("a.0.a", StringType, false) ::
  StructField("a.0.b", IntegerType, false) ::
  StructField("a.1.a", StringType, false) ::
  StructField("a.1.b", IntegerType, false) ::
  StructField("b.x.a", StringType, false) ::
  StructField("b.x.b", IntegerType, false) ::
  StructField("b.y.a", StringType, false) ::
  StructField("b.y.b", IntegerType, false) ::
  StructField("c.a", StringType, false) ::
  StructField("c.b", IntegerType, false) :: Nil
)

Now we want to read our data source with a flat schema:

import ste.StructTypeEncoder
import ste.StructTypeEncoder._
val df = spark
  .read
  .schema(StructTypeEncoder[Bar].encode)
  .csv("/some/dir/*.csv")

struct-type-encoder can derive the nested projection of a Dataframe and convert it to a Dataset by providing the class:

import StructTypeSelector._

val ds: Dataset[Bar] = df.asNested[Bar]

Benchmarks

This project includes JMH benchmarks to prove that inferring schemas and coming up with the schema satisfying all records is expensive. The benchmarks compare the average time spent parsing a thousand files each containing a hundred rows when the schema is inferred (by Spark, not user-specified) and derived (thanks to struct-type-encoder).

	derived	inferred
CSV	5.936 ± 0.035 s	6.494 ± 0.209 s
JSON	5.092 ± 0.048 s	6.019 ± 0.049 s

We see that when deriving the schemas we spend 16.7% less time reading JSON data and a 8.98% for CSV.

struct-type-encoder's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger waffle-iron elenaviter gaborbarna kobelzy scala-steward adserver-bot kmate pchanumolu kai2002 semanticbeeng zoltan-nagy

struct-type-encoder's Issues

Bump Spark to 2.3.1

Map Option to nullable

License headers

Support missing data types

~~CalendarIntervalType~~ not applicable since it's backed by a type internal to Spark and can't be inferred
DateType
DecimalType
NullType
ObjectType: could maybe resort to this if no implicit encoders are found
TimestampType

Solve classpath issue

It seems there is an issue in travis for builds in #51 https://travis-ci.org/BenFradet/struct-type-encoder/builds/579154896?utm_source=github_status&utm_medium=notification

Add scala-steward badge

https://github.com/fthomas/scala-steward/#wanna-have-a-badge-

Update the benchmarks' results on Spark 2.4.0

Investigate coproducts

not supported by Spark:

sealed trait A
final case class B(b: Int) extends A
final case class C(c: String) extends A
spark.createDataset(Seq(B(1), C("c")))
// java.lang.NoClassDefFoundError: no Java class corresponding to Product with Serializable with A found

might still be useful if not converting to dataset / see if frameless can be useful here
should either be:

StructType(
  StructField("B", StructType(StructField("b", IntegerType) :: Nil) ::
  StructField("C", StructType(StructField("c", StringType) :: Nil) :: Nil
)

or flattened set:

StructType(
  StructField("b", IntegerType) ::
  StructField("c", StringType) :: Nil
)

travis

Add support for Scala 2.12

Benchmarks

Bump Spark to 2.4.0

Make the traits sealed

how to query on SPARKSQL

Hi Team,

I am trying following SQL:-

var work__store_level_vend_pack_loc_final_data =

sparksession.read.format("csv")
.option("header", "true")
.option("delimiter", "|")
.option("inferSchema", "true")
.load("C:\Users\jabin\Desktop\project_files\work__store_level_vend_pack_loc_final_data.txt");

work__store_level_vend_pack_loc_final_data.registerTempTable("work__store_level_vend_pack_loc_final_data_table");

var r1 = sparksession.sqlContext.sql( "SELECT shc_item_id ,'K' as source_owner_cd,item_purchase_status_cd, vendor_package_id,vendor_package_purchase_status_cd,flow_type_cd as vendor_package_flow_type_cd,vendor_carton_qty,vendor_stock_nbr,ksn_package_id,ksn_purchase_status_cd,import_ind,sears_divission_nbr,sears_item_nbr,sears_sku_nbr,scan_based_trading_ind,cross_merchandising_cd,retail_carton_vendor_package_id,vendor_package_owner_cd,can_carry_model_id,'' AS days_to_check_begin_day_qty,'' AS days_to_check_end_day_qty ,dotcom_allocation_ind ,retail_carton_internal_package_qty,allocation_replenishment_cd,shc_item_type_cd,idrp_order_method_cd,source_package_qty as store_source_package_qty,order_duns_nbr FROM work__store_level_vend_pack_loc_final_data_table WHERE flow_type_cd = 'JIT' OR servicing_dc_nbr > '0' ")
// .collect.foreach(println)

now i want to distinct all column of the r1 using sparksession.sqlContext.sql("")
how to do above thing?

case class A(a: Unit)
StructTypeEncoder[A].encode
// StructType(StructField(a,StructType(),true))

Find a way to support all combinators

Find a way to deal with Tuples

spark.createDataset((1 to 100).map(i => (i, i.toString))).schema
// res0: org.apache.spark.sql.types.StructType = StructType(StructField(_1,IntegerType,false), StructField(_2,StringType,true))

case class R(a: (Int, String))
spark.createDataset((1 to 100).map(i => R(i, i.toString))).schema
// res1: org.apache.spark.sql.types.StructType = StructType(StructField(a,StructType(StructField(_1,IntegerType,false), StructField(_2,StringType,true)),true))

https://github.com/circe/circe/blob/master/project/Boilerplate.scala

benfradet / struct-type-encoder Goto Github PK

struct-type-encoder's Introduction

struct-type-encoder

Installation

Motivation

Additional features

Spark Metadata support

Flattening schemas

Benchmarks

struct-type-encoder's People

Contributors

Stargazers

Watchers

Forkers

struct-type-encoder's Issues

Recommend Projects

Recommend Topics

Recommend Org

Spark `Metadata` support