Coder Social home page Coder Social logo

spark-daria's Introduction

spark-daria

Spark helper methods to maximize developer productivity.

Build Status

Codacy Badge

typical daria

Setup

Option 1: Maven

Fetch the JAR file from Maven.

resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven"

libraryDependencies += "mrpowers" % "spark-daria" % "0.26.0"

Option 2: JitPack

Update your build.sbt file as follows.

resolvers += "jitpack" at "https://jitpack.io"
libraryDependencies += "com.github.mrpowers" % "spark-daria" % "v0.26.0"

Accessing spark-daria versions for different Spark versions

Different spark-daria versions are compatible with different Spark versions. In general, the latest spark-daria versions are always compatible with the latest Spark versions.

0.26.0
2.0.0 โŒ
2.1.0 โœ…
2.2.2 โœ…
2.3.0 โœ…
2.3.1 โœ…

Email me if you need a custom spark-daria version and I'll help you out ๐Ÿ˜‰

Usage

spark-daria provides different types of functions that will make your life as a Spark developer easier:

  1. Core extensions
  2. Column functions / UDFs
  3. Custom transformations
  4. Helper methods
  5. DataFrame validators

The following overview will give you an idea of the types of functions that are provided by spark-daria, but you'll need to dig into the docs to learn about all the methods.

Core extensions

The core extensions add methods to existing Spark classes that will help you write beautiful code.

The native Spark API forces you to write code like this.

col("is_nice_person").isNull && col("likes_peanut_butter") === false

When you import the spark-daria ColumnExt class, you can write idiomatic Scala code like this:

import com.github.mrpowers.spark.daria.sql.ColumnExt._

col("is_nice_person").isNull && col("likes_peanut_butter").isFalse

This blog post describes how to use the spark-daria createDF() method that's much better than the toDF() and createDataFrame() methods provided by Spark.

See the ColumnExt, DataFrameExt, and SparkSessionExt objects for all the core extensions offered by spark-daria.

Column functions

Column functions can be used in addition to the org.apache.spark.sql.functions.

Here is how to remove all whitespace from a string with the native Spark API:

import org.apache.spark.sql.functions._

regexp_replace(col("first_name"), "\\s+", "")

The spark-daria removeAllWhitespace() function lets you express this logic with code that's more readable.

import com.github.mrpowers.spark.daria.sql.functions._

removeAllWhitespace(col("first_name"))

Custom transformations

Custom transformations have the following method signature so they can be passed as arguments to the Spark DataFrame#transform() method.

def someCustomTransformation(arg1: String)(df: DataFrame): DataFrame = {
  // code that returns a DataFrame
}

The spark-daria snakeCaseColumns() custom transformation snake_cases all of the column names in a DataFrame.

import com.github.mrpowers.spark.daria.sql.transformations._

val betterDF = df.transform(snakeCaseColumns())

Protip: You'll always want to deal with snake_case column names in Spark - use this function if your column names contain spaces of uppercase letters.

Helper methods

The DataFrame helper methods make it easy to convert DataFrame columns into Arrays or Maps. Here's how to convert a column to an Array.

import com.github.mrpowers.spark.daria.sql.DataFrameHelpers._

val arr = DataFrameHelpers.columnToArray[Int](sourceDF, "num")

DataFrame validators

DataFrame validators check that DataFrames contain certain columns or a specific schema. They throw descriptive error messages if the DataFrame schema is not as expected. DataFrame validators are a great way to make sure your application gives descriptive error messages.

Let's look at a method that makes sure a DataFrame contains the expected columns.

val sourceDF = Seq(
  ("jets", "football"),
  ("nacional", "soccer")
).toDF("team", "sport")

val requiredColNames = Seq("team", "sport", "country", "city")

validatePresenceOfColumns(sourceDF, requiredColNames)

// throws this error message: com.github.mrpowers.spark.daria.sql.MissingDataFrameColumnsException: The [country, city] columns are not included in the DataFrame with the following columns [team, sport]

Documentation

Here is the latest spark-daria documentation.

Studying these docs will make you a better Spark developer!

๐Ÿ‘ญ ๐Ÿ‘ฌ ๐Ÿ‘ซ Contribution Criteria

We are actively looking for contributors to add functionality that fills in the gaps of the Spark source code.

To get started, fork the project and submit a pull request. Please write tests!

After submitting a couple of good pull requests, you'll be added as a contributor to the project.

Continued excellence will be rewarded with push access to the master branch.

spark-daria's People

Contributors

mrpowers avatar damanp avatar milin-k avatar nathanlim45 avatar snithish avatar eclosson avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.