Coder Social home page Coder Social logo

spark-df-profiling's Introduction

HTML profiling reports from Apache Spark DataFrames

Generates profile reports from an Apache Spark DataFrame. It is based on pandas_profiling, but for Spark's DataFrames instead of pandas'.

For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:

  • Essentials: type, unique values, missing values
  • Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
  • Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
  • Most frequent values
  • Histogram

All operations are done efficiently, which means that no Python UDFs or .map() transformations are used at all; only Spark SQL's Catalyst (and the Tungsten execution engine) is used for the retrieval of all statistics.

Demo

Available here.

Installation

If you are using Anaconda, you already have all the needed dependencies. So you just have to pip install the package without dependencies (just in case pip tries to overwrite your current dependencies):

pip install --no-deps spark-df-profiling

If you don't have pandas and/or matplotlib installed:

pip install spark-df-profiling

Usage

The profile report is written in HTML5 and CSS3, which means that you may require a modern browser.

Keep in mind that you need a working Spark cluster (or a local Spark installation). The report must be created from pyspark. To point pyspark driver to your Python environment, you must set the environment variable PYSPARK_DRIVER_PYTHON to your python environment where spark-df-profiling is installed. For example, for Anaconda:

export PYSPARK_DRIVER_PYTHON=/path/to/your/anaconda/bin/python

And then you can execute /path/to/your/bin/pyspark to enter pyspark's CLI.

Jupyter Notebook (formerly IPython)

We recommend generating reports interactively by using the Jupyter notebook.

To use pyspark with Jupyter, you must also set PYSPARK_DRIVER_PYTHON:

export PYSPARK_DRIVER_PYTHON=/path/to/your/anaconda/bin/python

And then:

IPYTHON_OPTS="notebook" /path/to/your/bin/pyspark

In spark 2.0.X IPYTHON_OPTS is removed: the environment variable you want to set is PYSPARK_DRIVER_PYTHON_OPTS:

PYSPARK_DRIVER_PYTHON_OPTS="notebook" /path/to/your/bin/pyspark

Now you can create a new notebook, which will run pyspark.

To use spark-df-profiling, start by loading in your Spark DataFrame, e.g. by using

# sqlContext is probably already created for you.
# To load a parquet file as a Spark Dataframe, you can:
df = sqlContext.read.parquet("/path/to/your/file.parquet")
# And you probably want to cache it, since a lot of 
# operations will be done while the report is being generated:
df_spark = df.cache()

To display the report in a Jupyter notebook, run:

import spark_df_profiling
spark_df_profiling.ProfileReport(df_spark)

If you want to generate a HTML report file, save the ProfileReport to an object and use the .to_file() method:

profile = spark_df_profiling.ProfileReport(df_spark)
profile.to_file(outputfile="/tmp/myoutputfile.html")

Dependencies

  • Python (>=2.7)
  • Apache Spark (who would imagine!) -> requires Spark >=1.5.0 (compatible with 2.0.0 also).
  • An internet connection. spark-df-profiling requires an internet connection to download the Bootstrap and JQuery libraries. You can choose to embed them in the HTML template code, should you desire.
  • jinja2 (>=2.8) -> needed for template rendering. Only needed in the Spark driver.
  • matplotlib (>=1.4) -> needed for histogram creation. Only needed in the Spark driver.
  • pandas (>=0.17.0) -> needed for internal data arrangement. Only needed in the Spark driver.
  • six (>=1.9.0) -> needed for py2/3 compatibility. Only needed in the Spark driver.

spark-df-profiling's People

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.