Awesome Spark

A curated list of awesome Apache Spark packages and resources.

Packages
Resources
- Books
- MOOCS
- Workshops
- Projects Using Spark
- Blogs

Packages

Language Bindings

Flambo - Clojure DSL.
Mobius - C# bindings.

Notebooks and IDEs

Apache Zeppelin - Web-based notebook that enables interactive data analytics with plugable backends, integrated plotting, and extensive Spark support out-of-the-box.
Spark Notebook - Scalable and stable Scala and Spark focused notebook bridging the gap between JVM and Data Scientists (incl. extendable, typesafe and reactive charts).

General Purpose Libraries

Succinct - Support for efficient queries on compressed data.

SQL Data Sources

Spark CSV - CSV reader and writer.
Spark Avro - Apache Avro reader and writer.
Spark XML - XML parser and writer.
Spark-Mongodb - MongoDB reader and writer.
Spark Cassandra Connector - Cassandra support including data source and API and support for arbitrary queries.
Spark Riak Connector - Riak TS & Riak KV connector.

Bioinformatics

ADAM - A set of tools designed to analyse genomics data.

GIS

Magellan - Geospatial analytics using Spark.
GeoSpark - A cluster computing system for processing large-scale spatial data.

Time Series Analytics

Spark-Timeseries - A Scala / Java / Python library for interacting with time series data on Apache Spark.

Graph Processing

Mazerunner - Graph analytics platform on top of Neo4j and GraphX.
GraphFrames - Data frame based graph API.

Machine Learning Extension

dbscan-on-spark - An Implementation of the DBSCAN clustering algorithm on top of Apache Spark by irvingc and based on the paper from He, Yaobin, et al. MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data.
Spark DBSCAN - Another implementation of the DBSCAN clustering algorithm by alitouka.
Apache SystemML - Declarative machine learning framework on top of Spark.
Mahout Spark Bindings - linear algebra DSL and optimizer with R-like syntax.
spark-sklearn - Scikit-learn integration with distributed model training.
KeystoneML - Type safe machine learning pipelines with RDDs.

REST interfaces

Livy - REST server with extensive language support (Python, R, Scala), ability to maintain interactive sessions and object sharing.
spark-jobserver - A simple Spark as a Service which supports objects sharing using so called named objects. JVM only.

Resources

Books

MOOCS

Workshops

AMP Camp

Projects Using Spark

Oryx 2 - A lambda architecture built on Apache Spark and Apache Kafka with specialization for real-time large scale machine learning.
PredictionIO - Machine Learning server for developers and data scientists to build and deploy predictive applications in a fraction of the time.

Blogs

Spark Technology Center - A great source of highly diverse posts related to Spark ecosystem. From practical advices to Spark commiter profiles.

License

This work (Awesome Spark, by https://github.com/awesome-spark/awesome-spark), identified by Maciej Szymkiewicz, is free of known copyright restrictions.

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation. This compilation is not endorsed by The Apache Software Foundation.

malcolmgreaves / awesome-spark Goto Github PK

awesome-spark's Introduction