Coder Social home page Coder Social logo

sparkmultitool's Introduction

SparkMultiTool

Tools for spark which we use on the daily basis. It contains:

  • Loader of HDFS files with combining small files (uses Hadoop CombineFileInputFormat)
  • Future: cosine calculation
  • Future: Quantile calculation

#Requirements This library was succeffully tested with CDH 5.1.2 and Spark 1.1.0. You should install SBT:

#Build This build based on CDH 5.1.2 and Spark 1.1.0. Edit build.sbt If you have another environment.

For building install sbt, launch a terminal, change current to sparkmultitool directory and launch a command:

sbt package
sbt test

Next copy spark-multitool*.jar from ./target/scala-2.10/... to the lib folder of your sbt project.

#Usage Include spark-multitool*.jar in --jars path in spark-submit like this:

spark-submit --master local --executor-memory 2G --class "Tst" --num-executors 1 --executor-cores 1 --jars lib/spark-multitool_2.10-0.1.jar target/scala-2.10/tst_2.10-0.1.jar

See examples folder.

##Loaders ru.retailrocket.spark.multitool.Loaders - combine input files before mappers by means of Hadoop CombineFileInputFormat. In our case it reduced the number of mappers from 100000 to approx 3000 and made job significantly faster. Parameters:

  • path - path to the files (as in spark.textFile)
  • size - size of target partition in Megabytes. Optimal value equals to a HDFS block size
  • delim - line delimiters

This example loads files from "/test/*" and combine them in mappers.

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

import ru.retailrocket.spark.multitool.Loaders._

object Tst{
	def main(args: Array[String]) ={
	val conf = new SparkConf().setMaster("local").setAppName("My App")
	val sc = new SparkContext("local", "My App")

    val path = "file:///test/*"
    {
	val sessions = sc.combineTextFile(path)
    // or val sessions = sc.combineTextFile(path, size = 256, delim = "\n")
    // where size is split size in Megabytes, delim - line break string
	println(sessions.count())
    }
    
    {
    // or via new API
    val sessions = sc.forPath(path)
      .setSplitSize(256) // optional
      .setRecordDelim("\n") // optional
      .combine()
	println(sessions.count())
    }

    {
    // you can also get RDD[(String, String)] with (file, line)
    val sessions = sc.forPath(path)
      .combineWithPath()
	println(sessions.count())
    {

    {
    // or add path filter, e.g. for partitioning
    class FileNameEqualityFilter extends Filter {
      def check(rules: Traversable[Filter.Rule], path: Array[String]) = {
        rules.forall{
          case(k, Array(eq)) =>
            k match {
              case "file" => eq == path.last
              case _ => false
            }
        }
      }
    }
    val sessions = sc.forPath(path)
      .addFilter(classOf[FileNameEqualityFilter], Seq("file" -> Array("file.name")))
      .combine()
	println(sessions.count())
    }
  }
}

##Algorithms

ru.retailrocket.spark.multitool.algs.cosine - cosine similarity function.

##Utility

ru.retailrocket.spark.multitool.HashFNV - simple, but useful hash function. Original idea from org.apache.pig.piggybank.evaluation.string.HashFNV

sparkmultitool's People

Contributors

aenoskov avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.