Coder Social home page Coder Social logo

hyliqd / spark-dirty-cat Goto Github PK

View Code? Open in Web Editor NEW

This project forked from rakutentech/spark-dirty-cat

0.0 1.0 0.0 98 KB

Similarity encoding of dirty categorical variables (strings)

License: Apache License 2.0

Makefile 0.75% Scala 80.80% Python 18.45%

spark-dirty-cat's Introduction

Dirty Cat: Dealing with dirty categorical (strings).

DirtyCat(Scala) is a package that leverage Spark ML to perform large scale Machine Learning, and provides an alternative to encode string variables. This package is largely based on the python original code, https://github.com/dirty-cat

Documentation

  • https://github.com/dirty-cat
  • Patricio Cerda, Gaël Varoquaux, Balázs Kégl. Similarity encoding for learning with dirty categorical variables. Machine Learning journal, Springer. 2018.

Getting started: How to use it

The DirtyCat project is built for both Scala 2.11.x against Spark v2.3.0.

This package is provided as it is, hence, you will have to install it by yourself. Here are some indications to start using it.

Build it by yourself: Installation

This project can be built with SBT 1.1.x.

Change build.sbt to satisfy your scala/spark installations. Then, run on the command line

sbt clean

sbt compile

sbt package

This will generate a .jar file in: target/scala_VERSION/PACKAGE.jar, where PACKAGE = com.rakuten.dirty_cat_VERSION-0.1-SNAPSHOT.jar

If you are using Jupyter notebooks (scala), you can add this file to your toree-spark-options in your Jupyter kernel.

  • Find your available kernesls running:
jupyter kernelspec list 
  • Go to your Scala kernel and add:
"env": {
    "DEFAULT_INTERPRETER": "Scala",
    "__TOREE_SPARK_OPTS__": "--conf spark.driver.memory=2g --conf spark.executor.cores=4 --conf spark.executor.memory=1g --jars PATH/target/scala_VERSION/PACKAGE.jar
    }

To submit your spark application, run

spark-submit --master local[3]  --jars target/scala-2.11/dirty_cat_2.11-1.0.jar YOUR_APPLICATION

Ceate local package

make publish 

Usage with Spark ML

Declaration

import com.rakuten.dirty_cat.feature.SimilarityEncoder

val encoder = (new SimilarityEncoder()
  .setInputCol("devices")
  .setOutputCol("devicesEncoded")
  .setSimilarityType("nGram")
  .setVocabSize(1000))

Using it in a pipeline

import org.apache.spark.ml.Pipeline

val pipeline = (new Pipeline().setStages(Array(encoder, YOUR_ESTIMATOR)))
val pipelineModel = pipeline.fit(dataframe)

Serialization

pipelineModel.write.overwrite().save("pipeline.parquet") 

History

Andrés Hoyos-Idrobo started this implementation of DirtyCat as a way to improve his Spark/Scala skills.

Contributions from:

  • Andrés Hoyos-Idrobo

Corporate (Code) Contributors:

  • Rakuten Institute of Technology

spark-dirty-cat's People

Contributors

ahoyosid avatar jammm avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.