Coder Social home page Coder Social logo

themis's Introduction

themis

Lifecycle: experimental CRAN status Travis build status Codecov test coverage

themis contain extra steps for the recipes package for dealingwith unbalanced data. The name themis is that of the ancient Greek god who is typically depicted with a balance.

Installation

You can install the released version of themis from CRAN with:

install.packages("themis")

Install the development version from GitHub with:

require("devtools")
install_github("emilhvitfeldt/themis")

Example

Following is a example of using the SMOTE algorithm to deal with unbalanced data

library(recipes)
library(themis)

data(okc)

sort(table(okc$Class, useNA = "always"))
#> 
#>  <NA>  stem other 
#>     0  9539 50316

ds_rec <- recipe(Class ~ age + height, data = okc) %>%
  step_meanimpute(all_predictors()) %>%
  step_smote(Class) %>%
  prep()

table(juice(ds_rec)$Class, useNA = "always")
#> 
#>  stem other  <NA> 
#> 50316 50316     0

Methods

Below is some unbalanced data. Used for examples latter.

example_data <- data.frame(class = letters[rep(1:5, 1:5 * 10)],
                           x = rnorm(150))

library(ggplot2)

example_data %>%
  ggplot(aes(class)) +
  geom_bar()

Upsample / Over-sampling

The following methods all share the tuning parameter over_ratio, which is the ratio of the majority-to-minority frequencies.

name function Multi-class
Random minority over-sampling with replacement step_upsample() ✔️
Synthetic Minority Over-sampling Technique step_smote() ✔️
Borderline SMOTE-1 step_bsmote(method = 1) ✔️
Borderline SMOTE-2 step_bsmote(method = 2) ✔️
Adaptive synthetic sampling approach for imbalanced learning step_adasyn() ✔️
Generation of synthetic data by Randomly Over Sampling Examples step_rose()

By setting over_ratio = 1 you bring the number of samples of all minority classes equal to 100% of the majority class.

recipe(~., example_data) %>%
  step_upsample(class, over_ratio = 1) %>%
  prep() %>%
  juice() %>%
  ggplot(aes(class)) +
  geom_bar()

and by setting over_ratio = 0.5 we upsample any minority class with less samples then 50% of the majority up to have 50% of the majority.

recipe(~., example_data) %>%
  step_upsample(class, over_ratio = 0.5) %>%
  prep() %>%
  juice() %>%
  ggplot(aes(class)) +
  geom_bar()

Downsample / Under-sampling

Most of the the following methods all share the tuning parameter under_ratio, which is the ratio of the minority-to-majority frequencies.

name function Multi-class under_ratio
Random majority under-sampling with replacement step_downsample() ✔️ ✔️
NearMiss-1 step_nearmiss() ✔️ ✔️
Extraction of majority-minority Tomek links step_tomek()

By setting under_ratio = 1 you bring the number of samples of all majority classes equal to 100% of the minority class.

recipe(~., example_data) %>%
  step_downsample(class, under_ratio = 1) %>%
  prep() %>%
  juice() %>%
  ggplot(aes(class)) +
  geom_bar()

and by setting under_ratio = 2 we downsample any majority class with more then 200% samples of the minority class down to have to 200% samples of the minority.

recipe(~., example_data) %>%
  step_downsample(class, under_ratio = 2) %>%
  prep() %>%
  juice() %>%
  ggplot(aes(class)) +
  geom_bar()

Code of Conduct

Please note that the ‘themis’ project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

themis's People

Contributors

emilhvitfeldt avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.