Coder Social home page Coder Social logo

dfuzz's Introduction

dfuzz

Lifecycle: experimental CRAN status

The goal of {dfuzz} is to help you cleaning up a messy column of strings of characters in your tibble or data.frame.

This package is highly experimental and is not yet ready for being used for real applications.

It is build around two dependencies which themselves have no dependencies:

  • {rlang}

  • {stringdist}, and it is possible to use the full power of the function stringdist() from this excellent package.

{dfuzz} aims at being compatible with both tidyverse and base R dialects.

Installation

You can install this package using {remotes} (or {devtools}):

remotes::install_github("courtiol/dfuzz")

Example

library(dfuzz)

## a toy example:
test_df <- data.frame(fruit = c("banana", "blueberry", "limon", "pinapple",
                                "aple", "apple", "ApplE", "bonana"))
test_df
#>       fruit
#> 1    banana
#> 2 blueberry
#> 3     limon
#> 4  pinapple
#> 5      aple
#> 6     apple
#> 7     ApplE
#> 8    bonana

## fast and dirty workflow:
clean_df1 <- fuzzy_tidy(test_df, fruit)
clean_df1
#>       fruit fruit.clean fruit.cleaned fruit.tidy
#> 1    banana        <NA>        banana     banana
#> 2 blueberry   blueberry          <NA>  blueberry
#> 3     limon       limon          <NA>      limon
#> 4  pinapple    pinapple          <NA>   pinapple
#> 5      aple        <NA>          aple       aple
#> 6     apple        <NA>          aple       aple
#> 7     ApplE       ApplE          <NA>      ApplE
#> 8    bonana        <NA>        banana     banana

## more subtle workflow:
template_fruit <- fuzzy_match(test_df, fruit)
template_fruit
#>   selected  syn_1  syn_2
#> 1     aple   aple  apple
#> 2   banana banana bonana
template_fruit$selected[1] <- "apple"
clean_df2 <- fuzzy_tidy(test_df, fruit, template_fruit)
clean_df2
#>       fruit fruit.clean fruit.cleaned fruit.tidy
#> 1    banana        <NA>        banana     banana
#> 2 blueberry   blueberry          <NA>  blueberry
#> 3     limon       limon          <NA>      limon
#> 4  pinapple    pinapple          <NA>   pinapple
#> 5      aple        <NA>         apple      apple
#> 6     apple        <NA>         apple      apple
#> 7     ApplE       ApplE          <NA>      ApplE
#> 8    bonana        <NA>        banana     banana

## fast and dirty workflow with {tidyverse}:
library(tidyverse)
#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
#> ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
#> ✓ tibble  3.0.4     ✓ dplyr   1.0.2
#> ✓ tidyr   1.1.2     ✓ stringr 1.4.0
#> ✓ readr   1.4.0     ✓ forcats 0.5.0
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()
test_df %>%
  fuzzy_tidy(fruit) %>%
  mutate(fruit = fruit.tidy) %>%
  select(-contains("fruit."))
#> # A tibble: 8 x 1
#>   fruit    
#>   <chr>    
#> 1 banana   
#> 2 blueberry
#> 3 limon    
#> 4 pinapple 
#> 5 aple     
#> 6 aple     
#> 7 ApplE    
#> 8 banana

## more subtle workflow with {tidyverse}:
test_df %>%
  mutate(fruit = str_to_title(fruit)) %>%
  fuzzy_match(fruit) -> template_fruit
template_fruit
#> # A tibble: 2 x 3
#>   selected syn_1  syn_2 
#>   <chr>    <chr>  <chr> 
#> 1 Aple     Aple   Apple 
#> 2 Banana   Banana Bonana

template_fruit %>%
  mutate(selected = fct_recode(selected, Apple = "Aple")) -> better_template_fruit

better_template_fruit
#> # A tibble: 2 x 3
#>   selected syn_1  syn_2 
#>   <fct>    <chr>  <chr> 
#> 1 Apple    Aple   Apple 
#> 2 Banana   Banana Bonana

test_df %>%
  mutate(fruit = str_to_title(fruit)) %>%
  fuzzy_tidy(fruit, better_template_fruit) -> clean_df3
clean_df3
#> # A tibble: 8 x 4
#>   fruit     fruit.clean fruit.cleaned fruit.tidy
#>   <chr>     <chr>       <chr>         <chr>     
#> 1 Banana    <NA>        Banana        Banana    
#> 2 Blueberry Blueberry   <NA>          Blueberry 
#> 3 Limon     Limon       <NA>          Limon     
#> 4 Pinapple  Pinapple    <NA>          Pinapple  
#> 5 Aple      <NA>        Apple         Apple     
#> 6 Apple     <NA>        Apple         Apple     
#> 7 Apple     <NA>        Apple         Apple     
#> 8 Bonana    <NA>        Banana        Banana

clean_df3 %>%
  mutate(fruit = fruit.tidy) %>%
  select(-contains("fruit."))
#> # A tibble: 8 x 1
#>   fruit    
#>   <chr>    
#> 1 Banana   
#> 2 Blueberry
#> 3 Limon    
#> 4 Pinapple 
#> 5 Apple    
#> 6 Apple    
#> 7 Apple    
#> 8 Banana

Help & feedbacks wanted!

If you find that this package is an idea worth pursuing, please let me know. Developing is always more fun when it becomes a collaborative work. So please also email me (or leave an issue) if you want to get involved!

dfuzz's People

Contributors

courtiol avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.