Coder Social home page Coder Social logo

teebusch / noah Goto Github PK

View Code? Open in Web Editor NEW
7.0 3.0 0.0 5.49 MB

An R package for generating pseudonyms that are delightful and easy to remember. It creates adorable anonymous animals like the Likeable Leech and the Proud Chikadee.

Home Page: https://teebusch.github.io/noah/

License: Other

R 100.00%
r package pseudonymisation rstats

noah's Introduction

noah

Lifecycle: maturing CRAN status R build status Codecov test coverage

noah (no animals were harmed) generates pseudonyms that are delightful and easy to remember. It creates adorable anonymous animals like the Likable Leech and the Proud Chickadee.

Installation

Install from CRAN with:

install.packages("noah")

Or install the development version from Github with:

# install.packages("remotes")
remotes::install_github("teebusch/noah")

Usage

Generate pseudonyms

Use pseudonymize() to generate a unique pseudonym for every unique element / row in a vector or data frame. pseudonymize() accepts multiple vectors and data frames as arguments, and will pseudonymize them row by row.

library(noah)

pseudonymize(1:9)
#> [1] "Impartial Rat"       "Superficial Bird"    "Royal Orca"         
#> [4] "Earsplitting Python" "Fascinated Donkey"   "Defeated Trout"     
#> [7] "Encouraging Stoat"   "Null Grouse"         "Axiomatic Octopus"

pseudonymize(
  c("๐Ÿฐ", "๐Ÿฐ", "๐Ÿฐ"), 
  c("๐Ÿฅ•", "๐Ÿฅ•", "๐Ÿฐ")
)
#> [1] "Bloody Clam"     "Bloody Clam"     "Depressed Egret"

For extra delight, we can ask noah to generate only alliterations:

pseudonymize(1:9, .alliterate = TRUE)
#> [1] "Safe Sole"             "Callous Clownfish"     "Polite Panda"         
#> [4] "Best Badger"           "Like Leopard"          "Many Mole"            
#> [7] "Smiling Slug"          "Sweltering Silverfish" "Sick Sloth"

Add pseudonyms to a data frame

You can use pseudonymize() with dplyr::mutate() to add a column with pseudonyms to a data frame. In this example we use the diabetic retinopathy dataset from the package survival and add a new column with a pseudonym for each unique id. We also use dplyr::relocate() to move the pseudonyms to the first column:

library(dplyr)
diabetic <- as_tibble(survival::diabetic)

diabetic %>% 
  mutate(pseudonym = pseudonymize(id)) %>% 
  relocate(pseudonym)
#> # A tibble: 394 x 9
#>    pseudonym               id laser   age eye     trt  risk  time status
#>    <chr>                <int> <fct> <int> <fct> <int> <int> <dbl>  <int>
#>  1 Possessive Armadillo     5 argon    28 left      0     9  46.2      0
#>  2 Possessive Armadillo     5 argon    28 right     1     9  46.2      0
#>  3 Crowded Vole            14 xenon    12 left      1     8  42.5      0
#>  4 Crowded Vole            14 xenon    12 right     0     6  31.3      1
#>  5 Productive Heron        16 xenon     9 left      1    11  42.3      0
#>  6 Productive Heron        16 xenon     9 right     0    11  42.3      0
#>  7 Frequent Okapi          25 xenon     9 left      0    11  20.6      0
#>  8 Frequent Okapi          25 xenon     9 right     1    11  20.6      0
#>  9 Giant Lobster           29 xenon    13 left      0    10   0.3      1
#> 10 Giant Lobster           29 xenon    13 right     1     9  38.8      0
#> # ... with 384 more rows

For your convenience, noah also provides add_pseudonyms(), which wraps mutate() and relocate() and supports tidyselect syntax for selecting the key columns:

diabetic %>% 
  add_pseudonyms(id, where(is.factor))
#> # A tibble: 394 x 9
#>    pseudonym                id laser   age eye     trt  risk  time status
#>    <chr>                 <int> <fct> <int> <fct> <int> <int> <dbl>  <int>
#>  1 Doubtful Horse            5 argon    28 left      0     9  46.2      0
#>  2 Caring Heron              5 argon    28 right     1     9  46.2      0
#>  3 Grey Chicken             14 xenon    12 left      1     8  42.5      0
#>  4 Giddy Vole               14 xenon    12 right     0     6  31.3      1
#>  5 Overrated Caterpillar    16 xenon     9 left      1    11  42.3      0
#>  6 Angry Oribi              16 xenon     9 right     0    11  42.3      0
#>  7 Roasted Sawfish          25 xenon     9 left      0    11  20.6      0
#>  8 Spectacular Lion         25 xenon     9 right     1    11  20.6      0
#>  9 Panoramic Owl            29 xenon    13 left      0    10   0.3      1
#> 10 Orange Bear              29 xenon    13 right     1     9  38.8      0
#> # ... with 384 more rows

Keeping track of pseudonyms with an Ark

To make sure that all pseudonyms are unique and consistent, pseudonymize() and add_pseudonyms() use an object of class Ark (a pseudonym archive). By default, a new Ark is created for each function call, but you can also provide an Ark yourself. This allows you to keep track of the pseudonyms that have been used and make sure that the same keys always get assigned the same pseudonym:

ark <- Ark$new()

# split dataset into left and right eye and pseudonymize separately
diabetic_left <- diabetic %>% 
  filter(eye == "left") %>% 
  add_pseudonyms(id, .ark = ark)

diabetic_right <- diabetic %>% 
  filter(eye == "right") %>% 
  add_pseudonyms(id, .ark = ark)

# reunite the data sets again
bind_rows(diabetic_left, diabetic_right) %>% 
  arrange(id)
#> # A tibble: 394 x 9
#>    pseudonym          id laser   age eye     trt  risk  time status
#>    <chr>           <int> <fct> <int> <fct> <int> <int> <dbl>  <int>
#>  1 Faulty Swift        5 argon    28 left      0     9  46.2      0
#>  2 Faulty Swift        5 argon    28 right     1     9  46.2      0
#>  3 Tart Crab          14 xenon    12 left      1     8  42.5      0
#>  4 Tart Crab          14 xenon    12 right     0     6  31.3      1
#>  5 Sticky Barnacle    16 xenon     9 left      1    11  42.3      0
#>  6 Sticky Barnacle    16 xenon     9 right     0    11  42.3      0
#>  7 Brainy Moth        25 xenon     9 left      0    11  20.6      0
#>  8 Brainy Moth        25 xenon     9 right     1    11  20.6      0
#>  9 Poised Urial       29 xenon    13 left      0    10   0.3      1
#> 10 Poised Urial       29 xenon    13 right     1     9  38.8      0
#> # ... with 384 more rows

The ark now contains 197 pseudonyms โ€“ as many as there are unique idโ€™s in the dataset.

length(unique(diabetic$id))
#> [1] 197
length(ark)
#> [1] 197

Customizing an Ark

Building your own Ark allows you to customize the name parts that are used to create pseudonyms (by default, adjectives and animals). It also allow you to use names with more than two parts:

ark <- Ark$new(parts = list(
  c("Charles", "Louis", "Henry", "George"),
  c("I", "II", "III", "IV"),
  c("The Good", "The Wise", "The Brave", "The Mad", "The Beloved")
))

pseudonymize(1:8, .ark = ark)
#> [1] "Louis IV The Brave"   "George II The Good"   "Louis I The Good"    
#> [4] "Charles IV The Wise"  "Charles IV The Brave" "Louis II The Mad"    
#> [7] "Charles I The Brave"  "George I The Beloved"

You can also configure an Ark so that it generates only alliterations. Note that this behavior can still be overridden temporarily by using .alliterate = FALSE when you call pseudonymize().

ark <- Ark$new(alliterate = TRUE)

pseudonymize(1:12, .ark = ark)
#>  [1] "Hard-To-Find Hyena" "Well-Made Whippet"  "Momentous Mosquito"
#>  [4] "Mushy Macaw"        "Complete Clownfish" "Three Tahr"        
#>  [7] "Phobic Pheasant"    "Squealing Swallow"  "Subdued Swan"      
#> [10] "Mundane Marsupial"  "Complex Centipede"  "Cruel Crane"

Gotchas

Noah will treat numerically identical whole numbers of type double and integer as different and give them different pseudonyms. This can cause some unexpected behavior. Consider this example:

ark <- Ark$new()

pseudonymize(1:2, .ark = ark)  # creates a vector of integers c(1L, 2L)
pseudonymize(1, .ark = ark)    # creates a double

You might expect to get 2 different pseudonyms, because in the second pseudonymize() you are requesting a pseudonym for the number 1, which is already in the Ark. Instead you get three pseudonyms:

length(ark)
#> [1] 3

Noah will warn you when it thinks you are making this mistake, but it might not catch it all the time. A workaround is to coerce types explicitly, for example by using as.double(), as.integer(), or 1L to create integers.

Related R packages

There are multiple R packages that generate fake data, including fake names, phone numbers, addresses, credit card numbers, gene sequences and more:

If you need watertight anonymization you should check out these packages for anonymizing personal identifiable information in data sets:

noah's People

Contributors

teebusch avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

noah's Issues

create hex image

It seems like every real package needs a hex sticker nowadays. How about a boat with some happy animals on it? SOmething friendly and silly. Abyss free font maybe?

Custom name parts

Instead of the default (adjective animal) there should be an option to supply your own name parts when the ark is created. This would also allow tests that run more quickly.

equivalent integers and floats don't get the same key

R creates floats by default when using single numbers, but integers when using the range notation 1:3.
This creates unexpected behaviour:

library(noah)
ark <- Ark$new(parts = list(
  foo = c("one", "two", "three"),
  bar = c("fi", "fa", "fu")
))
max_total <- 9
ark$pseudonymize(1:5)
#> [1] "Two Fi"   "Three Fi" "Two Fu"   "Three Fa" "One Fi"
ark$pseudonymize(5L)  # gets theexisting pseudonym for 5
#> [1] "One Fi"
ark$pseudonymize(5)   # gets new pseudonym
#> [1] "Three Fu"
print(ark)
#> # An Ark: 6 / 9 pseudonyms used (1%)
#>   key         pseudonym
#>   <md5>       <Attribute Animal>
#> 1 14fa27a6... Three Fu
#> 2 216deaa6... Two Fi
#> 3 3297603a... One Fi
#> 4 8357d673... Three Fi
#> 5 9380ec58... Three Fa
#> 6 ee415bd5... Two Fu

allow customizing built-in data

Some of the adjectives or animals could be perceived as offensive or irritating when used in a more serious context. It would be nice to have a way to customize the word lists.

  • Perhaps a "family-friendly" word list should be provided?
  • Perhaps, the data should be made external (user-accessible), so that it can be modified (filtered) easily by the user?

add option to return only alliterations

Alliterations are more fun and may be easier to remember. The pseudonymize() function should have an option to return only pseudonyms with alliterations.

permute index using generator-like function

Currently, the pseudonym name parts are shuffled using index_shuffled, which is integer vector, containing a permutation of the index (from 1 to max_length). This permutation is stored with the Ark object.
This is feasible as long as max_length isn't too large (i.e., the number of name parts is small). However, as max_length increases, the memory use will increase. Many of the stored indices may never be used.
A more efficient way to store the permutation could be a generator-like function, that yields a new (unique) random value from the range 1:max_length whenever one is requested.

Get only alliterations from existing "non-alliterating" ark

currently the only way to make alliterations is to create an ark with Ark$new(alliterate = TRUE). Then all pseudonyms from that Ark will be alliterations. It would be better to be able to temporarily request only alliterations from an existing Ark.

Add pckdown

A pckdown page bundles the documentation in a easily accessible format and is generated automatically from the package. It would be good to have one.

Encode index more efficiently

Currently the shuffled index is stored as a a:n numerical vector, wasting a lot of space. One could use a combination of FIsher-Yates random sampling and run length encoding to save memory

customizable random seed for Ark?

If the user wants the ark to be reproducible without storing it somewhere, it should be enough to use a random seed.
It's possible for the user to do this themselves, but adding it as a function argument .random_seed = NULL might still be more convenient.

Add alliteration info to print function

  • The print function should show the settings of the Ark (currently, whether it alliterates by default)
  • The print function should make clear how many alliterations are left in the ark, and how many psudonyms are in it in total.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.