ericarcher / rfpermute Goto Github PK

View Code? Open in Web Editor NEW

27.0 27.0 2.0 45.94 MB

Estimate Permutation p-Values for Random Forest Importance Metrics

R 1.76% HTML 98.24%

rfpermute's People

Contributors

Stargazers

Watchers

Forkers

katielong

rfpermute's Issues

plotImportance n argument error

From @shannonrankin

plotImportance function allows an argument ‘n’ which plots the first n values.

There is an error in this plot in that it actually plots the LAST n values rather than the FIRST n values.

Error in isOpen(con): invalid connection

Running rfPermute with num.cores generates the following error:

Error in isOpen(con): invalid connection

This seems to be caused by the closeAllConnections() function call in rfPermute. For instance, if I do the following:

library(parallel)
library(caret)
library(rfPermute)

num.cores <- 8
cl <-  makeForkCluster(num.cores)
stopCluster(cl)
closeAllConnections()

I get the same Error in isOpen(con): invalid connection error.

Here's a full reproducible example:

library(parallel)
library(caret)
library(rfPermute)

set.seed(2969)
imbal_train <- twoClassSim(2000, intercept = -20, linearVars = 20)
table(imbal_train$Class)

rp = rfPermute(Class ~ ., data=imbal_train, ntree = 50, mtry = 4, 
                norm.votes = FALSE, nrep=10, num.cores=8)

I can't find anything useful on this error. Any ideas?

sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /ebio/abt3_projects/software/miniconda3/envs/py3_physeq_ML/lib/R/lib/libRblas.so
LAPACK: /ebio/abt3_projects/software/miniconda3/envs/py3_physeq_ML/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] rfPermute_2.1.5     randomForest_4.6-12 caret_6.0-79       
[4] ggplot2_2.2.1       lattice_0.20-34    

loaded via a namespace (and not attached):
 [1] magic_1.5-6          maps_3.3.0           swfscMisc_1.2       
 [4] ddalpha_1.3.2        tidyr_0.8.0          sfsmisc_1.1-1       
 [7] jsonlite_1.5         splines_3.4.1        foreach_1.4.4       
[10] prodlim_1.6.1        assertthat_0.2.0     stats4_3.4.1        
[13] DRR_0.0.3            robustbase_0.92-7    ipred_0.9-6         
[16] pillar_1.2.1         glue_1.2.0           uuid_0.1-2          
[19] digest_0.6.13        polyclip_1.6-1       colorspace_1.3-2    
[22] recipes_0.1.2        Matrix_1.2-12        plyr_1.8.4          
[25] psych_1.7.8          timeDate_3012.100    pkgconfig_2.0.1     
[28] CVST_0.2-1           broom_0.4.4          purrr_0.2.4         
[31] scales_0.5.0         tensor_1.5           gower_0.1.2         
[34] lava_1.6.1           spatstat.utils_1.8-0 tibble_1.4.2        
[37] mgcv_1.8-17          withr_2.1.1          repr_0.12.0         
[40] mapdata_2.3.0        nnet_7.3-12          lazyeval_0.2.1      
[43] mnormt_1.5-5         deldir_0.1-14        survival_2.40-1     
[46] magrittr_1.5         crayon_1.3.4         evaluate_0.10.1     
[49] nlme_3.1-131         MASS_7.3-48          dimRed_0.1.0        
[52] foreign_0.8-67       class_7.3-14         tools_3.4.1         
[55] stringr_1.2.0        kernlab_0.9-25       munsell_0.4.3       
[58] bindrcpp_0.2         compiler_3.4.1       RcppRoll_0.2.2      
[61] rlang_0.2.0          grid_3.4.1           pbdZMQ_0.3-1        
[64] iterators_1.0.9      IRkernel_0.8.11      goftest_1.1-1       
[67] geometry_0.3-6       gtable_0.2.0         ModelMetrics_1.1.0  
[70] codetools_0.2-15     abind_1.4-5          reshape2_1.4.3      
[73] R6_2.2.2             gridExtra_2.3        lubridate_1.7.4     
[76] dplyr_0.7.4          bindr_0.1.1          spatstat.data_1.2-0 
[79] stringi_1.1.6        spatstat_1.54-0      IRdisplay_0.4.4     
[82] Rcpp_0.12.14         rpart_4.1-13         DEoptimR_1.0-8      
[85] tidyselect_0.2.4

There is NA in %IncMSE.pval.

Hi,
There are many NA in the %IncMSE.pval. If I change the number of the seed or ntree, NA will increase or decrease.
%IncMSE | %IncMSE.pval | IncNodePurity | IncNodePurity.pval
4.9089802 | 0.02970297 | 1262.8835 | 0.00990099
3.4689366 | 0.12871287 | 952.313 | 0.13861386
2.3781035 | NA | 491.9594 | 0.6039604
2.3378941 | NA | 953.8426 | 0.07920792
2.1870641 | NA | 675.8061 | 0.20792079
1.7889514 | NA | 479.9947 | 0.4950495
1.7102849 | 0.18811881 | 451.7091 | 0.65346535
1.6046331 | NA | 656.7826 | 0.30693069
1.0245411 | NA | 784.2426 | 0.17821782
0.463957 | NA | 479.3047 | 0.51485149
-0.7036787 | 0.5049505 | 441.574 | 0.55445545
-1.3277221 | NA | 431.3065 | 0.56435644
-1.9111734 | NA | 413.4493 | 0.75247525
-2.2478055 | NA | 210.2557 | 1
-2.9267568 | NA | 241.034 | 0.96039604

Code:
set.seed(123)
otu_rfP<- rfPermute(SR ~ ., data = otu, ntree=500,
na.action = na.omit,nrep = 999,num.cores = 1)

Remove unrelated dependencies

Inclusion of the convenience package swfscMisc, also written by @EricArcher, drags in an incredible amount of overhead. This package depends on R package sf, which in turn depends on geospatial libraries gdal and geos (to be installed on the operating system (OS) level), both of them are most likely not used in the context of rfPermute.

Building from source on an older Mac using Fink was rather complicated. First surprise was a dependency on R package s2 (via package sf), which took some time to compile. After that, package sf did not compile successfully as during configuration phase gdal and geos libraries were found to be missing on the OS level. These two libraries dragged in another 30+ dependencies and overall compilation time was three (3) hours.

Add option to feed trained randomForest model into rfPermute()

In my opinion, it would be a great feature to have the opportunity to feed rfPermute() an already trained randomForest model so that it uses the feature importance scores from that model instead of training a new forest. This allows for a more coherent analysis throughout the research application.

bug fix in clean.rf.data

#bug fix suggestion

#testing
install.packages("rfPermute")
library(rfPermute)
Data = data.frame(replicate(5,rnorm(100)))
Data[,1] = 7     #constant column "X1" should be removed
Data[2,2] = NA  #missing value in row 2 should be removed
head(Data)
Data = data.frame(X,y)
out = clean.rf.data(1:4, 5, Data)
head(out) #column "X1" is still there


#fix
#overwrite function in global environment
clean.rf.data = function (x, y, data, max.levels = 30)
{
  x <- setdiff(x, y)
  sub.df <- data[, c(y, x)]
  sub.df <- sub.df[complete.cases(sub.df), , drop = TRUE]
  delete.pred <- character(0)
  for (pred in x) {
    pred.vec <- sub.df[[pred]]
    if (length(unique(pred.vec)) <= 1) #change this line from, ==0
      delete.pred <- c(delete.pred, pred)
    if (is.factor(pred.vec) & (nlevels(pred.vec) > max.levels))
      delete.pred <- c(delete.pred, pred)
  }
  delete.pred <- unique(delete.pred)
  if (length(delete.pred) > 0)
    x <- setdiff(x, delete.pred)
  if (is.factor(sub.df[[y]]) & nlevels(sub.df[[y]][, drop = TRUE]) <
      2)
    return(NULL)
  sub.df[, c(y, x)]
}

#checking
out2 = clean.rf.data(1:4, 5, Data)
head(out2) #column "X1" is now gone

argument "y" is missing, with no default

Hello,

I am having this issue with y argument. Would you be able to tell me what am I doing wrong? This code was taken from Stack Overflow and it worked(I just changed rfpermute function, how it was corrected in repository, due to the Issue of clusterExport). Maybe this is a simple error but I just started with programming. Thanks

"Error in rfPermute.default(Species ~ ., data = iris, ntree = 100, na.action = na.omit, :
argument "y" is missing, with no default"

Code:

library(datasets)
data(iris)
library(randomForest)

rows <- sample(rownames(iris), replace = TRUE, size = length(rownames(iris))*0.8)
train <- iris[rows,]
validation <- iris[-as.numeric(names(table(rows))),]

fit <- randomForest(Species ~ .,data=train, importance=TRUE, ntree=1000)
Prediction <- predict(fit, validation)
confmatrix <- table(validation[,"Species"], Prediction)
caret::confusionMatrix(confmatrix)

rfPermute.default(Species ~ ., data = iris, ntree = 100, na.action = na.omit, nrep = 50)

How to understand the P-values from rfPermute？

Hi Eric,

I am little confused about p-value from the following command:
rp <- rfPermute(factor(am) ~ ., mtcars, nrep = 100, num.cores = 1)
How to understand the P-values from rfPermute? Thanks.

David

error in rfPermute: unused argument (environment())

I am desperately trying to understand where does this error come from, but I don't manage to fix it. I think it has something to do with the function and the environment. I don't know what to do. Do you have any suggestion?

GR_rfP <- rfPermute(MSIR_2 ~ ., data = data.rf, ntree = 1000, na.action = na.omit, nrep = 50)

Error in parallel::clusterExport(cl, "x", "rf.call", environment()) :
unused argument (environment())

Hypothesis Testing/CI on Variable Importance in random forests

Dear Eric,

I would like to compare variable importance between variables using random forests. To do so, I wanted to build CI around the estimated mean decrease accuracy and see if they would overlap between variables.

I was wondering if the approach below is valid: taking the SE estimated by rfPermute and construct CI using the forumla ±1.96*SЕ

results <- rfPermute(Y ~ some Xs, data = data, na.action = na.omit)
importance <- as.matrix(results$importanceSD[,3])
importance1 <- as.matrix(importance(results, type = 1, scale= FALSE))
blabla <- importance1 - 1.96importance
blabla1 <- importance1 + 1.96importance
output <- cbind(importance1, importance, blabla, blabla1)

If I am willing to make the assumption that the distribution of mean decrease accuracy among OOB is normally distributed, is this approach valid?
Thanks in advance,
Best,
Pascal

Using combine randomForest loses information

My workflow is essentially: build several smaller RFs on different nodes, then combine.

rf1 <- rfPermute(Ozone ~ ., data = airquality, ntree = 500, na.action = na.omit, nrep = 100)
rf2 <- rfPermute(Ozone ~ ., data = airquality, ntree = 500, na.action = na.omit, nrep = 100)
rf3 <- rfPermute(Ozone ~ ., data = airquality, ntree = 500, na.action = na.omit, nrep = 100)
rf.all <- combine(rf1, rf1, rf3)

I would like to use rfPermute to calculate importance p.values. However, when I use the combine function I lose the subsequent values in null.dist for rf2 & 3, with no warning.

This is a fairly common workflow, so is adding a custom combine method feasible? Or am I better to manually combine the values in null.dist beforehand?

ericarcher / rfpermute Goto Github PK

rfpermute's People

Contributors

Stargazers

Watchers

Forkers

rfpermute's Issues

plotImportance n argument error

Error in isOpen(con): invalid connection

There is NA in %IncMSE.pval.

Remove unrelated dependencies

Add option to feed trained randomForest model into rfPermute()

bug fix in clean.rf.data

argument "y" is missing, with no default

How to understand the P-values from rfPermute？

error in rfPermute: unused argument (environment())

Hypothesis Testing/CI on Variable Importance in random forests

Using combine randomForest loses information

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent