Coder Social home page Coder Social logo

rfpermute's People

Contributors

ericarcher avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

katielong

rfpermute's Issues

plotImportance n argument error

From @shannonrankin

plotImportance function allows an argument ‘n’ which plots the first n values.

There is an error in this plot in that it actually plots the LAST n values rather than the FIRST n values.

Error in isOpen(con): invalid connection

Running rfPermute with num.cores generates the following error:

Error in isOpen(con): invalid connection

This seems to be caused by the closeAllConnections() function call in rfPermute. For instance, if I do the following:

library(parallel)
library(caret)
library(rfPermute)

num.cores <- 8
cl <-  makeForkCluster(num.cores)
stopCluster(cl)
closeAllConnections()

I get the same Error in isOpen(con): invalid connection error.

Here's a full reproducible example:

library(parallel)
library(caret)
library(rfPermute)

set.seed(2969)
imbal_train <- twoClassSim(2000, intercept = -20, linearVars = 20)
table(imbal_train$Class)

rp = rfPermute(Class ~ ., data=imbal_train, ntree = 50, mtry = 4, 
                norm.votes = FALSE, nrep=10, num.cores=8)

I can't find anything useful on this error. Any ideas?

sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /ebio/abt3_projects/software/miniconda3/envs/py3_physeq_ML/lib/R/lib/libRblas.so
LAPACK: /ebio/abt3_projects/software/miniconda3/envs/py3_physeq_ML/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] rfPermute_2.1.5     randomForest_4.6-12 caret_6.0-79       
[4] ggplot2_2.2.1       lattice_0.20-34    

loaded via a namespace (and not attached):
 [1] magic_1.5-6          maps_3.3.0           swfscMisc_1.2       
 [4] ddalpha_1.3.2        tidyr_0.8.0          sfsmisc_1.1-1       
 [7] jsonlite_1.5         splines_3.4.1        foreach_1.4.4       
[10] prodlim_1.6.1        assertthat_0.2.0     stats4_3.4.1        
[13] DRR_0.0.3            robustbase_0.92-7    ipred_0.9-6         
[16] pillar_1.2.1         glue_1.2.0           uuid_0.1-2          
[19] digest_0.6.13        polyclip_1.6-1       colorspace_1.3-2    
[22] recipes_0.1.2        Matrix_1.2-12        plyr_1.8.4          
[25] psych_1.7.8          timeDate_3012.100    pkgconfig_2.0.1     
[28] CVST_0.2-1           broom_0.4.4          purrr_0.2.4         
[31] scales_0.5.0         tensor_1.5           gower_0.1.2         
[34] lava_1.6.1           spatstat.utils_1.8-0 tibble_1.4.2        
[37] mgcv_1.8-17          withr_2.1.1          repr_0.12.0         
[40] mapdata_2.3.0        nnet_7.3-12          lazyeval_0.2.1      
[43] mnormt_1.5-5         deldir_0.1-14        survival_2.40-1     
[46] magrittr_1.5         crayon_1.3.4         evaluate_0.10.1     
[49] nlme_3.1-131         MASS_7.3-48          dimRed_0.1.0        
[52] foreign_0.8-67       class_7.3-14         tools_3.4.1         
[55] stringr_1.2.0        kernlab_0.9-25       munsell_0.4.3       
[58] bindrcpp_0.2         compiler_3.4.1       RcppRoll_0.2.2      
[61] rlang_0.2.0          grid_3.4.1           pbdZMQ_0.3-1        
[64] iterators_1.0.9      IRkernel_0.8.11      goftest_1.1-1       
[67] geometry_0.3-6       gtable_0.2.0         ModelMetrics_1.1.0  
[70] codetools_0.2-15     abind_1.4-5          reshape2_1.4.3      
[73] R6_2.2.2             gridExtra_2.3        lubridate_1.7.4     
[76] dplyr_0.7.4          bindr_0.1.1          spatstat.data_1.2-0 
[79] stringi_1.1.6        spatstat_1.54-0      IRdisplay_0.4.4     
[82] Rcpp_0.12.14         rpart_4.1-13         DEoptimR_1.0-8      
[85] tidyselect_0.2.4   

There is NA in %IncMSE.pval.

Hi,
There are many NA in the %IncMSE.pval. If I change the number of the seed or ntree, NA will increase or decrease.
%IncMSE | %IncMSE.pval | IncNodePurity | IncNodePurity.pval
4.9089802 | 0.02970297 | 1262.8835 | 0.00990099
3.4689366 | 0.12871287 | 952.313 | 0.13861386
2.3781035 | NA | 491.9594 | 0.6039604
2.3378941 | NA | 953.8426 | 0.07920792
2.1870641 | NA | 675.8061 | 0.20792079
1.7889514 | NA | 479.9947 | 0.4950495
1.7102849 | 0.18811881 | 451.7091 | 0.65346535
1.6046331 | NA | 656.7826 | 0.30693069
1.0245411 | NA | 784.2426 | 0.17821782
0.463957 | NA | 479.3047 | 0.51485149
-0.7036787 | 0.5049505 | 441.574 | 0.55445545
-1.3277221 | NA | 431.3065 | 0.56435644
-1.9111734 | NA | 413.4493 | 0.75247525
-2.2478055 | NA | 210.2557 | 1
-2.9267568 | NA | 241.034 | 0.96039604

Code:
set.seed(123)
otu_rfP<- rfPermute(SR ~ ., data = otu, ntree=500,
na.action = na.omit,nrep = 999,num.cores = 1)

Remove unrelated dependencies

Inclusion of the convenience package swfscMisc, also written by @EricArcher, drags in an incredible amount of overhead. This package depends on R package sf, which in turn depends on geospatial libraries gdal and geos (to be installed on the operating system (OS) level), both of them are most likely not used in the context of rfPermute.

Building from source on an older Mac using Fink was rather complicated. First surprise was a dependency on R package s2 (via package sf), which took some time to compile. After that, package sf did not compile successfully as during configuration phase gdal and geos libraries were found to be missing on the OS level. These two libraries dragged in another 30+ dependencies and overall compilation time was three (3) hours.

Add option to feed trained randomForest model into rfPermute()

In my opinion, it would be a great feature to have the opportunity to feed rfPermute() an already trained randomForest model so that it uses the feature importance scores from that model instead of training a new forest. This allows for a more coherent analysis throughout the research application.

bug fix in clean.rf.data

#bug fix suggestion

#testing
install.packages("rfPermute")
library(rfPermute)
Data = data.frame(replicate(5,rnorm(100)))
Data[,1] = 7     #constant column "X1" should be removed
Data[2,2] = NA  #missing value in row 2 should be removed
head(Data)
Data = data.frame(X,y)
out = clean.rf.data(1:4, 5, Data)
head(out) #column "X1" is still there


#fix
#overwrite function in global environment
clean.rf.data = function (x, y, data, max.levels = 30)
{
  x <- setdiff(x, y)
  sub.df <- data[, c(y, x)]
  sub.df <- sub.df[complete.cases(sub.df), , drop = TRUE]
  delete.pred <- character(0)
  for (pred in x) {
    pred.vec <- sub.df[[pred]]
    if (length(unique(pred.vec)) <= 1) #change this line from, ==0
      delete.pred <- c(delete.pred, pred)
    if (is.factor(pred.vec) & (nlevels(pred.vec) > max.levels))
      delete.pred <- c(delete.pred, pred)
  }
  delete.pred <- unique(delete.pred)
  if (length(delete.pred) > 0)
    x <- setdiff(x, delete.pred)
  if (is.factor(sub.df[[y]]) & nlevels(sub.df[[y]][, drop = TRUE]) <
      2)
    return(NULL)
  sub.df[, c(y, x)]
}

#checking
out2 = clean.rf.data(1:4, 5, Data)
head(out2) #column "X1" is now gone

argument "y" is missing, with no default

Hello,

I am having this issue with y argument. Would you be able to tell me what am I doing wrong? This code was taken from Stack Overflow and it worked(I just changed rfpermute function, how it was corrected in repository, due to the Issue of clusterExport). Maybe this is a simple error but I just started with programming. Thanks

"Error in rfPermute.default(Species ~ ., data = iris, ntree = 100, na.action = na.omit, :
argument "y" is missing, with no default"

Code:

library(datasets)
data(iris)
library(randomForest)

rows <- sample(rownames(iris), replace = TRUE, size = length(rownames(iris))*0.8)
train <- iris[rows,]
validation <- iris[-as.numeric(names(table(rows))),]

fit <- randomForest(Species ~ .,data=train, importance=TRUE, ntree=1000)
Prediction <- predict(fit, validation)
confmatrix <- table(validation[,"Species"], Prediction)
caret::confusionMatrix(confmatrix)

rfPermute.default(Species ~ ., data = iris, ntree = 100, na.action = na.omit, nrep = 50)

How to understand the P-values from rfPermute?

Hi Eric,

I am little confused about p-value from the following command:
rp <- rfPermute(factor(am) ~ ., mtcars, nrep = 100, num.cores = 1)
How to understand the P-values from rfPermute? Thanks.
image

David

error in rfPermute: unused argument (environment())

I am desperately trying to understand where does this error come from, but I don't manage to fix it. I think it has something to do with the function and the environment. I don't know what to do. Do you have any suggestion?

GR_rfP <- rfPermute(MSIR_2 ~ ., data = data.rf, ntree = 1000, na.action = na.omit, nrep = 50)

Error in parallel::clusterExport(cl, "x", "rf.call", environment()) :
unused argument (environment())

Hypothesis Testing/CI on Variable Importance in random forests

Dear Eric,

I would like to compare variable importance between variables using random forests. To do so, I wanted to build CI around the estimated mean decrease accuracy and see if they would overlap between variables.

I was wondering if the approach below is valid: taking the SE estimated by rfPermute and construct CI using the forumla ±1.96*SЕ

results <- rfPermute(Y ~ some Xs, data = data, na.action = na.omit)
importance <- as.matrix(results$importanceSD[,3])
importance1 <- as.matrix(importance(results, type = 1, scale= FALSE))
blabla <- importance1 - 1.96importance
blabla1 <- importance1 + 1.96importance
output <- cbind(importance1, importance, blabla, blabla1)

If I am willing to make the assumption that the distribution of mean decrease accuracy among OOB is normally distributed, is this approach valid?
Thanks in advance,
Best,
Pascal

Using combine randomForest loses information

My workflow is essentially: build several smaller RFs on different nodes, then combine.

rf1 <- rfPermute(Ozone ~ ., data = airquality, ntree = 500, na.action = na.omit, nrep = 100)
rf2 <- rfPermute(Ozone ~ ., data = airquality, ntree = 500, na.action = na.omit, nrep = 100)
rf3 <- rfPermute(Ozone ~ ., data = airquality, ntree = 500, na.action = na.omit, nrep = 100)
rf.all <- combine(rf1, rf1, rf3)

I would like to use rfPermute to calculate importance p.values. However, when I use the combine function I lose the subsequent values in null.dist for rf2 & 3, with no warning.

This is a fairly common workflow, so is adding a custom combine method feasible? Or am I better to manually combine the values in null.dist beforehand?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.