ericarcher / rfpermute Goto Github PK
View Code? Open in Web Editor NEWEstimate Permutation p-Values for Random Forest Importance Metrics
Estimate Permutation p-Values for Random Forest Importance Metrics
From @shannonrankin
plotImportance function allows an argument ‘n’ which plots the first n values.
There is an error in this plot in that it actually plots the LAST n values rather than the FIRST n values.
Running rfPermute
with num.cores
generates the following error:
Error in isOpen(con): invalid connection
This seems to be caused by the closeAllConnections()
function call in rfPermute
. For instance, if I do the following:
library(parallel)
library(caret)
library(rfPermute)
num.cores <- 8
cl <- makeForkCluster(num.cores)
stopCluster(cl)
closeAllConnections()
I get the same Error in isOpen(con): invalid connection
error.
Here's a full reproducible example:
library(parallel)
library(caret)
library(rfPermute)
set.seed(2969)
imbal_train <- twoClassSim(2000, intercept = -20, linearVars = 20)
table(imbal_train$Class)
rp = rfPermute(Class ~ ., data=imbal_train, ntree = 50, mtry = 4,
norm.votes = FALSE, nrep=10, num.cores=8)
I can't find anything useful on this error. Any ideas?
sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS
Matrix products: default
BLAS: /ebio/abt3_projects/software/miniconda3/envs/py3_physeq_ML/lib/R/lib/libRblas.so
LAPACK: /ebio/abt3_projects/software/miniconda3/envs/py3_physeq_ML/lib/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] rfPermute_2.1.5 randomForest_4.6-12 caret_6.0-79
[4] ggplot2_2.2.1 lattice_0.20-34
loaded via a namespace (and not attached):
[1] magic_1.5-6 maps_3.3.0 swfscMisc_1.2
[4] ddalpha_1.3.2 tidyr_0.8.0 sfsmisc_1.1-1
[7] jsonlite_1.5 splines_3.4.1 foreach_1.4.4
[10] prodlim_1.6.1 assertthat_0.2.0 stats4_3.4.1
[13] DRR_0.0.3 robustbase_0.92-7 ipred_0.9-6
[16] pillar_1.2.1 glue_1.2.0 uuid_0.1-2
[19] digest_0.6.13 polyclip_1.6-1 colorspace_1.3-2
[22] recipes_0.1.2 Matrix_1.2-12 plyr_1.8.4
[25] psych_1.7.8 timeDate_3012.100 pkgconfig_2.0.1
[28] CVST_0.2-1 broom_0.4.4 purrr_0.2.4
[31] scales_0.5.0 tensor_1.5 gower_0.1.2
[34] lava_1.6.1 spatstat.utils_1.8-0 tibble_1.4.2
[37] mgcv_1.8-17 withr_2.1.1 repr_0.12.0
[40] mapdata_2.3.0 nnet_7.3-12 lazyeval_0.2.1
[43] mnormt_1.5-5 deldir_0.1-14 survival_2.40-1
[46] magrittr_1.5 crayon_1.3.4 evaluate_0.10.1
[49] nlme_3.1-131 MASS_7.3-48 dimRed_0.1.0
[52] foreign_0.8-67 class_7.3-14 tools_3.4.1
[55] stringr_1.2.0 kernlab_0.9-25 munsell_0.4.3
[58] bindrcpp_0.2 compiler_3.4.1 RcppRoll_0.2.2
[61] rlang_0.2.0 grid_3.4.1 pbdZMQ_0.3-1
[64] iterators_1.0.9 IRkernel_0.8.11 goftest_1.1-1
[67] geometry_0.3-6 gtable_0.2.0 ModelMetrics_1.1.0
[70] codetools_0.2-15 abind_1.4-5 reshape2_1.4.3
[73] R6_2.2.2 gridExtra_2.3 lubridate_1.7.4
[76] dplyr_0.7.4 bindr_0.1.1 spatstat.data_1.2-0
[79] stringi_1.1.6 spatstat_1.54-0 IRdisplay_0.4.4
[82] Rcpp_0.12.14 rpart_4.1-13 DEoptimR_1.0-8
[85] tidyselect_0.2.4
Hi,
There are many NA in the %IncMSE.pval. If I change the number of the seed or ntree, NA will increase or decrease.
%IncMSE | %IncMSE.pval | IncNodePurity | IncNodePurity.pval
4.9089802 | 0.02970297 | 1262.8835 | 0.00990099
3.4689366 | 0.12871287 | 952.313 | 0.13861386
2.3781035 | NA | 491.9594 | 0.6039604
2.3378941 | NA | 953.8426 | 0.07920792
2.1870641 | NA | 675.8061 | 0.20792079
1.7889514 | NA | 479.9947 | 0.4950495
1.7102849 | 0.18811881 | 451.7091 | 0.65346535
1.6046331 | NA | 656.7826 | 0.30693069
1.0245411 | NA | 784.2426 | 0.17821782
0.463957 | NA | 479.3047 | 0.51485149
-0.7036787 | 0.5049505 | 441.574 | 0.55445545
-1.3277221 | NA | 431.3065 | 0.56435644
-1.9111734 | NA | 413.4493 | 0.75247525
-2.2478055 | NA | 210.2557 | 1
-2.9267568 | NA | 241.034 | 0.96039604
Code:
set.seed(123)
otu_rfP<- rfPermute(SR ~ ., data = otu, ntree=500,
na.action = na.omit,nrep = 999,num.cores = 1)
Inclusion of the convenience package swfscMisc
, also written by @EricArcher, drags in an incredible amount of overhead. This package depends on R package sf
, which in turn depends on geospatial libraries gdal
and geos
(to be installed on the operating system (OS) level), both of them are most likely not used in the context of rfPermute
.
Building from source on an older Mac using Fink was rather complicated. First surprise was a dependency on R package s2
(via package sf
), which took some time to compile. After that, package sf
did not compile successfully as during configuration phase gdal
and geos
libraries were found to be missing on the OS level. These two libraries dragged in another 30+ dependencies and overall compilation time was three (3) hours.
In my opinion, it would be a great feature to have the opportunity to feed rfPermute() an already trained randomForest model so that it uses the feature importance scores from that model instead of training a new forest. This allows for a more coherent analysis throughout the research application.
#bug fix suggestion
#testing
install.packages("rfPermute")
library(rfPermute)
Data = data.frame(replicate(5,rnorm(100)))
Data[,1] = 7 #constant column "X1" should be removed
Data[2,2] = NA #missing value in row 2 should be removed
head(Data)
Data = data.frame(X,y)
out = clean.rf.data(1:4, 5, Data)
head(out) #column "X1" is still there
#fix
#overwrite function in global environment
clean.rf.data = function (x, y, data, max.levels = 30)
{
x <- setdiff(x, y)
sub.df <- data[, c(y, x)]
sub.df <- sub.df[complete.cases(sub.df), , drop = TRUE]
delete.pred <- character(0)
for (pred in x) {
pred.vec <- sub.df[[pred]]
if (length(unique(pred.vec)) <= 1) #change this line from, ==0
delete.pred <- c(delete.pred, pred)
if (is.factor(pred.vec) & (nlevels(pred.vec) > max.levels))
delete.pred <- c(delete.pred, pred)
}
delete.pred <- unique(delete.pred)
if (length(delete.pred) > 0)
x <- setdiff(x, delete.pred)
if (is.factor(sub.df[[y]]) & nlevels(sub.df[[y]][, drop = TRUE]) <
2)
return(NULL)
sub.df[, c(y, x)]
}
#checking
out2 = clean.rf.data(1:4, 5, Data)
head(out2) #column "X1" is now gone
Hello,
I am having this issue with y argument. Would you be able to tell me what am I doing wrong? This code was taken from Stack Overflow and it worked(I just changed rfpermute function, how it was corrected in repository, due to the Issue of clusterExport). Maybe this is a simple error but I just started with programming. Thanks
"Error in rfPermute.default(Species ~ ., data = iris, ntree = 100, na.action = na.omit, :
argument "y" is missing, with no default"
Code:
library(datasets)
data(iris)
library(randomForest)
rows <- sample(rownames(iris), replace = TRUE, size = length(rownames(iris))*0.8)
train <- iris[rows,]
validation <- iris[-as.numeric(names(table(rows))),]
fit <- randomForest(Species ~ .,data=train, importance=TRUE, ntree=1000)
Prediction <- predict(fit, validation)
confmatrix <- table(validation[,"Species"], Prediction)
caret::confusionMatrix(confmatrix)
rfPermute.default(Species ~ ., data = iris, ntree = 100, na.action = na.omit, nrep = 50)
I am desperately trying to understand where does this error come from, but I don't manage to fix it. I think it has something to do with the function and the environment. I don't know what to do. Do you have any suggestion?
GR_rfP <- rfPermute(MSIR_2 ~ ., data = data.rf, ntree = 1000, na.action = na.omit, nrep = 50)
Error in parallel::clusterExport(cl, "x", "rf.call", environment()) :
unused argument (environment())
Dear Eric,
I would like to compare variable importance between variables using random forests. To do so, I wanted to build CI around the estimated mean decrease accuracy and see if they would overlap between variables.
I was wondering if the approach below is valid: taking the SE estimated by rfPermute and construct CI using the forumla ±1.96*SЕ
results <- rfPermute(Y ~ some Xs, data = data, na.action = na.omit)
importance <- as.matrix(results$importanceSD[,3])
importance1 <- as.matrix(importance(results, type = 1, scale= FALSE))
blabla <- importance1 - 1.96importance
blabla1 <- importance1 + 1.96importance
output <- cbind(importance1, importance, blabla, blabla1)
If I am willing to make the assumption that the distribution of mean decrease accuracy among OOB is normally distributed, is this approach valid?
Thanks in advance,
Best,
Pascal
My workflow is essentially: build several smaller RFs on different nodes, then combine.
rf1 <- rfPermute(Ozone ~ ., data = airquality, ntree = 500, na.action = na.omit, nrep = 100)
rf2 <- rfPermute(Ozone ~ ., data = airquality, ntree = 500, na.action = na.omit, nrep = 100)
rf3 <- rfPermute(Ozone ~ ., data = airquality, ntree = 500, na.action = na.omit, nrep = 100)
rf.all <- combine(rf1, rf1, rf3)
I would like to use rfPermute to calculate importance p.values. However, when I use the combine function I lose the subsequent values in null.dist for rf2 & 3, with no warning.
This is a fairly common workflow, so is adding a custom combine method feasible? Or am I better to manually combine the values in null.dist beforehand?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.