kogalur / randomforestsrc Goto Github PK
View Code? Open in Web Editor NEWDOCUMENTATION:
Home Page: https://www.randomforestsrc.org/
License: GNU General Public License v3.0
DOCUMENTATION:
Home Page: https://www.randomforestsrc.org/
License: GNU General Public License v3.0
I'm working with a dataset with around 70 variables in the competing risk setting (same dataset as in my last issue, although I've reduced my rows to 100,000). Some of these variables I expect are more noisy while others are stronger, so I ran the max.subtree
function to look at the top variables. However, the threshold returned is extremely high (about 22) while the highest order is only 11. Suspicious of these results I created a fake, random covariate entirely unrelated to the response and introduced that into the model, and it received an order of about 6 (threshold being 22).
Thinking that this may be unique to my dataset/random chance I tried the survival example in the max.subtree documentation (veteran dataset). In that example I introduced 10 extra random covariates and they all were included in the top variables (not all of the original variables made it when I added extra variables which differs from my dataset, although they all were considered strong when I included no extra variables) I've run it several times with the same results so I know it's not that the random covariates are somehow by random chance related to the response.
I don't know enough about maximal subtrees and its assumptions to know whether this is a problem in my dataset or not, but being able to reproduce it in the example dataset was surprising. Any insight would be appreciated.
Here is some R code for what I did in the example.
require(randomForestSRC)
data(veteran, package = "randomForestSRC")
for(j in 1:10){
veteran[,paste0("random", j)] = rnorm(nrow(veteran))
}
v.obj <- rfsrc(Surv(time, status) ~ . , data = veteran)
v.max <- max.subtree(v.obj)
v.max$order
v.max$threshold
v.max$topvars
Here is my sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8 LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] randomForestSRC_2.6.1
loaded via a namespace (and not attached):
[1] compiler_3.4.4 parallel_3.4.4 tools_3.4.4
Hello, I was working in a dataset that happened to have some NaN s (distinct from NAs) in one of the columns and upon running rfsrc
many ugly error messages appeared that led to R completely crashing. I've included some example code below that replicates the problem.
I'm unsure how I would expect the package to treat NaN's as they can be distinct from NAs in certain problems (mine included); but informative error messages that identify the presence of NaN's causing problems would be helpful, or perhaps a precheck that forces users to handle NaN's themselves.
Here's example code that triggers the problem - I suggest running it in base R as Rstudio doesn't always display all of the errors.
x = rnorm(100)
z = x + rnorm(100)
y = 5 + 2*x - z + rnorm(100)
d = data.frame(x,z,y)
require(randomForestSRC)
rfsrc(y~x+z, d, na.action="na.impute", ntree=500) # so far so good - no issue
d$z[1:10] = NA
rfsrc(y~x+z, d, na.action="na.impute", ntree=500) # Still works fine, though we impute
d$z[1] = NaN
rfsrc(y~x+z, d, na.action="na.impute", ntree=500) # massive failure
install.packages("randomForestSRC")
Installing package into ‘/home/cnsun/R/x86_64-redhat-linux-gnu-library/3.4’
(as ‘lib’ is unspecified)
trying URL 'https://cran.mtu.edu/src/contrib/randomForestSRC_2.5.1.tar.gz'
Content type 'application/x-gzip' length 903705 bytes (882 KB)
==================================================
downloaded 882 KB
The downloaded source packages are in
‘/tmp/RtmpA1tGwr/downloaded_packages’
Warning message:
In install.packages("randomForestSRC") :
installation of package ‘randomForestSRC’ had non-zero exit status
Hi,
When I tried rfsrc with distance=TRUE, there is an error like this. It looks like nativeOutput$distance is NULL.
data(iris)
airq.obj <- rfsrc(Ozone ~ ., data = airquality, na.action = "na.omit", distance=TRUE)
Error in distance.out[k, 1:k] <- nativeOutput$distance[(count + 1):(count + :
number of items to replace is not a multiple of replacement length
mtcars.unspv <- rfsrc(Unsupervised() ~., data = mtcars, distance=TRUE)
Error in distance.out[k, 1:k] <- nativeOutput$distance[(count + 1):(count + :
number of items to replace is not a multiple of replacement length
Thank you!
I'm using version 2.9.0. When applying quantreg
to get predictions on new data, the predicted quantile values appear to be constant.
Example based on an example in the quantreg
documentation:
library(randomForestSRC)
set.seed(1)
o <- quantreg(mpg ~ ., mtcars[1:20,])
o.tst <- quantreg(object = o, newdata = mtcars[-(1:20),-1])
o$quantreg$quantiles # not constant
o.tst$quantreg$quantiles # constant in both directions
# Try on a subset of the original data
o.tst2 <- quantreg(object = o, newdata = mtcars[1:5, -1])
o.tst2$quantreg$quantiles # constant
I might be misunderstanding something. Are these constant values on new data the expected behaviour?
Thanks.
Hi there,
Along with the help menu from the rfsrc()
function, the multivariate option can be done using two different syntaxes:
My question is how so select from a data.frame all the columns for y1, y2,...yd when there are hundreds of response variables (d >=100).
I have tried with positions:
rfsrc(cbind(261:279) ~., data = birds1)
and with some Pattern Matching:
rfsrc(Multivar(grep('y_', colnames(birds1), value = TRUE)) ~., data = birds1)
# My response variables start with "y_NAME"
But It always returns some character string and the answer is always
Error in parseFormula(formula, data, ytry) : the formula is incorrectly specified.
Any suggestion to select a high amount of variables without the needed to write all of them?
R crash on predict: (minimal example provided as attachment)
crash_rfsrc.zip
RF-SRC
RF-SRC: *** ERROR ***
RF-SRC: Numerical Recipes Run-Time Error:
RF-SRC:
Illegal indices in gvector().
RF-SRC: Please Contact Technical Support.<simpleError in doTryCatch(return(expr), name, parentenv, handler):
RF-SRC: The application will now exit.
Fehler in generic.predict.rfsrc(object, newdata, ensemble = ensemble, m.target = m.target, :
An error has occurred in prediction. Please turn trace on for further analysis.
Ruft auf: predict -> predict.rfsrc -> generic.predict.rfsrc
Ausf�hrung angehalten
RF-SRC
RF-SRC: *** ERROR ***
RF-SRC: Numerical Recipes Run-Time Error:
RF-SRC:
Illegal indices in gvector().
RF-SRC: Please Contact Technical Support.Fehler:
RF-SRC: The application will now exit.
Fatal error: error during cleanup
Invocation: Rscript crash_rfsrc.R
SessInfo:
R version 3.5.2 (2018-12-20)
Platform: x86_64-suse-linux-gnu (64-bit)
Running under: openSUSE Tumbleweed
Matrix products: default
BLAS: /usr/lib64/R/lib/libRblas.so
LAPACK: /usr/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C
[3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8
[5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
[7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] randomForestSRC_2.8.0 forcats_0.3.0 stringr_1.3.1
[4] dplyr_0.7.8 purrr_0.3.0 readr_1.3.1
[7] tidyr_0.8.2 tibble_2.0.1 ggplot2_3.1.0
[10] tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.0 cellranger_1.1.0 pillar_1.3.1 compiler_3.5.2
[5] plyr_1.8.4 bindr_0.1.1 tools_3.5.2 jsonlite_1.6
[9] lubridate_1.7.4 gtable_0.2.0 nlme_3.1-137 lattice_0.20-38
[13] pkgconfig_2.0.2 rlang_0.3.1 cli_1.0.1 rstudioapi_0.9.0
[17] parallel_3.5.2 haven_2.0.0 bindrcpp_0.2.2 withr_2.1.2
[21] xml2_1.2.0 httr_1.4.0 generics_0.0.2 hms_0.4.2
[25] grid_3.5.2 tidyselect_0.2.5 glue_1.3.0 R6_2.3.0
[29] readxl_1.2.0 modelr_0.1.2 magrittr_1.5 backports_1.1.3
[33] scales_1.0.0 rvest_0.3.2 assertthat_0.2.0 colorspace_1.4-0
[37] stringi_1.2.4 lazyeval_0.2.1 munsell_0.5.0 broom_0.5.1
[41] crayon_1.3.4
Hello, I was hoping to get some context as to how I could get a predicted negative error rate for a model.
Hello, I've noticed that with competing risk data if I first predict on a dataset that has a response, and then predict next on a dataset without one, that R crashes entirely. Here's a script that can reliably trigger it. The script causes a crash on the three computers I tested it on, but they're all Linux so I don't know if it's cross-platform or not.
set.seed(500)
n = 1500
data <- data.frame(x=rnorm(n), delta=sample(1:2, replace=TRUE, size=n))
data$T <- rexp(n, rate=ifelse(data$delta==1, 1/10, 1/15))
censorTimes <- rexp(n, rate=1/9)
data$delta = ifelse(data$T < censorTimes, 0, data$delta)
data$T = pmin(data$T, censorTimes)
trainingData <- data[1:1000,]
testData <- data[1001:1500,]
newData <- data.frame(x=rnorm(20))
library(randomForestSRC)
# Log-rank split rule is only used for speed; it still crashes on default splitrule
modelRfsrc = rfsrc(Surv(T, delta) ~ x, trainingData,
ntree=1000, nodesize=10, mtry=1,
nsplit=0, splitrule = "logrank")
testSetPredictions <- predict(modelRfsrc, testData)
# This line triggers the crash. I've tried sometimes running it before the predictions for testData
# and often it then *won't* crash, but it sometimes still does. It always triggers a crash though if
# I've run the predictions for testData before, even if before that I had successfully run this line.
newDataPredictions <- predict(modelRfsrc, newData)
Here's my sessionInfo():
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.1 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8 LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] randomForestSRC_2.8.0
loaded via a namespace (and not attached):
[1] compiler_3.4.4 parallel_3.4.4 tools_3.4.4 yaml_2.2.0
Hi there,
I am interested in multivariate prediction of real Ys. However, I would like to know whether the prediction method takes into account the joint distribution of the Ys. The splitting function (Weighted variance splitting) is explained for one single Y_{i}. Does this function change when the model is multivariate?
Would you recommend any particular approach to estimate prediction intervals for this approach?
Hi, I am using the new version 2.9 and when I try to predict using a random survival forest with new test data, I am getting the error code "Attempt to apply to non-function". I had previously used the beta code (V. 2.8.0.11) you provided in another issue (Issue #29 ) and my code ran without error. Nothing has changed in my code besides the updated package (I am using it in a Shiny app and can't use a locally installed package when deploying).
I believe I have traced the error to lines 311 and 321 in the 'generic.predict.rfsrc' function - when assigning the variable 'sampsize'.
The current line is 'sampsize <- round(object$sampsize(nrow(xvar)))', where object$sampsize is just an integer so it crashes (and for me, nrow(xvar) is equal to object$sampsize). I looked at previous versions of this function (v 2.8.0) and the line was 'sampsize <- object$sampsize' which seems correct.
Is there something else in the 'predict.rfsrc' function I am missing? I am calling it exactly the same way I had been: predict(model, newdata = new_data) and this is happening with competing risk and survival models.
I REALLY need the code on CRAN to be updated as I have to present my Masters project next Friday (May 31) - do you know if debugging this error and uploading to CRAN is something you're able to do soon? Thank you!!
In the PDF manual, a number of times "without replacement" is written when "with replacement" is meant. For instance, the documentation for rfsrc() samptype reads
Choices are swor (sampling without replacement) and swr (sampling without replacement).
Likewise, for sampsize it says
For sampling without replacement, it is the requested size of the sample, which by default is .632 times the sample size. For sampling without replacement, it is the sample size.
Hello,
I am trying to use randomForestSRC
on a classification problem.
Below the error message I get:
all scheduled cores encountered errors in user codeError in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) :
cannot coerce class ‘"try-error"’ to a data.frame
I suspected the usage of a data.table object, but the following code worked fine.
data(iris)
dt_iris <- as.data.table(iris)
iris_modl_rfsrc <- rfsrc(Species ~., data = dt_iris)
iris_pred_rfsrc.pred <- predict(object=iris_modl_rfsrc,newdata=dt_iris[,.SD,.SDcols=-"Species"])
Thanks for your help.
My sessionInfo:
version R version 3.5.0 (2018-04-23)
system x86_64, linux-gnu
ui RStudio (1.1.447)
language (EN)
Packages --------------------------------------------------------------------------------------------------------------------
package * version date source
assertthat 0.2.0 2017-04-11 CRAN (R 3.5.0)
backports 1.1.2 2017-12-13 CRAN (R 3.5.0)
base * 3.5.0 2018-04-25 local
base64enc 0.1-3 2015-07-28 CRAN (R 3.5.0)
bindr 0.1.1 2018-03-13 CRAN (R 3.5.0)
bindrcpp 0.2.2 2018-03-29 CRAN (R 3.5.0)
colorspace 1.3-2 2016-12-14 CRAN (R 3.5.0)
compiler 3.5.0 2018-04-25 local
data.table * 1.10.4-3 2017-10-27 CRAN (R 3.5.0)
datasets * 3.5.0 2018-04-25 local
devtools 1.13.5 2018-02-18 CRAN (R 3.5.0)
digest 0.6.15 2018-01-28 CRAN (R 3.5.0)
dplyr 0.7.4 2017-09-28 CRAN (R 3.5.0)
DT 0.4 2018-01-30 CRAN (R 3.5.0)
evaluate 0.10.1 2017-06-24 CRAN (R 3.5.0)
foreign 0.8-70 2018-04-23 CRAN (R 3.5.0)
ggplot2 2.2.1 2016-12-30 CRAN (R 3.5.0)
glue 1.2.0 2017-10-29 CRAN (R 3.5.0)
graphics * 3.5.0 2018-04-25 local
grDevices * 3.5.0 2018-04-25 local
grid 3.5.0 2018-04-25 local
gtable 0.2.0 2016-02-26 CRAN (R 3.5.0)
htmltools 0.3.6 2017-04-28 CRAN (R 3.5.0)
htmlwidgets 1.2 2018-04-19 CRAN (R 3.5.0)
httr 1.3.1 2017-08-20 CRAN (R 3.5.0)
jsonlite 1.5 2017-06-01 CRAN (R 3.5.0)
knitr 1.20 2018-02-20 CRAN (R 3.5.0)
lattice 0.20-35 2017-03-25 CRAN (R 3.5.0)
lazyeval 0.2.1 2017-10-29 CRAN (R 3.5.0)
magrittr 1.5 2014-11-22 CRAN (R 3.5.0)
maptools 0.9-2 2017-03-25 CRAN (R 3.5.0)
memoise 1.1.0 2017-04-21 CRAN (R 3.5.0)
methods * 3.5.0 2018-04-25 local
munsell 0.4.3 2016-02-13 CRAN (R 3.5.0)
parallel 3.5.0 2018-04-25 local
pillar 1.2.1 2018-02-27 CRAN (R 3.5.0)
pkgconfig 2.0.1 2017-03-21 CRAN (R 3.5.0)
plotly 4.7.1 2017-07-29 CRAN (R 3.5.0)
plyr 1.8.4 2016-06-08 CRAN (R 3.5.0)
purrr 0.2.4 2017-10-18 CRAN (R 3.5.0)
R6 2.2.2 2017-06-17 CRAN (R 3.5.0)
randomForestSRC * 2.6.0 2018-05-02 CRAN (R 3.5.0)
Rcpp 0.12.16 2018-03-13 CRAN (R 3.5.0)
rgeos 0.3-26 2017-10-31 CRAN (R 3.5.0)
rlang 0.2.0 2018-02-20 CRAN (R 3.5.0)
rmarkdown 1.9 2018-03-01 CRAN (R 3.5.0)
rprojroot 1.3-2 2018-01-03 CRAN (R 3.5.0)
scales 0.5.0 2017-08-24 CRAN (R 3.5.0)
sp 1.2-7 2018-01-19 CRAN (R 3.5.0)
splitstackshape * 1.4.4 2018-03-29 CRAN (R 3.5.0)
stats * 3.5.0 2018-04-25 local
stringi 1.1.7 2018-03-12 CRAN (R 3.5.0)
stringr 1.3.0 2018-02-19 CRAN (R 3.5.0)
tibble 1.4.2 2018-01-22 CRAN (R 3.5.0)
tidyr 0.8.0 2018-01-29 CRAN (R 3.5.0)
tools 3.5.0 2018-04-25 local
utils * 3.5.0 2018-04-25 local
viridisLite 0.3.0 2018-02-01 CRAN (R 3.5.0)
withr 2.1.2 2018-03-15 CRAN (R 3.5.0)
yaml 2.1.18 2018-03-08 CRAN (R 3.5.0
Hello Sir,
I have a data of 500,000 observations.
I want to run RFSRC on it with 500 trees. But it requires lot of memory. So, I came up with a possible solution that first I will make 10 random survival forests with 50 trees each, each time on entire data, each with a different seed (I use seeds: 1001, 1002, 1003, and so on till 1010) and then average the results for each member/observation obtained from the above 10 models, by adding the 10 results and dividing by 10 (i.e., survival probabilities per member/observation per month for 24 months, which is the time period for which probabilities have to be forecasted, are obtained by averaging the 10 probabilities from 10 models for that member for that month). I thought that the result of this would be equivalent to the result of one 500 tree RFSRC model.
But surprisingly, the accuracy of the 10 combined models is much worse than the accuracy of even a single 50 survival tree model on the entire data. Yes, that's 50, not 500.
Why is this happening, as per your opinion? Can I do anything to simulate a 500 tree RFSRC model?
this is the code I ran (URF) and the error message
urf.elm <- rfsrc(data = ap, ntree = 10000, proximity = "oob", distance = "oob")
Error in distance.out[k, 1:k] <- nativeOutput$distance[(count + 1):(count + :
number of items to replace is not a multiple of replacement length
my data has no NA's and only numeric variables
thanks for the cool package!
Dear Professor,
I've tried to install the randomForestSRC package for a Centos system but I failed. But with a windows7 system at last I succeed with running
install.packages("randomForestSRC", dependencies = T, repos = 'http://cran.rstudio.com/')
(I also failed directly installing the downloaded package on windows7).
I'm sorry that I know little about compilation. I hope that you would give me some help to solve the problem.
At first I try to install the package with install.packages('randomForestSRC_2.5.1.tar.gz')
in the centos system as I've download the package file under the working directory. But it incurrs an error below:
Installing package into ‘/public/home/pengruijiao/R/x86_64-pc-linux-gnu-library/3.3’
(as ‘lib’ is unspecified)
inferring 'repos = NULL' from 'pkgs'
* installing *source* package ‘randomForestSRC’ ...
** package ‘randomForestSRC’ successfully unpacked and MD5 sums checked
checking for gcc... icc -std=gnu99
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether icc -std=gnu99 accepts -g... yes
checking for icc -std=gnu99 option to accept ISO C89... none needed
checking for icc -std=gnu99 option to support OpenMP... -fopenmp
configure: creating ./config.status
config.status: creating src/Makevars
** libs
icc -std=gnu99 -I/public/software/R-3.3.3/lib64/R/include -DNDEBUG -I/usr/local/include -fopenmp -fpic -g -O2 -std=c99 -c R_init_randomForestSRC.c -o R_init_randomForestSRC.o
icc: command line warning #10121: overriding '-std=gnu99' with '-std=c99'
icc -std=gnu99 -I/public/software/R-3.3.3/lib64/R/include -DNDEBUG -I/usr/local/include -fopenmp -fpic -g -O2 -std=c99 -c randomForestSRC.c -o randomForestSRC.o
icc: command line warning #10121: overriding '-std=gnu99' with '-std=c99'
randomForestSRC.c(2336): error: identifier "M_E" is undefined
RF_vimpCLSptr[p][j][k] = M_E * result / (double) cumDenomCount;
^
compilation aborted for randomForestSRC.c (code 2)
make: *** [randomForestSRC.o] Error 2
ERROR: compilation failed for package ‘randomForestSRC’
* removing ‘/public/home/pengruijiao/R/x86_64-pc-linux-gnu-library/3.3/randomForestSRC’
Warning message:
In install.packages("randomForestSRC_2.5.1.tar.gz") :
installation of package ‘randomForestSRC_2.5.1.tar.gz’ had non-zero exit status
Then I found an introduction about the package installation although it's mainly about to utilize OpenMP:
http://ccs.miami.edu/~hishwaran/rfsrc.html, and I followed the method 1:
1. Download the package source code randomForestSRC_X.x.x.tar.gz. The X's indicate the version posted. Do not download the binary. 2. Open a console, navigate to the directory containing the tarball, and untar it using the command tar -xvf randomForestSRC_X.x.x.tar.gz 3. This will create a directory structure with the root directory of the package named randomForestSRC. Change into the root directory of the package using the command cd randomForestSRC 4. Run autoconf using the command autoconf 5. Change back to your working directory using the command cd .. From your working directory, execute the command R CMD INSTALL --preclean --clean randomForestSRC on the modified package. Ensure that you do not target the unmodified tarball, but instead act on the directory structure you just modified.
And I try install the package on window7. At first I failed but I succeded later with
install.packages("randomForestSRC", dependencies = T, repos = 'http://cran.rstudio.com/')
.
I try the same command on centos but it just did't work. The error is still no change as above, except it downloads something before the installation
Content type 'application/x-gzip' length 903705 bytes (882 KB)
==================================================
downloaded 882 KB
I get some "unknown software exceptions" when I run the example below on Windows (32 GB Windows 10). It runs fine when I reduce to n=33000
. It also works fine with n=34000
on my 2017 MacBook Pro (8GB) and on a big Linux machine.
library(randomForestSRC)
n <- 34000
p <- 4
x <- replicate(p, rnorm(n))
time <- round(runif(n, 0, 100))
status <- round(runif(n , 0, 2))
dat <- data.frame(time = time, status = status, x)
rfsrc(Surv(time, status) ~ ., dat, ntree = 5, cause = 1)
I have randomForestSRC 2.9.1 (latest from CRAN). I tried R 3.5.1 and 3.6.0 with the same result.
Both in the example provided:
`# Veteran's Administration Lung Cancer Trial. Randomized
data(veteran, package = "randomForestSRC")
v.obj <- rfsrc(Surv(time, status) ~ ., data = veteran, ntree = 100)
plot(v.obj)
plot.survival(v.obj)`
and with my own data the error rate is constant (horizontal line), independent of the number of trees. Is this normal i.e. the expected behaviour?
Hello, the max.subtree function is throwing an error "Error in if (local.obj$stumpCnt == 0) { : argument is of length zero"
At first I thought there was a problem with my data, but when I ran the example provided with the function in the package documentation the same error occurred. To reproduce error run:
data(veteran, package = "randomForestSRC")
v.obj <- rfsrc(Surv(time, status) ~ . , data = veteran)
v.max <- max.subtree(v.obj)
#v.max will not exist because "Error in if (local.obj$stumpCnt == 0) { : argument is of length zero"
I have a problem where I'm trying to predict 15 minutes into the future. I've set up my model and have good results, but when I try to apply the model to current data where my y-vals are still unknown, the predict(model,test) won't give me any output for those rows.
I've tried:
Not supplying the yvar data in the data.frame for predict()
. This silently errors, predicting 0 for all rows.
Using `na.action = "na.impute". This gives me correct output, but spends a significant amount of time (6x longer than predict with "na.omit" in my case). Since there are no NAs in my xvars, I'm assuming it is imputing values for irrelevant data, such as unused variables in my data.frame or yvars.
Supplying yvars as part of a separate data.frame (ex. rfsrc(yvars$yval ~ xvar1 + xvar2, data=train)
). This fails with Error in parseFormula(formula, data, ytry) : formula is incorrectly specified.
Is there any way that I can supply y vars separately, such that they aren't needed for the prediction phase?
In the news there is mention of quantileReg() of which I cannot seem to find. Further, the quantreg() command seems to crash R when used. This is version 2.9.1 and R64 version 3.6.1. I am trying to use a continuous outcome with only two predictors and 1000 cases. It seems independent of the data and occurs on multiple machines.
I am using rfsrc to build competing risk survival random forest. The model builds fine without error but failed at prediction. The following is an example I taken from "survival" package's vignette "compete":
data("mgus2")
etime <- with(mgus2, ifelse(pstat==0, futime, ptime))
event <- with(mgus2, ifelse(pstat==0, 2*death, 1))
event <- factor(event, 0:2, labels=c("censor", "pcm", "death"))
mgus2$etime <- etime
mgus2$event <- event
xx <- rfsrc(Surv(etime, event)~sex, data=mgus2)
predict(xx)
I got error :
Error in Math.factor(cens) : ‘floor’ not meaningful for factors
Enter a frame number, or 0 to exit
1: predict(xx)
2: predict.rfsrc(xx)
3: generic.predict.rfsrc(object, newdata, m.target = m.target, importance = importance, err.block = err.block, na.action = na
4: get.event.info(object)
5: Math.factor(cens)
After I looking into function "get.event.info", I see it fails at
if (!all(floor(cens) == abs(cens), na.rm = TRUE)) {
stop("for survival families censoring variable must be coded as a non-negative integer")
}
This stop information contradicts with competing risk analysis's requirement that the event should be a factor. Is there a misunderstanding from me or the package doesn't support competing risk survival?
My system information
version
_
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 4.4
year 2018
month 03
day 15
svn rev 74408
language R
version.string R version 3.4.4 (2018-03-15)
nickname Someone to Lean On
randomForestSRC version: 2.6.0
survival version: 2.41-3
Hi,
I am trying to complile randomForestSRC for use of OpenMP following the instructions at
https://kogalur.github.io/randomForestSRC/building.html
As you can see, below, I have the clang8, fortran 6.1.0, ant 1.10.7, and java 1.8.0 (Mac OS Mojave 10.14.6). The problem appears to be a non-existent directory when attempting ``ant source-cran’’. Perhaps I’ve got the build.xml file in the wrong directory to begin with. Not sure. Advice appreciated. Best, -- Jay
math172m-01:tmp jay$ echo $PATH
/usr/local/ant/bin:/usr/local/gfortran/bin:/usr/local/clang8/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Library/TeX/texbin:/opt/X11/bin
math172m-01:tmp jay$ clang --version
clang version 8.0.0 (tags/RELEASE_800/final)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /usr/local/clang8/bin
math172m-01:tmp jay$ gfortran --version
GNU Fortran (GCC) 6.1.0
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
math172m-01:tmp jay$ ant -version
Apache Ant(TM) version 1.10.7 compiled on September 1 2019
math172m-01:tmp jay$ java -version
java version "1.8.0_221"
Java(TM) SE Runtime Environment (build 1.8.0_221-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.221-b11, mixed mode)
math172m-01:tmp jay$ ant source-cran
Buildfile: /Users/jay/Desktop/tmp/build.xml
init:
[echo] --------- randomForestSRC ---------
[echo]
[echo] Version: 2.9.1
[echo] Build: bld20190708a
[echo]
[echo] Date: 2019-10-04
[echo] Time: 04:07:42
[echo]
[echo] Platform Details:
[echo] OS name Mac OS X
[echo] OS version 10.14.6
[echo] OS arch x86_64
[echo] Java arch 64
clean-cran:
[delete] Deleting directory /Users/jay/Desktop/tmp/target/cran
source-cran:
[mkdir] Created dir: /Users/jay/Desktop/tmp/target/cran
[mkdir] Created dir: /Users/jay/Desktop/tmp/target/cran/randomForestSRC
[mkdir] Created dir: /Users/jay/Desktop/tmp/target/cran/randomForestSRC/inst
[mkdir] Created dir: /Users/jay/Desktop/tmp/target/cran/randomForestSRC/data
[mkdir] Created dir: /Users/jay/Desktop/tmp/target/cran/randomForestSRC/man
[mkdir] Created dir: /Users/jay/Desktop/tmp/target/cran/randomForestSRC/R
[mkdir] Created dir: /Users/jay/Desktop/tmp/target/cran/randomForestSRC/src
BUILD FAILED
/Users/jay/Desktop/tmp/build.cran.xml:29: /Users/jay/Desktop/tmp/src/main/resources/cran does not exist.
Total time: 0 seconds
math172m-01:tmp jay$
Hello, I ran a model on a large competing risk dataset (250,000 observations and 74 covariates). I wasn't able to run using more observations without running into errors about allocating vectors of length greater than 32-bit, but for 250,000 rows it ran without complaint. I only mention this as it may be related.
Anyway, my output from print.rfsrc(model)
is:
Sample size: 250000
Number of events: 102523, 22320
Was data imputed: yes
Number of trees: 10000
Forest terminal node size: 6
Average no. of terminal nodes: 1888.076
No. of variables tried at each split: 9
Total no. of variables: 74
Analysis: RSF
Family: surv-CR
Splitting rule: logrankCR *random*
Number of random split points: 3
Error rate: -15.41%, 34.04%
The error rate for the first event is negative 15.41%, which if I understood how the error rate is calculated with the concordance index isn't possible.
Here is the call I made: model = rfsrc(formula = Surv(u, delta) ~ . - sub_grade, data = data, ntree = 10000, nsplit = 3, importance = "none", na.action = "na.impute", ntime = 0:37, cause = 1, proximity = FALSE, sampsize = 10000, forest.wt = FALSE)
Here is the sessionInfo()
on the machine that trained the model:
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] survival_2.41-3 randomForestSRC_2.6.1
loaded via a namespace (and not attached):
[1] compiler_3.4.4 Matrix_1.2-12 parallel_3.4.4 splines_3.4.4
[5] grid_3.4.4 lattice_0.20-35
For reference I earlier ran the same call on a smaller subset of 100,000 rows which gave error rates of 42.63%, 34.17%.
Dear Professor,
I'm a PhD student in Actuarial Sciences and I'm working on the topic of Survival Analysis. Currently I'm studying a Random Forest method which aims to model E[phi(T)|X] where :
I'm using RSF algorithm from randomForestSRC package as a benchmark to my method. There is a presentation here if you are curious.
I have a small problem since I need to do repeated experiments on large datasets (from 10000 to 100000 observations). I found the rfsrc function is a bit slow to handle such data. I followed different advices you give in the function documentation to reduce the computation time :
In fact my question is about the split rules you mention in the articles "Random Survival Forest (2008)" and the R vignette "Random Survival Forests for R (2007)".
In "Random Survival Forests for R (2007)", you talk about different splitrules :
In "Random Survival Forest (2008)", in the "Empirical Comparisons" paragraph, you mention :
So "approximate logrank" is replaced by "logrank random". My question is : Do you confirm that the splitrule "approximate logrank" is not featured in the today randomForestSRC package ?
Finally, I would like to thank you for the very great RSF algorithm !
Best,
Yohann le Faou
When a data.frame has a factor column, and turned into a tibble, it seems then that the application cannot deal with the factor column.
sessionInfo()
#> R version 3.5.0 (2018-04-23)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 17134)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_United States.1252
#> [2] LC_CTYPE=English_United States.1252
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> loaded via a namespace (and not attached):
#> [1] compiler_3.5.0 backports_1.1.2 magrittr_1.5 rprojroot_1.3-2
#> [5] tools_3.5.0 htmltools_0.3.6 yaml_2.1.19 Rcpp_0.12.16
#> [9] stringi_1.2.2 rmarkdown_1.9 knitr_1.20 stringr_1.3.1
#> [13] digest_0.6.15 evaluate_0.10.1
library("randomForestSRC")
#>
#> randomForestSRC 2.6.1
#>
#> Type rfsrc.news() to see new features, changes, and bug fixes.
#>
data(veteran, package = "randomForestSRC")
veteran$trt <- factor(veteran$trt)
rfsrc(Surv(time, status) ~ trt, data = veteran, ntree = 100, tree.err=TRUE)
#> Sample size: 137
#> Number of deaths: 128
#> Number of trees: 100
#> Forest terminal node size: 3
#> Average no. of terminal nodes: 2
#> No. of variables tried at each split: 1
#> Total no. of variables: 1
#> Analysis: RSF
#> Family: surv
#> Splitting rule: logrank
#> Error rate: 73.17%
rfsrc(Surv(time, status) ~ trt, data = dplyr::as_tibble(veteran), ntree = 100, tree.err=TRUE)
#>
#> RF-SRC: *** ERROR ***
#> RF-SRC: X-var factor level in data inconsistent with number of levels indicated: [ 1] = 1 vs. 0
#> RF-SRC: Please Contact Technical Support.<simpleError in doTryCatch(return(expr), name, parentenv, handler):
#> RF-SRC: The application will now exit.
#> >
#> Error in rfsrc(Surv(time, status) ~ trt, data = dplyr::as_tibble(veteran), : An error has occurred in the grow algorithm. Please turn trace on for further analysis.
Hello Udaya,
Perhaps it would be useful to export functions to programmatically collect rfsrc object summary statistics.
For instance:
## ---- extract_rf_brier
#' Extract a Brier score from a randomfrestSRC object.
#'
#' @param x rfsrc object. An rfsrc object to extract from.
#'
#' @export extract_rf_brier
#' @md
extract_rf_brier <- function(x){
if (x$family %like% "class"){
if (!is.null(x$err.rate)){
conf.matx <- table(x$yvar, if (!is.null(x$class.oob) &&
!all(is.na(x$class.oob)))
x$class.oob else x$class)
conf.matx <- cbind(conf.matx,
class.error = round(1
-diag(conf.matx)/rowSums(conf.matx,
na.rm = TRUE), 4))
names(dimnames(conf.matx)) <- c(" observed", "predicted")
.brier <- function(ytest, pred){
cl <- colnames(pred)
mean(sapply(1:length(cl), function(k){
mean((1 * (ytest == cl[k]) - pred[, k])^2, na.rm = TRUE)
}), na.rm = TRUE)
}
brierS <- .brier(x$yvar, if (!is.null(x$predicted.oob) &&
!all(is.na(x$predicted.oob)))
x$predicted.oob else x$predicted)
} else {
conf.matx <- brierS <- NULL
}
} else {
return(NA)
}
Sincerely,
Andrew
For the comparison of the RSF model to the mixed outcome model (page 48 on the CRAN docs) why is one computed with get.cindex and the other computed with 1-get.cindex?
Hi - is there a way to display one of the trees in the forest ? I'm able to get the split statistics on the variables using the stat.split function, but I'm having a little trouble parsing what an actual tree looks like.
Thanks.
How can one send pull requests to this repository for the R package?
I do not know this build process you have in this package, but I've contributed to many R packages. Just wondering if you have a README on how to effectively send a PR for the package and have it checked with a CI service such as travis-ci.com.
I have been following the instructions on Working with R's OOPS in the rpy2 documentation here: https://rpy2.readthedocs.io/en/version_2.8.x/robjects_oop.html and I am trying to create a Python class to call the function rfsrc in the R package randomForestSRC.
When I run the code below from a Jupyter Notebook (Python 3, R 3.5.1), I get the error:
Error in (function (f, signature = character(), where = topenv(parent.frame()), : no generic function found for 'rfsrc'.
Does this mean that I cannot call rfsrc from Python? Thanks.
`import rpy2.robjects.packages as rpackages
from rpy2.robjects.vectors import StrVector
utils = rpackages.importr('utils')
utils.chooseCRANmirror(ind=1) # select the first mirror in the list
packnames = ('randomForestSRC', 'survival', 'tidyverse', 'magrittr', 'ggRandomForests', 'mlr')
utils.install_packages(StrVector(packnames))
from rpy2.robjects.packages import importr
randomForestSRC = importr('randomForestSRC')
from rpy2.robjects.methods import RS4Auto_Type
import six
class rfsrc(six.with_metaclass(RS4Auto_Type)):
rname = 'rfsrc'
rpackagename = 'randomForestSRC'`
Hi @kogalur ,
I wrote a custom split function,getCustomSplitStatisticMultivariateRegressionTwo(), and followed the steps exactly as mentioned.
i.e.,
After doing the above,
when I tried to grow the tree using split-rule as "custom2". Am getting the error that,
Kindly let me know if am missing anything.
Regards,
Vinodh
ic -fpic -fPIC -c randomForestSRC.c -o randomForestSRC.o
randomForestSRC.c: In function ‘updateGenericVimpEnsemble’:
randomForestSRC.c:2361: error: expected end of line before ‘update’
randomForestSRC.c: In function ‘updateProximity’:
randomForestSRC.c:20712: error: expected end of line before ‘update’
randomForestSRC.c:20717: error: expected end of line before ‘update’
make: *** [randomForestSRC.o] Error 1
ERROR: compilation failed for package ‘randomForestSRC’
I read on SO that its in your plans.
Is it adequate for imbalance in multiclass classification problems ?
Thanks.
I obtain the following error when trying to predict an rfsrc object:
"<simpleError in (object$nativeFactorArray)$mwcpPT: $ operator is invalid for atomic vectors>
Error in generic.predict.rfsrc(object, newdata, ensemble = ensemble, m.target = m.target, :
An error has occurred in prediction. Please turn trace on for further analysis.
This is the first time I see this error in months as I am just loading the same RDS model object and generating predictions on new data. I even tried just predicting on the model data that was used to build the model and I get the same error.
The model was built using package Version: 2.5.1.
I only started to see this error once I installed the latest randomForestSRC package: Version 2.7.0.
To confirm, i reverted back to Version 2.5.1. and predict started to work again.
It seems that the error rate plot method was broken in release 2.6
library(randomForestSRC, verbose = TRUE)
randomForestSRC 2.6.1
Type rfsrc.news() to see new features, changes, and bug fixes.
Using the example from the help file you can see that the plot is outputting a constant error rate like when the tree.err
option is set to FALSE
:
## veteran data
## randomized trial of two treatment regimens for lung cancer
data(veteran, package = "randomForestSRC")
v.obj <- rfsrc(Surv(time, status) ~ ., data = veteran, ntree = 100, tree.err=TRUE)
plot(v.obj)
You get the same plot with explicitly setting tree.err = FALSE:
v.obj <- rfsrc(Surv(time, status) ~ ., data = veteran, ntree = 100, tree.err = FALSE)
plot(v.obj)
The variable importance plot is still working:
v.obj <- rfsrc(Surv(time, status) ~ ., data = veteran, ntree = 100, tree.err = TRUE,
importance = TRUE)
plot(v.obj)
Version 2.5.1 was working fine. I remove version 2.6.1, install 2.5.1, and get the correct OOB error rate plot.
remove.packages("randomForestSRC", lib="~/R/win-library/3.5")
install.packages("C:/Users/*******/Downloads/randomForestSRC_2.5.1.tar.gz", repos = NULL, type = "source")
library(randomForestSRC, verbose = TRUE)
randomForestSRC 2.5.1
Type rfsrc.news() to see new features, changes, and bug fixes.
data(veteran, package = "randomForestSRC")
v.obj <- rfsrc(Surv(time, status) ~ ., data = veteran, ntree = 100, tree.err = TRUE)
plot(v.obj)
Session Info for reference:
sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252attached base packages:
[1] stats graphics grDevices utils datasets methods baseother attached packages:
[1] randomForestSRC_2.5.1loaded via a namespace (and not attached):
[1] compiler_3.5.0 backports_1.1.2 magrittr_1.5 rprojroot_1.3-2 parallel_3.5.0 htmltools_0.3.6 tools_3.5.0
[8] yaml_2.1.19 Rcpp_0.12.17 stringi_1.2.2 rmarkdown_1.9 knitr_1.20 stringr_1.3.1 digest_0.6.15
[15] evaluate_0.10.1
Any help would be appreciated, thanks!
I was trying to get the raw data used by the plot function plot.subsample(model.sm.rf, ...)
in order to use the data with my other analysis functions and harmonize the figure looks. However I had some difficulties getting it
From /src/R/plot.subsample.rfsrc.R I noticed that the plot function was calling extract.subsample.rfsrc.local(obj = ...)
which had the boxplot.dta
variable supposedly containing the values of each run (e.g. B = 100
), but all the values were identical in that data frame which I did not quite understand as the standard plot functioned ok?
I managed to use the var.jk.sel.Z
variable that contained the mean
value with the upper
and lower
bounds.
Any idea why the following call did not contain the actual results from various runs?
oo <- extract.subsample(x, alpha = alpha, target = target, standardize = standardize)
boxplot.dta <- oo$boxplot.dta
Is it possible to use randomForestSRC for evaluating a special form a cox ph regression called conditional logistic regression of the form:
coxph(formula = Surv(rep(1, 200L), event) ~ group + strata(id),
method = "exact")
Thanks
Hello, according to the documentation log rank splitting for competing risks data when we specify cause
is trying to maximize the test statistic for the log-rank score for that cause. However, I've discovered that in certain datasets the chosen split is not necessarily the one that actually maximizes the score, although it's close. Below is a script that can replicate what I've seen; here is the data.txt used in the script.
library(survival)
library(randomForestSRC)
data <- read.csv("data.txt") # Github won't let me upload a .csv
# We use no bootstrapping so that results can be replicated,
# one tree with a maximum node depth of 1 so that there's only one split to look at.
# nsplit=0 so that the optimal split can be selected.
# cause=2 because interestingly cause=1 is optimal.
rfsrc.model <- rfsrc(Surv(u, delta) ~ x, data, ntree=1, bootstrap="none", nodedepth = 1, nsplit = 0, cause=2)
rfsrcIsLeftHand <- data$x <= rfsrc.model$forest$nativeArray[1,4]
rfsrc.model$forest$nativeArray # split chosen to be <= 0.0370275
# Other theoretical split on x
otherPossibleLeftHand <- data$x <= 0.0225499335063258
newData <- data.frame(u=data$u, delta=data$delta, rfsrcIsLeftHand, otherPossibleLeftHand)
newData$isEvent1 <- newData$delta==1
newData$isEvent2 <- newData$delta==2
# Survdiff from the survival package runs by default a log-rank test
survdiff(Surv(u, isEvent2)~rfsrcIsLeftHand, newData)
# Chi-sq value of 76.5
survdiff(Surv(u, isEvent2)~otherPossibleLeftHand, newData)
# Chi-sq value of 77.4; higher.
Here is my sessionInfo()
:
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.1 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8 LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] randomForestSRC_2.8.0 survival_2.43-3
loaded via a namespace (and not attached):
[1] compiler_3.4.4 Matrix_1.2-12 parallel_3.4.4 tools_3.4.4 yaml_2.2.0 splines_3.4.4 grid_3.4.4
[8] lattice_0.20-35
Error "factor level in data inconsistent with number of levels indicated" is thrown amidst tuning run.
During the tuning process the Train/Testset split is not modified.
Example is attached (failrfsrc.tgz): ~> Rscript rfsrc_bug.R
[Tune] Started tuning learner classif.randomForestSRC for parameter set:
Type len Def Constr Req Tunable Trafo
ntree integer - - 100 to 500 - TRUE -
mtry integer - - 5 to 50 - TRUE -
nodesize integer - - 1 to 10 - TRUE -
nodedepth integer - - 3 to 16 - TRUE -
nsplit integer - - 1 to 50 - TRUE -
bootstrap discrete - - by.root - TRUE -
With control class: TuneControlMBO
Imputation value: -0
[Tune-x] 1: ntree=261; mtry=10; nodesize=7; nodedepth=12; nsplit=48; bootstrap=by.root
[Tune-y] 1: acc.test.mean=0.7435397; time: 0.0 min
[Tune-x] 2: ntree=383; mtry=40; nodesize=2; nodedepth=8; nsplit=38; bootstrap=by.root
[Tune-y] 2: acc.test.mean=0.7747563; time: 0.1 min
[Tune-x] 3: ntree=196; mtry=41; nodesize=4; nodedepth=7; nsplit=17; bootstrap=by.root
[Tune-y] 3: acc.test.mean=0.7657664; time: 0.1 min
[Tune-x] 4: ntree=317; mtry=34; nodesize=9; nodedepth=12; nsplit=31; bootstrap=by.root
[Tune-y] 4: acc.test.mean=0.7219982; time: 0.1 min
[Tune-x] 5: ntree=452; mtry=18; nodesize=5; nodedepth=9; nsplit=34; bootstrap=by.root
[Tune-y] 5: acc.test.mean=0.7483707; time: 0.1 min
[Tune-x] 6: ntree=283; mtry=27; nodesize=2; nodedepth=10; nsplit=14; bootstrap=by.root
[Tune-y] 6: acc.test.mean=0.7703074; time: 0.1 min
[Tune-x] 7: ntree=438; mtry=14; nodesize=7; nodedepth=12; nsplit=23; bootstrap=by.root
[Tune-y] 7: acc.test.mean=0.7438340; time: 0.1 min
[Tune-x] 8: ntree=326; mtry=39; nodesize=2; nodedepth=14; nsplit=33; bootstrap=by.root
[Tune-y] 8: acc.test.mean=0.7702152; time: 0.1 min
[Tune-x] 9: ntree=237; mtry=17; nodesize=5; nodedepth=16; nsplit=2; bootstrap=by.root
Fehler in sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE)) :
Schreibfehler, schließe pipe zum Elternprozess
Ruft auf: train ... extract.factor -> mclapply -> lapply -> FUN -> sendMaster
RF-SRC: *** ERROR ***
RF-SRC: Y-var factor level in data inconsistent with number of levels indicated: 1 0
Fehler in generic.predict.rfsrc(object, newdata, ensemble = ensemble, m.target = m.target, :
An error has occurred in prediction. Please turn trace on for further analysis.
Ruft auf: train ... predictLearner.classif.randomForestSRC -> predict -> predict.rfsrc -> generic.predict.rfsrc
Ausführung angehalten
Please find the attached R script as an example which shows that the subset argument in the plot.variable function doesn't seem to be working.
Hi there,
I would like to extract the Brier score directly from the rf1 object by modifying the source code of plot.survival.rfsrc
, in a way of returning the dataset brier.score
and crps
. However, it gives me an error that could not find function "get.event.info"
and I cannot find the source code. Could you please explain the function of get.event.info
usage or let plot.survival.rfsrc
also output the Brier score and crps?
Btw, the reason why I cannot use pec::pec to get the brier score is that there are missing in my test set. When the pec::predictSurvProb.rfsrc in pec::pec works on the rfsrc
object, you can see in the function predict
below, 'na.action = "na.impute"' is not there, hence, it can only output the predicted values for non-missing, which will give an error.
predictSurvProb.rfsrc <- function(object, newdata, times, ...){
ptemp <- predict(object,newdata=newdata,importance="none",...)$survival
pos <- prodlim::sindex(jump.times=object$time.interest,eval.times=times)
p <- cbind(1,ptemp)[,pos+1,drop=FALSE]
if (NROW(p) != NROW(newdata) || NCOL(p) != length(times))
stop(paste("\nPrediction matrix has wrong dimensions:\nRequested newdata x times: ",NROW(newdata)," x ",length(times),"\nProvided prediction matrix: ",NROW(p)," x ",NCOL(p),"\n\n",sep=""))
p
}
Thanks in advance!
Shengnan
At the moment it is near impossible to exclude text output from functions like var.select() in R markdown, since the chunk options of knitr can only suppress all text, or that output by message(), warning() and error() selectively. Please replace all instances of cat(), print(), printf() etc. by the appropriate message(), warning() and error().
I'm getting the following error with randomForestSRC as compiled below when trying to cluster a rather data frame of mostly logical features. It appears the vector length should be fine for a 64bit int. Is that intended? Also, is the warning likely related to the ultimate 'kill 9'?
> dim(chunk)
[1] 119674 392
> rf.fit <- randomForestSRC::rfsrc(data = select(chunk, -plus.master_id), ntree=10000, proximity="oob")
RF-SRC: *** WARNING ***
RF-SRC: S.E.X.P. vector element length exceeds 32-bits: 7160992975
RF-SRC: S.E.X.P. ALLOC: proximity
RF-SRC: Please Reduce Dimensionality If Possible.Killed: 9
System version:
> version
_
platform x86_64-apple-darwin15.6.0
arch x86_64
os darwin15.6.0
system x86_64, darwin15.6.0
status
major 3
minor 4.0
year 2017
month 04
day 21
svn rev 72570
language R
version.string R version 3.4.0 (2017-04-21)
nickname You Stupid Darkness
Package compile log:
checking for gcc... gcc-7
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc-7 accepts -g... yes
checking for gcc-7 option to accept ISO C89... none needed
checking for gcc-7 option to support OpenMP... -fopenmp
configure: creating ./config.status
config.status: creating src/Makevars
** libs
gcc-7 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -fopenmp -fPIC -Wall -g -O2 -c R_init_randomForestSRC.c -o R_init_randomForestSRC.o
gcc-7 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -fopenmp -fPIC -Wall -g -O2 -c randomForestSRC.c -o randomForestSRC.o
gcc-7 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -fopenmp -fPIC -Wall -g -O2 -c splitCustom.c -o splitCustom.o
gcc-7 -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/Library/Frameworks/R.framework/Resources/lib -L/usr/local/lib -o randomForestSRC.so R_init_randomForestSRC.o randomForestSRC.o splitCustom.o -fopenmp -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation
I am using randomForestSRC to estimate the variable importance of 34 potential predictors of survival. When all of potential predictor variables are numeric (or coded as numeric) the package runs without a problem. When I add in / appropriately code factor variables as factors, I get the following error: Error in Math.factor(cens) : ‘floor’ not meaningful for factors. I have tried using the package defaults to be sure that nothing I'm specifying is causing a problem. I've tried coding all of my variables as numeric - which is incorrect. That eliminates the problem. If I specify the factor variables as factors - which is correct - I get this error. I do not know why rounding, floor, would be applied to factor variables. I have looked through the code running in the package and cannot identify where the problem is. Thank you for your assistance.
I am getting an error while implementing subsample function for competing risk problem under the following setting:
Error does not occur for a different splitting rule or for nimpute > 1.
I tried my best to investigate this problem, and it appears to have related with rfsrc function and not with subsample function.
Hello Sir,
I have ran RFSRC on my train data and scored on the test data. While the model is performing very well on the train data, it is showing huge overfitting on test data. I thought that maybe presenting the problem to you might be useful given your expert knowledge of how Survival Forests work.
The data is attached with this mail.
I am first running the below code to make the model.
library(randomForestSRC)
# Running the model
[train_data.zip](https://github.com/kogalur/randomForestSRC/files/3417313/train_data.zip)
rfsrc_model <- rfsrc(Surv(monthlap,islapsed) ~ .,
data=train,
ntree = 100 ,
do.trace = TRUE)
# Making predictions on the test dataset
pred_test <- predict.rfsrc(rfsrc_model, test)
# Making the dataset containing 24 month probabilities and the time (monthlap) and status # #
# (islapsed) variables for the test data
Test_Survival = as.data.frame(cbind(pred_test$survival, monthlap = test$monthlap, islapsed = test$islapsed))
Then, the way I'm checking the performance on test data is by estimating monthly survival probabilities of all people in the data for a time period of 24 months from the model, setting a cut-off probability, below which if the probability of any member goes for any month, he'll be considered dead in that particular month, and then drawing a graph with percentage of deaths per month in 24 months on the y-axis, and the time on the x-axis. This graph is drawn for both the train and test data and in my case, the test graph line is hugely overpredicting death percentage as compared to train percentage, particularly beyond the 6th month. I'm attaching one plot with this question.
Train_test_plot.pdf
This is as per the below code:
# input the probability cut-off
x <- as.numeric(0.60)
# Plotting the graph
ncap_surv2 <- data.frame()
ncap_surv5 <- data.frame()
ncap_surv2 <- ifelse(ncap_surv[,c(1:24)]<x,1,0)
ncap_surv2 = cbind(ncap_surv2,monthlap = ncap_surv$monthlap, islapsed = ncap_surv$islapsed)
for (i in 24:2){
ncap_surv2[, i] = ncap_surv2[, i] - ncap_surv2[, i-1]
}
ncap_surv2 <- as.data.frame(ncap_surv2)
ncap_surv3 <- ncap_surv2 %>%
group_by(monthlap, islapsed) %>%
summarise(n = n())
ncap_surv4 <- colSums(ncap_surv2[, c(1:24)])
ncap_surv5 <- cbind(ncap_surv4, monthlap = 1:24, islapsed = 1)
ncap_surv5 <- as.data.frame(ncap_surv5)
ncap_surv6 <- merge(x = ncap_surv3, y = ncap_surv5, by = c('monthlap', 'islapsed'), all.x = TRUE)
ncap_surv7 <- ncap_surv6[!(ncap_surv6$islapsed == 0), ]
colnames(ncap_surv7) <- c('monthlap', 'islapsed', 'actual', 'predicted')
ncap_surv7[, c("actual", "predicted")] <- ncap_surv7[, c("actual", "predicted")]/nrow(ncap_surv)
ncap_surv7 <- within(ncap_surv7, cum_actual <- cumsum(actual))
ncap_surv7 <- within(ncap_surv7, cum_predicted <- cumsum(predicted))
library(ggplot2)
p = ggplot() +
geom_line(data = ncap_surv7, aes(x = monthlap, y = cum_actual), color = "blue") +
geom_line(data = ncap_surv7, aes(x = monthlap, y = cum_predicted), color = "red") +
xlab('Month') +
ylab('percent')
print(p)
I am totally at wits' end to ascertain why the survival probabilities are suddenly decreasing beyond the 6-month mark. Which things should I look into to ascertain which variable or other cause is decreasing the survival probabilities?
Thanks!
Hello,
I'm currently trying to get a deeper understanding of Random Survival Forests and how they work.
Since an individual can end up in many terminal nodes in different trees of the forest, I assume the survival function for that individual is averaged over all terminal nodes.
https://kogalur.github.io/randomForestSRC/theory.html mentions the KM-estimator for estimating the survival function but not how the ensemble survival function is calculated.
In "Evaluating Random Forests for Survival Analysis using Prediction Error Curves" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4194196/) it says that the ensemble survival function is derived from the ensemble CHF.
I just wanted to know what case is used in randomForestSRC. The actual equation would be the icing on the cake.
I hope this is the right place to ask such a question and that I don't bother anyone (I have a like two more questions).
Anyway, thanks for all the effort put in the papers and this R package!
Chris
library(randomForestSRC)
library(ggplot2)
library(Hmisc)
set.seed(1)
Model <- rfsrc(Surv(time, status) ~ ., data = veteran, ntree = 100, tree.err=TRUE,seed=1)
print(1-Model$err.rate[100])
print(rcorr.cens(predict(Model)$predicted,Surv(veteran$time, veteran$status))[1])
According to your document, the C-index given in the model and the C-index estimated by Harrell's should be the equal. However, they are quite different. Hence, please provide more information about it. Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.