Coder Social home page Coder Social logo

kogalur / randomforestsrc Goto Github PK

View Code? Open in Web Editor NEW
112.0 112.0 18.0 16.54 MB

DOCUMENTATION:

Home Page: https://www.randomforestsrc.org/

License: GNU General Public License v3.0

R 32.90% M4 0.07% C 58.20% Java 7.94% Shell 0.01% CSS 0.47% HTML 0.42%

randomforestsrc's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

randomforestsrc's Issues

Maximal Subtrees Not Detecting Noisy Variables

I'm working with a dataset with around 70 variables in the competing risk setting (same dataset as in my last issue, although I've reduced my rows to 100,000). Some of these variables I expect are more noisy while others are stronger, so I ran the max.subtree function to look at the top variables. However, the threshold returned is extremely high (about 22) while the highest order is only 11. Suspicious of these results I created a fake, random covariate entirely unrelated to the response and introduced that into the model, and it received an order of about 6 (threshold being 22).

Thinking that this may be unique to my dataset/random chance I tried the survival example in the max.subtree documentation (veteran dataset). In that example I introduced 10 extra random covariates and they all were included in the top variables (not all of the original variables made it when I added extra variables which differs from my dataset, although they all were considered strong when I included no extra variables) I've run it several times with the same results so I know it's not that the random covariates are somehow by random chance related to the response.

I don't know enough about maximal subtrees and its assumptions to know whether this is a problem in my dataset or not, but being able to reproduce it in the example dataset was surprising. Any insight would be appreciated.

Here is some R code for what I did in the example.

require(randomForestSRC)

data(veteran, package = "randomForestSRC")

for(j in 1:10){
  veteran[,paste0("random", j)] = rnorm(nrow(veteran))
}

v.obj <- rfsrc(Surv(time, status) ~ . , data = veteran)
v.max <- max.subtree(v.obj)

v.max$order
v.max$threshold
v.max$topvars

Here is my sessionInfo()

R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C               LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8    
 [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8    LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] randomForestSRC_2.6.1

loaded via a namespace (and not attached):
[1] compiler_3.4.4 parallel_3.4.4 tools_3.4.4

Complete crash on presence of NaN in covariate

Hello, I was working in a dataset that happened to have some NaN s (distinct from NAs) in one of the columns and upon running rfsrc many ugly error messages appeared that led to R completely crashing. I've included some example code below that replicates the problem.

I'm unsure how I would expect the package to treat NaN's as they can be distinct from NAs in certain problems (mine included); but informative error messages that identify the presence of NaN's causing problems would be helpful, or perhaps a precheck that forces users to handle NaN's themselves.

Here's example code that triggers the problem - I suggest running it in base R as Rstudio doesn't always display all of the errors.

x = rnorm(100)
z = x + rnorm(100)

y = 5 + 2*x - z + rnorm(100)

d = data.frame(x,z,y)

require(randomForestSRC)

rfsrc(y~x+z, d, na.action="na.impute", ntree=500) # so far so good - no issue

d$z[1:10] = NA

rfsrc(y~x+z, d, na.action="na.impute", ntree=500) # Still works fine, though we impute

d$z[1] = NaN
rfsrc(y~x+z, d, na.action="na.impute", ntree=500) # massive failure

installing error

install.packages("randomForestSRC")
Installing package into ‘/home/cnsun/R/x86_64-redhat-linux-gnu-library/3.4’
(as ‘lib’ is unspecified)
trying URL 'https://cran.mtu.edu/src/contrib/randomForestSRC_2.5.1.tar.gz'
Content type 'application/x-gzip' length 903705 bytes (882 KB)
==================================================
downloaded 882 KB

  • installing source package ‘randomForestSRC’ ...
    ** package ‘randomForestSRC’ successfully unpacked and MD5 sums checked
    checking for gcc... gcc -m64 -std=gnu99
    checking whether the C compiler works... yes
    checking for C compiler default output file name... a.out
    checking for suffix of executables...
    checking whether we are cross compiling... no
    checking for suffix of object files... o
    checking whether we are using the GNU C compiler... yes
    checking whether gcc -m64 -std=gnu99 accepts -g... yes
    checking for gcc -m64 -std=gnu99 option to accept ISO C89... none needed
    checking for gcc -m64 -std=gnu99 option to support OpenMP... -fopenmp
    configure: creating ./config.status
    config.status: creating src/Makevars
    ** libs
    gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -I/usr/local/include -fopenmp -fpic -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -fpic -fPIC -c R_init_randomForestSRC.c -o R_init_randomForestSRC.o
    gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -I/usr/local/include -fopenmp -fpic -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -fpic -fPIC -c randomForestSRC.c -o randomForestSRC.o
    randomForestSRC.c: In function ‘updateProximity’:
    randomForestSRC.c:19867: error: expected end of line before ‘update’
    randomForestSRC.c:19872: error: expected end of line before ‘update’
    make: *** [randomForestSRC.o] Error 1
    ERROR: compilation failed for package ‘randomForestSRC’
  • removing ‘/home/cnsun/R/x86_64-redhat-linux-gnu-library/3.4/randomForestSRC’

The downloaded source packages are in
‘/tmp/RtmpA1tGwr/downloaded_packages’
Warning message:
In install.packages("randomForestSRC") :
installation of package ‘randomForestSRC’ had non-zero exit status

Error with distance=TRUE in rfsrc()

Hi,

When I tried rfsrc with distance=TRUE, there is an error like this. It looks like nativeOutput$distance is NULL.

data(iris)
airq.obj <- rfsrc(Ozone ~ ., data = airquality, na.action = "na.omit", distance=TRUE)
Error in distance.out[k, 1:k] <- nativeOutput$distance[(count + 1):(count + :
number of items to replace is not a multiple of replacement length

mtcars.unspv <- rfsrc(Unsupervised() ~., data = mtcars, distance=TRUE)
Error in distance.out[k, 1:k] <- nativeOutput$distance[(count + 1):(count + :
number of items to replace is not a multiple of replacement length

Thank you!

quantreg constant predictions on new data

I'm using version 2.9.0. When applying quantreg to get predictions on new data, the predicted quantile values appear to be constant.

Example based on an example in the quantreg documentation:

library(randomForestSRC)

set.seed(1)

o <- quantreg(mpg ~ ., mtcars[1:20,])

o.tst <- quantreg(object = o, newdata = mtcars[-(1:20),-1])

o$quantreg$quantiles      # not constant
o.tst$quantreg$quantiles  # constant in both directions

# Try on a subset of the original data
o.tst2 <- quantreg(object = o, newdata = mtcars[1:5, -1])

o.tst2$quantreg$quantiles  # constant

I might be misunderstanding something. Are these constant values on new data the expected behaviour?

Thanks.

Selecting Several columns as Multivar Responses

Hi there,
Along with the help menu from the rfsrc() function, the multivariate option can be done using two different syntaxes:

  1. rfsrc(Multivar(y1, y2, ..., yd) ~ . , my.data, ...)
  2. rfsrc(cbind(y1, y2, ..., yd) ~ . , my.data, ...)

My question is how so select from a data.frame all the columns for y1, y2,...yd when there are hundreds of response variables (d >=100).

I have tried with positions:
rfsrc(cbind(261:279) ~., data = birds1)
and with some Pattern Matching:
rfsrc(Multivar(grep('y_', colnames(birds1), value = TRUE)) ~., data = birds1) # My response variables start with "y_NAME"

But It always returns some character string and the answer is always
Error in parseFormula(formula, data, ytry) : the formula is incorrectly specified.

Any suggestion to select a high amount of variables without the needed to write all of them?

Numerical Recipes Run-Time Error: Illegal indices in gvector

R crash on predict: (minimal example provided as attachment)
crash_rfsrc.zip

RF-SRC
RF-SRC: *** ERROR ***
RF-SRC: Numerical Recipes Run-Time Error:
RF-SRC:
Illegal indices in gvector().
RF-SRC: Please Contact Technical Support.<simpleError in doTryCatch(return(expr), name, parentenv, handler):
RF-SRC: The application will now exit.

Fehler in generic.predict.rfsrc(object, newdata, ensemble = ensemble, m.target = m.target, :
An error has occurred in prediction. Please turn trace on for further analysis.
Ruft auf: predict -> predict.rfsrc -> generic.predict.rfsrc
Ausf�hrung angehalten

RF-SRC
RF-SRC: *** ERROR ***
RF-SRC: Numerical Recipes Run-Time Error:
RF-SRC:
Illegal indices in gvector().
RF-SRC: Please Contact Technical Support.Fehler:
RF-SRC: The application will now exit.
Fatal error: error during cleanup

Invocation: Rscript crash_rfsrc.R

SessInfo:
R version 3.5.2 (2018-12-20)
Platform: x86_64-suse-linux-gnu (64-bit)
Running under: openSUSE Tumbleweed

Matrix products: default
BLAS: /usr/lib64/R/lib/libRblas.so
LAPACK: /usr/lib64/R/lib/libRlapack.so

locale:
[1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C
[3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8
[5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
[7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] randomForestSRC_2.8.0 forcats_0.3.0 stringr_1.3.1
[4] dplyr_0.7.8 purrr_0.3.0 readr_1.3.1
[7] tidyr_0.8.2 tibble_2.0.1 ggplot2_3.1.0
[10] tidyverse_1.2.1

loaded via a namespace (and not attached):
[1] Rcpp_1.0.0 cellranger_1.1.0 pillar_1.3.1 compiler_3.5.2
[5] plyr_1.8.4 bindr_0.1.1 tools_3.5.2 jsonlite_1.6
[9] lubridate_1.7.4 gtable_0.2.0 nlme_3.1-137 lattice_0.20-38
[13] pkgconfig_2.0.2 rlang_0.3.1 cli_1.0.1 rstudioapi_0.9.0
[17] parallel_3.5.2 haven_2.0.0 bindrcpp_0.2.2 withr_2.1.2
[21] xml2_1.2.0 httr_1.4.0 generics_0.0.2 hms_0.4.2
[25] grid_3.5.2 tidyselect_0.2.5 glue_1.3.0 R6_2.3.0
[29] readxl_1.2.0 modelr_0.1.2 magrittr_1.5 backports_1.1.3
[33] scales_1.0.0 rvest_0.3.2 assertthat_0.2.0 colorspace_1.4-0
[37] stringi_1.2.4 lazyeval_0.2.1 munsell_0.5.0 broom_0.5.1
[41] crayon_1.3.4

Negative Error Rate

Hello, I was hoping to get some context as to how I could get a predicted negative error rate for a model.

Competing Risks - Predicting on dataset without response can crash R

Hello, I've noticed that with competing risk data if I first predict on a dataset that has a response, and then predict next on a dataset without one, that R crashes entirely. Here's a script that can reliably trigger it. The script causes a crash on the three computers I tested it on, but they're all Linux so I don't know if it's cross-platform or not.

set.seed(500)

n = 1500

data <- data.frame(x=rnorm(n), delta=sample(1:2, replace=TRUE, size=n))
data$T <- rexp(n, rate=ifelse(data$delta==1, 1/10, 1/15))

censorTimes <- rexp(n, rate=1/9)
data$delta = ifelse(data$T < censorTimes, 0, data$delta)
data$T = pmin(data$T, censorTimes)

trainingData <- data[1:1000,]
testData <- data[1001:1500,]

newData <- data.frame(x=rnorm(20))

library(randomForestSRC)

# Log-rank split rule is only used for speed; it still crashes on default splitrule
modelRfsrc = rfsrc(Surv(T, delta) ~ x, trainingData, 
                   ntree=1000, nodesize=10, mtry=1, 
                   nsplit=0, splitrule = "logrank")


testSetPredictions <- predict(modelRfsrc, testData)

# This line triggers the crash. I've tried sometimes running it before the predictions for testData
# and often it then *won't* crash, but it sometimes still does. It always triggers a crash though if
# I've run the predictions for testData before, even if before that I had successfully run this line.
newDataPredictions <- predict(modelRfsrc, newData)

Here's my sessionInfo():

R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.1 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C               LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8    
 [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8    LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] randomForestSRC_2.8.0

loaded via a namespace (and not attached):
[1] compiler_3.4.4 parallel_3.4.4 tools_3.4.4    yaml_2.2.0   

Multivariate with real Ys

Hi there,

I am interested in multivariate prediction of real Ys. However, I would like to know whether the prediction method takes into account the joint distribution of the Ys. The splitting function (Weighted variance splitting) is explained for one single Y_{i}. Does this function change when the model is multivariate?
Would you recommend any particular approach to estimate prediction intervals for this approach?

ERROR: attempt to apply to non-function in predict using version 2.9

Hi, I am using the new version 2.9 and when I try to predict using a random survival forest with new test data, I am getting the error code "Attempt to apply to non-function". I had previously used the beta code (V. 2.8.0.11) you provided in another issue (Issue #29 ) and my code ran without error. Nothing has changed in my code besides the updated package (I am using it in a Shiny app and can't use a locally installed package when deploying).

I believe I have traced the error to lines 311 and 321 in the 'generic.predict.rfsrc' function - when assigning the variable 'sampsize'.

The current line is 'sampsize <- round(object$sampsize(nrow(xvar)))', where object$sampsize is just an integer so it crashes (and for me, nrow(xvar) is equal to object$sampsize). I looked at previous versions of this function (v 2.8.0) and the line was 'sampsize <- object$sampsize' which seems correct.

Is there something else in the 'predict.rfsrc' function I am missing? I am calling it exactly the same way I had been: predict(model, newdata = new_data) and this is happening with competing risk and survival models.

I REALLY need the code on CRAN to be updated as I have to present my Masters project next Friday (May 31) - do you know if debugging this error and uploading to CRAN is something you're able to do soon? Thank you!!

Documentation says "without replacement" instead of "with replacement"

In the PDF manual, a number of times "without replacement" is written when "with replacement" is meant. For instance, the documentation for rfsrc() samptype reads

Choices are swor (sampling without replacement) and swr (sampling without replacement).

Likewise, for sampsize it says

For sampling without replacement, it is the requested size of the sample, which by default is .632 times the sample size. For sampling without replacement, it is the sample size.

Error when using predict on a large rfsrc classification model

Hello,
I am trying to use randomForestSRC on a classification problem.

Below the error message I get:

all scheduled cores encountered errors in user codeError in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) : 
  cannot coerce class ‘"try-error"’ to a data.frame

I suspected the usage of a data.table object, but the following code worked fine.

data(iris)

dt_iris <- as.data.table(iris)
iris_modl_rfsrc      <- rfsrc(Species ~., data = dt_iris)
iris_pred_rfsrc.pred <- predict(object=iris_modl_rfsrc,newdata=dt_iris[,.SD,.SDcols=-"Species"])

Thanks for your help.

My sessionInfo:

version  R version 3.5.0 (2018-04-23)
 system   x86_64, linux-gnu           
 ui       RStudio (1.1.447)           
 language (EN)  
Packages --------------------------------------------------------------------------------------------------------------------
 package         * version  date       source        
 assertthat        0.2.0    2017-04-11 CRAN (R 3.5.0)
 backports         1.1.2    2017-12-13 CRAN (R 3.5.0)
 base            * 3.5.0    2018-04-25 local         
 base64enc         0.1-3    2015-07-28 CRAN (R 3.5.0)
 bindr             0.1.1    2018-03-13 CRAN (R 3.5.0)
 bindrcpp          0.2.2    2018-03-29 CRAN (R 3.5.0)
 colorspace        1.3-2    2016-12-14 CRAN (R 3.5.0)
 compiler          3.5.0    2018-04-25 local         
 data.table      * 1.10.4-3 2017-10-27 CRAN (R 3.5.0)
 datasets        * 3.5.0    2018-04-25 local         
 devtools          1.13.5   2018-02-18 CRAN (R 3.5.0)
 digest            0.6.15   2018-01-28 CRAN (R 3.5.0)
 dplyr             0.7.4    2017-09-28 CRAN (R 3.5.0)
 DT                0.4      2018-01-30 CRAN (R 3.5.0)
 evaluate          0.10.1   2017-06-24 CRAN (R 3.5.0)
 foreign           0.8-70   2018-04-23 CRAN (R 3.5.0)
 ggplot2           2.2.1    2016-12-30 CRAN (R 3.5.0)
 glue              1.2.0    2017-10-29 CRAN (R 3.5.0)
 graphics        * 3.5.0    2018-04-25 local         
 grDevices       * 3.5.0    2018-04-25 local         
 grid              3.5.0    2018-04-25 local         
 gtable            0.2.0    2016-02-26 CRAN (R 3.5.0)
 htmltools         0.3.6    2017-04-28 CRAN (R 3.5.0)
 htmlwidgets       1.2      2018-04-19 CRAN (R 3.5.0)
 httr              1.3.1    2017-08-20 CRAN (R 3.5.0)
 jsonlite          1.5      2017-06-01 CRAN (R 3.5.0)
 knitr             1.20     2018-02-20 CRAN (R 3.5.0)
 lattice           0.20-35  2017-03-25 CRAN (R 3.5.0)
 lazyeval          0.2.1    2017-10-29 CRAN (R 3.5.0)
 magrittr          1.5      2014-11-22 CRAN (R 3.5.0)
 maptools          0.9-2    2017-03-25 CRAN (R 3.5.0)
 memoise           1.1.0    2017-04-21 CRAN (R 3.5.0)
 methods         * 3.5.0    2018-04-25 local         
 munsell           0.4.3    2016-02-13 CRAN (R 3.5.0)
 parallel          3.5.0    2018-04-25 local         
 pillar            1.2.1    2018-02-27 CRAN (R 3.5.0)
 pkgconfig         2.0.1    2017-03-21 CRAN (R 3.5.0)
 plotly            4.7.1    2017-07-29 CRAN (R 3.5.0)
 plyr              1.8.4    2016-06-08 CRAN (R 3.5.0)
 purrr             0.2.4    2017-10-18 CRAN (R 3.5.0)
 R6                2.2.2    2017-06-17 CRAN (R 3.5.0)
 randomForestSRC * 2.6.0    2018-05-02 CRAN (R 3.5.0)
 Rcpp              0.12.16  2018-03-13 CRAN (R 3.5.0)
 rgeos             0.3-26   2017-10-31 CRAN (R 3.5.0)
 rlang             0.2.0    2018-02-20 CRAN (R 3.5.0)
 rmarkdown         1.9      2018-03-01 CRAN (R 3.5.0)
 rprojroot         1.3-2    2018-01-03 CRAN (R 3.5.0)
 scales            0.5.0    2017-08-24 CRAN (R 3.5.0)
 sp                1.2-7    2018-01-19 CRAN (R 3.5.0)
 splitstackshape * 1.4.4    2018-03-29 CRAN (R 3.5.0)
 stats           * 3.5.0    2018-04-25 local         
 stringi           1.1.7    2018-03-12 CRAN (R 3.5.0)
 stringr           1.3.0    2018-02-19 CRAN (R 3.5.0)
 tibble            1.4.2    2018-01-22 CRAN (R 3.5.0)
 tidyr             0.8.0    2018-01-29 CRAN (R 3.5.0)
 tools             3.5.0    2018-04-25 local         
 utils           * 3.5.0    2018-04-25 local         
 viridisLite       0.3.0    2018-02-01 CRAN (R 3.5.0)
 withr             2.1.2    2018-03-15 CRAN (R 3.5.0)
 yaml              2.1.18   2018-03-08 CRAN (R 3.5.0

Are 10 RFSRC random forests with 50 trees equivalent to one RFSRC model with 500 trees?

Hello Sir,
I have a data of 500,000 observations.
I want to run RFSRC on it with 500 trees. But it requires lot of memory. So, I came up with a possible solution that first I will make 10 random survival forests with 50 trees each, each time on entire data, each with a different seed (I use seeds: 1001, 1002, 1003, and so on till 1010) and then average the results for each member/observation obtained from the above 10 models, by adding the 10 results and dividing by 10 (i.e., survival probabilities per member/observation per month for 24 months, which is the time period for which probabilities have to be forecasted, are obtained by averaging the 10 probabilities from 10 models for that member for that month). I thought that the result of this would be equivalent to the result of one 500 tree RFSRC model.
But surprisingly, the accuracy of the 10 combined models is much worse than the accuracy of even a single 50 survival tree model on the entire data. Yes, that's 50, not 500.
Why is this happening, as per your opinion? Can I do anything to simulate a 500 tree RFSRC model?

error on distance = "oob"

this is the code I ran (URF) and the error message

urf.elm <- rfsrc(data = ap, ntree = 10000, proximity = "oob", distance = "oob")

Error in distance.out[k, 1:k] <- nativeOutput$distance[(count + 1):(count + :
number of items to replace is not a multiple of replacement length

my data has no NA's and only numeric variables

thanks for the cool package!

SIGSEGV in randomForestSRC.so - virtuallySplitNode

  • thread #1, name = 'R', stop reason = signal SIGSEGV: invalid address (fault address: 0x89de0)
    frame #0: 0x00007fffe218fde0 randomForestSRC.so`virtuallySplitNode(treeID=0, factorFlag='\xb9', mwcpSizeAbsolute=3795600968, randomCovariate=0, repMembrIndx=0x000000002bd4a2b4, repMembrSize=910216212, nonMissMembrIndx=0x000000003640cc14, nonMissMembrSize=72704, indxx=0x000000001bc63284, splitVectorPtr=0x0000000013197a28, offset=1, localSplitIndicator="\x85", leftSize=0x00007ffffffa8c40, priorMembrIter=0, currentMembrIter=0x00007ffffffa8c3c) at randomForestSRC.c:14434
    14431 }
    14432 }
    14433 else {
    -> 14434 if ((((double*) splitVectorPtr)[offset] - RF_observation[treeID][randomCovariate][ repMembrIndx[nonMissMembrIndx[indxx[*currentMembrIter]]] ]) >= 0.0) {
    14435 daughterFlag = LEFT;
    14436 }
    14437 else {

Install Error: identifier "M_E" is undefined; compilation aborted

Dear Professor,

I've tried to install the randomForestSRC package for a Centos system but I failed. But with a windows7 system at last I succeed with running
install.packages("randomForestSRC", dependencies = T, repos = 'http://cran.rstudio.com/')
(I also failed directly installing the downloaded package on windows7).
I'm sorry that I know little about compilation. I hope that you would give me some help to solve the problem.

Try1:

At first I try to install the package with install.packages('randomForestSRC_2.5.1.tar.gz') in the centos system as I've download the package file under the working directory. But it incurrs an error below:

Installing package into ‘/public/home/pengruijiao/R/x86_64-pc-linux-gnu-library/3.3’
(as ‘lib’ is unspecified)
inferring 'repos = NULL' from 'pkgs'
* installing *source* package ‘randomForestSRC’ ...
** package ‘randomForestSRC’ successfully unpacked and MD5 sums checked
checking for gcc... icc -std=gnu99
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether icc -std=gnu99 accepts -g... yes
checking for icc -std=gnu99 option to accept ISO C89... none needed
checking for icc -std=gnu99 option to support OpenMP... -fopenmp
configure: creating ./config.status
config.status: creating src/Makevars
** libs
icc -std=gnu99 -I/public/software/R-3.3.3/lib64/R/include -DNDEBUG  -I/usr/local/include   -fopenmp  -fpic  -g -O2 -std=c99  -c R_init_randomForestSRC.c -o R_init_randomForestSRC.o
icc: command line warning #10121: overriding '-std=gnu99' with '-std=c99'
icc -std=gnu99 -I/public/software/R-3.3.3/lib64/R/include -DNDEBUG  -I/usr/local/include   -fopenmp  -fpic  -g -O2 -std=c99  -c randomForestSRC.c -o randomForestSRC.o
icc: command line warning #10121: overriding '-std=gnu99' with '-std=c99'
randomForestSRC.c(2336): error: identifier "M_E" is undefined
                    RF_vimpCLSptr[p][j][k] = M_E * result / (double) cumDenomCount;
                                             ^

compilation aborted for randomForestSRC.c (code 2)
make: *** [randomForestSRC.o] Error 2
ERROR: compilation failed for package ‘randomForestSRC’
* removing ‘/public/home/pengruijiao/R/x86_64-pc-linux-gnu-library/3.3/randomForestSRC’
Warning message:
In install.packages("randomForestSRC_2.5.1.tar.gz") :
  installation of package ‘randomForestSRC_2.5.1.tar.gz’ had non-zero exit status

Try2:

Then I found an introduction about the package installation although it's mainly about to utilize OpenMP:
http://ccs.miami.edu/~hishwaran/rfsrc.html, and I followed the method 1:

1. Download the package source code randomForestSRC_X.x.x.tar.gz. The X's indicate the version posted. Do not download the binary.

2. Open a console, navigate to the directory containing the tarball, and untar it using the command

tar -xvf randomForestSRC_X.x.x.tar.gz

3. This will create a directory structure with the root directory of the package named randomForestSRC. Change into the root directory of the package using the command

cd randomForestSRC

4. Run autoconf using the command

autoconf

5. Change back to your working directory using the command

cd ..

From your working directory, execute the command

R CMD INSTALL --preclean --clean randomForestSRC

on the modified package. Ensure that you do not target the unmodified tarball, but instead act on the directory structure you just modified.

But the same error occurs just like the error above.

Try3:

And I try install the package on window7. At first I failed but I succeded later with
install.packages("randomForestSRC", dependencies = T, repos = 'http://cran.rstudio.com/').
I try the same command on centos but it just did't work. The error is still no change as above, except it downloads something before the installation

Content type 'application/x-gzip' length 903705 bytes (882 KB)
==================================================
downloaded 882 KB

Competing risks on big dataset (Windows only)

I get some "unknown software exceptions" when I run the example below on Windows (32 GB Windows 10). It runs fine when I reduce to n=33000. It also works fine with n=34000 on my 2017 MacBook Pro (8GB) and on a big Linux machine.

library(randomForestSRC)

n <- 34000
p <- 4

x <- replicate(p, rnorm(n))
time <- round(runif(n, 0, 100))
status <- round(runif(n , 0, 2))
dat <- data.frame(time = time, status = status, x)

rfsrc(Surv(time, status) ~ ., dat, ntree = 5, cause = 1)

I have randomForestSRC 2.9.1 (latest from CRAN). I tried R 3.5.1 and 3.6.0 with the same result.

Error rate is constant, independent of the number of trees

Both in the example provided:
`# Veteran's Administration Lung Cancer Trial. Randomized

trial of two treatment regimens for lung cancer.

data(veteran, package = "randomForestSRC")
v.obj <- rfsrc(Surv(time, status) ~ ., data = veteran, ntree = 100)

Plot the error.

plot(v.obj)

Plot the survival estimates.

plot.survival(v.obj)`
and with my own data the error rate is constant (horizontal line), independent of the number of trees. Is this normal i.e. the expected behaviour?

max.subtree errors

Hello, the max.subtree function is throwing an error "Error in if (local.obj$stumpCnt == 0) { : argument is of length zero"

At first I thought there was a problem with my data, but when I ran the example provided with the function in the package documentation the same error occurred. To reproduce error run:

------------------------------------------------------------

survival analysis

first and second order depths for all variables

------------------------------------------------------------

data(veteran, package = "randomForestSRC")
v.obj <- rfsrc(Surv(time, status) ~ . , data = veteran)
v.max <- max.subtree(v.obj)

#v.max will not exist because "Error in if (local.obj$stumpCnt == 0) { : argument is of length zero"

Test data requires y column for prediction

I have a problem where I'm trying to predict 15 minutes into the future. I've set up my model and have good results, but when I try to apply the model to current data where my y-vals are still unknown, the predict(model,test) won't give me any output for those rows.

I've tried:

  1. Not supplying the yvar data in the data.frame for predict(). This silently errors, predicting 0 for all rows.

  2. Using `na.action = "na.impute". This gives me correct output, but spends a significant amount of time (6x longer than predict with "na.omit" in my case). Since there are no NAs in my xvars, I'm assuming it is imputing values for irrelevant data, such as unused variables in my data.frame or yvars.

  3. Supplying yvars as part of a separate data.frame (ex. rfsrc(yvars$yval ~ xvar1 + xvar2, data=train)). This fails with Error in parseFormula(formula, data, ytry) : formula is incorrectly specified.

Is there any way that I can supply y vars separately, such that they aren't needed for the prediction phase?

quantreg crashing R

In the news there is mention of quantileReg() of which I cannot seem to find. Further, the quantreg() command seems to crash R when used. This is version 2.9.1 and R64 version 3.6.1. I am trying to use a continuous outcome with only two predictors and 1000 cases. It seems independent of the data and occurs on multiple machines.

competing risk survival analysis

I am using rfsrc to build competing risk survival random forest. The model builds fine without error but failed at prediction. The following is an example I taken from "survival" package's vignette "compete":

data("mgus2")
etime <- with(mgus2, ifelse(pstat==0, futime, ptime))
event <- with(mgus2, ifelse(pstat==0, 2*death, 1))
event <- factor(event, 0:2, labels=c("censor", "pcm", "death"))
mgus2$etime <- etime
mgus2$event <- event
xx <- rfsrc(Surv(etime, event)~sex, data=mgus2)
predict(xx)

I got error :

Error in Math.factor(cens) : ‘floor’ not meaningful for factors

Enter a frame number, or 0 to exit

1: predict(xx)
2: predict.rfsrc(xx)
3: generic.predict.rfsrc(object, newdata, m.target = m.target, importance = importance, err.block = err.block, na.action = na
4: get.event.info(object)
5: Math.factor(cens)

After I looking into function "get.event.info", I see it fails at

      if (!all(floor(cens) == abs(cens), na.rm = TRUE)) {
        stop("for survival families censoring variable must be coded as a non-negative integer")
      }

This stop information contradicts with competing risk analysis's requirement that the event should be a factor. Is there a misunderstanding from me or the package doesn't support competing risk survival?

My system information

version
_
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 4.4
year 2018
month 03
day 15
svn rev 74408
language R
version.string R version 3.4.4 (2018-03-15)
nickname Someone to Lean On
randomForestSRC version: 2.6.0
survival version: 2.41-3

failed attempt to build on mac osx with openmp

Hi,

I am trying to complile randomForestSRC for use of OpenMP following the instructions at

https://kogalur.github.io/randomForestSRC/building.html

As you can see, below, I have the clang8, fortran 6.1.0, ant 1.10.7, and java 1.8.0 (Mac OS Mojave 10.14.6). The problem appears to be a non-existent directory when attempting ``ant source-cran’’. Perhaps I’ve got the build.xml file in the wrong directory to begin with. Not sure. Advice appreciated. Best, -- Jay

math172m-01:tmp jay$ echo $PATH
/usr/local/ant/bin:/usr/local/gfortran/bin:/usr/local/clang8/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Library/TeX/texbin:/opt/X11/bin

math172m-01:tmp jay$ clang --version
clang version 8.0.0 (tags/RELEASE_800/final)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /usr/local/clang8/bin

math172m-01:tmp jay$ gfortran --version
GNU Fortran (GCC) 6.1.0
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

math172m-01:tmp jay$ ant -version
Apache Ant(TM) version 1.10.7 compiled on September 1 2019

math172m-01:tmp jay$ java -version
java version "1.8.0_221"
Java(TM) SE Runtime Environment (build 1.8.0_221-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.221-b11, mixed mode)

math172m-01:tmp jay$ ant source-cran
Buildfile: /Users/jay/Desktop/tmp/build.xml

init:
[echo] --------- randomForestSRC ---------
[echo]
[echo] Version: 2.9.1
[echo] Build: bld20190708a
[echo]
[echo] Date: 2019-10-04
[echo] Time: 04:07:42
[echo]
[echo] Platform Details:
[echo] OS name Mac OS X
[echo] OS version 10.14.6
[echo] OS arch x86_64
[echo] Java arch 64

clean-cran:
[delete] Deleting directory /Users/jay/Desktop/tmp/target/cran

source-cran:
[mkdir] Created dir: /Users/jay/Desktop/tmp/target/cran
[mkdir] Created dir: /Users/jay/Desktop/tmp/target/cran/randomForestSRC
[mkdir] Created dir: /Users/jay/Desktop/tmp/target/cran/randomForestSRC/inst
[mkdir] Created dir: /Users/jay/Desktop/tmp/target/cran/randomForestSRC/data
[mkdir] Created dir: /Users/jay/Desktop/tmp/target/cran/randomForestSRC/man
[mkdir] Created dir: /Users/jay/Desktop/tmp/target/cran/randomForestSRC/R
[mkdir] Created dir: /Users/jay/Desktop/tmp/target/cran/randomForestSRC/src

BUILD FAILED
/Users/jay/Desktop/tmp/build.cran.xml:29: /Users/jay/Desktop/tmp/src/main/resources/cran does not exist.

Total time: 0 seconds
math172m-01:tmp jay$

Negative Error Rate in Competing Risk Setting

Hello, I ran a model on a large competing risk dataset (250,000 observations and 74 covariates). I wasn't able to run using more observations without running into errors about allocating vectors of length greater than 32-bit, but for 250,000 rows it ran without complaint. I only mention this as it may be related.

Anyway, my output from print.rfsrc(model) is:

                         Sample size: 250000
                    Number of events: 102523, 22320
                    Was data imputed: yes
                     Number of trees: 10000
           Forest terminal node size: 6
       Average no. of terminal nodes: 1888.076
No. of variables tried at each split: 9
              Total no. of variables: 74
                            Analysis: RSF
                              Family: surv-CR
                      Splitting rule: logrankCR *random*
       Number of random split points: 3
                          Error rate: -15.41%, 34.04%

The error rate for the first event is negative 15.41%, which if I understood how the error rate is calculated with the concordance index isn't possible.

Here is the call I made: model = rfsrc(formula = Surv(u, delta) ~ . - sub_grade, data = data, ntree = 10000, nsplit = 3, importance = "none", na.action = "na.impute", ntime = 0:37, cause = 1, proximity = FALSE, sampsize = 10000, forest.wt = FALSE)

Here is the sessionInfo() on the machine that trained the model:

R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] survival_2.41-3       randomForestSRC_2.6.1

loaded via a namespace (and not attached):
[1] compiler_3.4.4  Matrix_1.2-12   parallel_3.4.4  splines_3.4.4  
[5] grid_3.4.4      lattice_0.20-35

For reference I earlier ran the same call on a smaller subset of 100,000 rows which gave error rates of 42.63%, 34.17%.

Split rules for Survival Analysis, and speed

Dear Professor,

I'm a PhD student in Actuarial Sciences and I'm working on the topic of Survival Analysis. Currently I'm studying a Random Forest method which aims to model E[phi(T)|X] where :

  • T is a censored time random variable
  • X is a vector of covariates
  • phi is a given real function

I'm using RSF algorithm from randomForestSRC package as a benchmark to my method. There is a presentation here if you are curious.

I have a small problem since I need to do repeated experiments on large datasets (from 10000 to 100000 observations). I found the rfsrc function is a bit slow to handle such data. I followed different advices you give in the function documentation to reduce the computation time :

  • setting nsplit to small value
  • setting nodesize to higher value
  • manually specifying the "ntime" parameter
  • etc..

In fact my question is about the split rules you mention in the articles "Random Survival Forest (2008)" and the R vignette "Random Survival Forests for R (2007)".

In "Random Survival Forests for R (2007)", you talk about different splitrules :

  • logrank
  • logrank score splitting
  • approximate logrank score splitting
  • Conservation of events splitting (May it apply with right censored data ?)

In "Random Survival Forest (2008)", in the "Empirical Comparisons" paragraph, you mention :

  • logrank
  • logrank score
  • logrank random
  • Conservation of event splitting

So "approximate logrank" is replaced by "logrank random". My question is : Do you confirm that the splitrule "approximate logrank" is not featured in the today randomForestSRC package ?

Finally, I would like to thank you for the very great RSF algorithm !

Best,
Yohann le Faou

Error when tibble has factor column.

When a data.frame has a factor column, and turned into a tibble, it seems then that the application cannot deal with the factor column.

sessionInfo()                                                                                
#> R version 3.5.0 (2018-04-23)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 17134)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_3.5.0  backports_1.1.2 magrittr_1.5    rprojroot_1.3-2
#>  [5] tools_3.5.0     htmltools_0.3.6 yaml_2.1.19     Rcpp_0.12.16   
#>  [9] stringi_1.2.2   rmarkdown_1.9   knitr_1.20      stringr_1.3.1  
#> [13] digest_0.6.15   evaluate_0.10.1
library("randomForestSRC")                                                                   
#> 
#>  randomForestSRC 2.6.1 
#>  
#>  Type rfsrc.news() to see new features, changes, and bug fixes. 
#> 
data(veteran, package = "randomForestSRC")                                                   
veteran$trt <- factor(veteran$trt)                                                           
rfsrc(Surv(time, status) ~ trt, data = veteran, ntree = 100, tree.err=TRUE)                  
#>                          Sample size: 137
#>                     Number of deaths: 128
#>                      Number of trees: 100
#>            Forest terminal node size: 3
#>        Average no. of terminal nodes: 2
#> No. of variables tried at each split: 1
#>               Total no. of variables: 1
#>                             Analysis: RSF
#>                               Family: surv
#>                       Splitting rule: logrank
#>                           Error rate: 73.17%
rfsrc(Surv(time, status) ~ trt, data = dplyr::as_tibble(veteran), ntree = 100, tree.err=TRUE)
#> 
#> RF-SRC:  *** ERROR *** 
#> RF-SRC:  X-var factor level in data inconsistent with number of levels indicated:  [         1] =          1 vs.          0
#> RF-SRC:  Please Contact Technical Support.<simpleError in doTryCatch(return(expr), name, parentenv, handler): 
#> RF-SRC:  The application will now exit.
#> >
#> Error in rfsrc(Surv(time, status) ~ trt, data = dplyr::as_tibble(veteran), : An error has occurred in the grow algorithm.  Please turn trace on for further analysis.

Enhancement: export functions that operate on rfsrc objects to collect summary statistics

Hello Udaya,

Perhaps it would be useful to export functions to programmatically collect rfsrc object summary statistics.

For instance:

   ## ---- extract_rf_brier
   #' Extract a Brier score from a randomfrestSRC object.
   #'
   #' @param x rfsrc object. An rfsrc object to extract from.
   #'
   #' @export extract_rf_brier
   #' @md

   extract_rf_brier <- function(x){

     if (x$family %like% "class"){
       if (!is.null(x$err.rate)){

         conf.matx <- table(x$yvar, if (!is.null(x$class.oob) &&
                                        !all(is.na(x$class.oob)))
           x$class.oob else x$class)
         conf.matx <- cbind(conf.matx,
                            class.error = round(1
                                              -diag(conf.matx)/rowSums(conf.matx,
           na.rm = TRUE), 4))
         names(dimnames(conf.matx)) <- c("  observed", "predicted")

   .brier <- function(ytest, pred){
           cl <- colnames(pred)

     mean(sapply(1:length(cl), function(k){
             mean((1 * (ytest == cl[k]) - pred[, k])^2, na.rm = TRUE)
           }), na.rm = TRUE)
         }
         brierS <- .brier(x$yvar, if (!is.null(x$predicted.oob) &&
                                      !all(is.na(x$predicted.oob)))
           x$predicted.oob else x$predicted)
       } else {
         conf.matx <- brierS <- NULL
       }
     } else {
       return(NA)
     }

Sincerely,
Andrew

c-index distinction

For the comparison of the RSF model to the mixed outcome model (page 48 on the CRAN docs) why is one computed with get.cindex and the other computed with 1-get.cindex?

Displaying sample tree

Hi - is there a way to display one of the trees in the forest ? I'm able to get the split statistics on the variables using the stat.split function, but I'm having a little trouble parsing what an actual tree looks like.

Thanks.

How to send PR?

How can one send pull requests to this repository for the R package?

I do not know this build process you have in this package, but I've contributed to many R packages. Just wondering if you have a README on how to effectively send a PR for the package and have it checked with a CI service such as travis-ci.com.

@wshannon01 @mattrosen

randomForestSRC and generic functions - calling rfsrc from Python

I have been following the instructions on Working with R's OOPS in the rpy2 documentation here: https://rpy2.readthedocs.io/en/version_2.8.x/robjects_oop.html and I am trying to create a Python class to call the function rfsrc in the R package randomForestSRC.

When I run the code below from a Jupyter Notebook (Python 3, R 3.5.1), I get the error:
Error in (function (f, signature = character(), where = topenv(parent.frame()), : no generic function found for 'rfsrc'.

Does this mean that I cannot call rfsrc from Python? Thanks.

`import rpy2.robjects.packages as rpackages
from rpy2.robjects.vectors import StrVector

utils = rpackages.importr('utils')
utils.chooseCRANmirror(ind=1) # select the first mirror in the list
packnames = ('randomForestSRC', 'survival', 'tidyverse', 'magrittr', 'ggRandomForests', 'mlr')
utils.install_packages(StrVector(packnames))

from rpy2.robjects.packages import importr
randomForestSRC = importr('randomForestSRC')
from rpy2.robjects.methods import RS4Auto_Type
import six

class rfsrc(six.with_metaclass(RS4Auto_Type)):
rname = 'rfsrc'
rpackagename = 'randomForestSRC'`

custom split not getting registered:

Hi @kogalur ,
I wrote a custom split function,getCustomSplitStatisticMultivariateRegressionTwo(), and followed the steps exactly as mentioned.
i.e.,

  • wrote the definition for getCustomSplitStatisticMultivariateRegressionTwo()
  • declared it in splitCustom.h
  • Registered it by calling registerThis (&getCustomSplitStatisticMultivariateRegressionTwo, REGR_FAM, 2); inside registerCustomFunctions()
  • compiled the sourced code successfully and installed the library in my default library path.

After doing the above,
when I tried to grow the tree using split-rule as "custom2". Am getting the error that,


RF-SRC: *** ERROR ***
RF-SRC: Custom split rule not registered: 2
RF-SRC: Please register the rule and recompile the package.<simpleError in doTryCatch(return(expr), name, parentenv, handler):

Kindly let me know if am missing anything.

Regards,
Vinodh

Will not install, fails with this error not matter what type of install I try

ic -fpic -fPIC -c randomForestSRC.c -o randomForestSRC.o
randomForestSRC.c: In function ‘updateGenericVimpEnsemble’:
randomForestSRC.c:2361: error: expected end of line before ‘update’
randomForestSRC.c: In function ‘updateProximity’:
randomForestSRC.c:20712: error: expected end of line before ‘update’
randomForestSRC.c:20717: error: expected end of line before ‘update’
make: *** [randomForestSRC.o] Error 1
ERROR: compilation failed for package ‘randomForestSRC’

  • removing ‘/root/R/x86_64-redhat-linux-gnu-library/3.4/randomForestSRC’

Error when trying to predict using version 2.7.0

I obtain the following error when trying to predict an rfsrc object:

"<simpleError in (object$nativeFactorArray)$mwcpPT: $ operator is invalid for atomic vectors>
Error in generic.predict.rfsrc(object, newdata, ensemble = ensemble, m.target = m.target, :
An error has occurred in prediction. Please turn trace on for further analysis.

This is the first time I see this error in months as I am just loading the same RDS model object and generating predictions on new data. I even tried just predicting on the model data that was used to build the model and I get the same error.

The model was built using package Version: 2.5.1.
I only started to see this error once I installed the latest randomForestSRC package: Version 2.7.0.
To confirm, i reverted back to Version 2.5.1. and predict started to work again.

OOB error rate plot broken with version 2.6.0

Issue

It seems that the error rate plot method was broken in release 2.6

library(randomForestSRC, verbose = TRUE)

randomForestSRC 2.6.1

Type rfsrc.news() to see new features, changes, and bug fixes.

Using the example from the help file you can see that the plot is outputting a constant error rate like when the tree.err option is set to FALSE:

## veteran data
## randomized trial of two treatment regimens for lung cancer
data(veteran, package = "randomForestSRC")
v.obj <- rfsrc(Surv(time, status) ~ ., data = veteran, ntree = 100, tree.err=TRUE)

plot(v.obj)

image

You get the same plot with explicitly setting tree.err = FALSE:

v.obj <- rfsrc(Surv(time, status) ~ ., data = veteran, ntree = 100, tree.err = FALSE)

plot(v.obj)

image

The variable importance plot is still working:

v.obj <- rfsrc(Surv(time, status) ~ ., data = veteran, ntree = 100, tree.err = TRUE,
                     importance = TRUE)

plot(v.obj)

image

Version 2.5.1 was working fine. I remove version 2.6.1, install 2.5.1, and get the correct OOB error rate plot.

remove.packages("randomForestSRC", lib="~/R/win-library/3.5")
install.packages("C:/Users/*******/Downloads/randomForestSRC_2.5.1.tar.gz", repos = NULL, type = "source")
library(randomForestSRC, verbose = TRUE)

randomForestSRC 2.5.1

Type rfsrc.news() to see new features, changes, and bug fixes.

data(veteran, package = "randomForestSRC")
v.obj <- rfsrc(Surv(time, status) ~ ., data = veteran, ntree = 100, tree.err = TRUE)
plot(v.obj)

image

Session Info for reference:

sessionInfo()

R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] randomForestSRC_2.5.1

loaded via a namespace (and not attached):
[1] compiler_3.5.0 backports_1.1.2 magrittr_1.5 rprojroot_1.3-2 parallel_3.5.0 htmltools_0.3.6 tools_3.5.0
[8] yaml_2.1.19 Rcpp_0.12.17 stringi_1.2.2 rmarkdown_1.9 knitr_1.20 stringr_1.3.1 digest_0.6.15
[15] evaluate_0.10.1

Any help would be appreciated, thanks!

Getting the raw boxplot data from "subsample" method

I was trying to get the raw data used by the plot function plot.subsample(model.sm.rf, ...) in order to use the data with my other analysis functions and harmonize the figure looks. However I had some difficulties getting it

From /src/R/plot.subsample.rfsrc.R I noticed that the plot function was calling extract.subsample.rfsrc.local(obj = ...) which had the boxplot.dta variable supposedly containing the values of each run (e.g. B = 100), but all the values were identical in that data frame which I did not quite understand as the standard plot functioned ok?

I managed to use the var.jk.sel.Z variable that contained the mean value with the upper and lower bounds.

image

Any idea why the following call did not contain the actual results from various runs?

oo <- extract.subsample(x, alpha = alpha, target = target, standardize = standardize)
    boxplot.dta <- oo$boxplot.dta

Conditional Logistic Regression

Is it possible to use randomForestSRC for evaluating a special form a cox ph regression called conditional logistic regression of the form:

coxph(formula = Surv(rep(1, 200L), event) ~ group + strata(id), 
    method = "exact")

Thanks

Log-rank split rule for competing risks not always choosing optimal split

Hello, according to the documentation log rank splitting for competing risks data when we specify cause is trying to maximize the test statistic for the log-rank score for that cause. However, I've discovered that in certain datasets the chosen split is not necessarily the one that actually maximizes the score, although it's close. Below is a script that can replicate what I've seen; here is the data.txt used in the script.

library(survival)
library(randomForestSRC)

data <- read.csv("data.txt") # Github won't let me upload a .csv

# We use no bootstrapping so that results can be replicated, 
# one tree with a maximum node depth of 1 so that there's only one split to look at. 
# nsplit=0 so that the optimal split can be selected. 
# cause=2 because interestingly cause=1 is optimal.
rfsrc.model <- rfsrc(Surv(u, delta) ~ x, data, ntree=1, bootstrap="none", nodedepth = 1, nsplit = 0, cause=2)
rfsrcIsLeftHand <- data$x <= rfsrc.model$forest$nativeArray[1,4]
rfsrc.model$forest$nativeArray # split chosen to be <= 0.0370275

# Other theoretical split on x
otherPossibleLeftHand <- data$x <= 0.0225499335063258


newData <- data.frame(u=data$u, delta=data$delta, rfsrcIsLeftHand, otherPossibleLeftHand)
newData$isEvent1 <- newData$delta==1
newData$isEvent2 <- newData$delta==2

# Survdiff from the survival package runs by default a log-rank test

survdiff(Surv(u, isEvent2)~rfsrcIsLeftHand, newData)
# Chi-sq value of 76.5

survdiff(Surv(u, isEvent2)~otherPossibleLeftHand, newData)
# Chi-sq value of 77.4; higher.

Here is my sessionInfo():

R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.1 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C               LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8    
 [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8    LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] randomForestSRC_2.8.0 survival_2.43-3      

loaded via a namespace (and not attached):
[1] compiler_3.4.4  Matrix_1.2-12   parallel_3.4.4  tools_3.4.4     yaml_2.2.0      splines_3.4.4   grid_3.4.4     
[8] lattice_0.20-35

Error "factor level in data inconsistent" during tuning

Error "factor level in data inconsistent with number of levels indicated" is thrown amidst tuning run.
During the tuning process the Train/Testset split is not modified.
Example is attached (failrfsrc.tgz): ~> Rscript rfsrc_bug.R

failrfsrc.tar.zip

[Tune] Started tuning learner classif.randomForestSRC for parameter set:
Type len Def Constr Req Tunable Trafo
ntree integer - - 100 to 500 - TRUE -
mtry integer - - 5 to 50 - TRUE -
nodesize integer - - 1 to 10 - TRUE -
nodedepth integer - - 3 to 16 - TRUE -
nsplit integer - - 1 to 50 - TRUE -
bootstrap discrete - - by.root - TRUE -
With control class: TuneControlMBO
Imputation value: -0
[Tune-x] 1: ntree=261; mtry=10; nodesize=7; nodedepth=12; nsplit=48; bootstrap=by.root
[Tune-y] 1: acc.test.mean=0.7435397; time: 0.0 min
[Tune-x] 2: ntree=383; mtry=40; nodesize=2; nodedepth=8; nsplit=38; bootstrap=by.root
[Tune-y] 2: acc.test.mean=0.7747563; time: 0.1 min
[Tune-x] 3: ntree=196; mtry=41; nodesize=4; nodedepth=7; nsplit=17; bootstrap=by.root
[Tune-y] 3: acc.test.mean=0.7657664; time: 0.1 min
[Tune-x] 4: ntree=317; mtry=34; nodesize=9; nodedepth=12; nsplit=31; bootstrap=by.root
[Tune-y] 4: acc.test.mean=0.7219982; time: 0.1 min
[Tune-x] 5: ntree=452; mtry=18; nodesize=5; nodedepth=9; nsplit=34; bootstrap=by.root
[Tune-y] 5: acc.test.mean=0.7483707; time: 0.1 min
[Tune-x] 6: ntree=283; mtry=27; nodesize=2; nodedepth=10; nsplit=14; bootstrap=by.root
[Tune-y] 6: acc.test.mean=0.7703074; time: 0.1 min
[Tune-x] 7: ntree=438; mtry=14; nodesize=7; nodedepth=12; nsplit=23; bootstrap=by.root
[Tune-y] 7: acc.test.mean=0.7438340; time: 0.1 min
[Tune-x] 8: ntree=326; mtry=39; nodesize=2; nodedepth=14; nsplit=33; bootstrap=by.root
[Tune-y] 8: acc.test.mean=0.7702152; time: 0.1 min
[Tune-x] 9: ntree=237; mtry=17; nodesize=5; nodedepth=16; nsplit=2; bootstrap=by.root
Fehler in sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE)) :
Schreibfehler, schließe pipe zum Elternprozess
Ruft auf: train ... extract.factor -> mclapply -> lapply -> FUN -> sendMaster

RF-SRC: *** ERROR ***
RF-SRC: Y-var factor level in data inconsistent with number of levels indicated: 1 0
Fehler in generic.predict.rfsrc(object, newdata, ensemble = ensemble, m.target = m.target, :
An error has occurred in prediction. Please turn trace on for further analysis.
Ruft auf: train ... predictLearner.classif.randomForestSRC -> predict -> predict.rfsrc -> generic.predict.rfsrc
Ausführung angehalten

Extract the brier score for survival random forest object

Hi there,

I would like to extract the Brier score directly from the rf1 object by modifying the source code of plot.survival.rfsrc, in a way of returning the dataset brier.score and crps. However, it gives me an error that could not find function "get.event.info" and I cannot find the source code. Could you please explain the function of get.event.info usage or let plot.survival.rfsrc also output the Brier score and crps?

Btw, the reason why I cannot use pec::pec to get the brier score is that there are missing in my test set. When the pec::predictSurvProb.rfsrc in pec::pec works on the rfsrc object, you can see in the function predict below, 'na.action = "na.impute"' is not there, hence, it can only output the predicted values for non-missing, which will give an error.

predictSurvProb.rfsrc <- function(object, newdata, times, ...){
    ptemp <- predict(object,newdata=newdata,importance="none",...)$survival
    pos <- prodlim::sindex(jump.times=object$time.interest,eval.times=times)
    p <- cbind(1,ptemp)[,pos+1,drop=FALSE]
    if (NROW(p) != NROW(newdata) || NCOL(p) != length(times))
        stop(paste("\nPrediction matrix has wrong dimensions:\nRequested newdata x times: ",NROW(newdata)," x ",length(times),"\nProvided prediction matrix: ",NROW(p)," x ",NCOL(p),"\n\n",sep=""))
    p
} 

Thanks in advance!
Shengnan

For Rmarkdown usability, use message() instead of cat()

At the moment it is near impossible to exclude text output from functions like var.select() in R markdown, since the chunk options of knitr can only suppress all text, or that output by message(), warning() and error() selectively. Please replace all instances of cat(), print(), printf() etc. by the appropriate message(), warning() and error().

Warning for u

I'm getting the following error with randomForestSRC as compiled below when trying to cluster a rather data frame of mostly logical features. It appears the vector length should be fine for a 64bit int. Is that intended? Also, is the warning likely related to the ultimate 'kill 9'?

> dim(chunk)
[1] 119674    392
> rf.fit <- randomForestSRC::rfsrc(data = select(chunk, -plus.master_id), ntree=10000, proximity="oob")

RF-SRC:  *** WARNING ***
RF-SRC:  S.E.X.P. vector element length exceeds 32-bits:            7160992975
RF-SRC:  S.E.X.P. ALLOC:  proximity
RF-SRC:  Please Reduce Dimensionality If Possible.Killed: 9

System version:

> version
               _                           
platform       x86_64-apple-darwin15.6.0   
arch           x86_64                      
os             darwin15.6.0                
system         x86_64, darwin15.6.0        
status                                     
major          3                           
minor          4.0                         
year           2017                        
month          04                          
day            21                          
svn rev        72570                       
language       R                           
version.string R version 3.4.0 (2017-04-21)
nickname       You Stupid Darkness         

Package compile log:

checking for gcc... gcc-7
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc-7 accepts -g... yes
checking for gcc-7 option to accept ISO C89... none needed
checking for gcc-7 option to support OpenMP... -fopenmp
configure: creating ./config.status
config.status: creating src/Makevars
** libs
gcc-7 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG   -I/usr/local/include  -fopenmp  -fPIC  -Wall -g -O2  -c R_init_randomForestSRC.c -o R_init_randomForestSRC.o
gcc-7 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG   -I/usr/local/include  -fopenmp  -fPIC  -Wall -g -O2  -c randomForestSRC.c -o randomForestSRC.o
gcc-7 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG   -I/usr/local/include  -fopenmp  -fPIC  -Wall -g -O2  -c splitCustom.c -o splitCustom.o
gcc-7 -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/Library/Frameworks/R.framework/Resources/lib -L/usr/local/lib -o randomForestSRC.so R_init_randomForestSRC.o randomForestSRC.o splitCustom.o -fopenmp -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation

'Floor' error when factor variables specified

I am using randomForestSRC to estimate the variable importance of 34 potential predictors of survival. When all of potential predictor variables are numeric (or coded as numeric) the package runs without a problem. When I add in / appropriately code factor variables as factors, I get the following error: Error in Math.factor(cens) : ‘floor’ not meaningful for factors. I have tried using the package defaults to be sure that nothing I'm specifying is causing a problem. I've tried coding all of my variables as numeric - which is incorrect. That eliminates the problem. If I specify the factor variables as factors - which is correct - I get this error. I do not know why rounding, floor, would be applied to factor variables. I have looked through the code running in the package and cannot identify where the problem is. Thank you for your assistance.

Error while implementing subsample function

I am getting an error while implementing subsample function for competing risk problem under the following setting:

  1. Spitting rule is logrankCR
  2. Data is imputed using na.action = "na.impute"
  3. nimpute = 1

Error does not occur for a different splitting rule or for nimpute > 1.
I tried my best to investigate this problem, and it appears to have related with rfsrc function and not with subsample function.

Rfsrc curve is overfitting quite a bit. What could be the reasons?

test_data.zip

Hello Sir,

I have ran RFSRC on my train data and scored on the test data. While the model is performing very well on the train data, it is showing huge overfitting on test data. I thought that maybe presenting the problem to you might be useful given your expert knowledge of how Survival Forests work.
The data is attached with this mail.
I am first running the below code to make the model.

library(randomForestSRC)

# Running the model
[train_data.zip](https://github.com/kogalur/randomForestSRC/files/3417313/train_data.zip)

rfsrc_model <- rfsrc(Surv(monthlap,islapsed) ~ .,
                     data=train, 
                     ntree = 100 , 
                     do.trace = TRUE)

# Making predictions on the test dataset
pred_test <- predict.rfsrc(rfsrc_model, test)

# Making the dataset containing 24 month probabilities and the time (monthlap) and status # # 
# (islapsed) variables for the test data
Test_Survival = as.data.frame(cbind(pred_test$survival, monthlap = test$monthlap, islapsed = test$islapsed))

Then, the way I'm checking the performance on test data is by estimating monthly survival probabilities of all people in the data for a time period of 24 months from the model, setting a cut-off probability, below which if the probability of any member goes for any month, he'll be considered dead in that particular month, and then drawing a graph with percentage of deaths per month in 24 months on the y-axis, and the time on the x-axis. This graph is drawn for both the train and test data and in my case, the test graph line is hugely overpredicting death percentage as compared to train percentage, particularly beyond the 6th month. I'm attaching one plot with this question.
Train_test_plot.pdf
This is as per the below code:

# input the probability cut-off
x <- as.numeric(0.60)

# Plotting the graph
ncap_surv2 <- data.frame()
ncap_surv5 <- data.frame()
ncap_surv2 <- ifelse(ncap_surv[,c(1:24)]<x,1,0)
ncap_surv2 = cbind(ncap_surv2,monthlap = ncap_surv$monthlap, islapsed = ncap_surv$islapsed)
for (i in 24:2){
  ncap_surv2[, i] = ncap_surv2[, i] - ncap_surv2[, i-1]
}
ncap_surv2 <- as.data.frame(ncap_surv2)
ncap_surv3 <- ncap_surv2 %>% 
  group_by(monthlap, islapsed) %>% 
  summarise(n = n())
ncap_surv4 <- colSums(ncap_surv2[, c(1:24)])
ncap_surv5 <- cbind(ncap_surv4, monthlap = 1:24, islapsed = 1)
ncap_surv5 <- as.data.frame(ncap_surv5)
ncap_surv6 <- merge(x = ncap_surv3, y = ncap_surv5, by = c('monthlap', 'islapsed'), all.x = TRUE)
ncap_surv7 <- ncap_surv6[!(ncap_surv6$islapsed == 0), ]
colnames(ncap_surv7) <- c('monthlap', 'islapsed', 'actual', 'predicted')
ncap_surv7[, c("actual", "predicted")] <- ncap_surv7[, c("actual", "predicted")]/nrow(ncap_surv)
ncap_surv7 <- within(ncap_surv7, cum_actual <- cumsum(actual))
ncap_surv7 <- within(ncap_surv7, cum_predicted <- cumsum(predicted))

library(ggplot2)
p = ggplot() + 
  geom_line(data = ncap_surv7, aes(x = monthlap, y = cum_actual), color = "blue") +
  geom_line(data = ncap_surv7, aes(x = monthlap, y = cum_predicted), color = "red") +
  xlab('Month') +
  ylab('percent')
print(p)

I am totally at wits' end to ascertain why the survival probabilities are suddenly decreasing beyond the 6-month mark. Which things should I look into to ascertain which variable or other cause is decreasing the survival probabilities?

Thanks!

Question: How is the Ensemble Survival Function calculated

Hello,

I'm currently trying to get a deeper understanding of Random Survival Forests and how they work.
Since an individual can end up in many terminal nodes in different trees of the forest, I assume the survival function for that individual is averaged over all terminal nodes.
https://kogalur.github.io/randomForestSRC/theory.html mentions the KM-estimator for estimating the survival function but not how the ensemble survival function is calculated.
In "Evaluating Random Forests for Survival Analysis using Prediction Error Curves" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4194196/) it says that the ensemble survival function is derived from the ensemble CHF.
I just wanted to know what case is used in randomForestSRC. The actual equation would be the icing on the cake.

I hope this is the right place to ask such a question and that I don't bother anyone (I have a like two more questions).

Anyway, thanks for all the effort put in the papers and this R package!

Chris

An issue about the C-index estimation in survival analysis

library(randomForestSRC)
library(ggplot2)
library(Hmisc)
set.seed(1)
Model <- rfsrc(Surv(time, status) ~ ., data = veteran, ntree = 100, tree.err=TRUE,seed=1)
print(1-Model$err.rate[100])
print(rcorr.cens(predict(Model)$predicted,Surv(veteran$time, veteran$status))[1])

According to your document, the C-index given in the model and the C-index estimated by Harrell's should be the equal. However, they are quite different. Hence, please provide more information about it. Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.