harrelfe / rms Goto Github PK

View Code? Open in Web Editor NEW

167.0 12.0 49.0 3.42 MB

Regression Modeling Strategies

Home Page: https://hbiostat.org/R/rms

License: Other

R 96.13% Fortran 3.59% PHP 0.16% C 0.12%

rms's Introduction

rms

Regression Modeling Strategies

Current Goals

Implement estimation and prediction methods for the Bayesian partial proportional odds model blrm function

Web Sites

Overall: http://hbiostat.org/R/rms/
Book: http://hbiostat.org/rms/
CRAN: http://cran.r-project.org/web/packages/rms/
Changelog: https://github.com/harrelfe/rms/commits/master/

To Do

Fix survplot so that explicitly named adjust-to values are still in subtitles. See tests/cph2.s.
Fix fit.mult.impute to average sigma^2 and then take square root, instead of averaging sigma
Implement user-added distributions in psm - see #41

rms's People

Contributors

Stargazers

Watchers

Forkers

guniorobot smcintosh sumprain erardnicolas sivartravis atcjochems zeemanary klainfo englianhu vjd xueyuan11 caumike couthcommander jphdotam varungoelpmx shearer ckogan guhjy anhnguyendepocen rix133 uhasan1 dupontct lspeetluk slzhao dgkf rsaporta thecodemasterk minghao2016 shanshan-he85 yadevi dincerti mkclairhong owain-s krassowski hivewang valentynbez nmdodds mclark-amp3d marymlucas davidaarmstrong bailliem artlesshao eegk jbarsotti northnomad yhpua whendric09 eddelbuettel vishalbelsare

rms's Issues

Error in lrm() function with factors: subscript out of bounds

Dear Prof. Harrel,

I'm trying to fit a logistic regression model using lrm() function in a dataset with 47 factors and 1 integer (the target variable) and I'm getting the following error:

str(dataset)
'data.frame': 147166 obs. of 48 variables:
$ X1 : Factor w/ 7 levels "01",..: 3 5 3 4 1 2 7 2 5 6 ...
$ X2 : Factor w/ 2 levels "01",..: 2 2 2 2 1 1 2 1 2 2 ...
$ X3 : Factor w/ 11 levels "11:",..: 2 2 2 8 2 2 2 2 2 2 ...
$ TARGET : int 1 1 1 1 1 1 1 1 1 1 ...

require(rms)

fit <- lrm(TARGET ~ ., data = dataset, x=TRUE, y=TRUE, se.fit=TRUE)
Error in X[, mmcolnames, drop = FALSE] : subscript out of bounds

Is there any clues on what's going on?

Thanks in advance,
Lucas Sala

Update:
I'm running R 3.3.1, rms 4.5-0 and Hmisc 3.17-4

Error in terms.formula(formula) : '.' in formula and no 'data' argument

Using dots in formulas works with normal R models, but apparently not with rms? The error message seems to be wrong, as the data argument was specified.

> rms::ols(Sepal.Length ~ ., data = iris)
Error in terms.formula(formula) : '.' in formula and no 'data' argument
> glm(Sepal.Length ~ ., data = iris)

Call:  glm(formula = Sepal.Length ~ ., data = iris)

Coefficients:
      (Intercept)        Sepal.Width       Petal.Length        Petal.Width  Speciesversicolor   Speciesvirginica  
            2.171              0.496              0.829             -0.315             -0.724             -1.023  

Degrees of Freedom: 149 Total (i.e. Null);  144 Residual
Null Deviance:	    102 
Residual Deviance: 13.6 	AIC: 79.1

Version:

R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 18.2
rms_5.1-1

Unexpected results when using `robcov` with dichotmous clustering variable

I came across a strange result - negative F statistics - when fitting an ols model and then using robcov with a dichotomous clustering variable. A reproducible example + results is below. The original context was using robust sandwich estimation to adjust for clustering by study site type; the results look as expected when clustering by individual sites, but not when clustering by type (say, community vs academic hospital).

## Create simple data frame: outcome + two continuous covariates, as well as a five-level site variable
set.seed(56)
df <- data.frame(y = rnorm(n = 100),
                   x1 = rnorm(n = 100),
                   x2 = rnorm(mean = 5, sd = 0.5, n = 100),
                   site = sample(LETTERS[1:5], size = 100, replace = TRUE))

## Collapse into "site types"
df$site.type <- with(df, factor(ifelse(site %in% c('A', 'B'), 'Type 1', 'Type 2')))

## Fit original ols model (issue does not reproduce in this case unless I use rcs())
orgmod <- ols(y ~ rcs(x1, 3) + rcs(x2, 3), data = df, x = TRUE, y = TRUE)

## Sandwich estimation with clustering a) by individual site and b) by site type
robcov.orgmod.site <- robcov(orgmod, cluster = df$site)
robcov.orgmod.sitetype <- robcov(orgmod, cluster = df$site.type)

## Results:
## Original model (no robust sandwich estimation)
> anova(orgmod)
                Analysis of Variance          Response: y 

 Factor          d.f. Partial SS MS        F    P     
 x1               2    1.7144676 0.8572338 0.94 0.3931
  Nonlinear       1    0.7848435 0.7848435 0.86 0.3552
 x2               2    1.0351252 0.5175626 0.57 0.5679
  Nonlinear       1    0.9487821 0.9487821 1.04 0.3096
 TOTAL NONLINEAR  2    1.6913972 0.8456986 0.93 0.3980
 REGRESSION       4    2.7925991 0.6981498 0.77 0.5487
 ERROR           95   86.3726934 0.9091862            

## Robust sandwich estimation using individual site (5 levels) as clusters
> anova(robcov.orgmod.site)
                Analysis of Variance          Response: y 

 Factor          d.f. Partial SS MS        F    P     
 x1               2    2.6469454 1.3234727 1.46 0.2384
  Nonlinear       1    0.8815979 0.8815979 0.97 0.3273
 x2               2    3.6743828 1.8371914 2.02 0.1382
  Nonlinear       1    1.8518288 1.8518288 2.04 0.1568
 TOTAL NONLINEAR  2    1.8936596 0.9468298 1.04 0.3570
 REGRESSION       4    5.3662876 1.3415719 1.48 0.2157
 ERROR           95   86.3726934 0.9091862          
  
## Robust sandwich estimation using site type (2 levels) as clusters
> anova(robcov.orgmod.sitetype, tol = 1e-500)
## had to set tolerance very high to get convergence; this is also true for my original data,
## regardless of whether and how I cluster
                Analysis of Variance          Response: y 

 Factor          d.f. Partial SS    MS            F             P     
 x1               2   -8.688465e+14 -4.344232e+14 -4.778154e+14 1.0000
  Nonlinear       1    5.413141e-01  5.413141e-01  6.000000e-01 0.4423
 x2               2    1.707539e+13  8.537696e+12  9.390481e+12 <.0001
  Nonlinear       1    3.426543e+00  3.426543e+00  3.770000e+00 0.0552
 TOTAL NONLINEAR  2    1.526208e+13  7.631040e+12  8.393264e+12 <.0001
 REGRESSION       4   -1.403561e+15 -3.508903e+14 -3.859389e+14 1.0000
 ERROR           95    8.637269e+01  9.091862e-01

survplot with npsurv: n.risk at time=0 is same across strata

I am not sure if this is the intended behaviour, but when using survplot with an npsurv object and n.risk=T, the number at risk for all strata but the first appears to be incorrect at t=0. E.g.

example(survplot)    # reuse objects of the example 
survplot(npsurv(S~sex),n.risk=T)

Penalized regression modelling for cph

Respected Sir,
Will it be possible to run penalised regression for models made by cph, in the same lines as is possible with pentrace for ols/lrm models? If it is not possible, can it be incorporated in pentrace function?
Thanks for the excellent packages rms and Hmisc

number at risk created by cph is misspecified

When using survplot with a cph object, the number at risk table is not stratified according to the covariate. using npsurv() does this correctly.

library(survival)
data(lung)
fit <- cph(Surv(time, status==2) ~ sex,data = lung,surv = TRUE,x=TRUE, y=TRUE)
survplot(fit, n.risk=TRUE)
# compared to:
survplot(npsurv(fit$sformula, data = lung), conf= "none",n.risk=TRUE)

For usual use cases I guess re-running the model with npsurv will do, but for others, like when wanting to use fit.mult.impute, you want to use the original cph itself.

print method for lrm when adjusted with robcov() contains unwanted LaTeX code

Thanks for this truly amazing package! I've found a small error that I thought you might like to know about. See the code below, adapted from the help page for lrm():

n <- 1000 # define sample size
age <- rnorm(n, 50, 10)
cholesterol <- rnorm(n, 200, 25)
ch <- cut2(cholesterol, g=40, levels.mean=TRUE)
d <- data.frame(ch, age, cluster=rep(1:10,100))
f <- lrm(ch ~ age, x=T, y=T, data=d)
print(robcov(f, d$cluster), coefs=4)
print(robcov(f, d$cluster), coefs=4, latex=FALSE)

fit.mult.impute no longer accepts "fitter = lrm"

I recently updated the rms package to version 4.4-1. When running through my analyses, I got an "Error in X[, mmcolnames, drop = FALSE] : subscript out of bounds".

The old models that use to work fine read:

fit <- fit.mult.impute(x ~ y, fitter = lrm, xtrans = imp, data = data)

but I have now have to fit using fitter = glm, family = "binomial" to run the models. I really miss the goodness of the plot(summary(fit)) and plot(anova(fit)) that I used to get, not to mention the convenience of getting the factor levels to print instead of simply: variable2.

Is this an error, or have you decided to drop fitter=lrm from fit.mult.impute?

Time-dependent coefficients for cph

Dear prof. Harrell,

I've recently learned of a way to work with time-dependent coefficients when the proportional hazards assumption is violated in the cox-regression: By splitting the dataset using the Epi-package I use the new start-times as an interaction term. Unfortunately this use was not anticipated and causes some issues with the cph() and its related functions.

I've forked your package and worked on a fix but while I get the same coefficients as coxph() gives (although there are some minimal differences) my favorite function, the contrast(), still doesn't work with this fix. I'm not entirely sure how to best implement the needed functionality. You can find my test-file here and the fixes are on line 143 and line 152-155 .

Gls() returns an error message

Dear Prof. Harrell,

The Gls() from rms package version 4.4-1 returns an error message when tested on the codes from the rms documentation (page 71).

library(nlme)
library(rms)
ns <- 20 # no. subjects
nt <- 10 # no. time points/subject
B <- 10 # no. bootstrap resamples
# usually do 100 for variances, 1000 for nonparametric CLs
rho <- .5 # AR(1) correlation parameter
V <- matrix(0, nrow=nt, ncol=nt)
V <- rho^abs(row(V)-col(V)) # per-subject correlation/covariance matrix
d <- expand.grid(tim=1:nt, id=1:ns)
d$trt <- factor(ifelse(d$id <= ns/2, 'a', 'b'))
true.beta <- c(Intercept=0,tim=.1,'tim^2'=0,'trt=b'=1)
d$ey <- true.beta['Intercept'] + true.beta['tim']*d$tim +
true.beta['tim^2']*(d$tim^2) + true.beta['trt=b']*(d$trt=='b')
set.seed(13)
library(MASS) # needed for mvrnorm
d$y <- d$ey + as.vector(t(mvrnorm(n=ns, mu=rep(0,nt), Sigma=V)))
dd <- datadist(d); options(datadist='dd')

# this returns an error message: Error in !fixSig : invalid argument type 
f <- Gls(y ~ pol(tim,2) + trt, correlation=corCAR1(form= ~tim | id), data=d, B=B); f

# using gls() from nlme package
f.nlme <- gls(y ~ pol(tim,2) + trt, correlation=corCAR1(form= ~tim | id), na.action=na.omit, data=d); f.nlme

Thank you in advance for looking into this.

Error(s) with latex.anova.rms

In the current version of rms I get error(s) when using latex.anova.rms. The following code illustrates the problem(s):

library("rms")

set.seed(1)
n <- 100
y <- rexp(n)
s <- rep(1, n)
x <- runif(n)
x2 <- runif(n)

fm_ols <- ols(y ~ x)
latex(anova(fm_ols), file = "")
## Error in ifelse(substring(rowl, 1, 5) %in% c("REGRE", "ERROR"), bold(rowl),  : 
##  could not find function "bold"

fm_lrm <- lrm(y > 1 ~ x)
latex(anova(fm_lrm), file = "")
## Error in ifelse(sn %nin% c("d.f.", "MS", "Partial SS"), math(sn), sn) : 
##  could not find function "math"

fm_cph <- cph(Surv(y, s) ~ x)
latex(anova(fm_cph), file = "")
## Error in ifelse(sn %nin% c("d.f.", "MS", "Partial SS"), math(sn), sn) : 
##  could not find function "math"

## And with an interaction
fm_cph_int <- cph(Surv(y, s) ~ x * x2)
latex(anova(fm_cph_int), file = "")
## Error in paste0(specs$lspace, specs$italics(substring(rowl, 2)), sep = "") : 
##  attempt to apply non-function

Package: rms
Version: 5.1-0
Date: 2017-01-01

platform       x86_64-w64-mingw32                         
arch           x86_64                                     
os             mingw32                                    
system         x86_64, mingw32                            
status         Patched                                    
major          3                                          
minor          3.2                                        
year           2017                                       
month          01                                         
day            11                                         
svn rev        71964                                      
language       R                                          
version.string R version 3.3.2 Patched (2017-01-11 r71964)
nickname       Sincere Pumpkin Patch

LaTeX of anova.rms object gives error message

Dear Prof. Harrell,

I've recently updated to the latest CRAN version of rms (5.1-0). When trying to print the results from an ANOVA using LaTeX i get the following error message:

Error in paste0(specs$lspace, specs$italics(substring(rowl, 2)), sep = "") :
attempt to apply non-function

Please see below for an example of when this happens:

library(rms)

# Example from the anova.rms help text: 

n <- 1000    # define sample size
set.seed(17) # so can reproduce the results
treat <- factor(sample(c('a','b','c'), n,TRUE))
num.diseases <- sample(0:4, n,TRUE)
age <- rnorm(n, 50, 10)
cholesterol <- rnorm(n, 200, 25)
weight <- rnorm(n, 150, 20)
sex <- factor(sample(c('female','male'), n,TRUE))
label(age) <- 'Age'      # label is in Hmisc
label(num.diseases) <- 'Number of Comorbid Diseases'
label(cholesterol) <- 'Total Cholesterol'
label(weight) <- 'Weight, lbs.'
label(sex) <- 'Sex'
units(cholesterol) <- 'mg/dl'   # uses units.default in Hmisc


# Specify population model for log odds that Y=1
L <- .1*(num.diseases-2) + .045*(age-50) +
  (log(cholesterol - 10)-5.2)*(-2*(treat=='a') +
                                 3.5*(treat=='b')+2*(treat=='c'))
# Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)]
y <- ifelse(runif(n) < plogis(L), 1, 0)


fit <- lrm(y ~ treat + scored(num.diseases) + rcs(age) +
             log(cholesterol+10) + treat:log(cholesterol+10))

a <- anova(fit)

# Print anova results
print(a) # Works fine!

# Plot anova results
plot(a) # No problem here either!

# Latex
latex(a, file = "")

# Produces the following error:
# Error in paste0(specs$lspace, specs$italics(substring(rowl, 2)), sep = "") : 
# attempt to apply non-function

I'm using R 3.3.2 on macOS. Any thought on why this error would occur?

Thanks,
Daniel

fit failure in 500 resamples. Might try increasing maxit. Ignore it or not?

My data is 188318 rows and 146 columns. After i fit the Glm into my data, it throws an error when i use bootstrap covariance:
fit failure in 500 resamples. Might try increasing maxit
Not sure why this happens and whether should i ignore it and proceed to the next step?
My code:

fit_glm<-Glm(loss~., x=T, y=T, data=xtrain)
boot_fit_lm <- bootcov(fit_glm, B=500, pr=T)

boot_fit_lm <- bootcov(fit_glm, B=500, pr=T)
Iteration: 500 of 500
Warning message:
In bootcov(fit_glm, B = 500, pr = T) :
fit failure in 500 resamples. Might try increasing maxit

Automated checking of results against vanilla functions

When discussing the recently found bug in Gls, Frank suggested that it is always a good idea to check pivotal results against vanilla functions.

Thinking about this, I came up with the following little script; it does nothing really special, just makes this automated:

check_rms_with_vanilla <- function( fit, target = c( coefsqdiff = coef, vcovsqdiff = vcov ) ) {
  if ( !attr( fit, "class" )[ 1 ]%in%c( "ols", "lrm", "cph", "Gls" ) ) {
    stop( "Check works only with rms models that can be reduced to lm, glm,
          coxph or gls (i.e. ols, lrm, cph and Gls)!" )
  }
  vanillaclass <- tail( attr( fit, "class" ), 1 )
  callargs <- lapply( as.list( fit$call )[ -1 ], eval )
  if( attr( fit, "class" )[ 1 ]=="lrm" ) {
    callargs <- c( callargs, family = binomial )
  }
  if ( sum( sapply( names( fit$call ), grepl, pattern = "penalty" ) )>0 ) {
    stop( "There is no vanilla (i.e. comparison) model when penalization is used!" )
  }
  if ( attr( fit, "class" )[ 1 ]=="lrm"&length( fit$freq )>2 ) {
    stop( "There is no vanilla (i.e. comparison) model for proportional odds
          logistic regression!" )
  }
  vanilla <- do.call( vanillaclass, callargs )
  sapply( target, function( t ) quantile( ( t( fit )-t( vanilla ) )^2 ) )
}

Predict() Error in fit$Design : object of type 'closure' is not subsettable

When calling Predict() on lrm.fit objects I'm inconsistently getting the following error:
Error in fit$Design : object of type 'closure' is not subsettable

It's unclear to me why I can call this function successfully on some lrm.fit objects but not others. Any ideas why this may be?

Suggesting two little performance improvements

The following suggestions bring a performance improvement that increases as dataset size increases. As far as I have seen, performance improvements easily reaches ~20% on a "small" 64MB dataset. I am currently testing on a 200MB dataset. If those suggestions will be implemented, I would test them on a 6.9GB dataset with about 150 milions of rows. I am expecting more than 20% perfomance improvements on bigger datasets. By the way, I'm testing this using MRO (Microsoft R Open) which is faster than R. I don't know why but cph() gains more than coxph() using MKL (which is a fast multithread math library from Intel shipped with MRO).

Suggestion n. 1

I would like to suggest an edit to R/cph.s in order to better improve performance when linear.predictors=F. I think that this suggestion accomodates the intention of setting linear.predictors=F: everything related to linear.predictors shouldn't be calculated, so GiniMd() and dxy.cens() shouldn't be invoked.

Lines 270 and 271 now are:

gindex <- GiniMd(f$linear.predictors)
dxy <- dxy.cens(f$linear.predictors, Y, type='hazard')['Dxy']

I suggest the following:

gindex <- NA; if(linear.predictors) gindex <- GiniMd(f$linear.predictors)
dxy <- NA; if(linear.predictors) dxy <- dxy.cens(f$linear.predictors, Y, type='hazard')['Dxy']

Suggestion n. 2

Following commit 23a4c06 the nonames parameter should be also evaluated at line 297.

Line 297 now is:

names(f$linear.predictors) <- rnam

I suggest the following:

if(nonames) names(f$linear.predictors) <- rnam

deprecation warning

After upgrading to R version 3.4, ols generates a deprecation warning:

require(rms)
getHdata(lead)
dd <- datadist(lead)
options(datadist='dd')
ols(maxfwt ~ age + ld72 + ld73, data=lead)

Warning message:
In structure(ia, dimnames = NULL) :
  Calling 'structure(NULL, *)' is deprecated, as NULL cannot have attributes.
  Consider 'structure(list(), *)' instead.

See rms.s, line 134, where ia is initialized to NULL.

predict() results incorrect when using categorical predictors with ordered levels

Dear Prof. Harrell,

Thank you for the useful rms package. I was fortunate to attend your course a few years ago.

I would like to bring to our attention an issue (and hopefully a fix) with the predict() method for ols(). When an ordered factor is defined and used as a predictor, the predict() method with newdata won't give correct results. The self-contained example below illustrates the issue.

library(rms)
df <- data.frame(y=rnorm(21), day=weekdays(Sys.Date()-(1:21), abbr=T))
df$day.ordered <- with(df, factor(as.character(day), levels=c("Sun","Mon","Tue","Wed","Thu","Fri","Sat"), ordered=T))
options(contrasts=c("contr.treatment", "contr.treatment"))
fit1 <- ols(yday, data=df)
fit2 <- ols(yday.ordered, data=df)
df.char <- df; df.char$day.ordered <- as.character(df.char$day.ordered)
cbind(orig=predict(fit1), orig.newdata=predict(fit1,newdata=df), ordered=predict(fit2), ordered.newdata=predict(fit2,newdata=df), ordered.workaround=predict(fit2,newdata=df.char))

Sample output:
orig orig.newdata ordered ordered.newdata ordered.workaround
1 -0.9806541 -0.9806541 -0.9806541 -1.7049262 -0.9806541
2 0.8522859 0.8522859 0.8522859 -1.7049439 0.8522859
3 -0.9279295 -0.9279295 -0.9279295 -1.8552874 -0.9279295
4 -0.7078337 -0.7078337 -0.7078337 0.5226366 -0.7078337
5 0.3559332 0.3559332 0.3559332 -1.5590623 0.3559332
[truncated]

You will note that the column "ordered.newdata" shows different results than the other columns. Redefining "day.ordered" as character serves as a workaround. Neither lm() or glm() has this issue.

I dug further into the predict() methods. I noted that predict.lm() defines an "xlev" argument when calling model.frame():

m <- model.frame(Terms, newdata, na.action = na.action, xlev = object$xlevels)

I've looked into the predictrms() function defined in github:
https://github.com/harrelfe/rms/blob/master/R/predictrms.s
At line 250, we see that the model.frame() is called without "xlev" argument

X <- model.frame(Terms, newdata, na.action=na.action, ...)
I would suggest replacing that line with the following:

X <- model.frame(Terms, newdata, na.action=na.action, xlev = fit$xlevels, ...)
Hopefully, this would fix the issue.

Best regards,
Jerome Asselin

version
_
platform x86_64-redhat-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 3.0
year 2016
month 05
day 03
svn rev 70573
language R
version.string R version 3.3.0 (2016-05-03)
nickname Supposedly Educational
packageVersion("rms")
[1] ‘4.5.0’

Error in !se.fit : invalid argument type

There's an error in cph().
Passing surv="summary" to cph() as suggested by the official documentation returns the following error:
Error in !se.fit : invalid argument type
More generally, it seems that setting surv to a non boolean value generates the error. Probably this line of code needs to be corrected:
https://github.com/harrelfe/rms/blob/master/R/cph.s#L328

May be the docs are wrong about setting surv="summary"?

Documentation of R2 for Cox PH

Apologies if this is the wrong place to put this.

The documentation for the cph function does not specify what R2 is returned, and indeed the value returned disagrees with that from survival::coxph. I think both are using the Cox & Snell generalized R-squared but cph is reporting R^2/R^2_max whereas coxph reports R^2 (and separately R^2_max). This took some amount of googling to figure out so it would be nice if was in the docs!

diffbands does not observe xlim

A minor plotting problem occurs when using survplot with diffbands and xlim: the gray confidence band extends beyond the plot margin. E.g.:

library(rms)
n <- 100
sex <- factor(sample(c('male','female'), n, TRUE))
dt <- 20*runif(n)
e <- sample(0:1,n,replace=T)
S <- Surv(dt,e)
f <- npsurv(S ~ sex)
survplot(f, conf='diffbands', xlim=c(0,10))

A minimal fix may be in survplot.npsurv.s:

@ -266,7 +268,7 @@ survplot.npsurv <-
                               col = col.fill[i], type = "s")
                 }
                 else if(conf == 'diffbands')
-                    survdiffplot(fit.orig, conf=conf, fun=fun, convert=convert)
+                    survdiffplot(fit.orig, conf=conf, fun=fun, convert=convert, xlim=xlim)

                 else {
                     j <- if(ns == 1) TRUE else vs == olev[i]
@@ -405,7 +407,8 @@ survdiffplot <-
         if(conf == 'diffbands') {
             lo <- surv - 0.5 * z * se
             hi <- surv + 0.5 * z * se
-            k <- !is.na(times + lo + hi)
+            k <- !is.na(times + lo + hi) & times < xlim[2]
             polyg(c(times[k], rev(times[k])), c(lo[k], rev(hi[k])),
                   col=gray(.9), type='s')
             return(invisible(slev))

Problem in cph, when there is missing value and cluster

Here is a minimal reproducible example:

set.seed(1)

RawData <- data.frame( time = runif( 100, 1, 5 ), event = sample( 0:1, 100, replace = TRUE ),
x1 = c( NA, rnorm( 99 ) ), x2 = as.factor( sample( 0:1, 100, replace = TRUE ) ),
ID = c( 1:70, 1:30 ) )

dd <- datadist( RawData )
options( datadist = "dd" )

cph( Surv( time, event ) ~ x1 + x2 + cluster( ID ), data = RawData )

It is not even an error, but a warning, but it nevertheless shouldn't be there. After some experimentation I think the problem is with the presence of a missing value and cluster().

The problem seems to occur at the very end of rms:::Design. I think the problem is that jz will index every right hand side term (including the cluster), but fname will only contain the explanatory variables (without the cluster). Thus
names(nmiss)[jz] <- fname[asm != 9]
will try to assign a length-2 vector to one of length 3.

Add support for polytomous models

As a suggestion, would it be possible to extend rms to also support polytomous (multinomial) logistic regression models?

ggplot.Predict uneven legend colors on multiplot

I'd like to plot a multi-panel plot from a multivariate regression with ggplot.Predict() but I'm running into issues with the paneling have different colors than in the legend. It appears the first panel has a different palette, but I'm not entirely sure why. When plotting more than two panels, it is only the first panel which is different; all of the other panels have the same color palette, but is not identified in the legend. The base graphics plot() works fine.

There is also an error message:

Warning messages:
1: show_guide has been deprecated. Please use show.legend instead.

Which eluded me to think it was version problem with ggplot, but after downgrading ggplot Hmisc would not load because it needs ggplot 2.0. Do you think this error could be causing the problem?

Here is an example:

dd = data.frame(x1 = 2 + (runif(200) * 6), x12 = 100 + (runif(200) * 6))
dd$y1 = rep(c(1.2, 1.4), each = 100) * dd$x1 + (runif(200) / 5)

library(rms)
ddist <- datadist(dd)
options("datadist" = "ddist")

g <- ols(y1 ~ x1 + x12, data = dd, x = TRUE, y = TRUE)
a <- Predict(g)

h <- ols(y1 ~ I(x1^2) + I(x12^2), data = dd, x = TRUE, y = TRUE)
b <- Predict(h)

p <- rbind(a,b)
ggplot(p, group = ".set.")

ggplot Plot

System Info:

R version 3.2.3 (2015-12-10)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora 22 (Twenty Two)

other attached packages:
[1] rms_4.4-1 SparseM_1.7 Hmisc_3.17-1 ggplot2_2.0.0 Formula_1.2-1 survival_2.38-3 lattice_0.20-33

Error with %ia% in cph

In the current version of rms I get an error when using the %ia% function within a cph() call. A minimal reproducible example that triggers the error is

library("rms")
N <- 100
set.seed(1)
time <- rexp(N)
status <- sample(0:1, N, replace = TRUE)
S <- Surv(time, status)
x1 <- gl(2, 50)
x2 <- runif(N)
fc <- cph(S ~ x1 + rcs(x2) + x1 %ia% x2)
## Error in model.matrix(sformula, X)[, mmcolnames, drop = FALSE] : 
##   subscript out of bounds

There appear to be no problem within ols() or lrm() calls:

fo <- ols(time ~ x1 + rcs(x2) + x1 %ia% x2)
fl <- lrm(status ~ x1 + rcs(x2) + x1 %ia% x2)

rms 4.4-0

platform       x86_64-w64-mingw32                         
arch           x86_64                                     
os             mingw32                                    
system         x86_64, mingw32                            
status         Patched                                    
major          3                                          
minor          2.2                                        
year           2015                                       
month          10                                         
day            30                                         
svn rev        69588                                      
language       R                                          
version.string R version 3.2.2 Patched (2015-10-30 r69588)
nickname       Fire Safety

HRs from a loglog orm do not fall within their corresponding CIs.

Dear Prof Harrell,
I've noticed that the hazard ratios from a loglog orm do not fall within their corresponding CIs. Below, I've provided a minimal example.
Thank you for looking into this and for creating the rms package!
Yonghao

create a datadist object

mtcars.dd <- datadist(mtcars)
options(datadist="mtcars.dd")

mod.loglog <- orm (mpg ~ drat + vs , family =loglog, data = mtcars) # loglog CPM

HR ratios are not within their corresponding CIs

HRs for the predictors are probably exp(beta)

summary (mod.loglog)

Error with contrast, three-way interaction and %ia%

I get an error from contrast() when fitting a model with a three-way interaction containing %ia% in rms version 4.4-2. Consider the following code

library(rms)

set.seed(1)

N <- 100
d <- expand.grid(x1 = rep(c("c", "d"), each = N/4), x2 = c("a", "b"))
d$x3 <- runif(N)
d$y <- rnorm(N)

dd <- datadist(d); options(datadist = "dd")

## Interaction between x2 and x3 restricted to the linear term in x3
## Also gives warning messages
f1 <- ols(y ~ x1 * (rcs(x3, 3) + x2 + x2 %ia% x3), data = d)
## Warning messages:
## 1: In FUN(newX[, i], ...) : coercing argument of type 'double' to logical
## 2: In FUN(newX[, i], ...) : coercing argument of type 'double' to logical

## Full interaction between x2 and x3
f2 <- ols(y ~ x1 * (rcs(x3, 3) * x2), data = d)

## Subsetting on x1 == "c"
f3 <- ols(y ~ rcs(x3, 3) + x2 + x2 %ia% x3, data = d, subset = x1 == "c")

Applying contrast() on f1 throws an error

contrast(f1, list(x2 = "a"), list(x2 = "b"))
## Error in matxv(X, betas) : columns in a (12) must be <= length of b (10)

Applying contrast() in a similar manner on both f2 and f3 works.

> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=Swedish_Sweden.1252  LC_CTYPE=Swedish_Sweden.1252    LC_MONETARY=Swedish_Sweden.1252 LC_NUMERIC=C                   
[5] LC_TIME=Swedish_Sweden.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] nlme_3.1-122    rms_4.4-2       SparseM_1.7     Hmisc_3.17-2    ggplot2_2.0.0   Formula_1.2-1   survival_2.38-3 lattice_0.20-33
[9] MASS_7.3-45    

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.3         cluster_2.0.3       splines_3.2.3       munsell_0.4.2       colorspace_1.2-6    multcomp_1.4-3     
 [7] plyr_1.8.3          tools_3.2.3         nnet_7.3-11         grid_3.2.3          gtable_0.1.2        quantreg_5.19      
[13] TH.data_1.0-7       latticeExtra_0.6-26 MatrixModels_0.4-1  polspline_1.1.12    Matrix_1.2-3        gridExtra_2.0.0    
[19] RColorBrewer_1.1-2  codetools_0.2-14    acepack_1.3-3.3     rpart_4.1-10        sandwich_2.3-4      scales_0.3.0       
[25] mvtnorm_1.0-5       foreign_0.8-66      zoo_1.7-12

cph fails with boolean input

It seems that model.matrix() returns TRUE together with the variable name. Not sure when this happend, can't find any note of it but the following code used to run fine:

set.seed(10)
n <- 500
ds <- data.frame(
  ftime = rexp(n),
  fstatus = sample(0:1, size = n, replace = TRUE),
  x = factor(sample(LETTERS[1:4], size = n, replace = TRUE)),
  boolean = sample(c(TRUE, FALSE), size = n, replace = TRUE),
  subsetting = factor(sample(c(TRUE, FALSE), size = n, replace = TRUE)))

library(rms)
dd <<- datadist(ds)
options(datadist="dd")

fit <- cph(Surv(ftime, fstatus == 1) ~ x + boolean, data=ds)

Now it throws an Error in X[, mmcolnames, drop = FALSE] : subscript out of bounds as the mmcolnames has the name for the boolean variable while the X <- model.matrix(sformula, X) returns the boolean column with the name booleanTRUE.

Btw - congratulations on the second edition. Great to see that it has finally arrived.

User-added distributions in psm

I am writing you regarding the rms package of which you are the author. I have a rather specific question, which has been posed before on stack-exchanged but without an answer, so I hope you can find the time to look into it.
For my PhD in modelling of drug exposure prior to hematopoietic cell transplantation I am trying to generate a plot using your “rms-package’.
More specifically, I want to use the predict function on a survival model and have done so successfully using the following code:

#——start Code——#

dd <- datadist(Biomarker)
options(datadist='dd’)
S<-Surv(Time,Event)
Weibull: f <- psm(S ~ rcs(Biomarker,4)+gr2, dist="weibull”) #parametric option
Cox: f <- cph(S ~ rcs(Biomarker,4)+gr2, x=TRUE, y=TRUE) #semi parametric option
plot(dm=Predict(f, Biomarker))

#——End Code——#

The problem I encounter is that the survival of our cohort typically follows a Gompertz distribution, not a Weibull.
This distribution is not taken up into “survreg.distributions” and therefore no default option. As I scroll to the code via ''getAnywere(psm)”, I notice an option where a list could be imputed instead of a character.
However, when I try this with the example given in the “survreg” helpfile (MyCauchy), it gives the following warnings/error:

#——start Code——#

mycauchy <- list(name='Cauchy',
init= function(x, weights, ...)
c(median(x), mad(x)),
density= function(x, parms) {
temp <- 1/(1 + x^2)
cbind(.5 + atan(x)/pi, .5+ atan(-x)/pi,
temp/pi, -2 xtemp, 2temp(4x^2temp -1))
},
quantile= function(p, parms) tan((p-.5)*pi),
deviance= function(...) stop('deviance residuals not defined')
)

f <- survreg(S ~ rcs(Flu,4)+gr2, dist=mycauchy)
#no error follows the above with survreg
f <- psm(S ~ rcs(Flu,4)+gr2, dist=mycauchy)
Warning messages:
1: In if (dist == "extreme") warning("Unlike earlier versions of survreg, dist="extreme" does not fit\na Weibull distribution as it uses an identity link. To fit the Weibull\ndistribution use the default for dist or specify dist="weibull".") :
the condition has length > 1 and only the first element will be used
2: In if (dist %in% c("weibull", "exponential", "lognormal", "loglogistic")) c("log(T)", :
the condition has length > 1 and only the first element will be used

f
Error in survreg.distributions[[dist]] : invalid subscript type 'list'

Jurgen Langenhorst
PhD Candidate / Pharmacist
UMC Utrecht
The Netherlands
[email protected]

lrm error with variable names

I have encountered a minor bug with lrm function. lrm could not fit a model with 2 variables named "X2" and "X21" in which the first one is binary with 2 levels 0 and 1, and the second one is continuous. The error message is "singular information matrix in lrm.fit". Problem is solved if I change the name of variable X2 to something else for example D2. I do not have the same problem with glm function.

I understand that this issue only happens in simulation study when people name their variable in a convenient way so it is kind of trivial, but I report it anyway.

new problem with < and > in html(summaryM)

Code like the following used to work fine for me:

library(Hmisc)
mydat <- data.frame(x= rbinom(100, 1, 0.5), y= rnorm(100), z= rbinom(100, 1, 0.3))

mysum <- summaryM(
    y + z ~ x,
    data= mydat
)

print(Hmisc::html(mysum))

but now the output does not render correctly. Here is one of the lines from the output:

<td style='padding: 0 7px 0 7px; border-bottom: 2px solid grey; text-align: center;'>0.27 <span style="font-size: 80%;"> (15)</span></td>

I think the problem is that the html output now has < and > where it should have < and >. I haven't updated Hmisc recently, but I have updated other packages; I'm wondering whether one of the packages listed below under "loaded via a namespace (and not attached)" is causing the problem?

Thank you!

sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Hmisc_4.0-3     ggplot2_2.2.1   Formula_1.2-2   survival_2.41-3 lattice_0.20-35

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.14        compiler_3.4.2      RColorBrewer_1.1-2  plyr_1.8.4          bindr_0.1          
 [6] base64enc_0.1-3     tools_3.4.2         rpart_4.1-11        digest_0.6.12       tibble_1.3.4       
[11] gtable_0.2.0        checkmate_1.8.5     htmlTable_1.11.0    pkgconfig_2.0.1     rlang_0.1.4        
[16] Matrix_1.2-12       rstudioapi_0.7      yaml_2.1.15         bindrcpp_0.2        gridExtra_2.3      
[21] stringr_1.2.0       knitr_1.17          dplyr_0.7.4         cluster_2.0.6       htmlwidgets_0.9    
[26] grid_3.4.2          nnet_7.3-12         data.table_1.10.4-3 glue_1.2.0          R6_2.2.2           
[31] foreign_0.8-69      latticeExtra_0.6-28 purrr_0.2.4         tidyr_0.7.2         magrittr_1.5       
[36] scales_0.5.0        backports_1.1.1     htmltools_0.3.6     splines_3.4.2       assertthat_0.2.0   
[41] colorspace_1.3-2    stringi_1.1.6       acepack_1.4.1       lazyeval_0.2.1      munsell_0.4.3

contrast in probabilities when using lrm

Apparently, I cannot use a function such as 1/(1 + exp(-x)) in the contrast command.

age <- rnorm(200,40,12)
sex <- factor(sample(c('female','male'),200,TRUE))
logit <- (sex=='male') + (age-40)/5
y <- ifelse(runif(200) <= plogis(logit), 1, 0)
f <- lrm(y ~ pol(age,2)*sex)
getProb <- function (x)  { 1/(1 + exp(-x)) }
contrast(f, list(sex='female', age=30), list(sex='male', age=40), fun = getProb)

Any ideas on how to get a contrast in probabilities?

Can't use factors() any more in lrm()

I get the following error
"Error in X[, mmcolnames, drop = FALSE] : subscript out of bounds"
when I use "factor(x)" in the regression formula. I downgraded to older versions of Hmisc and rms.

lrm() error: Error in X[, mmcolnames, drop = FALSE] : subscript out of bounds

I've been getting an error using lrm(). The equivalent glm() model runs without error.

Reproducible example below:

Small subset of my data:

tbi_ss <- structure(list(PosIntFinal = structure(c(1L, 1L, 2L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), .Label = c("No", "Yes"), class = "factor"), GCSTotal = c(15, 
15, 14, 15, 15, 15, 15, 15, 15, 14, 15, 15, 15, 15, 15, 15, 15, 
15, 15, 15, 15)), .Names = c("PosIntFinal", "GCSTotal"), row.names = c(NA, 
-21L), class = c("tbl_df", "tbl", "data.frame"))

m_lrm <- lrm(PosIntFinal ~ as.factor(GCSTotal), data=tbi_ss)

produces the error:
Error in X[, mmcolnames, drop = FALSE] : subscript out of bounds

m_glm <- glm(PosIntFinal ~ as.factor(GCSTotal), family="binomial", data=tbi_ss)

fits the model without complaint.

survplot n.risk and competing risks

There appear to be some unintended overplotting when n.risk=T is enabled together with competing risks. This is seen e.g. in the third plot of example(npsurv).

Error in htmlSpecial("part") : illegal character name:part

After updating to the Github version following #52, I get a new error:

> (ols_fit = rms::ols(f, data = d2, x = T, y = T))
Error in htmlSpecial("part") : illegal character name:part

The formula is quite complex because it consists of some 400 terms:

> f
S ~ .a + .b + .c + .d + .e + .f + .g + .h + .i + .j + .k + .l + 
    .m + .n + .o + .ø + .p + .r + .s + .t + .u + .v + .w + .x + 
    .y + .z + .aa + .ab + .ac + .ad + .ae + .ag + .ah + .ai + 
    .aj + .ak + .al + .am + .an + .ar + .as + .at + .au + .av + 
    .aw + .ay + .az + .bd + .be + .bi + .bj + .bo + .br + .ce + 
    .ch + .ci + .ck + .cl + .co + .dd + .de + .di + .do + .dr + 
    .ds + .dt + .du + .dy + .ee + .ef + .eg + .eh + .ei + .ej + 
    .ek + .el + .em + .en + .eo + .ep + .er + .es + .et + .ev + 
    .ew + .ex + .ey + .ff + .fi + .fr + .gi + .gn + .go + .gr + 
    .gu + .hi + .hm + .hn + .ho + .hr + .hu + .ik + .il + .im + 
    .in + .io + .ir + .is + .it + .iu + .iv + .iz + .jo + .jø + 
    .ju + .kk + .kl + .ko + .kr + .ks + .ll + .lm + .lo + .ls + 
    .lt + .lu + .lv + .ly + .mm + .mo + .mu + .my + .nn + .no + 
    .nr + .ns + .nt + .nu + .ny + .op + .or + .os + .ot + .ou + 
    .ør + .rr + .rs + .rt + .ru + .ry + .rz + .ss + .st + .su + 
    .sv + .sy + .sz + .tt + .ty + .tz + .abd + .abe + .abi + 
    .abr + .ach + .adi + .agn + .ahm + .all + .alt + .amm + .ann + 
    .ans + .ant + .anu + .ars + .art + .ast + .bel + .ben + .ber + 
    .bet + .bir + .bjø + .chr + .cil + .der + .din + .dit + 
    .dor + .eil + .ein + .ell + .elm + .els + .enn + .enr + .ens + 
    .ent + .ert + .ess + .est + .ett + .fin + .git + .gor + .hil + 
    .hor + .iko + .ill + .imo + .inn + .ino + .irs + .iss + .ist + 
    .itt + .itz + .jos + .jør + .lly + .lot + .lou + .nny + 
    .nor + .ort + .ott + .rst + .abel + .bert + .bjør + .dort + 
    .gitt + .lott + ._a + ._b + ._c + ._d + ._e + ._f + ._g + 
    ._h + ._i + ._j + ._k + ._l + ._m + ._n + ._o + ._p + ._r + 
    ._s + ._t + ._u + ._v + ._w + ._y + ._z + ._ab + ._ad + ._ag + 
    ._al + ._am + ._an + ._ar + ._as + ._ay + ._be + ._bi + ._bj + 
    ._bo + ._br + ._ce + ._ch + ._cl + ._co + ._de + ._di + ._do + 
    ._ei + ._ej + ._el + ._em + ._er + ._es + ._ev + ._fi + ._fr + 
    ._gi + ._gr + ._gu + ._hi + ._in + ._ir + ._is + ._iv + ._jo + 
    ._ju + ._kr + ._lo + ._lu + ._mo + ._mu + ._no + ._ru + ._st + 
    ._su + ._sv + ._sy + ._abd + ._agn + ._ann + ._ben + ._ber + 
    ._bet + ._bir + ._chr + ._dor + ._ell + ._els + ._hil + ._jos + 
    .a_ + .b_ + .d_ + .e_ + .f_ + .g_ + .h_ + .i_ + .j_ + .k_ + 
    .l_ + .m_ + .n_ + .o_ + .r_ + .s_ + .t_ + .u_ + .v_ + .w_ + 
    .y_ + .z_ + .ad_ + .ah_ + .ai_ + .aj_ + .al_ + .am_ + .an_ + 
    .ar_ + .as_ + .aw_ + .ce_ + .ch_ + .ck_ + .co_ + .de_ + .di_ + 
    .dy_ + .ek_ + .el_ + .en_ + .er_ + .es_ + .et_ + .ik_ + .il_ + 
    .im_ + .in_ + .ir_ + .is_ + .it_ + .iz_ + .ll_ + .ly_ + .my_ + 
    .nn_ + .no_ + .nt_ + .ny_ + .or_ + .rt_ + .ry_ + .st_ + .sz_ + 
    .tt_ + .ann_ + .ben_ + .ert_ + .itt_ + .lly_ + .nny_ + length + 
    vowel_frac + stop_frac + nasal_frac + has_dash

Note that this error is in the print function. The actual object is returned without issues.

> ols_fit
Error in htmlSpecial("part") : illegal character name:part
> traceback()
6: stop(paste0("illegal character name:", x))
5: htmlSpecial("part")
4: prStats(obj[[1]], obj[[2]], lang = lang)
3: prModFit(x, title = title, z, digits = digits, coefs = coefs, 
       ...)
2: print.ols(x)
1: function (x, ...) 
   UseMethod("print")(x)

I installed the CRAN version (5.1-1) and confirmed that this does not have the bug, though it has the validate issue. (#52)

Suggested fix for legend.nomabbrev()

I was working on a nomogram and ran into an error with the tick abbreviations in your legend.nomabbrev() function of plot.nomogram.s. I could easily have been doing something wrong, but I did notice I was able to correct the function by using…

abb <- attr(object, 'info')$Abbrev[[which]]  # note the capital ‘A’ rather than lowercase

predictor HRs from loglog orm are not within their corresponding CIs

Dear Prof Harrell,
I've noticed that the HRs from a loglog orm do not fall within their corresponding CIs (kindly refer to the minimal example below). Thank you for looking into this and for creating the rms package!
Yonghao

## create a datadist object
mtcars.dd <- datadist(mtcars)   
options(datadist="mtcars.dd")

mod.loglog <- orm (mpg ~ drat + vs , family =loglog, data = mtcars)  # loglog CPM
## HR ratios are not within their corresponding CIs
## HRs for the predictors are probably exp(beta)
summary (mod.loglog)

declaring a main effect after an interaction triggers an error

Dear Prof Harrell,

When using %ia% in a model formula, I've noticed an error is triggered simply by declaring a main effect after the interaction. Here is a reproducible example.

library(rms)
df <- data.frame(y=rnorm(21), x1=rnorm(21), x2=rnorm(21))
options(contrasts=c("contr.treatment", "contr.treatment"))
fit1 <- ols(y~rcs(x1)+rcs(x2)+rcs(x1)%ia%rcs(x2),data=df)
fit2 <- ols(y~rcs(x1)+rcs(x1)%ia%rcs(x2)+rcs(x2),data=df)

The final lines issues an error,

Error in Design(X) :
x1 x2 cannot be used in %ia% since not listed as main effect

I'm not sure if this behavior is intended. The workaround is to reorder the formula with the main effects before the interaction. I would expect fit1 and fit2 to return equivalent model objects, even if the terms aren't necessarily ordered the same way.

Best regards,
Jerome Asselin

version
_
platform x86_64-redhat-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 3.1
year 2016
month 06
day 21
svn rev 70800
language R
version.string R version 3.3.1 (2016-06-21)
nickname Bug in Your Hair
packageVersion('rms')
[1] ‘5.0.0’

Problems with offset() in Glm/ols

I'm currently implementing a Poisson regression model for survival data and I need to use the offset term for the count in that category. I've previously noted that the rms regression functions tend to behave strange when the offset() command is applied for ols() but this seems to be a more general issue for the regression functions. Below are examples based upon the code from this blog of Glm vs glm and ols vs lm:

# Setup some variables suited for poisson regression
Y  <- c(15,  7, 36,  4, 16, 12, 41, 15)
N  <- c(4949, 3534, 12210, 344,
        6178, 4883, 11256, 7125)
x1 <- c(-0.1, 0, 0.2, 0,
        1, 1.1, 1.1, 1)
x2 <- c(2.2, 1.5, 4.5, 7.2, 4.5, 3.2, 9.1, 5.2)
# Setup the rms environment
library(rms)
ddist <- datadist(Y, N, x1, x2)
options(datadist="ddist")

#############################
# Do the poisson regression #
#############################
reg_poisson_glm <- glm(Y ~ offset(log(N)) + (x1 + x2), family=poisson)
rms_poisson_glm <- Glm(formula(reg_poisson_glm), family=poisson)

# Adding the offset using the offset argument
glm(Y ~ x1 + x2, family=poisson, offset=log(N))
Glm(Y ~ x1 + x2, family=poisson, offset=log(N))
ols(Y ~ x1 + x2, offset=log(N))

# This works OK
coef(reg_poisson_glm) == coef(rms_poisson_glm)

# This gives different results
predict(reg_poisson_glm, newdata=data.frame(x1=1, x2=1.5, N=1000)) ==
  predict(rms_poisson_glm, newdata=data.frame(x1=1, x2=1.5, N=1000))

#####################
# Do the lm example #
#####################
reg_lm <- lm(Y ~ x1 + x2 + offset(N))
rms_ols <- ols(Y ~ x1 + x2 + offset(N))

# All coefficients differ
coef(reg_lm) == coef(rms_ols)
# see here
cbind(coef(reg_lm),coef(rms_ols))

# It seems that ols ignores the offset
rms_ols_no_offset <- ols(Y ~ x1 + x2)
coef(rms_ols) == coef(rms_ols_no_offset)

# This is also confirmed by using the specific offset() argument
# where the function seems to work correctly
rms_ols_offset_param <- ols(Y~x1+x2, offset=N)
coef(reg_lm) == coef(rms_ols_offset_param)

# The Glm regular regression works though better
rms_glm <- Glm(Y ~ x1 + x2 + offset(N))
coef(reg_lm) == coef(rms_glm)

# Although the predict still doesn't work for neither the
# glm or the ols param version
predict(reg_lm, newdata=data.frame(x1=1, x2=1.5, N=1000)) ==
  predict(rms_glm, newdata=data.frame(x1=1, x2=1.5, N=1000))

predict(reg_lm, newdata=data.frame(x1=1, x2=1.5, N=1000)) ==
  predict(rms_ols_offset_param, newdata=data.frame(x1=1, x2=1.5, N=1000))

# The predictrms seems to ignore the offset term
predict(rms_glm, newdata=data.frame(x1=1, x2=1.5, N=1000))==
  predict(rms_glm, newdata=data.frame(x1=1, x2=1.5))

# There is a floating point difference but the values are essentially the same
cbind(predict(reg_lm, newdata=data.frame(x1=1, x2=1.5, N=0)),
      predict(rms_glm, newdata=data.frame(x1=1, x2=1.5, N=1000)))

validate() for rms::ols: Error in lsfit(x, y) : only 0 cases, but 2 variables

I get a strange sounding error when trying to use validate() on a fitted ols:

Error in lsfit(x, y) : only 0 cases, but 2 variables

The dataset has n=1890 with about 400 predictors in the model. Almost all the predictors are dichotomous dummies indicating whether some regex pattern matched a name or not. Some of these only have a few true cases (but at least 10). This is a preliminary fit before I am doing some penalization to improve the model fit and final predictors (done with LASSO in glmnet). However, I wanted to validate the validity of the initial model. My guess is that the error occurs due to the resampling ending up with no cases for a given variable in the training set, which causes it to fail to fit / not be able to use that variable in the prediction in the test set.

For a reproducible example, here's a similar dataset based on iris:

#sim some data
iris2 = iris
set.seed(1)
iris2$letter1 = sample(letters, size = 150, replace = T)
iris2$letter2 = sample(letters, size = 150, replace = T)
iris2$letter3 = sample(letters, size = 150, replace = T)

#fit
(fit = rms::ols(Sepal.Width ~ letter1 + letter2 + letter3 + Petal.Width, Petal.Length, data = iris2, x = T, y = T))
validate(fit)

Gives:

Error in lsfit(x, y) : only 0 cases, but 2 variables
In addition: Warning message:
In lsfit(x, y) : 150 missing values deleted

The dataset has no missing data.

In my own simple cross-validation implementation discussed here, I got around this issue by simply ignoring runs that produce errors. See this question: https://stats.stackexchange.com/questions/213837/k-fold-cross-validation-nominal-predictor-level-appears-in-the-test-data-but-no Maybe this too should be done for rms?

cph fails with rhs in the form of `factor(x)`

Using a formula like "surv ~ factor(x)" produces the error X[, mmcolnames, drop = FALSE] : subscript out of bounds.

set.seed(10)
n <- 500
ds <- data.frame(
  ftime = rexp(n),
  fstatus = sample(0:1, size = n, replace = TRUE),
  x = sample(LETTERS[1:4], size = n, replace = TRUE) )

library(rms)
dd <<- datadist(ds)
options(datadist="dd")

fit <- cph(Surv(ftime, fstatus) ~ factor(x), data=ds)

Passing an already-factored x works as expected. Not sure if this is the intended behavior, and/or the error message may be made more explicit. The debug argument clarified the problem.

> fit <- cph(Surv(ftime, fstatus == 1) ~ factor(x), data=ds,debug=T)

sformula

Surv(ftime, fstatus == 1) ~ factor(x)
     colnames(X)  mmcolnames Design colnames
[1,] "factor(x)B" "xB"       "x=B"          
[2,] "factor(x)C" "xC"       "x=C"          
[3,] "factor(x)D" "xD"       "x=D"          
Error in X[, mmcolnames, drop = FALSE] : subscript out of bounds

This may be analogous to #15 .

Thanks!

survfit cph not working with id vector

Dear Professor Frank,

I am trying to run survfit using cph object according to the syntax mentioned here: https://www.rdocumentation.org/packages/rms/versions/5.1-1/topics/survfit.cph

My data is in a counting style format as it has TVC’s, however, it is ignoring the subject identifier “id” vector mentioned. Survfit is fitting the model by treating all observations of the same subject to be different from each other (as if it were different subjects) despite specifying the id vector. That is if there are 10 subjects with 3 time stops (1,2,3) for each subject, it is giving 30 different survival curves rather than 10 survival curves each belonging to one subject. I am really not sure what am I doing wrong?

My code is as follows:
library(survival)
install.packages("rms", dependencies = TRUE)
library(rms)

'01. input modelling data'
MOD_DS <- read.csv("V:/CoxPH_SMPL_FOR_R.csv",header=TRUE)

'02. run Cox regression using TVC'
COX_MODEL_V01 <- cph(Surv(TIME_LAG,TIME, BAD) ~ V1+ V2 + V3, data =MOD_DS,y=TRUE,x=TRUE,surv=TRUE)

'03. store covariate values to score accounts by the model' 'The covariates dataset must be in the same format as the modelling ds. It should have a BAD variable as well, though it could be coded to any dummy value as it is not used by survfit'
covs_FOR_SCORING <- read.csv("V:/covs_FOR_SCORING.csv",header=TRUE)

'04. Define the subject identifier vector'
id <- covs_FOR_SCORING[["sub_id"]]

'05. Run survfit using cph'
SCORED_POP <- summary (survfit(COX_MODEL_V01, newdata = covs_FOR_SCORING, individual = FALSE,conf.type =("none"), se.fit = F, id , type="counting"))
As noted in step 05 above, I am using id vector which uses the sub_id mentioned in the newdata = covs_FOR_SCORING dataset but the code runs ignoring the subject identifier. The number of rows in the id vector = number of rows in newdata.

P.S. I don’t want to use the survfit with coxph object as I have a dataset with 120K subjects and it takes 4 hours to fit my model on it. The cph survfit runs in 1 min!!!!

I am mentioning an example dataset for your perusal and debugging.

please use this an input for modelling using Cox cph
please use the same dataset for getting survival estimates through survfit
<--------------------------------------- Example dataset --------------------------------------------->

1	2	3	4	5	6
id	death	age	lt_age	time0	time1
1	1	20	0	0	1
2	0	21	0	0	1
3	0	19	0	0	1
3	1	19	16.05686	1	7
4	0	22	0	0	1
4	0	22	18.59216	1	7
4	1	22	22	7	10
5	0	20	0	0	1
5	0	20	16.90196	1	7
5	0	20	20	7	10
6	0	24	0	0	1
6	0	24	20.28235	1	7
6	0	24	24	7	10
6	1	24	26.73464	10	13
7	0	28	8.42884	0	2
7	0	28	19.57116	2	5
7	0	28	23.66275	5	7
7	0	28	25.28652	7	8
7	0	28	28	8	10
7	1	28	31.19041	10	13

library(survival)
install.packages("rms", dependencies = TRUE)
library(rms)

'01. input modelling data (please input the above dataset'
MOD_DS <- read.csv("V:/sample_ds_for_surv_cph.csv",header=TRUE)

'02. run Cox regression using TVC'
COX_MODEL_V01 <- cph(Surv(time0,time1, death) ~ age + lt_age , data =MOD_DS,y=TRUE,x=TRUE,surv=TRUE)

'03. store covariate values to score accounts by the model'
covs_FOR_SCORING <- read.csv("V:/MOD_DS .csv",header=TRUE)

id <- covs_FOR_SCORING[["id"]]

'04. score using survfit cph'
SCORED_POP <- summary(survfit(COX_MODEL_V01, newdata = covs_FOR_SCORING, individual = FALSE,conf.type =("none"),se.fit = F, id ,type="counting"))

'05. score using survfit coxph'
COX_MODEL_V01 <- coxph(Surv(time0,time1, death) ~ age + lt_age, data = MOD_DS)
SCORED_POP <- summary(survfit(COX_MODEL_V01, newdata = covs_FOR_SCORING,id=id, individual = FALSE,conf.type =("none"),se.fit = F ))

O/P from survfit coxph is different from O/P from survfit cph. The former has 7 survival curves, one for each subject. The latter has 20 survival curves.

survfit.cph output
time n.risk n.event survival1 survival2 survival3 survival4 survival5 survival6 survival7
1 7 1 0.829 0.929 0.62 0 0.971 0 0
7 5 1 0.829 0.929 0.62 0 0.971 0 0
10 4 1 0.829 0.929 0.62 0 0.971 0 0
13 2 2 0.829 0.929 0.62 0 0.971 0 0
survival8 survival9 survival10 survival11 survival12 survival13 survival14 survival15
0.829 0 0 0.996 0 0 0 0.962
0.829 0 0 0.996 0 0 0 0.962
0.829 0 0 0.996 0 0 0 0.962
0.829 0 0 0.996 0 0 0 0.962
survival16 survival17 survival18 survival19 survival20
2.73e-42 0 0 0 0
2.73e-42 0 0 0 0
2.73e-42 0 0 0 0
2.73e-42 0 0 0 0

survfit.coxph output

1

1
time   n.risk n.event survival
1.000    7.000    1.000    0.829

2
time   n.risk n.event survival
1.000    7.000    1.000    0.929

3
time n.risk n.event survival
1      7       1    0.620
7      5       1    0.401

4
time n.risk n.event survival
1      7       1    0.971
7      5       1    0.831
10      4       1    0.623

5
time n.risk n.event survival
1      7       1    0.829
7      5       1    0.608
10      4      1    0.384

6
time n.risk n.event survival
1      7       1    0.996
7      5       1    0.920
10      4       1    0.768
13      2       2    0.109

7
time n.risk n.event survival
1      7       1    0.962
7      5       1    0.943
10      4       1    0.878
13      2       2    0.307

Get all model fit statistics?

Per my Stackoverflow question:

https://stackoverflow.com/questions/47724189/extract-all-model-statistics-from-rms-fits/47724190#47724190

I cannot find a proper way to extract certain fit statistics from the fitted object.

library(pacman)
p_load(rms, stringr, readr)

#fit
> (fit = rms::ols(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species, data = iris))
Linear Regression Model

 rms::ols(formula = Sepal.Length ~ Sepal.Width + Petal.Length + 
     Petal.Width + Species, data = iris)

                Model Likelihood     Discrimination    
                   Ratio Test           Indexes        
 Obs     150    LR chi2    302.96    R2       0.867    
 sigma0.3068    d.f.            5    R2 adj   0.863    
 d.f.    144    Pr(> chi2) 0.0000    g        0.882    

 Residuals

       Min        1Q    Median        3Q       Max 
 -0.794236 -0.218743  0.008987  0.202546  0.731034 


                    Coef    S.E.   t     Pr(>|t|)
 Intercept           2.1713 0.2798  7.76 <0.0001 
 Sepal.Width         0.4959 0.0861  5.76 <0.0001 
 Petal.Length        0.8292 0.0685 12.10 <0.0001 
 Petal.Width        -0.3152 0.1512 -2.08 0.0389  
 Species=versicolor -0.7236 0.2402 -3.01 0.0031  
 Species=virginica  -1.0235 0.3337 -3.07 0.0026

How does one get all the outputted data? Standard errors, p values, adjusted R2 are not available in the fit object. As far as I can tell.

Looking over the codebase, the easiest solution seems to be to have the print.ols etc. functions invisibly return a list with an extended list of model statistics.

ols() fails for interactions without base term

I want to estimate an OLS model with an interaction term, but I don't want to include all variables from the interaction individually. ols calls Design, which is unable to handle this case. For example

y <- rnorm(100)
x1 <- rnorm(100)
x2 <- rnorm(100)
ols(y ~ x1 + x1:x2)

returns:

Error in if (!length(fname) || !any(fname == zname)) { : 
  missing value where TRUE/FALSE needed

While a statistician may simply demand the inclusion of x2, there are legitimate cases where it is not feasible or desirable. For example, x2 may only vary across entities for which fixed effects are included in the model.

contrast.rms collapses small p-values to zero

The function contrast.rms reports the p-values for all Wald statistics absolutely larger than ca. 9 as an exact zero.

Example adapted from the contrast.rms help page with Z=10.36:

set.seed(1)
x  = rnorm(500)
xb = 1+2*x
y  = ifelse(runif(500) <= plogis(xb), 1, 0)
f  = lrm(y ~ x)
ct = contrast(f, list(x=1))
ct
ct$P==0 ## Exact zero

The problem seems to be the line below in contrast.s where the p-value is calculated:

P <- if(length(idf)) 2 * (1 - pt(abs(Z), idf)) else 2 * (1 - pnorm(abs(Z)))

Replacing this with

P <- if(length(idf)) 2 * (1 - pt(abs(Z), idf)) else 2 *  pnorm(-abs(Z))

should do the trick.

rms 4.3-1

               _                           
platform       x86_64-w64-mingw32          
arch           x86_64                      
os             mingw32                     
system         x86_64, mingw32             
status                                     
major          3                           
minor          2.2                         
year           2015                        
month          08                          
day            14                          
svn rev        69053                       
language       R                           
version.string R version 3.2.2 (2015-08-14)
nickname       Fire Safety

Using get() for independent variable in lrm causes subscript out of bounds error

RStudio Version 1.0.136, R 3.3.2 on Windows 8.1. Package rms 5.1-0

I am trying to do bivariate comparisons by running lrm against a large set of 21 outcome (y) and 6 predictor (x) variables using a programmed loop. The code below for lrm fails with "Error in X[, mmcolnames, drop = FALSE] : subscript out of bounds". If I change the call to use glm and alter the other expected parameters accordingly then the call works. It seems that the use of get() in the predictor position is causing some unhandled situation in lrm.

I'd tried to essentially achieve the following (removing all the looping mechanics):

vdrug <- "Alcohol"
vpred <- "Sex"
m0 <- lrm(get(vdrug) ~ get(vpred), data = w13drug_dt)

Interestingly, the problem is not with any use of get() as the following code with get() in the y position works just fine:

vdrug <- "Alcohol"
m0 <- lrm(get(vdrug) ~ Sex, data = w13drug_dt)

And the idea of using get() in both y and x positions also works with glm; e.g.,

vdrug <- "Alcohol"
vpred <- "Sex"
m0 <- glm(get(vdrug) ~ get(vpred), data = w13drug_dt, family=binomial(link=logit))

However, the following code fails with the above-mentioned error message:

vpred <- "Sex"
m0 <- lrm(Alcohol ~ get(vpred), data = w13drug_dt)

In case it is relevant, I already had run datadist for the predictor (x) variables. The code calling lrm also works perfectly if the variable names are hardcoded; i.e.,

m0 <- lrm(Alcohol ~ Sex, data = w13drug_dt)

In case it was an interaction with just get(), I also tried:

vpred <- quote(Sex)
m0 <- lrm(Alcohol ~ eval(vpred), data = w13drug_dt)

but this causes the same subscript error. However, as with get(), the following code works:

vdrug <- quote(Alcohol)
m0 <- lrm(eval(vdrug) ~ Sex, data = w13drug_dt)

For the input data.frame/data.table, selected output from str(w13drug_dt) is:

str(w13drug_dt)
Classes ‘data.table’ and 'data.frame': 172 obs. of 42 variables:
$ Qnum : chr "A-07-001" "A-07-002" "A-07-003" "A-07-004" ...
$ Alcohol : int 0 1 1 0 0 0 0 1 0 0 ...
(snip)
$ Sex : Factor w/ 2 levels "Female","Male": 2 2 2 1 2 2 1 2 1 2 ...

Size of cph() return value is much bigger compared to coxph()

Linear predictors and residuals returned by cph() have a bigger size compared to the values returned by coxph() from the survival package.
Casting them to double solves the issue and brings them to the same size as coxph(). Some pseudo-code:
as.double(cph(...)$linear.predictors)
as.double(cph(...)$residuals)

This is an issue for large models. For example, using the same data (~2.7 milions of rows), same formula and same methods, coxph() returns an object of ~186Mb size while cph() returns an object of ~331Mb. Anyone can try with his own data and compare the twos.

strata.coefs option for printing lrm doesn't work

I was trying to implement support for rendering objects from rms in markdown using pander and while printing lrm in different ways, I realized that setting strata.coefs causes and error. Taking example from ?lrm

> n <- 1000    # define sample size
> set.seed(17) # so can reproduce the results
> age            <- rnorm(n, 50, 10)
> blood.pressure <- rnorm(n, 120, 15)
> cholesterol    <- rnorm(n, 200, 25)
> sex            <- factor(sample(c('female','male'), n,TRUE))
> label(age)            <- 'Age'      # label is in Hmisc
> label(cholesterol)    <- 'Total Cholesterol'
> label(blood.pressure) <- 'Systolic Blood Pressure'
> label(sex)            <- 'Sex'
> units(cholesterol)    <- 'mg/dl'   # uses units.default in Hmisc
> units(blood.pressure) <- 'mmHg'
> f <- lrm(ch ~ age)
> print(f, strata.coefs = TRUE)
Error in print.lrm(f, strata.coefs = TRUE) : 
  attempt to apply non-function

Looking at code of rms:::print.lrm, it seems that there is an error here:

    if (strata.coefs) {
        cof <- c(cof, x$strata.coef)
        vv <- c(vv, x$Varcov(x, which = "strata.var.diag"))
        if (length(pm)) 
            penalty.scale <- c(penalty.scale, rep(NA, x$nstrata - 
                                                      1))
    }

lrm objects don't have Varcov field and moreover, since lrm object is not an S4 object, I don't think field can be a function.

The half (1/2) unicode sign is not properly printed in survplotp legend

Seems like a relatively recent issue, but when we render survplotp, instead of the legend for the grey middle band being labeled as "Difference ½ 0.95 CL", we rather get the unicode sequence (not the symbol) printed, like so "Difference ½ 0.95 CL"

I guess some escaping rules somewhere (maybe in plotly?) changed.

Here's an example taken from the code in the ?suvrplot help page that reproduces the issue:

library(rms)
n <- 1000
set.seed(731)
age <- 50 + 12*rnorm(n)
label(age) <- "Age"
sex <- factor(sample(c('male','female'), n, TRUE))
cens <- 15*runif(n)
h <- .02*exp(.04*(age-50)+.8*(sex=='female'))
dt <- -log(runif(n))/h
label(dt) <- 'Follow-up Time'
e <- ifelse(dt <= cens,1,0)
dt <- pmin(dt, cens)
units(dt) <- "Year"

S <- Surv(dt,e)
f <- npsurv(S ~ sex)
survplotp(f)

The relevant bits from session_info() are likely:

Session info ----------------------------------------------------------------
 setting  value
 version  R version 3.4.1 (2017-06-30)
 system   x86_64, darwin15.6.0
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 tz       US/Pacific
 date     2017-09-06

Packages --------------------------------------------------------------------
 Hmisc        * 4.0-3      2017-05-02 CRAN (R 3.4.0)
 plotly         4.7.1      2017-07-29 CRAN (R 3.4.1)
 rms          * 5.1-1      2017-05-03 CRAN (R 3.4.0)
 shiny          1.0.5      2017-08-23 CRAN (R 3.4.1)

harrelfe / rms Goto Github PK

rms's Introduction

rms

Current Goals

Web Sites

To Do

rms's People

Contributors

Stargazers

Watchers

Forkers

rms's Issues

Suggestion n. 1

Suggestion n. 2

create a datadist object

HR ratios are not within their corresponding CIs

HRs for the predictors are probably exp(beta)

1

Recommend Projects

Recommend Topics

Recommend Org