Comments (6)
This does disturb me. Is there a convenient way for me to reproduce the data and PS model? Fitting decisions to consider:
- auto- vs. grid-search (auto-search seems to work better in my hands)
-
of fold (10 is pretty standard, but data may be very noisy)
-
of replicates (default is 1, but data may be very noisy)
- how we pick the hold-out set at each hyper parameter (I am concerned that I coded a very silly choice here)
from cohortmethod.
Below some simple test code that reproduces the problem (I think). I'm using train-test split, so we should be able to rule out the overfitting hypothesis.
library("Cyclops")
library("pROC")
predictOnTest <- function(fit,test){
betas <- coef(fit)
intercept <- betas[1]
betas <- betas[2:length(betas)]
betas <- data.frame(beta = as.numeric(betas),covariateId = as.numeric(names(betas)))
prediction <- merge(test$covariates,betas)
prediction$value = prediction$covariateValue * prediction$beta
prediction <- aggregate(value ~ rowId,data=prediction,sum)
prediction$value = prediction$value + intercept
link <- function(x) {
return(1/(1+exp(-x)))
}
prediction$value = link(prediction$value)
return(prediction)
}
evaluate <- function(fit,test){
pred <- predictOnTest(fit,test)
predVsTruth <- merge(pred,data$outcomes[,c("rowId","y")])
auc <- roc(response = predVsTruth$y, predictor = predVsTruth$value)$auc
writeLines(paste("Variance =",fit$variance,", AUC=",as.character(auc)))
}
ntest <- 1000
ntrain <- 1000
data <- simulateData(nstrata=1,nrows=ntest+ntrain,ncovars=2000,model="logistic")
test <- list(outcomes = data$outcomes[1:ntest,], covariates = data$covariates[data$covariates$rowId %in% data$outcomes$rowId[1:ntest],])
train <- list(outcomes = data$outcomes[(ntest+1):(ntest+ntrain),], covariates = data$covariates[data$covariates$rowId %in% data$outcomes$rowId[(ntest+1):(ntest+ntrain)],])
cyclopsData <- convertToCyclopsData(train$outcomes,train$covariates,modelType = "lr",addIntercept = TRUE)
prior <- createPrior("laplace", useCrossValidation = TRUE)
control <- createControl(lowerLimit=0.01, upperLimit=10, fold=5, noiseLevel = "silent")
fit <- fitCyclopsModel(cyclopsData,prior=prior,control=control)
evaluate(fit,test)
prior <- createPrior("laplace", useCrossValidation = TRUE)
control <- createControl(noiseLevel = "silent")
fit <- fitCyclopsModel(cyclopsData,prior=prior,control=control)
evaluate(fit,test)
prior <- createPrior("laplace", variance = 0.1)
control <- createControl(noiseLevel = "silent")
fit <- fitCyclopsModel(cyclopsData,prior=prior,control=control)
evaluate(fit,test)
prior <- createPrior("laplace", variance = 1)
control <- createControl(noiseLevel = "silent")
fit <- fitCyclopsModel(cyclopsData,prior=prior,control=control)
evaluate(fit,test)
prior <- createPrior("laplace", variance = 10)
control <- createControl(noiseLevel = "silent")
fit <- fitCyclopsModel(cyclopsData,prior=prior,control=control)
evaluate(fit,test)
from cohortmethod.
It appears that we should reconsider our cross-validation selection criterion. We are currently attempting to maximize the predicted (log) likelihood of the hold-out data. Using the following evaluation functions
predictOnTest <- function(fit,test){
betas <- coef(fit)
intercept <- betas[1]
betas <- betas[2:length(betas)]
betas <- data.frame(beta = as.numeric(betas),covariateId = as.numeric(names(betas)))
prediction <- merge(test$covariates,betas)
prediction$value = prediction$covariateValue * prediction$beta
prediction <- aggregate(value ~ rowId,data=prediction,sum)
prediction$value = prediction$value + intercept
link <- function(x) {
return(1/(1+exp(-x)))
}
prediction$xBeta = prediction$value
prediction$value = link(prediction$value)
return(prediction)
}
evaluate <- function(fit,test){
pred <- predictOnTest(fit,test)
predVsTruth <- merge(pred,data$outcomes[,c("rowId","y")])
auc <- roc(response = predVsTruth$y, predictor = predVsTruth$value)$auc
predLogLik <- sum(predVsTruth$y * predVsTruth$xBeta) - sum(log(1 + exp(predVsTruth$xBeta)))
writeLines(paste("Variance =",fit$variance,", AUC=",as.character(auc), "PL=", predLogLik))
}
now reports the predicted (log) likelihood of the test dataset. CV does seem to be doing a reasonable job finding a maximum when cvRepetitions is pumped up to, say, 10.
For reproducibility, I have been using
set.seed(666)
What easy-to-compute criterion should we be using if we want maximize discrimination (AUC)?
from cohortmethod.
We should also be excluding the intercept term from regularization via
prior <- createPrior("laplace", exclude = c(0), useCrossValidation = TRUE)
from cohortmethod.
Hi Marc,
I'm not sure what exactly solved the problem, but it is solved now. The optimal likelihood also leads to good AUC, although not optimal. But optimizing on AUC seems like its not a very good idea anyway. After fitting a propensity score using cross-validation (using the default settings, so the auto approach), we now get covariate balance.
My guess is that fixing of the folds leads to more stable prediction of performance at grid points, and therefore better estimation of the optimal variance. Anyway, I'm closing this issue.
(It would be nice if we could use parallelization to speed up the cross-validation though ;-) )
from cohortmethod.
Just an afterthought: unstable cross-validation estimates due to rerandomization of the folds would mostly affect evaluations of large prior variances; for smaller variances performance would be more stable because everything shrinks towards 0. This would bias the optimization to select smaller variances, which is what I think I observed.
from cohortmethod.
Related Issues (20)
- Add calibrated and uncalibrated oneSidedP to export
- Implement representativeness diagnostic
- Unexpected columns are created using matchOnPsAndCovariates - solution removing hardcoded concepts in mergeCovariatesWithPs HOT 1
- Better implement behavior in fitOutcomeModel() when combining useCovariates with interaction terms HOT 5
- MetaData class proposal HOT 5
- question on equipoise calculation HOT 3
- Error unused argument (outcomeIds = outcomeIds) HOT 2
- `runCmAnalyses(refitPsForEveryOutcome = TRUE)` Error in gzfile(file, "rb"): cannot open the connection HOT 4
- Automatically compute covariate balance in subgroups when specifying interaction terms
- If high correlations are discovered but stopOnError = FALSE, record problem in output data somehow
- Add `unblindForEvidenceSynthesis` column to diagnostics summary table
- Add diagnostics for negative controls
- Feature request: exclude highly correlated covariates from propensity score calculation HOT 3
- "missing value where TRUE/FALSE needed" when fitting outcome models for other outcomes HOT 3
- CohortMethodData is read from env cache even when the covarieSettings have changed within the same R session HOT 3
- Using computeSharedCovariateBalanceArgs causes some non-informative warning messages in the log HOT 1
- Function `drug()` in the demo, does not exist HOT 1
- Add vignette showing results data model
- Error in cohort method in export results from inside strategus HOT 5
- Use of big integer outcome, target and comparator ids causes `checkmate` failures HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cohortmethod.