ohdsi / cohortmethod Goto Github PK

View Code? Open in Web Editor NEW

78.0 51.0 56.0 108.17 MB

An R package for performing new-user cohort studies in an observational database in the OMOP Common Data Model.

Home Page: https://ohdsi.github.io/CohortMethod

R 96.13% C++ 3.69% Shell 0.11% Perl 0.08%

hades

cohortmethod's Introduction

CohortMethod

CohortMethod is part of HADES.

Introduction

CohortMethod is an R package for performing new-user cohort studies in an observational database in the OMOP Common Data Model.

Features

Extracts the necessary data from a database in OMOP Common Data Model format.
Uses a large set of covariates for both the propensity and outcome model, including for example all drugs, diagnoses, procedures, as well as age, comorbidity indexes, etc.
Large scale regularized regression to fit the propensity and outcome models.
Includes function for trimming, stratifying, matching, and weighting on propensity scores.
Includes diagnostic functions, including propensity score distribution plots and plots showing covariate balance before and after matching and/or trimming.
Supported outcome models are (conditional) logistic regression, (conditional) Poisson regression, and (conditional) Cox regression.

Screenshots


Propensity (preference score) distribution	Covariate balance plot

Technology

CohortMethod is an R package, with some functions implemented in C++.

System Requirements

Requires R (version 3.6.0 or higher). Installation on Windows requires RTools. Libraries used in CohortMethod require Java.

Installation

See the instructions here for configuring your R environment, including RTools and Java.
In R, use the following commands to download and install CohortMethod:

install.packages("remotes")
remotes::install_github("ohdsi/CohortMethod")

Optionally, run this to check if CohortMethod was correctly installed:

connectionDetails <- createConnectionDetails(dbms="postgresql",
                                             server="my_server.org",
                                             user = "joe",
                                             password = "super_secret")

checkCmInstallation(connectionDetails)

Where dbms, server, user, and password need to be changed to the settings for your database environment. Type

?createConnectionDetails

for more details on how to configure your database connection.

User Documentation

Documentation can be found on the package website.

PDF versions of the documentation are also available:

Vignette: Single studies using the CohortMethod package
Vignette: Running multiple analyses at once using the CohortMethod package
Package manual: CohortMethod.pdf

Support

Developer questions/comments/feedback: OHDSI Forum
We use the GitHub issue tracker for all bugs/issues/enhancements

Contributing

Read here how you can contribute to this package.

License

CohortMethod is licensed under Apache License 2.0

Development

CohortMethod is being developed in R Studio.

Development status

CohortMethod is actively being used in several studies and is ready for use.

Acknowledgements

This project is supported in part through the National Science Foundation grant IIS 1251151.

cohortmethod's People

Contributors

Stargazers

Watchers

Forkers

rwpark99 yenlow pombredanne zuoyizhang blindglobe tdbennett anthonysena mbaroudi sirpoovey jpfairbanks datamaniac03 azzashoaibi clairblacketer yuriykhoma alondhe maxtortime prijnbeek edoko alexdavv maoxuejie mkim0710 jamieweaver hms1 moomoofarm1 jediz chrisknoll rfherrerac odysseusinc abmi estone96 zhangly811 mimiyuchenguo ablack3 alexalexeyuk aki-nishimura azimov gowthamrao owain-s xintong-li-zncu sidiap evalytica awrosen schuemie qwu1221 xliu-stat rekkasa chungsookim louisahsmith fanbu1995 jg051623 snoweye rnaimehaom mdlavallee92 isabelweng darwin-eu-dev suchard-group

cohortmethod's Issues

Variable case inconsistency

line 848 in PsFunctions.R has an inconsistent case in the name of the variable. It is currently
beforeMatchingsumComparator - should it be beforeMatchingSumComparator?

A minor issue but I noticed this when working through the comparative cohort analysis workflow and the SqlRender function converting camel case to snake case bit me with this issue.

About restrictToCommonPeriod parameter

I'm trying to perform the Risk of hip fracture on bisphosphonates study.
(http://www.ohdsi.org/web/wiki/doku.php?id=research:bisphosphonates_and_hip_fracture)

I got an error when I use the AlendronateVsRaloxifene package.

As I know, the restrictToCommonPeriod paramter was added in CohortMethod after AlendronateVsRaloxifene study has been finished.

Should either CohortMethod or AlendronateVsRaloxifene package be modified?

Thank you,

too long expressions in querry

Hi Martijn,

In Oracle, the maximum number of expressions in a list is 1000. if length(nsaids) in the example of "Single studies using the CohortMethod package" is greater than 1000, one error will occur. How do we deal with this issue?

One way could be like:
separate the list to couple sub-list with length<=100 and add or clause to the condition clause in the query.

Thanks,

Zuoyi

CDM v5 compatibility

Currently, CohortMethod only works with CDM v4. Need to add support for v5.

Add functions for running the all-by-all-by-all

Add functions for efficiently running the CohortMethod across multiple drug-comparator, outcome, and analysis design choices.

Add elapsed time message to createPs

Add the same basic message that other functions have upon completion, ie:

Analysis took 2.03 mins

Error reproducing vignette in postgresql

Hi everyone,

When I try running the vignette in postgresql, I get the following error at 35% completion when calling getDbCohortData:

DBMS:
postgresql

Error:
execute JDBC update query failed in dbSendUpdate (ERROR: operator does not exist: text + character varying
Hint: No operator matches the given name and argument type(s). You might need to add explicit type casts.
Position: 954)

Any suggestions for how to deal with this?

Gratefully,
Trevor

CohortMethod plots: allow users to add titles?

I would like to add a title to the various plots, something that allows me to define the name of the treated/comparator group and also, if applicable, details which analysis the graph refers to. I have achieved this in the past by saving the plot object and then adding ggtitle() before saving it, but this can cause some annoying formatting issues, particularly if the title is long or multi-lined. would it be possible to have 'title' parameter on our graphs that could format better?

Add inverse probability weighting by propensity scores

Change standardized caliper to use logit scale

Currently the caliper on the standardized scale is defined as the SD of the propensity score here. However, it is recommended to take the logit of the PS instead because 'the logit of the propensity score is more likely to be normally distributed than the propensity score itself'.

Enforce consistency in language

Terms like drug and treatment should be dropped in favor of terms like target and exposure.

Add flow chart of population / attrition diagram

Create a diagram showing how much patients are 'lost' due to different filtering steps, stratified by treatment status. At least these steps should be distinguished:

Having at least washout period amount of observation time
Requiring the first exposure to be after the washout period
Prior outcome
Matching / trimming
.

Covariate names are too long

Currently covariate names can be hundreds of characters longs, too long for visualizations and including them in readable tables. Maybe we should (also) generate abbreviates names?

Add (conditional) logistic regression outcome model

(conditional) logistic regression is currently not yet implemented as outcome model

change SQL in getOutcomes to SELECT INTO instead of INSERT statements

SELECT INTO or CTAS queries seem more efficient that creating a table and then inserting records into it. We definitely can do this in getOutcomes, it's a bit less clear how to optimize getCovariates.

make loadRDS wrapper of readRDS for consistency

we save using saveRDS and load using readRDS - so our own wrapper to use loadRDS might be useful?

Create vignette for all-by-all

Create a vignette for running multiple analyses at once.

Comparing Cohorts with CohortMethod

I am curious your thoughts on the virtues of using the cohortMethod as a general comparator tool for cohorts. I have a project where I am not comparing outcome differences between groups but rather want to see how the groups differ.

It strikes me that CM does this already in its intermediate steps. Any caveats I should consider?

Patrick and I have chatted about a comparison view for Heracles, but in this case I am looking more for s rank ordered list of concepts.

Dependencies missing

When installing this package with devtools, the dependencies

ohdsi/DatabaseConnector
ohdsi/SqlRender
ohdsi/DatabaseConnector
ohdsi/FeatureExtraction

Are missing. This means that devtools::install_github("ohdsi/CohortMethod") fails.
Installing the dependencies by hand allows devtools to install CohortMethod.

Graceful handling of cohortData objects with empty exposure and/or outcome cohorts

Currently functions like summary.cohortData(), saveCohortData() and createPs() are not very friendly when the user provides a cohortData object that has empty exposure and/or outcome cohorts (e.g. because of invalid concept IDs). These functions need to be modified.

Bug: CohortMethod.sql, varchar field in cohort_covariate_ref table not large enough?

This field doesn't seem to be large enough for all cases:

https://github.com/OHDSI/CohortMethod/blob/master/inst/sql/sql_server/CohortMethod.sql#L154

When that gets translated to redshift (at least in my case), the field becomes varchar(256). But this isn't long enough for some cases at this line in the CohortMethod.sql code:

https://github.com/OHDSI/CohortMethod/blob/master/inst/sql/sql_server/CohortMethod.sql#L681

That field can end up too long if the concept_name field is long. For example, for concept_id 439181, we have that concept_name is: "Cortex contusion without open intracranial wound AND with prolonged loss of consciousness (more than 24 hours) without return to pre-existing conscious level" which ends up being too long (total length of concatenated string is 274).

I have code that replicates this if anyone is curious (I doubt it), but I'd have to clean it a bit so I'll only do that if someone's interested. I guess the field size should be increased (though the "max" there is a little ominous) or maybe the field should just be changed to "text"? I'm not sure what the best solution is myself since I'm still not too comfortable with all the inner workings of SqlRender, etc.

Error on getPsModel

Let me know if I can provide any additional details.

propensityModel <- getPsModel(results, cohortData)
Error in abs(cfs$coefficient) :
non-numeric argument to mathematical function
In addition: Warning message:
In merge.ffdf(ff::as.ffdf(cfs), cohortData$covariateRef, by.x = "id", :
No match found, returning NULL as ffdf can not contain 0 rows

traceback
function (x = NULL, max.lines = getOption("deparse.max.lines"))
{
if (is.null(x) && !is.null(x <- get0(".Traceback", envir = baseenv()))) {
}
else if (is.numeric(x))
x <- .Internal(traceback(x))
n <- length(x)
if (n == 0L)
cat(gettext("No traceback available"), "\n")
else {
for (i in 1L:n) {
label <- paste0(n - i + 1L, ": ")
m <- length(x[[i]])
if (!is.null(srcref <- attr(x[[i]], "srcref"))) {
srcfile <- attr(srcref, "srcfile")
x[[i]][m] <- paste0(x[[i]][m], " at ", basename(srcfile$filename),
"#", srcref[1L])
}
if (m > 1)
label <- c(label, rep(substr(" ", 1L,
nchar(label, type = "w")), m - 1L))
if (is.numeric(max.lines) && max.lines > 0L && max.lines <
m) {
cat(paste0(label[1L:max.lines], x[[i]][1L:max.lines]),
sep = "\n")
cat(label[max.lines + 1L], " ...\n")
}
else cat(paste0(label, x[[i]]), sep = "\n")
}
}
invisible(x)
}
<bytecode: 0x000000001505bed0>
<environment: namespace:base>

plotPS enhancements: adding information about equipoise

On the propensity score distribution plot, could we add some statistics like:

total cohort size in treated/comparator (perhaps "n=xxx' on the legend)
% of treated/comparator group in clinical equipoise (e.g. % with 0.4<=PrefScore<=0.6)
% of treated/comparator group without overlap

How to treat the patients in CohortMethod if the patients took exposure and comparator drugs on the index date?

Error When Running CM Analysis

I received this error when running CM (tough to get after 9.72 hours! :) ):

Error in UseMethod("open") :
no applicable method for 'open' applied to an object of class "data.frame"

Here is the full output, and below is my study configuration:

Connecting using Oracle driver

Constructing treatment and comparator cohorts
Executing multiple queries. This could take a while
|==================================================================================================================================| 100%
Analysis took 0.822 secs
Fetching data from server
Loading took 2.56 secs
Constructing default covariates
|==================================================================================================================================| 100%
Analysis took 9.72 hours
Done
Fetching data from server
Loading took 4.11 mins
Removing redundant covariates
Normalizing covariates

Constructing outcomes
Executing multiple queries. This could take a while
|==================================================================================================================================| 100%
Analysis took 6.27 secs
Done
Fetching data from server
Loading took 0.306 secs
Error in UseMethod("open") :
no applicable method for 'open' applied to an object of class "data.frame"
In addition: Warning message:
In lowLevelQuerySql.ffdf(connection, sql) :
Data has zero rows, returning an empty data frame

Study configuration (I did not use any excluded concept, which I realize is not good but I don't think the cause of the error):

covarSettings <- createCovariateSettings(useCovariateDemographics = TRUE,
useCovariateConditionOccurrence = TRUE,
useCovariateConditionOccurrence365d = TRUE,
useCovariateConditionOccurrence30d = TRUE,
useCovariateConditionOccurrenceInpt180d = TRUE,
useCovariateConditionEra = TRUE,
useCovariateConditionEraEver = TRUE,
useCovariateConditionEraOverlap = TRUE,
useCovariateConditionGroup = TRUE,
useCovariateDrugExposure = TRUE,
useCovariateDrugExposure365d = TRUE,
useCovariateDrugExposure30d = TRUE,
useCovariateDrugEra = TRUE,
useCovariateDrugEra365d = TRUE,
useCovariateDrugEra30d = TRUE,
useCovariateDrugEraEver = TRUE,
useCovariateDrugEraOverlap = TRUE,
useCovariateDrugGroup = TRUE,
useCovariateProcedureOccurrence = TRUE,
useCovariateProcedureOccurrence365d = TRUE,
useCovariateProcedureOccurrence30d = TRUE,
useCovariateProcedureGroup = TRUE,
useCovariateObservation = TRUE,
useCovariateObservation365d = TRUE,
useCovariateObservation30d = TRUE,
useCovariateObservationCount365d = TRUE,
useCovariateMeasurement365d = TRUE,
useCovariateMeasurement30d = TRUE,
useCovariateMeasurementCount365d = TRUE,
useCovariateMeasurementBelow = TRUE,
useCovariateMeasurementAbove = TRUE,
useCovariateConceptCounts = TRUE,
useCovariateRiskScores = TRUE,
useCovariateRiskScoresCharlson = TRUE,
useCovariateRiskScoresDCSI = TRUE,
useCovariateRiskScoresCHADS2 = TRUE,
useCovariateInteractionYear = FALSE,
useCovariateInteractionMonth = FALSE,
deleteCovariatesSmallCount = 100)

cohortMethodData <- getDbCohortMethodData(connectionDetails,
cdmDatabaseSchema = cdmDatabaseSchema,
oracleTempSchema = resultsDatabaseSchema,
targetId = 1082,
comparatorId = 1081,
indicationConceptIds = c(),
washoutWindow = 183,
indicationLookbackWindow = 183,
studyStartDate = "",
studyEndDate = "",
outcomeIds = 1080,
outcomeConditionTypeConceptIds = c(),
exposureDatabaseSchema = resultsDatabaseSchema,
exposureTable = "CATH_STUDY",
outcomeDatabaseSchema = resultsDatabaseSchema,
outcomeTable = "CATH_STUDY",
excludeDrugsFromCovariates = FALSE,
covariateSettings = covarSettings,
cdmVersion = cdmVersion)

Add paremeter checking to getDbCohortData

I was accidentally passing an empty character vector as indication_concept_ids to getDbCohortData (https://github.com/OHDSI/CohortMethod/blob/master/R/DataLoadingSaving.R#L114) and the only reason I noticed was because the auc came out as larger than 1. In fact, the default parameter of c() does exactly the same thing. Wouldn't it be better if we had a parameter check to make sure that indication_concept_ids pass a non-empty vector of integers? Maybe the default parameter should be something legal or should be removed entirely?

I also suspect that if you pass an indication_concept_ids vector with invalid ids (i.e. a number which does not show up in the database), this might have the same effect (assuming there is a sql call somewhere with "... WHERE indication_concept_id IN (####)". If #### is an id that doesn't show up in the database, then that might have the same effect as having no number there at all (which results in a messed up auc). Maybe there needs to be a check deeper somewhere? Or maybe that's too complex and in that situation we can't save the user from himself.

Add covariates for genders other than male and female

Drop uninformative strata earlier to boost performance

Oftentimes there are many strata in the outcome models that are uninformative (no outcomes in comparator and target group). Dropping these earlier could save compute time.

Run CohortMethod on an exemplar study, and document in a vignette

To see if the package has the functionality we need to do studies, we need to replicate an existing study. We'll probably do the classic coxibs-vs-non-selective-NSAIDS-for-UGIB study. Once we've done that, we can capture the process of doing the study in a vignette using KNITR.

Allow covariates to be created for limited subset of CONCEPTs within a table

Right now, creating CONDITION covariates creates all available CONDITIONs, but there may be instances where a user wants to only include a small subset of concepts. An additional parameter would be required to allow this filtering.

Same would apply to DRUGs, PROCEDUREs, OBSERVATIONs.

TestUnits fail on Travis-CI

Some tests in test-parameterSweep.R fail on Travis-CI (and have been commented out), but succeed on my Mac OS X install and (I assume) under @schuemie via Windows.

Create a simple simulation framework for generating example data

For the vignette in issue #8 we need simulated data that looks just like the real thing. It should be relatively straightforward to generate simulated data as a cohortData object based on the observed statistics in the exemplar study.
Basically, we'll follow these steps:

Generate covariate data, sampling from observed prevalences.
Generate treatment status using the generated covariates and observed betas in the PS model.
Generate outcomes using the generated covariates, treatment status, and observed betas in the outcome model.

New parameter to restrict cohorts to common time period

When comparing two treatments, it may be possible to select treatments which are not both available at the same time period, so therefore during the non-overlapping time, they do not represent valid counterfactual comparisons. For example, when comparing two drugs, one approved in 2013 and another approved in 2014, the only valid time to base a comparison would be 2014 onward, because the second treatment wasn't available in 2013 (and any propensity score would warn in this if INDEX_YEAR were included in the model). A proposed solution: provide a analysis parameter to: 'limit cohorts to period of overlapping calendar time', which would restrict the data to the maximum of the minimum cohort start dates of the two cohorts, and would run through the minimum of the maximum time-at-risk end dates.

Allow dbGetCohortData to use cohorts in a different schema

Currently dbGetCohortData assumes the cohort table is in the CDM schema, but most people do not have write access there. We need to add parameters to allow people to use cohort tables in other schemas.

Error when indicationConceptIds is null

DataLoadingSaving.R runs a query to summarize #indicated_cohorts, but that table only exists if indicationConceptIds is not null.

I think line 182 has to be changed to: if (indicationConceptIds[1] != ""){

Better handling when PS betas are all zero (except the intercept)

Some functions such as getPsModel behave badly if all betas (except the intercept) are zero. Need to generate a meaningful response.

Error running getDbCohortData: cohort_definition_id

I get an error when running getDbCohortData with the default parameters. The issue seems to stem from this commit: dfe0a4d

Here's the code that reproduces the error. The code runs without problems for me on this commit: f8e5a84 (assuming you uncomment the code in the test case corresponding to the change in function name).

library(SqlRender)
library(CohortMethod)

setwd("/tmp")

connectionDetails <- createConnectionDetails(
    dbms = "redshift",
    server = "omop-datasets.cqlmv7nlakap.us-east-1.redshift.amazonaws.com/truven",
    user = Sys.getenv("USER"),
    password = Sys.getenv("MYPGPASSWORD"),
    schema = "mslr_cdm4",
    port = "5439")

# Works on commit f8e5a848b9f55f61785fac1aa1d9e50d97f2628d
#cohortdata <- getDbCohortDataObject(
#    connectionDetails,
#    cdmSchema = connectionDetails$schema,
#    resultsSchema = connectionDetails$schema)

# Does not work on master branch.
cohortdata <- getDbCohortData(
    connectionDetails,
    cdmSchema = connectionDetails$schema,
    resultsSchema = connectionDetails$schema)

Here is the error message:

DBMS:
redshift

Error:
execute JDBC update query failed in dbSendUpdate (ERROR: column c1.cohort_definition_id does not exist)

SQL:
INSERT INTO raw_cohort (cohort_id, person_id, cohort_start_date, cohort_end_date, observation_period_end_date)
SELECT DISTINCT raw_cohorts.cohort_id,
  raw_cohorts.person_id,
  raw_cohorts.cohort_start_date,
  raw_cohorts.cohort_end_date
  AS cohort_end_date,
  op1.observation_period_end_date
  AS observation_period_end_date
FROM (



        SELECT CASE
                WHEN c1.cohort_definition_id = 755695
                    THEN 1
                WHEN c1.cohort_definition_id = 739138
                    THEN 0
                ELSE - 1
                END AS cohort_id,
            c1.subject_id as person_id,
            min(c1.cohort_start_date) AS cohort_start_date,
            min(c1.cohort_end_date) AS cohort_end_date
        FROM mslr_cdm4.drug_era c1
        WHERE c1.cohort_definition_id in (755695,739138)
        GROUP BY c1.cohort_definition_id,
            c1.subject_id

    ) raw_cohorts
INNER JOIN mslr_cdm4.observation_period op1
    ON raw_cohorts.person_id = op1.person_id 

INNER JOIN (
    SELECT person_id,
        condition_start_date AS indication_date
    FROM mslr_cdm4.condition_occurrence
    WHERE condition_concept_id IN (
            SELECT descendant_concept_id
            FROM mslr_cdm4.concept_ancestor
            WHERE ancestor_concept_id IN (439926)
            )
    ) indication
    ON raw_cohorts.person_id = indication.person_id
  AND raw_cohorts.cohort_start_date <= ( indication.indication_date +  183)
  AND raw_cohorts.cohort_start_date >= indication.indication_date

WHERE raw_cohorts.cohort_start_date >= ( op1.observation_period_start_date +  183)
    AND raw_cohorts.cohort_start_date <= op1.observation_period_end_date

Change computeCovariateBalance function to deal with strata

Currently computeCovariateBalance() only computes the overall covariate balance, which is fine when having performed 1-on-1 matching, but pointless when having performed variable ratio matching or stratification.

Need to change the function to compute means per stratum, and aggregate.

ERROR: relation "cov_m_below" does not exist

I'm trying to run the R code generated from Atlas for some simple cohorts. Everything runs fine up until I run this code:

> cohortMethodData <- getDbCohortMethodData(connectionDetails = connectionDetails,
+                                               cdmDatabaseSchema = cdmDatabaseSchema,
+                                               oracleTempSchema = resultsDatabaseSchema,
+                                               targetId = 12,
+                                               comparatorId = 11,
+                                               outcomeIds = 56,
+                                               studyStartDate = "",
+                                               studyEndDate = "",
+                                               exposureDatabaseSchema = resultsDatabaseSchema,
+                                               exposureTable = exposureTable,
+                                               outcomeDatabaseSchema = resultsDatabaseSchema,
+                                               outcomeTable = outcomeTable,
+                                               cdmVersion = cdmVersion,
+                                               excludeDrugsFromCovariates = FALSE,
+                                               firstExposureOnly = FALSE,
+                                               removeDuplicateSubjects = TRUE,
+                                               washoutPeriod = 365,
+                                               covariateSettings = covariateSettings)
Connecting using PostgreSQL driver

Constructing treatment and comparator cohorts
  |=========================================================================================================================================================================================| 100%
Analysis took 0.809 secs
Fetching cohorts from server
Fetching cohorts took 1.14 secs
Constructing default covariates
  |============================================================================================                                                                                             |  49%Error executing SQL: Error in .local(conn, statement, ...): execute JDBC update query failed in dbSendUpdate (ERROR: relation "cov_m_below" does not exist
  Position: 829)

An error report has been created at  /tmp/errorReport.txt
Error in value[[3L]](cond) : no loop for break/next, jumping to top level

What is cov_m_below?

createPs returns NaN

If I run the script below, the auc that is returned is NaN. It's not the same problem as in issue #24 because in this case getdrugfromindication() is actually returning some ids.

library(SqlRender)
library(Cyclops)
library(CohortMethod)


# Login info.
connectionDetails <- createConnectionDetails(
    dbms = "redshift",
    user = Sys.getenv("USER"),
    password = Sys.getenv("MYPGPASSWORD"), 
    server = "omop-datasets.cqlmv7nlakap.us-east-1.redshift.amazonaws.com/truven",
    schema = "mslr_cdm4",
    port = "5439")


# The function that does the analysis.
test <- function() {
    drug_concept_id <- 1342001
    drug_concept_name <- "Enalaprilat"
    comparator_drug_concept_id <- 974166
    comparator_drug_concept_name <- "Hydrochlorothiazide"
    indication_concept_id <- 21001432
    indication_concept_name <- "Hypertension"

    lowBackPain = 194133

    # Get SNOMED-CT drug_concept_id from indication.
    drug_indication_concept_ids <- getdrugfromindication(
        connectionDetails,
        indication_concept_id)

    num_ids <- length(unique(drug_indication_concept_ids))
    print(num_ids)

    # Cohort Method.
    cohortdata <- getDbCohortData(
        connectionDetails,
        cdmSchema = connectionDetails$schema,
        resultsSchema = connectionDetails$schema,
        targetDrugConceptId = drug_concept_id,
        comparatorDrugConceptId = comparator_drug_concept_id,
        indicationConceptIds = drug_indication_concept_ids)

    num_persons <- length(unique(cohortdata$cohorts$personId))
    print(num_persons)
    num_covariates <- length(unique(cohortdata$covariates$covariateId))
    print(num_covariates)

    ps <- createPs(
        cohortdata,
        lowBackPain)

    auc <- computePsAuc(ps)
    print(auc)

    return(auc)
}


getdrugfromindication <- function(connectionDetails, indication_concept_id) {
    sql <- "
    SELECT DISTINCT
        c2.concept_id
    FROM (
        SELECT
            *
        FROM vocabulary.concept
        WHERE
            concept_id = @indication_concept_id
        ) t1 INNER JOIN vocabulary.concept_relationship cr1
            ON t1.concept_id = cr1.concept_id_1
        INNER JOIN vocabulary.concept c1
            ON cr1.concept_id_2 = c1.concept_id
            AND c1.vocabulary_id = 1
        INNER JOIN vocabulary.concept_ancestor ca1
            ON c1.concept_id = ca1.ancestor_concept_id
        INNER JOIN vocabulary.concept c2
            ON ca1.descendant_concept_id = c2.concept_id
            AND c2.vocabulary_id = 1
    ;
    "

    sql <- renderSql(
        sql = sql,
        indication_concept_id = indication_concept_id)$sql

    conn <- connect(connectionDetails)
    data <- dbGetQuery(conn, sql)
    dbDisconnect(conn)

    data$concept_id
}


auc <- test()

Add includeCovariateIds parameter to createPS function

Sometimes you want to include only a subset of the covariates

Currently, createPS only allows removal of covariates using the 'excludeCovariateIds' parameter, but I think it'd be also helpful to have 'includeCovariateIds' parameter. Default behavior, if null, would be to include all covariates. However, if non-null, the list of covariates used should be restricted to those in the list.

Change schema calls to allow different SQL Server schemas inside of database

SQL Server has database and schema. Currently our SQL assumes the schema is 'dbo'. We can remove this assumption, thereby require SQL Server users to put 'DBName.SchemaName' in the string for the cdmSchema and resultsSchema parameters. Then, when this code is rendered for SQL Server, it will flexibly work for all database/schema, and when translated to Oracle/Postgres, the schema (if .dbo) could be removed and not included.

Will make this change after we have a stable version working, because it'll change all our current calls to getDBCohortData used in our development.

Show strata in PS plot when using stratification

When using stratification (instead matching), the plotPs() should show the boundary lines between strata.

Move Cyclops interface functions to Cyclops

Once we're (mildly) happy with the Cyclops interface functions, they need to move to the Cyclops package.

Add era construction function

Often we'd like to define our treatment and comparator cohorts not as a single drug, but a combination (e.g. drug classes comprising multiple drugs). In that case it is important to construct correct eras (periods of non-overlapping continuous use of the drug) for those combinations of exposures. SQL for performing this task is floating around OHDSI, but should ideally be captured in a function. This could live in the CohortMethod package for now, although it is a function that is generic to all methods.

Bring documentation up to date

All functions have some documentation, but its far from complete. Also, the DESCRIPTION file needs some prose.

Add parameter to allow for sampling of T and C in the data fetch step

Use case: sometimes, during initial feasibility, it may be useful to sample from T and C to fit a propensity score model and execute diagnostics to assess the adequacy of a study, prior to implementing a full study for the outcome of interest. Sampling T/C can reduce the data size and the wait time associated with computing feature extraction and data download.

I think the parameter should be added to the function getDbCohortMethodData.

The PatientLevelPrediction package has an analogous parameter in getPLPData called 'sampleSize', seen here.

Make sure we correctly handle all sources for exposures and outcomes

The getDbCohortData correctly needs to handle fetching exposures from

drug_exposure
drug_era
cohort (either within the CDM schema or a separate schema)
and handle fetching outcomes from
condition_occurrence
condition_era
cohort (either within the CDM schema or a separate schema)
In all scenarios, the function should select the appropriate variable names, and use type_concept_id fields when available.

I think currently this is implemented consistently for all scenarios

Cyclops' cross-validation does not pick very optimal hyperparameters for CohortMethod

When using cross-validation to pick the hyperparameter, the propensity scores have very low AUCs, and do not lead to good covariate balance. Simply picking hyperparameter=0.1 gives much better results. We need to figure out if this is due to overfitting (ie the cross-validation is correct), mismatch between optimization functions, or something else.

Add option to turn off restriction to first outcomes only

Currently CohortMethod only considers the first occurrence of an outcome, and censors people after their first outcome. It would be nice to be able to turn that feature off.