atorus-research / cdisc_pilot_replication Goto Github PK

A modern replication of the CDISC pilot table outputs, using the PHUSE Test Data Factory ADaM data, using the R programming language

Home Page: https://www.atorusresearch.com

License: MIT License

R 100.00%

r cdisc

cdisc_pilot_replication's Introduction

CDISC Pilot Replication in R

Updates

We've update this repository to be compatible with Huxtable v5.0.0! Huxtable v5.0.0 had some backwards compatibility breaking changes. All updates within this repository are compatible back to Huxtable v4.7.1. See a full list of changes to Huxtable here.

The changes most of interest to a user of this repository are:

The way indexing was handled has changed. You can no longer index columns like ht[1:5]. The new syntax is ht[1:5, ].
Additionally, the add_columns argument now changed from a default of TRUE instead of FALSE. We updated config.R to reset the default option to keep the code consistent. You can do this like so:

options(huxtable.add_colnames = FALSE)

Introduction

Welcome to the Atorus CDISC Pilot replication repository! In 2007, the original pilot project submission package was finalized and released following a review by FDA Staff, where the CDISC data and metadata contained within the package were evaluated for its suitability in meeting the needs and expectations of medical and statistical reviewers. In 2019, the PHUSE Test Data Factory took on the goal of replicating the SDTM and ADaM data within the CDISC pilot package to match more modern data standards, bringing the ADaM data up to version 1.1. Atorus Research has now regenerated the table outputs within the CDISC Pilot Project using the PHUSE Test Data Factory project’s data and the R Programming language. Our motivation behind this project was to:

Demonstrate that we were able to obtain matching outputs using R
Provide open source code to the public to demonstrate how we were able to do this
Demonstrate our first publicly released R package, pharmaRTF, in action. You can find our package pharmaRTF right here

Setup Instructions

To obtain the data for this repository, you can download the data from the PHUSE Github Repository, using this link for ADaM data and this link for the SDTM.

This repository was programmed using R 3.6. For further system information, see our session information.

Notes on Data

Every effort was made to use best programming practices to recreate the values on the CDISC Pilot displays, however some values on our outputs do not align with the values on the CDISC Pilot displays due to the following reasons:

General:

The ADaMs we used to regenerate the CDISC Pilot displays were the PHUSE CDISC Pilot replication ADaMs following ADaM V1.1. Since the CDISC Pilot displays were not regenerated using the PHUSE CDISC Pilot replication data there are likely discrepancies between the original CDISC Pilot analysis data and the PHUSE CDISC Pilot replication ADaMs.
SAS and R round differently. While SAS rounds up if the value is 5 or greater, R rounds to the nearest even number.
In some circumstances, R packages will not produce a p-value if the the counts within the data are not high enough to make it statistically meaningful. An example of this is BILIRUBIN on Table 14-6.05. The High at Baseline stratum only has a single count, and the mantelhein.test function in R requires that each stratum has more than 1 observation.

Output Specific Details:

Table 14-3.07 ADAS Cog (11) - Change from Baseline to Week 24 - Completers at Wk 24-Observed Cases-Windowed
- Difference in baseline values. Despite following the analysis results metadata (ARM) in the original CDISC Pilot Define.xml, the baseline counts are off by 1. Following the information available within the original ARM and SAP there is a discrepancy between the available data and the display using both the PHUSE CDISC Pilot replication data and the original CDISC Pilot analysis data.
Table 14-3.11 ADAS Cog (11) - Repeated Measures Analysis of Change from Baseline to Week 24
- Differences in p-values. See ARM in the original CDISC Pilot Define.xml (ARM-Leaf0046) for details on implementation in SAS. These numbers end up being slightly off. To anyone that finds this and can match the numbers, please feel free to submit a PR and correct our implementation! To the best of our knowledge, we've matched what we could. It's not explicit, but the default covariance structure in the lme4 package in unstructured.
Table 14-3.12 Mean NPI-X Total Score from Week 4 through Week 24 – Windowed
- Difference in values of Mean of Weeks 4-24. This was programmed using the derived NPTOTMN variable. The ARM in the original CDISC Pilot Define.xml was followed as best as possible to determine the subset. This means that the counts are a discrepancy with the original CDISC Pilot analysis data, which is no longer available, therefore we were not able to investigate the discrepancy with the current PHUSE CDISC Pilot replication data. The subsequent statistical summaries therefore also have differences.
Table 14-6.05 Shifts of Laboratory Values During Treatment, Categorized Based on Threshold Ranges
- Difference in the values for BILIRUBIN values for the Xan. Low group and MONOCYTES values for the Placebo group. The Analysis Reference Range Indicator and Shift variables are used as is from the ADaM which indicates there are likely discrepancies for reference ranges and shifts between the original CDISC Pilot analysis data and the PHUSE CDISC Pilot replication data. The subsequent statistical summaries therefore also have differences.
Table 14-6.06 Shifts of Hy's Law Values During Treatment
- Difference in the values for Transaminase 1.5 x ULN for the Xan. Low group and Total Bili 1.5 x ULN and Transaminase 1.5 x ULN for all groups. The Analysis Reference Range Indicator and Shift variables are used as is from the ADaM which indicates there are likely discrepancies for reference ranges and shifts between the original CDISC Pilot analysis data and the PHUSE CDISC Pilot replication data. The subsequent statistical summaries therefore also have differences.
Table 14-7.01 Summary of Vital Signs at Baseline and End of Treatment
- Difference in values for End of Treatment for all groups. The End of Treatment flag is used as is from the ADaM which indicates there are likely discrepancies for end of treatment between the original CDISC Pilot analysis data and the PHUSE CDISC Pilot replication data.
Table 14-7.02 Summary of Vital Signs Change from Baseline at End of Treatment
- Difference in values throughout table. The End of Treatment flag is used as is from the ADaM which indicates there are likely discrepancies for end of treatment between the original CDISC Pilot analysis data and the PHUSE CDISC Pilot replication data.
Table 14-7.03 Summary of Weight Change from Baseline at End of Treatment
- Difference in values for End of Treatment for all groups. The End of Treatment flag is used as is from the ADaM which indicates there are likely discrepancies for end of treatment between the original CDISC Pilot analysis data and the PHUSE CDISC Pilot replication data.

Notes on R Packages

As many programmers in the R community do, we relied on the tidyverse for much of our data processing. There are a few addition libraries that we used worth mentioning:

For the CMH test where testing for the alternate hypothesis that row means differ, the package vcdExtra was used, which is not included in the base distribution of R. We additionally had to make a slight modification to this library, which is available in this fork of the package. The update is due to the fact that the solve function in R will throw an error when processing large, sparse tables. By replacing solve with MASS::ginv, the error is bypassed. We as an organization plan to perform further testing of this update and submit the update back to the original author.
For Mixed Models, the lme4 package was used
For ANCOVA models, car was used and the emmeans package was used to do LSMEANS.

cdisc_pilot_replication's People

Contributors

Stargazers

Watchers

cdisc_pilot_replication's Issues

t-14-3-13.R: tidyr::replace_na breaks with column name `0`

At line 87 of t-14-3-13.R, the data manipulation of

replace_na(list(`0`=' 0       ', `54` = ' 0       ', `81`=' 0       '))

would break.

One example fix is given below.

     mutate(
         `0` = replace_na(`0`, ' 0       '),
         `54` = replace_na(`54`, ' 0       '),
         `81` = replace_na(`81`, ' 0       ')
     )

> sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
[1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8    
 [5] LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8    LC_PAPER=C.UTF-8       LC_NAME=C             
 [9] LC_ADDRESS=C           LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] pharmaRTF_0.1.0  assertthat_0.2.1 haven_2.2.0      forcats_0.5.0    stringr_1.4.0    dplyr_0.8.5     
 [7] purrr_0.3.4      readr_1.3.1      tidyr_1.0.2      tibble_3.0.1     ggplot2_3.3.0    tidyverse_1.3.0 
[13] glue_1.4.0      

loaded via a namespace (and not attached):
[1] nlme_3.1-142      fs_1.4.1          lubridate_1.7.8   httr_1.4.1        tools_3.6.2       backports_1.1.6  
 [7] R6_2.4.1          DBI_1.1.0         gnm_1.1-1         colorspace_1.4-1  nnet_7.3-12       withr_2.2.0      
[13] tidyselect_1.0.0  emmeans_1.4.6     curl_4.3          compiler_3.6.2    cli_2.0.2         rvest_0.3.5      
[19] xml2_1.3.2        scales_1.1.0      lmtest_0.9-37     mvtnorm_1.1-0     digest_0.6.25     foreign_0.8-72   
[25] relimp_1.0-5      minqa_1.2.4       rmarkdown_2.1     ca_0.71.1         rio_0.5.16        pkgconfig_2.0.3  
[31] htmltools_0.4.0   lme4_1.1-23       dbplyr_1.4.3      rlang_0.4.6       readxl_1.3.1      rstudioapi_0.11  
[37] generics_0.0.2    zoo_1.8-8         jsonlite_1.6.1    zip_2.0.4         car_3.0-7         magrittr_1.5     
[43] huxtable_4.7.1    qvcalc_1.0.2      Matrix_1.2-18     Rcpp_1.0.4.6      munsell_0.5.0     fansi_0.4.1      
[49] abind_1.4-5       lifecycle_0.2.0   stringi_1.4.6     yaml_2.2.1        carData_3.0-3     MASS_7.3-51.4    
[55] grid_3.6.2        crayon_1.3.4      lattice_0.20-38   splines_3.6.2     hms_0.5.3         knitr_1.28       
[61] pillar_1.4.4      boot_1.3-23       estimability_1.3  vcdExtra_0.7-4    reprex_0.3.0      evaluate_0.14    
[67] data.table_1.12.8 modelr_0.1.7      vcd_1.4-7         vctrs_0.2.4       nloptr_1.2.2.1    cellranger_1.1.0 
[73] gtable_0.3.0      xfun_0.13         openxlsx_4.1.5    xtable_1.8-4      broom_0.5.6       coda_0.19-3      
[79] statmod_1.4.34    ellipsis_0.3.0

t-14-5-01.R and t-14-5-02.R: different behaviors when run in console vs. as a job

There R scripts run fine with a clean state console session. However, when they are submitted as Local Jobs, errors encountered when the ae_counts() function fails to resolve one of its default parameters N_counts to the header_n object.

MMRM analysis

I have been looking a bit into the MMRM analysis in the report for "CDISCPILOT01 – Initial Case Study of the CDISC SDTM/ADaM Pilot Project". More specifically I looked into the generation of table 14-3.11 and the SAS output supporting table 14-3.11. The title of the table is "... change from baseline to week 24". It turns out that the numbers presented are not for the change to week 24, it is the average change to week 8/16/24. The numbers presented are created using this piece of SAS code (after having applied the following filter: EFFFL == "Y" & PARAMCD == 'ACTOT' & ANL01FL == 'Y' & DTYPE != 'LOCF' & AVISITN > 0):

proc mixed noclprint data=temp01 ic ;
class trtpcd_f aweekc sitegr1 usubjid;
model chg=trtpcd_f aweekc sitegr1 trtpcd_faweekc base baseaweekc/s ddfm=kr;
repeated aweekc/subject=usubjid type=un;
lsmeans trtpcd_f/ diff cl;
ods output diffs=temp02;
ods output LSMeans=temp03;
run;

What should be done instead is this (note the difference in the lsmeans statement):

proc mixed noclprint data=temp01 ic ;
class trtpcd_f aweekc sitegr1 usubjid;
model chg=trtpcd_f aweekc sitegr1 trtpcd_faweekc base baseaweekc/s ddfm=kr;
repeated aweekc/subject=usubjid type=un;
lsmeans trtpcd_f*aweekc/ diff cl;
ods output diffs=temp02;
ods output LSMeans=temp03;
run;

In R I did the following:

First approach

test1<-lmer(CHG ~ TRTPCD_F + SITEGR1 + AWEEKC + TRTPCD_F:AWEEKC + BASE + BASE:AWEEKC + (AVISITN | USUBJID),data=adas)

Second approach using gls

test1<-gls(CHG ~ TRTPCD_F + SITEGR1 + AWEEKC + TRTPCD_F:AWEEKC + BASE + BASE:AWEEKC,
corr=corSymm(form= ~1 | USUBJID),weights=varIdent(form= ~1 | AVISITN),data=adas)

Third approach: lmer with modification as suggested by Daniel Sabanes Bove

test1<-lmer(CHG ~ TRTPCD_F + SITEGR1 + AWEEKC + TRTPCD_F:AWEEKC + BASE + BASE:AWEEKC + (0 + AWEEKC | USUBJID),data=adas, control=lmerControl(check.nobs.vs.nRE="ignore"))

LSmeans per group

test2<-emmeans::lsmeans(test1, ~TRTPCD_F:AWEEKC, lmer.df='kenward-roger')

LSmeans for differences

test3<-emmeans::contrast(test2, method="pairwise", adjust=NULL)

95% CI

test4 <- confint(test3)

The results are as follows:

In SAS:

PBO: 2.3291 [0.9680; 3.6903]
High dose: 1.5009 [-0.1475; 3.1494]
Low dose: 1.7352 [0.2247; 3.2457]
Differences:
High vs. PBO: 0.8282 [-1.2856; 2.9420] p=0.4403
Low vs. PBO: 0.5939 [-1.4136; 2.6014] p=0.5600

First approach: with lmer:

Pbo Week 24 2.32 [0.962; 3.68]
Xan_Hi Week 24 1.49 [-0.154; 3.14]
Xan_Lo Week 24 1.77 [0.263; 3.28]
Differences:
Xan_Hi Week 24 - Pbo Week 24 0.8270 [-1.283; 2.9367] p=0.4404
Xan_Lo Week 24 - Pbo Week 24 0.5459 [-1.459; 2.5510] p=0.5919

Second approach: with gls (using approximative Satterthwaite):

Pbo Week 24 2.331 [0.972; 3.69]
Xan_Hi Week 24 1.502 [-0.140; 3.14]
Xan_Lo Week 24 1.735 [0.229; 3.24]
Differences:
Xan_Hi Week 24 - Pbo Week 24 0.8294 [-1.278; 2.9371] p=0.4384
Xan_Lo Week 24 - Pbo Week 24 0.5959 [-1.407; 2.5991] P=0.5578

Third approach: lmer with modification as suggested by Daniel Sabanes Bove

Pbo Week 24 2.329 [0.968; 3.69]
Xan_Hi Week 24 1.501 [-0.147; 3.15]
Xan_Lo Week 24 1.735 [0.225; 3.25]
Differences:
Xan_Hi Week 24 - Pbo Week 24 0.8282 [-1.285; 2.9415] p=0.4402
Xan_Lo Week 24 - Pbo Week 24 0.5939 [-1.413; 2.6010] p=0.5599

Error: is_huxtable(x = ht) is not TRUE

getting this error.

# Write into doc object and pull titles/footnotes from excel file
doc <- rtf_doc(ht) %>% titles_and_footnotes_from_df(
  from.file='./data/titles.xlsx',
  reader=example_custom_reader,
  table_number='14-2.01') %>%
  set_font_size(10) %>%
  set_ignore_cell_padding(TRUE) %>%
  set_column_header_buffer(top=1)

summarize() or dplyr::summarize()?

In R console, when the runtime environment is not in a clean state, thefunction call of summarize() may fail due to the identical function names existing in both packages of dplyr and plyr.

One easy fix is to prefix with dplyr:: or plyr:: whichever is intended.

RStudio Server Pro (v1.3+) offers a convenient new feature, Global Replace, for search and replace a string or a regex in all files in a project or inside a folder, etc.

Error: `correct_defaults(var)` must be a vector, a bare list, a data frame or a matrix.

Greetings.

Errors occurred when creating DOC :

# Write into doc object and pull titles/footnotes from excel file
doc <- rtf_doc(ht) %>% titles_and_footnotes_from_df(
  from.file='./data/titles.xlsx',
  reader=example_custom_reader,
  table_number='14-1.01') %>%
  set_font_size(10) %>%
  set_ignore_cell_padding(TRUE) %>%
  set_column_header_buffer(top=1)

Error details:

Error: `correct_defaults(var)` must be a vector, a bare list, a data frame or a matrix.
Run `rlang::last_error()` to see where the error occurred.

> rlang::last_error()
<error/tibble_error_need_rhs_vector>
`correct_defaults(var)` must be a vector, a bare list, a data frame or a matrix.
Backtrace:
  1. `%>%`(...)
  5. pharmaRTF::titles_and_footnotes_from_df(...)
  6. pharmaRTF:::read_hf(...)
  7. pharmaRTF:::fill_missing_data(.data, columns)
  9. tibble:::`[<-.tbl_df`(`*tmp*`, is.na(.data[var]), var, value = NULL)
 10. tibble:::tbl_subassign(x, i, j, value, i_arg, j_arg, substitute(value))
 11. tibble:::vectbl_wrap_rhs_row(value, value_arg)

> rlang::last_trace()
<error/tibble_error_need_rhs_vector>
`correct_defaults(var)` must be a vector, a bare list, a data frame or a matrix.
Backtrace:
     x
  1. +-`%>%`(...)
  2. +-pharmaRTF::set_column_header_buffer(., top = 1)
  3. +-pharmaRTF::set_ignore_cell_padding(., TRUE)
  4. +-pharmaRTF::set_font_size(., 10)
  5. \-pharmaRTF::titles_and_footnotes_from_df(...)
  6.   \-pharmaRTF:::read_hf(...)
  7.     \-pharmaRTF:::fill_missing_data(.data, columns)
  8.       +-base::`[<-`(`*tmp*`, is.na(.data[var]), var, value = NULL)
  9.       \-tibble:::`[<-.tbl_df`(`*tmp*`, is.na(.data[var]), var, value = NULL)
 10.         \-tibble:::tbl_subassign(x, i, j, value, i_arg, j_arg, substitute(value))
 11.           \-tibble:::vectbl_wrap_rhs_row(value, value_arg)

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    
system code page: 936

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] readxl_1.3.1     pharmaRTF_0.1.3  huxtable_5.4.0   assertthat_0.2.1 haven_2.4.1      forcats_0.5.1    stringr_1.4.0   
 [8] dplyr_1.0.7      purrr_0.3.4      readr_1.4.0      tidyr_1.1.3      tibble_3.1.2     ggplot2_3.3.5    tidyverse_1.3.1 
[15] glue_1.4.2      

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6       cellranger_1.1.0 pillar_1.6.1     compiler_4.0.3   dbplyr_2.1.1     tools_4.0.3      lubridate_1.7.10
 [8] jsonlite_1.7.2   lifecycle_1.0.0  gtable_0.3.0     pkgconfig_2.0.3  rlang_0.4.11     reprex_2.0.0     cli_3.0.0       
[15] rstudioapi_0.13  DBI_1.1.1        xfun_0.24        xml2_1.3.2       withr_2.4.2      httr_1.4.2       knitr_1.33      
[22] fs_1.5.0         generics_0.1.0   vctrs_0.3.8      hms_1.1.0        grid_4.0.3       tidyselect_1.1.1 R6_2.5.0        
[29] fansi_0.5.0      modelr_0.1.8     magrittr_2.0.1   backports_1.2.1  scales_1.1.1     ellipsis_0.3.2   rvest_1.0.0     
[36] colorspace_2.0-2 utf8_1.2.1       stringi_1.6.2    munsell_0.5.0    broom_0.7.8      crayon_1.4.1

Missing data/titles.xlsx

Impressive work on replicating the CDISC Pilot project with R and releasing both the reproducible R code and pharmaRTF package for table generation as open source. I was able to obtain the data for your work using the setup instructions. I was not able to get the titles.xlxs to create titles and footnotes. Is that file something you can share?

doc <- rtf_doc(final, header_rows = 2) %>% 
   titles_and_footnotes_from_df(
     from.file='./data/titles.xlsx', 
     reader=example_custom_reader,
     table_number='14-6.01') %>%
  set_font_size(10) %>%
  set_ignore_cell_padding(TRUE) %>%
  set_column_header_buffer(top=1)

Thanks,

atorus-research / cdisc_pilot_replication Goto Github PK

cdisc_pilot_replication's Introduction

CDISC Pilot Replication in R

Updates

Introduction

Setup Instructions

Notes on Data

General:

Output Specific Details:

Notes on R Packages

cdisc_pilot_replication's People

Contributors

Stargazers

Watchers

Forkers

cdisc_pilot_replication's Issues

First approach

Second approach using gls

Third approach: lmer with modification as suggested by Daniel Sabanes Bove

LSmeans per group

LSmeans for differences

95% CI

In SAS:

First approach: with lmer:

Second approach: with gls (using approximative Satterthwaite):

Third approach: lmer with modification as suggested by Daniel Sabanes Bove

Recommend Projects

Recommend Topics

Recommend Org