cbiit / r-cometsanalytics Goto Github PK
View Code? Open in Web Editor NEWR package development for COMETS Analytics
R package development for COMETS Analytics
I used input files based on the sample dataset data_input.zip.
With these files, I was unable to "Create input". Instead these files returned an error as follows:
Java Exception .jcall(cell, "V", "setCellValue", value)
[ ] rename all _ in variables with .
[ ] add lm and lmer code
[ ] summary statistics for covariates in modeldata$gdta
[ ] concatenate really long names or take only 1st if more than 1 in the display
In the test file that I had prepared, the N for non-white/European persons was quite small. In fact, only 1 individual had a race_grp=2. This seems to be causing all kinds of problems in the adjusted/stratified analyses.
To test, run in interactive mode:
Exposure: Age
Outcome: Any individual metabolite
Adjusted covariates: race_grp
Strata by: BMI_grp
Two of the three values returned will have a value of NA. Possibly, this reflects a degrees of freedom issue?
Input file is below.
we should implement acceptable values check so that we can do integrity checks. for example, BMI of 0 is not acceptable
My IMS collaborators have requested that we add the model tab to our zip file so that they can double-check that the correct models were run and have it documented. I think this is a great idea. We may want to add the varmap as well.
We will define the object to have the following slots:
For batch mode, model age.1, lets replace existing warning message with "We removed one or more dummy variables that were redundant (i.e. perfectly correlated with another variable)."
Also is it possible to specify the variable?
Currently, categorical variables are not properly adjusted for--they are entered into the model as continuous variables. Models should distinguish between categorical and continuous models using a new column that will be added to the Varmap tab.
This change will also require a change to the Sample file (to be logged separately) and to the "Create Input" utility (it needs to add this column to the input file that it creates).
The "Run model" is checking the data file again on the website end but it doesn't need to--the data is already saved in the created list and everything is there.
Resolving this will help speed the analyses.
some weird UIDs to be checked:
C14_0_CE_+NH4
C16_0_CE_+NH4
C16_1_CE_+NH4
C18_0_CE_+NH4
C18_0_MAG_+NH4
C18_1_CE_+NH4
C18_2_CE_+NH4
In the pairwise analysis, the exposure_uid and exposure columns are reporting the exposurespec. Would it be possible to get the actual UID and name in these two columns.
we will use rio to allow other formats
Certain combinations of adjustments and stratification can still cause problems with models. One of the simplest scenarios uses the data below, with the model as follows:
Exposure: age
Outcome: glycine (can also use All Metabolites)
Adjusted: bmi_grp, alc_grp
Strata: smk_grp
Initially, I though this could be due to a code reversion, but that was a false lead.
I then thought it could reflect metabolites with high numbers of values below the limit of detection (i.e. little meaningful variance), but I tested against glycine (for which this issue does not apply) and still had the same problem.
I am thus forced to conclude that the issue reflects something about the joint distribution of the adjusted and strata variables that we are not quite fully handling.
Ella and Ewy, the data are attached. Let me know if you have any insights. I hope to test again toward the end of today.
Currently, the code in the vignettes and examples do not match. To minimize confusion, it may be best to sync those up at one point...
Do you agree?
After harmonization of study data, in our updated process, we will need to update COMETS with a new UID file to account for new metabolites added. Is there a way to automate this where when the files has changes, we can upload it to a specific spot and is used by COMETS without any input processing?
Does the file we send need to be modified to make something like this work.
This issue occurs when using a dataset with r-unfriendly metabolite names in interactive mode, when selecting individual metabolites.
So, for example:
Exposure: Age
Outcome: HOMOVANILLATE (HVA)
Will result in a warning of "Undefined columns".
This issue can be triggered using the dataset in issue #49.
Sometimes the metabolites do not harmonize, even when HMDB IDs are present. To my understanding, Ella was looking into this issue. This also occurred with the R. Kelly VDAART file, which I can supply, if needed.
For our rollout, we have proposed a two step process for the cohorts:
Prepare data file, test integrity, download "harmonization" file, and run one simple analysis (age.2). Send harmonization file and results file to IMS so that they can begin harmonization.
IMS sends back a "Metabolites" tab that is identical to the original, except with an additional UID_01 column. With this new tab, the cohort then goes back to COMETS-Analytics and runs "All models". These models are now "pre-harmonized".
To accommodate this process change, I have two minor edits to the harmonization file, per discussion with Nathan Appel and David Ruggieri of IMS.
The variable that is currently called "UID_01" should be renamed to make room for the IMS UID_01 variable. My suggested rename is "UID_01.comets_analytics"--which reflects the fact that this UID_01 is based on the COMETS-Analytics algorithm. Making room for both columns also will give us data to track our algorithm's performance over time (% match between algorithm and IMS final UID).
The harmonization file changes the case (lower case vs. upper case) of the metabid variable as compared with the original input. To ensure that IMS can fully replicate the original harmonization file, we should provide the original harmonization case. Remember that if the cohort is not proceeding with the "All models" analysis initially, then IMS only has the "Harmonziation" file to work with and not the "Input file".
Several investigators have been trying to input missing data, which causes errors. All metabolites and subjectdata variables should be tested for this.
Possible fixes include:
a) An improved tutorial
b) Tests in the Data Input function
c) Tests in the "Read datafile", or "Integrity check", or "getModel" functions.
Because this check could add some time, we could add this as an option, i.e. add a checkbox to test the variables, default is no test of the variables.
Interactive mode - If using non-R friendly names for metabolites, it would fail with “individual metabolite names” or use the tag function. (Comets R). A lower priority issue for now.
I have been working with a new test dataset from our collaborators at the American Cancer Society. At least one of their models has been jamming the queue, for reasons unknown. I can confirm that the first two models are fine, and that the problem is not solely due to the "all metabolites*all metabolites" analysis.
More testing to be done once the queue is unjammed.
When running a model that has several XXX, the heatmap function fails because there are duplicate rownames:
> excorrdata <- COMETS::runCorr(exmodeldata,exmetabdata,"DPP")
NULL
NULL
[1] "running unadjusted"
> COMETS::showHClust(excorrdata)
Error: Duplicate identifiers for rows (532, 611), (1143, 1222)
This is a new error and not sure why it's creeping up now since the vignette hasn't been changed in a while.
For COMETS 1.4, I would like to focus on three things: 1) Harmonization; 2) Error handling; and 3) Queue management/troubleshooting. This issue applies to the first of these.
Currently, we are doing all the harmonization on the backend at IMS. For each cohort, they start with our attempt to auto-harmonize but then revise/edit substantially, until all entries are logically consistent. Nathan pointed out that, once this has been done for each study, the most sensible approach is to send our UID back to the cohort as a column to add to their datafile, so that files is permanently harmonized from then forward. Ella, Ewy and I should meet with Nathan to discuss, but on a preliminary basis, I agree.
If we go this route, we will need to accommodate a new column for each datafile in our harmonization algorithm. It may also change the (non-software) workflow for each cohort--for example, we have each study run the Integrity Check and one or two tables that they send to IMS for pre-harmonization. Then, we feed back the harmonized metabolite UID, teh cohort analyst adds it to their file, and runs one or two tables again. Then, if IMS is able to harmonize these easily, then we the cohort runs the whole analysis.
Let's discuss once 1.3 is complete.
COMETS manuscripts will need descriptive data from each of the participating studies that we can show in our Table 1. The descriptive data should be output as a zip file table. For categorical variables, the percent in each category will likely suffice. For continuous variables, I suggest outputting the mean, the standard deviation, and the values at the 0th (minimum), 5th, 10th, 25th, 50th, 75th, 90th, 95th, and 100th (maximum) percentiles.
The test site will not run any correlations with the new R package.
I've quick fixed the vignette clode bc the getModelData() function was not working. I figured out it was because the original call was as follows:
exmodeldata<- getModelData(exmetabdata,colvars="age",modbatch="1.1 Unadjusted")
HOwever, if modbatch is not specified, it errors because of this line (#62):
mods<-dplyr::filter(as.data.frame(readData[["mods"]]),model==modbatch)
I've fixed it the call to this, which now works:
`exmodeldata <- getModelData(exmetabdata,colvars="age",modbatch="1.1 Unadjusted")
However, does this make sense? If it's in batch, then all models should be read in right?
If the "where" statement is used in either "interactive" mode or in "batch" mode, the N listed does not update. This will create downstream problems when calculating standard errors for meta-analysis and so is an important problem.
I like the addition of the warnings--it will make testing easier.
Now that they are visible, there may be some tweaks needed. One such tweak is that, when running an analysis stratified by BMI, I received the following warning: "Warning: one of your models specifies bmi_grp as a stratification but that variable only has one possible value. Model will run without bmi_grp stratified"
There is a strata of BMI that had very few observations, but bmi_grp itself definitely has more than one possible value, as evidenced in the screenshot below. Any suggestions for how to modify wording ?
Still investigating...
Test using this file and "Age 2.3"
The minimum level of adjustment to provoke the issue is shown below. Notably, even just adding bmi_grp to the model triggered warning messages about the penrose, etc. matrix:
This occurs when any metabolite with an R-unfriendly name is used as an "individual metabolite" in an interactive mode analysis. Technically a bug (see issue #50) that will likely be eliminated in COMETS 1.4
The results files are outputting correctly from COMETS-Analytics. However, to link the results from one cohort to another's, we need to pull in their metabolite meta-data table in its entirety, most likely as a separate table in the zip file. In addition, if the results auto-harmonized, we should also pull in the UID column, possibly one or two other columns.
Without meta-data or the UID, we cannot harmonize the metabolites on the back end. This is a high priority fix.
A number of changes need to be made to the test/sample file, including:
It will sometimes happen that a metabolite has no variance, i.e. has the same value for every single participant. When this occurs, there should be no analysis/results for this metabolite, but analysis/results for other metabolites should carry forward as normal. Currently, however, the analysis crashes when it runs into any metabolite with variance=0.
We need a better method for handling metabolites where variance=0.
In tests done to date, all investigators have elected to use the variable names that we use. They are not using the variable matching in any meaningful way.
Thus, I think we could perhaps encourage users to simply code "COHORTVARIABLE" the same as "VARREFERENCE" and, if using the "Create input" utility, we could assume as a default the VARREFERENCE names. This could help to streamline the process of making data input and our writing of the tutorial. We should discuss as a group.
When running the following model, the app enters an infinite loop:
Exposure: Age
Outcome: All metabolites
Adjusted covariates: race_grp
I have noted that this model can work when running individual metabolites or when running all models that are not adjusting for race. This suggests that the problem is related to metabolites where there are only one or a few values being combined with an adjustment where some categories have only one or a few values.
Thus, this could be a model singularity issue, like that described in issue #32 .
The "where" functionality no longer appears to be working. This issue needs to be fixed before I can complete testing on categorical adjustment, since the model we have been testing includes a "where" statement.
Can you give us the differences between these 3 variables?
Also, why is the lone COMP_ID variable missing sometimes?
Thanks,
Nate
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.