mrcieu / metaboprep Goto Github PK

View Code? Open in Web Editor NEW

22.0 12.0 9.0 23.3 MB

a pipeline of metabolomics data processing and quality control

License: Other

R 97.49% Shell 2.51%

metaboprep's Introduction

Metabolite data preparation pipeline - version 1.0

by: David Hughes and Laura Corbin date: June 3rd 2019

This package

Reads in and processes (un)targeted metabolite data, saving datasets in tab-delimited format for use elsewhere
Provides useful summary data in the form of tab-delimited text file and a html report.
Performs data filtering on the data set using a standard pipeline and according to user-defined thresholds.

Known issues

Dec 22, 2022

Nightingale Health has updated their data release format and has changed the spelling/format of some metabolite names. We have updated our annotation file for automatic annotation of Nightingale Health data, but you will - at present - need to extract the metabolite and sample data from your Nightingale Health release and run them as flat text files.
Nightingale Health: IF there is a metabolite ID that metaboprep can not automatically annotate, because the spelling has changed it would be best to identify this change and edit the spelling manually before running metaboprep.
Nightingale Health: IF there is a metabolite ID that metaboprep can not automatically annotate, because it is a new feature, then you can still run metaboprep BUT if that feature is a derived feature - like a ratio - and you want to exclude ratios from some of the QC steps such as when identifying independent features or estimating PCs, then metaboprep will - at present - not be able to exclude this feature. We are aware of this issue and are looking for a fix.
Metabolon has a new data release format. We are working to incorporate it into the package. At present you will just have to extract the metabolite, sample annotation, and feature annotation files as flat text files, and run them as such.

Install metaboprep

To install do the following
1. quick install
  1. start an R session
  2. install the metaboprep package with
```
devtools::install_github("MRCIEU/metaboprep")
```
  3. from this repo download a copy of the following files
    1. run_metaboprep_pipeline.R
    2. parameter_file.txt
    - You can also download or clone the entire repo with
      git clone https://github.com/MRCIEU/metaboprep.git
2. alternatively you can download the package manually
  1. download a copy of the depository
  2. unzip/pack the download
  3. place the directory somewhere sensible
  4. start an R session
  5. set your working directory to the parent directory of the repo
  6. install R package with:
```
devtools::install("metaboprep")
```
3. A common installation error is produced by installation errors of dependent packages. If you experience this, install those dependent packages manually with BiocManager, and then attempt the installation of metaboprep again. You might have to repeat this step more than once.
```
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("MISSINGPACKAGENAME")
```

To run metaboprep

Edit the paramater (parameter_file.txt) file
1. do not add any spaces before or after the "=" sign in the paramater file.
2. the paramater file can be located anywhere
Move to the, or a, directory containing a copy of:
1. run_metaboprep_pipeline.R
Make sure that R is in your environment - an often necessary step if working on an HPC.
1. for example: module add languages/R-3.5-ATLAS-gcc-7.1.0

Run the metaboprep pipeline on a terminal command line as follows:

Rscript run_metaboprep_pipeline.R /FULL/PATH/TO/example_data/excel/parameter_file.txt

If you experienced issues with the geneartion of the html report on an HPC, the report can be generated on a local machine as follows.
1. move to your newly generated metaboprep project directory.
  - it will take the form of "../metaboprep_release_TODAYSDATE/"
2. You should find an R data object called "ReportData.Rdata". Save a copy locally.
3. Open an R session
4. produce report with the function metaboprep::generate_report() as
```
output_dir_path = paste0("FULL/PATH/TO/DIR/OF/CHOICE/")
rdfile = paste0(output_dir_path, "ReportData.Rdata")
generate_report( full_path_2_Rdatafile = rdfile, dir_4_report = output_dir_path )
```

Example data

An example data set can be found in the folder "example_data" here on the repository. It is a simulated data set of 100 metabolites, for 100 samples. There is a (1) metabolon like (v1 format) excel file of the data set, and a (2) flat text (tab delim) version of the data set. Both are accompanied by a parameter file to help guide you with the example. This example data includes data from two hypothetical mass spectrometry run modes or platforms, "neg" and "pos". As such, a subset of the metabolites were run (simulated) in the "neg" run mode in two batches, and the second subset of metabolites were run (simulated) in the "pos" run mode in three batches. Each batch was simulated with different mean abundance values to help illustrate possible batch effects and the normalization procedure. If looking at the flat text version of the example, information on which metabolites were simulated in which run mode or platform can be found in the flat text file "feature_data.txt" in the column "platform". Further, to identify which samples belonged to which batch, for each run mode, you would use the columns "neg" and "pos" in the flat text file "sample_data.txt".

Data Preparation steps in brief

See the Wiki page for a detailed synopsis of the pipeline here.

(A) General Outline of metaboprep

Read in the paramater file
Read in the data - (typically from a commercially provided excel file)
- metabolite abundance
- sample annotation
- feature annotation
Write metabolite data, sample annotation and feature annotation to flat text file.
Normalize data
- If data is from Metabolon or is any other technology that has run-mode or platform batches, median normalize the data.
Estimate summary statistics on the raw data set (step B below)
- write summary stats to file
Perfom the data filtering (step C below)
- using parameters passed in the parameter file
- write data filtering (metaboprep) data set to file
Estimate sumary statistics on the filtered data set (step B below)
- write summary stats to file
Generate html report
Print scatter plot, histogram, and summary stats for every metabolite to a single PDF.

(B) Summary Statistics Estimated

Sample Summary Statistics
- sample missingness
  - all features
  - to the exclusion of xenobiotic and\or derived variables
- sample total sum abundance (TSA) (derived variables excluded)
  - with all features
  - with complete features only (no missingness)
- count of how many times a sample is an outlier across all feature
  - each feature analyzed within its own sample distribution
  - outliers determined as those +/- 5 IQR of the median.
Feature Summary Statistics
- feature missingness
  - all samples
  - to the exclusion of sample(s) with extreme missingness (>= 50%)
- distribution statistics
  - shaprio's W-statistic of normality on raw distribution
  - shaprio's W-statistic of normality on log10 distribution
  - skewness
  - kutosis
  - N, sample size
  - variance
  - standard deviation
  - coefficient of variation
  - mean
  - median
  - min
  - max
- feature outlying sample count
  - count of outlying samples
    - outliers determined as those +/- 5 IQR of the median.
Feature and Sample structure
- feature:feature correlation structure (derived variables excluded)
  - only includes features with at least 50 measurments
    - or if the data set has an N<50 the missingness allowed is 0.8 * N
  - estimate the number of independent features
  - tag representitive features of feature clusters
- sample:sample correaltion structure (derived variables excluded)
  - principle components (PCA)
    - derived from independent features
    - missing data is imputed to the median estiamte of each feature
    - identify PC outliers
      - +/- 3,4,5 SD of mean for all significant PCs

(C) Data Filtering Steps

If data is from Metabolon, exclude (but retain for step 11) xenobiotic metabolites from anlaysis.
Estimate sample missingness and exclude extreme samples, those with a missingness >= 0.80 (or 80%) (derived variables excluded)
Estimate feature missingness and exclude extreme features, those with a missingness >= 0.80 (or 80%)
Re-estimate sample missingness and exclude samples >= user defined threshold (units: 0.2 or 20% missing) (derived variables excluded)
Re-estimate feature missingness and exclude features >= user defined threshold (units: 0.2 or 20% missing)
Estimate total sum abundance/area (the sum of all values) for each individual
- first z-transform each distribution
- second shift the mean of each distribution to the absolute minimum of ALL observed values
TSA sample exclusion using a user defined threshold (units: +/- SD from mean) (derived variables excluded)
- To ignore this step set to NA in parameter file
Identify outlier values for each feature using a user defined threshold (units: +/- IQR from median)
User define what to do with outliers for the purposes of deriving the PCs only
- "leave_be" which means the outlier values will be left as they are.
- "turn_NA" which means they will be median imputed for the PCA.
- "winsorize" to the 100th quantile of all remaining (non-outlier values).
Build feature:feature correlation matrix on filtered data derived from steps 1-6 above (derived variables excluded)
- To be included a feature must have a minimun of 50 observations, or N*0.8 observations if data set includes less than 50 individuals.
Identify "independent" features using data from step 8 and user defined tree cut height.
- we retain the feature with the least missingness within a cluster, otherwise we select the first feature in the list.
Estimate principal components using independent features from step 9
PC outlier samples exclusions using user defined threshold (units: +/- SD from the mean)
If the data is from Metabolon we place the xenobiotic metabolites back into the filtered data set.

NOTE: Derived variable are those that are ratios or percentanges of two or more features already present in a data set, such as those found in Nightingale data.

HTML Report includes

---example figures provided for illustration---

General information on study
Raw data summary
- sample size
- missingness
- table of data filtered exclusions
- figure of PCA oulier exclusions
Summary of data filtered data
- sample size
- summary figures
  - missingness distributions
  - feature dendrogram
  - PCA
- distribtion of Shapiro W-statistics
- outlier summary
Batch effects
- feature missingness
  - as influenced by:
    - feature SUPER_PATHWAY (categorical function)
    - MS method
      - LC/MS Polar, Pos Late, Pos Early, Neg
- sample missingness
  - as influenced by:
    - BOX_ID, storage box
    - RUN_DAY, day the samples were processed on tech
  - multivariate analysis of both BOX_ID and RUN_DAY on missingness
- sample total peak|abundance area
  - as influenced by:
    - BOX_ID, storage box
    - RUN_DAY, day the samples were processed on tech
  - multivariate analysis of both BOX_ID and RUN_DAY on missingness
Power Analysis
- presence -vs- absence
- continuous

change log

2022, Feb 7th: Edited the run_metaboprep_pipeline.R script to run the Metabolon normalization only when parameter 6 is Metabolon, rather than not Nightingale. (line 430 in script).

metaboprep's People

Contributors

Stargazers

Watchers

Forkers

weizhousjtu geneticresources yiheng-aug30 yguitton kurttaylor sameemanvar remlapmot mattlee821

metaboprep's Issues

Alter flat text sample and feature annotation read ins

Mandate that the first column of the sample annotation file and feature annotation file are ids.
During read in, automatically push them into row names with the read in function.
Check that ids in each, match those in the metabolite data sheet. If they do not STOP.

xenobiotics not included in independent list

right now, xenobiotics are excluded from being an "independent" feature when running through the QC steps.

They can however be an independent feature when we run sumstats over the final qc data set.

report: missingness figure

Section 1, figure 1 of report: the missingness figure has a legend that states:

"Figure Legend: Missingness structure across the raw data table. White cells depict missing data. Individuals are in rows, metabolites are in columns."

The legend is not accurate to the figure. The legend states the figure is "across the raw data table". But the figure is for the filtered data only.

version: 1.0.1
using metaboprep_Report_v0.Rmd

Issue with count of clusters/representative features

The count of representative features and clusters in the report should match (i.e, be the same). Currently this is not always the case. This needs to be investigated.

possible issue with sample ids when reading in from flat text

If there are no sample IDs in the first column of the flat text file, do lines 230 to 236 create a problem?

Check and consider revising criteria for including metab in tree/PCA

Statement in the readme says:

"feature:feature correlation structure (derived variables excluded)
only includes features with at least 50 measurments
or if the data set has an N<50 the missingness allowed is 0.8 * N"

Need to check firstly that this is actually how things are implemented in the code - both for the summary stats and QC steps where the tree/PCA analysis is done.

Also, need to consider appropriateness of including xenobiotics and/or derived measures in this step.

And also, whether imputing to the median will have any unforeseen consequences when the missingness rates for some metabolites in the analysis are high.

Check all log statements re. thresholds applied

When including/excluding features/samples, in the code itself I think you generally use >=/<=, but then in the printed statements you just use >/<. This is a minor point but i think there needs to be consistency here as if people try to manually produce the same lists of inc/exclusions, if there are any values that are equal to the threshold it will cause slight discrepancies.

Correction needed to heading in perform.metabolite.qc.R

Currently heading 8 says:

8) exclude terrible features ( fmis > 0.2 (default) )

i think it should be:

8) exclude features defined by user ( fmis > 0.2 (default) )

Exclude NMR derived measures from relevant parts of the workflow

Edits need to be made such that derived measures are excluded from certain parts of the pipeline as indicated by the readme.

error in QC_Report.Rmd

I am getting the following error back when the QC script reaches nearly the end (98%) of the .Rmd report:

Quitting from lines 751-763 (QC_Report.Rmd)
Error in seq.default(100, half, 100) : wrong sign in 'by' argument
Calls: ... run.pa.imbalanced.power.make.plot -> seq -> seq.default

Possible inconsistency in perform.metabolite.qc.R

It appears from the code, that if the user wants to they can opt not to make any sample exclusions based on TPA (by setting that to NA in the parameter file) ( "if( is.na(tpa_out_SD) == FALSE)" ) - although that's not clear in the readme or in the parameter file guidance (hashed out instructions in parameter file).
This doesn't seem to be an option for the PCA based exclusions though. Question is, should it be? I guess you could just set the value really high to stop any exclusions being made anyway so maybe not that important. Could be one to put on the wish list to sort out?

.rmd error - 'reshape' package

When I ran the run_MetaboQC_pipeline.R script there was an error generated during the running of the .rmd report, as follows:

Quitting from lines 123-141 (QC_Report.Rmd)
Error in loadNamespace(name) : there is no package called 'reshape'
Calls: ... loadNamespace -> withRestarts -> withOneRestart -> doWithOneRestart

I was able to fix this by manually installing the 'reshape' package prior to running the run_MetaboQC_pipeline.R script but presumably this package should be installed as a dependency as part of the main package install?

inconsistent headings in run_MetaboQC_pipeline.R

In Section V Estimate Summary Statistics, the first section is (A), then the next two are (II) and (III).

BUG in Nightingale pipeline

The current code depends on information contained within a file 'ng_anno' within the package itself to identify 'derived' features in the Nightingale Health NMR data. However, where Nightingale datasets have additional metabolites/features not in the list provided and/or metabolites/features are named differently, this part of the pipeline fails. As a consequence, steps reliant on this information including the generation of metabolite/feature and sample missingness stats and PCA, and subsequent filtering based on these are not working as expected.

We recommend extreme caution when using this pipeline whilst we investigate this issue. It may be possible to avoid the problem by specifying 'Other' in place of 'Nightingale' in the parameter file but this approach has not been fully tested at this time.

Query re. perform.metabolite.qc.R, step (10)

In step (10), PC outliers are ID'd and removed. I'm not sure about how this works ... If it is pulling the list of independent features from the summary stats part of the script (i.e. PC outliers section in run_MetaboQC_pipeline.R), is this correct or does the tree need to be rebuilt in the qc'd dataset? This only seems to happen if ind_feature_names[1] doesn't already exist - not sure when that is the case? See code snippet below.

So, can we just check this part of the code and its dependencies?

if( is.na(ind_feature_names[1]) == TRUE ){
cat( paste0("\t\t- QCstep: identify independent features through correlation analysis and dendrogram clustering.\n") )
## we need to estimate independent features denovo, if not available
featuresumstats = feature.sum.stats( wdata = wdata, sammis = samplemis)
w = which(featuresumstats$table$independent_features_binary == 1)
ind_feature_names = rownames(featuresumstats$table)[w]
cat( paste0("\t\t\t* ", length(ind_feature_names), " independent features identified.\n") )
}

TPA or TSA

We have started calling TPA total sum abundance in the manuscript since it's not peak area for NMR. We need to update the PDF QC report to reflect this and possibly also the code ...

Possible issue with identifying feature labels

There may be an issue with the package not identifying feature labels properly when running 'Other' datasets.

0,1 Data in metabolite data file

Olink data seems to sometimes have binary calls of 0 and 1. This creates issues when building the correlation and distance matrix which fails when making the hclust()

perform.metabolite.qc function broken

Using the current version of the package, I am getting the following error printed to the log/screen when i run the full pipeline:

V. Performing data quality control.
a. Performing QC on data.
Error in perform.metabolite.qc(wdata = mydata$metabolitedata, fmis = feature_missingness, :
unused argument (ind_feature_names = ind)
Execution halted

This must relate to the latest update of this function ...

multivariate anova in report NA

If some of the batch variables are perfectly correlated, i.e. redundant, you will get NAs in the multivatriate table.
Add a step to remove variable redundancies.

Potential issue with new format Nightingale data

Nightingale have changed the format of their data meaning it needs to be extracted as flat text files to input to Metaboprep. This may be causing issues where the feature names are not in the correct column so cannot be retrieved.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.