genelab_data_processing's People
Forkers
vikash84 animesh mitiku90 kayls369 unique379r vkdy xilin22 ajcoakley neev1108 buffalo0124 standardgalactic guff95 lau-jinhh derming123 dayueban guo-cheng lauren-sanders rndw germant13 python-repository-hub fuyuming agshaarrawi kosasih spacebiology larryyang1980 j-81 kasenjing dfeng23 ihinne babasaraki torres-alexis kieranmbrown davidecrs bnovak32 swineologist cyouh95 olabiyigenelab_data_processing's Issues
[Microarray] Issue combining dataframes using bind_rows() when getBM() returns no results
Description
The following error occurred when rendering Affymetrix.qmd
for one dataset (but same issue could potentially arise in Agile1CMP.qmd
too):
Error in `dplyr::bind_rows()`:
! Can't combine `..1$ensembl_gene_id` <logical> and `..2$ensembl_gene_id` <character>.
Error occurs when getBM()
returns no rows, then ensembl_gene_id
column is assumed to be logical type, causing the incompatibility in bind_rows()
.
Solution
Check if chunk_results
contains any rows before binding to df_mapping
in Affymetrix.qmd
/ Agile1CMP.qmd
:
if (nrow(chunk_results) > 0) {
df_mapping <- df_mapping %>% dplyr::bind_rows(chunk_results)
}
[BulkRNASeq] Update default publish output directory when processing GLDS data
Description
Current default output directory is named after GLDS ID; however, with release of OSDR, the default directory should indicate both the OSDR and GLDS IDs by default.
Proposed Solution
New default directory of OSD-###_GLDS-###
ENSEMBL annotations not populating correctly
Hi, I'm running this pipeline from the SJSU HPC and I'm having an issue where ~8.5k out of my 29k rows of data are populating with both the symbol and genename as "NA". In a given NA row, there is a valid ENSEMBL ID that, if I look it up on ENSEMBL, leads to a valid gene product with an annotation that looks as though it is just not being populated correctly. I am running OSD-511 using the following script (using the cached files established on the spartan01 HPC by Jonathan Oribello):
NXF_SINGULARITY_CACHEDIR=/home/joribello/test_install/singularity nextflow run NF_RCP-F_1.0.3/main.nf -profile singularity,slurm -resume --gldsAccession GLDS-511 -c /home/joribello/test_install/cos_hpc_nextflow.config -c give_ALIGN_STAR_more_memory.config --runsheetPath /home/carnold/GLDS-511/Metadata/GLDS-511_bulkRNASeq_v1_runsheet.csv
The original runsheet was edited to correct the switched R1 and R2 files for one of the samples (they were entered incorrectly in the downloaded version from GL).
ALIGN_STAR_more_memory.config goes as follows:
process {
withName:'ALIGN_STAR' {
memory='45GB'
}
withName:'SORT_INDEX_BAM' {
memory='45GB'
}
withName: "COUNT_ALIGNED" {
maxRetries = 3
errorStrategy = 'retry'
memory = { 8.GB + 4.GB * task.attempt }
}
}
[BulkRNASeq] Add V&V Check: Assert adaptor content removed using Trimmed Reads FastQC MultiQC
[BulkRNASeq] Handle cases where distinct group names resolve to the same R safe name
Description
Factor value strings can collapse into the same R safe strings.
E.g. "p53+/Fbxw7+/-" and "p53-/Fbxw7-/-" Both collapse into "p53..Fbxw7..."
This breaks DESeq2 R scripts.
Steps to Reproduce
- Run processing on OSD-432
Expected Behavior
R DESeq2 script safely handles factor values when resolve to multiple original
names converting to the same safe
name.
Actual Behavior
R DESeq2 script raises an exception.
Impact on Data
Non silent edge case, thus no released processed data should be impacted.
Blocks processing for 1 dataset at start of this ticket.
Additional Context
Provide any additional information or context that might be relevant to the issue.
Possible Solution (optional)
Introduce an integer or letter to each unique original factor value. This character will differentiate all factor values when converting to and using R safe names.
[BulkRNASeq] Underspecification of memory for STAR_ALIGN
Expected Behavior
Memory specified for STAR_ALIGN should be appropriate.
Actual Behavior
Memory specified is often insufficient when using larger references (i.e. from certain organisms like homo sapiens).
Depending on compute backend, this may either result in inefficient usage of resources or worse, workflow exceptions due to insufficient memory (often presenting as a 137 error from STAR_ALIGN tasks).
Workaround
Specification of RAM (~40 GB for homo sapiens) appropriate for reference size can be used to overcome this mis-specification.
[Microarray Affymetrix] Processed data protocol
Track updates to generate_protocol.sh for Affymetrix pipeline.
BulkRNASeq workflow should determine adaptor type automatically
Currently workflow user is expected to replace this value manually in workflow module file.
Instead, the adaptor should be automatically determine, perhaps from the raw fastQC reports/multiQC and supplied to the trimming processing.
DPPD Reference
Workflow Reference
[Microarray] Issue creating cache directories in quarto render command for updated Nextflow version
Description
No issues when using Nextflow v.22.10.0.5826, but after updating to Nextflow v.23.10.1.5891, the following error arose in AGILE1CH
/ PROCESS_AFFYMETRIX
processes from quarto render
command:
Could not create TypeScript compiler cache location: "/home/<user>/.cache/deno/gen"
Check the permission of the directory.
According to here, this cache location can be changed using environment variable DENO_DIR
.
That resolves this issue, but a second error follows:
Uncaught Error: Read-only file system (os error 30), mkdir '/home/<user>/.cache/quarto'
For both cache directories, they seem to be put by default inside the user's home directory. This suggests setting HOME
env var to a temp directory. This would cover the first issue too without the need to separately specify DENO_DIR
.
Solution
Set HOME
to be the working directory inside AGILE1CH
/ PROCESS_AFFYMETRIX
processes:
export HOME=$PWD;
The result is the cache directories will both appear inside the working directory within work/
folder:
work/<hash>/
|- .cache/
|- deno
|- quarto
[Metagenomics Illumina] Some datasets have too many files in the MAGs and bins folders
Some datasets can have over 1,000 files in the MAGs and bins folders of the Assembly-based_processing directories (for example: https://genelab-tools.arc.nasa.gov/jira/browse/GLDATAPROC-694). There's no need to have all of them available individually.
Parallelize and optimize Wald test and group statistics
With large number of groups this step can be prohibitively slow
For Wald test, a parallel computation would be helpful.
For group statistics, both parallel and a switch to vector based computation will be helpful.
[Methyl-Seq] Pipeline specifies incorrect input for deduplication step.
The pipeline specifies coordinate-sorted BAM as input for deduplicated_bismark (see GL-DPPD-7113 Step 6), but the deduplicate_bismark tool requires readname sorted input files (see Bismark documentation).
[BulkRNASeq] Remove generation of visualization extended DGE table
Description
This file has been retired as the relevant columns are now generated on the fly where used by GeneLab visualization.
Proposed Solution
Remove generating code.
[Metagenomics Illumina] explicitly setting humann3 reference db locations in rule
- an issue can arise if the user is pointing to previously installed humann3 reference databases, or if the locations were manually changed for testing at some point
- this linked commit on the metagenomics branch integrated always explicitly setting the locations based on what is in the config.yaml: 1405949
- changes are in the
humann3_PE
andhumann3_SE
rules - in case it's helpful, i'm also attaching a modified Snakefile that has the changes in it (as a zip due to file-type restrictions)
Snakefile.zip
Add reference table support for "Euglena gracilis"
Note: This organism isn't located on Ensembl from my searches.
This may block this request.
Add reference table support for "Pseudomonas Aeruginosa"
Transfer from dev issue 149
- Annotation Reference Table Support
Files Not available from source
When I run this code link
there are some errors.
The json files don't appear to be available.
Thanks in advance
[BulkRNASeq] Handling Technical Replicates
Description
Workflow should handle technical replicates appropriately.
Approaches
DESeq2 provides a collapseReplicates function that sums counts based on a factor to group samples by.
The rationale has two major points:
- Summing opposed to averaging is appropriate for maintaining expected Poisson distribution
- DESeq2 is designed to normalize for library size differences. Summing technical replicates is akin to having a higher sequencing depth for a sample.
Implementation Suggested
Encode Technical Replicate Groups in the Runsheet
Encode technical replicates as a column in the runsheet simply using integers for each technical replicate group.
Eventually, this technical replicate column should be automatically derived from ISA archive metadata; however, in the meantime, a workflow user should be able to supply a two column csv mapping sample name to technical replicate group which will be incorporated into the runsheet.
Use Technical Replicate Groups Column in Runsheet to for DESeq2 collapseReplicates
https://rdrr.io/bioc/DESeq2/man/collapseReplicates.html
Validation Plan
- Validate reasonable approach results as follows:
Run the following approaches
- NF_RCP-F_1.0.3 (i.e. no technical replicate handling)
- collapseReplicates (summed tech. replicates)
- median replicates
- mean replicates
- filter to first replicate only (drop others)
Assessment Metrics:
- DGE results
- Regression Test Criteria
- Core tests should run without change in outcomes (since core tests don't include any technical replicates)
[BulkRNASeq] Combine MultiQC reports
MultiQC reports are currently generated separately for each tool. Perhaps combine into one or two reports (e.g., FASTQC/Trim Galore! into one report where metrics are reported at read-level, STAR/RSeQC/RSEM in another where metrics are reported at sample-level)
from genelab study id to pubmed
Hi,
how can I map genelab study id into pubmed id ?
For example for the study id OSD-70 I would like find the related pubmed id - from the filenames I can see there are GSE33779 and then I can go into https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE33779 citation and see the 23806134
is there any smarter way of mapping ?
[Microarray Agilent 1-channel] Generate processed data protocol
Generate protocol from template (similar to GENERATE_PROTOCOL
in Affymetrix pipeline)
[BulkRNASeq] Workflow fails to launch using approach 3 when gldsAccession is not supplied
Expected behavior
Workflow launch using approach 3 (custom runsheet) should not require setting gldsAccession
Actual behaviour
Workflow checks if gldsAccession
is unset and raises an error on attempting to start.
Workaround
Supplying gldsAccession
as follows:
nextflow run ... --gldsAccession CustomAnalysis
Note: This will output all results to a directory called CustomAnalysis
. The name may be changed if an alternative name for the output directory is desired.
[BulkRNASeq] V&V program mistypes samplenames as ints when possible
Description
When sample names can be interpreted as numerical instead of string (e.g. 12), certain V&V checks incorrectly do so result in failure to match '1' and 1.
Approaches
Samplename column should also be read in as datatype string.
Implementation Suggested
All runsheet loading should use a standard interface that interprets samplename as string datatype.
Validation Plan
GLDS-201 triggers the error and can serve to as a good test case.
Impact
No impact on prior data since error causes a false workflow halt.
[Microarray] Assertion fails in GENERATE_SOFTWARE_TABLE when data files are not compressed
Description
R.utils
is only used when data files are compressed (e.g., .CEL.gz
) to unzip them. The following assertion fails with uncompressed data files (e.g., .CEL
) because R.utils
is not used:
Solution
Modify AFFYMETRIX_SOFTWARE_DPPD
to exclude R.utils
if data files are not compressed. Same thing can be done to AGILENT_SOFTWARE_DPPD
in Agilent pipeline.
[Microarray Affymetrix] Issue loading annotation package in read.celfiles() due to incomplete download
Description
The following error occurred when rendering Affymetrix.qmd
:
Error in oligo::read.celfiles(df_local_paths$`Local Paths`, sampleNames = df_local_paths$`Sample Name`) :
The annotation package, pd.mogene.1.0.st.v1, could not be loaded.
read.celfiles()
automatically detects and downloads annotation package. Upon further inspection, pd.mogene.1.0.st.v1
is a relatively large file, and the download did not look complete, which seems to be causing the error.
Same issue of incomplete download occurs if manually download using download.file()
. Assuming read.celfiles()
uses similar method of downloading, the default value for timeout
is 60 seconds.
Solution
Set global option to increase timeout in Affymetrix.qmd
:
options(timeout=1000)
[BulkRNASeq] Generate CSV of parsed metrics
Parse logs from tools (or MultiQC report) to generate metrics CSV (each row is a sample, each column is a metric)
[Microarray] Unintentional renaming of columns causes issues later in selection of columns
Description
The following error occurred when rendering Affymetrix.qmd
for one dataset:
Error in (function (cond) :
error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': Problem while computing `Group.Mean_(1G) = rowMeans(dplyr::select(.,
all_of(current_samples)))`.
Caused by error:
! error in evaluating the argument 'x' in selecting a method for function 'rowMeans': Problem while evaluating `all_of(current_samples)`.
In this particular dataset, some columns were unintentionally renamed because they happen to contain the substring that's being replaced (for other columns), causing this error when trying to select them later on.
Solution
Be more explicit about which columns we want to rename using rename_with()
here in Affymetrix.qmd
:
df_interim <- df_interim %>% dplyr::rename_with(reformat_names, .cols = matches('\\.condition'), group_name_mapping = design_data$mapping)
The same can be done here for Agile1CMP.qmd
to prevent something similar from happening in the future:
df_interim <- df_interim %>% dplyr::rename_with(reformat_names, .cols = matches('\\.condition|^Genes\\.'), group_name_mapping = design_data$mapping)
[BulkRNASeq] Handle processing with experimental groups where N = 1
Description
Certain datasets have experimental groups of single samples.
This current breaks differential expression approaches and likely means differential expression will be impossible.
Steps to Reproduce
- Create subset of Runsheet rows with at least one group with N = 1 samples
Expected Behavior
Processing workflows should identify these cases and process up to DE but not run DE.
In these cases, normalized data should be available for release.
Actual Behavior
Processing attempts DE and raises exception.
Impact on Data
Non silent bug: no processed data released with this causing an issue.
Known to impact 1 dataset at start of this issue.
Additional Context
Provide any additional information or context that might be relevant to the issue.
Possible Solution (optional)
Process through normalization, then stop.
This will require some modification for post processing which assumes complete processing.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.