nasa / genelab_data_processing Goto Github PK

View Code? Open in Web Editor NEW

57.0 6.0 41.0 232.36 MB

Shell 6.22% R 9.42% Python 10.79% Jupyter Notebook 71.00% Nextflow 2.57%

genetics pipeline genes genelab

genelab_data_processing's People

Contributors

Stargazers

Watchers

genelab_data_processing's Issues

[Microarray] Issue combining dataframes using bind_rows() when getBM() returns no results

Description

The following error occurred when rendering Affymetrix.qmd for one dataset (but same issue could potentially arise in Agile1CMP.qmd too):

Error in `dplyr::bind_rows()`:
  ! Can't combine `..1$ensembl_gene_id` <logical> and `..2$ensembl_gene_id` <character>.

Error occurs when getBM() returns no rows, then ensembl_gene_id column is assumed to be logical type, causing the incompatibility in bind_rows().

GeneLab_Data_Processing/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/bin/Affymetrix.qmd

Lines 636 to 652 in 90d6bb5

    
             df_mapping <- data.frame() 
        
             for (i in seq_along(probe_id_chunks)) { 
        
               probe_id_chunk <- probe_id_chunks[[i]] 
        
               print(glue::glue("Running biomart query chunk {i} of {length(probe_id_chunks)}. Total probes IDS in query ({length(probe_id_chunk)})")) 
        
               message(glue::glue("Running biomart query chunk {i} of {length(probe_id_chunks)}. Total probes IDS in query ({length(probe_id_chunk)})")) # NON_DPPD 
        
               chunk_results <- biomaRt::getBM( 
        
                   attributes = c( 
        
                       expected_attribute_name, 
        
                       "ensembl_gene_id" 
        
                       ),  
        
                       filters = expected_attribute_name,  
        
                       values = probe_id_chunk,  
        
                       mart = ensembl) 
        
               df_mapping <- df_mapping %>% dplyr::bind_rows(chunk_results) 
        
               Sys.sleep(10) # Slight break between requests to prevent back-to-back requests 
        
             }

Solution

Check if chunk_results contains any rows before binding to df_mapping in Affymetrix.qmd / Agile1CMP.qmd:

if (nrow(chunk_results) > 0) {
    df_mapping <- df_mapping %>% dplyr::bind_rows(chunk_results)
}

[BulkRNASeq] Update default publish output directory when processing GLDS data

Description

Current default output directory is named after GLDS ID; however, with release of OSDR, the default directory should indicate both the OSDR and GLDS IDs by default.

Proposed Solution

New default directory of OSD-###_GLDS-###

ENSEMBL annotations not populating correctly

Hi, I'm running this pipeline from the SJSU HPC and I'm having an issue where ~8.5k out of my 29k rows of data are populating with both the symbol and genename as "NA". In a given NA row, there is a valid ENSEMBL ID that, if I look it up on ENSEMBL, leads to a valid gene product with an annotation that looks as though it is just not being populated correctly. I am running OSD-511 using the following script (using the cached files established on the spartan01 HPC by Jonathan Oribello):

NXF_SINGULARITY_CACHEDIR=/home/joribello/test_install/singularity nextflow run NF_RCP-F_1.0.3/main.nf -profile singularity,slurm -resume --gldsAccession GLDS-511 -c /home/joribello/test_install/cos_hpc_nextflow.config -c give_ALIGN_STAR_more_memory.config --runsheetPath /home/carnold/GLDS-511/Metadata/GLDS-511_bulkRNASeq_v1_runsheet.csv

The original runsheet was edited to correct the switched R1 and R2 files for one of the samples (they were entered incorrectly in the downloaded version from GL).
ALIGN_STAR_more_memory.config goes as follows:
process {
withName:'ALIGN_STAR' {
memory='45GB'
}
withName:'SORT_INDEX_BAM' {
memory='45GB'
}
withName: "COUNT_ALIGNED" {
maxRetries = 3
errorStrategy = 'retry'
memory = { 8.GB + 4.GB * task.attempt }
}
}

[BulkRNASeq] Add V&V Check: Assert adaptor content removed using Trimmed Reads FastQC MultiQC

[BulkRNASeq] Handle cases where distinct group names resolve to the same R safe name

Description

Factor value strings can collapse into the same R safe strings.
E.g. "p53+/Fbxw7+/-" and "p53-/Fbxw7-/-" Both collapse into "p53..Fbxw7..."
This breaks DESeq2 R scripts.

Steps to Reproduce

Run processing on OSD-432

Expected Behavior

R DESeq2 script safely handles factor values when resolve to multiple original names converting to the same safe name.

Actual Behavior

R DESeq2 script raises an exception.

Impact on Data

Non silent edge case, thus no released processed data should be impacted.

Blocks processing for 1 dataset at start of this ticket.

Additional Context

Provide any additional information or context that might be relevant to the issue.

Possible Solution (optional)

Introduce an integer or letter to each unique original factor value. This character will differentiate all factor values when converting to and using R safe names.

[BulkRNASeq] Underspecification of memory for STAR_ALIGN

Expected Behavior

Memory specified for STAR_ALIGN should be appropriate.

Actual Behavior

Memory specified is often insufficient when using larger references (i.e. from certain organisms like homo sapiens).
Depending on compute backend, this may either result in inefficient usage of resources or worse, workflow exceptions due to insufficient memory (often presenting as a 137 error from STAR_ALIGN tasks).

Workaround

Specification of RAM (~40 GB for homo sapiens) appropriate for reference size can be used to overcome this mis-specification.

[Microarray Affymetrix] Processed data protocol

Track updates to generate_protocol.sh for Affymetrix pipeline.

BulkRNASeq workflow should determine adaptor type automatically

Currently workflow user is expected to replace this value manually in workflow module file.
Instead, the adaptor should be automatically determine, perhaps from the raw fastQC reports/multiQC and supplied to the trimming processing.

DPPD Reference

GeneLab_Data_Processing/RNAseq/Pipeline_GL-DPPD-7101_Versions/GL-DPPD-7101-F.md

Line 207 in 0fe1dfd

--illumina \ # if adapters are not illumina, replace with adapters used

Workflow Reference

GeneLab_Data_Processing/RNAseq/Workflow_Documentation/NF_RCP-F/workflow_code/modules/quality.nf

Lines 73 to 76 in 0fe1dfd

    
               trim_galore --gzip \ 
        
               --cores $task.cpus \ 
        
               --illumina \ 
        
               --phred33 \

[Microarray] Issue creating cache directories in quarto render command for updated Nextflow version

Description

No issues when using Nextflow v.22.10.0.5826, but after updating to Nextflow v.23.10.1.5891, the following error arose in AGILE1CH / PROCESS_AFFYMETRIX processes from quarto render command:

Could not create TypeScript compiler cache location: "/home/<user>/.cache/deno/gen"
  Check the permission of the directory.

According to here, this cache location can be changed using environment variable DENO_DIR.

That resolves this issue, but a second error follows:

Uncaught Error: Read-only file system (os error 30), mkdir '/home/<user>/.cache/quarto'

For both cache directories, they seem to be put by default inside the user's home directory. This suggests setting HOME env var to a temp directory. This would cover the first issue too without the need to separately specify DENO_DIR.

Solution

Set HOME to be the working directory inside AGILE1CH / PROCESS_AFFYMETRIX processes:

export HOME=$PWD;

The result is the cache directories will both appear inside the working directory within work/ folder:

work/<hash>/
|- .cache/
  |- deno
  |- quarto

[Metagenomics Illumina] Some datasets have too many files in the MAGs and bins folders

Some datasets can have over 1,000 files in the MAGs and bins folders of the Assembly-based_processing directories (for example: https://genelab-tools.arc.nasa.gov/jira/browse/GLDATAPROC-694). There's no need to have all of them available individually.

Parallelize and optimize Wald test and group statistics

With large number of groups this step can be prohibitively slow

For Wald test, a parallel computation would be helpful.

For group statistics, both parallel and a switch to vector based computation will be helpful.

[Methyl-Seq] Pipeline specifies incorrect input for deduplication step.

The pipeline specifies coordinate-sorted BAM as input for deduplicated_bismark (see GL-DPPD-7113 Step 6), but the deduplicate_bismark tool requires readname sorted input files (see Bismark documentation).

[BulkRNASeq] Remove generation of visualization extended DGE table

Description

This file has been retired as the relevant columns are now generated on the fly where used by GeneLab visualization.

Proposed Solution

Remove generating code.

[Metagenomics Illumina] explicitly setting humann3 reference db locations in rule

an issue can arise if the user is pointing to previously installed humann3 reference databases, or if the locations were manually changed for testing at some point
this linked commit on the metagenomics branch integrated always explicitly setting the locations based on what is in the config.yaml: 1405949
changes are in the humann3_PE and humann3_SE rules
in case it's helpful, i'm also attaching a modified Snakefile that has the changes in it (as a zip due to file-type restrictions)
Snakefile.zip

Add reference table support for "Euglena gracilis"

Note: This organism isn't located on Ensembl from my searches.
This may block this request.

Add reference table support for "Pseudomonas Aeruginosa"

Transfer from dev issue 149

Annotation Reference Table Support

Files Not available from source

When I run this code link

https://github.com/nasa/GeneLab_Data_Processing/tree/27e0607f6abd2c58669ba399a19b774f06243124/RNAseq/GLDS_Processing_Scripts/GLDS-379/GLDS_version_3/ERCC_Analysis

there are some errors.
The json files don't appear to be available.
Thanks in advance

[BulkRNASeq] Handling Technical Replicates

Description

Workflow should handle technical replicates appropriately.

Approaches

DESeq2 provides a collapseReplicates function that sums counts based on a factor to group samples by.
The rationale has two major points:

Summing opposed to averaging is appropriate for maintaining expected Poisson distribution
DESeq2 is designed to normalize for library size differences. Summing technical replicates is akin to having a higher sequencing depth for a sample.

Implementation Suggested

Encode Technical Replicate Groups in the Runsheet

Encode technical replicates as a column in the runsheet simply using integers for each technical replicate group.
Eventually, this technical replicate column should be automatically derived from ISA archive metadata; however, in the meantime, a workflow user should be able to supply a two column csv mapping sample name to technical replicate group which will be incorporated into the runsheet.

Use Technical Replicate Groups Column in Runsheet to for DESeq2 collapseReplicates

https://rdrr.io/bioc/DESeq2/man/collapseReplicates.html

Validation Plan

Validate reasonable approach results as follows:

Run the following approaches

NF_RCP-F_1.0.3 (i.e. no technical replicate handling)
collapseReplicates (summed tech. replicates)
median replicates
mean replicates
filter to first replicate only (drop others)

Assessment Metrics:

DGE results

Regression Test Criteria

Core tests should run without change in outcomes (since core tests don't include any technical replicates)

[BulkRNASeq] Combine MultiQC reports

MultiQC reports are currently generated separately for each tool. Perhaps combine into one or two reports (e.g., FASTQC/Trim Galore! into one report where metrics are reported at read-level, STAR/RSeQC/RSEM in another where metrics are reported at sample-level)

from genelab study id to pubmed

Hi,

how can I map genelab study id into pubmed id ?
For example for the study id OSD-70 I would like find the related pubmed id - from the filenames I can see there are GSE33779 and then I can go into https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE33779 citation and see the 23806134

is there any smarter way of mapping ?

[Microarray Agilent 1-channel] Generate processed data protocol

Generate protocol from template (similar to GENERATE_PROTOCOL in Affymetrix pipeline)

[BulkRNASeq] Workflow fails to launch using approach 3 when gldsAccession is not supplied

Expected behavior

Workflow launch using approach 3 (custom runsheet) should not require setting gldsAccession

Actual behaviour

Workflow checks if gldsAccession is unset and raises an error on attempting to start.

Workaround

Supplying gldsAccession as follows:

nextflow run ... --gldsAccession CustomAnalysis

Note: This will output all results to a directory called CustomAnalysis. The name may be changed if an alternative name for the output directory is desired.

[BulkRNASeq] V&V program mistypes samplenames as ints when possible

Description

When sample names can be interpreted as numerical instead of string (e.g. 12), certain V&V checks incorrectly do so result in failure to match '1' and 1.

Approaches

Samplename column should also be read in as datatype string.

Implementation Suggested

All runsheet loading should use a standard interface that interprets samplename as string datatype.

Validation Plan

GLDS-201 triggers the error and can serve to as a good test case.

Impact

No impact on prior data since error causes a false workflow halt.

[Microarray] Assertion fails in GENERATE_SOFTWARE_TABLE when data files are not compressed

Description

R.utils is only used when data files are compressed (e.g., .CEL.gz) to unzip them. The following assertion fails with uncompressed data files (e.g., .CEL) because R.utils is not used:

GeneLab_Data_Processing/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/modules/GENERATE_SOFTWARE_TABLE/resources/usr/bin/SoftwareYamlToMarkdownTable.py

Line 57 in 90d6bb5

    
           assert len(AFFYMETRIX_SOFTWARE_DPPD) == len(df), f"Not all software accounted for! Missing: {set(AFFYMETRIX_SOFTWARE_DPPD) - set(df['name'].str.lower())}"

Solution

Modify AFFYMETRIX_SOFTWARE_DPPD to exclude R.utils if data files are not compressed. Same thing can be done to AGILENT_SOFTWARE_DPPD in Agilent pipeline.

[Microarray Affymetrix] Issue loading annotation package in read.celfiles() due to incomplete download

Description

The following error occurred when rendering Affymetrix.qmd:

Error in oligo::read.celfiles(df_local_paths$`Local Paths`, sampleNames = df_local_paths$`Sample Name`) : 
    The annotation package, pd.mogene.1.0.st.v1, could not be loaded.

read.celfiles() automatically detects and downloads annotation package. Upon further inspection, pd.mogene.1.0.st.v1 is a relatively large file, and the download did not look complete, which seems to be causing the error.

Same issue of incomplete download occurs if manually download using download.file(). Assuming read.celfiles() uses similar method of downloading, the default value for timeout is 60 seconds.

Solution

Set global option to increase timeout in Affymetrix.qmd:

options(timeout=1000)

[BulkRNASeq] Generate CSV of parsed metrics

Parse logs from tools (or MultiQC report) to generate metrics CSV (each row is a sample, each column is a metric)

[Microarray] Unintentional renaming of columns causes issues later in selection of columns

Description

The following error occurred when rendering Affymetrix.qmd for one dataset:

Error in (function (cond)  : 
    error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': Problem while computing `Group.Mean_(1G) = rowMeans(dplyr::select(.,
  all_of(current_samples)))`.
  Caused by error:
  ! error in evaluating the argument 'x' in selecting a method for function 'rowMeans': Problem while evaluating `all_of(current_samples)`.

In this particular dataset, some columns were unintentionally renamed because they happen to contain the substring that's being replaced (for other columns), causing this error when trying to select them later on.

Solution

Be more explicit about which columns we want to rename using rename_with() here in Affymetrix.qmd:

df_interim <- df_interim %>% dplyr::rename_with(reformat_names, .cols = matches('\\.condition'), group_name_mapping = design_data$mapping)

The same can be done here for Agile1CMP.qmd to prevent something similar from happening in the future:

df_interim <- df_interim %>% dplyr::rename_with(reformat_names, .cols = matches('\\.condition|^Genes\\.'), group_name_mapping = design_data$mapping)

[BulkRNASeq] Handle processing with experimental groups where N = 1

Description

Certain datasets have experimental groups of single samples.
This current breaks differential expression approaches and likely means differential expression will be impossible.

Steps to Reproduce

Create subset of Runsheet rows with at least one group with N = 1 samples

Expected Behavior

Processing workflows should identify these cases and process up to DE but not run DE.
In these cases, normalized data should be available for release.

Actual Behavior

Processing attempts DE and raises exception.

Impact on Data

Non silent bug: no processed data released with this causing an issue.
Known to impact 1 dataset at start of this issue.

Additional Context

Provide any additional information or context that might be relevant to the issue.

Possible Solution (optional)

Process through normalization, then stop.
This will require some modification for post processing which assumes complete processing.

	df_mapping <- data.frame()
	for (i in seq_along(probe_id_chunks)) {
	probe_id_chunk <- probe_id_chunks[[i]]
	print(glue::glue("Running biomart query chunk {i} of {length(probe_id_chunks)}. Total probes IDS in query ({length(probe_id_chunk)})"))
	message(glue::glue("Running biomart query chunk {i} of {length(probe_id_chunks)}. Total probes IDS in query ({length(probe_id_chunk)})")) # NON_DPPD
	chunk_results <- biomaRt::getBM(
	attributes = c(
	expected_attribute_name,
	"ensembl_gene_id"
	),
	filters = expected_attribute_name,
	values = probe_id_chunk,
	mart = ensembl)

	df_mapping <- df_mapping %>% dplyr::bind_rows(chunk_results)
	Sys.sleep(10) # Slight break between requests to prevent back-to-back requests
	}

	trim_galore --gzip \
	--cores $task.cpus \
	--illumina \
	--phred33 \

nasa / genelab_data_processing Goto Github PK

genelab_data_processing's People

Contributors

Stargazers

Watchers

Forkers

genelab_data_processing's Issues

Description

Solution

Description

Proposed Solution

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Impact on Data

Additional Context

Possible Solution (optional)

Expected Behavior

Actual Behavior

Workaround

DPPD Reference

Workflow Reference

Description

Solution

Description

Proposed Solution

Description

Approaches

Implementation Suggested

Encode Technical Replicate Groups in the Runsheet

Use Technical Replicate Groups Column in Runsheet to for DESeq2 collapseReplicates

Validation Plan

Expected behavior

Actual behaviour

Workaround

Description

Approaches

Implementation Suggested

Validation Plan

Impact

Description

Solution

Description

Solution

Description

Solution

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Impact on Data

Additional Context

Possible Solution (optional)

Recommend Projects

Recommend Topics

Recommend Org