Coder Social home page Coder Social logo

genelab_data_processing's People

Contributors

asaravia-butler avatar astrobiomike avatar bnovak32 avatar cyouh95 avatar j-81 avatar torres-alexis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

genelab_data_processing's Issues

[Microarray] Issue combining dataframes using bind_rows() when getBM() returns no results

Description

The following error occurred when rendering Affymetrix.qmd for one dataset (but same issue could potentially arise in Agile1CMP.qmd too):

Error in `dplyr::bind_rows()`:
  ! Can't combine `..1$ensembl_gene_id` <logical> and `..2$ensembl_gene_id` <character>.

Error occurs when getBM() returns no rows, then ensembl_gene_id column is assumed to be logical type, causing the incompatibility in bind_rows().

df_mapping <- data.frame()
for (i in seq_along(probe_id_chunks)) {
probe_id_chunk <- probe_id_chunks[[i]]
print(glue::glue("Running biomart query chunk {i} of {length(probe_id_chunks)}. Total probes IDS in query ({length(probe_id_chunk)})"))
message(glue::glue("Running biomart query chunk {i} of {length(probe_id_chunks)}. Total probes IDS in query ({length(probe_id_chunk)})")) # NON_DPPD
chunk_results <- biomaRt::getBM(
attributes = c(
expected_attribute_name,
"ensembl_gene_id"
),
filters = expected_attribute_name,
values = probe_id_chunk,
mart = ensembl)
df_mapping <- df_mapping %>% dplyr::bind_rows(chunk_results)
Sys.sleep(10) # Slight break between requests to prevent back-to-back requests
}

Solution

Check if chunk_results contains any rows before binding to df_mapping in Affymetrix.qmd / Agile1CMP.qmd:

if (nrow(chunk_results) > 0) {
    df_mapping <- df_mapping %>% dplyr::bind_rows(chunk_results)
}

ENSEMBL annotations not populating correctly

Hi, I'm running this pipeline from the SJSU HPC and I'm having an issue where ~8.5k out of my 29k rows of data are populating with both the symbol and genename as "NA". In a given NA row, there is a valid ENSEMBL ID that, if I look it up on ENSEMBL, leads to a valid gene product with an annotation that looks as though it is just not being populated correctly. I am running OSD-511 using the following script (using the cached files established on the spartan01 HPC by Jonathan Oribello):

NXF_SINGULARITY_CACHEDIR=/home/joribello/test_install/singularity nextflow run NF_RCP-F_1.0.3/main.nf -profile singularity,slurm -resume --gldsAccession GLDS-511 -c /home/joribello/test_install/cos_hpc_nextflow.config -c give_ALIGN_STAR_more_memory.config --runsheetPath /home/carnold/GLDS-511/Metadata/GLDS-511_bulkRNASeq_v1_runsheet.csv

The original runsheet was edited to correct the switched R1 and R2 files for one of the samples (they were entered incorrectly in the downloaded version from GL).
ALIGN_STAR_more_memory.config goes as follows:
process {
withName:'ALIGN_STAR' {
memory='45GB'
}
withName:'SORT_INDEX_BAM' {
memory='45GB'
}
withName: "COUNT_ALIGNED" {
maxRetries = 3
errorStrategy = 'retry'
memory = { 8.GB + 4.GB * task.attempt }
}
}

[BulkRNASeq] Handle cases where distinct group names resolve to the same R safe name

Description

Factor value strings can collapse into the same R safe strings.
E.g. "p53+/Fbxw7+/-" and "p53-/Fbxw7-/-" Both collapse into "p53..Fbxw7..."
This breaks DESeq2 R scripts.

Steps to Reproduce

  1. Run processing on OSD-432

Expected Behavior

R DESeq2 script safely handles factor values when resolve to multiple original names converting to the same safe name.

Actual Behavior

R DESeq2 script raises an exception.

Impact on Data

Non silent edge case, thus no released processed data should be impacted.

Blocks processing for 1 dataset at start of this ticket.

Additional Context

Provide any additional information or context that might be relevant to the issue.

Possible Solution (optional)

Introduce an integer or letter to each unique original factor value. This character will differentiate all factor values when converting to and using R safe names.

[BulkRNASeq] Underspecification of memory for STAR_ALIGN

Expected Behavior

Memory specified for STAR_ALIGN should be appropriate.

Actual Behavior

Memory specified is often insufficient when using larger references (i.e. from certain organisms like homo sapiens).
Depending on compute backend, this may either result in inefficient usage of resources or worse, workflow exceptions due to insufficient memory (often presenting as a 137 error from STAR_ALIGN tasks).

Workaround

Specification of RAM (~40 GB for homo sapiens) appropriate for reference size can be used to overcome this mis-specification.

BulkRNASeq workflow should determine adaptor type automatically

Currently workflow user is expected to replace this value manually in workflow module file.
Instead, the adaptor should be automatically determine, perhaps from the raw fastQC reports/multiQC and supplied to the trimming processing.

DPPD Reference

--illumina \ # if adapters are not illumina, replace with adapters used

Workflow Reference

trim_galore --gzip \
--cores $task.cpus \
--illumina \
--phred33 \

[Microarray] Issue creating cache directories in quarto render command for updated Nextflow version

Description

No issues when using Nextflow v.22.10.0.5826, but after updating to Nextflow v.23.10.1.5891, the following error arose in AGILE1CH / PROCESS_AFFYMETRIX processes from quarto render command:

Could not create TypeScript compiler cache location: "/home/<user>/.cache/deno/gen"
  Check the permission of the directory.

According to here, this cache location can be changed using environment variable DENO_DIR.

That resolves this issue, but a second error follows:

Uncaught Error: Read-only file system (os error 30), mkdir '/home/<user>/.cache/quarto'

For both cache directories, they seem to be put by default inside the user's home directory. This suggests setting HOME env var to a temp directory. This would cover the first issue too without the need to separately specify DENO_DIR.

Solution

Set HOME to be the working directory inside AGILE1CH / PROCESS_AFFYMETRIX processes:

export HOME=$PWD;

The result is the cache directories will both appear inside the working directory within work/ folder:

work/<hash>/
|- .cache/
  |- deno
  |- quarto

[Metagenomics Illumina] explicitly setting humann3 reference db locations in rule

  • an issue can arise if the user is pointing to previously installed humann3 reference databases, or if the locations were manually changed for testing at some point
  • this linked commit on the metagenomics branch integrated always explicitly setting the locations based on what is in the config.yaml: 1405949
  • changes are in the humann3_PE and humann3_SE rules
  • in case it's helpful, i'm also attaching a modified Snakefile that has the changes in it (as a zip due to file-type restrictions)
    Snakefile.zip

[BulkRNASeq] Handling Technical Replicates

Description

Workflow should handle technical replicates appropriately.

Approaches

DESeq2 provides a collapseReplicates function that sums counts based on a factor to group samples by.
The rationale has two major points:

  1. Summing opposed to averaging is appropriate for maintaining expected Poisson distribution
  2. DESeq2 is designed to normalize for library size differences. Summing technical replicates is akin to having a higher sequencing depth for a sample.

Implementation Suggested

Encode Technical Replicate Groups in the Runsheet

Encode technical replicates as a column in the runsheet simply using integers for each technical replicate group.
Eventually, this technical replicate column should be automatically derived from ISA archive metadata; however, in the meantime, a workflow user should be able to supply a two column csv mapping sample name to technical replicate group which will be incorporated into the runsheet.

Use Technical Replicate Groups Column in Runsheet to for DESeq2 collapseReplicates

https://rdrr.io/bioc/DESeq2/man/collapseReplicates.html

Validation Plan

  1. Validate reasonable approach results as follows:

Run the following approaches

  • NF_RCP-F_1.0.3 (i.e. no technical replicate handling)
  • collapseReplicates (summed tech. replicates)
  • median replicates
  • mean replicates
  • filter to first replicate only (drop others)

Assessment Metrics:

  • DGE results
  1. Regression Test Criteria
  • Core tests should run without change in outcomes (since core tests don't include any technical replicates)

[BulkRNASeq] Combine MultiQC reports

MultiQC reports are currently generated separately for each tool. Perhaps combine into one or two reports (e.g., FASTQC/Trim Galore! into one report where metrics are reported at read-level, STAR/RSeQC/RSEM in another where metrics are reported at sample-level)

[BulkRNASeq] Workflow fails to launch using approach 3 when gldsAccession is not supplied

Expected behavior

Workflow launch using approach 3 (custom runsheet) should not require setting gldsAccession

Actual behaviour

Workflow checks if gldsAccession is unset and raises an error on attempting to start.

Workaround

Supplying gldsAccession as follows:

nextflow run ... --gldsAccession CustomAnalysis

Note: This will output all results to a directory called CustomAnalysis. The name may be changed if an alternative name for the output directory is desired.

[BulkRNASeq] V&V program mistypes samplenames as ints when possible

Description

When sample names can be interpreted as numerical instead of string (e.g. 12), certain V&V checks incorrectly do so result in failure to match '1' and 1.

Approaches

Samplename column should also be read in as datatype string.

Implementation Suggested

All runsheet loading should use a standard interface that interprets samplename as string datatype.

Validation Plan

GLDS-201 triggers the error and can serve to as a good test case.

Impact

No impact on prior data since error causes a false workflow halt.

[Microarray] Assertion fails in GENERATE_SOFTWARE_TABLE when data files are not compressed

Description

R.utils is only used when data files are compressed (e.g., .CEL.gz) to unzip them. The following assertion fails with uncompressed data files (e.g., .CEL) because R.utils is not used:

assert len(AFFYMETRIX_SOFTWARE_DPPD) == len(df), f"Not all software accounted for! Missing: {set(AFFYMETRIX_SOFTWARE_DPPD) - set(df['name'].str.lower())}"

Solution

Modify AFFYMETRIX_SOFTWARE_DPPD to exclude R.utils if data files are not compressed. Same thing can be done to AGILENT_SOFTWARE_DPPD in Agilent pipeline.

[Microarray Affymetrix] Issue loading annotation package in read.celfiles() due to incomplete download

Description

The following error occurred when rendering Affymetrix.qmd:

Error in oligo::read.celfiles(df_local_paths$`Local Paths`, sampleNames = df_local_paths$`Sample Name`) : 
    The annotation package, pd.mogene.1.0.st.v1, could not be loaded.

read.celfiles() automatically detects and downloads annotation package. Upon further inspection, pd.mogene.1.0.st.v1 is a relatively large file, and the download did not look complete, which seems to be causing the error.

Same issue of incomplete download occurs if manually download using download.file(). Assuming read.celfiles() uses similar method of downloading, the default value for timeout is 60 seconds.

Solution

Set global option to increase timeout in Affymetrix.qmd:

options(timeout=1000)

[Microarray] Unintentional renaming of columns causes issues later in selection of columns

Description

The following error occurred when rendering Affymetrix.qmd for one dataset:

Error in (function (cond)  : 
    error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': Problem while computing `Group.Mean_(1G) = rowMeans(dplyr::select(.,
  all_of(current_samples)))`.
  Caused by error:
  ! error in evaluating the argument 'x' in selecting a method for function 'rowMeans': Problem while evaluating `all_of(current_samples)`.

In this particular dataset, some columns were unintentionally renamed because they happen to contain the substring that's being replaced (for other columns), causing this error when trying to select them later on.

Solution

Be more explicit about which columns we want to rename using rename_with() here in Affymetrix.qmd:

df_interim <- df_interim %>% dplyr::rename_with(reformat_names, .cols = matches('\\.condition'), group_name_mapping = design_data$mapping)

The same can be done here for Agile1CMP.qmd to prevent something similar from happening in the future:

df_interim <- df_interim %>% dplyr::rename_with(reformat_names, .cols = matches('\\.condition|^Genes\\.'), group_name_mapping = design_data$mapping)

[BulkRNASeq] Handle processing with experimental groups where N = 1

Description

Certain datasets have experimental groups of single samples.
This current breaks differential expression approaches and likely means differential expression will be impossible.

Steps to Reproduce

  1. Create subset of Runsheet rows with at least one group with N = 1 samples

Expected Behavior

Processing workflows should identify these cases and process up to DE but not run DE.
In these cases, normalized data should be available for release.

Actual Behavior

Processing attempts DE and raises exception.

Impact on Data

Non silent bug: no processed data released with this causing an issue.
Known to impact 1 dataset at start of this issue.

Additional Context

Provide any additional information or context that might be relevant to the issue.

Possible Solution (optional)

Process through normalization, then stop.
This will require some modification for post processing which assumes complete processing.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.