phil9s / swgs-absolutecn Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 1.0 2.45 MB

Pipeline for the generation of absolute copy number profiles from shallow whole genome sequencing data

Python 40.20% R 44.34% Shell 4.12% HTML 9.84% CSS 1.49%

swgs-absolutecn's People

Contributors

Stargazers

Watchers

Forkers

jbreynier

swgs-absolutecn's Issues

Environment installation error

Conda environment fails to install and times out using the install_env.sh script.

May wish to disambiguate combine function to avoid confusion with deprecated dplyr function

rds.list <- lapply(rds.filename,FUN=function(x){readRDS(x)})

collapse_rds <- function(rds.list){
comb <- rds.list[[1]][[1]]
if(length(rds.list) > 1){
for(i in 2:length(rds.list)){
add <- rds.list[[i]][[1]]
comb <- Biobase::combine(comb,add) #disambiguate this
} #shifted this bracket up from below to account for instance where list length = 1
rds.obj <- comb
}
return(rds.obj)
}

Indexing issue when invoking base::match in rdnaseq_rel_to_abs.R

Hello,

This is just a quick thing that I noticed about indexing in one of the scripts. The problem arises when you try to cross-reference sample IDs between the RDS object and the metadata QC table in order enter plody, purity and other information into the RDS object.

This is done using base::match, however, if we think of the arguments as base::match(x,y), then the function will identify the locations of items in x in y.

The problem is that in the current version of the script is that we uses the RDS indices to then index the metadata sheet - whereas we should be using indices from the metadata sheet instead:

https://github.com/Phil9S/sWGS-absoluteCN/blob/1f1a155b4ae7c3bc3306565e2241b9703938d358/scripts/qdnaseq_rel_to_abs.R#L53-L56

If we just change the order of the arguments in the call to the function, then that should fix the problem:

pData(rds.rel)$purity <- samples$purity[match(pData(rds.rel)$name,samples$SAMPLE_ID)]

If the user follows the protocol exactly as planned, then I don't think this should be a problem, as the RDS object and the metadata datasheet should be in the same order anyway, however, I guess there may be potential issues if the user tries to edit the QC sheet or the RDS file manually, in which case, samples will be assigned incorrect ploidy and purity values before converting from relative to absolute copy numbers

Empty BAM files generates an error in the gridsearch_filter rule

The error is generated within the collapse_rds function of the scripts/gridsearch_results_filtering.R script.

In cases in which the BAM file is empty the phenoData attribute of the relevant RDS objects contains columns containing NA values. This causes the relevant R functions to infer the relevant column types as being logical.

This causes an error when you try to use BiocGenerics::combine https://github.com/Phil9S/sWGS-absoluteCN/blob/e18de14656a042d5895e8baaa55a9fcf64b4a050/scripts/gridsearch_results_filtering.R#L27

as the data types are inconsistent between the two data objects being combined.

I guess a short-term fix here would be to explicitly set column types for the relevant objects. But I guess perhaps a more long-term solution would be validate the BAM files before they are processed to ensure that they are not empty?

Missing Slurm configuration file.

Hi Phil,

Apologies, it has been a while since I last use this repo, so haven't posted anything here in quite a while.

This is a pretty straightforward issue.

The file profile/slurm/cluster_config.yaml is missing from the repo.

Cheers,
Thomas

Error using 30 kb bin

I am obtaining an error when finish running the pipeline using 30 kb bin in rule ' gridsearch_filter '. Since I don't know the TP53 alleles frequencies I have set the values to 'NA'. However, in the {project}fit_QC_predownsampling.tsv and filtered.tsv files I am obtaining the correct fits but I am not obtaining any plots.

May be the problem is that I do not have enough reads for 30kb bin? Using a 100 kb bin with this inputs works correctly.

Thanks.

Paired-end data

Hi Phil,

I was just wondering if the pipeline needs to be specifically configured to deal with paired-end sequencing data?

For example, when the binReadCounts function is called:

swgs-absolutecn/scripts/qdnaseq_mod.R

Line 26 in 9f68a19

    
           readCounts <- mclapply(X=bam_list, FUN=binReadCounts, bins=bins, mc.cores=ncores)

I think this function has a few options, where you can specify whether the data is paired-end or not (cf. https://rdrr.io/bioc/QDNAseq/man/binReadCounts.html)

I am not sure how exactly QDNASeq uses the this information to change/alter downstream processing however.

EDIT: Paired end reads will give misleading counts when they span genomic bin junctions - however, I am not sure if QDNASeq corrects for this internally or not.

I think for such paired-end reads (i.e. the one spanning bin junctions) than the count deriving from these reads need to be doubled.

Thanks

Edge case error with get_gene_bin function

Self reported issue - fixed in alternative implementation

Error in get_gene_seg() function as part of the qdnaseq_rel_to_abs.R script. Edge error in which genes which fall outside of the available bins are assigned to the closest bin present. This results in assignment of an unrelated copy number value to a gene region. Across testing this only affected 36 loci out of 18338 gene loci. Additionally, genes where either start or end values fail to match a bin will also select the nearest bin which includes 10 loci out of 18338 gene loci.

get_gene_seg <- function(target=NULL,abs_data=NULL){
  to_use <- fData(abs_data)$use
  cn_obj <- abs_data[to_use,]
  bin_pos <- fData(cn_obj)[,c("chromosome","start","end")]
  chr <- as.numeric(str_split(string = target,pattern = ":",simplify = T)[1])
  start <- as.numeric(str_split(string = target,pattern = ":|-",simplify = T)[2])
  end <- as.numeric(str_split(string = target,pattern = ":|-",simplify = T)[3])

  gene_chr_pos <- bin_pos[bin_pos$chromosome == chr,]
  min_start <- min(which(min(abs(gene_chr_pos$start - start)) == abs(gene_chr_pos$start - start)))
  min_end <- max(which(min(abs(gene_chr_pos$end - end)) == abs(gene_chr_pos$end - end)))
  if(gene_chr_pos$start[min_start] > start & min_start != 1){
    min_start <- min_start - 1
  }
  if(gene_chr_pos$end[min_end] < end & min_end != length(gene_chr_pos$end)){
    min_end <- min_end + 1
  }
  index_min <- which(bin_pos$chromosome == chr & bin_pos$start == gene_chr_pos[min_start,2])
  index_max <- which(bin_pos$chromosome == chr & bin_pos$end == gene_chr_pos[min_end,3])
  gene_pos <- seq.int(index_min,index_max,1)
  return(gene_pos)
}

Fix to be implemented but should not affect existing analysis pipelines, only the use of alternative anchoring loci. See local analysis function get_gene_bin() for implementation.

Update QDNAseqmod

Update QDNAseqmod installation to utilise source code implementation rather than byte-compiled version.

proposed fix for collapse_rds function edge case, where it will not analyse a single file

rds.list <- lapply(rds.filename,FUN=function(x){readRDS(x)})

collapse_rds <- function(rds.list){
comb <- rds.list[[1]][[1]]
if(length(rds.list) > 1){
for(i in 2:length(rds.list)){
add <- rds.list[[i]][[1]]
comb <- combine(comb,add)
} #shifted this bracket up from below to account for instance where list length = 1
rds.obj <- comb
}
return(rds.obj)
}

Error executing pipeline

When submitting the dry-run from Step 5 Stage 1 I receive multiple errors.

It seems to be having problems creating the bam.ok file, or issues detecting it. Below are some of the errors I receive. Please can I get some advice on what may be causing this error? Thank you in advance.

input: /rds/general/project/ettpstt/live/sWGS_profiling_test/bams_from_hg19/output_test/1_sample_test/PSTT10.bam
output: /rds/general/project/ettpstt/live/sWGS_profiling_test/bams_from_hg19/output_test/1_sample_test/out/sWGS_fitting/ettpstt_100kb/bam.ok
jobid: 5
reason: Missing output files: /rds/general/project/ettpstt/live/sWGS_profiling_test/bams_from_hg19/output_test/1_sample_test/out/sWGS_fitting/ettpstt_100kb/bam.ok

rule sym_link:
input: /rds/general/project/ettpstt/live/sWGS_profiling_test/bams_from_hg19/output_test/1_sample_test/PSTT10.bam, /rds/general/project/ettpstt/live/sWGS_profiling_test/bams_from_hg19/output_test/1_sample_test/out/sWGS_fitting/ettpstt_100kb/bam.ok
output: /rds/general/project/ettpstt/live/sWGS_profiling_test/bams_from_hg19/output_test/1_sample_test/out/sWGS_fitting/ettpstt_100kb/bams/PSTT10.bam
jobid: 4
reason: Missing output files: /rds/general/project/ettpstt/live/sWGS_profiling_test/bams_from_hg19/output_test/1_sample_test/out/sWGS_fitting/ettpstt_100kb/bams/PSTT10.bam; Input files updated by another job: /rds/general/project/ettpstt/live/sWGS_profiling_test/bams_from_hg19/output_test/1_sample_test/out/sWGS_fitting/ettpstt_100kb/bam.ok

How to cite use of this tool in publications?

Hi Phil,

I was just wondering how you would prefer this tool to be cited when using in publications/presentations etc. ?

Cheers,
Thomas

TP53 bin number is hard coded into the ploidy-purity search script

Hi Phil,

Something that I noticed is that the bin index/number for determining the tp53 copy number is hard-coded into the ploidy and purity search script:

https://github.com/Phil9S/sWGS-absoluteCN/blob/1fefcc4bfdba3c81765541a50e178f885fd546d9/scripts/ploidy_purity_search_standard_error.R#L143

I think 73504 is the correct bin number for 30kb, but this will mean that an error will be generated or the incorrect bin size will be selected when different bin sizes are used.

User provided output paths without a trailing forward slash produce malformed paths

Hi Phil,

For the following:

swgs-absolutecn/config/config.yaml

Lines 5 to 6 in 9f68a19

    
           # Output location 
        
           out_dir: "/mnt/scratcha/fmlab/smith10/"

A malformed path is produced if the stated output path does not have a trailing forward slash

For example 'smith10/sWGS_fitting' would become 'smith10sWGS_fitting'

Compatibility with different reference genomes

I guess this issue is inspired a little by #9 - when I started thinking about variable bin sizes, I also started to think about the reference genome itself.

I was just wondering about the compatibility of this repo with different reference genomes. I wouldn't imagine there would be any problem using any of the different iterations of GRCh37 - but I guess this repo in its current form is not compatible with use with any GRCh38 alignments. I am not sure what would happen if get you ran GRCh38 bams through this pipeline. I suspect you would get an error message somewhere along the line.

Would it perhaps be a good idea to maybe get a line or two in the documentation on reference genome compatibility?

output tables from ploidy_purity_search_standard_error.R are tsvs not csvs

Files have .csv ending but the separators are tabs

Sample sheet schema does not test for unique sample IDs

Hello,

If I have understood correctly, then line 12 in the sample sheet schema is meant to test for unique identifiers in the sample ID column:

https://github.com/Phil9S/sWGS-absoluteCN/blob/1f1a155b4ae7c3bc3306565e2241b9703938d358/schemas/samples.schema.yaml#L9-L12

However, as the schema/linter only assesses the sample sheet line-by-line (as far as I am aware), this tag is not testing the uniqueness of the SAMPLE_ID column but is rather testing the uniqueness of each individual sample ID string - which is unique by definition, because it is not an array.

When duplicate sample IDs are present in the sample sheet, the schema does not catch this, causing the following error to occur when building the DAG of rules for stage_1:

Input not found: file

As there is an indexing issue when searching for the BAM file path when the sample ID maps to multiple entries

Incorporate a measure of TP53 MAF uncertainty during manual curation of fits

Hi Phil,

This is just a suggested change/enhancement.

I think it may be a good idea to incorporate some measure of uncertainty regarding our TP53 MAF estimates during the manual curation process. One criteria for judging the quality of fits is to look at the difference between observed TP53 MAFs, and what would be expected from a particular absolute copy number fit. However, at the minute, without potentially cross-referencing against different data sheets, the user has no sense of how reliable the TP53 MAF estimate is for a given sample.

There are a number of criteria that the user uses when trying to judge which fits pass or fail selection. If the user knows during the manual curation how reliable the TP53 MAFs are, they can decide how to (subjectively) weight this measure compared to all the other measures of the quality of the fit. For example, the user may rely on the observed TP53 MAF more when making the final call on a fit when the variant is an SNV with a range of MAFs of 0.02 between two technical replicates compared to an indel in which the range of MAFs is 0.20, and you are a lot less confident what the true MAF for that sample is.

One measure of uncertainty that could be used is the standard error, but if there are only two technical replicates using the range between the two technical replicates may be better.

This suggestion wouldn't change anything in the actual algorithm (apart from the generation of metadata/QC sheets) - it is just a suggestion to aid the manual curator of the fits.

Using integer patient identifiers in the sample sheet throws an error

A very small issue - but one problem I have had, is that an error is raised when the PATIENT_ID column of the sample sheet is of the following form:

PATIENT_ID
1
2
3
4
5
...

Pandas reads this data in as a non-string type, which then causes an error to be thrown once the sample sheet undergoes validation.

As I think this numbering/naming system would not be uncommon, perhaps it would be good if the documentation could be updated warning users about using numbers in this column - or rather, specifying in the documentation that the required type for this column is string?

The input files for the gridsearch_filter rule are specified in both the snakefile and the script

The input files for this rule are specified both within rules/filter_gridsearch.smk and the relevant script:

https://github.com/Phil9S/sWGS-absoluteCN/blob/e18de14656a042d5895e8baaa55a9fcf64b4a050/scripts/gridsearch_results_filtering.R#L19

https://github.com/Phil9S/sWGS-absoluteCN/blob/e18de14656a042d5895e8baaa55a9fcf64b4a050/scripts/gridsearch_results_filtering.R#L38

The Snakefile derives the sample names directly from the sample sheet whilst the script performs a pattern match within a specified directory path.

This generates an error when there is a mismatch between the two. For example, a user may begin the analysis with a particular instance of the sample sheet, and then edit their sample sheet half-way through their analysis if they choose to remove samples (e.g. a BAM file was empty - cf #3, a sample was mislabelled etc.).

Input file information from snakemake, can be used within R using the snakemake@input object. Alternatively, I guess the responsibility could be left to the user to clean their directories if they edit their sample sheet half way through their analysis.

phil9s / swgs-absolutecn Goto Github PK

swgs-absolutecn's People

Contributors

Stargazers

Watchers

Forkers

swgs-absolutecn's Issues

Recommend Projects

Recommend Topics

Recommend Org