morinlab / gamblr Goto Github PK

View Code? Open in Web Editor NEW

3.0 5.0 2.0 67.14 MB

Set of standardized functions to operate with genomic data

Home Page: https://morinlab.github.io/GAMBLR/

License: MIT License

R 91.37% Perl 8.63%

genomics bioinformatics lymphoma lymphoma-classification cancer cancer-research

gamblr's Introduction

GABMLR - an R package with convenience functions for working with GAMBL results.

Installation

GAMBLR is an open-source package. It can be easily installed directly from GitHub:

devtools::install_github("morinlab/GAMBLR", repos = BiocManager::repositories())

This will install the full set of GAMBLR-verse children packages (GAMBLR.data, GAMBLR.helpers, GAMBLR.utils, GAMBLR.viz, GAMBLR.results) with all necessary dependencies. The latter child package (GAMBLR.results) requires access to the GSC resources and is not intended to be used outside of GSC. If you are interested in standalone functionality, please refer to the documentation of the GAMBLR.data package or any other individual child package.

Contributing

If you have access to gphost, the easiest way to obtain and contribute to GAMBLR is to do this via cloning the repository

cd
git clone [email protected]:morinlab/GAMBLR.git

In your R editor of choice, set your working directory to the place you just cloned the repo.

setwd("~/GAMBLR")

Install the package in R by running the following command (requires the devtools package)

devtools::install()

As GAMBL users (GAMBLRs, so to speak) rely on the functionality of this package, the Master branch is protected. All commits must be submitted via pull request on a branch. Please refer to the GAMBL documentation for details on how to do this.

New Functions

Please always ensure that the new function goes into the corresponding child package according to it's intended use. If you are not sure to which package the new function belongs to, please ask through opening new issue on this repository or starting new thread on Slack if you are the member of the Morin lab.

When designing new functions, please refer to guide-lines and best practices detailed here. Ensure to always provide the required documentation for any new functions. See this section for more details on best practices for documenting R functions. Unsure what information goes where in a function documentation? Here is a brief outline for what the different sections should include and as an example, here is an adequately documented GAMBLR function. For more information, see this.

Title

The title is taken from the first sentence. It should be written in sentence case, not end in a full stop, and be followed by a blank line. The title is shown in various function indexes (e.g. help(package = "some_package")) and is what the user will usually see when browsing multiple functions.

Description

The description is taken from the next paragraph. It’s shown at the top of documentation and should briefly describe the most important features of the function.

Details

Additional details are anything after the description. Details are optional, but can be any length so are useful if you want to dig deep into some important aspect of the function. Note that, even though the details come right after the description in the introduction, they appear much later in rendered documentation. If you want to add code in any other language other than R, this is also the sections to do so. For example, the new function relies on some bash code in order to utilize the GAMBLR code. You can detail such code here by simply adding a code block as you would in a regular markdown file.

Parameters

Detailed parameter descriptions should be included for all functions. Remember to state the required data types, default values, if the parameter is required or optional, etc.

Return

Specify the returned object, is it a data frame, a list, a vector or characters, etc.

Import

Always import all the packages from which you are calling any functions outside of base R and R packages that gets loaded per default. Remember to not import tidyverse, rather, import the individual packages from tidyverse that the function is depending on. If any packages that are not yet a part of the GAMBLR dependencies are needed for the function, the user needs to run usethis::use_package("package_name") in order to add any such new dependencies to DESCRIPTION. Warning, do not edit DESCRIPTION by hand, instead use the approach detailed here.

Export

Should this function be exported to NAMESPACE (i.e make it directly accessible for anyone who loads GAMBLR), or is the function considered to be an internal/helper function? In order to have the function populate NAMESPACE, the developer has to run devtools::document(). All functions that have the @export line in its documentation will be added to NAMESPACE. Helper functions should not include this in the function documentation. Note that such functions are still accessible with GAMBLR:::helper_function_name. If The new function is indeed a helper/internal function, ensure that this is made clear from both the function description and details (see this example). In addition, it should also be clear what purpose the helper function is serving (i.e what other GAMBLR functions are calling the helper function).

Examples

Please provide fully reproducible examples for the function. Ideally, the example should demonstrate basic usage, as well as more advanced usage with different parameter combinations. Note that examples can not extend over 100 characters per line, since this will cause the lines to be truncated in the rendered PDF manual. In addition, the developer needs to load any packages (besides GAMBLR) that are needed to run the examples. For instance, if the example code calls %>%, dplyr or magrittr to make the pipe available for the example. It is advised to write your example in such a way that loading external packages are avoided as much as possible. Instead, prioritize base R as much as possible. In some cases, it is undesirable to have a function run its examples. This applies to functions that are writing files and helper functions. To avoid any such examples to run, simply wrap the example in:

\dontrun{
do_not_run = some_function()
}

Helper Function Specific Instructions

If the newly added function is a utilized by GAMBLR as an internal or helper function, you should also add the @noRd field to the function documentation. This prevents the function to have an .Rd file created and populated in the man/ folder. This is important, since such functions should not be represented on the website that is being built from the source code. In addition, make sure that you also followed the helper function specific instructions under Examples and Export.

Testing New Functions

So you have added a new function (carefully following the steps in the previous section!) and you are obviously extremely proud and eager to test it out (and let others test it). There are basically two different approaches to do so.

Option 1

Your first option, and likely the preferred route to take, is to make sure that the working directory in R studio is set to the GAMBLR folder with your updated code and then run devtools::load_all() to load all the functions available in the R/ folder of thee same repo. This should make all such functions available to call.

Option 2

As an alternative, you can also run devtools::install() from the updated GAMBLR directory. As the name implies, this will install the complete package complete with dependencies, remotes, etc. Note, if you run with the second option, make sure to restart your R session with .rs.restartR() after installing the package and then load GAMBLR with library(GAMBLR). Now you have installed the updated branch of GAMBLR and are free to call any functions available in the R/

Function Documentation Template

For your convenience, here is a documentation template for GAMBLR functions.

#' @title
#'
#' @description
#'
#' @details
#'
#' @param a_parameter
#' @param another_parameter
#'
#' @return
#'
#' @import
#' @export
#'
#' @examples
#' #this is an example
#' ###For your reference, this line is exactly 100 characters. Do not exceed 100 characters per line
#'
function_name = function(a_parameter,
                         another_parameter){
                         }

gamblr's People

Contributors

Stargazers

Watchers

Forkers

bigdatasciencegroup ashakru

gamblr's Issues

hg38 support for collate_nfkbiz_results

collate_nfkbiz_results needs to be updated to work with hg38 projection.

Remove hardcoding in collate_csr_results

The path to the file loaded by collate_csr_results is hard-coded. This should be changed to use the config and the path where this file lives should also be somewhere central. We can discuss in a GAMBL meeting what location makes the most sense.

collate_csr_results = function(sample_table,seq_type_filter = "genome"){
   if(seq_type_filter=="capture"){
     return(sample_table) #result doesn't exist or make sense for this seq_type
   }
   csr = suppressMessages(read_tsv("/projects/rmorin/projects/gambl-repos/gambl-nthomas/results/icgc_dart/mixcr_current/level_3/mixcr_genome_CSR_results.tsv"))
...

assign_cn_to_ssm is broken

assign_cn_to_ssm internally calls get_ssm_by_sample and this function requires this_seq_type parameter to be specified, but currently it's not.

estimate_purity is broken when only sample_id is provided

It looks like some updates have broken how estimate_purity would obtain the MAF data for a sample. This should be fixed so we can use the augment_ssm output with the auto-removal of variants with low read support.

pure=estimate_purity(sample_id="12-17272_tumorB")
fetching: slms-3
using flatfile: /projects/nhl_meta_analysis_scratch/gambl/results_local/gambl/battenberg_current/99-outputs/seg/genome--grch37/12-17272_tumorB--12-17272_normal_subclones.igv.seg
Rows: 45 Columns: 6
── Column specification ────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (2): ID, chrom
dbl (4): start, end, LOH_flag, log.ratio

ℹ Use spec() to retrieve the full column specification for this data.
ℹ Specify the column types or set show_col_types = FALSE to quiet this message.
Error in UseMethod("filter") :
no applicable method for 'filter' applied to an object of class "NULL"

Add option for user to provide/obtain full gene expression data frame

Currently the get_gene_expression function requires the user specify a set of gene IDs (ENSG or HGNC) and it subsets the tidy data frame based on that information. We should add functionality to this to allow the user to specify that they want to get the full matrix back. An empty gene list is probably not the right approach since it could give an unsuspecting user a massive data frame unintentionally. If we add an another parameter that is defaulted to FALSE all_genes=FALSE then check for that OR a gene list, we should be able to return the full data frame. To make this functionality helpful we also need the function to accept the same data frame as input and (when provided) use it directly and skip the step of loading it from disk. The purpose of this is to avoid users having to re-load that data from disk multiple times if they plan on running this function on different gene sets in an interactive session. Hence, the function will need a second new argument full_expression_df or something similarly named that is optional.

Reduce duplicated code relate to get_coding_ssm_status

Update function to call get_coding_ssm_status directly to avoid duplicated code.

get_coding_ssm for capture data does not return variants

The get_coding_ssm for capture data returns maf file for 1-2 samples. This is a reproducible example:

capture_meta <- get_gambl_metadata(seq_type_filter = c('capture')) %>%
  dplyr::filter(consensus_pathology =='DLBCL') %>%
  dplyr::filter(COO_consensus == 'ABC')

capture_abc_maf <- get_coding_ssm(limit_samples = capture_meta$sample_id, 
                                  basic_columns = TRUE, 
                                  exclude_cohort = c('dlbcl_chapuy'),
                                  seq_type = "capture")

which returns data for only 1 sample

reading from: /projects/nhl_meta_analysis_scratch/gambl/results_local/all_the_things/slms_3-1.0_vcf2maf-1.3/capture--projection/deblacklisted/augmented_maf/all_slms-3--grch37.CDS.maf
mutations from 1023 samples
after linking with metadata, we have mutations from 1 samples

I think this is because the call to all_meta here always uses the default, which is genome. I think modifying this to

all_meta = get_gambl_metadata(from_flatfile = from_flatfile, seq_type = seq_type)

should resolve the issue

get_gene_cn_and_expression does not return CN states

The function get_gene_cn_and_expression does not return CN states. All values are NA, and the column name is incorrectly assigned as ._CN instead of <gene>_CN.

I think the issue is in this line here and respectively here when the ENSEMBL ID is specified instead of the gene symbol. It does not return a gene coordinates without error/warning. The object grch37_all_gene_coordinates referred to in these 2 lines should instead refer to grch37_gene_coordinates. Can confirm this modification returns the gene coordinates and CN states:

this_row = grch37_gene_coordinates %>%
  dplyr::filter(hugo_symbol == "KMT2D")
this_region = paste0(this_row$chromosome, ":", this_row$start, "-", this_row$end)
gene_name = "KMT2D"
gene_cn = get_cn_states(regions_list = c(this_region), region_names = c(gene_name)) %>%
  as.data.frame()
table(gene_cn)
gene_cn
   0    1    2    3    4    5    6    7   10   26   75 
   4   24 1133  181  102   11   15    2    1    1    1

Compared to what is currently on master:

this_row = grch37_all_gene_coordinates %>%
  dplyr::filter(hugo_symbol == "KMT2D")
this_region = paste0(this_row$chromosome, ":", this_row$start, "-", this_row$end)
gene_name = "KMT2D"
gene_cn = get_cn_states(regions_list = c(this_region), region_names = c(gene_name)) %>%
  as.data.frame()
table(gene_cn)
gene_cn
   2 
1475

Collate_results is broken

I can't seem to get collate_results to generate the cached outputs nor can I get it to load the cached result. If I specify from_cache = T it still loads the mutations as if it's ignoring that option completely. This really needs to be addressed.

collated_genome = collate_results(get_gambl_metadata() %>% dplyr::select(sample_id),TRUE,seq_type_filter = "genome",from_cache = F,write_to_file = T)

/projects/nhl_meta_analysis_scratch/gambl/results_local/shared/gambl_genome_results.tsv
Slow option: not using cached result. I suggest from_cache = TRUE whenever possible
Checking permissions on: /projects/nhl_meta_analysis_scratch/gambl/results_local/all_the_things/slms_3-1.0_vcf2maf-1.3/genome--projection/deblacklisted/maf/all_slms-3--grch37.maf
[1] "loading /projects/nhl_meta_analysis_scratch/gambl/results_local/all_the_things/slms_3-1.0_vcf2maf-1.3/genome--projection/deblacklisted/maf/all_slms-3--grch37.maf"
Rows: 17949660 Columns: 116                                                                                                                       
── Column specification ───────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (53): Hugo_Symbol, Center, NCBI_Build, Chromosome, Strand, Variant_Classification, Variant_Type, Refere...
dbl (19): Entrez_Gene_Id, Start_Position, End_Position, t_depth, t_ref_count, t_alt_count, n_depth, n_ref_c...
lgl (44): dbSNP_Val_Status, Tumor_Validation_Allele1, Tumor_Validation_Allele2, Match_Norm_Validation_Allel...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
mutations from 1566 samples
Joining, by = "sample_id"
Joining, by = "sample_id"
Joining, by = "sample_id"
Joining, by = "sample_id"                                                                                                                         
Joining, by = "sample_id"                                                                                                                         
Joining, by = "sample_id"                                                                                                                         
Joining, by = "sample_id"                                                                                                                         
Joining, by = "sample_id"                                                                                                                         
Joining, by = "sample_id"                                                                                                                         
Error in `standardise_join_by()`:                                                                                                                 
! `by` must be supplied when `x` and `y` have no common variables.
ℹ use by = character()` to perform a cross-join.
Run `rlang::last_error()` to see where the error occurred.
Warning message:
One or more parsing issues, call `problems()` on your data frame for details, e.g.:
  dat <- vroom(...)
  problems(dat)

ashm_multi_rainbow_plot broken?

The example for this function in our vignette produces just a very busy legend.

ashm_multi_rainbow_plot colour issue

The lymphgen colouring example shown in the vignette using ashm_multi_rainbow_plot shows every point in grey. No errors are thrown. The points should be coloured as per the legend, which is actually coloured properly, interestingly.

lymphgen_colours = get_gambl_colours(classification = "lymphgen")

# This package comes with some custom (curated) data such as the regions recurrently affected by hypermutation in B-NHLs
ashm_multi_rainbow_plot(regions_to_display = c("BCL2-TSS", "MYC-TSS", "SGK1-TSS", "IGL"), custom_colours = lymphgen_colours)

splendidHeatmap mismatches the features between clusters if metadata is not sorted

If the sample metadata provided to the splendidHeatmap is not arranged in the column that will be used for splitting the heatmap, this results in Features being misannotated between different clusters and not shown correctly. This needs to be changed by improving the logic of setting variable comparison_groups within that function.

In addition, the color scheme for the top annotation is reversed comparing to the color scheme of clusters and stacked barplot on the left.

Vignette refers to some functionality that is deprecated or gone

The vignette needs a lot of TLC. There are many sections relating to the database that are obsolete at this point. There is at least one function used in an example that doesn't exist in the current version of GAMBLR. This should be removed or updated to use the new functionality

```{r ssms_and_maftools, out.width = "100%", fig.dim = c(8,3)}
all_ssms = get_ssm_by_gene(gene_symbol = c("CCND3"), coding_only = TRUE)

all_ssms = all_ssms %>%
  as.data.frame()

# Make a MAFtools object and plot a lollipop plot
maf_obj = read.maf(all_ssms)
lollipopPlot(maf_obj, gene = "CCND3")

Object "genes" not found in get_gene_expression

If Hugo symbols are provided for this function, the genes_regex step throws an error, looking for a variable named "genes". Likely renaming of a parameter that did not get updated correctly across the board for this function.

For now, please use this function with ensembl_gene_ids instead of Hugo symbols.

The vignette for new users implemented as update to README.md dispalyed on package home

With opening the package to public, we need to update the README.md to display the vignette for new users.

`fancy_qc_plot` bugs

When I try to run GAMBLR::fancy_qc_plot as follows:

setwd("/projects/rmorin/projects/DLBCL_Trios/DLBCL_Trios_Manuscript")

trios_md <- read_tsv("data/metadata/metadata_sequencing.tsv") %>% 
  rename(sample_id = DNAseq_sample_id)

fancy_qc_plot(
  these_samples = trios_md$sample_id, 
  plot_data = "MeanCorrectedCoverage", 
  metadata = trios_md
)

I get the following error:

Error in `h()`:                                                                                                                                                                     
! Problem with `filter()` input `..1`.
ℹ Input `..1` is `sample_id %in% dplyr::pull(sample_table, sample_id)`.
x error in evaluating the argument 'table' in selecting a method for function '%in%': no applicable method for 'pull' applied to an object of class "character"

This traces back to the internal call to GAMBLR:::collate_qc_data, but running that doesn't give me a bug.

qc_data <- GAMBLR:::collate_qc_results(trios_md) # Returns a lovely table of qc data

So something about the way fancy_qc_plot() is passing the metadata table to collate_qc_results is not working properly.

Also could you please update the documentation to show all the possible QC metric arguments that can be supplied to plot_data along with what each shows? E.g. I'm not totally sure what MeanCorrectedCoverage returns, and it was hard to figure out that using the return_plotdata argument was there just to display the variables.

Lastly I think that collate_qc_results should be a full exported function so that users can make supplemental tables for writing papers.

TY!

Fixes for Lymphgen in get_gambl_metadata and collate_lymphgen

The way we track lymphgen results has changed such that one result is pre-populated into gambl_biopsy_metadata.tsv in the gambl repository and all the different flavours of outputs are individually in version control in that repo under versioned_results/LymphGen

Currently, collate_lymphgen is using the wrong path to load these data and this is clobbering some of the data in our metadata table. The call to collate_lymphgen in get_gambl_metadata is actually now obsolete and that line should be removed. The function is still useful for other applications (e.g. when called by collate_results). However, it needs to be updated to use the files in the correct path, mentioned above. Please update that function to use the correct path. This should go hand-in-hand with the addition of a new global config value e.g. versioned_results in the GAMBLR config. The error message that is only triggered in verbose mode should also be changed to be triggered by default if these files are not found. The fact that this error was silent has made it harder to track down.

`annotate_driver_ssm` has the wrong argument name

The function annotate_driver_ssm calls for the following arguments:

function (maf_df, lymphoma_type, driver_genes, noncoding_regions = c(NFKBIZ = "chr3:101578206-101578365", 
    HNRNPH1 = "chr5:179,045,946-179,046,683"))

Within the function body, the non-coding argument is referenced as include_noncoding.

This should be updated so the function can read from its arguments.

`collate_results` has unexpected behaviour on join_with_full_metadata

The documentation for this function implies that the join_with_full_metadata argument can be toggled to join with all metadata columns. However the function joins with all metadata columns from get_gambl_metadata() instead of from the user-supplied metadata table, so setting join_with_full_metadata = TRUE when running this function ignores the user-supplied metadata table.

if (join_with_full_metadata) {
    # INSERT ANOTHER IF STATEMENT HERE TO USE THE USER_PROVIDED METADATA TABLE IF IT EXISTS
    full_meta = get_gambl_metadata(seq_type_filter = seq_type_filter)
    full_table = left_join(full_meta, sample_table)
    full_table = full_table %>% mutate(MYC_SV_any = case_when(ashm_MYC > 
      3 ~ "POS", manta_MYC_sv == "POS" ~ "POS", ICGC_MYC_sv == 
      "POS" ~ "POS", myc_ba == "POS" ~ "POS", TRUE ~ "NEG"))
    full_table = full_table %>% mutate(BCL2_SV_any = case_when(ashm_BCL2 > 
      3 ~ "POS", manta_BCL2_sv == "POS" ~ "POS", ICGC_BCL2_sv == 
      "POS" ~ "POS", bcl2_ba == "POS" ~ "POS", TRUE ~ 
      "NEG"))
    full_table = full_table %>% mutate(DoubleHitBCL2 = ifelse(BCL2_SV_any == 
      "POS" & MYC_SV_any == "POS", "Yes", "No"))
    return(full_table)
  }

Separate annotation of lymphoma drivers that aren't exactly hot spots

There are several genes that are recurrently mutated but need only a subset of their mutations considered in some contexts because their function is firmly established. This includes NOTCH1 and NOTCH2 (PEST domain mutations). GAMBLR should be enhanced to handle these in addition to CREBBP KAT. Any other mutations used by LymphGen with a prescribed subset such as this should also be included. This applies to CD79B in theory but possibly not in practice based on my recollection of how LymphGen works.

Synchronise bed files of lymphoma genes with lymphoma genes list

The current bed files of lymphoma genes (both in grch37 and hg38 coordinates) are not synchronised with the lymphoma_genes set. More genes need to be added to the lymphoma_genes set and they need to be brought in sync with associated bed files.

Identify helper/internal and obsolete functions.

We need to go over the functions listed in each of the R scripts, to highlight functions that should be considered as helper/internal and unused functions (i.e not export to NAMESPACE and update function description accordingly). A suggested approach could be to list such functions in the comment section of this issue, I could then update the scripts accordingly and link the PR in the comments.

Broken functionality (from_flatfile) in get_manta_sv

I haven't investigated the cause of this issue. I wonder if this ever worked properly in the first place. We definitely need all functions to work with the from_flatfile option and TRUE should become the new default once it does

unannotated_sv = get_manta_sv(from_flatfile = TRUE) 
Error in read_tsv(sv_file, col_types = "cnncnncnccccnnnnccc", col_names = cnames) : 
  object 'cnames' not found

Issue with rtracklayer specific to R4

The liftover_bedpe function gives an error that occurs only in R version 4 (not R3.6):

> hg19_sv <- get_manta_sv() %>% head(20)
> hg38_sv <- liftover_bedpe(bedpe_df = hg19_sv, target_build="hg38")
  CHROM_A  START_A    END_A CHROM_B  START_B    END_B NAME SOMATIC_SCORE STRAND_A STRAND_B TYPE FILTER VAF_tumour
1    chr1  1556541  1556547    chr1  1556664  1556670    .            40        -        -  BND   PASS          0
2    chr1  6012725  6012732    chr1  6012825  6012832    .            48        +        +  BND   PASS          0
3    chr1  8464072  8464090    chr1  8464293  8464311    .            40        +        +  BND   PASS          0
4    chr1 10084099 10084251    chr1 10084266 10084411    .            48        -        -  BND   PASS          0
5    chr1 10526162 10526172    chr1 10526290 10526300    .            40        -        -  BND   PASS          0
6    chr1 15878430 15878436    chr1 15878608 15878614    .            40        +        +  BND   PASS          0
  VAF_normal DP_tumour DP_normal tumour_sample_id normal_sample_id pair_status
1          0        55        73  00-14595_tumorA  00-14595_normal     matched
2          0        32        85  00-14595_tumorA  00-14595_normal     matched
3          0       215        84  00-14595_tumorA  00-14595_normal     matched
4          0        43        50  00-14595_tumorA  00-14595_normal     matched
5          0       121        92  00-14595_tumorA  00-14595_normal     matched
6          0       126       101  00-14595_tumorA  00-14595_normal     matched
Error: 'chr1	1621161	1621167	.	0	-	1621161	1621167	0	1	6	0' does not exist in current working directory ('/projects/rmorin/projects/gambl-repos/gambl-kdreval').
In addition: Warning message:
In if (!grepl("chr", original_bedpe$CHROM_A)) { :
  the condition has length > 1 and only the first element will be used

@mattssca tracked it down to the rtracklayer issue with this line:

bedpe_obj = rtracklayer::import(text = char_vec, format = "bed")

The solution needs to ensure functionality is maintained both in R3.6 and R4.

ensure Morinlab fork of g3viz is installed as a dependency

It seems that the auto-installation of dependencies is relying on the main g3viz version, which doesn't handle MAFs the way we need for certain functions. For some GAMBLR functions to work we need users to have the version here:

https://github.com/morinlab/g3viz

Can GAMBLR be set up to automatically install from this fork?

Functions for QC data

We should set up some functions (and update the config) to leverage the outputs of the new QC pipeline Kostia implemented. The main output is a single tabular file that could even be stitched on with the other data brought in by the collate* functions. Perhaps adding a new collate function for this and the paths in the config are a first step then we can figure out some functions to summarize the QC metrics and report a sample-level set of QC details in the context of all other samples.

Update get_coding_ssm to load variants from latest results

get_ssm_by_sample has recently been updated to load variants from the latest and greatest deblacklisted MAFs. We definitely need to update all other functions that get ssms to load from the same files or their derivatives. The most important one is get_coding_ssm. I've created the .CDS derivative files for the files that are loaded by get_ssm_by_sample so this should be relatively easy to adopt into the other functions by pointing them to these new files.

color not properly matching to MZL

While making a pretty oncoplot,
all_meta_new = get_gambl_metadata()
get_coding_ssm_df=get_coding_ssm(from_flatfile=TRUE)
maf = read.maf(get_coding_ssm_df,clinicalData = all_meta_new)
bl_genes=c("MYC","ID3","TP53","ARID1A","FBXO11","GNA13","TCF3","TFAP4","HNRNPU","FOXO1","CCND3","SMARCA4","DDX3X")
dlbcl_genes = c("EZH2","KMT2D","MEF2B","CREBBP","MYD88")
genes=c(bl_genes,dlbcl_genes)

prettyOncoplot(maftools_obj = maf, genes = genes,

           these_samples_metadata = all_meta_new,

           metadataColumns = c("pathology","COO_consensus",

                                "lymphgen","EBV_status_inf"),

           minMutationPercent	= 0.01, include_noncoding=NULL)

I was getting this error:

All mutation types: Nonsense_Mutation, Missense_Mutation, Multi_Hit, Frame_Shift_Del, Splice_Site, Frame_Shift_Ins,
In_Frame_Del, Translation_Start_Site, Nonstop_Mutation, In_Frame_Ins.
alter_fun is assumed vectorizable. If it does not generate correct plot, please set alter_fun_is_vectorized = FALSE in oncoPrint().
Following at are removed: RNA, 3'UTR, Splice_Region, hot_spot, because no color was defined for them.
No legend element is put in the last 1 row under nrow = 3, maybe you should set by_row = FALSE? Reset nrow to 2.
Following at are removed: RNA, 3'UTR, Splice_Region, hot_spot, because no color was defined for them.
No legend element is put in the last 1 row under nrow = 3, maybe you should set by_row = FALSE? Reset nrow to 2.
Error: pathology: cannot map colors to some of the levels: MZL

GAMBLR documentation formatting is incorrect for many functions

It looks like the description is being interpreted as the title for some functions because there is not a separate title line in the documentation. I've fixed this for a few functions but I'm sure it's a problem elsewhere. Here's one I fixed. Originally it had all the text on the first line, which caused ROxygen to use that as the title and the description, which cluttered the docs page due to the font size and duplication. Be sure all functions either have a sufficiently brief title that this isn't a concern OR (ideally) all have both a title and description, separated by an empty line.

#' Get MAF-format data frame for more than one sample and combine together
#' 
#' This function internally runs get_ssm_by_sample. See get_ssm_by_sample for more information
#' 
#' @param these_sample_ids A vector of sample_id that you want results for.

Issues with the `collate_sbs_results` part of `collate_results`

collate_results uses the cached results table by default. This is good because it has preserved mutational signature data that has since gone away (namely the current mutational signature table omits SBS9 which has to be the worst one to omit from GAMBL).

However the sbs_manipulation argument effectively does nothing with the cached results because only SBS1, 8, and 9 have scaled results cached, so all the other mutational signatures have only original data preserved. Also I think expected behavior is that if the user wants scaled data returned, the function should ONLY return scaled data, not scaled + unscaled data in cryptically named columns.

suppress annoying readr messages

Pretty much everywhere we are seeing messages like this it's unnecessary (possibly everywhere). Can @mattssca please update all such lines so they stop doing this?

 Use `spec()` to retrieve the full column specification for this data.
Specify the column types or set `show_col_types = FALSE` to quiet this message.

get_ssm_by_region incorrectly subsets on five columns, regardless if basic_column is TRUE

This function needs to be updated so that all MAF columns are retained.

Currently, the function subsets on Chromosome, Start_Position, End_Position, Tumor_Sample_Barcode and Read_Support when reading indexed maf. Thus, using basic_columns = TRUE cannot subset on columns 1-45 throws an error.

This function can possibly be simplified using the Vroom package to read data into R.

Update/complete/fix missing/old documentation

Many GAMBLR functions have a disconnect between their documentation and actual current set of arguments. We need to make a concerted effort to backfill this information and correct it where necessary.

remove arguments from documentation when they are no longer available
add missing arguments
add description for arguments that are not described

Bug in ashm_rainbow_plot

While writing new examples for the SSM vignette, I discovered that the classification_column parameter does not work as advertised for this function. Or more specifically, this line in the function does not do what it is supposed to do:

meta_arranged$classification = factor(meta_arranged[,classification_column], levels = unique(meta_arranged[,classification_column]))

I am assuming that the goal here is to create a new column in the meta_arranged df called classification, that will be used by the colour parameter inside the ggplot argument later on. This code instead creates a new column filled with NAs (looking for a factor with the same name as the value for classification_column?) Resulting in the following plot (no colouring based on the selected classification column):

The example plot here was generated with the following code:

mybed = data.frame(start = 128747680,
                   end = 128753674,
                   name = "MYC")

region = "chr8:128737680-128763674"

fl_dlbcl_metadata = get_gambl_metadata() %>%
  dplyr::filter(pathology %in% c("FL", "DLBCL"))

my_mutations = get_ssm_by_region(region = region)

ashm_rainbow_plot(mutations_maf = my_mutations,
                  drop_unmutated = TRUE,
                  metadata = fl_dlbcl_metadata,
                  hide_ids = FALSE,
                  bed = mybed,
                  region = region,
                  classification_column = "pathology",
                  custom_colours = get_gambl_colours("pathology"))

In addition, it would be nice to have the function automatically subset the colours in the retrieved palette to the factors in the classification_column (FL and DLBCL, pathology).

some segmented data not found for unmatched tumours

Some of the controlfreec results seem to be missing from GAMBLR. It might be that hg38 lifted segmented data from controlfreec were never loaded into the database. Is this a new thing?

all_segs_unmatched %>% pull(ID) %>% unique()
  [1] "00-14595_tumorC"           "00-15201_tumorA"           "00-15201_tumorB"          
  [4] "00-23442_tumorA"           "00-23442_tumorB"           "00-26427_tumorA"          
  [7] "01-12047_tumorB"           "01-12047_tumorC"           "01-14774_tumorA"          
 [10] "01-14774_tumorB"           "01-16433_tumorA"           "01-16433_tumorB"          
 [13] "01-20774T"                 "01-23117_tumorA"           "01-23117_tumorB"          
 [16] "01-23942_tumorA"           "01-23942_tumorB"           "01-28152_tumorA"          
 [19] "01-28152_tumorB"           "02-13135T"                 "02-15630_tumorA"          
 [22] "02-15630_tumorB"           "02-18356_tumorA"           "02-18356_tumorB"          
 [25] "02-20170T"                 "02-22991T"                 "02-24492_tumorA"          
 [28] "02-28397_tumorA"           "02-28397_tumorB"           "03-10363T"                
 [31] "03-10440_tumorA"           "03-10440_tumorB"           "03-11110T"                
 [34] "03-19103_tumorB"           "03-23488_tumorA"           "03-23488_tumorB"          
 [37] "03-33266_tumorA"           "03-33266_tumorB"           "04-14066_tumorA"          
 [40] "04-14066_tumorB"           "04-14093_tumorA"           "04-14093_tumorB"          
 [43] "04-21856_tumorA"           "04-24061_tumorA"           "04-24061_tumorB"          
 [46] "04-24937T"                 "04-28140T"                 "04-35039T"                
 [49] "05-12939T"                 "05-17793T"                 "05-18426T"                
 [52] "05-21634T"                 "05-22052T"                 "05-23110T"                
 [55] "05-24065T"                 "05-24395T"                 "05-24401T"                
 [58] "05-24666T"                 "05-25439T"                 "05-25674T"                
 [61] "05-32150T"                 "05-32150_tumorA"           "05-32150_tumorB"          
 [64] "05-32762T"                 "05-32947T"                 "06-10398T"                
 [67] "06-11535T"                 "06-11677_tumorA"           "06-11677_tumorB"          
 [70] "06-14634T"                 "06-15256T"                 "06-15922T"                
 [73] "06-16716T"                 "06-19919T"                 "06-22057T"                
 [76] "06-22314_tumorA"           "06-22314_tumorB"           "06-23907T"                
 [79] "06-24255_tumorC"           "06-24255_tumorD"           "06-24915T"                
 [82] "06-24925T"                 "06-25674T"                 "06-30025T"                
 [85] "06-30145T"                 "06-33777T"                 "06-34043T"                
 [88] "07-10483T"                 "07-17613T"                 "07-25012T"                
 [91] "07-25994_tumorB"           "07-25994_tumorC"           "07-30628T"                
 [94] "07-31833T"                 "07-35482T"                 "07-40648_tumorA"          
 [97] "07-40648_tumorB"           "07-41887_tumorA"           "07-41887_tumorB"          
[100] "08-10249_tumorA"           "08-10249_tumorB"           "08-13706T"                
[103] "08-15460T"                 "08-15460_tumorA"           "08-15460_tumorB"          
[106] "08-15555T"                 "08-17645_tumorB"           "08-19764T"                
[109] "08-25894T"                 "08-29440_tumorA"           "08-29440_tumorB"          
[112] "08-33625T"                 "09-11467T"                 "09-12737T"                
[115] "09-12864T"                 "09-15842_tumorA"           "09-15842_tumorB"          
[118] "09-16981T"                 "09-21480T"                 "09-27193T"                
[121] "09-31008_tumorA"           "09-31008_tumorB"           "09-31233T"                
[124] "09-31895T"                 "09-33003T"                 "09-33003_tumorA"          
[127] "09-33003_tumorB"           "09-41082T"                 "09-41114T"                
[130] "10-10826T"                 "10-11584_tumorA"           "10-11584_tumorB"          
[133] "10-15025T"                 "10-27154T"                 "10-28165T"                
[136] "10-31625T"                 "10-32847T"                 "10-36955_tumorA"          
[139] "10-36955_tumorB"           "10-40676T"                 "10-41170T"                
[142] "11-14210T"                 "11-21727T"                 "11-35935T"                
[145] "12-17272_tumorA"           "12-17272_tumorB"           "12-29259T"                
[148] "13-22818T"                 "13-26601T"                 "13-26835_tumorB"          
[151] "13-30451T"                 "13-31210T"                 "13-38657_tumorA"          
[154] "13-38657_tumorB"           "13-42815T"                 "14-10498_tumorA"          
[157] "14-10498_tumorB"           "14-11247T"                 "14-13959T"                
[160] "14-14094T"                 "14-16281T"                 "14-16707T"                
[163] "14-20552_tumorA"           "14-20552_tumorB"           "14-20962T"                
[166] "14-23891T"                 "14-24648_tumorA"           "14-24648_tumorB"          
[169] "14-25466T"                 "14-27873T"                 "14-29443_tumorB"          
[172] "14-32442T"                 "14-33262T"                 "14-33436T"                
[175] "14-35026T"                 "14-35472_tumorA"           "14-35472_tumorB"          
[178] "14-35632T"                 "14-36022T"                 "14-37722T"                
[181] "14-41461T"                 "15-10535T"                 "15-11617T"                
[184] "15-13365T"                 "15-13383_tumorB"           "15-15757T"                
[187] "15-21654T"                 "15-24058T"                 "15-24306T"                
[190] "15-26538T"                 "15-29858T"                 "15-31924T"                
[193] "15-34472T"                 "15-38154T"                 "15-43657T"                
[196] "15-43891T"                 "16-11636T"                 "16-12281_tumorB"          
[199] "16-13732T"                 "16-16192T"                 "16-16723T"                
[202] "16-17861T"                 "16-18029T"                 "16-18623T"                
[205] "16-23208T"                 "16-27074_tumorA"           "16-27074_tumorB"          
[208] "16-27413T"                 "16-29329T"                 "16-31791T"                
[211] "16-37587_tumorB"           "17-12136T"                 "17-23504T"                
[214] "17-23711_tumorA"           "17-23711_tumorB"           "17-33596_tumorA"          
[217] "17-33596_tumorB"           "17-36275T"                 "17-36275_tumorB"          
[220] "17-40409_tumorA"           "17-40409_tumorB"           "76-10146_tumorA"          
[223] "76-10146_tumorB"           "81-52884T"                 "89-62169T"                
[226] "92-38267_tumorA"           "92-38267_tumorB"           "92-38626_tumorA"          
[229] "92-38626_tumorB"           "94-15772_tumorA"           "94-15772_tumorB"          
[232] "94-25764_tumorB"           "94-26795T"                 "95-32141_tumorA"          
[235] "95-32141_tumorB"           "95-32814T"                 "96-31596T"                
[238] "97-14402T"                 "97-18502_tumorA"           "97-18502_tumorB"          
[241] "98-22532T"                 "99-13280T"                 "99-27137T"                
[244] "DOHH-2"                    "Farage"                    "FL1001T2"                 
[247] "FL1002T2"                  "FL1003T2"                  "FL1004T2"                 
[250] "FL1005T2"                  "FL1006T2"                  "FL1007T2"                 
[253] "FL1008T2"                  "FL1010T2"                  "FL1012T2"                 
[256] "FL1013T2"                  "FL1015T2"                  "FL1016T2"                 
[259] "FL1018T2"                  "FL1019T2"                  "FL1020T2"                 
[262] "HBL-1"                     "HT"                        "HTMCP-01-01-00003-01D-03D"
[265] "HTMCP-01-01-00012-01A-01D" "HTMCP-01-01-00451-01A-01D" "HTMCP-01-02-00013-01A-01D"
[268] "HTMCP-01-02-00017-01A-01D" "HTMCP-01-06-00036-01E"     "HTMCP-01-06-00105-01A-01D"
[271] "HTMCP-01-06-00146-01A-01D" "HTMCP-01-06-00175-01A-01D" "HTMCP-01-06-00185-01A-01D"
[274] "HTMCP-01-06-00206-01A-01D" "HTMCP-01-06-00227-01A-01D" "HTMCP-01-06-00232-01A-01D"
[277] "HTMCP-01-06-00242-01A-01D" "HTMCP-01-06-00253-01A-01D" "HTMCP-01-06-00255-01A-01D"
[280] "HTMCP-01-06-00299-01A-01D" "HTMCP-01-06-00306-01A-01D" "HTMCP-01-06-00307-01A-01D"
[283] "HTMCP-01-06-00310-01B-01D" "HTMCP-01-06-00314-01A-01D" "HTMCP-01-06-00419-01B-01D"
[286] "HTMCP-01-06-00422-01A-01D" "HTMCP-01-06-00443-01A-01D" "HTMCP-01-06-00485-01A-01D"
[289] "HTMCP-01-06-00497-01A-01D" "HTMCP-01-06-00500-01A-01D" "HTMCP-01-06-00526-01A-01D"
[292] "HTMCP-01-06-00563-01A-01D" "HTMCP-01-06-00594-01A-01D" "HTMCP-01-06-00606-01A-01D"
[295] "HTMCP-01-06-00611-01A-01D" "HTMCP-01-06-00634-01A-01D" "HTMCP-01-07-00336-01A-01E"
[298] "HTMCP-01-10-00160-01A-01D" "HTMCP-01-10-00778-01A-01D" "HTMCP-01-15-00366-01A-01E"
[301] "HTMCP-01-15-00367-01A-01E" "HTMCP-01-15-00370-01A-01E" "HTMCP-01-16-00265-01A-01E"
[304] "HTMCP-01-20-00272-01A-01E" "Karpas422"                 "MD903"                    
[307] "OCI-Ly10"                  "OCI-Ly3"                   "SU-DHL-10"                
[310] "SU-DHL-4"                  "SU-DHL-5"                  "SU-DHL-6"                 
[313] "SU-DHL-9"                  "Toledo"                    "WSU-NHL"

Add license

Add license (same one used in LCR modules) to prepare for opening the repo.

Tidyverse isn't really "importable"

I noticed today when testing Ryan's GAMBLR remote files that get_gambl_metadata() throws an error if tidyverse hasn't been explicitly loaded, even though it's listed as a dependency.

Turns out tidyverse isn't importable in this way and you are supposed to import individual packages from within tidyverse.

https://stackoverflow.com/a/62377442/16737334

Possible error in read_tsv

When installing GAMBLR from the current MASTER branch, users have been reported to get the following message returned:

Note: possible error in 'read_tsv(full_path, col_names = c("chrpos", ': unused argument (show_col_types = FALSE)
Note: possible error in 'read_tsv("/projects/rmorin/projects/gambl-repos/gambl-rmorin/config/exclude.tsv", ': unused argument (show_col_types = FALSE)
Note: possible error in 'read_tsv(full_cnv_path, ': unused argument (show_col_types = FALSE)

Can anyone else either confirm or debunk this? I was unable to reproduce these messages when installing from MASTER.

Make `get_gene_expression` work for non-`morinlab` users

Currently the get_gene_expression function only works for users with morinlab group permissions because the config accesses only icgc_dart outputs. I've now updated my salmon/DESeq2 Snakefile to also generate the tidy gene expression matrix for each unix_group. Could you please update the function to use either the gambl version or icgc_dart version depending on a user's permission levels?

For those with morinlab permissions the file will be results/icgc_dart/DESeq2-0.0_salmon-1.0/mrna--gambl-icgc-all/vst-matrix-Hugo_Symbol_tidy.tsv (i.e. current status quo)

For those with only gambl permissions the file will be results/gambl/DESeq2-0.0_salmon-1.0/mrna/vst-matrix-Hugo_Symbol_tidy.tsv

Also, now that the Salmon Snakefile has been set up to generate the tidy matrix, we should consider modifying the current tidy_gene_expression function to NOT write to file by default as this will create some conflicts with the Snakefile version.

get_gambl_metadata relies on files that aren't in the repo

Some of the sample sets that can be accessed with get_gambl_metadata cause the function to break if run off-site. This is because files it's using do not exist in the repo or the path is specified wrong. For example:

get_gambl_metadata(case_set="BL-DLBCL-manuscript")
Error in data.table::fread("/projects/rmorin/projects/gambl-repos/gambl-kdreval/data/metadata/BLGSP--DLBCL-case-set.tsv") :                                                                               
  File '/projects/rmorin/projects/gambl-repos/gambl-kdreval/data/metadata/BLGSP--DLBCL-case-set.tsv' does not exist or is non-readable. getwd()=='/Users/rmorin/git/GAMBLR'

All files that GAMBLR loads must be either in the gambl repo (with the path specified properly using the relative path and config) or, in rare cases, bundled with GAMBLR.

PrettyOncoplot example in vignette throws an error

This section of the vignette throws an error that I believe has a known fix involving specifying a NULL value for one argument. Can we fix this function to handle this error more elegantly and update the vignette? I suspect the issue is due to the new MAFs no longer having the noncoding mutations in them.

prettyOncoplot(maftools_obj = maf,
               genes = genes,
               these_samples_metadata = all_meta,
               metadataColumns = c("pathology", "lymphgen", "sex", "EBV_status_inf", "cohort"),
               sortByColumns = c("pathology", "sex", "lymphgen", "EBV_status_inf", "cohort"),
               keepGeneOrder = TRUE,
               splitGeneGroups = gene_groups,
               splitColumnName = "pathology",
               metadataBarHeight = 5,
               metadataBarFontsize = 8,
               fontSizeGene = 11,
               recycleOncomatrix = TRUE,
               removeNonMutated = FALSE)
NFKBIZ and 3'UTR
HNRNPH1 and Splice_Region
Error in if (mat[gene, samp] == "") { : 
  missing value where TRUE/FALSE needed

ashm_rainbow_plot example in vignette is broken

Running the example throws an error complaining about chromosome. This may be related to the new retrieval method for variants in a region or it may be unrelated

ashm_rainbow_plot(mutations_maf = my_mutations, metadata = all_metadata,
                  bed = mybed, region = region, classification_column = "lymphgen", 
                  custom_colours = get_gambl_colours("lymphgen"))

Error: Problem with `filter()` input `..1`. ℹ Input `..1` is `&...`. x object 'Chromosome' not found Run `rlang::last_error()` to see where the error occurred.

bug in get_sample_cn_segments

with_chr_prefix for get_sample_cn_segments does not work as advertised.

get_ssm_by_regions() is duplicating variants, and assigning them to the incorrect regions

Running get_ssm_by_regions(grch37_ashm_regions) yields the following dataframe:

For some reason, this function is only returning variants upstream of MYC, and assigning them to (seemingly) ALL regions provided

get_ssm_by_sample updates needed

All functions that internally call get_gambl_metadata() should have been updated to require the user specify the seq_type of the sample when not providing the metadata. This is required because the function cannot call get_gambl_metadata() with the right seq_type otherwise. Currently, if a user runs get_ssm_by_sample and specifies a sample_id from any "capture" data set then the function fails unless they also provide the metadata. The fixes for this are twofold:

make seq_type a required argument UNLESS the user specifies the argument these_samples_metadata
this_sample_id should not be a required argument because, as we do for many functions, this id can be pulled from these_samples_metadata. Hence, the function should require the user specify one of these_samples_metadata or this_sample_id and use whichever one appropriately

Here's an example where the function is currently failing:

slms3_var_calls <- get_ssm_by_sample(
     this_sample_id = "PA001", 
     augmented = FALSE,
     min_read_support = 3
)
Error in if (pair_status == "unmatched") { : argument is of length zero

COMPOSITE class as default is a bug

There's a bug in get_gambl_metadata that causes any case without a LymphGen class to be assigned as COMPOSITE. This should be changed to either set them to "Other" or, better yet, a class that isn't part of LymphGen but also doesn't imply a composite class.

get_cn_segments is broken

It looks like the only way the get_cn_segments function works is when the database is used. The from_flatfile option is available but when I try supplying it I get an error:

my_segments = get_cn_segments(region="chr8:128,723,128-128,774,067",from_flatfile=TRUE)
 Error in rbind(all_segs_matched, all_segs_unmatched) : 
object 'all_segs_matched' not found

SRAdb dependency error

While running devtools::install() from within my RStudio on gphost10 I ran into this error:

* installing to library ‘/home/lhilton/R/x86_64-centos7-linux-gnu-library/3.6’
ERROR: dependency ‘SRAdb’ is not available for package ‘GAMBLR’
* removing ‘/home/lhilton/R/x86_64-centos7-linux-gnu-library/3.6/GAMBLR’

I was able to manually install SRAdb with BiocManager::install("SRAdb"). After that GAMBLR installed without error.

get_mutation_frequency_bin_matrix example in vignette causes R session to hang

This issue is self-explanatory. It would be good to know if someone else can reproduce this issue or if this was specific to my session.

pretty_lollipop_plot example in vignette not working

#load maf data.
maf = get_coding_ssm(limit_samples = metadata$sample_id, basic_columns = TRUE)

#construct pretty_lollipop_plot.
pretty_lollipop_plot(maf_df = maf, #a data frame containing the mutation data (from a MAF).
                     gene = "MYC", #the gene symbol to plot.
                     plot_title = "Mutation data for MYC", #optional (defaults to gene name).
                     plot_theme = "blue") #Options: cbioportal(default), blue, simple, nature, nature2, ggplot2, and dark.

Running the example above gives the following error:

Error in mapMutationTypeToMutationClass(maf.df[, variant.class.col], mutation.type.to.class.df) :
object 'mutation.table.df' not found