tidyomics / plyranges Goto Github PK

A grammar of genomic data transformation

Home Page: https://tidyomics.github.io/plyranges/

Makefile 0.07% R 99.93%

bioconductor data-analysis dplyr genomic-ranges genomics tidy-data

plyranges's Introduction

Overview

The tidyomics ecosystem is a set of packages for omics data analysis that work together in harmony; they share common Bioconductor data representations and API design consistent with the tidyverse ecosystem. The tidyomics package is designed to make it easy to install and load core packages from the tidyomics ecosystem with a single command.

The core packages are:

tidyomics_packages()

#   [1] "tidySummarizedExperiment" "tidySingleCellExperiment"
#   [3] "tidyseurat"               "tidybulk"                
#   [5] "plyranges"                "nullranges"              
#   [7] "purrr"                    "rlang"                   
#   [9] "stringr"                  "cli"                     
#  [11] "utils"                    "tidyomics"

The tidyomics ecosystem

You can find out more about each package in the tidyomics ecosystem here:

Package	Intro	GitHub	Description
tidybulk	Vignette	GitHub	Tidy bulk RNA-seq data analysis
tidySummarizedExperiment	Vignette	GitHub	Tidy manipulation of SummarizedExperiment objects
tidySingleCellExperiment	Vignette	GitHub	Tidy manipulation of SingleCellExperiment objects
tidySeurat	Vignette	GitHub	Tidy manipulation of Seurat objects
tidySpatialExperiment	Vignette	GitHub	Tidy manipulation of SpatialExperiment objects
plyranges	Vignette	GitHub	Tidy manipulation of genomics ranges
plyinteractions	Vignette	GitHub	Tidy manipulation of genomic interactions
nullranges	Vignette	GitHub	Generation of null genomic range sets

Installation

Installing the tidyomics package will install all core packages of the tidyomics ecosystem. The tidyomics package can be installed from Bioconductor:

BiocManager::install("tidyomics")

plyinteractions and tidySpatialExperiment are two new packages in the tidyomics ecosystem. plyinteractions and tidySpatialExperiment are both ready to use and are available in Bioconductor. The packages are now reaching maturity and will be added to the core packages for automatic installation mid-2024.

For the time being, plyinteractions and tidySpatialExperiment can be installed independently:

BiocManager::install("plyinteractions")
BiocManager::install("tidySpatialExperiment")

Loading the tidyomics ecosystem

The core tidyomics packages can be attached with:

library(tidyomics)

This command also produces a summary of package versions and function conflicts. Function conflicts are a point of ongoing development and will be addressed over time.

plyinteractions and tidySpatialExperiment can be loaded independently:

library(plyinteractions)
library(tidySpatialExperiment)

You are now ready to start using the tidyomics ecosystem.

plyranges's People

Contributors

Stargazers

Watchers

Forkers

hongzhonglu shians iimog jonocarroll alexyfyf rowancallahan shengwang whtns kmckanic xchromosome219 mikelove liupfskygre arrendi wisren sppearce guokai8 snystrom csoneson immuneaxisa

plyranges's Issues

coverage

Was looking at set_coverage(). What is the rationale for using the "set" verb there, since we are generating an entirely new set of ranges? Seems like it should be compute_coverage() or similar. Also, I think compute_coverage() or whatever we call it could just delegate to coverage() by default. Then, it calls some generic (maybe as_ranges()?) to generate a better data structure than an Rle or RleList. Rle would map to an IRanges, RleList to GRanges. This design lets us automatically support other coverage() methods, like BamFile and character (file path).

bedtools tutorial

It's possible to still include this in the pkgdown site but not have it built as a vignette on the Bioconductor side.

Steps for doing so

create inst/tutorials/
write the tutorial

Should Ranges arithmetic ignore strandedness by default?

Did we agree that any GenomicRanges that have an ignore.strand should be TRUE by default?

documentation with examples

a few methods have very sparse docs or are unclear in what they do...

run Rbioc cmd check
make sure no errors or warnings in R CMD CHECK
complete docs

promote joins to S4 methods?

mainly cause it would be nice to have double dispatch and so other developers could implement them on classes like RangedSummarisedExperiment etc

Suggestion: Add figures of overlap operations to vignette

Hey Stuart,

Looks like this is coming along nicely. A superficial suggestion and probably something you've thought of: perhaps you could add figures/cartoons to illustrate what's happening with overlap-based ops on Ranges (https://github.com/sa-lee/plyranges/blob/master/vignettes/introduction-plyranges.md#joins-or-another-way-at-looking-at-overlaps-between-ranges). I think that will help many users understand the output (I certainly struggle with just the R output).

The IRanges vignette has some, but I was really thinking about the bedtools docs (e.g., http://bedtools.readthedocs.io/en/latest/content/tools/merge.html) which I found very helpful when reasoning about or teaching interval algebra. Oh, I just realised that HelloRanges uses some of the bedtools figs http://bioconductor.org/packages/release/bioc/vignettes/HelloRanges/inst/doc/tutorial.pdf ...

And if you happen to write a function that makes pretty pictures of Ranges and overlaps I'm sure that'd be useful for teaching purposes

as_(i)(g)ranges method

in place of the Ranges constructor I've currently got.

Unexpected behaviour with stretching w/o anchor

Hi,

I'm trying to stretch a genomic region using different anchoring. I also tried without anchoring, which yielded the (unexpected) result below:

> x = makeGRangesFromDataFrame(data.frame(
    seqnames=1:2, start=c(10,100), end=c(200,300), strand=c('+','-')))

> x
      seqnames     ranges strand
         <Rle>  <IRanges>  <Rle>
  [1]        1    10-200      +
  [2]        2   100-300      -

> stretch(x, 2)
      seqnames     ranges strand
         <Rle>  <IRanges>  <Rle>
  [1]        1   104-106      +
  [2]        2   199-201      -

Why does it change the regions this way? (anchored stretching works as expected)

Using IRanges_2.14.10 and plyranges_1.0.3 (current Bioconductor release).

summarize doesn't work with weighted mean

library(plyranges)

set.seed(2018)
n <- 10
r <- GRanges(seqnames = rep("chr1", n),
                  ranges = IRanges(start = sample(20, n, replace = T),
                                   width = sample(6,  n, replace = T))
                  )
mcols(r) <- data.frame(score = runif(n, 0, 100), condition = rep_len(c("One","Two"), n))
r

## GRanges object with 10 ranges and 2 metadata columns:
##        seqnames    ranges strand |            score condition
##           <Rle> <IRanges>  <Rle> |        <numeric>  <factor>
##    [1]     chr1       7-9      * | 26.0100793326274       One
##    [2]     chr1     10-13      * | 56.8360368022695       Two
##    [3]     chr1       2-7      * | 15.1498265331611       One
##    [4]     chr1       4-8      * | 9.49368453584611       Two
##    [5]     chr1     10-14      * |  76.191659620963       One
##    [6]     chr1      7-10      * | 53.5928548080847       Two
##    [7]     chr1     13-14      * | 32.7880936674774       One
##    [8]     chr1       3-6      * | 92.4356027971953       Two
##    [9]     chr1     20-24      * | 59.8311740672216       One
##   [10]     chr1     11-15      * | 6.46366507280618       Two
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

These work:

r %>% group_by(condition) %>% summarize(score = mean(score))

## DataFrame with 2 rows and 2 columns
##   condition            score
##    <factor>        <numeric>
## 1       One 41.9941666442901
## 2       Two 43.7643688032404

r %>% group_by(condition) %>% summarize(score = mean(width))

## DataFrame with 2 rows and 2 columns
##   condition     score
##    <factor> <numeric>
## 1       One       4.2
## 2       Two       4.4

r %>% group_by(condition) %>% reduce_ranges(score = weighted.mean(score, width))

## GRanges object with 3 ranges and 2 metadata columns:
##       seqnames    ranges strand | condition            score
##          <Rle> <IRanges>  <Rle> |  <factor>        <numeric>
##   [1]     chr1      2-14      * |       One 38.4664801647887
##   [2]     chr1     20-24      * |       One 59.8311740672216
##   [3]     chr1      3-15      * |       Two 40.5111238942482
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

... but this doesn't:

r %>% group_by(condition) %>% summarize(score = weighted.mean(score, width))

## Error in as.vector(x, mode): coercing an AtomicList object to an atomic vector is supported only for
##   objects with top-level elements of length <= 1

sessionInfo()

## R version 3.5.0 (2018-04-23)
## Platform: x86_64-redhat-linux-gnu (64-bit)
## Running under: Fedora 28 (Workstation Edition)
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
## 
## locale:
##  [1] LC_CTYPE=ru_RU.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_DK.UTF-8        LC_COLLATE=ru_RU.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
## [1] knitr_1.20           rtracklayer_1.40.3   plyranges_1.1.6     
## [4] GenomicRanges_1.32.3 GenomeInfoDb_1.16.0  IRanges_2.14.10     
## [7] S4Vectors_0.18.2     BiocGenerics_0.26.0 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.17                pillar_1.2.3               
##  [3] BiocInstaller_1.30.0        compiler_3.5.0             
##  [5] git2r_0.21.0                XVector_0.20.0             
##  [7] bindr_0.1.1                 bitops_1.0-6               
##  [9] tools_3.5.0                 zlibbioc_1.26.0            
## [11] digest_0.6.15               evaluate_0.10.1            
## [13] tibble_1.4.2                memoise_1.1.0              
## [15] lattice_0.20-35             pkgconfig_2.0.1            
## [17] rlang_0.2.1                 Matrix_1.2-14              
## [19] DelayedArray_0.6.0          curl_3.2                   
## [21] yaml_2.1.19                 bindrcpp_0.2.2             
## [23] GenomeInfoDbData_1.1.0      stringr_1.3.1              
## [25] withr_2.1.2                 httr_1.3.1                 
## [27] dplyr_0.7.5                 Biostrings_2.48.0          
## [29] devtools_1.13.5             tidyselect_0.2.4           
## [31] grid_3.5.0                  glue_1.2.0                 
## [33] Biobase_2.40.0              R6_2.2.2                   
## [35] XML_3.98-1.11               BiocParallel_1.14.1        
## [37] tidyr_0.8.1                 purrr_0.2.4                
## [39] magrittr_1.5                Rsamtools_1.32.0           
## [41] matrixStats_0.53.1          GenomicAlignments_1.16.0   
## [43] assertthat_0.2.0            SummarizedExperiment_1.10.1
## [45] stringi_1.2.2               RCurl_1.95-4.10

Implement gather/spread and unite/extract from tidyr

Related to #16 (the last check box)

I feel like these names in tidyr are confusing - perhaps we should rename them?

gather takes a data.frame from wide to long (gathers columns into a key value pairs)

https://tidyr.tidyverse.org/reference/gather.html
For a GRanges object this should only be able to performed on the metadata and not touch the core GRanges. But what about the case of having a GAlignmentPairs or DataFrame with GRanges columns? There should be another verb maybe gather_as_granges or something similar?

For the formal arguments of gather I don't think anything after the ... is relevant, ideally the key should be an RLE column - which does suggest we use a new verb?

spread is the inverse of gather
https://tidyr.tidyverse.org/reference/spread.html
This has the ability to modify the type of an input GRanges - again suggesting a new verb?

unite/ extract combine multiple columns into one or takes one column and splits them to many.
It is probably a useful helper function given how sample names are usually compositional.
https://tidyr.tidyverse.org/reference/unite.html

joins (migrated from query)

Implement on ranges objects.

Some relevant literature from database theory:
GenAp a distributed SQL for genomics using spark

review paper on interval joins

book on temporal databases - in general it would be useful to learn more database theory for this project

slide deck on temporal databases from theory of databases class at Warwick. The algebra covers most use cases encountered in GenomicRanges:

And some new verbs that could be useful such as collapse, expand, pack/unpack

dplyr-like API for joins . not sure the operator ojoin is clearest syntax here. Limiting directions to left and inner for overlap joins. Generally we think that the overlapping start/end/equal are not particularly useful.

join type	description	targets
`left_join`, `right_join` `full_join`, `inner_join` etc.	usual SQL like joins in `dplyr`	acts only on mcols slot of Ranges
`find_overlaps`	findOverlaps but acts as user expects with strand, takes into account grouping	GenomicRanges/Ranges
`group_by_overlaps`	groups matches by the query hit	GenomicRanges/Ranges
`count_overlaps`	convenience function since this is a common operations	GenomicRanges/Ranges
`ojoin_self`	findOverlapPairs on a single ranges object	GenomicRanges/Ranges
`ojoin_inner` /`ojoin_inner`	overlaps operator described above. Implemented via findOverlapsPairs(type = "any") + pairs operator such as union/intersect. Can match on an additional key which makes it slightly different to find_overlaps.	GenomicRanges/Ranges
`*_within`	same functions as above but with type = "within"	GenomicRanges/Ranges
`*_includes`	inverse of within functions	GenomicRanges/Ranges
`join_nearest`/`join_nearest_left/right`/`join_nearest_up/downstream`	nearest analogue	GenomicRanges/Ranges
`join_precede`/`join_precede_left/right`/`join_precede_up/downstream`	precede analogue	GenomicRanges/Ranges
`join_follow`/`join_follow_left/right`/`join_follow_up/downstream`	follow analogue	GenomicRanges/Ranges

coordinate mapping

Mapping tools like mapToTranscripts() from the GenomicFeatures package are very useful. Is it worth having wrappers map_to_transcripts() and map_from_transcripts() that use grouped GRanges instead of GRangesList?

`combine_ranges()`

It is not clear what this function does from the name. Is the "interweaving" of the left and right that useful in general? If we want to keep it, maybe interweave() is a better name? Or maybe the tidyverse already has one...

Is there a tidyverse verb that concatenates two or more objects while somehow adding a column indicating their origin?

I could imagine simplifying the last example in the README with

interweave(left=left, right=right)

It adds a column indicating the origin as either "left" or "right". That resembles the mstack() function in IRanges. It's not perfect, in part because it creates the origin column (what do we even call it?) implicitly. The closest tidyverse analog might be

melt(list(left=left, right=right))

which also allows specifying the name of the distinguishing column.

Of course, in general we do not want to interweave, just combine. Maybe the general verb would be combine_groups() or similar?

general improvements to documentation

some functions have very sparse/unclear documentation, especially the read_ methods.
formatting issues in documentation

Formalise anchoring

implement anchors as a delegating class on top of I/GRanges
anchors should be length 1
fix bugs in anchoring by strand

summarise should return a DataFrame

this also removes dependency on tibble

Aware of valr?

Stumbled across this (never used it)

https://f1000research.com/articles/6-1025/v1
http://valr.hesselberthlab.org/

anchoring bug

From Bioconductor/Contributions#670

The anchoring mechanism only works properly on IRanges and GRanges instances. When used on an IRanges or GRanges derivative, it returns an IRanges or GRanges instance which is not what one would expect. For example, with a VRanges object (derives from GRanges):

library(VariantAnnotation)
example(VRanges)
set_width(anchor_start(vr), 20)
GRanges object with 2 ranges and 1 metadata column:
seqnames ranges strand | tumorSpecific
|
[1] chr1 1-20 * | FALSE
[2] chr2 10-29 * | TRUE

seqinfo: 2 sequences from an unspecified genome; no seqlengths

filter should follow the dplyr generic

currently we have

filter(.data, expr)

but it should match the dplyr generic

filter(.data, ...)

blocks from BED files

Right now the user can get the transcript/exon structure from a BED file using the blocks() function, which converts the "blocks" column to a GRangesList. I think we'd rather have a GRanges of exons grouped by transcript. Since this is specific to BED files though, I wonder whether we should instead just have a read_bed_transcripts() function that goes directly from BED to the transcript GRanges.

R>3.5.0 requirements

Hi,

Your DESCRIPTION file specifies that R>3.5.0 is required which has not been released yet. This makes your Github HEAD not installable with the current R release (which is, I guess, roughly 100% of your users).

I found a warning on the Bioconductor build process, which I think is the reason why you changed it, or is there any other reason as well?

If not, you might want to have a separate branch for the Bioconductor submission for now so that people can actually use your package with the current stable R.

other endomorphic methods

Some gaps in the API that are covered by GenomicRanges are slidingWindows and tile, we could make the following:
set_tiles
set_windows

refactor methods to reflect changes to Bioc3.7

From Bioconductor/Contributions#670

Some important changes were made in BioC 3.7 to the IRanges and GRanges class hierarchies. In particular the Ranges class in BioC 3.6 got renamed IntegerRanges in BioC 3.7. The Ranges class in BioC 3.7 is actually a new virtual class that is the common parent of the IntegerRanges, GenomicRanges, and GAlignments classes. We didn't have a common parent for these 3 classes before.

You probably want to revisit your use of Ranges in the light of that. For example, here you only need one method, the method for Ranges objects:

plyranges:::set_width.Ranges
function (x, width = 0L)
{
width(x) <- width
x
}
<bytecode: 0x29e36918>
<environment: namespace:plyranges>
plyranges:::set_width.GenomicRanges
function (x, width = 0L)
{
width(x) <- width
x
}
<bytecode: 0x1b693a60>
<environment: namespace:plyranges>

You have a lot of methods defined for Ranges and GenomicRanges objects. Everywhere the foo.Ranges and foo.GenomicRanges methods are identical (like with set_width.Ranges and set_width.GenomicRanges above), you only need the 1st one. It will work on any Ranges derivative for which width<- works. If the methods are not identical, like in the case of reduce_ranges.Ranges and reduce_ranges.GenomicRanges, it's probably safer to replace reduce_ranges.Ranges with reduce_ranges.IntegerRanges since the reduce_ranges.Ranges method is only guaranteed to do the right thing on an IntegerRanges derivative but not on a range-based object that inherits from Ranges in general.

dplyr::n() method

current approach attempts to capture an n() call before it hits the work horse functions in summarise, filter, mutate etc.

@lawremi suggested another approach by writing out a local function in these methods, so they understand to run length on the data.

We also discussed that this implies there should be some kind of pronoun to represent the entire ranges object and to have methods act in special ways when the keyword is mentioned.

anchoring redesign.

There was some general confusion about this during the talk/workshop at Bioc2018. People felt it was not intuitive that the anchoring is dropped after a mutate call and that anchoring decorates a GRanges object. There was a suggestion of making anchoring act like dplyr's scoped variants so something like:

gr %>% mutate_at_anchor(..., anchor = "start")

could be a possibility. I think I need to think about this a bit more though.

anchor functions should append metadata slot

note to myself - the current implementation considers anchoring to be done in isolation but this needs to be fixed if functions are to be chained - probably easiest to implement via a while loop

vignettes as case studies

I think the following would give sufficient coverage of the API

case study on the join grammar: overlap joins between differentially methylated regions and cpg islands. This could detail both joins and the arithmetic of resizing/anchoring.
case study on aggregation: summarising overlaps by some feature of interest, the grouping classes. one example would be average coverage over some genomic feature (group_by_overlaps)

An `mtcars` or `diamonds` dataset for plyranges

I feel like this is something ggplot and dplyr do quite well in that they provide concrete examples via a simple dataset. I think this makes things simpler to learn and gives intuition about how the API works.

Is there anything equivalent we could use for plyranges?

`group_by` should throw an error here

ht @mtmorgan

gr <- GRanges("chr1", IRanges(start = 1:6, width = 10))
foo <- 1:6
group_by(gr, foo)

'optimising' group_by operations

See aggregateWithDots in https://github.com/Bioconductor/S4Vectors/blob/master/R/aggregate-methods.R for some ideas of how to do this.

The current implementations evals in individual slices rather than over all the groups.

structure of vignettes / new vignettes

We have some good material now for actually using plyranges thanks to the BiocWorkshops which should be linked in both the README and also the intro vignette.

In my opinion, the intro vignette should not reference how things are done / comparisons to GenomicRanges API. That could be split out into a small vignette for users that are actually interested in those details.

Since the overlaps ideas seem to be a point of pain for all users, I think more concrete examples are needed here in a separate vignette. I think we could use the pasillaBamSubset to incorporate the more specialised BAM reading functionality along with this.

So structure would be:

introduction (include link to workshops) (this I think of as plyranges for users of tidyverse)
plyranges for users of GenomicRanges
overlaps

We are introducing two new (short) vignettes, plus a more concrete introduction

`read_bam()` return value

Current read_bam() returns a GAlignments or GAlignmentsList. I think we want to hide that data structure from the user. It could instead return a GRanges (GroupedGRanges for a GAlignmentsList) with fields from the GAlignments as metadata columns. There then needs to be a way to do the equivalent of grglist(ga), i.e., chop the ranges (into a GroupedGRanges) by introns or any gap. Maybe chop_by_introns() and chop_by_gaps() (both N's and D's).

unit tests checklists

test-overlaps
test-nest
test-io
test-disjoin

Goal is to get coverage to about 70% before release

Deferred I/O

To summarise our discussion earlier here's what we're going to implement with regards to I/O (see also #15):

read_bam with paired = TRUE returns a grouped granges grouped by pair.
implement DeferredGRanges class with a cache slot and an ops to allow for deferred reading of data inputs.
implement specs for filtering and selection on the DeferredRanges according to @lawremi suggestion on Oct 17 for issue #1
chop functions
implement method for putting GRanges from wide to long form.

how flat is too flat?

for things like join_overlap_inner_within_directed() have we gone too far? should we be passing things to arguments here instead? I'm not sure how much of a big deal this is with tab completion in most IDEs but it does seem inelegant.

flank() and ignore.strand

It looks like the flank_ functions are assuming that ignore.strand defaults to TRUE, but it defaults to FALSE.

group_by, filter, and reduce_ranges don't work together

Hi, great package!

However, I ran into a strange problem: group_by, filter, and reduce_ranges seem to work individually and in pairs, but not when I apply all three. Here's a self-contained example.

library(plyranges)

set.seed(2018)
n <- 10
r <- GRanges(seqnames = rep("chr1", n),
                  ranges = IRanges(start = sample(20, n, replace = T),
                                   width = sample(6,  n, replace = T))
                  )
mcols(r) <- data.frame(score = runif(n, 0, 100), condition = rep_len(c("One","Two"), n))
r

## GRanges object with 10 ranges and 2 metadata columns:
##        seqnames    ranges strand |            score condition
##           <Rle> <IRanges>  <Rle> |        <numeric>  <factor>
##    [1]     chr1       7-9      * | 26.0100793326274       One
##    [2]     chr1     10-13      * | 56.8360368022695       Two
##    [3]     chr1       2-7      * | 15.1498265331611       One
##    [4]     chr1       4-8      * | 9.49368453584611       Two
##    [5]     chr1     10-14      * |  76.191659620963       One
##    [6]     chr1      7-10      * | 53.5928548080847       Two
##    [7]     chr1     13-14      * | 32.7880936674774       One
##    [8]     chr1       3-6      * | 92.4356027971953       Two
##    [9]     chr1     20-24      * | 59.8311740672216       One
##   [10]     chr1     11-15      * | 6.46366507280618       Two
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

These all work:

r %>% group_by(condition) %>% reduce_ranges

## GRanges object with 3 ranges and 1 metadata column:
##       seqnames    ranges strand | condition
##          <Rle> <IRanges>  <Rle> |  <factor>
##   [1]     chr1      2-14      * |       One
##   [2]     chr1     20-24      * |       One
##   [3]     chr1      3-15      * |       Two
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

r %>% filter(score > 2) %>% reduce_ranges

## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1      2-15      *
##   [2]     chr1     20-24      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

r %>% group_by(condition) %>% filter(score > 2)

## GRanges object with 10 ranges and 2 metadata columns:
## Groups: condition [2]
##        seqnames    ranges strand |            score condition
##           <Rle> <IRanges>  <Rle> |        <numeric>  <factor>
##    [1]     chr1       7-9      * | 26.0100793326274       One
##    [2]     chr1     10-13      * | 56.8360368022695       Two
##    [3]     chr1       2-7      * | 15.1498265331611       One
##    [4]     chr1       4-8      * | 9.49368453584611       Two
##    [5]     chr1     10-14      * |  76.191659620963       One
##    [6]     chr1      7-10      * | 53.5928548080847       Two
##    [7]     chr1     13-14      * | 32.7880936674774       One
##    [8]     chr1       3-6      * | 92.4356027971953       Two
##    [9]     chr1     20-24      * | 59.8311740672216       One
##   [10]     chr1     11-15      * | 6.46366507280618       Two
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

... but this doesn't:

r %>% group_by(condition) %>% filter(score > 2) %>% reduce_ranges

## Error: subscript contains invalid names

sessionInfo()

## R version 3.5.0 (2018-04-23)
## Platform: x86_64-redhat-linux-gnu (64-bit)
## Running under: Fedora 28 (Workstation Edition)
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
## 
## locale:
##  [1] LC_CTYPE=ru_RU.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_DK.UTF-8        LC_COLLATE=ru_RU.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
## [1] plyranges_1.0.1      GenomicRanges_1.32.3 GenomeInfoDb_1.16.0 
## [4] IRanges_2.14.10      S4Vectors_0.18.2     BiocGenerics_0.26.0 
## [7] knitr_1.20          
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.17                compiler_3.5.0             
##  [3] pillar_1.2.3                XVector_0.20.0             
##  [5] bindr_0.1.1                 bitops_1.0-6               
##  [7] tools_3.5.0                 zlibbioc_1.26.0            
##  [9] evaluate_0.10.1             tibble_1.4.2               
## [11] lattice_0.20-35             pkgconfig_2.0.1            
## [13] rlang_0.2.0                 Matrix_1.2-14              
## [15] DelayedArray_0.6.0          bindrcpp_0.2.2             
## [17] GenomeInfoDbData_1.1.0      rtracklayer_1.40.2         
## [19] stringr_1.3.1               dplyr_0.7.5                
## [21] Biostrings_2.48.0           grid_3.5.0                 
## [23] tidyselect_0.2.4            glue_1.2.0                 
## [25] Biobase_2.40.0              R6_2.2.2                   
## [27] XML_3.98-1.11               BiocParallel_1.14.1        
## [29] tidyr_0.8.1                 purrr_0.2.4                
## [31] magrittr_1.5                Rsamtools_1.32.0           
## [33] matrixStats_0.53.1          GenomicAlignments_1.16.0   
## [35] assertthat_0.2.0            SummarizedExperiment_1.10.1
## [37] stringi_1.2.2               RCurl_1.95-4.10

winding and unwinding

MongoDb's aggregation pipeline has the verb unwind() that is analogous to S4Vectors::expand(). Given a list (nested) column, it unlists the column and repeats the other columns accordingly. I prefer expand() and collapse() but dplyr already uses expand() as an alias for expand.grid() (why not expand_grid()?). Anyway, we could use the notion of "wound" to represent nested data, like how a GRangesList is a nested GRanges. So we could have something like:

tx <- wind(exons, tx_id, optionally_more_tx_level_columns...)
exons <- unwind(tx)

I guess in general wind() would convert its argument to a list of data frames, and make it a column in a new data frame that also includes the columns used in the winding (like the tx-level columns in this case). When called on a GRanges, it would create a GRangesList, but the API would treat it as a GRanges-like table, not a list. Hopefully this is effectively endomorphic enough.

In general unwind() would take a second argument for the nested column to unwind by. If there is no second argument, the rule is for the object itself to be unwound. So a GRangesList is expanded to a GRanges.

Does that make sense?

virtual columns for Seqinfo objects

see commit 11f7ae0 where the function set_genome_info appends the seqinfo object to the Ranges object. This has the flaw that you are unable to group by these columns since having them in the mcols slot means the GenomicRanges object is invalid. Alternatives proposed to fix this problem are:

rename them to be valid cols i.e.:
- seqlengths --> lengths; isCircular --> is_circular; genome --> reference
  This might be confusing for the user since it means having to remember more names.
virtual columns, this could be achieved in the following ways:
1. create a GRanges class with a virtual mcols slot to store the seqinfo columns
2. bind the seqinfo element names to the ranges environment when a summarise/mutate/filter call is made. i.e. modify as.env.
3. store the seqinfo columns that are parallel to the ranges in the metadata slot.

Preference is probably for 2, which I guess would make the set_genome_info function redundant.

import helpers

these are lightweight wrappers to the import family of functions in Rtracklayer and
Rsamtools::scanBam that have consistent arguments, and should always return a Ranges (this may be slightly harder for BAM files)

We need to read chain files to use rtracklayer::liftOver(). Chain objects are not GRanges, but they could be. For the sake of consistency, it would be nice to have:

read_chain

parallelisation

from @lawremi:

It would be interesting to explore an API based on BiocParallel; something like:

Set the BiocParallelParam:
parallelize(x, param)

Specify chunking:
chunk_by(x, variable) # often seqnames
chunk_count(x, count)
chunk_size(x, size)
chunk_by_overlap(x, ranges) # like tiles

That might help with our argument about scalability.

library(dplyr)
df <- data.frame(seqnames = c("chr1", "chr2", "chr2", "chr1", "chr2"),
                 start = 1:5,
                 width = 5)
plyranges::as_granges(df) %>%
    plyranges::filter(seqnames == "chr1")

However, the following does not:

library(dplyr)
df <- data.frame(seqnames = c("chr1", "chr2", "chr2", "chr1", "chr2"),
                 start = 1:5,
                 width = 5)
plyranges::as_granges(df) %>%
    plyranges::filter(seqnames %in% c("chr1", "chr2"))

Error in match(x, table, nomatch = 0L) :
'match' requires vector arguments

The latter works again when attaching the package, but there are cases when you would like to avoid a utility script to clutter your global namespace.