smartgamer / bioc-refcard Goto Github PK
View Code? Open in Web Editor NEWThis project forked from mikelove/bioc-refcard
Bioconductor cheat sheet
This project forked from mikelove/bioc-refcard
Bioconductor cheat sheet
# Bioconductor cheat sheet Also useful are Kasper Daniel Hansen's [Bioconductor class notes](http://kasperdanielhansen.github.io/genbioconductor/). ## Install ``` source("http://bioconductor.org/biocLite.R") biocLite() biocLite(c("package1","package2")) library(BiocInstaller) biocVersion() # what Bioc version is on your machine? biocValid() # are packages up to date? # what Bioc version is release http://bioconductor.org/bioc-version # what Bioc versions are release/devel http://bioconductor.org/js/versions.js ``` more information at http://www.bioconductor.org/install/ ## help within R Simple help: ``` ?functionName ?"eSet-class" # classes need the '-class' on the end help(package="foo",help_type="html") # launch web browser help vignette("topic") browseVignettes(package="package") # show vignettes for the package ``` Help for advanced users: ``` functionName # prints source code getMethod(method,"class") # prints source code for method selectMethod(method, "class") # will climb the inheritance to find method showMethods(classes="class") # show all methods for class methods(class="GRanges") # this will work in R >= 3.2 ?"functionName,class-method" # method help for S4 objects, e.g.: ?"plotMA,data.frame-method" # from library(geneplotter) ?"method.class" # method help for S3 objects e.g.: ?"plot.lm" sessionInfo() # necessary info for getting help packageVersion("foo") # what version of package ``` Bioconductor support website: https://support.bioconductor.org If you use RStudio, then you already get nicely rendered documentation using `?` or `help`. If you are a command line person, then you can use this alias to pop up a help page in your web browser with `rhelp functionName packageName`. ``` alias rhelp="Rscript -e 'args <- commandArgs(TRUE); help(args[2], package=args[3], help_type=\"html\"); Sys.sleep(5)' --args" ``` ## debugging R ``` traceback() # what steps lead to an error # debug a function debug(myFunction) # step line-by-line through the code in a function undebug(myFunction) # stop debugging debugonce(myFunction) # same as above, but doesn't need undebug() # also useful if you are writing code is to put # the function browser() inside a function at a critical point # this plus devtools::load_all() can be useful for programming # to jump in function on error: options(error=recover) # turn that behavior off: options(error=NULL) # debug, e.g. estimateSizeFactors from DESeq2... # debugging an S4 method is more difficult; this gives you a peek inside: trace(estimateSizeFactors, browser, exit=browser, signature="DESeqDataSet") ``` ## Show package-specific methods for a class These two long strings of R code do approximately the same thing: obtain the methods that operate on an object of a given class, which are defined in a specific package. ``` intersect(sapply(strsplit(as.character(methods(class="DESeqDataSet")), ","), `[`, 1), ls("package:DESeq2")) sub("Function: (.*) \\(package .*\\)","\\1",grep("Function",showMethods(classes="DESeqDataSet", where=getNamespace("DESeq2"), printTo=FALSE), value=TRUE)) ``` ## Annotations **Edit (May 2017)**: the below needs to be updated to show off AnnotationHub. For AnnotationHub examples, see: https://www.bioconductor.org/help/workflows/annotation/Annotation_Resources The following is how to work with the organism database packages, and biomart. [AnnotationDbi](http://www.bioconductor.org/packages/release/bioc/html/AnnotationDbi.html) ``` # using one of the annotation packges library(AnnotationDbi) library(org.Hs.eg.db) # or, e.g. Homo.sapiens columns(org.Hs.eg.db) keytypes(org.Hs.eg.db) head(keys(org.Hs.eg.db, keytype="ENTREZID")) # returns a named character vector, see ?mapIds for multiVals options res <- mapIds(org.Hs.eg.db, keys=k, column="ENSEMBL", keytype="ENTREZID") # generates warning for 1:many mappings res <- select(org.Hs.eg.db, keys=k, columns=c("ENTREZID","ENSEMBL","SYMBOL"), keytype="ENTREZID") ``` [biomaRt](http://www.bioconductor.org/packages/release/bioc/html/biomaRt.html) ``` # map from one annotation to another using biomart library(biomaRt) m <- useMart("ensembl", dataset = "hsapiens_gene_ensembl") map <- getBM(mart = m, attributes = c("ensembl_gene_id", "entrezgene"), filters = "ensembl_gene_id", values = some.ensembl.genes) ``` ## Genomic ranges [GenomicRanges](http://bioconductor.org/packages/release/bioc/html/GenomicRanges.html) ``` library(GenomicRanges) z <- GRanges("chr1",IRanges(1000001,1001000),strand="+") start(z) end(z) width(z) strand(z) mcols(z) # the 'metadata columns', any information stored alongside each range ranges(z) # gives the IRanges seqnames(z) # the chromosomes for each ranges seqlevels(z) # the possible chromosomes seqlengths(z) # the lengths for each chromosome ``` ### Intra-range methods Affects ranges independently function | description --- | --- shift | moves left/right narrow | narrows by relative position within range resize | resizes to width, fixing start for +, end for - flank | returns flanking ranges to the left +, or right - promoters | similar to flank restrict | restricts ranges to a start and end position trim | trims out of bound ranges +/- | expands/contracts by adding/subtracting fixed amount * | zooms in (positive) or out (negative) by multiples ### Inter-range methods Affects ranges as a group function | description --- | --- range | one range, leftmost start to rightmost end reduce | cover all positions with only one range gaps | uncovered positions within range disjoin | breaks into discrete ranges based on original starts/ends ### Nearest methods Given two sets of ranges, `x` and `subject`, for each range in `x`, returns... function | description --- | --- nearest | index of the nearest neighbor range in subject precede | index of the range in subject that is directly preceded by the range in x follow |index of the range in subject that is directly followed by the range in x distanceToNearest | distances to its nearest neighbor in subject (Hits object) distance | distances to nearest neighbor (integer vector) A Hits object can be accessed with `queryHits`, `subjectHits` and `mcols` if a distance is associated. ### set methods If `y` is a GRangesList, then use `punion`, etc. All functions have default `ignore.strand=FALSE`, so are strand specific. ``` union(x,y) intersect(x,y) setdiff(x,y) ``` ### Overlaps ``` x %over% y # logical vector of which x overlaps any in y fo <- findOverlaps(x,y) # returns a Hits object queryHits(fo) # which in x subjectHits(fo) # which in y ``` ### Seqnames and seqlevels [GenomicRanges](http://www.bioconductor.org/packages/release/bioc/html/GenomicRanges.html) and [GenomeInfoDb](http://www.bioconductor.org/packages/release/bioc/html/GenomeInfoDb.html) ``` gr.sub <- gr[seqlevels(gr) == "chr1"] seqlevelsStyle(x) <- "UCSC" # convert to 'chr1' style from "NCBI" style '1' ``` ## Sequences [Biostrings](http://www.bioconductor.org/packages/release/bioc/html/Biostrings.html) see the [Biostrings Quick Overview PDF](http://www.bioconductor.org/packages/release/bioc/vignettes/Biostrings/inst/doc/BiostringsQuickOverview.pdf) For naming, see [cheat sheet for annotation](http://genomicsclass.github.io/book/pages/annoCheat.html) ``` library(BSgenome.Hsapiens.UCSC.hg19) dnastringset <- getSeq(Hsapiens, granges) # returns a DNAStringSet # also Views() for Bioconductor >= 3.1 ``` ``` library(Biostrings) dnastringset <- readDNAStringSet("transcripts.fa") ``` ``` substr(dnastringset, 1, 10) # to character string subseq(dnastringset, 1, 10) # returns DNAStringSet Views(dnastringset, 1, 10) # lightweight views into object complement(dnastringset) reverseComplement(dnastringset) matchPattern("ACGTT", dnastring) # also countPattern, also works on Hsapiens/genome vmatchPattern("ACGTT", dnastringset) # also vcountPattern letterFrequecy(dnastringset, "CG") # how many C's or G's # also letterFrequencyInSlidingView alphabetFrequency(dnastringset, as.prob=TRUE) # also oligonucleotideFrequency, dinucleotideFrequency, trinucleotideFrequency # transcribe/translate for imitating biological processes ``` ## Sequencing data [Rsamtools](http://www.bioconductor.org/packages/release/bioc/html/Rsamtools.html) `scanBam` returns lists of raw values from BAM files ``` library(Rsamtools) which <- GRanges("chr1",IRanges(1000001,1001000)) what <- c("rname","strand","pos","qwidth","seq") param <- ScanBamParam(which=which, what=what) # for more BamFile functions/details see ?BamFile # yieldSize for chunk-wise access bamfile <- BamFile("/path/to/file.bam") reads <- scanBam(bamfile, param=param) res <- countBam(bamfile, param=param) # for more sophisticated counting modes # see summarizeOverlaps() below # quickly check chromosome names seqinfo(BamFile("/path/to/file.bam")) # DNAStringSet is defined in the Biostrings package # see the Biostrings Quick Overview PDF dnastringset <- scanFa(fastaFile, param=granges) ``` [GenomicAlignments](http://www.bioconductor.org/packages/release/bioc/html/GenomicAlignments.html) returns Bioconductor objects (GRanges-based) ``` library(GenomicAlignments) ga <- readGAlignments(bamfile) # single-end ga <- readGAlignmentPairs(bamfile) # paired-end ``` ## Transcript databases [GenomicFeatures](http://www.bioconductor.org/packages/release/bioc/html/GenomicFeatures.html) ``` # get a transcript database, which stores exon, trancript, and gene information library(GenomicFeatures) library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene # or build a txdb from GTF file (e.g. downloadable from Ensembl FTP site) txdb <- makeTranscriptDbFromGFF("file.GTF", format="gtf") # or build a txdb from Biomart (however, not as easy to reproduce later) txdb <- makeTranscriptDbFromBiomart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl") # in Bioconductor >= 3.1, also makeTxDbFromGRanges # saving and loading saveDb(txdb, file="txdb.sqlite") loadDb("txdb.sqlite") # extracting information from txdb g <- genes(txdb) # GRanges, just start to end, no exon/intron information tx <- transcripts(txdb) # GRanges, similar to genes() e <- exons(txdb) # GRanges for each exon ebg <- exonsBy(txdb, by="gene") # exons grouped in a GRangesList by gene ebt <- exonsBy(txdb, by="tx") # similar but by transcript # then get the transcript sequence txSeq <- extractTranscriptSeqs(Hsapiens, ebt) ``` ## Summarizing information across ranges and experiments The SummarizedExperiment is a storage class for high-dimensional information tied to the same GRanges or GRangesList across experiments (e.g., read counts in exons for each gene). ``` library(GenomicAlignments) fls <- list.files(pattern="*.bam$") library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene ebg <- exonsBy(txdb, by="gene") # see yieldSize argument for restricting memory bf <- BamFileList(fls) library(BiocParallel) register(MulticoreParam(4)) # lots of options in the man page # singleEnd, ignore.strand, inter.features, fragments, etc. se <- summarizeOverlaps(ebg, bf) # operations on SummarizedExperiment assay(se) # the counts from summarizeOverlaps colData(se) rowRanges(se) ``` My preferred quantification method is [Salmon](https://combine-lab.github.io/salmon/), with `--gcBias` option enabled unless you know there is no GC dependence in the data, followed by [tximport](http://bioconductor.org/pacakges/tximport). Here is an example of usage: ```{r} coldata <- read.table("samples.txt") rownames(coldata) <- coldata$id files <- coldata$files; names(files) <- coldata$id txi <- tximport(files, type="salmon", tx2gene=tx2gene) dds <- DESeqDataSetFromTximport(txi, coldata, ~condition) ``` Another fast Bioconductor read counting method is featureCounts in [Rsubread](http://www.bioconductor.org/packages/release/bioc/html/Rsubread.html). ``` library(Rsubread) res <- featureCounts(files, annot.ext="annotation.gtf", isGTFAnnotationFile=TRUE, GTF.featureType="exon", GTF.attrType="gene_id") res$counts ``` ## RNA-seq gene-wise analysis [DESeq2](http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html) My preferred pipeline for DESeq2 users is to start with a lightweight transcript abundance quantifier such as [Salmon](https://combine-lab.github.io/salmon/) and to use [tximport](http://bioconductor.org/packages/tximport), followed by `DESeqDataSetFromTximport`. Here, `coldata` is a *data.frame* with `group` as a column. ``` library(DESeq2) # from tximport dds <- DESeqDataSetFromTximport(txi, coldata, ~ group) # from SummarizedExperiment dds <- DESeqDataSet(se, ~ group) # from count matrix dds <- DESeqDataSetFromMatrix(counts, coldata, ~ group) # minimal filtering helps keep things fast # one can set 'n' to e.g. min(5, smallest group sample size) keep <- rowSums(counts(dds) >= 10) >= n dds <- dds[keep,] dds <- DESeq(dds) res <- results(dds) # no shrinkage of LFC, or: res <- lfcShrink(dds, coef = 2, type="apeglm") # shrink LFCs ``` [edgeR](http://www.bioconductor.org/packages/release/bioc/html/edgeR.html) ``` # this chunk from the Quick start in the edgeR User Guide library(edgeR) y <- DGEList(counts=counts,group=group) keep <- filterByExpr(y) y <- y[keep,] y <- calcNormFactors(y) design <- model.matrix(~group) y <- estimateDisp(y,design) fit <- glmFit(y,design) lrt <- glmLRT(fit) topTags(lrt) # or use the QL methods: qlfit <- glmQLFit(y,design) qlft <- glmQLFTest(qlfit) topTags(qlft) ``` [limma-voom](http://www.bioconductor.org/packages/release/bioc/html/limma.html) ``` library(limma) design <- model.matrix(~ group) y <- DGEList(counts) keep <- filterByExpr(y) y <- y[keep,] y <- calcNormFactors(y) v <- voom(y,design) fit <- lmFit(v,design) fit <- eBayes(fit) topTable(fit) ``` [Many more RNA-seq packages](http://www.bioconductor.org/packages/release/BiocViews.html#___RNASeq) ## Expression set ``` library(Biobase) data(sample.ExpressionSet) e <- sample.ExpressionSet exprs(e) pData(e) fData(e) ``` ## Get GEO dataset ``` library(GEOquery) e <- getGEO("GSE9514") ``` ## Microarray analysis ``` library(affy) library(limma) phenoData <- read.AnnotatedDataFrame("sample-description.csv") eset <- justRMA("/celfile-directory", phenoData=phenoData) design <- model.matrix(~ Disease, pData(eset)) fit <- lmFit(eset, design) efit <- eBayes(fit) topTable(efit, coef=2) ``` ## iCOBRA performance metrics ``` library(iCOBRA) cd <- COBRAData(pval=pval.df, padj=padj.df, score=score.df, truth=truth.df) cp <- calculate_performance(cd, binary_truth = "status", cont_truth = "logFC") cobraplot <- prepare_data_for_plot(cp) plot_fdrtprcurve(cobraplot) # interactive shiny app: COBRAapp(cd) ```
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.