raphg / biostat-578 Goto Github PK

A repository for all of my teaching material

License: Creative Commons Zero v1.0 Universal

HTML 99.96% R 0.02% Shell 0.03%

biostat-578's Issues

Inconsistent capitalization in "images" folder name

This led to problem seeing images on my (R 3.0.2, Ubuntu 12.04) system:

Linux is case sensitive for file and folder names. Windows is usually not. Your Git repo has Images spelled with a capital letter 'I'. But not all Rpres files match this. Some use both.

$ grep -l 'images/' *.Rpres
Batch_effects.Rpres
Biology_basics.Rpres
Introduction_to_R.Rpres
Microarrays.Rpres
Probe_summary.Rpres
RNA-seq.Rpres
Sequence_analysis.Rpres

$ grep -l 'Images/' *.Rpres
Differential_expression.Rpres
Normalization.Rpres
Probe_summary.Rpres
RNA-seq.Rpres

dbWriteTable() needs overwrite=TRUE

Regarding, "Advanced_data_manipulation.Rpres" and the slide, "Creating your own SQLite database in R" ...

If the database table ("hai", "cohort") already exists (i.e. code has already been run once), the following code returns an error and the slide presentation will not continue.

db <- dbConnect(SQLite(), dbname="./Data/SDY61/SDY61.sqlite")
dbWriteTable(conn = db, name = "hai", value = "./Data/SDY61/hai_result.txt", row.names = FALSE, header = TRUE, sep="\t")

Error: Table hai exists in database, and both overwrite and append are FALSE

dbWriteTable(conn = db, name = "cohort", value = "./Data/SDY61/arm_or_cohort.txt", row.names = FALSE, header = TRUE, sep="\t")

Error: Table cohort exists in database, and both overwrite and append are FALSE

These errors can be fixed by using "overwrite=TRUE" in dbWriteTable().

db <- dbConnect(SQLite(), dbname="./Data/SDY61/SDY61.sqlite")
dbWriteTable(conn = db, name = "hai", value = "./Data/SDY61/hai_result.txt", row.names = FALSE, header = TRUE, sep="\t", overwrite=TRUE)
dbWriteTable(conn = db, name = "cohort", value = "./Data/SDY61/arm_or_cohort.txt", row.names = FALSE, header = TRUE, sep="\t", overwrite=TRUE)

Inconsistent loading of knitr before opts_chunk setting

We are running (R 3.0.2, Ubuntu 12.04)... and see the following issue...

Some .Rpres files have this:

# Let's first turn on the cache for increased performance.
# Set some global knitr options
library(knitr)
opts_chunk$set(cache=TRUE)

While others just have:

# Let's first turn on the cache for increased performance.
# Set some global knitr options
opts_chunk$set(cache=TRUE)

For those without "library(knitr)", when running the presentation, knitr may not be loaded and this will fail (after a long wait).

Please use "library(knitr)" before all instances of "opts_chunk$set(cache=TRUE)".

Reference:
http://stackoverflow.com/questions/16752765/knitr-r-package-check-error-object-opts-chunk-not-found

Here are the files containing this "opts_chunk" setting and you can see the ones which are not running "library(knitr)" directly beforehand:

$ grep -B1 -i 'opts_chunk$set(cache=TRUE)' *.Rpres
Advanced_data_manipulation.Rpres-# Set some global knitr options
Advanced_data_manipulation.Rpres:opts_chunk$set(cache=TRUE)
--
Batch_effects.Rpres-library(knitr)
Batch_effects.Rpres:opts_chunk$set(cache=TRUE)
--
Differential_expression.Rpres-# Set some global knitr options
Differential_expression.Rpres:opts_chunk$set(cache=TRUE)
--
GSEA.Rpres-library(knitr)
GSEA.Rpres:opts_chunk$set(cache=TRUE)
--
RNA-seq.Rpres-library(knitr)
RNA-seq.Rpres:opts_chunk$set(cache=TRUE)
--
Sequence_analysis.Rpres-# Set some global knitr options
Sequence_analysis.Rpres:opts_chunk$set(cache=TRUE)

You can see that 6 .Rpres files use the setting, but only 3 of them load knitr.

$ grep -l 'opts_chunk$set(cache=TRUE)' *.Rpres
Advanced_data_manipulation.Rpres
Batch_effects.Rpres
Differential_expression.Rpres
GSEA.Rpres
RNA-seq.Rpres
Sequence_analysis.Rpres

$ grep -l 'library(knitr)' *.Rpres
Batch_effects.Rpres
GSEA.Rpres
RNA-seq.Rpres

Sequence_analysis.Rpres: need to load packages before using them

When running the code in a presentation, packages must be loaded before attempting to use them, or the commands will fail.

In Sequence_analysis.Rpres, these two packages are used without loading them:

IRanges
GenomicRanges

Please modify the presentation to load these packages before they are used.

Bioconductor_intro.Rpres: Data/GEO folder does not exist

Downloading the GSE29617 data file will fail with the code in Bioconductor_intro.Rpres:

> # Download all raw data. This should only be evaluated once
> # Then the data would be stored locally in the data directory
> getGEOSuppFiles("GSE29617", makeDirectory = FALSE, baseDir = "./Data/GEO/")
[1] "ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE29nnn/GSE29617/suppl/"
Error in download.file(file.path(url, i), destfile = file.path(storedir,  : 
  cannot open destfile './Data/GEO//GSE29617_RAW.tar', reason 'No such file or directory'

Please, either add the ./Data/GEO/ folder to your Git repo or use an R command to first create the destination folder if not present:

dir.create(file.path("./Data/GEO"), showWarnings = FALSE)

With this change, you may be able to remove the second approach you mention next:

# If the above commands do not work, try:
getGEOSuppFiles("GSE29617", makeDirectory = TRUE, baseDir = "./")

Regardless, one or the other alternative should be commented out, since downloading twice is redundant. (And the downloads are slow.)

typo "Rle" in Sequence_analysis.Rpres

I found that fixing this (as below) made it work better for me. (R 3.0.2, Ubuntu 12.04).

This:

xRle <- Rle(xVector)

Should be this:

xRle <- rle(xVector)

RNA-seq.Rpres: replacement for T14 file

Apparently the T14 file is "problematic" and needs to replaced with this one:
https://www.dropbox.com/s/k2vufecktt5nyy0/GSE45735_T14_pbmc.txt.gz

gd <- getGEO("GSE45735", destdir = "Data/GEO/")
pd <- pData(gd[[1]])
#getGEOSuppFiles("GSE45735", makeDirectory=FALSE, baseDir = "Data/GEO/")
## The T14 file is problematic and needs to be fixed by hand

Please update the .Rpres file with the download link and where to place the downloaded file.

RNA-seq.Rpres: need to use biocLite() before library()

Please insert code to install bioconductor packages before their first use.

Example:

This code fails since biocLite() was not used to install limma and edgeR before use with library().

> # Set some global knitr options
> library(knitr)
> opts_chunk$set(cache=TRUE)
> library(data.table)
> library(ggplot2)
> library(limma)
Error in library(limma) : there is no package called ‘limma’
> library(edgeR)
Error in library(edgeR) : there is no package called ‘edgeR’
> library(GEOquery)
> biocLite(c("limma, "edgeR")
Error: unexpected symbol in "biocLite(c("limma, "edgeR"
> biocLite(c("limma, "edgeR"))
Error: unexpected symbol in "biocLite(c("limma, "edgeR"

So we need to install limma and edgeR:

> biocLite(c("limma", "edgeR"))
BioC_mirror: http://bioconductor.org
Using Bioconductor version 2.13 (BiocInstaller 1.12.0), R version 3.0.2.
Installing package(s) 'limma' 'edgeR'
trying URL 'http://bioconductor.org/packages/2.13/bioc/bin/windows/contrib/3.0/limma_3.18.13.zip'
Content type 'application/zip' length 3190124 bytes (3.0 Mb)
opened URL
downloaded 3.0 Mb

trying URL 'http://bioconductor.org/packages/2.13/bioc/bin/windows/contrib/3.0/edgeR_3.4.2.zip'
Content type 'application/zip' length 1815887 bytes (1.7 Mb)
opened URL
downloaded 1.7 Mb

package ‘limma’ successfully unpacked and MD5 sums checked
package ‘edgeR’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Documents and Settings\high\Local Settings\Temp\2\RtmpuKjnTn\downloaded_packages

Now it works:

> # Let's first turn on the cache for increased performance.
> # Set some global knitr options
> library(knitr)
> opts_chunk$set(cache=TRUE)
> library(data.table)
> library(ggplot2)
> library(limma)

Attaching package: ‘limma’

The following object is masked from ‘package:BiocGenerics’:

    plotMA

> library(edgeR)
> library(GEOquery)

Sequence_analysis.Rpres: data frame versus data table problem

This block of code did not work because dt_merged_melt was a data frame and not a data.table. (R 3.0.2, Ubuntu 12.04)

This line is needed.

dt_merged_melt<-data.table(dt_merged_melt)

See full code running below.

> library(ggplot2)
> #library(reshape2)  # Already had to do this earlier in the presentation!
> ggplot(dt_merged_melt[variable=="GSM733942.CEL.gz"],aes(x=as.factor(GC_count), y=value))+geom_violin()+geom_point(data=dt_merged_melt_sum,aes(x=as.factor(GC_count), y=mean), col="red", size=6)+theme_minimal(base_size = 18)
> dt_exprs <- data.table(probe_name=gsub("_PM", "", rownames(GSE29617_set2)), exprs(GSE29617_set2))
> # frequency for all probes
> freq <- alphabetFrequency(probes)
> # Compute the GC content
> GC_count <- freq[,"G"]+freq[,"C"]
> dt_probes <- data.table(probe_name=hgu133plus2probe$Probe.Set.Name, GC_count)
> setkey(dt_exprs, "probe_name")
> setkey(dt_probes, "probe_name")
> dt_merged <- dt_exprs[dt_probes]
> dt_merged_melt <- melt(dt_merged, id.vars = c("probe_name", "GC_count"))
> dt_merged_melt_sum <- dt_merged_melt[,list(mean=mean(value)),by=GC_count]
Error in `[.data.frame`(dt_merged_melt, , list(mean = mean(value)), by = GC_count) : 
  unused argument (by = GC_count)

# Need this extra line...
> dt_merged_melt<-data.table(dt_merged_melt)

# Now it works!
> dt_merged_melt_sum <- dt_merged_melt[,list(mean=mean(value)),by=GC_count]

> library(ggplot2)
> #library(reshape2)  # Already had to do this earlier in the presentation!
> ggplot(dt_merged_melt[variable=="GSM733942.CEL.gz"],aes(x=as.factor(GC_count), y=value))+geom_violin()+geom_point(data=dt_merged_melt_sum,aes(x=as.factor(GC_count), y=mean), col="red", size=6)+theme_minimal(base_size = 18)

installallpkgs.R introductory comments edit

Your version of this script has been modified to remove the ability to use require() to test to see if a package is already installed. Since your version will install all packages whether or not they were already present, you may want to remove the section of the introductory comments which contradicts the new behaviour. These comments could be confusing if the user expects the behaviour they describe. (Compare to my version for reference.)

# [...], either with "install.packages" or "biocLite",
# then tries to load them into memory with "require". The intent is to
# ensure that any packages required for the scripts to run have been
# installed previously. 
#
# Only those packages which are not already installed will be installed. 
# The "require" function is used to check those packages after 
# they are installed to make sure they can load into memory. Therefore, 
# even if packages have already been loaded with "library", this script 
# should run very quickly once the packages have been initially installed.

limma package updated: symbols2indices() renamed to ids2indices()

It appears that a function has been renamed and prevents the GSEA presentation code from working.

sets_indices <- symbols2indices(gene_ids, rownames(new_set))

Results in this error message:

Error: could not find function "symbols2indices"

From: http://www.bioconductor.org/packages/release/bioc/news/limma/NEWS

Changes in version 3.22.0:

symbols2indices() renamed to ids2indices().

Installing packages before loading them in *.Rpres

In order to get these presentations to work, we had to add "install.packages" code before these packages were loaded with "library".

We understand that some people may have already installed them, and that installing them repeatedly each time the presentations are run would be redundant and a time waster. But for those who have not, and are new to R, the errors they get from trying to load packages which have not been installed can be very confusing and discouraging.

Is there some way (in R or RStudio) to have them "conditionally" installed? I mean, is there code that could check a package to see if it is already installed, and if not, install it?

A quick search shows that there are a few ways to do this in R:

http://stackoverflow.com/questions/4090169/elegant-way-to-check-for-missing-packages-and-install-them

http://stackoverflow.com/questions/9341635/how-can-i-check-for-installed-r-packages-before-running-install-packages

Please consider modifying your presentations in the ways described in these links to conditionally install needed packages.

For reference, here are the ones we had to install with "install.packages" to get your presentations to work for us. (Does not include packages installed with "biocLite".)

Advanced_data_manipulation.Rpres:

install.packages("data.table")
install.packages("microbenchmark")
install.packages("sqldf")

Advanced_graphics_in_R.Rpres:

install.packages("ggplot2")
install.packages("ggthemes")

Bioconductor_intro.Rpres:

install.packages("knitr")
install.packages("data.table")

Sequence_analysis.Rpres:

install.packages("reshape2")
install.packages("IRanges")
install.packages("GenomicRanges")

We run R 3.0.2 on BioLinux 7 (Ubuntu 12.04), which has these R packages installed using the OS package management tool:

r-base
r-base-core
r-base-dev
r-base-html
r-bioc-affy
r-bioc-affyio
r-bioc-annotate
r-bioc-annotationdbi
r-bioc-biobase
r-bioc-biocgenerics
r-bioc-biocinstaller
r-bioc-biomart
r-bioc-biostrings
r-bioc-bitseq
r-bioc-deseq
r-bioc-edger
r-bioc-genefilter
r-bioc-geneplotter
r-bioc-genomicranges
r-bioc-hilbertvis
r-bioc-impute
r-bioc-iranges
r-bioc-limma
r-bioc-multtest
r-bioc-pcamethods
r-bioc-preprocesscore
r-bioc-qvalue
r-bioc-rsamtools
r-bioc-zlibbioc
r-cran-abind
r-cran-ade4
r-cran-ape
r-cran-aplpack
r-cran-bitops
r-cran-boot
r-cran-catools
r-cran-class
r-cran-cluster
r-cran-codetools
r-cran-dbi
r-cran-digest
r-cran-evaluate
r-cran-foreign
r-cran-gdata
r-cran-gee
r-cran-gplots
r-cran-gtools
r-cran-kernsmooth
r-cran-lattice
r-cran-leaps
r-cran-lme4
r-cran-locfit
r-cran-mass
r-cran-matrix
r-cran-matrixstats
r-cran-mgcv
r-cran-nlme
r-cran-nnet
r-cran-permute
r-cran-plotrix
r-cran-prettyr
r-cran-r.methodss3
r-cran-rcolorbrewer
r-cran-rcpp
r-cran-rcurl
r-cran-relimp
r-cran-rggobi
r-cran-rgl
r-cran-rgtk2
r-cran-rpart
r-cran-rsqlite
r-cran-rwave
r-cran-samr
r-cran-scatterplot3d
r-cran-sp
r-cran-spatial
r-cran-stringr
r-cran-survival
r-cran-tcltk2
r-cran-testthat
r-cran-vegan
r-cran-waveslim
r-cran-wavethresh
r-cran-xml
r-cran-xtable
r-doc-html
r-doc-info
r-doc-pdf
r-mathlib
r-recommended

Bioconductor_intro.Pres: use Sys.getenv("TAR") instead of "gtar"

The of tar = "gtar" with R's "untar" does not work on either Windows of Linux using standard installations of these operating systems. Use Sys.getenv("TAR") instead of "gtar".

For example...

Fails on Linux:

> untar("Data/GEO/GSE29617_RAW.tar", exdir="Data/GEO/GSE29617/", tar = "gtar")
sh: 1: gtar: not found
Warning message:
In untar("Data/GEO/GSE29617_RAW.tar", exdir = "Data/GEO/GSE29617/",  :
  ‘gtar -xf 'Data/GEO/GSE29617_RAW.tar' -C 'Data/GEO/GSE29617/'’ returned error code 127

And fails on Windows:

> untar("Data/GEO/GSE29617_RAW.tar", exdir="Data/GEO/GSE29617/", tar = "gtar")
Warning message:
In untar("Data/GEO/GSE29617_RAW.tar", exdir = "Data/GEO/GSE29617/",  :
  ‘gtar -xf "Data/GEO/GSE29617_RAW.tar" -C "Data/GEO/GSE29617/"’ returned error code 127

tar = Sys.getenv("TAR") does work on both Linux and Windows:

> untar("Data/GEO/GSE29617_RAW.tar", exdir="Data/GEO/GSE29617/", tar = Sys.getenv("TAR"))

R code to fix RNA-seq "problematic" GSE45735 file

Summary:

This issue contains code to fix the "problematic" GSE45735 data file in R without having to resort to a cumbersome manual edit procedure. It shows how you can filter a compressed data file in R, saving it back into the compressed (e.g., gzip) format. Using this code will help reproducibility.

Background:

The T14 file in the GSE45735 data set is "problematic" -- it contains lines which should be removed before running read.table().

Example code which gives an error:

getGEOSuppFiles("GSE45735", makeDirectory=FALSE, baseDir = "Data/GEO/")
# downloading...
files <- list.files(path = "Data/GEO/", pattern = "GSE45735_T.*.gz", full.names = TRUE)
file_list <- lapply(files, read.table, header=TRUE)

This is the error message:

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 21969 did not have 12 elements

This problem was discussed in class. The recommendation was to extract the gz file, edit the txt file by hand, then recompress the file. The comment in the RNA-seq.Rpres file is:

## The T14 file is problematic and needs to be fixed by hand

Solution using R code:

This is another alternative which can be scripted in R and thus avoid the need to manually fix the file.

fname <- "Data/GEO/GSE45735_T14_pbmc.txt.gz"
write(grep('^Column',grep('^([^\t]+\t){1,}[^\t]+$',
                          readLines(gzfile(fname)), 
                          perl=TRUE, value=TRUE), 
           perl=TRUE, value=TRUE, invert = TRUE), 
      gzfile(fname))
closeAllConnections()

This extracts the file, filters according to two regular expressions, then recompresses the file and closes open file connections. The two regular expressions will filter out lines starting with "Column" and any line not containing a tab delimiter. In the case of this particular file, it ends up deleting the last 28 lines from the text file before recompressing it.

After this is run, then this read.table() command will work:

file_list <- lapply(files, read.table, header=TRUE)

EDIT:

All of the files can be "cleaned" using a new function, filter.gz(), and lappply():

filter.gz <- function(fname) {
    write(grep('^Column',
               grep('^([^\t]+\t){1,}[^\t]+$',
                    readLines(gzfile(fname)), 
                    perl=TRUE, value=TRUE), 
                perl=TRUE, value=TRUE, invert = TRUE), 
            gzfile(fname)
    )
    closeAllConnections()
}

result <- lapply(files, filter.gz)

Or an another completely different alternate approach is:

file_list <- lapply(files, read.table, sep='\t', header=TRUE, blank.lines.skip=TRUE, skipNul=TRUE)
na.omit(file_list[[3]])

That still leaves the rows with Gene name containing "Column", so you can remove those also, by assuming all Gene names conform to a regular expression match of no spaces.

file_list <- lapply(files, read.table, sep='\t', header=TRUE)
file_list[[3]] <- subset(file_list[[3]], grepl('^[^[:space:]]+$', Gene))

Or to apply to all data frames in the list:

file_list <- lapply(files, read.table, sep='\t', header=TRUE)
file_list <- lapply(file_list, function(file_list)subset(file_list, grepl('^[^[:space:]]+$', Gene)))

Essentially, the default separator for read.table() is whitepsace, not tab, as it should be for this file. So the Gene names with spaces in them ("Column #") will be "problematic" as they appear to contain too many columns. Using sep='\t' addresses this.

Then, since there are some rows that are clearly not columnar data, and some are blank, those (and the rows with Gene like "Column") are removed using the subset().

Sequence_analysis.Rpres: could not find function "melt"

We get a "could not find function "melt"" error when running code in Sequence_analysis.Rpres.

> dt_exprs <- data.table(probe_name=gsub("_PM", "", rownames(GSE29617_set2)), exprs(GSE29617_set2))
> # frequency for all probes
> freq <- alphabetFrequency(probes)
> # Compute the GC content
> GC_count <- freq[,"G"]+freq[,"C"]
> dt_probes <- data.table(probe_name=hgu133plus2probe$Probe.Set.Name, GC_count)
> setkey(dt_exprs, "probe_name")
> setkey(dt_probes, "probe_name")
> dt_merged <- dt_exprs[dt_probes]
> dt_merged_melt <- melt(dt_merged, id.vars = c("probe_name", "GC_count"))
Error: could not find function "melt"
> # This line is not needed if you're using the latest dev version of data.tabler
> dt_merged_melt <- data.table(dt_merged_melt)
Error in data.table(dt_merged_melt) : object 'dt_merged_melt' not found
> dt_merged_melt_sum <- dt_merged_melt[,list(mean=mean(value)),by=GC_count]
Error: object 'dt_merged_melt' not found

Because we need to load "reshape2" first...

library(reshape2)

> dt_exprs <- data.table(probe_name=gsub("_PM", "", rownames(GSE29617_set2)), exprs(GSE29617_set2))
> # frequency for all probes
> freq <- alphabetFrequency(probes)
> # Compute the GC content
> GC_count <- freq[,"G"]+freq[,"C"]
> dt_probes <- data.table(probe_name=hgu133plus2probe$Probe.Set.Name, GC_count)
> setkey(dt_exprs, "probe_name")
> setkey(dt_probes, "probe_name")
> dt_merged <- dt_exprs[dt_probes]
> dt_merged_melt <- melt(dt_merged, id.vars = c("probe_name", "GC_count"))
> # This line is not needed if you're using the latest dev version of data.tabler
> dt_merged_melt <- data.table(dt_merged_melt)
> dt_merged_melt_sum <- dt_merged_melt[,list(mean=mean(value)),by=GC_count]

raphg / biostat-578 Goto Github PK

biostat-578's Issues

Inconsistent capitalization in "images" folder name

dbWriteTable() needs overwrite=TRUE

Inconsistent loading of knitr before opts_chunk setting

Sequence_analysis.Rpres: need to load packages before using them

Bioconductor_intro.Rpres: Data/GEO folder does not exist

typo "Rle" in Sequence_analysis.Rpres

RNA-seq.Rpres: replacement for T14 file

RNA-seq.Rpres: need to use biocLite() before library()

Sequence_analysis.Rpres: data frame versus data table problem

installallpkgs.R introductory comments edit

limma package updated: symbols2indices() renamed to ids2indices()

Installing packages before loading them in *.Rpres

Bioconductor_intro.Pres: use Sys.getenv("TAR") instead of "gtar"

R code to fix RNA-seq "problematic" GSE45735 file

Sequence_analysis.Rpres: could not find function "melt"

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent