raphg / biostat-578 Goto Github PK
View Code? Open in Web Editor NEWA repository for all of my teaching material
License: Creative Commons Zero v1.0 Universal
A repository for all of my teaching material
License: Creative Commons Zero v1.0 Universal
This led to problem seeing images on my (R 3.0.2, Ubuntu 12.04) system:
Linux is case sensitive for file and folder names. Windows is usually not. Your Git repo has Images spelled with a capital letter 'I'. But not all Rpres files match this. Some use both.
$ grep -l 'images/' *.Rpres
Batch_effects.Rpres
Biology_basics.Rpres
Introduction_to_R.Rpres
Microarrays.Rpres
Probe_summary.Rpres
RNA-seq.Rpres
Sequence_analysis.Rpres
$ grep -l 'Images/' *.Rpres
Differential_expression.Rpres
Normalization.Rpres
Probe_summary.Rpres
RNA-seq.Rpres
Regarding, "Advanced_data_manipulation.Rpres" and the slide, "Creating your own SQLite database in R" ...
If the database table ("hai", "cohort") already exists (i.e. code has already been run once), the following code returns an error and the slide presentation will not continue.
db <- dbConnect(SQLite(), dbname="./Data/SDY61/SDY61.sqlite")
dbWriteTable(conn = db, name = "hai", value = "./Data/SDY61/hai_result.txt", row.names = FALSE, header = TRUE, sep="\t")
Error: Table hai exists in database, and both overwrite and append are FALSE
dbWriteTable(conn = db, name = "cohort", value = "./Data/SDY61/arm_or_cohort.txt", row.names = FALSE, header = TRUE, sep="\t")
Error: Table cohort exists in database, and both overwrite and append are FALSE
These errors can be fixed by using "overwrite=TRUE" in dbWriteTable().
db <- dbConnect(SQLite(), dbname="./Data/SDY61/SDY61.sqlite")
dbWriteTable(conn = db, name = "hai", value = "./Data/SDY61/hai_result.txt", row.names = FALSE, header = TRUE, sep="\t", overwrite=TRUE)
dbWriteTable(conn = db, name = "cohort", value = "./Data/SDY61/arm_or_cohort.txt", row.names = FALSE, header = TRUE, sep="\t", overwrite=TRUE)
We are running (R 3.0.2, Ubuntu 12.04)... and see the following issue...
Some .Rpres files have this:
# Let's first turn on the cache for increased performance.
# Set some global knitr options
library(knitr)
opts_chunk$set(cache=TRUE)
While others just have:
# Let's first turn on the cache for increased performance.
# Set some global knitr options
opts_chunk$set(cache=TRUE)
For those without "library(knitr)", when running the presentation, knitr may not be loaded and this will fail (after a long wait).
Please use "library(knitr)" before all instances of "opts_chunk$set(cache=TRUE)".
Reference:
http://stackoverflow.com/questions/16752765/knitr-r-package-check-error-object-opts-chunk-not-found
Here are the files containing this "opts_chunk" setting and you can see the ones which are not running "library(knitr)" directly beforehand:
$ grep -B1 -i 'opts_chunk$set(cache=TRUE)' *.Rpres
Advanced_data_manipulation.Rpres-# Set some global knitr options
Advanced_data_manipulation.Rpres:opts_chunk$set(cache=TRUE)
--
Batch_effects.Rpres-library(knitr)
Batch_effects.Rpres:opts_chunk$set(cache=TRUE)
--
Differential_expression.Rpres-# Set some global knitr options
Differential_expression.Rpres:opts_chunk$set(cache=TRUE)
--
GSEA.Rpres-library(knitr)
GSEA.Rpres:opts_chunk$set(cache=TRUE)
--
RNA-seq.Rpres-library(knitr)
RNA-seq.Rpres:opts_chunk$set(cache=TRUE)
--
Sequence_analysis.Rpres-# Set some global knitr options
Sequence_analysis.Rpres:opts_chunk$set(cache=TRUE)
You can see that 6 .Rpres files use the setting, but only 3 of them load knitr.
$ grep -l 'opts_chunk$set(cache=TRUE)' *.Rpres
Advanced_data_manipulation.Rpres
Batch_effects.Rpres
Differential_expression.Rpres
GSEA.Rpres
RNA-seq.Rpres
Sequence_analysis.Rpres
$ grep -l 'library(knitr)' *.Rpres
Batch_effects.Rpres
GSEA.Rpres
RNA-seq.Rpres
When running the code in a presentation, packages must be loaded before attempting to use them, or the commands will fail.
In Sequence_analysis.Rpres, these two packages are used without loading them:
IRanges
GenomicRanges
Please modify the presentation to load these packages before they are used.
Downloading the GSE29617 data file will fail with the code in Bioconductor_intro.Rpres:
> # Download all raw data. This should only be evaluated once
> # Then the data would be stored locally in the data directory
> getGEOSuppFiles("GSE29617", makeDirectory = FALSE, baseDir = "./Data/GEO/")
[1] "ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE29nnn/GSE29617/suppl/"
Error in download.file(file.path(url, i), destfile = file.path(storedir, :
cannot open destfile './Data/GEO//GSE29617_RAW.tar', reason 'No such file or directory'
Please, either add the ./Data/GEO/ folder to your Git repo or use an R command to first create the destination folder if not present:
dir.create(file.path("./Data/GEO"), showWarnings = FALSE)
With this change, you may be able to remove the second approach you mention next:
# If the above commands do not work, try:
getGEOSuppFiles("GSE29617", makeDirectory = TRUE, baseDir = "./")
Regardless, one or the other alternative should be commented out, since downloading twice is redundant. (And the downloads are slow.)
I found that fixing this (as below) made it work better for me. (R 3.0.2, Ubuntu 12.04).
This:
xRle <- Rle(xVector)
Should be this:
xRle <- rle(xVector)
Apparently the T14 file is "problematic" and needs to replaced with this one:
https://www.dropbox.com/s/k2vufecktt5nyy0/GSE45735_T14_pbmc.txt.gz
gd <- getGEO("GSE45735", destdir = "Data/GEO/")
pd <- pData(gd[[1]])
#getGEOSuppFiles("GSE45735", makeDirectory=FALSE, baseDir = "Data/GEO/")
## The T14 file is problematic and needs to be fixed by hand
Please update the .Rpres file with the download link and where to place the downloaded file.
Please insert code to install bioconductor packages before their first use.
Example:
This code fails since biocLite()
was not used to install limma
and edgeR
before use with library()
.
> # Set some global knitr options
> library(knitr)
> opts_chunk$set(cache=TRUE)
> library(data.table)
> library(ggplot2)
> library(limma)
Error in library(limma) : there is no package called ‘limma’
> library(edgeR)
Error in library(edgeR) : there is no package called ‘edgeR’
> library(GEOquery)
> biocLite(c("limma, "edgeR")
Error: unexpected symbol in "biocLite(c("limma, "edgeR"
> biocLite(c("limma, "edgeR"))
Error: unexpected symbol in "biocLite(c("limma, "edgeR"
So we need to install limma
and edgeR
:
> biocLite(c("limma", "edgeR"))
BioC_mirror: http://bioconductor.org
Using Bioconductor version 2.13 (BiocInstaller 1.12.0), R version 3.0.2.
Installing package(s) 'limma' 'edgeR'
trying URL 'http://bioconductor.org/packages/2.13/bioc/bin/windows/contrib/3.0/limma_3.18.13.zip'
Content type 'application/zip' length 3190124 bytes (3.0 Mb)
opened URL
downloaded 3.0 Mb
trying URL 'http://bioconductor.org/packages/2.13/bioc/bin/windows/contrib/3.0/edgeR_3.4.2.zip'
Content type 'application/zip' length 1815887 bytes (1.7 Mb)
opened URL
downloaded 1.7 Mb
package ‘limma’ successfully unpacked and MD5 sums checked
package ‘edgeR’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Documents and Settings\high\Local Settings\Temp\2\RtmpuKjnTn\downloaded_packages
Now it works:
> # Let's first turn on the cache for increased performance.
> # Set some global knitr options
> library(knitr)
> opts_chunk$set(cache=TRUE)
> library(data.table)
> library(ggplot2)
> library(limma)
Attaching package: ‘limma’
The following object is masked from ‘package:BiocGenerics’:
plotMA
> library(edgeR)
> library(GEOquery)
This block of code did not work because dt_merged_melt was a data frame and not a data.table. (R 3.0.2, Ubuntu 12.04)
This line is needed.
dt_merged_melt<-data.table(dt_merged_melt)
See full code running below.
> library(ggplot2)
> #library(reshape2) # Already had to do this earlier in the presentation!
> ggplot(dt_merged_melt[variable=="GSM733942.CEL.gz"],aes(x=as.factor(GC_count), y=value))+geom_violin()+geom_point(data=dt_merged_melt_sum,aes(x=as.factor(GC_count), y=mean), col="red", size=6)+theme_minimal(base_size = 18)
> dt_exprs <- data.table(probe_name=gsub("_PM", "", rownames(GSE29617_set2)), exprs(GSE29617_set2))
> # frequency for all probes
> freq <- alphabetFrequency(probes)
> # Compute the GC content
> GC_count <- freq[,"G"]+freq[,"C"]
> dt_probes <- data.table(probe_name=hgu133plus2probe$Probe.Set.Name, GC_count)
> setkey(dt_exprs, "probe_name")
> setkey(dt_probes, "probe_name")
> dt_merged <- dt_exprs[dt_probes]
> dt_merged_melt <- melt(dt_merged, id.vars = c("probe_name", "GC_count"))
> dt_merged_melt_sum <- dt_merged_melt[,list(mean=mean(value)),by=GC_count]
Error in `[.data.frame`(dt_merged_melt, , list(mean = mean(value)), by = GC_count) :
unused argument (by = GC_count)
# Need this extra line...
> dt_merged_melt<-data.table(dt_merged_melt)
# Now it works!
> dt_merged_melt_sum <- dt_merged_melt[,list(mean=mean(value)),by=GC_count]
> library(ggplot2)
> #library(reshape2) # Already had to do this earlier in the presentation!
> ggplot(dt_merged_melt[variable=="GSM733942.CEL.gz"],aes(x=as.factor(GC_count), y=value))+geom_violin()+geom_point(data=dt_merged_melt_sum,aes(x=as.factor(GC_count), y=mean), col="red", size=6)+theme_minimal(base_size = 18)
Your version of this script has been modified to remove the ability to use require() to test to see if a package is already installed. Since your version will install all packages whether or not they were already present, you may want to remove the section of the introductory comments which contradicts the new behaviour. These comments could be confusing if the user expects the behaviour they describe. (Compare to my version for reference.)
# [...], either with "install.packages" or "biocLite",
# then tries to load them into memory with "require". The intent is to
# ensure that any packages required for the scripts to run have been
# installed previously.
#
# Only those packages which are not already installed will be installed.
# The "require" function is used to check those packages after
# they are installed to make sure they can load into memory. Therefore,
# even if packages have already been loaded with "library", this script
# should run very quickly once the packages have been initially installed.
It appears that a function has been renamed and prevents the GSEA presentation code from working.
sets_indices <- symbols2indices(gene_ids, rownames(new_set))
Results in this error message:
Error: could not find function "symbols2indices"
From: http://www.bioconductor.org/packages/release/bioc/news/limma/NEWS
Changes in version 3.22.0:
symbols2indices() renamed to ids2indices().
In order to get these presentations to work, we had to add "install.packages" code before these packages were loaded with "library".
We understand that some people may have already installed them, and that installing them repeatedly each time the presentations are run would be redundant and a time waster. But for those who have not, and are new to R, the errors they get from trying to load packages which have not been installed can be very confusing and discouraging.
Is there some way (in R or RStudio) to have them "conditionally" installed? I mean, is there code that could check a package to see if it is already installed, and if not, install it?
A quick search shows that there are a few ways to do this in R:
Please consider modifying your presentations in the ways described in these links to conditionally install needed packages.
For reference, here are the ones we had to install with "install.packages" to get your presentations to work for us. (Does not include packages installed with "biocLite".)
Advanced_data_manipulation.Rpres:
install.packages("data.table")
install.packages("microbenchmark")
install.packages("sqldf")
Advanced_graphics_in_R.Rpres:
install.packages("ggplot2")
install.packages("ggthemes")
Bioconductor_intro.Rpres:
install.packages("knitr")
install.packages("data.table")
Sequence_analysis.Rpres:
install.packages("reshape2")
install.packages("IRanges")
install.packages("GenomicRanges")
We run R 3.0.2 on BioLinux 7 (Ubuntu 12.04), which has these R packages installed using the OS package management tool:
r-base
r-base-core
r-base-dev
r-base-html
r-bioc-affy
r-bioc-affyio
r-bioc-annotate
r-bioc-annotationdbi
r-bioc-biobase
r-bioc-biocgenerics
r-bioc-biocinstaller
r-bioc-biomart
r-bioc-biostrings
r-bioc-bitseq
r-bioc-deseq
r-bioc-edger
r-bioc-genefilter
r-bioc-geneplotter
r-bioc-genomicranges
r-bioc-hilbertvis
r-bioc-impute
r-bioc-iranges
r-bioc-limma
r-bioc-multtest
r-bioc-pcamethods
r-bioc-preprocesscore
r-bioc-qvalue
r-bioc-rsamtools
r-bioc-zlibbioc
r-cran-abind
r-cran-ade4
r-cran-ape
r-cran-aplpack
r-cran-bitops
r-cran-boot
r-cran-catools
r-cran-class
r-cran-cluster
r-cran-codetools
r-cran-dbi
r-cran-digest
r-cran-evaluate
r-cran-foreign
r-cran-gdata
r-cran-gee
r-cran-gplots
r-cran-gtools
r-cran-kernsmooth
r-cran-lattice
r-cran-leaps
r-cran-lme4
r-cran-locfit
r-cran-mass
r-cran-matrix
r-cran-matrixstats
r-cran-mgcv
r-cran-nlme
r-cran-nnet
r-cran-permute
r-cran-plotrix
r-cran-prettyr
r-cran-r.methodss3
r-cran-rcolorbrewer
r-cran-rcpp
r-cran-rcurl
r-cran-relimp
r-cran-rggobi
r-cran-rgl
r-cran-rgtk2
r-cran-rpart
r-cran-rsqlite
r-cran-rwave
r-cran-samr
r-cran-scatterplot3d
r-cran-sp
r-cran-spatial
r-cran-stringr
r-cran-survival
r-cran-tcltk2
r-cran-testthat
r-cran-vegan
r-cran-waveslim
r-cran-wavethresh
r-cran-xml
r-cran-xtable
r-doc-html
r-doc-info
r-doc-pdf
r-mathlib
r-recommended
The of tar = "gtar"
with R's "untar" does not work on either Windows of Linux using standard installations of these operating systems. Use Sys.getenv("TAR")
instead of "gtar"
.
For example...
> untar("Data/GEO/GSE29617_RAW.tar", exdir="Data/GEO/GSE29617/", tar = "gtar")
sh: 1: gtar: not found
Warning message:
In untar("Data/GEO/GSE29617_RAW.tar", exdir = "Data/GEO/GSE29617/", :
‘gtar -xf 'Data/GEO/GSE29617_RAW.tar' -C 'Data/GEO/GSE29617/'’ returned error code 127
> untar("Data/GEO/GSE29617_RAW.tar", exdir="Data/GEO/GSE29617/", tar = "gtar")
Warning message:
In untar("Data/GEO/GSE29617_RAW.tar", exdir = "Data/GEO/GSE29617/", :
‘gtar -xf "Data/GEO/GSE29617_RAW.tar" -C "Data/GEO/GSE29617/"’ returned error code 127
tar = Sys.getenv("TAR") does work on both Linux and Windows:
> untar("Data/GEO/GSE29617_RAW.tar", exdir="Data/GEO/GSE29617/", tar = Sys.getenv("TAR"))
Summary:
This issue contains code to fix the "problematic" GSE45735 data file in R without having to resort to a cumbersome manual edit procedure. It shows how you can filter a compressed data file in R, saving it back into the compressed (e.g., gzip) format. Using this code will help reproducibility.
Background:
The T14 file in the GSE45735 data set is "problematic" -- it contains lines which should be removed before running read.table().
Example code which gives an error:
getGEOSuppFiles("GSE45735", makeDirectory=FALSE, baseDir = "Data/GEO/")
# downloading...
files <- list.files(path = "Data/GEO/", pattern = "GSE45735_T.*.gz", full.names = TRUE)
file_list <- lapply(files, read.table, header=TRUE)
This is the error message:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 21969 did not have 12 elements
This problem was discussed in class. The recommendation was to extract the gz file, edit the txt file by hand, then recompress the file. The comment in the RNA-seq.Rpres file is:
## The T14 file is problematic and needs to be fixed by hand
Solution using R code:
This is another alternative which can be scripted in R and thus avoid the need to manually fix the file.
fname <- "Data/GEO/GSE45735_T14_pbmc.txt.gz"
write(grep('^Column',grep('^([^\t]+\t){1,}[^\t]+$',
readLines(gzfile(fname)),
perl=TRUE, value=TRUE),
perl=TRUE, value=TRUE, invert = TRUE),
gzfile(fname))
closeAllConnections()
This extracts the file, filters according to two regular expressions, then recompresses the file and closes open file connections. The two regular expressions will filter out lines starting with "Column" and any line not containing a tab delimiter. In the case of this particular file, it ends up deleting the last 28 lines from the text file before recompressing it.
After this is run, then this read.table() command will work:
file_list <- lapply(files, read.table, header=TRUE)
EDIT:
All of the files can be "cleaned" using a new function, filter.gz(), and lappply():
filter.gz <- function(fname) {
write(grep('^Column',
grep('^([^\t]+\t){1,}[^\t]+$',
readLines(gzfile(fname)),
perl=TRUE, value=TRUE),
perl=TRUE, value=TRUE, invert = TRUE),
gzfile(fname)
)
closeAllConnections()
}
result <- lapply(files, filter.gz)
Or an another completely different alternate approach is:
file_list <- lapply(files, read.table, sep='\t', header=TRUE, blank.lines.skip=TRUE, skipNul=TRUE)
na.omit(file_list[[3]])
That still leaves the rows with Gene name containing "Column", so you can remove those also, by assuming all Gene names conform to a regular expression match of no spaces.
file_list <- lapply(files, read.table, sep='\t', header=TRUE)
file_list[[3]] <- subset(file_list[[3]], grepl('^[^[:space:]]+$', Gene))
Or to apply to all data frames in the list:
file_list <- lapply(files, read.table, sep='\t', header=TRUE)
file_list <- lapply(file_list, function(file_list)subset(file_list, grepl('^[^[:space:]]+$', Gene)))
Essentially, the default separator for read.table() is whitepsace, not tab, as it should be for this file. So the Gene names with spaces in them ("Column #") will be "problematic" as they appear to contain too many columns. Using sep='\t' addresses this.
Then, since there are some rows that are clearly not columnar data, and some are blank, those (and the rows with Gene like "Column") are removed using the subset().
We get a "could not find function "melt"" error when running code in Sequence_analysis.Rpres.
> dt_exprs <- data.table(probe_name=gsub("_PM", "", rownames(GSE29617_set2)), exprs(GSE29617_set2))
> # frequency for all probes
> freq <- alphabetFrequency(probes)
> # Compute the GC content
> GC_count <- freq[,"G"]+freq[,"C"]
> dt_probes <- data.table(probe_name=hgu133plus2probe$Probe.Set.Name, GC_count)
> setkey(dt_exprs, "probe_name")
> setkey(dt_probes, "probe_name")
> dt_merged <- dt_exprs[dt_probes]
> dt_merged_melt <- melt(dt_merged, id.vars = c("probe_name", "GC_count"))
Error: could not find function "melt"
> # This line is not needed if you're using the latest dev version of data.tabler
> dt_merged_melt <- data.table(dt_merged_melt)
Error in data.table(dt_merged_melt) : object 'dt_merged_melt' not found
> dt_merged_melt_sum <- dt_merged_melt[,list(mean=mean(value)),by=GC_count]
Error: object 'dt_merged_melt' not found
Because we need to load "reshape2" first...
library(reshape2)
> dt_exprs <- data.table(probe_name=gsub("_PM", "", rownames(GSE29617_set2)), exprs(GSE29617_set2))
> # frequency for all probes
> freq <- alphabetFrequency(probes)
> # Compute the GC content
> GC_count <- freq[,"G"]+freq[,"C"]
> dt_probes <- data.table(probe_name=hgu133plus2probe$Probe.Set.Name, GC_count)
> setkey(dt_exprs, "probe_name")
> setkey(dt_probes, "probe_name")
> dt_merged <- dt_exprs[dt_probes]
> dt_merged_melt <- melt(dt_merged, id.vars = c("probe_name", "GC_count"))
> # This line is not needed if you're using the latest dev version of data.tabler
> dt_merged_melt <- data.table(dt_merged_melt)
> dt_merged_melt_sum <- dt_merged_melt[,list(mean=mean(value)),by=GC_count]
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.