Comments (3)
Thanks a ton once again for your prompt & detailed response @AstrobioMike; it was silly of me not to pay attention to the [,2]
in the code. Everything is perfect now and I'm so very grateful to you for your kindness & boundless patience! May the force be with you!!! :-)
from astrobiomike.github.io.
Heya :)
When i started trying this, i immediately hit one thing that was unnecessarily confusing. And i think it might have been the only real source of a problem, ha
# this part
phyla_tax_vec <- as.vector(tax_table(tax_glom(ASV_physeq, taxrank="phylum"))[,2])
# would be more clear if written like this
phyla_tax_vec <- as.vector(tax_table(tax_glom(ASV_physeq, taxrank="phylum"))[,"phylum"])
# as the [,2] at the end was specific to trying to pull out the phylum column
# and writing out "phylum" makes it easier to spot quickly that we need to change that too if trying a different rank
So that's updated on the page now 👍
Here's what works for me doing it with the tutorial data:
## using phyloseq to make a count table that has summed all ASVs that were in the same phylum
genera_counts_tab <- otu_table(tax_glom(ASV_physeq, taxrank="genus"))
## making a vector of genera names to set as row names
genera_tax_vec <- as.vector(tax_table(tax_glom(ASV_physeq, taxrank="genus"))[,"genus"])
rownames(genera_counts_tab) <- as.vector(genera_tax_vec)
## this looks like this
head(genera_counts_tab)
# OTU Table: [6 taxa and 16 samples]
# taxa are rows
# BW1 BW2 R10 R11BF R11 R12 R1A R1B R2 R3 R4 R5 R6 R7 R8 R9
# Nitrospira 0 0 843 12 415 400 297 346 311 355 407 341 229 68 340 240
# Woeseia 0 12 460 25 509 724 431 562 567 669 566 727 534 201 428 257
# Candidatus Nitrosopumilus 9 32 310 407 233 1117 882 959 879 722 653 588 759 623 235 137
# Candidatus Nitrosopelagicus 0 0 52 0 30 70 99 137 313 218 214 117 89 61 60 45
# BD1-7 clade 0 0 305 0 547 38 31 23 51 62 0 112 13 2 115 112
# AqS1 0 0 112 0 20 144 125 157 189 138 186 128 132 9 218 12
## we also have to account for sequences that weren't assigned any
## taxonomy at the genus level
## these came into R as 'NAs' in the taxonomy table, but their counts are
## still in the count table
## so we can get that value for each sample by subtracting the column sums
## of this new table (that has everything that had a genus assigned to it)
## from the column sums of the starting count table (that has all
## representative sequences)
unclassified_tax_counts_for_genus_level_summary <- colSums(count_tab) - colSums(genera_counts_tab)
genera_and_unidentified_counts_tab <- rbind(genera_counts_tab, "Unclassified"=unclassified_tax_counts_for_genus_level_summary)
## and now we have our unclassified at the genus level added to the bottom of the table
tail(genera_and_unidentified_counts_tab)
# BW1 BW2 R10 R11BF R11 R12 R1A R1B R2 R3 R4 R5 R6 R7 R8 R9
# Candidatus Tenderia 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0
# IS-44 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0
# Vicingus 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0
# Candidatus Omnitrophus 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0
# Rubrobacter 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0
# Unclassified 1486 3000 5668 5714 4289 6936 5686 7452 8747 9112 11436 9263 7588 4004 7000 5455
## checking we aren't missing any counts (should come back 'TRUE')
identical(colSums(genera_and_unidentified_counts_tab), colSums(count_tab))
## now we'll generate a proportions table for summarizing:
genera_proportions_tab <- apply(genera_and_unidentified_counts_tab, 2, function(x) x/sum(x)*100)
From there, because there are so many with lower proportions, deciding what we want to show is might be trickier. With the tutorial data, this is 161 unique genera (well 160 plus the unclassified one which makes up a large portion due to the nature of it holding all not classified at the genus level). So it's really gonna be up to the researcher to decide what specific they'd want to present. But i ran through it with a similar min-percent-filtering process as in the tutorial when working with phylum and proteo classes just to make sure it'd work:
# if we check the dimensions of this table at this point
dim(genera_proportions_tab)
# we see there are currently 161 rows (tutorial data), which is likely way too busy for a
# summary figure
# many of these taxa make up a very small percentage, so we're going to
# filter some out
# this is a completely arbitrary decision solely to ease visualization and
# intepretation, entirely up to your data and you here
# i messed with the min percent in any sample required to keep a row (genera) based on how many were retained overall
temp_filt_genera_proportions_tab <- data.frame(genera_proportions_tab[apply(genera_proportions_tab, 1, max) > 5, ])
dim(temp_filt_genera_proportions_tab)
# 5% left only 6
temp_filt_genera_proportions_tab <- data.frame(genera_proportions_tab[apply(genera_proportions_tab, 1, max) > 2, ])
dim(temp_filt_genera_proportions_tab)
# 2% left 17, which seemed like a lot
temp_filt_genera_proportions_tab <- data.frame(genera_proportions_tab[apply(genera_proportions_tab, 1, max) > 2.5, ])
dim(temp_filt_genera_proportions_tab)
# 2.5% minimum in any sample left 10
# though each of the filtered taxa made up less than 2.5% alone in any given sample, together they
# may add up and should still be included in the overall summary
# so we're going to add a row called "Other" that keeps track of how much we
# filtered out (which will also keep our totals at 100%)
genera_filtered_proportions <- colSums(genera_proportions_tab) - colSums(temp_filt_genera_proportions_tab)
filt_genera_proportions_tab <- rbind(temp_filt_genera_proportions_tab, "Other" = genera_filtered_proportions)
And just taking that to the first bar plot:
# first let's make a copy of our table that's safe for manipulating
filt_genera_proportions_tab_for_plot <- filt_genera_proportions_tab
# and add a column of the taxa names so that it is within the table, rather
# than just as row names (this makes working with ggplot easier)
filt_genera_proportions_tab_for_plot$Genus <- row.names(filt_genera_proportions_tab_for_plot)
# now we'll transform the table into narrow, or long, format (also makes
# plotting easier)
filt_genera_proportions_tab_for_plot.g <- pivot_longer(filt_genera_proportions_tab_for_plot, !Genus, names_to = "Sample", values_to = "Proportion") %>% data.frame()
# now we want a table with "color" and "characteristics" of each sample to
# merge into our plotting table so we can use that more easily in our plotting
# function
# here we're making a new table by pulling what we want from the sample
# information table
sample_info_for_merge<-data.frame("Sample"=row.names(sample_info_tab), "char"=sample_info_tab$char, "color"=sample_info_tab$color, stringsAsFactors=F)
# and here we are merging this table with the plotting table we just made
filt_genera_proportions_tab_for_plot.g2 <- merge(filt_genera_proportions_tab_for_plot.g, sample_info_for_merge)
# and now we're ready to make some summary figures with our wonderfully
# constructed table
## a good color scheme can be hard to find, i included the viridis package
## here because it's color-blind friendly and sometimes it's been really
## helpful for me, though this is not demonstrated in all of the following :/
# one common way to look at this is with stacked bar charts for each taxon per sample:
ggplot(filt_genera_proportions_tab_for_plot.g2, aes(x=Sample, y=Proportion, fill=Genus)) +
geom_bar(width=0.6, stat="identity") +
theme_bw() +
scale_fill_viridis(discrete=TRUE) +
theme(axis.text.x=element_text(angle=90, vjust=0.4, hjust=1)) +
labs(x="Sample", y="% of 16S rRNA gene copies recovered", title="All samples")
As expected at this level, Unclassified dominates. If wanting to show genus level, you could remove Unclassified in order to see more of what was classified. But i'd be sure to note that it is only focusing on what was classified at the genus level, and specifically mention what proportion of seqs were not classified and therefore are missing. Could be done something like this:
unclassified_at_genera_level <- filt_genera_proportions_tab["Unclassified",] %>% as.vector
unclassified_at_genera_level
# [1] 78.33421 63.69427 64.10314 82.09770 65.21210 65.15735 66.61980 66.78616 70.33612 73.24170 77.54272 72.26556 72.64025 71.37255 74.05057 81.47872
mean(unclassified_at_genera_level)
# [1] 71.55831
sd(unclassified_at_genera_level)
# [1] 6.051418
And could say something like "Only those classified are depicted, proportion of unclassified sequences across all samples was 71.6 ± 6.05% (mean ± 1SD)".
Hopefully this helps!
from astrobiomike.github.io.
Likewise! Happy to help :)
from astrobiomike.github.io.
Related Issues (20)
- Mistake, bug, or typo HOT 2
- New Topic Idea HOT 2
- Error in installing sabre HOT 3
- De novo genome assembly HOT 3
- REgarding the full example workflow DADA2 HOT 2
- suggestion HOT 2
- Tutorial on Filtering Host Reads - just a thoughtNew Topic Idea HOT 2
- Support for SILVA v138.1 HOT 2
- decontam bug in subsetting fasta file when there are no contaminants HOT 1
- Dada2 merging doubt HOT 4
- How to define vector for decontam in a merged sequence table? HOT 3
- Mistake when running the protocol HOT 6
- Error in running the process in my data HOT 2
- Deseq error HOT 2
- Error in # generating and visualizing the PCoA with phyloseq HOT 2
- Phylogenetic tree construction HOT 1
- Nano issue HOT 1
- Mistake, bug, or typo HOT 3
- update conda intro page
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from astrobiomike.github.io.