Hi, So sorry to bother you with yet another silly query; you can ign

Thanks a ton once again for your prompt & detailed response <a class="user-mention

Genus/Family-based stacked bar charts about astrobiomike.github.io HOT 3 CLOSED

Anto007 commented on July 18, 2024

Genus/Family-based stacked bar charts

from astrobiomike.github.io.

Comments (3)

Anto007 commented on July 18, 2024 1

Thanks a ton once again for your prompt & detailed response @AstrobioMike; it was silly of me not to pay attention to the [,2] in the code. Everything is perfect now and I'm so very grateful to you for your kindness & boundless patience! May the force be with you!!! :-)

from astrobiomike.github.io.

AstrobioMike commented on July 18, 2024

Heya :)

When i started trying this, i immediately hit one thing that was unnecessarily confusing. And i think it might have been the only real source of a problem, ha

# this part
phyla_tax_vec <- as.vector(tax_table(tax_glom(ASV_physeq, taxrank="phylum"))[,2])

# would be more clear if written like this
phyla_tax_vec <- as.vector(tax_table(tax_glom(ASV_physeq, taxrank="phylum"))[,"phylum"])

    # as the [,2] at the end was specific to trying to pull out the phylum column
    # and writing out "phylum" makes it easier to spot quickly that we need to change that too if trying a different rank

So that's updated on the page now 👍

Here's what works for me doing it with the tutorial data:

## using phyloseq to make a count table that has summed all ASVs that were in the same phylum
genera_counts_tab <- otu_table(tax_glom(ASV_physeq, taxrank="genus"))

## making a vector of genera names to set as row names
genera_tax_vec <- as.vector(tax_table(tax_glom(ASV_physeq, taxrank="genus"))[,"genus"]) 
rownames(genera_counts_tab) <- as.vector(genera_tax_vec)

## this looks like this
head(genera_counts_tab)

# OTU Table:          [6 taxa and 16 samples]
#                      taxa are rows
#                             BW1 BW2 R10 R11BF R11  R12 R1A R1B  R2  R3  R4  R5  R6  R7  R8  R9
# Nitrospira                    0   0 843    12 415  400 297 346 311 355 407 341 229  68 340 240
# Woeseia                       0  12 460    25 509  724 431 562 567 669 566 727 534 201 428 257
# Candidatus Nitrosopumilus     9  32 310   407 233 1117 882 959 879 722 653 588 759 623 235 137
# Candidatus Nitrosopelagicus   0   0  52     0  30   70  99 137 313 218 214 117  89  61  60  45
# BD1-7 clade                   0   0 305     0 547   38  31  23  51  62   0 112  13   2 115 112
# AqS1                          0   0 112     0  20  144 125 157 189 138 186 128 132   9 218  12

## we also have to account for sequences that weren't assigned any
## taxonomy at the genus level 
## these came into R as 'NAs' in the taxonomy table, but their counts are
## still in the count table
## so we can get that value for each sample by subtracting the column sums
## of this new table (that has everything that had a genus assigned to it)
## from the column sums of the starting count table (that has all
## representative sequences)

unclassified_tax_counts_for_genus_level_summary <- colSums(count_tab) - colSums(genera_counts_tab)
genera_and_unidentified_counts_tab <- rbind(genera_counts_tab, "Unclassified"=unclassified_tax_counts_for_genus_level_summary)

## and now we have our unclassified at the genus level added to the bottom of the table
tail(genera_and_unidentified_counts_tab)
#                         BW1  BW2  R10 R11BF  R11  R12  R1A  R1B   R2   R3    R4   R5   R6   R7   R8   R9
# Candidatus Tenderia       0    0    0     0    0    0    0    0    0    0     0    0    0    3    0    0
# IS-44                     0    0    0     0    0    2    0    0    0    0     0    0    0    0    0    0
# Vicingus                  0    0    0     0    0    0    0    0    0    0     2    0    0    0    0    0
# Candidatus Omnitrophus    0    0    0     0    0    0    0    0    0    0     0    2    0    0    0    0
# Rubrobacter               0    0    0     0    0    0    0    0    0    0     0    0    2    0    0    0
# Unclassified           1486 3000 5668  5714 4289 6936 5686 7452 8747 9112 11436 9263 7588 4004 7000 5455

## checking we aren't missing any counts (should come back 'TRUE')
identical(colSums(genera_and_unidentified_counts_tab), colSums(count_tab)) 

## now we'll generate a proportions table for summarizing:
genera_proportions_tab <- apply(genera_and_unidentified_counts_tab, 2, function(x) x/sum(x)*100)

From there, because there are so many with lower proportions, deciding what we want to show is might be trickier. With the tutorial data, this is 161 unique genera (well 160 plus the unclassified one which makes up a large portion due to the nature of it holding all not classified at the genus level). So it's really gonna be up to the researcher to decide what specific they'd want to present. But i ran through it with a similar min-percent-filtering process as in the tutorial when working with phylum and proteo classes just to make sure it'd work:

  # if we check the dimensions of this table at this point
dim(genera_proportions_tab)
  # we see there are currently 161 rows (tutorial data), which is likely way too busy for a
    # summary figure
    # many of these taxa make up a very small percentage, so we're going to
    # filter some out
    # this is a completely arbitrary decision solely to ease visualization and
    # intepretation, entirely up to your data and you here
    # i messed with the min percent in any sample required to keep a row (genera) based on how many were retained overall

temp_filt_genera_proportions_tab <- data.frame(genera_proportions_tab[apply(genera_proportions_tab, 1, max) > 5, ])
dim(temp_filt_genera_proportions_tab)
    # 5% left only 6 

temp_filt_genera_proportions_tab <- data.frame(genera_proportions_tab[apply(genera_proportions_tab, 1, max) > 2, ])
dim(temp_filt_genera_proportions_tab)
    # 2% left 17, which seemed like a lot

temp_filt_genera_proportions_tab <- data.frame(genera_proportions_tab[apply(genera_proportions_tab, 1, max) > 2.5, ])
dim(temp_filt_genera_proportions_tab)
    # 2.5% minimum in any sample left 10

  # though each of the filtered taxa made up less than 2.5% alone in any given sample, together they
    # may add up and should still be included in the overall summary
    # so we're going to add a row called "Other" that keeps track of how much we
    # filtered out (which will also keep our totals at 100%)
genera_filtered_proportions <- colSums(genera_proportions_tab) - colSums(temp_filt_genera_proportions_tab)
filt_genera_proportions_tab <- rbind(temp_filt_genera_proportions_tab, "Other" = genera_filtered_proportions)

And just taking that to the first bar plot:

  # first let's make a copy of our table that's safe for manipulating
filt_genera_proportions_tab_for_plot <- filt_genera_proportions_tab

  # and add a column of the taxa names so that it is within the table, rather
  # than just as row names (this makes working with ggplot easier)
filt_genera_proportions_tab_for_plot$Genus <- row.names(filt_genera_proportions_tab_for_plot)

  # now we'll transform the table into narrow, or long, format (also makes
  # plotting easier)
filt_genera_proportions_tab_for_plot.g <- pivot_longer(filt_genera_proportions_tab_for_plot, !Genus, names_to = "Sample", values_to = "Proportion") %>% data.frame()

  # now we want a table with "color" and "characteristics" of each sample to
    # merge into our plotting table so we can use that more easily in our plotting
    # function
    # here we're making a new table by pulling what we want from the sample
    # information table
sample_info_for_merge<-data.frame("Sample"=row.names(sample_info_tab), "char"=sample_info_tab$char, "color"=sample_info_tab$color, stringsAsFactors=F)

  # and here we are merging this table with the plotting table we just made
filt_genera_proportions_tab_for_plot.g2 <- merge(filt_genera_proportions_tab_for_plot.g, sample_info_for_merge)

  # and now we're ready to make some summary figures with our wonderfully
    # constructed table

  ## a good color scheme can be hard to find, i included the viridis package
    ## here because it's color-blind friendly and sometimes it's been really
    ## helpful for me, though this is not demonstrated in all of the following :/ 

  # one common way to look at this is with stacked bar charts for each taxon per sample:
ggplot(filt_genera_proportions_tab_for_plot.g2, aes(x=Sample, y=Proportion, fill=Genus)) +
    geom_bar(width=0.6, stat="identity") +
    theme_bw() +
    scale_fill_viridis(discrete=TRUE) +
    theme(axis.text.x=element_text(angle=90, vjust=0.4, hjust=1)) +
    labs(x="Sample", y="% of 16S rRNA gene copies recovered", title="All samples")

As expected at this level, Unclassified dominates. If wanting to show genus level, you could remove Unclassified in order to see more of what was classified. But i'd be sure to note that it is only focusing on what was classified at the genus level, and specifically mention what proportion of seqs were not classified and therefore are missing. Could be done something like this:

unclassified_at_genera_level <- filt_genera_proportions_tab["Unclassified",] %>% as.vector
unclassified_at_genera_level

# [1] 78.33421 63.69427 64.10314 82.09770 65.21210 65.15735 66.61980 66.78616 70.33612 73.24170 77.54272 72.26556 72.64025 71.37255 74.05057 81.47872

mean(unclassified_at_genera_level)
# [1] 71.55831

sd(unclassified_at_genera_level)
# [1] 6.051418

And could say something like "Only those classified are depicted, proportion of unclassified sequences across all samples was 71.6 ± 6.05% (mean ± 1SD)".

Hopefully this helps!

from astrobiomike.github.io.

AstrobioMike commented on July 18, 2024

Likewise! Happy to help :)

from astrobiomike.github.io.

Genus/Family-based stacked bar charts about astrobiomike.github.io HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent