Coder Social home page Coder Social logo

Comments (2)

nextgenusfs avatar nextgenusfs commented on July 20, 2024

Hi @peterjc, sorry for my delayed response.

Because of how you submit samples to SRA they have to be minimally processed, ie each sample is a demuxed FASTQ, thus it is confusing (at least in my opinion). You can get the single file FASTQ.GZ files from the OSF repository instead of trying to re-create from SRA data.

What is the meaning of the prefix m3, m4A, m4b, m4C, m5 or m6 in the filenames?

Because we ended up sequencing the same environmental samples multiple times, it was just a prefix we were using to have unique idenfiers on each sample (ie when we submit to sequencing vendor for MiSeq used these names).

So Figure 4 is derived from the m3 (ion torrent) and the m4A (miseq) datasets. These were chronologically earlier on in this project -- as you can imagine we started with a simple query of trying to validate the read abundances from our new (at the time) Ion Torrent PGM sequencer in comparison to MiSeq, we then of course discovered what we wrote up in the paper along the way. These data were generated by pulling out these sample and mapping the reads to the "biological mock" references sequences.

This is from the OSF Jupyter notebook on how those data were generated:

#while the literature says that it is not a good idea to subsampling your reads so that you have equal number 
#of input reads in each sample will allow you to use abundance metrics, since people still do this, lets see what
#happens with the data if we subsample.  Note this is also incorrectly called rarefying...
#as you would expect, this does not change the results... subsampling doesn't allow you trust abundance
#combine the biomock samples
!cat ion3.biomock.fq.gz m4a.biomock.fq.gz > biomock.fq.gz
!amptk sample -i biomock.fq.gz -n 100000 -o biomock.sampled.fq.gz
#convert to fasta
!vsearch --fastq_filter biomock.sampled.fq.gz --fastaout biomock.sampled.fa
#map to biomock OTUs
!vsearch --usearch_global biomock.sampled.fa --otutabout biomock.sampled.otu_table.txt --id 0.97 \
    --db $HOME/amptk/DB/amptk_mock2.fa
#reorder OTU table for drawing heatmap
!amptk filter -i biomock.sampled.otu_table.txt -f $HOME/amptk/DB/amptk_mock2.fa -p 0 --col_order m3-stds m3-mock3-16000 m3-mock3-32000a m3-mock3-32000b m4A-stds m4A-mock3-16000 m4A-mock3-32000a m4A-mock3-32000b
#draw heatmap
!amptk heatmap -i biomock.sampled.final.txt --color YlOrBr -m heatmap --annotate -o biomock.sampled.pdf

Figure 6 was derived from datasets m5 (ion torrent) and m6 (miseq) which contained our synthetic spike in control (SynMock). To make the figure look they way it did, I did a little extra re-ordering, etc:

#Run UPARSE clustering on Miseq and Ion datasets containing SynMock
!amptk cluster -i ion5_1mm.demux.fq.gz -o ion5_1mm
!amptk cluster -i ion5.demux.fq.gz -o ion5
!amptk cluster -i m6_1mm.demux.fq.gz -o m6_1mm
!amptk cluster -i m6.demux.fq.gz -o m6

#using amptk filter command we can map the OTUs to the mock community
!amptk filter -i ion5_1mm.otu_table.txt -f ion5_1mm.cluster.otus.fa -p 0 -b m5-SynMock -m synmock -o ion5_1mm --keep_mock
!amptk filter -i ion5.otu_table.txt -f ion5.cluster.otus.fa -p 0 -b m5-SynMock -m synmock -o ion5 --keep_mock
!amptk filter -i m6_1mm.otu_table.txt -f m6_1mm.cluster.otus.fa -p 0 -b m6-SynMock -m synmock -o m6_1mm --keep_mock
!amptk filter -i m6.otu_table.txt -f m6.cluster.otus.fa -p 0 -b m6-SynMock -m synmock -o m6 --keep_mock

#now want to take a look at synmock distribution in entire miseq6 run, see where bleed is happening
#use the 0 mismatch dataset as that is what I think one should use.

import pandas as pd
from natsort import natsorted
#load the miseq table, no filtering version
df = pd.read_csv('m6.final.txt', sep='\t', index_col=0)
#re-order the columns 
keepers = ['m6-301-1', 'm6-301-2', 'm6-500-1', 'm6-500-2', 'm6-712-1', 'm6-712-2', 
           'm6-736-1', 'm6-736-2', 'm6-744-1', 'm6-744-2', 'm6-755-1', 'm6-755-2', 
           'm6-757-1', 'm6-757-2', 'm6-766-1', 'm6-766-2', 
           'm6-mock3-16000', 'm6-mock3-32000a', 'm6-mock3-32000b', 'm6-stds', 'm6-SynMock']
df2 = df.reindex(columns=keepers, index=natsorted(df.index))
mocks = []
for x in df2.index:
    if x.startswith('mock'):
        mocks.append(x)
otus = [item for item in df2.index if item not in mocks]

df3 = df2.reindex(columns=keepers, index=mocks)
df3.to_csv('figure.m6.onlymocks.txt', sep='\t')
df4 = df2.reindex(columns=keepers, index=otus)
#get number of OTUs from df4 as well as total reads
reads = df4.sum(axis=0)
reads.name ="Other OTUs"
otu_count = df4[df4 > 0].count(axis=0, numeric_only=True)
otu_count.name = "Number of OTUs"
#print reads
#print otu_count
#now append the read numbers to df3 dataframe
df3 = df3.append(reads)
#print df3
total_reads = df3.sum(axis=0)
#print total_reads
df3.to_csv('figure.m6_synmock.txt', sep='\t')

!amptk heatmap -i figure.m6_synmock.txt -d tsv -o figure3.pdf --format pdf -m heatmap --annotate --color YlOrBr --vmax 120000

# same method with the ion torrent m5 dataset
df = pd.read_csv('ion5.final.txt', sep='\t', index_col=0)
#Figure 2 should be about index bleed, so generate a table containing the samples that were run in all runs
keepers = ['m5-301-1', 'm5-301-2', 'm5-500-1', 'm5-500-2', 'm5-712-1', 'm5-712-2', 'm5-736-1', 'm5-736-2', 
           'm5-744-1', 'm5-744-2', 'm5-755-1', 'm5-755-2', 'm5-757-1', 'm5-757-2', 'm5-766-1', 'm5-766-2', 
           'm5-mock3.0-16000', 'm5-mock3.5-16000', 'm5-SynMock1.736-1.736-2', 'm5-SynMock2.757-1.757-2',
           'm5-SynMock']

df2 = df.reindex(columns=keepers, index=natsorted(df.index))
mocks = []
for x in df2.index:
    if x.startswith('mock'):
        mocks.append(x)
otus = [item for item in df2.index if item not in mocks]

df3 = df2.reindex(columns=keepers, index=mocks)
df3.to_csv('figure6.ion.onlymocks.txt', sep='\t')
df4 = df2.reindex(columns=keepers, index=otus)
#get number of OTUs from df4 as well as total reads
reads = df4.sum(axis=0)
reads.name ="Other OTUs"
otu_count = df4[df4 > 0].count(axis=0, numeric_only=True)
otu_count.name = "Number of OTUs"
#print reads
#print otu_count
#now append the read numbers to df3 dataframe
df3 = df3.append(reads)
#print df3
total_reads = df3.sum(axis=0)
#print total_reads
df3.to_csv('figure6.ion_synmock.txt', sep='\t')

!amptk heatmap -i figure6.ion_synmock.txt -d tsv -o figure6_ion.pdf --format pdf -m heatmap --annotate --color YlOrBr --vmax 17000

from amptk.

peterjc avatar peterjc commented on July 20, 2024

Thank you! My primary interest here is the synthetic spike in control (SynMock) as used with the Illumina data shown in Figure 6, so I think I only need to focus on the m6 Illumina files.

from amptk.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.