Cross reference <a class="issue-link js-issue-link" data-error-text="Failed to load ti

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Matching MockA, MockB1 and MockB2 to FASTQ filenames about amptk HOT 2 CLOSED

peterjc commented on July 20, 2024

Matching MockA, MockB1 and MockB2 to FASTQ filenames

from amptk.

Comments (2)

nextgenusfs commented on July 20, 2024

Hi @peterjc, sorry for my delayed response.

Because of how you submit samples to SRA they have to be minimally processed, ie each sample is a demuxed FASTQ, thus it is confusing (at least in my opinion). You can get the single file FASTQ.GZ files from the OSF repository instead of trying to re-create from SRA data.

What is the meaning of the prefix m3, m4A, m4b, m4C, m5 or m6 in the filenames?

Because we ended up sequencing the same environmental samples multiple times, it was just a prefix we were using to have unique idenfiers on each sample (ie when we submit to sequencing vendor for MiSeq used these names).

So Figure 4 is derived from the m3 (ion torrent) and the m4A (miseq) datasets. These were chronologically earlier on in this project -- as you can imagine we started with a simple query of trying to validate the read abundances from our new (at the time) Ion Torrent PGM sequencer in comparison to MiSeq, we then of course discovered what we wrote up in the paper along the way. These data were generated by pulling out these sample and mapping the reads to the "biological mock" references sequences.

This is from the OSF Jupyter notebook on how those data were generated:

#while the literature says that it is not a good idea to subsampling your reads so that you have equal number 
#of input reads in each sample will allow you to use abundance metrics, since people still do this, lets see what
#happens with the data if we subsample.  Note this is also incorrectly called rarefying...
#as you would expect, this does not change the results... subsampling doesn't allow you trust abundance
#combine the biomock samples
!cat ion3.biomock.fq.gz m4a.biomock.fq.gz > biomock.fq.gz
!amptk sample -i biomock.fq.gz -n 100000 -o biomock.sampled.fq.gz
#convert to fasta
!vsearch --fastq_filter biomock.sampled.fq.gz --fastaout biomock.sampled.fa
#map to biomock OTUs
!vsearch --usearch_global biomock.sampled.fa --otutabout biomock.sampled.otu_table.txt --id 0.97 \
    --db $HOME/amptk/DB/amptk_mock2.fa
#reorder OTU table for drawing heatmap
!amptk filter -i biomock.sampled.otu_table.txt -f $HOME/amptk/DB/amptk_mock2.fa -p 0 --col_order m3-stds m3-mock3-16000 m3-mock3-32000a m3-mock3-32000b m4A-stds m4A-mock3-16000 m4A-mock3-32000a m4A-mock3-32000b
#draw heatmap
!amptk heatmap -i biomock.sampled.final.txt --color YlOrBr -m heatmap --annotate -o biomock.sampled.pdf

Figure 6 was derived from datasets m5 (ion torrent) and m6 (miseq) which contained our synthetic spike in control (SynMock). To make the figure look they way it did, I did a little extra re-ordering, etc:

#Run UPARSE clustering on Miseq and Ion datasets containing SynMock
!amptk cluster -i ion5_1mm.demux.fq.gz -o ion5_1mm
!amptk cluster -i ion5.demux.fq.gz -o ion5
!amptk cluster -i m6_1mm.demux.fq.gz -o m6_1mm
!amptk cluster -i m6.demux.fq.gz -o m6

#using amptk filter command we can map the OTUs to the mock community
!amptk filter -i ion5_1mm.otu_table.txt -f ion5_1mm.cluster.otus.fa -p 0 -b m5-SynMock -m synmock -o ion5_1mm --keep_mock
!amptk filter -i ion5.otu_table.txt -f ion5.cluster.otus.fa -p 0 -b m5-SynMock -m synmock -o ion5 --keep_mock
!amptk filter -i m6_1mm.otu_table.txt -f m6_1mm.cluster.otus.fa -p 0 -b m6-SynMock -m synmock -o m6_1mm --keep_mock
!amptk filter -i m6.otu_table.txt -f m6.cluster.otus.fa -p 0 -b m6-SynMock -m synmock -o m6 --keep_mock

#now want to take a look at synmock distribution in entire miseq6 run, see where bleed is happening
#use the 0 mismatch dataset as that is what I think one should use.

import pandas as pd
from natsort import natsorted
#load the miseq table, no filtering version
df = pd.read_csv('m6.final.txt', sep='\t', index_col=0)
#re-order the columns 
keepers = ['m6-301-1', 'm6-301-2', 'm6-500-1', 'm6-500-2', 'm6-712-1', 'm6-712-2', 
           'm6-736-1', 'm6-736-2', 'm6-744-1', 'm6-744-2', 'm6-755-1', 'm6-755-2', 
           'm6-757-1', 'm6-757-2', 'm6-766-1', 'm6-766-2', 
           'm6-mock3-16000', 'm6-mock3-32000a', 'm6-mock3-32000b', 'm6-stds', 'm6-SynMock']
df2 = df.reindex(columns=keepers, index=natsorted(df.index))
mocks = []
for x in df2.index:
    if x.startswith('mock'):
        mocks.append(x)
otus = [item for item in df2.index if item not in mocks]

df3 = df2.reindex(columns=keepers, index=mocks)
df3.to_csv('figure.m6.onlymocks.txt', sep='\t')
df4 = df2.reindex(columns=keepers, index=otus)
#get number of OTUs from df4 as well as total reads
reads = df4.sum(axis=0)
reads.name ="Other OTUs"
otu_count = df4[df4 > 0].count(axis=0, numeric_only=True)
otu_count.name = "Number of OTUs"
#print reads
#print otu_count
#now append the read numbers to df3 dataframe
df3 = df3.append(reads)
#print df3
total_reads = df3.sum(axis=0)
#print total_reads
df3.to_csv('figure.m6_synmock.txt', sep='\t')

!amptk heatmap -i figure.m6_synmock.txt -d tsv -o figure3.pdf --format pdf -m heatmap --annotate --color YlOrBr --vmax 120000

# same method with the ion torrent m5 dataset
df = pd.read_csv('ion5.final.txt', sep='\t', index_col=0)
#Figure 2 should be about index bleed, so generate a table containing the samples that were run in all runs
keepers = ['m5-301-1', 'm5-301-2', 'm5-500-1', 'm5-500-2', 'm5-712-1', 'm5-712-2', 'm5-736-1', 'm5-736-2', 
           'm5-744-1', 'm5-744-2', 'm5-755-1', 'm5-755-2', 'm5-757-1', 'm5-757-2', 'm5-766-1', 'm5-766-2', 
           'm5-mock3.0-16000', 'm5-mock3.5-16000', 'm5-SynMock1.736-1.736-2', 'm5-SynMock2.757-1.757-2',
           'm5-SynMock']

df2 = df.reindex(columns=keepers, index=natsorted(df.index))
mocks = []
for x in df2.index:
    if x.startswith('mock'):
        mocks.append(x)
otus = [item for item in df2.index if item not in mocks]

df3 = df2.reindex(columns=keepers, index=mocks)
df3.to_csv('figure6.ion.onlymocks.txt', sep='\t')
df4 = df2.reindex(columns=keepers, index=otus)
#get number of OTUs from df4 as well as total reads
reads = df4.sum(axis=0)
reads.name ="Other OTUs"
otu_count = df4[df4 > 0].count(axis=0, numeric_only=True)
otu_count.name = "Number of OTUs"
#print reads
#print otu_count
#now append the read numbers to df3 dataframe
df3 = df3.append(reads)
#print df3
total_reads = df3.sum(axis=0)
#print total_reads
df3.to_csv('figure6.ion_synmock.txt', sep='\t')

!amptk heatmap -i figure6.ion_synmock.txt -d tsv -o figure6_ion.pdf --format pdf -m heatmap --annotate --color YlOrBr --vmax 17000

from amptk.

peterjc commented on July 20, 2024

Thank you! My primary interest here is the synthetic spike in control (SynMock) as used with the Illumina data shown in Figure 6, so I think I only need to focus on the m6 Illumina files.

from amptk.

Matching MockA, MockB1 and MockB2 to FASTQ filenames about amptk HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent