Coder Social home page Coder Social logo

wshuai294 / pstrain Goto Github PK

View Code? Open in Web Editor NEW
8.0 2.0 2.0 100.75 MB

Pstrain profiles strains in metagenomics data. It infers strain abundance and genotype for each species. Also, it has a single species mode; where given a BAM and VCF, it can phase the variants for any species.

License: MIT License

Python 8.51% Shell 91.49%
strain phase metagenomics-toolkit profile-strains metaphlan3 genotype-frequency marker-gene-analysis metagenotyping snp microbial-strain

pstrain's People

Contributors

wshuai294 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

kckoh liaoherui

pstrain's Issues

How to make a marker gene?

Hi Shuai

I have a new genome, but I don't know how to make genome to marker gene. Can you give me some idea?

Bests
Chengkai

Java version for GATK?

I got the following the error, which should be at the GATK step:

[Sun May 23 22:52:52 CST 2021] picard.sam.AddOrReplaceReadGroups done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=2155872256

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.5-0-g36282e4):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: Invalid command line: Malformed walker argument: Could not find walker with name: HaplotypeCaller
ERROR ------------------------------------------------------------------------------------------

I googled and someone said it is because of the Java version, so what't the java version should be?

Best,

Yuxiang

Threading and parallelisation across species

Hello,

Can I check that -p parallelises over samples, of which each pipeline is mostly single-threaded? And -n only parallelises bowtie (not GATK/picard/metaphlan etc). Thus the maximum number of processes will be p * n at the bowtie stage, and p subsequently? In this case, I assume maximum efficiency would be gained by -p over the number of processors and -n 1?

Having tested overnight with two pharyngeal samples (cached metaphlan runs, and -p 8), I found that the pipeline took most of its time at the iterative pulp stage - presumably working through species (~60) one by one (two samples in parallel, I hope!). It took about 7 hours.

I have >200 samples in total, and even with 32 processes this will exceed the run-time of my HPC jobs. What could be useful for me (and others, I suspect) is a way of batching the pipeline by species after metaphlan, BAM and VCF steps (keeping the advantage of multi-threading and competitive read mapping across all species). For example, if PStrain could produce intermediate output, and a list of the species eligible for strain-calling, it could then be run from a batch job per-species -- many, like me, will have access to much greater computing power by parallelising tasks as batch jobs rather than running as single large jobs.

It may be that single_species.py already does this -- to my first look though I believe it is designed for full separation (i.e totally separate species references, alignments and VCFs).

Final separate question, is there any way of getting strain model fitting statistics out of the pipeline (e.g. some measure of how well the deconvoluted phased strains and RAs account for the reads in each sample-species combination)? I assume this might be extractable from stdout/log in some way (though interestingly, no log is produced in my output folder, despite the code suggesting it should be produced).

Thanks again for such a useful tool!

Andrew

Error running PStrain_V30.py script

Hello,

I am running PStrain_V30.py script on test dataset but it is showing an error:

Could not build fai index test_dir/test//ref//merged_ref.fa.fai

This is the complete report:

python3 PStrain_V30.py --metaphlan3 /home/rishav/.local/bin/metaphlan --dbdir_V30 ../db_V30 -c ../test/config.txt -o test_dir --bowtie2 /home/rishav/anaconda3/envs/biobakery3/bin/bowtie2 --bowtie2-build /home/rishav/anaconda3/envs/biobakery3/bin/bowtie2-build --samtools /home/rishav/anaconda3/envs/biobakery3/bin/samtools --picard /home/rishav/anaconda3/envs/biobakery3/bin/picard --gatk /home/rishav/anaconda3/envs/biobakery3/bin/gatk
Start Profiling test...
[faidx] Could not build fai index test_dir/test//ref//merged_ref.fa.fai
Settings:
Output files: "test_dir/test//ref//merged_ref..bt2"
Line rate: 6 (line is 64 bytes)
Lines per side: 1 (side is 64 bytes)
Offset rate: 4 (one in 16)
FTable chars: 10
Strings: unpacked
Max bucket size: default
Max bucket size, sqrt multiplier: default
Max bucket size, len divisor: 4
Difference-cover sample period: 1024
Endianness: little
Actual local endianness: little
Sanity checking: disabled
Assertions: disabled
Random seed: 0
Sizeofs: void
:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
test_dir/test//ref//merged_ref.fa
Warning: Empty fasta file: 'test_dir/test//ref//merged_ref.fa'
Warning: All fasta inputs were empty
Total time for call to driver() for forward index: 00:00:00
Error: Encountered internal Bowtie 2 exception (#1)
Command: /home/rishav/anaconda3/envs/biobakery3/bin/bowtie2-build-s --wrapper basic-0 -f test_dir/test//ref//merged_ref.fa test_dir/test//ref//merged_ref
(ERR): "test_dir/test//ref//merged_ref" does not exist or is not a Bowtie 2 index
Exiting now ...
[main_samview] fail to read the header from "-".
[W::hts_set_opt] Cannot change block size for this format
samtools sort: failed to read header from "-"
Error: Invalid or corrupt jarfile /home/rishav/anaconda3/envs/biobakery3/bin/picard
rm: cannot remove 'test_dir/test//map/mapped.sort.bam': No such file or directory
[E::hts_open_format] Failed to open file "test_dir/test//map/mapped.bam" : No such file or directory
samtools index: failed to open "test_dir/test//map/mapped.bam": No such file or directory
[E::hts_open_format] Failed to open file "test_dir/test//map/mapped.bam" : No such file or directory
samtools depth: Cannot open input file "test_dir/test//map/mapped.bam": No such file or directory
Error: Invalid or corrupt jarfile /home/rishav/anaconda3/envs/biobakery3/bin/gatk
[E::hts_open_format] Failed to open file "test_dir/test//map/mapped.vcf.gz" : No such file or directory
No need to merge in case of one sample.

Kindly help me with this concern.
Thank you

--similarity cutoff default

Hi,

I notice that while --similarity cutoff is by default mention as 0.95 it actually uses 0.80 cutoff while merging and clustering results from samples. I feel 0.80 is bit conservative. What do you suggest?

Additionally, I am strain profiling species that is present with very high coverage in metagenomes (makes up to > 80% based on Kraken2 and Metaphlan3). In that case do you recommend depth (DP) or mapping quality (MAPQ) based filtering to filter out low quality SNPs?

Also let me know if you have any other recommendations?

Thanks,
Shriram

Error: File truncated at line 1 Could not build fai index out/test//ref//merged_ref.fa.fai

Hi Shuai WANG,

I am using this code python3 ../scripts/PStrain.py -c config.txt -o out --bowtie2db /home/rishav/.local/lib/python3.10/site-packages/metaphlan/metaphlan_databases -x mpa_vJun23_CHOCOPhlAnSGB_202307 --proc 20 --nproc 20 to run test file bu the error still persists:

Start Profiling test...
run metaphlan...
read_metaphlan out/test//metaphlan//metaphlan_output.txt
[E::fai_build_core] File truncated at line 1
[faidx] Could not build fai index out/test//ref//merged_ref.fa.fai
Settings:
Output files: "out/test//ref//merged_ref..bt2"
Line rate: 6 (line is 64 bytes)
Lines per side: 1 (side is 64 bytes)
Offset rate: 4 (one in 16)
FTable chars: 10
Strings: unpacked
Max bucket size: default
Max bucket size, sqrt multiplier: default
Max bucket size, len divisor: 4
Difference-cover sample period: 1024
Endianness: little
Actual local endianness: little
Sanity checking: disabled
Assertions: disabled
Random seed: 0
Sizeofs: void
:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
out/test//ref//merged_ref.fa
Warning: Empty fasta file: 'out/test//ref//merged_ref.fa'
Warning: All fasta inputs were empty
Total time for call to driver() for forward index: 00:00:00
Error: Encountered internal Bowtie 2 exception (#1)
Command: /home/rishav/anaconda3/envs/pstrain/bin/bowtie2-build-s --wrapper basic-0 -f out/test//ref//merged_ref.fa out/test//ref//merged_ref
(ERR): "out/test//ref//merged_ref" does not exist or is not a Bowtie 2 index
Exiting now ...
[main_samview] fail to read the header from "-".
[W::hts_set_opt] Cannot change block size for this format
samtools sort: failed to read header from "-"
[Mon May 06 11:51:34 IST 2024] picard.sam.AddOrReplaceReadGroups INPUT=out/test//map/mapped.sort.bam OUTPUT=out/test/map/mapped.bam RGLB=whatever RGPL=illumina RGPU=whatever RGSM=whatever RGID=1 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Mon May 06 11:51:34 IST 2024] Executing as rishav@mjlab-Precision-7920-Tower on Linux 6.5.0-27-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_112-b16; Picard version: 2.1.0(25ebc07f7fbaa7c1a4a8e6c130c88c1d10681802_1454776546) JdkDeflater
[Mon May 06 11:51:34 IST 2024] picard.sam.AddOrReplaceReadGroups done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=2058354688
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMException: Cannot read non-existent file: /home/rishav/PStrain/test/out/test/map/mapped.sort.bam
at htsjdk.samtools.util.IOUtil.assertFileIsReadable(IOUtil.java:338)
at htsjdk.samtools.util.IOUtil.assertFileIsReadable(IOUtil.java:325)
at htsjdk.samtools.util.IOUtil.assertInputIsValid(IOUtil.java:301)
at picard.sam.AddOrReplaceReadGroups.doWork(AddOrReplaceReadGroups.java:107)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:209)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)
rm: cannot remove 'out/test//map/mapped.sort.bam': No such file or directory
[E::hts_open_format] Failed to open file "out/test//map/mapped.bam" : No such file or directory
samtools index: failed to open "out/test//map/mapped.bam": No such file or directory
[E::hts_open_format] Failed to open file "out/test//map/mapped.bam" : No such file or directory
samtools depth: Cannot open input file "out/test//map/mapped.bam": No such file or directory
INFO 11:51:37,029 HelpFormatter - --------------------------------------------------------------------------------
INFO 11:51:37,031 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:56
INFO 11:51:37,032 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 11:51:37,032 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 11:51:37,034 HelpFormatter - Program Args: -T HaplotypeCaller -R out/test//ref//merged_ref.fa -allowPotentiallyMisencodedQuals -I out/test//map/mapped.bam -o out/test//map/mapped.vcf.gz
INFO 11:51:37,036 HelpFormatter - Executing as rishav@mjlab-Precision-7920-Tower on Linux 6.5.0-27-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_112-b16.
INFO 11:51:37,037 HelpFormatter - Date/Time: 2024/05/06 11:51:37
INFO 11:51:37,037 HelpFormatter - --------------------------------------------------------------------------------
INFO 11:51:37,037 HelpFormatter - --------------------------------------------------------------------------------
INFO 11:51:37,086 GenomeAnalysisEngine - Strictness is SILENT
INFO 11:51:39,399 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.5-0-g36282e4):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: Fasta index file /home/rishav/PStrain/test/out/test/ref/merged_ref.fa.fai for reference /home/rishav/PStrain/test/out/test/ref/merged_ref.fa does not exist. Please see http://gatkforums.broadinstitute.org/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference for help creating it.
ERROR ------------------------------------------------------------------------------------------

[E::hts_open_format] Failed to open file "out/test//map/mapped.vcf.gz" : No such file or directory
No need to merge in case of one sample.

this is the location /home/rishav/.local/lib/python3.10/site-packages/metaphlan/metaphlan_databases

mpa_latest
mpa_vJun23_CHOCOPhlAnSGB_202307.3.bt2l
mpa_vJun23_CHOCOPhlAnSGB_202307.pkl
mpa_vJun23_CHOCOPhlAnSGB_202307_VINFO.csv
mpa_vJun23_CHOCOPhlAnSGB_202307.1.bt2l
mpa_vJun23_CHOCOPhlAnSGB_202307.4.bt2l
mpa_vJun23_CHOCOPhlAnSGB_202307.rev.1.bt2l
README.txt
mpa_vJun23_CHOCOPhlAnSGB_202307.2.bt2l
mpa_vJun23_CHOCOPhlAnSGB_202307.fna
mpa_vJun23_CHOCOPhlAnSGB_202307.rev.2.bt2l

this is the content of mpa_vJun23_CHOCOPhlAnSGB_202307.fna

VDB|003B-0000-0-01C2|M1-c0-c0-c0 240413_kVSG_NC_0274021_Salmonella_phage_SPN3US
CTTTTTAAAACAATCACTTAGATTGTTTTTTAGATCGATCCTGTGGATCGGGTCGAAGAC
CCTTTCCTTATACGCGTGCGCGTGCGCACACCCGTATCCTCCAGGATCTTTTAAAATATA
TAATAATATTAAATATAATAATACCTTTAGGTATTATATTTAATTATTATAATTATTTAT
ATAAATAAATATATTAATATATTAAGATCTTAAAACGCGTGCGGGCGCGCGCGTGATCCT
TATTAAGAGATCGTGATCGGACAGACGTCCTGGTTTGGGTACCCTGTGTATTTTTTACGT
GGTTTCCCTATCGTGTGTTCAGACCACCACTCGATGTCCAAAACCCATAGCTTCGGTGGC
CGCCTCCCTGGACGGCAGACTCTGACAAGTCGACCCACACCACCAACCCGTCCGGGGACG
GATTTTTATTCCAACGACAAGCTCCACGAAAATCCCCGCAGAGGGACTCACTGCTGTGTG
ACTCACCGGCTCCTGGACGCCTACGCCTCGAGCCGGTAATTTTTTGATTACGTGAGTTTG

Thank you for your persistence

Auto detection of dependenies

I put the db and packages into the PStrain folder and it worked for the test.

However, when I was running my own folders, it failed. Especially it will give me error messages began with:

Error! numpy python library not detected!!
rm: cannot remove ‘PS_out/C1_WGSerr_r0//metaphlan2//bowtie.out.bz2’: No such file or directory
Could not build fai index PS_out/C1_WGSerr_r0//ref//merged_ref.fa.fai

I am very confused about why it worked for test but not any other samples. Could the guideline be more precise in the readme?

Thank you!

Yuxiang

Could not build fai index ./output/test//ref//merged_ref.fa.fai

Hi shuai!
I ran "test", but it showed following. What should I do?

Database is loaded.
##############
Could not build fai index ./output/test//ref//merged_ref.fa.fai
Settings:
Output files: "./output/test//ref//merged_ref..bt2"
Line rate: 6 (line is 64 bytes)
Lines per side: 1 (side is 64 bytes)
Offset rate: 4 (one in 16)
FTable chars: 10
Strings: unpacked
Max bucket size: default
Max bucket size, sqrt multiplier: default
Max bucket size, len divisor: 4
Difference-cover sample period: 1024
Endianness: little
Actual local endianness: little
Sanity checking: disabled
Assertions: disabled
Random seed: 0
Sizeofs: void
:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
./output/test//ref//merged_ref.fa
Warning: Empty fasta file: './output/test//ref//merged_ref.fa'
Warning: All fasta inputs were empty
Total time for call to driver() for forward index: 00:00:00
Error: Encountered internal Bowtie 2 exception (#1)
Command: bowtie2-build --wrapper basic-0 -f ./output/test//ref//merged_ref.fa ./output/test//ref//merged_ref
Could not locate a Bowtie index corresponding to basename "./output/test//ref//merged_ref"
Error: Encountered internal Bowtie 2 exception (#1)
Command: /ldfssz1/ST_META/P18Z10200N0127_GRJ/liangweiting2/software/PStrain/test/../install/packages/bowtie2-2.3.1-legacy/bowtie2-align-s --wrapper basic-0 -x ./output/test//ref//merged_ref -p 1 -1 /ldfssz1/ST_META/P18Z10200N0127_GRJ/liangweiting2/software/PStrain/test/test_1.fq.gz -2 /ldfssz1/ST_META/P18Z10200N0127_GRJ/liangweiting2/software/PStrain/test/test_2.fq.gz
(ERR): bowtie2-align exited with value 1
[Fri Mar 12 16:13:32 CST 2021] picard.sam.AddOrReplaceReadGroups INPUT=./output/test//map/mapped.sort.bam OUTPUT=./output/test/map/mapped.bam RGLB=whatever RGPL=illumina RGPU=whatever RGSM=whatever RGID=1 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri Mar 12 16:13:32 CST 2021] Executing as [email protected] on Linux 2.6.32-696.30.1.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_265-b11; Picard version: 2.1.0(25ebc07f7fbaa7c1a4a8e6c130c88c1d10681802_1454776546) JdkDeflater
INFO 2021-03-12 16:13:32 AddOrReplaceReadGroups Created read group ID=1 PL=illumina LB=whatever SM=whatever

[Fri Mar 12 16:13:32 CST 2021] picard.sam.AddOrReplaceReadGroups done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=2027945984
INFO 16:13:38,458 HelpFormatter - --------------------------------------------------------------------------------
INFO 16:13:38,462 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:56
INFO 16:13:38,462 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 16:13:38,463 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 16:13:38,468 HelpFormatter - Program Args: -T HaplotypeCaller -R ./output/test//ref//merged_ref.fa -allowPotentiallyMisencodedQuals -I ./output/test//map/mapped.bam -o ./output/test//map/mapped.vcf.gz
INFO 16:13:38,479 HelpFormatter - Executing as [email protected] on Linux 2.6.32-696.30.1.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_265-b11.
INFO 16:13:38,480 HelpFormatter - Date/Time: 2021/03/12 16:13:38
INFO 16:13:38,480 HelpFormatter - --------------------------------------------------------------------------------
INFO 16:13:38,480 HelpFormatter - --------------------------------------------------------------------------------
INFO 16:13:38,595 GenomeAnalysisEngine - Strictness is SILENT
INFO 16:13:39,470 HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Network is unreachable (connect failed)
INFO 16:13:39,470 HttpMethodDirector - Retrying request
INFO 16:13:39,472 HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Network is unreachable (connect failed)
INFO 16:13:39,473 HttpMethodDirector - Retrying request
INFO 16:13:39,474 HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Network is unreachable (connect failed)
INFO 16:13:39,475 HttpMethodDirector - Retrying request
INFO 16:13:39,476 HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Network is unreachable (connect failed)
INFO 16:13:39,476 HttpMethodDirector - Retrying request
INFO 16:13:39,478 HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Network is unreachable (connect failed)
INFO 16:13:39,478 HttpMethodDirector - Retrying request

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.5-0-g36282e4):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: Fasta index file /ldfssz1/ST_META/P18Z10200N0127_GRJ/liangweiting2/software/PStrain/test/./output/test/ref/merged_ref.fa.fai for reference /ldfssz1/ST_META/P18Z10200N0127_GRJ/liangweiting2/software/PStrain/test/./output/test/ref/merged_ref.fa does not exist. Please see http://gatkforums.broadinstitute.org/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference for help creating it.
ERROR ------------------------------------------------------------------------------------------

......................................
[E::hts_open_format] Failed to open file "./output/test//map/mapped.vcf.gz" : No such file or directory
No need to merge in case of one sample.

Sincerely,
Liang

Use existing metaphlan3 run

Hello,

Is it possible to accelerate this useful tool when metaphlan3 has already been run on samples?

Thanks,

Andrew

Reproducibility

Hello,

Thanks so much for the help to get 221 samples successfully analysed! No rush to respond here, but share in case it's worth exploring further.

As a first start, I've taken two samples which come from the same individual a week apart (oropharynx). I thought it would be helpful to see how reproducible the results are, given I would not expect much change in many organisms (though, worth acknowledging the individual would have been in hospital between the first and second swabs, and potentially on antibiotics). Kraken RAs (GTDB bacterial database) show fair similarity (Pearson R^2 on log10 RAs=0.81), with most species within an order of magnitude between the two samples, across four orders of magnitude of RA.

38 bacterial species give results from both samples (plus 24 in sample 1 alone and 4 in sample 2 alone - interesting as sample 2 has near twice the bacterial depth of sample 1, though sample 2 shows greater diversity)

Looking at those 38 reported in both samples, I hoped to see merged strains (0.95) mostly shared between samples, especially dominant ones. I confess I'm not full sure what the 95% corresponds to - 95% ANI of the marker genes, or 95% ANI of the SNPs (the latter would make sense to me).

In aggregate, of 77 strain clusters only 10 are shared between samples. Each are from separate species with no other strain clusters. One comprises a single matching strain between both samples (Oribacterium sinus). The others have multiple strains within a single strain cluster shared between both samples.

Species which have a matching strain are have similar abundances in sample 1 to those without (unshared species median 0.32% vs shared species 0.45%), but higher abundances in sample 2 (0.42 vs 1.3%).

28 species have no strain clusters shared between samples.

To take an example with high species RA in both samples, Porphyromonas somerae has a separate cluster in both samples, each comprising three strains. The frequency distribution is remarkably similar in both samples (~15%, ~32%, ~52%), and it would have been pleasing if three pairs of strains were instead clustered.

The reference sequences are 170,712 bases and 1367 shared polymorphic sites are identified (1726 in sample 1 and 2305 in sample 2). This seems reasonable as it would correspond to less than 1% divergence in general.

A quick pairwise alignment and identity matrix for the shared loci (69=sample 1 and 71=sample 2):

          str_1_71  str_2_71  str_3_71
str_1_69 0.8456474 0.9392831 0.7973665
str_2_69 0.8580834 0.9502560 0.7834674
str_3_69 0.8573519 0.9597659 0.7710315

It appears that strain_1_69 pairs with strain_2_71 (although the frequencies don't correspond; 14 vs 33%), however both strain_2_69 (31%) and strain_3_69 (55%) are closest to strain_2_71 (33%). None are closest to strain_3_71 (52%).

Assuming the strain profile should be stable between the two samples, I wonder if the issue could be phasing between genes or within genes, producing hybrid strains.

I've started to have a look by focusing on the 15 genes with 25 or more SNPs. The signals are a little confusing to me. Mean coverage (depth file) indicates a fairly wide range in depth: 32-57 in sample 1 (median 44) and 14-62 in sample 2 (median 30).

A0A069ZK39 strain_1_69 is dissimilar to all three strains in the second sample (72-89% identity), strain_2_69 is identical to str_3_71 (100% identity) and similar to strain_2_71 (96%). Strain_3_69 is dissimilar to all three (55-66% identity).

Many of the other genes have unusual patterns because the strains within sample 1 share identical sequence, e.g. for A0A134B4G4 all strains from the first sample are identical to strain_2_71, 84% identical to strain_1_71 and 52% identical to strain_3_71.

Do you have any ideas about what might be going on?

Thanks,

Andrew

The strain_RA.txt different from the intermediate metaphlan2_output.txt

Hi shuai!
I found that strain_RA.txt different from the intermediate metaphlan2_output.txt.
The strain_RA results contradicted my simulated data while the metaphlan2_output.txt showed the correct abundance, so I checked the metaphlan2_output.txt and strain_RA result by comparing their species's abundance. The abundance of the species is different between strain_RA.txt and metaphlan2_output.txt.

strain_RA.txt

Species | Species_RA

Bifidobacterium_bifidum | 36.7256
Bacteroides_thetaiotaomicron | 13.17946
Odoribacter_splanchnicus | 10.54507
Odoribacter_splanchnicus | 10.54507
Odoribacter_splanchnicus | 10.54507
Prevotella_intermedia | 9.94046
Prevotella_intermedia | 9.94046
Prevotella_intermedia | 9.94046
Alistipes_finegoldii | 7.29353
Clostridium_butyricum | 4.98109
Clostridium_butyricum | 4.98109
Clostridium_butyricum | 4.98109
Clostridium_sporogenes | 3.69829
Clostridium_sporogenes | 3.69829
Clostridium_sporogenes | 3.69829
Clostridium_pasteurianum | 2.43552
Clostridium_pasteurianum | 2.43552
Clostridium_tetani | 1.57689
Clostridium_stercorarium | 1.5142
Clostridium_stercorarium | 1.5142
Clostridium_stercorarium | 1.5142
Clostridium_stercorarium | 1.5142
Megasphaera_elsdenii | 1.27493
Bordetella_bronchiseptica_parapertussis | 0.68471
Bordetella_bronchiseptica_parapertussis | 0.68471
Bordetella_bronchiseptica_parapertussis | 0.68471
Bordetella_holmesii | 0.61797
Bordetella_pertussis | 0.58863
Bordetella_pertussis | 0.58863
Bordetella_pertussis | 0.58863
Bordetella_petrii | 0.55972
Bordetella_petrii | 0.55972
Taylorella_equigenitalis | 0.50617
Desulfovibrio_africanus | 0.40631
Desulfovibrio_vulgaris | 0.34669
Desulfovibrio_vulgaris | 0.34669
Desulfovibrio_vulgaris | 0.34669
Desulfovibrio_vulgaris | 0.34669
Haemophilus_influenzae | 0.33584
Achromobacter_xylosoxidans | 0.40631
Haemophilus_ducreyi | 0.29188
Coraliomargarita_akajimensis | 10.54507
Roseburia_hominis | 9.94046
Clostridium_botulinum | 2.43552
Clostridium_saccharobutylicum | 0.50617
Desulfitobacterium_dichloroeliminans | 0.32522

metaphlan2_output.txt

s__Bordetella_pertussis |   | 36.7256
s__Bordetella_petrii |   | 13.17946
s__Coraliomargarita_akajimensis |   | 10.54507
s__Roseburia_hominis |   | 9.94046
s__Clostridium_stercorarium |   | 7.29353
s__Clostridium_butyricum |   | 4.98109
s__Bordetella_bronchiseptica_parapertussis | 3.69829
s__Clostridium_botulinum |   | 2.43552
s__Clostridium_pasteurianum |   | 1.57689
s__Prevotella_intermedia |   | 1.5142
s__Desulfovibrio_vulgaris |   | 1.27493
s__Odoribacter_splanchnicus |   | 0.68471
s__Bordetella_holmesii |   | 0.61797
s__Bifidobacterium_bifidum |   | 0.58863
s__Clostridium_sporogenes |   | 0.55972
s__Clostridium_saccharobutylicum | 0.50617
s__Achromobacter_xylosoxidans |   | 0.40631
s__Bacteroides_thetaiotaomicron |   | 0.34669
s__Candidatus_Arthromitus_unclassified | 0.33744
s__Taylorella_equigenitalis |   | 0.33584
s__Desulfovibrio_africanus |   | 0.32804
s__Desulfitobacterium_dichloroeliminans | 0.32522
s__Alistipes_finegoldii |   | 0.31922
s__Haemophilus_ducreyi |   | 0.29188
s__Haemophilus_influenzae |   | 0.27104
s__Clostridium_tetani |   | 0.26774
s__Achromobacter_unclassified |   | 0.21043
s__Megasphaera_unclassified |   | 0.13154
s__Megasphaera_elsdenii |   | 0.1292
s__Pusillimonas_unclassified |   | 0.07615
s__Butyrivibrio_unclassified |   | 0.06459
s__Coriobacterium_glomerans |   | 0.03642

I'm not sure if I'm running it wrong. Here is my run command and log.
$ cat run.sh
python /ldfssz1/ST_META/P18Z10200N0127_GRJ/liangweiting2/software/PStrain/install/scripts/PStrain.py -c config.txt -o ./output -p 4 -n 8 --dbdir /ldfssz1/ST_META/P18Z10200N0127_GRJ/liangweiting2/software/PStrain/install/db --prior /ldfssz1/ST_META/P18Z10200N0127_GRJ/liangweiting2/software/PStrain/install/db/prior_beta.pickle --bowtie2-build /ldfssz1/ST_META/P18Z10200N0127_GRJ/liangweiting2/software/PStrain/install/packages/bowtie2-2.3.1-legacy/bowtie2-build --bowtie2 /ldfssz1/ST_META/P18Z10200N0127_GRJ/liangweiting2/software/PStrain/install/packages/bowtie2-2.3.1-legacy/bowtie2 --samtools /ldfssz1/ST_META/P18Z10200N0127_GRJ/liangweiting2/software/PStrain/install/packages/SamTools-1.3.1/samtools --metaphlan2 /ldfssz1/ST_META/P18Z10200N0127_GRJ/liangweiting2/software/PStrain/install/packages/metaphlan2/metaphlan2.py --gatk /ldfssz1/ST_META/P18Z10200N0127_GRJ/liangweiting2/software/PStrain/install/packages/GATK_3.5/GenomeAnalysisTK.jar --picard /ldfssz1/ST_META/P18Z10200N0127_GRJ/liangweiting2/software/PStrain/install/packages/picard-tools-2.1.0/picard.jar

log is too big to show, so I attach it here.
log.txt

Best,
Liang

A snippet offering a for-loop based config construction script

With input fastq in a root directory, in individual folders for each sample, and runs named $sample_1.fastq and $sample_2.fastq, and the names are assigned to a variable, $AllSamples

for I in $AllSamples; do echo "//";echo sample : $I ; echo fq1: /rootdir/$I/$I_1.fastq ; echo fq2: /rootdir/$I/$I_2.fastq ; done

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.