wshuai294 / pstrain Goto Github PK

Pstrain profiles strains in metagenomics data. It infers strain abundance and genotype for each species. Also, it has a single species mode; where given a BAM and VCF, it can phase the variants for any species.

License: MIT License

Python 8.51% Shell 91.49%

strain phase metagenomics-toolkit profile-strains metaphlan3 genotype-frequency marker-gene-analysis metagenotyping snp microbial-strain

pstrain's People

Contributors

Stargazers

Watchers

Forkers

kckoh liaoherui

pstrain's Issues

How to make a marker gene?

Hi Shuai

I have a new genome, but I don't know how to make genome to marker gene. Can you give me some idea?

Bests
Chengkai

Java version for GATK?

I got the following the error, which should be at the GATK step:

[Sun May 23 22:52:52 CST 2021] picard.sam.AddOrReplaceReadGroups done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=2155872256

ERROR ------------------------------------------------------------------------------------------

ERROR A USER ERROR has occurred (version 3.5-0-g36282e4):

ERROR

ERROR This means that one or more arguments or inputs in your command are incorrect.

ERROR The error message below tells you what is the problem.

ERROR

ERROR If the problem is an invalid argument, please check the online documentation guide

ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.

ERROR

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions http://www.broadinstitute.org/gatk

ERROR

ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.

ERROR

ERROR MESSAGE: Invalid command line: Malformed walker argument: Could not find walker with name: HaplotypeCaller

ERROR ------------------------------------------------------------------------------------------

I googled and someone said it is because of the Java version, so what't the java version should be?

Best,

Yuxiang

Threading and parallelisation across species

Hello,

Can I check that -p parallelises over samples, of which each pipeline is mostly single-threaded? And -n only parallelises bowtie (not GATK/picard/metaphlan etc). Thus the maximum number of processes will be p * n at the bowtie stage, and p subsequently? In this case, I assume maximum efficiency would be gained by -p over the number of processors and -n 1?

Having tested overnight with two pharyngeal samples (cached metaphlan runs, and -p 8), I found that the pipeline took most of its time at the iterative pulp stage - presumably working through species (~60) one by one (two samples in parallel, I hope!). It took about 7 hours.

I have >200 samples in total, and even with 32 processes this will exceed the run-time of my HPC jobs. What could be useful for me (and others, I suspect) is a way of batching the pipeline by species after metaphlan, BAM and VCF steps (keeping the advantage of multi-threading and competitive read mapping across all species). For example, if PStrain could produce intermediate output, and a list of the species eligible for strain-calling, it could then be run from a batch job per-species -- many, like me, will have access to much greater computing power by parallelising tasks as batch jobs rather than running as single large jobs.

It may be that single_species.py already does this -- to my first look though I believe it is designed for full separation (i.e totally separate species references, alignments and VCFs).

Final separate question, is there any way of getting strain model fitting statistics out of the pipeline (e.g. some measure of how well the deconvoluted phased strains and RAs account for the reads in each sample-species combination)? I assume this might be extractable from stdout/log in some way (though interestingly, no log is produced in my output folder, despite the code suggesting it should be produced).

Thanks again for such a useful tool!

Andrew

Error running PStrain_V30.py script

Hello,

I am running PStrain_V30.py script on test dataset but it is showing an error:

Could not build fai index test_dir/test//ref//merged_ref.fa.fai

This is the complete report:

python3 PStrain_V30.py --metaphlan3 /home/rishav/.local/bin/metaphlan --dbdir_V30 ../db_V30 -c ../test/config.txt -o test_dir --bowtie2 /home/rishav/anaconda3/envs/biobakery3/bin/bowtie2 --bowtie2-build /home/rishav/anaconda3/envs/biobakery3/bin/bowtie2-build --samtools /home/rishav/anaconda3/envs/biobakery3/bin/samtools --picard /home/rishav/anaconda3/envs/biobakery3/bin/picard --gatk /home/rishav/anaconda3/envs/biobakery3/bin/gatk
Start Profiling test...
[faidx] Could not build fai index test_dir/test//ref//merged_ref.fa.fai
Settings:
Output files: "test_dir/test//ref//merged_ref..bt2"
Line rate: 6 (line is 64 bytes)
Lines per side: 1 (side is 64 bytes)
Offset rate: 4 (one in 16)
FTable chars: 10
Strings: unpacked
Max bucket size: default
Max bucket size, sqrt multiplier: default
Max bucket size, len divisor: 4
Difference-cover sample period: 1024
Endianness: little
Actual local endianness: little
Sanity checking: disabled
Assertions: disabled
Random seed: 0
Sizeofs: void:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
test_dir/test//ref//merged_ref.fa
Warning: Empty fasta file: 'test_dir/test//ref//merged_ref.fa'
Warning: All fasta inputs were empty
Total time for call to driver() for forward index: 00:00:00
Error: Encountered internal Bowtie 2 exception (#1)
Command: /home/rishav/anaconda3/envs/biobakery3/bin/bowtie2-build-s --wrapper basic-0 -f test_dir/test//ref//merged_ref.fa test_dir/test//ref//merged_ref
(ERR): "test_dir/test//ref//merged_ref" does not exist or is not a Bowtie 2 index
Exiting now ...
[main_samview] fail to read the header from "-".
[W::hts_set_opt] Cannot change block size for this format
samtools sort: failed to read header from "-"
Error: Invalid or corrupt jarfile /home/rishav/anaconda3/envs/biobakery3/bin/picard
rm: cannot remove 'test_dir/test//map/mapped.sort.bam': No such file or directory
[E::hts_open_format] Failed to open file "test_dir/test//map/mapped.bam" : No such file or directory
samtools index: failed to open "test_dir/test//map/mapped.bam": No such file or directory
[E::hts_open_format] Failed to open file "test_dir/test//map/mapped.bam" : No such file or directory
samtools depth: Cannot open input file "test_dir/test//map/mapped.bam": No such file or directory
Error: Invalid or corrupt jarfile /home/rishav/anaconda3/envs/biobakery3/bin/gatk
[E::hts_open_format] Failed to open file "test_dir/test//map/mapped.vcf.gz" : No such file or directory
No need to merge in case of one sample.

Kindly help me with this concern.
Thank you

--similarity cutoff default

Hi,

I notice that while --similarity cutoff is by default mention as 0.95 it actually uses 0.80 cutoff while merging and clustering results from samples. I feel 0.80 is bit conservative. What do you suggest?

Additionally, I am strain profiling species that is present with very high coverage in metagenomes (makes up to > 80% based on Kraken2 and Metaphlan3). In that case do you recommend depth (DP) or mapping quality (MAPQ) based filtering to filter out low quality SNPs?

Also let me know if you have any other recommendations?

Thanks,
Shriram

metaphlan3 support?

Are there any plans to support metaphlan3 databases?

Error: File truncated at line 1 Could not build fai index out/test//ref//merged_ref.fa.fai

Hi Shuai WANG,

I am using this code python3 ../scripts/PStrain.py -c config.txt -o out --bowtie2db /home/rishav/.local/lib/python3.10/site-packages/metaphlan/metaphlan_databases -x mpa_vJun23_CHOCOPhlAnSGB_202307 --proc 20 --nproc 20 to run test file bu the error still persists:

Start Profiling test...
run metaphlan...
read_metaphlan out/test//metaphlan//metaphlan_output.txt
[E::fai_build_core] File truncated at line 1
[faidx] Could not build fai index out/test//ref//merged_ref.fa.fai
Settings:
Output files: "out/test//ref//merged_ref..bt2"
Line rate: 6 (line is 64 bytes)
Lines per side: 1 (side is 64 bytes)
Offset rate: 4 (one in 16)
FTable chars: 10
Strings: unpacked
Max bucket size: default
Max bucket size, sqrt multiplier: default
Max bucket size, len divisor: 4
Difference-cover sample period: 1024
Endianness: little
Actual local endianness: little
Sanity checking: disabled
Assertions: disabled
Random seed: 0
Sizeofs: void:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
out/test//ref//merged_ref.fa
Warning: Empty fasta file: 'out/test//ref//merged_ref.fa'
Warning: All fasta inputs were empty
Total time for call to driver() for forward index: 00:00:00
Error: Encountered internal Bowtie 2 exception (#1)
Command: /home/rishav/anaconda3/envs/pstrain/bin/bowtie2-build-s --wrapper basic-0 -f out/test//ref//merged_ref.fa out/test//ref//merged_ref
(ERR): "out/test//ref//merged_ref" does not exist or is not a Bowtie 2 index
Exiting now ...
[main_samview] fail to read the header from "-".
[W::hts_set_opt] Cannot change block size for this format
samtools sort: failed to read header from "-"
[Mon May 06 11:51:34 IST 2024] picard.sam.AddOrReplaceReadGroups INPUT=out/test//map/mapped.sort.bam OUTPUT=out/test/map/mapped.bam RGLB=whatever RGPL=illumina RGPU=whatever RGSM=whatever RGID=1 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Mon May 06 11:51:34 IST 2024] Executing as rishav@mjlab-Precision-7920-Tower on Linux 6.5.0-27-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_112-b16; Picard version: 2.1.0(25ebc07f7fbaa7c1a4a8e6c130c88c1d10681802_1454776546) JdkDeflater
[Mon May 06 11:51:34 IST 2024] picard.sam.AddOrReplaceReadGroups done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=2058354688
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMException: Cannot read non-existent file: /home/rishav/PStrain/test/out/test/map/mapped.sort.bam
at htsjdk.samtools.util.IOUtil.assertFileIsReadable(IOUtil.java:338)
at htsjdk.samtools.util.IOUtil.assertFileIsReadable(IOUtil.java:325)
at htsjdk.samtools.util.IOUtil.assertInputIsValid(IOUtil.java:301)
at picard.sam.AddOrReplaceReadGroups.doWork(AddOrReplaceReadGroups.java:107)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:209)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)
rm: cannot remove 'out/test//map/mapped.sort.bam': No such file or directory
[E::hts_open_format] Failed to open file "out/test//map/mapped.bam" : No such file or directory
samtools index: failed to open "out/test//map/mapped.bam": No such file or directory
[E::hts_open_format] Failed to open file "out/test//map/mapped.bam" : No such file or directory
samtools depth: Cannot open input file "out/test//map/mapped.bam": No such file or directory
INFO 11:51:37,029 HelpFormatter - --------------------------------------------------------------------------------
INFO 11:51:37,031 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:56
INFO 11:51:37,032 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 11:51:37,032 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 11:51:37,034 HelpFormatter - Program Args: -T HaplotypeCaller -R out/test//ref//merged_ref.fa -allowPotentiallyMisencodedQuals -I out/test//map/mapped.bam -o out/test//map/mapped.vcf.gz
INFO 11:51:37,036 HelpFormatter - Executing as rishav@mjlab-Precision-7920-Tower on Linux 6.5.0-27-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_112-b16.
INFO 11:51:37,037 HelpFormatter - Date/Time: 2024/05/06 11:51:37
INFO 11:51:37,037 HelpFormatter - --------------------------------------------------------------------------------
INFO 11:51:37,037 HelpFormatter - --------------------------------------------------------------------------------
INFO 11:51:37,086 GenomeAnalysisEngine - Strictness is SILENT
INFO 11:51:39,399 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------

ERROR A USER ERROR has occurred (version 3.5-0-g36282e4):

ERROR

ERROR This means that one or more arguments or inputs in your command are incorrect.

ERROR The error message below tells you what is the problem.

ERROR

ERROR If the problem is an invalid argument, please check the online documentation guide

ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.

ERROR

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions http://www.broadinstitute.org/gatk

ERROR

ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.

ERROR

ERROR MESSAGE: Fasta index file /home/rishav/PStrain/test/out/test/ref/merged_ref.fa.fai for reference /home/rishav/PStrain/test/out/test/ref/merged_ref.fa does not exist. Please see http://gatkforums.broadinstitute.org/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference for help creating it.

ERROR ------------------------------------------------------------------------------------------

[E::hts_open_format] Failed to open file "out/test//map/mapped.vcf.gz" : No such file or directory
No need to merge in case of one sample.

this is the location /home/rishav/.local/lib/python3.10/site-packages/metaphlan/metaphlan_databases

mpa_latest
mpa_vJun23_CHOCOPhlAnSGB_202307.3.bt2l
mpa_vJun23_CHOCOPhlAnSGB_202307.pkl
mpa_vJun23_CHOCOPhlAnSGB_202307_VINFO.csv
mpa_vJun23_CHOCOPhlAnSGB_202307.1.bt2l
mpa_vJun23_CHOCOPhlAnSGB_202307.4.bt2l
mpa_vJun23_CHOCOPhlAnSGB_202307.rev.1.bt2l
README.txt
mpa_vJun23_CHOCOPhlAnSGB_202307.2.bt2l
mpa_vJun23_CHOCOPhlAnSGB_202307.fna
mpa_vJun23_CHOCOPhlAnSGB_202307.rev.2.bt2l

this is the content of mpa_vJun23_CHOCOPhlAnSGB_202307.fna

VDB|003B-0000-0-01C2|M1-c0-c0-c0 240413_kVSG_NC_0274021_Salmonella_phage_SPN3US
CTTTTTAAAACAATCACTTAGATTGTTTTTTAGATCGATCCTGTGGATCGGGTCGAAGAC
CCTTTCCTTATACGCGTGCGCGTGCGCACACCCGTATCCTCCAGGATCTTTTAAAATATA
TAATAATATTAAATATAATAATACCTTTAGGTATTATATTTAATTATTATAATTATTTAT
ATAAATAAATATATTAATATATTAAGATCTTAAAACGCGTGCGGGCGCGCGCGTGATCCT
TATTAAGAGATCGTGATCGGACAGACGTCCTGGTTTGGGTACCCTGTGTATTTTTTACGT
GGTTTCCCTATCGTGTGTTCAGACCACCACTCGATGTCCAAAACCCATAGCTTCGGTGGC
CGCCTCCCTGGACGGCAGACTCTGACAAGTCGACCCACACCACCAACCCGTCCGGGGACG
GATTTTTATTCCAACGACAAGCTCCACGAAAATCCCCGCAGAGGGACTCACTGCTGTGTG
ACTCACCGGCTCCTGGACGCCTACGCCTCGAGCCGGTAATTTTTTGATTACGTGAGTTTG

Thank you for your persistence

Auto detection of dependenies

I put the db and packages into the PStrain folder and it worked for the test.

However, when I was running my own folders, it failed. Especially it will give me error messages began with:

Error! numpy python library not detected!!
rm: cannot remove ‘PS_out/C1_WGSerr_r0//metaphlan2//bowtie.out.bz2’: No such file or directory
Could not build fai index PS_out/C1_WGSerr_r0//ref//merged_ref.fa.fai

I am very confused about why it worked for test but not any other samples. Could the guideline be more precise in the readme?

Thank you!

Yuxiang

Could not build fai index ./output/test//ref//merged_ref.fa.fai

Hi shuai!
I ran "test", but it showed following. What should I do?

Database is loaded.
##############
Could not build fai index ./output/test//ref//merged_ref.fa.fai
Settings:
Output files: "./output/test//ref//merged_ref..bt2"
Line rate: 6 (line is 64 bytes)
Lines per side: 1 (side is 64 bytes)
Offset rate: 4 (one in 16)
FTable chars: 10
Strings: unpacked
Max bucket size: default
Max bucket size, sqrt multiplier: default
Max bucket size, len divisor: 4
Difference-cover sample period: 1024
Endianness: little
Actual local endianness: little
Sanity checking: disabled
Assertions: disabled
Random seed: 0
Sizeofs: void:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
./output/test//ref//merged_ref.fa
Warning: Empty fasta file: './output/test//ref//merged_ref.fa'
Warning: All fasta inputs were empty
Total time for call to driver() for forward index: 00:00:00
Error: Encountered internal Bowtie 2 exception (#1)
Command: bowtie2-build --wrapper basic-0 -f ./output/test//ref//merged_ref.fa ./output/test//ref//merged_ref
Could not locate a Bowtie index corresponding to basename "./output/test//ref//merged_ref"
Error: Encountered internal Bowtie 2 exception (#1)
Command: /ldfssz1/ST_META/P18Z10200N0127_GRJ/liangweiting2/software/PStrain/test/../install/packages/bowtie2-2.3.1-legacy/bowtie2-align-s --wrapper basic-0 -x ./output/test//ref//merged_ref -p 1 -1 /ldfssz1/ST_META/P18Z10200N0127_GRJ/liangweiting2/software/PStrain/test/test_1.fq.gz -2 /ldfssz1/ST_META/P18Z10200N0127_GRJ/liangweiting2/software/PStrain/test/test_2.fq.gz
(ERR): bowtie2-align exited with value 1
[Fri Mar 12 16:13:32 CST 2021] picard.sam.AddOrReplaceReadGroups INPUT=./output/test//map/mapped.sort.bam OUTPUT=./output/test/map/mapped.bam RGLB=whatever RGPL=illumina RGPU=whatever RGSM=whatever RGID=1 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri Mar 12 16:13:32 CST 2021] Executing as [email protected] on Linux 2.6.32-696.30.1.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_265-b11; Picard version: 2.1.0(25ebc07f7fbaa7c1a4a8e6c130c88c1d10681802_1454776546) JdkDeflater
INFO 2021-03-12 16:13:32 AddOrReplaceReadGroups Created read group ID=1 PL=illumina LB=whatever SM=whatever

[Fri Mar 12 16:13:32 CST 2021] picard.sam.AddOrReplaceReadGroups done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=2027945984
INFO 16:13:38,458 HelpFormatter - --------------------------------------------------------------------------------
INFO 16:13:38,462 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:56
INFO 16:13:38,462 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 16:13:38,463 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 16:13:38,468 HelpFormatter - Program Args: -T HaplotypeCaller -R ./output/test//ref//merged_ref.fa -allowPotentiallyMisencodedQuals -I ./output/test//map/mapped.bam -o ./output/test//map/mapped.vcf.gz
INFO 16:13:38,479 HelpFormatter - Executing as [email protected] on Linux 2.6.32-696.30.1.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_265-b11.
INFO 16:13:38,480 HelpFormatter - Date/Time: 2021/03/12 16:13:38
INFO 16:13:38,480 HelpFormatter - --------------------------------------------------------------------------------
INFO 16:13:38,480 HelpFormatter - --------------------------------------------------------------------------------
INFO 16:13:38,595 GenomeAnalysisEngine - Strictness is SILENT
INFO 16:13:39,470 HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Network is unreachable (connect failed)
INFO 16:13:39,470 HttpMethodDirector - Retrying request
INFO 16:13:39,472 HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Network is unreachable (connect failed)
INFO 16:13:39,473 HttpMethodDirector - Retrying request
INFO 16:13:39,474 HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Network is unreachable (connect failed)
INFO 16:13:39,475 HttpMethodDirector - Retrying request
INFO 16:13:39,476 HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Network is unreachable (connect failed)
INFO 16:13:39,476 HttpMethodDirector - Retrying request
INFO 16:13:39,478 HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Network is unreachable (connect failed)
INFO 16:13:39,478 HttpMethodDirector - Retrying request

ERROR ------------------------------------------------------------------------------------------

ERROR A USER ERROR has occurred (version 3.5-0-g36282e4):

ERROR

ERROR This means that one or more arguments or inputs in your command are incorrect.

ERROR The error message below tells you what is the problem.

ERROR

ERROR If the problem is an invalid argument, please check the online documentation guide

ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.

ERROR

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions http://www.broadinstitute.org/gatk

ERROR

ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.

ERROR

ERROR MESSAGE: Fasta index file /ldfssz1/ST_META/P18Z10200N0127_GRJ/liangweiting2/software/PStrain/test/./output/test/ref/merged_ref.fa.fai for reference /ldfssz1/ST_META/P18Z10200N0127_GRJ/liangweiting2/software/PStrain/test/./output/test/ref/merged_ref.fa does not exist. Please see http://gatkforums.broadinstitute.org/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference for help creating it.

ERROR ------------------------------------------------------------------------------------------

......................................
[E::hts_open_format] Failed to open file "./output/test//map/mapped.vcf.gz" : No such file or directory
No need to merge in case of one sample.

Sincerely,
Liang

Use existing metaphlan3 run

Hello,

Is it possible to accelerate this useful tool when metaphlan3 has already been run on samples?

Thanks,

Andrew

Tutorial issues with gzip files

-deleted while I work through this, and then post a solution later-

Reproducibility

Hello,

Thanks so much for the help to get 221 samples successfully analysed! No rush to respond here, but share in case it's worth exploring further.

As a first start, I've taken two samples which come from the same individual a week apart (oropharynx). I thought it would be helpful to see how reproducible the results are, given I would not expect much change in many organisms (though, worth acknowledging the individual would have been in hospital between the first and second swabs, and potentially on antibiotics). Kraken RAs (GTDB bacterial database) show fair similarity (Pearson R^2 on log10 RAs=0.81), with most species within an order of magnitude between the two samples, across four orders of magnitude of RA.

38 bacterial species give results from both samples (plus 24 in sample 1 alone and 4 in sample 2 alone - interesting as sample 2 has near twice the bacterial depth of sample 1, though sample 2 shows greater diversity)

Looking at those 38 reported in both samples, I hoped to see merged strains (0.95) mostly shared between samples, especially dominant ones. I confess I'm not full sure what the 95% corresponds to - 95% ANI of the marker genes, or 95% ANI of the SNPs (the latter would make sense to me).

In aggregate, of 77 strain clusters only 10 are shared between samples. Each are from separate species with no other strain clusters. One comprises a single matching strain between both samples (Oribacterium sinus). The others have multiple strains within a single strain cluster shared between both samples.

Species which have a matching strain are have similar abundances in sample 1 to those without (unshared species median 0.32% vs shared species 0.45%), but higher abundances in sample 2 (0.42 vs 1.3%).

28 species have no strain clusters shared between samples.

To take an example with high species RA in both samples, Porphyromonas somerae has a separate cluster in both samples, each comprising three strains. The frequency distribution is remarkably similar in both samples (~15%, ~32%, ~52%), and it would have been pleasing if three pairs of strains were instead clustered.

The reference sequences are 170,712 bases and 1367 shared polymorphic sites are identified (1726 in sample 1 and 2305 in sample 2). This seems reasonable as it would correspond to less than 1% divergence in general.

A quick pairwise alignment and identity matrix for the shared loci (69=sample 1 and 71=sample 2):

          str_1_71  str_2_71  str_3_71
str_1_69 0.8456474 0.9392831 0.7973665
str_2_69 0.8580834 0.9502560 0.7834674
str_3_69 0.8573519 0.9597659 0.7710315

It appears that strain_1_69 pairs with strain_2_71 (although the frequencies don't correspond; 14 vs 33%), however both strain_2_69 (31%) and strain_3_69 (55%) are closest to strain_2_71 (33%). None are closest to strain_3_71 (52%).

Assuming the strain profile should be stable between the two samples, I wonder if the issue could be phasing between genes or within genes, producing hybrid strains.

I've started to have a look by focusing on the 15 genes with 25 or more SNPs. The signals are a little confusing to me. Mean coverage (depth file) indicates a fairly wide range in depth: 32-57 in sample 1 (median 44) and 14-62 in sample 2 (median 30).

A0A069ZK39 strain_1_69 is dissimilar to all three strains in the second sample (72-89% identity), strain_2_69 is identical to str_3_71 (100% identity) and similar to strain_2_71 (96%). Strain_3_69 is dissimilar to all three (55-66% identity).

Many of the other genes have unusual patterns because the strains within sample 1 share identical sequence, e.g. for A0A134B4G4 all strains from the first sample are identical to strain_2_71, 84% identical to strain_1_71 and 52% identical to strain_3_71.

Do you have any ideas about what might be going on?

Thanks,

Andrew

The strain_RA.txt different from the intermediate metaphlan2_output.txt

Hi shuai!
I found that strain_RA.txt different from the intermediate metaphlan2_output.txt.
The strain_RA results contradicted my simulated data while the metaphlan2_output.txt showed the correct abundance, so I checked the metaphlan2_output.txt and strain_RA result by comparing their species's abundance. The abundance of the species is different between strain_RA.txt and metaphlan2_output.txt.

strain_RA.txt

Species | Species_RA

metaphlan2_output.txt

I'm not sure if I'm running it wrong. Here is my run command and log.
$ cat run.sh
python /ldfssz1/ST_META/P18Z10200N0127_GRJ/liangweiting2/software/PStrain/install/scripts/PStrain.py -c config.txt -o ./output -p 4 -n 8 --dbdir /ldfssz1/ST_META/P18Z10200N0127_GRJ/liangweiting2/software/PStrain/install/db --prior /ldfssz1/ST_META/P18Z10200N0127_GRJ/liangweiting2/software/PStrain/install/db/prior_beta.pickle --bowtie2-build /ldfssz1/ST_META/P18Z10200N0127_GRJ/liangweiting2/software/PStrain/install/packages/bowtie2-2.3.1-legacy/bowtie2-build --bowtie2 /ldfssz1/ST_META/P18Z10200N0127_GRJ/liangweiting2/software/PStrain/install/packages/bowtie2-2.3.1-legacy/bowtie2 --samtools /ldfssz1/ST_META/P18Z10200N0127_GRJ/liangweiting2/software/PStrain/install/packages/SamTools-1.3.1/samtools --metaphlan2 /ldfssz1/ST_META/P18Z10200N0127_GRJ/liangweiting2/software/PStrain/install/packages/metaphlan2/metaphlan2.py --gatk /ldfssz1/ST_META/P18Z10200N0127_GRJ/liangweiting2/software/PStrain/install/packages/GATK_3.5/GenomeAnalysisTK.jar --picard /ldfssz1/ST_META/P18Z10200N0127_GRJ/liangweiting2/software/PStrain/install/packages/picard-tools-2.1.0/picard.jar

log is too big to show, so I attach it here.
log.txt

Best,
Liang

A snippet offering a for-loop based config construction script

With input fastq in a root directory, in individual folders for each sample, and runs named $sample_1.fastq and $sample_2.fastq, and the names are assigned to a variable, $AllSamples

for I in $AllSamples; do echo "//";echo sample : $I ; echo fq1: /rootdir/$I/$I_1.fastq ; echo fq2: /rootdir/$I/$I_2.fastq ; done

wshuai294 / pstrain Goto Github PK

pstrain's People

Contributors

Stargazers

Watchers

Forkers

pstrain's Issues

ERROR ------------------------------------------------------------------------------------------

ERROR A USER ERROR has occurred (version 3.5-0-g36282e4):

ERROR

ERROR This means that one or more arguments or inputs in your command are incorrect.

ERROR The error message below tells you what is the problem.

ERROR

ERROR If the problem is an invalid argument, please check the online documentation guide

ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.

ERROR

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions http://www.broadinstitute.org/gatk

ERROR

ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.

ERROR

ERROR MESSAGE: Invalid command line: Malformed walker argument: Could not find walker with name: HaplotypeCaller

ERROR ------------------------------------------------------------------------------------------

ERROR ------------------------------------------------------------------------------------------

ERROR A USER ERROR has occurred (version 3.5-0-g36282e4):

ERROR

ERROR This means that one or more arguments or inputs in your command are incorrect.

ERROR The error message below tells you what is the problem.

ERROR

ERROR If the problem is an invalid argument, please check the online documentation guide

ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.

ERROR

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions http://www.broadinstitute.org/gatk

ERROR

ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.

ERROR

ERROR ------------------------------------------------------------------------------------------

ERROR ------------------------------------------------------------------------------------------

ERROR A USER ERROR has occurred (version 3.5-0-g36282e4):

ERROR

ERROR This means that one or more arguments or inputs in your command are incorrect.

ERROR The error message below tells you what is the problem.

ERROR

ERROR If the problem is an invalid argument, please check the online documentation guide

ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.

ERROR

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions http://www.broadinstitute.org/gatk

ERROR

ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.

ERROR

ERROR ------------------------------------------------------------------------------------------

strain_RA.txt

Species | Species_RA

metaphlan2_output.txt

Recommend Projects

Recommend Topics

Recommend Org