liaoherui / strainscan Goto Github PK

View Code? Open in Web Editor NEW

36.0 4.0 5.0 58.08 MB

High-resolution strain-level microbiome composition analysis tool based on reference genomes and k-mers

Home Page: https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-023-01615-w

License: MIT License

Python 98.13% R 0.06% C 0.68% Shell 1.13%

k-mers metagenomics reference-based bacterial-genomes strains

strainscan's Introduction

StrainScan

One efficient, accurate and high-resolution strain-level microbiome composition analysis tool based on reference genomes and k-mers. StrainScan takes reference database and sequencing data as input, outputs strain-level microbiome compistion analysis report.

Contributor: Liao Herui and Ji Yongxin (Ph.D of City University of Hong Kong, EE), Nick Youngblut

E-mail: [email protected] / [email protected]

Version: V1.0.14 (update at 2023-10-13)

Click here to check the log of all updates

[Update - 2022 - 05 - 01] :

V1.0.3: StrainScan can be installed via bioconda now!

[Update - 2022 - 06 - 07] :

V1.0.10: Add multuple threads to the reference database constrcution!

[Update - 2023 - 02 - 07] :

Two new intra-cluster searching modes are updated: plasmid_mode and extraRegion_mode.

[Update - 2023 - 04 - 22] :

StrainScan is able to take gzipped and PE FASTQs as input now!

[Update - 2023 - 05 - 04] :

StrainScan is able to take the custom clustering file generated by other clustering methods (e.g. PopPunk)! Additionally, we have made updates to the script (StrainScan_subsample.py) which now enables users to subsample their strains using hierarchical clustering.

[Update - 2023 - 05 - 15] :

Add a parameter "-b" that enables StrainScan to output the probability of detecting a strain in samples with low sequencing depth (e.g. <1X).

[Update - 2023 - 09 - 23] :

One database containing 1627 Staphylococcus aureus strains is publicly available!

[Update - 2023 - 09 - 29] :

V1.0.13: Update Bioconda version to latest GitHub version, which has more and new functions!!

[Update - 2023 - 10 - 03] :

One database containing 1124 Lactobacillus crispatus strains is publicly available!

[Update - 2023 - 10 - 13] :

V1.0.14: Fix a bug in identify.py about the identification of low-depth strains!

Overview of StrainScan:

Dependencies:

Python ==3.7.x
R
Sibeliaz ==1.2.2 (https://github.com/medvedevgroup/SibeliaZ)
Required python package: numpy==1.17.3, pandas==1.0.1, biopython==1.74, scipy==1.3.1, scikit-learn==0.23.1, bidict==0.21.3, treelib==1.6.1

Make sure these programs have been installed before using StrainScan.

Install (Linux or ubuntu only)

Option 1 - The first way to install StrainScan, is to use bioconda. Once you have bioconda environment installed, install package strainscan:

conda install -c bioconda strainscan

It should be noted that some commands have been replaced if you install StrainScan using bioconda. (See below)

Command (Not bioconda)	Command (bioconda)
python StrainScan.py -h	strainscan -h
python StrainScan_build.py -h	strainscan_build -h

Option 2 - Also, yon can install StrainScan via Anaconda using the commands below:

git clone https://github.com/liaoherui/StrainScan.git
cd StrainScan

conda env create -f environment_candidate.yaml
conda activate strainscan

chmod 755 library/jellyfish-linux
chmod 755 library/dashing_s128

Option 3 - Or, you can install all dependencies of StrainScan mannually and then run the commands below.

git clone https://github.com/liaoherui/StrainScan.git
cd StrainScan

chmod 755 library/jellyfish-linux
chmod 755 library/dashing_s128

Pre-built databases download

The table below offers information about the pre-built databases of 6 bacterial species used in the paper and 2 additional bacterial species. Users can download these databases and use them to identify strains directly.

Species	Source	Number of Strains	Number of Clusters	Download link
Akkermansia muciniphila	NCBI	157	53	Google drive
Cutibacterium acnes	NCBI	275	28	Google drive
Prevotella copri	NCBI	112	51	Google drive
Escherichia coli	NCBI	1433	823	Google drive
Mycobacterium tuberculosis	NCBI	792	25	Google drive
Staphylococcus epidermidis	NCBI	995	378	Google drive
Staphylococcus aureus	NCBI	1627	202	Google drive
Lactobacillus crispatus	NCBI	1124	311	Google drive

You can also use bash scripts in the folder "Download_DB_script" to download the pre-built databases from Google drive. For example,

cd Download_DB_script
sh ecoli_db.sh

If you can not download databases from the google drive, you may try to download databases from the Baidu Netdisk.
Baidu Netdisk link: https://pan.baidu.com/s/1YFtHv2weJEBdwTz4LmTKOQ
Extraction code: ASDF

Usage

One example about database construction and identification commands can be found in "test_run.sh".

Use StrainScan to build your own custom database.

python StrainScan_build.py -i <Input_Genomes> -o <Database_Dir>

eg: python StrainScan_build.py -i Test_genomes -o DB_Small

(Note: input fasta can be gzipped format)

Use StrainScan to build your own custom database with custom clustering file.

python StrainScan_build.py -i <Input_Genomes> -c <Cluster_file> -o <Database_Dir>

The data format of the input clustering file can be found in the demo file Custom_cluster_demo/custom_cls.txt, where the first column is the cluster ID, the second column is the cluster size, and the last column is the prefix of the reference genomes in the cluster.

Use StrainScan_subsample to subsample your large-scale strains with high redundancy.

python StrainScan_subsample.py -i <Input_Genomes> -o <Output_Dir>

If you have large-scale strains with high redundancy and you want to remove the redundancy. Then you can use this script, which utilizes dashing and hierarchical clustering to subsample strains efficiently. The subsampled strains can be found in <Output_Dir>/Rep_ref and clustering information can be found in <Output_Dir>/Cls_res. More parameters can be found using python StrainScan_subsample.py -h.

Use StrainScan to identify bacterial strains in short reads.

python StrainScan.py -i <Input_reads> -d <Database_Dir> -o <Output_Dir>

eg: python StrainScan.py -i Sim_Data/GCF_003812785.fq -d DB_Small -o Test_Sim/GCF_003812785
or python StrainScan.py -i Sim_Data_mul/GCA_000144385_5X_GCF_008868325_5X.fq -d DB_Small -o Test_Sim/GCA_000144385_5X_GCF_008868325_5X

PE reads (can be gzipped FASTQ format)
python StrainScan.py -i GCF_003812785_1.fq.gz -j GCF_003812785_2.fq.gz -d DB_Small -o Test_Sim/GCF_003812785

Note: if the sequencing depth of targeted strains is super low (e.g., <1X), then you may get "Warning: No clusters can be detected!". In this case, you can use parameter "-b" to output the probability of detecting a strain (cluster) in low-depth samples. The higher the probability, the more likely the strain (cluster) is to be present.
eg: python StrainScan.py -i <Input_reads> -d <Database_Dir> -b 1 -o <Output_Dir>

Use StrainScan to identify plasmids of bacterial strains in short reads.

option-1: identify possible plasmids by using contigs <100000 bp:
python StrainScan.py -i <Input_reads> -d <Database_Dir> -p 1 -r <Ref_genome_Dir> -o <Output_Dir>

option-2: identify possible plasmids (or possible strains) using reference genomes provided by "-r".
python StrainScan.py -i <Input_reads> -d <Database_Dir> -p 2 -r <Ref_genome_Dir> -o <Output_Dir>

<Ref_genome_Dir> refer to the dir of reference genomes of identified clusters or all strains used to build the database.

Use StrainScan to identify bacterial strains in short reads under extraRegion_mode.

This mode will search possible strains and return strains with extra regions (could be different genes, SNVs or SVs to the possible strains) covered. If there is a novel strain not in the database, then its closest relative can be one specific strain while its partial regions (we call them "extraRegion" ) in the genome can be similar to other strains. In this case, this mode can search its closest relative and return strains with "extraRegion" covered for downstream analysis.

python StrainScan.py -i <Input_reads> -d <Database_Dir> -e 1 -o <Output_Dir>

Full command-line options

Identification - StrainScan.py (Default k-mer size: 31)

StrainScan - A kmer-based strain-level identification tool.

Example: python StrainScan.py -i  <Input_reads> -d <Database_Dir> -o <Output_Dir>

required arguments:
    -i, --input_fastq             Input fastq data.
    -j, --input_fastq_2		  Input fastq data (for pair-end data).
    -d, --database_dir            Path of StrainScan database.

optional arguments:
    -h, --help                    Show help message and exit.
    -o, --output_dir              The output directory. (Default: ./StrainScan_Result)
    -k, --kmer_size               The size of k-mer, should be odd number. (Default: k=31)
    -l, --low_dep                 This parameter can be set to "1" if the sequencing depth of input data is very low (e.g. < 5x). For super low depth ( < 1x ), you can use "-l 2" (default: -l 0)
    -b, --strain_prob		  If this parameter is set to 1, then the algorithm will output the probabolity of detecting a strain (or cluster) in low-depth (e.g. <1x) samples.  (default: -b 0)
    -p,	--plasmid_mode		  If this parameter is set to 1, the intra-cluster searching process will search possible plasmids using short contigs (<100000 bp) in strain genomes, which are likely to be plasmids. 
                                  If this parameter is set to 2, the intra-cluster searching process will search possible plasmids or strains using given reference genomes by "-r".
    				  Reference genome sequences (-r) are required if this mode is used. (default: -p 0)
    -r, --ref_genome		  The dir of reference genomes of identified cluster or all strains. If plasmid_mode is used, then this parameter is required.
    -e, --extraRegion_mode	  If this parameter is set to 1, the intra-cluster searching process will search possible strains and return strains with extra regions (could be different genes, SNVs or SVs to the possible strains) covered. (default: -e 0)
    -s, --minimum_snv_num         The minimum number of SNVs during the iterative matrix multiplication at Layer-2 identification. (Default: s=40)

Build database - StrainScan_build.py (Default k-mer size: 31)

StrainScan - A kmer-based strain-level identification tool.

Example:  python StrainScan_build.py -i <Input_genomes> -o <Database_Dir>

required arguments:
     -i, --input_fasta             The path of input genomes. ("fasta" format)
     
optional arguments:
     -o, --output_dir              The output directory of constructed database. (Default: ./StrainScan_DB)
     -c, --cls_file		   The custom clustering file provided by user.
     -k, --kmer_size               The size of k-mer, should be odd number. (Default: k=31)
     -t, --threads		   The threads used to build the database. (default: t=1)
     -u, --uk_num                  The maximum number of unique k-mers in each genome to extract. (Default: 100000)
     -g, --gk_ratio                The ratio of group-specific k-mers to extract. (Default: g=1.0)        
     -m, --strainest_sample        If this parameter is 1, then the program will search joint kmer sets from msa generated by Strainest. To use this parameter, you have to make sure Strainest can run normally. (Default: 0)
     -e, --memory_efficient	   If this parameter is 1, then the program will build the database with the memory efficient mode.
     -n, --mink_cutoff             Minimum k-mer number cutoff in a node of the cluster search tree (CST). (Default: n=1000)
     -x, --maxk_cutoff             Maximum k-mer number cutoff in a node of the cluster search tree (CST). (Default: x=30000)
     -r, --maxn_cutoff             Maximum cluster number for node reconstruction of the cluster search tree(CST). (Default: r=3000)

Output Format

The output of StrainScan contains two parts. The first part is the final identification report file in text format. This file contains all identified strains and their predicted depth and relative abundance, etc. The second part is the strain identification report files inside each cluster.

For your reference, two output files are given as example in the folder "Output_Example" in this repository. These files contain identification results of one single-strain and one two-strain (depth: 5X and 5X) simulated datasets, respectively.

Explaination about the columns in the final identification report file (E.g. "Output_Example/GCA_000144385_5X_GCF_008868325_5X/final_report.txt") of StrainScan.

Column_name	Description
Strain_ID	The numerical id of identified strains in the ascending order.
Strain_Name	The name of identified strains. (In the example output, the name refers to the NCBI RefSeq accession id)
Cluster_ID	The cluster id of identified strains. (For cluster information, users can check "<Database_Dir>/Cluster_Result/hclsMap_95_recls.txt")
Relative_Abundance_Inside_Cluster	The predicted relative abundance of identified strains inside the cluster.
Predicted_Depth (Enet)	The predicted sequencing depth of identified strains inside the cluster using elastic net model.
Predicted_Depth (Ab*cls_depth)	The final predicted sequencing depth of identified strains.
Coverage	The estimated k-mer-based coverage of identified strains.
Coverd/Total_kmr	The number of "covered" and "total" k-mers of identified strains.
Valid_kmr	The valid k-mer refers to the k-mer belonging to the identified strain during the iterative matrix multiplication. More valid k-mers there are, more likely this strain exist.
Remain_Coverage	The coverage calculated by "covered" / "total" k-mers during the iterative matrix multiplication.
CV	The number of "covered" and "valid" k-mers of identified strains.
Exist evidence	By default, identified strains with "relative abundance > 0.02 and coverage >0.7" will be marked as "". Strains with "" are more likely to exist. However, for low-depth samples, this parameter may be not useful.

References:

how to cite this tool:

Liao, H., Ji, Y. & Sun, Y. High-resolution strain-level microbiome composition analysis from short reads. Microbiome 11, 183 (2023). https://doi.org/10.1186/s40168-023-01615-w

strainscan's People

Contributors

Stargazers

Watchers

Forkers

johnson951101 nick-youngblut orin-beep csigabigusz pi-bioinfo

strainscan's Issues

Issue with StrainScan_build during database construction

Hi,

I am encountering a problem with StrainScan during the construction of a Vibrio parahaemolyticus database. The process has been stuck at the "merge genomes" and "run Sibeliaz" (as shown in the screenshot atttached). This has been the case for several weeks, regardless of the number of genomes processed.

I have updated the StrainScan to the latest version. Could you please provide some guidance or a solution to overcome this problem? Any help would be greatly appreciated.

Thank you.

Best.
Bing

Genomes used to build the database

Could you provide me with the assembly_accession of the Escherichia coli (1433) used to build the database?

Thank you in advance!

Error for the usage of parameter "-b"

Hello,

when i use the command 1 to identify strains,
command 1: strainscan -i ./E.coli-3lr.fastq -d ./strainscan/DB_Ecoli -o ./strainscan/Test/Ecoli/same-3
i got "Warning: No clusters can be detected!"

So, i use the command 2 to identify strains, which adds "-b 1"
command 2: strainscan -i ./E.coli-3lr.fastq -d ./strainscan/DB_Ecoli -b 1 -o ./strainscan/Test/Ecoli/same-3
however, i got the error below:

usage: StrainScan.py [-h] -i INPUT_FQ -d DB_DIR [-o OUT_DIR] [-k KSIZE]
                     [-l LDEP] [-s MSN]
StrainScan.py: error: unrecognized arguments: -b 1

i dont't know what's wrong

Thanks for any feedback you might have!

How to create custom databases

Thanks for the program.

I'm not sure how to go about building a custom database, for example I am interested in looking at S. aureus strains in my metagenome samples but the database build directions aren't really clear. Do I need to download all the available genomes for S.aureus from NCBI then use them as input for StrainScan_build.py?

Bugs with database construction, strain scanning

Hi,

Thank you for your work on this tool, it looks very interesting. I have been trying it out and noticed a few problems:

When generating a database of very similar strains (differing only by SNPs) with a .fasta extension, I get this error:

Traceback (most recent call last):
  File "/opt/conda/envs/strainscan/bin/strainscan_build", line 10, in <module>
    sys.exit(main())
  File "/opt/conda/envs/strainscan/lib/python3.7/site-packages/StrainScan/StrainScan_build.py", line 161, in main
    Build_tree.build_tree([cls_res+'/distance_matrix.txt',cls_res+'/hclsMap_95_recls.txt',out_dir+'/Tree_database',31,params])
  File "/opt/conda/envs/strainscan/lib/python3.7/site-packages/StrainScan/library/Build_tree.py", line 266, in build_tree
    cls_dist, mapping, tree, depths, depths_mapping = hierarchy(fna_mapping, dist)
  File "/opt/conda/envs/strainscan/lib/python3.7/site-packages/StrainScan/library/Build_tree.py", line 71, in hierarchy
    for i in tree_relationship[parent]:
KeyError: 1

It looks like StrainScan only supports uncompressed reads because of JellyFish not being able to natively handle .gz files. Based on the following lines of code:


    os.system(dir_jf + " count -m 31 -s 100M -t 8 --if " + db_dir+"/kmer.fa -o temp_"+uid+".jf " + fq_path)
    os.system(dir_jf+" dump -c temp_"+uid+".jf > temp_"+uid+".fa")
    os.system("rm temp_"+uid+".jf")

It looks like you could just check if the input data has a ".gz" extension, then zcat the data into jellyfish to get the kmer counts. If that is too bothersome, an CLI option to provide the kmer counts would also work perfectly.

Thanks for any feedback you might have!

构建本地数据库报错

Traceback (most recent call last):
File "/usr/lishasha/biosoft/miniconda3/envs/strainscan/bin/strainscan_build", line 10, in
sys.exit(main())
File "/usr/lishasha/biosoft/miniconda3/envs/strainscan/lib/python3.7/site-packages/StrainScan/StrainScan_build.py", line 161, in main
Build_tree.build_tree([cls_res+'/distance_matrix.txt',cls_res+'/hclsMap_95_recls.txt',out_dir+'/Tree_database',31,params])
File "/usr/lishasha/biosoft/miniconda3/envs/strainscan/lib/python3.7/site-packages/StrainScan/library/Build_tree.py", line 287, in build_tree
cls_dist, mapping, tree, depths, depths_mapping = hierarchy(fna_mapping, dist)
File "/usr/lishasha/biosoft/miniconda3/envs/strainscan/lib/python3.7/site-packages/StrainScan/library/Build_tree.py", line 59, in hierarchy
mapping[i] -= 2
File "/usr/lishasha/biosoft/miniconda3/envs/strainscan/lib/python3.7/site-packages/bidict/_mut.py", line 78, in setitem
self._put(key, val, self.on_dup)
File "/usr/lishasha/biosoft/miniconda3/envs/strainscan/lib/python3.7/site-packages/bidict/_base.py", line 219, in _put
dedup_result = self._dedup_item(key, val, on_dup)
File "/usr/lishasha/biosoft/miniconda3/envs/strainscan/lib/python3.7/site-packages/bidict/_base.py", line 254, in _dedup_item
raise KeyAndValueDuplicationError(key, val)
bidict.KeyAndValueDuplicationError: (513, 282)

在构建本地数据库时发生如上报错，总共构建了3个菌，两个菌成功运行，1个失败。成功的两个菌基因组个数在300左右，失败的是600多，是对加入的基因组个数有限制么？

├── Cluster_Result
│   ├── distance_matrix_rebuild.txt
│   ├── distance_matrix.txt
│   ├── hclsMap_95_recls.txt
│   ├── hclsMap_95_Rep.txt
│   ├── hclsMap_95.txt
│   └── Other_Strain_CN.txt
├── Kmer_Sets_L2
└── Tree_database
├── nodes_kmer
├── overlap
└── test

报错的1个菌已经生成的文件目录

system calls: not report on errors

It appears that system calls (eg., calling dashing) are not done in a way that returns the stderr if the child job fails. For example:

def construct_matrix(input_fa):	
	'''
	pwd=os.getcwd()
	dir_work=os.path.split(os.path.abspath(__file__))[0]
	os.chdir(dir_work)
	'''
	path_file='genome_path_tem.txt'
	o=open(path_file,'w+')
	for filename in os.listdir(input_fa):
		o.write(input_fa+'/'+filename+'\n')
	o.close()
	#pwd=os.getcwd()	
	dir_dash=os.path.split(os.path.abspath(__file__))[0]+'/dashing_s128'
	#print(dir_dash+' dist -p10 -k31 -Odistance_matrix.txt -osize_estimates.txt -Q '+path_file+' -F '+path_file)
	#print(pwd)
	#exit()
	os.system(dir_dash+' dist  -p10 -k31 -Odistance_matrix.txt -osize_estimates.txt -Q  '+path_file+' -F '+path_file)
	#exit()
	nn=rebuild_matrix('distance_matrix.txt')
	os.system('rm genome_path_tem.txt size_estimates.txt')
	#os.system('rm genome_path_tem.txt size_estimates.txt')
	return nn

Switching os.system to Popen would be helpful. For instance:

    # cmd
    cmd = dir_dash+' dist  -p10 -k31 -Odistance_matrix.txt -osize_estimates.txt -Q  '+path_file+' -F '+path_file
    # run child job
    p = Popen(cmd, shell=True, stdout=PIPE, stderr=PIPE)
    output, err = p.communicate()
    ## check for errors
    if p.returncode != 0:
        raise ValueError(f'Error running {cmd}: {err.decode()}')

strainscan_build stuck after processing a cluster

Hi,
thanks for writing StrainScan, it's very useful!

However, I'm having some problems during the strainscan_build step.
I want to build the database on all the 358 NCBI RefSeq genomes of Lacticaseibacillus rhamnosus. I'm using ubuntu version 22.04 (installed as Windows Subsystem for Linux version 2 on a DELL mounting windows 11), with ~25.6 GB memory allocated.
Also, I installed strainscan using conda:
conda create -n strainscan -c bioconda strainscan=1.0.14 (latest version).

The command I'm trying to use for the building step is:
strainscan_build -i rhamnosus_genomes/ -o strainscan_rhamnosus_dir/ -k 31 -t 6 -u 100000 -e 1

The command starts running well, it detects 92 clusters, and then extracts k-mer from most of those clusters. However, after correctly processing one cluster (specifically, cluster named as C5) it stops and gets stuck. It does not start to process the next cluster, and running 'htop' I see that the processes are in S, no longer running.
I also left the computer run for one night, but the situation did not change: it does not start processing the next cluster.

I also tried several times running the same command, but it always gets stuck after processing cluster C5.

Any guesses on how to solve this?

Thank you!!

How to infer different species-level abundances based on strain-level abundances

Hi!
Thanks for the clear documentation and awesome tool.
I have read your explanation about Strainscan, and it seems that it is used to analyze the abundance of each strain in a specified species.
But I want to analyze multiple species. Can I build a strain database from multiple species to analyze the abundance of each strain to infer the abundance of different species?

use a custom database to identify strains give IndexError: list index out of range

Hello！Thank you for the tools you developed. When I use a custom database to identify strains, it give these error:
`terminate called after throwing an instance of 'std::runtime_error'
what(): Unsupported format
Aborted (core dumped)
Failed to open input file 'temp_415d474aef2c11eea34abc97e1c3cf11.jf'
rm: cannot remove 'temp_415d474aef2c11eea34abc97e1c3cf11.jf': No such file or directory

193: weak

parent node: 193 ->
194: weak
1: weak
1: 0.000000 | 0.000000 0

parent node: 194 ->
195: weak
2: weak
2: 0.000000 | 0.000000 0

Traceback (most recent call last):
File "/home/data/t220324/miniconda3/envs/env4mamba/envs/strainscan/bin/strainscan", line 10, in
sys.exit(main())
File "/home/data/t220324/miniconda3/envs/env4mamba/envs/strainscan/lib/python3.7/site-packages/StrainScan/StrainScan.py", line 196, in main
cls_dict = identify_low_mem.identify_cluster(in_fq, db_dir + '/Tree_database', [0.1, 0.4, 1])
File "/home/data/t220324/miniconda3/envs/env4mamba/envs/strainscan/lib/python3.7/site-packages/StrainScan/library/identify_low_mem.py", line 408, in identify_cluster
search(pending, match_results, db_dir, valid_kmers, length, cov, abundance, cov_cutoff, ab_cutoff, results, leaves, res_temp, tree, overlapping_info)
File "/home/data/t220324/miniconda3/envs/env4mamba/envs/strainscan/lib/python3.7/site-packages/StrainScan/library/identify_low_mem.py", line 247, in search
print("parent node: %d ->"%tree.parent(group[0].identifier).identifier)
IndexError: list index out of range`

Everything seems to be fine during the custom database building process，and the version is latest，any ideas?
Maybe it's the bug of the memory efficient mode?

How to create custom databases

Hi!
I would like to use your tool with my own reference database.

I installed using option 2 after reading through #10

git clone https://github.com/liaoherui/StrainScan.git
cd StrainScan

conda env create -f environment_candidate.yaml
conda activate strainscan

chmod 755 library/jellyfish-linux
chmod 755 library/dashing_s128

However, when I try to run the StrainScan_build.py command with your test data, I get the following error:
~/StrainScan$ python StrainScan_build.py -i Test_genomes -o DB_Small

2023-10-06 12:14:33,705 - Constructing matrix with dashing (jaccard index)
2023-10-06 12:14:33,822 - Hierarchical clustering
2023-10-06 12:14:34,117 - Constructing the tree
Traceback (most recent call last):
  File "StrainScan_build.py", line 162, in <module>
    main()
  File "StrainScan_build.py", line 127, in main
    31, params])
  File "/home/StrainScan/library/Build_tree.py", line 385, in build_tree
    kmer_index = extract_kmers(fna_mapping[i.identifier], fna_path, ksize, kmer_index_dict, kmer_index, Lv, spec, tree_dir, alpha_ratio, i.identifier)
  File "/home/StrainScan/library/Build_tree.py", line 115, in extract_kmers
    reverse = seqpy.revcomp(forward)
AttributeError: module 'seqpy' has no attribute 'revcomp'

When I try to use it with my own genomes, I get the following different error:
python StrainScan_build.py -i /path/fasta/ -o /StrainScan_DB

2023-10-06 12:23:53,950 - Constructing matrix with dashing (jaccard index)
2023-10-06 12:23:54,539 - Hierarchical clustering
Traceback (most recent call last):
  File "StrainScan_build.py", line 162, in <module>
    main()
  File "StrainScan_build.py", line 116, in main
    cls_file, cls_res)
  File "/home/StrainScan/library/select_rep.py", line 46, in pick_rep
    final_res[ele[2]]=int(ele[0])
IndexError: list index out of range

Any ideas?

数据库构建 killed

用您提供的测试文件可以正常运行。但是用自己的文件构建数据库的时候显示killed。请问作者这种情况应该怎么处理呢。

How to compare sequencing depth between different samples

Hello,

I'm curious about how to interpret the sequencing depth statistic in the output between different samples.

Is this dependent on the amount of sequencing data? Or is normalised somehow?

Apologies, I'm a biologist and not able to follow the mathematics describing the calculation of it in the pre-print.

Thanks,

Phil

Build custom database use too much memory

Hello！Thank you for the tools you developed. When I build a custom database, It will need about twice as much memory as the number of strains. Is there any way to reduce the amount of memory?

IndexError: list index out of range

Hello,
when I use the command 1 to identify strains,
command 1: python StrainScan.py -i E.coli-3lr.fastq -d ./download_database/DB_Ecoli -b 1 -o ./test
I got the errors below:

I don't know what's wrong

By the way, the code that I install StrainScan is:

git clone https://github.com/liaoherui/StrainScan.git
cd StrainScan
conda env create -f environment_candidate.yaml
conda activate strainscan
chmod 755 library/jellyfish-linux
chmod 755 library/dashing_s128

Thanks for any feedback you might have!

Error in building a custom database

Hi,

I am trying to build a custom database through your provided "Test_genomes" directory, but got an error. Here is the code:

(strainscan) [rishabh@localhost StrainScan]$ python StrainScan_build.py -i Test_genomes -o DB_Small /home/rishabh/StrainScan 2023-12-07 09:05:17,690 - Constructing matrix with dashing (jaccard index) 2023-12-07 09:05:17,769 - Hierarchical clustering Traceback (most recent call last): File "StrainScan_build.py", line 162, in <module> main() File "StrainScan_build.py", line 116, in main cls_file, cls_res) File "/home/rishabh/StrainScan/library/select_rep.py", line 68, in pick_rep index.append(dpi[t]) KeyError: 'GCA_005937645.1_ASM593764v1_genomic'

Any help would be appreciated.

ZeroDivisionError: division by zero

Hi,
I was running your software and got the following error. I am using the newest version of strainscan that I cloned from github 3 days ago.
Traceback (most recent call last):
File "StrainScan/StrainScan.py", line 275, in
sys.exit(main())
File "StrainScan/StrainScan.py", line 194, in main
cls_dict = identify_low_mem.identify_cluster(in_fq, db_dir + '/Tree_database', [0.1, 0.4, 1])
File "/home/ubuntu/Pig_microbiome/Culturomics/mesophilic_growth/strainscan/StrainScan/library/identify_low_mem.py", line 408, in identify_cluster
search(pending, match_results, db_dir, valid_kmers, length, cov, abundance, cov_cutoff, ab_cutoff, results, leaves, res_temp, tree, overlapping_info)
File "/home/ubuntu/Pig_microbiome/Culturomics/mesophilic_growth/strainscan/StrainScan/library/identify_low_mem.py", line 225, in search
cov[node] = len(k_profile)/length[node]
ZeroDivisionError: division by zero

Best wishes
Rafał

unrecognized arguments: -j (v. 1.0.10)

Hello,

Thank you very much for StrainScan development.

I am trying to use it with PE reads, but the option -j is not recognized.

The output from strainscan -h does not list -j:

usage: StrainScan.py [-h] -i INPUT_FQ -d DB_DIR [-o OUT_DIR] [-k KSIZE]
                     [-l LDEP] [-s MSN]

StrainScan - A kmer-based strain-level identification tool.

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_FQ, --input_fastq INPUT_FQ
                        The dir of input fastq data --- Required
  -d DB_DIR, --database_dir DB_DIR
                        The dir of your database --- Required
  -o OUT_DIR, --output_dir OUT_DIR
                        Output dir (default: current dir/StrainVote_Result)
  -k KSIZE, --kmer_size KSIZE
                        The size of kmer, should be odd number. (default:
                        k=31)
  -l LDEP, --low_dep LDEP
                        This parameter can be set to "1" if the sequencing
                        depth of input data is very low (e.g. < 10x). For
                        super low depth ( < 1x ), you can use "-l 2" (default:
                        -l 0)
  -s MSN, --minimum_snv_num MSN
                        The minimum number of SNV at Layer-2 identification.
                        (default: k=40)

My version is 1.0.10.

If possible, can you also explain what kind of modifications changing --low_dep does? :)

Output explanation - different total kmers and relationship between depth and coverage

Hi,

I'm interested in using StrainScan to analyse P. copri genomes. I cloned the github version (conda version doesn't take pair reads yet), and installed the conda env to get the dependencies.

It ran successfully on an example dataset, which is great! I used your example P. copri database.

However, I was wondering if you could explain two things.

This is my output:

I was surprised by two things:

There doesn't appear to be a correlation between predicted depth and coverage. Normally with sequencing experiments, we expect that with similar depth of sequencing there should be similar coverage, but this is not the case here, with very similar depth but very different coverage.
Perhaps related to the first point, why is the total number of k-mers different between the two clusters? Both clusters contain only a single genome, of similar size, therefore, I'd expect the number of k-mers to be the same. Or are these unique k-mers within the overall database and C15 is more divergent from the other clusters than C11?

All the best,

Phil

paired-end, interleaved, or just read1?

The README docs are not clear about what reads can be used: paired-end (separate files for each in the pair), interleaved (1 files for all paired-end reads), or just read1.

It appears that GCF_003812785.fq includes both read pairs (only fastq header format), but the reads are not interleaved: all read2 sequences are just appended below read1.

Lastly, does StrainScan.py accept compressed read files?

Given that uncompressing, interleaving, or other operations for large read datasets is computationally expensive (and requires more code), it would be quite helpful if the user has flexibility on the input.

liaoherui / strainscan Goto Github PK

strainscan's Introduction

StrainScan

Contributor: Liao Herui and Ji Yongxin (Ph.D of City University of Hong Kong, EE), Nick Youngblut

E-mail: [email protected] / [email protected]

Version: V1.0.14 (update at 2023-10-13)

[Update - 2022 - 05 - 01] :

[Update - 2022 - 06 - 07] :

[Update - 2023 - 02 - 07] :

[Update - 2023 - 04 - 22] :

[Update - 2023 - 05 - 04] :

[Update - 2023 - 05 - 15] :

[Update - 2023 - 09 - 23] :

[Update - 2023 - 09 - 29] :

[Update - 2023 - 10 - 03] :

[Update - 2023 - 10 - 13] :

Overview of StrainScan:

Dependencies:

Install (Linux or ubuntu only)

Pre-built databases download

Usage

Use StrainScan to build your own custom database.

Use StrainScan to build your own custom database with custom clustering file.

Use StrainScan_subsample to subsample your large-scale strains with high redundancy.

Use StrainScan to identify bacterial strains in short reads.

Use StrainScan to identify plasmids of bacterial strains in short reads.

Use StrainScan to identify bacterial strains in short reads under extraRegion_mode.

Full command-line options

Output Format

References:

strainscan's People

Contributors

Stargazers

Watchers

Forkers

strainscan's Issues

Recommend Projects

Recommend Topics

Recommend Org