astrobiomike / bit Goto Github PK

Bioinformatics Tools

License: GNU General Public License v3.0

Python 87.62% Shell 12.38%

bit's Introduction

Hi there!

I’m Mike Lee, a bioinformatician with NASA GeneLab and a research scientist with Blue Marble Space Institute of Science located at NASA’s Ames Research Center in northern California, USA. I focus primarily on microbial ecology and evolution in all kinds of different systems ranging from the bottoms of our oceans up to the International Space Station 👽

If interested, you can find publications here: microbialomics.org/research 🙂

astrobiomike.github.io <- bioinformatics for beginners
GToTree <- phylogenomics for all
Twitter
Google Scholar

bit's People

Contributors

Stargazers

Watchers

Forkers

vikash84 hannahhchu arkadiy-garber germant13 ayixon abdo3a anyihu akshayonly sterrettjd

bit's Issues

add option to set seed for bit-mutate-seqs

so that things could be reproduced the same way if wanted

also write the seed that was used to the log tsv (i guess put it as a comment at the top, and change from .tsv to .log)

Feature request: download any accession

Hi, would it be possible to add functionality for downloading any accession ID? I'm trying to download some protein accessions at the moment, but the tool says they are not found. For example, AMR44288.1.

Thanks!

More species that it was before

After running bit to combine taxonomy, I ended up with more taxa than I had before.
i.e. after running bracken, I had about 7000 species, but after running bit, there are ~11,000.
Is it something expected?

I used bit-combine-bracken-and-add-lineage and followed the guidance from here
https://hackmd.io/@astrobiomike/bit-bracken-combine-and-add-lineage

add option to bypass doing MAG recovery from metagenomics workflow

Create documentation for workflows

Indentation Error in bit-dl-ncbi-assemblies

Hi,
I tried to run the bit-dl-ncbi-assemblies but got this indentation error. When I fixed this, the script worked like a charm.
Thanks for a nice toolkit.

File "/data/emil/miniconda3/envs/bit/bin/helper-bit-parse-assembly-summary-file.py", line 41
    if not dl_acc:
                 ^
IndentationError: unindent does not match any outer indentation level
cut: 1651172846.bit-dl-ncbi.tmp: No such file or directory

     ******************************* NOTICE *******************************
		  68 input accessions were not found at NCBI.

		  Written to "NCBI-accessions-not-found.txt".
     **********************************************************************

	  Remaining total targets: 0

cat: 1651172846.bit-dl-ncbi.tmp: No such file or directory
rm: cannot remove '1651172846.bit-dl-ncbi.tmp': No such file or directory

bit-summarize-assembly when same assembly name but different path

currently if bit-summarize-assembly is giving a bunch of inputs that have the same base filename (but different paths, of course), it will overwrite the same info for all of them

I need to add something else to the names in this case so that they remain unique, keeping full paths is an option, but messy, and not worth it since this probably does not come up that frequently

maybe do a check if they are all the same at the start of the loop, and just add a number, then print out a map linking those new IDs to the full input paths

Batch downloading fasta files from a list of Accession#

Hi Mike,

Thanks for making this tool - exactly what I'm looking for.

I'm trying to batch download fasta files from a list of Genbank Accessions that I've made as .csv file.

However, when I try to run the command, I get the following:

~/Downloads$ bit-dl-ncbi-assemblies -w list.csv -f fasta -j 10

Targeting 157 genomes in fasta format.

Downloading ncbi assembly summaries to be able to construct ftp links...

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:00:30 --:--:-- 0
curl: (28) Connection timed out after 30000 milliseconds
Warning: Transient problem: timeout Will retry in 1 seconds. 10 retries left.
0 0 0 0 0 0 0 0 --:--:-- 0:00:30 --:--:-- 0
curl: (28) Connection timed out after 30000 milliseconds
Warning: Transient problem: timeout Will retry in 2 seconds. 9 retries left.
0 0 0 0 0 0 0 0 --:--:-- 0:00:29 --:--:-- 0
curl: (28) Connection timed out after 30000 milliseconds
Warning: Transient problem: timeout Will retry in 4 seconds. 8 retries left.
0 0 0 0 0 0 0 0 --:--:-- 0:00:30 --:--:-- 0
curl: (28) Connection timed out after 30000 milliseconds
Warning: Transient problem: timeout Will retry in 8 seconds. 7 retries left.
0 0 0 0 0 0 0 0 --:--:-- 0:00:30 --:--:-- 0
curl: (28) Connection timed out after 30000 milliseconds
Warning: Transient problem: timeout Will retry in 16 seconds. 6 retries left.
0 0 0 0 0 0 0 0 --:--:-- 0:00:30 --:--:-- 0
curl: (28) Connection timed out after 30000 milliseconds
Warning: Transient problem: timeout Will retry in 32 seconds. 5 retries left.
0 0 0 0 0 0 0 0 --:--:-- 0:00:30 --:--:-- 0
curl: (28) Connection timed out after 30000 milliseconds
Warning: Transient problem: timeout Will retry in 64 seconds. 4 retries left.
0 0 0 0 0 0 0 0 --:--:-- 0:00:29 --:--:-- 0
curl: (28) Connection timed out after 30000 milliseconds
Warning: Transient problem: timeout Will retry in 128 seconds. 3 retries left.
0 0 0 0 0 0 0 0 --:--:-- 0:00:30 --:--:-- 0
curl: (28) Connection timed out after 30001 milliseconds
Warning: Transient problem: timeout Will retry in 256 seconds. 2 retries left.

Is there an easy fix to this? Is it because I'm requesting from .csv rather than .txt file? Any other parameters I need to set?

Many thanks for your help!

[Metagenomics wf] capture and report MAG coverage across samples

contig-level coverage is already generated and provided for individual samples, but that would only include the MAGs from that sample
should make a combined bowtie2 db of all MAGs and then map each sample to that to get MAG coverages across all samples
- note that this would be dividing coverage/abundance estimates between closely related MAGs recovered from different samples. So a dRep step may also be wanted, then making a dRep'd MAG bowtie2 db for mapping...
- or maybe just providing a skani/ANI output for all MAGs and leaving mapping to all...

Example bracken combining outputs and adding lineage info

I have some questions about your bracken results (Thank you very much)

(1) braken just gives you one taxonomy level, how can you get all levels in one braken file

like example-bracken-output-1.tsv

(2) if I have more than 10 samples with different levels (P-phylum; C-class, O-order;...) in a folder, how can I make bracken-sample-name-map.tsv

like
P21_b-output_P.tsv
P21_b-output_C.tsv
P21_b-output_O.tsv
P21_b-output_G.tsv
...
P22_b-output_P.tsv
P22_b-output_C.tsv
P22_b-output_O.tsv
P22_b-output_G.tsv
...

P32_b-output_P.tsv
P32_b-output_C.tsv
P32_b-output_O.tsv
P32_b-output_G.tsv
...

taxid not found when using bit/bit-kraken2-to-taxon-summaries

lots of taxid not found when using bit/bit-kraken2-to-taxon-summaries, but when I check in NCBI, I find them, please help
21:30:19.605 �[33m[WARN]�[0m taxid 2613770 was deleted
21:30:19.605 �[33m[WARN]�[0m taxid 2843216 not found
21:30:19.605 �[33m[WARN]�[0m taxid 2884264 not found
21:30:19.605 �[33m[WARN]�[0m taxid 2782568 not found
21:30:19.605 �[33m[WARN]�[0m taxid 2836373 not found
21:30:19.605 �[33m[WARN]�[0m taxid 2842355 not found
21:30:19.605 �[33m[WARN]�[0m taxid 2849779 not found
21:30:19.605 �[33m[WARN]�[0m taxid 2789776 not found
21:30:19.605 �[33m[WARN]�[0m taxid 2865832 not found
21:30:19.605 �[33m[WARN]�[0m taxid 2884263 not found
21:30:19.605 �[33m[WARN]�[0m taxid 2819238 not found
21:30:19.605 �[33m[WARN]�[0m taxid 2781735 not found
21:30:19.605 �[33m[WARN]�[0m taxid 2800788 not found

gzip downloads are malformatted

I have a file containing accession IDs named mouse_acc.txt.

>$ bit-dl-ncbi-assemblies -w mouse_acc.txt -f fasta -j 10

    Targeting 16 genomes in fasta format.

                                                               
			  DONE!

>$ ls -lth
total 12G
-rw-r--r-- 1 blk6 tgen 709M Apr 15 15:00 GCA_001632615.1.fa.gz
-rw-r--r-- 1 blk6 tgen 716M Apr 15 15:00 GCA_001632555.1.fa.gz
-rw-r--r-- 1 blk6 tgen 715M Apr 15 15:00 GCA_001632525.1.fa.gz
-rw-r--r-- 1 blk6 tgen 714M Apr 15 15:00 GCA_001632575.1.fa.gz
-rw-r--r-- 1 blk6 tgen 711M Apr 15 15:00 GCA_001624775.1.fa.gz
-rw-r--r-- 1 blk6 tgen 700M Apr 15 15:00 GCA_001624835.1.fa.gz
-rw-r--r-- 1 blk6 tgen 708M Apr 15 15:00 GCA_001624535.1.fa.gz
-rw-r--r-- 1 blk6 tgen 716M Apr 15 15:00 GCA_001624745.1.fa.gz
-rw-r--r-- 1 blk6 tgen 720M Apr 15 15:00 GCA_001624475.1.fa.gz
-rw-r--r-- 1 blk6 tgen 796M Apr 15 15:00 GCA_000001635.8.fa.gz
-rw-r--r-- 1 blk6 tgen 722M Apr 15 15:00 GCA_001624295.1.fa.gz
-rw-r--r-- 1 blk6 tgen 709M Apr 15 15:00 GCA_001624505.1.fa.gz
-rw-r--r-- 1 blk6 tgen 715M Apr 15 15:00 GCA_001624185.1.fa.gz
-rw-r--r-- 1 blk6 tgen 718M Apr 15 15:00 GCA_001624215.1.fa.gz
-rw-r--r-- 1 blk6 tgen 714M Apr 15 15:00 GCA_001624675.1.fa.gz
-rw-r--r-- 1 blk6 tgen 705M Apr 15 15:00 GCA_001624445.1.fa.gz
-rw-r--r-- 1 blk6 tgen 361M Apr 15 14:55 ncbi_assembly_info.tsv
-rw-r--r-- 1 blk6 tgen  256 Apr 15 14:49 mouse_acc.txt

>$ gunzip *.gz

gzip: GCA_000001635.8.fa.gz: invalid compressed data--format violated

Note that I get this invalid compressed data--format violated for all downloaded .gz files. I've also tried running with the accessions and commands from the docs

Am I doing something wrong or is this a classic case of NCBI changing things up and breaking peoples code? 🙂

Conda bit python requirements

Hi,
was trying to install bit into an environment I already had installed with Python=3.8, and bit won't install, due to the requirements:

bit -> python[version='>=3.7,<3.8.0a0']

Not sure if there is a major bug or reason bit won't work with >3.7, but wanted to see if this was something that could be corrected.

[Metagenomics wf] Add option to run bbnorm

[Metagenomics wf] capture and report proportion of reads represented by MAGs for each sample

e.g., to be able to give an estimate stating something like "X% of reads from sample X recruit to the MAGs recovered from sample X" (or put another way, "How much of the starting read data made it through assembly and high-quality binning and is represented by the recovered MAGs?")
contig-level coverage is already generated and provided for individual samples, might be able to piggyback on that to get how much the MAGs capture of the total reads for a given sample
if i don't see an easier way to generate this info from what is already produced, the "long" way could be (for each sample) making new bowtie2 indexes of all recovered MAGs, mapping reads, and parsing/summarizing that
or maybe there's a fancy, quick kmer way to do this that would yield virtually the same info as mapping (e.g., "What proportion of kmers in the reads are found in the MAGs?"). Though maybe that will start to underestimate more and more with increased "intra-population" variation... Will have to bug @ctb about it :)

Conda install takes >30 min in clean enviorment

Hi,

I am trying to install this tool and its taking over 30 minutes in a clean conda environment. Any chance you can provide a list of dependencies for a github install instead?

Thanks for making this!
Josh

Feature request - Ability to download gene nucleotide sequences when using `bit-dl-ncbi-assemblies`

Hey @AstrobioMike thanks for this amazing tool. I have a small feature request.
I was wondering if it will be possible to update bit-dl-ncbi-assemblies where it can be used to download nucleotide sequences of genes (basically ffn files). Currently, bit-dl-ncbi-assemblies doesn't support the download of ffn (gene sequences in nucleotide format) but allows the download of faa files (gene sequences in amino acid format).
I'm happy to help in any way I can.
Thanks

ImportError: No module named pandas

Dear Mike,

I get some error like this:

I have pandas installed in my base Envs actually, I don't know why this happens. Thank you!

code
bit-update-ncbi-taxonomy

cd /home/lbhuang/Moore/results/K2/k2_outputs/
for f1 in *_k-output.txt
do
filename=$(basename "$f1")
sample="${filename%_k-output.txt}"
bit-kraken2-to-taxon-summaries -i ${sample}_k-output.txt -o ${sample}-k-tax.tsv
done

bit-combine-kraken2-taxon-summaries -i *-k-tax.tsv -o combined-k-tax.tsv

Errors:
Traceback (most recent call last):
File "/home/lbhuang/mambaforge/envs/bit/bin/bit-kraken2-to-taxon-summaries", line 26, in
import pandas as pd
ImportError: No module named pandas
Traceback (most recent call last):
File "/home/lbhuang/mambaforge/envs/bit/bin/bit-kraken2-to-taxon-summaries", line 26, in
import pandas as pd
ImportError: No module named pandas
Traceback (most recent call last):
File "/home/lbhuang/mambaforge/envs/bit/bin/bit-combine-kraken2-taxon-summaries", line 9, in
import pandas as pd
ImportError: No module named pandas

add program for downloading test data

make a bit-get-test-data program
make it have submodules to grab different types like bit-get-workflow
for metagenomics test data, take the same files i have pulled by genelab-utils here

[Metagenomics wf] add option to skip contig and gene-level taxonomy (so just binning/MAG recovery)

as per email request