metagentools / graphbin Goto Github PK
View Code? Open in Web Editor NEW✨🧬 Refined binning of metagenomic contigs using assembly graphs
Home Page: https://graphbin.readthedocs.io/en/stable/
License: BSD 3-Clause "New" or "Revised" License
✨🧬 Refined binning of metagenomic contigs using assembly graphs
Home Page: https://graphbin.readthedocs.io/en/stable/
License: BSD 3-Clause "New" or "Revised" License
Hello,
I am running graphbin v 1.7.1 and python version 3.10.13
First, I was unable to get the python support scripts to work so I renamed all of my files to make naming consistent and made the csv file with custom scripts.
When I tried to run the following code:
graphbin --contigs /home/ejunkins/LS01_hifi_coveragebin/LS01_001_assembly_renamed_edges.fasta --binned /home/ejunkins/LS01_hifi_coveragebin/bins/001/contignames/edges/LS01_001_all_edges_graphbin.csv --graph /home/ejunkins/jgi_assemblygraphs/NGXTG/flye/assembly_graph.gfa --output /home/ejunkins/LS01_hifi_coveragebin/bins/001/graphbin_out --prefix graphbin_metabat2_LS01_001_bin_with_cov --assembler flye
I get this error:
2024-02-12 12:28:23,452 - ERROR - Please make sure to provide the path to the contigs.paths file. 2024-02-12 12:28:23,453 - INFO - Exiting GraphBin... Bye...!
My understanding was that this was only for spades assemblies...
Hi, I recently heard about your tool and am hoping it can improve some of my binning results.
I cloned the version on github today (July 17 2020) and installed following the recommendations on the wiki. I have encountered a couple of errors that may be easy to resolve but wanted to share.
1 - I think the path to the assembler-specific scripts may be missing a forward slash. i got the following Errno 2:
python ${path}/apps/GraphBin/graphbin.py --assembler spades --graph assembly_graph_with_scaffolds.gfa --paths contigs.paths --binned ../initial_contig_bins.csv --output ../../gb_bins/
python: can't open file '${path}/apps/GraphBinsrc/graphbin_SPAdes.py': [Errno 2] No such file or directory
Looking in the graphbin.py script, I added a forward slash in the SPAdes section so it points to: ${path}/apps/GraphBin/src/graphbin_SPAdes.py
which worked.
2 - then I got a missing module error:
python ${path}//apps/GraphBin/graphbin.py --assembler spades --graph assembly_graph_with_scaffolds.gfa --paths contigs.paths --binned ../initial_contig_bins.csv --output ../../gb_bins/
Traceback (most recent call last):
File "${path}//apps/GraphBin/src/graphbin_SPAdes.py", line 24, in
from igraph import *
ModuleNotFoundError: No module named 'igraph'
I'm working on a cluster and was able to install igraph locally and get GraphBin to run but wanted to share this in case others have these issues.
Speedup final file write process by simultaneously writing to individual files of bins.
I've used MEGAHIT to assemble samples individually, and then ran vamb in order to bin them all together.
I was wondering whether GraphBin can cope with / be used to refine this type of input. There is one binning input, so that should be all right as long as I make sure the contig names are the same. For the contigs file, I can concatenate the individual contigs so there is one input file. But I'm confused about the assembly graph file. I guess I could concatenate all neccesary fastg files, while taking care to have only one begin and end line, and then convert to gfa...but in that case, should the fastg file include one, or multiple 'assembly name' lines? Do you have any idea?
Kind regards,
Laura
Hello,
A long-time user of GraphBin recommended this program to me, and I'm excited to use it. Yesterday, I was able to install the software successfully using the instructions on Github (the ones on readthedocs
page didn't work out), but since then, I'm having a couple of issues running GraphBin.
My main issue deals with the fastg2gfa
script. I have questions about this.
1.1: My install of the parent software, gfaview
is failing. After git clone
and make
, I get the following error
$ make
make: Warning: File `gfa.c' has modification time 17 s in the future
gcc -c -g -Wall -Wc++-compat -O2 -I. gfa.c -o gfa.o
gfa.c: In function ‘gfa_print’:
gfa.c:534:17: warning: variable ‘len’ set but not used [-Wunused-but-set-variable]
int max = 0, len;
^
gfa.c:564:17: warning: variable ‘len’ set but not used [-Wunused-but-set-variable]
int max = 0, len;
^
gcc -c -g -Wall -Wc++-compat -O2 -I. gfaview.c -o gfaview.o
gcc -g -Wall -Wc++-compat -O2 gfa.o gfaview.o -o gfaview -lz
make: warning: Clock skew detected. Your build may be incomplete.
How do I fix this, please?
1.2 Even if I got gfaview
to compile properly, how do I run a script that is in the misc
directory of this program?
Any help troubleshooting this will be much appreciated. Thank you very much.
Suggest the following structural changes to enable distribution via PyPI and also for Windows users
src/
graph bin/
__init__.py # this should be your current graphbin file
utils/
... # all files currently under graphbin_utils
support/
... # all files currently under support
tests/
data/ # test_data dir renamed to here
... test scripts
pyproject.toml # replace setup.py with this, hook into scripts
Hi GraphBin group, I was able to run GraphBin and get output that looks correct for the majority of my files. I have a subset of 8 of my 45 files that are all getting the same errors. I have double checked the content of these files, which seem to be fine. I'm copying the code & output below, would you let me know if there are any workarounds you might suggest? Thanks!
python ${path}/GraphBin/graphbin.py --assembler spades --graph ${spades_dir}/${name}/assembly_graph_with_scaffolds.gfa --paths ${spades_dir}/${name}/contigs.paths --binned ${path}/graphbin/inputs/CSVs/${name}_initial_contig_bins.csv --output ${outdir}/${name}
2020-07-23 13:12:22,545 - INFO - Welcome to GraphBin: Refined Binning of Metagenomic Contigs using Assembly Graphs.
2020-07-23 13:12:22,547 - INFO - This version of GraphBin makes use of the assembly graph produced by SPAdes which is based on the de Bruijn graph approach.
2020-07-23 13:12:22,547 - INFO - Input arguments:
2020-07-23 13:12:22,547 - INFO - Assembly graph file: ${path}/PM2-C1D1/assembly_graph_with_scaffolds.gfa
2020-07-23 13:12:22,547 - INFO - Contig paths file: ${path}/PM2-C1D1/contigs.paths
2020-07-23 13:12:22,547 - INFO - Existing binning output file: ${path}/inputs/CSVs/PM2-C1D1_initial_contig_bins.csv
2020-07-23 13:12:22,547 - INFO - Final binning output file: ${path}/gb_bins/PM2-C1D1/
2020-07-23 13:12:22,547 - INFO - Maximum number of iterations: 100
2020-07-23 13:12:22,547 - INFO - Difference threshold: 0.1
2020-07-23 13:12:22,547 - INFO - GraphBin started
2020-07-23 13:12:22,567 - INFO - Number of bins available in the initial binning result: 14
2020-07-23 13:12:22,567 - INFO - Constructing the assembly graph
2020-07-23 13:12:23,228 - INFO - Total number of contigs available: 60554
2020-07-23 13:12:28,439 - INFO - Total number of edges in the assembly graph: 7011
2020-07-23 13:12:28,439 - INFO - Obtaining the initial binning result
2020-07-23 13:12:28,452 - INFO - Determining ambiguous vertices
2020-07-23 13:12:28,936 - INFO - Removing labels of ambiguous vertices
2020-07-23 13:12:28,988 - INFO - Obtaining the refined binning result
2020-07-23 13:12:28,988 - INFO - Deteremining vertices which are not isolated and not in components without any labels
2020-07-23 13:12:36,162 - INFO - Number of non-isolated contigs: 4678
Traceback (most recent call last):
File "${path}/GraphBin/src/labelpropagation/labelprop.py", line 113, in process_data_line
for edge in edges:
TypeError: 'int' object is not iterable
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "${path}/GraphBin/src/graphbin_SPAdes.py", line 497, in
lp.load_data_from_mem(data)
File "${path}/GraphBin/src/labelpropagation/labelprop.py", line 99, in load_data_from_mem
self.process_data_line(line)
File "${path}/apps/GraphBin/src/labelpropagation/labelprop.py", line 121, in process_data_line
raise Exception("Coundn't parse vertex from line")
Exception: Coundn't parse vertex from line
Use nox for the testing suit
Convert to use click
instead of argparse
to parse the input arguments.
Hello!
Thanks for your research.I found that this research did not consider the weight between two contigs.I wonder if the weight between two contigs has an effect on the final clustering result. At the same time,whether the connection between two contigs with weight can be generated through the script in this article.
Thanks!
https://graphbin.readthedocs.io/en/latest/usage/
The link on how to convert fastg to gfa files in the above URL disappears.
I want to know how to convert fastg to gfa.
Thank you
Hi!
I'm testing GraphBin with my data and I'm unable to use it with a MEGAHIT graph.
2020-07-01 19:14:50,429 - INFO - Welcome to GraphBin: Refined Binning of Metagenomic Contigs using Assembly Graphs.
2020-07-01 19:14:50,429 - INFO - This version of GraphBin makes use of the assembly graph produced by MEGAHIT which is based on the de Bruijn graph approach.
2020-07-01 19:14:50,429 - INFO - Assembly graph file: ../assembly/assembly.graph.gfa
2020-07-01 19:14:50,429 - INFO - Existing binning output file: ../metabat_bins.csv
2020-07-01 19:14:50,429 - INFO - Final binning output file: ../graphbin_result/
2020-07-01 19:14:50,430 - INFO - Maximum number of iterations: 100
2020-07-01 19:14:50,430 - INFO - Difference threshold: 0.1
2020-07-01 19:14:50,430 - INFO - GraphBin started
2020-07-01 19:14:50,464 - INFO - Number of bins available in the initial binning result: 26
2020-07-01 19:14:50,464 - INFO - Constructing the assembly graph
2020-07-01 19:14:59,047 - INFO - Total number of contigs available: 0
2020-07-01 19:14:59,177 - INFO - Total number of edges in the assembly graph: 0
2020-07-01 19:14:59,178 - INFO - Obtaining the initial binning result
2020-07-01 19:14:59,179 - ERROR - Please make sure that you have provided the correct assembler type and the correct path to the binning result file in the correct format.
2020-07-01 19:14:59,179 - INFO - Exiting GraphBin... Bye...!
The graph was generated with megahit_toolkit contig2fastg
, however the format is different from the examples in this repository:
>NODE_1_length_302_cov_2.0000_ID_1;
GTGGACCTCTCAGCGGTCATTCACGAAGAAACCCAGGATGACCTCCATCGCCGCCGACGGCGTTTCGTACGCACGCCAGCAGTCGGATTTCGATCTGTACCGCCGTGGAAGCACGTGGTACCTGGTGGAGAACGGCGTCTGGTTCCGCTCCGATTCGTGGAAGGGCCCTTTCGTGTCGATCCGCGCGAAGGATGTTCCGAGGGCCATCTGGAGCATCCCGCCGGCCTACCGACGCCACTGGGTTCCAGCCGTTCGCTAGACGAGCGGGGTCCCTGGGCGCCGGGGCTGTATAGCGCCTCGGG
>NODE_1_length_302_cov_2.0000_ID_1';
CCCGAGGCGCTATACAGCCCCGGCGCCCAGGGACCCCGCTCGTCTAGCGAACGGCTGGAACCCAGTGGCGTCGGTAGGCCGGCGGGATGCTCCAGATGGCCCTCGGAACATCCTTCGCGCGGATCGACACGAAAGGGCCCTTCCACGAATCGGAGCGGAACCAGACGCCGTTCTCCACCAGGTACCACGTGCTTCCACGGCGGTACAGATCGAAATCCGACTGCTGGCGTGCGTACGAAACGCCGTCGGCGGCGATGGAGGTCATCCTGGGTTTCTTCGTGAATGACCGCTGAGAGGTCCAC
>NODE_2_length_305_cov_1.0000_ID_3;
GTGCCGCCGCCGCCGAAGAAGATGCCACTGACTACGGCGTTCCAGCCGCTGATTCGGGCCAGCTTGAACCGCACGTTTCCGGCCACGTTCCAGATCAGGTATGCACTCTTCCGTGGCAATAAAGCGCATGGCGCGCAAAGCCCGAACTTTATGCAGCGAGTTTCCCTCTTTAATCAGCTCCCCTAAATTTTCTTGCAGGGCCGTCCGTTCTTGGGCAATTTTTGCGGGCGGAATTTGGCCGGCGGCTTCGGCGGCTATAAAGAGCGCGCGCCGGCGCGCCAGCTTTTCGCCGTCGCTTTTTAGGC
>NODE_2_length_305_cov_1.0000_ID_3';
GCCTAAAAAGCGACGGCGAAAAGCTGGCGCGCCGGCGCGCGCTCTTTATAGCCGCCGAAGCCGCCGGCCAAATTCCGCCCGCAAAAATTGCCCAAGAACGGACGGCCCTGCAAGAAAATTTAGGGGAGCTGATTAAAGAGGGAAACTCGCTGCATAAAGTTCGGGCTTTGCGCGCCATGCGCTTTATTGCCACGGAAGAGTGCATACCTGATCTGGAACGTGGCCGGAAACGTGCGGTTCAAGCTGGCCCGAATCAGCGGCTGGAACGCCGTAGTCAGTGGCATCTTCTTCGGCGGCGGCGGCAC
(...)
Hello,
Is there a way to restart Graphbin from a checkpoint if something fails? I had a script running for 4 days that failed due to a node issue and I'd like not to have to wait that long again.
Please add this feature if it currently does not exist. It would be very helpful.
Thank you,
Taruna
Add Yu Lin as a contributor
Integrate coverage information from co-assemblies to make use of differential coverage across multiple samples.
Setup test suit using pytest to test GraphBin commands
Hi!
I've noticed that prepResult.py
doesn't support .fna
files, which is pretty common for bins. I'd be cool if support for this extension was added.
Also, I noticed that subprocess
is not being imported into the script, causing a NameError: name 'subprocess' is not defined
error.
Hey all,
Thanks a for a great tool. I was wondering if you have any plans or works in the pipeline to be able to use this for long-read assemblies.
Thanks again!
Change software license to BSD-3.
Hello,
I ran SPAdes assemblies and, before binning with MaxBin2, I renamed the assembly contigs with simple deflines (eg >c_0000001, >c_0000002, etc.). All the bins thus have the new simpler contig names. To run GraphBin, I replaced all the contig names in the original SPAdes 'contigs.paths' file with the corresponding renamed deflines. The bin mapping file also uses the new contig names.
I've modified all the input files with the renamed contig deflines, but GraphBin still seems to think the contigs.paths file does not exist. Does it require contigs to have the standard SPAdes name formats if the assembler input is --spades? My full command is below.
graphbin --assembler spades --contigs contigs-renamed.fasta --graph assembly_graph_with_scaffolds.gfa --paths contigs-renamed.paths --binned MaxBin2_graphbin_map.csv --output graphbin
thanks,
Nastassia
Add args.paths validation for Flye input to check that assembly_info.txt
is provided.
Hi!
I was wondering if there is any threads option for GraphBin. The help page for the command did not mention any such option for the tool and wanted to know if the tool automatically picks up that information?
Also, if the tool is single-threaded, is it possible to explore a multi-threaded version of the tool for future updates?
Please depend on the igraph
instead of python-igraph
package on PyPI. See igraph/python-igraph#699 for an explanation.
I believe this requires changes only in requirements.txt
and pyproject.toml
here:
https://github.com/search?q=repo%3Ametagentools%2FGraphBin%20python-igraph&type=code
Note that on conda-forge, the name stays python-igraph
!
Hi,
thank you for providing this package. I'm excited to use it.
Would you consider adding it a Conda repository such as Bioconda? I believe this would improve the installation process and make it more accessible for users.
I'd be glad to help drafting a recipe so the package can be added to Bioconda, if you agree.
Best wishes,
Vini
(graphbin_env) jespinozlt2-osx:GraphBin jespinoz$ python graphbin --graph ~/assembly_graph_with_scaffolds.gfa --binned ~/scaffolds_to_bins.csv --output graphin_output --paths ~/scaffolds.paths --assembler "spades"
2021-03-27 13:24:53,962 - INFO - Welcome to GraphBin: Refined Binning of Metagenomic Contigs using Assembly Graphs.
2021-03-27 13:24:53,962 - INFO - This version of GraphBin makes use of the assembly graph produced by SPAdes which is based on the de Bruijn graph approach.
2021-03-27 13:24:53,962 - INFO - Input arguments:
2021-03-27 13:24:53,962 - INFO - Assembly graph file: /Users/jespinoz/assembly_graph_with_scaffolds.gfa
2021-03-27 13:24:53,962 - INFO - Contig paths file: /Users/jespinoz/scaffolds.paths
2021-03-27 13:24:53,962 - INFO - Existing binning output file: /Users/jespinoz/binning/scaffolds_to_bins.csv
2021-03-27 13:24:53,962 - INFO - Final binning output file: graphin_output/
2021-03-27 13:24:53,962 - INFO - Maximum number of iterations: 100
2021-03-27 13:24:53,962 - INFO - Difference threshold: 0.1
2021-03-27 13:24:53,962 - INFO - GraphBin started
2021-03-27 13:24:53,964 - INFO - Number of bins available in the initial binning result: 2
2021-03-27 13:24:53,964 - INFO - Constructing the assembly graph
2021-03-27 13:24:54,173 - INFO - Total number of contigs available: 25728
2021-03-27 13:24:59,473 - INFO - Total number of edges in the assembly graph: 1373
2021-03-27 13:24:59,473 - INFO - Obtaining the initial binning result
2021-03-27 13:24:59,473 - ERROR - Please make sure that you have provided the correct assembler type and the correct path to the binning result file in the correct format.
2021-03-27 13:24:59,473 - INFO - Exiting GraphBin... Bye...!
I can't figure out what is going wrong with my files. I used metaspades and MaxBin2 for my binning.
Here is my version:
(graphbin_env) jespinozlt2-osx:GraphBin jespinoz$ python graphbin --version
GraphBin version 1.3
Also, "SPAdes" isn't an accepted argument.
Any help would be greatly appreciated.
Hello, thanks for the exciting tool.
I would like to try out the tool but I am not sure about the requested binning file.
I tried using the prepResult.py script but I suspect the output is wrong. As input, I used the folder of the SPAdes output (metasample1/metaSpades). I ran it as following
python prepResult.py --binned 'metasample1/metaSpades' --assembler SPAdes --output 'metasample1/metaSpades/z_graphbin'
The following message was sent to stdout:
Formatting initial binning results
Writing initial binning results to output file
Formatted initial binning results can be found at /metasample1/metaSpades/z_graphbin/initial_contig_bins.csv
Bin IDs and corresponding names of fasta files can be found at metasample1/metaSpades/z_graphbin/bin_ids.csv
Thank you for using prepResult for GraphBin!
The file bin_ids.csv has this:
before_rr.fasta,1
contigs.fasta,2
first_pe_contigs.fasta,3
scaffolds.fasta,4
While the file initial_contig_bins.csv has this:
NODE_1,1
NODE_2,1
NODE_3,1
...
NODE_452809,4
NODE_452810,4
If I understood correctly, does this mean that all contigs belong to 4 bins?
Also, if this is correct which .gfa file should I use as input? SPAdes produces assembly_graph_after_simplification.gfa, assembly_graph_with_scaffolds.gfa, and strain_graph.gfa. I tried using all of them with contigs.paths and initial_contig_bins.csv
and obtained the same error:
ERROR - Please make sure that you have provided the correct assembler type and the correct path to the binning result file in the correct format.
Sorry for the long post I am new to whole metagenomics, trying to catch up.
Hello, I want to ask you how to calculate ARI in graphbin, because the number of contings marked by different bining tools is different. For example, metabat has a very high precison, but the number of contings that can be obtained is very small. How do you weigh the different number of different tools when calculating ARI? If only calculate the corresponding number of contings bined into bins , the ARI of metabat should be very high,is it?
Another question. When will metacoag be officially released? Can I quote your method in mt paper?It’s a good tools.
Thank you very much!
Is it possible to add a matrix of binning methods? For example in the test data you have maxbin2, metabat2, etc. Can all of these be used at once?
import subprocess
def exec_command(cmnd, stdout=subprocess.PIPE, stderr=subprocess.PIPE):
"""executes shell command and returns stdout if completes exit code 0
Parameters
----------
cmnd : str
shell command to be executed
stdout, stderr : streams
Default value (PIPE) intercepts process output, setting to None
blocks this."""
proc = subprocess.Popen(cmnd, shell=True, stdout=stdout, stderr=stderr)
out, err = proc.communicate()
if proc.returncode != 0:
raise RuntimeError(f"FAILED: {cmnd}\n{err}")
return out.decode("utf8") if out is not None else None
Then you can write test functions as
import pytest
def test_some_command():
cmd = "graphbin <args>"
exec_command(cmd)
indicate developer install via flit
import pytest
@pytest.fixture(scope="session")
def tmp_dir(tmpdir_factory):
return tmpdir_factory.mktemp("sqlitedb")
@pytest.fixture(autouse=True)
def workingdir(tmp_dir, monkeypatch):
# this set's the working directory for all tests in this module
# as a tmp dir
monkeypatch.chdir(tmp_dir)
def test_assert_something(tmp_dir):
# this will be running within workingdir auto-magically thanks to pytest
# run commands so that they write output to tmp_dir
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.