younglululu / cocacola Goto Github PK

COCACOLA: a general framework for binning contigs in metagenomic studies incorporating read COverage, CorrelAtion, sequence COmposition and paired-end read LinkAge

License: GNU General Public License v2.0

Perl 4.43% MATLAB 77.33% C++ 18.24%

cocacola's People

Contributors

Stargazers

Watchers

Forkers

huwenboshi tw7649116 arvin580 elanameijer

cocacola's Issues

Listing of dependencies

As I understand, matlab (which requires the purchase of a license) is required as a dependency, right? Maybe this could be listed in your installation instructions/documentation?

Abundance profile format contradicts given example

Your script states that abundance profiles should be in a format where each line corresponds to a contig and each column corresponds to a sample.
As an example you show how to call CONCOCTS gen_input_table.py to create a coverage-profile in tabular format.

However, your in your example, this script is simply run without the subsequent parsing step which is required for CONCOCT. Therefore the profile-table would include an extra column at the beginning of the table. containing each contig's length. This would contradict your format specification above. In the CONCOCT example workflow, this "length" column is removed from the table using cut -f1-3- and the result is piped into the final input file "input-tableR.tsv".

Based on the naming scheme used in your example and on the format specification above, I would assume that the standard result of CONCOCTS gen_input_table.py (without subseqeunt processing/parsing) would NOT be correct input for COCACOLA?

please create a github repository for the python version, also

This would enable people to send pull requests for any bugfixes they may have found

Any options for speed-up

I have tested a bam file from CAMI challenge with COCACOLA and it took about 450 hours.
So I looked up that there is any multi-thread option, but I can't find it.
Could you suggest any option or pre-processing for speed-up?

Log File?

I am running Cocacola on a cluster. For binning approximately 4 million contigs, it has been running for more than 18h+ now

However, on examining the output log, there seems to be absolutely nothing. I could see the outputs from FragGeneScan and HMM runs within an hour after submitting the job, but nothing since then

The job seems to keep running. For context, METABAT on the --superspecific mode binned this dataset in approx 15h

Is there a way to output some progress information?

Error: KeyError: 'contig1'

Hi,

I'm trying to run cocacola.py, but I get an error just after running it.

python /services/tools/cocacola-python/20170305/cocacola.py --contig_file contig.fa --abundance_profiles depth.cov --composition_profiles kmer_4_tmp.csv --edge_list linkage.tsv --output result_cocacola.csv

When running python 2, I get this error:
Traceback (most recent call last):
File "/services/tools/cocacola-python/20170305/cocacola.py", line 180, in
edgelist = pd.DataFrame(np.genfromtxt(args.edge_list, dtype=str, converters={0: lambda s: mapObj[s], 1: lambda s: mapObj[s], 2: lambda s: float(s)})).as_matrix()
File "/services/tools/anaconda-2.2.0/lib/python2.7/site-packages/numpy/lib/npyio.py", line 1778, in genfromtxt
missing_values=missing_values[i],)
File "/services/tools/anaconda-2.2.0/lib/python2.7/site-packages/numpy/lib/_iotools.py", line 841, in update
tester = func(testing_value or b'1')
File "/services/tools/cocacola-python/20170305/cocacola.py", line 180, in
edgelist = pd.DataFrame(np.genfromtxt(args.edge_list, dtype=str, converters={0: lambda s: mapObj[s], 1: lambda s: mapObj[s], 2: lambda s: float(s)})).as_matrix()
KeyError: 'contig1'

If running python3, I get same error:
Traceback (most recent call last):
File "/services/tools/cocacola-python/20170305/cocacola.py", line 180, in
edgelist = pd.DataFrame(np.genfromtxt(args.edge_list, dtype=str, converters={0: lambda s: mapObj[s], 1: lambda s: mapObj[s], 2: lambda s: float(s)})).as_matrix()
File "/services/tools/anaconda3/4.4.0/lib/python3.6/site-packages/numpy/lib/npyio.py", line 1778, in genfromtxt
missing_values=missing_values[i],)
File "/services/tools/anaconda3/4.4.0/lib/python3.6/site-packages/numpy/lib/_iotools.py", line 841, in update
tester = func(testing_value or b'1')
File "/services/tools/cocacola-python/20170305/cocacola.py", line 180, in
edgelist = pd.DataFrame(np.genfromtxt(args.edge_list, dtype=str, converters={0: lambda s: mapObj[s], 1: lambda s: mapObj[s], 2: lambda s: float(s)})).as_matrix()
KeyError: b'contig1'

Any idea about what is going on?

Thanks in advance.

FragGeneScan failed

Hi!

I know this might be a long shot since this tool is somewhat old(er), but I ran into a specific problem when running cocacola. I am able to run it successfully on the test dataset, but when running it on my own, it always fails at the FragGeneScan step. I put the output below. Any idea on why it might be failing on my dataset, but not your default test dataset?

thanks in advance!
maureen

(cocacola_env) [mberg@ln006 COCACOLA-python]$ python cocacola.py --contig_file /clusterfs/jgi/scratch/science/metagen/mberg/TBL_2017_BAM_bbmap_files/TBL_MDA_2017_assembly_A_for_N.fna --abundance_profiles /clusterfs/jgi/scratch/science/metagen/mberg/TBL_2017_BAM_bbmap_files/TBL_MDA_2017_assembly_depth_cocacola.tsv --composition_profiles /clusterfs/jgi/scratch/science/metagen/mberg/TBL_2017_BAM_bbmap_files/TBL_MDA_cocacola_kmer_profile.csv --output /clusterfs/jgi/scratch/science/metagen/mberg/TBL_MDA_cocacola/result_MDA_cocacola.csv

Input args: contig_file: /clusterfs/jgi/scratch/science/metagen/mberg/TBL_2017_BAM_bbmap_files/TBL_MDA_2017_assembly_A_for_N.fna
abundance_profiles: /clusterfs/jgi/scratch/science/metagen/mberg/TBL_2017_BAM_bbmap_files/TBL_MDA_2017_assembly_depth_cocacola.tsv
composition_profiles: /clusterfs/jgi/scratch/science/metagen/mberg/TBL_2017_BAM_bbmap_files/TBL_MDA_cocacola_kmer_profile.csv
edge_list: Not Available
clusters: Auto
cocacola.py:157: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead. covMat = pd.read_csv(args.abundance_profiles,sep='\t',usecols=range(1, covHeader.shape[1])).as_matrix()
cocacola.py:162: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead. shuffled_compositMat = pd.read_csv(args.composition_profiles,sep=',',usecols=range(1, compositHeader.shape[1])).as_matrix()
No cluster number specified. Now automatically detect...
exec cmd: /clusterfs/jgi/groups/science/homes/mberg/COCACOLA-python/auxiliary/FragGeneScan1.19/run_FragGeneScan.pl -genome=/clusterfs/jgi/scratch/science/metagen/mberg/TBL_2017_BAM_bbmap_files/TBL_MDA_2017_assembly_A_for_N.fna -out=/clusterfs/jgi/scratch/science/metagen/mberg/TBL_2017_BAM_bbmap_files/TBL_MDA_2017_assembly_A_for_N.fna.frag -complete=0 -train=complete -thread=10 1>/clusterfs/jgi/scratch/science/metagen/mberg/TBL_2017_BAM_bbmap_files/TBL_MDA_2017_assembly_A_for_N.fna.frag.out 2>/clusterfs/jgi/scratch/science/metagen/mberg/TBL_2017_BAM_bbmap_files/TBL_MDA_2017_assembly_A_for_N.fna.frag.err
FragGeneScan failed! Not exist: /clusterfs/jgi/scratch/science/metagen/mberg/TBL_2017_BAM_bbmap_files/TBL_MDA_2017_assembly_A_for_N.fna.frag.faa
(626, 5)
k:5 silhouette:0.24014230713806714 elapsed time:0.069806098938
(626, 10)
k:10 silhouette:0.2049718493831629 elapsed time:0.124846935272 bestk:10 silVal:0.24014230713806714
[t=0] Starting the optimization...
[t=0.02] Finished solving H
[t=0.03] Finished solving W
[t=0.03] Finished iteration 0. DiffVal=37.8457
[t=0.05] Finished solving H
[t=0.06] Finished solving W
[t=0.06] Finished iteration 1. DiffVal=18.1606
[t=0.08] Finished solving H
[t=0.09] Finished solving W
[t=0.09] Finished iteration 2. DiffVal=11.7705
[t=0.11] Finished solving H
[t=0.12] Finished solving W
[t=0.12] Finished iteration 3. DiffVal=7.9393
[t=0.14] Finished solving H
[t=0.15] Finished solving W
[t=0.15] Finished iteration 4. DiffVal=9.5812
[t=0.17] Finished solving H
[t=0.18] Finished solving W
[t=0.19] Finished iteration 5. DiffVal=9.5700
[t=0.20] Finished solving H
[t=0.22] Finished solving W
[t=0.22] Finished iteration 6. DiffVal=5.2064
[t=0.23] Finished solving H
[t=0.25] Finished solving W
[t=0.25] Finished iteration 7. DiffVal=6.5615
[t=0.26] Finished solving H
[t=0.28] Finished solving W
[t=0.28] Finished iteration 8. DiffVal=4.0475
[t=0.29] Finished solving H
[t=0.31] Finished solving W
[t=0.31] Finished iteration 9. DiffVal=3.6770
[t=0.32] Finished solving H
[t=0.34] Finished solving W
[t=0.34] Finished iteration 10. DiffVal=3.7477
[t=0.36] Finished solving H
[t=0.37] Finished solving W
[t=0.37] Finished iteration 11. DiffVal=2.9558
[t=0.39] Finished solving H
[t=0.40] Finished solving W
[t=0.40] Finished iteration 12. DiffVal=3.8752
[t=0.42] Finished solving H
[t=0.43] Finished solving W
[t=0.43] Finished iteration 13. DiffVal=4.2484
[t=0.45] Finished solving H
[t=0.46] Finished solving W
[t=0.46] Finished iteration 14. DiffVal=2.4847
[t=0.48] Finished solving H
[t=0.49] Finished solving W
[t=0.49] Finished iteration 15. DiffVal=2.4846
[t=0.51] Finished solving H
[t=0.52] Finished solving W
[t=0.52] Finished iteration 16. DiffVal=2.0670
[t=0.54] Finished solving H
[t=0.55] Finished solving W
[t=0.55] Finished iteration 17. DiffVal=2.7411
[t=0.57] Finished solving H
[t=0.58] Finished solving W
[t=0.58] Finished iteration 18. DiffVal=2.6665
[t=0.60] Finished solving H
[t=0.61] Finished solving W
[t=0.61] Finished iteration 19. DiffVal=1.2217

FragGeneScan takes forever

The FragGeneScan seems to take forever. I see that FragGeneScan v1.19 is being called. Can I replace it with v1.30, which is supposed to be much faster?

Any reasons why the legacy version is still being used?

Best,

Assertion Error

I am using the python version, with contig file, abundance and composition profiles as suggested. I get the following error

Traceback (most recent call last):
File "COCACOLA-python/cocacola.py", line 167, in
assert (shuffled_namelist[contigIdx] in mapObj)
AssertionError

How to calculate weights for edge_lists from paired-end linkage or co-alignment info?

The format for edge-lists SEEMS pretty straightforward : "contigA<TAB>contigB<TAB>weight"
But how exactly are the weights (optimally) determined?

E.g. if i create a linkage table based on paired-end reads using CONCOCT-scripts i get a table with the following columns PER SAMPLE:

contig1<TAB>contig2<TAB>nr_links_inward_n<TAB>nr_links_outward_n<TAB>nr_links_inline_n<TAB>nr_links_inward_or_outward_n<TAB>read_count_contig1_n<TAB>read_count_contig2_n[<TAB>...]
(where n is the respective samplename)

would the weight here be:
-the total sum of linking reads across all samples?
-the ratio between linking reads and total reads across all samples?
-simply the number of samples supporting the linkage?

And how would you calculate the weight for co-alignment based linkages? Simply by using the alignment score?

Do you perhaps have an example workflow/script for this?

Bug in python-version cocacoly.py: "os.getcwd()" is not the correct method to determine scripts residing path

the python version of cocacola comes with external tools (such as fraggenescan and hmmer) in a subpath "auxiliary/" within the scripts main directory.

When trying to call these external tools, you seem to try to determine the scripts residing path using os.getcwd().
example for fraggenescan (line 197 in cocacola.py):
fragScanURL = os.path.join(os.getcwd(),'auxiliary','FragGeneScan1.19','run_FragGeneScan.pl'); os.system("chmod 777 " + fragScanURL);

However, os.getcwd() returns only the current working directory not the scripts residing directory.
Therefore all of these external commands fail, unless you only run cocacola from within the scripts residing directory (which may remain unnoticed as cocacola then continues even after these external calls fail, albeit without precise info on the number of expected clusters).

The correct way to determine the scripts residing directory in python would be os.path.dirname(os.path.realpath(__file__))

All 4 occurences of os.getcwd() in the script that aim at finding the path to auxiliary tools should be changed to os.path.dirname(os.path.realpath(__file__))
(this concerns the variables fragScanURL, hmmExeURL, markerExeURL & markerURL)

(also maybe the failure to correctly detect the expected external tools should raise an Exception in the future)

EDIT: Another downstream-problem: The script sets the permissions for fraggenescan.pl in order to make it executable, but it seems this is not enough. The supplementary scripts "FragGeneScan", "FGS_gff.py" and "post_process.pl" need to be made executable too, otherwise the external call fails (again passing unnoticed as this does not raise an exception)

younglululu / cocacola Goto Github PK

cocacola's People

Contributors

Stargazers

Watchers

Forkers

cocacola's Issues

Listing of dependencies

Abundance profile format contradicts given example

please create a github repository for the python version, also

Any options for speed-up

Log File?

Error: KeyError: 'contig1'

FragGeneScan failed

FragGeneScan takes forever

Assertion Error

How to calculate weights for edge_lists from paired-end linkage or co-alignment info?

Bug in python-version cocacoly.py: "os.getcwd()" is not the correct method to determine scripts residing path

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent