merenlab / anvio Goto Github PK

An analysis and visualization platform for 'omics data

Home Page: http://merenlab.org/software/anvio

License: GNU General Public License v3.0

Shell 1.75% Python 84.87% HTML 2.49% CSS 0.85% JavaScript 8.91% R 0.97% Dockerfile 0.09% Makefile 0.06%

anvio bioinformatics comparative-genomics javascript metagenomics metatranscriptomics pangenomics phylogenomics population-genetics python science visualization

anvio's People

Contributors

Stargazers

Watchers

Forkers

paczian mdehollander farukuzun urbanmicrobes gitter-badger scottdaniel banfieldlab giriartes wyim-pgl mortonjt stevencui729 psaxcode kanantharaman veronikakivenson fauziharoon jmeppley blankenberg epruesse rajaldebnath eamuelle sydneyhardeman jvineis fishinwind jessicamblanton qinlab ppflrs mruehlemann mahmoudyousef98 lindechun suresh2014 creageng aboffin gelomerase tucker4 thisisliuqing qclayssen fjlicht fmaguire blakevanniel pythseq johanneswerner yujimlong biomathematica simatei glomicon satish162 srinidhi202 ravinpoudel eburgoswisc ivagljiva efogarty11 wangmz0617 wangpanqiao mooreryan michaelkyu mschecht dutter caizhaohui semiller10 dagahren lucaspaoli ozcan yinx843 isaacfink21 sarah872 alienzj minghao2016 cheolsoonim jessica-pan reillyowencooper shiyi-pan a8nguyen kirstengott hollylutz genomewalker fbbjbb lvelosuarez fredhutch raw937 dpbisme nvt-1009 moritzbuck rbartelme microbesgonnamicrobe apcamargo wook2014 emilefaure oneomics-india cwood15 meren jadeaver zagrosman raw-lab ddeemerpurdue hanareiaehau karkman dspeth chiaravanni artumanga19 simroux

anvio's Issues

Consensus sequences for splits

Profiler should export consensus sequences for split as well as contigs. If we have access to splits, tools like phymmbl annotation script will be much less complex, and users will be able to use them on data that were not generated by PaPi.

split length

split length should be a standard layer just like GC content for merged runs

average completeness & contamination

On the buttons shown in the groups window display the "average" completeness and contamination calculated from all available sources.

Must work with compressed pickled objects

PaPi should read from and write to compressed objects. Large profiles take substantial space.

mock data to test performance

create a mock dataset to test the performance. now we are measuring tree drawing time, it could be useful to mention "expected" time for the test data to be drawn on a mid-level computer. any computer that takes much longer than that value would not be the best platform to run papi.

A fix to stop joe's complaints

My current installation of PaPi (the current version downloaded from github) is returning this error

jvineis@Joes-MacBook-Pro-2: papi-interactive-binning -r RUNINFO-mean_coverage.cp
Traceback (most recent call last):
File "/Applications/PaPi/bin/papi-interactive-binning", line 81, in
d = interactive.InputHandler(parser.parse_args())
File "/Applications/PaPi/PaPi/interactive.py", line 62, in init
self.load_from_runinfo_dict(args)
File "/Applications/PaPi/PaPi/interactive.py", line 216, in load_from_runinfo_dict
self.profile_db = PaPi.db.DB(self.P(self.runinfo['profile_db']), PaPi.profiler.version)
KeyError: 'profile_db'

Contigs with Ns

Hi PaPi,
Could you set a flag in the profiling to ignore contigs with Ns so they are not included in the analysis. Many assemblers will insert Ns for various reasons and it would be great to be able to control for this in the analysis.

server ctrl^c

Output messages on the terminal for papi-interactive-binning must be clearer.

mean_coverage_contigs table keeps GC_content for splits

It should be for contigs.

search box to highlight matching contigs

Bilişim tarihinde bir ilk; şiyirli feature request:

Philae yapayalnız iken bir 67P yamacında,
Ben PaPi'yi düşünüyorum gözlerim kapalı.
Mesela diyorum bir arama kutusu olsa,
Oraya yazdığımız metinleri,
PaPi gidip contig isimlerinde arasa,
ve highlhight etse bulduklarını ağaçta...
   Hatta ve hatta, 
      belki bir buton olsa arama tabında,
          "Bunlari al, ekle diğer grupların yanına" demek de misal,
               mümkün olsa bebek, oh mümkün olsa...

MERGE_RUNS

Hey awesome dudes! I am trying to merge runs and I have encountered this error. I successfully merged runs for other groups of samples and this set is no different other than the number of samples being processed.
Thanks for the help

jvineis@rocket:papi-merge-multiple-runs 204_*/RUNINFO.cPickle -o MERGED_RUNS
output_dir .......................................................: /automounts/bpcstorage01/production/users/jvineis/HMP_temp/204/BAM/MERGED_RUNS
num_runs_processed ...............................................: 5
num_splits_found .................................................: 2,926
contigs_total_length .............................................: 152,442
contigs_fasta ....................................................: /automounts/bpcstorage01/production/users/jvineis/HMP_temp/204/BAM/MERGED_RUNS/CONTIGS-CONSENSUS.fa
tnf_matrix .......................................................: /automounts/bpcstorage01/production/users/jvineis/HMP_temp/204/BAM/MERGED_RUNS/TETRANUCLEOTIDE-FREQ-MATRIX.txt
[05 Aug 14 10:13:31 Generating TNF tree] ... Traceback (most recent call last):
File "/groups/merenlab/PaPi/bin/papi-merge-multiple-runs", line 315, in
MultipleRuns(args).merge()
File "/groups/merenlab/PaPi/bin/papi-merge-multiple-runs", line 119, in merge
tnf_tree = self.generate_tnf_tree()
File "/groups/merenlab/PaPi/bin/papi-merge-multiple-runs", line 156, in generate_tnf_tree
PaPi.utils.get_newick_tree_data(self.run.info_dict['tnf_matrix'], newick_tree_file_path)
File "/groups/merenlab/PaPi/PaPi/utils.py", line 566, in get_newick_tree_data
normalized_vector = [p / denominator for p in vector]
ZeroDivisionError: float division by zero

Selecting contigs from tree

It would be really great if we could remove/select nodes from the tree by clicking on the outer layers instead of the leaves of the tree which can be very dense and difficult to select

interactive-binning += "Title"

birden fazla proje acik iken kullanicinin hangi pencerede hangi proje uzerinde calistigini takip etmesi imkaniz hale geliyor.

fe51b81 numarali commit ile bir data hook'u ekledim. artik javascript icinden projenin ismini ogrenmek mumkun:

bunu agacin sol ust kosesinde bir yerde buyukce gostermek leziz olurdu.

Critical DB issue

Well, I am getting this error with large files that do not happen with smaller ones.

This may be a db.commit() issue. Needs to be checked properly:

meren SSH://MBL /workspace/shared/tom/Infant-gut-FASTA-files $ papi-populate-search-table Infant-gut-assembly-1kb.fa Infant-gut-assembly-1kb.db -L 20000
Database .....................................: A new database, Infant-gut-assembly-1kb.db, has been created.
Split length .................................: 20000
HMM profiles .................................: 3 sources have been loaded: Dupont_et_al (111 genes), Campbell_et_al (139 genes), Wu_et_al (31 genes)

Finding ORFs in contigs
===============================================
Genes ........................................: /tmp/tmpqjiEz2/contigs.genes
Proteins .....................................: /tmp/tmpqjiEz2/contigs.proteins
Log file .....................................: /tmp/tmpqjiEz2/00_log.txt

HMM Profiling for Dupont_et_al
===============================================
Reference ....................................: Dupont et al, http://www.nature.com/ismej/journal/v6/n6/full/ismej2011189a.html
Pfam model ...................................: /groups/merenlab/PaPi/PaPi/data/hmm/Dupont_et_al/genes.hmm.gz
Number of genes ..............................: 111
Temporary work dir ...........................: /tmp/tmpYslaQa
HMM scan output ..............................: /tmp/tmpYslaQa/hmm.output
HMM scan hits ................................: /tmp/tmpYslaQa/hmm.hits
Log file .....................................: /tmp/tmpYslaQa/00_log.txt
Number of raw hits ...........................: 3,945

HMM Profiling for Campbell_et_al
===============================================
Reference ....................................: Campbell et al, http://www.pnas.org/content/110/14/5540.short
Pfam model ...................................: /groups/merenlab/PaPi/PaPi/data/hmm/Campbell_et_al/genes.hmm.gz
Number of genes ..............................: 139
Temporary work dir ...........................: /tmp/tmpI3mhZw
HMM scan output ..............................: /tmp/tmpI3mhZw/hmm.output
HMM scan hits ................................: /tmp/tmpI3mhZw/hmm.hits
Log file .....................................: /tmp/tmpI3mhZw/00_log.txt
Number of raw hits ...........................: 2,364

HMM Profiling for Wu_et_al
===============================================
Reference ....................................: Wu et al, http://genomebiology.com/2008/9/10/R151
Pfam model ...................................: /groups/merenlab/PaPi/PaPi/data/hmm/Wu_et_al/genes.hmm.gz
Number of genes ..............................: 31
Temporary work dir ...........................: /tmp/tmpz_f6JE
HMM scan output ..............................: /tmp/tmpz_f6JE/hmm.output
HMM scan hits ................................: /tmp/tmpz_f6JE/hmm.hits
Log file .....................................: /tmp/tmpz_f6JE/00_log.txt
Number of raw hits ...........................: 946
Traceback (most recent call last):
  File "/groups/merenlab/PaPi/bin/papi-populate-search-table", line 118, in <module>
    main(args)
  File "/groups/merenlab/PaPi/bin/papi-populate-search-table", line 78, in main
    g.populate_search_tables(annotation_db, sources)
  File "/groups/merenlab/PaPi/PaPi/annotation.py", line 179, in populate_search_tables
    search_tables.append(source, reference, kind_of_search, all_genes_searched_against, search_results_dict)
  File "/groups/merenlab/PaPi/PaPi/annotation.py", line 261, in append
    self.db.create_table(self.search_info_table, search_info_table_structure, search_info_table_types)
  File "/groups/merenlab/PaPi/PaPi/db.py", line 68, in create_table
    self._exec('''CREATE TABLE %s (%s)''' % (table_name, db_fields))
  File "/groups/merenlab/PaPi/PaPi/db.py", line 110, in _exec
    ret_val = self.cursor.execute(sql_query)
sqlite3.OperationalError: table search_info already exists

Versioning

PaPi should follow Semantic Versioning: http://semver.org/

keep trees in the database

GC Content for merged

Change the way GC content is calculated for merged runs. The consensus needs to be take into account.

ANNOTATION.db must have an identifier

A unique identifier that is stored somewhere in sample db's when an annotation database is involved with profiling or merging.

All paths in RUNINFO should be relative

annotation.db issue

if papi-populate-genes-table has not been run on an annotation db papi-interactive crashes.

PhymmBL

Anyone who wants to use PhymmBL annotation will need to add this line to scoreReads.pl script that comes with the PhymmBL distribution:

use Cwd; use File::Basename; chdir(dirname($ARGV[0])) or die "cannot change working directory: $!\n";

I don't even know how I am going to check this other than putting it in the documentation.

PhymmBL sucks so bad. Please find something else!

We need to assign taxonomy to our contigs in a very fast manner. PhymmBL is implemented horribly, and it takes forever to run. We need to look into this.

Taxonomy - Metadata

TAXONOMY.txt may have more entries in it as far as it does contain every instance appears in the METADATA.txt. This will increase efficiency of the use of one T.txt across different merging ops.

display relative proportion for each split as an alternative to coverage and log coverage

Meren,
I suggest to display relative proportion for each split as an alternative to coverage and log coverage.
Will be useful to genomic bins but also transcriptomic trends

papi-profile: --contigs vs --splits

When papi-profile is used with a PROFILE.cPickle as an input, the user may want to use the split names (obtained through the web interface) instead of contig names to retain from results. When --contigs is used with splits it produces an error. Something must be done about this.

SVG içindeki raporlar

Her bir layer'ın min-max değerlerinin SVG içerisinde raporlanması gerekli :)

Selection Control

This just happened to me: I had a tree with various selections, and then while I was moving it around I inadvertently clicked on one of the root branches. All my selections were overwritten by this new selection, and I had to wait a minute for PaPi to add 1.5K contigs into a bin.

Maybe there should be a popup to warn the user with something like "Yo, you just requested to bin XXX splits into Group X. Do you wish to continue?", if the user attempts to select more than, i.e., 500 splits.

rotating branches on the tree

I didn't think it would be important first, but I think it should be in then codebase.

Specialized binning for a group

In some cases TFN resolves a fine cluster, yet more than one genome can be fixed within that cluster with different coverages. There must be a way to focus on that clade, and re-order contigs based on some other information (without breaking up splits), such as coverage.

annotation.py needs love.

Tables with annotation_ prefix should use orfs_ of funtional_ as prefix. It will clear out great deal of confusion.

Also, papi-gen-annotation should behave identically to papi-populate-*.

Phylogram tree needs height and width options like start/end angle in circlephylogram tree.

Timeseries SVG raporlama problemi

Birden fazla profilin birleştirilmesi ile oluşturulan projelere ait SVG dosyaları arayüzden export edilince, SVG içinde layer'lara ilişkin hiçbir rapor olmuyor. Eğer bu hatayı tekrarlayamazsan haber ver, bir yere örnek veri seti koyayım :)

annotation.db needs to know which tables are populated

self table in the annotation.db should be updated when any of the papi-populate-*-table scripts are run.

it should be clear to the profiler whether a papi-populate-*-table was run and the results were empty, or it was never run at all.

Inclusion of high coverage elements

Could we find a way to include contigs that are highly abundant but too short to make it into the assembly? Perhaps we could use a percentage approach where we would compute the relative abundance of all the short contigs and if they are greater than .1 percent (or some selected value) of the total dataset, they are formed into a separate bin.

We don't want to miss these contigs!

drag start event shouldn't select contig.

samples and run type should be accessible through RUNINFO

Especially for merged runs. RUNINFO files should contain the names of samples in a merged run.

Work with relative paths..

At this point full paths are embedded in RUNINFO.cp and SUMMARY.cp files. In merge-multiple-runs and papi-interactive-binning scripts contain procedures to fix directories if they are carried over from a different machines.

If papi works with relative paths life would have been much easier.

Colors change between views

From one view (mean_coverage):

To another (standard_dev):

Thank you Ozcan :) Let me know when you feel overwhelmed! :)

incorporate metadata "total number of reads for each sample" into PAPI for normlization

incorporate metadata "total number of reads for each sample" into PAPI for normlization during the output process and for visualization of splits based on percentage proportion (see related issue from tod)

metadata.py

Metadata.py is a mess. Profiler needs a lib similar to annotation.py where all database operations are handled.

Contigs <-> Annotation DB

Each annotation database should keep the sha1sum of the FASTA file that is generated from. Right now it is possible to generate an annotation table, then populate it with search results coming from a totally different FASTA file.

This mistake should not be possible to make.

Uncaught TypeError: Cannot read property 'ancestor' of undefined

Weird error. To reproduce:

Update repo,
run ./mini_tests.sh
Draw tree,
Select everything,
From order-by select 'tnf', draw again.

Thank you very much!

magic_touch should take care of contig lenghts

Right now this information is being generated for multiple calls, but it can easily be stored in the database when it is first generated.

-L for profiling, -L for annotation db.

They should match. At this point this is not checked until very late in the process...

check contig names for silly characters

check_contig_names is in utils. it needs to be filled and called form a reasonable place.

sometimes bam files contain contigs with characters that shouldn't be. before profiling this needs to be pointed out.

Merging SVG bugs

A little bug report for Ozcan.

I run into this with infant 1kb. Here is the circular tree, that missing white piece at the end is :

This is a screenshot from phylogram view. It seems group names do not correspond to colors shown, and there are white ones even when there is a group assignment:

For instance, everything shown here is Group_28 (according to the mouse-over menu) :)

Best,