Coder Social home page Coder Social logo

vgteam / vg Goto Github PK

View Code? Open in Web Editor NEW
1.1K 1.1K 192.0 141.26 MB

tools for working with genome variation graphs

Home Page: https://biostars.org/tag/vg/

License: Other

C++ 94.14% Makefile 0.45% Shell 2.62% Python 1.69% XSLT 0.28% HTML 0.52% R 0.18% Julia 0.06% Ruby 0.01% Dockerfile 0.04%
dna genome-graph genomics graph variation-graph

vg's People

Contributors

6br avatar adamnovak avatar alexandersavelyev avatar alexjironkin avatar apregier avatar buske avatar cmarkello avatar code-s-witch avatar edawson avatar ekg avatar emikoifish avatar glennhickey avatar jeizenga avatar jervenbolleman avatar jltsiren avatar jmonlong avatar jonassibbesen avatar kevyin avatar ldenti avatar mlin avatar mr-c avatar mzueva avatar ocxtal avatar parsaeskandar avatar richarddurbin avatar stephenhwang avatar tanessav avatar xchang1 avatar yhoogstrate avatar yoheirosen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vg's Issues

vg on RNA-seq data

Hi,
This is great work and I look forwards to try it! I had some questions and I thought it would be better to ask rather than not to.

  • Would vg work on RNA-seq data?
  • would it be able to to spliced alignment directly?
  • could the known introns be coded as indels and the re-interpret them as introns when mapped back onto the reference space?
  • would it be possible to build a graph just with transcripts and move on from there?

thanks in advance,

Inti

circular graphs

vg can now support circular graphs, but there are no tests of this functionality. Make one, and verify this can work for at least some operations.

Steps to compile vg locally without needing sudo access

Hi Erik,

I wrote up the following steps, in case anyone would like to install vg without needing sudo access. Feel free to change/integrate them any way you prefer :) Below are the steps:

  1. Download and install jansson in your home directory, and please replace YOUR_USERNAME with your username on the system:
git clone https://github.com/akheron/jansson
autoreconf -i
./configure --prefix=/home/YOUR_USERNAME/apps/jansson
make
make install
cd ..
  1. Download vg and enter its directory:
git clone --recursive https://github.com/ekg/vg.git
cd vg
  1. Update INCLUDES and LDFLAGS in Makefile:

Add to INCLUDES the following, with YOUR_USERNAME being set appropriately:

-I/home/YOUR_USERNAME/apps/jansson/include

Add to LDFLAGS the following, with YOUR_USERNAME being set appropriately:

-L/home/YOUR_USERNAME/apps/jansson/lib

They should look something like this, with YOUR_USERNAME being set appropriately:

INCLUDES=-I./ -I/home/YOUR_USERNAME/apps/jansson/include -Icpp -I$(VCFLIB)/src -I$(VCFLIB) -Ifastahack -Igssw/src -Iprotobuf/build/include -Irocksdb/include -Iprogress_bar -Isparsehash/build/include -Ilru_cache -Ihtslib -Isha1
LDFLAGS=-L./ -L/home/YOUR_USERNAME/apps/jansson/lib -Lvcflib -Lgssw/src -Lprotobuf -Lsnappy -Lrocksdb -Lprogressbar -Lhtslib -lvcflib -lgssw -lprotobuf -lhts -lpthread -ljansson -lncurses -lrocksdb -lsnappy -lz -lbz2
  1. Next compile vg:
make
  1. For the final step, just update your LD_LIBRARY_PATH with jansson, and add it to you .bashrc file to keep it persistent between sessions. Just as before, YOUR_USERNAME would need to be replaced with your specific username on the system:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/YOUR_USERNAME/apps/jansson/lib

Hope it helps others as well,
Paul

Possibility for MPI integration...

Hi Erik,

Just out of curiosity, have you tried to integrate with MPI to work across multiple machines. MPI plays really well with OpenMP, and would allow access to even more cores to speed up things.

Just a thought,
Paul

Make include guards unique

I find that include guards like "INDEX_H" and "PATH_H" are too short for the safe reuse of your header files (when they belong to an application programming interface).

vg surject: Assertion `alignment.has_path()' failed.

aln.gam (16GiB) has one lane of HiSeq 2500 100x2 reads mapped with `vg map'.

$ time vg/vg surject -d wg.index.k27e11 -t 32 -p 17 -b aln.gam > aln.surj.bam
vg: alignment.cpp:462: std::string vg::cigar_against_path(const vg::Alignment&): Assertion `alignment.has_path()' failed.
Aborted  

real    0m12.513s  
user    0m21.291s
sys     0m2.333s  
'''

edges should be included in paths

The protobuf format doesn't currently have a way to represent when edges are part of paths.

Case in point (from the test directory):

โžœ  test git:(master) โœ— vg construct -r tiny/tiny.fa >t.vg; vg align -s CAAATAAGGCTTGGAAATTATATTCCAACTCTCTT -Q query t.vg | vg mod -i - t.vg | vg view -
H       HVN:Z:1.0
S       2       CAAATAAGGCTTGGAAATT
P       2       x       +       19M
L       2       -       5       +       0M
L       2       -       4       +       0M
S       4       TTCTGGAGTTCTATT
P       4       x       +       15M
L       4       -       5       +       0M
S       5       ATATTCCAACTCTCTG
P       5       x       +       16M

Nothing in the GFA output (or other output) can refer to the added path query. This is a problem with the schema itself.

FastG output

Hi,

Would you be interested in outputting into fastg format to then view it in @rrwick Bandage tool?

Long deletion simplification

Paths imply edges. However, we do not simplify paths where many nodes are skipped by deletions. Each node will be referenced in the path with a mapping to_length of 0. This is redundant and we can simply skip these nodes when describing the path. Downstream this will help when extending the graph with a set of alignments.

graph can imply unobserved sequences

What are the implications of the graph encoding (i.e. implying the existence of) sequences which have never been observed? e.g. variants at different locations which are not seen in a single individual but are on a valid path through the graph. Is there a way to encode that sort of contextual data in the graph?

shared-lib errors

I just installed vg on a CentOS 6.4 box (after failing on OSX in #3); it seemed to compile fine but when I try to run it on a simple chr1 fasta file:

$ ./vg construct -r /hpc/users/willir31/data/refs/hg19.chr1.fasta
./vg: /usr/lib64/libgomp.so.1: version `GOMP_4.0' not found (required by ./vg)
./vg: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by ./vg)
./vg: /usr/lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by ./vg)
./vg: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.18' not found (required by ./vg)
./vg: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.14' not found (required by ./vg)
./vg: /usr/lib64/libstdc++.so.6: version `CXXABI_1.3.5' not found (required by ./vg)
./vg: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.15' not found (required by ./vg)
./vg: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.19' not found (required by ./vg)

I don't really know what to make of these errors. Any ideas?

adding variation with map / mod does not seem to work

I am trying to use map and mod to add sequences to the graph (as new paths). This does not work as expected on some simple examples (derived from existing unit tests). (originally mentioned a while ago in email, but adding to Github where it should have been in 1st place for posteriority, and if Adam wants to take a look). .

Using your test tiny/tiny.fa, I tried to make a point mutation (A->G 2nd base).

vg construct -r tiny/tiny.fa >t.vg
vg index -s -k 11 t.vg
vg view t.vg
H HVN:Z:1.0
S 1 CAAATAAGGCTTGGAAATTTTCTGGAGTTCTATTATATTCCAACTCTCTG
P 1 x + 50M

vg map -s CGAATAAGGCTTGGAAATTTTCTGGAGTTCTATTATATTCCAACTCTCTT t.vg | vg mod -i - t.vg | vg view -
H HVN:Z:1.0
S 1 CAAATAAGGCTTGGAAATTTTCTGGAGTTCTATTATATTCCAACTCTCTG
P 1 x + 50M
(no change)

Shouldn't I see a bubble in the graph? Same deal if I insert GGG at same position:

vg map -s CGGGAAATAAGGCTTGGAAATTTTCTGGAGTTCTATTATATTCCAACTCTCTT t.vg | vg mod -i - t.vg | vg view -
H HVN:Z:1.0
S 1 CAAATAAGGCTTGGAAATTTTCTGGAGTTCTATTATATTCCAACTCTCTG
P 1 x + 50M
(no change)

Inserting GGG at position 20 seems to work

vg map -s CAAATAAGGCTTGGAAATTTGGGTCTGGAGTTCTATTATATTCCAACTCTCTG t.vg | vg mod -i - t.vg | vg view -
H HVN:Z:1.0
S 2 CAAATAAGGCTTGGAAATTT
P 2 x + 20M
L 2 - 3 + 0M
L 2 - 4 + 0M
S 3 TCTGGAGTTCTATTATATTCCAACTCTCTG
P 3 x + 30M
S 4 GGG
L 4 - 3 + 0M

but I only see the one path for the sequence "x" in tiny.fa. I'd like to have a 2nd path be added that includes the insertion.

Break banded alignment

I'm looking for failing unitig/contig alignments.

Reads past -B bases in length are aligned in "banded" mode in which the read is broke into overlapping subreads, each subread is mapped independently, and the results are merged by finding common points in the paths of the subreads.

This method does work, but there are certainly problems with my implementation. It might be much better to implement banded alignment in gssw. Either way I am looking for tests.

GCSA2 output problem

@jltsiren reports:

The forward and backward edges don't always match in the example.

In the first case, kmer AAGAATACAAG starts at position 35:4 and has A as a predecessor. If I follow the backward edge on A, I reach kmer AAAGAATACAA, which can be found at position 35:3. However, because AAAGAATACAA does not have G as a successor, I can't reach the original kmer AAGAATACAAG by following a forward edge.

In all cases, the successor positions of the two kmers seem to differ by more than 1, so one of them is probably wrong.

Reverse: Node AAAGAATACAA is missing successor(G): AAGAATACAAG
AAAGAATACAA 35:3 G A 37:1
AAGAATACAAG 35:4 A A 37:3

Reverse: Node TACTCCACATC is missing successor(A): ACTCCACATCA
TACTCCACATC 200:0 C,G C 202:0
ACTCCACATCA 200:1 T A 202:2

Reverse: Node GATGCTTGTGA is missing successor(A): ATGCTTGTGAA
GATGCTTGTGA 77:24 T T 79:0
ATGCTTGTGAA 77:25 G G 82:2

Reverse: Node CATTGTCAACA is missing successor(C): ATTGTCAACAC
CATTGTCAACA 107:1 T A 109:0
ATTGTCAACAC 107:2 C A 109:2

Reverse: Node TCTCTTCACTG is missing successor(C): CTCTTCACTGC
TCTCTTCACTG 153:2 A G 155:0
CTCTTCACTGC 153:3 T C 155:2

Reverse: Node TGGGTCCTGGT is missing successor(G): GGGTCCTGGTG
TGGGTCCTGGT 15:9 C T 20:0
GGGTCCTGGTG 15:10 T C 20:2

Reverse: Node TGGTTCCTGGT is missing successor(G): GGTTCCTGGTG
TGGTTCCTGGT 15:9 C T 20:0
GGTTCCTGGTG 15:10 T C 20:2

Reverse: Node GTGATGCTTGT is missing successor(A): TGATGCTTGTA
GTGATGCTTGT 77:22 G G 78:1
TGATGCTTGTA 77:23 G A 82:0

path storage change

To dynamically modify paths, they should be stored in-memory as linked lists: std::list<Mapping>, rather than literal protobuf path objects (which are only efficient in append-only mode).

vg surject: unhygienic output for unmapped reads

Unmapped reads (RNAME=*) in the BAM output from vg surject seem to have an arbitrary value filled in for POS and perhaps other fields as well. These could be dirty values left over in some data structure from a previous use.

$ samtools view aln.surn.bam
C2KC2ACXX_1:6:1101:3304:0/1     6       *       11075625        0       1S99M   *       11075625
        0       TGGGTTGATGCCATGGAAAGGGGCAGTAACTTCCTGATGTTACCATGGCAACAGTAAACTAACATGGCACACTGGTGTCTAATG
GGGGAGGTGCTTCTGC    <84><84><84><88><88><88><88><88><88><88><88><88><88><8B><8B><8B><8B><8B><8B><8B>
<8B><8B><8B><88><8B><8B><84><88><84><88><8B><8B><8B><8B><8B><8B><88><88><8B><8B><8B><88><8B><88><88>
<88><88><88><84><88><8B><8B><8B><88><88><84><88><88><8B><8B><88><88><88><88><88><88><88><88><84><88>
<88><88><88><88><88><88><84><84><84><88><88><84><88><88><88><88><88><88><84><88><88>~<84><84><84>
<84><84><88><88><88>
C2KC2ACXX_1:6:1101:3573:0/1     22      *       11077961        0       100M    *       11077961
        0       AGCAGCAGTGTTTCTGAACAGCTTCAGGAAGAGCTTGCCACTTTCAGGCTCTCACAAATGGAGAGACTTCTTATTAATCTCTTT
CTCTCCACTGCAGGCA    <84><84><84><88><88><88><88><88><88><88><88><88><88><88><88><88><88><88><88><88>
<88><88><84><88><88><88><88><88><8B><8B><88><8B><88><8B><88><88><88><88><88><8B><8B>~<88><88><8B>
<88><88><88><88><88><88><88><8B><8B><8B><8B><8B><88><8B><8B>~<88><84><84><88><88><88><8B><8B><8B>
<8B><88><8B><84><88><88><88><88><88><88>~<88><88><88>~<84><88><88><84><88><84><84><84><84><84><84>
<84><84><84><84>
C2KC2ACXX_1:6:1101:3928:0/1     22      *       11030553        0       1S99M   *       11030553
        0       GGGTAGTCTGAAAGAGCTTGTTCCTCCCCGCCTCTCTCTCTCTCTTGCTCTCTCTCTTGCCATGTAACATTCAGGCTCCTCCTT
CACCTTCCAACATGGT    <84><84><84><88><88><88><88><88><88><88><88><88><88><8B>~<88><88><88><8B><88>
<8B><8B><8B><88><88><8B><8B><88><8B><8B><8B><8B><8B><88><8B><8B><8B><8B><8B><8B><8B><8B><8B><8B><88>
<8B><8B><88><8B><8B><8B><8B><8B><8B><8B><84><88><88><8B><8B><88><88><8B><88><88><88><88><88><88><88>
<84><84><88><84>riiiriyririirrr<84>iiriririyi

paths should include edges

Right now these are implicit, which is causing problems that are obvious as soon as one uses vg view -d to convert a graph to dot format. Non-reference sequences are colored as red, and in some cases we see that deletions are colored as black (implying they are part of a known path). This can be corrected by examining the path membership of the nodes they connect--- if there is a gap we'll know that we are not in the path. However, perhaps it is better to add edges to paths? Unclear.

unitig mapping: Assertion `p1mp->position().node_id() == p2mp->position().node_id()' failed

I'm running vg map on some paired-end reads now, appears to be going smoothly.

I'd previously tried it on some @lh3 fermikit unitigs and hit the following:

$ time vg/vg map -f mlin_unitigs.mag.gz -d wg.index.k27e11 -t 32 -FX 1.9 >aln.gam
vg: path.cpp:510: vg::Path vg::merge_paths(const vg::Path&, const vg::Path&, int&, int&): Assertion `p1mp->position().node_id() == p2mp->position().node_id()' failed.
Aborted

real    0m12.258s
user    0m11.355s
sys     0m2.657s

Here are the first couple unitigs from mlin_unitigs.mag.gz:

@374566:656953435       108     .       1079017880,56;1596667180,64;2213272247,65;2213272249,65;
AAAAAAAGAAAGAAAAAGAGAGAGAGAAAATAAAAGAAAATTAATATCATTGGCTGTTTTTAAGTTCATCTTTCCCTCCTCTGTCATCTCACAGGTATTAGTAAGAACCGCTGTTACACTGCGTGCCACACTGAATTTCAACTATCCCTCTATCTGCTTTGTCTTCTCTCCCAGCCAGTAAGCTACTAAATGATTTTGGATGAATAAATAAACATCTAGGAATGGGAAAGAGAGCAAAATTGAACAAATAGTAATGAATTAGAGTAATCCTTTAAAAGGTGGAAATTATTGGAACAGATATGCAGTTTAAATAAGTTGCAGACTAAGATAGCAGCATAAAACATACAGGAATATGGCCGGGCGCGGTGGCTCAAGCCTGTAATCCCAGCACTTTGG
+
'''''((()*****++++++++,,,-..///////012234556667777888899:::;;;<<<<<<=====>>>>??@@ABBCCCDDDEEDEEEEEEFFFFFFEEEDCCCCDCDDEEEEEDEEDCCBBBBCCDCBAA@?>>===<<<<;;;<;;;;;::;::;<<<;<<<<;;;<;;:::9988887766667776665677888889::::::;::;<<==>>???@@A@@?@ABBBCDEEEEEEEEFGGGHHHIIJJIIIHGGHHGGGGGGGFGGGFFFGGGGFFFFFFFFGFGGFFFEEDEEEFEDEEDEDCDDCBCBA@@???>>>>>>=<;;;:988888888766544322111111110000000000////..----------,+*
@2445285:1581873033     5       1497293684,98;1884040499,93;    189839833,99;434191570,89;1562090913,91;1633644090,82;
AAAAAAAAAAAGTCATGGGAGAGGATGGTAAAGCTAAGTATCTTTTGCACCTACTCCCCAGCCCCACCACTGCAGAAGCTGAAGGGGTTCCTAGAGGCTTCTTCTGCC
+
""###$%%&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&%%$$$#""

Installation-from-source issues on OSX

Hi there, I've hit a few snags trying to install from source on OSX 10.10.2. All of my cmds / output are in this gist but I'll link to specific files / sections within it below:

That's as far as I can get right now.

I've installed "real" gcc/g++ via brew and they are the only ones on my path. I did this due to earlier errors like:

In file included from vg.cpp:1:
./vg.hpp:9:10: fatal error: 'omp.h' file not found
#include <omp.h>
         ^
1 error generated.

Some googling led me to believe that this was the result of using clang, cf. this SO answer.

When running in to the pb2json linker errors above, I thought they might be due to my having run brew install protobuf-c when I was still using clang, so I brew remove'd and re-brew install'd it, and that actually got me further, to errors in index.cpp:

 Tue 19:41:36  ryan@mbp: vg:master$ make
g++ -std=c++11 -fopenmp -g  -O3 -c -o vg.o vg.cpp -I./ -Ipb2json -Icpp -Ivcflib/src -Ivcflib -Ifastahack -Igssw/src -Irocksdb/include -Iprogress_bar -Isparsehash/build/include -Ilru_cache
g++ -std=c++11 -fopenmp -g  -O3 -c -o cpp/vg.pb.o cpp/vg.pb.cc -I./ -Ipb2json -Icpp -Ivcflib/src -Ivcflib -Ifastahack -Igssw/src -Irocksdb/include -Iprogress_bar -Isparsehash/build/include -Ilru_cache
g++ -std=c++11 -fopenmp -g  -O3 -c -o main.o main.cpp -I./ -Ipb2json -Icpp -Ivcflib/src -Ivcflib -Ifastahack -Igssw/src -Irocksdb/include -Iprogress_bar -Isparsehash/build/include -Ilru_cache
g++ -std=c++11 -fopenmp -g  -O3 -c -o index.o index.cpp -I./ -Ipb2json -Icpp -Ivcflib/src -Ivcflib -Ifastahack -Igssw/src -Irocksdb/include -Iprogress_bar -Isparsehash/build/include -Ilru_cache
index.cpp: In member function 'const string vg::Index::key_for_node(int64_t)':
index.cpp:119:20: error: 'htobe64' was not declared in this scope
     id = htobe64(id);
                    ^
index.cpp: In member function 'const string vg::Index::key_for_edge_from_to(int64_t, int64_t)':
index.cpp:133:20: error: 'htobe64' was not declared in this scope
     to = htobe64(to);
                    ^
index.cpp: In member function 'const string vg::Index::key_for_edge_to_from(int64_t, int64_t)':
index.cpp:151:20: error: 'htobe64' was not declared in this scope
     to = htobe64(to);
                    ^
index.cpp: In member function 'const string vg::Index::key_for_kmer(const string&, int64_t)':
index.cpp:168:20: error: 'htobe64' was not declared in this scope
     id = htobe64(id);
                    ^
index.cpp: In member function 'const string vg::Index::key_for_node_path(int64_t, int64_t, int64_t)':
index.cpp:182:30: error: 'htobe64' was not declared in this scope
     node_id = htobe64(node_id);
                              ^
index.cpp: In member function 'const string vg::Index::key_for_path_position(int64_t, int64_t, int64_t)':
index.cpp:202:30: error: 'htobe64' was not declared in this scope

For unknown reasons, I can't even get back to this point after having removed everything and started over; I'm currently stuck at the aforementioned pb2json linker errors.

Just putting all of this here in case people have ideas / anyone else hits the same issues.

update mapping algorithm

A big performance win lies at the other end of a small optimization. If we can guess what the best mapping target is on the basis of kmer matches, we can try to run fewer gssw alignments. If this can happen we can run a lot quicker when mapping.

One idea would be to measure the informativeness of each kmer, and then build a conditional entropy metric for each mapping target. We want to evaluate the mapping targets where our kmers hits are rare, and where there are many hits, before mapping against a huge complex of lower quality hits.

Defining personal genomes on a variation graph

Hi,
For some analyses, like Allelic Specific Expression, one wants to have a personal version of the reference. That is the reference modified to represent the known sequence of a individual. It seems this would be trivial with variation graphs if one has a set of variants for which a individual is hom/het. Question: is it possible to align the reads against a subset of the paths of the graph? like define a subset of the graph removing paths for alleles not present on a personal genome (it is hom at the site) and keeping only paths for het sites?
Thanks in advance

mapping qualities

We need mapping qualities. They can perhaps be estimated by looking at the size of the kmer space of the read vs. the kmer space of the entire index. A more nuanced approach is to build a ML model that calculates mapping accuracy given some parameters we can extract from the alignment (akin to Mosaik). But simple may be enough in this case.

paired-end reads

Is there any way to keep the separation distance of paired-end reads or is that information lost in graph format?

use many alignments to modify a graph

Due to the destructive nature of the changes on the internal representation of the graph, it's not trivial to take many graph alignments and include them in the graph. However, it should be possible to do with a little care, and in particular is not hard if we can handle all the mappings to a single node at once. This means sorting the changes we want to add up-front, modifying the graph in one step for each node.

Assertion `table' failed in sparsehash

I just got this very weird assertion error after completing whole-genome kmer indexing:

vg.hgi: sparsehash/build/include/sparsehash/internal/densehashtable.h:782: void google::dense_hashtable<Value, Key, HashFcn, ExtractKey, SetKey, EqualKey, Alloc>::clear_to_size(google::dense_hashtable<Value, Key, HashFcn, ExtractKey, SetKey, EqualKey, Alloc>::size_type) [with Value = std::pair<const std::pair<long int, long int>, vg::Edge*>; Key = std::pair<long int, long int>; HashFcn = std::hash<std::pair<long int, long int> >; ExtractKey = google::dense_hash_map<std::pair<long int, long int>, vg::Edge*, std::hash<std::pair<long int, long int> >, std::equal_to<std::pair<long int, long int> >, google::libc_allocator_with_realloc<std::pair<const std::pair<long int, long int>, vg::Edge*> > >::SelectKey; SetKey = google::dense_hash_map<std::pair<long int, long int>, vg::Edge*, std::hash<std::pair<long int, long int> >, std::equal_to<std::pair<long int, long int> >, google::libc_allocator_with_realloc<std::pair<const std::pair<long int, long int>, vg::Edge*> > >::SetKey; EqualKey = std::equal_to<std::pair<long int, long int> >; Alloc = google::libc_allocator_with_realloc<std::pair<const std::pair<long int, long int>, vg::Edge*> >; google::dense_hashtable<Value, Key, HashFcn, ExtractKey, SetKey, EqualKey, Alloc>::size_type = long unsigned int]: Assertion `table' failed.

I'm trying to reproduce it on a smaller set. It'd be nice to know what's going on.

Protobuf "info" map fields take too much memory

Every VG protobuf message has a "map<string, Info> info" field. On my system, this field, takes 88 bytes in memory when empty, and amounts to more than half of the in-memory protobuf Graph representation. And probably matches a good chunk of the memory used by the various hash_map indexes (if we assume 2 * 8 * 1.78 bytes per dense_map entry).

The info field doesn't seem super necessary. I'd suggest getting rid of it if possible, or moving it into Graph-level maps - ie instead of having an info record per Node, have a map in Graph that maps node id's to info records. This seems to me to be a low-hanging-fruit way to cut memory usage by about half. Something I can do, but would like feedback on as I don't really know the intuition behind these info fields...

sizeof( ::google::protobuf::internal::MapField<::std::string, ::vg::Info, ::google::protobuf::internal::WireFormatLite::TYPE_STRING, ::google::protobuf::internal::WireFormatLite::TYPE_MESSAGE, 0 >) = 88
sizeof(vg::Node) = 152
sizeof(vg::Edge) = 144

thanks
-Glenn

GFA output is not really GFA

It's come to my attention that vg is not outputting correct GFA format.

That said, it's not clear yet to me if GFA can support vg graphs as @adamnovak 's recent commits enable, as such graphs have four kinds of edges.

Any suggestions about how to deal with this would be welcome. I think GFA is a useful format as it enables text processing of graphs on the command line.

round trip from .vg to db and back

It should be possible to dump a .vg format file from a vg database, and also to read it back in. Could this be done quickly using something like the node range query in vg find?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.