vgteam / vg Goto Github PK

tools for working with genome variation graphs

License: Other

C++ 94.14% Makefile 0.45% Shell 2.62% Python 1.69% XSLT 0.28% HTML 0.52% R 0.18% Julia 0.06% Ruby 0.01% Dockerfile 0.04%

dna genome-graph genomics graph variation-graph

vg's People

Contributors

Stargazers

Watchers

Forkers

ryan-williams mcshane wtsi-hgi alexa-hks glennhickey alexjironkin adamnovak abeconnelly jervenbolleman cheinan lexnederbragt edawson macieksmuga gitter-badger jonnycrunch pd3 dillonl pombredanne ktym skwsm pgrosu anukat2015 arrogantrobot cmarkello yoheirosen zengfengbo richarddurbin mzueva nerdstrike ruhulsbu uio-cels yhoogstrate snashraf sujunhao dangertrip nathandunn plantandfoodresearch artrand alexandersavelyev apregier jpdna benedictpaten jeizenga shilpagarg bd2kgenomics asanchez75 biocodings mdkeehan dkj xtmgah byoo konradjk 6br tanessav jltsiren xchang1 ttriche hakon-jon lybird300 saiz06 emikoifish godfreygolova kavinsub kevyin jonike ahmed-bilal-kh vchen30 david4096 mattmcl4475 evanbiederstedt ocxtal garetjax xuepengsun jachansantiago afcarl bioshell hanzou666 cartoonist kortschak rndw eldariont letitiaismyname jmonlong jonassibbesen shivamdubey7 bricoletc zihua robin-rounthwaite urbanslug yangylin spellarbot sdwfrost asdlei99 reachsagaya code-s-witch pythseq sclipman aysunrhn matthieurouland zhanglzu

vg's Issues

vg on RNA-seq data

Hi,
This is great work and I look forwards to try it! I had some questions and I thought it would be better to ask rather than not to.

Would vg work on RNA-seq data?
would it be able to to spliced alignment directly?
could the known introns be coded as indels and the re-interpret them as introns when mapped back onto the reference space?
would it be possible to build a graph just with transcripts and move on from there?

thanks in advance,

Inti

circular graphs

vg can now support circular graphs, but there are no tests of this functionality. Make one, and verify this can work for at least some operations.

Should an alignment be a graph too? Express a sample's sequencing results as a labeled graph

Per-node, per-sample quality and count information on graph

Steps to compile vg locally without needing sudo access

Hi Erik,

I wrote up the following steps, in case anyone would like to install vg without needing sudo access. Feel free to change/integrate them any way you prefer :) Below are the steps:

Download and install jansson in your home directory, and please replace YOUR_USERNAME with your username on the system:

git clone https://github.com/akheron/jansson
autoreconf -i
./configure --prefix=/home/YOUR_USERNAME/apps/jansson
make
make install
cd ..

Download vg and enter its directory:

git clone --recursive https://github.com/ekg/vg.git
cd vg

Update INCLUDES and LDFLAGS in Makefile:

Add to INCLUDES the following, with YOUR_USERNAME being set appropriately:

-I/home/YOUR_USERNAME/apps/jansson/include

Add to LDFLAGS the following, with YOUR_USERNAME being set appropriately:

-L/home/YOUR_USERNAME/apps/jansson/lib

They should look something like this, with YOUR_USERNAME being set appropriately:

INCLUDES=-I./ -I/home/YOUR_USERNAME/apps/jansson/include -Icpp -I$(VCFLIB)/src -I$(VCFLIB) -Ifastahack -Igssw/src -Iprotobuf/build/include -Irocksdb/include -Iprogress_bar -Isparsehash/build/include -Ilru_cache -Ihtslib -Isha1
LDFLAGS=-L./ -L/home/YOUR_USERNAME/apps/jansson/lib -Lvcflib -Lgssw/src -Lprotobuf -Lsnappy -Lrocksdb -Lprogressbar -Lhtslib -lvcflib -lgssw -lprotobuf -lhts -lpthread -ljansson -lncurses -lrocksdb -lsnappy -lz -lbz2

Next compile vg:

make

For the final step, just update your LD_LIBRARY_PATH with jansson, and add it to you .bashrc file to keep it persistent between sessions. Just as before, YOUR_USERNAME would need to be replaced with your specific username on the system:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/YOUR_USERNAME/apps/jansson/lib

Hope it helps others as well,
Paul

Extra dependencies needed

These deps were also needed on my system before the build worked:

automake
libtool

Possibility for MPI integration...

Hi Erik,

Just out of curiosity, have you tried to integrate with MPI to work across multiple machines. MPI plays really well with OpenMP, and would allow access to even more cores to speed up things.

Just a thought,
Paul

Make include guards unique

I find that include guards like "INDEX_H" and "PATH_H" are too short for the safe reuse of your header files (when they belong to an application programming interface).

vg surject: Assertion `alignment.has_path()' failed.

aln.gam (16GiB) has one lane of HiSeq 2500 100x2 reads mapped with `vg map'.

$ time vg/vg surject -d wg.index.k27e11 -t 32 -p 17 -b aln.gam > aln.surj.bam
vg: alignment.cpp:462: std::string vg::cigar_against_path(const vg::Alignment&): Assertion `alignment.has_path()' failed.
Aborted  

real    0m12.513s  
user    0m21.291s
sys     0m2.333s  
'''

Factor longer functions in main.cpp into library functions for easier reuse

Genotyping using dynamic programming genotyping model and compressed sequence results against graph

edges should be included in paths

The protobuf format doesn't currently have a way to represent when edges are part of paths.

Case in point (from the test directory):

➜  test git:(master) ✗ vg construct -r tiny/tiny.fa >t.vg; vg align -s CAAATAAGGCTTGGAAATTATATTCCAACTCTCTT -Q query t.vg | vg mod -i - t.vg | vg view -
H       HVN:Z:1.0
S       2       CAAATAAGGCTTGGAAATT
P       2       x       +       19M
L       2       -       5       +       0M
L       2       -       4       +       0M
S       4       TTCTGGAGTTCTATT
P       4       x       +       15M
L       4       -       5       +       0M
S       5       ATATTCCAACTCTCTG
P       5       x       +       16M

Nothing in the GFA output (or other output) can refer to the added path query. This is a problem with the schema itself.

FastG output

Hi,

Would you be interested in outputting into fastg format to then view it in @rrwick Bandage tool?

K-mers input to GCSA: kmer, node+position, previous characters, successive characters

Dynamic programming method to estimate path qualities given per-node qualities and counts

Paths imply edges. However, we do not simplify paths where many nodes are skipped by deletions. Each node will be referenced in the path with a mapping to_length of 0. This is redundant and we can simply skip these nodes when describing the path. Downstream this will help when extending the graph with a set of alignments.

graph can imply unobserved sequences

What are the implications of the graph encoding (i.e. implying the existence of) sequences which have never been observed? e.g. variants at different locations which are not seen in a single individual but are on a valid path through the graph. Is there a way to encode that sort of contextual data in the graph?

Read paired-end data in FASTQ format

shared-lib errors

I just installed vg on a CentOS 6.4 box (after failing on OSX in #3); it seemed to compile fine but when I try to run it on a simple chr1 fasta file:

$ ./vg construct -r /hpc/users/willir31/data/refs/hg19.chr1.fasta
./vg: /usr/lib64/libgomp.so.1: version `GOMP_4.0' not found (required by ./vg)
./vg: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by ./vg)
./vg: /usr/lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by ./vg)
./vg: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.18' not found (required by ./vg)
./vg: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.14' not found (required by ./vg)
./vg: /usr/lib64/libstdc++.so.6: version `CXXABI_1.3.5' not found (required by ./vg)
./vg: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.15' not found (required by ./vg)
./vg: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.19' not found (required by ./vg)

I don't really know what to make of these errors. Any ideas?

Catch exceptions in main()?

I expect that exception handling is usually supported by a C++ program. I wonder why your function "main" does not contain corresponding try and catch instructions so far.

How do you think about recommendations by Matthew Wilson in an article?

Would you like to adjust the implementation if you consider effects for uncaught/unhandled exceptions like they are described by Danny Kalev?

adding variation with map / mod does not seem to work

I am trying to use map and mod to add sequences to the graph (as new paths). This does not work as expected on some simple examples (derived from existing unit tests). (originally mentioned a while ago in email, but adding to Github where it should have been in 1st place for posteriority, and if Adam wants to take a look). .

Using your test tiny/tiny.fa, I tried to make a point mutation (A->G 2nd base).

vg construct -r tiny/tiny.fa >t.vg
vg index -s -k 11 t.vg
vg view t.vg
H HVN:Z:1.0
S 1 CAAATAAGGCTTGGAAATTTTCTGGAGTTCTATTATATTCCAACTCTCTG
P 1 x + 50M

vg map -s CGAATAAGGCTTGGAAATTTTCTGGAGTTCTATTATATTCCAACTCTCTT t.vg | vg mod -i - t.vg | vg view -
H HVN:Z:1.0
S 1 CAAATAAGGCTTGGAAATTTTCTGGAGTTCTATTATATTCCAACTCTCTG
P 1 x + 50M
(no change)

Shouldn't I see a bubble in the graph? Same deal if I insert GGG at same position:

vg map -s CGGGAAATAAGGCTTGGAAATTTTCTGGAGTTCTATTATATTCCAACTCTCTT t.vg | vg mod -i - t.vg | vg view -
H HVN:Z:1.0
S 1 CAAATAAGGCTTGGAAATTTTCTGGAGTTCTATTATATTCCAACTCTCTG
P 1 x + 50M
(no change)

Inserting GGG at position 20 seems to work

vg map -s CAAATAAGGCTTGGAAATTTGGGTCTGGAGTTCTATTATATTCCAACTCTCTG t.vg | vg mod -i - t.vg | vg view -
H HVN:Z:1.0
S 2 CAAATAAGGCTTGGAAATTT
P 2 x + 20M
L 2 - 3 + 0M
L 2 - 4 + 0M
S 3 TCTGGAGTTCTATTATATTCCAACTCTCTG
P 3 x + 30M
S 4 GGG
L 4 - 3 + 0M

but I only see the one path for the sequence "x" in tiny.fa. I'd like to have a 2nd path be added that includes the insertion.

Break banded alignment

I'm looking for failing unitig/contig alignments.

Reads past -B bases in length are aligned in "banded" mode in which the read is broke into overlapping subreads, each subread is mapped independently, and the results are merged by finding common points in the paths of the subreads.

This method does work, but there are certainly problems with my implementation. It might be much better to implement banded alignment in gssw. Either way I am looking for tests.

GCSA2 output problem

@jltsiren reports:

The forward and backward edges don't always match in the example.

In the first case, kmer AAGAATACAAG starts at position 35:4 and has A as a predecessor. If I follow the backward edge on A, I reach kmer AAAGAATACAA, which can be found at position 35:3. However, because AAAGAATACAA does not have G as a successor, I can't reach the original kmer AAGAATACAAG by following a forward edge.

In all cases, the successor positions of the two kmers seem to differ by more than 1, so one of them is probably wrong.

Reverse: Node AAAGAATACAA is missing successor(G): AAGAATACAAG
AAAGAATACAA 35:3 G A 37:1
AAGAATACAAG 35:4 A A 37:3

Reverse: Node TACTCCACATC is missing successor(A): ACTCCACATCA
TACTCCACATC 200:0 C,G C 202:0
ACTCCACATCA 200:1 T A 202:2

Reverse: Node GATGCTTGTGA is missing successor(A): ATGCTTGTGAA
GATGCTTGTGA 77:24 T T 79:0
ATGCTTGTGAA 77:25 G G 82:2

Reverse: Node CATTGTCAACA is missing successor(C): ATTGTCAACAC
CATTGTCAACA 107:1 T A 109:0
ATTGTCAACAC 107:2 C A 109:2

Reverse: Node TCTCTTCACTG is missing successor(C): CTCTTCACTGC
TCTCTTCACTG 153:2 A G 155:0
CTCTTCACTGC 153:3 T C 155:2

Reverse: Node TGGGTCCTGGT is missing successor(G): GGGTCCTGGTG
TGGGTCCTGGT 15:9 C T 20:0
GGGTCCTGGTG 15:10 T C 20:2

Reverse: Node TGGTTCCTGGT is missing successor(G): GGTTCCTGGTG
TGGTTCCTGGT 15:9 C T 20:0
GGTTCCTGGTG 15:10 T C 20:2

Reverse: Node GTGATGCTTGT is missing successor(A): TGATGCTTGTA
GTGATGCTTGT 77:22 G G 78:1
TGATGCTTGTA 77:23 G A 82:0

Generalization to assembly graphs

(although directional, nothing is intrinsically DAG-based except alignment)

path storage change

To dynamically modify paths, they should be stored in-memory as linked lists: std::list<Mapping>, rather than literal protobuf path objects (which are only efficient in append-only mode).

vg surject: unhygienic output for unmapped reads

Unmapped reads (RNAME=*) in the BAM output from vg surject seem to have an arbitrary value filled in for POS and perhaps other fields as well. These could be dirty values left over in some data structure from a previous use.

$ samtools view aln.surn.bam
C2KC2ACXX_1:6:1101:3304:0/1     6       *       11075625        0       1S99M   *       11075625
        0       TGGGTTGATGCCATGGAAAGGGGCAGTAACTTCCTGATGTTACCATGGCAACAGTAAACTAACATGGCACACTGGTGTCTAATG
GGGGAGGTGCTTCTGC    <84><84><84><88><88><88><88><88><88><88><88><88><88><8B><8B><8B><8B><8B><8B><8B>
<8B><8B><8B><88><8B><8B><84><88><84><88><8B><8B><8B><8B><8B><8B><88><88><8B><8B><8B><88><8B><88><88>
<88><88><88><84><88><8B><8B><8B><88><88><84><88><88><8B><8B><88><88><88><88><88><88><88><88><84><88>
<88><88><88><88><88><88><84><84><84><88><88><84><88><88><88><88><88><88><84><88><88>~<84><84><84>
<84><84><88><88><88>
C2KC2ACXX_1:6:1101:3573:0/1     22      *       11077961        0       100M    *       11077961
        0       AGCAGCAGTGTTTCTGAACAGCTTCAGGAAGAGCTTGCCACTTTCAGGCTCTCACAAATGGAGAGACTTCTTATTAATCTCTTT
CTCTCCACTGCAGGCA    <84><84><84><88><88><88><88><88><88><88><88><88><88><88><88><88><88><88><88><88>
<88><88><84><88><88><88><88><88><8B><8B><88><8B><88><8B><88><88><88><88><88><8B><8B>~<88><88><8B>
<88><88><88><88><88><88><88><8B><8B><8B><8B><8B><88><8B><8B>~<88><84><84><88><88><88><8B><8B><8B>
<8B><88><8B><84><88><88><88><88><88><88>~<88><88><88>~<84><88><88><84><88><84><84><84><84><84><84>
<84><84><84><84>
C2KC2ACXX_1:6:1101:3928:0/1     22      *       11030553        0       1S99M   *       11030553
        0       GGGTAGTCTGAAAGAGCTTGTTCCTCCCCGCCTCTCTCTCTCTCTTGCTCTCTCTCTTGCCATGTAACATTCAGGCTCCTCCTT
CACCTTCCAACATGGT    <84><84><84><88><88><88><88><88><88><88><88><88><88><8B>~<88><88><88><8B><88>
<8B><8B><8B><88><88><8B><8B><88><8B><8B><8B><8B><8B><88><8B><8B><8B><8B><8B><8B><8B><8B><8B><8B><88>
<8B><8B><88><8B><8B><8B><8B><8B><8B><8B><84><88><88><8B><8B><88><88><8B><88><88><88><88><88><88><88>
<84><84><88><84>riiiriyririirrr<84>iiriririyi

paths should include edges

Right now these are implicit, which is causing problems that are obvious as soon as one uses vg view -d to convert a graph to dot format. Non-reference sequences are colored as red, and in some cases we see that deletions are colored as black (implying they are part of a known path). This can be corrected by examining the path membership of the nodes they connect--- if there is a gap we'll know that we are not in the path. However, perhaps it is better to add edges to paths? Unclear.

Genotyping of paths using freebayes-like genotyping model

unitig mapping: Assertion `p1mp->position().node_id() == p2mp->position().node_id()' failed

I'm running vg map on some paired-end reads now, appears to be going smoothly.

I'd previously tried it on some @lh3 fermikit unitigs and hit the following:

$ time vg/vg map -f mlin_unitigs.mag.gz -d wg.index.k27e11 -t 32 -FX 1.9 >aln.gam
vg: path.cpp:510: vg::Path vg::merge_paths(const vg::Path&, const vg::Path&, int&, int&): Assertion `p1mp->position().node_id() == p2mp->position().node_id()' failed.
Aborted

real    0m12.258s
user    0m11.355s
sys     0m2.657s

Here are the first couple unitigs from mlin_unitigs.mag.gz:

@374566:656953435       108     .       1079017880,56;1596667180,64;2213272247,65;2213272249,65;
AAAAAAAGAAAGAAAAAGAGAGAGAGAAAATAAAAGAAAATTAATATCATTGGCTGTTTTTAAGTTCATCTTTCCCTCCTCTGTCATCTCACAGGTATTAGTAAGAACCGCTGTTACACTGCGTGCCACACTGAATTTCAACTATCCCTCTATCTGCTTTGTCTTCTCTCCCAGCCAGTAAGCTACTAAATGATTTTGGATGAATAAATAAACATCTAGGAATGGGAAAGAGAGCAAAATTGAACAAATAGTAATGAATTAGAGTAATCCTTTAAAAGGTGGAAATTATTGGAACAGATATGCAGTTTAAATAAGTTGCAGACTAAGATAGCAGCATAAAACATACAGGAATATGGCCGGGCGCGGTGGCTCAAGCCTGTAATCCCAGCACTTTGG
+
'''''((()*****++++++++,,,-..///////012234556667777888899:::;;;<<<<<<=====>>>>??@@ABBCCCDDDEEDEEEEEEFFFFFFEEEDCCCCDCDDEEEEEDEEDCCBBBBCCDCBAA@?>>===<<<<;;;<;;;;;::;::;<<<;<<<<;;;<;;:::9988887766667776665677888889::::::;::;<<==>>???@@A@@?@ABBBCDEEEEEEEEFGGGHHHIIJJIIIHGGHHGGGGGGGFGGGFFFGGGGFFFFFFFFGFGGFFFEEDEEEFEDEEDEDCDDCBCBA@@???>>>>>>=<;;;:988888888766544322111111110000000000////..----------,+*
@2445285:1581873033     5       1497293684,98;1884040499,93;    189839833,99;434191570,89;1562090913,91;1633644090,82;
AAAAAAAAAAAGTCATGGGAGAGGATGGTAAAGCTAAGTATCTTTTGCACCTACTCCCCAGCCCCACCACTGCAGAAGCTGAAGGGGTTCCTAGAGGCTTCTTCTGCC
+
""###$%%&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&%%$$$#""

Travis is not happy - should I wait?

Hi Erik,

Below is a link to the Travis report:

https://travis-ci.org/ekg/vg

I assume you already know about this. Should I wait for a fix?

Thanks,
~p

Installation-from-source issues on OSX

Hi there, I've hit a few snags trying to install from source on OSX 10.10.2. All of my cmds / output are in this gist but I'll link to specific files / sections within it below:

git clone --recursive, cd vg, make:
- option -s is obsolete and being ignored warning and error
- this is freebayes#83
fix by commenting out the ,-s in vcfutils/smithwaterman/Makefile
make again
- Error: ar: smithwaterman/sw.o: No such file or directory
- turns out the previously interrupted make has left some inconsistent state
make clean
- Mostly works, but fails in snappy module and exits with an error code
make again
- Undefined symbols for architecture x86_64: related to google::protobuf things in pb2json

That's as far as I can get right now.

I've installed "real" gcc/g++ via brew and they are the only ones on my path. I did this due to earlier errors like:

In file included from vg.cpp:1:
./vg.hpp:9:10: fatal error: 'omp.h' file not found
#include <omp.h>
         ^
1 error generated.

Some googling led me to believe that this was the result of using clang, cf. this SO answer.

When running in to the pb2json linker errors above, I thought they might be due to my having run brew install protobuf-c when I was still using clang, so I brew remove'd and re-brew install'd it, and that actually got me further, to errors in index.cpp:

 Tue 19:41:36  ryan@mbp: vg:master$ make
g++ -std=c++11 -fopenmp -g  -O3 -c -o vg.o vg.cpp -I./ -Ipb2json -Icpp -Ivcflib/src -Ivcflib -Ifastahack -Igssw/src -Irocksdb/include -Iprogress_bar -Isparsehash/build/include -Ilru_cache
g++ -std=c++11 -fopenmp -g  -O3 -c -o cpp/vg.pb.o cpp/vg.pb.cc -I./ -Ipb2json -Icpp -Ivcflib/src -Ivcflib -Ifastahack -Igssw/src -Irocksdb/include -Iprogress_bar -Isparsehash/build/include -Ilru_cache
g++ -std=c++11 -fopenmp -g  -O3 -c -o main.o main.cpp -I./ -Ipb2json -Icpp -Ivcflib/src -Ivcflib -Ifastahack -Igssw/src -Irocksdb/include -Iprogress_bar -Isparsehash/build/include -Ilru_cache
g++ -std=c++11 -fopenmp -g  -O3 -c -o index.o index.cpp -I./ -Ipb2json -Icpp -Ivcflib/src -Ivcflib -Ifastahack -Igssw/src -Irocksdb/include -Iprogress_bar -Isparsehash/build/include -Ilru_cache
index.cpp: In member function 'const string vg::Index::key_for_node(int64_t)':
index.cpp:119:20: error: 'htobe64' was not declared in this scope
     id = htobe64(id);
                    ^
index.cpp: In member function 'const string vg::Index::key_for_edge_from_to(int64_t, int64_t)':
index.cpp:133:20: error: 'htobe64' was not declared in this scope
     to = htobe64(to);
                    ^
index.cpp: In member function 'const string vg::Index::key_for_edge_to_from(int64_t, int64_t)':
index.cpp:151:20: error: 'htobe64' was not declared in this scope
     to = htobe64(to);
                    ^
index.cpp: In member function 'const string vg::Index::key_for_kmer(const string&, int64_t)':
index.cpp:168:20: error: 'htobe64' was not declared in this scope
     id = htobe64(id);
                    ^
index.cpp: In member function 'const string vg::Index::key_for_node_path(int64_t, int64_t, int64_t)':
index.cpp:182:30: error: 'htobe64' was not declared in this scope
     node_id = htobe64(node_id);
                              ^
index.cpp: In member function 'const string vg::Index::key_for_path_position(int64_t, int64_t, int64_t)':
index.cpp:202:30: error: 'htobe64' was not declared in this scope

For unknown reasons, I can't even get back to this point after having removed everything and started over; I'm currently stuck at the aforementioned pb2json linker errors.

Just putting all of this here in case people have ideas / anyone else hits the same issues.

properly handle pairing information in surjection

This is now included in the GAM stream but not handled by surject.

Genotype likelihood generation

(given a source and sink, genotype paths)

update mapping algorithm

A big performance win lies at the other end of a small optimization. If we can guess what the best mapping target is on the basis of kmer matches, we can try to run fewer gssw alignments. If this can happen we can run a lot quicker when mapping.

One idea would be to measure the informativeness of each kmer, and then build a conditional entropy metric for each mapping target. We want to evaluate the mapping targets where our kmers hits are rare, and where there are many hits, before mapping against a huge complex of lower quality hits.

Defining personal genomes on a variation graph

Hi,
For some analyses, like Allelic Specific Expression, one wants to have a personal version of the reference. That is the reference modified to represent the known sequence of a individual. It seems this would be trivial with variation graphs if one has a set of variants for which a individual is hom/het. Question: is it possible to align the reads against a subset of the paths of the graph? like define a subset of the graph removing paths for alleles not present on a personal genome (it is hom at the site) and keeping only paths for het sites?
Thanks in advance

mapping qualities

We need mapping qualities. They can perhaps be estimated by looking at the size of the kmer space of the read vs. the kmer space of the entire index. A more nuanced approach is to build a ML model that calculates mapping accuracy given some parameters we can extract from the alignment (akin to Mosaik). But simple may be enough in this case.

Properly handle pairs in alignment

path inclusion should actually include the named path

vg mod should use the alignment name as the path name in the graph. This is required for building up graphs from alignments.

Improve "cannot setRegion" error message / failure mode

I am running vg and getting this error:

$ ./vg construct -r hg19.fasta -v NA12878.vcf
cannot setRegion on a non-tabix indexed file

help debugging would be appreciated
a more descriptive error message would be helpful here.

I see that it comes from vcflib but that's all I've got.

paired-end reads

Is there any way to keep the separation distance of paired-end reads or is that information lost in graph format?

use many alignments to modify a graph

Due to the destructive nature of the changes on the internal representation of the graph, it's not trivial to take many graph alignments and include them in the graph. However, it should be possible to do with a little care, and in particular is not hard if we can handle all the mappings to a single node at once. This means sorting the changes we want to add up-front, modifying the graph in one step for each node.

Examples of using `vg view` / generating dot files

There don't seem to be any, from what I can see.

It seems like the default behavior emits GFA.

Assertion `table' failed in sparsehash

I just got this very weird assertion error after completing whole-genome kmer indexing:

vg.hgi: sparsehash/build/include/sparsehash/internal/densehashtable.h:782: void google::dense_hashtable<Value, Key, HashFcn, ExtractKey, SetKey, EqualKey, Alloc>::clear_to_size(google::dense_hashtable<Value, Key, HashFcn, ExtractKey, SetKey, EqualKey, Alloc>::size_type) [with Value = std::pair<const std::pair<long int, long int>, vg::Edge*>; Key = std::pair<long int, long int>; HashFcn = std::hash<std::pair<long int, long int> >; ExtractKey = google::dense_hash_map<std::pair<long int, long int>, vg::Edge*, std::hash<std::pair<long int, long int> >, std::equal_to<std::pair<long int, long int> >, google::libc_allocator_with_realloc<std::pair<const std::pair<long int, long int>, vg::Edge*> > >::SelectKey; SetKey = google::dense_hash_map<std::pair<long int, long int>, vg::Edge*, std::hash<std::pair<long int, long int> >, std::equal_to<std::pair<long int, long int> >, google::libc_allocator_with_realloc<std::pair<const std::pair<long int, long int>, vg::Edge*> > >::SetKey; EqualKey = std::equal_to<std::pair<long int, long int> >; Alloc = google::libc_allocator_with_realloc<std::pair<const std::pair<long int, long int>, vg::Edge*> >; google::dense_hashtable<Value, Key, HashFcn, ExtractKey, SetKey, EqualKey, Alloc>::size_type = long unsigned int]: Assertion `table' failed.

I'm trying to reproduce it on a smaller set. It'd be nice to know what's going on.

Protobuf "info" map fields take too much memory

Every VG protobuf message has a "map<string, Info> info" field. On my system, this field, takes 88 bytes in memory when empty, and amounts to more than half of the in-memory protobuf Graph representation. And probably matches a good chunk of the memory used by the various hash_map indexes (if we assume 2 * 8 * 1.78 bytes per dense_map entry).

The info field doesn't seem super necessary. I'd suggest getting rid of it if possible, or moving it into Graph-level maps - ie instead of having an info record per Node, have a map in Graph that maps node id's to info records. This seems to me to be a low-hanging-fruit way to cut memory usage by about half. Something I can do, but would like feedback on as I don't really know the intuition behind these info fields...

sizeof( ::google::protobuf::internal::MapField<::std::string, ::vg::Info, ::google::protobuf::internal::WireFormatLite::TYPE_STRING, ::google::protobuf::internal::WireFormatLite::TYPE_MESSAGE, 0 >) = 88
sizeof(vg::Node) = 152
sizeof(vg::Edge) = 144

thanks
-Glenn

Perceptual DNA hashing to reduce index memory usage

cf. A perceptual hash function to store and retrieve large scale DNA sequences

GFA output is not really GFA

It's come to my attention that vg is not outputting correct GFA format.

That said, it's not clear yet to me if GFA can support vg graphs as @adamnovak 's recent commits enable, as such graphs have four kinds of edges.

Any suggestions about how to deal with this would be welcome. I think GFA is a useful format as it enables text processing of graphs on the command line.

vgteam / vg Goto Github PK

vg's People

Contributors

Stargazers

Watchers

Forkers

vg's Issues

Recommend Projects

Recommend Topics

Recommend Org