Coder Social home page Coder Social logo

lh3 / minigraph Goto Github PK

View Code? Open in Web Editor NEW
397.0 28.0 38.0 950 KB

Sequence-to-graph mapper and graph generator

Home Page: https://lh3.github.io/minigraph

License: MIT License

C 68.30% Makefile 0.62% Roff 1.70% JavaScript 12.15% TeX 14.25% Gnuplot 2.85% Shell 0.13%
bioinformatics genomics sequence-alignment genome-graph pan-genome

minigraph's Introduction

Getting Started

git clone https://github.com/lh3/minigraph
cd minigraph && make
# Map sequence to sequence, similar to minimap2 without base alignment
./minigraph test/MT-human.fa test/MT-orangA.fa > out.paf
# Map sequence to graph
./minigraph test/MT.gfa test/MT-orangA.fa > out.gaf
# Incremental graph generation (-l10k necessary for this toy example)
./minigraph -cxggs -l10k test/MT.gfa test/MT-chimp.fa test/MT-orangA.fa > out.gfa
# Call per-sample path in each bubble/variation (-c not needed for this)
./minigraph -xasm -l10k --call test/MT.gfa test/MT-orangA.fa > orangA.call.bed

# Extract localized structural variations
gfatools bubble out.gfa > SV.bed

# Generate human MHC graph and call SVs jointly (~10 min)
curl -sL https://zenodo.org/record/8245267/files/mg-cookbook-v1_x64-linux.tar.bz2?download=1 | tar -jxf -
cd mg-cookbook-v1_x64-linux && ./00run.sh

Table of Contents

Introduction

Minigraph is a sequence-to-graph mapper and graph constructor. For graph generation, it aligns a query sequence against a sequence graph and incrementally augments an existing graph with long query subsequences diverged from the graph. The figure on the right briefly explains the procedure.

Minigraph borrows ideas and code from minimap2. It is fairly efficient and can construct a graph from 90 human assemblies in a couple of days using 24 CPU cores. Older versions of minigraph was unable to produce base alignment. The latest version can. Please add option -c for graph generation as it generally improves the quality of graphs.

Users' Guide

Installation

To install minigraph, type make in the source code directory. The only non-standard dependency is zlib. For better performance, it is recommended to compile with recent compliers.

Sequence-to-graph mapping

To map sequences against a graph, you should prepare the graph in the GFA format, or preferrably the rGFA format. If you don't have a graph, you can generate a graph from multiple samples (see the Graph generation section below). The typical command line for mapping is

minigraph -cx lr graph.gfa query.fa > out.gaf

You may choose the right preset option -x according to input. Minigraph output mappings in the GAF format, which is a strict superset of the PAF format. The only visual difference between GAF and PAF is that the 6th column in GAF may encode a graph path like >MT_human:0-4001<MT_orang:3426-3927 instead of a contig/chromosome name.

The minigraph GFA parser seamlessly parses FASTA and converts it to GFA internally, so you can also provide sequences in FASTA as the reference. In this case, minigraph will behave like minimap2, though likely producing different alignments due to differences between the two implementations.

Graph generation

The following command-line generates a graph in rGFA:

minigraph -cxggs -t16 ref.fa sample1.fa sample2.fa > out.gfa

which is equivalent to

minigraph -cxggs -t16 ref.fa sample1.fa > sample1.gfa
minigraph -cxggs -t16 sample1.gfa sample2.fa > out.gfa

File ref.fa is typically the reference genome (e.g. GRCh38 for human). It can also be replaced by a graph in rGFA. Minigraph assumes sample1.fa to be the whole-genome assembly of an individual. This is an important assumption: minigraph only considers 1-to-1 orthogonal regions between the graph and the individual FASTA. If you use raw reads or put multiple individual genomes in one file, minigraph will filter out most alignments as they cover the input graph multiple times.

The output rGFA can be converted to a FASTA file with gfatools:

gfatools gfa2fa -s graph.gfa > out.stable.fa

The output out.stable.fa will always include the initial reference ref.fa and may additionally add new segments diverged from the initial reference.

Calling structural variations

A minigraph graph is composed of chains of bubbles with the reference as the backbone. Each bubble represents a structural variation. It can be multi-allelic if there are multiple paths through the bubble. You can extract these bubbles with

gfatools bubble graph.gfa > var.bed

The output is a BED-like file. The first three columns give the position of a bubble/variation and the rest of columns are:

  • (4) # GFA segments in the bubble including the source and the sink of the bubble
  • (5) # all possible paths through the bubble (not all paths present in input samples)
  • (6) 1 if the bubble involves an inversion; 0 otherwise
  • (7) length of the shortest path (i.e. allele) through the bubble
  • (8) length of the longest path/allele through the bubble
  • (9-11) please ignore
  • (12) list of segments in the bubble; first for the source and last for the sink
  • (13) sequence of the shortest path (* if zero length)
  • (14) sequence of the longest path (NB: it may not be present in the input samples)

Given an assembly, you can find the path/allele of this assembly in each bubble with

minigraph -cxasm --call -t16 graph.gfa sample-asm.fa > sample.bed

On each line in the BED-like output, the last colon separated field gives the alignment path through the bubble, the path length in the graph, the mapping strand of sample contig, the contig name, the approximate contig start and contig end. The number of lines in the file is the same as the number of lines in the output of gfatools bubble.

SV calling showcase (human MHC)

The following example generates a graph for 61 humam MHC haplotypes and calls SVs from them. Primary sequences are retrieved from an AGC archive.

# Obtain cookbook data and precompiled binaries
curl -sL https://zenodo.org/record/8245267/files/mg-cookbook-v1_x64-linux.tar.bz2?download=1 | tar -jxf -
cd mg-cookbook-v1_x64-linux

# Generate graph. This takes ~7 minutes.
./agc listset MHC-61.agc | awk '!/GRC/{a=a" <(./agc getset MHC-61.agc "$1")"}END{print "./minigraph -cxggs <(./agc getset MHC-61.agc MHC-00GRCh38)"a}' | bash > MHC-61.gfa 2> MHC-61.gfa.log

# Call SVs per sample. This takes a couple of minutes.
./agc listset MHC-61.agc | xargs -i echo ./minigraph -cxasm --call -t1 MHC-61.gfa '<(./agc getset MHC-61.agc {})' \> {}.bed 2\> {}.bed.log | parallel -j16

# Merge per-sample calls and generate VCF. `-r0` indicates the reference sample.
paste *.bed | ./k8 mgutils.js merge -s <(./agc listset MHC-61.agc) - | gzip > MHC-61.sv.bed.gz
./k8 mgutils-es6.js merge2vcf -r0 MHC-61.sv.bed.gz > MHC-61.sv.vcf

In this example, the GRCh38 haplotype is named "MHC-00GRCh38" in the AGC archive and is taken as the reference. The awk command line generates a command line that retrieves each haplotype on the fly and feeds it to minigraph. misc/mgutils.js merge combines per-sample calls and generates a merged BED file. The final misc/mgutils-es6.js merge2vcf derives a VCF file. This script requires the latest k8 JavaScript runtime.

Prebuilt graphs

Prebuilt human graphs in the rGFA format can be found at Zenodo.

Algorithm overview

In the following, minigraph command line options have a dash ahead and are highlighted in bold. The description may help to tune minigraph parameters.

  1. Read all reference bases, extract (-k,-w)-minimizers and index them in a hash table.

  2. Read -K [=500M] query bases in the mapping mode, or read all query bases in the graph construction mode. For each query sequence, do step 3 through 5:

  3. Find colinear minimizer chains using the minimap2 algorithm, assuming segments in the graph are disconnected. These are called linear chains.

  4. Perform another round of chaining, taking each linear chain as an anchor. For a pair of linear chains, minigraph tries to connect them by doing graph wavefront alignment algorithm (GWFA). If minigraph fails to find an alignment within an edit distance threshold, it will find up to 15 shortest paths between the two linear chains and chooses the path of length closest to the distance on the query sequence. Chains found at this step are called graph chains.

  5. Identify primary chains and estimate mapping quality with a method similar to the one used in minimap2. Perform base alignment.

  6. In the graph construction mode, collect all mappings longer than -d [=10k] and keep their query and graph segment intervals in two lists, respectively.

  7. For each mapping longer than -l [=100k], finds poorly aligned regions. A region is filtered if it overlaps two or more intervals collected at step 6.

  8. Insert the remaining poorly aligned regions into the input graph. This constructs a new graph.

Limitations

  • A complex minigraph subgraph is often suboptimal and may vary with the order of input samples. It may not represent the evolution history or the functional relevance at the locus. Please do not overinterpret complex subgraphs. If you are interested in a particular subgraph, it is recommended to extract the input contig subsequences involved in the subgraph with the --call option and manually curated the results.

  • Minigraph needs to find strong colinear chains first. For a graph consisting of many short segments (e.g. one generated from rare SNPs in large populations), minigraph will fail to map query sequences.

  • The base alignment in the current version of minigraph is slow for species of high diversity.

minigraph's People

Contributors

kojix2 avatar lh3 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

minigraph's Issues

Fragmented back-mapping in a 2-sample graph

Hello Heng,
After building a graph from two high quality assemblies I am in the process of tracing the paths through the graph. When mapping back the original sequences the first one, which is my reference, maps back quite nicely, giving me a single path retracing the entire chromosome. The second assembly, however, ends up mapping in a very fragmented way. When looking at the gaf file I notice cases where a much longer alignment is apparent but the program breaks it up.

Below is an example. The first mapping terminates at segment s485 and the second mapping begins with it at almost the exact position on the query. The start position on the second path is not 1 but 388910, which suggests that perhaps there is a large deletion on the query sequence. Is this the reason for the split? Following it there are two additional short and weak hits which are also labeled as Primary, but I'm not sure they are relevant to the split.

l used the following parameters: -x asm --vc --min-cov-blen=10000. Could you please help me better understand this behavior?


smpl_Per1.pseudomolecule_1      64599618        19670355        21354424        +       >s447>s448>s449>s2259>s451>s2260>s453>s2261>s455>s2262>s457>s2263>s458>s460>s2264>s461>s463>s2265>s465>s2266>s466>s468>s470>s472>s474>s2267>s476>s478>s479>s480>s481>s482>s2270>s484>s485       2405973   184122  2015775 1176009 1998796 60      tp:A:P  cm:i:200540     s1:i:765412     s2:i:21952      dv:f:0.0216
smpl_Per1.pseudomolecule_1      64599618        21354444        22017736        +       >s485>s487>s2877>s489>s490>s2879>s492>s2880>s494>s2881>s495>s497>s499>s501>s503>s2882>s505>s2883>s506>s2884>s508>s2885>s510>s512>s2886>s514>s2887>s515>s2888>s517>s519>s2889>s521>s2890>s523    1434328   388910  1181586 599124  795822  60      tp:A:P  cm:i:101971     s1:i:483101     s2:i:1766       dv:f:0.0084

program Aborted

Hi, I compile the minigraph correctly but when I run it as followling command:

$MINIGRAPH/minigraph -xggs -t4 SoyC01.genome.fasta SoyC02.genome.fasta SoyC03.genome.fasta > out.gfa

I met an error like that:

[M::main::21.3950.12] loaded the graph from "SoyC01.genome.fasta"
[M::mg_index::57.748
0.81] indexed the graph
[M::mg_opt_update::59.4300.81] occ_weight=20, occ_max1=250; 95 percentile: 3
[M::ggen_map::79.927
0.64] loaded file "SoyC02.genome.fasta"
[M::ggen_map::1479.4133.75] mapped 59 sequence(s) to the graph
[M::mg_ggsimple::1484.693
3.74] inserted 1275 events, including 0 inversions
[M::mg_index::1519.0763.68] indexed the graph
[M::mg_opt_update::1520.781
3.68] occ_weight=20, occ_max1=250; 95 percentile: 3
[M::ggen_map::1540.8603.63] loaded file "SoyC03.genome.fasta"
[M::ggen_map::3275.153
3.73] mapped 41 sequence(s) to the graph
minigraph: ggsimple.c:77: mg_ggsimple: Assertion `rs >= 0 && re > rs && re < g->seg[lc->v>>1].len' failed.
/gss1/home/hjb20181119/.lsbatch/1606288350.865135.shell: line 8: 11876 Aborted (core dumped) $MINIGRAPH/minigraph -xggs -t4 SoyC01.genome.fasta SoyC02.genome.fasta SoyC03.genome.fas
ta > out.gfa

could you help me fix the problem? Thank you very much . My genome size is about 1Gb.

Feature Request - Reference Path

Dear Heng,

would it be possible to add the choosen reference as path to the final GFA? Makes it easier to use it with other tools e.g., vg.

Bests,
Felix

Coordinates of arbitrary sequence in graph

Hi, I have built recently a graph of 10 genome assemblies with:

minigraph -xggs -t 10 oryza_sativa_reference.fna fasta/*.fna -o oryza_sativa.gfa

Is there a way to find the coordinates of any arbitrary sequence from one of those assemblies in the graph without remapping it?
Thanks for your help,
Bruno

Few nodes = poor alignment or good ?

Hi Heng,
When building a graph of two fairly divergent chromosomes (snp rate ~1.5%) I get very few nodes in the graph. In my naive understanding this could mean two opposite things - either the genomes aligned very well with little divergence at the structural level, OR that there were very few alignments/chains detected in the first place. When I align the chromosomes with minimap2 the alignments cover ~95% of the chromosomes, if that's of any help. Is there a way to have minigraph show what portion of the sequence did not get aligned or some other way to troubleshoot this?

Here are the stats for the graph:

Number of segments: 13
Number of links: 18
Number of arcs: 36
Max rank: 1
Total segment length: 30091832
Average segment length: 2314756.308
Sum of rank-0 segment lengths: 30091711
Max degree: 2
Average degree: 1.385
[M::main] Version: 0.4-r214-dirty

(The total segment length is close to the size of one of the chromosomes)

Thanks!

Interpretation of "." in --call output

Hello,
After calling --call on a sample, there are multiple lines with the : separated output as mentioned in the documentation. However, there are also occasionally lines where the output is for example

chr4	181668	181723	>s6	>s8	.

It looks like this is the "default" output unless p->t is set here, presumably when a some match is found. Is the correct interpretation then that a . means no segment of the called sample matches any part of that bubble or its surroundings, hence not called as a deletion (*)?

How to visualize constructed reference graph

I have constructed a reference graph from 11 different microbe strains, and want to inspect structural variations via some visualization tools, is there any such tool fit for this? Thanks.

How should we interpret `SR` of the `L` line in rGFA

Hello Heng,

While I'm playing with the example assemblies on the FTP site, I noticed that (all) the L-lines in the rGFA file also have the SR tag, which isn't a required tag according to the rGFA format (only the S-Lines are required to).

So how should we interpret them?

Thanks!
Steve

Does minigraph build genome graph including structural variants from a SV caller?

Hello,
I'm also keen to integrate into the genome graph the structural variability detected by Sv callers (from short reads and/or long-reads). I was wondering if minigraph handles that?
Those are often represented in a vcf which some programs can integrate into a genome-graph. I am using vg at the moment (https://github.com/vgteam/vg) which is fast at constructing the graph but slow at mapping. Hence, we are super interested in using minigraph, as it seems to be very efficient (and I remember the efficiency of minimap!). I may also have a second reference assembly so the idea in the end would be to build a graph integrating several assemblies and local SV;
Thank you for your help and attention,
Claire

Typo?

minigraph -x lr graph.gfa query.fa > out.gaf

should be:

minigraph -x lr graph.gfa query.fa > out.gfa

in the readme?

Request add P line

Hello Li,
The minigaraph is very excellent software to build the pangenome graph, but the output GFA format is not be accepted by vg, so can you consider to add the function for the GFA format, such as add P line H line etc.

Best wishes~

empty gfa file

Hi,I am trying to create a graph file using multiple files. I used the following command but got an empty gfa output. Could you provide any suggestions? Thanks

minigraph -xggs -t16 reference.fa sample1.fa sample2.fa sample3.fa sample4.fa sample5.fa sample6.fa> out.gfa

segmentation fault

Hi,

When mapping sequences (from nanopore reads) to a graph (~ 5.2M) I get no output an segmentation fault. Works OK when I run the test - is there a limit on the size of graph/do I need to re-compile with different settings?

How to get the GFA format not default rGFA?

Hello Li,
I have used the minigraph to build the pangenome graphs, but get the rGFA format is not standard GFA format, so couldn't be used the input file of vg. How to get the GFA format when we use the minigraph to build the pangenome graph?

Best wishes~

extract non-reference sequence from the var.bed

Hi, @lh3

We already have constructed a graph from 46 plant genomes. Some complex subgraph from the minigraph really impressive, but it need careful inspection since the vast repeat content in the plant genome.

So We want to focus on the infomation from the non-reference insertion sequence in the graph. Since we have the var.bed from the gfatools bubble and reference calls from the minigraph -t 64 -xasm --call G46.gfa reference.fasta > ref.bed,can I extract non-reference insestion sequence in the gfa? If it is practial, how can I find the insertion from the gfa? Non-reference can use the "SN:Z" to distinctied with the reference.

[question] Which preset for indiviual assembly gene mapping to minigraph gfa?

Hi, @lh3

We have build a pangenome graph with minigraph. We also do gene prediction for each indiviual assembly. Since graph gene annotation is not avaliable , so we choose to retrive each gene fasta (utr+exon+intron) for mapping to graph. minigraph have two preset -x asm and -x lr, which perset do you recommend for those gene mapping?

Should GAF have a similar concept to secondary alignment from SAM?

Hello Heng,

As I was experimenting with the GAF format, I run into the following two mappings from the same read (GIAB HG002 15kb + 20kb CCS reads from their FTP site, mapped to GRCh38-freeze1 rGFA from your FTP site)

m64011_190830_220126/157353505/ccs	20538	8032	20531	+	<chr1:73847000-73939836<chr1:73717356-73846749	222229	8102593530	12002	12508	60	tp:A:P	cm:i:2149	s1:i:11931	s2:i:646	dv:f:0.0022
m64011_190830_220126/157353505/ccs	20538	2	7857	-	chr11	135086622	103201000	103208858	73457861	60	tp:A:P	cm:i:1315	s1:i:7294	s2:i:0	dv:f:0.0037

Notice they are both tp:A:P (not a bug, is it?), so I am wondering should there be a concept in GAF that is similar to the secondary alignment in SAM (i.e. flag 256)? It's immediately clear what that would mean though.

Note: there's a 251-bp HET deletion at this site (chr1) from HG002 relative to GRCh38.

Thanks
Steve

Could Make create a library to link against?

Hi,

Could the Makefile for this project be modified to create a library file, similar to minimap2?
I am trying to compile and link against minigraph.

And could this be done in a way that does not generate duplicate symbol errors when linked together with minimap2 in the same project? I think these are great libraries and several people will try to use minimap2 and minigraph code in the future.

Thank you,
Cristian

A 1 mb bubble - fact or fiction?

Hello Heng,
In the graph I made from two complete haplomes of a large plant genome (basically, two sets of chromosomes with a fairly high confidence in the assembly), I see a few very large "bubbles", one as large as 1mb on the longer segment and 30kb on the shorter. According to the minigraph paper, it "only considers SVs of 100 bp–100 kb in length". Does this mean that the 1mb divergence is likely an artifact or are there scenarios where the algorithm can detect collinearity over a gap this large?

Thanks!

Question - Different ranks

Dear Heng,

Could you comment one the following minigraph warnings:

[W] stable sequence 'flattened_line_31558' associated with different ranks on segment 's298030': 15 != 22

Does this indicate deletions?

[W::mg_ggsimple] query interval flattened_line_61364:1069-1069 is not covered

This just means that a target did not get any query alignment?

Bests,
Felix

transcriptome graphs

Dear Dr Li,

Thank you for yet another great tool. It seems to me that the idea of the genome graphs could also be applied to isoforms in a transcriptome graph. I was thinking of using polished IsoSeq data, cluster isoforms based on genome alignment and then generate individual gene-specific grahs and then concatenate them to form a single transcriptome graph. Do you think minigraph could be used in this fashion?
I would be worried about full transcriptome-transcriptome alignment because of the high amount of partial alignments between paralogous genes. I think a transcriptome graph may be useful to increase mapping efficiency of RNA-seq reads and make it easier to quantify and compare isoform transcription levels.
I would love to hear your opinion and recommendations on this approach.

Kind regards,

Juan D. Montenegro

Graph generation (-xggs) seems to be skipping some homozygous events

I have 2 sequences, namely A and B, from an assembly of a human sample. The 2 sequences are very similar, and both contain an insertion of ~2150 bps, with breakpoints being mapped to the same location on GRCh38.

I tried to add these insertions to my reference genome graph by feeding the fasta file of the A and B to minigraph, expecting at least 1 of the 2 insertions will be added to the GFA output. However it was not the case:

$ minigraph -t8 -xggs ref.gfa.gz A_and_B.fasta > AB_ref.gfa.gz
[M::main::14.4911.00] loaded the graph from "ref.gfa.gz"
[M::mg_index::71.894
1.48] indexed the graph
[M::mg_opt_update::76.3591.45] occ_weight=20, occ_max1=138; 95 percentile: 2
[M::ggen_map::76.376
1.45] loaded file "A_and_B.fasta"
[M::ggen_map::76.4221.45] mapped 2 sequence(s) to the graph
[M::mg_ggsimple::76.422
1.45] inserted 0 events
[M::main] Version: 0.5-r294-dirty
........

But the first insertion is successfully added if the A and B are separated into 2 query files:

$ minigraph -t8 -xggs ref.gfa.gz A.fasta B.fasta > A_then_B_ref.gfa.gz
[M::main::14.4951.00] loaded the graph from "ref.gfa.gz"
[M::mg_index::72.652
1.47] indexed the graph
[M::mg_opt_update::77.1531.45] occ_weight=20, occ_max1=138; 95 percentile: 2
[M::ggen_map::77.156
1.45] loaded file "A.fasta"
[M::ggen_map::77.1971.45] mapped 1 sequence(s) to the graph
[M::mg_ggsimple::78.418
1.44] inserted 1 events
[M::mg_index::137.8161.48] indexed the graph
[M::mg_opt_update::142.177
1.47] occ_weight=20, occ_max1=138; 95 percentile: 2
[M::ggen_map::142.1811.47] loaded file "B.fasta"
[M::ggen_map::142.201
1.47] mapped 1 sequence(s) to the graph
[M::mg_ggsimple::142.201*1.47] inserted 0 events
[M::main] Version: 0.5-r294-dirty
........

Is it a behavior by design?

how are ranks determined

Hello Heng,

I'm looking at the example 20-sample human rGFA you shared on the FTP site (ftp://ftp.dfci.harvard.edu/pub/hli/minigraph), and guessed that in general the rank of segments is the order in which the deriving-sample was added to the graph (ps. the readme is for 14 samples which I guess is slightly out of date).

However, I noticed that for samples NA12878, NA24385 and PG1, segments marked as derived from them (SN tag) have two ranks. For example, there are 4058 segments derived from NA12878 ranked as rank-4, and 2908 of rank-5 (no significant difference amongst their lengths, based on a quick glance).

So I'm wondering if this is desired and in general, where can we read more about the rank calculation.

Thanks!
Steve

Mapping high quality assemblies to rGFA

Hello,
I was able to build a fairly large graph from 12 complete and polished assemblies of individuals of a certain grass species (~270 mb genomes). I am now trying to map them back to the rGFA graph. Running minigraph with all defaults, the resulting GAF file has shorter alignments and fewer long paths than I would expect, so I'm suspecting that the default parameters might not be appropriate here. Would you be able to recommend a set of flags for a case like this?

The reference graph is quite fragmented, and here's the breakdown of the segment sizes:
#46629 total values totaling 279430451.0000. <5992.632289 +/- 24558.585829>
#Most likely bin: [ 0 - 100 ] 8549 counts
#Median bin: [ 400 - 500 ] 1502 counts

In the GAF file only 10% of the alignments are multi-segment paths. These paths add up to about half of the total genome size. Here's the breakdown of the aligned block sizes in the milti-segment paths:
#2584 total values totaling 152743151.0000. <59111.126548 +/- 73694.925338>
#Range: [ 180 - 589673 ]
#Most likely bin: [ 2000 - 3000 ] 83 counts
#Median bin: [ 32000 - 33000 ] 23 counts

Perhaps this is expected, but I'm wondering what can be tweaked to help increase the block size, thus capturing more and longer paths. I'd be happy to share the data with you if you're interested.

Regards

extract bubble unitigs from unitig.gfa

Hello Dr. Li,
I saw a usage of gfatools something like below:

Extract localized structural variations

gfatools bubble out.gfa > SV.bed

and something below:
Given an assembly, you can find the path/allele of this assembly in each bubble with
minigraph -xasm --call graph.gfa sample-asm.fa > sample.bed

I wonder if gfatools or minigraph has a function to extract bubble unitigs from "hifiasm.r_utg.noseq.gfa" or "hifiasm.p_utg.noseq.gfa".

Plus, does Minigraph or gfatools support the usage of converting a unitig.gfa to contig.gfa?
thanks very much,
zhenzhen

Question about the coordinates in bubble output

Dear Dr. Li,
When running minigraph --call, how are the positions of the "alleles" (bubble vertices) determined? I am assuming that the reference sequences are mapped to the graph, and that's how the coordinates are calculated. Is that correct?
Below, see an example where I queried the graph with the same sequence that was used as the reference sample during the graph construction phase (minigraph -xasm --call foo.gfa Bd21C1.fa)

Bd21C1  481920  481939  >s1     >s3     >s2:19:+:Bd21C1:481915:481941

Notice how the allele coordinate is different from the bubble position (presumably on the same sequence). Is that a mapping artifact?

Also, what happens when a sequence maps to the bubble in two or more places?

Thank you

[morecore] insufficient memory

Dear Heng,

I am trying to build a rgfa from two large genomes (>10Gb). They are both finished down to chromosomes. I get the following output:

[M::main::23.960*1.00] loaded the graph from "reference.fasta"
[M::mg_index::380.806*1.79] indexed the graph
[M::mg_opt_update::393.356*1.76] occ_weight=20, occ_max1=2182; 95 percentile: 7
[M::ggen_map::426.277*1.71] loaded file "query.fasta"
[morecore] insufficient memory

If I use contigs instead of chromosomes as query it works. Some chromosomes are >800Mb. Is that happening because minigraph reads all query bases in the graph construction mode?

I rerun the chromosome vs. chromosome attempt with --no-kalloc, as expected minigraph slows down but it does not break any more, at least until now.

Thx,
Felix

Does minigraph support mixed-case sequence?

Hello Heng,
My minigraph run is crashing on the construction of a 2-genome graph. In fact, it's my server that's killing it, which suggests a memory overrun. I've been able to build much larger graphs in the past (i.e. larger in terms of total sequence length), so I'm scratching my head here. The only odd thing about the sequence is that it has mixed case letters, so I wanted to double check that the program supports this.
Thanks!

unnecessary branches in rGFA ?

Thank you for developing this tool - it looks very promising!
I tried it on a three-genome set and got an rGFA file which I then converted to GFA2 and loaded into a visualize (I need the 'set' functionality of the graph to be able to be able to view different sample segments by color). What I'm seeing quite often are reference segments with edges only to other reference segments, and the offsets suggesting that the two segments immediately follow each other. If so, then what is the reason for splitting the stable reference segment at that position in the first place?

Below is an extract from the graph where you can observe this. Segments s8162 and s8163 are both from the Ref sample and are adjacent, according to the offset and lengh values. The edge between them is not contested by any other edge. I checked the original sequence, and there are not N's, gaps, or repeats in this region. Can you please explain?

S	s8161	*	LN:i:82910	SN:Z:Ref	SO:i:26052560	SR:i:0
S	s8162	*	LN:i:578	SN:Z:Ref	SO:i:26135470	SR:i:0
S	s8163	*	LN:i:71969	SN:Z:Ref	SO:i:26136048	SR:i:0
S	s8164	*	LN:i:139743	SN:Z:Ref	SO:i:26208017	SR:i:0
S	s13329	*	LN:i:86	SN:Z:Bd21_3	SO:i:25992879	SR:i:1
S	s14767	*	LN:i:171	SN:Z:Bd25	SO:i:20514505	SR:i:2
L	s8161	+	s13329	+	0M	SR:i:1	L1:i:82910	L2:i:86
L	s8161	+	s8162	+	0M	SR:i:0	L1:i:82910	L2:i:578
L	s8162	+	s8163	+	0M	SR:i:0	L1:i:578	L2:i:71969
L	s8163	+	s8164	+	0M	SR:i:0	L1:i:71969	L2:i:139743
L	s8163	+	s14767	+	0M	SR:i:2	L1:i:71969	L2:i:171

The command I used was the one from the readme: minigraph -xggs ref.fa sample1.fa sample2.fa

Thank you

Question about 1-1 mappings

Hi Heng,

I have a question about 1-1 mappings. If I build a graph with for example ref and sample1 and then I align sample1 back to the graph to get its path through the graph will minigraph report a path that does not "reuse" parts of the query?

I ask because I am looking a a couple of alignments where the path length in column 7 of the GAF is much longer than the query end - query start.

If this is not supposed to happen, let me know and I will put together a minimal test case.

Thanks!
Mitchell

Degenerate graph constructed

Hello!

I am trying to construct a graph using two very simple (artificial) sequences for testing purposes.

I tried several different cases (two sequences cases each):

    • A-B-C
    • A-D-C
      (middle block is substituted by another block, files seq{1,2}_2_substitution_mid.fa)
    • A-B-C
    • A---C
      (middle block is deleted, files seq{1,2}_2_mid_block_indel.fa)
      where A, B, C and D are randomly generated 3000 bp sequences.

I tried to run it with all parameters at default as well as tried to vary some of the parameters. I attached the resulting GFA files for case 1 (substitute of the middle block). The same result is for case 2.

The following commands were used for default and one case of specific parameter values:

minigraph -x ggs -t 16 ./seqFile_2_substitution_mid.files/seq1_2_substitution_mid.fa ./seqFile_2_substitution_mid.files/seq2_2_substitution_mid.fa > ./out/test2.gfa

minigraph -x ggs -L 10 -l 100 -d 100 -q 1 -k 15 -q 10 -K 10k -t 16 ./seqFile_2_substitution_mid.files/seq1_2_substitution_mid.fa ./seqFile_2_substitution_mid.files/seq2_2_substitution_mid.fa > ./out/test.gfa

The file minigraphGithubIssue.zip contains all mentioned files.

Is it something that I do not understand or do something wrong? Or is it a glitch of minigraph?

By the way, is there a way to also get Paths in resulting GFA? Or it can only be done through separate alignment later (through converting GAF file separately?

Thank you very much in advance.

GFA1 to rGFA

Hello,
is there a way to convert from a GFA1 generated through other algorithms (e.g. vg) to a rGFA?
Thank you in advance
Andrea

Establishing paths faithfully

Hi Heng,
Once the graph is constructed, one often needs to trace paths through it that correspond to the constituent samples. The only apparent way to do this in minigraph is to map the assemblies back onto the graph, which is what I've been doing. The accuracy of this depends heavily on the alignment criteria and the overall divergence between the samples, and so I've been struggling to get a faithful set of paths without leaving big chunks of the sample "unwalked". It seems like a more direct (and hopefully more accurate) way would be to encode the paths right at the time of the incremental graph generation, lifting them over to the next set of nodes as new samples are added. This is, of course, assuming that the mapping/chaining information is stores somewhere internally for each iteration. I'm sure you've considered this, and I'd love to hear your thoughts on whether this is a good/doable idea.

generating a vcf output

How about to add a module to generate a vcf output from Minigraph mapping or genotyping so it is compatible with other tools?

Question on allele

Hello,
I've generate a graph using minigraph, extracted the bubbles and then called the samples using the guidelines in the website.
I'd like to see which allele a genome carries in comparison with the graph.
Fow example, if I paste the results of gfatools bubble and minigraph --call, I get a line like this:

genome1.chr1      2659    2829    4       2       0       170     170     -1      -1      -1      s2184,s307773,s2185,s2186       gtcctagatatatttttactatttttatgatcattcttttttttaatttttaaaaaaattttaaatcctctattactcctttaattttcattttttataacctatttattgctttgcaaaaaaaacccctattttttaaagcaaacttcatatatatatatatatatatg      aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa      genome1.chr1      2659    2829    >s2184  >s2186  >s2185:170:+:genome1.chr1:2657:2830

I can see that genome1 carries the allele in path 2185. Is there a way to know which of the two alleles is 2185, or do I have to go back to the graph?

Thank you in advance
Andrea

Other tools to contruct GFA from genome assemblies

Are there other tools which can generate genome graphs (GFA/rGFA format or other) de novo from genome assemblies like minigraph does? I am not talking about tools which use read alignment or VCFs for construction of graph. I want to use genome assemblies and generate the graph from them.

For a comparative study.

Question about anchor chaining algorithm

Hi Heng,

Thanks for sharing the minigraph preprint.
I've a question about your new anchor chaining algorithm described in the pre-print. I appreciate the use of RMQ data structure to consider a much larger set of anchors while building chains. I'm curious to know the rationale of using different gap costs for anchors in small v/s large windows.

Assuming the gap-cost function to be c_1 ·g_ji + c_2 ·d_ji (i'm ignoring the third log term: log2 g_ji for now), it should be possible to process anchors in both large and small window by using a RMQ-based algorithm as was described by Abouelhoda and Ohlebusch, 2005 (section 4.2.1). Their paper refers to the above cost function as "sum-of-pairs" gap cost model. Wondering if you also considered implementing the chaining algorithm that way? I think this would allow you to compute optimal chains in all scenarios. Although if the third log term plays an important role, then this won't work.

Also, did you observe any significant runtime (or memory) overhead using the new chaining algorithm of minigraph over chaining algorithm in minimap2?

Thanks in advance!

a bubble with all paths matching?

After making bubble calls I find occasional bubbles that don't make sense. The cases below are from a graph of only two sequences, and in both cases the reference and the alternative paths are the same. Is this due to a mis-mapping of the sequences back to the graph? If so, is there a way to remedy this with some parameter tweaking?

LimO7	24611971	24611971	>s159	>s160	*:0:+:LimO7:24611965:24611979	*:0:+:AurM7:24119755:24200162

LimO7	24699624	24700054	>s172	>s174	>s173:430:+:LimO7:24699615:24700060	>s173:430:-:AurM7:23642126:23701406

read mapping to sam?

Hello

Thank you for developing these fantastic tools!

I'm attempting to map short paired-end reads to a structural graph and call variants, similar to VG but I prefer how I can determine the sequence paths for bubbles etc. in your program, and I'm wondering if you've developed a way to produce a sam file for these mapped reads?

I was also looking to contrast the number of reads that mapped to the structural graph with the linear reference, using minimap2, and I'm seeing a lot more read pairs mapping to the linear reference than the graph, as well as reads mapping to different regions of the same reference. I'm curious why this may be.

minigraph -xsr -t16 all5.gfa $R1 $R2 > r12_all5.gaf
# with 13,527,026 reads mapped

# minimap2:
minnimap2 -xsr Ref.fasta $R1 $R2 > r12_single.paf
# with 27,968,534 reads mapped

#same reads mapping to different scaffolds:

$grep "HWI-D00256:413:C7N5GANXX:3:1101:2734:2086" CPB*20*paf

HWI-D00256:413:C7N5GANXX:3:1101:2734:2086	126	5	62	+	F_KS_tig00081383_RagTag_RagTag	19065987	1623340	1623397	42	57	12	tp:A:P	cm:i:2	s1:i:42	s2:i:0	rl:i:0
HWI-D00256:413:C7N5GANXX:3:1101:2734:2086	126	9	96	+	F_KS_tig00027608_RagTag_RagTag	3676404	2027258	2027345	67	87	0	tp:A:P	cm:i:5	s1:i:67	s2:i:67	rl:i:0

$ grep "HWI-D00256:413:C7N5GANXX:3:1101:2734:2086" CPB*20*gaf

HWI-D00256:413:C7N5GANXX:3:1101:2734:2086	252	5	96	+	>LI_2018_scaffold1025_size140985:6799-7281	482	89	180	84	91	60	tp:A:P	cm:i:7	s1:i:84	s2:i:0	dv:f:0.0295	ql:B:i,126,126
HWI-D00256:413:C7N5GANXX:3:1101:2734:2086	252	156	243	-	F_KS_tig00027608_RagTag_RagTag	3676404	1894488	1894575	67	87	0	tp:A:P	cm:i:6	s1:i:67	s2:i:67	dv:f:0.0368	ql:B:i,126,126

Runtime error: Assertion `k == n_a' failed

I recently tested minigraph (0.5-r294-dirty) on the 12 YPRP yeast genomes and was impressed by the speed and small output file size. I'm now trying to create graphs using a set of five public maize genomes, starting with analysis of separate chromosomes. The maize chromosomes range in length from 150-330 Mbp and are estimated to be up to 85% transposable elements.

I'm using the same parameters (-x ggs -t 10) that I used for the yeast genomes, but my first three tests with chromosomes 1, 2, & 3 each failed with error message:

minigraph: lchain.c:226: mg_update_anchors: Assertion `k == n_a' failed.

It appears to me that the error occurs while mapping the first or second non-reference sequence:

chr01 log

[M::main::1.046*0.99] loaded the graph from "Zm_B73_chr01.fna"
[M::mg_index::14.496*1.46] indexed the graph
[M::mg_opt_update::15.174*1.44] occ_weight=20, occ_max1=625; 95 percentile: 5
[M::ggen_map::16.640*1.40] loaded file "Zm_DK105_chr01.fna"
minigraph: lchain.c:226: mg_update_anchors: Assertion `k == n_a' failed.

chr02 log

[M::main::0.848*0.98] loaded the graph from "Zm_B73_chr02.fna"
[M::mg_index::12.921*1.42] indexed the graph
[M::mg_opt_update::13.435*1.40] occ_weight=20, occ_max1=553; 95 percentile: 5
[M::ggen_map::14.641*1.37] loaded file "Zm_DK105_chr02.fna"
minigraph: lchain.c:226: mg_update_anchors: Assertion `k == n_a' failed.

chr03 log

[M::main::0.808*0.98] loaded the graph from "Zm_B73_chr03.fna"
[M::mg_index::11.379*1.56] indexed the graph
[M::mg_opt_update::11.884*1.53] occ_weight=20, occ_max1=535; 95 percentile: 5
[M::ggen_map::13.075*1.48] loaded file "Zm_DK105_chr03.fna"
[M::ggen_map::80176.139*1.00] mapped 1 sequence(s) to the graph
[M::mg_ggsimple::80176.781*1.00] inserted 1332 events
[M::mg_index::80186.833*1.00] indexed the graph
[M::mg_opt_update::80187.546*1.00] occ_weight=20, occ_max1=544; 95 percentile: 5
[M::ggen_map::80191.878*1.00] loaded file "Zm_EP1_chr03.fna"
minigraph: lchain.c:226: mg_update_anchors: Assertion `k == n_a' failed.

Beyond this, it's not obvious to me from the error message what the exact problem is, so I'm not sure how best to modify the parameters. Any insights or suggestions would be appreciated!

gfa not finishing (stops at fasta indexing?)

Hi Heng,

I have some transcripts as FASTA that I'm trying to graph relative to their reference (unspliced).

When I put in ./minigraph -xggs ref.fa multifasta.fa > out.gfa, the output is just the indexed fasta.

I tried splitting the multifasta.fa into single fasta, and moving to directory. No luck.

Using minigraph version 0.15.

The sample files also don't work.

Meaning of optional tag fields in GAF format

Hi @lh3
Thank for developing and maintaining minigraph. I believe it has a lot of potential.

Would you be able to comment on the meaning of the optional tags found at the end of each line in the GAF format ?

For example:
tp:A:P cm:i:541394 s1:i:3239312 s2:i:7646 dv:f:0.0159

I could not find the answer in neither GAF nor PAF formats specification

Thanks

--call outputs a path that doesn't exist in the graph

Hi Heng,

I notice a path >s48420>s48421>s48428 in the --call outputs that doesn't actually exist in the GRCh38-freeze1.gfa because there is no such link between s48421 and s48428.
Here are the calls containing this path:

chr7	144186940	144296182	>s48420	>s48428	>s48421:10670:-:HG00438#2#h2tg000019l:15485853:15496541
chr7	144186940	144296182	>s48420	>s48428	>s48421:10670:-:HG01123#1#h1tg000013l:13689309:13699991
chr7	144186940	144296182	>s48420	>s48428	>s48421:10670:+:HG01243#2#h2tg000196l:341837:352526
chr7	144186940	144296182	>s48420	>s48428	>s48421:10670:+:HG02080#2#h2tg000103l:15430162:15440860
chr7	144186940	144296182	>s48420	>s48428	>s48421:10670:+:HG03492#2#h2tg000104l:354601:365286

The GRCh38-freeze1.gfa and its calls are both from ftp://ftp.dfci.harvard.edu/pub/hli/minigraph/HPP/

I'm not sure how to interpret this result.

Thanks,
Wen-Wei

Effect of changing the minimizer occurrence threshold interval

As a workaround to memory overruns you suggested in #38 to lower -U, which did allow the run to complete (I set it to 25,100). Separately, on a larger machine, I was also able to complete the default run, but that machine is not always available. So now I'm wondering how safe it is to use the modified -U values in the future. I compared the results, and there is a noticeable drop in the number of segments and the corresponding increase in their average length. Could you please explain why that is and what effect this parameter has when lowering the upper and the lower boundaries of the interval? If what I am seeing is loss of sensitivity (i.e. due to fewer minimizers picked), then I'd have to be cautious of that in the future. Thanks in advance!

>cat h1.h2.minimap._on_bigmem_i.GFA.stats
Number of segments: 248396
Number of links: 341923
Number of arcs: 683846
Max rank: 1
Total segment length: 2664944961
Average segment length: 10728.615
Sum of rank-0 segment lengths: 2339104873
Max degree: 2
Average degree: 1.377
------------------------------------------------------------------------
> cat h1.h2.minimap-U25,100.GFA.stats
Number of segments: 224545
Number of links: 309235
Number of arcs: 618470
Max rank: 1
Total segment length: 2617000485
Average segment length: 11654.682
Sum of rank-0 segment lengths: 2339104873
Max degree: 2
Average degree: 1.377

question about minimum shared sequence to add genome to graph

hi,

thank you for this amazing tool. what is the minimum overlap 2 sequences can have, so that minigraph will construct a node shared by the 2 underlying genomes?

i am asking bc/ when i construct a DBG from a couple of genomes i used for testing, i find several 31mers that are shared among them (ie a nodes connecting the corresponding colors) -- however, when i run these genomes through minigraph, they remain disconnected (both w/ default params as well as when reducing minimum variant length).

kind regards

different ranks?

Hello. I'm a user of minimap so I was relieved to see this new program! Anyway, I'm looking to generate a reference graph from 4 whole genome assemblies that were constructed using different assemblers from different individuals (obvi). But I'm coming into some issues.
Do the contig/scaffold names affect the graph generation and what does this error from ./gfatools stat means can I resolve it?

[W] stable sequence 'tig00006712' associated with different ranks on segment 's621820': 1 != 2

Ultimately, I'll have to convert this gfa to a vg file in order to map and call variants (as this isn't yet available, here?)

Thanks!

minigraph requires GFA 1.0 overlaps, which are optional in the spec

The GFA specs state that overlaps for L lines are optional, but minigraph seems to require these. I think this happens because there is no way to avoid parsing a CIGAR string in this code block.

For GFA 2.0, the "*" placeholder is used to denote a lack of an overlap CIGAR. GFA 1.0 doesn't specify a placeholder, just that the field is optional. I have always assumed that a GFA with no CIGAR in the L/E lines implies a non-overlap (i.e., a CIGAR of 0M).

Would it be possible to adjust the default condition so that the parser can handle all valid GFA 1 files?

Which preset for mapping CCS reads?

Hello Heng,

A simple question: which preset do you recommend for mapping (typical) CCS reads to (r)GFA?
CCS doesn't quite fit noisy long reads or asm.

Thanks!

Steve

Very short segments in assembly

Hi @lh3,

I assembled 12 individual genomes assemblies together to create a pan-genome graph reference with minigraph. The shorter contigs in these individual assemblies were around 1300b.

I found out that in the resulting graph I have a quite a few very short segments (1 to 100 bases).
Even when I collapsing the data at stable sequence level I still have a number of short stable sequence. I am not convinced that these very short segments mean anything, in particular when considering nanopore error rate in particular in homopolymers. I also tried to increase the length of minimal variants (-L 500) but it doesn't seem to solve my issue.
Here is a length distribution plot both at stable sequence and segment level.

index

Are there any parameters to prevent these very short segments/contig to be added to the graph ?
Would it be an issue to manually prune out the shorter segments + associated links ?

Thanks
Ad

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.