Coder Social home page Coder Social logo

pin_hic's Introduction

Pin_hic (Hi-C拼接)

Scaffolding tool based on Hi-C reads.

Overview

Pin_hic is a scaffolder using Hi-C data. It applies a dual selection and local optimal strategy to bridge two contigs and output a SAT file for each iteration, the SAT format is the extension of GFA format which is able to record the scaffolding process, and can also be useful for further genomic analysis.

Dependencies

  1. zlib

Installation

Run the following commands to install pins:

git clone https://github.com/dfguan/pin_hic.git
cd pin_hic/src && make

Usage

Scaffolding with Hi-C reads

Hi-C Read preprocessing

Given a list hiclist of Hi-C read files (suppose in fastq.gz format, paired files in a line) and the assembly asm, use the following code to generate Hi-C alignment files.

bwa index $asm
while read -r r1 r2
do
	prefix=`basename $r1 .fastq.gz`
	bwa mem -SP -B10 -t12 $asm $prefix_1.fq.gz $prefix_2.fq.gz | samtools view -b - > $prefix.bam
done < $hiclist
Hi-C scaffolding

Given Hi-C reads alignment bams, a draft assembly asm and a output directory outdir, if you want to build scaffols with Hi-C in N (default: 3) rounds, please try the following commands. The final assembly will be named as scaffols_final.fa.

samtools faidx $asm 
./bin/pin_hic_it -i $N -x $asm.fai -r $asm -O $outdir $bam1 $bam2 $bam3 ... 

Or you want to build scaffolds step by step:

Step 1. contact matrix calculation

From a draft assembly asm

samtools faidx $asm
./bin/pin_hic link $bam1 $bam2 $bam3 ... > link.matrix  # this will calcuate contact numbers between any pairs of contigs.

From a sat file:

./bin/pin_hic link -s $sat $bam1 $bam2 $bam3 ... > link.matrix  # this will calcuate contact numbers between any pairs of contigs.
Step 2. Scaffolding graph construction

From a draft assembly asm:

/bin/pin_hic build -w100 -k3 -c $asm.fai link.matrix > scaffolds.sat # this will generate scaffolding paths. 

From a sat file:

/bin/pin_hic build -w100 -k3 -s $sat link.matrix > scaffolds.sat # this will generate scaffolding paths. 
Step 3. Mis-join detection

Given a sat file:

./bin/pin_hic break $sat $bam1 $bam2 $bam3 ... > scaffs.bk.sat
./bin/pin_hic gets -c $asm scaffs.bk.sat > scaffols_final.fa # get scaffold sequences.

A scaffolding pipeline of 3 iterations:

samtools faidx $asm
for i in `seq 1 3`
do
	if [ $i -eq 1 ]
	then 
		./bin/pin_hic link $bam1 $bam2 $bam3 ... > links_$i.matrix
		./bin/pin_hic build -w100 -k3 -c $asm.fai links_$i.matrix > scaffolds_$i.sat
	else
		./bin/pin_hic link -s scaffolds_$pi.sat $bam1 $bam2 $bam3 ... > links_$i.matrix
		./bin/pin_hic build -w100 -k3 -s scaffolds_$pi.sat links_$i.matrix > scaffolds_$i.sat 
	fi
	pi=i
done
./bin/pin_hic break -s scaffolds_$i.sat $bam1 $bam2 $bam3 ... > scaffolds_bk.sat 
./bin/pin_hic gets -c $asm scaffs.bk.sat > scaffols_final.fa 

Output format: SAT (V 0.1)

SAT format is extended from the GFA 1.0 format.

Record types

Tag Description Comment
H Header optional
S Sequence required
L Link optional
P Path optional
A Scaffold set optional
C Current scaffold set optional

H Header

Col Field Regexp Description Comment
1 TAG H Tag Required
2 VER VN:Z:[0-9]\.[0-9] Version Required

S Sequence

Col Field Regexp Description Comment
1 TAG S Tag Required
2 SNAME .+ Sequence name Required, primary key
3 SLEN [0-9]+ Sequence length Required
4 SEQ \*|[A-Za-z]+ Sequence Required

L Link

Col Field Regexp Description Comment
1 TAG P Tag Required
2 SRCS .+ Source sequence name Required, foregin key S:SNAME
3 SRCE [-+] Source end Required, + for 5' end and - for 3'
4 TGTS .+ Target sequence name Required, foregin key S:SNAME
5 TGTE [-+] Target end Required, + for 5' end and - for 3'
6 WGT wt:f:[0-9]*\.?[0-9]+ Link weight Optional

P Path

Col Field Regexp Description Comment
1 TAG P Tag Required
2 PNAME [cu][0-9]{9} Path name Required, primary key
3 PLEN [0-9]+ Path length Required
4 NAMEL ((.+[-+],)*(.+[-+]))|((u[0-9]{9}[-+],)*u[0-9]{9}[-+]) List of sequence names or path names Required, foregin keys S:SNAME

A Scaffold set (or assembly set ?)

Col Field Regexp Description Comment
1 TAG A Tag Required
2 ANAME a[0-9]{5} Scaffold set name Required
3 PNAMEL ([cu][0-9]{9},)*[cu][0-9]{9} List of path names Required, foregin keys P:PNAME
C Current scaffold set
Col Field Regexp Description Comment
1 TAG C Tag Required
2 CNAME a[0-9]{5} Current scaffold set name Required, foregin key A:ANAME

Example

H	VN:Z:0.1
S	LR132056.1.4	138023	*
S	LR132056.1.5	1128790	*
S	LR132056.1.6	4496575	*
P	u000000004	662215	LR132053.1.4+,LR132053.1.5+,LR132053.1.6+
L	LR132051.1.5	+	LR132051.1.4	-	wt:f:0.028248
L	LR132051.1.6	+	LR132051.1.5	-	wt:f:0.009367
A	a00000	1	u000000004
C	a00000

Limitation

FAQ

Contact

Every one is Wellcomed to use and distribute the package. Bug report or any other suggestions, please use the github webpage or email me [email protected].

pin_hic's People

Contributors

dfguan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

pin_hic's Issues

No change after pin_hiC

Dear @dfguan
Thanks for your pin_hic.

There is no change after using pin_hic, no matter using bam cleaned by allhic or juicer, I don't know where goes wrong.

$ wc -l all_ont_selfcor_hificor_longest120X_canu.contigs.fasta.fai
2811 all_ont_selfcor_hificor_longest120X_canu.contigs.fasta.fai
$ grep ">" scaffolds_final.fa | wc -l
2811

The size is also no change, I check the top 10.

Thank you in advance.

No need to set enzyme ?

Dear @dfguan

Thanks for your powerful tool. When start from this tool, Is there no need to set enzyme type and restriction site ?

Sincerely
Johnson

Core dump error

Hi, everyone,

I got a core dump error when using pin_hic

The command is "pin_hic break scaffolds.sat M001.bam > M001.bk.sat"

The error showed like this:
Program starts
[M::mk_brks] initiate contigs
Segmentation fault (core dumped)

I have no idea to solve this problem.
Could anyone gives me a hand?
Thanks for all.

Segfault when output directory does not exist

Hi Dengfeng,

Thank you for your work on this scaffolding approach and software! I'm excited to see the results.

I wanted to mention that I ran into a segfault when running pin_hic_it for the first time because I (mistakenly) assumed the output directory would be created when the code was run. I took a look at the core dump and read back through the README, and it became reasonably clear to me that an existing directory was a pre-requisite... but, that initial segfault was not the expected behavior. So, maybe just mentioning this in case anyone runs into the same issue.

Thanks again for your work!
-brant

Haplotype scaffolding

Hi @dfguan,

Thanks for the excellent tool.

I am trying to scaffold two separate phased haplotype assemblies of the same diploid plant genome (assembled with HIFIasm). Since I am using the same HiC data for both haplotype assemblies, do you think the HiC mapq>10 is stringent enough to filter for haplotype-specific HiC reads?
What parameters in bwa mem and pin_hic_it would you tweak to improve the quality of haplotype scaffolds?

I also wondered what is the accurate mode (-a) in pin_hic_it?

Thanks,
Mojtaba

Error: ‘for’ loop initial declarations are only allowed in C99 mode

Hi,

I meet an error when installing pin_hic. The error has happened in 'build_graph' that is 'build_graph.c:62:2: error: ‘for’ loop initial declarations are only allowed in C99 mode'. Maybe I guess my gcc version is not right.
My gcc version is 4.8.5.
Could you help me?
Thanks~

Bests,
Kmanjor

Here are logs:

gcc -O2 -Wall -D VERBOSE -D PRINT_COVERAGE -c -o bamlite.o bamlite.c
gcc -O2 -Wall -D VERBOSE -D PRINT_COVERAGE -c -o bed.o bed.c
In file included from bed.c:23:0:
bed.c: In function ‘ks_seek’:
kseq.h:41:17: warning: no return statement in function returning non-void [-Wreturn-type]
typedef struct __kstream_t {
^
kseq.h:160:2: note: in expansion of macro ‘__KS_TYPE’
__KS_TYPE(type_t)
^
kseq.h:167:57: note: in expansion of macro ‘KSTREAM_INIT2’
#define KSTREAM_INIT(type_t, __read, __seek, __bufsize) KSTREAM_INIT2(static, type_t, __read, __seek, __bufsize)
^
bed.c:24:1: note: in expansion of macro ‘KSTREAM_INIT’
KSTREAM_INIT(gzFile, gzread, gzseek, 0x10000)
^
gcc -O2 -Wall -D VERBOSE -D PRINT_COVERAGE -c -o cdict.o cdict.c
cdict.c: In function ‘cd_filt’:
cdict.c:83:3: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘size_t’ [-Wformat=]
fprintf(stderr, "%d\t%d\t%d\n", t->lim, max_wt, sum_wt);
^
cdict.c: In function ‘cd2_set_lim’:
cdict.c:92:14: warning: unused variable ‘j’ [-Wunused-variable]
uint32_t i, j;
^
gcc -O2 -Wall -D VERBOSE -D PRINT_COVERAGE -c -o graph.o graph.c
In file included from graph.c:25:0:
graph.c: In function ‘ks_seek’:
kseq.h:41:17: warning: no return statement in function returning non-void [-Wreturn-type]
typedef struct __kstream_t {
^
kseq.h:160:2: note: in expansion of macro ‘__KS_TYPE’
__KS_TYPE(type_t)
^
kseq.h:252:2: note: in expansion of macro ‘KSTREAM_INIT2’
KSTREAM_INIT2(SCOPE, type_t, __read, __seek, 16384)
^
kseq.h:257:43: note: in expansion of macro ‘KSEQ_INIT2’
#define KSEQ_INIT(type_t, __read, __seek) KSEQ_INIT2(static, type_t, __read, __seek)
^
graph.c:33:1: note: in expansion of macro ‘KSEQ_INIT’
KSEQ_INIT(gzFile, gzread, gzseek)
^
graph.c: In function ‘simp_graph’:
graph.c:370:12: warning: unused variable ‘vt’ [-Wunused-variable]
vertex_t *vt = g->vtx.vertices;
^
graph.c: In function ‘add_a’:
graph.c:874:6: warning: variable ‘node_n’ set but not used [-Wunused-but-set-variable]
int node_n;
^
graph.c: In function ‘cp_seq’:
graph.c:948:3: warning: array subscript has type ‘char’ [-Wchar-subscripts]
for ( i = 0; i < len; ++i) s[i] = rc_table[t[len -i - 1]];
^
graph.c: In function ‘add_e’:
graph.c:817:11: warning: ‘wt’ may be used uninitialized in this function [-Wmaybe-uninitialized]
add_dedge(g, n1, d1 == '+', n2, d2 == '+', wt);
^
graph.c:817:11: warning: ‘d2’ may be used uninitialized in this function [-Wmaybe-uninitialized]
graph.c:817:11: warning: ‘d1’ may be used uninitialized in this function [-Wmaybe-uninitialized]
graph.c:817:11: warning: ‘n2’ may be used uninitialized in this function [-Wmaybe-uninitialized]
graph.c: In function ‘add_p’:
graph.c:860:20: warning: ‘plen’ may be used uninitialized in this function [-Wmaybe-uninitialized]
if (ns.n) add_path(g, name, plen, ns.a, ns.n, name[0]=='c');
^
graph.c:828:8: warning: ‘nodes_str’ may be used uninitialized in this function [-Wmaybe-uninitialized]
char *nodes_str;
^
graph.c: In function ‘add_a’:
graph.c:873:8: warning: ‘nodes_str’ may be used uninitialized in this function [-Wmaybe-uninitialized]
char *nodes_str;
^
gcc -O2 -Wall -D VERBOSE -D PRINT_COVERAGE -c -o pin_10x.o pin_10x.c
gcc -O2 -Wall -D VERBOSE -D PRINT_COVERAGE -c -o sdict.o sdict.c
gcc -O2 -Wall -D VERBOSE -D PRINT_COVERAGE -c -o build_graph.o build_graph.c
build_graph.c: In function ‘col_ctgs_from_graph’:
build_graph.c:62:16: error: redeclaration of ‘i’ with no linkage
for (uint32_t i = 0; i < ctgs->n_seq; ++i) fprintf(stderr, "ctgs: %s %d\n", ctgs->seq[i].name, ctgs->seq[i].is_circ);
^
build_graph.c:53:11: note: previous declaration of ‘i’ was here
uint32_t i, j;
^
build_graph.c:62:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
for (uint32_t i = 0; i < ctgs->n_seq; ++i) fprintf(stderr, "ctgs: %s %d\n", ctgs->seq[i].name, ctgs->seq[i].is_circ);
^
build_graph.c:62:2: note: use option -std=c99 or -std=gnu99 to compile your code
build_graph.c:53:14: warning: unused variable ‘j’ [-Wunused-variable]
uint32_t i, j;
^
build_graph.c: In function ‘print_cdict2’:
build_graph.c:143:12: warning: variable ‘icnt’ set but not used [-Wunused-but-set-variable]
uint32_t icnt;
^
build_graph.c: In function ‘norm_links’:
build_graph.c:247:9: warning: variable ‘icnt’ set but not used [-Wunused-but-set-variable]
float icnt;
^
build_graph.c: In function ‘nns_mst’:
build_graph.c:410:13: warning: unused variable ‘len2’ [-Wunused-variable]
uint32_t len2 = ctgs->seq[ind].len;
^
build_graph.c:409:19: warning: unused variable ‘is_l2’ [-Wunused-variable]
uint32_t is_l, is_l2;
^
build_graph.c:409:13: warning: unused variable ‘is_l’ [-Wunused-variable]
uint32_t is_l, is_l2;
^
build_graph.c:400:7: warning: unused variable ‘isf’ [-Wunused-variable]
int isf = 1;
^
build_graph.c:398:12: warning: unused variable ‘len1’ [-Wunused-variable]
uint32_t len1 = ctgs->seq[i].len;
^
build_graph.c:521:23: warning: unused variable ‘tt’ [-Wunused-variable]
float hh, ht, th, tt, ort_cnt[4];
^
build_graph.c:521:19: warning: unused variable ‘th’ [-Wunused-variable]
float hh, ht, th, tt, ort_cnt[4];
^
build_graph.c:521:15: warning: unused variable ‘ht’ [-Wunused-variable]
float hh, ht, th, tt, ort_cnt[4];
^
build_graph.c:521:11: warning: unused variable ‘hh’ [-Wunused-variable]
float hh, ht, th, tt, ort_cnt[4];
^
build_graph.c:520:14: warning: unused variable ‘lenznext’ [-Wunused-variable]
uint32_t lenznext = ctgs->seq[znext].len;
^
build_graph.c:519:14: warning: unused variable ‘lenz’ [-Wunused-variable]
uint32_t lenz = ctgs->seq[z].len;
^
build_graph.c: In function ‘nns_straight’:
build_graph.c:630:13: warning: unused variable ‘len2’ [-Wunused-variable]
uint32_t len2 = ctgs->seq[ind].len;
^
build_graph.c:618:12: warning: unused variable ‘len1’ [-Wunused-variable]
uint32_t len1 = ctgs->seq[i].len;
^
build_graph.c: In function ‘nns_mst2’:
build_graph.c:736:10: warning: variable ‘ocnt’ set but not used [-Wunused-but-set-variable]
float ocnt = 0;
^
build_graph.c: In function ‘main_bldg’:
build_graph.c:1080:8: warning: unused variable ‘msn’ [-Wunused-variable]
float msn = .7, mdw = 0.95;
^
make: *** [build_graph.o] Error 1

Non ACTGN characters introduced in pin_hic output

Hi, just wanted to make you/users aware there are non ACTGN characters in the final scaffolds produced by pin_hic. I tried to diagnose how or why this is happening; I introduced ambiguity characters during polishing with pilon prior to scaffolding with pin_hic. In the first and last contigs within scaffolds, pin_hic will output the ambiguities, but in contigs between the flanking contigs of scaffolds, ambiguities are replaced with a non ACTGN character (variously interpreted by programs as ?, �, EOT). My solution was to import into geneious, where the non-ACTGN character can be converted to Ns, then export for further polishing to correct/replace those bases; I'm sure there is a more elegant one liner to replace the swapped out characters.

Not sure if this is a problem other users have experienced. It should quickly become evident if others are facing this issue as programs like quast will fail, while others such as busco or pilon will produce spurious results (due to misalignments from the non ACTGN character).

I hope this helps with improving the program, this is the only scaffolding program that was easy to implement while correcting for over scaffolding (instagraal appears to be a suitable competitor, but requires setting up an environment and creating input files with hicstuff; not a huge deal, but creates barriers not present here with pin_hic). I think it would be immensely helpful to have more information on interpreting the log output columns, ect. Cheers.

Hi-c contact heatmap visualization

Dear, Guan

Thanks for your wonderful tool in scaffolding.
But, I want to know how to visualize the Hi-C heatmap like juicebox. Could you please give me some help.
Thank you.

Zhenpeng

OmniC data

Dear Sirs,
does pin_hic works with OmniC?
thanks,
Diego

pin_hic gets causes segmentation fault

Hello :)
I´m currently trying to integrate pin_hic in my assembly downstream pipeline written in snakemake (python). For that reason I downloaded pin_hic from bioconda (3.0.0). I followed all your steps from the README successfully till I tried to execute the last step:

pin_hic gets -c asm scaffs.bk.sat > scaffols_final.fa

I receive the following error:
[M::main_get_seq] program starts
[M::get_seq] load scaffolding graph to memory
[M::get_seq] get contigs to memory
[M::get_seq] get scaffolds
Segmentation fault

Any chance you can check what went wrong? I really want to test your tool and have an alternative tool for SALSA.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.