qiao-xin / dupgen_finder Goto Github PK

A pipeline used to identify different modes of duplicated gene pairs

C++ 55.61% C 1.00% Makefile 0.29% Perl 42.32% Shell 0.78%

dupgen_finder's Introduction

DupGen_finder

The DupGen_finder was developed to identify different modes of duplicated gene pairs. MCScanX algorithm was incorporated in this pipeline.


Authors	Xin Qiao (Xin Qiao)
	Qionghou Li (Qionghou Li)
	Yupeng Wang (Yupeng Wang)
	Shaoling Zhang (Shaoling Zhang)
	Andrew H. Paterson (PGML)
Email	[email protected]

The schematic diagram of DupGen_finder pipeline

Figure 1: The flowchart of DupGen_finder pipeline

Dependencies
Installation
Preparing input files
Running
Result Files
Citation

Dependencies

Perl

Installation

cd ~/software  # or any directory of your choice
git clone https://github.com/qiao-xin/DupGen_finder.git
cd DupGen_finder
make
chmod 775 DupGen_finder.pl
chmod 775 DupGen_finder-unique.pl
chmod 775 set_PATH.sh
source set_PATH.sh

Test you can run DupGen_finder:

DupGen_finder.pl

DupGen_finder should print its 'help' text.

**Note**

DupGen_finder software package includes a custom MCScanX algorithm which can output sorted gff files such as Ath.gff.sorted, and is slightly different from the original MCScanX algorithm implemented in MCScanX software package.

Preparing input files

Pre-computed BLAST results (-outfmt 6) and gene location information (GFF format) are required for running DupGen_finder successfully.

For the target genome in which gene duplicaiton modes will be classified, please prepare two input files:
- target_species.gff, a gene position file for the target species, following a tab-delimited format. For example, "Ath.gff".
- target_species.blast, a blastp output file (-outfmt 6) for the target species (self-genome comparison). For example, "Ath.blast".
For the outgroup genome, please prepare two input files:
- [target_species]_[outgroup_species].gff, a gene position file for the target_species and outgroup_species, following a tab-delimited format.
- [target_species]_[outgroup_species].blast, a blastp output file (-outfmt 6) between the target and outgroup species (cross-genome comparison).

For example, assuming that you are going to classify gene duplication modes in Arabidopsis thaliana (Ath), using Nelumbo nucifera (Nnu) as outgroup, you need to prepare 4 input files: Ath.gff,Ath.blast, Ath_Nnu.gff, Ath_Nnu.blast

gff file

Ath.gff is in the following format (tab separated):

Species_abbrev-Chr_ID	gene_ID	start_position	end_position

The data in Ath.gff looks like this (tab separated):

Ath-Chr1	AT1G01010.1	3631	5899
Ath-Chr1	AT1G01020.1	5928	8737
Ath-Chr1	AT1G01030.1	11649	13714
Ath-Chr1	AT1G01040.2	23416	31120
Ath-Chr1	AT1G01050.1	31170	33153

The below command can be used to creat Ath_Nnu.gff:

cat Ath.gff Nnu.gff >Ath_Nnu.gff

The data in Ath_Nnu.gff looks like this (tab separated):

Ath-Chr1	AT1G01010.1	3631	5899
Ath-Chr1	AT1G01020.1	5928	8737
...
Nnu-megascaffold_32	NNU_00001-RA	57481	63788
Nnu-megascaffold_32	NNU_00002-RA	32491	41125
...

blast file

Ath.blast is in the following format:

query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score

The data in Ath.blast looks like this (tab separated):

ATCG00500.1	ATCG00500.1	100.00	488	0	0	1	488	1	488	0.0	 932
ATCG00510.1	ATCG00510.1	100.00	37	0	0	1	37	1	37	2e-19	73.9
ATCG00280.1	ATCG00280.1	100.00	473	0	0	1	473	1	473	0.0	 876
ATCG00890.1	ATCG01250.1	100.00	389	0	0	1	389	1	389	0.0	 660
ATCG00890.1	ATCG00890.1	100.00	389	0	0	1	389	1	389	0.0	 660

Here is the typical parameter setting for generating the xyz.blast file:

The example file Ath.pep contains the whole genome protein sequences (FASTA format) of Arabidopsis.

For BLAST software

# Create a reference database
makeblastdb -in Ath.pep -dbtype prot -title Ath -parse_seqids -out Ath
# Align protein query sequences against the reference database
blastp -query query_file -db database -evalue 1e-10 -max_target_seqs 5 -outfmt 6 -out xyz.blast
# For example
blastp -query Ath.pep -db Ath -evalue 1e-10 -max_target_seqs 5 -outfmt 6 -out Ath.blast

For DIAMOND software

# Create a reference database
diamond makedb --in Ath.pep -d Ath
# Align protein query sequences against the reference database
diamond blastp -d Ath -q Ath.pep -o Ath.blast -p 20 --sensitive --max-target-seqs 5 --evalue 1e-10 --quiet

NOTE: All above input files should be stored under the same folder (the "-i" option). For more parameters please see below.

Running

Run the following command to get help information about DupGen_finder:

DupGen_finder.pl

This command will print a full list of options:

Usage: DupGen_finder.pl -i data_directory -t target_species -c outgroup_species -o output_directory
#####################
Optional:
-a 1 or 0(are segmental duplicates ancestral loci or not? default: 1, yes)
-d number_of_genes(maximum distance to call proximal, default: 10)
#####################
The following are optional MCScanX parameters:
-k match_score(cutoff score of collinear blocks for MCScanX, default: 50)
-g gap_penalty(gap penalty for MCScanX, default: -1)
-s match_size(number of genes required to call a collinear block for MCScanX, default: 5)
-e e_value(alignment significance for MCScanX, default: 1e-05)
-m max_gaps(maximum gaps allowed for MCScanX, default: 25)
-w overlap_window(maximum distance in terms of gene number, to collapse BLAST matches for MCScanX, default: 5)

A typical command to identify different modes of duplicated gene pairs in a given species could look like this:

DupGen_finder.pl -i data -t Ath -c Nnu -o results

Here, DupGen_finder attempts to identify the different modes of duplicated gene pairs in A.thaliana by using N.nucifera as outgroup. All required data files should be stored under this directory data. The output files will be stored under this directory results. For more details please see below. Ath: A.thaliana, Nnu: N.nucifera.

Note: We recommend that the "data_directory" or "output_directory" should be given a full path. For example, /home/the_path_to_your_data_directory/

DupGen_finder-unique

Moreover, to eliminate redundant duplicate genes among different modes, we provide a stricter version of DupGen_finder named DupGen_finder-unique by which each duplicate gene was assigned to a unique mode after all of the duplicated gene pairs were classified into different gene duplication types. The priority of the duplicate genes is as follows: WGD > tandem > proximal > transposed > dispersed.

DupGen_finder-unique.pl -i data -t Ath -c Nnu -o results

Result Files

1 - Duplicated gene pairs:

Ath.wgd.pairs
Ath.tandem.pairs
Ath.proximal.pairs
Ath.transposed.pairs
Ath.dispersed.pairs

These files includes duplicated gene pairs derived from five modes of gene duplication, including WGD (Ath.wgd.pairs), tandem duplication (Ath.tandem.pairs), proximal duplication (Ath.proximal.pairs), transposed duplication (Ath.transposed.pairs), dispersed duplication (Ath.dispersed.pairs). The gene pairs contained in these files looks like this (tab separated):

Duplicate 1	Location	Duplicate 2	Location	E-value
AT1G01010.1	Ath-Chr1:3631	AT4G01550.1	Ath-Chr4:673862	5e-52
AT1G01020.1	Ath-Chr1:5928	AT4G01510.1	Ath-Chr4:642733	5e-74
AT1G01030.1	Ath-Chr1:11649	AT1G13260.1	Ath-Chr1:4542168	3e-45
AT1G01050.1	Ath-Chr1:31170	AT3G53620.1	Ath-Chr3:19880504	6e-126
AT1G01060.1	Ath-Chr1:33666	AT5G17300.1	Ath-Chr5:5690227	3e-30

2 - Ath.singletons

It includes genes that have no homologous genes within target species.

GeneID	Location
AT2G32600.1	Ath-Chr2:13833545
AT5G11810.1	Ath-Chr5:3808720
AT4G16610.1	Ath-Chr4:9354321
AT1G66520.1	Ath-Chr1:24816128
AT1G51110.1	Ath-Chr1:18935329

3 - Ath.stats

The number of duplicated gene pairs derived from different modes.

Types	NO. of gene pairs
WGD-pairs	4352
TD-pairs	2063
PD-pairs	788
TRD-pairs	4447
DSD-pairs	16130

4 - Collinearity files

Ath.collinearity
Ath_Nnu.collinearity

Citation

Qiao X, Li Q, Yin H, Qi K, Li L, Wang R, Zhang S, Paterson AH: Gene duplication and evolution in recurring polyploidization–diploidization cycles in plants. Genome Biology 2019, 20:38.

dupgen_finder's People

Contributors

Stargazers

Watchers

dupgen_finder's Issues

the number of genes in 'xxx.xxx.pairs' was not consistent with the number in 'xxx.xxx.genes'

Dear Dr. Qiao,

Of the output files from DupGen_finder, I found that the number of genes in the file 'xxx.wgd.pairs' (after deduplication) is equal to the number in the file 'xxx.wgd.genes'.

But why the number of genes in the file 'xxx.proximal.pairs' (after deduplication) is more than the number in 'xxx.proximal.genes' (4051 vs. 3425)?

Can you give some explanations?

Thanks a lot.

conda version of DupGen_finder

Hello,
Curious if the conda version of DupGen_finder is available?

Regards,

Add document about how to use DupGen_finder without out group species.

Hi,

I have only one species and want to find DepGen using itself, but I can not found any document about how to do that, and I got errors when I using itself as target and outgroup species.
cp input/Pca.blast input/Pca_Pca.blast
cp input/Pca.gff input/Pca_Pca.gff

ls input|cat
Pca.blast
Pca.gff
Pca_Pca.blast
Pca_Pca.gff

perl /Bio/software/DupGen_finder/DupGen_finder.pl -i /state/partition1/WORK/DupGen_finder/input -t Pca -c Pca -o /state/partition1/WORK/DupGen_finder/output/
rm: cannot remove `/state/partition1/WORK/DupGen_finder/output/Pca_Pca.gff.sorted': No such file or directory

And I got nothing!

Best,
Kun

"*.gff.sorted". No such file or directory

Hi!
I am trying to use DupGen_finder.pl, but there is some error.

I am using in command line:

perl DupGen_finder.pl -i /media/dragonstone/dayana/KaKs/data_dir/ -t gma -c pvu -o /media/dragonstone/dayana/KaKs/results

error

rm: cannot remove '/media/dragonstone/dayana/KaKs/results/gma_pvu.gff.sorted': No such file or directory
Is this file "*.gff.sorted" not being created?
Thanks if you can help me.

Dayana.

Content of my data_dir:

gma.blast
gma.gff
gma_pvu.blast
gma_pvu.gff

dating WGD ?

I do not actually know WGD in detail, but I am in a situation where I have to do WGD analysis. So I want to use your pipeline.
Fortunately, this pipeline is easy to use.
But I have a problem that I can not solve.
Can you tell me how to check the WGD date after using this pipeline?

how to choose a proper species outgroup

Hi,
I'm wondering about what's the criteria to choose an outgroup species for DupGen_finder analyses, is it same or different with outgroup species when phylogeny construction?

error when running script

Hi, thanks so much for building this software, very helpful.
I am running into the following error when running the script. See below:

-bash-4.1$ ./DupGen_finder.pl -i Dniv_Aalp_Data/ -t Dniv -c Aalp -o /usit/abel/u1/michaeno/nobackup/program_dirs/DupGen_finder/Dniv_Results_13Jan2020
rm: cannot remove `/usit/abel/u1/michaeno/nobackup/program_dirs/DupGen_finder/Dniv_Results_13Jan2020/Dniv_Aalp.gff.sorted': No such file or directory

Any idea why this is occurring?
I notice that a previous version of this script did not have this problem.
Thanks!

DupGene_finder.pl or DupGene_finder-unique.pl

Dear Qiao,

Recently, I carried out the analyses to identify the different modes of gene duplication using both DupGene_finder.pl and DupGene_finder-unique.pl. I am going to calculate the ka, ks, ka/ks values for each gene pairs. Should I use the results from DupGene_finder.pl or DupGene_finder-unique.pl ??

Based on the same blast files, I also identified the WGD events using the DupGene_finder.pl (WGD-pairs) and another software, which exhibited the same WGDs within all genomes. Now, I want to classify the WGD-pairs into different groups according to the WGD events. Although the same WGDs were identified within genomes, the two softwares generated different numbers of WGD-pairs for each WGD event. Could I merge these WGD-pairs of the same WGD event into one non-redundant WGD-pair set or just used the result from either software ??

Thanks a lot!

The number of genes in the result file was not consistent with the total number of species gene sets

Hello, I used the sample file in the data folder to test the program dupgene_finder-unique. The number of genes is less than the total number of genes in Arabidopsis. Did I do anything wrong? Here's what I got for Arabidopsis.

113 Ath.singletons
3697 Ath.dispersed.genes-unique
1056 Ath.proximal.genes-unique
2890 Ath.tandem.genes-unique
4317 Ath.transposed.genes-unique
7545 Ath.wgd.genes-unique

Two collinearity documents without results.

Hello,
When "DupGen_finder.pl -i /home/wxy/data5 -t Os -c Aa -o results" was done, the results of "Os.collinearity" and "Os_Aa.collinearity" were all empty. Can you answer that for me? Thank you so much!

ancestral and novel transposed duplications

Hello,

In the species.transposed.pairs file, is column 1 ("Duplicate 1") the ancestral copy and column 3 ("Duplicate 2") the non-ancestral copy? Or are ancestral and non ancestral pairs shuffled between column 1 and column 3 and I need to run OrthoFinder to determine which is ancestral?

Thank you,
Chris

How to make this pipeline resolve the genes relationships of genome with three subgenomes

Hi,
How to make this pipeline resolve the gene relationships of genome with three subgenomes?
Thanks

Outgroups seem unused

stdout from execution below...

Reading BLAST file and pre-processing
Generating BLAST list
708823 matches imported (1542503 discarded)
2307 pairwise comparisons
7316 alignments generated
Pairwise collinear blocks written to outdir/targ.collinearity [41.574 seconds elapsed]
Reading BLAST file and pre-processing
Generating BLAST list
0 matches imported (2563444 discarded)
0 pairwise comparisons
0 alignments generated
Pairwise collinear blocks written to outdir/targ_outgroup.collinearity [8.996 seconds elapsed]

Seems like my outgroup blast file isn't working for whatever reason. Is there a some secondary filtering on the alignments happening? I've tried swapping which sample I use as query and which I use as database and it doesn't seem to matter. All the alignments get discarded either way. Any thoughts on what I could check? Is this expected output somehow?

Blast matches all discarded

Hi,

I think I am not running the software properly, but I wasn't sure what I am missing. It seems that all the BLAST matches are discarded, and all the genes are classified as singletons.

DupGen_finder.pl -i data/ -t SH -c SP -o DupGeneFinder_out
Reading BLAST file and pre-processing
Generating BLAST list
0 matches imported (120631 discarded)
0 pairwise comparisons
0 alignments generated
Pairwise collinear blocks written to DupGeneFinder_out/SH.collinearity [1.121 seconds elapsed]
Reading BLAST file and pre-processing
Generating BLAST list
0 matches imported (152117 discarded)
0 pairwise comparisons
0 alignments generated
Pairwise collinear blocks written to DupGeneFinder_out/SH_SP.collinearity [2.420 seconds elapsed]

Here is how the input files look like:

head data/SH.blast
SH1353_030184-RA        SH1353_030184-RA        100.000 114     0       0       1       114     1       114     1.41e-83      239
SH1353_030184-RA        SH1353_034720-RA        44.737  114     61      1       1       114     1       112     4.40e-27      96.7
SH1353_030184-RA        SH1353_034913-RA        38.971  136     53      2       1       106     1       136     2.62e-25      92.4

head data/SH.gff
SH-Chr1 SH1353_000001-RA        115428495       115433209
SH-Chr1 SH1353_000002-RA        115436316       115436785
SH-Chr1 SH1353_000003-RA        115438433       115438915
SH-Chr1 SH1353_000004-RA        115439932       115444476
SH-Chr1 SH1353_000005-RA        115447639       115458502

head data/SH_SP.blast
SH1353_030184-RA        Sopen04g021750.1        42.636  129     44      1       16      114     10      138     1.09e-26      97.1
SH1353_030184-RA        Sopen07g019450.1        40.157  127     46      2       2       98      402     528     4.50e-24      95.9
SH1353_030184-RA        Sopen03g009020.1        60.784  51      20      0       64      114     15      65      6.67e-13      62.8

 head data/SH_SP.gff
SH-Chr1 SH1353_000001-RA        115428495       115433209
SH-Chr1 SH1353_000002-RA        115436316       115436785
SH-Chr1 SH1353_000003-RA        115438433       115438915

tail data/SH_SP.gff
SP-Chr12        Sopen12g034940.1        83043827        83055077
SP-Chr12        Sopen12g034950.1        83056418        83058947
SP-Chr12        Sopen12g034960.1        83060006        83065975

Does something look off to you?
Thank you for your help!
Kyungyong

how to use Ath.wgd.pairs ?

Dear Qiao Xin,
Many thanks for your software,I wonder if it is possible to extract the WGD gene pair for both species from the Ath.wgd.pairs? in other words, we can be sure that it is these gene pairs that produce WGD , And then use gene pairs to analyze.

There are duplicate results in wgd.pairs

Hi,
I used DupGen_finder.pl for identifying duplicate genes, but accidentally found duplicate results in the results file Mlaxif.wgd.pairs.

cat Mlaxif.wgd.pairs|sort |uniq -d
Mla04G006270 Ml04:45595018 Mla07G001620 Ml07:3879510 5.8e-137
Mla04G006570 Ml04:51630775 Mla07G001750 Ml07:4173033 7e-234
Mla05G012620 Ml05:63796747 Mla09G015150 Ml09:87056266 2.6e-74
Mla05G016220 Ml05:95592974 Mla11G022290 Ml11:86599946 0
Mla08G001180 Ml08:2361079 Mla09G008860 Ml09:38325911 1.6e-51
Mla12G000260 Ml12:279900 Mla12G004330 Ml12:3831634 8.9e-293
Mla12G000550 Ml12:516995 Mla12G004370 Ml12:3862576 0.0e+00
Mla12G000860 Ml12:930867 Mla12G003070 Ml12:2149606 2.6e-152
Can you tell what caused this result?
Thanks

selecting outgroup

I am curious if there are specific criteria for selecting an outgroup when you have access to multiple genomic datasets.

Kindly suggest!

Regards,
B

How to select a outgroup species?

Hi sir:

Just like the topic says, i am wondering how to select a outgroup species for my query species？ Such as maize.

Thanks

The gene number assigned in gene type not equal to the total gene in the species

Hi Qiaoxin
I am confused by the problem as the title show, I run the program, however ,the results obtained that The total gene number assigned in six gene type not equal to the total gene in the species, is that normal ?
Thanks!