tohanwei / genome2or Goto Github PK

View Code? Open in Web Editor NEW

9.0 4.0 1.0 25.32 MB

Annotate Olfactory receptor CDS from genome

License: GNU General Public License v3.0

Python 100.00%

genome2or's Introduction

Genome2OR

Annotate Olfactory receptor CDS from genome

Contents Index

Cite
Abstract
Installation
Contact us

Cite

Cite us
Welcome to our paper.
Han W, Wu Y, Zeng L, Zhao S. Building the Chordata Olfactory Receptor Database using more than 400,000 receptors annotated by Genome2OR. Sci China Life Sci. 2022 Dec;65(12):2539-2551. doi: 10.1007/s11427-021-2081-6. Epub 2022 Jun 10. PMID: 35696018.
Article
Support information
Resource
Welcome to "Chordata Olfactory Receptor Database(CORD)". You will be able to obtain a large amount of data on olfactory receptors in chordata.

Abstract

Genome2OR is a genetic annotation tool based on HMMER, MAFFT and CD-HIT.
HMMER searches biological sequence databases for homologous sequences, using either single sequences or multiple sequence alignments as queries. HMMER implements a technology called "profile hidden Markov models (profile HMMs)".

MAFFT is a multiple sequence alignment program for unix-like operating systems. It offers a range of multiple alignment methods, L-INS-i (accurate; for alignment of <∼200 sequences), FFT-NS-2 (fast; for alignment of <∼30,000 sequences), etc.

CD-HIT is very fast and can handle extremely large databases. CD-HIT helps to significantly reduce the computational and manual efforts in many sequence analysis tasks and aids in understanding the data structure and correct the bias within a dataset.

Installation

Use anaconda for installation.

conda install -c whanwei genome2or

After installation you can use the following two commands to annotate the olfactory receptor genes in the genome.

Basic annotation method

genome2or Actinopteri outputdir genome.fasta -e 1e-10 -c 4 -p prefix

For more information.

genome2or --help

    ____ _____ _   _  ___  __  __ _____ ____   ___  ____
   / ___| ____| \ | |/ _ \|  \/  | ____|___ \ / _ \|  _ \
  | |  _|  _| |  \| | | | | |\/| |  _|   __) | | | | |_) |
  | |_| | |___| |\  | |_| | |  | | |___ / __/| |_| |  _ <
   \____|_____|_| \_|\___/|_|  |_|_____|_____|\___/|_| \_|


usage: genome2or.py [-h] [-e] [-l] [-c] [-p] [-k] [-v] [-V] profile outputdir genome

Annotating Olfactory Receptor Genes in Vertebrate Genomes in One Step.

positional arguments:
  profile               Select an HMM profile for the annotated species from the following options: "Actinopteri", "Amphibia", "Aves", "Branchiostoma_floridae",
                        "Chondrichthyes", "Cladistia", "Coelacanthimorpha", "Crocodylia", "Hyperoartia", "Lepidosauria", "Mammalia", "Myxini", "Reptiles", "Testudines".
                        Alternativa, provide the path to a HMM profile file. However, we do not generally recommend doing so unless there is no corresponding option for the
                        species you need to annotate in the list we provide.
  outputdir             String. Directory path where the output files are stored.[default:Current directory]
  genome                String. File path of the genome to be annotated.

optional arguments:
  -h, --help            show this help message and exit
  -e , --EvalueLimit    Float, The e-value threashold used for extract olfactory receptor gene fragment(s) from the genome.[default:1e-20]
  -l , --SeqLengthLimit
                        Integer. Threshold of sequence length, sequences shoter than this value will not be considered as the preferred targets for functional olfactory
                        receptors.[default:868]
  -c , --cpus           Integer. number of parallel, with 0, all CPUs will be used.[default='2/3 of all cores']
  -p , --prefix         String. Output file name prefix.[default:ORannotation]
  -k , --keepfile       Bool. whether to keep detailed intermediate file(True/False).[default:True]
  -v, --verbose         Print detailed running messages of the program.
  -V, --version         Show version message and exit.

https://link.springer.com/article/10.1007/s11427-021-2081-6

Iterative annotation of the genome

Iteration Actinopteri outputdir genome.fasta -i 3 -e 1e-10 -c 4 -p prefix

For more information.

Iteration --help

     ___ _____ _____ ____      _  _____ ___ ___  _   _
    |_ _|_   _| ____|  _ \    / \|_   _|_ _/ _ \| \ | |
     | |  | | |  _| | |_) |  / _ \ | |  | | | | |  \| |
     | |  | | | |___|  _ <  / ___ \| |  | | |_| | |\  |
    |___| |_| |_____|_| \_\/_/   \_\_| |___\___/|_| \_|


usage: Iteration.py [-h] [-i] [-e] [-l] [-c] [-p] [-k] [-v] [-V] profile outputdir genome

Iterative annotation of olfactory receptor genes in the genome.

positional arguments:
  profile               Select an HMM profile for the annotated species from the following options: "Actinopteri", "Amphibia", "Aves", "Branchiostoma_floridae",
                        "Chondrichthyes", "Cladistia", "Coelacanthimorpha", "Crocodylia", "Hyperoartia", "Lepidosauria", "Mammalia", "Myxini", "Reptiles", "Testudines".
                        Alternativa, provide the path to a HMM profile file. However, we do not generally recommend doing so unless there is no corresponding option for the
                        species you need to annotate in the list we provide.
  outputdir             String. Directory path where the output files are stored.[default:Current directory]
  genome                String. File path of the genome to be annotated.

optional arguments:
  -h, --help            show this help message and exit
  -i , --iteration      Int. Number of iterations.[default:2]
  -e , --EvalueLimit    Float, The e-value threashold used for extract olfactory receptor gene fragment(s) from the genome.[default:1e-20]
  -l , --SeqLengthLimit
                        Integer. Threshold of sequence length, sequences shoter than this value will not be considered as the preferred targets for functional olfactory
                        receptors.[default:868]
  -c , --cpus           Integer. number of parallel, with 0, all CPUs will be used.[default='2/3 of all cores']
  -p , --prefix         String. Output file name prefix.[default:ORannotation]
  -k , --keepfile       Bool. whether to keep detailed intermediate file(True/False).[default:True]
  -v, --verbose         Print detailed running messages of the program.
  -V, --version         Show version message and exit.

https://link.springer.com/article/10.1007/s11427-021-2081-6

Batch annotation

For batch annotation, you can use the following simple shell script to achieve it.

GenomeDir="Path to the directory where you store your genome"
for genome in `ls $GenomeDir`; do
	genome2or Actinopteri outputdir $GenomeDir/$genome -e 1e-10 -c 4 -p ${genome%.*}
done

Here we assume that you are annotating species of the "Actinopteri" and that the genomes stored in your catalog all belong to the "Actinopteri". For annotation of species of other orders, please select the corresponding HMM profile with the "profile" parameter.

Or you want to use iterations for batch annotation.

GenomeDir="Path to the directory where you store your genome"
for genome in `ls $GenomeDir`; do
	Iteration Actinopteri outputdir $GenomeDir/$genome -i 3 -e 1e-10 -c 4 -p ${genome%.*}
done

Contact us

[email protected], Suwen Zhao
[email protected], Wei Han

genome2or's People

Contributors

Stargazers

Watchers

Forkers

zm-git-dev

genome2or's Issues

Genome2OR installed centrally causes OSError: [Errno 30] Read-only file system

This same command works if I create a personal mamba environment, but when trying it in a centrally install mamba environment where the application dir is read-only to users, the command fails due to the error "OSError: [Errno 30] Read-only file system: '/apps/pkg/mambaforge/4.14/envs/genome2or-1.1.0/bin/hmmbuild'"

$ genome2or Actinopteri outputdir genome.fasta -e 1e-10 -c 16 -p prefix

____ _____ _   _  ___  __  __ _____ ____   ___  ____

/ | | \ | |/ _ | / | | \ / _ | _
| | | | | | | | | | |/| | | __) | | | | |) |
| || | || |\ | || | | | | | / __/| || | _ <
_||_| _|_/|| ||||__/|_| _|

2024-02-22 13:22:06,358 Functions.py:INFO ###genome2or program starts running###
Traceback (most recent call last):
File "genome2or.py", line 155, in
File "genome2or.py", line 123, in main
File "src/Functions.py", line 1830, in chmod
OSError: [Errno 30] Read-only file system: '/apps/pkg/mambaforge/4.14/envs/genome2or-1.1.0/bin/hmmbuild'
[396575] Failed to execute script 'genome2or' due to unhandled exception!

conda install error

conda install -c whanwei genome2or 
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: \ 
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.                                                              failed                                                                                                                  

UnsatisfiableError: The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versionsThe following specifications were found to be incompatible with your system:

  - feature:/linux-64::__glibc==2.17=0
  - feature:|@/linux-64::__glibc==2.17=0

Your installed version is: 2.17

When

Stuck at Recombination hit list...

Thanks for the magic tool! However, I met an issue when I use this tool for some genome assemblies. The tool stucked at "Recombination hit list..." stage. Please help me to figure it out. Here is a example log:

Internal pipeline statistics summary:

Query model(s): 1 (933 nodes)
Target sequences: 10042 (1643500208 residues searched)
Residues passing SSV filter: 230121808 (0.14); expected (0.02)
Residues passing bias filter: 137319152 (0.0836); expected (0.02)
Residues passing Vit filter: 23211691 (0.0141); expected (0.003)
Residues passing Fwd filter: 1005657 (0.000612); expected (3e-05)
Total number of hits: 120 (5.89e-05)

CPU time: 284.11u 0.93s 00:04:45.04 Elapsed: 00:00:03.61

Mc/sec: 424531.06

//
[ok]
284.17user 1.04system 0:03.78elapsed 7527%CPU (0avgtext+0avgdata 1324568maxresident)k
0inputs+64outputs (0major+424205minor)pagefaults 0swaps
2023-04-28 17:20:24,357 Functions.py:INFO run_nhmmer use 3.792 seconds

NHMMER program is complete.

2023-04-28 17:20:24,357 nhmmer.py:INFO ###Program nhmmer.py finish###
2023-04-28 17:20:24,407 Functions.py:INFO ###FindOR program starts running###
2023-04-28 17:20:24,408 Functions.py:INFO function platform_info() is running
The system you use is Linux.
Python3 is used.
2023-04-28 17:20:24,411 Functions.py:INFO platform_info use 0.003 seconds
Process nhmmer output file...
2023-04-28 17:20:24,411 Functions.py:INFO function proc_nhmmer_out() is running
2023-04-28 17:20:24,411 Functions.py:INFO 2 truncated gene(s) was discovered
2023-04-28 17:20:24,412 Functions.py:INFO proc_nhmmer_out use 0.001 seconds
Extract cds from genomic file...
2023-04-28 17:20:24,412 Functions.py:INFO function extract_cds() is running
2023-04-28 17:20:26,673 Functions.py:INFO extract_cds use 2.262 seconds
Find ATG and STOP codons for each sequence...
2023-04-28 17:20:26,673 Functions.py:INFO function find_cds() is running
2023-04-28 17:20:26,779 Functions.py:INFO find_cds use 0.105 seconds
Write data to file...
2023-04-28 17:20:26,786 Functions.py:INFO Merge pseudogene fragement
2023-04-28 17:20:26,787 Functions.py:INFO ###The result as follows###
2023-04-28 17:20:26,787 Functions.py:INFO Clarias_batrachus processing completed
2023-04-28 17:20:26,787 Functions.py:INFO 118 OR fragments found by nhmmer.
2023-04-28 17:20:26,787 Functions.py:INFO 91 potential functional ORs were discover.
2023-04-28 17:20:26,787 Functions.py:INFO -48 pseudogenes fragment were merged
2023-04-28 17:20:26,787 Functions.py:INFO 66 potential pseudogene ORs were discover.
2023-04-28 17:20:26,787 Functions.py:INFO 2 pseudogenes cause by too short sequence length.
2023-04-28 17:20:26,787 Functions.py:INFO 1 pseudogenes cause by insert or delect base.
2023-04-28 17:20:26,787 Functions.py:INFO 63 pseudogenes cause by contains termination codons.
2023-04-28 17:20:26,787 Functions.py:INFO ###The program finish###
2023-04-28 17:20:26,787 FindOR.py:INFO ###Program FindOR.py finish###
2023-04-28 17:20:26,831 Functions.py:INFO ###IdentityFunc program starts running###
2023-04-28 17:20:26,831 Functions.py:INFO function platform_info() is running
The system you use is Linux.
Python3 is used.
2023-04-28 17:20:26,835 Functions.py:INFO platform_info use 0.003 seconds
Process hit sequence file...
2023-04-28 17:20:26,835 Functions.py:INFO function refact_hitfile() is running
2023-04-28 17:20:26,835 Functions.py:INFO refact_hitfile use 0.000 seconds
Recombination hit list...
2023-04-28 17:20:26,835 Functions.py:INFO function refact_list() is running

FileNotFoundError: [Errno 2] No such file or directory: '../output/ORannotation_ORs_pro.fa'

The third step does not work:
I use the command:
python IdentifyFunc.py ../output/ORannotation_ORs_pro.fa ../output/ORannotation_ORs_dna.fa -o ../output -p Identity -v
But there is a problem:
2022-07-03 00:03:07,255 Functions.py:INFO ###IdentityFunc program starts running###
2022-07-03 00:03:07,256 Functions.py:INFO function platform_info() is running
The system you use is Linux.
Python3 is used.
2022-07-03 00:03:07,256 Functions.py:INFO platform_info use 0.000 seconds
Process hit sequence file...
2022-07-03 00:03:07,256 Functions.py:INFO function refact_hitfile() is running
Traceback (most recent call last):
File "/home/HHJ/getOR/Genome2OR/scripts/IdentifyFunc.py", line 56, in
template, hit_dict = refact_hitfile(hitpro)
File "/home/HHJ/getOR/Genome2OR/scripts/src/Functions.py", line 36, in logtimer
temp = func(*args, **kwargs)
File "/home/HHJ/getOR/Genome2OR/scripts/src/Functions.py", line 719, in refact_hitfile
seqlist = ReadSampleFasta(hitfile)
File "/home/HHJ/getOR/Genome2OR/scripts/src/Functions.py", line 662, in ReadSampleFasta
with open(seqfile) as seqf:
FileNotFoundError: [Errno 2] No such file or directory: '../output/ORannotation_ORs_pro.fa'

In addition, there are many spelling mistakes in your document. Correcting them will make people less detours.

tohanwei / genome2or Goto Github PK

genome2or's Introduction

Genome2OR

Annotate Olfactory receptor CDS from genome

Contents Index

Cite

Abstract

Installation

Basic annotation method

Iterative annotation of the genome

Batch annotation

Contact us

genome2or's People

Contributors

Stargazers

Watchers

Forkers

genome2or's Issues

Internal pipeline statistics summary:

CPU time: 284.11u 0.93s 00:04:45.04 Elapsed: 00:00:03.61

Mc/sec: 424531.06

Recommend Projects

Recommend Topics

Recommend Org