A bioinformatics pipeline for meta-genome scale functional de novo annotation of LTR retrotransposons
Due to their enormous contribution to genome structure and genome evolution transposable elements allow us to study fundamental mechanisms of phenotypic adaptation, diversification, and evolution. In particular, understanding the recognition and regulation of transposable elements by the genetic regulatory machinery will enable us to systematically identify the key players and key processes that enable niche adaptation and species diversification on the genetic level.
The LTRpred
pipeline aims to provide an integrated software framework to
predict potentially functional LTR transposons in any genomic sequence of interest. First, LTRpred
retrieves de novo annotations of retrotransposons via LTRharvest
and LTRdigest
and second efficiently screens, filters and annotates those predictions for potentially
functional elements.
LTR transposons have the capacity to move to new sites in genomes through a copy-and-paste mechanism and by doing so are able to contribute generatively to genome evolution and environmental sensing on the genetic level. Hence, predicting the presence of LTR transposons within genomes as well as their capacity to perform this copy-and-paste strategy enables us to quantify the extent to which transposons shape the adaptation and evolution of life in general.
In particular the following analyses can be performed with LTRpred
:
- de novo prediction of LTR retrotransposons (nested, overlapping, or pure template) using LTRharvest and LTRdigest
- annotation of predicted LTR retrotransposons using Dfam or Repbase as reference
- solo LTR prediction based on specialized BLAST searches
- LTR retrotransposons family clustering using vsearch
- open reading frame prediction in LTR retrotransposons using usearch
- age estimation of predicted LTR retrotransposons in Mya (not implemented yet, but soon to come..)
- CHH, CHG, CG, ... content quantification in predicted LTR retrotransposons
- filtering for (potentially) functional LTR retrotransposons
- quality assesment of input genomes used to predict LTR retrotransposons
- run
LTRpred
on entire kingdoms of life using only one command (see?LTRpred.meta
) - perform meta genomics studies customized for LTR retrotransposons
- cluster LTR retrotransposons within and between species
- quantify the diversity space of LTR retrotransposons for entire kingdoms of life
# install the current version of LTRpred on your system
source("http://bioconductor.org/biocLite.R")
biocLite("devtools")
biocLite("HajkD/LTRpred")
I would be very happy to learn more about potential improvements of the concepts and functions provided in this package.
Furthermore, in case you find some bugs or need additional (more flexible) functionality of parts of this package, please let me know:
https://github.com/HajkD/LTRpred/issues
The current status of the package as well as a detailed history of the
functionality of each version of LTRpred
can be found in the NEWS section.
This tutorial introduces users to LTRpred
:
Users can also read the tutorials within (RStudio) :
library(LTRpred)
browseVignettes("LTRpred")
In the LTRpred
framework users can find:
LTRpred()
: Major pipeline to predict LTR retrotransposons in a given genomeLTRpred.meta
: Perform Meta-Analyses with LTRpredmeta.summarize()
: Summarize (concatenate) all predictions of aLTRpred.meta()
runmeta.apply()
: Apply functions to meta data generated byLTRpred()
LTRharvest()
: Run LTRharvest to predict putative LTR RetrotransposonsLTRdigest()
: Run LTRdigest to predict putative LTR Retrotransposons
CLUSTpred()
: Cluster Sequences with VSEARCHcluster.members()
: Select members of a specific clusterclust2fasta()
: Export sequences of TEs belonging to the same cluster to fasta filesAllPairwiseAlign()
: Compute all pairwise (global) alignments with VSEARCHfilter.uc()
: Filter for cluster membersSimMatAbundance()
: Compute histogram shape similarity between species
ltr.cn()
: Detect solo LTR copies of predicted LTR transposonscn2bed()
: Write copy number estimation results to BED file format.
filter.jumpers()
: Detect LTR retrotransposons that are potential jumperstidy.datasheet()
: Select most important columns of 'LTRpred' output for further analytics
read.prediction()
: Import the output of LTRharvest or LTRdigestread.tabout()
: Import information sheet returned by LTRdigestread.orfs()
: Read output ofORFpred()
read.seqs()
: Import sequences of predicted LTR transposonsread.ltrpred()
: Import the data sheet file generated byLTRpred()
read.uc()
: Read file in USEARCH cluster formatread.blast6out()
: Read file in blast6out format generated by USEARCH or VSEARCH
pred2bed()
: Format LTR prediction data to BED file formatpred2fasta()
: Save the sequence of the predicted LTR Transposons in a fasta filepred2gff()
: Format LTR prediction data to GFF3 file formatpred2annotation()
: Match LTRharvest, LTRdigest, or LTRpred prediction with a given annotation file in GFF3 formatpred2csv()
: Format LTR prediction data to CSV file format
ORFpred()
: Open Reading Frame prediction in putative LTR transposons
dfam.query()
: Annotation ofde novo
predicted LTR transposons via Dfam searchesread.dfam()
: Import Dfam Query Outputrepbase.clean()
: Clean the initial Repbase database for BLASTrepbase.query()
: Query the RepBase to annotate putative LTRsrepbase.filter()
: Filter the Repbase query output
motif.count()
: Low level function to detect motifs in strings
plot_ltrsim_individual()
: Plot the age distribution of predicted LTR transposonsplot_ltrwidth_individual()
: Plot the width distribution of putative LTR transposons or LTRs for individual speciesplot_ltrwidth_species()
: Plot the width distribution of putative LTR transposons or LTRs for all speciesplot_ltrwidth_kingdom()
: Plot the width distribution of putative LTR transposons or LTRs for all kingdomsplot_copynumber_individual()
: Plot the copy number distribution of putative LTR transposons or LTRs for individual speciesplot_copynumber_species()
: Plot the copy number distribution of putative LTR transposons or LTRs for all speciesplot_copynumber_kingdom()
: Plot the copy number distribution of putative LTR transposons or LTRs for all kingdomsplotLTRRange()
: Plot Genomic Ranges of putative LTR transposonsPlotSimCount()
: Plot LTR Similarity vs. predicted LTR countplotSize()
: Plot Genome size vs. LTR transposon countplotSizeJumpers()
: Plot Genome size vs. LTR transposon count for jumpersplotFamily()
: Visualize the Superfamily distribution of predicted LTR retrotransposonsplotDomain()
: Visualize the Protein Domain distribution of predicted LTR retrotransposonsplotCN()
: Plot correlation between LTR copy number and methylation contextplotCluster()
: Plot correlation between Cluster Number and any other variablePlotInterSpeciesCluster()
: Plot inter species similarity between TEs (for a specific cluster)PlotMainInterSpeciesCluster()
: Plot inter species similarity between TEs (for the top n clusters)
bcolor()
: Beautiful colors for plotsfile.move()
: Move folders from one location to anotherget.pred.filenames()
: Retrieve file names of files genereated by LTRpredget.seqs()
: Quickly retrieve the sequences of a 'Biostrings' objectws.wrap.path()
: Wrap whitespace in pathsrename.fasta()
: rename.fasta
I would like to thank the Paszkowski team for incredible support and motivating discussions that led to the realization of this project.