Coder Social home page Coder Social logo

tefinder's Introduction

TEfinder

A bioinformatics tool for detecting novel transposable element insertions

Authors: Vista Sohrab & Dilay Hazal Ayhan

TEfinder uses discordant reads to detect novel transposable element insertion events in paired-end sample sequencing data.

Software dependencies:

  • Bedtools 2.28.0 or later
  • Samtools 1.3 or later
  • Picard 2.0.1 or later

Required inputs:

  • Sample alignment file(.bam | .sam)
  • Reference genome FASTA (.fa)
  • Reference TE annotation (.gff | .gtf)
  • TEs of interest in a single column text file (.txt)

Optional arguments with default values in brackets:

  • -fis: short-read sequencing fragment insert size [400]
  • -picard: path to Picard Tools .jar file [picard.jar]
  • -md: maximum distance between reads for bedtools merge [150]
  • -k: maximum TE target site duplication (TSD) length [20]
  • -maxHeapMem: java maximum heap memory allocation for picard in MB [2000]
  • -workingdir: working directory name [TEfinder_Date]
  • -out: output format as GTF [BED]
  • -outname: output name prefix added to file names [null]
  • -threads: number of threads for samtools multi-threading [1]
  • -intermed: keep intermediate files created by pipeline [no]
  • -h: prints help option

Usage:

Default run:

TEfinder -alignment sample.bam -fa reference.fa -gtf TEs.gtf -te List_of_TEs.txt

Note: In the test_dataset folder, the alignment file used for testing is gzipped so please use gunzip sample.bam.gz prior to running TEfinder. The expected test run result (sample_TEinsertions.bed) is provided for comparison.

Example command to change fragment insert size to 500, set maximum TSD length to 30, create GTF output, and include all intermediate files:

TEfinder -alignment sample.bam -fa reference.fa -gtf TEs.gtf -te List_of_TEs.txt \
         -fis 500 -k 30 -out GTF -intermed yes

Output files:

  • TE_insertions.bed contains identified TE insertion events in sample (in the final column, FILTER attribute with "PASS" refers to high confidence insertion events while instances labeled as "in_repeat", "weak_evidence", "strand bias" or a combination of these three labels indicate less confident insertion events)
  • TE_insertions.gtf is provided with the same information as the BED file if using -out GTF
  • DiscordantReads.bam contains all discordant reads that have been identified based on the TEs of interest that have been submitted to TEfinder

Notes:

  • The sample alignment file must include read group information in the header. An example command to include read group if using BWA for alignment would be:

    bwa index reference.fasta
    bwa mem -R ‘@RG\tID:sample\tPL:illumina\tLB:LIB\tSM:sample’ \
            reference.fasta sample_R1.fq sample_R2.fq > sample.bam
    

    If alignment has already been completed without specifying read group information, Picard's AddOrReplaceReadGroups function can generate read group information provided the binary alignment file (BAM):

    java -jar picard.jar AddOrReplaceReadGroups \
        I=input.bam \
        O=sample.bam \
        RGID=sample \
        RGLB=LIB \
        RGPL=illumina \
        RGPU=unit1 \
        RGSM=sample
    
  • Please ensure that the TE names of interest are separated by newline character "\n" and not carriage return "\r". If carriage return is present, please replace all instances of "\r\n" with "\n". This issue is generally encountered by Windows users.

  • To remove simple repeats from the RepeatMasker TE annotation output file, an example command would be the following:

    grep -v -iE '(Motif\:[ATGC]+\-rich)|(Motif\:\([ATGC]+\)n)' RepeatMasker_output.gff > TEs.gtf
    
  • To create a list of TE names from the TE GTF file:

     awk -F '\t' '{print $9}' TEs.gtf | awk -F '"' '{print $2}' | sort | uniq > List_of_TEs.txt
    

    This command may not yield the desired format of TE names as it depends on the structure of the attribute column of the GTF file and would need to be modified for specific cases.

  • Modifying the maximum TSD length (-k) could be useful if there is an unexpected number of insertion events identified with the default parameter. The optimal maximum TSD length can vary across datasets.

  • Modifying the fragment insert size (-fis) based on the sequencing library preparation can be useful.

DOI

tefinder's People

Contributors

vistasohrab avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.