Set of scripts for modifying output of Tandem Repeats Finder (TRF).
Finds candidate telomeric sequences in NGS data output of TRF.
Tested on Ubuntu 16.04 with Python 2.7.
Either you can run TRM along with TRF starting with the .fasta files, or if you already have NGS output data from TRF, you can run the TRM only.
This version is primarly used for Galaxy's toolshed repository definition. But can be used on command-line as well, just follow the README.
-
Place your data in .fasta format into one folder (e.g. ./data/)
-
Download Tandem Repeats Finder from https://tandem.bu.edu/trf/ and place it into this folder. If your binary is not named
trf407b.linux64
or you want to use different path than$PWD
, modifyiterateTRF.sh
.
Better solution is to use conda with the env.yaml configuration file. Just call conda create env -f ./env.yaml -n trm-env
and after the installation process call source activate trm-env
.
- Change the variable
dataDir
inside the./scripts/runAllWithTRF.sh
to point into your directory with inpout data. You may also want to change the default name of output data (variableshortName
). In the very same script, one can see the default settigns of other input parameters. They can be changed inside the script or sent from command line as follows:./runAllWithTRF.sh 3 4 2 7 7 80 10 50 15 2 90 0 -h
. It will create specific folder structure.
-
Assuming you already have TRF's NGS output data, you should place them into
./scripts/res/TRF\_res
directory with.dat
extension. -
You may also change the variable
myDir
inside the./scripts/runAllNoTRF.sh
script so you can place your input data accordingly into${myDir}/TRF\_res
directory. -
This particular script has much less input paramaters to set. They can be changed inside the script or sent from command line as follows:
./runAllNoTRF.sh 3 4 90 0
. It will create specific folder structure.
All the input parameters are contained together in the runAllWithTRF.sh
script so we use here the explanation from there (so far, they must be used in the specified order and in the right place):
- minNumberOfRepeats="3" ... min number of repeats
- minLengthOfPattern="4" ... min length of repeating pattern
- trf_match="2" ... TRF's matching weight
- trf_mism="7" ... TRF's mismatching penalty
- trf_delta="7" ... TRF's indel penalty
- trf_pm="80" ... TRF's match probability (whole number)
- trf_pi="10" ... TRF's indel probability (whole number)
- trf_min="50" ... TRF's minimum alignment score to report
- trf_max="15" ... TRF's maximum period size to report
- trf_longest="2" ... TRF's maximum TR length expected (in millions)
- readLength="90" ... for restrZeros.py
- relOccur="0" ... if yes, the value must be 1 otherwise it is preset to 0
- trf_html="" ... TRF's html output; if you want to supress it change the value to '-h'
- res ... predifined output directory name (can be changed in the variable
myDir
in the scriptsrunAllWithRTF.sh
andrunAllNoTRF.sh
)- parsed
- dataset_6484_ppr.txt ... intermediate file
- dataset_6485_ppr.txt ... intermediate file
- dataset_6486_ppr.txt ... intermediate file
- res
- dataset_6484_ppr_sorted.txt ... intermediate file
- dataset_6485_ppr_sorted.txt ... intermediate file
- dataset_6486_ppr_sorted.txt ... intermediate file
- joined_fixed_pairedReverseComplement_merged_sorted_FINAL.txt ... FINAL output file with reverse-complement-paired sequences of tandem repeats with number of occurrences in the input datasets
- joined_fixed_pairedReverseComplement_merged_sorted.txt ... intermediate file
- joined_fixed_pairedReverseComplement_merged.txt ... intermediate file
- joined_fixed_pairedReverseComplement.txt ... intermediate file
- joined_fixed.txt ... intermediate file
- joined_fixed_without_pairedReverseComplement_sorted_FINAL.txt ... FINAL output file sorted according to the number of occurrences of tandem repeats in the input datasets
- joined_fixed_without_pairedReverseComplement_sorted.txt ... intermediate file
- joined_fixed_without_pairedReverseComplement.txt ... intermediate file
- joined.txt ... intermediate file
- TRF_res ... directory containing all TRF outputs (either it is filled automatically (case of
runAllWithTRF.sh
), or you must copy your input here (case ofrunAllNoTRF.sh
)- dataset_6484.dat ... NGS data from TRF
- dataset_6485.dat ... NGS data from TRF
- dataset_6486.dat ... NGS data from TRF
- parsed