danbuchan / eigen_scripts Goto Github PK
View Code? Open in Web Editor NEWshort scripts for data munging eigenthreader
short scripts for data munging eigenthreader
#Follow these steps in order to re-generate the data for the #eigenthreader paper. Some scripts get edited to point at #certain directories (see steps 20, 21). Apols to the programmers #who feel this is an anathema # # Everything in python 3 btw run_dssp.py - run dssp over a dir of pdb files : DEPRECATED 1) run_strsum_eigen.py - runs the eigengthreader tdb generator over a db of pdb and dssp files Uses the HHSearch_overlap dir as the source of which files to generate Update: Scop70 1.75 12/nov/2016 http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/ 13,730 folds 2) equalise_fold_libs.py - read the HHSeach lib and delete anything that isn't in the et_data lib 13,613 folds 4) run_blast.py - runs blastp over a dir of fasta files Foldlibs now at: /mnt/bioinf/archive0/eigen_thread/et_data/ /mnt/bioinf/archive0/eigen_thread/HHSearch_hmm_complete/ Sorting the fold libs: Removing homologues from fold libs: **** DEPRECATED: No longer removing blast relatives from the foldlibs 3) extract_fasta.py - extracts the fasta files from the HHSearch hhms hhsearch.fasta 5) remove_homolgues.py - removes the homologues from the fold libs found in the blast run **** DEPRECATED 6) find_homologues.py - read the scop classification and identify the SCOP structural homologues for the PSICOV benchmark set pdb_scop_class.txt: pdb Domain ID to SCOP class for ALL classified pdb chains scop_class_members.txt: SCOP class to a list of domains included in that class scop_superfamily_members.txt: SCOP superfamily to list of domain ids non_redundant_list.txt: A list of all the benchmark pdb ids AND ids of their immediate homologues non_redundant_list_superfamily.txt: as above but superfamily members benchmark_family_members.txt : scop family members for the benchmark_superfamily_members.txt : benchmark_domain_id.txt : scop domain IDS for the benchmark set 7) run_eigen_threader_morecambe.py: run it changing the t value from 1 to 20 (c=10) 8) run_eigen_threader_morecambe.py: run it changing the c value from 1 to 20 (t=20) Results currently on morecambe2 /home/dbuchan/eigen_benchmark/results/ 9) Walkthrough HHSearch_build.txt instructions, to generate the HHSearch comparison data HHSearch/HHBlits library at /scratch1/NOT_BACKED_UP/dbuchan/HHSearch_eigen/ Results /scratch0/NOT_BACKED_UP/dbuchan/hhresults/paper_revision/ 10) then run genthreader with the et_foldlib lib run_genthreader.py. Edit GenThreader.sh for the fold lib and list for i in `ls /scratch1/NOT_BACKED_UP/dbuchan/eigenthreader/seq_files/ | sed -e 's/\.fasta//'`; do echo $i; ./GenThreader.sh -i /scratch1/NOT_BACKED_UP/dbuchan/eigenthreader/seq_files/$i.fasta -j $i -s; done results at: /scratch0/NOT_BACKED_UP/dbuchan/Applications/genthreader/results_new/ Process benchmark data: 11) count_eigenvector_benchmark.py - Takes the eigen_vector benchmark data and outputs the t1, t2, t5 and t10 performances (skipping the immediate structural homologues aZZs given by SCOP), t and c values changed by editing. To ensure we are only looking at good models an overlap of 40% is required 12) count_performance.py- work out the t1,t2,t5 and t10 performance for graphing. c and t values changed by editing 13) draw_benchmark_performance_graph.R - takes the eigenvector performance numbers and the distance numbers and plots some graphs Check graphs and then run optimised eigenthreader!!!! 14) count_comparison_perfomance.py - takes the genth, hhsearch and optimal eig and outputs the best 10 results for each method see /processed_comparison 15) sum_comparison_performance.py - work out the t1,t2,t5 and t10 performance for each method 16) count_gdt_perfomance.py - takes the genth, hhsearch and optimal eig models and calculates the min modelling performance and variance 17) draw_coll mparison.R - create barchart of the fold recognition comparison performance ### AND THERE'S MORE 18) subset_contacts.py - takes the metapsicov contact files and outputs L/2, L/5, L/10 subsets for top and random subsets for long range contacts separation > 21 residues 19) run_eigen_threader.py - takes the query input files and outputs an eigenthreader prediction, edited to take the subset contacts and output to an appropriate subset output dir 20) count_eigenvector_benchmark.py : change the path to the .out files 21) count_performance.py : change paths for the random/top .top files 22) draw_top_performance_graph.R - draws the graphs for the top L and random L charts ### TO DO Count the number of folds that could be hit in the fold lib for each domain #### Adjunct For eignethreader optimisation only filter_eigen_output.py : grab the top 50 hits for each run, filtering out famil and superfamily matches, append the first correct analogous fold if not in the top 50.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.