Coder Social home page Coder Social logo

hare's Introduction

#-------------------------------------------------------------------------------
This directory contains bash and R scripts (R package "e1071" is required) for training and assigning HARE model, described in Fang et al. (in preparation). The scripts are based on LSF batch job system. For other job systems, the submitting command in bash scripts should be modified. "LSB_JOBNAME" in Line 23 and "LSB_JOBINDEX" in Line 25 for "svm_fold.sh" should also be modified according to the job management system. 

Contact: Huaying Fang ([email protected])
Date: 20190213
#-------------------------------------------------------------------------------
hare_batch.sh: This is the main driver shell script and outlines a five-step procedure that selects the tuning parameters for SVM through cross-validation and assigns HARE. The five steps need to be run sequentially; steps 2 and 4 spawn out multiple jobs.

/********Input Data********/
The directory "input" contains an example data. The input files are (1) self-reported race/ethnicity file "pop1_30pcs.txt" and (2) principal components "pop1_sire.txt". See input/ for examples. For the SIRE file, individuals with missing/inconsistent SIRE should be coded as NA. For both SIRE and PC files, there should be one column named "IID".
These two files are used by demo.R to generate a R data file "input_sirePC.rdat" including a data.frame object "data_svm" for subsequent steps.

/********HARE Steps********/
Step 1: Run para1.R and set up a coarse grid for tuning parameters. There are two tuning parameters for SVM; they are selected through a (coarse) grid search, followed by a second grid search on a finer grid. The first step generates a list of the parameter combinations representing the coarse grid.
tmp/para1.csv: example input parameter list file, which is generated by script para1.R.
The range and step size of the grid are specified in para.R.

Step 2: Run first round of tuning parameter selection on the coarse grid: train a SVM at each parameter combination using five-fold CV. 
svm_fold.sh calls svm_fold.R, which trains SVM at a specific set of parameters using one training-testing data split, and output the testing accuracy. Individual output files are written to a folder tmp/svm1. 
This step runs nfold * ngrid_points jobs in parallel (in our setup,  5 x (5x6) = 150 jobs)

Step 3: Using the first round of grid search to narrow down the range of second round of grid search.
svm1summ.sh: Aggregate the CV accuracy across all folds; compare the accuracy across grid points to narrow to a smaller region for the second round of finer grid search. At the end, set up input parameters (analogous to tmp/para1.csv, called tmp/para2.csv) for the second round of grid search. 
This script needs to wait for all jobs in step 2 to finish.

Step 4: Run second round of tuning parameter selection on the finer grid: train a SVM at each parameter combination using five-fold CV.
This step calls svm_fold.sh again, just different parameters.

Step 5: Analogous to Step 3, aggregate the second round of grid search to find the optimal tuning parameters. Using this parameter value, HARE will be assigned to all individuals.
This script needs to wait for all jobs in step 4 to finish.

/*********Output***********/
The output directory "output/" for HARE includes 2 files. "HARE_output.txt" is the HARE assignments, and includes 3 columns, "IID", "sire" and "hare." The R data file "HARE_output.txt" includes 4 R objects, "data_hare" is the HARE assignments, "mod_svm" is the SVM model trained on individuals with SIRE, "P1P2Psire" is the probabilities ratios and L1 class, and "pred_svm" is the probability prediction for all individuals.

#-------------------------------------------------------------------------------
The directory "tmp" is a temporary folder including cross validation (CV) information. The files "para1.csv" and "para2.csv" are the parameter files for first and second round CV. The files "summ1.csv" and "summ2.csv" are cross validation precisions for first and second round CV. The files "svm1raw.csv" and "svm2raw.csv" are the collections for prediction accuracy under "svm1/" and "svm2/".

hare's People

Contributors

tanglab avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.