Coder Social home page Coder Social logo

gags's Introduction

GAGS

Genetic Algorithm for Group Selection

Written by Andrew Wood - [email protected] - Genetics of Complex Traits - University of Exeter

Purpose

GAGS is a C++ program created to sample data with specific means and standard deviations from large datasets. The original purpose was to select phenotypic distributions from large scale studies, such as the UK Biobank.

Compiling

GAGS can be compiled using the 'make' file provided

make 

Installation can occur with root permissions or by manually changing the Makefile.

make install

Input Files

  • Phenotype File (required). A tab delimited file containing ordered columns for ID and phenotypic values
ID    PHENO
A     178
B     170
C     167
D     169
...
  • Exclusion File (optional): A list of IDs (without header) for which you do not want the associated phenotypic values to be used by the algorithm and included in the output.

Runnings GAGS

Options

A full list of options can be found by simply typing 'gags' at the command line:

gags  --pheno      -p  [phenotype file]
      --ns         -n  [list of group sizes]
      --means      -m  [list of respective means]
      --sds        -s  [list of respective SDs]
      --seed       -r  [seed - default 10000]
      --mrate      -y  [max mutation rate - default 0.01]
      --popsize    -z  [initial solutions - default 10]
      --exclusions -e  [optional: file of IDs to exclude]
      --iterations -i  [max iterations - default 50000]
      --out        -o  [output file prefix]

Running GAGS

To extract a subset of 1000 people from the a phenotype file with mean = 90 and sd = 15:

gags --pheno example_data/ExamplePheno.txt --ns 1000 --means 90 --sds 15 --out MyPhenoSubset.txt

To extract a subset of 1000 people with mean = 90 and sd = 15 and a subset of 1000 people with mean = 110 and SD = 12:

gags --pheno example_data/ExamplePheno.txt --ns 1000,1000 --means 90,110 --sds 15,12 --out MyPhenoSubset.txt

To create multiple non-overlapping distributions then the order of N, mean and SD values need to be respective of each other on the command line.

To repeat the sampling process to create series of non-identical solutions you should use the --seed flag:

gags --pheno example_data/ExamplePheno.txt --ns 1000,1000 --means 90,110 --sds 15,12 --out MyPhenoSubset1.txt --seed 10
gags --pheno example_data/ExamplePheno.txt --ns 1000,1000 --means 90,110 --sds 15,12 --out MyPhenoSubset2.txt --seed 20
gags --pheno example_data/ExamplePheno.txt --ns 1000,1000 --means 90,110 --sds 15,12 --out MyPhenoSubset3.txt --seed 30
...

Note that the algorithm will aim to match the same precision as provided on the command line after rounding.

Output File

The output file will contain the group ID individual ID and phenotypic value

Group ID    Phenotype
1     A     178
1     C     167
...
2     B     170
2     D     169
...

Fine-tuning the algorithm

Depending on the mean and SD required, it may be necessary to adjust the initial chromosome population size, maximum mutation rate and the maximum number of iterations if it becomes difficult to find a solution. The error for the best solution is output to the console (or log files for HPC). Therefore, you should run GAGS with several tunings of the parameters and compare the error to determine which seems to be the most effective in getting closer to the solution you want. Additional releases of this software will automate this.

Asking for the impossible

The ability to find a desired set that meets the distribution requirements is dependent on how far away the mean and SD are from the full dataset. Be aware that the further away, the smaller the N as a proportion needs to be in order to maximise the chances of obtaining a solution. For example, if the phenotype mean = 200 and SD = 5 in a dataset of 500,000, then it will be hard to find a solution for a subset of 499,999 with mean=150 and SD=4. More guidance on this will be available here shortly.

Note also, the current version is able to sample up to two non-overlapping groups simultaneously with specific means and SDs in an efficient manner when the total N across required subsets = total N in the phenotype file. Creating additional groups simultaneously requires the total N of the subsets to be less than the sample size available in the phenotype file. More updates will be made to cater for this.

Contact

Please email me any questions or comments to [email protected]

Reference

If you use GAGS, please cite:

Tyrrell et al. Gene-obesogenic environment interactions in the UK Biobank study, Int J Epidemiol. 2017 Apr 1;46(2):559-575

gags's People

Contributors

drarwood avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.