Coder Social home page Coder Social logo

ljacquin / refpop Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 8.2 MB

This repository contains the refpop R scripts, for achieving reproducible data analysis and results, according to the FAIR principles

Home Page: https://www.nature.com/articles/s41438-020-00408-8

License: GNU General Public License v3.0

R 97.06% Shell 2.94%
genomic-prediction heritability-estimation genotype-environment-management-analysis genotype-environment-management-modeling pedigree-population-structure-analysis phenotype-population-structure-analysis genotype-population-structure-analysis phenotype-outlier-detection haplotype-association-mapping

refpop's Introduction

refpop

Licence, status and metrics

License: GPL v3 Lifecycle:Experimental Project Status: Active – The project has reached a stable, usable state and is being actively developed. GitHub repo size GitHub language count GitHub top language GitHub code size in bytes
GitHub all releases GitHub stars

Languages and technologies

R Badge Shell Script Badge Python Badge Plotly Badge RStudio Badge

Objective, automated tasks & analyses, and instructions :

🎯 Objective

This repository hosts R scripts designed to ensure reproducible data analysis and results, aimed at characterizing the refpop population, according to the FAIR principles defined as follows :

  • Findable

  • Accessible

  • Interoperable

  • Reusable

πŸ”„ Automated tasks & analyses

πŸ“‚ Repository overview

The refpop/ repository hosts R scripts that aim to automate the following tasks and analyses to the best of their ability:

  • Phenotype outlier detection: Intended to automatically produce a clean phenotype dataset, i.e., one that is free of outliers, using robust multivariate and knowledge-based rule methods

  • Spatial heterogeneity correction: Aims to address spatial heterogeneity for each trait in the cleaned phenotype dataset, and then identify environments with low heritability using a robust univariate outlier detection method and exclude these for further analysis. For each trait, heritability distributions across all environments are computed before and after spatial heterogeneity correction; these distributions are also computed by management type

  • Adjusted ls-means computation: Focuses on computing adjusted phenotypic ls-means for each genotype across environments, using the adjusted phenotypes obtained from spatial heterogeneity correction in each environment

  • Data structure inference: Focuses on inferring the genomic, phenotypic, and pedigree data structures of the refpop population

  • Genomic prediction accuracy evaluation: Seeks to evaluate, for each trait, the distributions of genomic prediction accuracies associated to several prediction methods, for the adjusted phenotypic ls-means

  • G x E x M evaluation: Directed towards evaluating, for each trait, the interactions between genotypes, their environments and management types

β”” πŸ“ Repository details

The refpop/ repository contains three main folders that are central to the tasks and analyses performed by the refpop/ R scripts:

  • src/: Houses the R scripts

  • data/: Contains the data sourced, processed and analyzed by the R scripts. Note that some processed data, such as the clean phenotype dataset for example, are written and stored in this folder

  • results/: Stores the results generated by the R script tasks and analyses

    Furthermore, the R scripts in src/ are distributed accross several subfolders: refpop_data_structure_analysis/, refpop_data_treatment_and_analysis/, refpop_gem_interaction_analysis/ and refpop_predictive_modeling_and_analysis/. These folders are organized according to the specific tasks and analyses performed by the user, such as general data treatment and analyses like outlier identification and removal, spatial heterogeneity correction, and adjusted ls-means, as well as more specialized tasks like genomic prediction modeling and analysis, etc.

  • 🧩 Interdependencies of scripts for analyses

    Most scripts rely on the preprocessed data generated by the scripts in refpop_data_treatment_and_analysis/. Therefore, it is strongly recommended to execute the data treatment steps first before proceeding with any further analyses. Otherwise, there are no interdependencies between the R scripts.

  • πŸ“œ Details of scripts analyses

    The analyses performed by the different R scripts in the subfolders of src/ are described as follows:

    • refpop_data_treatment_and_analysis/

      • refpop_0_phenotype_outlier_detection_per_env.R: This script conducts outlier detection for phenotypes using two methods.
        The first method employs robust multivariate outlier detection, which considers the covariance structure among phenotypic measurements of traits. This method utilizes the Mahalanobis distance (MD) with the minimum covariance determinant (MCD) estimator to mitigate contamination points (i.e., outliers) during the detection process. Given that MD can be sensitive to the curse of dimensionality - resulting in diminished outlier detection accuracy with an increasing number of variables - a principal component analysis (PCA) is performed beforehand to reduce dimensionality.
        The second approach is a knowledge-based rule univariate method that applies specific criteria according to the refpop phenotyping protocol:

        • An outlier is identified if the sample size exceeds 20
        • An outlier is identified if the sample size exceeds the total number of fruits per tree (at harvest date)

      • refpop_1_spat_hetero_correct_per_env_trait_and_h2_estim.R: For each trait, this script applies spatial heterogeneity correction to the cleaned phenotype dataset and computes the distributions of heritability values, which are derived from estimates in each environment, both before and after the correction. It also calculates the distributions of heritabilities by management type before and after the correction. The spatial heterogeneity correction is carried out using the spatial analysis of field trials with splines (SpATS). Additionally, this script identifies environments with low heritability using the median absolute deviation (MAD) as a unilateral test and excludes them for further analysis and computations, such as adjusted least squares means (ls-means) of phenotypes or the computed heritability using pooled data from all environments.

      • refpop_2_adjusted_lsmeans_phenotype.R: This script computes the adjusted phenotypic ls-means for each genotype across environments, using the adjusted phenotypes obtained from spatial heterogeneity correction in each environment. The least-squares means (ls-means) are computed after fitting a multiple linear regression model that includes genotype and environment effects as covariates, along with the overall mean.

      Note that the R script in this subfolder are prefixed with refpop_0, refpop_1, and refpop_2, indicating their sequential execution order.

    • refpop_data_structure_analysis/

      • refpop_genomic_data_structure_analysis.R: This script performs structure analyses for the refpop genomic data using both uniform manifold approximation and projection (UMAP) and principal component analysis (PCA).
      • refpop_pedigree_and_phenotype_data_structure_analysis.R: This script performs structure analyses for the refpop pedigree data, phenotype data, and their combination, using both uniform manifold approximation and projection (UMAP) and principal component analysis (PCA).

    • refpop_predictive_modeling_and_analysis/

      • refpop_genomic_prediction_and_analysis_trait.R: This script evaluates, for a defined trait, the distributions of genomic prediction accuracies associated to several prediction methods, for the adjusted phenotypic ls-means. These distributions are evaluated using K-folds cross-validation, for n shuffling scenarios of the refpop population, and using the pearson correlation as a measure of accuracy between the predicted and observed adjusted phenotypic ls-means. The implemented prediction methods are random forest (RF), support vector regression (SVR), reproducing kernel hilbert space regression (RKHS), genomic blup (GBLUP) and least absolute shrinkage and selection operator (LASSO).

    • refpop_gem_interaction_analysis/

      • script_1.R:
      • script_2.R:
      • ..........:

πŸ’» Instructions

Download the refpop repository in the current user's directory on a computing cluster or personal computer using one of the following commands :

  • git clone [email protected]:ljacquin/refpop.git

    or
  • git clone https://github.com/ljacquin/refpop.git

⚠️ Make sure git is installed beforehand; if not, install it with sudo apt install git.

  • Given that R β‰₯ 4.1.2 is already installed, use the following command to install refpop required R libraries :

    • R -q --vanilla < src/requirements.R

  • Within the refpop folder, execute the following commands to make scripts and programs executable :

    • chmod u+rwx execute_refpop_tasks_and_analyses.sh

  • Replace the data/ folder with the directory found at https://data :

  • Finally, execute one of the following commands for executing the refpop tasks and analyses :

    • sbatch execute_refpop_tasks_and_analyses.sh

      or
    • ./execute_refpop_tasks_and_analyses.sh (i.e., interactive execution)

⚠️ The tasks and analyses performed by the R scripts in the refpop repository can be run in either Unix/Linux or Windows environments, as long as R and the necessary libraries are installed. For local computations in RStudio, ensure that the computation_mode variable is set to "local" in the R scripts located in src/. Indeed, while maintaining the required sequential execution order, each R script can still be run independently for analyses in RStudio by setting the computation_mode variable to "local".

References

  • Jung, Michaela, et al. "The apple REFPOPβ€”a reference population for genomics-assisted breeding in apple." Horticulture research 7 (2020).

  • Leys, Christophe, et al. "Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median." Journal of experimental social psychology 49.4 (2013): 764-766.

  • Hubert, Mia, and Michiel Debruyne. "Minimum covariance determinant." Wiley interdisciplinary reviews: Computational statistics 2.1 (2010): 36-43.

  • Higham, Nicholas J. "Computing the nearest correlation matrixβ€”a problem from finance." IMA journal of Numerical Analysis 22.3 (2002): 329-343.

  • Breiman, Leo. "Random forests." Machine learning 45 (2001): 5-32.

  • Smola, Alex J., and Bernhard SchΓΆlkopf. "A tutorial on support vector regression." Statistics and computing 14 (2004): 199-222.

  • Jacquin L, Cao T-V and Ahmadi N (2016) A Unified and Comprehensible View of Parametric and Kernel Methods for Genomic Prediction with Application to Rice. Front. Genet. 7:145. doi: 10.3389/fgene.2016.00145

  • Tibshirani, Robert. "Regression shrinkage and selection via the lasso." Journal of the Royal Statistical Society Series B: Statistical Methodology 58.1 (1996): 267-288.

refpop's People

Contributors

ljacquin avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.