refpop

Licence, status and metrics

Languages and technologies

Objective, automated tasks & analyses, and instructions :

🎯 Objective

This repository hosts R scripts designed to ensure reproducible data analysis and results, aimed at characterizing the refpop population, according to the FAIR principles defined as follows :

Findable
Accessible
Interoperable
Reusable

🔄 Automated tasks & analyses

📂 Repository overview

The refpop/ repository hosts R scripts that aim to automate the following tasks and analyses to the best of their ability:

Phenotype outlier detection: Intended to automatically produce a clean phenotype dataset, i.e., one that is free of outliers, using robust multivariate and knowledge-based rule methods
Spatial heterogeneity correction: Aims to address spatial heterogeneity for each trait in the cleaned phenotype dataset, and then identify environments with low heritability using a robust univariate outlier detection method and exclude these for further analysis. For each trait, heritability distributions across all environments are computed before and after spatial heterogeneity correction; these distributions are also computed by management type
Adjusted ls-means computation: Focuses on computing adjusted phenotypic ls-means for each genotype across environments, using the adjusted phenotypes obtained from spatial heterogeneity correction in each environment
Data structure inference: Focuses on inferring the genomic, phenotypic, and pedigree data structures of the refpop population
Genomic prediction accuracy evaluation: Seeks to evaluate, for each trait, the distributions of genomic prediction accuracies associated to several prediction methods, for the adjusted phenotypic ls-means
G x E x M evaluation: Directed towards evaluating, for each trait, the interactions between genotypes, their environments and management types

└ 📁 Repository details

The refpop/ repository contains three main folders that are central to the tasks and analyses performed by the refpop/ R scripts:

src/: Houses the R scripts
data/: Contains the data sourced, processed and analyzed by the R scripts. Note that some processed data, such as the clean phenotype dataset for example, are written and stored in this folder
results/: Stores the results generated by the R script tasks and analyses

Furthermore, the R scripts in src/ are distributed accross several subfolders: refpop_data_structure_analysis/, refpop_data_treatment_and_analysis/, refpop_gem_interaction_analysis/ and refpop_predictive_modeling_and_analysis/. These folders are organized according to the specific tasks and analyses performed by the user, such as general data treatment and analyses like outlier identification and removal, spatial heterogeneity correction, and adjusted ls-means, as well as more specialized tasks like genomic prediction modeling and analysis, etc.
🧩 Interdependencies of scripts for analyses

Most scripts rely on the preprocessed data generated by the scripts in refpop_data_treatment_and_analysis/. Therefore, it is strongly recommended to execute the data treatment steps first before proceeding with any further analyses. Otherwise, there are no interdependencies between the R scripts.
📜 Details of scripts analyses

The analyses performed by the different R scripts in the subfolders of src/ are described as follows:
- refpop_data_treatment_and_analysis/
  - refpop_0_phenotype_outlier_detection_per_env.R: This script conducts outlier detection for phenotypes using two methods.
    The first method employs robust multivariate outlier detection, which considers the covariance structure among phenotypic measurements of traits. This method utilizes the Mahalanobis distance (MD) with the minimum covariance determinant (MCD) estimator to mitigate contamination points (i.e., outliers) during the detection process. Given that MD can be sensitive to the curse of dimensionality - resulting in diminished outlier detection accuracy with an increasing number of variables - a principal component analysis (PCA) is performed beforehand to reduce dimensionality.
    The second approach is a knowledge-based rule univariate method that applies specific criteria according to the refpop phenotyping protocol:
    - An outlier is identified if the sample size exceeds 20
    - An outlier is identified if the sample size exceeds the total number of fruits per tree (at harvest date)
  - refpop_1_spat_hetero_correct_per_env_trait_and_h2_estim.R: For each trait, this script applies spatial heterogeneity correction to the cleaned phenotype dataset and computes the distributions of heritability values, which are derived from estimates in each environment, both before and after the correction. It also calculates the distributions of heritabilities by management type before and after the correction. The spatial heterogeneity correction is carried out using the spatial analysis of field trials with splines (SpATS). Additionally, this script identifies environments with low heritability using the median absolute deviation (MAD) as a unilateral test and excludes them for further analysis and computations, such as adjusted least squares means (ls-means) of phenotypes or the computed heritability using pooled data from all environments.
  - refpop_2_adjusted_lsmeans_phenotype.R: This script computes the adjusted phenotypic ls-means for each genotype across environments, using the adjusted phenotypes obtained from spatial heterogeneity correction in each environment. The least-squares means (ls-means) are computed after fitting a multiple linear regression model that includes genotype and environment effects as covariates, along with the overall mean.
  Note that the R script in this subfolder are prefixed with refpop_0, refpop_1, and refpop_2, indicating their sequential execution order.
- refpop_data_structure_analysis/
  - refpop_genomic_data_structure_analysis.R: This script performs structure analyses for the refpop genomic data using both uniform manifold approximation and projection (UMAP) and principal component analysis (PCA).
  - refpop_pedigree_and_phenotype_data_structure_analysis.R: This script performs structure analyses for the refpop pedigree data, phenotype data, and their combination, using both uniform manifold approximation and projection (UMAP) and principal component analysis (PCA).
- refpop_predictive_modeling_and_analysis/
  - refpop_genomic_prediction_and_analysis_trait.R: This script evaluates, for a defined trait, the distributions of genomic prediction accuracies associated to several prediction methods, for the adjusted phenotypic ls-means. These distributions are evaluated using K-folds cross-validation, for n shuffling scenarios of the refpop population, and using the pearson correlation as a measure of accuracy between the predicted and observed adjusted phenotypic ls-means. The implemented prediction methods are random forest (RF), support vector regression (SVR), reproducing kernel hilbert space regression (RKHS), genomic blup (GBLUP) and least absolute shrinkage and selection operator (LASSO).
- refpop_gem_interaction_analysis/
  - script_1.R:
  - script_2.R:
  - ..........:

💻 Instructions

Download the refpop repository in the current user's directory on a computing cluster or personal computer using one of the following commands :

git clone [email protected]:ljacquin/refpop.git

or
git clone https://github.com/ljacquin/refpop.git

⚠️ Make sure git is installed beforehand; if not, install it with sudo apt install git.

Given that R ≥ 4.1.2 is already installed, use the following command to install refpop required R libraries :
- R -q --vanilla < src/requirements.R
Within the refpop folder, execute the following commands to make scripts and programs executable :
- chmod u+rwx execute_refpop_tasks_and_analyses.sh
Replace the data/ folder with the directory found at https://data :
Finally, execute one of the following commands for executing the refpop tasks and analyses :
- sbatch execute_refpop_tasks_and_analyses.sh
  
  or
- ./execute_refpop_tasks_and_analyses.sh (i.e., interactive execution)

⚠️ The tasks and analyses performed by the R scripts in the refpop repository can be run in either Unix/Linux or Windows environments, as long as R and the necessary libraries are installed. For local computations in RStudio, ensure that the computation_mode variable is set to "local" in the R scripts located in src/. Indeed, while maintaining the required sequential execution order, each R script can still be run independently for analyses in RStudio by setting the computation_mode variable to "local".

References

Jung, Michaela, et al. "The apple REFPOP—a reference population for genomics-assisted breeding in apple." Horticulture research 7 (2020).
Leys, Christophe, et al. "Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median." Journal of experimental social psychology 49.4 (2013): 764-766.
Hubert, Mia, and Michiel Debruyne. "Minimum covariance determinant." Wiley interdisciplinary reviews: Computational statistics 2.1 (2010): 36-43.
Higham, Nicholas J. "Computing the nearest correlation matrix—a problem from finance." IMA journal of Numerical Analysis 22.3 (2002): 329-343.
Breiman, Leo. "Random forests." Machine learning 45 (2001): 5-32.
Smola, Alex J., and Bernhard Schölkopf. "A tutorial on support vector regression." Statistics and computing 14 (2004): 199-222.
Jacquin L, Cao T-V and Ahmadi N (2016) A Unified and Comprehensible View of Parametric and Kernel Methods for Genomic Prediction with Application to Rice. Front. Genet. 7:145. doi: 10.3389/fgene.2016.00145
Tibshirani, Robert. "Regression shrinkage and selection via the lasso." Journal of the Royal Statistical Society Series B: Statistical Methodology 58.1 (1996): 267-288.

ljacquin / refpop Goto Github PK

refpop's Introduction

refpop

Licence, status and metrics

Languages and technologies

Objective, automated tasks & analyses, and instructions :

🎯 Objective

🔄 Automated tasks & analyses

📂 Repository overview

└ 📁 Repository details

🧩 Interdependencies of scripts for analyses

📜 Details of scripts analyses

💻 Instructions

References

refpop's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent