This repository hosts R
scripts designed to ensure reproducible data analysis and results, aimed at characterizing the refpop population, according to the FAIR principles defined as follows :
-
Findable
-
Accessible
-
Interoperable
-
Reusable
The refpop/
repository hosts R
scripts that aim to automate the following tasks and analyses to the best of their ability:
-
Phenotype outlier detection: Intended to automatically produce a clean phenotype dataset, i.e., one that is free of outliers, using robust multivariate and knowledge-based rule methods
-
Spatial heterogeneity correction: Aims to address spatial heterogeneity for each trait in the cleaned phenotype dataset, and then identify environments with low heritability using a robust univariate outlier detection method and exclude these for further analysis. For each trait, heritability distributions across all environments are computed before and after spatial heterogeneity correction; these distributions are also computed by management type
-
Adjusted ls-means computation: Focuses on computing adjusted phenotypic ls-means for each genotype across environments, using the adjusted phenotypes obtained from spatial heterogeneity correction in each environment
-
Data structure inference: Focuses on inferring the genomic, phenotypic, and pedigree data structures of the refpop population
-
Genomic prediction accuracy evaluation: Seeks to evaluate, for each trait, the distributions of genomic prediction accuracies associated to several prediction methods, for the adjusted phenotypic ls-means
-
G x E x M evaluation: Directed towards evaluating, for each trait, the interactions between genotypes, their environments and management types
The refpop/
repository contains three main folders that are central to the tasks and analyses performed by the refpop/
R
scripts:
-
src/
: Houses theR
scripts -
data/
: Contains the data sourced, processed and analyzed by theR
scripts. Note that some processed data, such as the clean phenotype dataset for example, are written and stored in this folder -
results/
: Stores the results generated by theR
script tasks and analysesFurthermore, the
R
scripts insrc/
are distributed accross several subfolders:refpop_data_structure_analysis/
,refpop_data_treatment_and_analysis/
,refpop_gem_interaction_analysis/
andrefpop_predictive_modeling_and_analysis/
. These folders are organized according to the specific tasks and analyses performed by the user, such as general data treatment and analyses like outlier identification and removal, spatial heterogeneity correction, and adjusted ls-means, as well as more specialized tasks like genomic prediction modeling and analysis, etc. -
Most scripts rely on the preprocessed data generated by the scripts in
refpop_data_treatment_and_analysis/
. Therefore, it is strongly recommended to execute the data treatment steps first before proceeding with any further analyses. Otherwise, there are no interdependencies between theR
scripts. -
The analyses performed by the different
R
scripts in the subfolders ofsrc/
are described as follows:-
refpop_data_treatment_and_analysis/
-
refpop_0_phenotype_outlier_detection_per_env.R
: This script conducts outlier detection for phenotypes using two methods.
The first method employs robust multivariate outlier detection, which considers the covariance structure among phenotypic measurements of traits. This method utilizes the Mahalanobis distance (MD) with the minimum covariance determinant (MCD) estimator to mitigate contamination points (i.e., outliers) during the detection process. Given that MD can be sensitive to the curse of dimensionality - resulting in diminished outlier detection accuracy with an increasing number of variables - a principal component analysis (PCA) is performed beforehand to reduce dimensionality.
The second approach is a knowledge-based rule univariate method that applies specific criteria according to the refpop phenotyping protocol:- An outlier is identified if the sample size exceeds 20
- An outlier is identified if the sample size exceeds the total number of fruits per tree (at harvest date)
-
refpop_1_spat_hetero_correct_per_env_trait_and_h2_estim.R
: For each trait, this script applies spatial heterogeneity correction to the cleaned phenotype dataset and computes the distributions of heritability values, which are derived from estimates in each environment, both before and after the correction. It also calculates the distributions of heritabilities by management type before and after the correction. The spatial heterogeneity correction is carried out using the spatial analysis of field trials with splines (SpATS). Additionally, this script identifies environments with low heritability using the median absolute deviation (MAD) as a unilateral test and excludes them for further analysis and computations, such as adjusted least squares means (ls-means) of phenotypes or the computed heritability using pooled data from all environments. -
refpop_2_adjusted_lsmeans_phenotype.R
: This script computes the adjusted phenotypic ls-means for each genotype across environments, using the adjusted phenotypes obtained from spatial heterogeneity correction in each environment. The least-squares means (ls-means) are computed after fitting a multiple linear regression model that includes genotype and environment effects as covariates, along with the overall mean.
Note that the
R
script in this subfolder are prefixed withrefpop_0
,refpop_1
, andrefpop_2
, indicating their sequential execution order. -
-
refpop_data_structure_analysis/
refpop_genomic_data_structure_analysis.R
: This script performs structure analyses for the refpop genomic data using both uniform manifold approximation and projection (UMAP) and principal component analysis (PCA).refpop_pedigree_and_phenotype_data_structure_analysis.R
: This script performs structure analyses for the refpop pedigree data, phenotype data, and their combination, using both uniform manifold approximation and projection (UMAP) and principal component analysis (PCA).
-
refpop_predictive_modeling_and_analysis/
refpop_genomic_prediction_and_analysis_trait.R
: This script evaluates, for a defined trait, the distributions of genomic prediction accuracies associated to several prediction methods, for the adjusted phenotypic ls-means. These distributions are evaluated using K-folds cross-validation, for n shuffling scenarios of the refpop population, and using the pearson correlation as a measure of accuracy between the predicted and observed adjusted phenotypic ls-means. The implemented prediction methods are random forest (RF), support vector regression (SVR), reproducing kernel hilbert space regression (RKHS), genomic blup (GBLUP) and least absolute shrinkage and selection operator (LASSO).
-
refpop_gem_interaction_analysis/
script_1.R
:script_2.R
:..........
:
-
Download the refpop
repository in the current user's directory on a computing cluster or personal computer using one of the following commands :
git clone [email protected]:ljacquin/refpop.git
git clone https://github.com/ljacquin/refpop.git
git
is installed beforehand; if not, install it with sudo apt install git
.
-
Given that
R β₯ 4.1.2
is already installed, use the following command to installrefpop
requiredR
libraries :R -q --vanilla < src/requirements.R
-
Within the
refpop
folder, execute the following commands to make scripts and programs executable :chmod u+rwx execute_refpop_tasks_and_analyses.sh
-
Replace the
data/
folder with the directory found at https://data : -
Finally, execute one of the following commands for executing the refpop tasks and analyses :
sbatch execute_refpop_tasks_and_analyses.sh
./execute_refpop_tasks_and_analyses.sh
(i.e., interactive execution)
R
scripts in the refpop
repository can be run in either Unix/Linux
or Windows
environments, as long as R
and the necessary libraries are installed. For local computations in RStudio
, ensure that the computation_mode
variable is set to "local" in the R
scripts located in src/
. Indeed, while maintaining the required sequential execution order, each R
script can still be run independently for analyses in RStudio
by setting the computation_mode
variable to "local".
-
Jung, Michaela, et al. "The apple REFPOPβa reference population for genomics-assisted breeding in apple." Horticulture research 7 (2020).
-
Leys, Christophe, et al. "Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median." Journal of experimental social psychology 49.4 (2013): 764-766.
-
Hubert, Mia, and Michiel Debruyne. "Minimum covariance determinant." Wiley interdisciplinary reviews: Computational statistics 2.1 (2010): 36-43.
-
Higham, Nicholas J. "Computing the nearest correlation matrixβa problem from finance." IMA journal of Numerical Analysis 22.3 (2002): 329-343.
-
Breiman, Leo. "Random forests." Machine learning 45 (2001): 5-32.
-
Smola, Alex J., and Bernhard SchΓΆlkopf. "A tutorial on support vector regression." Statistics and computing 14 (2004): 199-222.
-
Jacquin L, Cao T-V and Ahmadi N (2016) A Unified and Comprehensible View of Parametric and Kernel Methods for Genomic Prediction with Application to Rice. Front. Genet. 7:145. doi: 10.3389/fgene.2016.00145
-
Tibshirani, Robert. "Regression shrinkage and selection via the lasso." Journal of the Royal Statistical Society Series B: Statistical Methodology 58.1 (1996): 267-288.