privefl / bigsnpr Goto Github PK

R package for the analysis of massive SNP arrays.

Home Page: https://privefl.github.io/bigsnpr/

R 75.68% C++ 23.48% Shell 0.01% TeX 0.84%

big-data parallel-computing r r-package statistical-methods polygenic-scores bioinformatics population-structure-inference snp-data memory-mapped-file

bigsnpr's Introduction

bigsnpr

{bigsnpr} is an R package for the analysis of massive SNP arrays, primarily designed for human genetics. It enhances the features of package {bigstatsr} for the purpose of analyzing genotype data.

To get you started:

Quick demo
List of functions from bigsnpr and from bigstatsr
Extended documentation with more examples + course recording

Installation

In R, run

# install.packages("remotes")
remotes::install_github("privefl/bigsnpr")

or for the CRAN version

install.packages("bigsnpr")

Input formats

This package reads bed/bim/fam files (PLINK preferred format) using functions snp_readBed() and snp_readBed2(). Before reading into this package's special format, quality control and conversion can be done using PLINK, which can be called directly from R using snp_plinkQC() and snp_plinkKINGQC().

This package can also read UK Biobank BGEN files using function snp_readBGEN(). This function takes around 40 minutes to read 1M variants for 400K individuals using 15 cores.

This package uses a class called bigSNP for representing SNP data. A bigSNP object is a list with some elements:

$genotypes: A FBM.code256. Rows are samples and columns are variants. This stores genotype calls or dosages (rounded to 2 decimal places).
$fam: A data.frame with some information on the individuals.
$map: A data.frame with some information on the variants.

Note that most of the algorithms of this package don't handle missing values. You can use snp_fastImpute() (taking a few hours for a chip of 15K x 300K) and snp_fastImputeSimple() (taking a few minutes only) to impute missing values of genotyped variants.

Package {bigsnpr} also provides functions that directly work on bed files with a few missing values (the bed_*() functions). See paper "Efficient toolkit implementing..".

Polygenic scores

Polygenic scores are one of the main focus of this package. There are 3 main methods currently available:

Penalized regressions with individual-level data (see paper and tutorial)
Clumping and Thresholding (C+T) and Stacked C+T (SCT) with summary statistics and individual level data (see paper and tutorial).
LDpred2 with summary statistics (see paper and tutorial)

Possible upcoming features

Multiple imputation for GWAS (https://doi.org/10.1371/journal.pgen.1006091).
More interactive (visual) QC.

You can request some feature by opening an issue.

Bug report / Support

How to make a great R reproducible example?

Please open an issue if you find a bug.

If you want help using {bigstatsr} (the big_*() functions), please open an issue on {bigstatsr}'s repo, or post on Stack Overflow with the tag bigstatsr.

I will always redirect you to GitHub issues if you email me, so that others can benefit from our discussion.

References

Privé, Florian, et al. "Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr." Bioinformatics 34.16 (2018): 2781-2787.
Privé, Florian, et al. "Efficient implementation of penalized regression for genetic risk prediction." Genetics 212.1 (2019): 65-74.
Privé, Florian, et al. "Making the most of Clumping and Thresholding for polygenic scores." The American Journal of Human Genetics 105.6 (2019): 1213-1221.
Privé, Florian, et al. "Efficient toolkit implementing best practices for principal component analysis of population genetic data." Bioinformatics 36.16 (2020): 4449-4457.
Privé, Florian, et al. "LDpred2: better, faster, stronger." Bioinformatics 36.22-23 (2020): 5424-5431.
Privé, Florian. "Optimal linkage disequilibrium splitting." Bioinformatics 38.1 (2022): 255–256.
Privé, Florian. "Using the UK Biobank as a global reference of worldwide populations: application to measuring ancestry diversity from GWAS summary statistics." Bioinformatics 38.13 (2022): 3477-3480.
Privé, Florian, et al. Inferring disease architecture and predictive ability with LDpred2-auto. bioRxiv (2022).

bigsnpr's People

Contributors

Stargazers

Watchers

Forkers

jimhester moisesexpositoalonso guhjy monsanto-pinheiro mblumuga bvilhjal alice-macqueen crerecombinase srcbs pythseq choishingwan timo-cpr hxchengtj genomicsnx suensummit chen-zhao bbitarello saafdk shugamoe dianacornejo mfknie alyssacl tahseen-igib alxm0182 minghao2016 ritawang0427 sumeetmankar171 jujiaokang romainfrancois jordanero yidingdd aeverall nvrivera aviezerl jean997 kerwin12580 rnaimehaom kevinluolk wzhlin barracuda156 hugolyu dramanica yuzhengdun1999

bigsnpr's Issues

Faster decompression of BGEN files

Seems like snp_readBGEN() and snp_assocBGEN() are taking most of the time decompressing variant blocks.
Maybe, it would be possible to use a faster decompression algorithm? https://github.com/ebiggers/libdeflate?

Error in trunc(x) : non-numeric argument to mathematical function

The error I'm getting is : " Error in trunc(x) : non-numeric argument to mathematical function "
The error occured after running the following code :

##### convert vcf to bed ########
vcffile <- "Ha412_Jan2017.vcf"
prefix  <- sub("\\.vcf$", "", vcffile)
plink <- download_plink()
bedfile <- snp_plinkQC(plink.path = plink, prefix.in = prefix, prefix.out = tempfile(), file.type = "--vcf", extra.options =  "--aec --vcf-half-call m")
########## read the bed file ############
rds <- snp_readBed(bedfile, backingfile = tempfile())
########## get the BigSNP object #############
XX <- snp_attach(rds)
################## dealing with missing values : imputation ##########################
G <- XX$genotypes
CHR <- XX$map$chromosome
system.time(
 Z <- snp_fastImpute(G, CHR, ncores = nb_cores()) ### ERROR HERE!
)

Packages versions:

bigstatsr : 0.7.0
bigsnpr: 0.6.1

I'm afraid I am not allowed to provide the vcf file data.
Thanks

Parallel algorithms are slow to start when using large bed objects.

Possibly when there are lots of variants.
Too much data to transfer?

Should try to see if this can be improved.

Change tolerance of tests for pcadapt

Need new version of pcadapt first.

Installation Error - lgfortran on Mac

Hello,

I have successfully installed bigstatsr but get the following error when I try to install bigsnpr with devtools::install_github("privefl/bigsnpr"):


ld: warning: directory not found for option '-L/usr/local/lib/gcc/x86_64-apple-darwin13.0.0/4.8.2'
ld: library not found for -lgfortran
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [bigsnpr.so] Error 1
ERROR: compilation failed for package ‘bigsnpr’
* removing ‘/Library/Frameworks/R.framework/Versions/3.3/Resources/library/bigsnpr’
Installation failed: Command failed (1)

Thanks.

Recode `roll_mean` to avoid the useless `if`s

https://github.com/privefl/bigsnpr/blob/master/src/rollmean.cpp

Fast imputation with median/mode

In snp_fastImpute(), add a first step that imputes SNPs that have less than e.g. 1% of missing values using the median/mode.

Basic usage of creation of PRS using package

I am a bit confused by conflicting differences between your demo for creating polygenic risk scores (PRS) and the list of features of your package (from https://privefl.github.io/bigsnpr/).

The demo instructs users to use the big_spLogReg() function for the creation of PRS from a BigSNP object. However, this function is not listed in the feature list of the package, and meanwhile the feature list does mention a snp_PRS() function for creating PRS'.

I am unsure about the difference between these functions and whether one is superior to the other, or should be used in different circumstances or for creating different results?

Look at exporting bed accessors

Further pruning using snp_autoSVD( ) on previously pruned data

I am running the snp_autoSVD function on a set of SNPs and then once again on a subsequent subset of SNPs afterward. After the pruning and removal of long-range LD regions, I ran the same function on randomly selected subsets (10K SNPs and 50K SNPs) of the already pruned set of SNPs and both are pruned even further, reduced down to a set of ~6K SNPs and ~28K SNPs.

I am wondering what is happening and I am hoping that you could shed some light on if snp_autoSVD() might come into play here. Why would a subset of the pruned SNPs be pruned even further?

Error in subset.default(infos, pNA > 0.001): object 'pNA' not found

I'm having some trouble with the "preprocessing" script and not able to generate the plot at the end.

I'm using my own data, loading in .ped and .map files and converting them with snp_plinkQC() and snp_plinkIBDQC().

plink <- download_plink()
NCORES <- nb_cores()
technoQC.bed <- snp_plinkQC(prefix.in = "data/technoTeens", plink.path = plink, file.type = "--file", # ped/map geno = 0.05, mind = 0.05, maf = 0.05, hwe = 1e-10, autosome.only = TRUE )

PLINK v1.90b6.4 64-bit (7 Aug 2018) www.cog-genomics.org/plink/1.9/
(C) 2005-2018 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to data/technoTeens_QC.log.
Options in effect:
--autosome
--file data/technoTeens
--geno 0.05
--hwe 1e-10
--maf 0.05
--make-bed
--mind 0.05
--out data/technoTeens_QC

257658 MB RAM detected; reserving 128829 MB for main workspace.
Allocated 96621 MB successfully, after larger attempt(s) failed.
.ped scan complete (for binary autoconversion).
Performing single-pass .bed write (669934 variants, 423 people).
--file: data/technoTeens_QC-temporary.bed + data/technoTeens_QC-temporary.bim +
data/technoTeens_QC-temporary.fam written.
669934 variants loaded from .bim file.
423 people (0 males, 0 females, 423 ambiguous) loaded from .fam.
Ambiguous sex IDs written to data/technoTeens_QC.nosex .
1 person removed due to missing genotype data (--mind).
ID written to data/technoTeens_QC.irem .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 422 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate in remaining samples is 0.992935.
8604 variants removed due to missing genotype data (--geno).
--hwe: 2294 variants removed due to Hardy-Weinberg exact test.
338458 variants removed due to minor allele threshold(s)
(--maf/--max-maf/--mac/--max-mac).
320578 variants and 422 people pass filters and QC.
Note: No phenotypes present.
--make-bed to data/technoTeens_QC.bed + data/technoTeens_QC.bim +
data/technoTeens_QC.fam ... done.

technoQC2.bed <- snp_plinkIBDQC( bedfile.in = technoQC.bed, plink.path = plink, ncores = NCORES )
PLINK v1.90b6.4 64-bit (7 Aug 2018) www.cog-genomics.org/plink/1.9/
(C) 2005-2018 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to /tmp/Rtmpg7a6zA/file6adc6e7d579f.log.
Options in effect:
--bfile data/technoTeens_QC
--indep-pairwise 100 1 0.2
--out /tmp/Rtmpg7a6zA/file6adc6e7d579f

257658 MB RAM detected; reserving 128829 MB for main workspace.
Allocated 96621 MB successfully, after larger attempt(s) failed.
320578 variants loaded from .bim file.
422 people (0 males, 0 females, 422 ambiguous) loaded from .fam.
Ambiguous sex IDs written to /tmp/Rtmpg7a6zA/file6adc6e7d579f.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 422 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.997882.
320578 variants and 422 people pass filters and QC.
Note: No phenotypes present.
Pruned 13777 variants from chromosome 1, leaving 11023.
Pruned 15056 variants from chromosome 2, leaving 10855.
Pruned 12506 variants from chromosome 3, leaving 9649.
Pruned 11249 variants from chromosome 4, leaving 8904.
Pruned 10497 variants from chromosome 5, leaving 8289.
Pruned 16064 variants from chromosome 6, leaving 8472.
Pruned 10190 variants from chromosome 7, leaving 7938.
Pruned 9629 variants from chromosome 8, leaving 7051.
Pruned 7762 variants from chromosome 9, leaving 6492.
Pruned 8934 variants from chromosome 10, leaving 7148.
Pruned 8982 variants from chromosome 11, leaving 6730.
Pruned 8438 variants from chromosome 12, leaving 6958.
Pruned 6310 variants from chromosome 13, leaving 5307.
Pruned 5699 variants from chromosome 14, leaving 4841.
Pruned 5450 variants from chromosome 15, leaving 4709.
Pruned 5894 variants from chromosome 16, leaving 5071.
Pruned 5003 variants from chromosome 17, leaving 4729.
Pruned 5036 variants from chromosome 18, leaving 4680.
Pruned 3750 variants from chromosome 19, leaving 3755.
Pruned 4288 variants from chromosome 20, leaving 3808.
Pruned 2469 variants from chromosome 21, leaving 2258.
Pruned 2496 variants from chromosome 22, leaving 2432.
Pruning complete. 179479 of 320578 variants removed.
Marker lists written to /tmp/Rtmpg7a6zA/file6adc6e7d579f.prune.in and
/tmp/Rtmpg7a6zA/file6adc6e7d579f.prune.out .
PLINK v1.90b6.4 64-bit (7 Aug 2018) www.cog-genomics.org/plink/1.9/
(C) 2005-2018 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to /tmp/Rtmpg7a6zA/file6adc6e7d579f.log.
Options in effect:
--bfile data/technoTeens_QC
--extract /tmp/Rtmpg7a6zA/file6adc6e7d579f.prune.in
--genome
--min 0.08
--out /tmp/Rtmpg7a6zA/file6adc6e7d579f
--threads 23

system.time( technoQC.rds <- snp_readBed(technoQC2.bed, "technoQC") )
user system elapsed
1.109 0.054 1.161

str(technoTeens, max.level = 2)
List of 3
$ genotypes:Reference class 'FBM.code256' [package "bigstatsr"] with 10 fields
..and 22 methods, of which 8 are possibly relevant:
.. add_columns, as.FBM, copy#envRefClass, initialize,
.. initialize#FBM, save, show#envRefClass, show#FBM
$ fam :'data.frame': 195 obs. of 6 variables:
..$ family.ID : int [1:195] 1 3 4 6 8 11 12 13 17 21 ...
..$ sample.ID : chr [1:195] "202528140036_R01C01" "202528140036_R02C01" "202528140036_R02C02" "202528140036_R03C02" ...
..$ paternal.ID: int [1:195] 0 0 0 0 0 0 0 0 0 0 ...
..$ maternal.ID: int [1:195] 0 0 0 0 0 0 0 0 0 0 ...
..$ sex : int [1:195] 0 0 0 0 0 0 0 0 0 0 ...
..$ affection : int [1:195] -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 ...
$ map :'data.frame': 320578 obs. of 6 variables:
..$ chromosome : int [1:320578] 1 1 1 1 1 1 1 1 1 1 ...
..$ marker.ID : chr [1:320578] "GSA-rs114420996" "rs3131972" "rs12127425" "rs4970383" ...
..$ genetic.dist: num [1:320578] 0 0 0 0 0 0 0 0 0 0 ...
..$ physical.pos: int [1:320578] 58814 752721 794332 838555 840753 846808 853954 854250 861808 866893 ...
..$ allele1 : chr [1:320578] "A" "G" "A" "A" ...
..$ allele2 : chr [1:320578] "G" "A" "G" "C" ...

attr(*, "class")= chr "bigSNP"

G <- technoTeens$genotypes
big_counts(G, ind.col = 1:10)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
0 0 31 2 23 38 27 26 14 40 8
1 55 91 31 87 99 89 89 87 97 51
2 133 70 162 83 58 79 80 94 58 136
7 3 0 2 0 0 0 0 0 0

system.time( infos <- snp_fastImpute(G, technoTeens$map$chromosome, ncores = NCORES) )
user system elapsed
0.791 1.128 185.524

plot(subset(infos, pNA > 0.001), pch = 19, cex = 0.5)
Error in subset.default(infos, pNA > 0.001) : object 'pNA' not found``

Include wrappers to {rsnps}

For now:

devtools::source_gist("42b41d771bbeae63245b8304ef283c70", filename = "get-genes.R")
rsid <- c("rs3934834", "rs3737728", "rs6687776", "rs9651273", "rs4970405",
          "rs12726255", "rs2298217", "rs4970362", "rs9660710", "rs4970420")
snp_gene(rsid)

installation issue

Hi,
I tried to install bigsnpr and bigstatsr on a Mac book Pro with macOS Sierra and R 3.5.1 and I get the following error.

R version 3.5.1 (2018-07-02) -- "Feather Spray"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin15.6.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

During startup - Warning messages:
1: Setting LC_CTYPE failed, using "C"
2: Setting LC_COLLATE failed, using "C"
3: Setting LC_TIME failed, using "C"
4: Setting LC_MESSAGES failed, using "C"
5: Setting LC_MONETARY failed, using "C"

devtools::install_github("privefl/bigstatsr")
Downloading GitHub repo privefl/bigstatsr@master
tar: Failed to set default locale
tar: Failed to set default locale
During startup - Warning messages:
1: Setting LC_CTYPE failed, using "C"
2: Setting LC_TIME failed, using "C"
3: Setting LC_MESSAGES failed, using "C"
4: Setting LC_MONETARY failed, using "C"
v checking for file '/private/var/folders/r6/fttwjvts4tx456v09tzw9flr0000gn/T/RtmpQcCqhO/remotescecb4eb9368a/privefl-bigstatsr-fe0da7e/DESCRIPTION' ...

preparing 'bigstatsr':
v checking DESCRIPTION meta-information ...
cleaning src
running 'cleanup'
checking for LF line-endings in source and make files and shell scripts
checking for empty or unneeded directories
building 'bigstatsr_0.9.3.tar.gz'

Error: (converted from warning) Setting LC_CTYPE failed, using "C"
Execution halted
Error in i.p(...) :
(converted from warning) installation of package '/var/folders/r6/fttwjvts4tx456v09tzw9flr0000gn/T//RtmpQcCqhO/filececb109eb0bd/bigstatsr_0.9.3.tar.gz' had non-zero exit status

Any clue?

Thanks

FBM. code256

Excuse me,How to convert genotype files into FBM. code256 format?
How to determine the code file？

Checking NAs is slow

Would be better to temporary replace NAs with 3s in codes and then checking x == 3 instead of R_IsNA(x).

For example, could improve performance of snp_cor.

snp_grid_PRS error

Hi,
When I test the tmp-data (the example of SCT), snp_grid_PRS is error. The error is task 57 failed - "'lpS.keep' should have only positive values.".
The dimension of beta, lpval and G are consistent.
Could you tell me how to deal with the error.
Thanks!

Best,
Sheng

Function for downloading PLINK2

Convert data.frame to FBM.code256

Hi Florian,
I have some difficulties with your bigsnpr package. I would like to transform an SNP matrix (https://github.com/ahereil/Test-Data/blob/master/geno_test.txt) to FBM.code256 format in order to use the snp_clumping() function.

Thanks

Switch 0 <-> 2 when reading from PLINK files

I am reading PLINK files the same as package {gaston}.
However, it seems that we switch 0s and 2s (thus often reading more 2s than 0s).

Handling missing values in SNP matrix

Thanks for making this tool available. I wasn't sure whether to mark this as a question or issue because it's unclear to me if there is existing support for PCA and GWAS with SNP matrices that contain NAs. I am accustomed to setting thresholds in PLINK and GEMMA for which all SNPs missing in more than X genotypes are excluded, but haven't found a similar feature for bigsnpr in documentation or the demo.

I'm trying to run my data, which I've formatted the same way as the demo data, but there are NAs in my SNP set (from QC filtering, indels, etc) and this leads me to the error below.

Do I need to format the NAs a particular way, or do anything else to build models that include SNPs not found in all genotypes?

Error: You can't have missing values in 'X'.
Traceback:

1. big_univLogReg(G, y01.train = y01, covar.train = covariates, ncores = NCORES)
2. check_args()
3. with(args, eval(parse(text = check[i])))
4. with.default(args, eval(parse(text = check[i])))
5. eval(substitute(expr), data, enclos = parent.frame())
6. eval(substitute(expr), data, enclos = parent.frame())
7. eval(parse(text = check[i]))
8. eval(parse(text = check[i]))
9. assert_noNA(X)
10. stop2("You can't have missing values in '%s'.", deparse(substitute(x)))
11. stop(sprintf(...), call. = FALSE)

Implement inference of LD blocks

E.g. Berisa and Pickrell (2016)
https://dx.doi.org/10.1093%2Fbioinformatics%2Fbtv546

Add percentage of variance explained by PCs

If you have a bigSNP object, see privefl/bigstatsr#83.

If you have a bed object (directly mapping a bed file), see this example:

library(bigsnpr)
bedfile <- system.file("extdata/example.bed", package = "bigsnpr")
obj.bed <- bed(bedfile)
obj.svd <- bed_autoSVD(obj.bed)
ind.keep <- attr(obj.svd, "subset")

rssq <- bigsnpr:::prod_and_rowSumsSq(
  obj.bed, ind_row = rows_along(obj.bed), ind_col = ind.keep,
  center = obj.svd$center, scale = obj.svd$scale, 
  V = matrix(0, length(ind.keep), 0)  # skip product here
)[[2]]
sum(rssq)  # squared frobenius norm of genotype matrix for subset (= total variance explained)
var_exp <- obj.svd$d^2 / sum(rssq)
signif(var_exp, 2)
round(cumsum(var_exp), 3)

Implement bed_cor() (similar to snp_cor())

Would be then possible to compute LD scores directly on bed files.

Implement GWAS power calculations

Support for visualisation of LD

Example:

# computed with snp_cor() (do not hesitate to be stringent on thr_r2 or alpha)
corr <- readRDS(url("https://www.dropbox.com/s/65u96jf7y32j2mj/spMat.rds?raw=1"))

library(tidyverse)

## Transform sparse representation into (i,j,x) triplets
corrT <- as(corr, "dgTMatrix")
upper <- (corrT@i <= corrT@j)
df <- data.frame(
  i = corrT@i[upper], 
  j = corrT@j[upper],
  r2 = corrT@x[upper]
) %>%
  mutate(y = (j - i) / 2)

ggplot(df) +
  geom_point(aes(i + y, y, color = r2, alpha = r2), size = rel(0.5)) +
  coord_fixed() + 
  scale_color_gradientn(colours = rev(colorRamps::matlab.like2(100))) + 
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) +
  labs(x = "Position", y = NULL) +
  scale_alpha(guide = 'none')

snp_grid_PRS dimension mismatch

Following along the vignette entitled "Computing polygenic scores using Stacked Clumping and Thresholding (SCT)" everything works until the function snp_grid_PRS. The problem is mismatched dimensions between the G, bigsnpr object, and the summary statistic betas and pvals. Specifically, G contains more snps than the summary statistics. The exact call that is throwing the problem is assert_lengths().

A temporary fix is to create a new G, although it could be nice and thought worthwhile to point out, that there could be a more definite solution.

Dosage data

For now, see this vignette as a proof of concept.

Non-numeric Chromosome IDs

I noticed that many of the functions (ex. snp_indLRLDR) appear to assume integer chromosome IDs (1-26 etc) and produce errors when non-numeric. I was wondering if either there was support for non-traditional chromosome IDs (NC_####, chr##, etc)? Or if there was an easy way to transition the non-numeric IDs to numeric values within the objects used.

Going through the package, I couldn't see why it would be absolutely required for them to be numeric and could help increase the flexibility as lots of system require all sorts of designations for chromosomes.

Thanks!

snp_readBGEN() does not work with duplicated variants

Reprex:

# write bgen/bgi files
library(magrittr)
bgen_file <- tempfile(fileext = ".bgen")
system.file("testdata", "bgen_example.rds", package = "bigsnpr") %>%
  readRDS() %>% writeBin(bgen_file, useBytes = TRUE)
system.file("testdata", "bgi_example.rds",  package = "bigsnpr") %>%
  readRDS() %>% writeBin(paste0(bgen_file, ".bgi"), useBytes = TRUE)

IDs <- rep("1_1001_A_G", 3)
library(bigsnpr)
snp_attach(snp_readBGEN(bgen_file, tempfile(), list(IDs)))$genotypes[1:5, ]
     [,1] [,2] [,3]
[1,] 2.00    0    0
[2,] 1.98    0    0
[3,] 0.00    0    0
[4,] 1.00    0    0
[5,] 1.00    0    0

snp_plinkKINGQC() - version check for PLINK v2 fails when using older PLINK v2 builds

(I've never reported an issue on Github before -- hope this is well described and replicable)

Running the following command:

snp_plinkKINGQC(plink2_path, in_stem, out_stem)

gives the following error message:

Error: This requires PLINK v2; got 'PLINK v2.00a2LM 64-bit Intel (4 Mar 2019)' instead.

The additional warning message tells us why:

In addition: Warning message:
In system(paste(plink2.path, "--version"), intern = T) :
  running command '[redacted path to my org's PLINK2 version] --version' had status 9

The !identical(substr(v, 1, 8)) == "PLINK v2" check in snp_plinkKINGQC() fails because the v object (and its substr() output) has attr(,"status") [1] 9, unlike the string "PLINK v2".

However, the more recent PLINK v2.00a2.3LM does not display the behaviour of returning a non-zero exit status when --version is run, so I don't get the error when I point plink2_path to that build.

myassert_size was not declared in this scope

I am using R 3.6.1 and gcc 4.8.5 on Linux (CentOS 7.4), and when I try to install the package I get the following error:

clumping.cpp: In function 'Rcpp::List clumping_chr_cached(Rcpp::Environment, arma::sp_mat, const IntegerVector&, const IntegerVector&, const IntegerVector&, const IntegerVector&, const IntegerVector&, const NumericVector&, const NumericVector&, int, double)':
clumping.cpp:101:32: error: 'myassert_size' was not declared in this scope
   myassert_size(spInd.size(), m);

Can you please advise? Thank you.

Installation issue: OS X "-lgfortran" and "-lquadmath" errors

Hi Florian,
I've been using your bigsnpr package in a class I teach, and some of the students were having issues installing. One of the errors was this:

ld: warning: directory not found for option '-L/usr/local/lib/gcc/x86_64-apple-darwin13.0.0/4.8.2'
ld: library not found for -lquadmath
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [bigstatsr.so] Error 1
ERROR: compilation failed for package ‘bigstatsr’

removing ‘/Library/Frameworks/R.framework/Versions/3.3/Resources/library/bigstatsr’
Installation failed: Command failed (1)

Move PLINK wrappers in a separate package?

Could add more functions too.

What to do about lambda_GC?

Currently, lambda_GC(X2) = lambda_GC(Z)^2.
This can be confusing when comparing Q-Q plots using different statistics.
What to do about it?

Allow to use genetic distance for window in snp_cor()

Question: is it possible to OADP project samples onto existing smartpca output?

I've tested out OADP and I'm quite happy with the results. It's not absolutely perfect, but it looks to be the best projection method I've tried, aside from smartpca's computationally insane shrinkmode: YES.

I've been trying to apply OADP to existing smartpca output. I can generate projected coordinates, but they're not scaled correctly: PC1 values are x ~850 what they should be, PC2 are x ~400, PC3 are x ~-150, etc. Moreover, the overall appearance of a plot of all projected PCs changes depending on how many PCs I attempt to project on.

Attempted implementation below (hopefully without elementary errors):

library(dplyr)
library(bigsnpr)

# bed file that was used as smartpca input
pca_bed <- bed(pca_bed_file)
# bed file of GTs on exact same loci as above, but for individuals we want to project
project_bed <- bed(project_bed_file)

# as I understand, when creating "scaled" GT values for the individuals we want to 
# project, we should use the "scale" (per-loci means + sds) from the fileset used to 
# calculate the PCA
pca_bed_scale <- bed_scaleBinom(pca_bed)

# using the above, read in scaled GT values for the individuals to project
project_bed_scaled <- project_bed %>%
    bigsnpr:::read_bed_scaled(1:.$nrow, 1:.$ncol, pca_bed_scale$center, pca_bed_scale$scale) 

# read in smartpca SNP loadings for 10 PCs, dropping 1st 3 cols w/ rsid chr pos:
snpweight <- paste0(result_stem, '.snpweight') %>% 
    read.table() %>%
    select(-(1:3)) %>%
    as.matrix()

# and read in smartpca's eigenvalue file:
evals <- paste0(result_stem, '.eval') %>% 
    scan(nmax = 10)

# I think we now have the ingredients for OADP.
# follow pca_OADP_proj helppage by sqrting evals to get singular values
proj <- bigutilsr::pca_OADP_proj(project_bed_scaled, snpweight, sqrt(evals))

I've tried to modulate how I calculate the singular values (e.g. sqrt(evals * (length(evals) - 1)), where the length of evals is the same as the number of samples used to generate the original PCA), but this doesn't seem to work; the overall appearance of the plots changing based on the number of PCs I attempt to project onto seems to be an important clue.

Anyway, I'm not mathematically sophisticated enough to figure this out. If you could point me in the right direction, I would really appreciate it. I'm probably not the only person who has some "legacy" PCA output that would be annoying to run again or can't be run again in bigsnpr due to it being dependent on functionality that doesn't (yet!) exist in the package. Thanks.

log file empty, no bed file produced

Hi Florian

I'm having trouble with snp_plinkQC. It runs, but it produces a log file that is empty, and it does not produce the bed file.

The data is available at (http://dgrp2.gnets.ncsu.edu/data.html) "plink formatted genotype". My code is pasted below.
I realise I haven't given you much to go on but do you have a feeling where the issue may be?
Thanks in advance!

library(bigsnpr)
bedfile <- "/home/mist/projects/dgrp2.bed"
print(bedfile)

prefix  <- sub("\\.bed$", "", bedfile)
plink <- download_plink()
print(prefix)
bigsnp <- snp_plinkQC(plink.path = plink,
                      prefix.in = prefix,
                      prefix.out = tempfile(),
                      file.type = "--bfile",
                      maf = 0.05, 
                      geno = 0,   
                      mind = 1,   
                      hwe = 0,    
                      autosome.only = FALSE) 
bigsnp <- snp_readBed(bigsnp)

Loading required package: bigstatsr
[1] "/home/mist/projects/dgrp2.bed"
trying URL 'https://www.cog-genomics.org/static/bin/plink180911/plink_linux_x86_64.zip'
Content type 'application/zip' length 2284660 bytes (2.2 MB)
==================================================
downloaded 2.2 MB

[1] "/home/mist/projects/dgrp2"
[1] "/tmp/RtmpP40ZZj/file627364b519ff.bed"
Error: File '/tmp/RtmpP40ZZj/file627364b519ff.bed' doesn't exist.

ind.row, ind.col do not work in snp_readBed2

If the number of rows or columns is large, calling
snp_readBed2(bed.file, backingfile = rds.basename, ind.row=ind.row, ind.col=ind.col)
gives the following error message:

Error in if (file.size(.self$backingfile) != (size * .self$type_size)) stop2("Inconsistency between size of backingfile and dimensions.") :
missing value where TRUE/FALSE needed
In addition: Warning message:
In .self$nrow * .self$ncol : NAs produced by integer overflow

Also, calling
snp_readBed2(bed.file, backingfile = rds.basename, ind.row=ind.row)
throws:

Error in ncol(x) : object 'obj.bed' not found

Installation Error

Hello,

I am also having an error while installing bigsnpr. Despite having installed and loaded bigstatsr prior to attempting to install bigsnpr, I get this error:

ERROR: compilation failed for package ‘bigstatsr’

removing ‘/Library/Frameworks/R.framework/Versions/3.4/Resources/library/bigstatsr’
restoring previous ‘/Library/Frameworks/R.framework/Versions/3.4/Resources/library/bigstatsr’
Installation failed: Command failed (1)
'/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file --no-environ
--no-save --no-restore --quiet CMD INSTALL
'/private/var/folders/ct/wh2w15qd0fxfq3sbcpdp1_mr0000gn/T/Rtmpr61UIl/devtools1a043d4dcc42/privefl-bigsnpr-25b97e9'
--library='/Library/Frameworks/R.framework/Versions/3.4/Resources/library'
--install-tests
installing source package ‘bigsnpr’ ...
** libs
Agreeing to the Xcode/iOS license requires admin privileges, please run “sudo xcodebuild -license” and then retry this command.
ERROR: compilation failed for package ‘bigsnpr’
removing ‘/Library/Frameworks/R.framework/Versions/3.4/Resources/library/bigsnpr’
Installation failed: Command failed (1)

I may be doing something incorrectly, but if you have some insight into what might be going wrong I would appreciate the feedback. Thanks a lot.

Chad

Add vignette on how to import genotype data in bigSNP format

Match alleles between two datasets

For now, I have only bigsnpr::same_ref(), which is limited.

Problems specifying "backingfile " in snp_readBGEN

I am trying to read a BGEN file using bigsnpr::snp_readBGEN().
I am using the following code in a google virtual machine with 100 GB ram and 16 cores.

system.time(
  rds <- bigsnpr::snp_readBGEN(
    bgenfiles = glue::glue("/data/bgen/ukb_imp_chr{chr}_v3.bgen", chr = 1:22),
    list_snp_id = list_snp_id,
    backingfile = "data/try_1",
    ind_row = ind.indiv[sub],
    bgi_dir = "/data/bgen",
    ncores = 12
  )
)

However, I get an error that is related to the specification of the backingfile.
In particular, the error I encounter the following error:

Error in getXPtrFBM(.self$bk, .self$nrow, .self$ncol, .self$type) : 
  Error when mapping file:
  No such file or directory.
In addition: Warning message:
In normalizePath(bkfile) :
  path[1]="/data/try_1.bk": No such file or directory
Timing stopped at: 0.032 0 0.033

In my understading, the backingfile is created anew during snp_readBGEN execution.
However, I get the error that specifies that "No such file or directory" exists.

I tried using an alternative line to specify the backingfile :
backingfile = tempfile(pattern = "lasso_try_1")
It works. However, the temporary file is created in the temporary folder of R and the file is too big. I would need to specify a different server location outside of the temporary folder of R.

snp_grid_stacking error: 'y.train' should be composed of different values

Thanks for developing this efficient method for PRS. I am trying to go through the SCT with my own dataset. In the stacking step the code final_mod <- snp_grid_stacking(multi_PRS, y[ind.train], ncores = NCORES, K = 4) returns this error 'y.train' should be composed of different values . In my dataset, the y <- obj.bigSNP$fam$affection - 1 is all -10. I found that in the testing dataset you provided the y is 0/1 binary values thus big_spLogReg() is used for stacking. In my dataset, does this error return by big_spLinReg()? And what is the potential solution for this problem?

Fast way to remove "duplicate" variants

I.e. find a faster way to perform clumping with large thr.r2 (say 0.95 or 0.99).

Implement LDSc regression

Faster snp_clumping

It's somehow quite "slow".
Wonder if would be good to use a different way of parallelism.

Categorical Covariate

How would one include categorical covariates using snp_grid_PRS() or predict() such as sex or membership of a certain stratum? I see that covar.row can be used to load in results from a PCA but is there a way to specify separate categorical covariables, or include them together with the PCA? Thank you.

Handling of Dosage data in PRS calculation

Hi there, I'd like to ask if the use of snp_PRS() is compatible with BGEN file input? And if so, is the dosage data hard-coded?

Thanks

big_randomSVD() hangup

Ran the example here (https://privefl.github.io/bigsnpr/articles/demo.html) successfully and started trying my own data (started slow with 353 samples but only 17 variants in a small interval). Ran the following code:

library(bigsnpr)
library(ggplot2)
setwd("XXX/XXX")
tmpfile <- tempfile()
snp_readBed("test_region.bed", backingfile = tmpfile)
obj.bigSNP <- snp_attach(paste0(tmpfile, ".rds"))
str(obj.bigSNP, max.level = 2, strict.width = "cut")
G <- obj.bigSNP$genotypes
CHR <- obj.bigSNP$map$chromosome
POS <- obj.bigSNP$map$physical.pos
big_counts(G, ind.col = 1:10)

NCORES <- nb_cores()
ind.keep <- snp_clumping(G, infos.chr = CHR, ncores = NCORES)
obj.svd <- big_randomSVD(G, fun.scaling = snp_scaleBinom(), ind.col = ind.keep, ncores = NCORES)

Everything looked fine until the last step. The "big_randomSVD" hung up indefinitely without any error message. Based on the number of samples and variants this shouldn't take very long. Any advices? I could provide the "test_region.bed" data used here.

privefl / bigsnpr Goto Github PK

bigsnpr's Introduction

bigsnpr

Installation

Input formats

Polygenic scores

Possible upcoming features

Bug report / Support

References

bigsnpr's People

Contributors

Stargazers

Watchers

Forkers

bigsnpr's Issues

Recommend Projects

Recommend Topics

Recommend Org