zhanxw / seqminer Goto Github PK

View Code? Open in Web Editor NEW

30.0 3.0 12.0 7.16 MB

Query sequence data (VCF/BCF1/BCF2, Tabix, BGEN, PLINK) in R

Home Page: http://zhanxw.github.io/seqminer/

License: Other

R 0.63% C++ 4.89% C 94.10% Shell 0.01% M4 0.03% Makefile 0.12% Batchfile 0.04% CMake 0.15% Starlark 0.04%

sequencing annotation next-generation-sequencing meta-analysis vcf bcf tabix bgen plink workflow

seqminer's People

Stargazers

Watchers

Forkers

jevon1985 dajianggroup srenan funfunchen lzlgboy yang-lina wangdi2014 wenjianbi haizi-zh qbrc pekkarr

seqminer's Issues

New features for seqminer (8.0)

Hi Xiaowei,

I am using seqminer (v8.0) and it works pretty well under multiple OS. I am wondering if you can add some features to the current functions.

Usually, we do not need all subjects in analysis. So, for readBGENToMatrixByRange() and readVCFToMatrixByRange(), can you add one more argument such as 'subjIDs' or 'subjIndex' to specify the subjects in analysis. That can save a lot of memory sometimes.
Can you add one more function to split all markers into multiple ranges, and each range includes similar number of markers. When conducting a genome-wide analysis, we cannot put the genotype of all markers into memory. Hence, this function can greatly help us for that purpose. If possible, I suggest the new function should be like splitRange(fileName, memoryChunk = 4GB, subjIDs, ...). Output can be a data.frame object in which each row is for one range.
Sometimes, the plink bed/bim/fam files or bgen bgen/bgi files have different prefix names. I am wondering if you can let users specify the different names for different files. That would be also helpful.

Thanks,
Wenjian

`tabix.read` remote files

Is it possible to query indexed VCF files that are stored remotely?

Thanks!

vcf_url <- "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521//ALL.chr18.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz"

tab <- seqminer::tabix.read(tabixFile =vcf_url, tabixRange =  "8:48673708-48692728")
print(tab)

### character(0)

I have a dataset of size with 487409 samples and 571622 SNPs. Is it possible to load this entire dataset with seqminer? I tried requesting 2000 Gb memory on compute Canada but didn't work although 4874095716224 bytes per float is equal to 1,114.45483 gigabytes. Theoretically this should work. So seqminer is using a lot more memory than it theoretically should. I found readBGENToMatrixByRange is more efficient than readBGENToListByRange, but still readBGENToMatrixByRange doesn't work.

How is seqminer treating missing data?

How is seqminer treating missing data? I don't see this information being mentioned in the documentation, but all the loaded matrices are complete, so I suppose these data are dropped?

Support for csi index

Could you add support for csi indexes?

Missing license version

Dear maintainer,

We noticed that you list GPL as License in your Description file. Can you clarify the specific GPL version your package is licensed under (e.g. GPL-2 or GPL-3)?

On behalf of bioconda and conda-forge teams,
Jens

Technical background

Your package is currently available on the bioconda conda channel. We'd like to migrated the package to conda-forge. conda-forge requires inclusion of the license text during packaging. The R distribution already contains some licenses, but GPL is not one of them (however, GPL-2 and GPL-3 are included). Let me know if you need further information.

BGEN to Matrix variant IDs

Hi @zhanxw,

Would it be possible to add some additional granularity to the bgen to matrix variant ID format? Right now, the IDs are chrom:pos, but having the SNPID type format of CHR:POS:REF:ALT would be helpful.

Thanks!

Josh

`tabix.createIndex`: [ti_index_core] the file out of order at line ####

Hello,

When trying to use tabix.createIndex to index a munged GWAS summary stats file (tab-separated and compressed by Rsamtools::bgzip), I keep getting the following error:

[ti_index_core] the file out of order at line 776464
Create tabix index failed for [ /Users/schilder/Downloads/pgc-bip2021-all_munged.tsv.bgz ]!

And yet, I've made sure to sort the file by CHR (chromosome) and BP (position). I even visually confirmed that the positions are in order at the line it is referencing (after it's been bgzip compressed):

Reprex

The data can be downloaded here.

#### Set up paths ####
fullSS_path_vcf <- "~/Downloads/pgc-bip2021-all.vcf.tsv.gz" 
fullSS_path_tsv <- gsub("\\.vcf","",fullSS_path_vcf) 
fullSS_path_munged <- gsub("-all","-all_munged",fullSS_path_tsv)
#### Edit ####
dat <- data.table::fread(fullSS_path_vcf, 
                         skip = "#CHROM")
colnames(dat) <- gsub("#","",colnames(dat))
#### Sort ####
data.table::setkey(dat, CHROM, POS)
#### Save ####
data.table::fwrite(x = dat, 
                   file = fullSS_path_tsv, 
                   sep="\t")
#### Munge ####
fullSS_path <- MungeSumstats::format_sumstats(path = fullSS_path_tsv, 
                                              save_path = fullSS_path_munged, 
                                              sort_coordinates = TRUE,
                                              log_folder = "~/Downloads/logs", 
                                              log_mungesumstats_msgs = TRUE, 
                                              log_folder_ind = TRUE)

#### Compress ####
bgz_file <- Rsamtools::bgzip(file = fullSS_path, 
                                 overwrite = TRUE)

#### Index ####
 seqminer::tabix.createIndex(
        bgzipFile = bgz_file,
        sequenceColumn = 2,
        startColumn = 3,
        endColumn = 3,
        ## Just use the first column's name (since none have the `#` symbol)
        metaChar = "SNP"
    )

Any help would be appreciated.

Best,
Brian

Session info

R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.4

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] arrow_7.0.0      ggimage_0.3.0    ggplot2_3.3.5    dplyr_1.0.8      hexSticker_0.4.9
[6] echotabix_0.99.3

loaded via a namespace (and not attached):
  [1] AnnotationHub_3.2.2           BiocFileCache_2.2.1           systemfonts_1.0.4            
  [4] igraph_1.2.11                 BiocParallel_1.28.3           GenomeInfoDb_1.30.1          
  [7] digest_0.6.29                 yulab.utils_0.0.4             htmltools_0.5.2              
 [10] magick_2.7.3                  fansi_1.0.2                   magrittr_2.0.2               
 [13] memoise_2.0.1                 BSgenome_1.62.0               echoverseTemplate_0.99.0     
 [16] ontologyPlot_1.6              openxlsx_4.2.5                Biostrings_2.62.0            
 [19] matrixStats_0.61.0            R.utils_2.11.0                sysfonts_0.8.5               
 [22] prettyunits_1.1.1             colorspace_2.0-3              blob_1.2.2                   
 [25] rappdirs_0.3.3                textshaping_0.3.6             xfun_0.30                    
 [28] crayon_1.5.0                  RCurl_1.98-1.6                echodata_0.99.6              
 [31] jsonlite_1.8.0                hexbin_1.28.2                 graph_1.72.0                 
 [34] VariantAnnotation_1.40.0      glue_1.6.2                    gtable_0.3.0                 
 [37] zlibbioc_1.40.0               XVector_0.34.0                DelayedArray_0.20.0          
 [40] Rgraphviz_2.38.0              BiocGenerics_0.40.0           scales_1.1.1                 
 [43] DBI_1.1.2                     Rcpp_1.0.8.2                  showtextdb_3.0               
 [46] xtable_1.8-4                  progress_1.2.2                gridGraphics_0.5-1           
 [49] bit_4.0.4                     clisymbols_1.2.0              stats4_4.1.0                 
 [52] DT_0.21                       htmlwidgets_1.5.4             httr_1.4.2                   
 [55] ontologyIndex_2.7             ellipsis_0.3.2                pkgconfig_2.0.3              
 [58] XML_3.99-0.9                  R.methodsS3_1.8.1             farver_2.1.0                 
 [61] seqminer_8.4                  dbplyr_2.1.1                  utf8_1.2.2                   
 [64] here_1.0.1                    ggplotify_0.1.0               tidyselect_1.1.2             
 [67] labeling_0.4.2                rlang_1.0.2                   later_1.3.0                  
 [70] AnnotationDbi_1.56.2          BiocVersion_3.14.0            munsell_0.5.0                
 [73] tools_4.1.0                   cachem_1.0.6                  cli_3.2.0                    
 [76] generics_0.1.2                RSQLite_2.2.10                evaluate_0.15                
 [79] stringr_1.4.0                 fastmap_1.1.0                 yaml_2.3.5                   
 [82] ragg_1.2.2                    knitr_1.37                    bit64_4.0.5                  
 [85] fs_1.5.2                      zip_2.2.0                     purrr_0.3.4                  
 [88] KEGGREST_1.34.0               gh_1.3.0                      showtext_0.9-5               
 [91] mime_0.12                     R.oo_1.24.0                   xml2_1.3.3                   
 [94] biomaRt_2.50.3                brio_1.1.3                    compiler_4.1.0               
 [97] rstudioapi_0.13               interactiveDisplayBase_1.32.0 filelock_1.0.2               
[100] curl_4.3.2                    png_0.1-7                     testthat_3.1.2               
[103] paintmap_1.0                  tibble_3.1.6                  stringi_1.7.6                
[106] GenomicFeatures_1.46.5        desc_1.4.1                    lattice_0.20-45              
[109] Matrix_1.4-0                  vctrs_0.3.8                   pillar_1.7.0                 
[112] lifecycle_1.0.1               BiocManager_1.30.16           data.table_1.14.2            
[115] bitops_1.0-7                  httpuv_1.6.5                  rtracklayer_1.54.0           
[118] GenomicRanges_1.46.1          R6_2.5.1                      BiocIO_1.4.0                 
[121] promises_1.2.0.1              IRanges_2.28.0                ontoProc_1.16.0              
[124] assertthat_0.2.1              pkgload_1.2.4                 SummarizedExperiment_1.24.0  
[127] rprojroot_2.0.2               rjson_0.2.21                  withr_2.5.0                  
[130] GenomicAlignments_1.30.0      Rsamtools_2.10.0              S4Vectors_0.32.3             
[133] GenomeInfoDbData_1.2.7        parallel_4.1.0                hms_1.1.1                    
[136] grid_4.1.0                    ggfun_0.0.5                   tidyr_1.2.0                  
[139] rmarkdown_2.13                MatrixGenerics_1.6.0          piggyback_0.1.1              
[142] Biobase_2.54.0                shiny_1.7.1                   lubridate_1.8.0              
[145] restfulr_0.0.13

Configuration fails with missing -fPIE

Configuration fails with the following relevant error:

configure:2372: g++ -m64 -std=gnu++11 -o conftest  -I/usr/local/include -Wl,-z,relro -Wl,--as-needed -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -Wl,--build-id=sha1 conftest.cpp  >&5
/usr/bin/ld: /tmp/ccNfljhx.o: relocation R_X86_64_32 against `.rodata' can not be used when making a PIE object; recompile with -fPIE

`readVCFToListByRange`: Report 'VCF/BCF Files does not have header!'

seqminer::readVCFToListByRange currently crashes Rstudio. The VCF I'm querying is definitely not missing a valid header (contrary to the error message).

Moreover, even if the file was missing a header, the function should be able to handle this situation more gracefully than causing Rstudio to crash (and lose all the user's variables). Instead, a standard stop message should be triggered.

Reprex 1

 target_path <- paste(
        "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/",
        "ALL.chr4.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz",
        sep="/")
query_str <- "4:14884541-16649679"

main_cols <- c("CHROM","POS","ID","REF","ALT","QUAL","FILTER","INFO","FORMAT")
samples <- c('HG00097','HG00099','HG00100','HG00101','HG00102')
vcfColumn <- unique(c(main_cols,toupper(samples)))

 out <- seqminer::readVCFToListByRange(fileName = target_path,
                                              range = query_str, 
                                              vcfColumn = vcfColumn)

Error

Reprex 2

This imports the table successfully.

dat <- seqminer::tabix.read.table(tabixFile = target_path, 
                                          tabixRange = query_str)
dim(dat)
## [1] 24376  1101

For further discussion on querying VCFs using various R packages, see here.

Session info

R version 4.1.3 (2022-03-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.3.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] echotabix_0.99.5            dplyr_1.0.8                 VariantAnnotation_1.40.0    Rsamtools_2.10.0           
 [5] Biostrings_2.62.0           XVector_0.34.0              SummarizedExperiment_1.24.0 Biobase_2.54.0             
 [9] GenomicRanges_1.46.1        GenomeInfoDb_1.30.1         IRanges_2.28.0              S4Vectors_0.32.4           
[13] MatrixGenerics_1.6.0        matrixStats_0.61.0          BiocGenerics_0.40.0        

loaded via a namespace (and not attached):
 [1] fs_1.5.2                 bitops_1.0-7             lubridate_1.8.0          bit64_4.0.5             
 [5] filelock_1.0.2           progress_1.2.2           httr_1.4.2               rprojroot_2.0.3         
 [9] gh_1.3.0                 tools_4.1.3              utf8_1.2.2               R6_2.5.1                
[13] DT_0.22                  DBI_1.1.2                withr_2.5.0              tidyselect_1.1.2        
[17] prettyunits_1.1.1        bit_4.0.4                curl_4.3.2               compiler_4.1.3          
[21] cli_3.2.0                xml2_1.3.3               desc_1.4.1               DelayedArray_0.20.0     
[25] rtracklayer_1.54.0       readr_2.1.2              rappdirs_0.3.3           stringr_1.4.0           
[29] digest_0.6.29            piggyback_0.1.1          R.utils_2.11.0           htmltools_0.5.2         
[33] pkgconfig_2.0.3          echodata_0.99.7          dbplyr_2.1.1             fastmap_1.1.0           
[37] BSgenome_1.62.0          htmlwidgets_1.5.4        rlang_1.0.2              rstudioapi_0.13         
[41] RSQLite_2.2.12           BiocIO_1.4.0             generics_0.1.2           jsonlite_1.8.0          
[45] echoconda_0.99.5         BiocParallel_1.28.3      zip_2.2.0                R.oo_1.24.0             
[49] RCurl_1.98-1.6           magrittr_2.0.3           GenomeInfoDbData_1.2.7   Matrix_1.4-1            
[53] Rcpp_1.0.8.3             fansi_1.0.3              reticulate_1.24-9000     lifecycle_1.0.1         
[57] R.methodsS3_1.8.1        stringi_1.7.6            yaml_2.3.5               zlibbioc_1.40.0         
[61] brio_1.1.3               BiocFileCache_2.2.1      grid_4.1.3               blob_1.2.2              
[65] parallel_4.1.3           crayon_1.5.1             lattice_0.20-45          GenomicFeatures_1.46.5  
[69] hms_1.1.1                KEGGREST_1.34.0          seqminer_8.4             pillar_1.7.0            
[73] rjson_0.2.21             clisymbols_1.2.0         biomaRt_2.50.3           pkgload_1.2.4           
[77] XML_3.99-0.9             glue_1.6.2               data.table_1.14.2        tzdb_0.3.0              
[81] png_0.1-7                vctrs_0.4.0              testthat_3.1.3           tidyr_1.2.0             
[85] purrr_0.3.4              assertthat_0.2.1         cachem_1.0.6             openxlsx_4.2.5          
[89] restfulr_0.0.13          tibble_3.1.6             GenomicAlignments_1.30.0 AnnotationDbi_1.56.2    
[93] memoise_2.0.1            ellipsis_0.3.2

`tabix.read.table`: Error: Error: non-character argument

tabix.read.table fails on Windows. It seems that your GHA checks are also indicating this to you, but those have expired so I can't check.

@zhanxw is seqminer still being maintained or should it be considered deprecated and no longer used? I'm wondering because it's (currently) a dep for several of my packages.

remotes::install_github("RajLabMSSM/echodata")
remotes::install_github("RajLabMSSM/echotabix")
library(echotabix)

BST1 <- echodata::BST1 
    fullSS_path <- echodata::example_fullSS()
    fullSS_tabix <- convert(fullSS_path = fullSS_path,
                            start_col = "BP")
# Wraps seqminer and some other tools
tab1 <- query_tabular(
        fullSS_tabix = fullSS_tabix,
        chrom = BST1$CHR[1],
        start_pos = min(BST1$POS),
        end_pos = max(BST1$POS)
    )

Error report

Found here.

-- Error (test-query_tabular.R:9:5): query_tabular works -----------------------
Error: Error: non-character argument
Backtrace:
    x
 1. \-echotabix::query_tabular(...) test-query_tabular.R:9:4
 2.   \-seqminer::tabix.read.table(tabixFile = fullSS_tabix, tabixRange = coords)
 3.     +-base::do.call(rbind, strsplit(body, "\t"))
 4.     \-base::strsplit(body, "\t")

Seems to be related to this.

Thanks,
Brian

Feature request: region as chromosome on its own.

It would be nice to be able to define a region as chromosome name on its own as below:

tabix.read("GeneticMap1KG.txt.gz","chr1")

Currently it gives below message and reads in everything, not just chr1:

This range does not conform 1:100-200 format -- skip chr1

Can seqminer reads and merges these large GWAS files?

Hi, there:

A recent Nature Genetics paper posted 35 large GWAS files at https://doi.org/10.35092/yhjc.12355382. Compared with the regular .tsv format, I assume that these BED format files should be much faster to be queries and managed.

Now I want to combine the BETA and SE of all these 35 GWAS into a single file. Reading each of them into R using read.table() and then merge() would take impossible computational time and that did not utilize the BED format. Can you please let me know if seqminer or something else could do this merging?

Similarly, these GWAS files don’t have rsIDs, and I need to add them. Can you please also suggest a good approach to do this. If these were VCF files, I could use bcftools annotate to add rsID from a dbSNP reference file.

Thank you & Best regards,
Jie

readBGENToMatrixByRange cannot found variants but not so after pruning by PLINK2

Why is it that the dimension increased after pruning? In PLINK2 I did:

./plink2 --bgen ukb_imp_chr1_v3.bgen ref-first --sample ukb51283_imp_chr1_v3_s487283.sample --indep-pairwise 50 25 0.2 --rm-dup exclude-all --out ukb_imp_chr1_v3

./plink2 --bgen ukb_imp_chr1_v3.bgen ref-first --sample ukb51283_imp_chr1_v3_s487283.sample --extract ukb_imp_chr1_v3_pruned.prune.in --export bgen-1.2 bits=8 --out ukb_imp_chr1_v3_pruned

Then I use seqminer to check the dimension.

> library(seqminer)

> end_pos <- 1000000

> ukb_imp_chr1_v3.bgen <- "ukb_imp_chr1_v3.bgen"

> ukb_imp_chr1_v3_pruned.bgen <- "ukb_imp_chr1_v3_pruned.bgen"

> ukb_imp_chr1_v3.bgen <- readBGENToMatrixByRange(ukb_imp_chr1_v3.bgen, paste0("1:1-", format(end_pos, scientific=F)))

1 region to be extracted.

> ukb_imp_chr1_v3_pruned.bgen <- readBGENToMatrixByRange(ukb_imp_chr1_v3_pruned.bgen, paste0("1:1-", format(end_pos, scientific=F)))

1 region to be extracted.

> dim(ukb_imp_chr1_v3.bgen[[1]])

[1]      0 487409

> dim(ukb_imp_chr1_v3_pruned.bgen[[1]])

[1]   1008 487409

It means there wasn't any variant between POS 1 to 1000000 in the original file (which isn't the case) and now there are 1008 variants after pruning. This is very strange.

Cannot install package “ERROR: no information for variable 'CXX11', 'CXX11STD' and 'CXX11FLAGS'”

See issue on StackOverflow

gcc11 requires -fPIE

With gcc11 CXXFLAGS=-fPIE is required to install, otherwise ./configure (or installing via R) fails with the following output:

This file contains any messages produced by compilers while
running configure, to aid debugging if configure makes a mistake.

It was created by seqminer configure 8.3, which was
generated by GNU Autoconf 2.69.  Invocation command line was

  $ ./configure 

## --------- ##
## Platform. ##
## --------- ##

hostname = xpc15a6.math.canterbury.ac.nz
uname -m = x86_64
uname -r = 5.16.8-200.fc35.x86_64
uname -s = Linux
uname -v = #1 SMP PREEMPT Tue Feb 8 20:58:59 UTC 2022

/usr/bin/uname -p = x86_64
/bin/uname -X     = unknown

/bin/arch              = x86_64
/usr/bin/arch -k       = unknown
/usr/convex/getsysinfo = unknown
/usr/bin/hostinfo      = unknown
/bin/machine           = unknown
/usr/bin/oslevel       = unknown
/bin/universe          = unknown

PATH: /home/kieran/.emacs.d/bin
PATH: /usr/condabin
PATH: /usr/lib64/ccache
PATH: /usr/local/bin
PATH: /usr/local/sbin
PATH: /usr/bin
PATH: /usr/sbin
PATH: /var/lib/snapd/snap/bin
PATH: /home/kieran/bin


## ----------- ##
## Core tests. ##
## ----------- ##

configure:2057: CC = gcc -m64
configure:2059: CFLAGS = 
configure:2061: CXX = g++ -m64 -std=gnu++11
configure:2063: CXXFLAGS = -O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection
configure:2184: checking for C++ compiler version
configure:2193: g++ -m64 -std=gnu++11 --version >&5
g++ (GCC) 11.2.1 20220127 (Red Hat 11.2.1-9)
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

configure:2204: $? = 0
configure:2193: g++ -m64 -std=gnu++11 -v >&5
Using built-in specs.
COLLECT_GCC=/usr/bin/g++
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/11/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,objc,obj-c++,ada,go,d,lto --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --enable-plugin --enable-initfini-array --with-isl=/builddir/build/BUILD/gcc-11.2.1-20220127/obj-x86_64-redhat-linux/isl-install --enable-offload-targets=nvptx-none --without-cuda-driver --enable-gnu-indirect-function --enable-cet --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux --with-build-config=bootstrap-lto --enable-link-serialization=1
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 11.2.1 20220127 (Red Hat 11.2.1-9) (GCC) 
... rest of stderr output deleted ...
configure:2204: $? = 0
configure:2193: g++ -m64 -std=gnu++11 -V >&5
g++: error: unrecognized command-line option '-V'
g++: fatal error: no input files
compilation terminated.
configure:2204: $? = 1
configure:2193: g++ -m64 -std=gnu++11 -qversion >&5
g++: error: unrecognized command-line option '-qversion'; did you mean '--version'?
g++: fatal error: no input files
compilation terminated.
configure:2204: $? = 1
configure:2224: checking whether the C++ compiler works
configure:2246: g++ -m64 -std=gnu++11  -I/usr/local/include -Wl,-z,relro -Wl,--as-needed -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 conftest.cpp  >&5
configure:2250: $? = 0
configure:2298: result: yes
configure:2301: checking for C++ compiler default output file name
configure:2303: result: a.out
configure:2309: checking for suffix of executables
configure:2316: g++ -m64 -std=gnu++11 -o conftest  -I/usr/local/include -Wl,-z,relro -Wl,--as-needed -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 conftest.cpp  >&5
configure:2320: $? = 0
configure:2342: result: 
configure:2364: checking whether we are cross compiling
configure:2372: g++ -m64 -std=gnu++11 -o conftest  -I/usr/local/include -Wl,-z,relro -Wl,--as-needed -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 conftest.cpp  >&5
/usr/bin/ld: /tmp/ccIrYH7b.o: relocation R_X86_64_32 against `.rodata' can not be used when making a PIE object; recompile with -fPIE
collect2: error: ld returned 1 exit status
configure:2376: $? = 1
configure:2383: ./conftest
./configure: line 2385: ./conftest: No such file or directory
configure:2387: $? = 127
configure:2394: error: in `/home/kieran/src/seqminer':
configure:2396: error: cannot run C++ compiled programs.
If you meant to cross compile, use `--host'.
See `config.log' for more details

## ---------------- ##
## Cache variables. ##
## ---------------- ##

ac_cv_env_CCC_set=
ac_cv_env_CCC_value=
ac_cv_env_CPPFLAGS_set=
ac_cv_env_CPPFLAGS_value=
ac_cv_env_CXXCPP_set=
ac_cv_env_CXXCPP_value=
ac_cv_env_CXXFLAGS_set=
ac_cv_env_CXXFLAGS_value=
ac_cv_env_CXX_set=
ac_cv_env_CXX_value=
ac_cv_env_LDFLAGS_set=
ac_cv_env_LDFLAGS_value=
ac_cv_env_LIBS_set=
ac_cv_env_LIBS_value=
ac_cv_env_build_alias_set=
ac_cv_env_build_alias_value=
ac_cv_env_host_alias_set=
ac_cv_env_host_alias_value=
ac_cv_env_target_alias_set=
ac_cv_env_target_alias_value=

## ----------------- ##
## Output variables. ##
## ----------------- ##

CPPFLAGS='-I/usr/local/include'
CXX='g++ -m64 -std=gnu++11'
CXXCPP=''
CXXFLAGS=''
DEFS=''
ECHO_C=''
ECHO_N='-n'
ECHO_T=''
EGREP=''
EXEEXT=''
GREP=''
HAVE_BZIP2=''
HAVE_SQLITE=''
HAVE_ZSTD=''
LDFLAGS='-Wl,-z,relro -Wl,--as-needed -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1'
LIBOBJS=''
LIBS=''
LTLIBOBJS=''
OBJEXT=''
PACKAGE_BUGREPORT='[email protected]'
PACKAGE_NAME='seqminer'
PACKAGE_STRING='seqminer 8.3'
PACKAGE_TARNAME='seqminer'
PACKAGE_URL=''
PACKAGE_VERSION='8.3'
PATH_SEPARATOR=':'
PKG_CFLAGS=''
PKG_CPPFLAGS=''
PKG_LIBS=''
SHELL='/bin/sh'
ac_ct_CXX=''
bindir='${exec_prefix}/bin'
build_alias=''
datadir='${datarootdir}'
datarootdir='${prefix}/share'
docdir='${datarootdir}/doc/${PACKAGE_TARNAME}'
dvidir='${docdir}'
exec_prefix='NONE'
host_alias=''
htmldir='${docdir}'
includedir='${prefix}/include'
infodir='${datarootdir}/info'
libdir='${exec_prefix}/lib'
libexecdir='${exec_prefix}/libexec'
localedir='${datarootdir}/locale'
localstatedir='${prefix}/var'
mandir='${datarootdir}/man'
oldincludedir='/usr/include'
pdfdir='${docdir}'
prefix='NONE'
program_transform_name='s,x,x,'
psdir='${docdir}'
runstatedir='${localstatedir}/run'
sbindir='${exec_prefix}/sbin'
sharedstatedir='${prefix}/com'
sysconfdir='${prefix}/etc'
target_alias=''

## ----------- ##
## confdefs.h. ##
## ----------- ##

/* confdefs.h */
#define PACKAGE_NAME "seqminer"
#define PACKAGE_TARNAME "seqminer"
#define PACKAGE_VERSION "8.3"
#define PACKAGE_STRING "seqminer 8.3"
#define PACKAGE_BUGREPORT "[email protected]"
#define PACKAGE_URL ""

configure: exit 1

seqminer::tabix.read Aborted (core dumped)

Hi,

I want to read a tbi file from local, however, it shows the error message of " Aborted (core dumped)".

Could anyone help me to solve this issue?

Here is a demo to reproduce the bug.

 library(seqminer)
 if (.Platform$endian == "little") {
   fileName = system.file("vcf/all.anno.filtered.extract.vcf.gz", package = "seqminer")
   snp <- tabix.read(fileName, "1:196623337-196632470")
}
/usr/include/c++/8/bits/stl_vector.h:950: std::vector<_Tp, _Alloc>::const_reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) const [with _Tp = std::__cxx11::basic_string<char>; _Alloc = std::allocator<std::__cxx11::basic_string<char> >; std::vector<_Tp, _Alloc>::const_reference = const std::__cxx11::basic_string<char>&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion '__builtin_expect(__n < this->size(), true)' failed.
Aborted (core dumped)

R session info

library(seqminer)
sessionInfo()
R version 4.0.5 (2021-03-31)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 8

Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.12.so

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] seqminer_8.1

loaded via a namespace (and not attached):
[1] compiler_4.0.5

tabix.read.table crashes R

Tabixed files are at:
https://github.com/oncogenetics/LocusExplorer/tree/dev/Data/GeneticMap1KG

This works:

tabix.read("GeneticMap1KG.txt.gz","chr1:721290-729948")`
# [1] "chr1\t721290\t2.69\r" "chr1\t723819\t2.82\r" "chr1\t723891\t2.98\r" "chr1\t728242\t2.98\r" "chr1\t729948\t3.08\r"

This crashes R with following error message:

tabix.read.table("GeneticMap1KG.txt.gz","chr1:721290-729948")

# terminate called after throwing an instance of 'std::out_of_range'
#  what(): basic_string::substr
#
# And Rstudio crashes...

Session info:

sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] seqminer_4.5

loaded via a namespace (and not attached):
[1] tools_3.2.1

not being able to read a tabix file over http

Hi,
I have problem reading the seqminer::tabix.read.table function to read from http over AWS S3 storage.
I get the following error:

[khttp_connect_file] fail to open file (HTTP code: 0).
Cannot open specified tabix file:

However, With the same setting, I don't get any error using tabix tool itself.
Could anyone give me any clue what could be the cause of this error?
Thanks

install issue

Hi,

I would like to install the package "seqminer" in HPC where I did not have the manager permisson.
The R version is 3.2.5 in HPC.
When I installed it, it showed the following error message.
Could you help me to solve this issue? Thank you.

install.packages("seqminer", lib = "..../library", repos = "http://cran.us.r-project.org")
Error in readRDS(file) :
cannot read workspace version 3 written by R 3.6.0; need R 3.5.0 or newer
Error in readRDS(file) :
cannot read workspace version 3 written by R 3.6.0; need R 3.5.0 or newer
Error in readRDS(file) :
cannot read workspace version 3 written by R 3.6.0; need R 3.5.0 or newer
Error in readRDS(file) :
cannot read workspace version 3 written by R 3.6.0; need R 3.5.0 or newer
trying URL 'http://cran.us.r-project.org/src/contrib/seqminer_8.0.tar.gz'
Content type 'application/x-gzip' length 4055251 bytes (3.9 MB)
==================================================
downloaded 3.9 MB

installing source package 'seqminer' ...
** package 'seqminer' successfully unpacked and MD5 sums checked
configure: CC = icc -std=gnu99
configure: CFLAGS = -g -O2 -std=c99
configure: CXX = ERROR: no information for variable 'CXX11' ERROR: no information for variable 'CXX11STD'
configure: CXXFLAGS = ERROR: no information for variable 'CXX11FLAGS'
checking whether the C++ compiler works... no
configure: error: in /tmp/2967577.1.long.q/RtmpXlvYqd/R.INSTALL1ee473a5617c6/seqminer': configure: error: C++ compiler cannot create executables See config.log' for more details
ERROR: configuration failed for package 'seqminer'
removing '.../library/seqminer'

The downloaded source packages are in
'/tmp/2967577.1.long.q/RtmpJkxLe0/downloaded_packages'
Warning message:
In install.packages("seqminer", lib = ".../library", :
installation of package 'seqminer' had non-zero exit status

question: readPlinkToMatrixByIndex

Hi developer,

Thanks for this library. I would like to process data where sample size is 200K+ using following function:

readPlinkToMatrixByIndex(plinkFilePrefix, sampleIndex, markerIndex)

I would like know:
Does the plink file loaded into the library first, or, there are read in chunks?

I am developing pipeline for the 200K in plink file per chromosome basis.

Thank you.

https://rdrr.io/cran/seqminer/man/readPlinkToMatrixByIndex.html

seqminer is missing some SNPs in range compared to PLINK

I made a comparison between the PLINK2 glm result from the some range versus reading the dosage matrix with seqminer. The result from seqminer is missing some SNPs that appeared in the PLINK result from the same range.

PLINK

Prune the .bgen file with PLINK and do glm.

./plink2 --bfile ukb_imp_chr1_v3_pruned --from-bp 54505148 --to-bp 56530526  --chr 1 --out ukb_imp_chr1_v3_pruned_trimmed_2m --export bgen-1.2 bits=8
./plink2 --bgen ukb_imp_chr1_v3_pruned_trimmed_2m.bgen ref-first --sample ukb_imp_chr1_v3_pruned_trimmed_2m.sample --glm --pheno ldl_pcs_joined_with_geno.tsv --pheno-name LDL --covar ldl_pcs_joined_with_geno.tsv --covar-name PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10, PC11,PC12,PC13,PC14,PC15,PC16,PC17,PC18,PC19,PC20, PC21,PC22,PC23,PC24,PC25,PC26,PC27,PC28,PC29,PC30, PC31,PC32,PC33,PC34,PC35,PC36,PC37,PC38,PC39,PC40 --out ukb_imp_chr1_v3_pruned_trimmed_2m

Inspect PLINK glm result and see which positions are used.

> ukb_imp_chr1_v3_pruned_trimmed_2m.LDL.glm.linear <- read_tsv("ukb_imp_chr1_v3_pruned_trimmed_2m.LDL.glm.linear.tsv") %>% filter(TEST=="ADD") %>% mutate(CHROM=as.integer(CHROM), POS=as.integer(POS))
> length(ukb_imp_chr1_v3_pruned_trimmed_2m.LDL.glm.linear $POS)
[1] 11984
> ukb_imp_chr1_v3_pruned_trimmed_2m.LDL.glm.linear $POS
   [1] 54505201 54505304 54505437 54505703 54505756 54505887 54506600 54507000 54507273 54507506 54507734 54507907 54507996
  [14] 54508051 54508231 54508394 54508564 54509028 54509259 54509316 54509594 54509616 54509786 54509957 54509994 54510577
  [27] 54510860 54511200 54511248 54511349 54512553 54512583 54513132 54513323 54513367 54513825 54513867 54513921 54514065
  [40] 54514261 54514341 54514570 54514675 54515041 54515365 54516228 54516280 54516334 54516526 54516601 54516792 54516899
  [53] 54517324 54517633 54517874 54518434 54518725 54519045 54519107 54519119 54519158 54519292 54519295 54519731 54519800
  [66] 54520603 54520921 54521284 54521434 54521855 54522000 54522240 54522280 54522307 54522331 54522355 54522645 54522830
  [79] 54523285 54523304 54523344 54523630 54523639 54523689 54523881 54524519 54524575 54524608 54524940 54525150 54525207

seqminer

Load the .bgen file as a dosage matrix with seqminer and inspect the data size. 5766 is way smaller than 11984. So about half the SNPs are missed by seqminer.

library(tidyverse)
library(seqminer)
ukb_imp_chr1_v3_pruned.bgen <- "ukb_imp_chr1_v3_pruned.bgen"
start_pos  <- 54505148
end_pos <- 56530526
geno <- t(readBGENToMatrixByRange(ukb_imp_chr1_v3_pruned.bgen, paste0("1:", start_pos,"-",end_pos))[[1]])
> print(dim(geno))
[1] 487409   5766
> colnames(geno)
   [1] "1:54505201" "1:54506600" "1:54507273" "1:54507506" "1:54507734"
   [6] "1:54507996" "1:54508051" "1:54508394" "1:54508564" "1:54509028"
  [11] "1:54509316" "1:54509594" "1:54509957" "1:54510860" "1:54511248"
  [16] "1:54511349" "1:54513132" "1:54513323" "1:54513367" "1:54513825"
  [21] "1:54513921" "1:54514261" "1:54514341" "1:54514675" "1:54515365"
  [26] "1:54516228" "1:54516280" "1:54516526" "1:54516792" "1:54516899"
  [31] "1:54517633" "1:54517874" "1:54518434" "1:54518725" "1:54519045"

Allele T gets converted to TRUE

Using command line:

# tabix (htslib) 1.9
$ tabix  myfile.gz 10:85547273-85547273 | cut -f1-5
# 10:85547273:C:T 10      85547273        C       T

vs using seqminer:

# seqminer_8.0
# R version 3.6.0 (2019-04-26)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: CentOS Linux 7 (Core)

tabixSubset <- tabix.read.table("myfile.gz", "10:85547273-85547273")
tabixSubset[, 1:5]
#               V1 V2       V3 V4   V5
#1 10:85547273:C:T 10 85547273  C TRUE

zhanxw / seqminer Goto Github PK

seqminer's People

Stargazers

Watchers

Forkers

seqminer's Issues

Technical background

Reprex

Session info

Reprex 1

Error

Reprex 2

Session info

Error report

PLINK

seqminer

Recommend Projects

Recommend Topics

Recommend Org