limix / pandas-plink Goto Github PK

PLINK reader for Python.

License: MIT License

Python 77.27% Jupyter Notebook 16.11% C 4.89% Kaitai Struct 1.73%

plink genotype file-format reader bed-format

pandas-plink's Introduction

pandas-plink

Pandas-plink is a Python package for reading PLINK binary file format andrealized relationship matrices (PLINK or GCTA). The file reading is taken place via lazy loading, meaning that it saves up memory by actually reading only the genotypes that are actually accessed by the user.

Notable changes can be found at the CHANGELOG.md.

Install

It can be installed using pip:

pip install pandas-plink

Alternatively it can be intalled via conda:

conda install -c conda-forge pandas-plink

Usage

It is as simple as

>>> from pandas_plink import read_plink1_bin
>>> G = read_plink1_bin("chr11.bed", "chr11.bim", "chr11.fam", verbose=False)
>>> print(G)
<xarray.DataArray 'genotype' (sample: 14, variant: 779)>
dask.array<shape=(14, 779), dtype=float64, chunksize=(14, 779)>
Coordinates:
  * sample   (sample) object 'B001' 'B002' 'B003' ... 'B012' 'B013' 'B014'
  * variant  (variant) object '11_316849996' '11_316874359' ... '11_345698259'
    father   (sample) <U1 '0' '0' '0' '0' '0' '0' ... '0' '0' '0' '0' '0' '0'
    fid      (sample) <U4 'B001' 'B002' 'B003' 'B004' ... 'B012' 'B013' 'B014'
    gender   (sample) <U1 '0' '0' '0' '0' '0' '0' ... '0' '0' '0' '0' '0' '0'
    i        (sample) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13
    iid      (sample) <U4 'B001' 'B002' 'B003' 'B004' ... 'B012' 'B013' 'B014'
    mother   (sample) <U1 '0' '0' '0' '0' '0' '0' ... '0' '0' '0' '0' '0' '0'
    trait    (sample) <U2 '-9' '-9' '-9' '-9' '-9' ... '-9' '-9' '-9' '-9' '-9'
    a0       (variant) <U1 'C' 'G' 'G' 'C' 'C' 'T' ... 'T' 'A' 'C' 'A' 'A' 'T'
    a1       (variant) <U1 'T' 'C' 'C' 'T' 'T' 'A' ... 'C' 'G' 'T' 'G' 'C' 'C'
    chrom    (variant) <U2 '11' '11' '11' '11' '11' ... '11' '11' '11' '11' '11'
    cm       (variant) float64 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
    pos      (variant) int64 157439 181802 248969 ... 28937375 28961091 29005702
    snp      (variant) <U9 '316849996' '316874359' ... '345653648' '345698259'
>>> print(G.sel(sample="B003", variant="11_316874359").values)
0.0
>>> print(G.a0.sel(variant="11_316874359").values)
G
>>> print(G.sel(sample="B003", variant="11_316941526").values)
2.0
>>> print(G.a1.sel(variant="11_316941526").values)
C

Portions of the genotype will be read as the user access them.

Covariance matrices can also be read very easily. Example:

>>> from pandas_plink import read_rel
>>> K = read_rel("plink2.rel.bin")
>>> print(K)
<xarray.DataArray (sample_0: 10, sample_1: 10)>
array([[ 0.885782,  0.233846, -0.186339, -0.009789, -0.138897,  0.287779,
         0.269977, -0.231279, -0.095472, -0.213979],
       [ 0.233846,  1.077493, -0.452858,  0.192877, -0.186027,  0.171027,
         0.406056, -0.013149, -0.131477, -0.134314],
       [-0.186339, -0.452858,  1.183312, -0.040948, -0.146034, -0.204510,
        -0.314808, -0.042503,  0.296828, -0.011661],
       [-0.009789,  0.192877, -0.040948,  0.895360, -0.068605,  0.012023,
         0.057827, -0.192152, -0.089094,  0.174269],
       [-0.138897, -0.186027, -0.146034, -0.068605,  1.183237,  0.085104,
        -0.032974,  0.103608,  0.215769,  0.166648],
       [ 0.287779,  0.171027, -0.204510,  0.012023,  0.085104,  0.956921,
         0.065427, -0.043752, -0.091492, -0.227673],
       [ 0.269977,  0.406056, -0.314808,  0.057827, -0.032974,  0.065427,
         0.714746, -0.101254, -0.088171, -0.063964],
       [-0.231279, -0.013149, -0.042503, -0.192152,  0.103608, -0.043752,
        -0.101254,  1.423033, -0.298255, -0.074334],
       [-0.095472, -0.131477,  0.296828, -0.089094,  0.215769, -0.091492,
        -0.088171, -0.298255,  0.910274, -0.024663],
       [-0.213979, -0.134314, -0.011661,  0.174269,  0.166648, -0.227673,
        -0.063964, -0.074334, -0.024663,  0.914586]])
Coordinates:
  * sample_0  (sample_0) object 'HG00419' 'HG00650' ... 'NA20508' 'NA20753'
  * sample_1  (sample_1) object 'HG00419' 'HG00650' ... 'NA20508' 'NA20753'
    fid       (sample_1) object 'HG00419' 'HG00650' ... 'NA20508' 'NA20753'
    iid       (sample_1) object 'HG00419' 'HG00650' ... 'NA20508' 'NA20753'
>>> print(K.values)
[[ 0.89  0.23 -0.19 -0.01 -0.14  0.29  0.27 -0.23 -0.10 -0.21]
 [ 0.23  1.08 -0.45  0.19 -0.19  0.17  0.41 -0.01 -0.13 -0.13]
 [-0.19 -0.45  1.18 -0.04 -0.15 -0.20 -0.31 -0.04  0.30 -0.01]
 [-0.01  0.19 -0.04  0.90 -0.07  0.01  0.06 -0.19 -0.09  0.17]
 [-0.14 -0.19 -0.15 -0.07  1.18  0.09 -0.03  0.10  0.22  0.17]
 [ 0.29  0.17 -0.20  0.01  0.09  0.96  0.07 -0.04 -0.09 -0.23]
 [ 0.27  0.41 -0.31  0.06 -0.03  0.07  0.71 -0.10 -0.09 -0.06]
 [-0.23 -0.01 -0.04 -0.19  0.10 -0.04 -0.10  1.42 -0.30 -0.07]
 [-0.10 -0.13  0.30 -0.09  0.22 -0.09 -0.09 -0.30  0.91 -0.02]
 [-0.21 -0.13 -0.01  0.17  0.17 -0.23 -0.06 -0.07 -0.02  0.91]]

Please, refer to the pandas-plink documentation for more information.

Authors

Danilo Horta

License

This project is licensed under the MIT License.

pandas-plink's People

Contributors

Stargazers

Watchers

Forkers

fpcasale gitter-badger scchess changebio francois-a mennowitteveen yeahrmek thomasmcgill tnonet cholotook calvinleather ji-younkim jfcarter2358 rborder antoniojperezcastro dbolser bgorissen

pandas-plink's Issues

Accessing and Subsetting Error

Nice package!
I tried to implement the example and it works except when I want to print the values using .sel().values. I am using python 3.8 on a cluster.
Here is my code and the error.
Thanks.

import scipy
from scipy import linalg
import numpy as np
#import csv
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import LassoCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
import matplotlib.pyplot as plt
import pandas
from pandas_plink import read_plink1_bin
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
geno = read_plink1_bin("~/plink_file/EUR_100k.bed",
                          "~/plink_file/EUR_100k.bim",
                          "~/plink_file/EUR_100k.fam")
print(geno)
Coordinates:
  * sample   (sample) object 'id2_0' 'id2_1' 'id2_2' ... 'id2_99998' 'id2_99999'
  * variant  (variant) object '22_rs375684679' ... '22_rs370652263'
    fid      (sample) <U9 'id1_0' 'id1_1' 'id1_2' ... 'id1_99998' 'id1_99999'
    iid      (sample) <U9 'id2_0' 'id2_1' 'id2_2' ... 'id2_99998' 'id2_99999'
    father   (sample) <U1 '0' '0' '0' '0' '0' '0' ... '0' '0' '0' '0' '0' '0'
    mother   (sample) <U1 '0' '0' '0' '0' '0' '0' ... '0' '0' '0' '0' '0' '0'
    gender   (sample) <U1 '0' '0' '0' '0' '0' '0' ... '0' '0' '0' '0' '0' '0'
    trait    (sample) float64 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0
    chrom    (variant) <U2 '22' '22' '22' '22' '22' ... '22' '22' '22' '22' '22'
    snp      (variant) <U11 'rs375684679' 'rs376238049' ... 'rs370652263'
    cm       (variant) float64 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
    pos      (variant) int64 16052167 16052962 16052986 ... 51237364 51237712
    a0       (variant) <U50 'AAAAC' 'T' 'A' 'T' 'A' 'A' ... 'A' 'AT' 'C' 'G' 'A'
    a1       (variant) <U51 'A' 'C' 'C' 'A' 'C' 'C' ... 'T' 'G' 'A' 'T' 'A' 'G'
print(geno.sel(sample="id2_1", variant="22_rs370652263").values)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/ENV/lib/python3.8/site-packages/xarray/core/dataarray.py", line 567, in values
    return self.variable.values
  File "/home/user/ENV/lib/python3.8/site-packages/xarray/core/variable.py", line 448, in values
    return _as_array_or_item(self._data)
  File "/home/user/ENV/lib/python3.8/site-packages/xarray/core/variable.py", line 254, in _as_array_or_item
    data = np.asarray(data)
  File "/home/user/ENV/lib/python3.8/site-packages/numpy/core/_asarray.py", line 85, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/home/user/ENV/lib/python3.8/site-packages/dask/array/core.py", line 1336, in __array__
    x = self.compute()
  File "/home/user/ENV/lib/python3.8/site-packages/dask/base.py", line 166, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/home/user/ENV/lib/python3.8/site-packages/dask/base.py", line 438, in compute
    dsk = collections_to_dsk(collections, optimize_graph, **kwargs)
  File "/home/user/ENV/lib/python3.8/site-packages/dask/base.py", line 217, in collections_to_dsk
    _opt_list.append(opt(_graph_and_keys[0], _graph_and_keys[1], **kwargs))
  File "/home/user/ENV/lib/python3.8/site-packages/dask/array/optimization.py", line 46, in optimize
    dsk = ensure_dict(dsk)
  File "/home/user/ENV/lib/python3.8/site-packages/dask/utils.py", line 1033, in ensure_dict
    result.update(dd)
  File "/home/user/ENV/lib/python3.8/_collections_abc.py", line 720, in __iter__
    yield from self._mapping
  File "/home/user/ENV/lib/python3.8/site-packages/dask/blockwise.py", line 229, in __iter__
    return iter(self._dict)
  File "/home/user/ENV/lib/python3.8/site-packages/dask/blockwise.py", line 212, in _dict
    dsk, _ = fuse(self.dsk, [self.output])
  File "/home/user/ENV/lib/python3.8/site-packages/dask/optimization.py", line 496, in fuse
    if not config.get("optimization.fuse.active"):
  File "/home/user/ENV/lib/python3.8/site-packages/dask/config.py", line 459, in get
    result = result[k]
KeyError: 'optimization'

Allow passing of pathlib objects instead of string paths

Passing a Python standard library pathlib.Path object to read_plink leads to:

TypeError: expected string or bytes-like object

genotype values issue

I got genotypes a0 with 2 and a1 with 0 when i use pandas-plink, and my pandas-plink version is 2.0.4.

Chromosome names for X, Y and MT?

Sorry if I'm doing something wrong, but when I use plink2 ... --recode vcf I get chromosomes called 21, 22, X, Y and even MT... However, using read_plink(files), they are encoded as 21, 22, 23, 24 and 25.

I know this encoding is expected:
https://www.cog-genomics.org/plink/1.9/input

Given diploid autosomes, the remaining modifiers let you indicate the absence of specific non-autosomal chromosomes, as an extra sanity check on the input data. Note that, when there are n autosome pairs, the X chromosome is assigned numeric code n+1, Y is n+2, XY (pseudo-autosomal region of X) is n+3, and MT (mitochondria) is n+4.

However, is there a way to 'fix it' in the output like recode vcf does?

I don't see anything in the documentation about this...

I'm currently writing files out as:

...
# Find the SNVs

p = bim.a0.str.len() == 1
q = bim.a1.str.len() == 1

snv = bim[p & q]

print("SNVs:", snv.shape)

snv.to_csv("sensible_name.tsv", sep="\t", columns=["chrom", "pos", "snp", "a0", "a1"], index=False)

So trying to avoid going in and messing with the DataFrame the array line by line...

Hangs on bed.compute()

Transforming the dask object to Pandas w/ the .compute() function hangs for a long period of time.

I have a BED matrix of 22M variants x 300 samples and selecting 1 row:

bed[0,:].compute()

takes ~10 mins.

Using Python 3.6.4

error on importing plink data

I am getting this error

(
In [3]: (bim, fam, G) = read_plink('Results/variants.filt_v2.0_imputed')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-d18b061fd1ff> in <module>()
----> 1 (bim, fam, G) = read_plink('Results/variants.filt_v2.0_imputed')

/Users/inti/DATA/miniconda/envs/ngs/lib/python2.7/site-packages/pandas_plink/read.pyc in read_plink(file_prefix, verbose)
    103     provide a single view of the files.
    104     """
--> 105     from dask.array import concatenate
    106
    107     file_prefixes = glob(file_prefix)

/Users/inti/DATA/miniconda/envs/ngs/lib/python2.7/site-packages/dask/array/__init__.py in <module>()
      2
      3 from ..utils import ignoring
----> 4 from .core import (Array, block, concatenate, stack, from_array, store,
      5                    map_blocks, atop, to_hdf5, to_npy_stack, from_npy_stack,
      6                    from_delayed, asarray, asanyarray,

/Users/inti/DATA/miniconda/envs/ngs/lib/python2.7/site-packages/dask/array/core.py in <module>()
     19
     20 try:
---> 21     from cytoolz import (partition, concat, join, first,
     22                          groupby, valmap, accumulate, assoc)
     23     from cytoolz.curried import filter, pluck

/Users/inti/DATA/miniconda/envs/ngs/lib/python2.7/site-packages/cytoolz/__init__.py in <module>()
     16
     17 # Aliases
---> 18 comp = compose
     19
     20 # Always-curried functions

NameError: name 'compose' is not defined```

Any ideas why

Py3.7 with conda

Is it possible to update the conda recipe to allow for py-37 to work? Or is there a fundamental issue with py-37 and the codebase?

pandas_plink is slow since Dask 2024.2

Dask 2024.2 makes pandas-plink unusably slow (~4 hrs to read a 850 MB bed file). Due to improved tokenization, Dask now computes a sha1 hash of buff for each call to _delayed. Since buff has the same size as the file, this takes about a second each time, and this is done approximately 10⁴ times. The easiest workaround seems to be setting the parameter pure to False.

Speed up reading of single variants?

Hi Danilo,

Thanks for making this package, it's very useful!

One relatively common use case for me is that I have a big PLINK file and want to get a data frame with the genotypes for a single variant. To do that I run something like this:

G = read_plink1_bin('file.bed', verbose=False)
var = G.sel(variant='{}_{}'.format(chrom, snp_id))
gt = pd.DataFrame({'sample': var.sample.values, 'gt': var.values})

This does work, but if the input file is big it takes surprisingly long. In particular the read_plink1_bin step is very slow. I assume this is because it's reading quite a lot of data about all the other variants into memory, which is never actually used.

Is there a way to tell read_plink1_bin that I only care about a single variant in the file and ignore the rest?

support for int8

I'd like to read in plink data to int8 arrays rather than float32. Looking at _bed_read.py it seems this could be done at the chunk level by modifying _read_bed_chunk(). Are their any gotchas I'm missing? Happy to contribute this feature

Genotype code wrong!

Hi there,

 I was surprised to find a major error in your pandas-plink package! I used it to read /bed/bim/fam files into python. But I found that my minor alleles homozygous were coded as 0 and major alleles homozygous were coded as 2!!! That invests all my results! Can you please check your source code to correct such an error! Otherwise, it's really dangerous for others to continue using this pandas-plink package!!!

Pandas 1.1.0 error: ValueError: Names should be an ordered collection.

Hi,

The following code generates "ValueError: Names should be an ordered collection." with Pandas 1.1.0:

from pandas_plink import read_plink
bim, fam, bed = read_plink(plink_file_path)

(see https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.1.0.html)

Thanks for looking into it!

Increase of memory

Dear Developers:

I notice that after the last pandas-plink update, the memory needed to load the same amount of matrix significantly increase (during the process of lazy loading). I wonder could you resolve this issue

Slower performance of read_plink1_bin vs read_plink

Hi,

I've only compared version 2.2.2 (after upgrading from 2.0.5), but am getting significantly slower load times with the new function:

from pandas_plink import read_plink
bim, fam, bed = read_plink(plink_prefix_path, verbose=True)

takes ~25 s for a VCF with ~11M variants and ~850 samples, whereas

from pandas_plink import read_plink1_bin
G = read_plink1_bin(plink_prefix_path+'.bed', verbose=True)

takes ~7.5 min on the same VCF. The interface of read_plink is very convenient -- why is this being deprecated?

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Hi,
the following code generate the error "ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()"
G = read_plink1_bin("./3.filtering_samples_and_SNPs/prov_sample.indF3_mds_pca.bed", "./3.filtering_samples_and_SNPs/prov_sample.indF3_mds_pca.bim", "./3.filtering_samples_and_SNPs/prov_sample.indF3_mds_pca.fam", verbose=True)

the input files were generated by PLINK v1.90b6.20, the detailed error message is
`
ValueError Traceback (most recent call last)
in
----> 1 G = read_plink1_bin("./3.filtering_samples_and_SNPs/prov_sample.indF3_mds_pca.bed", "./3.filtering_samples_and_SNPs/prov_sample.indF3_mds_pca.bim", "./3.filtering_samples_and_SNPs/prov_sample.indF3_mds_pca.fam")

~/software/miniconda3/lib/python3.7/site-packages/pandas_plink/read.py in read_plink1_bin(bed, bim, fam, verbose, ref, chunk)
255 nsamples = fam_df.shape[0]
256 sample_ids = fam_df["iid"]
--> 257 variant_ids = bim_df["chrom"] + "" + bim_df["snp"]
258
259 if ref == "a1":

~/software/miniconda3/lib/python3.7/site-packages/pandas/core/ops/common.py in new_method(self, other)
62
63 other = item_from_zerodim(other)
---> 64
65 return method(self, other)
66

~/software/miniconda3/lib/python3.7/site-packages/pandas/core/ops/init.py in wrapper(left, right)
501 elif is_list_like(right) and not isinstance(right, (ABCSeries, ABCDataFrame)):
502 # GH17901
--> 503 right = to_series(right)
504
505 if flex is not None and isinstance(right, ABCDataFrame):

~/software/miniconda3/lib/python3.7/site-packages/pandas/core/ops/array_ops.py in arithmetic_op(left, right, op, str_rep)
195 def comparison_op(left: ArrayLike, right: Any, op) -> ArrayLike:
196 """
--> 197 Evaluate a comparison operation =, !=, >=, >, <=, or <.
198
199 Parameters

~/software/miniconda3/lib/python3.7/site-packages/pandas/core/ops/array_ops.py in na_arithmetic_op(left, right, op, str_rep)
147 # In this case we do not fall back to the masked op, as that
148 # will handle complex numbers incorrectly, see GH#32047
--> 149 raise
150 result = masked_arith_op(left, right, op)
151

~/software/miniconda3/lib/python3.7/site-packages/pandas/core/computation/expressions.py in evaluate(op, a, b, use_numexpr)
229 op_str = _op_str_mapping[op]
230 if op_str is not None:
--> 231 use_numexpr = use_numexpr and _bool_arith_check(op_str, a, b)
232 if use_numexpr:
233 return _evaluate(op, op_str, a, b) # type: ignore

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()`

looking forward to your help.
Thank you.

Saving dataframe as plink?

Any plans to include output as well as input functionality?

ImportError: No module named 'pandas_plink.bed_reader'

I am encountering a similar problem to one that I believe has come up in the past.
I am using python 3.5.2 and pandas_plink 1.2.25, which was installed with pip.

I have tried this in a clean virtualenv

The full error message:

$ python -c "import pandas_plink; pandas_plink.test()"
=============================================== test session starts ================================================
platform linux -- Python 3.5.2, pytest-3.4.2, py-1.5.2, pluggy-0.6.0
rootdir: /home/user/analysis, inifile:
plugins: pep8-1.0.6
collected 10 items                                                                                                 

pandas_plink/__init__.py .                                                                                   [ 10%]
pandas_plink/bed_read.py .                                                                                   [ 20%]
pandas_plink/builder.py .                                                                                    [ 30%]
pandas_plink/conftest.py .                                                                                   [ 40%]
pandas_plink/read.py .F

===================================================== FAILURES =====================================================
______________________________________ [doctest] pandas_plink.read.read_plink ______________________________________
056         1     1   rs2949420  0.0  45257  C  T  1
057         2     1   rs2949421  0.0  45413  0  0  2
058         3     1   rs2691310  0.0  46844  A  T  3
059         4     1   rs4030303  0.0  72434  0  G  4
060         >>> print(fam.head()) #doctest: +NORMALIZE_WHITESPACE
061                 fid       iid    father    mother gender trait  i
062         0  Sample_1  Sample_1         0         0      1    -9  0
063         1  Sample_2  Sample_2         0         0      2    -9  1
064         2  Sample_3  Sample_3  Sample_1  Sample_2      2    -9  2
065         >>> print(bed.compute()) #doctest: +NORMALIZE_WHITESPACE
UNEXPECTED EXCEPTION: ImportError("No module named 'pandas_plink.bed_reader'",)
Traceback (most recent call last):

  File "/usr/lib/python3.5/doctest.py", line 1321, in __run
    compileflags, 1), test.globs)

  File "<doctest pandas_plink.read.read_plink[5]>", line 1, in <module>

  File "/home/user/bio/lib/python3.5/site-packages/dask/base.py", line 143, in compute
    (result,) = compute(self, traverse=False, **kwargs)

  File "/home/user/bio/lib/python3.5/site-packages/dask/base.py", line 392, in compute
    results = get(dsk, keys, **kwargs)

  File "/home/user/bio/lib/python3.5/site-packages/dask/threaded.py", line 75, in get
    pack_exception=pack_exception, **kwargs)

  File "/home/user/bio/lib/python3.5/site-packages/dask/local.py", line 521, in get_async
    raise_exception(exc, tb)

  File "/home/user/bio/lib/python3.5/site-packages/dask/compatibility.py", line 67, in reraise
    raise exc

  File "/home/user/bio/lib/python3.5/site-packages/dask/local.py", line 290, in execute_task
    result = _execute_task(task, data)

  File "/home/user/bio/lib/python3.5/site-packages/dask/local.py", line 271, in _execute_task
    return func(*args2)

  File "/home/userl/bio/lib/python3.5/site-packages/pandas_plink/bed_read.py", line 6, in read_bed_chunk
    from .bed_reader import ffi, lib

ImportError: No module named 'pandas_plink.bed_reader'

/home/user/bio/lib/python3.5/site-packages/pandas_plink/read.py:65: UnexpectedException
======================================== 1 failed, 5 passed in 0.14 seconds ========================================

Thanks in advance for any help you may be able to provide

ImportError: No module named 'pandas_plink.bed_reader'

As the title says, when I try to import pandas_plink it fails on bed_reader.
I'm on

python 3.5.2
pandas-plink (1.2.15)

the complete error message is as follows

$ python -c "import pandas_plink; pandas_plink.test()"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/user/.virtualenvs/default/lib/python3.5/site-packages/pandas_plink/__init__.py", line 6, in <module>
    from .read import read_plink
  File "/home/user/.virtualenvs/default/lib/python3.5/site-packages/pandas_plink/read.py", line 10, in <module>
    from .bed_read import read_bed
  File "/home/user/.virtualenvs/default/lib/python3.5/site-packages/pandas_plink/bed_read.py", line 3, in <module>
    from .bed_reader import ffi, lib
ImportError: No module named 'pandas_plink.bed_reader'

I use pip install for this package
I have a working version for version 1.1.6.

Bug - 'join' is not defined

Trying to load a plink file with the plink prefix: '/hps/nobackup/hipsci/scratch/genotypes/imputed/2017-03-27/Full_Filtered_SNPs_Plink-F/hipsci.wec.gtarray.HumanCoreExome.imputed_phased.20170327.genotypes.norm.renamed'

And I get this error:

/nfs/software/stegle/users/dseaton/conda-envs/limix_env/lib/python2.7/site-packages/pandas_plink/read.pyc` in _clean_prefixes(prefixes)
232 path = p
233 else:
--> 234 path = join(dirn, basen)
235 paths.append(path)
236 return list(set(paths))

NameError: global name 'join' is not defined

Bim and fam files are ordered by index

Currently the bim and fam files are being sorted by index (df.sort_index(inplace=True)), which is undesirable, bim and fam files are ordered the same way as the genotype matrix - ordering them shuffles the annotation of genotypes.
This is transparent if e.g. samples in the genotype matrix were ordered alphabetically in the first place, but not in any other case.

Edit: The original order can be restored by sorting the dataframes by column "i". it remains questionable though if sorting bim and fam files by index makes sense in the first place

Сannot import name 'read_plink1_bin' from 'pandas_plink'

I have installed via pip pandas_plink, but I always have the same problem.

----> 1 from pandas_plink import read_plink1_bin

ImportError: cannot import name 'read_plink1_bin' from 'pandas_plink' (/home/username/anaconda3/lib/python3.7/site-packages/pandas_plink/__init__.py)

I tried to update and reinstall Anaconda, but It didn't help.
What I should do?
The function "read_plink" works, but I do not understand completely how to use it.

Thanks in advance.

requirements

it seems these are requieremtns which are currently not specified

dateutil>=2.5```

Install 2.2.9 using pip?

Sorry for being newb, but how do I get 2.2.9 using pip?

I've set up a virtualenv, and then typed pip install pandas-plink>=2.2.9, but when I look at .venv/lib/python3.6/site-packages/pandas_plink/__init__.py, I see __version__ = "2.2.4".

I'm asking because I'm seeing this error:

Traceback (most recent call last):
  File "convert_to_plink.py", line 5, in <module>
    import pandas_plink
  File ".venv/lib/python3.6/site-packages/pandas_plink/__init__.py", line 1, in <module>
    from ._chunk import Chunk
  File ".venv/lib/python3.6/site-packages/pandas_plink/_chunk.py", line 1, in <module>
    from dataclasses import dataclass
ModuleNotFoundError: No module named 'dataclasses'

reading plink2 format?

Hi developer,

Thanks for developing this nice package. I was wondering if we have the plan to support plink2 format in the future.

Thanks!

Yanyu

❓ `np.nan` value from bed.compute()

Hi,

I try to follow 1000G_example.ipynb with read_plink function and when I compute the genotype code from bed file, I get some nan values, does this value represents binary genotype code 3 which is Homozygous for second allele?

Thank you

Hi Danilo, question about writing Plink file from GEO SNP-Chip data

Hi Danilo,

You remember the good old days playing table football in the canteen at EBI?

Fun times!

I noticed that you wrote https://pypi.org/project/pandas-plink/ ?

I want to create bed, bim, fam from SNP-Chip data in GEO.

I'm reading in data using GeoParse, which gives me a pandas dataframe. Seems like it should be a simple step to then write that out in Plink format using pandas-plink?

Any tips?

Many thanks,
Dan.

Reading Data to Xarray Taking a long Time

Hi,

I am trying to read in a subsample of UKBiobank data (1000 individuals) at ~500,000 SNPS that are across all chromosomes and its taking a long time. I am trying to read this directly into a xarray

X = G[indiv,SNPs].compute()

where indiv and SNPs are binary vectors. I'm working on a server that I believe has enough memory (256Gb) to hold this directly in memory. It is taking a very long amount of time (hours) to read this in. Could this have anything to with the chunksize and is there any chunksize you would recommend? Is there anything you would recommend for speeding this up?

Thanks in advance for you help, this package has been very useful for me.

Support for additional dtypes?

Thanks for developing this package!

Is there a specific reason for the hard-coded int64 type? It would be great to have some flexibility here, for example to load imputed dosages as float32, or genotypes as int8 with missing values set to -1.

write_plink1_bin() gives error "TypeError: to_csv() got an unexpected keyword argument 'line_terminator'".

I have Python 3.9.18 and Pandas 2.1.1. My pandas-Plink version is 2.2.9

**The command:
write_plink1_bin(G_sample1, "sample1.bed")

gives this error output:
"TypeError: to_csv() got an unexpected keyword argument 'line_terminator'".
The current Pandas version has changed the "line_terminator" parameter to "lineterminator"**

Do I need to downgrade my Pandas version to use pandas_plink?

Thank you

write_plink1_bin(G_sample1, "sample1.bed")
Writing BED: 100%|██████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 5.54it/s]
Writing FAM...

TypeError Traceback (most recent call last)
Cell In[86], line 1
----> 1 write_plink1_bin(G_sample1, "sample1.bed")

File ~/miniforge3/envs/Genomics/lib/python3.9/site-packages/pandas_plink/_write.py:183, in write_plink1_bin(G, bed, bim, fam, major, verbose)
180 write_bed(bed, G, major, verbose)
182 _echo("Writing FAM... ", end="", disable=not verbose)
--> 183 _write_fam(fam, G)
184 _echo("done.", disable=not verbose)
186 _echo("Writing BIM... ", end="", disable=not verbose)

File ~/miniforge3/envs/Genomics/lib/python3.9/site-packages/pandas_plink/_write.py:261, in _write_fam(filepath, G)
258 df[col] = G.sample[col].values
259 df[col] = df[col].astype(col_type)
--> 261 df.to_csv(
262 filepath,
263 index=False,
264 sep="\t",
265 header=False,
266 encoding="ascii",
267 line_terminator="\n",
268 )

TypeError: to_csv() got an unexpected keyword argument 'line_terminator'

Memory explosion

When I try to load a small subset from the 1000 Genome my memory blows up during the data import.
In particular it seems dask uses all available threads to spawn python session which in turn import the whole dataset.

I am not sure how to solve this issue. Do I need spawn a local cluster before I start the import?

I would be thankful for any help!

See below for a reproducible example. The used data is available at ftp://climb.genomics.cn/pub/10.5524/100001_101000/100116/1kg_phase1_chr2.tar.gz.

import numpy as np
from pandas_plink import read_plink

(bim, fam, bed) = read_plink('data/genotypes/1kg_phase1_chr2')  
rand = np.random.choice(bim.i.values, 1000) 
X = bed[rand, :].compute()