pharmgkb / pgxpop Goto Github PK

View Code? Open in Web Editor NEW

16.0 16.0 2.0 576 KB

PGxPOP

License: Mozilla Public License 2.0

Python 100.00%

pgxpop's People

Contributors

Stargazers

Watchers

Forkers

shicheng-guo kenhanscombe

pgxpop's Issues

CYP3A4

based on https://pharmgkb.blogspot.com/2021/05/cyp3a4-now-available-in-pharmvar.html
is there reason to hope that PGxPOP might soon report 3A4?

UnicodeDecodeError

I'm using one of the sorted files that PGxPOP provides for testing. But when running the software I get this error:

So, As you can see in the last line there is a problem with the utf-8 codec:

File "/usr/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

I run a hexdump on the file in order to check that first byte:
hexdump -n 100 s1s1.sorted.vcf-gz.tbi (this is one of the test files PGxPOP provides).

So the problem is that 8b1f

Then I explored what codecs I have in my linux. Im working on windows using WSL2 and ubuntu construct. So:

I tried to change the codec to utf-8 by using: export LC_ALL=utf8 and then: export LANG="$LC_ALL" But without success since: -bash: warning: setlocale: LC_ALL: cannot change locale (utf8): No such file or directory.

I really don't know what to do, I even don't understand why this is happening since en_US.utf8 should be working. I would appreciate it a lot if you could give me some guidance!

Create more informative error message when a gene is missing from provided VCF

Just creating a tracking issue for this improvement. @gregmcinnes will add when he has a chance.

ModuleNotFoundError: No module named 'tabix'

I have a problem running your script! I used the following line:
python bin/PGxPOP.py --vcf ./prueba/C11_v1.vcf.gz.tbi --phased --g CYP2D6 --build hg19 -o ./prueba/

And I have the following lines:

Traceback (most recent call last):
  File "/home/rembukai/BIOSOFT/PGxPOP/bin/PGxPOP.py", line 16, in <module>
    import Gene
  File "/home/rembukai/BIOSOFT/PGxPOP/bin/Gene.py", line 8, in <module>
    from Variant import Variant
  File "/home/rembukai/BIOSOFT/PGxPOP/bin/Variant.py", line 2, in <module>
    from DawgToys import clean_chr, iupac_nt
  File "/home/rembukai/BIOSOFT/PGxPOP/bin/DawgToys.py", line 2, in <module>
    import tabix
ModuleNotFoundError: No module named 'tabix'

I have installed tabix along with python in:
/home/rembukai/.local/lib/python3.8/site-packages (0.1)
How can I tell PGxPOP where to find tabix?

Thank you very much

Unicode Error in MacOSX

Hi,

I am currently trying to use PGxPOP for haplotyping the UKBIOBANK'S VCF files (v4.2, ascii). I installed PGxPOP on my MAC using the list of commands given in an environment created with conda using python 3.6. However, when testing this software with the test data (that you have on your page: VcfReaderTest-phasing.vcf), PGxPOP throws this error-

File "/Users/sharmaa9/Desktop/Conda/envs/python3.6/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

I also tried using "PYTHONIOENCODING=UTF-8". But it still fails.

Also, just for your information, I am using the following command-
python bin/PGxPOP.py --vcf /Users/sharmaa9/Desktop/PGx/PGXPOP/PGxPOP/test_vcf.vcf.gz.tbi -g CYP2D6 --phased --build grch38 -o test_new.txt

For my own dataset, I used Eagle to first phase the file and then ran PGxPOP (after bgziping and tabixing) which throws the same error.

I apologize if this question is naive, but I am new to the world of population genetics as well as programming.

I truly appreciate your help.

Thanks

Recommendation for updated allele definitions?

Hello

First of all thank you for PGxPOP, it has made my work much easier!

Now, that it is not being actively maintained do you have any suggestions for how to keep using PGxPOP but with updated allele definitions from PharmVar or PharmGKB?
Perhaps there is a script that can convert PharmGKB variant tables into allele definition .json files ?
Or is the solution to use PharmCat instead?

Any help is much appreciated.
Thank you!

Alexander

tuple index out of range error

When running the command

python PGxPOP/bin/PGxPOP.py --vcf chr10_HRC.vcf.gz --gene CYP2C19 --phased --build hg19 --batch --output cyp2c19

I'm getting the error

Traceback (most recent call last):
  File "PGxPOP/bin/PGxPOP.py", line 308, in <module>
    cd.run()
  File "PGxPOP/bin/PGxPOP.py", line 54, in run
    results = self.process_gene(g)
  File "PGxPOP/bin/PGxPOP.py", line 68, in process_gene
    diplotypes, sample_variants, uncallable = self.get_calls(gene, gt_matrices)
  File "PGxPOP/bin/PGxPOP.py", line 177, in get_calls
    for samp in range(gt_mat[0].shape[1]):
IndexError: tuple index out of range

Here is a gist with the --debug output.

Is it possibly because not all of the variants are in the VCF file?

extra comma

PGxPOP/bin/PGxPOP.py

Line 267 in 3da8adf

    
           f"{r['phenotype_presumptive']},{r['activity_score']},{r['uncallable']},,{r['extra_variants']}\n")

Seems to have an extra comma, which is causing problems with parsing the results in csv readers...

Everything returns as normal

Running the script on phased GSA array data, I have tried 2D6 and 2C19 on CHR22 and CHR0 respectively, and everything comes back reported as *1/*1 and NM. Is there any trouble shooting I could perform as to why this is happening?

Thank you.

Can PGxPOP handle unphased VCFs?

Hi Greg, Adam,

Many thanks for releasing this tool and for providing a nice overview of CYP AF in UKB!

One question: does PGxPOP handle unphased VCFs?

--phased being an optional argument seems to suggest the input can be either phased or unphased:

    ________________________________________
    |      ___  ___     ___  ___  ___        |
    |     | _ \/ __|_ _| _ \/\  \| _ \       | 
    |     |  _/ (_ \ \ /  _/  \  |  _/       |
    |     |_|  \___/_\_\_|  \__\/|_|         |
    |                                        |
    |                 v1.0                   |
    |              Written by                |     
    |     Adam Lavertu and Greg McInnes      |
    |        with help from PharmGKB.        |
    |________________________________________|
    
Copyright (C) 2020 Stanford University.
Distributed under the Mozilla Public License 2.0 open source license.
    
usage: PGxPOP.py [-h] [-f VCF] [-g GENE] [--phased] [--build BUILD] [--extra_variants] [-d] [-b] [-o OUTPUT]

CityDawg determines star allele haplotypes for samples in a VCF file and outputs predicted pharmacogenetic phenotypes.

optional arguments:
  -h, --help            show this help message and exit
  -f VCF, --vcf VCF     Input VCF
  -g GENE, --gene GENE  Gene to run. Select from list. Run all by default. CFTR, CYP2C9, CYP2D6, CYP4F2, IFNL3, TPMT, VKORC1, CYP2C19,
                        CYP3A5, DPYD, SLCO1B1, UGT1A1, CYP2B6, NUDT15
  --phased              Data is phased. Will try to determine phasing status from VCF by default.
(...)

The GitHub README.md, on the other hand, mentions only phased data input:

PGxPOP is a population-scale PGx allele caller designed to handle 100,000s of samples. Input is a phased VCF file, that has been indexed with tabix.

Many thanks,

Chris

Diplotype frequency calculation

Hi,

I was computing diplotype frequencies (PGxPOP output, phased data). I have a query regarding this. Due to phasing, same diplotypes are present in two forms for example- *1/*17, *17/*1 and their count should be pooled because they point to same diplotype. Is there an way to make it uniform in PGxPOP, for example single representation - *17/*1 for all samples or I need to write separate code to process for downstream analysis?

Thank you!

different PGxPOP outputs with same vcf input

Hello PGxPOP team,

I have run PGxPOP v1.0 with the two vcf files attched wich in principle should be the same apart from some differences in the header related to the bcftools command used to generate them. However I found the PGxPOP output happens to be different. For instance:
The vcf file HG02236.a.vcf.gz gives me:
sample_id,gene,diplotype,
HG02236,CYP2C19,*1/*1

The vcf file HG02236.b.vcf.gz gives me:
HG02236,CYP2C19,*1/*2

This is how I run PGxPOP:
python bin/PGxPOP.py --vcf HG02236.a.vcf.gz -o HG02236.a.txt
python bin/PGxPOP.py --vcf HG02236.b.vcf.gz -o HG02236.b.txt

Is there something wrong I am doing you could think of? Any help will be much appreciated.

Many thanks
Jorge

HG02236.a.vcf.gz
HG02236.b.vcf.gz

HG02236.b.txt
HG02236.a.txt