pharmgkb / pgxpop Goto Github PK
View Code? Open in Web Editor NEWPGxPOP
License: Mozilla Public License 2.0
PGxPOP
License: Mozilla Public License 2.0
based on https://pharmgkb.blogspot.com/2021/05/cyp3a4-now-available-in-pharmvar.html
is there reason to hope that PGxPOP might soon report 3A4?
I'm using one of the sorted files that PGxPOP provides for testing. But when running the software I get this error:
So, As you can see in the last line there is a problem with the utf-8 codec:
File "/usr/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
I run a hexdump on the file in order to check that first byte:
hexdump -n 100 s1s1.sorted.vcf-gz.tbi (this is one of the test files PGxPOP provides).
So the problem is that 8b1f
Then I explored what codecs I have in my linux. Im working on windows using WSL2 and ubuntu construct. So:
I tried to change the codec to utf-8 by using: export LC_ALL=utf8 and then: export LANG="$LC_ALL" But without success since: -bash: warning: setlocale: LC_ALL: cannot change locale (utf8): No such file or directory.
I really don't know what to do, I even don't understand why this is happening since en_US.utf8 should be working. I would appreciate it a lot if you could give me some guidance!
Just creating a tracking issue for this improvement. @gregmcinnes will add when he has a chance.
I have a problem running your script! I used the following line:
python bin/PGxPOP.py --vcf ./prueba/C11_v1.vcf.gz.tbi --phased --g CYP2D6 --build hg19 -o ./prueba/
And I have the following lines:
Traceback (most recent call last):
File "/home/rembukai/BIOSOFT/PGxPOP/bin/PGxPOP.py", line 16, in <module>
import Gene
File "/home/rembukai/BIOSOFT/PGxPOP/bin/Gene.py", line 8, in <module>
from Variant import Variant
File "/home/rembukai/BIOSOFT/PGxPOP/bin/Variant.py", line 2, in <module>
from DawgToys import clean_chr, iupac_nt
File "/home/rembukai/BIOSOFT/PGxPOP/bin/DawgToys.py", line 2, in <module>
import tabix
ModuleNotFoundError: No module named 'tabix'
I have installed tabix along with python in:
/home/rembukai/.local/lib/python3.8/site-packages (0.1)
How can I tell PGxPOP where to find tabix?
Thank you very much
Hi,
I am currently trying to use PGxPOP for haplotyping the UKBIOBANK'S VCF files (v4.2, ascii). I installed PGxPOP on my MAC using the list of commands given in an environment created with conda using python 3.6. However, when testing this software with the test data (that you have on your page: VcfReaderTest-phasing.vcf), PGxPOP throws this error-
File "/Users/sharmaa9/Desktop/Conda/envs/python3.6/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
I also tried using "PYTHONIOENCODING=UTF-8". But it still fails.
Also, just for your information, I am using the following command-
python bin/PGxPOP.py --vcf /Users/sharmaa9/Desktop/PGx/PGXPOP/PGxPOP/test_vcf.vcf.gz.tbi -g CYP2D6 --phased --build grch38 -o test_new.txt
For my own dataset, I used Eagle to first phase the file and then ran PGxPOP (after bgziping and tabixing) which throws the same error.
I apologize if this question is naive, but I am new to the world of population genetics as well as programming.
I truly appreciate your help.
Thanks
Hello
First of all thank you for PGxPOP
, it has made my work much easier!
Now, that it is not being actively maintained do you have any suggestions for how to keep using PGxPOP
but with updated allele definitions from PharmVar or PharmGKB?
Perhaps there is a script that can convert PharmGKB variant tables into allele definition .json
files ?
Or is the solution to use PharmCat instead?
Any help is much appreciated.
Thank you!
When running the command
python PGxPOP/bin/PGxPOP.py --vcf chr10_HRC.vcf.gz --gene CYP2C19 --phased --build hg19 --batch --output cyp2c19
I'm getting the error
Traceback (most recent call last):
File "PGxPOP/bin/PGxPOP.py", line 308, in <module>
cd.run()
File "PGxPOP/bin/PGxPOP.py", line 54, in run
results = self.process_gene(g)
File "PGxPOP/bin/PGxPOP.py", line 68, in process_gene
diplotypes, sample_variants, uncallable = self.get_calls(gene, gt_matrices)
File "PGxPOP/bin/PGxPOP.py", line 177, in get_calls
for samp in range(gt_mat[0].shape[1]):
IndexError: tuple index out of range
Here is a gist with the --debug
output.
Is it possibly because not all of the variants are in the VCF file?
Line 267 in 3da8adf
Seems to have an extra comma, which is causing problems with parsing the results in csv readers...
Running the script on phased GSA array data, I have tried 2D6 and 2C19 on CHR22 and CHR0 respectively, and everything comes back reported as *1/*1 and NM. Is there any trouble shooting I could perform as to why this is happening?
Thank you.
Hi Greg, Adam,
Many thanks for releasing this tool and for providing a nice overview of CYP AF in UKB!
One question: does PGxPOP handle unphased VCFs?
--phased
being an optional argument seems to suggest the input can be either phased or unphased:
________________________________________
| ___ ___ ___ ___ ___ |
| | _ \/ __|_ _| _ \/\ \| _ \ |
| | _/ (_ \ \ / _/ \ | _/ |
| |_| \___/_\_\_| \__\/|_| |
| |
| v1.0 |
| Written by |
| Adam Lavertu and Greg McInnes |
| with help from PharmGKB. |
|________________________________________|
Copyright (C) 2020 Stanford University.
Distributed under the Mozilla Public License 2.0 open source license.
usage: PGxPOP.py [-h] [-f VCF] [-g GENE] [--phased] [--build BUILD] [--extra_variants] [-d] [-b] [-o OUTPUT]
CityDawg determines star allele haplotypes for samples in a VCF file and outputs predicted pharmacogenetic phenotypes.
optional arguments:
-h, --help show this help message and exit
-f VCF, --vcf VCF Input VCF
-g GENE, --gene GENE Gene to run. Select from list. Run all by default. CFTR, CYP2C9, CYP2D6, CYP4F2, IFNL3, TPMT, VKORC1, CYP2C19,
CYP3A5, DPYD, SLCO1B1, UGT1A1, CYP2B6, NUDT15
--phased Data is phased. Will try to determine phasing status from VCF by default.
(...)
The GitHub README.md, on the other hand, mentions only phased data input:
PGxPOP is a population-scale PGx allele caller designed to handle 100,000s of samples. Input is a phased VCF file, that has been indexed with tabix.
Many thanks,
Chris
Hi,
I was computing diplotype frequencies (PGxPOP output, phased data). I have a query regarding this. Due to phasing, same diplotypes are present in two forms for example- *1/*17, *17/*1 and their count should be pooled because they point to same diplotype. Is there an way to make it uniform in PGxPOP, for example single representation - *17/*1 for all samples or I need to write separate code to process for downstream analysis?
Thank you!
Hello PGxPOP team,
I have run PGxPOP v1.0 with the two vcf files attched wich in principle should be the same apart from some differences in the header related to the bcftools command used to generate them. However I found the PGxPOP output happens to be different. For instance:
The vcf file HG02236.a.vcf.gz gives me:
sample_id,gene,diplotype,
HG02236,CYP2C19,*1/*1
The vcf file HG02236.b.vcf.gz gives me:
HG02236,CYP2C19,*1/*2
This is how I run PGxPOP:
python bin/PGxPOP.py --vcf HG02236.a.vcf.gz -o HG02236.a.txt
python bin/PGxPOP.py --vcf HG02236.b.vcf.gz -o HG02236.b.txt
Is there something wrong I am doing you could think of? Any help will be much appreciated.
Many thanks
Jorge
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.