Comments (10)
@sbslee , thank you for all your advice. For now, I've decided to convert my vcf file into csv to reduce memory usage. I plan to change my computer to a better one and then your fuc.pyvcf will be very useful for me.
Thanks a lot!
from fuc.
@emoranska, thanks for moving our discussion to GitHub! You will notice that I updated your post so that your code block is highlighted, which makes it easier to read.
I also noticed that you are creating a VcfFrame in an unconventional way :)
just_gt = vf.strip('GT').df
new_vf = pyvcf.VcfFrame([''], just_gt)
Above is rather redundant because the pyvcf.VcfFrame.strip
method already returns a VcfFrame object. So you can just do:
new_vf = vf.strip('GT')
Finally, I'm afraid I still need some more information to solve your problem. What exactly do you mean by "remove all records with the same genotype calls in all samples"? For example, do you want to remove a variant record if all samples have '0/1' genotype?
from fuc.
@emoranska, perhaps, is this what you were looking for?
>>> from fuc import pyvcf
>>> data = {
... 'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
... 'POS': [100, 101, 102, 103],
... 'ID': ['.', '.', '.', '.'],
... 'REF': ['G', 'T', 'T', 'T'],
... 'ALT': ['A', 'C', 'A', 'C'],
... 'QUAL': ['.', '.', '.', '.'],
... 'FILTER': ['.', '.', '.', '.'],
... 'INFO': ['.', '.', '.', '.'],
... 'FORMAT': ['GT', 'GT', 'GT', 'GT'],
... 'Steven': ['0/1', '0/0', '0/0', '0/1'],
... 'Sara': ['0/1', '0/1', '0/0', '0/0'],
... 'James': ['0/1', '0/1', '0/0', '0/1'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara James
0 chr1 100 . G A . . . GT 0/1 0/1 0/1
1 chr1 101 . T C . . . GT 0/0 0/1 0/1
2 chr1 102 . T A . . . GT 0/0 0/0 0/0
3 chr1 103 . T C . . . GT 0/1 0/0 0/1
>>> i = vf.df.apply(lambda r: len(set(r[9:])) != 1, axis=1)
>>> vf.df = vf.df[i]
>>> vf.df
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara James
0 chr1 101 . T C . . . GT 0/0 0/1 0/1
1 chr1 103 . T C . . . GT 0/1 0/0 0/1
from fuc.
@sbslee, thank you for the answer!
Of course, my silly mistake with creating a VcfFrame object, thanks. Exactly, I would like to remove records with the same genotype for all samples (0/0, 0/1 or 1/1) and your script helps to achieve it. Now I would like to find regions of homozygosity where >=3 samples are 0/0 and >=3 samples are 1/1. Parsing vcf file is something new for me so you help me to learn it :)
Thank you!
from fuc.
@sbslee Could you explain to me how exactly your code works? Maybe could you write it to me without the lambda function? I'm learning Python and I'll be grateful for help in understanding well the code.
from fuc.
Unfortunately, when I've tried to run this script for my 1,9 GB vcf file, it raises me Memory Error...
from fuc.
The lambda function can be replaced from
i = vf.df.apply(lambda r: len(set(r[9:])) != 1, axis=1)
vf.df = vf.df[i]
to this:
def one_row(r):
return len(set(r[9:])) != 1
i = vf.df.apply(one_row, axis=1)
vf.df = vf.df[i]
Basically, for each given record the one_row
function returns True
if all samples have the same genotype call and False
if otherwise. This is done by counting how many unique genotypes are present for each record. For example, assuming there are only two samples A and B, if both have '0/1', then there is only one unique genotype. If samples A and B have '0/0' and '0/1', respectively, then there will be two unique genotypes.
With the apply
method, the one_row
method is applied to each record, which returns a series of booleans. This series is then used to select desired records.
from fuc.
I'm also not quite sure why you are running into a memory error. Did you strip your VcfFrame object with vf = vf.strip('GT')
before running the script I provided to you? If you want, you can send me your VCF file so I can test it myself.
P.S. When you said your VCF file is 1.9 Gb, did you mean when it's zipped? In other words, does your VCF filename end with .vcf.gz
?
from fuc.
@sbslee ,
Thank you for the explanation. I've commented a strip method and tried to run the script provided by you with saving the result to vcf file:
from fuc import pyvcf
vf = pyvcf.VcfFrame.from_file('P1_no_Un_NW_MT.vcf')
# just_gt = vf.strip('GT')
#
# i = just_gt.df.apply(lambda r: len(set(r[9:])) != 1, axis=1)
i = vf.df.apply(lambda r: len(set(r[9:])) != 1, axis=1)
no_the_same_variant = vf.df[i]
to_vcf = pyvcf.VcfFrame([''], no_the_same_variant)
P1_no_the_same_variant = to_vcf.to_file('P1_no_the_same_variant.vcf')
and it raises such an error:
Traceback (most recent call last):
File "/home/tomasz/Pulpit/burak/vcf_open.py", line 6, in
vf = pyvcf.VcfFrame.from_file('P1_no_Un_NW_MT.vcf')
File "/home/tomasz/miniconda3/envs/burak/lib/python3.7/site-packages/fuc/api/pyvcf.py", line 1632, in from_file
vf = cls.from_string(s)
File "/home/tomasz/miniconda3/envs/burak/lib/python3.7/site-packages/fuc/api/pyvcf.py", line 1704, in from_string
df = pd.read_table(StringIO(s), skiprows=skiprows,
MemoryError
Basically, I have two vcf files. The first of it, 1,9 Gb, is uncompressed (.vcf) and the second one, 10,7 Gb, I've received in zipped version (.zip, 3,3 Gb).
from fuc.
@emoranska, your second VCF file (10.7 Gb) is definitely too big to be loaded on a Jupyter Notebook all at once. Maybe the first file too. In any case, one thing you could try is loading a VCF file by one chromosome at a time:
from fuc import pyvcf
vf = pyvcf.VcfFrame.from_file('P1_no_Un_NW_MT.vcf.gz', regions='chr22:1-51,304,566')
Try above and see if it can load your VCF file at all. But before you do that, please read the following first:
- Above will only import VCF records for chromosome 22. This should reduce memory burden in your system.
- In order to import only a part of the VCF file, it needs to be compressed and indexed (for random access). You will notice that your VCF filename has the .gz extension ('P1_no_Un_NW_MT.vcf.gz'). You can both compress and index a VCF file at the same time using the
fuc vcf-index
command. - For the
regions
argument, we could have put 'chr22' instead of 'chr22:1-51,304,566' and it should work. However, I noticed that there was a bug in the the0.27.0
version where if I just put 'chr22' it will raise an error. I fixed this bug in the0.28.0-dev
branch, but I just wanted you to know in theory we shouldn't have to specify the start and end of the chromosome.
Hope this helps.
from fuc.
Related Issues (20)
- [VCF] Add function to find intersection between VCF files HOT 1
- [MAF/VCF] Add function to convert unannotated VCF to MAF HOT 1
- [MAF/VCF] Add function to create rainfall plots HOT 1
- [VCF] Add function to convert missing genotypes (./.) to REF homozygous (0/0)
- [BAM] Add function to plot uniformity in read depth
- [VCF] Add function to plot summary statistics HOT 1
- [VCF] Add function to convert 23andMe data to VCF
- [VCF] Error related to `pyvcf.VcfFrame.plot_hist` HOT 11
- [VCF] Add function to create a scatter plot of allele frequency for two datasets HOT 1
- [General] Error during installation fuc via conda HOT 2
- [VCF] Add function to compute AC/AN/AF in the INFO column HOT 1
- [VCF] Add function to remove samples with high missingness
- [General] Error while importing pyvcf HOT 2
- [VCF] Update `pyvcf.VcfFrame.filter_sampnum` to be more robust HOT 1
- [MAF] maf-oncoplt Index Error HOT 3
- [MAF/VCF] `pymaf.MafFrame.from_vcf` assumes CSQ is the first field in the INFO record HOT 3
- [VCF] Issue reading vcf from mutect2, strelka2 HOT 2
- [VCF] Question on usage HOT 7
- [MAF] Variant color coding mismatch HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fuc.