[VCF] How to remove all rows with the same variant in VCF file using `pyvcf` about fuc HOT 10 CLOSED

sbslee commented on June 11, 2024

[VCF] How to remove all rows with the same variant in VCF file using `pyvcf`

from fuc.

Comments (10)

emoranska commented on June 11, 2024 1

@sbslee , thank you for all your advice. For now, I've decided to convert my vcf file into csv to reduce memory usage. I plan to change my computer to a better one and then your fuc.pyvcf will be very useful for me.
Thanks a lot!

from fuc.

sbslee commented on June 11, 2024

@emoranska, thanks for moving our discussion to GitHub! You will notice that I updated your post so that your code block is highlighted, which makes it easier to read.

I also noticed that you are creating a VcfFrame in an unconventional way :)

just_gt = vf.strip('GT').df

new_vf = pyvcf.VcfFrame([''], just_gt)

Above is rather redundant because the pyvcf.VcfFrame.strip method already returns a VcfFrame object. So you can just do:

new_vf = vf.strip('GT')

Finally, I'm afraid I still need some more information to solve your problem. What exactly do you mean by "remove all records with the same genotype calls in all samples"? For example, do you want to remove a variant record if all samples have '0/1' genotype?

from fuc.

sbslee commented on June 11, 2024

@emoranska, perhaps, is this what you were looking for?

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['G', 'T', 'T', 'T'],
...     'ALT': ['A', 'C', 'A', 'C'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'Steven': ['0/1', '0/0', '0/0', '0/1'],
...     'Sara': ['0/1', '0/1', '0/0', '0/0'],
...     'James': ['0/1', '0/1', '0/0', '0/1'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara James
0  chr1  100  .   G   A    .      .    .     GT    0/1  0/1   0/1
1  chr1  101  .   T   C    .      .    .     GT    0/0  0/1   0/1
2  chr1  102  .   T   A    .      .    .     GT    0/0  0/0   0/0
3  chr1  103  .   T   C    .      .    .     GT    0/1  0/0   0/1
>>> i = vf.df.apply(lambda r: len(set(r[9:])) != 1, axis=1)
>>> vf.df = vf.df[i]
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara James
0  chr1  101  .   T   C    .      .    .     GT    0/0  0/1   0/1
1  chr1  103  .   T   C    .      .    .     GT    0/1  0/0   0/1

from fuc.

emoranska commented on June 11, 2024

@sbslee, thank you for the answer!
Of course, my silly mistake with creating a VcfFrame object, thanks. Exactly, I would like to remove records with the same genotype for all samples (0/0, 0/1 or 1/1) and your script helps to achieve it. Now I would like to find regions of homozygosity where >=3 samples are 0/0 and >=3 samples are 1/1. Parsing vcf file is something new for me so you help me to learn it :)
Thank you!

from fuc.

emoranska commented on June 11, 2024

@sbslee Could you explain to me how exactly your code works? Maybe could you write it to me without the lambda function? I'm learning Python and I'll be grateful for help in understanding well the code.

from fuc.

emoranska commented on June 11, 2024

Unfortunately, when I've tried to run this script for my 1,9 GB vcf file, it raises me Memory Error...

from fuc.

sbslee commented on June 11, 2024

@emoranska

The lambda function can be replaced from

i = vf.df.apply(lambda r: len(set(r[9:])) != 1, axis=1)

vf.df = vf.df[i]

to this:

def one_row(r):
    return len(set(r[9:])) != 1
    
i = vf.df.apply(one_row, axis=1)
vf.df = vf.df[i]

Basically, for each given record the one_row function returns True if all samples have the same genotype call and False if otherwise. This is done by counting how many unique genotypes are present for each record. For example, assuming there are only two samples A and B, if both have '0/1', then there is only one unique genotype. If samples A and B have '0/0' and '0/1', respectively, then there will be two unique genotypes.

With the apply method, the one_row method is applied to each record, which returns a series of booleans. This series is then used to select desired records.

from fuc.

sbslee commented on June 11, 2024

I'm also not quite sure why you are running into a memory error. Did you strip your VcfFrame object with vf = vf.strip('GT') before running the script I provided to you? If you want, you can send me your VCF file so I can test it myself.

P.S. When you said your VCF file is 1.9 Gb, did you mean when it's zipped? In other words, does your VCF filename end with .vcf.gz?

from fuc.

emoranska commented on June 11, 2024

@sbslee ,
Thank you for the explanation. I've commented a strip method and tried to run the script provided by you with saving the result to vcf file:

from fuc import pyvcf

vf = pyvcf.VcfFrame.from_file('P1_no_Un_NW_MT.vcf')

# just_gt = vf.strip('GT')
#
# i = just_gt.df.apply(lambda r: len(set(r[9:])) != 1, axis=1)

i = vf.df.apply(lambda r: len(set(r[9:])) != 1, axis=1)

no_the_same_variant = vf.df[i]

to_vcf = pyvcf.VcfFrame([''], no_the_same_variant)

P1_no_the_same_variant = to_vcf.to_file('P1_no_the_same_variant.vcf')

and it raises such an error:

Traceback (most recent call last):
File "/home/tomasz/Pulpit/burak/vcf_open.py", line 6, in
vf = pyvcf.VcfFrame.from_file('P1_no_Un_NW_MT.vcf')
File "/home/tomasz/miniconda3/envs/burak/lib/python3.7/site-packages/fuc/api/pyvcf.py", line 1632, in from_file
vf = cls.from_string(s)
File "/home/tomasz/miniconda3/envs/burak/lib/python3.7/site-packages/fuc/api/pyvcf.py", line 1704, in from_string
df = pd.read_table(StringIO(s), skiprows=skiprows,
MemoryError

Basically, I have two vcf files. The first of it, 1,9 Gb, is uncompressed (.vcf) and the second one, 10,7 Gb, I've received in zipped version (.zip, 3,3 Gb).

from fuc.

sbslee commented on June 11, 2024

@emoranska, your second VCF file (10.7 Gb) is definitely too big to be loaded on a Jupyter Notebook all at once. Maybe the first file too. In any case, one thing you could try is loading a VCF file by one chromosome at a time:

from fuc import pyvcf

vf = pyvcf.VcfFrame.from_file('P1_no_Un_NW_MT.vcf.gz', regions='chr22:1-51,304,566')

Try above and see if it can load your VCF file at all. But before you do that, please read the following first:

Above will only import VCF records for chromosome 22. This should reduce memory burden in your system.
In order to import only a part of the VCF file, it needs to be compressed and indexed (for random access). You will notice that your VCF filename has the .gz extension ('P1_no_Un_NW_MT.vcf.gz'). You can both compress and index a VCF file at the same time using the fuc vcf-index command.
For the regions argument, we could have put 'chr22' instead of 'chr22:1-51,304,566' and it should work. However, I noticed that there was a bug in the the 0.27.0 version where if I just put 'chr22' it will raise an error. I fixed this bug in the 0.28.0-dev branch, but I just wanted you to know in theory we shouldn't have to specify the start and end of the chromosome.

Hope this helps.

from fuc.

[VCF] How to remove all rows with the same variant in VCF file using `pyvcf` about fuc HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent