Coder Social home page Coder Social logo

Comments (10)

emoranska avatar emoranska commented on June 11, 2024 1

@sbslee , thank you for all your advice. For now, I've decided to convert my vcf file into csv to reduce memory usage. I plan to change my computer to a better one and then your fuc.pyvcf will be very useful for me.
Thanks a lot!

from fuc.

sbslee avatar sbslee commented on June 11, 2024

@emoranska, thanks for moving our discussion to GitHub! You will notice that I updated your post so that your code block is highlighted, which makes it easier to read.

I also noticed that you are creating a VcfFrame in an unconventional way :)

just_gt = vf.strip('GT').df

new_vf = pyvcf.VcfFrame([''], just_gt)

Above is rather redundant because the pyvcf.VcfFrame.strip method already returns a VcfFrame object. So you can just do:

new_vf = vf.strip('GT')

Finally, I'm afraid I still need some more information to solve your problem. What exactly do you mean by "remove all records with the same genotype calls in all samples"? For example, do you want to remove a variant record if all samples have '0/1' genotype?

from fuc.

sbslee avatar sbslee commented on June 11, 2024

@emoranska, perhaps, is this what you were looking for?

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['G', 'T', 'T', 'T'],
...     'ALT': ['A', 'C', 'A', 'C'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'Steven': ['0/1', '0/0', '0/0', '0/1'],
...     'Sara': ['0/1', '0/1', '0/0', '0/0'],
...     'James': ['0/1', '0/1', '0/0', '0/1'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara James
0  chr1  100  .   G   A    .      .    .     GT    0/1  0/1   0/1
1  chr1  101  .   T   C    .      .    .     GT    0/0  0/1   0/1
2  chr1  102  .   T   A    .      .    .     GT    0/0  0/0   0/0
3  chr1  103  .   T   C    .      .    .     GT    0/1  0/0   0/1
>>> i = vf.df.apply(lambda r: len(set(r[9:])) != 1, axis=1)
>>> vf.df = vf.df[i]
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara James
0  chr1  101  .   T   C    .      .    .     GT    0/0  0/1   0/1
1  chr1  103  .   T   C    .      .    .     GT    0/1  0/0   0/1

from fuc.

emoranska avatar emoranska commented on June 11, 2024

@sbslee, thank you for the answer!
Of course, my silly mistake with creating a VcfFrame object, thanks. Exactly, I would like to remove records with the same genotype for all samples (0/0, 0/1 or 1/1) and your script helps to achieve it. Now I would like to find regions of homozygosity where >=3 samples are 0/0 and >=3 samples are 1/1. Parsing vcf file is something new for me so you help me to learn it :)
Thank you!

from fuc.

emoranska avatar emoranska commented on June 11, 2024

@sbslee Could you explain to me how exactly your code works? Maybe could you write it to me without the lambda function? I'm learning Python and I'll be grateful for help in understanding well the code.

from fuc.

emoranska avatar emoranska commented on June 11, 2024

Unfortunately, when I've tried to run this script for my 1,9 GB vcf file, it raises me Memory Error...

from fuc.

sbslee avatar sbslee commented on June 11, 2024

@emoranska

The lambda function can be replaced from

i = vf.df.apply(lambda r: len(set(r[9:])) != 1, axis=1)

vf.df = vf.df[i]

to this:

def one_row(r):
    return len(set(r[9:])) != 1
    
i = vf.df.apply(one_row, axis=1)
vf.df = vf.df[i]

Basically, for each given record the one_row function returns True if all samples have the same genotype call and False if otherwise. This is done by counting how many unique genotypes are present for each record. For example, assuming there are only two samples A and B, if both have '0/1', then there is only one unique genotype. If samples A and B have '0/0' and '0/1', respectively, then there will be two unique genotypes.

With the apply method, the one_row method is applied to each record, which returns a series of booleans. This series is then used to select desired records.

from fuc.

sbslee avatar sbslee commented on June 11, 2024

I'm also not quite sure why you are running into a memory error. Did you strip your VcfFrame object with vf = vf.strip('GT') before running the script I provided to you? If you want, you can send me your VCF file so I can test it myself.

P.S. When you said your VCF file is 1.9 Gb, did you mean when it's zipped? In other words, does your VCF filename end with .vcf.gz?

from fuc.

emoranska avatar emoranska commented on June 11, 2024

@sbslee ,
Thank you for the explanation. I've commented a strip method and tried to run the script provided by you with saving the result to vcf file:

from fuc import pyvcf

vf = pyvcf.VcfFrame.from_file('P1_no_Un_NW_MT.vcf')

# just_gt = vf.strip('GT')
#
# i = just_gt.df.apply(lambda r: len(set(r[9:])) != 1, axis=1)

i = vf.df.apply(lambda r: len(set(r[9:])) != 1, axis=1)

no_the_same_variant = vf.df[i]

to_vcf = pyvcf.VcfFrame([''], no_the_same_variant)

P1_no_the_same_variant = to_vcf.to_file('P1_no_the_same_variant.vcf')

and it raises such an error:

Traceback (most recent call last):
File "/home/tomasz/Pulpit/burak/vcf_open.py", line 6, in
vf = pyvcf.VcfFrame.from_file('P1_no_Un_NW_MT.vcf')
File "/home/tomasz/miniconda3/envs/burak/lib/python3.7/site-packages/fuc/api/pyvcf.py", line 1632, in from_file
vf = cls.from_string(s)
File "/home/tomasz/miniconda3/envs/burak/lib/python3.7/site-packages/fuc/api/pyvcf.py", line 1704, in from_string
df = pd.read_table(StringIO(s), skiprows=skiprows,
MemoryError

Basically, I have two vcf files. The first of it, 1,9 Gb, is uncompressed (.vcf) and the second one, 10,7 Gb, I've received in zipped version (.zip, 3,3 Gb).

from fuc.

sbslee avatar sbslee commented on June 11, 2024

@emoranska, your second VCF file (10.7 Gb) is definitely too big to be loaded on a Jupyter Notebook all at once. Maybe the first file too. In any case, one thing you could try is loading a VCF file by one chromosome at a time:

from fuc import pyvcf

vf = pyvcf.VcfFrame.from_file('P1_no_Un_NW_MT.vcf.gz', regions='chr22:1-51,304,566')

Try above and see if it can load your VCF file at all. But before you do that, please read the following first:

  • Above will only import VCF records for chromosome 22. This should reduce memory burden in your system.
  • In order to import only a part of the VCF file, it needs to be compressed and indexed (for random access). You will notice that your VCF filename has the .gz extension ('P1_no_Un_NW_MT.vcf.gz'). You can both compress and index a VCF file at the same time using the fuc vcf-index command.
  • For the regions argument, we could have put 'chr22' instead of 'chr22:1-51,304,566' and it should work. However, I noticed that there was a bug in the the 0.27.0 version where if I just put 'chr22' it will raise an error. I fixed this bug in the 0.28.0-dev branch, but I just wanted you to know in theory we shouldn't have to specify the start and end of the chromosome.

Hope this helps.

from fuc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.