Coder Social home page Coder Social logo

forestqc's People

Contributors

avallonking avatar eg-r avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

forestqc's Issues

ForestQC stat error

Hello, I noticed an error when using ForestQC stat:

ForestQC v1.1.5.4 by Jae Hoon Sul Lab at UCLA
--Quality control on genetic variants from next-generation sequencing data using random forest

Loading files...
Computing...
Traceback (most recent call last):
  File "/root/miniconda3/bin/ForestQC", line 33, in <module>
    sys.exit(load_entry_point('ForestQC==1.1.5.4', 'console_scripts', 'ForestQC')())
  File "/root/miniconda3/lib/python3.9/site-packages/ForestQC/__main__.py", line 201, in main
    command_functions[command](**args)
  File "/root/miniconda3/lib/python3.9/site-packages/ForestQC/__main__.py", line 129, in main_stat
    vcf_process(target_file, stat_file, gc_file, ped_file, discord_geno_dict, hwe_file, gender_file, dp, gq,
  File "/root/miniconda3/lib/python3.9/site-packages/ForestQC/stat.py", line 63, in vcf_process
    gc = getGC(pos, gc_table_by_chr[chr])
  File "/root/miniconda3/lib/python3.9/site-packages/ForestQC/vcf_stat.py", line 99, in getGC
    step = gc_table.iloc[2,1] - gc_table.iloc[1,1]
  File "/root/miniconda3/lib/python3.9/site-packages/pandas/core/indexing.py", line 925, in __getitem__
    return self._getitem_tuple(key)
  File "/root/miniconda3/lib/python3.9/site-packages/pandas/core/indexing.py", line 1506, in _getitem_tuple
    self._has_valid_tuple(tup)
  File "/root/miniconda3/lib/python3.9/site-packages/pandas/core/indexing.py", line 754, in _has_valid_tuple
    self._validate_key(k, i)
  File "/root/miniconda3/lib/python3.9/site-packages/pandas/core/indexing.py", line 1409, in _validate_key
    self._validate_integer(key, axis)
  File "/root/miniconda3/lib/python3.9/site-packages/pandas/core/indexing.py", line 1500, in _validate_integer
    raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds

Is there any solution for this?

Python version for ForestQC

Hi avallonking,

I was trying to install ForestQC using conda as suggested in the wiki of this doc. The package failed and in specifications it shows that following

forestqc -> python[version ='>=3.6,<3.7.0a0|>=3.7,<3.8.0a0']

I have python =3.8
Is it that forestqc can run only in python < 3.8 ? Is there any other way out besides downgrading my python version.

Thank you.
regards
Smeeta

A problem of set_outlier

Hi,I want to ask a problem of set_outlier.
when I perfrom set_outlier , I only have the result of Outlier_GQ threshold, my result of Outlier_DP threshold is NA. the vcf files come from GATK.
the reaslt is:
Outlier_DP threshold:
Outlier_GQ threshold: 0

mutil-allele may split failed

I found that the split module will miss some multiple alleles.

ForestQC stat -i vcf -o stat.tsv -c gc_content_hg19.tsv --dp 14 --gq 60
ForestQC split -i stat.tsv -o part.tsv
grep -w 9770690 stat.tsv bad.part.tsv good.part.tsv gray.part.tsv | cut -f 1-5

stat.tsv:chr1:9770690 chr1 9770690 CAG C
stat.tsv:chr1:9770690 chr1 9770690 C CAG
good.part.tsv:chr1:9770690 chr1 9770690 C CAG

As mentioned above, the amount of variation does not correspond.
wc -l stat.tsv bad.part.tsv good.part.tsv gray.part.tsv

48008 stat.tsv
3289 bad.part.tsv
30420 good.part.tsv
14036 gray.part.tsv
95753 total

I am very grateful, If you can get a effective reply, thank you again for the software you provided.

Issue when using ped file

Hello, I am experiencing an error at the ForestQC stat stage.

Error message:

ForestQC v1.1.5.7 by Jae Hoon Sul Lab at UCLA
--Quality control on genetic variants from next-generation sequencing data using random forest

Loading files...
Traceback (most recent call last):
  File "/home/eduardo/anaconda3/envs/forestqc/bin/ForestQC", line 33, in <module>
    sys.exit(load_entry_point('ForestQC==1.1.5.7', 'console_scripts', 'ForestQC')())
  File "/home/eduardo/anaconda3/envs/forestqc/lib/python3.9/site-packages/ForestQC/__main__.py", line 201, in main
    command_functions[command](**args)
  File "/home/eduardo/anaconda3/envs/forestqc/lib/python3.9/site-packages/ForestQC/__main__.py", line 129, in main_stat
    vcf_process(target_file, stat_file, gc_file, ped_file, discord_geno_dict, hwe_file, gender_file, dp, gq,
  File "/home/eduardo/anaconda3/envs/forestqc/lib/python3.9/site-packages/ForestQC/stat.py", line 34, in vcf_process
    male_list, female_list = getSexInfo(ped_file, gender_file)
  File "/home/eduardo/anaconda3/envs/forestqc/lib/python3.9/site-packages/ForestQC/vcf_stat.py", line 152, in getSexInfo    assert len(male) > 1, 'There should be at least 2 males.'
AssertionError: There should be at least 2 males.

Is there any solution?

gender file function

Hello, I would like to know the function of the gender file. For example, if I have only autosomal chromosomes in VCF data, will it be necessary to create a gender file?

set_outlier triggers "OverflowError: Python int too large to convert to C int" when -m is >1

When running ForestQC set_outlier -m 2G -i Test.vcf.gz, we're observing this error on both CentOS 8 and MacOS:

Traceback (most recent call last):
  File "/Users/glb/miniconda3/envs/ForestQC/bin/ForestQC", line 33, in <module>
    sys.exit(load_entry_point('ForestQC==1.1.5.7', 'console_scripts', 'ForestQC')())
  File "/Users/glb/miniconda3/envs/ForestQC/lib/python3.8/site-packages/ForestQC/__main__.py", line 201, in main
    command_functions[command](**args)
  File "/Users/glb/miniconda3/envs/ForestQC/lib/python3.8/site-packages/ForestQC/__main__.py", line 112, in main_set_outlier
    set_outlier(file_list, temp_dir, 'temp.external_sort.out', mem)
  File "/Users/glb/miniconda3/envs/ForestQC/lib/python3.8/site-packages/ForestQC/setOutlier.py", line 247, in set_outlier
    sorter.sort(filenames, temp_dir, outfilename)
  File "/Users/glb/miniconda3/envs/ForestQC/lib/python3.8/site-packages/ForestQC/setOutlier.py", line 219, in sort
    merger.merge(gq_block_filenames, os.path.join(temp_dir, 'gq.' + outfilename), gq_buffer_size)
  File "/Users/glb/miniconda3/envs/ForestQC/lib/python3.8/site-packages/ForestQC/setOutlier.py", line 186, in merge
    outfile = open(outfilename, 'w', buffer_size)
OverflowError: Python int too large to convert to C int

It looks as though the buffer_size integer ends up being too large for Python, when using 1) a large input file and/or 2) more than 1G memory. This makes it difficult to run ForestQC set_outlier with large files as we cannot allocate enough memory for this to complete in a reasonable timeframe. We get the same error using the Test.vcf.gz in the examples directory when allocating more than 1G, too.

set_outlier "killed"

Hello, I have a multisample VCF with almost 50 Gb (uncompressed) and 11749030 variants from exome sequencing. I want to perform a QC in all chromosomes (autosomes, chrX, chrY, and chrM). However, when I tried using the outliers (trying many numbers of -m parameters), it always said killed. I have 27 GB memory RAM usable. Is it because I do not have enough memory RAM? If it is impossible to calculate all chr together, is it possible to make all QC for each chromosome and, ultimately, join them all together?

ForestQC split: AttributeError: 'DataFrame' object has no attribute 'append'

Hello, I am experiencing some errors in "ForestQC split":

ForestQC set_outlier -i test.vcf.gz -m 500M

ForestQC v1.1.5.7 by Jae Hoon Sul Lab at UCLA
--Quality control on genetic variants from next-generation sequencing data using random forest

Outlier_DP threshold: 20
Outlier_GQ threshold: 51

ForestQC stat -i test.vcf.gz -o example.result.tsv -d concordance_rate_SNP.txt.gz -c gc_content_hg19.chrX.tsv -as user_features.tsv -p PedStructSeqID.txt --dp 20.0 --gq 51.0

ForestQC v1.1.5.7 by Jae Hoon Sul Lab at UCLA
--Quality control on genetic variants from next-generation sequencing data using random forest

Loading files...
Computing...
Done.

ForestQC split -i example.result.tsv -as user_features.tsv -t user_thresholds.tsv

ForestQC v1.1.5.7 by Jae Hoon Sul Lab at UCLA
--Quality control on genetic variants from next-generation sequencing data using random forest

Loading data...
Data processing...

Current filter settings:

Good variants
----------------
Mendel_Error <= 0.04478
Missing_Rate < 0.11
HWE > 0.01
0.3 <= ABHet_deviation <= 0.7

Bad variants
----------------
Rare variants (MAF < 0.02):
        Mendel_Error > 0.004
        Missing_Rate > 0.2
        HWE < 0.005
        ABHet_deviation > 0.25

Common variants (MAF >= 0.02):
        Mendel_Error > 0.07463
        Missing_Rate > 0.25
        HWE < 0.0005
        ABHet_deviation > 0.3

Outlier variants
----------------
Rare variants (MAF < 0.02):
        Mendel_Error > 0.1194
        Missing_Rate > 0.08
        HWE < 0.002

Common variants (MAF >= 0.02):
        Mendel_Error > 0.14925
        Missing_Rate > 0.12
        HWE < 1e-08
Traceback (most recent call last):
  File "/root/mambaforge-pypy3/envs/forestqc/bin/ForestQC", line 33, in <module>
    sys.exit(load_entry_point('ForestQC==1.1.5.7', 'console_scripts', 'ForestQC')())
  File "/root/mambaforge-pypy3/envs/forestqc/lib/python3.9/site-packages/ForestQC/__main__.py", line 201, in main
    command_functions[command](**args)
  File "/root/mambaforge-pypy3/envs/forestqc/lib/python3.9/site-packages/ForestQC/__main__.py", line 159, in main_split
    execute_split(input_file, output_file, model, user_feature_names, thresholds_setting)
  File "/root/mambaforge-pypy3/envs/forestqc/lib/python3.9/site-packages/ForestQC/data_preprocessing.py", line 260, in execute_split
    good, bad, grey = model_selection[model](data, thresholds_setting)
  File "/root/mambaforge-pypy3/envs/forestqc/lib/python3.9/site-packages/ForestQC/data_preprocessing.py", line 143, in separateDataB
    grey = variants.append([good, bad])
  File "/root/mambaforge-pypy3/envs/forestqc/lib/python3.9/site-packages/pandas/core/generic.py", line 6204, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'append'

Is there any solution for this?

Unable to calculate GQ outlier in freebayes VCF

Hello,
I was testing QC with ForestQC in a VCF created by freebayes. However, I could not obtain the GQ outlier (blank). Then, I found out that the VCF did not have GQ information. Thus, I would like to ask if someone knows how to annotate this VCF with GQ information. I was trying with Variantannotator (GATK), but I could not find the correct parameter for GQ.

empty good, bad, gray files after split

Hi,

I have run your test code to see if ForestQC was successfully installed and it ended up working. Next, I just wanted to take one of my own vcf files just to see if ForestQC would work before I use a larger set. I only used arguments for the ForestQC commands that are required. I did not add any of the optional arguments. Computing outliers, running ForestQC stat, and compute_gc worked. However, after trying to split I end up with empty good, bad, gray files. Do you have any suggestions why this might be the case? Thank you!

Best,
Kristofer

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.