avallonking / forestqc Goto Github PK
View Code? Open in Web Editor NEWQuality control on genetic variants from next-generation sequencing data using random forest
License: MIT License
Quality control on genetic variants from next-generation sequencing data using random forest
License: MIT License
Hello, I noticed an error when using ForestQC stat:
ForestQC v1.1.5.4 by Jae Hoon Sul Lab at UCLA
--Quality control on genetic variants from next-generation sequencing data using random forest
Loading files...
Computing...
Traceback (most recent call last):
File "/root/miniconda3/bin/ForestQC", line 33, in <module>
sys.exit(load_entry_point('ForestQC==1.1.5.4', 'console_scripts', 'ForestQC')())
File "/root/miniconda3/lib/python3.9/site-packages/ForestQC/__main__.py", line 201, in main
command_functions[command](**args)
File "/root/miniconda3/lib/python3.9/site-packages/ForestQC/__main__.py", line 129, in main_stat
vcf_process(target_file, stat_file, gc_file, ped_file, discord_geno_dict, hwe_file, gender_file, dp, gq,
File "/root/miniconda3/lib/python3.9/site-packages/ForestQC/stat.py", line 63, in vcf_process
gc = getGC(pos, gc_table_by_chr[chr])
File "/root/miniconda3/lib/python3.9/site-packages/ForestQC/vcf_stat.py", line 99, in getGC
step = gc_table.iloc[2,1] - gc_table.iloc[1,1]
File "/root/miniconda3/lib/python3.9/site-packages/pandas/core/indexing.py", line 925, in __getitem__
return self._getitem_tuple(key)
File "/root/miniconda3/lib/python3.9/site-packages/pandas/core/indexing.py", line 1506, in _getitem_tuple
self._has_valid_tuple(tup)
File "/root/miniconda3/lib/python3.9/site-packages/pandas/core/indexing.py", line 754, in _has_valid_tuple
self._validate_key(k, i)
File "/root/miniconda3/lib/python3.9/site-packages/pandas/core/indexing.py", line 1409, in _validate_key
self._validate_integer(key, axis)
File "/root/miniconda3/lib/python3.9/site-packages/pandas/core/indexing.py", line 1500, in _validate_integer
raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds
Is there any solution for this?
Hi avallonking,
I was trying to install ForestQC using conda as suggested in the wiki of this doc. The package failed and in specifications it shows that following
forestqc -> python[version ='>=3.6,<3.7.0a0|>=3.7,<3.8.0a0']
I have python =3.8
Is it that forestqc can run only in python < 3.8 ? Is there any other way out besides downgrading my python version.
Thank you.
regards
Smeeta
Hi,I want to ask a problem of set_outlier.
when I perfrom set_outlier , I only have the result of Outlier_GQ threshold, my result of Outlier_DP threshold is NA. the vcf files come from GATK.
the reaslt is:
Outlier_DP threshold:
Outlier_GQ threshold: 0
I found that the split module will miss some multiple alleles.
ForestQC stat -i vcf -o stat.tsv -c gc_content_hg19.tsv --dp 14 --gq 60
ForestQC split -i stat.tsv -o part.tsv
grep -w 9770690 stat.tsv bad.part.tsv good.part.tsv gray.part.tsv | cut -f 1-5
stat.tsv:chr1:9770690 chr1 9770690 CAG C
stat.tsv:chr1:9770690 chr1 9770690 C CAG
good.part.tsv:chr1:9770690 chr1 9770690 C CAG
As mentioned above, the amount of variation does not correspond.
wc -l stat.tsv bad.part.tsv good.part.tsv gray.part.tsv
48008 stat.tsv
3289 bad.part.tsv
30420 good.part.tsv
14036 gray.part.tsv
95753 total
I am very grateful, If you can get a effective reply, thank you again for the software you provided.
Hello, I am experiencing an error at the ForestQC stat
stage.
Error message:
ForestQC v1.1.5.7 by Jae Hoon Sul Lab at UCLA
--Quality control on genetic variants from next-generation sequencing data using random forest
Loading files...
Traceback (most recent call last):
File "/home/eduardo/anaconda3/envs/forestqc/bin/ForestQC", line 33, in <module>
sys.exit(load_entry_point('ForestQC==1.1.5.7', 'console_scripts', 'ForestQC')())
File "/home/eduardo/anaconda3/envs/forestqc/lib/python3.9/site-packages/ForestQC/__main__.py", line 201, in main
command_functions[command](**args)
File "/home/eduardo/anaconda3/envs/forestqc/lib/python3.9/site-packages/ForestQC/__main__.py", line 129, in main_stat
vcf_process(target_file, stat_file, gc_file, ped_file, discord_geno_dict, hwe_file, gender_file, dp, gq,
File "/home/eduardo/anaconda3/envs/forestqc/lib/python3.9/site-packages/ForestQC/stat.py", line 34, in vcf_process
male_list, female_list = getSexInfo(ped_file, gender_file)
File "/home/eduardo/anaconda3/envs/forestqc/lib/python3.9/site-packages/ForestQC/vcf_stat.py", line 152, in getSexInfo assert len(male) > 1, 'There should be at least 2 males.'
AssertionError: There should be at least 2 males.
Is there any solution?
Hello, I would like to know the function of the gender file. For example, if I have only autosomal chromosomes in VCF data, will it be necessary to create a gender file?
When running ForestQC set_outlier -m 2G -i Test.vcf.gz
, we're observing this error on both CentOS 8 and MacOS:
Traceback (most recent call last):
File "/Users/glb/miniconda3/envs/ForestQC/bin/ForestQC", line 33, in <module>
sys.exit(load_entry_point('ForestQC==1.1.5.7', 'console_scripts', 'ForestQC')())
File "/Users/glb/miniconda3/envs/ForestQC/lib/python3.8/site-packages/ForestQC/__main__.py", line 201, in main
command_functions[command](**args)
File "/Users/glb/miniconda3/envs/ForestQC/lib/python3.8/site-packages/ForestQC/__main__.py", line 112, in main_set_outlier
set_outlier(file_list, temp_dir, 'temp.external_sort.out', mem)
File "/Users/glb/miniconda3/envs/ForestQC/lib/python3.8/site-packages/ForestQC/setOutlier.py", line 247, in set_outlier
sorter.sort(filenames, temp_dir, outfilename)
File "/Users/glb/miniconda3/envs/ForestQC/lib/python3.8/site-packages/ForestQC/setOutlier.py", line 219, in sort
merger.merge(gq_block_filenames, os.path.join(temp_dir, 'gq.' + outfilename), gq_buffer_size)
File "/Users/glb/miniconda3/envs/ForestQC/lib/python3.8/site-packages/ForestQC/setOutlier.py", line 186, in merge
outfile = open(outfilename, 'w', buffer_size)
OverflowError: Python int too large to convert to C int
It looks as though the buffer_size
integer ends up being too large for Python, when using 1) a large input file and/or 2) more than 1G memory. This makes it difficult to run ForestQC set_outlier
with large files as we cannot allocate enough memory for this to complete in a reasonable timeframe. We get the same error using the Test.vcf.gz in the examples directory when allocating more than 1G, too.
Hello, I have a multisample VCF with almost 50 Gb (uncompressed) and 11749030 variants from exome sequencing. I want to perform a QC in all chromosomes (autosomes, chrX, chrY, and chrM). However, when I tried using the outliers (trying many numbers of -m parameters), it always said killed. I have 27 GB memory RAM usable. Is it because I do not have enough memory RAM? If it is impossible to calculate all chr together, is it possible to make all QC for each chromosome and, ultimately, join them all together?
Hello, I am experiencing some errors in "ForestQC split":
ForestQC set_outlier -i test.vcf.gz -m 500M
ForestQC v1.1.5.7 by Jae Hoon Sul Lab at UCLA
--Quality control on genetic variants from next-generation sequencing data using random forest
Outlier_DP threshold: 20
Outlier_GQ threshold: 51
ForestQC stat -i test.vcf.gz -o example.result.tsv -d concordance_rate_SNP.txt.gz -c gc_content_hg19.chrX.tsv -as user_features.tsv -p PedStructSeqID.txt --dp 20.0 --gq 51.0
ForestQC v1.1.5.7 by Jae Hoon Sul Lab at UCLA
--Quality control on genetic variants from next-generation sequencing data using random forest
Loading files...
Computing...
Done.
ForestQC split -i example.result.tsv -as user_features.tsv -t user_thresholds.tsv
ForestQC v1.1.5.7 by Jae Hoon Sul Lab at UCLA
--Quality control on genetic variants from next-generation sequencing data using random forest
Loading data...
Data processing...
Current filter settings:
Good variants
----------------
Mendel_Error <= 0.04478
Missing_Rate < 0.11
HWE > 0.01
0.3 <= ABHet_deviation <= 0.7
Bad variants
----------------
Rare variants (MAF < 0.02):
Mendel_Error > 0.004
Missing_Rate > 0.2
HWE < 0.005
ABHet_deviation > 0.25
Common variants (MAF >= 0.02):
Mendel_Error > 0.07463
Missing_Rate > 0.25
HWE < 0.0005
ABHet_deviation > 0.3
Outlier variants
----------------
Rare variants (MAF < 0.02):
Mendel_Error > 0.1194
Missing_Rate > 0.08
HWE < 0.002
Common variants (MAF >= 0.02):
Mendel_Error > 0.14925
Missing_Rate > 0.12
HWE < 1e-08
Traceback (most recent call last):
File "/root/mambaforge-pypy3/envs/forestqc/bin/ForestQC", line 33, in <module>
sys.exit(load_entry_point('ForestQC==1.1.5.7', 'console_scripts', 'ForestQC')())
File "/root/mambaforge-pypy3/envs/forestqc/lib/python3.9/site-packages/ForestQC/__main__.py", line 201, in main
command_functions[command](**args)
File "/root/mambaforge-pypy3/envs/forestqc/lib/python3.9/site-packages/ForestQC/__main__.py", line 159, in main_split
execute_split(input_file, output_file, model, user_feature_names, thresholds_setting)
File "/root/mambaforge-pypy3/envs/forestqc/lib/python3.9/site-packages/ForestQC/data_preprocessing.py", line 260, in execute_split
good, bad, grey = model_selection[model](data, thresholds_setting)
File "/root/mambaforge-pypy3/envs/forestqc/lib/python3.9/site-packages/ForestQC/data_preprocessing.py", line 143, in separateDataB
grey = variants.append([good, bad])
File "/root/mambaforge-pypy3/envs/forestqc/lib/python3.9/site-packages/pandas/core/generic.py", line 6204, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'append'
Is there any solution for this?
Hello,
I was testing QC with ForestQC in a VCF created by freebayes. However, I could not obtain the GQ outlier (blank). Then, I found out that the VCF did not have GQ information. Thus, I would like to ask if someone knows how to annotate this VCF with GQ information. I was trying with Variantannotator (GATK), but I could not find the correct parameter for GQ.
Hi,
I have run your test code to see if ForestQC was successfully installed and it ended up working. Next, I just wanted to take one of my own vcf files just to see if ForestQC would work before I use a larger set. I only used arguments for the ForestQC commands that are required. I did not add any of the optional arguments. Computing outliers, running ForestQC stat, and compute_gc worked. However, after trying to split I end up with empty good, bad, gray files. Do you have any suggestions why this might be the case? Thank you!
Best,
Kristofer
Hello, I tried to change filter settings in the ForestQC split using the "user_thresholds.tsv" file. However, I could not change all the filters specified in the file. Is there any solution?
Adding a line for ignoring such cases as if GT is NA solves it because it runs into error and stops.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.