Coder Social home page Coder Social logo

hku-bal / clair Goto Github PK

View Code? Open in Web Editor NEW
102.0 6.0 12.0 1.04 MB

Clair: Exploring the limit of using deep neural network on pileup data for germline variant calling

License: BSD 3-Clause "New" or "Revised" License

Python 99.66% Dockerfile 0.34%
variant-calling deep-learning bioinformatics computational-biology

clair's Introduction

Clair - deep neural network based variant caller

License install with bioconda
Contact: Ruibang Luo
Email: [email protected]


Introduction

Clair3 released in May 2021 is the successor of Clair, please try out Clair3 at https://github.com/HKU-BAL/Clair3.

Single-molecule sequencing technologies have emerged in recent years and revolutionized structural variant calling, complex genome assembly, and epigenetic mark detection. However, the lack of a highly accurate small variant caller has limited the new technologies from being more widely used. In this study, we present Clair, the successor to Clairvoyante, a program for fast and accurate germline small variant calling, using single molecule sequencing data. For ONT data, Clair achieves the best precision, recall and speed as compared to several competing programs, including Clairvoyante, Longshot and Medaka. Through studying the missed variants and benchmarking intentionally overfitted models, we found that Clair may be approaching the limit of possible accuracy for germline small variant calling using pileup data and deep neural networks.

This is the formal release of Clair (Clair v2, Dec 2019). You can find the experimental Clair v1 (Jan 2019) at https://github.com/aquaskyline/Clair. The preprint of Clair v2 is available in bioRxiv.


Contents


What are we working on right now?

  • A full alignment representation for higher performance in the low complexity genomics regions.
  • Testing small technics to resolve some complex variants, e.g. a deletion that spans a SNP closely followed.

What's new?

  • 20200831
    • added support for alternative allele "*". "GetTruth.py" now requires a reference genome as input. You don't need to change your usage if you use "callVarBam.py" for automatic scripts generation.
  • 20200416
    • added two new options for haploid calling, --haploid_precision and --haploid_sensitive (in #24)
    • added a simple after calling solution to handle overlapped variants (in #15)
    • fixed haploid GT output (in #17)
  • 20200309
    • an ONT model trained with up to 578-fold coverage HG002 data from The Human Pangenome Reference Consortium is now available in Pretrained Models. The below table shows the biased test results, i.e. testing samples were included in training, thus are not for benchmarking but suggest the performance cap of each model at different coverages. The new model shows significantly improved performance at high coverages.


Installation

Option 1. Bioconda

# make sure channels are added in conda
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

# create conda environment named "clair-env"
conda create -n clair-env -c bioconda clair
conda activate clair-env

# store clair.py PATH into $CLAIR variable
CLAIR=`which clair.py`

# run clair like this afterwards
python $CLAIR --help

The conda environment has the Pypy3 interpreter installed, but one Pypy3 package intervaltree is still missing. The reason why this is not installed by default is because this is not yet available in any conda repositories. To install the package for Pypy3, after activating the conda environment, please run the following commands:

pypy3 -m ensurepip
pypy3 -m pip install --no-cache-dir intervaltree==3.0.2

Then download the trained models:

# download the trained model for ONT
mkdir ont && cd ont
wget http://www.bio8.cs.hku.hk/clair_models/ont/122HD34.tar
tar -xf 122HD34.tar
cd ../

# download the trained model for PacBio CCS
mkdir pacbio && cd pacbio
wget http://www.bio8.cs.hku.hk/clair_models/pacbio/ccs/15.tar
tar -xf 15.tar
cd ../

# download the trained model for Illumina
mkdir illumina && cd illumina
wget http://www.bio8.cs.hku.hk/clair_models/illumina/12345.tar
tar -xf 12345.tar
cd ../

Option 2. Build an anaconda virtual environment step by step

Please install anaconda using the installation guide at https://docs.anaconda.com/anaconda/install/

# create and activate the environment named clair
conda create -n clair python=3.7
conda activate clair

# install pypy and packages on clair environemnt
conda install -c conda-forge pypy3.6
pypy3 -m ensurepip
pypy3 -m pip install intervaltree==3.0.2

# install python packages on clair environment
pip install numpy==1.18.0 blosc==1.8.3 intervaltree==3.0.2 tensorflow==1.13.2 pysam==0.15.3 matplotlib==3.1.2
conda install -c anaconda pigz==2.4
conda install -c conda-forge parallel=20191122 zstd=1.4.4
conda install -c bioconda samtools=1.10 vcflib=1.0.0 bcftools=1.10.2

# clone Clair
git clone --depth 1 https://github.com/HKU-BAL/Clair.git
cd Clair
chmod +x clair.py
export PATH=`pwd`":$PATH"

# store clair.py PATH into $CLAIR variable
CLAIR=`which clair.py`

# run clair like this afterwards
python $CLAIR --help

Then download the trained models referring to download the trained model in Installation - Option 1

Option 3. Docker

# clone Clair
git clone --depth 1 https://github.com/HKU-BAL/Clair.git
cd Clair

# build a docker image named clair_docker_image
docker build -f ./Dockerfile -t clair_docker_image . # You might need root privilege

# run docker image
docker run -it clair_docker_image # You might need root privilege

# store clair.py PATH into $CLAIR variable
CLAIR=`which clair.py`

# run clair like this afterwards
python $CLAIR --help

Then download the trained models referring to download the trained model in Installation - Option 1

After Installation

To check the version of Tensorflow you have installed:

python -c 'import tensorflow as tf; print(tf.__version__)'

To do variant calling using trained models, CPU will suffice. Clair uses 4 threads by default in callVarBam. The number of threads to be used can be controlled using the parameter --threads. To train a new model, a high-end GPU and the GPU version of Tensorflow is needed. To install the GPU version of tensorflow:

pip install tensorflow-gpu==1.13.2

The installation of the blosc library might fail if your CPU doesn't support the AVX2 instruction set. Alternatively, you can compile and install from the latest source code available in GitHub with the DISABLE_BLOSC_AVX2 environment variable set.


Quick demo

conda activate clair-env
mkdir clairDemo
cd clairDemo
wget 'http://www.bio8.cs.hku.hk/clair_models/demo/clairDemo.sh'
bash clairDemo.sh
  • Step 3. Check the results using less -S ./training/chr21.vcf

Usage

General usage

CLAIR="[PATH_TO_CLAIR]/clair.py"

# to run a submodule using python
python $CLAIR [submodule] [options]

# to run a Pypy-able submodule using pypy (if `pypy3` is the executable command for Pypy)
pypy3 $CLAIR [submodule] [options]

Setup variables for variant calling commands afterwards

CLAIR="[PATH_TO_CLAIR]/clair.py"                         # e.g. clair.py
MODEL="[MODEL_PATH]"                                     # e.g. [PATH_TO_CLAIR_MODEL]/ont/model
BAM_FILE_PATH="[YOUR_BAM_FILE]"                          # e.g. chr21.bam
REFERENCE_FASTA_FILE_PATH="[YOUR_REFERENCE_FASTA_FILE]"  # e.g. chr21.fa
KNOWN_VARIANTS_VCF="[YOUR_VCF_FILE]"                     # e.g. chr21.vcf

Notes

  • Each model has three files model.data-00000-of-00001, model.index, model.meta. Please give the MODEL variable, the prefix model.

Call variants at known variant sites or in a chromosome (using callVarBam)

For whole genome variant calling, please use callVarBamParallel to generate multiple commands that invoke callVarBam on smaller chromosome chucks.

Call variants in a chromosome

# variables
VARIANT_CALLING_OUTPUT_PATH="[YOUR_OUTPUT_PATH]"         # e.g. calls/chr21.vcf (please make sure the directory exists)
CONTIG_NAME="[CONTIG_NAME_FOR_VARIANT_CALLING]"          # e.g. chr21
SAMPLE_NAME="[SAMPLE_NAME]"                              # e.g. HG001

python $CLAIR callVarBam \
--chkpnt_fn "$MODEL" \
--ref_fn "$REFERENCE_FASTA_FILE_PATH" \
--bam_fn "$BAM_FILE_PATH" \
--ctgName "$CONTIG_NAME" \
--sampleName "$SAMPLE_NAME" \
--call_fn "$VARIANT_CALLING_OUTPUT_PATH"

cd "$VARIANT_CALLING_OUTPUT_PATH"

Call variants at known variant sites in a chromosome

# variables
VARIANT_CALLING_OUTPUT_PATH="[YOUR_OUTPUT_PATH]"         # e.g. calls/chr21.vcf (please make sure the directory exists)
CONTIG_NAME="[CONTIG_NAME_FOR_VARIANT_CALLING]"          # e.g. chr21
SAMPLE_NAME="[SAMPLE_NAME]"                              # e.g. HG001
KNOWN_VARIANTS_VCF="[YOUR_VCF_PATH]"                     # e.g. chr21_candidates.vcf

python $CLAIR callVarBam \
--chkpnt_fn "$MODEL" \
--ref_fn "$REFERENCE_FASTA_FILE_PATH" \
--bam_fn "$BAM_FILE_PATH" \
--ctgName "$CONTIG_NAME" \
--sampleName "$SAMPLE_NAME" \
--vcf_fn "$KNOWN_VARIANTS_VCF" \
--call_fn "$VARIANT_CALLING_OUTPUT_PATH" \

cd "$VARIANT_CALLING_OUTPUT_PATH"

Call whole-genome variants in parallel (using callVarBamParallel)

# variables
SAMPLE_NAME="NA12878"
OUTPUT_PREFIX="call/var"                        # please make sure the call/ directory exists

# create command.sh for run jobs in parallel
python $CLAIR callVarBamParallel \
--chkpnt_fn "$MODEL" \
--ref_fn "$REFERENCE_FASTA_FILE_PATH" \
--bam_fn "$BAM_FILE_PATH" \
--threshold 0.2 \
--sampleName "$SAMPLE_NAME" \
--output_prefix "$OUTPUT_PREFIX" > command.sh

# disable GPU if you have one installed
export CUDA_VISIBLE_DEVICES=""

# run Clair with 4 concurrencies
cat command.sh | parallel -j4

# Find incomplete VCF files and rerun them
for i in OUTPUT_PREFIX.*.vcf; do if ! [ -z "$(tail -c 1 "$i")" ]; then echo "$i"; fi ; done | grep -f - command.sh | sh

# concatenate vcf files and sort the variants called
vcfcat ${OUTPUT_PREFIX}.*.vcf | bcftools sort -m 2G | bgziptabix snp_and_indel.vcf.gz

Notes

Parallelization
  • callVarBamParallel generates a file of callVarBam commands that can be run in parallel.
  • Use GNU parallel to run commands in parallel - parallel -j4 will run four concurrencies in parallel using GNU parallel. We suggest using half the number of available CPU cores.
  • An alternative to GNU parallel - If GNU parallel is not installed, please try awk '{print "\""$0"\""}' commands.sh | xargs -P4 -L1 sh -c
Options
  • Haploid Precision Mode - Use --haploid_precision option for haploid samples
    (output homozygous variants only).
  • Haploid Sensitive Mode - Use --haploid_sensitive option for haploid samples
    (output all variants except variants with genotype 1/2).
  • Choosing genome sequences and positions for variant calling - callVarBamParallel by default will generate commands for chromosome {1..22},X,Y (insensible to the "chr" prefix). To call variants in other sequences, you can either input via the option --bed_fn your own BED file with three columns including the target sequence names, starting positions and ending positions, or use the option --includingAllContigs to include all sequences in the input FASTA file. If you work on a non-human sample, please always use a BED file or the --includingAllContigs option to define the sequences you want Clair to work on.
  • For more accurate Indel calling - You may consider using the --pysam_for_all_indel_bases option for more accurate Indel results. On Illumina data and PacBio CCS data, the option requires 20% to 50% longer running time. On ONT data, Clair can run up to ten times slower, while the improvement in accuracy is not significant.
Other considerations
  • Setting an appropriate allele frequency cutoff - Please refer to About Setting the Alternative Allele Frequency Cutoff
  • Check for incomplete (unfinished) VCF files - Incomplete VCF files happens when 'out of memory' or other errors occur. The command in the example finds for a newline at the end of the VCF files, and regenerate the incomplete files.
  • Disabling GPU: Clair uses CPU for variant calling - To avoid the tensorflow library from using GPU, CUDA_VISIBLE_DEVICES="" makes GPUs invisible to Clair so it will only use CPU for variant calling. Please notice that unless you want to run commands.sh in serial, you cannot use GPU because one running copy of Clair will occupy all available memory of a GPU. While the bottleneck of callVarBam is at the CreateTensor script, which only runs on CPU, the effect of GPU accelerate is insignificant (roughly just about 15% faster). But if you have multiple GPU cards in your system, and you want to utilize them in variant calling, you may want to split the commands.sh into parts, and run the parts by firstly export CUDA_VISIBLE_DEVICES="$i", where $i is an integer from 0 identifying the ID of the GPU to be used.
  • Concatenating results - vcfcat and bgziptabix commands are from vcflib, and are installed by default.

Submodule Descriptions

Submodules in clair/ are for variant calling and model training. Submodules in dataPrepScripts are for data preparation.

For the submodules listed below, you use the -h or --help option for available options.

clair/ Note: submodules under this folder are Pypy incompatible, please run using Python
call_var Call variants using candidate variant tensors.
callVarBam Call variants directly from a BAM file.
callVarBamParallel Generate callVarBam commands that can be run in parallel. A BED file is required to specify the regions for variant calling. --refChunkSize set the genome chuck size per job.
evaluate Evaluate a model.
plot_tensor Create high resolution PNG figures to visualize input tensor.
train Training a model using adaptive learning rate decay. By default, the learning rate will decay for three times. Input a binary tensors file created by Tensor2Bin is highly recommended.
train_clr Training a model using Cyclical Learning Rate (CLR).
dataPrepScripts/ Note: submodules under this folder is Pypy compatiable unless specified.
ExtractVariantCandidates Extract the position of variant candidates.
Input: BAM; Reference FASTA.
Important option(s):
--threshold "Minimum alternative allele frequency to report a candidate"
--minCoverage "Minimum coverage to report a candidate"
GetTruth Extract the variants from a truth VCF. Input: VCF; Reference FASTA if the vcf contains asterisks in ALT field.
CreateTensor Create tensors for candidates or truth variants.
Input: A candidate list; BAM; Reference FASTA.
PairWithNonVariants Pair truth variant tensors with non-variant tensors.
Input: Truth variants tensors; Candidate variant tensors.
Important option(s):
--amp x "1-time truth variants + x-time non-variants".
Tensor2Bin Create a compressed binary tensors file to facilitate and speed up future usage.
Input: Mixed tensors by PairWithNonVariants; Truth variants by GetTruth and a BED file marks the high confidence regions in the reference genome.
(Pypy incompatible)
CombineBins Merge smaller bins from Tensor2Bin into a complete larger bin.
(Pypy incompatible)

Pretrained Models

Please download models from here or click on the links below.

Folder Tech Suggested Sample used Aligner Download
illumina Illumina * HG001,2,3,4,5 Novoalign Download
pacbio/ccs PacBio CCS (HiFi) * HG001,5 Minimap2 Download
ont ONT R9.4.1 HG001,2 Minimap2 Download
ont ONT R9.4.1 HG001,2,3,4 Minimap2 Download
ont ONT R9.4.1 * HG001,2,2HD,3,4 Minimap2 Download

Advanced Guides

About Setting the Alternative Allele Frequency Cutoff

Different from model training, in which all genome positions are candidates but randomly subsampled for training, variant calling using a trained model will require the user to define a minimal alternative allele frequency cutoff for a genome position to be considered as a candidate for variant calling. For all sequencing technologies, the lower the cutoff, the lower the speed. Setting a cutoff too low will increase the false positive rate significantly, while too high will increase the false negative rate significantly.
The option --threshold controls the cutoff in these submodules callVarBam, callVarBamParallel and ExtractVariantCandidates. The suggested cutoff is listed below for different sequencing technologies. A higher cutoff will increase the accuracy of datasets with poor sequencing quality, while a lower cutoff will increase the sensitivity in applications like clinical research. Setting a lower cutoff and further filter the variants by their quality is also a good practice.

Sequencing Technology Alt. AF Cutoff
Illumina 0.1
PacBio CCS 0.2
ONT 0.2

Variant quality cutoff selection

ONT data

The variant quality distribution of Clair on ONT data is usually bimodal. The best quality cutoff is usually the valley between two peaks plus 50. The image below shows the quality distribution of the variants in HG002 called using ~50-fold coverage ONT data. The best quality cutoff is 748.

PacBio CCS data

The image below shows the quality distribution of the variants in HG005 called using ~30-fold coverage PacBio CCS data. The best quality cutoff is 143.

Illumina data

The image below shows the quality distribution of the variants in HG002 called using ~60-fold coverage Illumina data. The best quality cutoff is 113.

Clair uses PyPy for speedup

Without a change to the code, using PyPy python interpreter on some tensorflow independent modules such as ExtractVariantCandidates and CreateTensor gives a 5-10 times speed up. Pypy python interpreter can be installed by apt-get, yum, Homebrew, MacPorts, etc. If you have no root access to your system, the official website of Pypy provides a portable binary distribution for Linux. Besides following the conda installation method in Installation, the following is a rundown extracted from Pypy's website (PyPy3.6 v7.2.0 in this case) on how to install the binaries.

wget https://github.com/squeaky-pl/portable-pypy/releases/download/pypy3.6-7.2.0/pypy3.6-7.2.0-linux_x86_64-portable.tar.bz2
tar -jxf pypy3.6-7.2.0-linux_x86_64-portable.tar.bz2
cd pypy3.6-7.2.0-linux_x86_64-portable/bin
./pypy3 -m pip install -U pip wheel intervaltree
# Use pypy3 as an inplace substitution of python to run pypy-able scripts

To guarantee a good user experience (good speed), pypy must be installed to run callVarBam (call variants from BAM), and callVarBamParallel that generate parallelizable commands to run callVarBam. Tensorflow is optimized using Cython thus not compatible with pypy3. For the list of scripts compatible to pypy3, please refer to the Submodule Descriptions.

Pypy is an awesome Python JIT interpreter, you can donate to the project.

clair's People

Contributors

aquaskyline avatar chaklim avatar ncl935 avatar zhengzhenxian avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

clair's Issues

An error when run it.

Hi:
I have installed Clair by anaconda3.
When I want to run the clairDemo.sh, I meet an error:
from clair.model import Clair
ImportError: cannot import name 'Clair' from 'clair.model' (unknown location)

Could you help me with this?
Thank

sars-cov2 model

Hi all,

Just wondering is there is a pre-trained model for sars-cov2 anywhere?

Liam

Installing without conda nor docker

Hi,

Is it possible to install Clair via make or pip instead of Conda? I have a "base" docker container where we actively avoid using Conda and I would like to add Clair to the docker.

Thanks a lot for any advice or pointers!

Which Reference to use during Training?

clair

In the highlighted portion
"REFERENCE_FILE_PATH="[YOUR_FASTA_FILE_PATH]" # e.g. hg001.fasta"

The reference should be the standard GRCh38 or GRCh37 fasta file OR as mentioned in the example part the fasta seq of the sample being trained ??

About execution error

Hello.

I have experiences with Clairvoyante and Clair v1.
I'm trying to use Clair v2. I installed Clair using Docker, Conda with virual environment as your suggestion.
But I got same error messages from running Clair.

Delay 0 seconds before starting variant calling ...
Traceback (most recent call last):
File "/home/macarima/bioapps/Clair/clair/../clair.py", line 93, in
main()
File "/home/macarima/bioapps/Clair/clair/../clair.py", line 87, in main
submodule.main()
File "/home/macarima/bioapps/Clair/dataPrepScripts/ExtractVariantCandidates.py", line 454, in main
make_candidates(args)
File "/home/macarima/bioapps/Clair/dataPrepScripts/ExtractVariantCandidates.py", line 349, in make_candidates
zero_based_position - (0 if reference_start is None else (reference_start - 1))
File "/home/macarima/bioapps/Clair/dataPrepScripts/ExtractVariantCandidates.py", line 27, in evc_base_from
return base if base == "N" else BASE2ACGT[base]
KeyError: 'a'
samtools view: writing to standard output failed: Broken pipe
samtools view: error closing standard output: -1
ExtractVariantCandidates.py or GetTruth.py exited with exceptions. Exiting...

Clair is installed on Ubuntu 16.04 with CUDA 10.
My genome file is repeat masked (soft mask).

How can I care this error? May I ask your help?

Thank you.

With best regards,
Hui-Su Kim

GVCF output

Hello,

Would be possible to add a GVCF output to Clair? GLnexus could be used on the output GVCFs to make a joint VCF afterwards.

Thank you for your help.

Best regards,
Guillaume

filtering an ONT VCF Clair generated callset from a bi-modal result

The documentation does well to suggest a quality cutoff point "The best quality cutoff is usually the valley between two peaks plus 50."
It looks like the example used in the documentation to illustrate is usingst the MEDIAN point between the valley. In the histograms an Rscript I am using to produce the histograms draws the MEAN point. (although it could be scripted to generate the median I suppose).
Just to be clear, the approximate centerpoint (median) +50, between the peaks is your suggestioin for a cut-off. Right?

SNP/indel tag in vcf

Hi,
I would like to ask if there is an option to add a type of variant tag to the vcf file, so I can select the indels easily? If not, do you plan to implement it?
Thanks a lot in advance,
Anna

VCF reporting for haploid samples

Hi, I recently ran a bacterial sample (nanopore sequenced) through Clair CallVarBam to find variants in a gyrA gene. I used the --haploid flag and the resulting vcf file is shown below. My question is why the genotype (GT) is still reported as 1/1 (suggesting diploidity, no?) instead of only 1? Maybe I misunderstand how the genoype is reported or that the data has evidence that suggests diploidity. Thanks in advance.

##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=LowQual,Description="Confidence in this variant being real is below calling threshold.">
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=INS,Description="Insertion of novel sequence">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=LENGUESS,Number=.,Type=Integer,Description="Best guess of the indel length">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=AF,Number=1,Type=Float,Description="Estimated allele frequency in the range (0,1)">
##contig=<ID=gyrA>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE
gyrA 262 . G A 1015 . . GT:GQ:DP:AF 1/1:1015:188:0.7713
gyrA 491 . T TC 516 . . GT:GQ:DP:AF 1/1:516:188:0.1596

trained clair model taking long time to run

Hi,

I've trained a clair model but the calling seem to take >5hrs to run (I stopped the job)
with one of your provided models it was ~1hr for the same chunk of data.

Do you have any insight why this could be the case?

Thanks

train clair with single strands?

Good morning,

I wanted to let you know that after the release of the latest trained models for ONT with coverage up to 550, Clair is performing really well.
I am using ONT to sequence a single gene with a PCR approach and Clair found only one false positive SNV. This particular false positive can be easily identified when looking at strands separately in the bam file, as it is present only in one of the two strands. The same is true for basically all the false positive indels called.
I was wondering if it could be worth trying to train the models with single strand data.

Kind regards

quality threshold advice

Hi folks,

I need to set up clair to run on a large set of samples and I would prefer to not manually check the score distribution each time to find the quality threshold. We are running R9 chemistry promethion flowcells and every sample that we manually check ends up with a threshold of 750. Is it safe to use a fixed threshold like this when the same library construction/chemistry/sequencer combination is applied to all samples?

thanks,
Richard

How to handle such "duplicate" calls?

Hi,

I am using Clair on my raw ONT reads (human), and found the following two calls that I don't know how to handle. They essentially differ by 1 base. So I hope there's a "merge" module in Clair to post process such "duplicated" calls.

chr1	25808786	.	GGAGGTGAGGACAGCTGGGGTGCGACGTGGGGCCCCTCC	G	691	.	.	GT:GQ:DP:AF	0/1:691:31:0.2258
chr1	25808787	.	GAGGTGAGGACAGCTGGGGTGCGACGTGGGGCCCCTCCGC	G	582	.	.	GT:GQ:DP:AF	0/1:582:31:0.3548

Screen Shot 2020-02-09 at 5 16 59 PM

Unfortunately, I cannot share the data.

Thanks!

--haploid setting less sensitive than default

Hello,
I'm not sure if this is intended behavior, but the --haploid flag appears to be significantly less sensitive at detecting variants than the default setting. Looking at the code in call_var.py it appears to be because of this snippet:

        if output_config.is_haploid_mode_enabled:
            if (
                is_hetero_SNP or is_hetero_ACGT_Ins or is_hetero_InsIns or
                is_hetero_ACGT_Del or is_hetero_DelDel or is_insertion_and_deletion
            ):
                return (
                    (True, False, False, False, False, False, False, False, False, False),
                    (reference_base_ACGT, reference_base_ACGT)
                )

Which based on the later return statement in that function,

    return (
        (
            is_reference, is_homo_SNP, is_hetero_SNP,
            is_homo_insertion, is_hetero_ACGT_Ins, is_hetero_InsIns,
            is_homo_deletion, is_hetero_ACGT_Del, is_hetero_DelDel,
            is_insertion_and_deletion
        ),
        (reference_base, alternate_base)
    )

, seems to be returning that anytime there is a hetereozygous variant it defaults to reporting the reference variant when in --haploid mode.

My expectation was that --haploid mode would be more sensitive at detecting low frequency variants rather than defaulting to the reference more frequently. If this is intended behavior it may be worth clarifying what --haploid mode is doing behind the scenes and what assumptions it's making in the --help statement.
Thanks!

HETEROSNP is not defined in the header

When using
vcfcat ${OUTPUT_PREFIX}.*.vcf | bcftools sort -m 2G | bgziptabix snp_and_indel.vcf.gz

to concatenate the vcf file generated from CallVarBamParallel, it raised the following error:

[W::vcf_parse] INFO '.,HETEROSNP' is not defined in the header, assuming Type=String
[E::bcf_hdr_parse_line] Could not parse the header line: "##INFO=<ID=.,HETEROSNP,Number=1,Type=String,Description="Dummy">"
[E::vcf_parse] Could not add dummy header for INFO '.,HETEROSNP'

Specific use case

Hello,

I would like to know if Clair would be suitable for the following use case:

  • I have a haploid genome of 100 Mb on which I mapped illumina 250 pb reads. Then I did a SNP call with DeepVariant.
    Now I want to know what threshold of GQ to use to filter my variants. Note, that the genome is variant dense. For DeepVariant it mattered, and I had to change the training model otherwise it was not calling the SNPs.

It happens that now I have a phased diploid assembly of the genome, from pacbio HiFi.

My logic is the following:
map the phased diploid pacbio HiFi genome on the haploid reference, and check if the SNPs called with illumina correspond to a heterozygous site in the phased diploid HiFi assembly. That sounds simple, but it turns out that most tools are not used to check for SNPs with a coverage of 2x (of course, when mapping the HiFi assembly the coverage is max 2X), hence they return nothing.

Would Clair be able to do that? Basically calling a SNP with support from a single sequence out of 2?
Example

haploid genome: ATGGCGTA
snp call ATCGCGTA
hifi assemble: ATGGCGTAblablabla
ATCGCGTAblablabla

would Clair reports the G-C heteorzygous site?

Thanks a lot

Bioconda Installation Issues - Error: samtoolssamtools: error while loading shared libraries: libcrypto.so.1.0.0

After following the Option 1 installation instructions on the README.md, I attempted to run the demo and received the following error messages (samtoolssamtools: error while loading shared libraries: libcrypto.so.1.0.0):

Screen Shot 2020-11-09 at 1 30 09 PM

I've noticed similar problems in other issues such as 1, and 2, and was wondering if there were any quick fixes possible with Clair.

Operating on Linux CentOS 7 aarch64. Clair version 2.1.1 was installed.

Trouble when reproduce the

Great Job!

I am running Clair on HG002 bam file and estimate its performance against Truth Variants by vcfeval in RTG tools. But the performance I got is really poor. The performance index are like around 10%. Even after I remove GA4GH low complexity regions from GIAB's high-confidence regions as described in Supplementary information, the index are still very low.
So could you please help to provide an script to reproduce you results or give me some advice?
Many thanks!

Automatically call SNPs for all contigs in fasta file?

Hi, I have a question regarding the enforced argument to always supply the contig name for SNP calling. Sometimes, one might have a fragmented but still high quality assembly that can be used for SNP calling. In that case, one would have to run SNP calling several times. I guess one alternative would be to concatenate all contigs (but that might affect the performance). A desirable feature would be to have Clair automatically call SNPs for all contigs unless a contig name was provided!

ONT R9.4.1 vs R10.3

Hello!

Clair3 has been working great, and thank you guys for sharing such solid tech.

This is mainly a question, but:

I've processed a few PromethION and minION samples through clair3, but those were run using R9.4.1.
My team is interested in ONT tech, and want to try it out, but will likely use their newer flowcells the R10.3.

Based on this, is it a bad idea to continue using the existing ONT models in Clair3 to do variant calling? Or would creating a new model trained with these flowcells be the correct step?

Source of input BAM file?

Hi HKU-BAL,

I have gotten promising results from using Clairvoyante for variant detection in my Oxford Nanopore seq data.
I am moving forward to install Clair now.

Can you provide guidance on the proper workflow for producing the BAM files for Clair/Clairvoyante input?

In particular, I wonder if any polishing step should be performed prior to variant detection.

I see the ONT training sets were made using minimap2.

Would I need to make a new training set / model if I wanted to use bam files that were produced from the medaka workflow (which presumably adds some polishing)?

Thanks for any guidance and suggestions you can provide.
Chris

wget '500 Internal Server Error' when downloading training models

Installation went well, but when I try wget of the different training models I get the following error:
--2020-05-18 15:02:24-- http://www.bio8.cs.hku.hk/clair_models/pacbio/ccs/15.tar Resolving www.bio8.cs.hku.hk (www.bio8.cs.hku.hk)... 147.8.179.16 Connecting to www.bio8.cs.hku.hk (www.bio8.cs.hku.hk)|147.8.179.16|:80... connected. HTTP request sent, awaiting response... 500 Internal Server Error 2020-05-18 15:02:24 ERROR 500: Internal Server Error.

Regards.

Coverage is less than igv show

Hi Ruibang

I use callVarBam with --haploid option to call variants in a small genome (29K). I find the DP in the output VCF is lower than the bam file coverage show in IGV or the result of pysam pysam.PileupColumn.nsegments (440 vs 479 on the test site). I have tried both --stop_consider_left_edge and --pysam_for_all_indel_bases but not work. Do you have any idea?

clarification on -dcov option

When I use callVarBam.py with the default dcov option at 250, I get variants called with depth of 500 and higher.

Is this correct? if so, can someone explain to me what dcov actually does?

Clair force calling using callVarBam calls more number of variants than included in a vcf file.

Hello,
The vcf file I used for force calling had 57 million SNVs for whole genome variant calling. Force calling using callVarBam and vcf provided with --call_fn resulted in calling 322 million SNVs. Does force calling mean calling those variants in the vcf file plus additional variants or something is wrong? Normally force calling should only call variants at sites provided in the vcf file. I used the following script:

python clair.py callVarBam --chkpnt_fn "/clair/ont/model" --ref_fn "ref.fasta" --bam_fn "alignment.bam" --threshold "0.2" --minCoverage "4" --pypy "pypy3" --samtools "samtools" --delay "10" --threads "4" --sampleName "sample1" --vcf_fn "all_merged.vcf" --ctgName "$CONTIG_NAME" --call_fn "input.vcf"

Any help please?

SNV force calling

Hello,
is there a way to perform SNV force calling using Clair?
E.g.: I did variant calling for 7 different genotypes. Afterwards I removed Indels and filtered for quality and AF ending up with 7 VCF files. After that, I merged the files using bcftools merge.
Now I want to use this VCF to do a force calling of this list of SNVs.
Many thanks,
Paul

Quality cut off determination for CLR data

Hi,

I have used the model for PacBio subread data that you provided within the issues forum. I have a question regarding the quality distribution for this and choosing the cutoff, as the distribution is different from ONT, CSS, and Illumina Data.

Is there any way to determine this ?

Alarm

Hi, I'm getting a rare issue in a small GRCh38 contig (chrUn_KI270336v1) that is making my WDl pipe fall over. In this instance nothing has mapped, but I don't think that that is the issue.

No read has been process, either the genome region you specified has no read cover, or please check the correctness of your BAM input (/working/genomeinfo/cromwell-dev/cromwell-executions/ONT/7de38df6-2b8b-4c07-b082-8d014fc06cb5/call-clair/shard-136/inputs/1486994909/sampleID.bam). INFO:tensorflow:Restoring parameters from /reference/software/claire/claire-2.0.6/ont/model Restoring parameters from /reference/software/claire/claire-2.0.6/ont/model Calling variants ... Processed 0 tensors Total time elapsed: 0.00 s [INFO] Using 9 CPU threads Delay 2 seconds before starting variant calling ... /var/spool/PBS/mom_priv/jobs/10425955.hpcpbs02.SC: line 25: 13868 Alarm clock clair.py callVarBam --threads 10 --chkpnt_fn /reference/software/claire/claire-2.0.6/ont/model --ref_fn /working/genomeinfo/cromwell-dev/cromwell-executions/ONT/7de38df6-2b8b-4c07-b082-8d014fc06cb5/call-clair/shard-136/inputs/862458108/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa --bam_fn /working/genomeinfo/cromwell-dev/cromwell-executions/ONT/7de38df6-2b8b-4c07-b082-8d014fc06cb5/call-clair/shard-136/inputs/1486994909/sampleID.bam --ctgName chrUn_KI270336v1 --sampleName chrUn_KI270336v1_name --call_fn chrUn_KI270336v1.vcf

I think it comes from line 203 of https://github.com/HKU-BAL/Clair/blob/90dc1a9c47c1d346b834f0ea54d2e185779f1c8a/clair/callVarBam.py . What is the purpose of the alarm here? It's possible that if my server is smashed that 2 seconds is simply not enough for network lag etc.

What would you suggest as a work around?

Thanks in advance,
Liam

Memory problem

Hi,
I am running Clair in in ONT data (45X) as specified in Call variants in a chromosome section and only chromosomes 18, 19,20,21,22,X and Y finished.

The computation of these jobs spent around 60G of ram and more than 18 hours. However, none of the other chromosomes finished because they required more than 60G when they already run 20h.

Is there any way to solve this memory problem?

Thanks for you time,

Error when take the output vcf as input for VariantQC

Hi Ruibang,

Clair is great and I have sucessfully run it and Got output vcf. But I occured an ERROR when take this vcf as input for VariantQC(https://github.com/BimberLab/DISCVRSeq).

The ERROR is as followings: "UNEXPECTED ERROR: The provided VCF file is malformed at approximately line number 2964659: unparsable vcf record with allele GBG, for input source: /input/clair.vcf.gz"

It maybe related to the this issue BimberLab/DISCVRSeq#63.

Could you please give me some clues to avoid this error?

Many thanks!

model quantization experiments

Hello,

I am planning to do some experiments with model quantization both by post-training and retraining approach. The end goal is to reduce the size of the model to fit on an accelerator with smaller on-chip memory. Currently, the size of the model (checkpoint files) is around 15MB, and I think it can potentially come down by 4x by fixed-point representation quantization for example.

I was hoping to get some inputs to achieve this without impacting the accuracy of variant calling. Please let me know if you have any recommendations based on your expertise so far.

Thanks,
Ram

ValueError: invalid literal for int() with base 10: 'CCA'

Hi,

I am training a model on multiple individuals. At step 8-create-splited-binaries-using-the-tensor2bin-submodule, I get the error ValueError: invalid literal for int() with base 10: 'CCA'.

error code:

8. Create splitted binaries using the Tensor2Bin submodule

Loading the dataset ...
Traceback (most recent call last):
  File "/home/kewinogink/anaconda3/envs/clair-env/bin/clair.py", line 94, in <module>
    main()
  File "/home/kewinogink/anaconda3/envs/clair-env/bin/clair.py", line 88, in main
    submodule.main()
  File "/home/kewinogink/anaconda3/envs/clair-env/bin/dataPrepScripts/Tensor2Bin.py", line 63, in main
    Run(args)
  File "/home/kewinogink/anaconda3/envs/clair-env/bin/dataPrepScripts/Tensor2Bin.py", line 25, in Run
    is_allow_duplicate_chr_pos=args.allow_duplicate_chr_pos
  File "/home/kewinogink/anaconda3/envs/clair-env/bin/clair/utils.py", line 138, in get_training_array
    Y = variant_map_from(var_fn, tree, is_tree_empty)
  File "/home/kewinogink/anaconda3/envs/clair-env/bin/clair/utils.py", line 127, in variant_map_from
    Y[key] = output_labels_from_vcf_columns(columns)
  File "/home/kewinogink/anaconda3/envs/clair-env/bin/clair/task/main.py", line 53, in output_labels_from_vcf_columns
    genotype_1, genotype_2 = int(columns[4]), int(columns[5])
ValueError: invalid literal for int() with base 10: 'CCA'

I have checked and it is due to that in the all_var_prefixed file there are 7 columns instead of 6 when training only on 1 sample: the random prefix is now the first column, followed by the other 6 ones.

one sample

gzip -fdc all_var_prefixed | head -5
CHR1 18 ACC CCA 1 1
CHR1 83 AGC AC 1 1
CHR1 109 G C 0 1
CHR1 115 C G 0 1
CHR1 136 G A 1 1

multiple samples

gzip -fdc train_clair_output/var/all_var_prefixed | head -5
xtBOD CHR1 485 TCA CCA 0 0
xtBOD CHR1 501 CTC CC 0 0
xtBOD CHR1 509 G A 0 0
xtBOD CHR1 519 C T 0 0
xtBOD CHR1 530 A G 1 1

My question is what I should do. Did I somehow make a mistake with these prefixes? What should the all_var_prefixed look like?

Best,

Kewin

Clair and Clairvoyante having trouble with bam file from human transcript alignments

Hello,

thank you for all of your hard work to make Clair, and its predecessor Clairvoyante. I have used both to successfully call human genome variants for a particular gene with ONT data from PCR amplification of this gene from genomic DNA.

However, when I try the exact same workflow with cDNA for this gene, using an approproiately modified reference, bedfile, and contig name, I always get the error:
No read has been process, either the genome region you specified has no read cover, or please check the correctness of your BAM input

I have sorted and indexed my bamfile, and I know that my bamfile is fine for the provided reference and bedifle, as bedtools intersect works as expected, and so does IGV using this bamfile and the reference transcript (from which I created a .genome file for IGV).

Please could you tell me what might be going on?

Thank you,
Ali

clair and ONT samples aligned with hg19

Hi all,

My team recently has started to run clair on samples that were aligned with a hg19 reference. Unfortunately this reference does not have the prefix of "chr" in front of the numbers on the header lines indicating the start of the chromosome region. Eg.:
>22 CM000684.1 Homo sapiens chromosome 22, GRCh37 primary reference assembly
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
...

This is correctly reflected in the generated BAMs.
7923634f-8126-4e55-be50-3e9cb09f7964 0 22 10001 1 106S15M1D5M1D11M1D5M...

But the Clair keeps complaining with the error:
."...
No read has been process, either the genome region you specified has no read cover, or please check the correctness of your BAM input....

I have tried a few things but it seems that Clair requires a bam to contain chromosome names containing "chr".

Please comment if this is correct or not and any solution that can be suggested.
thanks

clair can not find model.meta

Hi

I tried clair.py callVarBamParallel but script said as below:

[ERROR] file ~/FDA_SNVcall_challenge/clair/models/model.meta not found
(clair-env) [@gc015 clair]$ ls models/model.meta
models/model.meta
(clair-env) [@gc015 clair]$ ls models/model.*
models/model.data-00000-of-00001  models/model.index  models/model.meta

directory:

(clair-env) [@gc015 clair]$ tree
.
├── 122HD34.tar
├── call
├── cmd_gen.sh
├── command.sh
└── models
    ├── model.data-00000-of-00001
    ├── model.index
    └── model.meta

my script:
(in this script, $CLAIR is not used)

#! /bin/bash
CLAIR="~/anaconda2/envs/clair-env/bin/clair.py"
MODEL="~/FDA_SNVcall_challenge/clair/models/model"
BAM_FILE_PATH="~/FDA_SNVcall_challenge/FDA_data/sam/HG002_GM24385_1_2_3_Guppy_3.6.0_prom.bam"
REFERENCE_FASTA_FILE_PATH="~/FDA_SNVcall_challenge/FDA_data/hg38.fa"
SAMPLE_NAME="HG002"
OUTPUT_PREFIX="call/hg002"
mkdir -p call
clair.py callVarBamParallel \
--chkpnt_fn "$MODEL" \
--ref_fn "$REFERENCE_FASTA_FILE_PATH" \
--bam_fn "$BAM_FILE_PATH" \
--threshold 0.2 \
--sampleName "$SAMPLE_NAME" \
--output_prefix "$OUTPUT_PREFIX" > command.sh

I'm not sure why clair could not find model.meta.
Can anyone solve this?

Number of needed tensors to be processed

Dear Clair team

It would be great if Clair prints the number of needed tensors to be processed, maybe like this Processed 5000 out of 1000000 tensors. This can help me to know how much time it needs to finish.

Delay 7 seconds before starting variant calling ...
[INFO] Using 3 CPU threads
INFO:tensorflow:Restoring parameters from /Volumes/sina2tb/model
Restoring parameters from /Volumes/sina2tb/model
Calling variants ...
Processed 1000 tensors
Processed 2000 tensors
Processed 3000 tensors
Processed 4000 tensors
Processed 5000 tensors

Regards,
Sina

KeyError: '*T'

Hi,
Thank you for a really nice program.
With the release of GIAB 4.x truth sets they seem to have introduced the "*" allele for deletions overlapping a position. Is there any plans to incorporate this in Clair? It is currently throwing a "KeyError: '*T'" when you create the tensors for training.

Somatic and germline

Hi @aquaskyline ,
The variants detected by Clair from a tumor sample would be all germline and somatic variants in that tumor sample? is that right?

Thanks,
Vahid.

Any chance of a clair model trained on guppy >=3.6.0, and on guppy 4.4.2 with bonito?

Hi there,

I noticed that your models for ONT variant calling are at least two major basecaller updates behind the state of the art. When we tested clair on some in-house GM24385 data, we actually saw a drop in variant call accuracy when using guppy-bonito basecalls, despite an increase in raw read accuracy.

Is there any likelihood of newer models being produced for the newer, more accurate basecalling?

Can I use the model training on PacBio CCS data for PacBio non-CCS data?

Hello,

After your suggestion to switch to Clair, I'm exploring the option. However, I see that for PacBio there is only a trained model for CCS data. Since I have some slightly older PacBio data which is not CCS, I'm concerned that this model will not be appropriate for my data, which has a much higher error rate (~15%) than CCS data.

My instinct is that I can't use this model for my non-CCS PacBio data, but I would appreciate your opinion.

Best,

Phil

PacBio Subread Data

Hi,
I see that you uploaded a trained model for CLR data and I was wondering what threshold would be adequate to run for usual circumstances. You have 0.2 for PacBio CSS data, was wondering if this applies for subread data as well ?

If there is a separate model for subread data could it be uploaded ?

Thank you

Support for 'N' bases

I'm seeing an error that looks like it involves 'N' bases (see stack trace below). Not sure if the Ns are in the ref genome (we're using b37) or the BAM file, but either way this is a showstopper since we can't guarantee that either will never have a N.

Here's the stack trace:

Traceback (most recent call last):
  File "/opt/clair/clair/../clair.py", line 93, in <module>
    main()
  File "/opt/clair/clair/../clair.py", line 87, in main
    submodule.main()
  File "/opt/clair/dataPrepScripts/ExtractVariantCandidates.py", line 450, in main
    make_candidates(args)
  File "/opt/clair/dataPrepScripts/ExtractVariantCandidates.py", line 290, in make_candidates
    base = BASE2ACGT[SEQ[query_position]]
KeyError: 'N'

Really awesome work otherwise! I think your approach has a lot of advantages over the DeepVariant method (which seems a lot just like an improved VQSR), I'm pretty excited about Clair in general!

problem during training

Hi,
I am training Clair on NA12878 following the training tutorial:

CLAIR="/home/user/miniconda3/envs/clair-env/bin/clair.py"
PYPY="/home/user/miniconda3/envs/clair-env/bin/pypy3"

VCF_FILE_PATH="/homeb/data/NA12878-SNP/HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz"
BAM_FILE_PATH="/homeb/data/NA12878-SNP/reads2ref.sort.bam"
REFERENCE_FILE_PATH="/homeb/data/NA12878-SNP/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna" 

DATASET_FOLDER_PATH="/homeb/data/NA12878-SNP/train_output"
DEPTHS=(1.000 0.800 0.600 0.400 0.200)
SUBSAMPLED_BAMS_FOLDER_PATH="/homeb/data/NA12878-SNP/subsample_bams"
CHR_PREFIX="chr"
CHR=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X)
THREADS=40
THREADS_LOW=20
DEPTHS_PER_SAMPLE=${#DEPTHS[@]}
ESTIMATED_SPLIT_NO_OF_LINES=$((180000 * $DEPTHS_PER_SAMPLE))
MINIMUM_COVERAGE=4
VARIANT_FOLDER_PATH="${DATASET_FOLDER_PATH}/var"
CANDIDATE_FOLDER_PATH="${DATASET_FOLDER_PATH}/can"
TENSOR_VARIANT_FOLDER_PATH="${DATASET_FOLDER_PATH}/tensor_var"
TENSOR_CANDIDATE_FOLDER_PATH="${DATASET_FOLDER_PATH}/tensor_can"
TENSOR_PAIR_FOLDER_PATH="${DATASET_FOLDER_PATH}/tensor_pair"
SHUFFLED_TENSORS_FOLDER_PATH="${DATASET_FOLDER_PATH}/all_shuffled_tensors"
BINS_FOLDER_PATH="${DATASET_FOLDER_PATH}/all_bins"

mkdir ${DATASET_FOLDER_PATH}
cd ${DATASET_FOLDER_PATH}
mkdir ${VARIANT_FOLDER_PATH}
mkdir ${CANDIDATE_FOLDER_PATH}
mkdir ${TENSOR_VARIANT_FOLDER_PATH}
mkdir ${TENSOR_CANDIDATE_FOLDER_PATH}
mkdir ${TENSOR_PAIR_FOLDER_PATH}
mkdir ${SHUFFLED_TENSORS_FOLDER_PATH}
mkdir ${BINS_FOLDER_PATH}

for j in "${!DEPTHS[@]}"
do
  cd ${TENSOR_VARIANT_FOLDER_PATH}
  mkdir ${DEPTHS[j]}

  cd ${TENSOR_CANDIDATE_FOLDER_PATH}
  mkdir ${DEPTHS[j]}

  cd ${TENSOR_PAIR_FOLDER_PATH}
  mkdir ${DEPTHS[j]}
done

cd ${DATASET_FOLDER_PATH}

parallel --joblog ./get_truth.log -j${THREADS} \
"${PYPY} ${CLAIR} GetTruth \
--vcf_fn ${VCF_FILE_PATH} \
--var_fn ${VARIANT_FOLDER_PATH}/var_{1} \
--ref_fn ${REFERENCE_FILE_PATH} \
--ctgName ${CHR_PREFIX}{1}" ::: ${CHR[@]}

# merge all truth variants into a single file (named all_var)
cat ${VARIANT_FOLDER_PATH}/var_* > ${VARIANT_FOLDER_PATH}/all_var

parallel --joblog ./evc.log -j${THREADS} \
"${PYPY} ${CLAIR} ExtractVariantCandidates \
--bam_fn ${BAM_FILE_PATH} \
--ref_fn ${REFERENCE_FILE_PATH} \
--can_fn ${CANDIDATE_FOLDER_PATH}/can_{1} \
--ctgName ${CHR_PREFIX}{1} \
--gen4Training" ::: ${CHR[@]}


parallel --joblog ./create_tensor_var.log -j${THREADS} \
"${PYPY} ${CLAIR} CreateTensor \
--bam_fn ${SUBSAMPLED_BAMS_FOLDER_PATH}/{1}.bam \
--ref_fn ${REFERENCE_FILE_PATH} \
--can_fn ${VARIANT_FOLDER_PATH}/var_{2} \
--minCoverage ${MINIMUM_COVERAGE} \
--tensor_fn ${TENSOR_VARIANT_FOLDER_PATH}/{1}/tensor_var_{2} \
--ctgName ${CHR_PREFIX}{2}" ::: ${DEPTHS[@]} ::: ${CHR[@]}

parallel --joblog ./create_tensor_can.log -j${THREADS} \
"${PYPY} ${CLAIR} CreateTensor \
--bam_fn ${SUBSAMPLED_BAMS_FOLDER_PATH}/{1}.bam \
--ref_fn ${REFERENCE_FILE_PATH} \
--can_fn ${CANDIDATE_FOLDER_PATH}/can_{2} \
--minCoverage ${MINIMUM_COVERAGE} \
--tensor_fn ${TENSOR_CANDIDATE_FOLDER_PATH}/{1}/tensor_can_{2} \
--ctgName ${CHR_PREFIX}{2}" ::: ${DEPTHS[@]} ::: ${CHR[@]}

parallel --joblog ./create_tensor_pair.log -j${THREADS} \
"${PYPY} ${CLAIR} PairWithNonVariants \
--tensor_can_fn ${TENSOR_CANDIDATE_FOLDER_PATH}/{1}/tensor_can_{2} \
--tensor_var_fn ${TENSOR_VARIANT_FOLDER_PATH}/{1}/tensor_var_{2} \
--output_fn ${TENSOR_PAIR_FOLDER_PATH}/{1}/tensor_pair_{2} \
--amp 2" ::: ${DEPTHS[@]} ::: ${CHR[@]}

ls tensor_pair/*/tensor_pair* | \
parallel --joblog ./uncompress_tensors.log -j${THREADS_LOW} -N2 \
--line-buffer --shuf --verbose --compress stdbuf -i0 -o0 -e0 pigz -p4 -dc ::: | \
parallel --joblog ./round_robin_cat.log -j${THREADS} \
--line-buffer --pipe -N1000 --no-keep-order --round-robin --compress \
"split - -l ${ESTIMATED_SPLIT_NO_OF_LINES} --filter='shuf | pigz -p4 > \$FILE.gz' -d ${SHUFFLED_TENSORS_FOLDER_PATH}/split_{#}_"

ls ${SHUFFLED_TENSORS_FOLDER_PATH}/split_* | \
parallel --joblog ./tensor2Bin.log -j${THREADS_LOW} \
"python ${CLAIR} Tensor2Bin \
--tensor_fn {} \
--var_fn ${VARIANT_FOLDER_PATH}/all_var \
--bin_fn ${BINS_FOLDER_PATH}/{/.}.bin \
--allow_duplicate_chr_pos"

cd ${DATASET_FOLDER_PATH}
python ${CLAIR} CombineBins

MODEL_NAME="NA12878_20X_WGS_ONT_GRCh38_Guppy4_20210416"
MODEL_FOLDER_PATH="/homeb/data/NA12878-SNP/trained_model/${MODEL_NAME}"
TENSOR_FILE_PATH="${BINS_FOLDER_PATH}/tensor.bin"

mkdir ${MODEL_FOLDER_PATH}

export CUDA_VISIBLE_DEVICES="0"

python $CLAIR train \
--bin_fn "$TENSOR_FILE_PATH" \
--ochk_prefix "${MODEL_FOLDER_PATH}/model"

It shows the error [E::hts_open_format] Failed to open file /homeb/data/NA12878-SNP/subsample_bams/1.000.bam samtools view: failed to open "/homeb/data/NA12878-SNP/subsample_bams/1.000.bam" for reading: No such file or directory at the step of "Create tensors for truth variants using the CreateTensor submodule". Can you give me some suggestions?

Thanks
Neng Huang

Interpretation of quality scores

Hello,

I see in the clair pre-print that quality scores are defined as follows:

"The variant quality is calculated as the square of the Phred score of the distance between the largest and the second-largest likelihood values."

Currently, I'm interpreting this as:

If a variant has a quality of 400, then there is a 1 in 100 chance (sqrt(400) = 20, phred score 20 = 1 in 100) that the selected variant was chosen when in reality, the "second choice" variant should have been chosen.

Is this interpretation anywhere near the truth?

Thanks,

Phil

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.