Coder Social home page Coder Social logo

brentp / somalier Goto Github PK

View Code? Open in Web Editor NEW
254.0 11.0 35.0 1.45 MB

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"

License: MIT License

Shell 4.00% HTML 29.86% Nim 53.62% CSS 6.20% Dockerfile 0.29% Python 6.03%
cancer-genomics bioinformatics genomics

somalier's Introduction

somalier: extract informative sites, evaluate relatedness, and perform quality-control on BAM/CRAM/BCF/VCF/GVCF

Actions Status Cite install with bioconda

Quick Start

somalier makes checking any number of samples for identity easy directly from the alignments or from jointly-called VCFs:

The first step is to extract sites. For VCF just use:

somalier extract -d extracted/ --sites sites.vcf.gz -f /data/human/g1k_v37_decoy.fa $cohort.vcf.gz

with a sites file from releases

For BAM or CRAM, use: This is parallelizable by sample via cluster or cloud, but here, using a for loop:

for f in *.cram; do
    somalier extract -d extracted/ --sites sites.vcf.gz -f /data/human/g1k_v37_decoy.fa $f
done

--sites is a VCF of known polymorphic sites in VCF format. A good set is provided in the releases but any set of common variants will work.

โš ๏ธ somalier can work on GVCF and individual VCFs, but it is recommended to extract from bam/cram when possible. It is also good to extract from a jointly-called VCF/BCF when only looking within that cohort. While extracting from a single-sample VCF is possible (with --unknown) and GVCF is also supported, these options are less accurate and more prone to problems.

The next step is to calculate relatedness on the extracted data:

somalier relate --ped $pedigree extracted/*.somalier

This will create text and interactive HTML output that makes it fast and easy to detect mismatched samples and sample-swaps. The files created are:

  • `{output_prefix}.samples.tsv # creates a .ped like file with extra QC columns
  • `{output_prefix}.pairs.tsv # shows IBS for all possible sample pairs
  • `{output_prefix}.groups.tsv # shows pairs of samples above a certain relatedness
  • `{output_prefix}.html # interactive html

Example html output is here

Note that the somalier relate command runs extremely quickly (< 2 seconds for 600 samples and ~1 minute for 4,500 samples) so it's possible to add/remove samples or adjust a pedigree file and re-run iteratively.

For example to add the n + 1th samples, just run somalier extract on the new sample and then re-use the already extracted data from the n original samples.

For huge sample-sets, if you run into a bash error for argument list too long, you can pass the somalier files as quoted glob strings like: "/path/to/set-a/*.somalier" "/path/to/set-b/*.somalier".

Example Output

  • Interactive output from somalier relate is here
  • Interactive output from somalier ancestry is here

Infer

somalier can also infer first-degree relationships (parent-child) when both-parents are present and can often build entire pedigrees on high-qualty data. To do this, use somalier relate --infer ... and the samples.tsv output will be a pedigree file indicating the inferred relationships.

See wiki for more detail.

Usage

The usage is also described above. Briefly, after downloading the somalier binary and a sites vcf from the releases run:

somalier extract -d cohort/ --sites sites.hg38.vcf.gz -f $reference $sample.bam

for each sample to create a small binary file of the ref and alt counts for the variants listed in sites.hg38.vcf.gz.

for a vcf, run:

somalier extract -d cohort/ --sites sites.hg38.vcf.gz -f $reference $cohort.bcf

somalier can extract from a multi or single-sample VCF or a GVCF. This will be much faster, in cases where it's available, this would look like:

somalier extract -d extracted/ --sites sites.vcf.gz -f /data/human/g1k_v37_decoy.fa joint.vcf.gz

following this, there will be a $sample.somalier file for each sample in the joint.vcf.gz

Note that somalier uses the AD field to extract depth information. If that FORMAT field is not present in the header, then it will use the genotypes only and use a total depth of 20 (10,10 for heterozygote), etc.

Then run:

somalier relate --ped $pedigree_file cohort/*.somalier

This will create an html file for QC in a few seconds.

Note that if a new sample is added to the cohort, it's only necessary to perform the extract step on that sample and then run the (fast) relate step again with all of the extracted files.

Extended Usage

For each command of somalier, extended parameters are listed in --help of each subcommand.

$./somalier --help
Commands:
  extract      :   extract genotype-like information for a single sample from VCF/BAM/CRAM.
  relate       :   aggregate `extract`ed information and calculate relatedness among samples.
  ancestry     :   perform ancestry prediction on a set of samples, given a set of labeled samples
  find-sites   :   create a new sites.vcf.gz file from a population VCF (this is rarely needed).

somalier extract

extract genotype-like information for a single-sample at selected sites

Usage:
  somalier extract [options] sample_file

Arguments:
  sample_file      single-sample CRAM/BAM/GVCF file or multi/single-sample VCF from which to extract

Options:
  -s, --sites=SITES          sites vcf file of variants to extract
  -f, --fasta=FASTA          path to reference fasta file
  -d, --out-dir=OUT_DIR      path to output directory (default: .)
  --sample-prefix=SAMPLE_PREFIX
                             prefix for the sample name stored inside the digest

somalier relate

calculate relatedness among samples from extracted, genotype-like information

Usage:
  somalier relate [options] [extracted ...]

Arguments:
  [extracted ...]  $sample.somalier files for each sample. the first 10 are tested as a glob patterns

Options:
  -g, --groups=GROUPS        optional path  to expected groups of samples (e.g. tumor normal pairs).
specified as comma-separated groups per line e.g.:
    normal1,tumor1a,tumor1b
    normal2,tumor2a
  --sample-prefix=SAMPLE_PREFIX
                             optional sample prefixes that can be removed to find identical samples. e.g. batch1-sampleA batch2-sampleA
  -p, --ped=PED              optional path to a ped/fam file indicating the expected relationships among samples.
  -d, --min-depth=MIN_DEPTH  only genotype sites with at least this depth. (default: 7)
  --min-ab=MIN_AB            hets sites must be between min-ab and 1 - min_ab. set this to 0.2 for RNA-Seq data (default: 0.3)
  -u, --unknown              set unknown genotypes to hom-ref. it is often preferable to use this with VCF samples that were not jointly called
  -i, --infer                infer relationships (https://github.com/brentp/somalier/wiki/pedigree-inference)
  -o, --output-prefix=OUTPUT_PREFIX
                             output prefix for results. (default: somalier)

Note that for large cohorts, by default, somalier relate will subset to interesting sample-pairs so as not to balloon the size of the output. By default, pairs that are expected to be unrelated and have a relatedness <= 0.05. If you wish to force all samples to be reported, then set the environment variable SOMALIER_REPORT_ALL_PAIRS to any non-empty value, e.g. export SOMALIER_REPORT_ALL_PAIRS=1

find-sites

To create a set of new sites, use somalier find-sites on a population VCF. More info on this tool is available in the wiki

Install

Somalier is available via bioconda, see here.

Alternatively, you can get a static binary from here.

Users can also get a docker image here which contains htslib and a somalier binary ready-for-use.

How it works

somalier takes a list of known polymorphic sites. Even a few hundred (or dozen) sites can be a very good indicator of relatedness. The best sites are those with a population allele frequency close to 0.5 as that maximizes the probability that any 2 samples will differ. A list of such sites is provided in the releases for GRCh37 and hg38.

In order to quickly calculate genotypes at these sites, somalier assays the exact base. The extraction step is done directly from the bam/cram files 1 sample at a time.

The relate step is run on the output of the extract commands. It runs extremely quickly so that new samples can be added and compared. It uses 3 bit-vectors per sample for hom-ref, het, hom-alt. Each bitvector is a sequence of 64 bit integers where each bit is set if the variant at that index in the sample is for example, heterozygous. With this setup, we can use fast bitwise operations and popcount hardware instructions to calculate relatedness extremely quickly.

For each sample-pair, it reports:

  1. IBS0 -- the number of sites where one sample is hom-ref and another is hom-alt
  2. IBS2 -- the number of sites where the samples have the same genotype
  3. shared-hets -- the number of sites where both samples are heterozygotes
  4. shared-hom-alts -- the number of sites where both samples are homozygous alternate

These are used to calculate relatedness and a measure of relatedness that is unaffected by loss-of-heterozygosity that is often seen in some cancers. The interactive output allows toggling between any of these measures.

It also reports depth information and the count of HET, HOM_REF, HOM_ALT, and unknown genotypes for each sample along with a number of metrics that are useful for general QC.

Example

example

Here, each point is a pair of samples. We can see that the expected identical sample-pairs (e.g. tumor-normal pairs) specified by the user and drawn in red mostly cluster together on the right. Unrelateds cluster on the lower left. The sample-swaps are the blue points that cluster with the red. In the somalier output, the user can hover to see which sample-pairs are involved each point

Ancestry Estimate

somalier can predict ancestry on a set of query samples given a set of labelled samples. You can read more about this in the wiki

Usage

Usage is intentionally very simple and running somalier extract or somalier relate will give sufficient help for nearly all cases.

By default somalier will only consider variants that have a "PASS" or "RefCall" FILTER. To extend this list, set the environment variable SOMALIER_ALLOWED_FILTERS to a comma-delimited list of additional filters to allow.

by default sites with an allele balance < 0.01 will be considered homozygous reference. To adjust this, use e.g. : SOMALIER_AB_HOM_CUTOFF=0.04 somalier relate ...

Other Work

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5499645/

https://academic.oup.com/bioinformatics/article/33/4/596/2624551

Acknowledgement

This work was motivated by interaction and discussions with Preeti Aahir and several early users who provided valuable feedback.

somalier's People

Contributors

balthasar-eu avatar brentp avatar brwnj avatar johanneskoester avatar kpalin avatar tstannius avatar zztin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

somalier's Issues

Large number of n_unknown

Hi Brent,

I ran somalier on ~500 targeted sequencing samples with average coverage of around 600X and ~1000 sites. I noticed that most samples are showing n_unknown of around 600 and as such most pairs only have about 250 sites in common used for comparison.

Any idea why so many sites are not able to be classified?

add genomic depth view

given evenly spaced probes, can plot x=index, y=normalized-depth for each sample as a nice QC.

show warning when .somalier files have no sites with depth > 0

I am trying to use somalier to confirm matches between tumor and normal samples from the same patient. somalier extract works fine for both .bam files using sites.hg19.vcf.

The trouble is with relate, which I can't seem to figure out the parameters for. I have tried:

  • somalier relate cohort/*somalier
  • somalier relate -p pedigree.txt cohort/*somalier
  • somalier relate -p pedigree.txt -g group.txt cohort/*somalier

With some simple group and pedigree files, but I only output .tsv files with empty rows.

This feels like it should be a simple use-case, but it is fairly befuddling. Any pointers?


cat group.txt
normal0,tumor0

cat pedigree.txt
fam	normal0	0	0	0	0
fam	tumor0	 0	0	0	0

cancer ChIPseq data and whole exome data

Hi Brent,

Is it possible to match ChIPseq and WES bam from the same patient? The ChIPseq is 37 bp short single end while the WES is 75bp paired-end. Is this problematic? want to have a try with this tool.

Thanks!
Tommy

meaning of warning:cant use more than 65535 sites

When running somalier extract I got a warning message:

warning:cant use more than 65535 sites

What does this mean? Should we reduce the number of variants we feed into somalier?
I'm using version: 0.2.9

Thanks for writing this great tool!

Somalier underestimates relatedness if low sites overlap

Hi,

I am running some different samples (RNA vs cfDNA) through somalier and one of them came out with a very low relatedness, even though 77 (out of 78) sites are IBS2:

#sample_a   sample_b   relatedness  ibs0  ibs2  hom_conc hets_a  hets_b  shared_hets  hom_alts_a  hom_alts_b  shared_hom_alts  n    x_ibs0  x_ibs2  expected_relatedness
165962      778965     0.101        0     77    0.193    228     556     23           166         866         32               78   0       3       1.0

I think that, even though there are a lot of heterozygote sites, they do not actually overlap (the reason why n=78). On the RNA data probably because of non-expressed genes, and on the cfDNA prob due to the low coverage.

According to somalier's paper, relatedness is calculated as:

(shared-hets(i,j) - 2 * ibs0(i, j)) / min(hets(i), hets(j))

However, in this case, even though one sample has 228 hets and the other 556, they only overlap on 78 snps. So, shouldn't the formula be a bit more like:

(shared-hets(i,j) - 2 * ibs0(i, j)) / min(hets_in_common_pos(i), hets_in_common_pos(j))

where hets_in_common_pos(i) stands for the number of hets in sample "i" among the positions shared ("n"). This change should have no effect when comparing the same type of seq (since the overlap should be quite high) and improve comparisons of different types of seq.

Occasional access errors with http file pointers: Read block operation failed with error 2 after 0 of 4 bytes

# /home/avilella/somalier/somalier -t 1 /home/avilella/CEGX_Run704/CEGX_Run704.CEGX_Run704-12345678.input.list -f /home/avilella/Genomes/Homo_sapiens/NCBI/GRCh38Decoy/GRCh38Decoy.fa -s /home/avilella/Annotations/somalier/GRCh38/GRCh38.AF.0p50.vcf.gz -o /home/avilella/CEGX
_Run704//CEGX_Run704.CEGX_Run704-12345678.input

Any ideas what these may be? Would they affect the results?

avilella@avilellaM710t:~/CEGX_Run701$ [somalier] checked reference for 10000 sites
[somalier] checked reference for 20000 sites
[somalier] checked reference for 30000 sites
[somalier] checked reference for 40000 sites
[somalier] sites:48666 threads:1
[somalier] spawning sample:0
[E::bgzf_read] Read block operation failed with error 2 after 0 of 4 bytes
[hts-nim] error in bam.query:-2
[E::bgzf_read] Read block operation failed with error 2 after 0 of 4 bytes
[hts-nim] error in bam.query:-2
[E::bgzf_read] Read block operation failed with error 2 after 0 of 4 bytes
[hts-nim] error in bam.query:-2
[E::bgzf_read] Read block operation failed with error 3 after 0 of 4 bytes
[hts-nim] error in bam.query:-2
[E::bgzf_read] Read block operation failed with error 2 after 0 of 4 bytes
[hts-nim] error in bam.query:-2
[E::bgzf_read] Read block operation failed with error 2 after 0 of 4 bytes
[hts-nim] error in bam.query:-2
[E::bgzf_read] Read block operation failed with error 2 after 0 of 4 bytes
[hts-nim] error in bam.query:-2

new tool fam-finder

create a new tool to create a ped/fam file from *.somalier. it will only find first degree relationships, but this is often useful when there is no pedigree file for a cohort.

Limit number of threads used

Dear all,

is it possible to limit the number of threads somalieruses?
Right now it seems to use all available cpus.

thanks,

`Argument list too long' on 30k samples

Hi,

Found a small corner-case where the current command for running relate becomes an issue. When calculating the relatedness on around 30.5k sketches the command argument becomes too long and triggers a bash error: `Argument list too long'. My understanding is that the star-syntax unfolds the full lists of paths, which exceeds some OS-specific max length.

I was able to solve this by calling the program from within the sketch path (which shortens the unfolded command):
> ./somalier relate ./somalier_sketches/*.somalier -o output
> ../somalier relate *.somalier -o output

This won't be a remedy if the number of samples is even larger. [1] suggests this is fixable by bumping up ARG_MAX and ulimit but I wasn't able to verify it yet (plus, probably not an option on every managed cluster).

Thought I should report back, though not really a core software issue. Thank you for writing the tool, it is incredibly useful! Can confirm that somalier scales up to at least 24k samples, though most of the time is spent writing the 30GB .pairs.tsv file.

Alex

[1] https://unix.stackexchange.com/questions/45583/argument-list-too-long-how-do-i-deal-with-it-without-changing-my-command

Crash with 0.2.9

Hello,

I was upgrading to 0.2.9 and I'm getting a crash now for some reason. I have a parameterized snakemake command that looks like this:

{params.somalier} extract \
            -s {params.sites} \
            -f {params.ref} \
            -d {params.som_dir} \
            {input.vcf}

The only change I made was in the somalier executable path from the 0.2.5 binary to the 0.2.9 binary. Weirdly, the 0.2.9 binary seems to work on some other VCF files I have generated from a different pipeline, but not on these which were historically working in 0.2.5. Thoughts on debugging or additional details needed?

relatedness = -nan and no HETs found

Hi,
I have some WGS, RNAseq and targeted RNAseq normal (no tumour) samples and I wanted to run somalier on them to find out if any of them are from the same individual (i.e., if for example I have WGS and RNAseq data from the same sample).

I had no problems running somalier, but the results are very confusing because somalier does not find any HETs in the data and the relatedness metric that I get is '-nan' for all the pairs of samples. This is the first few lines of the resulting somaliers.pairs.tsv file:

#sample_a sample_b relatedness hom_concordance hets_a hets_b shared_hets hom_alts_a hom_alts_b shared_hom_alts ibs0 ibs2 n
ERR2322353 ERR2322354 -nan 0.827 0 0 0 4208 3456 2858 0 2858 2858
ERR2322353 ERR2322355 -nan 0.795 0 0 0 4208 3811 3030 0 3030 3030
ERR2322353 ERR2322356 -nan 0.847 0 0 0 4208 4875 3563 0 3563 3563
ERR2322353 ERR2322357 -nan 0.811 0 0 0 4208 3555 2884 0 2884 2884
ERR2322353 ERR2322358 -nan 0.805 0 0 0 4208 3710 2987 0 2987 2987

I had tried to match these samples previously just by calling variants on all of them and checking the genotypes. While doing this, I was able to call both HOMs and REFs in all the samples (I'm using the HaplotypeCaller from GATK). So I was wondering if you could please let me know what the problem might be? I'm using the sites.vcf file provided by somalier.

Many thanks

cross individual contamination estimation

Hi,

I have been using Conpair in the past for concordance and cross individual contamination for tumor-normal pairs. Since you are keeping track of the genotype and allele frequency you should be able to do a similar calculation for the contamination. The Conpair method uses the normal as a prior when calculation contamination in the tumor (order of the pairs matters), so that might complicate things. Maybe only do when groups are provided.

Conpair Paper

Feature request: option to ignore filters

Hello, I was testing on some files (gVCFs again) and I noticed I wasn't capturing all of the regions in a small custom-made sites file. After tinkering a bit, I ended up creating a gVCF file where all the variants were labeled "PASS" in the filter and that allowed somalier to correctly extract the missing variants. Of course, this is understandably the default behavior but I'm wondering if we can get an option that will allow somalier to extract those variants even if they are not labeled as "PASS".

Erroneous "index not found" errors

Brent: Great looking tool. I was able to get it to work occasionally.

When passing many samples via *.bam, and notably, only in multi-threaded mode, one or more of somalier's threads will report

index not found for:<bamfile.bam>

When run in single-thread mode it works (as far as I can tell, but is too slow when working with cohorts of hundreds for me to have waited for results yet). When run directly and only on two samples for which prior runs had given the error, it works fine, even in multithreaded mode. (with only limited testing, admittedly).

Could there be a data race somewhere?

relatedness calculation

The relatedness values you report are negative in some cases presumably because you provide an adjusted relatedness value. Is there more info on this somewhere or would it be possible to provide a little more context in terms of interpreting those values? Thanks - great tool!

nan values break output .html

One of the sample I used had low coverage (~0.5X) which caused a number of nan's in the html data, which broke it for a cohort of 500 samples.

{"sample":"I-H-134715-T2-1-D1-1","gt_depth_mean":0.0,"gt_depth_std":nan,"gt_depth_skew":nan,"depth_mean":0.1543086172344688,"depth_std":0.5164978344309251,"depth_skew":4.036674201975204,"ab_mean":0.0,"ab_std":nan,"ab_skew":nan,"pct_other_alleles":27.21822541966425,"n_hom_ref":0,"n_het":0,"n_hom_alt":0,"n_unknown":998,"n_known":0},

Add plot option to ancestry-predict.py

It would be nice to have an option to save the output plot into a file, for example by adding:

p.add_argument("--plot", help="output plot filename (format is infered from extension)")

and at the end of the script:

plt.savefig(args.plot)

Some grouped pairs in groups.csv shown as 'unrelated' in .html

In a cohort of 500 samples, some samples pairs were marked as unrelated despite being grouped in the grouped.csv file.

for example

grep 133675 somalier_groups_impact_independant.csv
I-H-133675-T3-1-D1-1,I-H-133675-N2-1-D1-1,I-H-133675-T4-1-D1-1,I-H-133675-T5-1-D1-1,I-H-133675-T6-1-D1-1,I-H-133675-T7-1-D1-1,I-H-133675-T8-1-D1-1,I-H-133675-T9-1-D1-1,I-H-133675-T10-1-D1-1,I-H-133675-N2-2-D1-1

Screen Shot 2019-03-27 at 4 44 34 PM

unspecified parent

given a pedigree with a missing parent like:

fam	mom	0	0	2	2
fam	kid1	0	mom	2	2
fam	kid2	0	mom	2	2

somalier will give an expected relatedness of 0.25 to kid1 and kid2. It's probably more common that they will have the same father and therefore have an expected relatedness of 0.5.

wrong header on relate pairs output

The pairs output from somalier relate seems to have an extra ibs2 in the header:

#sample_a       sample_b        relatedness     ibs0    ibs2    hom_concordance hets_a  hets_b  shared_hets     hom_alts_a      hom_alts_b
      shared_hom_alts ibs2    n       x_ibs0  x_ibs2  expected_relatedness

and should be:

#sample_a       sample_b        relatedness     ibs0    ibs2    hom_concordance hets_a  hets_b  shared_hets     hom_alts_a      hom_alts_b
      shared_hom_alts    n       x_ibs0  x_ibs2  expected_relatedness

unify markers for GRCh37, hg38

if the markers for hg38 and GRCh37 are the same, then we can QC across genome builds, which can be important.

  • do marker selection in one build
  • keep sites that can lift
  • have A and B allele rather than ref and alt.

headers for unprefixed genome builds have chr prefix

Hi. I am not sure if this is a problem for Somalier but I noticed that both sites.hg38.nochr.vcf.gz and sites.GRCh37.vcf.gz have unprefixed chromosome names for all the variants but the header still contains chr prefixed contig names. For example:

zgrep contig= sites.GRCh37.vcf.gz | head
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr10,length=135534747>
##contig=<ID=chr11,length=135006516>

This might cause some confusion or issues with other workflows even it is compatible with Somalier.

site-selection with BS-Seq in mind

if the sites did not include reference C's, then this could be used for BS-Seq data.

currently, the sites file includes 38399 sites and 27978 of them are not C's and 30415 of them are not C to T.
So we could just distribute a sites file with no C/T variants.

Suggestion: use groups info (same individual) to infer expected relatedness

somalier relate has two command-line options to control relatedness between samples: --groups and --ped. From what I could see, the former assigns groups of samples from the same individual, while the second allows for specifying a pedigree to infer relatedness between different individuals.

However, the information on --groups is not used to infer pedigree relatedness. I mean, if I have this --groups file:

normal0,tumor0

and this --ped file:

FAM001  normal0  normal1   normal2     2       -9
FAM001  normal1  0         0           1       -9
FAM001  normal2  0         0           2       -9

the inferred relatedness will be:

normal0 / normal1 = 0.5
normal0 / normal2 = 0.5
normal1 / normal2 = -1
tumor0 / normal1 = -1
tumor0 / normal2 = -1

However, it would be very nice if the last two could also inferred to be 0.5, since they are the same individual as normal0.

Error: unhandled exception: Illegal parameter in singular value decomposition gesdd

I'm encountering an error from running the following command:

somalier ancestry --labels my-ancestry-labels-1kg.tsv 1000G_somalier/*.somalier ++ *.somalier

STDOUT looks like this:

** On entry to SGESDD parameter number 10 had an illegal value
 ** On entry to SGESDD parameter number 10 had an illegal value

STDERR looks like this:

somalier version: 0.2.10
[somalier] subset from 17384 to 0 high call-rate sites (removed 100.00%)
decomposition_lapack.nim(287) gesdd
Error: unhandled exception: Illegal parameter in singular value decomposition gesdd: 10 [ValueError]

Note, that I'm using 1000G files that I generated myself, not the ones that you provide.

This is my first time using somalier. Is there any advantage to running the command with all of the samples versus one sample at a time? i.e. when you give it the 1000G samples does it run PCA for the test samples, projected on the 1000G space, so you'd get the same result one sample at a time or as a group?

Thanks,
Eric

Feature Request: Support gVCF

I've been able to use somalier without issue on Dragen VCF files, but I've encountered a nondescript error when trying on the Dragen gVCF file for the same sample. Here's the message I'm getting:

[XXX@XXX ~]$ somalier extract -s sites.hg38.vcf.gz -f hg38.fa -d ~ [MASKED].hard-filtered.gvcf.gz
somalier version: 0.2.3
SIGABRT: Abnormal termination.

Not sure if this is a bug or a feature request.

ancestry

the ancestry program should subset the labeled (and query) samples by sites that are well-behaved in the query (and labeled) samples.
currently there is no subsetting.

Feature request: output per site info (e.g. genotype, depth)

Dear all,

would it be possible to get more detailed per-site info for QC? Right now somalier outputs only per sample and pairs of samples info.

There is already something similar on depthview, but it is very broad and only on HTML.
Would it be possible to get that info on a TSV also? Maybe reporting for each site (rows) and each individual (columns) the coverage for each allele as well as somalier's called genotype.

thanks,

subset unrelated samples for large cohorts

from #31

with > 2K samples, the html output becomes nearly unusable. but in large cohorts, nearly all samples will be unrelated. we can sub-sample pairs that are expected to be unrelated and appear unrelated by phenotype.

this will reduce the memory usage and make somalier html output useful for huge cohorts.
it will require a substantial change in the html as that's expecting the full matrix of all vs all. will need a sparse representation instead.

`ancestry-predict.py` error

Error when running ancestry-predict.py:

Traceback (most recent call last):
  File "ancestry-predict.py", line 128, in <module>
    alpha=0.15, c=[colors[i % len(colors)]], ec='none')
  File "/services/tools/anaconda3/4.4.0/lib/python3.6/site-packages/matplotlib/__init__.py", line 1810, in inner
    return func(ax, *args, **kwargs)
  File "/services/tools/anaconda3/4.4.0/lib/python3.6/site-packages/matplotlib/axes/_axes.py", line 4300, in scatter
    collection.update(kwargs)
  File "/services/tools/anaconda3/4.4.0/lib/python3.6/site-packages/matplotlib/artist.py", line 916, in update
    ret = [_update_property(self, k, v) for k, v in props.items()]
  File "/services/tools/anaconda3/4.4.0/lib/python3.6/site-packages/matplotlib/artist.py", line 916, in <listcomp>
    ret = [_update_property(self, k, v) for k, v in props.items()]
  File "/services/tools/anaconda3/4.4.0/lib/python3.6/site-packages/matplotlib/artist.py", line 912, in _update_property
    raise AttributeError('Unknown property %s' % k)
AttributeError: Unknown property ec

I think lines should be:

        axes[0].scatter(bg_reduced[sel, 0], bg_reduced[sel, 1], label=l, s=8,
                alpha=0.15, c=[colors[i % len(colors)]])

output % of split reads

the genotyping step on bams iterates over reads. we can count the number of split reads and total reads and samples with a lot of split reads often have some kind of contamination.

`[somalier] found 0 sites' error when extracting from array .vcf.gz

Hi,

Following up on the question I've posted on twitter. Briefly, my goal is to check the relatedness between around ten thousand WES samples (no variant calls yet, so somalier's "sketches" would be very helpful) and existing array genotyping datasets (between 5k and 25k samples). I'm using the provided b38 site file and hs38DH.fa. For the WES samples, the sketch extraction works smoothly and the relatedness ckeck produces sensible results.

However, I'm getting puzzling errors when extracting the sites from VCFs. With 0.2.3 it reports [somalier] found 0 sites (and produces zero variant .somalier files). With 0.2.4 it's SIGSEGV: Illegal storage access. (Attempt to read from nil?).

Following your suggestion to check the REFs and ALTs, I've made toy VCF with three variants matching the sites file.

In the sites:

chr1    33572614        .       T       C       .       PASS    AC=127807;AF=0.509354
chr1    33772682        .       T       C       .       PASS    AC=74131;AF=0.295319
chr1    36307804        .       T       G       .       PASS    AC=78047;AF=0.498041

In the VCF:

chr1	33572614	AX-11402824	T	C	.	.	PR	GT	1/1	0/1 ...
chr1	33772682	AX-11522782	T	C	.	.	PR	GT	0/1	0/1 ...
chr1	36307804	AX-11462332	T	G	.	.	PR	GT	0/1	1/1 ...

This looks like it should work, but produces the same 0 sites/illegal storage errors. I've also tried setting --sites and the extraction file to the three variant VCF itself: both versions report found 0 sites. Any suggestions what else to check?

Appreciate your help!

add option to scale ibs0 and ibs2 by n

the current plot outputs have IBS0 and IBS2, but these vary depending on number of sites with data from each sample. likely much more informative to use IBS0[i, j] / n[i, j] for samples i, j.

Illegals storage access. (Attempt to read from nil)

Hi Brent,

I'm trying to compare two VCFs (hg19) for relatedness and stumbled on your tool.

somalier extract -d extracted/ --sites /home/victor/commondata/sites.hg19.vcf.gz -f /home/victor/commondata/human_g1k_v37_decoy.fasta 071181/071181_xAtlas.recode.sorted.vcf.gz
somalier version: 0.2.10
[W::bcf_hdr_check_sanity] PL should be declared as Number=G
[somalier] FORMAT field 'AD' not found for depth information. using genotype only
[somalier] found 10531 sites
SIGSEGV: Illegal storage access. (Attempt to read from nil?)

any ideas how to resolve this? its probably a file that cannot be accessed similar to one ticked reported for slivar

Basic IBSX question

With somalier, we get the ibs0, ibs2, and n outputs in the *.pairs.tsv file. This is likely a dumb question, but is this a correct inference from that info?

ibs1 = n - ibs0 - ibs2

I'm assuming that "n" is the total number of variants that overlap between the samples (i.e. non-shared are filtered out) and is basically counting the total number of variants used by somalier at any point in the process. However, I couldn't find a description of each field in that file, so I wasn't sure if some other filtering took place that might make that inference incorrect.

release 0.2.1?

update styling?

also could clean up style.css and results.html from master.

gVCF v. VCF outputs

Hello,

I ran into a strange difference in output between gVCF and VCF runs of the same samples and was hoping to get some input into how it happens.

Background: I've been using strelka2 which outputs both a gVCF and VCF file per sample (in my tests, the samples were run individually). I then run somalier extract to pull out the variants and then do a somalier relate where the only difference is that I add a -u for the VCF version to account for 0/0 calls essentially being removed from the VCF.

Issue: When I look at the outputs, I get drastically different numbers isolated to the hom_alts related fields, enough that for particular samples it can drastically skew relatedness. Here's an example of the file using gVCF inputs and no -u in the relate (I've clipped it for brevity, first row are replicates, second row completely unrelated samples):

#sample_a	sample_b	relatedness	hom_concordance	hets_a	hets_b	shared_hets	hom_alts_a	hom_alts_b	shared_hom_altibs0	ibs2	n	x_ibs0	x_ibs2	expected_relatedness
SL362490	SL362491	0.999	0.999	6889	6885	6880	5679	5682	5676	0	17328	17328	0	359	-1.0
SL362490	SL409548	-0.004	-0.026	6889	6977	2929	5679	5694	2807	1477	7845	17316	66	162	-1.0

And here's the values from the same strelka2 runs, only using the VCFs and the -u option during the relate step:

#sample_a	sample_b	relatedness	hom_concordance	hets_a	hets_b	shared_hets	hom_alts_a	hom_alts_b	shared_hom_altibs0	ibs2	n	x_ibs0	x_ibs2	expected_relatedness
SL362490	SL362491	0.999	0.997	6889	6885	6880	1315	1317	1315	2	17368	17384	0	197	-1.0
SL362490	SL409548	0.274	-0.381	6889	6977	2929	1315	1363	541	521	8855	17384	0	42	-1.0

There's a huge jump in relatedness (-0.004 -> 0.274) and the obvious culprit is the change the hom related fields. I'm planning to simply use the gVCF files going forward, but I wasn't able to really explain why the output was so drastically different given that these are from the exact same runs of the software. Thoughts on how to proceed?

Distinguish between bam files having the same bam sample ID

Hi,

Thanks for developing this tool, which is very fast and easy to use.
I am trying to do WES WGS concordance check. One problem is that they may have the same bam sample ID in both the WES and WGS bam files. Could you add a feature to modify the bam sample ID in the extracted binary file? e.g., adding a prefix or suffix? I tried to change the file name but that does not seem to help when I run relate.

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.