Coder Social home page Coder Social logo

ngsrelate's Introduction

Build Status

30juni 2018. Added new version which does analysis from bcf/vcf and outputs all nine jacquards

This page refers to the new v2 of NgsRelate which coestimates relatedness and inbreeding. For the old version please use this link: http://www.popgen.dk/software/index.php?title=NgsRelate&oldid=694

Brief description

This page contains information about the program called NgsRelate, which can be used to infer relatedness, inbreeding coefficients and many other summary statistics for pairs of individuals from low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods instead of called genotypes. To be able to infer the relatedness you will need to know the population allele frequencies and have genotype likelihoods. This can be obtained e.g. using the program ANGSD as shown in example 1 below. For more information about ANGSD see here: http://popgen.dk/angsd/index.php/Quick_Start. As of version 2, VCF/BCF files can also be parsed.

Options

$ ./ngsRelate 

Usage main analyses: ./ngsrelate  [options] 
Options:
   -f <filename>       Name of file with frequencies
   -O <filename>       Output filename
   -L <INT>            Number of genomic sites. Must be provided if -f (allele frequency file) is NOT provided 
   -m <INTEGER>        model 0=normalEM 1=acceleratedEM
   -i <UINTEGER>       Maximum number of EM iterations
   -t <FLOAT>          Tolerance for breaking EM
   -r <FLOAT>          Seed for rand
   -R <chr:from-to>    Region for analysis (only for bcf)
   -g gfile            Name of genotypellh file
   -p <INT>            threads (default 4)
   -c <INT>            Should call genotypes instead?
   -s <INT>            Should you swich the freq with 1-freq?
   -F <INT>            Estimate inbreeding instead of estimating the nine jacquard coefficients
   -o <INT>            estimating the 3 jacquard coefficient, assumming no inbreeding
   -v <INT>            Verbose. print like per iteration
   -e <FLOAT>            Errorrates when calling genotypes?
   -a <INT>            First individual used for analysis? (zero offset)
   -b <INT>            Second individual used for analysis? (zero offset)
   -B <INT>            Number of bootstrap replicates for (only for single pairs)
   -N <INT>            How many times to start each pair with random seed?
   -n <INT>            Number of samples in glf.gz
   -l <INT>            minMaf or 1-Maf filter
   -z <INT>            Name of file with IDs (optional)
   -T <STRING>         For -h vcf use PL (default) or GT tag
   -A <STRING>         For -h vcf use allele frequency TAG e.g. AFngsrelate (default)
   -P <filename>       plink name of the binary plink file (excluding the .bed)

Or
 ./ngsrelate extract_freq_bim pos.glf.gz plink.bim plink.freq
Or
 ./ngsrelate extract_freq .mafs.gz .pos.glf.gz [-rmTrans]
Or
 ./ngsrelate -h my.bcf [DEVELOPMENT ONLY]

How to download and install

On a linux or mac system with curl and g++ installed NgsRelate can be downloaded and installed as follows:

git clone --recursive https://github.com/SAMtools/htslib
git clone https://github.com/ANGSD/ngsRelate
cd htslib/;make -j2;cd ../ngsRelate;make HTSSRC=../htslib/

Run examples

Run example 1: using only NGS data

Below is an example of how NgsRelate can be used to coestimate relatedness and inbreeding from NGS data. Assume we have file (filelist) containing paths to 100 BAM/CRAM files; one line per BAM/CRAM file. Then we can use ANGSD to estimate allele frequencies and calculate genotype likelihoods while doing SNP calling and in the end produce the the two input files needed for the NgsRelate program as follows:

### First we generate a file with allele frequencies (angsdput.mafs.gz) and a file with genotype likelihoods (angsdput.glf.gz).
./angsd -b filelist -gl 2 -domajorminor 1 -snp_pval 1e-6 -domaf 1 -minmaf 0.05 -doGlf 3

### Then we extract the frequency column from the allele frequency file and remove the header (to make it in the format NgsRelate needs)
zcat angsdput.mafs.gz | cut -f5 |sed 1d >freq

### run NgsRelate
./ngsrelate  -g angsdput.glf.gz -n 100 -f freq  -O newres

The output should be a file called newres that contains the output for all pairs between six individuals.

Run example 2: using only VCF/BCF files

As of version 2, NgsRelate can parse BCF/VCF files using htslib with the following command:

./ngsrelate  -h my.VCF.gz -O vcf.res

By default, NgsRelate will estimate the allele frequencies using the individuals provided in the VCF files. Allele frequencies from the INFO field can used be used instead using -A TAG. The TAG usually take the form of AF or AF1 but can be set to anything. By default the PL data (Phred-scaled likelihoods for genotypes) is parsed, however, the called genotypes can also be used instead with the -T GT option. If called genotypes are being used, the software requires an additional argument (-c 1). If using -c 2, ngsRelate calls genotypes assuming hardy-weinberg.

Input file format

Genotype likelihood input

NgsRelate takes two files as input: a file with genotype likelihoods (-g) and a file with population allele frequencies (-f) for the sites there are genotype likelihoods for. The genotype likelihood file needs to contain a line for each site with 3 values for each individual (one log transformed genotype likelihood for each of the 3 possible genotypes encoded as 'double's) and it needs to be in binary format and gz compressed. The frequency file needs to contain a line per site with the allele frequency of the site in it.

NgsRelate also calculates a few summary statistics based on the 2dsfs instead of known population allele frequencies. Even if the population allele frequencies are not available, these summary statistics can still be estimated with NgsRelate by providing it with the number of sites (-L) instead of the allele frequency file (-f).

VCF input

NgsRelate takes a standard VCF file generated with e.g. bcftools. By default, NgsRelate will estimate the allele frequencies using the individuals provided in the VCF files. As shown in example 2, external allele frequencies can be provided with -f. These frequencies will overwrite the ones estimated from the VCF file.

Output format

a  b  nSites  J9        J8        J7        J6        J5        J4        J3        J2        J1        rab       Fa        Fb        theta     inbred_relatedness_1_2  inbred_relatedness_2_1  fraternity  identity  zygosity  2of3IDB   FDiff      loglh           nIter  coverage  2dsfs                                                                             R0        R1        KING       2dsfs_loglike   2dsfsf_niter
0  1  99927   0.384487  0.360978  0.001416  0.178610  0.071681  0.000617  0.002172  0.000034  0.000005  0.237300  0.002828  0.250330  0.127884  0.001091                0.035846                0.001451    0.000005  0.001456  0.345411  -0.088997  -341223.335664  103    0.999270  0.154920,0.087526,0.038724,0.143087,0.155155,0.139345,0.038473,0.087632,0.155138  0.497548  0.290124  0.000991   -356967.175857  7

The first two columns contain indices of the two individuals used for the analysis. The third column is the number of genomic sites considered. The following nine columns are the maximum likelihood (ML) estimates of the nine jacquard coefficients, where K0==J9; K1==J8; K2==J7 in absence of inbreeding. Based on these Jacquard coefficients, NgsRelate calculates 11 summary statistics:

  1. rab is the pairwise relatedness (J1+J7+0.75*(J3+J5)+.5*J8) Hedrick et al
  2. Fa is the inbreeding coefficient of individual a J1+J2+J3+J4 Jacquard
  3. Fb is the inbreeding coefficient of individual b J1+J2+J5+J6 Jacquard
  4. theta is the coefficient of kinship J1 + 0.5*(J3+J5+J7) + 0.25*J8) Jacquard
  5. inbred_relatedness_1_2 J1+0.5*J3 Ackerman et al
  6. inbred_relatedness_2_1 J1+0.5*J5 Ackerman et al
  7. fraternity J2+J7 Ackerman et al
  8. identity J1 Ackerman et al
  9. zygosity J1+J2+J7 Ackerman et al
  10. Two-out-of-three IBD J1+J2+J3+J5+J7+0.5*(J4+J6+J8) Miklos csuros
  11. Inbreeding difference 0.5*(J4-J6) Miklos csuros
  12. the log-likelihood of the ML estimate.
  13. number of EM iterations. If a -1 is displayed. A boundary estimate had a higher likelihood.
  14. If differs from -1, a boundary estimate had a higher likelihood. Reported loglikelihood should be highly similar to the corresponding value reported in loglh
  15. fraction of sites used for the ML estimate

The remaining columns relate to statistics based on a 2D-SFS.

  1. 2dsfs estimates using the same methodology as implemented in realSFS, see ANGSD
  2. R0 Waples et al
  3. R1 Waples et al
  4. KING Waples et al
  5. the log-likelihood of the 2dsfs estimate.
  6. number of iterations for 2dsfs estimate

You can also input a file with the IDs of the individuals (on ID per line), using the -z option, in the same order as in the file filelist used to make the genotype likelihoods or the VCF file. If you do the output will also contain these IDs as column 3 and 4.

Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).

Help and additional options

To get help and a list of all options simply type

./ngsrelate

Note that for the new version of ngsRelate it is not necessary to flip the allele frequencies if the input allele frequencies are for the minor allele.

running version 1 of NgsRelate

To run the old version (V1) that assumes both individuals to be out-bred, just add -o 1 to your command. This will force Jacqaurd coefficient 1-6 (the coefficients related to inbreeding) to zero.

./ngsrelate -o 1

Test for inbreeding only

To only estimate inbreeding, use the following command

./ngsrelate -F 1

Citing and references

Changelog

Contacts

[email protected], [email protected], and [email protected]

Testdataset

bcftools call -v -m test.bcf -Ob >small.bcf
bcftools view small.bcf -Oz -o small.vcf.gz -m2 -M2 -v snps
plink --vcf small.vcf.gz --make-bed --out small 
bgzip small.bed
md5sum small.* >small.md5

check glf file in R

a <- file("test.glf", "rb")
nobs <- 1e8 # large number 3*N*Nsites
N <- 6 # number of individuals
mat <- matrix(readBin(a, "double", nobs ), byrow=T, ncol=3*N)
con <- file("outfile.glf", "wb")
writeBin(c(t(mat)), con=con)
close(con)

ngsrelate's People

Contributors

aalbrechtsen avatar angsd avatar idamoltke avatar khanghoj avatar nicklohr avatar stephenturner avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

ngsrelate's Issues

Interpreting the inbreeding only Output.

Hello,

I am analyzing a population of 72 individuals, 18 adults and 54 of their offspring, and I am trying to determine of there was any inbreeding before moving on to the rest of my analysis. I did the inbreeding-only option "-F 1", but I am a little confused about how to interpret the results. Are Z0 and Z1 values of 1 and 0 indicative of no inbreeding? If so, what do numbers other than those mean? An explanation of the other values would also be appreciated.

Thanks

How does NgsRelate handle "missing data?"

Hi, I was wondering how NgsRelate handles missing data. I am interested in the pairwise SFS's that NgsRelate creates. In the beagle file input, there may be some loci where both individuals of a pair have missing data, i.e. that the genotype likelihoods are 0.33/0.33/0.33 for both individuals.
How does this influence the result? Could missing data end up making individuals look more similar to each other (because they have the "same" genotype at many places) than they would if they had higher coverage?
NgsDist has an option "--pairwise_del: pairwise deletion of missing data." to get rid of sites where the three genotypes are the same. Is there something similar that can be done with NgsRelate?
Thanks!
-Teresa

Make option to print IDs

Hej Thorfinn. Note så vi husker det: Det ville være et hit at kunne få den til også at printe individ IDs (hvis brugeren giver dem) :) Ida

Problem with guess in emStep

Hello
I am trying to bootstrap my relatedness estimations.
I am using the command: ~/software/NgsRelate/ngsRelate -f freq -g angsdput.glf.gz -O test -L 521328 -n 81 -a ind1 -b ind2 -B 100

For most of the files (I am looping over ~200 individual pair combinations which are a subset of the whole callset contained in the .glf and .maf files) I get this error:

Problem with guess in emStep: -nan -nan -nan -nan -nan -nan -nan -nan -nan

I get the error when I loop over a list, and also when I specify individuals as a single command.

For 5 of the comparisons it seems to work fine (estimation has been made 101 times, as I guess the bootstrapping is zero indexed...). Can't work out what is going on.

Please could I trouble you for some assistance?

I have attached a Drive link to a compressed folder with the data and the analysis script (bash wrapper):
https://drive.google.com/open?id=1eraeCDVdOa-idliPg00TgMDkeRLM8zTz

Thank you!

Tristan

Problem with guess in emStep: -nan -nan -nan -nan

I am receiving the following error when I run ngsRelate with the glf.gz file as input:

Problem with guess in emStep: -nan -nan -nan -nan

The command that I am using is:
./ngsRelate -g ~/genotyping/angsdmPCR_noMAF.glf.gz -p 10 -n 284 -f ~/genotyping/freqmPCR_noMAF -O res_bams_mPCR_noMAF

I have attached a copy of the glf file. Please let me know if you have any idea what the issue could be.

Many thanks

angsdmPCR_noMAF.glf.gz

Coverage

Hello,

NGsrelate provides as output a "coverage" statistic.
Could you please tell me more about how it's calculated?
Seems like this is the proportion of sites sequenced in the pair of sample?

Thanks!

Muriel

please describe the new output

Hi folks - New ngsRelate version from June 2018 generates a lot more output than the previous one - looks super fun but I could not find any description of what all that stuff is... thanks a lot in advance!
Misha

nIter = -1? Actually bestoptimll = -1?

I think ngsrelate 2 may've been updated but the tutorial has not. The tutorial talks about why "nIter" might = -1, but in my results, I see a lot of "bestoptiml" = -1 results.

If there are a ton of -1 results, does that mean anything about the relatedness results? Is there any reason to change the -i or -t options to extent the number of iterations or restrict the tolerance? For that matter, what are the default values for those options? Thanks!

verbose output

Thank you for continuing to update the program! I was exploring some parameters, and I found I couldn't get the verbose option to work. Using -v gives a message of "invalid option" so it would be great if you could check that. I also tried to add an integer (-v 1) but it didn't produced the same message.

Thanks!
Nathan

extract_freq_bim error

Hi,

I would like to use NgsRelate on a set of 8 samples, with allele frequencies from a population in PLINK format.
I used the latest github package, and followed guidelines reported here http://www.popgen.dk/software/index.php?title=NgsRelate&oldid=694#Run_example_2:_using_NGS_data_with_population_frequencies_estimated_from_genetic_data_from_PLINK_files

This is the last command line I launched
/home/bin/NgsRelate/ngsRelate extract_freq_bim angsd_out.glf.pos.gz plink.out.bim plink.out.frq > freq

I got this error message

posfile:angsd_out.glf.pos.gz bimfile:plink.out.bim ffile:plink.out.frq
ngsRelate: filereaders.cpp:212: posMap getBim(char*, char*): Assertion `rMap.find(rs)==rMap.end()' failed.
Aborted (core dumped) 

Could you please help me understanding what's wrong?
Thanks for your attention
Best wishes,
Maria Angela

Problem with guess in emStep: 0.350616 0.350616 ... (unrelated to previous posts)

Hi,

I am trying to run ngsRelate with a population of 6 ind. using the ANGSD -doGlf3 output, i.e. the glf.gz and mafs.gz files.

Unfortunately, when running
zcat WES_All.mafs.gz | cut -f6 | sed 1d > freq
ngsRelate -g WES_All.glf.gz -n 6 -f freq -O WES.ngsRelate

OR

ngsRelate -g WES_All.glf.gz -n 6 -L 8397285 -O WES.ngsRelate

I receive the output (or something along the lines of):

-> Seed is: 1908108041
-> Frequency file: 'freq' contain 8397285 number of sites
-> nind:6 overall_number_of_sites:8397285
-> Done reading data from file: 10.68 11.00
-> Starting analysis now
-> length of joblist:15
Problem with guess in emStep: 0.351857 0.351857 0.351857 0.351857 0.351857 0.351857 0.351857 0.351857 0.351857

Here, the values are all the same and not "nan" or "0", as previous users have reported.

Also, when I run the first of the above commands with -F 1, everything completes

-> Seed is: 103441962
-> Frequency file: 'freq' contain 8397285 number of sites
-> nind:6 overall_number_of_sites:8397285
-> Done reading data from file: 10.91 11.00
-> Starting analysis now
-> length of joblist:6
[ALL done] cpu-time used = 68.76 sec (filereading took: 10.91 sec)
[ALL done] walltime used = 41.00 sec (filereading took: 11.00 sec)

but all inbreeding values are 0.

Ind Z=0 Z=1 loglh nIter coverage sites
5 1.000000 0.000000 -6179642.753674 22 0.999557 8393569
4 1.000000 0.000000 -6080941.970638 -1 0.999600 8393929
0 1.000000 0.000000 -6069533.186939 -1 0.999605 8393972
1 1.000000 0.000000 -6159287.329214 -1 0.999566 8393640
2 1.000000 0.000000 -6116124.318296 -1 0.999586 8393811
3 1.000000 0.000000 -6105983.497060 -1 0.999584 8393792

I was wondering what is going wrong here (don't believe in a bug ...) or whether you have encountered such results before.

Best

Interpretation of relatedness (rab) coefficient

Hi,

I used your software following your guidelines. I also used the R script you provided to plot results. Please find attached the relatedness plot (these results were quite expected from a previous analysis).
NgsRelate_relatedness.pdf

Could you please tell me if the first range on the right, from 0.5 to 1.0, corresponds to first degree relationship or monozygotic twins?

Thank you.
Best,
Maria Angela

ngsrelate memory-issue?

Hello,
I try to run ngsrelate on a some large files and cannot get the run complete due to RAM overflow.

files:
73G test_gwas.glf.gz
10G freq

command:

/home/mmoser/NgsRelate/ngsRelate -g test_gwas.glf.gz -n 22 -f freq > gl.res

error:

 -> Frequency file: 'freq' contain 1177564412 number of sites
/opt/sge/default/spool/binfservas08/job_scripts/283373: line 3: 124295 Killed                  /home/mmoser/NgsRelate/ngsRelate -g test_gwas.glf.gz -n 22 -f freq > gl.res

Is there anyway to minimize memory usage for ngsrelated?

Thanks,
Michel

Different number of sites assumed vs. read

I'm running ngsrelate using the code below and get an error regarding the number of sites assumed vs read (also below). I ran angsd on regions of chromosomes and merged them (this is a subset of the merged files). Can you please advise?

zcat *.mafs.gz | cut -f5 | sed 1d > freq_sub
zcat *.glf.gz | gzip > sub.glf.gz
ngsRelate -g sub.glf.gz -n 192 -f freq_sub -O genotype_res

Seed is: 776063293
Frequency file: 'freq_sub' contain 5023472 number of sites
nsites: 5023472 assumed but 5023224 read

unexpected high theta values

Hi,

I am testing NgsRelate2 on some related samples from the 1000 Genomes Project, specifying allele frequencies from European 1000 Genomes population. It worked very well.
Then I generated simulated samples with a coverage ranging from 1 to 5X and in all cases only KING was able to correctly predict the degree of relationship. On the other side, theta values resulted very high (ranging from 0.76 to 0.82), suggesting identical individuals.
Could you please help me to understand the reason I am getting these results?

Thank you,
BW

Maria Angela

autocorrelation?

Hello and thank you for this nifty program! I'm using it to determine relatedness among three sympatric populations (within ~5mi radius) across two years and find an extremely high level of relatedness among all individuals. I'm using the r0, r1 and king-robust values for visualization but I'm concerned at the seemingly perfect correlation between these values among the individuals. Below is the angsd code used to generate the likelihoods. The reference genome is large at >2Gb, so I mapped the low coverage reads (1-3x) to the first half of the genome.

#map fastqs to reference:
bwa-mem2 mem -t32 noCont.fasta "$in1" "$in2" | samtools view -bS - | samtools sort - > sample."$z".bam

# did not mark duplicates, could this drive the difference?
#angsd.
angsd -b 176.temp -gl 2 -nInd 176 -minInd 158 -doCounts 1 -setMaxDepth 1580 -rf chrom5.rf.list -setMinDepth 320 -setMinDepthInd 1 -setMaxDepthInd 10 -domajorminor 1 -snp_pval 1e-6 -domaf 1 -minmaf 0.05 -doGlf 3 -nThreads 8 -out chrom5_90pc
#ngsrelate.
ngsRelate -g chrom5_90pc.glf.gz -n 176 -z 176.20201.list -f 179_chrom5.freq -O 179_chrom5_out

pecan_related20201

PL field in VCF

Hi,

I would like to apply NgsRelate to a vcf obtained with ATLAS (task=call method=MLE). I am able to process this vcf with bcftools and the analysis by NgsRelate using GT tag works. However, I would like to test it parsing PL, but I got this error

Problem with guess in emStep: nan nan nan nan nan nan nan nan nan

This is how PL is reported in my vcf

##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Phred-scaled normalized genotype likelihoods">
1 10234 . C T . . DP=3;AC=2;AN=6 GT:GQ:AD:GL:PL:AB:AI ./.:.:.:.:.:.:. 1/1:3:0,1:-10,-0.30103,0:103,3,0:0:0.5 0/0:3:1,0:0,-0.300987,-4:0,3,40:1:0.5

Could you please tell me if this is the correct format required by NgsRelate?
Thanks for your help.

BW,
Maria Angela

Something wrong with freq

Hi,

I am trying to run the following command: ./ngsRelate -g der.glf.gz -n 32 -f derfreq -O masters

I have followed the instructions from angsd to convert the necessary files and installed ngsRelate, however, when I run the command I get the following error:

-> Seed is: 1469169591

something is wrong with derfreq

There is no other information provided. Why would I be getting this error?
Thank you

Problem with guess in emStep: -nan

Hi, I've seen this issue come up several times but none of the current responses have seemed relevant to fixing mine. I'm running from a angsd output on one chromosome across 548 individuals.

runing ngsrelate
        -> Seed is: 1858380949
        -> Frequency file: '.../relatedness_scaf8_autosomes_freq' contain 206165 number of sites
        -> nind:548 overall_number_of_sites:206165
        -> Done reading data from file: 12.78 13.00
        -> Starting analysis now
        -> length of joblist:149878
        ->Sites with both 129 and 268 having data: 0
Problem with guess in emStep:  -nan  -nan  -nan  -nan  -nan  -nan  -nan  -nan  -nan

My command is
ngsRelate -g ${angsd_dir}/${prefix}.rad.${loc}_${prefix}.glf.gz -f $angsd_dir/${prefix}_${loc}_freq -p $nt -n 548 -O $angsd_dir/${prefix}_${loc}_newres

Please let me know what other information you require as I am unsure where to start, or where to send my freq and glf file, as it is confidential.

Thank you so much!

Runtime

Hello,

I am running the following using a vcf file output from ANGSD (output was bcf and converted to vcf.gz). AF was calculated from ANGSD.

I have ~11 million sites total, 7 million above 1% MAF, and 6 million above 5% MAF.

NgsRelate-2.0/ngsRelate -h GPs/Filt6b_Post.vcf.gz -A AF -O Filt6b_Post.vcf.res

What would you recommend in order to have a reasonable runtime?

[Question] Merging ANGSD Outputs

Hi everyone,

Firstly, thanks for this software, it works amazingly.

And to my question:
I was wondering if it was possible to merge GLF files from two different ANGSD runs, with identical settings, and then run this merged GLF through NGSRelate?

I want to do this because it seems inefficient and costly ( running in cloud) to re-run ~100 samples plus whatever new dataset I want to check relatedness with.

The ANGSD command in question.

./angsd -b Embryos.beagle-test.bamslist -out BeagleTest-embryos -gl 2 -domajorminor 3 -snp_pval 1e-6 -domaf 1 -minmaf 0.05 -doGlf 2 -nThreads 6 -nLines 140 -checkBamHeaders 0 -sites 1240K_hg38_sites.txt 

Let's say I have two sets of data.

References.glf.gz
References.mafs.gz

NewSamples.glf.gz
NewSamples.mafs.gz

Is this as trivial as:

zcat References.glf.gz NewSamples.glf.gz .... | gzip - >merge.glf.gz

Can I avoid the frequency files by just passing -L ?
Also, I will be testing this with the beagle format...

Any feedback would be much appreciated!

Issue with glf file - too many sites

Hi,
I am trying to make a kinship matrix from 47 individuals. I am running the following code:

angsd -b bams_qst_adults -gl 2 -domajorminor 1 -snp_pval 1e-6 -domaf 1 -minmaf 0.05 -doGlf 3 -out kinMAT
zcat kinMAT.mafs.gz | cut -f5 |sed 1d > freq
./ngsRelate -g kinMAT.glf.gz -n 47 -f freq -O kinMAT

Which produces the following error file:

./ngsRelate -g kinMAT.glf.gz -n 47 -f freq -O kinMAT
-> Seed is: 1211251308
-> Frequency file: 'freq' contain 139458 number of sites
-> Too many sites in glf file. Looks out of sync, or make sure you supplied correct number of individuals (-n)
-> Or that the number of sites provided (-L) it is correct

I double checked that my file list contains 47 individuals. Any assistance you can provide would be greatly appreciated.
Thanks!

elevated KING relatedness

Hello,

I was looking at my out file from ngsrelate. After looking at my KING values, more than 3/4 of my samples are half-sib or more. I plotted R0-R1 and R1-KING.

Screen Shot 2023-07-27 at 9 31 03 PM Screen Shot 2023-07-27 at 9 32 06 PM

they look very strange to me. When I ran angsd I ended up with my MAF frequencies in column 6 (below). Publications that use ngsrelate don't do a good job about their parameters or what value of relatedness they are using. I was wondering if you had any insight into my problem? I think I might be running the script incorrectly, but have no way of testing it.

I know this isn't an issue with the program, but appreciate any advice.

--Josh Hallas

ANGSD script:

angsd -nThreads 16 -bam md.list -ref finalmuledeergenome_filtered.fasta -out md_jul18 \
      -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -C 50 -baq 1 \
      -minMapQ 20 -minQ 20 -minInd 74 -setMaxDepth 400 -doCounts 1 \
      -GL 2 -doGlf 2 -doMajorMinor 1 -doMaf 1 -minMaf 0.05 -SNP_pval 1e-6 \
      -doGeno 16 -doPost 1 -doDepth 1 -dumpCounts 2

MAF file:

chromo  position        major   minor   ref     knownEM pK-EM   nInd
HiCscaffold1pilon       53902   G       A       G       0.087058        0.000000e+00    160
HiCscaffold1pilon       53926   A       G       A       0.112653        0.000000e+00    160
HiCscaffold1pilon       104909  C       T       C       0.108280        0.000000e+00    160
HiCscaffold1pilon       105737  C       T       C       0.089310        0.000000e+00    160
HiCscaffold1pilon       105746  C       T       C       0.167795        0.000000e+00    160
HiCscaffold1pilon       105753  G       T       G       0.110993        0.000000e+00    160

ngsRelate script:

NGSRELATE -G md_jul18.beagle.gz -f relatedness_md.freq -z relatedness_sample_ID.list -n 160 -O out.ld -p 4 -i 50000

Problem reading full chunk error

Hi,

I am using ngsRelate version 2, and getting an error that hasn't already been raised in issues. I followed the template on the GitHub page. My code:

angsd -b samples4relatedness.txt -out angsd4relatedness -gl 2 -domajorminor 1 -snp_pval 1e-6 -domaf 1 -minmaf 0.05 -doGlf 3 -P 15

zcat angsd4relatedness.mafs.gz | cut -f5 | sed 1d > freq.txt

ngsRelate -g angsd4relatedness.glf.gz -n 100 -p 15 -f freq.txt -O output.res

I get the following error:

        -> Frequency file: 'freq.txt' contain 1458673 number of sites
        -> Problem reading full chunk

Thanks in advance,
Adam

PCAngsd didn't converge

ood.discovery.neu.edu_pun_sys_dashboard_files_fs__work_seedpod_test_pcangsd_41265911.out.pdf
sbatch script that I'm using!! echo "Job started at: $(date)"

module load singularity/3.10.3

Define paths

DATA_DIR="/work/seedpod/output/angsd_spal2/angsd_result.beagle.gz"
OUT_PREFIX="/work/seedpod/output/pcangsd/pcangsd_spal"
SINGULARITY_IMAGE="/work/seedpod/test/seedpod_latest.sif"

echo "Running pcangsd inside Singularity container..."
singularity exec -B "/work:/work" $SINGULARITY_IMAGE /opt/miniconda/bin/pcangsd -b $DATA_DIR -o $OUT_PREFIX -t 32 -e 1 --maf 0.01 --iter 3000 --tole 1e-7

echo "Job finished at: $(date)"

Output file:
Job started at: Mon Mar 4 13:00:36 EST 2024
Running pcangsd inside Singularity container...

PCAngsd v1.21
Jonas Meisner and Anders Albrechtsen.
Using 32 thread(s).

Parsing Beagle file.
Loaded 730243 sites and 374 individuals.
Estimating minor allele frequencies.
EM (MAF) converged at iteration: 213
Number of sites after MAF filtering (0.01): 730243
Estimating covariance matrix.
Using 1 principal components (manually selected).
Individual allele frequencies estimated (1).
Individual allele frequencies estimated (2). RMSE=0.103376299
Individual allele frequencies estimated (3). RMSE=0.103394873
Individual allele frequencies estimated (4). RMSE=0.001676962
Individual allele frequencies estimated (5). RMSE=0.001336438
Individual allele frequencies estimated (6). RMSE=0.001089854
Individual allele frequencies estimated (7). RMSE=0.000904074
Individual allele frequencies estimated (8). RMSE=0.00076122
.
.
.
Individual allele frequencies estimated (2996). RMSE=6.68e-07
Individual allele frequencies estimated (2997). RMSE=6.48e-07
Individual allele frequencies estimated (2998). RMSE=6.67e-07
Individual allele frequencies estimated (2999). RMSE=6.62e-07
Individual allele frequencies estimated (3000). RMSE=6.12e-07
Individual allele frequencies estimated (3001). RMSE=6.31e-07
PCAngsd did not converge!
Saved covariance matrix as /work/seedpod/output/pcangsd/pcangsd_spal.cov

Total elapsed time: 74m53s
Job finished at: Mon Mar 4 14:15:30 EST 2024

possible to convert beagle to binary beagle (doGlf 2 to doGlf 3) or use beagle file as input?

Hello!
I am using NgsRelate as part of a large project, and I was wondering if there is a way for me to use the beagle files I have already generated rather than running ANGSD again just to get a "binary beagle" (output of -doGlf 3) file for NgsRelate input. The other things I am doing tend to use the beagle output from -doGlf 2, so I already have these made.

I know I can just have ANGSD calculate genotype likelihoods again and output them in the -doGlf 3 format, but my project uses many species with full whole genomes and this will add up to a lot of time and computation to repeat calculations that have already been done.

Based on the NgsRelate documentation it seems like the "binary beagle" format is the only one that will work, is that correct?

If that's the case, is there a way to convert a beagle file (output of -doGlf 2 in ANGSD) to a binary beagle file (output of -doGlf 3 in ANGSD)? The two file formats carry the same information, it's just that one is binary, right? (As far as I can tell from the ANGSD documentation about the -doGlf 3 option, which is very brief).

Thanks!
-Teresa

minor request to update readme

Hi,
Thank you for maintaining this software. In my first attempt at running the program I received an error when trying to use a .vcf file. After searching through the issues posts I came across this thread which resolved my problem: if invoking the -T parameter to read from the GT field of the .vcf file, you also need to invoke the -c parameter.
Perhaps I'm mistaken with these parameters, but if this is in fact true, it would be helpful to make a minor update to the readme/documentation when using vcf files to remind users to use those flags (if needed). Maybe something like this?:

./ngsrelate  -h my.VCF.gz -O vcf.res
./ngsrelate  -h my.VCF.gz -O vcf.res -T GT -c    ## if specifying values from GT field in .vcf

Cheers,
Devon

running program with -L information, without MAF info

Hi Kristian,

Similar to issue #20 I'm having a related problem getting ngsRelate to run with a beagle file as the input data type. When I run:

./ngsRelate -g my.beagle.gz -n 184 -L 1831344 -O myluNGSrelate.res -p 22

I receive the following error:

        -> Seed is: 1972650776
        -> Allele frequencies file (-f) is not provided. Only summary statistitics based on 2dsfs will be reported
        -> Problem reading full chunk

Unlike the earlier post, I didn't import the mafs.gz information from the ANGSD output - I was just supplying the number of sites with the -L argument. There are 555 columns in the beagle.gz file, which I believe is what I'd expect with 184 individuals. Just to confirm: the value I used as input in the -L argument (1831344) represents the number of sites, which should be one less than the total number of lines in the .beagle.gz file (1831345), correct?

The same .beagle file has been used in other applications in the ANGSD family (ngsAdmix and PCAngsd), so I do not suspect that the file is corrupt.

Thanks for any troubleshooting advice you can offer.

Devon

"Problem with guess in emStep"

Hello,

I ran the following command:

/usr/local/softw/NgsRelate/ngsRelate -h infile.vcf -T GT -O play.res

Got the following output:

-> Seed is: 101258788
-> Will use TAG: 'GT' from the VCF file
-> Will use TAG: 'AFngsrelate' in the VCF file as allele frequency if present. Otherwise allele frequencies are estimated from the data
-> readbcfvcf seek:(null) nind:2
-> [file='1000G_CEU_MAF0.45_200K_merged_CEPH.vcf'][chr='(null)'] Read 82626 records 82437 of which were SNPs number of sites with data:82437
-> nind:104 overall_number_of_sites:82437
-> Done reading data from file: 2.38 3.00
-> Starting analysis now
-> length of joblist:5356

Problem with guess in emStep: 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

I looked at the sourcecode where the error is but could not figure out the cause. Can you please help?

Wrong results?

I ran this command:
/usr/local/softw/NgsRelate/ngsRelate -h input.vcf -T GT -O results.res -c 1

Here is the relevant part of the result

a b nSites J9 J8 J7 J6 J5 J4 J3 J2 J1 rab Fa Fb theta inbred_relatedness_1_2 inbred_relatedness_2_1 fraternity identity zygosity 2of3_IDB FDiff loglh nIter coverage 2dsfs R0 R1 KING 2dsfs_loglike 2dsfsf_niter

100 101 26688 0.154319 0.125469 0.000000 0.016806 0.092420 0.000582 0.102905 0.000000 0.507499 0.716727 0.610986 0.616725 0.636528 0.558951 0.553709 0.000000 0.507499 0.507499 0.774252 -0.008112 -44505.357314 5000 0.323738 5.113872e-01,5.571375e-02,1.177190e-02,5.602682e-02,6.669632e-02,6.594131e-02,1.065887e-02,6.341072e-02,1.583932e-01 0.336312 0.253095 0.058306 -42959.275380 10

indiv.s 100 and 101 are siblings. Theoretically, for a sibling pair I would expect k0=0.25, k1=0.5, k2=0.25

However that is not what I see. Similarly for theta, which should be 0.25 but calculated as 0.636528.

Can you please help?

I tried -i 10000 but didn't help.

Thanks a lot.

rmTrans option

I'm quite interested in using the remove transitions option mentioned in the options list, but I cannot seem to get it to work. I tried various configurations, e.g. :

./ngsrelate extract_freq .mafs.gz .pos.glf.gz -rmTrans
./ngsrelate extract_freq .mafs.gz .pos.glf.gz -rmTrans 1
./ngsrelate -f .freq -g .pos.glf.gz -rmTrans
./ngsrelate -f .freq -g .pos.glf.gz -rmTrans 1

However, they all appear give the same results as without -rmTran (e.g. identical number of sites per pairwise comparison and essentially identical values for metrics). Have I missed something? I can rerun ANGSD with -rmTrans, but it would be quite handy to include this option for ease.

Cheers,
Nathan

Bad results starting from beagle files?

Hello,
I am trying to run ngsRelate starting from a beagle file (option -G) obtained with ANGSD and a bit more than 14 M SNPs.
Here is the head of the beagle (first 8 SNPs with 10 individuals)
marker allele1 allele2 Ind0 Ind0 Ind0 Ind1 Ind1 Ind1 Ind2 Ind2 Ind2 Ind3 Ind3 Ind3 Ind4 Ind4 Ind4 Ind5 Ind5 Ind5 Ind6 Ind6 Ind6 Ind7 Ind7 Ind7 Ind8 Ind8 Ind8 Ind9 Ind9 Ind9 pcla8_s000007_103 1 0 0.999494 0.000506 0.000000 0.000708 0.999292 0.000000 0.000000 1.000000 0.000000 0.999491 0.000509 0.000000 0.999992 0.000008 0.000000 0.004439 0.995561 0.000000 0.999871 0.000129 0.000000 0.000000 1.000000 0.000000 0.998993 0.001007 0.000000 0.999746 0.000254 0.000000 pcla8_s000007_131 0 3 0.999746 0.000254 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.999873 0.000127 0.000000 0.999998 0.000002 0.000000 0.000060 0.999940 0.000000 0.999491 0.000509 0.000000 0.000000 1.000000 0.000000 0.999489 0.000511 0.000000 0.999746 0.000254 0.000000 pcla8_s000007_140 3 1 0.999936 0.000064 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.999747 0.000253 0.000000 0.999984 0.000016 0.000000 0.000000 1.000000 0.000000 0.998991 0.001009 0.000000 0.000000 1.000000 0.000000 0.984350 0.015650 0.000000 0.999873 0.000127 0.000000 pcla8_s000007_224 0 3 0.998000 0.002000 0.000000 0.997995 0.002005 0.000000 0.999968 0.000032 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000005 0.999995 0.000000 0.998000 0.002000 0.000000 0.000000 0.999998 0.000002 0.000000 0.000506 0.999494 pcla8_s000007_248 1 0 0.969147 0.030853 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.999992 0.000008 0.000000 0.999998 0.000002 0.000000 0.000000 1.000000 0.000000 0.995953 0.004047 0.000000 0.000038 0.999962 0.000000 0.995985 0.004015 0.000000 0.999490 0.000510 0.000000 pcla8_s000007_257 1 2 0.984251 0.015749 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.999984 0.000016 0.000000 0.999984 0.000016 0.000000 0.000001 0.999999 0.000000 0.995953 0.004047 0.000000 0.017529 0.982471 0.000000 0.995985 0.004015 0.000000 0.999489 0.000511 0.000000 pcla8_s000007_258 0 1 0.992033 0.007967 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.999984 0.000016 0.000000 0.999984 0.000016 0.000000 0.000000 1.000000 0.000000 0.995981 0.004019 0.000000 0.017524 0.982476 0.000000 0.995989 0.004011 0.000000 0.999492 0.000508 0.000000 pcla8_s000007_275 2 3 0.984213 0.015787 0.000000 0.000006 0.999994 0.000000 0.000000 1.000000 0.000000 0.998990 0.001010 0.000000 0.999984 0.000016 0.000000 0.000000 1.000000 0.000000 0.991990 0.008010 0.000000 0.001770 0.998230 0.000000 0.999743 0.000257 0.000000 0.999872 0.000128 0.000000

here is the freq file obtained from .mafs.gz for the same sites
0.199546 0.200035 0.200432 0.350231 0.201001 0.198874 0.198675 0.200459

The log file looks like -> Seed is: 396728323 -> Frequency file: '/mnt/lustre/scratch/jbledoux/ANGSD/25-10-2022/Altare/Altare_freq' contain 14460747 number of sites -> Beagle - Reading from: /mnt/lustre/scratch/jbledoux/ANGSD/07-10-2022/Altare/1_Altare_SNP-no-missing_filtering-SITES.beagle.gz. Assuming 10 Ind and 14460747 sites -> Beagle - done processing 14460747 sites -> nind:10 overall_number_of_sites:14460747 -> Done reading data from file: 119.44 120.00 -> Starting analysis now -> length of joblist:10 [ALL done] cpu-time used = 232.02 sec (filereading took: 119.44 sec) [ALL done] walltime used = 166.00 sec (filereading took: 120.00 sec)

and results are for instance when estimating inbreeding only Ind Z=0 Z=1 loglh nIter coverage sites 0 1.000000 0.000000 -7134469.657723 -1 0.686355 9925209 1 1.000000 0.000000 -7033090.765785 -1 0.683340 9881608 2 1.000000 0.000000 -7169492.084132 -1 0.692961 10020737 3 1.000000 0.000000 -7247286.059557 -1 0.696883 10077449 4 1.000000 0.000000 -7132969.390517 21 0.694411 10041702 5 1.000000 0.000000 -7292105.232651 18 0.697278 10083157 6 1.000000 0.000000 -7451324.151384 -1 0.693961 10035196 7 1.000000 0.000000 -7224298.675603 -1 0.692677 10016624 8 1.000000 0.000000 -7045421.842850 -1 0.685920 9918913 9 1.000000 0.000000 -7471602.681979 -1 0.693180 10023897

It looks like the job does not start and I can not figure out why? I make test and everything seem ok.
Any suggestion is very welcome.
Thank you
Jean-Baptiste

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.