Coder Social home page Coder Social logo

gregor-mendel-institute / snpmatch Goto Github PK

View Code? Open in Web Editor NEW
9.0 7.0 5.0 94.65 MB

A simple python library to identify the most likely strain from the population

Home Page: https://arageno.gmi.oeaw.ac.at/

License: MIT License

Python 68.88% Dockerfile 0.16% Jupyter Notebook 30.97%
genotype snps arageno genotyping-by-sequencing genotyping 1001genomes arabidopsis pipeline snpmatch

snpmatch's Introduction

DOI

SNPmatch

SNPmatch is a Python toolkit which can be used to genotype a sample from as-low-as as 4000 markers from the database lines. SNPmatch can genotype samples efficiently and economically using a simple likelihood approach.

Installation & Usage

The below steps deal with running SNPmatch on a local machine. Also consider using Nextflow when implementing it on your cluster, we have provided best practices scripts here.

Installation using conda

SNPmatch can be easily installed either by conda (provided yml file) or pip. SNPmatch uses various python packages (NumPy, pandas, scikit-allel), which are automatically downloaded. Follow the commands below for successful installation.

## Conda installation, after cloning the repo
conda env create -f environment.yml
## or installing SNPmatch from git hub repository
pip install git+https://github.com/Gregor-Mendel-Institute/SNPmatch.git
## or PyPi
pip install SNPmatch

Database files

Database files containing the known genotype information for many strains have to be provided as HDF5 formatted file. These can be generated with given markers or variants present in a VCF file. The database files can be generated with the functions given in SNPmatch. They are generated using the commands given below.

The below commands require BCFtools executable in the path environment. The database files are read using PyGWAS package. So the VCF files need to have biallelic SNPs only for now.

snpmatch makedb -i input_database.vcf -o db

The above command generates three files,

  • db.csv
  • db.hdf5
  • db.acc.hdf5
  • db.csv.json

The two hdf5 files are the main database files used for further analysis. The files have the same information but are chunked for better efficiency. The files db.hdf5 and db.acc.hdf5 are given to the SNPmatch command under -d and -e options respectively.

For Arabidopsis thaliana users, we have made SNP database files for the RegMap and 1001Genomes panel available and can be downloaded here.

If you are working with other genomes, the above command generates a JSON file containing chromosome information. Provide this JSON file in cross and genotype_cross functions under --genome option.

Input file

As the input file, SNPmatch takes genotype information in two file formats (BED and VCF). Example input files are given in the folder sample_files. Briefly, BED files should be three tab-separated column with chromosome, position and genotype shown below.

1 125 0/0
1 284 0/0
1 336 0/0
1 346 1/1
1 353 0/0
1 363 0/0
1 465 0/0
1 471 0/1
1 540 0/0
1 564 0/0
1 597 0/0
1 612 1/1
1 617 0/1

VCF file in a default format in the link. The main arguments required for SNPmatch are CHROM and POS in header and GT in the INFO column. PL (Normalized Phred-scaled likelihoods of the possible genotypes), if present improves the efficiency of SNPmatch.

Usage

SNPmatch can be run as bash commands given below. A detailed manual for each command with -h.

snpmatch inbred -v -i input_file -d db.hdf5 -e db.acc.hdf5 -o output_file
# or
snpmatch parser -v -i input_file -o input_npz
snpmatch inbred -v -i input_npz -d db.hdf5 -e db.acc.hdf5 -o output_file

AraGeno

SNPmatch can be run directly for A. thaliana researchers as a web tool, AraGeno

Output files for inbred

SNPmatch outputs two file,

  1. output_file.scores.txt --- tab-separated file
1 2 3 4 5 6 7 8
8426 4946 4987 0.99 517.57 1.0 5525 4.55
8427 4861 5194 0.93 4897.21 9.46 5525 4.55
6191 4368 4933 0.88 8652.07 16.72 5525 4.55

The column information for the table is strain ID, Number of matched SNPs, Total informative SNPs, Probability of match, Likelihood, Likelihood ratio against best hit, Number of SNPs, Average depth of SNPs respectively. You can filter the strains that do not match the sample by the likelihood ratio value which is chi-square distributed.

  1. output_file.matches.json --- JSON file It might be easier to parse this file using json editor. All the strains that might match to the sample is present in this file. Also a simple interpretation based on the SNPmatch result is provided.

Genotyping a hybrid

SNPmatch can be used to identify hybrid individuals when parental strains are present in database. For such individuals, SNPmatch can be run in windows across the genome. The commands used to run are given below

snpmatch cross -v -d db.hdf5 -e db.acc.hdf5 -i input_file -b window_size_in_bp -o output_file
#to identify the windows matching to each parent in a hybrid

These scripts are implemented based on the A. thaliana genome sizes. But please change --genome option by providing JSON file generated while working with other genomes.

Output files for cross

We have three output files for the cross match in SNPmatch

  1. output_file.scores.txt --- tab-separated file The file is exactly same as explained before. Additionally, F1 simulated results are appended to the file.
  2. output_file.windowscore.txt --- tab-separated file The file provides information on the strains that match to input sample for each window across the genome.
1 2 3 4 5 6 7 8
1006 11 11 1.0 1.0 1 222 1
1158 11 11 1.0 1.0 1 222 1
1166 11 11 1.0 1.0 1 222 1

Here columns are strain ID, number of SNPs matched, Informative SNPs, Probability of match, Likelihood, Is the window identical to the line? used a simple binomial test, Number of strains that match at this window, window ID (number starting for 1 covering genome linearly). Filtering this table by column 7 having 1 would result in homozygous windows.

  1. output_file.matches.json --- JSON file

The file containing the list of matched strains, list of homozygous windows and strains matched to them and along with a simple interpretation.

Identifying underlying haplotype for a experimental cross

For a given hybird sample and its parents, SNPmatch can determine the underlying haplotype structure (homozygous or heterozygous).

snpmatch genotype_cross -v -e db.acc.hdf5 -p "parent1xparent2" -i input_file -o output_file -b window_size
# or if parents have VCF files individually
snpmatch genotype_cross -v -p parent1.vcf -q parent2.vcf -i input_file -o output_file -b window_size

One can implement this by considering a Markhof chain (HMM, requires hmmlearn python package), by running above command using --hmm. The starting probabilities are based on mendel segregation (1:2:1, for F2), might be necessary to change them when implementing for higher crosses. The transition probability matrix is adapted from R/qtl (Browman 2009, doi:10.1007/978-0-387-92125-9).

The output file is a tab delimited file as below.

1 2 3 4 5 6 7
1 1 300000 14 1114 NA 1.47,1.64,1.00
1 300001 600000 19 1248 2 2.46,2.29,1.00
1 600001 900000 8 1018 2 nan,3.28,1.00
1 900001 1200000 15 1036 2 2.83,2.59,1.00
1 1200001 1500000 12 995 2 2.71,2.71,1.00

The columns are Chromosome ID, start position of window, end position, number of SNPs from sample in a window, number of segregating SNPs, underlying genotype (0, 1, 2 for homozygous parent1, heterozygous and homozygous parent2), likelihood ratio test statistic for each genotype (or number of SNPs each genotype under HMM).

Contributing

  1. Fork it!
  2. Create your feature branch: git checkout -b my-new-feature
  3. Commit your changes: git commit -am 'Add some feature'
  4. Push to the branch: git push origin my-new-feature
  5. Submit a pull request :D

Credits

  • Rahul Pisupati (rahul.pisupati[at]gmi.oeaw.ac.at)
  • Ümit Seren (uemit.seren[at]gmi.oeaw.ac.at)

Citation

Pisupati, R. et al.. Verification of Arabidopsis stock collections using SNPmatch, a tool for genotyping high-plexed samples. Nature Scientific Data 4, 170184 (2017). doi:10.1038/sdata.2017.184

snpmatch's People

Contributors

ellisztamas avatar rbpisupati avatar timeu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

snpmatch's Issues

error when using snpmatch, throw error if bcftools doesnt exist in the path

Hello Mr. Rahul,

I am Vinay, a master student from University of Hohenheim, i decided to use snpmatch as a propability test check my master thesis samples of maize with related original varieties from CIMMYT. I ran the command to create a database file from original samples of cimmyt but i encountered with following error. Can you please check and help me out of this issue.

Error is as folows:
2018-08-08 12:23:14,565 - root - ERROR -
Traceback (most recent call last):
File "/afs/uni-hohenheim.de/hhome/v/vinay410/.SNPmatch/lib/python2.7/site-packages/snpmatch/init.py", line 163, in main
args'func'
File "/afs/uni-hohenheim.de/hhome/v/vinay410/.SNPmatch/lib/python2.7/site-packages/snpmatch/init.py", line 139, in makedb_vcf_to_hdf5
makedb.makedb_from_vcf(args)
File "/afs/uni-hohenheim.de/hhome/v/vinay410/.SNPmatch/lib/python2.7/site-packages/snpmatch/core/makedb.py", line 85, in makedb_from_vcf
makeHDF5s(args['db_id'] + '.csv', args['db_id'])
File "/afs/uni-hohenheim.de/hhome/v/vinay410/.SNPmatch/lib/python2.7/site-packages/snpmatch/core/makedb.py", line 73, in makeHDF5s
GenotypeData = genotype.load_csv_genotype_data(csvFile)
File "/afs/uni-hohenheim.de/hhome/v/vinay410/.SNPmatch/lib/python2.7/site-packages/pygwas/core/genotype.py", line 47, in load_csv_genotype_data
data = data_parsers.parse_genotype_csv_file(csv_files,format)
File "/afs/uni-hohenheim.de/hhome/v/vinay410/.SNPmatch/lib/python2.7/site-packages/pygwas/core/data_parsers.py", line 80, in parse_genotype_csv_file
first_line = reader.next()
StopIteration

Thanks,
Vinay Kumar Reddy NANNURU

An error occurs while using recommended version of scikit-allel (0.20.3)

After installing snpmatch using the recommended conda enviroment i had tried running it with the next code:

snpmatch inbred -d all_chromosomes_binary.hdf5 -e all_chromosomes_binary.acc.hdf5 -i 701_501.filter.vcf \-o output_snpmatch

getting this error:

2020-07-16 09:50:20,704 - root - ERROR - 'module' object has no attribute 'read_vcf'
Traceback (most recent call last):
File "/home/franco/anaconda3/envs/snpmatch/lib/python2.7/site-packages/snpmatch/init.py", line 175, in main
args'func'
File "/home/franco/anaconda3/envs/snpmatch/lib/python2.7/site-packages/snpmatch/init.py", line 120, in snpmatch_inbred
snpmatch.potatoGenotyper(args)
File "/home/franco/anaconda3/envs/snpmatch/lib/python2.7/site-packages/snpmatch/core/snpmatch.py", line 229, in potatoGenotyper
inputs = parsers.ParseInputs(inFile = args['inFile'], logDebug = args['logDebug'])
File "/home/franco/anaconda3/envs/snpmatch/lib/python2.7/site-packages/snpmatch/core/parsers.py", line 66, in init
(snpCHR, snpPOS, snpGT, snpWEI, DPmean) = self.read_vcf(inFile, logDebug)
File "/home/franco/anaconda3/envs/snpmatch/lib/python2.7/site-packages/snpmatch/core/parsers.py", line 131, in read_vcf
vcf = allel.read_vcf(inFile, samples = [0], fields = '*')
AttributeError: 'module' object has no attribute 'read_vcf'

i have solved this problem installing a more recent version from scikit-allel (v1.1.0), maybe the enviroment file can be edited to change this?
Regards

Incongruities at program results

Hello, im using SNPmatch in a VCF file where i have 7 replicates of 3 different plants (21 samples in total), i made the database with them and then i splitted in 21 separated VCFs to run the program with each one, but when i ran the program 21 times and made a heatmap to see the "probability of match" results, i see incongruents. For example, see the subsets RGS vs REED, subset 1 is showing more simmilarities than subset 2, how must i interpret this?

SNPMATCHresults

Add flag to distinguish cases in the identify statistics

In order to distinguish between case 1, 2, 3 and 4 add additional information to the matches.json statistics.

Something like this:

{
  "overlap": 0.8,
  "matches": ...,
  "interpretation": { 
      "case": 2,
      "text": "Ambiguous matches due to population structure"
  }
}

SNPmatch error

Dear Rahul,

Hello,

I tried to use SNPmatch but it generated error.
Can you please take a look?

Can you also check the Arageno?

http://tools.1001genomes.org/strain_id/#progress/116f3740-f33c-4268-adeb-b9a932ebb7fc

Thanks !

/mnt/d/scratch/bin/python_default/local/lib/python2.7/site-packages/snpmatch/core/csmatch.py:52: RuntimeWarning: invalid value encountered in less
  NumAmb = np.where(likeliHoodRatio < snpmatch.lr_thres)[0]
2018-05-10 14:56:04,744 - root - ERROR - The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Traceback (most recent call last):
  File "/mnt/d/scratch/bin/python_default/local/lib/python2.7/site-packages/snpmatch/__init__.py", line 163, in main
    args['func'](args)
  File "/mnt/d/scratch/bin/python_default/local/lib/python2.7/site-packages/snpmatch/__init__.py", line 105, in snpmatch_cross
    csmatch.potatoCrossIdentifier(args)
  File "/mnt/d/scratch/bin/python_default/local/lib/python2.7/site-packages/snpmatch/core/csmatch.py", line 232, in potatoCrossIdentifier
    crossIdentifier(args['binLen'],snpCHR, snpPOS, snpWEI, DPmean, GenotypeData, GenotypeData_acc, args['outFile'])
  File "/mnt/d/scratch/bin/python_default/local/lib/python2.7/site-packages/snpmatch/core/csmatch.py", line 187, in crossIdentifier
    (ScoreList, NumInfoSites, NumMatSNPs) = crossWindower(binLen, snpCHR, snpPOS, snpWEI, DPmean, GenotypeData, outFile)
  File "/mnt/d/scratch/bin/python_default/local/lib/python2.7/site-packages/snpmatch/core/csmatch.py", line 101, in crossWindower
    writeBinData(out_file, i, GenotypeData, ScoreList, NumInfoSites)
  File "/mnt/d/scratch/bin/python_default/local/lib/python2.7/site-packages/snpmatch/core/csmatch.py", line 50, in writeBinData
    (likeliScore, likeliHoodRatio) = snpmatch.calculate_likelihoods(ScoreList, NumInfoSites)
  File "/mnt/d/scratch/bin/python_default/local/lib/python2.7/site-packages/snpmatch/core/snpmatch.py", line 41, in calculate_likelihoods
    LikeLiHoods = [likeliTest(NumInfoSites[i], int(ScoreList[i])) for i in range(num_lines)]
  File "/mnt/d/scratch/bin/python_default/local/lib/python2.7/site-packages/snpmatch/core/snpmatch.py", line 27, in likeliTest
    if n > 0 and n != y and y > 0:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Trouble creating SNP database files

Hi,

I am confused about how to create a the hdf5 SNP database files for SNPmatch. I modified 01.convertVCF_SNPmat_csv.sh in /scripts to work for my setup, and it works fine to create the .csv file, but it fails once it calls 02.convertSNPmat_csv_HDF5.py with the following error in data_parsers.py line 108 from pygwas.core.

ValueError: invalid literal for long() with base 10: '0|0:36:4'

I forgot to grab the whole traceback, but I will add it when I can. Do you have any instructions for generating the database file or can you point me to some good vcf files I can test to see if I am making a mistake somewhere? Thanks!

VCF file:
I am using "1001genomes_snp-short-indel_only_ACGTN.vcf.gz" from http://1001genomes.org/data/GMI-MPI/releases/v3.1/ .

System Info:
ubuntu 16.04
python 2.7.13

SNPmatch error

Dear Rahul,

Hello,

I tried to use SNPmatch but it generated error.
Can you please take a look?

``
/mnt/d/scratch/bin/python_default/local/lib/python2.7/site-packages/snpmatch/core/csmatch.py:52: RuntimeWarning: invalid value encountered in less
NumAmb = np.where(likeliHoodRatio < snpmatch.lr_thres)[0]
2018-05-10 14:56:04,744 - root - ERROR - The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Traceback (most recent call last):
File "/mnt/d/scratch/bin/python_default/local/lib/python2.7/site-packages/snpmatch/init.py", line 163, in main
args'func'
File "/mnt/d/scratch/bin/python_default/local/lib/python2.7/site-packages/snpmatch/init.py", line 105, in snpmatch_cross
csmatch.potatoCrossIdentifier(args)
File "/mnt/d/scratch/bin/python_default/local/lib/python2.7/site-packages/snpmatch/core/csmatch.py", line 232, in potatoCrossIdentifier
crossIdentifier(args['binLen'],snpCHR, snpPOS, snpWEI, DPmean, GenotypeData, GenotypeData_acc, args['outFile'])
File "/mnt/d/scratch/bin/python_default/local/lib/python2.7/site-packages/snpmatch/core/csmatch.py", line 187, in crossIdentifier
(ScoreList, NumInfoSites, NumMatSNPs) = crossWindower(binLen, snpCHR, snpPOS, snpWEI, DPmean, GenotypeData, outFile)
File "/mnt/d/scratch/bin/python_default/local/lib/python2.7/site-packages/snpmatch/core/csmatch.py", line 101, in crossWindower
writeBinData(out_file, i, GenotypeData, ScoreList, NumInfoSites)
File "/mnt/d/scratch/bin/python_default/local/lib/python2.7/site-packages/snpmatch/core/csmatch.py", line 50, in writeBinData
(likeliScore, likeliHoodRatio) = snpmatch.calculate_likelihoods(ScoreList, NumInfoSites)
File "/mnt/d/scratch/bin/python_default/local/lib/python2.7/site-packages/snpmatch/core/snpmatch.py", line 41, in calculate_likelihoods
LikeLiHoods = [likeliTest(NumInfoSites[i], int(ScoreList[i])) for i in range(num_lines)]
File "/mnt/d/scratch/bin/python_default/local/lib/python2.7/site-packages/snpmatch/core/snpmatch.py", line 27, in likeliTest
if n > 0 and n != y and y > 0:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Different results when input is shuffled

I have installed SNPmatch using docker-miniconda and created several tests. If you want more information, I can provide all the steps I used to produce the examples below. For now, I will just show the relevant steps.

I created eg3.bed, which is completely identical to sample 1 in sample.vcf and the results are as expected.

rm -rf example/eg3
mkdir example/eg3
cat raw/sample.vcf | grep -v "^#" | head -1000 | cut -f1,2,10 > example/eg3/eg3.bed
snpmatch inbred -v -i example/eg3/eg3.bed -d db.hdf5 -e db.acc.hdf5 -o example/eg3/eg3
cat example/eg3/eg3.scores.txt
1	1000	1000	1.0	1.0	1.0	1000	NA
2	504	1000	0.504	8443.54247132949	8443.54247132949	1000	NA
3	533	1000	0.533	7911.4903138388245	7911.4903138388245	1000	NA
4	504	1000	0.504	8443.54247132949	8443.54247132949	1000	NA
5	491	1000	0.491	8683.14132921334	8683.14132921334	1000	NA
6	500	1000	0.5	8517.193193903857	8517.193193903857	1000	NA
7	483	1000	0.483	8830.922877708486	8830.922877708486	1000	NA
8	497	1000	0.497	8572.47323619864	8572.47323619864	1000	NA
9	508	1000	0.508	8369.95575353433	8369.95575353433	1000	NA
10	500	1000	0.5	8517.193193903857	8517.193193903857	1000	NA

However, when I shuffle the input using shuf (this just randomises the lines) the results are not as expected. The following steps are identical apart from running shuf.

rm -rf example/eg3
mkdir example/eg3
cat raw/sample.vcf | grep -v "^#" | head -1000 | shuf | cut -f1,2,10 > example/eg3/eg3.bed
snpmatch inbred -v -i example/eg3/eg3.bed -d db.hdf5 -e db.acc.hdf5 -o example/eg3/eg3
cat example/eg3/eg3.scores.txt
1	496	1000	0.496	8590.907917160912	1.0637249543580847	1000	NA
2	504	1000	0.504	8443.54247132949	1.045478186536507	1000	NA
3	499	1000	0.499	8535.61587463412	1.056878701787307	1000	NA
4	524	1000	0.524	8076.249299185785	1.0	1000	NA
5	501	1000	0.501	8498.774513176264	1.0523170098319123	1000	NA
6	508	1000	0.508	8369.95575353433	1.0363666899656199	1000	NA
7	521	1000	0.521	8131.2411580875205	1.006809083878488	1000	NA
8	469	1000	0.469	9090.157529759197	1.1255419679374719	1000	NA
9	522	1000	0.522	8112.906530450911	1.004538892982021	1000	NA
10	514	1000	0.514	8259.695714936273	1.0227143082085124	1000	NA

If I print every second line (thus the input is still sorted) instead of shuffling, I also get unexpected results.

rm -rf example/eg3
mkdir example/eg3
cat raw/sample.vcf | grep -v "^#" | perl -nle "if ($. % 2 == 0){ print }" | cut -f1,2,10 > example/eg3/eg3.bed
snpmatch inbred -v -i example/eg3/eg3.bed -d db.hdf5 -e db.acc.hdf5 -o example/eg3/eg3
cat example/eg3/eg3.scores.txt
1	514	1000	0.514	8259.695714936273	1.0368199969939522	1020	NA
2	490	1000	0.49	8701.600014528602	1.0922911947702092	1020	NA
3	501	1000	0.501	8498.774513176264	1.066830991033872	1020	NA
4	530	1000	0.53	7966.373853594235	1.0	1020	NA
5	487	1000	0.487	8757.000081471551	1.099245433670001	1020	NA
6	484	1000	0.484	8812.436172983842	1.1062041946484704	1020	NA
7	492	1000	0.492	8664.686645197171	1.0876575471395777	1020	NA
8	512	1000	0.512	8296.433052811099	1.041431547813683	1020	NA
9	474	1000	0.474	8997.483502817286	1.1294327467142204	1020	NA
10	476	1000	0.476	8960.441974174311	1.1247830115494237	1020	NA

Even more perplexing is when I shuffle and re-sort the input, I get another result!

rm -rf example/eg3
mkdir example/eg3
cat raw/sample.vcf | grep -v "^#" | head -1000 | shuf | sort -k2,2n | cut -f1,2,10 > example/eg3/eg3.bed
snpmatch inbred -v -i example/eg3/eg3.bed -d db.hdf5 -e db.acc.hdf5 -o example/eg3/eg3
cat example/eg3/eg3.scores.txt
1	992	1000	0.992	100.7710316093164	1.0	1000	NA
2	506	1000	0.506	8406.741111258392	83.42418428195569	1000	NA
3	529	1000	0.529	7984.676397077538	79.23583067040217	1000	NA
4	504	1000	0.504	8443.54247132949	83.78938209211381	1000	NA
5	493	1000	0.493	8646.235962207935	85.80080826927428	1000	NA
6	502	1000	0.502	8480.359832467335	84.15473868864628	1000	NA
7	487	1000	0.487	8757.000081471551	86.89997454250489	1000	NA
8	495	1000	0.495	8609.346598381862	85.43473715501706	1000	NA
9	506	1000	0.506	8406.741111258392	83.42418428195569	1000	NA
10	500	1000	0.5	8517.193193903857	84.52025406393113	1000	NA

Do you know what's going on?

Error when providing data without PL

Hi, I was testing the program but I get an error when parsing data that does not contain PL (either VCF file without PL or bed format).
To check if it was my input file I tried out with the provided bed data 701_502.filter.bed and got the same error.

(base) [user@host SNPmatch]$ snpmatch parser -v -i 701_502.filter.bed -o test
2022-08-31 17:32:58,070 - snpmatch.core.parsers - INFO - running snpmatch parser!
2022-08-31 17:32:58,071 - snpmatch.core.parsers - INFO - reading the position file
2022-08-31 17:32:58,100 - snpmatch.core.parsers - INFO - creating snpmatch parser file: test.npz
2022-08-31 17:32:58,117 - root - ERROR - ufunc 'add' did not contain a loop with signature matching types (dtype('<U2'), dtype('<U2')) -> None
Traceback (most recent call last):
  File "/home/user/miniconda3/lib/python3.9/site-packages/snpmatch/__init__.py", line 177, in main
    args['func'](args)
  File "/home/user/miniconda3/lib/python3.9/site-packages/snpmatch/__init__.py", line 134, in snpmatch_parser
    parsers.potatoParser(inFile = args['inFile'], logDebug =  args['logDebug'], outFile = args['outFile'])
  File "/home/user/miniconda3/lib/python3.9/site-packages/snpmatch/core/parsers.py", line 217, in potatoParser
    inputs = ParseInputs(inFile, logDebug, outFile)
  File "/home/user/miniconda3/lib/python3.9/site-packages/snpmatch/core/parsers.py", line 86, in __init__
    self.case_interpret_inputs(outFile + ".stats.json")
  File "/home/user/miniconda3/lib/python3.9/site-packages/snpmatch/core/parsers.py", line 113, in case_interpret_inputs
    statdict["depth"] = np.nanmean(self.dp)
  File "<__array_function__ internals>", line 180, in nanmean
  File "/home/user/miniconda3/lib/python3.9/site-packages/numpy/lib/nanfunctions.py", line 1034, in nanmean
    return np.mean(arr, axis=axis, dtype=dtype, out=out, keepdims=keepdims,
  File "<__array_function__ internals>", line 180, in mean
  File "/home/user/miniconda3/lib/python3.9/site-packages/numpy/core/fromnumeric.py", line 3432, in mean
    return _methods._mean(a, axis=axis, dtype=dtype,
  File "/home/user/miniconda3/lib/python3.9/site-packages/numpy/core/_methods.py", line 180, in _mean
    ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
numpy.core._exceptions._UFuncNoLoopError: ufunc 'add' did not contain a loop with signature matching types (dtype('<U2'), dtype('<U2')) -> None

Just to remark: when I provide a VCF file with PL I do not get the error.

== Update ==
I've just found where the problem comes from:
When parsing from .bed, DPmean is set to "NA" (line but afterwards this comuted statdict["depth"] = np.nanmean(self.dp) is called, which gives the error). The same happens with VCF file, if DP is not present (or not present in INFO), it is set to a list of "NA"s, giving afterwards the same error.

TypeError in JSON file output of genotyper inbred

Here is the detailed error,

Traceback (most recent call last):
  File "/mygit/SNPmatch/snpmatch/__init__.py", line 191, in main
    args['func'](args)
  File "/mygit/SNPmatch/snpmatch/__init__.py", line 116, in snpmatch_inbred
    snpmatch.potatoGenotyper(args)
  File "/mygit/SNPmatch/snpmatch/core/snpmatch.py", line 155, in potatoGenotyper
    (ScoreList, NumInfoSites) = genotyper(snpCHR, snpPOS, snpGT, snpWEI, DPmean, args['hdf5File'], args['hdf5accFile'], args['outFile'])
  File "/mygit/SNPmatch/snpmatch/core/snpmatch.py", line 148, in genotyper
    print_topHits(outFile + ".matches.json", GenotypeData.accessions, ScoreList, NumInfoSites, overlap, NumMatSNPs)
  File "/mygit/SNPmatch/snpmatch/core/snpmatch.py", line 91, in print_topHits
    out_stats.write(json.dumps(topHitsDict))
  File "/home/GMI/rahul.pisupati/anaconda2/envs/spyder/lib/python2.7/json/__init__.py", line 244, in dumps
    return _default_encoder.encode(obj)
  File "/home/GMI/rahul.pisupati/anaconda2/envs/spyder/lib/python2.7/json/encoder.py", line 207, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/home/GMI/rahul.pisupati/anaconda2/envs/spyder/lib/python2.7/json/encoder.py", line 270, in iterencode
    return _iterencode(o, 0)
  File "/home/GMI/rahul.pisupati/anaconda2/envs/spyder/lib/python2.7/json/encoder.py", line 184, in default
    raise TypeError(repr(o) + " is not JSON serializable")
TypeError: 14471 is not JSON serializable

But the dictionary that need to be output looks good!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.