simonhmartin / genomics_general Goto Github PK

View Code? Open in Web Editor NEW

322.0 322.0 92.0 707 KB

General tools for genomic analyses.

Python 99.33% R 0.67%

genomics_general's People

Contributors

Stargazers

Watchers

Forkers

altingia arvin580 cbrock2 jkreinz gmkov dileep-kishore nsalamin daxiw zhanglzu sujanmamidi fengyaa attrna kizbaolin hjsbio zyworship bigtigr dawei1203 tep0007 asendellprice ktp-forked-repos cgnat kaiserinlissi srikanthkris zhutao1009 a601866950 keithmp hj1994412 dilerhaji yazbraimah gillichu sdws1983 lonemen hongzhonglu yanccscu asingh164 gsandstedt kant lepisorus novapyth mitramenon nathofme dendisuhubdy adc0032 ausseriridische02 xgdong nitinra chaturvedi-lab lizhizhong1992 mhelmkampf joanam binlu1981 schmitt8 danielyao12 xma82 linzhi2013 shangshanzhizhe jim0912 tholtzem thomasbrazier laneatmore xxllgg yuzhenpeng biocko matiasgoco sunnyev rancilhac sunshichao0916 wyzhh linhai1221 1661848937 flzh628 wangjie07070910 wangchengww edelrieutrottin rakakp007 odko2008 hybio crazzy-rabbit ningshuang-yao hrluo93 zjwang6 samhuairen swift-213 grgong amazingshi jkfo002 odielwoolmore guo-cheng mtholder btrainee

genomics_general's Issues

No results written

Hi,

I‘ve been trying using popgenWindows.py to calculate the population statistics between pre-defined populations. All process works without error reports with the command:

python2.7 ~/local/genomics_general/popgenWindows.py -w 50000 -m 5000 -f phased -g merged.test.geno -o output.csv.gz -T 4 -p pop1 -p pop2 --popsFile Poplist.txt --windCoords WinCoords.txt

And I also used similar command to calculate more than two groups:

python2.7 ~/local/genomics_general/popgenWindows.py -w 50000 -m 5000 -f phased -g merged.two.geno.gz -o output.csv.gz -p pop1 -p pop2 -p pop3 -p pop4-p pop5 -p pop6 -p pop7 -p pop8 --popsFile Poplist.txt --windCoords WinCoords.txt

The geno file looks like this

But I got nothing except the header.
And the log told me like this:

Writing final results...

187 windows queued, 187 results received, 0 results written.
187 windows were tested.
0 results were written.

Done.

I don't know why this happened. Could you help with this problem?

Thanks!

George.

rawdatas

when I change my vcf file to geno,using the following command
python parseVCFs.py -t 25 -i mydata.vcf -o mydata.geno
the screen occurs many warnings like "Could not load .tbi/.csi index of mydata.vcf
WARNING empty window: mydata.vcf chrUnn_random 5300001-5400000"
At last the mydata.geno file is empty. Do you know what's wrong ? By the way ,Can you add the sample datas used in your scripts?

raxml_sliding_windows.py modification: raxmlHPC

Because the command for RAxML execution on my server (and by default) is raxmlHPC and not raxml, I had to modify raxml_sliding_windows.py to get it work on my server. This is a simple fix: replace all instances of 'raxml' with 'raxmlHPC' EXCEPT line 33, where the line must read
treeFile = open("RAxML_bestTree." + localName, "r")

Cheers,
Mark

Error in popPairDist

Hi Simon,

I am having the same issue with popPairDist that Robert (rfitak) described in December:

Traceback (most recent call last):
File "/opt/python/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/opt/python/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "popgenWindows.py", line 58, in stats_wrapper
values = [round(statsDict[stat], roundTo) for stat in stats]
KeyError: 'dxy_pop3_pop4'

The popDist analysis of pi seems to be working fine. Any insight would be appreciated. Thanks!

Kevin

error in distMat.py

Hi,

I've been trying to use distMat.py to calculate pairwise divergence between several individuals. It works fine with two individuals, with the following command:

/usr/bin/python ~/my_programs/genomics_general_v3/distMat.py -g wg_filtered_wHead.geno.gz -f diplo --outFormat raw --windType cat --samples species1,species2 -o output_file

However, when I try to use more than two individuals (or for all the species in the file), like so:

/usr/bin/python ~/my_programs/genomics_general_v3/distMat.py -g wg_filtered_wHead.geno.gz -f diplo --outFormat raw --windType cat --samples species1,species2,species3 -o output_file

...the script crashes after a while (it gives the normal number of windows queued info), with the following error:

...
0 windows queued 0 results received 0 results written.
0 windows queued 0 results received 0 results written.
0 windows queued 0 results received 0 results written.
0 windows queued 0 results received 0 results written.
0 windows queued 0 results received 0 results written.
Traceback (most recent call last):
File "/home/mafalda_ferreira/my_programs/genomics_general_v3/distMat.py", line 293, in
windowQueue.put((windowsQueued,window))
File "/usr/lib/python2.7/multiprocessing/queues.py", line 392, in put
return send(obj)
IOError: bad message length
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/mafalda_ferreira/my_programs/genomics_general_v3/distMat.py", line 102, in checkStats
print >> sys.stderr, windowsQueued, "windows queued", resultsReceived, "results received", resultsWritten, "results written."
AttributeError: 'NoneType' object has no attribute 'stderr'

I'm not sure why this is happening though... Because the two individuals comparison works, I'm not sure what could be causing it! Do you have any idea?

Thank you!

error in popgenWindows.py

Hi Simon,

I am having an issue with popgenWindows.py:
Traceback (most recent call last):
File "/data/users/wangyujian/software/genomics_general_can_culculate_Dxy/popgenWindows.py", line 381, in
for window in windowGenerator:
File "/data/users/wangyujian/software/genomics_general_can_culculate_Dxy/genomics.py", line 1867, in slidingCoordWindows
site = reader.nextSite(asDict = extractSpecificGTs)
File "/data/users/wangyujian/software/genomics_general_can_culculate_Dxy/genomics.py", line 1825, in nextSite
self.precompDict, addToPrecomp=self.precompDict["counter"]<self.precompDict["maxSize"])
File "/data/users/wangyujian/software/genomics_general_can_culculate_Dxy/genomics.py", line 1783, in parseGenoLine
"position": int(lineData[posCol]) if posCol >= 0 else None,
ValueError: invalid literal for int() with base 10: 'filters'

My command line is as follows:
python /data/users/wangyujian/software/genomics_general_can_culculate_Dxy/popgenWindows.py -w 50000 -m 5000 -g /data/users/zhanglei/ossutil/shuju/damai-dp5-miss0.1-maf0.05.vcf -o output.csv.gz -f phased -T 10 -p popB B12,B16,B21,B25,B26,B34,B40,B44,B52,B59,B62,B63,B8 -p popC C13,C15,C24,C31,C34,C35,C3,C53,C5,C66,C9
Any insight would be appreciated. Thanks!

Yujian Wang

Error when running popgenWindows.py

Hi Simon,
I was running the following command:
CHR="NC_042546.1"
SCRIPT=/scripts/genomics_general/popgenWindows.py
INPUT=/geno_chr/${CHR}.geno.gz
OUTPUT=/stats/${CHR}_dxy_wood.txt
POP=/popmap/pops.wood.txt
WINDOW=100000
STEP=25000
MIN=500
NTHREADS=10
python $SCRIPT -g $INPUT -f diplo -o $OUTPUT -w $WINDOW -m $MIN -s $STEP -p AGUL -p ANVB -p TEAL --popsFile $POP --writeFailedWindows -T $NTHREADS

I got the following message:
Starting 10 worker threads
Process Process-2:
Traceback (most recent call last):
File "/cm/shared/applications/Python/2.7.16/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
self.run()
File "/cm/shared/applications/Python/2.7.16/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/scratch-lustre/peuclide/Sockeye/raw_data_1/yue/scripts/genomics_general/popgenWindows.py", line 44, in stats_wrapper
Process Process-1:
Traceback (most recent call last):
File "/cm/shared/applications/Python/2.7.16/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
Aln = genomics.genoToAlignment(window.seqDict(), sampleData, genoFormat = genoFormat)
File "/scratch-lustre/peuclide/Sockeye/raw_data_1/yue/scripts/genomics_general/genomics.py", line 1051, in genoToAlignment
seqList = splitSeq(seqDict[indName], genoFormat)
File "/scratch-lustre/peuclide/Sockeye/raw_data_1/yue/scripts/genomics_general/genomics.py", line 380, in splitSeq
if genoFormat == "diplo": sequence = [haplo(d) for d in sequence]
File "/scratch-lustre/peuclide/Sockeye/raw_data_1/yue/scripts/genomics_general/genomics.py", line 27, in haplo
def haplo(diplo): return diploHaploDict[diplo]
KeyError: '.'
self.run()
File "/cm/shared/applications/Python/2.7.16/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/scratch-lustre/peuclide/Sockeye/raw_data_1/yue/scripts/genomics_general/popgenWindows.py", line 44, in stats_wrapper
Aln = genomics.genoToAlignment(window.seqDict(), sampleData, genoFormat = genoFormat)
File "/scratch-lustre/peuclide/Sockeye/raw_data_1/yue/scripts/genomics_general/genomics.py", line 1051, in genoToAlignment
Process Process-3:
Traceback (most recent call last):
File "/cm/shared/applications/Python/2.7.16/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
self.run()
File "/cm/shared/applications/Python/2.7.16/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/scratch-lustre/peuclide/Sockeye/raw_data_1/yue/scripts/genomics_general/popgenWindows.py", line 44, in stats_wrapper
seqList = splitSeq(seqDict[indName], genoFormat)
File "/scratch-lustre/peuclide/Sockeye/raw_data_1/yue/scripts/genomics_general/genomics.py", line 380, in splitSeq
if genoFormat == "diplo": sequence = [haplo(d) for d in sequence]
File "/scratch-lustre/peuclide/Sockeye/raw_data_1/yue/scripts/genomics_general/genomics.py", line 27, in haplo
def haplo(diplo): return diploHaploDict[diplo]
KeyError: '.'

I am pretty sure that the input geno file is the correct format, see blow:

Could you help me figure out why I got the error message?

Thanks,
Yue

fasta option for variant only vcfs

Hi @simonhmartin,
Thank you for making a very useful set of tools for popgen.
For my species the --emit_all_confident_sites creates a very large file. Thus I call variants with HaplotyeCaller (only outputting the variant positions), filter the individual vcfs, union for population variable sites, and then recall missing sites using freebayes with population information.
I think this causes an issue for me with your set of scripts since some of the window slicing relies on invariant sites which the README on twisst indicates you leave in your vcfs. Or am I mistaken?
Would it be possible to add a option to take a fasta file, the same used when snp calling, to fill the invariant sites when calling phyml_sliding_windows.py?
thank you,
@stsmall

popgenWindows.py output don't have result ?

Hi,

I have run the parseVCF.py for get the input for popgenWindows.py
python ~/biosoft/genomics_general/VCF_processing/parseVCF.py -i ~/07.lizard/Hongwei-Project/02.call_snp/Hong_vcf_files/HongWei.NJ.outgroup.use.vcf.gz --skipIndels --gtf flag=DP min=5 | bgzip > 2.2.output.geno.gz

the out geno file is like this:
#CHROM POS YPX45352 YPX30472 YPX30473 YPX30474 YPX45271 YPX45272 YPX45273 YPX45274 YP LG01 2341 N/N N/N C/C N/N N/N N/N C/C N/N C/C C/C N/N C/C N/N C/C C/C C/C C......

then the popgenWindows.py code is
python ~/biosoft/genomics_general/popgenWindows.py -w 50000 -m 5000 -f phased -g LG01.test.geno.gz -o 1.output.csv.gz --writeFailedWindows --verbose --addWindowID -T 4 -p pop2 YPX39994,YPX39995,YPX39996 -p pop3 YPX40091,YPX40095 -p popS YPX60188,YPX60189,YPX60190,YPX60191,YPX60192,YPX60199,YPX60200,YPX60201,YPX60202,YPX60203 -p popD YPX45273,YPX45274 --popsFile pop.txt
but the result is
windowID,scaffold,start,end,mid,sites,pi_pop2,pi_pop3,pi_popS,pi_popD,dxy_pop2_pop3,dxy_pop2_popS,dxy_pop2_popD,dxy_pop3_popS,dxy_pop3_popD,dxy_po 1,LG01,1,50000,23205,194,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
it don't have any error.

I don't know where is the problem.

dictionary update typo on parseVCF.py

I might be wrong here, but on line 355 of script parseVCF.py, where it says

ploidyDict.upldate

shouldn't it say

ploidyDict.update?

Same issue in script parseVCFs.py.

windows with fd > 1

Hi Simon,
I used your script to calculate fd statistics in sliding windows across chromosomes for my data. I used a window-size of 50kb, which resulted in 200-600 SNPs used for the analysis. I noticed that some fd values were greater than 1, which I found odd. I looked into the specific windows and I think this effect comes from sites that have SNPs only shared between P1 and P2 similar to this:
P1 P2 P3 P4
0.8 0.7 0.0 0.0

These sites are not informative for ABBA/BABA statistics, but they are still used for calculating the denominator for the f-statistic, right? When you set P3=P2, because P2>P3, then this leads to the creation of "artificial" BABAs that only affect the denominator. Thus, the denominator can be smaller than the numerator and fd greater than 1. Because this is summed over the whole window, you can still get a positive D, so the fd value should be valid.

Did you experience this before? Or do you have SNP filtering criteria to avoid this? I am also worried that this could affect the fd estimate in other windows, where it is not greater than 1 but still biased upwards. What do you think?

Best,
Hannes

Error while processing geno file to frequency using freq.py

I used the parseVCF.py script to convert the VCF file into geno file without any error using command:
python genomics_general-master/VCF_processing/parseVCF.py -i All_BIALLELIC_90mis_DP3.vcf.gz --skipIndels --minQual 30 --gtf flag=DP min=3 | bgzip -c > All_Biiallelic.geno.gz
When I tried to run the freq.py script it gives me an error:
Starting 1 worker threads
Process Process-1:
Traceback (most recent call last):
File "/home/genomics/anaconda2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/genomics/anaconda2/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/media/genomics/2TB_WGS-NEW/ABBA-BABA/genomics_general-master/freq.py", line 42, in freqs_wrapper
window = genomics.parseGenoFile(fileSlice, headerLine, names=sampleData.indNames)
File "/media/genomics/2TB_WGS-NEW/ABBA-BABA/genomics_general-master/genomics.py", line 1842, in parseGenoFile
GTs = [siteData["GTs"][name] for name in names] if extractSpecificGTs else siteData["GTs"]
KeyError: '79'
After which the script runs without any output.
The format of the file seems to be good with no obvious errors.
I am not sure what's happening.
Kindly help.
Thanks
Akankshs

no attribute 'ploidy' while running genoToSeq.py

Hi Simon,

After updating to the current version of genomics_general I ran into an issue while running genoToSeq.py to generate a fasta file from genotypes.

I'm re-using some code that worked before, but now it produces an error:

$ python VCF_processing/parseVCF.py \
   -i simulated.phased.vcf.gz | \
   gzip > test.geno.gz

$ python genoToSeq.py \
     -g test.geno.gz \
     -s test.fa \
     -f fasta \
     --splitPhased

Traceback (most recent call last):
  File "genomics_general/genoToSeq.py", line 57, in <module>
    window = genomics.parseGenoFile(genoFile, names=samples, splitPhased=args.splitPhased)
  File "genomics_general/genomics.py", line 1683, in parseGenoFile
    reader=GenoFileReader(genoFile, splitPhased=splitPhased, ploidy=ploidy)
  File "genomics_general/genomics.py", line 1655, in __init__
    if splitPhased: self.names = makeHaploidNames(self.names, self.ploidy)
AttributeError: GenoFileReader instance has no attribute 'ploidy'

I traced back the issue and it seems to emerge when going from commit

"54b0d75a79a6d4023bbc0e4cfc0c9719678bdde6"

"d6758ea9742c0b649286bce4ebe645f8b2bbfa8c".

So, when using the old version I'm able to get my fasta.
The following runs fine:

git checkout 54b0d75a79a6d4023bbc0e4cfc0c9719678bdde6
python genoToSeq.py -g test.geno.gz -s test.fa -f fasta --splitPhased

Did I miss a change in the way that genoToSeq.py to be run, or is this a unintended side-effect of the update?

best regards,
Kosmas

concerns about filtering missing data

Hi Simons,

This has been a very useful tool that I am using for one of my dissertation chapters. Thanks a lot!
I have a question concerning to missing data. While using the ParseVCF script, missing data is denoted as N. So this mean that no missing data is allowed, right?

If so, is there an argument that I can add to this command to exclude some level of missing data? (something similar to the "--max-missing 0.5" argument from VCF tools?).

Thanks in advance for your advice!

ABBABABA error when running tutorial

When running both my data and the tutorial data with ABBABABAwindows.py I get the following error:

Traceback (most recent call last):
File "genomics_general-master/ABBABABAwindows.py", line 223, in
if not args.addWindowID: outFile.write("scaffold,start,end,mid,sites,sitesUsed,ABBA,BABA,D,fd,fdM\n")
File "/usr/lib/python3.6/gzip.py", line 260, in write
data = memoryview(data)
TypeError: memoryview: a bytes-like object is required, not 'str'

I am using Python 3.6.9.

I am following the tutorial so this was the command:
python3 genomics_general-master/ABBABABAwindows.py
-g data/hel92.DP8HET75MP9BIminVar2.chr18.geno.gz -f phased
-o data/hel92.DP8HET75MP9BIminVar2.chr18.ABBABABA_mel_ros_chi_num.w25m250.csv.gz
-P1 mel_mel -P2 mel_ros -P3 cyd_chi -O num
--popsFile data/hel92.pop.txt -w 25000 -m 250 -T 2

Thanks,
Rebecca

extractCDSAlignments KeyError

Hi @simonhmartin ,
I really like the idea of this script, it works well with the modules in genomics.py. However, I seem to be running into some problems when running it. I have a gff file and a geno file that I created a tabix index using tabix -p vcf -h FOO.geno.gz. When I run the command:

python genomics_general/extractCDSAlignments.py --gff test.gff -g FOO.geno.gz -s sample1.1 sample1.2
It returns a KeyError:
Extracting 2 gene sequences from X
AFUN000184-RA.mrna1
Getting region X:913961-917834 from geno file...
Traceback (most recent call last):
File "genomics_general/extractCDSAlignments.py", line 46, in
window = genomics.parseGenoFile(genoStream.stdout, names=args.samples, includePositions=True, splitPhased = args.split, headerLine=headerLine)
File "genomics_general/genomics.py", line 1592, in parseGenoFile
names = [n + "_" + letter for n in names for letter in string.ascii_uppercase[:ploidyDict[n]]]
KeyError: 'sample.1.1'

If I leave out the sample names it defaults to printing 4 haplotypes but it names the output as N|N or G|G, basically from the nucleotide names.

Any suggestions?
thank you,
@stsmall

population setting in ABBABABAwindows.py

Hi Simonh,

I noticed that you changed the syntax of population setting in the new version of ABBABABAwindows.py. Could you please update the example command? THX

Langqing

popsFile format for popgenWindows.py

Hi there,

Any change you could provide an example of the format required for the popsFile in the popgenWindows.py script?
I have a simple .txt file with 2 columns and a header with Samples /t Population.

Thanks!

Fst analysis on haploid sex chromosomes

Hi Simon,
referring to your paper here (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3814882/), I have done a similar comparison between autosomal data and haploid sex chromosome data (homogametic individuals) using Fst (popgenWindows.py). However, I found some comments from people saying that Fst doesn't work on haploid data, since it depends on calculating expected heterozygosity. May I know if there is any substance to that claim? I am using homogametic birds (ZW).

Thank you very much

typo genoToEigenstrat.py

Dear Dr. Martin Simons,

there is a small typo in your genoToEigenstrat.py script on line 33. The line should end with a ).

Best regards,
Steven

issue with parseVCF.py reading my vcf

Hi Simon!
I'm running into an issue reading my vcf into the parseVCF.py program.
When I run: python3.5 ~/genomics_general-master/VCF_processing/parseVCF.py -i $input.vcf.gz --minQual 30 --skipIndels --gtf flag=DP min=5 | gzip > $output.geno.gz

It returns the error message:
Traceback (most recent call last):
File "/nas/longleaf/home/jcoug/genomics_general-master/VCF_processing/parseVCF.py", line 266, in
headData = getHeadData(args.inFile)
File "/nas/longleaf/home/jcoug/genomics_general-master/VCF_processing/parseVCF.py", line 156, in getHeadData
return parseHeaderLines(headLines)
File "/nas/longleaf/home/jcoug/genomics_general-master/VCF_processing/parseVCF.py", line 134, in parseHeaderLines
if line.startswith("##contig"):
TypeError: startswith first arg must be bytes or a tuple of bytes, not str

I don't think there's anything unusual about the format of my vcf, but I've included the first few lines in case you see something.
##fileformat=VCFv4.2
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">

Any help you can offer is appreciated!
Thanks very much,
Jenn

genoToVCF Error

Hi I am getting error fro geno to vcf conversion. I downloaded the .geno files from https://reich.hms.harvard.edu/datasets

**python genoToVCF.py -g panel7.geno -o panal17.vcf**

Traceback (most recent call last):
  File "genoToVCF.py", line 61, in <module>
    genoFileReader = genomics.GenoFileReader(genoFile)
  File "/storage/artemis/com/shg29ny/geno_vcf/genomics.py", line 1805, in __init__
    if not headerLine: headerLine = getNext(genoFile)
  File "/storage/artemis/com/shg29ny/geno_vcf/genomics.py", line 1799, in getNext
    return generator.next() if sys.version_info.major < 3 else next(generator)
  File "/home/shg29ny/anaconda3/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9a in position 221: invalid start byte

Thank you for the support!

2 minor phylo bugs (with fixes?)

Hi Simon,

Thanks for the great tool! I've been using genomics general to run some windowed phylogenetic trees with phyml and raxml, and I ran in to a couple small (but seemingly easy to fix) bugs.

For phyml_sliding_windows.py:

I was using pre-defined windows and kept getting this error message:

Traceback (most recent call last):
  File "/home/tt164677e/bin/genomics_general/phyml_sliding_windows.py", line 408, in <module>
    for window in windowGenerator:
  File "/home/tt164677e/bin/genomics_general/genomics.py", line 2031, in predefinedCoordWindows
    reader=GenoFileReader(genoFile, headerLine, splitPhased=splitPhased, ploidy=ploidy)
  File "/home/tt164677e/bin/genomics_general/genomics.py", line 1835, in __init__
    self.names = headerLine.split()[firstSampleCol:]
AttributeError: 'list' object has no attribute 'split'

I think the problem is in line 406 of phyml_sliding_windows.py, with the positional arguments of genomics.predefinedCoordWindows not matching up to the positions as they're defined in genomics.py. Changing line 406 to use the argument name for indNames fixed it, e.g.:

windowGenerator = genomics.predefinedCoordWindows(genoFile, windCoords, names = indNames)

For raxml_sliding_windows.py:

At first I couldn't even run python raxml_sliding_windows.py -h, which would give me this error:

  File "/home/tt164677e/bin/genomics_general/phylo/raxml_sliding_windows.py", line 329
    print "\nDone."
          ^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print("\nDone.")?

Changing line 329 to the error message's suggestion of print("\nDone.") did the trick.

If it matters, I'm using Python v3.8.3. I forgot to write down what day I download genomics general, but looks like I'm probably on commit ff736c5.

Thanks,
Tim

Result is differ from vcftools

Its a very good tools. but I use genomics_general and vcftools process the same vcf data with same parameter(window=50K,step=10K), they give me different PI and FST. I'm confused, why it happened?
I convert the vcf file to geno file without any filter.
Need your help, Thanks.

class vcfGenoData does not allow haploid data

Hi,
the documentation says that popgenWindows.py can process haploid data from the geno file. However, the parseVCF.py does not seem to deal with haploid vcf (at least I didn't find out how if so).
Therefore one needs to process the data twice using perl during the pipeline what is not very convenient:

	# make vcf diploid
perl -pe 's/([01])(:[0-9]*,[0-9:]*,[0-9]*)/$1\|$1$2/g;' -pe 's/.:.:.:.:./.|.:.:.:.:./g;' $vcf.vcf |
	gzip > $vcf.dip.vcf.gz

	# parse vcf
parseVCF.py -i $vcf.dip.vcf.gz --skipIndel --skipMono --minQual 30 --gtf flag=DP min=5 | gzip > $fgeno

	# make vcf haploid again
zcat $fgeno | perl -pe 's/([ATGCN])[\|\/][ATGCN]/$1/g;' | gzip > $fgeno.temp
mv $fgeno.temp $fgeno

cheers,
Ludovic

popgenWindows generating empty output

Hi Simon,

The command below runs without error, and prints a lot of progress messages, but the output is an empty file. It creates the output file, but apparently nothing is written to it. What am I doing wrong? Thanks!

python /home/rmarco3/bin/simon-martin-scripts/popgenWindows.py --ploidy 2 -f phased -g asp_con_din_par_quality_filtered_ONLY.geno --windType coordinate -w 50000 -s 25000 --popsFile pops_file_for_martin -p din -p asp --verbose -o simon_stats_dinVasp --writeFailedWindows

Here's the begginning of the progress messages:

started worker 0
Sorter received result 0Result 0 sent to writerWriter received result 0
Sorter received result 1Result 1 sent to writerWriter received result 1
Sorter received result 2Result 2 sent to writerWriter received result 2
Sorter received result 3Result 3 sent to writerWriter received result 3
Sorter received result 4Result 4 sent to writerWriter received result 4
Sorter received result 5Result 5 sent to writerWriter received result 5
Sorter received result 6Result 6 sent to writerWriter received result 6
Sorter received result 7Result 7 sent to writerWriter received result 7
Sorter received result 8Result 8 sent to writerWriter received result 8
Sorter received result 9Result 9 sent to writerWriter received result 9

Calculating global Fst

Hi,
popgenWindows.py calculated pairwise Fst in a window size base. Any chance it can also report the global estimate of Fst between two populations?

Best,

Unable to allocate array with shape When running ABBABABAwindows.py

Hi @simonhmartin ,
I tried to run the script ABBABABAwindows.py to calculate D and fd, while I found this issue
Traceback (most recent call last):
File "/home1/xx/miniconda2/envs/geno/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home1/xx/miniconda2/envs/geno/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, *self._kwargs)
File "/home1/xx/software/script/github/genomics_general/ABBABABAwindows.py", line 38, in ABBABABA_wrapper
statsDict = genomics.ABBABABA(Aln, P1, P2, P3, O, minData)
File "/home1/xx/software/script/github/genomics_general/genomics.py", line 1597, in ABBABABA
all4freqs = all4Aln.siteFreqs(sites=goodSites)
File "/home1/xx/software/script/github/genomics_general/genomics.py", line 1032, in siteFreqs
return np.array([binBaseFreqs(self.numArray[:,x][self.nanMask[:,x]], asCounts=asCounts) for x in sites])
File "/home1/xx/software/script/github/genomics_general/genomics.py", line 1032, in
return np.array([binBaseFreqs(self.numArray[:,x][self.nanMask[:,x]], asCounts=asCounts) for x in sites])
File "/home1/xx/software/script/github/genomics_general/genomics.py", line 593, in binBaseFreqs
else: return 1. np.bincount(numArr, minlength=4) / n
File "<array_function internals>", line 6, in bincount
MemoryError: Unable to allocate array with shape (140160792457881,) and data type int64

It seems that a memory error occured, but in a python3 environment. Then I change to a python2 environment and rerun this script, got this error,

File "/home1/xx/software/script/github/genomics_general/ABBABABAwindows.py", line 64
print("Sorter received result", resNumber, file=sys.stderr)
^
SyntaxError: invalid syntax

Ummmm how could I find a solution to make it work. Thanks in advance.

Best regards,
Yung-Chien

issue about mergeGeno.py and parseVCFs.py

Hi dear @simonhmartin
Thanks for these useful scripts! Recently when I used the mergeGeno.py to combine two geno files(they were prduced by your parseVCF.py and worked well), but somehow the result contained no information, just like this:
#CHROM POS A1 A2 A3 chr1 1 N N N
And the result was same to the outcome of parseVCFs.py(I tried to produce a geno file including all samples but failed).
Here is the code I used:python mergeGeno.py -i A_pop.geno.gz -i B_pop.geno.gz -f my_reference.fai --method all | bgzip > out.geno
So I wonder if there are anything wrong with my code or samples?
Thanks in advanced!!!

popgenWindows.py popFreq MemoryError

Thank you for providing these tools - they are incredibly useful. I am running into a problem running popFreq analysis. The geno file I have created that works fine for popDist and popPairDist but hangs when running popFreq.

I am attempting to run using the following command:

$ python $SCRIPTS/popgenWindows.py -g $GENOS/Mayaguez.geno.gz -w 50000 -s 10000 -m 100 -o $GENOS/Mayaguez.freq_output.csv -f phased -T 1 --windType coordinate --analysis "popFreq" -p urban KMW_344,KMW_558,KMW_563,KMW_575,KMW_586,KMW_589,KMW_605,KMW_616,KMW_623,KMW_626,KMW_632,KMW_633,KMW_641,KMW_647,KMW_659,KMW_661 -p forest KMW_380,KMW_386,KMW_396,KMW_408,KMW_414,KMW_421,KMW_425,KMW_426,KMW_431,KMW_461,KMW_478,KMW_480,KMW_492,KMW_535,KMW_539,KMW_553

I get the following error:

Starting 1 worker threads

211 windows queued, 210 results received, 119 results written.
/home/geneva/apps/genomics_general/genomics.py:625: RuntimeWarning: invalid value encountered in double_scalars
  D = d / np.sqrt(e1*S + e2*S*(S-1))
Process Process-1:
Traceback (most recent call last):
  File "/home/geneva/anaconda3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/geneva/anaconda3/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/geneva/apps/genomics_general/popgenWindows.py", line 49, in stats_wrapper
    statsDict.update(Aln.groupFreqStats())
  File "/home/geneva/apps/genomics_general/genomics.py", line 993, in groupFreqStats
    siteBaseCounts = np.apply_along_axis(binBaseFreqs,0,self.numArray[np.ix_(seqIdx,siteIdx)],asCounts=True)
  File "<__array_function__ internals>", line 5, in apply_along_axis
  File "/home/geneva/anaconda3/lib/python3.8/site-packages/numpy/lib/shape_base.py", line 402, in apply_along_axis
    buff[ind] = asanyarray(func1d(inarr_view[ind], *args, **kwargs))
  File "/home/geneva/apps/genomics_general/genomics.py", line 592, in binBaseFreqs
    if asCounts: return np.bincount(numArr, minlength=4)
  File "<__array_function__ internals>", line 5, in bincount
MemoryError: Unable to allocate 1020. TiB for an array with shape (140143801524529,) and data type int64

It then repeats this line every few minutes, seeming indefinitely:
211 windows queued, 210 results received, 119 results written.

There doesn't seem to be anything special about window 120 and if I change the window size the analysis hangs at a totally different coordinate. The same error also occurs with geno files built using other VCFs from other populations. I do not have 1020 TiB of RAM :) but I don't believe that should actually be needed for this analysis.

Many thanks for any ideas you have!

Update - The above occurred using python 3.8.3. A very similar error occurs when using 2.7 with slightly different error syntax:

Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/home/geneva/apps/genomics_general/popgenWindows.py", line 49, in stats_wrapper
    statsDict.update(Aln.groupFreqStats())
  File "/home/geneva/apps/genomics_general/genomics.py", line 993, in groupFreqStats
    siteBaseCounts = np.apply_along_axis(binBaseFreqs,0,self.numArray[np.ix_(seqIdx,siteIdx)],asCounts=True)
  File "/home/geneva/.local/lib/python2.7/site-packages/numpy/lib/shape_base.py", line 403, in apply_along_axis
    buff[ind] = asanyarray(func1d(inarr_view[ind], *args, **kwargs))
  File "/home/geneva/apps/genomics_general/genomics.py", line 592, in binBaseFreqs
    if asCounts: return np.bincount(numArr, minlength=4)
ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.

genomic statistics in sliding windows computation

Hi Simon,

I tried to calculate Fst and Dxy for one chromosome, but the output file did not give me the values requested.
Is there something i am not considering?

Here is the code I used:

python popgenWindows.py -w 50000 -m 5000 -g CM020245.1.geno.gz -o CM020245.1.csv.gz -f phased -T 5 --popsFile pops.txt

The populations as listed as:

GRFL00 A1
GRFL01 A1
AMFL02 B1
AMFL04 B2

Thanks in advance for your advice!

only getting pi_all in popgenWindows.py output

Hi Simon,
I am a new user implementing your popgenWindows.py to obtain Fst and Dxy calculations across 6 populations from RAD-Seq data, but my output only contains pi_all calculation. I have tried various -w and -m values, but that still does not give me fst or dxy columns.

Here's an example of what I've run:
python popgenWindows.py -w 20000 -g vcf.geno.gz -o csv.gz -f phased -T 10 --writeFailedWindow -s 2500 -m 1 --popsFile geopopname.txt

Here is an example output:
scaffold,start,end,mid,sites,pi_all
scf7180000845757,1,20000,8258,1,0.106
scf7180000787816,1,20000,7592,1,0.1035
scf7180000975282,1,20000,9347,1,0.2884
scf7180001008466,1,20000,987,3,0.152
scf7180000807401,1,20000,7048,6,0.1228
scf7180000777439,1,20000,10585,1,0.1018

Any idea why I might not be getting Fst and Dxy in the output?
I'm also wondering why I am only getting pi_all rather than pi for each of the six populations.

Thanks,
Adriana

Suggestion for behavior of -m flag in popgenWindows.py

Hi Martin,

While running popgenwindows on an alignment with several species (I have a single individual representing each species) to calculate Dxy between species pairs, I noticed that if I have several species in the geno file, setting -m will exclude windows with less then "m" sites for the global alignment with all species, but not for the specific comparison of two species. For example, a given window might have information for 2000 sites if I take all individuals into account, but when comparing individuals A and B, in fact there might only be 1000 sites. If -m is set to 2000, this window of 1000 sites will not be excluded for the specific comparison of A and B. Also, "sites" in the output will be 2000, reflecting the information for the global alignment, so I can't filter out these windows a posteriori in the specific pairwise comparisons.

For testing purposes, I've been going around this by creating an intermediate geno file just with the two individuals I wish to compare and with no missing data, but the process of creating this intermediate file for each comparison and then running popgenwindows is quite time consuming and takes up a lot of space. Using the same example above, I've confirmed that indeed the Dxy calculations are taking into account the available sites between the comparison of A and B (i.e., using the example above, the script is dividing differences in the window by the 1000 informative sites between A and B, and not by 2000 sites that exist in the entire alignment). So, would it be possible to add "nan" for this particular window and pairwise comparison, for example? Or is there a way to output the number of sites per window for a specific species pair, so it would be possible to filter out specific windows a posteriori?

Thank you,
Mafalda

strange pi and dxy

I'm calculating windowed pi and dxy within or between close related populations using you scripts. It's strange that the estimated pi and dxy is quite large, such as mean value of 0.1 of pi and 0.2 of dxy. But i know the pi of my populations is about 0.001, and dxy between populations should below 0.01. So, do you have any idea of those strange results? I suspect it may related to the correction of window size, if so, how should i do to get the right value of pi and dxy?

possible issues with multiprocessing

Hi Simon,
I am running into some issues with the most recent version of popgenWindows.py. I had used it before (maybe a year ago) without any problems. I parsed and compressed my VCF input file without issue using parseVCF.py (see uncompressed format below). When I run popgenWindows.py I get a series of error messages related to the multiprocessing module, and I think there is an issue reading across the various threads. The program continues to run however, producing the progress report:

9 windows queued, 6 results received, 0 results written

numerous times in a never ending process. Here is the error printed out:

1381 scaffolds will be excluded.started worker 0started worker 1Process Process-2:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "popgenWindows.py", line 44, in stats_wrapper
    statsDict.update(Aln.groupFreqStats())
  File "/Users/rfitak/Downloads/CAMEL/genomics_general/genomics.py", line 775, in groupFreqStats
    siteBaseCounts = np.apply_along_axis(binBaseFreqs,0,self.numArray[np.ix_(seqIdx,siteIdx)],asCounts=True)
TypeError: apply_along_axis() got an unexpected keyword argument 'asCounts'
Process Process-1:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "popgenWindows.py", line 44, in stats_wrapper
    statsDict.update(Aln.groupFreqStats())
  File "/Users/rfitak/Downloads/CAMEL/genomics_general/genomics.py", line 775, in groupFreqStats
    siteBaseCounts = np.apply_along_axis(binBaseFreqs,0,self.numArray[np.ix_(seqIdx,siteIdx)],asCounts=True)
TypeError: apply_along_axis() got an unexpected keyword argument 'asCounts'

My input file looks like this:

#CHROM	POS	DC158_53	DC269	DC399	DC400	DC402	DC408	DC423	DROM802	Drom439	Drom795	Drom796	Drom797	Drom800_55	Drom806	Drom816	Drom820	WC214	WC216	WC218	WC219	WC220	WC247	WC303_108	WC304	WC305
NW_006210177.1	864	G/G	G/G	G/G	G/G	G/G	G/G	G/G	G/A	G/G	A/A	G/A	G/A	G/G	G/G	A/A	A/A	G/G	G/G	G/G	G/G	G/G	G/G	G/G	G/G	G/G
NW_006210177.1	995	T/T	C/C	T/T	T/T	T/T	C/T	C/C	T/T	T/T	T/T	T/T	T/T	T/T	T/T	T/T	T/T	C/T	C/C	C/C	C/T	T/T	C/T	T/T	T/T	T/T
NW_006210177.1	1022	G/G	G/G	G/G	G/G	G/G	G/G	G/G	G/A	A/A	G/G	G/A	G/A	A/A	A/A	G/G	G/G	G/G	G/G	G/G	G/G	G/G	G/G	G/G	G/G	G/G
NW_006210177.1	1072	G/T	G/G	G/T	T/T	T/T	G/T	G/G	G/G	G/G	G/G	G/G	G/G	G/G	G/G	G/G	G/G	G/T	G/G	G/G	G/T	T/T	G/T	T/T	T/T	T/T
NW_006210177.1	1141	C/C	C/C	C/C	C/C	C/C	C/C	C/C	C/T	T/T	C/C	C/T	C/T	T/T	T/T	C/C	C/C	C/C	C/C	C/C	C/C	C/C	C/C	C/C	C/C	C/C
NW_006210177.1	1155	G/G	G/G	G/G	G/G	G/G	G/G	G/G	G/A	G/G	A/A	G/A	G/A	G/G	G/G	A/A	G/A	G/G	G/G	G/G	G/G	G/G	G/G	G/G	G/G	G/G
NW_006210177.1	1173	T/T	T/T	T/T	T/T	T/T	T/T	T/T	T/G	G/G	T/T	T/G	T/G	G/G	G/G	T/T	T/T	T/T	T/T	T/T	T/T	T/T	T/T	T/T	T/T	T/T
NW_006210177.1	1255	T/T	T/T	T/T	T/T	T/T	T/T	T/T	T/C	C/C	T/T	T/C	T/C	C/C	C/C	T/T	T/T	T/T	T/T	T/T	T/T	T/T	T/T	T/T	T/T	T/T
NW_006210177.1	1316	T/T	T/T	T/T	T/T	T/T	T/T	T/T	T/A	T/T	A/A	T/A	T/A	T/T	T/T	A/A	T/A	T/T	T/T	T/T	T/T	T/T	T/T	T/T	T/T	T/T

Furthermore, when I use individual based statistics, i.e. indHet or indPairDist the program runs without issue. I am using:
python 2.7.6
numpy==1.15.4

AssertionError: Sample ploidy (2) doesn't match number of sequences (3)

Hi Simon,
I used popgenWindows.py to analysize the pi and fst of two populations with the commnad：
python2 /public/home/kefushi/software/genomics_general/popgenWindows.py -w 100000 -s 5000 -m 100 -g /public/home/kefushi/data/dbm/bi_o_gene20k_095_005.recode.geno.gz -o /public/home/kefushi/data/dbm/NA_AF.csv -f pairs -T 1 -p NA NA_M_11,NA_M_12,NA_M_5,NA_M_6,NA_M_8 -p AF POP10_AF_3,POP10_AF_4,POP10_AF_5,POP10_AF_7,POP10_AF_8

but it came up with error:
Traceback (most recent call last):
File "/public/home/kefushi/software/miniconda3/envs/python2/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
self.run()
File "/public/home/kefushi/software/miniconda3/envs/python2/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/public/home/kefushi/software/genomics_general/popgenWindows.py", line 39, in stats_wrapper
Aln = genomics.genoToAlignment(window.seqDict(), sampleData, genoFormat = genoFormat)
File "/public/home/kefushi/software/genomics_general/genomics.py", line 1019, in genoToAlignment
if ploidy is not None: assert len(seqList) == ploidy, "Sample ploidy (" + str(ploidy) + ") doesn't match number of sequences (" + str(len(seqList)) + ")"
AssertionError: Sample ploidy (2) doesn't match number of sequences (3)

1519 windows queued, 1514 results received, 0 results written.

=========================================================

But if I change ploidy from 2 (which is the right number) into 3, it works.

Any suggerstions here?

Cheers,

Fushi

Testing introgression of individuals

Hi there,
Have been really enjoying your code and have been reading your papers. I got a quick questions here about using ABBABABAwindows.py in genomics_general to test whether there are certain windows exhibit evidence of introgression at the individual level.
For ABBABABAwindows.py (in the example of the manual page):

python ABBABABAwindows.py -g /zoo/disk1/shm45/vcf/set62/set62.chr21.DP5GQ30.AN100MAC1.diplo.gz -f phased -o output.csv -w 100000 -m 100 -s 100000 -P1 A -P2 B -P3 C -O D -T 10 --minData 0.5 --popsFile pops.txt --writeFailedWindows --polarize &

Could A, B, C, and D be an individual instead of a population? Thank you!

NA output of raxml_sliding_windows.py

Hi Simon,
Recently I was following your instruction in general genomics pakage and running the tree for sliding wimdow with raxml. However my output file, .tree.gz file only showed 'NA' for each lines. I was wondering what may result to such a file.
Following is my command line:
python2.7 /home/caiyc/software/genomics_general/phylo/raxml_sliding_windows.py -g Bi_SNP_9_pop_DF_miss_polar_paralog_Density.geno -f phased --windType coordinate -w 50000 -M 500 -S 50000 -p slidingTREE/phased --individuals FC-01,HC-02,HM-01,HN-02,PC-01,PN01,ZG_M1 --outgroup M_kudoi --raxml /home/caiyc/software/RAxML_8.2.8/raxmlHPC --model GTR -T 4

Thanks!
caiyc

"ValueError: zero length field name in format" in parseVCF

Hi again Simon,

Sorry about so many questions, but what could be causing the error below in parseVCF?


[rmarco3@mike1 29SEP2020]$ python /home/rmarco3/bin/simon-martin-scripts/VCF_processing/parseVCF.py -i REMERGED_post_filter_MASKED.recode.vcf > test-keep-indels.geno
Traceback (most recent call last):
  File "/home/rmarco3/bin/simon-martin-scripts/VCF_processing/parseVCF.py", line 369, in <module>
    ploidyMismatchToMissing=args.ploidyMismatchToMissing,expandMulti=args.expandMulti)
  File "/home/rmarco3/bin/simon-martin-scripts/VCF_processing/parseVCF.py", line 173, in getGenotypes
    expandMulti=expandMulti)
  File "/home/rmarco3/bin/simon-martin-scripts/VCF_processing/parseVCF.py", line 135, in getGenotype
    self.genoData[sample]["GT"], ploidy)) 
ValueError: zero length field name in format

Any suggestion on how to analyze synonymous and nonsynonymous site pi, Fst, Dxy

Hi. This is a great tool I've been using for a manuscript I'm working on so thanks!
I had one comment from a reviewer asking to do the Dxy analysis by synonymous or nonsynonymous variations. Do you have any suggestions what the best approach would be if I wanted to do this?

Problem with ABBABABAwindows.py

Hi,
Thank you very much for your scripts.
I have been struggling a lot with ABBABABAwindows.py
I am using python 3

When I run this command
python3 ABBABABAwindows.py -g $FILE.geno.gz -o $FILE.fd2.csv -w 20 -m 10 -s 20 -f phased --minData 0.5 -P1 P1 -P2 P2-P3 P3 -O P4--popsFile pop_file.txt

I get

Traceback (most recent call last):
File "ABBABABAwindows.py", line 256, in
windowQueue = SimpleQueue()
TypeError: init() missing 1 required keyword-only argument: 'ctx'

and when I run
python3 ABBABABAwindows.py
I get

usage: ABBABABAwindows.py [-h] [--windType {sites,coordinate,predefined}] [-w sites] [-s sites] [-m sites]
[--overlap sites] [-D MAXDIST] [--windCoords WINDCOORDS] [--minData proportion] -P1
popName [[samples] ...] -P2 popName [[samples] ...] -P3 popName [[samples] ...] -O popName
[[samples] ...] [--popsFile POPSFILE] [--haploid sample names] [-g GENOFILE] [-o OUTFILE]
[--exclude EXCLUDE] [--include INCLUDE] -f {phased,pairs,haplo,diplo} [-T threads]
[--verbose] [--addWindowID] [--writeFailedWindows]

Can you please assist.
Thank you very much,

Error with distMat.py

Hi Simon,

Thanks for making these scripts public! I've found them extremely helpful. I'm trying to implement your script for pairwise distance matricies (--window cat) style but I either get the program finishing with no output, or this error attached below.

Here is the command I've been using: python3 distMat.py -g ../herb_finalfilt.geno -f phased --windType cat -o herb_finalfilt_raw.dist -w 100000 -s 100000 -m 250 --outFormat raw

You'll notice I made the min number of SNPs rather small as I was worried too many windows were being thrown out due to the output msgs "0 windows queued | 0 results received | 0 results written.

Traceback (most recent call last):
  File "distMat.py", line 314, in <module>
    windowQueue.put((windowsQueued,window))
  File "/usr/lib64/python3.6/multiprocessing/queues.py", line 347, in put
    self._writer.send_bytes(obj)
  File "/usr/lib64/python3.6/multiprocessing/connection.py", line 204, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib64/python3.6/multiprocessing/connection.py", line 397, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

Thanks!!!

Why the ABBA, BABA, and D formula are different from MBE 2011 paper

Hi Simon,
I have been using your genomics library to calculate ABBA, BABA and D statistics from four populations. It runs well but I have two issues with the latest script.
1, I notices in genomics.py. The formulas of ABBA and BABA are different from MBE 2011 publication. It's
(1-p1)p2p3*(1-p4) + p1*(1-p2)*(1-p3)p4
in the script, but
(1-p1)p2p3(1-p4) in the paper. Why changed it and what's the difference?
2, I did a few SNPs manually. The D's calculated by ABBABABAwindows.py were equal to the output from D_new function, while in the ABBABABA function it was calculated by the old D function. I couldn't understand how it happened. The code chunk is like,
D = D(p1,p2,p3,p4) # I think this is the old D function
fd = fd(p1,p2,p3,p4)
fdm = fdm(p1,p2,p3,p4)
ABBA = ABBA(p1,p2,p3,p4) # and this is the new ABBA formula
BABA = BABA(p1,p2,p3,p4)

Output sites used

Hi,
Thank you again for the great script. It made my life much easier!
I am wondering if there is a way to output the sites used, i.e. the snps? This would help the downstream analyses of identifying alleles associated a certain trait.

Thank you very much

Extremely high fd output with ABBA-BABA script

Hi,
When using ABBABABAwindows.py to calculate introgression between groups, I come across such a strange value as the figure shows.

I know little about the ABBA-BABA statistics. But I've read the cited reference and learned about that the fd statitic will be underestimated compared to the D statistic. I wonder how did this high value happen? Is the value right?

The command I used is shown as below.
python2.7 ABBABABAwindows.py -g Selected.gono -f haplo -o ABBABABAoutput.csv -w 50000 -P1 Pop1 -P2 Pop2 -P3 Pop3 -O OG -T 10 --minData0.5 --popsFile Poplist --writeFailedWindows --haploid Q,W,E,R,T,Y,U,I,O,P,A,S,F,G,J,H

Thanks.

raxml_sliding_windows.py "--silent" option causing problems

I was just using your nice raxml_sliding_windows.py script. I ran into a bit of trouble with it. It turns out the "-- silent" option given to my raxml (RAxML version 8.0.26) is causing it to choke - seems it doesnt know this flag "invalid argument -- -" (or something similar). I fixed it by deleting "--silent" in the python script here:
raxCommand = raxml + " -s temp." + uniqueTag + ".phy -n " + uniqueTag + " -m " + model + og + " -V -f d -p 12345 --silent >>

wrong processing of agp "strand" in vcfChromTransfer.py

in agp file, the strand "-" represents "reverse complement", but in your coding, you just reverse the sequences.

    -snip-
    # line 128 - 133
    for vcfLine in vcfLines:
        fields = vcfLine.split("\t")
        assert fields[0] == chrom, "Something went wrong: Found chrom {} but expected chrom {}.".format(fields[0],chrom)
        fields[0] = newChrom
        fields[1] = str(newPos(int(fields[1]), start=int(start), newStart=int(newStart), newEnd=int(newEnd), reverse=reverse))
        outStream.write("\t".join(fields)+"\n")

you should also "complement" (A:T, T:A, G:C, C:G, ACG:CGT) the bases at the same time.

the error about popgenWindows.py

hi @simonhmartin
I have used the script of parseVCFs to the geno.gz file. it worked well and this was the results
.
While i run popgenWindows.py , it got errors ,
Traceback (most recent call last):
File "/vol3/agis/likui_group/yinhongwei/software/genomics_general/popgenWindows.py", line 381, in
for window in windowGenerator:
File "/vol3/agis/likui_group/yinhongwei/software/genomics_general/genomics.py", line 1894, in slidingCoordWindows
site = reader.nextSite(asDict = extractSpecificGTs)
File "/vol3/agis/likui_group/yinhongwei/software/genomics_general/genomics.py", line 1852, in nextSite
self.precompDict, addToPrecomp=self.precompDict["counter"]<self.precompDict["maxSize"])
File "/vol3/agis/likui_group/yinhongwei/software/genomics_general/genomics.py", line 1810, in parseGenoLine
"position": int(lineData[posCol]) if posCol >= 0 else None,
ValueError: invalid literal for int() with base 10: 'POS'
The script code is "${popgenWindows} -w 50000 -m 25000 -g sweep.mac1.chr_all.SNP.geno.gz -o sweep.mac1.csv.txt -f phased -T 4 -p popA -p popB --popsFile pops.txt"
Could you give me some advice?
your sincerely
Hongwei Yin

dictionary update sequence element #85 has length 0; 2 is required

Hi Simon,

What might be causing the error below?

Thanks!

[rmarco3@mike2 29SEP2020]$ python /home/rmarco3/bin/simon-martin-scripts/popgenWindows.py --ploidy 2 -T 8 -g REMERGED_post_filter_MASKED.recode.geno --windType coordinate -w 50000 -s 25000 --popsFile pops_file_for_martin -p asp -p din -f phased --verbose -o simon_stats_aspVdin
Traceback (most recent call last):
  File "/home/rmarco3/bin/simon-martin-scripts/popgenWindows.py", line 217, in <module>
    with open(args.popsFile, "rt") as pf: popDict = dict([ln.split() for ln in pf])
ValueError: dictionary update sequence element #85 has length 0; 2 is required

Here's the head of my geno file and my population file:


#CHROM	POS	UWBM_RTB349	LSU_B94570	UWBM_RTB587	LSU_B94713	UWBM_RTB589	LSU_B94661	LSU_B38330	UWBM_RTB448	UWBM_RTB320	LSU_B94519	UWBM_RTB447	UWBM_RTB446	UWBM_RTB579	LSU_B94788	UWBM_RTB586	UWBM_RTB329	UWBM_RTB513	LSU_B94781	UWBM_RTB341	LSU_B87974	UWBM_RTB314	UWBM_RTB590	UWBM_RTB324	UWBM_DCS5320	LSU_B94627	UWBM_RTB438	UWBM_RTB511	UWBM_RCF2148	UWBM_RTB431	LSU_B94463	LSU_B87007	LSU_B38326	UWBM_RTB578	UWBM_RTB432	LSU_B94622	UWBM_RTB437	UWBM_RTB435	LSU_B94528	LSU_B94714	UWBM_RTB445	UWBM_RTB365	UWBM_RTB384	UWBM_RTB376	UWBM_DCS5409	UWBM_RTB381	LSU_B34686	UWBM_RTB374	UWBM_RTB368	UWBM_RTB370	UWBM_RTB386	LSU_B18624	UWBM_RTB378	UWBM_RTB367	UWBM_RTB380	UWBM_RTB551	UWBM_RCF2167	UWBM_RTB540	UWBM_RTB524	UWBM_RTB573	UWBM_RTB541	UWBM_RCF2166	UWBM_RTB526	UWBM_RCF2164	UWBM_VGR296	CUMV_52575	UWBM_RTB561	UWBM_RTB537	UWBM_RTB545	UWBM_RTB406	UWBM_RTB413	UWBM_RTB412	UWBM_RTB410	LSU_B58322	UWBM_RTB398	LSU_B590	UWBM_RTB397	UWBM_RTB401	UWBM_RTB400	UWBM_RTB402	UWBM_RTB415	UWBM_RTB414	UWBM_RTB411	UWBM_RTB396
19	1013	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	T/T	T/C	T/T	T/T	T/T	T/T	T/T	T/T	T/T	C/C	T/T	T/T	T/T	T/T	T/T
19	1624	G/G	G/A	G/G	G/G	G/A	A/A	G/G	G/G	G/A	G/G	G/G	G/G	G/A	G/G	G|A	G/A	G/G	G/G	G/A	G/A	G/A	G/A	G/G	G/G	G/G	G/G	G/G	G/G	G/G	N/N	G/G	N/N	G/A	G/G	G/G	G/A	G/G	G/G	G/A	G/G	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N	N/N


LSU_B38326	con
LSU_B38330	con
LSU_B87007	con
LSU_B87040	con
LSU_B87974	con
LSU_B94463	con
LSU_B94519	con
LSU_B94528	con
LSU_B94570	con
LSU_B94622	con

popgenWindows.py error

Hi Simon,
used popgenWindows.py from your general genomics pakage to estimate diversity and divergence.
I was running the following command:
python popgenWindows.py -w 50000 -m 5000 -s 50000 --windType coordinate -g all_rename_filter2_geno0_phased_renammed_chr.geno.gz -o all.Fst.Dxy.pi.csv.gz -f phased -T 8 -p ang ang -p bux bux -p der der -p eur eur -p imb imb -p ory ory -p scr scr -p spe spe -p str str -p syl syl

and got the following message:
Starting 8 worker threads
Traceback (most recent call last):
File "popgenWindows.py", line 420, in
exclude = scafsToExclude)
TypeError: slidingCoordWindows() got an unexpected keyword argument 'headerLine'

I rechecked everithing but could not find out what I am doing wrog.
Your advice would be much appreciated
Joro

simonhmartin / genomics_general Goto Github PK

genomics_general's People

Contributors

Stargazers

Watchers

Forkers

genomics_general's Issues

Recommend Projects

Recommend Topics

Recommend Org