atkinson-lab / tractor Goto Github PK

View Code? Open in Web Editor NEW

44.0 44.0 5.0 3.8 MB

Scripts for implementing the Tractor pipeline

License: MIT License

Python 0.79% Jupyter Notebook 98.62% R 0.59%

gwas-model tractor tractor-pipeline

tractor's People

Contributors

Stargazers

Watchers

Forkers

rxseadew stoikumo lauralew11 medupalli kscott-1 nakanof

tractor's Issues

No gzip support in run_tractor.R

I have come across a minor issue where extract_tracts.py allows for compressing output, yet run_tractor.R cannot accept these .gz files.

PR to follow.

GWAS with genotype x phenotype interaction

Hello:
I was wondering whether or not it is possible to run a GWAS with genotype x phenotype interaction using TRACTOR and how would this be done.
Thank you for your help

Getting errors after running extract_tracts.py

Hi,

After running

$ python3 Tractor/scripts/extract_tracts.py --vcf subset1/query_file_phased.vcf --msp subset1/query_results.msp --output-dir output/ --num-ancs 8
INFO (__main__ 91): # VCF File                    : subset1/query_file_phased.vcf
INFO (__main__ 92): # Prefix of output file names : query_file_phased
INFO (__main__ 93): # VCF File is compressed?     : False
INFO (__main__ 94): # Number of Ancestries in VCF : 8
INFO (__main__ 95): # Output Directory            : output/
INFO (__main__ 101): Creating output files for 8 ancestries
INFO (__main__ 116): Iterating through VCF file
Traceback (most recent call last):
  File "/path/Tractor/scripts/extract_tracts.py", line 240, in <module>
    extract_tracts(**vars(args))
  File "/patj/Tractor/scripts/extract_tracts.py", line 170, in extract_tracts
    window = (ancs_entry[0], int(ancs_entry[1]), int(ancs_entry[2]))
ValueError: invalid literal for int() with base 10: '0.0'

Both vcf and msp are output files from the LAI tool G-Nomix, using the pre-trained model, using 8 ancestries. The vcf seems complete, so suspect the issue is regarding the msp, which has the next header:

#Subpopulation order/codes: EUR=0       EAS=1   NAT=2   AFR=3   SAS=4   AHG=5   OCE=6   WAS=7
#chm    spos    epos    sgpos   egpos   n snps  sample_1 sample_2 ... sample_n
        13273   779322  0.0     2.02544 696     5       5       5
...

Maybe the issue is with the EUR=0 tag, the void tag of the chromosome or the 0.0 of the centimorgan positions.

Any help will be appreciated.
Thank you.

IndexError

Hi,

I had this error while running the function of extracting the tracts, ExtractTracts.py. The input vcf file is not phased, though RFMix still gave reasonable results. So my question is that does Tractor here require phased vcf file to run the function? Thanks a lot for the help!

INFO (main 42): Creating output files for 2 ancestries
INFO (main 48): Opening input and output files for reading and writing
Traceback (most recent call last):
File "/Tractor/ExtractTracts.py", line 184, in
extract_tracts(**vars(args))
File "/Tractor/ExtractTracts.py", line 126, in extract_tracts
geno_b = str(geno[1])
~~~~^^^
IndexError: list index out of range

Possibility of outputting SNP effect variance in Tractor output

Hi,

Based on what I can see from the tractor output, there is no error term output for the SNP effect. I was wondering if there is some option I can set in the program so that the error term for each ancestral effect estimate is outputted?

Thanks for any help you can offer.

`extract_tracks.py`: NameError: name 'output_files' is not defined

Hi,

I was trying to run the extract_tracks.py script, but it threw an error

$ python3 /u/home/b/biona001/Tractor/scripts/extract_tracts.py \
    --vcf /u/home/b/biona001/project-loes/ForBen_genotypes_subset/LAI/vcf_phased/chr22.vcf.gz \
    --msp /u/home/b/biona001/project-loes/ForBen_genotypes_subset/LAI/output/chr22.msp.tsv \
    --num-ancs 3 \
    --output-dir /u/home/b/biona001/project-loes/ForBen_genotypes_subset/LAI/tracks

INFO (__main__ 90): # VCF File                    : /u/home/b/biona001/project-loes/ForBen_genotypes_subset/LAI/vcf_phased/chr22.vcf.gz
INFO (__main__ 91): # Prefix of output file names : chr22
INFO (__main__ 92): # VCF File is compressed?     : True
INFO (__main__ 93): # Number of Ancestries in VCF : 3
INFO (__main__ 94): # Output Directory            : /u/home/b/biona001/project-loes/ForBen_genotypes_subset/LAI/tracks
INFO (__main__ 100): Creating output files for 3 ancestries
Traceback (most recent call last):
  File "/u/home/b/biona001/Tractor/scripts/extract_tracts.py", line 239, in <module>
    extract_tracts(**vars(args))
  File "/u/home/b/biona001/Tractor/scripts/extract_tracts.py", line 102, in extract_tracts
    output_files[f"dos{i}"] = f"{output_path}anc{i}.dosage.txt{file_extension}"
NameError: name 'output_files' is not defined

Any tips/suggestions would be highly appreciated.

Cannot find example of phenotype file

Hello,
I am unable to find the exact format of the Phe.txt file described here:

https://github.com/Atkinson-Lab/Tractor-tutorial/blob/main/Local.md

python RunTractor.py --hapdose ADMIX_COHORT/ASW.phased --phe PHENO/Phe.txt --method linear --out SumStats.tsv

Please help
Thank you

Query cohort imputed with TOPMed reference panel (Michigan server hg38 build)

Hi, We have a genotyping cohort with samples (N>5000) of multiple races, and they have been imputed using TOPMed Imputation server (https://topmedimpute.readthedocs.io/en/latest/getting-started/), because this is the largest multi-ethnic reference panel till date. The output imputed data from the server is not phased, and it is in hg38 build.

Can you briefly describe how I should proceed, if my intent is to run tractorGWAS, with all 5 major ancestries using the logistic model.

Do I need to leftover the imputed VCF files from hg38 to hg19, so that I can use 1000G_Phase3 reference panel (which is hg19) for the Phasing (shapeIT), LAI (rfmix) and so on.. ?
Or
Should I use the 1000G_Phase3 (hg38) phased vcd files, and convert other supporting files like genetic map to hg38 build.
Have you tested extract_tracts.py --num-ancs 5 ? If so, do you foresee any problems with the results I may see.
I'm assuming we do not need to provide any population based covariates (example, PCs derived from eigenstrat) to tractor. It wasn't mentioned anywhere in the tutorial, but I'm assuming it to be so, because this is local ancestry aware GWAS.

I'm not sure, if I'm asking all the right questions to plan out my work.
Thank you for your time.

Related samples

Hi,

I would like to apply Tractor to a dataset with high relatedness, but it looks like Tractor does not take in GRM.
Besides using an unrelated subset, do you have other suggestions?

Thanks,
Wanying

How were the painted karyograms made?

Hail does only linear and not logistic regression?

Hi,

I've seen that the Hail version of Tractor pipeline only does linear regression (in the example).
What about logistic regression? Does Hail performs Logistic regression too?

Thank you

Error while running run_tractor.R

After running the script I got the next

Error in rep(NA, ncol(mat)) : invalid 'times' argument
Calls: RunTractor -> subset_mat_NA -> t -> sapply -> lapply -> FUN
Execution halted

The output was the next one:

CHR	POS	ID	REF	ALT	AF_anc0	AF_anc1	AF_anc2	AF_anc3	AF_anc4	AF_anc5	AF_anc6	AF_anc7	LAprop_anc0	LAprop_anc1	LAprop_anc2	LAprop_anc3	LAprop_anc4	LAprop_anc5	LAprop_anc6	LAprop_anc7	LAeff_anc0	LAeff_anc1	LAeff_anc2	LAeff_anc3	LAeff_anc4	LAeff_anc5	LAeff_anc6	LApval_anc0	LApval_anc1	LApval_anc2	LApval_anc3	LApval_anc4	LApval_anc5	LApval_anc6	Geff_anc0	Geff_anc1	Geff_anc2	Geff_ancGeff_anc4	Geff_anc5	Geff_anc6	Geff_anc7	Gpval_anc0	Gpval_anc1	Gpval_anc2	Gpval_anc3	Gpval_anc4	Gpval_anc5	Gpval_anc6	Gpval_anc7
chr1	662622	chr1:727242:G:A	G	A	NA	NA	NA	NA	NA	0.025	NA	NA	0	0	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	0.726245	NA	NA	NA	NA	NA	NA	NA	0.0792779608694168	NA	NA

I expect that R is not handling in a good way the NAs?

IndexError: list index out of range

Hello,

I am having an issue with the ExtractTracts.py portion of Tractor. I have a fairly large primate dataset of 887 individuals, 9 of which are reference animals (equally split between two closely related species). I have used Tractor on this same dataset about 4 or 5 times now without issue. I have been trying different combinations of reference panels following the protocol of RFMix and then Tractor and it has worked up until now. Using samples extracted from the same master VCF as before, run through RFMix using the same code, I am now getting this error using ExtractTracts.py and am not sure what to do.

The following is the code I used to generate the error:

module load python

for i in {1..20}; do
python ExtractTracts.py \
--msp $SCRATCH/rfmix/Apr2_2023_UnrelatedandFounders.QueryPanel.Chr${i} \
--vcf-prefix $SCRATCH/Beagle_Software/Beagle5.4/Apr2_2023_UnrelatedandFounders.QueryPanel.Chr${i} \
--zipped \
--output-path Apr2_2023_UnrelatedandFounders.QueryPanel.Chr${i} \
--num-ancs 2; done 2>Tractor_error5.log

The error being thrown is below, and happens as well if I try and run a chromosome independently as well:

INFO (main 42): Creating output files for 2 ancestries
INFO (main 48): Opening input and output files for reading and writing
Traceback (most recent call last):
File "ExtractTracts.py", line 184, in
extract_tracts(**vars(args))
File "ExtractTracts.py", line 126, in extract_tracts
geno_b = str(geno[1])
IndexError: list index out of range

Thank you for your help!

Example Hail code for binary traits

Hi,
Thanks for creating a very useful method. I was wondering if you have example Hail code (similar to Tractor-Example-GWAS.py / Tractor-Example-GWAS.ipynb) for running Tractor on binary traits? The issue is that the hl.agg.linreg() Hail function used in the example code doesn't have an equivalent function for logistic regression. There is the hl.logistic_regression_rows() function, but it only allows a single predictor (x) to be used, thus it's not possible to also include the haplotype counts or the non-index allele dosage. Of course one could implement this outside of Hail, but if you already have a solution it would be easier. Any insight would be very helpful.

Thanks,
Stephane

Index error using imputed VCF when extracting tracts

Hi!
I am encountering an error for which a few issues have already been raised, but I have been trying to troubleshoot it and still haven't worked it out. The thing is I am using imputed files (from TopMed), but they have been filtered (by MAF and INFO) using PLINK. RFMix handled these vcf without problems, but when running the ExtractTracts.py, I get this message:

File "/mnt/lustre/scratch/nlsas/home/usc/gb/sdd/lat23/TRACTOR/Tractor/scripts/ExtractTracts.py", line 126, in extract_tracts
geno_b = str(geno[1])

This is the VCF header:

##fileformat=VCFv4.3
##fileDate=20231123
##source=PLINKv2.00
##filedate=2023.3.13
##INFO=<ID=AF,Number=1,Type=Float,Description="Estimated Alternate Allele Frequency">
##INFO=<ID=MAF,Number=1,Type=Float,Description="Estimated Minor Allele Frequency">
##INFO=<ID=R2,Number=1,Type=Float,Description="Estimated Imputation Accuracy (R-square)">
##INFO=<ID=ER2,Number=1,Type=Float,Description="Empirical (Leave-One-Out) R-square (available only for genotyped variants)">
##INFO=<ID=IMPUTED,Number=0,Type=Flag,Description="Marker was imputed but NOT genotyped">
##INFO=<ID=TYPED,Number=0,Type=Flag,Description="Marker was genotyped AND imputed">
##INFO=<ID=TYPED_ONLY,Number=0,Type=Flag,Description="Marker was genotyped but NOT imputed">
##pipeline=michigan-imputationserver-1.7.1
##imputation=minimac4-1.0.2
##phasing=eagle-2.4
##panel=apps@[email protected]
##r2Filter=0.3
##contig=<ID=5>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

Sample genotypes are are split in columns by "\t", and genotype calls are separated by "|". It works fine when using the raw files from imputation instead (without filtering), but it is taking a lot of time just to run chr22 (and the output files are also very heavy). I have tried to modify the script in line 87 in case the problem was the "\t" separator between samples, but it does still throw the error. I would much appreciate your help here!

Thank you! :)

NaNs for some variants in output

I'm running a 3-ancestry tractor and I get NaN for some variants, and I've noticed its only for those variants, which have 1 or more populations with AF_ancX = 0

Can you explain if tractor is actually choosing not to calculate P-values for such positions, and why.
Can tractor not treat such positions as a regular globalized GWAS association?

I verified that when I run tractor on the same input, but only give 1 population hap and dosage files as input, I get Pvalue results for all those positions. This would mean that tractor is not doing a local ancestry aware GWAS when only 1 population dosages are provided, is it?

Example output:
3- populations based tractor output. anc0 = AFR, anc1 = EUR and anc2= AMR (My cohort is dominant in Afr and Eur)
CHROM POS ID REF ALT AF_anc0 AF_anc1 AF_anc2 LAprop_anc0 LAprop_anc1 LAprop_anc2 LAeff_anc0 LAeff_anc1 LApval_anc0 LApval_anc1 Geff_anc0 Geff_anc1 Geff_anc2 Gpval_anc0 Gpval_anc1 Gpval_anc2 8 205821 . C T 0.03498 0.00047 0.00094 0.41455 0.53202 0.05342 0.5867933365769665 -0.275980166234514 0.0012209901508823107 0.15753876256506594 0.04648066093110761 -17.450806869947634 -18.175099461425173 0.863883930882097 0.9989229240723104 0.9995039199153428 8 206716 . G C 0.01773 0.00218 0.0066 0.41455 0.53202 0.05342 0.5843741523379564 -0.28597165650323303 0.001261827582320208 0.1432488601362374 0.055573211202005056 1.330571431049564 -17.54550391111194 0.8882255510926501 0.19749680296697525 0.9988180297161879 8 206747 . C G 0.01713 0.00218 0.0066 0.41455 0.53202 0.05342 0.5837735937670595 -0.2860281420231862 0.001276367249093116 0.14317199618539617 0.09385740867275061 1.3307863458279647 -17.54529555393314 0.8125228667403726 0.1974232992189573 0.9988180556108245 8 207242 . G A 0.04616 0.00019 0.0 0.41455 0.53202 0.05342 nan nan nan nan nan nan nan nan nan nan 8 208175 . T G 0.06486 9e-05 0.0 0.41455 0.53202 0.05342 nan nan nan nan nan nan nan nan nan nan 8 208227 . A C 0.03462 0.00047 0.00094 0.41455 0.53202 0.05342 0.58649948554488 -0.275949037607527 0.0012281548283880866 0.1575858179938755 0.05462753081609236 -17.450673605916016 -18.17479306605417 0.8410591302452655 0.9989229409247578 0.9995039282782405 8 209006 . G C 0.0611 0.00028 0.0 0.41455 0.53202 0.05342 nan nan nan nan nan nan nan nan nan nan 8 211365 . G A 0.0345 0.00047 0.00094 0.41455 0.53202 0.05342 0.5864994855448779 -0.27594903760753037 0.0012281548283881785 0.1575858179938706 0.05462753081608959 -17.450673605916016 -18.174793066054686 0.8410591302452743 0.9989229409247578 0.9995039282782406 8 211559 . A T 0.03547 0.00558 0.0 0.41455 0.53202 0.05342 nan nan nan nan nan nan nan nan nan nan 8 211635 . A C 0.00255 0.1308 0.03393 0.41455 0.53202 0.05342 0.5633672510332808 -0.27457186750119905 0.001879978203044346 0.16478825079831816 -21.427942353297524 -0.2360630636660421 -20.542693693652247 0.9994105894895378 0.4225715386127862 0.9992785348646407 8 211831 . A G 0.03437 0.00047 0.00094 0.41455 0.53202 0.05342 0.5863420974023827 -0.27594723308428054 0.0012318491446336143 0.1575886887838127 0.05907664030957955 -17.450584915698936 -18.174632779883186 0.8283067215504775 0.9989229507441875 0.9995039326531734 8 212134 . T C 0.06595 9e-05 0.0 0.41455 0.53202 0.05342 nan nan nan nan nan nan nan nan nan nan 8 212431 . G A 0.01628 0.0 0.0 0.41455 0.53202 0.05342 nan nan nan nan nan nan nan nan nan nan 8 212585 . C T 0.03243 0.00047 0.00094 0.41455 0.53202 0.05342 0.586727459572817 -0.27604712744082566 0.0012228544583496716 0.15743135595710156 0.04933957840521465 -17.450826454128542 -18.175186862413863 0.8606224880774684 0.9989229226291162 0.999503917529778 8 212596 . G A 0.03207 0.00047 0.00094 0.41455 0.53202 0.05342 0.5862340972984453 -0.27606973908617627 0.0012343370077641566 0.1573955498682199 0.0635453815950705 -17.450556446357133 -18.174733095827484 0.8210885360654389 0.9989229521486803 0.9995039299150978 8 212796 . G C 0.0351 0.00047 0.00094 0.41455 0.53202 0.05342 0.5872028540737119 -0.27596227263118217 0.0012117519959782528 0.15756547160577028 0.035074669225241245 -17.45105960063877 -18.175506282013092 0.8975416557249976 0.9989228975956651 0.999503908811391 8 213962 . T G 0.01385 0.00019 0.00094 0.41455 0.53202 0.05342 0.5818012611002261 -0.2746938967833825 0.0013349880275129622 0.1596021880992482 0.4904680546030868 -17.893114903301132 -18.167808800088824 0.18618597007533477 0.9993093215293589 0.9995041189102106 8 213988 . C G 0.02769 9e-05 0.0 0.41455 0.53202 0.05342 nan nan nan nan nan nan nan nan nan nan 8 214676 . C T 0.06413 9e-05 0.0 0.41455 0.53202 0.05342 nan nan nan nan nan nan nan nan nan nan 8 281474 . A G 0.0 0.00151 0.00094 0.41322 0.53319 0.05359 nan nan nan nan nan nan nan nan nan nan 8 294589 . A C 0.0 0.02304 0.00468 0.41298 0.53324 0.05378 nan nan nan nan nan nan nan nan nan nan

Running 1 population based tractor and showing same positions above. Anc0 = Afr
CHROM POS ID REF ALT AF_anc0 LAprop_anc0 Geff_anc0 Gpval_anc0 8 205821 . C T 0.03498 1.0 0.6412522731228197 0.016018128393202897 8 206716 . G C 0.01773 1.0 0.6131118431244662 0.11886720160894383 8 206747 . C G 0.01713 1.0 0.6453075687149562 0.10098343499619976 8 207242 . G A 0.04616 1.0 0.12872366437608104 0.6611541190663489 8 208175 . T G 0.06486 1.0 0.9339493605726387 4.185585197316541e-07 8 208227 . A C 0.03462 1.0 0.6622026139834044 0.013408363791798283 8 209006 . G C 0.0611 1.0 0.3743355007576348 0.10392807669612138 8 211365 . G A 0.0345 1.0 0.6622026139834056 0.013408363791798042 8 211559 . A T 0.03547 1.0 0.6668868413362581 0.00735160056366909 8 211635 . A C 0.00255 1.0 -20.060908661749846 0.9990991633464656 8 211831 . A G 0.03437 1.0 0.6659527732795265 0.012891117266950454 8 212134 . T C 0.06595 1.0 0.9099151786072148 7.449635967596704e-07 8 212431 . G A 0.01628 1.0 1.273847584139422 1.1416550191617484e-05 8 212585 . C T 0.03243 1.0 0.6537431021225084 0.018000536254355784 8 212596 . G A 0.03207 1.0 0.6657178634342557 0.015993569517582372 8 212796 . G C 0.0351 1.0 0.6436369321530563 0.01624838163305082 8 213962 . T G 0.01385 1.0 1.0937502687294598 0.0028722620116477713 8 213988 . C G 0.02769 1.0 0.1295030868752873 0.7353893286125512 8 214676 . C T 0.06413 1.0 0.9386673906669657 3.2801637359264637e-07 8 281474 . A G 0.0 1.0 nan nan 8 294589 . A C 0.0 1.0 nan nan
I'm also attaching these outputs as files here, since they may not render properly.
github_issue_output_1pop.txt
github_issue_output_3pops.txt

Can you help me understand, and guide how I could combine to get best of both results?

Thank You.

Error in ExtractTracks.py Script

I have encounted the following error a few times using the updated ExtractTracts script:
Traceback (most recent call last): File "/home/puckett3/software/Tractor/ExtractTracts.py", line 163, in <module> extract_tracts(**vars(args)) File "/home/puckett3/software/Tractor/ExtractTracts.py", line 106, in extract_tracts geno_b = str(geno[1]) IndexError: list index out of range

I have checked that the number of samples in the VCF is half the number in the MSP file.
I have also run the script changing the second header line in the MSP file from:
#chm spos epos sgpos egpos n snps
to
#chm spos epos sgpos egpos nsnps

Yet I get the same error.

Do you have any suggestions for troubleshooting?

With thanks,
Emily

Dependency on RFMIX Output Format

HI! I have been looking at Tractor for usage in a new project. The program for Local Ancestry that we are likely to use in this project is Flare. This program outputs predicted ancestry in VCF format with AN1 & AN2 as fields. The program predicts ancestry only for the variants at which the input file has GT data.

Given that Tractor seems to require input in RFMIX format, there are some concerns with attempting to use the program. Is there anything in development to support other LA programs such as Flare?

Any help would be appreciated, thanks!

Question regarding VCFs and issue with the R script

I'm trying to perform a 3-way model.

Does the R script is limited to a maximum number of samples/features? After running

Rscript path/Tractor/scripts/run_tractor.R  \
    --hapdose path/chr21/chr21.annotated.LiftOver.dose \
    --phe path/phe.txt 
    --method logistic --out sumstats.tsv

I get the next output:

Tractor Script Version: 1.1.0 
Loading required package: optparse
Running Tractor... 
Error in data[[1]] : subscript out of bounds
Calls: RunTractor
Execution halted

After a quick look up, it's clear that I'm having an issue with the hapcount/dosage files, but I'm unsure what's going on.

On the other hand, I would like to use PLINK in order to test alternative models using the VCFs that one gets after running extract_tracts.py. According to your published paper, I would need to run 3 different GWAS and then perform a meta-analysis, albeit I'm unsure of something.

As far as I understand, in this case the complete model would be

$$\text{logit} (p)= \beta_0 + \beta_{LA1}LA_1 +\beta_{LA2}LA_2 + \beta_{G0} G_0 + \beta_{G1}G_1 + \beta_{G2} G_2 $$

If, let's say, I would like to test the deconvolved VCF from ancestry 0, would you recommend to use the haplotype counts from the other two ancestries as covariates? Why or why not? In your wiki you are ignoring the counts. Why? It also intrigues me that you are not using principal component as covariates.

Thanks.

Switch from per chromosome to autosomes

Hello Dr. Atkinson,

Thank you so much for these scripts and example code.

In the RFMix_v2 step, your example code produces per chromosome rfmix output.
In the Extract Tracts step, it's unclear whether you've used per chromosome or an autosome file (I used per chromosome MSP files and VCF files).
However in the example Hail code you've read in an autosomes.anc0.dosage.txt file.

I'm wondering where, when, and how the autosomes file should be made. Or rather, did you run RFMix_v2 on the whole genome instead of by chromosome as the example suggests?

Thank you for clarifying!
Heidi

NaN output

I am applying the pipeline on imputed data (chr 21) of 934 cases and 946 controls. The output I got contains all nan. Is it possibly due to the small sample size?

Typo on line 34 of ExtractTracts-Flags.py

mspfile = open(args.msp, + '.msp.tsv', 'r') #comma after args.msp should be removed

Version (number) of Tractor

Hi, all,

We are writing a paper that benchmarks the performance Tractor with other GWAS methods. Is there a version number of Tractor, so that it would be easier for readers to track our simulation scenario. We would like to use the latest Tractor noted by the version number. Thanks!

Best,
Zikun

Error message in the tutorial

Hi,

I was following the tutorial with the example dataset, and I got the error message when I was running the following code,

python Tractor/ExtractTracts.py
--msp ADMIX_COHORT/ASW.deconvoluted
--vcf ADMIX_COHORT/ASW.phased
--zipped
--num-ancs 2

The error message is

File "Tractor/ExtractTracts.py", line 24
def extract_tracts(msp: str, vcf_prefix: str, zipped: bool, zip_output:bool, output_path:str, num_ancs: int = 2):
^
SyntaxError: invalid syntax

Could you let me know where I got wrong? Thanks!

Inclusion of stderr and other possible enhancements

Hi again, I am currently running some Tractor analysis and have the need to compile multiple cohorts from separate Tractor runs and meta analyze the results. To do that, we need standard error in the output files in addition to the BETA & P columns. The way your code is written, it extracts these values from the coefficient matrix of each glm (through the matrix helper function). It only keeps BETA & P, however, standard error & z score are also values in that matrix. Z score may not be important, but standard error should be reported. The changes are minimal to allow for this, but it would change the output files by default for everyone. I would say it should be that way, but maybe not what you want. With more of a script overhaul, BETA/SE/P could become a parameter for the user to specify. After all, there are also cases where users may exclusively care about effect size and not P due to low sample size or something.

One other enhancement I would recommend is to allow the user to specify the number of decimals to round to. The script prints out to 6 decimals, but some users may want less. This is an easy change using an optional flag and the default set to 6 so that nothing changes for indifferent user. Edit: After second thought, P will always need to report as many decimals as possible, so this option could be confusing & probably not worth it since it would apply to non P cols only.

I may open a PR soon for the SE & rounding issues since I am already writing those for my own use case, but feel free to comment any thoughts.

Best -
Kyle

Does the pheno file accepts missing values for a variable/covariable?

Hi,
I am wondering if the pheno file accepts missing values when running the GWAS locally...
Thanks

Differences in AF between original version and v 1.1.0

Hello,

I ran tractor on the same cohort using the same rfmix output and noticed that the AF for some variants differed between my original tractor output and the v1.1.0 output. Was there any difference in how AF was calculated in the new version of tractor?
Thanks for any help you can provide.

RunTractor.py ERROR Phenotype ID must match with Hapdose file ID

Hi
After running RunTractor.py (version 0.0.1), I am getting an error but also an output summary stats file. I think my output summary stats is reasonable, and this could just be a minor bug but I'm not sure. Hence I'm raising an issue here.

python3.7 Tractor/RunTractor.py --hapdose MEG_phase2.chr22.phase \
--method logistic \
--phe summ_using_PCsAPOL/Phe.eskd2021_nicole.txt \
--out summ_using_PCsAPOL/MEG_phase2.chr22.eskd2021.summ.tsv &> summ_using_PCsAPOL/MEG_phase2.chr22.eskd2021.summ.tsv.log

cat summ_using_PCsAPOL/MEG_phase2.chr22.eskd2021.summ.tsv.log 
v 0.0.1
Reading files....
------
MEG_phase2.chr22.phase.anc0.hapcount.txt
------
MEG_phase2.chr22.phase.anc1.hapcount.txt
------
MEG_phase2.chr22.phase.anc0.dosage.txt
------
MEG_phase2.chr22.phase.anc1.dosage.txt
------
Notice:
Tractor drop one local ancestry term for regression. Therefore, MEG_phase2.chr22.phase.anc1.hapcount.txt will not be used.
------
ERROR: Phenotype ID must match with Hapdose file ID
END of calculation


wc -l summ_using_PCsAPOL/Phe.eskd2021_nicole.txt
     2775 summ_using_PCsAPOL/Phe.eskd2021_nicole.txt
head -1 summ_using_PCsAPOL/Phe.eskd2021_nicole.txt
IID	y

head -1 MEG_phase2.chr22.phase.anc0.dosage.txt |wc -w
2779   (CHROM	POS	ID	REF	ALT	1000560	1000625	1000628	1000633	1000634 ..... )
head -1 MEG_phase2.chr22.phase.anc0.hapcount.txt |wc -w
2779   (CHROM	POS	ID	REF	ALT	1000560	1000625	1000628	1000633	1000634 ..... )

You can see the Error message. But as you can see the header of the pheno file, and the number of samples in the pheno file and hapdosage files are all tallying correctly.

I just want to make sure that everything is correct.
Thank You.