wansonchoi / cookhla Goto Github PK

View Code? Open in Web Editor NEW

23.0 23.0 9.0 34.91 MB

An accurate and efficient HLA imputation method.

Python 89.27% R 5.14% Perl 0.59% Shell 5.00%

cookhla's People

Contributors

Stargazers

Watchers

Forkers

nanjalaruth tianbu nvrivera aamiralizai hyunjoonlim libingnan11 xingejun drgbl tzhang-nmdp

cookhla's Issues

version of reference panel

Hello, @WansonChoi ,
It's very kind of you create this method for HLA imputation. I noticed that this software automaticlly transformed the imputed targeted data to hg18, but current data are mostly based on hg19 and hg38, how should i do if i'm willing to construct my own reference panel in hg19 and impute based on it?
Waiting for your kindly reply.

some errors happen when I run your example code

i just used your example code, and raise errors like picture above.

python -m MakeGeneticMap \
    -i example/1958BC.hg19 \
    -hg 19 \
    -ref 1000G_REF/1000G_REF.EUR.chr6.hg18.29mb-34mb.inT1DGC \
    -o MyAGM/1958BC+1000G_REF.EUR

and also an error happens when i run this (used genetic map file your guys put in example file) example code, like below

python CookHLA.py -i example/1958BC.hg19 -hg 19 -o MyHLAImputation/1958BC+HM_CEU_REF -ref example/HM_CEU_REF -gm example/AGM.1958BC+HM_CEU_REF.mach_step.avg.clpsB -ae example/AGM.1958BC+HM_CEU_REF.aver.erate

Thx a lot if you can help me.

Meta-analysis

Hi there,
I'm wondering if it's possible to meta-analyse more than 2 datasets at once?
Thanks

Dependencies

In the readme it's mentioned that the user has to download the dependencies by themselves, but the dependencies are already there in the dependency folder which comes with CookHLA. Can someone please clarify?

Exon3 imputation for 3 days+?

Hi @WansonChoi and team, thank you for this helpful tool. I am using this pipeline with the T1DGC reference to impute HLA alleles for ~60k samples. I am performing this on a desktop with 32 GB memory and 4.29 GHz 6-core processor, using the options -mem 29g -mp 6 -nth 8. The software got to the point of Performing HLA imputation(exon3 / overlap: 5000) in about 2 hours, then was stuck there for 3 days. Would you be able to comment on whether this is normal, or what is the usual expected runtime for this many samples on such a machine?

I also got a warning initially that ~114k markers failed to lift down to hg18. Would that indicate the use of a different reference? Currently using hg19 which I believe is correct for the samples, and I tried different ones but CookHLA would give errors.

Thanks in advance!

Producing input files for MakeGeneticMap

I am having difficulty in preparing the input files for this function. Given that I am starting with CRAM files, I convert to BAM and then to BED using samtools and bedtools. I then sort the bed files. I am having difficulty merging these bed files as the code and examples seem to suggest as I need to do. I was wondering if anyone has a solution to this (I have tried using cat to join them all and then using mergeBed, but I was told there was an out of order record with the start coordinate being outside of the region I specified [28999852 when specified the start coordinate as 29000000])

Is it possible with ~43k sample?

I set up CookHLA for our study containing ~43k samples -- it failed with BEAGLE although I reserved 250GB RAM; would it be possible to do so? When I used only 2,491 samples it worked.

The screen output is as follows for the ~43k sample,

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/rds/user/jhz22/hpc-work/CookHLA/src/HLA_Imputation_BEAGLE5.py", line 555, in IMPUTE
subprocess.run(re.split('\s+', command), check=True, stdout=f_log, stderr=f_log)
File "/usr/local/software/master/python/3.7/lib/python3.7/subprocess.py", line 512, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['java', '-Djava.io.tmpdir=/home/jhz22/Caprion/analysis/work/hla_CookHLA.javatmpdir', '-Xmx250000m', '-jar', './dep>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/software/master/python/3.7/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/rds/user/jhz22/hpc-work/CookHLA/src/HLA_Imputation_BEAGLE5.py", line 559, in IMPUTE
raise CookHLAImputationError(std_ERROR_MAIN_PROCESS_NAME + "Imputation({} / overlap:{}) failed.\n".format(_exonN, _overlap))
src.CookHLAError.CookHLAImputationError:
[HLA_Imputation_BEAGLE5.py::ERROR]: Imputation(exon3 / overlap:1.5) failed.

"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "CookHLA.py", line 1035, in
f_save_IMPUTATION_INPUT=args.save_IMPUTATION_INPUT)
File "CookHLA.py", line 862, in CookHLA
f_measureAcc_v2=f_measureAcc_v2)
File "/rds/user/jhz22/hpc-work/CookHLA/src/HLA_Imputation_BEAGLE5.py", line 179, in init
self.dict_IMP_Result[_exonN][_overlap] = dict_Pool[_exonN][_overlap].get()
File "/usr/local/software/master/python/3.7/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
src.CookHLAError.CookHLAImputationError:
[HLA_Imputation_BEAGLE5.py::ERROR]: Imputation(exon3 / overlap:1.5) failed.

I got error on convert vcf step "Inconsistent marker IDs: [markers file]"

[CookHLA.py]: CookHLA : Performing HLA imputation for 'MyHLAImputation/APMDA.samples.phased.COPY.LiftDown_hg18.NoAmbig'

Java memory = 8000m(Mb)
Using Local Embedding.
Using Adaptive Genetic Map.
Small Sample mode. (because # of target samples < 100)
[1] Extracting SNPs from the MHC.
[2] Performing SNP quality control.
Warning: 3 variants had at least one non-A/C/G/T allele name.
Warning: At least 2 duplicate IDs in --exclude file.
62248
62248
62248
[3] Converting data to beagle format.
Exception in thread "main" java.lang.IllegalArgumentException: Inconsistent marker IDs: [markers file]=SNPS_DQA1_2481_32639939_intron1 [BEAGLE file]=chr6:32667088:G:A
at beagleutil.Beagle2Vcf.checkConsistency(Beagle2Vcf.java:153)
at beagleutil.Beagle2Vcf.main(Beagle2Vcf.java:67)

[HLA_Imputation.py::ERROR]: Input file for imputation('MyHLAImputation/APMDA.samples.51.MHC.QC.vcf') contains nothing. Please check it again.

[HLA_Imputation.py::ERROR]: Input file for imputation('MyHLAImputation/APMDA.samples.51.MHC.QC.vcf') contains nothing. Please check it again.
....

Input problem

Dear,

I'm sorry to bother you. I'm having trouble with CookHLA. We were very interested in CookHLA, which is amazing and great work.

We want to know if GWAS summary data can run CookHLA because I see he needs .bed .fam .bim files.

If so, what to do with the GWAS summary file.

Looking forward to your reply.

MakeGeneticMap with FATAL ERROR Marker # is duplicated

@WansonChoi
Hi ,when i try to use CookHLA on a reference panel build by myself, MakeGeneticMap scripts will go wrong with FATAL ERROR Marker AA_C_-18_31347808_Rx is duplicated.
i check the .markers file it include marks :
AA_C_-18_31347808_R 31347808 P A
AA_C_-18_31347808_x 31347808 P A
AA_C_-18_31347808_Q 31347808 P A
AA_C_-18_31347808_X 31347808 P A
AA_C_-18_31347808_Rx 31347808 P A
AA_C_-18_31347808_RQ 31347808 P A
AA_C_-18_31347808_RX 31347808 P A
it may be caused by "AA_C_-18_31347808_Rx 31347808 P A" and "AA_C_-18_31347808_RX 31347808 P A"
because it seem like Rx/RX make this error.
And some times error like this "Error: Duplicate ID 'chr6_31529929_C_T'. "
Do u have any suggestion?

Reference 1000G ALL not working?

Thanks for making this software available. I can successfully impute from the 1000G individual superpopulation files, but I'm seeing an error when I try to use the combined overall 1000G reference panel. Specifically Beagle says:

java.lang.IllegalArgumentException: 3
	at vcf.BitSetGTRec.get(BitSetGTRec.java:171)
	at vcf.BasicGT.allele(BasicGT.java:136)
	at vcf.SplicedGT.allele(SplicedGT.java:104)
	at phase.ImputeBaum.unscaledAlProbs(ImputeBaum.java:151)
	at phase.ImputeBaum.imputeInterval(ImputeBaum.java:124)
	at phase.ImputeBaum.phase(ImputeBaum.java:107)
	at phase.PhaseLS.lambda$runStage2$2(PhaseLS.java:148)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

though exactly which exon raises the error doesn't seem to be consistent. Sometimes its 2.1 but also 2.1.5 too.

Have you been able to use the 100G_ALL reference successfully? Is there anything that needs to be changed or checked to work with the larger reference panel?

Requirements.txt

Hi,
I dont want to use conda and instead use virtualenv to run CookHLA. But I am unable to use YML files. Is it possible to get a requirements.txt file?

Thanks,
Pooja.

Reference and target files have no markers in common in interval

@WansonChoi
Thank you very much for developing this software.
I encountered this problem when using Pan-Asian as panel-reference:
ERROR: Reference and target files have no markers in common in interval:
6:25002566-26995909
Can you tell me how to solve this problem?

can not get alleles result but no error

Hi @WansonChoi ,

I am running CookHLA with a target data (N larger than 50000) and the 1000G reference data in your software(N=504). Everything went well without an error but no results were achieved. So I am wandering if the strange issue came up because of my sample is bigger than the example in your github from which I used the parameter of "mem"(2g) and "window"(5). The imputation log is as follows:

respri.hg19.hla.MHC.QC.exon2.0.5.raw_imputation_out.log

I will be very grateful if you can reply!

Thanks,
Guo

[HLA_Imputation_BEAGLE5.py::ERROR]: Imputation(exon2 / overlap:0.5) failed.

Hello @WansonChoi
I am facing error in the last step of CookHLA pipeline, I used my data (chr6 29mb-34mb) data with the reference data from 1000 genome reference panel however it failed due to the following error
[4] Performing HLA imputation(exon2 / overlap:0.5).

[HLA_Imputation_BEAGLE5.py::ERROR]: Imputation(exon2 / overlap:0.5) failed.

Traceback (most recent call last):
File "/share/home/aamir/CookHLA-master/src/HLA_Imputation_BEAGLE5.py", line 555, in IMPUTE
subprocess.run(re.split('\s+', command), check=True, stdout=f_log, stderr=f_log)
File "/share/home/aamir/anaconda3/envs/CookHLA/lib/python3.6/subprocess.py", line 418, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['java', '-Djava.io.tmpdir=MyHLAImputation/data+1000G_REF.EAS.javatmpdir', '-Xmx2000m', '-jar', './dependency/beagle5.jar', 'gt=MyHLAImputation/data1+1000G_REF.EAS.MHC.QC.vcf', 'ref=MyHLAImputation/1000G_REF.EAS.chr6.hg18.29mb-34mb.inT1DGC.exon2.phased.vcf', 'out=MyHLAImputation/data1+1000G_REF.EAS.MHC.QC.exon2.0.5.raw_imputation_out', 'impute=true', 'gp=true', 'overlap=0.5', 'err=0.00350207085828343', 'map=MyHLAImputation/data1+1000G_REF.EAS.mach_step.avg.clpsB.exon2.txt', 'window=5', 'ne=10000', 'nthreads=1']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "CookHLA.py", line 1035, in
f_save_IMPUTATION_INPUT=args.save_IMPUTATION_INPUT)
File "CookHLA.py", line 862, in CookHLA
f_measureAcc_v2=f_measureAcc_v2)
File "/share/home/aamir/CookHLA-master/src/HLA_Imputation_BEAGLE5.py", line 154, in init
self.AVER, self.dict_ExonN_AGM[_exonN], f_prephasing=f_prephasing)
File "/share/home/aamir/CookHLA-master/src/HLA_Imputation_BEAGLE5.py", line 559, in IMPUTE
raise CookHLAImputationError(std_ERROR_MAIN_PROCESS_NAME + "Imputation({} / overlap:{}) failed.\n".format(_exonN, _overlap))
src.CookHLAError.CookHLAImputationError:
[HLA_Imputation_BEAGLE5.py::ERROR]: Imputation(exon2 / overlap:0.5) failed.
Can you please guide me to solve this issue

Fail to reproduce the toy example

Thank you for developing this wonderful tool. I have two questions for you as I failed to reproduce the toy example.

When I tried to generate the adaptive genetic map using the toy dataset available in the folder, I got the following errors. I have created MyAGM folder in the working directory, so I don't know if I missed anything.

python -m MakeGeneticMap -i example/1958BC.hg19 -hg 19 -ref 1000G_REF/1000G_REF.EUR.chr6.hg18.29mb-34mb.inT1DGC -o MyAGM/1958BC+1000G_REF.EUR
Namespace(human_genome='19', input='example/1958BC.hg19', out='MyAGM/1958BC+1000G_REF.EUR', reference='1000G_REF/1000G_REF.EUR.chr6.hg18.29mb-34mb.inT1DGC')
sh: 1: None: not found
Traceback (most recent call last):
  File "/home/wem26/miniconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/wem26/miniconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/wem26/CookHLA/MakeGeneticMap/__main__.py", line 102, in <module>
    CookHLA_MakeGeneticMap(args.input, args.human_genome, args.reference, args.out)
  File "/home/wem26/CookHLA/MakeGeneticMap/__main__.py", line 39, in __init__
    self.GeneticMap = MakeGeneticMap(_input, _reference, _out)
  File "/home/wem26/CookHLA/MakeGeneticMap/MakeGeneticMap.py", line 26, in MakeGeneticMap
    N_sample_target = getSampleNumbers(_input+'.fam')
  File "/home/wem26/CookHLA/src/checkInput.py", line 26, in getSampleNumbers
    with open(_fam, 'r') as f_fam:
FileNotFoundError: [Errno 2] No such file or directory: 'MyAGM/1958BC.hg19.COPY.LiftDown_hg18.fam'

Is there a way to merge reference panels, e.g., all 1000Genome panels?

advice on using the Han panel

I am trying to use the Han reference panel for which the authors have provided .markers and .bgl files. Any advice on how to generate the other reference files needed by cookHLA would be most helpful. Most of the file conversion tools I have tried complain about indels and A/P markers. Thank you for your guidance.