anhig / imgthla Goto Github PK

View Code? Open in Web Editor NEW

200.0 200.0 60.0 2.84 GB

Github for files currently published in the IPD-IMGT/HLA FTP Directory hosted at the European Bioinformatics Institute

Home Page: http://www.ebi.ac.uk/ipd/imgt/hla/

License: Other

Parrot 100.00%

alleles bioinformatics hla hla-database nomenclature

imgthla's People

Contributors

Stargazers

Watchers

Forkers

mslarosa14 osercinoglu mqondisi jpollack-nmdp lxblixiaobo antinomyra mallen2 biocodings wzw14541 wangdi2014 haoziyeung qiangzhongwork liviatran hanpingchen sjmack rachelmarty20 amar4ankatha josesamuelufpr zhoulin8908 jetcell lamnvcnsh drdboehm shulp2211 leornardzhou zxyeo jxshi benbelow ncborcherding tuqiang2014 zhangyanzhang wy2160640 tanhaican laflo bmatern robbrads medcodes caggtaagtat san-san33 shanshanglaohu hellosunking dongfang1021 biomedicalsciences asangphukieo wenyl1919 gerde sehawk drgbl olekskrav cschin gbiagini akumaharit yyxql yutanagano xin8you anuj2054 jing-xinxing rainlqy chaplin89 agolan12 seanpm2001

imgthla's Issues

File format

Is there documentation for the txt alignment format (for example: A_gen)?

Thank you for hosting this on github!

Comma in Description field of Deleted_alleles.txt file

Line 106 of the Deleted_alleles.txt file (HLA00615,DQA1*05013,To take account of coding polymorphism in the leader peptide, sequence renamed DQA1*05:05 (April 1998)) includes a comma in the Description field.

This results in an extra column being added for this line when parsing the file as a .csv document.

Could this comma be removed? It doesn't change the meaning of the entry.

incorrect/missing alignmentreference elements in hla.xml

For the DPB1 alleles, the alignmentreference element attributes have an empty alleleid attribute, and the allelename attribute contains "DPB101:01:01", but the allele element in the file has the extended name "DPB101:01:01:01" so the reference is not made.

DRBx alleles also have an empty alleleid alignmentreference attribute, but in these cases the DRB1*01:01:01 allele is named consistently

john

Errors in assigning intron numbers to DRB4*03:01N intron sequences?

In the hla.xml for release 3.33.0, the names of the DRB4*03:01N intron features do not match the feature order numbers for other DRB intron features.

Here the the intron elements for DRB4*03:01N:

     <feature id="914.5" order="5" featuretype="Intron" name="Intron 1">
        <SequenceCoordinates start="1" end="2684" />
     </feature>
      <feature id="914.7" order="7" featuretype="Intron" name="Intron 2">
        <SequenceCoordinates start="2967" end="3670" />
     </feature>
      <feature id="914.9" order="9" featuretype="Intron" name="Intron 3">
        <SequenceCoordinates start="3782" end="4255" />
     </feature>
      <feature id="914.11" order="11" featuretype="Intron" name="Intron 4">
        <SequenceCoordinates start="4280" end="4581" />
    </feature>

Here are the corresponding intron elements for other DRB alleles (e.g., DRB4*01:03:01:03):

      <feature id="6603.3" order="3" featuretype="Intron" name="Intron 1">
        <SequenceCoordinates start="414" end="9976" />
     </feature>
      <feature id="6603.5" order="5" featuretype="Intron" name="Intron 2">
        <SequenceCoordinates start="10247" end="12983" />
     </feature>
      <feature id="6603.7" order="7" featuretype="Intron" name="Intron 3">
        <SequenceCoordinates start="13266" end="13969" />
     </feature>
      <feature id="6603.9" order="9" featuretype="Intron" name="Intron 4">
        <SequenceCoordinates start="14081" end="14554" />
     </feature>
      <feature id="6603.11" order="11" featuretype="Intron" name="Intron 5">
        <SequenceCoordinates start="14579" end="14880" />
     </feature>

Shouldn't all DRB Intron 1 sequences be intron order 3, and all intron sequences of intron order 5 be intron 2?

C*02:10:01GG

Hi all,

This extra G is causing us some issues.

allele id="HLA18583" name="HLA-C02:02:37" dateassigned="2018-03-29"
hla_g_group status="C02:10:01GG"
hla_p_group status="C*02:02P"

Thank you!
Marney

Difference between fasta and alignments for A*01:11N

One base pair before point mutation 968G>T, the sequences seem to diverge. The mutation (T) is higlighted:

From alignment file (that I think is correct):
GGAGAACGGTAA...
vs the fasta section:
GGAGAACGACCC...

Problems with the 3.34.0 nuc.txt and prot.txt alignments for HLA-B and -C

In the 3.34.0 HLA-B protein alignment, the HLA-B*13:120Q peptide sequence is 11 amino acids longer than the reference, but these positions are not accounted for in the reference with . symbols. As a result, even though the last sequence block for all other alleles only include 69 amino-acid positions, the last 11 amino acids of the HLA-B*13:120Q sequence appear in a separate block, as below.

This also occurs for the B_nuc.txt alignment, as below.

The same thing is also true for the C*04:09N allele in the C_prot.txt and C_nuc.txt alignments.

It seems like these extra peptide positions should be included in the reference sequences as sequence indels.

Release 3.36.0 - file inconsistencies

There are two new alleles where "dateassigned" is blank in the hla.xml file, DQA1 05:05:01:20 (HLA22679) and DRB4 01:03:01:10 (HLA22663). The dates are listed appropriately in the hla_nom file.
There is an inconsistency between hla_nom and hla.xml for HLA00886, where the xml file has the allele name as v2 DRB3 010101 while the nom file has v3 DRB3 01:01:01. Could you explain this for us?
The hla.xml file has a G group listed as C*07:726N:01G while nom_g lists it as 07:726:01G. Could you please look into this one too?
There is an inconsistency between nom_p and hla.xml regarding DQA1 05:05:01:20. This allele is listed as part of DQA1*05:01P in nom_p but has no p group status in the xml file.

Any help on the above is greatly appreciated. Thanks!

C*17:01:01:02

Hi James,

During the processing of a bunch of new alleles, we ran into an issue with C*17:01:01:02
The hla.dat file we pulled from the git repository has Exon 5 marked as "pseudo" while on the IPD-IMGT/HLA website it is not marked as such. A cursory look at the history of the sequence does not indicate any recent changes. We were wondering if this was intentional and something we should take into account in our work flow ?

Cheers,
Vineeth

hla.dat file not updated in release 3370

Error in Sequence tag

Hi James,

in the new release 3.33.0 of hla.dat some DRB1 sequences are invalid. See for example DRB1*13:09, the substring "y/alignment_libraries/libs/drb1345genomiclib:drb1_13:09" should not be there, i think.

FH Key Location/Qualifiers
FH
FT source 1..325
FT /organism="Homo sapiens"
FT /mol_type="genomic DNA"
FT /db_xref="taxon:9606"
FT /ethnic="Hispanic"
FT /cell_line="MJD"
FT /cell_line="NT01111"
FT CDS <1..270
FT /codon_start=1
FT /partial
FT /gene="HLA-DRB1"
FT /allele="HLA-DRB113:09"
FT /product="MHC Class II HLA-DRB113:09 sequence"
FT /translation="RFLEYSTSECHFFNGTERVRFLDRYFHNQEENVRFDSDVGEFRAV
FT TELGRPDAEYWNSQKDILEQARAAVDTYCRHNYGVVESFTVQRR"
FT exon 1..270
FT /number="2"
FT UTR 271..328
SQ Sequence 325 BP; 58 A; 67 C; 100 G; 51 T; 49 other;
cacgtttctt ggagtactct acgtctgagt gtcatttctt caatgggacg gagcgggtgc 60
ggttcctgga cagatacttc cataaccagg aggagaacgt gcgcttcgac agcgacgtgg 120
gggagttccg ggcggtgacg gagctggggc ggcctgatgc cgagtactgg aacagccaga 180
aggacatcct ggagcaggcg cgggccgcgg tggacaccta ctgcagacac aactacgggg 240
ttgtggagag cttcacagtg cagcggcgag y/alignmen t_librarie s/libs/drb 300
1345genomi clib:drb1_ 13:09 325
//

Cheers,
Markus

HLA.Dat user manual not matching hla.dat file

The user manual and the HLA.Dat file appear to be out of sync. The user manual states that the DT Entry will have 3 per entry. When I look at the 3.30.0 HLA.Dat file, there are only 2 per entry.

hla.xml 3.32.0 HLA-C02:10:01:01 <hla_g_group status="C02:10:01GG"/>

Greetings,

Allele HLA-C02:10:01:01 has an extra 'G' <hla_g_group status="C02:10:01GG"/> in hla.xml for 3.32.0.

May the force be with you,
Marney

Invalid character � in dat file for 3.21.0, 3.22.0, 3.23.0 and 3.24.0

The following line is found in the hla.dat file for 3.21.0, 3.22.0, 3.23.0 and 3.24.0.

RA   Balas A, S�nchez-Gordo F, Garcia-S�nchez F, Gomez-Zumaquero JM, Vicario JL;

This prevents these files from being properly parsed.

Here are the specific alleles that have this issue:

Release = 3210, line # = 121045, Allele = HLA-A*11:210N
Release = 3210, line # = 177260, Allele = HLA-A*26:107N
Release = 3220, line # = 125142, Allele = HLA-A*11:210N
Release = 3220, line # = 183644, Allele = HLA-A*26:107N
Release = 3230, line # = 127802, Allele = HLA-A*11:210N
Release = 3230, line # = 187727, Allele = HLA-A*26:107N
Release = 3240, line # = 129967, Allele = HLA-A*11:210N
Release = 3240, line # = 191426, Allele = HLA-A*26:107N

no newline following XML declaration in hla_ambigs.xml

On line 1 of hla_ambigs.xml, the XML declaration is not followed by a newline character, so the tns:ambiguityData start-tag appears on the same line.

A newline character is not required by the XML spec, but could be a helpful aesthetic enhancement.

Incorrectly using join for DRB5 sequences in 3.20.0 and 3.21.0

In the hla.dat files for 3.20.0 and 3.21.0 a join is being used for the CDS sequence when it shouldn't be which causes parsers to fail. Here's an example:

DR   EMBL; AJ427352; AJ427352.1.
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..270
FT                   /organism="Homo sapiens"
FT                   /mol_type="genomic DNA"
FT                   /db_xref="taxon:9606"
FT                   /ethnic="Caucasoid"
FT                   /cell_line="Barpay"
FT   CDS             join(1..270)
FT                   /codon_start=1
FT                   /partial
FT                   /gene="HLA-DRB5"
FT                   /allele="HLA-DRB5*01:12"
FT                   /product="MHC Class II HLA-DRB5*01:12 sequence"
FT                   /translation="RFLQQDKYECHFFNGTERVRFLHRDIYNQEEDLRFDSDVGEYRAV
FT                   TELGRPDAESWNSQKDFLERRRAEVDTVCRHNYGVGESFTVQRR"

Should be FT CDS 1..270 or FT CDS <1..270> instead.

Here's a list of all the alleles that have this:

HLA01638.1 HLA-DRB5*01:11
HLA01634.1 HLA-DRB5*01:12
HLA01871.1 HLA-DRB5*01:13
HLA00927.1 HLA-DRB5*02:03
HLA00928.1 HLA-DRB5*02:04
HLA01280.1 HLA-DRB5*02:05
HLA00916.1 HLA-DRB5*01:01:02
HLA00918.2 HLA-DRB5*01:03
HLA00920.1 HLA-DRB5*01:05
HLA00921.1 HLA-DRB5*01:06
HLA00922.1 HLA-DRB5*01:07
HLA00924.1 HLA-DRB5*01:09
HLA01012.3 HLA-DRB5*01:10N

gGroup and gGroupAllele names in hla_ambigs.xml don't use full gene names

The gGroup and gGroupAllele names in hla_ambigs.xml don't use the full gene names. For example, in place of "HLA-A", they use "A". This makes them inconsistent with the allele names in hla.xml.

Below are file excerpts to further illustrate the issue.

From hla.xml:
<allele id="HLA00001" name="HLA-A*01:01:01:01" dateassigned="1989-08-01">

From hla_ambigs.xml:
<tns:gGroup name="A*01:01:01G" gid="HGG00001">
<tns:gGroupAllele name="A*01:01:01:01" alleleid="HLA00001" />

Please consider revising the gGroup and gGroupAllele names in hla_ambigs.xml to use the full gene names.

Errant Icon file in alignments directory.

https://github.com/ANHIG/IMGTHLA/blob/Latest/alignments/Icon%0D

DPA1_gen.fasta renamed to DPA_gen.fasta

But the alignment file not renamed? The pir and msf files were also renamed.
Are sequences for the DPA2 pseudo gene forthcoming?
This isn't a technical issues just a consistency issue.

Genomic alignment of DPA104:01 and DPA104:02 in the DPA1_gen.txt file

The alignment in the DPA1_gen.txt file for DPA1 *04:01 and *04:02 makes it appear that these alleles differ significantly in their sequence for positions 1061 to 1093, as below.

However, the sequences of these alleles are identical through these positions, and it seems like the sequence for *04:02 should only include a 3 nucleotide deletion, relative to the reference, for positions 1061 - 1063, as below.

Missing archive zip file

Hello,

In the README, you note that a "zip compressed archive of all the text-format alignment files is available from the top-level directory". However, I am unable to find such a zip file. The only zip file appears to be the Alignment_Rel_3350.zip that contains the alignments from the current release.

In particular, I would like to find archive versions of the alignment files and the archive versions of the fasta files.

Can you point me in the right direction?

Thanks,
Rachel

incomplete fasta file

hi,
the hla_gen.fasta from the latest version contains sequences for only 5773 alleles.
where are the other alleles? can't find DPA1*03:02 for instance.

thanks,

HLA-E's nuclear data is incomplete

It is missing the final 2 bases (AG) of the 7th exon and the 8th exon (CCTGA). This is comparing against the genetic data.

Sequence length error found for DRB1*14:13 (HLA00845)

For DRB1*14:13 (HLA00845) We noticed that the exon regions do not match the overall sequence length. As you can see from this snippet, the sequence length is 687 but the actual sequence listed is only 549 in length.
FT exon 1..270
FT /number="2"
FT exon 271..549
FT /number="3"
FT exon 553..663
FT /number="4"
FT exon 664..687
FT /number="5"
SQ Sequence 687 BP; 152 A; 173 C; 223 G; 139 T; 0 other;
cacgtttctt ggagtactct acgtctgagt gtcatttctt caatgggacg gagcgggtgc 60
ggttcctgga gagatacttc cataaccagg aggagaacgt gcgcttcgac agcgacgtgg 120
gggagtaccg ggcggtgacg gagctggggc ggcctagcgc cgagtactgg aacagccaga 180
aggacctcct ggagcagagg cgggccgcgg tggacaccta ctgcagacac aactacgggg 240
ttggtgagag cttcacagtg cagcggcgag tccatcctaa ggtgactgtg tatccttcaa 300
agacccagcc cctgcagcac cacaacctcc tggtctgttc tgtgagtggt ttctatccag 360
gcagcattga agtcaggtgg ttccggaatg gccaggaaga gaagactggg gtggtgtcca 420
caggcctgat ccacaatgga gactggacct tccagaccct ggtgatgctg gaaacagttc 480
ctcggagtgg agaggtttac acctgccaag tggagcaccc aagcgtgaca agccctctca 540
cagtggaat 549

DQA1*05:01:04 is not in P or G group in hla.xml.

Good morning again,

We noticed an inconsistency between the files. Will you correct which ever needs to be corrected, please?

allele id="HLA18836" name="HLA-DQA1*05:01:04" dateassigned="2018-04-30"
hla_g_group status="None"/
hla_p_group status="None"/

hla_nom_g.txt
DQA1*;05:01:01:01/05:01:01:02/05:01:01:03/05:01:04/05:03:01:01/05:03:01:02/05:05:01:01/05:05:01:02/05:05:01:03/05:05:01:04/05:05:01:05/05:05:01:06/05:05:01:07/05:05:01:08/05:05:01:09/05:05:01:10/05:06:01:01/05:06:01:02/05:07/05:08/05:09/05:11;05:01:01G

DQA1*;05:01:01:01/05:01:01:02/05:01:01:03/05:01:02/05:01:04/05:03:01:01/05:03:01:02/05:05:01:01/05:05:01:02/05:05:01:03/05:05:01:04/05:05:01:05/05:05:01:06/05:05:01:07/05:05:01:08/05:05:01:09/05:05:01:10/05:06:01:01/05:06:01:02/05:07/05:08/05:09/05:11;05:01P

Thank you!
May the force be with you,
Marney

HLA xsd file doesn't match HLA xml file structure

hugogenename is new attribute in HLA locus node, but does not exist in HLA.xsd file: Won't parse through XSD validator.

Errant A*11 allele in A_gen

There is a misaligned A_11 file in A_gen, between A_32:86 and A*32:93.

Some alleles are in Fasta but not in alignments for A_gen

A*68:01:24, A*32:01:24, A*31:01:24

ClassI_nuc.txt alignment issue (extra insertion placeholders in B,C alleles cause misalignment)

Extra insertion place holders found in B and C alleles (not A) starting line 122460 causing the exon barrier to not align around codon 182.

This looks like this A, B, and C got out of alignment due to an insertion placeholder present in the B alleles, but not A,C starting on line 98736 in B07:02:01:01 (due to '-' symbol in B40:345N, line 101665).

I can't attach the file, too big.

Assembly version

Hello all,

I am working on a neoantigene pipeline and using optitype for HLA detection. Optitype has an older FASTA version (2013) and the same alleles differ.
What is the assembly version of the most recent FASTA files here (2018)? I am looking at hla_nuc.fasta and hla_prot.fasta. GRCH39/HG39?
I was unable to find the info in readme/version report/change log, nor is it at
https://www.ebi.ac.uk/ipd/imgt/hla/ .
I think it would be useful to have it somewhere clearly visible.

Thank you

C*02:137 Missing from deleted_alleles.txt

Hi all,

We noticed that C*02:137 is listed on other deleted allele resources, but not deleted_alleles.txt.

Can you hook us up?

Many thanks,
Marney

HLA-DMB*01:02 - Invalid join

HLA00490 - 3.30.0

The join(<1..284) is invalid because a join should have at least two parts.

DR   EMBL; Z24750; Z24750.1.
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..284
FT                   /organism="Homo sapiens"
FT                   /mol_type="genomic DNA"
FT                   /db_xref="taxon:9606"
FT                   /ethnic="Caucasoid"
FT                   /cell_line="YAR"
FT   CDS             join(<1..284)
FT                   /codon_start=1
FT                   /partial
FT                   /gene="HLA-DMB"
FT                   /allele="HLA-DMB*01:02"
FT                   /product="MHC Class II HLA-DMB*01:02 sequence"
FT                   /translation="PPSVQVAKTTPFNTREPVMLACYVWGFYPAEVTITWRKNGKLVMP
FT                   HSSEHKTAQPNGDWTYQTLSHLALTPSYGDTYTCVVEHIGAPEPILRDW"
FT   exon            1..284
FT                   /number="3"
FT                   /partial
SQ   Sequence 284 BP; 67 A; 83 C; 74 G; 60 T; 0 other;
     ggccaccatc tgtgcaagta gccaaaacca ctccttttaa cacgagggag cctgtgatgc        60
     tggcctgcta tgtgtggggc ttctatccag cagaagtgac tatcacgtgg aggaagaacg       120
     ggaagcttgt catgcctcac agcagtgagc acaagactgc ccagcccaat ggagactgga       180
     cataccagac cctctcccat ttagccttaa ccccctctta cggggacact tacacctgtg       240
     tggtagagca cattggggct cctgagccca tccttcggga ctgg                        284
//

Having this error in the hla.dat file causes bio parsers to fail.

DRB5*01:01:01 not listed under alignments directory.

I find alignment flat file format useful as it already has intron exon boundaries embedded.

DRB5*01:01:01 allele is not listed under "alignments" directory whereas it is listed under "msf" directory.

Is this because there is only one full-length allele of DRB5? But in the README file, gen.txt description says:

"Please note for alleles that do not possess genomic sequences, there will be no entry in the file"

So for DRB5 even with one allele, there should be DRB5_gen.txt file containing the DRB5*01:01:01 allele.

Under msf directory, it is listed under DRB5_gen.msf but there is no corresponding alignment file DRB5_gen.txt under alignments directory.

Strange deletion at A*01:18N peptide position 341

In the A_prot.txt alignment, the sequence for the final peptide position for A*01:18N is a deletion (.), but the sequence for the preceding 158 peptide positions is unknown (*).

This does not correspond to the A_nuc.txt alignment, where exon 8 nucleotide sequence is *****.

This terminal deletion does not show up in the .fasta, .msf or .pir alignments (but honestly, it isn't clear how it could).

3.29.0 - Expected sequence length 687, found 549 (HLA00845.2)

The hla.dat file for 3.29.0 has the incorrect sequence length for HLA00845.2. The sequence tag should have 549 instead of 687.

SQ Sequence 687 BP; 152 A; 173 C; 223 G; 139 T; 0 other;

ID   HLA00845; SV 2; standard; DNA; HUM; 549 BP.
XX
AC   HLA00845;
XX
SV   HLA00845.2
XX
DT   06-AUG-1993 (Rel. 1.0.0, Created, Version 1)
DT   16-AUG-2017 (Rel. 3.29.0.1, Last Updated, Version 2)
XX
DE   HLA-DRB1*14:13, Human MHC Class II sequence (partial)
XX
KW   Human MHC; HLA; Class II; HLA-DRB1; Allele; HLA-DRB1*14:13;
XX
OS   Homo Sapiens (human)
OC   Eukaryota; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria; Primates;
OC   Catarrhini; Hominidae; Homo.
XX
CC   --------------------------------------------------------------------------
CC   IPD-IMGT/HLA Release Version 3.29.0.1
CC   --------------------------------------------------------------------------
CC   Copyrighted by the IPD-IMGT/HLA Database, Distributed under the Creative
CC   Commons Attribution-NoDerivs License, see;
CC   http://www.ebi.ac.uk/ipd/imgt/hla/licence.html for further details.
CC   --------------------------------------------------------------------------
XX
RN   [1]
RP   1-549
RX   PUBMED; 8168862.
RA   Pando M, Theiler G, Melano R, Petzl-Erler ML, Satz ML;
RT   "A new HLA-DR6 allele (DRB1*1413) found in a tribe of Brazilian Indians";
RL   Immunogenetics 39:377-377(1994).
XX
CC   --------------------------------------------------------------------------
CC   The sequence below is the official allele sequence as approved by the
CC   WHO Nomenclature Committee for Factors of the HLA System.
CC   Any cross references may differ from the sequence shown below.
CC   --------------------------------------------------------------------------
XX
DR   EMBL; AM110001; AM110001.0.
DR   EMBL; L21755; L21755.1.
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..549
FT                   /organism="Homo sapiens"
FT                   /mol_type="genomic DNA"
FT                   /db_xref="taxon:9606"
FT                   /ethnic="American Indian"
FT                   /cell_line="GRC-138"
FT   CDS             <1..549>
FT                   /codon_start=1
FT                   /partial
FT                   /gene="HLA-DRB1"
FT                   /allele="HLA-DRB1*14:13"
FT                   /product="MHC Class II HLA-DRB1*14:13 sequence"
FT                   /translation="RFLEYSTSECHFFNGTERVRFLERYFHNQEENVRFDSDVGEYRAV
FT                   TELGRPSAEYWNSQKDLLEQRRAAVDTYCRHNYGVGESFTVQRRVHPKVTVYPSKTQPL
FT                   QHHNLLVCSVSGFYPGSIEVRWFRNGQEEKTGVVSTGLIHNGDWTFQTLVMLETVPRSG
FT                   EVYTCQVEHPSVTSPLTVE"
FT   exon            1..270
FT                   /number="2"
FT   exon            271..549
FT                   /number="3"
SQ   Sequence 687 BP; 152 A; 173 C; 223 G; 139 T; 0 other;
     cacgtttctt ggagtactct acgtctgagt gtcatttctt caatgggacg gagcgggtgc        60
     ggttcctgga gagatacttc cataaccagg aggagaacgt gcgcttcgac agcgacgtgg       120
     gggagtaccg ggcggtgacg gagctggggc ggcctagcgc cgagtactgg aacagccaga       180
     aggacctcct ggagcagagg cgggccgcgg tggacaccta ctgcagacac aactacgggg       240
     ttggtgagag cttcacagtg cagcggcgag tccatcctaa ggtgactgtg tatccttcaa       300
     agacccagcc cctgcagcac cacaacctcc tggtctgttc tgtgagtggt ttctatccag       360
     gcagcattga agtcaggtgg ttccggaatg gccaggaaga gaagactggg gtggtgtcca       420
     caggcctgat ccacaatgga gactggacct tccagaccct ggtgatgctg gaaacagttc       480
     ctcggagtgg agaggtttac acctgccaag tggagcaccc aagcgtgaca agccctctca       540
     cagtggaat                                                               549
//

Inconsistent knowledge for DPB1*01:01:01

Alignment (alignments/DPB_nuc.txt) implies that we don't have sequence information for the final alanine, but the fasta (fasta/DPB_nuc.fasta) has it.

Extra sequence information for C*04:09N

I think theres an extra "ATGTGT" at the end C_nuc.fasta, at least with comparison to the sequence in the alignment file.

Some alleles are missing from hla.xml

During my recent investigation, i found that some alleles are missing from hla.xml which are in hla.dat. For example, HLA-H*02:06. There are ~300 alleles in this situation.

Is this intended?

Thank you,
Marcell

Class II protein alignment files are blank for 3.32.0

All class II protein alignment files are blank for 3.32.0.

Identical sequences with different feature annotations - 174 alleles

Feature annotations should not differ between database releases if the sequence is the same. If an annotation is changed in a later database release, then it should also be updated in all previous database releases that contain that sequence. The feature annotations for 174 alleles change between database releases even though the sequences do not. These differences mainly impact intron-4, exon-5, and intron-5 for HLA-DQB1. Below is a table of all the observed instances of this issue.

DB	Allele	# Features Removed	# Features Added	# Features Differ	Features Removed	Features Added	Features that Differ
3160	HLA-B*15:302N	0	0	3			exon_5 exon_2 exon_3
3160	HLA-C*08:89N	0	0	1			exon_2
3170	HLA-B*15:302N	0	0	1			exon_5
3180	HLA-B*39:97N	0	0	1			exon_3
3180	HLA-C*08:89N	0	0	1			exon_2
3190	HLA-C*08:89N	0	0	1			exon_2
3220	HLA-B*07:251N	0	0	1			exon_3
3280	HLA-B*15:149N	0	0	1			exon_4
3280	HLA-B*15:246N	0	0	1			exon_4
3280	HLA-C*08:89N	0	0	1			exon_2
3290	HLA-B*15:149N	0	0	1			exon_4
3290	HLA-B*15:246N	0	0	1			exon_4
3300	HLA-A*24:155N	1	0	0	exon_5
3300	HLA-A*26:01:01:03N	0	0	2			intron_4 exon_4
3300	HLA-B*07:44N	0	0	2			intron_4 exon_4
3300	HLA-B*15:01:01:02N	0	1	1		exon_1	intron_1
3300	HLA-B*15:149N	0	0	1			exon_4
3300	HLA-B*15:246N	0	0	2			exon_5 exon_4
3300	HLA-B*44:02:01:02S	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*02:01:01	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*02:02:01:02	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*02:02:04	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*02:53Q	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*02:62	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*02:79	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*02:80	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*02:81	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*02:82	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*02:83	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*02:84	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*02:96N	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:01:01:01	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:01:01:02	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:01:01:03	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:01:01:04	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:01:01:05	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:01:01:06	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:01:01:07	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:01:01:08	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:01:01:09	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:01:01:10	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:01:01:11	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:01:01:12	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:01:01:14	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:01:01:15	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:01:01:16	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:01:01:17	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:01:01:18	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:01:17	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:01:22	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:01:35	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:01:36	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:01:37	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:02:01:01	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:02:01:02	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:02:01:03	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:02:09	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:02:12	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:02:21	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:02:22	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:02:23	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:02:24	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:03:02:01	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:03:02:02	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:03:02:03	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:03:04	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:04:01	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:04:03	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:05:01	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:150	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:191	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:195	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:196	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:197Q	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:19:01	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:211	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:239	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:243	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:245	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:246	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:247	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:248	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:249	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:250	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:251	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:252	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:253	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:254	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*03:263	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*04:01:01:01	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*04:02:01:01	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*04:02:11	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*04:02:12	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*04:11	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*04:32	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*05:01:01:01	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*05:01:01:02	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*05:01:01:03	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*05:01:01:04	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*05:01:01:05	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*05:01:23	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*05:01:24	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*05:02:01:01	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*05:02:01:02	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*05:02:01:03	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*05:02:07	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*05:02:11	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*05:102	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*05:103	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*05:104	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*05:106	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*05:108	0	1	1		exon_5	exon_6
3300	HLA-DQB1*05:133	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*05:134	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*05:135	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*05:136	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*05:137	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*05:148	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*05:149	1	1	0	exon_6	exon_5
3300	HLA-DQB1*05:31	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*05:43:02	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*05:52	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*05:57	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*05:96	0	1	1		exon_5	exon_6
3300	HLA-DQB1*05:97	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:01:08	0	1	1		exon_5	exon_6
3300	HLA-DQB1*06:01:10	0	1	1		exon_5	exon_6
3300	HLA-DQB1*06:01:11	0	1	1		exon_5	exon_6
3300	HLA-DQB1*06:02:01:01	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:02:01:02	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:02:01:03	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:02:17	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:02:22	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:02:23	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:02:25	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:02:26	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:02:27	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:02:28	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:03:01:01	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:03:01:02	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:03:12	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:03:14	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:03:20	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:03:21	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:03:23	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:03:24	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:03:25	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:03:26	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:04:01	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:09:01:01	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:09:01:02	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:103	0	1	1		exon_5	exon_6
3300	HLA-DQB1*06:111	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:117	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:125	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:187	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:188	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:217	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:218	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:219	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:221	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:222	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:223	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:224	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:225	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:226	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:227	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:228	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:37	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:44	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:84	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:90	0	2	1		exon_5 intron_5	intron_4
3300	HLA-DQB1*06:99:02	0	1	1		exon_5	exon_6
3320	HLA-C*07:02:01:17N	0	0	2			intron_3 exon_3

miltiple sequence alignment reference

Hi,
I see the list of the references of the multiple sequence alignment here:
http://www.ebi.ac.uk/ipd/imgt/hla/nomenclature/alignments.html
May I ask are the alleles in this list part of the human reference genome GRCh38?
Does GRCh38 contains all HLA genes?

Many thanks,

Mengyao

What does '|' mean in the multiple sequence alignment?

In the 'alignments' folder A_gen.txt file, there are several lines contain " | " symbol, for example:
A_01:01:01:01 G | ATGGCCGTC ATGGCGCCCC GAACCCTCCT CCTGCTACTC TCGGGGGCCC TGGCCCTGAC CCAGACCTGG GCGG | GTGAGT GCGGGGTCGG GAGGGAAACC
A_01:01:01:02N - | --------- ---------- ---------- ---------- ---------- ---------- ---------- ---- | ------ ---------- ----------
A*01:01:01:03 * | --------- ---------- ---------- ---------- ---------- ---------- ---------- ---- | ------ ---------- ----------

May I ask what do these " | " symbols mean?

Many thanks,

Mengyao

Urgent issues with zip files

The following url - https://raw.githubusercontent.com/ANHIG/IMGTHLA/Latest/xml/hla.xml.zip is broken - it provides corrupted ZIP file (1KB size).

Other urls seem to be working – e.g. https://raw.githubusercontent.com/ANHIG/IMGTHLA/Latest/wmda/hla_nom.txt

Could you please take a look at this and fix the zip files? We would appreciate the help ASAP as this is blocking our processing of the latest release.

Spelling error in oid/README.md

In line 19 of the https://github.com/ANHIG/IMGTHLA/edit/Latest/oid/README.md document, the word 'donated' should be 'denoted'.

hla.dat file not downloading properly

I tried downloading the IMGT zip but the hla.dat file does not contain the alleles as expected. Instead it contains the following:
version https://git-lfs.github.com/spec/v1
oid sha256:1b26676d2366ba8768122a973aa0add3641671430a52a431e03b6700b8459ff1
size 113160320

nucleotide CDS alignment (MSA) file of release 3.9.0

I want to download the multiple sequence alignment files of release 3.9.0 release because we want to finish the remaining portion of an old project. However, I am unable to find the those files in this repository. Specifically I need the file DQA_nuc.txt or DQA1_nuc.txt for release 3.9.0. as I already have the files of other genes I am interested in.

Let me know if there is anyway I can find that file.

Thank you

<?xml version="1.0" encoding="UTF-8"?>
	<tns:ambiguityData xmlns:tns="http://www.example.org/ambig-aw"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://www.example.org/ambig-aw ambig-aw.xsd ">
	<tns:releaseVersion currentRelease="" date="" />
	<tns:geneList>

typos in README.md

It seems the COPYRIGHT NOTICE section of the README.md file here contains 1-2 typos.

The section indicates 2015 as the publication date for the Nucleic Acids Research article, but Google Scholar indicates 2014. I think 2015 is a typo.

Another typo: the word "stongly".