I'm not sure if this is a bug, but the expectation is that the number of entries (number of pipe-delimited values) should be the same for CLINSIG, CLNDBN, etc. However, for these cases, when we split on |
, we get get different numbers of fields:
% cat hg19_clinvar_20150330.txt | cut -f 6 | sed -e 's/;/ /g' | awk '{numsig=split($1,sig,"|");numacc=split($4,acc,"|"); if (numsig!=numacc) print}' | head
CLINSIG=pathogenic|pathogenic|pathogenic CLNDBN=Paragangliomas_4|Pheochromocytoma|Hereditary_cancer-predisposing_syndrome,Phaeochromocytoma|Cowden-like_syndrome CLNREVSTAT=single|single|single,single|single CLNACC=RCV000013623.23|RCV000013624.16|RCV000129929.2,RCV000148870.1|RCV000148871.1 CLNDSDB=GeneReviews:MedGen:OMIM:Orphanet|GeneReviews:MedGen:OMIM:Orphanet|MedGen:SNOMED_CT,MedGen|MedGen:OMIM:Orphanet CLNDSDBID=NBK1548:C1861848:115310:ORPHA29072|NBK1548:C0031511:171300:ORPHA29072|C0027672:699346009,CN221602|C2676500:612359:ORPHA201
CLINSIG=pathogenic|pathogenic|pathogenic CLNDBN=Paragangliomas_4|Pheochromocytoma|Hereditary_cancer-predisposing_syndrome,Phaeochromocytoma|Cowden-like_syndrome CLNREVSTAT=single|single|single,single|single CLNACC=RCV000013623.23|RCV000013624.16|RCV000129929.2,RCV000148870.1|RCV000148871.1 CLNDSDB=GeneReviews:MedGen:OMIM:Orphanet|GeneReviews:MedGen:OMIM:Orphanet|MedGen:SNOMED_CT,MedGen|MedGen:OMIM:Orphanet CLNDSDBID=NBK1548:C1861848:115310:ORPHA29072|NBK1548:C0031511:171300:ORPHA29072|C0027672:699346009,CN221602|C2676500:612359:ORPHA201
CLINSIG=pathogenic|pathogenic CLNDBN=Elliptocytosis_1|Protein_4.1_lille,Elliptocytosis_1|Protein_4.1_madrid CLNREVSTAT=single|single,single|single CLNACC=RCV000018198.26|RCV000018199.26,RCV000018196.26|RCV000018197.22 CLNDSDB=MedGen:OMIM:Orphanet|.,MedGen:OMIM:Orphanet|. CLNDSDBID=C2678497:611804:ORPHA288|.,C2678497:611804:ORPHA288|.
CLINSIG=pathogenic|pathogenic CLNDBN=Elliptocytosis_1|Protein_4.1_lille,Elliptocytosis_1|Protein_4.1_madrid CLNREVSTAT=single|single,single|single CLNACC=RCV000018198.26|RCV000018199.26,RCV000018196.26|RCV000018197.22 CLNDSDB=MedGen:OMIM:Orphanet|.,MedGen:OMIM:Orphanet|. CLNDSDBID=C2678497:611804:ORPHA288|.,C2678497:611804:ORPHA288|.
CLINSIG=other CLNDBN=Epilepsy\x2c_idiopathic_generalized\x2c_susceptibility_to\x2c_12,not_provided|Glucose_transporter_type_1_deficiency_syndrome CLNREVSTAT=single,single|single CLNACC=RCV000082868.1,RCV000128117.1|RCV000147523.1 CLNDSDB=MedGen:OMIM,MedGen|GeneReviews:MedGen:OMIM:Orphanet CLNDSDBID=CN158708:614847,CN221809|NBK1430:C1847501:606777:ORPHA71277
CLINSIG=other CLNDBN=Epilepsy\x2c_idiopathic_generalized\x2c_susceptibility_to\x2c_12,not_provided|Glucose_transporter_type_1_deficiency_syndrome CLNREVSTAT=single,single|single CLNACC=RCV000082868.1,RCV000128117.1|RCV000147523.1 CLNDSDB=MedGen:OMIM,MedGen|GeneReviews:MedGen:OMIM:Orphanet CLNDSDBID=CN158708:614847,CN221809|NBK1430:C1847501:606777:ORPHA71277
CLINSIG=pathogenic CLNDBN=Glut1_deficiency_syndrome_1\x2c_autosomal_recessive,Glucose_transporter_type_1_deficiency_syndrome|not_provided CLNREVSTAT=single,mult|single CLNACC=RCV000017489.24,RCV000017491.27|RCV000081432.3 CLNDSDB=MedGen,GeneReviews:MedGen:OMIM:Orphanet|MedGen CLNDSDBID=C3149117,NBK1430:C1847501:606777:ORPHA71277|CN221809
CLINSIG=pathogenic CLNDBN=Glut1_deficiency_syndrome_1\x2c_autosomal_recessive,Glucose_transporter_type_1_deficiency_syndrome|not_provided CLNREVSTAT=single,mult|single CLNACC=RCV000017489.24,RCV000017491.27|RCV000081432.3 CLNDSDB=MedGen,GeneReviews:MedGen:OMIM:Orphanet|MedGen CLNDSDBID=C3149117,NBK1430:C1847501:606777:ORPHA71277|CN221809
CLINSIG=pathogenic CLNDBN=MYH-associated_polyposis,Hereditary_cancer-predisposing_syndrome|Carcinoma_of_colon CLNREVSTAT=single,mult|single CLNACC=RCV000123141.1,RCV000115749.4|RCV000144636.1 CLNDSDB=GeneReviews:MedGen:OMIM:Orphanet,MedGen:SNOMED_CT|MedGen:SNOMED_CT CLNDSDBID=NBK107219:C1837991:608456:ORPHA220460,C0027672:699346009|C0699790:269533000
CLINSIG=probable-non-pathogenic|other CLNDBN=MYH-associated_polyposis|Hereditary_cancer-predisposing_syndrome,MYH-associated_polyposis|Hereditary_cancer-predisposing_syndrome CLNREVSTAT=single|mult,mult|single CLNACC=RCV000119223.2|RCV000126890.3,RCV000005617.2|RCV000163049.1 CLNDSDB=GeneReviews:MedGen:OMIM:Orphanet|MedGen:SNOMED_CT,GeneReviews:MedGen:OMIM:Orphanet|MedGen:SNOMED_CT CLNDSDBID=NBK107219:C1837991:608456:ORPHA220460|C0027672:699346009,NBK107219:C1837991:608456:ORPHA220460|C0027672:699346009