Coder Social home page Coder Social logo

ensembl-genes's People

Contributors

acastanza avatar dhimmel avatar eric-czech avatar ravwojdyla avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

ensembl-genes's Issues

Ensembl human release 110: assembly_exception table is empty

Ensembl 110 was released on 2023-07-17 and includes a note:

human genome assembly has been updated to the latest patch release GRCh38.p14. Note, however, that genes on patches will only appear on scaffold coordinates. Further, in the GFF3 annotation files, you will now find that MANE and Ensembl canonical attributes have been added as tags. Y pseudoautosomal region (PAR) genes are now stand-alone genes and are no longer taken from X, but MANE attributes remain on X PAR genes only.

ensembl_genes datasets --species=human --release=110 queries homo_sapiens_core_110_38 and fails with:

ValueError: expected at most 1 primary assembly gene per alt_allele_group

The two genes are ENSG00000273592 and ENSG00000276935, both which have symbol MBOAT7.

Transcripts missing in ensembl xrefs data

Nearly 5% of transcript IDs in Deep Mind's Alpha Missense data do not match with a transcript ID in the xrefs output of EnsemblGenesTask. This may be a consequence of using different versions of Ensembl (98 for Alpha Missense, 105 for RS).

Address and Filter NCBI Gene IDs misassigned due to read-through transcripts

There seems to be an issue which appears to originate on the NCBI side, whereby genes with a read-through transcript can end up getting the NCBI gene ID of the read through assigned to one of(?) the parent Ensembl Genes.

Here's an example from Biomart (taken in Ensembl 103) which demonstrates this issue:

Ensembl Gene ID NCBI Gene ID HGNC Gene ID Gene Symbol Gene Title
ENSG00000278232 1394 HGNC:2357 CRHR1 corticotropin releasing hormone receptor 1 [Source:HGNC Symbol;Acc:HGNC:2357]
ENSG00000278232 104909134 HGNC:51483 CRHR1 corticotropin releasing hormone receptor 1 [Source:HGNC Symbol;Acc:HGNC:2357]
ENSG00000282456 104909134 HGNC:51483 LINC02210-CRHR1 LINC02210-CRHR1 readthrough [Source:HGNC Symbol;Acc:HGNC:51483]
ENSG00000204650 147081 HGNC:26327 LINC02210 long intergenic non-protein coding RNA 2210 [Source:HGNC Symbol;Acc:HGNC:26327]

https://www.ncbi.nlm.nih.gov/gene/?term=1394
https://www.ncbi.nlm.nih.gov/gene/?term=104909134

From a quick look at your genes sheet:

ensembl_gene_id ensembl_gene_version gene_symbol gene_symbol_source_db gene_symbol_source gene_biotype gene_description
ENSG00000278232 4 LINC02210-CRHR1 HGNC HGNC:51483 protein_coding LINC02210-CRHR1 readthrough [Source:HGNC Symbol;Acc:HGNC:51483]
ENSG00000282456 1 LINC02210-CRHR1 HGNC HGNC:51483 lncRNA LINC02210-CRHR1 readthrough [Source:HGNC Symbol;Acc:HGNC:51483]

it would seem to be affected

Automate detection & export of new ensembl releases

@cthoyt tweeted:

Why not automate even further? Have it check on a daily basis if Ensembl has been updated since the last release of your artifacts so even if you don’t personally manage this anymore, it can continue on. I was thinking about this a lot lately and have been accumulating scripts for checking database versions in https://github.com/biopragmatics/bioversions. I just added one for ensembl, feel free to rely on that package or deconstruct the parts that are important and include directly in your source

This is a great idea and would reduce future maintenance. Happy to use bioversions for this.

We will need to detect if an output already exists. Should be able to do this by looking at the git branches.

Sometimes exports will fail, for example if a release changes the schema. These changes take a non-trivial amount of effort to fix. For this reason I lean towards weekly scheduled jobs, so when this is failing it becomes a weekly and not daily annoyance.

Use tags instead of branches for different versions

Effectively, tags are kind of like branches, but GitHub has much deeper support for tags/releases. Additionally, you could hook this up to Zenodo to automatically provide an archived backup for each if you used tags.

Ensembl alt_allele tables does not contains all alternative allele gene groups

We select a single representative genes for groups of alternative allele genes. These groups are based on the upstream alt_allele table, which provides a mapping between gene_ids and alt_allele_group_ids.

However, there appears to be groups of genes that are alternative alleles of each other that are not included in this table. One example is the set of human genes with the symbol GP6. Will elaborate further in subsequent comments.

Error Exporting Ensembl 111


INFO:root:exporting ensembl genes to output/homo_sapiens_core_111_38: version 111
INFO:root:connection_url: mysql+mysqlconnector://[email protected]:3306/homo_sapiens_core_111_38
INFO:root:exporting genes data
INFO:root:Ran 'genes' query returning 70,711 rows. Head:

   ensembl_gene_id  ensembl_gene_version  ... seq_region_strand primary_assembly
0  ENSG00000000003                    16  ...                -1             True
1  ENSG00000000005                     6  ...                 1             True
2  ENSG00000000419                    14  ...                -1             True
3  ENSG00000000457                    14  ...                -1             True

[4 rows x 19 columns]
INFO:root:Ran 'gene_xrefs' query returning 505,967 rows. Head:

   ensembl_gene_id   xref_source  ... xref_info_type xref_linkage_annotation
0  ENSG00000000003  ArrayExpress  ...         DIRECT                    None
1  ENSG00000000003    EntrezGene  ...      DEPENDENT                    None
2  ENSG00000000003     GeneCards  ...      DEPENDENT                    None
3  ENSG00000000003          HGNC  ...         DIRECT                    None

[4 rows x 7 columns]
INFO:root:Ran 'gene_alt_alleles' query returning 14,511 rows. Head:

   ensembl_gene_id  ...  ensembl_created_date
0  ENSG00000282572  ...   2015-06-01 18:57:05
1  ENSG00000281951  ...   2015-06-01 18:57:05
2  ENSG00000282572  ...   2015-06-01 18:57:05
3  ENSG00000273644  ...   2014-06-09 10:49:07

[4 rows x 7 columns]
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /home/runner/work/ensembl-genes/ensembl-genes/ensembl_genes/commands.py:39   │
│ in export_all                                                                │
│                                                                              │
│   36 │   def export_all(species: str = "human", release: str = "latest") ->  │
│   37 │   │   """Export datasets and then notebooks."""                       │
│   38 │   │   # Cannot use a classmethod here <https://github.com/related-sci │
│ ❱ 39 │   │   Commands.export_datasets(species=species, release=release)      │
│   40 │   │   Commands.export_notebooks(species=species, release=release)     │
│   41 │                                                                       │
│   42 │   @staticmethod                                                       │
│                                                                              │
│ ╭───── locals ──────╮                                                        │
│ │ release = '111'   │                                                        │
│ │ species = 'human' │                                                        │
│ ╰───────────────────╯                                                        │
│                                                                              │
│ /home/runner/work/ensembl-genes/ensembl-genes/ensembl_genes/commands.py:24   │
│ in export_datasets                                                           │
│                                                                              │
│   21 │   │   │   f"exporting ensembl genes to {ensgc.output_directory}: vers │
│   22 │   │   )                                                               │
│   23 │   │   logging.info(f"connection_url: {ensgc.connection_url}")         │
│ ❱ 24 │   │   ensgc.export_datasets()                                         │
│   25 │                                                                       │
│   26 │   @staticmethod                                                       │
│   27 │   @cli.command(name="notebooks")                                      │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │   ensgc = <ensembl_genes.ensembl_genes.Ensembl_Gene_Catalog_Writer       │ │
│ │           object at 0x7f986b2355a0>                                      │ │
│ │ release = '111'                                                          │ │
│ │ species = 'human'                                                        │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /home/runner/work/ensembl-genes/ensembl-genes/ensembl_genes/ensembl_genes.py │
│ :549 in export_datasets                                                      │
│                                                                              │
│   546 │   def export_datasets(self) -> None:                                 │
│   547 │   │   for export in self.exports:                                    │
│   548 │   │   │   logging.info(f"exporting {export.name} data")              │
│ ❱ 549 │   │   │   self.write_dataset(export)                                 │
│   550 │                                                                      │
│   551 │   def write_dataset(self, export: DatasetExport) -> None:            │
│   552 │   │   df = getattr(self, export.query_fxn)                           │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ export = DatasetExport(                                                  │ │
│ │          │   name='genes',                                               │ │
│ │          │   query_fxn='gene_df',                                        │ │
│ │          │   description='Primary table of ensembl genes with IDs,       │ │
│ │          symbols, and genomic location informati'+150,                   │ │
│ │          │   export_formats=[                                            │ │
│ │          │   │   <ExportFormat.parquet: 'parquet'>,                      │ │
│ │          │   │   <ExportFormat.tsv: 'tsv'>,                              │ │
│ │          │   │   <ExportFormat.excel: 'excel'>,                          │ │
│ │          │   │   <ExportFormat.json: 'json'>                             │ │
│ │          │   ]                                                           │ │
│ │          )                                                               │ │
│ │   self = <ensembl_genes.ensembl_genes.Ensembl_Gene_Catalog_Writer object │ │
│ │          at 0x7f986b2355a0>                                              │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /home/runner/work/ensembl-genes/ensembl-genes/ensembl_genes/ensembl_genes.py │
│ :552 in write_dataset                                                        │
│                                                                              │
│   549 │   │   │   self.write_dataset(export)                                 │
│   550 │                                                                      │
│   551 │   def write_dataset(self, export: DatasetExport) -> None:            │
│ ❱ 552 │   │   df = getattr(self, export.query_fxn)                           │
│   553 │   │   assert isinstance(df, pd.DataFrame)                            │
│   554 │   │   gz_compression = {"method": "gzip", "mtime": 0}                │
│   555 │   │   if ExportFormat.parquet in export.export_formats:              │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ export = DatasetExport(                                                  │ │
│ │          │   name='genes',                                               │ │
│ │          │   query_fxn='gene_df',                                        │ │
│ │          │   description='Primary table of ensembl genes with IDs,       │ │
│ │          symbols, and genomic location informati'+150,                   │ │
│ │          │   export_formats=[                                            │ │
│ │          │   │   <ExportFormat.parquet: 'parquet'>,                      │ │
│ │          │   │   <ExportFormat.tsv: 'tsv'>,                              │ │
│ │          │   │   <ExportFormat.excel: 'excel'>,                          │ │
│ │          │   │   <ExportFormat.json: 'json'>                             │ │
│ │          │   ]                                                           │ │
│ │          )                                                               │ │
│ │   self = <ensembl_genes.ensembl_genes.Ensembl_Gene_Catalog_Writer object │ │
│ │          at 0x7f986b2355a0>                                              │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/functools.py:981 in   │
│ __get__                                                                      │
│                                                                              │
│   978 │   │   │   │   # check if another thread filled cache while we awaite │
│   979 │   │   │   │   val = cache.get(self.attrname, _NOT_FOUND)             │
│   980 │   │   │   │   if val is _NOT_FOUND:                                  │
│ ❱ 981 │   │   │   │   │   val = self.func(instance)                          │
│   982 │   │   │   │   │   try:                                               │
│   983 │   │   │   │   │   │   cache[self.attrname] = val                     │
│   984 │   │   │   │   │   except TypeError:                                  │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │    cache = {                                                             │ │
│ │            │   'species': Species(                                       │ │
│ │            │   │   name='homo_sapiens',                                  │ │
│ │            │   │   common_name='human',                                  │ │
│ │            │   │   assembly='38',                                        │ │
│ │            │   │   ensembl_gene_pattern='^ENSG[0-9]{11}$',               │ │
│ │            │   │   enable_mhc=True,                                      │ │
│ │            │   │   mhc_chromosome='6',                                   │ │
│ │            │   │   mhc_lower=28510120,                                   │ │
│ │            │   │   mhc_upper=33480577,                                   │ │
│ │            │   │   xmhc_lower=25726063,                                  │ │
│ │            │   │   xmhc_upper=33410226,                                  │ │
│ │            │   │   chromosomes=[                                         │ │
│ │            │   │   │   '1',                                              │ │
│ │            │   │   │   '2',                                              │ │
│ │            │   │   │   '3',                                              │ │
│ │            │   │   │   '4',                                              │ │
│ │            │   │   │   '5',                                              │ │
│ │            │   │   │   '6',                                              │ │
│ │            │   │   │   '7',                                              │ │
│ │            │   │   │   '8',                                              │ │
│ │            │   │   │   '9',                                              │ │
│ │            │   │   │   '10',                                             │ │
│ │            │   │   │   ... +15                                           │ │
│ │            │   │   ]                                                     │ │
│ │            │   ),                                                        │ │
│ │            │   'release': '111',                                         │ │
│ │            │   'database': 'homo_sapiens_core_111_38',                   │ │
│ │            │   'output_directory':                                       │ │
│ │            PosixPath('output/homo_sapiens_core_111_38'),                 │ │
│ │            │   '_xref_raw_df':         ensembl_gene_id   xref_source     │ │
│ │            ... xref_info_type xref_linkage_annotation                    │ │
│ │            0       ENSG00000000003  ArrayExpress  ...         DIRECT     │ │
│ │            None                                                          │ │
│ │            1       ENSG00000000003    EntrezGene  ...      DEPENDENT     │ │
│ │            None                                                          │ │
│ │            2       ENSG00000000003     GeneCards  ...      DEPENDENT     │ │
│ │            None                                                          │ │
│ │            3       ENSG00000000003          HGNC  ...         DIRECT     │ │
│ │            None                                                          │ │
│ │            4       ENSG00000000003      MIM_GENE  ...      DEPENDENT     │ │
│ │            None                                                          │ │
│ │            ...                 ...           ...  ...            ...     │ │
│ │            ...                                                           │ │
│ │            505962  ENSG00000293556  ArrayExpress  ...         DIRECT     │ │
│ │            None                                                          │ │
│ │            505963  ENSG00000293557  ArrayExpress  ...         DIRECT     │ │
│ │            None                                                          │ │
│ │            505964  ENSG00000293558  ArrayExpress  ...         DIRECT     │ │
│ │            None                                                          │ │
│ │            505965  ENSG00000293559  ArrayExpress  ...         DIRECT     │ │
│ │            None                                                          │ │
│ │            505966  ENSG00000293560  ArrayExpress  ...         DIRECT     │ │
│ │            None                                                          │ │
│ │                                                                          │ │
│ │            [505967 rows x 7 columns],                                    │ │
│ │            │   'xref_lrg_df':         ensembl_gene_id lrg_gene_id        │ │
│ │            89      ENSG00000000971      LRG_47                           │ │
│ │            144     ENSG00000001084    LRG_1166                           │ │
│ │            264     ENSG00000001626     LRG_663                           │ │
│ │            349     ENSG00000001631     LRG_650                           │ │
│ │            438     ENSG00000002586    LRG_1023                           │ │
│ │            ...                 ...         ...                           │ │
│ │            455788  ENSG00000277027     LRG_163                           │ │
│ │            458738  ENSG00000277586     LRG_259                           │ │
│ │            467390  ENSG00000279220    LRG_1105                           │ │
│ │            475102  ENSG00000282608     LRG_424                           │ │
│ │            476974  ENSG00000283122    LRG_1035                           │ │
│ │                                                                          │ │
│ │            [1324 rows x 2 columns]                                       │ │
│ │            }                                                             │ │
│ │ instance = <ensembl_genes.ensembl_genes.Ensembl_Gene_Catalog_Writer      │ │
│ │            object at 0x7f986b2355a0>                                     │ │
│ │    owner = <class                                                        │ │
│ │            'ensembl_genes.ensembl_genes.Ensembl_Gene_Catalog_Writer'>    │ │
│ │     self = <functools.cached_property object at 0x7f9876531780>          │ │
│ │      val = <object object at 0x7f9884334340>                             │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /home/runner/work/ensembl-genes/ensembl-genes/ensembl_genes/ensembl_genes.py │
│ :99 in gene_df                                                               │
│                                                                              │
│    96 │   │   gene_df = gene_df.join(desc_df)                                │
│    97 │   │   # add ensembl_representative_gene_id column                    │
│    98 │   │   gene_repr_df = gene_df.merge(                                  │
│ ❱  99 │   │   │   self.alt_allele_df[["ensembl_gene_id", "ensembl_representa │
│   100 │   │   │   how="left",                                                │
│   101 │   │   )                                                              │
│   102 │   │   gene_repr_df.ensembl_representative_gene_id = (                │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ desc_df = │   │   │   │   │   │   │   │   │   │   gene_description  ...  │ │
│ │           gene_description_source_id                                     │ │
│ │           0                                          tetraspanin 6  ...  │ │
│ │           HGNC:11858                                                     │ │
│ │           1                                            tenomodulin  ...  │ │
│ │           HGNC:17757                                                     │ │
│ │           2      dolichyl-phosphate mannosyltransferase subunit...  ...  │ │
│ │           HGNC:3005                                                      │ │
│ │           3                               SCY1 like pseudokinase 3  ...  │ │
│ │           HGNC:19285                                                     │ │
│ │           4      FIGNL1 interacting regulator of recombination ...  ...  │ │
│ │           HGNC:25565                                                     │ │
│ │           ...                                                  ...  ...  │ │
│ │           ...                                                            │ │
│ │           70706                                   novel transcript  ...  │ │
│ │           NaN                                                            │ │
│ │           70707                                   novel transcript  ...  │ │
│ │           NaN                                                            │ │
│ │           70708                                   novel transcript  ...  │ │
│ │           NaN                                                            │ │
│ │           70709                                   novel transcript  ...  │ │
│ │           NaN                                                            │ │
│ │           70710                                   novel transcript  ...  │ │
│ │           NaN                                                            │ │
│ │                                                                          │ │
│ │           [70711 rows x 3 columns]                                       │ │
│ │ gene_df = │      ensembl_gene_id  ...  gene_description_source_id        │ │
│ │           0      ENSG00000000003  ...                  HGNC:11858        │ │
│ │           1      ENSG00000000005  ...                  HGNC:17757        │ │
│ │           2      ENSG00000000419  ...                   HGNC:3005        │ │
│ │           3      ENSG00000000457  ...                  HGNC:19285        │ │
│ │           4      ENSG00000000460  ...                  HGNC:25565        │ │
│ │           ...                ...  ...                         ...        │ │
│ │           70706  ENSG00000293556  ...                         NaN        │ │
│ │           70707  ENSG00000293557  ...                         NaN        │ │
│ │           70708  ENSG00000293558  ...                         NaN        │ │
│ │           70709  ENSG00000293559  ...                         NaN        │ │
│ │           70710  ENSG00000293560  ...                         NaN        │ │
│ │                                                                          │ │
│ │           [70711 rows x 23 columns]                                      │ │
│ │    self = <ensembl_genes.ensembl_genes.Ensembl_Gene_Catalog_Writer       │ │
│ │           object at 0x7f986b2355a0>                                      │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/functools.py:981 in   │
│ __get__                                                                      │
│                                                                              │
│   978 │   │   │   │   # check if another thread filled cache while we awaite │
│   979 │   │   │   │   val = cache.get(self.attrname, _NOT_FOUND)             │
│   980 │   │   │   │   if val is _NOT_FOUND:                                  │
│ ❱ 981 │   │   │   │   │   val = self.func(instance)                          │
│   982 │   │   │   │   │   try:                                               │
│   983 │   │   │   │   │   │   cache[self.attrname] = val                     │
│   984 │   │   │   │   │   except TypeError:                                  │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │    cache = {                                                             │ │
│ │            │   'species': Species(                                       │ │
│ │            │   │   name='homo_sapiens',                                  │ │
│ │            │   │   common_name='human',                                  │ │
│ │            │   │   assembly='38',                                        │ │
│ │            │   │   ensembl_gene_pattern='^ENSG[0-9]{11}$',               │ │
│ │            │   │   enable_mhc=True,                                      │ │
│ │            │   │   mhc_chromosome='6',                                   │ │
│ │            │   │   mhc_lower=28510120,                                   │ │
│ │            │   │   mhc_upper=33480577,                                   │ │
│ │            │   │   xmhc_lower=25726063,                                  │ │
│ │            │   │   xmhc_upper=33410226,                                  │ │
│ │            │   │   chromosomes=[                                         │ │
│ │            │   │   │   '1',                                              │ │
│ │            │   │   │   '2',                                              │ │
│ │            │   │   │   '3',                                              │ │
│ │            │   │   │   '4',                                              │ │
│ │            │   │   │   '5',                                              │ │
│ │            │   │   │   '6',                                              │ │
│ │            │   │   │   '7',                                              │ │
│ │            │   │   │   '8',                                              │ │
│ │            │   │   │   '9',                                              │ │
│ │            │   │   │   '10',                                             │ │
│ │            │   │   │   ... +15                                           │ │
│ │            │   │   ]                                                     │ │
│ │            │   ),                                                        │ │
│ │            │   'release': '111',                                         │ │
│ │            │   'database': 'homo_sapiens_core_111_38',                   │ │
│ │            │   'output_directory':                                       │ │
│ │            PosixPath('output/homo_sapiens_core_111_38'),                 │ │
│ │            │   '_xref_raw_df':         ensembl_gene_id   xref_source     │ │
│ │            ... xref_info_type xref_linkage_annotation                    │ │
│ │            0       ENSG00000000003  ArrayExpress  ...         DIRECT     │ │
│ │            None                                                          │ │
│ │            1       ENSG00000000003    EntrezGene  ...      DEPENDENT     │ │
│ │            None                                                          │ │
│ │            2       ENSG00000000003     GeneCards  ...      DEPENDENT     │ │
│ │            None                                                          │ │
│ │            3       ENSG00000000003          HGNC  ...         DIRECT     │ │
│ │            None                                                          │ │
│ │            4       ENSG00000000003      MIM_GENE  ...      DEPENDENT     │ │
│ │            None                                                          │ │
│ │            ...                 ...           ...  ...            ...     │ │
│ │            ...                                                           │ │
│ │            505962  ENSG00000293556  ArrayExpress  ...         DIRECT     │ │
│ │            None                                                          │ │
│ │            505963  ENSG00000293557  ArrayExpress  ...         DIRECT     │ │
│ │            None                                                          │ │
│ │            505964  ENSG00000293558  ArrayExpress  ...         DIRECT     │ │
│ │            None                                                          │ │
│ │            505965  ENSG00000293559  ArrayExpress  ...         DIRECT     │ │
│ │            None                                                          │ │
│ │            505966  ENSG00000293560  ArrayExpress  ...         DIRECT     │ │
│ │            None                                                          │ │
│ │                                                                          │ │
│ │            [505967 rows x 7 columns],                                    │ │
│ │            │   'xref_lrg_df':         ensembl_gene_id lrg_gene_id        │ │
│ │            89      ENSG00000000971      LRG_47                           │ │
│ │            144     ENSG00000001084    LRG_1166                           │ │
│ │            264     ENSG00000001626     LRG_663                           │ │
│ │            349     ENSG00000001631     LRG_650                           │ │
│ │            438     ENSG00000002586    LRG_1023                           │ │
│ │            ...                 ...         ...                           │ │
│ │            455788  ENSG00000277027     LRG_163                           │ │
│ │            458738  ENSG00000277586     LRG_259                           │ │
│ │            467390  ENSG00000279220    LRG_1105                           │ │
│ │            475102  ENSG00000282608     LRG_424                           │ │
│ │            476974  ENSG00000283122    LRG_1035                           │ │
│ │                                                                          │ │
│ │            [1324 rows x 2 columns]                                       │ │
│ │            }                                                             │ │
│ │ instance = <ensembl_genes.ensembl_genes.Ensembl_Gene_Catalog_Writer      │ │
│ │            object at 0x7f986b2355a0>                                     │ │
│ │    owner = <class                                                        │ │
│ │            'ensembl_genes.ensembl_genes.Ensembl_Gene_Catalog_Writer'>    │ │
│ │     self = <functools.cached_property object at 0x7f986b234b20>          │ │
│ │      val = <object object at 0x7f9884334340>                             │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /home/runner/work/ensembl-genes/ensembl-genes/ensembl_genes/ensembl_genes.py │
│ :142 in alt_allele_df                                                        │
│                                                                              │
│   139 │   │   │   "representative_gene_method",                              │
│   140 │   │   ]                                                              │
│   141 │   │   if not alt_allele_df.empty:                                    │
│ ❱ 142 │   │   │   alt_allele_df = alt_allele_df.groupby("alt_allele_group_id │
│   143 │   │   │   │   self._alt_allele_add_representative                    │
│   144 │   │   │   )                                                          │
│   145 │   │   else:                                                          │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ alt_allele_df = │      ensembl_gene_id  ...  ensembl_created_date        │ │
│ │                 0      ENSG00000282572  ...   2015-06-01 18:57:05        │ │
│ │                 1      ENSG00000281951  ...   2015-06-01 18:57:05        │ │
│ │                 2      ENSG00000282572  ...   2015-06-01 18:57:05        │ │
│ │                 3      ENSG00000273644  ...   2014-06-09 10:49:07        │ │
│ │                 4      ENSG00000273644  ...   2014-06-09 10:49:07        │ │
│ │                 ...                ...  ...                   ...        │ │
│ │                 14506  ENSG00000284613  ...   2017-06-13 10:44:55        │ │
│ │                 14507  ENSG00000292409  ...   2023-04-14 17:13:51        │ │
│ │                 14508  ENSG00000254581  ...   2010-11-01 15:31:55        │ │
│ │                 14509  ENSG00000254581  ...   2010-11-01 15:31:55        │ │
│ │                 14510  ENSG00000292410  ...   2023-04-14 17:13:51        │ │
│ │                                                                          │ │
│ │                 [14511 rows x 7 columns]                                 │ │
│ │ expected_cols = [                                                        │ │
│ │                 │   'ensembl_gene_id',                                   │ │
│ │                 │   'alt_allele_group_id',                               │ │
│ │                 │   'alt_allele_is_representative',                      │ │
│ │                 │   'primary_assembly',                                  │ │
│ │                 │   'seq_region',                                        │ │
│ │                 │   'alt_allele_attrib',                                 │ │
│ │                 │   'ensembl_created_date',                              │ │
│ │                 │   'ensembl_representative_gene_id',                    │ │
│ │                 │   'is_representative_gene',                            │ │
│ │                 │   'representative_gene_method'                         │ │
│ │                 ]                                                        │ │
│ │          self = <ensembl_genes.ensembl_genes.Ensembl_Gene_Catalog_Writer │ │
│ │                 object at 0x7f986b2355a0>                                │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /home/runner/.cache/pypoetry/virtualenvs/ensembl-genes-GU6ps7Hy-py3.10/lib/p │
│ ython3.10/site-packages/pandas/core/groupby/groupby.py:1567 in apply         │
│                                                                              │
│   1564 │   │   │   │   with rewrite_warning(                                 │
│   1565 │   │   │   │   │   old_msg, FutureWarning, new_msg                   │
│   1566 │   │   │   │   ) if is_np_func else nullcontext():                   │
│ ❱ 1567 │   │   │   │   │   result = self._python_apply_general(f, self._sele │
│   1568 │   │   │   except TypeError:                                         │
│   1569 │   │   │   │   # gh-20949                                            │
│   1570 │   │   │   │   # try again, with .apply acting as a filtering        │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │       args = ()                                                          │ │
│ │          f = <function                                                   │ │
│ │              Ensembl_Gene_Queries._alt_allele_add_representative at      │ │
│ │              0x7f986b26fb50>                                             │ │
│ │       func = <function                                                   │ │
│ │              Ensembl_Gene_Queries._alt_allele_add_representative at      │ │
│ │              0x7f986b26fb50>                                             │ │
│ │ is_np_func = False                                                       │ │
│ │     kwargs = {}                                                          │ │
│ │    new_msg = 'The operation <function                                    │ │
│ │              Ensembl_Gene_Queries._alt_allele_add_representative at      │ │
│ │              0'+160                                                      │ │
│ │    old_msg = 'The default value of numeric_only'                         │ │
│ │  orig_func = <function                                                   │ │
│ │              Ensembl_Gene_Queries._alt_allele_add_representative at      │ │
│ │              0x7f986b26fb50>                                             │ │
│ │       self = <pandas.core.groupby.generic.DataFrameGroupBy object at     │ │
│ │              0x7f986a48ff10>                                             │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /home/runner/.cache/pypoetry/virtualenvs/ensembl-genes-GU6ps7Hy-py3.10/lib/p │
│ ython3.10/site-packages/pandas/core/groupby/groupby.py:1629 in               │
│ _python_apply_general                                                        │
│                                                                              │
│   1626 │   │   Series or DataFrame                                           │
│   1627 │   │   │   data after applying f                                     │
│   1628 │   │   """                                                           │
│ ❱ 1629 │   │   values, mutated = self.grouper.apply(f, data, self.axis)      │
│   1630 │   │   if not_indexed_same is None:                                  │
│   1631 │   │   │   not_indexed_same = mutated or self.mutated                │
│   1632 │   │   override_group_keys = False                                   │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │             data = │      ensembl_gene_id  ...  ensembl_created_date     │ │
│ │                    0      ENSG00000282572  ...   2015-06-01 18:57:05     │ │
│ │                    1      ENSG00000281951  ...   2015-06-01 18:57:05     │ │
│ │                    2      ENSG00000282572  ...   2015-06-01 18:57:05     │ │
│ │                    3      ENSG00000273644  ...   2014-06-09 10:49:07     │ │
│ │                    4      ENSG00000273644  ...   2014-06-09 10:49:07     │ │
│ │                    ...                ...  ...                   ...     │ │
│ │                    14506  ENSG00000284613  ...   2017-06-13 10:44:55     │ │
│ │                    14507  ENSG00000292409  ...   2023-04-14 17:13:51     │ │
│ │                    14508  ENSG00000254581  ...   2010-11-01 15:31:55     │ │
│ │                    14509  ENSG00000254581  ...   2010-11-01 15:31:55     │ │
│ │                    14510  ENSG00000292410  ...   2023-04-14 17:13:51     │ │
│ │                                                                          │ │
│ │                    [14511 rows x 7 columns]                              │ │
│ │                f = <function                                             │ │
│ │                    Ensembl_Gene_Queries._alt_allele_add_representative   │ │
│ │                    at 0x7f986b26fb50>                                    │ │
│ │           is_agg = False                                                 │ │
│ │     is_transform = False                                                 │ │
│ │ not_indexed_same = None                                                  │ │
│ │             self = <pandas.core.groupby.generic.DataFrameGroupBy object  │ │
│ │                    at 0x7f986a48ff10>                                    │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /home/runner/.cache/pypoetry/virtualenvs/ensembl-genes-GU6ps7Hy-py3.10/lib/p │
│ ython3.10/site-packages/pandas/core/groupby/ops.py:839 in apply              │
│                                                                              │
│    836 │   │   │                                                             │
│    837 │   │   │   # group might be modified                                 │
│    838 │   │   │   group_axes = group.axes                                   │
│ ❱  839 │   │   │   res = f(group)                                            │
│    840 │   │   │   if not mutated and not _is_indexed_like(res, group_axes,  │
│    841 │   │   │   │   mutated = True                                        │
│    842 │   │   │   result_values.append(res)                                 │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │          axis = 0                                                        │ │
│ │          data = │      ensembl_gene_id  ...  ensembl_created_date        │ │
│ │                 0      ENSG00000282572  ...   2015-06-01 18:57:05        │ │
│ │                 1      ENSG00000281951  ...   2015-06-01 18:57:05        │ │
│ │                 2      ENSG00000282572  ...   2015-06-01 18:57:05        │ │
│ │                 3      ENSG00000273644  ...   2014-06-09 10:49:07        │ │
│ │                 4      ENSG00000273644  ...   2014-06-09 10:49:07        │ │
│ │                 ...                ...  ...                   ...        │ │
│ │                 14506  ENSG00000284613  ...   2017-06-13 10:44:55        │ │
│ │                 14507  ENSG00000292409  ...   2023-04-14 17:13:51        │ │
│ │                 14508  ENSG00000254581  ...   2010-11-01 15:31:55        │ │
│ │                 14509  ENSG00000254581  ...   2010-11-01 15:31:55        │ │
│ │                 14510  ENSG00000292410  ...   2023-04-14 17:13:51        │ │
│ │                                                                          │ │
│ │                 [14511 rows x 7 columns]                                 │ │
│ │             f = <function                                                │ │
│ │                 Ensembl_Gene_Queries._alt_allele_add_representative at   │ │
│ │                 0x7f986b26fb50>                                          │ │
│ │         group = │    ensembl_gene_id  ...  ensembl_created_date          │ │
│ │                 501  ENSG00000273592  ...   2014-06-09 10:49:07          │ │
│ │                 502  ENSG00000276935  ...   2014-06-09 10:49:07          │ │
│ │                                                                          │ │
│ │                 [2 rows x 7 columns]                                     │ │
│ │    group_axes = [                                                        │ │
│ │                 │   Int64Index([501, 502], dtype='int64'),               │ │
│ │                 │   Index(['ensembl_gene_id', 'alt_allele_group_id',     │ │
│ │                 │      'alt_allele_is_representative',                   │ │
│ │                 'primary_assembly', 'seq_region',                        │ │
│ │                 │      'alt_allele_attrib', 'ensembl_created_date'],     │ │
│ │                 │     dtype='object')                                    │ │
│ │                 ]                                                        │ │
│ │    group_keys = Int64Index([44429, 44430, 44431, 44432, 44433, 44434,    │ │
│ │                 44435, 44436, 44437,                                     │ │
│ │                 │   │   │   44438,                                       │ │
│ │                 │   │   │   ...                                          │ │
│ │                 │   │   │   48534, 48535, 48536, 48537, 48538, 48539,    │ │
│ │                 48540, 48541, 48542,                                     │ │
│ │                 │   │   │   48543],                                      │ │
│ │                 │   │      dtype='int64', name='alt_allele_group_id',    │ │
│ │                 length=3993)                                             │ │
│ │           key = 44458                                                    │ │
│ │       mutated = False                                                    │ │
│ │           res = │    ensembl_gene_id  ...    representative_gene_method  │ │
│ │                 497  ENSG00000094796  ...  alt_allele_is_representative  │ │
│ │                 498  ENSG00000094796  ...  alt_allele_is_representative  │ │
│ │                 499  ENSG00000262993  ...  alt_allele_is_representative  │ │
│ │                 500  ENSG00000292029  ...  alt_allele_is_representative  │ │
│ │                                                                          │ │
│ │                 [4 rows x 10 columns]                                    │ │
│ │ result_values = [                                                        │ │
│ │                 │      ensembl_gene_id  ...                              │ │
│ │                 representative_gene_method                               │ │
│ │                 0  ENSG00000282572  ...  alt_allele_is_representative    │ │
│ │                 1  ENSG00000281951  ...  alt_allele_is_representative    │ │
│ │                 2  ENSG00000282572  ...  alt_allele_is_representative    │ │
│ │                                                                          │ │
│ │                 [3 rows x 10 columns],                                   │ │
│ │                 │      ensembl_gene_id  ...                              │ │
│ │                 representative_gene_method                               │ │
│ │                 3  ENSG00000273644  ...  alt_allele_is_representative    │ │
│ │                 4  ENSG00000273644  ...  alt_allele_is_representative    │ │
│ │                 5  ENSG00000282333  ...  alt_allele_is_representative    │ │
│ │                                                                          │ │
│ │                 [3 rows x 10 columns],                                   │ │
│ │                 │   │   ensembl_gene_id  ...                             │ │
│ │                 representative_gene_method                               │ │
│ │                 6   ENSG00000232325  ...  alt_allele_is_representative   │ │
│ │                 7   ENSG00000232325  ...  alt_allele_is_representative   │ │
│ │                 8   ENSG00000281993  ...  alt_allele_is_representative   │ │
│ │                 9   ENSG00000282645  ...  alt_allele_is_representative   │ │
│ │                 10  ENSG00000288372  ...  alt_allele_is_representative   │ │
│ │                                                                          │ │
│ │                 [5 rows x 10 columns],                                   │ │
│ │                 │   │   ensembl_gene_id  ...                             │ │
│ │                 representative_gene_method                               │ │
│ │                 11  ENSG00000242611  ...  alt_allele_is_representative   │ │
│ │                 12  ENSG00000242611  ...  alt_allele_is_representative   │ │
│ │                 13  ENSG00000282155  ...  alt_allele_is_representative   │ │
│ │                 14  ENSG00000282557  ...  alt_allele_is_representative   │ │
│ │                 15  ENSG00000288288  ...  alt_allele_is_representative   │ │
│ │                                                                          │ │
│ │                 [5 rows x 10 columns],                                   │ │
│ │                 │   │   ensembl_gene_id  ...                             │ │
│ │                 representative_gene_method                               │ │
│ │                 16  ENSG00000242474  ...  alt_allele_is_representative   │ │
│ │                 17  ENSG00000242474  ...  alt_allele_is_representative   │ │
│ │                 18  ENSG00000282226  ...  alt_allele_is_representative   │ │
│ │                 19  ENSG00000282662  ...  alt_allele_is_representative   │ │
│ │                 20  ENSG00000288472  ...  alt_allele_is_representative   │ │
│ │                                                                          │ │
│ │                 [5 rows x 10 columns],                                   │ │
│ │                 │   │   ensembl_gene_id  ...                             │ │
│ │                 representative_gene_method                               │ │
│ │                 21  ENSG00000240859  ...  alt_allele_is_representative   │ │
│ │                 22  ENSG00000240859  ...  alt_allele_is_representative   │ │
│ │                 23  ENSG00000282075  ...  alt_allele_is_representative   │ │
│ │                 24  ENSG00000282461  ...  alt_allele_is_representative   │ │
│ │                 25  ENSG00000288417  ...  alt_allele_is_representative   │ │
│ │                                                                          │ │
│ │                 [5 rows x 10 columns],                                   │ │
│ │                 │   │   ensembl_gene_id  ...                             │ │
│ │                 representative_gene_method                               │ │
│ │                 26  ENSG00000261795  ...  alt_allele_is_representative   │ │
│ │                 27  ENSG00000261795  ...  alt_allele_is_representative   │ │
│ │                 28  ENSG00000281767  ...  alt_allele_is_representative   │ │
│ │                 29  ENSG00000282781  ...  alt_allele_is_representative   │ │
│ │                 30  ENSG00000288502  ...  alt_allele_is_representative   │ │
│ │                                                                          │ │
│ │                 [5 rows x 10 columns],                                   │ │
│ │                 │   │   ensembl_gene_id  ...                             │ │
│ │                 representative_gene_method                               │ │
│ │                 31  ENSG00000239715  ...  alt_allele_is_representative   │ │
│ │                 32  ENSG00000239715  ...  alt_allele_is_representative   │ │
│ │                 33  ENSG00000281349  ...  alt_allele_is_representative   │ │
│ │                 34  ENSG00000282309  ...  alt_allele_is_representative   │ │
│ │                 35  ENSG00000288311  ...  alt_allele_is_representative   │ │
│ │                                                                          │ │
│ │                 [5 rows x 10 columns],                                   │ │
│ │                 │   │   ensembl_gene_id  ...                             │ │
│ │                 representative_gene_method                               │ │
│ │                 36  ENSG00000240093  ...  alt_allele_is_representative   │ │
│ │                 37  ENSG00000240093  ...  alt_allele_is_representative   │ │
│ │                 38  ENSG00000281788  ...  alt_allele_is_representative   │ │
│ │                 39  ENSG00000282575  ...  alt_allele_is_representative   │ │
│ │                 40  ENSG00000288449  ...  alt_allele_is_representative   │ │
│ │                                                                          │ │
│ │                 [5 rows x 10 columns],                                   │ │
│ │                 │   │   ensembl_gene_id  ...                             │ │
│ │                 representative_gene_method                               │ │
│ │                 41  ENSG00000177706  ...  alt_allele_is_representative   │ │
│ │                 42  ENSG00000177706  ...  alt_allele_is_representative   │ │
│ │                 43  ENSG00000281429  ...  alt_allele_is_representative   │ │
│ │                 44  ENSG00000282147  ...  alt_allele_is_representative   │ │
│ │                 45  ENSG00000288499  ...  alt_allele_is_representative   │ │
│ │                                                                          │ │
│ │                 [5 rows x 10 columns],                                   │ │
│ │                 │   ... +19                                              │ │
│ │                 ]                                                        │ │
│ │          self = <pandas.core.groupby.ops.BaseGrouper object at           │ │
│ │                 0x7f986a48fd60>                                          │ │
│ │      splitter = <pandas.core.groupby.ops.FrameSplitter object at         │ │
│ │                 0x7f986a48e080>                                          │ │
│ │        zipped = <zip object at 0x7f9856a41e80>                           │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /home/runner/work/ensembl-genes/ensembl-genes/ensembl_genes/ensembl_genes.py │
│ :203 in _alt_allele_add_representative                                       │
│                                                                              │
│   200 │   │   Apply to alt_allele_df grouped by `alt_allele_group_id` to add │
│   201 │   │   `ensembl_representative_gene_id`, `is_representative_gene`, `r │
│   202 │   │   """                                                            │
│ ❱ 203 │   │   representative, method = Ensembl_Gene_Queries._alt_allele_get_ │
│   204 │   │   df["ensembl_representative_gene_id"] = representative          │
│   205 │   │   df["is_representative_gene"] = (                               │
│   206 │   │   │   df.ensembl_gene_id == df.ensembl_representative_gene_id    │
│                                                                              │
│ ╭─────────────────────── locals ───────────────────────╮                     │
│ │ df = │    ensembl_gene_id  ...  ensembl_created_date │                     │
│ │      501  ENSG00000273592  ...   2014-06-09 10:49:07 │                     │
│ │      502  ENSG00000276935  ...   2014-06-09 10:49:07 │                     │
│ │                                                      │                     │
│ │      [2 rows x 7 columns]                            │                     │
│ ╰──────────────────────────────────────────────────────╯                     │
│                                                                              │
│ /home/runner/work/ensembl-genes/ensembl-genes/ensembl_genes/ensembl_genes.py │
│ :187 in _alt_allele_get_representative                                       │
│                                                                              │
│   184 │   │   if len(representatives) == 1:                                  │
│   185 │   │   │   return representatives[0], "primary_assembly"              │
│   186 │   │   if len(representatives) > 1:                                   │
│ ❱ 187 │   │   │   raise ValueError(                                          │
│   188 │   │   │   │   "expected at most 1 primary assembly gene per alt_alle │
│   189 │   │   │   )                                                          │
│   190 │   │   return (                                                       │
│                                                                              │
│ ╭───────────────────────────── locals ──────────────────────────────╮        │
│ │              df = │    ensembl_gene_id  ...  ensembl_created_date │        │
│ │                   501  ENSG00000273592  ...   2014-06-09 10:49:07 │        │
│ │                   502  ENSG00000276935  ...   2014-06-09 10:49:07 │        │
│ │                                                                   │        │
│ │                   [2 rows x 7 columns]                            │        │
│ │ representatives = ['ENSG00000273592', 'ENSG00000276935']          │        │
│ ╰───────────────────────────────────────────────────────────────────╯        │
╰──────────────────────────────────────────────────────────────────────────────╯
ValueError: expected at most 1 primary assembly gene per alt_allele_group
Error: Process completed with exit code 1.

Use `pyarrow` instead of `fastparquet` to write parquet data

pyarrow is the default pandas parquet engine, it also by default works better across the ecosystem (including pyspark). Specifically genes.snappy.parquet data can't by read by pyspark 3.2.0, due to:

org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,false))
at org.apache.spark.sql.errors.QueryCompilationErrors$.illegalParquetTypeError(QueryCompilationErrors.scala:1284)

Btw fastparquet has a spark compatible mode for timestamps times="int96".

Also from https://fastparquet.readthedocs.io/en/latest/releasenotes.html#id2:

nanosecond resolution times: the new extended “logical” types system supports nanoseconds alongside the previous millis and micros. We now emit these for the default pandas time type, and produce full parquet schema including both “converted” and “logical” type information. Note that all output has isAdjustedToUTC=True, i.e., these are timestamps rather than local time. The time-zone is stored in the metadata, as before, and will be successfully recreated only in fastparquet and (py)arrow. Otherwise, the times will appear to be UTC. For compatibility with Spark, you may still want to use times="int96" when writing.

homo_sapiens_core_104_38: SMN2 xrefs SMN1 in EntrezGene

In the homo_sapiens_core_104_38 database, ensembl gene SMN2 (ENSG00000205571) maps to two ncbigenes: SMN1 (6606) and SMN2 (6607). This can be seen in the following table that shows all ensembl gene mappings to ncbigenes for SMN1 & SMN2:

ensembl_gene_id gene_symbol ensembl_representative_gene_id is_representative xref_source xref_accession xref_label xref_description xref_info_type xref_linkage_annotation
ENSG00000172062 SMN1 ENSG00000172062 True EntrezGene 6606 SMN1 survival of motor neuron 1, telomeric DEPENDENT None
ENSG00000275349 SMN1 ENSG00000172062 False EntrezGene 6606 SMN1 survival of motor neuron 1, telomeric DEPENDENT None
ENSG00000205571 SMN2 ENSG00000205571 True EntrezGene 6606 SMN1 survival of motor neuron 1, telomeric DEPENDENT None
ENSG00000205571 SMN2 ENSG00000205571 True EntrezGene 6607 SMN2 survival of motor neuron 2, centromeric DEPENDENT None
ENSG00000273772 SMN2 ENSG00000205571 False EntrezGene 6606 SMN1 survival of motor neuron 1, telomeric DEPENDENT None
ENSG00000273772 SMN2 ENSG00000205571 False EntrezGene 6607 SMN2 survival of motor neuron 2, centromeric DEPENDENT None
ENSG00000277773 SMN2 ENSG00000205571 False EntrezGene 6606 SMN1 survival of motor neuron 1, telomeric DEPENDENT None
ENSG00000277773 SMN2 ENSG00000205571 False EntrezGene 6607 SMN2 survival of motor neuron 2, centromeric DEPENDENT None

Some notes from the table:

  • ENSG00000172062 / SMN1 only maps to SMN1 in ncbigene and not SMN2
  • ENSG00000172062 / SMN1 has a single non-representative alt-allele, which is ENSG00000275349
  • ENSG00000205571 / SMN2 has two non-representative alt-alleles, which are ENSG00000273772 and ENSG00000277773.
  • alt alleles have the same mappings as their representative gene. So any fix to the mappings of ENSG00000205571 should also be applied to the alt alleles.

I'll forward this issue to the Ensembl helpdesk to see if they have any insights on why SMN2 is mapping to both SMN1 & SMN2 in ncbigene and whether this is an error that should be fixed.

Python code to generate the table above:

import pandas as pd
commit = "c87a3194704e073db841c0643f566bc5036e9f75" # homo_sapiens_core_104_38
url = f"https://github.com/related-sciences/ensembl-genes/raw/{commit}/genes.snappy.parquet"
genes_df = pd.read_parquet(url)
url = f"https://github.com/related-sciences/ensembl-genes/raw/{commit}/xrefs.snappy.parquet"
xrefs_df = pd.read_parquet(url)
smn_symbols = {"SMN1", "SMN2"}
smn_df = (
    xrefs_df
    .query("xref_source == 'EntrezGene'")
    .query("xref_label in @smn_symbols")
)
smn_df = (
    genes_df
    [["ensembl_gene_id", "gene_symbol", "ensembl_representative_gene_id"]]
    .eval("is_representative = ensembl_gene_id == ensembl_representative_gene_id")
    .merge(smn_df)
    .sort_values(["gene_symbol", "ensembl_gene_id"])
)
smn_df

Retry query when MySQL connection is lost

Got the following error in this extraction:

Traceback (most recent call last):
  File "/home/runner/.cache/pypoetry/virtualenvs/ensembl-genes-GU6ps7Hy-py3.9/lib/python3.9/site-packages/mysql/connector/connection_cext.py", line 523, in cmd_query
    self._cmysql.query(query,
_mysql_connector.MySQLInterfaceError: Lost connection to MySQL server during query

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/runner/.cache/pypoetry/virtualenvs/ensembl-genes-GU6ps7Hy-py3.9/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1802, in _execute_context
    self.dialect.do_execute(
  File "/home/runner/.cache/pypoetry/virtualenvs/ensembl-genes-GU6ps7Hy-py3.9/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
  File "/home/runner/.cache/pypoetry/virtualenvs/ensembl-genes-GU6ps7Hy-py3.9/lib/python3.9/site-packages/mysql/connector/cursor_cext.py", line 269, in execute
    result = self._cnx.cmd_query(stmt, raw=self._raw,
  File "/home/runner/.cache/pypoetry/virtualenvs/ensembl-genes-GU6ps7Hy-py3.9/lib/python3.9/site-packages/mysql/connector/connection_cext.py", line 528, in cmd_query
    raise errors.get_mysql_exception(exc.errno, msg=exc.msg,
mysql.connector.errors.OperationalError: 2013 (HY000): Lost connection to MySQL server during query

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/runner/work/ensembl-genes/ensembl-genes/ensembl_genes/ensembl_genes.py", line 611, in command
    fire.Fire(commands)
  File "/home/runner/.cache/pypoetry/virtualenvs/ensembl-genes-GU6ps7Hy-py3.9/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/runner/.cache/pypoetry/virtualenvs/ensembl-genes-GU6ps7Hy-py3.9/lib/python3.9/site-packages/fire/core.py", line 466, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/runner/.cache/pypoetry/virtualenvs/ensembl-genes-GU6ps7Hy-py3.9/lib/python3.9/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/runner/work/ensembl-genes/ensembl-genes/ensembl_genes/ensembl_genes.py", line 583, in export_all
    cls.export_datasets(species=species, release=release)
  File "/home/runner/work/ensembl-genes/ensembl-genes/ensembl_genes/ensembl_genes.py", line 571, in export_datasets
    ensgc.export_datasets()
  File "/home/runner/work/ensembl-genes/ensembl-genes/ensembl_genes/ensembl_genes.py", line 507, in export_datasets
    self.write_dataset(export)
  File "/home/runner/work/ensembl-genes/ensembl-genes/ensembl_genes/ensembl_genes.py", line 510, in write_dataset
    df = getattr(self, export.query_fxn)
  File "/opt/hostedtoolcache/Python/3.9.10/x64/lib/python3.9/functools.py", line 993, in __get__
    val = self.func(instance)
  File "/home/runner/work/ensembl-genes/ensembl-genes/ensembl_genes/ensembl_genes.py", line 388, in xref_go_df
    xref_go_df = self.run_query("gene_xrefs_go").merge(
  File "/home/runner/work/ensembl-genes/ensembl-genes/ensembl_genes/ensembl_genes.py", line 69, in run_query
    df = pd.read_sql_query(sql=query, con=self.connection_url)
  File "/home/runner/.cache/pypoetry/virtualenvs/ensembl-genes-GU6ps7Hy-py3.9/lib/python3.9/site-packages/pandas/io/sql.py", line 399, in read_sql_query
    return pandas_sql.read_query(
  File "/home/runner/.cache/pypoetry/virtualenvs/ensembl-genes-GU6ps7Hy-py3.9/lib/python3.9/site-packages/pandas/io/sql.py", line 1554, in read_query
    result = self.execute(*args)
  File "/home/runner/.cache/pypoetry/virtualenvs/ensembl-genes-GU6ps7Hy-py3.9/lib/python3.9/site-packages/pandas/io/sql.py", line 1399, in execute
    return self.connectable.execution_options().execute(*args, **kwargs)
  File "<string>", line 2, in execute
  File "/home/runner/.cache/pypoetry/virtualenvs/ensembl-genes-GU6ps7Hy-py3.9/lib/python3.9/site-packages/sqlalchemy/util/deprecations.py", line 401, in warned
    return fn(*args, **kwargs)
  File "/home/runner/.cache/pypoetry/virtualenvs/ensembl-genes-GU6ps7Hy-py3.9/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3146, in execute
    return connection.execute(statement, *multiparams, **params)
  File "/home/runner/.cache/pypoetry/virtualenvs/ensembl-genes-GU6ps7Hy-py3.9/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1274, in execute
    return self._exec_driver_sql(
  File "/home/runner/.cache/pypoetry/virtualenvs/ensembl-genes-GU6ps7Hy-py3.9/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1578, in _exec_driver_sql
    ret = self._execute_context(
  File "/home/runner/.cache/pypoetry/virtualenvs/ensembl-genes-GU6ps7Hy-py3.9/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1845, in _execute_context
    self._handle_dbapi_exception(
  File "/home/runner/.cache/pypoetry/virtualenvs/ensembl-genes-GU6ps7Hy-py3.9/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2026, in _handle_dbapi_exception
    util.raise_(
  File "/home/runner/.cache/pypoetry/virtualenvs/ensembl-genes-GU6ps7Hy-py3.9/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
    raise exception
  File "/home/runner/.cache/pypoetry/virtualenvs/ensembl-genes-GU6ps7Hy-py3.9/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1802, in _execute_context
    self.dialect.do_execute(
  File "/home/runner/.cache/pypoetry/virtualenvs/ensembl-genes-GU6ps7Hy-py3.9/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
  File "/home/runner/.cache/pypoetry/virtualenvs/ensembl-genes-GU6ps7Hy-py3.9/lib/python3.9/site-packages/mysql/connector/cursor_cext.py", line 269, in execute
    result = self._cnx.cmd_query(stmt, raw=self._raw,
  File "/home/runner/.cache/pypoetry/virtualenvs/ensembl-genes-GU6ps7Hy-py3.9/lib/python3.9/site-packages/mysql/connector/connection_cext.py", line 528, in cmd_query
    raise errors.get_mysql_exception(exc.errno, msg=exc.msg,
sqlalchemy.exc.OperationalError: (mysql.connector.errors.OperationalError) 2013 (HY000): Lost connection to MySQL server during query
[SQL: -- get Gene Ontology annotations for genes
-- GO xrefs in ensembl are linked to transcripts not genes.
-- Refs internal Related Sciences issue 316.
SELECT
  gene.stable_id AS ensembl_gene_id,
  -- external_db.db_name AS xref_source,
  xref.dbprimary_acc AS go_id,
  -- xref.display_label AS xref_label,
  xref.description AS go_label,
  GROUP_CONCAT(DISTINCT object_xref.linkage_annotation ORDER BY object_xref.linkage_annotation) AS go_evidence_codes,
  GROUP_CONCAT(DISTINCT xref.info_type ORDER BY xref.info_type) AS xref_info_types,
  GROUP_CONCAT(DISTINCT transcript.stable_id ORDER BY transcript.stable_id) AS ensembl_transcript_ids
FROM gene
INNER JOIN transcript 
  ON gene.gene_id = transcript.gene_id 
INNER JOIN object_xref 
  ON transcript.transcript_id = object_xref.ensembl_id 
  AND object_xref.ensembl_object_type = 'Transcript'
INNER JOIN xref 
  ON xref.xref_id = object_xref.xref_id
INNER JOIN external_db 
  ON xref.external_db_id = external_db.external_db_id 
  AND external_db.db_name = 'GO'
WHERE
  -- all genes were current when query was written, ensure this is always the case
  gene.is_current AND
  -- refs internal Related Sciences issue 289.
  gene.biotype != "LRG_gene"
GROUP BY gene.stable_id, external_db.db_name, xref.dbprimary_acc
ORDER BY ensembl_gene_id, go_id
-- LIMIT 10
]
(Background on this error at: https://sqlalche.me/e/14/e3q8)

Might have some helpful info on how we can retry lost connections at https://docs.sqlalchemy.org/en/14/core/pooling.html#pool-disconnects.

Missing argument 'CLS' when Running Workflow

Trying to run the workflow for Ensembl 110 (also tried with 'latest', which seemed to default to 109) and getting the following error (same error for all species):

Run poetry run ensembl_genes all --species="human" --release=[11](https://github.com/ACastanza/ensembl-genes/actions/runs/6254378054/job/16981773658#step:5:12)0
Usage: ensembl_genes all [OPTIONS] CLS
Try 'ensembl_genes all --help' for help.
╭─ Error ──────────────────────────────────────────────────────────────────────╮
│ Missing argument 'CLS'.                                                      │
╰──────────────────────────────────────────────────────────────────────────────╯
Error: Process completed with exit code 2.

Trying to get the "_old_to_newest" tables for our MSigDB build somewhat urgently

Ensembl release 109 seq_region table needs repair

When running ensembl_genes datasets --release=109, I'm getting the following error:

DatabaseError: (mysql.connector.errors.DatabaseError) 1194 (HY000): Table 'seq_region' is marked as crashed and should be 
repaired

This error occurred for when connecting to mysql+mysqlconnector://[email protected]:3306/homo_sapiens_core_109_38. See query causing error below:

Expand for query
SELECT
  gene.stable_id AS ensembl_gene_id,
  gene.version AS ensembl_gene_version,
  -- gene symbol methods https://github.com/cogent3/ensembldb3/issues/7
  -- Release 104 retired clone-based gene symbols,
  -- leading to ensembl genes without a symbol. Fill with the stable ID,
  -- as per https://www.ensembl.info/2021/03/15/retirement-of-clone-based-gene-names/
  COALESCE(xref.display_label, gene.stable_id) AS gene_symbol,
  external_db.db_name AS gene_symbol_source_db,
  xref.dbprimary_acc AS gene_symbol_source_id,
  gene.biotype AS gene_biotype,
  gene.description AS gene_description,
  gene.source AS ensembl_source,
  gene.created_date AS ensembl_created_date,
  gene.modified_date AS ensembl_modified_date,
  coord_system.version AS coord_system_version,
  coord_system.name AS coord_system,
  -- get chromosome: refs internal Related Sciences issue 606.
  CASE WHEN coord_system.name = "chromosome"
       THEN COALESCE(exc_seq_region.name, seq_region.name)
       END AS chromosome,
  assembly_exception.exc_type AS seq_region_exc_type,
  seq_region.name AS seq_region,
  gene.seq_region_start AS seq_region_start,
  gene.seq_region_end AS seq_region_end,
  gene.seq_region_strand AS seq_region_strand,
  assembly_exception.exc_seq_region_id IS NULL AS primary_assembly
FROM gene
LEFT JOIN xref ON xref.xref_id = gene.display_xref_id
LEFT JOIN external_db ON xref.external_db_id = external_db.external_db_id
LEFT JOIN seq_region ON gene.seq_region_id = seq_region.seq_region_id
LEFT JOIN coord_system ON seq_region.coord_system_id = coord_system.coord_system_id
LEFT JOIN assembly_exception ON seq_region.seq_region_id = assembly_exception.seq_region_id
  -- keep exc_type in (PATCH_FIX, PATCH_NOVEL, HAP)
  -- refs internal Related Sciences issue 606.
  AND NOT assembly_exception.exc_type <=> "PAR"
LEFT JOIN seq_region AS exc_seq_region ON assembly_exception.exc_seq_region_id = exc_seq_region.seq_region_id
WHERE
  -- all genes were current when query was written, ensure this is always the case
  gene.is_current AND
  -- refs internal Related Sciences issue 289.
  gene.biotype != "LRG_gene"
ORDER BY ensembl_gene_id

I believe this is an upstream issue entirely out of our hands, but wanted to document and report it.

MHC / xMHC genomic coordinates for rat & mouse

Following #4, we now support human, mouse, and rat exports. However, in the species configuration for mouse and rat, we set enable_mhc=False because we haven't yet put in accurate coordinates for the major histocompatibility complex region and extended MHC in these species.

See the code that needs updating for mouse

# FIXME: mhc coordinates (H2 complex)
# https://doi.org/10.1002/9780470015902.a0000921.pub4
enable_mhc=False,
mhc_chromosome="17",
mhc_lower=28_510_120,
mhc_upper=33_480_577,
xmhc_lower=25_726_063,
xmhc_upper=33_410_226,

And for rat

# FIXME: mhc coordinates
# https://github.com/related-sciences/ensembl-genes/pull/6#discussion_r729259953
enable_mhc=False,
mhc_chromosome="20",
mhc_lower=28_510_120,
mhc_upper=33_480_577,
xmhc_lower=25_726_063,
xmhc_upper=33_410_226,

If anyone stumbles on this issue and can help that would be much appreciated! We will want coordinates in the assembly specified for each species: currently GRCm39 for mouse and Rnor_6.0 for rat.

Separate description source from gene description text

Example gene descriptions by species:

  • human: tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858]
  • mouse: guanine nucleotide binding protein (G protein), alpha inhibiting 3 [Source:MGI Symbol;Acc:MGI:95773]
  • rat: glutamate decarboxylase 1 [Source:RGD Symbol;Acc:2652]

Notice the trailing bracketed source information like "[Source:HGNC Symbol;Acc:HGNC:11858]". It would be nice to separate this description source information into a separate column, such that it's possible to isolate the actual description.

Question: is the source string always going to be in the format of [Source:SOURCE;Acc:CURIE] for all species and descriptions?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.