hubiodatalab / crossbarv2 Goto Github PK

This is a repo for migration of CROssBAR data to the Neo4j database via BioCypher

Python 100.00%

crossbarv2's Introduction

CROssBAR-BioCypher-Migration

This is a repo for CROssBARv2 data to the Neo4j database via BioCypher. CROssBARv2 is, an extended and improved version of our previous work (for v1 please check CROssBAR), a heterogeneous general purpose biomedical knowledge graph (KG) based system.

This repo is currently under development. Therefore, you may encounter some problems while replicating it. Feel free to open issue if you encounter any problems.

Installation

The project uses Poetry. You can install like this:

git clone https://github.com/HUBioDataLab/CROssBAR-BioCypher-Migration.git
cd CROssBAR-BioCypher-Migration
poetry install

Poetry will create a virtual environment according to your configuration (either centrally or in the project folder). You can activate it by running poetry shell inside the project directory.

Note about pycurl

You may encounter an error when executing the UniProt adapter about the SSL backend in pycurl: ImportError: pycurl: libcurl link-time ssl backend (openssl) is different from compile-time ssl backend (none/other)

Should this happen, it can be fixed as described here: https://stackoverflow.com/questions/68167426/how-to-install-a-package-with-poetry-that-requires-cli-args by running poetry shell followed by pip list, noting the version of pycurl, and then running pip install --compile --install-option="--with-openssl" --upgrade --force-reinstall pycurl==<version> to provide the correct SSL backend.

crossbarv2's People

Contributors

Stargazers

Watchers

Forkers

melsiddieg shunsunsun wenliangz atabeyunlu abotzki pydrogo

crossbarv2's Issues

Unable to download edge data InterPro adapter

Hi,

I recently started to use the InterPro adapter and turns out I am unable to download edge data as there seems to be an error coming from pypath in retrieving the input data, please see the code and error below (note that node data was downloaded without any issues, though):

Code

from interpro_adapter import InterPro
yeast_interpro_1 = InterPro(organism=559292) # S. cerevisiae S-559292
yeast_interpro_1.download_domain_node_data()
yeast_interpro_1.download_domain_edge_data()

Error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[6], line 2
      1 # Downloading all edges from interpro
----> 2 yeast_interpro_1.download_domain_edge_data()

File [/acasadesus/AC_yeast_project/biocypher/interpro_adapter.py:163](https://file+.vscode-resource.vscode-cdn.net/acasadesus/AC_yeast_project/biocypher/interpro_adapter.py:163), in InterPro.download_domain_edge_data(self)
    159 t0 = time()
    161 if self.organism:                
    162     # WARNING: decrease page_size parameter if there is a curl error about timeout in the pypath_log
--> 163     self.interpro_annotations = interpro.interpro_annotations(page_size = self.page_size, reviewed = True, tax_id = self.organism)
    164 else:
    165     self.interpro_annotations = interpro.interpro_annotations(page_size = self.page_size, reviewed = True, tax_id = '')

File [/acasadesus/AC_yeast_project/biocypher/.venv/lib/python3.10/site-packages/pypath/inputs/interpro.py:274](https://file+.vscode-resource.vscode-cdn.net/acasadesus/AC_yeast_project/biocypher/.venv/lib/python3.10/site-packages/pypath/inputs/interpro.py:274), in interpro_annotations(page_size, reviewed, tax_id)
    267 c = curl.Curl(
    268     next_page_url,
    269     silent = False,
    270     large = False
    271 )
    273 res = inputs_common.json_read(c.result)
--> 274 totalrec = int(res['count'])
    276 _log(
    277     'Downloading page %u (total: %s).' % (
    278         page + 1,
   (...)
    282     )
    283 )
    285 for entry in res['results']:

TypeError: 'NoneType' object is not subscriptable

Be aware that I attempted to change the page_size parameter to 100, 50, 20, 10 as it is advised to be reduced if encountering any curl errors as in my case, but the error persisted anyway (and then even tried to increase it to 200 and 500, just in case).

Would anybody be so kind to look into this and let me know if I am missing something or if indeed there is any problem to be fixed? I presume it is only pypath related.

Thanks!

cannot import name 'IntactEdgeField' from 'bccb.ppi_adapter'

Hi, I just cloned the repository, ran "create_crossbar.py" and ran into the following error:

cannot import name 'IntactEdgeField' from 'bccb.ppi_adapter' (C:\Users\Name\Crossbar_to_biocypher\CROssBAR-BioCypher-Migration\bccb\ppi_adapter.py)

However when removing the "s" from the "IntactEdgeFields"/"BiogridEdgeFields"/"StringEdgeFields" classes in the ppi_adapter, it seems to work/download the uniprot data.

I´m still running into another issue further in the execution that I´m not sure about if it is related to this. This is the 2nd error I get:
INFO -- Downloading uniprot data...
100%|██████████| 14/14 [00:00<00:00, 21.93it/s]
INFO -- Acquired UniProt data in 0.07 mins.
INFO -- Preprocessing UniProt data.
100%|██████████| 14/14 [00:03<00:00, 3.53it/s]
Traceback (most recent call last):

File ~.conda\envs\biocypher\lib\site-packages\spyder_kernels\py3compat.py:356 in compat_exec
exec(code, globals, locals)

File c:\users\name\crossbar_to_biocypher\crossbar-biocypher-migration\create_crossbar.py:133
main()

File c:\users\name\crossbar_to_biocypher\crossbar-biocypher-migration\create_crossbar.py:101 in main
driver.write_nodes(uniprot_adapter.get_nodes())

File ~\biocypher\biocypher_core.py:239 in write_nodes
self._get_writer()

File ~\biocypher\biocypher_core.py:203 in _get_writer
translator=self._get_translator(),

File ~\biocypher\biocypher_core.py:173 in _get_translator
self._translator = Translator(

File ~\biocypher\biocypher_translate.py:69 in init
self._update_ontology_types()

File ~\biocypher\biocypher_translate.py:396 in _update_ontology_types
self._add_translation_mappings(labels, value['label_as_edge'])

File ~\biocypher\biocypher_translate.py:472 in _add_translation_mappings
self.mappings[on] = self.name_sentence_to_pascal(

File ~\biocypher\biocypher_translate.py:500 in name_sentence_to_pascal
return _misc.sentencecase_to_pascalcase(name)

File ~\biocypher\biocypher_misc.py:205 in sentencecase_to_pascalcase
return re.sub(r'(?:^| )([a-zA-Z])', lambda match: match.group(1).upper(), s)

File ~.conda\envs\biocypher\lib\re.py:209 in sub
return _compile(pattern, flags).sub(repl, string, count)

TypeError: expected string or bytes-like object

Let me know if I should open a seperate issue for the 2nd error. Thanks in advance!

PyPath download error in try-except block

After poetry install of the latest CROssBAR adapter version (pypath v14.16 as per poetry.lock file), I am getting the following error:

Traceback (most recent call last):
  File "/Users/slobentanzer/GitHub/CROssBAR-BioCypher-Migration/scripts/create_crossbar.py", line 12, in <module>
    uniprot_data.uniprot_data_download(cache=True)
  File "/Users/slobentanzer/GitHub/CROssBAR-BioCypher-Migration/bccb/protein.py", line 80, in uniprot_data_download
    self.data[query_key] = uniprot.uniprot_data(
  File "/Users/slobentanzer/GitHub/CROssBAR-BioCypher-Migration/.venv/lib/python3.10/site-packages/pypath/inputs/uniprot.py", line 528, in uniprot_data
    return dict(
ValueError: dictionary update sequence element #126959 has length 3; 2 is required

Since this happens in the except part of the try-except block (line 80) that I already complained about in #1, I am not sure what the actual issue is. ;)

Do you get this error as well? Can you try a new installation of the project and see if it works out-of-the-box for you?

Break up PPI adapter into individual resources

It should be relatively easy to modularise the existing PPI adapter into the three input resources, which is more aligned with the BioCypher design philosophy.

Try-except block not used correctly

https://github.com/HUBioDataLab/CROssBAR-BioCypher-Migration/blob/559591ec6f45761cd7882249f64a68a82a1b34a2/bccb/protein.py#L99

The exception is not propagated, I would use the retry function of pypath.curl for trying twice. Standard value for retries is 3, it should be on by default, right?

Provide better user interface for adapter

I started by introducing an Enum class for UniProt fields (2c6f5b9) which can be used to dynamically set the fields to download and pass to BioCypher in the crossbar build script. Not sure, however, if Enums are the best way from a user-friendliness perspective. The aim is to abstract the process of using any given adapter as someone who wants to build a database, particularly in the case when the user does not know about specifics of the API of the primary source. Enums offer autocomplete, so at least the "knowing about adapter contents" is facilitated a bit.

We can think about it and discuss alternatives.

PPI adapter does not work if not all fields are selected

if you select less than all edge fields (eg the intact edge fields), the adapter fails because the column name re-assignment (eg https://github.com/HUBioDataLab/CROssBAR-BioCypher-Migration/blob/b7062ec2f55da787888f1ef8e4b46a229217f46a/bccb/ppi_adapter.py#L158) is not interactive (column names are hardcoded).

Ontology error

Hi @slobentanzer,

After adding Interpro data to create_crossbar.py, I faced following issue:

Traceback (most recent call last):
  File "d:\crossbar\CROssBAR-BioCypher-Migration\scripts\create_crossbar.py", line 150, in <module>
    main()
  File "d:\crossbar\CROssBAR-BioCypher-Migration\scripts\create_crossbar.py", line 133, in main
    bc.write_nodes(uniprot_adapter.get_nodes())
  File "d:\crossbar\CROssBAR-BioCypher-Migration\scripts\biocypher\_core.py", line 239, in write_nodes
    self._get_writer()
  File "d:\crossbar\CROssBAR-BioCypher-Migration\scripts\biocypher\_core.py", line 204, in _get_writer
    ontology=self._get_ontology(),
  File "d:\crossbar\CROssBAR-BioCypher-Migration\scripts\biocypher\_core.py", line 186, in _get_ontology
    self._ontology = Ontology(
  File "d:\crossbar\CROssBAR-BioCypher-Migration\scripts\biocypher\_ontology.py", line 283, in __init__
    self._main()
  File "d:\crossbar\CROssBAR-BioCypher-Migration\scripts\biocypher\_ontology.py", line 301, in _main
    self._extend_ontology()
  File "d:\crossbar\CROssBAR-BioCypher-Migration\scripts\biocypher\_ontology.py", line 421, in _extend_ontology
    raise ValueError(
ValueError: Node protein to protein domain association not found in ontology, but also has no inheritance definition. Please check your schema for spelling errors or a missing `is_a` definition.

I didn't encounter this problem while creating CROssBAR with my earlier version (Now, I am using latest version of BioCypher to create it). How to solve this problem? Should I add is_a to related association, if so what to add?

BioGRID adapter not working

Hi,

I have just attempted to use the BioGRID adapter to download data from BioGRID and have come to the following error (see below). I checked with one of the biocypher's developers (@slobentanzer) and this problem seems to be on the adapter's side. Please let me know when can this be fixed, thanks.

from biogrid_adapter import BioGRID
yeast_biogrid = BioGRID(organism=9606, export_csvs=True, output_dir='./', test_mode=True)
yeast_biogrid.download_biogrid_data()

TypeError Traceback (most recent call last)
Cell In[3], line 1
----> 1 yeast_biogrid.download_biogrid_data()

File /biocypher/biogrid_adapter.py:119, in BioGRID.download_biogrid_data(self)
116 self.biogrid_ints = biogrid.biogrid_all_interactions(self.organism, 9999999999, False)
118 # download these fields for mapping from gene symbol to uniprot id
--> 119 self.uniprot_to_gene = uniprot.uniprot_data("genes", "", True)
120 self.uniprot_to_tax = uniprot.uniprot_data("organism-id", "", True)
123 if self.test_mode:

File /biocypher/.venv/lib/python3.10/site-packages/pypath/inputs/uniprot.py:566, in uniprot_data(field, organism, reviewed)
563 get['query'] = rev.strip(' AND ')
565 c = curl.Curl(url, get = get, silent = False, large = True, compr = 'gz')
--> 566 _ = next(c.result)
569 _id, variables = zip((
570 line.strip('\n\r').split('\t')
571 for line in c.result if line.strip('\n\r')
572 ))
574 result = dict(
575 (
576 f,
(...)
579 for f, v in zip(field, variables)
580 )

TypeError: 'NoneType' object is not an iterator