seffnet / seffnet Goto Github PK

View Code? Open in Web Editor NEW

12.0 4.0 3.0 913.65 MB

Network representation learning on drug-target-side effects-indication graphs for side effect prediction

Home Page: https://seffnet.readthedocs.io

License: MIT License

Python 11.61% Jupyter Notebook 88.04% HTML 0.32% Dockerfile 0.03%

machine-learning network-representation-learning side-effects

seffnet's Introduction

SEffNet

SEffNet (Side Effect Network embeddings) is a tool that optimizes, trains, and evaluates predictive models for biomedical networks that contain drug-, target- and side effect-information using different network representation learning methods in an attempt to understand the causes of side effects.

This package was developed during the master's thesis of Rana Aldisi.

Structure

notebooks: Notebooks that were used for training and evaluation of models, and interpertation of prediction model
resources: The graphs and materials that are used for training and testing

Installation

seffnet can be installed on python37+ from the latest code on GitHub with:

$ pip install git+https://github.com/seffnet/seffnet.git

Usage

Using the predictive model

If you've installed seffnet locally, you can use the default model from the GitHub repository with:

from seffnet.default_predictor import predictor

# Find new relations for a given entity based on its CURIE
results = predictor.find_new_relations(curie='pubchem.compound:5095')
...

You can get the embeddings for phenotype entities with

import itertools as itt
from seffnet.default_predictor import predictor

phenotype_to_embedding = {
    node_data['identifier']: predictor.embeddings[node_id]
    for node_id, node_data in predictor.node_id_to_info.items()
    if node_data['namespace'] == 'umls'
}
# could use sklearn.metrics.pairwise.cosine_similarity on the values in this dict

You can use the default model in the CLI:

$ seffnet predict pubchem.compound:5095

You can predict on new chemicals via their SMILES strings based on their similarity to chemicals included in the network. Warning: we haven't benchmarked how well this actually works yet.

$ seffnet predictc "C1=CC=C(C=C1)C2=CC=C(C=C2)CCO"

Rebuilding the resources

You can rebuild all the graphs and maps created for this project by running the following:

$ seffnet rebuild

Note that you need to have RDKit package and environment to be able to run this command

Model training and evaluation

You can train an NRL model using the following:

$ seffnet train --input-path ./resources/basic_graphs/fullgraph_with_chemsim.edgelist --evaluation --method node2vec

For further CLI options and parameters use --help, -h

Optimizing hyperparameters

Network representation learning models can be optimized with:

$ seffnet optimize --input-path ./resources/basic_graphs/fullgraph_with_chemsim.edgelist --method node2vec

For further CLI options and parameters use --help, -h

Web Application

The web application allows users to get results from the model programmatically. Make sure the extra dependencies have been installed as well using the [web] extra. Unfortunately, this doesn't work when installing directly from GitHub, so see the setup.cfg for the Flask dependencies.

$ pip install -e .[web]

Run development server with:

$ seffnet web --host localhost --port 5000

Run through docker with:

$ docker-compose up

A user interface can be found at http://localhost:5000
An auto-generated swagger UI can be found at http://localhost:5000/apidocs

As an example, you can check the chemicals predicted to interact with HDAC6 at http://localhost:5000/predict/uniprot:Q9UBN7?results_type=chemical.

seffnet's People

Contributors

Stargazers

Watchers

Forkers

ashar799 aspirincode phenylazide

seffnet's Issues

Check if entity does not exist

https://github.com/AldisiRana/SE_KGE/blob/4ad1f07d265710b7c30a28166b8867ed63c2caf2/src/se_kge/find_relations.py#L34-L39

Why are so many edges weighted with 0.0?

After looking at https://github.com/seffnet/seffnet/blob/master/resources/basic_graphs/weighted_training_set.edgelist, I saw that many are 0.0.

First, it might be good to generate some charts along with these edgelist files that show the distributions of weights by edge type. Then we need to figure out how this affects training

Validation of chemical embedding imputation

For each chemical, impute its embedding (after throwing away the 1.0 similarity to itself).

Scenario 1: Global Losses

Calculate a loss of that embedding against all other embeddings. This would allow for the goodness of different imputation procedures to be compared, but isn't so easy to assess on its own.

Scenario 2: Standard Evaluation

How good is the imputed embedding at predicting the real edges that were already in the network? Report true positive rate and false negative rate.

Negative sampling could also be used to allow reporting of MCC, ROC-AUC, ROC-PR, and other metrics

Update imports

I'm getting two warnings from sklearn that we should prepare to fix:

/Users/cthoyt/.virtualenvs/seffnet/lib/python3.7/site-packages/sklearn/utils/deprecation.py:144: FutureWarning: The sklearn.linear_model.logistic module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.linear_model. Anything that cannot be imported from sklearn.linear_model is now part of the private API.
  warnings.warn(message, FutureWarning)
/Users/cthoyt/.virtualenvs/seffnet/lib/python3.7/site-packages/sklearn/base.py:318: UserWarning: Trying to unpickle estimator LogisticRegression from version 0.21.3 when using version 0.22. This might lead to breaking code or invalid results. Use at your own risk.

Add description to the repo

Generate subgraphs between entities

Implement imputation of chemical embeddings

Use element-wise average of all other chemicals weighted by similarity
Do a "validation" where for each chemical you calculate the embedding then use a loss function to see how far away it is from the real one

Retrospective validation on withdrawn drugs

Fix pubchem namespace

Everywhere you're mentioning pubchem you should really be using pubchem.compound

Lower case package name

Update requirements (install fail on dateutil)

Got an error during installation

(seffnet) sichom@sichom-Precision-T7600:~/projects/seffnet$ seffnet web
Traceback (most recent call last):
  File "/home/sichom/software/miniconda/envs/seffnet/lib/python3.7/site-packages/pkg_resources/__init__.py", line 583, in _build_master
    ws.require(__requires__)
  File "/home/sichom/software/miniconda/envs/seffnet/lib/python3.7/site-packages/pkg_resources/__init__.py", line 900, in require
    needed = self.resolve(parse_requirements(requirements))
  File "/home/sichom/software/miniconda/envs/seffnet/lib/python3.7/site-packages/pkg_resources/__init__.py", line 791, in resolve
    raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (python-dateutil 2.8.1 (/home/sichom/software/miniconda/envs/seffnet/lib/python3.7/site-packages), Requirement.parse('python-dateutil<2.8.1,>=2.1; python_version >= "2.7"'), {'botocore'})

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sichom/software/miniconda/envs/seffnet/bin/seffnet", line 6, in <module>
    from pkg_resources import load_entry_point
  File "/home/sichom/software/miniconda/envs/seffnet/lib/python3.7/site-packages/pkg_resources/__init__.py", line 3251, in <module>
    @_call_aside
  File "/home/sichom/software/miniconda/envs/seffnet/lib/python3.7/site-packages/pkg_resources/__init__.py", line 3235, in _call_aside
    f(*args, **kwargs)
  File "/home/sichom/software/miniconda/envs/seffnet/lib/python3.7/site-packages/pkg_resources/__init__.py", line 3264, in _initialize_master_working_set
    working_set = WorkingSet._build_master()
  File "/home/sichom/software/miniconda/envs/seffnet/lib/python3.7/site-packages/pkg_resources/__init__.py", line 585, in _build_master
    return cls._build_from_requirements(__requires__)
  File "/home/sichom/software/miniconda/envs/seffnet/lib/python3.7/site-packages/pkg_resources/__init__.py", line 598, in _build_from_requirements
    dists = ws.resolve(reqs, Environment())
  File "/home/sichom/software/miniconda/envs/seffnet/lib/python3.7/site-packages/pkg_resources/__init__.py", line 791, in resolve
    raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (python-dateutil 2.8.1 (/home/sichom/software/miniconda/envs/seffnet/lib/python3.7/site-packages), Requirement.parse('python-dateutil<2.8.1,>=2.1; python_version >= "2.7"'), {'botocore'})

Refactor CLI

Click options that are reused can be saved as variables at the top of the file
No business logic in the CLI functions! Only handling user/file input/output. That means that there should be a wrapper function that takes care of all of the if statements living somewhere else. This is the case for optimize and train

implement extra set to avoid overoptimization

add an extra set for evaluating the optimized hyperparamters to avoid overoptimization

Online embedding of new chemicals

What happens when we want to make a prediction about a new chemical that isn't already in the graph?

Given a PubChem compound identifier, look up the SMILES string. Maybe also consider other identifier to SMILES resolvers, or allow direct input of SMILES
Convert that SMILES string to a descriptor vector, same as how the graph was created.
Calculate similarities to all chemicals in the graph using the same cutoff as before
Do some random walks for this new chemical
Stick those random walks in the Word2Vec model and generate the embedding

I think this means we need a more robust way of storing the fingerprints of all of the chemicals in the given graph, and also to store the word2vec model in addition to all of the other things

Deal with duplicate pubchem compounds

example:
https://pubchem.ncbi.nlm.nih.gov/compound/2462
https://pubchem.ncbi.nlm.nih.gov/compound/5281004
https://pubchem.ncbi.nlm.nih.gov/compound/Budesonide

These are three different pubchem ID for the same compound.
SIDER graph and DrugBank graph have the same compound with different pubchem IDs, so when merging the two graphs, the compound is saved as two different nodes

Add documentation to front page about calculating similarity between two entities by UMLS

Motivated by https://twitter.com/rguha/status/1224730259947474950

Use CURIEs instead of node identifiers

You spent a lot of effort to assign numeric identifiers to each node (which we then treat as strings)... would it be possible to skip all of that and just use the CURIEs directly? Or somewhere do these numbers correspond to the indexes in some matrices?

Add explanation of the hits@k and mean value evaluation metrics to "Train and Evaluate Models.ipynb"

Make sure that someone reading this notebook can follow along, especially if they're not familiar with this type of machine learning.

@mali-git would probably be happy to send you a couple references that explain it really well. Make sure you include these citation(s) in the notebook, too.

Add more information to entities when returned to user

For example, is a phenotype an indication or a side effect?

Use Europe PMC for entity co-occurrence lookup

Given a chemical (with pubchem.compound identifier) and a phenotype (with UMLS identifier), query the Europe PMC API for articles that mention both of the entities. Put this in its own module like seffnet.literature or something

See information that they send me via twitter: https://twitter.com/cthoytp/status/1171540212746457090

Rebrand as SEffNet

but the python package should be seffnet

Setup tox.ini and Travis CI

Add weight to edges from SIDER

Update Bio2BEL package to propagate through frequency information
Make a mapping function from those frequencies to continuous real number space.
Make new export of SIDER edge list including weighting (this will need a new format, right?)

We'll probably have to do some iteration through several different functions to see which works best

Bug in setup requirements

Line 39 of setup.cfg

python-dateutils < 2.8.1

should be

python-dateutil < 2.8.1

Make super cool web thing

Make Dockerfile

Add autocompletion to web application

Not clear what to type. It should be clearer that you need to type "pubchem.compound:X". Maybe implement auto completion?

Add weight to edges from DrugBank

Check if DrugBank sucks in nominal IC50 or Ki values. If so, we're done. Otherwise:

Map drugs to ChEMBL identifiers
Map drugs' targets to ChEBML identifiers
Look up pCHEMBL values for drug-target interactions using the ChEMBL python package
Generate new edge list using that as the weighting

How consistent are the random seeds?

Do we need to seed with both numpy and python? Why are the seeds being set over and over and over?

Make an option to output the prediction to a file

https://github.com/AldisiRana/SE_KGE/blob/1060f4fabb6195b00d3db046b9296f3125326060/src/se_kge/find_relations.py#L197-L221

Seed shouldn't be set during optimization

I'm worried the evaluation won't be fair if they're all doing prediction using the same random seed. I think what you should do is generate a random seed (between 1 and a billion, or whatever) yourself and store that as a user attribute for each run to promote reproducibility.

https://github.com/AldisiRana/SE_KGE/blob/6e7eca5474bae3e7997616c856e13b87e560174e/src/se_kge/optimization.py#L17

https://github.com/AldisiRana/SE_KGE/blob/6e7eca5474bae3e7997616c856e13b87e560174e/src/se_kge/optimization.py#L48

https://github.com/AldisiRana/SE_KGE/blob/6e7eca5474bae3e7997616c856e13b87e560174e/src/se_kge/optimization.py#L83

https://github.com/AldisiRana/SE_KGE/blob/6e7eca5474bae3e7997616c856e13b87e560174e/src/se_kge/optimization.py#L112

https://github.com/AldisiRana/SE_KGE/blob/6e7eca5474bae3e7997616c856e13b87e560174e/src/se_kge/optimization.py#L139

https://github.com/AldisiRana/SE_KGE/blob/6e7eca5474bae3e7997616c856e13b87e560174e/src/se_kge/optimization.py#L168