Coder Social home page Coder Social logo

sebief / hvi_toolkit Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 414.83 MB

Standards, benchmarks and bias checks for human-virus interaction datasets and models

License: MIT License

Python 51.72% Jupyter Notebook 48.28%
dataset interaction prediction protein human-virus

hvi_toolkit's Introduction

Human-Virus Interactions (HVI) Toolkit

Tools to work with human-virus interactions and evaluate biases associated to models and datasets.

This repository helps you to work with human-virus interactions (hvi) (or general protein-protein interactions).

You might have encountered one of the following problems yourself:

  • There are a lot of different human-virus interaction datasets, but they use different data formats.
  • The interactions provided in these datasets are often duplicates.
  • You created a predictor that is doing well on training and validation steps, but has poor generalization capability.

To ease the work with different datasets, this repository provides python implementations for a lot of existing hvi datasets. They can all be converted to a standardized dataset by simply calling one method. This makes them comparable, easy to analyze and to merge (HVI Dataset Standards). In addition, the standardized format ensures that all interactions contained are unique. They can also be used for subsequent redundancy reduction and dataset splitting.

A lot of unique interactions are not sufficient on their own: It also needs to be checked, if the datasets contain some biases. The hvi_toolkit contains automatic checks to evaluate your datasets for various known biases (Dataset Checking).

To train an interaction model, the interaction_mode from Biotrainer can be employed. Using this tool for model training ensures a standardized format for your model. This makes it easy to use the model checks provided by the hvi_toolkit (Model checking). High-quality benchmarks are used to assess the model's capabilities on unseen data. Adversarial, generated input is, furthermore, used to shed light on biases that the model might have picked up during training (work in progress). Models not trained via biotrainer can also be evaluated by implementing a simple interface.

HVI Dataset Standards

In order to create the best dataset possible for interaction prediction, a lot of different data sources have to be combined. The bias_eval module contains dataset_base_classes that can easily be extended to add new interaction datasets. For example, if you have interactions stored in PSI-MI TAB 2.5 format, you can create a new interaction set like this:

from hvi_toolkit.dataset_base_classes import DatasetMITAB25


class DatasetNewMITAB(DatasetMITAB25):
  """
  Description of the new dataset
  """

  delimiter = "\t"
  name = "your_dataset"
  header = None

The most important class is the DatasetHVIStandardized contained in dataset_base_classes/hvi_standardized_dataset.py. All interaction datasets must provide a to_standardized_dataset method in order to transfer them to a standardized format. This format makes it easy to combine and analyze the data sources.

Already implemented interaction datasets can be found in the interaction_datasets package. Raw data associated with them is provided in raw_data. The notebook hvi_datasets/merged_dataset/merge_datasets.ipynb shows how to merge all the provided datasets into a single and large dataset.

The DatasetHVIStandardized class can also be used for redundancy reduction on sequences and embeddings. Before configuring and running the hvi_datasets/redundancy_reduction/redundancy_reduction.py script, you need to have mmseqs2 installed in a separate conda environment:

conda create --name mmseqs2
conda activate mmseqs2
conda install -c conda-forge -c bioconda mmseqs2

After the redundancy reduction, the SplitsGenerator contained in hvi_datasets/dataset_splitting/splits_generators.py can be used to create a biotrainer compatible interaction file.

Dataset checking

Protein-protein or human-virus interaction datasets are prone to some forms of biases. The DatasetEvaluator class from bias_eval.evaluators provides automatic tests to check your datasets for these biases.

It provides the following checks:

  • Check dataset bias: Do proteins associated with interactions have the same frequency in the positive and negative dataset?
  • Bias prediction baselines (Using the protein frequencies in the dataset):
    • For the whole dataset
    • For provided test sets
  • Viral families: Are they uniformly distributed across the interactions?
  • Sequence length: Does it differ significantly between positive and negative interactions

An exemplary notebook check_dataset.ipynb is provided in the examples directory.

Model checking

Models trained on biased datasets will ultimately incorporate these biases. The ModelEvaluator class aims to provide automatic tests to check, how well your model does. To do this, benchmark datasets have to be provided. At the moment, experimentally verified interactions for Rabies Lyssavirus and the Negatome 2.0 datasets are provided as examples in the examples/check_model.ipynb notebook.

The model directory contains an already trained model, including the out.yml file from biotrainer and the checkpoints for all splits is provided in model/biotrainer_output. It is used in the examples to show how the model checking works.

The model was trained with the biotrainer_config.yml configuration file that can be run via:

# Install biotrainer
poetry add git+https://github.com/SebieF/biotrainer.git

# Run configuration file
cd lyssa_predict/model
biotrainer biotrainer_config.yml  # Biotrainer version >0.3.0 required

Rabies Lyssavirus benchmark

For the hvi_toolkit, a high quality dataset of experimentally verified, positive interactions for the Rabies Lyssavirus was compiled. It includes 53 experimentally verified interactions extracted from a paper of Zandi et al. 2021 (https://doi.org/10.52547%2Fibj.25.4.226). It can be found under hvi_datasets/raw_data/rabies_lyssavirus_zandi. The README.md file located in the same directory contains additional information, how the dataset was compiled.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.