Coder Social home page Coder Social logo

fs-tox's Introduction

FS-Tox: A Small Molecule Toxicity Benchmark

License Python Version Project Status

πŸ”Ž Overview

We are building FS-Tox: a toxicity benchmark for small molecule toxicology assays. Toxicity prediction tasks differ from traditional machine learning tasks in that there are usually only a small number of training examples per toxicity assay. Here, we provide a few-shot learning dataset built using several publicly available toxicity datasets (e.g. EPA's ToxRefDB), and an associated benchmarking pipeline. We will incorporate the different assays from these datsets consisting of the molecular representation of a small molecule, with an associated binary marker of whether the drug was toxic or not for the given assay.

πŸ—ΊοΈ Roadmap

Mid-May 2023 - benchmark SOTA models

Test the performance of the following state-of-the-art few-shot prediction methods on existing toxicity benchmark:

  • Gradient Boosted Random Forest (XGBoost)
  • Text-embedding-ada-002 on SMILES (OpenAI)
  • [] Galactica 125M (Hugging Face)
  • [] Galactica 1.3B (Hugging Face)
  • ChemGPT 19M (Hugging Face)
  • [] ChemGPT 1.2B (Hugging Face)
  • [] Uni-Mol (docker)
  • [] Uni-Mol+ (docker)
  • [] MoLeR (Microsoft)

Late-May 2023 - create FS-Tox benchmarking tool

Incorporate the following datsets containing results from in vivo toxicity assays:

  • [] ToxRefDB (subacute and chronic toxicity)
  • [] TDCommon, Zhu 2009 (acute toxicity)
  • [] MEIC (small, curated clinical toxicity)

Early-June 2023 - benchmark SOTA small molecule language models on FS-Tox

Test the following language models on the FS-Tox benchmark:

  • [] Text-embedding-ada-002 on SMILES (OpenAI)
  • [] Galactica 125M (Hugging Face)
  • [] Galactica 1.3B (Hugging Face)
  • [] ChemGPT 19M (Hugging Face)
  • [] ChemGPT 1.2B (Hugging Face)
  • [] Uni-Mol (docker)
  • [] Uni-Mol+ (docker)
  • [] MoLeR (Microsoft)

Mid-June 2023 - extend FS-Tox with in vitro data

Incorporate in vitro assays into the FS-Tox benchmark:

  • [] ToxCast
  • [] Extended Tox21
  • [] NCI60 data

πŸ“‚ Project Organization

β”œβ”€β”€ LICENSE
β”œβ”€β”€ Makefile           <- Makefile with commands like `make data` or `make train`
β”œβ”€β”€ README.md          <- The top-level README for developers using this project.
β”œβ”€β”€ data
β”‚Β Β  β”œβ”€β”€ external       <- Data from third party sources.
β”‚Β Β  β”œβ”€β”€ interim        <- Intermediate data that has been transformed.
β”‚Β Β  β”œβ”€β”€ processed      <- The final, canonical data sets for modeling.
β”‚Β Β  └── raw            <- The original, immutable data dump.
β”‚
β”œβ”€β”€ docs               <- A default Sphinx project; see sphinx-doc.org for details
β”‚
β”œβ”€β”€ models             <- Trained and serialized models, model predictions, or model summaries
β”‚
β”œβ”€β”€ notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
β”‚                         and a short `-` delimited description, e.g.
β”‚                         `1.0-initial-data-exploration`.
β”‚
β”œβ”€β”€ references         <- Data dictionaries, manuals, and all other explanatory materials.
β”‚
β”œβ”€β”€ reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
β”‚Β Β  └── figures        <- Generated graphics and figures to be used in reporting
β”‚
β”œβ”€β”€ requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
β”‚                         generated with `pip freeze > requirements.txt`
β”‚
β”œβ”€β”€ setup.py           <- makes project pip installable (pip install -e .) so src can be imported
β”œβ”€β”€ src                <- Source code for use in this project.
β”‚Β Β  β”œβ”€β”€ __init__.py    <- Makes src a Python module
β”‚   β”‚
β”‚Β Β  β”œβ”€β”€ data           <- Scripts to download or generate data
β”‚Β Β  β”‚Β Β  └── make_dataset.py
β”‚   β”‚
β”‚Β Β  β”œβ”€β”€ features       <- Scripts to turn raw data into features for modeling
β”‚Β Β  β”‚Β Β  └── build_features.py
β”‚   β”‚
β”‚Β Β  β”œβ”€β”€ models         <- Scripts to train models and then use trained models to make
β”‚   β”‚   β”‚                 predictions
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ predict_model.py
β”‚Β Β  β”‚Β Β  └── train_model.py
β”‚   β”‚
β”‚Β Β  └── visualization  <- Scripts to create exploratory and results oriented visualizations
β”‚Β Β      └── visualize.py
β”‚
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io

πŸ“š Resources

  1. ToxRefDB version 2.0: Improved utility for predictive and retrospective toxicology analyses
  2. ChemGPT: a transformer model for generative molecular modeling

fs-tox's People

Contributors

sethhowes avatar niklastr avatar

Watchers

 avatar  avatar  avatar

fs-tox's Issues

Add assay and feature names to parquet files

Currently, I store information about the assay name, features, and dataset name in the filename. This means that it is challenging to write SQL queries using duckdb to filter by these variables. Therefore I need to add this information to each file itself even if this means duplicating the information from the filename into the parquet file as an additional column. Using glob patterns to filter the filename should only be used to identify parquet files with the same schema. These can be conceptualised as individual relations in a relational database.

Write scripts to convert raw data files to parquet files for each assay

I think one approach to represent FS-Tox is a series of parquet files. Making the data as accessible as possible - including some examples in the README eventually, is going to be important to boost reproducibility, and frankly, also impact down the line.

  • corpus - one parquet file with each molecule SMILES, canonical SMILES and additional metadata, including where the molecule is from - an assay id (the column of a dataset), and a dataset id (i.e. clintox_moleculenet_2023)
  • representations - a directory of parquet files where we have [SMILES, feature 1, ... feature n], each molecule in FS-Tox is represented here. Every representation method creates one dedicated parquet - i.e. ECFP4, ChemGPT, ...
  • assays - each assay (aka column) in FS-Tox is a dedicated parquet file with the canonical smiles, the label, and the assay id

Parquet files work well with pytorch and can be queried super fast with duckdb in a data lakehouse context.

Prompts:
Yes, you can certainly read multiple different Parquet files and represent them as separate tables in DuckDB. Here's an example of how you can do this:

import duckdb

con = duckdb.connect()  # Connect to DuckDB

# Create separate views for each Parquet file
con.execute("CREATE VIEW my_data1 AS SELECT * FROM parquet_scan('my_data1.parquet')")
con.execute("CREATE VIEW my_data2 AS SELECT * FROM parquet_scan('my_data2.parquet')")

# Now you can query each view separately, or join them together as needed
result1 = con.execute("SELECT * FROM my_data1 WHERE some_column = some_value")
result2 = con.execute("SELECT * FROM my_data2 WHERE another_column = another_value")

df1 = result1.fetchdf()  # Get the result as a pandas DataFrame
df2 = result2.fetchdf()

In this example, my_data1 and my_data2 are views representing separate tables, each created from a different Parquet file. You can run separate queries on each view, or you can join them together using standard SQL syntax if they have columns in common.

This approach gives you the flexibility to work with each Parquet file individually, while still taking advantage of DuckDB's SQL querying capabilities.

Create a beeswarm plot to visualise the AUROC across individual assay

A beeswarm plot can be used to show the AUROC for various assays. It might be useful to color code the individual assays, too - it is likely that certain groups of assays will have different characteristics based on the project in which they were generated, in-vivo vs. in-vitro etc.

image

Do grid search on toxicity threshold

It is currently unclear what the appropriate threshold is for binarising our continuous toxicity outcomes. Do we use the median, or is the 80th percentile more appropriate? By doing a grid search over these thresholds, and the re-running the pipeline, we may see that model performance peaks, or plateaus at a given threshold. Peak predictive performance may correspond to a more valid classification of molecules as being toxic or non-toxic.

Explore results of Toxval ROC-AUC scores

Some of the ROC-AUC scores from running Toxval through the pipeline are strange (1.0 and 0.0 scores). I need to do an exploratory analysis for why these results have occured.

Construct working benchmark

Benchmarking procedure should consist of:

  • Multiple datasets loaded into the data lakehouse
  • Assigned to either meta-train or meta-test sets
  • Assigned to either support or query sets

We must ensure the following conditions are not broken:

  • Data from a single dataset remains in either the training or testing set

Parallelise hyperparameter search

This is computationally intense and computational resources required to train models will increase as I introduce new features. I therefore need to rewrite the xgboost script to parallelise hyperparameter serach. Can be parallelised as is SIMD.

Create data processing script that takes in csv files and outputs specified dataformat

I think one approach to represent FS-Tox is a series of parquet files. Making the data as accessible as possible - including some examples in the README eventually, is going to be important to boost reproducibility, and frankly, also impact down the line.

  • corpus - one parquet file with each molecule SMILES, canonical SMILES and additional metadata, including where the molecule is from - an assay id (the column of a dataset), and a dataset id (i.e. clintox_moleculenet_2023)
  • representations - a directory of parquet files where we have [SMILES, feature 1, ... feature n], each molecule in FS-Tox is represented here. Every representation method creates one dedicated parquet - i.e. ECFP4, ChemGPT, ...
  • assays - each assay (aka column) in FS-Tox is a dedicated parquet file with the canonical smiles, the label, and the assay id

Parquet files work well with pytorch and can be queried super fast with duckdb in a data lakehouse context.

Prompts:
Yes, you can certainly read multiple different Parquet files and represent them as separate tables in DuckDB. Here's an example of how you can do this:

import duckdb

con = duckdb.connect()  # Connect to DuckDB

# Create separate views for each Parquet file
con.execute("CREATE VIEW my_data1 AS SELECT * FROM parquet_scan('my_data1.parquet')")
con.execute("CREATE VIEW my_data2 AS SELECT * FROM parquet_scan('my_data2.parquet')")

# Now you can query each view separately, or join them together as needed
result1 = con.execute("SELECT * FROM my_data1 WHERE some_column = some_value")
result2 = con.execute("SELECT * FROM my_data2 WHERE another_column = another_value")

df1 = result1.fetchdf()  # Get the result as a pandas DataFrame
df2 = result2.fetchdf()

In this example, my_data1 and my_data2 are views representing separate tables, each created from a different Parquet file. You can run separate queries on each view, or you can join them together using standard SQL syntax if they have columns in common.

This approach gives you the flexibility to work with each Parquet file individually, while still taking advantage of DuckDB's SQL querying capabilities.

Represent FS-Tox as a set of parquet files for final publication

  1. Create merged pandas df
  2. Create series of parquet files from input as specified below

FROM NIKLAS: I think one approach to represent FS-Tox is a series of parquet files. Making the data as accessible as possible - including some examples in the README eventually, is going to be important to boost reproducibility, and frankly, also impact down the line.

  • corpus - one parquet file with each molecule SMILES, canonical SMILES and additional metadata, including where the molecule is from - an assay id (the column of a dataset), and a dataset id (i.e. clintox_moleculenet_2023)
  • representations - a directory of parquet files where we have [SMILES, feature 1, ... feature n], each molecule in FS-Tox is represented here. Every representation method creates one dedicated parquet - i.e. ECFP4, ChemGPT, ...
  • assays - each assay (aka column) in FS-Tox is a dedicated parquet file with the canonical smiles, the label, and the assay id

Parquet files work well with pytorch and can be queried super fast with duckdb in a data lakehouse context.

Prompts:
Yes, you can certainly read multiple different Parquet files and represent them as separate tables in DuckDB. Here's an example of how you can do this:

import duckdb

con = duckdb.connect()  # Connect to DuckDB

# Create separate views for each Parquet file
con.execute("CREATE VIEW my_data1 AS SELECT * FROM parquet_scan('my_data1.parquet')")
con.execute("CREATE VIEW my_data2 AS SELECT * FROM parquet_scan('my_data2.parquet')")

# Now you can query each view separately, or join them together as needed
result1 = con.execute("SELECT * FROM my_data1 WHERE some_column = some_value")
result2 = con.execute("SELECT * FROM my_data2 WHERE another_column = another_value")

df1 = result1.fetchdf()  # Get the result as a pandas DataFrame
df2 = result2.fetchdf()

In this example, my_data1 and my_data2 are views representing separate tables, each created from a different Parquet file. You can run separate queries on each view, or you can join them together using standard SQL syntax if they have columns in common.

This approach gives you the flexibility to work with each Parquet file individually, while still taking advantage of DuckDB's SQL querying capabilities.

Explore relationship between ECFP4 fingerprint embedding size and predictive accuracy

There have been published reports - including by the openbench team - that the predictive accuracy that is achievable by a learner on top of Morgan fingerprints increases / decreases proportional to the number of features that are generated.

I believe it would be a valuable experiment to compare the performance of the learner on top of ecfp4 embeddings across five different sizes (size of the smallest embedding from the other models, 1024, 2048, 4096, 8192).

Simplify pipeline

The current pipeline can be configured in lots of different ways. This adds complexity to the pipeline, and means that it is difficult to run a set of molecules through from start to finish. This is confusing even to me as the person that built the pipeline. As such, I will change the construction of the pipeline, such that one specifies the pipeline parameters at the start of the pipeline, rather than at each stage.

  • Specify parameters needed for start
  • Create single tool config for entire pipeline
  • Remove assay flags + associated code
  • Remove dataset flags + associated code

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.