The fs-tox's discuss from labdao

It is currently unclear what the appropriate threshold is for binarising our continuous toxicity outcomes. Do we use the median, or is the 80th percentile more appropriate? By doing a grid search over these thresholds, and the re-running the pipeline, we may see that model performance peaks, or plateaus at a given threshold. Peak predictive performance may correspond to a more valid classification of molecules as being toxic or non-toxic.

Parallelise hyperparameter search

This is computationally intense and computational resources required to train models will increase as I introduce new features. I therefore need to rewrite the xgboost script to parallelise hyperparameter serach. Can be parallelised as is SIMD.

Adapt pipeline so that assays appear are assigned to either test or train sets

Create a beeswarm plot to visualise the AUROC across individual assay

A beeswarm plot can be used to show the AUROC for various assays. It might be useful to color code the individual assays, too - it is likely that certain groups of assays will have different characteristics based on the project in which they were generated, in-vivo vs. in-vitro etc.

create a notebook in which you are using XGBoost with an 80/10/10 data split to predict outcomes. The metric is AUROC. The result is a table with one row (ECFP4) and 3 columns for each benchmark and its AUROC. Reach: perform a stability analysis-

Run cross-inference for assays

Create adapter for NCI60 dataset

Explore relationship between ECFP4 fingerprint embedding size and predictive accuracy

There have been published reports - including by the openbench team - that the predictive accuracy that is achievable by a learner on top of Morgan fingerprints increases / decreases proportional to the number of features that are generated.

I believe it would be a valuable experiment to compare the performance of the learner on top of ecfp4 embeddings across five different sizes (size of the smallest embedding from the other models, 1024, 2048, 4096, 8192).

Implement logistic regression model for prediction

Update chemgpt.py to allow for embedding creation with 3 model sizes

Create a docker image for project

Represent FS-Tox as a set of parquet files for final publication

Create merged pandas df
Create series of parquet files from input as specified below

FROM NIKLAS: I think one approach to represent FS-Tox is a series of parquet files. Making the data as accessible as possible - including some examples in the README eventually, is going to be important to boost reproducibility, and frankly, also impact down the line.

corpus - one parquet file with each molecule SMILES, canonical SMILES and additional metadata, including where the molecule is from - an assay id (the column of a dataset), and a dataset id (i.e. clintox_moleculenet_2023)
representations - a directory of parquet files where we have [SMILES, feature 1, ... feature n], each molecule in FS-Tox is represented here. Every representation method creates one dedicated parquet - i.e. ECFP4, ChemGPT, ...
assays - each assay (aka column) in FS-Tox is a dedicated parquet file with the canonical smiles, the label, and the assay id

Parquet files work well with pytorch and can be queried super fast with duckdb in a data lakehouse context.

Prompts:
Yes, you can certainly read multiple different Parquet files and represent them as separate tables in DuckDB. Here's an example of how you can do this:

import duckdb

con = duckdb.connect()  # Connect to DuckDB

# Create separate views for each Parquet file
con.execute("CREATE VIEW my_data1 AS SELECT * FROM parquet_scan('my_data1.parquet')")
con.execute("CREATE VIEW my_data2 AS SELECT * FROM parquet_scan('my_data2.parquet')")

# Now you can query each view separately, or join them together as needed
result1 = con.execute("SELECT * FROM my_data1 WHERE some_column = some_value")
result2 = con.execute("SELECT * FROM my_data2 WHERE another_column = another_value")

df1 = result1.fetchdf()  # Get the result as a pandas DataFrame
df2 = result2.fetchdf()

In this example, my_data1 and my_data2 are views representing separate tables, each created from a different Parquet file. You can run separate queries on each view, or you can join them together using standard SQL syntax if they have columns in common.

This approach gives you the flexibility to work with each Parquet file individually, while still taking advantage of DuckDB's SQL querying capabilities.

Create a compelling README that gives scientists an overview about the project goal

Create an app.py that takes as input a unformatted .smi file and returns a standardised .smi as awell as ECFP4 fingerprints

Create data processing script that takes in csv files and outputs specified dataformat

I think one approach to represent FS-Tox is a series of parquet files. Making the data as accessible as possible - including some examples in the README eventually, is going to be important to boost reproducibility, and frankly, also impact down the line.

corpus - one parquet file with each molecule SMILES, canonical SMILES and additional metadata, including where the molecule is from - an assay id (the column of a dataset), and a dataset id (i.e. clintox_moleculenet_2023)
representations - a directory of parquet files where we have [SMILES, feature 1, ... feature n], each molecule in FS-Tox is represented here. Every representation method creates one dedicated parquet - i.e. ECFP4, ChemGPT, ...
assays - each assay (aka column) in FS-Tox is a dedicated parquet file with the canonical smiles, the label, and the assay id

Parquet files work well with pytorch and can be queried super fast with duckdb in a data lakehouse context.

Prompts:
Yes, you can certainly read multiple different Parquet files and represent them as separate tables in DuckDB. Here's an example of how you can do this:

import duckdb

con = duckdb.connect()  # Connect to DuckDB

# Create separate views for each Parquet file
con.execute("CREATE VIEW my_data1 AS SELECT * FROM parquet_scan('my_data1.parquet')")
con.execute("CREATE VIEW my_data2 AS SELECT * FROM parquet_scan('my_data2.parquet')")

# Now you can query each view separately, or join them together as needed
result1 = con.execute("SELECT * FROM my_data1 WHERE some_column = some_value")
result2 = con.execute("SELECT * FROM my_data2 WHERE another_column = another_value")

df1 = result1.fetchdf()  # Get the result as a pandas DataFrame
df2 = result2.fetchdf()

In this example, my_data1 and my_data2 are views representing separate tables, each created from a different Parquet file. You can run separate queries on each view, or you can join them together using standard SQL syntax if they have columns in common.

This approach gives you the flexibility to work with each Parquet file individually, while still taking advantage of DuckDB's SQL querying capabilities.

Check concordance across small molecules

Update prediction script to output parquet files

Currently outputs csv files

Simplify pipeline

The current pipeline can be configured in lots of different ways. This adds complexity to the pipeline, and means that it is difficult to run a set of molecules through from start to finish. This is confusing even to me as the person that built the pipeline. As such, I will change the construction of the pipeline, such that one specifies the pipeline parameters at the start of the pipeline, rather than at each stage.

Specify parameters needed for start
Create single tool config for entire pipeline
Remove assay flags + associated code
Remove dataset flags + associated code

Create embedding scripts for ECFP4, ChemBERTA, and ChemGPT

Remove assays for toxvaldb where order of magnitude range is < 1

Generate a set of diagnostic plots for FS-Tox analogous to FS-Mol

Add assay and feature names to parquet files

Currently, I store information about the assay name, features, and dataset name in the filename. This means that it is challenging to write SQL queries using duckdb to filter by these variables. Therefore I need to add this information to each file itself even if this means duplicating the information from the filename into the parquet file as an additional column. Using glob patterns to filter the filename should only be used to identify parquet files with the same schema. These can be conceptualised as individual relations in a relational database.

Create an adapter for MEIC dataset

Construct working benchmark

Benchmarking procedure should consist of:

Multiple datasets loaded into the data lakehouse
Assigned to either meta-train or meta-test sets
Assigned to either support or query sets

We must ensure the following conditions are not broken:

Data from a single dataset remains in either the training or testing set

Explore results of Toxval ROC-AUC scores

Some of the ROC-AUC scores from running Toxval through the pipeline are strange (1.0 and 0.0 scores). I need to do an exploratory analysis for why these results have occured.

Create assay flag for prediction models

create an EDA notebook vignette for each of the 3 toxicity datasets in MoleculeNet

Create a figure with violin subplots

Figure subplots have dimensions: n_features * n_datasets

Write scripts to convert raw data files to parquet files for each assay

I think one approach to represent FS-Tox is a series of parquet files. Making the data as accessible as possible - including some examples in the README eventually, is going to be important to boost reproducibility, and frankly, also impact down the line.

corpus - one parquet file with each molecule SMILES, canonical SMILES and additional metadata, including where the molecule is from - an assay id (the column of a dataset), and a dataset id (i.e. clintox_moleculenet_2023)
representations - a directory of parquet files where we have [SMILES, feature 1, ... feature n], each molecule in FS-Tox is represented here. Every representation method creates one dedicated parquet - i.e. ECFP4, ChemGPT, ...
assays - each assay (aka column) in FS-Tox is a dedicated parquet file with the canonical smiles, the label, and the assay id

Parquet files work well with pytorch and can be queried super fast with duckdb in a data lakehouse context.

Prompts:
Yes, you can certainly read multiple different Parquet files and represent them as separate tables in DuckDB. Here's an example of how you can do this:

import duckdb

con = duckdb.connect()  # Connect to DuckDB

# Create separate views for each Parquet file
con.execute("CREATE VIEW my_data1 AS SELECT * FROM parquet_scan('my_data1.parquet')")
con.execute("CREATE VIEW my_data2 AS SELECT * FROM parquet_scan('my_data2.parquet')")

# Now you can query each view separately, or join them together as needed
result1 = con.execute("SELECT * FROM my_data1 WHERE some_column = some_value")
result2 = con.execute("SELECT * FROM my_data2 WHERE another_column = another_value")

df1 = result1.fetchdf()  # Get the result as a pandas DataFrame
df2 = result2.fetchdf()

In this example, my_data1 and my_data2 are views representing separate tables, each created from a different Parquet file. You can run separate queries on each view, or you can join them together using standard SQL syntax if they have columns in common.

This approach gives you the flexibility to work with each Parquet file individually, while still taking advantage of DuckDB's SQL querying capabilities.

Change nomenclature from feature to representation

Carry out cross-inference for specific categories

e.g. in-vivo vs. in-vitro

labdao / fs-tox Goto Github PK

fs-tox's Issues

Recommend Projects

Recommend Topics

Recommend Org