labdao / fs-tox Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
It is currently unclear what the appropriate threshold is for binarising our continuous toxicity outcomes. Do we use the median, or is the 80th percentile more appropriate? By doing a grid search over these thresholds, and the re-running the pipeline, we may see that model performance peaks, or plateaus at a given threshold. Peak predictive performance may correspond to a more valid classification of molecules as being toxic or non-toxic.
This is computationally intense and computational resources required to train models will increase as I introduce new features. I therefore need to rewrite the xgboost script to parallelise hyperparameter serach. Can be parallelised as is SIMD.
There have been published reports - including by the openbench team - that the predictive accuracy that is achievable by a learner on top of Morgan fingerprints increases / decreases proportional to the number of features that are generated.
I believe it would be a valuable experiment to compare the performance of the learner on top of ecfp4 embeddings across five different sizes (size of the smallest embedding from the other models, 1024, 2048, 4096, 8192).
FROM NIKLAS: I think one approach to represent FS-Tox is a series of parquet files. Making the data as accessible as possible - including some examples in the README eventually, is going to be important to boost reproducibility, and frankly, also impact down the line.
Parquet files work well with pytorch and can be queried super fast with duckdb in a data lakehouse context.
Prompts:
Yes, you can certainly read multiple different Parquet files and represent them as separate tables in DuckDB. Here's an example of how you can do this:
import duckdb
con = duckdb.connect() # Connect to DuckDB
# Create separate views for each Parquet file
con.execute("CREATE VIEW my_data1 AS SELECT * FROM parquet_scan('my_data1.parquet')")
con.execute("CREATE VIEW my_data2 AS SELECT * FROM parquet_scan('my_data2.parquet')")
# Now you can query each view separately, or join them together as needed
result1 = con.execute("SELECT * FROM my_data1 WHERE some_column = some_value")
result2 = con.execute("SELECT * FROM my_data2 WHERE another_column = another_value")
df1 = result1.fetchdf() # Get the result as a pandas DataFrame
df2 = result2.fetchdf()
In this example, my_data1 and my_data2 are views representing separate tables, each created from a different Parquet file. You can run separate queries on each view, or you can join them together using standard SQL syntax if they have columns in common.
This approach gives you the flexibility to work with each Parquet file individually, while still taking advantage of DuckDB's SQL querying capabilities.
I think one approach to represent FS-Tox is a series of parquet files. Making the data as accessible as possible - including some examples in the README eventually, is going to be important to boost reproducibility, and frankly, also impact down the line.
Parquet files work well with pytorch and can be queried super fast with duckdb in a data lakehouse context.
Prompts:
Yes, you can certainly read multiple different Parquet files and represent them as separate tables in DuckDB. Here's an example of how you can do this:
import duckdb
con = duckdb.connect() # Connect to DuckDB
# Create separate views for each Parquet file
con.execute("CREATE VIEW my_data1 AS SELECT * FROM parquet_scan('my_data1.parquet')")
con.execute("CREATE VIEW my_data2 AS SELECT * FROM parquet_scan('my_data2.parquet')")
# Now you can query each view separately, or join them together as needed
result1 = con.execute("SELECT * FROM my_data1 WHERE some_column = some_value")
result2 = con.execute("SELECT * FROM my_data2 WHERE another_column = another_value")
df1 = result1.fetchdf() # Get the result as a pandas DataFrame
df2 = result2.fetchdf()
In this example, my_data1 and my_data2 are views representing separate tables, each created from a different Parquet file. You can run separate queries on each view, or you can join them together using standard SQL syntax if they have columns in common.
This approach gives you the flexibility to work with each Parquet file individually, while still taking advantage of DuckDB's SQL querying capabilities.
Currently outputs csv files
The current pipeline can be configured in lots of different ways. This adds complexity to the pipeline, and means that it is difficult to run a set of molecules through from start to finish. This is confusing even to me as the person that built the pipeline. As such, I will change the construction of the pipeline, such that one specifies the pipeline parameters at the start of the pipeline, rather than at each stage.
Currently, I store information about the assay name, features, and dataset name in the filename. This means that it is challenging to write SQL queries using duckdb to filter by these variables. Therefore I need to add this information to each file itself even if this means duplicating the information from the filename into the parquet file as an additional column. Using glob patterns to filter the filename should only be used to identify parquet files with the same schema. These can be conceptualised as individual relations in a relational database.
Benchmarking procedure should consist of:
We must ensure the following conditions are not broken:
Some of the ROC-AUC scores from running Toxval through the pipeline are strange (1.0 and 0.0 scores). I need to do an exploratory analysis for why these results have occured.
Figure subplots have dimensions: n_features * n_datasets
I think one approach to represent FS-Tox is a series of parquet files. Making the data as accessible as possible - including some examples in the README eventually, is going to be important to boost reproducibility, and frankly, also impact down the line.
Parquet files work well with pytorch and can be queried super fast with duckdb in a data lakehouse context.
Prompts:
Yes, you can certainly read multiple different Parquet files and represent them as separate tables in DuckDB. Here's an example of how you can do this:
import duckdb
con = duckdb.connect() # Connect to DuckDB
# Create separate views for each Parquet file
con.execute("CREATE VIEW my_data1 AS SELECT * FROM parquet_scan('my_data1.parquet')")
con.execute("CREATE VIEW my_data2 AS SELECT * FROM parquet_scan('my_data2.parquet')")
# Now you can query each view separately, or join them together as needed
result1 = con.execute("SELECT * FROM my_data1 WHERE some_column = some_value")
result2 = con.execute("SELECT * FROM my_data2 WHERE another_column = another_value")
df1 = result1.fetchdf() # Get the result as a pandas DataFrame
df2 = result2.fetchdf()
In this example, my_data1 and my_data2 are views representing separate tables, each created from a different Parquet file. You can run separate queries on each view, or you can join them together using standard SQL syntax if they have columns in common.
This approach gives you the flexibility to work with each Parquet file individually, while still taking advantage of DuckDB's SQL querying capabilities.
e.g. in-vivo vs. in-vitro
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.