Coder Social home page Coder Social logo

fs-mutant's Introduction

FS-mutant: A Few-Shot Learning Dataset for Protein Mutants Mining

GitHub license Updated on 2023.11.23

Introduction

Welcome to the official repository of the FS-mutant dataset, a benchmark for few-shot learning in protein mutants mining.

Download

To download and unzip the dataset, use the following shell commands:

wget https://lianglab.sjtu.edu.cn/files/FS-Mutant-2023/fs-mutant.zip
unzip fs-mutant.zip

Dataset Structure

The dataset folder contains a curated selection of proteins. To view the structure, use:

tree dataset -L 1

dataset
├── A4_HUMAN_Seuma_2021
├── CAPSD_AAV2S_Sinai_substitutions_2021
├── DLG4_HUMAN_Faure_2021
├── F7YBW8_MESOW_Aakre_2015
├── GCN4_YEAST_Staller_induction_2018
├── GFP_AEQVI_Sarkisyan_2016
├── GRB2_HUMAN_Faure_2021
├── HIS7_YEAST_Pokusaeva_2019
├── PABP_YEAST_Melamed_2013
├── SPG1_STRSG_Olson_2014
└── YAP1_HUMAN_Araya_2012

The main directory, 'dataset', includes subdirectories for various proteins, such as:

  • A4_HUMAN_Seuma_2021
  • CAPSD_AAV2S_Sinai_substitutions_2021
  • DLG4_HUMAN_Faure_2021
  • [Other protein directories]

Each protein directory, for example, 'dataset/A4_HUMAN_Seuma_2021', contains:

  • A '.fasta file' with the protein sequence.
  • A '.scores.csv' file with scores predicted by zero-shot models (sourced from Protein Gym).
  • A 'splits' directory for training and testing data.

Example : A4_HUMAN_Seuma_2021

tree dataset/A4_HUMAN_Seuma_2021 -L 1

dataset/A4_HUMAN_Seuma_2021
├── A4_HUMAN_Seuma_2021.fasta
├── A4_HUMAN_Seuma_2021.scores.csv
└── splits

Within this directory:

  • A4_HUMAN_Seuma_2021.fasta
  • A4_HUMAN_Seuma_2021.scores.csv
  • splits/

The 'splits' directory is organized as follows:

tree dataset/A4_HUMAN_Seuma_2021/splits -L 1

dataset/A4_HUMAN_Seuma_2021/splits
├── 160_1
├── 160_2
├── 160_3
├── 160_4
├── 160_5
├── 20_1
├── 20_2
├── 20_3
├── 20_4
├── 20_5
├── 320_1
├── 320_2
├── 320_3
├── 320_4
├── 320_5
├── 40_1
├── 40_2
├── 40_3
├── 40_4
├── 40_5
├── 80_1
├── 80_2
├── 80_3
├── 80_4
└── 80_5

It contains multiple subdirectories, each named in the format N_I, where N denotes the training set size and I is the split index (e.g., 20_1, 40_2).

Inside a Split Directory

For instance, in 'dataset/A4_HUMAN_Seuma_2021/splits/20_1/':

tree dataset/A4_HUMAN_Seuma_2021/splits/20_1/ -L 1

dataset/A4_HUMAN_Seuma_2021/splits/20_1/
├── compound_multi.csv
├── cross_single.csv
├── local_single.csv
└── train.csv

This split directory contains:

  • train.csv: The training dataset.
  • Other CSV files like compound_multi.csv, cross_single.csv, local_single.csv: These are test datasets.

fs-mutant's People

Contributors

ginnm avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

aaranwang

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.