Coder Social home page Coder Social logo

tcc's Introduction

Talk Can Cells?

Do cells have a language? With the recent success of large language models and the vast number of curated gene pathways, we visited this fundamental question one more time. We trained a decoder-only transformer model with 120M parameters on >250,000 human pathways, and then asked the model to complete 1000 unseen pathways with 80% masking in zero-shot setting. Remarkably, this simple ready-to-use model was able to complete more than half of the pathways (660/1000) with significant overlap. Our results suggest that even relatively small transformers can capture underlying connection among genes and understand the true nature of cell language.

Blog Post.

Downloading the dataset

I have included an R script to download pathways using hypeR package in data folder. Outdated pathways were discarded. Simple quality control was performed to look cooler.

Preparing the dataset

The pathways are in .gmt format. This is more or less what it looks like:

Pathway Name Description Gene Symbol
ExamplePath This is a description of the ExamplePath pathway. It's involved in [specific process or function]. EXMP1
AnotherPath Description of AnotherPath, highlighting its significance in [related biological process]. ANTH2

Then, I felt like a basic filtering strategy was necessary, so in order:

  1. Harmonized all gene symbols to hg38
  2. Removed the genes that occurred less than 5 times in the whole dataset
  3. Selected pathways with >90% mapping to human genes
  4. 95% for training, 5% for test (10,000 sets) ~44,000 tokens
  5. Random 1000 sets then was selected from test set for inference.

All of these are here.

Training

Basically followed the recommendations of @Andrej to train karpathy/nanoGPT. Obviously, this needs improvement. I think we could train a custom model, maybe even a different architecture (GNN?) to answer follow up questions:

  • Can we translate it to plain English? lol
  • Can we use this as a base model to impute single cell data? Special tokens bla bla...
  • BETTER OBJECTIVE FUNCTIONs please.

Test

1000 random unseen pathways from step 5! Now it's time to mask 80% of each and generate... brr. Details? Here.

Food for Thought

Isn’t funny how 2-bits seems to be natural evolution for LLMs? Like DNA which consists of 4 letters (A,T,G,C) or operates on 2-bits. In order to run an LLM, all you need is a weight matrix and a run file (e.g. llama2.c), just like a DNA.

So, what is the run.c function for DNA?

I think we will learn the answer soon.

Questions? Ideas? Wanna Chat?

Join us.

tcc's People

Contributors

eonurk avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.