Coder Social home page Coder Social logo

tobiasheol / ablang2 Goto Github PK

View Code? Open in Web Editor NEW
9.0 1.0 4.0 81 KB

An antibody-specific language model focusing on NGL prediction

License: BSD 3-Clause "New" or "Revised" License

Python 73.57% Jupyter Notebook 26.43%
antibody-design language-model non-germline

ablang2's Introduction


AbLang-2

Addressing the antibody germline bias and its effect on language models for improved antibody design

DOI:10.1101/2022.01.20.477061

Motivation: The versatile binding properties of antibodies have made them an extremely important class of biotherapeutics. However, therapeutic antibody development is a complex, expensive and time-consuming task, with the final antibody needing to not only have strong and specific binding, but also be minimally impacted by any developability issues. The success of transformer-based language models in protein sequence space and the availability of vast amounts of antibody sequences, has led to the development of many antibody-specific language models to help guide antibody discovery and design. Antibody diversity primarily arises from V(D)J recombination, mutations within the CDRs, and/or from a small number of mutations away from the germline outside the CDRs. Consequently, a significant portion of the variable domain of all natural antibody sequences remains germline. This affects the pre-training of antibody-specific language models, where this facet of the sequence data introduces a prevailing bias towards germline residues. This poses a challenge, as mutations away from the germline are often vital for generating specific and potent binding to a target, meaning that language models need be able to suggest key mutations away from germline.

Results: In this study, we explore the implications of the germline bias, examining its impact on both general-protein and antibody-specific language models. We develop and train a series of new antibody-specific language models optimised for predicting non-germline residues. We then compare our final model, AbLang-2, with current models and show how it suggests a diverse set of valid mutations with high cumulative probability. AbLang-2 is trained on both unpaired and paired data, and is freely available (https://github.com/oxpig/AbLang2.git).

Availability and implementation: AbLang2 is a python package available at https://github.com/oxpig/AbLang2.git.

TCRLang-Paired: The AbLang2 architecture can be initialised with model weights trained on paired TCR sequences. This model can be used in an identical way to AbLang2 on TCR sequences. The only missing functionality is the lack of the align command. The generation of sequence and residue encodings, as well as masking are all the same. For an example please see the notebook.


Install AbLang2

AbLang is freely available and can be installed with pip.

    pip install ablang2

or directly from github.

    pip install -U git+https://github.com/oxpig/AbLang2.git

NB: If you want to have your returned output aligned (i.e. use the argument "align=True"), you need to manually install Pandas and a version of ANARCI in the same environment. ANARCI can also be installed using bioconda; however, this version is maintained by a third party.

    conda install -c bioconda anarci

AbLang2 usecases

AbLang2 can be used in different ways and for a variety of usecases. The central building blocks are the tokenizer, AbRep, and AbLang.

  • Tokenizer: Converts sequences and amino acids to tokens, and vice versa
  • AbRep: Generates residue embeddings from tokens
  • AbLang: Generates amino acid likelihoods from tokens
import ablang2

# Download and initialise the model
ablang = ablang2.pretrained(model_to_use='ablang2-paired', random_init=False, ncpu=1, device='cpu')

seq = [
'EVQLLESGGEVKKPGASVKVSCRASGYTFRNYGLTWVRQAPGQGLEWMGWISAYNGNTNYAQKFQGRVTLTTDTSTSTAYMELRSLRSDDTAVYFCARDVPGHGAAFMDVWGTGTTVTVSS', # The heavy chain (VH) needs to be the first element
'DIQLTQSPLSLPVTLGQPASISCRSSQSLEASDTNIYLSWFQQRPGQSPRRLIYKISNRDSGVPDRFSGSGSGTHFTLRISRVEADDVAVYYCMQGTHWPPAFGQGTKVDIK' # The light chain (VL) needs to be the second element
]

# Tokenize input sequences
seqs = [f"{seq[0]}|{seq[1]}"] # Input needs to be a list, with | used to separated the VH and VL 
tokenized_seq = ablang.tokenizer(seqs, pad=True, w_extra_tkns=False, device="cpu")
        
# Generate rescodings
with torch.no_grad():
    rescoding = ablang.AbRep(tokenized_seq).last_hidden_states

# Generate logits/likelihoods
with torch.no_grad():
    likelihoods = ablang.AbLang(tokenized_seq)

We have build a wrapper for specific usecases which can be explored via a the following Jupyter notebook.

Citation

@article{Olsen2024,
  title={Addressing the antibody germline bias and its effect on language models for improved antibody design},
  author={Tobias H. Olsen, Iain H. Moal and Charlotte M. Deane},
  journal={bioRxiv},
  doi={https://doi.org/10.1101/2024.02.02.578678},
  year={2024}
}

ablang2's People

Contributors

algw71 avatar fboyles avatar tobiasheol avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

oxpig algw71

ablang2's Issues

MMSeqs2 parameters

In the paper it is written that sequences were split into train/eval/test via MMSeqs2 clustering with a sequence_identity threshold of 0.95 - could you provide the full set of parameters used for clustering, or were the remaining ones left to be the default? Thanks!

What batch size should be used on an A100?

Cannot see any advice on batch size. If we want to improve the speed of obtaining embeddings, what batch sizes would you recommend on an A100?

Also is a dataloader used?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.