Coder Social home page Coder Social logo

Comments (10)

zktuong avatar zktuong commented on August 18, 2024 3

Looks like they are correct and are treated as exceptions, but probably non-functional

For TRAJ35*01:

(6) noncanonical J-MOTIF: Cys-Gly-X-Gly instead of Phe-Gly-X-Gly. The functionality of TRAJ35*01, initially considered as ORF owing to the presence of Cys (C118) instead of Phe (F118) in the J-MOTIF, has been changed to F (functional) as TRAJ35*01 has been found rearranged and expressed in several specific cDNA clones (comments added by G. Folch and M.-P. Lefranc on the 23/01/2019, following the identification of an additional clone and a question by C. Joos (Germany)).

https://www.imgt.org/IMGTrepertoire/index.php?section=LocusGenes&repertoire=genetable&species=human&group=TRAJ
Genecard says it's non-functional
https://www.genecards.org/cgi-bin/carddisp.pl?gene=TRAJ35

TRBJ2-2P*01 also looks like an exception and non-functional open reading frame

(1) This sequence has diverged considerably from that of other TRBJ2-2: it has no canonical J-NONAMER [2], J-PHE 118 is replaced by Gly (G) and G 119 by R in J-MOTIF.

https://www.imgt.org/IMGTrepertoire/index.php?section=LocusGenes&repertoire=genetable&species=human&group=TRBJ
https://www.genecards.org/cgi-bin/carddisp.pl?gene=TRBJ2-2P

TRBJ2-7*02

(2) J-PHE replaced 118 replaced by Val in J-MOTIF.

from scirpy.

zktuong avatar zktuong commented on August 18, 2024 2

i just checked the IgBLAST auxiliary files that indicate where the CDR3 end is and crossed it to the fasta files and it looks like the codons are correct for those 3 genes as well. so should be alright to use the J-junction-motifs.txt

from scirpy.

grst avatar grst commented on August 18, 2024

Hi Rachel,

thanks for bringing this up. This was meant as a workaround to make it work with ir_dist (which is currently hardcoded to work on the junction sequence).

A proper solution is probably to

  • convert junction to cdr3 (which means just removing the constant amino acids)
  • provide an option in ir_dist to use the cdr3 column instead of junction_aa

For now, as a workaround, I would suggest to store CDR3 sequences in junction_aa in both query and database object.
You can trim the string with awkward array functions:

import awkward as ak
mdata["airr"].obsm["airr"]["junction_aa"] = ak.str.slice(mdata["airr"].obsm["airr"]["junction_aa"], 1, -1)

from scirpy.

grst avatar grst commented on August 18, 2024

@zktuong, just wanted to ask you to be sure

  • junction_aa -> cdr3 should be trivial, right? Just clip off the first and last amino acid. Any caveats?
  • cdr3 -> junction_aa: Is it correct that the first AA is always C and the last is always W/F? Can this be robustly inferred from the V call?

from scirpy.

zktuong avatar zktuong commented on August 18, 2024

junction_aa -> cdr3 should be trivial, right? Just clip off the first and last amino acid. Any caveats?

yes should be correct.

cdr3 -> junction_aa: Is it correct that the first AA is always C and the last is always W/F? Can this be robustly inferred from the V call?

yes first is always C and the last should always be either F/W. Should be able to infer the F/W from the second codon from start of the J call (TTT/TTC for F or TGG for W).

from scirpy.

grst avatar grst commented on August 18, 2024

Hi @racng,

I believe I fixed this in #476, by extracting the junction_aa sequence from the full protein sequence based on the start/end coordinates provieded in IEDB. Would be great if you could give this a try!

You can install that version using

pip install git+https://github.com/scverse/scirpy@issue-469

Please make sure to remove the cached version of iedb before rerunning ir.datasets.iedb by deleting the data directory that was created.

from scirpy.

racng avatar racng commented on August 18, 2024

@grst Thank you for working on a fix for this issue!
I have copied the new function into a script and tested it.
For some reason I am getting a lot of NaN values for both junction_aa and cdr3_aa.

adata = iedb(cached=True, cache_path='test/iedb.h5ad')

ir.get.airr(adata, 'junction_aa')['VDJ_1_junction_aa'].isna().sum()
# 30080

ir.get.airr(adata, 'cdr3_aa')['VDJ_1_cdr3_aa'].isna().sum()
# 30110

# Old copy of iedb reference
ir.get.airr(adata_old, 'junction_aa')['VDJ_1_junction_aa'].isna().sum()
# 118

from scirpy.

grst avatar grst commented on August 18, 2024

Bad news! I took another look and it seems that indeed the Start/End position and and Protein sequence are only available for ~5000 receptors:

>>> iedb_df["Chain 1 Protein Sequence"].dropna().size
5083

For the rest, there is "CDR3 Curated", but not "CDR3 Calculated" available. Taking a closer look at "CDR3 Curated", it seems that some (but not all) sequences there are actually junction sequences, including the C and W/F.

image

So I'm afraid this would take quite some cleanup to get it right!


On the scirpy side, I could consider adding the option to use cdr3_aa sequences, instead of junction_aa, for sequence comparisons. Then the CDR3 sequences in the database could remain as they are.

from scirpy.

racng avatar racng commented on August 18, 2024

I think I found a solution. We can use the J-motif sequences in J-region-motifs.tsv generated by IMGTgeneDL run in stitchr mode to determine the terminal residue of the J gene. It "contains automatically inferred CDR3 junction ending motifs and residues (using the process established in the autoDCR TCR assignation tool)".

I am attaching it here:
J-region-motifs.txt

from scirpy.

grst avatar grst commented on August 18, 2024

@zktuong, what do you say about this one? Is that wrong or the always existing exception to the rule in Biology?
image

(L and V are "not confident" but C is)
(this is from above J-junction-motifs.txt)

from scirpy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.