Coder Social home page Coder Social logo

Comments (2)

qmarcou avatar qmarcou commented on August 26, 2024

Hi Anna,
Below are the answers to your questions, hopefully well structured, please tell me if this is unclear:


IGoR's handling of allelic variants.

Origin of IGoR's genomic templates.

The provided genomic templates originally come from the IMGT database, to which some variants that were found upon constructing the generative model on the training dataset were appended. Because people maintaining IMGT wanted to create an exhaustive database the obtained list of alleles comprise many allelic variants that had been found here and there in the population. Because IGoR does not yet ship with an on the fly inference of allelic variants present in the dataset it has to rely on these IMGT variants.

Number of allelic variants.

The biology.

In fact some studies suggest that the TCR and BCR locus are quite dynamic and gene duplication might be common. From Kidd et al. « The inference of phased haplotypes for the immunoglobulin H chain V region gene loci by analysis of VDJ gene rearrangements. », The Journal of Immunology, (2012):

It is also now clear that the apparent heterozygosity that can be seen in genotypes is often a consequence of the carriage of multiple “alleles” on a single chromosome. Such duplication of Ig genes has been reported previously. By employing RFLP analysis with sequence-specific oligonucleotide probes, Sasso and colleagues (30) identified two separate loci for sequences that are now identified as IGHV3-30 and IGHV4-28, as well as for IGHV1-69 (31). They also claimed
there can be multiple copies of the IGHV3-23 sequence on a single chromosome (32), and others have reported duplication of IGHV4-31 (33).

This in fact would naturally lead to observe more than 2 alleles of the same gene.

IGoR

On top of the fact that several variants allelic variants could be present on the same chromosome, there are other shortcomings due to the sequencing process. Because alleles of the same gene may vary by a single nucleotide and because the sequencing process is both error prone (i.e could introduce such single nucleotide variations) and have finite read length (meaning not all nucleotides of the gene/allele can be observed), it is not always possible to distinguish between two different alleles and one can only assign posterior probabilities on the gene/allele identity. This leads to assigning non zero probability to most alleles.

Restricting the number of alleles used in IGoR.

Tuning gene and allele usage to your dataset.

It has been shown that gene/allele usage frequencies are the most variable components of the recombination machinery across individuals and sequencing technologies (see IGoR's paper for a more detailed discussion).

To perform any computation on your dataset it might be interesting to first use the inference mode of IGoR and only relearn the gene usage frequencies for your dataset using the --infer_only command. Of course one should be careful on the kind of sequences used to re-infer those frequencies as gene usage frequencies might be modified by selection. Using non-productive or productive sequences for instance should be properly thought.

Manually restricting the number of gene/alleles available for a dataset.

In order to restrict the number of genes/alleles to a limited list (e.g to generate sequences with a particular VJ combination) the user can supply such a list via the -set_genomic General Command:

If the set of provided genomic templates is already fully contained (same name and same sequence) in the loaded model (default, custom, last_inferred), the missing ones will be set to zero probability keeping the ratios of the others. For instance providing only one already known genomic template will result in a model with the considered gene usage to be 1.0, all others set to 0.0. When using this option and introducing new/modified genomic templates, the user will need to re-infer a model since the genomic templates will no longer correspond to the ones contained in the reference models, the model parameters are thus automatically reset to a uniform distribution.

Thus supplying a FASTA files containing only the desired V and J alleles will automatically restrict the usage to these genes without the need for the user to re-infer a model, provided these genes/alleles were already contained in the initial gene list.


In a close future I'd like to introduce such notions in a more complete wiki/manual of IGoR, thus please tell me if anything remains unclear from this answer!

Best,

Quentin

from igor.

obrzts avatar obrzts commented on August 26, 2024

Thank you for detailed answer!

from igor.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.