psathyrella / partis Goto Github PK
View Code? Open in Web Editor NEWB- and T-cell receptor sequence annotation, simulation, clonal family and germline inference, and affinity prediction
License: GNU General Public License v3.0
B- and T-cell receptor sequence annotation, simulation, clonal family and germline inference, and affinity prediction
License: GNU General Public License v3.0
scons test
runs one monolithic block of coder, so a failure down the road requires rerunning the whole thing. At some point it would be nice to break this into several targets so that building is incremental.
Currently, if bppseqgen decides to mutate a base within a the conserved cysteine or tryptophan, we just revert that base to germline. This of course will bias the tree, so long term we should do something different.
Note that we should also at some point disallow stop codons.
So we need to be able to specify alternative germline files in Connor's ighutil/vdjalign package. Erick volunteered, ha ha, right?
The main package is here
As it says in the readme, you install with
make -C clj
pip install ./python # I would add --user
The way it's set up now, it always loads germline versions from files in this directory, and the options such as --j-subset
specify the file suffix, e.g. --j-subset adaptive
tells it to load ighj-adaptive.fasta
. One problem with this is that imgt seems to quite enjoy regularly changing both the name corresponding to a given sequence and the sequence corresponding to a given name.
So ideally I could specify something like --germline-dir
, and it always looks for igh[vdj].fasta
in that directory.
The code that actually loads the file is here. The farthest I got was figuring out that if I add this line:
return open('/home/dralph/work/partis/data/' + file_name)
in _handle()
it properly looks in that directory. But I need to do in on the command line!
I can of course ask Connor as well, or do it myself, but you sounded stoked on it...
So right now we infer branch lengths to use in the simulation on the fly for each dataset/person, but we're still using some gtr parameters I got from connor that correspond to one or the other of the people from adaptive. I think this is fine for the moment -- I don't think anything we're doing is sensitive to exactly which base goes to which other base.
Although, note, one thing that we still want to do is account for mutations being more likely from, say, A->C than A->G. This is easy to put in the HMM, but we decided putting it in simulation would be prohibitive so shelved it.
shouldn't call GetSubSeqs and then loop over seqs again.
This is tied in with #8 but on the discrete parameter estimation side.
Specifically, say we would like to infer per-site emission probabilities for a given gene that sits in a gene family (say, V3). The idea would be to use the Dirichlet-multinomial: we take a Dirichlet prior with alpha vector taken from the expectation equal to the aggregate frequency at the higher level (in our example, the gene family V3) and then take the posterior of the multinomial with that prior.
We have the natural hierarchy of genes contained in families, which are contained in segments (V, D, and J). If we were rad, we would treat the whole thing as a hierarchical model and do inference. But that doesn't seem so practical given that we would lose conjugacy, unless there is something about the Dirichlet-multinomial-multinomial that I don't know about (quite likely, in fact).
Hoping to get some commentary from @vnminin.
There is lots of data, and @psathyrella has done a nice job of getting empirical frequencies of affinity-matured sequences from this data for each germline gene. One approach would be to just let those be the emission probabilities in the HMM. However, as noted in our previous paper, there is a correlation between the amount of mutation in different segments, or even more clearly, there is a correlation between the level of mutation in the various sites in a gene. Thus, I would think that it makes sense to infer the amount of "evolutionary time" that has passed in the affinity maturation process and then use that to calculate emission probabilities for the various sites. Evolutionary time could be inferred just by looking at the average divergence of read segments from their germline equivalent using his pre-processing Smith-Waterman algorithm.
@vnminin -- what think you? Overkill?
It's implemented, but a significant minority of sequences do not have a single unambiguous cdr3 length, and you can't really cluster things when some of those things have to clustering values apiece.
Following 12/17 discussion with Vladimir.
The issue is that the size of clonally related groups can impact individual-level inferences. For example, if 80% of sequences come from one clonally related group then it the other 20% are more likely to follow that sequence's path through the HMM.
We could look at pairwise distance plots (though those do seem to be fairly mysterious sometimes, as lots of things are getting compressed down to a single number, and multiplicity gets squared). Perhaps trees are better, as long as we don't over-interpret them.
As described in conversation with @vnminin, a nice way to do model-based clustering would be to take a product across clusters, such that the likelihood of a given cluster is the marginal probability of generating the sequences contained in it via an n-HMM.
I think this is the way to go, but can see two things that need thinking about:
Pairs of sequences that are not clonally related in general take much longer to run than pairs that are related -- partly because I sum over a larger k-space, but also maybe for other reasons. This seems like kind of a waste, and it's at least possible it's avoidable.
hm, I thought this was already here somewhere, but I don't see it.
An incomplete list of things to which I may refer as knobs:
parameter inference with few observations if in training data we see a gene version only a few times, we can't build a meaningful hmm from it. I have tried a number of approaches, but what I do now is if we see a version fewer than n_obs
times (have tried with numbers like 5, 10, 30, 50) we lump in all the alleles of that gene. If this still doesn't give us n_obs
observations, we lump in all the genes with the same subgroup number (e.g. all IGHV3
s), and if that still doesn't I use all the IGHVs.
This has different implications for different parameters. For instance, mutation frequencies are extremely non-homogenous, and aren't necessarily similar even in among alleles. Erosion and insertion lengths certainly depend on version as well, and aren't necessarily the same among alleles or subgroups.
We know at the moment that on my training sets of ~10,000 sequences, switching from 5-50 has little effect. But we have to view this as code that someone can naively try to train on an arbitrarily number of sequences, so the decision has potential import
interpolation while estimating parameters When do we want to apply some interpolation? Think of a general histogram (could be mutation frequencies by position, or erosion frequencies for a given gene...). If we have lots of information that the distribution is really spiky, then we're pretty sure a zero-bin is really supposed to be zero. But if we only have a few observations, it's more likely that a "zero bin" isn't really zero and we should interpolate. So I have a threshhold: if the nearest bins with entries on either side have more than 20 entries, I don't interpolate (this bin is probably really zero), otherwise I do.
Note that since it's really bad to have zero probability of something in an HMM that could happen (since the zero absolutely totally forbids it), I always use either an EPS value (that I made up), or the bin's upper uncertainty if it's available.
Er, if that doesn't make sense, it's because it's complicated! Or because I didn't explain it very well.
the insertion model I have lovely distributions for the typical insertion lengths for each gene version, but since this is an hmm I can't (er, I haven't figured out how to) model an arbitrary distribution, so I use a geometric distribution with the proper mean. I.e., the probability of exiting the insert state is 1/<mean_length>
.
The distributions in our current data look quite like geomtric distributions, and the bias in the performance plots can probably be characterized as small, but it has to be remembered that I pulled the geometric distribution ooma.
How many gene matches per region? We don't sum over all possible gene versions, but only over those that score highly in the smith-waterman step. How many matches should we pass from SW to the HMM? In practice 3 v matches, 5 d matches, and 2 j matches is a good compromise.
Things to keep in mind: if SW didn't flag the true gene in the top few matches, most of the time the HMM won't like that match, either, so you don't gain that much by including it. In other words you're never going to annotate that sequence correctly so you may as well not spend too much compute time working on it. Also, when we're clustering (i.e. in pairs), if the correct match for both sequences is not in the union of their best matches, it doesn't matter -- that almost without fail means that they aren't from the same recombination event, anyway, so the total probability is going to change very little if you add a gene version that is only a good match to one of the sequences.
k-space boundaries when integrating over k_v
and k_d
, we of course don't actually sum from zero to infinity. I currently take the most like point in k-space from smith-waterman (i.e. the k_v
and k_d
from the best v-d-j match), find the limits of k-space for the top n_best_per_region
matches, and finally
expand these bounds with a fuzz factor of 2. It seems to work great, i.e. pretty much always includes the correct k-point, but if the correct value is ever outside my bounds, I'm screwed. Note also that compute time scales with k-area.
SW match parameters (note that we currently use Connor's vdjalign
for smith-waterman). There are a number of parameters one specifies, but the three which are probably most import for us are the match score, mismatch score, and gap-open penalty. The ratio of match to mismatch score is what matters, and it should in theory be set according to the mutation rate in the sequence. In practice I'm using 5:3, which seems to work well, but I just came up with that from experimenting on our existing data. I have the gap open penalty set to 1000, which effectively forbids indels within v, d, or j regions.
which germline repertoire to use? the set of germline gene versions that we make available to SW and the HMM or course have a huge effect on performance. There are, though, a lot of possibilities as to what we choose. In the long run, we want to only use versions that we see in each person's naive repertoire. Then we have to decide what set to start from -- the consensus seems to be that a huge fraction of the versions in imgt are not real, i.e. are mutated or sequencing error'ed from real versions.
Well, and the ability to run on TCRs when that flag is set.
As per talk with Raphael.
From talking to folks at keystone, it sounds like adding mouse would also be useful. This should be just a matter of downloading the imgt source.
Well, we have that ability now, but we should put in an explicit method for users to do it. The way we do it now, is infer parameters from a large data set, and then apply the hmm on that same data set. The use case for using different data sets is if you're interested in a data set that has too few sequences to do reliable parameter estimation.
Among other things, we should have:
They're certainly not badly biased, but it'd be nicer if those distributions were perfectly symmetric.
Also note, SW and HMM both bias in the same direction.
Rest of plots are here: http://stoat.fhcrc.org/sharing/dralph/partis/performance/check-new-imgt/hmm-vs-sw/plots.html (need to be on hutch network)
They're super rare at this point, but it'd be better not to have them.
In accordance with the final hmm states/topologies.
note, this should happen after #13
Erick's idea -- this allows us to take into account the fact that two clonal seqs have the same inserts
As part of this, should update/check get_emission_prob() in hmmwriter.py
Can we incorporate quality scores into partis? What sorts of scores can we expect to get from Adaptive? Stay tuned..
oh, man, this'll be kind of complicated.
By boundary errors I mean when our sum of k-space kicks a warning because the maximum is on the boundary. I auto-expand this (which does a pretty good job) but should go through and be more careful about it, since especially with forward pair/k we need to be suming over all areas of k-space with significant probability density.
This will probably mean starting again to write the errors to the hmm's csv output file.
you'll need to add an option to set the random seed in recombinator
All the hip kids are using git subtree
these days, but submodule
is nice in its own way.
The only thing I've thought of is to use the pair hmm as a preclustering step, and just lower/loosen the single-linkage clustering cutoff to make sure we never cluster when we shouldn't, and then pass those clusters to the k-hmm.
The trouble is that I don't know that it'll be simple to come up with a principled way to decide how much to loosen the cuttoff.
Currently we just print out the n-best paths for annotation... but there's loads more information at our fingertips that we can present if people would use it. For instance, if we print out the three best paths, they're typically all the same gene matches, and just wiggling around the alignment boundaries a little bit.
Possibilities:
I.e. if you're calculating all the m-subsets for a cluster of size n (>m), only calculate the single-sequence factors once for each sequence.
That is, make it easy to take a large sample, divide it into chunks, and pass each chunk to a separate clusterer, and then merge the clusters. Only running on 10k queries at the moment, and memory and cpu consumption even for hamming is killing me.
i.e. reads that have v chopped off, j chopped, off, yadda yadda
I.e. run simulation with lots of wildly different BCR repertoire structures, and see how the partitioning does.
We have a per-site mutation frequency, f
(the fraction of observed sequences that have a mutation at the site), and we want to fill in the 4x4 table of pair emission probabilities in the HMM.
The simplest way to do this is to make the emission probabilities a product of two factors (one for each sequence), where each factor is f/3
(if the sequence is not germline) and 1-f
(if the sequence is germline).
germline | mutated | mutated | mutated | |
---|---|---|---|---|
germline | (1-f)(1-f) | f(1-f)/3 | f(1-f)/3 | f(1-f)/3 |
mutated | f*f / 9 | f*f / 9 | f*f / 9 | |
mutated | f*f / 9 | f*f / 9 | ||
mutated | f*f / 9 |
So all we need to implement joint emission is fill in the entries in the matrix so they take into account that if the two sequences are mutated to the same base, they're more likely to be clonally related. Except I haven't worked out a good way to do this. All the things I've tried require assumptions about branch lengths and tree topology which are not always true, so empirically they end up not being that great.
Erick and I talked about this a few months ago. If memory serves we got as far as he was actually convinced it was non-trivial, but didn't work out how to do it.
This is quite related to #8.
mus data/hmm_parameters ‹regularized-params› » ls
all-probs.csv d_gene-probs.csv hmms j_gene-dj_insertion-probs.csv mute-freqs vd_insertion-probs.csv v_gene-v_5p_del-probs.csv
d_gene-d_3p_del-d_5p_del-probs.csv dj_insertion_content.csv j_5p_del-probs.csv j_gene-j_3p_del-probs.csv seq_content.csv v_gene-probs.csv
d_gene-d_5p_del-d_3p_del-probs.csv fv_insertion-probs.csv jf_insertion-probs.csv j_gene-probs.csv vd_insertion_content.csv v_gene-v_3p_del-probs.csv
mus data/hmm_parameters ‹regularized-params› » head vd_insertion-probs.csv
vd_insertion,count
21,78
32,4
23,46
34,3
61,1
9,1844
36,3
11,1314
38,2
mus data/hmm_parameters ‹regularized-params› » head j_gene-dj_insertion-probs.csv
dj_insertion,j_gene,count
28,IGHJ3*02,1
25,IGHJ1*01,1
11,IGHJ1*01,12
7,IGHJ6*03,222
5,IGHJ1*01,34
7,IGHJ2P*01,1
1,IGHJ6*03,466
6,IGHJ4*03,17
10,IGHJ3*01,15
Should probably go in partis... I just couldn't get scons to behave properly when I tried before, and gave up before I got it to work.
I think I might have some query_strs.first that would be better if they used .second as well
At the moment our emission tables are all symmetric, so I'm factorizing the drastic k-hmm changes by punting on this for the moment.
Currently we only encounter them in the imgt germlines, and just replace them with 'A'... which is clearly wrong, but also not a big deal for the time being.
Need to check whether this is bppseqgen or TreeSim misbehaving, me misbehaving, or nobody at all misbehaving.
Or rather add correct dependencies. It's kinda complicated which test should depend on which file.
A better way of saying it is that the way it's set up now, you can't determine what needs to be run just based on what code has changed. Which, you know, may be a pretty good approximation given how complicated everything is.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.