psathyrella / partis Goto Github PK

View Code? Open in Web Editor NEW

54.0 54.0 36.0 497.54 MB

B- and T-cell receptor sequence annotation, simulation, clonal family and germline inference, and affinity prediction

License: GNU General Public License v3.0

Python 57.84% Shell 0.58% C++ 27.07% C 9.31% Makefile 0.35% Perl 0.10% Roff 0.13% Dockerfile 0.01% R 4.57% HTML 0.05%

partis's People

Contributors

Stargazers

Watchers

partis's Issues

Break up `scons test` into chunks

scons test runs one monolithic block of coder, so a failure down the road requires rerunning the whole thing. At some point it would be nice to break this into several targets so that building is incremental.

take previous base into account when looking at deletion?

Image from Callan PNAS 2012:

How badly does mutated-base-reversion in recombinator screw up the tree

Currently, if bppseqgen decides to mutate a base within a the conserved cysteine or tryptophan, we just revert that base to germline. This of course will bias the tree, so long term we should do something different.

Note that we should also at some point disallow stop codons.

allow user-specified germlines in ighutil

So we need to be able to specify alternative germline files in Connor's ighutil/vdjalign package. Erick volunteered, ha ha, right?

The main package is here

As it says in the readme, you install with

make -C clj
pip install ./python # I would add --user

The way it's set up now, it always loads germline versions from files in this directory, and the options such as --j-subset specify the file suffix, e.g. --j-subset adaptive tells it to load ighj-adaptive.fasta. One problem with this is that imgt seems to quite enjoy regularly changing both the name corresponding to a given sequence and the sequence corresponding to a given name.

So ideally I could specify something like --germline-dir, and it always looks for igh[vdj].fasta in that directory.

The code that actually loads the file is here. The farthest I got was figuring out that if I add this line:

return open('/home/dralph/work/partis/data/' + file_name)

in _handle() it properly looks in that directory. But I need to do in on the command line!

I can of course ask Connor as well, or do it myself, but you sounded stoked on it...

Split simulator out of partis

Infer GTR parameters per-person/per-dataset

So right now we infer branch lengths to use in the simulation on the fly for each dataset/person, but we're still using some gtr parameters I got from connor that correspond to one or the other of the people from adaptive. I think this is fine for the moment -- I don't think anything we're doing is sensitive to exactly which base goes to which other base.

Although, note, one thing that we still want to do is account for mutations being more likely from, say, A->C than A->G. This is easy to put in the HMM, but we decided putting it in simulation would be prohibitive so shelved it.

Simplify GetQueryStrs in jobholder

shouldn't call GetSubSeqs and then loop over seqs again.

Hierarchical inference of discrete parameters

This is tied in with #8 but on the discrete parameter estimation side.

Specifically, say we would like to infer per-site emission probabilities for a given gene that sits in a gene family (say, V3). The idea would be to use the Dirichlet-multinomial: we take a Dirichlet prior with alpha vector taken from the expectation equal to the aggregate frequency at the higher level (in our example, the gene family V3) and then take the posterior of the multinomial with that prior.

We have the natural hierarchy of genes contained in families, which are contained in segments (V, D, and J). If we were rad, we would treat the whole thing as a hierarchical model and do inference. But that doesn't seem so practical given that we would lose conjugacy, unless there is something about the Dirichlet-multinomial-multinomial that I don't know about (quite likely, in fact).

Hoping to get some commentary from @vnminin.

Decide how to incorporate branch lengths into HMM emission probabilities

There is lots of data, and @psathyrella has done a nice job of getting empirical frequencies of affinity-matured sequences from this data for each germline gene. One approach would be to just let those be the emission probabilities in the HMM. However, as noted in our previous paper, there is a correlation between the amount of mutation in different segments, or even more clearly, there is a correlation between the level of mutation in the various sites in a gene. Thus, I would think that it makes sense to infer the amount of "evolutionary time" that has passed in the affinity maturation process and then use that to calculate emission probabilities for the various sites. Evolutionary time could be inferred just by looking at the average divergence of read segments from their germline equivalent using his pre-processing Smith-Waterman algorithm.

@vnminin -- what think you? Overkill?

put cdr3 length preclustering back in

It's implemented, but a significant minority of sequences do not have a single unambiguous cdr3 length, and you can't really cluster things when some of those things have to clustering values apiece.

Investigate diversity of sequences used to train parameters

Following 12/17 discussion with Vladimir.

The issue is that the size of clonally related groups can impact individual-level inferences. For example, if 80% of sequences come from one clonally related group then it the other 20% are more likely to follow that sequence's path through the HMM.

We could look at pairwise distance plots (though those do seem to be fairly mysterious sometimes, as lots of things are getting compressed down to a single number, and multiplicity gets squared). Perhaps trees are better, as long as we don't over-interpret them.

Decide what our "dream" model-based clustering procedure looks like.

As described in conversation with @vnminin, a nice way to do model-based clustering would be to take a product across clusters, such that the likelihood of a given cluster is the marginal probability of generating the sequences contained in it via an n-HMM.

I think this is the way to go, but can see two things that need thinking about:

What prior should we use for the clustering? Chinese restaurant?
In this, we will be comparing marginal likelihoods between n-HMMs for various n. For example, we may want to compare a clustering into two clusters of size 5 and 5 to a clustering into two clusters of size 7 and 3. I would guess that there may be some problems in doing so just due to the sizes of the clusters. Is there, and what should we do about it? Obviously, this is related to the first question, but it's more about what the marginal likelihoods coming from the HMM than just about the prior.

Stop using a file to pass scores to clusterer

Streamline very unlikley sequence pairs

Pairs of sequences that are not clonally related in general take much longer to run than pairs that are related -- partly because I sum over a larger k-space, but also maybe for other reasons. This seems like kind of a waste, and it's at least possible it's avoidable.

Implement k-HMMs

hm, I thought this was already here somewhere, but I don't see it.

Somewhat possibly fairly arbitrary decisions and parameters

An incomplete list of things to which I may refer as knobs:

parameter inference with few observations if in training data we see a gene version only a few times, we can't build a meaningful hmm from it. I have tried a number of approaches, but what I do now is if we see a version fewer than n_obs times (have tried with numbers like 5, 10, 30, 50) we lump in all the alleles of that gene. If this still doesn't give us n_obs observations, we lump in all the genes with the same subgroup number (e.g. all IGHV3s), and if that still doesn't I use all the IGHVs.

This has different implications for different parameters. For instance, mutation frequencies are extremely non-homogenous, and aren't necessarily similar even in among alleles. Erosion and insertion lengths certainly depend on version as well, and aren't necessarily the same among alleles or subgroups.

We know at the moment that on my training sets of ~10,000 sequences, switching from 5-50 has little effect. But we have to view this as code that someone can naively try to train on an arbitrarily number of sequences, so the decision has potential import
interpolation while estimating parameters When do we want to apply some interpolation? Think of a general histogram (could be mutation frequencies by position, or erosion frequencies for a given gene...). If we have lots of information that the distribution is really spiky, then we're pretty sure a zero-bin is really supposed to be zero. But if we only have a few observations, it's more likely that a "zero bin" isn't really zero and we should interpolate. So I have a threshhold: if the nearest bins with entries on either side have more than 20 entries, I don't interpolate (this bin is probably really zero), otherwise I do.

Note that since it's really bad to have zero probability of something in an HMM that could happen (since the zero absolutely totally forbids it), I always use either an EPS value (that I made up), or the bin's upper uncertainty if it's available.

Er, if that doesn't make sense, it's because it's complicated! Or because I didn't explain it very well.
the insertion model I have lovely distributions for the typical insertion lengths for each gene version, but since this is an hmm I can't (er, I haven't figured out how to) model an arbitrary distribution, so I use a geometric distribution with the proper mean. I.e., the probability of exiting the insert state is 1/<mean_length>.

The distributions in our current data look quite like geomtric distributions, and the bias in the performance plots can probably be characterized as small, but it has to be remembered that I pulled the geometric distribution ooma.
How many gene matches per region? We don't sum over all possible gene versions, but only over those that score highly in the smith-waterman step. How many matches should we pass from SW to the HMM? In practice 3 v matches, 5 d matches, and 2 j matches is a good compromise.

Things to keep in mind: if SW didn't flag the true gene in the top few matches, most of the time the HMM won't like that match, either, so you don't gain that much by including it. In other words you're never going to annotate that sequence correctly so you may as well not spend too much compute time working on it. Also, when we're clustering (i.e. in pairs), if the correct match for both sequences is not in the union of their best matches, it doesn't matter -- that almost without fail means that they aren't from the same recombination event, anyway, so the total probability is going to change very little if you add a gene version that is only a good match to one of the sequences.
k-space boundaries when integrating over k_v and k_d, we of course don't actually sum from zero to infinity. I currently take the most like point in k-space from smith-waterman (i.e. the k_v and k_d from the best v-d-j match), find the limits of k-space for the top n_best_per_region matches, and finally
expand these bounds with a fuzz factor of 2. It seems to work great, i.e. pretty much always includes the correct k-point, but if the correct value is ever outside my bounds, I'm screwed. Note also that compute time scales with k-area.
SW match parameters (note that we currently use Connor's vdjalign for smith-waterman). There are a number of parameters one specifies, but the three which are probably most import for us are the match score, mismatch score, and gap-open penalty. The ratio of match to mismatch score is what matters, and it should in theory be set according to the mutation rate in the sequence. In practice I'm using 5:3, which seems to work well, but I just came up with that from experimenting on our existing data. I have the gap open penalty set to 1000, which effectively forbids indels within v, d, or j regions.
which germline repertoire to use? the set of germline gene versions that we make available to SW and the HMM or course have a huge effect on performance. There are, though, a lot of possibilities as to what we choose. In the long run, we want to only use versions that we see in each person's naive repertoire. Then we have to decide what set to start from -- the consensus seems to be that a huge fraction of the versions in imgt are not real, i.e. are mutated or sequencing error'ed from real versions.

Add TCR and IGL flags

Well, and the ability to run on TCRs when that flag is set.

As per talk with Raphael.

From talking to folks at keystone, it sounds like adding mouse would also be useful. This should be just a matter of downloading the imgt source.

Add coherent framework for training and testing on different data sets

Well, we have that ability now, but we should put in an explicit method for users to do it. The way we do it now, is infer parameters from a large data set, and then apply the hmm on that same data set. The use case for using different data sets is if you're interested in a data set that has too few sequences to do reliable parameter estimation.

Among other things, we should have:

different v left and j right erosions for training and testing datasets (probably use a command line option to override what was seen in the training sample)
add a test (to scons test) with lots of different-length sequences
Add switch to smith-waterman to allow/disallow v left and j right deletions

Why are insertion and deletion lengths smaller than expected?

They're certainly not badly biased, but it'd be nicer if those distributions were perfectly symmetric.

Also note, SW and HMM both bias in the same direction.

Rest of plots are here: http://stoat.fhcrc.org/sharing/dralph/partis/performance/check-new-imgt/hmm-vs-sw/plots.html (need to be on hutch network)

Track down more sequences that give no-valid-path errors

They're super rare at this point, but it'd be better not to have them.

update/check get_emission_prob() in hmmwriter.py

In accordance with the final hmm states/topologies.

note, this should happen after #13

Replace single insert state with insert states for each of A, C, G, T

Erick's idea -- this allows us to take into account the fact that two clonal seqs have the same inserts

As part of this, should update/check get_emission_prob() in hmmwriter.py

Quality scores

Can we incorporate quality scores into partis? What sorts of scores can we expect to get from Adaptive? Stay tuned..

Fix bppseqgen seed in simulator

oh, man, this'll be kind of complicated.

Do more to avoid boundary errors

By boundary errors I mean when our sum of k-space kicks a warning because the maximum is on the boundary. I auto-expand this (which does a pretty good job) but should go through and be more careful about it, since especially with forward pair/k we need to be suming over all areas of k-space with significant probability density.

This will probably mean starting again to write the errors to the hmm's csv output file.

put recombinator back into 'scons test'

you'll need to add an option to set the random seed in recombinator

Decide how to incorporate branch lengths into hmm emission probabilities

have GH dependencies via subtree or submodules

All the hip kids are using git subtree these days, but submodule is nice in its own way.

How do we want to use the k-hmm to cluster?

The only thing I've thought of is to use the pair hmm as a preclustering step, and just lower/loosen the single-linkage clustering cutoff to make sure we never cluster when we shouldn't, and then pass those clusters to the k-hmm.

The trouble is that I don't know that it'll be simple to come up with a principled way to decide how much to loosen the cuttoff.

Add sampling from posterior for annotation

Currently we just print out the n-best paths for annotation... but there's loads more information at our fingertips that we can present if people would use it. For instance, if we print out the three best paths, they're typically all the same gene matches, and just wiggling around the alignment boundaries a little bit.

Possibilities:

add ability to toss pseudoexperiments from the posterior on paths
Maybe people would like an option to search through the n-best matches for the top gene matches?

Add switch to smith-waterman to allow/disallow v left and j right deletions

Denominator-factor caching in bcrham

I.e. if you're calculating all the m-subsets for a cluster of size n (>m), only calculate the single-sequence factors once for each sequence.

Add cluster merging scheme

That is, make it easy to take a large sample, divide it into chunks, and pass each chunk to a separate clusterer, and then merge the clusters. Only running on 10k queries at the moment, and memory and cpu consumption even for hamming is killing me.

Add long and short reads to testing framework

i.e. reads that have v chopped off, j chopped, off, yadda yadda

Add v/d/j selection frequency correlation

Testing partitioning on many different tree topologies

I.e. run simulation with lots of wildly different BCR repertoire structures, and see how the partitioning does.

Joint probabilities for pair emission?

The basic situation

We have a per-site mutation frequency, f (the fraction of observed sequences that have a mutation at the site), and we want to fill in the 4x4 table of pair emission probabilities in the HMM.

Independent emissions

The simplest way to do this is to make the emission probabilities a product of two factors (one for each sequence), where each factor is f/3 (if the sequence is not germline) and 1-f (if the sequence is germline).

virtues: simple, and it appears empirically to be approximately correct, i.e. bayes factors of related (unrelated) sequences are greater (less) than zero.
but: ignores the fact that related sequences have correlations in their mutations
looks like this (symmetric entries omitted for clarity)

	germline	mutated	mutated	mutated
germline	(1-f)(1-f)	f(1-f)/3	f(1-f)/3	f(1-f)/3
mutated		f*f / 9	f*f / 9	f*f / 9
mutated			f*f / 9	f*f / 9
mutated				f*f / 9

Joint emissions

So all we need to implement joint emission is fill in the entries in the matrix so they take into account that if the two sequences are mutated to the same base, they're more likely to be clonally related. Except I haven't worked out a good way to do this. All the things I've tried require assumptions about branch lengths and tree topology which are not always true, so empirically they end up not being that great.

Erick and I talked about this a few months ago. If memory serves we got as far as he was actually convinced it was non-trivial, but didn't work out how to do it.

This is quite related to #8.

Make `hmm_parameters` naming scheme consistent

mus data/hmm_parameters ‹regularized-params› » ls
all-probs.csv                       d_gene-probs.csv          hmms                    j_gene-dj_insertion-probs.csv  mute-freqs                vd_insertion-probs.csv     v_gene-v_5p_del-probs.csv
d_gene-d_3p_del-d_5p_del-probs.csv  dj_insertion_content.csv  j_5p_del-probs.csv      j_gene-j_3p_del-probs.csv      seq_content.csv           v_gene-probs.csv
d_gene-d_5p_del-d_3p_del-probs.csv  fv_insertion-probs.csv    jf_insertion-probs.csv  j_gene-probs.csv               vd_insertion_content.csv  v_gene-v_3p_del-probs.csv
mus data/hmm_parameters ‹regularized-params› » head vd_insertion-probs.csv 
vd_insertion,count
21,78
32,4
23,46
34,3
61,1
9,1844
36,3
11,1314
38,2
mus data/hmm_parameters ‹regularized-params› » head j_gene-dj_insertion-probs.csv 
dj_insertion,j_gene,count
28,IGHJ3*02,1
25,IGHJ1*01,1
11,IGHJ1*01,12
7,IGHJ6*03,222
5,IGHJ1*01,34
7,IGHJ2P*01,1
1,IGHJ6*03,466
6,IGHJ4*03,17
10,IGHJ3*01,15

Figure out final(ish) mutation parameters for simulation

Split BCR-specific code out of ham

Should probably go in partis... I just couldn't get scons to behave properly when I tried before, and gave up before I got it to work.

Check query_strs usage in ham/src

I think I might have some query_strs.first that would be better if they used .second as well

Harmonize pair emission table format with k-hmms

At the moment our emission tables are all symmetric, so I'm factorizing the drastic k-hmm changes by punting on this for the moment.

...

Decide how to treat ambiguous bases

Currently we only encounter them in the imgt germlines, and just replace them with 'A'... which is clearly wrong, but also not a big deal for the time being.

turn on chunk caching by default

More low priority TODOs

if hamming distance is very small could precluster the other way, i.e. assume pairs are in the same event
stop using hackey versions of gtr.txt and trees.tre in recombinator. I.e. start inferring phylogenies
tree inference closure tests in recombinator
synchronize replacement_gene treatment for all parameters between hmmwriter and recombinator
account for different mutation frequencies to different bases (it seems that this will require forking bppseqgen)
add recombinator run to scons test
optimizing recombinator:
- indeed it seems to be all in generating the mutants with bppseqgen
  - if I switch to a tree with ten times the tips, but then generate the same number of sequences as before, it runs like a hundred times faster
- could add an option to generate all the seqs with the same tree? hm, no, that probably wouldn't help cause the other input files would still be different
- will need to deal with productive and out-of-frame seqs separately, i.e. different parameters
- disallow stop codons, out of frame and frame-shifted rearrangements in recombinator
- do I really want to apply the gene choice prob in waterer?
- print a warning when the kbounds we pass to the hmm don't include the true one (and same thing for the gene versions)
- optimization
  - wow, switching from passing seqs by reference to value really slowed it down -- need to go through Viterbi() and Forward() and see if there's any other call-by-values I can remove
  - would it be faster with STATE_MAX less than 1024?
    - for k_v --> k_v + 1, don't recalculate the whole table
  - run performance profiling
  - adding the gobbledygook states seems to have really slowed it down (EDIT oh, wait, that was just removing the gcc optimization options)
  - hmm structure optimization
    - it seems like I should be able to take advantage of the fact that all my hmms are super linear, i.e. the matrix of possible transitions is very sparse
    - implement banding? I think I have this listed below but without using the word 'banding'
  - try pairwise s-w for preclustering (all against all -- align against each other using sw)
- I should really be able to construct the denominator in P(A,B) / (P(A) P(B)) from the numerator without recalculating much, right?
  - blast instead of s-w?
  - when we're doing pairs, it seems there should be some optimizations due to the fact we don't need to actually identify the gene versions, but only partition the sequences (EDIT wait, what?)
  - alleles: save chunks of dp tables and reuse 'em
- accuracy improvement
  - play around with s-w match/mismatch scores (match was 3[:1], just changed to 2[:1]... not sure what's best. It'll depend on the expected level of mutation. sigh)
  - the choices we make about which is the best reconstruction near the d/insertions is very dependent on how much mutation we think occurred
    - so try to use the amount of mutation in v to inform this decision (?)
    - i.e., calculate the within-sequence correlation between V mutation and D/insertion mutation
- clustering
  - connor's spectral decomposition sorting
  - use an N-HMM instead of a pair-HMM plus pairscore clustering (do some heuristic preclustering, then for each of these clusters, use the n-hmm viterbi path, and cluster on these)
  - neighbor-joining
  - transitivity:
    - rather than assuming transitivity in clustering, couldn't I use the extra information to improve clustering?
      (NOTE this is roughly equivalent to representing an existing cluster by its viterbi path. hm, with mutations or not?)

psathyrella / partis Goto Github PK

partis's People

Contributors

Stargazers

Watchers

Forkers

partis's Issues

The basic situation

Independent emissions

Joint emissions

Recommend Projects

Recommend Topics

Recommend Org