Hello,
Thanks so much for the help to get 221 samples successfully analysed! No rush to respond here, but share in case it's worth exploring further.
As a first start, I've taken two samples which come from the same individual a week apart (oropharynx). I thought it would be helpful to see how reproducible the results are, given I would not expect much change in many organisms (though, worth acknowledging the individual would have been in hospital between the first and second swabs, and potentially on antibiotics). Kraken RAs (GTDB bacterial database) show fair similarity (Pearson R^2 on log10 RAs=0.81), with most species within an order of magnitude between the two samples, across four orders of magnitude of RA.
38 bacterial species give results from both samples (plus 24 in sample 1 alone and 4 in sample 2 alone - interesting as sample 2 has near twice the bacterial depth of sample 1, though sample 2 shows greater diversity)
Looking at those 38 reported in both samples, I hoped to see merged strains (0.95) mostly shared between samples, especially dominant ones. I confess I'm not full sure what the 95% corresponds to - 95% ANI of the marker genes, or 95% ANI of the SNPs (the latter would make sense to me).
In aggregate, of 77 strain clusters only 10 are shared between samples. Each are from separate species with no other strain clusters. One comprises a single matching strain between both samples (Oribacterium sinus). The others have multiple strains within a single strain cluster shared between both samples.
Species which have a matching strain are have similar abundances in sample 1 to those without (unshared species median 0.32% vs shared species 0.45%), but higher abundances in sample 2 (0.42 vs 1.3%).
28 species have no strain clusters shared between samples.
To take an example with high species RA in both samples, Porphyromonas somerae has a separate cluster in both samples, each comprising three strains. The frequency distribution is remarkably similar in both samples (~15%, ~32%, ~52%), and it would have been pleasing if three pairs of strains were instead clustered.
The reference sequences are 170,712 bases and 1367 shared polymorphic sites are identified (1726 in sample 1 and 2305 in sample 2). This seems reasonable as it would correspond to less than 1% divergence in general.
A quick pairwise alignment and identity matrix for the shared loci (69=sample 1 and 71=sample 2):
str_1_71 str_2_71 str_3_71
str_1_69 0.8456474 0.9392831 0.7973665
str_2_69 0.8580834 0.9502560 0.7834674
str_3_69 0.8573519 0.9597659 0.7710315
It appears that strain_1_69 pairs with strain_2_71 (although the frequencies don't correspond; 14 vs 33%), however both strain_2_69 (31%) and strain_3_69 (55%) are closest to strain_2_71 (33%). None are closest to strain_3_71 (52%).
Assuming the strain profile should be stable between the two samples, I wonder if the issue could be phasing between genes or within genes, producing hybrid strains.
I've started to have a look by focusing on the 15 genes with 25 or more SNPs. The signals are a little confusing to me. Mean coverage (depth file) indicates a fairly wide range in depth: 32-57 in sample 1 (median 44) and 14-62 in sample 2 (median 30).
A0A069ZK39 strain_1_69 is dissimilar to all three strains in the second sample (72-89% identity), strain_2_69 is identical to str_3_71 (100% identity) and similar to strain_2_71 (96%). Strain_3_69 is dissimilar to all three (55-66% identity).
Many of the other genes have unusual patterns because the strains within sample 1 share identical sequence, e.g. for A0A134B4G4 all strains from the first sample are identical to strain_2_71, 84% identical to strain_1_71 and 52% identical to strain_3_71.
Do you have any ideas about what might be going on?
Thanks,
Andrew