GePMI: A statistical model for personal intestinal microbiome identification
(Generating inter-individual similarity distribution for Personal Microbiome Identification)
When you have a set of metagenomic samples, and you want to know which samples come from the same individual. Or if you want to know whether the intra-individual metagenomic samples are still similar before and after some disturbances, it is a good choice to use GePMI.The purpose of GePMI is to identity intra-individual samples through metagenomic similarity.
It is well accepted that there is a large number of variants in the human microbiomes, which results in great differences within inter-individuals under the same condition. The principle of GePMI is based on pairwise similarities between the metagenomes of any two individuals obey a Beta distribution and that a p-value derived accordingly well characterizes whether two samples are from the same individual or not. To control the false discovery rate (FDR) in multiple testing, Benjamini and Yekutieli’s method was used to transform p–values to q-values. So GePMI can help you to determine whether the two samples are from the same individual through three thresholds: similarities of input file, GePMI p-values and GePMI q-values.
If you want to start analyzing from the initial metagenomic sequence, please jump to details.
The similarity matrix is the initial input of the method. You can generate it in many ways, but we recommend using sourmash, in a word, your input should be arranged as follows:
MRA_P1E-0 | MRA_P1E-7 | MRA_P1E-90 | ...... |
---|---|---|---|
1 | 0.2435 | 0.1969 | ...... |
0.2435 | 1 | 0.2130 | ...... |
0.1969 | 0.2130 | 1 | ...... |
...... | ...... | ...... | ...... |
The first row is the name of samples, the rest part is the corresponding similarities.
python GePMI.py -i test-100-18-10000.csv -o output
GePMI.py is the main python script for judging metagenomic samples from intra or inter-individual.
result.txt
: Identification of Test sample and Target sample from one individual under threshold
The first column shows the target samples used to generate the distribution; The following columns are test samples which are significantly similar to the target samples,That is to say, GePMI judgment this(theses) test(s) and target sample come from the same individual
A glimpse of the result.txt:
Target_sample | Significant_similar_test_sample 1 | Significant_similar_test_sample 2 | ...... |
---|---|---|---|
MRA_P1E-0 | MRA_P1E-7 | MRA_P1E-90 | |
MRA_P1E-7 | MRA_P1E-0 | MRA_P1E-90 | |
MRA_P1E-90 | No matches | ||
...... |
detail.txt
: Values(MinHash similarity, GePMI p-values, GePMI q-values) of each Test sample in Target sample inter-individual similarity distribution
Please report any problems directly to the GitHub issue tracker
Also, you can send your feedback to [email protected]
Down sample to the same size
sourmash csv file
sourmash compute -k 18 -n 10000 subject-sample.fa -o subject-sample.sig
for
subject-sample
,if you know the subject name and sample number,for example:HMP_subject1-v1.fa, please name it like this
sourmash compare -k 18 --csv test-100-18-10000.csv -o test *.sig
We suggest that you name csv output file like this
test-100-18-10000.csv, -18 is KSIZES, -10000 is NUM_HASHES used in sourmash
python GePMI.py -i input.csv -p 0.001 -q 0.01 -s 0 -o outputDir -t
- -i your input sourmash csv file
Please name it as follows:
[prefix]-[base number(millions)]-[k-mer length]-[hashes used in sourmash].csv
for eaxamle:
test-100-18-10000.csv
-
-p threshold of p-value (default 0.001)
-
-q threshold of q-value (default 0.01)
-
-s threshold of similarity (default 0)
-
-o path of output dirctory (default ./output)
-
-t save p/q value matrix or not
p_values.txt : p value of each test sample in target samples's inter-individual similarity distribution
q_values.txt : q value by Benjamini & Yekutieli method (Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency[J]. Annals of Statistics, 2001, 29(4):1165--1188.)
- Figures (-t for generating)
roc-xx-xx-xx.pdf : Receiver Operating Characteristic of known sample for PMI
prc-xx-xx-xx.pdf : Precesion-Recall Curve of known sample for PMI
fdr-xx-xx-xx.pdf : False Discovery Rate Comparing of p/q values
fdr-xx-xx-xx.d.pdf: Detail of FDR plot in range of 0 - 0.01