A novel probabilistic approach to explicitly model overdispersion and sampling zeros in 16S rRNA sequencing data by considering the temporal correlation between nearby time points using Gaussian Processes
First of all thanks a lot for sharing the STAN model, it is a great help both for understanding the paper and especially for practical use of the method you proposed. I have a couple of questions about the implementation and the use of the model, will be very grateful for your help:
In Supplemental Figure 2: Synthetic data generation you have the following caption: "The noise-free relative abundances of 36 bacterial orders are illustrated with black lines. Example sets of noisy relative abundances are illustrated with blue points. Sampled noisy relative abundances are used to generate count data."
If I understand correctly the black lines โ is the average of the posterior samples from Theta_G over all iterations. And the noisy abundances come from a random iteration. For example, in this case you have 36 bacterial orders and 80 days, so if I run the model with 1 chain and 1000 iterations, the Theta_G will have a dimension of 1000x80x36. So by taking e.g. value from iteration = 100, I will get the simulated noisy abundance values shown as blue points on this picture. Is this correct?
Let's say I want to simulate the abundance values for an OTU, which I expect will have an exponential decrease of the abundance with time. In this case I can simply generate theoretical counts by using exponential function, then use them as part of the OTU_reads matrix and then select Theta_G values for this OTU from a random iteration. Is this correct?