HTSSIP has some functionality for simulating basic HTS-SIP datasets. With that said, I recommend using a more sophisticated simulation toolset such as SIPSim for applications other than simple testing of HTS-SIP data analysis functions.
HTSSIP relies heavily on the great R package coenocliner. See this tutorial for a short and simple introduction.
In this vignette, we’re going to simulate gradient fraction communities for 6 gradients, with the basic experimental setup as follows:
First, let’s load some packages including HTSSIP
.
OK, let’s set the parameters needed for community simulations. We are basically going to follow the coenocliner tutorial, but instead of a transect along an environmental gradient, we are simulating communities in each fraction of a density gradient.
# setting parameters for tests
set.seed(318) # reproduciblility
M = 6 # number of OTUs (species)
ming = 1.67 # gradient minimum...
maxg = 1.78 # ...and maximum
nfrac = 24 # number of gradient fractions
locs = seq(ming, maxg, length=nfrac) # Fraction BD's
tol = rep(0.005, M) # species tolerances
h = ceiling(rlnorm(M, meanlog=11)) # max abundances
opt = rnorm(M, mean=1.71, sd=0.008) # species optima (drawn from a normal dist.)
params = cbind(opt=opt, tol=tol, h=h) # put in a matrix
With the current parameters, we can simulate the gradient fraction communities for 1 density gradient:
## OTU.1 OTU.2 OTU.3 OTU.4 OTU.5 OTU.6 Buoyant_density Fraction
## 1 0 0 0 0 0 0 1.670000 1
## 2 0 0 0 0 0 0 1.674783 2
## 3 0 0 5 0 0 0 1.679565 3
## 4 0 0 131 0 0 0 1.684348 4
## 5 0 3 2074 4 14 22 1.689130 5
## 6 0 134 13016 131 491 727 1.693913 6
## 7 15 1891 32630 1177 6248 9066 1.698696 7
## 8 408 11003 33250 4449 31193 48158 1.703478 8
## 9 3665 25564 13439 6521 62709 100021 1.708261 9
## 10 12256 23561 2218 3901 50332 83169 1.713043 10
## 11 16510 8423 151 948 16405 28235 1.717826 11
## 12 8946 1236 3 108 2052 3800 1.722609 12
## 13 1953 73 0 3 102 199 1.727391 13
## 14 163 1 0 0 4 5 1.732174 14
## 15 5 0 0 0 0 0 1.736957 15
## 16 0 0 0 0 0 0 1.741739 16
## 17 0 0 0 0 0 0 1.746522 17
## 18 0 0 0 0 0 0 1.751304 18
## 19 0 0 0 0 0 0 1.756087 19
## 20 0 0 0 0 0 0 1.760870 20
## 21 0 0 0 0 0 0 1.765652 21
## 22 0 0 0 0 0 0 1.770435 22
## 23 0 0 0 0 0 0 1.775217 23
## 24 0 0 0 0 0 0 1.780000 24
As you can see, the abundance distribution of each OTU is approximately Gaussian, with varying optima among OTUs.
Let’s make the communities of all gradient fractions for all gradients.
If all OTUs in the 13C-treatment incorporated labeled isotope, then their abundance distributions should be shifted to ‘heavier’ buoyant densities. Let’s set the 13C-treatment gradients to have a higher mean species optima. For kicks, let’s also increase the species optima variance (representing more variable isotope incorporation percentages).
opt1 = rnorm(M, mean=1.7, sd=0.005) # species optima
params1 = cbind(opt=opt1, tol=tol, h=h) # put in a matrix
opt2 = rnorm(M, mean=1.7, sd=0.005)
params2 = cbind(opt=opt2, tol=tol, h=h)
opt3 = rnorm(M, mean=1.7, sd=0.005)
params3 = cbind(opt=opt3, tol=tol, h=h)
opt4 = rnorm(M, mean=1.72, sd=0.008)
params4 = cbind(opt=opt4, tol=tol, h=h)
opt5 = rnorm(M, mean=1.72, sd=0.008)
params5 = cbind(opt=opt5, tol=tol, h=h)
opt6 = rnorm(M, mean=1.72, sd=0.008)
params6 = cbind(opt=opt6, tol=tol, h=h)
# we need a named list of parameters for the next step. The names in the list will be used as sample IDs
params_all = list(
'12C-Con_rep1' = params1,
'12C-Con_rep2' = params2,
'12C-Con_rep3' = params3,
'13C-Glu_rep1' = params4,
'13C-Glu_rep2' = params5,
'13C-Glu_rep3' = params6
)
Additional sample metadata can be added to the resulting phyloseq
object that we are going to simulate. To add metadata, just make a
data.frame object with a Gradient
column, which will be
used for matching to the simulated samples. The Gradient
column values should match the names in the parameter list.
meta = data.frame(
'Gradient' = c('12C-Con_rep1', '12C-Con_rep2', '12C-Con_rep3',
'13C-Glu_rep1', '13C-Glu_rep2', '13C-Glu_rep3'),
'Treatment' = c(rep('12C-Con', 3), rep('13C-Glu', 3)),
'Replicate' = c(1:3, 1:3)
)
# do the names match?
all(meta$Gradient %in% names(params_all))
## [1] TRUE
OK. Let’s make the phyloseq object with the parameters that we specified above.
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 6 taxa and 144 samples ]
## sample_data() Sample Data: [ 144 samples by 5 sample variables ]
How does the sample_data
table look?
## Gradient Buoyant_density Fraction Treatment
## 12C-Con_rep1_1.668185_1 12C-Con_rep1 1.668185 1 12C-Con
## 12C-Con_rep1_1.680254_2 12C-Con_rep1 1.680254 2 12C-Con
## 12C-Con_rep1_1.679431_3 12C-Con_rep1 1.679431 3 12C-Con
## 12C-Con_rep1_1.682305_4 12C-Con_rep1 1.682305 4 12C-Con
## 12C-Con_rep1_1.689408_5 12C-Con_rep1 1.689408 5 12C-Con
## 12C-Con_rep1_1.691388_6 12C-Con_rep1 1.691388 6 12C-Con
## Replicate
## 12C-Con_rep1_1.668185_1 1
## 12C-Con_rep1_1.680254_2 1
## 12C-Con_rep1_1.679431_3 1
## 12C-Con_rep1_1.682305_4 1
## 12C-Con_rep1_1.689408_5 1
## 12C-Con_rep1_1.691388_6 1
The q-SIP analysis requires qPCR measurements of gene copies for each gradient fraction. We can simulate this data for the HTS-SIP dataset phyloseq object that we just created.
The qPCR value simulation function is fairly flexible. For input, functions are provided that define how qPCR values (averages & variability) relate to buoyant density. For example, you can set the ‘peak’ of qPCR values to be located at a BD of 1.7 for unlabeled gradient samples and a ‘peak’ at a BD of 1.75 for labeled samples. For this example, let’s set the error among replicate qPCR reactions (technical replicates) to scale with the mean qPCR values for that gradient sample.
# unlabeled control qPCR values
control_mean_fun = function(x) dnorm(x, mean=1.70, sd=0.01) * 1e8
# error will scale with the mean
control_sd_fun = function(x) control_mean_fun(x) / 3
# labeled treatment values
treat_mean_fun = function(x) dnorm(x, mean=1.75, sd=0.01) * 1e8
# error will scale with the mean
treat_sd_fun = function(x) treat_mean_fun(x) / 3
OK. Let’s simulate the qPCR data.
physeq_rep3_qPCR = qPCR_sim(physeq_rep3,
control_expr='Treatment=="12C-Con"',
control_mean_fun=control_mean_fun,
control_sd_fun=control_sd_fun,
treat_mean_fun=treat_mean_fun,
treat_sd_fun=treat_sd_fun)
physeq_rep3_qPCR %>% names
## [1] "raw" "summary"
The output is a list containing:
## Buoyant_density IS_CONTROL qPCR_tech_rep1 qPCR_tech_rep2 qPCR_tech_rep3
## 1 1.668185 TRUE 38380656 37801138 22008529
## 2 1.680254 TRUE 560887106 659743071 354651067
## 3 1.679431 TRUE 618877197 770165585 538409720
## 4 1.682305 TRUE 554753523 1190798381 1037082632
## 5 1.689408 TRUE 3312072961 1398862657 707835246
## 6 1.691388 TRUE 1515455205 3901629542 2667702680
## Sample Gradient Fraction Treatment Replicate
## 1 12C-Con_rep1_1.668185_1 12C-Con_rep1 1 12C-Con 1
## 2 12C-Con_rep1_1.680254_2 12C-Con_rep1 2 12C-Con 1
## 3 12C-Con_rep1_1.679431_3 12C-Con_rep1 3 12C-Con 1
## 4 12C-Con_rep1_1.682305_4 12C-Con_rep1 4 12C-Con 1
## 5 12C-Con_rep1_1.689408_5 12C-Con_rep1 5 12C-Con 1
## 6 12C-Con_rep1_1.691388_6 12C-Con_rep1 6 12C-Con 1
## IS_CONTROL Sample Buoyant_density qPCR_tech_rep_mean
## 1 FALSE 13C-Glu_rep1_1.671712_2 1.671712 1.953729e-04
## 2 FALSE 13C-Glu_rep1_1.671722_1 1.671722 2.710671e-04
## 3 FALSE 13C-Glu_rep1_1.680311_3 1.680311 1.368177e-01
## 4 FALSE 13C-Glu_rep1_1.683540_4 1.683540 1.039858e+00
## 5 FALSE 13C-Glu_rep1_1.688831_5 1.688831 2.230201e+01
## 6 FALSE 13C-Glu_rep1_1.696594_7 1.696594 2.143897e+03
## qPCR_tech_rep_sd Gradient Fraction Treatment Replicate
## 1 5.435725e-05 13C-Glu_rep1 2 13C-Glu 1
## 2 3.818031e-05 13C-Glu_rep1 1 13C-Glu 1
## 3 6.395073e-02 13C-Glu_rep1 3 13C-Glu 1
## 4 6.074444e-01 13C-Glu_rep1 4 13C-Glu 1
## 5 1.446576e+01 13C-Glu_rep1 5 13C-Glu 1
## 6 6.352483e+02 13C-Glu_rep1 7 13C-Glu 1
Let’s plot the data.
x_lab = bquote('Buoyant density (g '* ml^-1*')')
y_lab = '16S rRNA gene copies'
ggplot(physeq_rep3_qPCR$summary, aes(Buoyant_density, qPCR_tech_rep_mean,
ymin=qPCR_tech_rep_mean-qPCR_tech_rep_sd,
ymax=qPCR_tech_rep_mean+qPCR_tech_rep_sd,
color=IS_CONTROL, shape=Replicate)) +
geom_pointrange() +
scale_color_discrete('Unlabeled\ncontrol') +
labs(x=x_lab, y=y_lab) +
theme_bw()
With this simulation, we made the separation between labeled and unlabeled DNA pretty easy to distinguish.
OK. Just to show the flexiblity of the qPCR value simulation function, let’s try using some other functions as input.
# using the Cauchy distribution instead of normal distributions
control_mean_fun = function(x) dcauchy(x, location=1.70, scale=0.01) * 1e8
control_sd_fun = function(x) control_mean_fun(x) / 3
treat_mean_fun = function(x) dcauchy(x, location=1.75, scale=0.01) * 1e8
treat_sd_fun = function(x) treat_mean_fun(x) / 3
# simulating qPCR values
physeq_rep3_qPCR = qPCR_sim(physeq_rep3,
control_expr='Treatment=="12C-Con"',
control_mean_fun=control_mean_fun,
control_sd_fun=control_sd_fun,
treat_mean_fun=treat_mean_fun,
treat_sd_fun=treat_sd_fun)
Now, how does the data look?
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] phyloseq_1.51.0 HTSSIP_1.4.1 ggplot2_3.5.1 tidyr_1.3.1
## [5] dplyr_1.1.4 rmarkdown_2.29
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.49 bslib_0.8.0
## [4] rhdf5_2.51.0 Biobase_2.67.0 lattice_0.22-6
## [7] rhdf5filters_1.19.0 vctrs_0.6.5 tools_4.4.2
## [10] generics_0.1.3 biomformat_1.35.0 stats4_4.4.2
## [13] parallel_4.4.2 tibble_3.2.1 fansi_1.0.6
## [16] cluster_2.1.6 pkgconfig_2.0.3 Matrix_1.7-1
## [19] data.table_1.16.2 S4Vectors_0.45.2 lifecycle_1.0.4
## [22] GenomeInfoDbData_1.2.13 farver_2.1.2 compiler_4.4.2
## [25] stringr_1.5.1 Biostrings_2.75.1 munsell_0.5.1
## [28] coenocliner_0.2-3 codetools_0.2-20 permute_0.9-7
## [31] GenomeInfoDb_1.43.0 htmltools_0.5.8.1 sys_3.4.3
## [34] buildtools_1.0.0 sass_0.4.9 lazyeval_0.2.2
## [37] yaml_2.3.10 pillar_1.9.0 crayon_1.5.3
## [40] jquerylib_0.1.4 MASS_7.3-61 cachem_1.1.0
## [43] vegan_2.6-8 iterators_1.0.14 foreach_1.5.2
## [46] nlme_3.1-166 tidyselect_1.2.1 digest_0.6.37
## [49] stringi_1.8.4 reshape2_1.4.4 purrr_1.0.2
## [52] labeling_0.4.3 splines_4.4.2 maketools_1.3.1
## [55] ade4_1.7-22 fastmap_1.2.0 grid_4.4.2
## [58] colorspace_2.1-1 cli_3.6.3 magrittr_2.0.3
## [61] survival_3.7-0 utf8_1.2.4 ape_5.8
## [64] withr_3.0.2 scales_1.3.0 UCSC.utils_1.3.0
## [67] XVector_0.47.0 httr_1.4.7 multtest_2.63.0
## [70] igraph_2.1.1 evaluate_1.0.1 knitr_1.49
## [73] IRanges_2.41.1 mgcv_1.9-1 rlang_1.1.4
## [76] Rcpp_1.0.13-1 glue_1.8.0 BiocGenerics_0.53.3
## [79] jsonlite_1.8.9 Rhdf5lib_1.29.0 R6_2.5.1
## [82] plyr_1.8.9 zlibbioc_1.52.0