HTS-SIP data simulation


Introduction

HTSSIP has some functionality for simulating basic HTS-SIP datasets. With that said, I recommend using a more sophisticated simulation toolset such as SIPSim for applications other than simple testing of HTS-SIP data analysis functions.

HTSSIP relies heavily on the great R package coenocliner. See this tutorial for a short and simple introduction.

Simulating a HTS-SIP dataset

In this vignette, we’re going to simulate gradient fraction communities for 6 gradients, with the basic experimental setup as follows:

  • Treatments: 13C-glucose vs 12C-control
  • Treatment replicates: 3 (each)
  • Total gradients: 6
  • Fractions per gradient: 24

First, let’s load some packages including HTSSIP.

library(dplyr)
library(ggplot2)
library(HTSSIP)

OK, let’s set the parameters needed for community simulations. We are basically going to follow the coenocliner tutorial, but instead of a transect along an environmental gradient, we are simulating communities in each fraction of a density gradient.

# setting parameters for tests
set.seed(318)                              # reproduciblility
M = 6                                      # number of OTUs (species)
ming = 1.67                                # gradient minimum...
maxg = 1.78                                # ...and maximum
nfrac = 24                                 # number of gradient fractions
locs = seq(ming, maxg, length=nfrac)       # Fraction BD's
tol  = rep(0.005, M)                       # species tolerances
h    = ceiling(rlnorm(M, meanlog=11))      # max abundances

opt = rnorm(M, mean=1.71, sd=0.008)        # species optima (drawn from a normal dist.)
params = cbind(opt=opt, tol=tol, h=h)      # put in a matrix

With the current parameters, we can simulate the gradient fraction communities for 1 density gradient:

df_OTU = gradient_sim(locs, params)
df_OTU
##    OTU.1 OTU.2 OTU.3 OTU.4 OTU.5  OTU.6 Buoyant_density Fraction
## 1      0     0     0     0     0      0        1.670000        1
## 2      0     0     0     0     0      0        1.674783        2
## 3      0     0     5     0     0      0        1.679565        3
## 4      0     0   131     0     0      0        1.684348        4
## 5      0     3  2074     4    14     22        1.689130        5
## 6      0   134 13016   131   491    727        1.693913        6
## 7     15  1891 32630  1177  6248   9066        1.698696        7
## 8    408 11003 33250  4449 31193  48158        1.703478        8
## 9   3665 25564 13439  6521 62709 100021        1.708261        9
## 10 12256 23561  2218  3901 50332  83169        1.713043       10
## 11 16510  8423   151   948 16405  28235        1.717826       11
## 12  8946  1236     3   108  2052   3800        1.722609       12
## 13  1953    73     0     3   102    199        1.727391       13
## 14   163     1     0     0     4      5        1.732174       14
## 15     5     0     0     0     0      0        1.736957       15
## 16     0     0     0     0     0      0        1.741739       16
## 17     0     0     0     0     0      0        1.746522       17
## 18     0     0     0     0     0      0        1.751304       18
## 19     0     0     0     0     0      0        1.756087       19
## 20     0     0     0     0     0      0        1.760870       20
## 21     0     0     0     0     0      0        1.765652       21
## 22     0     0     0     0     0      0        1.770435       22
## 23     0     0     0     0     0      0        1.775217       23
## 24     0     0     0     0     0      0        1.780000       24

As you can see, the abundance distribution of each OTU is approximately Gaussian, with varying optima among OTUs.

Simulating the full SIP experiment

Let’s make the communities of all gradient fractions for all gradients.

If all OTUs in the 13C-treatment incorporated labeled isotope, then their abundance distributions should be shifted to ‘heavier’ buoyant densities. Let’s set the 13C-treatment gradients to have a higher mean species optima. For kicks, let’s also increase the species optima variance (representing more variable isotope incorporation percentages).

opt1 = rnorm(M, mean=1.7, sd=0.005)      # species optima
params1 = cbind(opt=opt1, tol=tol, h=h)  # put in a matrix
opt2 = rnorm(M, mean=1.7, sd=0.005)     
params2 = cbind(opt=opt2, tol=tol, h=h) 
opt3 = rnorm(M, mean=1.7, sd=0.005)     
params3 = cbind(opt=opt3, tol=tol, h=h) 
opt4 = rnorm(M, mean=1.72, sd=0.008)    
params4 = cbind(opt=opt4, tol=tol, h=h) 
opt5 = rnorm(M, mean=1.72, sd=0.008)    
params5 = cbind(opt=opt5, tol=tol, h=h) 
opt6 = rnorm(M, mean=1.72, sd=0.008)    
params6 = cbind(opt=opt6, tol=tol, h=h) 

# we need a named list of parameters for the next step. The names in the list will be used as sample IDs
params_all = list(
  '12C-Con_rep1' = params1,
  '12C-Con_rep2' = params2,
  '12C-Con_rep3' = params3,
  '13C-Glu_rep1' = params4,
  '13C-Glu_rep2' = params5,
  '13C-Glu_rep3' = params6
)

Additional sample metadata

Additional sample metadata can be added to the resulting phyloseq object that we are going to simulate. To add metadata, just make a data.frame object with a Gradient column, which will be used for matching to the simulated samples. The Gradient column values should match the names in the parameter list.

meta = data.frame(
  'Gradient' = c('12C-Con_rep1', '12C-Con_rep2', '12C-Con_rep3',
                 '13C-Glu_rep1', '13C-Glu_rep2', '13C-Glu_rep3'),
  'Treatment' = c(rep('12C-Con', 3), rep('13C-Glu', 3)),
  'Replicate' = c(1:3, 1:3)
)

# do the names match?
all(meta$Gradient %in% names(params_all))
## [1] TRUE

OK. Let’s make the phyloseq object with the parameters that we specified above.

## physeq object
physeq_rep3 = HTSSIP_sim(locs, params_all, meta=meta)
physeq_rep3
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 6 taxa and 144 samples ]
## sample_data() Sample Data:       [ 144 samples by 5 sample variables ]

How does the sample_data table look?

phyloseq::sample_data(physeq_rep3) %>% head
##                             Gradient Buoyant_density Fraction Treatment
## 12C-Con_rep1_1.668185_1 12C-Con_rep1        1.668185        1   12C-Con
## 12C-Con_rep1_1.680254_2 12C-Con_rep1        1.680254        2   12C-Con
## 12C-Con_rep1_1.679431_3 12C-Con_rep1        1.679431        3   12C-Con
## 12C-Con_rep1_1.682305_4 12C-Con_rep1        1.682305        4   12C-Con
## 12C-Con_rep1_1.689408_5 12C-Con_rep1        1.689408        5   12C-Con
## 12C-Con_rep1_1.691388_6 12C-Con_rep1        1.691388        6   12C-Con
##                         Replicate
## 12C-Con_rep1_1.668185_1         1
## 12C-Con_rep1_1.680254_2         1
## 12C-Con_rep1_1.679431_3         1
## 12C-Con_rep1_1.682305_4         1
## 12C-Con_rep1_1.689408_5         1
## 12C-Con_rep1_1.691388_6         1

Simulating qPCR data

The q-SIP analysis requires qPCR measurements of gene copies for each gradient fraction. We can simulate this data for the HTS-SIP dataset phyloseq object that we just created.

The qPCR value simulation function is fairly flexible. For input, functions are provided that define how qPCR values (averages & variability) relate to buoyant density. For example, you can set the ‘peak’ of qPCR values to be located at a BD of 1.7 for unlabeled gradient samples and a ‘peak’ at a BD of 1.75 for labeled samples. For this example, let’s set the error among replicate qPCR reactions (technical replicates) to scale with the mean qPCR values for that gradient sample.

# unlabeled control qPCR values
control_mean_fun = function(x) dnorm(x, mean=1.70, sd=0.01) * 1e8
# error will scale with the mean 
control_sd_fun = function(x) control_mean_fun(x) / 3
# labeled treatment values
treat_mean_fun = function(x) dnorm(x, mean=1.75, sd=0.01) * 1e8
# error will scale with the mean 
treat_sd_fun = function(x) treat_mean_fun(x) / 3

OK. Let’s simulate the qPCR data.

physeq_rep3_qPCR = qPCR_sim(physeq_rep3,
                control_expr='Treatment=="12C-Con"',
                control_mean_fun=control_mean_fun,
                control_sd_fun=control_sd_fun,
                treat_mean_fun=treat_mean_fun,
                treat_sd_fun=treat_sd_fun)

physeq_rep3_qPCR %>% names
## [1] "raw"     "summary"

The output is a list containing:

  • ‘raw’ data
    • qPCR values for each PCR reaction
  • ‘summary’ data
    • qPCR mean values (& standard deviations) for each gradient fraction sample.
physeq_rep3_qPCR$raw %>% head
##   Buoyant_density IS_CONTROL qPCR_tech_rep1 qPCR_tech_rep2 qPCR_tech_rep3
## 1        1.668185       TRUE       38380656       37801138       22008529
## 2        1.680254       TRUE      560887106      659743071      354651067
## 3        1.679431       TRUE      618877197      770165585      538409720
## 4        1.682305       TRUE      554753523     1190798381     1037082632
## 5        1.689408       TRUE     3312072961     1398862657      707835246
## 6        1.691388       TRUE     1515455205     3901629542     2667702680
##                    Sample     Gradient Fraction Treatment Replicate
## 1 12C-Con_rep1_1.668185_1 12C-Con_rep1        1   12C-Con         1
## 2 12C-Con_rep1_1.680254_2 12C-Con_rep1        2   12C-Con         1
## 3 12C-Con_rep1_1.679431_3 12C-Con_rep1        3   12C-Con         1
## 4 12C-Con_rep1_1.682305_4 12C-Con_rep1        4   12C-Con         1
## 5 12C-Con_rep1_1.689408_5 12C-Con_rep1        5   12C-Con         1
## 6 12C-Con_rep1_1.691388_6 12C-Con_rep1        6   12C-Con         1
physeq_rep3_qPCR$summary %>% head
##   IS_CONTROL                  Sample Buoyant_density qPCR_tech_rep_mean
## 1      FALSE 13C-Glu_rep1_1.671712_2        1.671712       1.953729e-04
## 2      FALSE 13C-Glu_rep1_1.671722_1        1.671722       2.710671e-04
## 3      FALSE 13C-Glu_rep1_1.680311_3        1.680311       1.368177e-01
## 4      FALSE 13C-Glu_rep1_1.683540_4        1.683540       1.039858e+00
## 5      FALSE 13C-Glu_rep1_1.688831_5        1.688831       2.230201e+01
## 6      FALSE 13C-Glu_rep1_1.696594_7        1.696594       2.143897e+03
##   qPCR_tech_rep_sd     Gradient Fraction Treatment Replicate
## 1     5.435725e-05 13C-Glu_rep1        2   13C-Glu         1
## 2     3.818031e-05 13C-Glu_rep1        1   13C-Glu         1
## 3     6.395073e-02 13C-Glu_rep1        3   13C-Glu         1
## 4     6.074444e-01 13C-Glu_rep1        4   13C-Glu         1
## 5     1.446576e+01 13C-Glu_rep1        5   13C-Glu         1
## 6     6.352483e+02 13C-Glu_rep1        7   13C-Glu         1

Let’s plot the data.

x_lab = bquote('Buoyant density (g '* ml^-1*')')
y_lab = '16S rRNA gene copies'

ggplot(physeq_rep3_qPCR$summary, aes(Buoyant_density, qPCR_tech_rep_mean,
                      ymin=qPCR_tech_rep_mean-qPCR_tech_rep_sd,
                      ymax=qPCR_tech_rep_mean+qPCR_tech_rep_sd,
                      color=IS_CONTROL, shape=Replicate)) +
  geom_pointrange() +
  scale_color_discrete('Unlabeled\ncontrol') +
  labs(x=x_lab, y=y_lab) +
  theme_bw()

With this simulation, we made the separation between labeled and unlabeled DNA pretty easy to distinguish.

OK. Just to show the flexiblity of the qPCR value simulation function, let’s try using some other functions as input.

# using the Cauchy distribution instead of normal distributions
control_mean_fun = function(x) dcauchy(x, location=1.70, scale=0.01) * 1e8
control_sd_fun = function(x) control_mean_fun(x) / 3
treat_mean_fun = function(x) dcauchy(x, location=1.75, scale=0.01) * 1e8
treat_sd_fun = function(x) treat_mean_fun(x) / 3
# simulating qPCR values
physeq_rep3_qPCR = qPCR_sim(physeq_rep3,
                control_expr='Treatment=="12C-Con"',
                control_mean_fun=control_mean_fun,
                control_sd_fun=control_sd_fun,
                treat_mean_fun=treat_mean_fun,
                treat_sd_fun=treat_sd_fun)

Now, how does the data look?

ggplot(physeq_rep3_qPCR$summary, aes(Buoyant_density, qPCR_tech_rep_mean,
                      ymin=qPCR_tech_rep_mean-qPCR_tech_rep_sd,
                      ymax=qPCR_tech_rep_mean+qPCR_tech_rep_sd,
                      color=IS_CONTROL, shape=Replicate)) +
  geom_pointrange() +
  scale_color_discrete('Unlabeled\ncontrol') +
  labs(x=x_lab, y=y_lab) +
  theme_bw()

Session info

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] phyloseq_1.51.0 HTSSIP_1.4.1    ggplot2_3.5.1   tidyr_1.3.1    
## [5] dplyr_1.1.4     rmarkdown_2.29 
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6            xfun_0.49               bslib_0.8.0            
##  [4] rhdf5_2.51.0            Biobase_2.67.0          lattice_0.22-6         
##  [7] rhdf5filters_1.19.0     vctrs_0.6.5             tools_4.4.2            
## [10] generics_0.1.3          biomformat_1.35.0       stats4_4.4.2           
## [13] parallel_4.4.2          tibble_3.2.1            fansi_1.0.6            
## [16] cluster_2.1.6           pkgconfig_2.0.3         Matrix_1.7-1           
## [19] data.table_1.16.2       S4Vectors_0.45.2        lifecycle_1.0.4        
## [22] GenomeInfoDbData_1.2.13 farver_2.1.2            compiler_4.4.2         
## [25] stringr_1.5.1           Biostrings_2.75.1       munsell_0.5.1          
## [28] coenocliner_0.2-3       codetools_0.2-20        permute_0.9-7          
## [31] GenomeInfoDb_1.43.0     htmltools_0.5.8.1       sys_3.4.3              
## [34] buildtools_1.0.0        sass_0.4.9              lazyeval_0.2.2         
## [37] yaml_2.3.10             pillar_1.9.0            crayon_1.5.3           
## [40] jquerylib_0.1.4         MASS_7.3-61             cachem_1.1.0           
## [43] vegan_2.6-8             iterators_1.0.14        foreach_1.5.2          
## [46] nlme_3.1-166            tidyselect_1.2.1        digest_0.6.37          
## [49] stringi_1.8.4           reshape2_1.4.4          purrr_1.0.2            
## [52] labeling_0.4.3          splines_4.4.2           maketools_1.3.1        
## [55] ade4_1.7-22             fastmap_1.2.0           grid_4.4.2             
## [58] colorspace_2.1-1        cli_3.6.3               magrittr_2.0.3         
## [61] survival_3.7-0          utf8_1.2.4              ape_5.8                
## [64] withr_3.0.2             scales_1.3.0            UCSC.utils_1.3.0       
## [67] XVector_0.47.0          httr_1.4.7              multtest_2.63.0        
## [70] igraph_2.1.1            evaluate_1.0.1          knitr_1.49             
## [73] IRanges_2.41.1          mgcv_1.9-1              rlang_1.1.4            
## [76] Rcpp_1.0.13-1           glue_1.8.0              BiocGenerics_0.53.3    
## [79] jsonlite_1.8.9          Rhdf5lib_1.29.0         R6_2.5.1               
## [82] plyr_1.8.9              zlibbioc_1.52.0