HTSSIP introduction


HTSSIP is a package for analyzing high throughput sequence (HTS) data from DNA- or RNA-based stable isotope probing (SIP) experiments.

I recommend starting with this introductory vignette, then checking out the other vignettes:

HTS-SIP introduction

The goal of HTS-SIP methods is to accurately determine which taxa (e.g., bacterial 16S rRNA OTUs) have incorporated isotopically labed substrate into biomass. By correctly identifying these ‘incorporators’, a number of fundamental questions about microbial ecology can be investigated, such as:

  • the microbial food web
    • e.g., Which bacteria contribute to each step of biomass breakdown in soil?
  • taxonomic conservatism of ecological roles
    • e.g., Is taxonomic relatedness a good predictor of functional similarity (similar niche)?
    • e.g., Do OTU groupings accurately delimit functionally coherent individuals?
  • competition and mutualism
    • e.g., Which microbes will (or will not) stabily coexist in a particular environment?
  • how microbial physiology can dictate biogeographic patterns
    • e.g., Is the biogeography distribution of cellulose specialists more patchy than carbon source generalists?
  • the prevalence of functional redundancy and community stability (resilience and resistance)
    • e.g., Do many microbes consume the same substrates? How can this affect stability of the microbial community and the ecosystem processes mediated by members of the community?

Background

Note: I’m going to just refer to DNA-SIP, but most of this also applies to RNA-SIP.

Stable Isotope Probing

In theory, if a taxon incorporates isotopically labeled substrate, then its DNA with be ‘heavier’ than if that same taxon had instead incorporated unlabeled substrate. By ‘heavier’, I mean that the DNA with have a higher buoyant density (BD) in a CsCl gradient. So, if you were to load isotopically labeled DNA into one CsCl gradient and unlabeled DNA (from the same taxon) in another CsCl gradient, then the labeled DNA with appear shifted to a heavier BD than the unlabeled DNA. The DNA should be more or less normally distributed in each gradient, so picture a Gaussian distribution that shifts from a ‘light’ BD to a ‘heavy’ BD between the control and treatment gradients. This is the basics of stable isotope probing, which is a powerful method because it can be used to detect which taxa in an entire community consumed particular substrates (e.g., cellulose).

If ‘heavy’ DNA is shifted from ‘light’ DNA, why are both a control and treatment gradient needed instead of just carefully separating out the ‘heavy’ DNA band from the ‘light’ DNA band in just one gradient? While this technique used be done in older SIP studies, the main problem is the BD of DNA is also a factor of G+C content (& some other factors). DNA with a higher G+C content has a ‘heavier’ BD. So, labeled, low G+C DNA can easily overlap with unlabeled high G+C DNA.

Let’s consider a simple DNA-SIP experiment. A basic SIP experiment could consist of incubating aliquots of soil with either 13C-glucose (treatment) or 12C-glucose (the corresponding control). The DNA is then extracted from each soil sample, and heavy DNA is separated (at least partially) from light DNA using CsCl gradient ultra centrifugation. Now, the challenge is to accurately determine the location of each taxon’s DNA in each of the 2 gradients in order to identify which taxa ‘shifted’ in the labeled treatment gradient due to isotope incorporation. Prior to the rise of high throughput sequencing (HTS) methods, this task was very difficult due to the limited throughput and taxonomic resolution of molecular fingerprinting and Sanger sequencing.

HTS-SIP

With high throughput sequencing, the taxonomic composition can be assessed at many points along a CsCl gradient by fractionating the gradient and sequencing the DNA in each fraction (note: I’m just going to focus on 16S rRNA sequencing, but fungal-ITS or metagenomic sequencing are also effective HTS-SIP methods). A large number of gradient fractions can be sequenced, which can help pinpoint the exact BD density range that each OTU occupies. For a typical HR-SIP experiment, ~20-30 fractions from each gradient are often sequenced.

Now remember, we are trying to use HTS-SIP to identify ‘BD shifts’, which indicate isotope incorporation. This is done by comparing labeled-treatment gradients to their corresponding unlabeled control gradients. Optimally, we’d be able to measure the absolute abundance of each taxon’s OTU in each gradient fraction in order to get an idea where the Gaussian distribution of that OTU’s DNA is in the gradient.

The major challenge of analyzing HTS data is that the data is compositional, in that the total number of sequences in a sample does not provide any information on the total numbers of each taxon in the sample. This is why relative abundances of OTUs are shown in most microbiome papers. Therefore, we have to try to determine the absolute abundance Gaussian distributions of each OTU in each gradient based on compositional 16S rRNA sequence data. Various HTS-SIP data analysis methods have be developed to address this challenge (e.g., HR-SIP and q-SIP). This R package was developed to easily implement on your own data, so that you can compare the results (and possible develop new methods). For more information on SIP and HTS-SIP, check out these references:

  • DNA SIP
    • Neufeld, Josh D., Jyotsna Vohra, Marc G. Dumont, Tillmann Lueders, Mike Manefield, Michael W. Friedrich, and J. Colin Murrell. 2007. “DNA Stable-Isotope Probing.” Nature Protocols 2 (4): 860–66.
    • Buckley, Daniel H., Varisa Huangyutitham, Shi-Fang Hsu, and Tyrrell A. Nelson. 2007. “Stable Isotope Probing with 15N Achieved by Disentangling the Effects of Genome G+C Content and Isotope Enrichment on DNA Density.” Applied and Environmental Microbiology 73 (10): 3189–95.
    • Youngblut, Nicholas D. and Daniel H. Buckley. 2014. “Intra-Genomic Variation in G + C Content and Its Implications for DNA Stable Isotope Probing.” Environmental Microbiology Reports 6 (6): 767–75.
  • HR-SIP
    • Pepe-Ranney, Charles, Chantal Koechli, Ruth Potrafka, Cheryl Andam, Erin Eggleston, Ferran Garcia-Pichel, and Daniel H. Buckley. 2015. “Non-Cyanobacterial Diazotrophs Mediate Dinitrogen Fixation in Biological Soil Crusts during Early Crust Formation.” The ISME Journal.
    • Pepe-Ranney, Charles, Ashley N. Campbell, Chantal N. Koechli, Sean Berthrong, and Daniel H. Buckley. 2016. “Unearthing the Ecology of Soil Microorganisms Using a High Resolution DNA-SIP Approach to Explore Cellulose and Xylose Metabolism in Soil.” Terrestrial Microbiology, 703.
  • MW-HR-SIP
    • Youngblut, Nicholas D., Samuel E. Barnett, and Daniel H. Buckley. 2018. “SIPSim: A Modeling Toolkit to Predict Accuracy and Aid Design of DNA-SIP Experiments”. Frontiers in Microbiology, 9: 570.
  • qSIP
    • Hungate, Bruce A., Rebecca L. Mau, Egbert Schwartz, J. Gregory Caporaso, Paul Dijkstra, Natasja van Gestel, Benjamin J. Koch, et al. 2015. “Quantitative Microbial Ecology Through Stable Isotope Probing.” Applied and Environmental Microbiology, August, AEM.02280–15.

HTSSIP package data

The phyloseq R package provides some great functions for importing and analyzing microbiome data in R. So instead of reinventing the wheel, HTSSIP utilizes Phyloseq objects as input. See the GitHub page or the package website for more info about the phyloseq.

The basic HTS-SIP dataset in the HTSSIP package consists of three CsCl 3 gradients, each with ~24 samples (gradient fractions). There are 2 treatment gradients, which contained DNA extracted from microcosms recieving both cellulose and glucose, but only one substrate was 13C-labeled for each treatment. For example, the 13C-cellulose (13C-Cel) treatment recieved 13C-cellulose and 12C-glucose. This experimental design had the benefit of requiring only one 12C-control that recieved 12C-cellulose and 12C-glucose.

First, let’s load some packages including HTSSIP.

library(dplyr)
library(HTSSIP)
library(phyloseq)

Now, let’s check out the HTSSIP phyloseq object that we will be using in the other vignettes.

physeq_S2D2
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1072 taxa and 139 samples ]
## sample_data() Sample Data:       [ 139 samples by 17 sample variables ]
## tax_table()   Taxonomy Table:    [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]

The “S2D2” part of the name stands for: 2 substrates assessed at 2 time points.

Experiment overview

The dataset is from a larger SIP experiment in which labeled substrates were incubated with 15 g of soil in microcosms. The microcosms were destructively sampled at certain numbers of days from the initial substrate addition. The DNA was extracted and loaded into CsCl gradients. Importantly, each microcosm recieved the same amount of all substrates used in the experiment, but only one of the substrates was 13C labeled for each labeled treatment, while the unlabeled controls recieved all 12C unlabeled versions. This design allowed for only one 12C control microcosm per time point (instead of 1 control per labeled substrate). Approximately 24 gradient fractions from each gradient were sequenced with MiSeq 16S rRNA sequencing.

The dataset subset that we’ll be using consists of samples 6 CsCl gradients:

  1. 12C-Con, Day 3
  2. 13C-Glu, Day 3
  3. 13C-Cel, Day 3
  4. 12C-Con, Day 14
  5. 13C-Glu, Day 14
  6. 13C-Cel, Day 14

Important note: The dataset used in this example has been filtered such that all OTUs have over 300 reads. This is not a standard operation but has been done here to reduce the size of the dataset and the processing time.

Notes:

  • “12C-Con” stands for the unlabeled control
  • “13C-Glu” and “13C-Cel” stand for 13C-glucose and 13C-cellulose, respectively
  • “Day” is the number of days from substrate addition to harvest

Sample metadata

Here’s a look at (some) of the sample metadata. Note that Bouyant_density is one of the columns. This is necessary for many of the analyses in HTSSIP}

physeq_S2D2 %>% sample_data %>% .[1:4,1:10]
##                              X.Sample primer_number fwd_barcode rev_barcode
## 12C-Con.D14.R1_F17 12C-Con.D14.R1_F17            38    CGTGAGTG    TGAGTACG
## 13C-Cel.D3.R1_F21   13C-Cel.D3.R1_F21             7    GGATATCT    CGAGAGTT
## 13C-Cel.D3.R1_F20   13C-Cel.D3.R1_F20             6    CGTGAGTG    CGAGAGTT
## 13C-Cel.D14.R1_F12 13C-Cel.D14.R1_F12            57    ATCGTACG    CGAGCGAC
##                    Substrate Day Microcosm_replicate Fraction Buoyant_density
## 12C-Con.D14.R1_F17   12C-Con  14                   1       17        1.715475
## 13C-Cel.D3.R1_F21    13C-Cel   3                   1       21        1.705640
## 13C-Cel.D3.R1_F20    13C-Cel   3                   1       20        1.706186
## 13C-Cel.D14.R1_F12   13C-Cel  14                   1       12        1.735145
##                    Sample_type
## 12C-Con.D14.R1_F17     unknown
## 13C-Cel.D3.R1_F21      unknown
## 13C-Cel.D3.R1_F20      unknown
## 13C-Cel.D14.R1_F12     unknown

To check the formatting of the phyloseq object:

# Our phyloseq object should be formatted correctly (no errors)
physeq_S2D2 = physeq_format(physeq_S2D2)
physeq_S2D2 %>% sample_data %>% .[1:4,1:10]
##                              X.Sample primer_number fwd_barcode rev_barcode
## 12C-Con.D14.R1_F17 12C-Con.D14.R1_F17            38    CGTGAGTG    TGAGTACG
## 13C-Cel.D3.R1_F21   13C-Cel.D3.R1_F21             7    GGATATCT    CGAGAGTT
## 13C-Cel.D3.R1_F20   13C-Cel.D3.R1_F20             6    CGTGAGTG    CGAGAGTT
## 13C-Cel.D14.R1_F12 13C-Cel.D14.R1_F12            57    ATCGTACG    CGAGCGAC
##                    Substrate Day Microcosm_replicate Fraction Buoyant_density
## 12C-Con.D14.R1_F17   12C-Con  14                   1       17        1.715475
## 13C-Cel.D3.R1_F21    13C-Cel   3                   1       21        1.705640
## 13C-Cel.D3.R1_F20    13C-Cel   3                   1       20        1.706186
## 13C-Cel.D14.R1_F12   13C-Cel  14                   1       12        1.735145
##                    Sample_type
## 12C-Con.D14.R1_F17     unknown
## 13C-Cel.D3.R1_F21      unknown
## 13C-Cel.D3.R1_F20      unknown
## 13C-Cel.D14.R1_F12     unknown
# This phyloseq object should NOT be formatted correctly 
data(GlobalPatterns)
GlobalPatterns %>% sample_data %>% .[1:4,1:ncol(.)]
##         X.SampleID  Primer Final_Barcode Barcode_truncated_plus_T
## CL3            CL3 ILBC_01        AACGCA                   TGCGTT
## CC1            CC1 ILBC_02        AACTCG                   CGAGTT
## SV1            SV1 ILBC_03        AACTGT                   ACAGTT
## M31Fcsw    M31Fcsw ILBC_04        AAGAGA                   TCTCTT
##         Barcode_full_length SampleType
## CL3             CTAGCGTGCGT       Soil
## CC1             CATCGACGAGT       Soil
## SV1             GTACGCACAGT       Soil
## M31Fcsw         TCGACATCTCT      Feces
##                                        Description
## CL3       Calhoun South Carolina Pine soil, pH 4.9
## CC1       Cedar Creek Minnesota, grassland, pH 6.1
## SV1     Sevilleta new Mexico, desert scrub, pH 8.3
## M31Fcsw    M3, Day 1, fecal swab, whole body study
tryCatch(
  physeq_format(GlobalPatterns),
  error = function(e) e
  )
## <simpleError in physeq_format(GlobalPatterns): No "Buoyant_density" column found in metadata>

Parsing the dataset

For most of the analyses in HTSSIP, the gradient fraction samples of each labeled-treatment gradient is compared to its corresponding control. For example, the dataset that we’ve been looking at has 4 comparisons:

  1. 12C-Con_Day3 vs 13C-Glu_Day3
  2. 12C-Con_Day3 vs 13C-Cel_Day3
  3. 12C-Con_Day14 vs 13C-Glu_Day14
  4. 12C-Con_Day14 vs 13C-Cel_Day14

Note that the same 12C-control is used for both treatments because both recieved glucose & cellulose, but only 1 substrate was isotopically labeled in each.

So to subset the data with the phyloseq package, we can use the prune_samples() function to subset the dataset:

m = sample_data(physeq_S2D2)

physeq_13C.Glu_D3 = prune_samples((m$Substrate=='12C-Con' & m$Day==3) | (m$Substrate=='13C-Glu' & m$Day==3), physeq_S2D2)
physeq_13C.Glu_D3
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1072 taxa and 46 samples ]
## sample_data() Sample Data:       [ 46 samples by 17 sample variables ]
## tax_table()   Taxonomy Table:    [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]
physeq_13C.Cel_D14 = prune_samples((m$Substrate=='12C-Con' & m$Day==3) | (m$Substrate=='13C-Cel' & m$Day==14), physeq_S2D2)
physeq_13C.Cel_D14
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1072 taxa and 46 samples ]
## sample_data() Sample Data:       [ 46 samples by 17 sample variables ]
## tax_table()   Taxonomy Table:    [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]

Writing a bunch of redundant code to make each subset is not a good method to subset the data if you have a bunch of subsets to make, so HTSSIP includes some functions to help.

The expression used in the prune_samples() to subset the data can be generalized as:

ex = "(Substrate=='12C-Con' & Day=='${Day}') | (Substrate=='${Substrate}' & Day == '${Day}')"

Now we need the parameters for each individual prune_samples() call. Then, we can iteratively insert these parameters into the expression in order create all of the subsets. This can be accomplished with:

params = get_treatment_params(physeq_S2D2, c('Substrate', 'Day'), "Substrate != '12C-Con'")
params
##                    Substrate Day
## 13C-Cel.D3.R1_F21    13C-Cel   3
## 13C-Cel.D14.R1_F12   13C-Cel  14
## 13C-Glu.D14.R1_F19   13C-Glu  14
## 13C-Glu.D3.R1_F27    13C-Glu   3

Let’s use the expression and parameters to subset the phyloseq object into a list of phyloseq object subsets:

physeq_S2D2_l = phyloseq_subset(physeq_S2D2, params, ex)
physeq_S2D2_l
## $`(Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Cel' & Day == '3')`
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1072 taxa and 46 samples ]
## sample_data() Sample Data:       [ 46 samples by 17 sample variables ]
## tax_table()   Taxonomy Table:    [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]
## 
## $`(Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Cel' & Day == '14')`
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1072 taxa and 46 samples ]
## sample_data() Sample Data:       [ 46 samples by 17 sample variables ]
## tax_table()   Taxonomy Table:    [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]
## 
## $`(Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Glu' & Day == '14')`
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1072 taxa and 47 samples ]
## sample_data() Sample Data:       [ 47 samples by 17 sample variables ]
## tax_table()   Taxonomy Table:    [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]
## 
## $`(Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Glu' & Day == '3')`
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1072 taxa and 46 samples ]
## sample_data() Sample Data:       [ 46 samples by 17 sample variables ]
## tax_table()   Taxonomy Table:    [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]

With a list of phyloseq objects (1 list item per treatment-control comparison), we can use lapply() or a similar function to run analyses on each subset of the dataset.

References for methods in HTSSIP

Session info

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] phyloseq_1.51.0 HTSSIP_1.4.1    ggplot2_3.5.1   tidyr_1.3.1    
## [5] dplyr_1.1.4     rmarkdown_2.29 
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6            xfun_0.49               bslib_0.8.0            
##  [4] rhdf5_2.51.0            Biobase_2.67.0          lattice_0.22-6         
##  [7] rhdf5filters_1.19.0     vctrs_0.6.5             tools_4.4.2            
## [10] generics_0.1.3          biomformat_1.35.0       stats4_4.4.2           
## [13] parallel_4.4.2          tibble_3.2.1            fansi_1.0.6            
## [16] cluster_2.1.6           pkgconfig_2.0.3         Matrix_1.7-1           
## [19] data.table_1.16.2       S4Vectors_0.45.2        lifecycle_1.0.4        
## [22] GenomeInfoDbData_1.2.13 farver_2.1.2            compiler_4.4.2         
## [25] stringr_1.5.1           Biostrings_2.75.1       munsell_0.5.1          
## [28] codetools_0.2-20        permute_0.9-7           GenomeInfoDb_1.43.0    
## [31] htmltools_0.5.8.1       sys_3.4.3               buildtools_1.0.0       
## [34] sass_0.4.9              lazyeval_0.2.2          yaml_2.3.10            
## [37] pillar_1.9.0            crayon_1.5.3            jquerylib_0.1.4        
## [40] MASS_7.3-61             cachem_1.1.0            vegan_2.6-8            
## [43] iterators_1.0.14        foreach_1.5.2           nlme_3.1-166           
## [46] tidyselect_1.2.1        digest_0.6.37           stringi_1.8.4          
## [49] reshape2_1.4.4          purrr_1.0.2             labeling_0.4.3         
## [52] splines_4.4.2           maketools_1.3.1         ade4_1.7-22            
## [55] fastmap_1.2.0           grid_4.4.2              colorspace_2.1-1       
## [58] cli_3.6.3               magrittr_2.0.3          survival_3.7-0         
## [61] utf8_1.2.4              ape_5.8                 withr_3.0.2            
## [64] scales_1.3.0            UCSC.utils_1.3.0        XVector_0.47.0         
## [67] httr_1.4.7              multtest_2.63.0         igraph_2.1.1           
## [70] evaluate_1.0.1          knitr_1.49              IRanges_2.41.1         
## [73] mgcv_1.9-1              rlang_1.1.4             Rcpp_1.0.13-1          
## [76] glue_1.8.0              BiocGenerics_0.53.3     jsonlite_1.8.9         
## [79] Rhdf5lib_1.29.0         R6_2.5.1                plyr_1.8.9             
## [82] zlibbioc_1.52.0