HTSSIP is a package for analyzing high throughput sequence (HTS) data from DNA- or RNA-based stable isotope probing (SIP) experiments.
I recommend starting with this introductory vignette, then checking out the other vignettes:
The goal of HTS-SIP methods is to accurately determine which taxa (e.g., bacterial 16S rRNA OTUs) have incorporated isotopically labed substrate into biomass. By correctly identifying these ‘incorporators’, a number of fundamental questions about microbial ecology can be investigated, such as:
Note: I’m going to just refer to DNA-SIP, but most of this also applies to RNA-SIP.
In theory, if a taxon incorporates isotopically labeled substrate, then its DNA with be ‘heavier’ than if that same taxon had instead incorporated unlabeled substrate. By ‘heavier’, I mean that the DNA with have a higher buoyant density (BD) in a CsCl gradient. So, if you were to load isotopically labeled DNA into one CsCl gradient and unlabeled DNA (from the same taxon) in another CsCl gradient, then the labeled DNA with appear shifted to a heavier BD than the unlabeled DNA. The DNA should be more or less normally distributed in each gradient, so picture a Gaussian distribution that shifts from a ‘light’ BD to a ‘heavy’ BD between the control and treatment gradients. This is the basics of stable isotope probing, which is a powerful method because it can be used to detect which taxa in an entire community consumed particular substrates (e.g., cellulose).
If ‘heavy’ DNA is shifted from ‘light’ DNA, why are both a control and treatment gradient needed instead of just carefully separating out the ‘heavy’ DNA band from the ‘light’ DNA band in just one gradient? While this technique used be done in older SIP studies, the main problem is the BD of DNA is also a factor of G+C content (& some other factors). DNA with a higher G+C content has a ‘heavier’ BD. So, labeled, low G+C DNA can easily overlap with unlabeled high G+C DNA.
Let’s consider a simple DNA-SIP experiment. A basic SIP experiment could consist of incubating aliquots of soil with either 13C-glucose (treatment) or 12C-glucose (the corresponding control). The DNA is then extracted from each soil sample, and heavy DNA is separated (at least partially) from light DNA using CsCl gradient ultra centrifugation. Now, the challenge is to accurately determine the location of each taxon’s DNA in each of the 2 gradients in order to identify which taxa ‘shifted’ in the labeled treatment gradient due to isotope incorporation. Prior to the rise of high throughput sequencing (HTS) methods, this task was very difficult due to the limited throughput and taxonomic resolution of molecular fingerprinting and Sanger sequencing.
With high throughput sequencing, the taxonomic composition can be assessed at many points along a CsCl gradient by fractionating the gradient and sequencing the DNA in each fraction (note: I’m just going to focus on 16S rRNA sequencing, but fungal-ITS or metagenomic sequencing are also effective HTS-SIP methods). A large number of gradient fractions can be sequenced, which can help pinpoint the exact BD density range that each OTU occupies. For a typical HR-SIP experiment, ~20-30 fractions from each gradient are often sequenced.
Now remember, we are trying to use HTS-SIP to identify ‘BD shifts’, which indicate isotope incorporation. This is done by comparing labeled-treatment gradients to their corresponding unlabeled control gradients. Optimally, we’d be able to measure the absolute abundance of each taxon’s OTU in each gradient fraction in order to get an idea where the Gaussian distribution of that OTU’s DNA is in the gradient.
The major challenge of analyzing HTS data is that the data is compositional, in that the total number of sequences in a sample does not provide any information on the total numbers of each taxon in the sample. This is why relative abundances of OTUs are shown in most microbiome papers. Therefore, we have to try to determine the absolute abundance Gaussian distributions of each OTU in each gradient based on compositional 16S rRNA sequence data. Various HTS-SIP data analysis methods have be developed to address this challenge (e.g., HR-SIP and q-SIP). This R package was developed to easily implement on your own data, so that you can compare the results (and possible develop new methods). For more information on SIP and HTS-SIP, check out these references:
The phyloseq
R package provides some great functions for
importing and analyzing microbiome data in R. So instead of reinventing
the wheel, HTSSIP utilizes Phyloseq objects as input. See the GitHub page or the package website for more
info about the phyloseq.
The basic HTS-SIP dataset in the HTSSIP
package consists
of three CsCl 3 gradients, each with ~24 samples (gradient fractions).
There are 2 treatment gradients, which contained DNA extracted from
microcosms recieving both cellulose and glucose, but only one substrate
was 13C-labeled for each treatment. For example, the 13C-cellulose
(13C-Cel) treatment recieved 13C-cellulose and 12C-glucose. This
experimental design had the benefit of requiring only one 12C-control
that recieved 12C-cellulose and 12C-glucose.
First, let’s load some packages including HTSSIP
.
Now, let’s check out the HTSSIP phyloseq object that we will be using in the other vignettes.
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1072 taxa and 139 samples ]
## sample_data() Sample Data: [ 139 samples by 17 sample variables ]
## tax_table() Taxonomy Table: [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]
The “S2D2” part of the name stands for: 2 substrates assessed at 2 time points.
The dataset is from a larger SIP experiment in which labeled substrates were incubated with 15 g of soil in microcosms. The microcosms were destructively sampled at certain numbers of days from the initial substrate addition. The DNA was extracted and loaded into CsCl gradients. Importantly, each microcosm recieved the same amount of all substrates used in the experiment, but only one of the substrates was 13C labeled for each labeled treatment, while the unlabeled controls recieved all 12C unlabeled versions. This design allowed for only one 12C control microcosm per time point (instead of 1 control per labeled substrate). Approximately 24 gradient fractions from each gradient were sequenced with MiSeq 16S rRNA sequencing.
The dataset subset that we’ll be using consists of samples 6 CsCl gradients:
Important note: The dataset used in this example has been filtered such that all OTUs have over 300 reads. This is not a standard operation but has been done here to reduce the size of the dataset and the processing time.
Notes:
Here’s a look at (some) of the sample metadata. Note that
Bouyant_density
is one of the columns. This is necessary
for many of the analyses in HTSSIP}
## X.Sample primer_number fwd_barcode rev_barcode
## 12C-Con.D14.R1_F17 12C-Con.D14.R1_F17 38 CGTGAGTG TGAGTACG
## 13C-Cel.D3.R1_F21 13C-Cel.D3.R1_F21 7 GGATATCT CGAGAGTT
## 13C-Cel.D3.R1_F20 13C-Cel.D3.R1_F20 6 CGTGAGTG CGAGAGTT
## 13C-Cel.D14.R1_F12 13C-Cel.D14.R1_F12 57 ATCGTACG CGAGCGAC
## Substrate Day Microcosm_replicate Fraction Buoyant_density
## 12C-Con.D14.R1_F17 12C-Con 14 1 17 1.715475
## 13C-Cel.D3.R1_F21 13C-Cel 3 1 21 1.705640
## 13C-Cel.D3.R1_F20 13C-Cel 3 1 20 1.706186
## 13C-Cel.D14.R1_F12 13C-Cel 14 1 12 1.735145
## Sample_type
## 12C-Con.D14.R1_F17 unknown
## 13C-Cel.D3.R1_F21 unknown
## 13C-Cel.D3.R1_F20 unknown
## 13C-Cel.D14.R1_F12 unknown
To check the formatting of the phyloseq object:
# Our phyloseq object should be formatted correctly (no errors)
physeq_S2D2 = physeq_format(physeq_S2D2)
physeq_S2D2 %>% sample_data %>% .[1:4,1:10]
## X.Sample primer_number fwd_barcode rev_barcode
## 12C-Con.D14.R1_F17 12C-Con.D14.R1_F17 38 CGTGAGTG TGAGTACG
## 13C-Cel.D3.R1_F21 13C-Cel.D3.R1_F21 7 GGATATCT CGAGAGTT
## 13C-Cel.D3.R1_F20 13C-Cel.D3.R1_F20 6 CGTGAGTG CGAGAGTT
## 13C-Cel.D14.R1_F12 13C-Cel.D14.R1_F12 57 ATCGTACG CGAGCGAC
## Substrate Day Microcosm_replicate Fraction Buoyant_density
## 12C-Con.D14.R1_F17 12C-Con 14 1 17 1.715475
## 13C-Cel.D3.R1_F21 13C-Cel 3 1 21 1.705640
## 13C-Cel.D3.R1_F20 13C-Cel 3 1 20 1.706186
## 13C-Cel.D14.R1_F12 13C-Cel 14 1 12 1.735145
## Sample_type
## 12C-Con.D14.R1_F17 unknown
## 13C-Cel.D3.R1_F21 unknown
## 13C-Cel.D3.R1_F20 unknown
## 13C-Cel.D14.R1_F12 unknown
# This phyloseq object should NOT be formatted correctly
data(GlobalPatterns)
GlobalPatterns %>% sample_data %>% .[1:4,1:ncol(.)]
## X.SampleID Primer Final_Barcode Barcode_truncated_plus_T
## CL3 CL3 ILBC_01 AACGCA TGCGTT
## CC1 CC1 ILBC_02 AACTCG CGAGTT
## SV1 SV1 ILBC_03 AACTGT ACAGTT
## M31Fcsw M31Fcsw ILBC_04 AAGAGA TCTCTT
## Barcode_full_length SampleType
## CL3 CTAGCGTGCGT Soil
## CC1 CATCGACGAGT Soil
## SV1 GTACGCACAGT Soil
## M31Fcsw TCGACATCTCT Feces
## Description
## CL3 Calhoun South Carolina Pine soil, pH 4.9
## CC1 Cedar Creek Minnesota, grassland, pH 6.1
## SV1 Sevilleta new Mexico, desert scrub, pH 8.3
## M31Fcsw M3, Day 1, fecal swab, whole body study
## <simpleError in physeq_format(GlobalPatterns): No "Buoyant_density" column found in metadata>
For most of the analyses in HTSSIP
, the gradient
fraction samples of each labeled-treatment gradient is compared to its
corresponding control. For example, the dataset that we’ve been looking
at has 4 comparisons:
Note that the same 12C-control is used for both treatments because both recieved glucose & cellulose, but only 1 substrate was isotopically labeled in each.
So to subset the data with the phyloseq
package, we can
use the prune_samples()
function to subset the dataset:
m = sample_data(physeq_S2D2)
physeq_13C.Glu_D3 = prune_samples((m$Substrate=='12C-Con' & m$Day==3) | (m$Substrate=='13C-Glu' & m$Day==3), physeq_S2D2)
physeq_13C.Glu_D3
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1072 taxa and 46 samples ]
## sample_data() Sample Data: [ 46 samples by 17 sample variables ]
## tax_table() Taxonomy Table: [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]
physeq_13C.Cel_D14 = prune_samples((m$Substrate=='12C-Con' & m$Day==3) | (m$Substrate=='13C-Cel' & m$Day==14), physeq_S2D2)
physeq_13C.Cel_D14
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1072 taxa and 46 samples ]
## sample_data() Sample Data: [ 46 samples by 17 sample variables ]
## tax_table() Taxonomy Table: [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]
Writing a bunch of redundant code to make each subset is not a good
method to subset the data if you have a bunch of subsets to make, so
HTSSIP
includes some functions to help.
The expression used in the prune_samples()
to subset the
data can be generalized as:
Now we need the parameters for each individual
prune_samples()
call. Then, we can iteratively insert these
parameters into the expression in order create all of the subsets. This
can be accomplished with:
## Substrate Day
## 13C-Cel.D3.R1_F21 13C-Cel 3
## 13C-Cel.D14.R1_F12 13C-Cel 14
## 13C-Glu.D14.R1_F19 13C-Glu 14
## 13C-Glu.D3.R1_F27 13C-Glu 3
Let’s use the expression and parameters to subset the phyloseq object into a list of phyloseq object subsets:
## $`(Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Cel' & Day == '3')`
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1072 taxa and 46 samples ]
## sample_data() Sample Data: [ 46 samples by 17 sample variables ]
## tax_table() Taxonomy Table: [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]
##
## $`(Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Cel' & Day == '14')`
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1072 taxa and 46 samples ]
## sample_data() Sample Data: [ 46 samples by 17 sample variables ]
## tax_table() Taxonomy Table: [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]
##
## $`(Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Glu' & Day == '14')`
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1072 taxa and 47 samples ]
## sample_data() Sample Data: [ 47 samples by 17 sample variables ]
## tax_table() Taxonomy Table: [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]
##
## $`(Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Glu' & Day == '3')`
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1072 taxa and 46 samples ]
## sample_data() Sample Data: [ 46 samples by 17 sample variables ]
## tax_table() Taxonomy Table: [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]
With a list of phyloseq objects (1 list item per treatment-control
comparison), we can use lapply()
or a similar function to
run analyses on each subset of the dataset.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] phyloseq_1.51.0 HTSSIP_1.4.1 ggplot2_3.5.1 tidyr_1.3.1
## [5] dplyr_1.1.4 rmarkdown_2.29
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.49 bslib_0.8.0
## [4] rhdf5_2.51.0 Biobase_2.67.0 lattice_0.22-6
## [7] rhdf5filters_1.19.0 vctrs_0.6.5 tools_4.4.2
## [10] generics_0.1.3 biomformat_1.35.0 stats4_4.4.2
## [13] parallel_4.4.2 tibble_3.2.1 fansi_1.0.6
## [16] cluster_2.1.6 pkgconfig_2.0.3 Matrix_1.7-1
## [19] data.table_1.16.2 S4Vectors_0.45.2 lifecycle_1.0.4
## [22] GenomeInfoDbData_1.2.13 farver_2.1.2 compiler_4.4.2
## [25] stringr_1.5.1 Biostrings_2.75.1 munsell_0.5.1
## [28] codetools_0.2-20 permute_0.9-7 GenomeInfoDb_1.43.0
## [31] htmltools_0.5.8.1 sys_3.4.3 buildtools_1.0.0
## [34] sass_0.4.9 lazyeval_0.2.2 yaml_2.3.10
## [37] pillar_1.9.0 crayon_1.5.3 jquerylib_0.1.4
## [40] MASS_7.3-61 cachem_1.1.0 vegan_2.6-8
## [43] iterators_1.0.14 foreach_1.5.2 nlme_3.1-166
## [46] tidyselect_1.2.1 digest_0.6.37 stringi_1.8.4
## [49] reshape2_1.4.4 purrr_1.0.2 labeling_0.4.3
## [52] splines_4.4.2 maketools_1.3.1 ade4_1.7-22
## [55] fastmap_1.2.0 grid_4.4.2 colorspace_2.1-1
## [58] cli_3.6.3 magrittr_2.0.3 survival_3.7-0
## [61] utf8_1.2.4 ape_5.8 withr_3.0.2
## [64] scales_1.3.0 UCSC.utils_1.3.0 XVector_0.47.0
## [67] httr_1.4.7 multtest_2.63.0 igraph_2.1.1
## [70] evaluate_1.0.1 knitr_1.49 IRanges_2.41.1
## [73] mgcv_1.9-1 rlang_1.1.4 Rcpp_1.0.13-1
## [76] glue_1.8.0 BiocGenerics_0.53.3 jsonlite_1.8.9
## [79] Rhdf5lib_1.29.0 R6_2.5.1 plyr_1.8.9
## [82] zlibbioc_1.52.0