Introduction

HR-SIP method workflow:

Remove sparsely occuring OTUs under the assumption that they are sequencing errors.
- This is assuming that DNA from and OTU is likely going to be found in multiple gradient fractions due to differences in G+C content, diffusion, etc.
- So, if an OTU is just detected in only one fraction of a gradient, we can’t tell whether it is a sequencing artifact or a real, but rare OTU.
For each OTU, use DESeq2 to detect significant enrichment in ‘heavy’ gradient fractions.
- ‘heavy’ gradients are selected ad hoc
- Multiple hypotheses are adjusted for.
- Significantly enriched OTUs are considered to have incorporated isotope (incorporators)

MW-HR-SIP method workflow:

The method is similar to HR-SIP, except instead of 1 ‘heavy’ BD window used, multiple windows are used, but the tradeoff is the higher number of multiple hypotheses that must be corrected for.
- For each OTU, the BD window with highest log2 fold change (l2fc) value is used, so the BD window can vary among OTUs.
This method has been shown to be generally more sensitive (more true positives) than HR-SIP and more specific (less false positives) than q-SIP. See the manuscript “Multiple window High Resolution Stable Isotope Probing (MW-HR-SIP)” for more information.

Dataset

First, let’s load some packages including HTSSIP.

library(dplyr)
library(ggplot2)
library(HTSSIP)

See HTSSIP introduction vignette for a description on why dataset parsing (all treatment-control comparisons) is needed.

Let’s see the already parsed dataset

physeq_S2D2_l

## $`(Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Cel' & Day == '3')`
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1072 taxa and 46 samples ]
## sample_data() Sample Data:       [ 46 samples by 17 sample variables ]
## tax_table()   Taxonomy Table:    [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]
## 
## $`(Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Cel' & Day == '14')`
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1072 taxa and 46 samples ]
## sample_data() Sample Data:       [ 46 samples by 17 sample variables ]
## tax_table()   Taxonomy Table:    [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]
## 
## $`(Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Glu' & Day == '14')`
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1072 taxa and 47 samples ]
## sample_data() Sample Data:       [ 47 samples by 17 sample variables ]
## tax_table()   Taxonomy Table:    [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]
## 
## $`(Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Glu' & Day == '3')`
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1072 taxa and 46 samples ]
## sample_data() Sample Data:       [ 46 samples by 17 sample variables ]
## tax_table()   Taxonomy Table:    [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]

HR-SIP

Let’s set some parameters used later.

# adjusted P-value cutoff 
padj_cutoff = 0.1
# number of cores for parallel processing (increase depending on your computational hardware)
ncores = 2

One treatment-control comparison

First, we’ll just run HR-SIP on 1 treatment-control comparison. Let’s get the individual phyloseq object.

physeq = physeq_S2D2_l[[1]]
physeq

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1072 taxa and 46 samples ]
## sample_data() Sample Data:       [ 46 samples by 17 sample variables ]
## tax_table()   Taxonomy Table:    [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]

Let’s check that the samples belong to either a 13C-treatment or 12C-control.

physeq %>% sample_data %>% .$Substrate %>% table

## .
## 12C-Con 13C-Cel 
##      23      23

OK, we should be ready to run HR-SIP!

Note that the design parameter for HRSIP() is the experimental design parameter for calculating log2 fold change (l2fc) values with DESeq. Here, it’s used to distinguish label-treatment and unlabel-control samples.

df_l2fc = HRSIP(physeq,
                design = ~Substrate,
                padj_cutoff = padj_cutoff,
                sparsity_threshold = c(0,0.15,0.3))   # just using 3 thresholds to reduce time

## Sparsity threshold: 0 
## Density window: 1.7-1.75

## Warning: multiple methods tables found for 'union'

## Warning: multiple methods tables found for 'intersect'

## Warning: multiple methods tables found for 'setdiff'

## Sparsity threshold: 0.15 
## Density window: 1.7-1.75

## Sparsity threshold: 0.3 
## Density window: 1.7-1.75

## Sparsity threshold with the most rejected hypotheses: 0

df_l2fc %>% head(n=3)

## # A tibble: 3 × 17
##   OTU    log2FoldChange       p   padj Rank1 Rank2 Rank3 Rank4 Rank5 Rank6 Rank7
##   <chr>           <dbl>   <dbl>  <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 OTU.5…         -0.133 0.745   1.00   Bact… __Pr… __De… __De… __Ni… __un… __un…
## 2 OTU.7…          1.39  0.0486  0.427  Bact… __Ac… __RB… __un… <NA>  <NA>  <NA> 
## 3 OTU.3…          1.53  0.00326 0.0975 Bact… __Ac… __DA… __un… <NA>  <NA>  <NA> 
## # ℹ 6 more variables: Rank8 <chr>, density_min <dbl>, density_max <dbl>,
## #   sparsity_threshold <dbl>, sparsity_apply <chr>, l2fc_threshold <dbl>

How many “incorporators”” (rejected hypotheses)?

df_l2fc %>% 
  filter(padj < padj_cutoff) %>%
  group_by() %>%
  summarize(n_incorp_OTUs = OTU %>% unique %>% length)

## # A tibble: 1 × 1
##   n_incorp_OTUs
##           <int>
## 1            36

Let’s plot a breakdown of incorporators for each phylum.

# summarizing
df_l2fc_s = df_l2fc %>% 
  filter(padj < padj_cutoff) %>%
  mutate(Rank2 = gsub('^__', '', Rank2)) %>%
  group_by(Rank2) %>%
  summarize(n_incorp_OTUs = OTU %>% unique %>% length) %>%
  ungroup()

# plotting
ggplot(df_l2fc_s, aes(Rank2, n_incorp_OTUs)) +
    geom_bar(stat='identity') +
    labs(x='Phylum', y='Number of incorporators') +
    theme_bw() +
    theme(
      axis.text.x = element_text(angle=45, hjust=1)
    )

All treatment-control comparisons

Let’s now run HR-SIP on all treatment-control comparisons in the dataset:

# Number of comparisons
physeq_S2D2_l %>% length

## [1] 4

The function plyr::ldply() is useful (compared to lapply()) beccause it can be run in parallel and returns a data.frame object.

# Running in parallel; you may need to change the backend for your environment.
# Or you can just set .parallel=FALSE. 
doParallel::registerDoParallel(ncores)

df_l2fc = plyr::ldply(physeq_S2D2_l, 
                      HRSIP, 
                      design = ~Substrate, 
                      padj_cutoff = padj_cutoff,
                      sparsity_threshold = c(0,0.15,0.3),  # just using 3 thresholds to reduce run time
                      .parallel=TRUE)
df_l2fc %>% head(n=3)

## eria __Proteobacteria
## 2 OTU.729      1.3853296 0.048578911 0.42708959 Bacteria  __Acidobacteria
## 3 OTU.322      1.5298155 0.003257549 0.09752411 Bacteria  __Acidobacteria
##                   Rank3                                Rank4            Rank5
## 1 __Deltaproteobacteria                  __Desulfobacterales __Nitrospinaceae
## 2                __RB25 __uncultured_Acidobacteria_bacterium             <NA>
## 3               __DA023               __uncultured_bacterium             <NA>
##          Rank6                  Rank7 Rank8 density_min density_max
## 1 __uncultured __uncultured_bacterium  <NA>         1.7        1.75
## 2         <NA>                   <NA>  <NA>         1.7        1.75
## 3         <NA>                   <NA>  <NA>         1.7        1.75
##   sparsity_threshold sparsity_apply l2fc_threshold
## 1                  0            all           0.25
## 2                  0            all           0.25
## 3                  0            all           0.25

Each specific phyloseq subset (treatment-control comparison) is delimited with the “.id” column.

df_l2fc %>% .$.id %>% unique

## [1] "(Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Cel' & Day == '3')"  
## [2] "(Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Cel' & Day == '14')"
## [3] "(Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Glu' & Day == '14')"
## [4] "(Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Glu' & Day == '3')"

For clarity, let’s edit these long strings to make them more readable when plotted.

df_l2fc = df_l2fc %>%
  mutate(.id = gsub(' \\| ', '\n', .id))
df_l2fc %>% .$.id %>% unique

## [1] "(Substrate=='12C-Con' & Day=='3')\n(Substrate=='13C-Cel' & Day == '3')"  
## [2] "(Substrate=='12C-Con' & Day=='14')\n(Substrate=='13C-Cel' & Day == '14')"
## [3] "(Substrate=='12C-Con' & Day=='14')\n(Substrate=='13C-Glu' & Day == '14')"
## [4] "(Substrate=='12C-Con' & Day=='3')\n(Substrate=='13C-Glu' & Day == '3')"

How many incorporators (rejected hypotheses) & which sparsity cutoff was used for each comparison?

Note: you could set one sparsity cutoff for all comparisons by setting the sparsity_cutoff to a specific value.

df_l2fc %>% 
  filter(padj < padj_cutoff) %>%
  group_by(.id, sparsity_threshold) %>%
  summarize(n_incorp_OTUs = OTU %>% unique %>% length) %>%
  as.data.frame

## `summarise()` has grouped output by '.id'. You can override using the `.groups`
## argument.

##                                                                        .id
## 1 (Substrate=='12C-Con' & Day=='14')\n(Substrate=='13C-Cel' & Day == '14')
## 2 (Substrate=='12C-Con' & Day=='14')\n(Substrate=='13C-Glu' & Day == '14')
## 3   (Substrate=='12C-Con' & Day=='3')\n(Substrate=='13C-Cel' & Day == '3')
## 4   (Substrate=='12C-Con' & Day=='3')\n(Substrate=='13C-Glu' & Day == '3')
##   sparsity_threshold n_incorp_OTUs
## 1                0.3            97
## 2                0.0           189
## 3                0.0            36
## 4                0.3            65

How about a breakdown of incorporators for each phylum in each comparision.

# summarizing
df_l2fc_s = df_l2fc %>% 
  filter(padj < padj_cutoff) %>%
  mutate(Rank2 = gsub('^__', '', Rank2)) %>%
  group_by(.id, Rank2) %>%
  summarize(n_incorp_OTUs = OTU %>% unique %>% length) %>%
  ungroup()

## `summarise()` has grouped output by '.id'. You can override using the `.groups`
## argument.

# plotting
ggplot(df_l2fc_s, aes(Rank2, n_incorp_OTUs)) +
    geom_bar(stat='identity') +
    labs(x='Phylum', y='Number of incorporators') +
    facet_wrap(~ .id, scales='free') +
    theme_bw() +
    theme(
      axis.text.x = element_text(angle=55, hjust=1)
    )

MW-HR-SIP

MW-HR-SIP is run very similarly to HRSIP, but it uses multiple buoyant density (BD) windows. MW-HR-SIP is performed with the HRSIP() function, but multiple BD windows are specified.

Let’s use 3 buoyant density windows (g/ml):

1.70-1.73, 1.72-1.75, 1.74-1.77

windows = data.frame(density_min=c(1.70, 1.72, 1.74), 
                     density_max=c(1.73, 1.75, 1.77))
windows

##   density_min density_max
## 1        1.70        1.73
## 2        1.72        1.75
## 3        1.74        1.77

Running HRSIP with all 3 BD windows. Let’s run this in parallel to speed things up.

You can turn off parallel processing by setting the parallel option to FALSE. Also, there’s 2 different levels that could be parallelized: either the ldply() or HRSIP(). Here, I’m running HRSIP in parallel, but it may make sense in other situations (eg., many comparisons but few density windows and/or sparsity cutoffs) to use ldply in parallel only.

doParallel::registerDoParallel(ncores)

df_l2fc = plyr::ldply(physeq_S2D2_l, 
                      HRSIP, 
                      density_windows = windows,
                      design = ~Substrate, 
                      padj_cutoff = padj_cutoff,
                      sparsity_threshold = c(0,0.15,0.3), # just using 3 thresholds to reduce run time
                      .parallel = TRUE)
df_l2fc %>% head(n=3)

##   1.73
## 3         <NA>                   <NA>  <NA>         1.7        1.73
##   sparsity_threshold sparsity_apply l2fc_threshold
## 1                  0            all           0.25
## 2                  0            all           0.25
## 3                  0            all           0.25

Let’s check that we have all treatment-control comparisons.

df_l2fc %>% .$.id %>% unique

## [1] "(Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Cel' & Day == '3')"  
## [2] "(Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Cel' & Day == '14')"
## [3] "(Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Glu' & Day == '14')"
## [4] "(Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Glu' & Day == '3')"

How many incorporators (rejected hypotheses) & which sparsity cutoff was used for each comparison?

Note: one sparsity cutoff could be set for all comparisons by setting the sparsity_cutoff to a specific value.

df_l2fc %>% 
  filter(padj < padj_cutoff) %>%
  group_by(.id, sparsity_threshold) %>%
  summarize(n_incorp_OTUs = OTU %>% unique %>% length) %>%
  as.data.frame

## `summarise()` has grouped output by '.id'. You can override using the `.groups`
## argument.

##                                                                         .id
## 1 (Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Cel' & Day == '14')
## 2 (Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Glu' & Day == '14')
## 3   (Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Cel' & Day == '3')
## 4   (Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Glu' & Day == '3')
##   sparsity_threshold n_incorp_OTUs
## 1                0.3           119
## 2                0.0           245
## 3                0.0            43
## 4                0.3            85

The density windows can vary for each OTU. Let’s plot which density windows were used for the OTUs in the dataset.

# summarizing
df_l2fc_s = df_l2fc %>% 
  mutate(.id = gsub(' \\| ', '\n', .id)) %>%
  filter(padj < padj_cutoff) %>%
  mutate(density_range = paste(density_min, density_max, sep='-')) %>% 
  group_by(.id, sparsity_threshold, density_range) %>%
  summarize(n_incorp_OTUs = OTU %>% unique %>% length)

## `summarise()` has grouped output by '.id', 'sparsity_threshold'. You can
## override using the `.groups` argument.

#plotting
ggplot(df_l2fc_s, aes(.id, n_incorp_OTUs, fill=density_range)) +
    geom_bar(stat='identity', position='fill') +
    labs(x='Control-treatment comparision', y='Fraction of incorporators') +
    scale_y_continuous(expand=c(0,0)) +
    theme_bw() +
    theme(
      axis.text.x = element_text(angle=55, hjust=1)
    )

Different BD windows were used for different treatment-control comparisons because the amount of BD shift likely varied among taxa. For example, if a taxon incorporates 100% 13C isotope, then a very ‘heavy’ BD window may show a larger l2fc than a less ‘heavy’ BD window.

Let’s look at a breakdown of incorporators for each phylum in each comparision.

# summarizing
df_l2fc_s = df_l2fc %>% 
  mutate(.id = gsub(' \\| ', '\n', .id)) %>%
  filter(padj < padj_cutoff) %>%
  mutate(Rank2 = gsub('^__', '', Rank2)) %>%
  group_by(.id, Rank2) %>%
  summarize(n_incorp_OTUs = OTU %>% unique %>% length) %>%
  ungroup()

## `summarise()` has grouped output by '.id'. You can override using the `.groups`
## argument.

# plotting
ggplot(df_l2fc_s, aes(Rank2, n_incorp_OTUs)) +
    geom_bar(stat='identity') +
    labs(x='Phylum', y='Number of incorporators') +
    facet_wrap(~ .id, scales='free') +
    theme_bw() +
    theme(
      axis.text.x = element_text(angle=55, hjust=1)
    )

Note that the MW-HR-SIP method identifies more incorporators than the HR-SIP method (which uses just one BD window).

MW-HR-SIP detects more taxa for 2 main reasons. First, taxa vary in G+C content, so using only 1 BD window likely encompasses BD shifts for taxa of certain G+C contents (eg., ~50% G+C), but may miss other taxa with higher or lower G+C content. Second, taxa can vary in how much isotope was incorporated, which will affect where each taxon’s DNA is in the density gradient.

Session info

sessionInfo()

## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] phyloseq_1.51.0 HTSSIP_1.4.1    ggplot2_3.5.1   tidyr_1.3.1    
## [5] dplyr_1.1.4     rmarkdown_2.29 
## 
## loaded via a namespace (and not attached):
##  [1] ade4_1.7-22                 tidyselect_1.2.1           
##  [3] farver_2.1.2                Biostrings_2.75.1          
##  [5] fastmap_1.2.0               lazyeval_0.2.2             
##  [7] digest_0.6.37               lifecycle_1.0.4            
##  [9] cluster_2.1.6               survival_3.7-0             
## [11] magrittr_2.0.3              compiler_4.4.2             
## [13] rlang_1.1.4                 sass_0.4.9                 
## [15] tools_4.4.2                 igraph_2.1.1               
## [17] utf8_1.2.4                  yaml_2.3.10                
## [19] data.table_1.16.2           knitr_1.49                 
## [21] S4Arrays_1.7.1              labeling_0.4.3             
## [23] DelayedArray_0.33.2         plyr_1.8.9                 
## [25] abind_1.4-8                 BiocParallel_1.41.0        
## [27] withr_3.0.2                 purrr_1.0.2                
## [29] BiocGenerics_0.53.3         sys_3.4.3                  
## [31] grid_4.4.2                  stats4_4.4.2               
## [33] fansi_1.0.6                 multtest_2.63.0            
## [35] biomformat_1.35.0           colorspace_2.1-1           
## [37] Rhdf5lib_1.29.0             scales_1.3.0               
## [39] iterators_1.0.14            MASS_7.3-61                
## [41] SummarizedExperiment_1.37.0 cli_3.6.3                  
## [43] vegan_2.6-8                 crayon_1.5.3               
## [45] generics_0.1.3              httr_1.4.7                 
## [47] reshape2_1.4.4              ape_5.8                    
## [49] cachem_1.1.0                rhdf5_2.51.0               
## [51] stringr_1.5.1               zlibbioc_1.52.0            
## [53] splines_4.4.2               parallel_4.4.2             
## [55] XVector_0.47.0              matrixStats_1.4.1          
## [57] vctrs_0.6.5                 Matrix_1.7-1               
## [59] jsonlite_1.8.9              IRanges_2.41.1             
## [61] S4Vectors_0.45.2            coenocliner_0.2-3          
## [63] maketools_1.3.1             locfit_1.5-9.10            
## [65] foreach_1.5.2               jquerylib_0.1.4            
## [67] glue_1.8.0                  codetools_0.2-20           
## [69] stringi_1.8.4               gtable_0.3.6               
## [71] GenomeInfoDb_1.43.0         GenomicRanges_1.59.0       
## [73] UCSC.utils_1.3.0            munsell_0.5.1              
## [75] tibble_3.2.1                pillar_1.9.0               
## [77] htmltools_0.5.8.1           rhdf5filters_1.19.0        
## [79] GenomeInfoDbData_1.2.13     R6_2.5.1                   
## [81] doParallel_1.0.17           evaluate_1.0.1             
## [83] lattice_0.22-6              Biobase_2.67.0             
## [85] bslib_0.8.0                 Rcpp_1.0.13-1              
## [87] SparseArray_1.7.2           nlme_3.1-166               
## [89] permute_0.9-7               mgcv_1.9-1                 
## [91] DESeq2_1.47.1               xfun_0.49                  
## [93] MatrixGenerics_1.19.0       buildtools_1.0.0           
## [95] pkgconfig_2.0.3