heavy-SIP


Introduction

heavy-SIP method workflow:

Prior to the development of these HTS-SIP analysis methods, DNA- and RNA-SIP experiments that utilized Sanger or high throughput sequencing were usually analyzed with standard statistical processes (e.g. t-tests), in order to identify incorporators. Previous work suggests that these methods generally have low senstivity and/or high false positive rates when applied to sequence data. Here, these analysis methods will be referred to “heavy-SIP” methods. While the work of Youngblut et al., (https://doi.org/10.3389/fmicb.2018.00570) suggests that HR-SIP analysis methods (eg., MW-HR-SIP) should be used for processing HTS-SIP datasets, the HTSSIP R package provides heavy-SIP methods so researchers have the option of using these methods and making their own comparisons to HR-SIP methods.

heavy-SIP is performed with the heavy_SIP() function, which consists of multiple possible tests. See ?heavy_SIP for more details. This vignette demonstrates the use of heavy_SIP().

Initialization

First, let’s load some packages including HTSSIP.

library(dplyr)
library(ggplot2)
library(HTSSIP)
# adjusted P-value cutoff 
padj_cutoff = 0.1

Unreplicated dataset

For unreplicated datasets (no experiment replicates of controls or treatments), the options are limited on how to identify incorporators.

Parsing the dataset

We will be using a dataset that is already parsed. See HTSSIP introduction vignette for a description on why dataset parsing (all treatment-control comparisons) is needed.

physeq_S2D2_l
## $`(Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Cel' & Day == '3')`
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1072 taxa and 46 samples ]
## sample_data() Sample Data:       [ 46 samples by 17 sample variables ]
## tax_table()   Taxonomy Table:    [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]
## 
## $`(Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Cel' & Day == '14')`
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1072 taxa and 46 samples ]
## sample_data() Sample Data:       [ 46 samples by 17 sample variables ]
## tax_table()   Taxonomy Table:    [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]
## 
## $`(Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Glu' & Day == '14')`
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1072 taxa and 47 samples ]
## sample_data() Sample Data:       [ 47 samples by 17 sample variables ]
## tax_table()   Taxonomy Table:    [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]
## 
## $`(Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Glu' & Day == '3')`
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1072 taxa and 46 samples ]
## sample_data() Sample Data:       [ 46 samples by 17 sample variables ]
## tax_table()   Taxonomy Table:    [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]

One treatment-control comparison

First, we’ll just focus on 1 treatment-control comparison. Let’s get the individual phyloseq object.

physeq = physeq_S2D2_l[[1]]
physeq
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1072 taxa and 46 samples ]
## sample_data() Sample Data:       [ 46 samples by 17 sample variables ]
## tax_table()   Taxonomy Table:    [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]

Let’s check that the samples belong to either a 13C-treatment or 12C-control.

physeq %>% sample_data %>% .$Substrate %>% table
## .
## 12C-Con 13C-Cel 
##      23      23

Since this dataset is an unreplicated comparison between treatment & control, we are just going to use the ‘binary’ method, which will call incorporators if they are present in the “heavy” gradient fractions of the treatment and not present in the “heavy” fractions of the control. Note that the “heavy” fractions are user-defined.

df_res = heavy_SIP(physeq, ex="Substrate=='12C-Con'", 
                   comparison='H-v-H', hypo_test='binary')
df_res %>% head(n=3)
##         statistic  p padj
## OTU.514         0 NA   NA
## OTU.729         0 NA   NA
## OTU.322         0 NA   NA

Since no real statistical test, the “statistic” is just 0 (not an incorporator) or 1 (an incorporator). Also, the “p” and “padj” columns are thus “NA”.

How many “incorporators”?

df_res$statistic %>% table
## .
##   0   1 
## 980  65

Replicated dataset

Experimental replicates allows us to use tradional hypothesis testing (e.g., t-tests) for determining significantly differ OTU abundances between treatment and controls. Note that there is a reason why more suffisticated statistical methods have been developed for assessing differentially abundant features in high throughput sequencing datasets (e.g., DESeq2, EdgeR, or MetagenomeSeq). The traditional methods don’t account for many challenging aspects of identifying statistically different abundances in sequence data such as i) a high number of multiple hypotheses ii) zero-inflation iii) compositional data (relative abundances; the sum-to-one constraint).

With that said, let’s try out these heavy-SIP methods on a replicated dataset, with 3 experimental replicates of the control and treatment (total gradients = 6)

physeq_rep3
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 6 taxa and 144 samples ]
## sample_data() Sample Data:       [ 144 samples by 5 sample variables ]
physeq_rep3 %>% sample_data %>% head(n=3)
##                             Gradient Buoyant_density Fraction Treatment
## 12C-Con_rep1_1.668185_1 12C-Con_rep1        1.668185        1   12C-Con
## 12C-Con_rep1_1.680254_2 12C-Con_rep1        1.680254        2   12C-Con
## 12C-Con_rep1_1.679431_3 12C-Con_rep1        1.679431        3   12C-Con
##                         Replicate
## 12C-Con_rep1_1.668185_1         1
## 12C-Con_rep1_1.680254_2         1
## 12C-Con_rep1_1.679431_3         1

t-tests

To compare “heavy” fractions in the treatment versus “heavy” fractions in the control, we will use the “H-v-H” comparison method. See ?heavy_SIP for details on other possible comparisons.

df_res = heavy_SIP(physeq_rep3, ex="Treatment=='12C-Con'", 
                   comparison='H-v-H', hypo_test='t-test')
df_res %>% head(n=3)
##         statistic         p      padj
## OTU.1 -0.09563691 0.5336654 0.5641843
## OTU.2 -0.18450596 0.5641843 0.5641843
## OTU.3  1.14859160 0.2280138 0.3519149

“padj” is p-values adjusted with the Benjamini Hochberg method.

How many incorporators?

df_res %>%
  filter(padj < padj_cutoff) %>%
  nrow
## [1] 0

No incorporators. Obviously, the sensitivity of this method is pretty low. What’s the distribution of p-values?

df_res$p %>% summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1619  0.2180  0.2313  0.3228  0.4589  0.5642

Mann Whitney U test

Does anything change when we use a nonparametric test? Here, we will use the Mann Whitney U test (a nonparametric t-test).

df_res = heavy_SIP(physeq_rep3, ex="Treatment=='12C-Con'", 
                   comparison='H-v-H', hypo_test='wilcox')
df_res %>% head(n=3)
##       statistic         p      padj
## OTU.1         2 0.6666667 0.6666667
## OTU.2         2 0.6666667 0.6666667
## OTU.3         4 0.1666667 0.2500000

What’s the p-value and adjusted-pvalue distribution?

df_res$p %>% summary %>% print
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1667  0.1667  0.1667  0.3333  0.5417  0.6667
df_res$padj %>% summary %>% print
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2500  0.2500  0.2500  0.3889  0.5625  0.6667

Again, no incorporators. The change in abundances must be pretty dramatic for heavy-SIP methods to ID incorporators, especially when there’s many multiple hypotheses.

Session info

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] phyloseq_1.49.0 HTSSIP_1.4.1    ggplot2_3.5.1   tidyr_1.3.1    
## [5] dplyr_1.1.4     rmarkdown_2.28 
## 
## loaded via a namespace (and not attached):
##  [1] ade4_1.7-22                 tidyselect_1.2.1           
##  [3] farver_2.1.2                Biostrings_2.73.2          
##  [5] fastmap_1.2.0               lazyeval_0.2.2             
##  [7] digest_0.6.37               lifecycle_1.0.4            
##  [9] cluster_2.1.6               survival_3.7-0             
## [11] magrittr_2.0.3              compiler_4.4.1             
## [13] rlang_1.1.4                 sass_0.4.9                 
## [15] tools_4.4.1                 igraph_2.0.3               
## [17] utf8_1.2.4                  yaml_2.3.10                
## [19] data.table_1.16.2           knitr_1.48                 
## [21] S4Arrays_1.5.11             labeling_0.4.3             
## [23] DelayedArray_0.31.14        plyr_1.8.9                 
## [25] abind_1.4-8                 BiocParallel_1.39.0        
## [27] withr_3.0.1                 purrr_1.0.2                
## [29] BiocGenerics_0.51.3         sys_3.4.3                  
## [31] grid_4.4.1                  stats4_4.4.1               
## [33] fansi_1.0.6                 multtest_2.61.0            
## [35] biomformat_1.33.0           colorspace_2.1-1           
## [37] Rhdf5lib_1.27.0             scales_1.3.0               
## [39] iterators_1.0.14            MASS_7.3-61                
## [41] SummarizedExperiment_1.35.4 cli_3.6.3                  
## [43] vegan_2.6-8                 crayon_1.5.3               
## [45] generics_0.1.3              httr_1.4.7                 
## [47] reshape2_1.4.4              ape_5.8                    
## [49] cachem_1.1.0                rhdf5_2.49.0               
## [51] stringr_1.5.1               zlibbioc_1.51.1            
## [53] splines_4.4.1               parallel_4.4.1             
## [55] XVector_0.45.0              matrixStats_1.4.1          
## [57] vctrs_0.6.5                 Matrix_1.7-0               
## [59] jsonlite_1.8.9              IRanges_2.39.2             
## [61] S4Vectors_0.43.2            coenocliner_0.2-3          
## [63] maketools_1.3.1             locfit_1.5-9.10            
## [65] foreach_1.5.2               jquerylib_0.1.4            
## [67] glue_1.8.0                  codetools_0.2-20           
## [69] stringi_1.8.4               gtable_0.3.5               
## [71] GenomeInfoDb_1.41.2         GenomicRanges_1.57.2       
## [73] UCSC.utils_1.1.0            munsell_0.5.1              
## [75] tibble_3.2.1                pillar_1.9.0               
## [77] htmltools_0.5.8.1           rhdf5filters_1.17.0        
## [79] GenomeInfoDbData_1.2.13     R6_2.5.1                   
## [81] doParallel_1.0.17           evaluate_1.0.1             
## [83] lattice_0.22-6              Biobase_2.65.1             
## [85] highr_0.11                  bslib_0.8.0                
## [87] Rcpp_1.0.13                 SparseArray_1.5.45         
## [89] nlme_3.1-166                permute_0.9-7              
## [91] mgcv_1.9-1                  DESeq2_1.45.3              
## [93] xfun_0.48                   MatrixGenerics_1.17.0      
## [95] buildtools_1.0.0            pkgconfig_2.0.3