First, let’s load some packages including HTSSIP
.
See HTSSIP introduction vignette for a description on why dataset parsing (all treatment-control comparisons) is needed.
Let’s see the already parsed dataset
## $`(Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Cel' & Day == '3')`
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1072 taxa and 46 samples ]
## sample_data() Sample Data: [ 46 samples by 17 sample variables ]
## tax_table() Taxonomy Table: [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]
##
## $`(Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Cel' & Day == '14')`
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1072 taxa and 46 samples ]
## sample_data() Sample Data: [ 46 samples by 17 sample variables ]
## tax_table() Taxonomy Table: [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]
##
## $`(Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Glu' & Day == '14')`
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1072 taxa and 47 samples ]
## sample_data() Sample Data: [ 47 samples by 17 sample variables ]
## tax_table() Taxonomy Table: [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]
##
## $`(Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Glu' & Day == '3')`
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1072 taxa and 46 samples ]
## sample_data() Sample Data: [ 46 samples by 17 sample variables ]
## tax_table() Taxonomy Table: [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]
Let’s set some parameters used later.
# adjusted P-value cutoff
padj_cutoff = 0.1
# number of cores for parallel processing (increase depending on your computational hardware)
ncores = 2
First, we’ll just run HR-SIP on 1 treatment-control comparison. Let’s get the individual phyloseq object.
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1072 taxa and 46 samples ]
## sample_data() Sample Data: [ 46 samples by 17 sample variables ]
## tax_table() Taxonomy Table: [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]
Let’s check that the samples belong to either a 13C-treatment or 12C-control.
## .
## 12C-Con 13C-Cel
## 23 23
OK, we should be ready to run HR-SIP!
Note that the design
parameter for HRSIP()
is the experimental design parameter for calculating log2 fold change
(l2fc) values with DESeq. Here, it’s used to distinguish label-treatment
and unlabel-control samples.
df_l2fc = HRSIP(physeq,
design = ~Substrate,
padj_cutoff = padj_cutoff,
sparsity_threshold = c(0,0.15,0.3)) # just using 3 thresholds to reduce time
## Sparsity threshold: 0
## Density window: 1.7-1.75
## Warning: multiple methods tables found for 'union'
## Warning: multiple methods tables found for 'intersect'
## Warning: multiple methods tables found for 'setdiff'
## Sparsity threshold: 0.15
## Density window: 1.7-1.75
## Sparsity threshold: 0.3
## Density window: 1.7-1.75
## Sparsity threshold with the most rejected hypotheses: 0
## # A tibble: 3 × 17
## OTU log2FoldChange p padj Rank1 Rank2 Rank3 Rank4 Rank5 Rank6 Rank7
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 OTU.5… -0.133 0.745 1.00 Bact… __Pr… __De… __De… __Ni… __un… __un…
## 2 OTU.7… 1.39 0.0486 0.427 Bact… __Ac… __RB… __un… <NA> <NA> <NA>
## 3 OTU.3… 1.53 0.00326 0.0975 Bact… __Ac… __DA… __un… <NA> <NA> <NA>
## # ℹ 6 more variables: Rank8 <chr>, density_min <dbl>, density_max <dbl>,
## # sparsity_threshold <dbl>, sparsity_apply <chr>, l2fc_threshold <dbl>
How many “incorporators”” (rejected hypotheses)?
df_l2fc %>%
filter(padj < padj_cutoff) %>%
group_by() %>%
summarize(n_incorp_OTUs = OTU %>% unique %>% length)
## # A tibble: 1 × 1
## n_incorp_OTUs
## <int>
## 1 36
Let’s plot a breakdown of incorporators for each phylum.
# summarizing
df_l2fc_s = df_l2fc %>%
filter(padj < padj_cutoff) %>%
mutate(Rank2 = gsub('^__', '', Rank2)) %>%
group_by(Rank2) %>%
summarize(n_incorp_OTUs = OTU %>% unique %>% length) %>%
ungroup()
# plotting
ggplot(df_l2fc_s, aes(Rank2, n_incorp_OTUs)) +
geom_bar(stat='identity') +
labs(x='Phylum', y='Number of incorporators') +
theme_bw() +
theme(
axis.text.x = element_text(angle=45, hjust=1)
)
Let’s now run HR-SIP on all treatment-control comparisons in the dataset:
## [1] 4
The function plyr::ldply()
is useful (compared to
lapply()
) beccause it can be run in parallel and returns a
data.frame object.
# Running in parallel; you may need to change the backend for your environment.
# Or you can just set .parallel=FALSE.
doParallel::registerDoParallel(ncores)
df_l2fc = plyr::ldply(physeq_S2D2_l,
HRSIP,
design = ~Substrate,
padj_cutoff = padj_cutoff,
sparsity_threshold = c(0,0.15,0.3), # just using 3 thresholds to reduce run time
.parallel=TRUE)
df_l2fc %>% head(n=3)
## eria __Proteobacteria
## 2 OTU.729 1.3853296 0.048578911 0.42708959 Bacteria __Acidobacteria
## 3 OTU.322 1.5298155 0.003257549 0.09752411 Bacteria __Acidobacteria
## Rank3 Rank4 Rank5
## 1 __Deltaproteobacteria __Desulfobacterales __Nitrospinaceae
## 2 __RB25 __uncultured_Acidobacteria_bacterium <NA>
## 3 __DA023 __uncultured_bacterium <NA>
## Rank6 Rank7 Rank8 density_min density_max
## 1 __uncultured __uncultured_bacterium <NA> 1.7 1.75
## 2 <NA> <NA> <NA> 1.7 1.75
## 3 <NA> <NA> <NA> 1.7 1.75
## sparsity_threshold sparsity_apply l2fc_threshold
## 1 0 all 0.25
## 2 0 all 0.25
## 3 0 all 0.25
Each specific phyloseq subset (treatment-control comparison) is delimited with the “.id” column.
## [1] "(Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Cel' & Day == '3')"
## [2] "(Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Cel' & Day == '14')"
## [3] "(Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Glu' & Day == '14')"
## [4] "(Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Glu' & Day == '3')"
For clarity, let’s edit these long strings to make them more readable when plotted.
## [1] "(Substrate=='12C-Con' & Day=='3')\n(Substrate=='13C-Cel' & Day == '3')"
## [2] "(Substrate=='12C-Con' & Day=='14')\n(Substrate=='13C-Cel' & Day == '14')"
## [3] "(Substrate=='12C-Con' & Day=='14')\n(Substrate=='13C-Glu' & Day == '14')"
## [4] "(Substrate=='12C-Con' & Day=='3')\n(Substrate=='13C-Glu' & Day == '3')"
How many incorporators (rejected hypotheses) & which sparsity cutoff was used for each comparison?
Note: you could set one sparsity cutoff for all comparisons by
setting the sparsity_cutoff
to a specific value.
df_l2fc %>%
filter(padj < padj_cutoff) %>%
group_by(.id, sparsity_threshold) %>%
summarize(n_incorp_OTUs = OTU %>% unique %>% length) %>%
as.data.frame
## `summarise()` has grouped output by '.id'. You can override using the `.groups`
## argument.
## .id
## 1 (Substrate=='12C-Con' & Day=='14')\n(Substrate=='13C-Cel' & Day == '14')
## 2 (Substrate=='12C-Con' & Day=='14')\n(Substrate=='13C-Glu' & Day == '14')
## 3 (Substrate=='12C-Con' & Day=='3')\n(Substrate=='13C-Cel' & Day == '3')
## 4 (Substrate=='12C-Con' & Day=='3')\n(Substrate=='13C-Glu' & Day == '3')
## sparsity_threshold n_incorp_OTUs
## 1 0.3 97
## 2 0.0 189
## 3 0.0 36
## 4 0.3 65
How about a breakdown of incorporators for each phylum in each comparision.
# summarizing
df_l2fc_s = df_l2fc %>%
filter(padj < padj_cutoff) %>%
mutate(Rank2 = gsub('^__', '', Rank2)) %>%
group_by(.id, Rank2) %>%
summarize(n_incorp_OTUs = OTU %>% unique %>% length) %>%
ungroup()
## `summarise()` has grouped output by '.id'. You can override using the `.groups`
## argument.
MW-HR-SIP is run very similarly to HRSIP, but it uses multiple
buoyant density (BD) windows. MW-HR-SIP is performed with the
HRSIP()
function, but multiple BD windows are
specified.
Let’s use 3 buoyant density windows (g/ml):
1.70-1.73, 1.72-1.75, 1.74-1.77
## density_min density_max
## 1 1.70 1.73
## 2 1.72 1.75
## 3 1.74 1.77
Running HRSIP with all 3 BD windows. Let’s run this in parallel to speed things up.
You can turn off parallel processing by setting the
parallel
option to FALSE
. Also, there’s 2
different levels that could be parallelized: either the
ldply()
or HRSIP()
. Here, I’m running HRSIP in
parallel, but it may make sense in other situations (eg., many
comparisons but few density windows and/or sparsity cutoffs) to use
ldply in parallel only.
doParallel::registerDoParallel(ncores)
df_l2fc = plyr::ldply(physeq_S2D2_l,
HRSIP,
density_windows = windows,
design = ~Substrate,
padj_cutoff = padj_cutoff,
sparsity_threshold = c(0,0.15,0.3), # just using 3 thresholds to reduce run time
.parallel = TRUE)
df_l2fc %>% head(n=3)
## 1.73
## 3 <NA> <NA> <NA> 1.7 1.73
## sparsity_threshold sparsity_apply l2fc_threshold
## 1 0 all 0.25
## 2 0 all 0.25
## 3 0 all 0.25
Let’s check that we have all treatment-control comparisons.
## [1] "(Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Cel' & Day == '3')"
## [2] "(Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Cel' & Day == '14')"
## [3] "(Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Glu' & Day == '14')"
## [4] "(Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Glu' & Day == '3')"
How many incorporators (rejected hypotheses) & which sparsity cutoff was used for each comparison?
Note: one sparsity cutoff could be set for all comparisons by setting
the sparsity_cutoff
to a specific value.
df_l2fc %>%
filter(padj < padj_cutoff) %>%
group_by(.id, sparsity_threshold) %>%
summarize(n_incorp_OTUs = OTU %>% unique %>% length) %>%
as.data.frame
## `summarise()` has grouped output by '.id'. You can override using the `.groups`
## argument.
## .id
## 1 (Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Cel' & Day == '14')
## 2 (Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Glu' & Day == '14')
## 3 (Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Cel' & Day == '3')
## 4 (Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Glu' & Day == '3')
## sparsity_threshold n_incorp_OTUs
## 1 0.3 119
## 2 0.0 245
## 3 0.0 43
## 4 0.3 85
The density windows can vary for each OTU. Let’s plot which density windows were used for the OTUs in the dataset.
# summarizing
df_l2fc_s = df_l2fc %>%
mutate(.id = gsub(' \\| ', '\n', .id)) %>%
filter(padj < padj_cutoff) %>%
mutate(density_range = paste(density_min, density_max, sep='-')) %>%
group_by(.id, sparsity_threshold, density_range) %>%
summarize(n_incorp_OTUs = OTU %>% unique %>% length)
## `summarise()` has grouped output by '.id', 'sparsity_threshold'. You can
## override using the `.groups` argument.
#plotting
ggplot(df_l2fc_s, aes(.id, n_incorp_OTUs, fill=density_range)) +
geom_bar(stat='identity', position='fill') +
labs(x='Control-treatment comparision', y='Fraction of incorporators') +
scale_y_continuous(expand=c(0,0)) +
theme_bw() +
theme(
axis.text.x = element_text(angle=55, hjust=1)
)
Different BD windows were used for different treatment-control comparisons because the amount of BD shift likely varied among taxa. For example, if a taxon incorporates 100% 13C isotope, then a very ‘heavy’ BD window may show a larger l2fc than a less ‘heavy’ BD window.
Let’s look at a breakdown of incorporators for each phylum in each comparision.
# summarizing
df_l2fc_s = df_l2fc %>%
mutate(.id = gsub(' \\| ', '\n', .id)) %>%
filter(padj < padj_cutoff) %>%
mutate(Rank2 = gsub('^__', '', Rank2)) %>%
group_by(.id, Rank2) %>%
summarize(n_incorp_OTUs = OTU %>% unique %>% length) %>%
ungroup()
## `summarise()` has grouped output by '.id'. You can override using the `.groups`
## argument.
# plotting
ggplot(df_l2fc_s, aes(Rank2, n_incorp_OTUs)) +
geom_bar(stat='identity') +
labs(x='Phylum', y='Number of incorporators') +
facet_wrap(~ .id, scales='free') +
theme_bw() +
theme(
axis.text.x = element_text(angle=55, hjust=1)
)
Note that the MW-HR-SIP method identifies more incorporators than the HR-SIP method (which uses just one BD window).
MW-HR-SIP detects more taxa for 2 main reasons. First, taxa vary in G+C content, so using only 1 BD window likely encompasses BD shifts for taxa of certain G+C contents (eg., ~50% G+C), but may miss other taxa with higher or lower G+C content. Second, taxa can vary in how much isotope was incorporated, which will affect where each taxon’s DNA is in the density gradient.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] phyloseq_1.51.0 HTSSIP_1.4.1 ggplot2_3.5.1 tidyr_1.3.1
## [5] dplyr_1.1.4 rmarkdown_2.29
##
## loaded via a namespace (and not attached):
## [1] ade4_1.7-22 tidyselect_1.2.1
## [3] farver_2.1.2 Biostrings_2.75.1
## [5] fastmap_1.2.0 lazyeval_0.2.2
## [7] digest_0.6.37 lifecycle_1.0.4
## [9] cluster_2.1.6 survival_3.7-0
## [11] magrittr_2.0.3 compiler_4.4.2
## [13] rlang_1.1.4 sass_0.4.9
## [15] tools_4.4.2 igraph_2.1.1
## [17] utf8_1.2.4 yaml_2.3.10
## [19] data.table_1.16.2 knitr_1.49
## [21] S4Arrays_1.7.1 labeling_0.4.3
## [23] DelayedArray_0.33.2 plyr_1.8.9
## [25] abind_1.4-8 BiocParallel_1.41.0
## [27] withr_3.0.2 purrr_1.0.2
## [29] BiocGenerics_0.53.3 sys_3.4.3
## [31] grid_4.4.2 stats4_4.4.2
## [33] fansi_1.0.6 multtest_2.63.0
## [35] biomformat_1.35.0 colorspace_2.1-1
## [37] Rhdf5lib_1.29.0 scales_1.3.0
## [39] iterators_1.0.14 MASS_7.3-61
## [41] SummarizedExperiment_1.37.0 cli_3.6.3
## [43] vegan_2.6-8 crayon_1.5.3
## [45] generics_0.1.3 httr_1.4.7
## [47] reshape2_1.4.4 ape_5.8
## [49] cachem_1.1.0 rhdf5_2.51.0
## [51] stringr_1.5.1 zlibbioc_1.52.0
## [53] splines_4.4.2 parallel_4.4.2
## [55] XVector_0.47.0 matrixStats_1.4.1
## [57] vctrs_0.6.5 Matrix_1.7-1
## [59] jsonlite_1.8.9 IRanges_2.41.1
## [61] S4Vectors_0.45.2 coenocliner_0.2-3
## [63] maketools_1.3.1 locfit_1.5-9.10
## [65] foreach_1.5.2 jquerylib_0.1.4
## [67] glue_1.8.0 codetools_0.2-20
## [69] stringi_1.8.4 gtable_0.3.6
## [71] GenomeInfoDb_1.43.0 GenomicRanges_1.59.0
## [73] UCSC.utils_1.3.0 munsell_0.5.1
## [75] tibble_3.2.1 pillar_1.9.0
## [77] htmltools_0.5.8.1 rhdf5filters_1.19.0
## [79] GenomeInfoDbData_1.2.13 R6_2.5.1
## [81] doParallel_1.0.17 evaluate_1.0.1
## [83] lattice_0.22-6 Biobase_2.67.0
## [85] bslib_0.8.0 Rcpp_1.0.13-1
## [87] SparseArray_1.7.2 nlme_3.1-166
## [89] permute_0.9-7 mgcv_1.9-1
## [91] DESeq2_1.47.1 xfun_0.49
## [93] MatrixGenerics_1.19.0 buildtools_1.0.0
## [95] pkgconfig_2.0.3