Data Wrangling

Examples used in this vignette will use the GlobalPatterns dataset from phyloseq.

library(phyloseq)
data(GlobalPatterns)

conglomerate_samples

Merges samples within a phyloseq-class object which match on the given criteria (treatment). Any sample_data factors that do not match will be set to NA. otu_table counts will be reassigned as the mean of all the samples that are merged together.

Use this with caution as replicate samples may be crucial to the experimental design and should be proven statistically to be similar enough to combine for downstream analysis.

Usage

conglomerate_samples(phyloseq_obj, treatment, subset = NULL)

Arguments

Call	Description
`phyloseq_obj`	A `phyloseq-class` object.
`treatment`	Column name as a `string`, or `vector` of, in the `sample_data`.
`subset`	A factor within the `treatment`. This will remove any samples that to not contain this factor. This can be a `vector` of multiple factors to subset on.

Examples

phyloseq::sample_sums(GlobalPatterns)

##      CL3      CC1      SV1  M31Fcsw  M11Fcsw  M31Plmr  M11Plmr  F21Plmr 
##   864077  1135457   697509  1543451  2076476   718943   433894   186297 
##  M31Tong  M11Tong LMEpi24M SLEpi20M   AQC1cm   AQC4cm   AQC7cm      NP2 
##  2000402   100187  2117592  1217312  1167748  2357181  1699293   523634 
##      NP3      NP5  TRRsed1  TRRsed2  TRRsed3     TS28     TS29    Even1 
##  1478965  1652754    58688   493126   279704   937466  1211071  1216137 
##    Even2    Even3 
##   971073  1078241

conglomerated <- conglomerate_samples(GlobalPatterns, treatment = 'SampleType')
phyloseq::sample_sums(conglomerated)

##               Soil              Feces               Skin             Tongue 
##           899014.3          1442116.0           446378.0          1050294.5 
##         Freshwater Freshwater (creek)              Ocean Sediment (estuary) 
##          1667452.0          1741407.3          1218451.0           277172.7 
##               Mock 
##          1088483.7

conglomerate_taxa

A re-write of the phyloseq::tax_glom(). This iteration runs faster with the implementation of data.table.

Usage

conglomerate_taxa(phyloseq_obj, classification, hierarchical = TRUE)

Arguments

Call	Description
`phyloseq_obj`	A phyloseq-class object.
`classification`	Column name as a `string` in the `tax_table` for the factor to conglomerate by.
`hierarchical`	Whether the order of factors in the tax_table represent a decreasing hierarchy (`TRUE`) or are independant (`FALSE`). If `FALSE`, will only return the factor given by `classification`.

Examples

conglomerate_taxa(GlobalPatterns, classification = 'Phylum', hierarchical = TRUE)

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 67 taxa and 26 samples ]
## sample_data() Sample Data:       [ 26 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 67 taxa by 2 taxonomic ranks ]

melt_phyloseq

Converts the otu_table, tax_table, and sam_data to a 2-dimensional data.table.

Usage

melt_phyloseq(phyloseq_obj)

Arguments

Call	Description
`phyloseq_obj`	A phyloseq-class object.

Examples

melt_phyloseq(GlobalPatterns)

## Warning in `[.data.table`(sample_data, , `:=`(Sample, NULL)): Column 'Sample'
## does not exist to remove

merge_treatments

Combines multiple columns from the sample-data into a single column. Doing this can make it easier to subset and look at the data on multiple factors.

Usage

merge_treatments(phyloseq_obj, ...)

Arguments

Call	Description
`phyloseq_obj`	A phyloseq-class object. It must contain sample_data() with information about each sample.
`treatment`	Column name as a `string`, or `vector` of, in the `sample_data`.

Examples

merge_treatments(GlobalPatterns, c('Final_Barcode', 'Barcode_truncated_plus_T'))

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 19216 taxa and 26 samples ]
## sample_data() Sample Data:       [ 26 samples by 8 sample variables ]
## tax_table()   Taxonomy Table:    [ 19216 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 19216 tips and 19215 internal nodes ]

set_sample_order

Arranged the phyloseq object so that the samples are listed in a given order, or sorted on metadata. This is most useful for visual inspection of the metadata, and having the samples presented in a correct order in ggplot2 figures.

Usage

set_sample_order(phyloseq_obj, treatment)

Arguments

Call	Description
`phyloseq_obj`	A phyloseq-class object.
`treatment`	Column name as a `string`, or `vector` of, in the `sample_data`.

Examples

phyloseq::sample_names(GlobalPatterns)

##  [1] "CL3"      "CC1"      "SV1"      "M31Fcsw"  "M11Fcsw"  "M31Plmr" 
##  [7] "M11Plmr"  "F21Plmr"  "M31Tong"  "M11Tong"  "LMEpi24M" "SLEpi20M"
## [13] "AQC1cm"   "AQC4cm"   "AQC7cm"   "NP2"      "NP3"      "NP5"     
## [19] "TRRsed1"  "TRRsed2"  "TRRsed3"  "TS28"     "TS29"     "Even1"   
## [25] "Even2"    "Even3"

ordered_obj <- set_sample_order(GlobalPatterns, "SampleType")
phyloseq::sample_names(ordered_obj)

##  [1] "M31Fcsw"  "M11Fcsw"  "TS28"     "TS29"     "LMEpi24M" "SLEpi20M"
##  [7] "AQC1cm"   "AQC4cm"   "AQC7cm"   "Even1"    "Even2"    "Even3"   
## [13] "NP2"      "NP3"      "NP5"      "TRRsed1"  "TRRsed2"  "TRRsed3" 
## [19] "M31Plmr"  "M11Plmr"  "F21Plmr"  "CL3"      "CC1"      "SV1"     
## [25] "M31Tong"  "M11Tong"

set_treatment_levels

Set the order of the levels of a factor in the sample-data. Primarily useful for easy formatting of the order that ggplot2 will display samples.

Useful for:

managing order which variables appear in figures

Usage

set_treatment_levels(phyloseq_obj, treatment, order)

Arguments

Call	Description
`phyloseq_obj`	A phyloseq-class object.
`treatment`	Column name as a `string`, or `vector` of, in the `sample_data`.
`order`	The order of factors in `treatment` column as a `vector` of `string`s. If assigned “numeric” will set ascending numerical order.

Examples

levels(soil_column@sam_data$Day)

## [1] "0"   "10"  "108" "24"  "38"  "59"  "80"

ordered_days <- set_treatment_levels(soil_column, 'Day', 'numeric')
levels(ordered_days@sam_data$Day)

## [1] "0"   "10"  "24"  "38"  "59"  "80"  "108"

taxa_extract

Create a new phyloseq-object containing defined taxa. Taxa names can be a substring or entire taxa name. It will match that string in all taxa levels unless a specific classification level is declared.

Useful for:

looking at specific taxa of interest

Usage

taxa_extract(phyloseq_obj, taxa_to_extract, classification = NULL)

Arguments

Call	Description
`phyloseq_obj`	A phyloseq-class object.
`taxa_to_extract`	A `string`, or `vector` of taxa of interest.
`classification`	Column name as a `string` in the `tax_table` for the factor

to conglomerate by.

Examples

GlobalPatterns

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 19216 taxa and 26 samples ]
## sample_data() Sample Data:       [ 26 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 19216 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 19216 tips and 19215 internal nodes ]

taxa_extract(GlobalPatterns, c("Cyano", "Proteo","Actinobacteria"))

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 8441 taxa and 26 samples ]
## sample_data() Sample Data:       [ 26 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 8441 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 8441 tips and 8440 internal nodes ]

taxa_filter

This is a robust function that is implemented in nearly every other function of this package. It uses many of the subsetting processes distributed within phyloseq, but strives to make them a more user-friendly and combined into a one-stop function. The function works in several steps.

Checks to see if treatments were specified. If so, it splits the phyloseq into separate objects for each treatment to process.
Check to see which taxa are seen in a proportion of samples across each phyloseq object > frequency (filtering out taxa seen in few samples) and then merge back to one object
If subset is declared, remove all treatment outside of the subset
If drop_samples is TRUE then remove any samples that have 0 taxa observed after filtering (this is a very situational need)

If frequency is set to 0 (default), then the function removes any taxa with no abundance in any sample.

Useful for:

subsetting by sample_data factors
removing low-presence taxa
removing high-presence taxa

Usage

taxa_filter(phyloseq_obj, treatment = NULL, subset = NULL, frequency = 0, below = FALSE, drop_samples = FALSE)

Arguments

Call	Description
`phyloseq_obj`	A phyloseq-class object.
`treatment`	Column name as a `string`, or `vector` of, in the `sample_data`.
`subset`	A factor within the `treatment`. This will remove any samples that to not contain this factor. This can be a `vector` of multiple factors to subset on.
`frequency`	The proportion of samples the taxa is found in.
`below`	Does frequency define the minimum (`FALSE`) or maximum (`TRUE`) proportion of samples the taxa is found in.
`drop_samples`	Should the function remove samples that that are empty after removing taxa filtered by frequency (`TRUE`).

Examples The soil_column data has 19,216 OTUs listed in its taxa_table.

GlobalPatterns

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 19216 taxa and 26 samples ]
## sample_data() Sample Data:       [ 26 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 19216 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 19216 tips and 19215 internal nodes ]

However, 228 of those taxa are not actually seen in any of the samples.

length(phyloseq::taxa_sums(GlobalPatterns)[phyloseq::taxa_sums(GlobalPatterns) == 0])

## [1] 228

taxa_filter with frequency = 0 will remove those taxa.

taxa_filter(GlobalPatterns, frequency = 0)

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 18988 taxa and 26 samples ]
## sample_data() Sample Data:       [ 26 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 18988 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 18988 tips and 18987 internal nodes ]

Say that we wanted to only look at taxa that are seen in 80% of the samples.

taxa_filter(GlobalPatterns, frequency = 0.80)

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 435 taxa and 26 samples ]
## sample_data() Sample Data:       [ 26 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 435 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 435 tips and 434 internal nodes ]

But if we want taxa that are seen in 80% of any 1 teatment group;

taxa_filter(GlobalPatterns, frequency = 0.80, treatment = 'SampleType')

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 435 taxa and 26 samples ]
## sample_data() Sample Data:       [ 26 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 435 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 435 tips and 434 internal nodes ]

It returns a larger number of taxa, since they need to be seen in less samples overall.

taxa_prune

Create a new phyloseq-object ommitting the defined taxa. Taxa names can be a substring or entire taxa name. It will match that string in all taxa levels unless a specific classification level is declared.

Useful for:

removing specific taxa that are not of interest

Usage

taxa_prune(phyloseq_obj, taxa_to_remove, classification = NULL)

Arguments

Call	Description
`phyloseq_obj`	A phyloseq-class object.
`taxa_to_remove`	A `string`, or `vector` of taxa to remove.
`classification`	Column name as a `string` in the `tax_table` for the factor

to conglomerate by.

Examples

GlobalPatterns

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 19216 taxa and 26 samples ]
## sample_data() Sample Data:       [ 26 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 19216 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 19216 tips and 19215 internal nodes ]

taxa_prune(GlobalPatterns, c("Cyano", "Proteo","Actinobacteria"))

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 17585 taxa and 26 samples ]
## sample_data() Sample Data:       [ 26 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 17585 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 17585 tips and 17584 internal nodes ]

Schuyler Smith
Ph.D. Bioinformatics and Computational Biology

Data Wrangling

Schuyler D. Smith

2023-03-26

conglomerate_samples

conglomerate_taxa

melt_phyloseq

merge_treatments

set_sample_order

set_treatment_levels

taxa_extract

taxa_filter

taxa_prune