#Load your dataset into file_path
file_path <- "../data/dataset_MISO_project.xlsx"
#Write the name of your variables
otu_sheet <- "OTU_table"
tax_sheet <- "Taxonomy"
sample_sheet <- "Sample_data" Introduction
Before performing metagenomic analyses, it is important to explore and clean the dataset. This step includes:
- importing OTU, taxonomy, and metadata tables,
- creating a phyloseq object,
- checking for missing values,
- verifying sequencing depth consistency,
- exploring taxonomic composition,
- and visualizing global similarities between samples.
These exploratory analyses help detect potential issues in the dataset and provide an overview of the microbial communities before more advanced statistical analyses.
User Input
In this section, users can load their own dataset.
The workflow expects:
- an OTU abundance table,
- a taxonomy table,
- and a sample metadata table.
All three tables are stored in an Excel file and imported separately using their sheet names. You can replace the file path and sheet names below with your own dataset.
Load Data
The imported tables are converted into a `phyloseq` object. `phyloseq` is a R package for microbiome analysis that allows abundance data, taxonomy, and sample metadata to be stored together in a single structured object.
This makes it easier to perform analyses and visualize the microbial data.
otu_mat <- read_excel(file_path, sheet = otu_sheet)
tax_mat <- read_excel(file_path, sheet = tax_sheet)
samples_df <- read_excel(file_path, sheet = sample_sheet)
# Convert data frames to matrices and set row names. Phyloseq objects need to have row.names
otu_mat <- otu_mat %>%
tibble::column_to_rownames("OTU_ID")
tax_mat <- tax_mat %>%
tibble::column_to_rownames("Tax_ID")
samples_df <- samples_df %>%
tibble::column_to_rownames("Sample_ID")
#Transform into matrixes otu and tax tables (sample table can be left as data frame)
otu_mat <- as.matrix(otu_mat)
tax_mat <- as.matrix(tax_mat)
# Transform to phyloseq objects
OTU = otu_table(otu_mat, taxa_are_rows = TRUE)
TAX = tax_table(tax_mat)
samples = sample_data(samples_df)
metagenomics <- phyloseq(OTU, TAX, samples)
#sample_names(metagenomics)
#rank_names(metagenomics)
#sample_variables(metagenomics)
total = median(sample_sums(metagenomics))
# standf = function(x, t=total) round(t * (x / sum(x)))
# metagenomics = transform_sample_counts(metagenomics, standf)Statistical summary
A first summary of the dataset provides general information about:
- the number of samples,
- the number of taxa,
- sequencing depth,
- and metadata variables available for analysis.
This helps verify that the dataset was imported correctly.
summary(metagenomics) Length Class Mode
1 phyloseq S4
typeof(otu_mat)[1] "double"
We can verify that the OTU table is stored as a numeric matrix of type `double`, meaning the abundance values are represented as decimal numbers.
We also checked whether some samples contained only zero values. Such samples can be problematic because they do not contain any detected microbial signal and may affect analyses, if they do exist, we should consider ignoring those samples.
which(colSums(otu_mat != 0) == 0)named integer(0)
summary(colSums(otu_mat)) Min. 1st Qu. Median Mean 3rd Qu. Max.
101.8 199.1 199.9 198.2 200.0 200.0
No sample composed only of zeroes was found.
Missing values check
Missing values can bias analyses and lead to incorrect statistical results.
We therefore verify that the OTU table, taxonomy table, and sample metadata do not contain missing entries before continuing the analysis.
#Check for missing values
sum(is.na(otu_mat))[1] 0
sum(is.na(tax_mat))[1] 0
sum(is.na(samples_df))[1] 0
#visualize missingness
colSums(is.na(samples_df)) dataset_name bodysite disease
0 0 0
age gender country
0 0 0
sequencing_technology pubmedid
0 0
We can see that there is no missing values.
Distribution of Sequencing Depth.
Sequencing depth corresponds to the total number of reads obtained for each sample.
Large differences in sequencing depth may introduce technical bias in diversity analyses. Samples with extremely low read counts are sometimes removed because they may not accurately represent the microbial community.
sample_depth <- sample_sums(metagenomics)
summary(sample_depth) Min. 1st Qu. Median Mean 3rd Qu. Max.
50.91 99.57 99.94 99.09 100.00 100.00
ggplot(data.frame(depth = sample_depth), aes(x = depth)) +
geom_histogram(bins = 30) +
theme_minimal() +
labs(title = "Distribution of Sequencing Depth",
x = "Reads per sample",
y = "Frequency")
Sequencing depth across samples is fairly consistent, with most samples summing to approximately 100, indicating that the dataset had been normalized to relative abundances.
However, we can notice one potential outlier, a sample exhibiting a lower total (50.91), let’s try to identify it.
sample_sums(metagenomics)[sample_sums(metagenomics) < 90]Sample_2148 Sample_2280 Sample_2516 Sample_3079 Sample_3482 Sample_3493
89.41782 84.62723 86.69817 89.79545 50.90540 85.95933
Sample_3504 Sample_3519 Sample_3526 Sample_3553 Sample_3594
88.59328 84.47117 88.53511 87.91809 89.17386
Most samples show similar sequencing depth values, suggesting that the dataset was already normalized before analysis.
One sample (`Sample_3482`) exhibits a lower sequencing depth than the others. However, since the difference remains moderate, the sample was retained for the study.
Taxonomic composition
Taxonomic composition plots allow visualization of the relative abundance of microbial groups across samples.
These representations help identify dominant bacterial taxa and observe potential differences between disease groups or other metadata categories.
metagenomics_phylum <- tax_glom(metagenomics, taxrank = "Phylum")
metagenomics_phylum_per <- transform_sample_counts(metagenomics_phylum, function(x) x / sum(x))
plot_bar(metagenomics_phylum_per, fill = "Phylum") +
geom_bar(stat = "identity") +
theme_minimal() +
labs(title = "Taxonomic composition at Phylum level") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the phyloseq package.
Please report the issue at <https://github.com/joey711/phyloseq/issues>.

plot_bar(metagenomics_phylum_per, x = "disease", fill = "Phylum") +
geom_bar(stat = "identity", position = "stack") +
theme_minimal() +
labs(title = "Phylum composition by disease group")
To improve readability, only the 10 most abundant genera were retained in the following visualization.
metagenomics_genus <- tax_glom(metagenomics, taxrank = "Genus")
metagenomics_genus_per <- transform_sample_counts(metagenomics_genus, function(x) x / sum(x))
# keep top 10 genera to avoid clutter
top_genera <- names(sort(taxa_sums(metagenomics_genus_per), decreasing = TRUE))[1:10]
metagenomics_genus_top <- prune_taxa(top_genera, metagenomics_genus_per)
plot_bar(metagenomics_genus_top, x = "disease", fill = "Genus") +
geom_bar(stat = "identity", position = "stack") +
theme_minimal() +
labs(title = "Top 10 Genera Composition by Disease") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Beta Diversity - PCoA
Beta diversity measures differences in microbial composition between samples.
Bray-Curtis distance is a statistic used to quantify the dissimilarity in composition between samples, based on abundance data.
We then visualise sample similarities via Principal Coordinate Analysis (PCoA). Samples positioned close together have more similar microbial compositions, whereas distant samples are more dissimilar.
Firstly, we do a PCoA by gender.
Beta Diversity - PCoA by gender
This visualization allows us to explore whether microbial communities tend to cluster according to gender.
Strong separation between groups would suggest that gender may influence microbiome composition.
pcoa <- ordinate(metagenomics, method = "PCoA", distance = "bray")
plot_ordination(metagenomics, pcoa, color = "gender") +
geom_point(size = 3) +
theme_minimal() +
labs(title = "PCoA - Bray-Curtis by gender")
Overall, the plot does not show a clear separation between male and female samples.
Samples from both genders largely overlap, suggesting that gender alone may not strongly influence the overall microbial community composition in this dataset.
However, a few local clusters and dispersion differences can still be observed. Male samples appear slightly more spread out, which could indicate greater variability in microbial composition between individuals.
The first two principal coordinates explain 11.1% and 9.9% of the total variation respectively, meaning that only a portion of the microbiome variability is represented in this plot.
Let’s do a PCoA by gender and disease to furthermore evaluate the dataset. Adding disease status as a variable allows visualization of two metadata variables simultaneously and may reveal certain paterns.
Beta Diversity - PCoA by gender and disease
This PCoA visualization combines both gender and disease information to explore whether microbial composition varies between groups.
pcoagd <- ordinate(metagenomics, method = "PCoA", distance = "bray")
plot_ordination(metagenomics, pcoagd, color = "gender", shape = "disease") +
geom_point(size = 3) +
theme_minimal()
Some disease-specific clustering patterns can be observed. Samples associated with cirrhosis tend to be located more frequently in the lower part of the plot, whereas cancer samples appear more concentrated in the upper central region.
In contrast, ulcerative colitis and type 2 diabetes samples show overlap with the other groups, suggesting less distinct microbial separation.
Gender does not appear to create a strong separation within disease groups, as male and female samples remain mostly intermixed through the space.
Overall, disease status may contribute more strongly than gender to differences in microbial composition, although the overlap between groups indicates that these effects are moderate.
Distribution of samples by gender
Let’s analyze the distribution of samples by gender. Strong imbalance between groups may influence statistical interpretation and reduce the reliability of comparisons.
meta_df <- data.frame(sample_data(metagenomics))
gender_counts <- meta_df %>%
group_by(gender) %>%
summarise(Count = n()) %>%
mutate(Percentage = Count / sum(Count) * 100)
print(gender_counts)# A tibble: 2 × 3
gender Count Percentage
<chr> <int> <dbl>
1 female 217 44.8
2 male 267 55.2
meta_df <- data.frame(sample_data(metagenomics))
gender_counts <- meta_df %>%
group_by(gender) %>%
summarise(Count = n())
ggplot(gender_counts, aes(x = gender, y = Count, fill = gender)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(title = "Number of Samples by Gender",
x = "Gender",
y = "Number of Samples")
We observe that 44.8% of the samples are female and 55.2% are male, indicating a slightly higher representation of males. However, the distribution remains relatively balanced between the two groups.
Number of samples of diseases by gender
Number of samples of diseases for females:
meta_df <- data.frame(sample_data(metagenomics))
disease_counts <- meta_df %>%
group_by(gender, disease) %>%
summarise(Count = n())`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by gender and disease.
ℹ Output is grouped by gender.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(gender, disease))` for per-operation grouping
(`?dplyr::dplyr_by`) instead.
ggplot(disease_counts %>% filter(gender == "female"),
aes(x = disease, y = Count, fill = disease)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(title = "Number of Female Samples per Disease") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Ulcerative colitis (IBD) has the highest number of samples with more than 75 counts, followed by type 2 diabetes (T2D), then cirrhosis, and finally cancer, which has fewer than 25 counts.
Number of samples of diseases for males:
ggplot(disease_counts %>% filter(gender == "male"),
aes(x = disease, y = Count, fill = disease)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(title = "Number of Male Samples per Disease") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Type 2 diabetes (T2D) has the highest number of samples with more than 100 counts, followed by type cirrhosis, then Ulcerative colitis (IBD), and finally cancer, which has around 25 counts.
We can see that there are differences in the number of samples per disease, depending on gender.
Distribution of bacterial class by gender
The taxonomic composition at the class level was compared between genders using relative abundance profiles. Only the most abundant classes (top 10) were retained to improve visualization clarity.
metagenomics_class <- tax_glom(metagenomics, taxrank = "Class")
metagenomics_class_per <- transform_sample_counts(metagenomics_class, function(x) x / sum(x))
top_classes <- names(sort(taxa_sums(metagenomics_class_per), decreasing = TRUE))[1:10]
metagenomics_class_top <- prune_taxa(top_classes, metagenomics_class_per)
plot_bar(metagenomics_class_top, x = "gender", fill = "Class") +
geom_bar(stat = "identity", position = "stack") +
theme_minimal() +
labs(title = "Bacterial TOP 10 Classes Distribution by Gender") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The same analysis but with all the bacterial classes:
plot_bar(metagenomics_class, x = "gender", fill = "Class") +
geom_bar(stat = "identity", position = "stack") +
theme_minimal() +
labs(title = "Bacterial Class Distribution by Gender") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
It seems that the bacterial distribution is approximately the same for each gender.
Overall, the data preparation step provided a clear overview of the dataset and ensured its suitability for the study. We examined key characteristics such as sample distribution across disease groups and gender, revealing a slight imbalance in both cases (more male samples than female, and varying representation across conditions).
These checks helped confirm data quality, identify potential biases, and ensure that the analyses are interpreted in the right context.