R / RShiny
This tutorial is a hands-on introduction to data analysis in R for people who are new to the language. It follows the same workflow as the Python tutorial — load, filter, slice, transform, merge, save — using the same bulk RNA-seq dataset (gene expression across eight tissues in human and mouse).
The material is adapted from the Applied data analysis for beginners module of the NYUAD Core Bioinformatics Workshops (Start Analytics 2025). Source data and notebooks are at NYUAD-Core-Bioinformatics/StartAnalytics_2025.
No prior R experience is assumed.
What you need to know first
A few terms that come up over and over:
Package — a chunk of pre-written code you can use. We use:
dplyr— verbs for data frames (filter,select,mutate,left_join) and the pipe%>%.tidyr— reshape data between wide and long form (pivot_longer).tibble— helpers likerownames_to_column.reshape2— older reshaper, providesmelt.ggplot2— the standard plotting library.gridExtra— arrange several ggplot panels in a grid (grid.arrange).pheatmap— pretty heatmaps with clustering.
data.frame — a table with named rows and named columns. The R equivalent of a
pandasDataFrame. After we read a CSV withrow.names = 1the first column of the file becomes the row labels.matrix — a 2-D grid of values that are all the same type (usually all numeric). Most of our math (sums, normalization, correlation) is done on a matrix.
The pipe
%>%— takes whatever is on the left and feeds it as the first argument to the function on the right.x %>% f(y)is the same asf(x, y). It lets you read transformations left-to-right instead of inside-out.Row and column ops —
rowSums(x)sums each row,colSums(x)sums each column. In Python you would writedf.sum(axis=1)anddf.sum(axis=0); in R the function name itself tells you the direction.The transpose trick —
t(t(x) / colSums(x))divides each column ofxby its column total. Without the twot()calls, R would try to divide row-wise and recycle the values the wrong way. (More on this in the normalization step.)
Workflow overview
Load the data into a data frame.
Filter — inspect samples and genes; spot anything empty.
Slice the table to focus on samples or genes of interest.
Transform the numbers (normalize, log-transform).
Merge human and mouse data and check correlation.
Save results to disk.
Data shape
rows — genes (e.g.
STAG2,GOSR2).columns — samples (e.g.
human_brain_4).values — raw sequencing read counts.
Before any analysis, ask yourself: Are there empty samples? Do all genes have a value? Which features are irrelevant? Are there outliers or missing data? Garbage in, garbage out.
Environment setup
Install the packages once from the R console:
install.packages(c("dplyr", "tidyr", "tibble", "reshape2",
"ggplot2", "gridExtra", "pheatmap"))
Load them at the top of each script or notebook session:
library(dplyr)
library(gridExtra)
library(reshape2)
library(ggplot2)
library(tidyr)
library(tibble)
library(pheatmap)
If you are working inside a Jupyter notebook with the R kernel, this option keeps the printed tables short so they fit on screen:
options(repr.matrix.max.rows = 10, repr.matrix.max.cols = 6)
Inputs
The example uses three gzipped CSVs (read.csv handles .gz
automatically — no need to unzip):
Download the three files from
NYUAD-Core-Bioinformatics/StartAnalytics_2025 / notebooks / data
and place them in a data/ folder next to your script or
notebook:
data/human_expression.csv.gz— raw read counts, 14,744 genes × 8 human tissue samples.data/mouse_expression.csv.gz— raw read counts, 14,744 genes × 8 mouse tissue samples.data/human_mouse_gene_map.csv.gz— a lookup table pairing each human gene symbol (STAG2) with its mouse equivalent (Stag2).
Or grab all three from the terminal:
mkdir -p data
BASE=https://raw.githubusercontent.com/NYUAD-Core-Bioinformatics/StartAnalytics_2025/main/notebooks/data
curl -L -o data/human_expression.csv.gz $BASE/human_expression.csv.gz
curl -L -o data/mouse_expression.csv.gz $BASE/mouse_expression.csv.gz
curl -L -o data/human_mouse_gene_map.csv.gz $BASE/human_mouse_gene_map.csv.gz
1. Loading data
read.csv reads a CSV file into a data frame. The argument
row.names = 1 says “use the first column of the file as the
row labels” — so gene names become the row names rather than a
plain data column.
mouse <- read.csv('data/mouse_expression.csv.gz', row.names = 1)
human <- read.csv('data/human_expression.csv.gz', row.names = 1)
gene_map <- read.csv('data/human_mouse_gene_map.csv.gz', row.names = 1)
A few quick ways to look at what just got loaded:
dim(human) # 14744 8 (rows, cols)
head(human) # first 6 rows
colnames(human) # sample names
rownames(human)[1:5] # first 5 gene names
In a notebook, just typing human and running the cell prints
the table. In a plain script use print(human).
Note
View(human) (capital V) opens the table in a
spreadsheet-style viewer — but this only works inside
RStudio. It will not do anything useful from a plain R
console, a terminal session, or most Jupyter setups.
2. Filtering
Filtering means dropping rows or columns that carry no useful information. At this stage we just inspect — the actual drop happens later if needed.
Empty samples
colSums adds up the values column-by-column, giving total
reads per sample:
colSums(human)
In this dataset every sample has millions of reads, so no empty columns to drop.
Genes with no expression
rowSums adds up across each row — the total reads per gene
across all samples:
gene_count_across_tissues <- rowSums(human)
head(gene_count_across_tissues)
summary(gene_count_across_tissues)
summary() is R’s quick descriptive-stats function (min, 1st
quartile, median, mean, 3rd quartile, max). It is the closest
analogue to pandas’ .describe().
How many genes have zero expression in every tissue?
gene_count_across_tissues == 0 returns a logical vector
(TRUE/FALSE), and summary() on a logical vector
prints a count of each value:
summary(gene_count_across_tissues == 0)
# Mode FALSE TRUE
# logical 14070 674
So 674 genes are flat zero across all tissues; 14,070 are expressed in at least one tissue.
3. Slicing
Slicing means picking a subset — particular rows, particular
columns, or both. The base-R bracket form is x[rows, cols];
leave a side blank to keep “all”.
Pick columns whose name contains a word
grep("brain", colnames(human)) returns the positions
(integers) of column names that match. Wrap it in
colnames(human)[...] to get the names themselves back:
brain_cols <- colnames(human)[grep("brain", colnames(human))]
brain_cols # "human_brain_4"
Pick rows by gene name
Same idea applied to rownames. The DLX genes are a family
of homeobox transcription factors active in the brain.
dlx_rows <- rownames(human)[grep("DLX", rownames(human))]
dlx_rows
Extract the subset of data
human[dlx_rows, brain_cols]
The order is always [rows, cols]. Forgetting the comma is a
common mistake — human[dlx_rows] would try to interpret
dlx_rows as column names and break.
4. Transform data
Raw counts are hard to interpret directly for two reasons:
A handful of highly-expressed genes dominate any plot or statistic (outliers).
Different samples were sequenced to different depths, so a higher count can reflect deeper sequencing rather than higher biological expression.
We apply two standard transformations:
Normalize by library size (counts per million, CPM) — scale each sample so column totals match, making samples comparable.
Log-transform (
log1p(x)=log(1 + x)) — compresses the long tail of highly-expressed genes.log1p(0)is0, so zero counts stay defined.
Wide vs long format
Before the first plot, one concept that trips up everyone new to
ggplot2: the difference between wide and long format.
Our expression table is in wide format — one column per sample, one row per gene. It is convenient to read and to do matrix math on:
Gene |
human_brain_4 |
human_heart_2 |
human_liver_2 |
|---|---|---|---|
STAG2 |
2513 |
1734 |
1262 |
GOSR2 |
40 |
187 |
87 |
DLX6 |
142 |
0 |
0 |
The same data in long format has one row per observation — i.e. one row per (gene, sample) pair — with the sample name moved into its own column:
Gene |
Sample |
Expression |
|---|---|---|
STAG2 |
human_brain_4 |
2513 |
STAG2 |
human_heart_2 |
1734 |
STAG2 |
human_liver_2 |
1262 |
GOSR2 |
human_brain_4 |
40 |
GOSR2 |
human_heart_2 |
187 |
GOSR2 |
human_liver_2 |
87 |
DLX6 |
human_brain_4 |
142 |
DLX6 |
human_heart_2 |
0 |
DLX6 |
human_liver_2 |
0 |
Why do we need long format? Because ggplot2’s aes() maps
columns of data to visual properties. To draw “one box per
sample, coloured by sample,” ggplot needs a single Sample
column to point x = and fill = at. A wide table has no
such column — the sample identity is hidden in the names of
the columns.
The function that converts wide → long is tidyr::pivot_longer:
long <- as.data.frame(human) %>%
rownames_to_column("Gene") %>%
pivot_longer(-Gene, # collapse everything except Gene
names_to = "Sample", # column names go here
values_to = "Expression") # values go here
-Genemeans “all columns exceptGene” — those are the ones that get stacked.names_tois the new column that will hold the old column names.values_tois the new column that will hold the old values.
You will see this exact three-line pattern many times below — whenever a matrix needs to become a plot.
A quick tour of ggplot2
With long format sorted out, here is the ggplot mental model.
ggplot2 is based on the grammar of graphics: every plot is
built up by adding layers with the + operator. A minimal
plot needs three things:
Data — a data frame in long format (one row per observation).
Aesthetics —
aes(...)maps columns of the data to visual properties (x, y, colour, fill, size, …).Geometry — a
geom_*()layer that says how to draw (geom_point,geom_boxplot,geom_violin,geom_tile, …).
A complete plot reads like a sentence:
ggplot(data, aes(x = sample, y = expression)) + # data + mapping
geom_boxplot() + # how to draw
labs(title = "...", x = "...", y = "...") + # labels
theme_bw() # overall look
Two practical notes:
Long format is required. Each row must be one observation. A wide table with one column per sample will not plot directly — reshape it first with
pivot_longer.The plus signs matter.
+belongs at the end of the previous line; if you put it at the start of the next line, R thinks the previous line was complete and runs it on its own.
Boxplot of raw counts
ggplot2 needs long format, so we pivot the wide matrix first.
rownames_to_column("Gene") moves the row labels into a
regular column so pivot_longer can leave it alone.
human_long <- as.data.frame(human) %>%
rownames_to_column("Gene") %>%
pivot_longer(-Gene, names_to = "Sample", values_to = "Expression")
boxplot_human <- ggplot(human_long,
aes(x = Sample, y = Expression, fill = Sample)) +
geom_boxplot(outlier.shape = 21, outlier.colour = "black",
outlier.fill = "white", color = "black") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Box Plot of expression by samples",
x = "Tissue sample", y = "Expression") +
theme(legend.position = "none")
boxplot_human
The raw plot is unreadable: a few enormous outlier genes squash the rest of the distribution onto the baseline.
Log1p transform
log1p() applied to a data frame returns a new data frame
where every value has been replaced by log(1 + value). The
original is not modified.
log1p_human <- log1p(human)
log1p_human_long <- as.data.frame(log1p_human) %>%
rownames_to_column("Gene") %>%
pivot_longer(-Gene, names_to = "Sample", values_to = "Expression")
sample_colors <- c(
"lightblue", "lightgreen", "lightpink", "lightsalmon",
"lightyellow", "powderblue", "lavender", "thistle"
)
ggplot(log1p_human_long,
aes(x = Sample, y = Expression, fill = Sample)) +
geom_boxplot(width = 0.5,
outlier.shape = 21, outlier.colour = "black",
outlier.fill = "white", color = "black") +
scale_fill_manual(values = sample_colors) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Box Plot of expression by samples",
x = "Tissue Sample", y = "log1p Expression") +
theme(legend.position = "none")
Now the distributions are visible — but samples still differ in overall level because of library size. The next step fixes that.
Normalize by library size (CPM)
The normalization is one line:
normed_human <- t(t(human) / colSums(human)) * 1e6
normed_human[dlx_rows, ]
Read it from the inside out:
colSums(human)— total reads per sample, a length-8 vector.t(human)— transpose the table so samples are rows and genes are columns.t(human) / colSums(human)— now R divides each row of the transposed table by the matching element of the vector, which is what we want (per-sample division).t(...)— transpose back so genes are rows again.* 1e6— rescale to counts per million.
Without the two transposes, R would recycle the column totals across rows in the wrong direction and produce nonsense.
Compare all three views side by side. grid.arrange from
gridExtra lays out separate ggplot objects in a grid.
options(repr.plot.width = 18, repr.plot.height = 6) # notebook only
# raw counts
boxplot_human <- ggplot(human_long,
aes(x = Sample, y = Expression, fill = Sample)) +
geom_boxplot(width = 0.5, outlier.shape = 21,
outlier.colour = "black", outlier.fill = "white",
color = "black") +
scale_fill_manual(values = sample_colors) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
legend.position = "none") +
labs(title = "Raw expression", x = "Tissue sample", y = "Expression")
# normalized
normed_human_long <- as.data.frame(normed_human) %>%
rownames_to_column("Gene") %>%
pivot_longer(-Gene, names_to = "Sample", values_to = "Expression")
boxplot_normed_human <- ggplot(normed_human_long,
aes(x = Sample, y = Expression, fill = Sample)) +
geom_boxplot(width = 0.5, outlier.shape = 21,
outlier.colour = "black", outlier.fill = "white",
color = "black") +
scale_fill_manual(values = sample_colors) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
legend.position = "none") +
labs(title = "Normed", x = "Tissue sample", y = "Expression")
# log1p of normalized
log1p_normed_human <- log1p(normed_human)
log1p_normed_human_long <- as.data.frame(log1p_normed_human) %>%
rownames_to_column("Gene") %>%
pivot_longer(-Gene, names_to = "Sample", values_to = "Expression")
boxplot_log1p_normed_human <- ggplot(log1p_normed_human_long,
aes(x = Sample, y = Expression, fill = Sample)) +
geom_boxplot(width = 0.5, outlier.shape = 21,
outlier.colour = "black", outlier.fill = "white",
color = "black") +
scale_fill_manual(values = sample_colors) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
legend.position = "none") +
labs(title = "l1p Normed", x = "Tissue sample", y = "Expression")
grid.arrange(boxplot_human, boxplot_normed_human,
boxplot_log1p_normed_human, ncol = 3)
Violin plots show the full distribution shape. geom_violin +
geom_boxplot on top of each other is a useful combination —
the violin gives density, the inner box gives the quartiles.
violinplot_human <- ggplot(human_long,
aes(x = Sample, y = Expression, fill = Sample)) +
geom_violin() +
geom_boxplot(width = 0.5, outlier.shape = 21,
outlier.colour = "black", outlier.fill = "white",
color = "black") +
scale_fill_manual(values = sample_colors) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
legend.position = "none") +
labs(title = "Raw expression (violin)",
x = "Tissue sample", y = "Expression")
# build violinplot_normed_human and violinplot_log1p_normed_human
# the same way, swapping in normed_human_long / log1p_normed_human_long
options(repr.plot.width = 18, repr.plot.height = 12)
grid.arrange(boxplot_human, boxplot_normed_human,
boxplot_log1p_normed_human,
violinplot_human, violinplot_normed_human,
violinplot_log1p_normed_human,
ncol = 3, nrow = 2)
Fold change before and after transformation
Fold change asks: how much higher (or lower) is gene X in
sample B compared to a reference sample A? The simple formula is
(B - A) / A: 1 means twice as high, -0.5 means half,
0 means equal.
Computing it on raw counts gives implausibly large values driven by sequencing depth. Computing it on log1p-normalized values gives interpretable numbers.
The pattern below is from the notebook: sapply walks over
each non-reference column, computes the fold-change against the
reference, and returns the result as a matrix.
ref <- "human_brain_4"
a <- normed_human[dlx_rows, ref]
col_subset <- colnames(normed_human)[colnames(normed_human) != ref]
normed_human_fold_changes <- sapply(col_subset, function(c) {
b <- normed_human[dlx_rows, c]
(b - a) / a
})
sapply(X, FUN)— applyFUNto each element ofXand try to simplify the result to a matrix or vector.function(c) { ... }— an anonymous function.chere is one column name fromcol_subsetat a time.
The result will contain NaN (from 0 / 0) and Inf
(from dividing by zero). Patch those up:
# 0 / 0 -> 0
normed_human_fold_changes[is.na(normed_human_fold_changes)] <- 0
# division by zero -> fall back to the raw value
inf_mask <- is.infinite(normed_human_fold_changes)
normed_human_fold_changes[inf_mask] <-
human[dlx_rows, col_subset][inf_mask]
normed_human_fold_changes <-
as.data.frame(normed_human_fold_changes, row.names = dlx_rows)
Do the same calculation for raw counts and for log1p-normalized counts:
# raw
a <- human[dlx_rows, ref]
col_subset <- colnames(human)[colnames(human) != ref]
human_fold_changes <- sapply(col_subset, function(c) {
b <- human[dlx_rows, c]
(b - a) / a
})
human_fold_changes[is.na(human_fold_changes)] <- 0
inf_mask <- is.infinite(human_fold_changes)
human_fold_changes[inf_mask] <- human[dlx_rows, col_subset][inf_mask]
# log1p of normalized
l1p_normed_human <- log1p(normed_human)
a <- l1p_normed_human[dlx_rows, ref]
col_subset <- colnames(l1p_normed_human)[colnames(l1p_normed_human) != ref]
l1p_normed_human_fold_changes <- sapply(col_subset, function(c) {
b <- l1p_normed_human[dlx_rows, c]
(b - a) / a
})
l1p_normed_human_fold_changes[is.na(l1p_normed_human_fold_changes)] <- 0
inf_mask <- is.infinite(l1p_normed_human_fold_changes)
l1p_normed_human_fold_changes[inf_mask] <-
l1p_normed_human[dlx_rows, col_subset][inf_mask]
Heatmap of fold changes
Build a heatmap with geom_tile (one rectangle per cell) and
geom_text (the number inside it). scale_fill_gradient2
makes blue → white → red with white at zero.
human_fold_changes_long <- as.data.frame(human_fold_changes) %>%
rownames_to_column("Gene") %>%
mutate(Gene = factor(Gene, levels = dlx_rows)) %>% # preserve order
pivot_longer(-Gene, names_to = "Sample", values_to = "FoldChange")
plot1 <- ggplot(human_fold_changes_long,
aes(x = Sample, y = Gene, fill = FoldChange)) +
geom_tile() +
scale_fill_gradient2(low = "blue", mid = "white",
high = "red", midpoint = 0) +
geom_text(aes(label = sprintf("%.2f", FoldChange)), size = 3) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_discrete(limits = rev) +
labs(title = "Raw FC", x = "Sample", y = "Gene",
fill = "Fold Change") +
theme(legend.position = "none")
Repeat for normed_human_fold_changes and
l1p_normed_human_fold_changes (call them plot2 and
plot3) and lay them side by side:
grid.arrange(plot1, plot2, plot3, ncol = 3)
Raw fold-changes contain values like 19 driven entirely by
library size; the log1p-normalized panel shows interpretable
values in roughly [-1, 1].
5. Merging data
To compare expression across species, we need to align genes:
STAG2 in human and Stag2 in mouse are the same gene. The
orthologue map provides that link.
First inspect the mouse side. Note that grep("DLX", toupper(...))
matches DLX regardless of capitalization (the mouse symbols are
Dlx4 etc., not DLX4):
mouse_brain_cols <- colnames(mouse)[grep("brain", colnames(mouse))]
mouse_dlx_rows <- rownames(mouse)[grep("DLX", toupper(rownames(mouse)))]
mouse[mouse_dlx_rows, ]
Normalize the mouse table the same way:
normed_mouse <- log1p(t(t(mouse) / colSums(mouse)) * 1e6)
normed_mouse[mouse_dlx_rows, ]
Filter the gene map to just the DLX orthologue pairs and look up
those rows in normed_mouse:
gene_map[gene_map$mouse %in% mouse_dlx_rows, ]
normed_mouse[gene_map[gene_map$mouse %in% mouse_dlx_rows, ]$mouse, ]
To join the gene map onto the expression tables we need the
gene-name to be a regular column (not a row name), because
left_join matches by column. Build “expression + key
column” data frames for each species:
mouse_with_genes <- as.data.frame(normed_mouse)
mouse_with_genes$mouse <- rownames(mouse_with_genes)
normed_human_with_genes <- as.data.frame(log1p(normed_human))
normed_human_with_genes$human <- rownames(normed_human_with_genes)
Join each species onto the gene map. left_join keeps every
row from the left table and pulls in matching rows from the right
table — unmatched rows on the right are dropped.
star_normed_mouse <- gene_map %>%
left_join(mouse_with_genes, by = "mouse")
rownames(star_normed_mouse) <- rownames(gene_map)
star_normed_human <- gene_map %>%
left_join(normed_human_with_genes, by = "human")
rownames(star_normed_human) <- rownames(gene_map)
Now both tables have the same row order (one row per ortholog
pair). Join them and drop the duplicated key columns. After the
second join, the human column appears twice — left_join
disambiguates with .x / .y suffixes, which we then drop:
merged_df <- star_normed_mouse %>%
left_join(star_normed_human, by = "mouse") %>%
select(-human.x, -mouse, -human.y)
rownames(merged_df) <- rownames(star_normed_mouse)
Sample correlation
A correlation heatmap is a quick sanity check: matched tissues (human brain ↔ mouse brain) should correlate more strongly than mismatched tissues, even across species.
cor() computes pairwise Pearson correlation. The argument
use = "pairwise.complete.obs" tells it to drop NAs
per pair instead of returning NA for the whole matrix if any
NA is present.
cor_matrix <- cor(merged_df,
use = "pairwise.complete.obs",
method = "pearson")
cor_matrix
Two ways to visualize it.
Option A — pheatmap (with built-in clustering):
options(repr.plot.width = 12, repr.plot.height = 12) # notebook only
pheatmap(cor_matrix,
display_numbers = TRUE,
number_format = "%.2f",
cluster_rows = TRUE,
cluster_cols = TRUE,
clustering_distance_rows = "euclidean",
clustering_method = "complete",
fontsize_number = 5,
angle_col = 90,
main = "Sample Correlation Heatmap")
The clustered version reorders rows and columns by similarity,
making tissue clusters jump out. Setting
cluster_rows = FALSE, cluster_cols = FALSE keeps the original
order if you prefer.
Option B — ggplot tile heatmap. This needs the matrix
reshaped to long form. reshape2::melt is the classic way:
cor_long <- melt(cor_matrix)
head(cor_long)
# Var1 Var2 value
# mouse_adipose_1 mouse_adipose_1 1.000
# ...
ggplot(cor_long, aes(x = Var2, y = Var1, fill = value)) +
geom_tile() +
geom_text(aes(label = sprintf("%.2f", value)), size = 3) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Correlation Heatmap",
x = "", y = "", fill = "Correlation") +
coord_fixed() +
scale_y_discrete(limits = rev)
If you prefer to stay inside the tidyverse, pivot_longer
gives the same result. You must first turn the matrix into a
data frame so it has named columns:
cor_df <- as.data.frame(cor_matrix) %>%
rownames_to_column(var = "Var1")
long_data <- pivot_longer(cor_df,
cols = -Var1,
names_to = "Var2",
values_to = "value")
ggplot(long_data, aes(x = Var2, y = Var1, fill = value)) +
geom_tile() +
geom_text(aes(label = sprintf("%.2f", value)), size = 3) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Correlation Heatmap",
x = "", y = "", fill = "Correlation") +
coord_fixed() +
scale_y_discrete(limits = rev)
Cross-species correlation only
A focused version: correlate the eight mouse samples against the
eight human samples (an 8 × 8 matrix instead of 16 × 16). Pick
the species-specific columns with grep and pattern-match on
the ^mouse_ / ^human_ prefixes.
mouse_columns <- grep("^mouse_", colnames(merged_df), value = TRUE)
human_columns <- grep("^human_", colnames(merged_df), value = TRUE)
mouse_data <- merged_df[, mouse_columns]
human_data <- merged_df[, human_columns]
cor_matrix <- cor(mouse_data, human_data,
use = "pairwise.complete.obs",
method = "pearson")
cor_matrix
cor_long <- melt(cor_matrix)
ggplot(cor_long, aes(x = Var2, y = Var1, fill = value)) +
geom_tile() +
geom_text(aes(label = sprintf("%.2f", value)), size = 3) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Mouse vs Human Correlation",
x = "Human", y = "Mouse", fill = "Correlation") +
coord_fixed() +
scale_y_discrete(limits = rev)
6. Saving data
Write the merged table to disk so the next step in your pipeline
can read it back. write.csv is the base-R writer:
write.csv(merged_df, "./data/human_mouse_merged_expression.csv")
Useful options:
row.names = FALSE— do not write the row names as a leading unnamed column.End the filename with
.gzand combine withgzfile(), or usereadr::write_csvwhich handles.gzdirectly.
7. Results
The expected output files generated throughout this tutorial are available in the course repository:
Next steps
The merged table produced here is the standard input for the downstream tutorials in this resource:
Bulk RNA-seq — differential expression.
Over-representation analysis — GO / KEGG over-representation on a DE gene list.
Gene set enrichment analysis — GSEA on a ranked gene list.