Python
This tutorial is a hands-on introduction to data analysis in Python for people who are new to the language. It walks through a typical workflow on a real biology dataset: bulk RNA-seq gene expression measured across eight tissues in human and mouse. By the end you will have loaded a table from disk, cleaned it up, made some plots, combined two datasets, and saved the result back out.
The material is adapted from the Applied data analysis for beginners module of the NYUAD Core Bioinformatics Workshops (Start Analytics 2025). Source data and notebooks are at NYUAD-Core-Bioinformatics/StartAnalytics_2025.
No prior Python experience is assumed. If you already know Python, skim the “What you need to know first” boxes and read the code.
What you need to know first
A few terms that come up over and over:
Library (also called a “package” or “module”) — a chunk of pre-written code you can use. We use four:
pandas— tables of data (think Excel sheets in code).numpy— fast math on numbers, arrays, and matrices.matplotlib— the standard plotting library.seaborn— nicer-looking plots built on top of matplotlib.
DataFrame — a table with named rows and named columns. This is what
pandasgives you. Each column can hold a different type (numbers, text, dates).Index — the labels on the rows of a DataFrame. In our case the index is the gene name (e.g.
STAG2); the column names are the sample names (e.g.human_brain_4).Axis —
axis=0means “go down columns” (operate on rows).axis=1means “go across rows” (operate on columns). It is the opposite of what most people guess on the first try.
Workflow overview
A typical analysis on a gene-expression table follows six steps:
Load the data into a DataFrame.
Filter out empty samples and unexpressed genes.
Slice the table to focus on samples or genes of interest.
Transform the numbers (normalize, log-transform) so plots and comparisons are meaningful.
Merge datasets (e.g. human + mouse) and check that the biology makes sense via a correlation heatmap.
Export the cleaned table to CSV for downstream work.
Data shape
The data table looks like this:
rows — genes (one per row, e.g.
STAG2,GOSR2, …).columns — samples / experiments (e.g.
human_brain_4,mouse_liver_5).values — raw “read counts”: how many sequencing reads from that sample mapped to that gene. A higher number means the gene is more strongly expressed in that sample.
Before any analysis, ask yourself: Are there empty samples? Do all genes have a value? Which features are irrelevant? Are there outliers or missing data? Garbage in, garbage out.
Environment setup
Install the libraries once from a terminal:
pip install numpy pandas matplotlib seaborn
Then, at the top of your Python script or notebook, import them.
The as np / as pd part gives each library a short nickname
so you do not have to type the full name every time.
import numpy as np # numerical math
import pandas as pd # tables (DataFrames)
import matplotlib.pyplot as plt # plotting
import seaborn as sns # nicer plots
import warnings
warnings.filterwarnings("ignore") # hide non-critical warnings
Inputs
The example uses three CSV files (provided as gzipped CSVs in the
workshop repo). Python’s pandas reads .csv.gz natively — no
need to unzip them first.
Download the three files from
NYUAD-Core-Bioinformatics/StartAnalytics_2025 / notebooks / data
and place them in a data/ folder next to your script or
notebook:
data/human_expression.csv.gz— raw read counts, 14,744 genes × 8 human tissue samples.data/mouse_expression.csv.gz— raw read counts, 14,744 genes × 8 mouse tissue samples.data/human_mouse_gene_map.csv.gz— a lookup table pairing each human gene symbol (uppercase, e.g.STAG2) with its mouse equivalent (capitalized, e.g.Stag2).
Or grab all three from the terminal:
mkdir -p data
BASE=https://raw.githubusercontent.com/NYUAD-Core-Bioinformatics/StartAnalytics_2025/main/notebooks/data
curl -L -o data/human_expression.csv.gz $BASE/human_expression.csv.gz
curl -L -o data/mouse_expression.csv.gz $BASE/mouse_expression.csv.gz
curl -L -o data/human_mouse_gene_map.csv.gz $BASE/human_mouse_gene_map.csv.gz
1. Loading data
pd.read_csv reads a CSV file into a DataFrame. The
index_col=0 argument says “use the first column of the file as
the row labels” — so gene names become the index rather than a
plain data column.
mouse = pd.read_csv('data/mouse_expression.csv.gz', index_col=0)
human = pd.read_csv('data/human_expression.csv.gz', index_col=0)
gene_map = pd.read_csv('data/human_mouse_gene_map.csv.gz', index_col=0)
A few quick ways to look at what just got loaded:
human.shape # (rows, cols) tuple — here (14744, 8)
human.head() # first 5 rows
human.columns # the column (sample) names
human.index # the row (gene) names
Typing the name of a DataFrame on its own line in a notebook
prints a preview of it. In a regular script use print(human).
2. Filtering
Filtering means dropping rows or columns that carry no useful information.
Empty samples
.sum() adds up the values. By default it sums each column,
giving the total number of reads per sample. Any sample with a
total of 0 sequenced badly and should be removed.
human.sum() # one total per column
In this dataset every sample has millions of reads, so nothing to drop here.
Genes with no expression
To check the other direction — genes that are zero in every
sample — pass axis=1 so the sum runs across each row instead
of each column:
human.sum(axis=1).describe() # summary stats: min, max, mean, etc.
.describe() prints quick summary statistics for whatever it is
called on. Useful for sanity-checking distributions.
There are 674 genes with zero counts in every tissue. We can see
them with a boolean filter — human.sum(1) == 0 produces a
Series of True/False, and using it inside human[...] keeps
only the rows where it is True:
empty_genes = human[human.sum(1) == 0]
empty_genes.shape # (674, 8)
Drop those rows and keep only genes expressed in at least one
tissue. .gt(0) is the same as > 0 in this context:
human = human[human.sum(1).gt(0)]
human.shape # (14070, 8)
Note
human.sum(1).gt(0) and ~human.sum(1).eq(0) (read: “not
equal to zero”) do the same thing. ~ is the “not” operator
for boolean Series.
3. Slicing
Slicing means picking a piece of the table — particular rows,
particular columns, or both. Square brackets [] are the main
tool.
Pick columns whose name contains a word
The expression [c for c in human.columns if "brain" in c] is a
list comprehension. Read it left-to-right: “make a list of
c for each c in human.columns, but only when the text
brain appears in c.”
brain_cols = [c for c in human.columns if "brain" in c]
brain_cols # ['human_brain_4']
human[brain_cols]
Pick rows by gene name
Same trick on the row labels (human.index). The DLX genes
are a family of homeobox transcription factors active in the
brain.
dlx_rows = [g for g in human.index if "DLX" in g]
human.loc[dlx_rows]
human.loc[...] selects rows by their label. Plain
human[...] selects columns. They look similar but they target
different axes, which is a common point of confusion.
Pick rows and columns at once
.loc accepts both row and column lists:
subset = human.loc[dlx_rows, brain_cols]
n_rows, n_cols = subset.shape
print(f"subset has: {n_rows} rows, and {n_cols} cols")
subset
The f"..." thing is an f-string: Python plugs the value of
n_rows and n_cols into the string for you.
Note
.loc[rows, cols] selects by label.
.iloc[rows, cols] selects by integer position
(.iloc[0:5] = first five rows). Mixing them is a common
source of off-by-one bugs — pick one and stick with it per
call.
4. Transforming data
Raw counts are hard to interpret directly for two reasons:
A handful of highly-expressed genes are so much bigger than everything else that they dominate any plot or statistic (outliers).
Different samples were sequenced to different depths, so a gene can look “higher” in one sample purely because that sample was sequenced more deeply, not because of biology.
We fix both with two standard transformations:
Normalize by library size (counts per million, CPM) — scale each sample so all samples sum to the same total, making them comparable.
Log-transform (
log1p(x)=log(1 + x)) — compresses the long tail of highly-expressed genes. We uselog1pinstead of plainlogso that zero counts stay defined (log(0)is undefined;log1p(0) = 0).
A quick tour of matplotlib and seaborn
Before the first plot, a short mental model.
matplotlib is the foundational plotting library. Two ways to use it:
Pandas/seaborn shortcuts — many objects know how to plot themselves.
human.plot.box(...)orsns.violinplot(data=human)build a matplotlib figure for you and return theAxes(one panel) so you can customize it.Direct matplotlib —
fig, axes = plt.subplots(...)creates an explicit figure and one or moreAxespanels; you then passax=...to the shortcut to draw into a specific panel.
Useful pieces you will see below:
plt.title("...")/plt.xlabel(...)/plt.ylabel(...)— labels on the current panel.plt.xticks(rotation=90)— tilt tick labels so long sample names fit.plt.tight_layout()— auto-fix overlapping titles/labels when you have multiple panels.plt.show()— display the figure (only needed in plain scripts; notebooks show it automatically).
seaborn is built on top of matplotlib and adds two things we use:
Nicer defaults and palettes (
palette="pastel",cmap="coolwarm").Higher-level plot types that work directly on DataFrames:
sns.boxplot,sns.violinplot,sns.heatmap,sns.clustermap.
One practical note: when you mix pandas’s .plot.box() with
seaborn calls and plt.subplots, all three are drawing onto
the same matplotlib figure underneath. That is why a
plt.title(...) call right after a .plot.box(...) line
still affects the right plot.
Boxplot of raw counts
A boxplot summarizes each sample’s distribution: the box is the middle 50% of values, the whiskers extend outward, and outliers appear as individual dots.
colors = [
"lightblue", "lightgreen", "lightpink", "lightsalmon",
"lightyellow", "powderblue", "lavender", "thistle",
]
ax = human.plot.box(figsize=(8, 6), patch_artist=True)
for patch, color in zip(ax.artists, colors):
patch.set_facecolor(color)
plt.title("Box plot of expression by samples", fontsize=14)
plt.xlabel("Tissue sample", fontsize=12)
plt.ylabel("Expression", fontsize=12)
plt.xticks(rotation=90) # tilt sample names so they fit
plt.show()
The raw plot is unreadable: a few enormous outlier genes squash the rest of the distribution onto the baseline.
Log1p transform
np.log1p(human) returns a new DataFrame where every value has
been replaced with log(1 + value). The original human
table is not modified.
ax = np.log1p(human).plot.box(figsize=(8, 6), patch_artist=True)
for i, patch in enumerate(ax.patches[:len(human.columns)]):
patch.set_facecolor(colors[i])
patch.set_edgecolor("black")
patch.set_linewidth(1.5)
plt.title("Box plot of log1p expression by samples", fontsize=14)
plt.xlabel("Tissue sample", fontsize=12)
plt.ylabel("log1p Expression", fontsize=12)
plt.xticks(rotation=90)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
Now the distributions are visible. But samples still differ in overall level because of library size — that is what the next step fixes.
Normalize by library size (CPM)
Divide every value in a column by that column’s total (so each column now sums to 1), then multiply by one million for readable numbers:
normed_human = human.div(human.sum(axis=0), axis=1) * 1e6
normed_human.loc[dlx_rows]
human.sum(axis=0)gives one total per column.human.div(..., axis=1)divides each row ofhumanby those totals, column-by-column.* 1e6rescales to counts per million.
Visually compare all three views — raw, normalized, log-normalized
— side by side. plt.subplots(1, 3) makes one figure with
three panels.
fig, axes = plt.subplots(1, 3, figsize=(18, 6), sharey=False)
titles = ['Raw expression', 'Normed', 'l1p Normed']
for i, (df, ax) in enumerate(
zip([human, normed_human, np.log1p(normed_human)], axes)
):
boxes = df.boxplot(
ax=ax, patch_artist=True,
return_type="dict", boxprops=dict(color="black"),
)
for patch, color in zip(boxes['boxes'], colors):
patch.set_facecolor(color)
ax.set_title(titles[i])
if i == 0:
ax.set_ylabel("Expression")
ax.set_xticklabels(df.columns, rotation=90)
plt.tight_layout()
plt.show()
Violin plots show the full distribution shape, not just the quartiles:
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
for i, (df, ax) in enumerate(
zip([human, normed_human, np.log1p(normed_human)], axes)
):
sns.violinplot(data=df, ax=ax, palette="pastel", inner="quartile")
ax.set_title(titles[i])
ax.set_xticklabels(df.columns, rotation=90)
plt.tight_layout()
plt.show()
Fold change before and after transformation
Fold change asks: how much higher (or lower) is gene X in
sample B compared to sample A? The simple formula is
(B - A) / A: a value of 1 means twice as high, -0.5
means half as high, 0 means equal.
Computing fold change on raw counts gives implausibly large values driven by sequencing depth. Computing it on log1p-normalized values gives interpretable numbers.
A small helper function. Defining a function avoids repeating the same lines three times:
def fold_change(df, ref, rows):
a = df.loc[rows, ref] # reference column
cols = df.columns[df.columns != ref] # everything else
fc = {c: (df.loc[rows, c] - a) / a for c in cols}
fc = pd.DataFrame(fc).fillna(0) # NaN -> 0
fc = fc.mask(np.isinf(fc), df.loc[rows, cols]) # inf -> raw val
return fc
fc_raw = fold_change(human, "human_brain_4", dlx_rows)
fc_normed = fold_change(normed_human, "human_brain_4", dlx_rows)
fc_l1p_normed = fold_change(np.log1p(normed_human),"human_brain_4", dlx_rows)
.fillna(0) replaces missing values (NaN, from 0 / 0) with
zero. .mask(condition, other) replaces values where
condition is True with values from other — here we
replace infinities (from dividing by zero) with the raw value, as
a fallback.
Heatmap comparison — colour-coded so the difference is obvious:
fig, axes = plt.subplots(1, 3, figsize=(15, 6), sharey=True)
sns.heatmap(fc_raw, annot=True, fmt=".2f",
cmap="coolwarm", ax=axes[0], cbar=False)
axes[0].set_title("Raw expression FC")
sns.heatmap(fc_normed, annot=True, fmt=".2f",
cmap="coolwarm", ax=axes[1], cbar=False)
axes[1].set_title("Normed FC")
sns.heatmap(fc_l1p_normed, annot=True, fmt=".2f",
cmap="coolwarm", ax=axes[2], cbar=True)
axes[2].set_title("l1p Normed FC")
plt.tight_layout()
plt.show()
Raw fold-changes contain values like 19 or -1 driven
entirely by library size; the log1p-normalized panel shows
interpretable values in roughly [-1, 1].
5. Merging data
To compare expression across species, we need to align genes:
STAG2 in human is the same gene as Stag2 in mouse. The
orthologue map provides that link.
gene_map.head()
# human mouse
# Stag2_STAG2 STAG2 Stag2
# Stag1_STAG1 STAG1 Stag1
# ...
Process the mouse table the same way (normalize + log1p), then
merge each species’ table onto the gene map. merge is
pandas’s version of a SQL join — it lines up rows by a key
column.
normed_mouse = np.log1p(mouse.div(mouse.sum(axis=0), axis=1) * 1e6)
star_normed_mouse = normed_mouse.merge(
gene_map, left_index=True, right_on='mouse',
)
star_normed_human = np.log1p(normed_human).merge(
gene_map, left_index=True, right_on='human',
)
left_index=True— use the left table’s row index as the join key.right_on='mouse'— match against themousecolumn on the right table.
After both merges, the two tables share the same row labels
(Stag2_STAG2, etc.). Join them and drop the duplicate
human/mouse columns introduced by the second merge
(suffixes=('', '_drop') tags the duplicates so we can find
and remove them):
merged_df = star_normed_mouse.merge(
star_normed_human,
right_index=True, left_index=True,
suffixes=('', '_drop'),
)
merged_df = merged_df.loc[
:, ~merged_df.columns.str.endswith('_drop')
]
merged_df.shape # (14744, 18)
Sample correlation
A correlation heatmap is a quick sanity check: matched tissues (human brain ↔ mouse brain) should correlate more strongly than mismatched tissues, even across species.
.corr() computes the pairwise Pearson correlation between
every pair of columns.
correlation_matrix = merged_df.loc[
:, ~merged_df.columns.isin(['human', 'mouse'])
].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix,
annot=True, cmap='coolwarm',
fmt=".2f", linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()
sns.clustermap(correlation_matrix,
annot=True, cmap='coolwarm',
fmt=".2f", linewidths=0.5)
plt.show()
sns.clustermap reorders the rows and columns so similar
samples sit next to each other, making tissue clusters jump out
even when the original ordering was arbitrary.
6. Exporting data
Save the cleaned, transformed, merged table to disk so the next step in your pipeline (differential expression, enrichment analysis, …) can read it back.
merged_df.to_csv('./data/human_mouse_merged_expression.csv')
Useful options:
index=False— do not write the row labels (use this if you do not want a leading unnamed column when you re-read the file).Pass a
.csv.gzfilename andpandaswrites it gzipped for free.
7. Results
The expected output files generated throughout this tutorial are available in the course repository:
Next steps
The merged table produced here is the standard input for the downstream tutorials in this resource:
Bulk RNA-seq — differential expression.
Over-representation analysis — GO / KEGG over-representation on a DE gene list.
Gene set enrichment analysis — GSEA on a ranked gene list.