Data Visualisation for Machine Learning

Why Visualise?

A good chart answers a question instantly. In ML, visualisation helps you spot patterns, outliers, and distributions that summary statistics alone might miss.

Every chart should have:

A clear title
Labelled X and Y axes with units
A caption or legend when needed

The Two Main Libraries

Matplotlib — the foundational plotting library. Gives you full control.

Seaborn — built on top of Matplotlib with a higher-level API. Makes statistical plots easier and more visually polished.

import matplotlib.pyplot as plt
import seaborn as sns

Univariate Analysis — One Variable at a Time

When you're exploring a single variable, you want to understand its:

Distribution — where do values cluster?
Spread — how wide is the range?
Shape — symmetric, skewed, multi-modal?
Outliers — any extreme values?

Histogram

Shows frequency distribution. Bars represent how many data points fall in each bin.

plt.hist(df["pH"], bins=20, edgecolor="black")
plt.xlabel("pH Level")
plt.ylabel("Count")
plt.title("Distribution of Water pH Levels")
plt.show()

Interpreting: Is it symmetric? Skewed left or right? Multiple peaks?

Density Plot (KDE)

A smoothed version of a histogram. Shows the probability density rather than raw counts.

sns.kdeplot(df["pH"], fill=True)

Box Plot

Compact summary of a distribution in five numbers:

Minimum ← Q1 - 1.5 × IQR
Q1 (25th percentile)
Median (50th percentile)
Q3 (75th percentile)
Maximum → Q3 + 1.5 × IQR

Anything beyond the whiskers is an outlier.

sns.boxplot(x=df["pH"])

Violin Plot

Combines a box plot with a density plot. You see both the summary statistics and the distribution shape. Useful when you want to understand how the data was distributed — not just where the quartiles fall.

sns.violinplot(x="land_use", y="pH", data=df)

Pie Chart & Area Plot

Useful for showing proportions of categorical data, though bar charts are often more readable.

Understanding Skewness

Positive skew (right-skewed) — long tail to the right, bulk of data on the left. Often caused by a "floor effect" where values can't go below a minimum.
Negative skew (left-skewed) — long tail to the left. Often caused by a "ceiling effect" where values hit a maximum.

Bivariate Analysis — Two Variables

Scatter Plot with Regression Line

Shows the relationship between two numerical variables. Add a best-fit line to see the trend:

sns.regplot(x="feature_a", y="feature_b", data=df)

The best-fit line follows y = mx + c — this is the foundation of linear regression.

Contour Plot / 3D Plot

For visualising relationships between three variables:

from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure()
ax = fig.add_subplot(111, projection="3d")
ax.scatter(df["x"], df["y"], df["z"])

Exploring Combinations

The type of chart depends on the types of variables you're comparing:

Combination	Recommended Charts
Numerical × Numerical	Scatter plot, heatmap, regression plot
Categorical × Numerical	Box plot, violin plot, bar chart
Categorical × Categorical	Cross-tabulation table, stacked bar chart

Multiple Plots on One Figure

You can create subplots to compare distributions side by side:

fig, axes = plt.subplots(1, 3, figsize=(15, 5))
sns.histplot(df["pH"], ax=axes[0])
sns.boxplot(x=df["pH"], ax=axes[1])
sns.kdeplot(df["pH"], fill=True, ax=axes[2])
plt.tight_layout()
plt.show()

The Designer's Takeaway

As a designer, you already think visually. These charts are your tools for building intuition about data before it goes into a model. The better you understand the data's shape, the better you can design around the model's strengths and limitations.