Why Visualise?
A good chart answers a question instantly. In ML, visualisation helps you spot patterns, outliers, and distributions that summary statistics alone might miss.
Every chart should have:
- A clear title
- Labelled X and Y axes with units
- A caption or legend when needed
The Two Main Libraries
Matplotlib — the foundational plotting library. Gives you full control.
Seaborn — built on top of Matplotlib with a higher-level API. Makes statistical plots easier and more visually polished.
import matplotlib.pyplot as plt
import seaborn as snsUnivariate Analysis — One Variable at a Time
When you're exploring a single variable, you want to understand its:
- Distribution — where do values cluster?
- Spread — how wide is the range?
- Shape — symmetric, skewed, multi-modal?
- Outliers — any extreme values?
Histogram
Shows frequency distribution. Bars represent how many data points fall in each bin.
plt.hist(df["pH"], bins=20, edgecolor="black")
plt.xlabel("pH Level")
plt.ylabel("Count")
plt.title("Distribution of Water pH Levels")
plt.show()Interpreting: Is it symmetric? Skewed left or right? Multiple peaks?
Density Plot (KDE)
A smoothed version of a histogram. Shows the probability density rather than raw counts.
sns.kdeplot(df["pH"], fill=True)Box Plot
Compact summary of a distribution in five numbers:
Minimum ← Q1 - 1.5 × IQR
Q1 (25th percentile)
Median (50th percentile)
Q3 (75th percentile)
Maximum → Q3 + 1.5 × IQRAnything beyond the whiskers is an outlier.
sns.boxplot(x=df["pH"])Violin Plot
Combines a box plot with a density plot. You see both the summary statistics and the distribution shape. Useful when you want to understand how the data was distributed — not just where the quartiles fall.
sns.violinplot(x="land_use", y="pH", data=df)Pie Chart & Area Plot
Useful for showing proportions of categorical data, though bar charts are often more readable.
Understanding Skewness
- Positive skew (right-skewed) — long tail to the right, bulk of data on the left. Often caused by a "floor effect" where values can't go below a minimum.
- Negative skew (left-skewed) — long tail to the left. Often caused by a "ceiling effect" where values hit a maximum.
Bivariate Analysis — Two Variables
Scatter Plot with Regression Line
Shows the relationship between two numerical variables. Add a best-fit line to see the trend:
sns.regplot(x="feature_a", y="feature_b", data=df)The best-fit line follows y = mx + c — this is the foundation of linear regression.
Contour Plot / 3D Plot
For visualising relationships between three variables:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection="3d")
ax.scatter(df["x"], df["y"], df["z"])Exploring Combinations
The type of chart depends on the types of variables you're comparing:
| Combination | Recommended Charts |
|---|---|
| Numerical × Numerical | Scatter plot, heatmap, regression plot |
| Categorical × Numerical | Box plot, violin plot, bar chart |
| Categorical × Categorical | Cross-tabulation table, stacked bar chart |
Multiple Plots on One Figure
You can create subplots to compare distributions side by side:
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
sns.histplot(df["pH"], ax=axes[0])
sns.boxplot(x=df["pH"], ax=axes[1])
sns.kdeplot(df["pH"], fill=True, ax=axes[2])
plt.tight_layout()
plt.show()The Designer's Takeaway
As a designer, you already think visually. These charts are your tools for building intuition about data before it goes into a model. The better you understand the data's shape, the better you can design around the model's strengths and limitations.