The Golden Rule
Whatever you put into a model, it assumes it's all good data.
This is why understanding your dataset before modelling is critical. Garbage in, garbage out — a model trained on bad data will produce bad predictions, no matter how sophisticated the algorithm.
Regression vs Classification — Know Your Task
Before anything else, identify what kind of problem you're solving:
- Classification — sorting things into buckets (spam/not spam, cat/dog)
- Regression — predicting a number (house price, temperature)
The type of problem determines which algorithms, metrics, and visualisations you'll use.
What is EDA?
Exploratory Data Analysis is the process of examining your data to understand its structure, quality, and quirks before modelling. Think of it as the research phase of a design project — you wouldn't start wireframing without understanding the problem space.
Why Do EDA?
- Understand structure — what does the data look like?
- Spot quality issues — missing values, outliers, duplicates
- Identify feature types — categorical vs numerical
- Prepare for modelling — inform feature selection and preprocessing
The key question: What kind of data are we working with?
DataFrames — Your Canvas
A DataFrame is the standard format for tabular data. Think of it as a spreadsheet with:
- Headers — column names
- Rows — individual records
- Columns — different data types per column
In Python, you'll use Pandas to create and manipulate DataFrames, and NumPy for numerical operations.
import pandas as pd
import numpy as np
df = pd.read_csv("water_samples.csv")
df.shape # (rows, columns)
df.info() # data types, non-null counts
df.describe() # statistical summaryDescribing Your Data
Count
How many entries are in each column? Missing data shows up as a lower count.
Central Tendency — What's the "Typical" Value?
| Measure | What it tells you | Use when... |
|---|---|---|
| Mean | Average of all values | Data is symmetric, no extreme outliers |
| Median | Middle value when sorted | Data is skewed or contains outliers |
| Mode | Most frequent value | Data is categorical, frequency matters more than magnitude |
Dispersion — How Spread Out is the Data?
| Measure | Definition | Note |
|---|---|---|
| Range | Max − Min | Simple but sensitive to outliers |
| Variance | Average squared distance from the mean | Weights high deviations heavily |
| Standard Deviation | Square root of variance | Same unit as the data, more interpretable |
| IQR (Interquartile Range) | Q3 − Q1 (middle 50% of data) | Robust to outliers |
Shape — How is the Data Distributed?
- Skewness — is the distribution symmetric or lopsided?
- Kurtosis — how peaked or flat is the distribution?
- Positive kurtosis = sharp peak, heavy tails
- Negative kurtosis = flat top, light tails
- Percentiles — what value does X% of the data fall below?
Correlation vs Causation
This distinction trips up even experienced practitioners:
Correlation — two variables are related. A might be associated with B.
Causation — A actually causes B. This must be empirically proven through controlled experiments.
Ice cream sales correlate with drowning incidents — but ice cream doesn't cause drowning. Both increase in summer.
Cross Tabulation
A cross-tabulation table lets you explore relationships between categorical variables. It's a foundational tool in statistical modelling.
It's essentially what powers a confusion matrix — a table that shows how well a classifier's predictions match reality. We'll cover that in detail when we get to classification models.
pd.crosstab(df["land_use"], df["water_quality"])Checking Correlations
For numerical relationships:
print(df.corr())This produces a correlation matrix showing how each pair of numerical columns relates. Values range from -1 (perfect negative correlation) to +1 (perfect positive correlation).
The Designer's Takeaway
Data analysis is to ML what user research is to design. You wouldn't design a product without understanding your users — don't build a model without understanding your data.