Understanding Your Data: Exploratory Data Analysis

The Golden Rule

Whatever you put into a model, it assumes it's all good data.

This is why understanding your dataset before modelling is critical. Garbage in, garbage out — a model trained on bad data will produce bad predictions, no matter how sophisticated the algorithm.

Regression vs Classification — Know Your Task

Before anything else, identify what kind of problem you're solving:

Classification — sorting things into buckets (spam/not spam, cat/dog)
Regression — predicting a number (house price, temperature)

The type of problem determines which algorithms, metrics, and visualisations you'll use.

What is EDA?

Exploratory Data Analysis is the process of examining your data to understand its structure, quality, and quirks before modelling. Think of it as the research phase of a design project — you wouldn't start wireframing without understanding the problem space.

Why Do EDA?

Understand structure — what does the data look like?
Spot quality issues — missing values, outliers, duplicates
Identify feature types — categorical vs numerical
Prepare for modelling — inform feature selection and preprocessing

The key question: What kind of data are we working with?

DataFrames — Your Canvas

A DataFrame is the standard format for tabular data. Think of it as a spreadsheet with:

Headers — column names
Rows — individual records
Columns — different data types per column

In Python, you'll use Pandas to create and manipulate DataFrames, and NumPy for numerical operations.

import pandas as pd
import numpy as np

df = pd.read_csv("water_samples.csv")
df.shape      # (rows, columns)
df.info()     # data types, non-null counts
df.describe() # statistical summary

Describing Your Data

Count

How many entries are in each column? Missing data shows up as a lower count.

Central Tendency — What's the "Typical" Value?

Measure	What it tells you	Use when...
Mean	Average of all values	Data is symmetric, no extreme outliers
Median	Middle value when sorted	Data is skewed or contains outliers
Mode	Most frequent value	Data is categorical, frequency matters more than magnitude

Dispersion — How Spread Out is the Data?

Measure	Definition	Note
Range	Max − Min	Simple but sensitive to outliers
Variance	Average squared distance from the mean	Weights high deviations heavily
Standard Deviation	Square root of variance	Same unit as the data, more interpretable
IQR (Interquartile Range)	Q3 − Q1 (middle 50% of data)	Robust to outliers

Shape — How is the Data Distributed?

Skewness — is the distribution symmetric or lopsided?
Kurtosis — how peaked or flat is the distribution?
Positive kurtosis = sharp peak, heavy tails
Negative kurtosis = flat top, light tails
Percentiles — what value does X% of the data fall below?

Correlation vs Causation

This distinction trips up even experienced practitioners:

Correlation — two variables are related. A might be associated with B.

Causation — A actually causes B. This must be empirically proven through controlled experiments.

Ice cream sales correlate with drowning incidents — but ice cream doesn't cause drowning. Both increase in summer.

Cross Tabulation

A cross-tabulation table lets you explore relationships between categorical variables. It's a foundational tool in statistical modelling.

It's essentially what powers a confusion matrix — a table that shows how well a classifier's predictions match reality. We'll cover that in detail when we get to classification models.

pd.crosstab(df["land_use"], df["water_quality"])

Checking Correlations

For numerical relationships:

print(df.corr())

This produces a correlation matrix showing how each pair of numerical columns relates. Values range from -1 (perfect negative correlation) to +1 (perfect positive correlation).

The Designer's Takeaway

Data analysis is to ML what user research is to design. You wouldn't design a product without understanding your users — don't build a model without understanding your data.