All posts
Data24 February 2026·7 min read

Understanding Your Data: Exploratory Data Analysis

Before you build any model, you need to know your data inside out. Here's how EDA helps you ask the right questions.

The Golden Rule

Whatever you put into a model, it assumes it's all good data.

This is why understanding your dataset before modelling is critical. Garbage in, garbage out — a model trained on bad data will produce bad predictions, no matter how sophisticated the algorithm.

Regression vs Classification — Know Your Task

Before anything else, identify what kind of problem you're solving:

  • Classification — sorting things into buckets (spam/not spam, cat/dog)
  • Regression — predicting a number (house price, temperature)

The type of problem determines which algorithms, metrics, and visualisations you'll use.

What is EDA?

Exploratory Data Analysis is the process of examining your data to understand its structure, quality, and quirks before modelling. Think of it as the research phase of a design project — you wouldn't start wireframing without understanding the problem space.

Why Do EDA?

  • Understand structure — what does the data look like?
  • Spot quality issues — missing values, outliers, duplicates
  • Identify feature types — categorical vs numerical
  • Prepare for modelling — inform feature selection and preprocessing

The key question: What kind of data are we working with?

DataFrames — Your Canvas

A DataFrame is the standard format for tabular data. Think of it as a spreadsheet with:

  • Headers — column names
  • Rows — individual records
  • Columns — different data types per column

In Python, you'll use Pandas to create and manipulate DataFrames, and NumPy for numerical operations.

import pandas as pd
import numpy as np

df = pd.read_csv("water_samples.csv")
df.shape      # (rows, columns)
df.info()     # data types, non-null counts
df.describe() # statistical summary

Describing Your Data

Count

How many entries are in each column? Missing data shows up as a lower count.

Central Tendency — What's the "Typical" Value?

MeasureWhat it tells youUse when...
MeanAverage of all valuesData is symmetric, no extreme outliers
MedianMiddle value when sortedData is skewed or contains outliers
ModeMost frequent valueData is categorical, frequency matters more than magnitude

Dispersion — How Spread Out is the Data?

MeasureDefinitionNote
RangeMax − MinSimple but sensitive to outliers
VarianceAverage squared distance from the meanWeights high deviations heavily
Standard DeviationSquare root of varianceSame unit as the data, more interpretable
IQR (Interquartile Range)Q3 − Q1 (middle 50% of data)Robust to outliers

Shape — How is the Data Distributed?

  • Skewness — is the distribution symmetric or lopsided?
  • Kurtosis — how peaked or flat is the distribution?
  • Positive kurtosis = sharp peak, heavy tails
  • Negative kurtosis = flat top, light tails
  • Percentiles — what value does X% of the data fall below?

Correlation vs Causation

This distinction trips up even experienced practitioners:

Correlation — two variables are related. A might be associated with B.

Causation — A actually causes B. This must be empirically proven through controlled experiments.

Ice cream sales correlate with drowning incidents — but ice cream doesn't cause drowning. Both increase in summer.

Cross Tabulation

A cross-tabulation table lets you explore relationships between categorical variables. It's a foundational tool in statistical modelling.

It's essentially what powers a confusion matrix — a table that shows how well a classifier's predictions match reality. We'll cover that in detail when we get to classification models.

pd.crosstab(df["land_use"], df["water_quality"])

Checking Correlations

For numerical relationships:

print(df.corr())

This produces a correlation matrix showing how each pair of numerical columns relates. Values range from -1 (perfect negative correlation) to +1 (perfect positive correlation).

The Designer's Takeaway

Data analysis is to ML what user research is to design. You wouldn't design a product without understanding your users — don't build a model without understanding your data.

Subscribe to new posts

Get notified when I publish new learnings. No spam, unsubscribe anytime.