Exploratory Data Analysis

Finding Patterns, Outliers, and Stories in Your Data

Exploratory Data Analysis (EDA) is the detective phase. Before answering specific questions, you get to know your data — its patterns, quirks, and stories.

The Purpose of Exploration

Understand Before Analyzing

You can't analyze well what you don't understand. EDA builds familiarity.

Find Surprises

Data often contains unexpected patterns, outliers, or issues you wouldn't find without looking.

Generate Hypotheses

Exploration often raises new questions worth investigating.

Check Assumptions

Your analysis methods have assumptions. EDA helps verify they're met.

The EDA Process

1. Size and Shape

Start basic:

  • How many rows (observations)?
  • How many columns (variables)?
  • Time range covered?
  • Completeness?

2. Variable-by-Variable

Examine each column:

For numeric variables:

  • Range (min to max)
  • Center (mean, median)
  • Spread (standard deviation)
  • Distribution shape
  • Outliers

For categorical variables:

  • How many categories?
  • Frequency of each
  • Missing categories?
  • Rare categories?

3. Relationships

Look at how variables relate:

  • Correlations between numeric variables
  • Categories vs. numeric outcomes
  • Time trends
  • Group comparisons

4. Patterns and Anomalies

What stands out?

  • Clusters or groupings
  • Outliers
  • Unexpected patterns
  • Missing data patterns

Exploration Techniques

Summary Statistics

Quick numerical summaries:

StatisticWhat It Tells You
CountHow many observations
MeanAverage value
MedianMiddle value
Std DevHow spread out
Min/MaxRange extremes

Frequency Tables

For categorical data:

  • Count of each category
  • Percentage breakdown
  • Cumulative percentages

Cross-Tabulations

Categories against categories:

  • Product type by region
  • Customer segment by purchase frequency
  • Status by time period

Distributions

Understanding how values are spread:

  • Symmetric or skewed?
  • Single peak or multiple?
  • Where are most values?
  • How much variation?

Visualization in EDA

Histograms

Show distribution of single numeric variable. Reveals:

  • Shape of distribution
  • Outliers
  • Gaps or clusters

Box Plots

Summarize distribution compactly. Shows:

  • Median
  • Quartiles
  • Outliers
  • Spread

Scatter Plots

Relationship between two numeric variables. Reveals:

  • Correlation
  • Outliers
  • Clusters
  • Non-linear patterns

Bar Charts

Categorical data comparisons. Shows:

  • Counts or values by category
  • Comparisons across groups

Line Charts

Data over time. Shows:

  • Trends
  • Seasonality
  • Anomalies

What to Look For

Central Tendency

Where do values cluster? What's typical?

Spread

How much variation? Tight or wide distribution?

Shape

Symmetric? Skewed? Multiple peaks? Gaps?

Outliers

Extreme values. Real or errors?

Missing Patterns

Are missing values random or systematic?

Relationships

What moves together? What moves opposite?

Time Patterns

Trends, seasonality, changes over time?

Segments

Different groups behaving differently?

Red Flags

Impossible Values

Negative counts, ages of 200, dates in the future.

Suspicious Spikes

Sudden changes that don't make business sense.

Too Perfect

Data that's too clean or regular may be fabricated or flawed.

Concentrated Values

Many values exactly the same (possible default or error).

Correlation Where None Expected

May indicate data leakage or errors.

AI Prompt: Exploratory Analysis

Help me explore this data.

Here's a sample of my data:
[Paste data or describe structure]

The data represents: [What it is]
I'm ultimately trying to understand: [Your goal]

Please help me:
1. Summarize key statistics
2. Identify notable patterns
3. Flag potential issues or outliers
4. Suggest visualizations to create
5. Raise questions worth investigating

AI Prompt: Pattern Investigation

I noticed something interesting in my data.

The pattern: [What you observed]
Context: [What the data represents]

Help me investigate:
1. Is this pattern real or artifact?
2. What might explain it?
3. What else should I check?
4. How can I test my hypotheses?

What's Next

Let's get more precise with descriptive statistics.

Next chapter: Descriptive statistics — summarizing data meaningfully.