Exploratory Data Analysis
Finding Patterns, Outliers, and Stories in Your Data
Exploratory Data Analysis (EDA) is the detective phase. Before answering specific questions, you get to know your data — its patterns, quirks, and stories.
The Purpose of Exploration
Understand Before Analyzing
You can't analyze well what you don't understand. EDA builds familiarity.
Find Surprises
Data often contains unexpected patterns, outliers, or issues you wouldn't find without looking.
Generate Hypotheses
Exploration often raises new questions worth investigating.
Check Assumptions
Your analysis methods have assumptions. EDA helps verify they're met.
The EDA Process
1. Size and Shape
Start basic:
- How many rows (observations)?
- How many columns (variables)?
- Time range covered?
- Completeness?
2. Variable-by-Variable
Examine each column:
For numeric variables:
- Range (min to max)
- Center (mean, median)
- Spread (standard deviation)
- Distribution shape
- Outliers
For categorical variables:
- How many categories?
- Frequency of each
- Missing categories?
- Rare categories?
3. Relationships
Look at how variables relate:
- Correlations between numeric variables
- Categories vs. numeric outcomes
- Time trends
- Group comparisons
4. Patterns and Anomalies
What stands out?
- Clusters or groupings
- Outliers
- Unexpected patterns
- Missing data patterns
Exploration Techniques
Summary Statistics
Quick numerical summaries:
| Statistic | What It Tells You |
|---|---|
| Count | How many observations |
| Mean | Average value |
| Median | Middle value |
| Std Dev | How spread out |
| Min/Max | Range extremes |
Frequency Tables
For categorical data:
- Count of each category
- Percentage breakdown
- Cumulative percentages
Cross-Tabulations
Categories against categories:
- Product type by region
- Customer segment by purchase frequency
- Status by time period
Distributions
Understanding how values are spread:
- Symmetric or skewed?
- Single peak or multiple?
- Where are most values?
- How much variation?
Visualization in EDA
Histograms
Show distribution of single numeric variable. Reveals:
- Shape of distribution
- Outliers
- Gaps or clusters
Box Plots
Summarize distribution compactly. Shows:
- Median
- Quartiles
- Outliers
- Spread
Scatter Plots
Relationship between two numeric variables. Reveals:
- Correlation
- Outliers
- Clusters
- Non-linear patterns
Bar Charts
Categorical data comparisons. Shows:
- Counts or values by category
- Comparisons across groups
Line Charts
Data over time. Shows:
- Trends
- Seasonality
- Anomalies
What to Look For
Central Tendency
Where do values cluster? What's typical?
Spread
How much variation? Tight or wide distribution?
Shape
Symmetric? Skewed? Multiple peaks? Gaps?
Outliers
Extreme values. Real or errors?
Missing Patterns
Are missing values random or systematic?
Relationships
What moves together? What moves opposite?
Time Patterns
Trends, seasonality, changes over time?
Segments
Different groups behaving differently?
Red Flags
Impossible Values
Negative counts, ages of 200, dates in the future.
Suspicious Spikes
Sudden changes that don't make business sense.
Too Perfect
Data that's too clean or regular may be fabricated or flawed.
Concentrated Values
Many values exactly the same (possible default or error).
Correlation Where None Expected
May indicate data leakage or errors.
AI Prompt: Exploratory Analysis
Help me explore this data.
Here's a sample of my data:
[Paste data or describe structure]
The data represents: [What it is]
I'm ultimately trying to understand: [Your goal]
Please help me:
1. Summarize key statistics
2. Identify notable patterns
3. Flag potential issues or outliers
4. Suggest visualizations to create
5. Raise questions worth investigating
AI Prompt: Pattern Investigation
I noticed something interesting in my data.
The pattern: [What you observed]
Context: [What the data represents]
Help me investigate:
1. Is this pattern real or artifact?
2. What might explain it?
3. What else should I check?
4. How can I test my hypotheses?
What's Next
Let's get more precise with descriptive statistics.
Next chapter: Descriptive statistics — summarizing data meaningfully.