Getting and Cleaning Data

The Unsexy Work That Makes Everything Else Possible

Data preparation consumes 60-80% of most analysis projects. It's tedious but essential. Garbage in, garbage out.

Data Sources

Internal Data

Data your organization collects:

Transaction records
Customer databases
Website analytics
Operational systems
CRM data

External Data

Data from outside sources:

Industry reports
Census data
Market research
Purchased data sets
Public APIs

Survey Data

Data you collect specifically for analysis:

Customer surveys
Employee surveys
Market research

Common Data Formats

Spreadsheets

Excel, Google Sheets. Good for smaller datasets. Easy to manipulate.

CSV (Comma-Separated Values)

Plain text, universally compatible. The common exchange format.

Databases

Larger datasets, better for complex queries. May require SQL.

JSON/APIs

Data from web services. Increasingly common.

Data Quality Issues

Missing Values

Cells with no data. Common and problematic.

How to handle:

Delete rows with missing values (lose data)
Fill with average/median (introduces assumptions)
Flag and analyze separately
Investigate why missing (often not random)

Duplicates

Same record appearing multiple times.

How to handle:

Identify duplicates (exact matches, near-matches)
Decide which to keep (most recent? most complete?)
Remove extras

Inconsistent Formatting

Same thing represented different ways:

"USA" vs. "United States" vs. "US"
"05/06/24" vs. "June 5, 2024"
"123-456-7890" vs. "1234567890"

How to handle:

Standardize formats
Create mapping tables
Clean before analysis

Outliers

Values far from typical. Could be:

Data entry errors
Real extreme values
Different populations mixed together

How to handle:

Investigate (is it real?)
Keep if real
Correct if error
Analyze with and without

Data Entry Errors

Typos, wrong values, misplaced data.

How to handle:

Validate against expected ranges
Cross-check with other fields
Flag suspicious values
Correct where possible

The Cleaning Process

1. Understand Your Data

Before cleaning, understand what you have:

What does each column represent?
What values are expected?
How was data collected?
What's the time range?

2. Assess Quality

Examine for issues:

Missing values by column
Duplicate rows
Value ranges
Format consistency
Obvious errors

3. Document Issues

Keep track of what you find and what you change. This matters for reproducibility.

4. Clean Systematically

Address issues one type at a time:

Fix duplicates
Handle missing values
Standardize formats
Address outliers
Verify corrections

5. Validate Results

After cleaning:

Row counts make sense?
Column sums/averages reasonable?
Spot-check records?
Does it pass sanity tests?

Data Transformation

Creating New Variables

Combine or transform existing data:

Age from birth date
Profit from revenue minus cost
Categories from continuous variables

Aggregation

Summarizing detailed data:

Daily to monthly
Transactions to customer totals
Individual products to categories

Reshaping

Changing data structure:

Wide to long format
Pivoting
Unpivoting

Filtering

Selecting relevant subsets:

Time periods
Specific segments
Valid records only

AI Prompt: Data Cleaning Help

Help me clean this data.

My data includes: [Describe columns]
Issues I've noticed: [What problems you see]
Purpose of analysis: [What you'll do with clean data]

Please suggest:
1. A cleaning strategy
2. How to handle the specific issues
3. Quality checks to perform
4. Transformations that might be useful
5. Red flags to watch for

AI Prompt: Understanding Data

Help me understand this dataset.

Here's a sample of my data:
[Paste sample rows]

Please help me understand:
1. What each column likely represents
2. Data types (numeric, categorical, date, etc.)
3. Potential quality issues you notice
4. Relationships between columns
5. Questions this data could answer

What's Next

Data clean. Time to explore it.

Next chapter: Exploratory data analysis.