Getting and Cleaning Data

The Unsexy Work That Makes Everything Else Possible

Data preparation consumes 60-80% of most analysis projects. It's tedious but essential. Garbage in, garbage out.

Data Sources

Internal Data

Data your organization collects:

  • Transaction records
  • Customer databases
  • Website analytics
  • Operational systems
  • CRM data

External Data

Data from outside sources:

  • Industry reports
  • Census data
  • Market research
  • Purchased data sets
  • Public APIs

Survey Data

Data you collect specifically for analysis:

  • Customer surveys
  • Employee surveys
  • Market research

Common Data Formats

Spreadsheets

Excel, Google Sheets. Good for smaller datasets. Easy to manipulate.

CSV (Comma-Separated Values)

Plain text, universally compatible. The common exchange format.

Databases

Larger datasets, better for complex queries. May require SQL.

JSON/APIs

Data from web services. Increasingly common.

Data Quality Issues

Missing Values

Cells with no data. Common and problematic.

How to handle:

  • Delete rows with missing values (lose data)
  • Fill with average/median (introduces assumptions)
  • Flag and analyze separately
  • Investigate why missing (often not random)

Duplicates

Same record appearing multiple times.

How to handle:

  • Identify duplicates (exact matches, near-matches)
  • Decide which to keep (most recent? most complete?)
  • Remove extras

Inconsistent Formatting

Same thing represented different ways:

  • "USA" vs. "United States" vs. "US"
  • "05/06/24" vs. "June 5, 2024"
  • "123-456-7890" vs. "1234567890"

How to handle:

  • Standardize formats
  • Create mapping tables
  • Clean before analysis

Outliers

Values far from typical. Could be:

  • Data entry errors
  • Real extreme values
  • Different populations mixed together

How to handle:

  • Investigate (is it real?)
  • Keep if real
  • Correct if error
  • Analyze with and without

Data Entry Errors

Typos, wrong values, misplaced data.

How to handle:

  • Validate against expected ranges
  • Cross-check with other fields
  • Flag suspicious values
  • Correct where possible

The Cleaning Process

1. Understand Your Data

Before cleaning, understand what you have:

  • What does each column represent?
  • What values are expected?
  • How was data collected?
  • What's the time range?

2. Assess Quality

Examine for issues:

  • Missing values by column
  • Duplicate rows
  • Value ranges
  • Format consistency
  • Obvious errors

3. Document Issues

Keep track of what you find and what you change. This matters for reproducibility.

4. Clean Systematically

Address issues one type at a time:

  1. Fix duplicates
  2. Handle missing values
  3. Standardize formats
  4. Address outliers
  5. Verify corrections

5. Validate Results

After cleaning:

  • Row counts make sense?
  • Column sums/averages reasonable?
  • Spot-check records?
  • Does it pass sanity tests?

Data Transformation

Creating New Variables

Combine or transform existing data:

  • Age from birth date
  • Profit from revenue minus cost
  • Categories from continuous variables

Aggregation

Summarizing detailed data:

  • Daily to monthly
  • Transactions to customer totals
  • Individual products to categories

Reshaping

Changing data structure:

  • Wide to long format
  • Pivoting
  • Unpivoting

Filtering

Selecting relevant subsets:

  • Time periods
  • Specific segments
  • Valid records only

AI Prompt: Data Cleaning Help

Help me clean this data.

My data includes: [Describe columns]
Issues I've noticed: [What problems you see]
Purpose of analysis: [What you'll do with clean data]

Please suggest:
1. A cleaning strategy
2. How to handle the specific issues
3. Quality checks to perform
4. Transformations that might be useful
5. Red flags to watch for

AI Prompt: Understanding Data

Help me understand this dataset.

Here's a sample of my data:
[Paste sample rows]

Please help me understand:
1. What each column likely represents
2. Data types (numeric, categorical, date, etc.)
3. Potential quality issues you notice
4. Relationships between columns
5. Questions this data could answer

What's Next

Data clean. Time to explore it.

Next chapter: Exploratory data analysis.