Getting and Cleaning Data
The Unsexy Work That Makes Everything Else Possible
Data preparation consumes 60-80% of most analysis projects. It's tedious but essential. Garbage in, garbage out.
Data Sources
Internal Data
Data your organization collects:
- Transaction records
- Customer databases
- Website analytics
- Operational systems
- CRM data
External Data
Data from outside sources:
- Industry reports
- Census data
- Market research
- Purchased data sets
- Public APIs
Survey Data
Data you collect specifically for analysis:
- Customer surveys
- Employee surveys
- Market research
Common Data Formats
Spreadsheets
Excel, Google Sheets. Good for smaller datasets. Easy to manipulate.
CSV (Comma-Separated Values)
Plain text, universally compatible. The common exchange format.
Databases
Larger datasets, better for complex queries. May require SQL.
JSON/APIs
Data from web services. Increasingly common.
Data Quality Issues
Missing Values
Cells with no data. Common and problematic.
How to handle:
- Delete rows with missing values (lose data)
- Fill with average/median (introduces assumptions)
- Flag and analyze separately
- Investigate why missing (often not random)
Duplicates
Same record appearing multiple times.
How to handle:
- Identify duplicates (exact matches, near-matches)
- Decide which to keep (most recent? most complete?)
- Remove extras
Inconsistent Formatting
Same thing represented different ways:
- "USA" vs. "United States" vs. "US"
- "05/06/24" vs. "June 5, 2024"
- "123-456-7890" vs. "1234567890"
How to handle:
- Standardize formats
- Create mapping tables
- Clean before analysis
Outliers
Values far from typical. Could be:
- Data entry errors
- Real extreme values
- Different populations mixed together
How to handle:
- Investigate (is it real?)
- Keep if real
- Correct if error
- Analyze with and without
Data Entry Errors
Typos, wrong values, misplaced data.
How to handle:
- Validate against expected ranges
- Cross-check with other fields
- Flag suspicious values
- Correct where possible
The Cleaning Process
1. Understand Your Data
Before cleaning, understand what you have:
- What does each column represent?
- What values are expected?
- How was data collected?
- What's the time range?
2. Assess Quality
Examine for issues:
- Missing values by column
- Duplicate rows
- Value ranges
- Format consistency
- Obvious errors
3. Document Issues
Keep track of what you find and what you change. This matters for reproducibility.
4. Clean Systematically
Address issues one type at a time:
- Fix duplicates
- Handle missing values
- Standardize formats
- Address outliers
- Verify corrections
5. Validate Results
After cleaning:
- Row counts make sense?
- Column sums/averages reasonable?
- Spot-check records?
- Does it pass sanity tests?
Data Transformation
Creating New Variables
Combine or transform existing data:
- Age from birth date
- Profit from revenue minus cost
- Categories from continuous variables
Aggregation
Summarizing detailed data:
- Daily to monthly
- Transactions to customer totals
- Individual products to categories
Reshaping
Changing data structure:
- Wide to long format
- Pivoting
- Unpivoting
Filtering
Selecting relevant subsets:
- Time periods
- Specific segments
- Valid records only
AI Prompt: Data Cleaning Help
Help me clean this data.
My data includes: [Describe columns]
Issues I've noticed: [What problems you see]
Purpose of analysis: [What you'll do with clean data]
Please suggest:
1. A cleaning strategy
2. How to handle the specific issues
3. Quality checks to perform
4. Transformations that might be useful
5. Red flags to watch for
AI Prompt: Understanding Data
Help me understand this dataset.
Here's a sample of my data:
[Paste sample rows]
Please help me understand:
1. What each column likely represents
2. Data types (numeric, categorical, date, etc.)
3. Potential quality issues you notice
4. Relationships between columns
5. Questions this data could answer
What's Next
Data clean. Time to explore it.
Next chapter: Exploratory data analysis.