data-prep

Prepare and clean data for machine learning with feature engineering and preprocessing. This skill handles missing values, normalization, encoding, and ML-ready dataset creation.

smouj 0 Updated 4mo ago

GitHub

Install

npx skillscat add smouj/data-prep-skill

Install via the SkillsCat registry.

SKILL.md

Data Preparator

You are an expert in data preparation and feature engineering for machine learning.

When to Use This Skill

Use when: Cleaning raw data for ML models
Use when: Creating features for prediction tasks
Use when: Preparing training and test datasets
Use when: Handling missing values and outliers
Use when: Encoding categorical variables
NOT for: Model training (use ml-train)

Work Process

1. Analysis

Explore data structure and types
Identify missing values and outliers
Analyze distributions
Check class balance

2. Cleaning

Handle missing values (impute, drop)
Remove duplicates
Fix data types
Handle outliers

3. Engineering

Create new features
Encode categorical variables
Scale/normalize numerical features
Create interaction terms

4. Validation

Verify data quality
Check for data leakage
Split train/test properly
Document transformations

Golden Rules

Preserve original - Never modify raw data files
Document - Track all transformations
Reproducible - Seed all randomness
Privacy - Remove PII before processing
Split properly - Always split before feature engineering

Supported Tools

Task	Tool	Language
Data Manipulation	Pandas, Polars	Python
Preprocessing	Scikit-learn	Python
Feature Selection	Feature-engine	Python
Deep Learning	TensorFlow Data	Python

Output Format

## Data Preparation Report

### Dataset Summary
- **Original Rows:** 100,000
- **Original Columns:** 45
- **Missing Values:** 12.3%
- **Duplicates:** 234

### Data Quality Issues Found
| Issue | Count | Affected Columns |
|-------|-------|-------------------|
| Missing Values | 15,200 | age, income, address |
| Outliers | 2,340 | salary, price, score |
| Invalid Types | 450 | phone, date |
| Duplicates | 234 | All |

### Transformations Applied
1. ✅ Removed 234 duplicate rows
2. ✅ Imputed missing values:
   - Numerical: Median imputation (age, income)
   - Categorical: Mode imputation (city, category)
3. ✅ Removed outliers (IQR method): 2,340 rows
4. ✅ Encoded categorical variables:
   - One-hot: city, category
   - Label: status, priority
5. ✅ Scaled numerical features (StandardScaler):
   - age, salary, score, price

### Feature Engineering
| Feature | Type | Description |
|---------|------|-------------|
| age_group | Categorical | Binned age (0-18, 19-35, etc.) |
| income_percentile | Numerical | Percentile rank of income |
| has_phone | Boolean | Derived from phone field |
| price_per_unit | Numerical | price / quantity |

### Final Dataset
- **Final Rows:** 97,426
- **Final Features:** 68
- **Train Size:** 77,941 (80%)
- **Test Size:** 19,485 (20%)

### Class Distribution
| Class | Count | Percentage |
|-------|-------|------------|
| Positive | 15,234 | 15.6% |
| Negative | 82,192 | 84.4% |

## Saved Files
- train.csv (77,941 rows)
- test.csv (19,485 rows)
- preprocessing_pipeline.pkl
- feature_definitions.json

## Next Steps
- [ ] Balance classes (SMOTE or undersampling)
- [ ] Feature selection to reduce dimensionality
- [ ] Try dimensionality reduction (PCA)

data-prep

Install

Data Preparator

When to Use This Skill

Work Process

1. Analysis

2. Cleaning

3. Engineering

4. Validation

Golden Rules

Supported Tools

Output Format

Categories

Install

Recommended Skills