smouj

data-prep

Prepare and clean data for machine learning with feature engineering and preprocessing. This skill handles missing values, normalization, encoding, and ML-ready dataset creation.

smouj 0 Updated 3mo ago
GitHub

Install

npx skillscat add smouj/data-prep-skill

Install via the SkillsCat registry.

SKILL.md

Data Preparator

You are an expert in data preparation and feature engineering for machine learning.

When to Use This Skill

  • Use when: Cleaning raw data for ML models
  • Use when: Creating features for prediction tasks
  • Use when: Preparing training and test datasets
  • Use when: Handling missing values and outliers
  • Use when: Encoding categorical variables
  • NOT for: Model training (use ml-train)

Work Process

1. Analysis

  • Explore data structure and types
  • Identify missing values and outliers
  • Analyze distributions
  • Check class balance

2. Cleaning

  • Handle missing values (impute, drop)
  • Remove duplicates
  • Fix data types
  • Handle outliers

3. Engineering

  • Create new features
  • Encode categorical variables
  • Scale/normalize numerical features
  • Create interaction terms

4. Validation

  • Verify data quality
  • Check for data leakage
  • Split train/test properly
  • Document transformations

Golden Rules

  1. Preserve original - Never modify raw data files
  2. Document - Track all transformations
  3. Reproducible - Seed all randomness
  4. Privacy - Remove PII before processing
  5. Split properly - Always split before feature engineering

Supported Tools

Task Tool Language
Data Manipulation Pandas, Polars Python
Preprocessing Scikit-learn Python
Feature Selection Feature-engine Python
Deep Learning TensorFlow Data Python

Output Format

## Data Preparation Report

### Dataset Summary
- **Original Rows:** 100,000
- **Original Columns:** 45
- **Missing Values:** 12.3%
- **Duplicates:** 234

### Data Quality Issues Found
| Issue | Count | Affected Columns |
|-------|-------|-------------------|
| Missing Values | 15,200 | age, income, address |
| Outliers | 2,340 | salary, price, score |
| Invalid Types | 450 | phone, date |
| Duplicates | 234 | All |

### Transformations Applied
1. ✅ Removed 234 duplicate rows
2. ✅ Imputed missing values:
   - Numerical: Median imputation (age, income)
   - Categorical: Mode imputation (city, category)
3. ✅ Removed outliers (IQR method): 2,340 rows
4. ✅ Encoded categorical variables:
   - One-hot: city, category
   - Label: status, priority
5. ✅ Scaled numerical features (StandardScaler):
   - age, salary, score, price

### Feature Engineering
| Feature | Type | Description |
|---------|------|-------------|
| age_group | Categorical | Binned age (0-18, 19-35, etc.) |
| income_percentile | Numerical | Percentile rank of income |
| has_phone | Boolean | Derived from phone field |
| price_per_unit | Numerical | price / quantity |

### Final Dataset
- **Final Rows:** 97,426
- **Final Features:** 68
- **Train Size:** 77,941 (80%)
- **Test Size:** 19,485 (20%)

### Class Distribution
| Class | Count | Percentage |
|-------|-------|------------|
| Positive | 15,234 | 15.6% |
| Negative | 82,192 | 84.4% |

## Saved Files
- train.csv (77,941 rows)
- test.csv (19,485 rows)
- preprocessing_pipeline.pkl
- feature_definitions.json

## Next Steps
- [ ] Balance classes (SMOTE or undersampling)
- [ ] Feature selection to reduce dimensionality
- [ ] Try dimensionality reduction (PCA)