Dexploarer

pii-detector

Detects Personally Identifiable Information (PII) in code, logs, databases, and files for GDPR/CCPA compliance. Use when user asks to "detect PII", "find sensitive data", "scan for personal information", "check GDPR compliance", or "find SSN/credit cards".

Dexploarer 5 2 Updated 7mo ago
GitHub

Install

npx skillscat add dexploarer/claudius-skills/pii-detector

Install via the SkillsCat registry.

SKILL.md

PII Detector

Scans code, logs, databases, and configuration files for Personally Identifiable Information (PII) to ensure GDPR, CCPA, and privacy compliance.

When to Use

  • "Scan for PII in my codebase"
  • "Find sensitive data"
  • "Check for exposed personal information"
  • "Detect SSN, credit cards, emails"
  • "GDPR compliance check"
  • "Find PII in logs"

Instructions

1. Detect Project Type

# Check project structure
ls -la

# Detect language
[ -f "package.json" ] && echo "JavaScript/TypeScript"
[ -f "requirements.txt" ] && echo "Python"
[ -f "pom.xml" ] && echo "Java"
[ -f "Gemfile" ] && echo "Ruby"

# Check for logs
find . -name "*.log" -type f | head -5

2. Define PII Patterns

Common PII Types:

  1. Social Security Numbers (SSN)

    • Pattern: \b\d{3}-\d{2}-\d{4}\b
    • Example: 123-45-6789
  2. Credit Card Numbers

    • Visa: \b4\d{3}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b
    • MasterCard: \b5[1-5]\d{2}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b
    • Amex: \b3[47]\d{2}[\s-]?\d{6}[\s-]?\d{5}\b
    • Discover: \b6011[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b
  3. Email Addresses

    • Pattern: \b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b
  4. Phone Numbers

    • US: \b\d{3}[-.]?\d{3}[-.]?\d{4}\b
    • International: \+\d{1,3}[\s-]?\d{1,14}
  5. IP Addresses

    • IPv4: \b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b
    • IPv6: ([0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}
  6. Dates of Birth

    • Pattern: \b\d{2}/\d{2}/\d{4}\b or \b\d{4}-\d{2}-\d{2}\b
  7. Passport Numbers

    • US: \b[A-Z]{1,2}\d{6,9}\b
  8. Driver's License

    • Varies by state/country
  9. Bank Account Numbers

    • Pattern: \b\d{8,17}\b
  10. API Keys / Tokens

    • AWS: AKIA[0-9A-Z]{16}
    • Slack: xox[baprs]-[0-9a-zA-Z-]{10,}
    • GitHub: ghp_[0-9a-zA-Z]{36}

3. Scan Codebase

Using grep:

# Scan for SSN
grep -rn '\b\d{3}-\d{2}-\d{4}\b' . --include="*.js" --include="*.py" --include="*.java"

# Scan for credit cards
grep -rn '\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b' . --exclude-dir=node_modules

# Scan for emails
grep -rn '\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b' . --include="*.log"

# Scan for phone numbers
grep -rn '\b\d{3}[-.]?\d{3}[-.]?\d{4}\b' .

# Scan for API keys
grep -rn 'AKIA[0-9A-Z]{16}' . --include="*.env*" --include="*.config*"

Exclude common false positives:

# Exclude test files, build directories
grep -rn <pattern> . \
  --exclude-dir=node_modules \
  --exclude-dir=.git \
  --exclude-dir=dist \
  --exclude-dir=build \
  --exclude-dir=vendor \
  --exclude-dir=__pycache__ \
  --exclude="*.test.js" \
  --exclude="*.spec.ts" \
  --exclude="*.min.js"

4. Create PII Detection Script

Python Script:

#!/usr/bin/env python3
import re
import os
import sys
from pathlib import Path
from typing import List, Dict, Tuple

class PIIDetector:
    def __init__(self):
        self.patterns = {
            'SSN': r'\b\d{3}-\d{2}-\d{4}\b',
            'Credit Card': r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b',
            'Email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
            'Phone (US)': r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',
            'IPv4': r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b',
            'AWS Key': r'AKIA[0-9A-Z]{16}',
            'GitHub Token': r'ghp_[0-9a-zA-Z]{36}',
            'Slack Token': r'xox[baprs]-[0-9a-zA-Z-]{10,}',
            'Date of Birth': r'\b(?:0[1-9]|1[0-2])/(?:0[1-9]|[12][0-9]|3[01])/(?:19|20)\d{2}\b',
        }

        self.exclude_dirs = {
            'node_modules', '.git', 'dist', 'build', 'vendor',
            '__pycache__', '.next', 'out', 'coverage', '.venv'
        }

        self.exclude_extensions = {
            '.min.js', '.map', '.lock', '.jpg', '.png', '.gif',
            '.pdf', '.zip', '.tar', '.gz'
        }

    def should_scan_file(self, filepath: Path) -> bool:
        """Check if file should be scanned."""
        # Check excluded directories
        if any(excluded in filepath.parts for excluded in self.exclude_dirs):
            return False

        # Check excluded extensions
        if filepath.suffix in self.exclude_extensions:
            return False

        # Check file size (skip files > 10MB)
        try:
            if filepath.stat().st_size > 10 * 1024 * 1024:
                return False
        except OSError:
            return False

        return True

    def scan_file(self, filepath: Path) -> List[Dict]:
        """Scan a single file for PII."""
        findings = []

        try:
            with open(filepath, 'r', encoding='utf-8', errors='ignore') as f:
                for line_num, line in enumerate(f, 1):
                    for pii_type, pattern in self.patterns.items():
                        matches = re.finditer(pattern, line)
                        for match in matches:
                            # Check for common false positives
                            if self.is_false_positive(pii_type, match.group(), line):
                                continue

                            findings.append({
                                'file': str(filepath),
                                'line': line_num,
                                'type': pii_type,
                                'value': self.mask_pii(match.group()),
                                'context': line.strip()[:100]
                            })
        except Exception as e:
            print(f"Error scanning {filepath}: {e}", file=sys.stderr)

        return findings

    def is_false_positive(self, pii_type: str, value: str, context: str) -> bool:
        """Check for common false positives."""
        # Common test data
        test_patterns = [
            '000-00-0000',
            '111-11-1111',
            '123-45-6789',
            '4111111111111111',  # Test credit card
            'test@example.com',
            'user@localhost',
            '127.0.0.1',
            '0.0.0.0',
            '192.168.',
        ]

        for pattern in test_patterns:
            if pattern in value:
                return True

        # Check if in comment
        if any(comment in context for comment in ['//', '#', '/*', '*', '<!--']):
            if 'example' in context.lower() or 'test' in context.lower():
                return True

        return False

    def mask_pii(self, value: str) -> str:
        """Mask PII value for reporting."""
        if len(value) <= 4:
            return '*' * len(value)
        return value[:2] + '*' * (len(value) - 4) + value[-2:]

    def scan_directory(self, directory: Path) -> List[Dict]:
        """Recursively scan directory for PII."""
        all_findings = []

        for filepath in directory.rglob('*'):
            if filepath.is_file() and self.should_scan_file(filepath):
                findings = self.scan_file(filepath)
                all_findings.extend(findings)

        return all_findings

    def generate_report(self, findings: List[Dict]) -> str:
        """Generate human-readable report."""
        if not findings:
            return "✅ No PII detected!"

        report = f"⚠️  Found {len(findings)} potential PII instances:\n\n"

        # Group by type
        by_type = {}
        for finding in findings:
            pii_type = finding['type']
            if pii_type not in by_type:
                by_type[pii_type] = []
            by_type[pii_type].append(finding)

        for pii_type, items in sorted(by_type.items()):
            report += f"## {pii_type} ({len(items)} found)\n\n"
            for item in items[:10]:  # Limit to 10 per type
                report += f"- {item['file']}:{item['line']}\n"
                report += f"  Value: {item['value']}\n"
                report += f"  Context: {item['context']}\n\n"

            if len(items) > 10:
                report += f"  ... and {len(items) - 10} more\n\n"

        return report

def main():
    import argparse

    parser = argparse.ArgumentParser(description='Scan for PII in codebase')
    parser.add_argument('path', nargs='?', default='.', help='Path to scan')
    parser.add_argument('--json', action='store_true', help='Output JSON')
    parser.add_argument('--exclude', nargs='+', help='Additional directories to exclude')

    args = parser.parse_args()

    detector = PIIDetector()

    if args.exclude:
        detector.exclude_dirs.update(args.exclude)

    scan_path = Path(args.path)
    findings = detector.scan_directory(scan_path)

    if args.json:
        import json
        print(json.dumps(findings, indent=2))
    else:
        print(detector.generate_report(findings))

    # Exit with error code if PII found
    sys.exit(1 if findings else 0)

if __name__ == '__main__':
    main()

Save as pii_detector.py and run:

python pii_detector.py .

JavaScript/TypeScript Version:

// pii-detector.js
const fs = require('fs');
const path = require('path');
const readline = require('readline');

const PII_PATTERNS = {
  'SSN': /\b\d{3}-\d{2}-\d{4}\b/g,
  'Credit Card': /\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b/g,
  'Email': /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/gi,
  'Phone (US)': /\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/g,
  'IPv4': /\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/g,
  'AWS Key': /AKIA[0-9A-Z]{16}/g,
  'GitHub Token': /ghp_[0-9a-zA-Z]{36}/g,
};

const EXCLUDE_DIRS = new Set([
  'node_modules', '.git', 'dist', 'build', 'coverage',
  '.next', 'out', 'vendor', '__pycache__'
]);

const EXCLUDE_EXTS = new Set([
  '.min.js', '.map', '.lock', '.jpg', '.png', '.gif', '.pdf'
]);

function shouldScanFile(filePath) {
  const parts = filePath.split(path.sep);
  if (parts.some(part => EXCLUDE_DIRS.has(part))) {
    return false;
  }

  const ext = path.extname(filePath);
  if (EXCLUDE_EXTS.has(ext)) {
    return false;
  }

  return true;
}

function maskPII(value) {
  if (value.length <= 4) return '*'.repeat(value.length);
  return value.slice(0, 2) + '*'.repeat(value.length - 4) + value.slice(-2);
}

async function scanFile(filePath) {
  const findings = [];

  const fileStream = fs.createReadStream(filePath);
  const rl = readline.createInterface({
    input: fileStream,
    crlfDelay: Infinity
  });

  let lineNum = 0;
  for await (const line of rl) {
    lineNum++;

    for (const [type, pattern] of Object.entries(PII_PATTERNS)) {
      const matches = line.matchAll(pattern);

      for (const match of matches) {
        findings.push({
          file: filePath,
          line: lineNum,
          type,
          value: maskPII(match[0]),
          context: line.trim().slice(0, 100)
        });
      }
    }
  }

  return findings;
}

async function scanDirectory(dir) {
  const findings = [];

  async function walk(directory) {
    const files = await fs.promises.readdir(directory);

    for (const file of files) {
      const filePath = path.join(directory, file);
      const stat = await fs.promises.stat(filePath);

      if (stat.isDirectory()) {
        if (!EXCLUDE_DIRS.has(file)) {
          await walk(filePath);
        }
      } else if (shouldScanFile(filePath)) {
        const fileFindings = await scanFile(filePath);
        findings.push(...fileFindings);
      }
    }
  }

  await walk(dir);
  return findings;
}

function generateReport(findings) {
  if (findings.length === 0) {
    return '✅ No PII detected!';
  }

  let report = `⚠️  Found ${findings.length} potential PII instances:\n\n`;

  const byType = {};
  for (const finding of findings) {
    if (!byType[finding.type]) {
      byType[finding.type] = [];
    }
    byType[finding.type].push(finding);
  }

  for (const [type, items] of Object.entries(byType)) {
    report += `## ${type} (${items.length} found)\n\n`;

    for (const item of items.slice(0, 10)) {
      report += `- ${item.file}:${item.line}\n`;
      report += `  Value: ${item.value}\n`;
      report += `  Context: ${item.context}\n\n`;
    }

    if (items.length > 10) {
      report += `  ... and ${items.length - 10} more\n\n`;
    }
  }

  return report;
}

async function main() {
  const scanPath = process.argv[2] || '.';
  const findings = await scanDirectory(scanPath);

  console.log(generateReport(findings));

  process.exit(findings.length > 0 ? 1 : 0);
}

main().catch(console.error);

5. Database Scanning

SQL Query to Find PII:

-- PostgreSQL example
-- Scan for potential email columns
SELECT
    table_name,
    column_name,
    data_type
FROM information_schema.columns
WHERE column_name ILIKE '%email%'
   OR column_name ILIKE '%ssn%'
   OR column_name ILIKE '%phone%'
   OR column_name ILIKE '%address%'
ORDER BY table_name, column_name;

-- Sample data to check patterns
SELECT
    table_name,
    column_name,
    COUNT(*) as sample_count
FROM information_schema.columns
WHERE data_type IN ('character varying', 'text', 'char')
GROUP BY table_name, column_name;

6. Log File Scanning

# Scan application logs
find . -name "*.log" -type f -exec grep -l '\b\d{3}-\d{2}-\d{4}\b' {} \;

# Scan with context
grep -C 3 'SSN\|social security\|credit card' *.log

# Check for leaked credentials
grep -r 'password.*=\|api_key.*=\|secret.*=' . --include="*.log"

7. CI/CD Integration

GitHub Actions:

name: PII Detection

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  pii-scan:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'

      - name: Run PII Detector
        run: |
          python pii_detector.py . --json > pii-report.json

      - name: Upload report
        if: always()
        uses: actions/upload-artifact@v3
        with:
          name: pii-report
          path: pii-report.json

      - name: Fail if PII found
        run: |
          if [ $(cat pii-report.json | jq 'length') -gt 0 ]; then
            echo "❌ PII detected! See report for details."
            exit 1
          fi

8. Pre-commit Hook

#!/bin/bash
# .git/hooks/pre-commit

echo "Scanning for PII..."

python pii_detector.py $(git diff --cached --name-only)

if [ $? -ne 0 ]; then
    echo "❌ PII detected in staged files!"
    echo "Please remove sensitive data before committing."
    exit 1
fi

echo "✅ No PII detected"
exit 0

9. Data Anonymization

Once PII is found, suggest anonymization:

# anonymize.py
import hashlib
import re

def anonymize_email(email):
    """Replace email with hashed version."""
    local, domain = email.split('@')
    hashed = hashlib.sha256(local.encode()).hexdigest()[:8]
    return f"{hashed}@{domain}"

def anonymize_ssn(ssn):
    """Mask SSN keeping only last 4 digits."""
    return f"***-**-{ssn[-4:]}"

def anonymize_phone(phone):
    """Mask phone keeping only last 4 digits."""
    digits = re.sub(r'\D', '', phone)
    return f"***-***-{digits[-4:]}"

def anonymize_credit_card(cc):
    """Mask credit card keeping only last 4 digits."""
    return f"****-****-****-{cc[-4:]}"

# Example usage
text = "Contact John at john@email.com or call 555-123-4567"
text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
              lambda m: anonymize_email(m.group()), text)
print(text)
# Output: "Contact John at a1b2c3d4@email.com or call 555-123-4567"

10. Compliance Report

Generate compliance report:

# PII Detection Report

**Date**: 2025-11-01
**Scope**: Entire codebase
**Files Scanned**: 1,247
**Total Findings**: 23

## Summary by Type

| PII Type | Count | Risk Level |
|----------|-------|------------|
| Email    | 12    | Medium     |
| Phone    | 8     | Medium     |
| SSN      | 2     | High       |
| API Keys | 1     | Critical   |

## Critical Findings

### 1. AWS API Key in config file
- **File**: config/production.env
- **Line**: 15
- **Recommendation**: Move to environment variables or secret manager

### 2. SSN in test data
- **File**: tests/fixtures/users.json
- **Line**: 42
- **Recommendation**: Use fake data generator

## Remediation Steps

1. ✅ Remove hardcoded credentials from config files
2. ✅ Replace real PII in test data with fake data
3. ✅ Add pre-commit hooks to prevent future leaks
4. ✅ Rotate exposed API keys
5. ✅ Update .gitignore to exclude sensitive files

## Compliance Status

- [ ] GDPR Article 32 (Security of processing)
- [ ] CCPA Section 1798.150 (Data protection)
- [ ] HIPAA Security Rule (if applicable)

Best Practices

DO:

  • Scan regularly (CI/CD, pre-commit)
  • Use environment variables for secrets
  • Anonymize data in non-production
  • Implement data retention policies
  • Train team on PII handling
  • Use tools like git-secrets, truffleHog

DON'T:

  • Store PII in version control
  • Log sensitive data
  • Hardcode credentials
  • Use real PII in tests
  • Keep PII longer than needed
  • Ignore scan results

Checklist

  • PII patterns defined
  • Scanner script created
  • Codebase scanned
  • Database schema reviewed
  • Logs checked
  • CI/CD integration added
  • Pre-commit hook installed
  • Findings documented
  • Remediation plan created
  • Team trained on PII handling