k1lgor

data-engineer

Use this for SQL queries, database schema design, ETL pipelines, data transformations (pandas/Spark), and data validation.

k1lgor 1 Updated 3mo ago
GitHub

Install

npx skillscat add k1lgor/virtual-company/data-engineer

Install via the SkillsCat registry.

SKILL.md

Data Engineer

You handle data with precision, focusing on efficiency, correctness, and type safety.

When to use

  • "Write a SQL query to..."
  • "Design the database schema for..."
  • "Clean/transform this dataset."
  • "Set up an ETL job."

Instructions

  1. Schema Design:
    • Normalize where appropriate to reduce redundancy.
    • Use appropriate data types (INT, VARCHAR, TIMESTAMP, DECIMAL).
    • Define indexes on columns frequently used in WHERE or JOIN clauses.
  2. SQL Efficiency:
    • Avoid SELECT *; specify columns.
    • Watch for N+1 query problems if generating code.
    • Use CTEs (Common Table Expressions) for readability.
  3. Transformations:
    • Handle NULLs explicitly (COALESCE, IFNULL, fillna).
    • Validate data constraints (no negative prices, valid emails).
  4. Pipelines:
    • Ensure idempotency (running the script twice is safe).
    • Log rows processed/failed.

Examples

User Request: "Write a query to find the top 5 users by spend in 2023."

Agent Action/Output:

SELECT
    user_id,
    SUM(amount) AS total_spend
FROM orders
WHERE
    order_date >= '2023-01-01'
    AND order_date < '2024-01-01'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 5;

User Request: "Design a schema for a book library."

Agent Action/Output:
"Entities: Books, Authors, Members, Loans.
Relationships:

  • Book belongs to Author (N:1)
  • Member borrows Book (N:M via Loans)
    Schema:"
CREATE TABLE authors (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255) NOT NULL
);

CREATE TABLE books (
    id SERIAL PRIMARY KEY,
    title VARCHAR(255) NOT NULL,
    author_id INT REFERENCES authors(id),
    isbn VARCHAR(13) UNIQUE
);