Write and refactor modular, readable Python code for deep learning research (PyTorch/JAX), following Python philosophy (PEP 8/20) and Karpathy-style coding guidelines (think before coding, simplicity first, surgical changes, goal-driven execution). Use for structuring training/eval code, splitting monolithic scripts into modules, designing clean APIs/configs, improving maintainability, and making experiments reproducible and HPC-friendly.
Resources
2Install
npx skillscat add mseok/dot/modular-python-deep-learning Install via the SkillsCat registry.
SKILL.md
Modular Python for Deep Learning
Follow this workflow whenever writing or refactoring deep learning research code.
0) Adopt the operating principles
- Read and apply
references/karpathy_guidelines.mdbefore making changes. - Default to Pythonic code: explicit, readable, minimal magic. Read
references/python_philosophy.mdwhen unsure.
1) Clarify the contract (before coding)
- Restate the goal in 1–2 sentences.
- Specify the I/O contract in concrete terms:
- Data source and preprocessing assumptions
- Tensor shapes, dtypes, units, and device placement
- Metrics and success criteria
- Performance constraints (memory, speed, batch size)
- Identify “must not change” interfaces (CLI args, checkpoint format, metrics names, log schema).
2) Choose module boundaries that match the contract
- Separate pure compute from I/O and side effects.
- Keep imports side-effect free (no hidden global initialization at import time).
- Prefer a small number of obvious modules over a large web of micro-files.
- Use
references/dl_modular_layout.mdas the default decomposition template.
3) Design small, testable APIs
- Prefer functions over classes until state is clearly necessary.
- Pass dependencies explicitly (model, tokenizer, config, device, RNG); avoid global singletons.
- Use
@dataclass(frozen=True)configs for immutable experiment settings. - Add type hints and shape comments/docstrings at module boundaries.
- Make failure modes explicit (raise informative exceptions early).
4) Implement with “simplicity first”
- Make the smallest change that solves the goal.
- Avoid framework upgrades or architectural rewrites unless required.
- Do not create a “mega-utils.py” dumping ground; create domain-named modules instead.
- Prefer standard library building blocks (
pathlib,logging,argparse) unless the user asked for a specific stack.
5) Add verification hooks (without pretending to run them)
- Add or update a CPU smoke test path (tiny batch, 1–2 steps) for shape + loss sanity.
- Add narrow unit tests for pure functions (tokenization, featurization, losses, metrics).
- Provide exact run commands and what to check in outputs.
If the user mentions remote HPC/Slurm:
- Provide a Slurm script snippet and the expected artifacts (checkpoints, logs, metrics files).
- Provide a short checklist for what to verify on the cluster (first-batch time, GPU util, NaNs, determinism).
6) Produce a clean handoff
- Summarize module responsibilities and public entrypoints.
- Document any new CLI flags/config fields and defaults.
- Call out any intentional behavior changes and migration notes (checkpoint compatibility, metric name changes).