code-review

"Quality review for R and Python research scripts. 11-category scorecard covering reproducibility, structure, domain correctness, cross-language verification, and more. Report-only — never edits source files."

mseok 5 1 Updated 4mo ago

Resources

GitHub

Install

npx skillscat add mseok/dot/code-review

Install via the SkillsCat registry.

SKILL.md

Research Code Review

Report-only skill. Never edit source files — produce CODE-REVIEW-REPORT.md only.

When to Use

Before submitting a paper (check replication package quality)
After writing analysis scripts and before sharing with coauthors
When taking over someone else's research code
As part of the Referee 2 agent's formal audit pipeline

When NOT to Use

Understanding old code — use $code-archaeology first to map out what exists
Formal verification — use the Referee 2 agent for cross-language replication
General software projects — this is for research scripts, not applications

Workflow

Locate scripts: Find all .R, .py, .do, .jl files in the project
Read each script carefully
Score each category (Pass / Fail / N/A)
Produce report: Write CODE-REVIEW-REPORT.md in the project directory

11 Review Categories

1. Reproducibility

Check	Pass Criteria
Random seeds	`set.seed()` / `random.seed()` / `np.random.seed()` set before any stochastic operation
Relative paths	No hardcoded absolute paths (e.g., `/Users/username/...` or `C:\...`)
Working directory	Script does not `setwd()` / `os.chdir()` — uses project-relative paths
Session info	Script prints session info at end (`sessionInfo()` / `sys.version`) or documents environment

2. Script Structure

Check	Pass Criteria
Header	Script begins with comment block: purpose, author, date, inputs, outputs
Sections	Code organised into labelled sections (comments or `# ---- Section ----`)
Imports at top	All `library()` / `import` statements at the top of the file
Reasonable length	Single script < 500 lines; longer scripts should be split

3. Output Hygiene

Check	Pass Criteria
No print pollution	No stray `print()` / `cat()` / `message()` dumping to console
Outputs saved	Key results saved to files, not just printed
Clean console	Running the script does not produce walls of text

4. Function Quality

Check	Pass Criteria
Documentation	Functions have comments explaining purpose, inputs, outputs
Naming	Function names are descriptive verbs (`estimate_ate`, not `f1`)
Defaults	Reasonable defaults for optional parameters
No side effects	Functions don't modify global state

5. Domain Correctness

Check	Pass Criteria
Estimator matches paper	The estimator used matches what the paper claims
Weights	If weighted: weights sum to expected value, correct application
Standard errors	Clustering / HC / bootstrap matches paper specification
Sample restrictions	Filters match the paper's sample description
Variable construction	Variables constructed as described in the paper

6. Figure Quality

Check	Pass Criteria
Dimensions specified	Figure size set explicitly (not default)
Transparency/resolution	Appropriate for publication (300+ DPI for raster, vector preferred)
Saved to file	Figures saved with `ggsave()` / `plt.savefig()`, not just displayed
Labels	Axes labelled, legend present where needed, title informative
Colour	Colourblind-friendly palette; not relying on red/green distinction

7. Data Persistence

Check	Pass Criteria
Intermediate objects saved	Expensive computations saved (`saveRDS()` / `pickle.dump()` / `.parquet`)
Load before recompute	Script checks for saved objects before rerunning expensive operations
Output format	Final outputs in portable format (CSV, parquet — not just `.RData`)

8. Dependencies

Check	Pass Criteria
Declared at top	All `library()` / `import` at the start of the script
Versions documented	`renv.lock` / `requirements.txt` / `pyproject.toml` exists
No unnecessary packages	Each loaded package is actually used
Installation instructions	README or comment explains how to set up the environment

9. Python-Specific

Score N/A if no Python files.

Check	Pass Criteria
Type hints	Functions have type annotations for parameters and return values
Docstrings	Functions have docstrings (not just comments)
uv usage	Uses `uv` for environment management (per project conventions)
f-strings	Uses f-strings, not `.format()` or `%` formatting

10. R-Specific

Score N/A if no R files.

Check	Pass Criteria
tidyverse consistency	Doesn't mix base R and tidyverse for the same operation
Assignment operator	Uses `<-` not `=` for assignment
Boolean values	Uses `TRUE`/`FALSE`, not `T`/`F`
Pipe consistency	Uses one pipe style consistently (`%>%` or `

11. Cross-Language Verification

Score N/A if the project has no numerical results or only uses one language.

Check	Pass Criteria
Replication directory	`code/replication/` (or equivalent) exists with cross-language scripts
Two-language coverage	Key numerical results reproduced in a second language (e.g., R results verified in Python or vice versa)
Result comparison	Scripts compare outputs and report discrepancies (tolerance-based, not exact match)
Precision threshold	Numerical outputs compared to 6+ decimal places — discrepancies at lower precision indicate real bugs
Documentation	README or comments explain what is being replicated and acceptable tolerance

Why Cross-Language Replication Works

Different languages produce different hallucination patterns when AI-assisted. An error in a Python implementation is unlikely to appear identically in R (or vice versa), making discrepancies easy to spot. This is the core insight from Scott Cunningham's Referee 2 protocol.

How to Set Up

Create code/replication/ with scripts that independently implement key numerical results in a second language
Write a comparison script that loads outputs from both languages and reports discrepancies at 6+ decimal places
Document what is being replicated, which results are covered, and the acceptable tolerance (e.g., 1e-6 for coefficients, 1e-4 for standard errors)

Scorecard

#	Category	Result
1	Reproducibility	Pass/Fail
2	Script structure	Pass/Fail
3	Output hygiene	Pass/Fail
4	Function quality	Pass/Fail
5	Domain correctness	Pass/Fail
6	Figure quality	Pass/Fail
7	Data persistence	Pass/Fail
8	Dependencies	Pass/Fail
9	Python-specific	Pass/Fail/N/A
10	R-specific	Pass/Fail/N/A
11	Cross-language verification	Pass/Fail/N/A

Overall: X/11 Pass (adjust denominator for N/A categories)

Quality Scoring

Apply numeric quality scoring using the shared framework and skill-specific rubric:

Framework: `../shared/quality-scoring.md` — severity tiers, thresholds, verdict rules
Rubric: `references/quality-rubric.md` — issue-to-deduction mappings for this skill

Start at 100, deduct per issue found, apply verdict. Insert the Score Block into the report after the scorecard.

Report Format

# Code Review Report

**Project:** [path]
**Date:** YYYY-MM-DD
**Scripts reviewed:** [list]
**Languages:** R / Python / Both

## Scorecard

[Table above, filled in]

## Detailed Findings

### Category 1: Reproducibility
**Result: Pass/Fail**

[Specific findings with file:line references]

### Category 2: Script Structure
...

[Continue for all 11 categories]

## Priority Fixes

1. [Most important issue — what to fix first]
2. [Second most important]
3. [Third]

## Quality Score

| Metric | Value |
|--------|-------|
| **Score** | XX / 100 |
| **Verdict** | Ship / Ship with notes / Revise / Revise (major) / Blocked |

### Deductions

| # | Issue | Tier | Deduction | Category |
|---|-------|------|-----------|----------|
| 1 | [description] | [tier] | -X | [category] |
| | **Total deductions** | | **-XX** | |

## Positive Observations

[Things done well — important for morale and learning]

Cross-References

$code-archaeology — For understanding unfamiliar code before reviewing it
Referee 2 agent — For formal cross-language replication and verification (Category 11 flags the absence; Referee 2 does the actual replication)
$proofread — For the paper that accompanies this code

code-review

Resources

Install

Research Code Review

When to Use

When NOT to Use

Workflow

11 Review Categories

1. Reproducibility

2. Script Structure

3. Output Hygiene

4. Function Quality

5. Domain Correctness

6. Figure Quality

7. Data Persistence

8. Dependencies

9. Python-Specific

10. R-Specific

11. Cross-Language Verification

Why Cross-Language Replication Works

How to Set Up

Scorecard

Quality Scoring

Report Format

Cross-References

Categories

Install

Recommended Skills