Within-7

skill-judge

"Evaluate Agent Skills against official specifications and 8-dimensional scoring framework. Use when: (1) Reviewing SKILL.md files for quality, (2) Auditing skill packages for compliance, (3) Improving existing skills with actionable feedback, (4) Validating skill design before deployment. Outputs structured evaluation report with dimension scores and improvement suggestions."

Within-7 1 Updated 4mo ago

Resources

1
GitHub

Install

npx skillscat add within-7/minto-plugin-tools/skill-judge

Install via the SkillsCat registry.

SKILL.md

Skill Judge

Evaluate Agent Skills against official specifications and best practices using an 8-dimensional scoring framework (120 points total).

Decision Tree

Choose evaluation approach based on context:

Evaluation Context Primary Focus When to Use
Quick review (5-10 min) Description + Knowledge Delta Initial screening, triage
Full audit (20-30 min) All 8 dimensions Comprehensive quality assessment
Compliance check (5 min) Frontmatter + Description Format validation only
Improvement guidance (30+ min) All dimensions + detailed feedback Skill optimization

Workflow

Step 1: Load Evaluation Framework

MANDATORY - READ ENTIRE FILE: Before proceeding, you MUST read `evaluation-guide.md` completely from start to finish. NEVER set any range limits when reading this file.

Step 2: Quick Scan (5 minutes)

Read SKILL.md completely and identify:

  • Skill type: Mindset (~50 lines), Navigation (~30 lines), Philosophy (~150 lines), Process (~200 lines), Tool (~300 lines)
  • Line count: Is it appropriate for the type?
  • Description quality: Does it have WHAT, WHEN, and keywords?
  • Knowledge delta: Any obvious "explaining basics" sections?

Step 3: Dimension Evaluation (15-20 minutes)

Evaluate each dimension in order:

Priority Dimension Points Why This Order
1 D4: Specification Compliance (Description) 15 Poor description = skill never used
2 D1: Knowledge Delta 20 Core dimension - determines value
3 D7: Pattern Recognition 10 Sets expectations for structure
4 D5: Progressive Disclosure 15 Checks if references are used properly
5 D2: Mindset + Procedures 15 Evaluates thinking patterns
6 D3: Anti-Pattern Quality 15 Checks for NEVER lists
7 D6: Freedom Calibration 15 Matches freedom to task fragility
8 D8: Practical Usability 15 Can Agent actually use it?

Step 4: Score Calculation

Sum all dimension scores (max 120 points). Calculate percentage and assign grade:

Score Range Grade Interpretation
96-120 A Excellent - Production ready
84-95 B Good - Minor improvements needed
72-83 C Acceptable - Moderate improvements needed
60-71 D Poor - Significant improvements needed
<60 F Fail - Major redesign required

Step 5: Generate Report

MANDATORY - READ ENTIRE FILE: Before generating report, you MUST read `scoring-guide.md` completely.

Output structured report in this format:

# Skill Evaluation Report

## Overview
- **Skill**: [skill-name]
- **Type**: [Mindset/Navigation/Philosophy/Process/Tool]
- **Total Score**: [X]/120 ([X]%)
- **Grade**: [A/B/C/D/F]

## Dimension Scores

| Dimension | Score | Max | Notes |
|-----------|-------|-----|-------|
| D1: Knowledge Delta | [X] | 20 | [brief notes] |
| D2: Mindset + Procedures | [X] | 15 | [brief notes] |
| D3: Anti-Pattern Quality | [X] | 15 | [brief notes] |
| D4: Specification Compliance | [X] | 15 | [brief notes] |
| D5: Progressive Disclosure | [X] | 15 | [brief notes] |
| D6: Freedom Calibration | [X] | 15 | [brief notes] |
| D7: Pattern Recognition | [X] | 10 | [brief notes] |
| D8: Practical Usability | [X] | 15 | [brief notes] |

## Critical Issues (Must Fix)
1. [Issue 1]
2. [Issue 2]

## Improvement Suggestions (Should Fix)
1. [Suggestion 1]
2. [Suggestion 2]

## Strengths (Keep)
1. [Strength 1]
2. [Strength 2]

NEVER Do When Evaluating

Scoring Mistakes

  • NEVER give high scores for "professional formatting" alone - content matters most
  • NEVER ignore token waste - every redundant paragraph = deduction
  • NEVER let length impress you - 43-line skill can outperform 500-line skill
  • NEVER assume all procedures are valuable - distinguish domain-specific from generic

Evaluation Mistakes

  • NEVER skip mentally testing decision trees - do they lead to correct choices?
  • NEVER forgive explaining basics with "but it provides helpful context"
  • NEVER overlook missing anti-patterns - no NEVER list = significant gap
  • NEVER undervalue description field - poor description = skill never used

Reporting Mistakes

  • NEVER give vague feedback like "improve quality" - be specific
  • NEVER suggest changes without explaining WHY
  • NEVER provide scores without actionable improvement suggestions

Quick Reference

Knowledge Delta Red Flags (D1)

  • "What is [basic concept]" sections
  • Step-by-step tutorials for standard operations
  • Explaining how to use common libraries
  • Generic best practices ("write clean code")
  • Definitions of industry-standard terms

Knowledge Delta Green Flags (D1)

  • Decision trees for non-obvious choices
  • Trade-offs only experts know
  • Edge cases from real-world experience
  • "NEVER do X because [non-obvious reason]"
  • Domain-specific thinking frameworks

Anti-Pattern Quality (D3)

  • Score 0-3: No anti-patterns mentioned
  • Score 4-7: Generic warnings ("avoid errors")
  • Score 8-11: Specific NEVER list with some reasoning
  • Score 12-15: Expert-grade anti-patterns with WHY

Description Quality (D4)

  • Must answer: WHAT (functionality), WHEN (trigger scenarios), KEYWORDS (searchable terms)
  • Poor: "处理文档相关功能" (vague, no triggers, no keywords)
  • Excellent: "Comprehensive document creation, editing, and analysis. Use when Claude needs to work with professional documents (.docx files) for: (1) Creating new documents, (2) Modifying content, (3) Working with tracked changes"

Pattern Recognition (D7)

Pattern ~Lines When to Use
Mindset ~50 Creative tasks requiring taste
Navigation ~30 Multiple distinct sub-scenarios
Philosophy ~150 Art/creation requiring originality
Process ~200 Complex multi-step projects
Tool ~300 Precise operations on specific formats

Freedom Calibration (D6)

  • High freedom: Creative/Design tasks (frontend-design)
  • Medium freedom: Code review, judgment-based tasks
  • Low freedom: File format operations (docx, pdf, xlsx)

Output Format

Always output evaluation report in the structured format shown in Step 5. Include:

  • Overview with total score and grade
  • Dimension scores table with notes
  • Critical issues (must fix)
  • Improvement suggestions (should fix)
  • Strengths (keep)

Do NOT output:

  • Unstructured feedback
  • Scores without explanations
  • Generic comments without specific examples