This skill should be used when the user asks to "improve a skill", "optimize a skill", "review a skill", "audit a skill", "apply SkillsBench findings", "make a skill more effective", "refactor a skill", "fix a skill", or mentions improving, optimizing, or auditing an existing Claude Code skill based on research-backed best practices.
Resources
1Install
npx skillscat add thomasrohde/marketplace/improve-skill Install via the SkillsCat registry.
Improve Skill
Analyze and improve an existing Claude Code skill using findings from the SkillsBench research paper (arXiv:2602.12670v1), which evaluated 7,308 agent trajectories and identified what makes skills effective.
Prerequisites
- Path to an existing skill directory containing a SKILL.md file
- If no path is provided, ask the user which skill to improve
Improvement Workflow
Step 1: Load and Audit the Skill
Read the target skill's SKILL.md and all files in its references/, examples/, and scripts/ directories. Produce a structured audit against the five evaluation dimensions below.
Step 2: Score the Five Dimensions
Rate each dimension 1-5 and note specific issues:
Dimension 1 — Procedural Density
Measures whether content teaches how to do things versus what things are.
- 5: Nearly all content is step-by-step procedures with decision points
- 3: Mix of procedural and reference/factual content
- 1: Reads like API documentation or a reference manual
Count the ratio of procedural sentences ("To X, do Y", "Run Z", "Configure W by...") to factual/descriptive sentences ("X is a...", "The Y property contains...", "There are three types of..."). Target: >70% procedural in SKILL.md body.
Dimension 2 — Conciseness
Measures whether SKILL.md body stays in the optimal token range.
- 5: Body is 845-1,165 tokens (~1,200-1,800 words), detailed content in references/
- 3: Body is 1,500-2,500 tokens, some content could move to references/
- 1: Body exceeds 2,500 tokens, comprehensive documentation hurting effectiveness
Estimate token count of the SKILL.md body (excluding frontmatter). Comprehensive skills (-2.9pp) perform worse than no skill at all. Detailed (~1,165 tokens, +18.8pp) and compact (~845 tokens, +17.1pp) are optimal.
Dimension 3 — Working Examples
Measures whether the skill provides concrete, copy-pasteable code for its procedures.
- 5: Every major procedure has a complete working example
- 3: Some procedures have examples, others are abstract
- 1: No working examples, or examples are fragments/pseudocode
The research found "stepwise guidance with at least one working example" is essential. Code templates and reference implementations matter more than documentation volume.
Dimension 4 — Signal-to-Noise Ratio
Measures whether content focuses on non-obvious, domain-specific knowledge versus standard practices models already know.
- 5: All content addresses specialized knowledge or non-obvious procedures
- 3: Mix of specialized and commonly-known content
- 1: Mostly standard practices (basic patterns, common conventions)
Software engineering skills showed only +4.5pp improvement in SkillsBench because models already know standard programming. Content teaching common practices adds noise without value.
Dimension 5 — Trigger Quality
Measures whether the frontmatter description reliably activates the skill for intended use cases.
- 5: Third-person, 5+ specific trigger phrases covering all use cases
- 3: Some trigger phrases but missing scenarios
- 1: Vague description, wrong person, or missing trigger phrases
Step 3: Identify Improvements
Based on the audit scores, identify specific improvements. Prioritize dimensions scoring 1-3. Apply these research-backed principles:
Principle A — Extract to References (Conciseness)
Move factual content, detailed schemas, comprehensive lists, and API documentation from SKILL.md to references/ files. Keep only procedural steps and essential context in SKILL.md.
Before: One large SKILL.md with everything.
After: Lean SKILL.md pointing to references/detailed-guide.md, references/api-reference.md.
Principle B — Rewrite Factual as Procedural (Procedural Density)
Transform "X is a Y that does Z" into "To accomplish Z, use X by doing..."
Before: "The $() function returns a collection object with filter, add, and each methods."
After: "To query model elements, call $('element-type'). Chain .filter() to narrow results and .each() to iterate."
Principle C — Add Working Examples (Working Examples)
For each major procedure that lacks a concrete example, add a complete, copy-pasteable code block showing the procedure in context.
Principle D — Remove Common Knowledge (Signal-to-Noise)
Delete content teaching standard practices. Ask: "Would a competent developer using Claude already know this without the skill?" If yes, remove it.
Principle E — Strengthen Triggers (Trigger Quality)
Add specific phrases users would say. Use third person. Cover edge-case phrasings.
Principle F — Eliminate Conflicting Guidance
When multiple approaches exist for the same task, designate one as the recommended default. Mention alternatives only with explicit decision criteria for when to deviate.
Principle G — Verify No Negative Value Content
The research found 16/84 tasks where skills hurt performance. Review for content that:
- Contradicts correct model pretraining knowledge
- Over-specifies solutions, preventing flexible problem-solving
- Adds complexity without procedural value
Remove or restructure such content.
Step 4: Present Findings and Propose Changes
Present the audit results to the user as a summary table:
| Dimension | Score | Key Issues |
|---------------------|-------|-------------------------------|
| Procedural Density | X/5 | ... |
| Conciseness | X/5 | ... |
| Working Examples | X/5 | ... |
| Signal-to-Noise | X/5 | ... |
| Trigger Quality | X/5 | ... |Then list proposed changes grouped by principle (A-G), showing what will change and why. Wait for user approval before making changes.
Step 5: Apply Approved Changes
After user approval, apply the changes:
- Edit SKILL.md — restructure body, improve frontmatter
- Create or update
references/files for extracted content - Add working examples where needed
- Remove low-value content
- Verify all referenced files exist
Step 6: Post-Improvement Validation
After applying changes, verify:
- SKILL.md body is under 1,200 tokens (target: 845-1,165)
- Frontmatter uses third person with 5+ trigger phrases
- Body uses imperative/infinitive form throughout
- Every major procedure has a working example
- No factual-only sections remain in SKILL.md (moved to references/)
- All
references/files mentioned in SKILL.md actually exist - No content teaches standard practices models already know
- No conflicting guidance without decision criteria
Key Research Numbers
Quick reference for the most actionable SkillsBench findings:
| Finding | Value | Implication |
|---|---|---|
| Detailed skills | +18.8pp | Step-by-step with examples wins |
| Compact skills | +17.1pp | Focused essentials also effective |
| Comprehensive skills | -2.9pp | Exhaustive docs hurt performance |
| 2-3 skills per task | +18.6pp | Focused scope is optimal |
| 4+ skills per task | +5.9pp | Cognitive overhead reduces gains |
| Self-generated skills | -1.3pp | Human authoring essential |
Reference Files
For detailed research findings, anti-patterns, and evaluation criteria:
references/skillsbench-findings.md— Complete SkillsBench research summary including anti-patterns, evaluation dimensions, and leakage audit criteria