Use this skill whenever code behavior must be reasoned about without executing it — especially when code is provided and needs careful tracing. Invoke it to: trace what a function or program returns step by step, identify which exact operation causes a test failure or wrong value, determine whether two implementations are behaviorally equivalent, simulate an algorithm's execution order manually, or audit for security vulnerabilities like SQL injection or race conditions. Use any time someone asks "what does this code return?", "which step caused the failure?", "are these equivalent?", "what's the execution order?", or "is this vulnerable?" — even when the answer seems obvious without a skill.
Resources
5Install
npx skillscat add kunihiros/agentic-code-reasoning-skills Install via the SkillsCat registry.
Agentic Code Reasoning
Purpose
Reason about code behavior using structured semi-formal analysis without executing repository code.
This skill enforces a certificate-based reasoning process: you must state premises, trace concrete code paths with file:line evidence, and derive formal conclusions. You cannot skip sections or make unsupported claims.
Modes
compare— determine if two changes produce the same behaviorlocalize— find the root cause of a bugexplain— answer a code question with verified evidenceaudit-improve— review code for security, API misuse, or maintainability
Choose a mode before exploring files. If unsure, prefer explain.
Mode selection guide
| Trigger | Mode |
|---|---|
| "Are these two patches/implementations equivalent?" | compare |
| "Where is the bug?" / failing test / error report | localize |
| "What does this code do?" / "Why does X happen?" | explain |
| "Is this code secure?" / "Review for issues" | audit-improve |
Core Method
Apply this process in every mode. Complete each section in order. Do not write a later section before completing earlier ones.
When a certificate template exists for your selected mode (see the mode sections below), use that template as your primary guide — it is the concrete implementation of Steps 1–6 for that mode.
Step 1: Task and constraints
Write a short task statement and list constraints (e.g., no repository execution, static inspection only, file:line evidence required).
Step 2: Numbered premises
Before concluding anything, write numbered premises grounded in known facts.
P1: [fact about the task, inputs, or expected behavior]
P2: [fact about relevant files, tests, or specifications]
P3: ...Do not treat guesses as premises. Every later claim must reference a premise by number.
Step 3: Hypothesis-driven exploration
Before opening any file, write:
HYPOTHESIS H[N]: [what you expect to find and why]
EVIDENCE: [what supports this hypothesis — cite premises or prior observations]
CONFIDENCE: high / medium / lowAfter reading, record:
OBSERVATIONS from [filename]:
O[N]: [finding with file:line]
O[N]: [another finding with file:line]
HYPOTHESIS UPDATE:
H[M]: CONFIRMED / REFUTED / REFINED — [explanation]
UNRESOLVED:
- [remaining questions]
NEXT ACTION RATIONALE: [why the next file or step is justified]Steps 3 and 4 work together: Step 3 is your real-time exploration journal. Step 4 is the accumulated function-behavior record you build during Step 3 — add a row to Step 4 each time you read a function definition in Step 3. Do not reconstruct the table from memory after the fact.
Step 4: Interprocedural tracing
Update this table in real time during Step 3 — add each row the moment you read a function definition. Do not write this table all at once from memory.
For every function or method encountered on a relevant code path, record:
| Function/Method | File:Line | Behavior (VERIFIED) |
|---|---|---|
| [name] | [file:N] | [actual behavior after reading the definition] |
Rules:
- Read the actual definition. Do not infer behavior from the name.
- Mark the Behavior column VERIFIED only after reading the source.
- If source is unavailable (third-party library), mark UNVERIFIED and note the assumption. Search for type signatures, documentation, or test usage as secondary evidence. Optionally probe language behavior with an independent script.
- Trace through conditionals, mapping tables, and configuration — not just the happy path.
- For exception handling inside loops or multi-branch control flows: after recording the inferred behavior, ask "if this trace were wrong, what concrete input would produce different behavior?" Trace that input through the code before finalizing the row.
Step 5: Refutation check (required)
This step is mandatory, not optional.
Scope: Apply counterfactual reasoning not only at the final conclusion, but at every key intermediate claim — especially:
- "No test exercises this difference" — before asserting this, describe what such a test would look like and show you searched for exactly that pattern.
- "This behavior is X" for a non-trivial control flow — before asserting this, ask what evidence would exist if the behavior were not X.
- "These test outcomes are identical/different" — before asserting this, state what evidence would refute it.
For compare and audit-improve:
COUNTEREXAMPLE CHECK:
If my conclusion were false, what evidence should exist?
- Searched for: [what]
- Found: [what — cite file:line]
- Result: REFUTED / NOT FOUNDFor explain and localize:
ALTERNATIVE HYPOTHESIS CHECK:
If the opposite answer were true, what evidence would exist?
- Searched for: [what]
- Found: [what — cite file:line]
- Conclusion: REFUTED / SUPPORTEDStep 5.5: Pre-conclusion self-check (required)
Before writing the formal conclusion, check each item below. If any answer is NO, fix it before Step 6.
- Every PASS/FAIL or EQUIVALENT/NOT_EQUIVALENT claim traces to a specific
file:line— not inferred from function names. - Every function in the trace table is marked VERIFIED, or explicitly UNVERIFIED with a stated assumption that does not alter the conclusion.
- The Step 5 refutation or alternative-hypothesis check involved at least one actual file search or code inspection — not reasoning alone.
- The conclusion I am about to write asserts nothing beyond what the traced evidence supports.
Step 6: Formal conclusion
Write a conclusion that:
- References specific numbered premises and claims (e.g., "By P1 and C2…")
- States what was established
- States what remains uncertain or unverified
- Assigns a confidence level: HIGH / MEDIUM / LOW
Compare
Goal: determine whether two changes produce the same relevant behavior.
This template implements Steps 1–6 of the Core Method for compare mode.
Certificate template
Complete every section. Do not skip to FORMAL CONCLUSION without completing ANALYSIS.
DEFINITIONS:
D1: Two changes are EQUIVALENT MODULO TESTS iff executing the relevant
test suite produces identical pass/fail outcomes for both.
D2: The relevant tests are:
(a) Fail-to-pass tests: tests that fail on the unpatched code and are
expected to pass after the fix — always relevant.
(b) Pass-to-pass tests: tests that already pass before the fix — relevant
only if the changed code lies in their call path.
To identify them: search for tests referencing the changed function, class,
or variable. If the test suite is not provided, state this as a constraint
in P[N] and restrict the scope of D1 accordingly.
PREMISES:
P1: Change A modifies [file(s)] by [specific description]
P2: Change B modifies [file(s)] by [specific description]
P3: The fail-to-pass tests check [specific behavior]
P4: The pass-to-pass tests check [specific behavior, if relevant]
ANALYSIS OF TEST BEHAVIOR:
For each relevant test:
Test: [name]
Claim C[N].1: With Change A, this test will [PASS/FAIL]
because [trace through code — cite file:line]
Claim C[N].2: With Change B, this test will [PASS/FAIL]
because [trace through code — cite file:line]
Comparison: SAME / DIFFERENT outcome
For pass-to-pass tests (if changes could affect them differently):
Test: [name]
Claim C[N].1: With Change A, behavior is [description]
Claim C[N].2: With Change B, behavior is [description]
Comparison: SAME / DIFFERENT outcome
EDGE CASES RELEVANT TO EXISTING TESTS:
(Only analyze edge cases that the ACTUAL tests exercise)
E[N]: [edge case]
- Change A behavior: [specific output/behavior]
- Change B behavior: [specific output/behavior]
- Test outcome same: YES / NO
COUNTEREXAMPLE (required if claiming NOT EQUIVALENT):
Test [name] will [PASS/FAIL] with Change A because [reason]
Test [name] will [FAIL/PASS] with Change B because [reason]
Therefore changes produce DIFFERENT test outcomes.
NO COUNTEREXAMPLE EXISTS (required if claiming EQUIVALENT):
If NOT EQUIVALENT were true, a counterexample would look like:
[describe concretely: what test, what input, what diverging behavior]
I searched for exactly that pattern:
Searched for: [specific pattern — test name, code path, or input type]
Found: [result — cite file:line, or NONE FOUND with search details]
Conclusion: no counterexample exists because [brief reason]
FORMAL CONCLUSION:
By Definition D1:
- Test outcomes with Change A: [PASS/FAIL for each test]
- Test outcomes with Change B: [PASS/FAIL for each test]
- Since outcomes are [IDENTICAL/DIFFERENT], changes are
[EQUIVALENT/NOT EQUIVALENT] modulo the existing tests.
ANSWER: [YES equivalent / NO not equivalent]
CONFIDENCE: [HIGH / MEDIUM / LOW]Compare checklist
- Identify changed files for both sides
- Identify fail-to-pass AND pass-to-pass tests
- For each function called in changed code, read its definition and record in the interprocedural trace table (Step 4)
- Trace each test through both changes separately before comparing
- When a semantic difference is found, trace at least one relevant test through the differing path before concluding it has no impact
- Provide a counterexample (if different) or justify no counterexample exists (if equivalent)
Localize
Goal: identify the root cause of a bug, not just the crash site.
Certificate template
Complete phases in order. Each phase depends on the previous one.
PHASE 1: TEST / SYMPTOM SEMANTICS
What does the failing test or bug report describe?
State as formal premises:
PREMISE T1: The test calls [X.method(args)] and expects [behavior]
PREMISE T2: The test asserts [condition]
PREMISE T3: The observed failure is [error type / wrong output / hang]
...
PHASE 2: CODE PATH TRACING
Trace the execution path from the test entry point into production code.
For each significant method call, record:
| # | METHOD | LOCATION | BEHAVIOR | RELEVANT |
|---|--------|----------|----------|----------|
| 1 | ClassName.method(params) | file:line | [verified behavior] | [why it matters to PREMISE T[N]] |
| 2 | ... | ... | ... | ... |
Build the call sequence: test → method1 → method2 → ...
PHASE 3: DIVERGENCE ANALYSIS
For each code path traced, identify where the implementation diverges
from the test's expectations. State as formal claims:
CLAIM D1: At [file:line], [code] produces [behavior]
which contradicts PREMISE T[N] because [reason]
CLAIM D2: ...
Each claim MUST reference a specific PREMISE and a specific code location.
PHASE 4: RANKED PREDICTIONS
Based on divergence claims, produce ranked predictions:
Rank 1 ([confidence]): [file:line range] — [description]
Supporting claim(s): D[N]
Root cause / symptom: [which one]
Rank 2 ([confidence]): ...Exploration protocol
Use the hypothesis-driven format from Step 3 during exploration. Number hypotheses H1, H2… and observations O1, O2… for traceability.
Localize checklist
- State what the failing behavior expects (Phase 1)
- Trace from entry point toward production code with per-method records (Phase 2)
- Every divergence claim must reference a specific premise (Phase 3)
- Rank candidates and cite supporting claims (Phase 4)
- Distinguish symptom site from root cause — if the crash site differs from the origin of incorrect state, investigate upstream
- Check for indirection: is the bug in a class not directly called by the test?
Explain
Goal: answer a code question with verified semantic evidence.
Certificate template
Complete every section. Do not write FINAL ANSWER before ALTERNATIVE HYPOTHESIS CHECK.
QUESTION: [restate the question]
FUNCTION TRACE TABLE:
| Function/Method | File:Line | Parameter Types | Return Type | Behavior (VERIFIED) |
|-----------------|-----------|-----------------|-------------|---------------------|
| [function1] | [file:N] | [param types] | [ret type] | [ACTUAL behavior] |
| [function2] | [file:N] | [param types] | [ret type] | [ACTUAL behavior] |
DATA FLOW ANALYSIS:
Variable: [key variable name]
- Created at: [file:line]
- Modified at: [file:line(s), or NEVER MODIFIED]
- Used at: [file:line(s)]
(Repeat for each key variable)
SEMANTIC PROPERTIES:
Property 1: [e.g., "map is immutable after initialization"]
- Evidence: [specific file:line]
Property 2: ...
- Evidence: [specific file:line]
ALTERNATIVE HYPOTHESIS CHECK:
If the opposite answer were true, what evidence would exist?
- Searched for: [what you looked for]
- Found: [what you found — cite file:line]
- Conclusion: REFUTED / SUPPORTED
FINAL ANSWER:
[answer with explicit evidence citations]
CONFIDENCE: [HIGH / MEDIUM / LOW]Explain checklist
- Read actual definitions — do not infer behavior from names
- Fill every row in the function trace table with VERIFIED behavior
- Track key variables from creation through modification to usage
- Identify semantic properties with per-property file:line evidence
- Check the opposite answer before finalizing
- After identifying an edge case, verify whether downstream code already handles it before reporting it as a finding
- State uncertainty when downstream behavior is not fully verified
Audit-Improve
Goal: inspect code for risks or improvement opportunities, grounded in traced evidence.
Sub-modes
security-audit— injection, auth bypass, path traversal, secrets, unsafe defaultsrefactor-review— oversized units, duplication, mixed responsibilities, fragile flowcode-smell-check— hidden coupling, dead branches, poor naming, hard-to-test designapi-misuse-check— incorrect API usage, wrong assumptions about library semantics
Sub-mode focus
| Sub-mode | Primary question | Key requirement |
|---|---|---|
security-audit |
Is this unsafe operation reachable? | Verify a concrete call path for every confirmed finding |
refactor-review |
What is the safest minimal change? | Always propose the smallest effective refactoring first |
code-smell-check |
Is there concrete coupling or testability harm? | Trace coupling to a specific dependency — do not flag patterns without evidence |
api-misuse-check |
Does the usage violate the documented contract? | Read the API definition or documentation before claiming misuse |
Certificate template
REVIEW TARGET: [file(s) / module / component]
AUDIT SCOPE: [which sub-mode(s) and what property is being checked]
PREMISES:
P1: [fact about the code's purpose or expected security properties]
P2: [fact about the API contract or framework requirements]
...
FINDINGS:
For each finding:
Finding F[N]: [title]
Category: security / refactor / smell / api-misuse
Status: CONFIRMED / PLAUSIBLE (needs more evidence)
Location: [file:line range]
Trace: [code path that leads to this issue — cite file:line at each step]
Impact: [what can go wrong and under what conditions]
Evidence: [specific file:line proof]
COUNTEREXAMPLE CHECK:
For each confirmed finding, did you verify it is reachable?
F[N]: Reachable via [call path] — YES / UNVERIFIED
RECOMMENDATIONS:
R[N] (for F[N]): [specific fix or mitigation]
Risk of change: [what could break]
Minimal safe change: [smallest effective fix]
UNVERIFIED CONCERNS:
- [issues that need more context or are speculative]
CONFIDENCE: [HIGH / MEDIUM / LOW]Audit-Improve checklist
- Define the review target and scope clearly
- State the risk or quality property being checked as a premise
- Trace the relevant code path — do not flag isolated lines without context
- Separate CONFIRMED from PLAUSIBLE findings
- For each confirmed finding, verify it is reachable via a concrete call path
- For refactoring, propose the safest minimal change first
- Do not report speculative security issues as confirmed vulnerabilities
- For API misuse, read the actual API definition or documentation before claiming misuse
Guardrails
From the paper's error analysis
- Do not assume behavior from names. Read the actual function definition. The canonical failure: assuming Python's builtin
format()when a module-level function with different semantics shadows it. - Do not claim test outcomes without tracing. Trace each test through the relevant code path before asserting PASS or FAIL.
- Do not confuse symptom with root cause. A crash site (e.g., StackOverflowError in a recursive method) may not be the origin of incorrect state. Trace upstream to find where the bad state was created.
- Do not dismiss subtle differences. If you find a semantic difference between compared items, trace at least one relevant test through the differing code path before concluding the difference has no impact.
- Do not trust incomplete chains. After building a reasoning chain, verify that downstream code does not already handle the edge case or condition you identified. Confident-but-wrong answers often come from thorough-but-incomplete analysis.
- Handle unavailable source explicitly. When a function's source is not in the repository (third-party library), mark it UNVERIFIED in trace tables. Search for type signatures, documentation, or test usage as secondary evidence. Do not guess behavior from the function name.
General
- Do not treat style preferences as findings unless they affect maintainability or correctness.
- Do not hide uncertainty — state what is unverified.
- Do not skip the refutation check. It is mandatory in every mode.
Minimal Response Contract
Every response using this skill must include:
| Element | Required in |
|---|---|
| Selected mode | All |
| Numbered premises | All |
| Interprocedural trace table | All (when functions are on the code path) |
| Per-item analysis (per-test, per-method, or per-function) | compare, localize, explain |
| Refutation / alternative-hypothesis check | All |
| Formal conclusion with premise/claim references | All |
| Confidence level | All |