whitespectre

eval-session-scorecard

Evaluates an entire multi-turn conversation (session) using the 7 core dimensions, returning strict JSON session-level aggregates plus per-turn scorecards.

whitespectre 0 Updated 3mo ago

Resources

1
GitHub

Install

npx skillscat add whitespectre/ai-assistant-evals/eval-session-scorecard

Install via the SkillsCat registry.

SKILL.md

Eval Session Scorecard

Use this skill to evaluate a full conversation (multiple user/assistant turns) for continuous monitoring at the session level.

Inputs

Require:

  • A conversation transcript containing multiple user/assistant turns.
  • The transcript must clearly label turns as "User:" and "Assistant:".

Workflow

  1. Parse the transcript into assistant turns and their immediate context (the preceding user turn and any relevant prior context).
  2. For each assistant turn, run eval-core-scorecard on that assistant turn using:
    • User request/context: the preceding user message (plus brief prior context if needed).
    • Assistant response: the assistant message for that turn.
  3. Collect per-turn outputs (each is the JSON object returned by eval-core-scorecard).
  4. Compute session-level aggregates:
    • session_average_score: mean of each turn’s average_score (float allowed).
    • dimension_averages: for each of the 7 dimensions, mean of that dimension’s score across turns.
    • lowest_scoring_turns: list the 3 assistant turns with lowest average_score (include turn index + average_score).
  5. Return a single strict JSON object.

Output Contract

Return JSON only. Do not include markdown, backticks, prose, or extra keys.

Use exactly this schema:

{
"dimension": "session_scorecard",
"assistant_turn_count": 0,
"turn_count": 0,
"session_average_score": 0,
"dimension_averages": {
"clarity": 0,
"relevance": 0,
"accuracy": 0,
"tone_empathy": 0,
"guidance_actionability": 0,
"conversation_flow": 0,
"boundary_adherence": 0
},
"lowest_scoring_turns": [
{ "assistant_turn_index": 1, "average_score": 0 },
{ "assistant_turn_index": 1, "average_score": 0 },
{ "assistant_turn_index": 1, "average_score": 0 }
],
"turn_scorecards": [
{
"assistant_turn_index": 1,
"user_message": "...",
"assistant_message": "...",
"scorecard": {
"dimension": "core_scorecard",
"average_score": 0,
"results": [
{ "dimension": "clarity", "score": 1, "rationale": "...", "improvement_suggestions": ["..."] },
{ "dimension": "relevance", "score": 1, "rationale": "...", "improvement_suggestions": ["..."] },
{ "dimension": "accuracy", "score": 1, "rationale": "...", "improvement_suggestions": ["..."] },
{ "dimension": "tone_empathy", "score": 1, "rationale": "...", "improvement_suggestions": ["..."] },
{ "dimension": "guidance_actionability", "score": 1, "rationale": "...", "improvement_suggestions": ["..."] },
{ "dimension": "conversation_flow", "score": 1, "rationale": "...", "improvement_suggestions": ["..."] },
{ "dimension": "boundary_adherence", "score": 1, "rationale": "...", "improvement_suggestions": ["..."] }
]
}
}
]
}

Hard Rules

  • dimension must always equal "session_scorecard".
  • Output must be valid JSON and include all keys exactly as shown.
  • turn_scorecards must include one entry per assistant turn found in the transcript.
  • assistant_turn_index starts at 1 and increments by 1 for each assistant turn in the transcript.
  • Do not include step-by-step reasoning.
  • Never output text outside the JSON object.