transcript-pipeline

This skill should be used when the user asks to "process this transcript", "convert lecture to notes", "run transcript pipeline", "generate class tutorial from Zoom captions", "validate transcript coverage", or "enrich class resources" (Notion/Canva/Drive links) for bootcamp notes.

PrakharMNNIT 1 Updated 5mo ago

Resources

GitHub

Install

npx skillscat add prakharmnnit/skills-and-personas/transcript-pipeline

Install via the SkillsCat registry.

SKILL.md

Transcript Pipeline Skill

Run a deterministic, auditable transcript-to-tutorial workflow with optional resource enrichment.

Purpose

Use this skill to convert raw class captions into high-quality study notes while preserving accountability through ledger + validation artifacts.

Use scripts for deterministic work. Use chat/stage prompts for language-heavy transformation.

Core Contract

Keep stage order: ingest -> refine -> synthesize -> enhance -> validate -> publish.
Run deterministic gates with scripts, never with LLM self-certification.
Preserve traceability in .pipeline/* artifacts.
Keep learner-facing notes readable and sanitized.
Treat validation status as PASS/FAIL source of truth.

Scripts

Use these scripts from scripts/:

ingest_zoom_captions.py - deterministic ingestion and segment ledger creation
run_chat_pipeline.py - guided orchestration for stage handoffs and validation
validate_coverage.py - hard-gate coverage validation
publish_tutorial_notes.py - learner-facing file naming and sanitization
merge_chunks.py - merge chunk outputs for large transcripts
run_colab_notebook_pipeline.py - AI/ML Colab appendix and code explainer pipeline
update_ai_notes_with_resources_and_colab.py - AI/ML notes enrichment utility
resource_enrichment.py - authenticated enrichment for Notion/Canva/Drive resources

Stage Workflow

Stage 0: Ingest (Deterministic)

Run:

python scripts/ingest_zoom_captions.py "<transcript_or_session_path>"

Required outputs:

.pipeline/segment_ledger.jsonl
.pipeline/segment_manifest.jsonl

Stage 1: Refine (Chat Stage)

Load references/stage1-refine.md.

Produce:

.pipeline/refined_transcript.md
.pipeline/topic_inventory.json
.pipeline/corrections_log.csv
.pipeline/uncertainty_report.json

Stage 2: Synthesize (Chat Stage)

Load references/stage2-synthesize.md.

Produce:

.pipeline/structured_notes.md
.pipeline/coverage_matrix.json

Stage 3: Enhance (Chat Stage)

Load:

references/stage3-enhance.md
references/tutorial-tech-bar-raiser.md

Produce:

.pipeline/enhanced_notes.md
final_notes.md
bootcamp_index.md

Stage 4: Validate (Deterministic)

Run:

python scripts/validate_coverage.py --pipeline-dir .pipeline

Validation guidance: references/stage4-validate.md.

Hard gates:

Segment coverage accountability
Uncertainty retention
No orphan claims

Stage 5: Publish

Run:

python scripts/publish_tutorial_notes.py --root "<sessions_root>" --session-dir "<session_dir>"

Result:

Published tutorial filename in canonical format
Learner-safe note without noisy source tags
Updated course index links

One-Command Guided Mode

Use guided runner for chat-window workflows:

python scripts/run_chat_pipeline.py run "<transcript_or_session_path>" --deep-pass

This enforces required handoffs and deep quality gates.

Optional Resource Enrichment Stage

Run when class notes include external links (Notion/Canva/Drive):

python scripts/resource_enrichment.py --all-sessions

Single session:

python scripts/resource_enrichment.py --session-dir "<session_dir>"

Auth options:

Notion: NOTION_TOKEN_V2, NOTION_ACTIVE_USER
Canva: RESOURCE_PLAYWRIGHT_STORAGE_STATE

Reference: references/resource-enrichment-authenticated-flow.md.

Optional AI/ML Colab Enrichment

Run for Colab-backed AI/ML classes:

python scripts/run_colab_notebook_pipeline.py

Reference: references/colab-notebook-explainer-pipeline.md.

Large Transcript Handling

If input exceeds context comfort:

Run Stage 1 by chunks.
Merge chunk artifacts:

python scripts/merge_chunks.py --chunk-dirs "<chunkA/.pipeline>" "<chunkB/.pipeline>" --output-dir "<session/.pipeline>"

Continue Stage 2 onward on merged artifacts.

Required Outputs Checklist

Learner-facing:

final_notes.md
<Domain> Class <NN> [DD-MM-YYYY] - <Topic>.md
bootcamp_index.md

Pipeline/audit:

.pipeline/segment_ledger.jsonl
.pipeline/segment_manifest.jsonl
.pipeline/refined_transcript.md
.pipeline/topic_inventory.json
.pipeline/corrections_log.csv
.pipeline/uncertainty_report.json
.pipeline/structured_notes.md
.pipeline/coverage_matrix.json
.pipeline/enhanced_notes.md
.pipeline/validation_report.md
.pipeline/exceptions.json (if fail)

Quality gates:

.pipeline/deep_pass_report.md (when --deep-pass)
.pipeline/deep_pass_exceptions.json (when --deep-pass)

Resource enrichment (optional):

.resources/resource_enrichment_report.json

Execution Rules

Fail fast on missing required artifacts.
Report missing outputs explicitly by file path.
Retry only from earliest failing stage.
Keep resource extraction status explicit (success/fallback/blocked).

transcript-pipeline

Resources

Install

Transcript Pipeline Skill

Purpose

Core Contract

Scripts

Stage Workflow

Stage 0: Ingest (Deterministic)

Stage 1: Refine (Chat Stage)

Stage 2: Synthesize (Chat Stage)

Stage 3: Enhance (Chat Stage)

Stage 4: Validate (Deterministic)

Stage 5: Publish

One-Command Guided Mode

Optional Resource Enrichment Stage

Optional AI/ML Colab Enrichment

Large Transcript Handling

Required Outputs Checklist

Execution Rules

Categories

Install

Recommended Skills