mseok

validate-bib

"Cross-reference \\cite{} keys against .bib files or embedded \\bibitem entries. Finds missing, unused, and typo'd citation keys. Deep verification mode spawns parallel agents for DOI/metadata validation at scale. Read-only in standard mode."

mseok 5 1 Updated 3mo ago

Resources

1
GitHub

Install

npx skillscat add mseok/dot/validate-bib

Install via the SkillsCat registry.

SKILL.md

Bibliography Validation

Read-only skill. Never edit source files — produce a categorised report only.

Citation key rule: the user's existing keys always take precedence. They come from Paperpile (his reference management system) and are canonical. When suggesting replacements (typo corrections, preprint upgrades, metadata fixes), always keep the user's key and update the .bib entry metadata around it — never suggest renaming a key to match some "standard" format.

When to Use

  • Before compiling a final version of a paper
  • After adding new citations to check nothing was missed
  • When biber/bibtex reports undefined citations
  • As part of a pre-submission checklist (pair with $proofread)

When NOT to Use

  • Finding new references — use $literature for discovery
  • Building a bibliography from scratch — use $literature with .bib generation
  • General proofreading — use $proofread (which also flags citation format issues)

Phase 0: Session Log (Suggested)

Bibliography validation with preprint staleness checks can be context-heavy (OpenAlex lookups, web searches for published versions). Before starting, suggest running $session-log to capture prior work as a recovery checkpoint. If the user declines, proceed without it.

Convention

Default bibliography file is paperpile.bib — this is the standard across all projects (per the $latex skill convention). However, the skill also supports:

  • Any .bib file found in the same directory as the .tex files being audited
  • Embedded bibliographies using \begin{thebibliography} / \bibitem{key} blocks
  • Both external and embedded simultaneously (rare but possible)

Bibliography Detection

At the start of validation, detect which bibliography method the project uses:

1. External .bib file (standard)

Look for .bib files in the project directory. Priority order:

  1. paperpile.bib (preferred — standard naming convention across all projects)
  2. Any other .bib file in the same directory as the .tex files

If multiple .bib files are found, validate all of them and produce a combined report. Note which file each issue belongs to. If paperpile.bib exists alongside other .bib files, flag the extras as a potential cleanup opportunity (the project may have migrated from a different naming convention).

Full validation applies: cross-reference checks and quality checks.

2. Embedded \begin{thebibliography} / \bibitem{key}

Some LaTeX documents define references inline rather than using an external .bib file. Detect by scanning .tex files for \begin{thebibliography}.

Extract keys from \bibitem entries:

  • \bibitem{key} — standard form, key is the argument in braces
  • \bibitem[label]{key} — optional label form (e.g., \bibitem[Smith et al., 2020]{smith2020}), key is in the second set of braces

Only cross-reference checks apply (missing keys, unused keys, typos). Quality checks (required fields, year, author formatting) are skipped because embedded bibliographies don't have structured metadata.

3. Both (rare)

If a project has both a .bib file and \begin{thebibliography} blocks, validate both:

  • Run full validation on the .bib file
  • Run cross-reference checks on \bibitem entries
  • Merge both key sets when checking for missing citations

Workflow

  1. Find files: Locate all .tex files in the project
  2. Detect bibliography type: Check for .bib files and/or \begin{thebibliography} blocks
  3. Extract citation keys from .tex: Scan for all citation commands
  4. Extract entry keys from bibliography source(s):
    • External: Parse all @type{key, entries from .bib file(s)
    • Embedded: Parse all \bibitem{key} and \bibitem[label]{key} entries
  5. Cross-reference: Compare the two sets
  6. Quality checks: Validate .bib entry completeness (external only)
  7. Produce report: Write results to stdout (or save if requested)

Citation Commands to Scan

Scan .tex files for all of these patterns:

Command Example
\cite{key} Basic citation
\citet{key} Textual: Author (Year)
\citep{key} Parenthetical: (Author, Year)
\textcite{key} biblatex textual
\autocite{key} biblatex auto
\parencite{key} biblatex parenthetical
\citeauthor{key} Author name only
\citeyear{key} Year only
\nocite{key} Include in bibliography without in-text citation

Also handle multi-key citations: \citep{key1, key2, key3}

Cross-Reference Checks

Critical: Missing Entries

Citation keys used in .tex but not defined in the bibliography source (.bib file or \bibitem entries).

These will cause compilation errors.

Warning: Unused Entries

Keys defined in the bibliography source but never cited in any .tex file.

Not errors, but may indicate:

  • Forgotten citations (should they be \nocite?)
  • Leftover entries from earlier drafts
  • Entries intended for a different paper

Warning: Possible Typos (Fuzzy Match)

For each missing key, check if a similar key exists in the bibliography using edit distance:

  • Edit distance = 1: Very likely a typo
  • Edit distance = 2: Possibly a typo
  • Flag these with the suggested correction

Common typo patterns:

  • Year off by one: smith2020 vs smith2021
  • Missing/extra letter: santanna vs sant'anna vs santana
  • Underscore vs camelCase: smith_jones vs smithjones

Quality Checks on .bib Entries

These checks apply only to external .bib files. Embedded bibliographies lack structured metadata, so quality checks are skipped for them.

Required Fields by Entry Type

Entry Type Required Fields
@article author, title, journal, year
@book author/editor, title, publisher, year
@incollection author, title, booktitle, publisher, year
@inproceedings author, title, booktitle, year
@techreport author, title, institution, year
@unpublished author, title, note, year
@phdthesis author, title, school, year

Year Reasonableness

  • Flag entries with year < 1900 or year > current year + 1
  • Flag entries with no year at all

Author Formatting

  • Check for inconsistent author formats within the file
  • Flag entries where author field contains "and others" or "et al." — this is never valid in BibTeX. All authors must be listed explicitly. Severity: Warning.
  • Flag entries with organisation names that might need {{braces}} to prevent splitting

DOI Resolution (optional — triggered by --verify-dois flag or when issues are suspected)

Preferred method: biblio MCP scholarly_verify_dois. Collect all DOIs from the .bib file and call scholarly_verify_dois (up to 50 per call). This batch-verifies each DOI against all enabled sources (OpenAlex, Scopus, WoS). Results:

  • VERIFIED (2+ sources confirm) — DOI is valid, metadata can be trusted
  • SINGLE_SOURCE (1 source only) — DOI exists but warrants a manual spot-check
  • NOT_FOUND — DOI not found in any source; resolve manually via WebFetch

Fallback for NOT_FOUND DOIs: Resolve via https://doi.org/[DOI] and confirm the returned metadata matches the entry:

  1. Title match: Does the DOI landing page title match the .bib title?
  2. Author match: Does the first author on the landing page match the .bib first author?
  3. Journal match: Does the venue match?

Flag mismatches as:

  • Warning: DOI mismatch — DOI resolves to a different paper than claimed. This usually means the DOI is wrong (adjacent DOI in the same journal volume) or the authors are wrong (conflation of researchers in the same subfield).

This check catches:

  • Wrong DOIs (e.g., off-by-one in the DOI suffix)
  • Author conflation (real researchers incorrectly attributed to a paper)
  • Metadata copied from secondary sources without verification

For manual WebFetch resolution, process in batches of 5 to avoid rate limiting. Only flag confirmed mismatches — if the DOI cannot be resolved (404, timeout), note it as "unresolvable" at Info level.

Preprint Staleness Check

For every entry that looks like a preprint, check whether a peer-reviewed version has since been published. Full detection signals, lookup protocol, and classification: `references/preprint-check.md`

Severity Levels

Level Meaning
Critical Missing entry for a cited key — will cause compilation error
Warning Unused entry, possible typo, missing required field
Info Year oddity, formatting suggestion, bibliography type note

Bibliography Output

After validation, offer these actions if applicable:

  • Embedded bibliography → offer to create paperpile.bib: If the project uses \begin{thebibliography}, offer to extract the references into a proper paperpile.bib file (one @misc entry per \bibitem, with the full text as a note field). The author can then enrich the entries with proper metadata.
  • Non-standard .bib name → offer to rename: If the existing .bib file is not named paperpile.bib, offer to rename it to paperpile.bib and update the \bibliography{} command in the .tex file.

These are offers only — do not make changes without explicit confirmation.

Report Format

Full report template with all sections: `references/report-template.md`

Sections: Summary table → Critical (missing entries) → Warning (typos, unused, missing fields, DOI mismatches, stale preprints) → Info (year issues) → Limitations (for embedded bibliographies).

Optional: Metadata Verification via Biblio MCP

When missing entries or suspicious metadata are flagged, use the biblio MCP tools:

  • scholarly_search — search by title to find the correct entry across OpenAlex + Scopus + WoS
  • scholarly_verify_dois — batch-verify DOIs across all sources (preferred over manual DOI resolution)
  • openalex_lookup_doi — look up full metadata for a specific DOI

For direct API fallback guidance (when MCP is unavailable): `references/openalex-verification.md`

Deep Verification Mode (Parallel, Disk-Based)

Triggered by: --deep-verify flag, or when the .bib has 40+ entries, or when the user says "deep verify" / "verify all references".

This mode spawns parallel sub-agents that each verify a batch of entries and write structured results to disk — bypassing context window limits entirely.

Architecture

You (orchestrator)
├── Read .bib file, extract all entries
├── Create verification_results/ directory in project root
├── Batch entries into groups of 5
├── Spawn parallel agents (max 5 concurrent), each:
│   ├── For each entry in batch:
│   │   ├── Verify DOI resolves (via biblio MCP scholarly_verify_dois)
│   │   ├── Check title matches DOI metadata
│   │   ├── Check author consistency
│   │   ├── Check year correctness
│   │   └── Check for published version if preprint
│   └── Write results to verification_results/batch_N.json
├── Wait for all agents to complete
├── Read all batch JSON files
├── Merge into verification_results/full_report.json
└── Generate markdown summary highlighting entries needing attention

Batch JSON Format

Each agent writes a file verification_results/batch_N.json:

[
  {
    "cite_key": "Author2020-ab",
    "doi_valid": true,
    "title_match": true,
    "author_match": true,
    "year_correct": true,
    "preprint_status": "published_version_exists",
    "published_doi": "10.1234/...",
    "issues": [],
    "suggested_fixes": {}
  }
]

Agent Prompt Template

Each sub-agent receives:

  • The batch of .bib entries (raw text)
  • The batch number
  • The output path: verification_results/batch_N.json
  • Instructions to use biblio MCP tools for verification
  • Instruction to write results to disk only — never return large payloads

Assembly & Report

After all agents complete:

  1. Read all verification_results/batch_*.json files
  2. Merge into verification_results/full_report.json
  3. Generate verification_results/summary.md with:
    • Total entries verified
    • Entries with issues (grouped by issue type)
    • Suggested fixes
    • Entries needing manual attention
  4. Print the summary to the user

Cleanup

After the report is delivered, offer to delete verification_results/ or keep it for reference.

Cross-References

  • $proofread — For overall paper quality including citation format
  • $literature — For finding and adding new references (includes full OpenAlex workflows)
  • $latex — For compilation with reference checking
  • your writing workflow — After drafting sections with citations, run $validate-bib to catch missing/typo'd keys before compilation