covalent-probe-discovery

"Systematic literature-mining pipeline for discovering novel covalent reactive handles capable of forming stable bonds with specified biological side-chain functionalities (including cysteine, lysine, serine, histidine, tyrosine, threonine, aspartate, glutamate, and methionine) under intended probe-use conditions. Use when a user asks to survey recent chemistry literature for novel covalent reaction partners targeting a specific residue or functional group, search for side-chain-reactive covalent motifs, design new reactive handles for chemical probes or covalent ligands, or produce an academic report of novel candidates with validated SMILES, rendered structures, and primary-literature citations."

mr-johnee 0 Updated 1mo ago

Resources

GitHub

Install

npx skillscat add mr-johnee/pclaw

Install via the SkillsCat registry.

SKILL.md

Covalent Probe Discovery

A five-step pipeline that turns a target residue or side-chain functionality
into a short, paper-style Markdown report of novel covalent reactive handles,
each accompanied by a validated SMILES, an RDKit-rendered structure, and
citations to the primary literature.

The pipeline is designed to be reproducible and auditable: every artifact
produced by a run is retained, every candidate in the final report is backed
by a locator in the source PDF, and the known-warhead exclusion list is
versioned alongside the skill itself.

Workflow Overview

Step 1  Generate keywords (keyword-generation prompt)   keywords.txt
Step 2  PubMed search                                   pubmed_results.json
Step 3  Relevance scoring (relevance-scoring prompt)    shortlist.json
Step 4  Attempt OA PDF acquisition                      pdfs/, download_log.json
Step 5  Extract, validate, render, compose report       candidates.enriched.json,
                                                        images/*.png, report.md

Steps 1, 3, 5a, and 5c are LLM-driven and follow the prompts in
references/. Steps 2, 4, and 5b are deterministic scripts under
scripts/.

Artifacts produced per run

The run directory is chosen by the caller (not the skill). Inside it the
pipeline produces:

File	Produced by	Audience
`keywords.txt`	Step 1	reproducibility
`pubmed_results.json`	Step 2	audit
`shortlist.json`	Step 3	audit
`pdfs/<doi>/paper.pdf`	Step 4	evidence source when available
`pdfs/download_log.json`	Step 4	acquisition audit
`candidates.json`	Step 5a	structure-extraction prompt output
`candidates.enriched.json`	Step 5b	validated candidate record
`images/candidate_NNN.png`	Step 5b	figures for the report
`candidate_validation.json`	Step 5b audit	JSON Schema / evidence audit
`report.md`	Step 5c	the deliverable

report.md reads like a short paper. Confidence scores, evidence quotes,
per-claim locators, structure-source locators, and manual-review status live
in candidates.enriched.json; they do not appear as database fields in the
human report.

Setup

Install this repository as a skill folder first (for example under
~/.openclaw-autoclaw/skills/pClaw/ for OpenClaw / AutoClaw). The installed
folder must contain this SKILL.md, plus the bundled scripts/,
references/, and data/ directories. The folder name may be pClaw; the
registered skill name is the name field at the top of this file:
covalent-probe-discovery.

Python dependencies

The pipeline has three dependency tiers:

Step 2 (scripts/pubmed_search.py) uses only the Python standard library.
Step 4 (scripts/download_pdf.py) requires httpx and network access to
Europe PMC, PMC's OA service, and the Unpaywall API.
Step 5b (scripts/smiles_to_image.py) requires rdkit.

Install with pip:

python3 -m pip install -r requirements.txt

If pip cannot install rdkit on your platform (common on Apple Silicon
outside conda), use a conda-compatible environment:

conda create -n covalent-probe python=3.12 -c conda-forge rdkit httpx
conda activate covalent-probe

Required configuration

The Unpaywall API requires a contact email. Set it before running Step 4:

export CHEM_PDF_UNPAYWALL_EMAIL="you@example.com"

Keep real local values in your shell profile or a private .env file. Do not
commit real emails, API keys, or tokens. See env.example.sh
for the full list of optional environment variables. The skill does not
auto-load .env files.

Non-package prerequisites

Network access to Europe PMC, PMC's OA service, and the Unpaywall API.
The bundled data/known_warheads.json exclusion
list (shipped with this skill; override with COVALENT_PROBE_KNOWN_WARHEADS
or --known-warheads on run_pipeline.py if you maintain a local copy).

No extra command-line tools are required — the pipeline does not call
pdftotext, poppler, ghostscript, openbabel, java, or node.

Running the pipeline

All examples below assume the current working directory is the run directory
and $SKILL_DIR points to the skill root:

export SKILL_DIR="/absolute/path/to/pClaw"

An orchestrator for the deterministic steps (2, 4, 5b) is provided:

python3 "$SKILL_DIR/scripts/run_pipeline.py" --keywords keywords.txt

See python3 "$SKILL_DIR/scripts/run_pipeline.py" --help for options. The
sections below document each step individually.

Step 0 — Confirm survey scope with the caller

Before generating keywords, ask the caller two scope questions and record the
answers verbatim in the run header of the final report:

Publication-year window. "Should I restrict the survey to a publication
year range (e.g. 2015–2025)?" For reproducible academic surveys a hard
window is strongly recommended — pick an inclusive [min_year, max_year]
pair the caller can quote in their Methods section. Default if the caller
declines: no year restriction.
Journal whitelist (optional). "Should I restrict the search to specific
journals by abbreviation (e.g. J. Am. Chem. Soc., Angew. Chem. Int.
Ed., Nat. Chem. Biol.)?" Default: no whitelist. Warn the caller that a
journal whitelist hides good chemistry in less obvious venues, so it
should be used only when the survey is explicitly scoped to a venue
subset.

Pass both answers through to Step 2 via the --min-year / --max-year /
--journal flags so the filters apply at the PubMed query layer rather than
post-hoc, and so they appear in pubmed_results.json's filters block for
audit.

Step 1 — Generate search keywords

Read references/prompt-keyword-generation.md.
Produce 24–36 keywords with five-layer coverage and write them to
keywords.txt (one per line). Keywords are chemistry-led and do not
encode year or journal — those are applied at the search layer in Step 2.

Step 2 — PubMed search

python3 "$SKILL_DIR/scripts/pubmed_search.py" \
    --query-file keywords.txt \
    --retmax 20 \
    --sort date \
    --max-pages 50 \
    --min-year 2015 --max-year 2025 \
    --pretty > pubmed_results.json

Add a journal whitelist only if the caller asked for one in Step 0:

    --journal "J Am Chem Soc" \
    --journal "Angew Chem Int Ed Engl" \
    --journal "Nat Chem Biol"

Behaviour:

Reviews, systematic reviews, and meta-analyses are excluded at the PubMed
query layer by default. Pass --include-reviews to override.
--min-year / --max-year apply an inclusive PubMed [PDAT] window. Both
are optional; omit a bound to leave that side open. Strongly recommended
for reproducibility.
--journal (repeatable) restricts to one or more journals by ISO
abbreviation via [TA]. Off by default. Use the abbreviation PubMed itself
emits (matches the journal field in articles).
Articles whose parseable page range exceeds --max-pages (default 50) are
dropped and listed under dropped_by_page_count in the output.
Articles without a parseable page range pass through — the PDF-level page
check in Step 5a catches any remaining outliers.
All applied filters are echoed back in the output's filters block so the
run is reproducible from pubmed_results.json alone.
A 0.12 s delay separates E-utilities calls. Bump it to ≥ 1.0 s inside the
script if NCBI starts returning HTTP 429.

Step 3 — Relevance scoring

Read references/prompt-relevance-scoring.md.
Score each article 1–10 on title + abstract; keep those at ≥ 7. Write the
surviving entries (with doi, pmid, title at minimum) to shortlist.json.

For large result sets, score in independent batches of 20–50 articles. These
batches may be run in parallel or through a provider batch-inference API
because the relevance-scoring prompt scores each article independently. Merge
the batch outputs only after every chunk returns, then apply the score
threshold.

Step 4 — Attempt OA PDF acquisition

python3 "$SKILL_DIR/scripts/download_pdf.py" \
    --shortlist shortlist.json \
    --output-dir pdfs \
    --workers 1 \
    --pretty

One subfolder per DOI under pdfs/, containing paper.pdf only when a
compliant OA route actually yields a valid PDF.
The downloader tries, in order: Europe PMC direct PDF for OA PMC records,
PMC's official OA service / OA package route for OA-subset PMC records, and
then other Unpaywall-reported OA PDF locations.
pdfs/download_log.json records every attempt with status
(ok / failed / skipped_existing). Cite this file from Appendix B of
the final report.
Reruns are cheap: existing paper.pdf files are skipped unless
--redownload is passed.
--workers N enables modest concurrent download attempts for large
shortlists. Keep N small (for example 2–4) because Europe PMC, PMC, and
Unpaywall are public services; the default remains sequential.
status: failed means none of the compliant strategies produced a valid
downloadable PDF. This includes PMC cases where a paper is visible on a PMC
webpage but is not in the official PMC OA subset.
This step is an attempted acquisition, not a guarantee. Some OA
locations still fail because the returned content is an HTML interstitial,
an access-denied page, or another non-PDF response.

Step 5 — Extract, validate, render, and compose the report

5a.0. Choose an extraction budget (interactive)

Step 5a is the most expensive stage in the pipeline: each PDF drives a full
structure-extraction run. Not every shortlisted paper is downloadable, so the
real extraction universe is the intersection of the shortlist with the
successfully acquired PDFs — a number that is only known at runtime.

Before invoking the structure-extraction prompt, rank that intersection by
relevance score and confirm a budget with the user:

python3 "$SKILL_DIR/scripts/rank_downloaded.py" \
    --shortlist shortlist.json \
    --download-log pdfs/download_log.json \
    --pubmed-results pubmed_results.json \
    --pretty > ranked_downloaded.json

The script joins the three files on DOI, keeps only papers whose
download_log.json status is ok or skipped_existing, and sorts
descending by relevance score with publication date as the tie-breaker
(newer first — newer structures are more likely to be novel relative to
the known-warhead database).

rank_downloaded.py writes a machine-readable ranking to stdout and a
human-readable summary to stderr that looks like:

Step 4 → Step 5a bridge
Shortlisted (relevance-scoring):  126
Successfully downloaded:  46
Ranked candidates:        46
Score distribution of downloaded & ranked pool:
     score 10: 4
     score  9: 8
     score  8: 12
     score  7: 22

Present these numbers to the user and ask how many PDFs to extract. A
reasonable default is the top 10 by score, but this is a decision the
caller should make with full visibility of the actual download yield —
do not silently default to "extract everything" when the downloaded
pool is large, and do not silently extract fewer than the user expects
when it is small.

Once the budget K is confirmed, pass --top-k K to
rank_downloaded.py to produce the final target list, then feed only
those K PDFs into the structure-extraction loop below.

Structure extraction itself can also be batched or parallelized, but keep one
logical model job per PDF. This avoids cross-paper locator mixups and lets you
retry only failed PDFs. If the model provider supports batch inference, submit
the top-K PDF jobs as independent requests and concatenate the returned JSON
arrays only after each one passes JSON Schema validation.

5a. Extract candidates from acquired PDFs only

Read references/prompt-structure-extraction.md.

Before invoking the structure-extraction prompt, load the known-warhead
exclusion list for the target residue:

import json
from pathlib import Path

known_path = Path(os.environ.get(
    "COVALENT_PROBE_KNOWN_WARHEADS",
    f"{os.environ['SKILL_DIR']}/data/known_warheads.json",
))
with known_path.open() as f:
    known = json.load(f)
excluded_classes = known[target_residue]  # e.g. known["Cysteine"]

Pass excluded_classes to the structure-extraction prompt as part of the
extraction context so filtering happens before the model emits candidates,
not after. The prompt treats that list as known and produces only novel
structures.

For each paper:

Use pdfs/<doi>/paper.pdf only when it exists.
Run the structure-extraction prompt with target_residue,
excluded_classes, the paper's metadata, and the PDF path.
Require it to use native JSON Schema / structured-output mode with
references/candidate.schema.json when
the host model API supports it.
Require it to emit structure_source, manual_review_status, and
mechanistic_family_normalized for every candidate.
Append the resulting JSON objects to candidates.json.

Rules:

Skip any paper whose acquired PDF is > 50 pages.
If no valid PDF was acquired, skip that paper for extraction.
Record acquisition failures from pdfs/download_log.json in Appendix B of
the final report rather than attempting text-only extraction.

5b. RDKit validation and rendering

python3 "$SKILL_DIR/scripts/smiles_to_image.py" \
    --input candidates.json \
    --output candidates.enriched.json \
    --images-dir images

The script:

parses every warhead_smiles with RDKit;
computes canonical_smiles;
renders PNGs to images/candidate_NNN.png;
fills missing mechanistic_family_normalized values by conservative text
normalization;
sets new candidates to manual_review_status: model_extracted when the
structure-extraction prompt omitted the field;
marks exact duplicate structures with duplicate_of_id;
sets validation_status to ok, invalid_smiles, or missing_smiles.

No cross-paper deduplication happens here — that is the report-composition
prompt's job.

Then validate the extraction contract locally:

python3 "$SKILL_DIR/scripts/validate_candidates.py" \
    --input candidates.enriched.json \
    --schema "$SKILL_DIR/references/candidate.schema.json" \
    --known-warheads "$SKILL_DIR/data/known_warheads.json" \
    --target-residue Cysteine \
    --pretty > candidate_validation.json

Use model-side JSON Schema / structured output when available, then run this
local validation anyway. The deterministic script checks the saved file
against references/candidate.schema.json and then applies workflow checks for
evidence locators, score ranges, structure locators, duplicate structures,
known-class overlaps, and candidates that remain model_extracted rather than
human_confirmed.

5c. Compose the academic report

Read references/prompt-report-composition.md.
Generate report.md from candidates.enriched.json together with the run
artifacts that carry search and acquisition metadata (keywords.txt,
pubmed_results.json, shortlist.json, and pdfs/download_log.json). The
prompt spells out the exact skeleton, ACS citation style, figure embedding
format, and what must stay in the human report versus the JSON.

Key rules recapped:

Running prose, not bullet lists of fields.
Superscript ACS citations ¹ inline, numbered references at the end.
Confidence scores do NOT appear in the report body. They only influence
where a candidate appears (body vs Appendix C) and how hedged the language
is.
Candidates with identical canonical_smiles across papers are merged in
prose; every supporting paper is cited.

Save the final report to report.md in the run directory.

Bundled scripts

scripts/pubmed_search.py — PubMed E-utilities
search with review exclusion and page-count filtering.
scripts/download_pdf.py — batch OA PDF
acquisition attempt through Europe PMC, PMC OA service, and Unpaywall, with
a manifest and skip-existing behaviour.
scripts/smiles_to_image.py — RDKit validation,
canonical SMILES, and PNG rendering.
scripts/validate_candidates.py — workflow
validation plus JSON Schema validation for evidence locators,
structure-source locators, manual review status, score ranges, and duplicate
flags.
scripts/rank_downloaded.py — joins shortlist
with download_log.json, ranks downloaded papers by relevance score with
publication-date tie-breaking, and prints a Step 4 → 5a bridge summary for
the extraction-budget dialog.
scripts/run_pipeline.py — orchestrator for the
deterministic steps (2, 4, 5b).

Core PDF-download modules live in chemdeep_pdf/ and are
self-contained: they prefer Europe PMC and PMC's official OA routes, then
fall back to other Unpaywall-reported OA locations.

Known warhead database

Default location: data/known_warheads.json,
shipped with the skill. Override by setting
COVALENT_PROBE_KNOWN_WARHEADS=/path/to/your.json or by passing
--known-warheads to run_pipeline.py.

Keyed by target residue; each value is a list of class names used as the
exclusion context passed to the structure-extraction prompt. The initial class
vocabulary is derived
from PDBcov / CovBinderInPDB and carried with local provenance metadata in the
JSON file. Current coverage:
Cysteine (69), Histidine (26), Serine (44), Lysine (24),
Aspartic Acid (10), Glutamic Acid (14), Tyrosine (7),
Threonine (15), Methionine (1).

Contributions to the exclusion list are welcome — open a pull request with
a short literature-cited rationale for any added class.

Default parameters

Reaction preference: stable covalent linkage; reversible vs irreversible
is interpreted from the user's request and evidence in the paper.
Time range: none by default; ask the caller in Step 0 and apply via
--min-year / --max-year for reproducibility.
Journal whitelist: none by default; ask the caller in Step 0. If the
caller declines, leave it unset to avoid hiding good chemistry in less
obvious venues. Apply via repeated --journal flags using PubMed ISO
abbreviations.
Literature scope: chemistry-first primary literature, including
synthetic, medicinal-chemistry, and chemical-biology methodology papers
where transferable reaction chemistry is reported.
Review filter: on (override with --include-reviews).
Max page count: 50 (--max-pages 0 disables).
Shortlist threshold: relevance score ≥ 7.
Relevance-scoring batch size: 20–50 articles per independent LLM call
when the host supports parallel or batch inference.
Extraction budget default: ask the user; suggest top 10 downloaded PDFs
by relevance score when the pool is large.
Novelty filter: the structure-extraction prompt respects
known_warheads.json[target_residue].

Operational notes

NCBI E-utilities enforce aggressive rate limits. A 0.12 s default delay is
usually enough; bump to ≥ 1.0 s after any HTTP 429.
chemdeep_pdf/downloader.py treats Europe PMC and PMC's OA service as the
preferred compliant routes for PMC-backed records, and only then falls back
to other Unpaywall-reported OA locations. Step 4 is still a retrieval
attempt, not a guaranteed download. If none of those routes yields a valid
PDF, the paper shows up in download_log.json with status: failed and
should end up in Appendix B rather than silently disappearing.
Structure drawings transcribed from PDF figures into SMILES are error-prone
on stereochemistry and fused-ring systems. smiles_to_image.py catches
syntax errors; validate_candidates.py checks evidence and structure
locators; low-confidence or non-human-confirmed transcriptions are triaged
for manual review rather than treated as publication-ready.

covalent-probe-discovery

Resources

Install

Covalent Probe Discovery

Workflow Overview

Artifacts produced per run

Setup

Python dependencies

Required configuration

Non-package prerequisites

Running the pipeline

Step 0 — Confirm survey scope with the caller

Step 1 — Generate search keywords

Step 2 — PubMed search

Step 3 — Relevance scoring

Step 4 — Attempt OA PDF acquisition

Step 5 — Extract, validate, render, and compose the report

5a.0. Choose an extraction budget (interactive)

5a. Extract candidates from acquired PDFs only

5b. RDKit validation and rendering

5c. Compose the academic report

Bundled scripts

Known warhead database

Default parameters

Operational notes

Categories

Install

Recommended Skills