keyword-research-harvest

Use when the user wants a reusable local workflow to search scholarly APIs for any topic keywords, build a candidate table, download accessible PDFs or HTML/XML full texts, run a second-pass HTML-to-PDF attempt, and deduplicate the downloaded files. Best for keyword-driven research-article harvesting that should be reproducible and saved into a project folder.

zhongzhx 62 3 Updated 1mo ago

Resources

GitHub

Install

npx skillscat add zhongzhx/literature-harvest

Install via the SkillsCat registry.

SKILL.md

Keyword Research Harvest

Use this skill when the user gives a topic or keyword set and wants a local literature-harvesting workflow, not a one-off manual search.

What this skill does

Bundles its own literature_harvest/scripts stack, including search_pubmed.py, search_europepmc.py, search_crossref.py, search_openalex.py, merge_and_deduplicate.py, download_fulltexts.py, and harvest_utils.py.
Searches PubMed/PMC, Europe PMC, Crossref, and OpenAlex through the bundled pipeline.
Builds a no-dedup candidate table first.
Downloads all legally accessible files to a new run folder.
Saves HTML/XML when PDF is not directly available.
Runs a second pass to chase PDF links from saved HTML pages.
Produces a deduplicated file folder and manifest after downloading.

Dependency model

This skill is self-contained. The target project does not need to already contain:

literature_harvest/
search_pubmed.py
download_fulltexts.py

The bundled copies live under:

literature_harvest/scripts/

Files in this skill

scripts/run_keyword_harvest_no_dedup.py
Use this to launch a new broad keyword harvest run into a new run folder.
scripts/continue_download_and_dedup.py
Use this to resume pending downloads, try HTML-to-PDF second pass, and build a deduplicated download set.
literature_harvest/scripts/
Bundled search and download dependencies copied from a working local harvest stack so the skill is portable.
references/config_template.json
Copy and edit this for the topic-specific query set and filtering terms.
references/prompt_template.md
Reusable prompt for another AI/agent.

Recommended workflow

Choose any output parent folder where the new harvest run folder should be created.
Copy references/config_template.json and edit it for the user's topic keywords.
Run scripts/run_keyword_harvest_no_dedup.py with:
- --output-root
- --config
- --run-name
If the job is large, rerun with the same --run-name and --skip-search to avoid repeating API search.
Run scripts/continue_download_and_dedup.py --run-root <run-folder> --retry-failed.
Report:
- candidate count
- downloaded file count
- true PDF count
- HTML/XML count
- remaining pending
- deduplicated keep count

Constraints

Do not silently drop failed downloads.
Keep metadata even when download fails.
Prefer original research when the config asks for it, but do not hard-code one domain such as Aspergillus.
Do not claim HTML/XML is PDF.
Treat second-pass cluster-like or support-like annotations as auxiliary only; downloading remains the primary task.

When to read references

Read references/config_template.json before preparing a new run config.
Read references/prompt_template.md when the user wants to hand this skill to another AI/agent.

keyword-research-harvest

Resources

Install

Keyword Research Harvest

What this skill does

Dependency model

Files in this skill

Recommended workflow

Constraints

When to read references

Categories

Install

Recommended Skills