Use when the user wants a reusable local workflow to search scholarly APIs for any topic keywords, build a candidate table, download accessible PDFs or HTML/XML full texts, run a second-pass HTML-to-PDF attempt, and deduplicate the downloaded files. Best for keyword-driven research-article harvesting that should be reproducible and saved into a project folder.
Resources
11Install
npx skillscat add zhongzhx/literature-harvest Install via the SkillsCat registry.
SKILL.md
Keyword Research Harvest
Use this skill when the user gives a topic or keyword set and wants a local literature-harvesting workflow, not a one-off manual search.
What this skill does
- Bundles its own
literature_harvest/scriptsstack, includingsearch_pubmed.py,search_europepmc.py,search_crossref.py,search_openalex.py,merge_and_deduplicate.py,download_fulltexts.py, andharvest_utils.py. - Searches PubMed/PMC, Europe PMC, Crossref, and OpenAlex through the bundled pipeline.
- Builds a no-dedup candidate table first.
- Downloads all legally accessible files to a new run folder.
- Saves HTML/XML when PDF is not directly available.
- Runs a second pass to chase PDF links from saved HTML pages.
- Produces a deduplicated file folder and manifest after downloading.
Dependency model
This skill is self-contained. The target project does not need to already contain:
literature_harvest/search_pubmed.pydownload_fulltexts.py
The bundled copies live under:
literature_harvest/scripts/
Files in this skill
scripts/run_keyword_harvest_no_dedup.py
Use this to launch a new broad keyword harvest run into a new run folder.scripts/continue_download_and_dedup.py
Use this to resume pending downloads, try HTML-to-PDF second pass, and build a deduplicated download set.literature_harvest/scripts/
Bundled search and download dependencies copied from a working local harvest stack so the skill is portable.references/config_template.json
Copy and edit this for the topic-specific query set and filtering terms.references/prompt_template.md
Reusable prompt for another AI/agent.
Recommended workflow
- Choose any output parent folder where the new harvest run folder should be created.
- Copy
references/config_template.jsonand edit it for the user's topic keywords. - Run
scripts/run_keyword_harvest_no_dedup.pywith:--output-root--config--run-name
- If the job is large, rerun with the same
--run-nameand--skip-searchto avoid repeating API search. - Run
scripts/continue_download_and_dedup.py --run-root <run-folder> --retry-failed. - Report:
- candidate count
- downloaded file count
- true PDF count
- HTML/XML count
- remaining pending
- deduplicated keep count
Constraints
- Do not silently drop failed downloads.
- Keep metadata even when download fails.
- Prefer original research when the config asks for it, but do not hard-code one domain such as Aspergillus.
- Do not claim HTML/XML is PDF.
- Treat second-pass cluster-like or support-like annotations as auxiliary only; downloading remains the primary task.
When to read references
- Read
references/config_template.jsonbefore preparing a new run config. - Read
references/prompt_template.mdwhen the user wants to hand this skill to another AI/agent.