Full project context for Rhemata — Alex's AI-powered theological research tool for charismatic Christians. Read this skill at the start of every Rhemata work session before doing anything else. Trigger whenever Alex mentions Rhemata, the theological research app, the RAG project, or any of its components (ingestion, chat, citations, frontend, backend).
Resources
7Install
npx skillscat add alxwhitley/rhemata Install via the SkillsCat registry.
Rhemata — Project Skill
What It Is
Rhemata (ῥήματα) is an AI-powered theological research tool targeting charismatic and Spirit-filled Christians. Users ask natural language questions and receive answers drawn from a curated library of theological documents, with inline citations pointing back to the source.
The primary product model is Magisterium AI. The primary UX model is Perplexity — centered chat input, inline citations, clickable source panel.
Who It's For
Charismatic and Spirit-filled Christians who want to research theology from within their tradition. The content library is built from documents Alex personally owns and has rights to — sermon outlines, theology papers, and similar material. New Wine Magazine extraction and ingestion pipeline is now operational (4 articles ingested from issue 03-1973, full 300-issue batch pending).
Repo & Git
- Git repo initialized and pushed to
alxwhitley/rhemataon GitHub .gitignorecovers.env,.env.local,__pycache__,.venv,node_modules,.next,.DS_Store
Monorepo Structure
repo/
├── frontend/ # Next.js 16 app (Vercel)
├── backend/
│ ├── app/ # FastAPI Python package
│ │ ├── main.py
│ │ ├── auth.py
│ │ ├── routers/
│ │ │ ├── chat.py # /chat endpoint — retrieval + LLM
│ │ │ ├── search.py # /search + /search/documents endpoints
│ │ │ ├── document.py # /document/{id} + /document/{id}/article
│ │ │ ├── study.py # /study/verse + /study/corpus + /study/lexicon + /study/excerpt endpoints
│ │ │ └── ingest.py # /ingest endpoint
│ │ ├── services/
│ │ ├── db/
│ │ ├── system_prompt.txt
│ │ └── theological_guardrails.txt
│ ├── requirements.txt # pinned via pip freeze
│ ├── railway.toml
│ └── nixpacks.toml # locks Python 3.9
├── sources/
│ ├── youtube/ # YouTube transcript pipeline
│ │ ├── raw/ # Freshly scraped transcripts
│ │ ├── cleaned/ # Groq-cleaned, ready for ingest
│ │ ├── ingested/ # Already in Supabase
│ │ └── youtube_tracker.xlsx
│ ├── magazine/ # New Wine Magazine pipeline
│ │ ├── 01_to_extract/ # Drop PDFs here (~198 issues)
│ │ ├── 02_extracted/ # Per-issue .md articles + raw_text.txt
│ │ ├── 03_approved/ # Reviewed and approved for ingest
│ │ ├── 04_ingested/ # Completed issues
│ │ ├── 05_archived/ # Original PDFs after extraction
│ │ └── rhemata_tracker.xlsx
│ └── documents/ # Non-copyrighted docs (sermons, papers)
│ └── ingested/ # Already in Supabase
├── scripts/ # All pipeline scripts
│ ├── scrape_youtube.py # YouTube transcript scraper (yt-dlp + Supabase dedupe, raw only — no cleaning)
│ ├── youtube_pipeline.sh # Full YouTube pipeline: scrape → clean → ingest
│ ├── whisper_transcribe.py # Whisper medium + Groq clean (batch from no_captions/ or single URL)
│ ├── clean_transcripts.py # Clean raw transcripts via Groq Llama 3.3 70B
│ ├── fix_article_json.py # One-off migration: fix raw JSON chunks in Supabase (run 2026-04-17, 30 fixed)
│ ├── extract_magazine.py # 3-pass Gemini/Groq extraction pipeline
│ ├── ingest_magazine.py # Supabase ingestion from .md files with frontmatter
│ ├── ingest.py # Standalone PDF/docx/txt ingestion with auto-tagging; moves YouTube transcripts to ingested/ on success
│ ├── tag_existing_articles.py # Backfill topic_tags on existing articles
│ └── tag_sermons_transcripts.py # Backfill topic_tags on sermons/transcripts/papers
├── scrape_preceptaustin.py # Precept Austin word study scraper (page caching, multi-strategy anchor matching)
├── ingest_preceptaustin.py # Precept Austin word study ingestion (psycopg2 chunks, OpenAI embeddings)
├── ingest_lexicon.py # STEPBible lexicon ingestion (TBESG, TBESH, TFLSJ)
├── ingest_bible.py # WEB Bible VPL ingestion into verses table (psycopg2)
├── migrations/ # SQL migrations (run in Supabase SQL Editor)
├── taxonomy.md # 100-tag topic taxonomy (8 categories)
├── CLAUDE.md # Claude Code context
└── SKILL.md # Full project skill context- All imports use
from app.x import y(absolute, not relative) requirements.txtpinned to exact versions, includestiktoken- Railway start command:
uvicorn app.main:app --host 0.0.0.0 --port $PORT
Tech Stack
| Layer | Technology |
|---|---|
| Frontend | Next.js 16 (React 19), Tailwind CSS 4, deployed to Vercel |
| Backend | Python 3.9 / FastAPI, deployed to Railway |
| Database | Supabase (PostgreSQL + pgvector) |
| Embeddings | OpenAI text-embedding-3-small (1536 dims) |
| Answer Generation LLM | Anthropic Claude Sonnet 4.5 (claude-sonnet-4-5) via anthropic SDK |
| Query Expansion / Metadata / Tagging / Transcript Cleaning LLM | Groq Llama 3.3 70B (llama-3.3-70b-versatile) |
| Vision / OCR (magazine extraction) | Gemini 2.5 Flash (gemini-2.5-flash) via google-genai SDK |
| Reranking | Cohere rerank-v3.5 (cohere SDK) — narrows top 10 RRF → top 5 |
| Retrieval | Hybrid search: pgvector + PostgreSQL FTS, fused via RRF |
| Markdown rendering | react-markdown + @tailwindcss/typography |
Removed: GPT-4o Vision (replaced by Gemini 2.5 Flash). Groq for answer generation (replaced by Anthropic Claude Sonnet 4.5, April 2026).
Architecture
Frontend → Backend → Supabase → LLM
- User types a query in the chat interface
- Frontend POSTs to
/chaton the FastAPI backend (field:question, plusanon_idfor guests) - Backend expands query into 3 semantic variants via Groq Llama 3.3 70B (
expand_query()) - For each variant: pgvector cosine similarity (top 40) + PostgreSQL full-text search (top 30)
- Results fused via Reciprocal Rank Fusion (RRF_K=60), deduplicated, document-level collapse (max 2 chunks per doc), top 10 selected
- Cohere rerank-v3.5 narrows top 10 → top 5 by relevance (graceful fallback to top 10 if COHERE_API_KEY unset)
- Neighbor chunk expansion (±1 chunk_index, cap at 12 total)
- Backend assembles prompt: system instructions + theological guardrails + retrieved chunks (tagged with
source_kindandcitation_mode) + query - Anthropic Claude Sonnet 4.5 generates a response, streamed back via SSE with
<answer>tag extraction. Runtime-appended faithfulness instruction preserves source document views without editorializing. Theological guardrails (theological_guardrails.txt) appended to system prompt enforce non-negotiable framings (e.g., Holy Spirit personhood). - Frontend renders response with inline citation tags
Database Schema
documents table — one row per source document
id(uuid),title(text),author(text)source_name(text),source_type(text),source_kind(text)citation_mode(text) —'citable'|'silent_context'is_copyrighted(boolean, default false)year(int),issue(text),url(text, nullable)topic_tags(text[]) — assigned from taxonomybible_references(text[], default'{}') — canonical refs like"Romans 8:28"; GIN indexedfts_weighted(tsvector) — weighted FTS on title (A), author (A), source_name (B), bible_references (C, colons stripped)content_summary(text) — first chunk content for displaycreated_at(timestamptz)
chunks table — one row per text chunk
id(uuid),document_id(FK → documents)content(text),embedding(vector(1536))chunk_index(int),created_at(timestamptz)
verses table — Bible verse text (WEB translation)
id(uuid),verse_id(text, unique — format:SBL.CHAPTER.VERSE, e.g.JHN.3.16)book(text — full name),book_num(int — 1-66),chapter(int),verse(int)text(text),translation(text, default'WEB')created_at(timestamptz)- Indexes on
verse_idand(book, chapter, verse)
saved_words table — user's saved Greek words for Study mode
id(uuid),user_id(uuid, FK → auth.users, cascade delete)strongs_number(text),greek_word(text),transliteration(text),english_gloss(text, nullable)created_at(timestamptz)- Unique constraint on
(user_id, strongs_number) - RLS enabled: users can only manage their own rows (
auth.uid() = user_id)
guest_sessions table — server-side guest query tracking
id(uuid),anon_id(text, unique)query_count(int, default 0)created_at/last_seen(timestamptz)
conversations table — saved chat history for authenticated users
id(uuid),user_id(uuid, FK → auth.users),title(text),created_at
messages table — individual messages within conversations
id(uuid),conversation_id(FK → conversations)role(text: 'user' | 'assistant'),content(text),created_at
Key Decisions Already Made
- HNSW indexing over ivfflat for pgvector;
match_chunkssetshnsw.ef_search=200 - Page-level citations — not chunk-level
- Two-tier content model:
citation_mode = 'citable'(renders citations) vs'silent_context'(informs LLM only) - Magazine chunking: tiktoken cl100k_base, 550 tokens target, 80 overlap
- Standalone ingest: recursive character text splitting, 1000 char chunks, 200 char overlap
- Hybrid search with RRF — query expansion (3 variants via Groq) → vector + FTS per variant → RRF (K=60) → document collapse → top 10
- CORS middleware —
ALLOWED_ORIGINSenv var (comma-separated) - Guest query limit — 6 free queries via
guest_sessions+increment_guest_queryRPC - JWT auth via Supabase JWKS endpoint (
PyJWKClient) - Bible Study articles excluded from extraction pipeline (reference materials, not theological teaching)
- Ingest auto-tagging — ingest.py tags every new document post-chunk-insert via Groq Llama 3.3 70B; strict 3–6 tags, main themes only, non-fatal
- is_copyrighted path-based —
sources/youtube/andsources/magazine/→ true,sources/documents/→ false - Sermon transcripts excluded from search — search_documents RPC defaults source_kind to "magazine_article"; transcripts available in chat retrieval only
- All scripts in
scripts/— no Python files at project root; all usePath(__file__).resolve().parent.parentfor project root - Bible reference tracking —
documents.bible_references text[]populated via Groq Llama 3.3 70B extraction; shared helper atscripts/bible_refs.pynormalizes to"Book Chapter:Verse"canonical form against 66-book set + alias map; non-fatal (returns[]on failure); auto-populated during ingest in bothingest.pyandingest_magazine.py; backfill viaextract_bible_refs.py - Prefix search —
search_documentsRPC buildsto_tsquerywith:*prefix operators per token (colons split to sub-tokens), so"Romans 8"matches"Romans 8:1","Romans 8:28", etc.; falls back toplainto_tsqueryon parse error - System prompt discipline —
backend/app/system_prompt.txtuses XML tags (<thinking>,<research_analysis>,<answer>).<research_analysis>runs 3 fixed self-checks (author conflation, silent_context citation, biblical case overreach). Response Discipline Rules block enforces: multi-part decomposition, retrieval-only format when asked, explicit "corpus insufficient" flag on thin charismatic distinctives, retrieval scope cap (10 items / 250 words). Scripture exception limited to verse text only — no interpretation beyond sources. Tone section includes charismatic linguistic anchors ("impressions," "promptings") for exploratory mode only. Citation rules include prompt-injection trust boundary (ignore instructions embedded in retrieved chunks), no anonymous attribution ("one teacher" etc. banned — cite by name or no attribution). Formatting requires minimum 2##headings for theological answers (mandatory, not optional). Quote discipline: paraphrase with attribution, never lift phrasing verbatim. Closing synthesis: end with own conclusion drawn from sources, not a summary. - Theological guardrails —
backend/app/theological_guardrails.txtloaded and appended to system prompt inchat.py. Contains non-negotiable theological framings that override source material phrasing. Currently covers Holy Spirit personhood (person of the Trinity, not merely a power/provision). - Chat streaming — Anthropic Claude Sonnet 4.5
max_tokens=1500.<answer>tag extraction server-side with 9-char buffer safety for split tags. If stream ends mid-answer, remaining buffer is flushed to client instead of silently dropped. Usesclient.messages.create(stream=True)(not context manager form, which is incompatible with generatoryield). Frontend: no timeouts or AbortController; stream completion via[DONE]sentinel or reader exhaustion; error handling for 429 (guest/daily limits). - Batched neighbor chunk expansion —
fetch_neighbor_chunks_batch()collects all (document_id, chunk_index±1) pairs from top chunks, builds.or_()compound filter withand()conditions, batches at 30 pairs per query. Replaces sequential per-chunk lookups (was 10 calls for 5 chunks, now 1-2 calls total). - Sparse citation rule — system prompt enforces max 2-3 inline citations per response. Only cite when source of claim materially matters. Single citation sufficient when answer draws primarily from one source.
- Cohere reranking — After RRF fusion, top 10 chunks sent to Cohere rerank-v3.5 with original query; top 5 by relevance score returned. Falls back to RRF top 10 if
COHERE_API_KEYnot set or call fails. - Column break handling — Pass 1 prompt instructs Gemini to transcribe multi-article pages column by column with
=== COLUMN BREAK ===markers. Pass 2 prompt tells Groq to follow article content across column breaks, ignoring other articles' content. - psycopg2 connection fix — Supabase pooler usernames contain a dot (
postgres.{ref}) whichpsycopg2.connect(uri)misparses, truncating topostgres. All ingestion scripts (ingest_lexicon.py,ingest_preceptaustin.py,ingest.py,ingest_commentaries.py) now parseSUPABASE_DB_URLwithurlparseand pass explicit keyword args (host,port,user,password,dbname).
Content Rules
- Only ingest documents that Alex personally owns or has rights to
- New Wine Magazine pipeline is operational —
is_copyrighted=true, controlled byINCLUDE_COPYRIGHTEDenv var INCLUDE_COPYRIGHTED=truein local.envand defaults true inchat.py- Current non-magazine documents are single-column — no multi-column OCR handling needed
Magazine Extraction Pipeline (3-pass)
Input: PDF in sources/magazine/01_to_extract/
Output: Per-article .md files in sources/magazine/02_extracted/{issue_stem}/
Pass 1: Vision Extraction (Gemini 2.5 Flash)
- Converts PDF pages to PIL images at 200 DPI via
pdf2image - Processes in 5-page batches to avoid output truncation
- Each batch gets explicit page numbering instructions (
=== PAGE N ===) - Outputs
raw_text.txtwith full issue transcription
Pass 2: Article Segmentation (Groq Llama 3.3 70B)
- Step 2a: Extracts TOC from pages 2-3, sends full text to Groq for metadata index (JSON array of title/author/page_start/page_end)
- Step 2b: For each article, extracts page range text and sends to Groq for body extraction + topic tagging
- Returns JSON:
{"topic_tags": [...], "body": "..."} - Tags validated against
VALID_TAGSset — invalid tags removed - Outputs individual
.mdfiles with frontmatter metadata
Pass 3: QA Inspection (Groq Llama 3.3 70B)
- Checks each article for: truncation, duplicates, mismatch, word count (min 200), OCR errors
- Returns JSON:
{"status": "PASS"|"WARN"|"FLAG", "issues": [...], "confidence": 0.0-1.0} - FLAG articles moved to
flagged/subfolder - WARN articles get
<!-- QA WARNINGS -->comment prepended - Outputs
qa_report.json
Article Format
Each article saved as .md with frontmatter:
---
TITLE: Article Title
AUTHOR: Author Name
ISSUE: 03-1973
DATE: March 1973
PAGE_START: 4
PAGE_END: 10
SOURCE_TYPE: magazine_article
TOPIC_TAGS: Fivefold Ministry, Prophetic Ministry, Biblical Leadership
---
# Article Title
*by Author Name*
Body text formatted as markdown...Exclusions
- Bible Study, Bible Lesson, Study Guide articles excluded from extraction
- Letters to editor, order forms, subscription info, staff boxes, ads excluded
- Cover/back cover, full-page illustrations, advertisement pages skipped in Pass 1
YouTube Pipeline
- Scrape:
python3 scripts/scrape_youtube.py— scrapes transcripts via yt-dlp from channels in youtube_tracker.xlsx, dedupes against Supabase, saves raw transcripts tosources/youtube/raw/(max 10 per run). Videos with no captions or low-quality transcripts (< 400 words) write metadata stubs tosources/youtube/no_captions/for Whisper processing. - Clean:
python3 scripts/clean_transcripts.py— cleans via Groq Llama 3.3 70B, moves tosources/youtube/cleaned/ - Whisper:
python3 scripts/whisper_transcribe.py— batch-processes stubs inno_captions/, downloads audio via yt-dlp, transcribes with Whisper medium, cleans via Groq, outputs tocleaned/. Also supports single-URL mode with--url --title --speaker --channel. - Ingest:
python3 scripts/ingest.py— ingests cleaned transcripts into Supabase with auto-tagging. Moves successfully ingested files fromcleaned/toingested/viashutil.move.
Convenience script: ./scripts/youtube_pipeline.sh runs all 4 steps in sequence (set -euo pipefail — stops on failure). Shell alias: rh-youtube (in ~/.zshrc).
Transcript files include metadata headers (TITLE, SPEAKER, URL, SOURCE_TYPE) parsed by ingest.py.
Topic Tagging
taxonomy.mdin project root contains 100-tag taxonomy across 8 categories:- Holy Spirit & Spiritual Gifts (14 tags)
- Charismatic Experience (14 tags)
- Prayer & Spiritual Warfare (14 tags)
- Healing & Wholeness (14 tags)
- Christian Leadership (15 tags)
- Christian Growth & Discipleship (14 tags)
- Kingdom & Theology (14 tags)
- Family & Relationships (13 tags)
- Tags assigned during Pass 2 extraction (5-8 per article)
- Strict rules: only assign if article directly teaches on topic for at least one paragraph
- Validated against
VALID_TAGSset in bothextract_magazine.pyandtag_existing_articles.py - Invalid/invented tags automatically removed
tag_existing_articles.pyfor backfilling existing magazine articles (retries if < 3 valid tags)tag_sermons_transcripts.pyfor backfilling non-magazine documents (3-6 tags, retries if < 2 valid)
Search Feature
- GET /search/documents — document-level FTS via
search_documentsRPC function- Parameters:
q,author,source_kind,include_copyrighted - Returns: id, title, author, issue, year, highlighted_snippet, rank
fts_weightedcolumn includes title (A), author (A), source_name (B), bible_references (C, colons stripped)- Prefix tsquery builder — tokenizes query, strips non-alphanumerics, appends
:*to each token, AND-joins;"Romans 8"matches"Romans 8:1","Romans 8:28", etc. Falls back toplainto_tsqueryon parse error. ts_headlinegenerates keyword-highlighted snippets from best-matching chunk- Markdown/metadata stripped from snippets via nested
regexp_replace - Fallback to first 200 chars if no FTS match in chunk content
source_kinddefaults to "magazine_article" — excludes sermon_transcript from search results
- Parameters:
- GET /document/{id}/article — reassembles full article from chunks
- Strips per-chunk metadata headers, trims overlap, strips markdown bold/italic
- Cleans author (truncates at parenthesis)
- GET /search/documents/browse — lists all documents of a source_kind, ordered by year/issue DESC
- Parameters:
source_kind,include_copyrighted - Returns same shape as search_documents (id, title, author, issue, year, topic_tags, highlighted_snippet=null, rank=0)
- Both
/search/documentsand/search/documents/browsereturntopic_tags(secondary lookup on doc IDs for search; direct select for browse)
- Parameters:
- Search page at /search — sidebar, search bar, result cards, article reader
- Browse listing on initial load (all magazine articles, before any search)
hasSearchedstate flag distinguishes "no search yet" (show browse) vs "searched with no results" (show empty state)- Result cards show author-only metadata (no date/year/issue)
- Topic tag pills on cards: rounded,
#d4b96agold text onrgba(212, 185, 106, 0.12)background ReactMarkdownrenders article body in reader view (title/byline stripped to avoid duplication)dangerouslySetInnerHTMLrenders<mark>highlighted snippets in result cardsmarkstyled with gold color (#d4b96a), transparent background, font-weight 600
Scripts
| Script | Purpose |
|---|---|
scripts/extract_magazine.py |
3-pass Gemini/Groq extraction pipeline (Vision → Segmentation → QA). Supports --max-issues N and --time-limit. Continuation resolver (BFS, depth 5) handles "continued on page N" markers. PDFs archived into 02_extracted/{issue_stem}/ after extraction. Empty Gemini batches log warning + substitute "" (non-fatal). |
scripts/ingest_magazine.py |
Ingest approved .md articles from sources/magazine/03_approved/ into Supabase. Auto-populates bible_references. Archives PDFs to 05_archived/ on success. |
scripts/ingest.py |
Standalone PDF/docx/txt ingestion with auto-tagging (3–6 tags, Groq, non-fatal). Auto-populates bible_references. Skip reason tracking: ingest_file() returns (status, reason) tuples; main() prints grouped summary table of all skipped/failed files with reasons at end of run. Uses psycopg2 direct query for already_ingested() (same DB_PARAMS pattern as other scripts). |
scripts/bible_refs.py |
Shared Bible reference extractor (Groq Llama 3.3 70B). extract_bible_references(content) -> List[str]. Segments at ~12k chars, normalizes against 66-book canonical set + alias map, dedupes. Non-fatal (returns []). |
extract_bible_refs.py (project root) |
Backfill bible_references on all documents. Flags: --dry-run, --force (re-process docs that already have refs). |
scripts/tag_existing_articles.py |
Backfill topic_tags on existing magazine articles via Groq |
scripts/tag_sermons_transcripts.py |
Backfill topic_tags on existing sermon/transcript/paper documents via Groq |
scripts/youtube_pipeline.sh |
Full YouTube pipeline convenience script: scrape → clean → whisper → ingest. Shell alias: rh-youtube. |
scripts/scrape_youtube.py |
YouTube transcript scraper (yt-dlp, Supabase dedupe, max 10 per run). Writes no_captions stubs for videos without captions or with < 400 words. |
scripts/whisper_transcribe.py |
Whisper medium transcription + Groq cleaning. Batch mode processes no_captions/ stubs; single-URL mode via CLI args. |
scripts/clean_transcripts.py |
Clean raw transcripts via Groq Llama 3.3 70B, move to cleaned/ |
scripts/generate_excerpts.py |
Batch-generate edited word study articles from Precept Austin raw chunks. Concatenates all chunks per document, sends to Anthropic Claude for editing into clean articles, writes to excerpts table (excerpt_type = 'word_study_article'). Flags: --test, --test-quality, `--model sonnet |
scripts/ingest_commentaries.py |
Ingests HistoricalChristianFaith commentaries from SQLite DB. Groups by father_name, one document per father, chunks with tiktoken, embeds with OpenAI, inserts via psycopg2. Single-transaction pattern (connect_with_retry() + ingest_father()). Theological tagging (Reformed, Cessationist, Charismatic-Friendly, Desert Fathers, Patristic). Flags: --dry-run, --father "Name", --filter-charismatic. |
scripts/fix_article_json.py |
One-off migration: fixed 30 chunks with raw JSON content in Supabase (run 2026-04-17). |
Deleted: merge_articles.py (replaced by Pass 2 per-article segmentation)
Root-Level Ingestion Scripts
| Script | Purpose |
|---|---|
scrape_preceptaustin.py |
Scrapes Precept Austin Greek word studies. Page caching to sources/precept_austin/page_cache/, randomized sleep (2-5s), 4-strategy anchor matching (exact → case-insensitive → partial → reverse word), quality filters (<100 words, nav bleed, fragmented). --fetch runs full pipeline, --test limits to 10 entries. Outputs to sources/precept_austin/raw/ + index.json. |
ingest_preceptaustin.py |
Ingests Precept Austin word studies into Supabase. Chunks via psycopg2 execute_values with ::vector cast. Documents via Supabase client. Skip logic checks excerpts table for existing word_study_article excerpt (not just document existence). Reuses existing doc_id when document exists but has no excerpt. Uses backend/app/services/chunker.py (550 tokens, 80 overlap). |
ingest_lexicon.py |
Ingests STEPBible lexicon files (TBESG, TBESH, TFLSJ). One chunk per lexical entry. tiktoken truncation at 8000 tokens for embedding. Resume-safe (tracks existing chunk counts). CLI flags: --lexicon TBESG|TBESH|TFLSJ, --delete (removes existing data first), --sample N. Brief mode for TBESG: stores only gloss + bolded sub-meanings (no full Abbott-Smith HTML). Chunk inserts via psycopg2 execute_values with ON CONFLICT DO NOTHING. |
ingest_bible.py |
Parses WEB VPL file, maps 66 canonical books (VPL→SBL abbreviations), inserts into verses table via psycopg2. Batch size 1000, ON CONFLICT DO NOTHING. --test limits to 100 verses. Skips deuterocanonical books. |
Note: ingest_commentaries.py is now in scripts/ (see Scripts table above).
Data sources:
sources/precept_austin/— word study.txtfiles +index.json+page_cache/(gitignored)sources/lexicon/— STEPBible TSV files (TBESG, TBESH, TFLSJ)sources/bible/eng-web_vpl.txt— World English Bible verse-per-line file
Corpus (as of 2026-05-23)
- 2,617 documents total, 124,346 chunks
- By source_kind: 1,779 word_study, 557 sermon_transcript, 186 commentary, 58 unknown, 33 magazine_article, 4 lexicon
- By source_type: 1,783 background, 557 sermon, 186 commentary, 49 book, 33 magazine_article, 5 paper, 4 other
- Copyrighted: 1,862 | Non-copyrighted: 755
- Lexicons: TBESG 11,034 chunks (complete), TBESH 10,258 chunks (complete), TFLSJ 15,767 chunks across 2 docs (complete)
- Verses: 31,098 rows (WEB, 66-book Protestant canon, complete)
- Excerpts: 1,713 of 1,779 word_study docs have generated articles (96%, 66 remaining)
- All 38 original backfilled docs have
bible_referencespopulated (2026-04-10)
UX Model
- Centered chat input as primary interaction
- Perplexity-style inline citations rendered as gold-highlighted tags
- Clicking a citation opens a source panel with document title, author, and page content
- Sidebar: Shared across all routes. "Rhemata" wordmark, gold "New Chat" CTA (
#b49238), nav items (Chat/Discover/Study), conditional content (Recents on chat, Saved Words on study). Hover viaonMouseEnter/onMouseLeaveinline handlers (#262624). - Search page at
/searchwith keyword search, browse-all default listing, result cards with topic tag pills, and full article reader - Auth flow: login modal triggered by AuthButton, sidebar sign-in link, or guest limit reached
- Guest users get 6 free queries before prompted to sign up
Brand
- Name: Rhemata
- Fonts: Lora (headings/logo), Inter (UI/body)
- Dark theme — near-black backgrounds, warm neutrals
- Gold accents —
#d4b96a(citations/highlights/tag pills),rgba(212, 185, 106, 0.12)(tag pill backgrounds) - Voice: Scholarly but accessible. Conviction, not performance. Serves the researcher, not the spectacle.
Deployment
| Target | Status | Notes |
|---|---|---|
| Railway (backend) | Live | Root dir: backend/, Python 3.9 via nixpacks.toml |
| Vercel (frontend) | Live | Root dir: frontend/ |
| Supabase | Live | PostgreSQL + pgvector |
Backend env vars (Railway)
SUPABASE_URL,SUPABASE_SERVICE_KEYOPENAI_API_KEY,GROQ_API_KEY,ANTHROPIC_API_KEY,COHERE_API_KEYSUPABASE_JWT_JWKS_URLALLOWED_ORIGINSINCLUDE_COPYRIGHTED
Frontend env vars (Vercel)
NEXT_PUBLIC_API_URLNEXT_PUBLIC_SUPABASE_URL,NEXT_PUBLIC_SUPABASE_ANON_KEY
Environment Variables (local — backend/app/.env)
GROQ_API_KEYOPENAI_API_KEYANTHROPIC_API_KEY— Claude Sonnet 4.5 for answer generationCOHERE_API_KEY— Cohere rerank-v3.5 for retrieval rerankingGOOGLE_API_KEY— Gemini 2.5 Flash for magazine extractionSUPABASE_URLSUPABASE_SERVICE_KEYSUPABASE_JWT_JWKS_URLINCLUDE_COPYRIGHTED—true/false(defaulttruein chat.py,falsein search.py)ALLOWED_ORIGINSSUPABASE_DB_URL— direct PostgreSQL connection string for psycopg2 (bypasses PostgREST timeouts). Used byingest_bible.py,ingest_preceptaustin.py,ingest_lexicon.py(psycopg2 variant).
Study Mode (frontend)
- Route:
/study— Study Mode page with tab-driven layout - Layout: Shared sidebar (w-64) with Saved Words list | Left panel (380px fixed: search, verse card, chapter view) | Right panel (flex:1: tab content)
- Sidebar: Shared
sidebar.tsxacross Chat and Study. UsesusePathname()for active route. Chat shows Recents; Study shows Saved Words. Nav items: Chat (MessageSquare), Discover (Compass), Study (BookOpen). Gold "New Chat" CTA (#b49238). Hover standard:onMouseEnterbg#262624,onMouseLeavetransparent. - Saved words:
saved_wordstable in Supabase (migration 017). RLS policy scoped toauth.uid(). Toggle save/unsave from definition panel bookmark icon. Sidebar shows English gloss + Strong's number, transliteration below. - Verse lookup: Direct Supabase query from frontend (
versestable, keyed byverse_id). Client-sideparseRef()with full 66-bookBOOK_MAP+ABBREV_TO_NAME. - Left panel: Search bar, verse card, "View Chapter" button, chapter view. No interlinear or definition content.
- Chapter view: Flowing text with inline
<sup>verse numbers. Verses queried via.like("verse_id", "JHN.1.%")pattern. Active verse highlighted with#2f2f2cbackground. Clicking a verse in chapter view loads it as the active verse. - Right panel tabs:
CorpusTabtype:"commentaries" | "word_study" | "jewish". Tab state lifted to parentStudyPagefor cross-panel coordination.- Commentary tab: Corpus results from
GET /study/corpus(commentaries, sermons, magazine articles). - Word Study tab: Interlinear word blocks → definition panel → Precept Austin excerpt → "From the Library" corpus results. Interlinear fetch gated by
corpusTab === "word_study"(not fetched on every verse change). Auto-selects first interlinear word when tokens load. - Jewish Perspective tab: Inline generate flow (no modal). Empty state with "Generate Jewish Perspective" button → disclaimer text → "Confirm & Generate" button → spinner → cached result.
jpCacheCheckedstate tracks whether cache has been checked.
- Commentary tab: Corpus results from
- Excerpt panel:
GET /study/excerpt?strongs=G####endpoint returns Precept Austin word study article for selected Strong's number. Triesexcerptstable first, falls back to concatenated chunks. - Corpus panel ("From the Library"): Backend
GET /study/corpus?verse=...&transliteration=...endpoint. Embeds query via OpenAI, runsmatch_chunksRPC (match_count=20, include_copyrighted=true), filters tocitation_mode='citable'+source_kind IN ('sermon_transcript', 'magazine_article', 'commentary', 'word_study'), dedupes by document, returns top 5. Frontend shows skeleton loader, empty state, or real results. - Definition panel: Fetches lexicon data from
GET /study/lexicon?strongs=G####endpoint. Displays gloss inline with transliteration/Strong's number, plus parsed definition and usage from TBESG chunks. - Backend router:
backend/app/routers/study.py—GET /study/verse(parses ref, queries verses table),GET /study/corpus(semantic search),GET /study/lexicon(TBESG lexicon lookup),GET /study/excerpt(word study excerpt by Strong's number),GET /study/interlinear(interlinear words by verse_id),GET /study/commentary(commentary semantic search),GET /study/wordsearch+GET /study/wordstudy/{document_id}(word study search and detail). Registered inmain.pywithprefix="/study". - Interlinear blocks: Inline styles for spacing/sizing (not Tailwind — classes weren't taking effect). Hover color
#2f2f2c(since resting bg is#262624). No bookmark icons on word blocks. Now rendered in Word Study tab of right panel (not left panel). - Interlinear data: Live from
interlinear_wordstable (142,096 rows). BackendGET /study/interlinear?verse_id=JHN.1.1returns word data ordered by word_position.
Remaining / Known Issues
- Full 300-issue batch not yet run — only 4 articles ingested from issue 03-1973
- Migration 012 not yet run — needs to be applied in Supabase SQL Editor (Migration 013 for
bible_referencesis applied as of 2026-04-10) sources/youtube/youtube_tracker.xlsxstill tracked — needsgit rm --cached sources/youtube/youtube_tracker.xlsxto finish the earliersources/cleanup. Shows up as modified on every commit.- Issue_03-1973 cleanup A/B/C options never resolved in the continuation-resolver session — was left in
02_extracted/in an uncertain state. - scrape_youtube.py dead Haiku code — removed (2026-04-15)
- content_summary not auto-populated on new article inserts (trigger only updates fts_weighted, not content_summary)
- Tagging retry logic sometimes needs improvement for complex articles
- Guest query limit —
increment_guest_query()SQL function needs migration file RLS policies needed on— DONE: RLS enabled on bothconversationsandmessagestables- INCLUDE_COPYRIGHTED not confirmed on Railway — check dashboard
- poppler no longer required — pdf2image replaced by PyMuPDF (fitz) in extract_magazine.py
- Bible ref extraction occasionally produces malformed JSON from Groq on edge-case batches (~1 in 38 docs in backfill run). Helper handles gracefully by dropping that segment and continuing; other segments in the same doc still succeed.
- System prompt and chat.py changes deployed (2026-04-15) — pushed to main; Railway/Vercel should auto-deploy.
- Anthropic + Cohere rerank deployed (2026-04-17) — answer gen switched to Claude Sonnet 4.5, Cohere rerank-v3.5 added. Pushed to main.
- Article reader date display — issue date (month/year) added to frontend but not yet visually confirmed in browser.
console.logleft inhandleCardClickfor debugging — remove after confirming. - 30 malformed JSON chunks fixed (2026-04-17) —
fix_article_json.pymigration ran successfully; content_summary refreshed on all 30 affected documents. - Shell aliases expanded — 10
rh-*aliases in~/.zshrccovering all pipeline scripts. - Proposed but unapplied system prompt changes (2026-04-29 session): example response section, retrieval formatting (bullets→headings+prose), softer Holy Spirit guardrails revision, Niagara Falls metaphor ban, "Go Deeper" follow-up questions, NIV translation preference. All shown as diffs but not confirmed by user.
Migration 016— DONE: verses table created, 31,098 rows ingested viaingest_bible.pyscrape_preceptaustin.pyhardened version not yet re-run — page cache from first run will speed up re-run; first run had 2.6% success rate before DOM fixingest_preceptaustin.pynot yet run — depends on successful scrape completionTBESG re-ingestion— DONE: 11,034 chunks ingested in brief mode (gloss + bold sub-meanings)- Debug logging in study.py lexicon endpoint — verbose print statements added for TBESG debugging. Remove after confirming lexicon endpoint works correctly with re-ingested TBESG data.
Study Mode interlinear data is placeholder only— DONE: Live frominterlinear_wordstable (142,096 rows), fetched viaGET /study/interlinearendpoint.Migration 017 (saved_words)— DONE: saved_words table exists with RLS enabled- 10 YouTube transcripts ingested (2026-05-22) — side effect of running skip tracking test against
youtube/ingested/. These had different MD5 hashes than stored (likely re-cleaned). 1 duplicate ("Your Calling Is Holy" by Derek Prince) was found and the older copy deleted. mode-toggle.tsxis orphaned —frontend/components/rhemata/mode-toggle.tsxexists but is no longer imported anywhere. Can be deleted.- Word study excerpt generation 96% complete — 1,713 of 1,779 word_study documents have excerpts generated (
excerptstable,excerpt_type = 'word_study_article'). 66 remaining. Script:scripts/generate_excerpts.py. - Commentary ingestion not yet run —
scripts/ingest_commentaries.pyrewritten and pushed, needs to be executed (325 fathers, 82,567 rows from SQLite DB at/tmp/commentaries-db/data.out). - Hebrew word study pipeline ready —
scrape_preceptaustin.py --language hebrew --fetchandingest_preceptaustin.py --language hebrewready to run. - Interlinear deployment verification needed — confirm
GET /study/interlinearworks on live Railway (tested locally only). SUPABASE_DB_URL psycopg2 connection refused— FIXED: dotted username was being truncated by psycopg2 URI parsing. All scripts now useurlparse+ explicit keyword args.
How to Work on This Project
- Code changes always go to Claude Code in terminal — do not write or edit code in chat unless the change is trivial (1-2 lines)
- Alex works fast — short messages, quick pivots, direct feedback
- Surface risks and blockers before building, not after
- When Alex references a component, check the actual file before assuming structure
- Python 3.9 constraint: use
Optional[str]notstr | None