rhemata

Full project context for Rhemata — Alex's AI-powered theological research tool for charismatic Christians. Read this skill at the start of every Rhemata work session before doing anything else. Trigger whenever Alex mentions Rhemata, the theological research app, the RAG project, or any of its components (ingestion, chat, citations, frontend, backend).

alxwhitley 0 Updated 1mo ago

Resources

GitHub

Install

npx skillscat add alxwhitley/rhemata

Install via the SkillsCat registry.

SKILL.md

Rhemata — Project Skill

What It Is

Rhemata (ῥήματα) is an AI-powered theological research tool targeting charismatic and Spirit-filled Christians. Users ask natural language questions and receive answers drawn from a curated library of theological documents, with inline citations pointing back to the source.

The primary product model is Magisterium AI. The primary UX model is Perplexity — centered chat input, inline citations, clickable source panel.

Who It's For

Charismatic and Spirit-filled Christians who want to research theology from within their tradition. The content library is built from documents Alex personally owns and has rights to — sermon outlines, theology papers, and similar material. New Wine Magazine extraction and ingestion pipeline is now operational (4 articles ingested from issue 03-1973, full 300-issue batch pending).

Repo & Git

Git repo initialized and pushed to alxwhitley/rhemata on GitHub
.gitignore covers .env, .env.local, __pycache__, .venv, node_modules, .next, .DS_Store

Monorepo Structure

repo/
├── frontend/          # Next.js 16 app (Vercel)
├── backend/
│   ├── app/           # FastAPI Python package
│   │   ├── main.py
│   │   ├── auth.py
│   │   ├── routers/
│   │   │   ├── chat.py       # /chat endpoint — retrieval + LLM
│   │   │   ├── search.py     # /search + /search/documents endpoints
│   │   │   ├── document.py   # /document/{id} + /document/{id}/article
│   │   │   ├── study.py      # /study/verse + /study/corpus + /study/lexicon + /study/excerpt endpoints
│   │   │   └── ingest.py     # /ingest endpoint
│   │   ├── services/
│   │   ├── db/
│   │   ├── system_prompt.txt
│   │   └── theological_guardrails.txt
│   ├── requirements.txt   # pinned via pip freeze
│   ├── railway.toml
│   └── nixpacks.toml      # locks Python 3.9
├── sources/
│   ├── youtube/               # YouTube transcript pipeline
│   │   ├── raw/               # Freshly scraped transcripts
│   │   ├── cleaned/           # Groq-cleaned, ready for ingest
│   │   ├── ingested/          # Already in Supabase
│   │   └── youtube_tracker.xlsx
│   ├── magazine/              # New Wine Magazine pipeline
│   │   ├── 01_to_extract/     # Drop PDFs here (~198 issues)
│   │   ├── 02_extracted/      # Per-issue .md articles + raw_text.txt
│   │   ├── 03_approved/       # Reviewed and approved for ingest
│   │   ├── 04_ingested/       # Completed issues
│   │   ├── 05_archived/       # Original PDFs after extraction
│   │   └── rhemata_tracker.xlsx
│   └── documents/             # Non-copyrighted docs (sermons, papers)
│       └── ingested/          # Already in Supabase
├── scripts/                   # All pipeline scripts
│   ├── scrape_youtube.py      # YouTube transcript scraper (yt-dlp + Supabase dedupe, raw only — no cleaning)
│   ├── youtube_pipeline.sh    # Full YouTube pipeline: scrape → clean → ingest
│   ├── whisper_transcribe.py   # Whisper medium + Groq clean (batch from no_captions/ or single URL)
│   ├── clean_transcripts.py   # Clean raw transcripts via Groq Llama 3.3 70B
│   ├── fix_article_json.py    # One-off migration: fix raw JSON chunks in Supabase (run 2026-04-17, 30 fixed)
│   ├── extract_magazine.py    # 3-pass Gemini/Groq extraction pipeline
│   ├── ingest_magazine.py     # Supabase ingestion from .md files with frontmatter
│   ├── ingest.py              # Standalone PDF/docx/txt ingestion with auto-tagging; moves YouTube transcripts to ingested/ on success
│   ├── tag_existing_articles.py  # Backfill topic_tags on existing articles
│   └── tag_sermons_transcripts.py  # Backfill topic_tags on sermons/transcripts/papers
├── scrape_preceptaustin.py # Precept Austin word study scraper (page caching, multi-strategy anchor matching)
├── ingest_preceptaustin.py # Precept Austin word study ingestion (psycopg2 chunks, OpenAI embeddings)
├── ingest_lexicon.py      # STEPBible lexicon ingestion (TBESG, TBESH, TFLSJ)
├── ingest_bible.py        # WEB Bible VPL ingestion into verses table (psycopg2)
├── migrations/            # SQL migrations (run in Supabase SQL Editor)
├── taxonomy.md            # 100-tag topic taxonomy (8 categories)
├── CLAUDE.md              # Claude Code context
└── SKILL.md               # Full project skill context

All imports use from app.x import y (absolute, not relative)
requirements.txt pinned to exact versions, includes tiktoken
Railway start command: uvicorn app.main:app --host 0.0.0.0 --port $PORT

Tech Stack

Layer	Technology
Frontend	Next.js 16 (React 19), Tailwind CSS 4, deployed to Vercel
Backend	Python 3.9 / FastAPI, deployed to Railway
Database	Supabase (PostgreSQL + pgvector)
Embeddings	OpenAI `text-embedding-3-small` (1536 dims)
Answer Generation LLM	Anthropic Claude Sonnet 4.5 (`claude-sonnet-4-5`) via `anthropic` SDK
Query Expansion / Metadata / Tagging / Transcript Cleaning LLM	Groq Llama 3.3 70B (`llama-3.3-70b-versatile`)
Vision / OCR (magazine extraction)	Gemini 2.5 Flash (`gemini-2.5-flash`) via `google-genai` SDK
Reranking	Cohere rerank-v3.5 (`cohere` SDK) — narrows top 10 RRF → top 5
Retrieval	Hybrid search: pgvector + PostgreSQL FTS, fused via RRF
Markdown rendering	`react-markdown` + `@tailwindcss/typography`

Removed: GPT-4o Vision (replaced by Gemini 2.5 Flash). Groq for answer generation (replaced by Anthropic Claude Sonnet 4.5, April 2026).

Architecture

Frontend → Backend → Supabase → LLM

User types a query in the chat interface
Frontend POSTs to /chat on the FastAPI backend (field: question, plus anon_id for guests)
Backend expands query into 3 semantic variants via Groq Llama 3.3 70B (expand_query())
For each variant: pgvector cosine similarity (top 40) + PostgreSQL full-text search (top 30)
Results fused via Reciprocal Rank Fusion (RRF_K=60), deduplicated, document-level collapse (max 2 chunks per doc), top 10 selected
Cohere rerank-v3.5 narrows top 10 → top 5 by relevance (graceful fallback to top 10 if COHERE_API_KEY unset)
Neighbor chunk expansion (±1 chunk_index, cap at 12 total)
Backend assembles prompt: system instructions + theological guardrails + retrieved chunks (tagged with source_kind and citation_mode) + query
Anthropic Claude Sonnet 4.5 generates a response, streamed back via SSE with <answer> tag extraction. Runtime-appended faithfulness instruction preserves source document views without editorializing. Theological guardrails (theological_guardrails.txt) appended to system prompt enforce non-negotiable framings (e.g., Holy Spirit personhood).
Frontend renders response with inline citation tags

Database Schema

documents table — one row per source document

id (uuid), title (text), author (text)
source_name (text), source_type (text), source_kind (text)
citation_mode (text) — 'citable' | 'silent_context'
is_copyrighted (boolean, default false)
year (int), issue (text), url (text, nullable)
topic_tags (text[]) — assigned from taxonomy
bible_references (text[], default '{}') — canonical refs like "Romans 8:28"; GIN indexed
fts_weighted (tsvector) — weighted FTS on title (A), author (A), source_name (B), bible_references (C, colons stripped)
content_summary (text) — first chunk content for display
created_at (timestamptz)

chunks table — one row per text chunk

id (uuid), document_id (FK → documents)
content (text), embedding (vector(1536))
chunk_index (int), created_at (timestamptz)

verses table — Bible verse text (WEB translation)

id (uuid), verse_id (text, unique — format: SBL.CHAPTER.VERSE, e.g. JHN.3.16)
book (text — full name), book_num (int — 1-66), chapter (int), verse (int)
text (text), translation (text, default 'WEB')
created_at (timestamptz)
Indexes on verse_id and (book, chapter, verse)

saved_words table — user's saved Greek words for Study mode

id (uuid), user_id (uuid, FK → auth.users, cascade delete)
strongs_number (text), greek_word (text), transliteration (text), english_gloss (text, nullable)
created_at (timestamptz)
Unique constraint on (user_id, strongs_number)
RLS enabled: users can only manage their own rows (auth.uid() = user_id)

guest_sessions table — server-side guest query tracking

id (uuid), anon_id (text, unique)
query_count (int, default 0)
created_at / last_seen (timestamptz)

conversations table — saved chat history for authenticated users

id (uuid), user_id (uuid, FK → auth.users), title (text), created_at

messages table — individual messages within conversations

id (uuid), conversation_id (FK → conversations)
role (text: 'user' | 'assistant'), content (text), created_at

Key Decisions Already Made

HNSW indexing over ivfflat for pgvector; match_chunks sets hnsw.ef_search=200
Page-level citations — not chunk-level
Two-tier content model: citation_mode = 'citable' (renders citations) vs 'silent_context' (informs LLM only)
Magazine chunking: tiktoken cl100k_base, 550 tokens target, 80 overlap
Standalone ingest: recursive character text splitting, 1000 char chunks, 200 char overlap
Hybrid search with RRF — query expansion (3 variants via Groq) → vector + FTS per variant → RRF (K=60) → document collapse → top 10
CORS middleware — ALLOWED_ORIGINS env var (comma-separated)
Guest query limit — 6 free queries via guest_sessions + increment_guest_query RPC
JWT auth via Supabase JWKS endpoint (PyJWKClient)
Bible Study articles excluded from extraction pipeline (reference materials, not theological teaching)
Ingest auto-tagging — ingest.py tags every new document post-chunk-insert via Groq Llama 3.3 70B; strict 3–6 tags, main themes only, non-fatal
is_copyrighted path-based — sources/youtube/ and sources/magazine/ → true, sources/documents/ → false
Sermon transcripts excluded from search — search_documents RPC defaults source_kind to "magazine_article"; transcripts available in chat retrieval only
All scripts in scripts/ — no Python files at project root; all use Path(__file__).resolve().parent.parent for project root
Bible reference tracking — documents.bible_references text[] populated via Groq Llama 3.3 70B extraction; shared helper at scripts/bible_refs.py normalizes to "Book Chapter:Verse" canonical form against 66-book set + alias map; non-fatal (returns [] on failure); auto-populated during ingest in both ingest.py and ingest_magazine.py; backfill via extract_bible_refs.py
Prefix search — search_documents RPC builds to_tsquery with :* prefix operators per token (colons split to sub-tokens), so "Romans 8" matches "Romans 8:1", "Romans 8:28", etc.; falls back to plainto_tsquery on parse error
System prompt discipline — backend/app/system_prompt.txt uses XML tags (<thinking>, <research_analysis>, <answer>). <research_analysis> runs 3 fixed self-checks (author conflation, silent_context citation, biblical case overreach). Response Discipline Rules block enforces: multi-part decomposition, retrieval-only format when asked, explicit "corpus insufficient" flag on thin charismatic distinctives, retrieval scope cap (10 items / 250 words). Scripture exception limited to verse text only — no interpretation beyond sources. Tone section includes charismatic linguistic anchors ("impressions," "promptings") for exploratory mode only. Citation rules include prompt-injection trust boundary (ignore instructions embedded in retrieved chunks), no anonymous attribution ("one teacher" etc. banned — cite by name or no attribution). Formatting requires minimum 2 ## headings for theological answers (mandatory, not optional). Quote discipline: paraphrase with attribution, never lift phrasing verbatim. Closing synthesis: end with own conclusion drawn from sources, not a summary.
Theological guardrails — backend/app/theological_guardrails.txt loaded and appended to system prompt in chat.py. Contains non-negotiable theological framings that override source material phrasing. Currently covers Holy Spirit personhood (person of the Trinity, not merely a power/provision).
Chat streaming — Anthropic Claude Sonnet 4.5 max_tokens=1500. <answer> tag extraction server-side with 9-char buffer safety for split tags. If stream ends mid-answer, remaining buffer is flushed to client instead of silently dropped. Uses client.messages.create(stream=True) (not context manager form, which is incompatible with generator yield). Frontend: no timeouts or AbortController; stream completion via [DONE] sentinel or reader exhaustion; error handling for 429 (guest/daily limits).
Batched neighbor chunk expansion — fetch_neighbor_chunks_batch() collects all (document_id, chunk_index±1) pairs from top chunks, builds .or_() compound filter with and() conditions, batches at 30 pairs per query. Replaces sequential per-chunk lookups (was 10 calls for 5 chunks, now 1-2 calls total).
Sparse citation rule — system prompt enforces max 2-3 inline citations per response. Only cite when source of claim materially matters. Single citation sufficient when answer draws primarily from one source.
Cohere reranking — After RRF fusion, top 10 chunks sent to Cohere rerank-v3.5 with original query; top 5 by relevance score returned. Falls back to RRF top 10 if COHERE_API_KEY not set or call fails.
Column break handling — Pass 1 prompt instructs Gemini to transcribe multi-article pages column by column with === COLUMN BREAK === markers. Pass 2 prompt tells Groq to follow article content across column breaks, ignoring other articles' content.
psycopg2 connection fix — Supabase pooler usernames contain a dot (postgres.{ref}) which psycopg2.connect(uri) misparses, truncating to postgres. All ingestion scripts (ingest_lexicon.py, ingest_preceptaustin.py, ingest.py, ingest_commentaries.py) now parse SUPABASE_DB_URL with urlparse and pass explicit keyword args (host, port, user, password, dbname).

Content Rules

Only ingest documents that Alex personally owns or has rights to
New Wine Magazine pipeline is operational — is_copyrighted=true, controlled by INCLUDE_COPYRIGHTED env var
INCLUDE_COPYRIGHTED=true in local .env and defaults true in chat.py
Current non-magazine documents are single-column — no multi-column OCR handling needed

Magazine Extraction Pipeline (3-pass)

Input: PDF in sources/magazine/01_to_extract/
Output: Per-article .md files in sources/magazine/02_extracted/{issue_stem}/

Pass 1: Vision Extraction (Gemini 2.5 Flash)

Converts PDF pages to PIL images at 200 DPI via pdf2image
Processes in 5-page batches to avoid output truncation
Each batch gets explicit page numbering instructions (=== PAGE N ===)
Outputs raw_text.txt with full issue transcription

Pass 2: Article Segmentation (Groq Llama 3.3 70B)

Step 2a: Extracts TOC from pages 2-3, sends full text to Groq for metadata index (JSON array of title/author/page_start/page_end)
Step 2b: For each article, extracts page range text and sends to Groq for body extraction + topic tagging
Returns JSON: {"topic_tags": [...], "body": "..."}
Tags validated against VALID_TAGS set — invalid tags removed
Outputs individual .md files with frontmatter metadata

Pass 3: QA Inspection (Groq Llama 3.3 70B)

Checks each article for: truncation, duplicates, mismatch, word count (min 200), OCR errors
Returns JSON: {"status": "PASS"|"WARN"|"FLAG", "issues": [...], "confidence": 0.0-1.0}
FLAG articles moved to flagged/ subfolder
WARN articles get  comment prepended
Outputs qa_report.json

Article Format

Each article saved as .md with frontmatter:

---
TITLE: Article Title
AUTHOR: Author Name
ISSUE: 03-1973
DATE: March 1973
PAGE_START: 4
PAGE_END: 10
SOURCE_TYPE: magazine_article
TOPIC_TAGS: Fivefold Ministry, Prophetic Ministry, Biblical Leadership
---

# Article Title
*by Author Name*

Body text formatted as markdown...

Exclusions

Bible Study, Bible Lesson, Study Guide articles excluded from extraction
Letters to editor, order forms, subscription info, staff boxes, ads excluded
Cover/back cover, full-page illustrations, advertisement pages skipped in Pass 1

YouTube Pipeline

Scrape: python3 scripts/scrape_youtube.py — scrapes transcripts via yt-dlp from channels in youtube_tracker.xlsx, dedupes against Supabase, saves raw transcripts to sources/youtube/raw/ (max 10 per run). Videos with no captions or low-quality transcripts (< 400 words) write metadata stubs to sources/youtube/no_captions/ for Whisper processing.
Clean: python3 scripts/clean_transcripts.py — cleans via Groq Llama 3.3 70B, moves to sources/youtube/cleaned/
Whisper: python3 scripts/whisper_transcribe.py — batch-processes stubs in no_captions/, downloads audio via yt-dlp, transcribes with Whisper medium, cleans via Groq, outputs to cleaned/. Also supports single-URL mode with --url --title --speaker --channel.
Ingest: python3 scripts/ingest.py — ingests cleaned transcripts into Supabase with auto-tagging. Moves successfully ingested files from cleaned/ to ingested/ via shutil.move.

Convenience script: ./scripts/youtube_pipeline.sh runs all 4 steps in sequence (set -euo pipefail — stops on failure). Shell alias: rh-youtube (in ~/.zshrc).

Transcript files include metadata headers (TITLE, SPEAKER, URL, SOURCE_TYPE) parsed by ingest.py.

Topic Tagging

taxonomy.md in project root contains 100-tag taxonomy across 8 categories:
1. Holy Spirit & Spiritual Gifts (14 tags)
2. Charismatic Experience (14 tags)
3. Prayer & Spiritual Warfare (14 tags)
4. Healing & Wholeness (14 tags)
5. Christian Leadership (15 tags)
6. Christian Growth & Discipleship (14 tags)
7. Kingdom & Theology (14 tags)
8. Family & Relationships (13 tags)
Tags assigned during Pass 2 extraction (5-8 per article)
Strict rules: only assign if article directly teaches on topic for at least one paragraph
Validated against VALID_TAGS set in both extract_magazine.py and tag_existing_articles.py
Invalid/invented tags automatically removed
tag_existing_articles.py for backfilling existing magazine articles (retries if < 3 valid tags)
tag_sermons_transcripts.py for backfilling non-magazine documents (3-6 tags, retries if < 2 valid)

Search Feature

GET /search/documents — document-level FTS via search_documents RPC function
- Parameters: q, author, source_kind, include_copyrighted
- Returns: id, title, author, issue, year, highlighted_snippet, rank
- fts_weighted column includes title (A), author (A), source_name (B), bible_references (C, colons stripped)
- Prefix tsquery builder — tokenizes query, strips non-alphanumerics, appends :* to each token, AND-joins; "Romans 8" matches "Romans 8:1", "Romans 8:28", etc. Falls back to plainto_tsquery on parse error.
- ts_headline generates keyword-highlighted snippets from best-matching chunk
- Markdown/metadata stripped from snippets via nested regexp_replace
- Fallback to first 200 chars if no FTS match in chunk content
- source_kind defaults to "magazine_article" — excludes sermon_transcript from search results
GET /document/{id}/article — reassembles full article from chunks
- Strips per-chunk metadata headers, trims overlap, strips markdown bold/italic
- Cleans author (truncates at parenthesis)
GET /search/documents/browse — lists all documents of a source_kind, ordered by year/issue DESC
- Parameters: source_kind, include_copyrighted
- Returns same shape as search_documents (id, title, author, issue, year, topic_tags, highlighted_snippet=null, rank=0)
- Both /search/documents and /search/documents/browse return topic_tags (secondary lookup on doc IDs for search; direct select for browse)
Search page at /search — sidebar, search bar, result cards, article reader
- Browse listing on initial load (all magazine articles, before any search)
- hasSearched state flag distinguishes "no search yet" (show browse) vs "searched with no results" (show empty state)
- Result cards show author-only metadata (no date/year/issue)
- Topic tag pills on cards: rounded, #d4b96a gold text on rgba(212, 185, 106, 0.12) background
- ReactMarkdown renders article body in reader view (title/byline stripped to avoid duplication)
- dangerouslySetInnerHTML renders <mark> highlighted snippets in result cards
- mark styled with gold color (#d4b96a), transparent background, font-weight 600

Scripts

Script	Purpose
`scripts/extract_magazine.py`	3-pass Gemini/Groq extraction pipeline (Vision → Segmentation → QA). Supports `--max-issues N` and `--time-limit`. Continuation resolver (BFS, depth 5) handles "continued on page N" markers. PDFs archived into `02_extracted/{issue_stem}/` after extraction. Empty Gemini batches log warning + substitute `""` (non-fatal).
`scripts/ingest_magazine.py`	Ingest approved .md articles from sources/magazine/03_approved/ into Supabase. Auto-populates `bible_references`. Archives PDFs to `05_archived/` on success.
`scripts/ingest.py`	Standalone PDF/docx/txt ingestion with auto-tagging (3–6 tags, Groq, non-fatal). Auto-populates `bible_references`. Skip reason tracking: `ingest_file()` returns `(status, reason)` tuples; `main()` prints grouped summary table of all skipped/failed files with reasons at end of run. Uses psycopg2 direct query for `already_ingested()` (same `DB_PARAMS` pattern as other scripts).
`scripts/bible_refs.py`	Shared Bible reference extractor (Groq Llama 3.3 70B). `extract_bible_references(content) -> List[str]`. Segments at ~12k chars, normalizes against 66-book canonical set + alias map, dedupes. Non-fatal (returns `[]`).
`extract_bible_refs.py` (project root)	Backfill `bible_references` on all documents. Flags: `--dry-run`, `--force` (re-process docs that already have refs).
`scripts/tag_existing_articles.py`	Backfill topic_tags on existing magazine articles via Groq
`scripts/tag_sermons_transcripts.py`	Backfill topic_tags on existing sermon/transcript/paper documents via Groq
`scripts/youtube_pipeline.sh`	Full YouTube pipeline convenience script: scrape → clean → whisper → ingest. Shell alias: `rh-youtube`.
`scripts/scrape_youtube.py`	YouTube transcript scraper (yt-dlp, Supabase dedupe, max 10 per run). Writes no_captions stubs for videos without captions or with < 400 words.
`scripts/whisper_transcribe.py`	Whisper medium transcription + Groq cleaning. Batch mode processes `no_captions/` stubs; single-URL mode via CLI args.
`scripts/clean_transcripts.py`	Clean raw transcripts via Groq Llama 3.3 70B, move to cleaned/
`scripts/generate_excerpts.py`	Batch-generate edited word study articles from Precept Austin raw chunks. Concatenates all chunks per document, sends to Anthropic Claude for editing into clean articles, writes to `excerpts` table (`excerpt_type = 'word_study_article'`). Flags: `--test`, `--test-quality`, `--model sonnet
`scripts/ingest_commentaries.py`	Ingests HistoricalChristianFaith commentaries from SQLite DB. Groups by father_name, one document per father, chunks with tiktoken, embeds with OpenAI, inserts via psycopg2. Single-transaction pattern (`connect_with_retry()` + `ingest_father()`). Theological tagging (Reformed, Cessationist, Charismatic-Friendly, Desert Fathers, Patristic). Flags: `--dry-run`, `--father "Name"`, `--filter-charismatic`.
`scripts/fix_article_json.py`	One-off migration: fixed 30 chunks with raw JSON content in Supabase (run 2026-04-17).

Deleted: merge_articles.py (replaced by Pass 2 per-article segmentation)

Root-Level Ingestion Scripts

Script	Purpose
`scrape_preceptaustin.py`	Scrapes Precept Austin Greek word studies. Page caching to `sources/precept_austin/page_cache/`, randomized sleep (2-5s), 4-strategy anchor matching (exact → case-insensitive → partial → reverse word), quality filters (<100 words, nav bleed, fragmented). `--fetch` runs full pipeline, `--test` limits to 10 entries. Outputs to `sources/precept_austin/raw/` + `index.json`.
`ingest_preceptaustin.py`	Ingests Precept Austin word studies into Supabase. Chunks via psycopg2 `execute_values` with `::vector` cast. Documents via Supabase client. Skip logic checks `excerpts` table for existing `word_study_article` excerpt (not just document existence). Reuses existing doc_id when document exists but has no excerpt. Uses `backend/app/services/chunker.py` (550 tokens, 80 overlap).
`ingest_lexicon.py`	Ingests STEPBible lexicon files (TBESG, TBESH, TFLSJ). One chunk per lexical entry. tiktoken truncation at 8000 tokens for embedding. Resume-safe (tracks existing chunk counts). CLI flags: `--lexicon TBESG\|TBESH\|TFLSJ`, `--delete` (removes existing data first), `--sample N`. Brief mode for TBESG: stores only gloss + bolded sub-meanings (no full Abbott-Smith HTML). Chunk inserts via psycopg2 `execute_values` with `ON CONFLICT DO NOTHING`.
`ingest_bible.py`	Parses WEB VPL file, maps 66 canonical books (VPL→SBL abbreviations), inserts into `verses` table via psycopg2. Batch size 1000, `ON CONFLICT DO NOTHING`. `--test` limits to 100 verses. Skips deuterocanonical books.

Note: ingest_commentaries.py is now in scripts/ (see Scripts table above).

Data sources:

sources/precept_austin/ — word study .txt files + index.json + page_cache/ (gitignored)
sources/lexicon/ — STEPBible TSV files (TBESG, TBESH, TFLSJ)
sources/bible/eng-web_vpl.txt — World English Bible verse-per-line file

Corpus (as of 2026-05-23)

2,617 documents total, 124,346 chunks
By source_kind: 1,779 word_study, 557 sermon_transcript, 186 commentary, 58 unknown, 33 magazine_article, 4 lexicon
By source_type: 1,783 background, 557 sermon, 186 commentary, 49 book, 33 magazine_article, 5 paper, 4 other
Copyrighted: 1,862 | Non-copyrighted: 755
Lexicons: TBESG 11,034 chunks (complete), TBESH 10,258 chunks (complete), TFLSJ 15,767 chunks across 2 docs (complete)
Verses: 31,098 rows (WEB, 66-book Protestant canon, complete)
Excerpts: 1,713 of 1,779 word_study docs have generated articles (96%, 66 remaining)
All 38 original backfilled docs have bible_references populated (2026-04-10)

UX Model

Centered chat input as primary interaction
Perplexity-style inline citations rendered as gold-highlighted tags
Clicking a citation opens a source panel with document title, author, and page content
Sidebar: Shared across all routes. "Rhemata" wordmark, gold "New Chat" CTA (#b49238), nav items (Chat/Discover/Study), conditional content (Recents on chat, Saved Words on study). Hover via onMouseEnter/onMouseLeave inline handlers (#262624).
Search page at /search with keyword search, browse-all default listing, result cards with topic tag pills, and full article reader
Auth flow: login modal triggered by AuthButton, sidebar sign-in link, or guest limit reached
Guest users get 6 free queries before prompted to sign up

Brand

Name: Rhemata
Fonts: Lora (headings/logo), Inter (UI/body)
Dark theme — near-black backgrounds, warm neutrals
Gold accents — #d4b96a (citations/highlights/tag pills), rgba(212, 185, 106, 0.12) (tag pill backgrounds)
Voice: Scholarly but accessible. Conviction, not performance. Serves the researcher, not the spectacle.

Deployment

Target	Status	Notes
Railway (backend)	Live	Root dir: `backend/`, Python 3.9 via nixpacks.toml
Vercel (frontend)	Live	Root dir: `frontend/`
Supabase	Live	PostgreSQL + pgvector

Backend env vars (Railway)

SUPABASE_URL, SUPABASE_SERVICE_KEY
OPENAI_API_KEY, GROQ_API_KEY, ANTHROPIC_API_KEY, COHERE_API_KEY
SUPABASE_JWT_JWKS_URL
ALLOWED_ORIGINS
INCLUDE_COPYRIGHTED

Frontend env vars (Vercel)

NEXT_PUBLIC_API_URL
NEXT_PUBLIC_SUPABASE_URL, NEXT_PUBLIC_SUPABASE_ANON_KEY

Environment Variables (local — backend/app/.env)

GROQ_API_KEY
OPENAI_API_KEY
ANTHROPIC_API_KEY — Claude Sonnet 4.5 for answer generation
COHERE_API_KEY — Cohere rerank-v3.5 for retrieval reranking
GOOGLE_API_KEY — Gemini 2.5 Flash for magazine extraction
SUPABASE_URL
SUPABASE_SERVICE_KEY
SUPABASE_JWT_JWKS_URL
INCLUDE_COPYRIGHTED — true/false (default true in chat.py, false in search.py)
ALLOWED_ORIGINS
SUPABASE_DB_URL — direct PostgreSQL connection string for psycopg2 (bypasses PostgREST timeouts). Used by ingest_bible.py, ingest_preceptaustin.py, ingest_lexicon.py (psycopg2 variant).

Study Mode (frontend)

Route: /study — Study Mode page with tab-driven layout
Layout: Shared sidebar (w-64) with Saved Words list | Left panel (380px fixed: search, verse card, chapter view) | Right panel (flex:1: tab content)
Sidebar: Shared sidebar.tsx across Chat and Study. Uses usePathname() for active route. Chat shows Recents; Study shows Saved Words. Nav items: Chat (MessageSquare), Discover (Compass), Study (BookOpen). Gold "New Chat" CTA (#b49238). Hover standard: onMouseEnter bg #262624, onMouseLeave transparent.
Saved words: saved_words table in Supabase (migration 017). RLS policy scoped to auth.uid(). Toggle save/unsave from definition panel bookmark icon. Sidebar shows English gloss + Strong's number, transliteration below.
Verse lookup: Direct Supabase query from frontend (verses table, keyed by verse_id). Client-side parseRef() with full 66-book BOOK_MAP + ABBREV_TO_NAME.
Left panel: Search bar, verse card, "View Chapter" button, chapter view. No interlinear or definition content.
Chapter view: Flowing text with inline <sup> verse numbers. Verses queried via .like("verse_id", "JHN.1.%") pattern. Active verse highlighted with #2f2f2c background. Clicking a verse in chapter view loads it as the active verse.
Right panel tabs: CorpusTab type: "commentaries" | "word_study" | "jewish". Tab state lifted to parent StudyPage for cross-panel coordination.
- Commentary tab: Corpus results from GET /study/corpus (commentaries, sermons, magazine articles).
- Word Study tab: Interlinear word blocks → definition panel → Precept Austin excerpt → "From the Library" corpus results. Interlinear fetch gated by corpusTab === "word_study" (not fetched on every verse change). Auto-selects first interlinear word when tokens load.
- Jewish Perspective tab: Inline generate flow (no modal). Empty state with "Generate Jewish Perspective" button → disclaimer text → "Confirm & Generate" button → spinner → cached result. jpCacheChecked state tracks whether cache has been checked.
Excerpt panel: GET /study/excerpt?strongs=G#### endpoint returns Precept Austin word study article for selected Strong's number. Tries excerpts table first, falls back to concatenated chunks.
Corpus panel ("From the Library"): Backend GET /study/corpus?verse=...&transliteration=... endpoint. Embeds query via OpenAI, runs match_chunks RPC (match_count=20, include_copyrighted=true), filters to citation_mode='citable' + source_kind IN ('sermon_transcript', 'magazine_article', 'commentary', 'word_study'), dedupes by document, returns top 5. Frontend shows skeleton loader, empty state, or real results.
Definition panel: Fetches lexicon data from GET /study/lexicon?strongs=G#### endpoint. Displays gloss inline with transliteration/Strong's number, plus parsed definition and usage from TBESG chunks.
Backend router: backend/app/routers/study.py — GET /study/verse (parses ref, queries verses table), GET /study/corpus (semantic search), GET /study/lexicon (TBESG lexicon lookup), GET /study/excerpt (word study excerpt by Strong's number), GET /study/interlinear (interlinear words by verse_id), GET /study/commentary (commentary semantic search), GET /study/wordsearch + GET /study/wordstudy/{document_id} (word study search and detail). Registered in main.py with prefix="/study".
Interlinear blocks: Inline styles for spacing/sizing (not Tailwind — classes weren't taking effect). Hover color #2f2f2c (since resting bg is #262624). No bookmark icons on word blocks. Now rendered in Word Study tab of right panel (not left panel).
Interlinear data: Live from interlinear_words table (142,096 rows). Backend GET /study/interlinear?verse_id=JHN.1.1 returns word data ordered by word_position.

Remaining / Known Issues

Full 300-issue batch not yet run — only 4 articles ingested from issue 03-1973
Migration 012 not yet run — needs to be applied in Supabase SQL Editor (Migration 013 for bible_references is applied as of 2026-04-10)
sources/youtube/youtube_tracker.xlsx still tracked — needs git rm --cached sources/youtube/youtube_tracker.xlsx to finish the earlier sources/ cleanup. Shows up as modified on every commit.
Issue_03-1973 cleanup A/B/C options never resolved in the continuation-resolver session — was left in 02_extracted/ in an uncertain state.
scrape_youtube.py dead Haiku code — removed (2026-04-15)
content_summary not auto-populated on new article inserts (trigger only updates fts_weighted, not content_summary)
Tagging retry logic sometimes needs improvement for complex articles
Guest query limit — increment_guest_query() SQL function needs migration file
RLS policies needed on conversations and messages tables — DONE: RLS enabled on both
INCLUDE_COPYRIGHTED not confirmed on Railway — check dashboard
poppler no longer required — pdf2image replaced by PyMuPDF (fitz) in extract_magazine.py
Bible ref extraction occasionally produces malformed JSON from Groq on edge-case batches (~1 in 38 docs in backfill run). Helper handles gracefully by dropping that segment and continuing; other segments in the same doc still succeed.
System prompt and chat.py changes deployed (2026-04-15) — pushed to main; Railway/Vercel should auto-deploy.
Anthropic + Cohere rerank deployed (2026-04-17) — answer gen switched to Claude Sonnet 4.5, Cohere rerank-v3.5 added. Pushed to main.
Article reader date display — issue date (month/year) added to frontend but not yet visually confirmed in browser. console.log left in handleCardClick for debugging — remove after confirming.
30 malformed JSON chunks fixed (2026-04-17) — fix_article_json.py migration ran successfully; content_summary refreshed on all 30 affected documents.
Shell aliases expanded — 10 rh-* aliases in ~/.zshrc covering all pipeline scripts.
Proposed but unapplied system prompt changes (2026-04-29 session): example response section, retrieval formatting (bullets→headings+prose), softer Holy Spirit guardrails revision, Niagara Falls metaphor ban, "Go Deeper" follow-up questions, NIV translation preference. All shown as diffs but not confirmed by user.
~~Migration 016~~ — DONE: verses table created, 31,098 rows ingested via ingest_bible.py
scrape_preceptaustin.py hardened version not yet re-run — page cache from first run will speed up re-run; first run had 2.6% success rate before DOM fix
ingest_preceptaustin.py not yet run — depends on successful scrape completion
~~TBESG re-ingestion~~ — DONE: 11,034 chunks ingested in brief mode (gloss + bold sub-meanings)
Debug logging in study.py lexicon endpoint — verbose print statements added for TBESG debugging. Remove after confirming lexicon endpoint works correctly with re-ingested TBESG data.
~~Study Mode interlinear data is placeholder only~~ — DONE: Live from interlinear_words table (142,096 rows), fetched via GET /study/interlinear endpoint.
~~Migration 017 (saved_words)~~ — DONE: saved_words table exists with RLS enabled
10 YouTube transcripts ingested (2026-05-22) — side effect of running skip tracking test against youtube/ingested/. These had different MD5 hashes than stored (likely re-cleaned). 1 duplicate ("Your Calling Is Holy" by Derek Prince) was found and the older copy deleted.
mode-toggle.tsx is orphaned — frontend/components/rhemata/mode-toggle.tsx exists but is no longer imported anywhere. Can be deleted.
Word study excerpt generation 96% complete — 1,713 of 1,779 word_study documents have excerpts generated (excerpts table, excerpt_type = 'word_study_article'). 66 remaining. Script: scripts/generate_excerpts.py.
Commentary ingestion not yet run — scripts/ingest_commentaries.py rewritten and pushed, needs to be executed (325 fathers, 82,567 rows from SQLite DB at /tmp/commentaries-db/data.out).
Hebrew word study pipeline ready — scrape_preceptaustin.py --language hebrew --fetch and ingest_preceptaustin.py --language hebrew ready to run.
Interlinear deployment verification needed — confirm GET /study/interlinear works on live Railway (tested locally only).
~~SUPABASE_DB_URL psycopg2 connection refused~~ — FIXED: dotted username was being truncated by psycopg2 URI parsing. All scripts now use urlparse + explicit keyword args.

How to Work on This Project

Code changes always go to Claude Code in terminal — do not write or edit code in chat unless the change is trivial (1-2 lines)
Alex works fast — short messages, quick pivots, direct feedback
Surface risks and blockers before building, not after
When Alex references a component, check the actual file before assuming structure
Python 3.9 constraint: use Optional[str] not str | None

rhemata

Resources

Install

Rhemata — Project Skill

What It Is

Who It's For

Repo & Git

Monorepo Structure

Tech Stack

Architecture

Database Schema

Key Decisions Already Made

Content Rules

Magazine Extraction Pipeline (3-pass)

Pass 1: Vision Extraction (Gemini 2.5 Flash)

Pass 2: Article Segmentation (Groq Llama 3.3 70B)

Pass 3: QA Inspection (Groq Llama 3.3 70B)

Article Format

Exclusions

YouTube Pipeline

Topic Tagging

Search Feature

Scripts

Root-Level Ingestion Scripts

Corpus (as of 2026-05-23)

UX Model

Brand

Deployment

Backend env vars (Railway)

Frontend env vars (Vercel)

Environment Variables (local — backend/app/.env)

Study Mode (frontend)

Remaining / Known Issues

How to Work on This Project

Categories

Install

Recommended Skills