levy-n

rag-retrieval

Implements RAG (Retrieval-Augmented Generation) pipelines. Covers embedding APIs (OpenAI, Gemini, Sentence-Transformers), vector stores (FAISS, ChromaDB, Pinecone), RAG variants (Query Rewrite, Conversational, Multi-hop), and evaluation (RAGAS, Faithfulness). Use when building knowledge bases, semantic search, chatbots with documents, or when user mentions 'RAG', 'embeddings', 'vector store', 'FAISS', 'ChromaDB', 'similarity search', 'retrieval', 'chunking', 'hallucination reduction', 'semantic search', or 'knowledge base'.

levy-n 10 1 Updated 4mo ago

Resources

1
GitHub

Install

npx skillscat add levy-n/claude-useful-skills/rag-retrieval

Install via the SkillsCat registry.

SKILL.md

RAG & Retrieval - Semantic Search & Knowledge Augmentation

RAG, Embeddings, Vector Stores, ו-Semantic Search.

Quick Start - Simple RAG Pipeline

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

# 1. Chunk documents
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
chunks = splitter.split_documents(documents)

# 2. Create embeddings and store
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./db")

# 3. Create RAG chain
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True
)

# 4. Query
result = qa_chain({"query": "What is machine learning?"})
print(result["result"])

When This Skill Activates

Use this skill when:

  • Building knowledge bases or Q&A systems
  • Implementing semantic search
  • Reducing LLM hallucinations with grounding
  • Creating chatbots with document context
  • Working with embeddings and vector databases
  • Evaluating RAG system quality

Core Patterns

Pattern 1: RAG Architecture

┌─────────────────────────────────────────────────────────────┐
│                    INDEXING STAGE (Offline)                  │
├─────────────────────────────────────────────────────────────┤
│   Parse  →  Chunk  →  Embed  →  Store                       │
│   (PDF)     (500t)    (384d)    (FAISS/Chroma)             │
├─────────────────────────────────────────────────────────────┤
│                    RUNTIME STAGE (Online)                    │
├─────────────────────────────────────────────────────────────┤
│   Query  →  Embed  →  Retrieve  →  Inject  →  Generate     │
│   (user)    (384d)    (top-k)      (prompt)   (LLM)        │
└─────────────────────────────────────────────────────────────┘

Pattern 2: Chunking Strategies

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Fixed-size chunks
splitter_fixed = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=0
)

# Overlapping chunks (RECOMMENDED)
splitter_overlap = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100  # 20% overlap preserves context
)

# Semantic chunking by headers
from langchain.text_splitter import MarkdownHeaderTextSplitter

headers = [
    ("#", "Header 1"),
    ("##", "Header 2"),
]
splitter_semantic = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
Strategy Pros Cons Best For
Fixed Simple, predictable May break mid-sentence Homogeneous docs
Overlapping Preserves context More chunks, storage General use
Semantic Respects structure Needs structured input Markdown, HTML

Pattern 3: Embedding Models

# OpenAI Embeddings
from openai import OpenAI

client = OpenAI()
response = client.embeddings.create(
    model="text-embedding-3-small",
    input="Hello world"
)
embedding = response.data[0].embedding  # 1536 dimensions

# Sentence Transformers (Local, Free)
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
embedding = model.encode("Hello world")  # 384 dimensions

# HuggingFace via LangChain
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5")
vector = embeddings.embed_query("Hello world")  # 768 dimensions
Model Dimensions Speed Quality Cost
text-embedding-3-small 1536 Fast Good $
text-embedding-3-large 3072 Medium Best $$
all-MiniLM-L6-v2 384 Very Fast Good Free
bge-base-en-v1.5 768 Fast Very Good Free

Pattern 4: FAISS Vector Store

import faiss
import numpy as np

# Create index
dimension = 384
index = faiss.IndexFlatL2(dimension)  # L2 distance

# Add vectors
vectors = np.array(embeddings, dtype=np.float32)
index.add(vectors)

# Search
query_vector = np.array([query_embedding], dtype=np.float32)
distances, indices = index.search(query_vector, k=5)

# Get results
for i, (dist, idx) in enumerate(zip(distances[0], indices[0])):
    print(f"Rank {i+1}: Document {idx}, Distance: {dist:.4f}")

# Save/Load
faiss.write_index(index, "index.faiss")
index = faiss.read_index("index.faiss")

Pattern 5: ChromaDB Vector Store

import chromadb
from chromadb.utils import embedding_functions

# Create client
client = chromadb.PersistentClient(path="./chroma_db")

# Create collection with embeddings
embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

collection = client.create_collection(
    name="my_documents",
    embedding_function=embedding_fn
)

# Add documents
collection.add(
    documents=["Document 1 text", "Document 2 text"],
    metadatas=[{"source": "file1.pdf"}, {"source": "file2.pdf"}],
    ids=["doc1", "doc2"]
)

# Query
results = collection.query(
    query_texts=["What is machine learning?"],
    n_results=3,
    where={"source": "file1.pdf"}  # Metadata filtering
)

Pattern 6: Hybrid Search (BM25 + Semantic)

from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.vectorstores import Chroma

# BM25 (keyword-based)
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 5

# Vector retriever (semantic)
vectorstore = Chroma.from_documents(documents, embeddings)
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Combine
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.5, 0.5]  # Equal weights
)

# Use
docs = ensemble_retriever.get_relevant_documents("my query")
Search Type Strengths Weaknesses
BM25 Exact keywords, names, IDs Misses paraphrases
Semantic Meaning, synonyms May miss exact terms
Hybrid Best of both More complex

Pattern 7: Re-ranking with Cross-Encoder

from sentence_transformers import CrossEncoder

# Bi-Encoder: Fast initial retrieval (top-20)
# Cross-Encoder: Accurate re-ranking (top-5)

# Initial retrieval
initial_results = vectorstore.similarity_search(query, k=20)

# Re-rank with cross-encoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
pairs = [[query, doc.page_content] for doc in initial_results]
scores = reranker.predict(pairs)

# Sort by score
reranked = sorted(zip(scores, initial_results), reverse=True)
top_5 = [doc for score, doc in reranked[:5]]

Pattern 8: RAG Evaluation (RAGAS)

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

# Prepare evaluation data
eval_data = {
    "question": ["What is ML?", "What is DL?"],
    "answer": ["ML is...", "DL is..."],
    "contexts": [["ML context 1", "ML context 2"], ["DL context"]],
    "ground_truth": ["ML ground truth", "DL ground truth"]
}

# Evaluate
results = evaluate(
    dataset=eval_data,
    metrics=[faithfulness, answer_relevancy, context_precision]
)

print(results)
Metric Measures Low Score Means
Faithfulness Is answer grounded in context? Hallucination
Answer Relevancy Does answer address question? Off-topic response
Context Precision Is retrieved context relevant? Bad retrieval
Context Recall Did we retrieve all needed info? Missing sources

Reference Navigation

For detailed content, see:

  • Embeddings Guide: reference/embeddings_guide.md - OpenAI, Gemini, Sentence-Transformers
  • Vector Stores: reference/vector_stores.md - FAISS, ChromaDB, Pinecone
  • RAG Architectures: reference/rag_architectures.md - Variants, Memory systems
  • RAG Memory: reference/rag_memory.md - Conversational, Multi-turn
  • RAG Evaluation: reference/rag_evaluation.md - RAGAS, LLM-as-Judge
  • Hybrid Search: reference/hybrid_search.md - BM25 + Semantic

Common Mistakes to Avoid

1. Chunks Too Small

# WRONG: Information gets fragmented
splitter = RecursiveCharacterTextSplitter(chunk_size=100)

# CORRECT: Reasonable chunk size
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)

2. No Overlap Between Chunks

# WRONG: Context lost at boundaries
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)

# CORRECT: Overlap preserves context
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)

3. Mixing Embedding Models

# WRONG: Different models produce incompatible embeddings
index_embeddings = model_a.encode(documents)
query_embedding = model_b.encode(query)  # Different model!

# CORRECT: Same model for indexing and querying
embeddings_model = SentenceTransformer("all-MiniLM-L6-v2")
index_embeddings = embeddings_model.encode(documents)
query_embedding = embeddings_model.encode(query)

4. Not Validating Retrieval Before LLM

# WRONG: Debugging LLM when retrieval is the problem
# Always check what's being retrieved first!

# CORRECT: Validate retrieval separately
docs = retriever.get_relevant_documents(query)
for i, doc in enumerate(docs):
    print(f"Doc {i}: {doc.page_content[:200]}")
# Then check if these are the right documents

5. Ignoring Metadata Filtering

# WRONG: Retrieve from all documents
results = vectorstore.similarity_search(query, k=5)

# CORRECT: Filter by metadata when relevant
results = vectorstore.similarity_search(
    query, k=5,
    filter={"document_type": "policy", "year": 2024}
)

Teaching Mode

When explaining RAG:

RAG Intuition

Without RAG:
"What's our refund policy?" → LLM guesses (may hallucinate!)

With RAG:
"What's our refund policy?"
  → Search company documents
  → Find: "Refunds within 30 days with receipt"
  → LLM answers: "Our refund policy allows returns within 30 days..."

RAG = "Open book exam" for LLMs

Embedding Space Visual

Similar documents are CLOSE in embedding space:

    ● "machine learning tutorial"
        ↘
          ● "ML course content"
            ↘
              ● "deep learning basics"


    ● "cooking recipes"
    (far away - different topic)

Retrieval Pipeline

Query: "How do I train a neural network?"
            ↓
       [Embed Query]
            ↓
       [Search Vector DB]
            ↓
    Top 3 relevant chunks:
    1. "Training neural networks involves..."
    2. "Use backpropagation to update..."
    3. "Choose optimizer like Adam..."
            ↓
       [Inject into prompt]
            ↓
    "Using the following context: [chunks]
     Answer: How do I train a neural network?"
            ↓
       [LLM generates grounded answer]