Implements RAG (Retrieval-Augmented Generation) pipelines. Covers embedding APIs (OpenAI, Gemini, Sentence-Transformers), vector stores (FAISS, ChromaDB, Pinecone), RAG variants (Query Rewrite, Conversational, Multi-hop), and evaluation (RAGAS, Faithfulness). Use when building knowledge bases, semantic search, chatbots with documents, or when user mentions 'RAG', 'embeddings', 'vector store', 'FAISS', 'ChromaDB', 'similarity search', 'retrieval', 'chunking', 'hallucination reduction', 'semantic search', or 'knowledge base'.
Resources
1Install
npx skillscat add levy-n/claude-useful-skills/rag-retrieval Install via the SkillsCat registry.
SKILL.md
RAG & Retrieval - Semantic Search & Knowledge Augmentation
RAG, Embeddings, Vector Stores, ו-Semantic Search.
Quick Start - Simple RAG Pipeline
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
# 1. Chunk documents
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
chunks = splitter.split_documents(documents)
# 2. Create embeddings and store
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./db")
# 3. Create RAG chain
qa_chain = RetrievalQA.from_chain_type(
llm=OpenAI(),
retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
return_source_documents=True
)
# 4. Query
result = qa_chain({"query": "What is machine learning?"})
print(result["result"])When This Skill Activates
Use this skill when:
- Building knowledge bases or Q&A systems
- Implementing semantic search
- Reducing LLM hallucinations with grounding
- Creating chatbots with document context
- Working with embeddings and vector databases
- Evaluating RAG system quality
Core Patterns
Pattern 1: RAG Architecture
┌─────────────────────────────────────────────────────────────┐
│ INDEXING STAGE (Offline) │
├─────────────────────────────────────────────────────────────┤
│ Parse → Chunk → Embed → Store │
│ (PDF) (500t) (384d) (FAISS/Chroma) │
├─────────────────────────────────────────────────────────────┤
│ RUNTIME STAGE (Online) │
├─────────────────────────────────────────────────────────────┤
│ Query → Embed → Retrieve → Inject → Generate │
│ (user) (384d) (top-k) (prompt) (LLM) │
└─────────────────────────────────────────────────────────────┘Pattern 2: Chunking Strategies
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Fixed-size chunks
splitter_fixed = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=0
)
# Overlapping chunks (RECOMMENDED)
splitter_overlap = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=100 # 20% overlap preserves context
)
# Semantic chunking by headers
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers = [
("#", "Header 1"),
("##", "Header 2"),
]
splitter_semantic = MarkdownHeaderTextSplitter(headers_to_split_on=headers)| Strategy | Pros | Cons | Best For |
|---|---|---|---|
| Fixed | Simple, predictable | May break mid-sentence | Homogeneous docs |
| Overlapping | Preserves context | More chunks, storage | General use |
| Semantic | Respects structure | Needs structured input | Markdown, HTML |
Pattern 3: Embedding Models
# OpenAI Embeddings
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(
model="text-embedding-3-small",
input="Hello world"
)
embedding = response.data[0].embedding # 1536 dimensions
# Sentence Transformers (Local, Free)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
embedding = model.encode("Hello world") # 384 dimensions
# HuggingFace via LangChain
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5")
vector = embeddings.embed_query("Hello world") # 768 dimensions| Model | Dimensions | Speed | Quality | Cost |
|---|---|---|---|---|
| text-embedding-3-small | 1536 | Fast | Good | $ |
| text-embedding-3-large | 3072 | Medium | Best | $$ |
| all-MiniLM-L6-v2 | 384 | Very Fast | Good | Free |
| bge-base-en-v1.5 | 768 | Fast | Very Good | Free |
Pattern 4: FAISS Vector Store
import faiss
import numpy as np
# Create index
dimension = 384
index = faiss.IndexFlatL2(dimension) # L2 distance
# Add vectors
vectors = np.array(embeddings, dtype=np.float32)
index.add(vectors)
# Search
query_vector = np.array([query_embedding], dtype=np.float32)
distances, indices = index.search(query_vector, k=5)
# Get results
for i, (dist, idx) in enumerate(zip(distances[0], indices[0])):
print(f"Rank {i+1}: Document {idx}, Distance: {dist:.4f}")
# Save/Load
faiss.write_index(index, "index.faiss")
index = faiss.read_index("index.faiss")Pattern 5: ChromaDB Vector Store
import chromadb
from chromadb.utils import embedding_functions
# Create client
client = chromadb.PersistentClient(path="./chroma_db")
# Create collection with embeddings
embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-MiniLM-L6-v2"
)
collection = client.create_collection(
name="my_documents",
embedding_function=embedding_fn
)
# Add documents
collection.add(
documents=["Document 1 text", "Document 2 text"],
metadatas=[{"source": "file1.pdf"}, {"source": "file2.pdf"}],
ids=["doc1", "doc2"]
)
# Query
results = collection.query(
query_texts=["What is machine learning?"],
n_results=3,
where={"source": "file1.pdf"} # Metadata filtering
)Pattern 6: Hybrid Search (BM25 + Semantic)
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.vectorstores import Chroma
# BM25 (keyword-based)
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 5
# Vector retriever (semantic)
vectorstore = Chroma.from_documents(documents, embeddings)
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# Combine
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.5, 0.5] # Equal weights
)
# Use
docs = ensemble_retriever.get_relevant_documents("my query")| Search Type | Strengths | Weaknesses |
|---|---|---|
| BM25 | Exact keywords, names, IDs | Misses paraphrases |
| Semantic | Meaning, synonyms | May miss exact terms |
| Hybrid | Best of both | More complex |
Pattern 7: Re-ranking with Cross-Encoder
from sentence_transformers import CrossEncoder
# Bi-Encoder: Fast initial retrieval (top-20)
# Cross-Encoder: Accurate re-ranking (top-5)
# Initial retrieval
initial_results = vectorstore.similarity_search(query, k=20)
# Re-rank with cross-encoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
pairs = [[query, doc.page_content] for doc in initial_results]
scores = reranker.predict(pairs)
# Sort by score
reranked = sorted(zip(scores, initial_results), reverse=True)
top_5 = [doc for score, doc in reranked[:5]]Pattern 8: RAG Evaluation (RAGAS)
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
# Prepare evaluation data
eval_data = {
"question": ["What is ML?", "What is DL?"],
"answer": ["ML is...", "DL is..."],
"contexts": [["ML context 1", "ML context 2"], ["DL context"]],
"ground_truth": ["ML ground truth", "DL ground truth"]
}
# Evaluate
results = evaluate(
dataset=eval_data,
metrics=[faithfulness, answer_relevancy, context_precision]
)
print(results)| Metric | Measures | Low Score Means |
|---|---|---|
| Faithfulness | Is answer grounded in context? | Hallucination |
| Answer Relevancy | Does answer address question? | Off-topic response |
| Context Precision | Is retrieved context relevant? | Bad retrieval |
| Context Recall | Did we retrieve all needed info? | Missing sources |
Reference Navigation
For detailed content, see:
- Embeddings Guide:
reference/embeddings_guide.md- OpenAI, Gemini, Sentence-Transformers - Vector Stores:
reference/vector_stores.md- FAISS, ChromaDB, Pinecone - RAG Architectures:
reference/rag_architectures.md- Variants, Memory systems - RAG Memory:
reference/rag_memory.md- Conversational, Multi-turn - RAG Evaluation:
reference/rag_evaluation.md- RAGAS, LLM-as-Judge - Hybrid Search:
reference/hybrid_search.md- BM25 + Semantic
Common Mistakes to Avoid
1. Chunks Too Small
# WRONG: Information gets fragmented
splitter = RecursiveCharacterTextSplitter(chunk_size=100)
# CORRECT: Reasonable chunk size
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)2. No Overlap Between Chunks
# WRONG: Context lost at boundaries
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
# CORRECT: Overlap preserves context
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)3. Mixing Embedding Models
# WRONG: Different models produce incompatible embeddings
index_embeddings = model_a.encode(documents)
query_embedding = model_b.encode(query) # Different model!
# CORRECT: Same model for indexing and querying
embeddings_model = SentenceTransformer("all-MiniLM-L6-v2")
index_embeddings = embeddings_model.encode(documents)
query_embedding = embeddings_model.encode(query)4. Not Validating Retrieval Before LLM
# WRONG: Debugging LLM when retrieval is the problem
# Always check what's being retrieved first!
# CORRECT: Validate retrieval separately
docs = retriever.get_relevant_documents(query)
for i, doc in enumerate(docs):
print(f"Doc {i}: {doc.page_content[:200]}")
# Then check if these are the right documents5. Ignoring Metadata Filtering
# WRONG: Retrieve from all documents
results = vectorstore.similarity_search(query, k=5)
# CORRECT: Filter by metadata when relevant
results = vectorstore.similarity_search(
query, k=5,
filter={"document_type": "policy", "year": 2024}
)Teaching Mode
When explaining RAG:
RAG Intuition
Without RAG:
"What's our refund policy?" → LLM guesses (may hallucinate!)
With RAG:
"What's our refund policy?"
→ Search company documents
→ Find: "Refunds within 30 days with receipt"
→ LLM answers: "Our refund policy allows returns within 30 days..."
RAG = "Open book exam" for LLMsEmbedding Space Visual
Similar documents are CLOSE in embedding space:
● "machine learning tutorial"
↘
● "ML course content"
↘
● "deep learning basics"
● "cooking recipes"
(far away - different topic)Retrieval Pipeline
Query: "How do I train a neural network?"
↓
[Embed Query]
↓
[Search Vector DB]
↓
Top 3 relevant chunks:
1. "Training neural networks involves..."
2. "Use backpropagation to update..."
3. "Choose optimizer like Adam..."
↓
[Inject into prompt]
↓
"Using the following context: [chunks]
Answer: How do I train a neural network?"
↓
[LLM generates grounded answer]