Guide to using document loader integrations in LangChain for PDFs, web pages, text files, and APIs
Install
npx skillscat add christian-bromann/langchain-skills/skills-langchain-integrations-document-loaders-python Install via the SkillsCat registry.
langchain-document-loaders (Python)
Overview
Document loaders extract data from various sources and formats into LangChain's standardized Document format. They're essential for building RAG systems, as they convert raw data into processable text chunks with metadata.
Key Concepts
- Document: Object with
page_content(text) andmetadata(source info, page numbers, etc.) - Loaders: Classes that extract content from specific sources/formats
- Metadata: Contextual information preserved during loading
- Lazy Loading: Stream documents without loading everything into memory
Loader Selection Decision Table
| Loader Type | Best For | Package | Key Features |
|---|---|---|---|
| PyPDFLoader | PDF files | langchain-community |
Page-by-page extraction |
| WebBaseLoader | Web pages | langchain-community |
HTML parsing with BeautifulSoup |
| TextLoader | Plain text files | langchain-community |
Simple text files |
| JSONLoader | JSON files/APIs | langchain-community |
Extract specific JSON fields |
| CSVLoader | CSV files | langchain-community |
Tabular data |
| DirectoryLoader | Multiple files | langchain-community |
Bulk loading from directories |
| UnstructuredLoader | Various formats | langchain-community |
PDFs, DOCXs, PPTs, images |
When to Choose Each Loader
Choose PyPDFLoader if:
- You're processing standard PDF documents
- You need page number metadata
- PDFs contain extractable text
Choose WebBaseLoader if:
- You're scraping web pages
- You need to parse HTML content
- You want to filter by CSS selectors
Choose UnstructuredLoader if:
- You have mixed document types
- You need OCR for scanned documents
- You want sophisticated parsing
Code Examples
PDF Loader
from langchain_community.document_loaders import PyPDFLoader
# Load PDF file
loader = PyPDFLoader("path/to/document.pdf")
docs = loader.load()
print(f"Loaded {len(docs)} pages")
for i, doc in enumerate(docs):
print(f"Page {i + 1}:", doc.metadata)
print(doc.page_content[:100])
# Each page is a separate document
# metadata includes: source, page number
# Lazy loading for large PDFs
loader = PyPDFLoader("large-file.pdf")
for doc in loader.lazy_load():
print(f"Processing page {doc.metadata['page']}")Web Scraping
from langchain_community.document_loaders import WebBaseLoader
# Load single URL
loader = WebBaseLoader("https://docs.langchain.com")
docs = loader.load()
print(docs[0].page_content)
print(docs[0].metadata) # {'source': url, ...}
# With custom BeautifulSoup parsing
loader = WebBaseLoader(
"https://news.ycombinator.com",
bs_kwargs={
"parse_only": bs4.SoupStrainer(class_=("storylink", "subtext"))
}
)
# Multiple URLs
loader = WebBaseLoader([
"https://example.com/page1",
"https://example.com/page2",
])
docs = loader.load()Text File Loader
from langchain_community.document_loaders import TextLoader
loader = TextLoader("path/to/file.txt")
docs = loader.load()
# Returns single document with entire file content
print(docs[0].page_content)
print(docs[0].metadata["source"]) # File path
# With specific encoding
loader = TextLoader("file.txt", encoding="utf-8")JSON Loader
from langchain_community.document_loaders import JSONLoader
import json
# Load JSON with specific field extraction
loader = JSONLoader(
file_path="path/to/data.json",
jq_schema=".texts[].content", # jq syntax to extract fields
text_content=False
)
docs = loader.load()
# Example JSON: {"texts": [{"content": "...", "id": 1}]}
# Each matching field becomes a document
# With metadata function
def metadata_func(record: dict, metadata: dict) -> dict:
metadata["id"] = record.get("id")
metadata["category"] = record.get("category")
return metadata
loader = JSONLoader(
file_path="data.json",
jq_schema=".items[]",
content_key="text",
metadata_func=metadata_func
)CSV Loader
from langchain_community.document_loaders import CSVLoader
loader = CSVLoader(
file_path="path/to/data.csv",
source_column="source", # Column for metadata
)
docs = loader.load()
# Each row becomes a document
# All columns stored in metadataDirectory Loader
from langchain_community.document_loaders import DirectoryLoader, TextLoader
# Load all text files from directory
loader = DirectoryLoader(
"path/to/documents",
glob="**/*.txt", # Pattern for files to load
loader_cls=TextLoader
)
docs = loader.load()
print(f"Loaded {len(docs)} documents")
# With multiple file types
from langchain_community.document_loaders import PyPDFLoader
# Custom loader for different file types
def get_loader(file_path):
if file_path.endswith(".pdf"):
return PyPDFLoader(file_path)
elif file_path.endswith(".txt"):
return TextLoader(file_path)Unstructured Loader (Advanced)
from langchain_community.document_loaders import UnstructuredFileLoader
# Handles PDFs, DOCXs, PPTs, images, etc.
loader = UnstructuredFileLoader("path/to/document.docx")
docs = loader.load()
# With OCR for scanned documents
loader = UnstructuredFileLoader(
"scanned.pdf",
strategy="ocr_only", # Use OCR
languages=["eng"]
)
# UnstructuredURLLoader for web pages
from langchain_community.document_loaders import UnstructuredURLLoader
loader = UnstructuredURLLoader(urls=["https://example.com"])
docs = loader.load()S3 Loader (Cloud Storage)
from langchain_community.document_loaders import S3FileLoader
loader = S3FileLoader(
bucket="my-bucket",
key="documents/file.pdf"
)
docs = loader.load()
# S3 Directory Loader
from langchain_community.document_loaders import S3DirectoryLoader
loader = S3DirectoryLoader(
bucket="my-bucket",
prefix="documents/"
)
docs = loader.load()Custom Metadata Example
from langchain_community.document_loaders import TextLoader
from datetime import datetime
loader = TextLoader("document.txt")
docs = loader.load()
# Enrich with custom metadata
for doc in docs:
doc.metadata["loaded_at"] = datetime.now().isoformat()
doc.metadata["category"] = "research"Lazy Loading (Memory Efficient)
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("large-file.pdf")
# Stream documents one at a time
for doc in loader.lazy_load():
print(f"Processing page {doc.metadata.get('page', 0)}")
# Process without loading all pages into memoryBoundaries
What Agents CAN Do
✅ Load from various sources
- PDF, text, CSV, JSON, DOCX, PPTX files
- Web pages and URLs
- Cloud storage (S3, GCS, Azure)
- APIs and databases
✅ Extract with metadata
- Preserve source information
- Add custom metadata
- Track page numbers, URLs, timestamps
✅ Process efficiently
- Use lazy loading for large files
- Batch process directories
- Stream data
✅ Customize extraction
- Use jq for JSON extraction
- BeautifulSoup for HTML parsing
- OCR for scanned documents
What Agents CANNOT Do
❌ Extract from encrypted/protected files
- Cannot bypass password-protected PDFs
- Cannot access auth-required sites without credentials
❌ Process all formats automatically
- Scanned PDFs need OCR
- Proprietary formats need specific loaders
❌ Bypass rate limits
- Must respect website rate limiting
Gotchas
1. Import from Correct Package
# ❌ OLD: Using langchain imports
from langchain.document_loaders import PyPDFLoader # Deprecated!
# ✅ NEW: Use community package
from langchain_community.document_loaders import PyPDFLoaderFix: Use langchain-community package.
2. PyPDF vs Unstructured
# ❌ PyPDF may not work for complex PDFs
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("complex.pdf")
docs = loader.load() # Poor extraction!
# ✅ Use Unstructured for complex PDFs
from langchain_community.document_loaders import UnstructuredPDFLoader
loader = UnstructuredPDFLoader("complex.pdf")
docs = loader.load() # Better extractionFix: Use UnstructuredPDFLoader for complex layouts.
3. Web Scraping Dependencies
# ❌ Missing dependencies
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://example.com")
# ImportError: bs4 not found!
# ✅ Install required packages
# pip install beautifulsoup4 lxmlFix: Install beautifulsoup4 and lxml.
4. Unstructured API Keys
# ❌ Unstructured may need API key for advanced features
from langchain_community.document_loaders import UnstructuredFileLoader
loader = UnstructuredFileLoader("file.pdf", strategy="ocr_only")
# May fail without API key!
# ✅ Set API key or install dependencies
# pip install unstructured[local-inference]
# Or set UNSTRUCTURED_API_KEY environment variableFix: Install local dependencies or use API key.
5. Encoding Issues
# ❌ Default encoding may fail
loader = TextLoader("file.txt")
docs = loader.load() # UnicodeDecodeError!
# ✅ Specify encoding
loader = TextLoader("file.txt", encoding="utf-8")
docs = loader.load()Fix: Specify correct encoding for text files.
6. S3 Credentials
# ❌ Missing AWS credentials
from langchain_community.document_loaders import S3FileLoader
loader = S3FileLoader("bucket", "key")
docs = loader.load() # Credential error!
# ✅ Configure AWS credentials
# Set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
# Or use AWS CLI: aws configureFix: Configure AWS credentials properly.
Links and Resources
Official Documentation
Package Installation
# Community loaders
pip install langchain-community
# PDF support
pip install pypdf
# Web scraping
pip install beautifulsoup4 lxml
# Unstructured (advanced)
pip install unstructured
# or with local inference
pip install "unstructured[local-inference]"