Work with multimodal inputs/outputs in LangChain - includes images, audio, video, content blocks, and vision capabilities
Install
npx skillscat add christian-bromann/langchain-skills/langchain-multimodal Install via the SkillsCat registry.
langchain-multimodal (JavaScript/TypeScript)
Overview
Multimodal support lets you work with images, audio, video, and other non-text data. Models with multimodal capabilities can process and generate content across these different formats.
Key Concepts:
- Content Blocks: Structured representation of multimodal data
- Vision: Image understanding with GPT-4V, Claude, Gemini
- Audio/Video: Emerging support in newer models
- Standard Format: Cross-provider content block structure
Decision Tables
Model Selection for Multimodal
| Task | Recommended Model | Why |
|---|---|---|
| Image understanding | GPT-4.1, Claude Sonnet, Gemini | Strong vision capabilities |
| Image generation | DALL-E (via OpenAI) | Specialized for generation |
| Document analysis (PDF) | Claude, GPT-4.1 | Handle complex layouts |
| Audio transcription | Whisper (OpenAI) | Specialized for audio |
Input Methods
| Method | When to Use | Example |
|---|---|---|
| URL | Public images | { type: "image", url: "https://..." } |
| Base64 | Private/local images | { type: "image", data: "base64..." } |
| File reference | Provider file APIs | { type: "image", fileId: "..." } |
Code Examples
Basic Image Input (URL)
import { ChatOpenAI } from "@langchain/openai";
import { HumanMessage } from "langchain";
const model = new ChatOpenAI({ model: "gpt-4.1" });
const message = new HumanMessage({
contentBlocks: [
{ type: "text", text: "What's in this image?" },
{
type: "image",
url: "https://example.com/photo.jpg",
},
],
});
const response = await model.invoke([message]);
console.log(response.content);Base64 Image Input
import { ChatOpenAI } from "@langchain/openai";
import { HumanMessage } from "langchain";
import fs from "fs";
const model = new ChatOpenAI({ model: "gpt-4.1" });
// Read image and convert to base64
const imageBuffer = fs.readFileSync("./photo.jpg");
const base64Image = imageBuffer.toString("base64");
const message = new HumanMessage({
contentBlocks: [
{ type: "text", text: "Describe this image in detail" },
{
type: "image",
data: base64Image,
mimeType: "image/jpeg",
},
],
});
const response = await model.invoke([message]);Multiple Images
import { ChatOpenAI } from "@langchain/openai";
import { HumanMessage } from "langchain";
const model = new ChatOpenAI({ model: "gpt-4.1" });
const message = new HumanMessage({
contentBlocks: [
{ type: "text", text: "Compare these two images" },
{ type: "image", url: "https://example.com/image1.jpg" },
{ type: "image", url: "https://example.com/image2.jpg" },
],
});
const response = await model.invoke([message]);PDF Document Analysis
import { ChatAnthropic } from "@langchain/anthropic";
import { HumanMessage } from "langchain";
import fs from "fs";
const model = new ChatAnthropic({ model: "claude-sonnet-4-5-20250929" });
const pdfBuffer = fs.readFileSync("./document.pdf");
const base64Pdf = pdfBuffer.toString("base64");
const message = new HumanMessage({
contentBlocks: [
{ type: "text", text: "Summarize this PDF document" },
{
type: "file",
data: base64Pdf,
mimeType: "application/pdf",
},
],
});
const response = await model.invoke([message]);Audio Input (Emerging)
// Example with hypothetical audio support
const message = new HumanMessage({
contentBlocks: [
{ type: "text", text: "Transcribe this audio" },
{
type: "audio",
data: base64Audio,
mimeType: "audio/mpeg",
},
],
});Accessing Multimodal Output
import { ChatOpenAI } from "@langchain/openai";
const model = new ChatOpenAI({ model: "gpt-4.1" });
const response = await model.invoke("Create an image of a sunset");
// Access content blocks
for (const block of response.contentBlocks) {
if (block.type === "text") {
console.log("Text:", block.text);
} else if (block.type === "image") {
console.log("Image URL:", block.url);
console.log("Image data:", block.data?.substring(0, 50) + "...");
}
}Vision with Claude
import { ChatAnthropic } from "@langchain/anthropic";
import { HumanMessage } from "langchain";
const model = new ChatAnthropic({
model: "claude-sonnet-4-5-20250929",
});
const message = new HumanMessage({
contentBlocks: [
{
type: "image",
url: "https://example.com/chart.png",
},
{
type: "text",
text: "Extract all data points from this chart and format as a table",
},
],
});
const response = await model.invoke([message]);Vision with Gemini
import { ChatGoogleGenerativeAI } from "@langchain/google-genai";
import { HumanMessage } from "langchain";
const model = new ChatGoogleGenerativeAI({
model: "gemini-2.5-flash",
});
const message = new HumanMessage({
contentBlocks: [
{ type: "text", text: "What objects are in this image?" },
{ type: "image", url: "https://example.com/scene.jpg" },
],
});
const response = await model.invoke([message]);Boundaries
What You CAN Do
✅ Image URLs: Public images via HTTPS
✅ Base64 images: Local or private images
✅ Multiple images: Compare, analyze together
✅ PDF documents: Text extraction, analysis
✅ Cross-provider format: Standard content blocks
What You CANNOT Do (Yet)
❌ Image generation in all models: Limited to specific models
❌ Video understanding: Emerging, limited support
❌ Audio in all models: Model-specific
❌ Modify images: Models analyze, don't edit
Gotchas
1. Model Doesn't Support Multimodal
// ❌ Problem: Using text-only model
const model = new ChatOpenAI({ model: "gpt-3.5-turbo" });
await model.invoke([imageMessage]); // Error!
// ✅ Solution: Use vision-capable model
const model = new ChatOpenAI({ model: "gpt-4.1" });2. Wrong Content Block Format
// ❌ Problem: Old format
const message = new HumanMessage({
content: [
{ type: "image_url", image_url: { url: "..." } } // OpenAI-specific
]
});
// ✅ Solution: Use standard content blocks
const message = new HumanMessage({
contentBlocks: [
{ type: "image", url: "..." } // Cross-provider
]
});3. Missing MIME Type for Base64
// ❌ Problem: No MIME type
{ type: "image", data: base64Data } // May fail
// ✅ Solution: Always include MIME type
{ type: "image", data: base64Data, mimeType: "image/jpeg" }4. Image Too Large
// ❌ Problem: Image exceeds size limit
const hugeImage = fs.readFileSync("./10mb_image.jpg");
// Model may reject
// ✅ Solution: Resize or compress images first
import sharp from "sharp";
const resized = await sharp("./10mb_image.jpg")
.resize(1024, 1024, { fit: "inside" })
.jpeg({ quality: 80 })
.toBuffer();