langchain-multimodal

Work with multimodal inputs/outputs in LangChain - includes images, audio, video, content blocks, and vision capabilities

christian-bromann 3 1 Updated 5mo ago

GitHub

Install

npx skillscat add christian-bromann/langchain-skills/langchain-multimodal

Install via the SkillsCat registry.

SKILL.md

langchain-multimodal (JavaScript/TypeScript)

Overview

Multimodal support lets you work with images, audio, video, and other non-text data. Models with multimodal capabilities can process and generate content across these different formats.

Key Concepts:

Content Blocks: Structured representation of multimodal data
Vision: Image understanding with GPT-4V, Claude, Gemini
Audio/Video: Emerging support in newer models
Standard Format: Cross-provider content block structure

Decision Tables

Model Selection for Multimodal

Task	Recommended Model	Why
Image understanding	GPT-4.1, Claude Sonnet, Gemini	Strong vision capabilities
Image generation	DALL-E (via OpenAI)	Specialized for generation
Document analysis (PDF)	Claude, GPT-4.1	Handle complex layouts
Audio transcription	Whisper (OpenAI)	Specialized for audio

Input Methods

Method	When to Use	Example
URL	Public images	`{ type: "image", url: "https://..." }`
Base64	Private/local images	`{ type: "image", data: "base64..." }`
File reference	Provider file APIs	`{ type: "image", fileId: "..." }`

Code Examples

Basic Image Input (URL)

import { ChatOpenAI } from "@langchain/openai";
import { HumanMessage } from "langchain";

const model = new ChatOpenAI({ model: "gpt-4.1" });

const message = new HumanMessage({
  contentBlocks: [
    { type: "text", text: "What's in this image?" },
    { 
      type: "image",
      url: "https://example.com/photo.jpg",
    },
  ],
});

const response = await model.invoke([message]);
console.log(response.content);

Base64 Image Input

import { ChatOpenAI } from "@langchain/openai";
import { HumanMessage } from "langchain";
import fs from "fs";

const model = new ChatOpenAI({ model: "gpt-4.1" });

// Read image and convert to base64
const imageBuffer = fs.readFileSync("./photo.jpg");
const base64Image = imageBuffer.toString("base64");

const message = new HumanMessage({
  contentBlocks: [
    { type: "text", text: "Describe this image in detail" },
    {
      type: "image",
      data: base64Image,
      mimeType: "image/jpeg",
    },
  ],
});

const response = await model.invoke([message]);

Multiple Images

import { ChatOpenAI } from "@langchain/openai";
import { HumanMessage } from "langchain";

const model = new ChatOpenAI({ model: "gpt-4.1" });

const message = new HumanMessage({
  contentBlocks: [
    { type: "text", text: "Compare these two images" },
    { type: "image", url: "https://example.com/image1.jpg" },
    { type: "image", url: "https://example.com/image2.jpg" },
  ],
});

const response = await model.invoke([message]);

PDF Document Analysis

import { ChatAnthropic } from "@langchain/anthropic";
import { HumanMessage } from "langchain";
import fs from "fs";

const model = new ChatAnthropic({ model: "claude-sonnet-4-5-20250929" });

const pdfBuffer = fs.readFileSync("./document.pdf");
const base64Pdf = pdfBuffer.toString("base64");

const message = new HumanMessage({
  contentBlocks: [
    { type: "text", text: "Summarize this PDF document" },
    {
      type: "file",
      data: base64Pdf,
      mimeType: "application/pdf",
    },
  ],
});

const response = await model.invoke([message]);

Audio Input (Emerging)

// Example with hypothetical audio support
const message = new HumanMessage({
  contentBlocks: [
    { type: "text", text: "Transcribe this audio" },
    {
      type: "audio",
      data: base64Audio,
      mimeType: "audio/mpeg",
    },
  ],
});

Accessing Multimodal Output

import { ChatOpenAI } from "@langchain/openai";

const model = new ChatOpenAI({ model: "gpt-4.1" });

const response = await model.invoke("Create an image of a sunset");

// Access content blocks
for (const block of response.contentBlocks) {
  if (block.type === "text") {
    console.log("Text:", block.text);
  } else if (block.type === "image") {
    console.log("Image URL:", block.url);
    console.log("Image data:", block.data?.substring(0, 50) + "...");
  }
}

Vision with Claude

import { ChatAnthropic } from "@langchain/anthropic";
import { HumanMessage } from "langchain";

const model = new ChatAnthropic({ 
  model: "claude-sonnet-4-5-20250929",
});

const message = new HumanMessage({
  contentBlocks: [
    {
      type: "image",
      url: "https://example.com/chart.png",
    },
    {
      type: "text",
      text: "Extract all data points from this chart and format as a table",
    },
  ],
});

const response = await model.invoke([message]);

Vision with Gemini

import { ChatGoogleGenerativeAI } from "@langchain/google-genai";
import { HumanMessage } from "langchain";

const model = new ChatGoogleGenerativeAI({
  model: "gemini-2.5-flash",
});

const message = new HumanMessage({
  contentBlocks: [
    { type: "text", text: "What objects are in this image?" },
    { type: "image", url: "https://example.com/scene.jpg" },
  ],
});

const response = await model.invoke([message]);

Boundaries

What You CAN Do

✅ Image URLs: Public images via HTTPS
✅ Base64 images: Local or private images
✅ Multiple images: Compare, analyze together
✅ PDF documents: Text extraction, analysis
✅ Cross-provider format: Standard content blocks

What You CANNOT Do (Yet)

❌ Image generation in all models: Limited to specific models
❌ Video understanding: Emerging, limited support
❌ Audio in all models: Model-specific
❌ Modify images: Models analyze, don't edit

Gotchas

1. Model Doesn't Support Multimodal

// ❌ Problem: Using text-only model
const model = new ChatOpenAI({ model: "gpt-3.5-turbo" });
await model.invoke([imageMessage]);  // Error!

// ✅ Solution: Use vision-capable model
const model = new ChatOpenAI({ model: "gpt-4.1" });

2. Wrong Content Block Format

// ❌ Problem: Old format
const message = new HumanMessage({
  content: [
    { type: "image_url", image_url: { url: "..." } }  // OpenAI-specific
  ]
});

// ✅ Solution: Use standard content blocks
const message = new HumanMessage({
  contentBlocks: [
    { type: "image", url: "..." }  // Cross-provider
  ]
});

3. Missing MIME Type for Base64

// ❌ Problem: No MIME type
{ type: "image", data: base64Data }  // May fail

// ✅ Solution: Always include MIME type
{ type: "image", data: base64Data, mimeType: "image/jpeg" }

4. Image Too Large

// ❌ Problem: Image exceeds size limit
const hugeImage = fs.readFileSync("./10mb_image.jpg");
// Model may reject

// ✅ Solution: Resize or compress images first
import sharp from "sharp";

const resized = await sharp("./10mb_image.jpg")
  .resize(1024, 1024, { fit: "inside" })
  .jpeg({ quality: 80 })
  .toBuffer();

langchain-multimodal

Install

langchain-multimodal (JavaScript/TypeScript)

Overview

Decision Tables

Model Selection for Multimodal

Input Methods

Code Examples

Basic Image Input (URL)

Base64 Image Input

Multiple Images

PDF Document Analysis

Audio Input (Emerging)

Accessing Multimodal Output

Vision with Claude

Vision with Gemini

Boundaries

What You CAN Do

What You CANNOT Do (Yet)

Gotchas

1. Model Doesn't Support Multimodal

2. Wrong Content Block Format

3. Missing MIME Type for Base64

4. Image Too Large

Links to Documentation

Categories

Install

langchain-multimodal

Install

langchain-multimodal (JavaScript/TypeScript)

Overview

Decision Tables

Model Selection for Multimodal

Input Methods

Code Examples

Basic Image Input (URL)

Base64 Image Input

Multiple Images

PDF Document Analysis

Audio Input (Emerging)

Accessing Multimodal Output

Vision with Claude

Vision with Gemini

Boundaries

What You CAN Do

What You CANNOT Do (Yet)

Gotchas

1. Model Doesn't Support Multimodal

2. Wrong Content Block Format

3. Missing MIME Type for Base64

4. Image Too Large

Links to Documentation

Categories

Install

Recommended Skills