aws-fsx

"FSx for Lustre — performance tuning, striping, S3 data repositories, EKS integration. Use when configuring high-performance storage for ML on EKS. NOT for EBS or EFS."

tylertitsworth 3 1 Updated 4mo ago

Resources

GitHub

Install

npx skillscat add tylertitsworth/skills/aws-fsx

Install via the SkillsCat registry.

SKILL.md

AWS FSx for Lustre

Architecture

An FSx for Lustre filesystem consists of:

Metadata Targets (MDTs): Store file metadata — names, timestamps, permissions, directory structure, file layouts. Hosted on metadata servers (MDS).
Object Storage Targets (OSTs): Store actual file data. Each OST is backed by a disk (SSD or HDD). Files are striped across OSTs for parallel I/O.
File servers: In-memory cache layer in front of OSTs. Hot data is served from cache (network-limited), cold data from disk.

Client (pod) ─── Lustre client ─── File servers ─── OSTs (data)
                                      │                └── SSD or HDD disks
                                      └── MDT (metadata)

Every client mounts the full filesystem and communicates directly with the file servers hosting the relevant OSTs — no single server bottleneck.

Read/Write Paths

Cached read: Client → file server in-memory/SSD cache → network-limited throughput
Uncached read: Client → file server → disk → limited by lower of network and disk throughput
Write: Client → file server → disk → limited by lower of network and disk throughput
S3 lazy load: First access to an S3-backed file triggers a fetch from S3 → stored on OSTs → subsequent reads from OSTs

Client Throughput Limits

Client Network Interface	Max Throughput per Client
Standard ENA	100 Gbps (5 Gbps per OST)
EFA	700 Gbps
EFA with GPUDirect Storage (GDS)	1200 Gbps

For filesystems with >10 GB/s throughput capacity, AWS recommends EFA-enabled clients. GDS allows GPU memory to read/write directly to Lustre storage without CPU copies.

Deployment Types

Type	Durability	Throughput	Min Size	Use Case
`SCRATCH_1`	None	200 MB/s per TiB (burst)	1.2 TiB	Ephemeral training data
`SCRATCH_2`	None	200 MB/s per TiB (burst)	1.2 TiB	Better networking than SCRATCH_1
`PERSISTENT_1`	In-AZ replication	50–200 MB/s per TiB	1.2 TiB	Longer-lived workloads
`PERSISTENT_2`	In-AZ replication	125–1000 MB/s per TiB	1.2 TiB	Production, highest throughput

SCRATCH filesystems have no data replication — data is lost if hardware fails. Ideal for training jobs where data is re-derivable from S3.

PERSISTENT_2 supports perUnitStorageThroughput of 125, 250, 500, or 1000 MB/s per TiB. 1000 MB/s requires SSD storage.

Storage sizing: Minimum 1.2 TiB, then increments of 2.4 TiB. Throughput scales linearly with size.

Striping

Lustre splits files into chunks distributed across multiple OSTs. This is the primary performance lever for large file I/O.

Default Progressive File Layout (PFL)

Filesystems created after August 2023 use a 4-component PFL:

File Size	Stripe Count	Effect
≤ 100 MiB	1	Single OST, no overhead
100 MiB – 10 GiB	8	Parallel I/O across 8 OSTs
10 GiB – 100 GiB	16	Higher parallelism
> 100 GiB	32	Maximum parallelism

Custom Striping

# Set stripe count on a directory (applies to new files)
lfs setstripe -c 16 /mount/training-data/

# Set PFL for a directory
lfs setstripe -E 100M -c 1 -E 10G -c 8 -E 100G -c 16 -E -1 -c 32 /mount/data/

# Check file layout
lfs getstripe /mount/data/large-file.bin

# Migrate existing file to new layout
lfs migrate -c 32 /mount/data/existing-file.bin

# View OST usage
lfs df -h /mount/

Striping Guidelines

Large files (>1 GiB): Higher stripe count improves throughput. Stripe across many OSTs.
Small files (<100 MiB): Stripe count of 1. Higher counts add metadata overhead (network round-trip per OST in layout).
Stripe count -1: Stripe across all OSTs. Use for largest files.
Stripe size: Default 1 MiB. Rarely needs changing.
Don't set high stripe counts on directories with many small files — metadata overhead degrades performance.

What Striping Can't Fix

Metadata-heavy workloads (millions of tiny files, ls on huge directories): Limited by MDT IOPS, not striping.
Single-threaded sequential reads: Limited by single OST throughput. Application must use parallel I/O.
Random small I/O: Lustre is optimized for large sequential I/O. Small random reads/writes are limited by latency.

Metadata IOPS

Metadata operations (file create, open, close, delete, directory operations) are limited by MDT performance.

PERSISTENT_2: User-Provisioned Metadata IOPS

Operation	IOPS per Provisioned Unit
File create, open, close	2
File delete	1
Directory create, rename	0.1
Directory delete	0.2

Valid provisioned values: 1500, 3000, 6000, 12000, and multiples of 12000 up to 192000.

SSD: Automatic Mode

Storage Capacity	Included Metadata IOPS
1.2 TiB	1500
2.4 TiB	3000
4.8–9.6 TiB	6000
12–45.6 TiB	12000
≥48 TiB	12000 per 24 TiB

ML Workload Implications

Training data loading: Mostly sequential reads of large files → limited by OST throughput, not metadata. Striping helps.
Checkpoint saving: Large sequential writes → striping helps. But initial file creation hits MDT.
Preprocessing with many small files: Can bottleneck on metadata IOPS. Consider pre-aggregating into fewer large files (TFRecord, WebDataset, etc.).

Data Repository Associations (DRA)

DRAs are the mechanism for linking FSx for Lustre to S3 buckets. Each DRA maps a filesystem path to an S3 prefix, creating a bidirectional data bridge.

Key Properties

Maximum 8 DRAs per filesystem
Each DRA maps one filesystem path to one S3 prefix (1:1)
Paths cannot overlap (e.g., /ns1/ and /ns1/subdir/ cannot coexist)
/ as filesystem path is only allowed for the first DRA
Only one DRA request processed at a time (others queue)
Not available on Scratch 1 or Lustre 2.10 filesystems

How DRAs Work

Metadata import: S3 object metadata (name, size, timestamps) is loaded into the MDT. File data is not copied — it's lazy-loaded on first access.
First read: Triggers an HSM (Hierarchical Storage Management) restore from S3 → data fetched and cached on OSTs.
Subsequent reads: Served from OSTs (no S3 latency).
Auto-import: S3 changes automatically reflected in Lustre metadata.
Auto-export: Filesystem changes automatically written back to S3 (asynchronous).

DRA Configuration

Setting	Options	Default (Console)	Default (CLI/API)
Import policy	None, New, Changed, Deleted (any combination)	New + Changed + Deleted	Disabled
Export policy	None, New, Changed, Deleted (any combination)	New + Changed + Deleted	Disabled
Import metadata on create	Yes / No	Yes	No

Auto-import policies:

Policy	Effect
None	No auto-import. Use import data repository tasks manually.
New	Import metadata for new S3 objects
Changed	Import metadata for modified S3 objects
Deleted	Remove metadata for deleted S3 objects

Auto-export: Exports regular files, symlinks, and directories. Does not export special files (FIFO, block, character, socket).

Conflict handling: If the same file is modified in both the filesystem and S3 simultaneously, there is no automatic conflict resolution. Application-level coordination is required.

Import and Export Data Repository Tasks

In addition to automatic import/export, you can run on-demand tasks:

Import task: Loads metadata for new/changed files from S3 into the filesystem
Export task: Exports file data and metadata to S3

Note: Auto-export and export tasks cannot be used simultaneously on the same filesystem. Auto-import and import tasks can be used simultaneously.

Cross-Region and Cross-Account

Feature	Cross-Region	Cross-Account
Auto-import	❌ Same region only	✅ Supported
Auto-export	✅ Supported	✅ Supported

Pre-Warming Data

Lazy loading means first-epoch training has S3 latency on first access. Pre-warm with:

# Restore specific files from S3 archive to OSTs
lfs hsm_restore /mount/data/file1 /mount/data/file2

# Bulk restore a directory
nohup find /mount/data/ -type f -print0 | xargs -0 -n 1 lfs hsm_restore &

ImportedFileChunkSize

Controls how S3-imported files are striped (default: 1 GiB). Files larger than this value are automatically striped across ceil(FileSize / ChunkSize) + 1 OSTs.

Intelligent-Tiering Storage Class

An alternative to SSD/HDD storage classes. Automatically tiers data across three access tiers:

Tier	Latency	Cost	When Data Moves Here
Frequent Access	Sub-millisecond (SSD)	Highest	Recently accessed data
Infrequent Access	Low (HDD)	Lower	Data not accessed recently
Archive	Higher (retrieval delay)	Lowest	Rarely accessed data

Optional SSD read cache for frequently accessed data
No minimum storage capacity — pay only for data stored
Starting at <$0.005/GB-month
Not available with Scratch deployments
Metadata IOPS: Only 6000 or 12000 (user-provisioned mode only, no automatic)

ML use case: Good for long-lived datasets where only a subset is actively used. Training data accessed during current epoch stays in Frequent Access tier; older experiment data tiers down automatically.

Compression

LZ4 transparent compression (dataCompressionType: LZ4) reduces storage costs and can improve throughput for compressible data. Applied to new files written after enabling.

Backups

Automatic backups: Daily, stored in S3 (11 9's durability). Retention 0-90 days.
Manual backups: On-demand via console/CLI/API.
Cross-region/cross-account backup copy: For disaster recovery and compliance.
Restore: Creates a new filesystem from backup.
Incremental: Only changed data since last backup.
Persistent filesystems only — Scratch filesystems do not support backups.

Backups are managed via AWS Backup for policy-based scheduling.

Storage Quotas

User-level and group-level quotas to control storage consumption:

# Set quota for a user (soft limit, hard limit, grace period)
lfs setquota -u username --block-softlimit 100G --block-hardlimit 120G /mount/

# Set quota for a group
lfs setquota -g groupname --block-softlimit 1T --block-hardlimit 1.2T /mount/

# Check quota usage
lfs quota -u username /mount/
lfs quota -g groupname /mount/

Useful for multi-team environments sharing a filesystem to prevent any team from exhausting storage.

Encryption

At rest: All filesystems encrypted. AWS managed key (default) or customer-managed KMS key.
In transit: Encryption of data in transit between clients and file servers. Available in select regions. Enabled automatically for supported client kernel versions.

GPUDirect Storage (GDS)

For EFA-enabled filesystems with NVIDIA GPUs, GDS enables direct data transfer between GPU memory and FSx for Lustre storage, bypassing CPU and system memory entirely.

Throughput: Up to 1200 Gbps per client (vs 700 Gbps with EFA alone)
Requirements: EFA-enabled filesystem, EFA-enabled GPU instance, NVIDIA GDS driver
Eliminates the CPU copy bottleneck for checkpoint writes and data loading

All Features Summary

Supported

POSIX filesystem semantics (with caveats below)
ReadWriteMany — multiple pods across nodes simultaneously
Data Repository Associations (DRA) — up to 8 S3 links per filesystem
Auto-import and auto-export with S3
Import/export data repository tasks (on-demand)
Progressive file layouts (PFL) with automatic striping
lfs CLI for stripe management, HSM operations, quotas
Transparent LZ4 compression
Encryption at rest (AWS managed or customer KMS keys)
Encryption in transit (select regions)
EFA (700 Gbps) and GPUDirect Storage (1200 Gbps) per-client throughput
Intelligent-Tiering storage class (automatic cost optimization)
Automatic and manual backups (persistent filesystems)
Cross-region and cross-account backup copy
Storage quotas (user and group level)
Lustre client on Amazon Linux 2, AL2023, Ubuntu, RHEL, CentOS, SUSE
EKS CSI driver (dynamic and static provisioning)
AWS Backup integration
CloudWatch metrics (throughput, IOPS, metadata ops, capacity)
Storage capacity scaling (increase online)

Not Supported / Limitations

Limitation	Detail
Single AZ	FSx for Lustre filesystems exist in one subnet/AZ. No multi-AZ replication.
No NFS/SMB	Lustre protocol only. Requires Lustre client (kernel module).
No online resize down	Can increase capacity, cannot shrink.
No snapshots	No built-in snapshot capability (unlike EBS or FSx ONTAP).
Minimum size	1.2 TiB minimum. Can't create small filesystems.
S3 lazy load latency	First access to uncached S3-backed files has S3 latency.
Metadata IOPS cap	Directory operations are slow relative to data I/O. Millions of tiny files suffer.
Client compatibility	Requires specific kernel versions with Lustre client module.
Hard link limit	Lustre has lower hard link limits than ext4/XFS.
No POSIX ACLs	Only basic Unix permissions (uid/gid/mode).

EKS CSI Driver

The fsx.csi.aws.com CSI driver enables dynamic and static provisioning. Install as an EKS add-on (requires Pod Identity agent). Key StorageClass parameters:

Parameter	Values	Effect
`subnetId`	subnet ID	Required. Subnet for filesystem ENI
`securityGroupIds`	SG IDs	Security groups (must allow TCP 988 inbound)
`deploymentType`	`SCRATCH_1`, `SCRATCH_2`, `PERSISTENT_1`, `PERSISTENT_2`	Durability/performance tier
`perUnitStorageThroughput`	`125`–`1000`	MB/s per TiB (PERSISTENT only)
`dataCompressionType`	`NONE`, `LZ4`	Transparent compression
`s3ImportPath`	`s3://bucket/prefix`	S3 data repository source
`autoImportPolicy`	`NONE`, `NEW`, `NEW_CHANGED`, `NEW_CHANGED_DELETED`	Auto-import from S3

See references/troubleshooting.md for common issues.

References

`troubleshooting.md` — Diagnostic commands and common FSx for Lustre issues

Cross-References

aws-efa — EFA networking for maximum client throughput (700 Gbps+)
ray-train — Distributed training jobs consuming FSx-backed PVCs
kueue — Queue training jobs that mount FSx volumes
megatron-lm — Large-scale training with shared checkpoint storage
minio — S3-compatible alternative for smaller-scale storage
longhorn — Alternative distributed storage for non-Lustre workloads
kubeflow-trainer — FSx-backed PVCs for training jobs
gpu-operator — GPU nodes consuming FSx-mounted training data
deepspeed — DeepSpeed checkpoint storage on FSx

Reference

FSx for Lustre docs
FSx CSI driver
FSx pricing
references/troubleshooting.md — mount issues, performance, data repository tasks
assets/storageclass.yaml — FSx Lustre StorageClass with S3 data repository, LZ4 compression, and PVC example

aws-fsx

Resources

Install

AWS FSx for Lustre

Architecture

Read/Write Paths

Client Throughput Limits

Deployment Types

Striping

Default Progressive File Layout (PFL)

Custom Striping

Striping Guidelines

What Striping Can't Fix

Metadata IOPS

PERSISTENT_2: User-Provisioned Metadata IOPS

SSD: Automatic Mode

ML Workload Implications

Data Repository Associations (DRA)

Key Properties

How DRAs Work

DRA Configuration

Import and Export Data Repository Tasks

Cross-Region and Cross-Account

Pre-Warming Data

ImportedFileChunkSize

Intelligent-Tiering Storage Class

Compression

Backups

Storage Quotas

Encryption

GPUDirect Storage (GDS)

All Features Summary

Supported

Not Supported / Limitations

EKS CSI Driver

References

Cross-References

Reference

Categories

Install

Recommended Skills