IBM VPC File Pool CSI Driver — Build Skill

- **Do NOT skip leader election for the controller** â two controllers allocating simultaneously will corrupt pool state.

neil1taylor 0 Updated 5mo ago

GitHub

Install

npx skillscat add neil1taylor/ibm-vpc-file-pool-csi

Install via the SkillsCat registry.

SKILL.md

IBM VPC File Pool CSI Driver — Build Skill

Overview

You are building ibm-vpc-file-pool-csi, a Kubernetes CSI driver for IBM Cloud VPC that provisions multiple PVCs as subdirectories within shared VPC file shares. This is fundamentally different from IBM's stock ibm-vpc-file-csi-driver, which creates one VPC file share per PVC.

The analogy: Think VMware — one NFS datastore holds many VMDKs. Here, one large VPC file share holds many PVC subdirectories.

Before You Start Any Task

Read this file completely.
Read the reference docs in this directory based on what you're working on:
- ARCHITECTURE.md — system design, component diagram, data flow
- CRD-SPEC.md — FileSharePool and SubVolume CRD definitions
- CSI-INTERFACE.md — CSI gRPC method implementations with pool-aware logic
- IBM-VPC-API.md — IBM Cloud VPC file share API usage and client wrapper
- CODING-GUIDELINES.md — Go conventions, error handling, testing patterns
- TESTING.md — testing strategy, fakes/mocks, test cases, coverage targets
- API-KEY-SETUP.md — IBM Cloud API key creation, IAM permissions, rotation, security
- INSTALL.md — build, deploy, Helm chart, verification steps
- USER-GUIDE.md — end-user guide for pools, StorageClasses, PVCs, monitoring
Check the existing codebase before writing new code. Don't duplicate what exists.

Project Structure

ibm-vpc-file-pool-csi/
├── cmd/
│   └── main.go                        # Entrypoint: parse flags, start gRPC server
├── pkg/
│   ├── driver/
│   │   ├── driver.go                  # Driver struct, gRPC server lifecycle
│   │   ├── identity.go                # CSI Identity service (GetPluginInfo, Probe)
│   │   ├── controller.go              # CSI Controller service (CreateVolume, DeleteVolume, etc.)
│   │   ├── controller_test.go         # Controller unit tests
│   │   ├── node.go                    # CSI Node service (NodePublishVolume, mount/bind)
│   │   └── node_test.go              # Node unit tests
│   ├── pool/
│   │   ├── manager.go                 # Pool manager: allocation, share selection, capacity tracking
│   │   ├── manager_test.go            # Unit tests for pool manager (87 tests)
│   │   ├── share.go                   # VPC file share lifecycle wrapper
│   │   ├── subvolume.go               # Subdirectory operations (mkdir, rm, quota)
│   │   ├── nfs.go                     # NFS operations interface
│   │   ├── reconciler.go             # Controller-runtime reconciler for FileSharePool
│   │   ├── reconciler_test.go        # Reconciler tests (26 tests)
│   │   ├── clone_worker.go           # Async clone operation handler
│   │   ├── clone_worker_test.go      # Clone worker tests (12 tests)
│   │   ├── replication_controller.go # Cross-region replication controller
│   │   └── replication_controller_test.go # Replication tests (23 tests)
│   ├── ibmcloud/
│   │   ├── client.go                  # IBM VPC client interface
│   │   ├── helpers.go                 # VPC API helper functions
│   │   ├── vpc_client.go             # IBM VPC SDK wrapper (file share CRUD)
│   │   ├── vpc_client_test.go        # Unit tests with mocked VPC API
│   │   └── fake/
│   │       ├── fake_client.go         # Fake client for testing without IBM Cloud
│   │       └── fake_client_test.go
│   ├── k8s/
│   │   ├── client.go                  # Kubernetes client interface for CRDs
│   │   ├── real_client.go            # Real Kubernetes client implementation
│   │   └── real_client_test.go
│   ├── metrics/
│   │   ├── metrics.go                 # Prometheus metric definitions
│   │   └── metrics_test.go
│   ├── migrate/
│   │   ├── executor.go               # Migration execution logic
│   │   ├── executor_test.go
│   │   ├── planner.go                # Migration planning
│   │   ├── planner_test.go
│   │   ├── pod.go                    # Pod management for migrations
│   │   └── pod_test.go
│   └── util/
│       ├── mount.go                   # NFS mount helpers, mount cache
│       ├── mount_test.go
│       ├── path.go                    # Path validation, directory traversal prevention
│       ├── path_test.go
│       ├── volume_id.go              # Volume ID parsing utilities
│       └── volume_id_test.go
├── api/
│   └── v1alpha1/
│       ├── doc.go                     # Package documentation
│       ├── groupversion_info.go       # GV registration
│       ├── filesharepool_types.go    # FileSharePool CRD Go types
│       ├── filesharepool_types_test.go
│       ├── subvolume_types.go         # SubVolume CRD Go types
│       ├── snapshot_types.go          # Snapshot CRD Go types
│       ├── volumegroupsnapshot_types.go # VolumeGroupSnapshot CRD Go types
│       ├── replicationpolicy_types.go # ReplicationPolicy CRD Go types
│       └── zz_generated.deepcopy.go   # Generated by controller-gen
├── config/
│   ├── crd/
│   │   ├── storage.ibmcloud.io_filesharepools.yaml
│   │   ├── storage.ibmcloud.io_subvolumes.yaml
│   │   ├── storage.ibmcloud.io_snapshots.yaml
│   │   ├── storage.ibmcloud.io_volumegroupsnapshots.yaml
│   │   └── storage.ibmcloud.io_replicationpolicies.yaml
│   ├── rbac/
│   │   ├── clusterrole.yaml
│   │   └── serviceaccount.yaml
│   └── deploy/
│       ├── controller.yaml            # Controller Deployment
│       ├── node.yaml                  # Node DaemonSet
│       ├── csidriver.yaml             # CSIDriver object
│       └── storageclass.yaml          # Example StorageClasses
├── charts/
│   └── ibm-vpc-file-pool-csi/
│       ├── Chart.yaml
│       ├── values.yaml
│       └── templates/
├── hack/
│   ├── update-codegen.sh             # CRD code generation
│   └── verify-codegen.sh
├── test/
│   ├── e2e/                          # End-to-end tests (require cluster)
│   └── integration/                  # Integration tests (in-memory fakes, no NFS server)
│       ├── capacity_management_test.go
│       ├── clone_lifecycle_test.go
│       ├── clone_worker_test.go
│       ├── concurrent_allocation_test.go
│       ├── error_recovery_test.go
│       ├── group_snapshot_test.go
│       ├── helpers_test.go
│       ├── pool_lifecycle_test.go
│       └── snapshot_lifecycle_test.go
├── Dockerfile
├── Makefile
├── go.mod
└── go.sum

Key Design Principles

1. One Share, Many PVCs

Every CreateVolume call picks an existing VPC file share from a pool and records a SubVolume CR — it does NOT create a new VPC file share. New shares are only created by the pool manager when capacity runs low.

2. State Lives in CRDs

All state (which PVC is on which share, capacity allocations, pool membership) is stored in Kubernetes CRDs (FileSharePool and SubVolume). No external database, no local files on the controller pod.

3. Node Mounts Are Cached

Each worker node mounts a VPC file share at most once. Individual PVCs are bind-mounted from subdirectories of that single NFS mount. This minimizes NFS connections.

4. Fail Safe, Not Fast

If the pool manager can't find a share with enough room and can't create a new one, CreateVolume should return a retriable gRPC error — never silently overcommit.

5. IBM VPC API Calls Are Expensive

API calls to create/expand shares take 30-90 seconds. The hot path (CreateVolume for a PVC) should almost never need one. API calls belong in the pool manager's background reconciliation loop, not in the CSI gRPC handlers.

Build & Development Commands

# Build
make build                    # Build the binary
make docker-build             # Build container image
make generate                 # Run controller-gen for CRD types

# Test
make test                     # Unit tests
make test-coverage            # Unit tests with coverage
make lint                     # golangci-lint

# Deploy
make install-crds             # Apply CRDs to cluster
make deploy                   # Deploy controller + node agent
make helm-install             # Install via Helm chart

# Development
make run-local                # Run controller locally against a cluster (dry-run mode)
make test-e2e                 # E2E tests (requires live cluster, //go:build e2e tag)

Task-Specific Guidance

When implementing CRD types

Read CRD-SPEC.md first
Use api/v1alpha1/ directory
Include validation markers (+kubebuilder:validation:*)
Always add a Status subresource
Run make generate after changing types

When implementing CSI Controller methods

Read CSI-INTERFACE.md first
The controller MUST be idempotent — if called twice with the same volume name, return the same result
CreateVolume: call pool manager → create SubVolume CR → return (no mkdir; subdirectory creation is deferred to NodePublishVolume)
DeleteVolume: update pool tracking → delete SubVolume CR (no subdir removal; nfsOps is nil in controller mode)
Never call IBM VPC API directly from CSI handlers — go through pool manager

When implementing CSI Node methods

Read CSI-INTERFACE.md (Node section) first
NodeStageVolume: mount the whole NFS share if not already mounted
NodePublishVolume: create the subdirectory if it does not exist (with uid/gid/permissions from VolumeContext), then bind-mount it into the pod path
Track active mounts in memory with a sync.RWMutex-protected map
Always validate mount paths to prevent directory traversal

When implementing the Pool Manager

Read ARCHITECTURE.md (Pool Manager section) first
This is the brain of the system
It runs as a controller-runtime reconciler watching FileSharePool CRs
It also exposes a synchronous Allocate(ctx, poolName, sizeGB) method for the CSI controller
Uses optimistic locking on CRD status updates to prevent races

When implementing the IBM Cloud VPC client

Read IBM-VPC-API.md first
Always use the vpc-go-sdk — never raw HTTP
All API calls must have context-based timeouts (2 minutes max)
Always implement a fake client for testing

When writing tests

Read CODING-GUIDELINES.md (Testing section) first
Unit tests go next to the code (_test.go)
Use table-driven tests
Mock IBM Cloud API with fake client
Mock Kubernetes API with fake.NewClientBuilder()
The pool manager needs thorough tests: allocation under pressure, concurrent requests, share exhaustion

When writing Kubernetes manifests

Controller runs as a Deployment (1-2 replicas, leader election)
Node agent runs as a DaemonSet (needs hostNetwork: false, hostPID: true — required for nsenter mount wrapper to access host mount namespace for NFS mounts)
Node agent needs /var/lib/kubelet mounted for bind-mounts
RBAC must cover: FileSharePool, SubVolume (get/list/watch/create/update/patch), PVs, PVCs, Secrets, ConfigMaps, Events, CSINode, CSIDriver

When implementing snapshots (Phase 4a)

Read CSI-INTERFACE.md (Snapshot section) and VOLUME-GROUP-SNAPSHOTS.md
Snapshots are directory copies under /pvcs/.snapshots/{snap-name}/ using NFSOperations.CopyDir
The Snapshot CRD tracks each snapshot; the pool manager creates and deletes them
Restore from snapshot uses RestoreSnapshot which creates a new SubVolume from the snapshot data
All snapshot operations are synchronous (unlike clones)

When implementing volume cloning (Phase 4b)

Read VOLUME-CLONING.md first
Clones use PoolManager.CloneVolume() — sync for small volumes, async for large
The clone worker (pkg/pool/clone_worker.go) handles async clones in the background
NodePublishVolume gates pod access on cloneStatus=Complete — pods wait until clone finishes
Share selection prefers the source share when it has capacity (same-share clone is faster)

When implementing group snapshots (Phase 4c)

Read VOLUME-GROUP-SNAPSHOTS.md first
Group snapshots reuse the Phase 4a single-snapshot infrastructure
The CSI controller handles hook orchestration (pre/post quiesce hooks)
The pool manager handles only the data plane (creating/deleting snapshot directories)
Failure policy: Abort rolls back all completed snapshots; Continue marks as PartialFailure

When implementing replication (Phase 4d)

Read CROSS-REGION-DR.md first
The replication controller (pkg/pool/replication_controller.go) runs as a separate reconciler
It copies SubVolume data between pools using CopyDir, not rsync (simplified from design doc)
Uses time.Duration schedule intervals, not cron expressions (simplified from design doc)
Destination is specified via DestinationNFSServer IP, not a pool reference (simplified from design doc)

When implementing migration (pkg/migrate/)

The planner analyzes existing stock IBM CSI PVCs and generates a migration plan
The executor creates SubVolume CRs, spawns data-copy pods, and rebinds PVCs
All migration operations are idempotent and can be resumed after failure

Common Mistakes to Avoid

Do NOT create a VPC file share in CreateVolume — that's the pool manager's job during reconciliation.
Do NOT store state in ConfigMaps — use the CRDs. ConfigMaps have size limits and no status subresource.
Do NOT assume subdirectory quotas are enforced — NFS doesn't enforce per-directory quotas natively. Track allocations in SubVolume CRs and report via metrics. Hard enforcement is a future feature.
Do NOT mount NFS shares with hard mount option in production — use soft,timeo=600,retrans=3 so pods don't hang indefinitely on NFS failures.
Do NOT use os.RemoveAll for PVC cleanup without checking the path — validate that the path is within the expected share mount point. Directory traversal bugs in a CSI driver are catastrophic.
Do NOT skip leader election for the controller — two controllers allocating simultaneously will corrupt pool state.

IBM VPC File Pool CSI Driver — Build Skill

Install

IBM VPC File Pool CSI Driver — Build Skill

Overview

Before You Start Any Task

Project Structure

Key Design Principles

1. One Share, Many PVCs

2. State Lives in CRDs

3. Node Mounts Are Cached

4. Fail Safe, Not Fast

5. IBM VPC API Calls Are Expensive

Build & Development Commands

Task-Specific Guidance

When implementing CRD types

When implementing CSI Controller methods

When implementing CSI Node methods

When implementing the Pool Manager

When implementing the IBM Cloud VPC client

When writing tests

When writing Kubernetes manifests

When implementing snapshots (Phase 4a)

When implementing volume cloning (Phase 4b)

When implementing group snapshots (Phase 4c)

When implementing replication (Phase 4d)

When implementing migration (pkg/migrate/)

Common Mistakes to Avoid

Categories

Install

Recommended Skills