wide-column

Use this skill when the user asks for wide-column reporting schemas, denormalized columnar marts, one-table analytics models, or join-minimized read models from MD-DDL.

Semprini 1 Updated 4mo ago

GitHub

Install

npx skillscat add semprini/md-ddl/wide-column

Install via the SkillsCat registry.

SKILL.md

Skill: Wide Column Schema Design

Covers translation of MD-DDL conceptual and logical models into denormalized, join-minimized columnar schemas designed for reporting and analytics consumption.

This skill is for the pattern where consumers want a single primary table at a stable grain (usually a "thing that happens") with selected descriptive context from related "things that are".

Load First

../../agent-ontology/skills/entity-modelling/SKILL.md
../../agent-ontology/skills/entity-modelling/conceptual-to-physical-realisation.md
../../agent-ontology/skills/relationship-events/SKILL.md
../dimensional/SKILL.md
../normalized/SKILL.md
../../agent-ontology/skills/standards-alignment/SKILL.md (mandatory in recognized industry domains)

Core Mental Model

Use two buckets to drive modeling choices:

Concept bucket	Typical MD-DDL signals	Typical physical role
Things that are	`existence: independent`, `mutability: slowly_changing` or `reference`	Descriptive context entities (customer, account, product)
Things that happen	`existence: dependent`, `mutability: append_only` or `immutable`, event-heavy semantics	Driving grain for wide table (transaction, claim event, payment)

Default strategy:

Pick a driving "thing that happens" grain.
Pull in selected attributes from related "things that are" where relationship paths are stable and cardinality-safe.
Optimize for consumer queries, not for update minimality.

Research-Backed Practical Guidance

This skill follows common modern warehousing guidance:

Denormalization is usually a read-performance optimization that moves joins from query time to build time.
Keep entity grain explicit: a wide table should still represent one core concept per row.
Avoid flattening high-churn and high-cardinality relationships blindly.
For one-to-many details, prefer selective rollups, capped arrays, or nested structures over row explosion.
In platforms that support nested/repeated types (for example BigQuery), nested structs and arrays can reduce join and shuffle costs while preserving hierarchy.

Practical source influences:

dbt marts guidance (wide, denormalized, entity-grained output models)
ClickHouse denormalization guidance (join-at-ingest trade-off, avoid high-cardinality flattening)
BigQuery nested/repeated best practices (hierarchical denormalization without full flattening)

Join Admissibility Rules

Use this matrix before denormalizing a relationship into a wide table.

Relationship shape from driving row	Include strategy	Default decision
1:1	Inline columns	Include
N:1 from driving row to parent	Inline selected parent columns	Include
1:N from driving row to child	Aggregate child info or nested array/struct	Include selectively
M:N	Bridge or separate child mart; optional aggregate summary only	Do not inline raw child rows
High-cardinality 1:N (hundreds/thousands per row)	Keep separate; add summary stats only	Avoid inline expansion

Important: If inclusion changes row grain, stop and redesign. Wide does not mean grain drift.

Realization Procedure

1) Pick and lock grain

Define a grain statement in plain language:

"One row per Transaction"
"One row per Claim Event"
"One row per Payment Instruction"

If the grain is not explicit, do not generate output.

2) Map relationship paths

From the driving entity, enumerate all immediate and second-hop relationships. Mark each path with:

cardinality
mutability/churn risk
business value to reporting consumers

3) Select projection tiers

Bring attributes into the wide table by priority:

Tier	Include guidance
T1 Core keys and timestamps	Always include identifiers and event time
T2 High-use descriptive context	Include stable attributes used in filtering/grouping
T3 Operational metadata	Include if materially used in reporting operations
T4 Rare or volatile details	Exclude from wide table; expose in companion table or nested field

4) Handle one-to-many safely

When the driving row has 1:N children, choose exactly one pattern:

Pattern	When to use
Aggregate columns	Need counts/sums/latest flags only
Nested array/struct	Platform supports nested types and consumers can handle them
Companion child wide table	Child details are frequently queried as first-class rows

Do not duplicate driving rows just to materialize child collections unless the user explicitly wants a row-per-child output.

5) Apply temporal correctness

For slowly changing "things that are":

Join "as of" the event timestamp from the driving row.
Snapshot attributes into the wide row to make reports historically reproducible.
Document late-arriving update policy (restate vs forward-fix).

6) Define refresh policy

Denormalized wide outputs need explicit rebuild strategy:

incremental append for immutable event grains
targeted backfill for changed dimensions
full refresh window for high-impact historical corrections

MD-DDL-Specific Mapping Guidance

Existence and Mutability

MD-DDL signal	Wide-column implication
`existence: dependent` + `append_only`/`immutable`	Strong candidate driving grain
`existence: independent` + `reference`	Good candidate for inline descriptive enrichment
`mutability: slowly_changing`	Use as-of joins and snapshot columns
`mutability: frequently_changing`	Include cautiously; consider volatile fields separately
`existence: associative`	Usually preserve as separate structure unless only summary is needed

Relationships and granularity

granularity: atomic is generally easiest to inline as context columns.
granularity: period often needs validity-window joins and effective date columns.
granularity: group often maps better to aggregated summaries than direct flattening.

Events

Events are strong drivers for wide tables because they naturally encode "things that happen".

Include event-level attributes plus selected actor/entity context fields needed for common reporting questions.

Example Pattern (Financial Transactions)

Goal: one row per transaction with account and customer context pre-joined.

Input semantics:

Transaction is the driving "thing that happens".
Account and Customer are "things that are".
Typical relationships: Customer 1:N Account, Account 1:N Transaction.

Recommended wide projection:

Column group	Example columns
Transaction core	transaction_id, transaction_timestamp, amount, currency_code, channel
Account context	account_id, account_type, account_status, opened_date
Customer context	customer_id, customer_segment, risk_rating, residency_country
Operational flags	is_high_risk, sanctions_hit_flag, aml_case_id

SQL-shape example (platform-agnostic):

SELECT
  t.transaction_id,
  t.transaction_timestamp,
  t.amount,
  t.currency_code,
  t.channel,
  a.account_id,
  a.account_type,
  a.account_status,
  a.opened_date,
  c.customer_id,
  c.customer_segment,
  c.risk_rating,
  c.residency_country,
  t.is_high_risk,
  t.sanctions_hit_flag,
  t.aml_case_id
FROM fact_transaction t
LEFT JOIN dim_account a
  ON t.account_id = a.account_id
  AND t.transaction_timestamp >= a.valid_from
  AND (a.valid_to IS NULL OR t.transaction_timestamp < a.valid_to)
LEFT JOIN dim_customer c
  ON a.customer_id = c.customer_id
  AND t.transaction_timestamp >= c.valid_from
  AND (c.valid_to IS NULL OR t.transaction_timestamp < c.valid_to);

This pattern keeps transaction grain while reducing common downstream joins.

Platform Notes for Columnar Engines

Engine capability	Recommended shape
Flat-column optimized warehouses	Wide flat columns with controlled duplication
Nested-capable engines	Use STRUCT/ARRAY for bounded 1:N child context
High-ingest analytical engines	Prefer build-time joins and periodic refresh for stable dimensions

When nested structures are used, provide both:

base wide columns for common BI filtering
nested details for advanced drill-through

Output Contract

When producing a wide-column schema, always include:

Grain statement
Attribute inclusion map (source entity attribute -> output column)
Join path map with cardinality and temporal join logic
Volatility notes (fields intentionally excluded due to churn)
DDL or schema contract for requested platform
Build strategy (incremental/full refresh/restate policy)
Consumer notes (expected query patterns and known trade-offs)
Open questions and assumptions

Quality Checklist

Before finalizing:

Grain is explicit and unchanged by joins
Every included relationship passed admissibility rules
M:N relationships are not naively flattened
High-cardinality child sets are aggregated or separately modeled
SCD/as-of logic is defined for slowly changing context
Included columns reflect real reporting use cases
Rebuild and correction policy is specified
Governance-sensitive fields (PII/classification) are clearly tagged

wide-column

Install

Skill: Wide Column Schema Design

Load First

Core Mental Model

Research-Backed Practical Guidance

Join Admissibility Rules

Realization Procedure

1) Pick and lock grain

2) Map relationship paths

3) Select projection tiers

4) Handle one-to-many safely

5) Apply temporal correctness

6) Define refresh policy

MD-DDL-Specific Mapping Guidance

Existence and Mutability

Relationships and granularity

Events

Example Pattern (Financial Transactions)

Platform Notes for Columnar Engines

Output Contract

Quality Checklist

Categories

Install

Recommended Skills