Use this skill when the user asks for wide-column reporting schemas, denormalized columnar marts, one-table analytics models, or join-minimized read models from MD-DDL.
Install
npx skillscat add semprini/md-ddl/wide-column Install via the SkillsCat registry.
Skill: Wide Column Schema Design
Covers translation of MD-DDL conceptual and logical models into denormalized, join-minimized columnar schemas designed for reporting and analytics consumption.
This skill is for the pattern where consumers want a single primary table at a stable grain (usually a "thing that happens") with selected descriptive context from related "things that are".
Load First
../../agent-ontology/skills/entity-modelling/SKILL.md../../agent-ontology/skills/entity-modelling/conceptual-to-physical-realisation.md../../agent-ontology/skills/relationship-events/SKILL.md../dimensional/SKILL.md../normalized/SKILL.md../../agent-ontology/skills/standards-alignment/SKILL.md(mandatory in recognized industry domains)
Core Mental Model
Use two buckets to drive modeling choices:
| Concept bucket | Typical MD-DDL signals | Typical physical role |
|---|---|---|
| Things that are | existence: independent, mutability: slowly_changing or reference |
Descriptive context entities (customer, account, product) |
| Things that happen | existence: dependent, mutability: append_only or immutable, event-heavy semantics |
Driving grain for wide table (transaction, claim event, payment) |
Default strategy:
- Pick a driving "thing that happens" grain.
- Pull in selected attributes from related "things that are" where relationship paths are stable and cardinality-safe.
- Optimize for consumer queries, not for update minimality.
Research-Backed Practical Guidance
This skill follows common modern warehousing guidance:
- Denormalization is usually a read-performance optimization that moves joins from query time to build time.
- Keep entity grain explicit: a wide table should still represent one core concept per row.
- Avoid flattening high-churn and high-cardinality relationships blindly.
- For one-to-many details, prefer selective rollups, capped arrays, or nested structures over row explosion.
- In platforms that support nested/repeated types (for example BigQuery), nested structs and arrays can reduce join and shuffle costs while preserving hierarchy.
Practical source influences:
- dbt marts guidance (wide, denormalized, entity-grained output models)
- ClickHouse denormalization guidance (join-at-ingest trade-off, avoid high-cardinality flattening)
- BigQuery nested/repeated best practices (hierarchical denormalization without full flattening)
Join Admissibility Rules
Use this matrix before denormalizing a relationship into a wide table.
| Relationship shape from driving row | Include strategy | Default decision |
|---|---|---|
| 1:1 | Inline columns | Include |
| N:1 from driving row to parent | Inline selected parent columns | Include |
| 1:N from driving row to child | Aggregate child info or nested array/struct | Include selectively |
| M:N | Bridge or separate child mart; optional aggregate summary only | Do not inline raw child rows |
| High-cardinality 1:N (hundreds/thousands per row) | Keep separate; add summary stats only | Avoid inline expansion |
Important: If inclusion changes row grain, stop and redesign. Wide does not mean grain drift.
Realization Procedure
1) Pick and lock grain
Define a grain statement in plain language:
- "One row per Transaction"
- "One row per Claim Event"
- "One row per Payment Instruction"
If the grain is not explicit, do not generate output.
2) Map relationship paths
From the driving entity, enumerate all immediate and second-hop relationships. Mark each path with:
- cardinality
- mutability/churn risk
- business value to reporting consumers
3) Select projection tiers
Bring attributes into the wide table by priority:
| Tier | Include guidance |
|---|---|
| T1 Core keys and timestamps | Always include identifiers and event time |
| T2 High-use descriptive context | Include stable attributes used in filtering/grouping |
| T3 Operational metadata | Include if materially used in reporting operations |
| T4 Rare or volatile details | Exclude from wide table; expose in companion table or nested field |
4) Handle one-to-many safely
When the driving row has 1:N children, choose exactly one pattern:
| Pattern | When to use |
|---|---|
| Aggregate columns | Need counts/sums/latest flags only |
| Nested array/struct | Platform supports nested types and consumers can handle them |
| Companion child wide table | Child details are frequently queried as first-class rows |
Do not duplicate driving rows just to materialize child collections unless the user explicitly wants a row-per-child output.
5) Apply temporal correctness
For slowly changing "things that are":
- Join "as of" the event timestamp from the driving row.
- Snapshot attributes into the wide row to make reports historically reproducible.
- Document late-arriving update policy (restate vs forward-fix).
6) Define refresh policy
Denormalized wide outputs need explicit rebuild strategy:
- incremental append for immutable event grains
- targeted backfill for changed dimensions
- full refresh window for high-impact historical corrections
MD-DDL-Specific Mapping Guidance
Existence and Mutability
| MD-DDL signal | Wide-column implication |
|---|---|
existence: dependent + append_only/immutable |
Strong candidate driving grain |
existence: independent + reference |
Good candidate for inline descriptive enrichment |
mutability: slowly_changing |
Use as-of joins and snapshot columns |
mutability: frequently_changing |
Include cautiously; consider volatile fields separately |
existence: associative |
Usually preserve as separate structure unless only summary is needed |
Relationships and granularity
granularity: atomicis generally easiest to inline as context columns.granularity: periodoften needs validity-window joins and effective date columns.granularity: groupoften maps better to aggregated summaries than direct flattening.
Events
Events are strong drivers for wide tables because they naturally encode "things that happen".
Include event-level attributes plus selected actor/entity context fields needed for common reporting questions.
Example Pattern (Financial Transactions)
Goal: one row per transaction with account and customer context pre-joined.
Input semantics:
- Transaction is the driving "thing that happens".
- Account and Customer are "things that are".
- Typical relationships: Customer 1:N Account, Account 1:N Transaction.
Recommended wide projection:
| Column group | Example columns |
|---|---|
| Transaction core | transaction_id, transaction_timestamp, amount, currency_code, channel |
| Account context | account_id, account_type, account_status, opened_date |
| Customer context | customer_id, customer_segment, risk_rating, residency_country |
| Operational flags | is_high_risk, sanctions_hit_flag, aml_case_id |
SQL-shape example (platform-agnostic):
SELECT
t.transaction_id,
t.transaction_timestamp,
t.amount,
t.currency_code,
t.channel,
a.account_id,
a.account_type,
a.account_status,
a.opened_date,
c.customer_id,
c.customer_segment,
c.risk_rating,
c.residency_country,
t.is_high_risk,
t.sanctions_hit_flag,
t.aml_case_id
FROM fact_transaction t
LEFT JOIN dim_account a
ON t.account_id = a.account_id
AND t.transaction_timestamp >= a.valid_from
AND (a.valid_to IS NULL OR t.transaction_timestamp < a.valid_to)
LEFT JOIN dim_customer c
ON a.customer_id = c.customer_id
AND t.transaction_timestamp >= c.valid_from
AND (c.valid_to IS NULL OR t.transaction_timestamp < c.valid_to);This pattern keeps transaction grain while reducing common downstream joins.
Platform Notes for Columnar Engines
| Engine capability | Recommended shape |
|---|---|
| Flat-column optimized warehouses | Wide flat columns with controlled duplication |
| Nested-capable engines | Use STRUCT/ARRAY for bounded 1:N child context |
| High-ingest analytical engines | Prefer build-time joins and periodic refresh for stable dimensions |
When nested structures are used, provide both:
- base wide columns for common BI filtering
- nested details for advanced drill-through
Output Contract
When producing a wide-column schema, always include:
- Grain statement
- Attribute inclusion map (source entity attribute -> output column)
- Join path map with cardinality and temporal join logic
- Volatility notes (fields intentionally excluded due to churn)
- DDL or schema contract for requested platform
- Build strategy (incremental/full refresh/restate policy)
- Consumer notes (expected query patterns and known trade-offs)
- Open questions and assumptions
Quality Checklist
Before finalizing:
- Grain is explicit and unchanged by joins
- Every included relationship passed admissibility rules
- M:N relationships are not naively flattened
- High-cardinality child sets are aggregated or separately modeled
- SCD/as-of logic is defined for slowly changing context
- Included columns reflect real reporting use cases
- Rebuild and correction policy is specified
- Governance-sensitive fields (PII/classification) are clearly tagged