Skip to content

Data Products & AI Blueprints

A data product is a self-contained YAML blueprint that fully describes a data model — its schema, semantics, governance, and AI instructions — in a single portable file. Data products follow the Open Semantic Interchange (OSI) specification, so they work with any OSI-compliant tool.

Most teams scatter metadata across wikis, spreadsheets, dbt YAML, BI tool definitions, and tribal knowledge. A data product consolidates everything an AI agent (or a new team member) needs into one file:

  • Schema — tables, columns, types, expressions
  • Semantics — descriptions, business terms, synonyms
  • Governance — ownership, tier status, trust level, PII classification
  • Rules — aggregation logic, required filters, valid ranges
  • Relationships — how datasets join, cardinality
  • AI context — golden queries, guardrails, usage instructions

The AI Blueprint is the Gold-tier deliverable — a single, portable YAML file that captures the complete semantic specification for a data product. Export it with context blueprint, serve it via MCP at context://data-product/{name}, or share it with any OSI-compliant tool.

A data product YAML follows the OSI semantic_model structure, extended with ContextKit governance fields. Here’s a complete Gold-tier example using a generic SaaS analytics domain:

saas_analytics.data-product.osi.yaml
# AI Blueprint — Open Semantic Interchange (OSI) v1.0
osi_version: "1.0"
semantic_model:
name: saas_analytics
description: >
SaaS platform analytics covering subscriptions, usage events,
and revenue. Designed for product analytics, finance reporting,
and AI-assisted query generation.
# ── Governance ──────────────────────────────────────────────
owner: data-platform-team
tier: gold
trust_status: verified
grain: One row per subscription event (create, upgrade, cancel, renew)
tags:
- saas
- revenue
- product-analytics
# ── Glossary ────────────────────────────────────────────────
glossary:
- term: Monthly Recurring Revenue
abbreviation: MRR
definition: >
Sum of all active subscription amounts normalized to a monthly
cadence. Excludes one-time charges, overages, and credits.
related_fields:
- subscriptions.mrr_amount
- subscriptions.plan_price
- term: Churn Rate
abbreviation: null
definition: >
Percentage of subscribers who cancel within a given period,
measured as cancelled_count / start_of_period_count.
related_fields:
- subscriptions.status
- subscriptions.cancelled_at
- term: Active User
definition: >
A user who triggered at least one meaningful event (not
page_view or heartbeat) in the trailing 28-day window.
related_fields:
- usage_events.event_type
- usage_events.user_id
# ── Datasets ────────────────────────────────────────────────
datasets:
- name: subscriptions
description: One row per subscription lifecycle event
schema: analytics
table: fct_subscriptions
grain: subscription_id + event_timestamp
owner: revenue-team
trust_status: verified
fields:
- name: subscription_id
expression: subscription_id
description: Unique subscription identifier
semantic_role: identifier
pii: false
- name: customer_id
expression: customer_id
description: Foreign key to the customers dimension
semantic_role: identifier
pii: false
- name: plan_name
expression: plan_name
description: Subscription plan tier (free, starter, pro, enterprise)
dimension: true
sample_values:
- free
- starter
- pro
- enterprise
- name: mrr_amount
expression: mrr_amount
description: Monthly recurring revenue for this subscription in USD
semantic_role: measure
aggregation: sum
unit: USD
ai_context:
instructions: >
Always aggregate with SUM. Never use AVG on mrr_amount
directly — use avg_mrr_per_customer metric instead.
Excludes one-time setup fees.
synonyms:
- MRR
- monthly revenue
- recurring revenue
examples:
- "What is total MRR?"
- "Show MRR by plan tier"
- "MRR trend over the last 12 months"
- name: status
expression: status
description: Current subscription status
dimension: true
sample_values:
- active
- cancelled
- paused
- trial
ai_context:
instructions: >
Most revenue queries should filter WHERE status = 'active'.
Include 'trial' only when specifically analyzing trial conversions.
guardrails:
- rule: required_filter
condition: "status = 'active'"
applies_to: revenue queries
- name: event_timestamp
expression: event_timestamp
description: When this subscription event occurred
semantic_role: time
time_grain: millisecond
- name: cancelled_at
expression: cancelled_at
description: Timestamp when subscription was cancelled (null if active)
semantic_role: time
# ── Business Rules ──────────────────────────────────────
business_rules:
- name: revenue_requires_active
description: Revenue calculations must filter to active subscriptions
rule: "WHERE status = 'active'"
severity: error
- name: no_future_events
description: Event timestamps must not be in the future
rule: "event_timestamp <= CURRENT_TIMESTAMP"
severity: warning
- name: usage_events
description: Product usage telemetry — one row per event
schema: analytics
table: fct_usage_events
grain: event_id
owner: product-analytics-team
trust_status: verified
fields:
- name: event_id
expression: event_id
description: Unique event identifier
semantic_role: identifier
- name: user_id
expression: user_id
description: The user who triggered this event
semantic_role: identifier
pii: true
pii_classification: direct_identifier
- name: event_type
expression: event_type
description: Category of the usage event
dimension: true
sample_values:
- feature_used
- api_call
- export
- invite_sent
- report_generated
- name: event_timestamp
expression: event_timestamp
description: When the event occurred
semantic_role: time
time_grain: millisecond
- name: session_duration_sec
expression: session_duration_sec
description: Duration of the session in seconds
semantic_role: measure
aggregation: avg
unit: seconds
- name: customers
description: Customer dimension with firmographic attributes
schema: analytics
table: dim_customers
grain: customer_id
owner: revenue-team
trust_status: verified
fields:
- name: customer_id
expression: customer_id
description: Unique customer identifier
semantic_role: identifier
- name: company_name
expression: company_name
description: Customer's company name
pii: true
pii_classification: business_name
label: true
- name: industry
expression: industry
description: Industry vertical of the customer
dimension: true
sample_values:
- technology
- healthcare
- finance
- retail
- education
- name: signup_date
expression: signup_date
description: Date the customer first signed up
semantic_role: time
time_grain: day
- name: region
expression: region
description: Geographic region
dimension: true
sample_values:
- north_america
- europe
- apac
- latam
# ── Relationships ───────────────────────────────────────────
relationships:
- name: subscriptions_to_customers
from:
dataset: subscriptions
columns:
- customer_id
to:
dataset: customers
columns:
- customer_id
cardinality: many_to_one
- name: usage_to_customers
from:
dataset: usage_events
columns:
- user_id
to:
dataset: customers
columns:
- customer_id
cardinality: many_to_one
# ── Metrics ─────────────────────────────────────────────────
metrics:
- name: total_mrr
expression: SUM(mrr_amount)
description: Total monthly recurring revenue across all active subscriptions
dataset: subscriptions
filters:
- "status = 'active'"
ai_context:
instructions: >
This is the primary revenue metric. Always filter to active
subscriptions. For historical MRR, group by month using
event_timestamp.
synonyms:
- MRR
- monthly recurring revenue
- recurring revenue
examples:
- "What is our current MRR?"
- "Show MRR growth month over month"
- name: avg_mrr_per_customer
expression: AVG(mrr_amount)
description: Average MRR per active customer
dataset: subscriptions
filters:
- "status = 'active'"
ai_context:
instructions: >
Use this instead of manually averaging mrr_amount.
Provides per-customer average, not per-subscription.
synonyms:
- ARPU
- average revenue per user
- name: monthly_active_users
expression: COUNT(DISTINCT user_id)
description: Distinct users with at least one non-trivial event in 28 days
dataset: usage_events
filters:
- "event_type NOT IN ('page_view', 'heartbeat')"
- "event_timestamp >= CURRENT_DATE - INTERVAL '28 days'"
ai_context:
synonyms:
- MAU
- active users
# ── Golden Queries ──────────────────────────────────────────
golden_queries:
- name: mrr_by_plan
description: MRR broken down by subscription plan
sql: |
SELECT plan_name, SUM(mrr_amount) AS mrr
FROM analytics.fct_subscriptions
WHERE status = 'active'
GROUP BY plan_name
ORDER BY mrr DESC
verified: true
- name: churn_rate_monthly
description: Monthly churn rate over the last 12 months
sql: |
SELECT
DATE_TRUNC('month', cancelled_at) AS churn_month,
COUNT(*) AS cancelled,
COUNT(*) * 100.0 / SUM(COUNT(*)) OVER () AS churn_pct
FROM analytics.fct_subscriptions
WHERE cancelled_at IS NOT NULL
AND cancelled_at >= CURRENT_DATE - INTERVAL '12 months'
GROUP BY 1
ORDER BY 1
verified: true
- name: usage_by_event_type
description: Event volume by type over the last 30 days
sql: |
SELECT event_type, COUNT(*) AS event_count
FROM analytics.fct_usage_events
WHERE event_timestamp >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY event_type
ORDER BY event_count DESC
verified: true

Every data product starts with osi_version and a semantic_model block. The model name, description, owner, and tier establish identity and governance at the top level.

Business terms with formal definitions. The related_fields list connects terms to actual columns, so AI agents can map natural language (“What’s our MRR?”) to the correct field (subscriptions.mrr_amount).

Each dataset maps to a physical table. Fields declare:

PropertyPurpose
semantic_roleidentifier, measure, time, or dimension — tells AI how to use the field
aggregationDefault aggregation function (sum, avg, count, etc.)
ai_contextInstructions, synonyms, examples, and guardrails for AI agents
pii / pii_classificationSecurity classification for sensitive data
sample_valuesRepresentative values to help AI understand the domain
business_rulesConstraints that must hold for correct queries

Explicit join definitions with cardinality. Without these, AI agents guess how tables connect — a common source of wrong queries.

Named, reusable calculations with built-in filters. Metrics prevent AI agents from reinventing aggregation logic incorrectly.

Verified SQL examples that serve as templates. AI agents can adapt these for similar questions, reducing hallucination.

Terminal window
context new sales-analytics --source warehouse
Terminal window
context introspect --db postgres://localhost/myapp
context enrich --target gold --apply
context build
Terminal window
context blueprint sales-analytics
# → blueprints/sales-analytics.data-product.osi.yaml
Terminal window
context serve --stdio

The MCP server exposes your data product as structured resources. AI agents receive the full semantic model — descriptions, rules, relationships, metrics — alongside the raw schema.

Because data products follow OSI, they’re portable:

Terminal window
# Copy your data product to another project
cp -r context/ /path/to/other-project/context/
# Or serve from a shared location
context serve --context-dir /shared/data-products/saas-analytics/context

The tier system measures how complete your data product is:

TierWhat’s includedHow to achieve
BronzeSchema, descriptions, ownershipcontext introspect
Silver+ glossary, trust status, lineage, sample valuescontext enrich --target silver
Gold+ semantic roles, aggregation rules, guardrails, golden queriesHuman curation + context enrich --target gold

A Gold-tier data product gives AI agents everything they need to generate correct SQL without guessing.

Data product YAMLs are strict supersets of OSI v1.0. The core semantic_model, datasets, fields, relationships, and metrics blocks are fully OSI-compliant. ContextKit-specific extensions (business_rules, golden_queries, glossary, governance fields) are stored in companion sections that OSI-compliant tools can safely ignore.

This means you can:

  • Import OSI files from other tools and enrich them with ContextKit
  • Export your ContextKit metadata as standard OSI for use elsewhere
  • Mix ContextKit-governed models with plain OSI models in the same project