Data Products & AI Blueprints
A data product is a self-contained YAML blueprint that fully describes a data model — its schema, semantics, governance, and AI instructions — in a single portable file. Data products follow the Open Semantic Interchange (OSI) specification, so they work with any OSI-compliant tool.
Why data products?
Section titled “Why data products?”Most teams scatter metadata across wikis, spreadsheets, dbt YAML, BI tool definitions, and tribal knowledge. A data product consolidates everything an AI agent (or a new team member) needs into one file:
- Schema — tables, columns, types, expressions
- Semantics — descriptions, business terms, synonyms
- Governance — ownership, tier status, trust level, PII classification
- Rules — aggregation logic, required filters, valid ranges
- Relationships — how datasets join, cardinality
- AI context — golden queries, guardrails, usage instructions
AI Blueprint
Section titled “AI Blueprint”The AI Blueprint is the Gold-tier deliverable — a single, portable YAML file that captures the complete semantic specification for a data product. Export it with context blueprint, serve it via MCP at context://data-product/{name}, or share it with any OSI-compliant tool.
Blueprint format
Section titled “Blueprint format”A data product YAML follows the OSI semantic_model structure, extended with ContextKit governance fields. Here’s a complete Gold-tier example using a generic SaaS analytics domain:
# AI Blueprint — Open Semantic Interchange (OSI) v1.0
osi_version: "1.0"
semantic_model: name: saas_analytics description: > SaaS platform analytics covering subscriptions, usage events, and revenue. Designed for product analytics, finance reporting, and AI-assisted query generation.
# ── Governance ────────────────────────────────────────────── owner: data-platform-team tier: gold trust_status: verified grain: One row per subscription event (create, upgrade, cancel, renew) tags: - saas - revenue - product-analytics
# ── Glossary ──────────────────────────────────────────────── glossary: - term: Monthly Recurring Revenue abbreviation: MRR definition: > Sum of all active subscription amounts normalized to a monthly cadence. Excludes one-time charges, overages, and credits. related_fields: - subscriptions.mrr_amount - subscriptions.plan_price
- term: Churn Rate abbreviation: null definition: > Percentage of subscribers who cancel within a given period, measured as cancelled_count / start_of_period_count. related_fields: - subscriptions.status - subscriptions.cancelled_at
- term: Active User definition: > A user who triggered at least one meaningful event (not page_view or heartbeat) in the trailing 28-day window. related_fields: - usage_events.event_type - usage_events.user_id
# ── Datasets ──────────────────────────────────────────────── datasets: - name: subscriptions description: One row per subscription lifecycle event schema: analytics table: fct_subscriptions grain: subscription_id + event_timestamp owner: revenue-team trust_status: verified
fields: - name: subscription_id expression: subscription_id description: Unique subscription identifier semantic_role: identifier pii: false
- name: customer_id expression: customer_id description: Foreign key to the customers dimension semantic_role: identifier pii: false
- name: plan_name expression: plan_name description: Subscription plan tier (free, starter, pro, enterprise) dimension: true sample_values: - free - starter - pro - enterprise
- name: mrr_amount expression: mrr_amount description: Monthly recurring revenue for this subscription in USD semantic_role: measure aggregation: sum unit: USD ai_context: instructions: > Always aggregate with SUM. Never use AVG on mrr_amount directly — use avg_mrr_per_customer metric instead. Excludes one-time setup fees. synonyms: - MRR - monthly revenue - recurring revenue examples: - "What is total MRR?" - "Show MRR by plan tier" - "MRR trend over the last 12 months"
- name: status expression: status description: Current subscription status dimension: true sample_values: - active - cancelled - paused - trial ai_context: instructions: > Most revenue queries should filter WHERE status = 'active'. Include 'trial' only when specifically analyzing trial conversions. guardrails: - rule: required_filter condition: "status = 'active'" applies_to: revenue queries
- name: event_timestamp expression: event_timestamp description: When this subscription event occurred semantic_role: time time_grain: millisecond
- name: cancelled_at expression: cancelled_at description: Timestamp when subscription was cancelled (null if active) semantic_role: time
# ── Business Rules ────────────────────────────────────── business_rules: - name: revenue_requires_active description: Revenue calculations must filter to active subscriptions rule: "WHERE status = 'active'" severity: error
- name: no_future_events description: Event timestamps must not be in the future rule: "event_timestamp <= CURRENT_TIMESTAMP" severity: warning
- name: usage_events description: Product usage telemetry — one row per event schema: analytics table: fct_usage_events grain: event_id owner: product-analytics-team trust_status: verified
fields: - name: event_id expression: event_id description: Unique event identifier semantic_role: identifier
- name: user_id expression: user_id description: The user who triggered this event semantic_role: identifier pii: true pii_classification: direct_identifier
- name: event_type expression: event_type description: Category of the usage event dimension: true sample_values: - feature_used - api_call - export - invite_sent - report_generated
- name: event_timestamp expression: event_timestamp description: When the event occurred semantic_role: time time_grain: millisecond
- name: session_duration_sec expression: session_duration_sec description: Duration of the session in seconds semantic_role: measure aggregation: avg unit: seconds
- name: customers description: Customer dimension with firmographic attributes schema: analytics table: dim_customers grain: customer_id owner: revenue-team trust_status: verified
fields: - name: customer_id expression: customer_id description: Unique customer identifier semantic_role: identifier
- name: company_name expression: company_name description: Customer's company name pii: true pii_classification: business_name label: true
- name: industry expression: industry description: Industry vertical of the customer dimension: true sample_values: - technology - healthcare - finance - retail - education
- name: signup_date expression: signup_date description: Date the customer first signed up semantic_role: time time_grain: day
- name: region expression: region description: Geographic region dimension: true sample_values: - north_america - europe - apac - latam
# ── Relationships ─────────────────────────────────────────── relationships: - name: subscriptions_to_customers from: dataset: subscriptions columns: - customer_id to: dataset: customers columns: - customer_id cardinality: many_to_one
- name: usage_to_customers from: dataset: usage_events columns: - user_id to: dataset: customers columns: - customer_id cardinality: many_to_one
# ── Metrics ───────────────────────────────────────────────── metrics: - name: total_mrr expression: SUM(mrr_amount) description: Total monthly recurring revenue across all active subscriptions dataset: subscriptions filters: - "status = 'active'" ai_context: instructions: > This is the primary revenue metric. Always filter to active subscriptions. For historical MRR, group by month using event_timestamp. synonyms: - MRR - monthly recurring revenue - recurring revenue examples: - "What is our current MRR?" - "Show MRR growth month over month"
- name: avg_mrr_per_customer expression: AVG(mrr_amount) description: Average MRR per active customer dataset: subscriptions filters: - "status = 'active'" ai_context: instructions: > Use this instead of manually averaging mrr_amount. Provides per-customer average, not per-subscription. synonyms: - ARPU - average revenue per user
- name: monthly_active_users expression: COUNT(DISTINCT user_id) description: Distinct users with at least one non-trivial event in 28 days dataset: usage_events filters: - "event_type NOT IN ('page_view', 'heartbeat')" - "event_timestamp >= CURRENT_DATE - INTERVAL '28 days'" ai_context: synonyms: - MAU - active users
# ── Golden Queries ────────────────────────────────────────── golden_queries: - name: mrr_by_plan description: MRR broken down by subscription plan sql: | SELECT plan_name, SUM(mrr_amount) AS mrr FROM analytics.fct_subscriptions WHERE status = 'active' GROUP BY plan_name ORDER BY mrr DESC verified: true
- name: churn_rate_monthly description: Monthly churn rate over the last 12 months sql: | SELECT DATE_TRUNC('month', cancelled_at) AS churn_month, COUNT(*) AS cancelled, COUNT(*) * 100.0 / SUM(COUNT(*)) OVER () AS churn_pct FROM analytics.fct_subscriptions WHERE cancelled_at IS NOT NULL AND cancelled_at >= CURRENT_DATE - INTERVAL '12 months' GROUP BY 1 ORDER BY 1 verified: true
- name: usage_by_event_type description: Event volume by type over the last 30 days sql: | SELECT event_type, COUNT(*) AS event_count FROM analytics.fct_usage_events WHERE event_timestamp >= CURRENT_DATE - INTERVAL '30 days' GROUP BY event_type ORDER BY event_count DESC verified: trueAnatomy of a data product
Section titled “Anatomy of a data product”Header
Section titled “Header”Every data product starts with osi_version and a semantic_model block. The model name, description, owner, and tier establish identity and governance at the top level.
Glossary
Section titled “Glossary”Business terms with formal definitions. The related_fields list connects terms to actual columns, so AI agents can map natural language (“What’s our MRR?”) to the correct field (subscriptions.mrr_amount).
Datasets
Section titled “Datasets”Each dataset maps to a physical table. Fields declare:
| Property | Purpose |
|---|---|
semantic_role | identifier, measure, time, or dimension — tells AI how to use the field |
aggregation | Default aggregation function (sum, avg, count, etc.) |
ai_context | Instructions, synonyms, examples, and guardrails for AI agents |
pii / pii_classification | Security classification for sensitive data |
sample_values | Representative values to help AI understand the domain |
business_rules | Constraints that must hold for correct queries |
Relationships
Section titled “Relationships”Explicit join definitions with cardinality. Without these, AI agents guess how tables connect — a common source of wrong queries.
Metrics
Section titled “Metrics”Named, reusable calculations with built-in filters. Metrics prevent AI agents from reinventing aggregation logic incorrectly.
Golden queries
Section titled “Golden queries”Verified SQL examples that serve as templates. AI agents can adapt these for similar questions, reducing hallucination.
Using data products
Section titled “Using data products”Create a new data product
Section titled “Create a new data product”context new sales-analytics --source warehouseGenerate from a database
Section titled “Generate from a database”context introspect --db postgres://localhost/myappcontext enrich --target gold --applycontext buildExport the AI Blueprint
Section titled “Export the AI Blueprint”context blueprint sales-analytics# → blueprints/sales-analytics.data-product.osi.yamlServe to AI agents via MCP
Section titled “Serve to AI agents via MCP”context serve --stdioThe MCP server exposes your data product as structured resources. AI agents receive the full semantic model — descriptions, rules, relationships, metrics — alongside the raw schema.
Move between environments
Section titled “Move between environments”Because data products follow OSI, they’re portable:
# Copy your data product to another projectcp -r context/ /path/to/other-project/context/
# Or serve from a shared locationcontext serve --context-dir /shared/data-products/saas-analytics/contextData products and the tier system
Section titled “Data products and the tier system”The tier system measures how complete your data product is:
| Tier | What’s included | How to achieve |
|---|---|---|
| Bronze | Schema, descriptions, ownership | context introspect |
| Silver | + glossary, trust status, lineage, sample values | context enrich --target silver |
| Gold | + semantic roles, aggregation rules, guardrails, golden queries | Human curation + context enrich --target gold |
A Gold-tier data product gives AI agents everything they need to generate correct SQL without guessing.
OSI compatibility
Section titled “OSI compatibility”Data product YAMLs are strict supersets of OSI v1.0. The core semantic_model, datasets, fields, relationships, and metrics blocks are fully OSI-compliant. ContextKit-specific extensions (business_rules, golden_queries, glossary, governance fields) are stored in companion sections that OSI-compliant tools can safely ignore.
This means you can:
- Import OSI files from other tools and enrich them with ContextKit
- Export your ContextKit metadata as standard OSI for use elsewhere
- Mix ContextKit-governed models with plain OSI models in the same project