Data Products & AI Blueprints

A data product is a self-contained YAML blueprint that fully describes a data model — its schema, semantics, governance, and AI instructions — in a single portable file. Data products follow the Open Semantic Interchange (OSI) specification, so they work with any OSI-compliant tool.

Why data products?

Most teams scatter metadata across wikis, spreadsheets, dbt YAML, BI tool definitions, and tribal knowledge. A data product consolidates everything an AI agent (or a new team member) needs into one file:

Schema — tables, columns, types, expressions
Semantics — descriptions, business terms, synonyms
Governance — ownership, tier status, trust level, PII classification
Rules — aggregation logic, required filters, valid ranges
Relationships — how datasets join, cardinality
AI context — golden queries, guardrails, usage instructions

AI Blueprint

The AI Blueprint is the Gold-tier deliverable — a single, portable YAML file that captures the complete semantic specification for a data product. Export it with context blueprint, serve it via MCP at context://data-product/{name}, or share it with any OSI-compliant tool.

Blueprint format

A data product YAML follows the OSI semantic_model structure, extended with ContextKit governance fields. Here’s a complete Gold-tier example using a generic SaaS analytics domain:

# AI Blueprint — Open Semantic Interchange (OSI) v1.0

osi_version: "1.0"

semantic_model:
  name: saas_analytics
  description: >
    SaaS platform analytics covering subscriptions, usage events,
    and revenue. Designed for product analytics, finance reporting,
    and AI-assisted query generation.

  # ── Governance ──────────────────────────────────────────────
  owner: data-platform-team
  tier: gold
  trust_status: verified
  grain: One row per subscription event (create, upgrade, cancel, renew)
  tags:
    - saas
    - revenue
    - product-analytics

  # ── Glossary ────────────────────────────────────────────────
  glossary:
    - term: Monthly Recurring Revenue
      abbreviation: MRR
      definition: >
        Sum of all active subscription amounts normalized to a monthly
        cadence. Excludes one-time charges, overages, and credits.
      related_fields:
        - subscriptions.mrr_amount
        - subscriptions.plan_price

    - term: Churn Rate
      abbreviation: null
      definition: >
        Percentage of subscribers who cancel within a given period,
        measured as cancelled_count / start_of_period_count.
      related_fields:
        - subscriptions.status
        - subscriptions.cancelled_at

    - term: Active User
      definition: >
        A user who triggered at least one meaningful event (not
        page_view or heartbeat) in the trailing 28-day window.
      related_fields:
        - usage_events.event_type
        - usage_events.user_id

  # ── Datasets ────────────────────────────────────────────────
  datasets:
    - name: subscriptions
      description: One row per subscription lifecycle event
      schema: analytics
      table: fct_subscriptions
      grain: subscription_id + event_timestamp
      owner: revenue-team
      trust_status: verified

      fields:
        - name: subscription_id
          expression: subscription_id
          description: Unique subscription identifier
          semantic_role: identifier
          pii: false

        - name: customer_id
          expression: customer_id
          description: Foreign key to the customers dimension
          semantic_role: identifier
          pii: false

        - name: plan_name
          expression: plan_name
          description: Subscription plan tier (free, starter, pro, enterprise)
          dimension: true
          sample_values:
            - free
            - starter
            - pro
            - enterprise

        - name: mrr_amount
          expression: mrr_amount
          description: Monthly recurring revenue for this subscription in USD
          semantic_role: measure
          aggregation: sum
          unit: USD
          ai_context:
            instructions: >
              Always aggregate with SUM. Never use AVG on mrr_amount
              directly — use avg_mrr_per_customer metric instead.
              Excludes one-time setup fees.
            synonyms:
              - MRR
              - monthly revenue
              - recurring revenue
            examples:
              - "What is total MRR?"
              - "Show MRR by plan tier"
              - "MRR trend over the last 12 months"

        - name: status
          expression: status
          description: Current subscription status
          dimension: true
          sample_values:
            - active
            - cancelled
            - paused
            - trial
          ai_context:
            instructions: >
              Most revenue queries should filter WHERE status = 'active'.
              Include 'trial' only when specifically analyzing trial conversions.
            guardrails:
              - rule: required_filter
                condition: "status = 'active'"
                applies_to: revenue queries

        - name: event_timestamp
          expression: event_timestamp
          description: When this subscription event occurred
          semantic_role: time
          time_grain: millisecond

        - name: cancelled_at
          expression: cancelled_at
          description: Timestamp when subscription was cancelled (null if active)
          semantic_role: time

      # ── Business Rules ──────────────────────────────────────
      business_rules:
        - name: revenue_requires_active
          description: Revenue calculations must filter to active subscriptions
          rule: "WHERE status = 'active'"
          severity: error

        - name: no_future_events
          description: Event timestamps must not be in the future
          rule: "event_timestamp <= CURRENT_TIMESTAMP"
          severity: warning

    - name: usage_events
      description: Product usage telemetry — one row per event
      schema: analytics
      table: fct_usage_events
      grain: event_id
      owner: product-analytics-team
      trust_status: verified

      fields:
        - name: event_id
          expression: event_id
          description: Unique event identifier
          semantic_role: identifier

        - name: user_id
          expression: user_id
          description: The user who triggered this event
          semantic_role: identifier
          pii: true
          pii_classification: direct_identifier

        - name: event_type
          expression: event_type
          description: Category of the usage event
          dimension: true
          sample_values:
            - feature_used
            - api_call
            - export
            - invite_sent
            - report_generated

        - name: event_timestamp
          expression: event_timestamp
          description: When the event occurred
          semantic_role: time
          time_grain: millisecond

        - name: session_duration_sec
          expression: session_duration_sec
          description: Duration of the session in seconds
          semantic_role: measure
          aggregation: avg
          unit: seconds

    - name: customers
      description: Customer dimension with firmographic attributes
      schema: analytics
      table: dim_customers
      grain: customer_id
      owner: revenue-team
      trust_status: verified

      fields:
        - name: customer_id
          expression: customer_id
          description: Unique customer identifier
          semantic_role: identifier

        - name: company_name
          expression: company_name
          description: Customer's company name
          pii: true
          pii_classification: business_name
          label: true

        - name: industry
          expression: industry
          description: Industry vertical of the customer
          dimension: true
          sample_values:
            - technology
            - healthcare
            - finance
            - retail
            - education

        - name: signup_date
          expression: signup_date
          description: Date the customer first signed up
          semantic_role: time
          time_grain: day

        - name: region
          expression: region
          description: Geographic region
          dimension: true
          sample_values:
            - north_america
            - europe
            - apac
            - latam

  # ── Relationships ───────────────────────────────────────────
  relationships:
    - name: subscriptions_to_customers
      from:
        dataset: subscriptions
        columns:
          - customer_id
      to:
        dataset: customers
        columns:
          - customer_id
      cardinality: many_to_one

    - name: usage_to_customers
      from:
        dataset: usage_events
        columns:
          - user_id
      to:
        dataset: customers
        columns:
          - customer_id
      cardinality: many_to_one

  # ── Metrics ─────────────────────────────────────────────────
  metrics:
    - name: total_mrr
      expression: SUM(mrr_amount)
      description: Total monthly recurring revenue across all active subscriptions
      dataset: subscriptions
      filters:
        - "status = 'active'"
      ai_context:
        instructions: >
          This is the primary revenue metric. Always filter to active
          subscriptions. For historical MRR, group by month using
          event_timestamp.
        synonyms:
          - MRR
          - monthly recurring revenue
          - recurring revenue
        examples:
          - "What is our current MRR?"
          - "Show MRR growth month over month"

    - name: avg_mrr_per_customer
      expression: AVG(mrr_amount)
      description: Average MRR per active customer
      dataset: subscriptions
      filters:
        - "status = 'active'"
      ai_context:
        instructions: >
          Use this instead of manually averaging mrr_amount.
          Provides per-customer average, not per-subscription.
        synonyms:
          - ARPU
          - average revenue per user

    - name: monthly_active_users
      expression: COUNT(DISTINCT user_id)
      description: Distinct users with at least one non-trivial event in 28 days
      dataset: usage_events
      filters:
        - "event_type NOT IN ('page_view', 'heartbeat')"
        - "event_timestamp >= CURRENT_DATE - INTERVAL '28 days'"
      ai_context:
        synonyms:
          - MAU
          - active users

  # ── Golden Queries ──────────────────────────────────────────
  golden_queries:
    - name: mrr_by_plan
      description: MRR broken down by subscription plan
      sql: |
        SELECT plan_name, SUM(mrr_amount) AS mrr
        FROM analytics.fct_subscriptions
        WHERE status = 'active'
        GROUP BY plan_name
        ORDER BY mrr DESC
      verified: true

    - name: churn_rate_monthly
      description: Monthly churn rate over the last 12 months
      sql: |
        SELECT
          DATE_TRUNC('month', cancelled_at) AS churn_month,
          COUNT(*) AS cancelled,
          COUNT(*) * 100.0 / SUM(COUNT(*)) OVER () AS churn_pct
        FROM analytics.fct_subscriptions
        WHERE cancelled_at IS NOT NULL
          AND cancelled_at >= CURRENT_DATE - INTERVAL '12 months'
        GROUP BY 1
        ORDER BY 1
      verified: true

    - name: usage_by_event_type
      description: Event volume by type over the last 30 days
      sql: |
        SELECT event_type, COUNT(*) AS event_count
        FROM analytics.fct_usage_events
        WHERE event_timestamp >= CURRENT_DATE - INTERVAL '30 days'
        GROUP BY event_type
        ORDER BY event_count DESC
      verified: true

Anatomy of a data product

Every data product starts with osi_version and a semantic_model block. The model name, description, owner, and tier establish identity and governance at the top level.

Glossary

Business terms with formal definitions. The related_fields list connects terms to actual columns, so AI agents can map natural language (“What’s our MRR?”) to the correct field (subscriptions.mrr_amount).

Datasets

Each dataset maps to a physical table. Fields declare:

Property	Purpose
`semantic_role`	`identifier`, `measure`, `time`, or `dimension` — tells AI how to use the field
`aggregation`	Default aggregation function (`sum`, `avg`, `count`, etc.)
`ai_context`	Instructions, synonyms, examples, and guardrails for AI agents
`pii` / `pii_classification`	Security classification for sensitive data
`sample_values`	Representative values to help AI understand the domain
`business_rules`	Constraints that must hold for correct queries

Relationships

Explicit join definitions with cardinality. Without these, AI agents guess how tables connect — a common source of wrong queries.

Metrics

Named, reusable calculations with built-in filters. Metrics prevent AI agents from reinventing aggregation logic incorrectly.

Golden queries

Verified SQL examples that serve as templates. AI agents can adapt these for similar questions, reducing hallucination.

Using data products

Create a new data product

context new sales-analytics --source warehouse

Generate from a database

context introspect --db postgres://localhost/myapp
context enrich --target gold --apply
context build

Export the AI Blueprint

context blueprint sales-analytics
# → blueprints/sales-analytics.data-product.osi.yaml

Serve to AI agents via MCP

context serve --stdio

The MCP server exposes your data product as structured resources. AI agents receive the full semantic model — descriptions, rules, relationships, metrics — alongside the raw schema.

Move between environments

Because data products follow OSI, they’re portable:

# Copy your data product to another project
cp -r context/ /path/to/other-project/context/

# Or serve from a shared location
context serve --context-dir /shared/data-products/saas-analytics/context

Data products and the tier system

The tier system measures how complete your data product is:

Tier	What’s included	How to achieve
Bronze	Schema, descriptions, ownership	`context introspect`
Silver	+ glossary, trust status, lineage, sample values	`context enrich --target silver`
Gold	+ semantic roles, aggregation rules, guardrails, golden queries	Human curation + `context enrich --target gold`

A Gold-tier data product gives AI agents everything they need to generate correct SQL without guessing.

OSI compatibility

Data product YAMLs are strict supersets of OSI v1.0. The core semantic_model, datasets, fields, relationships, and metrics blocks are fully OSI-compliant. ContextKit-specific extensions (business_rules, golden_queries, glossary, governance fields) are stored in companion sections that OSI-compliant tools can safely ignore.

This means you can:

Import OSI files from other tools and enrich them with ContextKit
Export your ContextKit metadata as standard OSI for use elsewhere
Mix ContextKit-governed models with plain OSI models in the same project