Skip to content
Back to Resources
Best Practices

The Data Stack Setup Checklist: 30 Steps to a Production-Ready Analytics Platform

Dr. Jody-Ann Jones
December 10, 2025
18 min read

A comprehensive checklist for building a modern data stack with open-source tools. Covers infrastructure, data modeling, transformation, visualization, and deployment.

Data Engineering
dbt
Business Intelligence
Data Stack
Checklist
DevOps

Building a data stack from scratch is complex. Use this checklist to ensure you don't miss critical steps—whether you're using Supabase + dbt + Metabase (like my E-Commerce Analytics project) or any modern data stack combination.

Phase 1: Infrastructure Setup (8 items)

1.1 Database Selection & Configuration

  • Choose your data warehouse based on scale and budget:

    • Supabase/PostgreSQL: <100GB, $0-50/month
    • BigQuery: 100GB-10TB, pay-per-query
    • Snowflake: 10TB+, enterprise features
    • Redshift: AWS ecosystem, predictable pricing
  • Configure database security:

    • Strong passwords (16+ characters, no defaults)
    • SSL/TLS encryption enabled
    • IP allowlisting for production
    • Separate users for different applications
  • Set up connection pooling if using PostgreSQL:

    • PgBouncer or Supabase's built-in pooler
    • Transaction mode for short-lived connections
    • Session mode for long-running transformations
  • Enable query logging and monitoring:

    • pg_stat_statements extension for PostgreSQL
    • Query duration thresholds for alerting
    • Slow query log analysis scheduled

1.2 Containerization & Orchestration

  • Create Docker Compose configuration:

    version: '3.8'
    services:
      database:
        image: postgres:15
        healthcheck:
          test: ["CMD-SHELL", "pg_isready"]
          interval: 10s
          timeout: 5s
          retries: 5
      transformation:
        build: ./dbt
        depends_on:
          database:
            condition: service_healthy
      visualization:
        image: metabase/metabase
        depends_on:
          - database
    
  • Configure persistent volumes:

    • Database data directory
    • Metabase application database
    • Log files for debugging
  • Set up environment variable management:

    • .env file for local development
    • .env.example with dummy values in Git
    • Secrets manager for production (Vault, AWS Secrets Manager)
  • Document deployment commands:

    • docker compose up -d — Start all services
    • docker compose logs -f — View logs
    • docker compose down -v — Full cleanup

Phase 2: Data Modeling (6 items)

2.1 Source System Analysis

  • Inventory all data sources:
SourceTypeRefreshVolumeOwner
OrdersPostgreSQLReal-time100K/dayE-commerce
CustomersAPIDaily10KCRM
ProductsCSVWeekly5KCatalog
  • Document source schemas:
    • Column names and data types
    • Primary keys and foreign keys
    • Known data quality issues
    • Business definitions for ambiguous fields

2.2 Dimensional Model Design

  • Identify fact tables (transactional events):

    • fact_orders — Order transactions
    • fact_order_items — Line items
    • fact_payments — Payment events
    • fact_reviews — Customer feedback
  • Identify dimension tables (descriptive attributes):

    • dim_customers — Customer demographics
    • dim_products — Product catalog
    • dim_sellers — Seller profiles
    • dim_dates — Calendar dimension
    • dim_geography — Location hierarchy
  • Define grain for each fact table:

    • fact_orders → One row per order
    • fact_order_items → One row per order line
    • fact_payments → One row per payment installment
  • Document key business metrics:

MetricDefinitionCalculation
GMVGross Merchandise ValueSUM(order_item.price)
AOVAverage Order ValueGMV / COUNT(DISTINCT orders)
LTVCustomer Lifetime ValueSUM(customer_orders.total_spent)
CACCustomer Acquisition CostMarketing Spend / New Customers

Phase 3: Transformation Layer (8 items)

3.1 dbt Project Structure

  • Initialize dbt project:

    dbt init my_project
    
  • Configure dbt_project.yml:

    name: 'ecommerce_analytics'
    version: '1.0.0'
    config-version: 2
    
    model-paths: ["models"]
    test-paths: ["tests"]
    seed-paths: ["seeds"]
    macro-paths: ["macros"]
    
    models:
      ecommerce_analytics:
        staging:
          +materialized: view
          +schema: staging
        intermediate:
          +materialized: ephemeral
        marts:
          +materialized: table
          +schema: marts
    
  • Create folder structure:

    models/
    ├── staging/
    │   ├── _sources.yml
    │   ├── stg_orders.sql
    │   └── stg_customers.sql
    ├── intermediate/
    │   ├── int_customer_orders.sql
    │   └── int_product_performance.sql
    └── marts/
        ├── core/
        │   ├── dim_customers.sql
        │   └── fact_orders.sql
        └── marketing/
            └── customer_segments.sql
    

3.2 Staging Layer Implementation

  • Create staging models for each source:

    -- models/staging/stg_orders.sql
    WITH source AS (
        SELECT * FROM {{ source('raw', 'orders') }}
    ),
    
    renamed AS (
        SELECT
            order_id,
            customer_id,
            order_status AS status,
            order_purchase_timestamp::timestamp AS purchased_at,
            order_delivered_customer_date::timestamp AS delivered_at
        FROM source
    )
    
    SELECT * FROM renamed
    
  • Define sources in _sources.yml:

    sources:
      - name: raw
        database: analytics
        schema: public
        tables:
          - name: orders
            description: "Raw orders from e-commerce platform"
            columns:
              - name: order_id
                description: "Unique order identifier"
                tests:
                  - unique
                  - not_null
    

3.3 Business Logic Layer

  • Create intermediate models for complex joins:

    -- models/intermediate/int_customer_orders.sql
    SELECT
        customer_id,
        COUNT(DISTINCT order_id) AS total_orders,
        SUM(order_total) AS lifetime_value,
        MIN(purchased_at) AS first_order_at,
        MAX(purchased_at) AS last_order_at,
        DATEDIFF('day', MIN(purchased_at), MAX(purchased_at)) AS customer_tenure_days
    FROM {{ ref('stg_orders') }}
    GROUP BY customer_id
    
  • Implement data quality tests:

    models:
      - name: int_customer_orders
        columns:
          - name: customer_id
            tests:
              - unique
              - not_null
          - name: total_orders
            tests:
              - not_null
              - dbt_utils.accepted_range:
                  min_value: 1
          - name: lifetime_value
            tests:
              - not_null
              - dbt_utils.accepted_range:
                  min_value: 0
    

Phase 4: Visualization Layer (4 items)

4.1 Metabase Configuration

  • Set up Metabase:

    • Create admin account with strong password
    • Connect to data warehouse
    • Configure email settings for alerts
    • Set up SSO if using enterprise
  • Create curated collections:

    Collections/
    ├── Executive Dashboards/
    │   ├── Daily KPIs
    │   └── Monthly Business Review
    ├── Sales & Marketing/
    │   ├── Customer Acquisition
    │   └── Campaign Performance
    └── Operations/
        ├── Fulfillment Metrics
        └── Inventory Levels
    

4.2 Dashboard Design

  • Build core dashboards:

Executive Dashboard:

  • Total Revenue (with period comparison)
  • Order Count
  • Average Order Value
  • Customer Count (new vs returning)
  • Top Products table
  • Geographic map

Customer Analytics Dashboard:

  • Customer segments pie chart
  • LTV distribution histogram
  • Cohort retention heatmap
  • RFM segmentation matrix
  • Acquisition trend line

Product Analytics Dashboard:

  • Revenue by category treemap

  • Top 20 products leaderboard

  • Price vs volume scatter

  • Category performance bars

  • Seasonal trends

  • Configure dashboard filters:

    • Date range (preset options + custom)
    • Product category
    • Customer segment
    • Geography (country/state/city)

Phase 5: Data Quality & Governance (4 items)

5.1 Automated Testing

  • Implement dbt tests:

    • Uniqueness: Primary key columns
    • Not null: Required fields
    • Accepted values: Enum fields
    • Relationships: Foreign key integrity
    • Custom SQL tests for business rules
  • Add source freshness checks:

    sources:
      - name: raw
        freshness:
          warn_after: {count: 12, period: hour}
          error_after: {count: 24, period: hour}
        tables:
          - name: orders
            loaded_at_field: _loaded_at
    

5.2 Documentation

  • Document all models:

    models:
      - name: dim_customers
        description: "Customer dimension with demographics and segments"
        columns:
          - name: customer_id
            description: "Unique customer identifier (UUID)"
          - name: customer_segment
            description: "RFM-based segment: VIP, Regular, At-Risk, Churned"
    
  • Generate and publish docs:

    dbt docs generate
    dbt docs serve --port 8080
    

Phase 6: Deployment & Operations (6 items)

6.1 CI/CD Pipeline

  • Create GitHub Actions workflow:

    name: dbt CI
    
    on:
      pull_request:
        branches: [main]
    
    jobs:
      dbt-test:
        runs-on: ubuntu-latest
        steps:
          - uses: actions/checkout@v3
          - uses: actions/setup-python@v4
            with:
              python-version: '3.10'
          - run: pip install dbt-postgres
          - run: dbt deps
          - run: dbt build --target ci
    
  • Set up branch protection:

    • Require PR reviews before merge
    • Require CI checks to pass
    • No direct pushes to main

6.2 Scheduling & Orchestration

  • Configure refresh schedule:
JobScheduleDependencies
Source freshness checkEvery hourNone
Staging modelsEvery 4 hoursSource freshness
Mart modelsDaily 6am UTCStaging complete
Dashboard cache refreshDaily 6:30am UTCMart complete
  • Set up Airflow DAG (if needed):
    from airflow import DAG
    from airflow.operators.bash import BashOperator
    from datetime import datetime
    
    with DAG('dbt_daily', start_date=datetime(2024, 1, 1), schedule='0 6 * * *'):
        dbt_run = BashOperator(
            task_id='dbt_run',
            bash_command='cd /dbt && dbt run --target prod'
        )
        dbt_test = BashOperator(
            task_id='dbt_test',
            bash_command='cd /dbt && dbt test --target prod'
        )
        dbt_run >> dbt_test
    

6.3 Monitoring & Alerting

  • Set up alerting:

    • dbt test failures → Slack/email
    • Source freshness violations → PagerDuty
    • Dashboard query timeouts → Ops channel
    • Database storage thresholds → Infrastructure team
  • Create runbooks:

    • How to rerun failed dbt models
    • How to backfill historical data
    • How to add new data sources
    • How to debug slow dashboards

Quick Reference: Technology Recommendations

For Startups (<$5M revenue, <10 employees)

ComponentRecommendationCost
WarehouseSupabase Free Tier$0
Transformationdbt Core$0
VisualizationMetabase Open Source$0
OrchestrationGitHub Actions$0
Total$0/month

For SMEs ($5M-50M revenue, 10-100 employees)

ComponentRecommendationCost
WarehouseSupabase Pro or BigQuery$25-200/month
Transformationdbt Core + dbt Cloud Developer$0-100/month
VisualizationMetabase Pro$85/user/month
OrchestrationAirflow on Cloud Run$50-200/month
Total$200-1000/month

For Growth Companies ($50M+ revenue)

ComponentRecommendationCost
WarehouseSnowflake or Databricks$1000+/month
Transformationdbt Cloud Team$100/seat/month
VisualizationLooker or Tableau$70+/user/month
OrchestrationAirflow (managed) or Dagster$500+/month
Total$3000+/month

Troubleshooting Common Issues

Database Connection Failures

# Check if database is running
docker compose ps

# View database logs
docker compose logs database

# Test connection manually
psql -h localhost -U postgres -d analytics

dbt Model Errors

# Debug specific model
dbt run --select model_name --target dev

# View compiled SQL
cat target/compiled/project/models/model_name.sql

# Check for circular dependencies
dbt ls --resource-type model --output json | jq

Metabase Query Timeouts

  1. Check query execution plan in database
  2. Add appropriate indexes
  3. Consider materializing slow models as tables
  4. Increase Metabase query timeout setting

Check out my E-Commerce Analytics Platform case study to see this checklist in action.

Need help implementing your data stack? Let's talk about building a solution tailored to your business.

Written by

Dr. Jody-Ann Jones

Need help with your data systems?

Let's discuss how we can help you build production-grade AI and data systems.