Building a data stack from scratch is complex. Use this checklist to ensure you don't miss critical steps—whether you're using Supabase + dbt + Metabase (like my E-Commerce Analytics project) or any modern data stack combination.
Phase 1: Infrastructure Setup (8 items)
1.1 Database Selection & Configuration
-
Choose your data warehouse based on scale and budget:
- Supabase/PostgreSQL: <100GB, $0-50/month
- BigQuery: 100GB-10TB, pay-per-query
- Snowflake: 10TB+, enterprise features
- Redshift: AWS ecosystem, predictable pricing
-
Configure database security:
- Strong passwords (16+ characters, no defaults)
- SSL/TLS encryption enabled
- IP allowlisting for production
- Separate users for different applications
-
Set up connection pooling if using PostgreSQL:
- PgBouncer or Supabase's built-in pooler
- Transaction mode for short-lived connections
- Session mode for long-running transformations
-
Enable query logging and monitoring:
pg_stat_statementsextension for PostgreSQL- Query duration thresholds for alerting
- Slow query log analysis scheduled
1.2 Containerization & Orchestration
-
Create Docker Compose configuration:
version: '3.8' services: database: image: postgres:15 healthcheck: test: ["CMD-SHELL", "pg_isready"] interval: 10s timeout: 5s retries: 5 transformation: build: ./dbt depends_on: database: condition: service_healthy visualization: image: metabase/metabase depends_on: - database -
Configure persistent volumes:
- Database data directory
- Metabase application database
- Log files for debugging
-
Set up environment variable management:
.envfile for local development.env.examplewith dummy values in Git- Secrets manager for production (Vault, AWS Secrets Manager)
-
Document deployment commands:
docker compose up -d— Start all servicesdocker compose logs -f— View logsdocker compose down -v— Full cleanup
Phase 2: Data Modeling (6 items)
2.1 Source System Analysis
- Inventory all data sources:
| Source | Type | Refresh | Volume | Owner |
|---|---|---|---|---|
| Orders | PostgreSQL | Real-time | 100K/day | E-commerce |
| Customers | API | Daily | 10K | CRM |
| Products | CSV | Weekly | 5K | Catalog |
- Document source schemas:
- Column names and data types
- Primary keys and foreign keys
- Known data quality issues
- Business definitions for ambiguous fields
2.2 Dimensional Model Design
-
Identify fact tables (transactional events):
fact_orders— Order transactionsfact_order_items— Line itemsfact_payments— Payment eventsfact_reviews— Customer feedback
-
Identify dimension tables (descriptive attributes):
dim_customers— Customer demographicsdim_products— Product catalogdim_sellers— Seller profilesdim_dates— Calendar dimensiondim_geography— Location hierarchy
-
Define grain for each fact table:
fact_orders→ One row per orderfact_order_items→ One row per order linefact_payments→ One row per payment installment
-
Document key business metrics:
| Metric | Definition | Calculation |
|---|---|---|
| GMV | Gross Merchandise Value | SUM(order_item.price) |
| AOV | Average Order Value | GMV / COUNT(DISTINCT orders) |
| LTV | Customer Lifetime Value | SUM(customer_orders.total_spent) |
| CAC | Customer Acquisition Cost | Marketing Spend / New Customers |
Phase 3: Transformation Layer (8 items)
3.1 dbt Project Structure
-
Initialize dbt project:
dbt init my_project -
Configure
dbt_project.yml:name: 'ecommerce_analytics' version: '1.0.0' config-version: 2 model-paths: ["models"] test-paths: ["tests"] seed-paths: ["seeds"] macro-paths: ["macros"] models: ecommerce_analytics: staging: +materialized: view +schema: staging intermediate: +materialized: ephemeral marts: +materialized: table +schema: marts -
Create folder structure:
models/ ├── staging/ │ ├── _sources.yml │ ├── stg_orders.sql │ └── stg_customers.sql ├── intermediate/ │ ├── int_customer_orders.sql │ └── int_product_performance.sql └── marts/ ├── core/ │ ├── dim_customers.sql │ └── fact_orders.sql └── marketing/ └── customer_segments.sql
3.2 Staging Layer Implementation
-
Create staging models for each source:
-- models/staging/stg_orders.sql WITH source AS ( SELECT * FROM {{ source('raw', 'orders') }} ), renamed AS ( SELECT order_id, customer_id, order_status AS status, order_purchase_timestamp::timestamp AS purchased_at, order_delivered_customer_date::timestamp AS delivered_at FROM source ) SELECT * FROM renamed -
Define sources in
_sources.yml:sources: - name: raw database: analytics schema: public tables: - name: orders description: "Raw orders from e-commerce platform" columns: - name: order_id description: "Unique order identifier" tests: - unique - not_null
3.3 Business Logic Layer
-
Create intermediate models for complex joins:
-- models/intermediate/int_customer_orders.sql SELECT customer_id, COUNT(DISTINCT order_id) AS total_orders, SUM(order_total) AS lifetime_value, MIN(purchased_at) AS first_order_at, MAX(purchased_at) AS last_order_at, DATEDIFF('day', MIN(purchased_at), MAX(purchased_at)) AS customer_tenure_days FROM {{ ref('stg_orders') }} GROUP BY customer_id -
Implement data quality tests:
models: - name: int_customer_orders columns: - name: customer_id tests: - unique - not_null - name: total_orders tests: - not_null - dbt_utils.accepted_range: min_value: 1 - name: lifetime_value tests: - not_null - dbt_utils.accepted_range: min_value: 0
Phase 4: Visualization Layer (4 items)
4.1 Metabase Configuration
-
Set up Metabase:
- Create admin account with strong password
- Connect to data warehouse
- Configure email settings for alerts
- Set up SSO if using enterprise
-
Create curated collections:
Collections/ ├── Executive Dashboards/ │ ├── Daily KPIs │ └── Monthly Business Review ├── Sales & Marketing/ │ ├── Customer Acquisition │ └── Campaign Performance └── Operations/ ├── Fulfillment Metrics └── Inventory Levels
4.2 Dashboard Design
- Build core dashboards:
Executive Dashboard:
- Total Revenue (with period comparison)
- Order Count
- Average Order Value
- Customer Count (new vs returning)
- Top Products table
- Geographic map
Customer Analytics Dashboard:
- Customer segments pie chart
- LTV distribution histogram
- Cohort retention heatmap
- RFM segmentation matrix
- Acquisition trend line
Product Analytics Dashboard:
-
Revenue by category treemap
-
Top 20 products leaderboard
-
Price vs volume scatter
-
Category performance bars
-
Seasonal trends
-
Configure dashboard filters:
- Date range (preset options + custom)
- Product category
- Customer segment
- Geography (country/state/city)
Phase 5: Data Quality & Governance (4 items)
5.1 Automated Testing
-
Implement dbt tests:
- Uniqueness: Primary key columns
- Not null: Required fields
- Accepted values: Enum fields
- Relationships: Foreign key integrity
- Custom SQL tests for business rules
-
Add source freshness checks:
sources: - name: raw freshness: warn_after: {count: 12, period: hour} error_after: {count: 24, period: hour} tables: - name: orders loaded_at_field: _loaded_at
5.2 Documentation
-
Document all models:
models: - name: dim_customers description: "Customer dimension with demographics and segments" columns: - name: customer_id description: "Unique customer identifier (UUID)" - name: customer_segment description: "RFM-based segment: VIP, Regular, At-Risk, Churned" -
Generate and publish docs:
dbt docs generate dbt docs serve --port 8080
Phase 6: Deployment & Operations (6 items)
6.1 CI/CD Pipeline
-
Create GitHub Actions workflow:
name: dbt CI on: pull_request: branches: [main] jobs: dbt-test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - uses: actions/setup-python@v4 with: python-version: '3.10' - run: pip install dbt-postgres - run: dbt deps - run: dbt build --target ci -
Set up branch protection:
- Require PR reviews before merge
- Require CI checks to pass
- No direct pushes to main
6.2 Scheduling & Orchestration
- Configure refresh schedule:
| Job | Schedule | Dependencies |
|---|---|---|
| Source freshness check | Every hour | None |
| Staging models | Every 4 hours | Source freshness |
| Mart models | Daily 6am UTC | Staging complete |
| Dashboard cache refresh | Daily 6:30am UTC | Mart complete |
- Set up Airflow DAG (if needed):
from airflow import DAG from airflow.operators.bash import BashOperator from datetime import datetime with DAG('dbt_daily', start_date=datetime(2024, 1, 1), schedule='0 6 * * *'): dbt_run = BashOperator( task_id='dbt_run', bash_command='cd /dbt && dbt run --target prod' ) dbt_test = BashOperator( task_id='dbt_test', bash_command='cd /dbt && dbt test --target prod' ) dbt_run >> dbt_test
6.3 Monitoring & Alerting
-
Set up alerting:
- dbt test failures → Slack/email
- Source freshness violations → PagerDuty
- Dashboard query timeouts → Ops channel
- Database storage thresholds → Infrastructure team
-
Create runbooks:
- How to rerun failed dbt models
- How to backfill historical data
- How to add new data sources
- How to debug slow dashboards
Quick Reference: Technology Recommendations
For Startups (<$5M revenue, <10 employees)
| Component | Recommendation | Cost |
|---|---|---|
| Warehouse | Supabase Free Tier | $0 |
| Transformation | dbt Core | $0 |
| Visualization | Metabase Open Source | $0 |
| Orchestration | GitHub Actions | $0 |
| Total | $0/month |
For SMEs ($5M-50M revenue, 10-100 employees)
| Component | Recommendation | Cost |
|---|---|---|
| Warehouse | Supabase Pro or BigQuery | $25-200/month |
| Transformation | dbt Core + dbt Cloud Developer | $0-100/month |
| Visualization | Metabase Pro | $85/user/month |
| Orchestration | Airflow on Cloud Run | $50-200/month |
| Total | $200-1000/month |
For Growth Companies ($50M+ revenue)
| Component | Recommendation | Cost |
|---|---|---|
| Warehouse | Snowflake or Databricks | $1000+/month |
| Transformation | dbt Cloud Team | $100/seat/month |
| Visualization | Looker or Tableau | $70+/user/month |
| Orchestration | Airflow (managed) or Dagster | $500+/month |
| Total | $3000+/month |
Troubleshooting Common Issues
Database Connection Failures
# Check if database is running
docker compose ps
# View database logs
docker compose logs database
# Test connection manually
psql -h localhost -U postgres -d analytics
dbt Model Errors
# Debug specific model
dbt run --select model_name --target dev
# View compiled SQL
cat target/compiled/project/models/model_name.sql
# Check for circular dependencies
dbt ls --resource-type model --output json | jq
Metabase Query Timeouts
- Check query execution plan in database
- Add appropriate indexes
- Consider materializing slow models as tables
- Increase Metabase query timeout setting
Check out my E-Commerce Analytics Platform case study to see this checklist in action.
Need help implementing your data stack? Let's talk about building a solution tailored to your business.