User Guide

Introduction

This guide provides comprehensive instructions for using conversionflow-aggregate to analyse customer journeys and optimise marketing budget allocation.

Understanding the Two-Stage Pipeline

conversionflow-aggregate operates as a two-stage analytical system:

Stage 1: Bayesian Parameter Estimation

Duration: Approximately 7 minutes
Purpose: Fits probabilistic models to your customer journey data

# Stage 1 only - parameter estimation
python scripts/run_full_italy_pipeline.py --stage1-only

What happens:

  • Loads raw customer event data

  • Fits Bayesian network models using MCMC sampling

  • Generates parameter exports with confidence intervals

  • Produces model diagnostics and validation metrics

Indicators of success:

  • PyMC progress bars showing MCMC sampling

  • Convergence diagnostics (R-hat < 1.1)

  • Model validation reports

  • Parameter export JSON files

Stage 2: Genetic Algorithm Optimisation

Duration: 3-4 seconds
Purpose: Optimises budget allocation using pre-computed parameters

# Stage 2 only - optimisation
python scripts/run_full_italy_pipeline.py --stage2-only

What happens:

  • Loads MCMC parameter exports

  • Runs genetic algorithm optimisation

  • Applies conservative attribution methodology

  • Generates executive reports and visualisations

Full Pipeline Execution

Complete End-to-End Analysis

# Recommended: Both stages together
python scripts/run_full_italy_pipeline.py

This is the preferred method for new analyses as it ensures parameter estimates are fresh and aligned with your current data.

Quick Testing Mode

# Fast validation using mock data
./run_pipeline.sh --mode=test

Use this for system validation and testing changes without processing real data.

Production Optimisation Mode

# Fast optimisation using existing parameters
./run_pipeline.sh --mode=italy

Use this when you have recent parameter estimates and only need updated budget allocations.

Working with Custom Data

Data Requirements

Your customer journey data should include:

  • Event timestamps in ISO format

  • Customer identifiers for journey tracking

  • Touchpoint identifiers for channel attribution

  • Conversion outcomes (purchases, leads, etc.)

Data Format Example

customer_id,timestamp,touchpoint,conversion,value
CUST_001,2024-01-15T10:30:00,website_visit,0,
CUST_001,2024-01-16T14:22:00,email_click,0,
CUST_001,2024-01-20T09:45:00,dealer_visit,1,35000

Configuration for Custom Data

Create a custom configuration file:

# configs/custom_analysis.yaml
data:
  source: "data/custom/customer_journeys.csv"
  date_column: "timestamp"
  customer_column: "customer_id"
  touchpoint_column: "touchpoint"
  conversion_column: "conversion"

model:
  mcmc_samples: 2000
  mcmc_tune: 1000
  chains: 4

optimization:
  population_size: 100
  generations: 200
  budget_total: 1000000

Running Custom Analysis

# Using custom configuration
python scripts/run_full_italy_pipeline.py --config configs/custom_analysis.yaml

Budget Allocation Scenarios

Standard Allocation

# Default budget (£2.5M)
python scripts/run_full_italy_pipeline.py

Custom Budget Scenarios

# Growth scenario - 10% budget increase
python scripts/run_full_italy_pipeline.py --budget 2750000

# Reduced budget scenario
python scripts/run_full_italy_pipeline.py --budget 2000000

# High investment scenario
python scripts/run_full_italy_pipeline.py --budget 5000000

Understanding Results

Executive Summary Output

The system generates business-ready reports including:

Budget Allocation Table:

Touchpoint               Allocation    Percentage
Car Configuration        £420,000      16.8%
Finance Calculator       £380,000      15.2%
Test Drive Requests      £350,000      14.0%
...

Performance Metrics:

  • Raw Improvement: Statistical optimisation potential

  • Business Claim: Conservative attribution-adjusted improvement

  • Confidence Intervals: Statistical uncertainty bounds

Technical Diagnostics

For technical validation, review:

MCMC Diagnostics:

  • R-hat values: Should be < 1.1 for convergence

  • Effective Sample Size (ESS): Should be > 400 per chain

  • ELPD-LOO: Model comparison metric

Optimisation Metrics:

  • Population diversity: Genetic algorithm health

  • Convergence rate: Solution stability

  • Constraint satisfaction: Business rule compliance

Attribution Methodology

conversionflow-aggregate implements a Data-Grounded Attribution methodology to ensure business credibility and analytical integrity.

The Principle

The core principle is that all financial projections must be directly tied to the scope of the data being analyzed. In the luxury automotive market, digital data typically accounts for only a small fraction (~5%) of total sales.

How It Works

  1. Scoped Analysis: The entire pipeline, from model fitting to optimization, operates exclusively on the tracked digital journey data.

  2. Scoped Projections: Business impact calculations (e.g., “Expected additional revenue”) are based on the ~5% of sales that are attributable to these digital journeys, not the total sales volume.

Why This Matters

This approach provides realistic, defensible projections of the value generated by optimizing digital marketing spend, maintaining stakeholder trust by avoiding unsupported claims about influencing total offline sales.

Advanced Usage

Model Selection

# Standard Poisson model (recommended)
python scripts/run_full_italy_pipeline.py

# Hurdle model (for zero-inflated data - slower)
python scripts/run_full_italy_pipeline.py --use-hurdle

Recommendation: Use standard model unless your data has severe zero-inflation issues. The hurdle model requires significantly longer convergence time (1+ hours vs 7 minutes).

Parallel Processing

# Increase MCMC chains for faster sampling (requires more CPU cores)
python scripts/run_full_italy_pipeline.py --chains 8

Output Customisation

# Specify output directory
python scripts/run_full_italy_pipeline.py --output results/custom_analysis/

# Control output formats
python scripts/run_full_italy_pipeline.py --formats csv,json,html

Performance Optimisation

For Large Datasets (1M+ events)

# In configuration file
model:
  sample_fraction: 0.5  # Use 50% of data for faster processing
  mcmc_samples: 1000    # Reduce samples if needed

For Resource-Constrained Systems

optimization:
  population_size: 50   # Smaller population for faster GA
  generations: 100      # Fewer generations

Quality Assurance

Validation Checks

The system automatically validates:

  • Data quality and completeness

  • Model convergence and diagnostics

  • Budget allocation constraints

  • Attribution methodology compliance

Manual Verification

# Run comprehensive validation suite
python scripts/test_italy_optimization.py

Troubleshooting Checklist

  1. PyMC progress bars visible: Confirms Bayesian fitting is running

  2. R-hat < 1.1: Confirms MCMC convergence

  3. Budget sums correctly: Confirms optimisation constraints

  4. Business impact is correctly scoped: Confirms projections are based on attributable sales

Best Practices

Data Preparation

  • Ensure consistent timestamp formatting

  • Remove duplicate or invalid records

  • Validate customer journey completeness

  • Check for data leakage or look-ahead bias

Model Validation

  • Review convergence diagnostics carefully

  • Compare multiple model runs for consistency

  • Validate results against business intuition

  • Test with different budget scenarios

Business Communication

  • Clearly state that projections are based on digitally attributable sales.

  • Document the methodology’s scope and limitations transparently.

  • Provide confidence intervals for all estimates.

  • Explain all assumptions clearly.

For additional support, consult the Configuration Guide and Troubleshooting Documentation.