User Guide

Introduction

This guide provides comprehensive instructions for using conversionflow-aggregate to analyse customer journeys and optimise marketing budget allocation.

Understanding the Two-Stage Pipeline

conversionflow-aggregate operates as a two-stage analytical system:

Stage 1: Bayesian Parameter Estimation

Duration: Approximately 7 minutes
Purpose: Fits probabilistic models to your customer journey data

# Stage 1 only - parameter estimation
python scripts/run_full_italy_pipeline.py --stage1-only

What happens:

Loads raw customer event data
Fits Bayesian network models using MCMC sampling
Generates parameter exports with confidence intervals
Produces model diagnostics and validation metrics

Indicators of success:

PyMC progress bars showing MCMC sampling
Convergence diagnostics (R-hat < 1.1)
Model validation reports
Parameter export JSON files

Stage 2: Genetic Algorithm Optimisation

Duration: 3-4 seconds
Purpose: Optimises budget allocation using pre-computed parameters

# Stage 2 only - optimisation
python scripts/run_full_italy_pipeline.py --stage2-only

What happens:

Loads MCMC parameter exports
Runs genetic algorithm optimisation
Applies conservative attribution methodology
Generates executive reports and visualisations

Full Pipeline Execution

Complete End-to-End Analysis

# Recommended: Both stages together
python scripts/run_full_italy_pipeline.py

This is the preferred method for new analyses as it ensures parameter estimates are fresh and aligned with your current data.

Quick Testing Mode

# Fast validation using mock data
./run_pipeline.sh --mode=test

Use this for system validation and testing changes without processing real data.

Production Optimisation Mode

# Fast optimisation using existing parameters
./run_pipeline.sh --mode=italy

Use this when you have recent parameter estimates and only need updated budget allocations.

Working with Custom Data

Data Requirements

Your customer journey data should include:

Event timestamps in ISO format
Customer identifiers for journey tracking
Touchpoint identifiers for channel attribution
Conversion outcomes (purchases, leads, etc.)

Data Format Example

customer_id,timestamp,touchpoint,conversion,value
CUST_001,2024-01-15T10:30:00,website_visit,0,
CUST_001,2024-01-16T14:22:00,email_click,0,
CUST_001,2024-01-20T09:45:00,dealer_visit,1,35000

Configuration for Custom Data

Create a custom configuration file:

# configs/custom_analysis.yaml
data:
  source: "data/custom/customer_journeys.csv"
  date_column: "timestamp"
  customer_column: "customer_id"
  touchpoint_column: "touchpoint"
  conversion_column: "conversion"

model:
  mcmc_samples: 2000
  mcmc_tune: 1000
  chains: 4

optimization:
  population_size: 100
  generations: 200
  budget_total: 1000000

Running Custom Analysis

# Using custom configuration
python scripts/run_full_italy_pipeline.py --config configs/custom_analysis.yaml

Budget Allocation Scenarios

Standard Allocation

# Default budget (£2.5M)
python scripts/run_full_italy_pipeline.py

Custom Budget Scenarios

# Growth scenario - 10% budget increase
python scripts/run_full_italy_pipeline.py --budget 2750000

# Reduced budget scenario
python scripts/run_full_italy_pipeline.py --budget 2000000

# High investment scenario
python scripts/run_full_italy_pipeline.py --budget 5000000

Understanding Results

Executive Summary Output

The system generates business-ready reports including:

Budget Allocation Table:

Touchpoint               Allocation    Percentage
Car Configuration        £420,000      16.8%
Finance Calculator       £380,000      15.2%
Test Drive Requests      £350,000      14.0%
...

Performance Metrics:

Raw Improvement: Statistical optimisation potential
Business Claim: Conservative attribution-adjusted improvement
Confidence Intervals: Statistical uncertainty bounds

Technical Diagnostics

For technical validation, review:

MCMC Diagnostics:

R-hat values: Should be < 1.1 for convergence
Effective Sample Size (ESS): Should be > 400 per chain
ELPD-LOO: Model comparison metric

Optimisation Metrics:

Population diversity: Genetic algorithm health
Convergence rate: Solution stability
Constraint satisfaction: Business rule compliance

Attribution Methodology

conversionflow-aggregate implements a Data-Grounded Attribution methodology to ensure business credibility and analytical integrity.

The Principle

The core principle is that all financial projections must be directly tied to the scope of the data being analyzed. In the luxury automotive market, digital data typically accounts for only a small fraction (~5%) of total sales.

How It Works

Scoped Analysis: The entire pipeline, from model fitting to optimization, operates exclusively on the tracked digital journey data.
Scoped Projections: Business impact calculations (e.g., “Expected additional revenue”) are based on the ~5% of sales that are attributable to these digital journeys, not the total sales volume.

Why This Matters

This approach provides realistic, defensible projections of the value generated by optimizing digital marketing spend, maintaining stakeholder trust by avoiding unsupported claims about influencing total offline sales.

Advanced Usage

Model Selection

# Standard Poisson model (recommended)
python scripts/run_full_italy_pipeline.py

# Hurdle model (for zero-inflated data - slower)
python scripts/run_full_italy_pipeline.py --use-hurdle

Recommendation: Use standard model unless your data has severe zero-inflation issues. The hurdle model requires significantly longer convergence time (1+ hours vs 7 minutes).

Parallel Processing

# Increase MCMC chains for faster sampling (requires more CPU cores)
python scripts/run_full_italy_pipeline.py --chains 8

Output Customisation

# Specify output directory
python scripts/run_full_italy_pipeline.py --output results/custom_analysis/

# Control output formats
python scripts/run_full_italy_pipeline.py --formats csv,json,html

Performance Optimisation

For Large Datasets (1M+ events)

# In configuration file
model:
  sample_fraction: 0.5  # Use 50% of data for faster processing
  mcmc_samples: 1000    # Reduce samples if needed

For Resource-Constrained Systems

optimization:
  population_size: 50   # Smaller population for faster GA
  generations: 100      # Fewer generations

Quality Assurance

Validation Checks

The system automatically validates:

Data quality and completeness
Model convergence and diagnostics
Budget allocation constraints
Attribution methodology compliance

Manual Verification

# Run comprehensive validation suite
python scripts/test_italy_optimization.py

Troubleshooting Checklist

PyMC progress bars visible: Confirms Bayesian fitting is running
R-hat < 1.1: Confirms MCMC convergence
Budget sums correctly: Confirms optimisation constraints
Business impact is correctly scoped: Confirms projections are based on attributable sales

Best Practices

Data Preparation

Ensure consistent timestamp formatting
Remove duplicate or invalid records
Validate customer journey completeness
Check for data leakage or look-ahead bias

Model Validation

Review convergence diagnostics carefully
Compare multiple model runs for consistency
Validate results against business intuition
Test with different budget scenarios

Business Communication

Clearly state that projections are based on digitally attributable sales.
Document the methodology’s scope and limitations transparently.
Provide confidence intervals for all estimates.
Explain all assumptions clearly.

For additional support, consult the Configuration Guide and Troubleshooting Documentation.