User Guide
Introduction
This guide provides comprehensive instructions for using conversionflow-aggregate to analyse customer journeys and optimise marketing budget allocation.
Understanding the Two-Stage Pipeline
conversionflow-aggregate operates as a two-stage analytical system:
Stage 1: Bayesian Parameter Estimation
Duration: Approximately 7 minutes
Purpose: Fits probabilistic models to your customer journey data
# Stage 1 only - parameter estimation
python scripts/run_full_italy_pipeline.py --stage1-only
What happens:
Loads raw customer event data
Fits Bayesian network models using MCMC sampling
Generates parameter exports with confidence intervals
Produces model diagnostics and validation metrics
Indicators of success:
PyMC progress bars showing MCMC sampling
Convergence diagnostics (R-hat < 1.1)
Model validation reports
Parameter export JSON files
Stage 2: Genetic Algorithm Optimisation
Duration: 3-4 seconds
Purpose: Optimises budget allocation using pre-computed parameters
# Stage 2 only - optimisation
python scripts/run_full_italy_pipeline.py --stage2-only
What happens:
Loads MCMC parameter exports
Runs genetic algorithm optimisation
Applies conservative attribution methodology
Generates executive reports and visualisations
Full Pipeline Execution
Complete End-to-End Analysis
# Recommended: Both stages together
python scripts/run_full_italy_pipeline.py
This is the preferred method for new analyses as it ensures parameter estimates are fresh and aligned with your current data.
Quick Testing Mode
# Fast validation using mock data
./run_pipeline.sh --mode=test
Use this for system validation and testing changes without processing real data.
Production Optimisation Mode
# Fast optimisation using existing parameters
./run_pipeline.sh --mode=italy
Use this when you have recent parameter estimates and only need updated budget allocations.
Working with Custom Data
Data Requirements
Your customer journey data should include:
Event timestamps in ISO format
Customer identifiers for journey tracking
Touchpoint identifiers for channel attribution
Conversion outcomes (purchases, leads, etc.)
Data Format Example
customer_id,timestamp,touchpoint,conversion,value
CUST_001,2024-01-15T10:30:00,website_visit,0,
CUST_001,2024-01-16T14:22:00,email_click,0,
CUST_001,2024-01-20T09:45:00,dealer_visit,1,35000
Configuration for Custom Data
Create a custom configuration file:
# configs/custom_analysis.yaml
data:
source: "data/custom/customer_journeys.csv"
date_column: "timestamp"
customer_column: "customer_id"
touchpoint_column: "touchpoint"
conversion_column: "conversion"
model:
mcmc_samples: 2000
mcmc_tune: 1000
chains: 4
optimization:
population_size: 100
generations: 200
budget_total: 1000000
Running Custom Analysis
# Using custom configuration
python scripts/run_full_italy_pipeline.py --config configs/custom_analysis.yaml
Budget Allocation Scenarios
Standard Allocation
# Default budget (£2.5M)
python scripts/run_full_italy_pipeline.py
Custom Budget Scenarios
# Growth scenario - 10% budget increase
python scripts/run_full_italy_pipeline.py --budget 2750000
# Reduced budget scenario
python scripts/run_full_italy_pipeline.py --budget 2000000
# High investment scenario
python scripts/run_full_italy_pipeline.py --budget 5000000
Understanding Results
Executive Summary Output
The system generates business-ready reports including:
Budget Allocation Table:
Touchpoint Allocation Percentage
Car Configuration £420,000 16.8%
Finance Calculator £380,000 15.2%
Test Drive Requests £350,000 14.0%
...
Performance Metrics:
Raw Improvement: Statistical optimisation potential
Business Claim: Conservative attribution-adjusted improvement
Confidence Intervals: Statistical uncertainty bounds
Technical Diagnostics
For technical validation, review:
MCMC Diagnostics:
R-hat values: Should be < 1.1 for convergence
Effective Sample Size (ESS): Should be > 400 per chain
ELPD-LOO: Model comparison metric
Optimisation Metrics:
Population diversity: Genetic algorithm health
Convergence rate: Solution stability
Constraint satisfaction: Business rule compliance
Attribution Methodology
conversionflow-aggregate implements a Data-Grounded Attribution methodology to ensure business credibility and analytical integrity.
The Principle
The core principle is that all financial projections must be directly tied to the scope of the data being analyzed. In the luxury automotive market, digital data typically accounts for only a small fraction (~5%) of total sales.
How It Works
Scoped Analysis: The entire pipeline, from model fitting to optimization, operates exclusively on the tracked digital journey data.
Scoped Projections: Business impact calculations (e.g., “Expected additional revenue”) are based on the ~5% of sales that are attributable to these digital journeys, not the total sales volume.
Why This Matters
This approach provides realistic, defensible projections of the value generated by optimizing digital marketing spend, maintaining stakeholder trust by avoiding unsupported claims about influencing total offline sales.
Advanced Usage
Model Selection
# Standard Poisson model (recommended)
python scripts/run_full_italy_pipeline.py
# Hurdle model (for zero-inflated data - slower)
python scripts/run_full_italy_pipeline.py --use-hurdle
Recommendation: Use standard model unless your data has severe zero-inflation issues. The hurdle model requires significantly longer convergence time (1+ hours vs 7 minutes).
Parallel Processing
# Increase MCMC chains for faster sampling (requires more CPU cores)
python scripts/run_full_italy_pipeline.py --chains 8
Output Customisation
# Specify output directory
python scripts/run_full_italy_pipeline.py --output results/custom_analysis/
# Control output formats
python scripts/run_full_italy_pipeline.py --formats csv,json,html
Performance Optimisation
For Large Datasets (1M+ events)
# In configuration file
model:
sample_fraction: 0.5 # Use 50% of data for faster processing
mcmc_samples: 1000 # Reduce samples if needed
For Resource-Constrained Systems
optimization:
population_size: 50 # Smaller population for faster GA
generations: 100 # Fewer generations
Quality Assurance
Validation Checks
The system automatically validates:
Data quality and completeness
Model convergence and diagnostics
Budget allocation constraints
Attribution methodology compliance
Manual Verification
# Run comprehensive validation suite
python scripts/test_italy_optimization.py
Troubleshooting Checklist
PyMC progress bars visible: Confirms Bayesian fitting is running
R-hat < 1.1: Confirms MCMC convergence
Budget sums correctly: Confirms optimisation constraints
Business impact is correctly scoped: Confirms projections are based on attributable sales
Best Practices
Data Preparation
Ensure consistent timestamp formatting
Remove duplicate or invalid records
Validate customer journey completeness
Check for data leakage or look-ahead bias
Model Validation
Review convergence diagnostics carefully
Compare multiple model runs for consistency
Validate results against business intuition
Test with different budget scenarios
Business Communication
Clearly state that projections are based on digitally attributable sales.
Document the methodology’s scope and limitations transparently.
Provide confidence intervals for all estimates.
Explain all assumptions clearly.
For additional support, consult the Configuration Guide and Troubleshooting Documentation.