# User Guide ## Introduction This guide provides comprehensive instructions for using conversionflow-aggregate to analyse customer journeys and optimise marketing budget allocation. ## Understanding the Two-Stage Pipeline conversionflow-aggregate operates as a two-stage analytical system: ### Stage 1: Bayesian Parameter Estimation **Duration:** Approximately 7 minutes **Purpose:** Fits probabilistic models to your customer journey data ```bash # Stage 1 only - parameter estimation python scripts/run_full_italy_pipeline.py --stage1-only ``` **What happens:** - Loads raw customer event data - Fits Bayesian network models using MCMC sampling - Generates parameter exports with confidence intervals - Produces model diagnostics and validation metrics **Indicators of success:** - PyMC progress bars showing MCMC sampling - Convergence diagnostics (R-hat < 1.1) - Model validation reports - Parameter export JSON files ### Stage 2: Genetic Algorithm Optimisation **Duration:** 3-4 seconds **Purpose:** Optimises budget allocation using pre-computed parameters ```bash # Stage 2 only - optimisation python scripts/run_full_italy_pipeline.py --stage2-only ``` **What happens:** - Loads MCMC parameter exports - Runs genetic algorithm optimisation - Applies conservative attribution methodology - Generates executive reports and visualisations ## Full Pipeline Execution ### Complete End-to-End Analysis ```bash # Recommended: Both stages together python scripts/run_full_italy_pipeline.py ``` This is the preferred method for new analyses as it ensures parameter estimates are fresh and aligned with your current data. ### Quick Testing Mode ```bash # Fast validation using mock data ./run_pipeline.sh --mode=test ``` Use this for system validation and testing changes without processing real data. ### Production Optimisation Mode ```bash # Fast optimisation using existing parameters ./run_pipeline.sh --mode=italy ``` Use this when you have recent parameter estimates and only need updated budget allocations. ## Working with Custom Data ### Data Requirements Your customer journey data should include: - **Event timestamps** in ISO format - **Customer identifiers** for journey tracking - **Touchpoint identifiers** for channel attribution - **Conversion outcomes** (purchases, leads, etc.) ### Data Format Example ```csv customer_id,timestamp,touchpoint,conversion,value CUST_001,2024-01-15T10:30:00,website_visit,0, CUST_001,2024-01-16T14:22:00,email_click,0, CUST_001,2024-01-20T09:45:00,dealer_visit,1,35000 ``` ### Configuration for Custom Data Create a custom configuration file: ```yaml # configs/custom_analysis.yaml data: source: "data/custom/customer_journeys.csv" date_column: "timestamp" customer_column: "customer_id" touchpoint_column: "touchpoint" conversion_column: "conversion" model: mcmc_samples: 2000 mcmc_tune: 1000 chains: 4 optimization: population_size: 100 generations: 200 budget_total: 1000000 ``` ### Running Custom Analysis ```bash # Using custom configuration python scripts/run_full_italy_pipeline.py --config configs/custom_analysis.yaml ``` ## Budget Allocation Scenarios ### Standard Allocation ```bash # Default budget (£2.5M) python scripts/run_full_italy_pipeline.py ``` ### Custom Budget Scenarios ```bash # Growth scenario - 10% budget increase python scripts/run_full_italy_pipeline.py --budget 2750000 # Reduced budget scenario python scripts/run_full_italy_pipeline.py --budget 2000000 # High investment scenario python scripts/run_full_italy_pipeline.py --budget 5000000 ``` ## Understanding Results ### Executive Summary Output The system generates business-ready reports including: **Budget Allocation Table:** ``` Touchpoint Allocation Percentage Car Configuration £420,000 16.8% Finance Calculator £380,000 15.2% Test Drive Requests £350,000 14.0% ... ``` **Performance Metrics:** - **Raw Improvement:** Statistical optimisation potential - **Business Claim:** Conservative attribution-adjusted improvement - **Confidence Intervals:** Statistical uncertainty bounds ### Technical Diagnostics For technical validation, review: **MCMC Diagnostics:** - **R-hat values:** Should be < 1.1 for convergence - **Effective Sample Size (ESS):** Should be > 400 per chain - **ELPD-LOO:** Model comparison metric **Optimisation Metrics:** - **Population diversity:** Genetic algorithm health - **Convergence rate:** Solution stability - **Constraint satisfaction:** Business rule compliance ## Attribution Methodology `conversionflow-aggregate` implements a **Data-Grounded Attribution** methodology to ensure business credibility and analytical integrity. ### The Principle The core principle is that all financial projections must be directly tied to the scope of the data being analyzed. In the luxury automotive market, digital data typically accounts for only a small fraction (~5%) of total sales. ### How It Works 1. **Scoped Analysis**: The entire pipeline, from model fitting to optimization, operates exclusively on the tracked digital journey data. 2. **Scoped Projections**: Business impact calculations (e.g., "Expected additional revenue") are based on the ~5% of sales that are attributable to these digital journeys, not the total sales volume. ### Why This Matters This approach provides realistic, defensible projections of the value generated by optimizing digital marketing spend, maintaining stakeholder trust by avoiding unsupported claims about influencing total offline sales. ## Advanced Usage ### Model Selection ```bash # Standard Poisson model (recommended) python scripts/run_full_italy_pipeline.py # Hurdle model (for zero-inflated data - slower) python scripts/run_full_italy_pipeline.py --use-hurdle ``` **Recommendation:** Use standard model unless your data has severe zero-inflation issues. The hurdle model requires significantly longer convergence time (1+ hours vs 7 minutes). ### Parallel Processing ```bash # Increase MCMC chains for faster sampling (requires more CPU cores) python scripts/run_full_italy_pipeline.py --chains 8 ``` ### Output Customisation ```bash # Specify output directory python scripts/run_full_italy_pipeline.py --output results/custom_analysis/ # Control output formats python scripts/run_full_italy_pipeline.py --formats csv,json,html ``` ## Performance Optimisation ### For Large Datasets (1M+ events) ```yaml # In configuration file model: sample_fraction: 0.5 # Use 50% of data for faster processing mcmc_samples: 1000 # Reduce samples if needed ``` ### For Resource-Constrained Systems ```yaml optimization: population_size: 50 # Smaller population for faster GA generations: 100 # Fewer generations ``` ## Quality Assurance ### Validation Checks The system automatically validates: - Data quality and completeness - Model convergence and diagnostics - Budget allocation constraints - Attribution methodology compliance ### Manual Verification ```bash # Run comprehensive validation suite python scripts/test_italy_optimization.py ``` ### Troubleshooting Checklist 1. **PyMC progress bars visible:** Confirms Bayesian fitting is running 2. **R-hat < 1.1:** Confirms MCMC convergence 3. **Budget sums correctly:** Confirms optimisation constraints 4. **Business impact is correctly scoped:** Confirms projections are based on attributable sales ## Best Practices ### Data Preparation - Ensure consistent timestamp formatting - Remove duplicate or invalid records - Validate customer journey completeness - Check for data leakage or look-ahead bias ### Model Validation - Review convergence diagnostics carefully - Compare multiple model runs for consistency - Validate results against business intuition - Test with different budget scenarios ### Business Communication - Clearly state that projections are based on digitally attributable sales. - Document the methodology's scope and limitations transparently. - Provide confidence intervals for all estimates. - Explain all assumptions clearly. For additional support, consult the [Configuration Guide](configuration.md) and [Troubleshooting Documentation](troubleshooting.md).