# Configuration Reference ## Overview conversionflow-aggregate uses YAML configuration files to control all aspects of data processing, model fitting, and optimisation. This reference documents all available configuration options. ## Configuration File Hierarchy The system uses a hierarchical configuration approach: 1. **Default Configuration:** `configs/default.yaml` 2. **Environment-Specific:** `configs/environments/{environment}.yaml` 3. **Model-Specific:** `configs/italy_dag_config.yaml` 4. **User Configuration:** Custom files passed via `--config` ## Core Configuration Structure ### Data Configuration ```yaml data: # Data source specification source: "data/processed/italy_analytics.duckdb" source_type: "duckdb" # Options: duckdb, csv, excel, postgresql # Column mapping for CSV sources customer_column: "customer_id" timestamp_column: "timestamp" touchpoint_column: "touchpoint_name" conversion_column: "conversion_flag" value_column: "conversion_value" # Data processing options sample_fraction: 1.0 # Use full dataset (0.1 = 10% sample) date_range: start: "2023-01-01" # Optional: filter date range end: "2024-12-31" # Data validation min_journey_length: 1 # Minimum touchpoints per customer max_journey_length: 20 # Maximum touchpoints per customer remove_outliers: true # Remove statistical outliers ``` ### Model Configuration ```yaml model: # Model type selection model_type: "standard" # Options: standard, hurdle # MCMC sampling parameters mcmc_samples: 2000 # Number of posterior samples mcmc_tune: 1000 # Number of tuning samples chains: 4 # Number of parallel chains cores: 4 # CPU cores to use # Convergence criteria target_accept: 0.8 # Target acceptance rate max_treedepth: 10 # Maximum tree depth # Prior distributions priors: beta0_mean: -2.0 # Baseline intercept prior mean beta0_std: 1.0 # Baseline intercept prior standard deviation beta1_mean: 1.0 # Budget coefficient prior mean beta1_std: 0.5 # Budget coefficient prior standard deviation # Model validation validation: loo_validation: true # Perform leave-one-out validation posterior_predictive: true # Generate posterior predictive checks convergence_checks: true # Strict convergence validation ``` ### Optimisation Configuration ```yaml optimization: # Genetic algorithm parameters population_size: 100 # GA population size generations: 200 # Maximum generations elite_fraction: 0.1 # Fraction of elite individuals preserved # Genetic operators crossover_rate: 0.8 # Crossover probability mutation_rate: 0.1 # Mutation probability tournament_size: 5 # Tournament selection size # Budget constraints budget_total: 2500000 # Total budget in pounds min_allocation: 10000 # Minimum allocation per touchpoint max_allocation_pct: 0.3 # Maximum 30% to any single touchpoint # Convergence criteria convergence_generations: 50 # Generations without improvement to stop convergence_threshold: 0.001 # Fitness improvement threshold # Business constraints business_constraints: dealer_visit_min: 100000 # Minimum dealer visit budget digital_max_pct: 0.6 # Maximum 60% to digital channels ``` ### Attribution Configuration ```yaml attribution: # Data-Grounded Attribution settings methodology: "data_grounded" # Business communication business_rationale: "Projections are scoped to digitally-attributable sales for realism." confidence_intervals: true # Include confidence bounds in reporting # The following settings are deprecated and no longer used: # attribution_ceiling: 0.10 # ceiling_enforcement: "reporting_only" ``` ### Output Configuration ```yaml output: # Output directory base_dir: "results/italy/" create_timestamp_dir: true # Create timestamped subdirectories # Output formats formats: ["csv", "json", "html", "png"] # Reporting options executive_summary: true # Generate executive summary technical_appendix: true # Include technical details visualisations: true # Generate charts and diagrams # File options compression: false # Compress large output files precision: 4 # Decimal precision for numerical outputs ``` ## Environment-Specific Configurations ### Development Environment ```yaml # configs/environments/development.yaml model: mcmc_samples: 500 # Faster sampling for development mcmc_tune: 200 chains: 2 optimization: population_size: 20 # Smaller population for speed generations: 50 logging: level: "DEBUG" # Verbose logging console_output: true ``` ### Production Environment ```yaml # configs/environments/production.yaml model: mcmc_samples: 4000 # Higher quality sampling mcmc_tune: 2000 chains: 8 optimization: population_size: 200 # Larger population for quality generations: 500 logging: level: "INFO" file_output: true console_output: false validation: strict_validation: true # Comprehensive checks quality_gates: true # Fail if quality thresholds not met ``` ### Staging Environment ```yaml # configs/environments/staging.yaml model: mcmc_samples: 2000 # Production-like sampling mcmc_tune: 1000 chains: 4 validation: moderate_validation: true benchmark_comparisons: true ``` ## Model Architecture Configuration ### Customer Journey DAG ```yaml # configs/italy_dag_config.yaml nodes: - session_start - website_visit - download - car_configuration - finance_calculation - dealer_search - test_drive_request - offer_request - contact_request - purchase_outcome edges: - [session_start, website_visit] - [website_visit, download] - [download, car_configuration] - [car_configuration, finance_calculation] - [finance_calculation, offer_request] - [dealer_search, test_drive_request] - [test_drive_request, purchase_outcome] - [offer_request, purchase_outcome] # Node properties node_properties: session_start: type: "entry_point" budget_eligible: false purchase_outcome: type: "conversion" weight: 10.0 test_drive_request: type: "high_intent" weight: 5.0 website_visit: type: "awareness" weight: 1.0 ``` ## Logging Configuration ```yaml logging: # Log levels: DEBUG, INFO, WARNING, ERROR, CRITICAL level: "INFO" # Output destinations console_output: true file_output: true log_file: "logs/conversionflow.log" # Log formatting format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s" date_format: "%Y-%m-%d %H:%M:%S" # Component-specific logging loggers: "conversionflow.models.bayesian": level: "WARNING" # Reduce PyMC verbosity "conversionflow.optimization": level: "INFO" "matplotlib": level: "ERROR" # Suppress matplotlib warnings ``` ## Advanced Configuration Options ### Performance Tuning ```yaml performance: # Memory management memory_limit_gb: 16 # Maximum memory usage chunk_size: 10000 # Data processing chunk size # Parallel processing n_jobs: -1 # Use all available cores backend: "multiprocessing" # Options: threading, multiprocessing # Caching enable_cache: true cache_dir: "data/cache/" cache_expiry_days: 30 ``` ### Numerical Stability ```yaml numerical: # Precision settings float_precision: "float64" convergence_tolerance: 1e-6 # Stability enhancements add_jitter: true # Add small noise for numerical stability jitter_scale: 1e-8 # Boundary handling clip_probabilities: true # Ensure probabilities in [0,1] min_probability: 1e-10 max_probability: 0.999999 ``` ### Validation Configuration ```yaml validation: # Model validation convergence_checks: rhat_threshold: 1.1 # R-hat convergence criterion ess_threshold: 400 # Effective sample size minimum # Data validation data_quality: completeness_threshold: 0.95 # Minimum data completeness consistency_checks: true # Cross-validation checks # Business validation business_rules: budget_sum_tolerance: 1.0 # Budget allocation tolerance (£) allocation_reasonableness: true # Check for unrealistic allocations ``` ## Configuration Best Practices ### Development Workflow ```yaml # Use fast settings for iterative development model: mcmc_samples: 500 chains: 2 optimization: population_size: 20 generations: 50 ``` ### Production Deployment ```yaml # Use high-quality settings for final results model: mcmc_samples: 4000 chains: 8 optimization: population_size: 200 generations: 500 validation: strict_validation: true ``` ### Memory-Constrained Environments ```yaml # Reduce memory usage for limited systems data: sample_fraction: 0.5 model: mcmc_samples: 1000 chains: 2 optimization: population_size: 50 performance: memory_limit_gb: 8 chunk_size: 5000 ``` ## Configuration Validation The system automatically validates configuration files for: - Required field presence - Data type consistency - Value range constraints - Business rule compliance - Cross-field dependencies Invalid configurations will generate clear error messages with suggested corrections. ## Custom Configuration Examples ### High-Frequency Data Analysis ```yaml # For datasets with very frequent customer interactions data: sample_fraction: 0.3 # Sample to manageable size min_journey_length: 3 # Require meaningful journeys model: mcmc_samples: 3000 # More samples for complex patterns optimization: population_size: 150 # Larger population for exploration ``` ### B2B Customer Journeys ```yaml # Configuration for longer B2B sales cycles data: max_journey_length: 50 # Allow longer customer journeys date_range: start: "2022-01-01" # Longer historical period model: priors: beta0_mean: -3.0 # Lower baseline conversion rates ``` For additional configuration examples and troubleshooting, see the [User Guide](user-guide.md) and [Troubleshooting Documentation](troubleshooting.md).