Configuration Reference
Overview
conversionflow-aggregate uses YAML configuration files to control all aspects of data processing, model fitting, and optimisation. This reference documents all available configuration options.
Configuration File Hierarchy
The system uses a hierarchical configuration approach:
Default Configuration:
configs/default.yamlEnvironment-Specific:
configs/environments/{environment}.yamlModel-Specific:
configs/italy_dag_config.yamlUser Configuration: Custom files passed via
--config
Core Configuration Structure
Data Configuration
data:
# Data source specification
source: "data/processed/italy_analytics.duckdb"
source_type: "duckdb" # Options: duckdb, csv, excel, postgresql
# Column mapping for CSV sources
customer_column: "customer_id"
timestamp_column: "timestamp"
touchpoint_column: "touchpoint_name"
conversion_column: "conversion_flag"
value_column: "conversion_value"
# Data processing options
sample_fraction: 1.0 # Use full dataset (0.1 = 10% sample)
date_range:
start: "2023-01-01" # Optional: filter date range
end: "2024-12-31"
# Data validation
min_journey_length: 1 # Minimum touchpoints per customer
max_journey_length: 20 # Maximum touchpoints per customer
remove_outliers: true # Remove statistical outliers
Model Configuration
model:
# Model type selection
model_type: "standard" # Options: standard, hurdle
# MCMC sampling parameters
mcmc_samples: 2000 # Number of posterior samples
mcmc_tune: 1000 # Number of tuning samples
chains: 4 # Number of parallel chains
cores: 4 # CPU cores to use
# Convergence criteria
target_accept: 0.8 # Target acceptance rate
max_treedepth: 10 # Maximum tree depth
# Prior distributions
priors:
beta0_mean: -2.0 # Baseline intercept prior mean
beta0_std: 1.0 # Baseline intercept prior standard deviation
beta1_mean: 1.0 # Budget coefficient prior mean
beta1_std: 0.5 # Budget coefficient prior standard deviation
# Model validation
validation:
loo_validation: true # Perform leave-one-out validation
posterior_predictive: true # Generate posterior predictive checks
convergence_checks: true # Strict convergence validation
Optimisation Configuration
optimization:
# Genetic algorithm parameters
population_size: 100 # GA population size
generations: 200 # Maximum generations
elite_fraction: 0.1 # Fraction of elite individuals preserved
# Genetic operators
crossover_rate: 0.8 # Crossover probability
mutation_rate: 0.1 # Mutation probability
tournament_size: 5 # Tournament selection size
# Budget constraints
budget_total: 2500000 # Total budget in pounds
min_allocation: 10000 # Minimum allocation per touchpoint
max_allocation_pct: 0.3 # Maximum 30% to any single touchpoint
# Convergence criteria
convergence_generations: 50 # Generations without improvement to stop
convergence_threshold: 0.001 # Fitness improvement threshold
# Business constraints
business_constraints:
dealer_visit_min: 100000 # Minimum dealer visit budget
digital_max_pct: 0.6 # Maximum 60% to digital channels
Attribution Configuration
attribution:
# Data-Grounded Attribution settings
methodology: "data_grounded"
# Business communication
business_rationale: "Projections are scoped to digitally-attributable sales for realism."
confidence_intervals: true # Include confidence bounds in reporting
# The following settings are deprecated and no longer used:
# attribution_ceiling: 0.10
# ceiling_enforcement: "reporting_only"
Output Configuration
output:
# Output directory
base_dir: "results/italy/"
create_timestamp_dir: true # Create timestamped subdirectories
# Output formats
formats: ["csv", "json", "html", "png"]
# Reporting options
executive_summary: true # Generate executive summary
technical_appendix: true # Include technical details
visualisations: true # Generate charts and diagrams
# File options
compression: false # Compress large output files
precision: 4 # Decimal precision for numerical outputs
Environment-Specific Configurations
Development Environment
# configs/environments/development.yaml
model:
mcmc_samples: 500 # Faster sampling for development
mcmc_tune: 200
chains: 2
optimization:
population_size: 20 # Smaller population for speed
generations: 50
logging:
level: "DEBUG" # Verbose logging
console_output: true
Production Environment
# configs/environments/production.yaml
model:
mcmc_samples: 4000 # Higher quality sampling
mcmc_tune: 2000
chains: 8
optimization:
population_size: 200 # Larger population for quality
generations: 500
logging:
level: "INFO"
file_output: true
console_output: false
validation:
strict_validation: true # Comprehensive checks
quality_gates: true # Fail if quality thresholds not met
Staging Environment
# configs/environments/staging.yaml
model:
mcmc_samples: 2000 # Production-like sampling
mcmc_tune: 1000
chains: 4
validation:
moderate_validation: true
benchmark_comparisons: true
Model Architecture Configuration
Customer Journey DAG
# configs/italy_dag_config.yaml
nodes:
- session_start
- website_visit
- download
- car_configuration
- finance_calculation
- dealer_search
- test_drive_request
- offer_request
- contact_request
- purchase_outcome
edges:
- [session_start, website_visit]
- [website_visit, download]
- [download, car_configuration]
- [car_configuration, finance_calculation]
- [finance_calculation, offer_request]
- [dealer_search, test_drive_request]
- [test_drive_request, purchase_outcome]
- [offer_request, purchase_outcome]
# Node properties
node_properties:
session_start:
type: "entry_point"
budget_eligible: false
purchase_outcome:
type: "conversion"
weight: 10.0
test_drive_request:
type: "high_intent"
weight: 5.0
website_visit:
type: "awareness"
weight: 1.0
Logging Configuration
logging:
# Log levels: DEBUG, INFO, WARNING, ERROR, CRITICAL
level: "INFO"
# Output destinations
console_output: true
file_output: true
log_file: "logs/conversionflow.log"
# Log formatting
format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
date_format: "%Y-%m-%d %H:%M:%S"
# Component-specific logging
loggers:
"conversionflow.models.bayesian":
level: "WARNING" # Reduce PyMC verbosity
"conversionflow.optimization":
level: "INFO"
"matplotlib":
level: "ERROR" # Suppress matplotlib warnings
Advanced Configuration Options
Performance Tuning
performance:
# Memory management
memory_limit_gb: 16 # Maximum memory usage
chunk_size: 10000 # Data processing chunk size
# Parallel processing
n_jobs: -1 # Use all available cores
backend: "multiprocessing" # Options: threading, multiprocessing
# Caching
enable_cache: true
cache_dir: "data/cache/"
cache_expiry_days: 30
Numerical Stability
numerical:
# Precision settings
float_precision: "float64"
convergence_tolerance: 1e-6
# Stability enhancements
add_jitter: true # Add small noise for numerical stability
jitter_scale: 1e-8
# Boundary handling
clip_probabilities: true # Ensure probabilities in [0,1]
min_probability: 1e-10
max_probability: 0.999999
Validation Configuration
validation:
# Model validation
convergence_checks:
rhat_threshold: 1.1 # R-hat convergence criterion
ess_threshold: 400 # Effective sample size minimum
# Data validation
data_quality:
completeness_threshold: 0.95 # Minimum data completeness
consistency_checks: true # Cross-validation checks
# Business validation
business_rules:
budget_sum_tolerance: 1.0 # Budget allocation tolerance (£)
allocation_reasonableness: true # Check for unrealistic allocations
Configuration Best Practices
Development Workflow
# Use fast settings for iterative development
model:
mcmc_samples: 500
chains: 2
optimization:
population_size: 20
generations: 50
Production Deployment
# Use high-quality settings for final results
model:
mcmc_samples: 4000
chains: 8
optimization:
population_size: 200
generations: 500
validation:
strict_validation: true
Memory-Constrained Environments
# Reduce memory usage for limited systems
data:
sample_fraction: 0.5
model:
mcmc_samples: 1000
chains: 2
optimization:
population_size: 50
performance:
memory_limit_gb: 8
chunk_size: 5000
Configuration Validation
The system automatically validates configuration files for:
Required field presence
Data type consistency
Value range constraints
Business rule compliance
Cross-field dependencies
Invalid configurations will generate clear error messages with suggested corrections.
Custom Configuration Examples
High-Frequency Data Analysis
# For datasets with very frequent customer interactions
data:
sample_fraction: 0.3 # Sample to manageable size
min_journey_length: 3 # Require meaningful journeys
model:
mcmc_samples: 3000 # More samples for complex patterns
optimization:
population_size: 150 # Larger population for exploration
B2B Customer Journeys
# Configuration for longer B2B sales cycles
data:
max_journey_length: 50 # Allow longer customer journeys
date_range:
start: "2022-01-01" # Longer historical period
model:
priors:
beta0_mean: -3.0 # Lower baseline conversion rates
For additional configuration examples and troubleshooting, see the User Guide and Troubleshooting Documentation.