# Architecture Overview ## System Design Philosophy conversionflow-aggregate is designed around the principle of **separation of concerns** between statistical optimisation and business attribution. This architecture enables maximum mathematical efficiency while maintaining credible business communication. ## High-Level Architecture The system implements a **two-stage pipeline architecture** with clear separation between parameter estimation and optimisation: ``` Raw Data → [Stage 1: Bayesian Estimation] → Parameters → [Stage 2: GA Optimisation] → Results ``` ### Stage 1: Bayesian Parameter Estimation **Duration:** ~7 minutes **Technology:** PyMC (MCMC sampling) **Output:** Probabilistic model parameters with uncertainty quantification ### Stage 2: Genetic Algorithm Optimisation **Duration:** ~3-4 seconds **Technology:** Custom genetic algorithm implementation **Output:** Optimal budget allocations with conservative attribution reporting ## Detailed System Components ### Data Layer #### Data Processing Pipeline ``` Raw Events → Validation → Transformation → Customer Journey Construction → Model Input ``` **Components:** - **Data Loaders:** Multi-format ingestion (CSV, DuckDB, Excel, PostgreSQL) - **Validators:** Data quality and consistency checking - **Transformers:** Event aggregation and journey reconstruction - **Cache Manager:** Intelligent caching for large dataset processing **Key Files:** - `src/conversionflow/data/italy_loader.py` - Italy-specific data processing - `src/conversionflow/data/loaders.py` - Generic data loading framework - `src/conversionflow/data/validators.py` - Data quality validation - `src/conversionflow/core/cache.py` - Caching system ### Bayesian Modelling Layer #### Probabilistic Network Architecture The system models customer journeys as **Bayesian networks** where: - **Nodes** represent marketing touchpoints - **Edges** represent transition probabilities - **Parameters** represent touchpoint effectiveness and interdependencies #### MCMC Implementation ```python # Conceptual model structure with pm.Model() as model: # Baseline touchpoint effectiveness beta0 = pm.Normal("beta0", mu=prior_mean, sigma=prior_std) # Budget sensitivity coefficients beta1 = pm.Normal("beta1", mu=1.0, sigma=0.5) # Parent touchpoint influences parent_effects = pm.Normal("parent_effects", mu=0, sigma=0.3) # Likelihood function likelihood = pm.Poisson("obs", mu=expected_conversions, observed=data) ``` **Key Features:** - **Convergence Diagnostics:** Automatic R-hat and ESS monitoring - **Model Comparison:** LOO-CV for model selection - **Uncertainty Quantification:** Full posterior distributions - **Numerical Stability:** Robust parameter estimation **Key Files:** - `src/conversionflow/models/bayesian.py` - Standard Poisson model - `src/conversionflow/models/bayesian_hurdle.py` - Hurdle model for zero-inflation - `src/conversionflow/models/parameter_export.py` - Parameter serialisation - `src/conversionflow/core/numerical_stability.py` - Numerical robustness ### Optimisation Layer #### Genetic Algorithm Implementation The optimisation engine uses a **multi-objective genetic algorithm** designed specifically for marketing budget allocation: ```python class ItalyGeneticOptimizer: def __init__(self, model_params, total_budget, constraints): self.population_size = 100 self.generations = 200 self.elite_fraction = 0.1 def fitness(self, individual): # Calculate expected conversions given budget allocation return sum(touchpoint_conversion_probability * allocation for touchpoint, allocation in individual.items()) def optimize(self): # Standard GA loop with elitism and tournament selection pass ``` **Genetic Operators:** - **Selection:** Tournament selection with configurable size - **Crossover:** Uniform crossover preserving budget constraints - **Mutation:** Gaussian mutation with boundary repair - **Elitism:** Top performers preserved across generations **Constraint Handling:** - **Budget Conservation:** Allocation sums exactly equal total budget - **Business Bounds:** Minimum/maximum allocations per channel - **Operational Constraints:** Real-world business rules **Key Files:** - `src/conversionflow/optimization/italy_optimizer.py` - GA implementation - `src/conversionflow/optimization/real_parameter_loader.py` - Parameter conversion utilities ### Attribution Layer #### Data-Grounded Attribution The system's architecture is built on the principle of **data-grounded attribution**. This ensures that all financial projections are directly and defensibly tied to the scope of the data being analyzed. **Methodology:** 1. **Scoped Modeling**: The Bayesian model is built exclusively on tracked digital journey data, which accounts for a fraction (~5%) of total sales. 2. **Scoped Optimization**: The genetic algorithm optimizes the marketing budget based on the conversion probabilities learned *only* from this digital data. 3. **Scoped Reporting**: All business impact calculations and financial projections are consequently based on the portion of sales that can be reasonably attributed to these digital journeys. This approach provides: - **Analytical Integrity**: It avoids making unsupported claims by extrapolating results from a small digital dataset to the entire offline sales volume. - **Business Credibility**: It delivers realistic and defensible projections of the value generated by optimizing digital marketing spend. ### Visualisation and Reporting Layer #### Multi-Format Output Generation The system generates comprehensive reporting across multiple formats: **Executive Reporting:** - Budget allocation tables - Performance improvement summaries - Implementation recommendations - Conservative attribution methodology explanation **Technical Documentation:** - MCMC diagnostics and convergence metrics - Model validation and comparison statistics - Genetic algorithm convergence analysis - Sensitivity analysis results **Visualisation Suite:** - Customer journey flow diagrams (Mermaid) - Budget allocation charts - Performance trend analysis - Attribution ceiling explanation graphics **Key Files:** - `src/conversionflow/visualization/charts.py` - Chart generation - `src/conversionflow/visualization/dag_mermaid.py` - Journey flow diagrams - `src/conversionflow/visualization/csv_exports.py` - Structured data exports - `src/conversionflow/core/console.py` - Professional console output ## Core Infrastructure ### Configuration Management **Hierarchical YAML Configuration:** - Default system settings - Environment-specific overrides - Model architecture definitions - User customisation layer ### Logging and Monitoring **Comprehensive Observability:** - Structured logging with configurable levels - Performance profiling and metrics collection - MCMC convergence monitoring - Business rule validation tracking ### Caching System **Intelligent Performance Optimisation:** - Content-addressable caching - Automatic cache invalidation - Large dataset chunking - Memory-efficient processing ### Numerical Stability **Robust Mathematical Implementation:** - Automatic gradient clipping - Numerical precision management - Boundary condition handling - Convergence monitoring **Key Infrastructure Files:** - `src/conversionflow/core/config.py` - Configuration management - `src/conversionflow/core/logging_config.py` - Logging framework - `src/conversionflow/core/cache.py` - Caching system - `src/conversionflow/core/profiler.py` - Performance monitoring ## Performance Considerations ### Computational Complexity - **MCMC Sampling:** O(samples × chains × model_complexity) - **Genetic Algorithm:** O(generations × population_size × touchpoints) - **Parameter Conversion:** O(touchpoints × parameters) ### Memory Management - **Streaming Data Processing:** Chunk-based processing for large datasets - **MCMC Memory:** Configurable thinning and sample storage - **Result Caching:** Automatic cleanup of intermediate results ### Scalability Bottlenecks - **MCMC Convergence:** Dominant time factor (~7 minutes) - **Data Loading:** I/O bound for very large datasets - **Visualisation:** Memory intensive for complex diagrams ## Error Handling and Resilience ### Fault Tolerance - **MCMC Convergence Failures:** Automatic retry with adjusted parameters - **Data Quality Issues:** Graceful degradation with warnings - **Resource Constraints:** Automatic configuration adjustment ### Validation Framework - **Data Validation:** Comprehensive input checking - **Model Validation:** Convergence and quality diagnostics - **Result Validation:** Business rule compliance checking ### Recovery Mechanisms - **Checkpoint System:** Resumable long-running computations - **Configuration Validation:** Early error detection - **Graceful Degradation:** Reduced functionality rather than failures This architecture provides a robust, scalable foundation for sophisticated marketing attribution analysis while maintaining clear separation between statistical optimisation and business communication concerns.