Architecture Overview

System Design Philosophy

conversionflow-aggregate is designed around the principle of separation of concerns between statistical optimisation and business attribution. This architecture enables maximum mathematical efficiency while maintaining credible business communication.

High-Level Architecture

The system implements a two-stage pipeline architecture with clear separation between parameter estimation and optimisation:

Raw Data → [Stage 1: Bayesian Estimation] → Parameters → [Stage 2: GA Optimisation] → Results

Stage 1: Bayesian Parameter Estimation

Duration: ~7 minutes
Technology: PyMC (MCMC sampling)
Output: Probabilistic model parameters with uncertainty quantification

Stage 2: Genetic Algorithm Optimisation

Duration: ~3-4 seconds
Technology: Custom genetic algorithm implementation
Output: Optimal budget allocations with conservative attribution reporting

Detailed System Components

Data Layer

Data Processing Pipeline

Raw Events → Validation → Transformation → Customer Journey Construction → Model Input

Components:

  • Data Loaders: Multi-format ingestion (CSV, DuckDB, Excel, PostgreSQL)

  • Validators: Data quality and consistency checking

  • Transformers: Event aggregation and journey reconstruction

  • Cache Manager: Intelligent caching for large dataset processing

Key Files:

  • src/conversionflow/data/italy_loader.py - Italy-specific data processing

  • src/conversionflow/data/loaders.py - Generic data loading framework

  • src/conversionflow/data/validators.py - Data quality validation

  • src/conversionflow/core/cache.py - Caching system

Bayesian Modelling Layer

Probabilistic Network Architecture

The system models customer journeys as Bayesian networks where:

  • Nodes represent marketing touchpoints

  • Edges represent transition probabilities

  • Parameters represent touchpoint effectiveness and interdependencies

MCMC Implementation

# Conceptual model structure
with pm.Model() as model:
    # Baseline touchpoint effectiveness
    beta0 = pm.Normal("beta0", mu=prior_mean, sigma=prior_std)
    
    # Budget sensitivity coefficients
    beta1 = pm.Normal("beta1", mu=1.0, sigma=0.5)
    
    # Parent touchpoint influences
    parent_effects = pm.Normal("parent_effects", mu=0, sigma=0.3)
    
    # Likelihood function
    likelihood = pm.Poisson("obs", mu=expected_conversions, observed=data)

Key Features:

  • Convergence Diagnostics: Automatic R-hat and ESS monitoring

  • Model Comparison: LOO-CV for model selection

  • Uncertainty Quantification: Full posterior distributions

  • Numerical Stability: Robust parameter estimation

Key Files:

  • src/conversionflow/models/bayesian.py - Standard Poisson model

  • src/conversionflow/models/bayesian_hurdle.py - Hurdle model for zero-inflation

  • src/conversionflow/models/parameter_export.py - Parameter serialisation

  • src/conversionflow/core/numerical_stability.py - Numerical robustness

Optimisation Layer

Genetic Algorithm Implementation

The optimisation engine uses a multi-objective genetic algorithm designed specifically for marketing budget allocation:

class ItalyGeneticOptimizer:
    def __init__(self, model_params, total_budget, constraints):
        self.population_size = 100
        self.generations = 200
        self.elite_fraction = 0.1
        
    def fitness(self, individual):
        # Calculate expected conversions given budget allocation
        return sum(touchpoint_conversion_probability * allocation 
                  for touchpoint, allocation in individual.items())
    
    def optimize(self):
        # Standard GA loop with elitism and tournament selection
        pass

Genetic Operators:

  • Selection: Tournament selection with configurable size

  • Crossover: Uniform crossover preserving budget constraints

  • Mutation: Gaussian mutation with boundary repair

  • Elitism: Top performers preserved across generations

Constraint Handling:

  • Budget Conservation: Allocation sums exactly equal total budget

  • Business Bounds: Minimum/maximum allocations per channel

  • Operational Constraints: Real-world business rules

Key Files:

  • src/conversionflow/optimization/italy_optimizer.py - GA implementation

  • src/conversionflow/optimization/real_parameter_loader.py - Parameter conversion utilities

Attribution Layer

Data-Grounded Attribution

The system’s architecture is built on the principle of data-grounded attribution. This ensures that all financial projections are directly and defensibly tied to the scope of the data being analyzed.

Methodology:

  1. Scoped Modeling: The Bayesian model is built exclusively on tracked digital journey data, which accounts for a fraction (~5%) of total sales.

  2. Scoped Optimization: The genetic algorithm optimizes the marketing budget based on the conversion probabilities learned only from this digital data.

  3. Scoped Reporting: All business impact calculations and financial projections are consequently based on the portion of sales that can be reasonably attributed to these digital journeys.

This approach provides:

  • Analytical Integrity: It avoids making unsupported claims by extrapolating results from a small digital dataset to the entire offline sales volume.

  • Business Credibility: It delivers realistic and defensible projections of the value generated by optimizing digital marketing spend.

Visualisation and Reporting Layer

Multi-Format Output Generation

The system generates comprehensive reporting across multiple formats:

Executive Reporting:

  • Budget allocation tables

  • Performance improvement summaries

  • Implementation recommendations

  • Conservative attribution methodology explanation

Technical Documentation:

  • MCMC diagnostics and convergence metrics

  • Model validation and comparison statistics

  • Genetic algorithm convergence analysis

  • Sensitivity analysis results

Visualisation Suite:

  • Customer journey flow diagrams (Mermaid)

  • Budget allocation charts

  • Performance trend analysis

  • Attribution ceiling explanation graphics

Key Files:

  • src/conversionflow/visualization/charts.py - Chart generation

  • src/conversionflow/visualization/dag_mermaid.py - Journey flow diagrams

  • src/conversionflow/visualization/csv_exports.py - Structured data exports

  • src/conversionflow/core/console.py - Professional console output

Core Infrastructure

Configuration Management

Hierarchical YAML Configuration:

  • Default system settings

  • Environment-specific overrides

  • Model architecture definitions

  • User customisation layer

Logging and Monitoring

Comprehensive Observability:

  • Structured logging with configurable levels

  • Performance profiling and metrics collection

  • MCMC convergence monitoring

  • Business rule validation tracking

Caching System

Intelligent Performance Optimisation:

  • Content-addressable caching

  • Automatic cache invalidation

  • Large dataset chunking

  • Memory-efficient processing

Numerical Stability

Robust Mathematical Implementation:

  • Automatic gradient clipping

  • Numerical precision management

  • Boundary condition handling

  • Convergence monitoring

Key Infrastructure Files:

  • src/conversionflow/core/config.py - Configuration management

  • src/conversionflow/core/logging_config.py - Logging framework

  • src/conversionflow/core/cache.py - Caching system

  • src/conversionflow/core/profiler.py - Performance monitoring

Performance Considerations

Computational Complexity

  • MCMC Sampling: O(samples × chains × model_complexity)

  • Genetic Algorithm: O(generations × population_size × touchpoints)

  • Parameter Conversion: O(touchpoints × parameters)

Memory Management

  • Streaming Data Processing: Chunk-based processing for large datasets

  • MCMC Memory: Configurable thinning and sample storage

  • Result Caching: Automatic cleanup of intermediate results

Scalability Bottlenecks

  • MCMC Convergence: Dominant time factor (~7 minutes)

  • Data Loading: I/O bound for very large datasets

  • Visualisation: Memory intensive for complex diagrams

Error Handling and Resilience

Fault Tolerance

  • MCMC Convergence Failures: Automatic retry with adjusted parameters

  • Data Quality Issues: Graceful degradation with warnings

  • Resource Constraints: Automatic configuration adjustment

Validation Framework

  • Data Validation: Comprehensive input checking

  • Model Validation: Convergence and quality diagnostics

  • Result Validation: Business rule compliance checking

Recovery Mechanisms

  • Checkpoint System: Resumable long-running computations

  • Configuration Validation: Early error detection

  • Graceful Degradation: Reduced functionality rather than failures

This architecture provides a robust, scalable foundation for sophisticated marketing attribution analysis while maintaining clear separation between statistical optimisation and business communication concerns.