Architecture Overview

System Design Philosophy

conversionflow-aggregate is designed around the principle of separation of concerns between statistical optimisation and business attribution. This architecture enables maximum mathematical efficiency while maintaining credible business communication.

High-Level Architecture

The system implements a two-stage pipeline architecture with clear separation between parameter estimation and optimisation:

Raw Data → [Stage 1: Bayesian Estimation] → Parameters → [Stage 2: GA Optimisation] → Results

Stage 1: Bayesian Parameter Estimation

Duration: ~7 minutes
Technology: PyMC (MCMC sampling)
Output: Probabilistic model parameters with uncertainty quantification

Stage 2: Genetic Algorithm Optimisation

Duration: ~3-4 seconds
Technology: Custom genetic algorithm implementation
Output: Optimal budget allocations with conservative attribution reporting

Detailed System Components

Data Layer

Data Processing Pipeline

Raw Events → Validation → Transformation → Customer Journey Construction → Model Input

Components:

Data Loaders: Multi-format ingestion (CSV, DuckDB, Excel, PostgreSQL)
Validators: Data quality and consistency checking
Transformers: Event aggregation and journey reconstruction
Cache Manager: Intelligent caching for large dataset processing

Key Files:

src/conversionflow/data/italy_loader.py - Italy-specific data processing
src/conversionflow/data/loaders.py - Generic data loading framework
src/conversionflow/data/validators.py - Data quality validation
src/conversionflow/core/cache.py - Caching system

Bayesian Modelling Layer

Probabilistic Network Architecture

The system models customer journeys as Bayesian networks where:

Nodes represent marketing touchpoints
Edges represent transition probabilities
Parameters represent touchpoint effectiveness and interdependencies

MCMC Implementation

# Conceptual model structure
with pm.Model() as model:
    # Baseline touchpoint effectiveness
    beta0 = pm.Normal("beta0", mu=prior_mean, sigma=prior_std)
    
    # Budget sensitivity coefficients
    beta1 = pm.Normal("beta1", mu=1.0, sigma=0.5)
    
    # Parent touchpoint influences
    parent_effects = pm.Normal("parent_effects", mu=0, sigma=0.3)
    
    # Likelihood function
    likelihood = pm.Poisson("obs", mu=expected_conversions, observed=data)

Key Features:

Convergence Diagnostics: Automatic R-hat and ESS monitoring
Model Comparison: LOO-CV for model selection
Uncertainty Quantification: Full posterior distributions
Numerical Stability: Robust parameter estimation

Key Files:

src/conversionflow/models/bayesian.py - Standard Poisson model
src/conversionflow/models/bayesian_hurdle.py - Hurdle model for zero-inflation
src/conversionflow/models/parameter_export.py - Parameter serialisation
src/conversionflow/core/numerical_stability.py - Numerical robustness

Optimisation Layer

Genetic Algorithm Implementation

The optimisation engine uses a multi-objective genetic algorithm designed specifically for marketing budget allocation:

class ItalyGeneticOptimizer:
    def __init__(self, model_params, total_budget, constraints):
        self.population_size = 100
        self.generations = 200
        self.elite_fraction = 0.1
        
    def fitness(self, individual):
        # Calculate expected conversions given budget allocation
        return sum(touchpoint_conversion_probability * allocation 
                  for touchpoint, allocation in individual.items())
    
    def optimize(self):
        # Standard GA loop with elitism and tournament selection
        pass

Genetic Operators:

Selection: Tournament selection with configurable size
Crossover: Uniform crossover preserving budget constraints
Mutation: Gaussian mutation with boundary repair
Elitism: Top performers preserved across generations

Constraint Handling:

Budget Conservation: Allocation sums exactly equal total budget
Business Bounds: Minimum/maximum allocations per channel
Operational Constraints: Real-world business rules

Key Files:

src/conversionflow/optimization/italy_optimizer.py - GA implementation
src/conversionflow/optimization/real_parameter_loader.py - Parameter conversion utilities

Attribution Layer

Data-Grounded Attribution

The system’s architecture is built on the principle of data-grounded attribution. This ensures that all financial projections are directly and defensibly tied to the scope of the data being analyzed.

Methodology:

Scoped Modeling: The Bayesian model is built exclusively on tracked digital journey data, which accounts for a fraction (~5%) of total sales.
Scoped Optimization: The genetic algorithm optimizes the marketing budget based on the conversion probabilities learned only from this digital data.
Scoped Reporting: All business impact calculations and financial projections are consequently based on the portion of sales that can be reasonably attributed to these digital journeys.

This approach provides:

Analytical Integrity: It avoids making unsupported claims by extrapolating results from a small digital dataset to the entire offline sales volume.
Business Credibility: It delivers realistic and defensible projections of the value generated by optimizing digital marketing spend.

Visualisation and Reporting Layer

Multi-Format Output Generation

The system generates comprehensive reporting across multiple formats:

Executive Reporting:

Budget allocation tables
Performance improvement summaries
Implementation recommendations
Conservative attribution methodology explanation

Technical Documentation:

MCMC diagnostics and convergence metrics
Model validation and comparison statistics
Genetic algorithm convergence analysis
Sensitivity analysis results

Visualisation Suite:

Customer journey flow diagrams (Mermaid)
Budget allocation charts
Performance trend analysis
Attribution ceiling explanation graphics

Key Files:

src/conversionflow/visualization/charts.py - Chart generation
src/conversionflow/visualization/dag_mermaid.py - Journey flow diagrams
src/conversionflow/visualization/csv_exports.py - Structured data exports
src/conversionflow/core/console.py - Professional console output

Core Infrastructure

Configuration Management

Hierarchical YAML Configuration:

Default system settings
Environment-specific overrides
Model architecture definitions
User customisation layer

Logging and Monitoring

Comprehensive Observability:

Structured logging with configurable levels
Performance profiling and metrics collection
MCMC convergence monitoring
Business rule validation tracking

Caching System

Intelligent Performance Optimisation:

Content-addressable caching
Automatic cache invalidation
Large dataset chunking
Memory-efficient processing

Numerical Stability

Robust Mathematical Implementation:

Automatic gradient clipping
Numerical precision management
Boundary condition handling
Convergence monitoring

Key Infrastructure Files:

src/conversionflow/core/config.py - Configuration management
src/conversionflow/core/logging_config.py - Logging framework
src/conversionflow/core/cache.py - Caching system
src/conversionflow/core/profiler.py - Performance monitoring

Performance Considerations

Computational Complexity

MCMC Sampling: O(samples × chains × model_complexity)
Genetic Algorithm: O(generations × population_size × touchpoints)
Parameter Conversion: O(touchpoints × parameters)

Memory Management

Streaming Data Processing: Chunk-based processing for large datasets
MCMC Memory: Configurable thinning and sample storage
Result Caching: Automatic cleanup of intermediate results

Scalability Bottlenecks

MCMC Convergence: Dominant time factor (~7 minutes)
Data Loading: I/O bound for very large datasets
Visualisation: Memory intensive for complex diagrams

Error Handling and Resilience

Fault Tolerance

MCMC Convergence Failures: Automatic retry with adjusted parameters
Data Quality Issues: Graceful degradation with warnings
Resource Constraints: Automatic configuration adjustment

Validation Framework

Data Validation: Comprehensive input checking
Model Validation: Convergence and quality diagnostics
Result Validation: Business rule compliance checking

Recovery Mechanisms

Checkpoint System: Resumable long-running computations
Configuration Validation: Early error detection
Graceful Degradation: Reduced functionality rather than failures

This architecture provides a robust, scalable foundation for sophisticated marketing attribution analysis while maintaining clear separation between statistical optimisation and business communication concerns.