Architecture Overview
System Design Philosophy
conversionflow-aggregate is designed around the principle of separation of concerns between statistical optimisation and business attribution. This architecture enables maximum mathematical efficiency while maintaining credible business communication.
High-Level Architecture
The system implements a two-stage pipeline architecture with clear separation between parameter estimation and optimisation:
Raw Data → [Stage 1: Bayesian Estimation] → Parameters → [Stage 2: GA Optimisation] → Results
Stage 1: Bayesian Parameter Estimation
Duration: ~7 minutes
Technology: PyMC (MCMC sampling)
Output: Probabilistic model parameters with uncertainty quantification
Stage 2: Genetic Algorithm Optimisation
Duration: ~3-4 seconds
Technology: Custom genetic algorithm implementation
Output: Optimal budget allocations with conservative attribution reporting
Detailed System Components
Data Layer
Data Processing Pipeline
Raw Events → Validation → Transformation → Customer Journey Construction → Model Input
Components:
Data Loaders: Multi-format ingestion (CSV, DuckDB, Excel, PostgreSQL)
Validators: Data quality and consistency checking
Transformers: Event aggregation and journey reconstruction
Cache Manager: Intelligent caching for large dataset processing
Key Files:
src/conversionflow/data/italy_loader.py- Italy-specific data processingsrc/conversionflow/data/loaders.py- Generic data loading frameworksrc/conversionflow/data/validators.py- Data quality validationsrc/conversionflow/core/cache.py- Caching system
Bayesian Modelling Layer
Probabilistic Network Architecture
The system models customer journeys as Bayesian networks where:
Nodes represent marketing touchpoints
Edges represent transition probabilities
Parameters represent touchpoint effectiveness and interdependencies
MCMC Implementation
# Conceptual model structure
with pm.Model() as model:
# Baseline touchpoint effectiveness
beta0 = pm.Normal("beta0", mu=prior_mean, sigma=prior_std)
# Budget sensitivity coefficients
beta1 = pm.Normal("beta1", mu=1.0, sigma=0.5)
# Parent touchpoint influences
parent_effects = pm.Normal("parent_effects", mu=0, sigma=0.3)
# Likelihood function
likelihood = pm.Poisson("obs", mu=expected_conversions, observed=data)
Key Features:
Convergence Diagnostics: Automatic R-hat and ESS monitoring
Model Comparison: LOO-CV for model selection
Uncertainty Quantification: Full posterior distributions
Numerical Stability: Robust parameter estimation
Key Files:
src/conversionflow/models/bayesian.py- Standard Poisson modelsrc/conversionflow/models/bayesian_hurdle.py- Hurdle model for zero-inflationsrc/conversionflow/models/parameter_export.py- Parameter serialisationsrc/conversionflow/core/numerical_stability.py- Numerical robustness
Optimisation Layer
Genetic Algorithm Implementation
The optimisation engine uses a multi-objective genetic algorithm designed specifically for marketing budget allocation:
class ItalyGeneticOptimizer:
def __init__(self, model_params, total_budget, constraints):
self.population_size = 100
self.generations = 200
self.elite_fraction = 0.1
def fitness(self, individual):
# Calculate expected conversions given budget allocation
return sum(touchpoint_conversion_probability * allocation
for touchpoint, allocation in individual.items())
def optimize(self):
# Standard GA loop with elitism and tournament selection
pass
Genetic Operators:
Selection: Tournament selection with configurable size
Crossover: Uniform crossover preserving budget constraints
Mutation: Gaussian mutation with boundary repair
Elitism: Top performers preserved across generations
Constraint Handling:
Budget Conservation: Allocation sums exactly equal total budget
Business Bounds: Minimum/maximum allocations per channel
Operational Constraints: Real-world business rules
Key Files:
src/conversionflow/optimization/italy_optimizer.py- GA implementationsrc/conversionflow/optimization/real_parameter_loader.py- Parameter conversion utilities
Attribution Layer
Data-Grounded Attribution
The system’s architecture is built on the principle of data-grounded attribution. This ensures that all financial projections are directly and defensibly tied to the scope of the data being analyzed.
Methodology:
Scoped Modeling: The Bayesian model is built exclusively on tracked digital journey data, which accounts for a fraction (~5%) of total sales.
Scoped Optimization: The genetic algorithm optimizes the marketing budget based on the conversion probabilities learned only from this digital data.
Scoped Reporting: All business impact calculations and financial projections are consequently based on the portion of sales that can be reasonably attributed to these digital journeys.
This approach provides:
Analytical Integrity: It avoids making unsupported claims by extrapolating results from a small digital dataset to the entire offline sales volume.
Business Credibility: It delivers realistic and defensible projections of the value generated by optimizing digital marketing spend.
Visualisation and Reporting Layer
Multi-Format Output Generation
The system generates comprehensive reporting across multiple formats:
Executive Reporting:
Budget allocation tables
Performance improvement summaries
Implementation recommendations
Conservative attribution methodology explanation
Technical Documentation:
MCMC diagnostics and convergence metrics
Model validation and comparison statistics
Genetic algorithm convergence analysis
Sensitivity analysis results
Visualisation Suite:
Customer journey flow diagrams (Mermaid)
Budget allocation charts
Performance trend analysis
Attribution ceiling explanation graphics
Key Files:
src/conversionflow/visualization/charts.py- Chart generationsrc/conversionflow/visualization/dag_mermaid.py- Journey flow diagramssrc/conversionflow/visualization/csv_exports.py- Structured data exportssrc/conversionflow/core/console.py- Professional console output
Core Infrastructure
Configuration Management
Hierarchical YAML Configuration:
Default system settings
Environment-specific overrides
Model architecture definitions
User customisation layer
Logging and Monitoring
Comprehensive Observability:
Structured logging with configurable levels
Performance profiling and metrics collection
MCMC convergence monitoring
Business rule validation tracking
Caching System
Intelligent Performance Optimisation:
Content-addressable caching
Automatic cache invalidation
Large dataset chunking
Memory-efficient processing
Numerical Stability
Robust Mathematical Implementation:
Automatic gradient clipping
Numerical precision management
Boundary condition handling
Convergence monitoring
Key Infrastructure Files:
src/conversionflow/core/config.py- Configuration managementsrc/conversionflow/core/logging_config.py- Logging frameworksrc/conversionflow/core/cache.py- Caching systemsrc/conversionflow/core/profiler.py- Performance monitoring
Performance Considerations
Computational Complexity
MCMC Sampling: O(samples × chains × model_complexity)
Genetic Algorithm: O(generations × population_size × touchpoints)
Parameter Conversion: O(touchpoints × parameters)
Memory Management
Streaming Data Processing: Chunk-based processing for large datasets
MCMC Memory: Configurable thinning and sample storage
Result Caching: Automatic cleanup of intermediate results
Scalability Bottlenecks
MCMC Convergence: Dominant time factor (~7 minutes)
Data Loading: I/O bound for very large datasets
Visualisation: Memory intensive for complex diagrams
Error Handling and Resilience
Fault Tolerance
MCMC Convergence Failures: Automatic retry with adjusted parameters
Data Quality Issues: Graceful degradation with warnings
Resource Constraints: Automatic configuration adjustment
Validation Framework
Data Validation: Comprehensive input checking
Model Validation: Convergence and quality diagnostics
Result Validation: Business rule compliance checking
Recovery Mechanisms
Checkpoint System: Resumable long-running computations
Configuration Validation: Early error detection
Graceful Degradation: Reduced functionality rather than failures
This architecture provides a robust, scalable foundation for sophisticated marketing attribution analysis while maintaining clear separation between statistical optimisation and business communication concerns.