# Architecture Overview

## System Design Philosophy

conversionflow-aggregate is designed around the principle of **separation of concerns** between statistical optimisation and business attribution. This architecture enables maximum mathematical efficiency while maintaining credible business communication.

## High-Level Architecture

The system implements a **two-stage pipeline architecture** with clear separation between parameter estimation and optimisation:

```
Raw Data → [Stage 1: Bayesian Estimation] → Parameters → [Stage 2: GA Optimisation] → Results
```

### Stage 1: Bayesian Parameter Estimation
**Duration:** ~7 minutes  
**Technology:** PyMC (MCMC sampling)  
**Output:** Probabilistic model parameters with uncertainty quantification

### Stage 2: Genetic Algorithm Optimisation
**Duration:** ~3-4 seconds  
**Technology:** Custom genetic algorithm implementation  
**Output:** Optimal budget allocations with conservative attribution reporting

## Detailed System Components

### Data Layer

#### Data Processing Pipeline
```
Raw Events → Validation → Transformation → Customer Journey Construction → Model Input
```

**Components:**
- **Data Loaders:** Multi-format ingestion (CSV, DuckDB, Excel, PostgreSQL)
- **Validators:** Data quality and consistency checking
- **Transformers:** Event aggregation and journey reconstruction
- **Cache Manager:** Intelligent caching for large dataset processing

**Key Files:**
- `src/conversionflow/data/italy_loader.py` - Italy-specific data processing
- `src/conversionflow/data/loaders.py` - Generic data loading framework
- `src/conversionflow/data/validators.py` - Data quality validation
- `src/conversionflow/core/cache.py` - Caching system

### Bayesian Modelling Layer

#### Probabilistic Network Architecture
The system models customer journeys as **Bayesian networks** where:
- **Nodes** represent marketing touchpoints
- **Edges** represent transition probabilities
- **Parameters** represent touchpoint effectiveness and interdependencies

#### MCMC Implementation
```python
# Conceptual model structure
with pm.Model() as model:
    # Baseline touchpoint effectiveness
    beta0 = pm.Normal("beta0", mu=prior_mean, sigma=prior_std)
    
    # Budget sensitivity coefficients
    beta1 = pm.Normal("beta1", mu=1.0, sigma=0.5)
    
    # Parent touchpoint influences
    parent_effects = pm.Normal("parent_effects", mu=0, sigma=0.3)
    
    # Likelihood function
    likelihood = pm.Poisson("obs", mu=expected_conversions, observed=data)
```

**Key Features:**
- **Convergence Diagnostics:** Automatic R-hat and ESS monitoring
- **Model Comparison:** LOO-CV for model selection
- **Uncertainty Quantification:** Full posterior distributions
- **Numerical Stability:** Robust parameter estimation

**Key Files:**
- `src/conversionflow/models/bayesian.py` - Standard Poisson model
- `src/conversionflow/models/bayesian_hurdle.py` - Hurdle model for zero-inflation
- `src/conversionflow/models/parameter_export.py` - Parameter serialisation
- `src/conversionflow/core/numerical_stability.py` - Numerical robustness

### Optimisation Layer

#### Genetic Algorithm Implementation
The optimisation engine uses a **multi-objective genetic algorithm** designed specifically for marketing budget allocation:

```python
class ItalyGeneticOptimizer:
    def __init__(self, model_params, total_budget, constraints):
        self.population_size = 100
        self.generations = 200
        self.elite_fraction = 0.1
        
    def fitness(self, individual):
        # Calculate expected conversions given budget allocation
        return sum(touchpoint_conversion_probability * allocation 
                  for touchpoint, allocation in individual.items())
    
    def optimize(self):
        # Standard GA loop with elitism and tournament selection
        pass
```

**Genetic Operators:**
- **Selection:** Tournament selection with configurable size
- **Crossover:** Uniform crossover preserving budget constraints
- **Mutation:** Gaussian mutation with boundary repair
- **Elitism:** Top performers preserved across generations

**Constraint Handling:**
- **Budget Conservation:** Allocation sums exactly equal total budget
- **Business Bounds:** Minimum/maximum allocations per channel
- **Operational Constraints:** Real-world business rules

**Key Files:**
- `src/conversionflow/optimization/italy_optimizer.py` - GA implementation
- `src/conversionflow/optimization/real_parameter_loader.py` - Parameter conversion utilities

### Attribution Layer

#### Data-Grounded Attribution
The system's architecture is built on the principle of **data-grounded attribution**. This ensures that all financial projections are directly and defensibly tied to the scope of the data being analyzed.

**Methodology:**
1.  **Scoped Modeling**: The Bayesian model is built exclusively on tracked digital journey data, which accounts for a fraction (~5%) of total sales.
2.  **Scoped Optimization**: The genetic algorithm optimizes the marketing budget based on the conversion probabilities learned *only* from this digital data.
3.  **Scoped Reporting**: All business impact calculations and financial projections are consequently based on the portion of sales that can be reasonably attributed to these digital journeys.

This approach provides:
- **Analytical Integrity**: It avoids making unsupported claims by extrapolating results from a small digital dataset to the entire offline sales volume.
- **Business Credibility**: It delivers realistic and defensible projections of the value generated by optimizing digital marketing spend.

### Visualisation and Reporting Layer

#### Multi-Format Output Generation
The system generates comprehensive reporting across multiple formats:

**Executive Reporting:**
- Budget allocation tables
- Performance improvement summaries
- Implementation recommendations
- Conservative attribution methodology explanation

**Technical Documentation:**
- MCMC diagnostics and convergence metrics
- Model validation and comparison statistics
- Genetic algorithm convergence analysis
- Sensitivity analysis results

**Visualisation Suite:**
- Customer journey flow diagrams (Mermaid)
- Budget allocation charts
- Performance trend analysis
- Attribution ceiling explanation graphics

**Key Files:**
- `src/conversionflow/visualization/charts.py` - Chart generation
- `src/conversionflow/visualization/dag_mermaid.py` - Journey flow diagrams
- `src/conversionflow/visualization/csv_exports.py` - Structured data exports
- `src/conversionflow/core/console.py` - Professional console output

## Core Infrastructure

### Configuration Management
**Hierarchical YAML Configuration:**
- Default system settings
- Environment-specific overrides
- Model architecture definitions
- User customisation layer

### Logging and Monitoring
**Comprehensive Observability:**
- Structured logging with configurable levels
- Performance profiling and metrics collection
- MCMC convergence monitoring
- Business rule validation tracking

### Caching System
**Intelligent Performance Optimisation:**
- Content-addressable caching
- Automatic cache invalidation
- Large dataset chunking
- Memory-efficient processing

### Numerical Stability
**Robust Mathematical Implementation:**
- Automatic gradient clipping
- Numerical precision management
- Boundary condition handling
- Convergence monitoring

**Key Infrastructure Files:**
- `src/conversionflow/core/config.py` - Configuration management
- `src/conversionflow/core/logging_config.py` - Logging framework
- `src/conversionflow/core/cache.py` - Caching system
- `src/conversionflow/core/profiler.py` - Performance monitoring

## Performance Considerations

### Computational Complexity
- **MCMC Sampling:** O(samples × chains × model_complexity)
- **Genetic Algorithm:** O(generations × population_size × touchpoints)
- **Parameter Conversion:** O(touchpoints × parameters)

### Memory Management
- **Streaming Data Processing:** Chunk-based processing for large datasets
- **MCMC Memory:** Configurable thinning and sample storage
- **Result Caching:** Automatic cleanup of intermediate results

### Scalability Bottlenecks
- **MCMC Convergence:** Dominant time factor (~7 minutes)
- **Data Loading:** I/O bound for very large datasets
- **Visualisation:** Memory intensive for complex diagrams

## Error Handling and Resilience

### Fault Tolerance
- **MCMC Convergence Failures:** Automatic retry with adjusted parameters
- **Data Quality Issues:** Graceful degradation with warnings
- **Resource Constraints:** Automatic configuration adjustment

### Validation Framework
- **Data Validation:** Comprehensive input checking
- **Model Validation:** Convergence and quality diagnostics
- **Result Validation:** Business rule compliance checking

### Recovery Mechanisms
- **Checkpoint System:** Resumable long-running computations
- **Configuration Validation:** Early error detection
- **Graceful Degradation:** Reduced functionality rather than failures

This architecture provides a robust, scalable foundation for sophisticated marketing attribution analysis while maintaining clear separation between statistical optimisation and business communication concerns.