# Mathematical Methodology

## Overview

This document provides the complete mathematical specification of the conversionflow-aggregate two-stage pipeline. The methodology implements a **hierarchical Bayesian-optimisation framework** with conservative attribution reporting for marketing budget allocation.

## Stage 1: Bayesian Parameter Estimation

### 1.1 Problem Formulation

Let $\mathbf{Y} = \{Y_{ij}\}$ denote the observed count data where:
- $i \in \{1, 2, \ldots, T\}$ indexes time periods (days)  
- $j \in \{1, 2, \ldots, J\}$ indexes marketing touchpoints
- $Y_{ij} \in \mathbb{N}_0$ represents the count of events for touchpoint $j$ on day $i$

The customer journey is modelled as a **directed acyclic graph (DAG)** $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ where:
- $\mathcal{V} = \{v_1, v_2, \ldots, v_J\}$ represents touchpoints
- $\mathcal{E} \subseteq \mathcal{V} \times \mathcal{V}$ represents causal relationships
- $\text{pa}(j) = \{k : (v_k, v_j) \in \mathcal{E}\}$ denotes parent nodes of touchpoint $j$

### 1.2 Standard Poisson Model

#### Likelihood Specification

For each touchpoint $j$ and time period $i$:

$$Y_{ij} \sim \text{Poisson}(\lambda_{ij})$$

where the rate parameter follows a **log-linear specification**:

$$\log(\lambda_{ij}) = \alpha_j + \beta_{1j} \log\left(1 + \frac{B_j}{\kappa}\right) + \sum_{k \in \text{pa}(j)} \gamma_{kj} \log(1 + Y_{ik}) + \delta_j \mathbf{w}_i^T$$

**Parameter Interpretation:**
- $\alpha_j$: Baseline log-rate for touchpoint $j$
- $\beta_{1j}$: Budget sensitivity coefficient (diminishing returns via logarithm)
- $B_j$: Budget allocation to touchpoint $j$
- $\kappa > 0$: Budget scaling factor (default: 1000)
- $\gamma_{kj}$: Influence coefficient from parent touchpoint $k$ to $j$
- $\delta_j$: Time-varying effect coefficients
- $\mathbf{w}_i$: Time covariate vector (e.g., day-of-week indicators)

#### Prior Specifications

**Baseline Effects:**
$$\alpha_j \sim \mathcal{N}(\mu_{\alpha,j}, \sigma_{\alpha,j}^2)$$

**Budget Sensitivity:**
$$\beta_{1j} \sim \mathcal{N}(\mu_{\beta,j}, \sigma_{\beta,j}^2)$$

**Parent Influences:**
$$\gamma_{kj} \sim \mathcal{N}(\mu_{\gamma,kj}, \sigma_{\gamma,kj}^2) \quad \forall k \in \text{pa}(j)$$

**Time Effects:**
$$\delta_j \sim \mathcal{N}(\mathbf{0}, \sigma_{\delta}^2 \mathbf{I})$$

**Default Hyperparameters:**
- $\mu_{\alpha,j} = 3.0, \sigma_{\alpha,j} = 1.5$ (baseline intercepts)
- $\mu_{\beta,j} = 1.0, \sigma_{\beta,j} = 0.5$ (budget sensitivity)
- $\mu_{\gamma,kj} = 0.0, \sigma_{\gamma,kj} = 1.0$ (parent effects)
- $\sigma_{\delta} = 1.5$ (time effects)

### 1.3 Hurdle Model (Zero-Inflated Poisson)

For count data with excess zeros, we employ a **two-stage hurdle model**:

#### Stage 1: Hurdle Component (Bernoulli Process)

$$H_{ij} \sim \text{Bernoulli}(\pi_{ij})$$

$$\text{logit}(\pi_{ij}) = \alpha^{(h)}_j + \sum_{k \in \text{pa}(j)} \gamma^{(h)}_{kj} \mathbb{I}(Y_{ik} > 0) + \delta^{(h)}_j \mathbf{w}_i^T$$

where $\mathbb{I}(\cdot)$ is the indicator function and $\pi_{ij}$ represents the probability of any activity occurring.

#### Stage 2: Count Component (Truncated Poisson)

$$Y_{ij} | H_{ij} = 1 \sim \text{TruncatedPoisson}(\mu_{ij}, \text{lower}=1)$$

$$\log(\mu_{ij}) = \alpha^{(c)}_j + \beta^{(c)}_{1j} \log\left(1 + \frac{B_j}{\kappa}\right) + \sum_{k \in \text{pa}(j)} \gamma^{(c)}_{kj} \log(1 + Y_{ik}) + \delta^{(c)}_j \mathbf{w}_i^T$$

#### Combined Likelihood

The complete data likelihood becomes:

$$Y_{ij} \sim \text{ZeroInflatedPoisson}(\psi_{ij}, \mu_{ij})$$

where:
- $\psi_{ij} = 1 - \pi_{ij}$ (excess zero probability)
- $\mu_{ij}$ is the Poisson rate when active

#### Hurdle Model Priors

**Hurdle Component:**
$$\alpha^{(h)}_j \sim \mathcal{N}(0, 1.5^2), \quad \gamma^{(h)}_{kj} \sim \mathcal{N}(0, 1^2)$$

**Count Component:**
$$\alpha^{(c)}_j \sim \mathcal{N}(2, 1.5^2), \quad \gamma^{(c)}_{kj} \sim \text{HalfCauchy}(5)$$

### 1.4 Posterior Inference

#### MCMC Sampling

Posterior inference uses **Hamiltonian Monte Carlo (HMC)** via PyMC:

**Sampling Configuration:**
- Draws: $S = 2000$ (production: 4000)
- Tuning: $T = 1000$ (production: 2000)  
- Chains: $C = 4$ (production: 8)
- Target acceptance rate: $\rho = 0.9$ (production: 0.95)
- Maximum tree depth: $d_{\max} = 15$

#### Convergence Diagnostics

**R-hat Statistic:**
$$\hat{R} = \sqrt{\frac{\hat{V}^+}{\hat{W}}}$$

where $\hat{V}^+$ is the posterior variance estimate and $\hat{W}$ is the within-chain variance.

**Convergence Criterion:** $\hat{R} < 1.1$ for all parameters.

**Effective Sample Size:**
$$\text{ESS} = \frac{CS}{1 + 2\sum_{t=1}^{T} \rho_t}$$

where $\rho_t$ is the lag-$t$ autocorrelation.

**Quality Criterion:** $\text{ESS} > 400$ for all parameters.

#### Model Comparison

**Leave-One-Out Cross-Validation (LOO-CV):**
$$\text{ELPD}_{\text{LOO}} = \sum_{i=1}^{n} \log p(y_i | y_{-i})$$

where $p(y_i | y_{-i})$ is the leave-one-out predictive density approximated using Pareto-smoothed importance sampling.

### 1.5 Parameter Export

The posterior samples are summarised into point estimates and uncertainty quantification:

For each parameter $\theta$, we compute:
- **Point Estimate:** $\hat{\theta} = \mathbb{E}[\theta | \mathbf{Y}]$ (posterior mean)
- **Uncertainty:** $\text{SD}(\theta) = \sqrt{\text{Var}[\theta | \mathbf{Y}]}$ (posterior standard deviation)
- **Credible Intervals:** $(\theta_{\alpha/2}, \theta_{1-\alpha/2})$ where $\alpha = 0.05$

**Export Format:**
```json
{
  "parameters": {
    "touchpoint_j": {
      "beta0": {"mean": α̂_j, "std": SD(α_j)},
      "beta1": {"mean": β̂_{1j}, "std": SD(β_{1j})},
      "parents": ["touchpoint_k", ...],
      "parent_coeffs": [
        {"mean": γ̂_{kj}, "std": SD(γ_{kj})}, ...
      ],
      "alpha": α̂_j  // Conversion value weight
    }
  },
  "diagnostics": {
    "elpd_loo": ELPD_LOO,
    "rhat_max": max(R̂),
    "ess_min": min(ESS)
  }
}
```

## Stage 2: Genetic Algorithm Optimisation

### 2.1 Problem Formulation

**Decision Variables:**
Let $\mathbf{b} = (b_1, b_2, \ldots, b_J)^T$ where $b_j \geq 0$ represents the budget allocation to touchpoint $j$.

**Budget Constraint:**
$$\sum_{j=1}^{J} b_j = B_{\text{total}}$$

**Box Constraints:**
$$b_{\text{min},j} \leq b_j \leq b_{\text{max},j} \quad \forall j$$

### 2.2 Objective Function

#### Expected Conversion Calculation

For a given budget allocation $\mathbf{b}$, the expected conversion probability for touchpoint $j$ is:

$$p_j(\mathbf{b}) = \sigma\left(\hat{\alpha}_j + \hat{\beta}_{1j} \log\left(1 + \frac{b_j}{\kappa}\right) + \sum_{k \in \text{pa}(j)} \hat{\gamma}_{kj} p_k(\mathbf{b})\right)$$

where $\sigma(z) = \frac{1}{1 + e^{-z}}$ is the sigmoid function with overflow protection:
$$\sigma(z) = \sigma(\max(-500, \min(500, z)))$$

#### Fitness Function

The optimization objective maximises expected total conversion value:

$$f(\mathbf{b}) = \sum_{j=1}^{J} \alpha_j \cdot p_j(\mathbf{b}) - \Phi(\mathbf{b})$$

where:
- $\alpha_j$ is the conversion value weight for touchpoint $j$
- $\Phi(\mathbf{b})$ represents penalty terms for constraint violations

#### Penalty Function

$$\Phi(\mathbf{b}) = \lambda_{\text{min}} \sum_{j=1}^{J} \max(0, b_{\text{min},j} - b_j) + \lambda_{\text{business}} \Psi_{\text{business}}(\mathbf{b})$$

where:
- $\lambda_{\text{min}} > 0$ penalises under-budgeted touchpoints
- $\Psi_{\text{business}}(\mathbf{b})$ enforces business-specific constraints

### 2.3 Genetic Algorithm Specification

#### Population Representation

Each individual $\mathbf{x}^{(i)} \in \mathbb{R}^J$ represents a budget allocation satisfying:
$$\mathbf{x}^{(i)} \in \mathcal{F} = \left\{\mathbf{b} \in \mathbb{R}_+^J : \sum_{j=1}^{J} b_j = B_{\text{total}}, \, b_{\text{min},j} \leq b_j \leq b_{\text{max},j}\right\}$$

#### Initialization

**Importance-Based Sampling:**
Initial population members are generated as:

$$b_j^{(0)} = \frac{w_j}{\sum_{k=1}^{J} w_k} B_{\text{total}} + \epsilon_j$$

where:
- $w_j$ is the importance weight for touchpoint $j$
- $\epsilon_j \sim \mathcal{N}(0, \sigma_{\text{init}}^2)$ adds diversity
- The result is projected onto $\mathcal{F}$ via constraint enforcement

#### Selection Operator

**Tournament Selection:**
For tournament size $k$, select parent as:
$$\mathbf{x}^{\text{parent}} = \arg\min_{\mathbf{x} \in \mathcal{T}} f(\mathbf{x})$$

where $\mathcal{T}$ is a random subset of size $k$ from the current population.

#### Crossover Operator

**Uniform Crossover with Constraint Repair:**
For parents $\mathbf{x}^{(1)}, \mathbf{x}^{(2)}$, generate offspring:

$$\mathbf{x}^{\text{child}} = \alpha \mathbf{x}^{(1)} + (1-\alpha) \mathbf{x}^{(2)}$$

where $\alpha \sim \text{Uniform}(0, 1)$.

**Constraint Repair:** Apply projection $\Pi_{\mathcal{F}}(\mathbf{x}^{\text{child}}) \in \mathcal{F}$ via:

1. **Bound Enforcement:** $\tilde{b}_j = \max(b_{\text{min},j}, \min(b_{\text{max},j}, b_j))$
2. **Budget Normalisation:** $b_j^* = \tilde{b}_j \cdot \frac{B_{\text{total}}}{\sum_{k=1}^{J} \tilde{b}_k}$
3. **Iterative Adjustment:** If constraints remain violated, apply iterative rebalancing

#### Mutation Operator

**Budget Reallocation Mutation:**
With probability $p_m$, apply:

$$b_j^{\text{new}} = b_j + \Delta_j$$

where $\sum_{j=1}^{J} \Delta_j = 0$ (budget conservation) and $\Delta_j$ follows a budget transfer scheme:

1. **Transfer Selection:** Choose donor-recipient pairs with probability proportional to current allocations
2. **Transfer Amount:** $|\Delta_j| \sim \text{Uniform}(0.05 b_j, 0.3 b_j)$
3. **Constraint Repair:** Apply $\Pi_{\mathcal{F}}(\cdot)$

#### Evolutionary Parameters

**Standard Configuration:**
- Population size: $N = 100$
- Generations: $G = 200$
- Tournament size: $k = 5$
- Crossover rate: $p_c = 0.8$
- Mutation rate: $p_m = 0.15$
- Elite fraction: $\eta = 0.1$

### 2.4 Convergence Criteria

**Fitness-Based Stopping:**
Algorithm terminates when:
$$\frac{f_{\max}^{(g)} - f_{\max}^{(g-h)}}{|f_{\max}^{(g-h)}|} < \epsilon_{\text{conv}}$$

for $h$ consecutive generations, where:
- $f_{\max}^{(g)}$ is the best fitness in generation $g$
- $h = 50$ (patience parameter)
- $\epsilon_{\text{conv}} = 0.001$ (convergence threshold)

## Stage Interface: Parameter Conversion

### 3.1 Bayesian to GA Parameter Mapping

The Stage 1 posterior estimates are converted to Stage 2 optimization parameters:

**Direct Mapping:**
- $\hat{\alpha}_j \leftarrow \mathbb{E}[\alpha_j | \mathbf{Y}]$ (baseline effects)
- $\hat{\beta}_{1j} \leftarrow \mathbb{E}[\beta_{1j} | \mathbf{Y}]$ (budget sensitivities)
- $\hat{\gamma}_{kj} \leftarrow \mathbb{E}[\gamma_{kj} | \mathbf{Y}]$ (parent influences)

**Uncertainty Propagation:**
For robust optimization, parameter uncertainty can be incorporated by:
1. **Monte Carlo Sampling:** Draw $\{\theta^{(s)}\}_{s=1}^{S}$ from posterior
2. **Stochastic Fitness:** $f(\mathbf{b}) = \frac{1}{S} \sum_{s=1}^{S} f(\mathbf{b}; \theta^{(s)})$

### 3.2 Constraint Specification

**Business Constraints:**
- Minimum allocation: $b_{\text{min},j} = \max(10\text{k}, 0.001 \cdot B_{\text{total}})$
- Maximum allocation: $b_{\text{max},j} = 0.95 \cdot B_{\text{total}}$
- Category limits: $\sum_{j \in \mathcal{C}_k} b_j \leq \beta_k B_{\text{total}}$ for channel categories $\mathcal{C}_k$

## Data-Grounded Attribution Framework

### 4.1 Principle of Scoped Projections

A core principle of the `conversionflow-aggregate` methodology is that all financial projections must be directly and defensibly tied to the scope of the data being analyzed. This ensures analytical integrity and provides credible, realistic business insights.

### 4.2 The Digital Attribution Challenge

In many real-world scenarios, particularly in markets like luxury automotive, the available digital data (e.g., website interactions, ad clicks) only captures a small fraction of the total customer journey. For the Italy market analysis, this is a critical consideration:

- **Digital Data Scope**: The model is built using data from digital touchpoints.
- **Sales Data Scope**: This digital data is linked to only **~5% of total vehicle sales**. The remaining 95% of sales occur through offline channels (e.g., dealer relationships, walk-ins) that are not present in the dataset.

### 4.3 Methodological Solution

To avoid making unsupported claims, our methodology strictly aligns the scope of the analysis with the scope of the data:

1.  **Model Scope**: The Bayesian network is built exclusively on the tracked digital journey data. It learns the conversion probabilities *within this digital ecosystem*.

2.  **Optimization Scope**: The genetic algorithm optimizes the marketing budget based on the conversion probabilities learned from the digital-only data. Its goal is to maximize conversions *within the population of digitally-engaged users*.

3.  **Business Impact Scope**: Consequently, all financial projections, such as the "Expected additional revenue" from the uncertainty analysis, are calculated based on the **portion of sales that can be reasonably attributed to these digital journeys**.

**Example Calculation:**
- **Total Annual Sales**: 5,067 units
- **Digitally Attributable Sales (Analysis Scope)**: `5067 * 0.05 = ~253` units
- **Optimization Result**: A 5.65% improvement in digital conversion efficiency.
- **Business Impact Calculation**: The 5.65% improvement is applied to the revenue from **~253 cars**, not the total 5,067 cars.

This approach ensures that the system provides a realistic estimate of the value generated by optimizing the digital marketing spend, rather than making speculative claims about its impact on the entire sales landscape.

## Mathematical Assumptions

### 4.3 Key Modelling Assumptions

1. **DAG Structure:** Customer journeys follow a directed acyclic graph with no cycles
2. **Poisson Counts:** Event counts are Poisson-distributed conditional on rate parameters
3. **Log-Linear Effects:** Budget and parent influences enter log-linearly
4. **Diminishing Returns:** Budget effects follow $\log(1 + b/\kappa)$ form
5. **Independence:** Conditional independence of counts given parameters and structure
6. **Stationarity:** Parameters are constant within the modelling period
7. **Additive Effects:** Parent influences combine additively in log-rate

### 4.4 Convergence Guarantees

**MCMC Convergence:**
Under regularity conditions (log-concave posteriors, bounded parameter spaces), HMC converges to the target posterior distribution.

**GA Convergence:**
The genetic algorithm converges to a local optimum with probability 1 under:
- Positive mutation rates
- Elite preservation
- Finite feasible region

**Global Optimality:**
No guarantee of global optimum due to non-convex objective function. Multiple runs with different random seeds recommended for robustness.

## Implementation Notes

### 4.5 Numerical Stability

**Overflow Protection:**
- Sigmoid function clipped to $[-500, 500]$
- Log-sum-exp tricks for stable probability calculations
- Regularization terms for near-singular matrices

**Constraint Handling:**
- Iterative projection algorithms for budget conservation
- Feasibility restoration via quadratic programming
- Numerical tolerance: $\epsilon_{\text{tol}} = 10^{-9}$

### 4.6 Computational Complexity

**Stage 1 (MCMC):** $\mathcal{O}(S \cdot C \cdot J^2 \cdot T)$ where $S$ is samples, $C$ is chains, $J$ is touchpoints, $T$ is time periods

**Stage 2 (GA):** $\mathcal{O}(G \cdot N \cdot J^2)$ where $G$ is generations, $N$ is population size

**Total Pipeline:** Dominated by MCMC sampling (typically ~7 minutes vs ~3 seconds for GA)