# Mathematical Methodology ## Overview This document provides the complete mathematical specification of the conversionflow-aggregate two-stage pipeline. The methodology implements a **hierarchical Bayesian-optimisation framework** with conservative attribution reporting for marketing budget allocation. ## Stage 1: Bayesian Parameter Estimation ### 1.1 Problem Formulation Let $\mathbf{Y} = \{Y_{ij}\}$ denote the observed count data where: - $i \in \{1, 2, \ldots, T\}$ indexes time periods (days) - $j \in \{1, 2, \ldots, J\}$ indexes marketing touchpoints - $Y_{ij} \in \mathbb{N}_0$ represents the count of events for touchpoint $j$ on day $i$ The customer journey is modelled as a **directed acyclic graph (DAG)** $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ where: - $\mathcal{V} = \{v_1, v_2, \ldots, v_J\}$ represents touchpoints - $\mathcal{E} \subseteq \mathcal{V} \times \mathcal{V}$ represents causal relationships - $\text{pa}(j) = \{k : (v_k, v_j) \in \mathcal{E}\}$ denotes parent nodes of touchpoint $j$ ### 1.2 Standard Poisson Model #### Likelihood Specification For each touchpoint $j$ and time period $i$: $$Y_{ij} \sim \text{Poisson}(\lambda_{ij})$$ where the rate parameter follows a **log-linear specification**: $$\log(\lambda_{ij}) = \alpha_j + \beta_{1j} \log\left(1 + \frac{B_j}{\kappa}\right) + \sum_{k \in \text{pa}(j)} \gamma_{kj} \log(1 + Y_{ik}) + \delta_j \mathbf{w}_i^T$$ **Parameter Interpretation:** - $\alpha_j$: Baseline log-rate for touchpoint $j$ - $\beta_{1j}$: Budget sensitivity coefficient (diminishing returns via logarithm) - $B_j$: Budget allocation to touchpoint $j$ - $\kappa > 0$: Budget scaling factor (default: 1000) - $\gamma_{kj}$: Influence coefficient from parent touchpoint $k$ to $j$ - $\delta_j$: Time-varying effect coefficients - $\mathbf{w}_i$: Time covariate vector (e.g., day-of-week indicators) #### Prior Specifications **Baseline Effects:** $$\alpha_j \sim \mathcal{N}(\mu_{\alpha,j}, \sigma_{\alpha,j}^2)$$ **Budget Sensitivity:** $$\beta_{1j} \sim \mathcal{N}(\mu_{\beta,j}, \sigma_{\beta,j}^2)$$ **Parent Influences:** $$\gamma_{kj} \sim \mathcal{N}(\mu_{\gamma,kj}, \sigma_{\gamma,kj}^2) \quad \forall k \in \text{pa}(j)$$ **Time Effects:** $$\delta_j \sim \mathcal{N}(\mathbf{0}, \sigma_{\delta}^2 \mathbf{I})$$ **Default Hyperparameters:** - $\mu_{\alpha,j} = 3.0, \sigma_{\alpha,j} = 1.5$ (baseline intercepts) - $\mu_{\beta,j} = 1.0, \sigma_{\beta,j} = 0.5$ (budget sensitivity) - $\mu_{\gamma,kj} = 0.0, \sigma_{\gamma,kj} = 1.0$ (parent effects) - $\sigma_{\delta} = 1.5$ (time effects) ### 1.3 Hurdle Model (Zero-Inflated Poisson) For count data with excess zeros, we employ a **two-stage hurdle model**: #### Stage 1: Hurdle Component (Bernoulli Process) $$H_{ij} \sim \text{Bernoulli}(\pi_{ij})$$ $$\text{logit}(\pi_{ij}) = \alpha^{(h)}_j + \sum_{k \in \text{pa}(j)} \gamma^{(h)}_{kj} \mathbb{I}(Y_{ik} > 0) + \delta^{(h)}_j \mathbf{w}_i^T$$ where $\mathbb{I}(\cdot)$ is the indicator function and $\pi_{ij}$ represents the probability of any activity occurring. #### Stage 2: Count Component (Truncated Poisson) $$Y_{ij} | H_{ij} = 1 \sim \text{TruncatedPoisson}(\mu_{ij}, \text{lower}=1)$$ $$\log(\mu_{ij}) = \alpha^{(c)}_j + \beta^{(c)}_{1j} \log\left(1 + \frac{B_j}{\kappa}\right) + \sum_{k \in \text{pa}(j)} \gamma^{(c)}_{kj} \log(1 + Y_{ik}) + \delta^{(c)}_j \mathbf{w}_i^T$$ #### Combined Likelihood The complete data likelihood becomes: $$Y_{ij} \sim \text{ZeroInflatedPoisson}(\psi_{ij}, \mu_{ij})$$ where: - $\psi_{ij} = 1 - \pi_{ij}$ (excess zero probability) - $\mu_{ij}$ is the Poisson rate when active #### Hurdle Model Priors **Hurdle Component:** $$\alpha^{(h)}_j \sim \mathcal{N}(0, 1.5^2), \quad \gamma^{(h)}_{kj} \sim \mathcal{N}(0, 1^2)$$ **Count Component:** $$\alpha^{(c)}_j \sim \mathcal{N}(2, 1.5^2), \quad \gamma^{(c)}_{kj} \sim \text{HalfCauchy}(5)$$ ### 1.4 Posterior Inference #### MCMC Sampling Posterior inference uses **Hamiltonian Monte Carlo (HMC)** via PyMC: **Sampling Configuration:** - Draws: $S = 2000$ (production: 4000) - Tuning: $T = 1000$ (production: 2000) - Chains: $C = 4$ (production: 8) - Target acceptance rate: $\rho = 0.9$ (production: 0.95) - Maximum tree depth: $d_{\max} = 15$ #### Convergence Diagnostics **R-hat Statistic:** $$\hat{R} = \sqrt{\frac{\hat{V}^+}{\hat{W}}}$$ where $\hat{V}^+$ is the posterior variance estimate and $\hat{W}$ is the within-chain variance. **Convergence Criterion:** $\hat{R} < 1.1$ for all parameters. **Effective Sample Size:** $$\text{ESS} = \frac{CS}{1 + 2\sum_{t=1}^{T} \rho_t}$$ where $\rho_t$ is the lag-$t$ autocorrelation. **Quality Criterion:** $\text{ESS} > 400$ for all parameters. #### Model Comparison **Leave-One-Out Cross-Validation (LOO-CV):** $$\text{ELPD}_{\text{LOO}} = \sum_{i=1}^{n} \log p(y_i | y_{-i})$$ where $p(y_i | y_{-i})$ is the leave-one-out predictive density approximated using Pareto-smoothed importance sampling. ### 1.5 Parameter Export The posterior samples are summarised into point estimates and uncertainty quantification: For each parameter $\theta$, we compute: - **Point Estimate:** $\hat{\theta} = \mathbb{E}[\theta | \mathbf{Y}]$ (posterior mean) - **Uncertainty:** $\text{SD}(\theta) = \sqrt{\text{Var}[\theta | \mathbf{Y}]}$ (posterior standard deviation) - **Credible Intervals:** $(\theta_{\alpha/2}, \theta_{1-\alpha/2})$ where $\alpha = 0.05$ **Export Format:** ```json { "parameters": { "touchpoint_j": { "beta0": {"mean": α̂_j, "std": SD(α_j)}, "beta1": {"mean": β̂_{1j}, "std": SD(β_{1j})}, "parents": ["touchpoint_k", ...], "parent_coeffs": [ {"mean": γ̂_{kj}, "std": SD(γ_{kj})}, ... ], "alpha": α̂_j // Conversion value weight } }, "diagnostics": { "elpd_loo": ELPD_LOO, "rhat_max": max(R̂), "ess_min": min(ESS) } } ``` ## Stage 2: Genetic Algorithm Optimisation ### 2.1 Problem Formulation **Decision Variables:** Let $\mathbf{b} = (b_1, b_2, \ldots, b_J)^T$ where $b_j \geq 0$ represents the budget allocation to touchpoint $j$. **Budget Constraint:** $$\sum_{j=1}^{J} b_j = B_{\text{total}}$$ **Box Constraints:** $$b_{\text{min},j} \leq b_j \leq b_{\text{max},j} \quad \forall j$$ ### 2.2 Objective Function #### Expected Conversion Calculation For a given budget allocation $\mathbf{b}$, the expected conversion probability for touchpoint $j$ is: $$p_j(\mathbf{b}) = \sigma\left(\hat{\alpha}_j + \hat{\beta}_{1j} \log\left(1 + \frac{b_j}{\kappa}\right) + \sum_{k \in \text{pa}(j)} \hat{\gamma}_{kj} p_k(\mathbf{b})\right)$$ where $\sigma(z) = \frac{1}{1 + e^{-z}}$ is the sigmoid function with overflow protection: $$\sigma(z) = \sigma(\max(-500, \min(500, z)))$$ #### Fitness Function The optimization objective maximises expected total conversion value: $$f(\mathbf{b}) = \sum_{j=1}^{J} \alpha_j \cdot p_j(\mathbf{b}) - \Phi(\mathbf{b})$$ where: - $\alpha_j$ is the conversion value weight for touchpoint $j$ - $\Phi(\mathbf{b})$ represents penalty terms for constraint violations #### Penalty Function $$\Phi(\mathbf{b}) = \lambda_{\text{min}} \sum_{j=1}^{J} \max(0, b_{\text{min},j} - b_j) + \lambda_{\text{business}} \Psi_{\text{business}}(\mathbf{b})$$ where: - $\lambda_{\text{min}} > 0$ penalises under-budgeted touchpoints - $\Psi_{\text{business}}(\mathbf{b})$ enforces business-specific constraints ### 2.3 Genetic Algorithm Specification #### Population Representation Each individual $\mathbf{x}^{(i)} \in \mathbb{R}^J$ represents a budget allocation satisfying: $$\mathbf{x}^{(i)} \in \mathcal{F} = \left\{\mathbf{b} \in \mathbb{R}_+^J : \sum_{j=1}^{J} b_j = B_{\text{total}}, \, b_{\text{min},j} \leq b_j \leq b_{\text{max},j}\right\}$$ #### Initialization **Importance-Based Sampling:** Initial population members are generated as: $$b_j^{(0)} = \frac{w_j}{\sum_{k=1}^{J} w_k} B_{\text{total}} + \epsilon_j$$ where: - $w_j$ is the importance weight for touchpoint $j$ - $\epsilon_j \sim \mathcal{N}(0, \sigma_{\text{init}}^2)$ adds diversity - The result is projected onto $\mathcal{F}$ via constraint enforcement #### Selection Operator **Tournament Selection:** For tournament size $k$, select parent as: $$\mathbf{x}^{\text{parent}} = \arg\min_{\mathbf{x} \in \mathcal{T}} f(\mathbf{x})$$ where $\mathcal{T}$ is a random subset of size $k$ from the current population. #### Crossover Operator **Uniform Crossover with Constraint Repair:** For parents $\mathbf{x}^{(1)}, \mathbf{x}^{(2)}$, generate offspring: $$\mathbf{x}^{\text{child}} = \alpha \mathbf{x}^{(1)} + (1-\alpha) \mathbf{x}^{(2)}$$ where $\alpha \sim \text{Uniform}(0, 1)$. **Constraint Repair:** Apply projection $\Pi_{\mathcal{F}}(\mathbf{x}^{\text{child}}) \in \mathcal{F}$ via: 1. **Bound Enforcement:** $\tilde{b}_j = \max(b_{\text{min},j}, \min(b_{\text{max},j}, b_j))$ 2. **Budget Normalisation:** $b_j^* = \tilde{b}_j \cdot \frac{B_{\text{total}}}{\sum_{k=1}^{J} \tilde{b}_k}$ 3. **Iterative Adjustment:** If constraints remain violated, apply iterative rebalancing #### Mutation Operator **Budget Reallocation Mutation:** With probability $p_m$, apply: $$b_j^{\text{new}} = b_j + \Delta_j$$ where $\sum_{j=1}^{J} \Delta_j = 0$ (budget conservation) and $\Delta_j$ follows a budget transfer scheme: 1. **Transfer Selection:** Choose donor-recipient pairs with probability proportional to current allocations 2. **Transfer Amount:** $|\Delta_j| \sim \text{Uniform}(0.05 b_j, 0.3 b_j)$ 3. **Constraint Repair:** Apply $\Pi_{\mathcal{F}}(\cdot)$ #### Evolutionary Parameters **Standard Configuration:** - Population size: $N = 100$ - Generations: $G = 200$ - Tournament size: $k = 5$ - Crossover rate: $p_c = 0.8$ - Mutation rate: $p_m = 0.15$ - Elite fraction: $\eta = 0.1$ ### 2.4 Convergence Criteria **Fitness-Based Stopping:** Algorithm terminates when: $$\frac{f_{\max}^{(g)} - f_{\max}^{(g-h)}}{|f_{\max}^{(g-h)}|} < \epsilon_{\text{conv}}$$ for $h$ consecutive generations, where: - $f_{\max}^{(g)}$ is the best fitness in generation $g$ - $h = 50$ (patience parameter) - $\epsilon_{\text{conv}} = 0.001$ (convergence threshold) ## Stage Interface: Parameter Conversion ### 3.1 Bayesian to GA Parameter Mapping The Stage 1 posterior estimates are converted to Stage 2 optimization parameters: **Direct Mapping:** - $\hat{\alpha}_j \leftarrow \mathbb{E}[\alpha_j | \mathbf{Y}]$ (baseline effects) - $\hat{\beta}_{1j} \leftarrow \mathbb{E}[\beta_{1j} | \mathbf{Y}]$ (budget sensitivities) - $\hat{\gamma}_{kj} \leftarrow \mathbb{E}[\gamma_{kj} | \mathbf{Y}]$ (parent influences) **Uncertainty Propagation:** For robust optimization, parameter uncertainty can be incorporated by: 1. **Monte Carlo Sampling:** Draw $\{\theta^{(s)}\}_{s=1}^{S}$ from posterior 2. **Stochastic Fitness:** $f(\mathbf{b}) = \frac{1}{S} \sum_{s=1}^{S} f(\mathbf{b}; \theta^{(s)})$ ### 3.2 Constraint Specification **Business Constraints:** - Minimum allocation: $b_{\text{min},j} = \max(10\text{k}, 0.001 \cdot B_{\text{total}})$ - Maximum allocation: $b_{\text{max},j} = 0.95 \cdot B_{\text{total}}$ - Category limits: $\sum_{j \in \mathcal{C}_k} b_j \leq \beta_k B_{\text{total}}$ for channel categories $\mathcal{C}_k$ ## Data-Grounded Attribution Framework ### 4.1 Principle of Scoped Projections A core principle of the `conversionflow-aggregate` methodology is that all financial projections must be directly and defensibly tied to the scope of the data being analyzed. This ensures analytical integrity and provides credible, realistic business insights. ### 4.2 The Digital Attribution Challenge In many real-world scenarios, particularly in markets like luxury automotive, the available digital data (e.g., website interactions, ad clicks) only captures a small fraction of the total customer journey. For the Italy market analysis, this is a critical consideration: - **Digital Data Scope**: The model is built using data from digital touchpoints. - **Sales Data Scope**: This digital data is linked to only **~5% of total vehicle sales**. The remaining 95% of sales occur through offline channels (e.g., dealer relationships, walk-ins) that are not present in the dataset. ### 4.3 Methodological Solution To avoid making unsupported claims, our methodology strictly aligns the scope of the analysis with the scope of the data: 1. **Model Scope**: The Bayesian network is built exclusively on the tracked digital journey data. It learns the conversion probabilities *within this digital ecosystem*. 2. **Optimization Scope**: The genetic algorithm optimizes the marketing budget based on the conversion probabilities learned from the digital-only data. Its goal is to maximize conversions *within the population of digitally-engaged users*. 3. **Business Impact Scope**: Consequently, all financial projections, such as the "Expected additional revenue" from the uncertainty analysis, are calculated based on the **portion of sales that can be reasonably attributed to these digital journeys**. **Example Calculation:** - **Total Annual Sales**: 5,067 units - **Digitally Attributable Sales (Analysis Scope)**: `5067 * 0.05 = ~253` units - **Optimization Result**: A 5.65% improvement in digital conversion efficiency. - **Business Impact Calculation**: The 5.65% improvement is applied to the revenue from **~253 cars**, not the total 5,067 cars. This approach ensures that the system provides a realistic estimate of the value generated by optimizing the digital marketing spend, rather than making speculative claims about its impact on the entire sales landscape. ## Mathematical Assumptions ### 4.3 Key Modelling Assumptions 1. **DAG Structure:** Customer journeys follow a directed acyclic graph with no cycles 2. **Poisson Counts:** Event counts are Poisson-distributed conditional on rate parameters 3. **Log-Linear Effects:** Budget and parent influences enter log-linearly 4. **Diminishing Returns:** Budget effects follow $\log(1 + b/\kappa)$ form 5. **Independence:** Conditional independence of counts given parameters and structure 6. **Stationarity:** Parameters are constant within the modelling period 7. **Additive Effects:** Parent influences combine additively in log-rate ### 4.4 Convergence Guarantees **MCMC Convergence:** Under regularity conditions (log-concave posteriors, bounded parameter spaces), HMC converges to the target posterior distribution. **GA Convergence:** The genetic algorithm converges to a local optimum with probability 1 under: - Positive mutation rates - Elite preservation - Finite feasible region **Global Optimality:** No guarantee of global optimum due to non-convex objective function. Multiple runs with different random seeds recommended for robustness. ## Implementation Notes ### 4.5 Numerical Stability **Overflow Protection:** - Sigmoid function clipped to $[-500, 500]$ - Log-sum-exp tricks for stable probability calculations - Regularization terms for near-singular matrices **Constraint Handling:** - Iterative projection algorithms for budget conservation - Feasibility restoration via quadratic programming - Numerical tolerance: $\epsilon_{\text{tol}} = 10^{-9}$ ### 4.6 Computational Complexity **Stage 1 (MCMC):** $\mathcal{O}(S \cdot C \cdot J^2 \cdot T)$ where $S$ is samples, $C$ is chains, $J$ is touchpoints, $T$ is time periods **Stage 2 (GA):** $\mathcal{O}(G \cdot N \cdot J^2)$ where $G$ is generations, $N$ is population size **Total Pipeline:** Dominated by MCMC sampling (typically ~7 minutes vs ~3 seconds for GA)