Mathematical Methodology

Overview

This document provides the complete mathematical specification of the conversionflow-aggregate two-stage pipeline. The methodology implements a hierarchical Bayesian-optimisation framework with conservative attribution reporting for marketing budget allocation.

Stage 1: Bayesian Parameter Estimation

1.1 Problem Formulation

Let $\mathbf{Y} = \{Y_{ij}\}$ denote the observed count data where:

$i \in \{1, 2, \ldots, T\}$ indexes time periods (days)
$j \in \{1, 2, \ldots, J\}$ indexes marketing touchpoints
$Y_{ij} \in \mathbb{N}_0$ represents the count of events for touchpoint $j$ on day $i$

The customer journey is modelled as a directed acyclic graph (DAG) $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ where:

$\mathcal{V} = \{v_1, v_2, \ldots, v_J\}$ represents touchpoints
$\mathcal{E} \subseteq \mathcal{V} \times \mathcal{V}$ represents causal relationships
$\text{pa}(j) = \{k : (v_k, v_j) \in \mathcal{E}\}$ denotes parent nodes of touchpoint $j$

1.2 Standard Poisson Model

Likelihood Specification

For each touchpoint $j$ and time period $i$:

\[Y_{ij} \sim \text{Poisson}(\lambda_{ij})\]

where the rate parameter follows a log-linear specification:

\[\log(\lambda_{ij}) = \alpha_j + \beta_{1j} \log\left(1 + \frac{B_j}{\kappa}\right) + \sum_{k \in \text{pa}(j)} \gamma_{kj} \log(1 + Y_{ik}) + \delta_j \mathbf{w}_i^T\]

Parameter Interpretation:

$\alpha_j$: Baseline log-rate for touchpoint $j$
$\beta_{1j}$: Budget sensitivity coefficient (diminishing returns via logarithm)
$B_j$: Budget allocation to touchpoint $j$
$\kappa > 0$: Budget scaling factor (default: 1000)
$\gamma_{kj}$: Influence coefficient from parent touchpoint $k$ to $j$
$\delta_j$: Time-varying effect coefficients
$\mathbf{w}_i$: Time covariate vector (e.g., day-of-week indicators)

Prior Specifications

Baseline Effects: $$\alpha_j \sim \mathcal{N}(\mu_{\alpha,j}, \sigma_{\alpha,j}^2)$$

Budget Sensitivity: $$\beta_{1j} \sim \mathcal{N}(\mu_{\beta,j}, \sigma_{\beta,j}^2)$$

Parent Influences: $$\gamma_{kj} \sim \mathcal{N}(\mu_{\gamma,kj}, \sigma_{\gamma,kj}^2) \quad \forall k \in \text{pa}(j)$$

Time Effects: $$\delta_j \sim \mathcal{N}(\mathbf{0}, \sigma_{\delta}^2 \mathbf{I})$$

Default Hyperparameters:

$\mu_{\alpha,j} = 3.0, \sigma_{\alpha,j} = 1.5$ (baseline intercepts)
$\mu_{\beta,j} = 1.0, \sigma_{\beta,j} = 0.5$ (budget sensitivity)
$\mu_{\gamma,kj} = 0.0, \sigma_{\gamma,kj} = 1.0$ (parent effects)
$\sigma_{\delta} = 1.5$ (time effects)

1.3 Hurdle Model (Zero-Inflated Poisson)

For count data with excess zeros, we employ a two-stage hurdle model:

Stage 1: Hurdle Component (Bernoulli Process)

\[H_{ij} \sim \text{Bernoulli}(\pi_{ij})\]

\[\text{logit}(\pi_{ij}) = \alpha^{(h)}_j + \sum_{k \in \text{pa}(j)} \gamma^{(h)}_{kj} \mathbb{I}(Y_{ik} > 0) + \delta^{(h)}_j \mathbf{w}_i^T\]

where $\mathbb{I}(\cdot)$ is the indicator function and $\pi_{ij}$ represents the probability of any activity occurring.

Stage 2: Count Component (Truncated Poisson)

\[Y_{ij} | H_{ij} = 1 \sim \text{TruncatedPoisson}(\mu_{ij}, \text{lower}=1)\]

\[\log(\mu_{ij}) = \alpha^{(c)}_j + \beta^{(c)}_{1j} \log\left(1 + \frac{B_j}{\kappa}\right) + \sum_{k \in \text{pa}(j)} \gamma^{(c)}_{kj} \log(1 + Y_{ik}) + \delta^{(c)}_j \mathbf{w}_i^T\]

Combined Likelihood

The complete data likelihood becomes:

\[Y_{ij} \sim \text{ZeroInflatedPoisson}(\psi_{ij}, \mu_{ij})\]

where:

$\psi_{ij} = 1 - \pi_{ij}$ (excess zero probability)
$\mu_{ij}$ is the Poisson rate when active

Hurdle Model Priors

Hurdle Component: $$\alpha^{(h)}_j \sim \mathcal{N}(0, 1.5^2), \quad \gamma^{(h)}_{kj} \sim \mathcal{N}(0, 1^2)$$

Count Component: $$\alpha^{(c)}_j \sim \mathcal{N}(2, 1.5^2), \quad \gamma^{(c)}_{kj} \sim \text{HalfCauchy}(5)$$

1.4 Posterior Inference

MCMC Sampling

Posterior inference uses Hamiltonian Monte Carlo (HMC) via PyMC:

Sampling Configuration:

Draws: $S = 2000$ (production: 4000)
Tuning: $T = 1000$ (production: 2000)
Chains: $C = 4$ (production: 8)
Target acceptance rate: $\rho = 0.9$ (production: 0.95)
Maximum tree depth: $d_{\max} = 15$

Convergence Diagnostics

R-hat Statistic: $$\hat{R} = \sqrt{\frac{\hat{V}^+}{\hat{W}}}$$

where $\hat{V}^+$ is the posterior variance estimate and $\hat{W}$ is the within-chain variance.

Convergence Criterion: $\hat{R} < 1.1$ for all parameters.

Effective Sample Size: $$\text{ESS} = \frac{CS}{1 + 2\sum_{t=1}^{T} \rho_t}$$

where $\rho_t$ is the lag-$t$ autocorrelation.

Quality Criterion: $\text{ESS} > 400$ for all parameters.

Model Comparison

Leave-One-Out Cross-Validation (LOO-CV): $$\text{ELPD}_{\text{LOO}} = \sum_{i=1}^{n} \log p(y_i | y_{-i})$$

where $p(y_i | y_{-i})$ is the leave-one-out predictive density approximated using Pareto-smoothed importance sampling.

1.5 Parameter Export

The posterior samples are summarised into point estimates and uncertainty quantification:

For each parameter $\theta$, we compute:

Point Estimate: $\hat{\theta} = \mathbb{E}[\theta | \mathbf{Y}]$ (posterior mean)
Uncertainty: $\text{SD}(\theta) = \sqrt{\text{Var}[\theta | \mathbf{Y}]}$ (posterior standard deviation)
Credible Intervals: $(\theta_{\alpha/2}, \theta_{1-\alpha/2})$ where $\alpha = 0.05$

Export Format:

{
  "parameters": {
    "touchpoint_j": {
      "beta0": {"mean": α̂_j, "std": SD(α_j)},
      "beta1": {"mean": β̂_{1j}, "std": SD(β_{1j})},
      "parents": ["touchpoint_k", ...],
      "parent_coeffs": [
        {"mean": γ̂_{kj}, "std": SD(γ_{kj})}, ...
      ],
      "alpha": α̂_j  // Conversion value weight
    }
  },
  "diagnostics": {
    "elpd_loo": ELPD_LOO,
    "rhat_max": max(R̂),
    "ess_min": min(ESS)
  }
}

Stage 2: Genetic Algorithm Optimisation

2.1 Problem Formulation

Decision Variables: Let $\mathbf{b} = (b_1, b_2, \ldots, b_J)^T$ where $b_j \geq 0$ represents the budget allocation to touchpoint $j$.

Budget Constraint: $$\sum_{j=1}^{J} b_j = B_{\text{total}}$$

Box Constraints: $$b_{\text{min},j} \leq b_j \leq b_{\text{max},j} \quad \forall j$$

2.2 Objective Function

Expected Conversion Calculation

For a given budget allocation $\mathbf{b}$, the expected conversion probability for touchpoint $j$ is:

\[p_j(\mathbf{b}) = \sigma\left(\hat{\alpha}_j + \hat{\beta}_{1j} \log\left(1 + \frac{b_j}{\kappa}\right) + \sum_{k \in \text{pa}(j)} \hat{\gamma}_{kj} p_k(\mathbf{b})\right)\]

where $\sigma(z) = \frac{1}{1 + e^{-z}}$ is the sigmoid function with overflow protection: $$\sigma(z) = \sigma(\max(-500, \min(500, z)))$$

Fitness Function

The optimization objective maximises expected total conversion value:

\[f(\mathbf{b}) = \sum_{j=1}^{J} \alpha_j \cdot p_j(\mathbf{b}) - \Phi(\mathbf{b})\]

where:

$\alpha_j$ is the conversion value weight for touchpoint $j$
$\Phi(\mathbf{b})$ represents penalty terms for constraint violations

Penalty Function

\[\Phi(\mathbf{b}) = \lambda_{\text{min}} \sum_{j=1}^{J} \max(0, b_{\text{min},j} - b_j) + \lambda_{\text{business}} \Psi_{\text{business}}(\mathbf{b})\]

where:

$\lambda_{\text{min}} > 0$ penalises under-budgeted touchpoints
$\Psi_{\text{business}}(\mathbf{b})$ enforces business-specific constraints

2.3 Genetic Algorithm Specification

Population Representation

Each individual $\mathbf{x}^{(i)} \in \mathbb{R}^J$ represents a budget allocation satisfying: $$\mathbf{x}^{(i)} \in \mathcal{F} = \left\{\mathbf{b} \in \mathbb{R}_+^J : \sum_{j=1}^{J} b_j = B_{\text{total}}, \, b_{\text{min},j} \leq b_j \leq b_{\text{max},j}\right\}$$

Initialization

Importance-Based Sampling: Initial population members are generated as:

\[b_j^{(0)} = \frac{w_j}{\sum_{k=1}^{J} w_k} B_{\text{total}} + \epsilon_j\]

where:

$w_j$ is the importance weight for touchpoint $j$
$\epsilon_j \sim \mathcal{N}(0, \sigma_{\text{init}}^2)$ adds diversity
The result is projected onto $\mathcal{F}$ via constraint enforcement

Selection Operator

Tournament Selection: For tournament size $k$, select parent as: $$\mathbf{x}^{\text{parent}} = \arg\min_{\mathbf{x} \in \mathcal{T}} f(\mathbf{x})$$

where $\mathcal{T}$ is a random subset of size $k$ from the current population.

Crossover Operator

Uniform Crossover with Constraint Repair: For parents $\mathbf{x}^{(1)}, \mathbf{x}^{(2)}$, generate offspring:

\[\mathbf{x}^{\text{child}} = \alpha \mathbf{x}^{(1)} + (1-\alpha) \mathbf{x}^{(2)}\]

where $\alpha \sim \text{Uniform}(0, 1)$.

Constraint Repair: Apply projection $\Pi_{\mathcal{F}}(\mathbf{x}^{\text{child}}) \in \mathcal{F}$ via:

Bound Enforcement: $\tilde{b}_j = \max(b_{\text{min},j}, \min(b_{\text{max},j}, b_j))$
Budget Normalisation: $b_j^* = \tilde{b}_j \cdot \frac{B_{\text{total}}}{\sum_{k=1}^{J} \tilde{b}_k}$
Iterative Adjustment: If constraints remain violated, apply iterative rebalancing

Mutation Operator

Budget Reallocation Mutation: With probability $p_m$, apply:

\[b_j^{\text{new}} = b_j + \Delta_j\]

where $\sum_{j=1}^{J} \Delta_j = 0$ (budget conservation) and $\Delta_j$ follows a budget transfer scheme:

Transfer Selection: Choose donor-recipient pairs with probability proportional to current allocations
Transfer Amount: $|\Delta_j| \sim \text{Uniform}(0.05 b_j, 0.3 b_j)$
Constraint Repair: Apply $\Pi_{\mathcal{F}}(\cdot)$

Evolutionary Parameters

Standard Configuration:

Population size: $N = 100$
Generations: $G = 200$
Tournament size: $k = 5$
Crossover rate: $p_c = 0.8$
Mutation rate: $p_m = 0.15$
Elite fraction: $\eta = 0.1$

2.4 Convergence Criteria

Fitness-Based Stopping: Algorithm terminates when: $$\frac{f_{\max}^{(g)} - f_{\max}^{(g-h)}}{|f_{\max}^{(g-h)}|} < \epsilon_{\text{conv}}$$

for $h$ consecutive generations, where:

$f_{\max}^{(g)}$ is the best fitness in generation $g$
$h = 50$ (patience parameter)
$\epsilon_{\text{conv}} = 0.001$ (convergence threshold)

Stage Interface: Parameter Conversion

3.1 Bayesian to GA Parameter Mapping

The Stage 1 posterior estimates are converted to Stage 2 optimization parameters:

Direct Mapping:

$\hat{\alpha}_j \leftarrow \mathbb{E}[\alpha_j | \mathbf{Y}]$ (baseline effects)
$\hat{\beta}_{1j} \leftarrow \mathbb{E}[\beta_{1j} | \mathbf{Y}]$ (budget sensitivities)
$\hat{\gamma}_{kj} \leftarrow \mathbb{E}[\gamma_{kj} | \mathbf{Y}]$ (parent influences)

Uncertainty Propagation: For robust optimization, parameter uncertainty can be incorporated by:

Monte Carlo Sampling: Draw $\{\theta^{(s)}\}_{s=1}^{S}$ from posterior
Stochastic Fitness: $f(\mathbf{b}) = \frac{1}{S} \sum_{s=1}^{S} f(\mathbf{b}; \theta^{(s)})$

3.2 Constraint Specification

Business Constraints:

Minimum allocation: $b_{\text{min},j} = \max(10\text{k}, 0.001 \cdot B_{\text{total}})$
Maximum allocation: $b_{\text{max},j} = 0.95 \cdot B_{\text{total}}$
Category limits: $\sum_{j \in \mathcal{C}_k} b_j \leq \beta_k B_{\text{total}}$ for channel categories $\mathcal{C}_k$

Data-Grounded Attribution Framework

4.1 Principle of Scoped Projections

A core principle of the conversionflow-aggregate methodology is that all financial projections must be directly and defensibly tied to the scope of the data being analyzed. This ensures analytical integrity and provides credible, realistic business insights.

4.2 The Digital Attribution Challenge

In many real-world scenarios, particularly in markets like luxury automotive, the available digital data (e.g., website interactions, ad clicks) only captures a small fraction of the total customer journey. For the Italy market analysis, this is a critical consideration:

Digital Data Scope: The model is built using data from digital touchpoints.
Sales Data Scope: This digital data is linked to only ~5% of total vehicle sales. The remaining 95% of sales occur through offline channels (e.g., dealer relationships, walk-ins) that are not present in the dataset.

4.3 Methodological Solution

To avoid making unsupported claims, our methodology strictly aligns the scope of the analysis with the scope of the data:

Model Scope: The Bayesian network is built exclusively on the tracked digital journey data. It learns the conversion probabilities within this digital ecosystem.
Optimization Scope: The genetic algorithm optimizes the marketing budget based on the conversion probabilities learned from the digital-only data. Its goal is to maximize conversions within the population of digitally-engaged users.
Business Impact Scope: Consequently, all financial projections, such as the “Expected additional revenue” from the uncertainty analysis, are calculated based on the portion of sales that can be reasonably attributed to these digital journeys.

Example Calculation:

Total Annual Sales: 5,067 units
Digitally Attributable Sales (Analysis Scope): 5067 * 0.05 = ~253 units
Optimization Result: A 5.65% improvement in digital conversion efficiency.
Business Impact Calculation: The 5.65% improvement is applied to the revenue from ~253 cars, not the total 5,067 cars.

This approach ensures that the system provides a realistic estimate of the value generated by optimizing the digital marketing spend, rather than making speculative claims about its impact on the entire sales landscape.

Mathematical Assumptions

4.3 Key Modelling Assumptions

DAG Structure: Customer journeys follow a directed acyclic graph with no cycles
Poisson Counts: Event counts are Poisson-distributed conditional on rate parameters
Log-Linear Effects: Budget and parent influences enter log-linearly
Diminishing Returns: Budget effects follow $\log(1 + b/\kappa)$ form
Independence: Conditional independence of counts given parameters and structure
Stationarity: Parameters are constant within the modelling period
Additive Effects: Parent influences combine additively in log-rate

4.4 Convergence Guarantees

MCMC Convergence: Under regularity conditions (log-concave posteriors, bounded parameter spaces), HMC converges to the target posterior distribution.

GA Convergence: The genetic algorithm converges to a local optimum with probability 1 under:

Positive mutation rates
Elite preservation
Finite feasible region

Global Optimality: No guarantee of global optimum due to non-convex objective function. Multiple runs with different random seeds recommended for robustness.

Implementation Notes

4.5 Numerical Stability

Overflow Protection:

Sigmoid function clipped to $[-500, 500]$
Log-sum-exp tricks for stable probability calculations
Regularization terms for near-singular matrices

Constraint Handling:

Iterative projection algorithms for budget conservation
Feasibility restoration via quadratic programming
Numerical tolerance: $\epsilon_{\text{tol}} = 10^{-9}$

4.6 Computational Complexity

Stage 1 (MCMC): $\mathcal{O}(S \cdot C \cdot J^2 \cdot T)$ where $S$ is samples, $C$ is chains, $J$ is touchpoints, $T$ is time periods

Stage 2 (GA): $\mathcal{O}(G \cdot N \cdot J^2)$ where $G$ is generations, $N$ is population size

Total Pipeline: Dominated by MCMC sampling (typically ~7 minutes vs ~3 seconds for GA)