# User Guide

## Introduction

This guide provides comprehensive instructions for using conversionflow-aggregate to analyse customer journeys and optimise marketing budget allocation.

## Understanding the Two-Stage Pipeline

conversionflow-aggregate operates as a two-stage analytical system:

### Stage 1: Bayesian Parameter Estimation
**Duration:** Approximately 7 minutes  
**Purpose:** Fits probabilistic models to your customer journey data

```bash
# Stage 1 only - parameter estimation
python scripts/run_full_italy_pipeline.py --stage1-only
```

**What happens:**
- Loads raw customer event data
- Fits Bayesian network models using MCMC sampling
- Generates parameter exports with confidence intervals
- Produces model diagnostics and validation metrics

**Indicators of success:**
- PyMC progress bars showing MCMC sampling
- Convergence diagnostics (R-hat < 1.1)
- Model validation reports
- Parameter export JSON files

### Stage 2: Genetic Algorithm Optimisation
**Duration:** 3-4 seconds  
**Purpose:** Optimises budget allocation using pre-computed parameters

```bash
# Stage 2 only - optimisation
python scripts/run_full_italy_pipeline.py --stage2-only
```

**What happens:**
- Loads MCMC parameter exports
- Runs genetic algorithm optimisation
- Applies conservative attribution methodology
- Generates executive reports and visualisations

## Full Pipeline Execution

### Complete End-to-End Analysis
```bash
# Recommended: Both stages together
python scripts/run_full_italy_pipeline.py
```

This is the preferred method for new analyses as it ensures parameter estimates are fresh and aligned with your current data.

### Quick Testing Mode
```bash
# Fast validation using mock data
./run_pipeline.sh --mode=test
```

Use this for system validation and testing changes without processing real data.

### Production Optimisation Mode
```bash
# Fast optimisation using existing parameters
./run_pipeline.sh --mode=italy
```

Use this when you have recent parameter estimates and only need updated budget allocations.

## Working with Custom Data

### Data Requirements

Your customer journey data should include:
- **Event timestamps** in ISO format
- **Customer identifiers** for journey tracking  
- **Touchpoint identifiers** for channel attribution
- **Conversion outcomes** (purchases, leads, etc.)

### Data Format Example
```csv
customer_id,timestamp,touchpoint,conversion,value
CUST_001,2024-01-15T10:30:00,website_visit,0,
CUST_001,2024-01-16T14:22:00,email_click,0,
CUST_001,2024-01-20T09:45:00,dealer_visit,1,35000
```

### Configuration for Custom Data

Create a custom configuration file:

```yaml
# configs/custom_analysis.yaml
data:
  source: "data/custom/customer_journeys.csv"
  date_column: "timestamp"
  customer_column: "customer_id"
  touchpoint_column: "touchpoint"
  conversion_column: "conversion"

model:
  mcmc_samples: 2000
  mcmc_tune: 1000
  chains: 4

optimization:
  population_size: 100
  generations: 200
  budget_total: 1000000
```

### Running Custom Analysis
```bash
# Using custom configuration
python scripts/run_full_italy_pipeline.py --config configs/custom_analysis.yaml
```

## Budget Allocation Scenarios

### Standard Allocation
```bash
# Default budget (£2.5M)
python scripts/run_full_italy_pipeline.py
```

### Custom Budget Scenarios
```bash
# Growth scenario - 10% budget increase
python scripts/run_full_italy_pipeline.py --budget 2750000

# Reduced budget scenario
python scripts/run_full_italy_pipeline.py --budget 2000000

# High investment scenario
python scripts/run_full_italy_pipeline.py --budget 5000000
```

## Understanding Results

### Executive Summary Output
The system generates business-ready reports including:

**Budget Allocation Table:**
```
Touchpoint               Allocation    Percentage
Car Configuration        £420,000      16.8%
Finance Calculator       £380,000      15.2%
Test Drive Requests      £350,000      14.0%
...
```

**Performance Metrics:**
- **Raw Improvement:** Statistical optimisation potential
- **Business Claim:** Conservative attribution-adjusted improvement
- **Confidence Intervals:** Statistical uncertainty bounds

### Technical Diagnostics
For technical validation, review:

**MCMC Diagnostics:**
- **R-hat values:** Should be < 1.1 for convergence
- **Effective Sample Size (ESS):** Should be > 400 per chain
- **ELPD-LOO:** Model comparison metric

**Optimisation Metrics:**
- **Population diversity:** Genetic algorithm health
- **Convergence rate:** Solution stability
- **Constraint satisfaction:** Business rule compliance

## Attribution Methodology

`conversionflow-aggregate` implements a **Data-Grounded Attribution** methodology to ensure business credibility and analytical integrity.

### The Principle

The core principle is that all financial projections must be directly tied to the scope of the data being analyzed. In the luxury automotive market, digital data typically accounts for only a small fraction (~5%) of total sales.

### How It Works

1.  **Scoped Analysis**: The entire pipeline, from model fitting to optimization, operates exclusively on the tracked digital journey data.
2.  **Scoped Projections**: Business impact calculations (e.g., "Expected additional revenue") are based on the ~5% of sales that are attributable to these digital journeys, not the total sales volume.

### Why This Matters

This approach provides realistic, defensible projections of the value generated by optimizing digital marketing spend, maintaining stakeholder trust by avoiding unsupported claims about influencing total offline sales.

## Advanced Usage

### Model Selection
```bash
# Standard Poisson model (recommended)
python scripts/run_full_italy_pipeline.py

# Hurdle model (for zero-inflated data - slower)
python scripts/run_full_italy_pipeline.py --use-hurdle
```

**Recommendation:** Use standard model unless your data has severe zero-inflation issues. The hurdle model requires significantly longer convergence time (1+ hours vs 7 minutes).

### Parallel Processing
```bash
# Increase MCMC chains for faster sampling (requires more CPU cores)
python scripts/run_full_italy_pipeline.py --chains 8
```

### Output Customisation
```bash
# Specify output directory
python scripts/run_full_italy_pipeline.py --output results/custom_analysis/

# Control output formats
python scripts/run_full_italy_pipeline.py --formats csv,json,html
```

## Performance Optimisation

### For Large Datasets (1M+ events)
```yaml
# In configuration file
model:
  sample_fraction: 0.5  # Use 50% of data for faster processing
  mcmc_samples: 1000    # Reduce samples if needed
```

### For Resource-Constrained Systems
```yaml
optimization:
  population_size: 50   # Smaller population for faster GA
  generations: 100      # Fewer generations
```

## Quality Assurance

### Validation Checks
The system automatically validates:
- Data quality and completeness
- Model convergence and diagnostics
- Budget allocation constraints
- Attribution methodology compliance

### Manual Verification
```bash
# Run comprehensive validation suite
python scripts/test_italy_optimization.py
```

### Troubleshooting Checklist
1. **PyMC progress bars visible:** Confirms Bayesian fitting is running
2. **R-hat < 1.1:** Confirms MCMC convergence
3. **Budget sums correctly:** Confirms optimisation constraints
4. **Business impact is correctly scoped:** Confirms projections are based on attributable sales

## Best Practices

### Data Preparation
- Ensure consistent timestamp formatting
- Remove duplicate or invalid records
- Validate customer journey completeness
- Check for data leakage or look-ahead bias

### Model Validation
- Review convergence diagnostics carefully
- Compare multiple model runs for consistency
- Validate results against business intuition
- Test with different budget scenarios

### Business Communication
- Clearly state that projections are based on digitally attributable sales.
- Document the methodology's scope and limitations transparently.
- Provide confidence intervals for all estimates.
- Explain all assumptions clearly.

For additional support, consult the [Configuration Guide](configuration.md) and [Troubleshooting Documentation](troubleshooting.md).