Driver Analysis, Bayes vs Shapley

blog-image

Stat & More offers various statistical methods depending on your study objectives and data: Bayesian networks approach or Shapley Value decomposition are two different tools to identify your driver factors.





1. Introduction


In social sciences and marketing, understanding the “drivers” behind a behavior or decision goes far beyond simple correlation. It sometimes requires integrating causality, precisely quantifying each variable’s influence, and adapting the tool’s complexity to your operational and budgetary constraints.

Stat & More has strong expertise in several driver analysis methods, including the two approaches detailed below:

  • Bayesian Networks (BN): used to explore causal (direct and indirect) relationships between variables, with the ability to run simulations.
  • Shapley Value (SV) method, also known as “Average Over Ordering (AOO)”: estimates each predictor’s contribution across all combinations of models among n explanatory variables, based on cooperative game theory concepts [1].



2. Theoretical Foundations

2.1. Bayes’ Theorem: the foundation of Bayesian modeling


The Bayes’ theorem is the foundation of any Bayesian approach. It allows estimating the probability of an event by integrating:

  • Prior knowledge (“a prior” information)
  • Observed information (the data we are processing)
  • Likelihood of the observations (“a posterior” information)

Bayesian inference is a statistical method for computing probabilities of hypothetical causes from known event observations. It primarily relies on Thomas Bayes’ theorem.

Thomas Bayes’ theorem defines conditional probabilities of several events. In its 1763 formulation, it was stated as follows:

P(A|B) = P(B|A) * P(A) P(B)

Given that P(B) ≠ 0, where:

  • A and B are two events,
  • P(A): prior probability of event A,
  • P(B): probability of event B (evidence),
  • P(A ∣ B): conditional probability that A occurs given B,
  • P(B ∣ A): conditional probability that B occurs given A.
Source: Wikipedia Bayes' Theorem

Concrete example:
Suppose you want to estimate the probability that a customer recommends your service (A), given that they received specific information (B):

  • P(A) prior probability: Probability a customer recommends before any information,
  • P(B) evidence: Probability they received an informational message,
  • P(B|A) likelihood: Probability they received the information given that they recommend the service,
  • P(A|B) posterior: Probability they recommend knowing they received the information.

This formula allows us to dynamically update our beliefs through a probabilistic model, something traditional statistics cannot do:

  • Dynamic integration of knowledge
  • Natural handling of uncertainty
  • Adaptability to new or missing variables



Why does the Bayesian approach outperform classical tools?


Feature Classical Statistics Bayesian Approach
Incorporation of prior knowledge No Yes
Update as data evolves No Yes
Uncertainty Approximated Fully managed
Scenario simulation Difficult Simple and natural



2.2. Bayesian Networks: graphical modeling of causality


Definition

A Bayesian network is a directed acyclic graph G = (V, E) where each node is a random variable, and each edge encodes a conditional dependency: P ( X 1 , , X n ) = i = 1 n P X i | Parents ( X i )

  • G represents the full directed acyclic graph, defining the network structure.
  • V is the set of nodes (variables),
  • E are the edges (conditional dependencies) between nodes.
Tips Information Recommendation

In this diagram example, “Tips,” “Information,” and “Recommendation” are nodes. The green arrows represent directed edges.

Interpretation:
In this network, “Tips” influences “Information,” which in turn influences “Recommendation.”


How is a Bayesian Network built?


1. Automatic structure discovery

The main approaches for discovering acyclic networks, particularly Bayesian networks (directed acyclic graphs), are as follows:

  • Constraint-based approaches:
    These rely on testing conditional independencies between variables based on the data. The goal is to find a network structure consistent with observed dependencies and independencies. These methods depend on statistical tests for conditional independence to guide graph construction. Common algorithms include:

    • Peter-Clark (PC) algorithm
    • Spirtes, Glymour, Scheines (SGS) algorithm


  • Score-based approaches:
    In these methods, each candidate network is evaluated with a score measuring how well its structure fits the observed data (in terms of dependencies and independencies). The goal is to maximize this score, which often requires heuristic searches in the space of possible graphs. Common examples include:

    • Greedy Search algorithm
    • K2 algorithm
    • Algorithms based on sampling or simulated annealing search


  • Hybrid approaches:
    These combine both previous methods: they use independence constraints to reduce the search space and a scoring function to select the best structure. Examples include Max-Min Hill Climbing (MMHC) algorithm.

All these approaches must also ensure graph acyclicity, which makes the discovery process computationally complex. Discovering Bayesian networks is thus a combinatorial problem that often relies on heuristics or optimization algorithms.

In summary, acyclic network discovery is based on:

  • Testing conditional independences to guide the structure,
  • Evaluating and optimizing a scoring function to select the best possible structure,
  • Combining both methods for better efficiency.

These algorithms aim to reconstruct the causal (acyclic) graph structure from data while maintaining acyclicity and maximizing the model–data fit.


2. Parameter estimation

Parameter estimation in a Bayesian network is generally performed after learning the structure (graph). This consists of filling, for each node, a Conditional Probability Table (CPT).

The CPT expresses the probability of a variable (node) taking a given value conditioned on the different combinations of its parent node values in the graph.

For example, for the variable “Information” with one parent variable “Tips,” which can be Yes or No, the conditional probability table is as follows:

Information
Tips Yes No
Yes 0.83 0.17
No 0.43 0.57

This table means:

  • The probability that “Information” = Yes given “Tips” = Yes is 0.83,
  • The probability that “Information” = Yes given “Tips” = No is 0.43,
  • And so on for all possible values.

Parameter estimation in the CPT is based on observed data, calculated from the relative frequency of each conditioned variable based on its parent variables (maximum likelihood).
When data are incomplete or partially observed, techniques such as the Expectation-Maximization (EM)[2] algorithm are used to estimate probabilities.

Thus, the conditional probability tables (CPTs) locally store each node’s conditional probability distribution, enabling the Bayesian network to model the global joint distribution as the product of all local conditional tables.


3. Model validation

Model validation in Bayesian networks aims to verify whether the learned structure and parameters describe the observed data as accurately as possible while avoiding overfitting.
Among the various criteria used for this purpose, the Bayesian Information Criterion (BIC) score is widely applied.

Definition of BIC score:

The BIC (Bayesian Information Criterion) score is a measure used to compare statistical models by considering both:

  • The quality of fit to the data (via the model likelihood),
  • The complexity of the model (number of parameters).

The formula for the BIC score is:

BIC = ln ( N ) k 2 ln ( L ^ )

where:

  • N: total number of observations (data samples),
  • k: number of parameters (model complexity),
  • L ^ : maximum likelihood of the model (fit to data).

Why use the BIC score to validate a Bayesian network?

  • It prevents selecting an overly complex model that fits training data perfectly but generalizes poorly to new data.
  • It is independent of data size and number of parameters, allowing fair model comparison.

A model with a lower BIC score is preferred since it indicates an optimal trade-off between predictive performance and simplicity.
The BIC score heavily penalizes overly complex models, promoting simpler, more generalizable structures when likelihood differences are not significant.

Practical application for Bayesian networks:

  1. Structure discovery: Generate multiple candidate graphs with various causal links.
  2. Parameter estimation: Compute conditional probabilities for each node in each candidate network, using EM if required.
  3. BIC score computation: Calculate the BIC score for each candidate structure.
  4. Model selection: Choose the network with the lowest BIC score as the optimal model.

In summary:

  • The BIC score objectively evaluates the quality of a learned Bayesian network by balancing data fit and model simplicity.
  • It is essential for model validation, helping retain the most robust and predictive structure without sacrificing generalization capacity.



2.3. Shapley Value Decomposition: fair attribution of each variable’s importance


The Shapley Value decomposition with 3 explanatory variables, for example “Tips”, “Information” and “Pleasant Relationship,” impacting the target variable “Recommendation” works as follows:

The Shapley Value assigns to each explanatory variable an average contribution to the prediction (or objective function value) by considering all possible combinations of variables. For 3 variables, all possible coalitions (subsets) of these variables are considered.

Let N = {Conseils, Information, Relation Agréable}

For a variable j among these 3, the Shapley Value ϕj is computed by the formula:

ϕj = S N {j} |S|! × ( |N| - |S| - 1 )! |N|! [ v ( S {j} ) - v ( S ) ]

where:

  • N is the set of explanatory variables,
  • S is a subset of explanatory variables not containing j,
  • v(S) is the value of the objective function (for example, the model prediction by linear regression) when only variables in S are considered,
  • v(S{j})-v(S) is the marginal contribution of variable j added to subset S,
  • The factorial coefficients are weights based on the size of coalitions, ensuring each permutation is fairly accounted.

For 3 variables, there are 8 subsets S, including the empty set. Marginal contributions are computed for:

  • S = No variable (average model),
  • S = Tips,
  • S = Information,
  • S = Pleasant Relationship,
  • S = Tips, Pleasant Relationship,
  • S = Tips, Information,
  • S = Information, Pleasant Relationship,
  • S = Tips, Information, Pleasant Relationship.

For each variable j, its marginal contribution is computed for these subsets. For example, for “Tips,” marginal contributions are computed when:

  • S = No variable (average model),
  • S = Information,
  • S = Pleasant Relationship,
  • S = Information, Pleasant Relationship,

Each marginal contribution is weighted by the size of the subset S. The weighted sum gives the Shapley Value for the variable “Tips.”

In practice, the function v(S) is evaluated based on a predictive model or an information metric measuring the effect of variables in S on the recommendation.

The sum of individual contributions calculated with the Shapley Value for each of the 3 explanatory variables (“Tips,” “Information,” “Pleasant Relationship”) exactly equals the difference between:

  • The model prediction considering all explanatory variables simultaneously enabled,
  • And a baseline reference prediction, usually the mean prediction of the model without any explanatory variables or with neutral variables.

In other words, the Shapley Value decomposes the overall prediction into exact parts associated with each variable, and the sum of these parts perfectly reconstructs the gap between full prediction and baseline. This additivity property is a fundamental guarantee of cooperative game theory, ensuring a consistent and complete interpretation of the contributions of each variable to the prediction.

This decomposition clearly explains how much each variable “Tips,” “Information,” “Pleasant Relationship” contributes to “Recommendation,” accounting for possible interactions between variables.

In summary, the Shapley Value computation decomposes each variable’s contribution to the total value by averaging its marginal contributions over all possible combinations of explanatory variables, offering a fair and comprehensive measure of importance.

Practical example:
Revisiting the analysis on “Recommendation,” the Shapley Values sorted descendingly are:

  • Information: ϕ = 0.42
  • Tips: ϕ = 0.34
  • Pleasant Relationship: ϕ = 0.24
Tips Information Pleasant Relationship

Interpretation:
The variable “Information” is the main driver in this model with a weight of 42%, followed by the variable “Tips” at 34%, and finally the variable “Pleasant Relationship” with 24%.

Digression on the impact of sign of coefficients in prediction models:

  • The Shapley Value computes the average marginal contribution of a variable across all possible permutations of inclusion/exclusion of variables. This contribution can be positive or negative depending on whether the variable increases or decreases the model output in different coalitions.

  • A negative regression coefficient means that, all else equal, the variable reduces the predicted value. In the Shapley Value decomposition, this manifests as a negative contribution to the total sum, reflecting an “inhibitory” influence.

  • Therefore, it is normal and expected that some variables have negative Shapley Values, especially if their regression coefficients are negative, since Shapley Values capture both positive and negative contributions.

  • To interpret these values, one can look at the absolute contribution magnitude as well as its sign, which indicates the direction of effect. For example, a large negative Shapley Value means the variable strongly decreases the prediction.

  • If a detailed analysis is desired, Shapley Values can also be decomposed into positive and negative average contributions separately, or marginal contributions examined in different coalitions to understand when the variable acts as a brake or as a lever.

In summary:

Negative regression coefficients are naturally accounted for in Shapley Value decomposition via negative marginal contributions in certain coalitions, and these negative values should be interpreted as inhibitory influences within the model.




3. Criteria for Choosing Between a Bayesian Network and Shapley Value Decomposition, Then Budget Constraints


The choice between a Bayesian network and a Shapley Value decomposition in a driver analysis context depends on the objectives, data type, and nature of the relationships to be modeled.

1. Why choose a Bayesian network?

  • A Bayesian network explicitly models causal dependencies between variables as a directed acyclic graph. It is suited to understand conditional relationships and direct and indirect influences between variables.
  • Useful when the goal is to analyze the overall system, its interactions, and to infer conditional probabilities—for example, to perform diagnostics or cause-effect scenario analysis.
  • It allows integrating both prior knowledge (expertise) and observed data.
  • Fits prospective analyses enabling the simulation of different states and their impacts.

2. Why choose Shapley Value decomposition?

  • Shapley decomposition is an interpretability method aimed at explaining the precise marginal contribution of each variable to a prediction or a particular outcome.
  • Very useful for targeted driver analysis focusing on individual attribute effects in predictive models, especially where direct impact is not trivial to extract in complex models.
  • Provides additive contribution measures even when variables interact, supporting the decision-making process for prioritising action on drivers.
  • Model-agnostic approach, applicable to any predictive model when the number of parameters to estimate is small (<20).

3. How to choose?

  • Choose a Bayesian network when understanding causal relationships and dependencies between variables is crucial, and you want to explicitly model these dependencies in a probabilistic framework.
  • Choose the Shapley Value when the main objective is to decompose the overall effect of a predictive model into precise per-variable contributions, to interpret and prioritize drivers based on their individual impact.
  • In some cases, both approaches can be complementary: Bayesian networks for global causal modeling, and Shapley Value for local explanation of predictions and decisions.

In summary, the Bayesian network is more oriented towards global causal diagnosis, while the Shapley decomposition offers a fine and fair analysis of individual variable contributions within a predictive framework, ideal for driver analysis.


To help you position yourself, here is a summary comparative table of the two approaches.
Criterion Bayesian Networks Shapley Value Decomposition
Foundation Bayes' Theorem, probabilistic graph Cooperative game theory
Interpretation Causal Additive, descriptive
Robustness Strong, even on imperfect data Sensitive to multicollinearity
Extensibility Good with modern algorithms/software Limited (number of variables < 20)
Simulation Yes No
Fairness Audit Medium Excellent
Budget It varies depending on the complexity. Limited if small number of variables



4. Conclusion

The comparison between Bayesian networks and Shapley Value decomposition shows that each method addresses distinct needs:

  • Bayesian networks illuminate causal relationships and system dynamics,
  • While Shapley Value decomposition provides an additive, clear, and fair attribution of the importance of each variable within a predictive model.

Other driver analysis approaches exist, such as attribute effects, partial correlations, or variance analysis.

Stat & More offers you deep expertise and advanced data analysis solutions, as well as automation of your decision-making analyses.
Are you a market research company, an advertiser, or simply looking for answers to your questions? More generally, do you need to leverage your data and maximize the informative potential contained within it?
Benefit from personalized support, rich and accessible deliverables, and position yourself as a leader in your expertise. So don’t hesitate, send an email to Stat & More for tailored support to get the best out of your data.

#StatAndMore #BayesianNetworks #ShapleyValue #DriverAnalysis #Causality #DecisionScience #SocialResearch #Consulting #CustomAnalysis




[1] Cooperative game theory is a branch of game theory that studies situations where players can collaborate, form coalitions, and commit to collective strategies to maximize a shared gain rather than individually maximizing their own benefits. The central idea is that participants can negotiate and agree on how to share the benefits from this cooperation.
Source: Wikipedia Cooperative Game Theory

[2] The expectation-maximization (EM) algorithm is an iterative method for finding maximum likelihood parameters of probabilistic models that depend on latent variables which are unobserved. It was proposed by Dempster et al. in 1977. Many variants have since been developed forming an entire class of algorithms.
Source: Wikipedia Expectation-Maximization




REFERENCES:

1. Wikipedia. Thomas Bayes. https://en.wikipedia.org/wiki/Thomas_Bayes

2. Wikipedia. Bayesian Information Criterion. https://en.wikipedia.org/wiki/Bayesian_information_criterion

3. Algorithms for discovery of acyclic Bayesian networks [1] https://arxiv.org/pdf/1502.02454

4. Algorithms for discovery of acyclic Bayesian networks [2] https://theses.hal.science/tel-00485862/PDF/HDR-Part1.pdf

5. Algorithms for discovery of acyclic Bayesian networks [3] https://www.rfai.lifat.univ-tours.fr/PublicData/PhD/a.delaplace.thesis.pdf

6. Bayesian Networks and Shapley Values https://hal.sorbonne-universite.fr/hal-03417323v1/document