User Guide#

This guide provides comprehensive documentation for scikit-bayes estimators, including theoretical background and practical usage examples.

Introduction #

What are Bayesian Network Classifiers?#

Bayesian Network Classifiers (BNCs) compute class probabilities using Bayes’ theorem. The most common is Naive Bayes, which assumes features are conditionally independent given the class:

\[P(y|\mathbf{x}) \propto P(y) \prod_{i=1}^{n} P(x_i|y)\]

This assumption is rarely true, but Naive Bayes scales well and works as a baseline. When feature dependencies are strong (e.g., in a parity problem like XOR), it produces inaccurate probabilities.

Why scikit-bayes?#

While scikit-learn includes standard Naive Bayes classes, they require some manual workarounds for complex datasets:

Heterogeneous data: You cannot pass a matrix with Gaussian, Categorical, and Bernoulli features directly into a single estimator without using a ColumnTransformer to split them.
Feature dependencies: There are no out-of-the-box implementations of n-dependence estimators like AODE or A2DE.
Hybrid optimization: Scikit-learn does not implement discriminatively-weighted Bayesian classifiers like ALR, which use generative models as preconditioners.

scikit-bayes implements these architectures while passing standard scikit-learn estimator checks.

MixedNB: Mixed Data Naive Bayes #

skbn.MixedNB handles datasets with heterogeneous feature types by internally combining scikit-learn’s specialized Naive Bayes estimators.

The Problem #

Consider a dataset with:

Feature 0: Age (continuous) → Gaussian
Feature 1: Gender (binary) → Bernoulli
Feature 2: Education Level (0, 1, 2, 3) → Categorical

With sklearn, you’d need to:

Split features by type
Fit separate NB classifiers
Combine probabilities manually

MixedNB does this automatically.

Feature Type Detection #

MixedNB auto-detects feature types during fit():

Gaussian: Float features with non-integer values
Bernoulli: Features with exactly 2 unique values
Categorical: Integer features with >2 unique values

You can also specify types manually:

from skbn import MixedNB

# Force features 2 and 3 to be categorical
clf = MixedNB(categorical_features=[2, 3])

# Force feature 1 to be Bernoulli
clf = MixedNB(bernoulli_features=[1])

Usage Example #

import numpy as np
from skbn import MixedNB

# Features: [Gaussian, Bernoulli, Categorical]
X = np.array([
    [1.5, 0, 0],
    [2.3, 1, 1],
    [0.8, 1, 2],
    [1.1, 0, 0],
    [3.2, 1, 1],
    [-0.5, 0, 2]
])
y = np.array([0, 1, 1, 0, 1, 0])

clf = MixedNB(alpha=1.0)
clf.fit(X, y)

# Inspect detected types
print(clf.feature_types_)
# {'gaussian': [0], 'bernoulli': [1], 'categorical': [2]}

# Predict
print(clf.predict([[1.0, 1, 1]]))  # [1]
print(clf.predict_proba([[1.0, 1, 1]]))  # [[0.xx, 0.xx]]

Parameters #

alpha: Smoothing parameter (Laplace smoothing) for Categorical/Bernoulli. Default: 1.0
var_smoothing: Variance smoothing for Gaussian features. Default: 1e-9
categorical_features: List of indices to treat as categorical
bernoulli_features: List of indices to treat as Bernoulli

AnDE Family: Relaxing Independence #

The Averaged n-Dependence Estimators (AnDE) family relaxes the independence assumption by conditioning on “super-parent” features.

The Independence Problem #

Consider the XOR problem:

X1	X2	Y
-1	-1	0
-1	+1	1
+1	-1	1
+1	+1	0

Looking at X1 alone: P(X1|Y=0) = P(X1|Y=1) (symmetric distributions). Looking at X2 alone: P(X2|Y=0) = P(X2|Y=1) (symmetric distributions).

Naive Bayes cannot learn this. It achieves ~50% accuracy (random guessing).

AnDE solves this by modeling P(X2 | Y, X1) instead of just P(X2 | Y).

The Super-Parent Strategy #

An SPnDE (Super-Parent n-Dependence Estimator) conditions all child features on the class and n parent features:

\[P(y, \mathbf{x}) = P(Y^*) \prod_{i \in children} P(x_i | Y^*)\]

Where \(Y^* = (y, x_{p1}, x_{p2}, ..., x_{pn})\) is the “augmented super-class”.

AnDE averages over all possible parent combinations.

AnDE (Arithmetic Mean)#

skbn.AnDE is the standard generative model described by Webb et al. [1].

\[P(y|\mathbf{x}) \propto \sum_{m} P_m(y, \mathbf{x})\]

Key Parameters:

n_dependence: Order of dependence
- n=0: Equivalent to Naive Bayes (MixedNB)
- n=1: AODE (Averaged One-Dependence Estimators)
- n=2: A2DE (common choice for higher accuracy)
n_bins: Discretization bins for continuous super-parents
strategy: Discretization strategy (‘uniform’, ‘quantile’, ‘kmeans’)

Example:

from skbn import AnDE

# AODE (n=1)
clf = AnDE(n_dependence=1, n_bins=5)
clf.fit(X, y)

# A2DE (n=2) - higher accuracy, more computation
clf = AnDE(n_dependence=2, n_bins=5)

AnJE (Geometric Mean)#

skbn.AnJE aggregates using the geometric mean (product of probabilities):

\[P(y|\mathbf{x}) \propto \prod_{m} P_m(y, \mathbf{x})\]

This is equivalent to summing log-probabilities and serves as the basis for the convex ALR optimization.

Usage is identical to AnDE.

ALR: Accelerated Logistic Regression #

skbn.ALR is a hybrid generative-discriminative classifier [4].

It starts with the AnJE generative model and learns discriminative weights to optimize classification performance:

\[P(y|\mathbf{x}) \propto \exp\left(\sum_{m} w_m \cdot \log P_m(y, \mathbf{x})\right)\]

Weight Granularity Levels:

ALR supports 4 levels of parameter granularity:

Level	Description	# Parameters	Best For
1	Per Model	M	Small datasets
2	Per Parent Value	M × V	Large datasets
3	Per Class	M × C	Multi-class
4	Per Value × Class	M × V × C	Very large datasets

Where M = number of models, V = parent value combinations, C = classes.

Example:

from skbn import ALR

# Level 1: Simple, low variance
clf = ALR(n_dependence=1, weight_level=1, l2_reg=1e-3)

# Level 3: Per-class weights (good for multi-class)
clf = ALR(n_dependence=1, weight_level=3, l2_reg=1e-4)

clf.fit(X, y)

WeightedAnDE #

skbn.WeightedAnDE applies discriminative weighting to the standard AnDE (arithmetic mean) model [3]. Unlike ALR, the optimization is non-convex.

from skbn import WeightedAnDE

clf = WeightedAnDE(n_dependence=1, weight_level=1)
clf.fit(X, y)

Parameter Tuning Guide #

Choosing n_dependence #

n=1 (AODE): Good default. Captures pairwise interactions.
n=2 (A2DE): Better accuracy, but O(n²) models. Use for <50 features.
n≥3: Rarely needed. Computational cost grows combinatorially.

Discretization Strategy #

For continuous super-parents:

‘quantile’ (default): Equal-frequency bins. Robust to outliers.
‘uniform’: Equal-width bins. Good for uniform distributions.
‘kmeans’: Data-driven bins. Best for multi-modal distributions.

n_bins typically 3-10. More bins = more precision but fewer samples per bin.

Regularization in Hybrid Models #

l2_reg controls regularization strength:

Small datasets: Use higher values (1e-2 to 1e-1) to prevent overfitting
Large datasets: Use lower values (1e-4 to 1e-3) for more flexibility

Computational Considerations #

Use n_jobs=-1 to parallelize SPODE fitting
Higher weight_level increases optimization time exponentially
A2DE with n_features=50 creates ~1,225 sub-models