User Guide#

This guide provides comprehensive documentation for scikit-bayes estimators, including theoretical background and practical usage examples.

Introduction#

What are Bayesian Network Classifiers?#

Bayesian Network Classifiers (BNCs) are probabilistic classifiers based on Bayes’ theorem. The simplest and most famous is Naive Bayes, which assumes all features are conditionally independent given the class:

\[P(y|\mathbf{x}) \propto P(y) \prod_{i=1}^{n} P(x_i|y)\]

While this assumption rarely holds in practice, Naive Bayes is surprisingly effective and computationally efficient. However, when feature dependencies are strong (e.g., the XOR problem), it fails.

Why scikit-bayes?#

scikit-learn provides excellent Naive Bayes implementations, but has limitations:

  1. No native mixed data support: You cannot directly combine Gaussian, Categorical, and Bernoulli features in one model.

  2. No dependency modeling: No implementations of AODE, A2DE, or other n-dependence estimators that relax the independence assumption.

  3. No hybrid models: No discriminatively-trained Bayesian classifiers like ALR (Accelerated Logistic Regression).

scikit-bayes fills these gaps with fully scikit-learn compatible estimators.

MixedNB: Mixed Data Naive Bayes#

skbn.MixedNB handles datasets with heterogeneous feature types by internally combining scikit-learn’s specialized Naive Bayes estimators.

The Problem#

Consider a dataset with:

  • Feature 0: Age (continuous) → Gaussian

  • Feature 1: Gender (binary) → Bernoulli

  • Feature 2: Education Level (0, 1, 2, 3) → Categorical

With sklearn, you’d need to:

  1. Split features by type

  2. Fit separate NB classifiers

  3. Combine probabilities manually

MixedNB does this automatically.

Feature Type Detection#

MixedNB auto-detects feature types during fit():

  • Gaussian: Float features with non-integer values

  • Bernoulli: Features with exactly 2 unique values

  • Categorical: Integer features with >2 unique values

You can also specify types manually:

from skbn import MixedNB

# Force features 2 and 3 to be categorical
clf = MixedNB(categorical_features=[2, 3])

# Force feature 1 to be Bernoulli
clf = MixedNB(bernoulli_features=[1])

Usage Example#

import numpy as np
from skbn import MixedNB

# Features: [Gaussian, Bernoulli, Categorical]
X = np.array([
    [1.5, 0, 0],
    [2.3, 1, 1],
    [0.8, 1, 2],
    [1.1, 0, 0],
    [3.2, 1, 1],
    [-0.5, 0, 2]
])
y = np.array([0, 1, 1, 0, 1, 0])

clf = MixedNB(alpha=1.0)
clf.fit(X, y)

# Inspect detected types
print(clf.feature_types_)
# {'gaussian': [0], 'bernoulli': [1], 'categorical': [2]}

# Predict
print(clf.predict([[1.0, 1, 1]]))  # [1]
print(clf.predict_proba([[1.0, 1, 1]]))  # [[0.xx, 0.xx]]

Parameters#

  • alpha: Smoothing parameter (Laplace smoothing) for Categorical/Bernoulli. Default: 1.0

  • var_smoothing: Variance smoothing for Gaussian features. Default: 1e-9

  • categorical_features: List of indices to treat as categorical

  • bernoulli_features: List of indices to treat as Bernoulli

AnDE Family: Relaxing Independence#

The Averaged n-Dependence Estimators (AnDE) family relaxes the independence assumption by conditioning on “super-parent” features.

The Independence Problem#

Consider the XOR problem:

X1

X2

Y

-1

-1

0

-1

+1

1

+1

-1

1

+1

+1

0

Looking at X1 alone: P(X1|Y=0) = P(X1|Y=1) (symmetric distributions). Looking at X2 alone: P(X2|Y=0) = P(X2|Y=1) (symmetric distributions).

Naive Bayes cannot learn this. It achieves ~50% accuracy (random guessing).

AnDE solves this by modeling P(X2 | Y, X1) instead of just P(X2 | Y).

The Super-Parent Strategy#

An SPnDE (Super-Parent n-Dependence Estimator) conditions all child features on the class and n parent features:

\[P(y, \mathbf{x}) = P(Y^*) \prod_{i \in children} P(x_i | Y^*)\]

Where \(Y^* = (y, x_{p1}, x_{p2}, ..., x_{pn})\) is the “augmented super-class”.

AnDE averages over all possible parent combinations.

AnDE (Arithmetic Mean)#

skbn.AnDE is the standard generative model described by Webb et al. [1].

\[P(y|\mathbf{x}) \propto \sum_{m} P_m(y, \mathbf{x})\]

Key Parameters:

  • n_dependence: Order of dependence

    • n=0: Equivalent to Naive Bayes (MixedNB)

    • n=1: AODE (Averaged One-Dependence Estimators)

    • n=2: A2DE (common choice for higher accuracy)

  • n_bins: Discretization bins for continuous super-parents

  • strategy: Discretization strategy (‘uniform’, ‘quantile’, ‘kmeans’)

Example:

from skbn import AnDE

# AODE (n=1)
clf = AnDE(n_dependence=1, n_bins=5)
clf.fit(X, y)

# A2DE (n=2) - higher accuracy, more computation
clf = AnDE(n_dependence=2, n_bins=5)

AnJE (Geometric Mean)#

skbn.AnJE aggregates using the geometric mean (product of probabilities):

\[P(y|\mathbf{x}) \propto \prod_{m} P_m(y, \mathbf{x})\]

This is equivalent to summing log-probabilities and serves as the basis for the convex ALR optimization.

Usage is identical to AnDE.

ALR: Accelerated Logistic Regression#

skbn.ALR is a hybrid generative-discriminative classifier [2].

It starts with the AnJE generative model and learns discriminative weights to optimize classification performance:

\[P(y|\mathbf{x}) \propto \exp\left(\sum_{m} w_m \cdot \log P_m(y, \mathbf{x})\right)\]

Weight Granularity Levels:

ALR supports 4 levels of parameter granularity:

Level

Description

# Parameters

Best For

1

Per Model

M

Small datasets

2

Per Parent Value

M × V

Large datasets

3

Per Class

M × C

Multi-class

4

Per Value × Class

M × V × C

Very large datasets

Where M = number of models, V = parent value combinations, C = classes.

Example:

from skbn import ALR

# Level 1: Simple, low variance
clf = ALR(n_dependence=1, weight_level=1, l2_reg=1e-3)

# Level 3: Per-class weights (good for multi-class)
clf = ALR(n_dependence=1, weight_level=3, l2_reg=1e-4)

clf.fit(X, y)

WeightedAnDE#

skbn.WeightedAnDE applies discriminative weighting to the standard AnDE (arithmetic mean) model. Unlike ALR, the optimization is non-convex.

from skbn import WeightedAnDE

clf = WeightedAnDE(n_dependence=1, weight_level=1)
clf.fit(X, y)

Parameter Tuning Guide#

Choosing n_dependence#

  • n=1 (AODE): Good default. Captures pairwise interactions.

  • n=2 (A2DE): Better accuracy, but O(n²) models. Use for <50 features.

  • n≥3: Rarely needed. Computational cost grows combinatorially.

Discretization Strategy#

For continuous super-parents:

  • ‘quantile’ (default): Equal-frequency bins. Robust to outliers.

  • ‘uniform’: Equal-width bins. Good for uniform distributions.

  • ‘kmeans’: Data-driven bins. Best for multi-modal distributions.

n_bins typically 3-10. More bins = more precision but fewer samples per bin.

Regularization in Hybrid Models#

l2_reg controls regularization strength:

  • Small datasets: Use higher values (1e-2 to 1e-1) to prevent overfitting

  • Large datasets: Use lower values (1e-4 to 1e-3) for more flexibility

Computational Considerations#

  • Use n_jobs=-1 to parallelize SPODE fitting

  • Higher weight_level increases optimization time exponentially

  • A2DE with n_features=50 creates ~1,225 sub-models

References#