.. title:: User guide : contents .. _user_guide: ========== User Guide ========== This guide provides comprehensive documentation for scikit-bayes estimators, including theoretical background and practical usage examples. .. contents:: Table of Contents :local: :depth: 2 Introduction ============ What are Bayesian Network Classifiers? -------------------------------------- Bayesian Network Classifiers (BNCs) are probabilistic classifiers based on Bayes' theorem. The simplest and most famous is **Naive Bayes**, which assumes all features are conditionally independent given the class: .. math:: P(y|\mathbf{x}) \propto P(y) \prod_{i=1}^{n} P(x_i|y) While this assumption rarely holds in practice, Naive Bayes is surprisingly effective and computationally efficient. However, when feature dependencies are strong (e.g., the XOR problem), it fails. Why scikit-bayes? ----------------- scikit-learn provides excellent Naive Bayes implementations, but has limitations: 1. **No native mixed data support**: You cannot directly combine Gaussian, Categorical, and Bernoulli features in one model. 2. **No dependency modeling**: No implementations of AODE, A2DE, or other n-dependence estimators that relax the independence assumption. 3. **No hybrid models**: No discriminatively-trained Bayesian classifiers like ALR (Accelerated Logistic Regression). scikit-bayes fills these gaps with **fully scikit-learn compatible** estimators. .. _mixed_naive_bayes: MixedNB: Mixed Data Naive Bayes =============================== :class:`skbn.MixedNB` handles datasets with heterogeneous feature types by internally combining scikit-learn's specialized Naive Bayes estimators. The Problem ----------- Consider a dataset with: * **Feature 0**: Age (continuous) → Gaussian * **Feature 1**: Gender (binary) → Bernoulli * **Feature 2**: Education Level (0, 1, 2, 3) → Categorical With sklearn, you'd need to: 1. Split features by type 2. Fit separate NB classifiers 3. Combine probabilities manually MixedNB does this automatically. Feature Type Detection ---------------------- MixedNB auto-detects feature types during ``fit()``: * **Gaussian**: Float features with non-integer values * **Bernoulli**: Features with exactly 2 unique values * **Categorical**: Integer features with >2 unique values You can also specify types manually: .. code-block:: python from skbn import MixedNB # Force features 2 and 3 to be categorical clf = MixedNB(categorical_features=[2, 3]) # Force feature 1 to be Bernoulli clf = MixedNB(bernoulli_features=[1]) Usage Example ------------- .. code-block:: python import numpy as np from skbn import MixedNB # Features: [Gaussian, Bernoulli, Categorical] X = np.array([ [1.5, 0, 0], [2.3, 1, 1], [0.8, 1, 2], [1.1, 0, 0], [3.2, 1, 1], [-0.5, 0, 2] ]) y = np.array([0, 1, 1, 0, 1, 0]) clf = MixedNB(alpha=1.0) clf.fit(X, y) # Inspect detected types print(clf.feature_types_) # {'gaussian': [0], 'bernoulli': [1], 'categorical': [2]} # Predict print(clf.predict([[1.0, 1, 1]])) # [1] print(clf.predict_proba([[1.0, 1, 1]])) # [[0.xx, 0.xx]] Parameters ---------- * ``alpha``: Smoothing parameter (Laplace smoothing) for Categorical/Bernoulli. Default: 1.0 * ``var_smoothing``: Variance smoothing for Gaussian features. Default: 1e-9 * ``categorical_features``: List of indices to treat as categorical * ``bernoulli_features``: List of indices to treat as Bernoulli AnDE Family: Relaxing Independence ================================== The **Averaged n-Dependence Estimators (AnDE)** family relaxes the independence assumption by conditioning on "super-parent" features. The Independence Problem ------------------------ Consider the XOR problem: .. list-table:: :header-rows: 1 * - X1 - X2 - Y * - -1 - -1 - 0 * - -1 - +1 - 1 * - +1 - -1 - 1 * - +1 - +1 - 0 Looking at X1 alone: P(X1|Y=0) = P(X1|Y=1) (symmetric distributions). Looking at X2 alone: P(X2|Y=0) = P(X2|Y=1) (symmetric distributions). **Naive Bayes cannot learn this.** It achieves ~50% accuracy (random guessing). AnDE solves this by modeling **P(X2 | Y, X1)** instead of just P(X2 | Y). The Super-Parent Strategy ------------------------- An SPnDE (Super-Parent n-Dependence Estimator) conditions all child features on the class **and** n parent features: .. math:: P(y, \mathbf{x}) = P(Y^*) \prod_{i \in children} P(x_i | Y^*) Where :math:`Y^* = (y, x_{p1}, x_{p2}, ..., x_{pn})` is the "augmented super-class". AnDE averages over all possible parent combinations. AnDE (Arithmetic Mean) ---------------------- :class:`skbn.AnDE` is the standard generative model described by Webb et al. [1]_. .. math:: P(y|\mathbf{x}) \propto \sum_{m} P_m(y, \mathbf{x}) **Key Parameters:** * ``n_dependence``: Order of dependence - n=0: Equivalent to Naive Bayes (MixedNB) - n=1: AODE (Averaged One-Dependence Estimators) - n=2: A2DE (common choice for higher accuracy) * ``n_bins``: Discretization bins for continuous super-parents * ``strategy``: Discretization strategy ('uniform', 'quantile', 'kmeans') **Example:** .. code-block:: python from skbn import AnDE # AODE (n=1) clf = AnDE(n_dependence=1, n_bins=5) clf.fit(X, y) # A2DE (n=2) - higher accuracy, more computation clf = AnDE(n_dependence=2, n_bins=5) AnJE (Geometric Mean) --------------------- :class:`skbn.AnJE` aggregates using the geometric mean (product of probabilities): .. math:: P(y|\mathbf{x}) \propto \prod_{m} P_m(y, \mathbf{x}) This is equivalent to summing log-probabilities and serves as the basis for the convex ALR optimization. **Usage is identical to AnDE.** ALR: Accelerated Logistic Regression ------------------------------------ :class:`skbn.ALR` is a **hybrid generative-discriminative** classifier [2]_. It starts with the AnJE generative model and learns discriminative weights to optimize classification performance: .. math:: P(y|\mathbf{x}) \propto \exp\left(\sum_{m} w_m \cdot \log P_m(y, \mathbf{x})\right) **Weight Granularity Levels:** ALR supports 4 levels of parameter granularity: .. list-table:: :header-rows: 1 :widths: 10 30 30 30 * - Level - Description - # Parameters - Best For * - 1 - Per Model - M - Small datasets * - 2 - Per Parent Value - M × V - Large datasets * - 3 - Per Class - M × C - Multi-class * - 4 - Per Value × Class - M × V × C - Very large datasets Where M = number of models, V = parent value combinations, C = classes. **Example:** .. code-block:: python from skbn import ALR # Level 1: Simple, low variance clf = ALR(n_dependence=1, weight_level=1, l2_reg=1e-3) # Level 3: Per-class weights (good for multi-class) clf = ALR(n_dependence=1, weight_level=3, l2_reg=1e-4) clf.fit(X, y) WeightedAnDE ------------ :class:`skbn.WeightedAnDE` applies discriminative weighting to the standard AnDE (arithmetic mean) model. Unlike ALR, the optimization is **non-convex**. .. code-block:: python from skbn import WeightedAnDE clf = WeightedAnDE(n_dependence=1, weight_level=1) clf.fit(X, y) Parameter Tuning Guide ====================== Choosing n_dependence --------------------- * **n=1 (AODE)**: Good default. Captures pairwise interactions. * **n=2 (A2DE)**: Better accuracy, but O(n²) models. Use for <50 features. * **n≥3**: Rarely needed. Computational cost grows combinatorially. Discretization Strategy ----------------------- For continuous super-parents: * **'quantile'** (default): Equal-frequency bins. Robust to outliers. * **'uniform'**: Equal-width bins. Good for uniform distributions. * **'kmeans'**: Data-driven bins. Best for multi-modal distributions. ``n_bins`` typically 3-10. More bins = more precision but fewer samples per bin. Regularization in Hybrid Models ------------------------------- ``l2_reg`` controls regularization strength: * **Small datasets**: Use higher values (1e-2 to 1e-1) to prevent overfitting * **Large datasets**: Use lower values (1e-4 to 1e-3) for more flexibility Computational Considerations ---------------------------- * Use ``n_jobs=-1`` to parallelize SPODE fitting * Higher ``weight_level`` increases optimization time exponentially * A2DE with n_features=50 creates ~1,225 sub-models References ========== .. [1] Webb, G. I., Boughton, J., & Wang, Z. (2005). Not so naive Bayes: Aggregating one-dependence estimators. Machine Learning, 58(1), 5-24. .. [2] Flores, M. J., Gámez, J. A., Martínez, A. M., & Puerta, J. M. (2009). GAODE and HAODE: Two proposals based on AODE to deal with continuous variables. ICML '09, 313-320. .. [3] Zaidi, N. A., Webb, G. I., Carman, M. J., & Petitjean, F. (2017). Efficient parameter learning of Bayesian network classifiers. Machine Learning, 106(9-10), 1289-1329.