.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/04_plot_mixednb_vs_pipeline.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_04_plot_mixednb_vs_pipeline.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_04_plot_mixednb_vs_pipeline.py:


=====================================================
Handling Mixed Data Types with MixedNB
=====================================================

This example compares three strategies for handling a dataset
with mixed continuous (Gaussian) and discrete (Categorical) features.

**The Scenario (Generative Data):**

We generate data from two natural clusters (Classes 0 and 1):

*   **Continuous Feature:** Two overlapping Gaussian distributions.
    Precision is key here.
*   **Categorical Feature:** Different category probabilities per class.
    Class 0 prefers Cat '0', Class 1 prefers Cat '2'.

**The Competitors:**

1.  **MixedNB (Native):** Models Gaussian as Gaussian, Categorical as Multinomial.
    Matches the data generation process. (Acc: ~0.920).
2.  **Pipeline (OHE + GNB):** Treats categories as binary Gaussians.
    (Acc: ~0.893).
3.  **Pipeline (Discretizer + CatNB):** Bins continuous data.
    (Acc: ~0.913).

**Result:**

MixedNB produces the smoothest probability landscape and optimal log-loss, though standard pipelines perform reasonably well on this simple problem.

.. GENERATED FROM PYTHON SOURCE LINES 31-203


.. image-sg:: /auto_examples/images/sphx_glr_04_plot_mixednb_vs_pipeline_001.png
   :alt: MixedNB vs. Scikit-Learn Workarounds: Quality of Probability Landscape, 1. MixedNB (Native) Correct Assumptions Acc: 0.920 | Log Loss: 0.215, 2. Pipeline (OHE + GNB) Flawed: Cats are Gaussians Acc: 0.893 | Log Loss: 0.316, 3. Pipeline (Binned + CatNB) Flawed: Loss of Precision Acc: 0.913 | Log Loss: 0.223
   :srcset: /auto_examples/images/sphx_glr_04_plot_mixednb_vs_pipeline_001.png
   :class: sphx-glr-single-img


.. code-block:: Python


    # Author: The scikit-bayes Developers
    # SPDX-License-Identifier: BSD-3-Clause

    import warnings

    import matplotlib.pyplot as plt
    import numpy as np
    from sklearn.compose import ColumnTransformer
    from sklearn.metrics import accuracy_score, log_loss
    from sklearn.model_selection import train_test_split
    from sklearn.naive_bayes import CategoricalNB, GaussianNB
    from sklearn.pipeline import make_pipeline
    from sklearn.preprocessing import KBinsDiscretizer, OneHotEncoder

    from skbn.mixed_nb import MixedNB

    # Suppress warnings
    warnings.filterwarnings("ignore")

    # --- 1. Generate Probabilistic Mixed Data ---
    np.random.seed(42)
    n_samples = 1000

    # Class 0: Centered at X=-1, Prefer Cat 0
    n0 = n_samples // 2
    X_cont_0 = np.random.normal(-1.0, 1.0, size=n0)
    # Probabilities for Cat 0, 1, 2: [0.7, 0.2, 0.1]
    X_cat_0 = np.random.choice([0, 1, 2], size=n0, p=[0.7, 0.2, 0.1])

    # Class 1: Centered at X=1, Prefer Cat 2
    n1 = n_samples - n0
    X_cont_1 = np.random.normal(1.0, 1.0, size=n1)
    # Probabilities for Cat 0, 1, 2: [0.1, 0.2, 0.7]
    X_cat_1 = np.random.choice([0, 1, 2], size=n1, p=[0.1, 0.2, 0.7])

    # Combine
    X_cont = np.concatenate([X_cont_0, X_cont_1])
    X_cat = np.concatenate([X_cat_0, X_cat_1])
    y = np.concatenate([np.zeros(n0), np.ones(n1)]).astype(int)

    # Stack: [Continuous, Categorical]
    X = np.column_stack([X_cont, X_cat])

    # Split for valid metric calculation
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42
    )

    # --- 2. Define Models ---

    # Model 1: MixedNB
    mnb = MixedNB()
    mnb.fit(X_train, y_train)

    # Model 2: OHE + GaussianNB
    pipe_ohe = make_pipeline(
        ColumnTransformer(
            [("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False), [1])],
            remainder="passthrough",
        ),
        GaussianNB(),
    )
    pipe_ohe.fit(X_train, y_train)

    # Model 3: Discretizer + CategoricalNB
    pipe_kbins = make_pipeline(
        ColumnTransformer(
            [
                (
                    "discretizer",
                    KBinsDiscretizer(n_bins=5, encode="ordinal", strategy="quantile"),
                    [0],
                )
            ],
            remainder="passthrough",
        ),
        CategoricalNB(),
    )
    pipe_kbins.fit(X_train, y_train)

    models = [mnb, pipe_ohe, pipe_kbins]
    titles = [
        "1. MixedNB (Native)\nCorrect Assumptions",
        "2. Pipeline (OHE + GNB)\nFlawed: Cats are Gaussians",
        "3. Pipeline (Binned + CatNB)\nFlawed: Loss of Precision",
    ]

    # --- 3. Visualization ---
    fig, axes = plt.subplots(1, 3, figsize=(18, 6), sharey=True)

    # Grid for plotting
    h = 0.05
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = -0.5, 2.5  # Categories 0, 1, 2
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, 1))

    # Flatten for prediction
    grid_X = np.c_[xx.ravel(), yy.ravel()]

    for ax, model, title in zip(axes, models, titles):
        # Predict
        Z = model.predict_proba(grid_X)[:, 1]
        Z = Z.reshape(xx.shape)

        # Metrics on Test Set
        acc = accuracy_score(y_test, model.predict(X_test))
        ll = log_loss(y_test, model.predict_proba(X_test))

        # Plot Heatmap - VIRIDIS
        # 0.0 = Purple (Class 0 zone), 1.0 = Yellow (Class 1 zone)
        ax.imshow(
            Z,
            extent=(x_min, x_max, y_min, y_max),
            origin="lower",
            cmap="viridis",
            vmin=0,
            vmax=1,
            aspect="auto",
            alpha=0.8,
        )

        # Overlay real data points (with jitter on Y)
        X_plot = X_test.copy()
        y_jit = X_plot[:, 1] + np.random.uniform(-0.2, 0.2, size=len(X_plot))

        # Visual Coherence with Viridis:
        # Class 0 -> Indigo (Matches background purple)
        # Class 1 -> Gold (Matches background yellow)
        # Edgecolors='w' ensures visibility even on matching backgrounds

        ax.scatter(
            X_plot[y_test == 0, 0],
            y_jit[y_test == 0],
            c="indigo",
            marker="o",
            s=30,
            alpha=0.9,
            edgecolors="w",
            linewidth=0.8,
            label="Class 0",
        )

        ax.scatter(
            X_plot[y_test == 1, 0],
            y_jit[y_test == 1],
            c="gold",
            marker="^",
            s=30,
            alpha=0.9,
            edgecolors="k",
            linewidth=0.5,
            label="Class 1",
        )  # Black edge for yellow points for better contrast

        # Title & Metrics
        ax.set_title(f"{title}\nAcc: {acc:.3f} | Log Loss: {ll:.3f}", fontsize=12)
        ax.set_xlabel("Continuous Feature")
        ax.set_yticks([0, 1, 2])

    axes[0].set_ylabel("Categorical Feature")
    # Clean legend
    handles, labels = axes[0].get_legend_handles_labels()
    fig.legend(handles, labels, loc="lower center", ncol=2, bbox_to_anchor=(0.5, 0.02))

    fig.suptitle(
        "MixedNB vs. Scikit-Learn Workarounds: Quality of Probability Landscape",
        fontsize=16,
    )
    plt.tight_layout()
    plt.subplots_adjust(top=0.80, bottom=0.15)
    plt.show()


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 0.284 seconds)


.. _sphx_glr_download_auto_examples_04_plot_mixednb_vs_pipeline.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: 04_plot_mixednb_vs_pipeline.ipynb <04_plot_mixednb_vs_pipeline.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: 04_plot_mixednb_vs_pipeline.py <04_plot_mixednb_vs_pipeline.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: 04_plot_mixednb_vs_pipeline.zip <04_plot_mixednb_vs_pipeline.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_