Probability: Conditional Independence Properties

Jun 09, 2024

In this article, we will prove that conditional independence does not hold for one-hot encoding. This has implications for the naive Bayes classifier as the ML algorithm assumes conditional independence assumption. This means we should avoid one-hot-encoding categorical data in the naive Bayes classifier.

Before we do this we go over the definition of conditional independence in the background section as well as some notation. From there we will prove the five properties of conditional independence in section 1. In section 2 we will prove conditional independence does not hold for one-hot-encoded data.

Section: Background

Definition 1: Conditional Independence:

P(X, Y| Z) = P(X|Z)P(Y|Z)

There is another equivalent definition of conditional independence.

Definition 2: Conditional independence

P(X|Y, Z) = P(X|Z)

Theorem 1: Equivalence of Conditional Independence Definitions

P(X, Y|Z) = P(X|Z)P(Y|Z) if and only if P(X|Y, Z) = P(X|Z)

The proof of this theorem can be found on Wikipedia. I have shamelessly copied and pasted it below.

Credits: Wikipedia: Conditional Independence

Important Note:

Sometimes you will see the following notation written below.

\(X \perp\!\!\!\perp Y \mid Z\)

This means the the random variable X is conditionally independent of Y given Z. Or in other words P(X, Y|Z) = P(X|Z)P(Y|Z) and similarly P(X|Y, Z) = P(X|Z).

Section 1: Properties of Conditional Independence

The Five Properties of Conditional Independence

We will now state the five properties of conditional independence. As a minor note, the properties we present were formerly known as the graphoid axioms. But it was later realised the axioms could be proven. The graph part of the word “graphoid” comes from properties holding for Bayesian networks which is a graph. Bayesian networks are used to represent conditional independence between variables using a graph.

In fact the conditional independence assumption used in the naive Bayes classifier can be represent as a Bayesian network.

Important Note:

With exclusion of intersection property each property of conditional independence can be written differently. The modified properties have an |Z added to each expression in contrast to non-modified properties.

The proof of all the non-modified properties of conditional independence can be found on Wikipedia. For this article we will prove all the modified properties of conditional independence.

Symmetry Property

\(X \perp\!\!\!\perp Y \quad \Rightarrow \quad Y \perp\!\!\!\perp X\)

1.1: modified symmetry property

\(X \perp\!\!\!\perp Y | Z \quad \Rightarrow \quad Y \perp\!\!\!\perp X|Z\)

We will prove the modified property.

Proof

We will first prove a lemma.

Lemma 1:

X is conditionally independent of Y given Z i.e.

\( X \perp\!\!\!\perp Y | Z\quad \)

if and only if P(X, Y, Z) = P(X, Z)P(Y, Z)/P(Z)

Proof of Lemma:

We will first prove if X is conditionally independent of Y given Z then P(X, Y, Z) = P(X, Z)P(Y, Z)/P(Z).

P(X|Y, Z) = P(X, Y, Z)/P(Y, Z) = P(X, Z)/P(Z) = P(X|Z) (eq 1)

Note that in the second equal sign in (eq 1), we have:

P(X, Y, Z)/P(Y, Z) = P(X, Z)/P(Z)

Multiplying both sides by P(Y, Z) we get:

P(X, Y, Z) = P(X, Z)P(Y, Z)/P(Z) (eq 2)

We now will prove if P(X, Y, Z) = P(X, Z)P(Y, Z)/P(Z) then X is conditionally independent of Y given Z.

P(X|Y, Z) = P(X, Y, Z)/P(Y, Z) = P(X, Z)P(Y, Z)/[P(Y, Z)P(Z)] = P(X, Z)/P(Z) = P(X|Z) as required.

This completes the proof of Lemma 1.

Therefore:

P(X|Y, Z)

= P(X, Y, Z)/P(Y, Z) (from the definition of conditional probability)

= P(X, Z)*P(Y, Z)/[P(Y, Z)*P(Z)] (using lemma 1)

= P(X, Z)/P(Z)

= P(X|Z)

as required.

Decomposition Property

\(X \perp\!\!\!\perp Y,W \quad \Rightarrow \quad \text{ and } \begin{cases} X \perp\!\!\!\perp Y \\ X \perp\!\!\!\perp W \end{cases}\)

2.1: Modified Decomposition Property

\(X \perp\!\!\!\perp Y,W \mid Z \quad \Rightarrow \quad \text{ and } \begin{cases} X \perp\!\!\!\perp Y \mid Z\\ X \perp\!\!\!\perp W \mid Z\\ \end{cases}\)

Proof of the Modified Decomposition Property:

From the if condition we know that

P(X|Y, W, Z) = P(X, Y, W, Z)/P(Y, W, Z) = P(X, Z)/P(Z)

=> P(X, Y, W, Z) = P(Y, W, Z)P(X, Z)/P(Z)

Note that

sigma(w) P(X, Y, W, Z) = P(X, Z)/P(Z) sigma(w) P(Y, W, Z)

=> P(X, Y, Z) = P(X, Z)P(Y, Z)/P(Z)

From Lemma 1 we can instantly see that

\( X \perp\!\!\!\perp Y | Z\quad \)

as required.

Important note: The proof of the rest of the properties can be proven in the same way as the given in the Wikipedia article. For completeness we will go through the proofs.

Weak Union Property

\(X \perp\!\!\!\perp Y,Z \quad \Rightarrow \quad \text{ and } \begin{cases} X \perp\!\!\!\perp Y \mid Z\\ X \perp\!\!\!\perp Z \mid Y \end{cases}\)

3.1: Modified Weak Union Property

\(X \perp\!\!\!\perp Y,W\mid Z \quad \Rightarrow \quad \text{ and } \begin{cases} X \perp\!\!\!\perp Y \mid W, Z\\ X \perp\!\!\!\perp W \mid Y, Z \end{cases}\)

Proof of the Modified Weak Union Property:

We know from the if-condition that

P(X|Y, W, Z) = P(X|Z)

From the decomposition property, we know that

P(X|Y, W, Z) = P(X|W, Z) = P(X|Z)

Therefore:

P(X|Y, W, Z) = P(X|W, Z)

Hence we have proved the theorem as required.

Contraction

\(\left.\begin{align} X \perp\!\!\!\perp Y \mid Z \\ X \perp\!\!\!\perp Z \end{align}\right\}\text{ and } \quad \Rightarrow \quad X \perp\!\!\!\perp Y,Z\)

4.1 Modified Contraction Property

\(\left.\begin{align} X \perp\!\!\!\perp W \mid Y, Z \\ X \perp\!\!\!\perp Y \mid Z \end{align}\right\}\text{ and } \quad \Rightarrow \quad X \perp\!\!\!\perp Y,W \mid Z\)

Proof of the Modified Contraction Property:

Note from both the if-conditions

P(X|W, Y, Z) = P(X|Y, Z)

P(X|Y, Z) = P(X|Z)

Therefore

P(X|W, Y, Z) = P(X|Z)

as required.

Intersection

\(\left.\begin{align} X \perp\!\!\!\perp Y \mid Z, W\\ X \perp\!\!\!\perp W \mid Z, Y \end{align}\right\}\text{ and } \quad \Rightarrow \quad X \perp\!\!\!\perp W, Y \mid Z\)

There is no modified version of the intersection property.

Section 2: One-hot-encoded Data/Conditional Independence

Theorem: Conditional independence does not hold for one-hot-encoded data.

Proof:

Suppose X is a categorical variable. Without loss of generality, we will assume the random variable X can take two values. We will let Z be the class variable and also assume it can take two values 0 or 1. We introduce a class variable Z so we have a similar situation as we would when applying the naive Bayes classifier.

Performing one-hot-encoding on column X will result in two columns, X0, and X1.

From how one-hot-encoding is defined we know that P(X0=1|X1=0, Z= 1) = 1. But we can arbitrarily set the value of P(X0=1|Z=1)

This means that P(X0=1| X1=0, Z=1) != P(X0=1|Z=1). Therefore conditional independence does not hold for one-hot-encoded data.

Tegh’s Substack

Discussion about this post