Appendix A: Some Probability and Statistics

A.1 Basic Definitions

(i) We denote the underlying probability space by (Ω, ℱ, P). Here, Ω, ℱ, and P are set, σ-algebra, and probability measure, respectively.
(ii) We use capital letters for real-valued random variables. For example, X: (Ω, ℱ) → (ℝ, ℬℝ) is a measurable function, with respect to the Borel σ-algebra. Random vectors are measurable functions X: (Ω, ℱ) → (ℝ^d, ℬℝ^d). We call X non-degenerate if there is no value c ∈ ℝ^d such that P(X = c) = 1. For an introduction to measure theory, see, for example, Dudley [2002].
(iii) We usually denote vectors with bold letters. In a slight abuse of notation, we consider sets of variables B ⊆ X as a single multivariate variable.
(iv) PX is the distribution of the d-dimensional random vector X, that is, a probability measure on (ℝ^d, ℬℝ^d).
(v) We write x ↦ p_X(x) or simply x ↦ p(x) for the density, that is, the Radon-Nikodym derivative of P_X with respect to a product measure. We (sometimes implicitly) assume its existence or continuity.
(vi) We call X independent of Y and write X ⊥⊥ Y if and only if $$p(x,y) = p(x)p(y) \tag{A.1}$$ for all x,y. Otherwise, X and Y are dependent, and we write X ⊥̸⊥ Y.
(vii) We call X₁, ..., X_d jointly (or mutually) independent if and only if $$p(x₁, ..., x_d) = p(x₁) · ... · p(x_d) \tag{A.2}$$ for all x₁, ..., x_d. If X₁, ..., X_d are jointly independent, then any pair X_i and X_j with i ≠ j are independent, too. The converse does not hold in general: pairwise independence does not imply joint independence.
(viii) We call X independent of Y conditional on Z and write X ⊥⊥ Y | Z if and only if $$p(x,y|z) = p(x|z)p(y|z) \tag{A.3}$$ for all x,y,z such that p(z) > 0. Otherwise, X and Y are dependent conditional on Z and we write X ⊥̸⊥ Y | Z.
(ix) Conditional independence relations obey the following important rules [e.g., Pearl, 2009, Section 1.1.5]:
- Symmetry: X ⊥⊥ Y | Z ⇒ Y ⊥⊥ X | Z
- Decomposition: X ⊥⊥ Y,W | Z ⇒ X ⊥⊥ Y | Z
- Weak union: X ⊥⊥ Y,W | Z ⇒ X ⊥⊥ Y | W,Z
- Contraction: X ⊥⊥ Y | Z and X ⊥⊥ W | Y,Z ⇒ X ⊥⊥ Y,W | Z
- Intersection: X ⊥⊥ Y | W,Z and X ⊥⊥ W | Y,Z ⇒ X ⊥⊥ Y,W | Z
The existence of a strictly positive density suffices for the intersection property to hold. Necessary and sufficient conditions for the discrete case are provided by Drton et al. [2009b, Exercise 6.6] and by Fink [2011]. Peters [2014] covers the continuous case.
(x) The variance of a random variable X is defined as $$\text{var}[X] := \mathbb{E}[(X - \mathbb{E}[X])²] = \mathbb{E}[X²] - \mathbb{E}[X]²$$ if 𝔼[X²] < ∞.
(xi) We call X and Y uncorrelated if 𝔼[X²], 𝔼[Y²] < ∞ and $$\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y],$$ that is $$\rho_{X,Y} := \frac{\mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]}{\sqrt{\text{var}[X]\text{var}[Y]}} = 0.$$
(xii) If X and Y are independent, then they are uncorrelated. The converse does not hold in general.
(xiii) We say that X and Y are partially uncorrelated given Z if $$\mathbb{E}[XY|Z] = \mathbb{E}[X|Z]\mathbb{E}[Y|Z]$$ almost surely with respect to P_Z.

A.2 Independence and Conditional Independence Testing

Testing for independence and conditional independence is a fundamental task in causal discovery algorithms. We briefly review some key approaches.

Independence Testing

For testing whether two random variables X and Y are independent, several approaches are available:

Pearson correlation test: For continuous variables, tests whether the linear correlation coefficient is significantly different from zero. Only detects linear dependences.
Chi-square test: For discrete variables, tests independence using the chi-square statistic comparing observed and expected frequencies under independence.
Kernel-based tests: Methods like the Hilbert-Schmidt Independence Criterion (HSIC) can detect non-linear dependences by embedding variables into reproducing kernel Hilbert spaces.
Mutual information-based tests: Estimate mutual information I(X;Y) and test whether it's significantly greater than zero.

Conditional Independence Testing

Testing X ⊥⊥ Y | Z is generally more challenging:

Partial correlation: For Gaussian variables, X ⊥⊥ Y | Z if and only if the partial correlation coefficient is zero.
Conditional mutual information: Test whether I(X;Y|Z) = 0.
Kernel conditional independence test (KCIT): Extension of kernel methods to the conditional setting.

Multiple Testing Considerations

When performing many independence tests simultaneously (as in structure learning algorithms), multiple testing corrections are essential:

Bonferroni correction: Conservative approach controlling familywise error rate.
False Discovery Rate (FDR) control: Less conservative, controls expected proportion of false discoveries.
Sequential testing procedures: Adaptive methods that can be more powerful than fixed corrections.

A.3 Capacity of Function Classes

The capacity of a function class is crucial for understanding the learnability and generalization properties of statistical learning algorithms. Here we review key concepts.

VC Dimension

The Vapnik-Chervonenkis (VC) dimension measures the capacity of a function class ℱ.

Definition A.1 (Shattering) A set of points {x₁, ..., x_n} is shattered by a function class ℱ if for every subset S ⊆ {1, ..., n}, there exists a function f ∈ ℱ such that f(x_i) = 1 if i ∈ S and f(x_i) = -1 if i ∉ S.

Definition A.2 (VC dimension) The VC dimension of a function class ℱ is the size of the largest set that can be shattered by ℱ.

Learning Theory Results

Fundamental Theorem of Statistical Learning: A function class ℱ is learnable if and only if its VC dimension is finite.

For a function class with VC dimension d, the generalization bound with probability at least 1 - δ is:

$$R[f] ≤ R_{\text{emp}}[f] + O\left(\sqrt{\frac{d \log(n) + \log(1/δ)}{n}}\right)$$

where R[f] is the true risk, R_emp[f] is the empirical risk, and n is the sample size.

Examples

Linear classifiers in ℝ^d: VC dimension is d + 1
Polynomial classifiers of degree k in ℝ^d: VC dimension is O(d^k)
Neural networks: Generally exponential in the number of parameters, though recent work shows more nuanced behavior

Implications for Causal Learning

In causal structure learning, we often work with function classes (like additive noise models) that have restricted capacity. This is essential because:

Identifiability: Unrestricted function classes can fit any distribution, making causal direction unidentifiable
Generalization: Finite capacity ensures learning algorithms can generalize beyond the training data
Computational tractability: Restricted classes often admit efficient optimization algorithms

The trade-off between expressiveness and learnability is central to designing effective causal discovery methods.