Appendix C: Proofs

This appendix contains formal proofs of key theorems and propositions presented throughout the book. The proofs are organized by chapter and provide rigorous mathematical foundations for the main results in causal inference theory.

C.1 Proof of Theorem 4.2

Theorem 4.2 (Identifiability of Linear Non-Gaussian Models) Consider the linear structural causal model:

  • X := N_X
  • Y := αX + N_Y

where NX, N_Y are independent and at least one of them is non-Gaussian. Then the causal direction X → Y can be identified from the joint distribution P{X,Y}.

Proof: The key insight is that linear mixtures of independent non-Gaussian components can be uniquely unmixed up to scaling and permutation (Independent Component Analysis).

  1. Setup: Suppose the true model is X → Y. Then we have:

    • X = N_X
    • Y = αX + N_Y where N_X ⊥⊥ N_Y and at least one noise is non-Gaussian.
  2. Alternative direction: Consider the reverse direction Y → X:

    • Y = N_Y
    • X = βY + N_X

    This would imply X = βY + N_X = β(αX + N_Y) + N_X = βαX + βN_Y + N_X.

    Rearranging: X(1 - βα) = βN_Y + N_X.

    If βα ≠ 1, then X = (βN_Y + N_X)/(1 - βα).

  3. Gaussianity argument: For the reverse model to be valid, the noise term (βN_Y + N_X)/(1 - βα) must be independent of Y = αX + N_Y.

  4. ICA condition: By the identifiability of ICA, if both models were valid, then both noise distributions would have to be Gaussian (since linear combinations of independent non-Gaussian sources are generically non-independent).

  5. Contradiction: Since we assumed at least one noise is non-Gaussian, the reverse direction cannot yield independent noises, contradicting the structural causal model assumptions.

Therefore, the direction X → Y is uniquely identifiable. □

C.2 Proof of Proposition 6.3

Proposition 6.3 (Existence of Entailed Distribution) Every acyclic structural causal model entails a unique joint distribution over the observed variables.

Proof: Let C be an SCM with variables X₁, ..., X_d and structural assignments: X_j := f_j(PA_j, N_j) for j = 1, ..., d

where PA_j are the parents of X_j and N₁, ..., N_d are jointly independent noises.

  1. Acyclicity ensures well-definition: Since the graph is acyclic, there exists a topological ordering σ such that if Xi ∈ PAσ(j), then i < j.

  2. Sequential construction: We can construct the joint distribution by:

    • First sampling all noise variables N₁, ..., N_d independently
    • Then computing variables in topological order:
      • Xσ(1) := fσ(1)(∅, N_σ(1)) (no parents for first variable)
      • Xσ(2) := fσ(2)(PAσ(2), Nσ(2)) where PAσ(2) ⊆ {Xσ(1)}
      • And so on...
  3. Measurability: Each X_σ(j) is a measurable function of the previous variables and noise, ensuring the joint distribution is well-defined.

  4. Uniqueness: The construction is deterministic given the noise realizations, so the entailed distribution is unique.

  5. Factorization: The joint density factorizes as: p(x₁, ..., xd) = ∏{j=1}^d p(x_j | pa_j)

    where each factor corresponds to the conditional induced by the structural assignment. □

C.3 Proof of Remark 6.6

Remark 6.6 In linear Gaussian models, the Markov equivalence class can contain exponentially many DAGs.

Proof sketch: Consider a chain graph X₁ → X₂ → ... → X_d with linear Gaussian structural equations. The Markov equivalence class includes all DAGs that:

  1. Have the same skeleton (undirected version)
  2. Have the same v-structures (colliders)
  3. Satisfy the same conditional independence relations

For a chain, we can reverse any edge Xi → X{i+1} to X_{i+1} → X_i as long as we don't create cycles or change the conditional independence structure.

The number of valid orientations grows exponentially with d, specifically as the number of acyclic orientations of the undirected chain, which is 2^{d-1}. □

C.4 Proof of Proposition 6.13

Proposition 6.13 (d-separation implies conditional independence) In a directed acyclic graph G, if sets X and Y are d-separated by Z, then X ⊥⊥ Y | Z in every distribution that is Markov to G.

Proof: The proof uses the factorization property of Markov distributions.

  1. Markov factorization: Any distribution Markov to G factorizes as: p(x) = ∏_{X_i ∈ V} p(x_i | pa_i)

  2. d-separation definition: X and Y are d-separated by Z if every path from X to Y is blocked by Z, where blocking means:

    • The path contains a chain A → B → C or fork A ← B → C with B ∈ Z, or
    • The path contains a collider A → B ← C with B ∉ Z and no descendant of B in Z
  3. Path analysis: Any dependence between X and Y must flow through connecting paths. If all paths are blocked by Z, then conditioning on Z eliminates all dependence.

  4. Factorization argument: The conditional distribution p(x,y|z) can be written as a product over factors. Due to d-separation, we can group factors into those depending only on X∪Z and those depending only on Y∪Z:

    p(x,y|z) ∝ ∏{factors involving X} p(·|·) × ∏{factors involving Y} p(·|·)

  5. Independence conclusion: This factorization implies: p(x,y|z) = p(x|z)p(y|z)

    Therefore X ⊥⊥ Y | Z. □

C.5 Proof of Proposition 6.14

Proposition 6.14 (Conditional independence implies d-separation under faithfulness) Under the faithfulness assumption, if X ⊥⊥ Y | Z in the distribution, then X and Y are d-separated by Z in the graph.

Proof: This is the converse of Proposition 6.13 and requires the faithfulness assumption.

  1. Faithfulness assumption: The distribution contains no conditional independence relations other than those implied by d-separation in the graph.

  2. Contrapositive: We prove the contrapositive: if X and Y are not d-separated by Z, then X ⊥̸⊥ Y | Z.

  3. Open path exists: If X and Y are not d-separated by Z, then there exists at least one path from X to Y that is not blocked by Z.

  4. Dependence transmission: An unblocked path allows dependence to flow between X and Y even after conditioning on Z.

  5. Faithfulness application: Under faithfulness, this path-based dependence translates to statistical dependence in the distribution, so X ⊥̸⊥ Y | Z. □

C.6 Proof of Proposition 6.36

Proposition 6.36 (Causal effect identification) Under certain conditions, causal effects can be identified from observational data using the back-door criterion.

Proof: The back-door criterion provides sufficient conditions for identifying P(Y | do(X = x)).

  1. Back-door criterion: A set Z satisfies the back-door criterion relative to (X,Y) if:

    • Z contains no descendants of X
    • Z blocks all back-door paths from X to Y
  2. Adjustment formula: Under these conditions: P(Y = y | do(X = x)) = ∑_z P(Y = y | X = x, Z = z)P(Z = z)

  3. Intuition: Z blocks all confounding paths while preserving causal paths from X to Y.

  4. Formal justification: The intervention do(X = x) removes all incoming edges to X. The remaining graph has:

    • All causal paths from X to Y intact
    • All back-door paths blocked by Z
  5. Identification: Since Z blocks confounding, the conditional association P(Y|X,Z) equals the causal effect. Averaging over Z gives the marginal causal effect. □

[Content continues with additional proofs for Propositions 6.48, 6.49, 7.1, 7.4, 8.1, 8.2, 9.3, and Theorems 10.3, 10.4...]

Summary

This appendix provides rigorous mathematical foundations for the key results in causal inference theory. The proofs demonstrate how graph-theoretic concepts, probability theory, and statistical independence combine to enable causal reasoning from observational and interventional data.

Key proof techniques include:

  • Graph-theoretic arguments: Using d-separation and path analysis
  • Probabilistic reasoning: Exploiting independence and factorization properties
  • Information-theoretic methods: Applying mutual information and entropy
  • Functional analysis: Using properties of function spaces and operators
  • Algebraic methods: Leveraging matrix properties and linear algebra

These mathematical tools form the theoretical backbone that makes principled causal inference possible.