Learning Multivariate Causal Models
As in Chapter 4, we now turn to the problem of learning causal models. We first discuss different assumptions under which (parts of) the graph structure can be recovered from the joint distribution in Section 7.1 ('structure identifiability'). Some of these results carry over from the bivariate setting discussed earlier. As in the bivariate case, there is no complete characterization of identifiability assumptions, and future research may reveal promising alternatives. In Section 7.2, we then introduce methods and algorithms, such as independence-based and score-based methods, that estimate the graph from a finite data set ('structure identification').
As in the bivariate setting, we are again facing the problem that the class of SCMs is too flexible. Given a distribution P_X over random variables X = (X₁, ..., X_d), can different SCMs entail this distribution? This question is answered by the following proposition: indeed, usually for many different graph structures, there is an SCM that induces the distribution P_X.
Proposition 7.1 (Non-uniqueness of graph structures) Consider a random vector X = (X₁, ..., X_d) with distribution P_X that has a density with respect to Lebesgue measure and assume it is Markovian with respect to G. Then there exists an SCM C = (S, P_N) with graph G that entails the distribution P_X.
Proof. See Appendix C.9. □
In particular, given any complete DAG, we can find a corresponding SCM that entails the distribution at hand. As in the bivariate case, it is therefore apparent that we require further assumptions to obtain identifiability results. The following section discusses some of those assumptions.
Structure Identifiability
Faithfulness
If the distribution P_X is Markovian and faithful with respect to the underlying DAG G₀, we have a one-to-one correspondence between d-separation statements in the graph G₀ and the corresponding conditional independence statements in the distribution. All graphs outside the correct Markov equivalence class of G₀ can therefore be rejected because they impose a set of d-separations that does not equal the set of conditional independences in P_X. Since both the Markov condition and faithfulness put restrictions only on the conditional independences in the joint distribution, it is also clear that we are not able to distinguish between two Markov equivalent graphs, that is, between two graphs that entail exactly the same set of conditional independences (see for example Figure 6.4 on page 103). Summarizing, under the Markov condition and faithfulness, the Markov equivalence class of G₀, represented by CPDAG(G₀), is identifiable from P_X [e.g., Spirtes et al., 2000].
Lemma 7.2 (Identifiability of Markov equivalence class) Assume that P_X is Markovian and faithful with respect to G₀. Then, for each graph G ∈ CPDAG(G₀), we find an SCM that entails the distribution P_X. Furthermore, there is no graph G with G ∉ CPDAG(G₀), such that P_X is Markovian and faithful with respect to G.
Proof. The first statement is a direct implication from Proposition 7.1, and the second statement follows from the definitions of Markov equivalence, seen in Definition 6.24. □
Independence-based methods (also called constraint-based methods) assume that the distribution is Markovian and faithful with respect to the underlying graph and then estimate the correct Markov equivalence class; see Section 7.2.1.
We have seen in Example 6.42 that for Gaussian distributions the causal effect can be summarized by a single number (6.20). If instead of the correct graph, we only know the Markov equivalence class of that graph, this quantity is not identifiable anymore. It is possible, however, to provide bounds [Maathuis et al., 2009].
Additive Noise Models
Proposition 7.1 shows that a given distribution could have been entailed from several SCMs with different graphs. For many of these graph structures, however, the functions f_j appearing in the structural assignments are rather complicated. It turns out that we obtain non-trivial identifiability results if we do not allow for arbitrarily complex functions, that is, if we restrict the function class. As we have already seen in Chapter 4, we will assume in the following Sections 7.1.4 and 7.1.5 that the noise acts in an additive way.
Definition 7.3 (ANMs) We call an SCM C an ANM if the structural assignments are of the form
$$X_j := f_j(PA_j) + N_j, \quad j = 1, ..., d, \quad (7.1)$$
that is, if the noise is additive. For simplicity, we further assume that the functions f_j are differentiable and the noise variables N_j have a strictly positive density.
Some of the following identifiability results assume causal minimality (Definition 6.33). For ANMs, this means that each function f_j is not constant in any of its arguments. Intuitively, the function should really 'depend' on its arguments. The proof of the following proposition is provided in Appendix C.10.
Proposition 7.4 (Causal minimality and ANMs) Consider a distribution induced by a model (7.1) and assume that the functions fj are not constant in any of its arguments, that is, for all j and i ∈ PA_j there is some value pa{j,-i} of the variables PA_j \ {i} and some x_i ≠ x'_i such that
$$f_j(pa_{j,-i}, x_i) ≠ f_j(pa_{j,-i}, x'_i).$$
Then the joint distribution satisfies causal minimality with respect to the corresponding graph. Conversely, if there are nodes j and i such that for all pa*{j,-i} the function f_j(pa*{j,-i}, ·) is constant, causal minimality is violated.
We have argued in Remark 6.6 that we can restrict ourselves to functions that are not constant in one of their arguments; see Proposition 6.49. We have now seen that for ANMs with fully supported noise, this restriction implies causal minimality.
Given the restricted class of SCMs described in (7.1), do we obtain full structure identifiability? Again, the answer is negative. Theorem 4.2 and Problem 7.13 show that if the distribution is induced by a linear Gaussian SCM, for example, we cannot necessarily recover the correct graph. It turns out, however, that this case is exceptional in the following sense. For almost all other combinations of functions and distributions, we obtain identifiability. All the nonidentifiable cases have been characterized [Zhang and Hyvarinen, 2009, Peters et al., 2014]. Another non-identifiable example different from the linear Gaussian case is shown in the right plot in Figure 4.2. Its details can be found in Peters et al. [2014, Example 25]. Table 7.1 shows some of the known identifiability results.
Table 7.1: Summary of some known identifiability results for Gaussian noise. Results for non-Gaussian noise identifiability results are available, too, but they are more technical.
| Type of structural assignment | Mathematical form | Condition on funct. | DAG identif. | See |
|---|---|---|---|---|
| (General) SCM: | Xj := f_j(X{PA_j}, N_j) | - | ✗ | Prop. 7.1 |
| ANM: | Xj := f_j(X{PA_j}) + N_j | nonlinear | ✓ | Thm. 7.7(i) |
| CAM: | Xj := ∑{k ∈ PAj} f{jk}(X_k) + N_j | nonlinear | ✓ | Thm. 7.7(ii) |
| Linear Gaussian: | Xj := ∑{k ∈ PAj} b{jk}X_k + N_j | linear | ✗ | Problem 7.13 |
| Lin. G., eq. error var.: | Xj := ∑{k ∈ PAj} b{jk}X_k + N_j | linear | ✓ | Prop. 7.5 |
Let us mention again that there are several extensions to the framework of ANMs. For example, Zhang and Hyvarinen [2009] allow for a post-nonlinear transformation of the variables and Peters et al. [2011a] consider ANMs for discrete variables.
In general, nonlinear ANMs are not closed under marginalization. That is, if P*{X,Y,Z} allows for ANMs from X to Y and from Y to Z, P*{X,Z} does not necessarily allow for an ANM from X to Z. This may restrict the applicability of ANMs in practice, since one may not observe intermediate variables on a causal path. For experiments in physics, one could argue that every influence is propagated via infinitely many intermediate variables. Thus, there is no absolute notion of direct or indirect effect (instead, it must always be relative to the observed set). In this sense, ANMs can only be taken as good approximations.
In the following three subsections, we will look at three specific identifiable examples in more detail: the linear Gaussian case with equal error variances (Section 7.1.3), the linear non-Gaussian case (Section 7.1.4), and the nonlinear Gaussian case (Section 7.1.5). Although more general results are available [Peters et al., 2014], we concentrate on those two examples because for them precise conditions can be stated easily. We omit proofs and concentrate on the statements. Most of the proofs can be based on the techniques developed in Peters et al. [2011b]. They allow many of the bivariate identifiability results that we developed in Chapter 4 to carry over to the multivariate setting.
Linear Gaussian Models with Equal Error Variances
There is another deviation from linear Gaussian SEMs that makes the graph identifiable. Peters and Bühlmann [2014] show that restricting the noise variables to have the same variance is sufficient to recover the graph structure. The proof can be found in Peters and Bühlmann [2014].
Proposition 7.5 (Identifiability with equal error variances) Consider an SCM with graph G₀ and assignments
$$X_j := \sum_{k \in PA_{G₀}(j)} b_{jk}X_k + N_j, \quad j = 1, ..., d,$$
where all Nj are i.i.d. and follow a Gaussian distribution. In particular, the noise variance σ² does not depend on j. Additionally, for each j ∈ {1, ..., p} we require b{jk} ≠ 0 for all k ∈ PA_{G₀}(j). Then, the graph G₀ is identifiable from the joint distribution.
For estimating the coefficients b_{jk} (and therefore the graph structure) Peters and Bühlmann [2014] propose to use a penalized maximum likelihood score based on the Bayesian information criterion (BIC); see also Section 7.2.2, and a greedy search algorithm in the space of DAGs. Rescaling the variables changes the variance of the error terms. Therefore, in many applications, model (7.2) cannot be sensibly applied. The BIC, however, allows us to compare the method's score with the score of a linear Gaussian SCM that uses more parameters and does not make the assumption of equal error variances.
Linear Non-Gaussian Acyclic Models
Shimizu et al. [2006] prove the following statement using independent component analysis (ICA) [Comon, 1994, Theorem 11], which itself is proved using the Darmois-Skitovič theorem.
Theorem 7.6 (Identifiability of LiNGAMs) Consider an SCM with graph G₀ and assignments
$$X_j := \sum_{k \in PA_{G₀}(j)} b_{jk}X_k + N_j, \quad j = 1, ..., d, \quad (7.2)$$
where all Nj are jointly independent and non-Gaussian distributed with strictly positive density. Additionally, for each j ∈ {1, ..., p}, we require b{jk} ≠ 0 for all k ∈ PA_{G₀}(j). Then, the graph G₀ is identifiable from the joint distribution.
The authors call this model a LiNGAM. As mentioned in Section 4.1.3, there is an alternative proof for Theorem 7.6: Theorem 28 in Peters et al. [2014] extends bivariate identifiability results such as Theorem 4.2 to the multivariate case. This trick is also used for nonlinear additive models (by extending Theorem 4.5).
Nonlinear Gaussian Additive Noise Models
We have seen that the graph structure of an ANM becomes identifiable if the assignments are linear and the noise variables are non-Gaussian. Alternatively, we can also exploit nonlinearity. The result is easiest to state with Gaussian noise:
Theorem 7.7 (Identifiability of nonlinear Gaussian ANMs)
(i) Let PX = P{X₁,...,X_d} be induced by an SCM with
$$X_j := f_j(PA_j) + N_j,$$
with normally distributed noise variables Nj ~ N(0, σ²_j) and three times differentiable functions f_j that are not linear in any component in the following sense. Denote the parents PA_j of X_j by X{k₁}, ..., X*{kℓ}, then the function fj(x*{k₁}, ..., x*{k*{a-1}}, ·, x*{k*{a+1}}, ..., x*{kℓ}) is assumed to be nonlinear for all a and some x{k₁}, ..., x*{k*{a-1}}, x*{k*{a+1}}, ..., x*{k_ℓ} ∈ ℝ^{ℓ-1}.
(ii) As a special case, let PX = P{X₁,...,X_d} be induced by an SCM with
$$X_j := \sum_{k \in PA_j} f_{j,k}(X_k) + N_j, \quad (7.3)$$
with normally distributed noise variables Nj ~ N(0, σ²_j) and three times differentiable, nonlinear functions f{j,k}. This model is known as a causal additive model (CAM).
In both cases (i) and (ii), we can identify the corresponding graph G₀ from the distribution P_X. The statements remain true if the noise distributions for source nodes, that is, nodes without parents, are allowed to have a non-Gaussian density with full support on the real line ℝ (the proof remains identical).
The proof can be found in Peters et al. [2014, Corollary 31].
Observational and Experimental Data
We have already seen in Section 6.3 that knowing causal relations can help improve predictions when the underlying distribution changes. We will now turn this idea around and show how observing the system in different environments can be used to learn causal relations. We therefore turn to the following setup, in which we observe data from different environments e ∈ E. The corresponding model reads
$$X^e = (X^e_1, ..., X^e_d) ∼ P^e,$$
where each variable X^e_j denotes the same (physical) quantity, measured in environment e ∈ E. We will talk about a variable X_j in different environments, which is a slight abuse of notation.
Known Intervention Targets A first type of method assumes that the different environments stem from different interventional settings. In the case that the intervention targets I^e ⊆ {1, ..., d} are known, several methods have been proposed. Tian and Pearl [2001] and Hauser and Bühlmann [2012], for example, assume faithfulness and consider mechanism changes and stochastic interventions, respectively. They define and characterize the interventional equivalence classes of graphs: that is, the class of graphs that can explain the given distributions. For mechanism changes, for example, we can include an intervention node into the model whose children are the variables that are intervened on. This way we increase the number of v-structures and two graphs become intervention equivalent (with respect to the given distributions) if they have the same skeletons and v-structures, and the nodes that are intervened on have the same parents [cf. Tian and Pearl, 2001, Theorem 2]. Eberhardt et al. [2010] allow for hard and stochastic interventions even in the presence of cycles.
Hyttinen et al. [2012] analyze conditions on the interventions under which the graph becomes identifiable. Eberhardt et al. [2005] and Hauser and Bühlmann [2014] investigate how many intervention experiments are necessary in the worst case to identify the graph.
Different Environments Let us now turn to a slightly different setting, in which we do not try to learn the whole causal structure. Instead, we consider a target variable Y with a set of d predictors X and try to learn which of the predictors are the causal parents of Y. Both X and Y are observed in different environments e ∈ E (which could be intervention settings with unknown targets). That is, we have
$$(X^e, Y^e) ∼ P^e_{Y^e|X^e} =: P^e$$
for e ∈ E. The key assumption is the existence of an unknown set PA_Y ⊆ {1, ..., d} (one may think of the direct causes of Y) such that the conditional Y given PA_Y is invariant over all environments, that is, for all e, f ∈ E we have
$$P^e_{Y^e|PA^e_Y} = P^f_{Y^f|PA^f_Y}.$$
This assumption is satisfied if the distributions are induced by an underlying SCM and the different environments correspond to different intervention distributions, for which Y has not been intervened on [Peters et al., 2016] (see Code Snippet 7.11 for an example). Having said that, the setting is more general and the environments do not need to correspond to interventions; one does not even require an underlying SCM. One can consider the collection S of all sets S ⊆ {1, ..., d} of variables that lead to 'invariant prediction,' that is, for all e, f ∈ E and for all S ∈ S, we have
$$P^e_{Y^e|S^e} = P^f_{Y^f|S^f}. \quad (7.4)$$
Here, Y^e|S^e is shorthand notation for Y^e|X^e_S. It is not difficult to see (Problem 7.15) that the variables appearing in all those sets S ∈ S must be direct causes of Y:
$$\bigcap_{S \in S} S ⊆ PA_Y, \quad (7.5)$$
where we define the intersection over an empty index set as the empty set. Peters et al. [2016] consider the left-hand side of (7.5) as an estimate for PA_Y. (7.5) then guarantees that any variable contained in the output of this method is indeed in PA_Y. In the special case of SCMs and interventions, there are sufficient conditions [Peters et al., 2016] under which PA_Y becomes identifiable, in other words, (7.5) is an equality. Interestingly, the method we present in Section 7.2.5 realizes whether the data come from such an identifiable case, it does not need to assume it.
Tian and Pearl [2001] also address the question of identifiability with unknown intervention targets. They do not specify a target variable and focus on changes in marginal distributions rather than conditionals.
Methods for Structure Identification
We have seen several assumptions that lead to (partial) identifiability of the causal structure. The purpose of this section is to show how these assumptions can be exploited to provide estimators of the underlying graph from a finite amount of data (see Figure 7.1 for two examples). We provide an overview of methods and try to focus on their ideas. There is a large pool of methods, and we believe that future research needs to show which of these methods will prove to be most useful in practice. We nevertheless try to highlight some of the methods' potential problems and most crucial assumptions. Although some papers study the consistency of the presented methodology, we omit most of those results and present ideas only. Subtleties of algorithmic implementation will not be discussed either, and we would like to refer the interested reader to the references we provide. Kalisch et al. [2012] maintain the software package pcalg for R [R Core Team, 2016] that contains code not only for the PC (for the inventors Peter Spirtes and Clark Glymour) algorithm (see Section 7.2.1), but also for many of the described methods.
Before providing more details about the existing methodology, we would like to add two comments first: (1) While there are several simulation studies available, a topic that receives little attention is the question of a loss function. Given the true underlying causal structure, how 'good' is an estimated causal graph? In practice, one often uses variants of the structural Hamming distance [Acid and de Campos, 2003, Tsamardinos et al., 2006], which counts the number of misspecified edges. As an alternative, Peters and Bühlmann [2015] suggest evaluating the graph based on its ability to predict intervention distributions. (2) Some of the methods that we present assume that the structural assignments (6.1) and the corresponding functions f_j in particular are simple. Often, those methods do provide estimates not only for the causal structure but also for the corresponding assignments, which can usually be used to compute residuals, too. In principle, and under this model, we can then test the strong assumption of mutually independent noise variables (Definition 3.1), for example, by applying a mutual independence test [e.g., Pfister et al., 2017]; see Section 4.2.1 for statistical subtleties of such a procedure.
Independence-Based Methods
Independence-based methods such as the inductive causation (IC) algorithm, the SGS (for the inventors Spirtes, Glymour, and Scheines) algorithm, and the PC algorithm assume that the distribution is faithful to the underlying DAG. This renders the Markov equivalence class, that is, the corresponding CPDAG, identifiable (see Section 7.1.1). There is a one-to-one correspondence between d-separations in the graph and conditional independences in P_X. Any query of a d-separation statement can therefore be answered by checking the corresponding conditional independence test. We first assume that an oracle provides us with the correct answers to the conditional independence questions and discuss some finite sample issues in the paragraph 'Conditional Independence Tests.'
Figure 7.1: The figure summarizes two approaches for the identification of causal structures. Independence-based methods (top) test for conditional independences in the data; these properties are related to the graph structure by the Markov condition and faithfulness. Often, the graph is not uniquely identifiable; the method may therefore output different graphs G and G'. Alternatively, one may restrict the model class and fit the SCM directly (bottom).
Estimation of Skeleton Most independence-based methods first estimate the skeleton, that is, the undirected edges, and orient as many edges as possible afterward. For the skeleton search, the following lemma is useful to know [see Verma and Pearl, 1991, Lemma 1].
Lemma 7.8 The following two statements hold.
(i) Two nodes X, Y in a DAG (V, E) are adjacent if and only if they cannot be d-separated by any subset S ⊆ V \ {X, Y}.
(ii) If two nodes X, Y in a DAG (V, E) are not adjacent, then they are d-separated by either PA_X or PA_Y.
Using Lemma 7.8(i), we have that if two variables are always dependent, no matter what other variables one conditions on, these two variables must be adjacent. This result is used in the IC algorithm [Pearl, 2009] and in the SGS algorithm [Spirtes et al., 2000]. For each pair of nodes (X, Y), these methods search through all possible subsets A ⊆ V \ {X, Y} of variables neither containing X nor Y and check whether X and Y are d-separated given A. After all those tests, X and Y are adjacent if and only if no set A was found that d-separates X and Y.
Searching through all possible subsets A does not seem optimal, especially if the graph is sparse. The PC algorithm [Spirtes et al., 2000] starts with a fully connected undirected graph and step-by-step increases the size of the conditioning set A, starting with |A| = 0. At iteration k, it considers sets A of size |A| = k, using the following neat trick: to test whether X and Y can be d-separated, one only has to go through sets A that are subsets either of the neighbors of X or of the neighbors of Y; this idea is based on Lemma 7.8(ii) and clearly improves the computation time, especially for sparse graphs.
Orientation of Edges Lemma 6.25 suggests that we should be able to orient the immoralities (or v-structures) in the graph. If two nodes are not directly connected in the obtained skeleton, there is a set that d-separates these nodes. Suppose that the skeleton contains the structure X-Z-Y with no direct edge between X and Y; further, let A be a set that d-separates X and Y. The structure X-Z-Y is an immorality and can therefore be oriented as X → Z ← Y if and only if Z ∉ A. After the orientation of immoralities, we may be able to orient some further edges in order to avoid cycles, for example. There is a set of such orientation rules that has been shown to be complete and is known as Meek's orientation rules [Meek, 1995].
Satisfiability Methods An alternative to the graphical approach just described is to formulate causal learning as a satisfiability (SAT) problem [Triantafillou et al., 2010]. First, one formulates graphical relations as Boolean variables, such as A := 'There is a direct edge from X to Y.' The non-trivial part is then to translate the independence statements (we still assume that they are provided by an independence oracle), as d-separation statements into 'formulas' that involve Boolean variables and the operators 'and' and 'or.' The SAT question then asks whether we can assign a value 'true' or 'false' to each of the Boolean variables to make the overall formula true. SAT solvers not only check whether this is the case but also provide us with the information as to whether in all of the assignments that make the overall formula true, certain variables are always assigned to the same value. For example, the d-separation statements may be satisfied by different graph structures that correspond to different assignments, but if in all such assignments the Boolean variable A from above takes the value 'true,' we can infer that in the underlying graph, X must be a parent of Y. Even though the Boolean SAT problem is known to be nondeterministic polynomial time (NP)-complete [Cook, 1971, Levin, 1973], that is, it is NP and NP-hard, there are heuristic algorithms that can solve instances of large problems, involving millions of variables. SAT methods in causal learning allow us to query specific statements as an ancestral relation rather than estimating the full graph. They let us incorporate different kinds of prior knowledge and furthermore, we can put weights on the independence constraints if we believe that some of the (statistical) findings contradict each other. These approaches have been extended to cycles, latent variables, and overlapping data sets [Hyttinen et al., 2013, Triantafillou and Tsamardinos, 2015].
Conditional Independence Tests In the three preceding paragraphs we have assumed the existence of an independence oracle that tells us whether a specific (conditional) independence is or is not present in the distribution. In practice, however, we have to infer this statement from a finite amount of data. This comes with two major challenges: (1) All causal discovery methods that are based on conditional independence tests draw conclusions both from dependences and independences. In practice, however, one most often uses statistical significance tests, which are inherently asymmetric. One therefore usually forgets about the original meaning of the significance level and treats it as a tuning parameter. Furthermore, due to finite samples, the testing results might even contradict each other in the sense that there is no graph structure that encodes the exact set of inferred conditional independences. (2) Although there is some recent work on kernel-based tests [Fukumizu et al., 2008, Tillman et al., 2009, Zhang et al., 2011], nonparametric conditional independence tests are difficult to perform with a finite amount of data. One therefore often restricts oneself to a subclass of possible dependences, some of which we now briefly review.
If the variables are assumed to follow a Gaussian distribution, we can test for vanishing partial correlation (see Appendices A.1 and A.2). Under faithfulness, the Markov equivalence class of the underlying DAG becomes identifiable (Lemma 7.2) and indeed, in the Gaussian setting, the PC algorithm with a test for vanishing partial correlation provides a consistent estimator for the correct CPDAG [Kalisch and Bühlmann, 2007]. Additionally assuming a condition called strong faithfulness [Zhang and Spirtes, 2003, Uhler et al., 2013] even yields uniform consistency [Kalisch and Bühlmann, 2007]; see also the discussion in Robins et al. [2003].
Non-parametric conditional independence testing is a difficult problem in theory and practice. For non-Gaussian distributions, vanishing partial correlation is neither necessary nor sufficient for conditional independence, as shown by the following example.
Example 7.9 (Conditional independence and partial correlation)
(i) If the distribution P_{X,Y,Z} is entailed by the SCM
$$Z := N_Z, \quad X := Z^2 + N_X, \quad Y := Z^2 + N_Y,$$
where N_X, N_Y, N_Z iid ~ N(0,1), it satisfies
$$X ⊥⊥ Y | Z \text{ and } r_{X,Y|Z} = 0.$$
The partial correlation coefficient r_{X,Y|Z} equals the correlation of X - aZ and Y - bZ where a and b are the regression coefficients when regressing X and Y on Z, respectively. In this example, a = b = 0 because X and Y do not correlate with Z.
(ii) The distribution P_{X,Y,Z} entailed by the SCM
$$Z := N_Z, \quad X := Z + N_X, \quad Y := Z + N_Y,$$
where (N_X, N_Y) ⊥⊥ N_Z and (N_X, N_Y) are uncorrelated but not independent, satisfies
$$X \not⊥⊥ Y | Z \text{ and } r_{X,Y|Z} = 0$$
since here, r_{X,Y|Z} is the correlation between N_X and N_Y.
Therefore, vanishing partial correlation does not imply and is not implied by conditional independence.
The following procedure for testing whether X and Y are conditionally independent given Z provides a natural nonlinear extension of partial correlation [e.g., Ramsey, 2014]: (1) (nonlinearly) regress X on Z and test whether the residuals are independent of Y; (2) (nonlinearly) regress Y on Z and test whether the residuals are independent of X; (3) if one of those two independences hold, conclude that X ⊥⊥ Y | Z. This seems to be the correct test in the case of ANMs; see Section 7.1.2. For three variables, for example, we have the following result.
Proposition 7.10 Consider a distribution P_{X,Y,Z} induced by an ANM (Definition 7.3) with all variables having strictly positive densities. If X and Y are d-separated given Z, then the procedure just described outputs the corresponding conditional independence in the sense that either X - E[X|Z] is independent of Y or Y - E[Y|Z] is independent of X.
Proof. Assume that X := h(Z) + N_X and Y := f(Z) + N_Y, with Z, N_X, and N_Y being mutually independent. Then, X - E[X|Z] = N_X is independent of Y. The statement follows analogously for the other possible structures, for example, X → Z → Y or X ← Z ← Y. □
The proposition shows that (in a population sense) the test described is appropriate for ANMs with three variables. Considering four variables X, Y, Z, V, however, may already lead to problems. Clearly, the graphs X ← Z → W → Y and X → Z → W → Y are Markov equivalent. But while the test outputs X ⊥⊥ Y | Z for the first graph, there is no such guarantee for the second graph. Thus, the abovementioned restriction of the dependence model between random variables that can be used to construct feasible conditional independence tests leads to asymmetric treatment of graphs within a Markov equivalence class. This effect may be the same for many other types of methods for conditional independence testing. This asymmetry does not necessarily need to be a drawback since, as we have seen, restricted function classes may lead to identifiability within the Markov equivalence class (see Section 7.1). It certainly requires consideration, though.
Score-Based Methods
In the preceding section we have directly used the independence statements to infer the graph. Alternatively, we can test different graph structures in their ability to fit the data. The rationale is that graph structures encoding the wrong conditional independences, for example, will yield bad model fits. Although the roots for score-based methods for causal learning may date back even further, we mainly refer to Geiger and Heckerman [1994a], Heckerman et al. [1999], Chickering [2002], and references therein. The Max-Min Hill-Climbing algorithm [Tsamardinos et al., 2006] combines score-based and independence-based techniques.
Best Scoring Graph Given data D = (X₁, ..., X_n) from a vector X of variables, that is, a sample containing n i.i.d. observations, the idea is to assign a score S(D, G) to each graph G and search over the space of DAGs to find the graph with the highest score:
$$\hat{G} := \arg\max_{G \text{ DAG over } X} S(D, G). \quad (7.6)$$
There are several possibilities to define such a scoring function S. Often a parametric model is assumed (e.g., linear Gaussian equations or multinomial distributions), which introduces a set of parameters θ ∈ Θ.
(Penalized) Likelihood For each graph we may consider the maximum likelihood estimator θ̂ for θ and then define a score function by the BIC
$$S(D, G) = \log p(D|\hat{θ}, G) - \frac{\text{# parameters}}{2} \log n, \quad (7.7)$$
where log p(D|θ̂, G) is the log likelihood and n is the sample size. Estimators that output the graph with the largest (penalized) likelihood are often consistent. This follows from the consistency of BIC [Haughton, 1988], and identifiability of the model class. To guarantee rates of convergence, however, one usually relies on a 'degree of identifiability' [e.g., Bühlmann et al., 2014]. In practice, finding the best scoring graph among all possible graphs may not be feasible and search techniques over the space of graphs are required (e.g., see the paragraph 'Greedy Search Techniques'). Regularization different from BIC is possible, too. Roos et al. [2008] base their score on the minimum description length principle [Grünwald, 2007], for example. Using work by Haughton [1988], Chickering [2002] discusses how the BIC approach relates to a Bayesian formulation that we discuss next.
Bayesian Scoring Functions We define priors p_pr(G) and p_pr(θ) over DAGs and parameters, respectively, and consider the log posterior as a score function (note that p(D) is constant over all DAGs):
$$S(D, G) := \log p(G|D) ∝ \log p_pr(G) + \log p(D|G),$$
where p(D|G) is the marginal likelihood
$$p(D|G) = \int_{θ ∈ Θ} p(D|G, θ) p_pr(θ|G) dθ.$$
Here, the resulting estimator Ĝ from Equation (7.6) is the mode of the posterior distribution, which is usually called a maximum a posteriori (MAP) estimator. Alternatively, one may output the full posterior distribution over DAGs, and, in principle, even more detailed information is available. For instance, one can average over all graphs to get a posterior probability of the existence of a specific edge.
As an example, consider random variables that take only finitely many values. For a given structure G, one may then assume that for each parent configuration the probability distribution of a random variable X_j follows a multinomial distribution. If we put a Dirichlet prior on its parameters (together with some further conditions on parameter independence and modularity), this leads to the Bayesian Dirichlet (BD) score [Geiger and Heckerman, 1994b].
In the case of parametric models, we call two graphs G₁ and G₂ distribution equivalent if for each parameter θ₁ there is a corresponding parameter θ₂, such that the distribution obtained from G₁ in combination with θ₁ is the same as the distribution obtained from graph G₂ with θ₂, and vice versa. It can be shown (see Problem 7.12) that in the linear Gaussian case, for example, two graphs are distribution equivalent if and only if they are Markov equivalent. It has therefore been argued that p(D|G₁) and p(D|G₂) should be the same for Markov equivalent graphs G₁ and G₂. The BD score can be adapted to satisfy this property. It is usually referred to as the Bayesian Dirichlet equivalence (BDe) score [Geiger and Heckerman, 1994b]. Buntine [1991] proposes a specific version of this score with even fewer hyperparameters.
Greedy Search Techniques The search space of all DAGs is growing superexponentially in the number of variables [e.g., Chickering, 2002], the numbers of DAGs for 2, 3, 4, and 10 variables are 3, 25, 543, and 4175098976430598143, respectively (see Table B.1). Therefore, computing a solution to Equation (7.6) by searching over all graphs is often infeasible. Instead, greedy search algorithms can be applied to solve (7.6). At each step there is a candidate graph and a set of neighboring graphs. For all these neighbors, one computes the score and considers the best-scoring graph as the new candidate. If none of the neighbors obtains a better score, the search procedure terminates (not knowing whether one obtained only a local optimum). Clearly, one therefore has to define a neighborhood relation. Starting from a graph G, we may define all graphs as neighbors from G that can be obtained by removing, adding, or reversing one edge, for example.
In the case of a linear Gaussian SCM, one cannot distinguish between Markov equivalent graphs. It turns out that then it is beneficial to change the search space to Markov equivalence classes instead of DAGs. The greedy equivalence search (GES) [Chickering, 2002] optimizes the BIC criterion (7.7) and starts with the empty graph. It consists of two-phases: in the first phase, edges are added until a local maximum is reached; in the second phase, edges are removed until a local maximum is reached, which is then given as an output of the algorithm.
Exact Methods In general, finding the optimal scoring DAG is NP-hard [Chickering, 1996] but still there is a lot of interesting research that tries to scale up exact methods. Here, 'exact' means that they aim at finding (one of) the best scoring graphs for a given finite data set. Greedy search techniques are often heuristic and have guarantees - if at all - only in the limit of infinite data.
One line of research is based on dynamic programming [Silander and Myllymäki, 2006, Koivisto and Sood, 2004, Koivisto, 2006]. These approaches exploit the decomposability of many scores that are used in practice: due to the Markov factorization, we have for D = (X₁, ..., X_n) that
$$\log p(D|\hat{θ}, G) = \sum_{j=1}^d \sum_{i=1}^n \log p(X_i^{(j)}|X_i^{PA_G(j)}, \hat{θ}),$$
which is a sum of d 'local' scores. Methods based on dynamic programming exploit this decomposability, and despite their exponential complexity they can find the best scoring graph for ≥ 30 variables, even if one does not restrict the number of parents. This is a remarkable result given the enormous number of different DAGs over this number of variables (see Table B.1).
The integer linear programming (ILP) framework assumes not only decomposability but also that the scoring function gives the same score to Markov equivalent graphs. The idea is then to represent graphical structures as vectors, such that the scoring function becomes an affine function in this vector representation. Studený and Haws [2014] describe how Hemmecke et al. [2012] base their representation on characteristic imsets, while Jaakkola et al. [2010] and Cussens [2011] use (exponentially long) zero-one codes instead that indicate parent-child-relationships between nodes and reduce the search space exploiting work by De Campos and Ji [2011]. Having formulated the problem as an ILP problem, the problem is still NP-hard, but one may now use off-the-shelf methods for ILP. Restricting the number of parents leads to further advances, for example, in 'pedigree learning' each node has at most two parents [Sheehan et al., 2014].
Additive Noise Models
ANMs can be learned with score-based methods that are combined with a greedy search technique. This has been proposed for linear Gaussian models with equal error variances (Section 7.1.3) or nonlinear Gaussian ANMs (Section 7.1.5) [see Peters and Bühlmann, 2014, Bühlmann et al., 2014]. In the nonlinear Gaussian case, for example, we can proceed analogously to the bivariate case (see Equations (4.18) and (4.19)). For a given graph structure G, we regress each variable on its parents and obtain the score
$$\log p(D|G) = \sum_{j=1}^d -\log \widehat{\text{var}}[R_j];$$
here, var̂[R_j] is the empirical variance of the residuals R_j obtained from the regression of variable X_j on its parents. Intuitively, the better the model fits the data, the smaller the variance of the residuals and thus the larger our score. Formally, the procedure is an instance of maximum likelihood and can be shown to be consistent [Bühlmann et al., 2014]. Computationally, we can again exploit the property that the score decomposes over the different nodes. When computing the score for a neighboring graph that changes the parent set of only one variable, we need to update only the corresponding summand. If the noise cannot be assumed to have a Gaussian distribution, for example, one can estimate the noise distribution [Nowzohour and Bühlmann, 2016] and obtain an entropy-like score.
Alternatively, one can estimate the structure in an iterative way using independence tests. Mooij et al. [2009] and Peters et al. [2014] propose a regression with subsequent independence test (RESIT). The method is based on the property that the noise variables are independent of all preceding variables. For linear non-Gaussian models (Section 7.1.4), Shimizu et al. [2006] provide a practical method based on ICA [Comon, 1994, Hyvärinen et al., 2001] that can be applied to a finite amount of data. Later, an improved version of this method has been proposed in Shimizu et al. [2011].
Known Causal Ordering
It is often difficult to find the causal ordering (see Appendix B) of the underlying causal model. Given the causal ordering, however, estimating the graph reduces to 'classical' variable selection. Assume, for example, that
$$X := N_X, \quad Y := f(X, N_Y), \quad Z := g(X, Y, N_Z)$$
with unknown f, g, N_X, N_Y, N_Z. Deciding whether f depends on X, and g depends on X and/or Y (see the assumption of structural minimality in Remark 6.6) is then a well-studied significance problem in 'traditional' statistics. Standard methods can be used, especially if further structural assumptions are made, such as linearity [e.g., Hastie et al., 2009, Bühlmann and van de Geer, 2011]. This observation has been made before [e.g., Teyssier and Koller, 2005, Shojaie and Michailidis, 2010] and it has been suggested that instead of searching over the space of directed acyclic graphs, it might be beneficial to search over the causal order first and then perform variable selection [e.g., Teyssier and Koller, 2005, Bühlmann et al., 2014].
Observational and Experimental Data
Section 7.1.6 describes how causal structures may become identifiable when we observe the system under different conditions ('environments'). We now discuss how these results can be exploited in practice, that is, given only finitely many data. Let us therefore assume that we obtain one sample X^e_{n_e} for each environment e ∈ E; that is, for each of the environments, we observe n_e i.i.d. data points.
Known Intervention Targets Here, each setting corresponds to an interventional experiment, and we have additional knowledge of the intervention targets I^e ⊆ {1, ..., p}. Cooper and Yoo [1999] incorporate the intervention effects as mechanism changes into a Bayesian framework. For perfect interventions, Hauser and Bühlmann [2015] consider linear Gaussian SCMs and propose a greedy interventional equivalence search (GIES), a modified version of the GES algorithm that we briefly described in Section 7.2.2.
Sometimes, one is not able to measure all variables in each experiment (this can even be the case when all experiments are observational) but nevertheless wants to combine the information from the available data; this problem has been addressed by SAT-based approaches [see, e.g., Triantafillou and Tsamardinos, 2015, Tillman and Eberhardt, 2014, references therein].
Unknown Intervention Targets Eaton and Murphy [2007] do not assume that the targets of the different interventions are known. Instead, they introduce for each environment e ∈ E an intervention node I_e with no incoming edges (see 'Intervention Variables' on page 95); for each data point only one intervention node is active. Then, they apply standard methods to the enlarged model with d + |E| variables, subject to the constraint that intervention nodes do not have any parents.
Tian and Pearl [2001] propose to test whether the marginal distributions change in the different settings and use this information to infer parts of the graph structure.
They even combine this method with an independence-based method.
Different Environments In Section 7.1.6, we have also considered the problem of estimating the causal parents of a target variable Y among the set X of d predictors. Therefore, we have defined the set S as the collection of all sets S ⊆ {1, ..., d} that satisfy invariant prediction, that is, for which P^e_{Y^e|S^e} remains invariant over all environments e ∈ E; see (7.4). In practice, we can test the hypothesis of invariant prediction at level α and collect all sets S that pass the test as an estimate Ŝ for the set S. Because the true set of parents PA_Y ⊆ X is a member of Ŝ with high probability (1 - α), we obtain the coverage statement
$$\bigcap_{S \in \hat{S}} S ⊆ PA_Y \quad (7.8)$$
with high probability (1 - α). The left-hand side of (7.8) is the output of a method called 'invariant causal prediction' [Peters et al., 2016]. Code Snippet 7.11 shows an example for which the environments correspond to different interventions (this is not required by the method). To obtain correct coverage in the sense of (7.8), one only needs to model the conditional Y given PA_Y; in particular, one does not assume anything on the distribution of the d predictors X. This is different for the method proposed by Eaton and Murphy [2007] (see the paragraph 'Unknown Intervention Targets'), which additionally tries to estimate the full causal structure.
Code Snippet 7.11 The following code shows an example of a causal system in two environments. In the true underlying structure we have that X₁ and X₂ are causing Y, which itself is causing X₃. In a linear model on the pooled data (line 13), all variables X₁, X₂, and X₃ are highly significant since all of them are good predictors for Y. Such a model is not invariant, however. In the two environments a regression from Y on X₁, X₂, X₃ yields coefficients -0.15, 1.09, -0.39, and -0.32, 1.62, -0.54, respectively. The method of invariant causal prediction outputs only the causal parents of Y, that is, X₁ and X₂. In this example, {1, 2} is the only set yielding an invariant model, that is, Ŝ = {{1, 2}}.
library(InvariantCausalPrediction)
#
# generate data from two environments
env <- c(rep(1,400),rep(2,700))
n <- length(env)
set.seed(1)
X1 <- rnorm(n)
X2 <- 1*X1 + c(rep(0.1,400), rep(1.0,700))*rnorm(n)
Y <- -0.7*X1 + 0.6*X2 + 0.1*rnorm(n)
X3 <- c(rep(-2,400),rep(-1,700))*Y + 2.5*X2 + 0.1*rnorm(n)
#
summary(lm(Y~-1+X1+X2+X3))
# Coefficients:
# ----Estimate Std.Error t.val. Pr(>|t|)
# X1 -0.396212 0.008667 -45.71 <2e-16 ***
# X2 +1.381497 0.021377 +64.63 <2e-16 ***
# X3 -0.410647 0.011152 -36.82 <2e-16 ***
#
ICP(cbind(X1,X2,X3),Y,env)
#lower bd upper bd p-value
# X1 -0.71 -0.68 3.7e-06 ***
# X2 +0.59 +0.61 0.0092 **
# X3 -0.00 +0.00 0.2972
Problems
Problem 7.12 (Gaussian SCMs) Prove that for linear Gaussian SCMs, two graphs G₁ and G₂ are distribution equivalent if and only if they are Markov equivalent. Here, we allow for zero coefficients.
Problem 7.13 (Gaussian SCMs) Consider a distribution P_X of X = (X₁, ..., X_d) with density p induced from a linear Gaussian SCM C. Prove that for any DAG G such that P_X is Markovian with respect to G, there is a corresponding linear Gaussian SCM C_G entailing P_X.
Problem 7.14 (ANMs) Prove that ANMs over X = (X₁, ..., X_d) with differentiable functions f_j and noise variables that have a strictly positive density entail a distribution over X that has a strictly positive density, too (see Definition 7.3).
Problem 7.15 (Invariant causal prediction) Prove Equation (7.5).