Cause-Effect Models

The present chapter formalizes some basic concepts of causality for the case where the causal models contain only two variables. Assuming, these two variables are non-trivially related and their dependence is not solely due to a common cause, this constitutes a cause-effect model. We briefly introduce SCMs, interventions, and counterfactuals. All of these concepts are defined again in the context of multivariate causal models (Chapter 6) and we hope that encountering them for two variables first makes the ideas more easily accessible.

3.1 Structural Causal Models

SCMs constitute an important tool to relate causal and probabilistic statements.

Definition 3.1 (Structural causal models) An SCM C with graph C → E consists of two assignments

$$C := N_C, \tag{3.1}$$ $$E := f_E(C, N_E), \tag{3.2}$$

where $N_E \perp!!!\perp N_C$, that is, $N_E$ is independent of $N_C$.

In this model, we call the random variables C the cause and E the effect. Furthermore, we call C a direct cause of E, and we refer to C → E as a causal graph. This notation hopefully clarifies and coincides with the reader's intuition when we talk about interventions, for example, in Example 3.2.

If we are given both the function $f_E$ and the noise distributions $P_{N_C}$ and $P_{N_E}$, we can sample data from such a model in the following way: We sample noise values $N_E, N_C$ and then evaluate (3.1) followed by (3.2). The SCM thus entails a joint distribution $P_{C,E}$ over C and E (for a formal proof see Proposition 6.3).

3.2 Interventions

As discussed in Section 1.4.2, we are often interested in the system's behavior under an intervention. The intervened system induces another distribution, which usually differs from the observational distribution. If any type of intervention can lead to an arbitrary change of the system, these two distributions become unrelated and instead of studying the two systems jointly we may consider them as two separate systems. This motivates the idea that after an intervention only parts of the data-generating process change. For example, we may be interested in a situation in which variable E is set to the value 4 (irrespective of the value of C) without changing the mechanism (3.1) that generates C. That is, we replace the assignment (3.2) by E := 4. This is called a (hard) intervention and is denoted by do(E := 4). The modified SCM, where (3.2) is replaced, entails a distribution over C that we denote by $P(C; do(E := 4))$ or $P(C|C; do(E := 4))$, where the latter makes explicit that the SCM C was our starting point. The corresponding density is denoted by $c \mapsto p(c; do(E := 4))$ or, in slight abuse of notation, $p(c; do(E := 4))$. However, manipulations can be much more general. For example, the intervention do(E := $g_E(C) + \tilde{N_E}$) keeps a functional dependence on C but changes the noise distribution. This is an example of a soft intervention. We can replace either of the two equations.

The following example motivates the namings 'cause' and 'effect':

Example 3.2 (Cause-effect interventions) Suppose that the distribution $P_{C,E}$ is entailed by an SCM C

$$C := N_C, \quad E := 4 \cdot C + N_E, \tag{3.3}$$

with $N_C, N_E$ iid $\sim \mathcal{N}(0,1)$ and graph C → E. Then,

$P_E = \mathcal{N}(0,17)$
$P(E|C; do(C := 2)) = P(E|C = 2) = \mathcal{N}(8,1)$
$P(E|C; do(C := 3)) = P(E|C = 3) = \mathcal{N}(12,1)$

Intervening on C changes the distribution of E. But on the other hand,

$$P(C|C; do(E := 2)) = \mathcal{N}(0,1) = P_C = P(C|C; do(E := 314159265)) \neq P(C|E = 2). \tag{3.4}$$

No matter how strongly we intervene on E, the distribution of C remains what it was before. This model behavior corresponds well to our intuition of C 'causing' E: for example, no matter how much we whiten someone's teeth, this will not have any effect on this person's smoking habits. (Importantly, the conditional distribution of C given E = 2 is different from the distribution of C after intervening and setting E to 2.)

The asymmetry between cause and effect can also be formulated as an independence statement. When we replace the assignment (3.3) with E := $\tilde{N_E}$ (think about randomizing E), we break the dependence between C and E. In

$$P(C,E|C; do(E := \tilde{N_E})),$$

we find $C \perp!!!\perp E$. This independence does not hold when randomizing C. As long as $\text{var}[\tilde{N_C}] \neq 0$, we find $C \not!\perp!!!\perp E$ in

$$P(C,E|C; do(C := \tilde{N_C}));$$

the correlation between C and E remains non-zero.

Code Snippet 3.3 The code samples from the SCM described in Example 3.2.

set.seed(1)
# generates a sample from the distribution entailed by the SCM
C <- rnorm(300)
E <- 4*C + rnorm(300)
c(mean(E), var(E))
# [1] 0.1236532 16.1386767

# generates a sample from the intervention distribution do(C:=2);
# this changes the distribution of E
C <- rep(2,300)
E <- 4*C + rnorm(300)
c(mean(E), var(E))
# [1] 7.936917 1.187035

# generates a sample from the intervention distribution do(E:=N~);
# this breaks the dependence between C and E
C <- rnorm(300)
E <- rnorm(300)
cor.test(C,E)$p.value
# [1] 0.2114492

3.3 Counterfactuals

Another possible modification of an SCM changes all of its noise distributions. Such a change can be induced by observations and allows us to answer counterfactual questions. To illustrate this, imagine the following hypothetical scenario:

Example 3.4 (Eye disease) There exists a rather effective treatment for an eye disease. For 99% of all patients, the treatment works and the patient gets cured (B = 0); if untreated, these patients turn blind within a day (B = 1). For the remaining 1%, the treatment has the opposite effect and they turn blind (B = 1) within a day. If untreated, they regain normal vision (B = 0).

Which category a patient belongs to is controlled by a rare condition ($N_B = 1$) that is unknown to the doctor, whose decision whether to administer the treatment (T = 1) is thus independent of $N_B$. We write it as a noise variable $N_T$.

Assume the underlying SCM

$$T := N_T, \quad B := T \cdot N_B + (1-T) \cdot (1-N_B) \tag{3.5}$$

with Bernoulli distributed $N_B \sim \text{Ber}(0.01)$; note that the corresponding causal graph is T → B.

Now imagine a specific patient with poor eyesight comes to the hospital and goes blind (B = 1) after the doctor administers the treatment (T = 1). We can now ask the counterfactual question 'What would have happened had the doctor administered treatment T = 0?' Surprisingly, this can be answered. The observation B = T = 1 implies with (3.5) that for the given patient, we had $N_B = 1$. This, in turn, lets us calculate the effect of do(T := 0).

To this end, we first condition on our observation to update the distribution over the noise variables. As we have seen, conditioned on B = T = 1, the distribution for $N_B$ and the one for $N_T$ collapses to a point mass on 1, that is, δ₁. This leads to a modified SCM:

$$T := 1, \quad B := 1 \cdot 1 + (1-1) \cdot (1-1) = 1.$$

We now intervene on this modified SCM by setting T := 0. This gives us

$$T := 0, \quad B := 0 \cdot 1 + (1-0) \cdot (1-1) = 0.$$

So the counterfactual outcome is B = 0. In other words, the patient would not have gone blind had he not received the treatment.

The preceding example illustrates several features:

Counterfactual reasoning requires individual-level information: We needed to know the specific values of the noise variables for this particular patient.
Observations help infer noise values: The observed outcome (B = 1, T = 1) allowed us to determine that $N_B = 1$ for this specific case.
Counterfactual differs from conditional: The counterfactual "what if T had been 0?" gives a different answer than the conditional "what is P(B|T = 0)?" The former is 0, while the latter would be 0.99.

This framework allows us to answer "what if" questions about individual cases, which is crucial for personalized decision making and understanding individual causal effects.