Connections to Machine Learning, II

As argued in Chapter 5, the causal structure that underlies a statistical model can have strong implications for machine learning tasks such as semi-supervised learning or domain adaptation. We now revisit this general topic, focusing on the multivariate case. We begin with a method that uses machine learning to model systematic errors for a given causal structure, followed by some thoughts on reinforcement learning (with an application in computational advertising), and finally we comment on the topic of domain adaptation.

Half-Sibling Regression

This method exploits a given causal structure (see Figure 8.1) to reduce systematic noise in a prediction task. The goal is to reconstruct the unobserved signal Q. Schölkopf et al. [2015] suggest that we can denoise the signal Y by removing all information that can be explained by other measurements X that have been corrupted with the same source of noise. Here, X are measurements of some signals R that are independent of Q. Intuitively, everything in Y that can be explained by X must be due to the systematic noise N and should therefore be removed. More precisely, we consider

$$\hat{Q} := Y - E[Y|X]$$

as an estimate for Q. Here, E[Y|X] is the regression of Y on its half-siblings X (note that X and Y share the parent N; see Figure 8.1).

One can show that for any random variables Q, X, Y that satisfy Q ⊥⊥ X, we have

[Schölkopf et al., 2016, Proposition 1]:

$$E[(Q - E[Q] - \hat{Q})^2] ≤ E[(Q - E[Q] - (Y - E[Y]))^2],$$

that is, the method is never worse than taking the measurement Y. If, moreover, the systematic noise acts in an additive manner, that is, Y = Q + f(N) for some (unknown) function f, we have [Schölkopf et al., 2016, Proposition 3]:

$$E[(Q - E[Q] - \hat{Q})^2] = E[\text{var}[f(N)|X]]. \quad (8.1)$$

If the additive noise is a function of X, that is, f(N) = γ(X) for some (unknown) function γ, then the right-hand side of (8.1) vanishes and hence $\hat{Q}$ recovers Q up to an additive shift; see Schölkopf et al. [2016] for other sufficient conditions.

As an example, consider the search for exoplanets. The Kepler space observatory, launched in 2009, observed a small fraction of the Milky Way during its search for exoplanets, monitoring the brightness of approximately 150,000 stars. Those stars that are surrounded by a planet with a suitable orbit to allow for partial occlusions of the star will exhibit light curves that show a periodic decrease of light intensity; see Figure 8.2. These measurements are corrupted with systematic noise that is due to the telescope and that makes the signal from possible planets hard to detect.

Fortunately, the telescope measures many stars at the same time. These stars can be assumed to be causally and therefore statistically independent since they are light-years apart from each other. Thus, the causal structure depicted in Figure 8.1 fits very well to this problem and we may apply the half-sibling regression. This simple method performs surprisingly well [Schölkopf et al., 2015].

Related approaches have been used in other application fields without reference to causal modeling [Gagnon-Bartsch and Speed, 2012, Jacob et al., 2016]. Considering the causal structure of the problem (Figure 8.1) immediately suggests the proposed methodology and leads to theoretical arguments justifying the approach.

Causal Inference and Episodic Reinforcement Learning

We now describe a class of problems in reinforcement learning from a causal perspective. Roughly speaking, in reinforcement learning, an agent is embedded in a world and chooses among a set of different actions. Depending on the current state of the world, these actions yield some reward and change the state of the world. The goal of the agent is to maximize the expected cumulated reward (see Section 8.2.2 for more details). We first introduce the concept of inverse probability weighting that has been applied in different contexts throughout machine learning and statistics and then relate it to episodic reinforcement learning. Drawing this connection is a first small step toward relating causality and reinforcement learning. The causal point of view enables us to exploit conditional independences that directly follow from the causal structure. We briefly mention two applications - blackjack and the placement of advertisement - and show how they benefit from causal knowledge. The causal formulation leads to these improvements of methodology very naturally but it is certainly possible to formulate these problems and corresponding algorithms without causal language. This section does not prove that reinforcement learning benefits from causality. Instead, we regard it as a step toward establishing a formal link between these two fields that may lead to fruitful research in future [see also Bareinboim et al., 2015, for example]. More concretely, we believe that causality could play a role when transferring knowledge between different tasks in reinforcement learning (e.g., when progressing to the next level in a computer game or when changing the opponent in table tennis); however, we are not aware of any such result.

Inverse Probability Weighting

Inverse probability weighting is a well-known technique that is used to estimate properties of a distribution from a sample that follows a different distribution. It therefore naturally relates to causal inference. Consider the kidney stone example (Example 6.37). We defined the binary variables size S, treatment T, and recovery R, and after obtaining observational data, we were interested in the expected recovery rate $\tilde{E}[R]$ in a hypothetical study in which everyone received treatment A, that is under a different distribution. Formally, consider an SCM C entailing the distribution $P^C_X$ over variables $X = (X_1, \ldots, X_d)$. We have argued that one often observes a sample from the observational distribution $P^C_X$, but one is interested in some intervention distribution $P^{\tilde{C}}_X$. Here, the new SCM $\tilde{C}$ is constructed from the original C by intervening on a node $X_k$, say,

$$\text{do}(X_k := \tilde{f}(X_{\tilde{\text{PA}}_k}, \tilde{N}_k));$$

see Section 6.3. In particular, we might want to estimate a certain property

$$\tilde{E}[\ell(X)] := E_{P^{\tilde{C}}_X}[\ell(X)]$$

of the new distribution $P^{\tilde{C}}_X$ (in the kidney stone example, this is $\tilde{E}[R]$). If densities exist, we have seen in Section 6.3 that the densities of C and $\tilde{C}$ factorize in a similar way:

$$p(x_1, \ldots, x_d) := p^C(x_1, \ldots, x_d) = \prod_{j=1}^d p^C(x_j | x_{\text{pa}(j)}) \text{ and}$$

$$\tilde{p}(x_1, \ldots, x_d) := p^{\tilde{C}}(x_1, \ldots, x_d) = \prod_{j \neq k} p^C(x_j | x_{\text{pa}(j)}) \tilde{p}(x_k | x_{\tilde{\text{pa}}_k}).$$

The factorizations agree except for the term of the intervened variable. We therefore have

$$\chi := \tilde{E}[\ell(X)] = \int \ell(x) \tilde{p}(x) dx = \int \ell(x) \frac{\tilde{p}(x)}{p(x)} p(x) dx = \int \ell(x) \frac{\tilde{p}(x_k | x_{\tilde{\text{pa}}k})}{p(x_k | x{\text{pa}_k})} p(x) dx.$$

(For simplicity, we assume throughout the whole section that the densities are strictly positive.) Given a sample $X_1, \ldots, X_n$ drawn from the distribution $P^C_X$, we can thus construct an estimator

$$\hat{\chi}n := \frac{1}{n} \sum{i=1}^n \ell(X_i) \frac{\tilde{p}(X_{i,k} | X_{i,\tilde{\text{pa}}k})}{p(X{i,k} | X_{i,\text{pa}k})} = \frac{1}{n} \sum{i=1}^n \ell(X_i) w_i \quad (8.2)$$

for $\chi = \tilde{E}[\ell(X)]$ by reweighting the observations; here, the weights $w_i$ are defined as the ratio of the conditional densities. The data points, that have a high likelihood under $P^{\tilde{C}}_X$ (they 'could have been drawn' from the new distribution of interest) receive a large weight and contribute more to the estimate $\hat{\chi}_n$ than those with a small weight. This kind of estimator appears in the following three situations, for example.

(i) Suppose that $X = (Y, Z)$ contains only a target variable Y and a causal covariate Z, that is, Z → Y. Let us consider an intervention in Z and the function $\ell(X) = \ell((Z, Y)) = Y$. Then, the estimator (8.2) reduces to

$$\hat{\chi}n := \frac{1}{n} \sum{i=1}^n Y_i \frac{\tilde{p}(Z_i)}{p(Z_i)}, \quad (8.3)$$

which is known as the Horvitz-Thompson estimator [Horvitz and Thompson, 1952]. This setting corresponds to the assumption of covariate shift [e.g., Shimodaira, 2000, Quionero-Candela et al., 2009, Ben-David et al., 2010]; see also Sections 5.2 and 8.3. The estimator (8.3) is an example of a weighted likelihood estimator.

(ii) For X = Z, we may estimate the expectation $\tilde{E}[\ell(Z)]$ under $\tilde{p}$ using data sampled from p. Thus, Equation (8.2) reduces to

$$\hat{\chi}n := \frac{1}{n} \sum{i=1}^n \ell(Z_i) \frac{\tilde{p}(Z_i)}{p(Z_i)},$$

a formula that is known as importance sampling [e.g., MacKay, 2002, Chapter 29.2]. The formula can be adapted if p and $\tilde{p}$ are known only up to constants.

(iii) We will make use of Equation (8.2) in the context of episodic reinforcement learning. We describe this application in a bit more detail next.

Episodic Reinforcement Learning

Reinforcement learning [e.g Sutton and Barto, 2015] models the behavior of agents taking actions in a world. Depending on the current state $S_t$ of the world and the action $A_t$, the state of the world changes according to a Markov decision process, for example [e.g., Bellman, 1957]; that is, the probability $P(S_{t+1} = s)$ of entering a new state s depends only on the current state $S_t$ and action $A_t$. Furthermore, the agent will receive some reward $R_{t+1}$ that depends on $S_t$, $A_t$, and $S_{t+1}$; the sum over all rewards is sometimes called the return, which we write as $Y := \sum_t R_t$. The way the return Y depends on states and action is unknown to the agent who tries to improve his strategy $(a, s) \mapsto p(a|s) := P(A_t = a | S_t = s)$, that is, the conditional of the action he chooses depending on the observational part of the state of the world. In episodic reinforcement learning, the state is reset after a finite number of actions (see Figure 8.3). In Section 8.2.3, we consider the example of blackjack. In the example of Figure 8.3, the player makes K = 3 decisions, after which the cards are reshuffled. Then, a new episode starts.

Suppose that we play n games under a certain strategy $(a, s) \mapsto p(a|s)$, and each game is an episode. This function p does not depend on the number of 'moves' we have played so far but just on the value of the state. As long as this strategy assigns a positive probability to any action, Equation (8.2) allows us to estimate the performance of a different strategy $(a, s) \mapsto \tilde{p}(a|s)$.

$$\hat{\chi}{n,\text{ERL}} := \frac{1}{n} \sum{i=1}^n Y_i \frac{\prod_{j=1}^K \tilde{p}(A_{i,j} | S_{i,j})}{\prod_{j=1}^K p(A_{i,j} | S_{i,j})}. \quad (8.4)$$

This can be seen as a Monte Carlo method for off-policy evaluation [Sutton and Barto, 2015, Chapter 5.5]. In practice, the estimator (8.4) often has large variance; in continuous settings the variance may even be infinite. It has been suggested to reweight [Sutton and Barto, 2015] or to disregard the (five) largest weights [Bottou et al., 2013] to trade off variance for bias. Bottou et al. [2013] additionally compute confidence intervals and gradients in the case of parametrized densities. The latter are important if one wants to search for optimal strategies.

We now briefly discuss two examples, in which exploiting the causal structure leads to an improved statistical performance of the learning procedure. We regard them as interesting examples that shed some light on the relationship between reinforcement learning and causality.

State Simplification in Blackjack

The methodology proposed in Section 8.2.2 can be used to learn how to play blackjack (a card game). We pretend that a player enters a casino and starts playing blackjack knowing neither the objective of the game nor the optimal strategy; instead, he applies a random strategy. At each point in the game, the player is asked which of the legal actions he wants to take, and after the game has finished the dealer reveals how much money the player won or lost. After a while the player may update his strategy toward decisions that proved to be successful and continue playing. From a mathematical point of view, blackjack is solved. The optimal strategy (for infinitely many decks) was discovered by Baldwin et al. [1956] and leads to an expectation of $E[Y] \approx -0.006€$ for a player betting 1€.

How does causality come into play? We have assumed that the player is unaware of the precise rules of blackjack; maybe he knows, however, that the win or loss is determined only by the values of the cards and not their suits; that is, the rules do not distinguish between a queen of clubs and a queen of hearts. The player can then immediately conclude that the optimal strategy does not depend on the suit. This comes with an obvious advantage when searching for the optimal strategy: the number of relevant state spaces and therefore the space of possible strategies reduces significantly. Figure 8.4 depicts this argument: the variables $S_t$ contain all information, whereas the variables $F_t$ do not contain suits. For example,

$$S_3 = (\text{Player: } ♥K, ♠5, ♦4; \text{Dealer: } ♦K)$$ $$F_3 = (\text{Player: } K, 5, 4; \text{Dealer: } K)$$

Since the final result Y depends only on $(F_1, \ldots, F_4)$ and not on the 'full state' $(S_1, \ldots, S_4)$, the actions may be chosen to depend on the F variables. Similarly, one may exploit that the order of the cards does not matter either. More formally, we have the following result:

Proposition 8.1 (State simplification) Suppose that we are interested in the return $Y := \sum_j R_j$, and all variables are discrete. Assume that there is a function f such that for all j and for $F_j := f(S_j)$, we have

$$R_j ⊥⊥ S_j | F_j, A_j, \quad (8.5)$$

and the full states do not matter for the change of states in the following sense: for all $s_j$ and for all $s_{j-1}, s'{j-1}$ with $f(s{j-1}) = f(s'_{j-1})$

$$p(f(s_j) | s_{j-1}) = p(f(s_j) | s'_{j-1}). \quad (8.6)$$

Then the optimal strategy $(a, s) \mapsto p^{\text{opt}}(a|s)$ depends only on $F_j$ and not on $S_j$. There exists

$$p^{\text{opt}} \in \arg\max_p E[Y],$$

such that

$$p^{\text{opt}}(a_j | s_{j-1}) = p^{\text{opt}}(a_j | s'{j-1}) \quad \forall s{j-1}, s'{j-1} : f(s{j-1}) = f(s'_{j-1}).$$

This result is particularly helpful if $F_j$ takes fewer values than $S_j$. The proof is provided in Appendix C.11. In the blackjack example, Equation (8.6) states that the probability of drawing another king depends only on the values of the cards drawn before (the number of kings in particular), not their suits.

Improved Weighting in Advertisement Placement

A related argument is used by Bottou et al. [2013] for the optimal placement of advertisements. Consider the following simplified description of the system. A company, which we will refer to as the publisher, runs a search engine and may want to display advertisements in the space above the search results, the mainline. Only if a user clicks on an ad does the publisher receive money from the corresponding company. Before displaying the ads, the publisher sets the mainline reserve A, a real-valued parameter that determines how many ads are shown in the mainline. In most systems, the number of mainline ads F varies between 0 and 4, that is, $F \in {0, 1, 2, 3, 4}$. The mainline reserve A usually depends on many variables (e.g., search query, date and time of the query, location), that we call the state S. If the search query indicates that the user intends to buy new shoes, for example, one may want to show more ads compared to when a user is looking for the time of the next service at church. We can model the system as episodic reinforcement learning with episodes of length 1. The return Y equals the number of clicks per episode; its value is either 0 or 1. The question how to choose an optimal mainline reserve A then corresponds to finding the optimal strategy $(a, s) \mapsto p^{\text{opt}}(a|s)$. Figure 8.5 shows a picture of the simplified problem. The state S contains information about the user that is available to the publisher. The hidden variable H contains unknown user information (e.g., his intention), the action A is the mainline reserve, and Y is the event whether or not a person clicks on one of the ads. Finally, F is the discrete variable that says, how many ads are shown. Evaluating new strategies $(a, s) \mapsto \tilde{p}(a|s)$, corresponds to applying Equation (8.4):

$$\hat{\chi}{n,\text{ERL}} := \frac{1}{n} \sum{i=1}^n Y_i \frac{\tilde{p}(A_i | S_i)}{p(A_i | S_i)}.$$

(Here, we write $p(a|s)$ rather than $p(a|s)$ for notational convenience.) We can now benefit from the following key insight. Whether a person clicks on an ad depends on the mainline reserve A but only via the value of F. The user never sees the real-valued parameter A. This is a somewhat trivial observation, when we think about the causal structure of the system (see Figure 8.5). Exploiting this fact, however, we can use a different estimator

$$\frac{1}{n} \sum_{i=1}^n Y_i \frac{\tilde{p}(F_i | S_i)}{p(F_i | S_i)};$$

see Proposition 8.2. And since F is a discrete variable taking values between 0 and 4, say, this usually leads to weights that are much better behaved. In practice, the modification may reduce the size of confidence intervals considerably [Bottou et al., 2013, Section 5.1]. As in Section 8.1, we can exploit our knowledge of the causal structure to improve statistical performance. More formally, the procedure is justified by the following proposition:

Proposition 8.2 (Improved weighting) Suppose there is a density p over $X = (A, F, H, S, Y)$ that is entailed by an SCM C with graph shown in Figure 8.5. Assume further that the density p is entailed by an SCM $\tilde{C}$ that corresponds to an intervention in A of the form $\text{do}(A := \tilde{f}(S, N_A, \tilde{}))$ and satisfies $\tilde{p}(f|s) = 0$ if $p(f|s) = 0$ and $\tilde{p}(a|s) = 0$ if $p(a|s) = 0$. We then have

$$\tilde{E}[Y] = \int y \frac{\tilde{p}(a|s)}{p(a|s)} p(x) dx = \int y \frac{\tilde{p}(f|s)}{p(f|s)} p(x) dx.$$

The proof can be found in Appendix C.12. In general, the condition of the nonvanishing densities is indeed necessary: if there is a set of a and s values (with non-vanishing Lebesgue measure) that belong to the support of $\tilde{p}$ and contribute to the expectation of Y, there must be a non-vanishing probability under p to sample data in this area.

Domain Adaptation

Domain adaptation is another machine learning problem that is naturally related to causality [Schölkopf et al., 2012]. Here, we will relate domain adapation to what we called invariant prediction in 'Different Environments' in Section 7.2.5. We do not claim that this connection, in its current form, yields major improvements, but we believe that it could prove to be useful for developing a novel methodology in domain adaptation.

Let us assume that we obtain data from a target variable $Y^e$ and d possible predictors $X^e = (X^e_1, \ldots, X^e_d)$ in different domains $e \in E = {1, \ldots, D}$ and that we are interested in predicting Y. Adapting to widely used notation, we use the terms 'domain' or 'task.' Table 8.1 describes a taxonomy of three problems in domain adaptation that we consider here.

Method	Training data from	Training data from	Test domain
Domain generalization	$(X^1, Y^1)$	$, \ldots, (X^D, Y^D)$	$T := D + 1$
Multi-task learning	$(X^1, Y^1)$	$, \ldots, (X^D, Y^D)$	$T \in {1, \ldots, D}$
Asymmetric multi-task learning	$(X^1, Y^1)$	$, \ldots, (X^D, Y^D)$	$T := D$

Our main assumption is that there exists a set $S^* \subseteq {1, \ldots, d}$ such that the conditional $Y^e | X^e_{S^}$ is the same for all domains $e \in E$, including the test domain, that is, for all $e, f \in E$ and for all $x_{S^}$

$$Y^e | X^e_{S^} = x_{S^} \text{ and } Y^f | X^f_{S^} = x_{S^} \text{ have the same distribution}. \quad (8.7)$$

In Sections 7.1.6 and 7.2.5 we have considered a similar setup, where we used the term 'environments' rather than 'domains' and called the property (8.7) 'invariant prediction.' We have argued that if there is an underlying SCM and if the environments correspond to interventions on nodes other than the target Y, property (8.7) is satisfied for $S^* = \text{PA}_Y$ (cf. also our discussion of Simon's invariance criterion in Section 2.2). Property (8.7) may also hold, however, for sets other than the causal parents. Since our goal is prediction, we are most interested in sets $S^$ that satisfy (8.7) and additionally predict Y as accurately as possible. Let us for now assume, that we are given such a set $S^$ (we will return to this issue later) and point at how the assumption (8.7) relates to domain adaptation.

In settings of covariate shift [e.g., Shimodaira, 2000, Quionero-Candela et al., 2009, Ben-David et al., 2010], one usually assumes that the conditional $Y^e | X^e = x$ remains invariant over all tasks e. Assumption (8.7) means that covariate shift holds for some subset $S^*$ of the variables and thus constitutes a generalization of the covariate shift assumption.

For domain generalization, and if the set $S^$ is known, we can then apply traditional methods for covariate shift for this subset $S^$. For example, if the supports of the data in input space are overlapping (or the system is linear), we may use the estimator $f_{S^}(X^T_{S^})$ with $f_{S^}(x) := E[Y^1 | X^1_{S^} = x]$ in test domain T. One can prove that this approach is optimal in an adversarial setting, where the distributions in the test domain may be arbitrarily different from the training domains, except for the conditional distribution (8.7) that we require to remain invariant [Rojas-Carulla et al., 2016, Theorem 1]. In multi-task learning, it is less obvious how to exploit the knowledge of such a set $S^$. In practice, one needs to combine information gained from pooling the tasks and regressing Y on $S^$ with knowledge obtained from considering the test task separately [Rojas-Carulla et al., 2016].

If the set $S^*$ is unknown, we again propose to search for sets S that satisfy (8.7) over available domains. When learning the causal predictors, one prefers to stay conservative, and the method of invariant causal prediction [Peters et al., 2016] therefore outputs the intersection of all sets S satisfying (8.7); see Equation (7.5). Here, we are interested in prediction instead. Among all sets that lead to invariant prediction, one may therefore choose the set S that leads to the best predictive performance, which is usually one of the larger of those sets. The same applies if there are different known sets S that all satisfy (8.7). If the data are generated by an SCM and the domains correspond to different interventions, the set S with the best predictive power that satisfies (8.7) can, in the limit of infinite data, be shown to be a subset of the Markov blanket of Y (see Problem 8.5).

Problems

Problem 8.3 (Half-sibling regression) Consider the DAG in Figure 8.1. The fact that X provides additional information about Q on top of the one provided by Y follows from causal faithfulness. Why?

Problem 8.4 (Inverse probability weighting) Consider an SCM C of the form

$$Z := N_Z$$ $$Y := Z^2 + N_Y,$$

with $N_Y, N_Z$ iid $\sim N(0, 1)$ and an intervened version $\tilde{C}$ with

$$\text{do}(Z := \tilde{N}_Z),$$

where $\tilde{N}_Z \sim N(2, 1)$.

a) (optional) Compute $E[Y] := E_{P^C}[Y]$ and $\tilde{E}[Y] := E_{P^{\tilde{C}}}[Y]$.
b) Draw n = 200 i.i.d. data points from the SCM C and implement the estimator (8.3) for estimating $\tilde{E}[Y]$.
c) Compute the estimate in b) and the empirical variance of the weights appearing in (8.3) for increasing sample size n between n = 5 and n = 50,000. What do you conclude?

Problem 8.5 (Invariant predictors) We want to justify the last sentence in Section 8.3. Consider a DAG over variables Y, E, and $X_1, \ldots, X_d$, in which E (for 'environment') is not a parent of Y and does not have any parents itself. Denote the Markov blanket of Y by M. Prove that for any set $S \subseteq {X_1, \ldots, X_d}$ with

$$Y ⊥⊥ E | S$$

there is another set $S_{\text{new}} \subseteq M$ such that

$$Y ⊥⊥ E | S_{\text{new}} \text{ and } Y ⊥⊥ (S \setminus S_{\text{new}}) | S_{\text{new}}.$$