Chapter 6

REMEMBERING REWARDING FUTURES

Many studies targeting the hippocampus have been carried out. As of April 2022, typing in the word 'hippocampus' in PubMed (a search engine for biological and medical publications) generates over 170,000 publications. These studies have contributed greatly to our understanding of the hippocampus. We now know a great deal about its anatomy, biochemistry, physiology, development, function, and relevance to neurological and mental diseases. I selectively introduced a few of these studies in the previous chapters that I think are critical for understanding the neural circuit operations underlying the memory and imagination functions of the hippocampus. To summarize, the hippocampus is involved not only in remembering past experiences but also in imagining future events; sharp-wave ripples and hippocampal replays take place in the hippocampus during sleep and resting states; CA3, but not CA1, has massive recurrent projections that enable self-excitation and sequential firing; CA3 generates sharp-wave ripples and CA1 represents value signals. We will examine a synthesis of these findings, a simulationselection model, in this chapter.

SIMULATION-SELECTION MODEL

The main idea of the model is simple: CA3 generates diverse event sequences based on massive recurrent projections during rest and sleep (simulation) and CA1 preferentially reinforces high-value sequences based on value-dependent neural activity (selection). This way, neural activity sequences representing high-value events and actions will be preferentially reinforced so that they are likely to be chosen in the future under similar circumstances. This will allow us to make better choices in the future.

Together with my colleagues, I proposed this model in 2018. We proposed that the hippocampus simulates and reinforces high-value events and actions in preparation for the future rather than merely remembering what happened in the past (the idea for a hippocampal role in future planning has been proposed by numerous scientists; see our paper and references therein). We also put forward that this function of the hippocampus is implemented in the CA3-CA1 network. This may look inefficient at first glance because not just one but two neural networks are needed to prepare for optimal choices. However, two networks separately performing simulation (CA3) and selection (CA1) enable the generation and evaluation of a wide variety of events and actions, which would be useful to prepare in advance for diverse future circumstances. If we rely on only one network for both simulation and selection, the diversity of simulated events or behaviors will be markedly reduced. We elaborated in our paper why we consider the hippocampus as a simulation-selection device and how the process of simulation-selection might be implemented in the CA3-CA1 neural circuits. Below, I will summarize the key arguments for the model.

CA3 AS A SIMULATOR

Why do we consider CA3 a simulator? The hippocampus plays a role in imagination (see chapter 1). It also generates place-cell firing sequences that correspond to unexperienced spatial trajectories during rest and sleep (see chapter 3). These findings indicate that the hippocampus generates novel activity sequences. Put differently, it simulates unexperienced event sequences.

Where then in the hippocampus are simulated sequences generated? Most scientists would point at the CA3 as the source because CA3 neurons are connected by massive recurrent projections (see fig. 4.2). CA1, in contrast, has only weak, longitudinally-directed recurrent projections. Because CA3 neurons are heavily interconnected, activation of some CA3 neurons will likely activate others (self-excitation). Propagation of such sequential activation fits well with the sequential place-cell firing that occurs with a sharpwave ripple during sleep and resting states (i.e., replays). As mentioned in chapter 4, sharp-wave ripples are initiated in CA3 and propagate to CA1. Together, these findings consistently indicate CA3 is the source of simulated sequences.

How then does CA3 generate novel activity sequences during sharp-wave ripples? Why don't they simply repeat firing sequences that happened during past active states? Scientists believe that CA3 stores memories of experienced events, such as navigation trajectories, by changing connection strengths among CA3 neurons (see fig. 4.3). However, the following factors would act against repeating the same activity sequences exactly as during active navigation under resting or sleep states. First, individual synaptic communication is unreliable among brain cells because of the probabilistic release of neurotransmitters. A message from one neuron is transmitted to another neuron only probabilistically. Second, the brain state is likely to differ drastically between active navigation and passive resting states. In rats, thetafrequency rhythmic oscillations are dominant during active navigation, but slow oscillations and sharp waves are dominant during passive states (see fig. 3.1). Third, inhibitory neuronal activity is lower during passive compared to active states. Thus, it appears that the CA3 neural network is more loosely controlled under passive states. Fourth, incoming sensory inputs may differ drastically between active navigation and resting or sleep states. Finally, CA3 is a network interconnected with many individually weak synapses rather than a few strong ones. An unusual feature of the CA3 network is the sheer number of recurrent projections. As we examined in chapter 4, each excitatory CA3 neuron receives synaptic inputs from about twelve thousand other excitatory CA3 neurons, which comprises about 75 percent of all synaptic inputs it receives (see fig. 4.2). However, physiological studies have shown that recurrent projection synapses are individually weak, which would be disadvantageous for the generation of high-fidelity activity sequences.

To summarize, because CA3 is a network interconnected with many weak synapses, the CA3 neural network state differs greatly between active and passive states, incoming sensory inputs differ drastically between active and sleep states, and inhibitory regulation is weak during passive states, it would be difficult to repeat the same firing sequences as during active navigation under inactive states. Consequently, CA3-generated replays will consist of not only experienced but also unexperienced sequences. In this respect, randomness may be a critical functional element of the CA3 network. It would allow the network to generate a wide variety of unexperienced (novel) sequences and function as a simulator rather than a highfidelity memory device.

CA1 AS A VALUE-DEPENDENT SELECTOR

What will happen to replays that are generated in the CA3? Some researchers have proposed that hippocampal replays will be evaluated in brain structures such as the ventral striatum and orbitofrontal cortex, which are well known to process value-related signals. This proposal is in line with the long-standing view that the hippocampus mainly processes spatial and cognitive information rather than value-related information. However, our results indicate that CA1 is a value specialist. A corollary of strong value-related CA1 neural activity is that CA3-originated neural signals will be processed differently in CA1 according to their associated values. In other words, CA1 will filter CA3-generated replays according to their associated values. For example, CA1 may preferentially pass high-value replays, such as those corresponding to spatial trajectories leading to a rewarding location, while filtering out lowvalue replays, such as those corresponding to spatial trajectories leading to an unrewarding location. Of course, 'selection' and 'filtering out' here by no means indicate an all-or-none process. CA3 replays will be more likely to pass through CA1 as their associated values increase. CA3 presumably generates a huge number of replays during resting and sleep states. We propose that CA1 processes these CA3-generated replays in proportion to their associated values so that high-value replays are preferentially selected and reinforced.

To understand how exactly the process of simulation-selection operates in the CA3-CA1 neural circuit, we need to compare how CA3 and CA1 replays are affected by their associated values. Few studies have explored this issue, but findings so far are consistent with the simulation-selection model. For example, CA1 place cells with their firing fields near a rewarding location are preferentially reactivated during sharp-wave ripples compared to those with their firing fields far from a rewarding location. In contrast, CA3 place cells do not show such reward-dependent activation during sharp-wave ripples. No other studies have compared the reward or value dependence of CA3 and CA1 replays so far. Nevertheless, numerous studies have repeatedly shown that reward facilitates CA1 replays in rats. In humans, the imagination of episodic future events is enhanced by reward, and hippocampal activity patterns for high-reward contexts are preferentially reactivated during post-learning rest. These results are well in line with the proposal that CA3 generates replays independent of their values while CA1 preferentially processes highvalue replays. The functional consequence of this operation is clear. The selection of high-value sequences will strengthen neural representations for those sequences, which can guide optimal choices in the future.

DENTATE GYRUS

We have examined key concepts of the simulation-selection model. More of its details, especially those related to the neurobiological implementation of the simulation-selection process, can be found in the paper I published with my colleagues in 2018. We focused on CA3 and CA1, leaving out the dentate gyrus, which is another component of the hippocampal trisynaptic circuit. What does the dentate gyrus do in hippocampal functioning? And how is its function related to the proposed simulation-selection process of the CA3-CA1 network?

Currently, pattern separation is the most popular theory for the role of the dentate gyrus. This idea is related to David Marr's theory that CA3 stores associative memory (fig. 4.3). The main thrust is that the dentate gyrus separates similar input patterns into distinct patterns so that CA3 can store many patterns (memories) with minimal interference. However, it is unclear whether this idea can be applied to memories for sequences rather than static patterns.

There are other reasons to doubt that pattern separation is its major function. I have proposed, together with my long-term colleague, Jong Won Lee, that the primary function of the dentate gyrus is to bind together diverse sensory signals and, by doing so, form 'spatial context.' Appendix 1 also briefly discusses this matter. To put it simply, we think that the trisynaptic circuit of the hippocampus performs what we call binding-simulationselection . The dentate gyrus allows us to recognize where we are (our spatial context) by binding together diverse sensory signals, and CA3 and CA1 together perform simulation-selection to reinforce high-value sequences in each spatial context.

IMPLICATIONS OF THE MODEL

The simulation-selection model is a theory that awaits empirical verification. Nevertheless, it coherently explains findings that cannot be readily accounted for by conventional theories. For example, the model explains why the hippocampus is involved not only in memory but also in imagination, why memory is prone to falsification, why the hippocampus represents value, why the hippocampus needs CA1 in addition to CA3, and why place cell characteristics are similar across CA3 and CA1, in terms of a simple scheme of simulation-selection. In addition, the model provides new perspectives on neural processes underlying goal-directed behavior and memory consolidation.

First, the model explains two core processes of goal-directed spatial navigation, namely spatial and value representations, with a single neural mechanism. It is generally assumed that goal-directed spatial navigation is supported by spatial information represented in the hippocampus and value information represented elsewhere in the brain. However, both spatial and value information are represented in the hippocampus in the simulationselection model; therefore, goal-directed spatial navigation can be explained by a simple process of simulation-selection within the hippocampus. There is no need to assume two separate neural systems dedicated to navigation and value processing.

Second, the model provides a new perspective on memory consolidation. We examined issues and debates on memory consolidation in chapter 1. It is still unclear why and how initially formed memories are consolidated over time to become permanent memories. The simulation-selection model posits that memory consolidation is a process of finding optimal strategies based on past experiences rather than strengthening incidental memories. This view is radically different from conventional theories on memory consolidation.

DYNA

Memory consolidation as a process of actively selecting and reinforcing valuable options for the future is surprisingly akin to a well-known machine learning algorithm. As mentioned in chapter 5, reinforcement learning is a branch of artificial intelligence that aims to find optimal action plans in a dynamic and uncertain environment. An agent selects actions based on value functions and updates value functions based on the consequences of actions. This iterative process allows an agent to keep track of true value functions and make adaptive choices.

One drawback of such a trial-and-error approach, however, is inefficiency. It often requires an enormous number of trials to approximate true value functions. This is particularly problematic when a long sequence of actions is needed to reach the final goal. It would be difficult to know whether selecting a particular action (X) in a situation (Y) is of high value or low value if the consequence of choosing that action in that situation is revealed only after a long sequence of actions. This explains why reinforcement learning algorithms in general have trouble mastering the video game Montezuma's Revenge , in which the character Panama Joe must go through many steps before getting to the destination, the Treasure Chamber, in Montezuma's pyramid.

One proposed solution to overcome this difficulty is to perform simulations during the offline state to supplement trial-and-error value learning. Imagine a robot vacuum cleaner trying to master an effective way to clean a room filled with furniture. Finding the best strategy to clean the room may take a long time if the arrangement of the furniture is complex because there would be an immense number of possible trajectories for covering the entire floor. If the robot vacuum cleaner relies solely on its actual cleaning experiences, it may take months, or even years, to figure out the best strategy for the room.

One way to solve this problem is to learn value functions by simulating possible trajectories. This is the core idea of the Dyna algorithm David Sutton proposed in 1991. The algorithm learns value functions in two steps: it first learns value functions while interacting with the environment (trialand-error learning), and it then learns value functions by simulating actions and assessing their outcomes (offline learning). In our example, the robot vacuum cleaner learns value functions for various spatial trajectories by actual cleaning and then by simulation without actual movement. This can greatly facilitate the rate of learning because the robot can evaluate an enormous number of trials without actually performing the cleaning process.

The similarity between the simulation-selection model and the Dyna algorithm is remarkable. Both increase the rate of value learning by simulation based on limited experiences. The process of representing accurate value functions in an uncertain environment may take a long time if we rely only on trial-and-error learning. Of course, we can eventually learn accurate value functions if an environment is stable. However, your competitors, such as potential predators, are not nice enough to wait for you until you represent value functions accurately. It's a jungle out there. Moreover, environments often change dynamically. With slow learning, you may never make optimal choices in a dynamic environment because the environment (and hence true value functions) may change before you master them. Let's assume that it takes one full year for the robot to master the best cleaning strategy for a room by trial and error. Let's also assume that the room's residents and furniture arrangement change every three months. If so, the robot will never be able to clean the room most efficiently. This problem can be solved by using simulation-selection to accelerate learning.