Generative Feedback Explains Distinct Brain Activity Codes for Seen and Mental Images: Current Biology | Current Biology
Jesse L. Breedlove, Ghislain St-Yves, Cheryl A. Olman, Thomas Naselaris
A Deep Generative Network Exhibits Distinct Codes for Seen and Mental Images
We considered activity patterns in a deep generative network specified by a hierarchy of
processing levels with feedforward and feedback connections. The lowest level (level
) of the network analogized the retina. Levels above analogized functionally distinct visual areas (e.g., V1, V2, V3, and so on; see nodes in Figure 1A). We modeled vision and mental imagery as distinct input configurations of this network and used activity patterns in the network to derive qualitative predictions about activity patterns in the brain.
During vision, the retina is activated by a visual stimulus
be an activity pattern at a level
be the expected activity pattern at that level. We modeled vision in the generative network by clamping the activity pattern at the lowest level to the visual stimulus,
(“vis” denotes vision). Because, in this case,
is the only source of variation, the resulting expected activity pattern at level
is specified by a forward transformation
that maps from level
(superscript) to level
(subscript). (Note that we refer to
as a forward transform only because it yields an expected activity pattern
as a function of the stimulus
. The brain is of course dynamic, and we interpret
as the expected activity pattern at steady state and
as incorporating the effects of both feedforward and feedback connections between levels in the network. See Method S1.1.4 for explicit analysis.)
During imagery experiments, the retinal stimulus is uninformative (e.g., a blank screen). We therefore modeled the visual input during imagery by clamping activity at the sensor level to a null pattern
. We assumed that subjects imagine by reinstating in a high-level brain area the expected activity pattern that would have been evoked by seeing the imagined stimulus. We modeled this mechanism by clamping the activity in the network at a level higher than
, to its expected activity pattern during vision,
(“img” denotes imagery). Now, the clamped activity pattern in level
is the only source of variation, so the resulting expected activity pattern at level
is specified by a feedback transformation
that maps activity at level
(superscript) to activity at level
(subscript). This feedback transformation is the operation in our model that most closely resembles “vision in reverse.”
Because of the hierarchical structure of the network, the imagery activity pattern
can be re-written as an explicit function of the stimulus
This expression reveals how the expected imagery activity pattern
differs from the expected visual activity pattern
. The difference is specified by the distortion
, which can be construed as an “echo,” because
maps the expected visual activity pattern at level
into the abstract representation at the clamped level
maps it back to level
. The encoding of
during imagery at level
is thus likely to be limited by the encoding of
For practical applications, high-level representations in deep generative networks are often optimized to give near-lossless stimulus reconstruction. However, in high-level visual areas, brain activity can be quite invariant to many aspects of stimulus variation—an inevitable consequence of forming abstract representations that are useful for cognition [
Variational learning in nonlinear gaussian belief networks.
] of natural scenes. We implemented a network with linear connections between processing levels in order to obtain exact solutions for the mean activity patterns during imagery and vision. Neural units with complex-like responses were modeled by combining pairs of network units under a sum-of-squares operation (see Method S1.1.7 for details). Although processing in the human visual system undoubtedly involves more complex nonlinearities, this simple model was sufficient to capture the basic tuning properties we analyzed in our subsequent human neuroimaging experiment.
We first trained the generative network to optimize the log-likelihood of natural scenes in a large image database. A unique and essential aspect of our approach is that the training objective also constrained the network to exhibit brain-like responses to visual stimulation. Units at higher levels of the network were encouraged to exhibit lower spatial frequency preference [
Visual field maps, population receptive field sizes, and visual field coverage in the human MT+ complex.
] during vision than units at lower levels. We emphasize that these constraints were placed on activity generated by the network during vision only. No explicit constraints were placed upon the activity generated by the network during imagery, and imagery activity patterns played no role in training the model.
Once training was complete, we generated a large corpus of visual and imagery activity patterns in response to a new set of natural scenes. From these activity patterns, we first estimated signal amplitude, noise, and their ratio (SNR) during vision and imagery.
Signal amplitude during imagery was attenuated relative to signal amplitude during vision at levels below the clamped level (Figure 1B, top left), and the amount of attenuation depended upon distance from the clamped level. Attenuation of signal amplitude was a direct and obvious consequence of clamping the lowest level to
Interestingly, noise was reduced during imagery at all levels (Figure 1B, bottom left). Reduction of noise is caused by clamping two levels during imagery (i.e., the lowest level plus one higher level) instead of just one (i.e., the lowest level alone). This additional clamping reduces the number of random variables in the network and therefore reduces noise.
SNR during imagery was attenuated relative to SNR during vision at levels below the clamped level (Figure 1B, right). As with signal amplitude, the amount of attenuation depended upon distance from the clamped level. Attenuation of SNR occurred because the fixed reduction in noise during imagery at all levels was small relative to the attenuation of signal amplitude in lower levels.
We then analyzed visual and imagery responses to estimate distinct spatial frequency tuning functions and receptive fields for each unit in the generative network. As anticipated, tuning during imagery was very different from tuning during vision. Spatial frequency preference was reduced, relative to vision, for units below the clamped level (Figures 1C and 1D, top). Receptive field sizes were larger, and receptive field centers were shifted toward the fovea (Figures 1C and 1D, bottom). Thus, tuning to imagined features at lower levels more closely resembled tuning to seen features at the clamped level. Like SNR, the size of these changes increased with distance from the clamped level, so that the effect of
was strongest in the level furthest below the clamped one. Consequently, by clamping higher in the hierarchy, the effects became stronger and more widespread; by clamping lower, the effects become weaker and restricted only to the lowest processing level (Figures 1B and 1D).
Imagery-Encoding Models Link Brain Activity to Imagined Stimuli
To test these predictions, we developed a method for inferring how mental images are encoded in human brain activity patterns. In visual neuroscience, the encoding of stimuli in brain activity patterns is revealed by visual-encoding models [
Bayesian reconstruction of natural images from human brain activity.
]. Formally, the visual-encoding model is specified by a forward transformation from a stimulus to an expected visual activity pattern (e.g.,
). The transformation determines the visual features (e.g., spatial frequency) that are extracted from the stimulus and encoded into the expected activity pattern. It also describes spatial sensitivity to features in the form of a visual receptive field.
Formulating an imagery-encoding model that would reveal tuning and spatial sensitivity to imagined features has been a formidable conceptual challenge because imagery activity is not driven by a measurable stimulus. However, an important implication of Equation 1 is that, if visual activity is reinstated somewhere in the network, then the imagery activity pattern
can be expressed as an encoding of the stimulus
, even though
is not seen during imagery. Thus, if a set of known stimuli are imagined, it should be possible to use those stimuli to construct imagery-encoding models. Once constructed, these imagery-encoding models could be compared to visual-encoding models to test for differences in the encoding of seen and imagined stimuli. This comparison should reveal the distortion effect of
To construct imagery-encoding models, we used 7-Tesla fMRI to measure whole-brain blood-oxygen-level-dependent (BOLD) activity as human subjects viewed and then in separate scans imagined 64 pictures they had memorized prior to the experiment (Figure 2A). Pictures depicted a range of natural objects, artifacts, humans, and animals (Figure 2B). On each trial, a small 6-letter cue at the center of the display indicated the picture to be displayed (imagined). The color of the cue indicated the position of the seen (imagined) picture. All pictures were displayed (imagined) at one of 8 possible positions (Figure 2C). The 8 different positions were delineated with colored brackets that were all visible at all times throughout both visual and imagery scans. Thus, subjects saw (imagined) 64 pictures at 8 positions each for a total of 512 distinct seen (imagined) stimuli across the viewing (imagery) scans. Viewing and imagery scans alternated during each experimental session. Subjects were instructed to maintain fixation on the small central cue or dummy cue at all times.
We estimated distinct voxelwise visual and imagery encoding models [
The feature-weighted receptive field: an interpretable encoding model for complex feature spaces.
] from activity measured during viewing and imagery sessions, respectively (Figures 3 and S2). The visual-encoding model for each voxel specified tuning to seen spatial frequency and a receptive field in visual space. The imagery-encoding model for each voxel specified tuning to imagined spatial frequency and a receptive field in imagined space.