Jesse L. Breedlove, Ghislain St-Yves, Cheryl A. Olman, Thomas Naselaris
A Deep Generative Network Exhibits Distinct Codes for Seen and Mental Images
processing levels with feedforward and feedback connections. The lowest level (level
) of the network analogized the retina. Levels above analogized functionally distinct visual areas (e.g., V1, V2, V3, and so on; see nodes in Figure 1A). We modeled vision and mental imagery as distinct input configurations of this network and used activity patterns in the network to derive qualitative predictions about activity patterns in the brain.

Figure 1A Deep Generative Network Exhibits Distinct Codes for Seen and Mental Images
(A) The visual system as a deep generative network specified by a hierarchy of processing levels (circles;
denotes an activity pattern) and feedforward and feedback connections (gray arrows). During vision, the expected visual activity pattern at a processing level, say
, is determined by a transformation
(long blue arrow) of activity (denoted
for stimulus) at the sensor level
(eye).
is equivalent to a composition of transformations (shorter blue arrows) of activity patterns between intervening levels. During imagery,
, but at least one processing level is clamped to its expected visual activity pattern (
in this example; red box). Expected imagery activity patterns beneath the clamped level (e.g.,
) differ from their visual activity patterns by a transformation,
, from the current to the clamped level (
, shortest blue arrow) and from the clamped level back (
, orange arrow).
(B–D) In silico experiments on a deep generative network illustrate the predicted effects of
on brain activity patterns during mental imagery.
(B) Signal (top), noise (bottom), and signal to noise (right) during imagery relative to vision at each processing level (x axis). Changes are expressed on a power-of-2 logarithmic scale. The strength of the effect depends upon the processing level that is clamped during imagery (curves illustrate three different clamping levels; see legend).
(C) Top row shows population tuning to spatial frequency (
, x axis) for vision (blue) and imagery (orange) for units at each level in the network when the top level of the network is clamped (dashed box). Bottom two rows show receptive fields (RFs) for individual units at each level of the network (circle radius is one SD of Gaussian RF; circle color scales with radius) for vision (middle) and imagery (bottom).
(D) For each level of clamping, units below the clamped level exhibit lower spatial frequency preference (top), larger RFs (bottom), and lower eccentricity (right) during imagery relative to vision.
. Let
be an activity pattern at a level
and
be the expected activity pattern at that level. We modeled vision in the generative network by clamping the activity pattern at the lowest level to the visual stimulus,
(“vis” denotes vision). Because, in this case,
is the only source of variation, the resulting expected activity pattern at level
is specified by a forward transformation
that maps from level
(superscript) to level
(subscript). (Note that we refer to
as a forward transform only because it yields an expected activity pattern
as a function of the stimulus
. The brain is of course dynamic, and we interpret
as the expected activity pattern at steady state and
as incorporating the effects of both feedforward and feedback connections between levels in the network. See Method S1.1.4 for explicit analysis.)
During imagery experiments, the retinal stimulus is uninformative (e.g., a blank screen). We therefore modeled the visual input during imagery by clamping activity at the sensor level to a null pattern
. We assumed that subjects imagine by reinstating in a high-level brain area the expected activity pattern that would have been evoked by seeing the imagined stimulus. We modeled this mechanism by clamping the activity in the network at a level higher than
, say
, to its expected activity pattern during vision,
(“img” denotes imagery). Now, the clamped activity pattern in level
is the only source of variation, so the resulting expected activity pattern at level
is specified by a feedback transformation
that maps activity at level
(superscript) to activity at level
(subscript). This feedback transformation is the operation in our model that most closely resembles “vision in reverse.”
can be re-written as an explicit function of the stimulus
(see Method S1.1.2 for derivation):
(Equation 1)
This expression reveals how the expected imagery activity pattern
differs from the expected visual activity pattern
. The difference is specified by the distortion
, which can be construed as an “echo,” because
maps the expected visual activity pattern at level
into the abstract representation at the clamped level
and then
maps it back to level
. The encoding of
during imagery at level
is thus likely to be limited by the encoding of
at level
during vision.
,
,
]. Thus, a predicted effect of
was that low-level visual areas should encode variation in mental images with less precision than seen images.
,
] of natural scenes. We implemented a network with linear connections between processing levels in order to obtain exact solutions for the mean activity patterns during imagery and vision. Neural units with complex-like responses were modeled by combining pairs of network units under a sum-of-squares operation (see Method S1.1.7 for details). Although processing in the human visual system undoubtedly involves more complex nonlinearities, this simple model was sufficient to capture the basic tuning properties we analyzed in our subsequent human neuroimaging experiment.
,
], a more steep receptive field size-eccentricity relation [
,
,
,
], and greater foveal coverage [
,
] during vision than units at lower levels. We emphasize that these constraints were placed on activity generated by the network during vision only. No explicit constraints were placed upon the activity generated by the network during imagery, and imagery activity patterns played no role in training the model.
Once training was complete, we generated a large corpus of visual and imagery activity patterns in response to a new set of natural scenes. From these activity patterns, we first estimated signal amplitude, noise, and their ratio (SNR) during vision and imagery.
during imagery.
was strongest in the level furthest below the clamped one. Consequently, by clamping higher in the hierarchy, the effects became stronger and more widespread; by clamping lower, the effects become weaker and restricted only to the lowest processing level (Figures 1B and 1D).
Imagery-Encoding Models Link Brain Activity to Imagined Stimuli
,
,
,
,
,
]. Formally, the visual-encoding model is specified by a forward transformation from a stimulus to an expected visual activity pattern (e.g.,
). The transformation determines the visual features (e.g., spatial frequency) that are extracted from the stimulus and encoded into the expected activity pattern. It also describes spatial sensitivity to features in the form of a visual receptive field.
can be expressed as an encoding of the stimulus
, even though
is not seen during imagery. Thus, if a set of known stimuli are imagined, it should be possible to use those stimuli to construct imagery-encoding models. Once constructed, these imagery-encoding models could be compared to visual-encoding models to test for differences in the encoding of seen and imagined stimuli. This comparison should reveal the distortion effect of
.

Figure 2Stimulus Presentation and Timing
(A) Top: the stimulus displayed on the viewing screen during vision runs. Second from top: enlargements of the cues visible during both vision and imagery runs are shown. Third from top: the display during imagery runs is shown. Bottom: timing of stimulus on- and offset and interstimulus interval
is shown.
(B) All 64 individual object pictures viewed and imagined during the experiment.
(C) An object picture displayed in each of the 8 positions bounded by the framing brackets.
(D) A superposition of all 64 object pictures showing the visual field coverage of the stimuli.
] from activity measured during viewing and imagery sessions, respectively (Figures 3 and S2). The visual-encoding model for each voxel specified tuning to seen spatial frequency and a receptive field in visual space. The imagery-encoding model for each voxel specified tuning to imagined spatial frequency and a receptive field in imagined space.

Figure 3Data and Procedures for Estimating Visual- and Imagery-Encoding Models