Jesse L. Breedlove, Ghislain St-Yves, Cheryl A. Olman, Thomas Naselaris

 A Deep Generative Network Exhibits Distinct Codes for Seen and Mental Images

We considered activity patterns in a deep generative network specified by a hierarchy of

L

processing levels with feedforward and feedback connections. The lowest level (level

0

) of the network analogized the retina. Levels above analogized functionally distinct visual areas (e.g., V1, V2, V3, and so on; see nodes in Figure 1A). We modeled vision and mental imagery as distinct input configurations of this network and used activity patterns in the network to derive qualitative predictions about activity patterns in the brain.

Figure thumbnail gr1

Figure 1A Deep Generative Network Exhibits Distinct Codes for Seen and Mental Images

(A) The visual system as a deep generative network specified by a hierarchy of processing levels (circles;

rl

denotes an activity pattern) and feedforward and feedback connections (gray arrows). During vision, the expected visual activity pattern at a processing level, say

l+d

, is determined by a transformation

Tl+d0

(long blue arrow) of activity (denoted

s

for stimulus) at the sensor level

0

(eye).

Tl+d0

is equivalent to a composition of transformations (shorter blue arrows) of activity patterns between intervening levels. During imagery,

s=0

, but at least one processing level is clamped to its expected visual activity pattern (

rl+d=μl+dvis

in this example; red box). Expected imagery activity patterns beneath the clamped level (e.g.,

μlimg

) differ from their visual activity patterns by a transformation,

Ω

, from the current to the clamped level (

Tl+dl

, shortest blue arrow) and from the clamped level back (

T¯ll+d

, orange arrow).

(B–D) In silico experiments on a deep generative network illustrate the predicted effects of

Ω

on brain activity patterns during mental imagery.

(B) Signal (top), noise (bottom), and signal to noise (right) during imagery relative to vision at each processing level (x axis). Changes are expressed on a power-of-2 logarithmic scale. The strength of the effect depends upon the processing level that is clamped during imagery (curves illustrate three different clamping levels; see legend).

(C) Top row shows population tuning to spatial frequency (

log2[cyc/stim]

, x axis) for vision (blue) and imagery (orange) for units at each level in the network when the top level of the network is clamped (dashed box). Bottom two rows show receptive fields (RFs) for individual units at each level of the network (circle radius is one SD of Gaussian RF; circle color scales with radius) for vision (middle) and imagery (bottom).

(D) For each level of clamping, units below the clamped level exhibit lower spatial frequency preference (top), larger RFs (bottom), and lower eccentricity (right) during imagery relative to vision.

During vision, the retina is activated by a visual stimulus

s

. Let

rk

be an activity pattern at a level

k

and

μk

be the expected activity pattern at that level. We modeled vision in the generative network by clamping the activity pattern at the lowest level to the visual stimulus,

r0vis=s

(“vis” denotes vision). Because, in this case,

s

is the only source of variation, the resulting expected activity pattern at level

l>0

is specified by a forward transformation

μlvis=Tl0[s]

that maps from level

0

(superscript) to level

l

(subscript). (Note that we refer to

Tl0

as a forward transform only because it yields an expected activity pattern

μl

as a function of the stimulus

s

. The brain is of course dynamic, and we interpret

μl

as the expected activity pattern at steady state and

Tl0

as incorporating the effects of both feedforward and feedback connections between levels in the network. See Method S1.1.4 for explicit analysis.)

During imagery experiments, the retinal stimulus is uninformative (e.g., a blank screen). We therefore modeled the visual input during imagery by clamping activity at the sensor level to a null pattern

r0img=0

. We assumed that subjects imagine by reinstating in a high-level brain area the expected activity pattern that would have been evoked by seeing the imagined stimulus. We modeled this mechanism by clamping the activity in the network at a level higher than

l

, say

l+d

, to its expected activity pattern during vision,

rl+dimg=μl+dvis

(“img” denotes imagery). Now, the clamped activity pattern in level

l+d

is the only source of variation, so the resulting expected activity pattern at level

l

is specified by a feedback transformation

μlimg=T¯ll+dμl+dvis

that maps activity at level

l+d

(superscript) to activity at level

l

(subscript). This feedback transformation is the operation in our model that most closely resembles “vision in reverse.”

Because of the hierarchical structure of the network, the imagery activity pattern

μlimg

can be re-written as an explicit function of the stimulus

s

(see Method S1.1.2 for derivation):

μlimg=Tll+dTl+dlΩl,l+dTl0[s].

(Equation 1)

This expression reveals how the expected imagery activity pattern

μlimg

differs from the expected visual activity pattern

μlvis=Tl0[s]

. The difference is specified by the distortion

Ωl,l+d

, which can be construed as an “echo,” because

Tl+dl

maps the expected visual activity pattern at level

l

into the abstract representation at the clamped level

l+d

and then

T¯ll+d

maps it back to level

l

. The encoding of

s

during imagery at level

l

is thus likely to be limited by the encoding of

s

at level

l+d

during vision.

For practical applications, high-level representations in deep generative networks are often optimized to give near-lossless stimulus reconstruction. However, in high-level visual areas, brain activity can be quite invariant to many aspects of stimulus variation—an inevitable consequence of forming abstract representations that are useful for cognition [

Selectivity and tolerance (“invariance”) both increase as visual information propagates from cortical area V4 to IT.