LETTER
Communicated by Misha Tsodyks
Mean-Driven and Fluctuation-Driven Persistent Activity in Recurrent Networks Alf...

Author:
MIT Press

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

LETTER

Communicated by Misha Tsodyks

Mean-Driven and Fluctuation-Driven Persistent Activity in Recurrent Networks Alfonso Renart∗ [email protected] Departamento de F´ısca Te´orica, Universidad Aut´onoma de Madrid, Cantoblanco 28049, Madrid, Spain, and Volen Center for Complex Systems, Brandeis University, Waltham, MA 02254, U.S.A.

Rub´en Moreno-Bote [email protected] Department de F´ısca Te´orica, Universidad Aut´onoma de Madrid, Cantoblanco 28049, Madrid, Spain

Xiao-Jing Wang [email protected] Volen Center for Complex Systems, Brandeis University, Waltham, MA 02254, U.S.A.

N´estor Parga [email protected] Departamento de F´ısca Te´orica, Universidad Aut´onoma de Madrid, Cantoblanco 28049, Madrid, Spain

Spike trains from cortical neurons show a high degree of irregularity, with coefficients of variation (CV) of their interspike interval (ISI) distribution close to or higher than one. It has been suggested that this irregularity might be a reflection of a particular dynamical state of the local cortical circuit in which excitation and inhibition balance each other. In this “balanced” state, the mean current to the neurons is below threshold, and firing is driven by current fluctuations, resulting in irregular Poisson-like spike trains. Recent data show that the degree of irregularity in neuronal spike trains recorded during the delay period of working memory experiments is the same for both low-activity states of a few Hz and for elevated, persistent activity states of a few tens of Hz. Since the difference between these persistent activity states cannot be due to external factors coming from sensory inputs, this suggests that the underlying

∗ Current address: Center for Molecular and Behavioral Neuroscience, Rutgers University, Newark, NJ 07102 USA.

Neural Computation 19, 1–46 (2007)

C 2006 Massachusetts Institute of Technology

2

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

network dynamics might support coexisting balanced states at different firing rates. We use mean field techniques to study the possible existence of multiple balanced steady states in recurrent networks of current-based leaky integrate-and-fire (LIF) neurons. To assess the degree of balance of a steady state, we extend existing mean-field theories so that not only the firing rate, but also the coefficient of variation of the interspike interval distribution of the neurons, are determined self-consistently. Depending on the connectivity parameters of the network, we find bistable solutions of different types. If the local recurrent connectivity is mainly excitatory, the two stable steady states differ mainly in the mean current to the neurons. In this case, the mean drive in the elevated persistent activity state is suprathreshold and typically characterized by low spiking irregularity. If the local recurrent excitatory and inhibitory drives are both large and nearly balanced, or even dominated by inhibition, two stable states coexist, both with subthreshold current drive. In this case, the spiking variability in both the resting state and the mnemonic persistent state is large, but the balance condition implies parameter fine-tuning. Since the degree of required fine-tuning increases with network size and, on the other hand, the size of the fluctuations in the afferent current to the cells increases for small networks, overall we find that fluctuation-driven persistent activity in the very simplified type of models we analyze is not a robust phenomenon. Possible implications of considering more realistic models are discussed.

1 Introduction The spike trains of cortical neurons recorded in vivo are irregular and consistent, to a first approximation, to a Poisson process, possessing a roughly exponential interspike interval (ISI) distribution (except at very short intervals) and a coefficient variation (CV) of the ISI close to one (Softky & Koch, 1993). The possible implications of this fact on the basic principles of cortical organization have been the motivation for a large number of studies during the past 10 years (Softky & Koch, 1993; Shadlen & Newsome, 1994, 1998; Tsodyks & Sejnowski, 1995; van Vreeswijk & Sompolinsky, 1996, 1998; Zador & Stevens, 1998; Harsch & Robinson, 2000). An important idea that was analyzed by some of these studies was that a way out of the apparent inconsistency between the cortical neuron working as an integrator over the timescale of a relatively long time constant of the order of 10 to 20 ms of a very large number of inputs, and its irregular spiking, was to have similar amounts of excitatory and inhibitory drive. In this way, the mean drive to the cell was subthreshold, and spikes were the result of fluctuations, which occur irregularly, thus leading to a high CV (Gerstein & Mandelbrot, 1964). Although the implications of this result were first studied in a feedforward architecture (Shadlen & Newsome, 1994), it was soon discovered that a state

Bistability in Balanced Recurrent Networks

3

in which excitation and inhibition balance each other, resulting in irregular spiking, was a robust dynamical attractor in recurrent networks (Tsodyks & Sejnowski, 1995; van Vreeswijk & Sompolinsky, 1996, 1998); that is, under very general conditions, a recurrent network settles down into a state of this sort. Although the original studies characterizing quantitatively the degree of spiking irregularity in the cortex were done using data from sensory cortices, it has since been shown that neurons in higher-order associative areas like the prefrontal cortex (PFC) also spike irregularly (Shinomoto, Sakai, & Funahashi, 1999; Compte et al., 2003) (see Figure 1). This is interesting because it is well known that cells in the PFC (Fuster & Alexander, 1971; Funahashi, Bruce, & Goldman-Rakic, 1989; Miller, Erickson, & Desimone, 1996; Romo, Brody, Hern´andez, & Lemus, 1999), as well as those in other associative cortices like the inferotemporal (Miyashita & Chang, 1988) or posterior parietal cortex (Gnadt & Andersen, 1988; Chafee & GoldmanRakic, 1998), show activity patterns that are selective to stimuli no longer present to the animal and are thus being held in working memory. The activity of these neurons seems to be able to switch, on presentation of an appropriate brief, transient input, from a basal spontaneous activity level to a higher activity state. When the dimensionality of the stimulus to be remembered is low (e.g., the position of an LED on a computer screen or the frequency of a vibrotactile simulus), the mnemonic activity during the delay period when the stimulus is absent seems to be graded (Funahashi et al., 1989; Romo et al., 1999), whereas when the dimensionality of the stimulus is high (e.g., a complex image), the single neurons seem to choose from a small number of discrete activity states (Miyashita & Chang, 1988; Miller et al., 1996). This last coding scheme is referred to as object working memory. Since there is no explicit sensory input present during the delay period in a working memory task, the neuronal activity must be a result of the dynamics of the relevant neural circuit. There is a long tradition of modeling studies that have described delay-period activity as a reflection of dynamical attractors in multistable (usually bistable) networks presumed to represent the local cortical environment of the neurons recorded in the neurophysiological experiments (Hopfield, 1982; Amit & Tsodyks, 1991; Ben-Yishai, Lev Bar-Or, & Sompolinsky, 1995; Amit & Brunel, 1997b; Brunel, 2000a; Compte, Brunel, Goldman-Rakic, & Wang, 2000; Brunel & Wang, 2001; Hansel & Mato, 2001; Cai, Tao, Shelly, & McLaughlin, 2004). Originally inspired by models and techniques from the statistical mechanics of disordered systems, network models of persistent activity have progressively become more faithful to the biological circuits that they seek to describe. The landmark study (Amit & Brunel, 1997b) provided an extended meanfield description of the activity of a recurrent network of spiking current-based leaky integrate-and-fire neurons (LIF). One of its main achievements was to use the theory of diffusion processes to provide an intuitive, compact

4

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga FIXATION 3

NONPREFERRED

# of cells

p =8.5e −15

2

CV 1 0

PREFERRED

F

20

20

20

10

10

10

0

PR NP

0

1

2

0

0

2

1

2

0

0

CV

CV

1

2

CV

30

30

30

20

20

20

10

10

10

p =1.6e−23

# of cells

1.5

CV2

1 0.5 0

F

PR NP

0

0

1

CV

2

2

0

0

1

CV

2

2

0

0

1

CV

2

2

Figure 1: CV of the ISI of neurons in monkey prefrontal cortex during a spatial working memory task. The monkey made saccades to remembered locations on a computer screen after a delay period of a few seconds. On each trial, a dot of light (cue stimulus) was briefly shown in one of eight to-be-remembered locations, equidistant from the fixation point but at different angles. After the delay period, starting with the disappearance of the cue stimulus and terminating with the disappearance of the fixation point, the monkey made a saccade to the remembered location. Top and bottom rows correspond, respectively, to the CV and CV2 (CV calculated using only consecutive ISIs to try to compensate from possible slow nonstationarities in the neurons instantaneous frequency) computed from spike trains of prefrontal cortical neurons recorded from monkeys performing an oculomotor spatial working memory task. Results shown correspond to analysis of the activity during the delay period of the task. The spike trains are irregular (CV ∼ 1), and to a similar extent, both when the data correspond to trials in which the preferred (PR; middle column) positional cue for the cell was held in working memory (higher firing rate during the delay period) and when it corresponds to stimuli with the nonpreferred (NP; right column) positional cue for the particular neuron (lower firing rate during the delay period). See Compte et al. (2003) for details. Adapted with permission from Compte et al. (2003).

description of the spontaneous, low-rate, basal activity state of cortical cells in terms of self-consistency equations that included information about both the mean and the fluctuations of the afferent current to the cell. The theory proposed was both simple and accurate, and matched well the properties of simulated LIF networks. The spontaneous activity state in Amit and Brunel (1997b) is effectively the balanced state described above, in which the recurrent connectivity is

Bistability in Balanced Recurrent Networks

5

dominated by inhibition and firing is due to the occurrence of positive fluctuations in the drive to the neurons. However, in Amit and Brunel (1997b), this same model was used to describe the coexistence of the spontaneous activity state with a persistent activity state with a physiologically plausible firing rate that would correspond to the spiking observed during the delay period in object working memory tasks, such as seen in, for example, Miyashita and Chang (1988). Although the model, with its large number of subsequent improvements, has been successful in providing a fairly accurate description of simulated spiking networks, no effort has yet been made to study systematically the relationship between multistability and the irregularity of the spike trains, especially in the elevated activity state. As we will show below, the qualitative organization of the connectivity in the recurrent network not only determines the existence of a fluctuation-driven balanced spontaneous activity state in the network, but also the existence of bistability in the network, and whether the elevated activity states are fluctuation driven. In order to perform a systematic analysis of the types of persistent activity that can be obtained in a network of current-based LIF neurons, two steps are important. First, we believe that the scaling of the synaptic connections with the number of afferent synapses per neuron should be made explicit. This approach was taken in the studies of the balanced state (Tsodyks & Sejnowski, 1995; van Vreeswijk & Sompolinsky, 1996), but is not present in the Amit and Brunel (1997b) framework. As we shall see, when the scaling is made explicit and the network is studied in the limit of a large number of connections per cell, the difference between the behavior of alternative circuit organizations (or architectures) becomes qualitative. Second, it would be desirable to be able to check for the spike train irregularity within the theory. In Amit and Brunel (1997b), spiking was assumed to be Poisson and, hence, to have a CV equal to 1. Poisson spike trains are completely characterized by a single number, the instantaneous firing probability, so there is nothing more to say about the spike train once its firing rate has been given. A general self-consistent description of the higher-order moments of spiking in a recurrent network of LIF neurons is extremely difficult, as the calculation of the moments of the ISI distribution becomes prohibitively complicated when the input current to a particular cell contains temporal correlations (although see Moreno-Bote & Parga, 2006). However, based on our study of the input-output properties of the LIF neuron under the influence of correlated inputs (Moreno, de la Rocha, Renart, & Parga, 2002), we have constructed a self-consistent description for the first two moments of the current to the neurons in the network, which relaxes the Poisson assumption and which we expect to be valid if the temporal correlations in the spike trains in the network are sufficiently short. Some of the results presented here have already been published in abstract form.

6

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

2 Methods We consider a network of current-based leaky integrate-and-fire neurons. The voltage difference across each neuron’s membrane evolves in time according to the following equation, d V(t) V(t) + I (t), =− dt τm with voltages being measured relative to the leak potential of the neuron. When the depolarization reaches a threshold voltage that we set at Vth = 20 mV, a spike is emited, and the voltage is clamped at a reset potential Vr = 10 mV during a refractory period τref = 2 ms, after which the voltage continues to integrate the input current. The membrane time constant is τm = 10 ms. When the neuron is inserted in a network, I (t) represents the total synaptic current, which is assumed to be a linear sum of the contributions from each individual presynaptic cell. We consider the simplest description of the synaptic interaction between the pre- and postsynaptic neurons, according to which each presynaptic action potential provokes an instantaneous “kick” in the depolarization of the postsynaptic cell. The network is composed of NE excitatory and NI inhibitory cells randomly connected so that each cell receives C E excitatory and C I inhibitory contacts, each with an efficacy (“kick” size) J E j and J Ik , respectively ( j = 1, . . . , C E ; k = 1, . . . , C I ). The total afferent current into the cell can be represented as I (t) =

CE j=1

J E j s j (t) −

CI

J Ik sk (t),

k=1

where s j(k) (t) represents the spike train from the jth excitatory (kth inhibitory) neuron. Since according to this description, the effect of a presynaptic spike on the voltage of the postsynaptic neuron is instantaneous, s(t) is a collection of Dirac delta functions, that is, s(t) ≡ j δ(t − t j ), where t j are the spike times. 2.1 Mean-Field Description. Spike trains in the model are stochastic, with an instantaneous firing rate (i.e., a probability of measuring a spike in (t, t + dt) per unit time) denoted by ν(t) = ν. The secondorder statistics of the process is characterized by its connected two-point correlation function C(t, t ), giving the joint probability density (above chance) that two spikes happen at (t, t + dt) and at (t , t + dt), that is, C(t, t ) ≡ s(t)s(t ) − s(t)s(t ). Stochastic spiking in network models is usually assumed to follow Poisson statistics, which is both a fairly good approximation to what is commonly observed experimentally (see, e.g.,

Bistability in Balanced Recurrent Networks

7

Softky & Koch, 1993; Compte et al., 2003) and convenient technically since Poisson spike trains lack any temporal correlations. For Poisson spike trains, C(t, t ) = νδ(t − t ), where ν is the instantaneous firing probability. We have previously analyzed the effect of temporal correlations in the afferents to a LIF neuron on its firing rate (Moreno et al., 2002). Temporal correlations measured in vivo are often well fitted by an exponential (Bair, Zohary, & Newsome, 2001). We considered exponential correlations of the form

|t −t|

e− τ c C(t, t ) = ν δ(t − t) + (F∞ − 1) 2τc

,

(2.1)

where F∞ is the Fano factor of the spike train for infinitely long time windows. The Fano factor in a window of length T is defined as the ratio between the variance and the mean of the spike count on the window. It is illustrative to calculate it for our process, FT ≡

2 σ N(T)

,

N(T)

where N(T) is the (stochastic) spike count in a window of length T, N(T) =

T

dt s(t), 0

so that N(T) = νT, and the spike count variance is given by 2 σ N(T) ≡

0

T

0

T

T dt dt C(t, t ) = νT + ν(F∞ − 1) T − τc 1 − e− τc .

When the time window is long compared to the correlation time constant, 2 that is, T τc , then σ N(T) ∝ F∞ νT; hence, our use of the factor F∞ in the definition of the correlation function. An interesting point to note is that for time windows that are long compared to the correlation time constant, the variance of the spike count is linear in time, which is a signature of independence across time, that is, independent variances add up (for the 2 Poisson process, (σ N(T) )Poisson = νT, so that (FT )Poisson = 1). If the characteristic time of the postsynaptic cell integrating this stochastic current (its membrane time constant) is very long compared with τc , we expect that the main effect of the deviation from Poisson of the input spike trains will be on the amplitude of the current variance, with the parameter τc playing only 2 a marginal role, as it does not appear in σ N(T) when T τc . As we show

8

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

below, a rigorous analysis of the effect of correlations on the mean firing rate of a LIF neuron confirms this intuitive picture. The postsynaptic cell receives many inputs. Recall that the total current is given (we consider for simplicity for this discussion that the cell receives C inputs from a single, large, homogeneous population composed of N neurons) by I (t) = J Cj s j (t). Thus, the mean and correlation function of the total afferent current to a given cell are I (t) = C J ν C I (t, t ) = I (t )I (t) − I (t)I (t ) =J2 si (t)s j (t ) − si (t)s j (t ) ij

= C J C(t, t ) + C(C − 1)J 2 Ccc (t, t ), 2

where C(t, t ) is the (auto)correlation function in equation 2.1 and Ccc (t, t ) is the cross-correlogram between any two given cells of the pool of presynaptic inputs (which we have assumed to be the same for all pairs). We restrict our analysis to very sparse random networks—networks with C N—so that the fraction of synaptic inputs shared by any two given neurons can be assumed to be negligible. In this case, the cross-correlation between the spike trains of the two cells Ccc (t, t) will be zero. This approximation simplifies the analysis of the network behavior significantly and allows for a self-consistent solution for the network’s steady states. Thus, the temporal structure of the total current to the cell is described by

α − |t−t | C I (t, t ) = σ02 δ(t − t ) + e τc 2τc

(2.2)

with σ02 = C J 2 ν

and α = F∞ − 1.

We have previously calculated the output firing rate of an LIF neuron subject to an exponentially correlated input (Moreno et al., 2002). The calculation is done using the diffusion approximation (Ricciardi, 1977) in which the discontinuous voltage trajectories are approximated by those obtained from an equivalent diffusion process. The diffusion approximation is expected to give accurate results when the overall rate of the input process is high, with the amplitude of each individual event being very small (Ricciardi, 1977). For small but finite τc , the analytic calculation of the firing rate of the cell can be done only when the deviation of the input current from a white noise

Bistability in Balanced Recurrent Networks

9

√ process is small, that is, it has to be done assuming that k ≡ τc /τm 1 and that α 1. More specifically, we found that if k = 0, then the firing rate can be calculated for arbitrary values of α, but if k is small but finite, then an expression can be found for the case when both k and α are small (see Moreno et al., 2002, for details). If k = 0, then the result one obtains is that the firing rate of the neuron is given by the same expression that one finds for the case of a white noise input, but with an effective variance that takes into account the change in amplitude of the fluctuations due to the non-Poisson nature of the inputs. The effective variance is equal to 2 σeff = σ02 (1 + α) = C J 2 ν F∞ ,

which is exactly the slope of the linear increase with the size of the time window T of the variance in the spike count NI (T) of the total input current. This result can be understood in terms of the Fano factor calculation outlined above. Assuming k = 0 is equivalent to assuming an infinitely long time window for the calculation of the Fano factor, and in those conditions we also saw that the only effect of the temporal correlations is to renormalize the variance of the spike count with respect to the poisson case. In order to set up a self-consistent scenario, we have to close the loop, by calculating a measure of the variability of the postsynaptic cell and relating it to the same property in the spike trains of its inputs. To do this, we note that if the spike trains in the model can be described as renewal processes, these processes have a property that relates their spike count variability and their ISI variability, F∞ = CV2 , if a point process is renewal (Cox, 1962). Renewal point processes are characterized by having independent ISIs, which are not necessarily exponentially distributed. Since we are assuming that the temporal correlations in the spike trains are short anyway, and the firing rates of the cells in the persistent activity states that we are interested in are not very high, then we expect the renewal assumption to be appropriate. The final step is to make sure that the result for the firing rate (the inverse of the mean ISI) in terms of the effective variance also holds for higher moments of the postsynaptic ISI, not only for the first, and this is indeed the case (Renart, 2000); that is, the CV of the ISI when k = 0 is given by the same expression as when the input is a white noise process, but with a renormalized variance equal to 2 σeff . Thus, under the assumptions described above, there is a way of computing the output rate and CV of an LIF neuron solely in terms of the rate and CV of its presynaptic inputs. In the steady states, both input and output

10

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

firing rate and CV will be the same, and this provides a couple of equations that determine these quantities self-consistently. In the reminder of the letter, we thus use the common expressions for the mean and CV of the first passage time of the Ornstein-Uhlenbeck (OU) process, ν

−1

√

= τref + τm π

CV2 = 2πν 2

Vth −µV √ σV 2 Vres −µV √ σV 2

Vth −µV √ σV 2

2

d x ex [1 + erf(x)]

Vres −µV √ σV 2

d x ex

2

x

−∞

2

dy e y [1 + erf(y)]2 ,

(2.3)

(2.4)

where µV and σV2 are the mean and variance of the depolarization of the postsynaptic cell (in the absence of threshold; Ricciardi, 1977). In a stationary situation, they are related to the mean µ and variance σ 2 of the afferent current to the cell by µV = τm µ;

σV2 =

1 τm σ 2 . 2

Following the arguments above, the mean and (effective) variance of the current to the cells are given by µ=CJ ν 2 = C J 2 νCV2 σ 2 ≡ σeff

for the mean and variance of the white noise input current. Finally, it is easy to show that if the presynaptic afferents to the cell come from a set of different statistically homogeneous subpopulations, the previous expressions generalize readily to µi =

Ci j J i j ν j

j

σi2 ≡ σi2eff =

Ci j J i2j ν j CV2j

(2.5) (2.6)

j

as long as the timescales of the correlations in the spike trains of the neurons in the different subpopulations are all of the same order. Inhibitory subpopulations are characterized by negative connection strengths. 2.2 Dynamics. A detailed characterization of the dynamics of the activity of the network is beyond the scope of this work. Since our main interest is the steady states of the network, we use a simple, effective dynamical

Bistability in Balanced Recurrent Networks

11

scheme that is consistent with the self-consistent equations that determine the steady states. In particular, we use the subthreshold dynamics of the first two moments of the depolarization in terms of the first two moments of the current (Ricciardi, 1977; Gillespie, 1992): dµV µV + µ; =− dt τm

dσV2 σ2 = − V + σ 2. dt τm /2

(2.7)

In using these equations, our assumption is that the activity of the population follows instantaneously the distribution of the depolarization. Thus, at every point in time, we use expressions 2.5 and 2.6 for µ and σ appearing in the right-hand side of equations 2.7, which depend on the rate ν(µV , σV ) and CV(µV , σV ) as given in equations 2.3 and 2.4. The only dynamical variables are therefore µV and σV2 (Amit & Brunel, 1997b; Mascaro & Amit, 1999). 2.3 Numerical Analysis of the Analytic Results. The phase plane analysis of the reduced network was done using both custom-made C++ code and the program XPPaut. The integrals in equations 2.3 and 2.4 were calculated analytically for very large and very small values of the limits of integration (using asymptotic expressions for the error function; Abramowitz & Stegun, 1970) and numerically for values of the integration limits of order one. The corresponding C++ code was incorporated into XPPaut through the use of dynamically linked libraries for phase plane analysis. Some of the cusp diagrams were calculated without the use of XPPaut by the direct approach of looking for values of the parameters at which the number of fixed points changed abruptly.

2.4 Numerical Simulations. We simulated an identical network to the one used in the mean-field description (see the captions of Figures 12 and 13 for parameters). In the simulation, on every time step dt = 50 µs, it is checked which neurons in the network receive any spikes. The membrane potential of cells that do not receive spikes is integrated analytically. The membrane potential of cells that receive spikes is integrated analytically within that dt, taking into account the synaptic postsynaptic potentials (PSPs) but assuming that there is no threshold. Only at the end of the time step is it checked whether the membrane potential is above threshold. If this is the case, the neuron is said to have produced a spike. This procedure effectively introduces a (very short but nonzero) synaptic integration time constant. Emitted spikes are fed back into the network using a system of queues to account for the synaptic delays (Mattia & Del Giudice, 2000).

12

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

3 Results 3.1 Network Architecture and Scaling Considerations. We first study the issue of how the architecture of the network determines the qualitative nature of the steady-state solutions for the firing rate of the cells in the network. In particular, we are interested in understanding under which conditions there are multiple steady-state solutions (bistability or, in general, multistability) in networks in which cells will fire with a high degree of irregularity. We consider a network for object working memory composed of a number of nonoverlapping subpopulations, or columns, defined by selectivity with respect to a given external stimulus. Each subpopulation contains both excitatory and inhibitory cells. The synaptic properties of neurons within the same column are assumed to be statistically identical. Thus, the column is, in our network, the minimal signaling unit at the average level, that is, all the neurons within a column behave identically on average. As will be shown below, the type of bistability realizable for the column depends critically on its size (more specifically, on the number of connections that a given cell receives from within its own column). Different architectures will thus be considered in which the total number of afferent connections per cell C is constant (and large) but the number of columns in the network varies, effectively varying the number of afferent connections from a given column to a cell. A multicolumnar architecture of this sort is inspired in the anatomical organization of the PFC, in which columnarly organized putative excitatory cells and interneurons show similar response profiles during working memory tasks (Rao, Williams, & Goldman-Rakic, 1999). As noted in section 1, many of the properties of the network can be inferred from a scaling analysis. In the limit in which the connectivity is very sparse, so that correlations between the spike trains of different cells can be neglected (see section 2), the relevant parameter is the number of afferent connections per neuron C. We will investigate the behavior of the network in the limit C → ∞ (the “extensive” limit) since, in this case, the different scenarios become qualitatively different. Of course, physical neurons receive a finite number of connections, but the rationale is that the physical solution can be considered a small perturbation to the solution found in the C = ∞ case, which is much easier to characterize. One should keep in mind that even if C becomes very large, we still need to impose the sparseness condition for our analysis to be valid, which implies that it should always hold that N C. When considering current-based scenarios in the extensive limit, one is forced to normalize the connection strengths (the size of the unitary PSPs, which we denote generally by J ) by (some power of) the number of connections per cell C, in order to keep the total afferent current within the (presynaptic) dynamic range of the neuron (whose order of magnitude is given by the distance between reset and threshold). As we show below,

Bistability in Balanced Recurrent Networks

13

different scaling schemes of J with C lead to different relative magnitudes of the mean and fluctuations of the afferent current into the cells in the extensive limit, and this in turn determines the type of steady-state solutions for the network. We thus proceed to analyze the expressions for the mean and variance of the afferent current (see equations 2.5 and 2.6) under different scaling assumptions. We consider multicolumnar networks in which the C afferent connections to a given cell come from Nc different “columns” (each contributing Cc connections, so that C = Nc Cc ). Each column is composed of an excitatory and an inhibitory subpopulation. The multicolumnar structure of the network is specified by the following scaling relationships, Nc ≡ nc C α 1−α Cc ≡ n−1 , c C

with 0 ≤ α ≤ 1 and nc order one, that is, independent of C. The case α = 0 corresponds to a finite number nc of columns, each contributing a number of connections of order C. The case α = 1 corresponds to an extensive number of columns, each contributing a number of connections of order one—that is, a fixed number as the total number of connections C grows. Although connection strengths between the different subpopulations can all be different, we assume that they can be classified into two types according to their scaling with C: those between cells within the same column, of strength J w , and those between cells belonging to different columns, of strength J b (the scaling is assumed to be the same for excitatory and for inhibitory connections). We define J w ≡ jw C −αw J b ≡ jb C −αb where αw , αb > 0 and the j’s are all order one. In these conditions, the afferent current to the excitatory or inhibitory cells (it does not matter which, for this analysis) from their own subpopulation is characterized by µin = Cc [J Ew ν Ein − J Iw ν Iin ] 1−α−αw = C 1−α−αw [ j Ew ν Ein − j Iw ν Iin ]n−1 f µin c ≡C

σin2 = Cc [J E2w ν Ein CV2Ein + J I2w ν Iin CV2Iin ] 1−α−2αw f σin , = C 1−α−2αw [ j E2 w ν Ein CV2Ein + j I2w ν Iin CV2Iin ]n−1 c ≡C

where the f ’s are linear combinations of rates and CVs weighted by factors

14

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

of order one. We proceed by assuming that all other columns are in the same state ν E,I out , CV E,I out , so that the current to the cells in the column under focus from the rest of the network is characterized by µout = (Nc − 1)Cc J Eb ν Eout − J Ib ν Iout −α −α j Eb ν Eout − j Ib ν Iout ≡ C 1−αb 1 − n−1 f µout = C 1−αb 1 − n−1 c C c C 2 σout = (Nc − 1)Cc J E2b ν Eout CV2Eout + J I2b ν Iout CV2Iout 2 −α j Eb ν Eout CV2Eout + j I2b ν Iout CV2Iout = C 1−2αb 1 − n−1 c C −α ≡ C 1−2αb 1 − n−1 f σout . c C −α become equal to one if α > 0 In the extensive limit, the terms 1 − n−1 c C and are of order one if α = 0, in which case they can be included as an extra multiplicative factor in the f terms. We thus omit them from now on. In addition to their recurrent inputs, cells receive a similar number of external excitatory inputs as well, but since we are interested in the generation of irregularity by the network, we will assume this external drive to be deterministic, that is, characterized by

µext = C J ext νext = C 1−αext jext νext ≡ C 1−αext f ext , with J ext = jext C −αext and αext > 0. The scaling with C of the different components of the total afferent current is thus given by µin = C 1−α−αw f µin µout = C 1−αb f µout

σin2 = C 1−α−2αw f σin 2 σout = C 1−2αb f σout

µext = C 1−αext f ext . If α, αb , αw are such that the variances vanish as C → ∞, the corresponding networks will consist of regularly spiking neurons. Since we are interested in irregular spiking, we therefore look for solutions in which 2 σin2 , σout or both remain order one in the C → ∞ limit. There are several ways to achieve this. 3.1.1 Scenario 1: Homogeneous Balanced Network. This case is associated with the choice α = 0. In this case, the size of the columns is of the same order as the size of the whole network (i.e., the number of columns, nc , is order one), in which case the in and out quantities become equivalent. √ A finite variance is achieved by setting αw = αb = 1/2, that is, J ∝ 1/ C. This scenario is equivalent to the network studied originally in Tsodyks and Sejnowski (1995) and van Vreeswijk and Sompolinsky (1996, 1998). In such

Bistability in Balanced Recurrent Networks

15

a network, the mean input from the recurrent network grows as the square root of the number of inputs, µin + µout =

√ C( f µin + f µout ).

This quantity can be positive or negative depending on the excitationinhibition balance in the network. The overall mean input into the neurons is obtained by adding the external input: µ = µin + µout + µext . In order not to saturate the dynamic range of the cell in the extensive limit, the overall mean current into the neurons should remain of order one as C → ∞. Hence, it is needed that µ=

√ 1 C[ f µin + f µout + f ext C 2 −αext ] ∼ O(1),

(3.1)

√ which is possible only if the term in square brackets vanishes as 1/ C. If the synapses from the external inputs vanish like 1/C (αext = 1), then the external input has a negligible contribution to the overall mean input to the cells. In this case, since both f µin and f µout are linear combinations of the firing rates of the neurons inside and outside the column under focus, equation 3.1 effectively becomes, in the extensive limit, a set of linear homogeneous equations for the activity of the different columns (note that although we have, for brevity, written only one, there are four equations like equation 3.1, for the excitatory and inhibitory subpopulations inside and outside the column under focus). Thus, unless the matrix of coefficients of the firing rates in f µin and f µout for the excitatory and inhibitory subpopulations is not full rank, the only solution of equation 3.1 in the extensive limit is given by a silent, zero rate, network (van Vreeswijk & Sompolinsky, 1998). On the other hand, if αext = 1/2, the linear system defined by equations 3.1 in the extensive limit is not homogeneous anymore. Hence, in the general case (except for degeneracies), if αext = 1/2, there is a single self-consistent solution for the firing rates in the network, in which the activity in each subpopulation is proportional to the external drive νext (van Vreeswijk & Sompolinsky, 1998). This highlights the importance of a powerful external excitatory drive. When αext = 1/2, the external drive by itself would drive the cells to saturation if the recurrent connections were inactivated. In the presence of the recurrent connections, the activities in the excitatory and inhibitory subpopulations adjust themselves to compensate this massive external drive. The firing rates in the self-consistent solution correspond to the only way in which this compensation can occur for all the subpopulations at the same time. It follows that inasmuch as the different inputs to the neuron combine linearly, unless the connectivity matrix is degenerate, which requires some kind of fine-tuning mechanism, bistability in a large, homogeneous balanced network is not a robust property.

16

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

3.1.2 Scenario 2: Homogeneous Multicolumnar Balanced Network. Since the linearity condition that results in a unique solution follows from the mean current to the cells from within the column growing with C, we impose that µin ∼ O(1), that is, 1 − α − αw = 0. If this is the case, the variance coming from within the column goes as C α−1 . We consider first the case α < 1. In these conditions, the variance coming from the activity inside the column vanishes for large C. To keep the total variance finite, we set αb = 1/2. If we also choose α = 1/2, then αw = αb = 1/2, so the network is homogeneous in that all the connection strengths scale similarly with C regardless of whether they connect cells belonging to the same or different columns. Since α = 1/2, there are many columns in the network, and the number of connections coming from√inside a particular column is a very small fraction (which decreases like 1/ C) of the total number of afferent connections to the cell. The fact that αb = 1/2 implies that √ the mean current coming from outside the column will still grow like C. Thus, in order for the cells not to saturate, the excitation and the inhibition from outside the column have to balance precisely; the steady state of the rest of the network becomes, again, a unique, linear function of the external input to the cells (where again we choose αext = 1/2 to avoid the quiescent state). However, now the mean current coming from inside the column is independent of C, so the steady-state activity inside the column is not determined by a set of linear equations. Instead, it should be determined self-consistently using the nonlinear transfer function in equation 2.3, which, in principle, permits bistability. This scenario is, in fact, equivalent to the one studied in Brunel (2000b), where a systematic analysis of the conditions in which bistability in a network like this can exist has been performed. (Although no explicit scaling of the synaptic connection strengths with C was assumed in Brunel, 2000b, the essential fact that the total variance to the cells in the subpopulation that supports bistability is constant is considered in that article.) As will be shown in detail below, the fact that the potential multiple steadystate solutions in this scenario differ only in the mean current to the cells, not in their variance (which is fixed by the balance condition on the rates outside the column), leads necessarily to a lower (in general, significantly lower) CV in the activity in the cells in√the elevated persistent activity state. Therefore, in a network with J ∝ 1/ C scaling, bistability is possible in √ small subsets of neurons comprising a fraction ∝ 1/ C of the total number of connections per cell, but the elevated persistent activity states are characterized by a change in the mean drive to the cells at constant variance, and, as we show below, this leads to a significant decrease in the spiking irregularity in the elevated persistent activity states. 3.1.3 Scenario 3: Heterogeneous Multicolumnar Network. In order for the CV in the elevated persistent activity state to remain close to one, the variance of the afferent current to the cells inside the column should depend

Bistability in Balanced Recurrent Networks

17

on their own activity. Thus, in addition to the condition 1 − α − αw = 0 necessary for bistability, we have to impose that σin2 ∝ C α−1 be independent of C, that is, α = 1, which implies αw = 0. In these conditions, the extensive number of connections per cell come from an extensive number of columns, with the number of connections from each column remaining a finite number. The αw = 0 condition reflects the fact that since cells receive only a finite number of intracolumnar connections, the strength of these connections does not need to scale in any specific way with C. As for the activity outside the√network, one could now, in principle, choose either J b ∝ 1/C or J b ∝ 1/ C (corresponding to αb = 1, 1/2, respectively), since there is already a finite amount of variance coming from within the column. In the first case, the rest of the network contributes only a noiseless deterministic current whose exact amount has to be determined selfconsistently, and in the second it contributes both to the total mean and variance of the afferent current to the neurons in the √ column. In this last case, as in the previous two scenarios, the J b ∝ 1/ C scaling results in the need for balance between the total excitation and inhibition outside the network, which (again, if αext = 1/2) leads to a unique solution for the activity of the rest of the population linear in the external drive to these neurons. In this scenario, the network is heterogeneous since the strength of the connections from neurons within the same column is larger than those from neurons in other columns. Since the rate and CV of the cells inside the column have to be determined self-consistently in this case, we proceed to do a systematic quantitative analysis of this scenario in the next section. From the scaling considerations described in this section, it is already clear, though, that a potential bistable solution with high CV is possible only in a small network.

3.2 Types of Bistable Solutions in a Reduced Network. In this section, we consider the network described in scenario 3 in the previous section, with the choice αb = 1/2, and analyze the types of steady-state solutions for the activity in a particular column of finite size. The rest of the network is in a balanced state, and its activity is completely decoupled from the activity of the column, which is too small in size to make a difference in the overall input to the rest of the cells. For our present purposes, all that matters about the afferents outside the column (from both the rest of the network and the external ones) is that they provide a finite net input to the cells in the column. We denote the mean and variance of that fixed external ext 2 current by µext E,I /τm and 2(σ E,I ) /τm , where the factors with the membrane ext 2 time constant τm have been included so that µext E,I and (σ E,I ) represent the contribution to the mean and variance of the postsynaptic depolarization (in the absence of threshold) in the steady states arising from outside the column.

18

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

The net input to the excitatory and inhibitory populations in the column under focus is thus characterized by µ E = c EE jEE ν E − c EI jEI ν I + µ I = c IE jIE ν E − c II jII ν I +

µext E τm

µext I τm

2 σ E2 = c EE jEE ν E CV2E + c EI jEI2 ν I CV2I +

σ I2 = c IE jIE2 ν E CV2E + c II jII2 ν I CV2I +

(σ Eext )2 τm /2

(σ Iext )2 , τm /2

where the number of connection and connection strength parameters c and j are all order one. We proceed by simplifying this scheme further in order to reduce the dimensionality of the system from four to two, which will allow a systematic exploration of the effect of all the parameters on the type of steady-state solutions of the network. In particular, we make the inputs to the excitatory and inhibitory populations identical, c IE = c EE ≡ c E µext E

= µext I

c II = c EI ≡ c I

≡µ

ext

(σ Eext )2

=

jIE = jEE ≡ j E

(σ Eext )2

≡ (σ

jII = jEI ≡ j I

) ,

ext 2

so that the whole column becomes statistically identical: ν E = ν I ≡ ν and CV E = CV I ≡ CV. For simplicity, we also assume that the number of excitatory and inhibitory inputs is the same: c E = c I = c. Thus, we are left with a system with four parameters, c µ = c( j E − j I );

cσ =

c( j E2 + j I2 );

µext ;

σ ext ,

(3.2)

all with units of mV, and two dynamical variables (from equation 2.7) µV dµV =− + µ; dt τm

σ2 dσV2 = − V + σ 2, dt τm /2

where µ = cµ ν +

µext ; τm

σ 2 = c σ2 νCV2 +

(σ ext )2 τm /2

(3.3)

Bistability in Balanced Recurrent Networks

19

100

1

Firing Rate x CV2 (Hz)

1.5

CV

Firing Rate (Hz)

150 150

50 0.5 0 20

σ (mV)

0 0

10 0 0

20 µ (mV)

10

20 10 20 µ (mV)

30

10 30 0

100

50

0 20

10 σ (mV)

σ (mV)

0 0

10

20 µ (mV)

30

Figure 2: Mean firing rate ν (left), CV (middle), and product νCV2 (right) of the LIF neuron as a function of the mean and standard deviation of the depolarization. Parameters: Vth = 20 mV, Vres = 10 mV, τm = 10 ms, and τref = 2 ms.

and −1

ν(µV , σV )

√

= τref + τm π

CV2 (µV , σV ) = 2πν 2

Vth −µV √ σV 2 Vres −µV √ σV 2

Vth −µV √ σV 2 Vres −µV √ σV 2

d x ex

2

2

d x ex [1 + erf(x)]

x

−∞

2

dy e y [1 + erf(y)]2

(3.4)

(3.5)

The parameters c µ and c σ2 measure the strength of the feedback that the activity in the column produces on the mean and variance of the current to the cells. c µ can be less than equal to, or greater than zero. A value larger (less) than zero implies that the activity in the column has a net excitatory (inhibitory) effect on the neurons. In general, we assume the positive parameter c σ2 to be independent of c µ (implying the recurrent feedback on the mean and on the variance can be manipulated independently). Note, however, that since j I /j E > 0, c σ2 cannot be arbitrarily small, that is, c σ2 > c µ2 /c. Equations 3.4 and 3.5 are plotted as a function of µV and σV in Figure 2. 3.2.1 Mean- and Fluctuation-Driven Bistability in the Reduced System. nullclines for the two equations 3.3 are given by ν(µV , σV ) =

µV − µext ; τm c µ

ν(µV , σV )CV2 (µV , σV ) =

The

σV2 − (σ ext )2 . (τm /2)c σ2

The values of (µV , σV ) that satisfy these equations are shown in Figure 3 for several values of the parameters c µ , c σ . The nullclines for the mean (see Figure 3, left) are the projection on the (µV , σV ) plane of the intersection of the surface in Figure 2, left, with a plane parallel to the σV axes, shifted by µext and tilted (i.e., with slope) at a rate 1/(τm c µ ). Since the mean firing

20

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga Nullclines for the Standard Deviation 10

Nullclines for the mean

4

Cµ=0 C >0

3.5

σ

8 σ (mV)

σ (mV)

3 2.5 2

Cµ < 0

6

4 C >0 µ

1.5 1 10

2

µext=15 mV

15

20 µ (mV)

σ

C =0

=2 mV

σ

ext

25

30

10

20

µ (mV)

30

40

Figure 3: Nullclines of the equation for the mean µV (left) and standard deviation σV (right). The nullcline for the mean depends on only c µ , and the one for the standard deviation depends on only c σ .

rate as a function of the average depolarization changes curvature (it has an inflection point near threshold, where firing changes from being driven by fluctuations to being driven by the mean), the nullcline for the mean has a “knee” when the net feedback c µ is large enough and excitatory. Similarly, the nullclines for the standard deviation of the depolarization (see Figure 3, right) are the projection on the (µV , σV ) plane of the intersection of the surface in Figure 2, right, with a parabolic surface parallel to the µV axes, shifted by (σ ext )2 and with a curvature 2/τm c σ2 . Again, this curve can display a “knee” for high enough values of the net strength of the feedback onto the variance c σ2 . The fixed points of the system are given by the points of intersection of the two nullclines. We now show, through two particular examples, the main result of this letter: depending on the degree of balance between excitation and inhibition, two types of bistability can exist: mean driven and fluctuation driven. Mean-Driven Bistability. Figure 4 shows a particular example of the type of bistability obtained for low-moderate values of c σ and moderate-high values of c µ . Figure 4a shows the time evolution of the firing rate (bottom) and CV (top) in the network when the external drive to the cells is transiently elevated. In response to the transient input, the network switches between a low-rate, high-CV basal state, into an elevated activity state. For this particular type of bistability, the CV in this state is low. The nullclines for this example are shown in Figure 4b. The σV nullcline is essentially parallel to the µV axis, and it intercepts the µV nullcline (which has a pronounced “knee”) at three points: one below (stable), one around (unstable), and one above

Bistability in Balanced Recurrent Networks

a

21

CV

1.5 1 0.5 0 Firing Rate (Hz)

80 60 40 20 0

b

σ (mV)

0.75

0

1 2 Time (s)

Nullcline for µ Nullcline for σ

3

Rate = 35.5 Hz CV = 0.24

0.7 0.65 Rate = 1.42 Hz CV = 0.94

0.6 18

19

20 µ (mV)

21

Figure 4: Example of mean-driven bistability. (a) CV (top) and firing rate (bottom) in the network as a function of time. Between t = 0 s and t = 0.5 s (dashed lines), the mean of the external drive to the neurons was elevated from 18 mV to 19 mV, causing the network to switch to its elevated activity fixed point. In this fixed point, the CV is low. (b) Nullclines for this example. The two stable fixed points differ primarily in the mean current that the cells are receiving, with an essentially constant variance. Hence, the CV in the elevated persistent-activity fixed point is low. Parameters: µext = 18 mV, σ ext = 0.65 mV, c µ = 7.2 mV, c σ = 1 mV. Dotted line: neuronal threshold.

(stable) threshold. The stable fixed point below threshold corresponds to the state of the system before the external input is transiently elevated in Figure 4a. It is typically characterized by a low rate and a relatively high CV, as subthreshold spiking is fluctuation driven and thus irregular. However, since the CV drops fast for µV > Vth at the values of σV at which the µV

22

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

nullcline bends (see Figure 2, middle), the CV in the suprathreshold fixed point is typically low (but see section 4). Fluctuations in the current to the cells play little role in determining the spike times of the cells in this elevated persistent activity state. Qualitatively, this is the type of excitation-driven bistability that has been thoroughly analyzed over the past few years . It is expected to be present in networks in which small populations of excitatory cells can be bistable in the presence of global inhibition by virtue of selective synaptic potentiation. It is also expected to be present in larger populations if the fluctuations arising from the recurrent connections are weak compared to those coming from external afferents, for instance, due to synaptic filtering (Wang, 1999). Fluctuation-Driven Bistability. If the connectivity is such that the mean drive to the neurons is only weakly dependent on the activity, that is, c µ is small, but at the same time the activity has a strong effect on the variance, that is, c σ is large, the system can also have two stable fixed points, as shown in the example in Figure 5 (same format as in the previous figure). In this situation, however, the two fixed points are subthreshold, and they differ primarily in the variance of the current to the cells. Hence, spiking in both fixed points is fluctuation driven, and the CV is high in both of them; in particular, it is slightly higher in the elevated activity fixed point (see Figure 5a). This type of bistability can be realized only if there is a precise balance between the net excitatory and inhibitory drive to the cells. Since c σ must be large in order for the σV nullcline to display a “knee,” both the net excitation and inhibition should be large, and in these conditions, a small c µ can be achieved only if the balance between the two is precise. This suggests that this regime will be sensitive to changes in the parameters determining the connectivity; that is, it will require fine-tuning, a conclusion that is supported by the analysis below. Mean- and fluctuation-driven bistability are not discrete phenomena. Depending on the values of the parameters, the elevated activity fixed point can rotate in the (µV , σV ) plane, spanning intermediate values from those shown in the examples in Figures 4 and 5. We thus now proceed to a systematic characterization of all possible behaviors of the reduced system as a function of its four parameters. 3.2.2 Effect of the External Current. Since c σ2 > 0, the σV nullcline always bends upward (see Figure 3), that is, the values of σV in the nullcline are always larger than σ ext . Assuming for simplicity that c σ can be arbitrarily low, this implies that no bistability can exist unless the external variance is low enough. In particular, for every value of the external mean µext , ext there is a critical value σc1 defined as the value of σV at which the first two ext derivatives of the µV nullcline vanish (see Figure 6, middle). For σ ext > σc1 , the two nullclines can cross only once, and therefore no bistability of any

Bistability in Balanced Recurrent Networks

a

23

CV

1.5 1 0.5 0 Firing Rate (Hz)

80 60 40 20 0

0

b

1 2 Time (s)

3

15 σ (mV)

Rate = 59.3 Hz CV = 1.17

10 Rate = 2.4 Hz CV = 1.02

5

Nullcline for µ Nullcline for σ

5

15 µ (mV)

25

Figure 5: Example of fluctuation-driven bistability. (a) CV (top) and firing rate (bottom) in the network as a function of time. Between t = 0 s and t = 0.5 s (dashed lines), the standard deviation of the external drive to the neurons was elevated from 5 mV to 7 mV, causing the network to switch to its elevated activity fixed point. In this fixed point, the CV is slightly higher than in the basal state. (b) Nullclines for this example. The two stable fixed points differ primarily in the variance of the current that the cells are receiving, with little change in the mean. Hence, the CV in the elevated persistent-activity fixed point is slightly higher than in the low-activity fixed point; that is, the CV increases with the rate. Parameters: µext = 5 mV, σ ext = 5 mV, c µ = 5 mV, c σ = 20.2 mV. Dotted line: Neuronal threshold.

kind is possible in the reduced system (see Figure 6, left). For values of σ ext only slightly lower than this critical value, the jump between the low- and the high-activity stable fixed points in the (µV , σV ) plane is approximately horizontal, so the type of bistability obtained is mean driven. For lower

24

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga ext

ext

> σc1

10 8 σ (mV)

σ (mV)

8 6 4 2 0

σext=σext c2

ext

σ =σc1 µ nullcline σ nullcline

12 10 σ (mV)

ext

σ

10

6 4

10

20 µ (mV)

30

40

2 0

8 6 4

10

20 µ (mV)

30

40

2 0

10

20 µ (mV)

30

40

Figure 6: The external variance determines the existence and types of bistability ext (left), no bistability is possible. σ ext = possible in the network. For σ ext > σc1 ext ext marks the onset of bistability (middle). At σ ext = σc2 bistability becomes σc1 possible in a perfectly balanced network (a network with c µ = 0) (right).

values of the external variance, a point is eventually reached at which bistability becomes possible in a perfectly balanced network. Again, for ext each µext , one can define a second critical value σc2 in the following way: ext ext σc2 is the value of σ at which the point where the derivative at the inflection point of the σV nullcline is infinite occurs at a value of µV equal ext to µext (see Figure 6, right). For values of σ ext < σc2 , bistability is possible in networks in which the net recurrent feedback is inhibitory. Since both critical values of σ ext are functions of the external mean, they define curves in the (µext , σ ext ) plane. These curves are plotted in Figure 7. Both are decreasing functions of µext and meet at threshold, implying that bistability in the reduced network is possible only for subthreshold mean external inputs (see section 4). 3.2.3 Phase Diagrams of the Reduced Network. For each point in the (µext , σ ext ) plane, the external current is completely characterized, and the only two parameters left to be specified are c µ , c σ . In particular, in the regions where bistability is possible, it will exist for only appropriate values of c µ and c σ . The two insets in Figure 7 show phase diagrams in the (c µ , c σ ) plane showing the regions of bistability in two representative points: one in ext ext which σc2 < σ ext < σc1 , in which bistability is possible only in excitationext dominated networks (top-right inset), and one in which σ ext < σc2 , in which bistability is possible in both excitation- and in inhibition-dominated networks (bottom-left inset). In this latter case, the region enclosed by the curve in which bistability can exist stretches to the left, including the region with c µ 0. We have further characterized the nature of the fixed-point solutions in these two cases by plotting the rate and CV on each point in the (c µ , c σ ) plane on which bistability is possible, as well as the ratio between the rate and CV in the high- versus low-activity states. Instead of showing this in the

Bistability in Balanced Recurrent Networks

25

9 ext

σc1

8

No Bistability

7

10

ext

σ

5

5 0

4 3 2

15

20

c (mV)

25

µ

20 cσ (mV)

σ

ext

(mV)

6

c (mV)

σc2

10

1

0

0

40

c (mV)

80

µ

0 0

5 ext

µ

10 (mV)

15

20

Figure 7: Phase diagram with the effect of the mean and variance of the external current on the existence and types of bistability in the network. The two insets represent the regions of bistability in the (c µ , c σ ) plane at the corresponding points in the (µext , σ ext ) plane. Fluctuation-driven bistability is possible only ext (µext ). Top-right and bottom-left near and below the lower critical line σ ext = σc2 ext ext insets correspond to µ = 10 mV; σ = 4 mV and µext = 10 mV; σ ext = 3 mV, respectively.

(c µ , c σ ) plane, we have inverted Equations 3.2 to show (assuming a constant c = 100) the results as a function of the unitary EPSP j E and of the ratio of the unitary inhibitory to excitatory PSPs j I /j E , which measures the degree of balance in the network, that is, j I /j E = 1 implies a perfect balance:

jE =

cµ +

2cc σ2 − c µ2 2c

;

j I /j E =

2cc σ2 − c µ2 − c µ 2cc σ2 − c µ2 + c µ

.

(3.6)

In Figure 8 we show the results for the case where the external variance ext , so that bistability is possible only if the net recurrent is higher than σc2 connectivity is excitatory. Overall, the shape of the bistable region in this space is a diagonal to the right. This means that closer to the balanced region, the net excitatory drive (proportional to j E ) has to be higher in order for bistable solutions to exist. The low-activity fixed point (left column) is subthreshold, and thus spiking is fluctuation driven, characterized by a high CV. In this case, the high-activity fixed point is suprathreshold, so the CV in this fixed point is in general small (see the bottom right). Of course,

26

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga Rate

down

(Hz)

Rate /Rate

Hz 30

0.8

up

down

0.8 25

25

0.6

0.6

20

I E

0.4

15

0.2

10

j /j

j /j

I E

20

15

0.4

10 0.2 5

5

0.2

0.4 0.6 j (mV)

0.8

0.2

E

0.4 0.6 j (mV)

0.8

E

CV

CV /CV

down

up

0.95

0.8

down

0.8

0.8 0.7

0.9

0.4

I E

0.6 j /j

jI/jE

0.6

0.6 0.5

0.4

0.4 C

σ

0.3

0.2

0.2 0.85 0.2

0.4 0.6 jE (mV)

0.8

0.2 16

0.2

0.1

C 18 µ

0.4 0.6 jE (mV)

0.8

Figure 8: Bistability in an excitation-dominated network in the ( j E , j I /j E ) plane. (Top and bottom left) Firing rate and CV in the low-activity fixed point. (Top and bottom right) Ratio of the firing rate and CV between the high- and low-activity fixed points. (Inset) same as bottom-right panel in the (c µ , c σ ) plane. Parameters: c = 100, µext = 10 mV, and σ ext = 4 mV.

very close to the cusp, at the onset of bistability, the CV (and rate) in both fixed points is similar. In Figure 9 the same information is shown for the case where the external ext variance is lower than σc2 so that bistability is possible when the recurrent connectivity is dominated by excitation or inhibition. In this case, the region where the CV in the high- and low-activity fixed points is similar is larger, corresponding to situations in which j I /j E ∼ 1, that is, excitation and inhibition in the network are roughly balanced. Only in this region is the < 100 Hz. In this regime, firing rate in the high-activity state not too high, ∼ when excitation dominates, the rate in the high-activity state becomes very high. Note also that the transition between the relatively low-rate, high-CV regime and the very high-rate, low-CV regime at j I /j E ∼ 0.9 is relatively abrupt. Finally, we can use the relations 3.6 to study quantitatively the effect of the number of connections c on the regions of bistability, something we

Bistability in Balanced Recurrent Networks

27

Hz

Ratedown (Hz)

Rateup/Ratedown 600

3

1

2.5

1.5

0.4

1

0.2 0

0.5

1 jE (mV)

I E

500

0.8

400

0.6

300

0.4

200 100

0.2

0.5

0

0

1.5

0.5

1 j (mV)

1.5

E

CVdown

j /j

j /j

0.6

I E

2

CVup/CVdown

1

1

1

0.8

0.8

0.6

0.99

jI/jE

j /j

I E

0.8

1

1

C

σ

−10

0.8

C 10 µ

0.6

0.6

0.4

0.4

0.2

0.2

0.4 0.2 0

0

0.5

1 j (mV) E

1.5

0.98

0

0.5

1 jE (mV)

1.5

Figure 9: Mean and fluctuation-driven bistability in the ( j E , j I /j E ) plane. Panels as in Figure 8. (Inset) Portion of the bistability region with fluctuation-driven fixed points in the (c µ , c σ ) plane. When the network is approximately balanced, that is, j I /j E ∼ 1, the CV in the high-activity state is high. Parameters: c = 100, µext = 10 mV, and σ ext = 3 mV.

did in section 3.1 at a qualitative level based on scaling considerations. Figure 10 shows the effect of increasing the number of afferent connections per cell on the shape of the region of bistability for σ ext = 4 mV (left) and for σ ext = 3 mV (right). The results are clearer when shown in the plane ( j E c, j I /j E ), where j E c represents the net mean excitatory drive to the neurons. The range of values of the net excitatory drive in which bistability is allowed in the excitation-dominated regime, where j I /j E < 1, does not depend very strongly on c. However, for both σ ext = 4 mV and σ ext = 3 mV, when inhibition and excitation become more balanced, a higher net excitatory drive is needed. In particular, when σ ext = 3, the bistable region always includes the balanced network, j I /j E = 1, but the range of values of j I /j E ∼ 1 where bistability is possible (being in this case fluctuation driven) considerably shrinks. Thus, as noted in section 3.1, bistability in a large, balanced network requires a precise balance of excitation and inhibition.

28

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga ext

σ

ext

σ

=4 mV

=3 mV

1

I E

0.5

j /j

j /j

I E

1

0.5

c=100 c=1000 c=10000 0 0

200

400 jE c (mV)

600

c=100 c=1000 c=10000 800

0 0

500

1000 1500 j c (mV)

2000

E

Figure 10: Effect of number of connections c on the phase diagram in the ( j E c, j I /j E ) plane for the case where µext = 10 mV and σ ext = 4 mV (left) and σ ext = 3 mV (right). Note that in the right panel, j I /j E = 1 is always in the region of bistability, but the range of values of j I /j E ∼ 1 in the bistable region decreases significantly with c.

The precise balance between excitation and inhibition required to obtain solutions with fluctuation-driven bistability is also evident when one analyzes the effect of changing the net input to the excitatory and inhibitory subpopulations within a column. In this case, the excitatory and inhibitory mean firing rate and CV become different. We have chosen to study the effect of the different ratios of excitation and inhibition to the excitatory and inhibitory populations. In particular, defining γE ≡

jEI jEE

and

γI ≡

jII , jIE

we have considered the effect of having γ E = γ I while still considering that the excitatory connections to the excitatory and inhibitory populations are equal, that is, jEE = jIE ≡ j E . To proceed with the analysis, we started by specifying a point in the parameter space of the symmetric network in which excitation and inhibition were identical by choosing a value for (µext , σ ext , c µ , c σ ). Then, fixing c = 200, we used the relationships 3.6 to solve for j E and γ and, defining γ E ≡ γ , we found which values of γ I resulted in bistable solutions. Correspondingly, when γ I = γ E , the two subpopulations within the column become identical again. We performed this analysis for two initial sets of parameters of the symmetric network: one corresponding to mean-driven and the other to fluctuation-driven bi-stability. The results of this analysis are shown in Figure 11. The type of bistability does not change qualitatively depending on the value of γ I /γ E in the mean-driven case (left column). For the right

Bistability in Balanced Recurrent Networks

29

Mean-driven 300

Low activity High activity

80

Firing Rate (Hz)

Firing Rate (Hz)

100

Fluctuation-driven

60 40

Bistable Region

20

200

0

0 1

1.5 γI/γE

2

1

1.05

1

0.8

0.8

Bistable Region

0.6

CV

CV

1.025 γ /γ I E

1

0.4

0.2

0.2 1

1.5 γI/γE

2

Bistable Region

0.6

0.4

0

Bistable Region

100

0

1

1.025 γI/γE

1.05

Figure 11: Effect of different levels of input to the excitatory and inhibitory subpopulations. The ratio between the inhibitory and excitatory connection strengths γ was allowed to be different for each subpopulation. Left and right columns correspond to mean- and fluctuation-driven bistability in the corresponding situation for the symmetric network. The network is bistable for values of γ I /γ E within the dashed lines. Parameters on the left column: µext = 18 mV, σ ext = 0.65 mV, c µ = 7.2 mV, c σ = 1 mV. Parameters on the right column: µext = 10 mV, σ ext = 3 mV, c µ = 0.5 mV, c σ = 19 mV.

column, however, the original fluctuation-driven regime is quickly abolished as γ I /γ E increases, leading to very high activity and low CV in the high-activity fixed point. Note that the size of the bistable region is also much smaller in this case. 3.3 Numerical Simulations. We conducted numerical simulations of our network to investigate whether the two types of bistable states that the mean-field analysis predicts, the mean-driven and the fluctuation-driven regimes, can be realized. In addition to the approximations that we are forced to make in order to be able to construct the mean-field theory itself, the more qualitative and robust result that fluctuation-driven bistable points require large fluctuations in relatively small networks with relatively large

30

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

synapses leads to the question of whether potential fixed points in this very noisy network will indeed be stable. We found that under certain conditions, both types of bistable points can be realized in numerical simulations. In Figure 12 we show an example of a network supporting bistability on the mean-driven regime. In this network, the recurrent connections are dominated by excitation, and the mean of the external drive leaves the neurons very close to threshold, with the fluctuations of this external drive being small. As expected, the irregularity in the elevated-activity fixed point is low, with a CV ∼ 0.2. The mean µV in this fixed point is above threshold. The mean-field theory predicts for the same network a rate of 46.7 Hz and a CV of 0.21. An example of another network showing bistability, this time in the fluctuation-driven regime, is shown in Figure 13. In this network, the recurrent connections are dominated by inhibition, and the external drive leaves the membrane potential relatively far from threshold on average but has large fluctuations. Taking into account only consecutive ISIs, the temporal irregularity is still large: CV2 ∼ 0.8. The spike trains in the elevated activity state are quite irregular, partly, but not only, due to large, temporal fluctuations in the instantaneous network activity. The mean-field prediction for these parameters gives a rate of 91.5 Hz and a CV of 1.6. In order for the elevated activity states to be stable, in both the meandriven and, especially, the fluctuation-driven regimes, we needed to use a wide distribution of synaptic delays in the recurrent connections: between 1 and 10 ms for Figure 12 and 1 and 50 ms for Figure 13. Narrow distributions of delays lead to oscillations, which destabilize the stationary activity states. The emergence and properties of these oscillations in a network similar to the one we study here have been described in Brunel (2000a). Although such long synaptic delays are not expected to be found in connections between neighboring local cortical neurons, our network is extremely simple and lacks many elements of biological realism that would work in the same direction as the wide distributions of delays. Among these are slow and saturating synaptic interactions (NMDA-mediated excitation; (Wang, 1999) and heterogeneity in cellular and synaptic properties. The large and slow temporal fluctuations in the instantaneous rate in Figure 13 are due to the large fluctuations in the nearly balanced external and recurrent drive to the cells and the wide distribution of synaptic delays. These fluctuations lead to high trial-to-trial variability in the activity of the network, as shown in Figure 14. In this figure, we show nine trials with identical parameters as in Figure 13, and only different seeds for the random number generator. On each panel, the mean instantaneous activity across all nine trials (the thick line) is shown along with the activity in the individual trial. Sometimes the large fluctuations lead to the activity returning to the basal spontaneous state. Other times they provoke longlasting periods of elevated firing (above average). Nevertheless, on a large fraction of the trials, a memory of the stimulus persists for several seconds.

Bistability in Balanced Recurrent Networks

31

Rate (Hz)

150

100

50

0 0

100

200

300

400 500 Time (ms)

600

=0.2

0

50 Rate (Hz)

100 0

0.25 CV

0.5

700

800

=0.21 2

0

0.25 CV2

0.5

Figure 12: Numerical simulations of a bistable network in the mean-driven regime. The rate of the external afferents was raised between 200 and 300 ms (vertical bars). (Top) Raster display of the activity of 200 neurons in the network. (Middle) Instantaneous network activity (temporal bin of 10 ms). The dashed line represents the average network activity during the delay period, 53.4 Hz. (Lower panels) Distribution across cells of the rate (left), CV (middle), and CV2 (right) during the delay period. The fact that the CV and CV2 are very similar reflects the stationarity of the instantaneous activity. Single-cell parameters as in the caption to Figure 2. The network consists of two populations of excitatory and inhibitory cells (1000 neurons each) connected at random with 0.1 probability. Delays are uniformly distributed between 1 and 10 ms. External spikes are all excitatory, with PSP size 0.09 mV. The external rate is 19.25 KHz. This leads to µext = 17.325 mV and σ ext = 0.883 mV. Recurrent EPSPs and IPSPs are 0.138 mV and −0.05 mV, respectively, leading to c µ = 8.8 mV and c σ = 1.468 mV.

32

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

Rate (Hz)

150

100

50

0 0

100

200

300

400 500 Time (ms)

600

=1.27

0

200 Rate (Hz)

400 0

2 CV

700

800

=0.81

4

0

1 CV2

2

Figure 13: Numerical simulations of a bistable network in the fluctuationdriven regime. Panels as in Figure 12. Parameters as in Figure 12 except distribution of delays, which is uniform between 1 and 50 ms. External spikes are excitatory, with PSP size 1.85 mV and rate 0.78 KHz and inhibitory, with PSP size −1.85 mV and rate 0.5 KHz. This leads to µext = 5.18 mV and σ ext = 4.68 mV. Recurrent EPSPs and IPSPs are 1.85 mV and −1.98 mV, respectively, leading to c µ = −13 mV and c σ = 27.1 mV.

Firing Rate (Hz)

Bistability in Balanced Recurrent Networks

33

100 50

Firing Rate (Hz)

0 100 50

Firing Rate (Hz)

0 100 50 0 0

1 Time (s)

20

1 Time (s)

20

1 Time (s)

2

Figure 14: Trial-to-trial variability in the fluctuation-driven regime. Each panel is a different repetition of the same trial, in a network identical to the one described in Figure 13. The thick line represents the average across all nine trials, and the thin line is the instantaneous network activity in the given trial. Vertical bars mark the time during which the rate of the external inputs is elevated.

In the mean-driven regime, the trial-to-trial variability is very low (not shown). We conclude that despite quantitative differences in the rate and CV between the mean-field theory and the simulations, it is possible, albeit difficult, to find both mean-driven and fluctuation-driven bistability in small networks of LIF neurons. 4 Discussion In this letter, we have aimed at an understanding of the different ways in which a simple network of current-based LIF neurons can be organized in order to support bistability, the coexistence of two steady-state solutions for the activity of the network that can be selected by transient external stimulation. We have shown that in addition to the well-known case in which strong excitatory feedback can lead to bistability, bistability can also be obtained when the recurrent connectivity is nearly balanced, or even when its net effect is inhibitory, provided that an increase in the activity in

34

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

the network provides a large enough increase in the size of the fluctuations of the current afferent to the cells. When bistability is obtained in this fashion, the CV in both steady states is close to one, as found experimentally (see Figure 1; Compte et al., 2003). We have done a systematic analysis at the mean-field level (and a partial one through numerical simulations) of a reduced network where the activity in the excitatory and the inhibitory subpopulations was equal by construction (implying balance at the level of the output activity) and studied which types of bistable solutions are obtained depending on the level of balance in the currents (the parameter c µ ), that is, balance at the level of the inputs. This simple model allows for a complete understanding of the role of the different model parameters. The first phenomenon, which we have termed mean-driven bistability, can essentially be traced back to the shape of the curve relating the mean firing rate of the cell to the average current it is receiving (at a constant noise level; Brunel, 2000b), that is, the f − I curve. In order for bistability to exist, this curve should be a sigmoid, for which it is enough that the neurons possess a threshold and a refractory period. If, in addition, the low-activity fixed point is to have a nonzero activity (consistent with the fact that cells in the cortex fire spontaneously), then the neuron should display nonzero activity for subthreshold mean currents. This can be achieved if the current is noisy, where the noise is due to the spiking irregularity of the inputs to the cell. When this type of bistability is considered in a network of LIF neurons, the mean current to the cells in the high-activity fixed point is above threshold. Under general assumptions, this leads invariably to fairly regular spiking in this high-activity fixed point. Of course, tuning the parameters of the current in such a way that the mean current in the high-activity fixed point is only very marginally suprathreshold will result in only a small decrease of the CV with respect to the low-activity fixed point (e.g., Figure 2 in Brunel & Wang, 2001). On the other hand, in this scenario, it is relatively easy (it does not take much tuning) for the firing rate in the elevated persistent activity state not to be very much higher than that in the low-activity state, for example, below 100 Hz (see Figure 8). When the recurrent connectivity is balanced, bistable solutions can exist in which both fixed points are subthreshold, so that spiking in both fixed points is fluctuation driven and thus fairly irregular. This can be the case if the fluctuations in the depolarization due to current from outside the column are low enough (see Figure 6). However, in order for these solutions to exist, first, the overall inputs to the excitatory and the inhibitory subpopulations should be close enough (ensuring balance at the level of the firing activity in the network); second, both of these inputs, the one to the excitatory and the one to the inhibitory subpopulation, should themselves also be balanced (be composed of similar amounts of excitation and inhibition); and third, both the net excitatory drive and the inhibitory drive to the cells should be large. This third condition, if the first two are satisfied, results in a high, effective fluctuation feedback gain: an increase in the activity of

Bistability in Balanced Recurrent Networks

35

the cells results in a large increase in the size of the fluctuations in the afferent current to the neurons (a large value of the parameter c σ ). However, it also implies that the excitation-inhibition balance condition will be quite stringent; it will require tuning, especially when the network is large. In addition, if this balance is slightly broken, since both excitation and inhibition are large, the corresponding firing rate in the elevated persistent activity state becomes very large, for example, significantly higher than 100 Hz (see Figure 9). In fact, based just on scaling considerations (see section 3.1), one can conclude that this type of bistability can be present only (unless one allows for perfect tuning) in small networks. If the network is large, the excitation-inhibition balance condition has, in general, a single (albeit very robust) solution (van Vreeswijk & Sompolinsky, 1996, 1998). It is intriguing, however, that several lines of evidence in fact suggest a fairly precise balance of local excitation and inhibition in cortical circuits, at both the output level (Rao et al., 1999) and the input level (Anderson, Carandini, & Ferster, 2000; Shu, Hasenstaub, & McCormick, 2003; Marino et al., 2005).

4.1 Limitations of the Present Approach. Most of the results we have presented are based on an exhaustive analysis of the stationary states of a mean-field description of a simple network of LIF neurons. Several limitations of our approach should be noted. First, in order to be able to go beyond the Poisson assumption, we have had to make a number of approximations (discussed in section 2) that are expected to be valid only on limited regions of the large parameter space. Second, we have focused only on the stationary fixed points of the system, neglecting an examination of any oscillatory solutions. Oscillations in networks of LIF neurons in the high-noise regime have been extensively studied by Brunel and collaborators (see, e.g., Brunel & Hakim, 1999; Brunel, 2000a; Brunel & Wang, 2003). Third, in order to be able to provide an analytical description, we have considered a very simplified network lacking many aspects of biological realism known to affect network dynamics, most important, a more realistic description of synaptic dynamics (Wang, 1999; Brunel & Wang, 2001). Finally, the use of a mean-field description based on the diffusion approximation to study small networks with big synaptic PSPs might lead to problems, since the diffusion approximation assumes these PSPs are (infinitely) small. Large PSPs might lead to fluctuations that are too strong, which would destabilize the analytically predicted fixed points. In order to check that the main qualitative conclusion of this study was not an artifact due to the mean-field approach, we simulated the network of LIF neurons used in the mean-field description, adding synaptic delays. Provided the distribution of delays was wide, we observed both types of bistable solutions. However, as expected, the fluctuation-driven persistent activity states show large, temporal fluctuations that sometimes are enough to destabilize them.

36

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

The evidence we have provided is suggestive, but given the limitations listed above, it is not conclusive. Addressing the limitations of this work will involve using recently developed analytical techniques (Moreno-Bote & Parga, 2006), along with a systematic exploration of the behavior of more realistic recurrent networks through numerical simulations. 4.2 Self-Consistent Second-Order Statistics. The extended mean-field theory we have used builds on the landmark study by Amit and Brunel (1997b), which provided a general-purpose theory for the study of the different types of steady states in recurrent networks of spiking neurons in the presence of noise, while leaving room for different degrees of biophysical realism. Our contribution has been to try to go beyond the Poisson assumption in order to allow a self-consistent solution to the second-order statistics of the spike trains in the network. If the spike trains are assumed to be Poisson, there is only one parameter to be determined self-consistently: the firing rate. Under the approximations made in this letter, the statistics are characterized by two parameters, the firing rate and the CV, which provides information about the degree of irregularity of the spiking activity. In order to go beyond the Poisson assumption, we have assumed the spike trains in the network can be described as renewal processes with a very short correlation time. In these conditions, for time windows large compared to this correlation time, the Fano factor of the process is constant, but instead of being one, as for a Poisson process, it is equal to CV2 . This motivates our strategy of neglecting the temporal aspect of the deviation from Poisson, which is extremely complicated to deal with analytically, and keep only its effect on the amplitude of the correlations. We have done this by using the expressions for the rate and CV of the first passage time of the OU process with a renormalized variance that takes into account the CV of the inputs. If the time constant of the autocorrelation of the process is exactly zero, this approximation becomes exact (Moreno et al., 2002), so we have assumed it will still be qualitatively valid if the correlation time constant is small. In this way, we have been able to solve for the CV of the neurons in the steady states self-consistently. It has to be stressed that the fact that the individual inputs to a neuron are considered independent does not imply that the overall input process, made of the sum of each individual component, is Poisson. Informally, in order for the superposition to converge to a (homogeneous) Poisson process of rate λ, two conditions have to be met: given any set S on the time axis (say, any time interval), calling Ni1 the probability of observing one spike in S from process i, and Ni>2 the probability of observing two or more spikes in S from process i, then the superposition of the i = 1, . . . , N processes will converge to a Poisson process if lim N→∞ iN Ni1 = λS (with max{Ni1 } = 0 as N >2 N → ∞) and if lim N→∞ i Ni = 0 (see, e.g., Daley & Vere-Jones, 1988). The autocorrelated renewal processes that we consider in this letter do

Bistability in Balanced Recurrent Networks

37

not meet the second condition, which can also be seen in the fact that the superposition process has an autocorrelation given by equation 2.2, not by a Dirac delta function, as would be the case for a Poisson process. Despite this, it might be the case that if instead of any set S, one considers only a given time window T, both conditions could approximately be met in T, and we could say that the superposition process is locally Poisson in T. Whether this locally Poisson train will have the same effect on the postsynaptic cell as a real Poisson train of the same rate depends on a number of factors and has been studied in detail in Moreno et al. (2002) for the case of exponential autocorrelations. Other types of autocorrelation structures, for instance, regular spike trains, could lead to different results. This is an open problem. 4.3 Current-Based versus Conductance-Based Descriptions. We have analyzed a network of current-based LIF neurons. The motivations for this choice are that current-based LIF neurons are simpler to analyze, especially in the presence of noise, than conductance-based LIF neurons and also that there were a number of unresolved issues raised in the current-based framework that we have made an attempt to clarify. In particular, we were interested in understanding whether the framework of Amit and Brunel (1997b) could be used to produce bistable solutions in balanced networks like those studied in Tsodyks and Sejnowski (1995) and van Vreeswijk and Sompolinsky (1996, 1998) outside the context of multistability in recurrent networks. An important issue has been the relationship between different scaling relationships between the connection strengths and the number of afferents and the possible types of bistability attainable in large networks, when the number of afferents per cell tends to infinity. √ This analysis shows that large, homogeneous networks using the J ∼ 1/ C scaling needed to retain a significant amount of fluctuation at large C do not support bistability in a robust manner, a result already implicitly present in van Vreeswijk and Sompolinsky (1996, 1998). Reasonably robust bistability in homogeneous balanced networks requires that they are small. Does one expect these conclusions to hold qualitatively if one considers the more realistic case of conductance-based synaptic inputs to the cells? The answer to this question is uncertain. In particular, scaling relationships between J and C, absolutely unavoidable in current-based scenarios to keep a finite input to the cells in the extensive C limit, are not necessary when synaptic inputs are assumed to induce a transient change in conductance. In the presence of conductances, the steady-state voltage is automatically independent of C for large C, regardless of the value of the unitary synaptic conductances. In fact, assuming a simple model for a cell having only leak, excitatory, and inhibitory synaptic conductances, the steady-state voltage in the absence of threshold is given by Vss =

gL gE gI VL + VE + VI , gTot gTot gTot

38

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

where VL , VE , VI are the reversal potentials of the respective currents; g L , g E , g I are the total leak, excitatory, and inhibitory conductance in the cell and gTot = g L + g E + g I . (This expression ignores the effect of temporal fluctuations in the conductances, but it is a good approximation, since the mean conductance, being by definition positive, is expected to be much larger than its fluctuations.) The steady-state voltage is just the average of the different reversal potentials, each weighted by the relative amount of conductance that the respective current is carrying. Of course, each of the total synaptic conductances is proportional to the number of inputs, but since the steady-state voltage is just a weighted sum, it does not explode even if C tends to infinity. It might seem that in fact, the infinite-C limit leads to an ill-defined model, as the membrane time constant vanishes in this case as 1/C. If Cm is the membrane capacitance, then τm =

Cm ∼ 1/C, gTot

assuming that g E,I ∼ C. We believe, however, that this is an artifact due to an incorrect way of defining the model in the large C limit. It implicitly assumes that the number of synaptic inputs grows at a constant membrane area, thus increasing indefinitely the local channel density. A more appropriate way of taking the large C limit is to fix the relative densities of the different channels per unit area and then assume the area becomes large. In this case, both Cm and the total leak conductance of the cell (proportional to the number of leak channels) will grow with the total area. This way of taking the limit respects the well-known decrease in membrane time constant as the synaptic input to the cell grows, but retaining a well-defined, nonzero membrane time constant in the extensive C limit (in this case, the range of values that τm can take is determined by the local differences in channel density, which is independent of the total channel number). A crucial difference with the current-based cell is the behavior of the variance of the depolarization in the large C limit. A simple estimate of this variance can be obtained by ignoring threshold and considering only the low-pass filtering effect of the membrane (with time constant τm ) on a gaussian noise current of variance σ I2 and time constant τs . It is straightforward to calculate the variance of the depolarization in these conditions, resulting in σV2 =

σ I2 2 gTot

τs τs + τm

.

Bistability in Balanced Recurrent Networks

39

If the inputs are independent, both the variance of the current and the total conductance of the cell are proportional to C, which implies that σV2 ∼ 1/C for large C. Therefore, the statistics of the depolarization in conductance-based and current-based neurons show a very different dependence with the number of inputs to the cell. In particular, it is unclear whether the main organizational principle behind the balanced state in the current-based framework, √ that is, the J ∼ 1 C scaling that is needed to retain a finite variance in the C → ∞ limit and that leads to the set of linear equations that specify the single solution for the activity in the balanced network, is relevant in a conductance-based framework. A rigorous study of this problem is beyond the scope of this work, but is one of the outstanding challenges for understanding the basic principles of cortical organization.

4.4 Correlations and Synaptic Time Constants. Our mean-field description assumes that the network is in the extreme sparse limit, N C, in which the fraction of common input shared by two neurons is negligible, leading to vanishing correlations between the afferent current to different cells in the large C, large N limit. This is a crucial assumption, since it causes the variance of the depolarization in the network to be the sum of the variances of the individual spike trains, that is, proportional to C. If the correlation coefficient is finite as C → ∞, the variance is proportional to C 2 (see, e.g., Moreno et al., 2002). In a current-based network, J ∼ 1/C scaling would lead to a nonvanishing variance in the large C limit without a stringent balance condition, and in a conductance-based network, it would lead to a C-independent variance for large C. This suggests that correlations between the cells in the recurrent network should have a large effect on both their input-output properties (Zohary, Shadlen, & Newsome, 1994; Salinas & Sejnowski, 2000; Moreno et al., 2002) and the network dynamics. The issue is, however, not straightforward, as simulations of irregular spiking networks with realistic connectivity parameters, which do show weak but significant cross-correlations between neurons (Amit & Brunel, 1997a), seem to be well described by the mean-field theory in which correlations are neglected (Amit & Brunel, 1997a; Brunel & Wang, 2001). Noise correlations measured experimentally are small but significant, with normalized correlation coefficients on the range of a few percent to a few tenths for a review (see, e.g., Salinas & Sejnowski, 2001). It would thus be desirable to be able to extend the current mean-field theory to incorporate the effect of cross-correlations and to understand under which conditions their effect is important. The first steps in this direction have already been taken (Moreno et al., 2002; Moreno-Bote & Parga, 2006). The arguments of the previous section suggest that a correct treatment of correlations might be especially important in large networks of conductance-based neurons.

40

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

An issue of similar importance for the understanding of the high irregularity of cortical spike trains is the relationship between the time constant (or time constants) of the inputs to the neuron and its (effective) membrane time constant. Indeed, the need for a balance between excitation and inhibition in order to have a high-output spiking variability when receiving many irregular inputs exists only if the membrane time constant is relatively long—in particular, long enough that if the afferents are all excitatory, the input fluctuations are averaged out. If the membrane time constant is very short, large, short-lived fluctuations are needed to make the postsynaptic cell fire, and these occur only irregularly, even if all the afferents are excitatory (see, e.g., Figure 1 in Shadlen & Newsome, 1994). These considerations seem relevant since cortical cells receive large numbers of inputs that have spontaneous activity, thus putting the cell into a high-conductance state (see, e.g., Destexhe, Rudolph, & Pare, 2003) in which its effective membrane time constant is short—on the order of only a few miliseconds (Bernander, Douglas, Martin, & Koch, 1991; Softky, 1994). It has also been recognized that when the synaptic time constant is large compared to the membrane time constant, spiking in the subthreshold regime becomes very irregular, and in particular, the distribution of firing rates becomes bimodal. Qualitatively, in these conditions, the depolarization follows the current instead of integrating it. Relative to the timescale of the membrane, fluctuations are long-lived, and this separates two different timescales for spiking (which result in bimodality of the firing-rate distribution) depending on whether the size of a fluctuation is such that the total current is subthreshold (i.e., no spiking leading to a large peak of the firing rate histogram at zero) or suprathreshold (leading to a nonzero peak in the firing rate distribution) (Moreno-Bote & Parga, 2005). In these conditions, neurons seem “bursty,” and the CV of the ISI is high. Interestingly, recent evidence confirms this bimodality of the firing-rate distribution in spiking activity recorded in vivo in the visual cortex (Carandini, 2004). Increases in the synaptic-to-membrane time constant ratio leading to more irregular spiking can be due to a number of factors: a very short membrane time constant if the neuron is a high-conductance state, relatively long excitatory synaptic drive if there is a substantial NMDA component in the excitatory EPSPs, or even long-lived dendrosomatic current sources, for instance, due to the existence of “calcium spikes” generated in the dendrites. There is evidence that irregular current applied to the dendrites of pyramidal cells results in higher CVs than the same current applied to the soma (Larkum, Senn, & Luscher, 2004). 4.5 Parameter Fine-Tuning. In order for both stable firing rate states of the networks we have studied to display significant spiking irregularity, the afferent current to the cells in both states needs to be subthreshold. We have shown that this requires a significant amount of parameter fine-tuning, especially when the number of connections per neuron is large. Parameter

Bistability in Balanced Recurrent Networks

41

fine-tuning is a problem, since biological networks are heterogeneous and cellular and synaptic properties change in time. Regarding this issue, though, some considerations are in order. First, the model we have considered is extremely simple, especially at the singleneuron level. We have already pointed out possible consequences of considering properties such as longer synaptic time constants or some degree of correlations between the spiking activity of different neurons. Another biophysical property that we expect to have a large impact is short-term synaptic plasticity. In the presence of depressing synapses, the postsynaptic current is no longer linear in the presynaptic firing rate, thus acting as an activity-dependent gain control mechanism (Tsodyks & Markram, 1997; Abbott, Varela, Sen, & Nelson, 1997). It remains to be explored to what extent balanced bistability in networks of neurons exhibiting these properties becomes a more robust phenomenon. Second, synaptic weights (as well as intrinsic properties; Desai, Rutherford, & Turrigiano, 1999) can adapt in an activity-dependent manner to keep the overall activity in a recurrent network within an appropriate operational range (Turrigiano, Leslie, Desai, Rutherford, & Nelson, 1998). Delicate computational tasks, which seem to require finetuning, can be rendered robust though the use of these types of activitydependent homeostatic rules (Renart, Song, & Wang, 2003). It will be interesting to study whether homeostatic plasticity (Turrigiano, 1999) can be used to relax some of the fine-tuning constraints described in this letter. 4.6 Multicolumnar Networks and Hierarchical Organization. The fact that bistability is not a robust property of large, homogeneous balanced networks suggests that the functional units of working memory could correspond to small subpopulations (Rao et al., 1999). In addition, we have shown that bistability in a small, reduced network is possible only for subthreshold external inputs (see section 3.2.2). At the same time, it is known that a nonzero activity balanced state requires a very large (suprathreshold) excitatory drive (see section 3.1 and van Vreeswijk & Sompolinsky, 1998). This seems to point to a hierarchical organization: large networks receive massive excitation from long-distance projections, and this external excitation sets up a balanced state in the network. Globally, the activity in the large, balanced network follows the external input linearly. This large, balanced network then provides an already balanced (subthreshold) input to smaller subcomponents, which, in these conditions (in particular, if the variance of this subthreshold input is small enough; see figure 7), can display more complex nonlinear behavior such as bistability. From the point of view of the smaller subnetworks, the balanced subthreshold input can be considered external, since the size of this network is too small to make a difference in the global activity of the larger network (despite being recurrently connected, the activities in the large and small networks effectively

42

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

decouple). In the cortex, the larger balanced network could correspond to a whole cortical column. Indeed, long-range projections between columns are mostly excitatory (see, e.g., Douglas & Martin, 2004). Within a column, the smaller networks that interact through both excitation and inhibition could anatomically correspond to microcolumns (Rao et al., 1999) or, more generally, define functional assemblies (Hebb, 1949). 5 Summary General principles of cortical organization (large numbers of active synaptic inputs per neuron) and function (irregular spiking statistics) put strong constraints on working memory models of spiking neurons. We have provided evidence that a network of current-based LIF neurons can exhibit bistability with the high persistent activity driven by either the mean or the fluctuations in the input to the cells. The fluctuation-driven bistability regime requires a strict excitation-inhibition balance that needs parameter tuning. It remains a challenge in future research to analyze systematically what the conditions are under which nonlinear phenomena such as bistability can exist robustly in large networks of more biophysically plausible conductance-based and correlated spiking neurons. It is also conceivable that additional biological mechanisms, such as homeostatic regulation, are important for solving the fine-tuning problem and ensuring a desired excitation-inhibition balance in cortical circuits. Progress in this direction will provide insight into the microcircuit mechanisms of working memory, such as found in the prefrontal cortex. Acknowledgments We are indebted to Jaime de la Rocha for providing the code for the numerical simulations and to Albert Compte for providing the data for Figure 1. A.R. thanks N. Brunel for pointing out previous related work, and A. Amarasingham for discussions on point processes. Support was provided by the National Institute of Mental Health (MH62349, DA016455), the A. P. Sloan Foundation and the Swartz Foundation, and the Spanish Ministery of Education and Science (BFM 2003-06242). References Abbott, L. F., Varela, J. A., Sen, K., & Nelson, S. B. (1997). Synaptic depression and cortical gain control. Science, 275, 220–224. Abramowitz, M., & Stegun, I. A. (1970). Tables of mathematical functions. New York: Dover. Amit, D. J., & Brunel, N. (1997a). Dynamics of a recurrent network of spiking neurons before and following learning. Network, 8, 373–404.

Bistability in Balanced Recurrent Networks

43

Amit, D. J., & Brunel, N. (1997b). Model of global spontaneous activity and local structured activity during delay periods in the cerebral cortex. Cerebral Cortex, 7, 237–252. Amit, D. J., & Tsodyks, M. V. (1991). Quantitative study of attractor neural network retrieving at low spike rates II: Low-rate retrieval in symmetric networks. Network, 2, 275–294. Anderson, J. S., Carandini, M., & Ferster, D. (2000). Orientation tuning of input conductance, excitation, and inhibition in cat primary visual cortex. J. Neurophysiol., 84, 909–926. Bair, W., Zohary, E., & Newsome, W. T. (2001). Correlated firing in macaque visual area MT: Time scales and relationship to behavior. J. Neurosci, 21, 1676– 1697. Ben-Yishai, R., Lev Bar-Or, R., & Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proc. Natl. Acad. Sci. USA, 92, 3844–3848. Bernander, O., Douglas, R. J., Martin, K. A., & Koch, C. (1991). Synaptic background activity influences spatiotemporal integration in single pyramidal cells. Proc. Natl. Acad. Sci. USA, 88, 11569–11573. Brunel, N. (2000a). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comput. Neurosci., 8, 183–208. Brunel, N. (2000b). Persistent activity and the single cell f-I curve in a cortical network model. Network, 11, 261–280. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrate-andfire neurons with low firing rates. Neural Computation, 11, 1621–1671. Brunel, N., & Wang, X.-J. (2001). Effects of neuromodulation in a cortical network model of object working memory dominated by recurrent inhibition. J. Comput. Neurosci., 11, 63–85. Brunel, N., & Wang, X.-J. (2003). What determines the frequency of fast network oscillations with irregular neural discharges? I. Synaptic dynamics and excitationinhibition balance. J. Neurophysiol., 90, 415–430. Cai, D., Tao, L., Shelley, M., & McLaughlin, D. W. (2004). An effective kinetic representation of fluctuation-driven networks with application to simple and complex cells in visual cortex. Proc. Natl. Acad. Sci., 101, 7757–7762. Carandini, M. (2004). Amplification of trial-to-trial response variability by neurons in visual cortex. PLOS Biol., 2, E264. Chafee, M. V., & Goldman-Rakic, P. S. (1998). Matching patterns of activity in primate prefrontal area 8a and parietal area 7ip neurons during a spatial working memory task. J. Neurophysiol., 79, 2919–2940. Compte, A., Brunel, N., Goldman-Rakic, P. S., & Wang, X.-J. (2000). Synaptic mechanisms and network dynamics underlying spatial working memory in a cortical network model. Cerebral Cortex, 10, 910–923. Compte, A., Constantinidis, C., Tegn´er, J., Raghavachari, S., Chafee, M., GoldmanRakic, P. S., & Wang, X.-J. (2003). Temporally irregular mnemonic persistent activity in prefrontal neurons of monkeys during a delayed response task. J. Neurophysiol., 90, 3441–3454. Cox, D. R. (1962). Renewal theory. New York: Wiley. Daley, D. J., & Vere-Jones, D. (1988). An introduction to the theory of point processes. New York: Springer.

44

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

Desai, N. S., Rutherford, L. C., & Turrigiano, G. G. (1999). Plasticity in the intrinsic excitability of cortical pyramidal neurons. Nat. Neurosci., 2, 515–520. Destexhe, A., Rudolph, M., & Pare, D. (2003). The high-conductance state of neocortical neurons in vivo. Nat. Rev. Neurosci., 4, 739–751. Douglas, R. J., & Martin, K. A. (2004). Neuronal circuits of the neocortex. Ann. Rev. Neurosci., 27, 419–451. Funahashi, S., Bruce, C. J., & Goldman-Rakic, P. S. (1989). Mnemonic coding of visual space in the monkey’s dorsolateral prefrontal cortex. J. Neurophysiol., 61, 331– 349. Fuster, J. M., & Alexander, G. (1971). Neuron activity related to short-term memory. Science, 173, 652–654. Gerstein, G. L., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single neuron. Biophys. J., 4, 41–68. Gillespie, D. T. (1992). Markov processes: An introduction for physical scientists. Orlando, FL: Academic Press. Gnadt, J. W., & Andersen, R. A. (1988). Memory related motor planning activity in posterior parietal cortex of macaque. Exp. Brain Res., 70, 216–220. Hansel, D., & Mato, G. (2001). Existence and stability of persistent states in large neuronal networks. Phys. Rev. Lett., 86, 4175–4178. Harsch, A., & Robinson, H. P. (2000). Postsynaptic variability of firing in rat cortical neurons: The roles of input synchronization and synaptic NMDA receptor conductance. J. Neurosci., 20, 6181–6192. Hebb, D. O. (1949). Organization of behavior. New York: Wiley. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA, 79, 2554–2558. Larkum, M. E., Senn, W., & Luscher, H. M. (2004). Top-down dendritic input increases the gain of layer 5 pyramidal neurons. Cereb. Cortex, 14, 1059–1070. Marino, J., Schummers, J., Lyon, D. C., Schwabe, L., Beck, O., Wiesing, P., & Obermayer, K. (2005). Invariant computations in local cortical networks with balanced excitation and inhibition. Nat. Neurosci., 8, 194–201. Mascaro, M., & Amit, D. J. (1999). Effective neural response function for collective population states. Network, 10, 351–373. Mattia, M., & Del Giudice, P. (2000). Efficient event-driven simulation of large networks of spiking neurons and dynamical synapses. Neural Computation, 12, 2305– 2329. Miller, E. K., Erickson, C. A., & Desimone, R. (1996). Neural mechanisms of visual working memory in prefrontal cortex of the macaque. J. Neurosci., 16, 5154– 5167. Miyashita, Y., & Chang, H. S. (1988). Neuronal correlate of pictorial short-term memory in the primate temporal cortex. Nature, 331, 68–70. Moreno, R., de la Rocha, J., Renart, A., & Parga, N. (2002). Response of spiking neurons to correlated inputs. Phys. Rev. Lett., 89, 288101. Moreno-Bote, R., & Parga, N. (2005). Membrane potential and response properties of populations of cortical neurons in the high conductance state. Phys. Rev. Lett., 94, 088103. Moreno-Bote, R., & Parga, N. (2006). Auto- and cross-correlograms for the spike response of lif neurons with slow synapses. Phys. Rev. Lett., 96, 028101.

Bistability in Balanced Recurrent Networks

45

Rao, S. G., Williams, G. V., & Goldman-Rakic, P. S. (1999). Isodirectional tuning of adjacent interneurons and pyramidal cells during working memory: Evidence for microcolumnar organization in PFC. J. Neurophysiol., 81, 1903–1916. Renart, A. (2000). Multi-modular memory systems. Unpublished doctoral dissertation, ´ Universidad Autonoma de Madrid. Renart, A., Song, P., & Wang, X.-J. (2003). Robust spatial working memory through homeostatic synaptic scaling in heterogeneous cortical networks. Neuron, 38, 473– 485. Ricciardi, L. M. (1977). Diffusion processes and related topics on biology. Berlin: SpringerVerlag. Romo, R., Brody, C. D., Hern´andez, A., & Lemus, L. (1999). Neuronal correlates of parametric working memory in the prefrontal cortex. Nature, 399, 470–474. Salinas, E., & Sejnowski, T. J. (2000). Impact of correlated synaptic input on output firing rate and variability in simple neuronal models. Journal of Neuroscience, 20, 6193–6209. Salinas, E., & Sejnowski, T. J. (2001). Correlated neuronal activity and the flow of neural information. Nat. Rev. Neurosci., 2, 539–550. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Current Opinion in Neurobiol., 4, 569–579. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18, 3870–3896. Shinomoto, S., Sakai, Y., & Funahashi, S. (1999). The Ornstein-Uhlenbeck process does not reproduce spiking statistics of neurons in prefrontal cortex. Neural Comput., 11, 935–951. Shu, Y., Hasenstaub, A., & McCormick, D. A. (2003). Turning on and off recurrent balanced cortical activity. Nature, 432, 288–293. Softky, W. R. (1994). Sub-millisecond coincidence detection in active dendritic trees. Neuroscience, 58, 13–41. Softky, W. R., & Koch, C. (1993). The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosci., 13, 334–350. Tsodyks, M. V., & Markram, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proc. Natl. Acad. Sci. USA, 94, 719–723. Tsodyks, M. V., & Sejnowski, T. (1995). Rapid state switching in balanced cortical network models. Network, 6, 111–124. Turrigiano, G. G. (1999). Homeostatic plasticity in neuronal networks: The more things change, the more they stay the same. Trends in Neurosci., 22, 221–227. Turrigiano, G. G., Leslie, K. R., Desai, N. S., Rutherford, L. C., & Nelson, S. B. (1998). Activity-dependent scaling of quantal amplitude in neocortical neurons. Nature, 391, 892–896. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. van Vreeswijk, C., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Computation, 10, 1321–1371. Wang, X.-J. (1999). Synaptic basis of cortical persistent activity: The importance of NMDA receptors to working memory. J. Neurosci., 19, 9587–9603.

46

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

Zador, A. M., & Stevens, C. F. (1998). Input synchrony and the irregular firing of cortical neurons. Nat. Neurosci., 1, 210–217. Zohary, E., Shadlen, M. N., & Newsome, W. T. (1994). Correlated neuronal discharge rate and its implications for psychophysical performance. Nature, 370, 140–143.

Received August 22, 2005; accepted May 17, 2006.

LETTER

Communicated by Alain Destexhe

Exact Subthreshold Integration with Continuous Spike Times in Discrete-Time Neural Network Simulations Abigail Morrison [email protected] Computational Neurophysics, Institute of Biology III, and Bernstein Center for Computational Neuroscience, Albert-Ludwigs-University, 79104 Freiburg, Germany

Sirko Straube [email protected] Computational Neurophysics, Institute of Biology III, Albert-Ludwigs-University, 79104 Freiburg, Germany

Hans Ekkehard Plesser [email protected] Department of Mathematical Sciences and Technology, Norwegian University of Life ˚ Norway Sciences, N-1432 As,

Markus Diesmann [email protected] Computational Neurophysics, Institute of Biology III, and Bernstein Center for Computational Neuroscience, Albert-Ludwigs-University, 79104 Freiburg, Germany

Very large networks of spiking neurons can be simulated efficiently in parallel under the constraint that spike times are bound to an equidistant time grid. Within this scheme, the subthreshold dynamics of a wide class of integrate-and-fire-type neuron models can be integrated exactly from one grid point to the next. However, the loss in accuracy caused by restricting spike times to the grid can have undesirable consequences, which has led to interest in interpolating spike times between the grid points to retrieve an adequate representation of network dynamics. We demonstrate that the exact integration scheme can be combined naturally with off-grid spike events found by interpolation. We show that by exploiting the existence of a minimal synaptic propagation delay, the need for a central event queue is removed, so that the precision of event-driven simulation on the level of single neurons is combined with the efficiency of time-driven global scheduling. Further, for neuron models with linear subthreshold dynamics, even local event queuing can be avoided, resulting in much greater efficiency on the single-neuron level. These ideas are exemplified by two implementations of a widely used neuron model. We present a measure for the efficiency of network simulations in terms Neural Computation 19, 47–79 (2007)

C 2006 Massachusetts Institute of Technology

48

A. Morrison, S. Straube, H. Plesser, and M. Diesmann

of their integration error and show that for a wide range of input spike rates, the novel techniques we present are both more accurate and faster than standard techniques. 1 Introduction A major problem in the simulation of cortical neural networks has been their high connectivity. With each neuron receiving input from the order of 104 other neurons, simulations are demanding in terms of memory as well as simulation time requirements. Techniques to study such highly connected systems of 105 and more neurons, using distributed computing, are now available (see, e.g., Hammarlund & Ekeberg, 1998; Harris et al., 2003; Morrison, Mehring, Geisel, Aertsen, & Diesmann, 2005). The question remains as to whether a time-driven or event-driven simulation algorithm (Fujimoto, 2000; Zeigler, Praehofer, & Kim, 2000; Sloot, Kaandorp, Hoekstra, & Overeinder, 1999; Ferscha, 1996) should be used. At first glance, the choice of event-driven algorithms seems natural, because a neuron can be described as emitting and absorbing point events (spikes) in continuous time. In fact, for neuron models with linear subthreshold dynamics and postsynaptic potentials without rise time, highly efficient algorithms exist (see, e.g., Mattia & Del Giudice, 2000). These exploit the fact that threshold crossings can occur only at the impact times of excitatory events. If more general types of neuron models are considered, the global algorithmic framework becomes much more complicated. For example, each neuron may be required to “look ahead” to determine when it will fire in the absence of new events. The global algorithm then either updates the neuron with the shortest latency or delivers the event with the most imminent arrival time (whichever is shorter) and revises the latency calculations for the neurons receiving the event. (See Marian, Reilly, & Mackey, 2002; Makino, 2003; Rochel & Martinez, 2003; Lytton & Hines, 2005; and Brette, 2006, for refined versions of such an algorithm.) This decision process clearly comes at a cost and becomes unwieldy for networks of high connectivity: if each neuron is receiving input spikes at a conservative average rate of 1 Hz from each of 104 synapses, it needs to process a spike every 0.1 ms, and this limits the characteristic integration step size. Therefore, time-driven algorithms have been found useful for the simulation of large, highly connected networks. Here, each neuron is updated on an equidistant time grid, and the emission and absorption times of spikes are restricted to the grid (see section 3). The temporal spacing of the grid is called computation time step h. Consider a network of 105 neurons as described above. At a computation time step of 0.1 ms, a time-driven algorithm carries out 109 neuron updates per second, the same number as required for the eventdriven algorithm. In this situation, the time-driven scheme is necessarily faster than the event-driven scheme because the costs of the actual updates are the same and there is no overhead caused by the scheduling of events.

Continuous Spike Times in Exact Discrete-Time Simulations

49

However, in this letter, we criticize this view and argue that in order to arrive at a relevant measure of efficiency, simulation time should be analyzed as a function of the integration error rather than the update interval. Whether a time-driven or event-driven scheme yields a better perfomance from this perspective depends on the required accuracy of the simulation and the network spike rate, and is not immediately apparent from considerations of complexity. In the time-driven framework, Rotter and Diesmann (1999) showed that for a wide class of neuron models, the linearity of the subthreshold dynamics can be exploited to integrate the neuron state exactly from one grid point to the next by performing a single matrix vector multiplication. Here, the computation time step simultaneously determines the accuracy with which incoming spikes influence the subthreshold dynamics and the timescale at which threshold crossings can be detected. However, Hansel, Mato, Meunier, and Neltner (1998) showed that forcing spikes onto the grid can significantly distort the synchronization dynamics of certain networks. Reducing the computation step ameliorates the problem only slowly, as the integration error declines linearly with h (see section 8.4.1). The problem was solved (Hansel et al., 1998; Shelley & Tao, 2001) by interpolating the membrane potential between grid points to give a better approximation of the time of threshold crossing and evaluating the effect of incoming spikes on the neuronal state in continuous time. In this work, we demonstrate that the techniques developed for the exact integration of the subthreshold dynamics (Rotter & Diesmann, 1999) and for the interpolation of spike times (Hansel et al., 1998; Shelley & Tao, 2001) can be successfully combined. By requiring that the minimal synaptic propagation delay be at least as large as the computation time step, all events can be queued at their target neurons rather than relying on a central event queue to maintain causality in the network. This reduces the complexity of the global scheduling algorithm—that is, deciding which neuron should be updated, and how far—to the simple time-driven case, whereby each neuron in turn is advanced in time by a fixed amount. Therefore, the global overhead costs are no more than in a traditional discrete-time simulation, and yet on the level of the individual neuron, spikes can be processed and emitted in continuous time with the accuracy of an event-driven algorithm. This approach represents a hybridization of traditional time-driven and event-driven algorithms: the scheme is time driven on the global level to advance the system time but event driven on the level of the individual neurons. The exact integration method is predicated on the linearity of the subthreshold dynamics. We show that this property can be further exploited, as the order of incoming events is not relevant for calculating the neuron state. This completely removes the need for storing and sorting individual events, and therefore also for dynamic data structures, while maintaining the high precision of the event-driven approach.

50

A. Morrison, S. Straube, H. Plesser, and M. Diesmann

We illustrate these ideas by comparing three implementations of the same widely used integrate-and-fire neuron model. As in previous work, the scaling behavior of the integration error is considered a function of the computational resolution. However, in contrast to these works, we also analyze the run time and memory consumption of a large neuronal network model as a function of integration error, thus defining a measure of efficiency that can be applied to any competing model implementations. This analysis reveals that the novel scheme of embedding continuous-time implementations in a discrete-time framework can in many cases result in simulations that are both more accurate and faster than a given purely discrete-time simulation. This is possible because the new scheme achieves the same accuracy at larger computation time steps. Depending on the rate of events to be processed by the neuron, the gain in simulation speed due to an increased step size can more than compensate for the increased complexity of processing continuous-time input events. The scheme presented is well suited for distributed computing. In section 2, we describe the neuron model used as an example in the remainder of the article, and then we review the techniques for integrating the dynamics of a neural network in discrete time steps in section 3. Subsequently, in section 4, we present two implementations solving the singleneuron dynamics between grid points but handling the incoming and emitted spikes in continuous time. The performance of these implementations with respect to integration error, run time, and memory requirements is analyzed in section 5. We show that the choice of which implementation should be used for a given problem depends on a trade-off between these factors. The concepts of time-driven and event-driven simulation of large neural networks are discussed in section 6 in the light of our findings. The numerical techniques underlying the reported results are given in the appendix. The conceptual and algorithmic work described here is a module in our long-term collaborative project to provide the technology for neural systems simulations (Diesmann & Gewaltig, 2002). Preliminary results have been presented in abstract form (Morrison, Hake, Straube, Plesser, & Diesmann, 2005). 2 Example Neuron Model Although the methods in this letter can be applied to any neuron model reducible to a system of linear differential equations, for clarity, we compare various implementations of one particular physical model: a currentbased integrate-and-fire neuron with postsynaptic currents represented as α-functions. The dynamics of the membrane potential V is: V˙ = −

V 1 + I, τm C

Continuous Spike Times in Exact Discrete-Time Simulations

51

where τm is the membrane time constant, C is the capacitance of the membrane, and I is the input current to the neuron. The current arises as a superposition of the synaptic currents and any external current. The time course of the synaptic current ι due to one incoming spike is ι(t) = ˆι

e −t/τα te , τα

where ˆι is the peak value of the current and τα is the rise time. When the membrane potential reaches a given threshold value , the membrane potential is clamped to zero for an absolute refractory period τr . The values for these parameters used in this article are τm , 10 ms; C, 250 pF; , 20 mV; τr , 2 ms; ˆι, 103.4 pA; and τα , 0.1 ms. 3 Exact Integration of Subthreshold Dynamics in a Discrete Time Simulation The dynamics of the neuron model described in section 2 is linear and can therefore be reformulated to give a particularly efficient implementation for a discrete-time simulation (Rotter & Diesmann, 1999). We refer to this traditional, discrete-time approach as the grid-constrained implementation. Making the substitutions y1 =

d 1 ι+ ι dt τα

y2 = ι y3 = V, where yi is the ith component of the state vector y, we arrive at the following system of linear equations:

− τ1α

0

0

1 C

1 y˙ = Ay = 1 − τα

0

0 y, 1 − τm

ˆι τeα

y(0) = 0 , 0

where y(0) is the initial condition for a postsynaptic potential originating at time t = 0. The exact solution of this system is given by y(t) = P(t)y(0), where P(t) = e At denotes the matrix exponential of At, which is an exact mathematical expression (see, e.g., Golub & van Loan, 1996). For a fixed time step h, the state of the system can be propagated from one grid position to the next by yt+h = P(h)yt . This is an efficient method because P(h) is constant and has to be calculated only once at the beginning of a simulation. Moreover, P(h) can be obtained in closed form (Diesmann, Gewaltig, Rotter,

52

A. Morrison, S. Straube, H. Plesser, and M. Diesmann

& Aertsen, 2001), for example, using symbolic algebra software such as Maple (Heck, 2003) or Mathematica (Wolfram, 2003), and can therefore be evaluated by simple expressions in the implementation language (see also section A.3). The complete update for the neuron state may be written as yt+h = P(h)yt + xt+h ,

(3.1)

assuming incoming spikes are constrained to the grid, as the linearity of the system permits the initial conditions for all spikes arriving at a given grid point to be lumped together into one term, xt+h

e τα

xt+h = 0 ˆιk . 0

(3.2)

k∈St+h

Here St+h is the set of indices k ∈ 1, . . . , K of synapses that deliver a spike to the neuron at time t + h, and ˆιk represents the “weight” of synapse k. Note that the ˆιk may be arbitrarily signed and may also vary over the course of the simulation. The new neuron state yt+h is the exact solution to the subthreshold neural dynamics at time t + h, including all events that arrive at time t + h. This assumes that the neuron does not itself produce a spike in the interval (t, t + h]. 3 If the membrane potential yt+h exceeds the threshold value , the neuron communicates a spike event to the network with a time stamp of t + h; the membrane potential is subsequently clamped to 0 in the interval [t + h, t + h + τr ] (see Figure 1A). The earliest grid point at which a neuron could produce its next spike is therefore t + 2h + τr . Note that for a gridconstrained implementation, τr must be an integer multiple of h, because the membrane potential is evaluated only at grid points, and we define it to be nonzero. 3.1 Computational Requirements. In order to preserve causality, it is necessary that there is a minimal synaptic delay of h. Otherwise, if a neuron spiked at time t + h and its synapses had a propagation delay of 0, then this event would seem to arrive at some of its targets at t + h and at some of them at t + 2h, depending on the order in which the neurons are updated. In practice, simulations are generally performed with synaptic delays that are greater than the time step h, and so some technique must be used to store events that have already been produced by a neuron but are not due to arrive at their targets for several time steps. In a grid-constrained simulation, only delays that are an integer multiple of h can be considered because incoming spikes can be handled only at grid points. Consequently, pending events can be stored in a data structure analogous to a looped tape device (see Morrison, Mehring, et al., 2005). If a neuron emits a spike at time t that has a

Continuous Spike Times in Exact Discrete-Time Simulations

53

AV Θ

0

B

t

t+h

t+2h

time

t+h

t+2h

time

V

Θ

0 t

t δ

= t Θ− t

Θ

τr

Figure 1: Spike generation and refractory periods for a grid-constrained (A) and a continuous-time implementation (B). The spike threshold is indicated by the dashed horizontal line. The solid black curve shows the membrane potential time course for a neuron subject to a suprathreshold constant input current leading to a threshold crossing in the interval (t, t + h]. The gray vertical lines indicate the discrete time grid with spacing h. The refractory period τr in this example is set to its minimal value h. Filled circles denote observable values of the membrane potential; unfilled circles denote supporting points that are not observable. (A) In the grid-constrained implementation, the spike is emitted at t + h. During the refractory period τr , the membrane potential is clamped to zero. (B) In the continuous-time implementation, the threshold crossing is found by interpolation (here linear: black dashed line) at time t . The spike is emitted with the time stamp t + h and with an offset with respect to t of δ = t − t. The neuron is refractory from t until t + τr , during which period the membrane potential is clamped to zero. At grid point t + 2h, a finite membrane potential can be observed.

54

A. Morrison, S. Straube, H. Plesser, and M. Diesmann

delay of d, the simulation algorithm waits until all neurons have completed their updates for the integration step (t − h, t] and then delivers the event to its target(s). The event is placed in the tape device of the target neuron d/ h segments on from the current reading position. It will then become visible to the target neuron at the grid point t + d, when the neuron is performing the update for the integration step (t + d − h, t + d]. Recalling that the initial conditions for all events arriving at a given time may be lumped together (see equation 3.2) and that two of the three components of the initial conditions vector are zero, the segments of the looped taped device can be very simple. Each segment contains just one value, which is incremented by the weight of every spike event delivered there. In other words, when the neuron performs the integration step (t, t + h], the segment visible to the reading head contains the first component of xt+h up to the scaling factor of τeα . In terms of memory, the looped tape device needs as many segments as computation steps are required to cover the maximum synaptic delay, plus an additional segment to represent the events arriving at t + h. Therefore, performing a given simulation with a smaller time step will require more memory. The model described in section 2 has only a single synaptic dynamics and so requires only one tape device; models using multiple types of synaptic dynamics can be implemented in this framework by providing them with the corresponding number of tape devices. 4 Continuous-Time Implementation of Exact Integration 4.1 Canonical Implementation. The most obvious way to reconcile exact integration and precise spike timing within a discrete-time simulation is to store the precise times of incoming events. In order to represent this information, an offset must be assigned to each spike event in addition to the time stamp. This offset is measured from the beginning of the interval in which the spike was produced: a spike generated at time t + δ receives a time stamp of t + h and an offset of δ (see Figure 1B). Given a sorted list of event offsets {δ1 , δ2 , · · · , δn } with δi ≤ h, which become visible to a neuron in the step (t, t + h], exact integration of the subthreshold dynamics can be performed from the beginning of the time step to the beginning of the list: yt+δ1 = P(δ1 )yt + xδ1 ; then along the list: yt+δ2 = P(δ2 − δ1 )yt+δ1 + xδ2 .. . yt+δn = P(δn − δn−1 )yt+δn−1 + xδn ;

Continuous Spike Times in Exact Discrete-Time Simulations

55

and finally from the end of the list to the end of the time step:

yt+h = P(h − δn )yt+δn .

The final term yt+h is the exact solution for the neuron dynamics at time t + h. This sequence is illustrated in Figure 2B. This is assuming that the neuron does not produce a spike or emerge from its refractory period during this interval. These special cases are described in more detail below. 4.1.1 Spike Generation. In the grid-constrained implementation, the neuron state is inspected at the end of each time step to see if it meets its spiking criteria. In the case of the neuron model described in section 3, the criterion is y3 ≥ , where is the threshold. In this implementation, the neuron state can be inspected after every step of the process described in sec3 3 tion 4.1. If yt+δ < and yt+δ ≥ , then the membrane potential of the i i+1 neuron reached threshold between t + δi and t + δi+1 . As the dynamics of this model is not invertible, the time t of this threshold passing can be determined only by interpolating the membrane potential in the interval (t + δi , t + δi+1 ]. For this article, linear, quadratic, and cubic interpolation schemes were investigated. After the threshold crossing, the neuron is refractory for the duration of its refractory period τr . The membrane potential y3 is set to zero and need not be calculated during this period, although the other components of the neuron state continue to be updated as in section 4.1. At the end of the time interval, an event is dispatched with a discrete-time stamp of t + h and an offset of δ = t − t (see Figure 1B). 4.1.2 Emerging from the Refractory Period. The neuron emerges from its refractory period in the time step defined by t < t + τr ≤ t + h (see Figure 1B). In contrast to the grid-constrained implementation, τr does not have to be an integer multiple of h. For a continuous time τr , a grid position t can always be found such that t + τr comes within (t, t + h]. However, the implementation is simpler when τr is a nonzero integer multiple of h. To calculate the neuron state at time t + h exactly, the interval is divided into two subintervals: (t, t + τr ] and (t + τr , t + h]. In the first period, the neuron is still refractory, so when performing the exact integration along the incoming events as in section 4.1, y3 is not calculated and remains at zero. At the end of this period, the neuron state yt +τr is the exact solution for the dynamics at the end of the refractory period. In the second period, the neuron is no longer refractory, and so the exact integration can be performed as usual (including the calculation of y3 ). The neuron state yt+h is therefore an exact solution to the neuron dynamics at time t + h.

56

A. Morrison, S. Straube, H. Plesser, and M. Diesmann

A

B

C

t

t+h

t+2h

Figure 2: Handling of incoming spikes in the grid-constrained (A), the canonical (B), and the prescient (C) implementations. In each panel, the solid curve represents the excursion of the membrane potential in response to two incoming spikes (gray vertical lines). Filled circles denote observable values of the membrane potential; unfilled circles denote supporting points that are not observable. The gray horizontal arrows beneath each panel indicate the propagation steps performed during the time step (t, t + h]. Dashed arrows in A indicate that in the grid-constrained implementation, input spikes are effectively shifted to the next point on the grid. The observable membrane potentials following spike impact are identical and exact in B and C but differ from those in A.

Continuous Spike Times in Exact Discrete-Time Simulations

57

4.1.3 Computational Requirements. As in the grid-constrained implementation, a minimum spike propagation delay of h is necessary in order to ensure that all events due at the neuron between t and t + h have arrived by the time the neuron performs its update for that interval. The simple looped event buffer described in section 3.1 must be extended to store the weights and offsets for incoming events separately. As the number of events arriving in an interval cannot be known ahead of time, this structure must be capable of dynamic memory allocation, which reduces its cache effectiveness. However, information about the minimum propagation delay in the network can be utilized to streamline the data structure so that its size does not depend on h, which compensates to some extent for the use of dynamic memory. Finally, as the events may arrive in any order, the buffer must also be capable of sorting the events with respect to increasing offset, which for a general-purpose sorting algorithm has complexity O(n log n). An alternative representation of the timing data is a priority queue; in practice, this was not quite as efficient as the looped tape device. In contrast to the grid-constrained implementation, delays can now be represented in continuous time. A spike arrival time t + δ + d, where t + δ is a continuous spike generation time and d is a continuous time delay, can always be decomposed into a discrete grid point t + d and a continuous offset δ . However, for notational and implementational convenience (see section 6), we assume d to be a nonzero integer multiple of h. 4.2 Prescient Implementation. In the implementation described in section 4.1, the neuron state at the end of a time step is calculated by integrating along a sorted list of events. However, as the subthreshold dynamics is linear, it is not dependent on the order of events. In this section, an implementation is presented that exploits this fact to reduce the computational complexity and dynamic memory requirements of the canonical implementation. 4.2.1 Receiving an Event. Consider a spike event generated during the step (t − h, t] with offset δ and transmitted with delay d. This spike will be processed during the update step (t + d − h, t + d], its effect being observable for the first time at t + d. Since the correct spike arrival time is t + d − h + δ, when the algorithm delivers the spike to the neuron we evolve the effect of the spike input from the arrival time to the end of the interval at t + d using y˜ t+d = P(h − δ)x. Therefore, instead of storing the entire event as for the canonical implementation, the three components of its effect on the neuron state can be stored in an event buffer at the position d instead. Due to the linearity of the system,

58

A. Morrison, S. Straube, H. Plesser, and M. Diesmann

these components can be summed for all events due to arrive at the neuron in a given time step, regardless of the order in which the algorithm delivers them to the neuron or the order in which they are to become visible to the neuron. As we exploit the fact that the effect of an event on the neuron can be calculated before the event becomes visible to the neuron, we call this the prescient implementation. 4.2.2 Calculating the Subthreshold Dynamics. At the beginning of each time interval (t, t + h], the total effect of all events arriving within that step on the three components of the neuron state at time t + h is already stored in the event buffer. Calculating the neuron state at the end of the time step is therefore simple: yt+h = P(h)yt + y˜ t+h . The new neuron state yt+h is the exact solution to the neuron dynamics at time t + h, including all events that arrived within the step (t, t + h]. This is depicted in Figure 2C. As with the canonical implementation, there are two special cases that need to be treated with more care: a time step in which a spike is generated and one in which the neuron emerges from its refractory period. 4.2.3 Spike Generation. The process that generates a spike is very similar to that for the canonical implementation described in section 4.1.1. In this case, as the timing of the incoming events is no longer known, the neuron state can be inspected only at the end of the time step rather than at each incoming event, and so the length of the interpolation interval is h rather than the interspike interval of incoming events. 4.2.4 Emerging from the Refractory Period. As for the canonical implementation, the time step in which the neuron emerges from its refractory period is divided into two subintervals: (t, t + τr ] and (t + τr , t + h]. Setting tem = t + τr − t, the neuron state at the end of the refractory period can be calculated as follows: yt+tem = P(tem )yt 3 yt+t ← 0. em

Having emerged from the refractory period, the membrane potential is no longer clamped to zero and can develop normally during the remainder of the time step (see Figure 1B): yt+h = P(h − tem )yt+tem + y˜ t+h .

Continuous Spike Times in Exact Discrete-Time Simulations

59

However, this overestimates the effect of the events arriving in the time interval (t, t + h] on the membrane potential. The summation of the components of these events was predicated on the assumption of linear dynamics, but as the membrane potential is clamped to zero until t + tem , this assumption does not hold. Any events arriving at the neuron before its emergence from its refractory period should have no effect on its membrane potential before this point, yet adding the full value of the third component of y˜ t+h assumes that they do. As a corrective measure, the effect of the new events on the membrane potential can be considered to be linear within the small interval (t, t + h], and the membrane potential can be adjusted accordingly: 3 3 3 yt+h ← yt+h − γ y˜ t+h ,

with γ = tem / h. 4.2.5 Computational Requirements. As in the grid-constrained and canonical implementations, a minimum spike propagation delay of h is required to preserve causality. The looped-tape device described in section 3.1 needs to be able to store the three components of the neuron state rather than just the weight of the incoming events. Alternatively, three event buffers can be used, capable of storing one component each. Unlike the buffer devices for the canonical implementation, they need to store only one value per time step rather than one for each incoming spike, so there is no time overhead for sorting the values. Moreover, they do not require dynamic memory allocation and so are more cache effective. 5 Performance In order to compare error scaling for the different implementations and interpolation orders, a simple single-neuron simulation was chosen. As the system is deterministic and nonchaotic, reducing the computation time step h causes the simulation results to converge to the exact solution, so error measures can be well defined. To investigate the costs incurred by simulating at finer resolutions or using computationally more expensive off-grid neuron implementations, a network simulation was chosen. This is fairer than a single-neuron simulation, as the run-time penalties of applications requiring more memory will come into play only if the application is large enough not to fit easily into the processor’s cache memory. Furthermore, it is only when performing network simulations that the bite of long simulation times per neuron is really felt. 5.1 Single-Neuron Simulations. Each experiment consisted of 40 trials of 500 ms each, during which a neuron of the type described in section 2 was stimulated with a constant excitatory current of 412 pA and unique

60

A. Morrison, S. Straube, H. Plesser, and M. Diesmann 0

Error [mV]

A 10

B

−5

−5

10

10

−10

−10

10

10

−15

10

0

10

−15

−4

10

−2

10

Time step h [ms]

0

10

10

−4

10

−2

10

0

10

Time step h [ms]

Figure 3: Scaling of error in membrane potential as a function of the computational resolution in double logarithmic representation. (A) Canonical implementation. (B) Prescient implementation. No interpolation, circles; linear interpolation, plus signs; quadratic interpolation, diamonds; cubic interpolation, multiplication signs. In both cases, the triangles show the behavior of the gridconstrained implementation, and the gray lines indicate the slopes expected for scaling of orders first to fourth with an arbitrary intercept of the vertical axis.

realizations of an excitatory Poissonian spike train of 1.3 × 104 Hz and an inhibitory Poissonian spike train of 3 × 103 Hz. The spike times of the Poissonian input trains were represented in continuous time. Parameters are as in section 2, but the peak value of the current resulting from an inhibitory spike was a factor of 6.25 greater than that of an excitatory spike to ensure a balance between excitation and inhibition. The output firing rate was 12.7 Hz. The experiment was repeated for each implementation with each interpolation order over a wide range of computational resolutions. As the membrane potential and spike times cannot be calculated analytically for this protocol, the canonical implementation with cubic interpolation at the finest resolution (2−13 ms ≈ 0.12 µs) was defined to be the reference simulation for each realization of the input spike train. As a measure of the error in calculating the membrane potential, the deviation of the actual membrane potential from the reference membrane potential was sampled every millisecond for all the trials. In Figure 3, the median of these deviations is plotted as a function of the computational resolution in double logarithmic representation. In both the canonical implementation (see Figure 3A) and the prescient implementation (see Figure 3B), the same scaling behavior can be seen: for an interpolation order of n, the error in membrane potential scales with order n + 1 (see section A.4). The error has a lower bound at 10−14 , which can be seen for very fine resolutions using cubic interpolation. This represents the greatest

Continuous Spike Times in Exact Discrete-Time Simulations 0

Error [ms]

A 10

B

−5

0

10

−5

10

10

−10

−10

10

10

−15

10

61

−15

−4

10

−2

10

Time step h [ms]

0

10

10

−4

10

−2

10

0

10

Time step h [ms]

Figure 4: Scaling of error in spike times as a function of the computational resolution. (A) Canonical implementation. (B) Prescient implementation. Symbols and lines as in Figure 3.

numerical precision possible for this physical quantity using the standard representation of floating-point numbers (see section A.1). Interestingly, the error for the canonical implementation also saturates at coarse resolutions. This is because interpolation is performed between incoming events rather than across the whole time step, as in the case of the prescient implementation. Consequently, the effective computational resolution cannot be any coarser than the average interspike interval of the incoming spike train (in this case, 1/16 ms), and this determines the maximum error of the canonical implementation. Note that the error of the grid-constrained implementation scales in the same way as that of the canonical and prescient implementations with no interpolation. However, due to the fact that incoming spikes are forced to the grid, the absolute error is greater for this implementation. The accuracy of a simulation is not determined by the membrane potential alone; the precision of spike times is of at least as much relevance. The median of the differences between the actual and the reference spike times is shown in Figure 4. As with the error in membrane potential, using an interpolation order of n results in a scaling of order n + 1, and the error has a lower bound that is exhibited at very fine resolution when using cubic interpolation. Furthermore, a similar upper bound on the error is observed for the canonical implementation at coarse resolutions. However, in this case, the grid-constrained implementation exhibits not only the same scaling, but a similar absolute error to the continuous-time implementations with no interpolation. Recalling that all the simulations receive identical continuous-time Poisson spike trains, the only difference remaining between the grid-constrained implementation and the continuous-time implementations without interpolation is that the former ignores the offset of the incoming spikes and treats them as if they were produced on the

62

A. Morrison, S. Straube, H. Plesser, and M. Diesmann

grid, whereas the latter process the incoming spikes precisely. This reveals that handling incoming spikes precisely confers no real advantage if outgoing spikes are generated without interpolation and thereby forced to the grid. One might therefore conclude that the precise handling of incoming spikes is unnecessary and that the single-neuron integration error could be significantly improved just by performing an appropriate interpolation to generate spikes, while treating incoming spikes as if they were produced on the grid. In fact, this is not the case. If incoming spikes are treated as if on the grid, the error in the membrane potential decreases only linearly with h, thus limiting the accuracy of higher-order methods in determining threshold crossings. This is corroborated by simulations (data not shown). Substantial improvement in the accuracy of single-neuron simulations requires both techniques: precise handling of incoming spikes and interpolation of outgoing spikes. 5.2 Network Simulations. In order to determine the efficiency of the various implementations, a balanced recurrent network was adapted from Brunel (2000). The network contained 10,240 excitatory and 2560 inhibitory neurons and had a connection probability of 0.1, resulting in a total of 15.6 × 106 synapses. The inhibitory synapses were a factor of 6.25 stronger than the excitatory synapses, and each neuron received a constant excitatory current of 412 pA as its sole external input. Membrane potentials were initialized to values chosen from a uniform random distribution over [−/2, 0.99]. In this configuration, the network fires with approximately 12.7 Hz in the asynchronous irregular regime, which recreates the input statistics used in the single-neuron simulations. The synaptic delay was 1 ms, and the network was simulated for 1 biological second. The simulation time and memory requirements for the network simulation described above are shown in Figure 5. For the simulation time (see Figure 5A), it can be seen that at coarse resolutions, the grid-constrained implementation is significantly faster than the prescient implementation, which in turn is faster than the canonical implementation. This is due to the fact that the cost of processing spikes is essentially independent of the computational resolution and manifests as an implementation-dependent constant contribution to the simulation time, which is particularly dominant at coarse resolutions. The difference in speed between the canonical and prescient implementations results from the use of dynamic as opposed to static data structures, and to a lesser extent from the cost of sorting incoming spikes in the canonical implementation. As the computation time step decreases, the simulation times converge, because the cost of updating the neuron dynamics in the absence of events, which is the same for all implementations, is inversely proportional to the resolution and so manifests as a scaling with exponent −1 at small computation time steps (see Figure 5A). It is clear that in general, at the same computation time step, the continuous-time implementations must be slower than the grid-constrained

Continuous Spike Times in Exact Discrete-Time Simulations

B

3

10

Memory [GB]

Simulation time [s]

A

2

10

63

6 4

2

1

1

10

−4

10

−2

10

Time step h [ms]

0

10

−4

10

−2

10

0

10

Time step h [ms]

Figure 5: Simulation time (A) and memory requirements (B) for a network simulation as functions of computational resolution in double logarithmic representation. Triangles, grid-constrained neuron; plus signs, canonical implementation with cubic interpolation; circles, prescient implementation with cubic interpolation. Other interpolation orders for the canonical and the prescient implementations result in practically identical curves and are therefore not shown. For details of the simulation, see the text.

implementation, as the former perform a propagation for every incoming spike and the latter does not. The increased costs concomitant with higher interpolation orders proved to be negligible in a network simulation. An increase in memory requirements can be observed for all implementations (see Figure 5B) as the resolutions become finer. Although all implementations require much the same amount of memory at coarser resolutions, for finer resolutions, the canonical implementation requires the least memory, followed by the grid-constrained implementation, and the prescient implementation requires the most. It is clear that for a wide range of resolutions, the memory required by the rest of the network, specifically the synapses, dominates the total memory requirements (Morrison, Mehring, et al., 2005). As the resolution becomes finer, the memory required for the input buffers for the neurons plays a greater role. The spike buffer for the canonical implementation is independent of the resolution (see section 4.1.3), and so it might seem that it should not require more memory at finer resolution. However, all implementations tested here also have a buffer for a piece-wise constant input current. In addition to this, the grid-constrained implementation has one buffer for the weights of incoming spikes, and the prescient implementation has three buffers—one for each component of the state vector. All of these buffers require memory in inverse proportion to the resolution, thus explaining the ordering of the curves. Generally a smaller application is more cache effective than a larger one, and this may explain why the canonical implementation exhibits slightly lower

64

A. Morrison, S. Straube, H. Plesser, and M. Diesmann

simulation times than the other implementations at very fine resolutions (see Figure 5A). 5.3 Conjunction of Integration Error and Run-Time Costs. The considerations in section 5.2 of how the simulation time and memory requirements increase with finer resolutions are of limited practical relevance to a scientist with a particular problem to investigate. More interesting in this case is how much precision bang you get for your simulation time buck. Unlike the single neuron, however, the network described above is a chaotic system (Brunel, 2000). Any deviation at all between simulations will lead, in a short time, to totally different results on the microscopic level, such as the evoked spike patterns. Such a deviation can even be caused by differences in the tiny round-off errors that occur if floating-point numbers are summed in a different order. Because of this, these simulations do not converge on the microscopic level as the single-neuron simulations do, and for that reason the so-called accuracy of a simulation cannot be taken at face value. We therefore relate the cost of network simulations to the accuracy of single-neuron simulations with comparable input statistics. In Figure 6, the simulation time and memory requirements data from Figure 5 are combined with the accuracy of the corresponding single-neuron simulations shown in Figure 4, thus eliminating the computational resolution as a parameter. Figure 6A shows the simulation time as a function of spike time error for the three implementations. This graph can be read in two directions: horizontally and vertically. By reading the graph horizontally, we can determine which implementation will give the best accuracy for a given affordable simulation time. Reading the graph vertically allows us to determine which implementation will result in the shortest simulation time for a given acceptable accuracy. Concentrating on the latter interpretation, it can be seen from the intersection of the lines corresponding to the prescientand grid-based implementations (vertical dashed line in Figure 5A) that if an error greater than 2.3 × 10−2 ms is acceptable, the grid-constrained implementation is faster. For better accuracy than this, the prescient implementation is more effective. If an appropriate time step is chosen, the prescient implementation can simulate more accurately and more quickly than a given grid-constrained simulation in this regime. Only for very high accuracy can a lower simulation time be achieved using the canonical implementation. Similarly, Figure 6B, which shows the memory requirements as a function of spike-time error in double logarithmic representation, can be read in both directions. This shows qualitatively the same relationship as in Figure 6A, but the point at which one would switch from the prescient implementation to the canonical implementation in order to conserve memory occurs for larger errors. The flatness of the curves for the continuous-time implementations shows that it is possible to increase the accuracy of a simulation considerably without having to worry about the memory requirements.

Continuous Spike Times in Exact Discrete-Time Simulations

B

3

10

2

10

−12

10

−4

10

10

−2

10

2

prescient faster

Input rate [kHz]

10

100

−8

10

−4

10

0

10

Error [ms]

D 120

grid−constrained faster

10

−12

0

10

Simulation time [s]

−1

Eqv. Error [ms]

−8

10

Error [ms]

C

4

1

1

10

6

Memory [GB]

Simulation time [s]

A

65

100 80 60 40 20 0

0

20

40

60

80

Input rate [kHz]

Figure 6: Analysis of simulation time and memory requirements for a network simulation as functions of spike-time error for the single-neuron simulation and input spike rate. (A) Simulation time as a function of spike-time error in double logarithmic representation. Triangles, grid-constrained neuron; plus signs, canonical implementation with cubic interpolation; circles, prescient implementation with cubic interpolation. Data combined from Figure 4 and Figure 5. For errors smaller than 2.3 × 10−2 ms (vertical dashed line), a continuous-time implementation with an appropriately chosen computation step size gives better performance. (B) Memory consumption as a function of spike-time error in double logarithmic representation. Symbols as in A. (C) Error in spike time for which the prescient and grid-constrained implementations require the same simulation time (for different appropriate choices of h) as a function of input spike rate. The unfilled circle indicates the equivalence error for a network rate of 12.7 Hz (input rate 16 kHz), that is, the intersection marked by the vertical dashed line in A. The gray line is a linear fit to the data (slope −0.95). (D) Simulation time as a function of input spike rate for h = 0.125 ms, symbols as in A. The gray lines are linear fits to the data (slopes 1.3, 0.8 and 0.3 s per kHz from top to bottom).

66

A. Morrison, S. Straube, H. Plesser, and M. Diesmann

In general, the point at which the prescient implementation at an appropriate computational resolution can produce a faster and more accurate simulation than a grid-constrained simulation will depend on the rate of events a neuron has to handle. To investigate this relationship, the network described in section 5.2 was simulated with different input currents and inhibitory synapse strengths to generate a range of different firing rates in the asynchronous irregular regime. The single-neuron integration error at which the prescient- and the grid-constrained implementations require the same simulation time is shown in Figure 6C as a function of the input spike rate. This equivalence error depends linearly on the input spike rate, demonstrating that in the parameter space of useful accuracies and realistic input rates, there is a wide regime where the prescient is faster than the grid-constrained implementation. Underlying the benign nature of the comparative effectiveness analyzed in Figure 6C is the dependence of simulation time on the rate of events. For all implementations, the simulation time increases practically linearly with the input spike rate (see Figure 6D), albeit with different slopes. 5.4 Artificial Synchrony. The previous section has shown that in many situations, continuous-time implementations achieve a desired single-neuron integration error more effectively than the grid-constrained implementation. However, continuous-time implementations have an advantage compared to the grid-constrained implementation beyond the single-neuron integration error. In a network simulation carried out with the grid-constrained implementation, the spikes of all neurons are aligned to the temporal grid defined by the computation time step. This causes artificial synchronization between neurons that may distort measures of synchronization and correlation on the network level. To demonstrate this effect, Hansel et al. (1998) investigated a small network of N integrateand-fire neurons with excitatory all-to-all coupling. Here, we extend their analysis to the three implementations under study and provide a comparison of single-neuron and network-level integration error. In contrast to their study, our network is constructed from the model neuron introduced in section 2. The time constant of the synaptic current τα is adjusted to the rise time of the synaptic current in the original model, which was described by a β-function. The synaptic delay d and the absolute refractory period τr are set to the maximum computation time step h investigated in this section. Note that this choice of d and τr means that these parameters can be represented at all computational resolutions, thus ensuring that all simulations using the grid-constrained implementation are solving the same dynamical system. Figure 7A illustrates the synchrony in the network as a function of synaptic strength. Following Hansel et al. (1998), synchrony is defined as the variance of the population-averaged membrane potential normalized by the population-averaged variance of the membrane

Continuous Spike Times in Exact Discrete-Time Simulations

A

67

1

Synchrony

0.8 0.6 0.4 0.2 0

0

0.2

0.4

0.6

0.8

1

Synaptic strength

Synch. Error [%]

2

C 10

Synch. Error [%]

2

B 10

0

10

−2

10

−4

10

0

10

−2

10

−4

−4

10

−2

10

Time step h [ms]

0

10

10

−12

10

−8

10

−4

10

0

10

Error [ms]

Figure 7: Synchronization error in a network simulation. (A) Network synchrony (see equation 3.3) as a function of synaptic strength in a fully connected network of N = 128 neurons (τα = (3/2) ln 3 ms, synaptic delay d = τr = 0.25 ms) with excitatory coupling (cf. Hansel et al., 1998). Neurons are driven by a suprathreshold DC I0 = 575 pA, no further external input. The initial T )], where i ∈ 1, . . . , N is membrane potential is Vi (0) = τCm I0 [1 − exp(−γ i−1 N τm the neuron index, T the period in the absence of coupling, and γ = 0.5 controls the initial coherence. The simulation time is 10 s, and V is recorded in intervals of 1 ms between 5 s and 10 s. Synaptic strength is expressed as the amplitude of a postsynaptic current relative to the rheobase current I∞ = (C/τm )θ = 500 pA and multiplied by the number of neurons N. Other parameters as in section 2. Canonical implementation as reference (h = 2−10 ms, gray curve) and prescient implementation (h = 2−2 ms, circles), both with cubic interpolation; grid-constrained implementation (h = 2−5 ms, triangles). (B) Synchronization error as a function of the computation time step in double logarithmic representation: grid-constrained implementation, triangles; prescient implementation, circles; canonical implementation, plus signs. (C) Synchronization error as a function of the single-neuron integration error for the grid-constrained and the prescient implementation, same representation as in B. The gray lines are linear fits to the data.

68

A. Morrison, S. Straube, H. Plesser, and M. Diesmann

potential time course, S=

2

2

2

2 Vi (t) t − Vi (t) t i , Vi (t) i − Vi (t) i t t

i

(5.1)

where · i indicates averaging over the N neurons and · t indicates averaging over time. This is a measure of coherence in the limit of large N with S = 1 for full synchronization and S = 0 for the asynchronous state. The grid-constrained implementation exhibits a considerable error in synchrony, which vanishes in approaching the asynchronous regime. The prescient implementation accurately preserves network synchrony even with a significantly larger computation time step. The error in synchrony is quantified in Figure 7B as the root mean square of the relative deviation of S with respect to the reference solution, estimated over the range of synaptic strength investigated. Note that this includes the asynchronous regime where errors are small in general. The prescient implementation is easily an order of magnitude more accurate than the gridconstrained implementation and is itself outperformed by the canonical implementation. In addition, for the continuous-time implementations, the error in synchrony drops more rapidly with decreasing computational time step h than for the grid-constrained implementation. However, at the same h, different integration methods exhibit a different integration error for the single-neuron dynamics (see Figure 4). Therefore, to accentuate network effects, continuous-time and grid-constrained implementations should be compared at the same single-neuron integration error. To this end, we proceed as follows: the network spike rate is approximately 80 Hz, corresponding to an input spike rate of some 10 kHz. In Figure 7C, the error in network synchrony at a given computational time step h is plotted as a function of the spike time error of a single neuron driven with an input spike rate of approximately 10 kHz, simulated at the same h. For single-neuron errors of 10−2 ms and above, the grid-constrained implementation results in considerable errors in network synchrony. Spike timing errors of 10−2 ms and below are required for the grid-constrained and the prescient implementations to achieve a synchronization error in the 1% range or better. Interestingly, the grid-constrained implementation exhibits a larger synchronization error than the prescient implementation for identical single-neuron integration errors. 6 Discussion We have shown that exact integration techniques are compatible with continuous-time handling of spike events within a discrete-time simulation. This combination of techniques achieves arbitrarily high accuracy (up to machine precision) without incurring any extra management costs in the global algorithm, such as a central event queue or looking ahead to see

Continuous Spike Times in Exact Discrete-Time Simulations

69

which neuron will fire next. This is particularly important for the study of large networks with frequent events, as the cost of managing events can become prohibitive (Mattia & Del Giudice, 2000; Reutimann, Giugliano, & Fusi, 2003). We introduced a canonical implementation that illustrates the principles of combining these techniques and a prescient implementation that further exploits the linearity of the subthreshold dynamics. The latter implementation simplifies the neuron update algorithm and requires only static data structures and no queuing, leading to a better time and accuracy performance than the canonical implementation. We compared interpolating polynomials of orders 1 to 3 and discovered that the increased numerical complexity of the higher-order interpolations was not reflected in the run time, which is dominated by other factors. Furthermore, it was shown that the highest-order interpolation performed stably. This suggests that the highest-order interpolation should be used, as the greater accuracy is obtained at negligible cost. We have investigated the nature of the trade-off between accuracy and simulation time/memory and demonstrated that for a large range of input spike rates, it is possible to find a combination of continuous-time implementation and computation time step that fulfills a given maximum error requirement both more accurately and faster than a grid-constrained simulation. This measure of efficiency is based on truly large-scale networks (12,800 neurons, 15.6 million synapses). The techniques described here have several possible extensions. First, the canonical implementation places no constraints on the neuron model used beyond the physically plausible requirement that the membrane potential is thrice continuously differentiable. It may therefore be used to implement essentially any kind of neuronal dynamics, including neurons with conductance-based synapses. The prescient implementation further requires that the neuron’s dynamics is linear; it may thus be used for a wide range of model neurons with current-based synapses. The neuron model we implemented does not have invertible dynamics, and so the determination of its spike time necessarily involves approximation. For some neuron models, it is possible to determine the precise spike time without recourse to approximation, such as the Lapicque model (Tuckwell, 1988) and its descendants (Mirollo & Strogatz, 1990). Such models can, of course, also be implemented in this framework, but they would have only the same precision as a classical event-based simulation if they were canonically implemented (obviously without interpolation); a prescient implementation would be able to represent the subthreshold dynamics exactly on the grid but would entail the use of approximative methods to determine the spiketimes. Although we investigated polynomial interpolation, other methods of spike time determination such as Newton’s method can be implemented with no change to the conceptual framework. Second, most constraints imposed in terms of the computational time step h may be relaxed. As indicated in section 4.1.3, the restriction of

70

A. Morrison, S. Straube, H. Plesser, and M. Diesmann

delays to nonzero integer multiples of h can be relaxed to any floatingpoint number ≥ h. When a neuron spikes, the offset of the spike could then be combined on the fly with the delay to create an integer component and a revised offset, thus allowing the spike to be delivered correctly by the global discrete-time algorithm, and processed correctly by the continuoustime neuron model. This relaxation would come at the memory cost of having to store delays as floating-point numbers rather than integers and the computational cost of having to perform the on-the-fly delay decomposition. Furthermore, h is currently a parameter of the entire network, so it is the same for all neurons in a given simulation. Given that the minimum propagation delay already defines natural synchronization points to preserve causality in the simulation, it would be possible to allow h to be chosen individually for each neuron in the network, or even to use variable time steps, while still maintaining consistency with the global discrete-time algorithm. Finally, the techniques are compatible with the distributed computing techniques described in Morrison, Mehring, et al. (2005), requiring only that the spike-time offsets are communicated in addition to the indices of the spiking neurons. This increases the bulk but not the frequency of communication, as it is still sufficient to communicate in intervals determined by the minimum propagation delay. A similar minimum delay principle is used by Lytton and Hines (2005), again suggesting a convergence of time-driven and event-driven approaches. When investigating a particular system, it is worthwhile considering what accuracy is necessary. For the networks described in section 5.2, it would be pointless to simulate with a very small time step, as they are chaotic. As long as the relevant macroscopic measures are preserved, any time step is as good or as bad as any other. However, a good rule of thumb is that it should be possible to discriminate spike times an order of magnitude more accurately than the characteristic timescales of the macroscopic phenomena to be observed, such as the temporal structure of cross-correlations. Note that even if the characteristics of the system to be investigated suggest that a grid-constrained implementation is optimal, the availability of equivalent continuous-time implementations is still advantageous: should the suspicion arise that an observed phenomenon is an artifact of the grid constraints, they can be used to test this without altering any other part of the simulation. For much the same reason, it is very useful to be able to modify the time step without having to adjust the rest of the simulation. The network studied in section 5.4 illustrates two more important points. First, it demonstrates that exceedingly small single-neuron integration errors may be required to accurately capture network synchronization. Second, it is clear from Figure 7C that continuous-time implementations are better at rendering macroscopic measures such as synchrony correctly: in conditions where the grid-constrained and prescient implementations

Continuous Spike Times in Exact Discrete-Time Simulations

71

achieve the same single-neuron spike-timing error, the prescient implementation yields a significantly smaller error in network synchrony. Accuracy cannot be improved on indefinitely: Figures 3 and 4 show that the errors in both membrane potential and spike timing saturate at about 10−14 mV and ms, respectively, for a time step of around 10−3 ms. The saturation accuracy is close to the maximal precision of the standard floating-point representation of the computer, and so 10−3 ms represents a lower bound on the range of useful h. An upper bound is determined by the physical properties of the system. First, h may not be larger than the minimum propagation delay in the network or the refractory period of the neurons. Second, using a large h increases the danger that a spike is missed. This can occur if the true trajectory of the membrane potential passes through the threshold within a step but is subthreshold again by the end of the step. This is less of an issue for the canonical implementation, as the check for a suprathreshold membrane potential is not just performed at the end of every step but also at the arrival of every incoming event (see section 4.1.1). There is a common perception that event-driven algorithms are exact and time-driven algorithms are approximate. However, both parts of this perception are generally false. With respect to the first part, event-driven algorithms are not by the nature of the algorithm more exact than timedriven algorithms. It depends on the dynamics of the neuron model whether an event-driven algorithm can find an exact solution, just as it does for timedriven algorithms. For a restricted class of models, the spike times can be calculated exactly through inversion of the dynamics. For other models, approximate methods to determine the spike times need to be employed. With respect to the second part, time-driven algorithms are not necessarily approximate. A discrete-time algorithm does not imply that spike times have to be constrained onto the grid, as shown by Hansel et al. (1998) and Shelley and Tao (2001). Moreover, the subthreshold dynamics for a large class of neuron models can be integrated exactly (Rotter & Diesmann, 1999). Here we combine these insights to show that the degree of approximation in a simulation is not determined by whether an event-driven or a time-driven algorithm is used but by the dynamics of the neuron model. A further question is whether the terms time-driven and event-driven should even be used in this mutually exclusive way. In our algorithm, neuron implementations treating incoming and outgoing spikes in continuous time are seamlessly integrated into a global discrete-time algorithm. Should this therefore be considered a time-driven or an event-driven algorithm? We believe that this combination of techniques represents a hybrid algorithm that is globally time driven but locally event driven. Similarly, when designing a distributed simulation algorithm (Morrison, Mehring, et al., 2005), it was shown that a time-driven neuron updating algorithm can be successfully combined with event-driven synapse updating, again suggesting that no dogmatic distinction between the two approaches need

72

A. Morrison, S. Straube, H. Plesser, and M. Diesmann

be made. However, although we were able to demonstrate the potential advantages of a hybrid algorithm, these findings do not in principle rule out the existence of pure event-driven or time-driven algorithms with identical universality and better performance for a given set of parameters than the schemes presented here. In closing, we express our hope that this work can help defuse the at times heated debate between advocates of event-driven and time-driven algorithms for the simulation of neuronal networks. Appendix: Numerical Techniques In this appendix, we present the numerical techniques employed to achieve the reported accuracy. A.1 Accuracy of Floating-Point Representation of Physical Quantities. The double representation of floating-point numbers (Press, Teukolsky, Vetterling, & Flannery, 1992) used by standard computer hardware limits the accuracy with which physical quantities can be stored. The machine precision ε is the smallest number for which the double representation of 1 + ε is different from the representation of 1. Consequently, the absolute error σx of a quantity x is limited by the magnitude of x, σx ≈ 2log2 x · ε. In double representation, we have ε = 2−52 ≈ 2.22 · 10−16 . Membrane potential values y are on the order of 20 mV; the lower limit of the integration error in the membrane potential is therefore on the order of 5 · 10−15 mV. According to the rules of error propagation, the error in the time of threshold crossing σ depends on the error in membrane potential as σ = |1/ y˙ |. Typical values of the derivative | y˙ | of the membrane potential are on the order of 1 mV per ms (single-neuron simulations, data not shown), from which we obtain 5 · 10−15 ms as a lower bound for the error in spike timing. Therefore, the observed integration errors at which the simulations saturate are close to the limits imposed by the double representation for both physical quantities. A.2 Representation of Spike Times. We have seen in section A.1 that the absolute error σx depends on the magnitude of the quantity x. As a consequence, the error of spike times recorded in double representation increases with simulation time. An additional error is introduced if the computation time step h cannot be represented as a double (e.g., 0.1 ms).

Continuous Spike Times in Exact Discrete-Time Simulations

73

Therefore, we record spike times as a pair of two values {t + h, δ}. The first one is an integral number in units of h represented as a long int specifying the computation step in which the spike was emitted. The second one is the offset of the spike time in units of ms represented as a double. If h is a power of 2 in units of ms, both values can be represented as doubles without loss of accuracy. A.3 Evaluating the Update Equation. The implementation of the update equation 3.1 in the target programming language requires attention to numerical detail if utmost precision is desired. We were able to reduce membrane potential errors in a nonspiking simulation from some 10−12 mV to about 10−15 mV by careful rearrangement of terms. Although details may depend significantly on processor architecture and compiler optimization strategies, we will briefly recount the implementation we found optimal. The matrix-vector multiplication in equation 3.1 describes updates of the form y ← (1 − e − τ )a + e − τ y, h

h

where a and y are of order unity, while h τ so that e − τ ≈ 1. For a time step h −12 of h = 2 ms and a time constant of τ = 10 ms, one has γ = 1 − e − τ ≈ 10−5 . The quantity γ can be computed accurately for small values of h/τ using the function expm1(x) provided by current numeric libraries (C standard library; see also Galassi et al., 2001). Using double resolution, γ will have some 15 significant digits, spanning roughly from 10−5 to 10−20 , and all h of these digits are nontrivial. The exponential e − τ may be computed to 15 significant digits using exp(x); the first five of these will be trivial nines, though, leaving just 10 nontrivial digits, which furthermore span down to only 10−15 . We thus rewrite the equation above entirely in terms of γ , h

y ← γ a + (1 − γ )y, and finally as y ← γ (a − y) + y. In our experience, this final form yields the most accurate results, as the full precision of γ is retained as long as possible. Note that computing the 1 − γ term above discards the five least significant digits of γ . When several terms need to be added, they should be organized according to their expected magnitude starting with the smallest components. A.4 Polynomial Interpolation of the Membrane Potential. In order to approximate the time of threshold crossing, the membrane potential yt3

74

A. Morrison, S. Straube, H. Plesser, and M. Diesmann

known at grid points t with spacing h can be interpolated with polynomials of different order. For the purpose of this section, we drop the index specifying the membrane potential as the third component of the state vector. Without loss of generality, we assume that the threshold crossing occurs at time δ ∗ in the interval (0, h]. The corresponding values of the membrane potential are denoted y0 < and yh ≥ , respectively. The threshold crossings are found using the explicit formulas for the roots of polynomials of order n = 1, 2, and 3 (Weisstein, 1999). In order to constrain the polynomials, we exploit the fact that the derivative of the membrane potential can be easily obtained from the state vector at both sides of the interval. For the grid-constrained (n = 0) simulation and linear (n = 1) interpolation, we demonstrate why the error in spike timing decreases with h n+1 . A.4.1 Grid-Constrained Simulation. In the variables defined above, the approximate time of threshold crossing δ equals the computation time step h; the spike is reported to occur at the right border of the interval (0, h]. Assuming the membrane potential y(t) to be exact, the error in membrane potential with respect to the value at the exact point of threshold crossing is = yh − y(δ ∗ ). Let us require that y(t) is sufficiently often differentiable and that the derivatives assume finite values. We can then express the membrane potential as a Taylor expansion originating at the left border of the interval y(t) = y0 + y˙0 t + O(t 2 ). Considering terms up to first order, we obtain = {y0 + y˙0 h} − {y0 + y˙0 δ ∗ } = y˙0 (h − δ ∗ ). Hence, reaches its maximum amplitude at δ ∗ = 0, and we can write || ≤ | y˙0 |h. The error in spike timing is σ = h − δ∗ = and |σ | ≤ h.

1 y˙0

Continuous Spike Times in Exact Discrete-Time Simulations

75

A.4.2 Linear Interpolation. A polynomial of order 1 is naturally constrained by the values of the membrane potential (y0 and yh ) at both ends of the interval. With yt = a t + b, the set of equations specifying the coefficients of the polynomial (a and b) is −1 a 0 1 y0 = b h 1 yh −1 −1 y0 h −h = yh 1 0 (yh − y0 )h −1 . = y0 Thus, in normalized form, we need to solve 0=

yh − y0 δ + (y0 − ) h

(A.1)

for δ to find the approximate time of threshold crossing. At the exact point of threshold crossing δ ∗ , the error in membrane potential is =

yh − y0 ∗ δ + y0 − y(δ ∗ ). h

(A.2)

Let us require that y(t) is sufficiently often differentiable and that the derivatives assume finite values. We can then express the membrane potential as a Taylor expansion originating at the left border of the interval y(t) = y0 + y˙0 t + 12 y¨ 0 t 2 + O(t 3 ). Considering terms up to second order, we obtain 1 1 2 = y˙0 δ ∗ + y¨ 0 hδ ∗ + y0 − y0 + y˙0 δ ∗ + y¨ 0 δ ∗ 2 2 1 2 = y¨ 0 (δ ∗ h − δ ∗ ). 2 The time of threshold crossing is bounded by the interval (0, h]. Hence, reaches its maximum amplitude at δ ∗ = 12 h, and we can write || ≤

1 | y¨ 0 | h 2 . 8

76

A. Morrison, S. Straube, H. Plesser, and M. Diesmann

Noting that y(δ ∗ ) = in equation A.2, we have =

yh − y0 ∗ δ + (y0 − ), h

(A.3)

and subtracting equation 4.1 from 4.3, we obtain =

yh − y0 ∗ (δ − δ). h

Thus, the error in spike timing is σ = δ∗ − δ =

h . yh − y0

With the help of the expansion h 1 y¨ 0 = − h, yh − y0 y˙0 2 y˙0 2 we arrive at |σ | ≤

1 y¨ 0 2 h . 8 y˙0

A.4.3 Quadratic Interpolation. Using a polynomial of order 2, we can add an additional constraint to the interpolating function. We decide for the derivative of the membrane potential at the left border of the interval y˙ 0 . With yt = a t 2 + bt + c, we have −1 0 0 1 y0 a b = h 2 h 1 yh c y˙ 0 0 1 0 −2 −2 −1 y0 −h h h 0 1 yh = 0 y˙ 0 1 0 0 (yh − y0 )h −2 − y˙ 0 h −1 . y˙ 0 = y0 Thus, in normalized form, we need to solve 0 = δ2 +

(y0 − )h 2 y˙ 0 h 2 δ+ . yh − y0 − y˙ 0 h yh − y0 − y˙ 0 h

Continuous Spike Times in Exact Discrete-Time Simulations

77

The solution can be obtained by the quadratic formula. The GSL (Galassi et al., 2001) implements an appropriate solver. Generally there are two real solutions: the desired one inside the interval (0, h] and one outside. A.4.4 Cubic Interpolation. A polynomial of order 3 enables us to constrain the interpolation further by the derivative of the membrane potential at the right border of the interval y˙ h . With yt = a t 3 + bt 2 + ct + d, we have −1 0 0 0 1 a y0 3 2 b h h h 1 yh = c 0 0 1 0 y˙ 0 d y˙ h 3h 2 2h 1 0 y0 h −2 2h −2h −3 h −2 −3h −2 3h −2 −2h −1 −h −1 yh = 0 0 1 0 y˙ 0 y˙ h 1 0 0 0 2(y0 − yh )h −3 + ( y˙ 0 + y˙ h )h −2 3(yh − y0 )h −2 − (2 y˙ 0 + y˙ h )h −1 . = y˙ 0 y0 Thus, in normalized form, we need to solve 0 = δ3 + +

3(yh − y0 )h − (2 y˙ 0 + y˙ h )h 2 2 δ 2(y0 − yh ) + ( y˙ 0 + y˙ h )h

y˙ 0 h 3 (y0 − )h 3 δ+ . 2(y0 − yh ) + ( y˙ 0 + y˙ h )h 2(y0 − yh ) + ( y˙ 0 + y˙ h )h

The solution can be found by the cubic formula. There is at least one real solution in the interval (0, h]. It is convenient to chose a substitution that avoids the intermediate occurrence of complex quantities (e.g., Weisstein, 1999). The GSL (Galassi et al., 2001) implements an appropriate solver. If the interval contains more than one solution, the time of threshold crossing is defined by the left-most root. Acknowledgments The new address of A. M. and M. D. is Computational Neuroscience Group, RIKEN Brain Science Institute, Wako, Japan. The new address of S.S. is Human-Neurobiologie, University of Bremen, Germany. We acknowledge Johan Hake for the interesting discussions that started the project.

78

A. Morrison, S. Straube, H. Plesser, and M. Diesmann

As always, Stefan Rotter and the members of the NEST collaboration (in particular Marc-Oliver Gewaltig) were very helpful. We thank Jochen Eppler for assisting with the C++ coding. We acknowledge the two anonymous referees whose stimulating questions helped us to improve the letter. This work was partially funded by DAAD 313-PPP-N4-lk, DIP F1.2, BMBF Grant 01GQ0420 to the Bernstein Center for Computational Neuroscience Freiburg, and EU Grant 15879 (FACETS). As of the date this letter is published, the NEST initiative (www.nest-initiative.org) makes available the different implementations of the neuron model used as an example in this article, in source code form under its public license. All simulations were carried out using the parallel computing facilities of the Norwegian Uni˚ versity of Life Sciences at As. References Brette, R. (2006). Exact simulation of integrate-and-fire models with synaptic conductances. Neural Comput., 18, 2004–2027. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comput. Neurosci., 8(3), 183–208. Diesmann, M., & Gewaltig, M.-O. (2002). NEST: An environment for neural systems simulations. In T. Plesser & V. Macho (Eds.), Forschung und wisschenschaftliches ¨ Rechnen, Beitr¨age zum Heinz-Billing-Preis 2001 (pp. 43–70). Gottingen: Gesellschaft ¨ wissenschaftliche Datenverarbeitung. fur Diesmann, M., Gewaltig, M.-O., Rotter, S., & Aertsen, A. (2001). State space analysis of synchronous spiking in cortical neural networks. Neurocomputing, 38–40, 565– 571. Ferscha, A. (1996). Parallel and distributed simulation of discrete event systems. In A. Y. Zomaya (Ed.), Parallel and distributed computing handbook (pp. 1003–1041). New York: McGraw-Hill. Fujimoto, R. M. (2000). Parallel and distributed simulation systems. New York: Wiley. Galassi, M., Davies, J., Theiler, J., Gough, B., Jungman, G., Booth, M., & Rossi, F. (2001). Gnu scientific library: Reference manual. Bristol: Network Theory Limited. Golub, G. H., & van Loan, C. F. (1996). Matrix computations (3rd ed.). Baltimore, MD: Johns Hopkins University Press. Hammarlund, P., & Ekeberg, O. (1998). Large neural network simulations on multiple hardware platforms. J. Comput. Neurosci., 5(4), 443–459. Hansel, D., Mato, G., Meunier, C., & Neltner, L. (1998). On numerical simulations of integrate-and-fire neural networks. Neural Comput., 10(2), 467–483. Harris, J., Baurick, J., Frye, J., King, J., Ballew, M., Goodman, P., & Drewes, R. (2003). A novel parallel hardware and software solution for a large-scale biologically realistic cortical simulation (Tech. Rep.). Las Vegas: University of Nevada. Heck, A. (2003). Introduction to Maple (3rd ed.). Berlin: Springer-Verlag. Lytton, W. W., & Hines, M. L. (2005). Independent variable time-step integration of individual neurons for network simulations. Neural Comput., 17, 903–921. Makino, T. (2003). A discrete-event neural network simulator for general neuron models. Neural Comput. and Applic., 11, 210–223.

Continuous Spike Times in Exact Discrete-Time Simulations

79

Marian, I., Reilly, R. G., & Mackey, D. (2002). Efficient event-driven simulation of spiking neural networks. In Proceedings of the 3. WSES International Conference on Neural Networks and Applications. Interlaken, Switzerland. Mattia, M., & Del Giudice, P. (2000). Efficient event-driven simulation of large networks of spiking neurons and dynamical synapses. Neural Comput., 12(10), 2305– 2329. Mirollo, R. E., & Strogatz, S. H. (1990). Synchronization of pulse-coupled biological oscillators. SIAM J. Appl. Math., 50(6), 1645–1662. Morrison, A., Hake, J., Straube, S., Plesser, H. E., & Diesmann, M. (2005). Precise spike timing with exact subthreshold integration in discrete time network simu¨ lations. Proceedings of the 30th Gottingen Neurobiology Conference. Neuroforum, 1(Suppl.), 205B. Morrison, A., Mehring, C., Geisel, T., Aertsen, A., & Diesmann, M. (2005). Advancing the boundaries of high connectivity network simulation with distributed computing. Neural Comput., 17(8), 1776–1801. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in C (2nd ed.). Cambridge: Cambridge University Press. Reutimann, J., Giugliano, M., & Fusi, S. (2003). Event-driven simulation of spiking neurons with stochastic dynamics. Neural Comput., 15, 811–830. Rochel, O., & Martinez, D. (2003). An event-driven framework for the simulation of networks of spiking neurons. In ESANN’2003 Proceedings—European Symposium on Artifical Neural Networks (pp. 295–300). Bruges, Belgium: d-side Publications. Rotter, S., & Diesmann, M. (1999). Exact digital simulation of time-invariant linear systems with applications to neuronal modeling. Biol. Cybern., 81(5/6), 381–402. Shelley, M. J., & Tao, L. (2001). Efficient and accurate time-stepping schemes for integrate-and-fire neuronal networks. J. Comput. Neurosci., 11(2), 111–119. Sloot, A., Kaandorp, J. A., Hoekstra, G., & Overeinder, B. J. (1999). Distributed simulation with cellular automata: Architecture and applications. In J. Pavelka, G. Tel, & M. Bartosek (Eds.), SOFSEM’99, LNCS (pp. 203–248). Berlin: SpringerVerlag. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press. Weisstein, E. W. (1999). CRC concise encyclopedia of mathematics. Boca Raton, FL: CRC Press. Wolfram, S. (2003). The mathematica book (5th ed.). Champaign, IL: Wolfram Media Incorporated. Zeigler, B. P., Praehofer, H., & Kim, T. G. (2000). Theory of modeling and simulation: Integrating discrete event and continuous complex dynamic systems (2nd ed.). Amsterdam: Academic Press.

Received August 12, 2005; accepted May 25, 2006.

LETTER

Communicated by Daniel Amit

The Road to Chaos by Time-Asymmetric Hebbian Learning in Recurrent Neural Networks Colin Molter [email protected] Laboratory for Dynamics of Emergent Intelligence, RIKEN Brain Science Institute, Wako, Saitama, 351-0198, Japan, and Laboratory of Artificial Intelligence, IRIDIA, Universit´e Libre de Bruxelles, 1050 Brussels, Belgium

Utku Salihoglu [email protected] Laboratory of Artificial Intelligence, IRIDIA, Universit´e Libre de Bruxelles, 1050 Brussels, Belgium

Hugues Bersini [email protected] Laboratory of Artificial Intelligence, IRIDIA, Universit´e Libre de Bruxelles, 1050 Brussels, Belgium

This letter aims at studying the impact of iterative Hebbian learning algorithms on the recurrent neural network’s underlying dynamics. First, an iterative supervised learning algorithm is discussed. An essential improvement of this algorithm consists of indexing the attractor information items by means of external stimuli rather than by using only initial conditions, as Hopfield originally proposed. Modifying the stimuli mainly results in a change of the entire internal dynamics, leading to an enlargement of the set of attractors and potential memory bags. The impact of the learning on the network’s dynamics is the following: the more information to be stored as limit cycle attractors of the neural network, the more chaos prevails as the background dynamical regime of the network. In fact, the background chaos spreads widely and adopts a very unstructured shape similar to white noise. Next, we introduce a new form of supervised learning that is more plausible from a biological point of view: the network has to learn to react to an external stimulus by cycling through a sequence that is no longer specified a priori. Based on its spontaneous dynamics, the network decides “on its own” the dynamical patterns to be associated with the stimuli. Compared with classical supervised learning, huge enhancements in storing capacity and computational cost have been observed. Moreover, this new form of supervised learning, by being more “respectful” of the network intrinsic dynamics, maintains much more structure in Neural Computation 19, 80–110 (2007)

C 2006 Massachusetts Institute of Technology

Road to Chaos in Recurrent Neural Networks

81

the obtained chaos. It is still possible to observe the traces of the learned attractors in the chaotic regime. This complex but still very informative regime is referred to as “frustrated chaos.” 1 Introduction Synaptic plasticity is now widely accepted as a basic mechanism underlying learning and memory. There is experimental evidence that neuronal activity can affect synaptic strength through both long-term potentiation and longterm depression (Bliss & Lomo, 1973). Inspired by or forecasting this biological fact, a large number of learning “rules,” specifying how activity and training experience change synaptic efficacies, have been proposed (Hebb, 1949; Sejnowski, 1977). Such learning rules have been essential for the construction of most models of associative memory (among others, Amari, 1977; Hopfield, 1982; Amari & Maginu, 1988; Amit, 1995; Brunel, Carusi, & Fusi, 1997; Fusi, 2002; Amit & Mongillo, 2003). In such models, the neural network maps the structure of information contained in the external or internal environment into embedded attractors. Since Amari, Grossberg, and Hopfield precursor works (Amari, 1972; Hopfield, 1982; Grossberg, 1992), the privileged regime to code information has been fixed-point attractors. Many theoretical and experimental works have shown and discussed the limited storing capacity of these attractor network (Amit, Gutfreund, & Sompolinsky, 1987; Gardner, 1987; Gardner & Derrida, 1989; Amit & Fusi, 1994; and Domany, van Hemmen, & Schulten, 1995, for a review). However, many neurophysiological reports (Nicolis & Tsuda, 1985; Skarda & Freeman, 1987; Babloyantz & Loureno, 1994; Rodriguez et al., 1999; and Kenet, Bibitchkov, Tsodyks, Grinvald, & Arieli, 2003) tend to indicate that brain dynamics is much more complex than fixed points and is more faithfully characterized by cyclic and weak chaotic regimes. In line with these results, in this article, we propose to map stimuli to spatiotemporal limit cycle attractors of the network’s dynamics. A learned stimulus is no longer expected to stabilize the network into a steady state (which could in some cases correspond to a minimum of a Lyapunov function). Instead, the stimulus is expected to drive the network into a specific spatiotemporal cyclic trajectory. This cyclic trajectory is still considered an attractor since content addressability is expected: before presentation of the stimulus, the network could follow another trajectory, and the stimulus could be corrupted with noise. By relying on spatiotemporal cyclic attractors, the famous theoretical results on the limited capacity of Hopfield network no longer apply. In fact, the extension of encoding attractors to cycles potentially boosts this storing capacity. Suppose a network is composed of two neurons that can have only two values: −1 and +1. Without paying attention to noise and generalization, only four fixed-point attractors can be exploited, whereas by adding cycles, this number increases. For instance,

82

C. Molter, U. Salihoglu, and H. Bersini

cycles of length two are: (+1, +1)(+1, −1) (+1, +1)(−1, +1) (+1, +1)(−1, −1) (+1, −1)(−1, +1) (+1, −1)(−1, −1) (−1, +1)(−1, −1). In a given network, with a fixed topology and parameterization, the number of cyclic attractors is obviously inferior to the number of fixed points (a cycle iterates through a succession of unstable equilibrium). An indexing of the memorized attractors restricted to the initial conditions, as classically done in Hopfield networks, would not allow a full exploitation of all of these potential cyclic attractors. This is the reason that in the experimental framework presented here, the indexing is done instead by means of an added external input layer that continuously feeds the network with different external stimuli. Each external stimulus modifies the parameterization of the network—thus, the possible dynamical regimes and the set of potential attractors. The goal of this letter is not to calculate the “maximum storage capacity” of these “cyclic attractors networks,”1 either theoretically2 or experimentally. Rather, we intend to discuss the potential dynamical regimes (oscillations and chaos) that allow this new form of information storage. The experimental results we present show how the theoretical limitation of fixed-point attractor networks can easily be overcome by adding the input layer and exploitating the cyclic attractors. By studying small fully connected networks, we have shown previously (Molter & Bersini, 2003) how a synaptic matrix randomly generated allows the exploitation of a huge number of cyclic attractors for storing information. In this letter, according to our previous results (Molter, Salihoglu, & Bersini, 2005a, 2005b), a time-asymmetric Hebbian rule is proposed to encode the information. This rule is related to experimental observations showing an asymmetrical time window of synaptic plasticity in pyramidal cells during tetanic stimulus conditions (Levy & Steward, 1983; Bi & Poo, 1999). Information stored into the network consists of a set of pair data. Each datum is composed of an external stimulus and a series of patterns through which the network is expected to iterate when the stimulus feeds the network (this series corresponds to the limit cycle attractor). Traditionally, the information to be stored is either installed by means of a supervised mechanism (e.g., Hopfield, 1982) or discovered on the spot by an unsupervised version, revealing some statistical regularities in the data presented to the 1 To paraphrase Gardner’s article “Maximum Storage Capacity in Neural Networks” (1987). 2 The fact that we are not working at the thermodynamic limit, as in population models, would render such analysis very difficult.

Road to Chaos in Recurrent Neural Networks

83

net (e.g. Amit & Brunel, 1994). In this letter, two different forms of learning are studied and compared. In the first, the information to be learned (the external stimulus and the limit cycle attractor) is prespecified and installed as a result of a classical supervised algorithm. However, such supervised learning has always raised serious problems at a biological level, due to its top-down nature, and a cognitive level. Who would take responsibility to look inside the brain of the learner, that is, to decide the information to associate with the external stimulus and exert supervision during the learning task? To answer the last question, in the second form of learning proposed here, the semantics of the attractors to be associated with the feeding stimulus is left unprescribed: the network is taught to associate external stimuli with original attractors, not specified a priori. This perspective remains in line with the very old philosophical conviction of constructivism, which has been modernized in neural network terms by several authors (among others, Varela, Thompson, & Rosch, 1991; Erdi, 1996; Tsuda, 2001). One operational form has achieved great popularity as a neural net implementation of statistical clustering algorithms (Kohonen, 1982; Grossberg, 1992). To differentiate between the two learning procedures, the first one is called out-supervised, since the information is fully specified from outside. In contrast, the second one is called in-supervised, since the learning maps stimuli to cyclic attractors “derived” from the ones spontaneously proposed by the network.3 Quite naturally, we show that this in-supervised learning leads to increased storing capacity. The aim of this letter is not only to show how stimuli could be mapped to limit cycle attractors of the network’s dynamics. It also focuses on the network’s background dynamics: the dynamics observed when unlearned stimuli feed the network. More precisely, in line with theoretical investigations of the onset and the nature of chaotic dynamics in deterministic dynamical systems (Eckmann & Ruelle, 1985; Sompolinsky, Crisanti, & Sommers, 1988; van Vreeswijk & Sompolinsky, 1996; Hansel & Sompolinsky, 1996), this letter computes and analyzes the background presence of chaotic dynamics. The presence of chaos in recurrent neural networks (RNNs) and the benefit gained by its presence is still an open question. However, since the seminal paper by Skarda and Freeman (1987) dedicated to chaos in the rabbit brain, many authors have shared the idea that chaos is the ideal regime to store and efficiently retrieve information in neural networks (Freeman, 2002; Guillot & Dauce, 2002; Pasemann, 2002; Kaneko & Tsuda, 2003). Chaos, although very simply produced, 3 This algorithm is still supervised in the sense that the mapped cyclic attractors are only “derived” from the spontaneous network dynamics. External supervision is still partly needed, which raises question about biological plausibility. However, even if the proposed in-supervised algorithm is improved or if it ends as part of a more biologically plausible learning procedure, we think that the dynamical results obtained here will remain qualitatively valid.

84

C. Molter, U. Salihoglu, and H. Bersini

inherently possesses an infinite number of cyclic regimes that can be exploited for coding information. Moreover, it randomly wanders around these unstable regimes in a spontaneous way, thus rapidly proposing alternative responses to external stimuli and able to switch easily from one of these potential attractors to another in response to any coming stimulus. This article maintains this line of thinking by forcing the coding of information in robust cyclic attractors and experimentally showing that the more information is to be stored, the more chaos appears as a regime in the back, erratically itinerating, that is, moving from place to place, among brief appearances of these attractors. Chaos appears to be the consequence of the learning, not the cause. However, it appears as a helpful consequence that widens the net’s encoding capacity and diminishes the existence of spurious attractors. By comparing the nature of the chaotic dynamics obtained from the two learning procedures, we show that more structure in the background chaotic regimes is obtained when the network “chooses” by itself the limit cycle attractors to encode the stimuli (the in-supervised learning). The nature of these chaotic regimes can be related to the classic intermittent type of chaos (Pomeau & Manneville, 1980) and to its extension to biological networks, originally called frustrated chaos (Bersini, 1998), in reminiscence of the frustration phenomenon occurring in both recurrent neural nets and spin glasses. This chaos is related to chaotic itinerancy (Kaneko, 1992; Tsuda, 1992), which has been suggested to be of biological significance (Tsuda, 2001). The plan of the letter is as follows. Section 2 describes the model as well as the learning task. Section 3 describes the out-supervised learning procedure, where stimuli are encoded in predefined limit cycle attractors. Section 4 describes the in-supervised learning procedure, where stimuli are encoded in the limit cycle attractors derived from those spontaneously proposed by the network. Section 5 computes and compares networks’ encoding capacity, as well as the content addressability of encoded information. Section 6 compares and discusses the proportion and the nature of the observed chaotic dynamics when the network is presented with unlearned stimuli. 2 Model and Learning Task Descriptions This section describes the model used in our simulations as well as the learning task. 2.1 The Model. The network is fully connected. Each neuron’s activation is a function of other neurons’ impact and an external stimulus. The neurons’ activation f is continuous and updated synchronously by discrete time step. The mathematical description of such a network is a classic

Road to Chaos in Recurrent Neural Networks

85

one. The activation value of a neuron xi at a discrete time step n + 1 is xi (n + 1) = f (g neti (n)) neti (n) =

N

wij x j (n) +

j=1

M

wis ιs ,

(2.1)

s=1

where N is the number of neurons, M is the number of units composing the stimulus, g is the slope parameter, wij is the weight between the neurons j and i, wis is the weight between the external stimulus’ unit s and the neuron i, and ιs is the unit s of the external stimulus. The saturating activation function f is taken continuous (here tanh) to ease the study of the networks’ dynamical properties. The network’s size, N, and the stimulus’s size, M, have been set to 25 in this article for legibility. Of course, this size has an impact on both the encoding capacity and the background dynamics. This impact will not be discussed. Another impact not discussed here is the value of the slope parameter g, set to 3 in the following.4 The main purpose of this article is to analyze how the learning of stimuli in spatiotemporal attractors affects the network’s background dynamics. When storing information in fixed-point attractors, the temporal update rule can indifferently be asynchronous or synchronous. This is no longer the case when storing information in cycle attractors, for which the updating must necessarily be synchronous: it is a global activity due to one pattern that generates the next one. To compare the network’s continuous internal states with bit patterns, a filter layer quantizing the internal states, based on the sign function, is added (Omlin, 2001). It defines the output vector o:

o i = −1 ⇐⇒ xi < 0 oi = 1

⇐⇒ xi ≥ 0,

(2.2)

where xi is the internal state of the neuron i and o i is its associated output (i.e., its visible value). This filter layer enables it to perform symbolic investigations on the dynamical attractors. Figure 1 represents a period 2 sequence unfolding in a network of four neurons. The persistent external stimulus feeding the network appears Figure 1A. Given that the internal state of neurons is continuous, the internal states (see Figure 1B) are filtered (see Figure 1C) to enable the comparison with the stored data. 4

The impact of this parameter has been discussed in Dauce, Quoy, Cessac, Doyon, & Samuelides (1998). They have demonstrated how the slope parameter can be used as a route to chaos.

86

C. Molter, U. Salihoglu, and H. Bersini

Figure 1: (A) A fully recurrent neural network (N = 4) fed by a persistent external stimulus. (B). Three shots of the network’s states. Each represents the internal state of the network at a successive time step. (C) After filtering, a cycle of period 2 can be seen.

2.2 The Learning Task. Two different forms of supervised learning are proposed in this article. However, both consist in storing a set of q external stimuli in spatiotemporal cycles of the network’s internal dynamics. The data set is written as D = D1 , . . . , Dq ,

(2.3)

where each datum Dµ is defined by a pair composed of a pattern χ µ corresponding to the external stimulus feeding the network and a sequence of patterns ς µ,i , i = 1, . . . , lµ to store in a dynamical attractor: Dµ = χ µ , (ς µ,1 , . . . , ς µ,lµ )

µ = 1, . . . , q ,

(2.4)

where lµ is the period of the sequence µ and may vary from one datum to another. Each pattern µ is defined by assigning digital values to all neurons: µ

χ µ = {χi , i = 1, . . . , M} µ,k

ς µ,k = {ςi

, i = 1, . . . , N}

with with

µ

χi ∈ {−1, 1} µ,k

ςi

∈ {−1, 1}.

(2.5)

Road to Chaos in Recurrent Neural Networks

87

3 The Out-Supervised Learning Algorithm This first learning task is straightforward and consists of storing a welldefined data set. It means that each datum stored in the network is fully specified a priori: each external stimulus must be associated with a prespecified limit cycle attractor of the network’s dynamics. By suppressing the external stimulus and defining all the sequences’ periods lµ to 1, this task is reduced to the classical learning task originally proposed by Hopfield: storing pattern in fixed-point attractors of the underlying RNN’s dynamics. The learning task described above turns out to generalize the one proposed by Hopfield. For ease of reading, when patterns are stored in fixed-point attractors, they are noted by ξ µ . 3.1 Introduction: Hopfield’s Autoassociative Model. In the basic Hopfield model (Hopfield, 1982), all connections need to be symmetric, no autoconnection can exist, and the update rule must be synchronous. Hopfield has proven that these constraints are sufficient to define a Lyapunov function H for the system:5 1 wij xi x j . 2 N

H=−

N

(3.1)

i=1 j=1

Each state variation produced by the system’s equation entails a nonpositive variation of H: H ≤ 0. The existence of such a decreasing function ensures convergence to fixed-point attractors. Each local minimum of the Lyapunov function represents one fixed point of the dynamics. These local minima can be used to store patterns. This kind of network is akin to a content-addressable memory since any stored item will be retrieved when the network dynamics is initiated with a vector of activation values sufficiently overlapping the stored pattern.6 In such a case, the network dynamics is initiated in the desired item’s basin of attraction, spontaneously driving the network dynamics to converge to this specific item. The set of patterns can be stored in the network by using the following Hebbian learning rule, which obviously respects the constraints of the Hopfield model (symmetric connections and no autoconnection): 1 µ µ ξi ξ j N p

wij =

wii = 0.

(3.2)

µ=1

5 A Lyapunov function defines a lower-bounded function whose derivative is decreasing in time. 6 This remains valid only when learning a few uncorrelated patterns. In other cases, the network converges to any fixed-point attractor.

88

C. Molter, U. Salihoglu, and H. Bersini

However, this kind of rule leads to drastic storage limitations. An in-depth analysis of the Hopfield model’s storing capacity has been done by Amit et al. (1987) by relying on a mean-field approach and on replica methods originally developed for spin-glass models. Their theoretical results show that these types of networks, when coupled with this learning rule, are unlikely to store more than 0.14 N uncorrelated random patterns. 3.2 Iterative Version of the Hebbian Learning Rule 3.2.1 Learning Fixed Points. A better way of storing patterns is given by an iterative version of the Hebbian rule (Gardner, 1987; for a detailed description of this algorithm, see van Hemmen & Kuhn (1995), and Forrest & Wallace, 1995). The principle of this algorithm is as follows: at each learning iteration, the stability of every nominal pattern ξ µ is tested. Whenever one pattern has not yet reached stability, the responsible neuron i sees its connectivity reinforced by adding a Hebbian term to all the synaptic connections impinging on it, µ

µ

wij → wij + εs ξi ξ j ,

(3.3)

where εs defines the learning rate. All patterns to be learned are repeatedly tested for stability, and once all are stable, the learning is complete. This learning algorithm is incremental since the learning of new information can be done by preserving all information that has already been learned. It has been proved (Gardner, 1987) that by using this procedure, the capacity can be increased up to 2N uncorrelated random patterns. In our model, stored cycles are indexed by the use of external stimuli. These external stimuli are responsible for a modification of the underlying network’s internal dynamics and, consequently, increasing the number of potential attractors, as well as the size of their basins of attraction. The connection weights between the external stimuli and the neurons are learned by adopting the same approach as given in equation 3.3. When one pattern is not yet stable, the responsible neuron i sees its connectivity reinforced by adding a Hebbian term to all of the synaptic connections impinging on it (see equation 3.3), including connections coming from the external stimulus: wik → wik + εb χkµ ξiµ ,

(3.4)

where εb defines the learning rate applied on the external stimulus’s connections and which may differ from εs . In order to not only store the patterns but also ensure sufficient content addressability, we must try to “excavate” the basins of attraction. Two approaches are commonly proposed in the literature. The first aims at getting

Road to Chaos in Recurrent Neural Networks

89

the alignment of the spin of the neuron (+1 or −1) together with its local field to be not just positive (the requirement to ensure stability) but greater than a given minimum bound. The second approach attempts explicitly to enlarge the domains of attraction around each nominal pattern. To do so, the network is trained to associate noisy versions of each nominal pattern with the desired pattern, following a given number of iterations expected to be sufficient for convergence. This second approach is the one adopted in this article. Two noise parameters are introduced to tune noise during the learning phase: the noise imposed on the internal states lns and the noise imposed on the external stimulus lnb .7 3.2.2 Learning Cycles. The learning rule defined in equation 3.3 naturally leads to asymmetrical weights’ values. It is no longer possible to define a Lyapunov function for this system, the main consequence being to include cycles in the set of “memory bags.” As for fixed points, the network can be trained to converge to such limit cycles attractors by modifying equations 3.3 and 3.4. This time, the weights wij and wis are modified according to the expected value of neuron i at time t + 1 and the expected value of neuron j and of the external stimulus at time t: µ,ν+1

ςj

µ,ν+1

χi ,

wij → wij + εs ςi

wis → wis + εb ςi

µ,ν µ

(3.5)

where εs and εb , respectively, define the learning rate and the stimulus learning rate. This time-asymmetric Hebbian rule can be related to the asymmetric time window of synaptic plasticity observed in pyramidal cells during tetanic stimulus conditions (Levy & Steward, 1983; Bi & Poo, 1999). 3.2.3 Adaptation of the Algorithm to Continuous Activation Functions. When working with continuous state neurons, we are no longer working with cyclic attractors but with limit cycle attractors. As a consequence, the µ

µ 7 A noisy pattern ξ lns is obtained from a pattern to learn ξ by choosing a set of lns items, randomly chosen among all the initial pattern’s items, and by switching their sign. Thus, d H (lns ) defines the Hamming distance between the two patterns:

d H (lns ) =

N i=1

di = 0 di

where

di = 1

µ

µ

if

ξi ξi,lns = 1

(items equals)

if

µ µ ξi ξi,lns

(items differents)

= −1

In this article, the Hamming distance is normalized to range in [0, 100].

.

90

C. Molter, U. Salihoglu, and H. Bersini

algorithm needs to be adapted in order to prevent the learned data from vanishing after a few iterations. One step iteration does not guarantee longterm stability of the internal states since observations are performed using a filter layer. The adaptation consists of waiting a certain number of cycles before testing the correctness of the obtained attractor. The halting test for discrete neurons is given by the following equation, ∀µ, ν if (x(0) = ς µ,ν ) → x(1) = ς µ,ν+1

⇒ stop;

(3.6)

for continuous neurons, it becomes ∀µ, ν if (x(0) = ς µ,ν ) → (o(1) = o(lµ + 1) = . . . = o(T ∗ lµ + 1) = ς µ,ν+1 )

⇒ stop,

(3.7)

where T is a further parameter of our algorithm (set to 10 in all the experiments) and o is the output filtered vector defined in equation 1.2. 4 In-Supervised Learning algorithm As shown in section 5, the encoding capacities of networks learning in the out-supervised way described above are fairly good. However, these results are disappointing compared to the potential capacity observed in random networks (Molter & Bersini, 2003). Moreover, section 6 shows how learning too many cycle attractors in an out-supervised way leads to the kind of blackout catastrophe similar to the ones observed in fixed-point attractor networks (Amit, 1989). Here, the network’s background regime becomes fully chaotic and similar to white noise. Learning prespecified data appears to be too constraining for the network. This section introduces an in-supervised learning algorithm, more plausible from a biological point of view: the network has to learn to react to an external stimulus by cycling through a sequence that is not specified a priori but is obtained following an internal mechanism. In other words, the information is generated through the learning procedure that assigns a meaning to each external stimulus. There is an important tradition of less supervised learning in neural nets since the seminal work of Kohonen and Grossberg. This tradition enters in resonance with writings in cognitive psychology and constructivist philosophy (among others, Piaget, 1963; Varela et al., 1991; Erdi, 1996; and Tsuda, 2001). The algorithm presented now can be seen as a dynamical extension in the spirit of this preliminary work where the coding scheme relies on cycles instead of single neurons. 4.1 Description of the Learning Task. The main characteristic of this new algorithm lies in the nature of the learned information: only the external stimuli are known before learning. The limit cycle attractor associated

Road to Chaos in Recurrent Neural Networks

91

with an external stimulus is identified through the learning procedure: the procedure enforces a mapping between each stimulus of the data set and a limit cycle attractor of the network’s inner dynamic, whatever it is. Hence, the aim of the learning procedure is twofold: first, it proposes a dynamical way to code the information (i.e., to associate a meaning with the external stimuli), and then it learns it (through a classical supervised procedure). Before mapping, the data set is defined by Dbm (bm standing for “before mapping”): 1 q , . . . , Dbm Dbm = Dbm

µ

Dbm = χ µ

µ = 1, . . . , q .

(4.1)

After mapping, the data set becomes 1 q , . . . , Dam Dam = Dam

µ Dam = χ µ , (ς µ,1 , . . . , ς µ,lµ )

µ = 1, . . . , q ,

(4.2)

where lµ is the period of the learned cycle. 4.2 Description of the Algorithm. Inputs of this algorithm are a data set Dbm to learn (see equation 4.1), and a range [mincs , maxcs ] that defines the bounds of the accepted periods of the limit cycle attractors coding the information. This algorithm can be broken down in three phases that are constantly iterated until convergence: 1. Remapping stimuli into spatiotemporal cyclic attractors. During this phase, the network is presented with an external stimulus that drives it into a temporal attractor outputµ (which can be chaotic). Since the idea is to constrain the network as little as possible, a meaning is assigned to the stimulus by associating it with a close cyclic version of the attractor outputµ , called cycleµ , an original8 attractor respecting the periodic bounds [mincs , maxcs ]. This step is iterated for all the stimuli of the data set; 2. Learning the information. Once a new attractor cycleµ has been proposed for each stimulus, it is tentatively learned by means of a supervised procedure. However, to avoid constraining the network too much, only a limited number of iterations are performed, even if no convergence has been reached. 3. End test. if all stimuli are successfully associated with different cyclic attractors, the in-supervised learning stops, otherwise the whole process is repeated.

8 Original means that each pattern composing the limit cycle attractor must be different from all other patterns of all cycleµ .

92

C. Molter, U. Salihoglu, and H. Bersini

Table 1: Pseudocode of the In-Supervised algorithm. A/ re-mapping stimuli to spatiotemporal cyclic attractors µ

1. ∀ data Dbm to learn, µ = 1, . . . , q a. Stimulation of the network i. The stimulus is initialized with χ µ . ii. The states are initialized with ς µ,1 which are obtained from the previous iteration (or random at first). iii.To skip the transient, the network is simulated some steps. iv. The states ς µ,i crossed by the network’s dynamics are stored in outputµ . b. Proposal of an attractor code i. If outputµ is not a cycle of period lesser or equal to maxcs ⇒ compression process (see Table 2): outputµ → cycleµ . ii. If outputµ is not a cycle of period greater or equal to mincs ⇒ extension process (see Table 2): outputµ → cycleµ . iii.If a pattern contained in cycleµ is too correlated with any other patterns, this pattern is slightly modified to make it “original”. µ

2. The data set Dtemp is created where Dtemp = (χ µ , cycleµ ) B/ learning the information 3. Using an iterative supervised learning algorithm, the data set Dtemp is tentatively learned a limited number of time steps. C/ end test µ

4. If ∀ data Dbm , the network iterates through valid limit cycle attractors ⇒ finished else goto 1

The pseudocode of this in-supervised learning algorithm is described in Table 1. The algorithm presented here learns to map external stimuli into the network’s cyclic attractors in a very unconstrained way. Provided these attractors are derived from the ones spontaneously proposed by the network, a form of supervision is still needed (how to create an original cycle and how to know if the proposed cycle is original). However, recent neurophysiological observations have shown that in some cases, synaptic plasticity is effectively guided by a supervised procedure (Gutfreund, Zheng, & Knudsen, 2002; Franosch, Lingenheil, & van Hemmen, 2005). The supervised procedure proposed here to create and test original cycles (see Table 2) has no real biological grounding. This part of the algorithm might be improved in the future to reinforce its biological likelihood. Nevertheless, we think that whatever supervised procedure is chosen, the part of the study concerned with the capacity of the network and the background chaos remains valid.

Road to Chaos in Recurrent Neural Networks

93

Table 2: Routines Used When Creating an Attractor Code. Compression process Since outputµ = ς µ,1 , . . . , ς µ,maxcs , . . . is a cycle of period greater than maxcs (it could even be chaotic): ⇒ cycleµ is generated by truncating outputµ : cycleµ = ς µ,1 , . . . , ς µ, ps (the compression period ps ∈ [mincs , maxcs ] is another parameter). Extension process Since outputµ = ς µ,1 , . . . , ς µ,q with q < mincs :

⇒ cycleµ is generated by duplicating outputµ : cycleµ = ς µ,1 , . . . , ς µ,q , ς µ,1 , . . . such that size(cycleµ ) = pe , where pe ∈ [mincs , maxcs ] is the extension period, another parameter, randomly given or fixed;

5 Performance Regarding the Encoding Capacity Encoding capacities of networks having learned by using the outsupervised algorithm and networks having learned by using the insupervised algorithm are compared in Figure 2. Here, to ease the comparison, we have enforced the in-supervised algorithm to learn mappings to cycles of fixed period. However, because this algorithm enables mapping stimuli to limit cycle attractors of variable period (between mincs and maxcs ), better performance could be achieved. The number of iterations required to learn the specified data set is plotted as a function of the size of this data set.9 We have seen in section 3.2 that it is possible to enforce greater content addressability by training the network to associate noisy versions of each nominal pattern with the desired limit cycle attractors. Here, the two parameters enforcing the content addressability (lns and lnb ) have been set to 0 in order to display the maximum encoding capacities. As expected, the in-supervised learning procedure, by letting the network decide how to map the stimuli, outperforms its out-supervised counterpart where all the mappings are specified before the learning.10 Robustness to noise is compared in Figure 3. Content addressability has been enforced as much as possible by means of the learning-with-noise procedure. By computing the normalized Hamming distance between the obtained attractor and the initially learned attractor, robustness to noise 9

Each iteration represents 100 weight modifications defined in equation 3.6. The mapping stimuli to fixed-point attractors are not compared, since to obtain relevant results, we have to enforce content addressability by learning with noise. See also section 6.1.1. 10

50

C. Molter, U. Salihoglu, and H. Bersini

20

30

40

5

10

10

10

3

3

5

0

Number of iterations

94

0

10

20

30

40

50

60

70

Number of stimuli mapped to cycles 3, 5 and 10

Figure 2: Encoding capacities’ comparison between out-supervised learning (filled circles) and in-supervised learning (squares). The number of iterations required to learn the specified data set is plotted according to data set size. Results obtained from three different types of data sets are represented: data sets composed of stimuli associated with cycles of period 3, 5, and 10. Each value has been obtained from statistics of 100 different data sets.

indicates how well the dynamics is able to retrieve the original association when both the external stimulus and the network’s internal state are perturbed by noise. The two plots in Figure 3 show the normalized Hamming distance between the stored sequence and the recovered sequence according to the noise injected in the external stimulus and the initial states. This noise is quantified by computing the initial overlaps m0b and m0s .11 The noise injected in the network’s internal state is measured by computing the smallest overlap between the internal state and every pattern composing the limit cycle attractor associated with the external stimulus. In-supervised learning, compared to out-supervised learning, considerably improves robustness. For instance, when learning 6 period–4 cycles, in the in-supervised case (see Figure 3B), stored sequences are very robust to noise, while in the out-supervised case (see Figure 3A), content addressability is no longer observed. By adding a tiny amount of noise, the correlation between the stored sequence and the recovered one goes to zero (the normalized Hamming distance is equal to 50). These figures also show that in the in-supervised case, the external stimulus plays a stronger role in indexing the stored data.

11

The overlap mµ between two patterns is given by mµ =

N 1 µ,idcyc µ,idcyc ςi ςi,noisy . N i=1

Two patterns that match perfectly have an overlap equal to 1; it equals −1 in the opposite case. The overlap m and the Hamming distance dh are related by dh = N m+1 2 .

Road to Chaos in Recurrent Neural Networks

95

Figure 3: Content addressability obtained for the learned networks. The normalized Hamming distance between the expected sequence and the obtained sequence is plotted according to the noise injected in both the initial states (m0s ) and the external stimulus (m0b ). Results for the out-supervised Hebbian algorithm (A) are compared with the ones obtained for the in-supervised Hebbian algorithm (B).

One can say that the in-supervised learning mechanism implicitly supplies the network with an important robustness to noise. This could be explained by the following considerations. First, the coding attractors are derived from the ones spontaneously proposed by the network. Second, they need to have large and stable basins of attraction in order to resist the process of trial, error, and adaptation that characterizes this iterative remapping procedure. 6 Dynamical Analysis Tests performed here aim at analyzing the so-called background or spontaneous dynamics obtained when the network is presented with external stimuli other than the learned ones. In the first two sections, analyses are performed to quantify the presence and proportion of chaotic dynamics. Section 6.1 uses two measures: the network’s mean Lyapunov exponent and the probability of having chaotic dynamics. Section 6.2 uses symbolic analyses12 to quantify the proportion of the different types of symbolic attractors found. It shows how chaotic dynamics help to prevent the proliferation of spurious data. Qualitative analyses of the nature of the chaotic dynamics obtained are performed in the last section (section 6.3) by means of classical tools such as return maps,

12 Symbolic analyses are performed by analyzing the output of the filter layer instead of directly analyzing the network’s internal state.

96

C. Molter, U. Salihoglu, and H. Bersini

power spectra, and Lyapunov spectra. Furthermore an innovative measure is developed to assess the presence of frustrated chaotic dynamics. 6.1 Quantitative Dynamical Analysis: The Background Regime. Quantitative analyses are performed here using two kinds of measures: the mean Lyapunov exponent and the probability of having chaotic dynamics. Both measures come from statistics on 100 learned networks. For each learned network, dynamics obtained from randomly chosen external stimuli and initial states have been tested (1000 different configurations). These tests aim at analyzing the so-called background or spontaneous dynamics obtained by stimulating the network with external stimuli and initial states different from the learned ones. The computation of the first Lyapunov exponent is done empirically by computing the evolution of a tiny perturbation (renormalized at each time step) performed on an attractor state (for more details, see, Wolf, Swift, Swinney, & Vastano, 1984; Albers, Sprott, & Dechert, 1998). This exponent indicates how fast the system’s history is lost. While stable dynamics have negative Lyapunov exponents, Lyapunov exponents bigger than 0 are the signature of chaotic dynamics, and when the biggest Lyapunov exponent is very high, the system’s dynamics may be seen as equivalent to a turbulent state. Here, to distinguish chaotic dynamics from quasi-periodic regimes, dynamics is said to be chaotic if the Lyapunov exponent is greater than a given value slightly above zero (in practice, if the Lyapunov exponent is greater than 0.01). Obtained results are made more meaningful by comparing global dynamics of learned networks (indexed with L ) with global dynamics of networks obtained randomly (without learning). However, to constrain random networks to behave as a kind of “surrogate network” (indexed with S ), they must have the same mean µ and the same standard deviation σ for their weight distributions (neuron-neuron and stimulus-neuron): µ wijL = µ wijS µ wbiL = µ wbiS µ wiiL = µ wiiS σ wijL = σ wijS σ wbiL = σ wbiS σ wiiL = σ wiiS .

(6.1)

These surrogate random networks enable the measurement of the weight distribution’s impact. The random distribution has been chosen gaussian. 6.1.1 Network Stabilization Through Hebbian Learning of Static Patterns. Figure 4A shows the mean Lyapunov exponents and probabilities of having chaos for networks with learned data encoded in fixed-point attractors only. To avoid weight distribution to converge to the identity matrix (if ∀i, j : wii > 0 and wij ≈ 0, all patterns are learned but without robustness), noise has been added during the training period. We can observe that:

Road to Chaos in Recurrent Neural Networks

97

Figure 4: Mean Lyapunov exponents and probability of chaos in RNNs, in function of the learning sets’ size. Out-supervised learned networks (filled circles) are compared with in-supervised learned networks (squares). In both cases, comparisons are performed with their surrogate random networks (in gray). Since these results come from statistics, mean and standard deviation are plotted. (A) Network stabilization can be observed after Hebbian learning of static patterns. (B) Chaotic dynamics appear after mapping stimuli to period 3 cycles.

r

r

r r

The encoding capacity (with content addressability) of in-supervised learned networks is greater than the maximum encoding capacities (without content addressability) obtained from theoretical results (Gardner, 1987). The explanation lies in the presence of the input layer, which modifies the network’s internal dynamics and enables other attractors to appear. Mean Lyapunov exponents of learned networks are always negative. When the learning task are made more complex, networks are more and more constrained. The mean Lyapunov exponent increases but still remains below a negative upper bound. It is nearly impossible to find chaotic dynamics for spontaneous regimes in learned networks, even after intensive learning of static patterns. Surrogate random networks show very different results, clearly indicating how the learned weight distribution is anything but random.

98

C. Molter, U. Salihoglu, and H. Bersini

r

The same trends are observed when relying on both the outsupervised algorithm and the in-supervised algorithm.

The iterative Hebbian learning algorithm described here is likely to keep all connections approximately symmetric, preserving the stability of learned networks, while it is no longer the case for random networks. 6.1.2 Hebbian Learning of Cycles: A New Road to Chaos. If learning data in fixed-point attractors stabilize the network, learning sequences in limit cycle attractors lead to diametrically opposed results. Figure 4B compares the chaotic dynamics’ presence in 25-neuron networks learned with different data set of period 3 cycles. We can observe that chaotic dynamics is equal to one, with high mean Lyapunov exponents. This kind of dynamics is ergodic (Fusi, 2002). This looks very similar to the “blackout catastrophe” observed in fixed-point attractors networks (Amit, 1989). If the result is the same—the network becomes unable to retrieve any of the memorized patterns—this blackout arises with a progressive increase of the chaotic dynamics:

r

r

r

Networks learned through the out-supervised algorithm are becoming more and more chaotic while the learning task is intensified. At the end, the probability of falling into a chaotic dynamics is equal to one, with high mean Lyapunov exponents. This kind of dynamics is ergodic and turns out to be reminiscent of the “blackout catastrophe” observed in fixed-point attractor networks (Amit, 1989; Fusi, 2002). If the result is the same—the network becomes unable to retrieve any of the memorized patterns—here this blackout arises through a progressive increase of the chaotic dynamics. The same trend shows up in the in-supervised case; however, even after intensive periods of learning, the networks do not become fully chaotic. The dynamics never turns into an ergodic one: chaotic attractors and limit-cycle attractors coexist. Compared to surrogate random networks, in both cases learning contributes to structure the dynamics (at least, learned networks have to remember the learned data!). However, when the learning task is intensified, the differences between the random and the learned networks tend to vanish in the out-supervised case.

The absence of full chaos in in-supervised learned networks is explained by the fact that this algorithm is based on a process of trial, error, and adaptations, which provides robustness and prevents full chaos. By increasing the data set to learn, learning takes more and more time, but at the same time, the number of remappings increases and forces large basins of stable dynamics. The network is more and more constrained, and complex but not fully chaotic.

Road to Chaos in Recurrent Neural Networks

99

Figure 5: Road to chaos observed when learning four cycles of period 7 in a recurrent network of 25 neurons through an iterative supervised Hebbian algorithm (the number of learning iterations is indicated on the x-axis). The network is initialized randomly, and following a given transient, the average state of the network is plotted on the y-axis. The mean Lyapunov exponent demonstrates the growing presence of chaotic dynamics.

Obtaining chaos by learning an increasing number of cycles based on a time-asymmetric Hebbian mechanism can be seen as a new road to chaos. This road to chaos is illustrated in Figure 5. In contrast with the classical roads shaped by the gradual variation of a control parameter, this new road relies on an external mechanism simultaneously modifying a set of parameters to fulfill an encoding task. When learning cycles, the network is prevented from stabilizing in fixed-point attractors. The more cycles there are to learn, the more the network is externally constrained and the more the regime turns out to be spontaneously chaotic. 6.2 Quantitative Dynamical Analysis: Symbolic Analyses. After learning, when the network is presented with a learned stimulus while its initial state is correctly chosen, the expected spatiotemporal attractor is observed. Section 5 showed that noise tolerance is expected: the external stimulus or the network’s internal state can be slightly modified without affecting the result. However, in some cases, the network fails to “understand” the stimulus, and an unexpected symbolic attractor appears in the output. The question is to know whether this attractor should be considered as spurious data. The intuitive idea is that if the attractor is chaotic or if its period is different from the learned data, it is easy to recognize it at a glance, and thus to discard it. This becomes more difficult if the observed attractor’s period is the same as the ones of the attractors learned. In this case, it is in fact impossible to know whether this information is relevant without comparing it with all the learned data. As a consequence, we will define an attractor—having the same period as the learned data but still different from all of them—as spurious data.

100

C. Molter, U. Salihoglu, and H. Bersini

Figure 6: Proportion of the different symbolic attractors obtained during the spontaneous activity of artificial neural networks learned using, respectively, the out-supervised algorithm (A) and the in-supervised algorithm (B). Two types of mappings are analyzed: stimuli mapped to fixed-point attractors and stimuli mapped to spatiotemporal attractors of period 4.

As a result, two classification schemes are used to differentiate the symbolic attractors obtained. The first scheme is based on periods. This criterion enables distinguishing among chaotic attractors, periodic attractors whose periods differ from the learned ones (named out-of-range attractors), and periodic attractors having the same period as the learned ones. The aim of the second classification scheme is to differentiate these attractors, based on the normalized Hamming distance between them and the closest learned data (i.e., the attractors at a distance less than 10%, less than 20%, and so on). Figure 6 shows the proportion of the different types of attractors found as the size of the learned data set is increased. Results obtained with the in-supervised and the out-supervised algorithm while mapping stimuli in spatiotemporal attractors of various periods are compared. For each data set size, statistics are obtained from 100 different learned networks, and each time, 1000 symbolic attractors obtained from random stimuli and internal states have been classified. When stimuli are mapped into fixed-point attractors, the proportion of chaotic attractors and of out-of-range attractors falls rapidly to zero. In contrast, the number of spurious data increases drastically. In fact both

Road to Chaos in Recurrent Neural Networks

101

Figure 7: Proportions of the different types of attractors observed in insupervised learned networks while the noise injected in a previously learned stimulus is varied. Networks’ initial states were randomly initiated. (A) Thirty stimuli are mapped to fixed-point attractors. (B) Thirty stimuli are mapped to period 4 cycles. One hundred learned networks, with each time 1000 configurations have been tested.

learning procedures tend to stabilize the network by enforcing symmetric weights and positive auto-connections.13 When stimuli are mapped to cyclic attractors, both learning procedures lead to an increasing number of chaotic attractors. The more you learn, the more the spontaneous regime of the net tends to be chaotic. Still, we have to differentiate the two learning procedures. Out-supervised learning, due to its very constraining nature, drives the network into a fully chaotic state that prevents the network from learning more than 13 period 4 cycles. The less constraining and more natural in-supervised learning task leads to different behavior. This time, the network does not become fully chaotic, and the storing capacity is enhanced by a factor as large as 4. Unfortunately, the number of spurious data is also increasing, and noticeable proportions of spurious data are visible when the network is fed with random stimuli. The aim of Figure 7 is to analyze the proportion of the different types of attractors observed when the external stimulus is progressively shifted from a learned stimulus to a random one. Again, the network’s initial state is set completely random. Two types of in-supervised mappings are compared: stimuli mapped to fixed-point attractors and stimuli mapped to attractors of period 4. When unnoised learned stimuli are presented to the network, stunning results appear. For fixed-point learned networks, in more than 80% of the observations, we are facing spurious data. Indeed, the distance between the observed attractors and the expected one is larger than 10%, and they have the same period (period 1). By contrast, for period 4 learned networks, the 13 If no noise is injected during the learning procedure, the network converges to the identity matrix.

102

C. Molter, U. Salihoglu, and H. Bersini

probability of recovering the perfect cycle increases to 56%. Moreover, 62% of the obtained cycles are at a distance less than 10% of the expected ones, and the amount of chaotic and out-of-range attractors is, respectively, equal to 23% and 8%. Space left for spurious data becomes less than 5%. Thus, if we imagine a procedure where the network’s states are slightly modified in case a chaotic trajectory is encountered, until a cyclic attractor is found, we are nearly certain of obtaining the correct mapping. Because of the pervading number of spurious data in fixed-point attractors’ learned networks, it becomes difficult to imagine these networks as working memories. By contrast, the presence of chaotic attractors help to prevent the proliferation of spurious data in cyclic attractors’ learned networks, while good storage capacity is possible by relying on in-supervised mappings. 6.3 Qualitative Dynamical Analysis: Nature of Chaos. The preceding section has given quantitative analyses of the presence of chaotic dynamics in learned networks. This section aims at introducing a more qualitative description of the different types of chaotic regimes encountered in these networks. Numerous techniques exist to characterize chaotic dynamics (Eckmann & Ruelle, 1985). In the first part of this section, three well-known tools are used: chaotic dynamics are characterized by means of return maps, power spectra, and analysis of their Lyapunov spectrum. The computation of the Lyapunov spectrum is performed through Gram-Schmidt reorthogonalization of the evolved system’s Jacobian matrix (which is estimated at each time step from the system’s equations). This method is detailed in Wolf et al. (1984). In the second part of this section, to have a better understanding of the notion of frustrated chaos, a new type of measure is proposed, indicating how chaotic dynamics is built on nearby limit cycle attractors. 6.3.1 Preliminary Analyses. From return maps and power spectra analysis, three types of chaotic regimes have been identified in these networks. Figure 8 shows the power spectra and the return maps of these three generic types. They are labeled here white noise, deep chaos, and informative chaos. For each of them, a return map has been plotted for one particular neuron and the network’s mean signal. We observe that:

r

White noise. The power spectrum of this type of chaos (see Figure 8A) shows a total lack of structure; all the frequencies are represented nearly equally. The associated return map is completely filled, with a bigger density of points at the edges, indicating the presence of saturation. No useful information can be obtained from such chaos.

Road to Chaos in Recurrent Neural Networks

103

Figure 8: Return maps of the network’s mean signal (upper figures), return maps of a particular neuron (center figures), and power spectra of the network’s mean signal (lower figures) for chaotic regimes encountered in random (A) and learned (B, C) networks. The Lyapunov exponent of the corresponding dynamics is given.

r

r

Deep chaos. The power spectrum of this type of chaos shows more structure, but is still very similar to white noise (see Figure 8B). However, the return map is very similar to the previous one and does not seem to provide any useful information. Informative chaos. The power spectrum looks very informative (see Figure 8C). Different peaks show up, indicating the presence of nearby limit cycles. The associated return map shows more structure, but still with the presence of saturation.

The most relevant result is the possibility of predicting the type of chaos preferentially encountered by knowing the learning procedure used, as well as the size of the data set to be learned. Chaotic dynamics encountered in surrogate random networks are nearly always similar to white noise. This explains the large mean Lyapunov exponent obtained for these networks.

104

C. Molter, U. Salihoglu, and H. Bersini

Figure 9: The 10 first Lyapunov exponents. (A) After out-supervised learning of 10 period 4 cycles. (B) After in-supervised learning of 20 period 4 cycles. Each time, 10 different learning sets have been tested. For each obtained network, 100 chaotic dynamics have been studied.

When learning by means of the out-supervised procedure, depending on the data set’s size, different types of chaotic regimes appear. In a small data set size, informative chaos and deep chaos coexist. By increasing the data set’s size, the more we want to learn, the more chaotic regimes go from an informative chaos to an almost white noise one. Learning too much information in an out-supervised way leads to networks’ showing very uninformative deep chaos similar to white noise. In other words, having too many competing limit cycle attractors forces the network’s dynamics to become almost random, losing its structure and behaving as white noise. All information about hypothetical nearby limit cycles attractors is lost. No more content addressability can be obtained. When learning by means of the in-supervised procedure, chaotic dynamics are nearly always of the third type: very informative chaos shows up. By not predefining the internal representations of the network, this type of learning preserves more structure in the chaotic dynamics. The dynamical structure of the informative chaos (Figure 8C) reveals the existence of nearby competing attractors. It shows the phenomenon of frustration obtained when the network hesitates between two (or more) nearby cycles, passing from one to another. To provide a better understanding of this type of chaos, Figure 9 compares Lyapunov spectra obtained from deep chaos in out-supervised learned networks and from informative chaos in in-supervised learned networks. From this figure, deep chaos can ¨ be related to “hyperchaos” (Rossler, 1983). In hyperchaos, the presence of more than one positive Lyapunov exponent is expected, and these exponents are expected to be high. By contrast, Lyapunov spectra obtained in informative chaos after in-supervised learning are characteristic of chaotic itinerancies (Kaneko & Tsuda, 2003). In this type of chaos, the dynamics is attracted to learned memories, which is indicated by negative Lyapunov exponents, while at the same time it escapes from them, which is indicated

Road to Chaos in Recurrent Neural Networks

105

Figure 10: Probability of the presence of the nearby limit cycle attractors in chaotic dynamics (x-axis). By slowly shifting the external stimulus from a stimulus previously learned (region a) to another stimulus learned (region b), the network’s dynamics goes from the limit cycle attractor associated with the former stimulus to the limit cycle attractor associated with the latter stimulus. The probabilities of the presence of cycle a and cycle b are plotted (black). The Lyapunov exponent of the obtained dynamics is also plotted (gray). (A) After out-supervised learning, hyperchaos is observed in between the learned attractors. (B) The same conditions lead to frustrated chaos after insupervised learning.

by the presence of at least one positive Lyapunov exponent. This positive Lyapunov exponent must be slightly positive in order not to completely erase the system’s history and thus keep traces of the learned memories. This regime of frustration is increased by some modes of neutral stability indicated by the presence of many exponents whose values are close to zero. 6.3.2 The Frustrated Chaos. The main difference between the frustrated chaos (Bersini & Sener, 2002) and similar forms of chaos welknown in the literature, like intermittency chaos (Pomeau & Manneville, 1980) and chaotic itinerancy (Ikeda, Matsumoto, & Otsuka, 1989; Kaneko, 1992; Tsuda, 1992), lies in the way to obtain it and in the possible characterization of the dynamics in terms of the encoded attractors. If all of those very structured chaotic regimes are characterized by strong cyclic components among which the dynamics randomly itinerates, it is in the transparency and the exploitation of those cycles that lies the key difference. In the frustrated chaos, those cycles are the basic stages of the road to chaos. It is by forcing those cycles in the network (by tuning the connection parameters) that the chaos finally shows up. One way of forcing these cycles is by the time-asymmetric Hebbian learning adopted in this letter. Frustrated chaos is a dynamical regime that appears in a network when the global structure is such that local connectivity patterns responsible for stable and meaningful oscillatory behaviors are intertwined, leading to mutually competing attractors and unpredictable itinerancy among brief appearance of these attractors.

106

C. Molter, U. Salihoglu, and H. Bersini

To have a better understanding of this type of chaos, Figure 10 compares a hyperchaos and the frustrated one through the probability of presence of the nearby limit cycle attractors in chaotic dynamics. In the figure on the left, the network has learned to associate 2 data with limit cycle attractors of period 10 by using the out-supervised Hebbian algorithm. This algorithm barely constrains the network, and as a consequence, chaotic dynamics appear very uninformative: by shifting the external stimulus from one attractor to another one in between, strong chaos shows up (indicated by the Lyapunov exponent), where any information concerning these two limit cycle attractors is lost. In contrast, when mapping four stimuli to period 4 cycles, using the in-supervised algorithm (Figure 10, right), when driving the dynamics by shifting the external stimulus from one attractor to another one, the chaos encountered on the road appears much more structured: small Lyapunov exponents and strong presence of the nearby limit cycles are easy to observe, shifting progressively from one attractor to the other one. 7 Conclusion This letter studies the possibility of encoding information in spatiotemporal cyclic attractors of a network’s internal state. Two versions of a timeasymmetric Hebbian learning algorithm are proposed. First is a classical out-supervised version, where the cyclic attractors are predefined and need to be replicated by the internal dynamics of the network. Second is an insupervised version where the cyclic attractors to be associated with the external stimuli are left unprescribed and derived from the ones spontaneously proposed by the network. First, experimental results show that the encoding performances obtained in the in-supervised case are greatly enhanced compared to the ones obtained in the out-supervised case. This is intuitively understandable since the in-supervised leaning is much less constraining for the network. Second, experimental results aim at analyzing the background dynamical regime of the network: the dynamics observed when unlearned stimuli are presented to the network. It is empirically shown that the more information the network has to store in its attractors, the more its spontaneous dynamical regime tends to be chaotic. Chaos is in fact the biggest pool of potential cyclic attractors. Asymmetric Hebbian learning can be seen as an alternative road to chaos. Adopting the out-supervised learning and increasing the amount of information to store, the background chaos spreads widely and adopts a very unstructured shape similar to white noise. The reason lies in the constraints imposed by this learning process, which becomes harder and harder to satisfy by the network. In contrast, in-supervised learning, by being more “respectful” of the network intrinsic dynamics, maintains much more structure in the obtained chaos. It is still possible to observe in the chaotic regime the traces of the learned attractors. Following

Road to Chaos in Recurrent Neural Networks

107

our previous considerations, we call this complex but still very structured and informative regime frustrated chaos. This behavior can be related to experimental findings where ongoing cortical activity has been shown to encompass a set of dynamically switching cortical states that corresponds to stimulus-evoked activity (Kenet et al., 2003). Symbolic investigations have been performed on the spatiotemporal attractors obtained when the network is in a random state and presented with random or noisy stimuli. Different types of attractors are observed in the output. Spurious data have been defined as attractors having the same period as the learned data but still different from all of them. This follows the intuitive idea that the other kinds of attractors (like chaotic attractors) are easily recognizable at a glance and unlikely to be confused with learned data. In the case of spurious data, it is impossible to know if the observed attractors bear useful information without comparing it with all the learned data Networks where the information is coded in fixed-point attractors are easily corrupted with spurious data, which makes their exploitation very delicate. By contrast, when the information is stored in cyclic attractors using in-supervised learning, chaotic attractors appear as the background regime of the net. They are likely to play a beneficial role by preventing the proliferation of spurious data and helping the recovery of the correct mapping: if a previously learned stimulus is presented to the network, either the network will find the correct mapping, or it will iterate through a chaotic trajectory. While not as easy to interpret and “engineerize” as its classical supervised counterpart, in-supervised learning makes more sense when adopting both a biological and a cognitive perspective. Biological systems propose their own way to treat external impact by slightly perturbing their inner working. Here the information is still meaningful to the extent that the external impact is associated in an unequivocal way with an attractor. As the famous Colombian neurophysiologist Rodolfo Llinas (2001) uses to say: “A person’s waking life is a dream modulated by the senses”.

References Albers, D., Sprott, J., & Dechert, W. (1998). Routes to chaos in neural networks with random weights. International Journal of Bifurcation and Chaos, 8, 1463–1478. Amari, S. (1972). Learning pattern sequences by self-organizing nets of threshold elements. IEEE Transactions on Computers, 21, 1197–1206. Amari, S. (1977). Neural theory of association and concept-formation. Biological Cybernetics, 26, 175–185. Amari, S., & Maginu, K. (1988). Statistical neurodynamics of associative memory. Neural Networks, 1, 63–73. Amit, D. (1989). Modeling brain function. Cambridge: Cambridge University Press.

108

C. Molter, U. Salihoglu, and H. Bersini

Amit, D. (1995). The Hebbian paradigm reintegrated: Local reverberations as internal representations. Behavioral Brain Science, 18, 617–657. Amit, D., & Brunel, N. (1994). Learning internal representations in an attractor neural network with analogue neurons. Network: Computation in Neural Systems, 6, 359– 388. Amit, D., & Fusi, S. (1994). Learning in neural networks with material synapses. Neural Computation, 6, 957–982. Amit, D., Gutfreund, G., & Sompolinsky, H. (1987). Statistical mechanics of neural networks near saturation. Ann. Phys., 173, 30–67. Amit, D., & Mongillo, G. (2003). Spike-driven synaptic dynamics generating working memory states. Neural Computation, 15, 565–596. Babloyantz, A., & Loureno (1994). Computation with chaos: A paradigm for cortical activity. Proceedings of National Academy of Sciences, 91, 9027–9031. Bersini, H. (1998). The frustrated and compositional nature of chaos in small Hopfield networks. Neural Networks, 11, 1017–1025. Bersini, H., & Sener, P. (2002). The connections between the frustrated chaos and the intermittency chaos in small Hopfield networks. Neural Netwoks, 15, 1197–1204. Bi, G., & Poo, M. (1999). Distributed synaptic modification in neural networks induced by patterned stimulation. Nature, 401, 792–796. Bliss, T., & Lomo, T. (1973). Long-lasting potentiation of synaptic transmission in the dentate area of the anaesthetized rabbit following stimulation of the perforant path. Journal of Physiology, 232, 331–356. Brunel, N., Carusi, F., & Fusi, S. (1997). Slow stochastic Hebbian learning of classes of stimuli in a recurrent neural network. Network: Computation in Neural Systems, 9, 123–152. Dauce, E., Quoy, M., Cessac, B., Doyon, B., & Samuelides, M. (1998). Self-organization and dynamics reduction in recurrent networks: Stimulus presentation and learning. Neural Networks, 11, 521–533. Domany, E., van Hemmen, J., & Schulten, K. (Eds.). (1995). Models of neural networks (2nd ed.). Berlin: Springer. Eckmann, J., & Ruelle, D. (1985). Ergodic theory of chaos and strange attractors. Reviews of Modern Physics, 57(3), 617–656. Erdi, P. (1996). The brain as a hermeneutic device. Biosystems, 38, 179–189. Forrest, B., & Wallace, D. (1995). Models of neural netWorks (2nd ed.). Berlin: Springer. Franosch, J.-M., Lingenheil, M., & van Hemmen, J. (2005). How a frog can learn what is where in the dark. Physical Review Letters, 95, 1–4. Freeman, W. (2002). Biocomputing. Norwell, MA: Kluwer. Fusi, S. (2002). Hebbian spike-driven synaptic plasticity for learning patterns of mean firing rates. Biological Cybernetics, 87, 459–470. Gardner, E. (1987). Maximum storage capacity in neural networks. Europhysics Letters, 4, 481–485. Gardner, E., & Derrida, B. (1989). Three unfinished works on the optimal storage capacity of networks. J. Physics A: Math. Gen., 22, 1983–1994. Grossberg, S. (1992). Neural networks and natural intelligence. Cambridge, MA: MIT Press . Guillot, A., & Dauce, E. (Eds.). (2002). Approche dynamique de la cognition artificielle. Paris: Herms Science.

Road to Chaos in Recurrent Neural Networks

109

Gutfreund, Y., Zheng, W., & Knudsen, E. I. (2002). Gated visual input to the central auditory system. Science, 297, 1556–1559. Hansel, D., & Sompolinsky, H. (1996). Chaos and synchrony in a model of a hypercolumn in visual cortex. J. Computation Neuroscience, 3, 7–34. Hebb, D. (1949). The organization of behavior. New York: Wiley-Interscience. Hopfield, J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences USA, 79, 2554–2558. Ikeda, K., Matsumoto, K., & Otsuka, K. (1989). Maxwell-Bloch turbulence. Progress of Theoretical Physics, 99(Suppl.), 295–324. Kaneko, K. (1992). Pattern dynamics in spatiotemporal chaos. Physica D, 34, 1– 41. Kaneko, K., & Tsuda, I. (2003). Chaotic itinerancy. Chaos: Focus Issue on Chaotic Itinerancy, 13(3), 926–936. Kenet, T., Bibitchkov, D., Tsodyks, M., Grinvald, A., & Arieli, A. (2003). Spontaneously emerging cortical representations of visual attributes. Nature, 425, 954– 956. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 59–69. Levy, W., & Steward, O. (1983). Temporal contiguity requirements for long term associative potentiation/depression in the hippocampus. Neuroscience, 8, 791– 797. Llineas, R. (2001). I of the vortex: From neurons to self. Cambridge, MA: MIT Press. Molter, C., & Bersini, H. (2003). How chaos in small Hopfield networks makes sense of the world. In Proceedings of the International Joint Conference on Neural Networks conference. Piscataway, NJ: IEEE Press. Molter, C., Salihoglu, U., & Bersini, H. (2005a). Introduction of an Hebbian unsupervised learning algorithm to boost the encoding capacity of Hopfield networks. In Proceedings of the International Joint Conference on Neural Networks—IJCNN Conference, Montreal. Los Alamitos, CA: IEEE Computer Society Press. Molter, C., Salihoglu, U., & Bersini, H. (2005b). Learning cycles brings chaos in continuous Hopfield networks. In Proceedings of the International Joint Conference on Neural Networks—IJCNN Conference. Los Alamitos, CA: IEEE Computer Society Press. Nicolis, J., & Tsuda, I. (1985). Chaotic dynamics of information processing: The “magic number seven plusminus two” revisited. Bulletin of Mathematical Biology, 47, 343–65. Omlin, C. (2001). Understanding and explaining DRN behavior. In K. Kremer (Ed.), A field guide to dynamical recurrent networks. Piscataway, NJ: IEEE Press. Pasemann, F. (2002). Complex dynamics and the structure of small neural networks. Network: Computation in Neural Systems, 13(2), 195–216. Piaget, J. (1963). The psychology of intelligence. New York: Routledge. Pomeau, Y., & Manneville, P. (1980). Intermittent transitions to turbulence in dissipative dynamical systems. Comm. Math. Phys., 74, 189–197. Rodriguez, E., George, N., Lachaux, J., Renault, B., Martinerie, J., Reunault, B., & Varela, F. (1999). Perception’s shadow: Long-distance synchronization of human brain activity. Nature, 397, 430–433.

110

C. Molter, U. Salihoglu, and H. Bersini

¨ ¨ Naturforschung, 38a, 788– Rossler, O. E. (1983). The chaotic hierarchy. Zeitschrift fur 801. Sejnowski, T. (1977). Storing covariance with nonlinearly interacting neurons. J. Math. Biol., 4, 303–321. Skarda, C., & Freeman, W. (1987). How brains make chaos in order to make sense of the world. Behavioral and Brain Sciences, 10, 161–195. Sompolinsky, H., Crisanti, A., & Sommers, H. (1988). Chaos in random neural networks. Physical Review Letters, 61, 258–262. Tsuda, I. (1992). Dynamic link of memory-chaotic memory map in nonequilibrium neural networks. Neural Networks, 5, 313–326. Tsuda, I. (2001). Toward an interpretation of dynamic neural activity in terms of chaotic dynamical systems. Behavioral and Brain Sciences, 24, 793–847. van Hemmen, J., & Kuhn, R. (1995). Models of neural networks (2nd ed.). Berlin: Springer. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274(5293), 1724–1726. Varela, F., Thompson, E., & Rosch, E. (1991). The embodied mind: Cognitive science and human experience. Cambridge, MA: MIT Press. Wolf, A., Swift, J., Swinney, H., & Vastano, J. (1984). Determining Lyapunov exponents from a time series. Physica, D16, 285–317.

Received April 6, 2005; accepted May 30, 2006.

LETTER

Communicated by Herbert Jaeger

Analysis and Design of Echo State Networks Mustafa C. Ozturk [email protected]

Dongming Xu [email protected]

Jos´e C. Pr´ıncipe [email protected] Computational NeuroEngineering Laboratory, Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL 32611, U.S.A.

The design of echo state network (ESN) parameters relies on the selection of the maximum eigenvalue of the linearized system around zero (spectral radius). However, this procedure does not quantify in a systematic manner the performance of the ESN in terms of approximation error. This article presents a functional space approximation framework to better understand the operation of ESNs and proposes an informationtheoretic metric, the average entropy of echo states, to assess the richness of the ESN dynamics. Furthermore, it provides an interpretation of the ESN dynamics rooted in system theory as families of coupled linearized systems whose poles move according to the input signal dynamics. With this interpretation, a design methodology for functional approximation is put forward where ESNs are designed with uniform pole distributions covering the frequency spectrum to abide by the richness metric, irrespective of the spectral radius. A single bias parameter at the ESN input, adapted with the modeling error, configures the ESN spectral radius to the input-output joint space. Function approximation examples compare the proposed design methodology versus the conventional design. 1 Introduction Dynamic computational models require the ability to store and access the time history of their inputs and outputs. The most common dynamic neural architecture is the time-delay neural network (TDNN) that couples delay lines with a nonlinear static architecture where all the parameters (weights) are adapted with the backpropagation algorithm. The conventional delay line utilizes ideal delay operators, but delay lines with local first-order recursive filters have been proposed by Werbos (1992) and extensively studied in the gamma model (de Vries, 1991; Principe, de Vries, & de Oliviera, 1993). Chains of first-order integrators are interesting because they effectively decrease the number of delays necessary to create time embeddings Neural Computation 19, 111–138 (2007)

C 2006 Massachusetts Institute of Technology

112

M. Ozturk, D. Xu, and J. Pr´ıncipe

(Principe, 2001). Recurrent neural networks (RNNs) implement a different type of embedding that is largely unexplored. RNNs are perhaps the most biologically plausible of the artificial neural network (ANN) models (Anderson, Silverstein, Ritz, & Jones, 1977; Hopfield, 1984; Elman, 1990), but are not well understood theoretically (Siegelmann & Sontag, 1991; Siegelmann, 1993; Kremer, 1995). One of the main practical problems with RNNs is the difficulty to adapt the system weights. Various algorithms, such as backpropagation through time (Werbos, 1990) and real-time recurrent learning (Williams & Zipser, 1989), have been proposed to train RNNs; however, these algorithms suffer from computational complexity, resulting in slow training, complex performance surfaces, the possibility of instability, and the decay of gradients through the topology and time (Haykin, 1998). The problem of decaying gradients has been addressed with special processing elements (PEs) (Hochreiter & Schmidhuber, 1997). Alternative second-order training methods based on extended Kalman filtering (Singhal & Wu, 1989; Puskorius & Feldkamp, 1994; Feldkamp, Prokhorov, Eagen, & Yuan, 1998) and the multistreaming training approach (Feldkamp et al., 1998) provide more reliable performance and have enabled practical applications in identification and control of dynamical systems (Kechriotis, Zervas, & Monolakos, 1994; Puskorius & Feldkamp, 1994; Delgado, Kambhampati, & Warwick, 1995). Recently, two new recurrent network topologies have been proposed: the echo state network (ESN) by Jaeger (2001, 2002a; Jaeger & Hass, 2004) and the liquid state machine (LSM) by Maass (Maass, Natschl¨ager, & Markram, 2002). ESNs possess a highly interconnected and recurrent topology of nonlinear PEs that constitutes a “reservoir of rich dynamics” (Jaeger, 2001) and contain information about the history of input and output patterns. The outputs of these internal PEs (echo states) are fed to a memoryless but adaptive readout network (generally linear) that produces the network output. The interesting property of ESN is that only the memoryless readout is trained, whereas the recurrent topology has fixed connection weights. This reduces the complexity of RNN training to simple linear regression while preserving a recurrent topology, but obviously places important constraints in the overall architecture that have not yet been fully studied. Similar ideas have been explored independently by Maass and formalized in the LSM architecture. LSMs, although formulated quite generally, are mostly implemented as neural microcircuits of spiking neurons (Maass et al., 2002), whereas ESNs are dynamical ANN models. Both attempt to model biological information processing using similar principles. We focus on the ESN formulation in this letter. The echo state condition is defined in terms of the spectral radius (the largest among the absolute values of the eigenvalues of a matrix, denoted by · ) of the reservoir’s weight matrix (W < 1). This condition states that the dynamics of the ESN is uniquely controlled by the input, and the effect of the initial states vanishes. The current design of ESN parameters

Analysis and Design of Echo State Networks

113

relies on the selection of spectral radius. However, there are many possible weight matrices with the same spectral radius, and unfortunately they do not all perform at the same level of mean square error (MSE) for functional approximation. A similar problem exists in the design of the LSM. LSMs have been shown to possess universal approximation given the separation property (SP) for the liquid (reservoir in ESNs) and the approximation property (AP) for the readout (Maass et al., 2002). SP is quantified by a kernel-quality measure proposed in Maass, Legenstein, and Bertschinger (2005) that is based on the rank of a matrix formed by the system states corresponding to different input signals. The kernel quality is a measure for the complexity and diversity of nonlinear operations carried out by the liquid on its input stream in order to boost the classification power of a subsequent linear decision hyperplane (Maass et al., 2005). A variation of SP has been proposed in Bertschinger and Natschl¨ager (2004), and it has been argued that complex calculations can be best carried out by networks on the boundary between ordered and chaotic dynamics. In this letter, we are interested in studying the ESN for functional approximation (filters that map input functions u(·) of time on output functions y(·) of time). We see two major shortcomings with the current ESN approach that uses echo state condition as a design principle. First, the impact of fixed reservoir parameters for function approximation means that the information about the desired response is conveyed only to the output projection. This is not optimal, and strategies to select different reservoirs for different applications have not been devised. Second, imposing a constraint only on the spectral radius is a weak condition to properly set the parameters of the reservoir, as experiments show (different randomizations with the same spectral radius perform differently for the same problem; see Figure 2). This letter aims to address these two problems by proposing a framework, a metric, and a design principle for ESNs. The framework is a signal processing interpretation of basis and projections in functional spaces to describe and understand the ESN architecture. According to this interpretation, the ESN states implement a set of basis functionals (representation space) constructed dynamically by the input, while the readout simply projects the desired response onto this representation space. The metric to describe the richness of the ESN dynamics is an information-theoretic quantity, the average state entropy (ASE). Entropy measures the amount of information contained in a given random variable (Shannon, 1948). Here, the random variable is the instantaneous echo state from which the entropy for the overall state (vector) is estimated. The probability density function (pdf) in a differential geometric framework should be thought of as a volume form; that is, in our case, the pdf of the state vector describes the metric of the state space manifold (Amari, 1990). Moreover, Cox (1946) established information as a coordinate free metric in the state manifold. Therefore, entropy becomes a global descriptor of information that quantifies the volume of the manifold defined by the random variable. Due to the

114

M. Ozturk, D. Xu, and J. Pr´ıncipe

time dependency of the states, the state entropy averaged over time (ASE) is an appropriate estimate of the volume of the state manifold. The design principle specifies that one should consider independently the correlation among the basis and the spectral radius. In the absence of any information about the desired response, the ESN states should be designed with the highest ASE, independent of the spectral radius. We interpret the ESN dynamics as a combination of time-varying linear systems obtained from the linearization of the ESN nonlinear PE in a small, local neighborhood of the current state. The design principle means that the poles of the linearized ESN reservoir should have uniform pole distributions to generate echo states with the most diverse pole locations (which correspond to the uniformity of time constants). Effectively, this will create the least correlated bases for a given spectral radius, which corresponds to the largest volume spanned by the basis set. When the designer has no other information about the desired response to set the basis, this principle distributes the system’s degrees of freedom uniformly in space. It approximates for ESNs the well-known property of orthogonal basis. The unresolved issue that ASE does not quantify is how to set the spectral radius, which depends again on the desired mapping. The concept of memory depth as explained in Principe et al. (1993) and Jaeger (2002a) is helpful in understanding the issues associated with the spectral radius. The correlation time of the desired response (as estimated by the first zero of the autocorrelation function) gives an indication of the type of spectral radius required (long correlation time requires high spectral radius). Alternatively, a simple adaptive bias is added at the ESN input to control the spectral radius integrating the information from the input-output joint space in the ESN bases. For sigmoidal PEs, the bias adjusts the operating points of the reservoir PEs, which has the net effect of adjusting the volume of the state manifold as required to approximate the desired response with a small error. This letter shows that ESNs designed with this strategy obtain systematically better results in a set of experiments when compared with the conventional ESN design.

2 Analysis of Echo State Networks 2.1 Echo States as Bases and Projections. Let us consider the architecture and recursive update equation of a typical ESN more closely. Consider the recurrent discrete-time neural network given in Figure 1 with M input units, N internal PEs, and L output units. The value of the input unit at time n is u(n) = [u1 (n), u2 (n), . . . , u M (n)]T , of internal units are x(n) = [x1 (n), x2 (n), . . . , xN (n)]T , and of output units are y(n) = [y1 (n), y2 (n), . . . , yL (n)]T . The connection weights are given in an N × M weight matrix Win = (wiinj ) for connections between the input and the internal PEs, in an N × N matrix W = (wi j ) for connections between the internal PEs, in an L × N matrix Wout = (wiout j ) for connections from PEs to the

Analysis and Design of Echo State Networks

Dynamical Reservoir

Input Layer Win

115

Read-out Wout

W

x(n)

u(n) .

+ .

y(n)

Wback Figure 1: An echo state network (ESN). ESN is composed of two parts: a fixedweight (W < 1) recurrent network and a linear readout. The recurrent network is a reservoir of highly interconnected dynamical components, states of which are called echo states. The memoryless linear readout is trained to produce the output.

output units, and in an N × L matrix Wba ck = (wibaj ck ) for the connections that project back from the output to the internal PEs (Jaeger, 2001). The activation of the internal PEs (echo state) is updated according to x(n + 1) = f(Win u(n + 1) + Wx(n) + Wba ck y(n)),

(2.1)

where f = ( f 1 , f 2 , . . . , f N ) are the internal PEs’ activation functions. Here, all x −x f i ’s are hyperbolic tangent functions ( ee x −e ). The output from the readout +e −x network is computed according to y(n + 1) = fout (Wout x(n + 1)),

(2.2)

where fout = ( f 1out , f 2out , . . . , f Lout ) are the output unit’s nonlinear functions (Jaeger, 2001, 2002a). Generally, the readout is linear so fout is identity. ESNs resemble the RNN architecture proposed in Puskorius and Feldkamp (1996) and also used by Sanchez (2004) in brain-machine

116

M. Ozturk, D. Xu, and J. Pr´ıncipe

interfaces. The critical difference is the dimensionality of the hidden recurrent PE layer and the adaptation of the recurrent weights. We submit that the ideas of approximation theory in functional spaces (bases and projections), so useful in adaptive signal processing (Principe, 2001), should be utilized to understand the ESN architecture. Let h(u(t)) be a real-valued function of a real-valued vector u(t) = [u1 (t), u2 (t), . . . , u M (t)]T . In functional approximation, the goal is to estimate the behavior of h(u(t)) as a combination of simpler functions ϕi (t), called the basis functionals, ˆ such that its approximant, h(u(t)), is given by ˆ h(u(t)) =

N

a i ϕi (t).

i=1

Here, a i ’s are the projections of h(u(t)) onto each basis function. One of the central questions in practical functional approximation is how to choose the set of bases to approximate a given desired signal. In signal processing, the choice normally goes for a complete set of orthogonal basis, independent of the input. When the basis set is complete and can be made as large as required, fixed bases work wonders (e.g., Fourier decompositions). In neural computing, the basic idea is to derive the set of bases from the input signal through a multilayered architecture. For instance, consider a single hidden layer TDNN with N PEs and a linear output. The hiddenlayer PE outputs can be considered a set of nonorthogonal basis functionals dependent on the input, ϕi (u(t)) = g

b i j u j (t) .

j

b i j ’s are the input layer weights, and g is the PE nonlinearity. The approximation produced by the TDNN is then ˆ h(u(t)) =

N

a i ϕi (u(t)),

(2.3)

i=1

where a i ’s are the weights of the output layer. Notice that the b i j ’s adapt the bases and the a i ’s adapt the projection in the projection space. Here the goal is to restrict the number of bases (number of hidden layer PEs) because their number is coupled with the number of parameters to adapt, which has an impact on generalization and training set size, for example. Usually,

Analysis and Design of Echo State Networks

117

since all of the parameters of the network are adapted, the best basis in the joint (input and desired signals) space as well as the best projection can be achieved and represents the optimal solution. The output of the TDNN is a linear combination of its internal representations, but to achieve a basis set (even if nonorthogonal), linear independence among the ϕi (u(t))’s must be enforced. Ito, Shah and Pon, and others have shown that this is indeed the case (Ito, 1996; Shah & Poon, 1999), but a thorough discussion is outside the scope of this article. The ESN (and the RNN) architecture can also be studied in this framework. The states of equation 2.1 correspond to the basis set, which are recursively computed from the input, output, and previous states through Win , W, and Wba ck . Notice, however, that none of these weight matrices is adapted, that is, the functional bases in the ESN are uniquely defined by the input and the initial selection of weights. In a sense, ESNs are trading the adaptive connections in the RNN hidden layer by a brute force approach of creating fixed diversified dynamics in the hidden layer. For an ESN with a linear readout network, the output equation (y(n + 1) = Wout x(n + 1)) has the same form of equation 2.3, where the ϕi ’s and a i ’s are replaced by the echo states and the readout weights, respectively. The readout weights are adapted in the training data, which means that the ESN is able to find the optimal projection in the projection space, just like the RNN or the TDNN. A similar perspective of basis and projections for information processing in biological networks has been proposed by Pouget and Sejnowski (1997). They explored the possibility that the response of neurons in parietal cortex serves as basis functions for the transformations from the sensory input to the motor responses. They proposed that “the role of spatial representations is to code the sensory inputs and posture signals in a format that simplifies subsequent computation, particularly in the generation of motor commands”. The central issue in ESN design is exactly the nonadaptive nature of the basis set. Parameter sets in the reservoir that provide linearly independent states and possess a given spectral radius may define drastically different projection spaces because the correlation among the bases is not constrained. A simple experiment was designed to demonstrate that the selection of the ESN parameters by constraining the spectral radius is not the most suitable for function approximation. Consider a 100-unit ESN where the input signal is sin(2πn/10π). Mimicking Jaeger (2001), the goal is to let the ESN generate the seventh power of the input signal. Different realizations of a randomly connected 100-unit ESN were constructed where the entries of W are set to 0.4, −0.4, and 0 with probabilities of 0.025, 0.025, and 0.95, respectively. This corresponds to a spectral radius of 0.88. Input weights are set to +1 or, −1 with equal probabilities, and Wba ck is set to zero. Input is applied for 300 time steps, and the echo states are calculated using equation 2.1. The next step is to train the linear readout. One method

118

M. Ozturk, D. Xu, and J. Pr´ıncipe

MSE for different realizations

104

106

108

109

0

10

20 30 Different realizations

40

50

Figure 2: Performances of ESNs for different realizations of W with the same weight distribution. The weight values are set to 0.4, −0.4, and 0 with probabilities of 0.025, 0.025, and 0.95. All realizations have the same spectral radius of 0.88. In the 50 realizations, MSEs vary from 5.9 × 10−9 to 8.9 × 10−5 . Results show that for each set of random weights that provide the same spectral radius, the correlation or degree of redundancy among the bases will change, and different performances are encountered in practice.

to determine the optimal output weight matrix, Wout , in the mean square error (MSE) sense (where MSE is defined by O = 12 (d − y)T (d − y)) is to use the Wiener solution given by Haykin (2001): W

out

T −1

= E[xx ]

E[xd] ∼ =

1 x(n)x(n)T N n

−1

1 x(n)d(n) . N n

(2.4)

Here, E[.] denotes the expected value operator, and d denotes the desired signal. Figure 2 depicts the MSE values for 50 different realizations of the ESNs. As observed, even though each ESN has the same sparseness and spectral radius, the MSE values obtained vary greatly among different realizations. The minimum MSE value obtained among the 50 realizations is 5.9x10−9 , whereas the maximum MSE is 8.9x10−5 . This experiment

Analysis and Design of Echo State Networks

119

demonstrates that a design strategy that is based solely on the spectral radius is not sufficient to specify the system architecture for function approximation. This shows that for each set of random weights that provide the same spectral radius, the correlation or degree of redundancy among the bases will change, and different performances are encountered in practice. 2.2 ESN Dynamics as a Combination of Linear Systems. It is well known that the dynamics of a nonlinear system can be approximated by that of a linear system in a small neighborhood of an equilibrium point (Kuznetsov, Kuznetsov, & Marsden, 1998). Here, we perform the analysis with hyperbolic tangent nonlinearities and approximate the ESN dynamics by the dynamics of the linearized system in the neighborhood of the current system state. Hence, when the system operating point varies over time, the linear system approximating the ESN dynamics changes. We are particularly interested in the movement of the poles of the linearized ESN. Consider the update equation for the ESN without output feedback given by x(n + 1) = f(Win u(n + 1) + Wx(n)). Linearizing the system around the current state x(n), one obtains the Jacobian matrix, J(n + 1), defined by

f˙ (net1 (n))w11 ˙ f (net2 (n))w21 J(n + 1) = ··· f˙ (netN (n))w N1 =

f˙ (net1 (n)) 0

f˙ (net1 (n))w12 · · · f˙ (net1 (n))w1N f˙ (net2 (n))w22 · · · f˙ (net2 (n))w2N ··· ··· ··· f˙ (netN (n))w N2 · · · f˙ (netN (n))w NN ···

0

f˙ (net2 (n)) · · ·

0

0

···

···

0

0

···

···

· · · f˙ (netN (n))

· W = F(n) · W.

(2.5)

Here, neti (n) is the ith entry of the vector (Win u(n + 1) + Wx(n)), and wi j denotes the (i, j)th entry of W. The poles of the linearized system at time n + 1 are given by the eigenvalues of the Jacobian matrix J(n + 1).1 As the amplitude of each PE changes, the local slope changes, and so the poles of

1 The

A)−1 B

transfer function of a linear system x(n + 1) = Ax(n) + Bu(n) is Ad joint(zI−A) det(zI−A) B.

X(z) U(z)

= (zI −

= The poles of the transfer function can be obtained by solving det(zI − A) = 0. The solution corresponds to the eigenvalues of A.

120

M. Ozturk, D. Xu, and J. Pr´ıncipe

the linearized system are time varying, although the parameters of ESN are fixed. In order to visualize the movement of the poles, consider an ESN with 100 states. The entries of the internal weight matrix are chosen to be 0, 0.4 and −0.4 with probabilities 0.9, 0.05, and 0.05. W is scaled such that a spectral radius of 0.95 is obtained. Input weights are set to +1 or −1 with equal probabilities. A sinusoidal signal with a period of 100 is fed to the system, and the echo states are computed according to equation 2.1. Then the Jacobian matrix and the eigenvalues are calculated using equation 2.5. Figure 3 shows the pole tracks of the linearized ESN for different input values. A single ESN with fixed parameters implements a combination of many linear systems with varying pole locations, hence many different time constants that modulate the richness of the reservoir of dynamics as a function of input amplitude. Higher-amplitude portions of the signal tend to saturate the nonlinear function and cause the poles to shrink toward the origin of the z-plane (decreases the spectral radius), which results in a system with a large stability margin. When the input is close to zero, the poles of the linearized ESN are close to the maximal spectral radius chosen, decreasing the stability margin. When compared to their linear counterpart, an ESN with the same number of states results in a detailed coverage of the z-plane dynamics, which illustrates the power of nonlinear systems. Similar results can be obtained using signals of different shapes at the ESN input. A key corollary of the above analysis is that the spectral radius of an ESN can be adjusted using a constant bias signal at the ESN input without changing the recurrent connection matrix, W. The application of a nonzero constant bias will move the operating point to regions of the sigmoid function closer to saturation and always decrease the spectral radius due to the shape of the nonlinearity.2 The relevance of bias in terms of overall system performance has also been discussed in Jaeger (2002b) and Bertschinger and Natschl¨ager (2004), but here we approach it from a system theory perspective and explain its effect on reservoir dynamics. 3 Average State Entropy as a Measure of the Richness of ESN Reservoir Previous research was aware of the influence of diversity of the recurrent layer outputs on the overall performance of ESNs and LSMs. Several metrics to quantify the diversity have been proposed (Jaeger, 2001; Maass, et al., 2 Assume W has nondegenerate eigenvalues and corresponding linearly independent eigenvectors. Then consider the eigendecomposition of W, where W = PDP−1 , P is the eigenvector matrix and D is the diagonal matrix of eigenvalues (Dii ) of W. Since F(n) and D are diagonal, J(n + 1) = F(n)W = F(n)(PDP−1 ) = P(F(n)D)P−1 is the eigendecomposition of J(n + 1). Here, each entry of F(n)D, f (net(n))Dii , is an eigenvalue of J. Therefore, | f (net(n))Dii | ≤ |Dii | since f (neti ) ≤ f (0).

Analysis and Design of Echo State Networks

Imaginary

(E)

1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1

Imaginary

C E

20

1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1

-0.5

1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1

-0.5

1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1

-0.5

(B)

D

40

60

Time

80

100

0

0.5

1

0

0.5

1

0

0.5

1

Real

(D)

Imaginary

Imaginary

(C)

1 0.8 0.6 0.4 0.2 0 B -0.2 -0.4 -0.6 -0.8 -1 0

-0.5

0

Real

0.5

1

(F)

Imaginary

Amplitude

(A)

121

-0.5

0

Real

0.5

1

Real

Real

Figure 3: The pole tracks of the linearized ESN with 100 PEs when the input goes through a cycle. An ESN with fixed parameters implements a combination of linear systems with varying pole locations. (A) One cycle of sinusoidal signal with a period of 100. (B–E) The positions of poles of the linearized systems when the input values are at B, C, D, and E in Figure 5A. (F) The cumulative pole locations show the movement of the poles as the input changes. Due to the varying pole locations, different time constants modulate the richness of the reservoir of dynamics as a function of input amplitude. Higher-amplitude signals tend to saturate the nonlinear function and cause the poles to shrink toward the origin of the z-plane (decreases the spectral radius), which results in a system with a large stability margin. When the input is close to zero, the poles of the linearized ESN are close to the maximal spectral radius chosen, decreasing the stability margin. An ESN with more states results in a detailed coverage of the z-plane dynamics, which illustrates the power of nonlinear systems, when compared to their linear counterpart.

122

M. Ozturk, D. Xu, and J. Pr´ıncipe

2005). Here, our approach of bases and projections leads to a new metric. We propose the instantaneous state entropy to quantify the distribution of instantaneous amplitudes across the ESN states. Entropy of the instantaneous ESN states is appropriate to quantify performance in function approximation because the ESN output is a mere weighted combination of the instantaneous value of the ESN states. If the echo state’s instantaneous amplitudes are concentrated on only a few values across the ESN state dynamic range, the ability to approximate an arbitrary desired response by weighting the states is limited (and wasteful due to redundancy between the different states), and performance will suffer. On the other hand, if the ESN states provide a diversity of instantaneous amplitudes, it is much easier to achieve the desired mapping. Hence, the instantaneous entropy of the states appears as a good measure to quantify the richness of dynamics with instantaneous mappers. Due to the time structure of signals, the average state entropy (ASE), defined as the state entropy averaged over time, will be the parameter used to quantify the diversity in the dynamical reservoir of the ESN. Moreover, entropy has been proposed as an appropriate measure of the volume of the signal manifold (Cox, 1946; Amari, 1990). Here, ASE measures the volume of the echo state manifold spanned by trajectories. Renyi’s quadratic entropy is employed here because it is a global measure of information. In addition, an efficient nonparametric estimator of Renyi’s entropy, which avoids explicit pdf estimation, has been developed (Principe, Xu, & Fisher, 2000). Renyi’s entropy with parameter γ for a random variable X with a pdf f X (x) is given by Renyi (1970): Hγ (X) =

1 γ −1 log E[ f X (X)]. 1−γ

Renyi’s quadratic entropy is obtained for γ = 2 (for γ → 1, Shannon’s entropy is obtained). Given N samples {x1 , x2 , . . . , xN } drawn from the unknown pdf to be estimated, Parzen windowing approximates the underlying pdf by 1 K σ (x − xi ), N N

f X (x) =

i=1

where K σ is the kernel function with the kernel size σ . Then the Renyi’s quadratic entropy can be estimated by (Principe et al., 2000) 1 K σ (x j − xi ) . H2 (X) = −log 2 N j i

(3.1)

Analysis and Design of Echo State Networks

123

The instantaneous state entropy is estimated using equation 3.1 where the samples are the entries of the state vector x(n) = [x1 (n), x2 (n), . . . , xN (n)]T of an ESN with N internal PEs. Results will be shown with a gaussian kernel with kernel size chosen to be 0.3 of the standard deviation of the entries of the state vector. We will show that ASE is a more sensitive parameter to quantify the approximation properties of ESNs by experimentally demonstrating that ESNs with different spectral radius and even with the same spectral radius display different ASEs. Let us consider the same 100-unit ESN that we used in the previous section built with three different spectral radii 0.2, 0.5, 0.8 with an input signal of sin(2πn/20). Figure 4A depicts the echo states over 200 time ticks. The instantaneous state entropy is also calculated at each time step using equation 3.1 and plotted in Figure 4B. First, note that the instantaneous state entropy changes over time with the distribution of the echo states as we would expect, since state entropy is dependent on the input signal that also changes in this case. Second, as the spectral radius increases in the simulation, the diversity in the echo states increases. For the spectral radius of 0.2, echo state’s instantaneous amplitudes are concentrated on only a few values, which is wasteful due to redundancy between different states. In practice, to quantify the overall representation ability over time, we will use ASE, which takes values −0.735, −0.007, and 0.335 for the spectral radii of 0.2, 0.5, and 0.8, respectively. Moreover, even for the same spectral radius, several ASEs are possible. Figure 4C shows ASEs from 50 different realizations of ESNs with the same spectral radius of 0.5, which means that ASE is a finer descriptor of the dynamics of the reservoir. Although we have presented an experiment with sinusoidal signal, similar results are obtained for other inputs as long as the input dynamic range is properly selected. Maximizing ASE means that the diversity of the states over time is the largest and should provide a basis set that is as uncorrelated as possible. This condition is unfortunately not a guarantee that the ESN so designed will perform the best, because the basis set in ESNs is created independent of the desired response and the application may require a small spectral radius. However, we maintain that when the desired response is not accessible for the design of the ESN bases or when the same reservoir is to be used for a number of problems, the default strategy should be to maximize the ASE of the state vector. The following section addresses the design of ESNs with high ASE values and a simple mechanism to adjust the reservoir dynamics without changing the recurrent connection weights. 4 Designing Echo State Networks 4.1 Design of the Echo State Recurrent Connections. According to the interpretation of ESNs as coupled linear systems, the design of the internal

124

M. Ozturk, D. Xu, and J. Pr´ıncipe

connection matrix, W, will be based on the distribution of the poles of the linearized system around zero state. Our proposal is to design the ESN such that the linearized system has uniform pole distribution inside the unit circle of the z-plane. With this design scenario, the system dynamics will include uniform coverage of time constants arising from the uniform distribution of the poles, which also decorrelates as much as possible the basis functionals. This principle was chosen by analogy to the identification of linear systems using Kautz filters (Kautz, 1954), which shows that the best approximation of a given transfer function by a linear system with finite order is achieved when poles are placed in the neighborhood of the spectral resonances. When no information is available about the desired response, we should uniformly spread the poles to anticipate good approximation to arbitrary mappings. We again use a maximum entropy principle to distribute the poles inside the unit circle uniformly. The constraints of a circle as boundary conditions for discrete linear systems and complex conjugate locations are easy to include for the pole distribution (Thogula, 2003). The poles are first initialized at random locations; the quadratic Renyi’s entropy is calculated by equation 3.1, and poles are moved such that the entropy of the new distribution is increased over iterations (Erdogmus & Principe, 2002). This method is efficient to find uniform coverage of the unit circle with an arbitrary number of poles. The system with the uniform pole locations can be interpreted using linear system theory. The poles that are close to the unit circle correspond to many sharp bandpass filters specializing in different frequency regions, whereas the inner poles realize filters of larger frequency support. Moreover, different orientations (angles) of the poles create filters of different center frequencies. Now the problem is to construct an internal weight matrix from the pole locations (eigenvalues of W). In principle, we would like to create a sparse

Figure 4: Examples of echo states and instantaneous state entropy. (A) Outputs of echo states (100 PEs) produced by ESNs with spectral radius of 0.2, 0.5, and 0.8, from top to bottom, respectively. The diversity of echo states increases when the spectral radius increases. Within the dynamic range of the echo states, systems with smaller spectral radius can generate only uneven representations, while for W = 0.8, outputs of echo states almost uniformly distribute within their dynamic range. (B) Instantaneous state entropy is calculated using equation 3.1. Information contained in the echo states is changing over time according to the input amplitude. Therefore, the richness of representation is controlled by the input amplitude. Moreover, the value of ASE increases with spectral radius. (C) ASEs from 50 different realizations of ESNs with the same spectral radius of 0.5. The plot shows that ASE is a finer descriptor of the dynamics of the reservoir than the spectral radius.

Analysis and Design of Echo State Networks

125

(A) Echo States

1 0 -1

0 1

20 40 60 80 100 120 140 160 180 200

0 -1

0 1

20 40 60 80 100 120 140 160 180 200

0 -1

0

20 40 60 80 100 120 140 160 180 200 Time (B) State Entropy Spectral Radius = 0.2 Spectral Radius = 0.5 Spectral Radius = 0.8

1.5 1 0.5 0 -0.5 -1 -1.5 -2 -2.5

0

50

100 Time

150

200

(C) Different ASEs for the same spectral radius 0.3

ASE

0.25 0.2 0.15 0.1 0.05 0

10

20

30 Trials

40

50

126

M. Ozturk, D. Xu, and J. Pr´ıncipe

matrix, so we started with the sparsest matrix (with an inverse), which is the direct canonical structure given by (Kailath, 1980)

−a 1 −a 2 · · · −a N−1 −a N

1 0 0 1 W= ··· ··· 0 0

···

0

···

0

···

···

···

1

0 0 . ··· 0

(4.1)

The characteristic polynomial of W is l(s) = det(sI − W) = s N + a 1 s N−1 + a 2 s N−2 + a N = (s − p1 )(s − p2 ) · · · (s − p N ),

(4.2)

where pi ’s are the eigenvalues and a i ’s are the coefficients of the characteristic polynomial of W. Here, we know the pole locations of the linear system obtained from the linearization of the ESN, so using equation 4.2, we can obtain the characteristic polynomial and construct W matrix in the canonical form using equation 4.1. We will call the ESN constructed based on the uniform pole principle ASE-ESN. All other possible solutions with the same eigenvalues can be obtained by Q−1 WQ, where Q is any nonsingular matrix. To corroborate our hypothesis, we would like to show that the linearized ESN designed with the recurrent weight matrix having the eigenvalues uniformly distributed inside the unit circle creates higher ASE values for a given spectral radius compared to other ESNs with random internal connection weight matrices. We will consider an ESN with 30 states and use our procedure to create the W matrix for ASE-ESN for different spectral radii between [0.1, 0.95]. Similarly, we constructed ESNs with sparse random W matrices with different sparseness constraints. This corresponds to a weight distribution having the values 0, c and −c with probabilities p1 , (1 − p1 )/2, and (1 − p1 )/2, where p1 defines the sparseness of W and c is a constant that takes a specific value depending on the spectral radius. We also created W matrices with values uniformly distributed between −1 and 1 (U-ESN) and scaled to obtain a given spectral radius (Jaeger & Hass, 2004). Then, for different Win matrices, we run the ASE-ESNs with the sinusoidal input given in section 3 and calculate ASE. Figure 5 compares the ASE values averaged over 1000 realizations. As observed from the figure, the ASE-ESN with uniform pole distribution generates higher ASE on average for all spectral radii compared to ESNs with sparse and uniform random connections. This approach is indeed conceptually similar to Jeffreys’ maximum entropy prior (Jeffreys, 1946): it will provide a consistently good response for the largest class of problems. Concentrating the poles of the linearized

Analysis and Design of Echo State Networks

127

1 ASEESN UESN sparseness=0.2 sparseness=0.1 sparseness=0.07

0.8

ASE

0.6 0.4 0.2 0 -0.2 -0.4

0

0.2

0.4

0.6

0.8

1

Spectral radius Figure 5: Comparison of ASE values obtained for ASE-ESN having W with uniform eigenvalue distribution, ESNs with random W matrix, and U-ESN with uniformly distributed weights between −1 and 1. Randomly generated weights have sparseness of 0.07, 0.1, and 0.2. ASE values are calculated for the networks with spectral radius from 0.1 to 0.95. The ASE-ESN with uniform pole distribution generates a higher ASE on average for all spectral radii compared to ESNs with random connections.

system in certain regions of the space provides good performance only if the desired response has energy in this part of the space, as is well known from the theory of Kautz filters (Kautz, 1954). 4.2 Design of the Adaptive Bias. In conventional ESNs, only the output weights are trained, optimizing the projections of the desired response onto the basis functions (echo states). Since the dynamical reservoir is fixed, the basis functions are only input dependent. However, since function approximation is a problem in the joint space of the input and desired signals, a penalty in performance will be incurred. From the linearization analysis that shows the crucial importance of the operating point of the PE nonlinearity in defining the echo state dynamics, we propose to use a single external adaptive bias to adjust the effective spectral radius of an ESN. Notice that according to linearization analysis, bias can reduce only spectral radius. The information for adaptation of bias is the MSE in training, which modulates the spectral radius of the system with the information derived from the approximation error. With this simple mechanism, some information from the input-output joint space is incorporated in the definition of the projection space of the ESN. The beauty of this method is that the spectral

128

M. Ozturk, D. Xu, and J. Pr´ıncipe

radius can be adjusted by a single parameter that is external to the system without changing reservoir weights. The training of bias can be easily accomplished. Indeed, since the parameter space is only one-dimensional, a simple line search method can be efficiently employed to optimize the bias. Among different line search algorithms, we will use a search that uses Fibonacci numbers in the selection of points to be evaluated (Wilde, 1964). The Fibonacci search method minimizes the maximum number of evaluations needed to reduce the interval of uncertainty to within the prescribed length. In our problem, a bias value is picked according to Fibonacci search. For each value of bias, training data are applied to the ESN, and the echo states are calculated. Then the corresponding optimal output weights and the objective function (MSE) are evaluated to pick the next bias value. Alternatively, gradient-based methods can be utilized to optimize the bias, due to simplicity and low computational cost. System update equation with an external bias signal, b, is given by x(n + 1) = f(Win u(n + 1) + Win b + Wx(n)). The update equation for b is given by ∂x(n + 1) ∂ O(n + 1) = −e · Wout × ∂b ∂b ∂x(n) in ˙ ) · W × = −e · Wout × f(net + W . n+1 ∂b

(4.3) (4.4)

Here, O is the MSE defined previously. This algorithm may suffer from similar problems observed in gradient-based methods in recurrent networks training. However, we observed that the performance surface is rather simple. Moreover, since the search parameter is one-dimensional, the gradient vector can assume only one of the two directions. Hence, imprecision in the gradient estimation should affect the speed of convergence but normally not change the correct gradient direction. 5 Experiments This section presents a variety of experiments in order to test the validity of the ESN design scheme proposed in the previous section. 5.1 Short-Term Memory Capacity. This experiment compares the shortterm memory (STM) capacity of ESNs with the same spectral radius using the framework presented in Jaeger (2002a). Consider an ESN with a single input signal, u(n), optimally trained with the desired signal u(n − k), for a given delay k. Denoting the optimal output signal yk (n), the k-delay

Analysis and Design of Echo State Networks

129

STM capacity of a network, MCk , is defined as a squared correlation coefficient between u(n − k) and yk (n) (Jaeger, 2002a). The STM capacity, MC, of the network is defined as ∞ k=1 MC k . STM capacity measures how accurately the delayed versions of the input signal are recovered with optimally trained output units. Jaeger (2002a) has shown that the memory capacity for recalling an independent and identically distributed (i.i.d.) input by an N unit RNN with linear output units is bounded by N. We use ESNs with 20 PEs and a single input unit. ESNs are driven by an i.i.d. random input signal, u(n), that is uniformly distributed over [−0.5, 0.5]. The goal is to train the ESN to generate the delayed versions of the input, u(n − 1), . . . , u(n − 40). We used four different ESNs: R-ESN, U-ESN, ASE-ESN, and BASE-ESN. R-ESN is a randomly connected ESN used in Jaeger (2002a) where the entries of W matrix are set to 0, 0.47, −0.47 with probabilities 0.8, 0.1, 0.1, respectively. This corresponds to a sparse connectivity of 20% and a spectral radius of 0.9. The entries of W of U-ESN are uniformly distributed over [−1, 1] and scaled to obtain the spectral radius of 0.9. ASE-ESN also has a spectral radius of 0.9 and is designed with uniform poles. BASE-ESN has the same recurrent weight matrix as ASE-ESN and an adaptive bias at its input. In each ESN, the input weights are set to 0.1 or −0.1 with equal probability, and direct connections from the input to the output are allowed, whereas Wba ck is set to 0 (Jaeger, 2002a). The echo states are calculated using equation 2.1 for 200 samples of the input signal, and the first 100 samples corresponding to initial transient are eliminated. Then the output weight matrix is calculated using equation 2.4. For the BASE-ESN, the bias is trained for each task. All networks are run with a test input signal, and the corresponding output and MCk are calculated. Figure 6 shows the k-delay STM capacity (averaged over 100 trials) of each ESN for delays 1, . . . , 40 for the test signal. The STM capacities of R-ESN, U-ESN, ASE-ESN, and BASE-ESN are 13.09, 13.55, 16.70, and 16.90, respectively. First, ESNs with uniform pole distribution (ASEESN and BASE-ESN) have MCs that are much longer than the randomly generated ESN given in Jaeger (2002a) in spite of all having the same spectral radius. In fact, the STM capacity of ASE-ESN is close to the theoretical maximum value of N = 20. A closer look at the figure shows that R-ESN performs slightly better than ASE-ESN for delays less than 9. In fact, for small k, large ASE degrades the performance because the tasks do not need long memory depth. However, the drawback of high ASE for small k is recovered in BASE-ESN, which reduces the ASE to the appropriate level required for the task. Overall, the addition of the bias to the ASE-ESN increases the STM capacity from 16.70 to 16.90. On the other hand, U-ESN has slightly better STM compared to R-ESN with only three different weight values, although it has more distinct weight values compared to R-ESN. It is also significant to note that the MC will be very poor for an ESN with smaller spectral radius even with an adaptive bias, since the problem requires large ASE and bias can only reduce ASE. This experiment demonstrates the

130

M. Ozturk, D. Xu, and J. Pr´ıncipe

1

RESN UESN ASEESN BASEESN

Memory Capacity

0.8 0.6 0.4 0.2 0 0

10

20 Delay

30

40

Figure 6: The k-delay STM capacity of each ESN for delays 1, . . . , 40 computed using the test signal. The results are averaged over 100 different realizations of each ESN type with the specifications given in the text for different W and Win matrices. The STM capacities of R-ESN, U-ESN, ASE-ESN, and BASE-ESN are 13.09, 13.55, 16.70, and 16.90, respectively.

suitability of maximizing ASE in tasks that require a substantial memory length. 5.2 Binary Parity Check. The effect of the adaptive bias was marginal in the previous experiment since the nature of the problem required large ASE values. However, there are tasks in which the optimal solutions require smaller ASE values and smaller spectral radius. Those are the tasks where the adaptive bias becomes a crucial design parameter in our design methodology. Consider an ESN with 100 internal units and a single input unit. ESN is driven by a binary input signal, u(n), that assumes the values 0 or 1. The goal is to train an ESN to generate the m-bit parity corresponding to last m bits received, where m is 3, . . . , 8. Similar to the previous experiments, we used the R-ESN, ASE-ESN, and BASE-ESN topologies. R-ESN is a randomly connected ESN where the entries of W matrix are set to 0, 0.06, −0.06 with probabilities 0.8, 0.1, 0.1, respectively. This corresponds to a sparse connectivity of 20% and a spectral radius of 0.3. ASE-ESN and BASE-ESN are designed with a spectral radius of 0.9. The input weights are set to 1 or -1 with equal probability, and direct connections from the input to the output are allowed whereas Wba ck is set to 0. The echo states are calculated using equation 2.1 for 1000 samples of the input signal, and the first 100 samples corresponding to the initial transient are eliminated. Then the output weight

Analysis and Design of Echo State Networks

131

350

Wrong Decisions

300 250 200 150 100 ASEESN RESN BASEESN

50 0 3

4

5

6

7

8

m Figure 7: The number of wrong decisions made by each ESN for m = 3, . . . , 8 in the binary parity check problem. The results are averaged over 100 different realizations of R-ESN, ASE-ESN, and BASE-ESN for different W and Win matrices with the specifications given in the text. The total numbers of wrong decisions for m = 3, . . . , 8 of R-ESN, ASE-ESN, and BASE-ESN are 722, 862, and 699.

matrix is calculated using equation 2.4. For ESN with adaptive bias, the bias is trained for each task. The binary decision is made by a threshold detector that compares the output of the ESN to 0.5. Figure 7 shows the number of wrong decisions (averaged over 100 different realizations) made by each ESN for m = 3, . . . , 8. The total numbers of wrong decisions for m = 3, . . . , 8 of R-ESN, ASEESN, and BASE-ESN are 722, 862, and 699, respectively. ASE-ESN performs poorly since the nature of the problem requires a short time constant for fast response, but ASE-ESN has a large spectral radius. For 5-bit parity, the R-ESN has no wrong decisions, whereas ASE-ESN has 53 wrong decisions. BASE-ESN performs a lot better than ASE-ESN and slightly better than the R-ESN since the adaptive bias reduces the spectral radius effectively. Note that for m = 7 and 8, the ASE-ESN performs similar to the R-ESN, since the task requires access to longer input history, which compromises the need for fast response. Indeed, the bias in the BASE-ESN takes effect when there are errors (m > 4) and when the task benefits from smaller spectral radius. The optimal bias values are approximately 3.2, 2.8, 2.6, and 2.7 for m = 3, 4, 5, and 6, respectively. For m = 7 or 8, there is a wide range of bias values that result in similar MSE values (between 0 and 3). In

132

M. Ozturk, D. Xu, and J. Pr´ıncipe

summary, this experiment clearly demonstrates the power of the bias signal to configure the ESN reservoir according to the mapping task. 5.3 System Identification. This section presents a function approximation task where the aim is to identify a nonlinear dynamical system. The unknown system is defined by the difference equation y(n + 1) = 0.3y(n) + 0.6y(n − 1) + f (u(n)), where f (u) = 0.6 sin(πu) + 0.3 sin(3πu) + 0.1 sin(5πu). The input to the system is chosen to be sin(2πn/25). We used three different ESNs—R-ESN, ASE-ESN, and BASE-ESN—with 30 internal units and a single input unit. The W matrix of each ESN is scaled such that it has a spectral radius of 0.95. R-ESN is a randomly connected ESN where the entries of W matrix are set to 0, 0.35, −0.35 with probabilities 0.8, 0.1, 0.1, respectively. In each ESN, the input weights are set to 1 or −1 with equal probability, and direct connections from the input to the output are allowed, whereas Wba ck is set to 0. The optimal output weights are calculated using equation 2.4. The MSE values (averaged over 100 realizations) for RESN and ASE-ESN are 1.23x10−5 and 1.83x10−6 , respectively. The addition of the adaptive bias to the ASE-ESN reduces the MSE value from 1.83x10−6 to 3.27x10−9 . 6 Discussion The great appeal of echo state networks (ESNs) and liquid state machine (LSM) is their ability to construct arbitrary mappings of signals with rich and time-varying temporal structures without requiring adaptation of the free parameters of the recurrent layer. The echo state condition allows the recurrent connections to be fixed with training limited to the linear output layer. However, the literature did not elucidate on how to properly choose the recurrent parameters for system identification applications. Here, we provide an alternate framework that interprets the echo states as a set of functional bases formed by fixed nonlinear combinations of the input. The linear readout at the output stage simply computes the projection of the desired output space onto this representation space. We further introduce an information-theoretic criterion, ASE, to better understand and evaluate the capability of a given ESN to construct such a representation layer. The average entropy of the distribution of the echo states quantifies the volume spanned by the bases. As such, this volume should be the largest to achieve the smallest correlation among the bases and be able to cope with

Analysis and Design of Echo State Networks

133

arbitrary mappings. However, not all function approximation problems require the same memory depth, which is coupled to the spectral radius. The effective spectral radius of an ESN can be optimized for the given problem with the help of an external bias signal that is adapted using the joint inputoutput space information. The interesting property of this method when applied to ESN built from sigmoidal nonlinearities is that it allows the fine tuning of the system dynamics for a given problem with a single external adaptive bias input and without changing internal system parameters. In our opinion, the combination of the largest possible ASE and the adaptation of the spectral radius by the bias produces the most parsimonious pole location of the linearized ESN when no knowledge about the mapping is available to optimally locate the bass functionals. Moreover, the bias can be easily trained with either a line search method or a gradient-based method since it is one-dimensional. We have illustrated experimentally that the design of the ESN using the maximization of ASE with the adaptation of the spectral radius by the bias has provided consistently better performance across tasks that require different memory depths. This means that these two parameters’ design methodology is preferred to the spectral radius criterion proposed by Jaeger, and it is still easily incorporated in the ESN design. Experiments demonstrate that the ASE for ESN with uniform linearized poles is maximized when the spectral radius of the recurrent weight matrix approaches one (instability). It is interesting to relate this observation with the computational properties found in dynamical systems “at the edge of chaos” (Packard, 1988; Langton, 1990; Mitchell, Hraber, & Crutchfield, 1993; Bertschinger & Natschl¨ager, 2004). Langton stated that when cellular automata rules are evolved to perform a complex computation, evolution will tend to select rules with “critical” parameter values, which correlate with a phase transition between ordered and chaotic regimes. Recently, similar conclusions were suggested for LSMs (Bertschinger & Natschl¨ager, 2004). Langton’s interpretation of edge of chaos was questioned by Mitchell et al. (1993). Here, we provide a system-theoretic view and explain the computational behavior with the diversity of dynamics achieved with linearizations that have poles close to the unit circle. According to our results, the spectral radius of the optimal ESN in function approximation is problem dependent, and in general it is impossible to forecast the computational performance as the system approaches instability (the spectral radius of the recurrent weight matrix approaches one). However, allowing the system to modulate the spectral radius by either the output or internal biasing may allow a system close to instability to solve various problems requiring different spectral radii. Our emphasis here is mostly on ESNs without output feedback connections. However, the proposed design methodology can also be applied to ESNs with output feedback. Both feedforward and feedback connections contribute to specify the bases to create the projection space. At the same

134

M. Ozturk, D. Xu, and J. Pr´ıncipe

time, there are applications where the output feedback contributes to the system dynamics in a different fashion. For example, it has been shown that a fixed weight (fully trained) RNN with output feedback can implement a family of functions (meta-learners) (Prokhorov, Feldkamp, & Tyukin, 1992). In meta-learning, the role of output feedback in the network is to bias the system to different regions of dynamics, providing multiple input-output mappings required (Santiago & Lendaris, 2004). However, results could not be replicated with ESNs (Prokhorov, 2005). We believe that more work has to be done on output feedback in the context of ESNs but also suspect that the echo state condition may be a restriction on the system dynamics for this type of problem. There are many interesting issues to be researched in this exciting new area. Besides an evaluation tool, ASE may also be utilized to train the ESN’s representation layer in an unsupervised fashion. In fact, we can easily adapt with the SIG (stochastic information gradient) described in Erdogmus, Hild, and Principe (2003): extra weights linking the outputs of recurrent states to maximize output entropy. Output entropy maximization is a well-known metric to create independent components (Bell & Sejnowski, 1995), and here it means that the echo states will become as independent as possible. This would circumvent the linearization of the dynamical system to set the recurrent weights and would fine-tune continuously in an unsupervised manner the parameters of the ESN among different inputs. However, it goes against the idea of a fixed ESN reservoir. The reservoir of recurrent PEs can be thought of as a new form of a timeto-space mapping. Unlike the delay line that forms an embedding (Takens, 1981), this mapping may have the advantage of filtering noise and produce representations with better SNRs to the peaks of the input, which is very appealing for signal processing and seems to be used in biology. However, further theoretical work is necessary in order to understand the embedding capabilities of ESNs. One of the disadvantages of the ESN correlated basis is in the design of the readout. Gradient-based algorithms will be very slow to converge (due to the large eigenvalue spread of modes), and even if recursive methods are used, their stability may be compromised by the condition number of the matrix. However, our recent results incorporating an L 1 norm penalty in the LMS (Rao et al., 2005) show great promise of solving this problem. Finally we would like to briefly comment on the implications of these models to neurobiology and computational neuroscience. The work by Pouget and Sejnowski (1997) has shown that the available physiological data are consistent with the hypothesis that the response of a single neuron in the parietal cortex serves as a basis function generated by the sensory input in a nonlinear fashion. In other words, the neurons transform the sensory input into a format (representation space) such that the subsequent computation is simplified. Then, whenever a motor command (output of the biological system) needs to be generated, this simple computation to

Analysis and Design of Echo State Networks

135

read out the neuronal activity is done. There is an intriguing similarity between the interpretation of the neuronal activity by Pouget and Sejnowski and our interpretation of echo states in ESN. We believe that similar ideas can be applied to improve the design of microcircuit implementations of LSMs. First, the framework of functional space interpretation (bases and projections) is also applicable to microcircuits. Second, the ASE measure may be directly utilized for LSM states because the states are normally lowpass-filtered before the readout. However, the control of ASE by changing the liquid dynamics is unclear. Perhaps global control of thresholds or bias current will be able to accomplish bias control as in ESN with sigmoid PEs.

Acknowledgments This work was partially supported by NSF ECS-0422718, NSF CNS-0540304, and ONR N00014-1-1-0405.

References Amari, S.-I. (1990). Differential-geometrical methods in statistics. New York: Springer. Anderson, J., Silverstein, J., Ritz, S., & Jones, R. (1977). Distinctive features, categorical perception, and probability learning: Some applications of a neural model. Psychological Review, 84, 413–451. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129– 1159. Bertschinger, N., & Natschl¨ager, T. (2004). Real-time computation at the edge of chaos in recurrent neural networks. Neural Computation, 16(7), 1413–1436. Cox, R. T. (1946). Probability, frequency, and reasonable expectation. American Journal of Physics, 14(1), 1–13. de Vries, B. (1991). Temporal processing with neural networks—the development of the gamma model. Unpublished doctoral dissertation, University of Florida. Delgado, A., Kambhampati, C., & Warwick, K. (1995). Dynamic recurrent neural network for system identification and control. IEEE Proceedings of Control Theory and Applications, 142(4), 307–314. Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179–211. Erdogmus, D., Hild, K. E., & Principe, J. (2003). Online entropy manipulation: Stochastic information gradient. Signal Processing Letters, 10(8), 242–245. Erdogmus, D., & Principe, J. (2002). Generalized information potential criterion for adaptive system training. IEEE Transactions on Neural Networks, 13(5), 1035–1044. Feldkamp, L. A., Prokhorov, D. V., Eagen, C., & Yuan, F. (1998). Enhanced multistream Kalman filter training for recurrent networks. In J. Suykens, & J. Vandewalle (Eds.), Nonlinear modeling: Advanced black-box techniques (pp. 29–53). Dordrecht, Netherlands: Kluwer.

136

M. Ozturk, D. Xu, and J. Pr´ıncipe

Haykin, S. (1998). Neural networks: A comprehensive foundation (2nd ed.). Upper Saddle River, NJ. Prentice Hall. Haykin, S. (2001). Adaptive filter theory (4th ed.). Upper Saddle River, NJ: Prentice Hall. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. Hopfield, J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Sciences, 81, 3088–3092. Ito, Y. (1996). Nonlinearity creates linear independence. Advances in Computer Mathematics, 5(1), 189–203. Jaeger, H. (2001). The echo state approach to analyzing and training recurrent neural networks (Tech. Rep. No. 148). Bremen: German National Research Center for Information Technology. Jaeger, H. (2002a). Short term memory in echo state networks (Tech. Rep. No. 152). Bremen: German National Research Center for Information Technology. Jaeger, H. (2002b). Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the “echo state network” approach (Tech. Rep. No. 159). Bremen: German National Research Center for Information Technology. Jaeger, H., & Hass, H. (2004). Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667), 78–80. Jeffreys, H. (1946). An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London, A 196, 453–461. Kailath, T. (1980). Linear systems. Upper Saddle River, NJ: Prentice Hall. Kautz, W. (1954). Transient synthesis in time domain. IRE Transactions on Circuit Theory, 1(3), 29–39. Kechriotis, G., Zervas, E., & Manolakos, E. S. (1994). Using recurrent neural networks for adaptive communication channel equalization. IEEE Transactions on Neural Networks, 5(2), 267–278. Kremer, S. C. (1995). On the computational power of Elman-style recurrent networks. IEEE Transactions on Neural Networks, 6(5), 1000–1004. Kuznetsov, Y., Kuznetsov, L., & Marsden, J. (1998). Elements of applied bifurcation theory (2nd ed.). New York: Springer-Verlag. Langton, C. G. (1990). Computation at the edge of chaos. Physica D, 42, 12–37. Maass, W., Legenstein, R. A., & Bertschinger, N. (2005). Methods for estimating the computational power and generalization capability of neural microcircuits. In L. K. Saul, Y. Weiss, L. Bottou (Eds.), Advances in neural information processing systems, no. 17 (pp. 865–872). Cambridge, MA: MIT Press. Maass, W., Natschl¨ager, T., & Markram, H. (2002). Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Computation, 14(11), 2531–2560. Mitchell, M., Hraber, P., & Crutchfield, J. (1993). Revisiting the edge of chaos: Evolving cellular automata to perform computations. Complex Systems, 7, 89– 130. Packard, N. (1988). Adaptation towards the edge of chaos. In J. A. S. Kelso, A. J. Mandell, & M. F. Shlesinger (Eds.), Dynamic patterns in complex systems (pp. 293– 301). Singapore: World Scientific.

Analysis and Design of Echo State Networks

137

Pouget, A., & Sejnowski, T. J. (1997). Spatial transformations in the parietal cortex using basis functions. Journal of Cognitive Neuroscience, 9(2), 222–237. Principe, J. (2001). Dynamic neural networks and optimal signal processing. In Y. Hu & J. Hwang (Eds.), Neural networks for signal processing (Vol. 6-1, pp. 6– 28). Boca Raton, FL: CRC Press. Principe, J. C., de Vries, B., & de Oliviera, P. G. (1993). The gamma filter—a new class of adaptive IIR filters with restricted feedback. IEEE Transactions on Signal Processing, 41(2), 649–656. Principe, J., Xu, D., & Fisher, J. (2000). Information theoretic learning. In S. Haykin (Ed.), Unsupervised adaptive filtering (pp. 265–319). Hoboken, NJ: Wiley. Prokhorov, D. (2005). Echo state networks: Appeal and challenges. In Proc. of International Joint Conference on Neural Networks (pp. 1463–1466). Montreal, Canada. Prokhorov, D., Feldkamp, L., & Tyukin, I. (1992). Adaptive behavior with fixed weights in recurrent neural networks: An overview. In Proc. of International Joint Conference on Neural Networks (pp. 2018–2022). Honolulu, Hawaii. Puskorius, G. V., & Feldkamp, L. A. (1994). Neurocontrol of nonlinear dynamical systems with Kalman filter trained recurrent networks. IEEE Transactions on Neural Networks, 5(2), 279–297. Puskorius, G. V., & Feldkamp, L. A. (1996). Dynamic neural network methods applied to on-vehicle idle speed control. Proceedings of IEEE, 84(10), 1407–1420. Rao, Y., Kim, S., Sanchez, J., Erdogmus, D., Principe, J. C., Carmena, J., Lebedev, M., & Nicolelis, M. (2005). Learning mappings in brain machine interfaces with echo state networks. In 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing. Philadelphia. Renyi, A. (1970). Probability theory. New York: Elsevier. Sanchez, J. C. (2004). From cortical neural spike trains to behavior: Modeling and analysis. Unpublished doctoral dissertation, University of Florida. Santiago, R. A., & Lendaris, G. G. (2004). Context discerning multifunction networks: Reformulating fixed weight neural networks. In Proc. of International Joint Conference on Neural Networks (pp. 189–194). Budapest, Hungary. Shah, J. V., & Poon, C.-S. (1999). Linear independence of internal representations in multilayer perceptrons. IEEE Transactions on Neural Networks, 10(1), 10–18. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 623–656. Siegelmann, H. T. (1993). Foundations of recurrent neural networks. Unpublished doctoral dissertation, Rutgers University. Siegelmann, H. T., & Sontag, E. (1991). Turing computability with neural nets. Applied Mathematics Letters, 4(6), 77–80. Singhal, S., & Wu, L. (1989). Training multilayer perceptrons with the extended Kalman algorithm. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 1 (pp. 133–140). San Mateo, CA: Morgan Kaufmann. Takens, F. (1981). Detecting strange attractors in turbulence. In D. A. Rand & L.-S. Young (Eds.), Dynamical systems and turbulence (pp. 366–381). Berlin: Springer. Thogula, R. (2003). Information theoretic self-organization of multiple agents. Unpublished master’s thesis, University of Florida. Werbos, P. (1990). Backpropagation through time: What it does and how to do it. Proceedings of IEEE, 78(10), 1550–1560.

138

M. Ozturk, D. Xu, and J. Pr´ıncipe

Werbos, P. (1992). Neurocontrol and supervised learning: An overview and evaluation. In D. White & D. Sofge (Eds.), Handbook of intelligent control (pp. 65–89). New York: Van Nostrand Reinhold. Wilde, D. J. (1964). Optimum seeking methods. Upper Saddle River, NJ: Prentice Hall. Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1, 270–280.

Received December 28, 2004; accepted June 1, 2006.

LETTER

Communicated by Ralph M. Siegel

Invariant Global Motion Recognition in the Dorsal Visual System: A Unifying Theory Edmund T. Rolls [email protected]

Simon M. Stringer [email protected] Oxford University, Centre for Computational Neuroscience, Department of Experimental Psychology, Oxford OX1 3UD, England

The motion of an object (such as a wheel rotating) is seen as consistent independent of its position and size on the retina. Neurons in higher cortical visual areas respond to these global motion stimuli invariantly, but neurons in early cortical areas with small receptive fields cannot represent this motion, not only because of the aperture problem but also because they do not have invariant representations. In a unifying hypothesis with the design of the ventral cortical visual system, we propose that the dorsal visual system uses a hierarchical feedforward network architecture (V1, V2, MT, MSTd, parietal cortex) with training of the connections with a short-term memory trace associative synaptic modification rule to capture what is invariant at each stage. Simulations show that the proposal is computationally feasible, in that invariant representations of the motion flow fields produced by objects self-organize in the later layers of the architecture. The model produces invariant representations of the motion flow fields produced by global in-plane motion of an object, in-plane rotational motion, looming versus receding of the object, and object-based rotation about a principal axis. Thus, the dorsal and ventral visual systems may share some similar computational principles. 1 Introduction A key issue in understanding the cortical mechanisms that underlie motion perception is how we perceive the motion of objects such as a rotating wheel invariantly with respect to position on the retina, and size. For example, we perceive the wheel shown in Figures 1 and 4a rotating clockwise independent of its position on the retina. This occurs even though the local motion for the wheels in the different positions may be opposite (as indicated in the dashed box in Figure 1). How could this invariance of the visual motion perception of objects arise in the visual system? Invariant motion representations are known to be developed in the cortical dorsal visual system. Motion-sensitive neurons in V1 have small receptive fields Neural Computation 19, 139–169 (2007)

C 2006 Massachusetts Institute of Technology

140

E. Rolls and S. Stringer

Figure 1: A wheel rotating clockwise at different locations on the retina. How can a network learn to represent the clockwise rotation independent of the location of the moving object? The dashed box shows that local motion cues available at the beginning of the visual system are ambiguous about the direction of rotation when the stimulus is seen in different locations. One rotating wheel is presented at any one time, but the need is to develop a representation of the fact that in the case shown, the rotating flow field is always clockwise, independent of the location of the flow field and even though the local motion cues may be ambiguous, as shown in the dashed box.

(in the range 1–2 degrees at the fovea), and therefore cannot detect global motion, and this is part of the aperture problem (Wurtz & Kandel, 2000). Neurons in MT, which receives inputs from V1 and V2, have larger receptive fields (e.g., 5 degrees at the fovea) and are able to respond to planar global motion, such as a field of small dots in which the majority (in practice, as little as 55%) move in one direction, or to the overall direction of a

Invariant Global Motion Recognition

141

moving plaid, the orthogonal grating components of which have motion at 45 degrees to the overall motion (Movshon, Adelson, Gizzi, & Newsome, 1985; Newsome, Britten, & Movshon, 1989). Further on in the dorsal visual system, some neurons in macaque visual area MST (but not MT) respond to rotating flow fields or looming with considerable translation invariance (Graziano, Andersen, & Snowden, 1994; Geesaman & Andersen, 1996). It is known that single neurons in the ventral visual system have translation, size, and even view-invariant representations of stationary objects (Rolls & Deco, 2002; Desimone, 1991; Tanaka, 1996; Logothetis & Sheinberg, 1996; Rolls, 1992, 2000, 2006). A theory that can account for this uses a feature hierarchy network (Fukushima, 1980; Rolls, 1992; Wallis & Rolls, 1997; Riesenhuber & Poggio, 1999) combined with an associative Hebb-like learning rule (in which the synaptic weights increase in proportion to the pre-and postsynaptic firing rates) with a short-term memory of, for example, 1 sec, to enable different instances of the stimulus to be associated together as the visual objects transform continuously from second to second in the world ¨ ak, 1991; Rolls, 1992; Wallis & Rolls, 1997; Bartlett & Sejnowski, 1998; (Foldi´ Rolls & Milward, 2000; Stringer & Rolls, 2000, 2002; Rolls & Deco, 2002). In a unifying hypothesis, we propose here that the analysis of invariant motion in the dorsal visual system uses a similar architecture and learning rule, but in contrast utilizes as its inputs neurons that respond to local motion of the type found in the primary visual cortex, V1 (Wurtz & Kandel, 2000; Duffy, 2004; Bair & Movshon, 2004). A feature of the theory is that motion in the visual field is computed only once in V1 (by processes that take into account luminance changes across short times) and that the representations of motion that develop in the dorsal visual system require no further computation of time-delay-related firing to compute motion. The theory is of interest, for it proposes that some aspects of the computations in parts of the cerebral cortex that appear to be involved in different types of visual function, the dorsal and ventral visual systems, may in fact be performed by some similar organizational and computational principles. 2 The Theory and Its Implementation in a Model 2.1 The Theory. We propose that the general architecture of the dorsal visual system areas we consider is a feedforward feature hierarchy network, the inputs to which are local motion-sensitive neurons of V1 with receptive fields of approximately 1 degree in diameter (see Figure 2). There is convergence from stage to stage, so that a neuron at any one stage need receive only a limited number of inputs from the preceding stage, yet by the end of the network, an effectively global computation that can take into account information derived from different parts of the retina can have been performed. Within each cortical layer of the architecture (or layer of the network), local lateral inhibition implemented by inhibitory feedback neurons implements competition between the neurons, in such a way that fast-firing neurons

142

E. Rolls and S. Stringer

Layer 4

Layer 3

Layer 2

Layer 1 Figure 2: Stylized image of hierarchical organization in the dorsal as well as ventral visual system. The architecture is captured by the VisNet model, in which convergence through the network is designed to provide fourth-layer neurons with information from across the entire input retina.

inhibit other neurons in the vicinity, so that the overall activity within an area is kept within bounds. The competition may be nonlinear, due in part to the threshold nonlinearity of neurons, and this competition, helped by the diluted connectivity (i.e., the fact that only a low proportion of the neurons are connected), enables some neurons to respond to particular combinations of the inputs being received from the preceding area (Rolls & Deco, 2002; Deco & Rolls, 2005). These aspects of the architecture potentially enable single neurons at higher stages of the network to respond to combinations of the local motion inputs from V1 to the first layer of the network. These combinations, helped by the increasingly larger receptive fields, could include global motion to partly randomly moving dots (and to plaids) over

Invariant Global Motion Recognition

143

areas as large as 5 degrees in MT (Wurtz & Kandel, 2000; Duffy & Wurtz, 1996). In the architecture shown in Figure 2, layer 1 might correspond to MT; layer 2 to MST, which has receptive fields of 15–65 degrees in diameter; and layers 3 and 4 to areas in the parietal cortex such as 7a and to areas in the cortex in the superior temporal sulcus, which receives from parietal visual areas where view-invariant object-based motion is represented (Hasselmo, Rolls, Baylis, & Nalwa, 1989; Sakata, Shibutani, Ito, & Tsurugai, 1986). The synaptic plasticity between the layers of neurons has a Hebbian associative component in order to enable the system to build reliable representations in which the same neurons are activated by particular stimuli on different occasions in what is effectively a hierarchical multilayer competitive network (Rolls, 1992; Wallis & Rolls, 1997; Rolls & Deco, 2002). Such processes might enable neurons in layer 2 of Figure 4a to respond to, for example, a wheel rotating clockwise in one position on the retina (e.g., neuron A in layer 2). A key issue not addressed by the architecture described so far is how rotation (e.g., of a small wheel rotating clockwise) in one part of the retina activates the same neurons at the end of the network as when it is presented on a different part of the retina (see Figure 1). We propose that an associative synaptic learning rule with a short-term memory trace of neuronal activity is used between the layers to solve this problem. The idea is that if at a high level of the architecture (labeled layer 2/3 in Figure 4a) a wheel rotating clockwise is activating a neuron in one position on the retina, then the activated neurons remain active in a short delay period (of, e.g., 1 s) while the object moves to another location on the retina (e.g., the right position in Figure 4a). Then, with the postsynaptic neurons still active from the motion at the left position, the newly active synapses onto the layer 2/3 neuron (C) show associative modification, resulting in neuron C learning in an unsupervised way to respond to the wheel rotating clockwise in either the left or the right position on the retina. The idea is, just as for the ventral visual system (Rolls & Deco, 2002), that whatever the convergence allows to be learned at each stage of the hierarchy will be learned by this invariance algorithm, resulting in neurons higher in the hierarchy having higher- and higher-level invariance properties, including view-invariant object-based motion. More formally, the rule we propose is that identical to the one ¨ ak, 1991; Rolls, 1992; Wallis & proposed for the ventral visual system (Foldi´ Rolls, 1997; Rolls & Deco, 2002) as follows: w j = α yτ −1 x τj ,

(2.1)

where the trace yτ is updated according to yτ = (1 − η)yτ + ηyτ −1 ,

(2.2)

144

E. Rolls and S. Stringer

and we have the following definitions: x j : jth input to the neuron yτ : trace value of the output of the neuron at time step τ w j : synaptic weight between jth input and the neuron y : output from the neuron α : learning rate; annealed between unity and zero η : trace value; the optimal value varies with presentation sequence length The parameter η may be set anywhere in the interval [0, 1], and for the simulations described here, η was set to 0.8, which works well with nine transforms for each object in the stimulus set (Wallis & Rolls, 1997). (A discussion of the good performance of this rule, and its relation to other versions of trace learning rules, including the point that the trace can be implemented in the presynaptic firing, is provided by Rolls & Milward, 2000, and Rolls & Stringer, 2001. We note that in the version of the rule used here (equation 2.1), the trace is calculated from the postsynaptic firing in the preceding time step (yτ −1 ) but not the current time step, but that analogous performance is obtained if the firing in the current time step is also included (Rolls & Milward, 2000; Rolls & Stringer, 2001).) The temporal trace in the brain could be implemented by a number of processes, as simple as continuing firing of neurons for several hundred ms after a stimulus has disappeared or moved (as shown to be present for at least inferior temporal neurons in masking experiments—Rolls & Tovee, 1994; Rolls, Tovee, Purcell, Stewart, & Azzopardi, 1994), or by the long time constant of NMDA receptors and the resulting entry of calcium to neurons. An important idea here is that the temporal properties of the biologically implemented learning mechanism are such that it is well suited to detecting the relevant continuities in the world of real motion of objects. The system uses the underlying continuity in the world to help itself learn the invariances of, for example, the motions that are typical of objects. 2.2 The Network Architecture. The model we used for the simulations was VisNet, which was developed as a model of hierarchical processing in the ventral visual system that uses a trace learning to develop invariant representations of stationary objects (Wallis & Rolls, 1997; Rolls & Milward, 2000; Rolls & Deco, 2002). The simulations performed here utilized the latest version of the VisNet model (VisNet2), with the same model parameters as used by Rolls and Milward (2000) for their investigations of the formation of invariant representations in the ventral visual system. These parameters were kept identical for all the simulations described here. The difference is that instead of using simple cell-like inputs to the model that respond to stationary-oriented bars and edges (with four spatial frequencies and four orientations), in the modeling described here we used motion-related

Invariant Global Motion Recognition

145

Table 1: VisNet Dimensions.

Dimensions

Number of Connections

Radius

100 100 100 201 -

12 9 6 6 -

32 × 32 32 × 32 32 × 32 32 × 32 128 × 128 × 8

Layer 4 Layer 3 Layer 2 Layer 1 Input layer

Table 2: Lateral Inhibition Parameters. Layer Radius, σ Contrast, δ

1 1.38 1.5

2 2.7 1.5

3 4.0 1.6

4 6.0 1.4

inputs that capture some of the relevant properties of neurons present in V1 as part of the primate magnocellular (M) system (Wurtz & Kandel, 2000; Duffy, 2004; Rolls & Deco, 2002). VisNet is a four-layer feedforward network with unsupervised competitive learning at each layer. For each layer, the forward connections to individual cells are derived from a topologically corresponding region of the preceding layer, with connection probabilities based on a gaussian distribution (see Figure 2). These distributions are defined by a radius that will contain approximately 67% of the connections from the preceding layer. Typical values are given in Table 1. Within each layer there is competition between neurons, which is graded rather than winner-take-all, and is implemented in two stages. First, to implement lateral inhibition, the firing rates of the neurons (calculated as the dot product of the vector of presynaptic firing rates and the synaptic weight vector on a neuron, followed by a linear activation function to produce a firing rate) within a layer are convolved with a spatial filter, I , where δ controls the contrast and σ controls the width, and a and b index the distance away from the center of the filter:

Ia ,b

−δe − a 2σ+b2 2 = 1 − a =0 Ia ,b

if a = 0 or

b = 0,

if a = 0 and b = 0.

b=0

(2.3)

Typical lateral inhibition parameters are given in Table 2. Next, contrast enhancement is applied by means of a sigmoid function y = f sigmoid (r ) =

1 1+

e −2β(r −α)

,

(2.4)

146

E. Rolls and S. Stringer

Table 3: Sigmoid Parameters. Layer Percentile Slope β

1 99.2 190

2 98 40

3 88 75

4 91 26

where r is the firing rate after lateral inhibition, y is the firing rate after contrast enhancement, and α and β are the sigmoid threshold and slope, respectively. The parameters α and β are constant within each layer, although α is adjusted to control the sparseness of the firing rates. For example, to set the sparseness to, say, 5%, the threshold is set to the value of the 95th percentile point of the firing rates r within the layer. Typical parameters for the sigmoid function are shown in Table 3. ¨ ak, 1991; Rolls, 1992; Wallis & Rolls, 1997; The trace learning rule (Foldi´ Rolls & Milward, 2000; Rolls & Stringer, 2001; Rolls & Deco, 2002) is that shown in equation 2.1 and encourages neurons to develop invariant responses to input patterns that tend to occur close together in time, because these are likely to be from the same moving object. 2.3 The Motion Inputs to the Network. The images presented to the network represent local motion signals with small receptive fields. These local visual motion (or local optic flow) input signals are similar to those of neurons in V1 in that they have small receptive fields and cannot detect global motion because of the aperture problem (Wurtz & Kandel, 2000). At each pixel coordinate in the 128 × 128 image, a direction of local motion/optic flow is defined. The global optic flow patterns used in the different experiments occupied part of this 128 × 128 image, as described for each experiment below. At each coordinate, there are eight cells, where the optimal response is defined by flows 45 degrees apart. That is, the cells are tuned to local optic flow directions of 0, 45, 90, . . ., 315 degrees. The firing rate of each cell is set equal to a gaussian function of the difference between the cell’s preferred direction and the actual direction of local optic flow. The standard deviation of this gaussian was 20 degrees. The number of inputs from the arrays of motion sensitive cells to each cell in the first layer of the network is 201, selected probabilistically as a gaussian function of distance as described above and in more detail elsewhere (Rolls & Milward, 2000). The local motion signals are given to the network, and not computed in the simulations, because the aim of the simulations is to test the theory that (given that local motion inputs that are known to be present in early cortical processing; Wurtz & Kandel, 2000) the trace learning mechanism described can in a hierarchical network account for a range of the types of global motion neuron that are found in the dorsal stream visual cortical areas.

Invariant Global Motion Recognition

147

2.4 Training and Test Procedure. To train the network, each stimulus is presented to VisNet in a randomized sequence of locations or orientations with respect to VisNet’s input retina. The different locations were spaced 32 pixels apart on the 128 × 128 retina. At each stimulus presentation, the activation of individual neurons is calculated, then the neuronal firing rates are calculated, and then the synaptic weights are updated. Each time a stimulus has been presented in all the training locations or orientations, a new stimulus is chosen at random and the process repeated. The presentation of all the stimuli through all locations or orientations constitutes one epoch of training. In this manner, the network is trained one layer at a time starting with layer 1 and finishing with layer 4. In the investigations described here, the numbers of training epochs for layers 1 to 4 were 50, 100, 100, and 75, respectively, as these have been shown in previous work to provide good performance (Wallis & Rolls, 1997; Rolls & Milward, 2000). The learning rates α in equation 2.1 for layers 1 to 4 were 0.09, 0.067, 0.05, and 0.04. Two measures of performance were used to assess the ability of the output layer of the network to develop neurons that are able to respond with view invariance to individual stimuli or objects (see Rolls & Milward, 2000). A single cell information measure was applied to individual cells in layer 4 and measures how much information is available from the response of a single cell about which stimulus was shown independent of view. The measure was the stimulus-specific information or surprise, I (s, R), which is the amount of information the set of responses, R, has about a specific stimulus, s. (The mutual information between the whole set of stimuli S and of responses R is the average across stimuli of this stimulus-specific information.) (Note that r is an individual response from the set of responses R.) I (s, R) =

r ∈R

P(r |s) log2

P(r |s) P(r )

(2.5)

The calculation procedure was identical to that described by Rolls, Treves, Tovee, and Panzeri (1997) with the following exceptions. First, no correction was made for the limited number of trials because, in VisNet, each measurement of a response is exact, with no variation due to sampling on different trials. Second, the binning procedure was to use equispaced rather than equipopulated bins. This small modification was useful because the data provided by VisNet can produce perfectly discriminating responses with little trial-to-trial variability. Because the cells in VisNet can have bimodally distributed responses, equipopulated bins could fail to separate the two modes perfectly. (This is because one of the equipopulated bins might contain responses from both of the modes.) The number of bins used was equal to or less than the number of trials per stimulus, that is, for VisNet the number of positions on the retina (Rolls et al., 1997). Because

148

E. Rolls and S. Stringer

VisNet operates as a form of competitive net to perform categorization of the inputs received, good performance of a neuron will be characterized by large responses to one or a few stimuli regardless of their position on the retina (or other transform) and small responses to the other stimuli. We are thus interested in the maximum amount of information that a neuron provides about any of the stimuli rather than the average amount of information it conveys about the whole set S of stimuli (known as the mutual information). Thus, for each cell, the performance measure was the maximum amount of information a cell conveyed about any one stimulus (with a check, in practice always satisfied, that the cell had a large response to that stimulus, as a large response is what a correctly operating competitive net should produce to an identified category). In many of the graphs in this article, the amount of information each of the 50 most informative cells had about any stimulus is shown. A multiple cell information measure, the average amount of information that is obtained about which stimulus was shown from a single presentation of a stimulus from the responses of all the cells, enabled measurement of whether across a population of cells, information about every object in the set was provided. Procedures for calculating the multiple cell information measure are given by Rolls, Treves, and Tovee (1997) and Rolls and Milward (2000). The multiple cell information measure is the mutual information I (S, R), that is, the average amount of information that is obtained from a single presentation of a stimulus about the set of stimuli S from the responses of all the cells. For multiple cell analysis, the set of responses, R, consists of response vectors comprising the responses from each cell. Ideally, we would like to calculate I (S, R) =

P(s)I (s, R).

(2.6)

s∈S

However, the information cannot be measured directly from the probability table P(r, s) embodying the relationship between a stimulus s and the response rate vector r provided by the firing of the set of neurons to a presentation of that stimulus. (Note that “stimulus” refers to an individual object that can occur with different transforms, e.g., translation or size; see Wallis & Rolls, 1997.) This is because the dimensionality of the response vectors is too large to be adequately sampled by trials. Therefore, a decoding procedure is used, in which the stimulus s that gave rise to the particular firing-rate response vector on each trial is estimated. This involves, for example, maximum likelihood estimation or dot product decoding. For example, given a response vector r to a single presentation of a stimulus, its similarity to the average response vector of each neuron to each stimulus is used to estimate using a dot product comparison which stimulus was shown. The probabilities of it being each of the stimuli can be estimated in

Invariant Global Motion Recognition

149

this way. Details are provided by Rolls et al. (1997). A probability table is then constructed of the real stimuli s and the decoded stimuli s . From this probability table, the mutual information is calculated as I (S, S ) =

P(s, s ) log2

s,s

P(s, s ) . P(s)P(s )

(2.7)

The multiple cell information was calculated using the five cells for each stimulus with high information values for that stimulus. Thus, in this letter, 10 cells were used in the multiple cell information analysis. 3 Simulation Results We now describe simulations with the neural network model described in section 2 that enabled us to test this theory. 3.1 Experiment 1: Global Planar Motion. Motion-sensitive neurons in V1 have small receptive fields (in the range 1–2 deg at the fovea) and therefore cannot detect global motion, and this is part of the aperture problem (Wurtz & Kandel, 2000). As described in section 1, neurons in MT have larger receptive fields and are able to respond to planar global motion (Movshon et al., 1985; Newsome et al., 1989). Here we show that the hierarchical feature network we propose can solve this global planar motion problem and, moreover, that the performance is improved by using a trace rather than a purely associative synaptic modification rule. Invariance is addressed in later simulations. The network was trained on two 100 × 100 stimuli representing noisy left and right global planar motion (see Figure 3a). During the training, cells developed that responded to either left or right global motion but not to both (see Figure 3), with 1 bit of information representing perfect discrimination of left from right. The untrained network with initial random synaptic weights tested as a control showed much poorer performance, as shown in Figure 3. It might be expected that some global planar motion sensitivity would be developed by a purely Hebbian learning rule, and indeed this has been demonstrated (under somewhat different training conditions) by Sereno (1989) and Sereno and Sereno (1991). This occurs because on any single trial with one average direction of global motion, neurons at intermediate layers will tend to receive on average inputs that reflect the current average global planar motion and will thus learn to respond optimally to the current inputs that represent that motion direction. We showed that the trace learning rule used here performed better than a Hebb rule (which produced only neurons with 0.0 bits given that the motion stimulus patches presented in our simulations were in nonoverlapping locations, as

150

E. Rolls and S. Stringer

a Global planar motion left

Global planar motion right

Stimulus 1

Stimulus 2

b

c VisNet: 2s 9l: Single cell analysis

VisNet: 2s 9l: Multiple cell analysis

3

3 trace random

2 1.5 1 0.5

2 1.5 1 0.5

0

0

-0.5

-0.5 5

10 15 20 25 30 35 40 45 50 Cell Rank

trace random

2.5 Information (bits)

Information (bits)

2.5

1

2

3

4

5

6

7

8

9

10

Number of Cells

Figure 3: Experiment 1. (a) The two motion stimuli used in experiment 1 were noisy global planar motion left (left) and noisy global planar motion right (right), present throughout the 128 × 128 retina. Each arrow in this and subsequent figures represents the local direction of optic flow. The size of the optic flow pattern was 100 × 100 pixels, not the 2 × 4 shown in the diagram. The noise was introduced into each image stimulus by inverting the direction of optic flow at a random set of 45% of the image nodes. This meant that it would not be possible to determine the directional bias of the flow field by examining the optic flow over local regions of the retina. Instead, the overall directional bias could be determined only by analyzing the whole image. (b) When trained with the trace rule, equation 2.1, some single cells in layer 4 conveyed 1 bit of information about whether the global motion was left or right, and this is perfect performance. (The single cell information is shown for the 50 most selective cells.) (c) The multiple cell information measures, used to show that different neurons are tuned to different stimuli (see section 2.4), indicate that over a set of neurons, information about the whole stimulus set was present. (The information values for one cell are the average of 10 cells selected from the 50 most selective cells, and hence the value is not exactly 1 bit.)

Invariant Global Motion Recognition

151

illustrated in Figure 1). A further reason for the better performance of the trace rule is that on successive trials, the average global motion identifiable by a single intermediate-layer neuron from the probabilistic inputs will be a better estimate (a temporal average) of the true global motion, and this will be utilized in the learning. These results show that the network architecture is able to develop global motion representations of the noisy local motion patterns. Indeed, it is emphasized that neurons in the input to VisNet had only local but not global motion information, as shown by the fact that the average amount of information the 50 most selective input cells had about the global motion was 0.0 bits. 3.2 Experiment 2: Rotating Wheel. Neurons in MST, but not MT, are responsive to rotation with considerable translation invariance (Graziano et al., 1994). The aim of this simulation was to determine whether layer 4 cells in our network develop position-invariant representations of wheels rotating clockwise (as shown in Figure 4a) versus anticlockwise. The stimuli consist only of optic flow fields around the rim of a geometric circle with radius 16 unless otherwise stated. The local motion inputs from the wheel in the two positions shown are ambiguous where the wheels are close to each other in Figure 4a. The network was expected to solve the problem as illustrated in Figure 4a. The results in Figures 4b to 4d show perfect performance on position invariance when trained with the trace rule but not when untrained. The perfect performance is shown by the neurons that responded to, for example, clockwise but not anticlockwise rotation, and did this for each of the nine training positions. Figure 4e shows perfect size invariance for some layer 4 cells when the network was trained with the trace rule with three different radii of the wheels: 10, 16, and 22. These results show that the network architecture is able to develop location- and size-invariant representations of the global, rotating wheel, motion patterns even though the neurons in the input layer receive information from only a small local region of the retina. We note that the position-invariant global motion results shown in Figure 4 were not due to chance mappings of the two stimuli through the network and were a result of the training, in that the position-invariant information about whether the global motion was clockwise or anticlockwise was 0.0 bits for both the single and the multiple cell information in the untrained (“random”) network. Corresponding differences between the trained and the untrained networks were found in all the other experiments described in this article. 3.3 Experiment 3: Looming. Neurons in macaque dorsal stream visual area MSTd respond to looming stimuli with considerable translation

152

E. Rolls and S. Stringer

invariance (Graziano et al., 1994; Geesaman & Andersen, 1996). We tested whether the network could learn to respond to small patches of looming versus contracting motion typically generated by objects as they are seen successively on different locations on the retina. The network was trained on two circular flow patterns representing looming toward and looming away, as shown in Figure 5a. The stimuli are circular optic flow fields, with the direction of flow either away from (left) or toward (right) the center of the circle and with radius 16 unless otherwise stated. The results shown in Figures 5b to 5d show perfect performance on position invariance when trained with the trace rule but not when untrained. The perfect performance is shown by the neurons that responded to, for example, looming toward but not movement away, and did this for each of the nine training positions. Simulations were run for various optic flow field diameters to test the robustness of the results, and in all cases tested (which included radii of

Figure 4: Experiment 2. (a) Two rotating wheels at different locations rotating in opposite directions. The local flow field is ambiguous. Clockwise or counterclockwise rotation can be diagnosed only by a global flow computation, and it is shown how the network is expected to solve the problem to produce positioninvariant global-motion-sensitive neurons. One rotating wheel is presented at any one time, but the need is to develop a representation of the fact that in the case shown, the rotating flow field is always clockwise, independent of the location of the flow field. (b) Single cell information measures showing that some layer 4 neurons have perfect performance of 1 bit (clockwise versus anticlockwise) after training with the trace rule, but not with random initial synaptic weights in the untrained control condition. (c) The multiple cell information measures show that small groups of neurons have perfect performance. (d) Position invariance illustrated for a single cell from layer 4, which responded only to the clockwise rotation, and for every one of the nine positions. (e) Size invariance illustrated for a single cell from layer 4, which after training with three different radii of rotating wheel, responded only to anticlockwise rotation, independent of the size of the rotating wheels. (For the position-invariant simulations, the wheel rims overlapped, but are shown slightly separated in Figure 1 for clarity.) The training grid spacing was 32 pixels, and the radii of the wheels were 16 pixels. This ensured the rims of the wheels in adjacent training grid locations overlapped. One wheel was shown on any one trial. On successive trials, the wheel rotating clockwise was shown in each of the nine locations, allowing the trace learning rule to build location-invariant representations of the wheel rotating in one direction. In the next set of training trials, the wheel was shown rotating in the opposite direction in each of the nine locations. For the size-invariant simulations, the network was trained and tested with the set of clockwise versus anticlockwise rotating wheels presented in three different sizes.

Invariant Global Motion Recognition

153

C

a

Layer 3 = MSTd or higher Rotational motion with invariance Larger receptive field size

A

Layer 2 = MT Rotational motion Large receptive field size

B

Layer 1 = MT Global planar motion Intermediate receptive field size

Input layer = V1,V2 Local motion

b

c

VisNet: 2s 9l: Single cell analysis

VisNet: 2s 9l: Multiple cell analysis

3

3 trace random

trace random

2.5 Information (bits)

Information (bits)

2.5 2 1.5 1 0.5 0

2 1.5 1 0.5 0

-0.5

-0.5 5

10 15 20 25 30 35 40 45 50 Cell Rank

1

d

2

4 5 6 7 Number of Cells

8

9

10

e Visnet: Cell (24,13) Layer 4

Visnet: Cell (7,11) Layer 4

1

1 ’clock’ ’anticlock’

0.8 Firing Rate

0.8 Firing Rate

3

0.6 0.4

0.6 0.4

0.2

0.2

0

0 0

1

2

3 4 5 6 Location Index

7

8

’clock’ ’anticlock’

10

12

14

16 18 Radius

20

22

154

E. Rolls and S. Stringer

10 and 20 as well as intermediate values), cells developed a transform (location) invariant representation in the output layer. These results show that the network architecture is able to develop invariant representations of the global looming motion patterns, even though the neurons in the input layer receive information from only a small local region of the retina. 3.4 Experiment 4: Rotating Cylinder. Some neurons in the macaque cortex in the anterior part of the superior temporal sulcus (which receives inputs from both the dorsal and ventral visual streams; Ungerleider & Mishkin, 1982; Seltzer & Pandya, 1978; Rolls & Deco, 2002) respond to a head when it is rotating clockwise about its own axis but not counterclockwise, regardless of whether it is upright or inverted (Hasselmo et al., 1989). The result of the inversion experiment shows that these neurons are not just responding to global flow across the visual field, but are taking into account information about the shape and features of the object. Some neurons in the parietal cortex may also respond to motion of an object about one of its axes in an object-based way (Sakata et al., 1986). In experiment 4, we tested whether the network could self-organize to form neurons that represent global motion in an object-based coordinate frame. The network was trained on two stimuli, with four transforms of each. Figure 6a shows stimulus 1, which is a cylinder with shading at the top rotating clockwise about its own (top-defined) axis. Stimulus 1 is shown in its upright and inverted transforms. Stimulus 2 is the same cylinder with shading at the top, but rotating anticlockwise about its own vertical axis. The stimuli were presented in a single location, but to solve the problem,

Figure 5: Experiment 3. (a) The two motion stimuli were flow fields looming toward (left) and looming away (right). The stimuli are circular optic flow fields, with the direction of flow either away from (left) or toward (right) the center of the circle. Local motion cells near, for example, the intersection of the two stimuli cannot distinguish between the two global motion patterns. Locationinvariant representations (for nine different locations) of stimuli looming toward or moving away from the observer were learned, as shown by the single cell information measures (b), and multiple cell information measures (c) (using the same conventions as in Figure 3) were formed if the network was trained with the trace rule but not if it was untrained. (d) Position invariance illustrated for a single cell from layer 4, which responded only to moving away, and for every one of the nine positions. (The network was trained and tested with the stimuli presented in a 3 × 3 grid of nine retinal locations, as in experiment 1. The training grid spacing was 32 pixels, and the radii of the circular looming stimuli were 16 pixels. This ensured that the edges of the looming stimuli in adjacent training grid locations overlapped, as shown in the dashed box of Figure 5a.)

Invariant Global Motion Recognition

155

a Looming towards

Moving away

Stimulus 1

Stimulus 2

b

c

VisNet: 2s 9l: Single cell analysis

VisNet: 2s 9l: Multiple cell analysis

3

3 trace random Information (bits)

2

1 0.5

2 1.5 1 0.5

0

0

-0.5

-0.5 5

trace random

2.5

1.5

10 15 20 25 30 35 40 45 50 Cell Rank

1

2

3

d Visnet: Cell (9,7) Layer 4 1 0.8 Firing Rate

Information (bits)

2.5

’towards’ ’away’

0.6 0.4 0.2 0 0

1

2

3 4 5 6 Location Index

7

8

4 5 6 7 Number of Cells

8

9

10

156

E. Rolls and S. Stringer

the network must form some neurons that respond to the clockwise rotation of the shaded cylinder independent of the four transforms of each, which were upright (0 degrees), 90 degrees, inverted (180 degrees) and 270 degrees. Other neurons should self-organize to respond to view invariant counterclockwise rotation. For this experiment, additional information about surface luminance must be fed into the first layer of the network in order for the network to be able to distinguish between the clockwise and anticlockwise rotating cylinders. Additional retinal inputs to the first layer of the network came from a 128 × 128 array of luminance-sensitive cells. The cells within the luminance array are maximally activated for the shaded region of the cylinder image. Elsewhere the luminance inputs are zero. The number of inputs from the array of luminance sensitive cells to each cell in the first layer of the network was 50. The results shown in Figures 6b to 6c show perfect performance for many single cells, and across multiple cells, in representing the direction of rotation of the shaded cylinder about its own axis regardless of which of the four transforms was shown, when trained with the trace rule but not when untrained. Simulations were run for various sizes of the cylinders, including height = 40 and diameter = 20. For all simulations, cells developed a transform (e.g., upright, inverted) invariant representation in the output layer. That is, some cells responded to one of the stimuli in all of its four transformations (i.e., orientations) but not to the other stimulus. These results show that the network architecture is able to develop objectcentered view-invariant representations of the global motion patterns representing the two rotating cylinders, even though the neurons in the input layer receive information from only a small, local region of the retina.

Figure 6: Experiment 4. (a) Stimulus 1, which is a cylinder with shading at the top rotating clockwise about its own (top-defined) axis. Stimulus 1 is shown in its upright and inverted transforms. Stimulus 2 is the same cylinder with shading at the top, but rotating anticlockwise about its own axis. Invariant representations were formed, with some cells coding for the object rotating clockwise about its own axis and other cells coding for the object rotating anticlockwise, invariantly with respect to whether which of the four transforms (0 degrees = upright, 90 degrees, 180 degrees = inverted, and 270 degrees) was viewed, as shown by the single cell information measures (b) and multiple cell information measures (c) (using the same conventions as in Figure 3). Because only eight images in one location form the training set, some single cells by chance with the random untrained connectivity had some information about which stimulus was shown, but cells performed the correct mapping only if the network was trained with the trace rule.

Invariant Global Motion Recognition

157

Upright transform

a

Inverted transform

Stimulus 1: Cylinder rotating clockwise when viewed from shaded end.

Stimulus 2: Cylinder rotating anticlockwise when viewed from shaded end.

c

b VisNet: 2s 4t: Single cell analysis

VisNet: 2s 4t: Multiple cell analysis

3

3 trace random

2 1.5 1 0.5

trace random

2.5 Information (bits)

Information (bits)

2.5

2 1.5 1 0.5

0

0 5

10 15 20 25 30 35 40 45 50 Cell Rank

1

2

3

4 5 6 7 Number of Cells

8

9

10

158

E. Rolls and S. Stringer

3.5 Experiment 5: Optic Flow Analysis of Real Images: Translation Invariance. In experiments 5 and 6, we extend this research by testing the operation of the model when the optic flow inputs to the network are extracted by a motion analysis algorithm operating on the successive images generated by moving objects. The optic flow fields generated by a moving object were calculated as described next and were used to set the firing of the motion-selective cells, the properties of which are described in section 2.3. These optic flow algorithms use an image gradient-based method, which exploits the relationship between the spatial and temporal gradients of intensity, to compute the local optic flow throughout the image. The image flow constraint equation Ix U + I y V + It = 0 is approximated at each pixel location by algebraic finite difference approximations in space and time (Horn & Schunk, 1981). Systems of these finite difference equations are then solved for the local image velocity (U, V) within each 4 × 4 pixel block within the image. The images of the rotating objects were generated using OpenGL. In experiment 5, we investigated the learning of translation-invariant representations of the optic flow vector fields generated by clockwise versus anticlockwise rotation of the tetrahedron stimulus illustrated in Figure 7a. The network was trained with the two optic flow patterns generated in nine different locations, as in experiments 2 and 3. The flow fields used to train the network were generated by the object rotating through one degree of angle. The single cell information measures (see Figure 7b) and multiple cell information measures (see Figure 7c) (using the same conventions as in Figure 3) show that the maximal information, one bit, was reached by single cells and with the multiple cell information measure. The dashed line shows the control condition of a network with random untrained connectivity. This experiment shows that the model can operate well and learn translation-invariant representations with motion flow fields actually extracted from the successive images produced by a rotating object. 3.6 Experiment 6: Optic Flow Analysis of Real Images: Rotation Invariance. In experiment 6 we investigated the learning of rotationinvariant representations of the optic flow vector fields generated by clockwise versus anticlockwise rotation of the spoked wheel stimulus illustrated in Figure 8a. (The algorithm for generating the optic flow field is described in section 3.5.) The radius of the spoked wheel was 50 pixels on the 128 × 128 background. The rotation was in-plane, and the optic flow fields used as an input to the network were extracted from the changing images, each separated by one degree of the object as it rotated through 360 degrees. The single cell information measures (see Figure 8b) and multiple cell information measures (see Figure 8c) (using the same conventions as in Figure 3) show that the maximal information, one bit, was almost reached by single cells and by the multiple cell information measure. The dashed

Invariant Global Motion Recognition

159

a Optic flow fields produced by a tetrahedron rotating clockwise or anticlockwise

b

c VisNet: 2s 9l: Single cell analysis

VisNet: 2s 9l: Multiple cell analysis

3

3 trace random

2 1.5 1 0.5

2 1.5 1 0.5

0

0

-0.5

-0.5 5

10 15 20 25 30 35 40 45 50 Cell Rank

trace random

2.5 Information (bits)

Information (bits)

2.5

1

2

3

4 5 6 7 8 Number of Cells

9

10

Figure 7: Experiment 5. Translation-invariant representations of the optic flow vector fields generated by clockwise versus anticlockwise rotation of the tetrahedron stimulus illustrated. The optic flow field used as an input to the network was extracted from the changing images of the object as it rotated. The single cell information measures (b) and multiple cell information measures (c) (using the same conventions as in Figure 3) show that the maximal information, 1 bit, was reached by both single cells and in the multiple cell information measure. The dashed line shows the control condition of a network with random untrained connectivity.

line shows the control condition of a network with random untrained connectivity. This experiment shows that the model can operate well and learn rotation-invariant representations with motion flow fields actually extracted from a very large number of the successive images produced by a rotating object. Because of the large number of closely spaced training images used in this simulation, it is likely that the crucial type of learning was continuous transformation learning (Stringer, Perry, Rolls, & Proske, 2006). Consistent with this, the learning rate was set to the lower value of 7.2 × 10−5 for all layers for experiment 6 (cf. Stringer et al., 2006).

160

E. Rolls and S. Stringer

a Optic flow fields produced by a spoked wheel rotating clockwise or anticlockwise

c

b VisNet: 2s: Single cell analysis

VisNet: 2s: Multiple cell analysis

3

3 trace random

2 1.5 1 0.5

2 1.5 1 0.5

0

0

-0.5

-0.5 5

10 15 20 25 30 35 40 45 50 Cell Rank

trace random

2.5 Information (bits)

Information (bits)

2.5

1

2

3

4 5 6 7 8 Number of Cells

9

10

Figure 8: Experiment 6. In-plane rotation-invariant representations of the optic flow vector fields generated by a spoked wheel rotating clockwise or anticlockwise. The optic flow field used as an input to the network was extracted from the changing images of the object as it rotated through 360 degrees, each separated by 1 degree. The single cell information measures (b) and multiple cell information measures (c) (using the same conventions as in Figure 3) show that the maximal information, 1 bit, was reached by both single cells and in the multiple cell information measure. The dashed line shows the control condition of a network with random untrained connectivity.

3.7 Experiment 7: Generalization to Untrained Images. To investigate whether the representations of object-based motion such as circular rotation learned with the approach introduced in this article would generalize usefully to the flow fields generated by other objects moving in the same way, we trained the network on the optic flow vector fields generated by clockwise versus anticlockwise rotation of the spoked wheel stimulus illustrated in Figure 8. The training images rotated through 90 degrees in 1 degree steps. We then tested generalization to the new, untrained image shown in Figure 9a. The single and multiple cell information plots in Figure 9b show that information was available about the direction of

Invariant Global Motion Recognition

161

a Generalisation: Training with rotating spoked wheel, followed by testing with a regular grid rotating clockwise or anticlockwise .

VisNet: 2s: Single cell analysis

b

VisNet: 2s: Multiple cell analysis

3

3 trace random

2 1.5 1 0.5

trace random

2.5

Information (bits)

Information (bits)

2.5

0

2 1.5 1 0.5 0

-0.5

-0.5 5

10 15 20 25 30 35 40 45 50 Cell Rank

1

2

3

4

5 6 7 Cell Rank

8

9

10

Responses of a typical cell to the spoked wheel and grid, after training with the spoked wheel alone. Clockwise rotation Visnet: Cell (17,17) Layer 4

Visnet: Cell (17,17) Layer 4

1

1

0.8

0.8 Firing Rate

Firing Rate

c Spoked wheel

Anticlockwise rotation

0.6 0.4 0.2

0.6 0.4 0.2

0

0 0

10 20 30 40 50 60 70 80 90 Orientation (deg)

0

Visnet: Cell (17,17) Layer 4

Visnet: Cell (17,17) Layer 4

1

1

0.8

0.8 Firing Rate

Grid

Firing Rate

d

0.6 0.4 0.2

10 20 30 40 50 60 70 80 90 Orientation (deg)

0.6 0.4 0.2

0

0 0

10 20 30 40 50 60 70 80 90 Orientation (deg)

0

10 20 30 40 50 60 70 80 90 Orientation (deg)

Figure 9: Experiment 7. Generalization to untrained images. The network was trained on the optic flow vector fields generated by the spoked wheel stimulus illustrated in Figure 8 rotating clockwise or anticlockwise. (a) Generalization to the new untrained image shown at the top right of the Figure was then tested. (b) The single and multiple cell information plots show that information was available about the direction of rotation (clockwise versus anticlockwise) of the untrained test images. (c) The firing rate of a fourth layer cell to the clockwise and anticlockwise rotations of the trained image illustrated in Figure 8. (d) The firing rate of the same fourth layer cell to the clockwise and anticlockwise rotations of the untrained image illustrated in Figure 9a.

162

E. Rolls and S. Stringer

rotation (clockwise versus anticlockwise) of the untrained test images. Although the information was not as high as 1 bit, which would have indicated perfect generalization, individual cells did generalize usefully to the new images, as shown in Figures 9c and 9d. For example, Figure 9c shows the firing rate of a fourth layer cell to the clockwise and anticlockwise rotations of the trained image illustrated in Figure 8. Figure 9d shows the firing rate of the same fourth-layer cell to the clockwise and anticlockwise rotations of the untrained image illustrated in Figure 9a. The neuron responded correctly to almost all the anticlockwise rotation shifts, and correctly to many of the clockwise rotation shifts, though some noise was evident in the responses of the neuron to the untrained images. Overall, the results demonstrate useful generalization after training with one object to testing with an untrained, different, object on the ability to represent rotation.

4 Discussion We have presented a hierarchical feature analysis theory of the operation of parts of the dorsal visual system, which provides a computational account for how transform-invariant representations of the flow fields generated by moving objects could be formed in the cerebral cortex. The theory uses a modified Hebb rule with a short-term temporal trace of preceding activity to enable whatever is invariant at any stage of the dorsal motion system across short time intervals to be associated together. The theory can account for many types of invariance and has been tested by simulation for position and size invariance. The simulations show that the network can develop global planar representations from noisy local motion inputs (experiment 1), invariant representations of rotating optic flow fields (experiment 2), invariant representations of looming optic flow fields (experiment 3), and invariant representations of asymmetrical objects rotating about one of their axes (experiment 4). These are fundamental problems in motion analysis, and they have all been studied neurophysiologically, including local versus planar motion (Movshon et al., 1985; Newsome et al., 1989); position-invariant representation of rotating flow fields and looming (Lagae, Maes, Raiguel, Xiao, & Orban, 1994); and object-based rotation (Hasselmo et al., 1989; Sakata et al., 1986). The model thus shows principles by which the different types of motion-related invariant neuronal responses in the dorsal cortical visual system could be produced. The theory is unifying in the sense that the same theory, but with different inputs, can account for invariant representations of objects in the ventral visual system (Rolls, 1992; Wallis & Rolls, 1997; Elliffe, Rolls, & Stringer, 2002; Rolls & Deco, 2002). It is a strength of the unifying concept introduced in this article that the same hierarchical network that can perform computations of the type important in the ventral visual system can also perform computations of a type important in the dorsal visual system.

Invariant Global Motion Recognition

163

Our simulations support the hypothesis that the different response properties of MT and MST neurons from V1 neurons are determined in part by the sizes of their receptive fields, with a larger receptive field needed to analyze some global motion patterns. Similar conclusions were drawn from simulation experiments performed by Sereno (1989) and Sereno and Sereno (1991). This type of self-organization can occur with a Hebbian associative learning rule operating on the feedforward connections to a competitive network. However, experiment 1 showed that even for the computation of planar global motion in intermediate layers such as MT, a trace-based associative learning rule is better than a purely associative Hebbian rule with noisy (probabilistic) local motion inputs, because the trace rule allows temporal averaging to contribute to the learning. In experiments 2 and 3, the trace rule is crucial to the success of the learning, in that the stimuli when presented in different training locations did not overlap, so that the only process by which the different transforms can be linked is by the temporal trace learning rule implemented in the model (Rolls & Milward, 2000; Rolls & Stringer, 2001). (We note that in a new development, it has been shown that if different transforms of the training stimuli do overlap continuously in space, then this overlap can provide a useful learning principle for invariant representations to be formed and requires only associative synaptic modification; Stringer et al., 2006. It would be of interest to extend this concept, which has been applied to the ventral visual system, to the dorsal visual system.) One type of perceptual analysis that can be understood with the theory and simulations described here is how neurons can self-organize to respond to the motion inputs produced by small objects when they are seen on different parts of the retina. This is achieved by using memorytrace-based synaptic modification in the type of architecture illustrated in Figure 4a. The crucial stage for this learning is the top layer in Figure 4a labeled Layer 2/3. The forward connections to the neurons in this layer can form the required representation if they use a trace or similar learning rule, and the object motion occurs with some temporospatial continuity. (Temporospatial continuity has been shown to be important in human face invariance learning [Wallis & Bulthoff, 2001], and spatial continuity over continuous transforms may be a useful learning principle [Stringer et al., 2006].) This aspect of the architecture is what is formally similar to the architecture of the ventral visual system, which can learn invariant representations of stationary objects. The only difference required of the networks is that the ventral visual stream network should receive inputs from neurons that respond to stationary features such as lines or edges and that the dorsal visual stream network should receive inputs from neurons that respond to local motion cues. It is this concept that allows us to propose that there is a unifying hypothesis that applies to some of the computations performed by both the ventral and the dorsal visual streams.

164

E. Rolls and S. Stringer

The way in which position-invariant representations in the model develop is illustrated in Figure 4a, where, in the top layer labeled layer 3, individual neurons receive information from different parts of layer 2, where different neurons can represent the same object motion but in different parts of visual space. In the model, layer 2 can thus be thought of as corresponding to some neurons in area MT, in which direction selectivity for elementary optic flow components such as rotation, deformation, and expansion and contraction is not position invariant (Lagae et al., 1994). Layer 3 in the model can in the same way be thought of as corresponding to area MST, in which direction selectivity for elementary optic flow components such as rotation, deformation, and expansion and contraction is position invariant for 40% of neurons (Lagae et al., 1994). A further correspondence between the model and the brain is that neurons that respond to global planar motion are found in the brain in area MT (Movshon et al., 1985; Newsome et al., 1989) and in the model in layer 1, whereas neurons in V1 and V2 do not respond to global motion (Movshon et al., 1985; Newsome et al., 1989), and correspond in the model to the input layer of Figure 4a. Another type of perceptual analysis that can be understood with the theory and simulations described here is the object-based view-independent representation of objects, exemplified by the ability to see that an “ended” object is rotating clockwise about one of its axes. It was shown in experiment 4 that these representations can be formed by combining information from both the dorsal visual stream (about global motion) and the ventral visual stream (about object shape and/or luminance features). For these representations to be learned, a trace associative or similar learning rule must be used while the object transforms from one view to another (e.g., from upright to inverted). A hierarchical network with the general architecture shown in Figure 2 with separate analyses of form and motion that are combined at a final stage (as in experiment 4) is also useful for biological motion, such as representing a person walking (Giese & Poggio, 2003). However, the network described by Giese and Poggio is not very biologically plausible, in that it performs MAX functions to help with the computational issue of transform invariance and does not self-organize on the basis of the inputs so that it must be largely hand-wired. The issue here is that Giese and Poggio suppose that a MAX function is performed to select the maximally active afferent to a neuron, but there is no account of how afferents of just one type (e.g., a bar with a particular orientation and contrast) are being received by a given neuron. Not only is no principle suggested by which this could be achieved, but also no learning algorithm is given to achieve this. We suggest therefore that it would be of interest to investigate whether the more biologically plausible self-organizing type of network described in this article can learn on the basis of the inputs being received to respond to biological motion. To do this, some form of sequence sensitivity would be useful.

Invariant Global Motion Recognition

165

The theory described here is appropriate for the global motion analysis required to analyze the flow fields of objects as they translate, rotate, expand (loom), or contract, as shown in experiments 1 to 3. The theory thus provides a model of some of the computations that appear to occur along the pathway V1–V2–MT–MST, as neurons of these types are generated along this pathway (see section 1). The theory described here can also account for global motion in an object-based coordinate frame as shown in experiment 4. Neurons with these properties have been found in the cortex in the anterior part of the macaque superior temporal sulcus, in which neurons respond to a head when it is rotating clockwise about its own axis but not counterclockwise, regardless of whether it is upright or inverted (Hasselmo et al., 1989). The result of the inversion experiment shows that these neurons are not just responding to global flow across the visual field, but are taking into account information about the shape and features of the object. Area STPa (the cortex in the anterior part of the macaque superior temporal sulcus) contains neurons that respond to a rotating sphere (Anderson & Siegel, 2005), and as shown in experiment 4, the present theory could account for such neurons. Whether the present model could account for the structure from motion also observed for these neurons is not yet known. The theory could also account for neurons in area 7a of the parietal cortex that may also respond to motion of an object about one of its axes in an object-based way (Sakata et al., 1986). Neurons have also been found in the primary motor cortex (M1) that respond similarly to neurons in area 7a when a monkey is solving a visually presented maze (Crowe, Chafee, Averbeck, & Georgopoulos, 2004), but their visual properties are not sufficiently understood to know whether the present model might apply. Area LIP contains neurons that perform processing related to saccadic eye movements to visual targets (Andersen, 1997), and the present theory may not apply to this type of processing. The model of processing utilized here in a series of hierarchically organized competitive networks with convergence at each stage (as illustrated in Figure 2) is intended to capture some of the main anatomical and physiological characteristics of the ventral visual stream of visual cortical areas, and is intended to provide a model for how processing in these areas could operate, as described in detail elsewhere (Rolls & Deco, 2002; Rolls & Treves, 1998). To enable learning along this pathway to result by self-organization in the correct representations being formed, associative learning using a short-term memory trace has been proposed (Rolls, 1992; Wallis & Rolls, 1997; Rolls & Milward, 2000; Rolls & Stringer, 2001; Rolls & Deco, 2002). Another approach used in continuous transformation learning utilizes associative learning without a temporal trace and relies on close exemplars of stimuli being provided during the training (Stringer et al., 2006). What we propose here is that similar connectivity and learning processes in the series of cortical pathways in the dorsal visual stream that includes V1–V2– MT–MST and onward connections to the cortex in the superior temporal

166

E. Rolls and S. Stringer

sulcus and area 7a could account for the invariant representations of the flow fields produced by moving objects. In relation to the number of stimuli that could be learned by the system, we note that the network simulated is relatively small and was designed to illustrate the new computational hypotheses introduced here rather than to analyze the capacity of such feature hierarchical systems. We note in particular that the network simulated has 1024 neurons in each layer and 100 inputs to each neuron in layers 2 to 4. In contrast, it has been estimated that perhaps half of the macaque brain is involved in visual processing, and typically each neuron has on the order of 104 inputs. It will be of interest using much larger simulations in the future to address capacity issues of this class of network. However, we note that because the network can generalize to rotational flow fields generated by untrained stimuli, as shown in experiment 7, separate representations for the flow fields generated by every object may not be required, and this helps to reduce the number of separate representations that the network may be required to learn. In contrast to some other theories, the theory developed here utilizes a single unified approach to self-organization in the dorsal and ventral visual systems. Predictions of the theory described here include the following. First, use of a trace rule in the dorsal as well as ventral visual system is predicted. (Thus, differences in, for example, the time constants of NMDA receptors, or persistent poststimulus firing, either of which could implement a temporal trace, would not be expected.) Second, a feature hierarchy is a useful way for understanding details of the operation of the ventral visual system, but can now be used as a clarifying concept for how the details of representations in the dorsal visual system may be built. Third, the theory predicts that neurons specialized for motion detection by using differences in the arrival times of sensory inputs from different retinal locations need occur at only one stage of the system (e.g., in V1) and need not occur elsewhere in the dorsal visual system. These are labeled as local motion neurons in Figure 4a. Acknowledgments This research was supported by the Wellcome Trust and by the Medical Research Council. References Andersen, R. A. (1997). Multimodal integration for the representation of space in the posterior parietal cortex. Philosophical Transactions of the Royal Society of London B, 352, 1421–1428. Anderson, K. C., & Siegel, R. M. (2005). Three-dimensional structure-from-motion selectivity in the anterior superior temporal polysensory area, STPa, of the behaving monkey. Cerebral Cortex, 15, 1299–1307.

Invariant Global Motion Recognition

167

Bair, W., & Movshon, J. A. (2004). Adaptive temporal integration of motion in direction-selective neurons in macaque visual cortex. Journal of Neuroscience, 24, 7305–7323. Bartlett, M. S., & Sejnowski, T. J. (1998). Learning viewpoint-invariant face representations from visual experience in an attractor network. Network: Computation in Neural Systems, 9, 399–417. Crowe, D. A., Chafee, M. V., Averbeck, B. B., & Georgopoulos, A. P. (2004). Participation of primary motor cortical neurons in a distributed network during maze solution: Representation of spatial parameters and time-course comparison with parietal area 7a. Experimental Brain Research, 158, 28–34. Deco, G., & Rolls, E. T. (2005). Neurodynamics of biased competition and cooperation for attention: A model with spiking neurons. Journal of Neurophysiology, 94, 295– 313. Desimone, R. (1991). Face-selective cells in the temporal cortex of monkeys. Journal of Cognitive Neuroscience, 3, 1–8. Duffy, C. J. (2004). The cortical analysis of optic flow. In L. M. Chalupa & J. S. Werner (Eds.), The visual neurosciences (Vol. 2, pp. 1260–1283). Cambridge, MA: MIT Press. Duffy, C. J., & Wurtz, R. H. (1996). Optic flow, posture, and the dorsal visual pathway. In T. Ono, B. L. McNaughton, S. Molotchnikoff, E. T. Rolls, & H. Nishijo (Eds.), Perception, memory and emotion: frontiers in neuroscience (pp. 63–77). Cambridge: Cambridge University Press. Elliffe, M. C. M., Rolls, E. T., & Stringer, S. M. (2002). Invariant recognition of feature combinations in the visual system. Biological Cybernetics, 86, 59–71. ¨ ak, P. (1991). Learning invariance from transformation sequences. Neural ComFoldi´ putation, 3, 194–200. Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36, 193–202. Geesaman, B. J., & Andersen, R. A. (1996). The analysis of complex motion patterns by form/cue invariant MSTd neurons. Journal of Neuroscience, 16, 4716–4732. Giese, M. A., & Poggio, T. (2003). Neural mechanisms for the recognition of biological movements. Nature Reviews Neuroscience, 4, 179–192. Graziano, M. S. A., Andersen, R. A., & Snowden, R. J. (1994). Tuning of MST neurons to spiral motions, Journal of Neuroscience, 14, 54–67. Hasselmo, M. E., Rolls, E. T., Baylis, G. C., & Nalwa, V. (1989). Object-centered encoding by face-selective neurons in the cortex in the superior temporal sulcus of the monkey. Experimental Brain Research, 75, 417–429. Horn, B. K. P., & Schunk, B. G. (1981). Determining optic flow. Artificial Intelligence, 17, 185–203. Lagae, L., Maes, H., Raiguel, S., Xiao, D.-K., & Orban, G. A. (1994). Responses of macaque STS neurons to optic flow components: A comparison of areas MT and MST. Journal of Neurophysiology, 71, 1597–1626. Logothetis, N. K., & Sheinberg, D. L. (1996). Visual object recognition. Annual Review of Neuroscience, 19, 577–621. Movshon, J. A., Adelson, E. H., Gizzi, M. S., & Newsome, W. T. (1985). The analysis of moving visual patterns. In C. Chagas, R. Gattass, & C. Gross (Eds.), Pattern recognition mechanisms (pp. 117–151). New York: Springer-Verlag.

168

E. Rolls and S. Stringer

Newsome, W. T., Britten, K. H., & Movshon, J. A. (1989). Neuronal correlates of a perceptual decision. Nature, 341, 52–54. Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience, 2, 1019–1025. Rolls, E. T. (1992). Neurophysiological mechanisms underlying face processing within and beyond the temporal cortical visual areas. Philosophical Transactions of the Royal Society, 335, 11–21. Rolls, E. T. (2000). Functions of the primate temporal lobe cortical visual areas in invariant visual object and face recognition. Neuron, 27, 205–218. Rolls, E. T. (2006). The representation of information about faces in the temporal and frontal lobes of primates including humans. Neuropsychologia, PMID = 16797609. Rolls, E. T., & Deco, G. (2002). Computational neuroscience of vision. New York: Oxford University Press. Rolls, E. T., & Milward, T. (2000). A model of invariant object recognition in the visual system: Learning rules, activation functions, lateral inhibition, and informationbased performance measures. Neural Computation, 12, 2547–2572. Rolls, E. T., & Stringer, S. M. (2001). Invariant object recognition in the visual system with error correction and temporal difference learning. Network: Computation in Neural Systems, 12, 111–129. Rolls, E. T., & Tovee, M. J. (1994). Processing speed in the cerebral cortex and the neurophysiology of visual masking. Proceedings of the Royal Society, B, 257, 9–15. Rolls, E. T., Tovee, M. J., Purcell, D. G., Stewart, A. L., & Azzopardi, P. (1994). The responses of neurons in the temporal cortex of primates, and face identification and detection. Experimental Brain Research, 101, 474–484. Rolls, E. T., & Treves, A. (1998). Neural networks and brain function. New York: Oxford University Press. Rolls, E. T., Treves, A., & Tovee, M. J. (1997). The representational capacity of the distributed encoding of information provided by populations of neurons in the primate temporal visual cortex. Experimental Brain Research, 114, 149–162. Rolls, E. T., Treves, A., Tovee, M., & Panzeri, S. (1997). Information in the neuronal representation of individual stimuli in the primate temporal visual cortex. Journal of Computational Neuroscience, 4, 309–333. Sakata, H., Shibutani, H., Ito, Y., & Tsurugai, K. (1986). Parietal cortical neurons responding to rotary movement of visual space stimulus in space. Experimental Brain Research, 61, 658–663. Seltzer, B., & Pandya, D. N. (1978). Afferent cortical connections and architectonics of the superior temporal sulcus and surrounding cortex in the rhesus monkey. Brain Research, 149, 1–24. Sereno, M. I. (1989). Learning the solution to the aperture problem for pattern motion with a Hebb rule. In D. Touretzky (Ed.), Advances in neural information processing systems, 1 (pp. 468–476). San Mateo, CA: Morgan Kaufmann. Sereno, M. I., & Sereno, M. E. (1991). Learning to see rotation and dilation with a Hebb rule. In D. Touretzky & R. Lippmann (Eds.), Advances in neural information processing systems 3 (pp. 320–326). San Mateo, CA: Morgan Kaufmann. Stringer, S. M., Perry, G., Rolls, E. T., & Proske, J. H. (2006). Learning invariant object recognition in the visual system with continuous transformations. Biological Cybernetics, 94, 128–142.

Invariant Global Motion Recognition

169

Stringer, S. M., & Rolls, E. T. (2000). Position invariant recognition in the visual system with cluttered environments. Neural Networks, 13, 305–315. Stringer, S. M., & Rolls, E. T. (2002). Invariant object recognition in the visual system with novel views of 3D objects. Neural Computation, 14, 2585–2596. Tanaka, K. (1996). Inferotemporal cortex and object vision. Annual Review of Neuroscience, 19, 109–139. Ungerleider, L. G., & Mishkin, M. (1982). Two cortical visual systems. In D. J. Ingle, M. A. Goodale, & R. J. W. Mansfield (Eds.), Analysis of visual behavior (pp. 549–586). Cambridge, MA: MIT Press. Wallis, G., & Bulthoff, H. H. (2001). Effects of temporal assocation on recognition memory. Proceedings of the National Academy of Sciences, 98, 4800–4804. Wallis, G., & Rolls, E. T. (1997). Invariant face and object recognition in the visual system. Progress in Neurobiology, 51, 167–194. Wurtz, R. H., & Kandel, E. R. (2000). Perception of motion depth and form. In E. R. Kandel, J. H. Schwartz, & T. M. Jessell (Eds.), Principles of neural science, 4th ed. (pp. 548–571). New York: McGraw-Hill.

Received January 11, 2006; accepted May 15, 2006.

LETTER

Communicated by Mike Arnold

Recurrent Cerebellar Loops Simplify Adaptive Control of Redundant and Nonlinear Motor Systems John Porrill [email protected]

Paul Dean [email protected] Centre for Signal Processing in Neuroimaging and Systems Neuroscience, Department of Psychology, University of Sheffield, Sheffield S10 2TP, U.K.

We have described elsewhere an adaptive filter model of cerebellar learning in which the cerebellar microcircuit acts to decorrelate motor commands from their sensory consequences (Dean, Porrill, & Stone, 2002). Learning stability required the cerebellar microcircuit to be embedded in a recurrent loop, and this has been shown to lead to a simple and modular adaptive control architecture when applied to the linearized 3D vestibular ocular reflex (Porrill, Dean, & Stone, 2004). Here we investigate the properties of recurrent loop connectivity in the case of redundant and nonlinear motor systems and illustrate them using the example of kinematic control of a simulated two-joint robot arm. We demonstrate that (1) the learning rule does not require unavailable motor error signals or complex neural reference structures to estimate such signals (i.e., it solves the motor error problem) and (2) control of redundant systems is not subject to the nonconvexity problem in which incorrect average motor commands are learned for end-effector positions that can be accessed in more than one arm configuration. These properties suggest a central functional role for the closed cerebellar loops, which have been shown to be ubiquitous in motor systems (e.g., Kelly & Strick, 2003). 1 Introduction The grace and economy of animal movement suggest that neural control methods are likely to be of interest to robotics. The neural structure particularly associated with coordinated movement is the cerebellum, whose function seems to be the fine-tuning of motor skills by elaborating incomplete or approximate commands issued by higher levels of the motor system (Brindley, 1964). But although the microcircuitry of cerebellar cortex has inspired models for over 30 years (Albus, 1971; Eccles, Ito, & Szent´agothai, 1967; Marr, 1969), the results appear to have produced relatively meager benefits for robotics. As Marr himself commented, “In my own case, the cerebellar study . . . disappointed me, because even if the theory was Neural Computation 19, 170–193 (2007)

C 2006 Massachusetts Institute of Technology

Recurrent Cerebellar Loops Simplify Adaptive Control

171

correct, it did not enlighten one about the motor system—it did not, for example, tell one how to go about programming a mechanical arm” (Marr, 1982, p. 15). Part of the problem is that the control function of any individual region of the cerebellum depends not only on its internal microcircuitry but also on the way it is connected with other parts of the motor system (Lisberger, 1998), and in many cases the details of this connectivity are still not well understood. However, recent anatomical investigations have suggested that there may be a common feature of cerebellar connectivity: a recurrent architecture. It appears as if individual regions of the cerebellum (1) receive a copy of the low-level motor commands that constitute system output and (2) apply their corrections to the high-level command. These “multiple closed-loop circuits represent a fundamental architectural feature of cerebro-cerebellar interactions,” and it is a challenge to future studies to “determine the computations that are supported by this architecture” (Kelly & Strick, 2003). Determining these computations might also throw light on how biological control methods using the cerebellum might be of use to robotics. Our initial attempts to address this issue considered adaptation of the angular vestibular-ocular reflex (aVOR), a classic preparation for studying basic cerebellar function (Boyden, Katoh, & Raymond, 2004; Carpenter, 1988; Ito, 1970). In this reflex, a head-rotation signal from vestibular sensors is used to counter-rotate the eye to maintain stability of the retinal image. Visual processing delays of ∼100 ms mean that movement of the retinal image (a retinal slip) is used as an error signal for calibration of the aVOR rather than for online control, and this calibration requires the floccular region of the cerebellum. In our simulations of the aVOR, the characteristics of the oculomotor plant (eye muscles plus orbital tissue) were altered, and the flocculus (modeled as an adaptive filter; see section 2) was required to learn the appropriate plant compensation. Stable and robust learning was achieved by connecting the filter so that it decorrelated the retinal slip signal from a copy of the motor command sent to the eye muscles (Dean, Porrill, & Stone, 2002; Porrill, Dean, & Stone, 2004) and sent its output to join the vestibular input. This arrangement constitutes an example of the recurrent architecture described above, and ensuring the stability of the decorrelation control algorithm is therefore a candidate for its computational role. Here we extend these findings to derive some theoretical properties of the recurrent cerebellar architecture, which indicate that it may play a central role in simplifying the adaptive control of nonlinear and redundant biological motor systems. These properties will be illustrated by comparison of forward and recurrent architectures in simulated calibration of the inverse kinematic control of a two-degree-of-freedom (2dof) robot arm. Part of this work has been reported in abstract form (Porrill & Dean, 2004).

172

J. Porrill and P. Dean

Figure 1: (a) Schematic representation of the cerebellar microcircuit. MF inputs uk are analyzed in the granule cell layer (only one MF input is shown; in reality, GCs receive direct input from multiple MFs and indirect input from many more via recurrent connections, not shown here, to PFs), and the GC output signals p j are distributed along the PFs. The ith PC makes contact with many PFs that drive its simple spike output vi . The cell also has a CF input e i , which is assumed in Marr-Albus models to act as a teaching signal for the synaptic weights wij . (b) Interpretation of the cerebellar microcircuit as an adaptive filter. Granule cells processing is modeled as a filter pi = G i (u1 , u2 , . . .), and each PC outputs a weighted sum of its PF inputs. (Figure modified from Dean, Porrill, & Stone, 2004.)

2 The Adaptive Filter Model We will use what is perhaps the simplest implementation of the Marr-Albus architecture: the adaptive filter (Fujita, 1982). The microcircuit based around a Purkinje cell (PC) and its computational interpretation as an adaptive filter model are shown schematically in Figure 1.

Recurrent Cerebellar Loops Simplify Adaptive Control

173

The mossy fibers (MFs) carry input signals (u1 , u2 , . . .) that are processed in the granule cell layer to produce the signals ( p1 , p2 , . . .) carried on the parallel fibers (PFs). This process is interpreted as a massive expansionrecoding of the mossy fiber inputs of the form p j = G j (u1 , u2 , . . .).

(2.1)

In the nondynamic problems to be considered here, the G j are functions of the current values of (u1 , u2 , . . .). Dynamic problems can be tackled by allowing the p j to encode aspects of the past history of the ui ; in that case, the G j are required to be more general causal functionals such as tapped delay lines. We are interested in the output of a subset of Purkinje cells that take their inputs from common parallel fibers. The output vi of the ith such PC is modeled as a weighted linear combination of its PF inputs, vi =

wij p j ,

(2.2)

where the coefficient wij is the synaptic weight of the jth PF on the ith PC. The combined transformation from the vector of inputs u = (u1 , u2 , . . .) to the vector of outputs v = (v1 , v2 , . . .) can then be written conveniently as a vector equation, v = C(u) =

wij Gij (u),

(2.3)

where Gij (u) = (0, . . . , G j (u), . . . , 0) is the vector with the jth parallel fiber signal as its ith entry and zeros elsewhere. In Marr-Albus models, the climbing fiber input e i to the Purkinje cell is assumed to act as a teaching signal. The qualitative properties of long-term depression (LTD) and long-term potentiation (LTP) at PF/PC synapses are consistent with the anti-Hebbian heterosynaptic covariance learning rule (Sejnowski, 1977), δwij = −β e i p j ,

(2.4)

which is identical in form to the least mean square (LMS) learning rule of adaptive control theory (Widrow & Stearns, 1985). Note that in all the above formulas, neural signals are assumed to be coded as firing rates relative to a tonic firing rate, and hence can take both positive and negative values. 3 Example Problem: Learning Inverse Kinematics We will show that recurrent loops can simplify the adaptive control of nonlinear and redundant motor systems. We derive the results for a problem

174

J. Porrill and P. Dean

Figure 2: (a) Geometry of the planar, two-degree-of-freedom robot arm. Motor commands (m1 , m2 ) specify joint angles as shown. The position (x1 , x2 ) of the end effector is specified in Cartesian coordinates in arbitrary units. (b) Example plant compensation problem. The arm controller is initially calibrated for the arm lengths l1 = 1.1, l2 = 1.8 of the gray arm. This arm is shown reaching accurately to a point on the gray polar grid covering the work space. When used with the black arm (lengths l1 = 1, l2 = 2), this approximate controller reaches to points on the distorted (black) grid. The task of the cerebellum is to compensate for this miscalibration; reaching errors (δx1 , δx2 ) are provided in Cartesian coordinates.

presenting both of these difficulties: learning robot inverse kinematics. This problem is of general interest, since it is an equivalent to the generic problem of learning right inverse mappings. The theoretical results obtained in the general case will be illustrated throughout by application to the inverse kinematic problem for the 2dof robot arm shown in Figure 2, where the geometry is particularly easy to intuit. Although this is a very simple system, it exhibits strong nonlinearity and a discrete redundancy. We begin by recalling some terminology. The forward kinematics or plant model of a robot arm is the mapping x = P(m) from motor commands m = (m1 , . . . , m M ) to the end-effector positions x = (x1 , . . . , xS ). An inverse kinematics is a mapping m = Q(x) that calculates the motor commands corresponding to a given end-effector position. Since this implies that x = P(Q(x)), an inverse kinematics is a rightinverse P−1 for P. For redundant systems, a given position can be reached using more than one choice of motor command; in this case, we will use the notation Q = P−1 to denote a particular choice of inverse kinematics. For the 2dof arm, there are only two choices of motor command for a given end-effector position. This is an example of a discrete redundancy. More complex systems can have continuous redundancies in which there is a

Recurrent Cerebellar Loops Simplify Adaptive Control

175

Figure 3: Forward architecture. The desired position input xd produces a motor command m = B(xd ) + C(xd ), which is the sum of contributions from a fixed element B and an adaptive cerebellar element C. This input to the plant P produces the output position x. Training the weights of C requires proximal or motor error δm rather than distal or sensory error δx = x − xd . Hence, in this forward architecture, motor error must be estimated by backpropagation via a reference matrix R ≈ ∂P−1 /∂x. This requires detailed prior knowledge of the motor plant.

continuum of motor commands available for each end-effector position. These are sometimes called redundant degrees of freedom. The cerebellum has been described as a “repair shop,” compensating for miscalibration (due to damage, fatigue, or development, for example) of the motor plant (Robinson, 1974). It is this adaptive plant compensation problem that we will model here; that is, we assume that an approximate inverse kinematics controller B ≈ P−1 is available to the system and that the function of the cerebellum is to supplement this controller to produce more accurate movements. Learning is supervised by making the reaching error, δx = x − xd = P(B(xd )) − xd ,

(3.1)

available to the learning system, where xd is the target (desired) position. We call this quantity sensory error since it can be measured by an appropriate sensor and to distinguish it from motor error, which will be defined later. In the 2dof robot arm example shown in Figure 2b, the approximate controller B is the inverse kinematics for a robot with arm lengths that are ±10% in error. Thus, when required to reach to positions on the polar grid shown in Figure 2, the arm actually moves to positions on the overlaid distorted grid. 4 The Forward Learning Architecture To highlight the properties of recurrent connectivity, we begin by considering the problems encountered in implementing a learning rule for the alternative forward connectivity shown schematically in Figure 3. The motor command to the plant is produced by an open-loop filter that is the sum

176

J. Porrill and P. Dean

B + C of the fixed element B and the adaptive cerebellar component C; this combination will be an inverse kinematics B + C = P−1 if C takes the value C∗ =

wij∗ G ij = P−1 − B.

(4.1)

We assume that the granule cell basis functions Gij satisfy the matching condition, that is, that synaptic weights wij∗ can be found such that the above equation holds for the range of P−1 and B under consideration. To obtain a learning rule similar to the covariance rule (see equation 2.3), we introduce the concept of motor error (Gomi & Kawato, 1992). Motor error δm is the error in motor command responsible for the sensory error δx. Minimizing expected square motor error, Em =

1 2 1 t δm = δm δm , 2 2

(4.2)

(where a superscript t denotes the matrix transpose), leads to a simple learning rule because motor error is linearly related to synaptic weight error, δm = C(xd ) − C∗ (xd ) =

(wij − wij∗ )Gij (xd ).

(4.3)

Using this expression, the gradient of expected square motor error is ∂E ∂δm = δmt = δmt Gij (xd ) = δmi p j , ∂wij ∂wij

(4.4)

giving the gradient descent learning rule, δwij = −β δmi p j ,

(4.5)

where β is a small, positive constant. If we label Purkinje cells such that the ith PC output contributes to the ith component of motor error, then comparison with the covariance learning rule (see equation 2.4) shows that the teaching signal e i provided on the climbing fiber input to the ith Purkinje cell must be the ith component of motor error, e i = δmi .

(4.6)

This apparently simple prescription is complicated by the fact that motor error is not in itself an observable quantity. It is a derived quantity given by the equation δm = P−1 (x) − P−1 (xd ).

(4.7)

Recurrent Cerebellar Loops Simplify Adaptive Control

177

This leads to an obvious circularity in that the rule for learning inverse kinematics requires prior knowledge of that same inverse kinematics. This circularity can be circumvented to some extent by supposing that all errors are small so that δm ≈

∂P−1 δx, ∂x

(4.8)

and then replacing the unknown Jacobian ∂P−1 /∂x in this error backpropagation rule by a fixed approximation, R≈

∂P−1 . ∂x

(4.9)

If R were exact, then, if J is the forward Jacobian J = ∂P/∂m, the product JR would be the identity matrix. To ensure stable learning, the approximate R must estimate motor error correctly up to a strict positive realness (SPR) condition, which in this static case requires that the symmetric part of the matrix JR be positive definite. The hypothetical neural structures required to implement this transformation R and recover motor error from observable sensory error, δm ≈ Rδx

or

δmi ≈

Rik δxk ,

(4.10)

have been called reference structures (Gomi & Kawato, 1992), so we will call R the reference matrix. 5 The Motor Error Problem We refer to the requirement that the climbing fibers carry the unobservable motor error signal rather than the observed sensory error signal as the motor error problem. Although the forward architecture has been applied successfully to a number of real and simulated control tasks (notably in the form of the feedback error learning architecture; Gomi & Kawato, 1992, 1993), we will argue here that for generic biological control systems, the complexity of the neural reference structures it requires makes forward architecture implausible. It is clear that the complexity of the reference structure is multiplicative in the dimension of the control and sensor space. For a task in which M muscles control the output of N sensors, there are MN entries in the reference matrix R. For example, in our 2dof robot arm problem, four real numbers must be hard-wired to guarantee learning. For more realistic motor tasks in biological systems (such as reaching while preserving balance), values of 100 or more for MN would not be unreasonable.

178

J. Porrill and P. Dean

Figure 4: (a) The dots show the arrangement of RBF centers, and the circle shows the receptive field radius for the forward architecture. This configuration leads to an accurate fit if exact motor error is provided to the learning algorithm (not shown). (b) A snapshot of performance during learning that illustrates the need for multiple reference structures in nonlinear problems. The reference structure R is chosen as the exact Jacobian at the grid point marked with a small circle. Although the learned (black) grid overlays the exact (gray) grid more accurately in the neighborhood of O (compare with Figure 3 bottom), performance has clearly deteriorated over the remainder of the work space. Learning rate beta = 0.0005. The effect of reducing the learning rate is to delay but not abolish the divergence. (c) An illustration of the redundancy or nonconvexity problem. The arm controller is set up to consistently choose the configurations shown by the solid and dotted arms when reaching into the top or bottom half of the work space. However, when reaching to points in the shaded horizontal sector, the arm retains the configuration used for the previous target. Hence, arm configuration is chosen effectively randomly in this sector, and the system fails to learn. Learning rate is as above. (Exact motor errors were used in this part of the simulation).

In fact, this analysis understates the problem since biological motor systems are often nonlinear, and hence the reference structures are valid only locally. This behavior will be illustrated for the 2dof robot arm calibration problem described above (see Figure 2). Details of the radial basis function (RBF) implementation are given in the appendix. Figure 4a shows a snapshot of performance during learning. In this example, the reference matrix R was chosen to be the exact inverse Jacobian at the point O. Clearly this choice satisfies the SPR condition in a neighborhood of O, and hence in this region where R provides good estimates of motor error, reaching accuracy initially improves. However, outside this region, the sign of a component of motor error is wrongly estimated, and errors in this component diverge catastrophically. This instability will eventually spread to the point O itself because of overlap between adjacent RBFs. To ensure stable learning in this example would require at least three reference structures valid on three different sectors of the work space.

Recurrent Cerebellar Loops Simplify Adaptive Control

179

This requires 3 × 4 = 12 prespecified parameters (not including the extra parameters needed to specify the region of validity of each reference structure). For a general inverse kinematics problem, we must specify MNK parameters, where K is the number of regions required to guarantee that positive definiteness of JR in each region. Finally, we note that in the dynamic case, learning must be dynamically stable. For example, in the linear case, the required reference structure is a matrix of transfer functions R(iω), and an SPR condition must be satisfied by the matrix transfer function J(iω)R(iω) at each frequency, further increasing the complexity of the motor error problem. 6 The Redundancy Problem Most artificial and biological motor systems are redundant, that is, different motor commands can produce the same output. Such redundancy leads to a classic supervised learning problem called the nonconvexity problem: if the training data for the learning system associate multiple motor commands with a given position, and if this set of motor commands is nonconvex, then the system will learn an inappropriate weighted average motor command at that position. The forward architecture shown in Figure 3 is subject to the redundancy problem whenever the controller B is allowed to produce different output commands for the same input. This type of behavior is common in motor tasks; for example, a robot arm configuration might be determined by a combination of convenience and movement history. This type of behavior is illustrated for the 2dof arm in Figure 4b. In this experiment, one arm configuration is used in the top half of the work space and the opposite configuration in the bottom half. However, when reaching into the central sector, the controller reuses the configuration from the previous position (this kind of behavior is common to avoid work space obstacles), and hence the configuration chosen in the central sector is essentially random. While learning succeeds in the top and bottom sectors of the work space, the failure to learn in the central sector is evident from the distorted (black) grid of learned positions. This convexity problem can be avoided by providing auxiliary variables ξ to the learning component such that the combination (x, ξ ) unambiguously specifies the motor state of the system (although identifying such variables in practice can be nontrivial). For example, the discrete redundancy found in the 2dof arm requires a discrete variable ξ = ±1 to identify the particular arm configuration to be used. This solution is not particularly satisfactory since it breaks modularity, forcing a controller whose task is simply to reach to a given position to take responsibility for choosing the required arm configuration. More interesting from our point of view, this solution also increases the complexity of the reference structure, since the number K of reference

180

J. Porrill and P. Dean

Figure 5: Recurrent architecture. Here the motor command generated by the fixed element B is the input to the adaptive element C. The output of C is then used as a correction to the input to B. This loop implements Brindley’s (1964) influential suggestion that the cerebellum elaborates commands coming from higher-level structures in the context of information about the current state of the organism. In this recurrent architecture, the sensor error δx becomes effectively proximal to C and, as demonstrated in the text, can be used directly as a teaching signal.

matrices required must be further increased to reflect the dependence on the extra parameters ξ . For example, to learn correctly in the 2dof arm example in Figure 4b, the two different arm configurations clearly require different reference matrices; hence, to allow both configurations over the whole work space increases the number of hard-wired parameters by a factor of 2 to 2 × 12 = 24. Even in this simple example, the complexity of the reference structure required to support learning is beginning to approach that of the structure to be learned. Note that the situation is not helped by adding a conventional error feedback loop to the adaptive forward architecture (as in the feedback error learning model). This loop also requires a motor error signal, and since error is available only in sensory coordinates, different reference structures are required for different arm configurations. 7 The Recurrent Architecture As we noted in section 1, the forward architecture just described ignores a major feature, the recurrent connectivity, of the circuitry in which the cerebellar microcircuit is embedded. An alternative recurrent architecture reflecting this connectivity is shown schematically in Figure 5. Although the analysis of recurrent networks and their learning rules can be very complex (e.g., Pearlmutter, 1995) this architecture is an important exception; in particular, we will show that no backpropagation step is required in the learning rule. The analysis proceeds in two stages. First, a plausible cerebellar learning rule is derived by the familiar method of gradient descent. This derivation does not provide a rigorous proof of convergence because it requires a small-weight-error approximation. Second, a Lyapunov function

Recurrent Cerebellar Loops Simplify Adaptive Control

181

for the learning rule is derived and used to demonstrate convergence of the learning rule without the need for the small-weight-error approximation. We are able to simplify the treatment because we deal only with kinematics. Hence, we can idealize the recurrent loop as an algebraic loop in which the output m of the closed loop shown in Figure 5 satisfies the implicit equation, m = B(xd +C(m)).

(7.1)

Clearly control is possible only if the fixed element B has some left inverse B−1 . Applying this inverse to equation 7.1, we find that the desired position input xd is related to the actual motor command m by the equation xd = B−1 (m) − C(m).

(7.2)

Again, we assume the matching condition so that weights exist for which C takes the exact value C∗ . For this choice of C, the desired position will equal the actual output position, that is, x = P(m) = B−1 (m) − C∗ (m),

(7.3)

from which we derive the following expression for the desired cerebellar filter: C∗ = B−1 − P.

(7.4)

By subtracting equation 7.3 from 7.2, we find that δx = x − xd = C∗ (m) − ∗ ∗ C(m), and substituting C = wij Gij , C = wij Gij gives the following simple relationship between sensory error and synaptic weight error: δx = P(m) − xd = C(m) − C∗ (m) =

(wij − wij∗ )Gij (m).

(7.5)

(This is analogous to equation 4.3 relating motor error and synaptic weight error for the forward architecture.) Although this equation is at first sight linear in the weights wij , this appearance is misleading since the argument m also depends implicitly on the wij . However, the appearance of linearity is close enough to the truth to allow us to derive a simple learning rule. If weight errors are small, the second term in the derivative, ∂δx ∗ ∂Gkl (m) = Gij (m) + (wkl − wkl ) , ∂wij ∂wij

(7.6)

182

J. Porrill and P. Dean

can be neglected to give the approximation ∂δx ≈ Gij (m). ∂wij

(7.7)

Using this result, we can derive an approximate gradient-descent learning rule by minimizing expected square sensory error (rather than motor error, as in the previous section). Defining Es =

1 2 1 t δx = δx δx , 2 2

(7.8)

its gradient is ∂δx ∂ Es ≈ δxt Gij (m) = δxi p j , = δxt ∂wij ∂wij

(7.9)

leading to the approximate gradient-descent learning rule δwij = −β δxi p j .

(7.10)

In this learning rule, no Jacobian appears, and hence no reference structures embodying prior knowledge of plant parameters are required. Comparison with the covariance learning rule (see equation 2.4) shows that the teaching signal required on the climbing fibers in recurrent architecture is now the sensory error: e i = δxi .

(7.11)

Although this local learning rule has been derived as an approximate gradient-descent rule, its properties are more easily determined by a Lyapunov analysis; this analysis is simplified if we work with the continuous update form of the learning rule: w˙ ij = −β δxi p j .

(7.12)

As a Lyapunov function, we use the sum square synaptic weight error, V=

1 (wij − wij∗ )2 , 2

(7.13)

which has time derivative V˙ =

w˙ ij (wij − wij∗ ) = −β

i

δxi

j

(wij − wij∗ ) p j .

(7.14)

Recurrent Cerebellar Loops Simplify Adaptive Control

183

Substituting into this expression the expression for sensory error above, we find that V˙ = −βδx2 .

(7.15)

Since its derivative is nonpositive, V is a Lyapunov function, that is, a positive function that decreases over time as learning proceeds. It is unnecessary to appeal to the Lyapunov theorems to determine the behavior of this system sufficiently for practical purposes. The equation above shows that over a fixed period of time, the sum square synaptic weight error decreases by an amount proportional to the mean square sensory error; hence, it is clear that the system can make RMS sensory errors above a certain magnitude only for a limited time, since V would otherwise become negative, which is impossible. 8 The Motor Error Problem Is Solved It is clear that this architecture solves the motor error problem in that there is no longer a need for unavailable motor error signals on the climbing fibers or for complex reference structures to estimate them. Figure 6a shows the performance of recurrent architecture for the 2dof arm problem described in section 3 (see the appendix for implementation details). It can be seen that the arm now recalibrates successfully over the whole work space. Figure 6b shows the decrease in position error during training; this decrease is stochastic in nature, as would be expected from the approximate stochastic gradient-descent rule. In contrast, Figure 6c shows the monotonic decrease of sum square weight error V during training predicted by the Lyapunov analysis, with the greatest decreases taking place where large errors are made. 9 The Nonconvexity Problem Is Solved In the recurrent architecture, the nonconvexity problem is easily solved because it does not arise. The adaptive element C takes motor commands as input, and its task is to learn to associate them with a corrective output. Since the motor command completely determines the current configuration, there is no ambiguity to be resolved. Although we illustrate this property below for the discrete redundancy of the 2dof arm, the reasoning above clearly applies to both discrete and continuous redundancies. This property confers remarkable modularity on recurrent architecture. It means that the connectivity of a cerebellar controller is determined solely by the task and is independent of low-level details such as the particular algorithm chosen for resolving joint angle redundancy or how, for example, reciprocal innervation allocates the tension in antagonistic muscle pairs.

184

J. Porrill and P. Dean

Figure 6: (a) This panel illustrates successful recalibration by the recurrent architecture. After training, the learned (black) grid overlays the exact (gray) grid over the whole work space (compare with the initial performance in Figure 2). Learning rate β = 0.05. (b) The grid shows the region of motor space corresponding to the robot work space. The dots and the circle indicate RBF centers and the receptive field radius. (c) The two graphs illustrate the stochastic decrease in squared position error (top), same units of length as Figure 2, and the associated monotonic decrease in sum square synaptic weight error (bottom) as predicted by theory. The behavior at the positions marked by arrows illustrates the fact that faster decrease in sum square weight error is associated with larger position error, as predicted by the Lyapunov equation, 7.13.

This property is illustrated in Figure 7a, where the recurrent architecture is applied to the redundant reaching problem described in section 6. It can be seen that learning is now satisfactory over the whole work space. The only modification to the net for the task of Figure 7 is the need for RBF centers covering points in motor space associated with the alternative arm configuration. This grid of RBF centers is shown in Figure 7b. The situation would be only slightly different for a continuous redundancy; in this case, new RBF centers would be needed to cover all points in motor command space accessible by the redundant degrees of freedom. 10 Discussion As argued in section 1, one reason that cerebellar-inspired models have been of modest use to robotics is that cerebellar connectivity is often poorly understood (Lisberger, 1998). The general idea that identical cerebellar microcircuits can be wired up to perform a wide range of motor and

Recurrent Cerebellar Loops Simplify Adaptive Control

185

Figure 7: (a) This panel shows that recurrent architecture solves the redundant reaching problem example described in section 6 for which forward architecture fails (compare with Figure 4). The learned (black) grid overlays the exact (gray) grid accurately over the whole work space, including the horizontal 60 degree sector in which both arm configurations are used. Learning rate β = 0.05. (b) This panel shows the separate grids in motor space associated with the two arm configurations. The grid of dots and the circle indicate the RBF centers and receptive field radius. The dark gray regions highlight motor commands used to generate the arm configurations used consistently in the top and bottom sectors of the work space, and the light gray regions highlight motor commands that generate the two alternative configurations used in the central sector of the work space (the unshaded regions are not used). Since the arm configurations that are ambiguous in task space are represented in separate regions of motor space, they are learned independently in the recurrent architecture.

cognitive functions (Ito, 1984, 1997) is well appreciated, but identifying specific instances has proved difficult. Here we have explored the computational capacities of the cerebellar microcircuit embedded in a recurrent architecture for adaptive feedforward control of nonlinear redundant systems as exemplified by a simulated 2dof robot arm. We have shown that the architecture solves the distal error problem, copes naturally with the redundancy/convexity problem, and gives enhanced modularity. We now compare it with alternative architectures from both computational and biological perspectives 10.1 Computational Control. The distal error problem arises whenever output errors are used to train internal parameters (Jordan, 1996; Jordan & Wolpert, 2000). It is a fundamental obstacle in neural learning systems, and the consequent lack of biological plausibility of learning rules in neural net supervised learning algorithms has become a clich´e. There have been two main previous approaches to solving the distal error problem: (1) continue

186

J. Porrill and P. Dean

to use standard architectures and hypothesize the existence of structures implementing the required error backpropagation schemes, or (2) look for those special architectures in which output errors are themselves sufficient for training. The forward learning architecture (see Figure 3) appears poorly suited for solving the distal error problem. Considerable ingenuity has been expended on the feedback error learning scheme developed by Kawato and coworkers (Gomi & Kawato, 1992, 1993) (for a recent rigorous treatment, see Nakanishi & Schaal, 2004) in order to rescue this architecture, but even so, substantial difficulties remain. In feedback error learning, the adaptive component is embedded in a feedback controller so that the estimated motor error δmest is used as both a training signal and a feedback error term. As we have noted, feedback error learning imposes SPR conditions on the accuracy of the motor error estimate and hence requires complex reference structures for generic redundant and nonlinear systems. There have been theoretical attempts to remove the SPR condition. For example, Miyamura and Kimura (2002) avoid the necessity for SPR at the cost of requiring large gains in the conventional error feedback loop; this is unacceptable in autonomous and biological systems since one of the primary reasons for using adaptive control is to avoid the destabilizing effect of large feedback gains given inevitable feedback delays. Despite the difficulties we have noted here, feedback error learning has been usefully applied to online learning by autonomous robots in numerous contexts (e.g., Dean, Mayhew, Thacker, & Langdon, 1991; Mayhew, Zheng, & Cornell, 1992). It is clear that feedback error learning remains a useful approach for problems in which simplifying features of the motor plant mean that the reference structures are easily estimated. We have presented a general architecture for tracking control in which output error can be used directly for training. Other architectures of this type in the literature have been designed for specific control problems. For example, the adaptive scheme for control of robot manipulators proposed by Slotine and Li (1989) relies on special features of the problem of controlling joint angles using joint torques. We note also the adaptive schemes for particular single-input–single-output nonlinear systems considered by Patino and Liu (2000) and Nakanishi and Schaal (2004). Although none of these architectures tackles the generic problems of nonlinearity and redundancy considered here, it is interesting to note that they also use recurrent architectures in an essential way, supporting the idea that recurrent connectivity may play a fundamental role in simplifying biological motor control.

10.2 Biological Plausibility. From a biological perspective, the recurrent architecture appears more plausible in the context of plant compensation than the forward architecture, with its requirements for feedback error learning, for three reasons.

Recurrent Cerebellar Loops Simplify Adaptive Control

187

First, as pointed out in section 1, anatomical evidence indicates that the recurrent architecture is a feature of many cerebellar microzones. In addition, where it is available, electrophysiological evidence has specifically identified efferent-copy information as part of the mossy fiber input to particular regions of the cerebellum. Important examples are (1) the primate flocculus and ventral paraflocculus, responsible for adaptation of the vestibulo-ocular reflex (VOR), where extensive recordings have shown that about three-quarters of their mossy fiber inputs carry eye-movementrelated signals (Miles, Fuller, Braitman, & Dow, 1980); (2) the oculomotor vermis, responsible for saccadic adaptation, where about 25% of mossy fibers show short-latency burst firing in association with saccades that closely resemble the activity of excitatory burst neurons in the paramedian pontine reticular formation (Ohtsuka & Noda, 1992); and (3) regions of cerebellar cortex associated with control of limb movement by the red nucleus receive an efferent copy of rubrospinal output to cerebellum, related to limb position and velocity (Keifer & Houk, 1994). Thus, the defining feature of the recurrent architecture used here appears to be present for all the cerebellar microzones that have been adequately investigated. Second, the recurrent architecture allows use of sensory error signals, that is, the sensory consequences of inaccurate motor commands, which are physically available signals. Previous inability to use such distal error signals has been a fundamental obstacle in neural learning systems, so that architectures such as the one described here, in which distal error can be used directly as a teaching signal, have fundamental importance as basic components of biological learning systems. Hence, if the recurrent architecture were used biologically, we would expect cerebellar climbing fibers to carry sensory information. In contrast, a central consequence of feedback error learning is the identification of climbing fiber signals with motor error. In the words of Gomi and Kawato (1992), “Our view that the climbing fibers carry control error information, the difference between the instructions and the motor act, is common to most cerebellar motor-learning models; however ours is unique in that this error information is represented in motor-command coordinates.” This view runs into both theoretical and empirical problems. From a theoretical point of view, the use of the motor error signal requires not only new and as yet unidentified neural reference structures to recover motor error from observable sensory errors, but also new and as yet unidentified mechanisms to calibrate these structures. Again in the words of Gomi and Kawato (1992), “The most interesting and challenging theoretical problem [raised by FEL] is setting an appropriate inverse reference model in the feedback controller at the spinal and brainstem levels.” From the point of view of experimental evidence, it appears that many climbing fibers are in fact strongly activated by sensory inputs, such as touch, pain, muscle sense, or, in the case of the VOR, retinal slip (e.g., Apps & Garwicz, 2005; De Zeeuw et al., 1998; Simpson, Wylie, & De Zeeuw, 1996).

188

J. Porrill and P. Dean

In certain cases where nonsensory signals have been identified in climbingfiber discharge (e.g., Andersson, Garwicz, & Hesslow, 1988; Gibson, Horn, & Pong, 2002), their effect seems to be to emphasize the unpredicted sensory consequences of a movement by gating the expected consequences. Such gated signals are still physically available sensory signals. In other instances, for example, retinal slip in the horizontal VOR, it appears that the effect of nonsensory modulation is to produce a two-valued slip signal that conveys information about the direction of image movement but not its speed (Highstein, Porrill, & Dean, 2005; Simpson, Belton, Suh, & Winkelman, 2002). One line of evidence often used to support feedback error learning concerns ocular following, where externally imposed sliplike movements of the retinal image drive compensatory eye movements. It has been shown that the associated complex spike discharge in the flocculus can be predicted from either slip or eye movement signals (Kobayashi et al., 1998). However, given that a later article states that “only the velocity of the retinal error (retinal slip) was statistically sufficient” (Yamamoto, Kobayashi, Takemura, Kawano, & Kawato, 2002, p. 1558) to reproduce the observed variability in climbing fiber discharge, it is not clear that this evidence decisively supports the existence of a motor error signal, even in these special and ambiguous circumstances. Although we have not considered dynamics here, a further advantage of the recurrent architecture is its capacity to generate long-time constant signals when plant compensation requires integrator-like processes (Dean et al., 2002; Porrill et al., 2004). This desirable feature of the recurrent architecture was pointed out for eye-position control by Galiana and Outerbridge (1984) and has been incorporated in other models of gaze holding (e.g., Glasauer, 2003). It is unclear how forward architectures could be used to achieve the observed performance of the neural integrator (Robinson, 1974). 10.3 Further Developments. Our previous work on plant compensation in the VOR (Dean et al., 2002; Porrill et al., 2004) established the capabilities of recurrent architecture for dynamic control of linear systems. Here we have extended these results to the kinematic control of redundant and nonlinear systems. It is clearly a priority to extend these results to dynamic nonlinear control problems. It is also important to implement the decorrelation-control scheme in a robot, and this work is currently in progress. We have emphasized the limitations of the feedback-error-learning architecture, but its combination of feedback and feedforward controllers does offer considerable advantages for robust online control. It appears that a possibly similar strategy is used biologically to control gaze stability, combining the optokinetic (feedback) and vestibulo-ocular (feedforward) reflexes (Carpenter, 1988). As we will show elsewhere, the recurrent architecture can also be embedded stably and naturally in a conventional feedback loop.

Recurrent Cerebellar Loops Simplify Adaptive Control

189

Finally, although the recurrent architecture does not need forward connections for plant compensation (and so they are not shown in Figure 5), such connections are also ubiquitous for cerebellar microzones. We conjecture that once plant compensation has been achieved, a microzone could in principle use these inputs for a wide range of purposes, including sensory calibration and predictive control. Appendix: Technical Details Explicit details of the forward and recurrent algorithms for the robotic application are supplied below. This is a vanilla RBF implementation, since our intention is to concentrate on the nature of the error signal and the learning rule. Both accuracy and learning speed could be greatly improved by optimizing the choice of centers and transforming to an optimal basis of receptive fields. The forward kinematics for the 2dof robot arm with arm lengths (l1 , l2 ) is given by x1 = P1 (m1 , m2 ) = l1 cos m1 + l2 cos(m1 + m2 ) x2 = P2 (m1 , m2 ) = l1 sin m1 + l2 sin(m1 + m2 ).

(A.1)

The brainstem component of the controller is defined as the exact inverse kinematics for a robot with slightly different arm lengths (l1 , l2 ), −1

m1 = B1 (x1 , x2 ) = tan

x2 x1

−1

− tan

l2 sin ξ θ l1 + l2 sin ξ θ

m2 = B2 (x1 , x2 ) = ξ θ,

(A.2)

where θ = cos−1

x12 + x12 − l12 − l22 2l1 l2

,

(A.3)

and the choice of ξ = ±1 determines the arm configuration. In the forward architecture, the parallel fiber signals are given by gaussian RBFs p j = G j (x1 , x2 ) = e − 2σ 2 ((x1 −c1 j ) 1

2

+(x2 −c 2 j )2 )

,

(A.4)

with the centers (c 1 j , c 2 j ) chosen on a rectangular grid covering the work 1 space and with σ equal to 2 /2 times the maximum grid spacing (see Figure 4). The forward architecture implies the following expression for

190

J. Porrill and P. Dean

the motor commands: m1 = B1 (x1 , x2 ) + m2 = B2 (x1 , x2 ) +

w1 j p j w2 j p j .

(A.5)

The unknown weights wij were initially set to 0. The two components of sensory error are given by the formula δx1 = P1 (m1 , m2 ) − x1 δx2 = P2 (m1 , m2 ) − x2 ,

(A.6)

from which the two components of motor error are estimated using a 2 × 2 reference matrix R δmest R11 R12 δx1 1 = . (A.7) δmest R21 R22 δx2 2 RBF weights are updated using the learning rule wijnew = wijold − βδmiest p j .

(A.8)

In the alternative recurrent architecture, the PF signals are given by RBFs, p j = G j (m1 , m2 ) = e − 2σ 2 ((m1 −c1 j ) 1

2

+(m2 −c 2 j )2 )

,

(A.9)

with centers chosen on a rectangular grid in motor space. The grid was chosen to cover the image of the work space in motor space (see Figure 6b). The recurrent architecture of Figure 5 implies the following equation for motor commands

w1 j p j , x2 + w2 j p j m1 = B1 x1 +

(A.10) w1 j p j , x2 + w2 j p j . m2 = B2 x2 + This is an implicit equation for motor error since the p j depend on the mi via equation A.6. Its solution was obtained at each trial by iterating mn+1 = B(x + C(mn )) to convergence (a relative accuracy of 10−4 required at most 10 iterations in the simulations reported here). This off-line iteration is necessary to allow the arm to make discontinuous movements between randomly chosen gridpoints. If waypoints xn are closely sampled from a continuous curve, the more natural alternative online procedure mn+1 = B(xn + C(mn )) can be used.

Recurrent Cerebellar Loops Simplify Adaptive Control

191

Again the unknown weights wij were initially set to 0. Sensory error obtained from equation A.3 above was used directly in the learning rule wijnew = wijold − βδxi p j .

(A.11)

Values for the optimal weight values wij∗ are required to calculate the sum square weight errors plotted in Figure 6. These were obtained by direct batch minimization of sum square reaching error calculated over a subsampled grid of points in the work space. We have primarily investigated convergence in the circumstances in which it is believed to operate biologically, that is, in repair shop mode with changes in plant characteristics of 10% to 15%. However, simulations have indicated that it also converges from initial weights corresponding to grossly degraded performance, although we have not investigated this systematically. Acknowledgments This work was supported by EPSRC grant GR/T10602/01 under their Novel Computation Initiative. References Albus, J. S. (1971). A theory of cerebellar function. Mathematical Biosciences, 10, 25–61. Andersson, G., Garwicz, M., & Hesslow, G. (1988). Evidence for a GABA-mediated cerebellar inhibition of the inferior olive in the cat. Experimental Brain Research, 72, 450–456. Apps, R., & Garwicz, M. (2005). Anatomical and physiological foundations of cerebellar information processing. Nature Reviews Neuroscience, 6(4), 297–311. Boyden, E. S., Katoh, A., & Raymond, J. L. (2004). Cerebellum-dependent learning: The role of multiple plasticity mechanisms. Annual Review of Neuroscience, 27, 581–609. Brindley, G. S. (1964). The use made by the cerebellum of the information that it receives from sense organs. International Brain Research Organization Bulletin, 3, 80. Carpenter, R. H. S. (1988). Movements of the eyes (2nd ed.). London: Pion. De Zeeuw, C. I., Simpson, J. I., Hoogenraad, C. C., Galjart, N., Koekkoek, S. K. E., & Ruigrok, T. J. H. (1998). Microcircuitry and function of the inferior olive. Trends in Neurosciences, 21(9), 391–400. Dean, P., Mayhew, J. E. W., Thacker, N., & Langdon, P. M. (1991). Saccade control in a simulated robot camera-head system: Neural net architectures for efficient learning of inverse kinematics. Biological Cybernetics, 66, 27–36. Dean, P., Porrill, J., & Stone, J. V. (2002). Decorrelation control by the cerebellum achieves oculomotor plant compensation in simulated vestibulo-ocular reflex. Proceedings of the Royal Society of London, Series B, 269(1503), 1895–1904.

192

J. Porrill and P. Dean

Dean, P., Porrill, J., & Stone, J. V. (2004). Visual awareness and the cerebellum: Possible role of decorrelation control. Progress in Brain Research, 144, 61–75. Eccles, J. C., Ito, M., & Szent´agothai, J. (1967). The cerebellum as a neuronal machine. Berlin: Springer-Verlag. Fujita, M. (1982). Adaptive filter model of the cerebellum. Biological Cybernetics, 45, 195–206. Galiana, H. L., & Outerbridge, J. S. (1984). A bilateral model for central neural pathways in vestibuloocular reflex. Journal of Neurophysiology, 51(2), 210–241. Gibson, A. R., Horn, K. M., & Pong, M. (2002). Inhibitory control of olivary discharge. Annals of the New York Academy of Sciences, 978, 219–231. Glasauer, S. (2003). Cerebellar contribution to saccades and gaze holding—a modeling approach. Annals of the New York Academy of Sciences, 1004, 206–219. Gomi, H., & Kawato, M. (1992). Adaptive feedback control models of the vestibulocerebellum and spinocerebellum. Biological Cybernetics, 68(2), 105–114. Gomi, H., & Kawato, M. (1993). Neural network control for a closed-loop system using feedback-error-learning. Neural Networks, 6, 933–946. Highstein, S. M., Porrill, J., & Dean, P. (2005). Report on a workshop concerning the cerebellum and motor learning. Held in St Louis October 2004. Cerebellum, 4, 1–11. Ito, M. (1970). Neurophysiological aspects of the cerebellar motor control system. International Journal of Neurology (Montevideo), 7, 162–176. Ito, M. (1984). The cerebellum and neural control. New York: Raven Press. Ito, M. (1997). Cerebellar microcomplexes. International Review of Neurobiology, 41, 475–487. Jordan, M. I. (1996). Computational aspects of motor control and motor learning. In H. Heuer & S. Keele (Eds.), Handbook of perception and action, Vol. 2: Motor skills (pp. 71–120). London: Academic Press. Jordan, M. I., & Wolpert, D. M. (2000). Computational motor control. In M. S. Gazzaniga (Ed.), The new cognitive neurosciences (2nd ed., pp. 601–618). Cambridge MA: MIT Press. Keifer, J., & Houk, J. C. (1994). Motor function of the cerebellorubrospinal system. Physiological Reviews, 74(3), 509–542. Kelly, R. M., & Strick, P. L. (2003). Cerebellar loops with motor cortex and prefrontal cortex of a nonhuman primate. Journal of Neuroscience, 23(23), 8432–8444. Kobayashi, Y., Kawano, K., Takemura, A., Inoue, Y., Kitama, T., Gomi, H., & Kawato, M. (1998). Temporal firing patterns of Purkinje cells in the cerebellar ventral paraflocculus during ocular following responses in monkeys II. Complex spikes. Journal of Neurophysiology, 80(2), 832–848. Lisberger, S. G. (1998). Cerebellar LTD: A molecular mechanism of behavioral learning? Cell, 92(6), 701–704. Marr, D. (1969). A theory of cerebellar cortex. Journal of Physiology, 202, 437–470. Marr, D. (1982). Vision: A computational investigation into the human representation and processing of visual information. San Francisco: Freeman. Mayhew, J. E. W., Zheng, Y., & Cornell, S. (1992). The adaptive control of a fourdegrees-of-freedom stereo camera head. Philosophical Transactions of the Royal Society of London, Series B, 337, 315–326.

Recurrent Cerebellar Loops Simplify Adaptive Control

193

Miles, F. A., Fuller, J. H., Braitman, D. J., & Dow, B. M. (1980). Long-term adaptive changes in primate vestibuloocular reflex. III. Electrophysiological observations in flocculus of normal monkeys. Journal of Neurophysiology, 43, 1437–1476. Miyamura, A., & Kimura, H. (2002). Stability of feedback error learning scheme. Systems and Control Letters, 45, 303–316. Nakanishi, J., & Schaal, S. (2004). Feedback error learning and nonlinear adaptive control. Neural Networks, 17(10), 1453–1465. Ohtsuka, K., & Noda, H. (1992). Burst discharges of mossy fibers in the oculomotor vermis of macaque monkeys during saccadic eye movements. Neuroscience Research, 15(1–2), 102–114. Patino, H. D., & Liu, D. (2000). Neural network–based model reference adaptive control system. IEEE Transactions on Systems, Man and Cybernetics—Part B: Cybernetics, 30, 198–203. Pearlmutter, B. A. (1995). Gradient calculations for dynamic recurrent neural networks. IEEE Transactions on Neural Networks, 6(5), 1212–1228. Porrill, J., & Dean, P. (2004). Recurrent cerebellar loops simplify adaptive control of redundant and nonlinear motor systems. In 2004 Abstract Viewer/Itinerary Planner (pp. Prog. No. 989.984). Washington, DC: Society for Neuroscience. Porrill, J., Dean, P., & Stone, J. V. (2004). Recurrent cerebellar architecture solves the motor error problem. Proceedings of the Royal Society of London, Series B, 271, 789–796. Robinson, D. A. (1974). The effect of cerebellectomy on the cat’s vestibulo-ocular integrator. Brain Research, 71, 195–207. Sejnowski, T. J. (1977). Storing covariance with nonlinearly interacting neurons. Journal of Mathematical Biology, 4, 303–321. Simpson, J. I., Belton, T., Suh, M., & Winkelman, B. (2002). Complex spike activity in the flocculus signals more than the eye can see. Annals of the New York Academy of Sciences, 978, 232–236. Simpson, J. I., Wylie, D. R., & De Zeeuw, C. I. (1996). On climbing fiber signals and their consequence(s). Behavioral and Brain Sciences, 19(3), 384–398. Slotine, J. J. E., & Li, W. P. (1989). Composite adaptive-control of robot manipulators. Automatica, 25(4), 509–519. Widrow, B., & Stearns, S. D. (1985). Adaptive signal processing. Upper Saddle River, NJ: Prentice Hall. Yamamoto, K., Kobayashi, Y., Takemura, A., Kawano, K., & Kawato, M. (2002). Computational studies on acquisition and adaptation of ocular following responses based on cerebellar synaptic plasticity. Journal of Neurophysiology, 87(3), 1554–1571.

Received August 8, 2005; accepted May 30, 2006.

LETTER

Communicated by Stephen Jos´e Hanson

Free-Lunch Learning: Modeling Spontaneous Recovery of Memory J. V. Stone [email protected] Psychology Department, Sheffield University, Sheffield S10 2TP, England

P. E. Jupp [email protected] School of Mathematics and Statistics, St. Andrews University, St. Andrews KY16 9SS, Scotland

After a language has been learned and then forgotten, relearning some words appears to facilitate spontaneous recovery of other words. More generally, relearning partially forgotten associations induces recovery of other associations in humans, an effect we call free-lunch learning (FLL). Using neural network models, we prove that FLL is a necessary consequence of storing associations as distributed representations. Specifically, we prove that (1) FLL becomes increasingly likely as the number of synapses (connection weights) increases, suggesting that FLL contributes to memory in neurophysiological systems, and (2) the magnitude of FLL is greatest if inactive synapses are removed, suggesting a computational role for synaptic pruning in physiological systems. We also demonstrate that FLL is different from generalization effects conventionally associated with neural network models. As FLL is a generic property of distributed representations, it may constitute an important factor in human memory. 1 Introduction A popular aphorism states that “there’s no such thing as a free lunch.” However, in the context of learning theory, we propose that there is. In previous work, free-lunch learning (FLL) has been demonstrated using a task in which participants learned the positions of letters on a nonstandard computer keyboard (Stone, Hunkin, & Hornby, 2001). After a period of forgetting, participants relearned a proportion of these letter positions. Crucially, it was found that this relearning induced recovery of the nonrelearned letter positions. Preliminary results suggest that FLL also occurs using face stimuli. If the brain stores information as distributed representations, then each neuron contributes to the storage of many associations, so that relearning Neural Computation 19, 194–217 (2007)

C 2006 Massachusetts Institute of Technology

Free-Lunch Learning

195

Figure 1: Free-lunch learning protocol. Two subsets of associations A1 and A2 are learned. After partial forgetting (see text), performance error E pre on subset A1 is measured. Subset A2 is then relearned to preforgetting levels of performance, and performance error E post on subset A1 is remeasured. If E post < E pre then FLL has occurred, and the amount of FLL is δ = E pre − E post .

some old and partially forgotten associations affects the integrity of other old associations. Using neural network models, we show that relearning some associations does not disrupt other stored associations but actually restores them. In essence, recovery occurs in neural network models because each association is distributed among all connection weights (synapses) between units (model neurons). After partial forgetting, relearning some of the associations forces all of the weights closer to preforgetting values, resulting in improved performance even on nonrelearned associations. 1.1 The Geometry of Free-Lunch Learning. The protocol used to examine FLL here is as follows (see Figure 1). First, learn a set of n1 + n2 associations A = A1 ∪ A2 consisting of two intermixed subsets A1 and A2 of n1 and n2 associations, respectively. After all learned associations A have been partially forgotten, measure performance on subset A1 . Finally, relearn only subset A2 , and then remeasure performance on subset A1 . FLL occurs if relearning subset A2 improves performance on A1 . Unless stated otherwise, we assume that for a network with n connection weights, n ≥ n1 + n2 . For the present, we assume that the network has one output unit and two input units, which implies n = 2 connection weights and that A1 and A2 each consist of n1 = n2 = 1 association, as in Figure 2. Input units are

196

a

J. Stone and P. Jupp

b

Figure 2: Geometry of free-lunch learning. (a) A network with two input units and one output unit, with connection weights wa and wb , defines a weight vector w = (wa , wb ). The network learns two associations A1 and A2 , where (for example) A1 is the mapping from input vector x1 = (x11 , x12 ) to desired output value d1 ; learning consists of adjusting w until the network output y1 = w · x1 equals d1 . (b) Each association A1 and A2 defines a constraint line L 1 and L 2 , respectively. The intersection of L 1 and L 2 defines a point w0 that satisfies both constraints, so that zero error on A1 and A2 is obtained if w = w0 . After partial forgetting, w is a randomly chosen point w1 on the circle C with radius r , and performance error E pre on A1 is the squared distance p 2 . After relearning A2 , the weight vector w2 is in L 2 , and performance error E post on A1 is q 2 . FLL occurs if δ = E pre − E post > 0, or equivalently if Q = p 2 − q 2 > 0. Relearning A2 has one of three possible effects, depending on the position of w1 on C: (1) if w1 is under the larger (dashed) arc C F L L as shown here, then p 2 > q 2 (δ > 0) and therefore FLL is observed; (2) if w1 is under the smaller (dotted) arc, then p 2 < q 2 (δ < 0), and therefore negative FLL is observed; and (3) if w1 is at the critical point wcrit , then p 2 = q 2 (δ = 0). Given that w1 is a randomly chosen point on C and that the length of C F L L is SF L L , the probability of FLL is P(δ > 0) = SF L L /πr (i.e., the proportion of C F L L under the upper semicircle of C).

connected to the output unit via weights wa and wb , which define a weight vector w = (wa , wb ). Associations A1 and A2 consist of different mappings from the input vectors x1 = (x11 , x12 ) and x2 = (x21 , x22 ) to desired output values d1 and d2 , respectively. If a network is presented with input vectors x1 and x2 , then its output values are y1 = w · x1 = wa x11 + wb x12 and y2 = w · x2 = wa x21 + wb x22 , respectively. Network performance error for k = 2 k associations is defined as E(w, A) = i=1 (di − yi )2 .

Free-Lunch Learning

197

The weight vector w defines a point in the (wa , wb )-plane. For an input vector x1 , there are many different combinations of weight values wa and wb that give the desired output d1 . These combinations lie on a straight line L 1 , because the network output is a linear weighted sum of input values. A corresponding constraint line L 2 exists for A2 . The intersection of L 1 and L 2 therefore defines the only point w0 that satisfies both constraints, so that zero error on A1 and A2 is obtained if and only if w = w0 . Without loss of generality, we define the origin w0 to be the intersection of L 1 and L 2 . We now consider the geometric effect of partial forgetting of both associations, followed by relearning A2 . This geometric account applies to a network with two weights (see Figure 2) and depends on the following observation: if the length of the input vector x1 = 1, then the performance error E(w, A1 ) = (d1 − y1 )2 of a network with weight vector w when tested on association A1 is equal to the squared distance between w and the constraint line L 1 (see appendix C). For example, if w is in L 1 , then E(w, A1 ) = 0, but as the distance between w and L 1 increases, so E(w, A1 ) must increase. For the purposes of this geometric account, we assume that x1 = x2 = 1. Partial forgetting is induced by adding isotropic noise v to the weight vector w = w0 . This effectively moves w to a randomly chosen point w1 = w0 + v on the circle C of radius r = v, where r represents the amount of forgetting. For a network with w = w1 , learning A2 moves w to the nearest point w2 on L 2 (see appendix B), so that w2 is the orthogonal projection of w1 on L 2 . Before relearning A2 , the performance error E pre on A1 is the squared distance p 2 between w1 and its orthogonal projection on L 1 (see appendix C). After relearning A2 , the performance error E post is the squared distance q 2 between w2 and its orthogonal projection on L 1 . The amount of FLL is δ = E pre − E post and, for a network with two weights, is equal to Q = p 2 − q 2 . The probability P(δ > 0) of FLL given L 1 and L 2 is equal to the proportion of points on C for which δ > 0 (or, equivalently, for which Q > 0). For example, averaging over all subsets A1 and A2 , there is the probability P(δ > 0) = 0.68 that relearning A2 induces FLL of A1 (see Figure 5), a probability that increases with the number of weights (see theorem 3). If we drop the assumption that a network has only two input units, then we can consider subsets A1 and A2 with n1 > 1 and n2 > 1 associations. If the number of connection weights n ≥ max(n1 , n2 ), then A1 and A2 define an (n − n1 )-dimensional subspace L 1 and an (n − n2 )-dimensional subspace L 2 , respectively. The intersection of L 1 and L 2 corresponds to weight vectors that generate zero error on A = A1 ∪ A2 . Finally, we can drop the assumption that a network has only one output unit, because the connections to each output unit can be considered as a distinct network, in which case our results can be applied to the network associated with each output unit.

198

J. Stone and P. Jupp

2 Methods Given a network with n input units and one output unit, the set A of associations consisted of k input vectors (x1 , . . . , xk ) and k corresponding desired scalar output values (d1 , . . . , dk ). Each input vector comprises n elements x = (x1 , . . . , xn ). The values of xi and di were chosen from a gaussian distribution with unit variance (i.e., σx2 = σd2 = 1). A network’s output yi is a weighted sum of input values yi = w · xi = kj=1 w j xi j , where xi j is the jth value of the ith input vector xi , and each weight wi is one input-output connection. Given that the network error for a given set of k associations is E(w, A) = k k 2 i=1 (di − yi ) , the derivative ∇ E (w) = 2 i=1 (di − yi )xi of E with respect to w yields the delta learning rule wnew = wold − η∇ E (wold ) , where η is the learning rate, which is adjusted according to the number of weights. A learning trial consists of presenting the k input vectors to the network and then updating the weights using the delta rule. Learning was stopped when ∇ E (w) < k0.001, where ∇ E (w) is the magnitude of the gradient. Initial learning of the k = n associations in A = A1 ∪ A2 was performed by solving a set of n simultaneous equations using a standard method, after which perfect performance on all n associations was obtained. Partial forgetting was induced by adding an isotropic noise vector v with r = v = 1. Relearning the n2 = n/2 associations in A2 was implemented with k = n2 using the delta rule. 3 Results Our four main theorems are summarized here, and proofs are provided in the appendixes. These theorems apply to a network with n weights that learns n1 + n2 associations A = A1 ∪ A2 and, after partial forgetting, relearns the n2 associations in A2 . Theorem 1.

The probability P(δ > 0) of FLL is greater than 0.5.

Theorem 2.

The expected amount of FLL per association in A1 is

E[δ/n1 ] =

n2 E[x2 ]E[v2 ]. n2

(3.1)

For given values of E[x2 ] and E v2 , the value of n2 , which maximizes E[δ/n1 ] (subject to n1 + n2 ≤ n), is n2 = n − n1 . If each input vector x = (x1 , . . . , xn ) is chosen from an isotropic (e.g., isotropic gaussian) distribution and the variance of xi is σx2 , then E x2 = nσx2 . If σx2 is the same for all n, then the state of a neuron (with a typical

Free-Lunch Learning

199

sigmoidal transfer function) would be in a constantly saturated state as the number of synapses increases. One way to prevent this saturation is to assume that the efficacy of synapses on a given neuron decreases as the number of synapses increases. If forgetting is caused primarily by learning spurious inputs, then the delta learning rule used here implies that the “amount of forgetting” v is approximately independent of n. We therefore assume that v and σx2 are constant, and for convenience, we set v = 1 and σx2 = 1. Substituting these values into equation 3.1 yields E[δ/n1 ] =

n2 . n

(3.2)

Using these assumptions, simulations of networks with n = 2 and n = 100 weights agree with equation 3.2, as shown in Figure 3. The role of pruning can be demonstrated as follows. Consider a network with 100 input units and one output unit with n = 100 weights. If n2 = 90 associations are relearned out of an original set of n1 + n2 = 100 associations, then E[δ/n1 ] = n2 /n = 0.90. However, if n = 1000, then E[δ/n1 ] = 0.09. In general, as the number n − (n1 + n2 ) of unpruned redundant weights increases, so E[δ/n1 ] decreases. Therefore, E[δ/n1 ] is maximized if n1 + n2 = n. If n1 + n2 < n, then the expected amount of FLL is not maximal and can therefore be increased by pruning redundant weights until n = n1 + n2 (see Figure 4). Note that for a particular network, performance error E post on A1 after learning A2 can be zero. For example, if w = w∗ in Figure 2, then p = q = 0, which implies that δ/n1 = E post = q 2 = 0. Theorem 3.

The probability P(δ > 0) of FLL of A1 satisfies

P(δ > 0) > 1 −

a 0 (n, n1 , n2 ) + a 1 (n, n2 ) var (x2 )/E[x2 ]2 , n1 n2 (n + 2)2

(3.3)

where a 0 (n, n1 , n2 ) = 2 n1 (n + 2)(n − n2 ) + n(n − n2 ) + n(n + 2)(n − 1) a 1 (n, n2 ) = n2 (2n + n2 + 6).

(3.4) (3.5)

Theorem 3 implies that if the numbers (n1 and n2 ) of associations in A1 and A2 are fixed nonzero proportions of the number n of connection weights 2 (n1 /n and n2 /n, respectively) and var x2 /nE x2 → 0 as n → ∞, then P(δ > 0) → 1 as n → ∞; and the probability that each of the n1 associations in A1 exhibits FLL is P(δ/n1 > 0) = P(δ > 0) because δ > 0 iff δ/n1 > 0. For example, if we assume that each input vector x = (x1 , . . . , xn ) is chosen from an isotropic (e.g., isotropic gaussian) distribution and the

200

J. Stone and P. Jupp

2 variance of xi is σx2 , then var x2 /E x2 = 2/n. This ensures that 2 2 2 var x /nE x → 0 as n → ∞, and therefore that P(δ > 0) → 1 as n → ∞. Using this assumption, an approximation of the right-hand side of equation 3.3 yields P(δ > 0) > 1 −

2(1 + α1 − α1 α2 ) 2(2 + α2 + 6/n) − , nα1 α2 α1 α2 (n + 2)2

(3.6)

where α1 = n1 /n and α2 = n2 /n. In this form, it is easy to see that P(δ > 0) → 1 as n → ∞. We briefly consider the case n1 ≥ n and n2 ≥ n, so that each of L 1 and L 2 is a single point. If the distance D between these points is much less than v, then simple geometry shows that performance error E pre on A1 is large and that relearning A2 reduces this error for any v (i.e., with probability 1) with E post ∝ D2 , even in the absence of initial learning of A1 and A2 (see equation A.18 in appendix A). A similar conclusion is implicit in Atkins and Murre (1998). Theorem 4. If, instead of relearning A2 , the network learns a new subset A3 (drawn from the same distribution as A2 ), then the expected amount of FLL is less than the expected amount of FLL after relearning subset A2 . Learning A3 is analogous to the control condition used with human participants (Stone et al., 2001), and the finding that the amount of recovery after learning A3 is less than the amount of recovery after relearning A2 is predicted by theorem 4.

Figure 3: Distribution of free-lunch learning. (a) Histogram of amount of FLL δ/n1 per association, based on 1000 runs, for a network with n = 2 weights (see section 2). After learning two association subsets (η = 0.1), A1 and A2 , containing n1 = 1 and n2 = 1 associations (respectively), the network has a weight vector w0 . Forgetting is then induced by adding a noise vector v with v2 = 1 to w0 . One association A2 is then relearned, and the change in performance on A1 is measured as δ/n1 (see Figure 2). Negative values indicate that performance on A1 decreases after relearning A2 . (b) Histogram of amount of FLL δ/n1 per association for a network with n = 100 weights and η = 0.005, with A1 and A2 each consisting of n1 = n2 = 50 associations, using the same protocol as in (a). In both (a) and (b), the mean value of δ/n1 is about 0.5, as predicted by equation 3.2. As the number of associations learned increases, the amount of FLL becomes more tightly clustered around δ/n1 = 0.5, as demonstrated in these two histograms, and the probability of FLL increases (also see Figure 5).

Free-Lunch Learning

201

202

J. Stone and P. Jupp

Figure 4: Effect of pruning on free-lunch learning. Graph of the expected amount of FLL per association E[δ/n1 ] as a function of the total number n1 + n2 of learned associations in A = A1 ∪ A2 , as given in equation 3.2. In this example, the number of connection weights is fixed at n = 100, and the number of associations in A = A1 ∪ A2 increases from n1 + n2 = 2 to n1 + n2 = 100. The number n2 of relearned associations in A2 is a constant proportion (0.5) of the associations in A. If n1 + n2 ≤ n, then the network contains n − (n1 + n2 ) unpruned redundant connections. Thus, pruning effectively increases as n1 + n2 increases because, as the number n1 + n2 of associations grows, so the number of unpruned redundant connections decreases. The expected amount of FLL per association E[δ/n1 ] increases as the amount of pruning increases.

4 Discussion Theorems 1 to 4 provide the first proof that relearning induces nontransient recovery, where postrecovery error is potentially zero. This contrasts with the usually small and transient recovery that occurs during the initial phase of relearning forgotten associations (Hinton & Plaut, 1987; Atkins & Murre, 1998), and during learning of new associations (Harvey & Stone, 1996). In particular, theorem 2 is predictive inasmuch as it suggests that the amount of FLL in humans should be (1) proportional to the amount of forgetting of A = A1 ∪ A2 and (2) proportional to the proportion n2 /(n1 + n2 ) of associations relearned after partial forgetting of A. We have assumed that the number n1 + n2 of associations A = A1 ∪ A2 encoded by a given neuron is not greater than the number n of input connections (synapses) to that neuron. Given that each neuron typically has

Free-Lunch Learning

203

Figure 5: Probability of free-lunch learning. The probability P(δ > 0) of FLL of associations A1 as a function of the total number n1 + n2 of learned associations A = A1 ∪ A2 for networks with n = n1 + n2 weights. Each of the two subsets of associations A1 and A2 consists of n1 = n2 = n/2 associations. After learning and then partially forgetting A, performance on A1 was measured. P(δ > 0) is the probability that performance on subset A1 is better after subset A2 has been relearned than it is before A2 has been relearned. Solid line: Empirical estimate of P(δ > 0). Each data point is based on 10,000 runs, where each run uses input vectors chosen from an isotropic gaussian distribution (see section 2). Dashed line: Theoretical lower bound on the probability of FLL, as given by theorems 1 and 3, assuming that input vectors are chosen from an isotropic (e.g., isotropic gausssian) distribution.

many thousands of synapses (e.g., cerebellar Purkinje cells), it seems likely that this assumption is valid. However, the total amount of FLL is maximal if n1 = n2 = n/2, so that the full potential of FLL can be realized only if n1 + n2 = n. This optimum number of synapses can be achieved if inactive (i.e., redundant) synapses are pruned. Pruning may therefore contribute to FLL in physiological systems (Purves & Lichtman, 1980; Goldin, Segal, & Avignone, 2001). We have also assumed that a delta rule is used to learn associations between inputs and desired outputs. This general type of supervised learning is thought to be implemented by the cerebellum and basal ganglia (Doya, 1999). Models of the cerebellum (Dean, Porrill, & Stone, 2002) use a delta rule to implement learning. Similarly, models of the basal ganglia (Nakahara, Itoh, Kawagoe, Takikawa, & Hikosaka, 2004) use a temporally discounted

204

J. Stone and P. Jupp

form of delta rule, the temporal difference rule. This temporal difference rule has also been used to model learning in humans (Seymour et al., 2004), and (under mild conditions) is equivalent to the standard delta rule (Sutton, 1988). Indeed, from a purely computational perspective, it is difficult to conceive how these forms of associative learning could be implemented without some form of delta rule. Our analysis is based on the assumption that the network model is linear. Of course, many nonlinear networks can be approximated by linear networks, but it is possible that the results derived here have limited applicability to certain classes of nonlinear networks. Relation to Task Generalization. It is only natural to ask how FLL relates to tasks that a human might learn. One obvious but vital condition for FLL is that different associations must be encoded by a common set of neuronal connections. Aside from this condition, it might be thought that relearning A2 improves performance on A1 because A1 and A2 are somehow related (as in Hanson & Negishi, 2002; Dienes, Altmann, & Gao, 1999), so that learning A2 generalizes to A1 . This form of task generalization can occur if A1 and A2 are related as follows. If the input-output pairs in A1 and A2 are sampled from a sufficiently smooth function f and n1 n and n2 n, then A1 and A2 are statistically related, and therefore the weights induced by learning A1 are similar to those induced by learning A2 . Consequently, the resultant network input-output functions g1 and g2 (respectively) both approximate the function f (i.e., g1 ≈ g2 ≈ f ). In this case, learning A2 yields good performance on A1 . In the context of FLL, if A1 ∪ A2 is learned, forgotten, and then A2 is relearned, performance on A1 will also improve. However, the reason for this improvement is obvious and trivial: it is simply that A1 and A2 are statistically related and large enough (i.e., with n1 n and n2 n) to induce similar network functions. In contrast, the effect described in this letter does not depend on statistical similarity between A1 and A2 . Crucial assumptions are that n1 + n2 ≤ n, n1 < n, and n2 < n, so that learning the n2 associations in A2 in a network with n weights is underconstrained. This implies that the network function induced by learning A1 has no particular relation to the network function induced by learning A2 , even if A1 and A2 are sampled from the same function f (provided A1 and A2 are disjoint sets). For example, if A1 and A2 each consists of one association sampled from a linear function f (i.e., a line), then learning A2 in a linear network (as in Figure 2a) induces a linear network function g1 (i.e., a line) that intersects with f but is otherwise unconstrained. Thus, learning A2 does not necessarily yield good performance on A1 . The FLL effect reported here depends on relearning after forgetting. To cite an extreme example, if unicycling and learning French were encoded by a common set of neurons, then, after forgetting both, relearning unicycling could improve your French (although the mechanism involved here is unrelated to that described in Harvey & Stone, 1996). Thus, FLL contrasts

Free-Lunch Learning

205

with the task generalization outlined above, where it is obvious that both A1 and A2 induce similar network functions. Motivated by the demonstration that recovery occurs in humans (Stone et al., 2001; Coltheart & Byng, 1989; Weekes & Coltheart, 1996) (but not in all studies—Atkins, 2001), we have proven that FLL occurs in network models. The analysis presented here suggests that FLL is a necessary and generic consequence of storing information in distributed systems rather than a side effect peculiar to a particular class of artificial neural nets. Moreover, the generic nature of FLL suggests that it is largely independent of the type (i.e., artificial or physiological) of network used to learn associations. FLL appears to be a fundamental property of distributed representations. Given the reliance of neuronal systems on distributed representations, FLL may be a ubiquitous feature of learning and memory. It is likely that any organism that did not take advantage of such a fundamental and ubiquitous effect would be at a severe selective disadvantage. Appendix A: Analysis of Free-Lunch Learning We proceed by deriving expressions for E pre , E post , and δ = E pre − E post . We prove that if n1 + n2 ≤ n, then the expected value of δ is positive. We then prove that if n1 + n2 ≤ n, the probability P(δ > 0) of FLL is greater than 0.5, that its lower bound increases with n (if n1 /n and n2 /n are fixed), and that this bound approaches unity as n increases. A.1 Definition of Performance Error. For an artificial neural network (ANN) with weight vector w, we define the performance error for input vectors x1 , . . . , xc and desired outputs d1 , . . . , dc to be E(x1 , . . . , xc ; w, d1 , . . . , dc ) =

c

(w · xi − di )2 .

(A.1)

i=1

By putting X = (x1 , . . . , xc )T , d = (d1 , . . . , dc )T and E(X; w, d) = E(x1 , . . . , xc ; w, d1 , . . . , dc ), we can write equation A.1 succinctly as E(X; w, d) = Xw − d2 .

(A.2)

Given a c × n matrix X and a c-dimensional vector d, let L X,d be the affine subspace, L X,d = w : XT Xw = XT d ,

206

J. Stone and P. Jupp

of Rn . Since i. rk XT X ≤ rk (X), ii. XT Xa = 0 ⇒ aT XT Xa = 0 ⇒ Xa = 0, it follows that rk XT X = rk (X) (where rk denotes the rank of a matrix), and so L X,d is nonempty.

(A.3)

If X and d are consistent (i.e., there is a w such that Xw = d), then L X,d = {w : Xw = d}. A.2 Comparison of Performance Errors. Given weight vectors w1 and w2 , a matrix X of input vectors, and a vector d of desired outputs, define δ(w1 , w2 ; X, d) = E pre − E post , ˜ be any element of where E pre = E(X; w1 , d) and E post = E(X; w2 , d). Let w L X,d . Then δ(w1 , w2 ; X, d) = Xw1 − d2 − Xw2 − d2 = Xw1 2 − Xw2 2 − 2 (w1 − w2 )T XT d ˜ = (w1 − w2 )T XT X (w1 + w2 ) − 2 (w1 − w2 )T XT Xw ˜ . = (w1 − w2 )T XT X (w1 + w2 − 2w)

(A.4)

Suppose given ni × n matrices Xi and ni -dimensional vectors di (for i = 1, 2). Put L i = L Xi ,di

for i = 1, 2.

If Xi has rank ni , then Xi = Ti Zi for unique ni × ni and ni × n matrices Ti and Zi with Ti upper triangular and Zi ZiT = Ini . Note that the matrix ZiT Zi represents the operator that projects onto the image of XiT Xi , and so ZiT Zi XiT Xi = XiT Xi .

(A.5)

Free-Lunch Learning

207

Let w0 be an element of L X,d , where

X1 d1 X= d= , X2 d2 that is,

X1T X1 + X2T X2 w0 = X1T d1 + X2T d2 .

(A.6)

(By equation A.3, such a w0 always exists.) Given v in Rn , put w1 = w0 + v. Let w02 and w2 be the orthogonal projections of w0 and w1 , respectively, onto L 2 . Then X2T X2 w02 = X2T d2

w2 = w02 + In −

Z2T Z2

(A.7) (w1 − w02 ) .

Manipulation gives w1 − w2 = Z2T Z2 (v + w0 − w02 ) ,

(A.8)

and so w1 + w2 − 2w0 = 2In − Z2T Z2 v − Z2T Z2 (w0 − w02 ) .

(A.9)

˜ be any element of L X1 ,d1 . Then equations A.4, A.6, A.7 to A.9, and A.5 Let w yield δ(w1 , w2 ; X1 , d1 ) ˜ = (w1 − w2 )T X1T X1 (w1 + w2 − 2w) = (w1 − w2 )T X1T X1 (w1 + w2 ) − 2 (w1 − w2 )T X1T d1 = (w1 − w2 )T X1T X1 (w1 + w2 − 2w0 ) − 2 (w1 − w2 )T X2T X2 (w0 − w02 ) = (v + w0 − w02 )T Z2T Z2 X1T X1 (w1 + w2 − 2w0 ) − 2 (v + w0 − w02 )T Z2T Z2 X2T X2 (w0 − w02 ) = (v + w0 − w02 )T Z2T Z2 X1T X1 2In − Z2T Z2 v − (v + w0 − w02 )T Z2T Z2 X1T X1 Z2T Z2 (w0 − w02 ) − 2 (v + w0 − w02 )T Z2T Z2 X2T X2 (w0 − w02 )

208

J. Stone and P. Jupp

= vT Z2T Z2 X1T X1 2In − Z2T Z2 v − 2 (w0 − w02 )T Z2T Z2 X1T X1 In − Z2T Z2 − X2T X2 v − (w0 − w02 )T Z2T Z2 2X2T X2 + X1T X1 Z2T Z2 (w0 − w02 ) = vT Z2T Z2 X1T X1 2In − Z2T Z2 v − 2 (w0 − w02 )T Z2T Z2 X1T X1 In − Z2T Z2 − X2T X2 v −(w0 − w02 )T (2X2T X2 + Z2T Z2 X1T X1 Z2T Z2 ) (w0 − w02 ) .

(A.10)

A.3 Moments of Isotropic Distributions. In order to obtain results on the distribution of performance error, it is useful to have some moments of isotropic distributions. Let u be uniformly distributed on Sn−1 , and let A and B be n × n matrices. The formulas for the second and fourth moments of u given in equations 9.6.1 and 9.6.2 of Mardia and Jupp (2000), together with some algebraic manipulation, yield tr (A) E uT Au = n

(A.11)

T

tr (AB) + tr AB + tr (A) tr (B) E uT AuuT Bu = n(n + 2) 2 ntr A + ntr AAT − 2tr (A)2 T . var u Au = n2 (n + 2)

(A.12) (A.13)

Now let x be isotropically distributed on Rn , that is, Ux has the same distribution as x for all orthogonal n × n matrices U. Then writing x = xu with u = 1 and using equations A.11 to A.13 gives E x2 tr (A) E xT Ax = n 4 E x tr (AB) + tr ABT + tr (A) tr (B) T T E x Axx Bx = n(n + 2)

var xT Ax =

(A.14)

E x4 ntr A2 + ntr AAT − 2tr (A)2 n2 (n + 2)

+

var x2 tr (A)2 . n2

(A.15)

Free-Lunch Learning

209

A.4 Distribution of Performance Error. Now suppose that X1 , d1 , X2 , d2 , and v are random and satisfy X1 and v are independent, the distribution of X1 is isotropic,

(A.16)

v has an isotropic distribution, where conditions A.16 mean that UX1 V has the same distribution as X1 for all orthogonal n1 × n1 matrices U and all orthogonal n × n matrices V. Then equation A.10 yields E [δ(w1 , w2 ; X1 , d1 ) |X1 , X2 ] E v2 T = tr X1 X1 Z2T Z2 n − (w0 − w02 )T 2X2T X2 + Z2T Z2 X1T X1 Z2T Z2 (w0 − w02 ) .

(A.17)

Taking expectations over X1 and X2 in equation A.17 gives the following general result on FLL: E[δ(w1 , w2 ; X1 , d1 )] > 0 iff n2 E (w0 − w02 )T 2X2T X2 + Z2T Z2 X1T X1 Z2T Z2 (w0 − w02 ) E[v2 ] > . n1 n2 (A.18) The intuitive interpretation of this result is that if E v2 is large enough, then there is FLL, whereas if P (w0 = w02 ) > 0 then “negative FLL” can occur. In particular, if n1 + n2 ≤ n and P (v = 0) > 0, then there is FLL. A.5 The Case n1 + n2 ≤ n. In this section we assume that X1 , d1 , X2 and d2 are random and that (X1 , d1 ), (X2 , d2 ) and v are independent,

(A.19)

the distribution of v is isotropic.

(A.20)

We suppose also that n1 + n2 ≤ n, and that the distributions of X1 , d1 , X2 , and d2 are continuous. Then, with probability 1, X1 w0 = d1 and X2 w0 = d2 , so that w02 = w0 and equation A.10 reduces to δ(w1 , w2 ; X1 , d1 ) = vT Z2T Z2 X1T X1 2In − Z2T Z2 v.

(A.21)

210

J. Stone and P. Jupp

A.5.1 FLL Is More Probable Than Not. Let w∗1 be the reflection of w1 in L 2 , that is, w∗1 = w2 − (w1 − w2 ) . Consideration of the parallelogram with vertices at w0 , w1 , w∗1 , and w1 + w∗1 − w0 gives 2 X1 (w1 − w0 ) 2 + X1 (w∗1 − w0 ) 2 = X1 [w1 − w0 ] + w∗1 − w0 2 + X1 [w1 − w0 ] − w∗1 − w0 2 = 4 X1 (w2 − w0 ) 2 + X1 (w1 − w2 ) 2 , so that (since d1 = X1 w0 ) δ(w1 , w2 ; X1 , d1 ) + δ(w∗1 , w2 ; X1 , d1 ) = X1 (w1 − w0 ) 2 + X1 (w∗1 − w0 ) 2 − 2X1 (w2 − w0 ) 2 = 2X1 (w1 − w2 ) 2 ≥ 0. Thus if δ(w1 , w2 ; X1 , d1 ) < 0, then δ(w∗1 , w2 ; X1 , d1 ) > 0. If v is distributed isotropically, then w∗1 − w0 is distributed isotropically, so that δ(w∗1 , w2 ; X1 , d1 ) has the same distribution (conditionally on X1 , d1 and X2 ) as δ(w1 , w2 ; X1 , d1 ), and so P(δ(w1 , w2 ; X1 , d1 ) < 0|X1 , d1 , X2 ) ≤ P(δ(w∗1 , w2 ; X1 , d1 ) > 0|X1 , d1 , X2 ) = P(δ(w1 , w2 ; X1 , d1 ) > 0|X1 , d1 , X2 ). (A.22) Further, if v ∈ L 2 \ L 1 , then w2 = w1 = w∗1 , so that δ(w1 , w2 ; X1 , d1 ) = δ(w∗1 , w2 ; X1 , d1 ) > 0. By continuity of δ, there is a neighborhood of v on which δ(w1 , w2 ; X1 , d1 ) > 0 and δ(w∗1 , w2 ; X1 , d1 ) > 0. Thus, if L 2 \ L 1 = ∅, then equation A.22 can be refined to P(δ(w1 , w2 ; X1 , d1 ) < 0|X1 , d1 , X2 ) < P(δ(w∗1 , w2 ; X1 , d1 ) < 0|X1 , d1 , X2 ). (A.23) Since P(L 2 ⊂ L 1 ) = 0 and P(δ(w1 , w2 ; X1 , d1 ) < 0|X1 , d1 , X2 ) is a continuous function of X1 , d1 and X2 , it follows from equation A.23 that P(δ(w1 , w2 ; X1 , d1 ) < 0) < P(δ(w1 , w2 ; X1 , d1 ) > 0), which implies the following result.

Free-Lunch Learning

211

Theorem 1 P(δ(w1 , w2 ; X 1 , d1 ) > 0) > 0.5. This implies that the median of δ(w 1 , w2 ; X 1 , d1 ) is positive. A.5.2 A Lower Bound for P(δ > 0). Our proof depends on Chebyshev’s inequality, which states that for any positive value of t, P(|δ − E[δ]| ≥ t) ≤

var(δ) , t2

where var(δ) denotes the variance of δ. If we set t = E[δ], then (since, by equation A.28, E[δ] > 0) P (δ ≤ 0) ≤

var (δ) E [δ]2

.

(A.24)

This provides a lower bound for the probability of FLL. We prove that this bound approaches unity as n approaches infinity. Now we assume (in addition to conditions A.19 and A.20) that the distributions of X1 and X2 are isotropic.

(A.25)

It follows from equations A.21, A.14, and A.15 that n1 In 2In − Z2T Z2 v E [δ(w1 , w2 ; X1 , d1 ) |Z2 , v ] = vT Z2T Z2 E x2 n n1 2 T T (A.26) = E x v Z2 Z2 v, n where x is the first column of X1T , and var (δ(w1 , w2 ; X1 , d1 ) |Z2 , v )

E x4 (n − 2)Z2 v4 + nZ2 v2 2In − Z2T Z2 v2 = n1 n2 (n + 2) var x2 Z2 v4 + . n2

(A.27)

Since v has an isotropic distribution, equations A.26, A.11, and A.13 imply that E [δ(w1 , w2 ; X1 , d1 ) |Z2 , v ] =

n1 n2 2 E x v2 . n2

(A.28)

212

J. Stone and P. Jupp

Given that there are n1 associations in the subset A1 that is not relearned, equation A.28 implies the following theorem about the expected amount of recovery per association in A1 . Theorem 2 E

n2 δ(w 1 , w 2 ; X 1 , d1 ) 2 2 Z2 , v = n2 E[x ]v . n1

(A.29)

Equations A.26 and A.13 also imply that var (E [δ(w1 , w2 ; X1 , d1 ) |Z2 , v ] |Z2 , v ) n 2 v4 2nn2 − 2n22 1 2 E x = n n2 (n + 2) 2 2n21 n2 (n − n2 )E x2 v4 , = n4 (n + 2)

(A.30)

and it follows from equations A.27 and A.12 that E [var (δ(w1 , w2 ; X1 , d1 ) |Z2 , v ) |Z2 , v ]

n1 v3 E x4 (n − 2)n2 (n2 + 2) + nn2 (2n − n2 + 2) = n(n + 2) n2 (n + 2) var x2 n2 (n2 + 2) + n2 =

n1 n2 v4 4 E x 2(n2 + 2n − n2 − 2) + var x2 (n + 2)(n2 + 2) . 3 2 n (n + 2) (A.31)

Then equations A.30 and A.31 give var (δ(w1 , w2 ; X1 , d1 ) |Z2 , v ) 2 2n21 n2 (n − n2 )E x2 v4 = n4 (n + 2) +

n1 n2 v4 4 E x 2(n2 + 2n − n2 − 2) + var x2 (n + 2)(n2 + 2) n3 (n + 2)2

Free-Lunch Learning

213

n1 n2 v4 {2[n1 (n + 2)(n − n2 ) + n(n − n2 ) + n(n + 2)(n − 1)]E[x2 ]2 n4 (n + 2)2

=

+ n2 (2n + n2 + 6)var(x2 )}, and so var (δ(w1 , w2 ; X1 , d1 ) |Z2 , v ) 2

E [δ(w1 , w2 ; X1 , d1 ) |Z2 , v ]

=

a 0 (n, n1 , n2 ) + a 1 (n, n2 )γ (n) , n1 n2 (n + 2)2

where a 0 (n, n1 , n2 ) = 2{n1 (n + 2)(n − n2 ) + n(n − n2 ) + n(n + 2)(n − 1)} a 1 (n, n2 ) = n2 (2n + n2 + 6) var x2 γ (n) = 2 . E x2 Chebyshev’s inequality implies the following theorem. Theorem 3 P (δ(w1 , w 2 ; X 1 , d1 ) ≤ 0 |Z2 , v ) ≤

a 0 (n, n1 , n2 ) + a 1 (n, n1 , n2 )γ (n) . n1 n2 (n + 2)2

Since the right-hand side does not depend on Z2 or v, this gives the following result. If γ (n)/n → 0 and n1 /n, n2 /n are bounded away from zero as n → ∞, then P (δ(w1 , w2 ; X1 , d1 ) > 0) → 1,

n → ∞.

If x ∼ N 0, σx2 In ,

Example.

then E x2 = nσx2 ,

var x2 = 2nσx4 ,

γ (n) =

2 , n

and so P(δ(w1 , w2 ; X1 , d1 ) > 0) → 1,

n → ∞,

provided that n1 /n and n2 /n are bounded away from zero.

214

J. Stone and P. Jupp

A.5.3 Learning A3 Instead of A2 . Now suppose that relearning of A2 is replaced by learning another subset A3 of n2 associations. Let the matrix X3 and vector d3 be such that the subspace L 3 corresponding to A3 has the form L 3 = L X3 ,d3 . Let w3 and w13 denote the orthogonal projections of w1 onto L 3 and L 1 ∩ L 3 , respectively. Then (A.32) w3 = w13 + In − Z3T Z3 (w1 − w13 ) , and so w1 = w3 + Z3T Z3 (w1 − w13 ) .

(A.33)

˜ = w13 , and equations A.33 and A.32, we have From equation A.4 with w δ(w1 , w3 ; X1 , d1 ) = (w1 − w3 )T X1T X1 (w1 + w3 − 2w13 ) = (w1 − w13 )T Z3T Z3 X1T X1 (w1 + w3 − 2w13 ) ˜ , = (v − ω) ˜ T Z3T Z3 X1T X1 2In − Z3T Z3 (v − ω)

(A.34)

where ω˜ = w13 − w0 . Since X1 w0 = X1 w13 , equation A.34 can be expanded as δ(w1 , w3 ; X1 , d1 )

= vT Z3T Z3 X1T X1 2In − Z3T Z3 v − vT Z3T Z3 X1T X1 2In − Z3T Z3 ω˜ − ω˜ T Z3T Z3 X1T X1 2In − Z3T Z3 v ˜ − ω˜ T Z3T Z3 X1T X1 Z3T Z3 ω,

and so E [δ(w1 , w3 ; X1 , d1 )|X1 , d1 , X2 , d2 , X3 , d3 ] E v2 T = tr Z3 Z3 X1T X1 2In − Z3T Z3 − ω˜ T Z3T Z3 X1T X1 Z3T Z3 ω˜ n E v2 T ˜ 2. = tr X1 X1 Z3T Z3 − X1 Z3T Z3 ω n Now assume that (X1 , d1 ), (X2 , d2 ), (X3 , d3 ) and v are independent, the distributions of X1 , X2 , X3 and v are isotropic.

Free-Lunch Learning

Since

215

E v2 T E v2 T E tr X1 X1 Z2T Z2 = E tr X1 X1 Z3T Z3 n n = E [δ(w1 , w2 ; X1 , d1 )] ,

we have the following theorem. Theorem 4 E[δ(w1 , w 3 ; X 1 , d1 )] ≤ E [δ(w 1 , w2 ; X 1 , d1 )] . Appendix B: Behavior of the Gradient Algorithm If E is regarded as a function of w, then differentiation of equation A.2 shows that the gradient of E at w is ∇ E (w) = 2XT (Xw − d) . Then for any algorithm that takes an initial w(0) to w(1) , w(2) , . . . using steps w(t+1) − w(t) in the direction of ∇ E (w(t) ) , w(t) − w(0) is in the image of XT X, and so is orthogonal to L X,d . It follows that if Xw(t) − d2 → minw Xw − d2 as t → ∞, then w(t) converges to the orthogonal projection of w(0) onto L X,d . Appendix C: The Geometry of Performance Error When n1 = 1 Given associations A1 and A2 , we prove that if n1 = 1 and input vectors have unit length (so that x1 = 1), then the difference δ in performance errors on association A1 of w1 (i.e., after partial forgetting) and w2 (i.e., after relearning A2 ) is equal to the difference Q = p 2 − q 2 . This proof supports the geometric account given in the article and in Figure 2 and does not (in general) apply if n1 > 1. We begin by proving that (if n1 = 1 and x1 = 1) the performance error of an association A1 for an arbitrary weight vector w1 is equal to the squared distance p 2 between w1 and its orthogonal projection w1 onto the affine subspace L 1 corresponding to A1 . If n1 = 1, then L 1 has the form L 1 = {w : w · x1 = d1 } for some x1 and d1 . Given an arbitrary weight vector w1 , we define the performance error on association A1 as equivalent to E(w1 , A1 ) = (w1 · x1 − d1 )2 .

(C.1)

216

J. Stone and P. Jupp

The orthogonal projection w1 of w1 onto L 1 is w1 = w1 +

d1 − w1 · x1 x1 , x1 2

(C.2)

so that d1 = w1 · x1 .

(C.3)

Substituting equation C.3 into C.1 and using C.2 yields E(w1 , A1 ) = w1 − w1 2 x1 2 = p 2 x1 2 .

(C.4)

Now suppose that x1 = 1. Then E(w1 , A1 ) = p 2 , that is, the performance error is equal to the squared distance between the weight vectors w1 and w1 . The same line of reasoning can be applied to prove that E(w2 , A1 ) = q 2 . Thus, the difference δ in performance error on A1 for weight vectors w1 and w2 is δ = E(w1 , A1 ) − E(w2 , A1 ) = p2 − q 2 = Q. Acknowledgments Thanks to S. Isard for substantial help with the analysis presented here; to R. Lister, S. Eglen, P. Parpia, A. Farthing, P. Warren, K. Gurney, N. Hunkin, and two anonymous referees for comments; and J. Porrill for useful discussions. References Atkins, P. (2001). What happens when we relearn part of what we previously knew? Predictions and constraints for models of long-term memory. Psychological Research, 65(3), 202–215.

Free-Lunch Learning

217

Atkins, P., & Murre, J., (1998). Recovery of unrehearsed items in connectionist models. Connection Science, 10(2), 99–119. Coltheart, M., & Byng, S. (1989). A treatment for surface dyslexia. In X. Seron (Ed.), Cognitive approaches in neuropsychological rehabilitation. Mahwah, NJ: Erlbaum. Dean, P., Porrill, J., & Stone, J. V. (2002). Decorrelation control by the cerebellum achieves oculomotor plant compensation in simulated vestibulo-ocular reflex. Proceedings Royal Society (B), 269(1503), 1895–1904. Dienes, Z., Altmann, G., & Gao, S.-J. (1999). Mapping across domains without feedback: A neural network model of transfer of implicit knowledge. Cognitive Science, 23, 53–82. Doya, K. (1999). What are the computations of the cerebellum, the basal ganglia and the cerebral cortex? Neural Networks, 12(7–8), 961–974. Goldin, M., Segal, M., & Avignone, E. (2001). Functional plasticity triggers formation and pruning of dendritic spines in cultured hippocampal networks. J. Neuroscience, 21(1), 186–193. Hanson, S. J., & Negishi, M. (2002). On the emergence of rules in neural networks. Neural Computation, 14, 2245–2268. Harvey, I., & Stone, J.V. (1996). Unicycling helps your French: Spontaneous recovery of associations by learning unrelated tasks. Neural Computation, 8, 697–704. Hinton, G., & Plaut, D. (1987). Using fast weights to deblur old memories. In Proceedings Ninth Annual Conference of the Cognitive Science Society, Seattle WA, 177–186. Mardia, K. V., & Jupp, P. E. (2000). Directional statistics. New York: Wiley. Nakahara, H., Itoh, H., Kawagoe, R., Takikawa, Y., & Hikosaka, O. (2004). Dopamine neurons can represent context-dependent prediction error. Neuron, 41(2), 269–280. Purves D., & Lichtman, J. (1980). Elimination of synapses in the developing nervous system. Science, 210, 153–157. Seymour, B., O’Doherty, J. P., Dayan, P., Koltzenburg, M., Jones, A. K., Dolan, R. J., Friston, K. J., & Frackowiak, R. (2004). Temporal difference models describe higher order learning in humans. Nature, 429, 664–667. Stone, J. V., Hunkin, N. M., & Hornby, A. (2001). Predicting spontaneous recovery of memory. Nature, 414, 167–168. Sutton, R. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44. Weekes, B., & Coltheart, M. (1996). Surface dyslexia and surface dysgraphia: Treatment studies and their theoretical implications. Cognitive Neuropsychology, 13, 277–315.

Received August 1, 2005; accepted May 19, 2006.

LETTER

Communicated by Mark Girolami

Linear Multilayer ICA Generating Hierarchical Edge Detectors Yoshitatsu Matsuda [email protected]-tokyo.ac.jp

Kazunori Yamaguchi [email protected] Kazunori Yamaguchi Laboratory, Department of General Systems Studies, Graduate School of Arts and Sciences, University of Tokyo, Tokyo, Japan 153-8902

In this letter, a new ICA algorithm, linear multilayer ICA (LMICA), is proposed. There are two phases in each layer of LMICA. One is the mapping phase, where a two-dimensional mapping is formed by moving more highly correlated (nonindependent) signals closer with the stochastic multidimensional scaling network. Another is the local-ICA phase, where each neighbor (namely, highly correlated) pair of signals in the mapping is separated by MaxKurt algorithm. Because in LMICA only a small number of highly correlated pairs have to be separated, it can extract edge detectors efficiently from natural scenes. We conducted numerical experiments and verified that LMICA generates hierarchical edge detectors from large-size natural scenes. 1 Introduction Independent component analysis (ICA) is a recently developed method in the fields of signal processing and artificial neural networks, and it has been shown to be quite useful for the blind separation problem (Jutten & H´erault, 1991; Comon, 1994; Bell & Sejnowski, 1995; Cardoso & Laheld, 1996). The linear ICA is formalized as follows. Let s and A be N-dimensional source signals and an N × N mixing matrix. Then the observed signals x are defined as x = As.

(1.1)

The problem to solve is to find out A (or the inverse, W) when only the observed (mixed) signals are given. In other words, ICA blindly extracts the source signals from M samples of the observed signals as follows: Sˆ = WX,

(1.2)

where X is an N × M matrix of the observed signals and Sˆ is the estimate of the source signals. Although this is a typical ill-conditioned problem, ICA Neural Computation 19, 218–230 (2007)

C 2006 Massachusetts Institute of Technology

Linear Multilayer ICA Generating Hierarchical Edge Detectors

219

can solve it if the source signals are generated according to independent and nongaussian probability distributions. Concretely speaking, ICA algorithms find out W by minimizing a criterion (called the contrast function) ˆ that is defined by higher-order statistics (e.g., kurtosis) of components of S, and many methods have been proposed for the minimization, for example, fast ICA (Hyv¨arinen, 1999) and the relative gradient algorithm (Cardoso & Laheld, 1996). Because all of these previous algorithms try to estimate all the N2 components of W rigorously, their time complexity is O(N2 ), which is intractable for large N. For actual data such as natural scenes, W is given not randomly but according to some underlying structure in the data. LMICA (linear multilayer ICA), which we proposed previously (Matsuda & Yamaguchi, 2003, 2004, 2005b), utilizes such structure for estimating W. By gradually improving an estimate of W, it could finally find out a fairly good estimate of W quite efficiently. This letter extends this work with improvement of the algorithm, using the two-dimensional stochastic multidimensional scaling (MDS) network, and exhaustive numerical experiments (generating hierarchical edge detectors). This letter is organized as follows. In section 2, the algorithm is described. In section 3, numerical experiments verify that LMICA can extract edge detectors efficiently from natural scenes, and it can generate hierarchical edge detectors from large-size scenes. This letter is concluded in section 4. 2 Algorithm 2.1 Basic Idea. It is assumed that X is whitened initially. LMICA tries to extract the independent components approximately by repetition of the following two phases: the mapping phase, which brings more highly correlated signals nearer, and the local-ICA phase, where each neighbor pair of signals in the mapping is separated by the MaxKurt algorithm (Cardoso, 1999). Intuitively, the mapping phase finds only a few pairs that are more “important” for estimating W, and the local-ICA phase optimizes only the “important” pairs. The mechanism of LMICA is illustrated in Figure 1. Note that this illustration shows the ideal case where N signals are separated in only O (log N) layers. Although this does not hold for an arbitrary W, it will be shown in section 3 that natural scenes can be separated quite effectively by this method with a two-dimensional mapping. 2.2 Mapping Phase. In the mapping phase, current signals X are arranged 2 2in a two-dimensional array so that pairs of signals taking higher k xik x jk are placed nearer. In order to efficiently calculate such an array, the stochastic MDS network in Matsuda and Yamaguchi (2005a) is used. Its main procedure is the repetition of the calculation of the center of gravity and the slide of each signal toward the center in proportion to its value.

220

Y. Matsuda and K. Yamaguchi

1

2

3

4

5

6

7

8 local-ICA

1-2

1-2

3-4

3-4

5-6

5-6

7-8

7-8 mapping

1-2

3-4

1-2

3-4

5-6

7-8

5-6

7-8 local-ICA

1-4

1-4

1-4

1-4

5-8

5-8

5-8

5-8

1-4

5-8

1-4

5-8

1-4

5-8

1-4

5-8

mapping

local-ICA 1-8

1-8

1-8

1-8

1-8

1-8

1-8

1-8

Figure 1: LMICA (the ideal case). Each number from 1 to 8 means a source signal. In the first local-ICA phase, each neighbor pair of the completely mixed signals (denoted 1-8) is partially separated into 1-4 and 5-8. Next, the mapping phase rearranges the partially separated signals so that more highly correlated signals are nearer. In consequence, the four 1-4 signals (similarly, 5-8 ones) are brought nearer. Then the local-ICA phase partially separates the pairs of neighbor signals into 1-2, 3-4, 5-6, and 7-8. By repetition of these two phases, LMICA can extract all the sources quite efficiently.

The original stochastic MDS network is given as follows (see Matsuda & Yamaguchi, 2005a, for details): 1. Place given signals Z = (zik ) on a two-dimensional array randomly, where each signal i (the ith row of Z) corresponds to a component of the array through a one-to-one correspondence. The x- and ycoordinates of a signal i on the array are denoted as mi1 and mi2 . Note that mi1 and mi2 take discrete values. 2. Pick up a column of Z randomly, whose index is denoted as p. The column can be regarded as a randomly selected sample p of signals. 3. Calculate the center of gravity for the sample p by g

ml (Z, p) = where l = 1 or 2.

i zi p mil , i zi p

(2.1)

Linear Multilayer ICA Generating Hierarchical Edge Detectors STEP 1 5

STEP 2

2 23 17 7

5

STEP 3

2 23 17

7

2 23 17

STEP 4 7

5

2 23 17 12

8 20 15 10 21

8 20 10 12 21

8 20 10

10 15 6 11 3

10 15 6 11

3

8 20 10 12 21 10 15 6 11 3

10 7

25 4 24

25 4 24

9

25 4

25 4 24 1

1

9

13 14 16 22 19

7 unit to be updated new coordinates of 7 offset vector of 7 15 destination unit

2 21

5

221

1

13 14 16 22 19

24 1

9

6 11 3 9

13 14 16 22 19

13 14 16 22 19

unit to be shifted

new offset vector of 7

start and end points medium points 10 12 medium units

Figure 2: Discretized update on the stochastic MDS network (excerpted from Matsuda & Yamaguchi, 2005a, pp. 289). It illustrates each step of the discretized update rule. Each small square represents a signal on the discrete two-dimensional array (denoted a unit in this figure). In the first step, the destination is calculated by adding the current offset vector to the new coordinates. Second, the signals on the route are detected. Third, the signal to be updated is moved by exchanging the signals on the route one by one. Finally, the fraction under the discretization is preserved as the offset vector.

4. Calculate the new coordinate mil for each signal i by g mil = mil − λzi p mil − ml ,

(2.2)

where λ is the step size. 5. Update the coordinates of each signal on the array approximately according to equation 2.2 under the constraints that every coordinate mil has to be on the discrete two-dimensional array. Such discretized updates are conducted by giving an offset vector (which preserves the rounded off fraction under the discretization) to each signal and exchanging the signals one by one on the array. The mechanism is roughly illustrated in Figure 2. The details are omitted in this letter. 6. Terminate the process if a termination condition is satisfied. Otherwise, return to step 2. It can be shown that this process approximately minimizes an error function, C=

(zik z jk ) (mil − m jl )2 , i, j

k

l

(2.3)

222

Y. Matsuda and K. Yamaguchi

under the conditions that the location of each signal (mil , mi2 ) is bound to a unit in the discrete two-dimensional array through a one-to-one correspondence. Because the minimization of C makes more highly correlated pairs of signals (taking larger k (zik z jk )) be closer (smaller l (mil − m jl )2 ), it is shown that this process generates a topographic mapping where highly correlated signals are placed near each other. Some modifications are needed in order to make the original stochastic MDS network suitable for LMICA:

r r

r

zi p is given as xi2p − 1 where p is selected randomly from 1 to M at each update. Because the convergence of the original network is too slow and tends to drop into a local minimum if the signals Z take continuous values, the following elaboration is utilized. The signals are classified into two groups for each i: the positive group σ + consisting of signals satisfying zi p > 0 and the negative one σ − of zi p < 0. Then equations 2.1 and 2.2 are calculated for each group. Note that each signal of negative group is moved away from the center of gravity. Numerical experiments (omitted in this article) showed that this improvement gave more accurate results more efficiently than the original network. Two learning stages are often used in order to make local neighbor relations more accurate if there are many signals. The first is the usual stage for the whole array. The second is the locally tuning stage where the array is divided into some small areas and the algorithm is applied to each area.

The total procedure of the mapping phase for given X, W, A, and a given two-dimensional array is described as follows: Mapping Phase 1. Allocate each signal i to a component of the two-dimensional array by a randomly selected one-to-one correspondence. 2. Repeat the following steps over T times with a decreasing step size λ: a. Select p randomly from {1, . . . , M}, and let zi p = xi2p − 1 for each i. b. Calculate the two centers of gravity: g+ g− i∈σ + zi p mil i∈σ − zi p mil ml (Z, p) = and ml (Z, p) = . z + i p i∈σ i∈σ − zi p (2.4) c. Update eachmil by g+ mil − λzi p mil − ml zi p > 0 mil := (2.5) g− mil − λzi p mil − ml zi p < 0 under the constraints that every mil is on the discrete array.

Linear Multilayer ICA Generating Hierarchical Edge Detectors

223

3. Divide the array into small areas, and apply the above process to each area if there are many signals. 4. Rearrange the rows of X and W and the columns of A according to the generated array. 2.3 Local-ICA Phase. In the local-ICA phase, the following contrast function φ(X) (the minus sum of kurtoses) is used (which is the same one in MaxKurt algorithm; Cardoso, 1999): 4 i,k xik φ (X) = − −3 . (2.6) M φ(X) is minimized by “rotating” nearest-neighbor pairs of signals on the array. For each nearest-neighbor pair (i, j), a rotation matrix R(θ ) is given as cos θ sin θ R(θ ) = . (2.7) − sin θ cos θ Then the optimal angle θˆ is given as 4 4 θˆ = argminθ − xik , + x jk

(2.8)

k where xik = cos θ · xik + sin θ · x jk and x jk = − sin θ · xik + cos θ · x jk . After some tedious transformation of this equation (see Cardoso, 1999), it is shown that θˆ is determined analytically by the following equations:

sin 4θˆ =

αi j αi2j

+

βi2j

and cos 4θˆ =

where αi j =

3 x jk − xik x 3jk and βi j = xik

k

βi j αi2j

+ βi2j

k

,

(2.9)

4 2 2 xik + x 4jk − 6xik x jk

4

.

(2.10)

Now, the procedure of the local-ICA phase for given X, W, A, and an array is described as follows: Local-ICA Phase 1. For every nearest-neighbor pair of signals on the array, (i, j), a. Find out the optimal angle θˆ by equation 2.9. b. Rotate the corresponding parts of X, W, and A.

224

Y. Matsuda and K. Yamaguchi

2.4 Complete Algorithm. The complete algorithm of LMICA for any given observed signals X and array fixing the shape of the mapping is given by repeating the mapping phase and the local-ICA phase alternately: Linear Multilayer ICA Algorithm 1. Initial settings: Let X be a whitened observed signal and W and A be the N × N identity matrix. 2. Repetition: Do the following two phases alternately over L times. a. Do the mapping phase in section 2.2 b. Do the local-ICA phase in section 2.3. 2.5 Some Remarks 2.5.1 Relation to MaxKurt Algorithm. Equation 2.9 is the same the as MaxKurt algorithm (Cardoso, 1999). The crucial difference between LMICA and MaxKurt is that LMICA optimizes just the nearest-neighbor pairs instead of all the N(N−1) ones in MaxKurt. In LMICA, the pairs with higher 22 2 costs (higher k xik x jk ) are brought nearer in the mapping phase. Approximations of independent components can be extracted effectively by optimizing just the neighbor pairs. 2.5.2 Prewhitening. Although LMICA is applicable to any prewhitened signals, the selection of the whitening method is actually crucial for its performance. It is shown in section 3.1 that ZCA (zero-phase component analysis) is more suitable than principal component analysis (PCA) if natural scenes are given as the observed signals. The ZCA filter is given as X := − 2 X, 1

(2.11)

where is the covariance matrix of X. It has been known that the ZCA filter whitens the given signals with preserving the spatial relationship if natural scenes are given as X (Li & Atick, 1994; Bell & Sejnowski, 1997). 3 Results It is well known that various local edge detectors can be extracted from natural scenes by the standard ICA algorithm (Bell & Sejnowski, 1997; van Hateren & van der Schaaf, 1998). Here, LMICA was applied to the same problem. 3.1 Small Natural Scenes. Thirty thousand samples of natural scenes of 12 × 12 pixels were given as the observed signals X. That is, N and M were 144 and 30,000. Original natural scenes were downloaded from http://www.cis.hut.fi/projects/ica/data/images/. X is then whitened by

Linear Multilayer ICA Generating Hierarchical Edge Detectors

225

Decrease of Contrast Function LMICA (a) original topography (b) MaxKurt (c) LMICA for PCA (d)

minus kurtosis

-4 -5 -6 -7 -8 0

3000

6000 9000 12000 times of optimizations for pairs

15000

Figure 3: Decreasing curves of the contrast function φ along the times of optimizations for pairs of signals. (a) The standard LMICA. (b) LMICA using the identity mapping in the mapping phase, which preserves the original topography for the ZCA filter. (c) MaxKurt. (d) LMICA for the PCA-whitened signals.

ZCA. In the mapping phase, a 12 × 12 array was used, where the learning 100 length T was 100,000 with the step size λ = 100+t (t is the current time, which is the number of updates to the point). The calculation time for 400 layers of LMICA with Intel 2.8 GHz CPU was about 40 minutes, about 60% of which was for the mapping phase. It shows that the mapping phase is so efficient that its computational costs are not a bottleneck in LMICA. Figure 3 shows the decreasing curves of the contrast function φ along the times of optimizations (rotations) for pairs of signals, where the values of φ are averaged over 10 trials for independently sampled Xs. For comparison, the following four experiments were done: (1) LMICA; (2) LMICA using not the optimized rearrangement but the simple identity mapping in the mapping phase, which preserves the original 12 × 12 topography of scenes for the ZCA filter; (3) MaxKurt, where all the pairs are optimized one by one in random order; and (4) LMICA for the PCA-whitened observed signals; Note that one layer of LMICA without the mapping phase and one iteration of MaxKurt are equivalent to 12 × 11 × 2 = 264 and 144×143 = 10,296 times 2 of pair optimizations, respectively. In Figure 3, the standard LMICA gives the best solution everywhere except only a few first optimizations and the late ones. Though LMICA using the original topography is the best within about the first 1000 optimizations, it rapidly converged to local minima. On the other hand, MaxKurt and LMICA for PCA became superior to the others only after more than 10,000 optimizations.

226

Y. Matsuda and K. Yamaguchi

(a) LMICA.

(b) original topography.

(c) MaxKurt.

(d) fast ICA with g (u) = tanh (u).

Figure 4: Edge detectors from natural scenes of 12 × 12 pixels after 5280 optimizations. Each shows 144 edge detectors of 12 × 12 pixels from A. (a) LMICA. (b) LMICA using the original topography (namely, the identity mapping). (c) MaxKurt. (d) Fast ICA (g(u) = tanh(u)).

Figures 4a to 4c show the edge detectors after 5280 optimizations (equivalent to the twentieth layer of LMICA) by LMICA, LMICA using the original topography, and MaxKurt, respectively. For comparison, Figure 4d is the result by fast ICA with the widely used nonlinear function g(u) = tanh(u). Figures 4a and 4d show that LMICA could quite rapidly generate edge detectors similar to those in fast ICA. It is especially surprising that the number of optimizations in LMICA (5280) is about half of the degrees of freedom of the mixing matrix A (10,296). It suggests that LMICA gives an

Linear Multilayer ICA Generating Hierarchical Edge Detectors

(a) at the 0th layer (ZCA).

(b) at the 10th layer.

(c) at the 50th layer.

(d) at the 300th layer.

227

Figure 5: Representative edge detectors from large natural scenes at each layer. They show 20 representative edge detectors of A from scenes of 64 × 64 pixels at each layer.

effective model for the ICA processing of natural scenes. There are no clear edges in Figures 4b and 4c. Each detector in these figures is extremely localized and has no orientation preference (or one that is too weak). It shows that the mapping phase of LMICA plays a crucial role in the rapid formation of edge detectors. 3.2 Large Natural Scenes. Here, 100,000 samples of natural scenes of 64 × 64 pixels were given as X. Fast ICA, MaxKurt, and other well-known ICA algorithms are not applicable to such a large-scale problem because they require huge computations. For example, fast ICA based on kurtosis spent about 45 minutes on processing the small images (30,000 samples of 12 × 12 pixels). Under the rough assumption that the calculation time is proportional to the number of samples and the parameters to be estimated, it requires about 2000 hours to process the large images. This estimation is rather optimistic because it assumes that the number of updates for

Y. Matsuda and K. Yamaguchi 4000

4000

3000

3000

frequency

frequency

228

2000 1000

2000 1000

0

0 4 6 8 10 12 14 length of edges (a) at the 0th layer.

2

4 6 8 10 12 14 length of edges (b) at the 10th layer.

4000

4000

3000

3000

frequency

frequency

0

2000 1000

0

2

0

2

2000 1000

0

0 0

2

4 6 8 10 12 14 length of edges (c) at the 50th layer.

4 6 8 10 12 14 length of edges (d) at the 300th layer.

Figure 6: Histograms of edge lengths at each layer of LMICA for large, natural scenes. The lengths were calculated as the full width at half maximum (FWHM) of gaussian approximations of Hilbert transformation of W by a method similar to that of van Hateren and van der Schaaf (1998).

convergence is constant regardless of the number of parameters. In the mapping phase, 64 × 64 array was used for T = 1,500,000; then it was divided into 16 arrays of 16 × 16 components, and each array was optimized for T = 500,000. LMICA was carried out in L = 300 layers, and it consumed about 170 hours with Intel 2.8 GHz CPU. Figure 5 shows some representative edge detectors at the 0th (ZCA filter), 10th, 50th, and 300th layers. The histograms of the lengths of edges at each layer are shown in Figure 6. There are pixel detectors of only length zero at the 0th layer (ZCA) in Figures 5a and 6a. Then many short edge detectors were generated after just 10 layers (see Figures 5b and 6b). At the 50th layer, the lengths of edges were longer (see Figures 5c and 6c). At the final 300th layer (see Figures 5d and 6d), there are some long edges. In addition, it is interesting that some “compound” detectors were observed where multiple edges seem to be included in a single detector.

Linear Multilayer ICA Generating Hierarchical Edge Detectors

229

4 Conclusion In this letter, we proposed linear multilayer ICA (LMICA). We carried out some numerical experiments on natural scenes, which verified that LMICA can extract edge detectors quite efficiently. We also showed that LMICA can generate hierarchical edge detectors from large-size natural scenes, where short edges exist in lower layers and longer edges in higher ones. Although some multilayer models have been employed in ICA (e.g., Hyv¨arinen & Hoyer, 2000, and Hoyer & Hyv¨arinen, 2002), the purpose of and rationale for them are quite different from those of LMICA. Their multilayer networks have been proposed for constructing more powerful models of ICA, which include nonlinear connections and allow some dependencies between sources. On the other hand, LMICA is constructed only for the efficient calculation of the usual linear ICA. Nevertheless, it seems interesting that the structures of their multilayer models (locally connected multilayer networks) are quite similar to that of LMICA. It may be promising to extend LMICA with some nonlinearity. We are planning to apply LMICA to some applications in image processing, such as image compression and digital watermarking. We are also planning to utilize LMICA for other large-scale problems such as text mining. In addition, we are trying to explore a faster method for the mapping phase. Some batch-learning techniques may be promising. We are now paying attention to the fact that it has been known that edge detectors are not formed at the maximum of kurtoses. Some different nonlinearity is needed (e.g., tanh). So the choice of nonlinearity in the contrast function is quite important and sensitive in the usual ICA models, such as the InfoMax model in Bell and Sejnowski (1997). On the other hand, LMICA can generate hierarchical edge detectors by gradually increasing simple kurtoses. This suggests that LMICA may be able to extract edge detectors by using more general contrast functions, but further experiments will be needed to test this hypothesis. Finally, LMICA with kurtoses is expected to be as sensitive to outliers as other cumulant-based ICA algorithms are. Some different contrast functions would be needed for noisy signals. In addition, LMICA is available only if the number of sources is the same as that of observed signals, because it utilizes the ZCA filter. In order to apply LMICA to undercomplete cases, a new whitening method would have to be exploited, which can decrease the number of signals while preserving the topography of images. It is more difficult for LMICA to deal with overcomplete cases, because it is based on the simplest contrast function without any generative models of sources. Nevertheless, it seems interesting that LMICA could generate many edge detectors with different lengths at every layer. These detectors appear to be overcomplete bases, though without any theoretical foundations so far. Further analysis of these bases may be promising.

230

Y. Matsuda and K. Yamaguchi

References Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Bell, A. J., & Sejnowski, T. J. (1997). The “independent components” of natural scenes are edge filters. Vision Research, 37(23), 3327–3338. Cardoso, J.-F. (1999). High-order contrasts for independent component analysis. Neural Computation, 11(1), 157–192. Cardoso, J.-F., & Laheld, B. (1996). Equivariant adaptive source separation. IEEE Transactions on Signal Processing, 44(12), 3017–3030. Comon, P. (1994). Independent component analysis—a new concept? Signal Processing, 36, 287–314. Hoyer, P. O., & Hyv¨arinen, A. (2002). A multi-layer sparse coding network learns contour coding from natural images. Vision Research, 42(12), 1593–1605. Hyv¨arinen, A. (1999). Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks, 10(3), 626–634. Hyv¨arinen, A., & Hoyer, P. O. (2000). Emergence of phase and shift invariant features by decomposition of natural images into independent feature subspaces. Neural Computation, 12(7), 1705–1720. Jutten, C., & H´erault, J. (1991). Blind separation of sources (part I): An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24(1), 1–10. Li, Z., & Atick, J. J. (1994). Towards a theory of the striate cortex. Neural Computation, 6, 127–146. Matsuda, Y., & Yamaguchi, K. (2003). Linear multilayer ICA algorithm integrating small local modules. In Proceedings of ICA2003 (pp. 403–408). Nara, Japan. Matsuda, Y., & Yamaguchi, K. (2004). Linear multilayer independent component analysis using stochastic gradient algorithm. In C. G. Puntomet & A. Pneto (Eds.), Independent component analysis and blind source separation—ICA2004 (pp. 303–310). Berlin: Springer-Verlag. Matsuda, Y., & Yamaguchi, K. (2005a). An efficient MDS-based topographic mapping algorithm. Neurocomputing, 64, 285–299. Matsuda, Y., & Yamaguchi, K. (2005b). Linear multilayer independent component analysis for large natural scenes. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 897–904). Cambridge, MA: MIT Press. van Hateren, J. H., & van der Schaaf, A. (1998). Independent component filters of natural images compared with simple cells in primary visual cortex. Proceedings of the Royal Society of London: B, 265, 359–366.

Received December 16, 2005; accepted April 5, 2006.

LETTER

Communicated by Andries P. Engelbrecht

Functional Network Topology Learning and Sensitivity Analysis Based on ANOVA Decomposition Enrique Castillo [email protected] Department of Applied Mathematics and Computational Sciences, University of Cantabria and University of Castilla–La Mancha, Spain

˜ Noelia S´anchez-Marono [email protected]

Amparo Alonso-Betanzos [email protected] ˜ Spain Computer Science Department, University of A Coruna,

Carmen Castillo [email protected] Department of Civil Engineering, University of Castilla–La Mancha, Spain

A new methodology for learning the topology of a functional network from data, based on the ANOVA decomposition technique, is presented. The method determines sensitivity (importance) indices that allow a decision to be made as to which set of interactions among variables is relevant and which is irrelevant to the problem under study. This immediately suggests the network topology to be used in a given problem. Moreover, local sensitivities to small changes in the data can be easily calculated. In this way, the dual optimization problem gives the local sensitivities. The methods are illustrated by their application to artificial and real examples.

1 Introduction Functional networks (FN) have been proposed by E. Castillo (1998; Castillo, Cobo, Guti´errez, & Pruneda, 1998) and have been shown to be successful in solving many physical or engineering problems (see Castillo & Guti´errez, 1998; Castillo, Cobo, Guti´errez, & Pruneda, 2000). FN are a generalization of neural networks (NN) that combine knowledge about the structure of the problem and data (Castillo et al., 1998). There are important differences between FN and NN. Standard NN usually have a rigid topology (only the number of layers and the number of neurons can be chosen) and fixed neural functions, so that only the weights are learned. In FN, the neural Neural Computation 19, 231–257 (2007)

C 2006 Massachusetts Institute of Technology

232

˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,

topology is initially free, and the neural functions can be selected from a wide range of families, so that both the topology and the parameters of the neural functions are learned. In addition, FN incorporate knowledge on the network topology or neural functions, or both. This knowledge can come from two main sources: (1) the knowledge the user has about the problem being solved, which can be written in terms of certain properties or characteristics of the model, or (2) the available data, which can be scrutinized to obtain the structure. The authors cited have described how the first type of knowledge can be used to derive the network topology. This is done mainly with the help of functional equations, which both suggest an initial topology and allow this initial topology to be simplified, leading to an equivalent simpler one. Thus, one of the main tools for building the network topology from this type of knowledge is functional equations. Such equations have been extensively described in the literature (see, e.g., Acz´el, 1966; Castillo & Ru´ız-Cobo, 1992; Castillo, Cobo, Guti´errez, & Pruneda, 1999; Castillo, Iglesias, & Ru´ız-Cobo, 2004). In this way and although FN can also be used as black boxes, the network is a white (rather than a black) box, where the structure has a knowledge-based foundation. However, no efficient and clean method for deriving the network topology from data has yet been devised. This is one of the aims of this article. Sensitivity analysis is another area of great interest (Saltelli, Chan, & Scott, 2000; Castillo, Guti´errez, & Hadi, 1997; Chatterjee & Hadi, 1988; Hadi & Nyquist, 2002); the concern is not only with learning a model but with the sensitivity of the model parameters to data. Sensitivity can be understood as local or global. Local sensitivity aims at discovering how the model changes as small modifications are made to the data and at determining which data values have the greatest influence on the model when changed in small increments. Global sensitivity aims at discovering how sensitive the model is to the inputs and whether some inputs can be removed without a significant loss in output quality. Different studies have been carried out regarding model approximation (Sacks, Mitchell, & Wynn, 1989; Koehler & Owen, 1996; Currin, Mitchell, Morris, & Ylvisaker, 1991). Some of them use regression strategies, similar to, but different from the one described in Jiang and Owen (2001). In this letter, we present a methodology based on ANOVA decomposition that permits the topology of a functional network to be learned from data. The ANOVA decomposition, also permits global sensitivity indices to be obtained for each variable and for each interaction between variables. The topology of a functional network is derived from this information, and lowand high-order interactions among variables are easily determined using ANOVA. Finally, a local sensitivity analysis can be carried out too. In this way, the final model can be fully defined. The letter is structured as follows. Section 2 gives a quick introduction to functional networks and describes the ANOVA decomposition, including global sensitivity indices. Section 3 describes the proposed method for

Functional Network Topology Learning and Sensitivity Analysis

233

learning the network topology and the local and global sensitivity indices from data. Section 4 presents two illustrative examples, and section 5 contains the conclusions. 2 Background Knowledge 2.1 A Brief Introduction to Functional Networks. Functional networks (FN) are a generalization of neural networks that brings together domain knowledge, to determine the structure of the problem, and data, to estimate the unknown functional neurons (Castillo et al., 1998). In FN, there are two types of learning to deal with this domain and data knowledge:

r r

Structural learning, which includes the initial topology of the network and its posterior simplification using functional equations, leading to a simpler equivalent structure. Parametric learning, concerned with the estimation of the neuron functions. This can be done by considering linear combinations of given functional families and estimating the associated parameters from the available data. Note that this type of learning generalizes the idea of estimating the weights of the connections in a neural network.

In FN, not only arbitrary neural functions are allowed, but they are initially assumed to be multiargument and vector-valued functions; that is, they depend on several arguments and are multivariate. This fact is shown in Figure 1a. In this figure, we can also see some relevant differences between neural and functional networks. Note that the FN has no weights, and the parameters to be learned are incorporated into the neural functions f i ; i = 1, 2, 3. These neural functions are unknown functions from a given family (e.g., the polynomial function) to be estimated during the learning process. For example, the neural functions f i could be approximated by f i (xi , x j ) =

mi ki =0

a iki xiki +

mj

k

a ik j x j j ,

k j =0

and the parameters to be learned will be the coefficients a iki and a ik j . As each function f i is learned, a different function is obtained for each neuron. It can be argued that functional networks require domain knowledge for deriving the functional equations and make assumptions about the form the unknown functions should take. However, as neural networks, functional networks can also be used as a black box. 2.2 The ANOVA Decomposition. The analysis of variance (ANOVA) was developed by Fisher (1918) and later further developed by other authors (Efron & Stein, 1981; Sobol, 1969; Hoeffding, 1948).

˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,

234

x1 x2 x3

w51 f (Σw5i xi)

w52 w62 w63

f ( Σw6i xi)

w73 x4

f ( Σw7i xi)

x5 w85 x6 w86 x7

x8

f ( Σw8i xi)

w87

w74 (a)

x1

x5

f 1( x1,x2)

x2

f 2( x2,x3)

x9

f 4( x5,x6)

x3 x6

f 3( x3,x4) x4 (b)

Figure 1: (a) A neural network. (b) A functional network.

According to Sobol (2001), any square integrable function f (x1 , . . . , xn ) defined on the unit hypercube [0, 1]n can be written as:

y = f (x1 , . . . , xn ) = f 0 +

n i 1 =1

+

n−1 n−2

n

f i1 (xi1 ) +

n n−1

f i1 i2 (xi1 , xi2 )

i 1 =1 i 2 =i 1 +1

f i1 ,i2 ,i3 (xi1 , xi2 , xi3 ) + . . .

i 1 =1 i 2 =i 1 +1 i 3 =i 2 +1

+

2 1 i 1 =1 i 2 =2

...

n

f i1 i2 ...in (xi1 , xi2 , . . . , xin ),

(2.1)

i n =n

where the term f 0 is a constant and corresponds to the function with no arguments.

Functional Network Topology Learning and Sensitivity Analysis

235

The decomposition, equation 2.1, is called ANOVA iff

1 0

f i1 i2 ...ik (xi1 , xi2 , . . . , xik ) d xir ≡ 0; ∀r = 1, 2, . . . , k; ∀i 1 , i 2 , . . . , i k ; ∀k, (2.2)

where k is an index to point to any of the elements in the set x1 , x2 , . . . , xn . Hence, the functions corresponding to the different summands are orthogonal, that is,

1 0

1

1

...

0

0

f i1 i2 ...ik (xi1 , xi2 , . . . , xik ) f j1 j2 ... j (x j1 , x j2 , . . . , x j ) dx = 0

∀(i 1 i 2 , . . . , i k ) = ( j1 j2 , . . . , j ), where x = (x1 , x2 , . . . , xn ). It is important to notice that the conditions in equation 2.2 are sufficient for the different component functions to be orthogonal. In addition, it can be shown that they are unique, in the sense of L 2 equivalent classes. Note that since the above decomposition includes terms with all possible kinds of interactions among variables x1 , x2 , . . . , xn , it allows these interactions to be determined. The main advantage of this decomposition is that there are explicit formulas for obtaining the different summands or components of f (x1 , . . . , xn ). Sobol (2001) provides the following expressions:

1

f0 =

0

0

0

...

0

1

f (x1 , . . . , xn ) d x1 d x2 , . . . , d xn ,

(2.3)

0

1

...

0 1

f i j (xi , x j ) =

1

0

1

f i (xi ) =

1

f (x1 , . . . , xn )

0 1

... 0

n

d xk − f 0 ,

(2.4)

k=1;k=i 1

f (x1 , . . . , xn )

n k=1;k=i, j

d xk − f 0 −

f k (xk ) (2.5)

k=i, j

and so on. In other words, the f (x1 , . . . , xn ) function can always be written as the sum of 2n orthogonal summands (see equation 2.1). If f (x1 , . . . , xn ) is square integrable, then all f i1 i2 ,...,ik (xi1 , xi2 , . . . , xik ) for all combinations i 1 i 2 , . . . , i k and k = 1, 2, . . . , n are also square integrable.

˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,

236

Squaring f (x1 , . . . , xn ) and integrating over (0, 1)n gives

1 0

1

...

0

0

+

n

1

f 2 (x1 , . . . , xn )d x1 d x2 . . . d xn = f 02

f i21 (xi1 ) +

i 1 =1

+

n n−1

f i21 i2 (xi1 , xi2 ) + · · ·

i 1 =1 i 2 =i 1 +1

n−1 n−2

n

f i21 ,i2 ,i3 (xi1 , xi2 , xi3 )

i 1 =1 i 2 =i 1 +1 i 3 =i 2 +1

+

2 1

...

n

f i21 i2 ...in (xi1 , xi2 , . . . , xin )

(2.6)

i n =n

i 1 =1 i 2 =2

and calling D=

1

0

Di1 i2 ...ik =

1

0

1

0 1

1

...

0 1

...

0

0

f 2 (x1 , . . . , xn )d x1 d x2 . . . d xn − f 02 ,

(2.7)

f i21 i2 ,...,ik (xi1 , xi2 , . . . , xik )dx,

(2.8)

the result is D=

n

Di1 i2 ,...,ik .

(2.9)

k=1 i 1 i 2 ,...,i k

The constant D is called the variance, because if (x1 , x2 , . . . , xn ) is a uniform random variable in the unit hypercube, then D is the variance of f (x1 , x2 , . . . , xn ). Thus, the following set of global sensitivity indices, adding up to one, can be defined: Si1 i2 ...ik =

Di1 i2 ,...,ik D

⇔

Si1 i2 ,...,ik = 1.

(2.10)

i 1 i 2 ,...,i k

The practical importance of the ANOVA decomposition arises from the following facts: 1. Every square integrable function can be decomposed as the sum of orthogonal functions, including all interaction levels. 2. This decomposition is unique.

Functional Network Topology Learning and Sensitivity Analysis

237

3. There are explicit formulas for determining this decomposition in terms of f (x1 , . . . , xn ). 4. The variance of the initial function can be obtained by summing up the variances of the components, and this permits global sensitivity indices that sum to one to be assigned to the different functional components. 3 Learning Algorithm In this section, we present a method (denominated the AFN, i.e., ANOVA decomposition and functional networks) for learning the functional components of any general function f (x1 , . . . , xn ) from data and for calculating local and global sensitivity indices. Consider a data set {(x1m , x2m , . . . , xnm ; ym )|k = 1, 2, . . . , M}—a sample of size M of n input variables (X1 , X2 , . . . , Xn )—and one output variable Y. The algorithm works as follows. 3.1 Step 1: Select a Set of Approximating Orthonormal Functions. Each functional component f i1 i2 ,...,ik (xi1 , xi2 , . . . , xik ) is approximated as ki∗ i

1 2 ,...,i k

f i1 i2 ,...,ik (xi1 , xi2 , . . . , xik ) ≈

c i∗1 i2 ,...,ik ; j h i∗1 i2 ,...,ik ; j (xi1 , xi2 , . . . , xik ),

j=1

(3.1) where c i∗1 i2 ,...,ik ; j are real constants, and {h i∗1 i2 ,...,ik ; j (xi1 , xi2 , . . . , xik )| j = 1, 2, . . . , ki∗1 i2 ,...,ik }

(3.2)

is a set of basis functions (e.g., polynomial, sinusoids, exponential, splines, wavelets), which must be orthonormalized. One possibility consists of using one of the families of univariate orthogonal functions, for example, Legendre polynomials, form tensor products with them, and select a subset of them. Although these polynomials are defined with respect to a uniform weighting of [−1, 1], they can be easily mapped to the interval [0, 1]. Hermite and Chebychev polynomials, Fourier functions, and Haar wavelets also provide univariate basis functions. Another alternative consists of using a family of functions and the Gram-Schmidt technique, which is implemented in some numerical libraries, to orthonormalize these basis functions.

˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,

238

For the sake of completeness, we give a third alternative. Since they must satisfy the ANOVA constraints, equation 2.2, we must have 0

1

ki∗ i

1 2 ,...,i k

c i∗1 i2 ,...,ik ; j h i∗1 i2 ,...,ik ; j (xi1 , xi2 , . . . , xik )d xir ≡ 0; (3.3)

j=1

∀r = 1, 2, . . . , k;

∀i 1 , i 2 , . . . , i k ;

∀k ∈ 1, 2, . . . , n.

Note that in spite of the fact that these conditions represent an uncountably infinite number of constraints, only a finite number of them are independent. Thus, only some subset of these constants {c i∗1 i2 ,...,ik ; j | j = 1, 2, . . . , ki∗1 i2 ,...,ik }, remain free. After renaming the free constants {c i1 i2 ,...,ik ; j | j = 1, 2, . . . , ki1 i2 ,...,ik } and orthonormalizing the basis functions, equation 3.1 becomes ki1 i2 ...ik

f i1 i2 ,...,ik (xi1 , xi2 , . . . , xik ) ≈

c i1 i2 ...ik ; j pi1 i2 ,...,ik ; j (xi1 , xi2 , . . . , xik ).

(3.4)

j=1

Note that the initial set of basis functions, equation 3.2, changes to { pi1 i2 ,...,ik ; j (xi1 , xi2 , . . . , xik )| j = 1, 2, . . . , ki1 i2 ,...,ik }.

(3.5)

If one uses the tensor product technique, the multivariate normalized basis functions pi1 i2 ,...,ik ; j (xi1 , xi2 , . . . , xik ) can be selected from the set (tensor products): k

j=

pi j ,r j (xi j )|1 ≤ r j ≤ ki1

.

(3.6)

Note that the size of this set grows exponentially with k and that we can be interested in selecting a small subfamily. For example, if we select as univariate basis functions polynomials of degree r , for the m-dimensional basis functions, the tensor product technique leads to polynomials of degree r × m, which is too high. Thus, we can limit the degree of the corresponding m-multivariate basis to contain only polynomials of degree dm or less. This is what we have done with the examples presented in this letter. Example 1 (Univariate Functions): nomial functions,

Consider the basis of univariate poly-

{h ∗1;1 (x1 ), h ∗1;2 (x1 ), h ∗1;3 (x1 ), h ∗1;4 (x1 )} = {1, x1 , x12 , x13 }.

Functional Network Topology Learning and Sensitivity Analysis

239

After imposing the constraints, equations 2.3 to 2.5, we get the reduced basis {h 1;1 (x1 ), h 1;2 (x1 ), h 1;3 (x1 )} = {2x1 − 1, 3x12 − 1, 4x13 − 1}, which after normalization leads to √ √ { p1;1 (x1 ), p1;2 (x1 ), p1;3 (x1 )} = { 3(2x1 − 1), 5(6x12 − 6x1 + 1), √ 7(20x13 − 30x12 + 12x1 − 1)}. Example 2 (Multivariate Functions): If the function is multivariate, the normalized basis can be obtained as the tensor product of univariate basis functions. For example, consider the basis of 16 bivariate functions, {1, x2 , x22 , x23 , x1 , x1 x2 , x1 x22 , x1 x23 , x12 , x12 x2 , x12 x22 , x12 x23 , x13 , x13 x2 , x13 x22 , x13 x23 }, which, after imposing the constraints, equations 2.3 to 2.5, and normalization leads to the normalized basis: √

3 − 1 + 2 x1 (−1 + 2 x2 ), 15 (−1 + 2 x1 ) (1 − 6 x2 + 6 x22 ), √ 21 − 1 + 2 x1 − 1 + 12 x2 − 30x22 + 20 x23 , √ 15 1 − 6 x1 + 6 x12 − 1 + 2 x2 , 5 1 − 6 x1 + 6 x12 1 − 6 x2 + 6 x22 , √ 35 1 − 6 x1 + 6 x12 − 1 + 12 x2 − 30x22 + 20 x23 , √ 21 − 1 + 12 x1 − 30x12 + 20 x13 − 1 + 2 x2 √ 35 − 1 + 12 x1 − 30x12 + 20 x13 1 − 6 x2 + 6 x22 , 7 − 1 + 12 x1 − 30x12 + 20 x13 − 1 + 12 x2 − 30x22 + 20 x23 . This family of 9 functions is the family 3.6; that is, it comes from the tensor products of the normalized functions in example 1. Note that these bases can be obtained independently of the data set, which means that they are valid for learning any data set. So once calculated, they can be stored to be used when needed. In addition, since multivariate functions have associated tensor product bases of univariate functions, one needs to calculate and store only the basis associated with univariate functions.

240

˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,

3.2 Step 2: Learn the Coefficients by Least Squares and Obtain Local Sensitivity Indices. According to our approximation, we have ki 1 n

ym = f (x1m , . . . , xnm ) = f 0 +

c i1 ; j pi1 ; j (xi1 m )

i 1 =1 j=1

+

ki 1 i 2 n n−1

c i1 ; j pi1 i2 ; j (xi1 m , xi2 m )

i 1 =1 i 2 =i 1 +1 j=1

+

(3.7)

ki 1 i 2 i 3 n

n−1 n−2

c i1 i2 i3 ; j pi1 ,i2 ,i3 ; j (xi1 m , xi2 m , xi3 m ) + · · ·

i 1 =1 i 2 =i 1 +1 i 3 =i 2 +1 j=1

+

1 2

...

n ki 1 i 2 ...i n i n =n

i 1 =1 i 2 =2

c i1 i2 ...in ; j pi1 i2 ...in (xi1 m , xi2 m , . . . , xin m ).

j=1

The error for the mth data value is m (y, x, c) = ym − f 0 −

ki 1 n

c i1 ; j pi1 ; j (xi1 m )

i 1 =1 j=1

−

ki 1 i 2 n n−1

c i1 ; j pi1 i2 ; j (xi1 m , xi2 m )

i 1 =1 i 2 =i 1 +1 j=1

−

n−1 n−2

ki 1 i 2 i 3 n

c i1 i2 i3 ; j pi1 ,i2 ,i3 ; j (xi1 m , xi2 m , xi3 m ) − · · ·

i 1 =1 i 2 =i 1 +1 i 3 =i 2 +1 j=1

−

2 1

...

i 1 =1 i 2 =2

n ki 1 i 2 ...i n i n =n

c i1 i2 ,...,in ; j pi1 i2 ,...,in (xi1 m , xi2 m , . . . , xin m ), (3.8)

j=1

where y, x are the data vectors and c is the vector including all the unknown coefficients. Hence, to estimate the constants c, that is, all c i1 i2 ,...,ik ; j ; ∀k, j, the following minimization problem has to be solved: Minimize Q = f 0 ,c

M

m2 (y, x, c).

(3.9)

m=1

The minimization problem, equation 3.9, leads to a linear system of equations with a unique solution. However, due to its tensor character, its organization in a standard form as Ac = b requires renumbering of

Functional Network Topology Learning and Sensitivity Analysis

241

the unknowns c, which is not a trivial task. Alternatively, one can use standard optimization packages, such as GAMS (Brooke, Kendrik, Meeraus, & Raman, 1998), to solve equation 3.9. This is the option used in this letter. However, to obtain the local sensitivities of Q to small changes in the data, the following modified problem, which is equivalent to equation 3.9, can be solved instead: Minimize Q= f 0 ,c; y ,x

M

m2 (y , x , c),

(3.10)

m=1

subject to: y = y : λ

(3.11)

x = x : δ,

(3.12)

where λ and δ are the vectors of dual variables, which give the local sensitivities of Q with respect to the data values y and x, respectively. This is so because the dual variables associated with any primal constrained optimization problem are known to be the partial derivatives of the objective function optimal values with respect to changes in the right-hand-side parameters of the corresponding equality constraints. Since in this artificial optimization problem, equations 3.10 to 3.12, the right-hand sides of the equality constraints 3.11 and 3.12 are the data x and y, respectively, the partial derivatives of Q∗ (optimal value of Q), with respect to the data, that is, the sensitivities sought after, are the values of the corresponding dual variables. 3.3 Step 3: Obtain the Global Sensitivity Indices. Since the resulting basis functions have already been orthonormalized, the global sensitivity indices (importance factors) are the sums of the squares of the coefficients, that is, ki1 i2 ...ik

Si1 i2 ...ik =

c i21 i2 ,...,ik ; j ; ∀(i 1 i 2 , . . . , i k ).

(3.13)

j=1

3.4 Step 4: Return Solution. The algorithm returns the following information:

r

r

An estimation of the coefficients f 0 and {c i1 i2 ,...,ik ; j ; ∀(i 1 i 2 , . . . , i k ; j)}, which, considering the basis functions pi1 i2 ,...,ik ; j (xi1 , xi2 , . . . , xik ) and equation 3.7, allow obtaining an approximation for the solution of the problem being studied Local and global sensitivity indices.

242

˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,

For simplicity, only one output variable was considered; however, an extension to the case of several outputs can be immediately obtained by solving the corresponding minimization problems. Thus, the algorithm above allows us to obtain an approximation for a given function f in terms of its functional components. Moreover, global sensitivity indices are derived for each functional component; therefore, the topology of a functional network can be considerably simplified, taking into account only the components with higher sensitivities. Suppose that, ˆf (x1 , . . . , xn ) is the f approximation after removing unimportant interactions, that is, when only the subset of important interactions is included in the final model, then the mean squared error (assuming a uniform distribution in the unit cube) is MSE = 0

1

0

1

...

1

2 fˆ(x1 , . . . , xn ) − f (x1 , . . . , xn ) dx.

0

Sometimes it is more convenient to obtain the normalized mean squared error (NMSE), which is adimensional and defined as NMSE =

MSE . Var[ f (x1 , . . . , xn )]

4 Application Examples In this section two application examples are described to illustrate the methodology proposed above. 4.1 Learning a Three-Input Nonlinear Function. The aim of this example is to demonstrate that the global sensitivity indices recovered by the proposed algorithm from data are exactly the same as those that can be obtained directly from the original function. Also, we will show that the resulting global sensitivity indices are almost not affected by noise. For illustration, we selected the following function: y = f (x1 , x2 , x3 ) = 1 + x1 + x12 + x1 x2 + 2x1 x2 x3 .

(4.1)

First, we calculate the exact ANOVA components using equations 2.3 to 2.5, leading to: 7 4 1 f 0 = ; f 1 (x1 ) = − + 2 x1 + x12 ; f 2 (x2 ) = − + x2 ; 3 3 2 x3 1 1 f 3 (x3 ) = − + ; f 12 (x1 , x2 ) = − x1 − x2 + 2 x1 x2 ; 4 2 2

Functional Network Topology Learning and Sensitivity Analysis

243

Table 1: Sensibility Results with Different Levels of Noise. σ 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20

S1

S2

S3

S12

S13

S23

S123

0.8361 0.8359 0.8356 0.8360 0.8336 0.8335 0.8312 0.8305 0.8321 0.8268 0.8267

0.0922 0.0922 0.0921 0.0917 0.0924 0.0918 0.0931 0.0940 0.0901 0.0925 0.0930

0.0231 0.0230 0.0233 0.0231 0.0237 0.0238 0.0242 0.0241 0.0237 0.0244 0.0238

0.0307 0.0308 0.0309 0.0304 0.0317 0.0311 0.0314 0.0305 0.0319 0.0327 0.0331

0.0077 0.0077 0.0076 0.0081 0.0081 0.0086 0.0088 0.0089 0.0095 0.0110 0.0103

0.0077 0.0078 0.0080 0.0080 0.0077 0.0084 0.0081 0.0091 0.0096 0.0099 0.0096

0.0026 0.0026 0.0026 0.0026 0.0028 0.0028 0.0032 0.0029 0.0031 0.0028 0.0035

Note: Mean values of 100 replications.

1 1 x1 x3 x2 x3 − − + x1 x3 ; f 23 (x2 , x3 ) = − − + x2 x3 ; 4 2 2 4 2 2 x1 x2 1 x3 f 123 (x1 , x2 , x3 ) = − + + − x1 x2 + − x1 x3 − x2 x3 + 2 x1 x2 x3 , 4 2 2 2 (4.2) f 13 (x1 , x3 ) =

whose sum gives exactly the function in equation 4.1. Next, the exact variance D = 0.9037 of function f and the sensitivity indices for the functional components were calculated using equations 2.7 to 2.10. These are shown in the first row of Table 1, corresponding to σ = 0. Subsequently, we show that we can learn the function f , its ANOVA decomposition, and the global indices from data. To this end, a sample of size m = 100 was generated for each input variable {xik |i = 1, 2, 3; k = 1, 2, . . . , m} from a uniform U(0, 1) random variable, and {yk |k = 1, 2, . . . , m} was calculated using equation 4.1. Next, the AFN method was applied in order to obtain an approximation of the function f in equation 4.1. The algorithm starts by considering a set of basis functions for univariate functions. As the function to be estimated was known (because it is an illustrative example), a suitable set of basis functions was formed by third-degree polynomials. These basis functions were orthonormalized as described in example 1. The multivariate functions, with two and three arguments, were approximated with the tensor product functions obtained from the corresponding third-degree polynomials univariate functions (see example 2). However, only polynomials of third degree or lower were used in the approximation. After solving the minimization problem, the exact function

244

˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,

was recovered by applying the algorithm f (x1 , . . . , xn ) = 1 + x1 + x12 + x1 x2 + 2x1 x2 x3 . Next, the global sensitivity indices were calculated from the coefficients using equation 3.13 in step 3 of the algorithm. These coefficients were the exact ones (i.e., with no error), as can be confirmed in Table 1 in the row labeled “σ = 0.00.” Finally, to test the sensitivity of the results to noise, the learning process was repeated for different noise levels, adding to y in equation 4.1 a normal N(0, σ 2 ) noise with σ = 0.00, 0.02, . . . , 0.20. The results obtained are shown in Table 1. It can be observed that the sensitivity indices are barely affected by the noise level. Once the global sensitivity indices have been calculated, one can decide which interactions must be included in the model. This immediately defines the network topology. Since 98.18% of the variance in the illustrative example can be obtained by the approximation f (x1 , . . . , xn ) ≈ f 0 + f 1 (x1 ) + f 2 (x2 ) + f 3 (x3 ) + f 12 (x1 , x2 ),

(4.3)

one can decide to remove all other interactions. The final topology of the network obtained using our method is shown in Figure 2b, whereas Figure 2a shows the network topology used when the methodology proposed was not applied. Comparing both network topologies, it can be observed that the application of the algorithm achieved considerable simplification while practically maintaining the quality of the outputs. To illustrate the performance of the proposed method, Table 2 shows the means and standard deviations (in parentheses) of MSE (mean squared error) and NMSE (normalized mean squared error) for the training (80% of the sample) and testing (20% of the sample) samples, obtained with 100 replications. Since the values are small, they reveal good performance. In addition, Table 2 includes the errors obtained with increasing noise values (σ ). 4.2 Case Example: A Vertical Breakwater Problem. Breakwaters are constructed to provide sheltered bays for ships and protect harbor facilities. Moreover, in ports open to rough seas, they play a key role in port operations. Since sea waves are enormously powerful, it is not an easy matter to construct structures to mitigate sea power. When designing a breakwater (see Figure 3), one looks for the optimal cross-section that minimizes construction and maintenance costs during the breakwater’s useful life and also satisfies reliability constraints guaranteeing that the work is reasonably safe for each failure mode. Optimization of this engineering

Functional Network Topology Learning and Sensitivity Analysis

X1

245

f1 f12 f2 y

X2

f123

+

f13 f23 X3

f3 (a)

X1

f1 f12

X2

f2

X3

f3

y +

(b) Figure 2: Network topologies corresponding to the three-input function in the example. (a) Using the available data directly, without applying the proposed methodology. (b) Using the knowledge obtained by the proposed methodology to eliminate some interaction terms.

design is extremely important because of the corresponding reduction in the associated cost. Analysis of the failure probabilities of the breakwater requires calculating failure probabilities and the annual failure rate for each failure mode. This implies determining the pressure produced on the breakwater crownwall by the sea waves. This problem involves nine input variables (listed in Table 3) and four output variables: p1 , p2 , p3 , and pu (the first three pi are, respectively, water pressure at the mean, freeboard and base levels of the concrete block, and pu is the maximum subpressure value).

˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,

246

Table 2: Simulation Results for a Sample Size n = 100 with Different Levels of Noise. σ 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20

MSE tr

NMSE tr

MSE test

NMSE test

0.00000 (0.00000) 0.00035 (0.00006) 0.00140 (0.00023) 0.00312 (0.00052) 0.00570 (0.00095) 0.00891 (0.00162) 0.01257 (0.00248) 0.01681 (0.00300) 0.02203 (0.00389) 0.02789 (0.00489) 0.03490 (0.00623)

0.00000 (0.00000) 0.00040 (0.00009) 0.00159 (0.00039) 0.00352 (0.00082) 0.00643 (0.00153) 0.00997 (0.00230) 0.01398 (0.00363) 0.01853 (0.00409) 0.02433 (0.00554) 0.03052 (0.00695) 0.03753 (0.00873)

0.00000 (0.00000) 0.00054 (0.00018) 0.00228 (0.00090) 0.00494 (0.00149) 0.00944 (0.00326) 0.01371 (0.00465) 0.02062 (0.00761) 0.02661 (0.00973) 0.03563 (0.01453) 0.04186 (0.01439) 0.05572 (0.02170)

0.00000 (0.00000) 0.00061 (0.00022) 0.00258 (0.00111) 0.00556 (0.00183) 0.01067 (0.00420) 0.01546 (0.00589) 0.02304 (0.00975) 0.02926 (0.01096) 0.03956 (0.01772) 0.04571 (0.01657) 0.06034 (0.02652)

Notes: Mean values of 100 replications. Standard deviations appear in parentheses.

In the case of the vertical breakwater, approximating formulas for calculating p1 , p2 , p3 , and pu were given by Goda (1972) using his own theoretical and laboratory studies. These were later extended by other authors (Takahashi, 1996), obtaining: p1 = 0.5(1 + cos θ )(α1 + α4 cos2 θ )w0 HD

(4.4)

p2 = α2 p1

(4.5)

p3 = α3 p1

(4.6)

pu = 0.5(1 + cos θ )α1 α2 w0 HD ,

(4.7)

where the nondimensional coefficients are given by

(4π h/L) α1 = 0.6 + 0.5 sinh(4π h/L)

2 (4.8)

Functional Network Topology Learning and Sensitivity Analysis

247

p1 p2 hc h’ d h pu p3 Bm Figure 3: Typical cross section of a vertical breakwater.

h 1 α2 = 1 − 1− h cosh(2π h/L) hc α3 = 1 − min 1, 0.75(1 + cos θ )HD α4 = max(α5 , α I )

(4.9) (4.10) (4.11)

α5 = min((1 − d/ h)(HD /d) /3, 2d/HD ) 2

αI = αI 0αI 1 HD /d if HD ≤ 2d αI 0 = 2 otherwise cos δ2 / cosh δ1 αI 1 = 1/(cosh δ1 (cosh δ2 )) 20δ11 if δ11 ≤ 0 δ1 = 15δ11 otherwise

(4.12) (4.13) (4.14)

if δ2 ≤ 0 otherwise

(4.15)

(4.16)

δ11 = 0.93(Bm /L − 0.12) + 0.36[(h − d)/ h − 0.6] 4.9δ22 if δ22 ≤ 0 δ2 = otherwise 3δ22

(4.17)

δ22 = −0.36(Bm /L − 0.12) + 0.93[(h − d)/ h − 0.6].

(4.19)

(4.18)

˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,

248

Table 3: Input Variables for the Vertical Breakwater Problem. θ w0 HD h L d h hc Bm

incidence angle of waves water unit weight design wave height water depth wave length submerged depth of the vertical breakwater submerged depth of the vertical breakwater, including width of the armor layer free depth of the concrete block on the lee side berm sea side length

As can be observed, these formulas are very complicated and have continuity problems in the first derivatives because they use different expressions in different intervals that do not have the desired regularity properties at the change points. As a consequence, some optimization program solvers, such as GAMS, fail to obtain the optimal solution in some cases. Therefore, more complex solvers, which are more costly in terms of computational power and time, are needed. To solve this problem, the idea is to generate a random set of input variables and then calculate the corresponding outputs using the above formulas, then use these data to train a network that may solve the problem satisfactorily. This application is very important from a practical point of view because it means that an approximation can be obtained for the formulas in equations 4.4 to 4.19 without continuity or regularity problems. Therefore, the optimization problem can be stated, and the optimal solution can be obtained without any special computational requirement. Almost all the variables involved in the breakwater problem can be represented as powers of fundamental magnitudes such as length and mass. Therefore, the breakwater problem can be simplified by the application of dimensional analysis, specifically, its main theorem: the -theorem (Bridgman, 1922; Buckingham, 1914, 1915). This theorem states that any physical relation in terms of n variables can be rewritten in a new relation involving r fewer variables. When this theorem is applied, the input and output variables are transformed into dimensionless monomials. Different transformations are possible, but using engineering knowledge on breakwaters allows determining, without knowledge of formulas 4.4 to 4.19, that the dimensionless output variables

p1 p2 p3 pu , , , w0 HD w0 HD w0 HD w0 HD

(4.20)

Functional Network Topology Learning and Sensitivity Analysis

249

Table 4: Monomials Required for Estimating the Variables p1 , p2 , p3 , and pu . Output/Input p1 w0 HD p2 w0 HD p3 w0 HD pu w0 HD

h L

h h

√ √

√

√ √

√

θ

d h

HD d

BM L

√

√

√

√

√

√

√

√

√

√

√

√

hc HD

√

√

can be written in terms of the set of dimensionless input variables θ,

h d HD h h c Bm . , , , , , L h d h HD L

(4.21)

Not all the dimensionless monomials in equation 4.21 are needed for estimating the dimensionless outputs (see equations 4.4 to 4.19). Table 4 summarizes the required ones. In order to choose the input data points to learn these functions, several different criteria can be used; knowledge about their ranges and statistical distribution is required. (For some interesting work on this and related problems, readers are referred to Sacks et al., 1989; Koehler & Owen, 1996; and Currin et al., 1991.) Knowledge about ranges is easy to obtain from experienced engineers. However, given the lack of knowledge about the statistical distribution of the input data values in existing breakwaters, a set S of 2000 input data (1600 for training and 400 for testing) was generated randomly, assuming that each dimensionless variable in equation 4.21 is U(0, 1), that is, normalized in the interval (0, 1) in order to apply the methodology proposed. This is a reasonable assumption covering the actual ranges of the variables involved. Note that the ranges correspond to nondimensional variables. To show the performance of the methodology proposed (AFN), all of the dimensionless inputs will be considered for estimation of the desired dimensionless outputs. It will be demonstrated that only the relevant input interactions appear in the learned models. In this way, we illustrate how one can use data from a given problem to obtain appropriate knowledge that can assist in deriving a topology for functional networks. To our knowledge, there is no other efficient method for deriving functional network topologies from data. Twenty replications with different starting values for the parameters were performed for each case studied. This example is focused in the estimation of pu /w0 HD , but p1 /w0 HD , p2 /w0 HD , and p3 /w0 HD were also learned in a similar way, and the most significative results are presented. From this point, we will refer to the dimensionless outputs as p1 , p2 , p3 , and pu .

250

˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,

Table 5: Mean, Standard Deviation (STD), and Minimum and Maximum Values for the Normalized Mean Square Error (NMSE) for the Test Data Set. Degrees 2, 1, 1, 1, 1 2, 2, 1, 1, 1 2, 3, 1, 1, 1 2, 4, 1, 1, 1 3, 1, 1, 1, 1 3, 3, 3, 1, 1 3, 4, 1, 1, 1 3, 6, 9, 1, 1 2, 4, 3, 4, 5 3, 6, 3, 4, 5 2, 3, 1

MSEtest

(ST D)

Min

Max

2.3900 × 10−1 7.4135 × 10−2 6.6946 × 10−2 6.7852 × 10−2 9.4133 × 10−1 9.8419 × 10−1 9.5612 × 10−1 9.6607 × 10−1 7.4111 × 10−2 0.1044 × 101 6.8237 × 10−2

(3.8184 × 10−2 ) (4.8200 × 10−3 ) (4.0007 × 10−3 ) (4.0342 × 10−3 ) (5.2477 × 10−2 ) (5.7424 × 10−2 ) (6.0454 × 10−2 ) (6.6378 × 10−2 ) (6.4231 × 10−3 ) (6.9930 × 10−2 ) (4.2030 × 10−3 )

1.9346 × 10−1 6.5495 × 10−2 5.9426 × 10−2 5.9443 × 10−2 8.3121 × 10−1 8.6991 × 10−1 8.3115 × 10−1 8.6482 × 10−1 6.1047 × 10−2 9.2294 × 10−1 5.8822 × 10−2

3.2621 × 10−1 8.4480 × 10−2 7.6709 × 10−2 7.7210 × 10−2 1.0367 1.0875 1.0819 1.1104 8.5550 × 10−2 1.1540 7.4263 × 10−2

Notes: Results for the estimations of pu using the proposed methodology. The last row shows the results obtained when the AFN method was applied and a simplified topology is obtained.

The complexity of the proposed method grows exponentially with the number of inputs. To estimate pu , seven input variables are given; therefore, a complex model is derived if all levels of interactions are considered. Then a simpler model was chosen in the first instance, and its complexity was constantly increased. This model is obtained by considering a reduced number of univariate functions and limiting the multivariate functions obtained from its corresponding tensor products. As in the previous example, the polynomial family was selected for estimating the desired output. The first column in Table 5 shows the different degrees taken into account; the first number is the degree for the univariate functions, the second one is the degree for the bivariate functions, and so on. Note that six and seven arguments functions were not included because there is no improvement in the performance results when more relations are considered. The global sensitivity indices related to the best performance results and its corresponding total sensitivity indices are shown in Tables 6 and 7, respectively. To be concise, all the monomials or relations between them with index values under 0.0005 were removed from Table 6. It can be seen that the monomials with the highest sensitivity values are the three required (the first three rows in Table 6; see Table 4). Besides, the relation between the necessary variables h/L and h / h (fourth row in Table 6) is the most important one, although it has a low value compared to the monomials individually. All other variables and relations have lower sensitivity values and thus were not considered. After removing the unimportant factors, the topology of the functional network is much simpler. It has only three univariate functions

Functional Network Topology Learning and Sensitivity Analysis

251

Table 6: Global Sensitivity Indices for the Univariate and Bivariate Polynomials When Estimating pu . Replication Monomials

1

2

3

4

5

Mean

h/L h/ h θ

0.6644 0.2680 0.0169

0.6598 0.2763 0.0197

0.6544 0.2785 0.0161

0.6671 0.2626 0.0179

0.6594 0.2729 0.0197

0.6622 0.2728 0.0172

h/L, h / h h/L, θ h/L, d/ h h/L, HD /d h / h, θ h / h, d/ h

0.0429 0.0033 0.0002 0.0002 0.0022 0.0002

0.0362 0.0040 0.0002 0.0002 0.0010 0.0006

0.0443 0.0029 0.0000 0.0005 0.0013 0.0001

0.0440 0.0042 0.0005 0.0001 0.0011 0.0000

0.0396 0.0046 0.0002 0.0005 0.0007 0.0001

0.0392 0.0040 0.0002 0.0002 0.0015 0.0002

Notes: Results obtained by the first five replications and mean of the 20 replications. Rows with values under 0.0005 are not included. Shown below the line are the global sensitivity indices for the relations between monomials.

Table 7: Total Sensitivity Indices When Estimating pu . Replication Variables h/L h/ h θ d/ h HD /d B M /L h c /HD

1

2

3

4

5

Mean

0.7114 0.3135 0.0228 0.0007 0.0007 0.0007 0.0004

0.7007 0.3143 0.0254 0.0011 0.0012 0.0007 0.0004

0.7024 0.3243 0.0207 0.0007 0.0008 0.0009 0.0008

0.7163 0.3080 0.0238 0.0011 0.0005 0.0012 0.0011

0.7044 0.3138 0.0255 0.0008 0.0012 0.0010 0.0009

0.7061 0.3141 0.0234 0.0010 0.0010 0.0010 0.0008

Note: Results obtained by the first 5 replications and mean of the 20 replications.

(second-degree polynomials) and a bivariate function relating the variables h/L and h / h. However, the testing results after training this functional network are very similar to those obtained previously, as expected, because only the nonrelevant factors were removed. These results are presented in the last row of Table 5 to facilitate comparison. Note that the number of parameters is extremely reduced. In fact, there are 9 parameters when the proper dimensionless monomials are employed (3 univariate functions with 2 parameters and 1 bivariate function with 3 parameters), and 77 parameters are used when all the dimensionless monomials are taken into account, considering the same degree for the polynomials (7 univariate functions with 2 parameters and 21 bivariate functions with 3 parameters).

˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,

252

Table 8: Mean, Standard Deviation (STD), and Minimum and Maximum Values for the Normalized Mean Square Error (NMSE) for the Test Set When Estimating p1 , p2 , and p3 .

p1 p2 p3

Method

Mean NMSE

(STD)

Min

Max

AFN FN AFN FN AFN FN

1.8693 × 10−1 1.9795 × 10−1 7.7311 × 10−2 9.1324 × 10−2 7.7058 × 10−2 1.0552 × 10−1

(2.6462 × 10−2 ) (2.6624 × 10−2 ) (7.0881 × 10−3 ) (1.0700 × 10−2 ) (1.1104 × 10−2 ) (1.7653 × 10−2 )

1.4714 × 10−1 1.5455 × 10−1 6.5819 × 10−2 7.3875 × 10−2 6.2961 × 10−2 7.7863 × 10−2

2.4397 × 10−1 2.4882 × 10−1 9.0760 × 10−2 1.1522 × 10−1 1.0444 × 10−1 1.3590 × 10−1

Table 9: Total Sensitivity Indices for the Best Approximation of p1 , p2 , and p3 . Input/Output h L h h

θ

d h HD d BM L hc HD

p1

p2

p3

0.550433 0.003544 0.220720 0.107570 0.124541 0.083042 0.003075

0.652522 0.323328 0.040430 0.016706 0.019912 0.013103 0.001099

0.183044 0.001595 0.129206 0.035828 0.041625 0.027788 0.636814

Similar experiments were carried out for the other output variables ( p1 , p2 , and p3 ). However, for simplicity and clarity, only the most significant results are included. The best performance results obtained applying the proposed methodology (AFN) are shown in Tables 8 and 9, indicating the total sensitivity index for each input variable. The variables with the lowest values are those not checked in Table 4, and thus the proposed method is also able to discard the irrelevant variables in these cases. Again, considering the topology derived from the application of the method proposed, a functional network is trained for each output, and its performance results are also included in Table 8 (rows entitled FN). Again, it is important to remark that the simplification of the topology means a significant reduction in the number of parameters while maintaining the performance of the approach. By applying the proposed method, interactions between variables are removed and irrelevant variables are found. This suggests that the AFN method can be applied as a feature selection method, that is, a method that reduces the number of original features or variables by selecting a subset of them. The advantages of reducing the number of inputs have been extensively discussed in the machine learning literature (Kohavi & John, 1997; Guyon & Elisseeff, 2003). Two of the most important ones are that

Functional Network Topology Learning and Sensitivity Analysis

253

Table 10: Mean, Standard Deviation, and Minimum and Maximum Values for the Normalized MSE for the Test Set when Estimating pu Considering All the Inputs and Only the Three Relevant Ones. Neurons

Mean NMSE

(STD)

Min

Max

Considering the seven variables (1.2739 × 10−3 ) 5 1.9880 × 10−3 −4 10 4.8390 × 10 (1.0625 × 10−3 ) 15 2.5818 × 10−4 (7.5651 × 10−4 ) 20 1.5951 × 10−6 (2.2556 × 10−6 ) 25 1.4011 × 10−6 (2.6363 × 10−6 ) 27 8.0122 × 10−7 (8.1440 × 10−7 ) 30 2.5952 × 10−6 (6.3535 × 10−6 )

1.7747 × 10−4 1.7887 × 10−6 1.0661 × 10−6 2.5091 × 10−7 8.3305 × 10−8 1.4289 × 10−7 7.0063 × 10−8

4.8721 × 10−3 3.7790 × 10−3 2.6543 × 10−3 1.0151 × 10−5 1.1963 × 10−5 3.7679 × 10−6 2.4735 × 10−5

Using the three relevant variables (9.9263 × 10−4 ) 5 1.4199 × 10−3 10 2.4766 × 10−5 (6.7618 × 10−5 ) 15 1.1263 × 10−6 (9.7697 × 10−7 ) 20 2.1324 × 10−7 (2.7517 × 10−7 ) 25 1.1236 × 10−7 (1.0665 × 10−7 ) 27 7.2760 × 10−8 (4.1195 × 10−8 ) 30 8.8819 × 10−8 (7.7426 × 10−8 )

1.5737 × 10−4 1.8441 × 10−6 1.5265 × 10−7 3.4903 × 10−8 1.8267 × 10−8 1.0599 × 10−8 1.8998 × 10−8

2.8286 × 10−3 3.0382 × 10−4 3.8548 × 10−6 1.2734 × 10−6 4.3665 × 10−7 1.9125 × 10−7 2.6956 × 10−7

Note: Neurons refers to the number of neurons in the hidden layer.

it allows reducing computational complexity, and it improves performance results. As a feature selection method, the AFN method should indicate how to choose the relevant variables. This information is provided by the total sensitivity indices (TSI) that give a ranking of the variables in terms of its variance. Besides, a threshold is required to determine the variables to be selected, in such a way that variables with TSIs under the threshold are discarded. The establishment of this threshold is not a trivial issue and will be discussed in a further study. As a preliminary approach, the AFN method was used as a feature selection method for the breakwater problem. The threshold was established at 1%, that is, variables with TSI over 1% were chosen (see Tables 7 and 9). Then multilayer perceptrons (MLP) were used to solve the breakwater problem in both cases (using all the given inputs and only the inputs selected by the AFN method). The hyperbolic tangent was employed as the transfer function and the MSE as the performance function. The Levenberg-Marquardt learning algorithm was used to train the network (Levenberg, 1944; Marquardt, 1963). Two thousand epochs were employed because they were enough to the MLP to converge. The results are very representative in the case of pu , because the number of inputs is drastically reduced from 7 to 3. Table 10 shows these results. As can be seen, the reduction in the number of inputs leads to better performance when the same number of neurons in the hidden layer is used,

˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,

254

Table 11: Mean, Standard Deviation (STD), and Minimum and Maximum Values for the Normalized Mean Square Error (NMSE) for the Test Set When Estimating p1 , p2 , and p3 , Considering All the Inputs (All) and Only the Relevant (Rel.) Ones.

p1 p2 p3

Na

Varb

Mean NMSE

(STD)

Min

Max

35 37 21 22 37 39

All Rel. All Rel. All Rel.

1.4148 × 10−2 3.6129 × 10−3 1.0843 × 10−2 7.3305 × 10−3 6.3983 × 10−3 3.9794 × 10−3

(2.1045 × 10−2 ) (9.4444 × 10−4 ) (8.4366 × 10−3 ) (4.7671 × 10−3 ) (4.9059 × 10−3 ) (2.0662 × 10−3 )

2.8752 × 10−3 2.2487 × 10−3 4.1416 × 10−3 2.6945 × 10−3 2.4433 × 10−3 2.0258 × 10−3

9.8051 × 10−2 5.0922 × 10−3 3.6060 × 10−2 2.2399 × 10−2 2.2456 × 10−2 9.1660 × 10−3

Notes: a N refers to the number of neurons in the hidden layer. b Var. refers to the number of variables.

although it involves a smaller set of weights, that is, the network topology is simplified. The best results achieved when estimating p1 , p2 , and pu are shown in Table 11, although the variable reduction is not so important. Again, better performance results are achieved. 5 Conclusions and Future Work In this article, a new methodology, the AFN method, based on ANOVA decomposition and functional networks, was presented and described. This methodology permits a simplified topology for a functional network to be learned from the data available. As stated in section 1, to our knowledge, there is no other method that permits this to be done. In addition to the advantages inherited from the ANOVA decomposition (uniqueness and orthogonality), the proposed methodology has the following advantages:

r

r

r

Global sensitivity indices can be obtained from the application of the AFN. These can be used to establish the relevance of each functional component; consequently, they determine the input variables and relations to be selected. If a particular variable has no influence (in isolation or related to others), it can be discarded as an input for the functional or neural network. It allows learning and simplifying the topology of a functional or neural network. If the variables required for estimating a specific function are important, then the topology of the functional network should include these relationships. All existing multivariate interactions among variables are identified by the global sensitivity indices.

Functional Network Topology Learning and Sensitivity Analysis

r

r

255

Local sensitivity indices. Although the letter was focused on the global sensitivity indices for deriving the functional network topology, it is important to note that the proposed methodology also provides local sensitivity indices. These indices could be used to detect outliers in the samples or to make a selective sampling (these will be a future line of research for us). Several alternatives for selecting the basis orthonormal functions for the proposed approximation are available. Moreover, a new one has been presented here. As the orthonormalization decomposition is easily accomplished by one of these alternatives, the application of the AFN would require only a minimization problem to be solved.

The suitability of the proposed methodology was illustrated by its application to a real engineering problem: the design of a vertical breakwater. The performance results obtained were compared to those obtained using functional and neural networks to solve the same problem. It was demonstrated that although the AFN obtains similar performance results, it also returns a set of sensitivity indices that permits the initial topologies to be simplified (functional networks) or some of the input variables to be eliminated (neural networks). In view of the results obtained, future work will involve adapting the proposed methodology for use as a feature subset selection method. A detailed study will be carried out comparing the proposed methodology with the existing feature selection methods. Acknowledgments We are indebted to the Spanish Ministry of Science and Technology with FEDER Funds (Projects DPI2002-04172-C04-02 and TIC-2003-00600), to the Xunta de Galicia (Project PGIDIT04-PXIC10502), and to Iberdrola for partial support of this work. References Acz´el, J. (1966). Lectures on functional equations and their applications. New York: Academic Press. Bridgman, P. (1922). Dimensional analysis. New Haven, CT: Yale University Press. Brooke, A., Kendrik, D., Meeraus, A., & Raman, R. (1998). GAMS: A user’s guide. Washington, DC: Gams Development Corporation. Buckingham, E. (1914). On physically similar systems: Illustrations of dimensional equations. Phys. Rev., 4, 345–376. Buckingham, E. (1915). Model experiments and the form of empirical equations. Trans. ASME, 37, 263. Castillo, E. (1998). Functional networks. Neural Processing Letters, 7, 151–159.

256

˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,

Castillo, E., Cobo, A., Guti´errez, J. M., & Pruneda, E. (1998). Functional networks with applications. Boston: Kluwer. Castillo, E., Cobo, A., Guti´errez, J. M., & Pruneda, R. E. (1999). Working with differential, functional and difference equations using functional networks. Applied Mathematical Modeling, 23, 89–107. Castillo, E., Cobo, A., Guti´errez, J. M., & Pruneda, R. E. (2000). Functional networks: A new neural network based methodology. Computer-Aided Civil and Infrastructure Engineering, 15, 90–106. Castillo, E., & Guti´errez, J. M. (1998). Nonlinear time series modeling and prediction using functional networks. extracting information masked by chaos. Physics Letters A, 244, 71–84. Castillo, E., Guti´errez, J., & Hadi, A. (1997). Sensitivity analysis in discrete Bayesian networks. IEEE Transactions on Systems, Man and Cybernetics, 26, 412–423. Castillo, E., Iglesias, A., & Ru´ız-Cobo, R. (2004). Functional equations in applied sciences. New York: Elsevier. Castillo, E., & Ru´ız-Cobo, R. (1992). Functional equations in science and engineering. New York: Marcel Dekker. Chatterjee, S., & Hadi, A. S. (1988). Sensitivity analysis in linear regression. New York: Wiley. Currin, C., Mitchell, T., Morris, M., & Ylvisaker, D. (1991). Bayesian prediction of deterministic functions, with applications to the design and analysis of computer experiments. Journal of the American Statistical Association, 86, 953–963. Efron, B., & Stein, C. (1981). The jackknife estimate of variance. Annals of Statistics, 9, 586–596. Fisher, R. (1918). The correlation between relatives on the supposition of Mendelian inheritance. Transactions of the Royal Society, 52, 399–433. Goda, Y. (1972). Laboratory investigation of wave pressure exerted upon vertical and composite walls. Coastal Engineering, 15, 81–90. Guyon, I., & Elisseef, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182. Hadi, A., & Nyquist, H. (2002). Sensitivity analysis in statistics. Journal of Statistical Studies: A Special Volume in Honor of Professor Mir Masoom. Ali’s 65th Birthday, 125–138. Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution. Annals of Mathematical Statistics, 19, 293–325. Jiang, T., & Owen, A. (2001). Quasi-regression with shrinkage. Mathematics and Computers in Simulation, 62, 231–241. Koehler, J., & Owen, A. (1996). Computer experiments: Design and analysis of experiments. In S. Ghosh & C. R. Rao (Eds.), Handbook of statistics, 13 (pp. 261–308). New York: Elsevier Science. Kohavi, R., & John, G. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1–2), 273–324. Levenberg, K. (1944). A method for the solution of certain non-linear problems in least squares. Quartely Journal of Applied Mathematics, 2(2), 164–168. Marquardt, D. W. (1963). An algorithm for least-squares estimation of nonlinear parameters. Journal of the Society of Industrial and Applied Mathematics, 11(2), 431– 441.

Functional Network Topology Learning and Sensitivity Analysis

257

Sacks, J., Mitchell, W. W. T., & Wynn, H. (1989). Design and analysis of computer experiments. Statistical Science, 4(4), 409–435. Saltelli, A., Chan, K., & Scott, M. (2000). Sensitivity analysis. New York: Wiley. Sobol, I. M. (1969). Multidimensional quadrature formulas and Haar functions. Moscow: Nauka. (in Russian) Sobol, I. M. (2001). Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Mathematics and Computers in Simulation, 55, 271–280. Takahashi, S. (1996). Design of vertical breakwaters (Techn. Rep. No. 34). Yokosuka, Japan: Port and Harbour Research Institute. Ministry of Transport.

Received August 12, 2004; accepted June 17, 2006.

LETTER

Communicated by Erin Bredensteiner

Second-Order Cone Programming Formulations for Robust Multiclass Classification Ping Zhong [email protected] College of Science, China Agricultural University, Beijing, 100083, China

Masao Fukushima [email protected] Department of Applied Mathematics and Physics, Graduate School of Informatics, Kyoto University, Kyoto, 606-8501, Japan

Multiclass classification is an important and ongoing research subject in machine learning. Current support vector methods for multiclass classification implicitly assume that the parameters in the optimization problems are known exactly. However, in practice, the parameters have perturbations since they are estimated from the training data, which are usually subject to measurement noise. In this article, we propose linear and nonlinear robust formulations for multiclass classification based on the M-SVM method. The preliminary numerical experiments confirm the robustness of the proposed method. 1 Introduction Given L labeled examples known to come from K (>2) classes T = {(x p , θ p )} Lp=1 ⊂ X × Y, where X ⊂ R N and Y = {1 , . . . , K }, multiclass classification refers to the construction of a discriminate function from the input space X onto the unordered set of classes Y. Support vector machines (SVMs) serve as a useful and popular tool for classification. Recent developments in the study of SVMs show that there are roughly two types of approaches to tackle multiclass classification problem. One is to construct and fuse several binary classifiers, such as “one-against-all” (Bottou et al., 1994; Vapnik, 1998), “one-againstone” (Hastie & Tibshirani, 1998; Kressel, 1999), directed acyclic graph SVM (DAGSVM; Platt, Cristianini, & Shawe-Taylor, 2000), error-correcting output code (ECOC; Allwein, Schapire, & Singer, 2001; Dietterich & Bakiri, 1995), K-SVCR method (Angulo, Parra, & Catal`a, 2003), and ν-K-SVCR Neural Computation 19, 258–282 (2007)

C 2006 Massachusetts Institute of Technology

SOCP for Multiclass Classification

259

method (Zhong & Fukushima, 2006), among others. The other, called “all-together,” is to consider all data in one optimization formulation (Bennett & Mangasarian, 1994; Bredensteiner & Bennett, 1999; Guermeur, 2002; Vapnik, 1998; Weston & Watkins, 1998; Yajima, 2005). In this letter, we focus on the second approach. There are several all-together methods. The method independently proposed by Vapnik (1998) and Weston and Watkins (1998) is similar to oneagainst-all. It constructs K two-class discriminants where each discriminant separates a single class from all the others. Hence, there are K decision functions, but all are obtained by solving one optimization problem. Bennett and Mangasarian (1994) constructed a piecewise-linear discriminant for the K -class classification by a single linear program. The method called M-SVM (Bredensteiner & Bennett, 1999) extends their method to generate a kernel-based nonlinear K -class discriminant by solving a convex quadratic program. Although the original forms proposed by Vapnik (1998), Weston and Watkins (1998), and Bredensteiner and Bennett (1999) are different, they are not only equivalent to each other, but also equivalent to that proposed by Guermeur (2002). Based on M-SVM, the linear programming formulations are proposed in a low-dimensional feature subspace (Yajima, 2005). In the methods noted, the parameters in the optimization problems are implicitly assumed to be known exactly. However, in practice, these parameters have perturbations since they are estimated from the training data, which are usually corrupted by measurement noise. As pointed out by Goldfarb and Iyengar (2003), the solutions to the optimization problems are sensitive to parameter perturbations. Errors in the input space tend to get amplified in the decision function, which often results in misclassification. So it will be useful to explore formulations that can yield discriminants robust to such estimation errors. In this article, we propose a robust formulation of M-SVM, which is represented as a second-order cone program (SOCP). The second-order cone (SOC) in Rn (n ≥ 1), also called the Lorentz cone, is the convex cone defined by Kn =

z0 z¯

: z0 ∈ R, z¯ ∈ Rn−1 , z¯ ≤ z0 ,

where · denotes the Euclidean norm. The SOCP is a special class of convex optimization problems involving SOC constraints, which can be efficiently solved by interior point methods. The work related to SOCP can be seen, for example, in Alizadeh and Goldfarb (2003), Fukushima, Luo, and Tseng (2002), Hayashi, Yamashita, and Fukushima (2005), and Lobo, Vandenberghe, Boyd, and L´ebret (1998). The letter is organized as follows. We first propose a robust formulation for piecewise-linear M-SVM in section 2 and then construct a robust

260

P. Zhong and M. Fukushima

classifier based on the dual SOCP formulation in section 3. In section 4, we extend the robust classifier to the piecewise-nonlinear M-SVM case. Section 5 gives numerical results. Section 6 concludes the letter. 2 Robust Piecewise-Linear M-SVM Formulation For each i, let Ai be a set of examples in the N-dimensional real space R N with cardinality li . Let Ai be an li × N matrix whose rows are the examples in Ai . The pth example in Ai and the pth row of Ai are both denoted Aip . Let ei denote the vector of ones of dimension li . For each i, let wi be a vector in R N and b i be a real number. The sets Ai , i = 1, . . . , K , are called piecewise-linearly separable (Bredensteiner & Bennett, 1999) if there exist wi and b i , i = 1, . . . , K , such that Ai wi − b i ei > Ai w j − b j ei ,

i, j = 1, . . . , K ,

i = j.

Piecewise-linear M-SVM can be formulated as follows (Bredensteiner & Bennett, 1999): i−1 K K K K 1 1 min ν wi − w j 2 + wi 2 + (1 − ν) (ei )T yij w, b, y 2 2

i=1 j=1

i=1 j=1, j=i

i=1

s.t. Ai (wi − w j ) − (b i − b j )ei + yij ≥ ei , yij ≥ 0,

i, j = 1, . . . , K ,

i = j,

(2.1)

where ν ∈ (0, 1],

T w = (w 1 )T , (w 2 )T , . . . , (w K )T ∈ R K N , T

b = b 1, b 2, . . . , b K ∈ RK ,

T y = ( y12 )T , . . . , ( y1K )T , . . . , ( yK 1 )T , . . . , ( yK (K −1) )T ∈ R L(K −1) ,

(2.2) (2.3) (2.4)

K li . When ν = 1, equation 2.1 is the formulation for the and L = i=1 piecewise-linearly separable case. Otherwise, it is the formulation for the piecewise-linearly inseparable case. Figure 1 shows an example of a piecewise-linearly separable M-SVM for three classes in two dimensions. The training data Ai , i = 1, . . . , K , used in problem 2.1, are implicitly assumed to be known exactly. However, in practice, training data are often corrupted by measurement noises. Errors in the input space tend to get amplified in the decision function, which often results in misclassification. For example, suppose each example in Figure 1 is allowed to move in a sphere (see Figure 2). The original discriminants cannot separate the training data

SOCP for Multiclass Classification

261

(w1 − w2 )T x = (b1 − b2 ) + 1 (w1 − w2 , b1 − b2 ) (w1 − w2 )T x = (b1 − b2 ) − 1

A1

A2

(w1 − w3 , b1 − b3 )

*

*

*

*

(w2 − w3 , b2 − b3 )

* A3

Figure 1: Three classes separated by piecewise-linear M-SVM discriminants.

A1 A2

*

*

*

*

* 3

Figure 2: An example of the effect of measurement noises.

sets in the worst case. It will be useful to explore formulations that can yield discriminants robust to such estimation errors. In the following, we discuss such a formulation. We assume Aˆ ip = Aip + ρ ip (aip )T ,

(2.5)

where Aˆ ip is the actual value of the training data and ρ ip (aip )T is the measurement noise, with aip ∈ R N , aip = 1 and ρ ip ≥ 0 being a given constant. Denote the unit sphere in R N by U = {a ∈ R N : a = 1}. The robust

262

P. Zhong and M. Fukushima

version of formulation 2.1 can be stated as follows:

min

w, b, y

i−1 K K K K 1 1 ν wi − w j 2 + wi 2 + (1 − ν) (ei )T yij 2 2 i=1 j=1

i=1 j=1, j=i

i=1

ij

Aip (wi − w j ) + ρ ip (aip )T (wi − w j ) − (b i − b j ) + yp ≥ 1,

s.t.

ij

yp ≥ 0,

∀ aip ∈ U,

p = 1, . . . , li , i, j = 1, . . . , K , i = j.

(2.6)

Since min{ρ ip (aip )T (wi − w j ) : aip ∈ U} = −ρ ip wi − w j , problem 2.6 is equivalent to the following SOCP:

min

w, b, y

i−1 K K K K 1 1 ν wi − w j 2 + wi 2 + (1 − ν) (ei )T yij 2 2 i=1 j=1

i=1 j=1, j=i

i=1

ij

Aip (wi − w j ) − ρ ip wi − w j − (b i − b j ) + yp ≥ 1,

s.t.

ij

yp ≥ 0,

(2.7)

p = 1, . . . , li , i, j = 1, . . . , K , i = j.

Denote Q = (K + 1)I K N − , with I K N being the identity matrix of order K N and

IN IN = . .. IN

IN IN .. . IN

··· ··· .. . ···

IN IN K N×K N. .. ∈ R . IN

Denote e = [(e1 )T , . . . , (e1 )T , . . . , (e K )T , . . . , (e K )T ]T ∈ R L(K −1) . The objec K −1

K −1

tive function of problem 2.7 can then be expressed compactly as ν T w Qw + (1 − ν)eT y. 2

(2.8)

SOCP for Multiclass Classification

263

Additionally, Q is a symmetric positive definite matrix, which can be inferred from the following proposition. The proof of the proposition is omitted since it is similar to that given by Yajima (2005). Proposition 1

Denote C =

√

K + 1I K N −

1. Q = C 2 . 2. C is nonsingular, and C −1 =

√

√

K +1−1 . K

1

I + K +1 K N

Then

√

K√+1−1 . K K +1

Let H ij be the K N × N matrix with all blocks being N × N zero matrices except the ith block being I N and the jth block being −I N : H ij = [O, . . . , O, I N , O, . . . , O, −I N , O, . . . , O]T .

(2.9)

Then, by equation 2.2 we get wi − w j = (H ij )T w.

(2.10)

Let r ij be the K -dimensional vector with all components being zero except the ith component being 1 and the jth component being −1: r ij = [0, . . . , 0, 1, 0, . . . , 0, −1, 0, . . . , 0]T .

(2.11)

Then by equation 2.3 we get b i − b j = (r ij )T b.

(2.12)

ij

Let h p be the L(K − 1)-dimensional vector with all components being zero except the ((K − 1) i−1 k=1 lk + ( j − 1)li + p)th component being 1: ij

h p = [0, . . . , 0, . . . , 0, 1, 0, . . . , 0, . . . , 0]T .

(2.13)

Then by equation 2.4, we get ij

ij

yp = (h p )T y.

(2.14)

By equations 2.10, 2.12, and 2.14, the first constraint in problem 2.7 can be rewritten as follows: ij T ρ ip (H ij )T w ≤ Aip (H ij )T w − (r ij )T b + h p y − 1.

(2.15)

264

P. Zhong and M. Fukushima

Therefore, by equations 2.8 and 2.15 and proposition 1, formulation 2.7 can be written as follows:

min νt + (1 − ν)eT y

w, b, y, t

s.t.

1 Cw2 ≤ t, 2

ij T ρ ip (H ij )T w ≤ Aip (H ij )T w − (r ij )T b + h p y − 1,

(2.16)

p = 1, . . . , li , i, j = 1, . . . , K , i = j, y ≥ 0.

Furthermore, formulation 2.16 can be cast as the following SOCP:

min νt + (1 − ν)eT y

w, b, y, t

√ 2Cw ≤ 1 + t, s.t. 1−t ij T ρ ip (H ij )T w ≤ Aip (H ij )T w − (r ij )T b + h p y − 1,

(2.17)

p = 1, . . . , li , i, j = 1, . . . , K , i = j, y ≥ 0.

3 Robust Piecewise-Linear M-SVM Classifier In this section, we construct a robust piecewise-linear M-SVM classifier based on the dual formulation of problem 2.17.

3.1 Dual of the Robust Piecewise-Linear M-SVM Formulation. Denote

¯ = B1T , B2T , . . . , B KT T ∈ R L(K −1)×KN A

(3.1)

SOCP for Multiclass Classification

265

with

−Ai . .. O Bi = O . .. O

· · · O Ai O · · · O . . . .. .. . .. .. .. . i i · · · −A A O · · · O ∈ Rli (K −1)×K N . · · · O Ai −Ai · · · O .. .. .. . . .. . . . . . i · · · O A O · · · −Ai

Denote

T ¯1,M ¯ 2T . . . , M ¯ TK T ∈ R L N(K −1)×K N , H¯ = M

(3.2)

where

O ··· O .. .. . . · · · −Mi Mi O · · · O ∈ Rli N(K −1)×K N , · · · O Mi −Mi · · · O .. .. . . .. . . . . O · · · O Mi O · · · −Mi

−Mi . . . O ¯i = M O . . .

· · · O Mi . .. .. . .. .

with

Mi := Mi (ρ) = [ρ1i I N , . . . , ρlii I N ]T ∈ Rli N×N ,

i = 1, . . . , K .

Denote T

E¯ = E 1T , E 2T , . . . , E KT ∈ R L(K −1)×K

(3.3)

266

P. Zhong and M. Fukushima

with

−ei . . . 0 Ei = 0 . . . 0

· · · 0 ei 0 · · · . . .. .. .. . . . . · · · −ei ei 0 · · · · · · 0 ei −ei · · · .. .. .. . . . . . . · · · 0 ei 0 · · ·

0 .. . 0 ∈ Rli (K −1)×K . 0 .. . −ei

We can derive the following dual of problem 2.17 (see appendix A): max eT α − (σ + τ )

α,s,σ,τ

s.t. E¯ T α = 0, α ≤ (1 − ν)e, σ − τ = ν, −√ 1 ¯ T α + H¯ T s) (A 2(K +1) ≤ σ, τ ij s p ≤ α ijp ,

(3.4)

p = 1, . . . , li , i, j = 1, . . . , K , j = i,

where α = [(α 12 )T , . . . , (α 1K )T , . . . , (α K 1 )T , . . . , (α K (K −1) )T ]T ∈ R L(K −1) , T T T T , . . . , sl12 , . . . , s1K , . . . , sl1K ,..., s = s12 1 1 1 1

K (K −1) T

s1

K (K −1) T , . . . , sl K

T

∈ R L N(K −1) .

(3.5)

(3.6)

In addition, we get the following complementary equations at optimality:

ij T

αp ij

sp

ij T Aip (H ij )T w − (r ij )T b + h p y − 1 ρ ip (H ij )T w

p = 1, . . . , li , i, j = 1, . . . , K , j = i,

= 0, (3.7)

SOCP for Multiclass Classification

−√ 1 2(K +1)

267

T σ 1+t √ ¯ T α + H¯ T s) (A 2Cw = 0,

(3.8)

1−t

τ

((1 − ν)e − α)T y = 0.

(3.9)

3.2 Robust Classifier. From formulation 3.4 we get σ > 0. In fact, if σ = 0, then τ = 0. The third constraint of formulation 3.4 becomes ν = 0, which contradicts ν > 0. By the complementary equation 3.8, we have the following implications (see appendix B for the complementary conditions in SOCP, equations B.1 to B.3): If −√ 1 ¯ T α + H¯ T s) (A 2(K +1) < σ, τ then √ 2Cw 1 − t = 1 + t = 0. But this contradicts t ≥ 0. So we must have −√ 1 ¯ T α + H¯ T s) (A 2(K +1) = σ. τ Since σ > 0, we have √ 2Cw 1 − t = 1 + t. Hence, there exists µ > 0 such that √ 2Cw =

µ 2(K + 1)

¯ T α + H¯ T s) and 1 − t = −µτ. (A

(3.10)

In addition, it is easy to get the following equalities by proposition 1: ¯T = √ C −1 A

1 K +1

¯T A

and C −1 H¯ T = √

1 K +1

H¯ T .

(3.11)

268

P. Zhong and M. Fukushima

Hence, by equations 3.10 and 3.11, we get w=

t−1 ¯ T α + H¯ T s). (A 2τ (K + 1)

Furthermore, by equations 2.2, 3.1, and 3.2, we get lj li K t − 1 ij α p (Aip )T − α pji (Apj )T wi = 2τ (K + 1) j=1, j=i

p=1

lj li ij ρ ip s p − ρ pj s pji . +

p=1

p=1

p=1

Therefore, the decision functions are given by f i (x) = x T wi − b i

lj li K t−1 ij = α p x T (Aip )T − α pji x T (Apj )T 2τ (K + 1) j=1, j=i

p=1

p=1

lj li ij ρ ip x T s p − ρ pj x T s pji − b i , i = 1, . . . , K . + p=1

(3.12)

p=1

In particular, if we set ρ ip = 0, i = 1, . . . , K , p = 1, . . . , li , then equation 3.12 becomes lj li K i T j T t−1 ij T ji T f i (x) = α p x Ap − α p x Ap 2τ (K + 1) j=1, j=i

p=1

p=1

− b i , i = 1, . . . , K .

(3.13)

Since ρ ip = 0, p = 1, . . . , li , i = 1, . . . , K , imply that the parameter perturbations are not considered (cf. equation 2.5); equation 3.13 corresponds to the discriminants for the case of no measurement noise. With these decision functions, the classification of an example x is to find a class i such that f i (x) = max{ f 1 (x), . . . , f K (x)}. 4 Robust Piecewise-Nonlinear M-SVM Classifier The above discussion is concerned with the piecewise-linear case. In this section, the analysis will be extended to the nonlinear case.

SOCP for Multiclass Classification

269

To construct separating functions in a higher-dimensional feature space, a nonlinear mapping ψ : X → F is used to transform the original examples into the feature space, which is equipped with the inner product defined by k(x, x ) = ψ(x), ψ(x ) , where k(·, ·) : R N × R N → R is a function called a kernel. Typical choices of kernels include polynomial kernels k(x, x ) = (x T x + 1)d with an integer parameter d and radial basis function (RBF) kernels k(x, x ) = exp(−x − x 2 /κ) with a real parameter κ. 4.1 Robust Piecewise-Nonlinear M-SVM Formulation. We assume T ˜ (4.1) ψ Aˆ ip = ψ (Aip )T + ρ˜ ip a˜ ip , a˜ ip ∈ U, where U˜ is a unit sphere in the feature space. For the nonlinear case, ρ˜ ip in the feature space associated with a kernel k(·, ·) can be computed as T T ρ˜ ip = ψ Aˆ ip − ψ Aip % T T & % T T & − 2 ψ Aˆ ip , ψ Aip = ψ Aˆ ip , ψ Aˆ ip % T T &1/2 + ψ Aip , ψ Aip T T T T T T 1/2 = k Aˆ ip , Aˆ ip . − 2k Aˆ ip , Aip + k Aip , Aip For example, for RBF kernels, since T T = 1, k Aˆ ip , Aˆ ip 2 i T i T = exp − ρ ip /κ , k Aˆ p , Ap T T = 1, k Aip , Aip we have 2 1/2 . ρ˜ ip = 2 − 2 exp − ρ ip /κ

(4.2)

The robust version of the piecewise-nonlinear M-SVM can be expressed as follows: i−1 K K K K 1 1 min ν wi − w j 2 + wi 2 + (1 − ν) (ei )T yij w,b, y 2 2

i=1 j=1

i=1

i=1 j=1, j=i

ij

˜ s.t. (ψ((Aip )T ))T (wi − w j ) + ρ˜ ip ( a˜ ip )T (wi − w j ) − (b i − b j ) + yp ≥ 1, ∀ a˜ ip ∈ U,

270

P. Zhong and M. Fukushima ij

yp ≥ 0,

p = 1, . . . , li , i, j = 1, . . . , K , i = j,

which can be rewritten as the following SOCP: K K K i−1 K 1 1 min ν wi − w j 2 + wi 2 + (1 − ν) (ei )T yij w,b, y 2 2

i=1 j=1

i=1 j=1, j=i

i=1

' '' (T ((T ij s.t. ψ Aip (wi − w j ) − ρ˜ ip wi − w j − (b i − b j ) + yp ≥ 1, ij

yp ≥ 0,

p = 1, . . . , li , i, j = 1, . . . , K , i = j.

Denote

˜ = B˜ 1T , B˜ 2T , . . . , B˜ KT T , A

where

−(Ai ) .. . O B˜ i = O .. . O

···

(Ai ) .. .

O .. .

···

· · · −(Ai ) (Ai )

O

..

.

O .. .

O .. .

···

O

···

O .. .

(Ai ) −(Ai ) · · · .. .. .. . . .

O .. .

···

O

(Ai )

with

T (Ai ) = ψ((Ai1 )T ), . . . , ψ((Alii )T ) .

Denote

T ˜1,M ˜ 2T . . . , M ˜ TK T , H˜ = M

O

· · · −(Ai )

(4.3)

SOCP for Multiclass Classification

271

where

−Mi . . . O ˜i = M O . . . O

· · · O Mi O · · · O . .. .. .. .. . .. . . . · · · −Mi Mi O · · · O · · · O Mi −Mi · · · O .. .. . . .. . . . . · · · O Mi O · · · −Mi

with T

Mi := Mi (ρ) ˜ = ρ˜ 1i I N , . . . , ρ˜ lii I N .

(4.4)

In a similar manner to that of getting formulation 3.4, we get the dual of problem 4.3 as follows: max eT α − (σ + τ )

α,s,σ,τ

s.t. E¯ T α = 0, α ≤ (1 − ν)e,

(4.5)

σ − τ = ν, −√ 1 ˜ T α + H˜ T s) (A 2(K +1) ≤ σ, τ ij s p ≤ α ijp ,

p = 1, . . . , li , i, j = 1, . . . , K , j = i.

4.2 Robust Classifier in a Feature Subspace. In the previous section, we have gotten the robust formulation 4.5 in the feature space. However, the feature space F may have an arbitrarily large dimension, possibly infi¨ nite. Usually the kernel principal component analysis (KPCA) (Scholkopf, ¨ Smola, & Muller, 1998; Yajima, 2005) is used for feature extraction. In this section, we first reduce the feature space F to an S-dimensional subspace with S < L by KPCA, and then construct the corresponding robust classifier of piecewise-nonlinear M-SVM in the subspace. j Consider the kernel matrix G = (k((Aip )T , (Ap )T )) ∈ R L×L associated with a kernel k(·, ·). Since G is a symmetric positive semidefinite matrix, there is an orthogonal matrix V such that G = V V T , where is a diagonal matrix whose diagonal elements are the eigenvalues λi ≥ 0, i = 1, . . . , L , of G, and v i , i = 1, . . . , L, the columns of V, are the corresponding eigenvectors. Suppose λ1 ≥ λ2 ≥ . . . ≥ λ L . Select the S(< L)

272

P. Zhong and M. Fukushima

largest√positive √ eigenvalues √ and the corresponding eigenvectors. Denote DS = [ λ1 v 1 , λ2 v 2 , . . . , λ S v S ], where the components of v i are written as follows: T

1 2 K v i = vi,1 , . . . , vi,1 l1 , vi,1 , . . . , vi,2 l2 , . . . , vi,1 , . . . , vi,Kl K . Define the vectors K l j j=1

ui :=

j

j

vi, p ψ((Ap )T ) , i = 1, . . . , S. √ λi

p=1

Then we have ui T ui =

1 T v Gv i = 1 λi i

and ui T u j =

1 v iT Gv j = 0, λi λ j

i = j.

Therefore, {u1 , u2 , . . . , u S } forms an orthogonal basis of an S-dimensional subspace of F. Let ψ S (x) be the S-dimensional subcoordinate of ψ(x), which is given by

T lj lj K K 1 1 j j ψ S (x) = √ v k(x, (Apj )T ), . . . , √ v k(x, (Apj )T ) . λ S j=1 p=1 S, p λ1 j=1 p=1 1, p (4.6) Then, similar to equation 3.12, we can get the decision functions associated with the robust formulation of piecewise-nonlinear M-SVM in the feature subspace as follows: li K t−1 ij f i (x) = α p ψ S (x)T ψ S ((Aip )T ) 2τ (K + 1) j=1, j=i

−

lj

p=1

α pji ψ S (x)T ψ S ((Apj )T )

p=1

lj li ij ρ˜ ip ψ S (x)T s p − ρ˜ pj ψ S (x)T s pji − b i , i = 1, . . . , K . + p=1

p=1

(4.7)

SOCP for Multiclass Classification

273

Table 1: Description of Iris, Wine, and Glass Data Sets.

Name

Dimension (N)

Number of Classes (K )

Number of Examples (L)

Iris Wine Glass

4 13 9

3 3 6

150 178 214

5 Preliminary Numerical Results In this section, through numerical experiments, we examine the performance of the robust piecewise-nonlinear M-SVM formulation and the original model for multiclass classification problems. We use RBF kernel in the experiments. As we described in section 4.2, we first construct an L × L kernel matrix G associated with the RBF kernel for the training data set. Then we decompose G and select an appropriate number S. Using equation 4.6, we obtain the S-dimensional subcoordinate of each point. The problems used in the experiments are the robust model 4.5 and the original model obtained by setting ρ˜ = 0 in equation 4.4. In the latter model, we have H˜ = O. Thus, we may write the problem as follows: max eT α − (σ + τ ) α,σ,τ

s.t. E¯ T α = 0, α ≤ (1 − ν)e,

(5.1)

σ − τ = ν, −√ 1 ˜Tα A 2(K +1) ≤ σ. τ The experiments were implemented on a PC (1GB RAM, CPU 3.00GHz) using SeDuMi1.05 (Sturm, 2001) as a solver. This solver is developed by J. Sturm for optimization problems over symmetric cones, including SOCP. Some experimental results on real-world data sets taken from the UCI machine learning repository (Blake & Merz, 1998) are reported below. Table 1 gives a description of the data sets. In the experiments, the data sets were normalized to lie between −1 and 1. For simplicity, we set all ρ ip in equation 2.5 to be a constant ρ. The measurement noise aip was generated randomly from the normal distribution and scaled on the unit sphere. Two experiments were performed. In the first, an appropriate value of S for getting reasonable discriminants was sought. The second experiment was

I II I II I II I II I II

1 1 2 2 2 2 3 3 12 12

S 0.5364 0.5364 0.7950 0.7950 0.7950 0.7950 0.8836 0.8836 0.9911 0.9911

Rt

Iris

62.67 60.67 89.33 87.33 89.33 87.33 85.33 84.0 88.0 86.67

PTa

Note: a PT: Percentage of tenfold testing correctness on validation set.

0.99

0.8

0.7

0.6

0.5

Ra

Robust (I), Original (II) 5 5 9 9 16 16 — — — —

S 0.5179 0.5179 0.6102 0.6102 0.7103 0.7103 — — — —

Rt

Wine PT 90.0 88.89 88.89 80.0 87.78 82.22 — — — —

Table 2: Results for Iris, Wine, and Glass Data Sets with Noise (ρ = 0.3, κ = 2, ν = 0.05).

1 1 2 2 3 3 4 4 — —

S

0.5737 0.5737 0.6826 0.6826 0.7523 0.7523 0.8002 0.8002 — —

Rt

Glass

35.24 31.43 66.67 32.86 66.67 38.57 66.67 45.24 — —

PT

274 P. Zhong and M. Fukushima

Iris (2) Wine (5) Glass (4)

Data Set (S)

I II I II I II

Robust (I), Original (II) 94.0 94.0 98.33 98.33 59.52 59.52

0 88.67 87.33 91.11 88.89 66.19 46.19

0.1 88.0 87.33 90.0 88.83 65.71 45.71

0.2

ρ

89.33 87.33 90.0 88.89 66.67 45.24

0.3

Table 3: Percentage of Tenfold Test Correctness for the Data Sets with Noise (κ = 2, ν = 0.05).

91.33 86.67 87.78 85.56 66.67 49.05

0.4

90.0 85.33 84.44 82.22 66.67 49.52

0.5

SOCP for Multiclass Classification 275

276

P. Zhong and M. Fukushima

conducted on the three data sets with the measurement noise. Tenfold cross validation was used in the experiments. In order to seek an appropriate value of S, a ratio Ra is set. It is chosen from the set {0.5, 0.6, 0.7, S 0.8, 0.99}. L For each value of Ra , we Sfind the smallL est integer S such that i=1 λi / i=1 λi ≥ Ra , and let Rt := i=1 λi / i=1 λi . At the same time, we test the accuracy on the validation set by computing the percentage of tenfold testing correctness. Table 2 contains these three kinds of results for the robust model and the original model on the Iris, Wine, and Glass data sets with the measurement noise scaled by ρ = 0.3. When Ra is large, we were unable to solve the problems for the Wine and Glass data sets because of memory limitations. Nevertheless, it can be seen from Table 2 that the values of Rt around 50% up to 70% yield reasonable discriminants. Moreover, in all cases, S is much smaller than the data size L. Table 3 shows the percentage of tenfold testing correctness for the robust model and the original model on the three data sets with various noise levels ρ. Especially, ρ = 0 means that there is no noise on the data sets. In this case, the robust model reduces to the original model. It can be observed that the performance of the robust model is consistently better than that of the original model, especially for the Glass data set. In addition, the correctness for the original model on the three data sets with noise is much lower than the results when ρ = 0. For the linear case, we also find that the correctness for the original model on the data sets with noise is lower than the correctness on the data sets without noise. For example, the correctness of the Iris data set, the Wine data set, and the Glass data set without noise is 90.67%, 97.78%, and 57.14%, respectively. However, when ρ = 0.5, the corresponding correctness is 87.33%, 78.89%, and 48.57%, respectively. For the robust model, when ρ = 0.5, the corresponding correctness is 90.67%, 81.11%, and 56.19%, respectively.

6 Conclusion In this letter, we have established the robust linear and nonlinear formulations for multiclass classification based on M-SVM method. KPCA has been used to reduce the feature space to an S-dimensional subspace. The preliminary numerical experiments show that the performance of the robust model is better than that of the original model. Unfortunately, the conic convex optimization solver SeDuMi1.05 (Sturm, 2001) used in our numerical experiments could solve problems only for small data sets. The sequential minimal optimization (SMO) techniques (Platt, 1999) are essential in large-scale implementation of SVM. Future subjects for investigation include developing SMO-based robust algorithms for multiclass classification.

SOCP for Multiclass Classification

277

Appendix A: Dual of Formulation 2.17 In order to get the dual of problem 2.17, we first state a more general primal and dual form of the SOCP. The notations used in section A.1 are independent of those in the other part of the letter. A.1 A General Primal and Dual Pair. For the SOCP min cT x + dT y + eT z

x, y, z

s.t. AT x + B T y + C T z = f ,

(A.1)

y ≥ 0, z ∈ Kn1 × · · · × Knl , its dual is written as follows: max f T w w,u,v

s.t. Aw = c, Bw + u = d, Cw + v = e,

(A.2)

u ≥ 0, v ∈ Kn1 × · · · × Knl . Now consider the problem min cT x + dT y x, y,z

s.t. G¯ iT x + qi ≤ g iT x + hiT y + r iT z + a i ,

i = 1, . . . , m,

y ≥ 0. This problem can be formulated as follows: min cT x + dT y

x, y,z,ζ

s.t. ζ i −

g iT x + hiT y + r iT z + a i G¯ iT x + qi

ζ i ∈ Kni , y ≥ 0,

i = 1, . . . , m,

= 0,

i = 1, . . . , m,

(A.3)

278

P. Zhong and M. Fukushima

which can be further rewritten as min cT x + dT y

x, y,z,ζ

−G 1 −H1 −R1 s.t. I O . .. O

−G 2 −H2 −R2 O I .. . O

ζ i ∈ Kni ,

T x · · · −G m y · · · −Hm β1 z · · · −Rm β2 ··· O ζ 1 = .. ζ2 . ··· O . . .. βm . .. .. ζm ··· I

i = 1, . . . , m,

y ≥ 0, where G i = [g i , G¯ i ], Hi = [hi , O], Ri = [r i , O], β i = [a i , qiT ]T , i = 1, . . . , m. In view of the primal-dual pair A.2 and A.3, we obtain the dual of problem A.3 as follows: max − η, λ

s.t.

m

β iT ηi

i=1 m

G i ηi = c,

i=1 m

Hi ηi + λ = d,

(A.4)

i=1 m

Ri ηi = 0,

i=1

λ ≥ 0, ηi ∈ Kni ,

i = 1, . . . , m.

A.2 Dual of Problem 2.17. In the following we derive the dual of formulation 2.17. The primal problem 2.17 can be put in the following equivalent form:

min 0T , ν x + (1 − ν)eT y

x, b, y

√

0 2C 0 ≤ 0T , 1 x + 1, s.t. x+ 1 0T −1

SOCP for Multiclass Classification

279

ij T i ij T ρ p (H ) , 0 x ≤ Aip (H ij )T , 0 x + h p y − (rij )T b − 1, p = 1, . . . , li , i, j = 1, . . . , K , i = j, y ≥ 0, T

where x = w T , t . Then, by equation A.4, we get the dual of problem 2.17 as follows:

max

α, s,σ, τ

s.t.

li K K

ij

α p − (σ + τ )

(A.5)

i=1 j=1, j=i p=1

√

2C T ξ +

li ' K K ( ij ij α p H ij (Aip )T + ρ ip H ij s p = 0,

(A.6)

i=1 j=1, j=i p=1

σ − τ = ν, K

K

li

(A.7) ij ij

α p h p + λ = (1 − ν)e,

(A.8)

i=1 j=1, j=i p=1

−

li K K

ij

α p r ij = 0,

(A.9)

i=1 j=1, j=i p=1

ξ τ ≤ σ, ij

ij

s p ≤ α p ,

(A.10) p = 1, . . . , li , i, j = 1, . . . , K , i = j,

λ ≥ 0.

(A.11) (A.12)

By equations 2.9, 3.1, and 3.5, we get li K K

ij ¯ T α. α p H ij (Aip )T = A

(A.13)

i=1 j=1, j=i p=1

By equations 3.2 and 3.6, we get li K K i=1 j=1, j=i p=1

ij ρ ip H ij s p = H¯ T s.

(A.14)

280

P. Zhong and M. Fukushima

Hence by equations A.13 and A.14, we can express equation A.6 compactly as follows: √

¯ T α + H¯ T s = 0. 2C T ξ + A

(A.15)

By equations A.15 and 3.11, we get the following equation: ξ = −

1 2(K + 1)

¯ T α + H¯ T s). (A

(A.16)

By equations 2.13 and 3.5, we have li K K

ij ij

α p h p = α.

i=1 j=1, j=i p=1

Hence, equation A.8 can be expressed as follows: (1 − ν)e − λ − α = 0.

(A.17)

By equations 2.11, 3.3, and 3.5, we can rewrite equation A.9 as follows: − E¯ T α = 0.

(A.18)

Combining equations A.16 to A.18, the problem given in A.5 to A.12 can be written as problem 3.4. Appendix B: Complementarity Conditions of SOCP Let bd Kn denote the boundary of Kn : bd Kn =

z0 z¯

∈ Kn : z¯ = z0 .

Let int Kn denote the interior of Kn : int Kn = For two elements

z0 z¯

∈ Kn

z0 z¯

∈ Kn : z¯ < z0 .

SOCP for Multiclass Classification

281

and

z0 z¯

∈ Kn ,

z0 z¯

T

z0 z¯

=0

if and only if the following conditions are satisfied (Lobo et al., 1998):

z0 z¯

∈ bd Kn \ {0},

z0 z¯

z0 z¯ z0 z¯

∈ int Kn ⇒ z¯ = z0 = 0,

(B.1)

∈ int Kn ⇒ z¯ = z0 = 0,

(B.2)

∈ bd Kn \ {0} ⇒

z0 z¯

=µ

z0 , − z¯

(B.3)

where µ > 0 is a constant. These three conditions are regarded as a generalization of the complementary slackness conditions in linear programming. Acknowledgments P.Z is supported in part by a grant-in-aid from the Ministry of Education Culture Sports Science and Technology of Japan and the National Science Foundation of China Grant No. 70601033. M.F. is supported in part by the Scientific Research Grant-in-Aid from the Japan Society for the Promotion of Science. References Alizadeh, F., & Goldfarb, D. (2003). Second-order cone programming. Math. Program., Ser. B, 95, 3–51. Allwein, E. L., Schapire, R. E., & Singer, Y. (2001). Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1, 113–141. Angulo, C., Parra, X., & Catal`a, A. (2003). K-SVCR: A support vector machine for multi-class classification. Neurocomputing, 55, 57–77. Bennett, K. P., & Mangasarian, O. L. (1994). Multicategory discrimination via linear programming. Optimization Methods and Software, 3, 27–39. Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases. University of California. Available online at http://www.ics.uci.edu/∼mlearn/ MLRepository.html. Bottou, L., Cortes, C., Denker, J. S., Drucker, H., Guyon, I., Jackel, L. D., LeCun, ¨ Y., Muller, U. A., Sackinger, E., Simard, P., & Vapnik, V. (1994). Comparison of classifier methods: A case study in handwriting digit recognition. In IAPR (Ed.), Proceedings of the International Conference on Pattern Recognition (pp. 77–82). Piscataway, NJ: IEEE Computer Society Press.

282

P. Zhong and M. Fukushima

Bredensteiner, E. J., & Bennett, K. P. (1999). Multicategory classification by support vector machines. Computational Optimization and Applications, 12, 53–79. Dietterich, T. G., & Bakiri, G. (1995). Solving multi-class learning problems via errorcorrecting output codes. Journal of Artificial Intelligence Research, 2, 263–286. Fukushima, M., Luo, Z. Q., & Tseng, P. (2002). Smoothing functions for second-ordercone complementarity problems. SIAM Journal on Optimization, 12, 436–460. Goldfarb, D., & Iyengar, G. (2003). Robust convex quadratically constrained programs. Mathematical Programming, 97, 495–515. Guermeur, Y. (2002). Combining discriminant models with new multi-class SVMs. Pattern Analysis and Applications, 5, 168–179. Hastie, T. J., & Tibshirani, R. J. (1998). Classification by pairwise coupling. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 507–513). Cambridge, MA: MIT Press. Hayashi, S., Yamashita, N., & Fukushima, M. (2005). A combined smoothing and regularization method for monotone second-order cone complementarity problems. SIAM Journal on Optimization, 15, 593–615. Kressel, U. (1999). Pairwise classification and support vector machines. In B. ¨ Scholkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods: Support vector learning (pp. 255–268). Cambridge, MA: MIT Press. Lobo, M. S., Vandenberghe, L., Boyd, S., & L´ebret, H. (1998). Applications of secondorder cone programming. Linear Algebra and Applications, 284, 193–228. Platt, J. (1999). Sequential minimal optimization: A fast algorithm for training sup¨ port vector machines. In B. Scholkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods: Support vector learning (pp. 185–208). Cambridge, MA: MIT Press. Platt, J., Cristianini, N., & Shawe-Taylor, J. (2000). Large margin DAGs for multiclass ¨ classification. In S. A. Solla, T. K. Leen, & K. -R. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 547–553). Cambridge, MA: MIT Press. ¨ ¨ Scholkopf, B., Smola, A., & Muller, K. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299–1319. Sturm, J. (2001). Using SeDuMi, a Matlab toolbox for optimization over symmetric cones. Department of Econmetrics, Tilburg University, Netherlands. Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Weston, J., & Watkins, C. (1998). Multi-class support vector machines (CSD-TR-9804). Egham, UK: Royal Holloway, University of London. Yajima, Y. (2005). Linear programming approaches for multicategory support vector machines. European Journal of Operational Research, 162, 514–531. Zhong, P., & Fukushima, M. (2006). A new multi-class support vector algorithm. Optimization Methods and Software, 21, 359–372.

Received September 19, 2005; accepted May 24, 2006.

LETTER

Communicated by Grace Wahba

Fast Generalized Cross-Validation Algorithm for Sparse Model Learning S. Sundararajan [email protected] Philips Electronics India Ltd., Ulsoor, Bangalore, India

Shirish Shevade [email protected] Computer Science and Automation, Indian Institute of Science, Bangalore, India

S. Sathiya Keerthi [email protected] Yahoo! Research, 2821 Mission College Blvd., Santa Clara, CA 95054, USA

We propose a fast, incremental algorithm for designing linear regression models. The proposed algorithm generates a sparse model by optimizing multiple smoothing parameters using the generalized cross-validation approach. The performances on synthetic and real-world data sets are compared with other incremental algorithms such as Tipping and Faul’s fast relevance vector machine, Chen et al.’s orthogonal least squares, and Orr’s regularized forward selection. The results demonstrate that the proposed algorithm is competitive. 1 Introduction In recent years, there has been a lot of focus on designing sparse models in machine learning. For example, the support vector machine (SVM) (Cortes & Vapnik, 1995) and the relevance vector machine (RVM) (Tipping, 2001; Tipping & Faul, 2003) have been proven to provide sparse solutions to both regression and classification problems. Some of the earlier successful approaches for regression problems include Chen, Cowan, and Grant’s (1991) orthogonal least squares and Orr’s (1995b) regularized forward selection algorithms. In this letter, we consider only the regression problem. Often the target function model takes the form

y(x) =

M

wm φm (x).

(1.1)

m=1

Neural Computation 19, 283–301 (2007)

C 2006 Massachusetts Institute of Technology

284

S. Sundararajan, S. Shevade, and S. Keerthi

For notational convenience, we do not indicate the dependency of y on w, the weight vector. The available choices in selecting the set of basis vectors {φ m (x), m = 1, . . . , M} make the model flexible. Given a data set N {xn , tn }n=1 , tn ∈ R ∀ n, we can write the target vector t = (t1 , . . . ., tN )T as the sum of the approximation vector y(x) = (y(x1 ), . . . ., y(x N ))T and an error vector η, t = w + η,

(1.2)

where = ( φ 1 φ 2 , · · · , φ M ) is the design matrix and φ i is the ith column vector of of size N × 1 representing the response of the ith basis vector for all the input samples x j , j = 1, . . . , N. One possible way to obtain this x −x 2

design matrix is to use a gaussian kernel with i j = exp(− i 2z2 j ). The error vector η can be modeled as an independent zero-mean gaussian vector with variance σ 2 . In this context, controlling the model complexity in avoiding overfitting is an important task. This problem has been addressed in the past by regularization approaches (Bishop, 1995). In the classical approach, the sum squared error with weight decay regularization having a single regularization parameter α controls the trade-off between fitting the training data and smoothing the output function. In the local smoothing approach, each weight in w is associated with a regularization or ridge parameter (Hoerl & Kennard, 1970; Orr, 1995a; Denison & George, 2000). The interested reader can refer to Denison and George (2000) for a discussion on the generalized ridge regression and Bayesian approach. For our discussion, we consider the weight decay regularizer with multiple regularization parameters. In this case, the optimal weight vector is obtained by minimizing the following cost function (for a given set of hyperparameter values, α, σ 2 ), C(w, α, σ 2 ) =

1 1 t − w2 + wT Aw, 2σ 2 2

(1.3)

where A is a diagonal matrix with elements α = (α1 , . . . , α M )T , and the optimal weight vector is given by w=

1 −1 T S y, σ2

(1.4)

T where S = R + A and R = σ 2 . Note that this solution depends on the product ασ 2 . This solution is same as the maximum a posteriori (MAP) solution obtainable from defining the gaussian likelihood and gaussian prior for the weights in a Bayesian framework (Tipping, 2001).

Fast GCV Algorithm for Sparse Model Learning

285

The hyperparameters are typically selected using iterative approaches like marginal likelihood maximization (Tipping, 2001). In practice, many of the αi approach ∞. This results in the removal of associated basis vectors, thereby making the model sparse. The final model consists of a small number of basis vectors L (L M), called relevance vectors, and, hence, known by the name relevance vector machine (RVM). This procedure is computationally intensive and needs O(M3 ) effort at least for the initial iterations while starting with the full model (M = N or M = N + 1 if the bias term is included). Hence, it is not suitable for large data sets. This limitation was addressed in Tipping and Faul (2003), where a computationally efficient algorithm was proposed. In this algorithm, basis vectors are added sequentially starting from an empty model. It also allows deleting the basis vectors, which may subsequently become redundant. There are various other basis vector selection heuristics that can be used to design sparse models (Chen et al., 1991; Orr, 1995b). Chen et al. (1991) proposed an orthogonal least-squares algorithm as a forward regression procedure to select the basis vectors. At each step of the regression, the increment to the explained variance of the desired output is maximized. Orr (1995b) proposed an algorithm that combines regularization and crossvalidated selection of basis vectors. Some other promising incremental approaches are the algorithms of Csato and Opper (2002), Lawrence, Seeger, and Herbrich (2003), Seeger, Williams, and Lawrence (2003), and Smola and Bartlett (2000). But they apply to gaussian processes and are not directly related to the problem formulation addressed in this article. Generalized cross validation (GCV) is another important approach for the selection of hyperparameters and has been shown to exhibit good generalization performance (Sundararajan & Keerthi, 2001; Orr, 1995a). Orr (1995a) used the GCV approach to estimate the multiple smoothing parameters of the full model. This approach, however, is not suitable for large data sets due to its computational complexity. Therefore, there is a need to devise a computationally efficient algorithm based on the GCV approach for handling large data sets. In this letter, we propose a new fast, incremental GCV algorithm that can be used to design sparse models exhibiting good generalization performance. In the algorithm, we start with an empty model and sequentially add the basis functions to reduce the GCV error. The GCV error can also be reduced by deleting those basis functions that subsequently become redundant. This important feature offsets the inherent greediness exhibited by other sequential algorithms. This algorithm has the same computational complexity as that of the algorithm given in Tipping and Faul (2003) and is suitable for large data sets as well. Preliminary results on synthetic and real-world benchmark data sets indicate that the new approach gains on generalization but at the expense of a moderate increase in the number of basis vectors.

286

S. Sundararajan, S. Shevade, and S. Keerthi

The letter is organized as follows. In section 2, we describe the GCV approach and compare the GCV error function with marginal likelihood function. Section 3 describes the fast, incremental algorithm, computational complexity, and the numerical issues involved; the update expressions mentioned in this section are detailed in appendixes A to F. In section 4, we present the simulation results. Section 5 concludes the letter. 2 Generalized Cross Validation The standard techniques that estimate the prediction error of a given model are the leave-one-out (LOO) cross-validation (Stone, 1974) and the closely related GCV described in Golub, Heath, and Wahba (1979) and Orr (1995b). The generalization performance of the GCV approach is quite good, like that of the LOO cross-validation approach. The advantage in using the GCV error is that it takes a much simpler form compared to the LOO error and is given by N V(α, σ 2 ) = N

− y(xi ))2 , (tr(P))2

i=1 (ti

(2.1)

where P=I−

S−1 T . σ2

(2.2)

When this GCV error is minimized with respect to the hyperparameters, many of the αi approach ∞, making the model sparse. Equation 2.1 can be written as V(α, σ 2 ) = N

t T P2 t . (tr(P))2

(2.3)

Note that P is dependent only on ζ = ασ 2 . This means that we can get rid of σ 2 from equations 1.4 and 2.1. Then it is sufficient to work directly with ζ . However, using the optimal ζ obtained from minimizing the GCV error, the noise level can be computed from σˆ 2 =

t T P2 t . tr(P)

(2.4)

We now discuss the algorithm, proposed by Orr to determine the optimal set of hyperparameters.

Fast GCV Algorithm for Sparse Model Learning

287

2.1 Orr’s Algorithm. Starting with the full model, Orr (1995a) proposed an iterative scheme to minimize the GCV error, equation 2.3, with respect to the hyperparameters. Although this algorithm was originally described in terms of the variable ζ j , we describe it here using the variables α j and σ 2 for convenience. Each α j is optimized in turn, while the others are held fixed. The minimization thus proceeds by a series of one-dimensional minimizations. This can be achieved by rewriting equation 2.3 using t T P2 t = tr(P) =

a j 2j − 2b j j + c j 2j δjj − j , j

where a j = tT P2j t b j = (t

T

(2.5)

P2j φ j )(tT P j φ j )

(2.6)

c j = (φ Tj P2j φ j )(tT P j φ j )2

(2.7)

δ j = tr(P j )

(2.8)

j = (φ Tj P2j φ j ).

(2.9)

The above relationships follow from the rank-one update relationship between P and P j , P = Pj −

where P j = I −

1 Pj φ j φ Tj Pj , j T j S−1 j j

σ2

(2.10)

and

j = φ Tj P j φ j + α j σ 2 .

(2.11)

Here, j denotes the matrix with the column φ j removed. Therefore, S j = R j + A j , with the subscript j having the same interpretation as in j . Note that S j does not contain α j , and, hence, P j also does not contain α j . Thus, equation 2.3 can be seen as a rational polynomial in α j alone, with a single minimum in the range [0, ∞]. The details of minimizing this polynomial with respect to α j are given in appendix A. After one complete cycle in which each parameter is optimized once, the GCV score is calculated and compared with the score at the end of the previous cycle. If significant decrease has occurred, a new cycle is begun; otherwise, the algorithm

288

S. Sundararajan, S. Shevade, and S. Keerthi

terminates. As detailed in Orr (1995a), the computational complexity of one cycle of the above algorithm is O(N3 ), at least during the first few iterations. Although this will consequently reduce to O(L N2 ) as the basis vectors are pruned, this algorithm is not suitable for large data sets. 2.2 Comparison of GCV Error with Marginal Likelihood. We now compare the GCV error and marginal likelihood by studying their dependence on α j . In the marginal likelihood method, the optimal α j ’s are determined by maximizing the marginal likelihood with respect to α j . The GCV error is minimized to get the optimal α j ’s. First, we study the behavior of the GCV error with reference to α j . The term in the denominator of the GCV error has tr(P) = δ j − jj , where δ j and j are independent of α j and P is a positive semidefinite matrix. Further, tr(P) increases monotonically as a function of α j . Therefore, maximizing the tr(P) (in order to minimize the GCV error) will prefer the optimal value of α j to be ∞. Thus, the denominator term in equation 2.3 prefers a simple model. The term, tT P2 t, in the numerator of equation 2.3 is the squared error at the optimal weight vector in equation 1.4. Let g(α) = tT P2 t. Differentiating g(α) with respect to α j , we get ∂g(α) 2σ 2 = 2 ∂α j j

bj −

cj j

,

where b j and c j are independent of α j and c j , j ≥ 0. If b j is nonpositive, then g(α j ) is a monotonically decreasing function of α j . Minimization of g(α j ) with respect to α j would thus prefer α j to be ∞. On the other hand, if b j is positive, the minimum of g(α j ) would depend on the sign of σ 2 b j s¯ j − c j where s¯ j = φ Tj −1 − j φ j , − j is with the contribution of basis vector j removed and = σ 2 I + A−1 . Note that P = σ 2 −1 . Therefore, the optimal choice of α j using the GCV error method depends on the trade-off between the data-dependent term in the numerator and the term in the denominator that prefers α j to be ∞. The logarithm of marginal likelihood function is def

L(α) = −

1 N log(2π) + log || + tT −1 t . 2

Considering the dependence of L(α) on a single hyperparameter α j , the above equation can be rewritten (Tipping & Faul, 2003) as L(α) = L(α− j ) + l(α j ),

Fast GCV Algorithm for Sparse Model Learning α

where l(α j ) = 12 [log( α j +¯j s j ) +

q¯ 2j α j +¯s j

289

]. L(α− j ) is the marginal likelihood with def

φ j excluded and is thus independent of α j . Here, we have defined q¯ j = φ Tj −1 − j t. Note that q¯ j and s¯ j are independent of α j . The second term in l(α j ) comes from the data-dependent term, tT −1 t, in the logarithm of the marginal likelihood function, and maximization of this term prefers α j to be zero. On the other hand, the first term in l(α j ) comes from the Ockham factor (Tipping, 2001), and maximization of this term chooses α j to be ∞. So the optimal value of α j is a trade-off between the data-dependent term and the Ockham factor. Note that tT −1 t = σ12 tT P2 t + wT Aw. The term on the left-hand side of this equation appears in the negative logarithm of marginal likelihood function, while the first term on the right-hand side appears in the numerator of the GCV error. The key difference in the choice of the basis function is the additional term that is present in the data-dependent term of the marginal likelihood. For the marginal likelihood method, it has been shown that for a given basis function j, if q¯ 2j ≤ s¯ j , then the optimal value of α j is ∞; otherwise, the optimal value is

s¯ 2j q¯ 2j −¯s j

(Tipping and Faul, 2003). For the GCV

method, the optimal value of α j depends on q¯ j , s¯ j and some other quantities detailed in appendix A. Further, the sufficient condition for a basis function to be not relevant for the marginal likelihood method is q¯ 2j ≤ s¯ j and that for the GCV error method is b j ≤ 0. Note that b j is independent of s¯ j but is dependent on q¯ j . In general, a relevant vector obtained using a marginal likelihood (or GCV error) method need not be relevant in the GCV error (or marginal likelihood) method. This fact was also observed in our experiments. We now discuss the effect of scaling on the GCV error. First, note that the GCV error is a function of P. With the scaling of the output t, there will be an associated scaling of basis functions. However, P is invariant to such scaling (see equation 2.2). This will make the GCV error invariant to scaling. Also, a similar result holds for the log marginal likelihood function. 3 Fast GCV Algorithm In this section we describe the fast GCV (FGCV) algorithm, which constructs the model sequentially starting from an empty model. The basis vectors are added sequentially, and their weightings are modified to get the maximum reduction in the GCV error. The GCV error can also be decreased by deleting those basis vectors that subsequently become redundant. By maintaining the following set of variables for every basis vector, m, we can find the optimal value of αm for every basis vector and the corresponding GCV error efficiently: rm = tT Pφ m

(3.1)

290

S. Sundararajan, S. Shevade, and S. Keerthi T γm = φ m Pφ m

(3.2)

ξm = t T P 2 φ m

(3.3)

T 2 um = φ m P φm.

(3.4)

In addition, we need v = tr(P)

(3.5)

q = t P t.

(3.6)

T

2

Further, after every minimization process, these variables can be updated efficiently using the rank-one update given in equation 2.10. We now give the algorithm and discuss the relevant implementation details and storage and computational requirements. 3.1 Algorithm 1. Initialize σ 2 to some reasonable value (e.g., var(t) × .1). 2. Select the first basis vector φk (which forms the initial relevance vector set), and set the corresponding αk to its optimal value. The remaining α’s are notionally set to ∞. 3. Initialize S−1 , w (which are scalars initially), and the variables given in equations 3.1 to 3.6. 4. αk old := αk . 5. For all j, find the optimal solution α j , keeping the remaining αi , i = j fixed and the corresponding GCV error. Select the basis vector k for which reduction in the GCV error is maximum. 6. If αk old < ∞ and αk < ∞, the relevance vector set remains unchanged. 7. If αk old = ∞ and αk < ∞, add φk to the relevance vector set. 8. If αk old < ∞ and αk = ∞, then delete φk from the relevance vector set. 9. Update S−1 , w, and the variables given in equations 3.1 to 3.6. 10. Estimate the noise level using equation 2.4. This step may be repeated once in, for example, five iterations. 11. If there is no significant change in the values of α and σ 2 , then stop. Otherwise, go to step 4. 3.2 Implementation Details 1. Since we start from the empty model, the basis vector that gives the minimum GCV error is selected as the first basis vector (step 2). The details of this procedure are given in appendix B. The first basis vector

Fast GCV Algorithm for Sparse Model Learning

291

can also be selected based on the largest normalized projection onto the target vector, as suggested in Tipping and Faul (2003). 2. The initialization of relevant variables in step 3 is described in appendix C. 3. Appendixes D and A describe the method to estimate the optimal αi and the corresponding GCV error (step 5). 4. The variables in step 9 are updated using the details given in appendix E for the case in step 6 (reestimation) or step 8 (deletion) of the algorithm. The details corresponding to step 7 (addition) are given in appendix F. 5. In practice, numerical accuracies may be affected as the iterations progress. More specifically, the quantities γm and um , which are expected to remain nonnegative, may become negative while updating the variables. When any of these two quantities becomes negative, it is a good idea to compute the quantities in equations 3.1 to 3.6 afresh using direct computations. If the problem still persists (this typically happens when the matrix P becomes ill conditioned, for example, when the width parameter z used in a gaussian kernel is large), we terminate the algorithm. 6. When the noise level is also to be estimated (step 10), all the relevant variables are calculated afresh. This computation is simplified by expanding the matrix P using equation 2.2. 7. In an experiment, some of the α j may reach 0, as can be seen in appendix A. This may affect the solution. Therefore, it is useful to set such α j to a relatively small value, αmin . Setting this value to 1/N was found to work well in practice. 3.3 Storage and Computational Complexity. The storage requirements of the FGCV algorithm are more than that of the fast RVM (FRVM) algorithm in Tipping and Faul (2003) and arise from the additional variables ξm and um to be maintained. However, they are still linear in N. The computational requirements of the proposed algorithm are similar to those of the FRVM algorithm. Step 5 of the algorithm has the computational complexity of O(N). This is possible using the expressions given in appendix D. The computational complexity of the reestimation or deletion of a basis vector is O(L N), while that of the addition of a basis vector is O(N2 ). Step 10 of the algorithm, however, has a computational complexity of O(L N2 ) as it requires reestimating the relevant variables. We observed that the FGCV algorithm was 1.25 to 1.5 times slower than the FRVM algorithm in our simulations for these main reasons:

r

The error function is shallow near the solution, which results in more iterations.

292

S. Sundararajan, S. Shevade, and S. Keerthi

0.7

NMSE

0.6 0.5 0.4 0.3 0.2 0.1

1

2

3

4

3

4

Number of Basis Vectors

Algorithm

80 60 40 20 0 1

2 Algorithm

Figure 1: Results on the Friedman2 data set. Table 1: Statistical Significance (Wilcoxon Signed Rank) Test Results on the Friedman2 Data Set.

OLS RFS FRVM

r r

FGCV

FRVM

RFS

.47 .013 1.3e-12

9.4e-5 .097

.119

There is a higher number of relevance vectors at the solution. The additional variables, ξm and um , are updated.

4 Simulations The proposed FGCV algorithm is evaluated on four popular benchmark data sets and compared with the algorithms described in Tipping and Faul (2003), Chen et al. (1991), and Orr (1995b) and referred to as fast RVM (FRVM), orthogonal least squares (OLS), and regularized forward selection (RFS), respectively. Two of these data sets were generated, as described by Friedman (1991) and are referred to as Friedman2 and Friedman3. The input

Fast GCV Algorithm for Sparse Model Learning

293

0.35

NMSE

0.3 0.25 0.2 0.15 1

2

3

4

3

4

Algorithm

Number of Basis Vectors

80 70 60 50 40 30 20 10 1

2

Algorithm

Figure 2: Results on the Friedman3 data set. Table 2: Statistical Significance (Wilcoxon Signed Rank) Test Results on the Friedman3 Data Set.

OLS RFS FRVM

FGCV

FRVM

RFS

4.6e-9 2.1e-6 3.0e-6

.002 .139

.054

dimension for each of these data sets was four. For these data sets, the training set consisted of 200 randomly generated examples, while the test set had 1000 noise-free examples and the experiment was repeated 100 times for different training set examples. We report the normalized mean squared error (NMSE) (normalized with respect to the output variance) on the test set. The third data set used was the Boston housing data set obtained from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston). This data set comprises 506 examples with 13 variables. The data were split into 481/25 training/testing splits randomly, and the partitioning was repeated 100 times independently. The fourth data set used was the Abalone data set (ftp://ftp.ics.uci.edu/pub/machine-learning-databases/abalone/). After mapping the gender encoding (male/female/infant) into {(1,0,0), (0,1,0),

294

S. Sundararajan, S. Shevade, and S. Keerthi

40

MSE

30 20 10

1

2

3

4

3

4

Number of Basis Vectors

Algorithm

150

100

50

1

2 Algorithm

Figure 3: Results on the Boston Housing data set. Table 3: Statistical Significance (Wilcoxon Signed Rank) Test Results on the Boston Housing Data Set.

OLS RFS FRVM

FGCV

FRVM

RFS

2.6e-5 .003 2.0e-5

.096 .647

.156

(0,0,1)}, the 10-dimensional data were split into 3000/1177 training/testing splits randomly. The partitioning was repeated 10 times independently. (The exact partitions for the last two data sets were obtained from http://www.gatsby.ucl.ac.uk/˜chuwei/regression.html.) For all the data sets, gaussian kernel was used and the width parameter was chosen by using fivefold crossvalidation. For the OLS and RFS algorithms, the readily available Matlab functions (http://www.anc.ed.ac.uk/˜mjo/software/ rbf2.zip) were used with the default settings. We adhered to the guidelines provided in Tipping and Faul (2003) for FRVM. The results obtained using the four algorithms (FGCV, algorithm 1; FRVM, algorithm 2; RFS, algorithm 3; OLS, algorithm 4) on these data sets are presented in Figures 1 to 4. From these box plots, it is clear that

Fast GCV Algorithm for Sparse Model Learning

295

0.8

MSE

0.7 0.6 0.5 0.4 1

2

3

4

3

4

Algorithm

Number Of Basis Vectors

80 70 60 50 40 30 20 10 1

2 Algorithm

Figure 4: Results on the Abalone data set. Table 4: Statistical Significance (Wilcoxon Signed Rank) Test Results on the Abalone Data Set.

OLS RFS FRVM

FGCV

FRVM

RFS

9.8e-4 .023 .462

9.8e-4 .005

.403

the FGCV algorithm generalizes well compared to the other algorithms. However, this happens at the expense of a moderate increase in the number of basis vectors as compared to the FRVM algorithm. It is worth noting that the sparseness of the solution obtained from the FGCV algorithm is still very good. On the Abalone data set, the FRVM algorithm is slightly better than the FGCV algorithm on the average. Box plots in Figures 1 to 4 show that the distribution of MSE is nonsymmetrical. Therefore, we used Wilcoxon matched-pair signed rank tests to compare the four algorithms; the results are given in Tables 1 to 4. Each box in the table compares an algorithm in the column to an algorithm in the row. The null hypothesis is that the two medians of the test error are the

296

S. Sundararajan, S. Shevade, and S. Keerthi

same, while the alternate hypothesis is that they are different. The p-value of this hypothesis test is given in the box. The following comparisons are made with respect to the significance level of 0.05. If a p-value is smaller than 0.05, then the algorithm in the column (row) is statistically superior to the algorithm in the row (column). Table 1 suggests that for the Friedman2 data set, the FGCV algorithm is statistically superior to the FRVM and RFS algorithms. But it is not significantly different from the OLS algorithm. The FGCV algorithm is statistically superior to all the other algorithms on the Friedman3 and the Boston Housing data sets, as is evident from Tables 2 and 3. For the Abalone data set, the results in Table 4 show that the performances of the FGCV and the FRVM algorithms are not significantly different. However, these algorithms are statistically superior to the OLS and RFS algorithms. We also compared the FGCV and FRVM algorithms with the gaussian process regression (GPR) algorithm on the Boston Housing and Abalone data sets (results reported in http://www.gatsby.ucl.ac.uk/˜chuwei/ regression.html). These comparisons were also done using Wilcoxon matched-pair signed-rank test with a significance level of .05. For the Boston Housing data set, the performance comparison gave p-values of 3.4e-10 (FGCV) and 1.22e-12 (FRVM). This shows that the GPR algorithm (with all basis vectors) is statistically superior to the FRVM algorithm. A similar comparison on the Abalone data set resulted in the p-values of .012 (FGCV) and .095 (FRVM). On this data set, the GPR algorithm is statistically superior to the FGCV algorithm, while it is not statistically superior to the FRVM algorithm.

5 Conclusion We have proposed a fast, incremental GCV algorithm for designing sparse regression models. This algorithm is very efficient and constructs the model sequentially starting from an empty model. In each iteration, it adds or deletes or reestimates the basis vectors depending on the maximum reduction in the GCV error. The experimental results suggest that, considering the requirements of sparseness, good generalization performance, and computational complexity, the FGCV algorithm is competitive. Clearly, this algorithm is an excellent alternative to the FRVM algorithm of Tipping and Faul (2003). We mainly compared our approach against OLS, RFS, and FRVM since they were quite directly related to our problem formulation. We also compared against GPR. We did not compare against the other sparse incremental GP algorithms mentioned in Csato and Opper (2002), Lawrence et al. (2003), Seeger et al. (2003), and Smola and Bartlett (2000) since we felt that those methods will be slightly inferior to GPR. But those comparisons could be interesting, especially if we compare at the same levels of sparsity. This will

Fast GCV Algorithm for Sparse Model Learning

297

be taken up in future work. It will also be interesting to extend the proposed algorithm to classification problems. Appendix A: α Estimation Using GCV Approach Using equations 2.5 to 2.9, the numerator of the derivative of GCV error with respect to α j can be shown to take the form, g j + h j α j , where g j = (δ j b j − a j j )ψ j − (δ j c j − b j j ), h j = (δ j b j − a j j )σ 2 , and ψ j = φ Tj P j φ j .

(A.1)

It is easy to verify that the denominator of the derivative of GCV error with respect to α j is nonnegative, and noting that α j ≥ 0, the solution can be obtained directly using g j and h j or using the sign information of the gradient. More specifically, the optimal solution α j (lying in the interval [0, ∞)) is obtained from one of the following possibilities: If {g j , h j } < 0, then α j = ∞. If {g j , h j } > 0, then, α j = 0. g If g j < 0, h j > 0, then a unique solution exists and is given by α j = − h jj . g

If g j > 0, h j < 0, then a unique solution exists and is given by α j = − h jj . But in this case, the derivative changes from a positive to a negative value while crossing zero. Therefore, it is possible to have solution at either 0 or ∞. In this case, we can evaluate the function value at 0 and ∞ and choose the right one. When h j = 0, the solution is dependent on the sign of g j .

Appendix B: Selection of the First Basis Vector Two quantities that are of interest in finding the GCV error for all the basis vectors are given by tT P2m t =

a 2m − 2b m m + c m 2m

(B.1)

T φm φm , T φ m φ m + αm σ 2

(B.2)

tr(Pm ) = N −

φ S−1 φ T φ T φ +α σ 2 where Pm = I − m σm2 m , Sm = m mσ 2 m , and m = Sm . Then the optimal solution αm can be obtained, as described in appendix A, from the coefficients 2.5 to 2.9 using P j = I. After substituting the optimal solution αm into equations B.1 and B.2, we can evaluate the GCV error for a given m.

298

S. Sundararajan, S. Shevade, and S. Keerthi

Finally, the basis vector j is selected as the index for which the GCV error is minimum. Appendix C: Initialization Once the first basis vector, φ j , is selected based on the GCV error with the optimal α j , initialization of all the relevant quantities is done as follows: r m = tT φ m − T φm − γm = φ m

(tT φ j )(φ Tj φ m ) (φ Tj φ m )2

ξm = t T φ m − 2

(C.2)

j

T φm − 2 um = φ m

v=N−

(C.1)

j

(φ Tj φ m )2 j

+

T (φ m φ m )(φ Tj φ m )2

(tT φ j )(φ Tj φ m ) j

(C.3)

2j +

(tT φ j )(φ Tj φ m )(φ Tj φ j ) 2j

φ Tj φ j

(C.5)

j

q = tT t − 2

(C.4)

(tT φ j )2 (φ Tj φ j ) (tT φ j )2 + , j 2j

(C.6)

tT φ j j

σ , S−1 = [ ] and = [φ j ]. Note j

where j = φ Tj φ j + α j σ 2 . Further, w =

2

that S−1 is a single element matrix when there is a single basis vector. Appendix D: Computing α j

Here, we find the set of coefficients required to compute the optimal α j from the set of quantities defined in equations 3.1 to 3.6. All the results given below are obtained using the rank one update relationship between P and P j . With o j =

α j σ 2r j α j σ 2 −t j

and ψ j =

o j jo j j + 2ξ j j j αjσ2 jo j j + b j = o j ξj αjσ2 j

aj =q +

c j = j o 2j ,

α j σ 2tj α j σ 2 −t j

, we have

(D.1) (D.2) (D.3)

Fast GCV Algorithm for Sparse Model Learning

where j = ψ j + α j σ 2 and j =

uj 2j . (α j σ 2 )2

299

Further,

g j = (δ j b j − a j j )ψ j − (δ j c j − b j j )

(D.4)

h j = (δ j b j − a j j )σ ,

(D.5)

2

where δ j = v + jj . For j not in the relevance vector set, we do not have to find the above set of quantities with P j as j = ∞ and P = P j . Therefore, the quantities in equations 2.5 to 2.9 required to compute the optimal α j can be found in a much simpler way. Next, the optimal solution α j is obtained using g j and h j , as mentioned earlier. (See the discussion in the paragraph below equation A.1.) Appendix E: Reestimation and Deletion of a Basis Vector T −1 Recall that the matrix P = I − Sσ 2 . Then, with σ 2 fixed (at least for the iteration under consideration or for few iterations), a change in α j (which is essentially the reestimation process itself) results in a change in S−1 . Let s j denote the jth column and sjj denote the jth diagonal element of S−1 . Note that the computations are similar to the one used for reestimation except for the coefficient K j . In the case of deletion, K j = s1 , and in the case of

jj

reestimation, K j = (sjj + (αˆ j − α j )−1 )−1 . Here, αˆ j denotes the new optimal solution. The final set of computations required for the reestimation or deletion of basis vectors is given below. Defining ρ jm = sTj T φ m , we have rm = rm + K j w j ρ jm γm = γm +

(E.1)

Kj 2 ρ . σ 2 jm

(E.2)

Defining χ jm = sTj AS−1 T φ m , we have um = um + where τ j =

Kj ρ jm (2χ jm + τ j ρ jm ), σ2

Kj T T s s j . σ2 j

(E.3)

Defining κ j = wT As j , we have

ξm = ξm + K j ρ jm κ j + K j w j (χ jm + ρ jm τ j ) v = v + τj

(E.4) (E.5)

q = q + K j σ w j (2κ j + τ j w j ). 2

(E.6)

Finally, w = w − K j w j s j and S−1 = S−1 − K j s j sTj . Although the set of equations given above is common for reestimation and deletion procedures, the

300

S. Sundararajan, S. Shevade, and S. Keerthi

jth row and/or column is to be removed from S−1 , w, and after making the necessary updates in the deletion procedure. Appendix F: Adding a New Basis Vector On adding the new basis vector j, the dimension of changes, and a new finite α j gets defined. Defining l j = σ12 S−1 T φ j and e j = φ j − l j and µ jm = eTj φ m , we get rm = rm − w j µ jm sjj γm = γm − 2 µ2jm , σ where sjj = we have

tj σ2

1 +α j

(F.1) (F.2)

and w j =

sjj r . σ2 j

Next, defining ν jm = µ jm − lTj AS−1 T φ m ,

sjj µ jm (ξ j − w j u j ) − w j ν jm σ2 s sjj jj . um = um + 2 µ jm u µ − 2ν j jm jm σ σ2 ξ m = ξm −

(F.3) (F.4)

Next, sjj uj σ2 q = q + w j (w j u j − 2ξ j ). v=v −

(F.5) (F.6)

Also, S

−1

=

S−1 + sjj l j lTj −sjj lTj

Finally, w =

w − wjlj wj

−sjj l j sjj

.

(F.7)

and = [ φ j ].

References Bishop, C. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press. Chen, S., Cowan, C. F. N., & Grant, P. M. (1991). Orthogonal least squares learning for radial basis function networks. IEEE Trans on Neural Networks, 2, 302–309. Cortes, C., & Vapnik, V. N. (1995). Support vector networks. Machine Learning, 20, 273–297. Csato, L., & Opper, M. (2002). Sparse on-line gaussian processes. Neural Computation, 14(3), 641–668.

Fast GCV Algorithm for Sparse Model Learning

301

Denison, D., & George, E. (2000). Bayesian prediction using adaptive ridge estimators. (Tech Rep.). London: Department of Mathematics, Imperial College. Available online at http://stats.ma.ic.ac.uk/dgtd/public html/Papers/grr.ps. Friedman, J. H. (1991). Multivariate adaptive regression splines. Annals of Statistics, 19, 1–141. Golub, G. H., Heath, M., & Wahba, G. (1979). Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 21, 215–223. Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12, 55–67. Lawrence, N., Seeger, M., & Herbrich, R. (2003). Fast sparse gaussian process ¨ & K. Obermethods: The informative vector machine. In S. Becker, S. Thrun, mayer (Eds.), Advances in neural processing information systems, 15(pp. 609–616). Cambridge, MA: MIT Press. Orr, M. J. L. (1995a). Local smoothing of radial basis function networks. In Proceedings of International Symposium on Neural Networks. Hsinchu, Taiwan. Orr, M. J. L. (1995b). Regularization in the selection of radial basis function centers. Neural Computation, 7(3), 606–623. Seeger, M., Williams, C., & Lawrence, N. D. (2003). Fast forward selection to speed up sparse gaussian process regression. In C. M. Bishop & B. J. Frey (Eds.), Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics. San Francisco: Morgan Kaufmann. Smola, A. J., & Bartlett, P. L. (2000). Sparse greedy gaussian process regression. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 619–625). Cambridge, MA: MIT Press. Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions (with discussion). Journal of Royal Statistical Society (series B), 36, 111–147. Sundararajan, S., & Keerthi, S. S. (2001). Predictive approaches for choosing hyperparameters in gaussian processes. Neural Computation, 13(5), 1103–1118. Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1, 211–244. Tipping, M. E., & Faul, A. (2003). Fast marginal likelihood maximisation for sparse Bayesian models. In C. M. Bishop & B. J. Frey (Eds.), Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics. San Francisco: Morgan Kaufmann.

Received August 29, 2005; accepted June 1, 2006.

302

Addendum In “Low-Dimensional Maps Encoding Dynamics in Entorhinal Cortex and Hippocampus” by Dmitri Pervouchine, Theoden Netoff, Horacio Rotstein, John White, Mark Cunningham, Miles Whittington, and Nancy Kopell (Vol. 18, No. 11: 2617–2650), the following paragraph was omitted at the end of Appendix section A.2: Synaptic gating variable s j obeys first order kinetics according to the equation ∂s j = α j (1 − s j ) 1 + tanh((v j − vth )/vsl ) − β j s j , ∂t where α j and β j are the respective synapse rise and decay rate constants, vth = 0, and vsl = 4. The decay rate constants were chosen to achieve the desired synapse decay time (for instance, β j 0.2 for τ = 5, and β j 0.05 for τ = 20). The rise rate constant α j = 20 used in the simulations was not accessible in the experiments.

ARTICLE

Communicated by S. Coombes

The Astrocyte as a Gatekeeper of Synaptic Information Transfer Vladislav Volman [email protected] School of Physics and Astronomy, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel-Aviv University, 69978 Tel-Aviv, Israel

Eshel Ben-Jacob [email protected] School of Physics and Astronomy, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel-Aviv University, 69978 Tel-Aviv, Israel, and Center for Theoretical Biological Physics, University of California at San Diego, La Jolla, CA 92093-0319, U.S.A.

Herbert Levine [email protected] Center for Theoretical Biological Physics, University of California at San Diego, La Jolla, CA 92093-0319, U.S.A.

We present a simple biophysical model for the coupling between synaptic transmission and the local calcium concentration on an enveloping astrocytic domain. This interaction enables the astrocyte to modulate the information flow from presynaptic to postsynaptic cells in a manner dependent on previous activity at this and other nearby synapses. Our model suggests a novel, testable hypothesis for the spike timing statistics measured for rapidly firing cells in culture experiments. 1 Introduction In recent years, evidence has been mounting regarding the possible role of glial cells in the dynamics of neural tissue (Volterra & Meldolesi, 2005; Haydon 2001; Newman, 2003; Takano et al., 2006). For astrocytes in particular, the specific association of processes with synapses and the discovery of two-way astrocyte-neuron communication has demonstrated the inadequacy of the previously held view regarding the purely supportive role for these glial cells. Instead, future progress requires rethinking how the dynamics of the coupled neuron-glial network can store, recall, and process information. At the level of cell biophysics, some of the mechanisms underlying the so-called tripartite synapse (Araque, Parpura, Sanzgiri, & Haydon, 1999) are becoming clearer. For example, it is now well established that astrocytic Neural Computation 19, 303–326 (2007)

C 2007 Massachusetts Institute of Technology

304

V. Volman, E. Ben-Jacob, and H. Levine

mGlu receptors detect synaptic activity and respond via activation of the calcium-induced calcium release pathway, leading to elevated Ca 2+ levels. The spread of these levels within a microdomain of one cell can coordinate the activity of disparate synapses that are associated with the same microdomain (Perea & Araque, 2002). Moreover, it might even be possible to transmit information directly from domain to domain and even from astrocyte to astrocyte if the excitation level is strong enough to induce either intracellular or intercellular calcium waves (Cornell-Bell, Finkbeiner, Cooper, & Smith, 1990; Charles, Merrill, Dirksen, & Sanderson, 1991; Cornell-Bell & Finkbeiner, 1991). One sign of the maturity in our understanding is the formulation of semiquantitative models for this aspect of neuron-glial communication (Nadkarni & Jung, 2004; Sneyd, Wetton, Charles, & Sanderson, 1995; Hofer, Venance, & Giaume, 2003). There is also information flow in the opposite direction, from astrocyte to synapse. Direct experimental evidence for this, via the detection of the modulation of synaptic transmission as a function of the state of the glial cells, will be reviewed in more detail below. One of the major goals of this work is to introduce a simple phenomenological model for this interaction. The model will take into account both a deterministic effect of high Ca 2+ in the astrocytic process, namely, the reduction of the postsynaptic response to incoming spikes on the presynaptic axon (Araque, Parpura, Sanzgiri, & Haydon, 1998a), and a stochastic effect, namely, the increase in the frequency of observed miniature postsynaptic current events uncorrelated with any input (Araque, Parpura, Sanzgiri, & Haydon, 1998b). There are also direct NMDA-dependent effects on the postsynaptic neuron of astrocyte-emitted factors (Perea & Araque, 2005), which are not considered here. As we will show, the coupling allows the astrocyte to act as a gatekeeper for the synapse. By this, we mean that the amount of data transmitted across the synapse can be modulated by astrocytic dynamics. These dynamics may be controlled mostly by other synapses, in which case the gatekeeping will depend on dynamics external to the specific synapse under consideration. Alternatively, the dynamics may depend mostly on excitation from the selfsame synapse, in which case the behavior of the entire system is determined self-consistently. Here we focus on the latter possibility and leave for future work the discussion of how this mechanism could lead to multisynaptic coupling. Our ideas regarding the role of the astrocyte offer a new explanation for observations regarding firing patterns in cultured neuronal networks. In particular, spontaneous bursting activity in these networks is regulated by a set of rapidly firing neurons, which we refer to as spikers; these neurons exhibit spiking even during long interburst intervals and hence must have some form of self-consistent self-excitation. We model these neurons as containing astrocyte-mediated self-synapses (autapses) (Segal, 1991, 1994; Bekkers & Stevens, 1991) and show that this hypothesis naturally accounts

The Astrocyte as a Gatekeeper of Synaptic Information Transfer

305

for the observed unusual interspike interval distribution. Additional tests of this hypothesis are proposed at the end. 2 Experimental Observations 2.1 Cultured Neuronal Networks. The cultured neuronal networks presented here are self-generated from dissociated cultures of mixed cortical neurons and glial cells drawn from 1-day-old Charles River rats. The dissection, cell dissociation, and recording procedures were previously described in detail (Segev et al., 2002). Briefly, following dissection, neurons are dispersed by enzymatic treatment and mechanical dissociation. Then the cells are homogeneously plated on multielectrode arrays (MEA, Multi-Channel Systems), precoated with Poly-L-Lysine. Culture media was DMEM, (sigma) enriched by serum, and changed every two days. Plated cultures are placed on the MEA board (B-MEA-1060, Multi Channel Systems) for simultaneous long-term noninvasive recordings of neuronal activity from several neurons at a time. Recorded signals are digitized and stored for off-line analysis on a PC via an A-D board (Microstar DAP) and data acquisition software (Alpha-Map, Alpha Omega Engineering). Noninvasive recording of the networks activity (action potentials) is possible due to the capacitive coupling that some of the neurons form with some of the electrodes. Since typically one electrode can record signals from several neurons, a specially developed spike-sorting algorithm (Hulata, Segev, Shapira, Benveniste, & Ben-Jacob, 2000) is utilized to reconstruct single neuron-specific spike series. Although there are no externally provided guiding stimulations or chemical cues, relatively intense dynamical activity is spontaneously generated within several days. The activity is marked by the formation of synchronized bursting events (SBEs): short (∼200 ms) time windows during which most of the recorded neurons participate in relatively rapid firing (Segev & Ben-Jacob, 2001). These SBEs are separated by long intervals (several seconds or more) of sporadic neuronal firing of most of the neurons. A few neurons (referred to as spiker neurons) exhibit rapid firing even during the inter-SBE time intervals. These neurons also exhibit much faster firing rates during the SBEs, and their interspike intervals distribution is marked by a long-tail behavior (see Figure 4). 2.2 Interspike Interval (ISI) Increments Distribution. One of the tools used to compare model results with measured spike data concerns the distribution of increments in the spike times, defined as δ(i) = ISI (i + 1) − ISI (i), i ≥ 1. The distribution of δ(i) will have heavy tails if there is a wide range of interspike intervals and rapid transitions from one type of interval to the next. For example, rapid transitions from bursting events to occasional interburst firings will lead to such a tail. Applying this analysis to the recorded spike data of cultured cortical networks, Segev et al. (2002) found

306

V. Volman, E. Ben-Jacob, and H. Levine

that distributions of neurons’ ISI increments can be well fitted with Levy functions over three decades in time. 3 The Model In this section we present the mathematical details of the models employed in this work. Readers interested mainly in the conclusions can skip directly to section 4. The basic notion we use is that standard synapse models must be modified to account for the astrocytic modulation, depending, of course, on the calcium level. In turn, the astrocytic calcium level is affected by synaptic activity; for this, we use the Li-Rinzel model where the IP3 concentration parameter governing the excitability is increased on neurotransmitter release. These ingredients suffice to demonstrate what we mean by gatekeeping. Finally, we apply this model to the case of an autaptic oscillator, which requires the introduction of neuronal dynamics. For this, we chose the Morris-Lecar model as a generic example of a type-I firing system. None of our results would be altered with a different choice as long as we retain the tangent-bifurcation structure, which allows for arbitrarily long interspike intervals. 3.1 TUM Synapse Model. To describe the kinetics of a synaptic terminal, we have used the model of an activity-dependent synapse first introduced by Tsodyks, Uziel, and Markram (2000). In this model, the effective synaptic strength evolves according to the following equations: z − uxδ(t − tsp ) τrec y y˙ = − + uxδ(t − tsp ) τin y z z˙ = − τin τrec

x˙ =

(3.1)

Here, x, y, and z are the fractions of synaptic resources in the recovered, active, and inactive states, respectively. For an excitatory glutamatergic synapse, the values attained by these variables can be associated with the dynamics of vesicular glutamate. As an example, the value of y in this formulation will be proportional to the amount of glutamate that is being released during the synaptic event, and the value of x will be proportional to the size of a readily releasable vesicle pool. The time series tsp denotes the arrival times of presynaptic spikes, τin is the characteristic time of postsynaptic currents (PSCs) decay, and τr ec is the recovery time from synaptic depression. Upon arrival of a spike at the presynaptic terminal at time tsp , a fraction u of available synaptic resources is transferred from the recovered

The Astrocyte as a Gatekeeper of Synaptic Information Transfer

307

Table 1: Parameters Used in Simulations. P0 τca r I P3 v1 k3 d3 c0 gL VL V3 Iba se u

0.5

ηmea n

1.2 · 10−3 µA cm−2

σ

0.1

4 sec 7.2 mM sec−1 6 sec−1 0.1 µM 0.9434 µM 2.0 µM 0.5 mS cm−2 −35 mV 10 mV 0.34 µA cm−2 0.1

κ I P3∗ v2 d1 d5 gCa VCa V1 V4 τd A

0.5 sec−1 0.16 µM 0.11 sec−1 0.13 µM 0.08234 µM 1.1 mS cm−2 100 mV −1 mV 14.5 mV 10 msec 10.00 µA cm−2

τ I P3 c1 v3 d2 a2 gK VK V2 φ τr ec

7 sec 0.185 0.9 µM sec−1 1.049 µ M 0.2 µM−1 sec−1 2.0 mS cm−2 −70 mV 15 mV 0.3 100 msec

state to the active state. Once in the active state, synaptic resources rapidly decay to the inactive state, from which they recover within a timescale τr ec . Since the typical times are assumed to satisfy τr ec τin , the model predicts onset of short-term synaptic depression after a period of high-frequency repetitive firing. The onset of depression can be controlled by the variable u, which describes the effective use of synaptic resources by the incoming spike. In the original TUM model, the variable u is taken to be constant for the excitatory postsynaptic neuron; in what follows, we will set u = 0.1. Other parameter choices for these equations as well as for the rest of the model equations are presented in Table 1. To complete the specification, it is assumed that the resulting PSC, arriving at the model neurons’ soma through the synapse, depends linearly on the fraction of available synaptic resources. Hence, a total synaptic current seen by a neuron is Isyn (t) = Ay(t), where A stands for an absolute synaptic strength. At this stage, we do not take into account the long-term effects associated with the plasticity of neuronal somata and take the parameter A to be time independent. 3.2 Astrocyte Response. Astrocytes adjacent to synaptic terminals respond to the neuronal action potentials by binding glutamate to their metabotropic glutamate receptors (Porter & McCarthy, 1996). The activation of these receptors then triggers the production of I P3 , which consequently serves to modulate the intracellular concentration of calcium ions; the effective rate of I P3 production depends on the amount of transmitter released during the synaptic event. We therefore assume that the production of intracellular I P3 in the astrocyte is given by I P3∗ − I P3 d[I P3 ] = + ri p3 y. dt τi p3

(3.2)

308

V. Volman, E. Ben-Jacob, and H. Levine

This equation is similar to the formulation used by Nadkarni and Jung (2004), with some important differences. First, the effective rate of I P3 production depends not on the potential of neuronal membrane, but on the amount of neurotransmitter that is being released into the synaptic cleft. Hence, as the resources of synapse are depleted (due to depression), there will be less transmitter released, and therefore the I P3 will be produced at lower rates, leading eventually to decay of calcium concentration. Second, as the neurotransmitter is released also during spontaneous synaptic events (noise), the latter will also influence the production of I P3 and subsequent calcium oscillations. 3.3 Astrocyte. To model the dynamics of a single astrocytic domain, we use the Li-Rinzel model (Li & Rinzel, 1994; Nadkarni & Jung, 2004), which has been specifically developed to take into account the I P3 -dependent dynamical changes in the concentration of cytosolic Ca 2+ . This is based on the theoretical studies of Nadkarni and Jung (2004), where it is decisively demonstrated that astrocytic Ca 2+ oscillations may account for the spontaneous activity of neurons. The intracellular concentration of Ca2+ in the astrocyte is described by the following set of equations: d[Ca 2+ ] = −J chan − J pump − J leak dt

(3.3)

dq = αq (1 − q ) − βq q . dt

(3.4)

Here, q is the fraction of activated I P3 receptors. The fluxes of currents through ER membrane are given in the following expressions: J chan = c 1 v1 m3∞ n3∞ q 3 ([Ca 2+ ] − [Ca 2+ ] E R ) J pump =

v3 [Ca ] + [Ca 2+ ]2

(3.5)

2+ 2

k32

J leak = c 1 v2 ([Ca 2+ ] − [Ca 2+ ] E R ),

(3.6) (3.7)

where m∞ =

[I P3 ] [I P3 ] + d1

(3.8)

n∞ =

[Ca 2+ ] [Ca 2+ ] + d5

(3.9)

αq = a 2 d2

[I P3 ] + d1 [I P3 ] + d3

(3.10)

The Astrocyte as a Gatekeeper of Synaptic Information Transfer

βq = a 2 [Ca 2+ ].

309

(3.11)

The reversal Ca 2+ concentration ([Ca 2+ ] E R ) is obtained after requiring conservation of the overall Ca 2+ concentration: [Ca 2+ ] E R =

c 0 − [Ca 2+ ] . c1

(3.12)

3.4 Glia-Synapse Interaction. Astrocytes affect synaptic vesicle release in a calcium-dependent manner. Rather than attempt a complete biophysical model of the complex chain of events leading from calcium rise to vesicle release (Gandhi & Stevens, 2003), we proceed in a phenomenological manner. We define a dynamical variable f that phenomenologically will capture this interaction. When the concentration of calcium in its synapseassociated process exceeds a threshold, we assume that the astrocyte emits a finite amount of neurotransmitter into the perisynaptic space, thus altering the state of a nearby synapse; this interaction occurs via glutamate binding to presynaptic mGlu and NMDA receptors (Zhang et al., 2004). As the internal astrocyte resource of neurotransmitter is finite, we include the saturation term (1 − f ) in the dynamical equation for f . The final form is f˙ =

−f 2+ + (1 − f )κ ([Ca 2+ ] − [Ca thr eshold ]). τCa 2+

(3.13)

Given this assumption, equations 3.1 should be modified to take this modulation into account. We assume the following simple form: z − (1 − f )uxδ(t − tsp ) − xη( f ) τr ec −y + (1 − f )uxδ(t − tsp ) + xη( f ). y˙ = τin

x˙ =

(3.14) (3.15)

In equations 3.14 and 3.15, η( f ) represents a noise term modeling the increased occurrence of mini-PSCs. The fact that a noise increase accompanies an amplitude decrease is partially due to competition for synaptic resources between these two release modes (Otsu et al., 2004). Based on experimental observations, we prescribe that the dependence of η( f ) on f is such that the rate of noise occurrence (the frequency of η( f ) in a fixed time step) increases with increasing f , but the amplitude distribution (modeled here as a gaussian-distributed variable centered around positive mean) remains unchanged. For the rate of noise occurrence, we chose the following functional dependence: 2 1− f P( f ) = P0 exp − √ , 2σ

(3.16)

310

V. Volman, E. Ben-Jacob, and H. Levine

with P0 representing the maximal frequency of η( f ) in a fixed time step. Note that although both synaptic terminals and astrocytes utilize glutamate for their signaling purposes, we assume the two processes to be independent. In so doing, we rely on existing biophysical experiments demonstrating that whereas a presynaptic terminal releases glutamate in the synaptic cleft, astrocytes selectively target extrasynaptic glutamate receptors (Araque et al., 1998a, 1998b). Hence, synaptic transmission does not interfere with the astrocyte-to-synapse signaling. 3.5 Neuron Model. We describe the neuronal dynamics with a simplified two-component Morris-Lecar model (Morris & Lecar, 1981), V˙ = −Iion (V, W) + Ie xt (t)

(3.17)

W∞ (V) − W(V) , τW (V)

(3.18)

˙ W(V) =φ

with Iion (V, W) representing the contribution of the internal ionic Ca 2+ , K + , and leakage currents with their corresponding channel conductivities gCa , g K , and g L being constant: Iion (V, W) = gCa m∞ (V)(V − VCa ) + g K W(V)(V − VK ) + g L (V − VL ). (3.19) Ie xt represents all the external current sources stimulating the neuron, such as signals received through its synapses, glia-derived currents, artificial stimulations, as well as any noise sources. In the absence of any such stimulation, the fraction of open potassium channels, W(V), relaxes toward its limiting curve (nullcline) W∞ (V), which is described by the sigmoid function, V − V1 1 W∞ (V) = 1 + tanh 2 V2

(3.20)

within a characteristic timescale given by

τW (V) =

cosh

1 V−V1 . 2V2

(3.21)

The Astrocyte as a Gatekeeper of Synaptic Information Transfer

311

In contrast to this, it is assumed in the Morris-Lecar model that calcium channels are activated immediately. Accordingly, the fraction of open Ca 2+ channels obeys the following equation: V − V3 1 m∞ (V) = 1 + tanh . 2 V4

(3.22)

For an isolated neuron, rendered with a single autapse, one has Ie xt (t) = Isyn (t) + Iba se , where Isyn (t) is the current arriving through the self-synapse and Iba se is some constant background current. In this work, we assume that Iba se is such that, when acting alone, it causes a neuron to fire at a very low constant rate. Of course, these two terms enter the equation additively, and the dynamics depends on the total external current. Nonetheless, it is important to separate these terms, as only one of them enters through the synapse; it is only this term that is modulated by astrocytic glutamate release and only this term that would be changed by synaptic blockers. As we will note later, the baseline current may also be due to astrocytes, albeit to a direct current directed into the neuronal soma. In anticipation of a better future understanding of this term, we consider it separately from the constant appearing in leak current (g L VL ), although there is clearly some redundancy in the way these two terms set the operating point of the neuron. 4 Results 4.1 Synaptic Model. In simple models of neural networks, the synapse is considered to be a passive element that directly transmits information, in the form of arriving spikes on the presynaptic terminal, to postsynaptic currents. It has been known for a long time that more complex synaptic dynamics can affect this transfer. One such effect concerns the finite reservoir of presynaptic vesicle resources and was modeled by Tsodyks, Uziel, and Markram (TUM) (Tsodyks et al., 2000). Spike trains with too high a frequency will be attenuated by a TUM synapse, as there is insufficient recovery from one arrival to the next. To demonstrate this effect, we fed the TUM synaptic model with an actual spike train recorded from a neuron in a cultured network (shown in Figure 1a); the resulting postsynaptic current (PSC) is shown in Figure 1b. As is expected, there is attenuation of the PSC height during time windows with high rates of presynaptic spiking input. 4.2 Effect of Presynaptic Gating. Our goal is to extend the TUM model to include the interaction of the synapse with an astrocytic process imagined to be wrapped around the synaptic cleft. The effects of astrocytes on stimulated synaptic transmission are well established. Araque et al. (1998a) report that astrocyte stimulation reduced the magnitude of

312

V. Volman, E. Ben-Jacob, and H. Levine (a)

(b)

(c)

(d)

0

4 time [sec]

Figure 1: The generic effect of an astrocyte on the presynaptic depression, as captured by our phenomenological model (see text for details). To illustrate the effect of presynaptic depression and the astrocyte influence, we feed a model synapse with the input of spikes taken from the recorded activity of a cultured neuronal network (see the text and Segev & Ben-Jacob, 2001 for details). (a) The input sequence of spikes that is fed into the model presynaptic terminal. (b) Each spike arriving at the model presynaptic terminal results in the postsynaptic current (PSC). The strength of the PSC depends on the amount of available synaptic resources, and the synaptic depression effect is clearly observable during spike trains with relatively high frequency. (c) The effect of a periodic gating function, f (t) = 0.5 + f 0 sin(wt), shown in (d). The period of the = 2 sec, is taken to be compatible with the typical timescales oscillation, T = 2π ω of variations in the intraglial Ca 2+ concentration. Note the reduction in the PSC near the maxima of f , along with the elevated baseline resulting from the increase in the rate of spontaneous presynaptic transfer.

action-potential-evoked excitatory and inhibitory synaptic currents by decreasing the probability of evoked transmitter release. Specifically, presynaptic metabotropic glutamate receptors (mGluRs) have been shown to affect the stimulated synaptic transmission by regulating presynaptic voltagegated calcium channels, which eventually leads to the reduction of calcium flux during the incoming spike and results in a decrease of amplitude of synaptic transmission. These results are best shown in Figure 8 of their paper, which presents the amplitude of evoked EPSC both before and after stimulation of an associated astrocyte. Note that we are referring here to “faithful" synapses—those that transmit almost all of the incoming spikes. Effects of astrocytic stimulation on highly stochastic synapses, namely, the increase in fidelity (Kang, Jiang, Goldman, & Nedergaard, 1998), are not studied here. In addition, astrocytes were shown to increase the frequency of spontaneous synaptic events. In detail, Araque et al. (1998b) have shown that

The Astrocyte as a Gatekeeper of Synaptic Information Transfer

313

astrocyte stimulation increases the frequency of miniature postsynaptic currents (mPSC) without modifying their amplitude distribution, suggesting that astrocytes act to increase the probability of vesicular release from the presynaptic terminal. Although the exact mechanism is unknown, this effect is believed to be mediated by NMDA receptors located at the presynaptic terminal. It is important to note that the two kinds of astrocytic influence on the synapse (decrease of the probability of evoked release and increase in the probability of spontaneous release) do not contradict each other. Evoked transmitter release depends on the calcium influx through calcium channels that can be inhibited by the activation of presynaptic mGluRs. On the other hand, the increase in the probability of spontaneous release follows because of the activation of presynaptic NMDA channels. In addition, spontaneous activity can deplete the vesicle pool (in terms of either number or filling) and hence directly lower triggered release amplitudes (Otsu et al., 2004). We model these effects by two modifications of the TUM model. First, we introduce a gating function f that modulates the stimulated release in a calcium-dependent manner. This term will cause the synapse to turn off at high calcium. This presynaptic gating effect is demonstrated in Figure 1c, where we show the resulting PSC corresponding to a case in which f is chosen to vary periodically with a timescale consistent with astrocytic calcium dynamics. The effect on the recorded spike train data is quite striking. The second effect, the increase of stochastic release in the absence of any input, is included as an f -dependent noise term in the TUM equations. This will be important as we turn to a self-consistent calculation of the synapse coupled to a dynamical astrocyte. 4.3 The Gatekeeping Effect. We close the synapse-glia-synapse feedback loop by inclusion of the effect of the presynaptic activity on the intracellular Ca 2+ dynamics in the astrocyte that in turn sets the value of the gating function f . Nadkarni and Jung (2004) have argued that the basic calcium phenomenology in the astrocyte, arising via the glutamate-induced production of I P3 , can be studied using the Li-Rinzel model. What emerges from their work is that the dynamics of the intra-astrocyte Ca 2+ level depends on the intensity of the presynaptic spike train, acting as an information integrator over a timescale on the order of seconds; the level of Ca 2+ in the astrocyte increases according to the summation of the synaptic spikes over time. If the total number of spikes is low, the Ca 2+ concentration in the astrocyte remains below a self-amplification threshold level and simply decays back to its resting level with some characteristic time. However, things change dramatically when a sufficiently intense set of signals arises across the synapse. Now the Ca 2+ concentration overshoots its linear response level, followed by decaying oscillations. Given our results, these high Ca 2+ levels in the astrocyte will in fact attenuate spike information that arrives subsequent to strong bursts of activity. We illustrate this time-delayed gatekeeping (TDGK) effect in

314

V. Volman, E. Ben-Jacob, and H. Levine (a)

(b)

(c)

(d)

0

20 time [sec]

Figure 2: The gatekeeping effect in a glia-gated synapse. (a) The input sequence of spikes, which is composed of several copies of the sequence shown in Figure 1, separated by segments of long quiescent time. The resulting time series may be viewed as bursts of action potentials arriving at the model presynaptic terminal. The first burst of spikes results in the elevation of free astrocyte Ca 2+ concentration (b), but this elevation alone is not sufficient to evoke oscillatory response. An additional elevation of Ca 2+ , leading to the emergence of oscillation, is provided by the second burst of spikes arriving at the presynaptic terminal. Once the astrocytic Ca 2+ crosses a predefined threshold, it starts to exert a modulatory influence back on the presynaptic terminal. In the model, this is manifested by the rising dynamics of the gating function (c). Note that as the decay time of the gating function f is on the order of seconds, the astrocyte influence on the presynaptic terminal persists even after concentration of astrocyte Ca 2+ has fallen. This is best seen from d, where we show the profile of the PSC. The third burst of spikes arriving at the presynaptic terminal is modulated due to the astrocyte, even though the concentration of Ca 2+ is relatively low at that time. This modulation extends also to the fourth burst of spikes, which together with the third burst leads again to the oscillatory response of astrocyte Ca 2+ . Taken together, these results illustrate a temporally nonlocal gatekeeping effect of glia cells.

Figure 2. We constructed a spike train by placing a time delay in between segments of recorded sequences. As can be seen, since the degree of activity during the first two segments exceeds the threshold level, there is attenuation of the late-arriving segments. Thus, the information passed through the synapse is modulated by previous arriving data. 4.4 Autaptic Excitatory Neurons. Our new view of synaptic dynamics will have broad consequences for making sense of neural circuitry. To illustrate this prospect, we turn to the study of an autaptic oscillator (Seung, Lee, Reis, & Tank, 2000), by which we mean an excitatory neuron

The Astrocyte as a Gatekeeper of Synaptic Information Transfer

315

that exhibits repeated spiking driven at least in part by self-synapses (Segal, 1991, 1994; Bekkers & Stevens, 1991; Lubke, Markram, Frotscher, & Sakmann, 1996). By including the coupling of a model neuron to our synapse system, we can investigate both the case of the role of an associated astrocyte with externally imposed temporal behavior and the case where the astrocyte dynamics is itself determined by feedback from this particular synapse. Finally, we should be clear that when we refer to one synapse, we are also dealing implicitly with the case of multiple self-synapses, all coupled to the same astrocytic domain, which in turn is exhibiting correlated dynamics in its processes connecting to these multiple sites. It is important to note that this same modulation can in fact correlate multiple synapses connecting distinct neurons coupled to the same astrocyte. The effect of this new multisynaptic coupling on the spatiotemporal flow of information in a model network will be described elsewhere. We focus on an excitatory neuron modeled with Morris-Lecar dynamics, as described in section 3. We add some external bias current so as to place the neuron in a state slightly beyond the saddle-node bifurcation, where it would spontaneously oscillate at a very low frequency in the absence of any synaptic input. We then assume that this neuron has a self-synapse (autapse). An excitatory self-synapse clearly has the possibility of causing a much higher spiking rate than would otherwise be the case; this behavior without any astrocyte influence is shown in Figure 3. The existence of autaptic neurons was originally demonstrated in cultured networks (Segal, 1991, 1994; Bekkers & Stevens, 1991) but has been detected in intact neocortex as well (Lubke et al., 1996). Importantly, these can be either inhibitory or excitatory. There has been some speculation regarding the role of autapses in memory (Seung et al., 2000), but this is not our concern here. Are such neurons observed experimentally? In Figure 4 we show a typical raster plot recorded from cultured neural network grown from a dissociated mixture of glial and neuronal cortical cells taken from 1-day-old Charles River rats (see section 2). The spontaneous activity of the network is marked by synchronized bursting events (SBEs)—short (several 100s of ms) periods during which most of the recorded neurons show relatively rapid firing separated by long (order of seconds) time intervals of sporadic neuronal firing of most of the neurons (Segev & Ben-Jacob, 2001). Only small fractions of special neurons (termed spiker neurons) exhibit rapid firing also during inter-SBEs intervals. These spiker neurons also exhibit much higher firing rates during the SBEs. But the behavior of these rapidly firing neurons does not match that expected of the simple autaptic oscillator. The major differences, as illustrated by comparing Figures 3 and 4, are (1) the existence of long interspike intervals for the spikers, marked by a long-tail (Levy) distribution of the increments of the interspike intervals, and (2) the beating or burstlike rate modulation in the temporal ordering of the spike train.

316

V. Volman, E. Ben-Jacob, and H. Levine

(a)

(b)

0

5 time [sec]

(c)

−1

Probability distribution

10

−5

10

1

10 100 δ(ISI) [msec]

1000

Figure 3: Activity of a model neuron containing the self-synapse (autapse), as modeled by the classical Tsodyks-Uziel-Markram model of synaptic transmission. In this case, it is possible to recover some of the features of cortical rapidly firing neurons, namely, the relatively high-frequency persistent activity. However, the resulting time series of action potentials for such a model neuron, shown in (a), is almost periodic. Due to the self-synapse, a periodic series of spikes results in the periodic pattern for the postsynaptic current, shown in (b), which closes the self-consistency loop by causing a model neuron to generate a periodic time series of spikes. Further difference between the model neuron and between cortical rapidly firing neurons is seen upon comparing the corresponding distributions of ISI increments, plotted on double-logarithmic scale. These distributions, shown in (c), disclose that, contrary to the cortical rapidly firing neurons, the increments distribution for the model neuron with TUM autapse (diamonds) is gaussian (seen as a stretched parabola on double-log scale), pointing to the existence of characteristic timescale. On the other hand, distributions for cortical neurons (circles) decay algebraically and are much broader. The distribution of the model neuron has been vertically shifted for clarity of comparison.

The Astrocyte as a Gatekeeper of Synaptic Information Transfer 1

317

(a)

neuron #

12

60 0

45

90

time [sec] 1

(b)

neuron # 60 0

(c)

−1

probability distribution

12

10

−3

10

−5

400

time [msec]

800

10

1

10

δ(ISI) [msec]

100

Figure 4: Electrical activity of in-vitro cortical networks. These cultured networks are spontaneously formed from a dissociated mixture of cortical neurons and glial cells drawn from 1-day-old Charles River rats. The cells are homogeneously spread over a lithographically specified area of Poly-D-Lysine for attachment to the recording electrodes. The activity of a network is marked by formation of synchronized bursting events (SBEs), short (∼100–400 msec) periods of time during which most of the recorded neurons are active. (a) Raster plot of recorded activity, showing a sample of a few SBEs. The time axis is divided into 10−1 s bins. Each row is a binary bar code representation of the activity of an individual neuron; the bars mark detection of spikes. Note that while the majority of the recorded neurons are firing rapidly mostly during SBEs, some neurons are marked by persistent intense activity (e.g., neuron no. 12). This property supports the notion that the activity of these neurons is autonomous and hence self-amplified. (b) A zoomed view of a sample synchronized bursting event. Note that each neuron has its own pattern of activity during the SBE. To access the differences in activity between ordinary neurons and neurons that show intense firing between the SBEs, for each neuron we constructed the series of increments of interspike intervals (ISI), defined as δ(i) = I SI (i + 1) − I SI (i), i ≥ 1. The distributions of δ(i), shown in (c), disclose that the dynamics of ordinary neurons (squares) is similar to the dynamics of rapidly firing neurons (circles), up to the timescale of 100 msec, corresponding to the width of a typical SBE. Note that since increments of interspike intervals are analyzed, the increased rate of neurons firing does not necessarily affect the shape of the distribution. Yet above the characteristic time of 100 msec, the distributions diverge, possibly indicating the existence of additional mechanisms governing the activity of rapidly firing neurons on a longer timescale. Note that for normal neurons, there is another peak at typical interburst intervals (> seconds), not shown here.

318

V. Volman, E. Ben-Jacob, and H. Levine

Motivated by the above and the glial gatekeeping effect studied earlier, we proceed to test if an autaptic oscillator with a glial-regulated self-synapse will bring the model into better agreement. In Figure 5 we show that the activity of such a modified model does show the additional modulation. The basic mechanism results from the fact that after a period of rapid firing of the neuron, the astrocyte intracellular Ca 2+ concentration (shown in Figure 5b) exceeds the critical threshold for time-delayed attenuation. This then stops the activity and gives rise to large interspike intervals. The distributions shown in Figure 5 are a much better match to experimental data for time intervals up to 100 msec.

5 Robustness Tests 5.1 Stochastic Li-Rinzel Model. One of the implicit assumptions of our model for astrocyte-synapse interaction is related to the deterministic nature of astrocyte calcium release. It is assumed that in the absence of any I P3 signals from the associated synapses, the astrocyte will stay “silent,” in the sense that there will be no spontaneous Ca 2+ events. However, it should be kept in mind that the equations for the calcium channel dynamics used in the context of Li-Rinzel model in fact describe the collective behavior of large numbers of channels. In reality, experimental evidence indicates that the calcium release channels in astrocytes are spatially organized in small clusters of 20 to 50 channels—the so-called microdomains. These microdomains were found to contain small membrane leaflets (of O(10 nm) thick), wrapping around the synapses and potentially able to synchronize ensembles of synapses. This finding calls for a new view of astrocytes as cells with multiple functional and structural compartments. The microdomains (within the same astrocyte) have been observed to generate the spontaneous Ca 2+ signals. As the passage of the calcium ions through a single channel is subject to fluctuations, the stochastic aspects can become important for small clusters of channels. Inclusion of stochastic effects can explain the generation of calcium puffs: fast, localized elevations of calcium concentration. Hence, it is important to test the possible effect of stochastic calcium events on the model’s behavior. We achieve this goal by replacing the deterministic Li-Rinzel model with its stochastic version, obtained using Langevin approximation, as has been recently described by Shuai and Jung (2003). With the Langevin approach, the equation for the fraction of open calcium channels is modified and takes the following form:

dq = αq (1 − q ) − βq q + ξ (t), dt

(5.1)

The Astrocyte as a Gatekeeper of Synaptic Information Transfer

319

(a)

(b)

(c)

(d) 0

10

20

time [sec]

(e)

−1

Probability distribution

10

−3

10

−5

10

1

10 100 δ(ISI) [msec]

1000

Figure 5: The activity of a model neuron containing a glia-gated autapse. The equations of synaptic transmission for this case have been modified to take into account the influence of synaptically associated astrocyte, as explained in text. The resulting spike time series, shown in (a), deviates from periodicity due to the slow modulation of the synapse by the adjacent astrocyte. The relatively intense activity at the presynaptic terminal activates astrocyte receptors, which in turn leads to the production of IP3 and subsequent oscillations of free astrocyte Ca 2+ concentration. The period of these oscillations, shown in (b), is much larger than the characteristic time between spikes arriving at the presynaptic terminal. Because Ca 2+ dynamics is oscillatory, so also will be the dynamics of the gating function f , as is seen from (c), and the period of oscillations for f will follow the period of Ca 2+ oscillations. The periodic behavior of f leads to slow periodic modulation of PSC pattern (shown in (d)), which closes the self-consistency loop by causing a neuron to fire in a burstlike manner. Additional information is obtained after comparison of distributions for ISI increments, shown in (e). Contrary to results for the model neuron with a simple autapse (see Figure 4c), the distribution for a glia-gated autaptic model neuron (diamonds) now closely follows the distributions of two sample recorded cortical rapidly firing neurons (circles), up to the characteristic time of ∼100 msec, which corresponds to the width of a typical SBE. The heavy tails of the recorded distributions above this characteristic time indicate that network mechanisms are involved in shaping the form of the distribution on longer timescales.

320

V. Volman, E. Ben-Jacob, and H. Levine (a)

(b)

(c)

(d) 0

40

time [sec]

80

Figure 6: The dynamical behavior of an astrocyte-gated model autaptic neuron, including the stochastic release of calcium from ER of astrocyte. Shown are the results of the simulation when calcium release from intracellular stores is mediated by a cluster of N = 10 channels. The generic form of the spike time series (shown in (a)) does not differ from those obtained for the deterministic model. Namely, even for the stochastic model, the neuron is still firing in a burstlike manner. Although the temporal profile of astrocyte calcium (b) is irregular, the resulting dynamics of the gating function (c) is relatively smooth, stemming from the choice of the gating function dynamics (being an integration over the calcium profile). As a result, the PSC profile (d) does not differ much from the corresponding PSC profile obtained for the deterministic model.

in which the stochastic term, ξ (t), has the following properties: ξ (t) = 0

ξ (t)ξ (t ) =

αq (1 − q ) + βq q

δ(t − t ). N

(5.2) (5.3)

In the limit of very large cluster size, N → ∞, and the effect of stochastic Ca 2+ release is not significant. On the contrary, the dynamics of calcium release are greatly modified for small cluster sizes. A typical spike-time series of glia-gated autaptic neuron, obtained for the cluster size of N = 10 channels, is shown Figure 6a. Note that while there appear considerable fluctuations in concentration of astrocyte calcium (see Figure 6b), the dynamics of the gating function (see Figure 6c) is less irregular. This follows because our choice of the gating function corresponds to the integration of calcium events. We have also checked that the distribution of interspike intervals is practically unchanged (data not shown). All told, our results indicate that including the stochastic nature of the release of calcium from astrocyte ER does not affect the dynamics of our model autaptic neuron in any significant way.

The Astrocyte as a Gatekeeper of Synaptic Information Transfer

321

0

10

−4

τ =4 sec,κ=5*10 f −4 τ =40 sec,κ=1*10 f

−1

probability distribution

10

−2

10

−3

10

−4

10

−5

10

0

10

1

10

2

δ(ISI) [msec]

10

3

10

Figure 7: Distributions of interspike interval increments for the model of an astrocyte-gated autaptic neuron with slow dynamics of the gating function, as compared with the corresponding distribution for the deterministic Li-Rinzel model. Due to the slow dynamics of the gating function, the transitions between different phases of bursting are blurred, resulting in a weaker tail for the distribution of interspike interval increments.

5.2 The Correlation Time of the Gating Function. Another assumption made in our model concerns the dynamics of the gating function. We have assumed the simple first-order differential equation for the dynamics of our phenomenological gating function and have selected timescales that are believed to be consistent with the influence of astrocytes on synaptic terminals. However, because the exact nature of the underlying processes (and corresponding timescales) is unknown, it is important to test the robustness of the model to variations in the gating function dynamics. To do that, we altered the baseline dynamics of the gating function to have a slower characteristic decay time and a slower rate of accumulation; for example, we can set τ f = 40 sec and κ = 0.1 sec−1 . Simulations show that the only effect is a slight blurring of the transition between different phases of the bursting, as would be expected. This can best be detected by looking at the distribution of interspike interval increments, for the case of slow gating dynamics. The distribution, shown in Figure 7, has a weaker tail as compared to the distribution obtained for the faster gating dynamics. This result follows because for slower gating, the modulation of the postsynaptic current is weaker. Hence, the transitions from intense firing

322

V. Volman, E. Ben-Jacob, and H. Levine (a)

(b)

(c)

(d) 0

45

time [sec]

90

Figure 8: The dynamical behavior of an astrocyte-gated model autaptic neuron with slowly oscillating background current. Shown are the results of the simut), T = 10 sec. The mean level of Ibase is set so as to put lation when Ibase ∝ sin( 2π T a neuron in the quiescent phase for half a period. The resulting spike time series (a) disclose the burstlike firing of a neuron, with the superimposed oscillatory dynamics of a background current. The variations in the concentration of astrocyte calcium (b) are much more temporally localized, and so is the resulting dynamics of the gating function (c). Consequently, the PSC profile (d) strongly reflects the burstlike synaptic transmission efficacy, thus forcing the neuron to fire in a burstlike manner and closing the self-consistency loop.

to low-frequency spiking are less abrupt, resulting in a relatively low proportion of large increments. It is worth remembering that large increments of interspike intervals reflect sudden changes in dynamics, which are eliminated by the blurring. Clearly, the model with fast gating does a better job in fitting the spiker data. 5.3 Time-Dependent Background Current. All of the main results were obtained under the assumption of constant background current feeding into neuronal soma, such that when acting alone, this current forces the model neuron to fire at a very low frequency. One may justly argue that there is no such thing as constant current. Indeed, if a background current has to do with the biological reality, then it should possess some dynamics. For example, a better match would be to imagine the background current to be associated with the activity in adjacent astrocytes (see, e.g., Angulo, Kozlov, Charpak, & Audinat, 2004). To test this, we simulated glia-gated autaptic neuron subject to slowly oscillating (T = 10 sec) background current. For this case, we found that the behavior of a model is generically the same. Yet now the transitions between the bursting phases are sharper (see Figure 8a). This in turn leads to the sharper modulation of postsynaptic currents (shown in Figure 8d). We can confirm this by noting that the distribution of interspike interval increments has a slightly heavier tail, as compared to the distribution obtained for

The Astrocyte as a Gatekeeper of Synaptic Information Transfer

323

the case of constant background current (data not shown). On the other hand, replacing the constant current with the oscillating one introduces a typical frequency not seen in the actual spiker data. This artificial problem will presumably disappear when the background current is determined self-consistently as part of the overall network activity. Similarly, the key to extending the increments distributions to longer timescales seems to be getting the network feedback to the spikers to regulate the interburst timing, which at the moment is too regular. This will be presented in a future publication. 6 Discussion In this article, we have proposed that the regulation of synaptic transmission by astrocytic calcium dynamics is a critical new component of neural circuitry. We have used existing biophysical experiments to construct a coupled synapse-astrocyte model to illustrate this regulation and explore its consequences for an autaptic oscillator, arguably the most elementary neural circuit. Our results can be compared to data taken from cultured neuron networks. This comparison reveals that the glial gatekeeping effect appears to be necessary for an understanding of the interspike interval distribution of observed rapidly firing spiker neurons, for timescales up to about 100 msec. Of course, many aspects of our modeling are quite simplified as compared to the underlying biophysics. We have investigated the sensitivity of our results to the modification of some of the parameters of our model as well as the addition of more complex dynamics for the various parts of our system. Our results with regard to the interspike interval are exceedingly robust. This work should be viewed as a step toward understanding the full dynamical consequences brought about by the strong reciprocal couplings between synapses and the glial processes that envelop them. We have focused on the fact that astrocytic emissions shut down synaptic transmission when the activity becomes too high. This mechanism appears to be a necessary part of the regulation of spiker activity; without it, spikers would fire too often, too regularly. Related work by S. Nadkarni and P. Jung (private communication, July 2005) focuses on a different aspect: that of increased fidelity of synaptic release (for otherwise highly stochastic synapses) due to glia-mediated increases in presynaptic calcium levels. As our working assumption is that the spikers are most likely to be neurons with “faithful” autapses, this effect does not play a role in our attempt to compare to the experimental data. It will of course be necessary to combine these two different pieces to obtain a more complete picture. The application to spikers is just one way in which our new synaptic dynamics may alter our thinking about neural circuits. This particular application is appealing and informative but must at the moment be

324

V. Volman, E. Ben-Jacob, and H. Levine

considered an untested hypothesis. Future experimental work must test the assumption that spikers have significant excitatory autaptic coupling, that pharmacological blockage of the synaptic current reverts their firing to low-frequency, almost periodic patterns, and that cutting the feedback loop with the enveloping astrocyte eliminates the heavy-tail increment distribution. Work toward achieving these tests is ongoing. In the experimental system, a purported autaptic neuron is part of an active network and would therefore receive input currents from the other neurons in the network. This more complex input would clearly alter the very-long-time interspike interval distribution, especially given the existence of a new interburst timescale in the problem. Similarly, the current approach of adding a constant background current to the neuron is not realistic; the actual background current, due to such processes as glialgenerated currents in the cell soma, would again alter the long-time distribution. Preliminary tests have shown that these effects could extend the range of agreement between autaptic oscillator statistics and experimental measurements. Just as the network provides additional input for the spiker, the spiker provides part of the stimulation that leads to the bursting dynamics. Future work will endeavor to create a fully self-consistent network model to explore the overall activity patterns of this system. One issue that needs investigation concerns the role that glia might have in coordinating the action of neighboring synapses. It is well known that a single astrocytic process might contact thousands of synapses; if the calcium excitation spreads from being a local increase in a specific terminus to being a more widespread phenomenon within the glial cell body, neighboring synapses can become dynamically coupled. The role of this extra complexity in shaping the burst structure and time sequence is as yet unknown. Acknowledgments We thank Gerald M. Edelman for insightful conversation about the possible role of glia. Eugene Izhikevich, Peter Jung, Suhita Nadkarni, and Itay Baruchi are acknowledged for useful comments and for the critical reading of an earlier version of this manuscript. V. V. thanks the Center for Theoretical Biological Physics for hospitality. This work has been supported in part by the NSF-sponsored Center for Theoretical Biological Physics (grant numbers PHY-0216576 and PHY-0225630), by Maguy-Glass Chair in Physics of Complex Systems. References Angulo, M., Kozlov, A., Charpak, S., & Audinat, E. (2004). Glutamate released from glial cells synchronizes neuronal activity in the hippocampus. J. Neurosci., 24(31), 6920–6927.

The Astrocyte as a Gatekeeper of Synaptic Information Transfer

325

Araque, A., Parpura, V., Sanzgiri, R., & Haydon, P. (1998a). Glutamate-dependent astrocyte modulation of synaptic transmission between cultured hippocampal neurons. Eur. J. Neurosci., 10(6), 2129–2142. Araque, A., Parpura, V., Sanzgiri, R., & Haydon, P. (1998b). Glutamate-dependent astrocyte modulation of synaptic transmission between cultured hippocampal neurons. J. Neurosci., 18(17), 6822–6829. Araque A., Parpura, V., Sanzgiri, R., & Haydon, P. (1999). Tripartite synapses: Glia, the unacknowledged partner. Trends in Neurosci., 22(5), 208–215. Bekkers, J., & Stevens, C. (1991). Excitatory and inhibitory autaptic currents in isolated hippocampal neurons maintained in cell culture. Proc. Nat. Acad. Sci., 88, 7834–7838. Charles, A., Merrill, J., Dirksen, E., & Sanderson, M. (1991). Inter-cellular signaling in glial cells: Calcium waves and oscillations in response to mechanical stimulation and glutamate. Neuron., 6, 983–992. Cornell-Bell A., & Finkbeiner, S. (1991). Ca2+ waves in astrocytes. Cell Calcium, 12, 185–204. Cornell-Bell, A., Finkbeiner, S., Cooper, M., & Smith, S. (1990). Glutamate induces calcium waves in cultured astrocytes: Long-range glial signaling. Science, 247, 470–473. Gandhi, A., & Stevens, C. (2003). Three modes of synaptic vesicular release revealed by single-vesicale imaging. Nature, 423, 607–613. Haydon, P. (2001). Glia: Listening and talking to the synapse. Nat. Rev. Neurosci., 2(3), 185–193. Hofer, T., Venance, L., & Giaume, C. (2003). Control and plasticity of inter-cellular calcium waves in astrocytes. J. Neurosci., 22, 4850–4859. Hulata, E., Segev, R., Shapira, Y., Benveniste, M., & Ben-Jacob, E. (2000). Detection and sorting of neural spikes using wavelet packets. Phys. Rev. Lett., 85, 4637–4640. Kang, J., Jiang, L., Goldman, S., & Nedergaard, M. (1998). Astrocyte-mediated potentiation of inhibitory synaptic transmission. Nat. Neurosci., 1, 683–692. Li, Y., & Rinzel, J. (1994). Equations for inositol-triphosphate receptor-mediated calcium oscillations derived from a detailed kinetic model: A Hodgkin-Huxley like formalism. J. Theor. Biol., 166, 461–473. Lubke, J., Markham, H., Frotscher, M., & Sakmann, B. (1996). Frequency and dendritic distributions of autapses established by layer-5 pyramidal neurons in developing rat cortex. J. Neurosci., 616, 3209–3218. Morris, C., & Lecar. H. (1981). Voltage oscillations in the barnacle giant muscle fiber. Biophys. J., 35, 193–213. Nadkarni, S., & Jung, P. (2004). Spontaneous oscillations of dressed neurons: A new mechanism for epilepsy? Phys. Rev. Lett., 91(26). Newman, E. (2003). New roles for astrocytes: Regulation of synaptic transmission. Trends in Neurosci., 26(10), 536–542. Otsu, Y., Shahrezaei, V., Li, B., Raymond, L., Delaney K., & Murphy, T. (2004). Competition between phasic and asynchronous release for recovered synaptic vesicles at developing hippocampal autaptic synapses. J. Neurosci., 24(2), 420–433. Perea, G., & Araque, A. (2002). Communication between astrocytes and neurons: A complex language. J. Physiol., 96, 199–207.

326

V. Volman, E. Ben-Jacob, and H. Levine

Perea, G., & Araque, A. (2005). Properties of synaptically evoked astrocyte calcium signals reveal synaptic information processing by astrocytes. J. Neurosci., 25, 2192– 203. Porter. J., & McCarthy, K. (1996). Hippocampal astrocytes in situ respond to glutamate released from synaptic terminals. J. Neurosci., 16(16), 5073–5081. Segal, M. (1991). Epileptiform activity in microcultures containing one excitatory hippocampal neuron. J. Neuroanat., 65, 761–770. Segal, M. (2004). Endogenous bursts underlie seizurelike activity in solitary excitatory hippocampal neurons in microculture. J. Neurophysiol., 72, 1874–1884. Segev, R., & Ben-Jacob, E. (2001). Spontaneous synchronized bursting activity in 2D neural networks. Physica A, 302, 64–69. Segev, R., Benveniste, M., Hulata, E., Cohen, N., Paleski, A., Kapon, E., Shapira, Y., & Ben-Jacob, E. (2002). Long term behavior of lithographically prepared in-vitro neural networks. Phys. Rev. Lett., 88, 118102. Seung, H., Lee, D., Reis, B., & Tank, D. (2000). The autapse: A simple illustration of short-term analog memory storage by tuned synaptic feedback. J. Comp. Neurosci., 9, 171–185. Shuai, J., & Jung, P. (2003). Langevin modeling of intra-cellular calcium dynamics. In M. Falcke & D. Malchow (Eds.), Understanding calcium dynamics—experiments and theory. (pp. 231–252). Berlin: Springer. Sneyd, J., Wetton, B., Charles, A., & Sanderson, M. (1995). Intercellular calcium waves mediated by diffusion of inositol triphosphate: A two-dimensional model. Am. J. Physiology, 268, C1537–C1545. Takano T., Tian, G., Peng, W., Lou, N., Libionka, W., Han, X., & Nedergaard, M. (2006). Astrocyte-mediated control of cerebral blood flow. Nat. Neurosci., 9(2), 260–267. Tsodyks, M., Uziel, A., & Markram, H. (2000). Synchrony generation in recurrent networks with frequency-dependent synapses. J. Neurosci., 20(RC50), 1–5. Volterra A., & Meldolesi J. (2005). Astrocytes, from brain glue to communication elements: The revolution continues. Nat. Neurosci., 6, 626–640. Zhang, Q., Pangrsic, T., Kreft, M., Krzan, M., Li, N., Sul, J., Halassa, M., van Bockstaele, E., Zorec, R., & Haydon, P. (2004). Fusion-related release of glutamate from astrocytes. J. Biol. Chem., 279, 12724–12733.

Received January 6, 2006; accepted May 25, 2006.

LETTER

Communicated by Gert Cauwenberghs

Thermodynamically Equivalent Silicon Models of Voltage-Dependent Ion Channels Kai M. Hynna [email protected]

Kwabena Boahen [email protected] Department of Bioengineering, University of Pennsylvania, Philadelphia, PA 19104, U.S.A.

We model ion channels in silicon by exploiting similarities between the thermodynamic principles that govern ion channels and those that govern transistors. Using just eight transistors, we replicate—for the first time in silicon—the sigmoidal voltage dependence of activation (or inactivation) and the bell-shaped voltage-dependence of its time constant. We derive equations describing the dynamics of our silicon analog and explore its flexibility by varying various parameters. In addition, we validate the design by implementing a channel with a single activation variable. The design’s compactness allows tens of thousands of copies to be built on a single chip, facilitating the study of biologically realistic models of neural computation at the network level in silicon. 1 Neural Models A key computational component within the neurons of the brain is the ion channel. These channels display a wide range of voltage-dependent responses (Llinas, 1988). Some channels open as the cell depolarizes; others open as the cell hyperpolarizes; a third kind exhibits transient dynamics, opening and then closing in response to changes in the membrane voltage. The voltage dependence of the channel plays a functional role. Cells in the gerbil medial superior olivary complex possess a potassium channel that activates on depolarization and helps in phase-locking the response to incoming auditory information (Svirskis, Kotak, Sanes, & Rinzel, 2002). Thalamic cells possess a hyperpolarization-activated cation current that contributes to rhythmic bursting in thalamic neurons during periods of sleep by depolarizing the cell from hyperpolarized levels (McCormick & Pape, 1990). More ubiquitously, action potential generation is the result of voltage-dependent dynamics of a sodium channel and a delayed rectifier potassium channel (Hodgkin & Huxley, 1952). Researchers use a variety of techniques to study voltage-dependent channels, each possessing distinct advantages and disadvantages. Neural Computation 19, 327–350 (2007)

C 2007 Massachusetts Institute of Technology

328

K. Hynna and K. Boahen

Neurobiologists perform experiments on real cells, both in vivo and in vitro; however, while working with real cells eliminates the need for justifying assumptions within a model, limitations in technology restrict recordings to tens of neurons. Computational neuroscientists create computer models of real cells, using simulations to test the function of a channel within a single cell or a network of cells. While providing a great deal of flexibility, computational neuroscientists are often limited by the processing power of the computer and must constantly balance the complexity of the model with practical simulation times. For instance, a Sun Fire 480R takes 20 minutes to simulate 1 second of the 4000-neuron network (M. Shelley, personal communication, 2004) of Tao, Shelley, McLaughlin, and Shapley (2004), just 1 mm2 (a single hypercolumn) of a sublayer (4Cα) of the primary visual cortex (V1). An emerging medium for modeling neural circuits is the silicon chip, a technique at the heart of neuromorphic engineering. To model the brain, neuromorphic engineers use the transistor’s physical properties to create silicon analogs of neural circuits. Rather than build abstractions of neural processing, which make gross simplifications of brain function, the neuromorphic engineer designs circuit components, such as ion channels, from which silicon neurons are built. Silicon is an attractive medium because a single chip can have thousands of heterogeneous silicon neurons that operate in real time. Thus, network phenomena can be studied without waiting hours, or days, for a simulation to run. To date, however, neuromorphic models have not captured the voltage dependence of the ion channel’s temporal dynamics, a problem outstanding since 1991, the year Mahowald and Douglas (1991) published their seminal work on silicon neurons. Neuromorphic circuits are constrained by surface area on the silicon die, limiting their complexity, as more complex circuits translate to fewer silicon neurons on a chip. In the face of this trade-off, previous attempts at designing neuromorphic models of voltage-dependent ion channels (Mahowald and Douglas, 1991; Simoni, Cymbalyuk, Sorensen, Calabrese, & DeWeerth, 2004) sacrificed the time constant’s voltage dependence, keeping the time constant fixed. In some cases, however, this nonlinear property is critical. An example is the lowthreshold calcium channel in the thalamus’s relay neurons. The time constant for inactivation can vary over an order of magnitude depending on the membrane voltage. This variation defines the relative lengths of the interburst interval (long) and the burst duration (short) when the cell bursts rhythmically. Sodium channels involved in spike generation also possess a voltagedependent inactivation time constant that varies from a peak of approximately 8 ms just below spike threshold to as fast as 1 ms at the peak of the voltage spike (Hodgkin & Huxley, 1952). Ignoring this variation by fixing the time constant alters the dynamics that shape the action potential. For example, lowering the maximum (peak) time constant would reduce the

Voltage-Dependent Silicon Ion Channel Models

329

size of the voltage spike due to faster inactivation of sodium channels below threshold. Inactivation of these channels is a factor in the failure of action potential propagation in Purkinje cells (Monsivais, Clark, Roth, & Hausser, 2005). On the other hand, increasing the minimum time constant at more depolarized levels—that is, near the peak voltage of the spike—would increase the width of the action potential, as cell repolarization begins once the potassium channels overcome the inactivating sodium channel. A wider spike could influence the cell’s behavior through several mechanisms, such as those triggered by increased calcium entry through voltage-dependent calcium channels. In this letter, we present a compact circuit that models the nonlinear dynamics of the ion channel’s gating particles. Our circuit is based on linear thermodynamic models of ion channels (Destexhe & Huguenard, 2000), which apply thermodynamic considerations to the gating particle’s movement, due to conformation of the ion channel protein in an electric field. Similar considerations of the transistor make clear that both the ion channel and the transistor operate under similar principles. This observation, originally recognized by Carver Mead (1989), allows us to implement the voltage dependence of the ion channel’s temporal dynamics, while at the same time using fewer transistors than previous neuromorphic models that do not possess these nonlinear dynamics. With a more compact design, we can incorporate a larger number of silicon neurons on a chip without sacrificing biological realism. The next section provides a brief tutorial on the similarities between the underlying physics of thermodynamic models and that of transistors. In section 3, we derive a circuit that captures the gating particle’s dynamics. In section 4, we derive the equations defining the dynamics of our variable circuit, and section 5 describes the implementation of an ion channel population with a single activation variable. Finally, we discuss the ramifications of this design in the conclusion. 2 Ion Channels and Transistors Thermodynamic models of ion channels are founded on Hodgkin and Huxley’s (empirical) model of the ion channel. A channel model consists of a series of independent gating particles whose binary state—open or closed—determines the channel permeability. A Hodgkin-Huxley (HH) variable represents the probability of a particle being in the open state or, with respect to the channel population, the fraction of gating particles that are open. The kinetics of the variable are simply described by α(V) (1 − u) −→ ←− u, β(V)

(2.1)

330

K. Hynna and K. Boahen

where α (V) and β (V) define the voltage-dependent transition rates between the states (indicated by the arrows), u is the HH variable, and (1 − u) represents the closed fraction. α (V) and β (V) define the voltage-dependent dynamics of the two-state gating particle. At steady state, the total number of gating particles that are opening—that is, the opening flux, which depends on the number of channels closed and the opening rate—are balanced by the total number of gating particles that are closing (the closing flux). Increasing one of the transition rates—through, for example, a shift in the membrane voltage— will increase the respective flow of particles changing state; the system will find a new steady state with new open and closed fractions such that the fluxes again cancel each other out. We can describe these dynamics simply using a differential equation: du = α (V) (1 − u) − β (V) u. dt

(2.2)

The first term represents the opening flux, the product of the opening transition rate α (V) and the fraction of particles closed (1 − u). The second term represents the closing flux. Depending on which flux is larger (opening or closing), the fraction of open channels u will increase or decrease accordingly. Equation 2.2 is often expressed in the following form: du 1 =− (u − u∞ (V)) , dt τu (V)

(2.3)

where u∞ (V) =

α (V) , α (V) + β (V)

(2.4)

τu (V) =

1 , α (V) + β (V)

(2.5)

represent the steady-state level and time constant (respectively) for u. This form is much more intuitive to use, as it describes, for a given membrane voltage, where u will settle and how fast. In addition, these quantities are much easier for neuroscientists to extract from real cells through voltageclamp experiments. We will come back to the form of equation 2.3 in section 4. For now, we will focus on the dynamics of the gating particle in terms of transition rates. In thermodynamic models, state changes of a gating particle are related to changes in the conformation of the ion channel protein (Hill & Chen, 1972; Destexhe & Huguenard, 2000, 2001). Each state possesses a certain energy

Voltage-Dependent Silicon Ion Channel Models

331

G*(V)

Energy

∆GC(V) ∆GO(V)

GC(V) ∆G(V) GO(V) Closed

Activated

Open

State Figure 1: Energy diagram of a reaction. The transition rates between two states are dependent on the heights of the energy barriers (GC and GO ), the differences in energy between the activated state (G∗ ) and the initial states (GC or GO ). Thus, the time constant depends on the height of the energy barriers, and the steady state depends on the difference in energy between the closed and open states (G).

(see Figure 1), dependent on the interactions of the protein molecule with the electric field across the membrane. For a state transition to occur, the molecule must overcome an energy barrier (see Figure 1), defined as the difference in energy between the initial state and an intermediate activated state. The size of the barrier controls the rate of transition between states (Hille, 1992): α (V ) = α0 e−GC (V )/R T β (V ) = β0 e

−GO (V )/R T

(2.6) ,

(2.7)

where α0 and β0 are constants representing base transition rates (at zero barrier height), GC (V ) and GO (V ) are the voltage-dependent energy barriers, R is the gas constant, and T is the temperature in Kelvin. Changes in the membrane voltage of the cell, and thus the electric field across the membrane, influence the energies of the protein’s conformations differently, changing the sizes of the barriers and altering the transition rates between states. Increasing a barrier decreases the respective transition rate, slowing the dynamics, since fewer proteins will have sufficient energy. The steady state depends on the energy difference between the two states. For a difference of zero and equivalent base transition rates, particles are

332

K. Hynna and K. Boahen

equally distributed between the two states. Otherwise, the state with lower energy is the preferred one. The voltage dependence of an energy barrier has many components, both linear and nonlinear. Linear thermodynamic models, as the name implies, assume that the linear voltage dependence dominates. This dependence may be produced by the movement of a monopole or dipole through an electric field (Hill & Chen, 1972; Stevens, 1978). In this situation, the above rate equations simplify to α (V ) = A e−b1 (V −VH )/R T β (V ) = A e

−b 2 (V −VH )/R T

(2.8) ,

(2.9)

where VH and A represent the half-activation voltage and rate, while b 1 and b 2 define the linear relationship between each barrier and the membrane voltage. The magnitude of the linear term depends on such factors as the net movement of charge or net change in charge due to the conformation of the channel protein. Thus, ion channels use structural differences to define different membrane voltage dependencies. While linear thermodynamic models have simple governing equations, they possess a significant flaw: time constants can reach extremely small values at voltages where either α (V) and β (V) become large (see equation 2.5), which is unrealistic since it does not occur in biology. Adding nonlinear terms in the energy expansion of α (V) and β (V) can counter this effect (Destexhe & Huguenard, 2000). Other solutions involve either saturating the transition rate (Willms, Baro, Harris-Warrick, & Guckenheimer, 1999) or using a three-state model (Destexhe & Huguenard, 2001), where the forward and reverse transition rates between two of the states are fixed, effectively setting the maximum transition rate. Linear models, however, bear the closest resemblance to the MOS transistor, which operates under similar thermodynamic principles. Short for metal oxide semiconductor, the MOS transistor is named for its structure: a metallic gate (today, a polysilicon gate) atop a thin oxide, which insulates the gate from a semiconductor channel. The channel, part of the body or substrate of the transistor, lies between two heavily doped regions called the source and the drain (see Figure 2). There are two types of MOS transistors: negative or n-type (NMOS) and positive or p-type (PMOS). NMOS transistors possess a drain and a source that are negatively doped— areas where the charge carriers are negatively charged electrons. These two areas exist within a p-type substrate, a positively doped area, where the charge carriers are positively charged holes. A PMOS transistor consists of a p-type source and drain within an n-type well. While the rest of this discussion focuses on NMOS transistor operation, the same principles apply to PMOS transistors, except that the charge carrier is of the opposite sign.

Voltage-Dependent Silicon Ion Channel Models gate

source

S

D

drain

n+

333

n+

B

G

B

G

psubstrate

a

b

S

D

Figure 2: MOS transistor. (a) Cross-section of an n-type MOS transistor. The transistor has four terminals: source (S), drain (D), gate (G) and bulk (B), sometimes referred to as the back-gate. (b), Symbols for the two transistor types: NMOS (left) and PMOS (right). The transistor is a symmetric device, and thus the direction of its current—by convention, the flow of positive charges—indicates the drain and the source. In an NMOS, current flows from drain to source, as indicated by the arrow. Conversely, current flows from source to drain in a PMOS.

In the subthreshold regime, charge flows across the channel by diffusion from the source end of the channel, where the density is high, to the drain, where the density is low. Governed by the same laws of thermodynamics that govern protein conformations, the density of charge carriers at the source and drain ends of the channel depends exponentially on the size of the energy barriers there (see Figure 3). These energy barriers exist due to a built-in potential difference, and thus an electric field, between the channel and the source or the drain. Adjusting the voltage at the source, or the drain, changes the charge carriers’ energy level. For the NMOS transistor’s negatively charged electrons, increasing the source voltage decreases the energy level; hence, the barrier height increases. This decreases the charge density at that end of the channel, as fewer electrons have the energy required to overcome the barrier. The voltage applied to the gate, which influences the potential at the surface of the channel, has the opposite effect: increasing it (e.g., from VG to VG1 in Figure 3) decreases the barrier height—at both ends of the channel. Factoring in the exponential charge density dependence on barrier height yields the relationship between an NMOS transistor’s channel current and its terminal voltages (Mead, 1989): Ids = Ids0 e

κVGB −VSB UT

−e

κVGB −VDB UT

,

(2.10)

where κ describes the relationship between the gate voltage and the potential at the channel surface. UT is called the thermal voltage (25.4 mV at room temperature), and Ids0 is the baseline diffusion current, defined by the barrier introduced when the oppositely doped regions (p-type and n-type) were fabricated. Note that, for clarity, UT will not appear in the

334

K. Hynna and K. Boahen

Energy

VG

φS

VG1 > VG

VS

φD VD > VS

Source

Channel

Drain

Figure 3: Energy diagram of a transistor. The vertical dimension represents the energy of negative charge carriers (electrons) within an NMOS transistor, while the horizontal dimension represents location within the transistor. φS and φD are the energy barriers faced by electrons attempting to enter the channel from the source and drain, respectively. VD , VS , and VG are the terminal voltages, designated by their subscripts. During transistor operation, VD > VS , and thus φS < φD . VG1 represents another scenario with a higher gate voltage. (Adapted from Mead, 1989.)

remaining transistor current equations, as all transistor voltages from here on are given in units of UT . When VDB exceeds VSB by 4 UT or more, the drain term becomes negligible and is ignored; the transistor is then said to be in saturation. The similarities in the underlying physics of ion channels and transistors allow us to use transistors as thermodynamic isomorphs of ion channels. In both, there is a linear relationship between the energy barrier and the controlling voltage. For the ion channel, either isolated charges or dipoles of the channel protein have to overcome the electric field created by the voltage across the membrane. For the transistor, electrons, or holes, have to overcome the electric field created by the voltage difference between the source, or drain, and the transistor channel. In both instances, the transport of charge across the energy barrier is governed by a Boltzman distribution, which results in an exponential voltage dependence. In the next section, we use these similarities to design an efficient transistor representation of the gating particle dynamics. 3 Variable Circuit Based on the discussion from the previous section, it is tempting to think we may be able to use a single transistor to model the gating dynamics of a channel particle completely. However, obtaining the transition rates solves only part of the problem. We still need to multiply the rates with the number of gating particles in each state to obtain the opening and closing fluxes, and then integrate the flux difference to update the particle counts

Voltage-Dependent Silicon Ion Channel Models

335

uH N4

uτH uH

N2

N2

VO

VO

N1

VC

uV

uV

C

C

N1

VC

uL

N3

uτL

a

b

uL

Figure 4: Channel variable circuit. (a) The voltage uV represents the logarithm of the channel variable u. VO and VC are linearly related to the membrane voltage, with slopes of opposite sign. uH and uL are adjustable bias voltages. (b) Two transistors (N3 and N4) are added to saturate the variable’s opening and closing rates; the bias voltages uτ H and uτ L set the saturation level.

(see equation 2.2). A capacitor can perform the integration if we use charge to represent particle count and current to represent flux. The voltage on the capacitor, which is linearly proportional to its charge, yields the result. The first sign of trouble appears when we attempt to connect a capacitor to the transistor’s source (or drain) terminal to perform the integration. As the capacitor integrates the current, the voltage changes, and hence the transistor’s barrier height changes. Thus, the barrier height depends on the particle count, which is not the case in biology; gating particles do not (directly) affect the barrier height when they switch state. Our only remaining option, the gate voltage, is unsuitable for defining the barrier height, as it influences the barrier at both ends of the channel identically. α (V) and β (V), however, demonstrate opposite dependencies on the membrane voltage; that is, one increases while the other decreases. We can resolve this conundrum by connecting two transistors to a single capacitor (see Figure 4a). Each transistor defines an energy barrier for one of the transition rates: transistor N1 uses its source and gate voltages (uL and VC , respectively) to define the closing rate, and transistor N2 uses its drain and gate voltages (uH and VO ) to define the opening rate (where uH > uL ). We integerate the difference in transistor currents on the capacitor Cu to update the particle count. Notice that neither barrier

336

K. Hynna and K. Boahen

depends on the capacitor voltage uV . Thus, uV becomes representative of the fraction of open channels; it increases as particles switch to the open state. How do we compute the fluxes from the transition rates? If uV directly represented the particle count, we would take the product of uV and the transition rates. However, we can avoid multiplying altogether if uV represents the logarithm of the open fraction rather than the open fraction itself. uV ’s dynamics are described by the differential equation Cu

duV = Ids0 eκ VO e−uV − e−uH − Ids0 eκ VC e−uL , dt = Ids0 eκ VO −uH e−(uV −uH ) − 1 − Ids0 eκ VC −uL ,

(3.1)

where Ids0 and κ are transistor parameters (defined in equation 2.10), and VO , VC , uH , uL , and uV are voltages (defined in Figure 4a). We assume N1 remains in saturation during the channel’s operation; that is, uV > uL + 4 UT , making the drain voltage’s influence negligible. The analogies between equation 3.1 and equation 2.2 become clear when we divide the latter by u. Our barriers—N1’s source-gate for the closing rate and N2’s drain-gate for the opening rate—correspond to α (V) and β (V), while e−(uV −uH ) corresponds to u−1 . Thus, our circuit computes (and integrates) the net flux divided by u, the open fraction. Fortuitously, the net flux scaled by the open fraction is exactly what we need to update the fraction’s logarithm, since d log(u)/dt = (du/dt)/u. Indeed, substituting uV = log u + uH —our log-domain representation of u—into equation 3.1 yields Qu du = Ids0 eκ VO −uH u dt

1 −1 u

− Ids0 eκ VC −uL

Ids0 κ VC −uL du Ids0 κ VO −uH = ... e e u, (1 − u) − dt Qu Qu

(3.2)

where Qu = Cu UT . If we design VC and VO to be functions of the membrane voltage V, equation 3.2 becomes directly analogous to equation 2.2. In linear thermodynamic models, the transition rates depend exponentially on the membrane voltage. We can realize this by designing VC and VO to be linear functions of V, albeit with slopes of opposite sign. The opposite slopes ensure that as the membrane voltage shifts in one direction, the opening and closing rates will change in opposite directions relative to each other. Thus far in our circuit design, we have not specified whether the variable activates or inactivates as the membrane voltage increases. Recall that for activation, the gating particle opens as the cell depolarizes, whereas for

Voltage-Dependent Silicon Ion Channel Models

337

inactivation, the gating particle opens as the cell hyperpolarizes. In our circuit, whether activation or inactivation occurs depends on how we define VO and VC with respect to V. Increasing VO , and decreasing VC , with V defines an activation variable, as at depolarized levels this results in α (V) > β (V). Conversely, increasing VC , and decreasing VO , with V defines an inactivation variable, as now at depolarized voltages, β (V) > α (V), and the variable will equilibrate in a closed state. Naturally, our circuit has the same limitation that all two-state linear thermodynamic models have: its time constant approaches zero when either VO or VC grows large, as the transition rates α (V) and β (V) become unrealistically large. This shortcoming is easily rectified by imposing an upper limit on the transition rates, as has been done for other thermodynamic models (Willms et al., 1999). We realize this saturation by placing two additional transistors in series with the original two (see Figure 4b). With these transistors, the transition rates become α (V) =

Ids0 eκ VO −uH Qu 1 + eκ(VO −uτ H )

(3.3)

β (V) =

eκ VC −uL Ids0 , Qu 1 + eκ(VC −uτ L )

(3.4)

where the single exponentials in equation 3.2 are now scaled by an additional exponential term. The voltages uτ H and uτ L (fixed biases) set the maximum transition rate for opening and closing, respectively. That is, when VO < uτ H − 4 UT , α (V) ∝ eκ VO , a rate function exponentially dependent on the membrane voltage. But when VO > uτ H + 4 UT , α (V) ∝ eκuτ H , limiting the transition rate and fixing the minimum time constant for channel opening. The behavior of β (V) is similarly defined by VC ’s value relative to uτ L . In the following section, we explore how the channel variable computed by our circuit changes with the membrane voltage and how quickly it approaches steady state. To do so, we must relate the steady state and time constant to the opening and closing rates and specify how the circuit’s opening and closing voltages depend on the membrane voltage. 4 Circuit Operation To help understand the operation of the channel circuit and the influence of various circuit parameters, we will derive u∞ (V) and τu (V) for the circuit in Figure 4b using equations 2.4 and 2.5 and the transistor rate equations (equations 3.3 and 3.4), limiting our presentation to the activation version. The derivation, and the influence of various parameters, is similar for the inactivation version.

338

K. Hynna and K. Boahen

For the activation version of the channel circuit (see Figure 4b), we define the opening and closing voltages’ dependence on the membrane voltage as: VO = φo + γo V

(4.1)

VC = φc − γc V,

(4.2)

where φo , γo , φc , and γc are positive constants representing the offsets and slopes for the opening and closing voltages. Additional circuitry is required to define these constants; one example is described in the next section. In this section, however, we will leave the definition as such while we derive the equations for the circuit. Under certain restrictions (see appendix A), u’s steady-state level has a sigmoidal voltage dependence: u∞ (V) =

1 , V−Vmid 1 + exp − V∗u

(4.3)

u

where Vmid u = V∗u =

1 (φc − φo + (uH − uL )/κ) γ o + γc

(4.4)

UT 1 . γ o + γc κ

(4.5)

Figure 5a shows how the sigmoid arises from the transition rates and, through them, its relationship to the voltage biases. The midpoint of the sigmoid, where the open probability equals half, occurs when α (V) = β (V); it will thus shift with any voltage biases (φo , φc , uH , or uL ) that scale either of the transition rate currents. For example, increasing uH reduces α (V), shifting the midpoint to higher voltages. The slope of the sigmoid around the midpoint is defined by the slopes γo and γc of VO (V) and VC (V). To obtain the sigmoid shape, we restricted the effect of saturation. It is assumed that the bias voltages uτ H and uτ L are set such that saturation is negligible in the linear midsegment of the sigmoid (i.e., VO < uτ H − 4 UT and VC < uτ L − 4 UT when V ∼ Vmid u ). That is why α (V) and β (V) appear to be pure exponentials in Figure 5a. This restriction is reasonable as saturation is supposed to occur only for large excursions from Vmid u , where it imposes a lower limit on the time constant. Therefore, under this assumption, the sigmoid lacks any dependence on the biases uτ H and uτ L .

Voltage-Dependent Silicon Ion Channel Models

339

1

-uL

e

e

-κV -u e c + τL

-κV -u e o + τΗ

e

e

-κ (Vo-uτH)

τu(V) 1 0

0

0

a

1+e

1+e

-κ (Vc-uτL)

τu/τmin

-uH

Transition Rate

u∞

u∞(V)

b

V

V

Figure 5: Steady state and time constants for channel circuit. (a) The variable’s steady-state value (u∞ ) changes sigmoidally with membrane voltage (V), dictated by the ratio of the opening and closing rates (dashed lines). The midpoint occurs when the rates are equal, and hence its horizontal location is affected by the bias voltages (uH and uL ) applied to the circuit (see Figure 4b). (b) The variable’s time constant (τu ) has a bell-shaped dependence on the membrane voltage (V), dictated by the reciprocal of the opening and closing rates (dashed lines). The time constant diverges from these asymptotes at intermediate voltages, where neither rate dominates; it follows the reciprocal of their sum, peaking when the sum is minimized.

Under certain further restrictions (see appendix A), u’s time constant has a bell-shaped voltage dependence: τu (V) = τmin 1 +

exp

V−V1u V∗1u

1

2u + exp − V−V V∗

(4.6)

2u

where V1u = ( uτ H − φo ) /γo V∗1u = UT / (κ γo ) V2u = (φc − uτ H + (uH − uL )/κ) /γc V∗2u = UT / (κ γc ) and τmin = (Qu /Ids0 ) e−(κ uτ H −uH ) . Figure 5b shows how the bell shape arises from the transition rates and, through them, its relationship to the voltage biases. For large excursions of the membrane voltage, one transition rate dominates, and the time constant closely follows its inverse. For small excursions, neither rate dominates, and the time constant diverges from the inverses, peaking at the membrane voltage where the sum of the transition rates is minimized.

340

K. Hynna and K. Boahen

To obtain the bell shape, we saturated the opening and closing rates at the same level by setting κ uτ H − uH = κ uτ L − uL . Though not strictly necessary, this assumption simplifies the expression for τu (V) by matching the minimum time constants at hyperpolarized and depolarized voltages, yielding the result given in equation 4.6. The bell shape also requires this so-called minimum time constant to be smaller than the peak time constant in the absence of saturation. The free parameters within the circuit—φo , γo , φc , φo , uH , uL , uτ H , and uτ L —allow for much flexibility in designing a channel. Appendix B provides an expanded discussion on the influence of the various parameters in the equations above. In the following section, we present measurements from a simple activation channel designed using this circuit, which was fabricated in a standard 0.25 µm CMOS process.

5 A Simple Activation Channel Our goal here is to implement an activating channel to serve as a concrete example and examine its behavior through experiment. We start with the channel variable circuit, which computes the logarithm of the channel variable, and attach its output voltage to the gate of a transistor, which uses the subthreshold regime’s exponential current-voltage relationship to invert the logarithm. The current this transistor produces, which is directly proportional to the variable, can be injected directly into a silicon neuron (Hynna & Boahen, 2001) or can be used to define a conductance (Simoni et al., 2004). The actual choice is irrelevant for the purposes of this article, which demonstrates only the channel variable. In addition to the output transistor, we also need circuitry to compute the opening and closing voltages from the membrane voltage. For the opening voltage (VO ), we simply use a wire to tie it to the membrane voltage (V), which yields a slope of unity (γo = 1) and an intercept of zero (φo = 0). For the closing voltage (VC ), we use four transistors to invert the membrane voltage. The end result is shown in Figure 6. For the voltage inverter circuit we chose, the closing voltage’s intercept φc = κ VC0 (set by a bias voltage VC0 ) and its slope γc = κ 2 /(κ + 1) (set by the transistor parameter defined in equation 2.10). Since κ ≈ 0.7, the closing voltage has a shallower slope than the opening voltage, which makes the closing rate change more gradually, skewing the bell curve in the hyperpolarizing direction as intended for the application in which this circuit was used (Hynna, 2005). This eight-transistor design captures the ion channel’s nonlinear dynamics, which we demonstrated by performing voltage clamp experiments (see Figure 7). As the command voltage (i.e., step size) increases, the output current’s time course and final amplitude both change. The clustering and speed at low and high voltages are what we would expect from a sigmoidal steady-state dependence with a bell-shaped time constant. The

Voltage-Dependent Silicon Ion Channel Models

341

uH VO

V

N5

N6

VC 0

N8

N7

N3

N2

uτΗ

uV

VC

N1

N4

IT

Cu

uG

uL

Figure 6: A simple activating channel. A voltage inverter (N5-8) produces the closing voltage (VC ); a channel variable circuit (N1-3) implements the variable’s dynamics in the log domain (uV ); and an antilog transistor (N4) produces a current (IT ) proportional to the variable. The opening voltage (VO ) is identical to the membrane voltage (V). The series transistor (N2) sets the minimum time constant at depolarized levels. The same circuit can be used to implement an inactivating channel simply by swapping VO and VC .

Figure 7: Channel circuit’s measured voltage-dependent activation. When the membrane voltage is stepped to increasing levels, from the same starting level, the output current becomes increasingly larger, approaching its steady-state amplitudes at varying speeds.

relationship between this output current (IT ) and the activation variable, defined as u = euV −uH , has the form IT = uκ IT .

(5.1)

342

K. Hynna and K. Boahen 1

1.2 1.

0.6

0.8

u∞

τ u (ms)

0.8

0.4

0.6 0.4

0.2 0 0.2

a

0.2 0 0.3 0.4 0.5 0.6 Membrane Voltage (V)

0.3

b

0.4 0.5 0.6 0.7 Membrane Voltage (V)

Figure 8: Channel circuit’s measured sigmoid and bell curve. (a) Dependence of activation on membrane voltage in steady state, captured by sweeping the membrane voltage slowly and recording the normalized current output. (b), Dependence of time constant on membrane voltage, extracted from the curves in Figure 7 by fitting exponentials. Fits (solid lines) are of equations 4.3 = 423.0 mV, V∗u = 28.8 mV, τmin = 0.0425 ms, V1u = 571.3 mV, and 4.6: Vmid u ∗ V1u = 36.9 mV, V2u = 169.8 mV, V∗2u = 71.8 mV.

Its maximum value IT = eκ uH −uG and its exponent κ ≈ 0.7 (the same transistor parameter). It is possible to achieve an exponent of unity, or even a square or a cube, but this requires a lot more transistors. We measured the sigmoidal change in activation directly, by sweeping the membrane voltage slowly, and its bell-shaped time constant indirectly, by fitting the voltage clamp data in Figure 7 with exponentials. The results are shown in Figure 8; the relationship above was used to obtain u from IT . The solid lines are the fits of equations 4.3 and 4.6, which reasonably capture the behavior in both sets of data. The range in the time constant data is limited due to the experimental protocol used. Since we modulated only the step size from a fixed hyperpolarized position, we need a measurable change in the steady-state output current to be able to measure the temporal dynamics for opening. However, this worked to our benefit as, given the range of the fit, there was no need to modify equation 4.6 to allow the time constant to go to zero at hyperpolarized levels (this circuit omits the second saturation transistor in Figure 4b). All of the extracted parameters from the fit are reasonably close to our expectations—based on equations 4.3 and 4.6 and our applied voltage biases—except for V∗2u . For κ ≈ 0.7 and UT = 25.4 mV, we expected V∗2u ≈ 125 mV, but our fit yielded V∗2u ≈ 71.8 mV. There are two possible explanations, not mutually exclusive. First, the fact that equation 4.6 assumes the presence of a saturation transistor, in addition to the limited data along the left side of the bell curve, may have contributed to the underfitting of that value. Second, κ is not constant within the chip but possesses voltage dependence. Overall, however, the analysis matches reasonably well the performance of the circuit.

Voltage-Dependent Silicon Ion Channel Models

343

6 Conclusion We showed that the transistor is a thermodynamic isomorph of a channel gating particle. The analogy is accomplished by considering the operation of both within the framework of energy models. Both involve the movement of charge within an electric field: for the channel, due to conformations of the ion channel protein; for the transistor, due to charge carriers entering the transistor channel. Using this analogy, we generated a compact channel variable circuit. We demonstrated our variable circuit’s operation by implementing a simple channel with a single activation variable, showing that the steady state is sigmoid and the time constant bell shaped. Our measured results, obtained through voltage clamp experiments, matched our analytical results, derived from knowledge of transistors. Bias voltages applied to the circuit allow us to shift the sigmoid and the bell curve and set the bell curve’s height independently. However, the sigmoid’s slope, and its location relative to the bell curve, which is determined by the slope, cannot be changed (it is set by the MOS transistor’s κ parameter). Our variable circuit is not limited to activation variables: reversing the opening and closing voltages’ linear dependence on the membrane voltage will change the circuit into an inactivation variable. In addition, channels that activate and inactivate are easily modeled by including additional circuitry to multiply the variable circuits’ output currents (Simoni et al., 2004; Delbruck, 1991). The change in temporal dynamics of gating particles plays a critical role in some voltage-gated ion channels. As discussed in section 1, the inactivation time constant of T channels in thalamic relays changes dramatically, defining properties of the relay cell burst response, such as the interburst interval and the length of the burst itself. Activation time constants are also influential: they can modify the delay with which the channel responds (Zhan, Cox, Rinzel, & Sherman, 1999), an important determinant of a neuron’s temporal precision. Incorporating these nonlinear temporal dynamics into silicon models will yield further insights into neural computation. Equally important in our design, not only were we able to capture the nonlinear dynamics of gating particles, we were able to do so using fewer transistors than previous silicon models. Rather than exploit the parallels between transistors and ion channels, as we did, previous silicon modelers attempted to “linearize” the transistor, to make it approximate a resistor. After designing a circuit to accomplish this, the resistor’s value had to be adjusted dynamically, so more circuitry was added to filter the membrane voltage. The time constant of this filter was kept constant, sacrificing the ion channel’s voltage-dependent nonlinear dynamics for simplicity. We avoided all these complications by recognizing that the transistor is a

344

K. Hynna and K. Boahen

thermodynamic isomorph of the ion channel. Thus, we were able to come up with a compact replica. The size of the circuit is an important consideration within silicon models, as smaller circuits translate to more neurons on a silicon die. To illustrate, a simple silicon neuron (Zaghloul & Boahen, 2004), with a single input synapse, possessing the activation channel from section 5, requires about 330 µm2 of area. This corresponds to around 30,000 neurons on a silicon die, 10 mm2 in area. Adding additional circuits, such as inactivation to the channel, increases the area of the cell design, reducing the size of the population on the chip (assuming, of course, that the total area of the die remains constant). To compensate for larger cell footprints, we can either increase the size of the whole silicon die (which costs money), or we can simply incorporate multiple chips into the system, easily doubling or tripling the network size. And unlike computer simulations, the increase in network sizes comes with minimal cost in performance or “simulation” time. Of course, like all other modeling media, silicon has its own drawbacks. For one, silicon is not forgiving with respect to design flaws. Once the chip has been fabricated, we are limited to manipulating our models using only external voltage biases within our design. This places a great deal of importance on verification of the final design before submitting it for fabrication; the total time from starting the design to receiving the fabricated chip can be on the order of 6 to 12 months. An additional characteristic of the silicon fabrication process is mismatch, a term referring to the variability among fabricated copies of the same design within a silicon chip (Pavasovic, Andreou, & Westgate, 1994). Within an array of silicon neurons, this translates into heterogeneity within the population. While we can take steps to reduce the variability within an array, generally at the expense of area, this mismatch can be considered a feature, since biology also needs to deal with variability. When we build silicon models that reproduce biological phenomena, being able to do so lends credence to our models, given their robustness to parameter variability. And when we discover network phenomena within our chips, these are likely to be found in biology as well, as they will be robust to biological heterogeneity. With the ion channel design described in this article and our ability to expand our networks without much cost, we have a great deal of potential in building many of the neural systems within the brain, which consists of numerous layers of cells, each possessing its own distinct characteristics. Not only do we have the opportunity to study the role of an ion channel within an individual cell, we have the potential to study its influence within the dynamics of a population of cells, and hence its role in neural computation.

Voltage-Dependent Silicon Ion Channel Models

345

Appendix A: Derivations We can use the transition rates for our channel circuit (see equations 3.3 and 3.4) to calculate the voltage dependence of its steady state and time constant. Starting with the steady-state equation, equation 2.4, u∞ (V) = = =

α (V) α (V) + β (V) Ids0 Qu

Ids0 e−uH Qu e−κ VO +e−κuτ H −u e−uH + IQds0u e−κ VCe+eL−κuτ L e−κ VO +e−κuτ H

1+

e−κ VO +e−κuτ H e−κ VC +e−κuτ L

1 euH −uL

.

Throughout the linear segment of the sigmoid, we set the voltage biases such that uτ H > VO + 4 UT and uτ L > VC + 4 UT . These restrictions essentially marginalize uτ H and uτ L . By the time either of the terms with these two biases becomes significant, the steady state will be sufficiently close to either unity or zero, so that their influence is negligible. Therefore, we drop the exponential terms with uτ H and uτ L ; substituting equations 4.1 and 4.2 yields the desired result of equation 4.3. For the time constant, we substitute equations 3.3 and 3.4 into equation 4.1: τu (V) =

1 α (V) + β (V)

=

Qu Ids0

=

Qu Ids0

1 e−uH e−κ VO +e−κuτ H

+

e−uL e−κ VC +e−κuτ L

1 1 e−(κ VO −uH ) +e−(κ uτ H −uH )

+

1 e−(κ VC −uL ) +e−(κ uτ L −uL )

.

To equalize the minimum time constant at hyperpolarized and depolarized levels, we establish the following relationship: κ uτ H − uH = κ uτ L − uL . After additional algebraic manipulation, the time constant becomes

e−κ (VO −uτ H ) e−κ (VC −uτ L ) e−κ (VO −uτ H ) + e−κ (VC −uτ L ) + 2 1 − −κ (V −u ) , O τ H + e−κ (VC −uτ L ) + 2 e

τu (V) = τmin

1+

346

K. Hynna and K. Boahen

where τmin = (Qu /Ids0 ) e−(κ uτ H −uH ) is the minimum time constant. To reduce the expression further, we need to understand the relative magnitudes of the various terms. We can drop the constant in both denominators, as one of the exponentials there will always be significantly larger. Since VO and VC have opposite signs for their slopes with respect to V, the sum of the exponentials in the denominator peaks at a membrane voltage somewhere within the middle section of its operational voltage range. As it happens, this peak is close to the midpoint of the sigmoid (see below), where we have defined uτ H > VO + 4 UT and uτ L > VC + 4 UT . Thus, at the peak, we know the constant is negligible. As the membrane voltage moves away from the peak, in either direction, one of the exponentials will continue to increase while the other decreases. Thus, with the restriction on the bias voltage uτ H and uτ L , the sum of the exponentials will always be much larger than the constant in the denominator. By the same logic, we can disregard the final term, since the sum of the exponentials will always be substantially larger than the numerator, making the fraction negligible over the whole membrane voltage. With these assumptions, and substituting the dependence of VO and VC on V (see Equations 4.1 and 4.2), we obtain the desired result, equation 4.6. Appendix B: Circuit Flexibility This section is more theoretical in nature, using the steady-state and timeconstant equations derived in appendix A to provide insight into how the various parameters influence the two voltage dependencies (steady state and time constant). This section is likely of interest only to those who wish to use our approach for modeling ion channels. An important consideration in generating these models is defining the location (i.e., the membrane voltage) and magnitude of the peak time constant. Unlike the minimum time constant, which is determined simply by the difference between the bias voltages uτ H and uH (or uτ L and uL ), no bias directly controls the maximum time constant, since it is the point at which the sum of the transition rates is minimized (see Figure 5b). The voltage at which it occurs, however, is easily determined from equation 4.6: Vτ pk = Vmid + u

UT log [γc /γo ] . κ (γo + γc )

(B.1)

Thus, where the bell curve lies relative to the sigmoid, whose midpoint lies at Vmid u , is determined by the opening and closing voltages’ slopes (γo and γc ). Consequently, changing these slopes is the only way to displace the bell curve relative to the sigmoid. Shifting the bell curve by changing

Voltage-Dependent Silicon Ion Channel Models

0.5

0 200 300 400 500 600 700 800 V

a

3. Log 10 [ τ /τ min ]

u∞

1.

347

2. 1. 0 200 300 400 500 600 700 800 V

b

1.

10 τ /τ min

u∞

8 0.5

6 4 2

c

0 200 300 400 500 600 700 800 V

d

0 200 300 400 500 600 700 800 V

Figure 9: Varying the closing voltage’s slope (γc ). (a) Changing γc adjusts both the slope and midpoint of the steady-state sigmoid. (b) γc also affects the location and height (relative to the minimum) of the time constant’s bell curve; the change in height (plotted logarithmically) is dramatic. (c, d) Same as in a and b, except that we adjusted the bias voltage uL to compensate for the change in γc , so the sigmoid’s midpoint and the bell curve’s height remain the same. The sigmoid’s slope does change, as it did before, and the bell curve’s location shifts as well, though much less than before. In these plots, φo = 400 mV, uH = 400 mV, uL = 50 mV, uτ H = 700 mV, and γo = 1.0. γc ’s values are 0.5 (thin, solid line), 0.75 (short, dashed line), 1.0 (long, dashed line), and 1.25 (thick, solid line).

a parameter other than γo or γc will automatically shift the sigmoid by the same amount. As we change the opening and closing voltages’ slopes (γo and γc ), two things happen. First, due to Vmid u ’s dependence on these parameters (see equation 4.4), the sigmoid and the bell curve shift together. Two, due to the dependence we just described (see equation B.1), the bell curve shifts relative to the sigmoid. These effects are illustrated in Figures 9a and 9b for γc (γo behaves similarly). To eliminate the first effect while preserving the second, we can compensate for the sigmoid’s shift by adjusting the bias voltage uL , which scales the closing rate (see equation 4.2). Consequently, uL also rescales the left part of the bell curve, where the closing rate is dominant, reducing its height relative to the minimum and canceling its shift due to the first effect. This is demonstrated in Figures 9c and 9d. The sigmoid remains fixed, while the bell curve shifts (slightly) due to the second effect. Although the bias voltage uL cancels the shift γc produces in the sigmoid, it does not compensate for the change in the sigmoid’s slope. We can shift the sigmoid

348

K. Hynna and K. Boahen 1.

25

0.5

τ / τ min

u∞

20 15 10 5 0 200 300 400 500 600 700 800 V

a

b

0 200 300 400 500 600 700 800 V

a

25

25

20

20

15

15

10

τ / τ min

τ / τ min

Figure 10: Varying the closing voltage’s intercept (φc ). (a) Changing φc shifts the steady-state sigmoid’s midpoint, leaving its slope unaffected. (b) φc also affects the time constant bell curve’s location and height (relative to the minimum). In these plots, φo = 0 mV, uH = 400 mV, uL = 50 mV, uτ H = 700 mV, γo = 1.0, and γc = 0.5. φc ’s values are 400 mV (thin, solid line), 425 mV (short, dashed line), 450 mV (long, dashed line), and 475 mV (thick, solid line).

10

5

5

0 200 300 400 500 600 700 800 V

0 200 300 400 500 600 700 800 V

b

Figure 11: Setting the bell curve’s location and height independently. (a) Changing the opening and closing voltages’ intercepts (φo and φc ) together shifts the bell curve’s location without affecting its height. The steady-state sigmoid (not shown) moves with the bell curve (see equation B.1). (b) Changing φo and φc in opposite ways increases the height (relative to the minimum) without affecting the location. The steady-state sigmoid (not shown) stays put as well. In these plots, uH = 400 mV, uL = 50 mV, uτ H = 700 mV, γo = 1.0, and γc = 0.5. In both a and b, φc = 400 mV and φo = 0 mV for the thin, solid line. In a, both φc and φo increment by 25 mV from the thin, solid line to the short, dashed line, to the long, dashed line and to the thick, solid line. In b, φc increments and φo decrements by 25 mV from the thin, solid line to the short, dashed line, to the long, dashed line, and to the thick, solid line.

and leave its slope unaffected by changing the opening and closing voltages’ intercepts, as shown in Figure 10 for φc (φo behaves similarly). However, the bell curve shifts by the same amount, and its height changes as well, since φc rescales the closing rate. Thus, it is not possible to shift the bell curve relative to the sigmoid without changing the latter’s slope; this is evident from equations 4.5 and B.1.

Voltage-Dependent Silicon Ion Channel Models

349

We can set the bell curve’s location and height independently if we change the opening and closing voltages’ intercepts by the same amount or by equal and opposite amounts, respectively. Equal changes in the intercepts (φo and φc ) shift the opening and closing rate curves by the same amount, thus shifting the bell curve (and the sigmoid) without affecting its height (see Figure 11a), whereas equal and opposite changes shift the opening and closing rate curves apart, leaving the point where they cross at the same location while rescaling the value of the rate there. As a result, the bell curve’s height is changed without affecting its location (see Figure 11b). Choosing values for γo and γc , however, presents a trade-off. These two parameters define the dependence of VO and VC on V and are not external biases like the other parameters; rather, they are defined by the fabrication process through the transistor parameter κ. We can define their relationships with κ through the use of different circuits; in our simple activation channel (see section 5), γc = κ 2 /(κ + 1), as defined by the four transistors that invert the membrane voltage. There are a couple of drawbacks. First, not all values for γo or γc are possible using only a few transistors. Expressed another way, a trade-off needs to be made between achieving more precise values for γo or γc and using fewer transistors within the design. The other drawback is that after fabrication, γo and γc can no longer be modified, as they are defined as functions of the transistor parameter κ. These issues merit special consideration before submitting the chip for fabrication. References Delbruck, T. “Bump” circuits for computing similarity and dissimilarity of analog voltages. In Neural Networks, 1991, IJCNN-91-Seattle International Joint Conference on (Vol. 1, pp. 475–479). Piscataway, NJ: IEEE. Destexhe, A., & Huguenard, J. R. (2000). Nonlinear thermodynamic models of voltage-dependent currents. J. Comput. Neurosci., 9(3), 259–270. Destexhe, A., & Huguenard, J. R. (2001). Which formalism to use for voltagedependent conductances? In R. C. Cannon & E. D. Schutter (Eds.), Computational neuroscience: Realistic modeling for experimentalists (pp. 129–157). Boca Raton, FL: CRC. Hill, T. L., & Chen, Y. (1972). On the theory of ion transport across the nerve membrane. VI. free energy and activation free energies of conformational change. Proc. Natl. Acad. Sci. U.S.A., 69(7), 1723–1726. Hille, B. (1992). Ionic channels of excitable membranes (2nd ed.). Sunderland, MA: Sinauer Associates. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol., 117(4), 500–544. Hynna, K. (2005). T channel dynamics in a silicon LGN. Unpublished doctoral dissertation, University of Pennsylvania.

350

K. Hynna and K. Boahen

Hynna, K., & Boahen, K. (2001). Space-rate coding in an adaptive silicon neuron. Neural Networks, 14(6–7), 645–656. Llinas, R. R. (1988). The intrinsic electrophysiological properties of mammalian neurons: Insights into central nervous system function. Science, 242(4886), 1654–1664. Mahowald, M., & Douglas, R. (1991). A silicon neuron. Nature, 354(6354), 515–518. McCormick, D. A., & Pape, H. C. (1990). Properties of a hyperpolarization-activated cation current and its role in rhythmic oscillation in thalamic relay neurones. J. Physiol., 431, 291–318. Mead, C. (1989). Analog VLSI and neural systems. Reading, MA: Addison-Wesley. Monsivais, P., Clark, B. A., Roth, A., & Hausser, M. (2005). Determinants of action potential propagation in cerebellar Purkinje cell axons. J. Neurosci., 25(2), 464–472. Pavasovic, A., Andreou, A. G., & Westgate, C. R. (1994). Characterization of subthreshold MOS mismatch in transistors for VLSI systems. J. VLSI Signal Process. Syst., 8(1), 75–85. Simoni, M. F., Cymbalyuk, G. S., Sorensen, M. E., Calabrese, R. L., & DeWeerth, S. P. (2004). A multiconductance silicon neuron with biologically matched dynamics. IEEE Transactions on Biomedical Engineering, 51(2), 342–354. Stevens, C. F. (1978). Interactions between intrinsic membrane protein and electric field: An approach to studying nerve excitability. Biophys. J., 22(2), 295–306. Svirskis, G., Kotak, V., Sanes, D. H., & Rinzel, J. (2002). Enhancement of signal-tonoise ratio and phase locking for small inputs by a low-threshold outward current in auditory neurons. J. Neurosci., 22(24), 11019–11025. Tao, L., Shelley, M., McLaughlin, D., & Shapley, R. (2004). An egalitarian network model for the emergence of simple and complex cells in visual cortex. Proc. Natl. Acad. Sci. U.S.A., 101(1), 366–371. Willms, A. R., Baro, D. J., Harris-Warrick, R. M., & Guckenheimer, J. (1999). An improved parameter estimation method for Hodgkin-Huxley models. Journal of Computational Neuroscience, 6(2), 145–168. Zaghloul, K. A., & Boahen, K. (2004). Optic nerve signals in a neuromorphic chip II: Testing and results. IEEE Transactions on Biomedical Engineering, 51(4), 667–675. Zhan, X. J., Cox, C. L., Rinzel, J., & Sherman, S. M. (1999). Current clamp and modeling studies of low-threshold calcium spikes in cells of the cat’s lateral geniculate nucleus. J. Neurophysiol., 81(5), 2360–2373.

Received October 6, 2005; accepted June 20, 2006.

LETTER

Communicated by Harry Erwin

Spatiotemporal Conversion of Auditory Information for Cochleotopic Mapping Osamu Hoshino [email protected] Department of Intelligent Systems Engineering, Ibaraki University, Hitachi, Ibaraki, 316-8511 Japan

Auditory communication signals such as monkey calls are complex FM vocal sounds and in general induce action potentials in different timing in the primary auditory cortex. Delay line scheme is one of the effective ways for detecting such neuronal timing. However, the scheme is not straightforwardly applicable if the time intervals of signals are beyond the latency time of delay lines. In fact, monkey calls are often expressed in longer time intervals (hundreds of milliseconds to seconds) and are beyond the latency times observed in the brain (less than several hundreds of milliseconds). Here, we propose a cochleotopic map similar to that in vision known as a retinotopic map. We show that information about monkey calls could be mapped on a cochleotopic cortical network as spatiotemporal firing patterns of neurons, which can then be decomposed into simple (linearly sweeping) FM components and integrated into unified percepts by higher cortical networks. We suggest that the spatiotemporal conversion of auditory information may be essential for developing the cochleotopic map, which could serve as the foundation for later processing, or monkey call identification by higher cortical areas. 1 Introduction Frequency modulation (FM) is a critical parameter for constructing auditory communication signals in both humans and monkeys. For example, human speech contains FM sounds called formant transitions that are critical for encoding consonant and vowel combinations such as “ga,” “da,” and “ba” (Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967; Suga, 1995). Monkeys use complex FM sounds—the so-called monkey calls—that are considered to be a precursor of human speech (Poremba et al., 2004; Holden 2004), in order to make social interactions with other members of the species (Symmes, Newman, Talmage-Riggs, & Lieblich, 1979; Janik, 2000; Tyack, 2000). Auditory FM signals differ in informational structure from other sensory signals in that they are processed in a time-dependent manner (or characterized by time-varying spectral frequencies), while color and orientation Neural Computation 19, 351–370 (2007)

C 2007 Massachusetts Institute of Technology

352

O. Hoshino

in vision, odorants in olfaction, and tastants in gustation are characterized in most cases in a time-independent manner. Understanding how the auditory cortex encodes and detects the streams of spectral information arising from the temporal structure of FM sounds is one of the most challenging problems in auditory cognitive neuroscience. Auditory signals being sent from receptor neurons enter the primary auditory area (AI). Neurons of the AI are orderly arranged and form the so-called tonotopic map, where tonotopic representation of the cochlea is well preserved (Yost, 1994). Each neuron on the map tends to respond best to its characteristic frequency. Because of the tonotopic organization, these neurons could be activated in a sequential manner and generate action potentials in different timing when stimulated with an FM sound. It is interesting to know how higher cortical areas, to which the AI project, detect the timing of action potentials so that the brain can identify the applied FM sound. One of the effective ways for detecting the timing of action potentials is to use delay lines. Jeffress (1948) proposed a theory for sound localization. The theory is that an interaural time difference, which expresses the azimuthal location of a sound source, could be detected by neurons that receive signals from both ears through distinct delay lines. Suga (1995) proposed a multiple delay line scheme for echo sound detection in bats. Bi and Poo (1999) demonstrated in a cultured hippocampal neuronal network that the timing of action potentials could be detected when relevant delay lines are properly chosen. The timing of action potentials in these processes was relatively short, ranging from submilliseconds to tens of milliseconds, for which the delay line scheme worked well. However, the information about FM sounds contained in monkey calls is in general expressed in longer time intervals of hundreds of milliseconds to seconds. The delay line scheme is unlikely to be applicable for them, because the delay lines observed in the brain are at most of several hundreds of milliseconds (Miller, 1987). This implies that the brain might employ another strategy for identifying individual calls. A speculative cochleotopic map was proposed in relation to a retinotopic map in vision (Rauschecker, 1998). The retinotopic map expresses information about the location of a moving bar in a two-dimensional visual space that is projected onto the retina. As the bar moves in the visual space, the map shows a spatiotemporal firing pattern of neurons. The axes of the map indicate the location of the bar in the two-dimensional visual space. In the cochleotopic mapping scheme, the location of neuronal activation moves along the frequency axis as the frequency of a FM sound sweeps. However, the exact auditory variable for the other axis is still unknown. We propose here a hypothetical two-dimensional (frequency axis, propagation axis) cochleotopic neural network (NC O network) for the AI on which information about FM sounds is mapped in a spatiotemporal manner. The propagation axis is assumed for the unknown axis. Stimulation

Spatiotemporal Conversion of Auditory Information

353

with a pure (single-frequency) tone activates a peripheral neuron, whose activity propagates along the propagation axis, or along its isofrequency band. When the NC O network is stimulated with an FM sound, peripheral neurons are sequentially activated, which then propagates along their isofrequency bands. As a consequence, the information about the applied FM sound is expressed as a specific spatiotemporal firing pattern in the NC O network dynamics. Such activity propagation has been reported in the AI. Hess and Scheich (1996) stimulated Mongolian gerbils with pure tones (1 kHz to 16 kHz) and recorded the activity of AI neurons. The researchers found that neuronal activation propagated along isofrequency bands at all frequencies. Taniguchi, Horikawa, Moriyama, and Nasu (1992) stimulated guinea pigs with pure tones (1 kHz to 30 kHz) and recorded the activity of the anterior field (field A) in which tonotopic organization was well preserved. The researchers found that the focal activation beginning in field A propagated in two directions: along isofrequency bands and toward field DC. There has been neurophysiological evidence that FM sounds are precisely detected by auditory neurons. Neurons of the lateral belt (LB) (Rauschecker, 1998) that receives signals from the AI responded to FM sounds. These neurons were relatively organized in an orderly fashion depending on the sweeping rate (between slow and fast) and direction (upward or downward) of the FMs. Based on these experimental findings, we construct a neural network model (NFM ) for the LB to which the NC O network projects. A given FM sound evokes a specific spatiotemporal firing pattern in the NC O network, to which a certain group of NFM neurons (NFM column) responds and identifies the applied FM. It is also well known that LB neurons respond to monkey calls (Rauschecker, 1997). Monkey calls as vocal signatures are complex FM sounds and play an important role in identifying individuals, especially when their visual systems are unavailable, as in a luxuriant forest (Symmes et al., 1979). To detect such complex FM sounds, we construct a higher neural network (NI D network) model for the STGr (rostral portion of the superior temporal gyrus) to which the LB projects. The NI D network receives selective projections from the NFM network. When a monkey call is presented to the NC O network, multiple NFM columns are sequentially activated in a specific order. The NI D network integrates the sequence of the dynamic NFM columns, thereby identifying that call. Based on the proposed cochleotopic mapping scheme, we investigate how FM sound information is encoded and detected. Applying to the NC O network simple (linearly sweeping) and complex (monkey call) FM sounds, we record the activities of neurons. By statistically analyzing them, we try to understand the neuronal mechanisms that underlie FM sound information processing in the auditory cortex.

354

O. Hoshino

2 Neural Network Model 2.1 Outline of the Model. The NC O network, modeling the AI, is organized in a tonotopic fashion as shown in Figure 1A. When the frequency of an applied FM sound sweeps upward, the neuronal activation of the periphery (p1) sweeps from f1 to f40, which then moves along the propagate axis (isofrequency bands). The filled circles schematically indicate a neuronal firing pattern induced by a simple (linearly sweeping) upward FM sound at a certain time after the stimulus onset. The gray circles indicate a neuronal firing pattern induced by a downward FM sound. The spatiotemporal firing pattern of the NC O network expresses combinatorial information about the direction and the sweep rate of the applied FM sound. Neurons of the NFM network, modeling the LB, receive convergent projections from the NC O network (solid and dashed lines), and detect the upward (black circles) and downward (gray circles) FM sounds. We made sets of selective convergent projections from the NC O to NFM network in order to detect specific linearly sweeping FM sounds. Neurons within isofrequency bands are connected by excitatory and inhibitory delay lines, as shown in Figure 1B. The excitatory connections (solid lines) from the periphery to the center are the major driving force for the neuronal activation to move along the propagation axis, and the inhibitory connections (dashed lines) were employed to sharpen the spatiotemporal firing patterns. This specific circuitry was used for functionally expressing the propagation axis whose evidence in the AI is being accumulated (Hess & Scheich, 1996; Taniguchi et al., 1992). For simplicity, there is no connection between isofrequency bands. Neurons within NFM columns are connected with each other via excitatory synapses, and neurons between NFM columns are connected via inhibitory synapses. This circuitry enables each NFM column to respond selectively to a specific linearly sweeping FM sound. Figures 1C and 1D are schematic drawings of neuronal responses to a linearly sweeping FM sound. When the NC O network is stimulated with an upward FM sound sweeping at a slow (Figure 1C, left), intermediate (Figure 1C, center), or fast rate (Figure 1C, right), the activation area moves from the lower left to upper right (arrows). When a downward FM sound is presented, the activation area moves from the lower right to upper left (see Figure 1D). When the activation area reaches a certain position (gray ellipses), the neurons send action potentials to the NFM network via the selective feedforward projections, and activate the NFM column corresponding to the applied FM (black ellipses). The neuronal activation of the other NC O regions (dashed ellipses) can also be used for the FM detection. Nevertheless, we chose the firing patterns (gray ellipses), because these patterns appear first in the time courses with maximal neurons simultaneously activated. This enables the NFM network to respond reliably and rapidly to the applied FM sounds.

Spatiotemporal Conversion of Auditory Information

A

355

B

c1

c10

,,,,,,, ,,,

c40

,,,,,,,

,,,

,,,

NFM ,,

p ro

c31

,,,,,,,

,,,

n

tio

a ag

upwards

,,

,,

,,

p40

,,,,,,,

p

,,,

,,,

,,,,,,,

p1 input

f1

frequency

low

NCO

propagation - axis

downwards

f40 fl (l = 1, 2, ,,, 40)

high

C

D upward

downward

slow

fast

slow

fast

NCO

frequency

propagation

propagation

NFM

frequency

Figure 1: Neural network model. (A) The NC O network is organized in a tonotopic manner. The NFM network receives selective projections from the NC O network. Among NFM columns, c1–c10 and c31–c40 detect FM sounds that sweep linearly in downward and upward directions, respectively. For clarity, only two sets of firing patterns (black and gray circles) and projections (solid and dashed lines) for an upward and a downward FM sound are depicted. (B) Neuronal connections within isofrequency bands of the NC O network. Neurons are connected via excitatory (solid lines) and inhibitory (dashed lines) delay lines, where t denotes a signal transmission delay time. (C) Schematic drawings of the spatiotemporal neuronal responses of the NC O network and those of the NFM network to simple (linearly sweeping) upward FM sounds. Activity patterns for FM sounds that sweep at a slow (left), intermediate (center), and fast (right) rate are shown. Arrows indicate the directions of movements of active areas. (D) Schematic drawings as in C for downward FM sounds.

356

O. Hoshino

2.2 Model Description. Dynamic evolutions of the membrane potentials of neurons of the NC O and NFM networks are defined, respectively, by D ex duCk Oi (t) wk i,k = −uCk Oi (t) + dt

M

τC O

j=1

+ wki h i,k

τFM

CO (i+ j) Sk (i+ j) (t

CO (i− j) Sk (i− j) (t

− jt)

− jt) ,

(2.1)

f 40 p40 duiF M (t) = −uiF M (t) + L i,k j SkC Oj (t − tFM ) dt k= f 1 j= p1

MFM

+

wiFj M S Fj M (t),

(2.2)

j=1( j=i)

where

Prob SiY (t) = 1 = f Y uiY (t) f Y [u] =

1 . 1 + e −ηY (u−θY )

(Y = C O, F M), (2.3)

uCk Oi (t) and uiF M (t) are the membrane potential of the ith NC O neuron of the kth (k = f1–f40) isofrequency band and that of the ith NFM neuron at time t, respectively. τY (Y = CO, FM) is a decay time of the membrane potential of the network NY . MD is the number of excitatory or inhibitory input delay lines that a single NC O neuron receives from other neurons within isofrequency bands. wke xi,k (i− j) and wki h i,k (i+ j) are, respectively, excitatory and inhibitory synaptic connection strengths from neuron (i − j) to i and from (i + j) to i of the kth isofrequency band. SkC Oj (t − jt) = 1 expresses an action potential of the jth NC O neuron of the kth isofrequency band, where jt denotes a signal transmission delay time (see Figure 1B). (f1– f40, p1–p40) denotes the locations of neurons on the two-dimensional NC O map (see Figure 1A). L i,k j is the strength of synaptic connection from the jth NC O neuron of the kth isofrequency band to the ith NFM neuron. tFM is a signal transmission delay time from the NC O to NFM network. MFM is the number of NFM neurons. wiFj M is the synaptic connection strength from the jth to the ith NFM neuron, and S Fj M (t) = 1 expresses an action potential of the jth NFM neuron. {wiFj M } was set for the neurons within NFM columns (c1–c40; see Figure 1A) to be mutually excited and for the neurons between NFM columns to be laterally inhibited. ηY and θY are the steepness and the threshold of the sigmoid function f Y , respectively, for Y neuron. Equation 2.3 defines the probability of firing of neuron i; that is, the

Spatiotemporal Conversion of Auditory Information

357

probability of SiY (t) = 1 is given by function f Y . After firing, its membrane potential is reset to 0. FM sound stimuli are applied to the peripheral neurons of the NC O network. Dynamic evolutions of membrane potentials of these neurons are defined by D duCk O1 (t) wki h 1,k = −uCk O1 (t) + dt

M

τC O

CO (1+ j) (t)Sk (1+ j) (t

− jt)

j=1

+ α Ik 1 (t),

(2.4)

where Ik 1 (t) is the input stimulus to the peripheral neuron of the kth isofrequency band, or the neuron located at (k, p1; k = f1–f40; see Figure 1A). α is the intensity of the input. Note that the peripheral neurons receive only an inhibitory input from the (1 + j)th neuron with a delay of jt and do not receive any delayed excitatory input. Network parameter values are as follows. The number of neurons are 40 (f1–f40) × 40 (p1–p40) and 40 (c1–c40) × 12 for the NC O and NFM networks, respectively: τC O = 10 ms, τFM = 10 ms, θY = 0.7, ηY = 10.0, and MD = 3. wke xi,k (i− j) = 5.0 and wki h i,k (i− j) = −0.5. t = 10 ms and tFM = 20 ms. L i,k j was selectively set at either 0.1 or 0, as addressed in section 2.1, by which the specific firing patterns induced in the NC O network dynamics can activate their corresponding NFM columns (see Figures 1C and 1D). wiFj M = 0.3 and −5.0 within and between NFM columns, respectively. α = 8.0, and Ik 1 (t) = 1 for an input and 0 for no input. 3 Results 3.1 Tuning Property to Simple FM Sounds. We show here how the information about simple (linearly sweeping) FM sounds could be expressed as spatiotemporal firing patterns in the NC O network dynamics. We also show how the auditory information could be transferred to and detected by specific neuronal columns of the NFM network. Response properties (action potential generation) of NFM neurons are compared with those observed experimentally. An upward FM sound sweeping linearly at 20 Hz per ms induces a specific spatiotemporal firing pattern in the NC O network, in which the neuronal activation moves from the lower left toward the upper right (see Figure 2A). When the activity reaches a certain point (time = 200 ms), the active neurons send action potentials to the NFM network and stimulate the corresponding NFM column (arrow) at time = ∼ 220 ms. The difference in activation time between the NC O (200 ms) and NFM (220 ms) networks arises from a difference in signal transmission delay between the two networks, or tFM = 20 ms (see equation 2.2).

358

O. Hoshino A

120 ms

220 ms

320 ms

520 ms

NFM NCO 100 ms B

200 ms

300 ms

500 ms

f (kHz)

spikes/bin

upward 10 8 6 4 2 0

13.3 Hz/ms 10 8 6 4 2 0

20 Hz/ms

12

12

12

8

8 0 100 200 300

26.7 Hz/ms

10 8 6 4 2 0

8 0 100 200 300

0 100 200 300

time (ms)

downward

f (kHz)

spikes/bin

C 10 8 6 4 2 0

13.3 Hz/ms 10 8 6 4 2 0

12

12

8

20 Hz/ms

26.7 Hz/ms

12

8 0 100 200 300

10 8 6 4 2 0

8 0 100 200 300

time (ms)

0 100 200 300

Spatiotemporal Conversion of Auditory Information

359

We assumed f1 = 8 kHz to f40 = 11.9 kHz (see Figure 1A), where the isofrequency bands were placed at an even interval, or 100 Hz per band. These frequencies are within the range observed in squirrel monkeys (Symmes et al., 1979) and employed for investigating how complex FM sounds such as monkey calls could be identified, as will be shown in sections 3.2 and 3.3. Figure 2B presents the total spike counts (top) and raster plots (middle) for the neurons of a given NFM column when stimulated with different upward FM sounds (bottom). The columnar neurons show specific sensitivity to the upward FM sound (20 Hz per ms) but less to downward FM sounds (see Figure 2C). This tendency is almost consistent with that observed in macaque monkeys (Rauschecker, 1998). Although the tuning characteristic of the NFM columnar neurons to the applied FM sound (sweep rate = 20 Hz per ms; upward) is evident, these neurons also show weak responses to the other upward FM sounds with different sweep rates (see the arrows of Figure 2B). Such weak responsiveness to the “irrelevant” FM sounds is due to the overlapping of NC O firing patterns, as schematically shown in Figure 3. The set of NC O neurons within the solid ellipse, which are to be simultaneously activated by the FM stimulus (sweep rate = 20 Hz per ms; upward), send action potentials to the relevant NFM column and maximally activate the column (top left). However, the subsets of NC O neurons within the overlapping regions (gray and black) could also be activated, respectively, by the irrelevant upward FM sounds (26.7 and 13.3 Hz per ms). These neurons send a relatively small number of action potentials to the same NFM column, which results in weaker neural responses in the column (bottom left and top right). 3.2 Tuning Property to Complex FM Sounds. We show here how complex FM sounds such as monkey calls could be expressed as specific spatiotemporal firing patterns in the NC O network dynamics. We also show

Figure 2: Response property of the NC O and NFM networks. (A) Time courses of neuronal activation when stimulated with an upward FM sound sweeping at 20 Hz per ms. The level of neuronal activity is expressed for each neuron as the number of action potentials observed within a time interval (bin = 10 ms). Arrows indicate the responses of NFM neurons to the applied FM sound. (B) Spike counts (top) and raster plots (middle) for the NFM column neurons when stimulated with different upward FM sounds (bottom). Spike count is the number of action potentials of the NFM column neurons observed within a time interval (bin = 10 ms). (C) Spike counts (top) and raster plots (middle) for the NFM column when stimulated with different downward FM sounds (bottom). The NFM column shows specific sensitivity to the upward FM sound sweeping at 20 Hz per ms.

20 Hz/ms 10 8 6 4 2 0 0 100 200 300

propagation

NCO

spikes/bin

time (ms)

10 8 6 4 2 0

26.7 Hz/ms

spikes/bin

O. Hoshino

spikes/bin

360

low

10 8 6 4 2 0

13.3 Hz/ms

0 100 200 300

time (ms)

high frequency

0 100 200 300

time (ms)

Figure 3: Overlapping property of NC O firing patterns. Thin-dashed, solid, and thick-dashed areas schematically depict the firing patterns generated by the FM sounds sweeping at 26.7 Hz per ms, 20 Hz per ms, and 13.3 Hz per ms, respectively. A certain NFM column is activated maximally by its preferred FM sound (20 Hz per ms) (top left). The overlapping regions, or gray and black regions, are activated not only by the upward FM (20 Hz per ms) but also by those sweeping at 26.7 Hz per ms (gray region) and 13.3 Hz per ms (black region). This results in weaker neuronal responses in the same NFM column (bottom left and top right).

how the information about individual monkey calls can be decomposed into simple (linearly sweeping) FM components. We used artificial isolation peep (IP) as monkey calls (Symmes et al., 1979), whose pitch profiles are shown in Figure 4A. A specific spatiotemporal firing pattern is induced in the NC O network when stimulated with the IP of monkey X, which then activates multiple NFM columns in a sequential manner (see Figure 4B). We have observed distinct sequential orders of columnar activation for monkey call X, Y, and Z. Figure 4C presents the details of the sequences of columnar activation for monkey X (the thick solid line), Y (the thin solid line), and Z (dashed line), indicating that the IPs are decomposed into specific sequences of simple (linearly sweeping) FM components. In the model, the neurons of a currently active NFM -column continue firing, even without any excitatory input, until another dynamic NFM column emerges, or its neurons begin to fire. For example, NFM column c31 is activated at time = 0.22 s (upward arrow) and continues firing (downward arrow) until the NFM column c1 begins to fire (at 0.34 s; rightward arrow).

Spatiotemporal Conversion of Auditory Information A

frequency(kHz)

monkey X

361

monkey Y

monkey Z

12

12

12

8

8

8

4

4

4 0

0.2

0.4

0

0.2

0

0.4

0.2

0.4

time (s) B

NFM NCO 129 ms

222 ms

366 ms

262 ms

419 ms

341 ms

501 ms

C

monkey X:

c40

monkey Y:

NFM-column

c35

monkey Z:

c31

c10 c5 c1 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

time (s) Figure 4: Spatiotemporal conversion of information about isolation-peeps (IPs). (A) Profiles of artificial IPs for monkey calls (X, Y, Z) used in the simulations. (B) Time courses of neuronal activation of the NC O and NFM networks induced by the IP of monkey X. (C) Sequences of dynamic NFM columns induced by the IPs of monkey X (thick solid line), Y (thin solid line), and Z (dashed line). Each NFM column, c31–c40 and c1–10, responds to a specific linearly sweeping FM component involved in the IPs. NFM columns (c11–c30) were not assigned to detect FM sounds in this simulation.

362

O. Hoshino

This self-generative continuous firing could be mediated by mutual excitation within NFM columns. In the next section, we try to identify these monkey calls by an integrative process, to which the persistent neuronal firing effectively contributes. 3.3 Monkey Call Identification. We showed in the previous section that the information about complex FM sounds, or the IPs of monkeys, could be decomposed into simple (linearly sweeping) FM components. It reminds us that early visual systems decompose complex visual objects into simple features such as edge, orientation, and color. Higher visual areas integrate these features so that the visual images of the objects can be reconstructed as unified percepts. To identify the sequences of the dynamic NFM columns, we extended the model, adding a higher (integrative) neural network NI D (see Figure 5A). We made selective feedforward projections from the NFM to NI D network (solid lines) in order to integrate the series of FM components. The specific NI D column (filled circles; NI D ) assigned to detect a certain IP receives convergent inputs from the NFM columns (filled circles; NFM ) that are to be sequentially activated by the FM components constituting the IP (filled circles; NC O ). The NI D network may correspond to a lateral belt (LB) area that is known to respond to monkey calls or the rostral portion of the superior temporal gyrus (STGr) to which the LB projects (Rauschecker, 1998), as we assumed here. Dynamic evolutions of membrane potentials of NI D neurons are defined by τI D

MFM duiI D (t) L iIjD S Fj M (t − tI D ) = −uiI D (t) + dt j=1

+

MI D

wiIjD S Ij D (t),

(3.1)

j=1( j=i)

where action potentials are generated according to equation 2.3 and the other parameter values were the same as those for the NFM network (see equation 2.2). As shown in Figure 5B, when the NC O network is stimulated with the IP of monkey X, multiple NFM columns are sequentially activated, which then activate the call-relevant NI D column at 367 ms. In this integrative process, the NI D column receive consecutive action potentials from the NFM columns, as addressed in the previous section (see Figure 4C). Such continuous activation of the NI D column gradually depolarizes its neurons and allows them to fire when the membrane potentials reach a threshold, thereby identifying the applied IP. Figure 5C presents the identification processing for monkey Z.

Spatiotemporal Conversion of Auditory Information

363

A

B

monkey X NID NFM NCO 103 ms

C

208 ms

350 ms

367 ms

338 ms

410 ms

509 ms

monkey Z NID NFM NCO 83 ms

Figure 5: Monkey call identification. (A) A higher (integrative) neural network NI D is added to the original model (see Figure 1A), which integrates the specific sequences of dynamic NFM columns induced by individual IPs. (B,C) Time courses of neuronal activation in the NC O , NFM , and NI D networks when stimulated with the IPs of monkeys X (B) and Z (C).

Figure 6 presents how similar monkey calls could be distinguished from each other. The IP of monkey V (see Figure 6A, left) has a similar spectrogram in the first part (time = 0–0.1 s; solid line) to that of monkey A (dashed line). When the NC O network is stimulated with the IP (see Figure 6B), multiple NFM columns are sequentially activated, which then activate the two

364

O. Hoshino

A

8

4 0

0.2

0.4

monkey W frequency(kHz)

frequency(kHz)

monkey V 12

12

8

4 0

time(sec)

0.2

0.4

time(sec)

B

monkey V NID NFM NCO 250 ms

C

330 ms

446 ms

500 ms

338 ms

380 ms

509 ms

monkey W

NID NFM NCO 83 ms

Figure 6: Identification of monkey calls that have similar auditory spectrograms. (A) Profiles of artificial IPs for monkeys V and W (solid lines). The dashed lines denote that of monkey A. (B,C) Time courses of neuronal activation in the NC O , NFM , and NI D networks when stimulated with the IPs of monkeys V (B) and W (C).

NI D columns corresponding to the IPs of monkeys V and A (at 330 ms). The two dynamic NI D columns compete for a while (at 330–446 ms), and the NI D column corresponding to monkey V finally prevails (at 500 ms). The neuronal competition between the two dynamic NI D columns arises from the lateral inhibition between NI D columns. In contrast, when the IP of monkey W, which has a similar spectrogram in the last part (0.3–0.4 s; see Figure 6A, right), is presented, the NI D column corresponding to the IP of monkey W is selectively activated (at 338 ms) without competition, as shown in Figure 6C. Note that the time required

Spatiotemporal Conversion of Auditory Information

365

to identify monkey W (338 ms; see Figure 6C) is shorter than that for monkey V (500 ms; see Figure 6B), which arises presumably because there is less neuronal competition between dynamic NI D columns. These results indicate that the spectrogram of the last part might be useful for further analyses if circumstances require, although it is more time consuming. 4 Discussion We have proposed a hypothesis that (1) information about monkey calls could be mapped on a cochleotopic cortical network as spatiotemporal firing patterns, (2) which can then be decomposed into simple (linearly sweeping) FM components and (3) integrated into unified percepts by higher cortical networks. For the cochleotopic two-dimensional map (hypothesis 1), we assumed activity propagation along isofrequency axis (bands) in order to make a distinct spatiotemporal firing pattern for a given monkey call. Imaging studies (Taniguchi et al., 1992; Hess & Scheich, 1996; Song et al., 2005) evidenced such activity propagation in the primary auditory cortex (AI). When presented with alternating pure tones or alternation between 1 and 8 kHz (Hess & Scheich, 1996), the activity propagation was confined to the low and high isofrequency bands. To our knowledge, the exact formation of spatiotemporal firing patterns for FM sound stimulation has not been identified yet, but we simply extended this scheme. Namely, the neuronal activity propagates along multiple isofrequency bands corresponding to the tone frequencies constituting the applied FM sound. Actual spatiotemporal firing patterns induced by monkey calls might be rather complex because of the interaction between isofrequency bands or the influences of other brain regions. Accordingly, the information about individual monkey calls could be encoded more precisely in the auditory cortex. Nevertheless, the proposed simple cochleotopic map was sufficient to generate distinct spatiotemporal firing patterns and worked well as the foundation for later sound processing, or monkey call identification. The delay line proposed for the propagation axis (see Figure 1A) was our speculation prompted by a recent experimental study (Song et al., 2005). The study demonstrated that an electrical pulse, applied focally within an isofrequency band, triggered activity propagation along the isofrequency band that was similar to tone-evoked activation. When the auditory thalamus was chemically lesioned, the electrically evoked activity in the AI was not affected, but the tone-evoked activity was abolished. Based on these results, it was suggested that intracortical connectivity in the AI enables neuronal activity to propagate along isofrequency bands. The underling neuronal mechanisms of activity propagation in the AI has not fully been understood yet, but we assumed the intracortical connectivity via delay lines (see Figure 1B) for developing the activity propagation. Note that the intracortical delay lines are relatively short (less than tens of milliseconds)

366

O. Hoshino

that could be neurophysiologically plausible in the brain as addressed in section 1. For expressing the information about simple (linearly sweeping) FM components (hypothesis 2), we assumed neurons respond selectively to the sweeping rates and directions of FMs. Neurophysiological studies (Rauschecker, 1997, 1998; Tian & Rauschecker, 1998) demonstrated that many neurons of lateral belt areas to which the primary auditory cortex (AI) projects responded better to more complex stimuli, such as FMs and band passed noises, than to pure tones. These neurons were highly selective to the rates and directions of FMs. Neurons of the anterolateral (AL) and caudolateral (CL) belt areas responded better to slower and faster FM sweep rates, respectively. Neurons of the posterior auditory field were highly selective for one direction. The detailed organization of these (rateand direction-selective) neurons has not clearly been identified yet, but we represented them in a simple and functional manner (see NFM ; Figure 1A). To integrate the sequence of linearly sweeping FMs, expressing a specific monkey call, into a unified percept (hypothesis 3), we assumed that neurons respond selectively to the call. Neurophysiological studies (Rauschecker, 1997, 1998) demonstrated that neurons of the lateral belt responded to a certain class of monkey calls. Very few neurons responded to a single call, and most neurons responded to a number of calls. These results imply that the lateral belt is not yet the end state processing monkey calls. Higher cortical areas such as the rostral portion of the superior temporal gyrus (STGr) or the prefrontal cortex, or both, might be responsible for monkey call identification (Rauschecker, 1998). The exact cortical areas whose neurons have selective responsiveness to individual monkey calls have not clearly been identified yet, but we assumed such “call-selective” neurons in the STGr (see NI D ; Figure 5A). The selective projections from the NFM to NI D (or the lateral belt to STGr) were our speculation in order for the model to perform the identification of individual monkey calls or detect the sequence of FM components. Coincidence detection based on a delay line scheme, as addressed in section 1, cannot be applicable for longer signals such as monkey calls (more than 500 ms), because delay lines observed in the brain are at most of several hundred milliseconds. The delay lines proposed here for the propagation axis (NC O ; Figure 1B) are shorter ones (tens of milliseconds) that were speculative but neurophysiologically plausible in the brain. The idea of this study is on temporal-to-spatiotemporal conversion of auditory information mediated by shorter (plausible) delay lines (∼tens of milliseconds) but not on a coincidence detection scheme. In the propagation axis, we assumed delay lines between neighboring neurons ranging from 10 to 30 ms (see section 2.2). This architecture allowed the cochleotopic neural network (NC O ) to propagate along isofrequency bands as observed in the auditory cortex (Taniguchi et al., 1992; Hess & Scheich, 1996). To our knowledge, such delay lines have not been reported

Spatiotemporal Conversion of Auditory Information

367

in the auditory cortex. However, Hess and Scheich (1996) pointed out that activity propagation along isofrequency bands might be closely related to the distribution of response latency in the AI. Tian and Rauschecker (1998) found a response-latency distribution (23 ± 12 ms) in the AI. The range of delay lines used for the NC O network as an AI area is within the range observed. We assumed activity propagation along isofrequency bands. However, activity propagation across isofrequency bands has also been reported (Taniguchi et al., 1992; Hess & Scheich, 1996), for which the interaction between different isofrequency bands might be responsible. The neuronal activation propagated toward the two (isofrequncy and tonotopic-gradient) directions, where the peak activity was shifted along isofrequency bands (Hess & Scheich, 1996). Although the spatiotemporal firing pattern propagating toward the two directions might contribute to encoding auditory information, presumably in a more precise manner, we modeled the propagation only along the isofrequency bands. Panchev and Wermter (2004) proposed a neural network model that can detect temporal sequences in timescale from hundreds of milliseconds to several seconds. The network consisted of integrate-and-fire neurons with active dendrites and dynamic synapses. The researchers applied the model to recognizing words such as bat, tab, cab, and cat. Each word was expressed as a sequence of phonemes, (for example, c → a → t for cat. A spike train of a single input neuron encoded each phoneme of a word, with a 100 ms delay between the onset times of successive phonemes. After training based on spike-timing-dependent plasticity, a single output neuron was able to detect a particular sequence of phonemes or identify a specific word. Their model could be an alternative, especially for the integration network (NI D ; Figure 5A), that detects the sequences of simple (linearly sweeping) FMs. Sargolini and colleagues (2006) found evidence of conjunctive representation of position, direction, and velocity in the entorhinal cortex of rats that explored two-dimensional environments. In the medial entorhinal cortex (MEC), the network of grid cells constituted a spatial coordinate system in which positional information was represented. Head direction cells were responsible for head-directional information representation. Grid cells were co-localized with head direction cells and conjunctive (grid and head direction) cells, and the running speed of the rat modulated these cells. The researchers suggested that the conjunctive cells might update the representation of spatial location by integrating positional and directional and velocity information in the grid cell network during navigation. Such a conjunctive representation may be an alternative for the spatiotemporal representation of monkey calls, in which the information about spectral components, sweep rates and sweep directions and their combinatorial information may be represented by distinct types of cells in the primary auditory cortex.

368

O. Hoshino

In humans, it has been suggested that a voice contains information about not only a speech but also an “auditory face” which allows us to identify individuals (Belin, Fecteau, & Bedard, 2004). This is called auditory face perception and is processed based on a neurocognitive scheme similar to that proposed for visual face perception. Among vocal components for human speech processing, formants (Fitch, 1997) and syllables (Belin & Zatorre, 2003) might be candidate components used for identifying individuals. We suggest that monkey call identification may also be a kind of auditory face perception making use of FM components. There has been evidence that the inferior colliculus (IC) encodes spectrotemporal acoustic patterns of species-specific calls. For example, Suta, Kvasnak, Popelar, and Syka (2003) investigated the neuronal representation of specific calls in the IC of guinea pigs. Responses of individual IC neurons of anesthetized guinea pigs to four typical calls (purr, chutter, chirp, and whistle) were recorded. A majority of neurons (55% of 124 units) responded to all calls. A small portion of neurons (3%) responded to only one call or did not respond to any of the calls. A time-reversed version of calls elicited on average a weaker response. The researchers concluded that the IC neurons do not respond selectively to specific calls but encode spectrotemporal acoustic patterns of the calls. Maki and Riquimaroux (2002) recorded responses of IC neurons of gerbils to two distinct FM sounds that have the same spectral components (5–12 kHz) with an opposite sweep direction, or an upward sweep and a downward sweep. The upward FM generated much stronger responses than the downward FM. The researchers suggested that the directional selectivity to the FM sweeps implies that the IC may encode spectrotemporal acoustic patterns of species-specific calls. These experiments imply that the encoding of spectrotemporal acoustic images of specific calls takes place, in part, in the IC, which presumably makes the spatiotemporal pattern of neuronal activation more complex in the present cochleotopic map. Hopefully, we will know details of it in the near future. 5 Conclusion In this study, we have proposed a cochleotopic map similar to the retinotopic map in vision. When the cochleotopic (NC O ) network was stimulated with a monkey call, the peripheral neurons located on the frequency axis were sequentially activated. The active area moved along the propagation axis, by which the information about the call was mapped as a spatiotemporal firing pattern in the cochleotopic network dynamics. This spatiotemporal conversion was quite effective for the NFM network to decompose the call information into simple (linearly sweeping) FM components, by which the higher network (NI D ) was able to integrate these components into a unified percept, or to identify the call.

Spatiotemporal Conversion of Auditory Information

369

We suggest that the information about monkey calls could be mapped on a cochleotopic cortical network as spatiotemporal firing patterns, which can then be decomposed into simple (linearly sweeping) FM components and integrated into unified percepts by higher cortical networks. The spatiotemporal conversion of auditory information may be essential for developing the cochleotopic map, which could subserve as the foundation for later processing, or monkey call identification by higher cortical areas. Acknowledgments I am grateful to Yuishi Iwasaki for productive discussions and to Hiromi Ohta for her encouragement throughout the study. I am also grateful to the reviewers for giving me valuable comments and suggestions on the earlier draft. References Belin, P., Fecteau, S., & Bedard, C. (2004). Thinking the voice: Neural correlates of voice perception. Trends Cogn. Sci., 8, 129–135. Belin, P., & Zatorre, R. J. (2003). Adaptation to speaker’s voice in right anterior temporal lobe. Neuroreport, 14, 2105–2109. Bi, G. Q., & Poo, M. M. (1999). Distributed synaptic modification in neural networks induced by patterned stimulation. Nature, 401, 792–796. Fitch, W. T. (1997). Vocal tract length and formant frequency dispersion correlate with body size in rhesus macaques. J. Acoust. Soc. Am., 102, 1213–1222. Hess, A., & Scheich, H. (1996). Optical and FDG mapping of frequency-specific activity in auditory cortex. Neuroreport, 7, 2643–2677. Holden, C. (2004). The origin of speech. Science, 303, 1316–1319. Janik, V. M. (2000). Whistle matching in wild bottlenose dolphins (Tursiops truncates). Science, 289, 1355–1357. Jeffress, L. A. (1948). A place theory of sound localization. J. Comp. Physiol. Psychol., 41, 35–39. Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert-Kennedy M. (1967). Perception of the speech code. Psychol. Rev., 74, 431–461. Maki, K., & Riquimaroux, H. (2002). Time-frequency distribution of neuronal activity in the gerbil inferior colliculus responding to auditory stimuli. Neurosci. Lett., 331, 1–4. Miller, R. (1987). Representation of brief temporal patterns, Hebbian synapses, and the left-hemisphere dominance for phoneme recognition. Psychobiol., 15, 241–247. Panchev, C., & Wermter, S. (2004). Spike-timing-dependent synaptic plasticity: From single spikes to spike trains. Neurocomputing, 58–60, 365–371. Poremba, A., Malloy, M., Saunders, R. C., Carson, R. E., Herscovitch, P., & Mishkin, M. (2004). Species-specific calls evoke asymmetric activity in the monkey’s temporal poles. Nature, 427, 448–451. Rauschecker, J. P. (1997). Processing of complex sounds in the auditory cortex of cat, monkey, and man. Acta Otolaryngol. (Stockh.), 532, 34–38.

370

O. Hoshino

Rauschecker, J. P. (1998). Parallel processing in the auditory cortex of primates. Audiol. Neuro-Otol., 3, 86–103. Sargolini, F., Fyhn, M., Hafting, T., McNaughton, B. L., Witter, M. P., Moser, M. B., & Moser E. I. (2006). Conjunctive representation of position, direction, and velocity in entorhinal cortex. Science, 312, 758–762. Song, W. J., Kawaguchi, H., Totoki, S., Inoue, Y., Katura, T., Maeda, S., Inagaki, S., Shirasawa, H., & Nishimura, M. (2005). Cortical intrinsic circuits can support activity propagation through an isofrequency strip of the guinea pig primary auditory cortex. Cereb. Cortex, 16, 718–729. Suga, N. (1995). Processing of auditory information carried by species-specific complex sounds. In M. S. Gazzaniga (Ed.), The cognitive neurosciences (pp. 295–313). Cambridge, MA: MIT Press. Suta, D., Kvasnak, E., Popelar, J., & Syka, J. (2003). Representation of species-specific vocalizations in the inferior colliculus of the guinea pig. J. Neurophysiol., 90, 3794– 3808. Symmes, D., Newman, J. D., Talmage-Riggs, G., & Lieblich, A. K. (1979). Individuality and stability of isolation peeps in squirrel monkeys. Anim. Behav., 27, 1142–1152. Taniguchi, I., Horikawa, J., Moriyama, T., & Nasu, M. (1992). Spatio-temporal pattern of frequency representation in the auditory cortex of guinea pigs. Neurosci. Lett., 146, 37–40. Tian, B., & Rauschecker, J. P. (1998). Processing of frequency-modulated sounds in the cat’s posterior auditory field. J. Neurophysiol., 79, 2629–2642. Tyack, P. L. (2000). Dolphins whistle a signature tune. Science, 289, 1310–1311. Yost, W. A. (1994). The neural response and the auditory code. In W. A. Yost (Ed.), Fundamentals of hearing (pp. 116–133). San Diego, CA: Academic Press.

Received January 19, 2006; accepted June 9, 2006.

LETTER

Communicated by Gal Chechik

Reducing the Variability of Neural Responses: A Computational Theory of Spike-Timing-Dependent Plasticity Sander M. Bohte [email protected] Netherlands Centre for Mathematics and Computer Science (CWI), 1098 SJ Amsterdam, The Netherlands

Michael C. Mozer [email protected] Department of Computer Science, University of Colorado, Boulder, CO, U.S.A.

Experimental studies have observed synaptic potentiation when a presynaptic neuron fires shortly before a postsynaptic neuron and synaptic depression when the presynaptic neuron fires shortly after. The dependence of synaptic modulation on the precise timing of the two action potentials is known as spike-timing dependent plasticity (STDP). We derive STDP from a simple computational principle: synapses adapt so as to minimize the postsynaptic neuron’s response variability to a given presynaptic input, causing the neuron’s output to become more reliable in the face of noise. Using an objective function that minimizes response variability and the biophysically realistic spike-response model of Gerstner (2001), we simulate neurophysiological experiments and obtain the characteristic STDP curve along with other phenomena, including the reduction in synaptic plasticity as synaptic efficacy increases. We compare our account to other efforts to derive STDP from computational principles and argue that our account provides the most comprehensive coverage of the phenomena. Thus, reliability of neural response in the face of noise may be a key goal of unsupervised cortical adaptation. 1 Introduction Experimental studies have observed synaptic potentiation when a presynaptic neuron fires shortly before a postsynaptic neuron and synaptic de¨ pression when the presynaptic neuron fires shortly after (Markram, Lubke, Frotscher, & Sakmann, 1997; Bell, Han, Sugawara, & Grant, 1997; Zhang, Tao, Holt, Harris, & Poo, 1998; Bi & Poo, 1998; Debanne, G¨ahwiler, & Thomp¨ om, ¨ Turrigiano, & Nelson, 2001; Nishiyama, son, 1998; Feldman, 2000; Sjostr Hong, Mikoshiba, Poo, & Kato, 2000). The dependence of synaptic Neural Computation 19, 371–403 (2007)

C 2007 Massachusetts Institute of Technology

372 A

S. Bohte and M. Mozer B

Figure 1: (A) Measuring STDP experimentally. Presynaptic and postsynaptic spike pairs are repeatedly induced at a fixed interval tpr e− post , and the resulting change to the strength of the synapse is assessed. (B) Change in synaptic strength after repeated spike pairing as a function of the difference in time between the presynaptic and postsynaptic spikes. A presynaptic before postsynaptic spike induces LTP and postsynaptic before presynaptic LTD (data points were obtained by digitizing figures in Zhang et al., 1998). We have superimposed an exponential fit of LTP and LTD.

modulation on the precise timing of the two action potentials, known as spike-timing dependent plasticity (STDP), is depicted in Figure 1. Typically, plasticity is observed only when the presynaptic and postsynaptic spikes occur within a 20 to 30 ms time window, and the transition from potentiation to depression is very rapid. The effects are long lasting and are therefore referred to as long-term potentiation (LTP) and depression (LTD). An important observation is that the relative magnitude of the LTP component of STDP decreases with increased synaptic efficacy between presynaptic and postsynaptic neuron, whereas the magnitude of LTD remains roughly constant (Bi & Poo, 1998). This finding has led to the suggestion that the LTP component of STDP might best be modeled as additive, whereas the LTD component is better modeled as being multiplicative (Kepecs, van Rossum, Song, & Tegner, 2002). For detailed reviews of STDP see Bi and Poo (2001), Roberts and Bell (2002), and Dan and Poo (2004). Because these intriguing findings appear to describe a fundamental learning mechanism in the brain, a flurry of models has been developed that focus on different aspects of STDP. A number of studies focus on biochemical models that explain the underlying mechanisms giving rise to STDP (Senn, Markram, & Tsodyks, 2000; Bi, 2002; Karmarkar, Najarian, & ¨ otter, ¨ ¨ otter, ¨ Buonomano, 2002; Saudargiene, Porr, & Worg 2004; Porr & Worg 2003). Many researchers have also focused on models that explore the consequences of STDP-like learning rules in an ensemble of spiking neurons

Reducing the Variability of Neural Responses

373

(Gerstner, Kempter, van Hemmen, & Wagner, 1996; Kempter, Gerstner, & van Hemmen, 1999, 2001; Song, Miller, & Abbott, 2000; van Rossum, Bi, & Turrigiano, 2000; Izhikevich & Desai, 2003; Abbott & Gerstner, 2004; Burkitt, Meffin, & Grayden, 2004; Shon, Rao, & Sejnowski, 2004; Legenstein, Naeger, & Maass, 2005), and a comprehensive review of the different types and con¨ otter, ¨ clusions can be found in Porr and Worg (2003). Finally, a recent trend is to propose models that provide fundamental computational justifications for STDP. This article proposes a novel justification, and we explore the consequences of this justification in detail. Most commonly, STDP is viewed as a type of asymmetric Hebbian learning with a temporal dimension. However, this perspective is hardly a fundamental computational rationale, and one would hope that such an intuitively sensible learning rule would emerge from a first-principle computational justification. Several researchers have tried to derive a learning rule yielding STDP from first principles. Dayan and H¨ausser (2004) show that STDP can be viewed as an optimal noise-removal filter for certain noise distributions. However, even a small variation from these noise distributions yields quite different learning rules, and the noise statistics of biological neurons are ¨ otter, ¨ unknown. Similarly, Porr and Worg (2003) propose an unsupervised learning rule based on the correlation of bandpass-filtered inputs with the derivative of the output and show that the weight change rule is qualitatively similar to STDP. Hopfield and Brody (2004) derive learning rules that implement ongoing network self-repair. In some circumstances, a qualitative similarity to STDP is found, but the shape of the learning rule depends on both network architecture and task. M. Eisele (private communication, April 2004). has shown that an STDP-like learning rule can be derived from the goal of maintaining the relevant connections in a network. Rao and Sejnowski (1999, 2001) suggest that STDP may be related to prediction, in particular to temporal difference (TD) learning. They argue that STDP emerges when a neuron attempts to predict its membrane potential at some time t from the potential at time t − t. As Dayan (2002) points out, however, temporal difference learning depends on an estimate of the prediction error, which will be very hard to obtain. Rather, a quantity that might be called an activity difference can be computed, and the learning rule is then better characterized as a “correlational learning rule between the stimuli, and the differences in successive outputs” (Dayan, 2002; see also ¨ otter, ¨ Porr & Worg 2003, appendix B). Furthermore, Dayan argues that for true prediction, the model has to show that the learning rule works for biologically realistic timescales. The qualitative nature of the modeling makes it unclear whether a quantitative fit can be obtained. Finally, the derived difference rule is inherently instable, as it does not impose any bounds on synaptic efficacies; also, STDP emerges only for a narrow range of t values.

374

S. Bohte and M. Mozer

Chechik (2003) relates STDP to information theory via maximization of mutual information between input and output spike trains. This approach derives the LTP portion of STDP but fails to yield the LTD portion. Nonetheless, an information-theoretic approach is quite elegant and has proven valuable in explaining other neural learning phenomena (e.g., Linsker, 1989). The account we describe in this article also exploits an informationtheoretic approach. We are not the only ones to appreciate the elegance of information-theoretic accounts. In parallel with a preliminary presentation of our work at the NIPS 2004 conference, two quite similar informationtheoretic accounts also appeared (Bell & Parra, 2005; Toyoizumi, Pfister, Aihara, & Gerstner, 2005). It will be easiest to explain the relationship of these accounts to our own once we have presented ours. The computational approaches of Chechik (2003), Dayan and H¨ausser ¨ otter ¨ (2004) and Porr and Worg (2003) are all premised on a rate-based neuron model that disregards the relative timing of spikes. It seems quite odd to argue for STDP using neural firing rate: if spike timing is irrelevant to information transmission, then STDP is likely an artifact and is not central to understanding mechanisms of neural computation. Further, as Dayan and H¨ausser (2004) note, because STDP is not quite additive in the case of multiple input or output spikes that are near in time (Froemke & Dan, 2002), one should consider interpretations that are based on individual spikes, not aggregates over spike trains. In this letter, we present an alternative theoretical motivation for STDP from a spike-based neuron model that takes the specific times of spikes into account. We conjecture that a fundamental objective of cortical computation is to achieve reliable neural responses, that is, neurons should produce the identical response—in both the number and timing of spikes—given a fixed input spike train. Reliability is an issue if neurons are affected by noise influences, because noise leads to variability in a neuron’s dynamics and therefore in its response. Minimizing this variability will reduce the effect of noise and will therefore increase the informativeness of the neuron’s output signal. The source of the noise is not important; it could be intrinsic to a neuron (e.g., a time-varying threshold), or it could originate in unmodeled external sources that cause fluctuations in the membrane potential uncorrelated with a particular input. We are not suggesting that increasing neural reliability is the only objective of learning. If it were, a neuron would do well to shut off and give no response regardless of the input. Rather, reliability is but one of many objectives that learning tries to achieve. This form of unsupervised learning must, of course, be complemented by other unsupervised, supervised, and reinforcement learning objectives that allow an organism to achieve its goals and satisfy drives. We return to this issue below and in our conclusions section.

Reducing the Variability of Neural Responses

375

We derive STDP from the following computational principle: synapses adapt so as to minimize the variability in the timing of the spikes of the postsynaptic neuron’s output in response to given presynaptic input spike trains. This variability reduction causes the response of a neuron to become more deterministic and less sensitive to noise, which provides an obvious computational benefit. In our simulations, we follow the methodology of neurophysiological experiments. This approach leads to a detailed fit to key experimental results. We model not only the shape (sign and time course) of the STDP curve, but also the fact that potentiation of a synapse depends on the efficacy of the synapse; it decreases with increased efficacy. In addition to fitting these key STDP phenomena, the model allows us to make predictions regarding the relationship between properties of the neuron and the shape of the STDP curve. The detailed quantitative fit to data makes our work unique among first-principle computational accounts. Before delving into the details of our approach, we give a basic intuition about the approach. Noise in spiking neuron dynamics leads to variability in the number and timing of spikes. Given a particular input, one spike train might be more likely than others, but the output is nondeterministic. By the response variability minimization principle, adaptation should reduce the likelihood of these other possibilities. To be concrete, consider a particular experimental paradigm. In Zhang et al. (1998), a presynaptic neuron is identified with a weak synapse to a postsynaptic neuron, such that this presynaptic input is unlikely to cause the postsynaptic neuron to fire. However, the postsynaptic neuron can be induced to fire via a second presynaptic connection. In a typical trial, the presynaptic neuron is induced to fire a single spike, and with a variable delay, the postsynaptic neuron is also induced to fire (typically) a single spike. To increase the likelihood of the observed postsynaptic response, other response possibilities must be suppressed. With presynaptic input preceding the postsynaptic spike, the most likely alternative response is no output spikes at all. Increasing the synaptic connection weight should then reduce the possibility of this alternative response. With presynaptic input following the postsynaptic spike, the most likely alternative response is a second output spike. Decreasing the synaptic connection weight should reduce the possibility of this alternative response. Because both of these alternatives become less likely as the lag between preand postsynaptic spikes is increased, one would expect that the magnitude of synaptic plasticity diminishes with the lag, as is observed in the STDP curve. Our approach to reducing response variability given a particular input pattern involves computing the gradient of synaptic weights with respect to a differentiable model of spiking neuron behavior. We use the spike response model (SRM) of Gerstner (2001) with a stochastic threshold, where the stochastic threshold models fluctuations of the membrane potential or

376

S. Bohte and M. Mozer

the threshold outside experimental control. For the stochastic SRM, the response probability is differentiable with respect to the synaptic weights, allowing us to calculate the gradient that reduces response variability with respect to the weights. Learning is presumed to take a gradient step to reduce the response variability. In modeling neurophysiological experiments, we demonstrate that this learning rule yields the typical STDP curve. We can predict the relationship between the exact shape of the STDP curve and physiologically measurable parameters, and we show that our results are robust to the choice of the few free parameters of the model. Many important machine learning algorithms in the literature seek local optimizers. It is often the case that the initial conditions, which determine which local optimizer will be found, can be controlled to avoid unwanted local optimizers. For example, with neural networks, weights are initialized near the origin; large initial weights would lead to degenerate solutions. And K-means has many degenerate and suboptimal solutions; consequently, careful initialization of cluster centers is required. In the case of our model’s learning algorithm, the initial conditions also avoid the degenerate local optimizer. These initial conditions correspond to the original weights of the synaptic connections and are constrained by the specific methodology of the experiments that we model: the subthreshold input must have a small but nonzero connection strength, and the suprathreshold input must have a large connection strength (less than 10%, more than 70% probability of activating the target, respectively). Given these conditions, the local optimizer that our learning algorithm discovers is an extremely good fit to the experimental data. In parallel with our work, two other groups of authors have proposed explanations of STDP in terms of neurons maximizing an informationtheoretic measure for the spike-response model (Bell & Parra, 2005; Toyoizumi et al., 2005). Toyoizumi et al. (2005) maximize the mutual information of the input and output between a pool of presynaptic neurons and a single postsynaptic output neuron, whereas Bell and Parra (2005) maximize sensitivity between a pool of (possibly correlated) presynaptic neurons and a pool of postsynaptic neurons. Bell and Parra use a causal SRM model and do not obtain the LTD component of STDP. As we will show, when the objective function is minimization of (conditional) response variability, obtaining LTD critically depends on a stochastic neural response. In the derivation of Toyoizumi et al. (2005), LTD, which is very weak in magnitude, is attributed to the refractoriness of the spiking neuron (via the autocorrelation function), where they use questionably strong and enduring refractoriness. In our framework, refractoriness suppresses noise in the neuron after spiking, and we show that in our simulations, strong refraction in fact diminishes the LTD component of STDP. Furthermore, the mathematical derivation of Toyoizumi et al. is valid only for an essentially constant membrane potential with small fluctuations, a condition clearly violated in experimental

Reducing the Variability of Neural Responses

377

conditions studied by neurophysiologists. It is unclear whether the derivation would hold under more realistic conditions. Neither of these approaches thus far succeeds in quantitatively modeling specific experimental data with neurobiologically realistic timing parameters, and neither explains the relative reduction of STDP as the synaptic efficacy increases as we do. Nonetheless, these models make an interesting contrast to ours by suggesting a computational principle of optimization of information transmission, as contrasted with our principle of neural response variability reduction. Experimental tests might be devised to distinguish between these competing theories. In section 2 we describe the sSRM, and in section 3 we derive the minimal entropy gradient. In section 4 we describe the STDP experiment, which we simulate in section 5. We conclude with section 6. 2 The Stochastic Spike Response Model The spike response model (SRM), defined by Gerstner (2001), is a generic integrate-and-fire model of a spiking neuron that closely corresponds to the behavior of a biological spiking neuron and is characterized in terms of a small set of easily interpretable parameters (Jolivet, Lewis, & Gerstner, 2003; Paninski, Pillow, & Simoncelli, 2005). The standard SRM formulation describes the temporal evolution of the membrane potential based on past neuronal events, specifically as a weighted sum of postsynaptic potentials (PSPs) modulated by reset and threshold effects of previous postsynaptic spiking events. The general idea is depicted in Figure 2; formally (following Gerstner, 2001), the membrane potential ui (t) of cell i at time t is defined as ui (t) =

η(t − f i ) +

f i ∈Git

j∈i

wi j

(t| f j , Git ),

(2.1)

f j ∈G tj

where i is the set of inputs connected to neuron i; Git is the set of times prior to t that a neuron i has spiked, with firing times f i ∈ Git ; wi j is the synaptic weight from neuron j to neuron i; (t| f j , Git ) is the PSP in neuron i due to an input spike from neuron j at time f j given postsynaptic firing history Git ; and η(t − f i ) is the refractory response due to the postsynaptic spike at time f i . To model the postsynaptic potential ε in a leaky-integrate-and-fire neuron, a spike of presynaptic neuron j emitted at time f j generates a postsynaptic current α(t) for a presynaptic spike arriving at f j for t > f j . In the absence of postsynaptic firing, this kernel (following Gerstner & Kistler, 2002, eqs 4.62–4.56, pp. 114–115) can be computed as (t| f j ) =

t fj

s− f j α(s − f j ) ds, exp − τm

(2.2)

378

S. Bohte and M. Mozer

Figure 2: Membrane potential u(t) of a neuron as a sum of weighted excitatory PSP kernels due to impinging spikes. Arrival of PSPs marked by arrows. Once the membrane potential reaches threshold, it is reset, and a reset function η is added to model the recovery effects of the threshold.

where τm is the decay time of the postsynaptic neuron’s membrane potential. Consider an exponentially decaying postsynaptic current α(t) of the form α(t) =

t 1 H(t) exp − τs τs

(2.3)

(see Figure 3A), where τs is the decay time of the current and H(t) is the Heaviside function. In the absence of postsynaptic firing, this current contributes a postsynaptic potential of the form (t| f j ) =

1 1 − τs /τm

exp

(t − f j ) (t − f j ) − − exp − H(t − f j ), τm τs (2.4)

with current decay time constant τs and decay time constant τm . When the postsynaptic neuron fires after the presynaptic spike arrives—at some time fˆ i following presynaptic spike at time f j —the membrane potential is reset, and only the remaining synaptic current α(t ) for t > fˆ i is integrated in equation 2.2. Following Gerstner, 2001 (section 4.4, equation 1.66), the PSP that takes such postsynaptic firing into account can be written as fˆ i < f j , (t| f j ) (2.5) (t| f j , fˆ i ) = ( f j − fˆ i ) (t| f j ) fˆ i ≥ f j . exp − τs

Reducing the Variability of Neural Responses

379

A

B

C

D

Figure 3: (A) α(t) function. Synaptic input modeled as exponentially decaying current. (B) Postsynaptic potential due to a synaptic input in the absence of postsynaptic firing (solid line), and with postsynaptic firing once and twice (dotted resp. dashed lines; postsynaptic spikes indicated by arrow). (C) Reset function η(t). (D) Spike probability ρ(u) as a function of potential u for different values of α and β parameters.

This function is depicted in Figure 3B, for the cases when a postsynaptic spike occurs both before and after the presynaptic spike. In principle, this formulation can be expanded to include the postsynaptic neuron firing more than once after the onset of the postsynaptic potential. However, for fast current decay times τs , it is useful to consider only the residual current input for the first postsynaptic spike after onset and assume that any further postsynaptic spiking is modeled by a postsynaptic potential reset to zero from that point on. The reset response η(t) models two phenomena. First, a neuron can be in a refractory period: it simply cannot spike again for about a millisecond after a spiking event. Second, after the emission of a spike, the threshold of the neuron may initially be elevated and then recover to the original value (Kandel, Schwartz, & Jessell, 2000). The SRM models this behavior as negative contributions to the membrane potential (see equation 2.1): with s = t − fˆ i denoting the time since the postsynaptic spike, the refractory

380

S. Bohte and M. Mozer

reset function is defined as (Gerstner, 2001):

η(s) =

Uabs Uabs exp

−

s + δr τr

f

+ Ur exp

−

s τrs

0 < s < δr s ≥ δr ,

(2.6)

where a large negative impulse Ua bs models the absolute refractory period, with duration δr ; the absolute refractory contribution smoothly resets via a f fast-decaying exponential with time constant τr . The term Ur models the slow exponential recovery of the elevated threshold with time constant τrs . The function η is depicted in Figure 3C. We made a minor modification to the SRM described in Gerstner (2001) by relaxing the constraint that τrs = τm and also by smoothing the absolute refractory function (such smoothing is mentioned in Gerstner, but is not explicitly defined). In all simulations, we use δr = 1 ms, τrs = 3 ms, and f τr = 0.25 ms (in line with estimates for biological neurons; Kandel et al., 2000; the smoothing parameter was chosen to be fast compared to τrs ). The SRM we just described is deterministic. Gerstner (2001) introduces a stochastic variant of the SRM (sSRM) by incorporating the notion of a stochastic firing threshold: given membrane potential ui (t), the probability of the neuron firing at time t is specified by ρ ui (t) . Herrmann and Gerstner (2001) find that for a reasonable escape-rate noise model of the integration of current in real neurons, the probability of firing is small and constant for small potentials, but around a threshold ϑ, the probability increases linearly with the potential. In our simulations, we use such a function, ρ(v) =

β {ln[1 + exp(α (ϑ − v))] − α(ϑ − v)}, α

(2.7)

where α determines the abruptness of the constant-to-linear transition in the neighborhood of threshold ϑ and β determines the slope of the linear increase beyond ϑ. This function is depicted in Figure 3D for several values of α and β. We also conducted simulation experiments with sigmoidal and exponential density functions and found no qualitative difference in the results. 3 Minimizing Conditional Entropy We now derive the rule for adjusting the weight from a presynaptic input neuron j to a postsynaptic neuron i so as to minimize the entropy of i’s response given a particular spike train from j. A spike train is described by the set of all times at which a neuron i emitted spikes within some interval between 0 and T, denoted GiT . We assume the interval is wide enough that the occurrence of spikes outside

Reducing the Variability of Neural Responses

381

the interval does not influence the state of a neuron within the interval (e.g., through threshold reset effects). This assumption allows us to treat intervals as independent of each other. The set of input spikes received by neuron i during this interval is denoted FiT , which is just the union of all output spike trains of connected presynaptic neurons j: FiT = G Tj ∀ j ∈ i . Given input spikes FiT , the stochastic nature of neuron i may lead not only to the observed response GiT but also to a range of other possibilities. Denote the set of possible responses i , where GiT ∈ i . Further, let binary variable σ (t) denote the state of the neuron in the time interval [t, t + t), where σ (t) = 1 means the neuron spikes and σ (t) = 0 means no spike. A response ξ ∈ i is then equivalent to [σ (0), σ (t), . . . , σ (T)]. Given a probability density p(ξ ) over all possible responses ξ , the differential entropy of neuron i’s response conditional on input FiT is then defined as

h i FiT = −

i

p(ξ ) log p(ξ ) dξ.

(3.1)

According to our hypothesis, a neuron adjusts its weights so as to minimize the conditional response variability. Such an adjustment is obtained by performing gradient descent on the weighted likelihood of the response, which corresponds to the conditional entropy, with respect to the weights,

∂h i |FiT wi j = −γ , ∂wi j

(3.2)

with learning rate γ . In this section, we compute the right-hand side of equation 3.2 for an sSRM neuron. Substituting the entropy definition of equation 3.1 into equation 3.2, we obtain:

∂h i FiT ∂ =− p(ξ ) log( p(ξ )) dξ ∂wi j ∂wi j ∂ log( p(ξ )) p(ξ ) (log( p(ξ )) + 1) dξ. =− ∂wi j i We closely follow Xie and Seung (2004) to derive tiable neuron model firing at times

p(ξ ) =

T t=0

GiT .

P(σ (t)|{σ (t ), ∀t < t}).

∂ log( p(ξ )) ∂wi j

(3.3)

for a differen-

First, we factorize p(ξ ):

(3.4)

382

S. Bohte and M. Mozer

The states σ (t) are conditionally independent as the probability for a neuron i to fire during [t, t + t) is determined by the spike probability density of the membrane potential: σ (t) =

1 with probability

pi = ρi (t)t,

0 with probability

pi = 1 − pi (t),

for the spike probability density of the membrane with ρi (t) shorthand

potential, ρ ui (t) ; this equation holds for sufficiently small σ (t) (see also Xie & Seung, 2004, for more details). We note further that ∂ ln( p(ξ )) 1 ∂ p(ξ ) ≡ ∂wi j p(ξ ) ∂wi j and ∂ρi (t) ∂ρi (t) ∂ui (t) = . ∂wi j ∂ui (t) ∂wi j

(3.5)

It is straightforward to derive: ∂ log( p(ξ )) = ∂wi j

T

t=0

=−

∂ρi (t) ∂ui (t) ∂ui (t) ∂wi j T

t=0

f i ∈FiT

ρi (t) (t| f j , f i )dt +

δ(t − f i ) − ρi (t) ρi (t)

dt,

ρ ( fi ) i ( f i | f j , f i ), ρ i ( fi ) T

(3.6)

f i ∈Fi

∂ρi (t) and δ(t − f i ) is the Dirac delta, and we use that in the where ρi (t) ≡ ∂u i (t) sSRM formulation,

∂ui (t) = (t| f j , f i ). ∂wi j The term ρi (t) in equation 3.6 can be computed for any differentiable spike probability function. In the case of equation 2.7, ρi (t) =

β . 1 + exp(α(ϑ − ui (t))

Reducing the Variability of Neural Responses

383

Substituting our model for ρi (t), ρi (t) from equation 2.7 into equation 3.6, we obtain

∂ log( p(ξ )) = −β ∂wi j +

f i ∈GiT

T t=0

(t| f j , f i ) dt 1 + exp[α(ϑ − ui (t))]

( f i | f j , f i ) . α {ln(1 + exp[α (ϑ − ui ( f i ))]) − α (ϑ − ui ( f i ))}(1 + exp[α(ϑ − ui ( f i ))]) (3.7)

Equation 3.7 can be substituted into equation 3.3, which, when integrated, provides the gradient-descent weight update that implements conditional entropy minimization (see equation 3.2). The hypothesis under exploration is that this gradient-descent weight update yields STDP. Unfortunately, an analytic solution to equation 3.3 (and hence equation 3.2) is not readily obtained. Nonetheless, numerical methods can be used to obtain a solution. We are not suggesting a neuron performs numerical integration of this sort in real time. It would be preposterous to claim biological realism for an instantaneous integration over all possible responses ξ ∈ i , as specified by equation 3.3. Consequently, we have a dilemma: What use is a computational theory of STDP if the theory demands intensive computations that could not possibly be performed by a neuron in real time? This dilemma can be circumvented in two ways. First, the resulting learning rule might be cached in some form through evolution so that the computation is not necessary. That is, the solution—the STDP curve itself—may be built into a neuron. As such, our computational theory provides an argument for why neurons have evolved to implement the STDP learning rule. Second, the specific response produced by a neuron on a single trial might be considered a sample from the distribution p(ξ ), and the integration in equation 3.3 can be performed by a sampling process over repeated trials; each trial would produce a stochastic gradient step.

3.1 Numerical Computation. In this section, we describe the procedure for numerically evaluating equation 3.2 via Simpson’s integration (Hennion, 1962). This integration is performed over the set of possible responses i (see equation 3.3) within the time interval [0 . . . T]. The set i can be divided into disjoint subsets in , which contain exactly n spikes: i = in ∀ n.

384

S. Bohte and M. Mozer

Using this breakdown,

∂h i |FiT ∂ log(g(ξ )) =− g(ξ )(log(g(ξ )) + 1) dξ, ∂wi j ∂wi j i n=∞ ∂ log(g(ξ )) =− g(ξ )(log(g(ξ )) + 1) dξ. ∂wi j in

(3.8)

n=0

It is illustrative to walk through the alternatives. For n = 0, there is only one response given the input. Assuming the probability of n = 0 spikes is p0 , the n = 0 term of equation 3.8 reads:

T ∂h i FiT = p0 (log( p0 ) + 1) −ρi (t) (t| f j , f i )dt. ∂wi j t=0

(3.9)

The probability p0 is the probability of the neuron not having fired between t = 0 and t = T given inputs FiT resulting in membrane potential ui (t) and hence probability of firing at time t of ρ(ui (t)), p0 = S[0, T] = exp −

T

ρ (ui (t)) dt ,

(3.10)

t=0

which is equal to the survival function S for a nonhomogeneous Poisson process with probability density ρ(ui (t)) for t = [0 . . . T]. (We use the inclusive/exclusive notation for S: S(0,T) computes the function excluding the end points; S[0,T] is inclusive.) For n = 1, we must consider all responses containing exactly one output spike: GiT = { f i1 }, f i1 ∈ [0, T]. Assuming that neuron i fires only at time f i1 with probability p1 ( f i1 ), the n = 1 term of equation 3.8 reads

f 1 =T T i

∂h i FiT = p1 f i1 log p1 ( f i1 ) + 1 −ρi (t) t| f j , f i1 dt 1 ∂wi j f i =0 t=0

1

ρ f + i i1 f i1 | f j , f i1 d f i1 . (3.11) ρi f i The probability p1 ( f i1 ) is computed as

p1 f i1 = S 0, f i1 ρi f i1 S f i1 , T ,

(3.12)

Reducing the Variability of Neural Responses

385

where the membrane potential now incorporates one reset at t = f i1 :

ui (t) = η t − f i1 + wi j t| f j , f i1 . j∈i

f j ∈F tj

For n = 2, we must consider all responses containing exactly two output spikes: GiT = { f i1 , f i2 } for f i1 , f i2 ∈ [0, T]. Assuming that neuron i fires at f i1 and f i2 with probability probability p2 ( f i1 , f i2 ), the n = 2 term of equation 3.8 reads:

f 1 =T f 2 =T i i

∂h i |FiT = p2 f i1 , f i2 log p2 ( f i1 , f i2 ) + 1 ∂wi j f i1 =0 f i2 = f i1 T × −ρi (t) (t| f j , f i1 , f i2 )dt t=0

+

ρi f i2 2

1 1 2 1 2 + | f , f , f | f , f , f f f d f i1 d f i2 .

j j i i i i i i ρi f i1 ρi f i2

ρi

f i1

(3.13) The probability p2 ( f i1 , f i2 ) can again be expressed in terms of the survival function,

p2 f i1 , f i2 = S 0, f i1 ρi f i1 S f i1 , f i2 ρi f i2 S f i2 , T ,

(3.14)

with ui (t) = η(t − f i1 ) + η(t − f i2 ) + j∈i wi j f j ∈F tj (t| f j , f i1 , f i2 ). This procedure can be extended for n > 2 following the pattern above. In our simulation of the STDP experiments, the probability of obtaining zero, one, or two spikes already accounted for 99.9% of all possible responses; adding the responses of three spikes (n = 3) accounted for all possible responses got this number up to ≈ 99.999, which is close to the accuracy of our numerical computation. In practice, we found that taking into account n = 3 had no significant contribution to computing w, and we did not compute higher-order terms as the cumulative probability of these responses was below our numerical precision. For the results we present later, we used only terms n ≤ 2; we demonstrate that this is sufficient in appendix A. In this section, we have replaced an integral over possible spike sequences i with an integral over the time of two output spikes, f i1 and f i2 , which we compute numerically.

386

S. Bohte and M. Mozer

Figure 4: Experimental setup of Zhang et al. (1998).

4 Simulation Methodology We modeled in detail the experiment of Zhang et al. (1998) involving asynchronous costimulation of convergent inputs. In this experiment, depicted in Figure 4, a postsynaptic neuron is identified that has two neurons projecting to it: one weak (subthreshold) and one strong (suprathreshold). The subthreshold input results in depolarization of the postsynaptic neuron, but the depolarization is not strong enough to cause the postsynaptic neuron to spike. The suprathreshold input is strong enough to induce a spike in the postsynaptic neuron. Plasticity of the synapse between the subthreshold input and the postsynaptic neuron is measured as a function of the timing between subthreshold and postsynaptic neurons’ spikes (tpre-post ) by varying the intervals between induced spikes in the subthreshold and the suprathreshold inputs (tpre-pre ). This measurement yields the well-known STDP curve (see Figure 1b). In most experimental studies of STDP, the postsynaptic neuron is induced to spike not via a suprathreshold neuron, but rather by depolarizing current injection directly into the postsynaptic neuron. To model experiments that induce spiking via current injection, additional assumptions must be made in the spike response model framework. Because these assumptions are not well established in the literature, we have focused on the synaptic input technique of Zhang et al. (1998). In section 5.1, we propose a method for modeling a depolarizing current injection in the spike-response model. The Zhang et al. (1998) experiment imposes four constraints on a simulation: (1) the suprathreshold input alone causes spiking more than 70% of the time; (2) the subthreshold input alone causes spiking less than 10% of the time; (3) synchronous firing of suprathreshold or subthreshold inputs causes LTP if and only if the postsynaptic neuron fires; and (4) the time constants of the excitatory PSPs (EPSPs)—τs and τm in the sSRM—are

Reducing the Variability of Neural Responses

387

in the range of 1 to 5 ms and 7 to 15 ms, respectively. These constraints remove many free parameters from our simulation. We do not explicitly model the two input cells; instead, we model the EPSPs they produce. The magnitude of these EPSPs is picked to satisfy the experimental constraints: in most simulations, unless reported otherwise, the suprathreshold EPSP (wsupra ) alone causes a spike in the post on 85% of trials, and the subthreshold EPSP (wsub ) alone causes a spike on fewer than 0.1% of trials. In our principal re-creation of the experiment (see Figure 5), we added normally distributed variation to wsupra and wsub to simulate the experimental selection process of finding suitable supra-subthreshold input pairs according to: wsupra = wsupra + N(0, σsupra ) and wsub = wsub + N(0, σsub ) (we controlled the random variation for conditions outside the specified firing probability ranges). Free parameters of the simulation are ϑ and β in the spike probability function (α can be folded into ϑ) and the magnitude (urs , uabs ) and time f constants (τrs , τr , abs ) of the reset. We can further investigate how the results depend on the exact strengths of the subthreshold and suprathreshold EPSPs. The dependent variable of the simulation is tpre-pre , and we measure the time of the post spike to determine tpr e− post . In the experimental protocol, a pair of inputs is repeatedly stimulated at a specific interval tpre-pre at a low frequency of 0.1 Hz. The weight update for a given tpre-pre is measured by comparing the size of the EPSC before stimulation and (about) half an hour after stimulation. In terms of our model, this repeated stimulation can be considered as drawing a response ξ from the stochastic conditional response density p(ξ ). We estimate the expected weight update for this density p(ξ ) for a given tpre-pre using equation 3.2 by approximating the integral by a summation over all time-discretized output responses consisting of 0, 1, or 2 spikes. Note that performing the weight update computation like this implicitly assumes that the synaptic efficacies in the experiment do not change much during repeated stimulation; since longterm synaptic changes require the synthesis of for example, proteins this seems a reasonable assumption, also reflected in the half-hour or so that the experimentalists wait after stimulation before measuring the new synaptic efficacy.

5 Results Figure 5A shows an STDP curve produced by the model, obtained by plotting the estimated weight update of equation 3.2 against tpre− post for fixed supra and subthreshold inputs. Specifically, we vary the difference in time between subthreshold and suprathreshold inputs (a pre-pre pair), and we compute the expected gradient for the subthreshold input wsub over all responses of the postsynaptic neuron via equation 3.2. We thus obtain a value for w for each tpre-pre data point; we then compute wsub (%) as the

388

S. Bohte and M. Mozer

A

B

C

.,

Figure 5: (A) STDP: experimental data (triangles) and model fit (solid line). and wsub (B) Added simulation data points with perturbed weights wsupra (crosses). STDP data redrawn from Zhang et al. (1998). Model parameters: τs = 2.5 ms, τm = 10 ms, sub- and suprathreshold weight perturbation in B: σsupra = 0.33 wsupra , σsub = 0.1 wsub . (C) Model fit compared to previous generative models (Chechik, 2003, short dashed line; Toyoizumi et al., 2005, long dashed line; data points and curves were obtained by digitizing figures in original papers). Free parameters of Chechik (2003) and Toyoizumi et al. (2005) were fit to the experimental data as described in appendix B.

relative percentage change of synaptic efficacy: wsub (%) = w/wsub × 100%.1 For each tpre-pre , the corresponding value tpr e− post is determined by calculating for each input pair the average time at which the postsynaptic neuron fires relative to the subthreshold input. Together, this results in a set of (tpr e− post , wsub (%)) data points. The continuous graph in Figure 5A We set the global learning rate γ in equation 3.2 such that the simulation curve is scaled to match the neurophysiological results. In all other experiments where we use relative percentage change wsub (%), the same value for γ is used. 1

Reducing the Variability of Neural Responses

389

is obtained by repeating this procedure for fixed supra- and subthreshold weights and connecting the resultant points. In Figure 5B, the supra- and subthreshold weights in the simulation are randomly perturbed for each pre-pre pair, to simulate the fact that in the experiment, different pairs of neurons are selected for each pre-pre pair, leading inevitably to variation in the synaptic strengths. Mild variation of the input weights yields the “scattering” data points of the relative weight changes similar to the experimentally observed data. Clearly, the mild variation we apply is small only relative to the observed ¨ om, ¨ in vivo distributions of synaptic weights in the brain (e.g., Song, Sjostr Reigl, Nelson, & Chklovskii, 2005). However, Zhang et al. (1998) did not sample randomly from synapses in the brain but rather selected synapses that had a particularly narrow range of initial EPSPs to satisfy the criteria for “supra-” and “subthreshold” synapses (see also section 4). Hence, the experimental variance was particularly small (see Figure 1e of Zhang et al., 1998), and our variation of the size of the EPSP is in line with the observed variations in the experimental results of Zhang et al. (1998). The model produces a good quantitative fit to the experimental data points (triangles), especially compared to other related work as discussed in section 1 and robustly obtains the typical LTP and LTD time windows associated with STDP. In Figure 5C, we show our model fit compared to the models of Toyoizumi et al. (2005) and Chechik (2003). Our model obtained the lowest sum squared error (1.25 versus 1.63 and 3.27,2 respectively; see appendix B for methods)—this despite the lack of data in the region tpre-post = 0, . . . , 10 ms in the Zhang et al. (1998) experiment, where difference in LTD behavior is most pronounced. The qualitative shape of the STDP curve is robust to settings of the spiking neuron model’s parameters, as we will illustrate shortly. Additionally, we found that the type of spike probability function ρ (exponential, sigmoidal, or linear) is not critical. Our model accounts for an additional finding that has not been explained by alternative theories: the relative magnitude of LTP decreases as the efficacy of the synapse between the subthreshold input and the postsynaptic target neuron increases; in contrast, LTD remains roughly constant (Bi & Poo, 1998). Figure 6A shows this effect in the experiment of Bi and Poo (1998), and Figure 6B shows the corresponding result from our model. We compute the magnitude of LTP and LTD for the peak modulation (i.e., tpr e− post = −5 for LTP and tpr e− post = +5 for LTD) as the amplitude of 2 Of note for this comparison is that our spiking neuron model uses a more sophisticated difference of exponentials (see equation 2.4) to describe the EPSP, whereas the spiking neuron models in Toyoizumi et al. (2005) and Chechik (2003) use a single exponential. These other models might be improved using the more sophisticated EPSP function.

390 A

S. Bohte and M. Mozer B

Figure 6: Dependence of LTP and LTD magnitude on efficacy of the subthreshold input. (A) Experimental data redrawn from Bi and Poo (1998). (B) Simulation result.

the subthreshold EPSP is increased. The model’s explanation for this phenomenon is simple: as the synaptic weight increases, its effect saturates, and a small change to the weight does little to alter its influence. Consequently, the gradient of the entropy with respect to the weight goes toward zero. Similar saturation effects are observed in gradient-based learning methods with nonlinear response functions such as backpropagation. As we mentioned earlier, other theories have had difficulty reproducing the typical shape of the LTD component of STDP. In Chechik (2003), the shape is predicted to be near uniform, and in Toyoizumi et al. (2005), the shape depends on the autocorrelation. In our stochastic spike response model, this component arises due to the stochastic variation in the neural response: in the specific STDP experiment, reduction of variability is achieved by reducing the probability of multiple output spikes. To argue for this conclusion, we performed simulations that make our neuron model less variable in various ways, and each of these manipulations results in a reduction in the LTD component of STDP. In Figures 7A and 7B, we make the threshold more deterministic by increasing the values of α and β in the spike probability density function. In Figure 7C, we increase the magnitude of the refractory response η, which will prevent spikes following the initial postsynaptic response. And finally, in Figure 7D, we increase the efficacy of the suprathreshold input, which prevents the postsynaptic neuron’s potential from hovering in the region where the stochasticity of the threshold can induce a spike. Modulation of all of these variables makes the threshold more deterministic and decreases LTD relative to LTP. Our simulation results are robust to biologically realizable variation in the parameters of the sSRM model. For example, time constants of the EPSPs can be varied with no qualitative effect on the STDP curves.

Reducing the Variability of Neural Responses

A

B

C

D

391

Figure 7: Dependence of relative LTP and LTD on (A) the parameter α of the stochastic threshold function, (B) the parameter β of the stochastic threshold function, (C) the magnitude of refraction, η, and (D) efficacy of the suprathreshold synapse, expressed as p(fire|supra), the probability that the postsynaptic neuron will fire when receiving only the suprathreshold input. Larger values of p(fire|supra) correspond to a weaker suprathreshold synapse. In all graphs, the weight gradient for individual curves is normalized to peak LTP for comparison purposes.

Figures 8A and 8B show the effect of manipulating the membrane potential decay time τm and the EPSP rise time τs , respectively. Note that manipulation of these time constants does predict a systematic effect on STDP curves. Increasing τm increases the duration of both the LTP and LTD windows, whereas decreasing τs leads to a faster transition from LTP to LTD. Both predictions could be tested experimentally by correlating time constants of individual neurons studied with the time course of their STDP curves. 5.1 Current Injection. We mentioned earlier that in many STDP experiments, an action potential is induced in the postsynaptic neuron not via a suprathreshold presynaptic input, but via a depolarizing current injection. In order to model experiments using current injection, we must

392

S. Bohte and M. Mozer A

B

Figure 8: Influence of time constants of the sSRM model on the shape of the STDP curve: (A) varying the membrane potential time-constant τm and (B) varying the EPSP rise time constant τs . In both figures, the magnitude of LTP and LTD has been normalized to 1 for each curve to allow for easy examination of the effect of the manipulation on temporal characteristics of the STDP curves.

characterize the current function and its effect on the postsynaptic neuron. In this section, we make such a proposal framed in terms of the spike response model and report simulation results using current injection. We model the injected current I(t) as a rectangular step function, I(t) = H(t − f I ) Ic H(t − [ I − f I ]),

(5.1)

where the current of magnitude Ic is switched on at t = f I and off at t = f I + I . In the Zhang et al. (1998) experiment, I is 2 ms, a value we adopted for our simulations as well. The resulting postsynaptic potential, c is

t

c (t) = 0

s exp − τm

I(s) ds.

(5.2)

In the absence of postsynaptic firing, the membrane potential of an integrate-and-fire neuron in response to a step current is (Gerstner, 2001): c (t| f I ) = Ic (1 − exp[−(t − f I )/τm ]).

(5.3)

In the presence of postsynaptic firing at time fˆ i , we assume—as we did previously in equation 2.5—a reset and subsequent integration of the residual

Reducing the Variability of Neural Responses

393

Figure 9: Voltage response of a spiking neuron for a 2 ms current injection in the spike response model. Solid curve: The postsynaptic neuron produces no spike, and the potential due to the injected current decays with the membrane time constant τm . Dotted curve: The postsynaptic neuron spikes while the current is still being applied. Dashed curve: The postsynaptic neuron spikes after application of the current has terminated (moment of postsynaptic spiking indicated by arrows).

current:

s I(s) ds exp − τm 0 t s I(s) ds. + H(t − fˆ i ) exp − τm fˆ i

c (t| fˆ i ) = H( fˆ i − t)

t

(5.4)

These c kernels are depicted in Figure 9 for a postsynaptic spike occurring at various times fˆ i . In our simulations, we chose the current magnitude Ic to be large enough to elicit spiking of the target neuron with probability greater than 0.7. Figure 10a shows the STDP curve obtained using the current injection model for the exact same model parameter settings used to produce the result based on a suprathreshold synaptic input (depicted in Figure 5A) superimposed on the experimental data STDP obtained by depolarizing current injection from Zhang et al. (1998). Figure 10b additionally superimposes the earlier result on the current injection result, and the two curves are difficult to distinguish. As in the earlier result, variation of model parameters has little appreciable effect on the model’s behavior using the current injection paradigm, suggesting that current injection versus synaptic input makes little difference on the nature of STDP.

394

S. Bohte and M. Mozer

(a)

(b)

Figure 10: (A) STDP curve obtained for SRM with current injection (solid curve) compared with experimental data for depolarizing current injection (circles; redrawn from Zhang et al., 1998). (B) Comparing STDP curves for both current injection (solid curve) and suprathreshold input (dashed curve) models. The same model parameters are used for both curves. Experimental data redrawn from Zhang et al. (1998) for current injection (circles) and suprathreshold input (triangles) paradigms are superimposed.

6 Discussion In this letter, we explored a fundamental computational principle: that synapses adapt so as to minimize the variability of a neuron’s response in the face of noisy inputs, yielding more reliable neural representations. From this principle, instantiated as entropy minimization, we derived the STDP learning curve. Importantly, the simulation methodology we used to derive the curve closely follows the procedure used in neurophysiological experiments (Zhang et al., 1998): assuming variation in sub- and suprathreshold synaptic efficacies from experimental pair-to-pair even recovers the noisy scattering of efficacy changes. Our simulations furthermore obtain an STDP curve that is robust to model parameters and details of the noise distribution. Our results are critically dependent on the use of Gerstner’s stochastic spike response model, whose dynamics are a good approximation to those of a biological spiking neuron. The sSRM has the virtue of being characterized by parameters that are readily related to neural dynamics, and its dynamics are differentiable such that we can derive a gradient-descent learning rule that minimizes the response variability of a postsynaptic neuron given a particular set of input spikes. Our model predicts the shape of the STDP curve and how it relates to properties of a neuron’s response function. These predictions may be empirically testable if a diverse population of cells can be studied. The predictions include the following. First, the width of the LTD and LTP windows depends on the (excitatory) PSP time constants (see Figures 7A

Reducing the Variability of Neural Responses

395

and 7B). Second, the strength of LTD relative to LTP depends on the degree of noise in the neuron’s response; the LTD strength is related to the noise level. Our model also can characterize the nature of the learning curve for experimental situations that deviate from the boundary conditions of Zhang et al. (1998). In Zhang et al., the subthreshold and suprathreshold inputs produced postsynaptic firing with probability less than .10 and greater than .70, respectively. Our model can predict the consequences of violating these conditions. For example, when the subthreshold input is very strong or the suprathreshold input is very weak, our model produces strictly LTD, that is, anti-Hebbian learning. The consequence of a strong subthreshold input is shown in Figure 6B, and the consequence of a weak suprathreshold input is shown in Figure 7D. Intuitively, this simulation result makes sense because—in the first case—the most likely alternative response of the postsynaptic neuron is to produce more than one spike, and—in the second case—the most likely alternative response is no postsynaptic spike at all. In both cases, synaptic depression reduces the probability of the alternative response. We note that such strictly anti-Hebbian learning has been reported in relation to STDP-type experiments (Roberts & Bell, 2002). For very noisy thresholds and for weak suprathreshold inputs, our model produces an LTD dip before LTP (see Figure 7D). This dip is in fact also present in the work of Chechik (2003). We find it intriguing that this dip is also observed in the experimental results of Nishiyama et al. (2000). The explanation for this dip may be along the same lines as the explanation for the LTD window: given the very noisy threshold, the subthreshold input may occasionally cause spiking, and decreasing its weight would decrease response variability. This may not be offset by the increase due to its contribution to the spike caused by the suprathreshold input, as it is too early to have much influence. With careful consideration of experimental conditions and neuron parameters, it may be possible to reconcile the somewhat discrepant STDP curves obtained in the literature using our model. In our model, the transition from LTP to LTD occurs at a slight offset from tpr e− post = 0: if the subthreshold input fires 1 to 2 ms before the postsynaptic neuron fires (on average), then neither potentiation nor depression occurs. This offset of 1 to 2 ms is attributable to the current decay time constant, τs . The neurophysiological data are not sufficiently precise to determine the exact offset of the LTP-LTD transition in real neurons. Unfortunately, few experimental data points are recorded near tpr e− post = 0. However, the STDP curve of our model does pass through the one data point in that region (see Figure 5A), so the offset may be a real phenomenon. The main focus of the simulations in this letter was to replicate the experimental paradigm of Zhang et al. (1998), in which a suprathreshold presynaptic neuron is used to induce the postsynaptic neuron to fire. The Zhang et al. (1998) study is exceptional in that most other experimental studies of STDP use a depolarizing current injection to induce the

396

S. Bohte and M. Mozer

postsynaptic neuron to fire. We are not aware of any established model for current injection within the SRM framework. We therefore proposed a model of current injection within the SRM framework in section 5.1. The proposed model is an ideal abstraction of current injection that does not take into account effects like current onset and offset fluctuations inherent in such experimental methods. Even with these limitations in mind, the current injection model produced STDP curves very similar to the ones obtained by the simulation of the suprathreshold input–induced postsynaptic firing. The simulations reported in this letter account for classical STDP experiments in which a single presynaptic spike is paired with a single postsynaptic spike. The same methodology can be applied to model experimental paradigms involving multiple presynaptic or postsynaptic spikes, or both. However, the computation involved becomes nontrivial. We are currently engaged in modeling data from the multispike experiments of Froemke and Dan (2002). We note that one set of simulation results we reported is particularly pertinent for comparing and contrasting our model to the related model of Toyoizumi et al. (2005). The simulations reported in Figure 7 suggest that noise in our model is critical for obtaining the LTD component of STDP and that parameters that reduce noise in the neural response also reduce LTD. We found that increasing the strength of neuronal refraction reduces response variability and therefore diminishes the LTD component of STDP. This notion is also put forward in very recent work by Pfister, Toyoizumi, Barber, and Gerstner (2006), where an STDP-like rule arises from from a supervised learning procedure that aims to obtain spikes at times specified by a teacher. The LTD component in this work also depends on the probability of stochastic activity. In sharp contrast, Toyoizumi et al. (2005) suggest that neuronal refraction is responsible for LTD. Because the two models are quite similar, it seems unlikely that the models make opposite predictions and the discrepancy may be due to Toyoizumi et al.’s focus on analytical approximations to solve the mathematical problem at hand, limiting the validity of comparisons between that model and biological experiments in the process. It is useful to reflect on the philosphy of choosing reduction of spike train variability as a target function, as it so obviously has the degenerate but energy-efficient solution of emitting no spikes at all. The usefulness of our approach clearly relies on the stochastic gradient reaching a local optimum in the likelihood space that does not always correspond to the degenerate solution. We compute the gradient of the input weights with respect to the conditionally independent sequence of response intervals [t, t + ]. The gradient approach tries to push the probability of the responses in these intervals to either 0 or 1, irrespective of what the response is (not firing or firing). We find that in the sSRM spiking neuron model, this gradient can be toward either state of each response interval, which can be attributed

Reducing the Variability of Neural Responses

397

to the monotonically increasing spike probability density as a function of the membrane potential. This spike probability density allows neurons to become very reliable by firing spikes only at specific times, at least when starting from a set of input weights that, given the input pattern, is likely to induce a spike in the postsynaptic neuron. The fact that the target function is the reduction of postsynaptic spike train variability does predict that in the case of small inputs impinging on a postsynaptic target causing only occasional firing, the prediction would be that the average weight update due to this target function would reduce these inputs to zero. We have modeled the experimental studies in some detail, beyond the level of detail achieved by other researchers investigating STDP. Even a model with an entirely heuristic learning rule has value if it obtains a better fit to the data than other models of similar complexity. Our model has a learning rule that goes beyond heuristic: the learning rule is derived from a computational objective. To some, this objective may not be as exciting as more elaborative objectives like information maximization. As it is, our model stands alone from the contenders in providing a first-principle account of STDP that fits experimental data extremely well. Might there be a mathematically sexier model? We certainly hope so, but it has not yet been discovered. We reiterate the point that our learning objective is viewed as but one of many objectives operating in parallel. The question remains as to why neurons would respond in such a highly variable way to fixed input spike trains: a more deterministic threshold would eliminate the need for any minimization of response variability. We can only speculate that the variability in neuronal responses may also well serve these other objectives, such as exploitation or exploration in reinforcement learning or the exploitation of stochastic resonance phenomena (e.g., Hahnloser, Sarpeshkar, Mahowald, Douglas, & Seung, 2000). It is interesting to note that minimization of conditional response variability corresponds to one part of the equation that maximizes mutual information. The mutual information I between input X and outputs Y is defined as I(X, Y) = H(Y) − H(Y|X). Hence, minimization of the conditional entropy H(Y|X)—our objective—along with the secondary unsupervised objective of maximizing the marginal entropy H(Y) maximize mutual information. The first unsupervised objective is notoriously hard to compute (e.g., see Bell & Parra, 2005, for an extensive discussion) whereas, as we have shown, the second objective—conditional entropy minimization—can be computed relatively easily via stochastic gradient descent. Indeed, in this light, it

398

S. Bohte and M. Mozer

Figure 11: STDP graphs for w computed using the terms n ≤ 2 (solid line) and n ≤ 3 (crossed solid line).

is a virtue of our model that we account for the experimental data with only one component of the mutual information objective (taking the responses in the experimental conditions as the set of responses Y). The relatively simple nature of the experiments that uncovered STDP lacks any interaction with other (output) neurons, and we may speculate that STDP may be the degenerate reflection of information maximization in the absence of such interactions. If subsequent work shows that STDP can be explained by mutual information maximization (without the drawbacks of existing work, such as the rate-based treatment of Chechik, 2003, or the unrealistic autocorrelation function and difficulty of relating to biological parameters of Toyoizumi et al., 2005), this work contributes in helping to tease apart the components of the objective that are necessary and sufficient for explaining the data. Appendix A: Higher-Order Spike Probabilities To compute w, we stop at n = 2, as in the experimental conditions that we model, the contribution of n > 2 spikes is vanishingly small. We find that the probability of three spikes occurring is typically < 1e − 5, and the n = 3 term did not contribute significantly, as shown, for example, in Figure 11. Intuitively it seems very unlikely that the gradient of the conditional response entropy is dominated by terms that are highly unlikely. This could be the case only if the gradient on the probability of getting three or more spikes would be much larger than the gradient on getting, say, two spikes.

Reducing the Variability of Neural Responses

399

Given the setup of the model with an increasing probability of firing a spike as a function of the membrane potential, it is easy to see that changing a weight will change the probability of obtaining two spikes much more than the probability of obtaining three spikes. Hence, the entropy gradient from components n ≤ 2 will be (in practice, much) larger than the gradient for terms n = 3, n = 4, . . .. As we remarked before, in our simulation of the experimental setup, the probability of obtaining three spikes given the input was computed to be < 1e − 5; the overall probability was computed at up to 1e − 6. The probability of n = 4 was below the precision of our simulation. Appendix B: Sum-Squared-Error Parameter Fitting To compute the sum squared error when comparing the different STDP models in section 5, we use linear regression in the free parameters to minimize the sum-squared error between the model curves and the experimental data. For m experimental data points {(t1 , w1 ), . . . (tm , wm )} and model curve w = f (t), we report for each model curve the sum-squared error for those values of the free parameters that minimize the sum-squared error E 2: E2 =

min

free parameters

m

(wi − f (ti ))2 .

i=1

For our model, linear regression is performed in the scaling parameter γ in equation 3.2 that relates the gradient obtained with the model parameters mentioned to the weight change. Where possible for the other models, we set model parameters to correspond to the values observed in the experimental conditions described in section 4. For the model by Chechik (2003), the weight update is computed as the sum of a positive exponent and a negative damping contribution,

w = γ H(−t) exp(t/τ ) − H(t + )H(− − t)K , where t is computed as tpr e − tpost , K denotes the negative damping contribution that is applied over a time window before and after the postsynaptic spike, and H( ) denotes the Heaviside function. The time constant τ is related to the decay of the EPSP, and we set this value to the same value we use for our model: 10 ms. Linear regression to find the minimal sum-squared error is performed on the free parameters γ , K , .

400

S. Bohte and M. Mozer

In Toyoizumi et al. (2005), the learning rule is the sum of two terms

w = γ 2 (t) − µ0 (φ 2 )(t) , where (t) is the EPSP, modeled as (t) exp(−t/τ ), and µ0 (φ 2 )(t) is a function of the autocorrelation function of a neuron (φ 2 )(t), times the spontaneous neural activity in the absence of input, µ0 . The EPSP decay time constant used in Toyoizumi et al. was already set to τ = 10 ms, and for the two terms in the sum we used the functions described by Figures 2A and 2B in Toyoizumi et al. We performed linear regression to the one free parameter, γ . Note that for this model, we obtain better LTD, and hence E 2 , for larger values of µ0 as those used in Toyoizumi et al. (2005). However, then E 2 still remains worse than for the other two models, and the spontaneous neural activity becomes unrealistically large. Acknowledgments We thank Tony Bell, Lucas Parra, and Gary Cottrell for insightful comments and encouragement. We also thank the anonymous reviewers for constructive feedback, which allowed us to improve the quality of our work and this article. The work of S.M.B. was supported by the Netherlands Organization for Scientific Research (NWO), TALENT S-62 588 and VENI 639.021.203. The work of M.C.M. was supported by National Science Foundation BCS 0339103 and CSE-SMA 0509521. References Abbott, L., & Gerstner, W. (2004). Homeostasis and learning through spike-timing dependent plasticity. In D. Hansel, C. Chow, B. Gutkin, & C. Meunier (Eds.), Methods and models in neurophysics. In Proceedings of the Les Houches Summer School 2003. Amsterdam: Elsevier. Bell, C. C., Han, V. Z., Sugawara, Y., & Grant, K. (1997). Synaptic plasticity in a cerebellum-like structure depends on temporal order. Nature, 387, 278–281. Bell, A., & Parra, L. (2005). Maximising sensitivity in a spiking network. In L. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 121–128). Cambridge, MA: MIT Press. Bi, G.-Q. (2002). Spatiotemporal specificity of synaptic plasticity: Cellular rules and mechanisms. Biol. Cybern., 87, 319–332. Bi, G.-Q., & Poo, M.-M. (1998). Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 18(24), 10464–10472. Bi, G.-Q., & Poo, M.-M. (2001). Synaptic modification by correlated activity: Hebb’s postulate revisited. Ann. Rev. Neurosci., 24, 139–166.

Reducing the Variability of Neural Responses

401

Burkitt, A., Meffin, H., & Grayden, D. (2004). Spike timing-dependent plasticity: The relationship to rate-based learning for models with weight dynamics determined by a stable fixed-point. Neural Computation, 16(5), 885–940. Chechik, G. (2003). Spike-timing-dependent plascticity and relevant mutual information maximization. Neural Computation, 15, 1481–1510. Dan, Y., & Poo, M.-M. (2004). Spike timing-dependent plasticity of neural circuits. Neuron, 44, 23–30. Dayan, P. (2002). Matters temporal. Trends in Cognitive Sciences, 6(3), 105–106. Dayan, P., & H¨ausser, M. (2004). Plasticity kernels and temporal statistics. In S. Thrun, ¨ L. Saul, & B. Scholkopf (Eds.), Advances in neural information processing systems, 16. Cambridge, MA: MIT Press. Debanne, D., G¨ahwiler, B., & Thompson, S. (1998). Long-term synaptic plasticity between pairs of individual CA3 pyramidal cells in rat hippocampal slice cultures. J. Physiol., 507, 237–247. Feldman, D. (2000). Timing-based LTP and LTD at vertical inputs to layer II/III pyramidal cells in rat barrel cortex. Neuron, 27, 45–56. Froemke, R., & Dan, Y. (2002). Spike-timing-dependent synaptic modification induced by natural spike trains. Nature, 416, 433–438. Gerstner, W. (2001). A framework for spiking neuron models: The spike response model. In F. Moss & S. Gielen (Eds.), The handbook of biological physics, (vol 4, pp. 469–516). Amsterdam: Elsevier. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (1996). A neural learning rule for sub-millisecond temporal coding. Nature, 383, 76–78. Gerstner, W., & Kistler, W. (2002). Spiking neuron models. Cambridge: Cambridge University Press. Hahnloser, R. H. R., Sarpeshkar, R., Mahowald, M. A., Douglas, R. J., & Seung, H. S. (2000). Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature, 405, 947–951. Hennion, P. E. (1962). Algorithm 84: Simpson’s integration. Communications of the ACM, 5(4), 208. Herrmann, A., & Gerstner, W. (2001). Noise and the PSTH response to current transients: I. General theory and application to the integrate-and-fire neuron. J. Comp. Neurosci., 11, 135–151. Hopfield, J., & Brody, C. (2004). Learning rules and network repair in spike-timingbased computation. PNAS, 101(1), 337–342. Izhikevich, E., & Desai, N. (2003). Relating STDP to BCM. Neural Computation, 15, 1511–1523. Jolivet, R., Lewis, T., & Gerstner, W. (2003). The spike response model: A framework to predict neuronal spike trains. In O. Kaynak, E. Alpaydin, E. Oja, & L. Yu (Eds.), Proc. Joint International Conference ICANN/ICONIP 2003 (pp. 846–853). Berlin: Springer. Kandel, E. R., Schwartz, J., & Jessell, T. M. (2000). Principles of neural science. New York: McGraw-Hill. Karmarkar, U., Najarian, M., & Buonomano, D. (2002). Mechanisms and significance of spike-timing dependent plasticity. Biol. Cybern., 87, 373–382. Kempter, R., Gerstner, W., & van Hemmen, J. (1999). Hebbian learning and spiking neurons. Phys. Rev. E, 59(4), 4498–4514.

402

S. Bohte and M. Mozer

Kempter, R., Gerstner, W., & van Hemmen, J. (2001). Intrinsic stabilization of output rates by spike-based Hebbian learning. Neural Computation, 13, 2709–2742. Kepecs, A., van Rossum, M., Song, S., & Tegner, J. (2002). Spike-timing-dependent plasticity: Common themes and divergent vistas. Biol. Cybern., 87, 446–458. Legenstein, R., Naeger, C., & Maass, W. (2005). What can a neuron learn with spiketiming-dependent plasticity? Neural Computation, 17, 2337–2382. Linsker, R. (1989). How to generate ordered maps by maximizing the mutual information between input and output signals. Neural Computation, 1, 402–411. ¨ Markram, H., Lubke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efficacy by coincidence of postsynaptic APS and EPSPS. Science, 275, 213–215. Nishiyama, M., Hong, K., Mikoshiba, K., Poo, M.-M., & Kato, K. (2000). Calcium stores regulate the polarity and input specificity of synaptic modification. Nature, 408, 584–588. Paninski, L., Pillow, J., & Simoncelli, E. (2005). Comparing integrate-and-fire models estimated using intracellular and extracellular data. Neurocomputing, 65–66, 379– 385. Pfister, J.-P., Toyoizumi, T., Barber, D., & Gerstner, W. (2006). Optimal spike-timing dependent plasticity for precise action potential firing. Neural Computation, 18, 1318–1348. ¨ otter, ¨ Porr, B., & Worg F. (2003). Isotropic sequence order learning. Neural Computation, 15(4), 831–864. Rao, R., & Sejnowski, T. (1999). Predictive sequence learning in recurrent neocortical ¨ circuits. In S. A. Solla, T. K. Leen, & K. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 164–170). Cambridge, MA: MIT Press. Rao, R., & Sejnowski, T. (2001). Spike-timing-dependent plasticity as temporal difference learning. Neural Computation, 13, 2221–2237. Roberts, P., & Bell, C. (2002). Spike timing dependent synaptic plasticity in biological systems. Biol. Cybern., 87, 392–403. ¨ otter, ¨ Saudargiene, A., Porr, B., & Worg F. (2004). How the shape of pre- and postsynaptic signals can influence STDP: A biophysical model. Neural Computation, 16, 595–625. Senn, W., Markram, H., & Tsodyks, M. (2000). An algorithm for modifying neurotransmitter release probability based on pre- and postsynaptic spike timing. Neural Computation, 13, 35–67. Shon, A., Rao, R., & Sejnowski, T. (2004). Motion detection and prediction through spike-timing dependent plasticity. Network: Comput. Neural Syst., 15, 179–198. ¨ om, ¨ Sjostr P., Turrigiano, G., & Nelson, S. (2001). Rate, timing, and cooperativity jointly determine cortical synpatic plasticity. Neuron, 32, 1149–1164. Song, S., Miller, K., & Abbott, L. (2000). Competitive Hebbian learning through spiketime -dependent synaptic plasticity. Nature Neuroscience, 3, 919–926. ¨ om, ¨ P. J., Reigl, M., Nelson, S., & Chklovskii, D. B. (2005). Highly nonSong, S., Sjostr random features of synaptic connectivity in local cortical circuits. PLoS Biology, 3(3), e68. Toyoizumi, T., Pfister, J.-P., Aihara, K., & Gerstner, W. (2005). Spike-timing dependent plasticity and mutual information maximization for a spiking neuron model. In L. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 1409–1416). Cambridge, MA: MIT Press.

Reducing the Variability of Neural Responses

403

van Rossum, R., Bi, G.-Q., & Turrigiano, G. (2000). Stable Hebbian learning from spike time dependent plasticity. J. Neurosci., 20, 8812–8821. Xie, X., & Seung, H. (2004). Learning in neural networks by reinforcement of irregular spiking. Physical Review E, 69, 041909. Zhang, L., Tao, H., Holt, C., Harris, W., & Poo, M.-M. (1998). A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395, 37–44.

Received April 6, 2005; accepted June 19, 2006.

LETTER

Communicated by Alexandre Pouget

Fast Population Coding Quentin J. M. Huys [email protected] Gatsby Computational Neuroscience Unit, University College London, London WC1N 3AR, U.K.

Richard S. Zemel [email protected] Department of Computer Science, University of Toronto, Toronto, Ontario, Canada M5S 3H5

Rama Natarajan [email protected] Department of Computer Science, University of Toronto, Toronto, Ontario, Canada M5S 3H5

Peter Dayan [email protected] Gatsby Computational Neuroscience Unit, University College London, London WC1N 3AR, U.K.

Uncertainty coming from the noise in its neurons and the ill-posed nature of many tasks plagues neural computations. Maybe surprisingly, many studies show that the brain manipulates these forms of uncertainty in a probabilistically consistent and normative manner, and there is now a rich theoretical literature on the capabilities of populations of neurons to implement computations in the face of uncertainty. However, one major facet of uncertainty has received comparatively little attention: time. In a dynamic, rapidly changing world, data are only temporarily relevant. Here, we analyze the computational consequences of encoding stimulus trajectories in populations of neurons. For the most obvious, simple, instantaneous encoder, the correlations induced by natural, smooth stimuli engender a decoder that requires access to information that is nonlocal both in time and across neurons. This formally amounts to a ruinous representation. We show that there is an alternative encoder that is computationally and representationally powerful in which each spike contributes independent information; it is independently decodable, in other words. We suggest this as an appropriate foundation for understanding time-varying population codes. Furthermore, we show how adaptation to Neural Computation 19, 404–441 (2007)

C 2007 Massachusetts Institute of Technology

Fast Population Coding

405

temporal stimulus statistics emerges directly from the demands of simple decoding.

1 Introduction From the earliest neurophysiological investigations in the cortex, it became apparent that sensory and motor information is represented in the joint activity of large populations of neurons (Barlow, 1953; Georgopoulos, Schwartz, & Kettner, 1983). There are by now substantial ideas and data about how these representations are formed (Rao, Olshausen, & Lewicki 2002), how information can be decoded from recordings of this activity (Paradiso, 1988; Snippe & Koenderinck, 1992; Seung & Sompolinsky, 1993), and how various sorts of computations, including uncertaintysensitive, Bayesian optimal statistical processing can be performed through the medium of feedforward and recurrent connections among the populations (Pouget, Zhang, Deneve, & Latham, 1998; Deneve, Latham, & Pouget, 2001). Critical issues that have emerged from these analyses are the forms of correlations between neurons in the populations, whether these correlations are significant for decoding and computation, and what sorts of prior information are relevant to computations and can be incorporated by such networks. However, many theoretical investigations into population coding have so far somewhat neglected a major dimension of coding: time. This is despite the beautiful and influential analyzes of circumstances in which individual spikes contribute importantly to the representation of rapidly varying stimuli (Bialek, Rieke, de Ruyter van Steveninck & Warland, 1991; Reinagel & Reid, 2000; Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997; Johansson & Birznieks, 2004) and the importance accorded to fast-timescale spiking by some practical investigations into population coding (Wilson & McNaughton, 1993; Schwartz, 1994; Zhang, Ginzburg, McNaughton, & Sejnowski, 1998; Brown, Frank, Tang, Quirk, & Wilson, 1998). The assumption is often made that encoded objects do not vary quickly with time and that therefore spike counts in the population suffice. Even some approaches that consider fast decoding (Brunel & Nadal, 1998; Van Rullen & Thorpe, 2001) treat stimuli as being discrete and separate rather than as evolving along whole trajectories. In this letter, we study the generic computational consequences of population coding in time. We analyze decoding in time as a proxy for computation in time as it is the most comprehensive computation that can be performed (accessing all information present). Decoding therefore constitutes a canonical test (Brown et al., 1998; Zhang et al., 1998). We consider a regime in which stimuli are not static and create sparse trains of spikes. Decoding trajectory information from these population spike trains is thoroughly ill posed, and prior information about what trajectories are likely

406

Q. Huys, R. Zemel, R. Natarajan, and P. Dayan

comes to play a critical role. We show that optimal decoding with ecological priors formally couples together the spikes, making trajectory inference computationally very hard. We thus consider the prospects for neural populations to recode the information about the trajectory into new sets of spikes that do support simple computations. Phenomena reminiscent of adaptation emerge as a by-product of the maintenance of a computationally advantageous code. We analyze the extension of one of the simplest ideas about population codes for static stimuli (Snippe & Koenderinck, 1992) to the case of trajectories. This links a neurally plausible population encoding model with a naturally realistic gaussian process prior. Unlike some previous work on decoding in time (Brown et al., 1998; Zhang et al., 1998; Smith & Brown, 2003), we do not confine ourselves to recursively specifiable priors and can therefore treat smoother cases. It is these smooth priors that render decoding, and likely other computations, hard and inspire an energy-based (product of experts) recoding (Hinton, 1999; Zemel, Huys, Natarajan, & Dayan, 2005), which makes for readier probabilistic computation. Section 2 starts with a simple encoding model. It introduces the need for priors, their shape, and analytical results for decoding in time. Section 3 shows how priors determine the form in which information is available to downstream neurons. We show that the decoder corresponding to the simple encoder can be extraordinarily complex, meaning that the encoded information is not readily available to downstream neurons. Finally, section 4 proposes a representation that has comparable power but for which decoding requires vastly less downstream computation. 2 A Gaussian Process Prior Approach As a motivating example, consider tennis. The player returning a serve has to predict the position of the ball based on data acquired in fractions of seconds. Experts compensate for the extraordinarily sparse stimulus information with a very rich temporal prior over ball trajectories and thus make predictions that are accurate enough to guarantee many a winning return. Figure 1 illustrates the setting of this article more formally. It shows an array of neurons with partially overlapping tuning functions that emit spikes in response to a stimulus that varies in time. These could be V1 neurons responding to a target (the tennis ball) as it moves through their receptive fields, or hippocampal neurons with place fields firing as a rat explores an environment. The task is to decode the spikes in time, that is, recover the trajectory of the stimulus (the ball’s position, say) based on the spikes, a knowledge of the neuronal tuning functions (cf. Brown et al., 1998; Zhang et al., 1998, for hippocampal examples), and some knowledge about the temporal characteriztics of the stimulus (the prior). In Figure 1, the ordinate represents the stimulus space (here one-dimensional for illustrative

407

Neurone position/Space

Fast Population Coding

0

50

100 150 Time [ms]

200

250

Figure 1: The problem: Reconstructing the stimulus as a function of time, given the spikes emitted by a population of neurons. When a neuron with preferred stimulus si emits a spike at time t, a black dot is plotted at (t, si ). A few example tuning functions are shown in gray. The ordinate represents stimulus space, with each neuron being positioned according to its preferred stimulus si . The decoding problem is related to fitting a line through these points, which is achievable only if there is prior information about the line to be fitted (e.g., the order of a polynomial fit or the smoothness).

purposes) and the abscissa, time. Neuron i has preferred stimulus si . If it emits a spike ξti at time t, a dot is drawn at position (t, si ). The dots in Figure 1 thus represent the spiking activity of the entire population of neurons over time. Our aim is to find, for each observation time T, a distribution over likely stimulus values sT given all the spikes previous to T. This is related to fitting a line representing the trajectory of the stimulus through the points. It is a thoroughly ill-posed problem, for instance, because we are not given any information about the stimulus at all between the spikes. To solve this ill-posed problem, we have to bring in additional knowledge in the form of a prior distribution about likely stimulus trajectories. The prior distribution specifies the temporal characteriztics of the trajectories (e.g., how smooth they are) and also whether they live within some constrained part of the stimulus space. Subjects are assumed to possess such prior information ahead of time—for instance, from previous exposures to trajectories (a good tennis player will have seen many serves). To gain analytical insight into the structure of decoding in this temporally rich case, we consider a very simple spiking model p(ξti |st ) (c.f., Snippe & Koenderinck, 1992, for the static case), augmented with a simple prior over stimulus trajectories p(s). We thereafter follow standard approaches (Zhang et al., 1998; Brown et al., 1998) by performing causal decoding and thus recovering p(sT |ξ ) over the current stimulus sT at time T given all the J spikes ξ ≡ {ξtij } Jj=1 at times 0 < {t j } Jj=1 < T in the observation period

408

Q. Huys, R. Zemel, R. Natarajan, and P. Dayan

([0, T)), emitted by the entire population. Here, i = 1, . . . , N designates the neuron that emitted the spike. To state the problem in mathematical terms, we can write (at least for the case that there is no spike at time T itself) p(sT |ξ ) ∝ p(sT ) p(ξ |sT ) = p(sT ) dsT p(ξ |sT ) p(sT |sT ),

(2.1) (2.2)

where, being slightly notationally sloppy, we are integrating over stimulus trajectories sT up to, but not including, time T, but restricted to just those trajectories that end at sT . Equation 2.2 lays bare the two parts of the definition of the problem. One is the likelihood p(ξ |sT ) of the spikes given the trajectory. This will be assumed to arise from a Poisson-gaussian spiking model. The other is the prior p(sT ) p(sT |sT ) = p(s)

(2.3)

over the trajectories. This will be assumed to be a gaussian process. 2.1 Poisson-Gaussian Spiking Model. We first define the spiking model. Let φi (s) be the tuning function of neuron i and assume independent, inhomogeneous, and instantaneous Poisson neurons (Snippe & Koenderinck, 1992; Brown et al., 1998; Barbieri et al., 2004). Let j be an index running over all the spikes in the population, with i( j) reporting the index of the neuron that spiked at time t j . Then, from the basic definition of an inhomogeneous Poisson process, the likelihood of a particular population spike train ξ given the stimulus trajectory sT can be written as

p(ξ |sT ) =

φi( j) (st j ) exp −

j

∝

i

φi( j) (st j ),

T

dtφi (st )

(2.4)

0

(2.5)

j

assuming that the trajectories are such that we can swap the order of the sum and the integral in the exp(·), that tuning functions are sufficiently dense that the sum spiking rate is constant independent of the location of the stimulus st , and that no two neurons ever fire together.

Fast Population Coding

409

Finally, we assume squared-exponential (gaussian) tuning functions,

st j − si φi (st j ) = φmax exp − 2σ 2

2 ,

where φmax is the maximal firing rate of a neuron and si the ith neuron’s preferred stimulus. Combining this with our previous assumptions (see equation 2.5) and completing the square implies that (sξ − θ )T (sξ − θ) p(ξ |sT ) ∝ φmax exp − , 2σ 2

(2.6)

where the spikes from the entire population have been ordered in time; the jth component of both sξ and θ corresponds to the jth spike and is, respectively, the stimulus at that spike’s time t j and the preferred stimulus si of the neuron that produced it. Note that time is continuous here. 2.2 Gaussian Process Prior. The prior p(s) defines a distribution over stimulus trajectories that are continuous in time. However, p(ξ |sT ) in equation 2.6 depends on only the times t j at which neurons in the population spike. Thus, in the integral in equation 2.2, we can formally marginalize or integrate out all the nonspiking times, making the key quantity to be defined by the prior to be p(sξ , sT ). For a gaussian process (GP), this quantity is a multivariate gaussian, defined by its (J + 1)-dimensional mean vector m and covariance matrix C, which can in general depend on the times t j . We write the distribution as p(sξ , sT ) ∼ N (m, C)

Ct j t j = c exp −αt j − t j ζ .

(2.7)

The parameter ζ ≥ 0 dictates the smoothness and the correlation structure of the process. If ζ = 0, then the stimulus is assumed to be constant (we sometimes call this the static case). Setting ζ = 1 corresponds to assuming that the stimulus evolves as an Ornstein-Uhlenbeck (OU) or first-order autoregressive process. This is the generative model underlying Kalman filters (Twum-Danso & Brockett, 2001) and generates an autocorrelation with the Fourier spectrum ∼1/ω2 often observed experimentally (Atick, 1992; Dong & Atick, 1995; Wang, Liu, Sanchez-Vives, & McCormick, 2003). This can be generalized to nth-order autoregressive processes. Setting ζ = 2 leads to the opposite end of the spectrum, with smooth trajectories that are non-Markovian. The parameter α dictates the temporal extent of the correlations and c their overall size (c also parameterizes the scale of the overall process). Example trajectories drawn from these priors for ζ = {1, 2} are shown in Figure 2. For most of the letter, we will let m = 0. Assuming

410

Q. Huys, R. Zemel, R. Natarajan, and P. Dayan 0.5

A

Smooth 0

−0.5 0.5

B OU

0

−0.5

100

200 Time [ms]

Figure 2: Example trajectories drawn from the prior distribution in equation 2.7. (A) Examples for the smooth covariance matrix with ζ = 2. (B) The OU covariance matrix, ζ = 1.

a GP prior with a particular covariance matrix is exactly equivalent to regularizing the autocorrelation of the trajectory. 2.3 Posterior. Making these assumptions, we can write down the posterior distribution p(sT |ξ ) analytically by solving equation 2.2. It is a simple gaussian distribution with mean µ(T) and variance ν 2 (T) given in terms of tuning function widths σ , the vector θ , and the covariance matrix C. All three terms in equation 2.2 are now defined. The conditional distribution p(sξ |sT ) is given in terms of the partitioned covariance matrix C, p(sξ |sT ) = Nsξ Cξ T CT−1T sT , Cξ ξ − Cξ T CT−1T CTξ , where Cξ ξ is the covariance matrix of the stimulus at all the spike times, CTξ and Cξ T are vectors with the cross-covariances between the spike times and the observation time T, and CT T is the marginal (static) stimulus prior at the observation time (constant for the stationary processes considered here). The corresponding partitioning of the matrix C is

Cξ ξ Cξ T . C= CTξ CT T

(2.8)

Fast Population Coding

411

The remaining two terms in equation 2.2 are given by p(sT ) = NsT (0, CT T ) and equation 2.6. As the integral in equation 2.2 is a convolution of two gaussians, the variances add, and the integral evaluates to p(ξ |sT ) = Nθ Cξ T CT−1T sT , Cξ ξ − Cξ T CT−1T CTξ + Iσ 2 . Finally, taking a product with p(sT ), renormalizing, and applying the matrix inversion lemmas (see appendix A), we get µ(T) = k(ξ , T) · θ(T) ν 2 (T) = CT T − k(ξ , T) · Cξ T 2 −1

k(ξ , T) = CTξ (Cξ ξ + Iσ ) .

(2.9) (2.10) (2.11)

The mean µ(T) of the gaussian posterior is thus a weighted sum of the preferred stimulus of those neurons that emitted particular spikes. The weights are given by what we term the temporal kernel k(ξ , T). As we will see, the weight given to each spike will depend strongly on the time at which it occurred. A spike that occurred in the distant past will be given small weight. The posterior variance depends on only C and σ 2 . Remember that C depends on only the times of spikes, not the identities of the neurons that fired them. The posterior variance ν 2 , similar to a Kalman filter, depends on only when data are observed, not what data. This depends on the squared exponential nature of the tuning functions φ, and other tuning functions (e.g., with nonzero baselines) may not lead to this quality. However, it will not affect the conclusions reached below. This posterior distribution p(sT |ξ ) is well known in the GP literature as the predictive distribution (MacKay, 2003). 2.4 Structure of the Code. The operations needed to evaluate the posterior p(sT |ξ ) give us insight into the structure of the code and will be analyzed in section 3 for various priors. If the posterior is a function of combinations of spikes, postsynaptic neurons have to have simultaneous access to all those spikes. This point will be critical in temporal codes, as the spikes to which access is required are spread out in time. Only if spikes are interpretable independently can they be forgotten once they have been used for inference. All information the spikes contribute to some future time T > T is then contained within p(sT |ξ ). If the posterior depends on combinations of spikes (as will be the case for ecological, smooth priors), information that can be extracted from a spike about times T > T is not entirely contained within p(sT |ξ ). As a result, past spikes have to be stored and the posterior recomputed using them—an operation that is nonlocal in time. We will show that under ecological priors, the posterior depends on spike combinations and is thus complex. Decoding for the simple encoder (the spiking model) is thus hard. In section 4, we will illustrate the type

412

Q. Huys, R. Zemel, R. Natarajan, and P. Dayan

of computations (“recoding”) a network has to perform to access all the information. This will be equivalent to finding a new, complex encoder in time for which decoding is simple. 3 Effect of the Prior The effect of the prior manifests itself very clearly in the temporal kernels k(ξ , T) from equation 2.11 and the independence structure of the code. We show this by analyzing a representative set of priors in terms of both the behavior of the temporal kernels and the structure of the code, including priors that generate constant, varyingly rough and entirely smooth trajectories. (MATLAB example code can be downloaded from http://www.gatsby. ucl.ac.uk/∼qhuys/code.html.) 3.1 Constant stimulus prior ζ = 0. We first show that our treatment of the time-varying case is an exact generalization of the case in which the stimulus is fixed (does not change relative to the mean m), by rederiving classical results for static stimuli. Snippe and Koenderinck (1992) have shown that the posterior mean and variance (under a flat prior) is given by a weighted spike count, µ(T) =

i

ni (T)si J (T)

ν 2 (T) =

σ2 J (T)

(3.1)

T where ni (T) = 0 dt ξti is the ith neuron’s spike count and J (T) = i ni (T) is the total population spike count at time T. If we let ζ = 0, the matrix Cξ ξ = cnnT where n is a J (T) × 1 vector of ones. Equations 2.9 to 2.11 can then be solved analytically:

(σ 2 + c J (T))δi j − c σ 2 (σ 2 + c J (T)) c k(ξ , T) = 2 n σ + c J (T) ni (T)si c µ(T) = 2 i σ + c J (T)

(Cξ ξ + Iσ 2 )−1

ij

=

ν 2 (T) =

cσ 2 , σ 2 + c J (T)

which is exactly analogous to equation 3.1 with an informative prior. The temporal kernel k(ξ , T) does not decay but is flat, with a magnitude proportional to 1/J (T). The contribution of each neuron to the mean µ(T) is given by its spike count ni (T). Each spike is given the same weight,

Fast Population Coding

413

dynamic inference

static inference

static stim

dynamic stim

A

B

C

D

Figure 3: Comparison of static and dynamic inference. Throughout, the posterior density p(sT |ξ ) is indicated by gray shading, the spikes are vertical (gray) lines with dots, and the true stimulus is the line at the top of each plot. (A) Static stimulus, constant temporal kernel. (B) Moving stimulus, constant temporal kernel. (C) Static stimulus, decaying temporal kernel. (D) Moving stimulus, decaying temporal kernel. A and D show that only a match between true stimulus statistics and prior allows the posterior to capture the stimulus well.

which is a sensible approach only if spikes are eternally informative about the stimulus. This is true only if the covariance matrix is flat, which itself implies that the only time-varying component of the stimulus is in the mean m and not the covariance C. If the stimulus is a varying function of time s(t), spikes at time t are informative only about the stimulus at times t close to t and the influence of each spike on the posterior should fade away with time. This is illustrated in Figure 3. Figure 3A shows the present static case, where the stimulus does indeed not move; over time, the posterior p(sT |ξ ) sharpens up around the true value. However, if the stimulus does move, the posterior ends up at the wrong value (see Figure 3B). If the temporal kernel k(ξ , T) decays, this amounts to downweighting spikes observed in the more distant past. In the following, we analyze the behavior of p(sT |ξ ) and the optimal temporal kernel k(ξ , T) for various stimulus autocorrelation functions. Figure 3C shows that a decaying kernel leads to a posterior that widens inbetween spikes. This is incorrect if the stimulus is static, but Figure 3D shows how such a decaying temporal kernel would, in contrast to Figure 3B, allow p(sT |ξ ) to track the moving stimulus correctly.

414

Q. Huys, R. Zemel, R. Natarajan, and P. Dayan

Space

−1

0

1

0.05

0.1

0.15 0.2 Time [s]

0.25

0.3

Figure 4: Posterior distribution p(sT |ξ ) for OU prior. Same representation as in Figure 1. The dashed line shows the actual stimulus trajectory used to generate the spikes, the dots are the spikes, the posterior distribution is in gray scale, and the solid line shows the posterior mean. Between spikes, the posterior mean decays exponentially back toward the mean m (here 0), and the variance approaches the static prior variance CT T .

3.2 Nonsmooth (Ornstein-Uhlenbeck) Prior ζ = 1. Setting ζ = 1 in the definition of the prior (see equation 2.7) corresponds to assuming that the stimulus evolves as a random walk with drift to zero (an OU process): ds = −(1 − e −α )s(t)dt +

√ c(1 − e −2α ) dt d N(t),

(3.2)

with gaussian noise N(t) ∼ N (0, 1) and parameters as in equation 2.7. The OU process is the underlying generative process assumed by standard Kalman filters. The simplicity of Kalman-filter-like formulations explains some of its wide applicability and success (Brown et al., 1998; Barbieri, et al., 2004). However, as indicated visually by the example trajectories in Figure 2, the rough trajectories this prior favors are not a good model of smooth biological movements (see also section 5). Figure 4 shows a sampled stimulus trajectory, sample spikes generated from it, and the posterior distribution p(sT |ξ ). The mean of the posterior does a good job of tracking the true underlying stimulus trajectory and is never more than two standard deviations away from it. Between spikes, the mean simply moves back to zero (albeit rather slowly given the parameters associated with the Figure shown). Figure 5A displays example temporal kernels k(ξ , T) for inference in this process. They are very close to exponentials (note the logarithmic ordinate). This makes intuitive sense as an OU process is a first-order Markov process (it can be rewritten as a first-order difference equation). In fact, assuming the spikes arrive regularly (replacing each of the interspike intervals (ISI) 1 by their average value = 1J j (t j − t j−1 ) ∝ φmax ) allows us to write the

Fast Population Coding

415

B Kernel size (weight) (log scale)

Kernel size (weight) (log scale)

A −2

10

−4

10

−6

10

−8

10

0.05

0.1

0.15

T−t k

0.2

increasing observation time T

−3

10

−5

10

−7

10

0.2

0.4

0.6 T−t

0.8

1

k

Figure 5: OU temporal kernels for ζ = 1. (A) Example of temporal kernels. The top traces are for lower and the bottom for higher average firing rates. The gray traces show temporal kernels for Poisson spike trains. The components of the vector k(ξ , T) are plotted against the corresponding spike time. The dashed black traces show temporal kernels for regular spike arrivals (metronomic temporal kernels). The true (gray) temporal kernels are relatively tightly bunched around the metronomic temporal kernel. The firing rate affects the slope of the kernel, but not its overall scale of the kernel. (B) The effect of the time since the last spike on the temporal kernel is an overall multiplicative scaling. There is no effect on the slope.

jth component of k(ξ , T) as j−1

k j ≈ d1 λ1 , where d1 and λ1 are constants defined in appendix B. For such metronomic spiking, k(ξ , T) is thus really simply a decaying exponential. Somewhat similar expressions can be obtained for the original case of Poisson-distributed ISIs (see appendix B). Figure 5A shows that the metronomic approximation provides a generally good fit, capturing especially the slope of the true temporal kernels, which depends mostly on the correlation length α and the maximal (or average) firing rate φmax . The remaining quality of the fit is influenced most strongly by the match between and the time since the last spike T − tJ (which takes its effects through CTξ in equation 2.8 and 2.9–2.11). This determines the overall scale of the temporal kernel. The factors influencing the slope of the temporal kernel and its height do not interact greatly; that is, T − tJ does not affect the slope (shape) of the temporal kernel, only its magnitude, as shown in Figure 5B (metronomic temporal kernels are used for clarity, but the argument applies equally to the exact kernel). Conversely, affects mostly the slope. Replacing the true

416

Q. Huys, R. Zemel, R. Natarajan, and P. Dayan

−1

A

Space

0

1 −1

B 0

1

0.05

0.1

0.15 Time [s]

0.2

0.25

0.3

Figure 6: Comparison between exact and metronomic kernels. Same representation as in Figure 4. (A) Exact posterior p(sT |ξ ). (B) Approximate posterior derived by replacing all ISIs by but keeping T − tJ . This corresponds to approximating the true kernels with the metronomic kernels in Figure 5. The approximation is very close.

temporal kernels by metronomic temporal kernels, that is, replacing all ISIs by but keeping the time since the last spike T − tJ , does not greatly degrade p(sT |ξ ) (cf. Figures 6A and 6B). The dependence in Figure 5B can be understood by writing out the integrand of equation 2.2 in detail for the OU prior. This factorizes over potentials involving duplets of spikes because, as we show in appendix B, C −1 is tridiagonal, implying that the elements of C −1 involve only two successive spikes: T 1 s p(sξ , sT ) ∝ exp − sξ sT C −1 ξ sT 2 J +1 J 1 st2j Ct−1 + st j Ct−1 s = exp − j tj j ,t j+1 t j+1 2 j=1

p(sξ , sT ) = ψ(sT )

J

ψ(st j , st j+1 ),

j=1

(3.3)

j=1

where tJ stands for the time of the last spike, tJ −1 the time of the penultimate one, and so on, and the observation time T = tJ +1 . Note that the last equality

Fast Population Coding

A

417

B

90

0

−0.2

80

−0.4 −0.6 j

log C(t − t )

60

i

Space [cm]

70

50

−0.8 −1 −1.2

40

−1.4 30 −1.6 40

60 Space [cm]

80

−1.5

−1

−0.5

0 0.5 t −t [s] i

1

1.5

2

j

Figure 7: Natural trajectories are smooth. (A) Position of a rat freely exploring a square environment. (B) Covariance function of the position along the ordinate (gray, dashed line) and a quadratic approximation (black, solid line). Note the logarithmic ordinate. The smoothing applied to eliminate artifacts was of a timescale short enough not to interfere with the overall shape of the covariance function.

implies that the determinant also factors over spike pairs. This means that the integrations over each spike in the main equation 2.2 can be written in a recursive form akin to that used in message-passing algorithms (MacKay, 2003) and the exact Kalman filter. 3.3 Smooth Prior ζ = 2. Setting ζ = 2 in the definition of the prior (see equation 2.7) corresponds to assuming that the stimulus evolves as a non-Markov random walk. Trajectories with this autocovariance function are smooth (Figure 2A shows some sample trajectories generated from the prior) and infinitely differentiable. The smoothness makes it a more ecologically relevant prior for Bayesian decoding from movement-related trajectories than nonsmooth priors since natural objects (and limbs) move along smooth trajectories rather than jumping. As an example, Figure 7A shows trajectories of a rat exploring a square environment (data kindly provided by Lever, Wills, Cacucci, Burgess, & O’Keefe, 2002). Not only are these natural trajectories smooth, but Figure 7B also shows that a squared exponential covariance function closely approximates the real covariance function.1

1

Only the center of the covariance function is shown here. Due to the small size of the environment, the rat runs back and forth the entire available length, and there are oscillating flanks to the covariance function for delays larger than those shown.

418

Q. Huys, R. Zemel, R. Natarajan, and P. Dayan

−1

Space

−0.5 0 0.5 1

0.05

0.1 Time [s]

0.15

0.2

Figure 8: Posterior distribution p(sT |ξ ) for the smooth prior. Same representation as in Figure 4. The arrow highlights where the smooth prior uses spike combinations to constrain higher-order statistics of the process, such as velocity, acceleration, and jerk. While the smooth prior correctly predicts that the stimulus will continue away from the mean before returning back, the OU process can predict a decay only back to the mean (see Figure 9). The first spike on the left is the very first spike observed. As the spike history becomes more extensive, the posterior distribution is seen to sharpen up and follow the stimulus accurately.

Figure 8 is the equivalent of Figure 4 for the smooth case and shows the posterior p(sT |ξ ). The main dynamical difference between inference in this smooth case and inference in the OU case is indicated by the arrow in the Figure. While the OU process simply decays back to the mean (here, zero for simplicity), the dynamics of the smooth posterior mean are much richer. In the absence of spikes, the mean continues in its current direction for a while before reversing back. As can be seen, this gives a better fit to the underlying stimulus trajectory (the black dashed line) than would otherwise have been achieved. It arises directly from the fact that the correlations extend essentially beyond the last spike (and into the entire past). For comparison, Figure 9 shows the posterior when the wrong prior is used. The stimulus was generated from the smooth prior, but the OU prior was used to infer the posterior. The arrow indicates where the infelicity of the inaccurate posterior is most apparent, falling back to zero instead of predicting that the stimulus will continue to move farther away from zero. In terms of difference equations, the larger extent of correlations intuitively means that the higher-order derivatives of the process are also “constrained” by the covariance C. The simple exponential temporal kernels observed in the OU process cannot give rise to the reversals observed in the smooth process. Figure 10A shows the temporal kernels for the smooth process, which have a distinctively different flavor from the OU temporal kernels (shown in Figure 5),

Fast Population Coding

419

Space

−0.5 0 0.5 0.05

0.1 Time [s]

0.15

0.2

0.4 0.3 0.2 0.1 0 0.05 0.1 0.15 0.2 T−t j

C 0.2 Increasing observation time T

0.1 0 0

0.05 0.1 T−t j

0.15

Kernel size (weight)

B

0.5

Kernel size (weight)

A

Kernel size (weight)

Figure 9: Posterior distribution p(sT |ξ ) for smooth stimulus but wrongly assuming an OU prior. The posterior is consistently wider than it should be (cf. Figure 8). The arrow points out where the prediction is qualitatively wrong: the OU prior allows for decay back only to zero, unlike the smooth prior. Note also that the beneficial effect of a larger spike history observed in Figure 8 is absent here.

Increasing observation time T

2 1 0 0

0.2

0.4

0.6

T−t j

Figure 10: Temporal kernels for the smooth prior. (A) Exact (gray solid) and metronomic (black dashed) temporal kernels for the smooth prior with ζ = 2. The metronomic kernels again provide a close fit. (B) The metronomic temporal kernels change in a complex manner as the observation time T is moved away from the time of the last spike. Unlike in the OU case, this is not just a recursively implementable multiplication. (C) The same qualitative behavior arises for kernels derived from the empirical covariance function of the rat trajectories.

including oscillating terms multiplying the exponential decay. Most important, the oscillating terms allow the weight assigned to a spike to dip below zero; that is, a spike initially signifies proximity of the stimulus to the neuron’s preferred stimulus but later swaps over, signaling that the stimulus is not there anymore. This feature of the temporal kernels gives rise to the reversals seen in the posterior mean.

420

Q. Huys, R. Zemel, R. Natarajan, and P. Dayan

As in the OU case, the metronomic temporal kernel based on equal ISIs gives a good description of the temporal kernel mostly for spikes in the more distant past. Replacing the true temporal kernels by metronomic temporal kernels (but keeping the exact time since the last spike T − tJ ) again does not affect the posterior strongly. Nevertheless, the Kullback-Leibler divergence between the true posterior and the metronomic posterior is larger in the smooth than in the OU case (data not shown), indicating that the exact timing of spikes is more important in the smooth inference. Unlike in the OU case, there is no simple analytical expression for the metronomic temporal kernel (let alone the true temporal kernel). In particular, Figure 10B shows that changing the time since the last observed spike T − tJ does not simply scale the temporal kernel, but also changes the shape of the temporal kernel (it produces a complicated phase shift of the oscillating component). Again, for clarity, the metronomic kernels are used as an illustration, but the argument also applies to the exact kernels. Local structure has complex global consequences in the smooth case, with a single new spike requiring individual reweighting of all past spikes depending on their precise times. By comparison, for the OU process, the reweighting involves multiplication by a single factor. Figure 10C shows that this temporal kernel complexity is also a feature of the temporal kernel derived from the covariance function of the empirical rat trajectories in Figure 7. The fundamental difference between the OU and the smooth temporal kernels arises from the difference in the factorization properties of the prior. Because the inverse of the covariance matrix for ζ ∈ / {0, 1}, and specifically for ζ = 2, is dense, it does not factorize over spike combinations and therefore does not allow a recursive form. To see that a recurrence relation is possible only for the OU prior that factorizes across duplets of spikes, write p(sT |ξ ) = ds J ds J p(sT , s J , s J |ξ ) by expanding and integrating over the stimulus s J at the time tJ of the last spike, and s J at the time of all the spikes apart from the last ∝ ds J p(sT , s J ) p(ξ J |s J ) ds J p(s J , ξ J |sT , s J ) using Bayes rule, and the instantaneity of spiking =

ds J p(sT , s J ) p(ξ J |s J )

ds J p(ξ J |s J ) p(s J |sT , s J ),

again because the spikes are instantaneous, =

ds J p(sT , s J ) p(ξ J |s J )mT (sT , s J , ξ J ).

(3.4)

Fast Population Coding

421

Were mT (sT , s J , ξ J ) independent of sT , this would be exactly like a recursive update equation, with p(sT , s J ) being the transition probability from the last observed spike to the inference time T, p(ξ J |s J ) being the innovation due to the last observation (the likelihood of the last observed spike), and the message mT (s J , ξ J ) propagating the uncertainty from all the spikes other than the last to the last one. However, for general priors, p(s J |sT , s J ), and therefore also mT (sT , s J , ξ J ), do depend on sT , so all spikes have to be used to infer the posterior at each time T. To make the mT independent of sT , the prior has to be Markov in individual spike timings, with p(s J |sT , s J ) = p(s J |s J ),

(3.5)

which makes mT (sT , s J , ξ J ) =

ds J p(ξ J |s J ) p(s J |s J )

≡ mT (s J , ξ J ),

(3.6) (3.7)

which is indeed independent of sT . So for the OU process, the last message mT (s J , ξ J ) merely needs to be multiplied by the transition probability (see Figure 5B). However, the smooth temporal kernel changes shape in a complex way (corresponding to the dependence of the message mT (sT , s J , ξ J ) in equation 3.4 on sT ). Again, this means that all spikes have to be kept in memory for full inference. Note, finally, that this conclusion, and the fact that there is a recursive form for the OU process, do not depend on the particular spiking model assumed, verifying the assertion that the choice of squared exponential tuning functions, although mathematically helpful, does not pose limitations on our conclusions. 3.4 Intermediate (Autoregressive) Processes. There are cases intermediate to the smooth and the OU process that allow a partially recursive formulation. For instance, the metronomic OU process can be generalized to an autoregressive model of nth order by writing st =

n

√ βi st−i + c ηt .

(3.8)

i=1

In this case, the inverse covariance matrix C −1 is (2n + 1)-diagonal (see appendix C), with entries determined directly by the βi . This implies that the posterior factorizes over cliques ψ involving n + 1 spikes (see equation 3.3), and that inference will be Markov in groups of n spikes. Zhang et al. (1998) find that a two-step Bayesian decoder, which is an AR(2) process in our terms, significantly improves decoding hippocampal place cell data.

422

Q. Huys, R. Zemel, R. Natarajan, and P. Dayan

B

A

0.5 order=1 0 0.5 order=2

Space

kernel size (weight)

0 0.5 order=3 0 0.5 order=5 0 0.5 order=10 0 0.05 0.1

0.15 0.2

0.25 0.3

0.35 0.4

0.45 0.5

Time [s]

0.05

0.1 0.15 T−t j [s]

0.2

Figure 11: Autoregressive processes of increasing order. (A) Samples from processes of order n = {1, 2, 3, 5, 10} from top to bottom. The top process corresponds to an OU process. (B) Metronomic temporal kernels k(ξ , T) corresponding to the processes in A. The different lines (in varying shades of gray) correspond to increasing the observation time T as in Figures 5 and 10.

Figure 11A shows sample trajectories from such processes of increasing order. The coefficient vectors’ β was set here such that the nth difference of the processes evolved as an OU process (see appendix C). The higher the order, the smoother the processes that can be generated, and the more oscillations are apparent in the temporal kernels. The OU and the smooth processes (see section 3.3) are at opposite ends of this spectrum, with tridiagonal and dense inverse matrices, respectively. The higher the order, the greater the complexity of the code. Indeed, the complexity grows exponentially (since groups of n spikes have to be considered and the number of such groups increases exponentially). While natural stimulus trajectories may not be indefinitely differentiable, the exponential increase in complexity implies that any smoothness has great potential to render the code complex. 4 Expert Spikes for Efficient Computation Complex codes, following, for instance, from the assumption of natural smooth priors, render the information inherent in the spikes hard to extract. Efficient computation in time requires access to all encoded information and

Fast Population Coding

423

thus requires that the complex temporal structure of the code be taken into account. Here, we show that information present in the complex codes can be re-represented using codes that are straightforward to decode and use in key probabilistic computations. Specifically, we propose to decode each spike independently and multiply together the contributions from all spikes. This corresponds to treating each spike as an independent expert in a product of experts (PoE) setting (Hinton, 1999): 1 i pˆ (sT |ξ ) = exp gi (s, t)ξT−t . Z(T) i t

(4.1)

That is, each time a spike ξ i occurs, it contributes its same projection kernel exp(gi (s, t)) to the posterior distribution pˆ (sT |ξ ). To put it another way, for each spike, we add the same stereotyped contribution to the log posterior and then renormalize. From the discussion in the preceding sections, it is immediately apparent that the PoE approximation is a better approximation for the OU case than for the smooth case. In the following, we first derive an approximate analytical expression for separable projection kernels gi (s, t) = f i (s)h(t) based on metronomic spikes and the OU prior. We then remove any restrictions and derive nonparametric, nonseparable gi (s, t) for both the OU and the smooth temporal kernel and show that these still perform better for the OU process than for the smooth process. Finally, we infer a new set of spikes ρ ξ such that decoding according to the PoE model produces a posterior distribution pˆ (sT |ρ ξ ) that matches the true posterior distribution p(sT |ξ ) well for both OU and smooth priors. 4.1 Approximate Projection Kernels 4.1.1 Metronomic Projection Kernels Section 3.2 showed that for the OU process, the weight accorded a spike is approximately a decreasing exponential function of the time elapsed since its occurrence and that replacing the true temporal kernels by the metronomic temporal kernels (without fixing the time since the last spike at ) gives a qualitatively good approximation (see Figure 4). This suggests writing an approximate distribution with spatiotemporally separable projective kernels, pˆ (sT |ξ ) ∝

φi (s)

i

=

i

exp

i −βt t ξT−t e

t

(4.2)

log(φi (s))e

−βt

i ξT−t

(4.3)

424

Q. Huys, R. Zemel, R. Natarajan, and P. Dayan

A

−1

B

Space

0 1

0.05

0.1

0.15

0.2

0.25

0.3

0.05

0.1

0.15 Time [s]

0.2

0.25

0.3

−1 0 1

Figure 12: Separable projection kernel for the OU process: Comparison of true p(sT |ξ ) (A) and pˆ (sT |ξ ) from equation 4.3 (B). The left arrows indicate where the variance of the approximate distribution diverges toward ∞ as T − tJ → ∞ rather than approaching CT T . The right arrows show the effect of this on the mean or the approximate posterior, which returns to the prior mean m = 0 more rapidly than the true posterior.

to use exactly the form of equation 4.1. We can thus also write pˆ (sT |A) ∝

i

φi (s) Ai (T) ,

(4.4)

where Ai (T) can be seen as an equivalent “activity” of each neuron. The performance of this approximation is shown in Figure 12 for the OU process (see also Zemel et al., 2005). There are a few differences between Figures 4 and 12. Keeping the φi (s) as before, the variance of this approximation is νˆ 2 (T) = σ 2 / i Ai (T). As the last observed spike recedes into the past, this approaches infinity (left arrows in Figure 12), and the mean returns to zero (right arrows in Figure 12). This is different from the case of exact inference, which approaches the static prior with variance CT T . The mean is always normalized and returns to zero more slowly µ(T) ˆ = i si AiA(T) j (T) j than the variance increases. This introduces an inaccuracy, since the true OU temporal kernels (shown in Figure 5) are not normalized t kt (ξ , T) < 1, which arises because of the weight given to the spatial prior. For the smooth case, no simple approximation of the form of equation 4.3 is viable. This can be seen, for instance, from the fact that the smooth temporal kernels (see Figure 10) dip below zero (making it tricky to use them in products).

Fast Population Coding

425

A

B

Figure 13: Projection kernels inferred by equation 4.5 for OU (A) and smooth (B) priors. Stimulus trajectories and corresponding population spike trains ξ were generated until the update equations converged (approximately 2 · 104 spike trains). Both kernels have the shape of difference of gaussians for t = 0 and fall off exponentially with time. There is little nonseparable structure in both cases.

4.1.2 Inferring Full Spatiotemporal Projection Kernels gi (s, t). To apply expression 4.1 to the smooth case, we inferred gi (s, t) in a nonparametric way by discretizing time and space over which the distributions are defined and minimizing the Kullback-Leibler divergence between the discretized versions p(sT |ξ ) and pˆ (sT |ξ ) with respect to the projection kernels, gi (s, t) ← gi (s, t) − ε∇gi (s,t) DK L ( p(sT |ξ )|| pˆ (sT |ξ )),

(4.5)

where DK L ( p(s)||q (s)) = ds p(s) log qp(s) . Given that our approximation 4.1 (s) is related to restricted Boltzmann machines (RBM), it is not surprising that the gradient has a form akin to the wake-sleep algorithm (Hinton, Dayan, Frey, & Neal, 1995): ∇gi (s,t) DK L ( p(sT |ξ )|| pˆ (sT |ξ )) =

[ pˆ (sT |ξ ) − p(sT |ξ )] ξi (T − t).

(4.6)

T

Figure 13 shows the projection kernels inferred for the OU prior (see Figure 13A) and the smooth prior (see Figure 13B). Both start, for t = 0 with a spatial profile similar to a difference of gaussians (DOG), and then fall off as exponentials of time. The kernels gi (s, t) shown here are for neurons i with si close to 0, the center of the gaussian prior over the trajectories. The projection kernels shown are for the same parameter settings as Figures 4 and 8, and the faster decay of the smooth projection kernels is due to the

426

Q. Huys, R. Zemel, R. Natarajan, and P. Dayan

Figure 14: Projection kernels are independent of contrast. The left-most panel shows an OU kernel for the same contrast (φmax ) as in Figure 13; the contrast is doubled in the middle and quadrupled in the right panel. All these are offcenter kernels with the same parameters as used in the other Figures. Despite a slight slant toward the mean, the kernels are approximately separable.

shorter correlation timescale. For the OU process, the kernels for neurons i with si > 0 become slightly slanted toward −1 over time (and the converse holds for those with si < 0) to capture the decay to the mean (zero), which is only a function of the distance from the mean. This effect is noticeable for the OU but very small for the smooth kernels. Figure 14 shows off-center OU kernels inferred for different contrast (by varying φmax ). As can be seen, the kernels are invariant to the contrast, and the slant effect is small. For the parameter range explored here, both projection kernels are approximately separable, indicating that the analytically derived motivation above may be close to optimal and that, in the PoE framework of equation 4.1, separable projection kernels may be the optimal choice even for the smooth prior. However, simply using these projection kernels to interpret the original spikes ξ results in an approximation that is far from perfect, especially in the smooth case. Figure 15 compares the true posterior distribution and that given by the approximation with the above projection kernels. The cost of independent decoding is quantified in Figure 15A using

1 D( p(sT |ξ )|| pˆ (sT |ξ )) T t H( p(sT |ξ ))

,

(4.7)

p(s,ξ )

where H( p) is the entropy of p and the average is over many stimulus trajectories s ∼ N (0, C) and spikes ξ ∼ p(ξ |s). This quantity can also be interpreted as a percentage information loss. It is larger for the smooth than for the OU process, showing that the OU process suffers much less from the approximation than the smooth prior. Visually, there are no gross differences between p(sT |ξ ) and pˆ (sT |ξ ) for the OU prior (see Figures 15B and 15D). However, for the smooth prior, the arrows in

Fast Population Coding

427 OU

Smooth

−1

Exact

−0.5

0.12

Space

0.06 0.04

1 −1

−0.5

0.02 OU

Smooth

Approx

0.1 0.08

0

0 0.5

0 0.5 1

0.05

0.1

0.15 0.2 Time [s]

0.25

0.3

0.05

0.1 Time [s]

0.15

0.2

Figure 15: Comparison of true distribution p(sT |ξ ) and approximate distribution pˆ (sT |ξ ) given by equation 4.1 with projection kernels inferred by equation 4.5 and shown in Figure 13. Organization is the same as in previous figures. (A) T1 t D( p(sT |ξ )|| pˆ (sT |ξ ))/H( p(sT |ξ )) p(ξ ,s) ± 1 standard deviation for both priors. (B, C) p(sT |ξ ). (D, E) The corresponding pˆ (sT |ξ ) for the same spikes. (B, D) A stimulus generated from the OU prior. (C, E) The smooth prior. pˆ (sT |ξ ) is a good approximation for the OU prior but fails for the smooth prior. The arrows indicate where the approximation fails fundamentally in a similar way to that shown in Figure 9.

Figures 15C and 15E indicate areas where a large mismatch is introduced by the independent treatment of the spikes, which discards all information contained in spike combinations. This mismatch is entirely to be expected.

4.2 Recoding: Finding Expert Spikes. The previous section has shown that an independent interpretation of spikes is more costly with the smooth than with the OU prior. In this section, we show that it is possible to find a new set of “expert” spikes ρ, such that each spike can be interpreted independently and the posterior distribution is matched closely for both the OU and the smooth prior. This recoding thus takes spikes ξ that are redundant in a decoding sense and produces a new set of spikes ρ that can be easily used for efficient neural computation because the decoding redundancy has been eliminated. We first infer real-valued activities aξ and then proceed to infer actual spikes ρ. We use neurally implausible methods to infer the new set of spikes ρ. In a companion paper we will explore the capability of neurally plausible spiking networks to do this recoding and to use the resulting simple code for probabilistic computations in time (see also Zemel, Huys, Natarajan, and Dayan, 2004).

428

Q. Huys, R. Zemel, R. Natarajan, and P. Dayan

Figure 16: Inferring activities A for the OU prior. (A) True posterior p(sT |ξ ). (B) Approximate posterior pˆ (sT |A), which matches arbitrarily well (for this example, DK L T ∼ 10−5 and the entropy HT ∼ 2, making the information loss I ∼ 10−5 ). (C) Activities A for all neurons. The vertical black lines with dots indicate the original spike times ξ . Each thin line along the gray surface is the “activity” of one neuron as a function of time. There is a small amount of activity away from the spikes, but zeroing this affects the match between p(sT |ξ ) and pˆ (sT |A) only marginally.

4.2.1 Activities. Given a set of projection kernels gi (s, t) from the previous section, we can go back and infer the optimal activities A ≥ 0 of neurons by writing pˆ (sT |A) ∝ exp

Ai (T − t)gi (s, t) .

(4.8)

i,t

If we let Ai (T − t) = exp(Bi (T − t)) and minimize with respect to B the ¨ Kullback-Leibler divergence from the true posterior, we simultaneously enforce A ≥ 0: Bi (t) ← Bi (t) − ε∇ Bi (t) DK L ( p(sT |ξ )|| pˆ (sT |A)) .

(4.9)

The results of this procedure are shown for both the OU process (see Figure 16) and for the smooth process (see Figure 17). Figures 16A and 17A show the true spikes ξ and the corresponding distribution p(sT |ξ ). Figures 16B and 17B show the approximate distributions pˆ (sT |A) defined in equation 4.8 for the optimal activities A inferred with equation 4.9. The continuous nature of the activation functions means that they can contain as much information as the distribution itself, and indeed we find empirically that arbitrarily close matches are possible (exemplified by the two Figures; in both cases DK L T ∼ 10−5 ). Figures 16C and 17C finally show

Fast Population Coding

A

429

C

B

Figure 17: Inferring activities A for the smooth prior. (A) True posterior p(sT |ξ ). (B) Approximate posterior pˆ (sT |A), which matches arbitrarily well (for this example, DK L T ∼ 10−5 and the entropy HT ∼ 2, making the information loss I ∼ 10−5 ). (C) Activities A for all neurons. The vertical black lines with dots indicate the original spike times ξ . Each thin line along the gray surface is the “activity” of one neuron as a function of time. There is a small amount of activity away from the spikes, which allows the approximation pˆ (sT |A) to “bend” between spikes. Unlike in the OU

Communicated by Misha Tsodyks

Mean-Driven and Fluctuation-Driven Persistent Activity in Recurrent Networks Alfonso Renart∗ [email protected] Departamento de F´ısca Te´orica, Universidad Aut´onoma de Madrid, Cantoblanco 28049, Madrid, Spain, and Volen Center for Complex Systems, Brandeis University, Waltham, MA 02254, U.S.A.

Rub´en Moreno-Bote [email protected] Department de F´ısca Te´orica, Universidad Aut´onoma de Madrid, Cantoblanco 28049, Madrid, Spain

Xiao-Jing Wang [email protected] Volen Center for Complex Systems, Brandeis University, Waltham, MA 02254, U.S.A.

N´estor Parga [email protected] Departamento de F´ısca Te´orica, Universidad Aut´onoma de Madrid, Cantoblanco 28049, Madrid, Spain

Spike trains from cortical neurons show a high degree of irregularity, with coefficients of variation (CV) of their interspike interval (ISI) distribution close to or higher than one. It has been suggested that this irregularity might be a reflection of a particular dynamical state of the local cortical circuit in which excitation and inhibition balance each other. In this “balanced” state, the mean current to the neurons is below threshold, and firing is driven by current fluctuations, resulting in irregular Poisson-like spike trains. Recent data show that the degree of irregularity in neuronal spike trains recorded during the delay period of working memory experiments is the same for both low-activity states of a few Hz and for elevated, persistent activity states of a few tens of Hz. Since the difference between these persistent activity states cannot be due to external factors coming from sensory inputs, this suggests that the underlying

∗ Current address: Center for Molecular and Behavioral Neuroscience, Rutgers University, Newark, NJ 07102 USA.

Neural Computation 19, 1–46 (2007)

C 2006 Massachusetts Institute of Technology

2

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

network dynamics might support coexisting balanced states at different firing rates. We use mean field techniques to study the possible existence of multiple balanced steady states in recurrent networks of current-based leaky integrate-and-fire (LIF) neurons. To assess the degree of balance of a steady state, we extend existing mean-field theories so that not only the firing rate, but also the coefficient of variation of the interspike interval distribution of the neurons, are determined self-consistently. Depending on the connectivity parameters of the network, we find bistable solutions of different types. If the local recurrent connectivity is mainly excitatory, the two stable steady states differ mainly in the mean current to the neurons. In this case, the mean drive in the elevated persistent activity state is suprathreshold and typically characterized by low spiking irregularity. If the local recurrent excitatory and inhibitory drives are both large and nearly balanced, or even dominated by inhibition, two stable states coexist, both with subthreshold current drive. In this case, the spiking variability in both the resting state and the mnemonic persistent state is large, but the balance condition implies parameter fine-tuning. Since the degree of required fine-tuning increases with network size and, on the other hand, the size of the fluctuations in the afferent current to the cells increases for small networks, overall we find that fluctuation-driven persistent activity in the very simplified type of models we analyze is not a robust phenomenon. Possible implications of considering more realistic models are discussed.

1 Introduction The spike trains of cortical neurons recorded in vivo are irregular and consistent, to a first approximation, to a Poisson process, possessing a roughly exponential interspike interval (ISI) distribution (except at very short intervals) and a coefficient variation (CV) of the ISI close to one (Softky & Koch, 1993). The possible implications of this fact on the basic principles of cortical organization have been the motivation for a large number of studies during the past 10 years (Softky & Koch, 1993; Shadlen & Newsome, 1994, 1998; Tsodyks & Sejnowski, 1995; van Vreeswijk & Sompolinsky, 1996, 1998; Zador & Stevens, 1998; Harsch & Robinson, 2000). An important idea that was analyzed by some of these studies was that a way out of the apparent inconsistency between the cortical neuron working as an integrator over the timescale of a relatively long time constant of the order of 10 to 20 ms of a very large number of inputs, and its irregular spiking, was to have similar amounts of excitatory and inhibitory drive. In this way, the mean drive to the cell was subthreshold, and spikes were the result of fluctuations, which occur irregularly, thus leading to a high CV (Gerstein & Mandelbrot, 1964). Although the implications of this result were first studied in a feedforward architecture (Shadlen & Newsome, 1994), it was soon discovered that a state

Bistability in Balanced Recurrent Networks

3

in which excitation and inhibition balance each other, resulting in irregular spiking, was a robust dynamical attractor in recurrent networks (Tsodyks & Sejnowski, 1995; van Vreeswijk & Sompolinsky, 1996, 1998); that is, under very general conditions, a recurrent network settles down into a state of this sort. Although the original studies characterizing quantitatively the degree of spiking irregularity in the cortex were done using data from sensory cortices, it has since been shown that neurons in higher-order associative areas like the prefrontal cortex (PFC) also spike irregularly (Shinomoto, Sakai, & Funahashi, 1999; Compte et al., 2003) (see Figure 1). This is interesting because it is well known that cells in the PFC (Fuster & Alexander, 1971; Funahashi, Bruce, & Goldman-Rakic, 1989; Miller, Erickson, & Desimone, 1996; Romo, Brody, Hern´andez, & Lemus, 1999), as well as those in other associative cortices like the inferotemporal (Miyashita & Chang, 1988) or posterior parietal cortex (Gnadt & Andersen, 1988; Chafee & GoldmanRakic, 1998), show activity patterns that are selective to stimuli no longer present to the animal and are thus being held in working memory. The activity of these neurons seems to be able to switch, on presentation of an appropriate brief, transient input, from a basal spontaneous activity level to a higher activity state. When the dimensionality of the stimulus to be remembered is low (e.g., the position of an LED on a computer screen or the frequency of a vibrotactile simulus), the mnemonic activity during the delay period when the stimulus is absent seems to be graded (Funahashi et al., 1989; Romo et al., 1999), whereas when the dimensionality of the stimulus is high (e.g., a complex image), the single neurons seem to choose from a small number of discrete activity states (Miyashita & Chang, 1988; Miller et al., 1996). This last coding scheme is referred to as object working memory. Since there is no explicit sensory input present during the delay period in a working memory task, the neuronal activity must be a result of the dynamics of the relevant neural circuit. There is a long tradition of modeling studies that have described delay-period activity as a reflection of dynamical attractors in multistable (usually bistable) networks presumed to represent the local cortical environment of the neurons recorded in the neurophysiological experiments (Hopfield, 1982; Amit & Tsodyks, 1991; Ben-Yishai, Lev Bar-Or, & Sompolinsky, 1995; Amit & Brunel, 1997b; Brunel, 2000a; Compte, Brunel, Goldman-Rakic, & Wang, 2000; Brunel & Wang, 2001; Hansel & Mato, 2001; Cai, Tao, Shelly, & McLaughlin, 2004). Originally inspired by models and techniques from the statistical mechanics of disordered systems, network models of persistent activity have progressively become more faithful to the biological circuits that they seek to describe. The landmark study (Amit & Brunel, 1997b) provided an extended meanfield description of the activity of a recurrent network of spiking current-based leaky integrate-and-fire neurons (LIF). One of its main achievements was to use the theory of diffusion processes to provide an intuitive, compact

4

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga FIXATION 3

NONPREFERRED

# of cells

p =8.5e −15

2

CV 1 0

PREFERRED

F

20

20

20

10

10

10

0

PR NP

0

1

2

0

0

2

1

2

0

0

CV

CV

1

2

CV

30

30

30

20

20

20

10

10

10

p =1.6e−23

# of cells

1.5

CV2

1 0.5 0

F

PR NP

0

0

1

CV

2

2

0

0

1

CV

2

2

0

0

1

CV

2

2

Figure 1: CV of the ISI of neurons in monkey prefrontal cortex during a spatial working memory task. The monkey made saccades to remembered locations on a computer screen after a delay period of a few seconds. On each trial, a dot of light (cue stimulus) was briefly shown in one of eight to-be-remembered locations, equidistant from the fixation point but at different angles. After the delay period, starting with the disappearance of the cue stimulus and terminating with the disappearance of the fixation point, the monkey made a saccade to the remembered location. Top and bottom rows correspond, respectively, to the CV and CV2 (CV calculated using only consecutive ISIs to try to compensate from possible slow nonstationarities in the neurons instantaneous frequency) computed from spike trains of prefrontal cortical neurons recorded from monkeys performing an oculomotor spatial working memory task. Results shown correspond to analysis of the activity during the delay period of the task. The spike trains are irregular (CV ∼ 1), and to a similar extent, both when the data correspond to trials in which the preferred (PR; middle column) positional cue for the cell was held in working memory (higher firing rate during the delay period) and when it corresponds to stimuli with the nonpreferred (NP; right column) positional cue for the particular neuron (lower firing rate during the delay period). See Compte et al. (2003) for details. Adapted with permission from Compte et al. (2003).

description of the spontaneous, low-rate, basal activity state of cortical cells in terms of self-consistency equations that included information about both the mean and the fluctuations of the afferent current to the cell. The theory proposed was both simple and accurate, and matched well the properties of simulated LIF networks. The spontaneous activity state in Amit and Brunel (1997b) is effectively the balanced state described above, in which the recurrent connectivity is

Bistability in Balanced Recurrent Networks

5

dominated by inhibition and firing is due to the occurrence of positive fluctuations in the drive to the neurons. However, in Amit and Brunel (1997b), this same model was used to describe the coexistence of the spontaneous activity state with a persistent activity state with a physiologically plausible firing rate that would correspond to the spiking observed during the delay period in object working memory tasks, such as seen in, for example, Miyashita and Chang (1988). Although the model, with its large number of subsequent improvements, has been successful in providing a fairly accurate description of simulated spiking networks, no effort has yet been made to study systematically the relationship between multistability and the irregularity of the spike trains, especially in the elevated activity state. As we will show below, the qualitative organization of the connectivity in the recurrent network not only determines the existence of a fluctuation-driven balanced spontaneous activity state in the network, but also the existence of bistability in the network, and whether the elevated activity states are fluctuation driven. In order to perform a systematic analysis of the types of persistent activity that can be obtained in a network of current-based LIF neurons, two steps are important. First, we believe that the scaling of the synaptic connections with the number of afferent synapses per neuron should be made explicit. This approach was taken in the studies of the balanced state (Tsodyks & Sejnowski, 1995; van Vreeswijk & Sompolinsky, 1996), but is not present in the Amit and Brunel (1997b) framework. As we shall see, when the scaling is made explicit and the network is studied in the limit of a large number of connections per cell, the difference between the behavior of alternative circuit organizations (or architectures) becomes qualitative. Second, it would be desirable to be able to check for the spike train irregularity within the theory. In Amit and Brunel (1997b), spiking was assumed to be Poisson and, hence, to have a CV equal to 1. Poisson spike trains are completely characterized by a single number, the instantaneous firing probability, so there is nothing more to say about the spike train once its firing rate has been given. A general self-consistent description of the higher-order moments of spiking in a recurrent network of LIF neurons is extremely difficult, as the calculation of the moments of the ISI distribution becomes prohibitively complicated when the input current to a particular cell contains temporal correlations (although see Moreno-Bote & Parga, 2006). However, based on our study of the input-output properties of the LIF neuron under the influence of correlated inputs (Moreno, de la Rocha, Renart, & Parga, 2002), we have constructed a self-consistent description for the first two moments of the current to the neurons in the network, which relaxes the Poisson assumption and which we expect to be valid if the temporal correlations in the spike trains in the network are sufficiently short. Some of the results presented here have already been published in abstract form.

6

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

2 Methods We consider a network of current-based leaky integrate-and-fire neurons. The voltage difference across each neuron’s membrane evolves in time according to the following equation, d V(t) V(t) + I (t), =− dt τm with voltages being measured relative to the leak potential of the neuron. When the depolarization reaches a threshold voltage that we set at Vth = 20 mV, a spike is emited, and the voltage is clamped at a reset potential Vr = 10 mV during a refractory period τref = 2 ms, after which the voltage continues to integrate the input current. The membrane time constant is τm = 10 ms. When the neuron is inserted in a network, I (t) represents the total synaptic current, which is assumed to be a linear sum of the contributions from each individual presynaptic cell. We consider the simplest description of the synaptic interaction between the pre- and postsynaptic neurons, according to which each presynaptic action potential provokes an instantaneous “kick” in the depolarization of the postsynaptic cell. The network is composed of NE excitatory and NI inhibitory cells randomly connected so that each cell receives C E excitatory and C I inhibitory contacts, each with an efficacy (“kick” size) J E j and J Ik , respectively ( j = 1, . . . , C E ; k = 1, . . . , C I ). The total afferent current into the cell can be represented as I (t) =

CE j=1

J E j s j (t) −

CI

J Ik sk (t),

k=1

where s j(k) (t) represents the spike train from the jth excitatory (kth inhibitory) neuron. Since according to this description, the effect of a presynaptic spike on the voltage of the postsynaptic neuron is instantaneous, s(t) is a collection of Dirac delta functions, that is, s(t) ≡ j δ(t − t j ), where t j are the spike times. 2.1 Mean-Field Description. Spike trains in the model are stochastic, with an instantaneous firing rate (i.e., a probability of measuring a spike in (t, t + dt) per unit time) denoted by ν(t) = ν. The secondorder statistics of the process is characterized by its connected two-point correlation function C(t, t ), giving the joint probability density (above chance) that two spikes happen at (t, t + dt) and at (t , t + dt), that is, C(t, t ) ≡ s(t)s(t ) − s(t)s(t ). Stochastic spiking in network models is usually assumed to follow Poisson statistics, which is both a fairly good approximation to what is commonly observed experimentally (see, e.g.,

Bistability in Balanced Recurrent Networks

7

Softky & Koch, 1993; Compte et al., 2003) and convenient technically since Poisson spike trains lack any temporal correlations. For Poisson spike trains, C(t, t ) = νδ(t − t ), where ν is the instantaneous firing probability. We have previously analyzed the effect of temporal correlations in the afferents to a LIF neuron on its firing rate (Moreno et al., 2002). Temporal correlations measured in vivo are often well fitted by an exponential (Bair, Zohary, & Newsome, 2001). We considered exponential correlations of the form

|t −t|

e− τ c C(t, t ) = ν δ(t − t) + (F∞ − 1) 2τc

,

(2.1)

where F∞ is the Fano factor of the spike train for infinitely long time windows. The Fano factor in a window of length T is defined as the ratio between the variance and the mean of the spike count on the window. It is illustrative to calculate it for our process, FT ≡

2 σ N(T)

,

N(T)

where N(T) is the (stochastic) spike count in a window of length T, N(T) =

T

dt s(t), 0

so that N(T) = νT, and the spike count variance is given by 2 σ N(T) ≡

0

T

0

T

T dt dt C(t, t ) = νT + ν(F∞ − 1) T − τc 1 − e− τc .

When the time window is long compared to the correlation time constant, 2 that is, T τc , then σ N(T) ∝ F∞ νT; hence, our use of the factor F∞ in the definition of the correlation function. An interesting point to note is that for time windows that are long compared to the correlation time constant, the variance of the spike count is linear in time, which is a signature of independence across time, that is, independent variances add up (for the 2 Poisson process, (σ N(T) )Poisson = νT, so that (FT )Poisson = 1). If the characteristic time of the postsynaptic cell integrating this stochastic current (its membrane time constant) is very long compared with τc , we expect that the main effect of the deviation from Poisson of the input spike trains will be on the amplitude of the current variance, with the parameter τc playing only 2 a marginal role, as it does not appear in σ N(T) when T τc . As we show

8

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

below, a rigorous analysis of the effect of correlations on the mean firing rate of a LIF neuron confirms this intuitive picture. The postsynaptic cell receives many inputs. Recall that the total current is given (we consider for simplicity for this discussion that the cell receives C inputs from a single, large, homogeneous population composed of N neurons) by I (t) = J Cj s j (t). Thus, the mean and correlation function of the total afferent current to a given cell are I (t) = C J ν C I (t, t ) = I (t )I (t) − I (t)I (t ) =J2 si (t)s j (t ) − si (t)s j (t ) ij

= C J C(t, t ) + C(C − 1)J 2 Ccc (t, t ), 2

where C(t, t ) is the (auto)correlation function in equation 2.1 and Ccc (t, t ) is the cross-correlogram between any two given cells of the pool of presynaptic inputs (which we have assumed to be the same for all pairs). We restrict our analysis to very sparse random networks—networks with C N—so that the fraction of synaptic inputs shared by any two given neurons can be assumed to be negligible. In this case, the cross-correlation between the spike trains of the two cells Ccc (t, t) will be zero. This approximation simplifies the analysis of the network behavior significantly and allows for a self-consistent solution for the network’s steady states. Thus, the temporal structure of the total current to the cell is described by

α − |t−t | C I (t, t ) = σ02 δ(t − t ) + e τc 2τc

(2.2)

with σ02 = C J 2 ν

and α = F∞ − 1.

We have previously calculated the output firing rate of an LIF neuron subject to an exponentially correlated input (Moreno et al., 2002). The calculation is done using the diffusion approximation (Ricciardi, 1977) in which the discontinuous voltage trajectories are approximated by those obtained from an equivalent diffusion process. The diffusion approximation is expected to give accurate results when the overall rate of the input process is high, with the amplitude of each individual event being very small (Ricciardi, 1977). For small but finite τc , the analytic calculation of the firing rate of the cell can be done only when the deviation of the input current from a white noise

Bistability in Balanced Recurrent Networks

9

√ process is small, that is, it has to be done assuming that k ≡ τc /τm 1 and that α 1. More specifically, we found that if k = 0, then the firing rate can be calculated for arbitrary values of α, but if k is small but finite, then an expression can be found for the case when both k and α are small (see Moreno et al., 2002, for details). If k = 0, then the result one obtains is that the firing rate of the neuron is given by the same expression that one finds for the case of a white noise input, but with an effective variance that takes into account the change in amplitude of the fluctuations due to the non-Poisson nature of the inputs. The effective variance is equal to 2 σeff = σ02 (1 + α) = C J 2 ν F∞ ,

which is exactly the slope of the linear increase with the size of the time window T of the variance in the spike count NI (T) of the total input current. This result can be understood in terms of the Fano factor calculation outlined above. Assuming k = 0 is equivalent to assuming an infinitely long time window for the calculation of the Fano factor, and in those conditions we also saw that the only effect of the temporal correlations is to renormalize the variance of the spike count with respect to the poisson case. In order to set up a self-consistent scenario, we have to close the loop, by calculating a measure of the variability of the postsynaptic cell and relating it to the same property in the spike trains of its inputs. To do this, we note that if the spike trains in the model can be described as renewal processes, these processes have a property that relates their spike count variability and their ISI variability, F∞ = CV2 , if a point process is renewal (Cox, 1962). Renewal point processes are characterized by having independent ISIs, which are not necessarily exponentially distributed. Since we are assuming that the temporal correlations in the spike trains are short anyway, and the firing rates of the cells in the persistent activity states that we are interested in are not very high, then we expect the renewal assumption to be appropriate. The final step is to make sure that the result for the firing rate (the inverse of the mean ISI) in terms of the effective variance also holds for higher moments of the postsynaptic ISI, not only for the first, and this is indeed the case (Renart, 2000); that is, the CV of the ISI when k = 0 is given by the same expression as when the input is a white noise process, but with a renormalized variance equal to 2 σeff . Thus, under the assumptions described above, there is a way of computing the output rate and CV of an LIF neuron solely in terms of the rate and CV of its presynaptic inputs. In the steady states, both input and output

10

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

firing rate and CV will be the same, and this provides a couple of equations that determine these quantities self-consistently. In the reminder of the letter, we thus use the common expressions for the mean and CV of the first passage time of the Ornstein-Uhlenbeck (OU) process, ν

−1

√

= τref + τm π

CV2 = 2πν 2

Vth −µV √ σV 2 Vres −µV √ σV 2

Vth −µV √ σV 2

2

d x ex [1 + erf(x)]

Vres −µV √ σV 2

d x ex

2

x

−∞

2

dy e y [1 + erf(y)]2 ,

(2.3)

(2.4)

where µV and σV2 are the mean and variance of the depolarization of the postsynaptic cell (in the absence of threshold; Ricciardi, 1977). In a stationary situation, they are related to the mean µ and variance σ 2 of the afferent current to the cell by µV = τm µ;

σV2 =

1 τm σ 2 . 2

Following the arguments above, the mean and (effective) variance of the current to the cells are given by µ=CJ ν 2 = C J 2 νCV2 σ 2 ≡ σeff

for the mean and variance of the white noise input current. Finally, it is easy to show that if the presynaptic afferents to the cell come from a set of different statistically homogeneous subpopulations, the previous expressions generalize readily to µi =

Ci j J i j ν j

j

σi2 ≡ σi2eff =

Ci j J i2j ν j CV2j

(2.5) (2.6)

j

as long as the timescales of the correlations in the spike trains of the neurons in the different subpopulations are all of the same order. Inhibitory subpopulations are characterized by negative connection strengths. 2.2 Dynamics. A detailed characterization of the dynamics of the activity of the network is beyond the scope of this work. Since our main interest is the steady states of the network, we use a simple, effective dynamical

Bistability in Balanced Recurrent Networks

11

scheme that is consistent with the self-consistent equations that determine the steady states. In particular, we use the subthreshold dynamics of the first two moments of the depolarization in terms of the first two moments of the current (Ricciardi, 1977; Gillespie, 1992): dµV µV + µ; =− dt τm

dσV2 σ2 = − V + σ 2. dt τm /2

(2.7)

In using these equations, our assumption is that the activity of the population follows instantaneously the distribution of the depolarization. Thus, at every point in time, we use expressions 2.5 and 2.6 for µ and σ appearing in the right-hand side of equations 2.7, which depend on the rate ν(µV , σV ) and CV(µV , σV ) as given in equations 2.3 and 2.4. The only dynamical variables are therefore µV and σV2 (Amit & Brunel, 1997b; Mascaro & Amit, 1999). 2.3 Numerical Analysis of the Analytic Results. The phase plane analysis of the reduced network was done using both custom-made C++ code and the program XPPaut. The integrals in equations 2.3 and 2.4 were calculated analytically for very large and very small values of the limits of integration (using asymptotic expressions for the error function; Abramowitz & Stegun, 1970) and numerically for values of the integration limits of order one. The corresponding C++ code was incorporated into XPPaut through the use of dynamically linked libraries for phase plane analysis. Some of the cusp diagrams were calculated without the use of XPPaut by the direct approach of looking for values of the parameters at which the number of fixed points changed abruptly.

2.4 Numerical Simulations. We simulated an identical network to the one used in the mean-field description (see the captions of Figures 12 and 13 for parameters). In the simulation, on every time step dt = 50 µs, it is checked which neurons in the network receive any spikes. The membrane potential of cells that do not receive spikes is integrated analytically. The membrane potential of cells that receive spikes is integrated analytically within that dt, taking into account the synaptic postsynaptic potentials (PSPs) but assuming that there is no threshold. Only at the end of the time step is it checked whether the membrane potential is above threshold. If this is the case, the neuron is said to have produced a spike. This procedure effectively introduces a (very short but nonzero) synaptic integration time constant. Emitted spikes are fed back into the network using a system of queues to account for the synaptic delays (Mattia & Del Giudice, 2000).

12

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

3 Results 3.1 Network Architecture and Scaling Considerations. We first study the issue of how the architecture of the network determines the qualitative nature of the steady-state solutions for the firing rate of the cells in the network. In particular, we are interested in understanding under which conditions there are multiple steady-state solutions (bistability or, in general, multistability) in networks in which cells will fire with a high degree of irregularity. We consider a network for object working memory composed of a number of nonoverlapping subpopulations, or columns, defined by selectivity with respect to a given external stimulus. Each subpopulation contains both excitatory and inhibitory cells. The synaptic properties of neurons within the same column are assumed to be statistically identical. Thus, the column is, in our network, the minimal signaling unit at the average level, that is, all the neurons within a column behave identically on average. As will be shown below, the type of bistability realizable for the column depends critically on its size (more specifically, on the number of connections that a given cell receives from within its own column). Different architectures will thus be considered in which the total number of afferent connections per cell C is constant (and large) but the number of columns in the network varies, effectively varying the number of afferent connections from a given column to a cell. A multicolumnar architecture of this sort is inspired in the anatomical organization of the PFC, in which columnarly organized putative excitatory cells and interneurons show similar response profiles during working memory tasks (Rao, Williams, & Goldman-Rakic, 1999). As noted in section 1, many of the properties of the network can be inferred from a scaling analysis. In the limit in which the connectivity is very sparse, so that correlations between the spike trains of different cells can be neglected (see section 2), the relevant parameter is the number of afferent connections per neuron C. We will investigate the behavior of the network in the limit C → ∞ (the “extensive” limit) since, in this case, the different scenarios become qualitatively different. Of course, physical neurons receive a finite number of connections, but the rationale is that the physical solution can be considered a small perturbation to the solution found in the C = ∞ case, which is much easier to characterize. One should keep in mind that even if C becomes very large, we still need to impose the sparseness condition for our analysis to be valid, which implies that it should always hold that N C. When considering current-based scenarios in the extensive limit, one is forced to normalize the connection strengths (the size of the unitary PSPs, which we denote generally by J ) by (some power of) the number of connections per cell C, in order to keep the total afferent current within the (presynaptic) dynamic range of the neuron (whose order of magnitude is given by the distance between reset and threshold). As we show below,

Bistability in Balanced Recurrent Networks

13

different scaling schemes of J with C lead to different relative magnitudes of the mean and fluctuations of the afferent current into the cells in the extensive limit, and this in turn determines the type of steady-state solutions for the network. We thus proceed to analyze the expressions for the mean and variance of the afferent current (see equations 2.5 and 2.6) under different scaling assumptions. We consider multicolumnar networks in which the C afferent connections to a given cell come from Nc different “columns” (each contributing Cc connections, so that C = Nc Cc ). Each column is composed of an excitatory and an inhibitory subpopulation. The multicolumnar structure of the network is specified by the following scaling relationships, Nc ≡ nc C α 1−α Cc ≡ n−1 , c C

with 0 ≤ α ≤ 1 and nc order one, that is, independent of C. The case α = 0 corresponds to a finite number nc of columns, each contributing a number of connections of order C. The case α = 1 corresponds to an extensive number of columns, each contributing a number of connections of order one—that is, a fixed number as the total number of connections C grows. Although connection strengths between the different subpopulations can all be different, we assume that they can be classified into two types according to their scaling with C: those between cells within the same column, of strength J w , and those between cells belonging to different columns, of strength J b (the scaling is assumed to be the same for excitatory and for inhibitory connections). We define J w ≡ jw C −αw J b ≡ jb C −αb where αw , αb > 0 and the j’s are all order one. In these conditions, the afferent current to the excitatory or inhibitory cells (it does not matter which, for this analysis) from their own subpopulation is characterized by µin = Cc [J Ew ν Ein − J Iw ν Iin ] 1−α−αw = C 1−α−αw [ j Ew ν Ein − j Iw ν Iin ]n−1 f µin c ≡C

σin2 = Cc [J E2w ν Ein CV2Ein + J I2w ν Iin CV2Iin ] 1−α−2αw f σin , = C 1−α−2αw [ j E2 w ν Ein CV2Ein + j I2w ν Iin CV2Iin ]n−1 c ≡C

where the f ’s are linear combinations of rates and CVs weighted by factors

14

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

of order one. We proceed by assuming that all other columns are in the same state ν E,I out , CV E,I out , so that the current to the cells in the column under focus from the rest of the network is characterized by µout = (Nc − 1)Cc J Eb ν Eout − J Ib ν Iout −α −α j Eb ν Eout − j Ib ν Iout ≡ C 1−αb 1 − n−1 f µout = C 1−αb 1 − n−1 c C c C 2 σout = (Nc − 1)Cc J E2b ν Eout CV2Eout + J I2b ν Iout CV2Iout 2 −α j Eb ν Eout CV2Eout + j I2b ν Iout CV2Iout = C 1−2αb 1 − n−1 c C −α ≡ C 1−2αb 1 − n−1 f σout . c C −α become equal to one if α > 0 In the extensive limit, the terms 1 − n−1 c C and are of order one if α = 0, in which case they can be included as an extra multiplicative factor in the f terms. We thus omit them from now on. In addition to their recurrent inputs, cells receive a similar number of external excitatory inputs as well, but since we are interested in the generation of irregularity by the network, we will assume this external drive to be deterministic, that is, characterized by

µext = C J ext νext = C 1−αext jext νext ≡ C 1−αext f ext , with J ext = jext C −αext and αext > 0. The scaling with C of the different components of the total afferent current is thus given by µin = C 1−α−αw f µin µout = C 1−αb f µout

σin2 = C 1−α−2αw f σin 2 σout = C 1−2αb f σout

µext = C 1−αext f ext . If α, αb , αw are such that the variances vanish as C → ∞, the corresponding networks will consist of regularly spiking neurons. Since we are interested in irregular spiking, we therefore look for solutions in which 2 σin2 , σout or both remain order one in the C → ∞ limit. There are several ways to achieve this. 3.1.1 Scenario 1: Homogeneous Balanced Network. This case is associated with the choice α = 0. In this case, the size of the columns is of the same order as the size of the whole network (i.e., the number of columns, nc , is order one), in which case the in and out quantities become equivalent. √ A finite variance is achieved by setting αw = αb = 1/2, that is, J ∝ 1/ C. This scenario is equivalent to the network studied originally in Tsodyks and Sejnowski (1995) and van Vreeswijk and Sompolinsky (1996, 1998). In such

Bistability in Balanced Recurrent Networks

15

a network, the mean input from the recurrent network grows as the square root of the number of inputs, µin + µout =

√ C( f µin + f µout ).

This quantity can be positive or negative depending on the excitationinhibition balance in the network. The overall mean input into the neurons is obtained by adding the external input: µ = µin + µout + µext . In order not to saturate the dynamic range of the cell in the extensive limit, the overall mean current into the neurons should remain of order one as C → ∞. Hence, it is needed that µ=

√ 1 C[ f µin + f µout + f ext C 2 −αext ] ∼ O(1),

(3.1)

√ which is possible only if the term in square brackets vanishes as 1/ C. If the synapses from the external inputs vanish like 1/C (αext = 1), then the external input has a negligible contribution to the overall mean input to the cells. In this case, since both f µin and f µout are linear combinations of the firing rates of the neurons inside and outside the column under focus, equation 3.1 effectively becomes, in the extensive limit, a set of linear homogeneous equations for the activity of the different columns (note that although we have, for brevity, written only one, there are four equations like equation 3.1, for the excitatory and inhibitory subpopulations inside and outside the column under focus). Thus, unless the matrix of coefficients of the firing rates in f µin and f µout for the excitatory and inhibitory subpopulations is not full rank, the only solution of equation 3.1 in the extensive limit is given by a silent, zero rate, network (van Vreeswijk & Sompolinsky, 1998). On the other hand, if αext = 1/2, the linear system defined by equations 3.1 in the extensive limit is not homogeneous anymore. Hence, in the general case (except for degeneracies), if αext = 1/2, there is a single self-consistent solution for the firing rates in the network, in which the activity in each subpopulation is proportional to the external drive νext (van Vreeswijk & Sompolinsky, 1998). This highlights the importance of a powerful external excitatory drive. When αext = 1/2, the external drive by itself would drive the cells to saturation if the recurrent connections were inactivated. In the presence of the recurrent connections, the activities in the excitatory and inhibitory subpopulations adjust themselves to compensate this massive external drive. The firing rates in the self-consistent solution correspond to the only way in which this compensation can occur for all the subpopulations at the same time. It follows that inasmuch as the different inputs to the neuron combine linearly, unless the connectivity matrix is degenerate, which requires some kind of fine-tuning mechanism, bistability in a large, homogeneous balanced network is not a robust property.

16

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

3.1.2 Scenario 2: Homogeneous Multicolumnar Balanced Network. Since the linearity condition that results in a unique solution follows from the mean current to the cells from within the column growing with C, we impose that µin ∼ O(1), that is, 1 − α − αw = 0. If this is the case, the variance coming from within the column goes as C α−1 . We consider first the case α < 1. In these conditions, the variance coming from the activity inside the column vanishes for large C. To keep the total variance finite, we set αb = 1/2. If we also choose α = 1/2, then αw = αb = 1/2, so the network is homogeneous in that all the connection strengths scale similarly with C regardless of whether they connect cells belonging to the same or different columns. Since α = 1/2, there are many columns in the network, and the number of connections coming from√inside a particular column is a very small fraction (which decreases like 1/ C) of the total number of afferent connections to the cell. The fact that αb = 1/2 implies that √ the mean current coming from outside the column will still grow like C. Thus, in order for the cells not to saturate, the excitation and the inhibition from outside the column have to balance precisely; the steady state of the rest of the network becomes, again, a unique, linear function of the external input to the cells (where again we choose αext = 1/2 to avoid the quiescent state). However, now the mean current coming from inside the column is independent of C, so the steady-state activity inside the column is not determined by a set of linear equations. Instead, it should be determined self-consistently using the nonlinear transfer function in equation 2.3, which, in principle, permits bistability. This scenario is, in fact, equivalent to the one studied in Brunel (2000b), where a systematic analysis of the conditions in which bistability in a network like this can exist has been performed. (Although no explicit scaling of the synaptic connection strengths with C was assumed in Brunel, 2000b, the essential fact that the total variance to the cells in the subpopulation that supports bistability is constant is considered in that article.) As will be shown in detail below, the fact that the potential multiple steadystate solutions in this scenario differ only in the mean current to the cells, not in their variance (which is fixed by the balance condition on the rates outside the column), leads necessarily to a lower (in general, significantly lower) CV in the activity in the cells in√the elevated persistent activity state. Therefore, in a network with J ∝ 1/ C scaling, bistability is possible in √ small subsets of neurons comprising a fraction ∝ 1/ C of the total number of connections per cell, but the elevated persistent activity states are characterized by a change in the mean drive to the cells at constant variance, and, as we show below, this leads to a significant decrease in the spiking irregularity in the elevated persistent activity states. 3.1.3 Scenario 3: Heterogeneous Multicolumnar Network. In order for the CV in the elevated persistent activity state to remain close to one, the variance of the afferent current to the cells inside the column should depend

Bistability in Balanced Recurrent Networks

17

on their own activity. Thus, in addition to the condition 1 − α − αw = 0 necessary for bistability, we have to impose that σin2 ∝ C α−1 be independent of C, that is, α = 1, which implies αw = 0. In these conditions, the extensive number of connections per cell come from an extensive number of columns, with the number of connections from each column remaining a finite number. The αw = 0 condition reflects the fact that since cells receive only a finite number of intracolumnar connections, the strength of these connections does not need to scale in any specific way with C. As for the activity outside the√network, one could now, in principle, choose either J b ∝ 1/C or J b ∝ 1/ C (corresponding to αb = 1, 1/2, respectively), since there is already a finite amount of variance coming from within the column. In the first case, the rest of the network contributes only a noiseless deterministic current whose exact amount has to be determined selfconsistently, and in the second it contributes both to the total mean and variance of the afferent current to the neurons in the √ column. In this last case, as in the previous two scenarios, the J b ∝ 1/ C scaling results in the need for balance between the total excitation and inhibition outside the network, which (again, if αext = 1/2) leads to a unique solution for the activity of the rest of the population linear in the external drive to these neurons. In this scenario, the network is heterogeneous since the strength of the connections from neurons within the same column is larger than those from neurons in other columns. Since the rate and CV of the cells inside the column have to be determined self-consistently in this case, we proceed to do a systematic quantitative analysis of this scenario in the next section. From the scaling considerations described in this section, it is already clear, though, that a potential bistable solution with high CV is possible only in a small network.

3.2 Types of Bistable Solutions in a Reduced Network. In this section, we consider the network described in scenario 3 in the previous section, with the choice αb = 1/2, and analyze the types of steady-state solutions for the activity in a particular column of finite size. The rest of the network is in a balanced state, and its activity is completely decoupled from the activity of the column, which is too small in size to make a difference in the overall input to the rest of the cells. For our present purposes, all that matters about the afferents outside the column (from both the rest of the network and the external ones) is that they provide a finite net input to the cells in the column. We denote the mean and variance of that fixed external ext 2 current by µext E,I /τm and 2(σ E,I ) /τm , where the factors with the membrane ext 2 time constant τm have been included so that µext E,I and (σ E,I ) represent the contribution to the mean and variance of the postsynaptic depolarization (in the absence of threshold) in the steady states arising from outside the column.

18

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

The net input to the excitatory and inhibitory populations in the column under focus is thus characterized by µ E = c EE jEE ν E − c EI jEI ν I + µ I = c IE jIE ν E − c II jII ν I +

µext E τm

µext I τm

2 σ E2 = c EE jEE ν E CV2E + c EI jEI2 ν I CV2I +

σ I2 = c IE jIE2 ν E CV2E + c II jII2 ν I CV2I +

(σ Eext )2 τm /2

(σ Iext )2 , τm /2

where the number of connection and connection strength parameters c and j are all order one. We proceed by simplifying this scheme further in order to reduce the dimensionality of the system from four to two, which will allow a systematic exploration of the effect of all the parameters on the type of steady-state solutions of the network. In particular, we make the inputs to the excitatory and inhibitory populations identical, c IE = c EE ≡ c E µext E

= µext I

c II = c EI ≡ c I

≡µ

ext

(σ Eext )2

=

jIE = jEE ≡ j E

(σ Eext )2

≡ (σ

jII = jEI ≡ j I

) ,

ext 2

so that the whole column becomes statistically identical: ν E = ν I ≡ ν and CV E = CV I ≡ CV. For simplicity, we also assume that the number of excitatory and inhibitory inputs is the same: c E = c I = c. Thus, we are left with a system with four parameters, c µ = c( j E − j I );

cσ =

c( j E2 + j I2 );

µext ;

σ ext ,

(3.2)

all with units of mV, and two dynamical variables (from equation 2.7) µV dµV =− + µ; dt τm

σ2 dσV2 = − V + σ 2, dt τm /2

where µ = cµ ν +

µext ; τm

σ 2 = c σ2 νCV2 +

(σ ext )2 τm /2

(3.3)

Bistability in Balanced Recurrent Networks

19

100

1

Firing Rate x CV2 (Hz)

1.5

CV

Firing Rate (Hz)

150 150

50 0.5 0 20

σ (mV)

0 0

10 0 0

20 µ (mV)

10

20 10 20 µ (mV)

30

10 30 0

100

50

0 20

10 σ (mV)

σ (mV)

0 0

10

20 µ (mV)

30

Figure 2: Mean firing rate ν (left), CV (middle), and product νCV2 (right) of the LIF neuron as a function of the mean and standard deviation of the depolarization. Parameters: Vth = 20 mV, Vres = 10 mV, τm = 10 ms, and τref = 2 ms.

and −1

ν(µV , σV )

√

= τref + τm π

CV2 (µV , σV ) = 2πν 2

Vth −µV √ σV 2 Vres −µV √ σV 2

Vth −µV √ σV 2 Vres −µV √ σV 2

d x ex

2

2

d x ex [1 + erf(x)]

x

−∞

2

dy e y [1 + erf(y)]2

(3.4)

(3.5)

The parameters c µ and c σ2 measure the strength of the feedback that the activity in the column produces on the mean and variance of the current to the cells. c µ can be less than equal to, or greater than zero. A value larger (less) than zero implies that the activity in the column has a net excitatory (inhibitory) effect on the neurons. In general, we assume the positive parameter c σ2 to be independent of c µ (implying the recurrent feedback on the mean and on the variance can be manipulated independently). Note, however, that since j I /j E > 0, c σ2 cannot be arbitrarily small, that is, c σ2 > c µ2 /c. Equations 3.4 and 3.5 are plotted as a function of µV and σV in Figure 2. 3.2.1 Mean- and Fluctuation-Driven Bistability in the Reduced System. nullclines for the two equations 3.3 are given by ν(µV , σV ) =

µV − µext ; τm c µ

ν(µV , σV )CV2 (µV , σV ) =

The

σV2 − (σ ext )2 . (τm /2)c σ2

The values of (µV , σV ) that satisfy these equations are shown in Figure 3 for several values of the parameters c µ , c σ . The nullclines for the mean (see Figure 3, left) are the projection on the (µV , σV ) plane of the intersection of the surface in Figure 2, left, with a plane parallel to the σV axes, shifted by µext and tilted (i.e., with slope) at a rate 1/(τm c µ ). Since the mean firing

20

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga Nullclines for the Standard Deviation 10

Nullclines for the mean

4

Cµ=0 C >0

3.5

σ

8 σ (mV)

σ (mV)

3 2.5 2

Cµ < 0

6

4 C >0 µ

1.5 1 10

2

µext=15 mV

15

20 µ (mV)

σ

C =0

=2 mV

σ

ext

25

30

10

20

µ (mV)

30

40

Figure 3: Nullclines of the equation for the mean µV (left) and standard deviation σV (right). The nullcline for the mean depends on only c µ , and the one for the standard deviation depends on only c σ .

rate as a function of the average depolarization changes curvature (it has an inflection point near threshold, where firing changes from being driven by fluctuations to being driven by the mean), the nullcline for the mean has a “knee” when the net feedback c µ is large enough and excitatory. Similarly, the nullclines for the standard deviation of the depolarization (see Figure 3, right) are the projection on the (µV , σV ) plane of the intersection of the surface in Figure 2, right, with a parabolic surface parallel to the µV axes, shifted by (σ ext )2 and with a curvature 2/τm c σ2 . Again, this curve can display a “knee” for high enough values of the net strength of the feedback onto the variance c σ2 . The fixed points of the system are given by the points of intersection of the two nullclines. We now show, through two particular examples, the main result of this letter: depending on the degree of balance between excitation and inhibition, two types of bistability can exist: mean driven and fluctuation driven. Mean-Driven Bistability. Figure 4 shows a particular example of the type of bistability obtained for low-moderate values of c σ and moderate-high values of c µ . Figure 4a shows the time evolution of the firing rate (bottom) and CV (top) in the network when the external drive to the cells is transiently elevated. In response to the transient input, the network switches between a low-rate, high-CV basal state, into an elevated activity state. For this particular type of bistability, the CV in this state is low. The nullclines for this example are shown in Figure 4b. The σV nullcline is essentially parallel to the µV axis, and it intercepts the µV nullcline (which has a pronounced “knee”) at three points: one below (stable), one around (unstable), and one above

Bistability in Balanced Recurrent Networks

a

21

CV

1.5 1 0.5 0 Firing Rate (Hz)

80 60 40 20 0

b

σ (mV)

0.75

0

1 2 Time (s)

Nullcline for µ Nullcline for σ

3

Rate = 35.5 Hz CV = 0.24

0.7 0.65 Rate = 1.42 Hz CV = 0.94

0.6 18

19

20 µ (mV)

21

Figure 4: Example of mean-driven bistability. (a) CV (top) and firing rate (bottom) in the network as a function of time. Between t = 0 s and t = 0.5 s (dashed lines), the mean of the external drive to the neurons was elevated from 18 mV to 19 mV, causing the network to switch to its elevated activity fixed point. In this fixed point, the CV is low. (b) Nullclines for this example. The two stable fixed points differ primarily in the mean current that the cells are receiving, with an essentially constant variance. Hence, the CV in the elevated persistent-activity fixed point is low. Parameters: µext = 18 mV, σ ext = 0.65 mV, c µ = 7.2 mV, c σ = 1 mV. Dotted line: neuronal threshold.

(stable) threshold. The stable fixed point below threshold corresponds to the state of the system before the external input is transiently elevated in Figure 4a. It is typically characterized by a low rate and a relatively high CV, as subthreshold spiking is fluctuation driven and thus irregular. However, since the CV drops fast for µV > Vth at the values of σV at which the µV

22

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

nullcline bends (see Figure 2, middle), the CV in the suprathreshold fixed point is typically low (but see section 4). Fluctuations in the current to the cells play little role in determining the spike times of the cells in this elevated persistent activity state. Qualitatively, this is the type of excitation-driven bistability that has been thoroughly analyzed over the past few years . It is expected to be present in networks in which small populations of excitatory cells can be bistable in the presence of global inhibition by virtue of selective synaptic potentiation. It is also expected to be present in larger populations if the fluctuations arising from the recurrent connections are weak compared to those coming from external afferents, for instance, due to synaptic filtering (Wang, 1999). Fluctuation-Driven Bistability. If the connectivity is such that the mean drive to the neurons is only weakly dependent on the activity, that is, c µ is small, but at the same time the activity has a strong effect on the variance, that is, c σ is large, the system can also have two stable fixed points, as shown in the example in Figure 5 (same format as in the previous figure). In this situation, however, the two fixed points are subthreshold, and they differ primarily in the variance of the current to the cells. Hence, spiking in both fixed points is fluctuation driven, and the CV is high in both of them; in particular, it is slightly higher in the elevated activity fixed point (see Figure 5a). This type of bistability can be realized only if there is a precise balance between the net excitatory and inhibitory drive to the cells. Since c σ must be large in order for the σV nullcline to display a “knee,” both the net excitation and inhibition should be large, and in these conditions, a small c µ can be achieved only if the balance between the two is precise. This suggests that this regime will be sensitive to changes in the parameters determining the connectivity; that is, it will require fine-tuning, a conclusion that is supported by the analysis below. Mean- and fluctuation-driven bistability are not discrete phenomena. Depending on the values of the parameters, the elevated activity fixed point can rotate in the (µV , σV ) plane, spanning intermediate values from those shown in the examples in Figures 4 and 5. We thus now proceed to a systematic characterization of all possible behaviors of the reduced system as a function of its four parameters. 3.2.2 Effect of the External Current. Since c σ2 > 0, the σV nullcline always bends upward (see Figure 3), that is, the values of σV in the nullcline are always larger than σ ext . Assuming for simplicity that c σ can be arbitrarily low, this implies that no bistability can exist unless the external variance is low enough. In particular, for every value of the external mean µext , ext there is a critical value σc1 defined as the value of σV at which the first two ext derivatives of the µV nullcline vanish (see Figure 6, middle). For σ ext > σc1 , the two nullclines can cross only once, and therefore no bistability of any

Bistability in Balanced Recurrent Networks

a

23

CV

1.5 1 0.5 0 Firing Rate (Hz)

80 60 40 20 0

0

b

1 2 Time (s)

3

15 σ (mV)

Rate = 59.3 Hz CV = 1.17

10 Rate = 2.4 Hz CV = 1.02

5

Nullcline for µ Nullcline for σ

5

15 µ (mV)

25

Figure 5: Example of fluctuation-driven bistability. (a) CV (top) and firing rate (bottom) in the network as a function of time. Between t = 0 s and t = 0.5 s (dashed lines), the standard deviation of the external drive to the neurons was elevated from 5 mV to 7 mV, causing the network to switch to its elevated activity fixed point. In this fixed point, the CV is slightly higher than in the basal state. (b) Nullclines for this example. The two stable fixed points differ primarily in the variance of the current that the cells are receiving, with little change in the mean. Hence, the CV in the elevated persistent-activity fixed point is slightly higher than in the low-activity fixed point; that is, the CV increases with the rate. Parameters: µext = 5 mV, σ ext = 5 mV, c µ = 5 mV, c σ = 20.2 mV. Dotted line: Neuronal threshold.

kind is possible in the reduced system (see Figure 6, left). For values of σ ext only slightly lower than this critical value, the jump between the low- and the high-activity stable fixed points in the (µV , σV ) plane is approximately horizontal, so the type of bistability obtained is mean driven. For lower

24

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga ext

ext

> σc1

10 8 σ (mV)

σ (mV)

8 6 4 2 0

σext=σext c2

ext

σ =σc1 µ nullcline σ nullcline

12 10 σ (mV)

ext

σ

10

6 4

10

20 µ (mV)

30

40

2 0

8 6 4

10

20 µ (mV)

30

40

2 0

10

20 µ (mV)

30

40

Figure 6: The external variance determines the existence and types of bistability ext (left), no bistability is possible. σ ext = possible in the network. For σ ext > σc1 ext ext marks the onset of bistability (middle). At σ ext = σc2 bistability becomes σc1 possible in a perfectly balanced network (a network with c µ = 0) (right).

values of the external variance, a point is eventually reached at which bistability becomes possible in a perfectly balanced network. Again, for ext each µext , one can define a second critical value σc2 in the following way: ext ext σc2 is the value of σ at which the point where the derivative at the inflection point of the σV nullcline is infinite occurs at a value of µV equal ext to µext (see Figure 6, right). For values of σ ext < σc2 , bistability is possible in networks in which the net recurrent feedback is inhibitory. Since both critical values of σ ext are functions of the external mean, they define curves in the (µext , σ ext ) plane. These curves are plotted in Figure 7. Both are decreasing functions of µext and meet at threshold, implying that bistability in the reduced network is possible only for subthreshold mean external inputs (see section 4). 3.2.3 Phase Diagrams of the Reduced Network. For each point in the (µext , σ ext ) plane, the external current is completely characterized, and the only two parameters left to be specified are c µ , c σ . In particular, in the regions where bistability is possible, it will exist for only appropriate values of c µ and c σ . The two insets in Figure 7 show phase diagrams in the (c µ , c σ ) plane showing the regions of bistability in two representative points: one in ext ext which σc2 < σ ext < σc1 , in which bistability is possible only in excitationext dominated networks (top-right inset), and one in which σ ext < σc2 , in which bistability is possible in both excitation- and in inhibition-dominated networks (bottom-left inset). In this latter case, the region enclosed by the curve in which bistability can exist stretches to the left, including the region with c µ 0. We have further characterized the nature of the fixed-point solutions in these two cases by plotting the rate and CV on each point in the (c µ , c σ ) plane on which bistability is possible, as well as the ratio between the rate and CV in the high- versus low-activity states. Instead of showing this in the

Bistability in Balanced Recurrent Networks

25

9 ext

σc1

8

No Bistability

7

10

ext

σ

5

5 0

4 3 2

15

20

c (mV)

25

µ

20 cσ (mV)

σ

ext

(mV)

6

c (mV)

σc2

10

1

0

0

40

c (mV)

80

µ

0 0

5 ext

µ

10 (mV)

15

20

Figure 7: Phase diagram with the effect of the mean and variance of the external current on the existence and types of bistability in the network. The two insets represent the regions of bistability in the (c µ , c σ ) plane at the corresponding points in the (µext , σ ext ) plane. Fluctuation-driven bistability is possible only ext (µext ). Top-right and bottom-left near and below the lower critical line σ ext = σc2 ext ext insets correspond to µ = 10 mV; σ = 4 mV and µext = 10 mV; σ ext = 3 mV, respectively.

(c µ , c σ ) plane, we have inverted Equations 3.2 to show (assuming a constant c = 100) the results as a function of the unitary EPSP j E and of the ratio of the unitary inhibitory to excitatory PSPs j I /j E , which measures the degree of balance in the network, that is, j I /j E = 1 implies a perfect balance:

jE =

cµ +

2cc σ2 − c µ2 2c

;

j I /j E =

2cc σ2 − c µ2 − c µ 2cc σ2 − c µ2 + c µ

.

(3.6)

In Figure 8 we show the results for the case where the external variance ext , so that bistability is possible only if the net recurrent is higher than σc2 connectivity is excitatory. Overall, the shape of the bistable region in this space is a diagonal to the right. This means that closer to the balanced region, the net excitatory drive (proportional to j E ) has to be higher in order for bistable solutions to exist. The low-activity fixed point (left column) is subthreshold, and thus spiking is fluctuation driven, characterized by a high CV. In this case, the high-activity fixed point is suprathreshold, so the CV in this fixed point is in general small (see the bottom right). Of course,

26

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga Rate

down

(Hz)

Rate /Rate

Hz 30

0.8

up

down

0.8 25

25

0.6

0.6

20

I E

0.4

15

0.2

10

j /j

j /j

I E

20

15

0.4

10 0.2 5

5

0.2

0.4 0.6 j (mV)

0.8

0.2

E

0.4 0.6 j (mV)

0.8

E

CV

CV /CV

down

up

0.95

0.8

down

0.8

0.8 0.7

0.9

0.4

I E

0.6 j /j

jI/jE

0.6

0.6 0.5

0.4

0.4 C

σ

0.3

0.2

0.2 0.85 0.2

0.4 0.6 jE (mV)

0.8

0.2 16

0.2

0.1

C 18 µ

0.4 0.6 jE (mV)

0.8

Figure 8: Bistability in an excitation-dominated network in the ( j E , j I /j E ) plane. (Top and bottom left) Firing rate and CV in the low-activity fixed point. (Top and bottom right) Ratio of the firing rate and CV between the high- and low-activity fixed points. (Inset) same as bottom-right panel in the (c µ , c σ ) plane. Parameters: c = 100, µext = 10 mV, and σ ext = 4 mV.

very close to the cusp, at the onset of bistability, the CV (and rate) in both fixed points is similar. In Figure 9 the same information is shown for the case where the external ext variance is lower than σc2 so that bistability is possible when the recurrent connectivity is dominated by excitation or inhibition. In this case, the region where the CV in the high- and low-activity fixed points is similar is larger, corresponding to situations in which j I /j E ∼ 1, that is, excitation and inhibition in the network are roughly balanced. Only in this region is the < 100 Hz. In this regime, firing rate in the high-activity state not too high, ∼ when excitation dominates, the rate in the high-activity state becomes very high. Note also that the transition between the relatively low-rate, high-CV regime and the very high-rate, low-CV regime at j I /j E ∼ 0.9 is relatively abrupt. Finally, we can use the relations 3.6 to study quantitatively the effect of the number of connections c on the regions of bistability, something we

Bistability in Balanced Recurrent Networks

27

Hz

Ratedown (Hz)

Rateup/Ratedown 600

3

1

2.5

1.5

0.4

1

0.2 0

0.5

1 jE (mV)

I E

500

0.8

400

0.6

300

0.4

200 100

0.2

0.5

0

0

1.5

0.5

1 j (mV)

1.5

E

CVdown

j /j

j /j

0.6

I E

2

CVup/CVdown

1

1

1

0.8

0.8

0.6

0.99

jI/jE

j /j

I E

0.8

1

1

C

σ

−10

0.8

C 10 µ

0.6

0.6

0.4

0.4

0.2

0.2

0.4 0.2 0

0

0.5

1 j (mV) E

1.5

0.98

0

0.5

1 jE (mV)

1.5

Figure 9: Mean and fluctuation-driven bistability in the ( j E , j I /j E ) plane. Panels as in Figure 8. (Inset) Portion of the bistability region with fluctuation-driven fixed points in the (c µ , c σ ) plane. When the network is approximately balanced, that is, j I /j E ∼ 1, the CV in the high-activity state is high. Parameters: c = 100, µext = 10 mV, and σ ext = 3 mV.

did in section 3.1 at a qualitative level based on scaling considerations. Figure 10 shows the effect of increasing the number of afferent connections per cell on the shape of the region of bistability for σ ext = 4 mV (left) and for σ ext = 3 mV (right). The results are clearer when shown in the plane ( j E c, j I /j E ), where j E c represents the net mean excitatory drive to the neurons. The range of values of the net excitatory drive in which bistability is allowed in the excitation-dominated regime, where j I /j E < 1, does not depend very strongly on c. However, for both σ ext = 4 mV and σ ext = 3 mV, when inhibition and excitation become more balanced, a higher net excitatory drive is needed. In particular, when σ ext = 3, the bistable region always includes the balanced network, j I /j E = 1, but the range of values of j I /j E ∼ 1 where bistability is possible (being in this case fluctuation driven) considerably shrinks. Thus, as noted in section 3.1, bistability in a large, balanced network requires a precise balance of excitation and inhibition.

28

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga ext

σ

ext

σ

=4 mV

=3 mV

1

I E

0.5

j /j

j /j

I E

1

0.5

c=100 c=1000 c=10000 0 0

200

400 jE c (mV)

600

c=100 c=1000 c=10000 800

0 0

500

1000 1500 j c (mV)

2000

E

Figure 10: Effect of number of connections c on the phase diagram in the ( j E c, j I /j E ) plane for the case where µext = 10 mV and σ ext = 4 mV (left) and σ ext = 3 mV (right). Note that in the right panel, j I /j E = 1 is always in the region of bistability, but the range of values of j I /j E ∼ 1 in the bistable region decreases significantly with c.

The precise balance between excitation and inhibition required to obtain solutions with fluctuation-driven bistability is also evident when one analyzes the effect of changing the net input to the excitatory and inhibitory subpopulations within a column. In this case, the excitatory and inhibitory mean firing rate and CV become different. We have chosen to study the effect of the different ratios of excitation and inhibition to the excitatory and inhibitory populations. In particular, defining γE ≡

jEI jEE

and

γI ≡

jII , jIE

we have considered the effect of having γ E = γ I while still considering that the excitatory connections to the excitatory and inhibitory populations are equal, that is, jEE = jIE ≡ j E . To proceed with the analysis, we started by specifying a point in the parameter space of the symmetric network in which excitation and inhibition were identical by choosing a value for (µext , σ ext , c µ , c σ ). Then, fixing c = 200, we used the relationships 3.6 to solve for j E and γ and, defining γ E ≡ γ , we found which values of γ I resulted in bistable solutions. Correspondingly, when γ I = γ E , the two subpopulations within the column become identical again. We performed this analysis for two initial sets of parameters of the symmetric network: one corresponding to mean-driven and the other to fluctuation-driven bi-stability. The results of this analysis are shown in Figure 11. The type of bistability does not change qualitatively depending on the value of γ I /γ E in the mean-driven case (left column). For the right

Bistability in Balanced Recurrent Networks

29

Mean-driven 300

Low activity High activity

80

Firing Rate (Hz)

Firing Rate (Hz)

100

Fluctuation-driven

60 40

Bistable Region

20

200

0

0 1

1.5 γI/γE

2

1

1.05

1

0.8

0.8

Bistable Region

0.6

CV

CV

1.025 γ /γ I E

1

0.4

0.2

0.2 1

1.5 γI/γE

2

Bistable Region

0.6

0.4

0

Bistable Region

100

0

1

1.025 γI/γE

1.05

Figure 11: Effect of different levels of input to the excitatory and inhibitory subpopulations. The ratio between the inhibitory and excitatory connection strengths γ was allowed to be different for each subpopulation. Left and right columns correspond to mean- and fluctuation-driven bistability in the corresponding situation for the symmetric network. The network is bistable for values of γ I /γ E within the dashed lines. Parameters on the left column: µext = 18 mV, σ ext = 0.65 mV, c µ = 7.2 mV, c σ = 1 mV. Parameters on the right column: µext = 10 mV, σ ext = 3 mV, c µ = 0.5 mV, c σ = 19 mV.

column, however, the original fluctuation-driven regime is quickly abolished as γ I /γ E increases, leading to very high activity and low CV in the high-activity fixed point. Note that the size of the bistable region is also much smaller in this case. 3.3 Numerical Simulations. We conducted numerical simulations of our network to investigate whether the two types of bistable states that the mean-field analysis predicts, the mean-driven and the fluctuation-driven regimes, can be realized. In addition to the approximations that we are forced to make in order to be able to construct the mean-field theory itself, the more qualitative and robust result that fluctuation-driven bistable points require large fluctuations in relatively small networks with relatively large

30

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

synapses leads to the question of whether potential fixed points in this very noisy network will indeed be stable. We found that under certain conditions, both types of bistable points can be realized in numerical simulations. In Figure 12 we show an example of a network supporting bistability on the mean-driven regime. In this network, the recurrent connections are dominated by excitation, and the mean of the external drive leaves the neurons very close to threshold, with the fluctuations of this external drive being small. As expected, the irregularity in the elevated-activity fixed point is low, with a CV ∼ 0.2. The mean µV in this fixed point is above threshold. The mean-field theory predicts for the same network a rate of 46.7 Hz and a CV of 0.21. An example of another network showing bistability, this time in the fluctuation-driven regime, is shown in Figure 13. In this network, the recurrent connections are dominated by inhibition, and the external drive leaves the membrane potential relatively far from threshold on average but has large fluctuations. Taking into account only consecutive ISIs, the temporal irregularity is still large: CV2 ∼ 0.8. The spike trains in the elevated activity state are quite irregular, partly, but not only, due to large, temporal fluctuations in the instantaneous network activity. The mean-field prediction for these parameters gives a rate of 91.5 Hz and a CV of 1.6. In order for the elevated activity states to be stable, in both the meandriven and, especially, the fluctuation-driven regimes, we needed to use a wide distribution of synaptic delays in the recurrent connections: between 1 and 10 ms for Figure 12 and 1 and 50 ms for Figure 13. Narrow distributions of delays lead to oscillations, which destabilize the stationary activity states. The emergence and properties of these oscillations in a network similar to the one we study here have been described in Brunel (2000a). Although such long synaptic delays are not expected to be found in connections between neighboring local cortical neurons, our network is extremely simple and lacks many elements of biological realism that would work in the same direction as the wide distributions of delays. Among these are slow and saturating synaptic interactions (NMDA-mediated excitation; (Wang, 1999) and heterogeneity in cellular and synaptic properties. The large and slow temporal fluctuations in the instantaneous rate in Figure 13 are due to the large fluctuations in the nearly balanced external and recurrent drive to the cells and the wide distribution of synaptic delays. These fluctuations lead to high trial-to-trial variability in the activity of the network, as shown in Figure 14. In this figure, we show nine trials with identical parameters as in Figure 13, and only different seeds for the random number generator. On each panel, the mean instantaneous activity across all nine trials (the thick line) is shown along with the activity in the individual trial. Sometimes the large fluctuations lead to the activity returning to the basal spontaneous state. Other times they provoke longlasting periods of elevated firing (above average). Nevertheless, on a large fraction of the trials, a memory of the stimulus persists for several seconds.

Bistability in Balanced Recurrent Networks

31

Rate (Hz)

150

100

50

0 0

100

200

300

400 500 Time (ms)

600

=0.2

0

50 Rate (Hz)

100 0

0.25 CV

0.5

700

800

=0.21 2

0

0.25 CV2

0.5

Figure 12: Numerical simulations of a bistable network in the mean-driven regime. The rate of the external afferents was raised between 200 and 300 ms (vertical bars). (Top) Raster display of the activity of 200 neurons in the network. (Middle) Instantaneous network activity (temporal bin of 10 ms). The dashed line represents the average network activity during the delay period, 53.4 Hz. (Lower panels) Distribution across cells of the rate (left), CV (middle), and CV2 (right) during the delay period. The fact that the CV and CV2 are very similar reflects the stationarity of the instantaneous activity. Single-cell parameters as in the caption to Figure 2. The network consists of two populations of excitatory and inhibitory cells (1000 neurons each) connected at random with 0.1 probability. Delays are uniformly distributed between 1 and 10 ms. External spikes are all excitatory, with PSP size 0.09 mV. The external rate is 19.25 KHz. This leads to µext = 17.325 mV and σ ext = 0.883 mV. Recurrent EPSPs and IPSPs are 0.138 mV and −0.05 mV, respectively, leading to c µ = 8.8 mV and c σ = 1.468 mV.

32

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

Rate (Hz)

150

100

50

0 0

100

200

300

400 500 Time (ms)

600

=1.27

0

200 Rate (Hz)

400 0

2 CV

700

800

=0.81

4

0

1 CV2

2

Figure 13: Numerical simulations of a bistable network in the fluctuationdriven regime. Panels as in Figure 12. Parameters as in Figure 12 except distribution of delays, which is uniform between 1 and 50 ms. External spikes are excitatory, with PSP size 1.85 mV and rate 0.78 KHz and inhibitory, with PSP size −1.85 mV and rate 0.5 KHz. This leads to µext = 5.18 mV and σ ext = 4.68 mV. Recurrent EPSPs and IPSPs are 1.85 mV and −1.98 mV, respectively, leading to c µ = −13 mV and c σ = 27.1 mV.

Firing Rate (Hz)

Bistability in Balanced Recurrent Networks

33

100 50

Firing Rate (Hz)

0 100 50

Firing Rate (Hz)

0 100 50 0 0

1 Time (s)

20

1 Time (s)

20

1 Time (s)

2

Figure 14: Trial-to-trial variability in the fluctuation-driven regime. Each panel is a different repetition of the same trial, in a network identical to the one described in Figure 13. The thick line represents the average across all nine trials, and the thin line is the instantaneous network activity in the given trial. Vertical bars mark the time during which the rate of the external inputs is elevated.

In the mean-driven regime, the trial-to-trial variability is very low (not shown). We conclude that despite quantitative differences in the rate and CV between the mean-field theory and the simulations, it is possible, albeit difficult, to find both mean-driven and fluctuation-driven bistability in small networks of LIF neurons. 4 Discussion In this letter, we have aimed at an understanding of the different ways in which a simple network of current-based LIF neurons can be organized in order to support bistability, the coexistence of two steady-state solutions for the activity of the network that can be selected by transient external stimulation. We have shown that in addition to the well-known case in which strong excitatory feedback can lead to bistability, bistability can also be obtained when the recurrent connectivity is nearly balanced, or even when its net effect is inhibitory, provided that an increase in the activity in

34

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

the network provides a large enough increase in the size of the fluctuations of the current afferent to the cells. When bistability is obtained in this fashion, the CV in both steady states is close to one, as found experimentally (see Figure 1; Compte et al., 2003). We have done a systematic analysis at the mean-field level (and a partial one through numerical simulations) of a reduced network where the activity in the excitatory and the inhibitory subpopulations was equal by construction (implying balance at the level of the output activity) and studied which types of bistable solutions are obtained depending on the level of balance in the currents (the parameter c µ ), that is, balance at the level of the inputs. This simple model allows for a complete understanding of the role of the different model parameters. The first phenomenon, which we have termed mean-driven bistability, can essentially be traced back to the shape of the curve relating the mean firing rate of the cell to the average current it is receiving (at a constant noise level; Brunel, 2000b), that is, the f − I curve. In order for bistability to exist, this curve should be a sigmoid, for which it is enough that the neurons possess a threshold and a refractory period. If, in addition, the low-activity fixed point is to have a nonzero activity (consistent with the fact that cells in the cortex fire spontaneously), then the neuron should display nonzero activity for subthreshold mean currents. This can be achieved if the current is noisy, where the noise is due to the spiking irregularity of the inputs to the cell. When this type of bistability is considered in a network of LIF neurons, the mean current to the cells in the high-activity fixed point is above threshold. Under general assumptions, this leads invariably to fairly regular spiking in this high-activity fixed point. Of course, tuning the parameters of the current in such a way that the mean current in the high-activity fixed point is only very marginally suprathreshold will result in only a small decrease of the CV with respect to the low-activity fixed point (e.g., Figure 2 in Brunel & Wang, 2001). On the other hand, in this scenario, it is relatively easy (it does not take much tuning) for the firing rate in the elevated persistent activity state not to be very much higher than that in the low-activity state, for example, below 100 Hz (see Figure 8). When the recurrent connectivity is balanced, bistable solutions can exist in which both fixed points are subthreshold, so that spiking in both fixed points is fluctuation driven and thus fairly irregular. This can be the case if the fluctuations in the depolarization due to current from outside the column are low enough (see Figure 6). However, in order for these solutions to exist, first, the overall inputs to the excitatory and the inhibitory subpopulations should be close enough (ensuring balance at the level of the firing activity in the network); second, both of these inputs, the one to the excitatory and the one to the inhibitory subpopulation, should themselves also be balanced (be composed of similar amounts of excitation and inhibition); and third, both the net excitatory drive and the inhibitory drive to the cells should be large. This third condition, if the first two are satisfied, results in a high, effective fluctuation feedback gain: an increase in the activity of

Bistability in Balanced Recurrent Networks

35

the cells results in a large increase in the size of the fluctuations in the afferent current to the neurons (a large value of the parameter c σ ). However, it also implies that the excitation-inhibition balance condition will be quite stringent; it will require tuning, especially when the network is large. In addition, if this balance is slightly broken, since both excitation and inhibition are large, the corresponding firing rate in the elevated persistent activity state becomes very large, for example, significantly higher than 100 Hz (see Figure 9). In fact, based just on scaling considerations (see section 3.1), one can conclude that this type of bistability can be present only (unless one allows for perfect tuning) in small networks. If the network is large, the excitation-inhibition balance condition has, in general, a single (albeit very robust) solution (van Vreeswijk & Sompolinsky, 1996, 1998). It is intriguing, however, that several lines of evidence in fact suggest a fairly precise balance of local excitation and inhibition in cortical circuits, at both the output level (Rao et al., 1999) and the input level (Anderson, Carandini, & Ferster, 2000; Shu, Hasenstaub, & McCormick, 2003; Marino et al., 2005).

4.1 Limitations of the Present Approach. Most of the results we have presented are based on an exhaustive analysis of the stationary states of a mean-field description of a simple network of LIF neurons. Several limitations of our approach should be noted. First, in order to be able to go beyond the Poisson assumption, we have had to make a number of approximations (discussed in section 2) that are expected to be valid only on limited regions of the large parameter space. Second, we have focused only on the stationary fixed points of the system, neglecting an examination of any oscillatory solutions. Oscillations in networks of LIF neurons in the high-noise regime have been extensively studied by Brunel and collaborators (see, e.g., Brunel & Hakim, 1999; Brunel, 2000a; Brunel & Wang, 2003). Third, in order to be able to provide an analytical description, we have considered a very simplified network lacking many aspects of biological realism known to affect network dynamics, most important, a more realistic description of synaptic dynamics (Wang, 1999; Brunel & Wang, 2001). Finally, the use of a mean-field description based on the diffusion approximation to study small networks with big synaptic PSPs might lead to problems, since the diffusion approximation assumes these PSPs are (infinitely) small. Large PSPs might lead to fluctuations that are too strong, which would destabilize the analytically predicted fixed points. In order to check that the main qualitative conclusion of this study was not an artifact due to the mean-field approach, we simulated the network of LIF neurons used in the mean-field description, adding synaptic delays. Provided the distribution of delays was wide, we observed both types of bistable solutions. However, as expected, the fluctuation-driven persistent activity states show large, temporal fluctuations that sometimes are enough to destabilize them.

36

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

The evidence we have provided is suggestive, but given the limitations listed above, it is not conclusive. Addressing the limitations of this work will involve using recently developed analytical techniques (Moreno-Bote & Parga, 2006), along with a systematic exploration of the behavior of more realistic recurrent networks through numerical simulations. 4.2 Self-Consistent Second-Order Statistics. The extended mean-field theory we have used builds on the landmark study by Amit and Brunel (1997b), which provided a general-purpose theory for the study of the different types of steady states in recurrent networks of spiking neurons in the presence of noise, while leaving room for different degrees of biophysical realism. Our contribution has been to try to go beyond the Poisson assumption in order to allow a self-consistent solution to the second-order statistics of the spike trains in the network. If the spike trains are assumed to be Poisson, there is only one parameter to be determined self-consistently: the firing rate. Under the approximations made in this letter, the statistics are characterized by two parameters, the firing rate and the CV, which provides information about the degree of irregularity of the spiking activity. In order to go beyond the Poisson assumption, we have assumed the spike trains in the network can be described as renewal processes with a very short correlation time. In these conditions, for time windows large compared to this correlation time, the Fano factor of the process is constant, but instead of being one, as for a Poisson process, it is equal to CV2 . This motivates our strategy of neglecting the temporal aspect of the deviation from Poisson, which is extremely complicated to deal with analytically, and keep only its effect on the amplitude of the correlations. We have done this by using the expressions for the rate and CV of the first passage time of the OU process with a renormalized variance that takes into account the CV of the inputs. If the time constant of the autocorrelation of the process is exactly zero, this approximation becomes exact (Moreno et al., 2002), so we have assumed it will still be qualitatively valid if the correlation time constant is small. In this way, we have been able to solve for the CV of the neurons in the steady states self-consistently. It has to be stressed that the fact that the individual inputs to a neuron are considered independent does not imply that the overall input process, made of the sum of each individual component, is Poisson. Informally, in order for the superposition to converge to a (homogeneous) Poisson process of rate λ, two conditions have to be met: given any set S on the time axis (say, any time interval), calling Ni1 the probability of observing one spike in S from process i, and Ni>2 the probability of observing two or more spikes in S from process i, then the superposition of the i = 1, . . . , N processes will converge to a Poisson process if lim N→∞ iN Ni1 = λS (with max{Ni1 } = 0 as N >2 N → ∞) and if lim N→∞ i Ni = 0 (see, e.g., Daley & Vere-Jones, 1988). The autocorrelated renewal processes that we consider in this letter do

Bistability in Balanced Recurrent Networks

37

not meet the second condition, which can also be seen in the fact that the superposition process has an autocorrelation given by equation 2.2, not by a Dirac delta function, as would be the case for a Poisson process. Despite this, it might be the case that if instead of any set S, one considers only a given time window T, both conditions could approximately be met in T, and we could say that the superposition process is locally Poisson in T. Whether this locally Poisson train will have the same effect on the postsynaptic cell as a real Poisson train of the same rate depends on a number of factors and has been studied in detail in Moreno et al. (2002) for the case of exponential autocorrelations. Other types of autocorrelation structures, for instance, regular spike trains, could lead to different results. This is an open problem. 4.3 Current-Based versus Conductance-Based Descriptions. We have analyzed a network of current-based LIF neurons. The motivations for this choice are that current-based LIF neurons are simpler to analyze, especially in the presence of noise, than conductance-based LIF neurons and also that there were a number of unresolved issues raised in the current-based framework that we have made an attempt to clarify. In particular, we were interested in understanding whether the framework of Amit and Brunel (1997b) could be used to produce bistable solutions in balanced networks like those studied in Tsodyks and Sejnowski (1995) and van Vreeswijk and Sompolinsky (1996, 1998) outside the context of multistability in recurrent networks. An important issue has been the relationship between different scaling relationships between the connection strengths and the number of afferents and the possible types of bistability attainable in large networks, when the number of afferents per cell tends to infinity. √ This analysis shows that large, homogeneous networks using the J ∼ 1/ C scaling needed to retain a significant amount of fluctuation at large C do not support bistability in a robust manner, a result already implicitly present in van Vreeswijk and Sompolinsky (1996, 1998). Reasonably robust bistability in homogeneous balanced networks requires that they are small. Does one expect these conclusions to hold qualitatively if one considers the more realistic case of conductance-based synaptic inputs to the cells? The answer to this question is uncertain. In particular, scaling relationships between J and C, absolutely unavoidable in current-based scenarios to keep a finite input to the cells in the extensive C limit, are not necessary when synaptic inputs are assumed to induce a transient change in conductance. In the presence of conductances, the steady-state voltage is automatically independent of C for large C, regardless of the value of the unitary synaptic conductances. In fact, assuming a simple model for a cell having only leak, excitatory, and inhibitory synaptic conductances, the steady-state voltage in the absence of threshold is given by Vss =

gL gE gI VL + VE + VI , gTot gTot gTot

38

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

where VL , VE , VI are the reversal potentials of the respective currents; g L , g E , g I are the total leak, excitatory, and inhibitory conductance in the cell and gTot = g L + g E + g I . (This expression ignores the effect of temporal fluctuations in the conductances, but it is a good approximation, since the mean conductance, being by definition positive, is expected to be much larger than its fluctuations.) The steady-state voltage is just the average of the different reversal potentials, each weighted by the relative amount of conductance that the respective current is carrying. Of course, each of the total synaptic conductances is proportional to the number of inputs, but since the steady-state voltage is just a weighted sum, it does not explode even if C tends to infinity. It might seem that in fact, the infinite-C limit leads to an ill-defined model, as the membrane time constant vanishes in this case as 1/C. If Cm is the membrane capacitance, then τm =

Cm ∼ 1/C, gTot

assuming that g E,I ∼ C. We believe, however, that this is an artifact due to an incorrect way of defining the model in the large C limit. It implicitly assumes that the number of synaptic inputs grows at a constant membrane area, thus increasing indefinitely the local channel density. A more appropriate way of taking the large C limit is to fix the relative densities of the different channels per unit area and then assume the area becomes large. In this case, both Cm and the total leak conductance of the cell (proportional to the number of leak channels) will grow with the total area. This way of taking the limit respects the well-known decrease in membrane time constant as the synaptic input to the cell grows, but retaining a well-defined, nonzero membrane time constant in the extensive C limit (in this case, the range of values that τm can take is determined by the local differences in channel density, which is independent of the total channel number). A crucial difference with the current-based cell is the behavior of the variance of the depolarization in the large C limit. A simple estimate of this variance can be obtained by ignoring threshold and considering only the low-pass filtering effect of the membrane (with time constant τm ) on a gaussian noise current of variance σ I2 and time constant τs . It is straightforward to calculate the variance of the depolarization in these conditions, resulting in σV2 =

σ I2 2 gTot

τs τs + τm

.

Bistability in Balanced Recurrent Networks

39

If the inputs are independent, both the variance of the current and the total conductance of the cell are proportional to C, which implies that σV2 ∼ 1/C for large C. Therefore, the statistics of the depolarization in conductance-based and current-based neurons show a very different dependence with the number of inputs to the cell. In particular, it is unclear whether the main organizational principle behind the balanced state in the current-based framework, √ that is, the J ∼ 1 C scaling that is needed to retain a finite variance in the C → ∞ limit and that leads to the set of linear equations that specify the single solution for the activity in the balanced network, is relevant in a conductance-based framework. A rigorous study of this problem is beyond the scope of this work, but is one of the outstanding challenges for understanding the basic principles of cortical organization.

4.4 Correlations and Synaptic Time Constants. Our mean-field description assumes that the network is in the extreme sparse limit, N C, in which the fraction of common input shared by two neurons is negligible, leading to vanishing correlations between the afferent current to different cells in the large C, large N limit. This is a crucial assumption, since it causes the variance of the depolarization in the network to be the sum of the variances of the individual spike trains, that is, proportional to C. If the correlation coefficient is finite as C → ∞, the variance is proportional to C 2 (see, e.g., Moreno et al., 2002). In a current-based network, J ∼ 1/C scaling would lead to a nonvanishing variance in the large C limit without a stringent balance condition, and in a conductance-based network, it would lead to a C-independent variance for large C. This suggests that correlations between the cells in the recurrent network should have a large effect on both their input-output properties (Zohary, Shadlen, & Newsome, 1994; Salinas & Sejnowski, 2000; Moreno et al., 2002) and the network dynamics. The issue is, however, not straightforward, as simulations of irregular spiking networks with realistic connectivity parameters, which do show weak but significant cross-correlations between neurons (Amit & Brunel, 1997a), seem to be well described by the mean-field theory in which correlations are neglected (Amit & Brunel, 1997a; Brunel & Wang, 2001). Noise correlations measured experimentally are small but significant, with normalized correlation coefficients on the range of a few percent to a few tenths for a review (see, e.g., Salinas & Sejnowski, 2001). It would thus be desirable to be able to extend the current mean-field theory to incorporate the effect of cross-correlations and to understand under which conditions their effect is important. The first steps in this direction have already been taken (Moreno et al., 2002; Moreno-Bote & Parga, 2006). The arguments of the previous section suggest that a correct treatment of correlations might be especially important in large networks of conductance-based neurons.

40

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

An issue of similar importance for the understanding of the high irregularity of cortical spike trains is the relationship between the time constant (or time constants) of the inputs to the neuron and its (effective) membrane time constant. Indeed, the need for a balance between excitation and inhibition in order to have a high-output spiking variability when receiving many irregular inputs exists only if the membrane time constant is relatively long—in particular, long enough that if the afferents are all excitatory, the input fluctuations are averaged out. If the membrane time constant is very short, large, short-lived fluctuations are needed to make the postsynaptic cell fire, and these occur only irregularly, even if all the afferents are excitatory (see, e.g., Figure 1 in Shadlen & Newsome, 1994). These considerations seem relevant since cortical cells receive large numbers of inputs that have spontaneous activity, thus putting the cell into a high-conductance state (see, e.g., Destexhe, Rudolph, & Pare, 2003) in which its effective membrane time constant is short—on the order of only a few miliseconds (Bernander, Douglas, Martin, & Koch, 1991; Softky, 1994). It has also been recognized that when the synaptic time constant is large compared to the membrane time constant, spiking in the subthreshold regime becomes very irregular, and in particular, the distribution of firing rates becomes bimodal. Qualitatively, in these conditions, the depolarization follows the current instead of integrating it. Relative to the timescale of the membrane, fluctuations are long-lived, and this separates two different timescales for spiking (which result in bimodality of the firing-rate distribution) depending on whether the size of a fluctuation is such that the total current is subthreshold (i.e., no spiking leading to a large peak of the firing rate histogram at zero) or suprathreshold (leading to a nonzero peak in the firing rate distribution) (Moreno-Bote & Parga, 2005). In these conditions, neurons seem “bursty,” and the CV of the ISI is high. Interestingly, recent evidence confirms this bimodality of the firing-rate distribution in spiking activity recorded in vivo in the visual cortex (Carandini, 2004). Increases in the synaptic-to-membrane time constant ratio leading to more irregular spiking can be due to a number of factors: a very short membrane time constant if the neuron is a high-conductance state, relatively long excitatory synaptic drive if there is a substantial NMDA component in the excitatory EPSPs, or even long-lived dendrosomatic current sources, for instance, due to the existence of “calcium spikes” generated in the dendrites. There is evidence that irregular current applied to the dendrites of pyramidal cells results in higher CVs than the same current applied to the soma (Larkum, Senn, & Luscher, 2004). 4.5 Parameter Fine-Tuning. In order for both stable firing rate states of the networks we have studied to display significant spiking irregularity, the afferent current to the cells in both states needs to be subthreshold. We have shown that this requires a significant amount of parameter fine-tuning, especially when the number of connections per neuron is large. Parameter

Bistability in Balanced Recurrent Networks

41

fine-tuning is a problem, since biological networks are heterogeneous and cellular and synaptic properties change in time. Regarding this issue, though, some considerations are in order. First, the model we have considered is extremely simple, especially at the singleneuron level. We have already pointed out possible consequences of considering properties such as longer synaptic time constants or some degree of correlations between the spiking activity of different neurons. Another biophysical property that we expect to have a large impact is short-term synaptic plasticity. In the presence of depressing synapses, the postsynaptic current is no longer linear in the presynaptic firing rate, thus acting as an activity-dependent gain control mechanism (Tsodyks & Markram, 1997; Abbott, Varela, Sen, & Nelson, 1997). It remains to be explored to what extent balanced bistability in networks of neurons exhibiting these properties becomes a more robust phenomenon. Second, synaptic weights (as well as intrinsic properties; Desai, Rutherford, & Turrigiano, 1999) can adapt in an activity-dependent manner to keep the overall activity in a recurrent network within an appropriate operational range (Turrigiano, Leslie, Desai, Rutherford, & Nelson, 1998). Delicate computational tasks, which seem to require finetuning, can be rendered robust though the use of these types of activitydependent homeostatic rules (Renart, Song, & Wang, 2003). It will be interesting to study whether homeostatic plasticity (Turrigiano, 1999) can be used to relax some of the fine-tuning constraints described in this letter. 4.6 Multicolumnar Networks and Hierarchical Organization. The fact that bistability is not a robust property of large, homogeneous balanced networks suggests that the functional units of working memory could correspond to small subpopulations (Rao et al., 1999). In addition, we have shown that bistability in a small, reduced network is possible only for subthreshold external inputs (see section 3.2.2). At the same time, it is known that a nonzero activity balanced state requires a very large (suprathreshold) excitatory drive (see section 3.1 and van Vreeswijk & Sompolinsky, 1998). This seems to point to a hierarchical organization: large networks receive massive excitation from long-distance projections, and this external excitation sets up a balanced state in the network. Globally, the activity in the large, balanced network follows the external input linearly. This large, balanced network then provides an already balanced (subthreshold) input to smaller subcomponents, which, in these conditions (in particular, if the variance of this subthreshold input is small enough; see figure 7), can display more complex nonlinear behavior such as bistability. From the point of view of the smaller subnetworks, the balanced subthreshold input can be considered external, since the size of this network is too small to make a difference in the global activity of the larger network (despite being recurrently connected, the activities in the large and small networks effectively

42

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

decouple). In the cortex, the larger balanced network could correspond to a whole cortical column. Indeed, long-range projections between columns are mostly excitatory (see, e.g., Douglas & Martin, 2004). Within a column, the smaller networks that interact through both excitation and inhibition could anatomically correspond to microcolumns (Rao et al., 1999) or, more generally, define functional assemblies (Hebb, 1949). 5 Summary General principles of cortical organization (large numbers of active synaptic inputs per neuron) and function (irregular spiking statistics) put strong constraints on working memory models of spiking neurons. We have provided evidence that a network of current-based LIF neurons can exhibit bistability with the high persistent activity driven by either the mean or the fluctuations in the input to the cells. The fluctuation-driven bistability regime requires a strict excitation-inhibition balance that needs parameter tuning. It remains a challenge in future research to analyze systematically what the conditions are under which nonlinear phenomena such as bistability can exist robustly in large networks of more biophysically plausible conductance-based and correlated spiking neurons. It is also conceivable that additional biological mechanisms, such as homeostatic regulation, are important for solving the fine-tuning problem and ensuring a desired excitation-inhibition balance in cortical circuits. Progress in this direction will provide insight into the microcircuit mechanisms of working memory, such as found in the prefrontal cortex. Acknowledgments We are indebted to Jaime de la Rocha for providing the code for the numerical simulations and to Albert Compte for providing the data for Figure 1. A.R. thanks N. Brunel for pointing out previous related work, and A. Amarasingham for discussions on point processes. Support was provided by the National Institute of Mental Health (MH62349, DA016455), the A. P. Sloan Foundation and the Swartz Foundation, and the Spanish Ministery of Education and Science (BFM 2003-06242). References Abbott, L. F., Varela, J. A., Sen, K., & Nelson, S. B. (1997). Synaptic depression and cortical gain control. Science, 275, 220–224. Abramowitz, M., & Stegun, I. A. (1970). Tables of mathematical functions. New York: Dover. Amit, D. J., & Brunel, N. (1997a). Dynamics of a recurrent network of spiking neurons before and following learning. Network, 8, 373–404.

Bistability in Balanced Recurrent Networks

43

Amit, D. J., & Brunel, N. (1997b). Model of global spontaneous activity and local structured activity during delay periods in the cerebral cortex. Cerebral Cortex, 7, 237–252. Amit, D. J., & Tsodyks, M. V. (1991). Quantitative study of attractor neural network retrieving at low spike rates II: Low-rate retrieval in symmetric networks. Network, 2, 275–294. Anderson, J. S., Carandini, M., & Ferster, D. (2000). Orientation tuning of input conductance, excitation, and inhibition in cat primary visual cortex. J. Neurophysiol., 84, 909–926. Bair, W., Zohary, E., & Newsome, W. T. (2001). Correlated firing in macaque visual area MT: Time scales and relationship to behavior. J. Neurosci, 21, 1676– 1697. Ben-Yishai, R., Lev Bar-Or, R., & Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proc. Natl. Acad. Sci. USA, 92, 3844–3848. Bernander, O., Douglas, R. J., Martin, K. A., & Koch, C. (1991). Synaptic background activity influences spatiotemporal integration in single pyramidal cells. Proc. Natl. Acad. Sci. USA, 88, 11569–11573. Brunel, N. (2000a). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comput. Neurosci., 8, 183–208. Brunel, N. (2000b). Persistent activity and the single cell f-I curve in a cortical network model. Network, 11, 261–280. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrate-andfire neurons with low firing rates. Neural Computation, 11, 1621–1671. Brunel, N., & Wang, X.-J. (2001). Effects of neuromodulation in a cortical network model of object working memory dominated by recurrent inhibition. J. Comput. Neurosci., 11, 63–85. Brunel, N., & Wang, X.-J. (2003). What determines the frequency of fast network oscillations with irregular neural discharges? I. Synaptic dynamics and excitationinhibition balance. J. Neurophysiol., 90, 415–430. Cai, D., Tao, L., Shelley, M., & McLaughlin, D. W. (2004). An effective kinetic representation of fluctuation-driven networks with application to simple and complex cells in visual cortex. Proc. Natl. Acad. Sci., 101, 7757–7762. Carandini, M. (2004). Amplification of trial-to-trial response variability by neurons in visual cortex. PLOS Biol., 2, E264. Chafee, M. V., & Goldman-Rakic, P. S. (1998). Matching patterns of activity in primate prefrontal area 8a and parietal area 7ip neurons during a spatial working memory task. J. Neurophysiol., 79, 2919–2940. Compte, A., Brunel, N., Goldman-Rakic, P. S., & Wang, X.-J. (2000). Synaptic mechanisms and network dynamics underlying spatial working memory in a cortical network model. Cerebral Cortex, 10, 910–923. Compte, A., Constantinidis, C., Tegn´er, J., Raghavachari, S., Chafee, M., GoldmanRakic, P. S., & Wang, X.-J. (2003). Temporally irregular mnemonic persistent activity in prefrontal neurons of monkeys during a delayed response task. J. Neurophysiol., 90, 3441–3454. Cox, D. R. (1962). Renewal theory. New York: Wiley. Daley, D. J., & Vere-Jones, D. (1988). An introduction to the theory of point processes. New York: Springer.

44

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

Desai, N. S., Rutherford, L. C., & Turrigiano, G. G. (1999). Plasticity in the intrinsic excitability of cortical pyramidal neurons. Nat. Neurosci., 2, 515–520. Destexhe, A., Rudolph, M., & Pare, D. (2003). The high-conductance state of neocortical neurons in vivo. Nat. Rev. Neurosci., 4, 739–751. Douglas, R. J., & Martin, K. A. (2004). Neuronal circuits of the neocortex. Ann. Rev. Neurosci., 27, 419–451. Funahashi, S., Bruce, C. J., & Goldman-Rakic, P. S. (1989). Mnemonic coding of visual space in the monkey’s dorsolateral prefrontal cortex. J. Neurophysiol., 61, 331– 349. Fuster, J. M., & Alexander, G. (1971). Neuron activity related to short-term memory. Science, 173, 652–654. Gerstein, G. L., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single neuron. Biophys. J., 4, 41–68. Gillespie, D. T. (1992). Markov processes: An introduction for physical scientists. Orlando, FL: Academic Press. Gnadt, J. W., & Andersen, R. A. (1988). Memory related motor planning activity in posterior parietal cortex of macaque. Exp. Brain Res., 70, 216–220. Hansel, D., & Mato, G. (2001). Existence and stability of persistent states in large neuronal networks. Phys. Rev. Lett., 86, 4175–4178. Harsch, A., & Robinson, H. P. (2000). Postsynaptic variability of firing in rat cortical neurons: The roles of input synchronization and synaptic NMDA receptor conductance. J. Neurosci., 20, 6181–6192. Hebb, D. O. (1949). Organization of behavior. New York: Wiley. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA, 79, 2554–2558. Larkum, M. E., Senn, W., & Luscher, H. M. (2004). Top-down dendritic input increases the gain of layer 5 pyramidal neurons. Cereb. Cortex, 14, 1059–1070. Marino, J., Schummers, J., Lyon, D. C., Schwabe, L., Beck, O., Wiesing, P., & Obermayer, K. (2005). Invariant computations in local cortical networks with balanced excitation and inhibition. Nat. Neurosci., 8, 194–201. Mascaro, M., & Amit, D. J. (1999). Effective neural response function for collective population states. Network, 10, 351–373. Mattia, M., & Del Giudice, P. (2000). Efficient event-driven simulation of large networks of spiking neurons and dynamical synapses. Neural Computation, 12, 2305– 2329. Miller, E. K., Erickson, C. A., & Desimone, R. (1996). Neural mechanisms of visual working memory in prefrontal cortex of the macaque. J. Neurosci., 16, 5154– 5167. Miyashita, Y., & Chang, H. S. (1988). Neuronal correlate of pictorial short-term memory in the primate temporal cortex. Nature, 331, 68–70. Moreno, R., de la Rocha, J., Renart, A., & Parga, N. (2002). Response of spiking neurons to correlated inputs. Phys. Rev. Lett., 89, 288101. Moreno-Bote, R., & Parga, N. (2005). Membrane potential and response properties of populations of cortical neurons in the high conductance state. Phys. Rev. Lett., 94, 088103. Moreno-Bote, R., & Parga, N. (2006). Auto- and cross-correlograms for the spike response of lif neurons with slow synapses. Phys. Rev. Lett., 96, 028101.

Bistability in Balanced Recurrent Networks

45

Rao, S. G., Williams, G. V., & Goldman-Rakic, P. S. (1999). Isodirectional tuning of adjacent interneurons and pyramidal cells during working memory: Evidence for microcolumnar organization in PFC. J. Neurophysiol., 81, 1903–1916. Renart, A. (2000). Multi-modular memory systems. Unpublished doctoral dissertation, ´ Universidad Autonoma de Madrid. Renart, A., Song, P., & Wang, X.-J. (2003). Robust spatial working memory through homeostatic synaptic scaling in heterogeneous cortical networks. Neuron, 38, 473– 485. Ricciardi, L. M. (1977). Diffusion processes and related topics on biology. Berlin: SpringerVerlag. Romo, R., Brody, C. D., Hern´andez, A., & Lemus, L. (1999). Neuronal correlates of parametric working memory in the prefrontal cortex. Nature, 399, 470–474. Salinas, E., & Sejnowski, T. J. (2000). Impact of correlated synaptic input on output firing rate and variability in simple neuronal models. Journal of Neuroscience, 20, 6193–6209. Salinas, E., & Sejnowski, T. J. (2001). Correlated neuronal activity and the flow of neural information. Nat. Rev. Neurosci., 2, 539–550. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Current Opinion in Neurobiol., 4, 569–579. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18, 3870–3896. Shinomoto, S., Sakai, Y., & Funahashi, S. (1999). The Ornstein-Uhlenbeck process does not reproduce spiking statistics of neurons in prefrontal cortex. Neural Comput., 11, 935–951. Shu, Y., Hasenstaub, A., & McCormick, D. A. (2003). Turning on and off recurrent balanced cortical activity. Nature, 432, 288–293. Softky, W. R. (1994). Sub-millisecond coincidence detection in active dendritic trees. Neuroscience, 58, 13–41. Softky, W. R., & Koch, C. (1993). The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosci., 13, 334–350. Tsodyks, M. V., & Markram, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proc. Natl. Acad. Sci. USA, 94, 719–723. Tsodyks, M. V., & Sejnowski, T. (1995). Rapid state switching in balanced cortical network models. Network, 6, 111–124. Turrigiano, G. G. (1999). Homeostatic plasticity in neuronal networks: The more things change, the more they stay the same. Trends in Neurosci., 22, 221–227. Turrigiano, G. G., Leslie, K. R., Desai, N. S., Rutherford, L. C., & Nelson, S. B. (1998). Activity-dependent scaling of quantal amplitude in neocortical neurons. Nature, 391, 892–896. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. van Vreeswijk, C., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Computation, 10, 1321–1371. Wang, X.-J. (1999). Synaptic basis of cortical persistent activity: The importance of NMDA receptors to working memory. J. Neurosci., 19, 9587–9603.

46

A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga

Zador, A. M., & Stevens, C. F. (1998). Input synchrony and the irregular firing of cortical neurons. Nat. Neurosci., 1, 210–217. Zohary, E., Shadlen, M. N., & Newsome, W. T. (1994). Correlated neuronal discharge rate and its implications for psychophysical performance. Nature, 370, 140–143.

Received August 22, 2005; accepted May 17, 2006.

LETTER

Communicated by Alain Destexhe

Exact Subthreshold Integration with Continuous Spike Times in Discrete-Time Neural Network Simulations Abigail Morrison [email protected] Computational Neurophysics, Institute of Biology III, and Bernstein Center for Computational Neuroscience, Albert-Ludwigs-University, 79104 Freiburg, Germany

Sirko Straube [email protected] Computational Neurophysics, Institute of Biology III, Albert-Ludwigs-University, 79104 Freiburg, Germany

Hans Ekkehard Plesser [email protected] Department of Mathematical Sciences and Technology, Norwegian University of Life ˚ Norway Sciences, N-1432 As,

Markus Diesmann [email protected] Computational Neurophysics, Institute of Biology III, and Bernstein Center for Computational Neuroscience, Albert-Ludwigs-University, 79104 Freiburg, Germany

Very large networks of spiking neurons can be simulated efficiently in parallel under the constraint that spike times are bound to an equidistant time grid. Within this scheme, the subthreshold dynamics of a wide class of integrate-and-fire-type neuron models can be integrated exactly from one grid point to the next. However, the loss in accuracy caused by restricting spike times to the grid can have undesirable consequences, which has led to interest in interpolating spike times between the grid points to retrieve an adequate representation of network dynamics. We demonstrate that the exact integration scheme can be combined naturally with off-grid spike events found by interpolation. We show that by exploiting the existence of a minimal synaptic propagation delay, the need for a central event queue is removed, so that the precision of event-driven simulation on the level of single neurons is combined with the efficiency of time-driven global scheduling. Further, for neuron models with linear subthreshold dynamics, even local event queuing can be avoided, resulting in much greater efficiency on the single-neuron level. These ideas are exemplified by two implementations of a widely used neuron model. We present a measure for the efficiency of network simulations in terms Neural Computation 19, 47–79 (2007)

C 2006 Massachusetts Institute of Technology

48

A. Morrison, S. Straube, H. Plesser, and M. Diesmann

of their integration error and show that for a wide range of input spike rates, the novel techniques we present are both more accurate and faster than standard techniques. 1 Introduction A major problem in the simulation of cortical neural networks has been their high connectivity. With each neuron receiving input from the order of 104 other neurons, simulations are demanding in terms of memory as well as simulation time requirements. Techniques to study such highly connected systems of 105 and more neurons, using distributed computing, are now available (see, e.g., Hammarlund & Ekeberg, 1998; Harris et al., 2003; Morrison, Mehring, Geisel, Aertsen, & Diesmann, 2005). The question remains as to whether a time-driven or event-driven simulation algorithm (Fujimoto, 2000; Zeigler, Praehofer, & Kim, 2000; Sloot, Kaandorp, Hoekstra, & Overeinder, 1999; Ferscha, 1996) should be used. At first glance, the choice of event-driven algorithms seems natural, because a neuron can be described as emitting and absorbing point events (spikes) in continuous time. In fact, for neuron models with linear subthreshold dynamics and postsynaptic potentials without rise time, highly efficient algorithms exist (see, e.g., Mattia & Del Giudice, 2000). These exploit the fact that threshold crossings can occur only at the impact times of excitatory events. If more general types of neuron models are considered, the global algorithmic framework becomes much more complicated. For example, each neuron may be required to “look ahead” to determine when it will fire in the absence of new events. The global algorithm then either updates the neuron with the shortest latency or delivers the event with the most imminent arrival time (whichever is shorter) and revises the latency calculations for the neurons receiving the event. (See Marian, Reilly, & Mackey, 2002; Makino, 2003; Rochel & Martinez, 2003; Lytton & Hines, 2005; and Brette, 2006, for refined versions of such an algorithm.) This decision process clearly comes at a cost and becomes unwieldy for networks of high connectivity: if each neuron is receiving input spikes at a conservative average rate of 1 Hz from each of 104 synapses, it needs to process a spike every 0.1 ms, and this limits the characteristic integration step size. Therefore, time-driven algorithms have been found useful for the simulation of large, highly connected networks. Here, each neuron is updated on an equidistant time grid, and the emission and absorption times of spikes are restricted to the grid (see section 3). The temporal spacing of the grid is called computation time step h. Consider a network of 105 neurons as described above. At a computation time step of 0.1 ms, a time-driven algorithm carries out 109 neuron updates per second, the same number as required for the eventdriven algorithm. In this situation, the time-driven scheme is necessarily faster than the event-driven scheme because the costs of the actual updates are the same and there is no overhead caused by the scheduling of events.

Continuous Spike Times in Exact Discrete-Time Simulations

49

However, in this letter, we criticize this view and argue that in order to arrive at a relevant measure of efficiency, simulation time should be analyzed as a function of the integration error rather than the update interval. Whether a time-driven or event-driven scheme yields a better perfomance from this perspective depends on the required accuracy of the simulation and the network spike rate, and is not immediately apparent from considerations of complexity. In the time-driven framework, Rotter and Diesmann (1999) showed that for a wide class of neuron models, the linearity of the subthreshold dynamics can be exploited to integrate the neuron state exactly from one grid point to the next by performing a single matrix vector multiplication. Here, the computation time step simultaneously determines the accuracy with which incoming spikes influence the subthreshold dynamics and the timescale at which threshold crossings can be detected. However, Hansel, Mato, Meunier, and Neltner (1998) showed that forcing spikes onto the grid can significantly distort the synchronization dynamics of certain networks. Reducing the computation step ameliorates the problem only slowly, as the integration error declines linearly with h (see section 8.4.1). The problem was solved (Hansel et al., 1998; Shelley & Tao, 2001) by interpolating the membrane potential between grid points to give a better approximation of the time of threshold crossing and evaluating the effect of incoming spikes on the neuronal state in continuous time. In this work, we demonstrate that the techniques developed for the exact integration of the subthreshold dynamics (Rotter & Diesmann, 1999) and for the interpolation of spike times (Hansel et al., 1998; Shelley & Tao, 2001) can be successfully combined. By requiring that the minimal synaptic propagation delay be at least as large as the computation time step, all events can be queued at their target neurons rather than relying on a central event queue to maintain causality in the network. This reduces the complexity of the global scheduling algorithm—that is, deciding which neuron should be updated, and how far—to the simple time-driven case, whereby each neuron in turn is advanced in time by a fixed amount. Therefore, the global overhead costs are no more than in a traditional discrete-time simulation, and yet on the level of the individual neuron, spikes can be processed and emitted in continuous time with the accuracy of an event-driven algorithm. This approach represents a hybridization of traditional time-driven and event-driven algorithms: the scheme is time driven on the global level to advance the system time but event driven on the level of the individual neurons. The exact integration method is predicated on the linearity of the subthreshold dynamics. We show that this property can be further exploited, as the order of incoming events is not relevant for calculating the neuron state. This completely removes the need for storing and sorting individual events, and therefore also for dynamic data structures, while maintaining the high precision of the event-driven approach.

50

A. Morrison, S. Straube, H. Plesser, and M. Diesmann

We illustrate these ideas by comparing three implementations of the same widely used integrate-and-fire neuron model. As in previous work, the scaling behavior of the integration error is considered a function of the computational resolution. However, in contrast to these works, we also analyze the run time and memory consumption of a large neuronal network model as a function of integration error, thus defining a measure of efficiency that can be applied to any competing model implementations. This analysis reveals that the novel scheme of embedding continuous-time implementations in a discrete-time framework can in many cases result in simulations that are both more accurate and faster than a given purely discrete-time simulation. This is possible because the new scheme achieves the same accuracy at larger computation time steps. Depending on the rate of events to be processed by the neuron, the gain in simulation speed due to an increased step size can more than compensate for the increased complexity of processing continuous-time input events. The scheme presented is well suited for distributed computing. In section 2, we describe the neuron model used as an example in the remainder of the article, and then we review the techniques for integrating the dynamics of a neural network in discrete time steps in section 3. Subsequently, in section 4, we present two implementations solving the singleneuron dynamics between grid points but handling the incoming and emitted spikes in continuous time. The performance of these implementations with respect to integration error, run time, and memory requirements is analyzed in section 5. We show that the choice of which implementation should be used for a given problem depends on a trade-off between these factors. The concepts of time-driven and event-driven simulation of large neural networks are discussed in section 6 in the light of our findings. The numerical techniques underlying the reported results are given in the appendix. The conceptual and algorithmic work described here is a module in our long-term collaborative project to provide the technology for neural systems simulations (Diesmann & Gewaltig, 2002). Preliminary results have been presented in abstract form (Morrison, Hake, Straube, Plesser, & Diesmann, 2005). 2 Example Neuron Model Although the methods in this letter can be applied to any neuron model reducible to a system of linear differential equations, for clarity, we compare various implementations of one particular physical model: a currentbased integrate-and-fire neuron with postsynaptic currents represented as α-functions. The dynamics of the membrane potential V is: V˙ = −

V 1 + I, τm C

Continuous Spike Times in Exact Discrete-Time Simulations

51

where τm is the membrane time constant, C is the capacitance of the membrane, and I is the input current to the neuron. The current arises as a superposition of the synaptic currents and any external current. The time course of the synaptic current ι due to one incoming spike is ι(t) = ˆι

e −t/τα te , τα

where ˆι is the peak value of the current and τα is the rise time. When the membrane potential reaches a given threshold value , the membrane potential is clamped to zero for an absolute refractory period τr . The values for these parameters used in this article are τm , 10 ms; C, 250 pF; , 20 mV; τr , 2 ms; ˆι, 103.4 pA; and τα , 0.1 ms. 3 Exact Integration of Subthreshold Dynamics in a Discrete Time Simulation The dynamics of the neuron model described in section 2 is linear and can therefore be reformulated to give a particularly efficient implementation for a discrete-time simulation (Rotter & Diesmann, 1999). We refer to this traditional, discrete-time approach as the grid-constrained implementation. Making the substitutions y1 =

d 1 ι+ ι dt τα

y2 = ι y3 = V, where yi is the ith component of the state vector y, we arrive at the following system of linear equations:

− τ1α

0

0

1 C

1 y˙ = Ay = 1 − τα

0

0 y, 1 − τm

ˆι τeα

y(0) = 0 , 0

where y(0) is the initial condition for a postsynaptic potential originating at time t = 0. The exact solution of this system is given by y(t) = P(t)y(0), where P(t) = e At denotes the matrix exponential of At, which is an exact mathematical expression (see, e.g., Golub & van Loan, 1996). For a fixed time step h, the state of the system can be propagated from one grid position to the next by yt+h = P(h)yt . This is an efficient method because P(h) is constant and has to be calculated only once at the beginning of a simulation. Moreover, P(h) can be obtained in closed form (Diesmann, Gewaltig, Rotter,

52

A. Morrison, S. Straube, H. Plesser, and M. Diesmann

& Aertsen, 2001), for example, using symbolic algebra software such as Maple (Heck, 2003) or Mathematica (Wolfram, 2003), and can therefore be evaluated by simple expressions in the implementation language (see also section A.3). The complete update for the neuron state may be written as yt+h = P(h)yt + xt+h ,

(3.1)

assuming incoming spikes are constrained to the grid, as the linearity of the system permits the initial conditions for all spikes arriving at a given grid point to be lumped together into one term, xt+h

e τα

xt+h = 0 ˆιk . 0

(3.2)

k∈St+h

Here St+h is the set of indices k ∈ 1, . . . , K of synapses that deliver a spike to the neuron at time t + h, and ˆιk represents the “weight” of synapse k. Note that the ˆιk may be arbitrarily signed and may also vary over the course of the simulation. The new neuron state yt+h is the exact solution to the subthreshold neural dynamics at time t + h, including all events that arrive at time t + h. This assumes that the neuron does not itself produce a spike in the interval (t, t + h]. 3 If the membrane potential yt+h exceeds the threshold value , the neuron communicates a spike event to the network with a time stamp of t + h; the membrane potential is subsequently clamped to 0 in the interval [t + h, t + h + τr ] (see Figure 1A). The earliest grid point at which a neuron could produce its next spike is therefore t + 2h + τr . Note that for a gridconstrained implementation, τr must be an integer multiple of h, because the membrane potential is evaluated only at grid points, and we define it to be nonzero. 3.1 Computational Requirements. In order to preserve causality, it is necessary that there is a minimal synaptic delay of h. Otherwise, if a neuron spiked at time t + h and its synapses had a propagation delay of 0, then this event would seem to arrive at some of its targets at t + h and at some of them at t + 2h, depending on the order in which the neurons are updated. In practice, simulations are generally performed with synaptic delays that are greater than the time step h, and so some technique must be used to store events that have already been produced by a neuron but are not due to arrive at their targets for several time steps. In a grid-constrained simulation, only delays that are an integer multiple of h can be considered because incoming spikes can be handled only at grid points. Consequently, pending events can be stored in a data structure analogous to a looped tape device (see Morrison, Mehring, et al., 2005). If a neuron emits a spike at time t that has a

Continuous Spike Times in Exact Discrete-Time Simulations

53

AV Θ

0

B

t

t+h

t+2h

time

t+h

t+2h

time

V

Θ

0 t

t δ

= t Θ− t

Θ

τr

Figure 1: Spike generation and refractory periods for a grid-constrained (A) and a continuous-time implementation (B). The spike threshold is indicated by the dashed horizontal line. The solid black curve shows the membrane potential time course for a neuron subject to a suprathreshold constant input current leading to a threshold crossing in the interval (t, t + h]. The gray vertical lines indicate the discrete time grid with spacing h. The refractory period τr in this example is set to its minimal value h. Filled circles denote observable values of the membrane potential; unfilled circles denote supporting points that are not observable. (A) In the grid-constrained implementation, the spike is emitted at t + h. During the refractory period τr , the membrane potential is clamped to zero. (B) In the continuous-time implementation, the threshold crossing is found by interpolation (here linear: black dashed line) at time t . The spike is emitted with the time stamp t + h and with an offset with respect to t of δ = t − t. The neuron is refractory from t until t + τr , during which period the membrane potential is clamped to zero. At grid point t + 2h, a finite membrane potential can be observed.

54

A. Morrison, S. Straube, H. Plesser, and M. Diesmann

delay of d, the simulation algorithm waits until all neurons have completed their updates for the integration step (t − h, t] and then delivers the event to its target(s). The event is placed in the tape device of the target neuron d/ h segments on from the current reading position. It will then become visible to the target neuron at the grid point t + d, when the neuron is performing the update for the integration step (t + d − h, t + d]. Recalling that the initial conditions for all events arriving at a given time may be lumped together (see equation 3.2) and that two of the three components of the initial conditions vector are zero, the segments of the looped taped device can be very simple. Each segment contains just one value, which is incremented by the weight of every spike event delivered there. In other words, when the neuron performs the integration step (t, t + h], the segment visible to the reading head contains the first component of xt+h up to the scaling factor of τeα . In terms of memory, the looped tape device needs as many segments as computation steps are required to cover the maximum synaptic delay, plus an additional segment to represent the events arriving at t + h. Therefore, performing a given simulation with a smaller time step will require more memory. The model described in section 2 has only a single synaptic dynamics and so requires only one tape device; models using multiple types of synaptic dynamics can be implemented in this framework by providing them with the corresponding number of tape devices. 4 Continuous-Time Implementation of Exact Integration 4.1 Canonical Implementation. The most obvious way to reconcile exact integration and precise spike timing within a discrete-time simulation is to store the precise times of incoming events. In order to represent this information, an offset must be assigned to each spike event in addition to the time stamp. This offset is measured from the beginning of the interval in which the spike was produced: a spike generated at time t + δ receives a time stamp of t + h and an offset of δ (see Figure 1B). Given a sorted list of event offsets {δ1 , δ2 , · · · , δn } with δi ≤ h, which become visible to a neuron in the step (t, t + h], exact integration of the subthreshold dynamics can be performed from the beginning of the time step to the beginning of the list: yt+δ1 = P(δ1 )yt + xδ1 ; then along the list: yt+δ2 = P(δ2 − δ1 )yt+δ1 + xδ2 .. . yt+δn = P(δn − δn−1 )yt+δn−1 + xδn ;

Continuous Spike Times in Exact Discrete-Time Simulations

55

and finally from the end of the list to the end of the time step:

yt+h = P(h − δn )yt+δn .

The final term yt+h is the exact solution for the neuron dynamics at time t + h. This sequence is illustrated in Figure 2B. This is assuming that the neuron does not produce a spike or emerge from its refractory period during this interval. These special cases are described in more detail below. 4.1.1 Spike Generation. In the grid-constrained implementation, the neuron state is inspected at the end of each time step to see if it meets its spiking criteria. In the case of the neuron model described in section 3, the criterion is y3 ≥ , where is the threshold. In this implementation, the neuron state can be inspected after every step of the process described in sec3 3 tion 4.1. If yt+δ < and yt+δ ≥ , then the membrane potential of the i i+1 neuron reached threshold between t + δi and t + δi+1 . As the dynamics of this model is not invertible, the time t of this threshold passing can be determined only by interpolating the membrane potential in the interval (t + δi , t + δi+1 ]. For this article, linear, quadratic, and cubic interpolation schemes were investigated. After the threshold crossing, the neuron is refractory for the duration of its refractory period τr . The membrane potential y3 is set to zero and need not be calculated during this period, although the other components of the neuron state continue to be updated as in section 4.1. At the end of the time interval, an event is dispatched with a discrete-time stamp of t + h and an offset of δ = t − t (see Figure 1B). 4.1.2 Emerging from the Refractory Period. The neuron emerges from its refractory period in the time step defined by t < t + τr ≤ t + h (see Figure 1B). In contrast to the grid-constrained implementation, τr does not have to be an integer multiple of h. For a continuous time τr , a grid position t can always be found such that t + τr comes within (t, t + h]. However, the implementation is simpler when τr is a nonzero integer multiple of h. To calculate the neuron state at time t + h exactly, the interval is divided into two subintervals: (t, t + τr ] and (t + τr , t + h]. In the first period, the neuron is still refractory, so when performing the exact integration along the incoming events as in section 4.1, y3 is not calculated and remains at zero. At the end of this period, the neuron state yt +τr is the exact solution for the dynamics at the end of the refractory period. In the second period, the neuron is no longer refractory, and so the exact integration can be performed as usual (including the calculation of y3 ). The neuron state yt+h is therefore an exact solution to the neuron dynamics at time t + h.

56

A. Morrison, S. Straube, H. Plesser, and M. Diesmann

A

B

C

t

t+h

t+2h

Figure 2: Handling of incoming spikes in the grid-constrained (A), the canonical (B), and the prescient (C) implementations. In each panel, the solid curve represents the excursion of the membrane potential in response to two incoming spikes (gray vertical lines). Filled circles denote observable values of the membrane potential; unfilled circles denote supporting points that are not observable. The gray horizontal arrows beneath each panel indicate the propagation steps performed during the time step (t, t + h]. Dashed arrows in A indicate that in the grid-constrained implementation, input spikes are effectively shifted to the next point on the grid. The observable membrane potentials following spike impact are identical and exact in B and C but differ from those in A.

Continuous Spike Times in Exact Discrete-Time Simulations

57

4.1.3 Computational Requirements. As in the grid-constrained implementation, a minimum spike propagation delay of h is necessary in order to ensure that all events due at the neuron between t and t + h have arrived by the time the neuron performs its update for that interval. The simple looped event buffer described in section 3.1 must be extended to store the weights and offsets for incoming events separately. As the number of events arriving in an interval cannot be known ahead of time, this structure must be capable of dynamic memory allocation, which reduces its cache effectiveness. However, information about the minimum propagation delay in the network can be utilized to streamline the data structure so that its size does not depend on h, which compensates to some extent for the use of dynamic memory. Finally, as the events may arrive in any order, the buffer must also be capable of sorting the events with respect to increasing offset, which for a general-purpose sorting algorithm has complexity O(n log n). An alternative representation of the timing data is a priority queue; in practice, this was not quite as efficient as the looped tape device. In contrast to the grid-constrained implementation, delays can now be represented in continuous time. A spike arrival time t + δ + d, where t + δ is a continuous spike generation time and d is a continuous time delay, can always be decomposed into a discrete grid point t + d and a continuous offset δ . However, for notational and implementational convenience (see section 6), we assume d to be a nonzero integer multiple of h. 4.2 Prescient Implementation. In the implementation described in section 4.1, the neuron state at the end of a time step is calculated by integrating along a sorted list of events. However, as the subthreshold dynamics is linear, it is not dependent on the order of events. In this section, an implementation is presented that exploits this fact to reduce the computational complexity and dynamic memory requirements of the canonical implementation. 4.2.1 Receiving an Event. Consider a spike event generated during the step (t − h, t] with offset δ and transmitted with delay d. This spike will be processed during the update step (t + d − h, t + d], its effect being observable for the first time at t + d. Since the correct spike arrival time is t + d − h + δ, when the algorithm delivers the spike to the neuron we evolve the effect of the spike input from the arrival time to the end of the interval at t + d using y˜ t+d = P(h − δ)x. Therefore, instead of storing the entire event as for the canonical implementation, the three components of its effect on the neuron state can be stored in an event buffer at the position d instead. Due to the linearity of the system,

58

A. Morrison, S. Straube, H. Plesser, and M. Diesmann

these components can be summed for all events due to arrive at the neuron in a given time step, regardless of the order in which the algorithm delivers them to the neuron or the order in which they are to become visible to the neuron. As we exploit the fact that the effect of an event on the neuron can be calculated before the event becomes visible to the neuron, we call this the prescient implementation. 4.2.2 Calculating the Subthreshold Dynamics. At the beginning of each time interval (t, t + h], the total effect of all events arriving within that step on the three components of the neuron state at time t + h is already stored in the event buffer. Calculating the neuron state at the end of the time step is therefore simple: yt+h = P(h)yt + y˜ t+h . The new neuron state yt+h is the exact solution to the neuron dynamics at time t + h, including all events that arrived within the step (t, t + h]. This is depicted in Figure 2C. As with the canonical implementation, there are two special cases that need to be treated with more care: a time step in which a spike is generated and one in which the neuron emerges from its refractory period. 4.2.3 Spike Generation. The process that generates a spike is very similar to that for the canonical implementation described in section 4.1.1. In this case, as the timing of the incoming events is no longer known, the neuron state can be inspected only at the end of the time step rather than at each incoming event, and so the length of the interpolation interval is h rather than the interspike interval of incoming events. 4.2.4 Emerging from the Refractory Period. As for the canonical implementation, the time step in which the neuron emerges from its refractory period is divided into two subintervals: (t, t + τr ] and (t + τr , t + h]. Setting tem = t + τr − t, the neuron state at the end of the refractory period can be calculated as follows: yt+tem = P(tem )yt 3 yt+t ← 0. em

Having emerged from the refractory period, the membrane potential is no longer clamped to zero and can develop normally during the remainder of the time step (see Figure 1B): yt+h = P(h − tem )yt+tem + y˜ t+h .

Continuous Spike Times in Exact Discrete-Time Simulations

59

However, this overestimates the effect of the events arriving in the time interval (t, t + h] on the membrane potential. The summation of the components of these events was predicated on the assumption of linear dynamics, but as the membrane potential is clamped to zero until t + tem , this assumption does not hold. Any events arriving at the neuron before its emergence from its refractory period should have no effect on its membrane potential before this point, yet adding the full value of the third component of y˜ t+h assumes that they do. As a corrective measure, the effect of the new events on the membrane potential can be considered to be linear within the small interval (t, t + h], and the membrane potential can be adjusted accordingly: 3 3 3 yt+h ← yt+h − γ y˜ t+h ,

with γ = tem / h. 4.2.5 Computational Requirements. As in the grid-constrained and canonical implementations, a minimum spike propagation delay of h is required to preserve causality. The looped-tape device described in section 3.1 needs to be able to store the three components of the neuron state rather than just the weight of the incoming events. Alternatively, three event buffers can be used, capable of storing one component each. Unlike the buffer devices for the canonical implementation, they need to store only one value per time step rather than one for each incoming spike, so there is no time overhead for sorting the values. Moreover, they do not require dynamic memory allocation and so are more cache effective. 5 Performance In order to compare error scaling for the different implementations and interpolation orders, a simple single-neuron simulation was chosen. As the system is deterministic and nonchaotic, reducing the computation time step h causes the simulation results to converge to the exact solution, so error measures can be well defined. To investigate the costs incurred by simulating at finer resolutions or using computationally more expensive off-grid neuron implementations, a network simulation was chosen. This is fairer than a single-neuron simulation, as the run-time penalties of applications requiring more memory will come into play only if the application is large enough not to fit easily into the processor’s cache memory. Furthermore, it is only when performing network simulations that the bite of long simulation times per neuron is really felt. 5.1 Single-Neuron Simulations. Each experiment consisted of 40 trials of 500 ms each, during which a neuron of the type described in section 2 was stimulated with a constant excitatory current of 412 pA and unique

60

A. Morrison, S. Straube, H. Plesser, and M. Diesmann 0

Error [mV]

A 10

B

−5

−5

10

10

−10

−10

10

10

−15

10

0

10

−15

−4

10

−2

10

Time step h [ms]

0

10

10

−4

10

−2

10

0

10

Time step h [ms]

Figure 3: Scaling of error in membrane potential as a function of the computational resolution in double logarithmic representation. (A) Canonical implementation. (B) Prescient implementation. No interpolation, circles; linear interpolation, plus signs; quadratic interpolation, diamonds; cubic interpolation, multiplication signs. In both cases, the triangles show the behavior of the gridconstrained implementation, and the gray lines indicate the slopes expected for scaling of orders first to fourth with an arbitrary intercept of the vertical axis.

realizations of an excitatory Poissonian spike train of 1.3 × 104 Hz and an inhibitory Poissonian spike train of 3 × 103 Hz. The spike times of the Poissonian input trains were represented in continuous time. Parameters are as in section 2, but the peak value of the current resulting from an inhibitory spike was a factor of 6.25 greater than that of an excitatory spike to ensure a balance between excitation and inhibition. The output firing rate was 12.7 Hz. The experiment was repeated for each implementation with each interpolation order over a wide range of computational resolutions. As the membrane potential and spike times cannot be calculated analytically for this protocol, the canonical implementation with cubic interpolation at the finest resolution (2−13 ms ≈ 0.12 µs) was defined to be the reference simulation for each realization of the input spike train. As a measure of the error in calculating the membrane potential, the deviation of the actual membrane potential from the reference membrane potential was sampled every millisecond for all the trials. In Figure 3, the median of these deviations is plotted as a function of the computational resolution in double logarithmic representation. In both the canonical implementation (see Figure 3A) and the prescient implementation (see Figure 3B), the same scaling behavior can be seen: for an interpolation order of n, the error in membrane potential scales with order n + 1 (see section A.4). The error has a lower bound at 10−14 , which can be seen for very fine resolutions using cubic interpolation. This represents the greatest

Continuous Spike Times in Exact Discrete-Time Simulations 0

Error [ms]

A 10

B

−5

0

10

−5

10

10

−10

−10

10

10

−15

10

61

−15

−4

10

−2

10

Time step h [ms]

0

10

10

−4

10

−2

10

0

10

Time step h [ms]

Figure 4: Scaling of error in spike times as a function of the computational resolution. (A) Canonical implementation. (B) Prescient implementation. Symbols and lines as in Figure 3.

numerical precision possible for this physical quantity using the standard representation of floating-point numbers (see section A.1). Interestingly, the error for the canonical implementation also saturates at coarse resolutions. This is because interpolation is performed between incoming events rather than across the whole time step, as in the case of the prescient implementation. Consequently, the effective computational resolution cannot be any coarser than the average interspike interval of the incoming spike train (in this case, 1/16 ms), and this determines the maximum error of the canonical implementation. Note that the error of the grid-constrained implementation scales in the same way as that of the canonical and prescient implementations with no interpolation. However, due to the fact that incoming spikes are forced to the grid, the absolute error is greater for this implementation. The accuracy of a simulation is not determined by the membrane potential alone; the precision of spike times is of at least as much relevance. The median of the differences between the actual and the reference spike times is shown in Figure 4. As with the error in membrane potential, using an interpolation order of n results in a scaling of order n + 1, and the error has a lower bound that is exhibited at very fine resolution when using cubic interpolation. Furthermore, a similar upper bound on the error is observed for the canonical implementation at coarse resolutions. However, in this case, the grid-constrained implementation exhibits not only the same scaling, but a similar absolute error to the continuous-time implementations with no interpolation. Recalling that all the simulations receive identical continuous-time Poisson spike trains, the only difference remaining between the grid-constrained implementation and the continuous-time implementations without interpolation is that the former ignores the offset of the incoming spikes and treats them as if they were produced on the

62

A. Morrison, S. Straube, H. Plesser, and M. Diesmann

grid, whereas the latter process the incoming spikes precisely. This reveals that handling incoming spikes precisely confers no real advantage if outgoing spikes are generated without interpolation and thereby forced to the grid. One might therefore conclude that the precise handling of incoming spikes is unnecessary and that the single-neuron integration error could be significantly improved just by performing an appropriate interpolation to generate spikes, while treating incoming spikes as if they were produced on the grid. In fact, this is not the case. If incoming spikes are treated as if on the grid, the error in the membrane potential decreases only linearly with h, thus limiting the accuracy of higher-order methods in determining threshold crossings. This is corroborated by simulations (data not shown). Substantial improvement in the accuracy of single-neuron simulations requires both techniques: precise handling of incoming spikes and interpolation of outgoing spikes. 5.2 Network Simulations. In order to determine the efficiency of the various implementations, a balanced recurrent network was adapted from Brunel (2000). The network contained 10,240 excitatory and 2560 inhibitory neurons and had a connection probability of 0.1, resulting in a total of 15.6 × 106 synapses. The inhibitory synapses were a factor of 6.25 stronger than the excitatory synapses, and each neuron received a constant excitatory current of 412 pA as its sole external input. Membrane potentials were initialized to values chosen from a uniform random distribution over [−/2, 0.99]. In this configuration, the network fires with approximately 12.7 Hz in the asynchronous irregular regime, which recreates the input statistics used in the single-neuron simulations. The synaptic delay was 1 ms, and the network was simulated for 1 biological second. The simulation time and memory requirements for the network simulation described above are shown in Figure 5. For the simulation time (see Figure 5A), it can be seen that at coarse resolutions, the grid-constrained implementation is significantly faster than the prescient implementation, which in turn is faster than the canonical implementation. This is due to the fact that the cost of processing spikes is essentially independent of the computational resolution and manifests as an implementation-dependent constant contribution to the simulation time, which is particularly dominant at coarse resolutions. The difference in speed between the canonical and prescient implementations results from the use of dynamic as opposed to static data structures, and to a lesser extent from the cost of sorting incoming spikes in the canonical implementation. As the computation time step decreases, the simulation times converge, because the cost of updating the neuron dynamics in the absence of events, which is the same for all implementations, is inversely proportional to the resolution and so manifests as a scaling with exponent −1 at small computation time steps (see Figure 5A). It is clear that in general, at the same computation time step, the continuous-time implementations must be slower than the grid-constrained

Continuous Spike Times in Exact Discrete-Time Simulations

B

3

10

Memory [GB]

Simulation time [s]

A

2

10

63

6 4

2

1

1

10

−4

10

−2

10

Time step h [ms]

0

10

−4

10

−2

10

0

10

Time step h [ms]

Figure 5: Simulation time (A) and memory requirements (B) for a network simulation as functions of computational resolution in double logarithmic representation. Triangles, grid-constrained neuron; plus signs, canonical implementation with cubic interpolation; circles, prescient implementation with cubic interpolation. Other interpolation orders for the canonical and the prescient implementations result in practically identical curves and are therefore not shown. For details of the simulation, see the text.

implementation, as the former perform a propagation for every incoming spike and the latter does not. The increased costs concomitant with higher interpolation orders proved to be negligible in a network simulation. An increase in memory requirements can be observed for all implementations (see Figure 5B) as the resolutions become finer. Although all implementations require much the same amount of memory at coarser resolutions, for finer resolutions, the canonical implementation requires the least memory, followed by the grid-constrained implementation, and the prescient implementation requires the most. It is clear that for a wide range of resolutions, the memory required by the rest of the network, specifically the synapses, dominates the total memory requirements (Morrison, Mehring, et al., 2005). As the resolution becomes finer, the memory required for the input buffers for the neurons plays a greater role. The spike buffer for the canonical implementation is independent of the resolution (see section 4.1.3), and so it might seem that it should not require more memory at finer resolution. However, all implementations tested here also have a buffer for a piece-wise constant input current. In addition to this, the grid-constrained implementation has one buffer for the weights of incoming spikes, and the prescient implementation has three buffers—one for each component of the state vector. All of these buffers require memory in inverse proportion to the resolution, thus explaining the ordering of the curves. Generally a smaller application is more cache effective than a larger one, and this may explain why the canonical implementation exhibits slightly lower

64

A. Morrison, S. Straube, H. Plesser, and M. Diesmann

simulation times than the other implementations at very fine resolutions (see Figure 5A). 5.3 Conjunction of Integration Error and Run-Time Costs. The considerations in section 5.2 of how the simulation time and memory requirements increase with finer resolutions are of limited practical relevance to a scientist with a particular problem to investigate. More interesting in this case is how much precision bang you get for your simulation time buck. Unlike the single neuron, however, the network described above is a chaotic system (Brunel, 2000). Any deviation at all between simulations will lead, in a short time, to totally different results on the microscopic level, such as the evoked spike patterns. Such a deviation can even be caused by differences in the tiny round-off errors that occur if floating-point numbers are summed in a different order. Because of this, these simulations do not converge on the microscopic level as the single-neuron simulations do, and for that reason the so-called accuracy of a simulation cannot be taken at face value. We therefore relate the cost of network simulations to the accuracy of single-neuron simulations with comparable input statistics. In Figure 6, the simulation time and memory requirements data from Figure 5 are combined with the accuracy of the corresponding single-neuron simulations shown in Figure 4, thus eliminating the computational resolution as a parameter. Figure 6A shows the simulation time as a function of spike time error for the three implementations. This graph can be read in two directions: horizontally and vertically. By reading the graph horizontally, we can determine which implementation will give the best accuracy for a given affordable simulation time. Reading the graph vertically allows us to determine which implementation will result in the shortest simulation time for a given acceptable accuracy. Concentrating on the latter interpretation, it can be seen from the intersection of the lines corresponding to the prescientand grid-based implementations (vertical dashed line in Figure 5A) that if an error greater than 2.3 × 10−2 ms is acceptable, the grid-constrained implementation is faster. For better accuracy than this, the prescient implementation is more effective. If an appropriate time step is chosen, the prescient implementation can simulate more accurately and more quickly than a given grid-constrained simulation in this regime. Only for very high accuracy can a lower simulation time be achieved using the canonical implementation. Similarly, Figure 6B, which shows the memory requirements as a function of spike-time error in double logarithmic representation, can be read in both directions. This shows qualitatively the same relationship as in Figure 6A, but the point at which one would switch from the prescient implementation to the canonical implementation in order to conserve memory occurs for larger errors. The flatness of the curves for the continuous-time implementations shows that it is possible to increase the accuracy of a simulation considerably without having to worry about the memory requirements.

Continuous Spike Times in Exact Discrete-Time Simulations

B

3

10

2

10

−12

10

−4

10

10

−2

10

2

prescient faster

Input rate [kHz]

10

100

−8

10

−4

10

0

10

Error [ms]

D 120

grid−constrained faster

10

−12

0

10

Simulation time [s]

−1

Eqv. Error [ms]

−8

10

Error [ms]

C

4

1

1

10

6

Memory [GB]

Simulation time [s]

A

65

100 80 60 40 20 0

0

20

40

60

80

Input rate [kHz]

Figure 6: Analysis of simulation time and memory requirements for a network simulation as functions of spike-time error for the single-neuron simulation and input spike rate. (A) Simulation time as a function of spike-time error in double logarithmic representation. Triangles, grid-constrained neuron; plus signs, canonical implementation with cubic interpolation; circles, prescient implementation with cubic interpolation. Data combined from Figure 4 and Figure 5. For errors smaller than 2.3 × 10−2 ms (vertical dashed line), a continuous-time implementation with an appropriately chosen computation step size gives better performance. (B) Memory consumption as a function of spike-time error in double logarithmic representation. Symbols as in A. (C) Error in spike time for which the prescient and grid-constrained implementations require the same simulation time (for different appropriate choices of h) as a function of input spike rate. The unfilled circle indicates the equivalence error for a network rate of 12.7 Hz (input rate 16 kHz), that is, the intersection marked by the vertical dashed line in A. The gray line is a linear fit to the data (slope −0.95). (D) Simulation time as a function of input spike rate for h = 0.125 ms, symbols as in A. The gray lines are linear fits to the data (slopes 1.3, 0.8 and 0.3 s per kHz from top to bottom).

66

A. Morrison, S. Straube, H. Plesser, and M. Diesmann

In general, the point at which the prescient implementation at an appropriate computational resolution can produce a faster and more accurate simulation than a grid-constrained simulation will depend on the rate of events a neuron has to handle. To investigate this relationship, the network described in section 5.2 was simulated with different input currents and inhibitory synapse strengths to generate a range of different firing rates in the asynchronous irregular regime. The single-neuron integration error at which the prescient- and the grid-constrained implementations require the same simulation time is shown in Figure 6C as a function of the input spike rate. This equivalence error depends linearly on the input spike rate, demonstrating that in the parameter space of useful accuracies and realistic input rates, there is a wide regime where the prescient is faster than the grid-constrained implementation. Underlying the benign nature of the comparative effectiveness analyzed in Figure 6C is the dependence of simulation time on the rate of events. For all implementations, the simulation time increases practically linearly with the input spike rate (see Figure 6D), albeit with different slopes. 5.4 Artificial Synchrony. The previous section has shown that in many situations, continuous-time implementations achieve a desired single-neuron integration error more effectively than the grid-constrained implementation. However, continuous-time implementations have an advantage compared to the grid-constrained implementation beyond the single-neuron integration error. In a network simulation carried out with the grid-constrained implementation, the spikes of all neurons are aligned to the temporal grid defined by the computation time step. This causes artificial synchronization between neurons that may distort measures of synchronization and correlation on the network level. To demonstrate this effect, Hansel et al. (1998) investigated a small network of N integrateand-fire neurons with excitatory all-to-all coupling. Here, we extend their analysis to the three implementations under study and provide a comparison of single-neuron and network-level integration error. In contrast to their study, our network is constructed from the model neuron introduced in section 2. The time constant of the synaptic current τα is adjusted to the rise time of the synaptic current in the original model, which was described by a β-function. The synaptic delay d and the absolute refractory period τr are set to the maximum computation time step h investigated in this section. Note that this choice of d and τr means that these parameters can be represented at all computational resolutions, thus ensuring that all simulations using the grid-constrained implementation are solving the same dynamical system. Figure 7A illustrates the synchrony in the network as a function of synaptic strength. Following Hansel et al. (1998), synchrony is defined as the variance of the population-averaged membrane potential normalized by the population-averaged variance of the membrane

Continuous Spike Times in Exact Discrete-Time Simulations

A

67

1

Synchrony

0.8 0.6 0.4 0.2 0

0

0.2

0.4

0.6

0.8

1

Synaptic strength

Synch. Error [%]

2

C 10

Synch. Error [%]

2

B 10

0

10

−2

10

−4

10

0

10

−2

10

−4

−4

10

−2

10

Time step h [ms]

0

10

10

−12

10

−8

10

−4

10

0

10

Error [ms]

Figure 7: Synchronization error in a network simulation. (A) Network synchrony (see equation 3.3) as a function of synaptic strength in a fully connected network of N = 128 neurons (τα = (3/2) ln 3 ms, synaptic delay d = τr = 0.25 ms) with excitatory coupling (cf. Hansel et al., 1998). Neurons are driven by a suprathreshold DC I0 = 575 pA, no further external input. The initial T )], where i ∈ 1, . . . , N is membrane potential is Vi (0) = τCm I0 [1 − exp(−γ i−1 N τm the neuron index, T the period in the absence of coupling, and γ = 0.5 controls the initial coherence. The simulation time is 10 s, and V is recorded in intervals of 1 ms between 5 s and 10 s. Synaptic strength is expressed as the amplitude of a postsynaptic current relative to the rheobase current I∞ = (C/τm )θ = 500 pA and multiplied by the number of neurons N. Other parameters as in section 2. Canonical implementation as reference (h = 2−10 ms, gray curve) and prescient implementation (h = 2−2 ms, circles), both with cubic interpolation; grid-constrained implementation (h = 2−5 ms, triangles). (B) Synchronization error as a function of the computation time step in double logarithmic representation: grid-constrained implementation, triangles; prescient implementation, circles; canonical implementation, plus signs. (C) Synchronization error as a function of the single-neuron integration error for the grid-constrained and the prescient implementation, same representation as in B. The gray lines are linear fits to the data.

68

A. Morrison, S. Straube, H. Plesser, and M. Diesmann

potential time course, S=

2

2

2

2 Vi (t) t − Vi (t) t i , Vi (t) i − Vi (t) i t t

i

(5.1)

where · i indicates averaging over the N neurons and · t indicates averaging over time. This is a measure of coherence in the limit of large N with S = 1 for full synchronization and S = 0 for the asynchronous state. The grid-constrained implementation exhibits a considerable error in synchrony, which vanishes in approaching the asynchronous regime. The prescient implementation accurately preserves network synchrony even with a significantly larger computation time step. The error in synchrony is quantified in Figure 7B as the root mean square of the relative deviation of S with respect to the reference solution, estimated over the range of synaptic strength investigated. Note that this includes the asynchronous regime where errors are small in general. The prescient implementation is easily an order of magnitude more accurate than the gridconstrained implementation and is itself outperformed by the canonical implementation. In addition, for the continuous-time implementations, the error in synchrony drops more rapidly with decreasing computational time step h than for the grid-constrained implementation. However, at the same h, different integration methods exhibit a different integration error for the single-neuron dynamics (see Figure 4). Therefore, to accentuate network effects, continuous-time and grid-constrained implementations should be compared at the same single-neuron integration error. To this end, we proceed as follows: the network spike rate is approximately 80 Hz, corresponding to an input spike rate of some 10 kHz. In Figure 7C, the error in network synchrony at a given computational time step h is plotted as a function of the spike time error of a single neuron driven with an input spike rate of approximately 10 kHz, simulated at the same h. For single-neuron errors of 10−2 ms and above, the grid-constrained implementation results in considerable errors in network synchrony. Spike timing errors of 10−2 ms and below are required for the grid-constrained and the prescient implementations to achieve a synchronization error in the 1% range or better. Interestingly, the grid-constrained implementation exhibits a larger synchronization error than the prescient implementation for identical single-neuron integration errors. 6 Discussion We have shown that exact integration techniques are compatible with continuous-time handling of spike events within a discrete-time simulation. This combination of techniques achieves arbitrarily high accuracy (up to machine precision) without incurring any extra management costs in the global algorithm, such as a central event queue or looking ahead to see

Continuous Spike Times in Exact Discrete-Time Simulations

69

which neuron will fire next. This is particularly important for the study of large networks with frequent events, as the cost of managing events can become prohibitive (Mattia & Del Giudice, 2000; Reutimann, Giugliano, & Fusi, 2003). We introduced a canonical implementation that illustrates the principles of combining these techniques and a prescient implementation that further exploits the linearity of the subthreshold dynamics. The latter implementation simplifies the neuron update algorithm and requires only static data structures and no queuing, leading to a better time and accuracy performance than the canonical implementation. We compared interpolating polynomials of orders 1 to 3 and discovered that the increased numerical complexity of the higher-order interpolations was not reflected in the run time, which is dominated by other factors. Furthermore, it was shown that the highest-order interpolation performed stably. This suggests that the highest-order interpolation should be used, as the greater accuracy is obtained at negligible cost. We have investigated the nature of the trade-off between accuracy and simulation time/memory and demonstrated that for a large range of input spike rates, it is possible to find a combination of continuous-time implementation and computation time step that fulfills a given maximum error requirement both more accurately and faster than a grid-constrained simulation. This measure of efficiency is based on truly large-scale networks (12,800 neurons, 15.6 million synapses). The techniques described here have several possible extensions. First, the canonical implementation places no constraints on the neuron model used beyond the physically plausible requirement that the membrane potential is thrice continuously differentiable. It may therefore be used to implement essentially any kind of neuronal dynamics, including neurons with conductance-based synapses. The prescient implementation further requires that the neuron’s dynamics is linear; it may thus be used for a wide range of model neurons with current-based synapses. The neuron model we implemented does not have invertible dynamics, and so the determination of its spike time necessarily involves approximation. For some neuron models, it is possible to determine the precise spike time without recourse to approximation, such as the Lapicque model (Tuckwell, 1988) and its descendants (Mirollo & Strogatz, 1990). Such models can, of course, also be implemented in this framework, but they would have only the same precision as a classical event-based simulation if they were canonically implemented (obviously without interpolation); a prescient implementation would be able to represent the subthreshold dynamics exactly on the grid but would entail the use of approximative methods to determine the spiketimes. Although we investigated polynomial interpolation, other methods of spike time determination such as Newton’s method can be implemented with no change to the conceptual framework. Second, most constraints imposed in terms of the computational time step h may be relaxed. As indicated in section 4.1.3, the restriction of

70

A. Morrison, S. Straube, H. Plesser, and M. Diesmann

delays to nonzero integer multiples of h can be relaxed to any floatingpoint number ≥ h. When a neuron spikes, the offset of the spike could then be combined on the fly with the delay to create an integer component and a revised offset, thus allowing the spike to be delivered correctly by the global discrete-time algorithm, and processed correctly by the continuoustime neuron model. This relaxation would come at the memory cost of having to store delays as floating-point numbers rather than integers and the computational cost of having to perform the on-the-fly delay decomposition. Furthermore, h is currently a parameter of the entire network, so it is the same for all neurons in a given simulation. Given that the minimum propagation delay already defines natural synchronization points to preserve causality in the simulation, it would be possible to allow h to be chosen individually for each neuron in the network, or even to use variable time steps, while still maintaining consistency with the global discrete-time algorithm. Finally, the techniques are compatible with the distributed computing techniques described in Morrison, Mehring, et al. (2005), requiring only that the spike-time offsets are communicated in addition to the indices of the spiking neurons. This increases the bulk but not the frequency of communication, as it is still sufficient to communicate in intervals determined by the minimum propagation delay. A similar minimum delay principle is used by Lytton and Hines (2005), again suggesting a convergence of time-driven and event-driven approaches. When investigating a particular system, it is worthwhile considering what accuracy is necessary. For the networks described in section 5.2, it would be pointless to simulate with a very small time step, as they are chaotic. As long as the relevant macroscopic measures are preserved, any time step is as good or as bad as any other. However, a good rule of thumb is that it should be possible to discriminate spike times an order of magnitude more accurately than the characteristic timescales of the macroscopic phenomena to be observed, such as the temporal structure of cross-correlations. Note that even if the characteristics of the system to be investigated suggest that a grid-constrained implementation is optimal, the availability of equivalent continuous-time implementations is still advantageous: should the suspicion arise that an observed phenomenon is an artifact of the grid constraints, they can be used to test this without altering any other part of the simulation. For much the same reason, it is very useful to be able to modify the time step without having to adjust the rest of the simulation. The network studied in section 5.4 illustrates two more important points. First, it demonstrates that exceedingly small single-neuron integration errors may be required to accurately capture network synchronization. Second, it is clear from Figure 7C that continuous-time implementations are better at rendering macroscopic measures such as synchrony correctly: in conditions where the grid-constrained and prescient implementations

Continuous Spike Times in Exact Discrete-Time Simulations

71

achieve the same single-neuron spike-timing error, the prescient implementation yields a significantly smaller error in network synchrony. Accuracy cannot be improved on indefinitely: Figures 3 and 4 show that the errors in both membrane potential and spike timing saturate at about 10−14 mV and ms, respectively, for a time step of around 10−3 ms. The saturation accuracy is close to the maximal precision of the standard floating-point representation of the computer, and so 10−3 ms represents a lower bound on the range of useful h. An upper bound is determined by the physical properties of the system. First, h may not be larger than the minimum propagation delay in the network or the refractory period of the neurons. Second, using a large h increases the danger that a spike is missed. This can occur if the true trajectory of the membrane potential passes through the threshold within a step but is subthreshold again by the end of the step. This is less of an issue for the canonical implementation, as the check for a suprathreshold membrane potential is not just performed at the end of every step but also at the arrival of every incoming event (see section 4.1.1). There is a common perception that event-driven algorithms are exact and time-driven algorithms are approximate. However, both parts of this perception are generally false. With respect to the first part, event-driven algorithms are not by the nature of the algorithm more exact than timedriven algorithms. It depends on the dynamics of the neuron model whether an event-driven algorithm can find an exact solution, just as it does for timedriven algorithms. For a restricted class of models, the spike times can be calculated exactly through inversion of the dynamics. For other models, approximate methods to determine the spike times need to be employed. With respect to the second part, time-driven algorithms are not necessarily approximate. A discrete-time algorithm does not imply that spike times have to be constrained onto the grid, as shown by Hansel et al. (1998) and Shelley and Tao (2001). Moreover, the subthreshold dynamics for a large class of neuron models can be integrated exactly (Rotter & Diesmann, 1999). Here we combine these insights to show that the degree of approximation in a simulation is not determined by whether an event-driven or a time-driven algorithm is used but by the dynamics of the neuron model. A further question is whether the terms time-driven and event-driven should even be used in this mutually exclusive way. In our algorithm, neuron implementations treating incoming and outgoing spikes in continuous time are seamlessly integrated into a global discrete-time algorithm. Should this therefore be considered a time-driven or an event-driven algorithm? We believe that this combination of techniques represents a hybrid algorithm that is globally time driven but locally event driven. Similarly, when designing a distributed simulation algorithm (Morrison, Mehring, et al., 2005), it was shown that a time-driven neuron updating algorithm can be successfully combined with event-driven synapse updating, again suggesting that no dogmatic distinction between the two approaches need

72

A. Morrison, S. Straube, H. Plesser, and M. Diesmann

be made. However, although we were able to demonstrate the potential advantages of a hybrid algorithm, these findings do not in principle rule out the existence of pure event-driven or time-driven algorithms with identical universality and better performance for a given set of parameters than the schemes presented here. In closing, we express our hope that this work can help defuse the at times heated debate between advocates of event-driven and time-driven algorithms for the simulation of neuronal networks. Appendix: Numerical Techniques In this appendix, we present the numerical techniques employed to achieve the reported accuracy. A.1 Accuracy of Floating-Point Representation of Physical Quantities. The double representation of floating-point numbers (Press, Teukolsky, Vetterling, & Flannery, 1992) used by standard computer hardware limits the accuracy with which physical quantities can be stored. The machine precision ε is the smallest number for which the double representation of 1 + ε is different from the representation of 1. Consequently, the absolute error σx of a quantity x is limited by the magnitude of x, σx ≈ 2log2 x · ε. In double representation, we have ε = 2−52 ≈ 2.22 · 10−16 . Membrane potential values y are on the order of 20 mV; the lower limit of the integration error in the membrane potential is therefore on the order of 5 · 10−15 mV. According to the rules of error propagation, the error in the time of threshold crossing σ depends on the error in membrane potential as σ = |1/ y˙ |. Typical values of the derivative | y˙ | of the membrane potential are on the order of 1 mV per ms (single-neuron simulations, data not shown), from which we obtain 5 · 10−15 ms as a lower bound for the error in spike timing. Therefore, the observed integration errors at which the simulations saturate are close to the limits imposed by the double representation for both physical quantities. A.2 Representation of Spike Times. We have seen in section A.1 that the absolute error σx depends on the magnitude of the quantity x. As a consequence, the error of spike times recorded in double representation increases with simulation time. An additional error is introduced if the computation time step h cannot be represented as a double (e.g., 0.1 ms).

Continuous Spike Times in Exact Discrete-Time Simulations

73

Therefore, we record spike times as a pair of two values {t + h, δ}. The first one is an integral number in units of h represented as a long int specifying the computation step in which the spike was emitted. The second one is the offset of the spike time in units of ms represented as a double. If h is a power of 2 in units of ms, both values can be represented as doubles without loss of accuracy. A.3 Evaluating the Update Equation. The implementation of the update equation 3.1 in the target programming language requires attention to numerical detail if utmost precision is desired. We were able to reduce membrane potential errors in a nonspiking simulation from some 10−12 mV to about 10−15 mV by careful rearrangement of terms. Although details may depend significantly on processor architecture and compiler optimization strategies, we will briefly recount the implementation we found optimal. The matrix-vector multiplication in equation 3.1 describes updates of the form y ← (1 − e − τ )a + e − τ y, h

h

where a and y are of order unity, while h τ so that e − τ ≈ 1. For a time step h −12 of h = 2 ms and a time constant of τ = 10 ms, one has γ = 1 − e − τ ≈ 10−5 . The quantity γ can be computed accurately for small values of h/τ using the function expm1(x) provided by current numeric libraries (C standard library; see also Galassi et al., 2001). Using double resolution, γ will have some 15 significant digits, spanning roughly from 10−5 to 10−20 , and all h of these digits are nontrivial. The exponential e − τ may be computed to 15 significant digits using exp(x); the first five of these will be trivial nines, though, leaving just 10 nontrivial digits, which furthermore span down to only 10−15 . We thus rewrite the equation above entirely in terms of γ , h

y ← γ a + (1 − γ )y, and finally as y ← γ (a − y) + y. In our experience, this final form yields the most accurate results, as the full precision of γ is retained as long as possible. Note that computing the 1 − γ term above discards the five least significant digits of γ . When several terms need to be added, they should be organized according to their expected magnitude starting with the smallest components. A.4 Polynomial Interpolation of the Membrane Potential. In order to approximate the time of threshold crossing, the membrane potential yt3

74

A. Morrison, S. Straube, H. Plesser, and M. Diesmann

known at grid points t with spacing h can be interpolated with polynomials of different order. For the purpose of this section, we drop the index specifying the membrane potential as the third component of the state vector. Without loss of generality, we assume that the threshold crossing occurs at time δ ∗ in the interval (0, h]. The corresponding values of the membrane potential are denoted y0 < and yh ≥ , respectively. The threshold crossings are found using the explicit formulas for the roots of polynomials of order n = 1, 2, and 3 (Weisstein, 1999). In order to constrain the polynomials, we exploit the fact that the derivative of the membrane potential can be easily obtained from the state vector at both sides of the interval. For the grid-constrained (n = 0) simulation and linear (n = 1) interpolation, we demonstrate why the error in spike timing decreases with h n+1 . A.4.1 Grid-Constrained Simulation. In the variables defined above, the approximate time of threshold crossing δ equals the computation time step h; the spike is reported to occur at the right border of the interval (0, h]. Assuming the membrane potential y(t) to be exact, the error in membrane potential with respect to the value at the exact point of threshold crossing is = yh − y(δ ∗ ). Let us require that y(t) is sufficiently often differentiable and that the derivatives assume finite values. We can then express the membrane potential as a Taylor expansion originating at the left border of the interval y(t) = y0 + y˙0 t + O(t 2 ). Considering terms up to first order, we obtain = {y0 + y˙0 h} − {y0 + y˙0 δ ∗ } = y˙0 (h − δ ∗ ). Hence, reaches its maximum amplitude at δ ∗ = 0, and we can write || ≤ | y˙0 |h. The error in spike timing is σ = h − δ∗ = and |σ | ≤ h.

1 y˙0

Continuous Spike Times in Exact Discrete-Time Simulations

75

A.4.2 Linear Interpolation. A polynomial of order 1 is naturally constrained by the values of the membrane potential (y0 and yh ) at both ends of the interval. With yt = a t + b, the set of equations specifying the coefficients of the polynomial (a and b) is −1 a 0 1 y0 = b h 1 yh −1 −1 y0 h −h = yh 1 0 (yh − y0 )h −1 . = y0 Thus, in normalized form, we need to solve 0=

yh − y0 δ + (y0 − ) h

(A.1)

for δ to find the approximate time of threshold crossing. At the exact point of threshold crossing δ ∗ , the error in membrane potential is =

yh − y0 ∗ δ + y0 − y(δ ∗ ). h

(A.2)

Let us require that y(t) is sufficiently often differentiable and that the derivatives assume finite values. We can then express the membrane potential as a Taylor expansion originating at the left border of the interval y(t) = y0 + y˙0 t + 12 y¨ 0 t 2 + O(t 3 ). Considering terms up to second order, we obtain 1 1 2 = y˙0 δ ∗ + y¨ 0 hδ ∗ + y0 − y0 + y˙0 δ ∗ + y¨ 0 δ ∗ 2 2 1 2 = y¨ 0 (δ ∗ h − δ ∗ ). 2 The time of threshold crossing is bounded by the interval (0, h]. Hence, reaches its maximum amplitude at δ ∗ = 12 h, and we can write || ≤

1 | y¨ 0 | h 2 . 8

76

A. Morrison, S. Straube, H. Plesser, and M. Diesmann

Noting that y(δ ∗ ) = in equation A.2, we have =

yh − y0 ∗ δ + (y0 − ), h

(A.3)

and subtracting equation 4.1 from 4.3, we obtain =

yh − y0 ∗ (δ − δ). h

Thus, the error in spike timing is σ = δ∗ − δ =

h . yh − y0

With the help of the expansion h 1 y¨ 0 = − h, yh − y0 y˙0 2 y˙0 2 we arrive at |σ | ≤

1 y¨ 0 2 h . 8 y˙0

A.4.3 Quadratic Interpolation. Using a polynomial of order 2, we can add an additional constraint to the interpolating function. We decide for the derivative of the membrane potential at the left border of the interval y˙ 0 . With yt = a t 2 + bt + c, we have −1 0 0 1 y0 a b = h 2 h 1 yh c y˙ 0 0 1 0 −2 −2 −1 y0 −h h h 0 1 yh = 0 y˙ 0 1 0 0 (yh − y0 )h −2 − y˙ 0 h −1 . y˙ 0 = y0 Thus, in normalized form, we need to solve 0 = δ2 +

(y0 − )h 2 y˙ 0 h 2 δ+ . yh − y0 − y˙ 0 h yh − y0 − y˙ 0 h

Continuous Spike Times in Exact Discrete-Time Simulations

77

The solution can be obtained by the quadratic formula. The GSL (Galassi et al., 2001) implements an appropriate solver. Generally there are two real solutions: the desired one inside the interval (0, h] and one outside. A.4.4 Cubic Interpolation. A polynomial of order 3 enables us to constrain the interpolation further by the derivative of the membrane potential at the right border of the interval y˙ h . With yt = a t 3 + bt 2 + ct + d, we have −1 0 0 0 1 a y0 3 2 b h h h 1 yh = c 0 0 1 0 y˙ 0 d y˙ h 3h 2 2h 1 0 y0 h −2 2h −2h −3 h −2 −3h −2 3h −2 −2h −1 −h −1 yh = 0 0 1 0 y˙ 0 y˙ h 1 0 0 0 2(y0 − yh )h −3 + ( y˙ 0 + y˙ h )h −2 3(yh − y0 )h −2 − (2 y˙ 0 + y˙ h )h −1 . = y˙ 0 y0 Thus, in normalized form, we need to solve 0 = δ3 + +

3(yh − y0 )h − (2 y˙ 0 + y˙ h )h 2 2 δ 2(y0 − yh ) + ( y˙ 0 + y˙ h )h

y˙ 0 h 3 (y0 − )h 3 δ+ . 2(y0 − yh ) + ( y˙ 0 + y˙ h )h 2(y0 − yh ) + ( y˙ 0 + y˙ h )h

The solution can be found by the cubic formula. There is at least one real solution in the interval (0, h]. It is convenient to chose a substitution that avoids the intermediate occurrence of complex quantities (e.g., Weisstein, 1999). The GSL (Galassi et al., 2001) implements an appropriate solver. If the interval contains more than one solution, the time of threshold crossing is defined by the left-most root. Acknowledgments The new address of A. M. and M. D. is Computational Neuroscience Group, RIKEN Brain Science Institute, Wako, Japan. The new address of S.S. is Human-Neurobiologie, University of Bremen, Germany. We acknowledge Johan Hake for the interesting discussions that started the project.

78

A. Morrison, S. Straube, H. Plesser, and M. Diesmann

As always, Stefan Rotter and the members of the NEST collaboration (in particular Marc-Oliver Gewaltig) were very helpful. We thank Jochen Eppler for assisting with the C++ coding. We acknowledge the two anonymous referees whose stimulating questions helped us to improve the letter. This work was partially funded by DAAD 313-PPP-N4-lk, DIP F1.2, BMBF Grant 01GQ0420 to the Bernstein Center for Computational Neuroscience Freiburg, and EU Grant 15879 (FACETS). As of the date this letter is published, the NEST initiative (www.nest-initiative.org) makes available the different implementations of the neuron model used as an example in this article, in source code form under its public license. All simulations were carried out using the parallel computing facilities of the Norwegian Uni˚ versity of Life Sciences at As. References Brette, R. (2006). Exact simulation of integrate-and-fire models with synaptic conductances. Neural Comput., 18, 2004–2027. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comput. Neurosci., 8(3), 183–208. Diesmann, M., & Gewaltig, M.-O. (2002). NEST: An environment for neural systems simulations. In T. Plesser & V. Macho (Eds.), Forschung und wisschenschaftliches ¨ Rechnen, Beitr¨age zum Heinz-Billing-Preis 2001 (pp. 43–70). Gottingen: Gesellschaft ¨ wissenschaftliche Datenverarbeitung. fur Diesmann, M., Gewaltig, M.-O., Rotter, S., & Aertsen, A. (2001). State space analysis of synchronous spiking in cortical neural networks. Neurocomputing, 38–40, 565– 571. Ferscha, A. (1996). Parallel and distributed simulation of discrete event systems. In A. Y. Zomaya (Ed.), Parallel and distributed computing handbook (pp. 1003–1041). New York: McGraw-Hill. Fujimoto, R. M. (2000). Parallel and distributed simulation systems. New York: Wiley. Galassi, M., Davies, J., Theiler, J., Gough, B., Jungman, G., Booth, M., & Rossi, F. (2001). Gnu scientific library: Reference manual. Bristol: Network Theory Limited. Golub, G. H., & van Loan, C. F. (1996). Matrix computations (3rd ed.). Baltimore, MD: Johns Hopkins University Press. Hammarlund, P., & Ekeberg, O. (1998). Large neural network simulations on multiple hardware platforms. J. Comput. Neurosci., 5(4), 443–459. Hansel, D., Mato, G., Meunier, C., & Neltner, L. (1998). On numerical simulations of integrate-and-fire neural networks. Neural Comput., 10(2), 467–483. Harris, J., Baurick, J., Frye, J., King, J., Ballew, M., Goodman, P., & Drewes, R. (2003). A novel parallel hardware and software solution for a large-scale biologically realistic cortical simulation (Tech. Rep.). Las Vegas: University of Nevada. Heck, A. (2003). Introduction to Maple (3rd ed.). Berlin: Springer-Verlag. Lytton, W. W., & Hines, M. L. (2005). Independent variable time-step integration of individual neurons for network simulations. Neural Comput., 17, 903–921. Makino, T. (2003). A discrete-event neural network simulator for general neuron models. Neural Comput. and Applic., 11, 210–223.

Continuous Spike Times in Exact Discrete-Time Simulations

79

Marian, I., Reilly, R. G., & Mackey, D. (2002). Efficient event-driven simulation of spiking neural networks. In Proceedings of the 3. WSES International Conference on Neural Networks and Applications. Interlaken, Switzerland. Mattia, M., & Del Giudice, P. (2000). Efficient event-driven simulation of large networks of spiking neurons and dynamical synapses. Neural Comput., 12(10), 2305– 2329. Mirollo, R. E., & Strogatz, S. H. (1990). Synchronization of pulse-coupled biological oscillators. SIAM J. Appl. Math., 50(6), 1645–1662. Morrison, A., Hake, J., Straube, S., Plesser, H. E., & Diesmann, M. (2005). Precise spike timing with exact subthreshold integration in discrete time network simu¨ lations. Proceedings of the 30th Gottingen Neurobiology Conference. Neuroforum, 1(Suppl.), 205B. Morrison, A., Mehring, C., Geisel, T., Aertsen, A., & Diesmann, M. (2005). Advancing the boundaries of high connectivity network simulation with distributed computing. Neural Comput., 17(8), 1776–1801. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in C (2nd ed.). Cambridge: Cambridge University Press. Reutimann, J., Giugliano, M., & Fusi, S. (2003). Event-driven simulation of spiking neurons with stochastic dynamics. Neural Comput., 15, 811–830. Rochel, O., & Martinez, D. (2003). An event-driven framework for the simulation of networks of spiking neurons. In ESANN’2003 Proceedings—European Symposium on Artifical Neural Networks (pp. 295–300). Bruges, Belgium: d-side Publications. Rotter, S., & Diesmann, M. (1999). Exact digital simulation of time-invariant linear systems with applications to neuronal modeling. Biol. Cybern., 81(5/6), 381–402. Shelley, M. J., & Tao, L. (2001). Efficient and accurate time-stepping schemes for integrate-and-fire neuronal networks. J. Comput. Neurosci., 11(2), 111–119. Sloot, A., Kaandorp, J. A., Hoekstra, G., & Overeinder, B. J. (1999). Distributed simulation with cellular automata: Architecture and applications. In J. Pavelka, G. Tel, & M. Bartosek (Eds.), SOFSEM’99, LNCS (pp. 203–248). Berlin: SpringerVerlag. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press. Weisstein, E. W. (1999). CRC concise encyclopedia of mathematics. Boca Raton, FL: CRC Press. Wolfram, S. (2003). The mathematica book (5th ed.). Champaign, IL: Wolfram Media Incorporated. Zeigler, B. P., Praehofer, H., & Kim, T. G. (2000). Theory of modeling and simulation: Integrating discrete event and continuous complex dynamic systems (2nd ed.). Amsterdam: Academic Press.

Received August 12, 2005; accepted May 25, 2006.

LETTER

Communicated by Daniel Amit

The Road to Chaos by Time-Asymmetric Hebbian Learning in Recurrent Neural Networks Colin Molter [email protected] Laboratory for Dynamics of Emergent Intelligence, RIKEN Brain Science Institute, Wako, Saitama, 351-0198, Japan, and Laboratory of Artificial Intelligence, IRIDIA, Universit´e Libre de Bruxelles, 1050 Brussels, Belgium

Utku Salihoglu [email protected] Laboratory of Artificial Intelligence, IRIDIA, Universit´e Libre de Bruxelles, 1050 Brussels, Belgium

Hugues Bersini [email protected] Laboratory of Artificial Intelligence, IRIDIA, Universit´e Libre de Bruxelles, 1050 Brussels, Belgium

This letter aims at studying the impact of iterative Hebbian learning algorithms on the recurrent neural network’s underlying dynamics. First, an iterative supervised learning algorithm is discussed. An essential improvement of this algorithm consists of indexing the attractor information items by means of external stimuli rather than by using only initial conditions, as Hopfield originally proposed. Modifying the stimuli mainly results in a change of the entire internal dynamics, leading to an enlargement of the set of attractors and potential memory bags. The impact of the learning on the network’s dynamics is the following: the more information to be stored as limit cycle attractors of the neural network, the more chaos prevails as the background dynamical regime of the network. In fact, the background chaos spreads widely and adopts a very unstructured shape similar to white noise. Next, we introduce a new form of supervised learning that is more plausible from a biological point of view: the network has to learn to react to an external stimulus by cycling through a sequence that is no longer specified a priori. Based on its spontaneous dynamics, the network decides “on its own” the dynamical patterns to be associated with the stimuli. Compared with classical supervised learning, huge enhancements in storing capacity and computational cost have been observed. Moreover, this new form of supervised learning, by being more “respectful” of the network intrinsic dynamics, maintains much more structure in Neural Computation 19, 80–110 (2007)

C 2006 Massachusetts Institute of Technology

Road to Chaos in Recurrent Neural Networks

81

the obtained chaos. It is still possible to observe the traces of the learned attractors in the chaotic regime. This complex but still very informative regime is referred to as “frustrated chaos.” 1 Introduction Synaptic plasticity is now widely accepted as a basic mechanism underlying learning and memory. There is experimental evidence that neuronal activity can affect synaptic strength through both long-term potentiation and longterm depression (Bliss & Lomo, 1973). Inspired by or forecasting this biological fact, a large number of learning “rules,” specifying how activity and training experience change synaptic efficacies, have been proposed (Hebb, 1949; Sejnowski, 1977). Such learning rules have been essential for the construction of most models of associative memory (among others, Amari, 1977; Hopfield, 1982; Amari & Maginu, 1988; Amit, 1995; Brunel, Carusi, & Fusi, 1997; Fusi, 2002; Amit & Mongillo, 2003). In such models, the neural network maps the structure of information contained in the external or internal environment into embedded attractors. Since Amari, Grossberg, and Hopfield precursor works (Amari, 1972; Hopfield, 1982; Grossberg, 1992), the privileged regime to code information has been fixed-point attractors. Many theoretical and experimental works have shown and discussed the limited storing capacity of these attractor network (Amit, Gutfreund, & Sompolinsky, 1987; Gardner, 1987; Gardner & Derrida, 1989; Amit & Fusi, 1994; and Domany, van Hemmen, & Schulten, 1995, for a review). However, many neurophysiological reports (Nicolis & Tsuda, 1985; Skarda & Freeman, 1987; Babloyantz & Loureno, 1994; Rodriguez et al., 1999; and Kenet, Bibitchkov, Tsodyks, Grinvald, & Arieli, 2003) tend to indicate that brain dynamics is much more complex than fixed points and is more faithfully characterized by cyclic and weak chaotic regimes. In line with these results, in this article, we propose to map stimuli to spatiotemporal limit cycle attractors of the network’s dynamics. A learned stimulus is no longer expected to stabilize the network into a steady state (which could in some cases correspond to a minimum of a Lyapunov function). Instead, the stimulus is expected to drive the network into a specific spatiotemporal cyclic trajectory. This cyclic trajectory is still considered an attractor since content addressability is expected: before presentation of the stimulus, the network could follow another trajectory, and the stimulus could be corrupted with noise. By relying on spatiotemporal cyclic attractors, the famous theoretical results on the limited capacity of Hopfield network no longer apply. In fact, the extension of encoding attractors to cycles potentially boosts this storing capacity. Suppose a network is composed of two neurons that can have only two values: −1 and +1. Without paying attention to noise and generalization, only four fixed-point attractors can be exploited, whereas by adding cycles, this number increases. For instance,

82

C. Molter, U. Salihoglu, and H. Bersini

cycles of length two are: (+1, +1)(+1, −1) (+1, +1)(−1, +1) (+1, +1)(−1, −1) (+1, −1)(−1, +1) (+1, −1)(−1, −1) (−1, +1)(−1, −1). In a given network, with a fixed topology and parameterization, the number of cyclic attractors is obviously inferior to the number of fixed points (a cycle iterates through a succession of unstable equilibrium). An indexing of the memorized attractors restricted to the initial conditions, as classically done in Hopfield networks, would not allow a full exploitation of all of these potential cyclic attractors. This is the reason that in the experimental framework presented here, the indexing is done instead by means of an added external input layer that continuously feeds the network with different external stimuli. Each external stimulus modifies the parameterization of the network—thus, the possible dynamical regimes and the set of potential attractors. The goal of this letter is not to calculate the “maximum storage capacity” of these “cyclic attractors networks,”1 either theoretically2 or experimentally. Rather, we intend to discuss the potential dynamical regimes (oscillations and chaos) that allow this new form of information storage. The experimental results we present show how the theoretical limitation of fixed-point attractor networks can easily be overcome by adding the input layer and exploitating the cyclic attractors. By studying small fully connected networks, we have shown previously (Molter & Bersini, 2003) how a synaptic matrix randomly generated allows the exploitation of a huge number of cyclic attractors for storing information. In this letter, according to our previous results (Molter, Salihoglu, & Bersini, 2005a, 2005b), a time-asymmetric Hebbian rule is proposed to encode the information. This rule is related to experimental observations showing an asymmetrical time window of synaptic plasticity in pyramidal cells during tetanic stimulus conditions (Levy & Steward, 1983; Bi & Poo, 1999). Information stored into the network consists of a set of pair data. Each datum is composed of an external stimulus and a series of patterns through which the network is expected to iterate when the stimulus feeds the network (this series corresponds to the limit cycle attractor). Traditionally, the information to be stored is either installed by means of a supervised mechanism (e.g., Hopfield, 1982) or discovered on the spot by an unsupervised version, revealing some statistical regularities in the data presented to the 1 To paraphrase Gardner’s article “Maximum Storage Capacity in Neural Networks” (1987). 2 The fact that we are not working at the thermodynamic limit, as in population models, would render such analysis very difficult.

Road to Chaos in Recurrent Neural Networks

83

net (e.g. Amit & Brunel, 1994). In this letter, two different forms of learning are studied and compared. In the first, the information to be learned (the external stimulus and the limit cycle attractor) is prespecified and installed as a result of a classical supervised algorithm. However, such supervised learning has always raised serious problems at a biological level, due to its top-down nature, and a cognitive level. Who would take responsibility to look inside the brain of the learner, that is, to decide the information to associate with the external stimulus and exert supervision during the learning task? To answer the last question, in the second form of learning proposed here, the semantics of the attractors to be associated with the feeding stimulus is left unprescribed: the network is taught to associate external stimuli with original attractors, not specified a priori. This perspective remains in line with the very old philosophical conviction of constructivism, which has been modernized in neural network terms by several authors (among others, Varela, Thompson, & Rosch, 1991; Erdi, 1996; Tsuda, 2001). One operational form has achieved great popularity as a neural net implementation of statistical clustering algorithms (Kohonen, 1982; Grossberg, 1992). To differentiate between the two learning procedures, the first one is called out-supervised, since the information is fully specified from outside. In contrast, the second one is called in-supervised, since the learning maps stimuli to cyclic attractors “derived” from the ones spontaneously proposed by the network.3 Quite naturally, we show that this in-supervised learning leads to increased storing capacity. The aim of this letter is not only to show how stimuli could be mapped to limit cycle attractors of the network’s dynamics. It also focuses on the network’s background dynamics: the dynamics observed when unlearned stimuli feed the network. More precisely, in line with theoretical investigations of the onset and the nature of chaotic dynamics in deterministic dynamical systems (Eckmann & Ruelle, 1985; Sompolinsky, Crisanti, & Sommers, 1988; van Vreeswijk & Sompolinsky, 1996; Hansel & Sompolinsky, 1996), this letter computes and analyzes the background presence of chaotic dynamics. The presence of chaos in recurrent neural networks (RNNs) and the benefit gained by its presence is still an open question. However, since the seminal paper by Skarda and Freeman (1987) dedicated to chaos in the rabbit brain, many authors have shared the idea that chaos is the ideal regime to store and efficiently retrieve information in neural networks (Freeman, 2002; Guillot & Dauce, 2002; Pasemann, 2002; Kaneko & Tsuda, 2003). Chaos, although very simply produced, 3 This algorithm is still supervised in the sense that the mapped cyclic attractors are only “derived” from the spontaneous network dynamics. External supervision is still partly needed, which raises question about biological plausibility. However, even if the proposed in-supervised algorithm is improved or if it ends as part of a more biologically plausible learning procedure, we think that the dynamical results obtained here will remain qualitatively valid.

84

C. Molter, U. Salihoglu, and H. Bersini

inherently possesses an infinite number of cyclic regimes that can be exploited for coding information. Moreover, it randomly wanders around these unstable regimes in a spontaneous way, thus rapidly proposing alternative responses to external stimuli and able to switch easily from one of these potential attractors to another in response to any coming stimulus. This article maintains this line of thinking by forcing the coding of information in robust cyclic attractors and experimentally showing that the more information is to be stored, the more chaos appears as a regime in the back, erratically itinerating, that is, moving from place to place, among brief appearances of these attractors. Chaos appears to be the consequence of the learning, not the cause. However, it appears as a helpful consequence that widens the net’s encoding capacity and diminishes the existence of spurious attractors. By comparing the nature of the chaotic dynamics obtained from the two learning procedures, we show that more structure in the background chaotic regimes is obtained when the network “chooses” by itself the limit cycle attractors to encode the stimuli (the in-supervised learning). The nature of these chaotic regimes can be related to the classic intermittent type of chaos (Pomeau & Manneville, 1980) and to its extension to biological networks, originally called frustrated chaos (Bersini, 1998), in reminiscence of the frustration phenomenon occurring in both recurrent neural nets and spin glasses. This chaos is related to chaotic itinerancy (Kaneko, 1992; Tsuda, 1992), which has been suggested to be of biological significance (Tsuda, 2001). The plan of the letter is as follows. Section 2 describes the model as well as the learning task. Section 3 describes the out-supervised learning procedure, where stimuli are encoded in predefined limit cycle attractors. Section 4 describes the in-supervised learning procedure, where stimuli are encoded in the limit cycle attractors derived from those spontaneously proposed by the network. Section 5 computes and compares networks’ encoding capacity, as well as the content addressability of encoded information. Section 6 compares and discusses the proportion and the nature of the observed chaotic dynamics when the network is presented with unlearned stimuli. 2 Model and Learning Task Descriptions This section describes the model used in our simulations as well as the learning task. 2.1 The Model. The network is fully connected. Each neuron’s activation is a function of other neurons’ impact and an external stimulus. The neurons’ activation f is continuous and updated synchronously by discrete time step. The mathematical description of such a network is a classic

Road to Chaos in Recurrent Neural Networks

85

one. The activation value of a neuron xi at a discrete time step n + 1 is xi (n + 1) = f (g neti (n)) neti (n) =

N

wij x j (n) +

j=1

M

wis ιs ,

(2.1)

s=1

where N is the number of neurons, M is the number of units composing the stimulus, g is the slope parameter, wij is the weight between the neurons j and i, wis is the weight between the external stimulus’ unit s and the neuron i, and ιs is the unit s of the external stimulus. The saturating activation function f is taken continuous (here tanh) to ease the study of the networks’ dynamical properties. The network’s size, N, and the stimulus’s size, M, have been set to 25 in this article for legibility. Of course, this size has an impact on both the encoding capacity and the background dynamics. This impact will not be discussed. Another impact not discussed here is the value of the slope parameter g, set to 3 in the following.4 The main purpose of this article is to analyze how the learning of stimuli in spatiotemporal attractors affects the network’s background dynamics. When storing information in fixed-point attractors, the temporal update rule can indifferently be asynchronous or synchronous. This is no longer the case when storing information in cycle attractors, for which the updating must necessarily be synchronous: it is a global activity due to one pattern that generates the next one. To compare the network’s continuous internal states with bit patterns, a filter layer quantizing the internal states, based on the sign function, is added (Omlin, 2001). It defines the output vector o:

o i = −1 ⇐⇒ xi < 0 oi = 1

⇐⇒ xi ≥ 0,

(2.2)

where xi is the internal state of the neuron i and o i is its associated output (i.e., its visible value). This filter layer enables it to perform symbolic investigations on the dynamical attractors. Figure 1 represents a period 2 sequence unfolding in a network of four neurons. The persistent external stimulus feeding the network appears Figure 1A. Given that the internal state of neurons is continuous, the internal states (see Figure 1B) are filtered (see Figure 1C) to enable the comparison with the stored data. 4

The impact of this parameter has been discussed in Dauce, Quoy, Cessac, Doyon, & Samuelides (1998). They have demonstrated how the slope parameter can be used as a route to chaos.

86

C. Molter, U. Salihoglu, and H. Bersini

Figure 1: (A) A fully recurrent neural network (N = 4) fed by a persistent external stimulus. (B). Three shots of the network’s states. Each represents the internal state of the network at a successive time step. (C) After filtering, a cycle of period 2 can be seen.

2.2 The Learning Task. Two different forms of supervised learning are proposed in this article. However, both consist in storing a set of q external stimuli in spatiotemporal cycles of the network’s internal dynamics. The data set is written as D = D1 , . . . , Dq ,

(2.3)

where each datum Dµ is defined by a pair composed of a pattern χ µ corresponding to the external stimulus feeding the network and a sequence of patterns ς µ,i , i = 1, . . . , lµ to store in a dynamical attractor: Dµ = χ µ , (ς µ,1 , . . . , ς µ,lµ )

µ = 1, . . . , q ,

(2.4)

where lµ is the period of the sequence µ and may vary from one datum to another. Each pattern µ is defined by assigning digital values to all neurons: µ

χ µ = {χi , i = 1, . . . , M} µ,k

ς µ,k = {ςi

, i = 1, . . . , N}

with with

µ

χi ∈ {−1, 1} µ,k

ςi

∈ {−1, 1}.

(2.5)

Road to Chaos in Recurrent Neural Networks

87

3 The Out-Supervised Learning Algorithm This first learning task is straightforward and consists of storing a welldefined data set. It means that each datum stored in the network is fully specified a priori: each external stimulus must be associated with a prespecified limit cycle attractor of the network’s dynamics. By suppressing the external stimulus and defining all the sequences’ periods lµ to 1, this task is reduced to the classical learning task originally proposed by Hopfield: storing pattern in fixed-point attractors of the underlying RNN’s dynamics. The learning task described above turns out to generalize the one proposed by Hopfield. For ease of reading, when patterns are stored in fixed-point attractors, they are noted by ξ µ . 3.1 Introduction: Hopfield’s Autoassociative Model. In the basic Hopfield model (Hopfield, 1982), all connections need to be symmetric, no autoconnection can exist, and the update rule must be synchronous. Hopfield has proven that these constraints are sufficient to define a Lyapunov function H for the system:5 1 wij xi x j . 2 N

H=−

N

(3.1)

i=1 j=1

Each state variation produced by the system’s equation entails a nonpositive variation of H: H ≤ 0. The existence of such a decreasing function ensures convergence to fixed-point attractors. Each local minimum of the Lyapunov function represents one fixed point of the dynamics. These local minima can be used to store patterns. This kind of network is akin to a content-addressable memory since any stored item will be retrieved when the network dynamics is initiated with a vector of activation values sufficiently overlapping the stored pattern.6 In such a case, the network dynamics is initiated in the desired item’s basin of attraction, spontaneously driving the network dynamics to converge to this specific item. The set of patterns can be stored in the network by using the following Hebbian learning rule, which obviously respects the constraints of the Hopfield model (symmetric connections and no autoconnection): 1 µ µ ξi ξ j N p

wij =

wii = 0.

(3.2)

µ=1

5 A Lyapunov function defines a lower-bounded function whose derivative is decreasing in time. 6 This remains valid only when learning a few uncorrelated patterns. In other cases, the network converges to any fixed-point attractor.

88

C. Molter, U. Salihoglu, and H. Bersini

However, this kind of rule leads to drastic storage limitations. An in-depth analysis of the Hopfield model’s storing capacity has been done by Amit et al. (1987) by relying on a mean-field approach and on replica methods originally developed for spin-glass models. Their theoretical results show that these types of networks, when coupled with this learning rule, are unlikely to store more than 0.14 N uncorrelated random patterns. 3.2 Iterative Version of the Hebbian Learning Rule 3.2.1 Learning Fixed Points. A better way of storing patterns is given by an iterative version of the Hebbian rule (Gardner, 1987; for a detailed description of this algorithm, see van Hemmen & Kuhn (1995), and Forrest & Wallace, 1995). The principle of this algorithm is as follows: at each learning iteration, the stability of every nominal pattern ξ µ is tested. Whenever one pattern has not yet reached stability, the responsible neuron i sees its connectivity reinforced by adding a Hebbian term to all the synaptic connections impinging on it, µ

µ

wij → wij + εs ξi ξ j ,

(3.3)

where εs defines the learning rate. All patterns to be learned are repeatedly tested for stability, and once all are stable, the learning is complete. This learning algorithm is incremental since the learning of new information can be done by preserving all information that has already been learned. It has been proved (Gardner, 1987) that by using this procedure, the capacity can be increased up to 2N uncorrelated random patterns. In our model, stored cycles are indexed by the use of external stimuli. These external stimuli are responsible for a modification of the underlying network’s internal dynamics and, consequently, increasing the number of potential attractors, as well as the size of their basins of attraction. The connection weights between the external stimuli and the neurons are learned by adopting the same approach as given in equation 3.3. When one pattern is not yet stable, the responsible neuron i sees its connectivity reinforced by adding a Hebbian term to all of the synaptic connections impinging on it (see equation 3.3), including connections coming from the external stimulus: wik → wik + εb χkµ ξiµ ,

(3.4)

where εb defines the learning rate applied on the external stimulus’s connections and which may differ from εs . In order to not only store the patterns but also ensure sufficient content addressability, we must try to “excavate” the basins of attraction. Two approaches are commonly proposed in the literature. The first aims at getting

Road to Chaos in Recurrent Neural Networks

89

the alignment of the spin of the neuron (+1 or −1) together with its local field to be not just positive (the requirement to ensure stability) but greater than a given minimum bound. The second approach attempts explicitly to enlarge the domains of attraction around each nominal pattern. To do so, the network is trained to associate noisy versions of each nominal pattern with the desired pattern, following a given number of iterations expected to be sufficient for convergence. This second approach is the one adopted in this article. Two noise parameters are introduced to tune noise during the learning phase: the noise imposed on the internal states lns and the noise imposed on the external stimulus lnb .7 3.2.2 Learning Cycles. The learning rule defined in equation 3.3 naturally leads to asymmetrical weights’ values. It is no longer possible to define a Lyapunov function for this system, the main consequence being to include cycles in the set of “memory bags.” As for fixed points, the network can be trained to converge to such limit cycles attractors by modifying equations 3.3 and 3.4. This time, the weights wij and wis are modified according to the expected value of neuron i at time t + 1 and the expected value of neuron j and of the external stimulus at time t: µ,ν+1

ςj

µ,ν+1

χi ,

wij → wij + εs ςi

wis → wis + εb ςi

µ,ν µ

(3.5)

where εs and εb , respectively, define the learning rate and the stimulus learning rate. This time-asymmetric Hebbian rule can be related to the asymmetric time window of synaptic plasticity observed in pyramidal cells during tetanic stimulus conditions (Levy & Steward, 1983; Bi & Poo, 1999). 3.2.3 Adaptation of the Algorithm to Continuous Activation Functions. When working with continuous state neurons, we are no longer working with cyclic attractors but with limit cycle attractors. As a consequence, the µ

µ 7 A noisy pattern ξ lns is obtained from a pattern to learn ξ by choosing a set of lns items, randomly chosen among all the initial pattern’s items, and by switching their sign. Thus, d H (lns ) defines the Hamming distance between the two patterns:

d H (lns ) =

N i=1

di = 0 di

where

di = 1

µ

µ

if

ξi ξi,lns = 1

(items equals)

if

µ µ ξi ξi,lns

(items differents)

= −1

In this article, the Hamming distance is normalized to range in [0, 100].

.

90

C. Molter, U. Salihoglu, and H. Bersini

algorithm needs to be adapted in order to prevent the learned data from vanishing after a few iterations. One step iteration does not guarantee longterm stability of the internal states since observations are performed using a filter layer. The adaptation consists of waiting a certain number of cycles before testing the correctness of the obtained attractor. The halting test for discrete neurons is given by the following equation, ∀µ, ν if (x(0) = ς µ,ν ) → x(1) = ς µ,ν+1

⇒ stop;

(3.6)

for continuous neurons, it becomes ∀µ, ν if (x(0) = ς µ,ν ) → (o(1) = o(lµ + 1) = . . . = o(T ∗ lµ + 1) = ς µ,ν+1 )

⇒ stop,

(3.7)

where T is a further parameter of our algorithm (set to 10 in all the experiments) and o is the output filtered vector defined in equation 1.2. 4 In-Supervised Learning algorithm As shown in section 5, the encoding capacities of networks learning in the out-supervised way described above are fairly good. However, these results are disappointing compared to the potential capacity observed in random networks (Molter & Bersini, 2003). Moreover, section 6 shows how learning too many cycle attractors in an out-supervised way leads to the kind of blackout catastrophe similar to the ones observed in fixed-point attractor networks (Amit, 1989). Here, the network’s background regime becomes fully chaotic and similar to white noise. Learning prespecified data appears to be too constraining for the network. This section introduces an in-supervised learning algorithm, more plausible from a biological point of view: the network has to learn to react to an external stimulus by cycling through a sequence that is not specified a priori but is obtained following an internal mechanism. In other words, the information is generated through the learning procedure that assigns a meaning to each external stimulus. There is an important tradition of less supervised learning in neural nets since the seminal work of Kohonen and Grossberg. This tradition enters in resonance with writings in cognitive psychology and constructivist philosophy (among others, Piaget, 1963; Varela et al., 1991; Erdi, 1996; and Tsuda, 2001). The algorithm presented now can be seen as a dynamical extension in the spirit of this preliminary work where the coding scheme relies on cycles instead of single neurons. 4.1 Description of the Learning Task. The main characteristic of this new algorithm lies in the nature of the learned information: only the external stimuli are known before learning. The limit cycle attractor associated

Road to Chaos in Recurrent Neural Networks

91

with an external stimulus is identified through the learning procedure: the procedure enforces a mapping between each stimulus of the data set and a limit cycle attractor of the network’s inner dynamic, whatever it is. Hence, the aim of the learning procedure is twofold: first, it proposes a dynamical way to code the information (i.e., to associate a meaning with the external stimuli), and then it learns it (through a classical supervised procedure). Before mapping, the data set is defined by Dbm (bm standing for “before mapping”): 1 q , . . . , Dbm Dbm = Dbm

µ

Dbm = χ µ

µ = 1, . . . , q .

(4.1)

After mapping, the data set becomes 1 q , . . . , Dam Dam = Dam

µ Dam = χ µ , (ς µ,1 , . . . , ς µ,lµ )

µ = 1, . . . , q ,

(4.2)

where lµ is the period of the learned cycle. 4.2 Description of the Algorithm. Inputs of this algorithm are a data set Dbm to learn (see equation 4.1), and a range [mincs , maxcs ] that defines the bounds of the accepted periods of the limit cycle attractors coding the information. This algorithm can be broken down in three phases that are constantly iterated until convergence: 1. Remapping stimuli into spatiotemporal cyclic attractors. During this phase, the network is presented with an external stimulus that drives it into a temporal attractor outputµ (which can be chaotic). Since the idea is to constrain the network as little as possible, a meaning is assigned to the stimulus by associating it with a close cyclic version of the attractor outputµ , called cycleµ , an original8 attractor respecting the periodic bounds [mincs , maxcs ]. This step is iterated for all the stimuli of the data set; 2. Learning the information. Once a new attractor cycleµ has been proposed for each stimulus, it is tentatively learned by means of a supervised procedure. However, to avoid constraining the network too much, only a limited number of iterations are performed, even if no convergence has been reached. 3. End test. if all stimuli are successfully associated with different cyclic attractors, the in-supervised learning stops, otherwise the whole process is repeated.

8 Original means that each pattern composing the limit cycle attractor must be different from all other patterns of all cycleµ .

92

C. Molter, U. Salihoglu, and H. Bersini

Table 1: Pseudocode of the In-Supervised algorithm. A/ re-mapping stimuli to spatiotemporal cyclic attractors µ

1. ∀ data Dbm to learn, µ = 1, . . . , q a. Stimulation of the network i. The stimulus is initialized with χ µ . ii. The states are initialized with ς µ,1 which are obtained from the previous iteration (or random at first). iii.To skip the transient, the network is simulated some steps. iv. The states ς µ,i crossed by the network’s dynamics are stored in outputµ . b. Proposal of an attractor code i. If outputµ is not a cycle of period lesser or equal to maxcs ⇒ compression process (see Table 2): outputµ → cycleµ . ii. If outputµ is not a cycle of period greater or equal to mincs ⇒ extension process (see Table 2): outputµ → cycleµ . iii.If a pattern contained in cycleµ is too correlated with any other patterns, this pattern is slightly modified to make it “original”. µ

2. The data set Dtemp is created where Dtemp = (χ µ , cycleµ ) B/ learning the information 3. Using an iterative supervised learning algorithm, the data set Dtemp is tentatively learned a limited number of time steps. C/ end test µ

4. If ∀ data Dbm , the network iterates through valid limit cycle attractors ⇒ finished else goto 1

The pseudocode of this in-supervised learning algorithm is described in Table 1. The algorithm presented here learns to map external stimuli into the network’s cyclic attractors in a very unconstrained way. Provided these attractors are derived from the ones spontaneously proposed by the network, a form of supervision is still needed (how to create an original cycle and how to know if the proposed cycle is original). However, recent neurophysiological observations have shown that in some cases, synaptic plasticity is effectively guided by a supervised procedure (Gutfreund, Zheng, & Knudsen, 2002; Franosch, Lingenheil, & van Hemmen, 2005). The supervised procedure proposed here to create and test original cycles (see Table 2) has no real biological grounding. This part of the algorithm might be improved in the future to reinforce its biological likelihood. Nevertheless, we think that whatever supervised procedure is chosen, the part of the study concerned with the capacity of the network and the background chaos remains valid.

Road to Chaos in Recurrent Neural Networks

93

Table 2: Routines Used When Creating an Attractor Code. Compression process Since outputµ = ς µ,1 , . . . , ς µ,maxcs , . . . is a cycle of period greater than maxcs (it could even be chaotic): ⇒ cycleµ is generated by truncating outputµ : cycleµ = ς µ,1 , . . . , ς µ, ps (the compression period ps ∈ [mincs , maxcs ] is another parameter). Extension process Since outputµ = ς µ,1 , . . . , ς µ,q with q < mincs :

⇒ cycleµ is generated by duplicating outputµ : cycleµ = ς µ,1 , . . . , ς µ,q , ς µ,1 , . . . such that size(cycleµ ) = pe , where pe ∈ [mincs , maxcs ] is the extension period, another parameter, randomly given or fixed;

5 Performance Regarding the Encoding Capacity Encoding capacities of networks having learned by using the outsupervised algorithm and networks having learned by using the insupervised algorithm are compared in Figure 2. Here, to ease the comparison, we have enforced the in-supervised algorithm to learn mappings to cycles of fixed period. However, because this algorithm enables mapping stimuli to limit cycle attractors of variable period (between mincs and maxcs ), better performance could be achieved. The number of iterations required to learn the specified data set is plotted as a function of the size of this data set.9 We have seen in section 3.2 that it is possible to enforce greater content addressability by training the network to associate noisy versions of each nominal pattern with the desired limit cycle attractors. Here, the two parameters enforcing the content addressability (lns and lnb ) have been set to 0 in order to display the maximum encoding capacities. As expected, the in-supervised learning procedure, by letting the network decide how to map the stimuli, outperforms its out-supervised counterpart where all the mappings are specified before the learning.10 Robustness to noise is compared in Figure 3. Content addressability has been enforced as much as possible by means of the learning-with-noise procedure. By computing the normalized Hamming distance between the obtained attractor and the initially learned attractor, robustness to noise 9

Each iteration represents 100 weight modifications defined in equation 3.6. The mapping stimuli to fixed-point attractors are not compared, since to obtain relevant results, we have to enforce content addressability by learning with noise. See also section 6.1.1. 10

50

C. Molter, U. Salihoglu, and H. Bersini

20

30

40

5

10

10

10

3

3

5

0

Number of iterations

94

0

10

20

30

40

50

60

70

Number of stimuli mapped to cycles 3, 5 and 10

Figure 2: Encoding capacities’ comparison between out-supervised learning (filled circles) and in-supervised learning (squares). The number of iterations required to learn the specified data set is plotted according to data set size. Results obtained from three different types of data sets are represented: data sets composed of stimuli associated with cycles of period 3, 5, and 10. Each value has been obtained from statistics of 100 different data sets.

indicates how well the dynamics is able to retrieve the original association when both the external stimulus and the network’s internal state are perturbed by noise. The two plots in Figure 3 show the normalized Hamming distance between the stored sequence and the recovered sequence according to the noise injected in the external stimulus and the initial states. This noise is quantified by computing the initial overlaps m0b and m0s .11 The noise injected in the network’s internal state is measured by computing the smallest overlap between the internal state and every pattern composing the limit cycle attractor associated with the external stimulus. In-supervised learning, compared to out-supervised learning, considerably improves robustness. For instance, when learning 6 period–4 cycles, in the in-supervised case (see Figure 3B), stored sequences are very robust to noise, while in the out-supervised case (see Figure 3A), content addressability is no longer observed. By adding a tiny amount of noise, the correlation between the stored sequence and the recovered one goes to zero (the normalized Hamming distance is equal to 50). These figures also show that in the in-supervised case, the external stimulus plays a stronger role in indexing the stored data.

11

The overlap mµ between two patterns is given by mµ =

N 1 µ,idcyc µ,idcyc ςi ςi,noisy . N i=1

Two patterns that match perfectly have an overlap equal to 1; it equals −1 in the opposite case. The overlap m and the Hamming distance dh are related by dh = N m+1 2 .

Road to Chaos in Recurrent Neural Networks

95

Figure 3: Content addressability obtained for the learned networks. The normalized Hamming distance between the expected sequence and the obtained sequence is plotted according to the noise injected in both the initial states (m0s ) and the external stimulus (m0b ). Results for the out-supervised Hebbian algorithm (A) are compared with the ones obtained for the in-supervised Hebbian algorithm (B).

One can say that the in-supervised learning mechanism implicitly supplies the network with an important robustness to noise. This could be explained by the following considerations. First, the coding attractors are derived from the ones spontaneously proposed by the network. Second, they need to have large and stable basins of attraction in order to resist the process of trial, error, and adaptation that characterizes this iterative remapping procedure. 6 Dynamical Analysis Tests performed here aim at analyzing the so-called background or spontaneous dynamics obtained when the network is presented with external stimuli other than the learned ones. In the first two sections, analyses are performed to quantify the presence and proportion of chaotic dynamics. Section 6.1 uses two measures: the network’s mean Lyapunov exponent and the probability of having chaotic dynamics. Section 6.2 uses symbolic analyses12 to quantify the proportion of the different types of symbolic attractors found. It shows how chaotic dynamics help to prevent the proliferation of spurious data. Qualitative analyses of the nature of the chaotic dynamics obtained are performed in the last section (section 6.3) by means of classical tools such as return maps,

12 Symbolic analyses are performed by analyzing the output of the filter layer instead of directly analyzing the network’s internal state.

96

C. Molter, U. Salihoglu, and H. Bersini

power spectra, and Lyapunov spectra. Furthermore an innovative measure is developed to assess the presence of frustrated chaotic dynamics. 6.1 Quantitative Dynamical Analysis: The Background Regime. Quantitative analyses are performed here using two kinds of measures: the mean Lyapunov exponent and the probability of having chaotic dynamics. Both measures come from statistics on 100 learned networks. For each learned network, dynamics obtained from randomly chosen external stimuli and initial states have been tested (1000 different configurations). These tests aim at analyzing the so-called background or spontaneous dynamics obtained by stimulating the network with external stimuli and initial states different from the learned ones. The computation of the first Lyapunov exponent is done empirically by computing the evolution of a tiny perturbation (renormalized at each time step) performed on an attractor state (for more details, see, Wolf, Swift, Swinney, & Vastano, 1984; Albers, Sprott, & Dechert, 1998). This exponent indicates how fast the system’s history is lost. While stable dynamics have negative Lyapunov exponents, Lyapunov exponents bigger than 0 are the signature of chaotic dynamics, and when the biggest Lyapunov exponent is very high, the system’s dynamics may be seen as equivalent to a turbulent state. Here, to distinguish chaotic dynamics from quasi-periodic regimes, dynamics is said to be chaotic if the Lyapunov exponent is greater than a given value slightly above zero (in practice, if the Lyapunov exponent is greater than 0.01). Obtained results are made more meaningful by comparing global dynamics of learned networks (indexed with L ) with global dynamics of networks obtained randomly (without learning). However, to constrain random networks to behave as a kind of “surrogate network” (indexed with S ), they must have the same mean µ and the same standard deviation σ for their weight distributions (neuron-neuron and stimulus-neuron): µ wijL = µ wijS µ wbiL = µ wbiS µ wiiL = µ wiiS σ wijL = σ wijS σ wbiL = σ wbiS σ wiiL = σ wiiS .

(6.1)

These surrogate random networks enable the measurement of the weight distribution’s impact. The random distribution has been chosen gaussian. 6.1.1 Network Stabilization Through Hebbian Learning of Static Patterns. Figure 4A shows the mean Lyapunov exponents and probabilities of having chaos for networks with learned data encoded in fixed-point attractors only. To avoid weight distribution to converge to the identity matrix (if ∀i, j : wii > 0 and wij ≈ 0, all patterns are learned but without robustness), noise has been added during the training period. We can observe that:

Road to Chaos in Recurrent Neural Networks

97

Figure 4: Mean Lyapunov exponents and probability of chaos in RNNs, in function of the learning sets’ size. Out-supervised learned networks (filled circles) are compared with in-supervised learned networks (squares). In both cases, comparisons are performed with their surrogate random networks (in gray). Since these results come from statistics, mean and standard deviation are plotted. (A) Network stabilization can be observed after Hebbian learning of static patterns. (B) Chaotic dynamics appear after mapping stimuli to period 3 cycles.

r

r

r r

The encoding capacity (with content addressability) of in-supervised learned networks is greater than the maximum encoding capacities (without content addressability) obtained from theoretical results (Gardner, 1987). The explanation lies in the presence of the input layer, which modifies the network’s internal dynamics and enables other attractors to appear. Mean Lyapunov exponents of learned networks are always negative. When the learning task are made more complex, networks are more and more constrained. The mean Lyapunov exponent increases but still remains below a negative upper bound. It is nearly impossible to find chaotic dynamics for spontaneous regimes in learned networks, even after intensive learning of static patterns. Surrogate random networks show very different results, clearly indicating how the learned weight distribution is anything but random.

98

C. Molter, U. Salihoglu, and H. Bersini

r

The same trends are observed when relying on both the outsupervised algorithm and the in-supervised algorithm.

The iterative Hebbian learning algorithm described here is likely to keep all connections approximately symmetric, preserving the stability of learned networks, while it is no longer the case for random networks. 6.1.2 Hebbian Learning of Cycles: A New Road to Chaos. If learning data in fixed-point attractors stabilize the network, learning sequences in limit cycle attractors lead to diametrically opposed results. Figure 4B compares the chaotic dynamics’ presence in 25-neuron networks learned with different data set of period 3 cycles. We can observe that chaotic dynamics is equal to one, with high mean Lyapunov exponents. This kind of dynamics is ergodic (Fusi, 2002). This looks very similar to the “blackout catastrophe” observed in fixed-point attractors networks (Amit, 1989). If the result is the same—the network becomes unable to retrieve any of the memorized patterns—this blackout arises with a progressive increase of the chaotic dynamics:

r

r

r

Networks learned through the out-supervised algorithm are becoming more and more chaotic while the learning task is intensified. At the end, the probability of falling into a chaotic dynamics is equal to one, with high mean Lyapunov exponents. This kind of dynamics is ergodic and turns out to be reminiscent of the “blackout catastrophe” observed in fixed-point attractor networks (Amit, 1989; Fusi, 2002). If the result is the same—the network becomes unable to retrieve any of the memorized patterns—here this blackout arises through a progressive increase of the chaotic dynamics. The same trend shows up in the in-supervised case; however, even after intensive periods of learning, the networks do not become fully chaotic. The dynamics never turns into an ergodic one: chaotic attractors and limit-cycle attractors coexist. Compared to surrogate random networks, in both cases learning contributes to structure the dynamics (at least, learned networks have to remember the learned data!). However, when the learning task is intensified, the differences between the random and the learned networks tend to vanish in the out-supervised case.

The absence of full chaos in in-supervised learned networks is explained by the fact that this algorithm is based on a process of trial, error, and adaptations, which provides robustness and prevents full chaos. By increasing the data set to learn, learning takes more and more time, but at the same time, the number of remappings increases and forces large basins of stable dynamics. The network is more and more constrained, and complex but not fully chaotic.

Road to Chaos in Recurrent Neural Networks

99

Figure 5: Road to chaos observed when learning four cycles of period 7 in a recurrent network of 25 neurons through an iterative supervised Hebbian algorithm (the number of learning iterations is indicated on the x-axis). The network is initialized randomly, and following a given transient, the average state of the network is plotted on the y-axis. The mean Lyapunov exponent demonstrates the growing presence of chaotic dynamics.

Obtaining chaos by learning an increasing number of cycles based on a time-asymmetric Hebbian mechanism can be seen as a new road to chaos. This road to chaos is illustrated in Figure 5. In contrast with the classical roads shaped by the gradual variation of a control parameter, this new road relies on an external mechanism simultaneously modifying a set of parameters to fulfill an encoding task. When learning cycles, the network is prevented from stabilizing in fixed-point attractors. The more cycles there are to learn, the more the network is externally constrained and the more the regime turns out to be spontaneously chaotic. 6.2 Quantitative Dynamical Analysis: Symbolic Analyses. After learning, when the network is presented with a learned stimulus while its initial state is correctly chosen, the expected spatiotemporal attractor is observed. Section 5 showed that noise tolerance is expected: the external stimulus or the network’s internal state can be slightly modified without affecting the result. However, in some cases, the network fails to “understand” the stimulus, and an unexpected symbolic attractor appears in the output. The question is to know whether this attractor should be considered as spurious data. The intuitive idea is that if the attractor is chaotic or if its period is different from the learned data, it is easy to recognize it at a glance, and thus to discard it. This becomes more difficult if the observed attractor’s period is the same as the ones of the attractors learned. In this case, it is in fact impossible to know whether this information is relevant without comparing it with all the learned data. As a consequence, we will define an attractor—having the same period as the learned data but still different from all of them—as spurious data.

100

C. Molter, U. Salihoglu, and H. Bersini

Figure 6: Proportion of the different symbolic attractors obtained during the spontaneous activity of artificial neural networks learned using, respectively, the out-supervised algorithm (A) and the in-supervised algorithm (B). Two types of mappings are analyzed: stimuli mapped to fixed-point attractors and stimuli mapped to spatiotemporal attractors of period 4.

As a result, two classification schemes are used to differentiate the symbolic attractors obtained. The first scheme is based on periods. This criterion enables distinguishing among chaotic attractors, periodic attractors whose periods differ from the learned ones (named out-of-range attractors), and periodic attractors having the same period as the learned ones. The aim of the second classification scheme is to differentiate these attractors, based on the normalized Hamming distance between them and the closest learned data (i.e., the attractors at a distance less than 10%, less than 20%, and so on). Figure 6 shows the proportion of the different types of attractors found as the size of the learned data set is increased. Results obtained with the in-supervised and the out-supervised algorithm while mapping stimuli in spatiotemporal attractors of various periods are compared. For each data set size, statistics are obtained from 100 different learned networks, and each time, 1000 symbolic attractors obtained from random stimuli and internal states have been classified. When stimuli are mapped into fixed-point attractors, the proportion of chaotic attractors and of out-of-range attractors falls rapidly to zero. In contrast, the number of spurious data increases drastically. In fact both

Road to Chaos in Recurrent Neural Networks

101

Figure 7: Proportions of the different types of attractors observed in insupervised learned networks while the noise injected in a previously learned stimulus is varied. Networks’ initial states were randomly initiated. (A) Thirty stimuli are mapped to fixed-point attractors. (B) Thirty stimuli are mapped to period 4 cycles. One hundred learned networks, with each time 1000 configurations have been tested.

learning procedures tend to stabilize the network by enforcing symmetric weights and positive auto-connections.13 When stimuli are mapped to cyclic attractors, both learning procedures lead to an increasing number of chaotic attractors. The more you learn, the more the spontaneous regime of the net tends to be chaotic. Still, we have to differentiate the two learning procedures. Out-supervised learning, due to its very constraining nature, drives the network into a fully chaotic state that prevents the network from learning more than 13 period 4 cycles. The less constraining and more natural in-supervised learning task leads to different behavior. This time, the network does not become fully chaotic, and the storing capacity is enhanced by a factor as large as 4. Unfortunately, the number of spurious data is also increasing, and noticeable proportions of spurious data are visible when the network is fed with random stimuli. The aim of Figure 7 is to analyze the proportion of the different types of attractors observed when the external stimulus is progressively shifted from a learned stimulus to a random one. Again, the network’s initial state is set completely random. Two types of in-supervised mappings are compared: stimuli mapped to fixed-point attractors and stimuli mapped to attractors of period 4. When unnoised learned stimuli are presented to the network, stunning results appear. For fixed-point learned networks, in more than 80% of the observations, we are facing spurious data. Indeed, the distance between the observed attractors and the expected one is larger than 10%, and they have the same period (period 1). By contrast, for period 4 learned networks, the 13 If no noise is injected during the learning procedure, the network converges to the identity matrix.

102

C. Molter, U. Salihoglu, and H. Bersini

probability of recovering the perfect cycle increases to 56%. Moreover, 62% of the obtained cycles are at a distance less than 10% of the expected ones, and the amount of chaotic and out-of-range attractors is, respectively, equal to 23% and 8%. Space left for spurious data becomes less than 5%. Thus, if we imagine a procedure where the network’s states are slightly modified in case a chaotic trajectory is encountered, until a cyclic attractor is found, we are nearly certain of obtaining the correct mapping. Because of the pervading number of spurious data in fixed-point attractors’ learned networks, it becomes difficult to imagine these networks as working memories. By contrast, the presence of chaotic attractors help to prevent the proliferation of spurious data in cyclic attractors’ learned networks, while good storage capacity is possible by relying on in-supervised mappings. 6.3 Qualitative Dynamical Analysis: Nature of Chaos. The preceding section has given quantitative analyses of the presence of chaotic dynamics in learned networks. This section aims at introducing a more qualitative description of the different types of chaotic regimes encountered in these networks. Numerous techniques exist to characterize chaotic dynamics (Eckmann & Ruelle, 1985). In the first part of this section, three well-known tools are used: chaotic dynamics are characterized by means of return maps, power spectra, and analysis of their Lyapunov spectrum. The computation of the Lyapunov spectrum is performed through Gram-Schmidt reorthogonalization of the evolved system’s Jacobian matrix (which is estimated at each time step from the system’s equations). This method is detailed in Wolf et al. (1984). In the second part of this section, to have a better understanding of the notion of frustrated chaos, a new type of measure is proposed, indicating how chaotic dynamics is built on nearby limit cycle attractors. 6.3.1 Preliminary Analyses. From return maps and power spectra analysis, three types of chaotic regimes have been identified in these networks. Figure 8 shows the power spectra and the return maps of these three generic types. They are labeled here white noise, deep chaos, and informative chaos. For each of them, a return map has been plotted for one particular neuron and the network’s mean signal. We observe that:

r

White noise. The power spectrum of this type of chaos (see Figure 8A) shows a total lack of structure; all the frequencies are represented nearly equally. The associated return map is completely filled, with a bigger density of points at the edges, indicating the presence of saturation. No useful information can be obtained from such chaos.

Road to Chaos in Recurrent Neural Networks

103

Figure 8: Return maps of the network’s mean signal (upper figures), return maps of a particular neuron (center figures), and power spectra of the network’s mean signal (lower figures) for chaotic regimes encountered in random (A) and learned (B, C) networks. The Lyapunov exponent of the corresponding dynamics is given.

r

r

Deep chaos. The power spectrum of this type of chaos shows more structure, but is still very similar to white noise (see Figure 8B). However, the return map is very similar to the previous one and does not seem to provide any useful information. Informative chaos. The power spectrum looks very informative (see Figure 8C). Different peaks show up, indicating the presence of nearby limit cycles. The associated return map shows more structure, but still with the presence of saturation.

The most relevant result is the possibility of predicting the type of chaos preferentially encountered by knowing the learning procedure used, as well as the size of the data set to be learned. Chaotic dynamics encountered in surrogate random networks are nearly always similar to white noise. This explains the large mean Lyapunov exponent obtained for these networks.

104

C. Molter, U. Salihoglu, and H. Bersini

Figure 9: The 10 first Lyapunov exponents. (A) After out-supervised learning of 10 period 4 cycles. (B) After in-supervised learning of 20 period 4 cycles. Each time, 10 different learning sets have been tested. For each obtained network, 100 chaotic dynamics have been studied.

When learning by means of the out-supervised procedure, depending on the data set’s size, different types of chaotic regimes appear. In a small data set size, informative chaos and deep chaos coexist. By increasing the data set’s size, the more we want to learn, the more chaotic regimes go from an informative chaos to an almost white noise one. Learning too much information in an out-supervised way leads to networks’ showing very uninformative deep chaos similar to white noise. In other words, having too many competing limit cycle attractors forces the network’s dynamics to become almost random, losing its structure and behaving as white noise. All information about hypothetical nearby limit cycles attractors is lost. No more content addressability can be obtained. When learning by means of the in-supervised procedure, chaotic dynamics are nearly always of the third type: very informative chaos shows up. By not predefining the internal representations of the network, this type of learning preserves more structure in the chaotic dynamics. The dynamical structure of the informative chaos (Figure 8C) reveals the existence of nearby competing attractors. It shows the phenomenon of frustration obtained when the network hesitates between two (or more) nearby cycles, passing from one to another. To provide a better understanding of this type of chaos, Figure 9 compares Lyapunov spectra obtained from deep chaos in out-supervised learned networks and from informative chaos in in-supervised learned networks. From this figure, deep chaos can ¨ be related to “hyperchaos” (Rossler, 1983). In hyperchaos, the presence of more than one positive Lyapunov exponent is expected, and these exponents are expected to be high. By contrast, Lyapunov spectra obtained in informative chaos after in-supervised learning are characteristic of chaotic itinerancies (Kaneko & Tsuda, 2003). In this type of chaos, the dynamics is attracted to learned memories, which is indicated by negative Lyapunov exponents, while at the same time it escapes from them, which is indicated

Road to Chaos in Recurrent Neural Networks

105

Figure 10: Probability of the presence of the nearby limit cycle attractors in chaotic dynamics (x-axis). By slowly shifting the external stimulus from a stimulus previously learned (region a) to another stimulus learned (region b), the network’s dynamics goes from the limit cycle attractor associated with the former stimulus to the limit cycle attractor associated with the latter stimulus. The probabilities of the presence of cycle a and cycle b are plotted (black). The Lyapunov exponent of the obtained dynamics is also plotted (gray). (A) After out-supervised learning, hyperchaos is observed in between the learned attractors. (B) The same conditions lead to frustrated chaos after insupervised learning.

by the presence of at least one positive Lyapunov exponent. This positive Lyapunov exponent must be slightly positive in order not to completely erase the system’s history and thus keep traces of the learned memories. This regime of frustration is increased by some modes of neutral stability indicated by the presence of many exponents whose values are close to zero. 6.3.2 The Frustrated Chaos. The main difference between the frustrated chaos (Bersini & Sener, 2002) and similar forms of chaos welknown in the literature, like intermittency chaos (Pomeau & Manneville, 1980) and chaotic itinerancy (Ikeda, Matsumoto, & Otsuka, 1989; Kaneko, 1992; Tsuda, 1992), lies in the way to obtain it and in the possible characterization of the dynamics in terms of the encoded attractors. If all of those very structured chaotic regimes are characterized by strong cyclic components among which the dynamics randomly itinerates, it is in the transparency and the exploitation of those cycles that lies the key difference. In the frustrated chaos, those cycles are the basic stages of the road to chaos. It is by forcing those cycles in the network (by tuning the connection parameters) that the chaos finally shows up. One way of forcing these cycles is by the time-asymmetric Hebbian learning adopted in this letter. Frustrated chaos is a dynamical regime that appears in a network when the global structure is such that local connectivity patterns responsible for stable and meaningful oscillatory behaviors are intertwined, leading to mutually competing attractors and unpredictable itinerancy among brief appearance of these attractors.

106

C. Molter, U. Salihoglu, and H. Bersini

To have a better understanding of this type of chaos, Figure 10 compares a hyperchaos and the frustrated one through the probability of presence of the nearby limit cycle attractors in chaotic dynamics. In the figure on the left, the network has learned to associate 2 data with limit cycle attractors of period 10 by using the out-supervised Hebbian algorithm. This algorithm barely constrains the network, and as a consequence, chaotic dynamics appear very uninformative: by shifting the external stimulus from one attractor to another one in between, strong chaos shows up (indicated by the Lyapunov exponent), where any information concerning these two limit cycle attractors is lost. In contrast, when mapping four stimuli to period 4 cycles, using the in-supervised algorithm (Figure 10, right), when driving the dynamics by shifting the external stimulus from one attractor to another one, the chaos encountered on the road appears much more structured: small Lyapunov exponents and strong presence of the nearby limit cycles are easy to observe, shifting progressively from one attractor to the other one. 7 Conclusion This letter studies the possibility of encoding information in spatiotemporal cyclic attractors of a network’s internal state. Two versions of a timeasymmetric Hebbian learning algorithm are proposed. First is a classical out-supervised version, where the cyclic attractors are predefined and need to be replicated by the internal dynamics of the network. Second is an insupervised version where the cyclic attractors to be associated with the external stimuli are left unprescribed and derived from the ones spontaneously proposed by the network. First, experimental results show that the encoding performances obtained in the in-supervised case are greatly enhanced compared to the ones obtained in the out-supervised case. This is intuitively understandable since the in-supervised leaning is much less constraining for the network. Second, experimental results aim at analyzing the background dynamical regime of the network: the dynamics observed when unlearned stimuli are presented to the network. It is empirically shown that the more information the network has to store in its attractors, the more its spontaneous dynamical regime tends to be chaotic. Chaos is in fact the biggest pool of potential cyclic attractors. Asymmetric Hebbian learning can be seen as an alternative road to chaos. Adopting the out-supervised learning and increasing the amount of information to store, the background chaos spreads widely and adopts a very unstructured shape similar to white noise. The reason lies in the constraints imposed by this learning process, which becomes harder and harder to satisfy by the network. In contrast, in-supervised learning, by being more “respectful” of the network intrinsic dynamics, maintains much more structure in the obtained chaos. It is still possible to observe in the chaotic regime the traces of the learned attractors. Following

Road to Chaos in Recurrent Neural Networks

107

our previous considerations, we call this complex but still very structured and informative regime frustrated chaos. This behavior can be related to experimental findings where ongoing cortical activity has been shown to encompass a set of dynamically switching cortical states that corresponds to stimulus-evoked activity (Kenet et al., 2003). Symbolic investigations have been performed on the spatiotemporal attractors obtained when the network is in a random state and presented with random or noisy stimuli. Different types of attractors are observed in the output. Spurious data have been defined as attractors having the same period as the learned data but still different from all of them. This follows the intuitive idea that the other kinds of attractors (like chaotic attractors) are easily recognizable at a glance and unlikely to be confused with learned data. In the case of spurious data, it is impossible to know if the observed attractors bear useful information without comparing it with all the learned data Networks where the information is coded in fixed-point attractors are easily corrupted with spurious data, which makes their exploitation very delicate. By contrast, when the information is stored in cyclic attractors using in-supervised learning, chaotic attractors appear as the background regime of the net. They are likely to play a beneficial role by preventing the proliferation of spurious data and helping the recovery of the correct mapping: if a previously learned stimulus is presented to the network, either the network will find the correct mapping, or it will iterate through a chaotic trajectory. While not as easy to interpret and “engineerize” as its classical supervised counterpart, in-supervised learning makes more sense when adopting both a biological and a cognitive perspective. Biological systems propose their own way to treat external impact by slightly perturbing their inner working. Here the information is still meaningful to the extent that the external impact is associated in an unequivocal way with an attractor. As the famous Colombian neurophysiologist Rodolfo Llinas (2001) uses to say: “A person’s waking life is a dream modulated by the senses”.

References Albers, D., Sprott, J., & Dechert, W. (1998). Routes to chaos in neural networks with random weights. International Journal of Bifurcation and Chaos, 8, 1463–1478. Amari, S. (1972). Learning pattern sequences by self-organizing nets of threshold elements. IEEE Transactions on Computers, 21, 1197–1206. Amari, S. (1977). Neural theory of association and concept-formation. Biological Cybernetics, 26, 175–185. Amari, S., & Maginu, K. (1988). Statistical neurodynamics of associative memory. Neural Networks, 1, 63–73. Amit, D. (1989). Modeling brain function. Cambridge: Cambridge University Press.

108

C. Molter, U. Salihoglu, and H. Bersini

Amit, D. (1995). The Hebbian paradigm reintegrated: Local reverberations as internal representations. Behavioral Brain Science, 18, 617–657. Amit, D., & Brunel, N. (1994). Learning internal representations in an attractor neural network with analogue neurons. Network: Computation in Neural Systems, 6, 359– 388. Amit, D., & Fusi, S. (1994). Learning in neural networks with material synapses. Neural Computation, 6, 957–982. Amit, D., Gutfreund, G., & Sompolinsky, H. (1987). Statistical mechanics of neural networks near saturation. Ann. Phys., 173, 30–67. Amit, D., & Mongillo, G. (2003). Spike-driven synaptic dynamics generating working memory states. Neural Computation, 15, 565–596. Babloyantz, A., & Loureno (1994). Computation with chaos: A paradigm for cortical activity. Proceedings of National Academy of Sciences, 91, 9027–9031. Bersini, H. (1998). The frustrated and compositional nature of chaos in small Hopfield networks. Neural Networks, 11, 1017–1025. Bersini, H., & Sener, P. (2002). The connections between the frustrated chaos and the intermittency chaos in small Hopfield networks. Neural Netwoks, 15, 1197–1204. Bi, G., & Poo, M. (1999). Distributed synaptic modification in neural networks induced by patterned stimulation. Nature, 401, 792–796. Bliss, T., & Lomo, T. (1973). Long-lasting potentiation of synaptic transmission in the dentate area of the anaesthetized rabbit following stimulation of the perforant path. Journal of Physiology, 232, 331–356. Brunel, N., Carusi, F., & Fusi, S. (1997). Slow stochastic Hebbian learning of classes of stimuli in a recurrent neural network. Network: Computation in Neural Systems, 9, 123–152. Dauce, E., Quoy, M., Cessac, B., Doyon, B., & Samuelides, M. (1998). Self-organization and dynamics reduction in recurrent networks: Stimulus presentation and learning. Neural Networks, 11, 521–533. Domany, E., van Hemmen, J., & Schulten, K. (Eds.). (1995). Models of neural networks (2nd ed.). Berlin: Springer. Eckmann, J., & Ruelle, D. (1985). Ergodic theory of chaos and strange attractors. Reviews of Modern Physics, 57(3), 617–656. Erdi, P. (1996). The brain as a hermeneutic device. Biosystems, 38, 179–189. Forrest, B., & Wallace, D. (1995). Models of neural netWorks (2nd ed.). Berlin: Springer. Franosch, J.-M., Lingenheil, M., & van Hemmen, J. (2005). How a frog can learn what is where in the dark. Physical Review Letters, 95, 1–4. Freeman, W. (2002). Biocomputing. Norwell, MA: Kluwer. Fusi, S. (2002). Hebbian spike-driven synaptic plasticity for learning patterns of mean firing rates. Biological Cybernetics, 87, 459–470. Gardner, E. (1987). Maximum storage capacity in neural networks. Europhysics Letters, 4, 481–485. Gardner, E., & Derrida, B. (1989). Three unfinished works on the optimal storage capacity of networks. J. Physics A: Math. Gen., 22, 1983–1994. Grossberg, S. (1992). Neural networks and natural intelligence. Cambridge, MA: MIT Press . Guillot, A., & Dauce, E. (Eds.). (2002). Approche dynamique de la cognition artificielle. Paris: Herms Science.

Road to Chaos in Recurrent Neural Networks

109

Gutfreund, Y., Zheng, W., & Knudsen, E. I. (2002). Gated visual input to the central auditory system. Science, 297, 1556–1559. Hansel, D., & Sompolinsky, H. (1996). Chaos and synchrony in a model of a hypercolumn in visual cortex. J. Computation Neuroscience, 3, 7–34. Hebb, D. (1949). The organization of behavior. New York: Wiley-Interscience. Hopfield, J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences USA, 79, 2554–2558. Ikeda, K., Matsumoto, K., & Otsuka, K. (1989). Maxwell-Bloch turbulence. Progress of Theoretical Physics, 99(Suppl.), 295–324. Kaneko, K. (1992). Pattern dynamics in spatiotemporal chaos. Physica D, 34, 1– 41. Kaneko, K., & Tsuda, I. (2003). Chaotic itinerancy. Chaos: Focus Issue on Chaotic Itinerancy, 13(3), 926–936. Kenet, T., Bibitchkov, D., Tsodyks, M., Grinvald, A., & Arieli, A. (2003). Spontaneously emerging cortical representations of visual attributes. Nature, 425, 954– 956. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 59–69. Levy, W., & Steward, O. (1983). Temporal contiguity requirements for long term associative potentiation/depression in the hippocampus. Neuroscience, 8, 791– 797. Llineas, R. (2001). I of the vortex: From neurons to self. Cambridge, MA: MIT Press. Molter, C., & Bersini, H. (2003). How chaos in small Hopfield networks makes sense of the world. In Proceedings of the International Joint Conference on Neural Networks conference. Piscataway, NJ: IEEE Press. Molter, C., Salihoglu, U., & Bersini, H. (2005a). Introduction of an Hebbian unsupervised learning algorithm to boost the encoding capacity of Hopfield networks. In Proceedings of the International Joint Conference on Neural Networks—IJCNN Conference, Montreal. Los Alamitos, CA: IEEE Computer Society Press. Molter, C., Salihoglu, U., & Bersini, H. (2005b). Learning cycles brings chaos in continuous Hopfield networks. In Proceedings of the International Joint Conference on Neural Networks—IJCNN Conference. Los Alamitos, CA: IEEE Computer Society Press. Nicolis, J., & Tsuda, I. (1985). Chaotic dynamics of information processing: The “magic number seven plusminus two” revisited. Bulletin of Mathematical Biology, 47, 343–65. Omlin, C. (2001). Understanding and explaining DRN behavior. In K. Kremer (Ed.), A field guide to dynamical recurrent networks. Piscataway, NJ: IEEE Press. Pasemann, F. (2002). Complex dynamics and the structure of small neural networks. Network: Computation in Neural Systems, 13(2), 195–216. Piaget, J. (1963). The psychology of intelligence. New York: Routledge. Pomeau, Y., & Manneville, P. (1980). Intermittent transitions to turbulence in dissipative dynamical systems. Comm. Math. Phys., 74, 189–197. Rodriguez, E., George, N., Lachaux, J., Renault, B., Martinerie, J., Reunault, B., & Varela, F. (1999). Perception’s shadow: Long-distance synchronization of human brain activity. Nature, 397, 430–433.

110

C. Molter, U. Salihoglu, and H. Bersini

¨ ¨ Naturforschung, 38a, 788– Rossler, O. E. (1983). The chaotic hierarchy. Zeitschrift fur 801. Sejnowski, T. (1977). Storing covariance with nonlinearly interacting neurons. J. Math. Biol., 4, 303–321. Skarda, C., & Freeman, W. (1987). How brains make chaos in order to make sense of the world. Behavioral and Brain Sciences, 10, 161–195. Sompolinsky, H., Crisanti, A., & Sommers, H. (1988). Chaos in random neural networks. Physical Review Letters, 61, 258–262. Tsuda, I. (1992). Dynamic link of memory-chaotic memory map in nonequilibrium neural networks. Neural Networks, 5, 313–326. Tsuda, I. (2001). Toward an interpretation of dynamic neural activity in terms of chaotic dynamical systems. Behavioral and Brain Sciences, 24, 793–847. van Hemmen, J., & Kuhn, R. (1995). Models of neural networks (2nd ed.). Berlin: Springer. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274(5293), 1724–1726. Varela, F., Thompson, E., & Rosch, E. (1991). The embodied mind: Cognitive science and human experience. Cambridge, MA: MIT Press. Wolf, A., Swift, J., Swinney, H., & Vastano, J. (1984). Determining Lyapunov exponents from a time series. Physica, D16, 285–317.

Received April 6, 2005; accepted May 30, 2006.

LETTER

Communicated by Herbert Jaeger

Analysis and Design of Echo State Networks Mustafa C. Ozturk [email protected]

Dongming Xu [email protected]

Jos´e C. Pr´ıncipe [email protected] Computational NeuroEngineering Laboratory, Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL 32611, U.S.A.

The design of echo state network (ESN) parameters relies on the selection of the maximum eigenvalue of the linearized system around zero (spectral radius). However, this procedure does not quantify in a systematic manner the performance of the ESN in terms of approximation error. This article presents a functional space approximation framework to better understand the operation of ESNs and proposes an informationtheoretic metric, the average entropy of echo states, to assess the richness of the ESN dynamics. Furthermore, it provides an interpretation of the ESN dynamics rooted in system theory as families of coupled linearized systems whose poles move according to the input signal dynamics. With this interpretation, a design methodology for functional approximation is put forward where ESNs are designed with uniform pole distributions covering the frequency spectrum to abide by the richness metric, irrespective of the spectral radius. A single bias parameter at the ESN input, adapted with the modeling error, configures the ESN spectral radius to the input-output joint space. Function approximation examples compare the proposed design methodology versus the conventional design. 1 Introduction Dynamic computational models require the ability to store and access the time history of their inputs and outputs. The most common dynamic neural architecture is the time-delay neural network (TDNN) that couples delay lines with a nonlinear static architecture where all the parameters (weights) are adapted with the backpropagation algorithm. The conventional delay line utilizes ideal delay operators, but delay lines with local first-order recursive filters have been proposed by Werbos (1992) and extensively studied in the gamma model (de Vries, 1991; Principe, de Vries, & de Oliviera, 1993). Chains of first-order integrators are interesting because they effectively decrease the number of delays necessary to create time embeddings Neural Computation 19, 111–138 (2007)

C 2006 Massachusetts Institute of Technology

112

M. Ozturk, D. Xu, and J. Pr´ıncipe

(Principe, 2001). Recurrent neural networks (RNNs) implement a different type of embedding that is largely unexplored. RNNs are perhaps the most biologically plausible of the artificial neural network (ANN) models (Anderson, Silverstein, Ritz, & Jones, 1977; Hopfield, 1984; Elman, 1990), but are not well understood theoretically (Siegelmann & Sontag, 1991; Siegelmann, 1993; Kremer, 1995). One of the main practical problems with RNNs is the difficulty to adapt the system weights. Various algorithms, such as backpropagation through time (Werbos, 1990) and real-time recurrent learning (Williams & Zipser, 1989), have been proposed to train RNNs; however, these algorithms suffer from computational complexity, resulting in slow training, complex performance surfaces, the possibility of instability, and the decay of gradients through the topology and time (Haykin, 1998). The problem of decaying gradients has been addressed with special processing elements (PEs) (Hochreiter & Schmidhuber, 1997). Alternative second-order training methods based on extended Kalman filtering (Singhal & Wu, 1989; Puskorius & Feldkamp, 1994; Feldkamp, Prokhorov, Eagen, & Yuan, 1998) and the multistreaming training approach (Feldkamp et al., 1998) provide more reliable performance and have enabled practical applications in identification and control of dynamical systems (Kechriotis, Zervas, & Monolakos, 1994; Puskorius & Feldkamp, 1994; Delgado, Kambhampati, & Warwick, 1995). Recently, two new recurrent network topologies have been proposed: the echo state network (ESN) by Jaeger (2001, 2002a; Jaeger & Hass, 2004) and the liquid state machine (LSM) by Maass (Maass, Natschl¨ager, & Markram, 2002). ESNs possess a highly interconnected and recurrent topology of nonlinear PEs that constitutes a “reservoir of rich dynamics” (Jaeger, 2001) and contain information about the history of input and output patterns. The outputs of these internal PEs (echo states) are fed to a memoryless but adaptive readout network (generally linear) that produces the network output. The interesting property of ESN is that only the memoryless readout is trained, whereas the recurrent topology has fixed connection weights. This reduces the complexity of RNN training to simple linear regression while preserving a recurrent topology, but obviously places important constraints in the overall architecture that have not yet been fully studied. Similar ideas have been explored independently by Maass and formalized in the LSM architecture. LSMs, although formulated quite generally, are mostly implemented as neural microcircuits of spiking neurons (Maass et al., 2002), whereas ESNs are dynamical ANN models. Both attempt to model biological information processing using similar principles. We focus on the ESN formulation in this letter. The echo state condition is defined in terms of the spectral radius (the largest among the absolute values of the eigenvalues of a matrix, denoted by · ) of the reservoir’s weight matrix (W < 1). This condition states that the dynamics of the ESN is uniquely controlled by the input, and the effect of the initial states vanishes. The current design of ESN parameters

Analysis and Design of Echo State Networks

113

relies on the selection of spectral radius. However, there are many possible weight matrices with the same spectral radius, and unfortunately they do not all perform at the same level of mean square error (MSE) for functional approximation. A similar problem exists in the design of the LSM. LSMs have been shown to possess universal approximation given the separation property (SP) for the liquid (reservoir in ESNs) and the approximation property (AP) for the readout (Maass et al., 2002). SP is quantified by a kernel-quality measure proposed in Maass, Legenstein, and Bertschinger (2005) that is based on the rank of a matrix formed by the system states corresponding to different input signals. The kernel quality is a measure for the complexity and diversity of nonlinear operations carried out by the liquid on its input stream in order to boost the classification power of a subsequent linear decision hyperplane (Maass et al., 2005). A variation of SP has been proposed in Bertschinger and Natschl¨ager (2004), and it has been argued that complex calculations can be best carried out by networks on the boundary between ordered and chaotic dynamics. In this letter, we are interested in studying the ESN for functional approximation (filters that map input functions u(·) of time on output functions y(·) of time). We see two major shortcomings with the current ESN approach that uses echo state condition as a design principle. First, the impact of fixed reservoir parameters for function approximation means that the information about the desired response is conveyed only to the output projection. This is not optimal, and strategies to select different reservoirs for different applications have not been devised. Second, imposing a constraint only on the spectral radius is a weak condition to properly set the parameters of the reservoir, as experiments show (different randomizations with the same spectral radius perform differently for the same problem; see Figure 2). This letter aims to address these two problems by proposing a framework, a metric, and a design principle for ESNs. The framework is a signal processing interpretation of basis and projections in functional spaces to describe and understand the ESN architecture. According to this interpretation, the ESN states implement a set of basis functionals (representation space) constructed dynamically by the input, while the readout simply projects the desired response onto this representation space. The metric to describe the richness of the ESN dynamics is an information-theoretic quantity, the average state entropy (ASE). Entropy measures the amount of information contained in a given random variable (Shannon, 1948). Here, the random variable is the instantaneous echo state from which the entropy for the overall state (vector) is estimated. The probability density function (pdf) in a differential geometric framework should be thought of as a volume form; that is, in our case, the pdf of the state vector describes the metric of the state space manifold (Amari, 1990). Moreover, Cox (1946) established information as a coordinate free metric in the state manifold. Therefore, entropy becomes a global descriptor of information that quantifies the volume of the manifold defined by the random variable. Due to the

114

M. Ozturk, D. Xu, and J. Pr´ıncipe

time dependency of the states, the state entropy averaged over time (ASE) is an appropriate estimate of the volume of the state manifold. The design principle specifies that one should consider independently the correlation among the basis and the spectral radius. In the absence of any information about the desired response, the ESN states should be designed with the highest ASE, independent of the spectral radius. We interpret the ESN dynamics as a combination of time-varying linear systems obtained from the linearization of the ESN nonlinear PE in a small, local neighborhood of the current state. The design principle means that the poles of the linearized ESN reservoir should have uniform pole distributions to generate echo states with the most diverse pole locations (which correspond to the uniformity of time constants). Effectively, this will create the least correlated bases for a given spectral radius, which corresponds to the largest volume spanned by the basis set. When the designer has no other information about the desired response to set the basis, this principle distributes the system’s degrees of freedom uniformly in space. It approximates for ESNs the well-known property of orthogonal basis. The unresolved issue that ASE does not quantify is how to set the spectral radius, which depends again on the desired mapping. The concept of memory depth as explained in Principe et al. (1993) and Jaeger (2002a) is helpful in understanding the issues associated with the spectral radius. The correlation time of the desired response (as estimated by the first zero of the autocorrelation function) gives an indication of the type of spectral radius required (long correlation time requires high spectral radius). Alternatively, a simple adaptive bias is added at the ESN input to control the spectral radius integrating the information from the input-output joint space in the ESN bases. For sigmoidal PEs, the bias adjusts the operating points of the reservoir PEs, which has the net effect of adjusting the volume of the state manifold as required to approximate the desired response with a small error. This letter shows that ESNs designed with this strategy obtain systematically better results in a set of experiments when compared with the conventional ESN design.

2 Analysis of Echo State Networks 2.1 Echo States as Bases and Projections. Let us consider the architecture and recursive update equation of a typical ESN more closely. Consider the recurrent discrete-time neural network given in Figure 1 with M input units, N internal PEs, and L output units. The value of the input unit at time n is u(n) = [u1 (n), u2 (n), . . . , u M (n)]T , of internal units are x(n) = [x1 (n), x2 (n), . . . , xN (n)]T , and of output units are y(n) = [y1 (n), y2 (n), . . . , yL (n)]T . The connection weights are given in an N × M weight matrix Win = (wiinj ) for connections between the input and the internal PEs, in an N × N matrix W = (wi j ) for connections between the internal PEs, in an L × N matrix Wout = (wiout j ) for connections from PEs to the

Analysis and Design of Echo State Networks

Dynamical Reservoir

Input Layer Win

115

Read-out Wout

W

x(n)

u(n) .

+ .

y(n)

Wback Figure 1: An echo state network (ESN). ESN is composed of two parts: a fixedweight (W < 1) recurrent network and a linear readout. The recurrent network is a reservoir of highly interconnected dynamical components, states of which are called echo states. The memoryless linear readout is trained to produce the output.

output units, and in an N × L matrix Wba ck = (wibaj ck ) for the connections that project back from the output to the internal PEs (Jaeger, 2001). The activation of the internal PEs (echo state) is updated according to x(n + 1) = f(Win u(n + 1) + Wx(n) + Wba ck y(n)),

(2.1)

where f = ( f 1 , f 2 , . . . , f N ) are the internal PEs’ activation functions. Here, all x −x f i ’s are hyperbolic tangent functions ( ee x −e ). The output from the readout +e −x network is computed according to y(n + 1) = fout (Wout x(n + 1)),

(2.2)

where fout = ( f 1out , f 2out , . . . , f Lout ) are the output unit’s nonlinear functions (Jaeger, 2001, 2002a). Generally, the readout is linear so fout is identity. ESNs resemble the RNN architecture proposed in Puskorius and Feldkamp (1996) and also used by Sanchez (2004) in brain-machine

116

M. Ozturk, D. Xu, and J. Pr´ıncipe

interfaces. The critical difference is the dimensionality of the hidden recurrent PE layer and the adaptation of the recurrent weights. We submit that the ideas of approximation theory in functional spaces (bases and projections), so useful in adaptive signal processing (Principe, 2001), should be utilized to understand the ESN architecture. Let h(u(t)) be a real-valued function of a real-valued vector u(t) = [u1 (t), u2 (t), . . . , u M (t)]T . In functional approximation, the goal is to estimate the behavior of h(u(t)) as a combination of simpler functions ϕi (t), called the basis functionals, ˆ such that its approximant, h(u(t)), is given by ˆ h(u(t)) =

N

a i ϕi (t).

i=1

Here, a i ’s are the projections of h(u(t)) onto each basis function. One of the central questions in practical functional approximation is how to choose the set of bases to approximate a given desired signal. In signal processing, the choice normally goes for a complete set of orthogonal basis, independent of the input. When the basis set is complete and can be made as large as required, fixed bases work wonders (e.g., Fourier decompositions). In neural computing, the basic idea is to derive the set of bases from the input signal through a multilayered architecture. For instance, consider a single hidden layer TDNN with N PEs and a linear output. The hiddenlayer PE outputs can be considered a set of nonorthogonal basis functionals dependent on the input, ϕi (u(t)) = g

b i j u j (t) .

j

b i j ’s are the input layer weights, and g is the PE nonlinearity. The approximation produced by the TDNN is then ˆ h(u(t)) =

N

a i ϕi (u(t)),

(2.3)

i=1

where a i ’s are the weights of the output layer. Notice that the b i j ’s adapt the bases and the a i ’s adapt the projection in the projection space. Here the goal is to restrict the number of bases (number of hidden layer PEs) because their number is coupled with the number of parameters to adapt, which has an impact on generalization and training set size, for example. Usually,

Analysis and Design of Echo State Networks

117

since all of the parameters of the network are adapted, the best basis in the joint (input and desired signals) space as well as the best projection can be achieved and represents the optimal solution. The output of the TDNN is a linear combination of its internal representations, but to achieve a basis set (even if nonorthogonal), linear independence among the ϕi (u(t))’s must be enforced. Ito, Shah and Pon, and others have shown that this is indeed the case (Ito, 1996; Shah & Poon, 1999), but a thorough discussion is outside the scope of this article. The ESN (and the RNN) architecture can also be studied in this framework. The states of equation 2.1 correspond to the basis set, which are recursively computed from the input, output, and previous states through Win , W, and Wba ck . Notice, however, that none of these weight matrices is adapted, that is, the functional bases in the ESN are uniquely defined by the input and the initial selection of weights. In a sense, ESNs are trading the adaptive connections in the RNN hidden layer by a brute force approach of creating fixed diversified dynamics in the hidden layer. For an ESN with a linear readout network, the output equation (y(n + 1) = Wout x(n + 1)) has the same form of equation 2.3, where the ϕi ’s and a i ’s are replaced by the echo states and the readout weights, respectively. The readout weights are adapted in the training data, which means that the ESN is able to find the optimal projection in the projection space, just like the RNN or the TDNN. A similar perspective of basis and projections for information processing in biological networks has been proposed by Pouget and Sejnowski (1997). They explored the possibility that the response of neurons in parietal cortex serves as basis functions for the transformations from the sensory input to the motor responses. They proposed that “the role of spatial representations is to code the sensory inputs and posture signals in a format that simplifies subsequent computation, particularly in the generation of motor commands”. The central issue in ESN design is exactly the nonadaptive nature of the basis set. Parameter sets in the reservoir that provide linearly independent states and possess a given spectral radius may define drastically different projection spaces because the correlation among the bases is not constrained. A simple experiment was designed to demonstrate that the selection of the ESN parameters by constraining the spectral radius is not the most suitable for function approximation. Consider a 100-unit ESN where the input signal is sin(2πn/10π). Mimicking Jaeger (2001), the goal is to let the ESN generate the seventh power of the input signal. Different realizations of a randomly connected 100-unit ESN were constructed where the entries of W are set to 0.4, −0.4, and 0 with probabilities of 0.025, 0.025, and 0.95, respectively. This corresponds to a spectral radius of 0.88. Input weights are set to +1 or, −1 with equal probabilities, and Wba ck is set to zero. Input is applied for 300 time steps, and the echo states are calculated using equation 2.1. The next step is to train the linear readout. One method

118

M. Ozturk, D. Xu, and J. Pr´ıncipe

MSE for different realizations

104

106

108

109

0

10

20 30 Different realizations

40

50

Figure 2: Performances of ESNs for different realizations of W with the same weight distribution. The weight values are set to 0.4, −0.4, and 0 with probabilities of 0.025, 0.025, and 0.95. All realizations have the same spectral radius of 0.88. In the 50 realizations, MSEs vary from 5.9 × 10−9 to 8.9 × 10−5 . Results show that for each set of random weights that provide the same spectral radius, the correlation or degree of redundancy among the bases will change, and different performances are encountered in practice.

to determine the optimal output weight matrix, Wout , in the mean square error (MSE) sense (where MSE is defined by O = 12 (d − y)T (d − y)) is to use the Wiener solution given by Haykin (2001): W

out

T −1

= E[xx ]

E[xd] ∼ =

1 x(n)x(n)T N n

−1

1 x(n)d(n) . N n

(2.4)

Here, E[.] denotes the expected value operator, and d denotes the desired signal. Figure 2 depicts the MSE values for 50 different realizations of the ESNs. As observed, even though each ESN has the same sparseness and spectral radius, the MSE values obtained vary greatly among different realizations. The minimum MSE value obtained among the 50 realizations is 5.9x10−9 , whereas the maximum MSE is 8.9x10−5 . This experiment

Analysis and Design of Echo State Networks

119

demonstrates that a design strategy that is based solely on the spectral radius is not sufficient to specify the system architecture for function approximation. This shows that for each set of random weights that provide the same spectral radius, the correlation or degree of redundancy among the bases will change, and different performances are encountered in practice. 2.2 ESN Dynamics as a Combination of Linear Systems. It is well known that the dynamics of a nonlinear system can be approximated by that of a linear system in a small neighborhood of an equilibrium point (Kuznetsov, Kuznetsov, & Marsden, 1998). Here, we perform the analysis with hyperbolic tangent nonlinearities and approximate the ESN dynamics by the dynamics of the linearized system in the neighborhood of the current system state. Hence, when the system operating point varies over time, the linear system approximating the ESN dynamics changes. We are particularly interested in the movement of the poles of the linearized ESN. Consider the update equation for the ESN without output feedback given by x(n + 1) = f(Win u(n + 1) + Wx(n)). Linearizing the system around the current state x(n), one obtains the Jacobian matrix, J(n + 1), defined by

f˙ (net1 (n))w11 ˙ f (net2 (n))w21 J(n + 1) = ··· f˙ (netN (n))w N1 =

f˙ (net1 (n)) 0

f˙ (net1 (n))w12 · · · f˙ (net1 (n))w1N f˙ (net2 (n))w22 · · · f˙ (net2 (n))w2N ··· ··· ··· f˙ (netN (n))w N2 · · · f˙ (netN (n))w NN ···

0

f˙ (net2 (n)) · · ·

0

0

···

···

0

0

···

···

· · · f˙ (netN (n))

· W = F(n) · W.

(2.5)

Here, neti (n) is the ith entry of the vector (Win u(n + 1) + Wx(n)), and wi j denotes the (i, j)th entry of W. The poles of the linearized system at time n + 1 are given by the eigenvalues of the Jacobian matrix J(n + 1).1 As the amplitude of each PE changes, the local slope changes, and so the poles of

1 The

A)−1 B

transfer function of a linear system x(n + 1) = Ax(n) + Bu(n) is Ad joint(zI−A) det(zI−A) B.

X(z) U(z)

= (zI −

= The poles of the transfer function can be obtained by solving det(zI − A) = 0. The solution corresponds to the eigenvalues of A.

120

M. Ozturk, D. Xu, and J. Pr´ıncipe

the linearized system are time varying, although the parameters of ESN are fixed. In order to visualize the movement of the poles, consider an ESN with 100 states. The entries of the internal weight matrix are chosen to be 0, 0.4 and −0.4 with probabilities 0.9, 0.05, and 0.05. W is scaled such that a spectral radius of 0.95 is obtained. Input weights are set to +1 or −1 with equal probabilities. A sinusoidal signal with a period of 100 is fed to the system, and the echo states are computed according to equation 2.1. Then the Jacobian matrix and the eigenvalues are calculated using equation 2.5. Figure 3 shows the pole tracks of the linearized ESN for different input values. A single ESN with fixed parameters implements a combination of many linear systems with varying pole locations, hence many different time constants that modulate the richness of the reservoir of dynamics as a function of input amplitude. Higher-amplitude portions of the signal tend to saturate the nonlinear function and cause the poles to shrink toward the origin of the z-plane (decreases the spectral radius), which results in a system with a large stability margin. When the input is close to zero, the poles of the linearized ESN are close to the maximal spectral radius chosen, decreasing the stability margin. When compared to their linear counterpart, an ESN with the same number of states results in a detailed coverage of the z-plane dynamics, which illustrates the power of nonlinear systems. Similar results can be obtained using signals of different shapes at the ESN input. A key corollary of the above analysis is that the spectral radius of an ESN can be adjusted using a constant bias signal at the ESN input without changing the recurrent connection matrix, W. The application of a nonzero constant bias will move the operating point to regions of the sigmoid function closer to saturation and always decrease the spectral radius due to the shape of the nonlinearity.2 The relevance of bias in terms of overall system performance has also been discussed in Jaeger (2002b) and Bertschinger and Natschl¨ager (2004), but here we approach it from a system theory perspective and explain its effect on reservoir dynamics. 3 Average State Entropy as a Measure of the Richness of ESN Reservoir Previous research was aware of the influence of diversity of the recurrent layer outputs on the overall performance of ESNs and LSMs. Several metrics to quantify the diversity have been proposed (Jaeger, 2001; Maass, et al., 2 Assume W has nondegenerate eigenvalues and corresponding linearly independent eigenvectors. Then consider the eigendecomposition of W, where W = PDP−1 , P is the eigenvector matrix and D is the diagonal matrix of eigenvalues (Dii ) of W. Since F(n) and D are diagonal, J(n + 1) = F(n)W = F(n)(PDP−1 ) = P(F(n)D)P−1 is the eigendecomposition of J(n + 1). Here, each entry of F(n)D, f (net(n))Dii , is an eigenvalue of J. Therefore, | f (net(n))Dii | ≤ |Dii | since f (neti ) ≤ f (0).

Analysis and Design of Echo State Networks

Imaginary

(E)

1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1

Imaginary

C E

20

1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1

-0.5

1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1

-0.5

1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1

-0.5

(B)

D

40

60

Time

80

100

0

0.5

1

0

0.5

1

0

0.5

1

Real

(D)

Imaginary

Imaginary

(C)

1 0.8 0.6 0.4 0.2 0 B -0.2 -0.4 -0.6 -0.8 -1 0

-0.5

0

Real

0.5

1

(F)

Imaginary

Amplitude

(A)

121

-0.5

0

Real

0.5

1

Real

Real

Figure 3: The pole tracks of the linearized ESN with 100 PEs when the input goes through a cycle. An ESN with fixed parameters implements a combination of linear systems with varying pole locations. (A) One cycle of sinusoidal signal with a period of 100. (B–E) The positions of poles of the linearized systems when the input values are at B, C, D, and E in Figure 5A. (F) The cumulative pole locations show the movement of the poles as the input changes. Due to the varying pole locations, different time constants modulate the richness of the reservoir of dynamics as a function of input amplitude. Higher-amplitude signals tend to saturate the nonlinear function and cause the poles to shrink toward the origin of the z-plane (decreases the spectral radius), which results in a system with a large stability margin. When the input is close to zero, the poles of the linearized ESN are close to the maximal spectral radius chosen, decreasing the stability margin. An ESN with more states results in a detailed coverage of the z-plane dynamics, which illustrates the power of nonlinear systems, when compared to their linear counterpart.

122

M. Ozturk, D. Xu, and J. Pr´ıncipe

2005). Here, our approach of bases and projections leads to a new metric. We propose the instantaneous state entropy to quantify the distribution of instantaneous amplitudes across the ESN states. Entropy of the instantaneous ESN states is appropriate to quantify performance in function approximation because the ESN output is a mere weighted combination of the instantaneous value of the ESN states. If the echo state’s instantaneous amplitudes are concentrated on only a few values across the ESN state dynamic range, the ability to approximate an arbitrary desired response by weighting the states is limited (and wasteful due to redundancy between the different states), and performance will suffer. On the other hand, if the ESN states provide a diversity of instantaneous amplitudes, it is much easier to achieve the desired mapping. Hence, the instantaneous entropy of the states appears as a good measure to quantify the richness of dynamics with instantaneous mappers. Due to the time structure of signals, the average state entropy (ASE), defined as the state entropy averaged over time, will be the parameter used to quantify the diversity in the dynamical reservoir of the ESN. Moreover, entropy has been proposed as an appropriate measure of the volume of the signal manifold (Cox, 1946; Amari, 1990). Here, ASE measures the volume of the echo state manifold spanned by trajectories. Renyi’s quadratic entropy is employed here because it is a global measure of information. In addition, an efficient nonparametric estimator of Renyi’s entropy, which avoids explicit pdf estimation, has been developed (Principe, Xu, & Fisher, 2000). Renyi’s entropy with parameter γ for a random variable X with a pdf f X (x) is given by Renyi (1970): Hγ (X) =

1 γ −1 log E[ f X (X)]. 1−γ

Renyi’s quadratic entropy is obtained for γ = 2 (for γ → 1, Shannon’s entropy is obtained). Given N samples {x1 , x2 , . . . , xN } drawn from the unknown pdf to be estimated, Parzen windowing approximates the underlying pdf by 1 K σ (x − xi ), N N

f X (x) =

i=1

where K σ is the kernel function with the kernel size σ . Then the Renyi’s quadratic entropy can be estimated by (Principe et al., 2000) 1 K σ (x j − xi ) . H2 (X) = −log 2 N j i

(3.1)

Analysis and Design of Echo State Networks

123

The instantaneous state entropy is estimated using equation 3.1 where the samples are the entries of the state vector x(n) = [x1 (n), x2 (n), . . . , xN (n)]T of an ESN with N internal PEs. Results will be shown with a gaussian kernel with kernel size chosen to be 0.3 of the standard deviation of the entries of the state vector. We will show that ASE is a more sensitive parameter to quantify the approximation properties of ESNs by experimentally demonstrating that ESNs with different spectral radius and even with the same spectral radius display different ASEs. Let us consider the same 100-unit ESN that we used in the previous section built with three different spectral radii 0.2, 0.5, 0.8 with an input signal of sin(2πn/20). Figure 4A depicts the echo states over 200 time ticks. The instantaneous state entropy is also calculated at each time step using equation 3.1 and plotted in Figure 4B. First, note that the instantaneous state entropy changes over time with the distribution of the echo states as we would expect, since state entropy is dependent on the input signal that also changes in this case. Second, as the spectral radius increases in the simulation, the diversity in the echo states increases. For the spectral radius of 0.2, echo state’s instantaneous amplitudes are concentrated on only a few values, which is wasteful due to redundancy between different states. In practice, to quantify the overall representation ability over time, we will use ASE, which takes values −0.735, −0.007, and 0.335 for the spectral radii of 0.2, 0.5, and 0.8, respectively. Moreover, even for the same spectral radius, several ASEs are possible. Figure 4C shows ASEs from 50 different realizations of ESNs with the same spectral radius of 0.5, which means that ASE is a finer descriptor of the dynamics of the reservoir. Although we have presented an experiment with sinusoidal signal, similar results are obtained for other inputs as long as the input dynamic range is properly selected. Maximizing ASE means that the diversity of the states over time is the largest and should provide a basis set that is as uncorrelated as possible. This condition is unfortunately not a guarantee that the ESN so designed will perform the best, because the basis set in ESNs is created independent of the desired response and the application may require a small spectral radius. However, we maintain that when the desired response is not accessible for the design of the ESN bases or when the same reservoir is to be used for a number of problems, the default strategy should be to maximize the ASE of the state vector. The following section addresses the design of ESNs with high ASE values and a simple mechanism to adjust the reservoir dynamics without changing the recurrent connection weights. 4 Designing Echo State Networks 4.1 Design of the Echo State Recurrent Connections. According to the interpretation of ESNs as coupled linear systems, the design of the internal

124

M. Ozturk, D. Xu, and J. Pr´ıncipe

connection matrix, W, will be based on the distribution of the poles of the linearized system around zero state. Our proposal is to design the ESN such that the linearized system has uniform pole distribution inside the unit circle of the z-plane. With this design scenario, the system dynamics will include uniform coverage of time constants arising from the uniform distribution of the poles, which also decorrelates as much as possible the basis functionals. This principle was chosen by analogy to the identification of linear systems using Kautz filters (Kautz, 1954), which shows that the best approximation of a given transfer function by a linear system with finite order is achieved when poles are placed in the neighborhood of the spectral resonances. When no information is available about the desired response, we should uniformly spread the poles to anticipate good approximation to arbitrary mappings. We again use a maximum entropy principle to distribute the poles inside the unit circle uniformly. The constraints of a circle as boundary conditions for discrete linear systems and complex conjugate locations are easy to include for the pole distribution (Thogula, 2003). The poles are first initialized at random locations; the quadratic Renyi’s entropy is calculated by equation 3.1, and poles are moved such that the entropy of the new distribution is increased over iterations (Erdogmus & Principe, 2002). This method is efficient to find uniform coverage of the unit circle with an arbitrary number of poles. The system with the uniform pole locations can be interpreted using linear system theory. The poles that are close to the unit circle correspond to many sharp bandpass filters specializing in different frequency regions, whereas the inner poles realize filters of larger frequency support. Moreover, different orientations (angles) of the poles create filters of different center frequencies. Now the problem is to construct an internal weight matrix from the pole locations (eigenvalues of W). In principle, we would like to create a sparse

Figure 4: Examples of echo states and instantaneous state entropy. (A) Outputs of echo states (100 PEs) produced by ESNs with spectral radius of 0.2, 0.5, and 0.8, from top to bottom, respectively. The diversity of echo states increases when the spectral radius increases. Within the dynamic range of the echo states, systems with smaller spectral radius can generate only uneven representations, while for W = 0.8, outputs of echo states almost uniformly distribute within their dynamic range. (B) Instantaneous state entropy is calculated using equation 3.1. Information contained in the echo states is changing over time according to the input amplitude. Therefore, the richness of representation is controlled by the input amplitude. Moreover, the value of ASE increases with spectral radius. (C) ASEs from 50 different realizations of ESNs with the same spectral radius of 0.5. The plot shows that ASE is a finer descriptor of the dynamics of the reservoir than the spectral radius.

Analysis and Design of Echo State Networks

125

(A) Echo States

1 0 -1

0 1

20 40 60 80 100 120 140 160 180 200

0 -1

0 1

20 40 60 80 100 120 140 160 180 200

0 -1

0

20 40 60 80 100 120 140 160 180 200 Time (B) State Entropy Spectral Radius = 0.2 Spectral Radius = 0.5 Spectral Radius = 0.8

1.5 1 0.5 0 -0.5 -1 -1.5 -2 -2.5

0

50

100 Time

150

200

(C) Different ASEs for the same spectral radius 0.3

ASE

0.25 0.2 0.15 0.1 0.05 0

10

20

30 Trials

40

50

126

M. Ozturk, D. Xu, and J. Pr´ıncipe

matrix, so we started with the sparsest matrix (with an inverse), which is the direct canonical structure given by (Kailath, 1980)

−a 1 −a 2 · · · −a N−1 −a N

1 0 0 1 W= ··· ··· 0 0

···

0

···

0

···

···

···

1

0 0 . ··· 0

(4.1)

The characteristic polynomial of W is l(s) = det(sI − W) = s N + a 1 s N−1 + a 2 s N−2 + a N = (s − p1 )(s − p2 ) · · · (s − p N ),

(4.2)

where pi ’s are the eigenvalues and a i ’s are the coefficients of the characteristic polynomial of W. Here, we know the pole locations of the linear system obtained from the linearization of the ESN, so using equation 4.2, we can obtain the characteristic polynomial and construct W matrix in the canonical form using equation 4.1. We will call the ESN constructed based on the uniform pole principle ASE-ESN. All other possible solutions with the same eigenvalues can be obtained by Q−1 WQ, where Q is any nonsingular matrix. To corroborate our hypothesis, we would like to show that the linearized ESN designed with the recurrent weight matrix having the eigenvalues uniformly distributed inside the unit circle creates higher ASE values for a given spectral radius compared to other ESNs with random internal connection weight matrices. We will consider an ESN with 30 states and use our procedure to create the W matrix for ASE-ESN for different spectral radii between [0.1, 0.95]. Similarly, we constructed ESNs with sparse random W matrices with different sparseness constraints. This corresponds to a weight distribution having the values 0, c and −c with probabilities p1 , (1 − p1 )/2, and (1 − p1 )/2, where p1 defines the sparseness of W and c is a constant that takes a specific value depending on the spectral radius. We also created W matrices with values uniformly distributed between −1 and 1 (U-ESN) and scaled to obtain a given spectral radius (Jaeger & Hass, 2004). Then, for different Win matrices, we run the ASE-ESNs with the sinusoidal input given in section 3 and calculate ASE. Figure 5 compares the ASE values averaged over 1000 realizations. As observed from the figure, the ASE-ESN with uniform pole distribution generates higher ASE on average for all spectral radii compared to ESNs with sparse and uniform random connections. This approach is indeed conceptually similar to Jeffreys’ maximum entropy prior (Jeffreys, 1946): it will provide a consistently good response for the largest class of problems. Concentrating the poles of the linearized

Analysis and Design of Echo State Networks

127

1 ASEESN UESN sparseness=0.2 sparseness=0.1 sparseness=0.07

0.8

ASE

0.6 0.4 0.2 0 -0.2 -0.4

0

0.2

0.4

0.6

0.8

1

Spectral radius Figure 5: Comparison of ASE values obtained for ASE-ESN having W with uniform eigenvalue distribution, ESNs with random W matrix, and U-ESN with uniformly distributed weights between −1 and 1. Randomly generated weights have sparseness of 0.07, 0.1, and 0.2. ASE values are calculated for the networks with spectral radius from 0.1 to 0.95. The ASE-ESN with uniform pole distribution generates a higher ASE on average for all spectral radii compared to ESNs with random connections.

system in certain regions of the space provides good performance only if the desired response has energy in this part of the space, as is well known from the theory of Kautz filters (Kautz, 1954). 4.2 Design of the Adaptive Bias. In conventional ESNs, only the output weights are trained, optimizing the projections of the desired response onto the basis functions (echo states). Since the dynamical reservoir is fixed, the basis functions are only input dependent. However, since function approximation is a problem in the joint space of the input and desired signals, a penalty in performance will be incurred. From the linearization analysis that shows the crucial importance of the operating point of the PE nonlinearity in defining the echo state dynamics, we propose to use a single external adaptive bias to adjust the effective spectral radius of an ESN. Notice that according to linearization analysis, bias can reduce only spectral radius. The information for adaptation of bias is the MSE in training, which modulates the spectral radius of the system with the information derived from the approximation error. With this simple mechanism, some information from the input-output joint space is incorporated in the definition of the projection space of the ESN. The beauty of this method is that the spectral

128

M. Ozturk, D. Xu, and J. Pr´ıncipe

radius can be adjusted by a single parameter that is external to the system without changing reservoir weights. The training of bias can be easily accomplished. Indeed, since the parameter space is only one-dimensional, a simple line search method can be efficiently employed to optimize the bias. Among different line search algorithms, we will use a search that uses Fibonacci numbers in the selection of points to be evaluated (Wilde, 1964). The Fibonacci search method minimizes the maximum number of evaluations needed to reduce the interval of uncertainty to within the prescribed length. In our problem, a bias value is picked according to Fibonacci search. For each value of bias, training data are applied to the ESN, and the echo states are calculated. Then the corresponding optimal output weights and the objective function (MSE) are evaluated to pick the next bias value. Alternatively, gradient-based methods can be utilized to optimize the bias, due to simplicity and low computational cost. System update equation with an external bias signal, b, is given by x(n + 1) = f(Win u(n + 1) + Win b + Wx(n)). The update equation for b is given by ∂x(n + 1) ∂ O(n + 1) = −e · Wout × ∂b ∂b ∂x(n) in ˙ ) · W × = −e · Wout × f(net + W . n+1 ∂b

(4.3) (4.4)

Here, O is the MSE defined previously. This algorithm may suffer from similar problems observed in gradient-based methods in recurrent networks training. However, we observed that the performance surface is rather simple. Moreover, since the search parameter is one-dimensional, the gradient vector can assume only one of the two directions. Hence, imprecision in the gradient estimation should affect the speed of convergence but normally not change the correct gradient direction. 5 Experiments This section presents a variety of experiments in order to test the validity of the ESN design scheme proposed in the previous section. 5.1 Short-Term Memory Capacity. This experiment compares the shortterm memory (STM) capacity of ESNs with the same spectral radius using the framework presented in Jaeger (2002a). Consider an ESN with a single input signal, u(n), optimally trained with the desired signal u(n − k), for a given delay k. Denoting the optimal output signal yk (n), the k-delay

Analysis and Design of Echo State Networks

129

STM capacity of a network, MCk , is defined as a squared correlation coefficient between u(n − k) and yk (n) (Jaeger, 2002a). The STM capacity, MC, of the network is defined as ∞ k=1 MC k . STM capacity measures how accurately the delayed versions of the input signal are recovered with optimally trained output units. Jaeger (2002a) has shown that the memory capacity for recalling an independent and identically distributed (i.i.d.) input by an N unit RNN with linear output units is bounded by N. We use ESNs with 20 PEs and a single input unit. ESNs are driven by an i.i.d. random input signal, u(n), that is uniformly distributed over [−0.5, 0.5]. The goal is to train the ESN to generate the delayed versions of the input, u(n − 1), . . . , u(n − 40). We used four different ESNs: R-ESN, U-ESN, ASE-ESN, and BASE-ESN. R-ESN is a randomly connected ESN used in Jaeger (2002a) where the entries of W matrix are set to 0, 0.47, −0.47 with probabilities 0.8, 0.1, 0.1, respectively. This corresponds to a sparse connectivity of 20% and a spectral radius of 0.9. The entries of W of U-ESN are uniformly distributed over [−1, 1] and scaled to obtain the spectral radius of 0.9. ASE-ESN also has a spectral radius of 0.9 and is designed with uniform poles. BASE-ESN has the same recurrent weight matrix as ASE-ESN and an adaptive bias at its input. In each ESN, the input weights are set to 0.1 or −0.1 with equal probability, and direct connections from the input to the output are allowed, whereas Wba ck is set to 0 (Jaeger, 2002a). The echo states are calculated using equation 2.1 for 200 samples of the input signal, and the first 100 samples corresponding to initial transient are eliminated. Then the output weight matrix is calculated using equation 2.4. For the BASE-ESN, the bias is trained for each task. All networks are run with a test input signal, and the corresponding output and MCk are calculated. Figure 6 shows the k-delay STM capacity (averaged over 100 trials) of each ESN for delays 1, . . . , 40 for the test signal. The STM capacities of R-ESN, U-ESN, ASE-ESN, and BASE-ESN are 13.09, 13.55, 16.70, and 16.90, respectively. First, ESNs with uniform pole distribution (ASEESN and BASE-ESN) have MCs that are much longer than the randomly generated ESN given in Jaeger (2002a) in spite of all having the same spectral radius. In fact, the STM capacity of ASE-ESN is close to the theoretical maximum value of N = 20. A closer look at the figure shows that R-ESN performs slightly better than ASE-ESN for delays less than 9. In fact, for small k, large ASE degrades the performance because the tasks do not need long memory depth. However, the drawback of high ASE for small k is recovered in BASE-ESN, which reduces the ASE to the appropriate level required for the task. Overall, the addition of the bias to the ASE-ESN increases the STM capacity from 16.70 to 16.90. On the other hand, U-ESN has slightly better STM compared to R-ESN with only three different weight values, although it has more distinct weight values compared to R-ESN. It is also significant to note that the MC will be very poor for an ESN with smaller spectral radius even with an adaptive bias, since the problem requires large ASE and bias can only reduce ASE. This experiment demonstrates the

130

M. Ozturk, D. Xu, and J. Pr´ıncipe

1

RESN UESN ASEESN BASEESN

Memory Capacity

0.8 0.6 0.4 0.2 0 0

10

20 Delay

30

40

Figure 6: The k-delay STM capacity of each ESN for delays 1, . . . , 40 computed using the test signal. The results are averaged over 100 different realizations of each ESN type with the specifications given in the text for different W and Win matrices. The STM capacities of R-ESN, U-ESN, ASE-ESN, and BASE-ESN are 13.09, 13.55, 16.70, and 16.90, respectively.

suitability of maximizing ASE in tasks that require a substantial memory length. 5.2 Binary Parity Check. The effect of the adaptive bias was marginal in the previous experiment since the nature of the problem required large ASE values. However, there are tasks in which the optimal solutions require smaller ASE values and smaller spectral radius. Those are the tasks where the adaptive bias becomes a crucial design parameter in our design methodology. Consider an ESN with 100 internal units and a single input unit. ESN is driven by a binary input signal, u(n), that assumes the values 0 or 1. The goal is to train an ESN to generate the m-bit parity corresponding to last m bits received, where m is 3, . . . , 8. Similar to the previous experiments, we used the R-ESN, ASE-ESN, and BASE-ESN topologies. R-ESN is a randomly connected ESN where the entries of W matrix are set to 0, 0.06, −0.06 with probabilities 0.8, 0.1, 0.1, respectively. This corresponds to a sparse connectivity of 20% and a spectral radius of 0.3. ASE-ESN and BASE-ESN are designed with a spectral radius of 0.9. The input weights are set to 1 or -1 with equal probability, and direct connections from the input to the output are allowed whereas Wba ck is set to 0. The echo states are calculated using equation 2.1 for 1000 samples of the input signal, and the first 100 samples corresponding to the initial transient are eliminated. Then the output weight

Analysis and Design of Echo State Networks

131

350

Wrong Decisions

300 250 200 150 100 ASEESN RESN BASEESN

50 0 3

4

5

6

7

8

m Figure 7: The number of wrong decisions made by each ESN for m = 3, . . . , 8 in the binary parity check problem. The results are averaged over 100 different realizations of R-ESN, ASE-ESN, and BASE-ESN for different W and Win matrices with the specifications given in the text. The total numbers of wrong decisions for m = 3, . . . , 8 of R-ESN, ASE-ESN, and BASE-ESN are 722, 862, and 699.

matrix is calculated using equation 2.4. For ESN with adaptive bias, the bias is trained for each task. The binary decision is made by a threshold detector that compares the output of the ESN to 0.5. Figure 7 shows the number of wrong decisions (averaged over 100 different realizations) made by each ESN for m = 3, . . . , 8. The total numbers of wrong decisions for m = 3, . . . , 8 of R-ESN, ASEESN, and BASE-ESN are 722, 862, and 699, respectively. ASE-ESN performs poorly since the nature of the problem requires a short time constant for fast response, but ASE-ESN has a large spectral radius. For 5-bit parity, the R-ESN has no wrong decisions, whereas ASE-ESN has 53 wrong decisions. BASE-ESN performs a lot better than ASE-ESN and slightly better than the R-ESN since the adaptive bias reduces the spectral radius effectively. Note that for m = 7 and 8, the ASE-ESN performs similar to the R-ESN, since the task requires access to longer input history, which compromises the need for fast response. Indeed, the bias in the BASE-ESN takes effect when there are errors (m > 4) and when the task benefits from smaller spectral radius. The optimal bias values are approximately 3.2, 2.8, 2.6, and 2.7 for m = 3, 4, 5, and 6, respectively. For m = 7 or 8, there is a wide range of bias values that result in similar MSE values (between 0 and 3). In

132

M. Ozturk, D. Xu, and J. Pr´ıncipe

summary, this experiment clearly demonstrates the power of the bias signal to configure the ESN reservoir according to the mapping task. 5.3 System Identification. This section presents a function approximation task where the aim is to identify a nonlinear dynamical system. The unknown system is defined by the difference equation y(n + 1) = 0.3y(n) + 0.6y(n − 1) + f (u(n)), where f (u) = 0.6 sin(πu) + 0.3 sin(3πu) + 0.1 sin(5πu). The input to the system is chosen to be sin(2πn/25). We used three different ESNs—R-ESN, ASE-ESN, and BASE-ESN—with 30 internal units and a single input unit. The W matrix of each ESN is scaled such that it has a spectral radius of 0.95. R-ESN is a randomly connected ESN where the entries of W matrix are set to 0, 0.35, −0.35 with probabilities 0.8, 0.1, 0.1, respectively. In each ESN, the input weights are set to 1 or −1 with equal probability, and direct connections from the input to the output are allowed, whereas Wba ck is set to 0. The optimal output weights are calculated using equation 2.4. The MSE values (averaged over 100 realizations) for RESN and ASE-ESN are 1.23x10−5 and 1.83x10−6 , respectively. The addition of the adaptive bias to the ASE-ESN reduces the MSE value from 1.83x10−6 to 3.27x10−9 . 6 Discussion The great appeal of echo state networks (ESNs) and liquid state machine (LSM) is their ability to construct arbitrary mappings of signals with rich and time-varying temporal structures without requiring adaptation of the free parameters of the recurrent layer. The echo state condition allows the recurrent connections to be fixed with training limited to the linear output layer. However, the literature did not elucidate on how to properly choose the recurrent parameters for system identification applications. Here, we provide an alternate framework that interprets the echo states as a set of functional bases formed by fixed nonlinear combinations of the input. The linear readout at the output stage simply computes the projection of the desired output space onto this representation space. We further introduce an information-theoretic criterion, ASE, to better understand and evaluate the capability of a given ESN to construct such a representation layer. The average entropy of the distribution of the echo states quantifies the volume spanned by the bases. As such, this volume should be the largest to achieve the smallest correlation among the bases and be able to cope with

Analysis and Design of Echo State Networks

133

arbitrary mappings. However, not all function approximation problems require the same memory depth, which is coupled to the spectral radius. The effective spectral radius of an ESN can be optimized for the given problem with the help of an external bias signal that is adapted using the joint inputoutput space information. The interesting property of this method when applied to ESN built from sigmoidal nonlinearities is that it allows the fine tuning of the system dynamics for a given problem with a single external adaptive bias input and without changing internal system parameters. In our opinion, the combination of the largest possible ASE and the adaptation of the spectral radius by the bias produces the most parsimonious pole location of the linearized ESN when no knowledge about the mapping is available to optimally locate the bass functionals. Moreover, the bias can be easily trained with either a line search method or a gradient-based method since it is one-dimensional. We have illustrated experimentally that the design of the ESN using the maximization of ASE with the adaptation of the spectral radius by the bias has provided consistently better performance across tasks that require different memory depths. This means that these two parameters’ design methodology is preferred to the spectral radius criterion proposed by Jaeger, and it is still easily incorporated in the ESN design. Experiments demonstrate that the ASE for ESN with uniform linearized poles is maximized when the spectral radius of the recurrent weight matrix approaches one (instability). It is interesting to relate this observation with the computational properties found in dynamical systems “at the edge of chaos” (Packard, 1988; Langton, 1990; Mitchell, Hraber, & Crutchfield, 1993; Bertschinger & Natschl¨ager, 2004). Langton stated that when cellular automata rules are evolved to perform a complex computation, evolution will tend to select rules with “critical” parameter values, which correlate with a phase transition between ordered and chaotic regimes. Recently, similar conclusions were suggested for LSMs (Bertschinger & Natschl¨ager, 2004). Langton’s interpretation of edge of chaos was questioned by Mitchell et al. (1993). Here, we provide a system-theoretic view and explain the computational behavior with the diversity of dynamics achieved with linearizations that have poles close to the unit circle. According to our results, the spectral radius of the optimal ESN in function approximation is problem dependent, and in general it is impossible to forecast the computational performance as the system approaches instability (the spectral radius of the recurrent weight matrix approaches one). However, allowing the system to modulate the spectral radius by either the output or internal biasing may allow a system close to instability to solve various problems requiring different spectral radii. Our emphasis here is mostly on ESNs without output feedback connections. However, the proposed design methodology can also be applied to ESNs with output feedback. Both feedforward and feedback connections contribute to specify the bases to create the projection space. At the same

134

M. Ozturk, D. Xu, and J. Pr´ıncipe

time, there are applications where the output feedback contributes to the system dynamics in a different fashion. For example, it has been shown that a fixed weight (fully trained) RNN with output feedback can implement a family of functions (meta-learners) (Prokhorov, Feldkamp, & Tyukin, 1992). In meta-learning, the role of output feedback in the network is to bias the system to different regions of dynamics, providing multiple input-output mappings required (Santiago & Lendaris, 2004). However, results could not be replicated with ESNs (Prokhorov, 2005). We believe that more work has to be done on output feedback in the context of ESNs but also suspect that the echo state condition may be a restriction on the system dynamics for this type of problem. There are many interesting issues to be researched in this exciting new area. Besides an evaluation tool, ASE may also be utilized to train the ESN’s representation layer in an unsupervised fashion. In fact, we can easily adapt with the SIG (stochastic information gradient) described in Erdogmus, Hild, and Principe (2003): extra weights linking the outputs of recurrent states to maximize output entropy. Output entropy maximization is a well-known metric to create independent components (Bell & Sejnowski, 1995), and here it means that the echo states will become as independent as possible. This would circumvent the linearization of the dynamical system to set the recurrent weights and would fine-tune continuously in an unsupervised manner the parameters of the ESN among different inputs. However, it goes against the idea of a fixed ESN reservoir. The reservoir of recurrent PEs can be thought of as a new form of a timeto-space mapping. Unlike the delay line that forms an embedding (Takens, 1981), this mapping may have the advantage of filtering noise and produce representations with better SNRs to the peaks of the input, which is very appealing for signal processing and seems to be used in biology. However, further theoretical work is necessary in order to understand the embedding capabilities of ESNs. One of the disadvantages of the ESN correlated basis is in the design of the readout. Gradient-based algorithms will be very slow to converge (due to the large eigenvalue spread of modes), and even if recursive methods are used, their stability may be compromised by the condition number of the matrix. However, our recent results incorporating an L 1 norm penalty in the LMS (Rao et al., 2005) show great promise of solving this problem. Finally we would like to briefly comment on the implications of these models to neurobiology and computational neuroscience. The work by Pouget and Sejnowski (1997) has shown that the available physiological data are consistent with the hypothesis that the response of a single neuron in the parietal cortex serves as a basis function generated by the sensory input in a nonlinear fashion. In other words, the neurons transform the sensory input into a format (representation space) such that the subsequent computation is simplified. Then, whenever a motor command (output of the biological system) needs to be generated, this simple computation to

Analysis and Design of Echo State Networks

135

read out the neuronal activity is done. There is an intriguing similarity between the interpretation of the neuronal activity by Pouget and Sejnowski and our interpretation of echo states in ESN. We believe that similar ideas can be applied to improve the design of microcircuit implementations of LSMs. First, the framework of functional space interpretation (bases and projections) is also applicable to microcircuits. Second, the ASE measure may be directly utilized for LSM states because the states are normally lowpass-filtered before the readout. However, the control of ASE by changing the liquid dynamics is unclear. Perhaps global control of thresholds or bias current will be able to accomplish bias control as in ESN with sigmoid PEs.

Acknowledgments This work was partially supported by NSF ECS-0422718, NSF CNS-0540304, and ONR N00014-1-1-0405.

References Amari, S.-I. (1990). Differential-geometrical methods in statistics. New York: Springer. Anderson, J., Silverstein, J., Ritz, S., & Jones, R. (1977). Distinctive features, categorical perception, and probability learning: Some applications of a neural model. Psychological Review, 84, 413–451. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129– 1159. Bertschinger, N., & Natschl¨ager, T. (2004). Real-time computation at the edge of chaos in recurrent neural networks. Neural Computation, 16(7), 1413–1436. Cox, R. T. (1946). Probability, frequency, and reasonable expectation. American Journal of Physics, 14(1), 1–13. de Vries, B. (1991). Temporal processing with neural networks—the development of the gamma model. Unpublished doctoral dissertation, University of Florida. Delgado, A., Kambhampati, C., & Warwick, K. (1995). Dynamic recurrent neural network for system identification and control. IEEE Proceedings of Control Theory and Applications, 142(4), 307–314. Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179–211. Erdogmus, D., Hild, K. E., & Principe, J. (2003). Online entropy manipulation: Stochastic information gradient. Signal Processing Letters, 10(8), 242–245. Erdogmus, D., & Principe, J. (2002). Generalized information potential criterion for adaptive system training. IEEE Transactions on Neural Networks, 13(5), 1035–1044. Feldkamp, L. A., Prokhorov, D. V., Eagen, C., & Yuan, F. (1998). Enhanced multistream Kalman filter training for recurrent networks. In J. Suykens, & J. Vandewalle (Eds.), Nonlinear modeling: Advanced black-box techniques (pp. 29–53). Dordrecht, Netherlands: Kluwer.

136

M. Ozturk, D. Xu, and J. Pr´ıncipe

Haykin, S. (1998). Neural networks: A comprehensive foundation (2nd ed.). Upper Saddle River, NJ. Prentice Hall. Haykin, S. (2001). Adaptive filter theory (4th ed.). Upper Saddle River, NJ: Prentice Hall. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. Hopfield, J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Sciences, 81, 3088–3092. Ito, Y. (1996). Nonlinearity creates linear independence. Advances in Computer Mathematics, 5(1), 189–203. Jaeger, H. (2001). The echo state approach to analyzing and training recurrent neural networks (Tech. Rep. No. 148). Bremen: German National Research Center for Information Technology. Jaeger, H. (2002a). Short term memory in echo state networks (Tech. Rep. No. 152). Bremen: German National Research Center for Information Technology. Jaeger, H. (2002b). Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the “echo state network” approach (Tech. Rep. No. 159). Bremen: German National Research Center for Information Technology. Jaeger, H., & Hass, H. (2004). Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667), 78–80. Jeffreys, H. (1946). An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London, A 196, 453–461. Kailath, T. (1980). Linear systems. Upper Saddle River, NJ: Prentice Hall. Kautz, W. (1954). Transient synthesis in time domain. IRE Transactions on Circuit Theory, 1(3), 29–39. Kechriotis, G., Zervas, E., & Manolakos, E. S. (1994). Using recurrent neural networks for adaptive communication channel equalization. IEEE Transactions on Neural Networks, 5(2), 267–278. Kremer, S. C. (1995). On the computational power of Elman-style recurrent networks. IEEE Transactions on Neural Networks, 6(5), 1000–1004. Kuznetsov, Y., Kuznetsov, L., & Marsden, J. (1998). Elements of applied bifurcation theory (2nd ed.). New York: Springer-Verlag. Langton, C. G. (1990). Computation at the edge of chaos. Physica D, 42, 12–37. Maass, W., Legenstein, R. A., & Bertschinger, N. (2005). Methods for estimating the computational power and generalization capability of neural microcircuits. In L. K. Saul, Y. Weiss, L. Bottou (Eds.), Advances in neural information processing systems, no. 17 (pp. 865–872). Cambridge, MA: MIT Press. Maass, W., Natschl¨ager, T., & Markram, H. (2002). Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Computation, 14(11), 2531–2560. Mitchell, M., Hraber, P., & Crutchfield, J. (1993). Revisiting the edge of chaos: Evolving cellular automata to perform computations. Complex Systems, 7, 89– 130. Packard, N. (1988). Adaptation towards the edge of chaos. In J. A. S. Kelso, A. J. Mandell, & M. F. Shlesinger (Eds.), Dynamic patterns in complex systems (pp. 293– 301). Singapore: World Scientific.

Analysis and Design of Echo State Networks

137

Pouget, A., & Sejnowski, T. J. (1997). Spatial transformations in the parietal cortex using basis functions. Journal of Cognitive Neuroscience, 9(2), 222–237. Principe, J. (2001). Dynamic neural networks and optimal signal processing. In Y. Hu & J. Hwang (Eds.), Neural networks for signal processing (Vol. 6-1, pp. 6– 28). Boca Raton, FL: CRC Press. Principe, J. C., de Vries, B., & de Oliviera, P. G. (1993). The gamma filter—a new class of adaptive IIR filters with restricted feedback. IEEE Transactions on Signal Processing, 41(2), 649–656. Principe, J., Xu, D., & Fisher, J. (2000). Information theoretic learning. In S. Haykin (Ed.), Unsupervised adaptive filtering (pp. 265–319). Hoboken, NJ: Wiley. Prokhorov, D. (2005). Echo state networks: Appeal and challenges. In Proc. of International Joint Conference on Neural Networks (pp. 1463–1466). Montreal, Canada. Prokhorov, D., Feldkamp, L., & Tyukin, I. (1992). Adaptive behavior with fixed weights in recurrent neural networks: An overview. In Proc. of International Joint Conference on Neural Networks (pp. 2018–2022). Honolulu, Hawaii. Puskorius, G. V., & Feldkamp, L. A. (1994). Neurocontrol of nonlinear dynamical systems with Kalman filter trained recurrent networks. IEEE Transactions on Neural Networks, 5(2), 279–297. Puskorius, G. V., & Feldkamp, L. A. (1996). Dynamic neural network methods applied to on-vehicle idle speed control. Proceedings of IEEE, 84(10), 1407–1420. Rao, Y., Kim, S., Sanchez, J., Erdogmus, D., Principe, J. C., Carmena, J., Lebedev, M., & Nicolelis, M. (2005). Learning mappings in brain machine interfaces with echo state networks. In 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing. Philadelphia. Renyi, A. (1970). Probability theory. New York: Elsevier. Sanchez, J. C. (2004). From cortical neural spike trains to behavior: Modeling and analysis. Unpublished doctoral dissertation, University of Florida. Santiago, R. A., & Lendaris, G. G. (2004). Context discerning multifunction networks: Reformulating fixed weight neural networks. In Proc. of International Joint Conference on Neural Networks (pp. 189–194). Budapest, Hungary. Shah, J. V., & Poon, C.-S. (1999). Linear independence of internal representations in multilayer perceptrons. IEEE Transactions on Neural Networks, 10(1), 10–18. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 623–656. Siegelmann, H. T. (1993). Foundations of recurrent neural networks. Unpublished doctoral dissertation, Rutgers University. Siegelmann, H. T., & Sontag, E. (1991). Turing computability with neural nets. Applied Mathematics Letters, 4(6), 77–80. Singhal, S., & Wu, L. (1989). Training multilayer perceptrons with the extended Kalman algorithm. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 1 (pp. 133–140). San Mateo, CA: Morgan Kaufmann. Takens, F. (1981). Detecting strange attractors in turbulence. In D. A. Rand & L.-S. Young (Eds.), Dynamical systems and turbulence (pp. 366–381). Berlin: Springer. Thogula, R. (2003). Information theoretic self-organization of multiple agents. Unpublished master’s thesis, University of Florida. Werbos, P. (1990). Backpropagation through time: What it does and how to do it. Proceedings of IEEE, 78(10), 1550–1560.

138

M. Ozturk, D. Xu, and J. Pr´ıncipe

Werbos, P. (1992). Neurocontrol and supervised learning: An overview and evaluation. In D. White & D. Sofge (Eds.), Handbook of intelligent control (pp. 65–89). New York: Van Nostrand Reinhold. Wilde, D. J. (1964). Optimum seeking methods. Upper Saddle River, NJ: Prentice Hall. Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1, 270–280.

Received December 28, 2004; accepted June 1, 2006.

LETTER

Communicated by Ralph M. Siegel

Invariant Global Motion Recognition in the Dorsal Visual System: A Unifying Theory Edmund T. Rolls [email protected]

Simon M. Stringer [email protected] Oxford University, Centre for Computational Neuroscience, Department of Experimental Psychology, Oxford OX1 3UD, England

The motion of an object (such as a wheel rotating) is seen as consistent independent of its position and size on the retina. Neurons in higher cortical visual areas respond to these global motion stimuli invariantly, but neurons in early cortical areas with small receptive fields cannot represent this motion, not only because of the aperture problem but also because they do not have invariant representations. In a unifying hypothesis with the design of the ventral cortical visual system, we propose that the dorsal visual system uses a hierarchical feedforward network architecture (V1, V2, MT, MSTd, parietal cortex) with training of the connections with a short-term memory trace associative synaptic modification rule to capture what is invariant at each stage. Simulations show that the proposal is computationally feasible, in that invariant representations of the motion flow fields produced by objects self-organize in the later layers of the architecture. The model produces invariant representations of the motion flow fields produced by global in-plane motion of an object, in-plane rotational motion, looming versus receding of the object, and object-based rotation about a principal axis. Thus, the dorsal and ventral visual systems may share some similar computational principles. 1 Introduction A key issue in understanding the cortical mechanisms that underlie motion perception is how we perceive the motion of objects such as a rotating wheel invariantly with respect to position on the retina, and size. For example, we perceive the wheel shown in Figures 1 and 4a rotating clockwise independent of its position on the retina. This occurs even though the local motion for the wheels in the different positions may be opposite (as indicated in the dashed box in Figure 1). How could this invariance of the visual motion perception of objects arise in the visual system? Invariant motion representations are known to be developed in the cortical dorsal visual system. Motion-sensitive neurons in V1 have small receptive fields Neural Computation 19, 139–169 (2007)

C 2006 Massachusetts Institute of Technology

140

E. Rolls and S. Stringer

Figure 1: A wheel rotating clockwise at different locations on the retina. How can a network learn to represent the clockwise rotation independent of the location of the moving object? The dashed box shows that local motion cues available at the beginning of the visual system are ambiguous about the direction of rotation when the stimulus is seen in different locations. One rotating wheel is presented at any one time, but the need is to develop a representation of the fact that in the case shown, the rotating flow field is always clockwise, independent of the location of the flow field and even though the local motion cues may be ambiguous, as shown in the dashed box.

(in the range 1–2 degrees at the fovea), and therefore cannot detect global motion, and this is part of the aperture problem (Wurtz & Kandel, 2000). Neurons in MT, which receives inputs from V1 and V2, have larger receptive fields (e.g., 5 degrees at the fovea) and are able to respond to planar global motion, such as a field of small dots in which the majority (in practice, as little as 55%) move in one direction, or to the overall direction of a

Invariant Global Motion Recognition

141

moving plaid, the orthogonal grating components of which have motion at 45 degrees to the overall motion (Movshon, Adelson, Gizzi, & Newsome, 1985; Newsome, Britten, & Movshon, 1989). Further on in the dorsal visual system, some neurons in macaque visual area MST (but not MT) respond to rotating flow fields or looming with considerable translation invariance (Graziano, Andersen, & Snowden, 1994; Geesaman & Andersen, 1996). It is known that single neurons in the ventral visual system have translation, size, and even view-invariant representations of stationary objects (Rolls & Deco, 2002; Desimone, 1991; Tanaka, 1996; Logothetis & Sheinberg, 1996; Rolls, 1992, 2000, 2006). A theory that can account for this uses a feature hierarchy network (Fukushima, 1980; Rolls, 1992; Wallis & Rolls, 1997; Riesenhuber & Poggio, 1999) combined with an associative Hebb-like learning rule (in which the synaptic weights increase in proportion to the pre-and postsynaptic firing rates) with a short-term memory of, for example, 1 sec, to enable different instances of the stimulus to be associated together as the visual objects transform continuously from second to second in the world ¨ ak, 1991; Rolls, 1992; Wallis & Rolls, 1997; Bartlett & Sejnowski, 1998; (Foldi´ Rolls & Milward, 2000; Stringer & Rolls, 2000, 2002; Rolls & Deco, 2002). In a unifying hypothesis, we propose here that the analysis of invariant motion in the dorsal visual system uses a similar architecture and learning rule, but in contrast utilizes as its inputs neurons that respond to local motion of the type found in the primary visual cortex, V1 (Wurtz & Kandel, 2000; Duffy, 2004; Bair & Movshon, 2004). A feature of the theory is that motion in the visual field is computed only once in V1 (by processes that take into account luminance changes across short times) and that the representations of motion that develop in the dorsal visual system require no further computation of time-delay-related firing to compute motion. The theory is of interest, for it proposes that some aspects of the computations in parts of the cerebral cortex that appear to be involved in different types of visual function, the dorsal and ventral visual systems, may in fact be performed by some similar organizational and computational principles. 2 The Theory and Its Implementation in a Model 2.1 The Theory. We propose that the general architecture of the dorsal visual system areas we consider is a feedforward feature hierarchy network, the inputs to which are local motion-sensitive neurons of V1 with receptive fields of approximately 1 degree in diameter (see Figure 2). There is convergence from stage to stage, so that a neuron at any one stage need receive only a limited number of inputs from the preceding stage, yet by the end of the network, an effectively global computation that can take into account information derived from different parts of the retina can have been performed. Within each cortical layer of the architecture (or layer of the network), local lateral inhibition implemented by inhibitory feedback neurons implements competition between the neurons, in such a way that fast-firing neurons

142

E. Rolls and S. Stringer

Layer 4

Layer 3

Layer 2

Layer 1 Figure 2: Stylized image of hierarchical organization in the dorsal as well as ventral visual system. The architecture is captured by the VisNet model, in which convergence through the network is designed to provide fourth-layer neurons with information from across the entire input retina.

inhibit other neurons in the vicinity, so that the overall activity within an area is kept within bounds. The competition may be nonlinear, due in part to the threshold nonlinearity of neurons, and this competition, helped by the diluted connectivity (i.e., the fact that only a low proportion of the neurons are connected), enables some neurons to respond to particular combinations of the inputs being received from the preceding area (Rolls & Deco, 2002; Deco & Rolls, 2005). These aspects of the architecture potentially enable single neurons at higher stages of the network to respond to combinations of the local motion inputs from V1 to the first layer of the network. These combinations, helped by the increasingly larger receptive fields, could include global motion to partly randomly moving dots (and to plaids) over

Invariant Global Motion Recognition

143

areas as large as 5 degrees in MT (Wurtz & Kandel, 2000; Duffy & Wurtz, 1996). In the architecture shown in Figure 2, layer 1 might correspond to MT; layer 2 to MST, which has receptive fields of 15–65 degrees in diameter; and layers 3 and 4 to areas in the parietal cortex such as 7a and to areas in the cortex in the superior temporal sulcus, which receives from parietal visual areas where view-invariant object-based motion is represented (Hasselmo, Rolls, Baylis, & Nalwa, 1989; Sakata, Shibutani, Ito, & Tsurugai, 1986). The synaptic plasticity between the layers of neurons has a Hebbian associative component in order to enable the system to build reliable representations in which the same neurons are activated by particular stimuli on different occasions in what is effectively a hierarchical multilayer competitive network (Rolls, 1992; Wallis & Rolls, 1997; Rolls & Deco, 2002). Such processes might enable neurons in layer 2 of Figure 4a to respond to, for example, a wheel rotating clockwise in one position on the retina (e.g., neuron A in layer 2). A key issue not addressed by the architecture described so far is how rotation (e.g., of a small wheel rotating clockwise) in one part of the retina activates the same neurons at the end of the network as when it is presented on a different part of the retina (see Figure 1). We propose that an associative synaptic learning rule with a short-term memory trace of neuronal activity is used between the layers to solve this problem. The idea is that if at a high level of the architecture (labeled layer 2/3 in Figure 4a) a wheel rotating clockwise is activating a neuron in one position on the retina, then the activated neurons remain active in a short delay period (of, e.g., 1 s) while the object moves to another location on the retina (e.g., the right position in Figure 4a). Then, with the postsynaptic neurons still active from the motion at the left position, the newly active synapses onto the layer 2/3 neuron (C) show associative modification, resulting in neuron C learning in an unsupervised way to respond to the wheel rotating clockwise in either the left or the right position on the retina. The idea is, just as for the ventral visual system (Rolls & Deco, 2002), that whatever the convergence allows to be learned at each stage of the hierarchy will be learned by this invariance algorithm, resulting in neurons higher in the hierarchy having higher- and higher-level invariance properties, including view-invariant object-based motion. More formally, the rule we propose is that identical to the one ¨ ak, 1991; Rolls, 1992; Wallis & proposed for the ventral visual system (Foldi´ Rolls, 1997; Rolls & Deco, 2002) as follows: w j = α yτ −1 x τj ,

(2.1)

where the trace yτ is updated according to yτ = (1 − η)yτ + ηyτ −1 ,

(2.2)

144

E. Rolls and S. Stringer

and we have the following definitions: x j : jth input to the neuron yτ : trace value of the output of the neuron at time step τ w j : synaptic weight between jth input and the neuron y : output from the neuron α : learning rate; annealed between unity and zero η : trace value; the optimal value varies with presentation sequence length The parameter η may be set anywhere in the interval [0, 1], and for the simulations described here, η was set to 0.8, which works well with nine transforms for each object in the stimulus set (Wallis & Rolls, 1997). (A discussion of the good performance of this rule, and its relation to other versions of trace learning rules, including the point that the trace can be implemented in the presynaptic firing, is provided by Rolls & Milward, 2000, and Rolls & Stringer, 2001. We note that in the version of the rule used here (equation 2.1), the trace is calculated from the postsynaptic firing in the preceding time step (yτ −1 ) but not the current time step, but that analogous performance is obtained if the firing in the current time step is also included (Rolls & Milward, 2000; Rolls & Stringer, 2001).) The temporal trace in the brain could be implemented by a number of processes, as simple as continuing firing of neurons for several hundred ms after a stimulus has disappeared or moved (as shown to be present for at least inferior temporal neurons in masking experiments—Rolls & Tovee, 1994; Rolls, Tovee, Purcell, Stewart, & Azzopardi, 1994), or by the long time constant of NMDA receptors and the resulting entry of calcium to neurons. An important idea here is that the temporal properties of the biologically implemented learning mechanism are such that it is well suited to detecting the relevant continuities in the world of real motion of objects. The system uses the underlying continuity in the world to help itself learn the invariances of, for example, the motions that are typical of objects. 2.2 The Network Architecture. The model we used for the simulations was VisNet, which was developed as a model of hierarchical processing in the ventral visual system that uses a trace learning to develop invariant representations of stationary objects (Wallis & Rolls, 1997; Rolls & Milward, 2000; Rolls & Deco, 2002). The simulations performed here utilized the latest version of the VisNet model (VisNet2), with the same model parameters as used by Rolls and Milward (2000) for their investigations of the formation of invariant representations in the ventral visual system. These parameters were kept identical for all the simulations described here. The difference is that instead of using simple cell-like inputs to the model that respond to stationary-oriented bars and edges (with four spatial frequencies and four orientations), in the modeling described here we used motion-related

Invariant Global Motion Recognition

145

Table 1: VisNet Dimensions.

Dimensions

Number of Connections

Radius

100 100 100 201 -

12 9 6 6 -

32 × 32 32 × 32 32 × 32 32 × 32 128 × 128 × 8

Layer 4 Layer 3 Layer 2 Layer 1 Input layer

Table 2: Lateral Inhibition Parameters. Layer Radius, σ Contrast, δ

1 1.38 1.5

2 2.7 1.5

3 4.0 1.6

4 6.0 1.4

inputs that capture some of the relevant properties of neurons present in V1 as part of the primate magnocellular (M) system (Wurtz & Kandel, 2000; Duffy, 2004; Rolls & Deco, 2002). VisNet is a four-layer feedforward network with unsupervised competitive learning at each layer. For each layer, the forward connections to individual cells are derived from a topologically corresponding region of the preceding layer, with connection probabilities based on a gaussian distribution (see Figure 2). These distributions are defined by a radius that will contain approximately 67% of the connections from the preceding layer. Typical values are given in Table 1. Within each layer there is competition between neurons, which is graded rather than winner-take-all, and is implemented in two stages. First, to implement lateral inhibition, the firing rates of the neurons (calculated as the dot product of the vector of presynaptic firing rates and the synaptic weight vector on a neuron, followed by a linear activation function to produce a firing rate) within a layer are convolved with a spatial filter, I , where δ controls the contrast and σ controls the width, and a and b index the distance away from the center of the filter:

Ia ,b

−δe − a 2σ+b2 2 = 1 − a =0 Ia ,b

if a = 0 or

b = 0,

if a = 0 and b = 0.

b=0

(2.3)

Typical lateral inhibition parameters are given in Table 2. Next, contrast enhancement is applied by means of a sigmoid function y = f sigmoid (r ) =

1 1+

e −2β(r −α)

,

(2.4)

146

E. Rolls and S. Stringer

Table 3: Sigmoid Parameters. Layer Percentile Slope β

1 99.2 190

2 98 40

3 88 75

4 91 26

where r is the firing rate after lateral inhibition, y is the firing rate after contrast enhancement, and α and β are the sigmoid threshold and slope, respectively. The parameters α and β are constant within each layer, although α is adjusted to control the sparseness of the firing rates. For example, to set the sparseness to, say, 5%, the threshold is set to the value of the 95th percentile point of the firing rates r within the layer. Typical parameters for the sigmoid function are shown in Table 3. ¨ ak, 1991; Rolls, 1992; Wallis & Rolls, 1997; The trace learning rule (Foldi´ Rolls & Milward, 2000; Rolls & Stringer, 2001; Rolls & Deco, 2002) is that shown in equation 2.1 and encourages neurons to develop invariant responses to input patterns that tend to occur close together in time, because these are likely to be from the same moving object. 2.3 The Motion Inputs to the Network. The images presented to the network represent local motion signals with small receptive fields. These local visual motion (or local optic flow) input signals are similar to those of neurons in V1 in that they have small receptive fields and cannot detect global motion because of the aperture problem (Wurtz & Kandel, 2000). At each pixel coordinate in the 128 × 128 image, a direction of local motion/optic flow is defined. The global optic flow patterns used in the different experiments occupied part of this 128 × 128 image, as described for each experiment below. At each coordinate, there are eight cells, where the optimal response is defined by flows 45 degrees apart. That is, the cells are tuned to local optic flow directions of 0, 45, 90, . . ., 315 degrees. The firing rate of each cell is set equal to a gaussian function of the difference between the cell’s preferred direction and the actual direction of local optic flow. The standard deviation of this gaussian was 20 degrees. The number of inputs from the arrays of motion sensitive cells to each cell in the first layer of the network is 201, selected probabilistically as a gaussian function of distance as described above and in more detail elsewhere (Rolls & Milward, 2000). The local motion signals are given to the network, and not computed in the simulations, because the aim of the simulations is to test the theory that (given that local motion inputs that are known to be present in early cortical processing; Wurtz & Kandel, 2000) the trace learning mechanism described can in a hierarchical network account for a range of the types of global motion neuron that are found in the dorsal stream visual cortical areas.

Invariant Global Motion Recognition

147

2.4 Training and Test Procedure. To train the network, each stimulus is presented to VisNet in a randomized sequence of locations or orientations with respect to VisNet’s input retina. The different locations were spaced 32 pixels apart on the 128 × 128 retina. At each stimulus presentation, the activation of individual neurons is calculated, then the neuronal firing rates are calculated, and then the synaptic weights are updated. Each time a stimulus has been presented in all the training locations or orientations, a new stimulus is chosen at random and the process repeated. The presentation of all the stimuli through all locations or orientations constitutes one epoch of training. In this manner, the network is trained one layer at a time starting with layer 1 and finishing with layer 4. In the investigations described here, the numbers of training epochs for layers 1 to 4 were 50, 100, 100, and 75, respectively, as these have been shown in previous work to provide good performance (Wallis & Rolls, 1997; Rolls & Milward, 2000). The learning rates α in equation 2.1 for layers 1 to 4 were 0.09, 0.067, 0.05, and 0.04. Two measures of performance were used to assess the ability of the output layer of the network to develop neurons that are able to respond with view invariance to individual stimuli or objects (see Rolls & Milward, 2000). A single cell information measure was applied to individual cells in layer 4 and measures how much information is available from the response of a single cell about which stimulus was shown independent of view. The measure was the stimulus-specific information or surprise, I (s, R), which is the amount of information the set of responses, R, has about a specific stimulus, s. (The mutual information between the whole set of stimuli S and of responses R is the average across stimuli of this stimulus-specific information.) (Note that r is an individual response from the set of responses R.) I (s, R) =

r ∈R

P(r |s) log2

P(r |s) P(r )

(2.5)

The calculation procedure was identical to that described by Rolls, Treves, Tovee, and Panzeri (1997) with the following exceptions. First, no correction was made for the limited number of trials because, in VisNet, each measurement of a response is exact, with no variation due to sampling on different trials. Second, the binning procedure was to use equispaced rather than equipopulated bins. This small modification was useful because the data provided by VisNet can produce perfectly discriminating responses with little trial-to-trial variability. Because the cells in VisNet can have bimodally distributed responses, equipopulated bins could fail to separate the two modes perfectly. (This is because one of the equipopulated bins might contain responses from both of the modes.) The number of bins used was equal to or less than the number of trials per stimulus, that is, for VisNet the number of positions on the retina (Rolls et al., 1997). Because

148

E. Rolls and S. Stringer

VisNet operates as a form of competitive net to perform categorization of the inputs received, good performance of a neuron will be characterized by large responses to one or a few stimuli regardless of their position on the retina (or other transform) and small responses to the other stimuli. We are thus interested in the maximum amount of information that a neuron provides about any of the stimuli rather than the average amount of information it conveys about the whole set S of stimuli (known as the mutual information). Thus, for each cell, the performance measure was the maximum amount of information a cell conveyed about any one stimulus (with a check, in practice always satisfied, that the cell had a large response to that stimulus, as a large response is what a correctly operating competitive net should produce to an identified category). In many of the graphs in this article, the amount of information each of the 50 most informative cells had about any stimulus is shown. A multiple cell information measure, the average amount of information that is obtained about which stimulus was shown from a single presentation of a stimulus from the responses of all the cells, enabled measurement of whether across a population of cells, information about every object in the set was provided. Procedures for calculating the multiple cell information measure are given by Rolls, Treves, and Tovee (1997) and Rolls and Milward (2000). The multiple cell information measure is the mutual information I (S, R), that is, the average amount of information that is obtained from a single presentation of a stimulus about the set of stimuli S from the responses of all the cells. For multiple cell analysis, the set of responses, R, consists of response vectors comprising the responses from each cell. Ideally, we would like to calculate I (S, R) =

P(s)I (s, R).

(2.6)

s∈S

However, the information cannot be measured directly from the probability table P(r, s) embodying the relationship between a stimulus s and the response rate vector r provided by the firing of the set of neurons to a presentation of that stimulus. (Note that “stimulus” refers to an individual object that can occur with different transforms, e.g., translation or size; see Wallis & Rolls, 1997.) This is because the dimensionality of the response vectors is too large to be adequately sampled by trials. Therefore, a decoding procedure is used, in which the stimulus s that gave rise to the particular firing-rate response vector on each trial is estimated. This involves, for example, maximum likelihood estimation or dot product decoding. For example, given a response vector r to a single presentation of a stimulus, its similarity to the average response vector of each neuron to each stimulus is used to estimate using a dot product comparison which stimulus was shown. The probabilities of it being each of the stimuli can be estimated in

Invariant Global Motion Recognition

149

this way. Details are provided by Rolls et al. (1997). A probability table is then constructed of the real stimuli s and the decoded stimuli s . From this probability table, the mutual information is calculated as I (S, S ) =

P(s, s ) log2

s,s

P(s, s ) . P(s)P(s )

(2.7)

The multiple cell information was calculated using the five cells for each stimulus with high information values for that stimulus. Thus, in this letter, 10 cells were used in the multiple cell information analysis. 3 Simulation Results We now describe simulations with the neural network model described in section 2 that enabled us to test this theory. 3.1 Experiment 1: Global Planar Motion. Motion-sensitive neurons in V1 have small receptive fields (in the range 1–2 deg at the fovea) and therefore cannot detect global motion, and this is part of the aperture problem (Wurtz & Kandel, 2000). As described in section 1, neurons in MT have larger receptive fields and are able to respond to planar global motion (Movshon et al., 1985; Newsome et al., 1989). Here we show that the hierarchical feature network we propose can solve this global planar motion problem and, moreover, that the performance is improved by using a trace rather than a purely associative synaptic modification rule. Invariance is addressed in later simulations. The network was trained on two 100 × 100 stimuli representing noisy left and right global planar motion (see Figure 3a). During the training, cells developed that responded to either left or right global motion but not to both (see Figure 3), with 1 bit of information representing perfect discrimination of left from right. The untrained network with initial random synaptic weights tested as a control showed much poorer performance, as shown in Figure 3. It might be expected that some global planar motion sensitivity would be developed by a purely Hebbian learning rule, and indeed this has been demonstrated (under somewhat different training conditions) by Sereno (1989) and Sereno and Sereno (1991). This occurs because on any single trial with one average direction of global motion, neurons at intermediate layers will tend to receive on average inputs that reflect the current average global planar motion and will thus learn to respond optimally to the current inputs that represent that motion direction. We showed that the trace learning rule used here performed better than a Hebb rule (which produced only neurons with 0.0 bits given that the motion stimulus patches presented in our simulations were in nonoverlapping locations, as

150

E. Rolls and S. Stringer

a Global planar motion left

Global planar motion right

Stimulus 1

Stimulus 2

b

c VisNet: 2s 9l: Single cell analysis

VisNet: 2s 9l: Multiple cell analysis

3

3 trace random

2 1.5 1 0.5

2 1.5 1 0.5

0

0

-0.5

-0.5 5

10 15 20 25 30 35 40 45 50 Cell Rank

trace random

2.5 Information (bits)

Information (bits)

2.5

1

2

3

4

5

6

7

8

9

10

Number of Cells

Figure 3: Experiment 1. (a) The two motion stimuli used in experiment 1 were noisy global planar motion left (left) and noisy global planar motion right (right), present throughout the 128 × 128 retina. Each arrow in this and subsequent figures represents the local direction of optic flow. The size of the optic flow pattern was 100 × 100 pixels, not the 2 × 4 shown in the diagram. The noise was introduced into each image stimulus by inverting the direction of optic flow at a random set of 45% of the image nodes. This meant that it would not be possible to determine the directional bias of the flow field by examining the optic flow over local regions of the retina. Instead, the overall directional bias could be determined only by analyzing the whole image. (b) When trained with the trace rule, equation 2.1, some single cells in layer 4 conveyed 1 bit of information about whether the global motion was left or right, and this is perfect performance. (The single cell information is shown for the 50 most selective cells.) (c) The multiple cell information measures, used to show that different neurons are tuned to different stimuli (see section 2.4), indicate that over a set of neurons, information about the whole stimulus set was present. (The information values for one cell are the average of 10 cells selected from the 50 most selective cells, and hence the value is not exactly 1 bit.)

Invariant Global Motion Recognition

151

illustrated in Figure 1). A further reason for the better performance of the trace rule is that on successive trials, the average global motion identifiable by a single intermediate-layer neuron from the probabilistic inputs will be a better estimate (a temporal average) of the true global motion, and this will be utilized in the learning. These results show that the network architecture is able to develop global motion representations of the noisy local motion patterns. Indeed, it is emphasized that neurons in the input to VisNet had only local but not global motion information, as shown by the fact that the average amount of information the 50 most selective input cells had about the global motion was 0.0 bits. 3.2 Experiment 2: Rotating Wheel. Neurons in MST, but not MT, are responsive to rotation with considerable translation invariance (Graziano et al., 1994). The aim of this simulation was to determine whether layer 4 cells in our network develop position-invariant representations of wheels rotating clockwise (as shown in Figure 4a) versus anticlockwise. The stimuli consist only of optic flow fields around the rim of a geometric circle with radius 16 unless otherwise stated. The local motion inputs from the wheel in the two positions shown are ambiguous where the wheels are close to each other in Figure 4a. The network was expected to solve the problem as illustrated in Figure 4a. The results in Figures 4b to 4d show perfect performance on position invariance when trained with the trace rule but not when untrained. The perfect performance is shown by the neurons that responded to, for example, clockwise but not anticlockwise rotation, and did this for each of the nine training positions. Figure 4e shows perfect size invariance for some layer 4 cells when the network was trained with the trace rule with three different radii of the wheels: 10, 16, and 22. These results show that the network architecture is able to develop location- and size-invariant representations of the global, rotating wheel, motion patterns even though the neurons in the input layer receive information from only a small local region of the retina. We note that the position-invariant global motion results shown in Figure 4 were not due to chance mappings of the two stimuli through the network and were a result of the training, in that the position-invariant information about whether the global motion was clockwise or anticlockwise was 0.0 bits for both the single and the multiple cell information in the untrained (“random”) network. Corresponding differences between the trained and the untrained networks were found in all the other experiments described in this article. 3.3 Experiment 3: Looming. Neurons in macaque dorsal stream visual area MSTd respond to looming stimuli with considerable translation

152

E. Rolls and S. Stringer

invariance (Graziano et al., 1994; Geesaman & Andersen, 1996). We tested whether the network could learn to respond to small patches of looming versus contracting motion typically generated by objects as they are seen successively on different locations on the retina. The network was trained on two circular flow patterns representing looming toward and looming away, as shown in Figure 5a. The stimuli are circular optic flow fields, with the direction of flow either away from (left) or toward (right) the center of the circle and with radius 16 unless otherwise stated. The results shown in Figures 5b to 5d show perfect performance on position invariance when trained with the trace rule but not when untrained. The perfect performance is shown by the neurons that responded to, for example, looming toward but not movement away, and did this for each of the nine training positions. Simulations were run for various optic flow field diameters to test the robustness of the results, and in all cases tested (which included radii of

Figure 4: Experiment 2. (a) Two rotating wheels at different locations rotating in opposite directions. The local flow field is ambiguous. Clockwise or counterclockwise rotation can be diagnosed only by a global flow computation, and it is shown how the network is expected to solve the problem to produce positioninvariant global-motion-sensitive neurons. One rotating wheel is presented at any one time, but the need is to develop a representation of the fact that in the case shown, the rotating flow field is always clockwise, independent of the location of the flow field. (b) Single cell information measures showing that some layer 4 neurons have perfect performance of 1 bit (clockwise versus anticlockwise) after training with the trace rule, but not with random initial synaptic weights in the untrained control condition. (c) The multiple cell information measures show that small groups of neurons have perfect performance. (d) Position invariance illustrated for a single cell from layer 4, which responded only to the clockwise rotation, and for every one of the nine positions. (e) Size invariance illustrated for a single cell from layer 4, which after training with three different radii of rotating wheel, responded only to anticlockwise rotation, independent of the size of the rotating wheels. (For the position-invariant simulations, the wheel rims overlapped, but are shown slightly separated in Figure 1 for clarity.) The training grid spacing was 32 pixels, and the radii of the wheels were 16 pixels. This ensured the rims of the wheels in adjacent training grid locations overlapped. One wheel was shown on any one trial. On successive trials, the wheel rotating clockwise was shown in each of the nine locations, allowing the trace learning rule to build location-invariant representations of the wheel rotating in one direction. In the next set of training trials, the wheel was shown rotating in the opposite direction in each of the nine locations. For the size-invariant simulations, the network was trained and tested with the set of clockwise versus anticlockwise rotating wheels presented in three different sizes.

Invariant Global Motion Recognition

153

C

a

Layer 3 = MSTd or higher Rotational motion with invariance Larger receptive field size

A

Layer 2 = MT Rotational motion Large receptive field size

B

Layer 1 = MT Global planar motion Intermediate receptive field size

Input layer = V1,V2 Local motion

b

c

VisNet: 2s 9l: Single cell analysis

VisNet: 2s 9l: Multiple cell analysis

3

3 trace random

trace random

2.5 Information (bits)

Information (bits)

2.5 2 1.5 1 0.5 0

2 1.5 1 0.5 0

-0.5

-0.5 5

10 15 20 25 30 35 40 45 50 Cell Rank

1

d

2

4 5 6 7 Number of Cells

8

9

10

e Visnet: Cell (24,13) Layer 4

Visnet: Cell (7,11) Layer 4

1

1 ’clock’ ’anticlock’

0.8 Firing Rate

0.8 Firing Rate

3

0.6 0.4

0.6 0.4

0.2

0.2

0

0 0

1

2

3 4 5 6 Location Index

7

8

’clock’ ’anticlock’

10

12

14

16 18 Radius

20

22

154

E. Rolls and S. Stringer

10 and 20 as well as intermediate values), cells developed a transform (location) invariant representation in the output layer. These results show that the network architecture is able to develop invariant representations of the global looming motion patterns, even though the neurons in the input layer receive information from only a small local region of the retina. 3.4 Experiment 4: Rotating Cylinder. Some neurons in the macaque cortex in the anterior part of the superior temporal sulcus (which receives inputs from both the dorsal and ventral visual streams; Ungerleider & Mishkin, 1982; Seltzer & Pandya, 1978; Rolls & Deco, 2002) respond to a head when it is rotating clockwise about its own axis but not counterclockwise, regardless of whether it is upright or inverted (Hasselmo et al., 1989). The result of the inversion experiment shows that these neurons are not just responding to global flow across the visual field, but are taking into account information about the shape and features of the object. Some neurons in the parietal cortex may also respond to motion of an object about one of its axes in an object-based way (Sakata et al., 1986). In experiment 4, we tested whether the network could self-organize to form neurons that represent global motion in an object-based coordinate frame. The network was trained on two stimuli, with four transforms of each. Figure 6a shows stimulus 1, which is a cylinder with shading at the top rotating clockwise about its own (top-defined) axis. Stimulus 1 is shown in its upright and inverted transforms. Stimulus 2 is the same cylinder with shading at the top, but rotating anticlockwise about its own vertical axis. The stimuli were presented in a single location, but to solve the problem,

Figure 5: Experiment 3. (a) The two motion stimuli were flow fields looming toward (left) and looming away (right). The stimuli are circular optic flow fields, with the direction of flow either away from (left) or toward (right) the center of the circle. Local motion cells near, for example, the intersection of the two stimuli cannot distinguish between the two global motion patterns. Locationinvariant representations (for nine different locations) of stimuli looming toward or moving away from the observer were learned, as shown by the single cell information measures (b), and multiple cell information measures (c) (using the same conventions as in Figure 3) were formed if the network was trained with the trace rule but not if it was untrained. (d) Position invariance illustrated for a single cell from layer 4, which responded only to moving away, and for every one of the nine positions. (The network was trained and tested with the stimuli presented in a 3 × 3 grid of nine retinal locations, as in experiment 1. The training grid spacing was 32 pixels, and the radii of the circular looming stimuli were 16 pixels. This ensured that the edges of the looming stimuli in adjacent training grid locations overlapped, as shown in the dashed box of Figure 5a.)

Invariant Global Motion Recognition

155

a Looming towards

Moving away

Stimulus 1

Stimulus 2

b

c

VisNet: 2s 9l: Single cell analysis

VisNet: 2s 9l: Multiple cell analysis

3

3 trace random Information (bits)

2

1 0.5

2 1.5 1 0.5

0

0

-0.5

-0.5 5

trace random

2.5

1.5

10 15 20 25 30 35 40 45 50 Cell Rank

1

2

3

d Visnet: Cell (9,7) Layer 4 1 0.8 Firing Rate

Information (bits)

2.5

’towards’ ’away’

0.6 0.4 0.2 0 0

1

2

3 4 5 6 Location Index

7

8

4 5 6 7 Number of Cells

8

9

10

156

E. Rolls and S. Stringer

the network must form some neurons that respond to the clockwise rotation of the shaded cylinder independent of the four transforms of each, which were upright (0 degrees), 90 degrees, inverted (180 degrees) and 270 degrees. Other neurons should self-organize to respond to view invariant counterclockwise rotation. For this experiment, additional information about surface luminance must be fed into the first layer of the network in order for the network to be able to distinguish between the clockwise and anticlockwise rotating cylinders. Additional retinal inputs to the first layer of the network came from a 128 × 128 array of luminance-sensitive cells. The cells within the luminance array are maximally activated for the shaded region of the cylinder image. Elsewhere the luminance inputs are zero. The number of inputs from the array of luminance sensitive cells to each cell in the first layer of the network was 50. The results shown in Figures 6b to 6c show perfect performance for many single cells, and across multiple cells, in representing the direction of rotation of the shaded cylinder about its own axis regardless of which of the four transforms was shown, when trained with the trace rule but not when untrained. Simulations were run for various sizes of the cylinders, including height = 40 and diameter = 20. For all simulations, cells developed a transform (e.g., upright, inverted) invariant representation in the output layer. That is, some cells responded to one of the stimuli in all of its four transformations (i.e., orientations) but not to the other stimulus. These results show that the network architecture is able to develop objectcentered view-invariant representations of the global motion patterns representing the two rotating cylinders, even though the neurons in the input layer receive information from only a small, local region of the retina.

Figure 6: Experiment 4. (a) Stimulus 1, which is a cylinder with shading at the top rotating clockwise about its own (top-defined) axis. Stimulus 1 is shown in its upright and inverted transforms. Stimulus 2 is the same cylinder with shading at the top, but rotating anticlockwise about its own axis. Invariant representations were formed, with some cells coding for the object rotating clockwise about its own axis and other cells coding for the object rotating anticlockwise, invariantly with respect to whether which of the four transforms (0 degrees = upright, 90 degrees, 180 degrees = inverted, and 270 degrees) was viewed, as shown by the single cell information measures (b) and multiple cell information measures (c) (using the same conventions as in Figure 3). Because only eight images in one location form the training set, some single cells by chance with the random untrained connectivity had some information about which stimulus was shown, but cells performed the correct mapping only if the network was trained with the trace rule.

Invariant Global Motion Recognition

157

Upright transform

a

Inverted transform

Stimulus 1: Cylinder rotating clockwise when viewed from shaded end.

Stimulus 2: Cylinder rotating anticlockwise when viewed from shaded end.

c

b VisNet: 2s 4t: Single cell analysis

VisNet: 2s 4t: Multiple cell analysis

3

3 trace random

2 1.5 1 0.5

trace random

2.5 Information (bits)

Information (bits)

2.5

2 1.5 1 0.5

0

0 5

10 15 20 25 30 35 40 45 50 Cell Rank

1

2

3

4 5 6 7 Number of Cells

8

9

10

158

E. Rolls and S. Stringer

3.5 Experiment 5: Optic Flow Analysis of Real Images: Translation Invariance. In experiments 5 and 6, we extend this research by testing the operation of the model when the optic flow inputs to the network are extracted by a motion analysis algorithm operating on the successive images generated by moving objects. The optic flow fields generated by a moving object were calculated as described next and were used to set the firing of the motion-selective cells, the properties of which are described in section 2.3. These optic flow algorithms use an image gradient-based method, which exploits the relationship between the spatial and temporal gradients of intensity, to compute the local optic flow throughout the image. The image flow constraint equation Ix U + I y V + It = 0 is approximated at each pixel location by algebraic finite difference approximations in space and time (Horn & Schunk, 1981). Systems of these finite difference equations are then solved for the local image velocity (U, V) within each 4 × 4 pixel block within the image. The images of the rotating objects were generated using OpenGL. In experiment 5, we investigated the learning of translation-invariant representations of the optic flow vector fields generated by clockwise versus anticlockwise rotation of the tetrahedron stimulus illustrated in Figure 7a. The network was trained with the two optic flow patterns generated in nine different locations, as in experiments 2 and 3. The flow fields used to train the network were generated by the object rotating through one degree of angle. The single cell information measures (see Figure 7b) and multiple cell information measures (see Figure 7c) (using the same conventions as in Figure 3) show that the maximal information, one bit, was reached by single cells and with the multiple cell information measure. The dashed line shows the control condition of a network with random untrained connectivity. This experiment shows that the model can operate well and learn translation-invariant representations with motion flow fields actually extracted from the successive images produced by a rotating object. 3.6 Experiment 6: Optic Flow Analysis of Real Images: Rotation Invariance. In experiment 6 we investigated the learning of rotationinvariant representations of the optic flow vector fields generated by clockwise versus anticlockwise rotation of the spoked wheel stimulus illustrated in Figure 8a. (The algorithm for generating the optic flow field is described in section 3.5.) The radius of the spoked wheel was 50 pixels on the 128 × 128 background. The rotation was in-plane, and the optic flow fields used as an input to the network were extracted from the changing images, each separated by one degree of the object as it rotated through 360 degrees. The single cell information measures (see Figure 8b) and multiple cell information measures (see Figure 8c) (using the same conventions as in Figure 3) show that the maximal information, one bit, was almost reached by single cells and by the multiple cell information measure. The dashed

Invariant Global Motion Recognition

159

a Optic flow fields produced by a tetrahedron rotating clockwise or anticlockwise

b

c VisNet: 2s 9l: Single cell analysis

VisNet: 2s 9l: Multiple cell analysis

3

3 trace random

2 1.5 1 0.5

2 1.5 1 0.5

0

0

-0.5

-0.5 5

10 15 20 25 30 35 40 45 50 Cell Rank

trace random

2.5 Information (bits)

Information (bits)

2.5

1

2

3

4 5 6 7 8 Number of Cells

9

10

Figure 7: Experiment 5. Translation-invariant representations of the optic flow vector fields generated by clockwise versus anticlockwise rotation of the tetrahedron stimulus illustrated. The optic flow field used as an input to the network was extracted from the changing images of the object as it rotated. The single cell information measures (b) and multiple cell information measures (c) (using the same conventions as in Figure 3) show that the maximal information, 1 bit, was reached by both single cells and in the multiple cell information measure. The dashed line shows the control condition of a network with random untrained connectivity.

line shows the control condition of a network with random untrained connectivity. This experiment shows that the model can operate well and learn rotation-invariant representations with motion flow fields actually extracted from a very large number of the successive images produced by a rotating object. Because of the large number of closely spaced training images used in this simulation, it is likely that the crucial type of learning was continuous transformation learning (Stringer, Perry, Rolls, & Proske, 2006). Consistent with this, the learning rate was set to the lower value of 7.2 × 10−5 for all layers for experiment 6 (cf. Stringer et al., 2006).

160

E. Rolls and S. Stringer

a Optic flow fields produced by a spoked wheel rotating clockwise or anticlockwise

c

b VisNet: 2s: Single cell analysis

VisNet: 2s: Multiple cell analysis

3

3 trace random

2 1.5 1 0.5

2 1.5 1 0.5

0

0

-0.5

-0.5 5

10 15 20 25 30 35 40 45 50 Cell Rank

trace random

2.5 Information (bits)

Information (bits)

2.5

1

2

3

4 5 6 7 8 Number of Cells

9

10

Figure 8: Experiment 6. In-plane rotation-invariant representations of the optic flow vector fields generated by a spoked wheel rotating clockwise or anticlockwise. The optic flow field used as an input to the network was extracted from the changing images of the object as it rotated through 360 degrees, each separated by 1 degree. The single cell information measures (b) and multiple cell information measures (c) (using the same conventions as in Figure 3) show that the maximal information, 1 bit, was reached by both single cells and in the multiple cell information measure. The dashed line shows the control condition of a network with random untrained connectivity.

3.7 Experiment 7: Generalization to Untrained Images. To investigate whether the representations of object-based motion such as circular rotation learned with the approach introduced in this article would generalize usefully to the flow fields generated by other objects moving in the same way, we trained the network on the optic flow vector fields generated by clockwise versus anticlockwise rotation of the spoked wheel stimulus illustrated in Figure 8. The training images rotated through 90 degrees in 1 degree steps. We then tested generalization to the new, untrained image shown in Figure 9a. The single and multiple cell information plots in Figure 9b show that information was available about the direction of

Invariant Global Motion Recognition

161

a Generalisation: Training with rotating spoked wheel, followed by testing with a regular grid rotating clockwise or anticlockwise .

VisNet: 2s: Single cell analysis

b

VisNet: 2s: Multiple cell analysis

3

3 trace random

2 1.5 1 0.5

trace random

2.5

Information (bits)

Information (bits)

2.5

0

2 1.5 1 0.5 0

-0.5

-0.5 5

10 15 20 25 30 35 40 45 50 Cell Rank

1

2

3

4

5 6 7 Cell Rank

8

9

10

Responses of a typical cell to the spoked wheel and grid, after training with the spoked wheel alone. Clockwise rotation Visnet: Cell (17,17) Layer 4

Visnet: Cell (17,17) Layer 4

1

1

0.8

0.8 Firing Rate

Firing Rate

c Spoked wheel

Anticlockwise rotation

0.6 0.4 0.2

0.6 0.4 0.2

0

0 0

10 20 30 40 50 60 70 80 90 Orientation (deg)

0

Visnet: Cell (17,17) Layer 4

Visnet: Cell (17,17) Layer 4

1

1

0.8

0.8 Firing Rate

Grid

Firing Rate

d

0.6 0.4 0.2

10 20 30 40 50 60 70 80 90 Orientation (deg)

0.6 0.4 0.2

0

0 0

10 20 30 40 50 60 70 80 90 Orientation (deg)

0

10 20 30 40 50 60 70 80 90 Orientation (deg)

Figure 9: Experiment 7. Generalization to untrained images. The network was trained on the optic flow vector fields generated by the spoked wheel stimulus illustrated in Figure 8 rotating clockwise or anticlockwise. (a) Generalization to the new untrained image shown at the top right of the Figure was then tested. (b) The single and multiple cell information plots show that information was available about the direction of rotation (clockwise versus anticlockwise) of the untrained test images. (c) The firing rate of a fourth layer cell to the clockwise and anticlockwise rotations of the trained image illustrated in Figure 8. (d) The firing rate of the same fourth layer cell to the clockwise and anticlockwise rotations of the untrained image illustrated in Figure 9a.

162

E. Rolls and S. Stringer

rotation (clockwise versus anticlockwise) of the untrained test images. Although the information was not as high as 1 bit, which would have indicated perfect generalization, individual cells did generalize usefully to the new images, as shown in Figures 9c and 9d. For example, Figure 9c shows the firing rate of a fourth layer cell to the clockwise and anticlockwise rotations of the trained image illustrated in Figure 8. Figure 9d shows the firing rate of the same fourth-layer cell to the clockwise and anticlockwise rotations of the untrained image illustrated in Figure 9a. The neuron responded correctly to almost all the anticlockwise rotation shifts, and correctly to many of the clockwise rotation shifts, though some noise was evident in the responses of the neuron to the untrained images. Overall, the results demonstrate useful generalization after training with one object to testing with an untrained, different, object on the ability to represent rotation.

4 Discussion We have presented a hierarchical feature analysis theory of the operation of parts of the dorsal visual system, which provides a computational account for how transform-invariant representations of the flow fields generated by moving objects could be formed in the cerebral cortex. The theory uses a modified Hebb rule with a short-term temporal trace of preceding activity to enable whatever is invariant at any stage of the dorsal motion system across short time intervals to be associated together. The theory can account for many types of invariance and has been tested by simulation for position and size invariance. The simulations show that the network can develop global planar representations from noisy local motion inputs (experiment 1), invariant representations of rotating optic flow fields (experiment 2), invariant representations of looming optic flow fields (experiment 3), and invariant representations of asymmetrical objects rotating about one of their axes (experiment 4). These are fundamental problems in motion analysis, and they have all been studied neurophysiologically, including local versus planar motion (Movshon et al., 1985; Newsome et al., 1989); position-invariant representation of rotating flow fields and looming (Lagae, Maes, Raiguel, Xiao, & Orban, 1994); and object-based rotation (Hasselmo et al., 1989; Sakata et al., 1986). The model thus shows principles by which the different types of motion-related invariant neuronal responses in the dorsal cortical visual system could be produced. The theory is unifying in the sense that the same theory, but with different inputs, can account for invariant representations of objects in the ventral visual system (Rolls, 1992; Wallis & Rolls, 1997; Elliffe, Rolls, & Stringer, 2002; Rolls & Deco, 2002). It is a strength of the unifying concept introduced in this article that the same hierarchical network that can perform computations of the type important in the ventral visual system can also perform computations of a type important in the dorsal visual system.

Invariant Global Motion Recognition

163

Our simulations support the hypothesis that the different response properties of MT and MST neurons from V1 neurons are determined in part by the sizes of their receptive fields, with a larger receptive field needed to analyze some global motion patterns. Similar conclusions were drawn from simulation experiments performed by Sereno (1989) and Sereno and Sereno (1991). This type of self-organization can occur with a Hebbian associative learning rule operating on the feedforward connections to a competitive network. However, experiment 1 showed that even for the computation of planar global motion in intermediate layers such as MT, a trace-based associative learning rule is better than a purely associative Hebbian rule with noisy (probabilistic) local motion inputs, because the trace rule allows temporal averaging to contribute to the learning. In experiments 2 and 3, the trace rule is crucial to the success of the learning, in that the stimuli when presented in different training locations did not overlap, so that the only process by which the different transforms can be linked is by the temporal trace learning rule implemented in the model (Rolls & Milward, 2000; Rolls & Stringer, 2001). (We note that in a new development, it has been shown that if different transforms of the training stimuli do overlap continuously in space, then this overlap can provide a useful learning principle for invariant representations to be formed and requires only associative synaptic modification; Stringer et al., 2006. It would be of interest to extend this concept, which has been applied to the ventral visual system, to the dorsal visual system.) One type of perceptual analysis that can be understood with the theory and simulations described here is how neurons can self-organize to respond to the motion inputs produced by small objects when they are seen on different parts of the retina. This is achieved by using memorytrace-based synaptic modification in the type of architecture illustrated in Figure 4a. The crucial stage for this learning is the top layer in Figure 4a labeled Layer 2/3. The forward connections to the neurons in this layer can form the required representation if they use a trace or similar learning rule, and the object motion occurs with some temporospatial continuity. (Temporospatial continuity has been shown to be important in human face invariance learning [Wallis & Bulthoff, 2001], and spatial continuity over continuous transforms may be a useful learning principle [Stringer et al., 2006].) This aspect of the architecture is what is formally similar to the architecture of the ventral visual system, which can learn invariant representations of stationary objects. The only difference required of the networks is that the ventral visual stream network should receive inputs from neurons that respond to stationary features such as lines or edges and that the dorsal visual stream network should receive inputs from neurons that respond to local motion cues. It is this concept that allows us to propose that there is a unifying hypothesis that applies to some of the computations performed by both the ventral and the dorsal visual streams.

164

E. Rolls and S. Stringer

The way in which position-invariant representations in the model develop is illustrated in Figure 4a, where, in the top layer labeled layer 3, individual neurons receive information from different parts of layer 2, where different neurons can represent the same object motion but in different parts of visual space. In the model, layer 2 can thus be thought of as corresponding to some neurons in area MT, in which direction selectivity for elementary optic flow components such as rotation, deformation, and expansion and contraction is not position invariant (Lagae et al., 1994). Layer 3 in the model can in the same way be thought of as corresponding to area MST, in which direction selectivity for elementary optic flow components such as rotation, deformation, and expansion and contraction is position invariant for 40% of neurons (Lagae et al., 1994). A further correspondence between the model and the brain is that neurons that respond to global planar motion are found in the brain in area MT (Movshon et al., 1985; Newsome et al., 1989) and in the model in layer 1, whereas neurons in V1 and V2 do not respond to global motion (Movshon et al., 1985; Newsome et al., 1989), and correspond in the model to the input layer of Figure 4a. Another type of perceptual analysis that can be understood with the theory and simulations described here is the object-based view-independent representation of objects, exemplified by the ability to see that an “ended” object is rotating clockwise about one of its axes. It was shown in experiment 4 that these representations can be formed by combining information from both the dorsal visual stream (about global motion) and the ventral visual stream (about object shape and/or luminance features). For these representations to be learned, a trace associative or similar learning rule must be used while the object transforms from one view to another (e.g., from upright to inverted). A hierarchical network with the general architecture shown in Figure 2 with separate analyses of form and motion that are combined at a final stage (as in experiment 4) is also useful for biological motion, such as representing a person walking (Giese & Poggio, 2003). However, the network described by Giese and Poggio is not very biologically plausible, in that it performs MAX functions to help with the computational issue of transform invariance and does not self-organize on the basis of the inputs so that it must be largely hand-wired. The issue here is that Giese and Poggio suppose that a MAX function is performed to select the maximally active afferent to a neuron, but there is no account of how afferents of just one type (e.g., a bar with a particular orientation and contrast) are being received by a given neuron. Not only is no principle suggested by which this could be achieved, but also no learning algorithm is given to achieve this. We suggest therefore that it would be of interest to investigate whether the more biologically plausible self-organizing type of network described in this article can learn on the basis of the inputs being received to respond to biological motion. To do this, some form of sequence sensitivity would be useful.

Invariant Global Motion Recognition

165

The theory described here is appropriate for the global motion analysis required to analyze the flow fields of objects as they translate, rotate, expand (loom), or contract, as shown in experiments 1 to 3. The theory thus provides a model of some of the computations that appear to occur along the pathway V1–V2–MT–MST, as neurons of these types are generated along this pathway (see section 1). The theory described here can also account for global motion in an object-based coordinate frame as shown in experiment 4. Neurons with these properties have been found in the cortex in the anterior part of the macaque superior temporal sulcus, in which neurons respond to a head when it is rotating clockwise about its own axis but not counterclockwise, regardless of whether it is upright or inverted (Hasselmo et al., 1989). The result of the inversion experiment shows that these neurons are not just responding to global flow across the visual field, but are taking into account information about the shape and features of the object. Area STPa (the cortex in the anterior part of the macaque superior temporal sulcus) contains neurons that respond to a rotating sphere (Anderson & Siegel, 2005), and as shown in experiment 4, the present theory could account for such neurons. Whether the present model could account for the structure from motion also observed for these neurons is not yet known. The theory could also account for neurons in area 7a of the parietal cortex that may also respond to motion of an object about one of its axes in an object-based way (Sakata et al., 1986). Neurons have also been found in the primary motor cortex (M1) that respond similarly to neurons in area 7a when a monkey is solving a visually presented maze (Crowe, Chafee, Averbeck, & Georgopoulos, 2004), but their visual properties are not sufficiently understood to know whether the present model might apply. Area LIP contains neurons that perform processing related to saccadic eye movements to visual targets (Andersen, 1997), and the present theory may not apply to this type of processing. The model of processing utilized here in a series of hierarchically organized competitive networks with convergence at each stage (as illustrated in Figure 2) is intended to capture some of the main anatomical and physiological characteristics of the ventral visual stream of visual cortical areas, and is intended to provide a model for how processing in these areas could operate, as described in detail elsewhere (Rolls & Deco, 2002; Rolls & Treves, 1998). To enable learning along this pathway to result by self-organization in the correct representations being formed, associative learning using a short-term memory trace has been proposed (Rolls, 1992; Wallis & Rolls, 1997; Rolls & Milward, 2000; Rolls & Stringer, 2001; Rolls & Deco, 2002). Another approach used in continuous transformation learning utilizes associative learning without a temporal trace and relies on close exemplars of stimuli being provided during the training (Stringer et al., 2006). What we propose here is that similar connectivity and learning processes in the series of cortical pathways in the dorsal visual stream that includes V1–V2– MT–MST and onward connections to the cortex in the superior temporal

166

E. Rolls and S. Stringer

sulcus and area 7a could account for the invariant representations of the flow fields produced by moving objects. In relation to the number of stimuli that could be learned by the system, we note that the network simulated is relatively small and was designed to illustrate the new computational hypotheses introduced here rather than to analyze the capacity of such feature hierarchical systems. We note in particular that the network simulated has 1024 neurons in each layer and 100 inputs to each neuron in layers 2 to 4. In contrast, it has been estimated that perhaps half of the macaque brain is involved in visual processing, and typically each neuron has on the order of 104 inputs. It will be of interest using much larger simulations in the future to address capacity issues of this class of network. However, we note that because the network can generalize to rotational flow fields generated by untrained stimuli, as shown in experiment 7, separate representations for the flow fields generated by every object may not be required, and this helps to reduce the number of separate representations that the network may be required to learn. In contrast to some other theories, the theory developed here utilizes a single unified approach to self-organization in the dorsal and ventral visual systems. Predictions of the theory described here include the following. First, use of a trace rule in the dorsal as well as ventral visual system is predicted. (Thus, differences in, for example, the time constants of NMDA receptors, or persistent poststimulus firing, either of which could implement a temporal trace, would not be expected.) Second, a feature hierarchy is a useful way for understanding details of the operation of the ventral visual system, but can now be used as a clarifying concept for how the details of representations in the dorsal visual system may be built. Third, the theory predicts that neurons specialized for motion detection by using differences in the arrival times of sensory inputs from different retinal locations need occur at only one stage of the system (e.g., in V1) and need not occur elsewhere in the dorsal visual system. These are labeled as local motion neurons in Figure 4a. Acknowledgments This research was supported by the Wellcome Trust and by the Medical Research Council. References Andersen, R. A. (1997). Multimodal integration for the representation of space in the posterior parietal cortex. Philosophical Transactions of the Royal Society of London B, 352, 1421–1428. Anderson, K. C., & Siegel, R. M. (2005). Three-dimensional structure-from-motion selectivity in the anterior superior temporal polysensory area, STPa, of the behaving monkey. Cerebral Cortex, 15, 1299–1307.

Invariant Global Motion Recognition

167

Bair, W., & Movshon, J. A. (2004). Adaptive temporal integration of motion in direction-selective neurons in macaque visual cortex. Journal of Neuroscience, 24, 7305–7323. Bartlett, M. S., & Sejnowski, T. J. (1998). Learning viewpoint-invariant face representations from visual experience in an attractor network. Network: Computation in Neural Systems, 9, 399–417. Crowe, D. A., Chafee, M. V., Averbeck, B. B., & Georgopoulos, A. P. (2004). Participation of primary motor cortical neurons in a distributed network during maze solution: Representation of spatial parameters and time-course comparison with parietal area 7a. Experimental Brain Research, 158, 28–34. Deco, G., & Rolls, E. T. (2005). Neurodynamics of biased competition and cooperation for attention: A model with spiking neurons. Journal of Neurophysiology, 94, 295– 313. Desimone, R. (1991). Face-selective cells in the temporal cortex of monkeys. Journal of Cognitive Neuroscience, 3, 1–8. Duffy, C. J. (2004). The cortical analysis of optic flow. In L. M. Chalupa & J. S. Werner (Eds.), The visual neurosciences (Vol. 2, pp. 1260–1283). Cambridge, MA: MIT Press. Duffy, C. J., & Wurtz, R. H. (1996). Optic flow, posture, and the dorsal visual pathway. In T. Ono, B. L. McNaughton, S. Molotchnikoff, E. T. Rolls, & H. Nishijo (Eds.), Perception, memory and emotion: frontiers in neuroscience (pp. 63–77). Cambridge: Cambridge University Press. Elliffe, M. C. M., Rolls, E. T., & Stringer, S. M. (2002). Invariant recognition of feature combinations in the visual system. Biological Cybernetics, 86, 59–71. ¨ ak, P. (1991). Learning invariance from transformation sequences. Neural ComFoldi´ putation, 3, 194–200. Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36, 193–202. Geesaman, B. J., & Andersen, R. A. (1996). The analysis of complex motion patterns by form/cue invariant MSTd neurons. Journal of Neuroscience, 16, 4716–4732. Giese, M. A., & Poggio, T. (2003). Neural mechanisms for the recognition of biological movements. Nature Reviews Neuroscience, 4, 179–192. Graziano, M. S. A., Andersen, R. A., & Snowden, R. J. (1994). Tuning of MST neurons to spiral motions, Journal of Neuroscience, 14, 54–67. Hasselmo, M. E., Rolls, E. T., Baylis, G. C., & Nalwa, V. (1989). Object-centered encoding by face-selective neurons in the cortex in the superior temporal sulcus of the monkey. Experimental Brain Research, 75, 417–429. Horn, B. K. P., & Schunk, B. G. (1981). Determining optic flow. Artificial Intelligence, 17, 185–203. Lagae, L., Maes, H., Raiguel, S., Xiao, D.-K., & Orban, G. A. (1994). Responses of macaque STS neurons to optic flow components: A comparison of areas MT and MST. Journal of Neurophysiology, 71, 1597–1626. Logothetis, N. K., & Sheinberg, D. L. (1996). Visual object recognition. Annual Review of Neuroscience, 19, 577–621. Movshon, J. A., Adelson, E. H., Gizzi, M. S., & Newsome, W. T. (1985). The analysis of moving visual patterns. In C. Chagas, R. Gattass, & C. Gross (Eds.), Pattern recognition mechanisms (pp. 117–151). New York: Springer-Verlag.

168

E. Rolls and S. Stringer

Newsome, W. T., Britten, K. H., & Movshon, J. A. (1989). Neuronal correlates of a perceptual decision. Nature, 341, 52–54. Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience, 2, 1019–1025. Rolls, E. T. (1992). Neurophysiological mechanisms underlying face processing within and beyond the temporal cortical visual areas. Philosophical Transactions of the Royal Society, 335, 11–21. Rolls, E. T. (2000). Functions of the primate temporal lobe cortical visual areas in invariant visual object and face recognition. Neuron, 27, 205–218. Rolls, E. T. (2006). The representation of information about faces in the temporal and frontal lobes of primates including humans. Neuropsychologia, PMID = 16797609. Rolls, E. T., & Deco, G. (2002). Computational neuroscience of vision. New York: Oxford University Press. Rolls, E. T., & Milward, T. (2000). A model of invariant object recognition in the visual system: Learning rules, activation functions, lateral inhibition, and informationbased performance measures. Neural Computation, 12, 2547–2572. Rolls, E. T., & Stringer, S. M. (2001). Invariant object recognition in the visual system with error correction and temporal difference learning. Network: Computation in Neural Systems, 12, 111–129. Rolls, E. T., & Tovee, M. J. (1994). Processing speed in the cerebral cortex and the neurophysiology of visual masking. Proceedings of the Royal Society, B, 257, 9–15. Rolls, E. T., Tovee, M. J., Purcell, D. G., Stewart, A. L., & Azzopardi, P. (1994). The responses of neurons in the temporal cortex of primates, and face identification and detection. Experimental Brain Research, 101, 474–484. Rolls, E. T., & Treves, A. (1998). Neural networks and brain function. New York: Oxford University Press. Rolls, E. T., Treves, A., & Tovee, M. J. (1997). The representational capacity of the distributed encoding of information provided by populations of neurons in the primate temporal visual cortex. Experimental Brain Research, 114, 149–162. Rolls, E. T., Treves, A., Tovee, M., & Panzeri, S. (1997). Information in the neuronal representation of individual stimuli in the primate temporal visual cortex. Journal of Computational Neuroscience, 4, 309–333. Sakata, H., Shibutani, H., Ito, Y., & Tsurugai, K. (1986). Parietal cortical neurons responding to rotary movement of visual space stimulus in space. Experimental Brain Research, 61, 658–663. Seltzer, B., & Pandya, D. N. (1978). Afferent cortical connections and architectonics of the superior temporal sulcus and surrounding cortex in the rhesus monkey. Brain Research, 149, 1–24. Sereno, M. I. (1989). Learning the solution to the aperture problem for pattern motion with a Hebb rule. In D. Touretzky (Ed.), Advances in neural information processing systems, 1 (pp. 468–476). San Mateo, CA: Morgan Kaufmann. Sereno, M. I., & Sereno, M. E. (1991). Learning to see rotation and dilation with a Hebb rule. In D. Touretzky & R. Lippmann (Eds.), Advances in neural information processing systems 3 (pp. 320–326). San Mateo, CA: Morgan Kaufmann. Stringer, S. M., Perry, G., Rolls, E. T., & Proske, J. H. (2006). Learning invariant object recognition in the visual system with continuous transformations. Biological Cybernetics, 94, 128–142.

Invariant Global Motion Recognition

169

Stringer, S. M., & Rolls, E. T. (2000). Position invariant recognition in the visual system with cluttered environments. Neural Networks, 13, 305–315. Stringer, S. M., & Rolls, E. T. (2002). Invariant object recognition in the visual system with novel views of 3D objects. Neural Computation, 14, 2585–2596. Tanaka, K. (1996). Inferotemporal cortex and object vision. Annual Review of Neuroscience, 19, 109–139. Ungerleider, L. G., & Mishkin, M. (1982). Two cortical visual systems. In D. J. Ingle, M. A. Goodale, & R. J. W. Mansfield (Eds.), Analysis of visual behavior (pp. 549–586). Cambridge, MA: MIT Press. Wallis, G., & Bulthoff, H. H. (2001). Effects of temporal assocation on recognition memory. Proceedings of the National Academy of Sciences, 98, 4800–4804. Wallis, G., & Rolls, E. T. (1997). Invariant face and object recognition in the visual system. Progress in Neurobiology, 51, 167–194. Wurtz, R. H., & Kandel, E. R. (2000). Perception of motion depth and form. In E. R. Kandel, J. H. Schwartz, & T. M. Jessell (Eds.), Principles of neural science, 4th ed. (pp. 548–571). New York: McGraw-Hill.

Received January 11, 2006; accepted May 15, 2006.

LETTER

Communicated by Mike Arnold

Recurrent Cerebellar Loops Simplify Adaptive Control of Redundant and Nonlinear Motor Systems John Porrill [email protected]

Paul Dean [email protected] Centre for Signal Processing in Neuroimaging and Systems Neuroscience, Department of Psychology, University of Sheffield, Sheffield S10 2TP, U.K.

We have described elsewhere an adaptive filter model of cerebellar learning in which the cerebellar microcircuit acts to decorrelate motor commands from their sensory consequences (Dean, Porrill, & Stone, 2002). Learning stability required the cerebellar microcircuit to be embedded in a recurrent loop, and this has been shown to lead to a simple and modular adaptive control architecture when applied to the linearized 3D vestibular ocular reflex (Porrill, Dean, & Stone, 2004). Here we investigate the properties of recurrent loop connectivity in the case of redundant and nonlinear motor systems and illustrate them using the example of kinematic control of a simulated two-joint robot arm. We demonstrate that (1) the learning rule does not require unavailable motor error signals or complex neural reference structures to estimate such signals (i.e., it solves the motor error problem) and (2) control of redundant systems is not subject to the nonconvexity problem in which incorrect average motor commands are learned for end-effector positions that can be accessed in more than one arm configuration. These properties suggest a central functional role for the closed cerebellar loops, which have been shown to be ubiquitous in motor systems (e.g., Kelly & Strick, 2003). 1 Introduction The grace and economy of animal movement suggest that neural control methods are likely to be of interest to robotics. The neural structure particularly associated with coordinated movement is the cerebellum, whose function seems to be the fine-tuning of motor skills by elaborating incomplete or approximate commands issued by higher levels of the motor system (Brindley, 1964). But although the microcircuitry of cerebellar cortex has inspired models for over 30 years (Albus, 1971; Eccles, Ito, & Szent´agothai, 1967; Marr, 1969), the results appear to have produced relatively meager benefits for robotics. As Marr himself commented, “In my own case, the cerebellar study . . . disappointed me, because even if the theory was Neural Computation 19, 170–193 (2007)

C 2006 Massachusetts Institute of Technology

Recurrent Cerebellar Loops Simplify Adaptive Control

171

correct, it did not enlighten one about the motor system—it did not, for example, tell one how to go about programming a mechanical arm” (Marr, 1982, p. 15). Part of the problem is that the control function of any individual region of the cerebellum depends not only on its internal microcircuitry but also on the way it is connected with other parts of the motor system (Lisberger, 1998), and in many cases the details of this connectivity are still not well understood. However, recent anatomical investigations have suggested that there may be a common feature of cerebellar connectivity: a recurrent architecture. It appears as if individual regions of the cerebellum (1) receive a copy of the low-level motor commands that constitute system output and (2) apply their corrections to the high-level command. These “multiple closed-loop circuits represent a fundamental architectural feature of cerebro-cerebellar interactions,” and it is a challenge to future studies to “determine the computations that are supported by this architecture” (Kelly & Strick, 2003). Determining these computations might also throw light on how biological control methods using the cerebellum might be of use to robotics. Our initial attempts to address this issue considered adaptation of the angular vestibular-ocular reflex (aVOR), a classic preparation for studying basic cerebellar function (Boyden, Katoh, & Raymond, 2004; Carpenter, 1988; Ito, 1970). In this reflex, a head-rotation signal from vestibular sensors is used to counter-rotate the eye to maintain stability of the retinal image. Visual processing delays of ∼100 ms mean that movement of the retinal image (a retinal slip) is used as an error signal for calibration of the aVOR rather than for online control, and this calibration requires the floccular region of the cerebellum. In our simulations of the aVOR, the characteristics of the oculomotor plant (eye muscles plus orbital tissue) were altered, and the flocculus (modeled as an adaptive filter; see section 2) was required to learn the appropriate plant compensation. Stable and robust learning was achieved by connecting the filter so that it decorrelated the retinal slip signal from a copy of the motor command sent to the eye muscles (Dean, Porrill, & Stone, 2002; Porrill, Dean, & Stone, 2004) and sent its output to join the vestibular input. This arrangement constitutes an example of the recurrent architecture described above, and ensuring the stability of the decorrelation control algorithm is therefore a candidate for its computational role. Here we extend these findings to derive some theoretical properties of the recurrent cerebellar architecture, which indicate that it may play a central role in simplifying the adaptive control of nonlinear and redundant biological motor systems. These properties will be illustrated by comparison of forward and recurrent architectures in simulated calibration of the inverse kinematic control of a two-degree-of-freedom (2dof) robot arm. Part of this work has been reported in abstract form (Porrill & Dean, 2004).

172

J. Porrill and P. Dean

Figure 1: (a) Schematic representation of the cerebellar microcircuit. MF inputs uk are analyzed in the granule cell layer (only one MF input is shown; in reality, GCs receive direct input from multiple MFs and indirect input from many more via recurrent connections, not shown here, to PFs), and the GC output signals p j are distributed along the PFs. The ith PC makes contact with many PFs that drive its simple spike output vi . The cell also has a CF input e i , which is assumed in Marr-Albus models to act as a teaching signal for the synaptic weights wij . (b) Interpretation of the cerebellar microcircuit as an adaptive filter. Granule cells processing is modeled as a filter pi = G i (u1 , u2 , . . .), and each PC outputs a weighted sum of its PF inputs. (Figure modified from Dean, Porrill, & Stone, 2004.)

2 The Adaptive Filter Model We will use what is perhaps the simplest implementation of the Marr-Albus architecture: the adaptive filter (Fujita, 1982). The microcircuit based around a Purkinje cell (PC) and its computational interpretation as an adaptive filter model are shown schematically in Figure 1.

Recurrent Cerebellar Loops Simplify Adaptive Control

173

The mossy fibers (MFs) carry input signals (u1 , u2 , . . .) that are processed in the granule cell layer to produce the signals ( p1 , p2 , . . .) carried on the parallel fibers (PFs). This process is interpreted as a massive expansionrecoding of the mossy fiber inputs of the form p j = G j (u1 , u2 , . . .).

(2.1)

In the nondynamic problems to be considered here, the G j are functions of the current values of (u1 , u2 , . . .). Dynamic problems can be tackled by allowing the p j to encode aspects of the past history of the ui ; in that case, the G j are required to be more general causal functionals such as tapped delay lines. We are interested in the output of a subset of Purkinje cells that take their inputs from common parallel fibers. The output vi of the ith such PC is modeled as a weighted linear combination of its PF inputs, vi =

wij p j ,

(2.2)

where the coefficient wij is the synaptic weight of the jth PF on the ith PC. The combined transformation from the vector of inputs u = (u1 , u2 , . . .) to the vector of outputs v = (v1 , v2 , . . .) can then be written conveniently as a vector equation, v = C(u) =

wij Gij (u),

(2.3)

where Gij (u) = (0, . . . , G j (u), . . . , 0) is the vector with the jth parallel fiber signal as its ith entry and zeros elsewhere. In Marr-Albus models, the climbing fiber input e i to the Purkinje cell is assumed to act as a teaching signal. The qualitative properties of long-term depression (LTD) and long-term potentiation (LTP) at PF/PC synapses are consistent with the anti-Hebbian heterosynaptic covariance learning rule (Sejnowski, 1977), δwij = −β e i p j ,

(2.4)

which is identical in form to the least mean square (LMS) learning rule of adaptive control theory (Widrow & Stearns, 1985). Note that in all the above formulas, neural signals are assumed to be coded as firing rates relative to a tonic firing rate, and hence can take both positive and negative values. 3 Example Problem: Learning Inverse Kinematics We will show that recurrent loops can simplify the adaptive control of nonlinear and redundant motor systems. We derive the results for a problem

174

J. Porrill and P. Dean

Figure 2: (a) Geometry of the planar, two-degree-of-freedom robot arm. Motor commands (m1 , m2 ) specify joint angles as shown. The position (x1 , x2 ) of the end effector is specified in Cartesian coordinates in arbitrary units. (b) Example plant compensation problem. The arm controller is initially calibrated for the arm lengths l1 = 1.1, l2 = 1.8 of the gray arm. This arm is shown reaching accurately to a point on the gray polar grid covering the work space. When used with the black arm (lengths l1 = 1, l2 = 2), this approximate controller reaches to points on the distorted (black) grid. The task of the cerebellum is to compensate for this miscalibration; reaching errors (δx1 , δx2 ) are provided in Cartesian coordinates.

presenting both of these difficulties: learning robot inverse kinematics. This problem is of general interest, since it is an equivalent to the generic problem of learning right inverse mappings. The theoretical results obtained in the general case will be illustrated throughout by application to the inverse kinematic problem for the 2dof robot arm shown in Figure 2, where the geometry is particularly easy to intuit. Although this is a very simple system, it exhibits strong nonlinearity and a discrete redundancy. We begin by recalling some terminology. The forward kinematics or plant model of a robot arm is the mapping x = P(m) from motor commands m = (m1 , . . . , m M ) to the end-effector positions x = (x1 , . . . , xS ). An inverse kinematics is a mapping m = Q(x) that calculates the motor commands corresponding to a given end-effector position. Since this implies that x = P(Q(x)), an inverse kinematics is a rightinverse P−1 for P. For redundant systems, a given position can be reached using more than one choice of motor command; in this case, we will use the notation Q = P−1 to denote a particular choice of inverse kinematics. For the 2dof arm, there are only two choices of motor command for a given end-effector position. This is an example of a discrete redundancy. More complex systems can have continuous redundancies in which there is a

Recurrent Cerebellar Loops Simplify Adaptive Control

175

Figure 3: Forward architecture. The desired position input xd produces a motor command m = B(xd ) + C(xd ), which is the sum of contributions from a fixed element B and an adaptive cerebellar element C. This input to the plant P produces the output position x. Training the weights of C requires proximal or motor error δm rather than distal or sensory error δx = x − xd . Hence, in this forward architecture, motor error must be estimated by backpropagation via a reference matrix R ≈ ∂P−1 /∂x. This requires detailed prior knowledge of the motor plant.

continuum of motor commands available for each end-effector position. These are sometimes called redundant degrees of freedom. The cerebellum has been described as a “repair shop,” compensating for miscalibration (due to damage, fatigue, or development, for example) of the motor plant (Robinson, 1974). It is this adaptive plant compensation problem that we will model here; that is, we assume that an approximate inverse kinematics controller B ≈ P−1 is available to the system and that the function of the cerebellum is to supplement this controller to produce more accurate movements. Learning is supervised by making the reaching error, δx = x − xd = P(B(xd )) − xd ,

(3.1)

available to the learning system, where xd is the target (desired) position. We call this quantity sensory error since it can be measured by an appropriate sensor and to distinguish it from motor error, which will be defined later. In the 2dof robot arm example shown in Figure 2b, the approximate controller B is the inverse kinematics for a robot with arm lengths that are ±10% in error. Thus, when required to reach to positions on the polar grid shown in Figure 2, the arm actually moves to positions on the overlaid distorted grid. 4 The Forward Learning Architecture To highlight the properties of recurrent connectivity, we begin by considering the problems encountered in implementing a learning rule for the alternative forward connectivity shown schematically in Figure 3. The motor command to the plant is produced by an open-loop filter that is the sum

176

J. Porrill and P. Dean

B + C of the fixed element B and the adaptive cerebellar component C; this combination will be an inverse kinematics B + C = P−1 if C takes the value C∗ =

wij∗ G ij = P−1 − B.

(4.1)

We assume that the granule cell basis functions Gij satisfy the matching condition, that is, that synaptic weights wij∗ can be found such that the above equation holds for the range of P−1 and B under consideration. To obtain a learning rule similar to the covariance rule (see equation 2.3), we introduce the concept of motor error (Gomi & Kawato, 1992). Motor error δm is the error in motor command responsible for the sensory error δx. Minimizing expected square motor error, Em =

1 2 1 t δm = δm δm , 2 2

(4.2)

(where a superscript t denotes the matrix transpose), leads to a simple learning rule because motor error is linearly related to synaptic weight error, δm = C(xd ) − C∗ (xd ) =

(wij − wij∗ )Gij (xd ).

(4.3)

Using this expression, the gradient of expected square motor error is ∂E ∂δm = δmt = δmt Gij (xd ) = δmi p j , ∂wij ∂wij

(4.4)

giving the gradient descent learning rule, δwij = −β δmi p j ,

(4.5)

where β is a small, positive constant. If we label Purkinje cells such that the ith PC output contributes to the ith component of motor error, then comparison with the covariance learning rule (see equation 2.4) shows that the teaching signal e i provided on the climbing fiber input to the ith Purkinje cell must be the ith component of motor error, e i = δmi .

(4.6)

This apparently simple prescription is complicated by the fact that motor error is not in itself an observable quantity. It is a derived quantity given by the equation δm = P−1 (x) − P−1 (xd ).

(4.7)

Recurrent Cerebellar Loops Simplify Adaptive Control

177

This leads to an obvious circularity in that the rule for learning inverse kinematics requires prior knowledge of that same inverse kinematics. This circularity can be circumvented to some extent by supposing that all errors are small so that δm ≈

∂P−1 δx, ∂x

(4.8)

and then replacing the unknown Jacobian ∂P−1 /∂x in this error backpropagation rule by a fixed approximation, R≈

∂P−1 . ∂x

(4.9)

If R were exact, then, if J is the forward Jacobian J = ∂P/∂m, the product JR would be the identity matrix. To ensure stable learning, the approximate R must estimate motor error correctly up to a strict positive realness (SPR) condition, which in this static case requires that the symmetric part of the matrix JR be positive definite. The hypothetical neural structures required to implement this transformation R and recover motor error from observable sensory error, δm ≈ Rδx

or

δmi ≈

Rik δxk ,

(4.10)

have been called reference structures (Gomi & Kawato, 1992), so we will call R the reference matrix. 5 The Motor Error Problem We refer to the requirement that the climbing fibers carry the unobservable motor error signal rather than the observed sensory error signal as the motor error problem. Although the forward architecture has been applied successfully to a number of real and simulated control tasks (notably in the form of the feedback error learning architecture; Gomi & Kawato, 1992, 1993), we will argue here that for generic biological control systems, the complexity of the neural reference structures it requires makes forward architecture implausible. It is clear that the complexity of the reference structure is multiplicative in the dimension of the control and sensor space. For a task in which M muscles control the output of N sensors, there are MN entries in the reference matrix R. For example, in our 2dof robot arm problem, four real numbers must be hard-wired to guarantee learning. For more realistic motor tasks in biological systems (such as reaching while preserving balance), values of 100 or more for MN would not be unreasonable.

178

J. Porrill and P. Dean

Figure 4: (a) The dots show the arrangement of RBF centers, and the circle shows the receptive field radius for the forward architecture. This configuration leads to an accurate fit if exact motor error is provided to the learning algorithm (not shown). (b) A snapshot of performance during learning that illustrates the need for multiple reference structures in nonlinear problems. The reference structure R is chosen as the exact Jacobian at the grid point marked with a small circle. Although the learned (black) grid overlays the exact (gray) grid more accurately in the neighborhood of O (compare with Figure 3 bottom), performance has clearly deteriorated over the remainder of the work space. Learning rate beta = 0.0005. The effect of reducing the learning rate is to delay but not abolish the divergence. (c) An illustration of the redundancy or nonconvexity problem. The arm controller is set up to consistently choose the configurations shown by the solid and dotted arms when reaching into the top or bottom half of the work space. However, when reaching to points in the shaded horizontal sector, the arm retains the configuration used for the previous target. Hence, arm configuration is chosen effectively randomly in this sector, and the system fails to learn. Learning rate is as above. (Exact motor errors were used in this part of the simulation).

In fact, this analysis understates the problem since biological motor systems are often nonlinear, and hence the reference structures are valid only locally. This behavior will be illustrated for the 2dof robot arm calibration problem described above (see Figure 2). Details of the radial basis function (RBF) implementation are given in the appendix. Figure 4a shows a snapshot of performance during learning. In this example, the reference matrix R was chosen to be the exact inverse Jacobian at the point O. Clearly this choice satisfies the SPR condition in a neighborhood of O, and hence in this region where R provides good estimates of motor error, reaching accuracy initially improves. However, outside this region, the sign of a component of motor error is wrongly estimated, and errors in this component diverge catastrophically. This instability will eventually spread to the point O itself because of overlap between adjacent RBFs. To ensure stable learning in this example would require at least three reference structures valid on three different sectors of the work space.

Recurrent Cerebellar Loops Simplify Adaptive Control

179

This requires 3 × 4 = 12 prespecified parameters (not including the extra parameters needed to specify the region of validity of each reference structure). For a general inverse kinematics problem, we must specify MNK parameters, where K is the number of regions required to guarantee that positive definiteness of JR in each region. Finally, we note that in the dynamic case, learning must be dynamically stable. For example, in the linear case, the required reference structure is a matrix of transfer functions R(iω), and an SPR condition must be satisfied by the matrix transfer function J(iω)R(iω) at each frequency, further increasing the complexity of the motor error problem. 6 The Redundancy Problem Most artificial and biological motor systems are redundant, that is, different motor commands can produce the same output. Such redundancy leads to a classic supervised learning problem called the nonconvexity problem: if the training data for the learning system associate multiple motor commands with a given position, and if this set of motor commands is nonconvex, then the system will learn an inappropriate weighted average motor command at that position. The forward architecture shown in Figure 3 is subject to the redundancy problem whenever the controller B is allowed to produce different output commands for the same input. This type of behavior is common in motor tasks; for example, a robot arm configuration might be determined by a combination of convenience and movement history. This type of behavior is illustrated for the 2dof arm in Figure 4b. In this experiment, one arm configuration is used in the top half of the work space and the opposite configuration in the bottom half. However, when reaching into the central sector, the controller reuses the configuration from the previous position (this kind of behavior is common to avoid work space obstacles), and hence the configuration chosen in the central sector is essentially random. While learning succeeds in the top and bottom sectors of the work space, the failure to learn in the central sector is evident from the distorted (black) grid of learned positions. This convexity problem can be avoided by providing auxiliary variables ξ to the learning component such that the combination (x, ξ ) unambiguously specifies the motor state of the system (although identifying such variables in practice can be nontrivial). For example, the discrete redundancy found in the 2dof arm requires a discrete variable ξ = ±1 to identify the particular arm configuration to be used. This solution is not particularly satisfactory since it breaks modularity, forcing a controller whose task is simply to reach to a given position to take responsibility for choosing the required arm configuration. More interesting from our point of view, this solution also increases the complexity of the reference structure, since the number K of reference

180

J. Porrill and P. Dean

Figure 5: Recurrent architecture. Here the motor command generated by the fixed element B is the input to the adaptive element C. The output of C is then used as a correction to the input to B. This loop implements Brindley’s (1964) influential suggestion that the cerebellum elaborates commands coming from higher-level structures in the context of information about the current state of the organism. In this recurrent architecture, the sensor error δx becomes effectively proximal to C and, as demonstrated in the text, can be used directly as a teaching signal.

matrices required must be further increased to reflect the dependence on the extra parameters ξ . For example, to learn correctly in the 2dof arm example in Figure 4b, the two different arm configurations clearly require different reference matrices; hence, to allow both configurations over the whole work space increases the number of hard-wired parameters by a factor of 2 to 2 × 12 = 24. Even in this simple example, the complexity of the reference structure required to support learning is beginning to approach that of the structure to be learned. Note that the situation is not helped by adding a conventional error feedback loop to the adaptive forward architecture (as in the feedback error learning model). This loop also requires a motor error signal, and since error is available only in sensory coordinates, different reference structures are required for different arm configurations. 7 The Recurrent Architecture As we noted in section 1, the forward architecture just described ignores a major feature, the recurrent connectivity, of the circuitry in which the cerebellar microcircuit is embedded. An alternative recurrent architecture reflecting this connectivity is shown schematically in Figure 5. Although the analysis of recurrent networks and their learning rules can be very complex (e.g., Pearlmutter, 1995) this architecture is an important exception; in particular, we will show that no backpropagation step is required in the learning rule. The analysis proceeds in two stages. First, a plausible cerebellar learning rule is derived by the familiar method of gradient descent. This derivation does not provide a rigorous proof of convergence because it requires a small-weight-error approximation. Second, a Lyapunov function

Recurrent Cerebellar Loops Simplify Adaptive Control

181

for the learning rule is derived and used to demonstrate convergence of the learning rule without the need for the small-weight-error approximation. We are able to simplify the treatment because we deal only with kinematics. Hence, we can idealize the recurrent loop as an algebraic loop in which the output m of the closed loop shown in Figure 5 satisfies the implicit equation, m = B(xd +C(m)).

(7.1)

Clearly control is possible only if the fixed element B has some left inverse B−1 . Applying this inverse to equation 7.1, we find that the desired position input xd is related to the actual motor command m by the equation xd = B−1 (m) − C(m).

(7.2)

Again, we assume the matching condition so that weights exist for which C takes the exact value C∗ . For this choice of C, the desired position will equal the actual output position, that is, x = P(m) = B−1 (m) − C∗ (m),

(7.3)

from which we derive the following expression for the desired cerebellar filter: C∗ = B−1 − P.

(7.4)

By subtracting equation 7.3 from 7.2, we find that δx = x − xd = C∗ (m) − ∗ ∗ C(m), and substituting C = wij Gij , C = wij Gij gives the following simple relationship between sensory error and synaptic weight error: δx = P(m) − xd = C(m) − C∗ (m) =

(wij − wij∗ )Gij (m).

(7.5)

(This is analogous to equation 4.3 relating motor error and synaptic weight error for the forward architecture.) Although this equation is at first sight linear in the weights wij , this appearance is misleading since the argument m also depends implicitly on the wij . However, the appearance of linearity is close enough to the truth to allow us to derive a simple learning rule. If weight errors are small, the second term in the derivative, ∂δx ∗ ∂Gkl (m) = Gij (m) + (wkl − wkl ) , ∂wij ∂wij

(7.6)

182

J. Porrill and P. Dean

can be neglected to give the approximation ∂δx ≈ Gij (m). ∂wij

(7.7)

Using this result, we can derive an approximate gradient-descent learning rule by minimizing expected square sensory error (rather than motor error, as in the previous section). Defining Es =

1 2 1 t δx = δx δx , 2 2

(7.8)

its gradient is ∂δx ∂ Es ≈ δxt Gij (m) = δxi p j , = δxt ∂wij ∂wij

(7.9)

leading to the approximate gradient-descent learning rule δwij = −β δxi p j .

(7.10)

In this learning rule, no Jacobian appears, and hence no reference structures embodying prior knowledge of plant parameters are required. Comparison with the covariance learning rule (see equation 2.4) shows that the teaching signal required on the climbing fibers in recurrent architecture is now the sensory error: e i = δxi .

(7.11)

Although this local learning rule has been derived as an approximate gradient-descent rule, its properties are more easily determined by a Lyapunov analysis; this analysis is simplified if we work with the continuous update form of the learning rule: w˙ ij = −β δxi p j .

(7.12)

As a Lyapunov function, we use the sum square synaptic weight error, V=

1 (wij − wij∗ )2 , 2

(7.13)

which has time derivative V˙ =

w˙ ij (wij − wij∗ ) = −β

i

δxi

j

(wij − wij∗ ) p j .

(7.14)

Recurrent Cerebellar Loops Simplify Adaptive Control

183

Substituting into this expression the expression for sensory error above, we find that V˙ = −βδx2 .

(7.15)

Since its derivative is nonpositive, V is a Lyapunov function, that is, a positive function that decreases over time as learning proceeds. It is unnecessary to appeal to the Lyapunov theorems to determine the behavior of this system sufficiently for practical purposes. The equation above shows that over a fixed period of time, the sum square synaptic weight error decreases by an amount proportional to the mean square sensory error; hence, it is clear that the system can make RMS sensory errors above a certain magnitude only for a limited time, since V would otherwise become negative, which is impossible. 8 The Motor Error Problem Is Solved It is clear that this architecture solves the motor error problem in that there is no longer a need for unavailable motor error signals on the climbing fibers or for complex reference structures to estimate them. Figure 6a shows the performance of recurrent architecture for the 2dof arm problem described in section 3 (see the appendix for implementation details). It can be seen that the arm now recalibrates successfully over the whole work space. Figure 6b shows the decrease in position error during training; this decrease is stochastic in nature, as would be expected from the approximate stochastic gradient-descent rule. In contrast, Figure 6c shows the monotonic decrease of sum square weight error V during training predicted by the Lyapunov analysis, with the greatest decreases taking place where large errors are made. 9 The Nonconvexity Problem Is Solved In the recurrent architecture, the nonconvexity problem is easily solved because it does not arise. The adaptive element C takes motor commands as input, and its task is to learn to associate them with a corrective output. Since the motor command completely determines the current configuration, there is no ambiguity to be resolved. Although we illustrate this property below for the discrete redundancy of the 2dof arm, the reasoning above clearly applies to both discrete and continuous redundancies. This property confers remarkable modularity on recurrent architecture. It means that the connectivity of a cerebellar controller is determined solely by the task and is independent of low-level details such as the particular algorithm chosen for resolving joint angle redundancy or how, for example, reciprocal innervation allocates the tension in antagonistic muscle pairs.

184

J. Porrill and P. Dean

Figure 6: (a) This panel illustrates successful recalibration by the recurrent architecture. After training, the learned (black) grid overlays the exact (gray) grid over the whole work space (compare with the initial performance in Figure 2). Learning rate β = 0.05. (b) The grid shows the region of motor space corresponding to the robot work space. The dots and the circle indicate RBF centers and the receptive field radius. (c) The two graphs illustrate the stochastic decrease in squared position error (top), same units of length as Figure 2, and the associated monotonic decrease in sum square synaptic weight error (bottom) as predicted by theory. The behavior at the positions marked by arrows illustrates the fact that faster decrease in sum square weight error is associated with larger position error, as predicted by the Lyapunov equation, 7.13.

This property is illustrated in Figure 7a, where the recurrent architecture is applied to the redundant reaching problem described in section 6. It can be seen that learning is now satisfactory over the whole work space. The only modification to the net for the task of Figure 7 is the need for RBF centers covering points in motor space associated with the alternative arm configuration. This grid of RBF centers is shown in Figure 7b. The situation would be only slightly different for a continuous redundancy; in this case, new RBF centers would be needed to cover all points in motor command space accessible by the redundant degrees of freedom. 10 Discussion As argued in section 1, one reason that cerebellar-inspired models have been of modest use to robotics is that cerebellar connectivity is often poorly understood (Lisberger, 1998). The general idea that identical cerebellar microcircuits can be wired up to perform a wide range of motor and

Recurrent Cerebellar Loops Simplify Adaptive Control

185

Figure 7: (a) This panel shows that recurrent architecture solves the redundant reaching problem example described in section 6 for which forward architecture fails (compare with Figure 4). The learned (black) grid overlays the exact (gray) grid accurately over the whole work space, including the horizontal 60 degree sector in which both arm configurations are used. Learning rate β = 0.05. (b) This panel shows the separate grids in motor space associated with the two arm configurations. The grid of dots and the circle indicate the RBF centers and receptive field radius. The dark gray regions highlight motor commands used to generate the arm configurations used consistently in the top and bottom sectors of the work space, and the light gray regions highlight motor commands that generate the two alternative configurations used in the central sector of the work space (the unshaded regions are not used). Since the arm configurations that are ambiguous in task space are represented in separate regions of motor space, they are learned independently in the recurrent architecture.

cognitive functions (Ito, 1984, 1997) is well appreciated, but identifying specific instances has proved difficult. Here we have explored the computational capacities of the cerebellar microcircuit embedded in a recurrent architecture for adaptive feedforward control of nonlinear redundant systems as exemplified by a simulated 2dof robot arm. We have shown that the architecture solves the distal error problem, copes naturally with the redundancy/convexity problem, and gives enhanced modularity. We now compare it with alternative architectures from both computational and biological perspectives 10.1 Computational Control. The distal error problem arises whenever output errors are used to train internal parameters (Jordan, 1996; Jordan & Wolpert, 2000). It is a fundamental obstacle in neural learning systems, and the consequent lack of biological plausibility of learning rules in neural net supervised learning algorithms has become a clich´e. There have been two main previous approaches to solving the distal error problem: (1) continue

186

J. Porrill and P. Dean

to use standard architectures and hypothesize the existence of structures implementing the required error backpropagation schemes, or (2) look for those special architectures in which output errors are themselves sufficient for training. The forward learning architecture (see Figure 3) appears poorly suited for solving the distal error problem. Considerable ingenuity has been expended on the feedback error learning scheme developed by Kawato and coworkers (Gomi & Kawato, 1992, 1993) (for a recent rigorous treatment, see Nakanishi & Schaal, 2004) in order to rescue this architecture, but even so, substantial difficulties remain. In feedback error learning, the adaptive component is embedded in a feedback controller so that the estimated motor error δmest is used as both a training signal and a feedback error term. As we have noted, feedback error learning imposes SPR conditions on the accuracy of the motor error estimate and hence requires complex reference structures for generic redundant and nonlinear systems. There have been theoretical attempts to remove the SPR condition. For example, Miyamura and Kimura (2002) avoid the necessity for SPR at the cost of requiring large gains in the conventional error feedback loop; this is unacceptable in autonomous and biological systems since one of the primary reasons for using adaptive control is to avoid the destabilizing effect of large feedback gains given inevitable feedback delays. Despite the difficulties we have noted here, feedback error learning has been usefully applied to online learning by autonomous robots in numerous contexts (e.g., Dean, Mayhew, Thacker, & Langdon, 1991; Mayhew, Zheng, & Cornell, 1992). It is clear that feedback error learning remains a useful approach for problems in which simplifying features of the motor plant mean that the reference structures are easily estimated. We have presented a general architecture for tracking control in which output error can be used directly for training. Other architectures of this type in the literature have been designed for specific control problems. For example, the adaptive scheme for control of robot manipulators proposed by Slotine and Li (1989) relies on special features of the problem of controlling joint angles using joint torques. We note also the adaptive schemes for particular single-input–single-output nonlinear systems considered by Patino and Liu (2000) and Nakanishi and Schaal (2004). Although none of these architectures tackles the generic problems of nonlinearity and redundancy considered here, it is interesting to note that they also use recurrent architectures in an essential way, supporting the idea that recurrent connectivity may play a fundamental role in simplifying biological motor control.

10.2 Biological Plausibility. From a biological perspective, the recurrent architecture appears more plausible in the context of plant compensation than the forward architecture, with its requirements for feedback error learning, for three reasons.

Recurrent Cerebellar Loops Simplify Adaptive Control

187

First, as pointed out in section 1, anatomical evidence indicates that the recurrent architecture is a feature of many cerebellar microzones. In addition, where it is available, electrophysiological evidence has specifically identified efferent-copy information as part of the mossy fiber input to particular regions of the cerebellum. Important examples are (1) the primate flocculus and ventral paraflocculus, responsible for adaptation of the vestibulo-ocular reflex (VOR), where extensive recordings have shown that about three-quarters of their mossy fiber inputs carry eye-movementrelated signals (Miles, Fuller, Braitman, & Dow, 1980); (2) the oculomotor vermis, responsible for saccadic adaptation, where about 25% of mossy fibers show short-latency burst firing in association with saccades that closely resemble the activity of excitatory burst neurons in the paramedian pontine reticular formation (Ohtsuka & Noda, 1992); and (3) regions of cerebellar cortex associated with control of limb movement by the red nucleus receive an efferent copy of rubrospinal output to cerebellum, related to limb position and velocity (Keifer & Houk, 1994). Thus, the defining feature of the recurrent architecture used here appears to be present for all the cerebellar microzones that have been adequately investigated. Second, the recurrent architecture allows use of sensory error signals, that is, the sensory consequences of inaccurate motor commands, which are physically available signals. Previous inability to use such distal error signals has been a fundamental obstacle in neural learning systems, so that architectures such as the one described here, in which distal error can be used directly as a teaching signal, have fundamental importance as basic components of biological learning systems. Hence, if the recurrent architecture were used biologically, we would expect cerebellar climbing fibers to carry sensory information. In contrast, a central consequence of feedback error learning is the identification of climbing fiber signals with motor error. In the words of Gomi and Kawato (1992), “Our view that the climbing fibers carry control error information, the difference between the instructions and the motor act, is common to most cerebellar motor-learning models; however ours is unique in that this error information is represented in motor-command coordinates.” This view runs into both theoretical and empirical problems. From a theoretical point of view, the use of the motor error signal requires not only new and as yet unidentified neural reference structures to recover motor error from observable sensory errors, but also new and as yet unidentified mechanisms to calibrate these structures. Again in the words of Gomi and Kawato (1992), “The most interesting and challenging theoretical problem [raised by FEL] is setting an appropriate inverse reference model in the feedback controller at the spinal and brainstem levels.” From the point of view of experimental evidence, it appears that many climbing fibers are in fact strongly activated by sensory inputs, such as touch, pain, muscle sense, or, in the case of the VOR, retinal slip (e.g., Apps & Garwicz, 2005; De Zeeuw et al., 1998; Simpson, Wylie, & De Zeeuw, 1996).

188

J. Porrill and P. Dean

In certain cases where nonsensory signals have been identified in climbingfiber discharge (e.g., Andersson, Garwicz, & Hesslow, 1988; Gibson, Horn, & Pong, 2002), their effect seems to be to emphasize the unpredicted sensory consequences of a movement by gating the expected consequences. Such gated signals are still physically available sensory signals. In other instances, for example, retinal slip in the horizontal VOR, it appears that the effect of nonsensory modulation is to produce a two-valued slip signal that conveys information about the direction of image movement but not its speed (Highstein, Porrill, & Dean, 2005; Simpson, Belton, Suh, & Winkelman, 2002). One line of evidence often used to support feedback error learning concerns ocular following, where externally imposed sliplike movements of the retinal image drive compensatory eye movements. It has been shown that the associated complex spike discharge in the flocculus can be predicted from either slip or eye movement signals (Kobayashi et al., 1998). However, given that a later article states that “only the velocity of the retinal error (retinal slip) was statistically sufficient” (Yamamoto, Kobayashi, Takemura, Kawano, & Kawato, 2002, p. 1558) to reproduce the observed variability in climbing fiber discharge, it is not clear that this evidence decisively supports the existence of a motor error signal, even in these special and ambiguous circumstances. Although we have not considered dynamics here, a further advantage of the recurrent architecture is its capacity to generate long-time constant signals when plant compensation requires integrator-like processes (Dean et al., 2002; Porrill et al., 2004). This desirable feature of the recurrent architecture was pointed out for eye-position control by Galiana and Outerbridge (1984) and has been incorporated in other models of gaze holding (e.g., Glasauer, 2003). It is unclear how forward architectures could be used to achieve the observed performance of the neural integrator (Robinson, 1974). 10.3 Further Developments. Our previous work on plant compensation in the VOR (Dean et al., 2002; Porrill et al., 2004) established the capabilities of recurrent architecture for dynamic control of linear systems. Here we have extended these results to the kinematic control of redundant and nonlinear systems. It is clearly a priority to extend these results to dynamic nonlinear control problems. It is also important to implement the decorrelation-control scheme in a robot, and this work is currently in progress. We have emphasized the limitations of the feedback-error-learning architecture, but its combination of feedback and feedforward controllers does offer considerable advantages for robust online control. It appears that a possibly similar strategy is used biologically to control gaze stability, combining the optokinetic (feedback) and vestibulo-ocular (feedforward) reflexes (Carpenter, 1988). As we will show elsewhere, the recurrent architecture can also be embedded stably and naturally in a conventional feedback loop.

Recurrent Cerebellar Loops Simplify Adaptive Control

189

Finally, although the recurrent architecture does not need forward connections for plant compensation (and so they are not shown in Figure 5), such connections are also ubiquitous for cerebellar microzones. We conjecture that once plant compensation has been achieved, a microzone could in principle use these inputs for a wide range of purposes, including sensory calibration and predictive control. Appendix: Technical Details Explicit details of the forward and recurrent algorithms for the robotic application are supplied below. This is a vanilla RBF implementation, since our intention is to concentrate on the nature of the error signal and the learning rule. Both accuracy and learning speed could be greatly improved by optimizing the choice of centers and transforming to an optimal basis of receptive fields. The forward kinematics for the 2dof robot arm with arm lengths (l1 , l2 ) is given by x1 = P1 (m1 , m2 ) = l1 cos m1 + l2 cos(m1 + m2 ) x2 = P2 (m1 , m2 ) = l1 sin m1 + l2 sin(m1 + m2 ).

(A.1)

The brainstem component of the controller is defined as the exact inverse kinematics for a robot with slightly different arm lengths (l1 , l2 ), −1

m1 = B1 (x1 , x2 ) = tan

x2 x1

−1

− tan

l2 sin ξ θ l1 + l2 sin ξ θ

m2 = B2 (x1 , x2 ) = ξ θ,

(A.2)

where θ = cos−1

x12 + x12 − l12 − l22 2l1 l2

,

(A.3)

and the choice of ξ = ±1 determines the arm configuration. In the forward architecture, the parallel fiber signals are given by gaussian RBFs p j = G j (x1 , x2 ) = e − 2σ 2 ((x1 −c1 j ) 1

2

+(x2 −c 2 j )2 )

,

(A.4)

with the centers (c 1 j , c 2 j ) chosen on a rectangular grid covering the work 1 space and with σ equal to 2 /2 times the maximum grid spacing (see Figure 4). The forward architecture implies the following expression for

190

J. Porrill and P. Dean

the motor commands: m1 = B1 (x1 , x2 ) + m2 = B2 (x1 , x2 ) +

w1 j p j w2 j p j .

(A.5)

The unknown weights wij were initially set to 0. The two components of sensory error are given by the formula δx1 = P1 (m1 , m2 ) − x1 δx2 = P2 (m1 , m2 ) − x2 ,

(A.6)

from which the two components of motor error are estimated using a 2 × 2 reference matrix R δmest R11 R12 δx1 1 = . (A.7) δmest R21 R22 δx2 2 RBF weights are updated using the learning rule wijnew = wijold − βδmiest p j .

(A.8)

In the alternative recurrent architecture, the PF signals are given by RBFs, p j = G j (m1 , m2 ) = e − 2σ 2 ((m1 −c1 j ) 1

2

+(m2 −c 2 j )2 )

,

(A.9)

with centers chosen on a rectangular grid in motor space. The grid was chosen to cover the image of the work space in motor space (see Figure 6b). The recurrent architecture of Figure 5 implies the following equation for motor commands

w1 j p j , x2 + w2 j p j m1 = B1 x1 +

(A.10) w1 j p j , x2 + w2 j p j . m2 = B2 x2 + This is an implicit equation for motor error since the p j depend on the mi via equation A.6. Its solution was obtained at each trial by iterating mn+1 = B(x + C(mn )) to convergence (a relative accuracy of 10−4 required at most 10 iterations in the simulations reported here). This off-line iteration is necessary to allow the arm to make discontinuous movements between randomly chosen gridpoints. If waypoints xn are closely sampled from a continuous curve, the more natural alternative online procedure mn+1 = B(xn + C(mn )) can be used.

Recurrent Cerebellar Loops Simplify Adaptive Control

191

Again the unknown weights wij were initially set to 0. Sensory error obtained from equation A.3 above was used directly in the learning rule wijnew = wijold − βδxi p j .

(A.11)

Values for the optimal weight values wij∗ are required to calculate the sum square weight errors plotted in Figure 6. These were obtained by direct batch minimization of sum square reaching error calculated over a subsampled grid of points in the work space. We have primarily investigated convergence in the circumstances in which it is believed to operate biologically, that is, in repair shop mode with changes in plant characteristics of 10% to 15%. However, simulations have indicated that it also converges from initial weights corresponding to grossly degraded performance, although we have not investigated this systematically. Acknowledgments This work was supported by EPSRC grant GR/T10602/01 under their Novel Computation Initiative. References Albus, J. S. (1971). A theory of cerebellar function. Mathematical Biosciences, 10, 25–61. Andersson, G., Garwicz, M., & Hesslow, G. (1988). Evidence for a GABA-mediated cerebellar inhibition of the inferior olive in the cat. Experimental Brain Research, 72, 450–456. Apps, R., & Garwicz, M. (2005). Anatomical and physiological foundations of cerebellar information processing. Nature Reviews Neuroscience, 6(4), 297–311. Boyden, E. S., Katoh, A., & Raymond, J. L. (2004). Cerebellum-dependent learning: The role of multiple plasticity mechanisms. Annual Review of Neuroscience, 27, 581–609. Brindley, G. S. (1964). The use made by the cerebellum of the information that it receives from sense organs. International Brain Research Organization Bulletin, 3, 80. Carpenter, R. H. S. (1988). Movements of the eyes (2nd ed.). London: Pion. De Zeeuw, C. I., Simpson, J. I., Hoogenraad, C. C., Galjart, N., Koekkoek, S. K. E., & Ruigrok, T. J. H. (1998). Microcircuitry and function of the inferior olive. Trends in Neurosciences, 21(9), 391–400. Dean, P., Mayhew, J. E. W., Thacker, N., & Langdon, P. M. (1991). Saccade control in a simulated robot camera-head system: Neural net architectures for efficient learning of inverse kinematics. Biological Cybernetics, 66, 27–36. Dean, P., Porrill, J., & Stone, J. V. (2002). Decorrelation control by the cerebellum achieves oculomotor plant compensation in simulated vestibulo-ocular reflex. Proceedings of the Royal Society of London, Series B, 269(1503), 1895–1904.

192

J. Porrill and P. Dean

Dean, P., Porrill, J., & Stone, J. V. (2004). Visual awareness and the cerebellum: Possible role of decorrelation control. Progress in Brain Research, 144, 61–75. Eccles, J. C., Ito, M., & Szent´agothai, J. (1967). The cerebellum as a neuronal machine. Berlin: Springer-Verlag. Fujita, M. (1982). Adaptive filter model of the cerebellum. Biological Cybernetics, 45, 195–206. Galiana, H. L., & Outerbridge, J. S. (1984). A bilateral model for central neural pathways in vestibuloocular reflex. Journal of Neurophysiology, 51(2), 210–241. Gibson, A. R., Horn, K. M., & Pong, M. (2002). Inhibitory control of olivary discharge. Annals of the New York Academy of Sciences, 978, 219–231. Glasauer, S. (2003). Cerebellar contribution to saccades and gaze holding—a modeling approach. Annals of the New York Academy of Sciences, 1004, 206–219. Gomi, H., & Kawato, M. (1992). Adaptive feedback control models of the vestibulocerebellum and spinocerebellum. Biological Cybernetics, 68(2), 105–114. Gomi, H., & Kawato, M. (1993). Neural network control for a closed-loop system using feedback-error-learning. Neural Networks, 6, 933–946. Highstein, S. M., Porrill, J., & Dean, P. (2005). Report on a workshop concerning the cerebellum and motor learning. Held in St Louis October 2004. Cerebellum, 4, 1–11. Ito, M. (1970). Neurophysiological aspects of the cerebellar motor control system. International Journal of Neurology (Montevideo), 7, 162–176. Ito, M. (1984). The cerebellum and neural control. New York: Raven Press. Ito, M. (1997). Cerebellar microcomplexes. International Review of Neurobiology, 41, 475–487. Jordan, M. I. (1996). Computational aspects of motor control and motor learning. In H. Heuer & S. Keele (Eds.), Handbook of perception and action, Vol. 2: Motor skills (pp. 71–120). London: Academic Press. Jordan, M. I., & Wolpert, D. M. (2000). Computational motor control. In M. S. Gazzaniga (Ed.), The new cognitive neurosciences (2nd ed., pp. 601–618). Cambridge MA: MIT Press. Keifer, J., & Houk, J. C. (1994). Motor function of the cerebellorubrospinal system. Physiological Reviews, 74(3), 509–542. Kelly, R. M., & Strick, P. L. (2003). Cerebellar loops with motor cortex and prefrontal cortex of a nonhuman primate. Journal of Neuroscience, 23(23), 8432–8444. Kobayashi, Y., Kawano, K., Takemura, A., Inoue, Y., Kitama, T., Gomi, H., & Kawato, M. (1998). Temporal firing patterns of Purkinje cells in the cerebellar ventral paraflocculus during ocular following responses in monkeys II. Complex spikes. Journal of Neurophysiology, 80(2), 832–848. Lisberger, S. G. (1998). Cerebellar LTD: A molecular mechanism of behavioral learning? Cell, 92(6), 701–704. Marr, D. (1969). A theory of cerebellar cortex. Journal of Physiology, 202, 437–470. Marr, D. (1982). Vision: A computational investigation into the human representation and processing of visual information. San Francisco: Freeman. Mayhew, J. E. W., Zheng, Y., & Cornell, S. (1992). The adaptive control of a fourdegrees-of-freedom stereo camera head. Philosophical Transactions of the Royal Society of London, Series B, 337, 315–326.

Recurrent Cerebellar Loops Simplify Adaptive Control

193

Miles, F. A., Fuller, J. H., Braitman, D. J., & Dow, B. M. (1980). Long-term adaptive changes in primate vestibuloocular reflex. III. Electrophysiological observations in flocculus of normal monkeys. Journal of Neurophysiology, 43, 1437–1476. Miyamura, A., & Kimura, H. (2002). Stability of feedback error learning scheme. Systems and Control Letters, 45, 303–316. Nakanishi, J., & Schaal, S. (2004). Feedback error learning and nonlinear adaptive control. Neural Networks, 17(10), 1453–1465. Ohtsuka, K., & Noda, H. (1992). Burst discharges of mossy fibers in the oculomotor vermis of macaque monkeys during saccadic eye movements. Neuroscience Research, 15(1–2), 102–114. Patino, H. D., & Liu, D. (2000). Neural network–based model reference adaptive control system. IEEE Transactions on Systems, Man and Cybernetics—Part B: Cybernetics, 30, 198–203. Pearlmutter, B. A. (1995). Gradient calculations for dynamic recurrent neural networks. IEEE Transactions on Neural Networks, 6(5), 1212–1228. Porrill, J., & Dean, P. (2004). Recurrent cerebellar loops simplify adaptive control of redundant and nonlinear motor systems. In 2004 Abstract Viewer/Itinerary Planner (pp. Prog. No. 989.984). Washington, DC: Society for Neuroscience. Porrill, J., Dean, P., & Stone, J. V. (2004). Recurrent cerebellar architecture solves the motor error problem. Proceedings of the Royal Society of London, Series B, 271, 789–796. Robinson, D. A. (1974). The effect of cerebellectomy on the cat’s vestibulo-ocular integrator. Brain Research, 71, 195–207. Sejnowski, T. J. (1977). Storing covariance with nonlinearly interacting neurons. Journal of Mathematical Biology, 4, 303–321. Simpson, J. I., Belton, T., Suh, M., & Winkelman, B. (2002). Complex spike activity in the flocculus signals more than the eye can see. Annals of the New York Academy of Sciences, 978, 232–236. Simpson, J. I., Wylie, D. R., & De Zeeuw, C. I. (1996). On climbing fiber signals and their consequence(s). Behavioral and Brain Sciences, 19(3), 384–398. Slotine, J. J. E., & Li, W. P. (1989). Composite adaptive-control of robot manipulators. Automatica, 25(4), 509–519. Widrow, B., & Stearns, S. D. (1985). Adaptive signal processing. Upper Saddle River, NJ: Prentice Hall. Yamamoto, K., Kobayashi, Y., Takemura, A., Kawano, K., & Kawato, M. (2002). Computational studies on acquisition and adaptation of ocular following responses based on cerebellar synaptic plasticity. Journal of Neurophysiology, 87(3), 1554–1571.

Received August 8, 2005; accepted May 30, 2006.

LETTER

Communicated by Stephen Jos´e Hanson

Free-Lunch Learning: Modeling Spontaneous Recovery of Memory J. V. Stone [email protected] Psychology Department, Sheffield University, Sheffield S10 2TP, England

P. E. Jupp [email protected] School of Mathematics and Statistics, St. Andrews University, St. Andrews KY16 9SS, Scotland

After a language has been learned and then forgotten, relearning some words appears to facilitate spontaneous recovery of other words. More generally, relearning partially forgotten associations induces recovery of other associations in humans, an effect we call free-lunch learning (FLL). Using neural network models, we prove that FLL is a necessary consequence of storing associations as distributed representations. Specifically, we prove that (1) FLL becomes increasingly likely as the number of synapses (connection weights) increases, suggesting that FLL contributes to memory in neurophysiological systems, and (2) the magnitude of FLL is greatest if inactive synapses are removed, suggesting a computational role for synaptic pruning in physiological systems. We also demonstrate that FLL is different from generalization effects conventionally associated with neural network models. As FLL is a generic property of distributed representations, it may constitute an important factor in human memory. 1 Introduction A popular aphorism states that “there’s no such thing as a free lunch.” However, in the context of learning theory, we propose that there is. In previous work, free-lunch learning (FLL) has been demonstrated using a task in which participants learned the positions of letters on a nonstandard computer keyboard (Stone, Hunkin, & Hornby, 2001). After a period of forgetting, participants relearned a proportion of these letter positions. Crucially, it was found that this relearning induced recovery of the nonrelearned letter positions. Preliminary results suggest that FLL also occurs using face stimuli. If the brain stores information as distributed representations, then each neuron contributes to the storage of many associations, so that relearning Neural Computation 19, 194–217 (2007)

C 2006 Massachusetts Institute of Technology

Free-Lunch Learning

195

Figure 1: Free-lunch learning protocol. Two subsets of associations A1 and A2 are learned. After partial forgetting (see text), performance error E pre on subset A1 is measured. Subset A2 is then relearned to preforgetting levels of performance, and performance error E post on subset A1 is remeasured. If E post < E pre then FLL has occurred, and the amount of FLL is δ = E pre − E post .

some old and partially forgotten associations affects the integrity of other old associations. Using neural network models, we show that relearning some associations does not disrupt other stored associations but actually restores them. In essence, recovery occurs in neural network models because each association is distributed among all connection weights (synapses) between units (model neurons). After partial forgetting, relearning some of the associations forces all of the weights closer to preforgetting values, resulting in improved performance even on nonrelearned associations. 1.1 The Geometry of Free-Lunch Learning. The protocol used to examine FLL here is as follows (see Figure 1). First, learn a set of n1 + n2 associations A = A1 ∪ A2 consisting of two intermixed subsets A1 and A2 of n1 and n2 associations, respectively. After all learned associations A have been partially forgotten, measure performance on subset A1 . Finally, relearn only subset A2 , and then remeasure performance on subset A1 . FLL occurs if relearning subset A2 improves performance on A1 . Unless stated otherwise, we assume that for a network with n connection weights, n ≥ n1 + n2 . For the present, we assume that the network has one output unit and two input units, which implies n = 2 connection weights and that A1 and A2 each consist of n1 = n2 = 1 association, as in Figure 2. Input units are

196

a

J. Stone and P. Jupp

b

Figure 2: Geometry of free-lunch learning. (a) A network with two input units and one output unit, with connection weights wa and wb , defines a weight vector w = (wa , wb ). The network learns two associations A1 and A2 , where (for example) A1 is the mapping from input vector x1 = (x11 , x12 ) to desired output value d1 ; learning consists of adjusting w until the network output y1 = w · x1 equals d1 . (b) Each association A1 and A2 defines a constraint line L 1 and L 2 , respectively. The intersection of L 1 and L 2 defines a point w0 that satisfies both constraints, so that zero error on A1 and A2 is obtained if w = w0 . After partial forgetting, w is a randomly chosen point w1 on the circle C with radius r , and performance error E pre on A1 is the squared distance p 2 . After relearning A2 , the weight vector w2 is in L 2 , and performance error E post on A1 is q 2 . FLL occurs if δ = E pre − E post > 0, or equivalently if Q = p 2 − q 2 > 0. Relearning A2 has one of three possible effects, depending on the position of w1 on C: (1) if w1 is under the larger (dashed) arc C F L L as shown here, then p 2 > q 2 (δ > 0) and therefore FLL is observed; (2) if w1 is under the smaller (dotted) arc, then p 2 < q 2 (δ < 0), and therefore negative FLL is observed; and (3) if w1 is at the critical point wcrit , then p 2 = q 2 (δ = 0). Given that w1 is a randomly chosen point on C and that the length of C F L L is SF L L , the probability of FLL is P(δ > 0) = SF L L /πr (i.e., the proportion of C F L L under the upper semicircle of C).

connected to the output unit via weights wa and wb , which define a weight vector w = (wa , wb ). Associations A1 and A2 consist of different mappings from the input vectors x1 = (x11 , x12 ) and x2 = (x21 , x22 ) to desired output values d1 and d2 , respectively. If a network is presented with input vectors x1 and x2 , then its output values are y1 = w · x1 = wa x11 + wb x12 and y2 = w · x2 = wa x21 + wb x22 , respectively. Network performance error for k = 2 k associations is defined as E(w, A) = i=1 (di − yi )2 .

Free-Lunch Learning

197

The weight vector w defines a point in the (wa , wb )-plane. For an input vector x1 , there are many different combinations of weight values wa and wb that give the desired output d1 . These combinations lie on a straight line L 1 , because the network output is a linear weighted sum of input values. A corresponding constraint line L 2 exists for A2 . The intersection of L 1 and L 2 therefore defines the only point w0 that satisfies both constraints, so that zero error on A1 and A2 is obtained if and only if w = w0 . Without loss of generality, we define the origin w0 to be the intersection of L 1 and L 2 . We now consider the geometric effect of partial forgetting of both associations, followed by relearning A2 . This geometric account applies to a network with two weights (see Figure 2) and depends on the following observation: if the length of the input vector x1 = 1, then the performance error E(w, A1 ) = (d1 − y1 )2 of a network with weight vector w when tested on association A1 is equal to the squared distance between w and the constraint line L 1 (see appendix C). For example, if w is in L 1 , then E(w, A1 ) = 0, but as the distance between w and L 1 increases, so E(w, A1 ) must increase. For the purposes of this geometric account, we assume that x1 = x2 = 1. Partial forgetting is induced by adding isotropic noise v to the weight vector w = w0 . This effectively moves w to a randomly chosen point w1 = w0 + v on the circle C of radius r = v, where r represents the amount of forgetting. For a network with w = w1 , learning A2 moves w to the nearest point w2 on L 2 (see appendix B), so that w2 is the orthogonal projection of w1 on L 2 . Before relearning A2 , the performance error E pre on A1 is the squared distance p 2 between w1 and its orthogonal projection on L 1 (see appendix C). After relearning A2 , the performance error E post is the squared distance q 2 between w2 and its orthogonal projection on L 1 . The amount of FLL is δ = E pre − E post and, for a network with two weights, is equal to Q = p 2 − q 2 . The probability P(δ > 0) of FLL given L 1 and L 2 is equal to the proportion of points on C for which δ > 0 (or, equivalently, for which Q > 0). For example, averaging over all subsets A1 and A2 , there is the probability P(δ > 0) = 0.68 that relearning A2 induces FLL of A1 (see Figure 5), a probability that increases with the number of weights (see theorem 3). If we drop the assumption that a network has only two input units, then we can consider subsets A1 and A2 with n1 > 1 and n2 > 1 associations. If the number of connection weights n ≥ max(n1 , n2 ), then A1 and A2 define an (n − n1 )-dimensional subspace L 1 and an (n − n2 )-dimensional subspace L 2 , respectively. The intersection of L 1 and L 2 corresponds to weight vectors that generate zero error on A = A1 ∪ A2 . Finally, we can drop the assumption that a network has only one output unit, because the connections to each output unit can be considered as a distinct network, in which case our results can be applied to the network associated with each output unit.

198

J. Stone and P. Jupp

2 Methods Given a network with n input units and one output unit, the set A of associations consisted of k input vectors (x1 , . . . , xk ) and k corresponding desired scalar output values (d1 , . . . , dk ). Each input vector comprises n elements x = (x1 , . . . , xn ). The values of xi and di were chosen from a gaussian distribution with unit variance (i.e., σx2 = σd2 = 1). A network’s output yi is a weighted sum of input values yi = w · xi = kj=1 w j xi j , where xi j is the jth value of the ith input vector xi , and each weight wi is one input-output connection. Given that the network error for a given set of k associations is E(w, A) = k k 2 i=1 (di − yi ) , the derivative ∇ E (w) = 2 i=1 (di − yi )xi of E with respect to w yields the delta learning rule wnew = wold − η∇ E (wold ) , where η is the learning rate, which is adjusted according to the number of weights. A learning trial consists of presenting the k input vectors to the network and then updating the weights using the delta rule. Learning was stopped when ∇ E (w) < k0.001, where ∇ E (w) is the magnitude of the gradient. Initial learning of the k = n associations in A = A1 ∪ A2 was performed by solving a set of n simultaneous equations using a standard method, after which perfect performance on all n associations was obtained. Partial forgetting was induced by adding an isotropic noise vector v with r = v = 1. Relearning the n2 = n/2 associations in A2 was implemented with k = n2 using the delta rule. 3 Results Our four main theorems are summarized here, and proofs are provided in the appendixes. These theorems apply to a network with n weights that learns n1 + n2 associations A = A1 ∪ A2 and, after partial forgetting, relearns the n2 associations in A2 . Theorem 1.

The probability P(δ > 0) of FLL is greater than 0.5.

Theorem 2.

The expected amount of FLL per association in A1 is

E[δ/n1 ] =

n2 E[x2 ]E[v2 ]. n2

(3.1)

For given values of E[x2 ] and E v2 , the value of n2 , which maximizes E[δ/n1 ] (subject to n1 + n2 ≤ n), is n2 = n − n1 . If each input vector x = (x1 , . . . , xn ) is chosen from an isotropic (e.g., isotropic gaussian) distribution and the variance of xi is σx2 , then E x2 = nσx2 . If σx2 is the same for all n, then the state of a neuron (with a typical

Free-Lunch Learning

199

sigmoidal transfer function) would be in a constantly saturated state as the number of synapses increases. One way to prevent this saturation is to assume that the efficacy of synapses on a given neuron decreases as the number of synapses increases. If forgetting is caused primarily by learning spurious inputs, then the delta learning rule used here implies that the “amount of forgetting” v is approximately independent of n. We therefore assume that v and σx2 are constant, and for convenience, we set v = 1 and σx2 = 1. Substituting these values into equation 3.1 yields E[δ/n1 ] =

n2 . n

(3.2)

Using these assumptions, simulations of networks with n = 2 and n = 100 weights agree with equation 3.2, as shown in Figure 3. The role of pruning can be demonstrated as follows. Consider a network with 100 input units and one output unit with n = 100 weights. If n2 = 90 associations are relearned out of an original set of n1 + n2 = 100 associations, then E[δ/n1 ] = n2 /n = 0.90. However, if n = 1000, then E[δ/n1 ] = 0.09. In general, as the number n − (n1 + n2 ) of unpruned redundant weights increases, so E[δ/n1 ] decreases. Therefore, E[δ/n1 ] is maximized if n1 + n2 = n. If n1 + n2 < n, then the expected amount of FLL is not maximal and can therefore be increased by pruning redundant weights until n = n1 + n2 (see Figure 4). Note that for a particular network, performance error E post on A1 after learning A2 can be zero. For example, if w = w∗ in Figure 2, then p = q = 0, which implies that δ/n1 = E post = q 2 = 0. Theorem 3.

The probability P(δ > 0) of FLL of A1 satisfies

P(δ > 0) > 1 −

a 0 (n, n1 , n2 ) + a 1 (n, n2 ) var (x2 )/E[x2 ]2 , n1 n2 (n + 2)2

(3.3)

where a 0 (n, n1 , n2 ) = 2 n1 (n + 2)(n − n2 ) + n(n − n2 ) + n(n + 2)(n − 1) a 1 (n, n2 ) = n2 (2n + n2 + 6).

(3.4) (3.5)

Theorem 3 implies that if the numbers (n1 and n2 ) of associations in A1 and A2 are fixed nonzero proportions of the number n of connection weights 2 (n1 /n and n2 /n, respectively) and var x2 /nE x2 → 0 as n → ∞, then P(δ > 0) → 1 as n → ∞; and the probability that each of the n1 associations in A1 exhibits FLL is P(δ/n1 > 0) = P(δ > 0) because δ > 0 iff δ/n1 > 0. For example, if we assume that each input vector x = (x1 , . . . , xn ) is chosen from an isotropic (e.g., isotropic gaussian) distribution and the

200

J. Stone and P. Jupp

2 variance of xi is σx2 , then var x2 /E x2 = 2/n. This ensures that 2 2 2 var x /nE x → 0 as n → ∞, and therefore that P(δ > 0) → 1 as n → ∞. Using this assumption, an approximation of the right-hand side of equation 3.3 yields P(δ > 0) > 1 −

2(1 + α1 − α1 α2 ) 2(2 + α2 + 6/n) − , nα1 α2 α1 α2 (n + 2)2

(3.6)

where α1 = n1 /n and α2 = n2 /n. In this form, it is easy to see that P(δ > 0) → 1 as n → ∞. We briefly consider the case n1 ≥ n and n2 ≥ n, so that each of L 1 and L 2 is a single point. If the distance D between these points is much less than v, then simple geometry shows that performance error E pre on A1 is large and that relearning A2 reduces this error for any v (i.e., with probability 1) with E post ∝ D2 , even in the absence of initial learning of A1 and A2 (see equation A.18 in appendix A). A similar conclusion is implicit in Atkins and Murre (1998). Theorem 4. If, instead of relearning A2 , the network learns a new subset A3 (drawn from the same distribution as A2 ), then the expected amount of FLL is less than the expected amount of FLL after relearning subset A2 . Learning A3 is analogous to the control condition used with human participants (Stone et al., 2001), and the finding that the amount of recovery after learning A3 is less than the amount of recovery after relearning A2 is predicted by theorem 4.

Figure 3: Distribution of free-lunch learning. (a) Histogram of amount of FLL δ/n1 per association, based on 1000 runs, for a network with n = 2 weights (see section 2). After learning two association subsets (η = 0.1), A1 and A2 , containing n1 = 1 and n2 = 1 associations (respectively), the network has a weight vector w0 . Forgetting is then induced by adding a noise vector v with v2 = 1 to w0 . One association A2 is then relearned, and the change in performance on A1 is measured as δ/n1 (see Figure 2). Negative values indicate that performance on A1 decreases after relearning A2 . (b) Histogram of amount of FLL δ/n1 per association for a network with n = 100 weights and η = 0.005, with A1 and A2 each consisting of n1 = n2 = 50 associations, using the same protocol as in (a). In both (a) and (b), the mean value of δ/n1 is about 0.5, as predicted by equation 3.2. As the number of associations learned increases, the amount of FLL becomes more tightly clustered around δ/n1 = 0.5, as demonstrated in these two histograms, and the probability of FLL increases (also see Figure 5).

Free-Lunch Learning

201

202

J. Stone and P. Jupp

Figure 4: Effect of pruning on free-lunch learning. Graph of the expected amount of FLL per association E[δ/n1 ] as a function of the total number n1 + n2 of learned associations in A = A1 ∪ A2 , as given in equation 3.2. In this example, the number of connection weights is fixed at n = 100, and the number of associations in A = A1 ∪ A2 increases from n1 + n2 = 2 to n1 + n2 = 100. The number n2 of relearned associations in A2 is a constant proportion (0.5) of the associations in A. If n1 + n2 ≤ n, then the network contains n − (n1 + n2 ) unpruned redundant connections. Thus, pruning effectively increases as n1 + n2 increases because, as the number n1 + n2 of associations grows, so the number of unpruned redundant connections decreases. The expected amount of FLL per association E[δ/n1 ] increases as the amount of pruning increases.

4 Discussion Theorems 1 to 4 provide the first proof that relearning induces nontransient recovery, where postrecovery error is potentially zero. This contrasts with the usually small and transient recovery that occurs during the initial phase of relearning forgotten associations (Hinton & Plaut, 1987; Atkins & Murre, 1998), and during learning of new associations (Harvey & Stone, 1996). In particular, theorem 2 is predictive inasmuch as it suggests that the amount of FLL in humans should be (1) proportional to the amount of forgetting of A = A1 ∪ A2 and (2) proportional to the proportion n2 /(n1 + n2 ) of associations relearned after partial forgetting of A. We have assumed that the number n1 + n2 of associations A = A1 ∪ A2 encoded by a given neuron is not greater than the number n of input connections (synapses) to that neuron. Given that each neuron typically has

Free-Lunch Learning

203

Figure 5: Probability of free-lunch learning. The probability P(δ > 0) of FLL of associations A1 as a function of the total number n1 + n2 of learned associations A = A1 ∪ A2 for networks with n = n1 + n2 weights. Each of the two subsets of associations A1 and A2 consists of n1 = n2 = n/2 associations. After learning and then partially forgetting A, performance on A1 was measured. P(δ > 0) is the probability that performance on subset A1 is better after subset A2 has been relearned than it is before A2 has been relearned. Solid line: Empirical estimate of P(δ > 0). Each data point is based on 10,000 runs, where each run uses input vectors chosen from an isotropic gaussian distribution (see section 2). Dashed line: Theoretical lower bound on the probability of FLL, as given by theorems 1 and 3, assuming that input vectors are chosen from an isotropic (e.g., isotropic gausssian) distribution.

many thousands of synapses (e.g., cerebellar Purkinje cells), it seems likely that this assumption is valid. However, the total amount of FLL is maximal if n1 = n2 = n/2, so that the full potential of FLL can be realized only if n1 + n2 = n. This optimum number of synapses can be achieved if inactive (i.e., redundant) synapses are pruned. Pruning may therefore contribute to FLL in physiological systems (Purves & Lichtman, 1980; Goldin, Segal, & Avignone, 2001). We have also assumed that a delta rule is used to learn associations between inputs and desired outputs. This general type of supervised learning is thought to be implemented by the cerebellum and basal ganglia (Doya, 1999). Models of the cerebellum (Dean, Porrill, & Stone, 2002) use a delta rule to implement learning. Similarly, models of the basal ganglia (Nakahara, Itoh, Kawagoe, Takikawa, & Hikosaka, 2004) use a temporally discounted

204

J. Stone and P. Jupp

form of delta rule, the temporal difference rule. This temporal difference rule has also been used to model learning in humans (Seymour et al., 2004), and (under mild conditions) is equivalent to the standard delta rule (Sutton, 1988). Indeed, from a purely computational perspective, it is difficult to conceive how these forms of associative learning could be implemented without some form of delta rule. Our analysis is based on the assumption that the network model is linear. Of course, many nonlinear networks can be approximated by linear networks, but it is possible that the results derived here have limited applicability to certain classes of nonlinear networks. Relation to Task Generalization. It is only natural to ask how FLL relates to tasks that a human might learn. One obvious but vital condition for FLL is that different associations must be encoded by a common set of neuronal connections. Aside from this condition, it might be thought that relearning A2 improves performance on A1 because A1 and A2 are somehow related (as in Hanson & Negishi, 2002; Dienes, Altmann, & Gao, 1999), so that learning A2 generalizes to A1 . This form of task generalization can occur if A1 and A2 are related as follows. If the input-output pairs in A1 and A2 are sampled from a sufficiently smooth function f and n1 n and n2 n, then A1 and A2 are statistically related, and therefore the weights induced by learning A1 are similar to those induced by learning A2 . Consequently, the resultant network input-output functions g1 and g2 (respectively) both approximate the function f (i.e., g1 ≈ g2 ≈ f ). In this case, learning A2 yields good performance on A1 . In the context of FLL, if A1 ∪ A2 is learned, forgotten, and then A2 is relearned, performance on A1 will also improve. However, the reason for this improvement is obvious and trivial: it is simply that A1 and A2 are statistically related and large enough (i.e., with n1 n and n2 n) to induce similar network functions. In contrast, the effect described in this letter does not depend on statistical similarity between A1 and A2 . Crucial assumptions are that n1 + n2 ≤ n, n1 < n, and n2 < n, so that learning the n2 associations in A2 in a network with n weights is underconstrained. This implies that the network function induced by learning A1 has no particular relation to the network function induced by learning A2 , even if A1 and A2 are sampled from the same function f (provided A1 and A2 are disjoint sets). For example, if A1 and A2 each consists of one association sampled from a linear function f (i.e., a line), then learning A2 in a linear network (as in Figure 2a) induces a linear network function g1 (i.e., a line) that intersects with f but is otherwise unconstrained. Thus, learning A2 does not necessarily yield good performance on A1 . The FLL effect reported here depends on relearning after forgetting. To cite an extreme example, if unicycling and learning French were encoded by a common set of neurons, then, after forgetting both, relearning unicycling could improve your French (although the mechanism involved here is unrelated to that described in Harvey & Stone, 1996). Thus, FLL contrasts

Free-Lunch Learning

205

with the task generalization outlined above, where it is obvious that both A1 and A2 induce similar network functions. Motivated by the demonstration that recovery occurs in humans (Stone et al., 2001; Coltheart & Byng, 1989; Weekes & Coltheart, 1996) (but not in all studies—Atkins, 2001), we have proven that FLL occurs in network models. The analysis presented here suggests that FLL is a necessary and generic consequence of storing information in distributed systems rather than a side effect peculiar to a particular class of artificial neural nets. Moreover, the generic nature of FLL suggests that it is largely independent of the type (i.e., artificial or physiological) of network used to learn associations. FLL appears to be a fundamental property of distributed representations. Given the reliance of neuronal systems on distributed representations, FLL may be a ubiquitous feature of learning and memory. It is likely that any organism that did not take advantage of such a fundamental and ubiquitous effect would be at a severe selective disadvantage. Appendix A: Analysis of Free-Lunch Learning We proceed by deriving expressions for E pre , E post , and δ = E pre − E post . We prove that if n1 + n2 ≤ n, then the expected value of δ is positive. We then prove that if n1 + n2 ≤ n, the probability P(δ > 0) of FLL is greater than 0.5, that its lower bound increases with n (if n1 /n and n2 /n are fixed), and that this bound approaches unity as n increases. A.1 Definition of Performance Error. For an artificial neural network (ANN) with weight vector w, we define the performance error for input vectors x1 , . . . , xc and desired outputs d1 , . . . , dc to be E(x1 , . . . , xc ; w, d1 , . . . , dc ) =

c

(w · xi − di )2 .

(A.1)

i=1

By putting X = (x1 , . . . , xc )T , d = (d1 , . . . , dc )T and E(X; w, d) = E(x1 , . . . , xc ; w, d1 , . . . , dc ), we can write equation A.1 succinctly as E(X; w, d) = Xw − d2 .

(A.2)

Given a c × n matrix X and a c-dimensional vector d, let L X,d be the affine subspace, L X,d = w : XT Xw = XT d ,

206

J. Stone and P. Jupp

of Rn . Since i. rk XT X ≤ rk (X), ii. XT Xa = 0 ⇒ aT XT Xa = 0 ⇒ Xa = 0, it follows that rk XT X = rk (X) (where rk denotes the rank of a matrix), and so L X,d is nonempty.

(A.3)

If X and d are consistent (i.e., there is a w such that Xw = d), then L X,d = {w : Xw = d}. A.2 Comparison of Performance Errors. Given weight vectors w1 and w2 , a matrix X of input vectors, and a vector d of desired outputs, define δ(w1 , w2 ; X, d) = E pre − E post , ˜ be any element of where E pre = E(X; w1 , d) and E post = E(X; w2 , d). Let w L X,d . Then δ(w1 , w2 ; X, d) = Xw1 − d2 − Xw2 − d2 = Xw1 2 − Xw2 2 − 2 (w1 − w2 )T XT d ˜ = (w1 − w2 )T XT X (w1 + w2 ) − 2 (w1 − w2 )T XT Xw ˜ . = (w1 − w2 )T XT X (w1 + w2 − 2w)

(A.4)

Suppose given ni × n matrices Xi and ni -dimensional vectors di (for i = 1, 2). Put L i = L Xi ,di

for i = 1, 2.

If Xi has rank ni , then Xi = Ti Zi for unique ni × ni and ni × n matrices Ti and Zi with Ti upper triangular and Zi ZiT = Ini . Note that the matrix ZiT Zi represents the operator that projects onto the image of XiT Xi , and so ZiT Zi XiT Xi = XiT Xi .

(A.5)

Free-Lunch Learning

207

Let w0 be an element of L X,d , where

X1 d1 X= d= , X2 d2 that is,

X1T X1 + X2T X2 w0 = X1T d1 + X2T d2 .

(A.6)

(By equation A.3, such a w0 always exists.) Given v in Rn , put w1 = w0 + v. Let w02 and w2 be the orthogonal projections of w0 and w1 , respectively, onto L 2 . Then X2T X2 w02 = X2T d2

w2 = w02 + In −

Z2T Z2

(A.7) (w1 − w02 ) .

Manipulation gives w1 − w2 = Z2T Z2 (v + w0 − w02 ) ,

(A.8)

and so w1 + w2 − 2w0 = 2In − Z2T Z2 v − Z2T Z2 (w0 − w02 ) .

(A.9)

˜ be any element of L X1 ,d1 . Then equations A.4, A.6, A.7 to A.9, and A.5 Let w yield δ(w1 , w2 ; X1 , d1 ) ˜ = (w1 − w2 )T X1T X1 (w1 + w2 − 2w) = (w1 − w2 )T X1T X1 (w1 + w2 ) − 2 (w1 − w2 )T X1T d1 = (w1 − w2 )T X1T X1 (w1 + w2 − 2w0 ) − 2 (w1 − w2 )T X2T X2 (w0 − w02 ) = (v + w0 − w02 )T Z2T Z2 X1T X1 (w1 + w2 − 2w0 ) − 2 (v + w0 − w02 )T Z2T Z2 X2T X2 (w0 − w02 ) = (v + w0 − w02 )T Z2T Z2 X1T X1 2In − Z2T Z2 v − (v + w0 − w02 )T Z2T Z2 X1T X1 Z2T Z2 (w0 − w02 ) − 2 (v + w0 − w02 )T Z2T Z2 X2T X2 (w0 − w02 )

208

J. Stone and P. Jupp

= vT Z2T Z2 X1T X1 2In − Z2T Z2 v − 2 (w0 − w02 )T Z2T Z2 X1T X1 In − Z2T Z2 − X2T X2 v − (w0 − w02 )T Z2T Z2 2X2T X2 + X1T X1 Z2T Z2 (w0 − w02 ) = vT Z2T Z2 X1T X1 2In − Z2T Z2 v − 2 (w0 − w02 )T Z2T Z2 X1T X1 In − Z2T Z2 − X2T X2 v −(w0 − w02 )T (2X2T X2 + Z2T Z2 X1T X1 Z2T Z2 ) (w0 − w02 ) .

(A.10)

A.3 Moments of Isotropic Distributions. In order to obtain results on the distribution of performance error, it is useful to have some moments of isotropic distributions. Let u be uniformly distributed on Sn−1 , and let A and B be n × n matrices. The formulas for the second and fourth moments of u given in equations 9.6.1 and 9.6.2 of Mardia and Jupp (2000), together with some algebraic manipulation, yield tr (A) E uT Au = n

(A.11)

T

tr (AB) + tr AB + tr (A) tr (B) E uT AuuT Bu = n(n + 2) 2 ntr A + ntr AAT − 2tr (A)2 T . var u Au = n2 (n + 2)

(A.12) (A.13)

Now let x be isotropically distributed on Rn , that is, Ux has the same distribution as x for all orthogonal n × n matrices U. Then writing x = xu with u = 1 and using equations A.11 to A.13 gives E x2 tr (A) E xT Ax = n 4 E x tr (AB) + tr ABT + tr (A) tr (B) T T E x Axx Bx = n(n + 2)

var xT Ax =

(A.14)

E x4 ntr A2 + ntr AAT − 2tr (A)2 n2 (n + 2)

+

var x2 tr (A)2 . n2

(A.15)

Free-Lunch Learning

209

A.4 Distribution of Performance Error. Now suppose that X1 , d1 , X2 , d2 , and v are random and satisfy X1 and v are independent, the distribution of X1 is isotropic,

(A.16)

v has an isotropic distribution, where conditions A.16 mean that UX1 V has the same distribution as X1 for all orthogonal n1 × n1 matrices U and all orthogonal n × n matrices V. Then equation A.10 yields E [δ(w1 , w2 ; X1 , d1 ) |X1 , X2 ] E v2 T = tr X1 X1 Z2T Z2 n − (w0 − w02 )T 2X2T X2 + Z2T Z2 X1T X1 Z2T Z2 (w0 − w02 ) .

(A.17)

Taking expectations over X1 and X2 in equation A.17 gives the following general result on FLL: E[δ(w1 , w2 ; X1 , d1 )] > 0 iff n2 E (w0 − w02 )T 2X2T X2 + Z2T Z2 X1T X1 Z2T Z2 (w0 − w02 ) E[v2 ] > . n1 n2 (A.18) The intuitive interpretation of this result is that if E v2 is large enough, then there is FLL, whereas if P (w0 = w02 ) > 0 then “negative FLL” can occur. In particular, if n1 + n2 ≤ n and P (v = 0) > 0, then there is FLL. A.5 The Case n1 + n2 ≤ n. In this section we assume that X1 , d1 , X2 and d2 are random and that (X1 , d1 ), (X2 , d2 ) and v are independent,

(A.19)

the distribution of v is isotropic.

(A.20)

We suppose also that n1 + n2 ≤ n, and that the distributions of X1 , d1 , X2 , and d2 are continuous. Then, with probability 1, X1 w0 = d1 and X2 w0 = d2 , so that w02 = w0 and equation A.10 reduces to δ(w1 , w2 ; X1 , d1 ) = vT Z2T Z2 X1T X1 2In − Z2T Z2 v.

(A.21)

210

J. Stone and P. Jupp

A.5.1 FLL Is More Probable Than Not. Let w∗1 be the reflection of w1 in L 2 , that is, w∗1 = w2 − (w1 − w2 ) . Consideration of the parallelogram with vertices at w0 , w1 , w∗1 , and w1 + w∗1 − w0 gives 2 X1 (w1 − w0 ) 2 + X1 (w∗1 − w0 ) 2 = X1 [w1 − w0 ] + w∗1 − w0 2 + X1 [w1 − w0 ] − w∗1 − w0 2 = 4 X1 (w2 − w0 ) 2 + X1 (w1 − w2 ) 2 , so that (since d1 = X1 w0 ) δ(w1 , w2 ; X1 , d1 ) + δ(w∗1 , w2 ; X1 , d1 ) = X1 (w1 − w0 ) 2 + X1 (w∗1 − w0 ) 2 − 2X1 (w2 − w0 ) 2 = 2X1 (w1 − w2 ) 2 ≥ 0. Thus if δ(w1 , w2 ; X1 , d1 ) < 0, then δ(w∗1 , w2 ; X1 , d1 ) > 0. If v is distributed isotropically, then w∗1 − w0 is distributed isotropically, so that δ(w∗1 , w2 ; X1 , d1 ) has the same distribution (conditionally on X1 , d1 and X2 ) as δ(w1 , w2 ; X1 , d1 ), and so P(δ(w1 , w2 ; X1 , d1 ) < 0|X1 , d1 , X2 ) ≤ P(δ(w∗1 , w2 ; X1 , d1 ) > 0|X1 , d1 , X2 ) = P(δ(w1 , w2 ; X1 , d1 ) > 0|X1 , d1 , X2 ). (A.22) Further, if v ∈ L 2 \ L 1 , then w2 = w1 = w∗1 , so that δ(w1 , w2 ; X1 , d1 ) = δ(w∗1 , w2 ; X1 , d1 ) > 0. By continuity of δ, there is a neighborhood of v on which δ(w1 , w2 ; X1 , d1 ) > 0 and δ(w∗1 , w2 ; X1 , d1 ) > 0. Thus, if L 2 \ L 1 = ∅, then equation A.22 can be refined to P(δ(w1 , w2 ; X1 , d1 ) < 0|X1 , d1 , X2 ) < P(δ(w∗1 , w2 ; X1 , d1 ) < 0|X1 , d1 , X2 ). (A.23) Since P(L 2 ⊂ L 1 ) = 0 and P(δ(w1 , w2 ; X1 , d1 ) < 0|X1 , d1 , X2 ) is a continuous function of X1 , d1 and X2 , it follows from equation A.23 that P(δ(w1 , w2 ; X1 , d1 ) < 0) < P(δ(w1 , w2 ; X1 , d1 ) > 0), which implies the following result.

Free-Lunch Learning

211

Theorem 1 P(δ(w1 , w2 ; X 1 , d1 ) > 0) > 0.5. This implies that the median of δ(w 1 , w2 ; X 1 , d1 ) is positive. A.5.2 A Lower Bound for P(δ > 0). Our proof depends on Chebyshev’s inequality, which states that for any positive value of t, P(|δ − E[δ]| ≥ t) ≤

var(δ) , t2

where var(δ) denotes the variance of δ. If we set t = E[δ], then (since, by equation A.28, E[δ] > 0) P (δ ≤ 0) ≤

var (δ) E [δ]2

.

(A.24)

This provides a lower bound for the probability of FLL. We prove that this bound approaches unity as n approaches infinity. Now we assume (in addition to conditions A.19 and A.20) that the distributions of X1 and X2 are isotropic.

(A.25)

It follows from equations A.21, A.14, and A.15 that n1 In 2In − Z2T Z2 v E [δ(w1 , w2 ; X1 , d1 ) |Z2 , v ] = vT Z2T Z2 E x2 n n1 2 T T (A.26) = E x v Z2 Z2 v, n where x is the first column of X1T , and var (δ(w1 , w2 ; X1 , d1 ) |Z2 , v )

E x4 (n − 2)Z2 v4 + nZ2 v2 2In − Z2T Z2 v2 = n1 n2 (n + 2) var x2 Z2 v4 + . n2

(A.27)

Since v has an isotropic distribution, equations A.26, A.11, and A.13 imply that E [δ(w1 , w2 ; X1 , d1 ) |Z2 , v ] =

n1 n2 2 E x v2 . n2

(A.28)

212

J. Stone and P. Jupp

Given that there are n1 associations in the subset A1 that is not relearned, equation A.28 implies the following theorem about the expected amount of recovery per association in A1 . Theorem 2 E

n2 δ(w 1 , w 2 ; X 1 , d1 ) 2 2 Z2 , v = n2 E[x ]v . n1

(A.29)

Equations A.26 and A.13 also imply that var (E [δ(w1 , w2 ; X1 , d1 ) |Z2 , v ] |Z2 , v ) n 2 v4 2nn2 − 2n22 1 2 E x = n n2 (n + 2) 2 2n21 n2 (n − n2 )E x2 v4 , = n4 (n + 2)

(A.30)

and it follows from equations A.27 and A.12 that E [var (δ(w1 , w2 ; X1 , d1 ) |Z2 , v ) |Z2 , v ]

n1 v3 E x4 (n − 2)n2 (n2 + 2) + nn2 (2n − n2 + 2) = n(n + 2) n2 (n + 2) var x2 n2 (n2 + 2) + n2 =

n1 n2 v4 4 E x 2(n2 + 2n − n2 − 2) + var x2 (n + 2)(n2 + 2) . 3 2 n (n + 2) (A.31)

Then equations A.30 and A.31 give var (δ(w1 , w2 ; X1 , d1 ) |Z2 , v ) 2 2n21 n2 (n − n2 )E x2 v4 = n4 (n + 2) +

n1 n2 v4 4 E x 2(n2 + 2n − n2 − 2) + var x2 (n + 2)(n2 + 2) n3 (n + 2)2

Free-Lunch Learning

213

n1 n2 v4 {2[n1 (n + 2)(n − n2 ) + n(n − n2 ) + n(n + 2)(n − 1)]E[x2 ]2 n4 (n + 2)2

=

+ n2 (2n + n2 + 6)var(x2 )}, and so var (δ(w1 , w2 ; X1 , d1 ) |Z2 , v ) 2

E [δ(w1 , w2 ; X1 , d1 ) |Z2 , v ]

=

a 0 (n, n1 , n2 ) + a 1 (n, n2 )γ (n) , n1 n2 (n + 2)2

where a 0 (n, n1 , n2 ) = 2{n1 (n + 2)(n − n2 ) + n(n − n2 ) + n(n + 2)(n − 1)} a 1 (n, n2 ) = n2 (2n + n2 + 6) var x2 γ (n) = 2 . E x2 Chebyshev’s inequality implies the following theorem. Theorem 3 P (δ(w1 , w 2 ; X 1 , d1 ) ≤ 0 |Z2 , v ) ≤

a 0 (n, n1 , n2 ) + a 1 (n, n1 , n2 )γ (n) . n1 n2 (n + 2)2

Since the right-hand side does not depend on Z2 or v, this gives the following result. If γ (n)/n → 0 and n1 /n, n2 /n are bounded away from zero as n → ∞, then P (δ(w1 , w2 ; X1 , d1 ) > 0) → 1,

n → ∞.

If x ∼ N 0, σx2 In ,

Example.

then E x2 = nσx2 ,

var x2 = 2nσx4 ,

γ (n) =

2 , n

and so P(δ(w1 , w2 ; X1 , d1 ) > 0) → 1,

n → ∞,

provided that n1 /n and n2 /n are bounded away from zero.

214

J. Stone and P. Jupp

A.5.3 Learning A3 Instead of A2 . Now suppose that relearning of A2 is replaced by learning another subset A3 of n2 associations. Let the matrix X3 and vector d3 be such that the subspace L 3 corresponding to A3 has the form L 3 = L X3 ,d3 . Let w3 and w13 denote the orthogonal projections of w1 onto L 3 and L 1 ∩ L 3 , respectively. Then (A.32) w3 = w13 + In − Z3T Z3 (w1 − w13 ) , and so w1 = w3 + Z3T Z3 (w1 − w13 ) .

(A.33)

˜ = w13 , and equations A.33 and A.32, we have From equation A.4 with w δ(w1 , w3 ; X1 , d1 ) = (w1 − w3 )T X1T X1 (w1 + w3 − 2w13 ) = (w1 − w13 )T Z3T Z3 X1T X1 (w1 + w3 − 2w13 ) ˜ , = (v − ω) ˜ T Z3T Z3 X1T X1 2In − Z3T Z3 (v − ω)

(A.34)

where ω˜ = w13 − w0 . Since X1 w0 = X1 w13 , equation A.34 can be expanded as δ(w1 , w3 ; X1 , d1 )

= vT Z3T Z3 X1T X1 2In − Z3T Z3 v − vT Z3T Z3 X1T X1 2In − Z3T Z3 ω˜ − ω˜ T Z3T Z3 X1T X1 2In − Z3T Z3 v ˜ − ω˜ T Z3T Z3 X1T X1 Z3T Z3 ω,

and so E [δ(w1 , w3 ; X1 , d1 )|X1 , d1 , X2 , d2 , X3 , d3 ] E v2 T = tr Z3 Z3 X1T X1 2In − Z3T Z3 − ω˜ T Z3T Z3 X1T X1 Z3T Z3 ω˜ n E v2 T ˜ 2. = tr X1 X1 Z3T Z3 − X1 Z3T Z3 ω n Now assume that (X1 , d1 ), (X2 , d2 ), (X3 , d3 ) and v are independent, the distributions of X1 , X2 , X3 and v are isotropic.

Free-Lunch Learning

Since

215

E v2 T E v2 T E tr X1 X1 Z2T Z2 = E tr X1 X1 Z3T Z3 n n = E [δ(w1 , w2 ; X1 , d1 )] ,

we have the following theorem. Theorem 4 E[δ(w1 , w 3 ; X 1 , d1 )] ≤ E [δ(w 1 , w2 ; X 1 , d1 )] . Appendix B: Behavior of the Gradient Algorithm If E is regarded as a function of w, then differentiation of equation A.2 shows that the gradient of E at w is ∇ E (w) = 2XT (Xw − d) . Then for any algorithm that takes an initial w(0) to w(1) , w(2) , . . . using steps w(t+1) − w(t) in the direction of ∇ E (w(t) ) , w(t) − w(0) is in the image of XT X, and so is orthogonal to L X,d . It follows that if Xw(t) − d2 → minw Xw − d2 as t → ∞, then w(t) converges to the orthogonal projection of w(0) onto L X,d . Appendix C: The Geometry of Performance Error When n1 = 1 Given associations A1 and A2 , we prove that if n1 = 1 and input vectors have unit length (so that x1 = 1), then the difference δ in performance errors on association A1 of w1 (i.e., after partial forgetting) and w2 (i.e., after relearning A2 ) is equal to the difference Q = p 2 − q 2 . This proof supports the geometric account given in the article and in Figure 2 and does not (in general) apply if n1 > 1. We begin by proving that (if n1 = 1 and x1 = 1) the performance error of an association A1 for an arbitrary weight vector w1 is equal to the squared distance p 2 between w1 and its orthogonal projection w1 onto the affine subspace L 1 corresponding to A1 . If n1 = 1, then L 1 has the form L 1 = {w : w · x1 = d1 } for some x1 and d1 . Given an arbitrary weight vector w1 , we define the performance error on association A1 as equivalent to E(w1 , A1 ) = (w1 · x1 − d1 )2 .

(C.1)

216

J. Stone and P. Jupp

The orthogonal projection w1 of w1 onto L 1 is w1 = w1 +

d1 − w1 · x1 x1 , x1 2

(C.2)

so that d1 = w1 · x1 .

(C.3)

Substituting equation C.3 into C.1 and using C.2 yields E(w1 , A1 ) = w1 − w1 2 x1 2 = p 2 x1 2 .

(C.4)

Now suppose that x1 = 1. Then E(w1 , A1 ) = p 2 , that is, the performance error is equal to the squared distance between the weight vectors w1 and w1 . The same line of reasoning can be applied to prove that E(w2 , A1 ) = q 2 . Thus, the difference δ in performance error on A1 for weight vectors w1 and w2 is δ = E(w1 , A1 ) − E(w2 , A1 ) = p2 − q 2 = Q. Acknowledgments Thanks to S. Isard for substantial help with the analysis presented here; to R. Lister, S. Eglen, P. Parpia, A. Farthing, P. Warren, K. Gurney, N. Hunkin, and two anonymous referees for comments; and J. Porrill for useful discussions. References Atkins, P. (2001). What happens when we relearn part of what we previously knew? Predictions and constraints for models of long-term memory. Psychological Research, 65(3), 202–215.

Free-Lunch Learning

217

Atkins, P., & Murre, J., (1998). Recovery of unrehearsed items in connectionist models. Connection Science, 10(2), 99–119. Coltheart, M., & Byng, S. (1989). A treatment for surface dyslexia. In X. Seron (Ed.), Cognitive approaches in neuropsychological rehabilitation. Mahwah, NJ: Erlbaum. Dean, P., Porrill, J., & Stone, J. V. (2002). Decorrelation control by the cerebellum achieves oculomotor plant compensation in simulated vestibulo-ocular reflex. Proceedings Royal Society (B), 269(1503), 1895–1904. Dienes, Z., Altmann, G., & Gao, S.-J. (1999). Mapping across domains without feedback: A neural network model of transfer of implicit knowledge. Cognitive Science, 23, 53–82. Doya, K. (1999). What are the computations of the cerebellum, the basal ganglia and the cerebral cortex? Neural Networks, 12(7–8), 961–974. Goldin, M., Segal, M., & Avignone, E. (2001). Functional plasticity triggers formation and pruning of dendritic spines in cultured hippocampal networks. J. Neuroscience, 21(1), 186–193. Hanson, S. J., & Negishi, M. (2002). On the emergence of rules in neural networks. Neural Computation, 14, 2245–2268. Harvey, I., & Stone, J.V. (1996). Unicycling helps your French: Spontaneous recovery of associations by learning unrelated tasks. Neural Computation, 8, 697–704. Hinton, G., & Plaut, D. (1987). Using fast weights to deblur old memories. In Proceedings Ninth Annual Conference of the Cognitive Science Society, Seattle WA, 177–186. Mardia, K. V., & Jupp, P. E. (2000). Directional statistics. New York: Wiley. Nakahara, H., Itoh, H., Kawagoe, R., Takikawa, Y., & Hikosaka, O. (2004). Dopamine neurons can represent context-dependent prediction error. Neuron, 41(2), 269–280. Purves D., & Lichtman, J. (1980). Elimination of synapses in the developing nervous system. Science, 210, 153–157. Seymour, B., O’Doherty, J. P., Dayan, P., Koltzenburg, M., Jones, A. K., Dolan, R. J., Friston, K. J., & Frackowiak, R. (2004). Temporal difference models describe higher order learning in humans. Nature, 429, 664–667. Stone, J. V., Hunkin, N. M., & Hornby, A. (2001). Predicting spontaneous recovery of memory. Nature, 414, 167–168. Sutton, R. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44. Weekes, B., & Coltheart, M. (1996). Surface dyslexia and surface dysgraphia: Treatment studies and their theoretical implications. Cognitive Neuropsychology, 13, 277–315.

Received August 1, 2005; accepted May 19, 2006.

LETTER

Communicated by Mark Girolami

Linear Multilayer ICA Generating Hierarchical Edge Detectors Yoshitatsu Matsuda [email protected]-tokyo.ac.jp

Kazunori Yamaguchi [email protected] Kazunori Yamaguchi Laboratory, Department of General Systems Studies, Graduate School of Arts and Sciences, University of Tokyo, Tokyo, Japan 153-8902

In this letter, a new ICA algorithm, linear multilayer ICA (LMICA), is proposed. There are two phases in each layer of LMICA. One is the mapping phase, where a two-dimensional mapping is formed by moving more highly correlated (nonindependent) signals closer with the stochastic multidimensional scaling network. Another is the local-ICA phase, where each neighbor (namely, highly correlated) pair of signals in the mapping is separated by MaxKurt algorithm. Because in LMICA only a small number of highly correlated pairs have to be separated, it can extract edge detectors efficiently from natural scenes. We conducted numerical experiments and verified that LMICA generates hierarchical edge detectors from large-size natural scenes. 1 Introduction Independent component analysis (ICA) is a recently developed method in the fields of signal processing and artificial neural networks, and it has been shown to be quite useful for the blind separation problem (Jutten & H´erault, 1991; Comon, 1994; Bell & Sejnowski, 1995; Cardoso & Laheld, 1996). The linear ICA is formalized as follows. Let s and A be N-dimensional source signals and an N × N mixing matrix. Then the observed signals x are defined as x = As.

(1.1)

The problem to solve is to find out A (or the inverse, W) when only the observed (mixed) signals are given. In other words, ICA blindly extracts the source signals from M samples of the observed signals as follows: Sˆ = WX,

(1.2)

where X is an N × M matrix of the observed signals and Sˆ is the estimate of the source signals. Although this is a typical ill-conditioned problem, ICA Neural Computation 19, 218–230 (2007)

C 2006 Massachusetts Institute of Technology

Linear Multilayer ICA Generating Hierarchical Edge Detectors

219

can solve it if the source signals are generated according to independent and nongaussian probability distributions. Concretely speaking, ICA algorithms find out W by minimizing a criterion (called the contrast function) ˆ that is defined by higher-order statistics (e.g., kurtosis) of components of S, and many methods have been proposed for the minimization, for example, fast ICA (Hyv¨arinen, 1999) and the relative gradient algorithm (Cardoso & Laheld, 1996). Because all of these previous algorithms try to estimate all the N2 components of W rigorously, their time complexity is O(N2 ), which is intractable for large N. For actual data such as natural scenes, W is given not randomly but according to some underlying structure in the data. LMICA (linear multilayer ICA), which we proposed previously (Matsuda & Yamaguchi, 2003, 2004, 2005b), utilizes such structure for estimating W. By gradually improving an estimate of W, it could finally find out a fairly good estimate of W quite efficiently. This letter extends this work with improvement of the algorithm, using the two-dimensional stochastic multidimensional scaling (MDS) network, and exhaustive numerical experiments (generating hierarchical edge detectors). This letter is organized as follows. In section 2, the algorithm is described. In section 3, numerical experiments verify that LMICA can extract edge detectors efficiently from natural scenes, and it can generate hierarchical edge detectors from large-size scenes. This letter is concluded in section 4. 2 Algorithm 2.1 Basic Idea. It is assumed that X is whitened initially. LMICA tries to extract the independent components approximately by repetition of the following two phases: the mapping phase, which brings more highly correlated signals nearer, and the local-ICA phase, where each neighbor pair of signals in the mapping is separated by the MaxKurt algorithm (Cardoso, 1999). Intuitively, the mapping phase finds only a few pairs that are more “important” for estimating W, and the local-ICA phase optimizes only the “important” pairs. The mechanism of LMICA is illustrated in Figure 1. Note that this illustration shows the ideal case where N signals are separated in only O (log N) layers. Although this does not hold for an arbitrary W, it will be shown in section 3 that natural scenes can be separated quite effectively by this method with a two-dimensional mapping. 2.2 Mapping Phase. In the mapping phase, current signals X are arranged 2 2in a two-dimensional array so that pairs of signals taking higher k xik x jk are placed nearer. In order to efficiently calculate such an array, the stochastic MDS network in Matsuda and Yamaguchi (2005a) is used. Its main procedure is the repetition of the calculation of the center of gravity and the slide of each signal toward the center in proportion to its value.

220

Y. Matsuda and K. Yamaguchi

1

2

3

4

5

6

7

8 local-ICA

1-2

1-2

3-4

3-4

5-6

5-6

7-8

7-8 mapping

1-2

3-4

1-2

3-4

5-6

7-8

5-6

7-8 local-ICA

1-4

1-4

1-4

1-4

5-8

5-8

5-8

5-8

1-4

5-8

1-4

5-8

1-4

5-8

1-4

5-8

mapping

local-ICA 1-8

1-8

1-8

1-8

1-8

1-8

1-8

1-8

Figure 1: LMICA (the ideal case). Each number from 1 to 8 means a source signal. In the first local-ICA phase, each neighbor pair of the completely mixed signals (denoted 1-8) is partially separated into 1-4 and 5-8. Next, the mapping phase rearranges the partially separated signals so that more highly correlated signals are nearer. In consequence, the four 1-4 signals (similarly, 5-8 ones) are brought nearer. Then the local-ICA phase partially separates the pairs of neighbor signals into 1-2, 3-4, 5-6, and 7-8. By repetition of these two phases, LMICA can extract all the sources quite efficiently.

The original stochastic MDS network is given as follows (see Matsuda & Yamaguchi, 2005a, for details): 1. Place given signals Z = (zik ) on a two-dimensional array randomly, where each signal i (the ith row of Z) corresponds to a component of the array through a one-to-one correspondence. The x- and ycoordinates of a signal i on the array are denoted as mi1 and mi2 . Note that mi1 and mi2 take discrete values. 2. Pick up a column of Z randomly, whose index is denoted as p. The column can be regarded as a randomly selected sample p of signals. 3. Calculate the center of gravity for the sample p by g

ml (Z, p) = where l = 1 or 2.

i zi p mil , i zi p

(2.1)

Linear Multilayer ICA Generating Hierarchical Edge Detectors STEP 1 5

STEP 2

2 23 17 7

5

STEP 3

2 23 17

7

2 23 17

STEP 4 7

5

2 23 17 12

8 20 15 10 21

8 20 10 12 21

8 20 10

10 15 6 11 3

10 15 6 11

3

8 20 10 12 21 10 15 6 11 3

10 7

25 4 24

25 4 24

9

25 4

25 4 24 1

1

9

13 14 16 22 19

7 unit to be updated new coordinates of 7 offset vector of 7 15 destination unit

2 21

5

221

1

13 14 16 22 19

24 1

9

6 11 3 9

13 14 16 22 19

13 14 16 22 19

unit to be shifted

new offset vector of 7

start and end points medium points 10 12 medium units

Figure 2: Discretized update on the stochastic MDS network (excerpted from Matsuda & Yamaguchi, 2005a, pp. 289). It illustrates each step of the discretized update rule. Each small square represents a signal on the discrete two-dimensional array (denoted a unit in this figure). In the first step, the destination is calculated by adding the current offset vector to the new coordinates. Second, the signals on the route are detected. Third, the signal to be updated is moved by exchanging the signals on the route one by one. Finally, the fraction under the discretization is preserved as the offset vector.

4. Calculate the new coordinate mil for each signal i by g mil = mil − λzi p mil − ml ,

(2.2)

where λ is the step size. 5. Update the coordinates of each signal on the array approximately according to equation 2.2 under the constraints that every coordinate mil has to be on the discrete two-dimensional array. Such discretized updates are conducted by giving an offset vector (which preserves the rounded off fraction under the discretization) to each signal and exchanging the signals one by one on the array. The mechanism is roughly illustrated in Figure 2. The details are omitted in this letter. 6. Terminate the process if a termination condition is satisfied. Otherwise, return to step 2. It can be shown that this process approximately minimizes an error function, C=

(zik z jk ) (mil − m jl )2 , i, j

k

l

(2.3)

222

Y. Matsuda and K. Yamaguchi

under the conditions that the location of each signal (mil , mi2 ) is bound to a unit in the discrete two-dimensional array through a one-to-one correspondence. Because the minimization of C makes more highly correlated pairs of signals (taking larger k (zik z jk )) be closer (smaller l (mil − m jl )2 ), it is shown that this process generates a topographic mapping where highly correlated signals are placed near each other. Some modifications are needed in order to make the original stochastic MDS network suitable for LMICA:

r r

r

zi p is given as xi2p − 1 where p is selected randomly from 1 to M at each update. Because the convergence of the original network is too slow and tends to drop into a local minimum if the signals Z take continuous values, the following elaboration is utilized. The signals are classified into two groups for each i: the positive group σ + consisting of signals satisfying zi p > 0 and the negative one σ − of zi p < 0. Then equations 2.1 and 2.2 are calculated for each group. Note that each signal of negative group is moved away from the center of gravity. Numerical experiments (omitted in this article) showed that this improvement gave more accurate results more efficiently than the original network. Two learning stages are often used in order to make local neighbor relations more accurate if there are many signals. The first is the usual stage for the whole array. The second is the locally tuning stage where the array is divided into some small areas and the algorithm is applied to each area.

The total procedure of the mapping phase for given X, W, A, and a given two-dimensional array is described as follows: Mapping Phase 1. Allocate each signal i to a component of the two-dimensional array by a randomly selected one-to-one correspondence. 2. Repeat the following steps over T times with a decreasing step size λ: a. Select p randomly from {1, . . . , M}, and let zi p = xi2p − 1 for each i. b. Calculate the two centers of gravity: g+ g− i∈σ + zi p mil i∈σ − zi p mil ml (Z, p) = and ml (Z, p) = . z + i p i∈σ i∈σ − zi p (2.4) c. Update eachmil by g+ mil − λzi p mil − ml zi p > 0 mil := (2.5) g− mil − λzi p mil − ml zi p < 0 under the constraints that every mil is on the discrete array.

Linear Multilayer ICA Generating Hierarchical Edge Detectors

223

3. Divide the array into small areas, and apply the above process to each area if there are many signals. 4. Rearrange the rows of X and W and the columns of A according to the generated array. 2.3 Local-ICA Phase. In the local-ICA phase, the following contrast function φ(X) (the minus sum of kurtoses) is used (which is the same one in MaxKurt algorithm; Cardoso, 1999): 4 i,k xik φ (X) = − −3 . (2.6) M φ(X) is minimized by “rotating” nearest-neighbor pairs of signals on the array. For each nearest-neighbor pair (i, j), a rotation matrix R(θ ) is given as cos θ sin θ R(θ ) = . (2.7) − sin θ cos θ Then the optimal angle θˆ is given as 4 4 θˆ = argminθ − xik , + x jk

(2.8)

k where xik = cos θ · xik + sin θ · x jk and x jk = − sin θ · xik + cos θ · x jk . After some tedious transformation of this equation (see Cardoso, 1999), it is shown that θˆ is determined analytically by the following equations:

sin 4θˆ =

αi j αi2j

+

βi2j

and cos 4θˆ =

where αi j =

3 x jk − xik x 3jk and βi j = xik

k

βi j αi2j

+ βi2j

k

,

(2.9)

4 2 2 xik + x 4jk − 6xik x jk

4

.

(2.10)

Now, the procedure of the local-ICA phase for given X, W, A, and an array is described as follows: Local-ICA Phase 1. For every nearest-neighbor pair of signals on the array, (i, j), a. Find out the optimal angle θˆ by equation 2.9. b. Rotate the corresponding parts of X, W, and A.

224

Y. Matsuda and K. Yamaguchi

2.4 Complete Algorithm. The complete algorithm of LMICA for any given observed signals X and array fixing the shape of the mapping is given by repeating the mapping phase and the local-ICA phase alternately: Linear Multilayer ICA Algorithm 1. Initial settings: Let X be a whitened observed signal and W and A be the N × N identity matrix. 2. Repetition: Do the following two phases alternately over L times. a. Do the mapping phase in section 2.2 b. Do the local-ICA phase in section 2.3. 2.5 Some Remarks 2.5.1 Relation to MaxKurt Algorithm. Equation 2.9 is the same the as MaxKurt algorithm (Cardoso, 1999). The crucial difference between LMICA and MaxKurt is that LMICA optimizes just the nearest-neighbor pairs instead of all the N(N−1) ones in MaxKurt. In LMICA, the pairs with higher 22 2 costs (higher k xik x jk ) are brought nearer in the mapping phase. Approximations of independent components can be extracted effectively by optimizing just the neighbor pairs. 2.5.2 Prewhitening. Although LMICA is applicable to any prewhitened signals, the selection of the whitening method is actually crucial for its performance. It is shown in section 3.1 that ZCA (zero-phase component analysis) is more suitable than principal component analysis (PCA) if natural scenes are given as the observed signals. The ZCA filter is given as X := − 2 X, 1

(2.11)

where is the covariance matrix of X. It has been known that the ZCA filter whitens the given signals with preserving the spatial relationship if natural scenes are given as X (Li & Atick, 1994; Bell & Sejnowski, 1997). 3 Results It is well known that various local edge detectors can be extracted from natural scenes by the standard ICA algorithm (Bell & Sejnowski, 1997; van Hateren & van der Schaaf, 1998). Here, LMICA was applied to the same problem. 3.1 Small Natural Scenes. Thirty thousand samples of natural scenes of 12 × 12 pixels were given as the observed signals X. That is, N and M were 144 and 30,000. Original natural scenes were downloaded from http://www.cis.hut.fi/projects/ica/data/images/. X is then whitened by

Linear Multilayer ICA Generating Hierarchical Edge Detectors

225

Decrease of Contrast Function LMICA (a) original topography (b) MaxKurt (c) LMICA for PCA (d)

minus kurtosis

-4 -5 -6 -7 -8 0

3000

6000 9000 12000 times of optimizations for pairs

15000

Figure 3: Decreasing curves of the contrast function φ along the times of optimizations for pairs of signals. (a) The standard LMICA. (b) LMICA using the identity mapping in the mapping phase, which preserves the original topography for the ZCA filter. (c) MaxKurt. (d) LMICA for the PCA-whitened signals.

ZCA. In the mapping phase, a 12 × 12 array was used, where the learning 100 length T was 100,000 with the step size λ = 100+t (t is the current time, which is the number of updates to the point). The calculation time for 400 layers of LMICA with Intel 2.8 GHz CPU was about 40 minutes, about 60% of which was for the mapping phase. It shows that the mapping phase is so efficient that its computational costs are not a bottleneck in LMICA. Figure 3 shows the decreasing curves of the contrast function φ along the times of optimizations (rotations) for pairs of signals, where the values of φ are averaged over 10 trials for independently sampled Xs. For comparison, the following four experiments were done: (1) LMICA; (2) LMICA using not the optimized rearrangement but the simple identity mapping in the mapping phase, which preserves the original 12 × 12 topography of scenes for the ZCA filter; (3) MaxKurt, where all the pairs are optimized one by one in random order; and (4) LMICA for the PCA-whitened observed signals; Note that one layer of LMICA without the mapping phase and one iteration of MaxKurt are equivalent to 12 × 11 × 2 = 264 and 144×143 = 10,296 times 2 of pair optimizations, respectively. In Figure 3, the standard LMICA gives the best solution everywhere except only a few first optimizations and the late ones. Though LMICA using the original topography is the best within about the first 1000 optimizations, it rapidly converged to local minima. On the other hand, MaxKurt and LMICA for PCA became superior to the others only after more than 10,000 optimizations.

226

Y. Matsuda and K. Yamaguchi

(a) LMICA.

(b) original topography.

(c) MaxKurt.

(d) fast ICA with g (u) = tanh (u).

Figure 4: Edge detectors from natural scenes of 12 × 12 pixels after 5280 optimizations. Each shows 144 edge detectors of 12 × 12 pixels from A. (a) LMICA. (b) LMICA using the original topography (namely, the identity mapping). (c) MaxKurt. (d) Fast ICA (g(u) = tanh(u)).

Figures 4a to 4c show the edge detectors after 5280 optimizations (equivalent to the twentieth layer of LMICA) by LMICA, LMICA using the original topography, and MaxKurt, respectively. For comparison, Figure 4d is the result by fast ICA with the widely used nonlinear function g(u) = tanh(u). Figures 4a and 4d show that LMICA could quite rapidly generate edge detectors similar to those in fast ICA. It is especially surprising that the number of optimizations in LMICA (5280) is about half of the degrees of freedom of the mixing matrix A (10,296). It suggests that LMICA gives an

Linear Multilayer ICA Generating Hierarchical Edge Detectors

(a) at the 0th layer (ZCA).

(b) at the 10th layer.

(c) at the 50th layer.

(d) at the 300th layer.

227

Figure 5: Representative edge detectors from large natural scenes at each layer. They show 20 representative edge detectors of A from scenes of 64 × 64 pixels at each layer.

effective model for the ICA processing of natural scenes. There are no clear edges in Figures 4b and 4c. Each detector in these figures is extremely localized and has no orientation preference (or one that is too weak). It shows that the mapping phase of LMICA plays a crucial role in the rapid formation of edge detectors. 3.2 Large Natural Scenes. Here, 100,000 samples of natural scenes of 64 × 64 pixels were given as X. Fast ICA, MaxKurt, and other well-known ICA algorithms are not applicable to such a large-scale problem because they require huge computations. For example, fast ICA based on kurtosis spent about 45 minutes on processing the small images (30,000 samples of 12 × 12 pixels). Under the rough assumption that the calculation time is proportional to the number of samples and the parameters to be estimated, it requires about 2000 hours to process the large images. This estimation is rather optimistic because it assumes that the number of updates for

Y. Matsuda and K. Yamaguchi 4000

4000

3000

3000

frequency

frequency

228

2000 1000

2000 1000

0

0 4 6 8 10 12 14 length of edges (a) at the 0th layer.

2

4 6 8 10 12 14 length of edges (b) at the 10th layer.

4000

4000

3000

3000

frequency

frequency

0

2000 1000

0

2

0

2

2000 1000

0

0 0

2

4 6 8 10 12 14 length of edges (c) at the 50th layer.

4 6 8 10 12 14 length of edges (d) at the 300th layer.

Figure 6: Histograms of edge lengths at each layer of LMICA for large, natural scenes. The lengths were calculated as the full width at half maximum (FWHM) of gaussian approximations of Hilbert transformation of W by a method similar to that of van Hateren and van der Schaaf (1998).

convergence is constant regardless of the number of parameters. In the mapping phase, 64 × 64 array was used for T = 1,500,000; then it was divided into 16 arrays of 16 × 16 components, and each array was optimized for T = 500,000. LMICA was carried out in L = 300 layers, and it consumed about 170 hours with Intel 2.8 GHz CPU. Figure 5 shows some representative edge detectors at the 0th (ZCA filter), 10th, 50th, and 300th layers. The histograms of the lengths of edges at each layer are shown in Figure 6. There are pixel detectors of only length zero at the 0th layer (ZCA) in Figures 5a and 6a. Then many short edge detectors were generated after just 10 layers (see Figures 5b and 6b). At the 50th layer, the lengths of edges were longer (see Figures 5c and 6c). At the final 300th layer (see Figures 5d and 6d), there are some long edges. In addition, it is interesting that some “compound” detectors were observed where multiple edges seem to be included in a single detector.

Linear Multilayer ICA Generating Hierarchical Edge Detectors

229

4 Conclusion In this letter, we proposed linear multilayer ICA (LMICA). We carried out some numerical experiments on natural scenes, which verified that LMICA can extract edge detectors quite efficiently. We also showed that LMICA can generate hierarchical edge detectors from large-size natural scenes, where short edges exist in lower layers and longer edges in higher ones. Although some multilayer models have been employed in ICA (e.g., Hyv¨arinen & Hoyer, 2000, and Hoyer & Hyv¨arinen, 2002), the purpose of and rationale for them are quite different from those of LMICA. Their multilayer networks have been proposed for constructing more powerful models of ICA, which include nonlinear connections and allow some dependencies between sources. On the other hand, LMICA is constructed only for the efficient calculation of the usual linear ICA. Nevertheless, it seems interesting that the structures of their multilayer models (locally connected multilayer networks) are quite similar to that of LMICA. It may be promising to extend LMICA with some nonlinearity. We are planning to apply LMICA to some applications in image processing, such as image compression and digital watermarking. We are also planning to utilize LMICA for other large-scale problems such as text mining. In addition, we are trying to explore a faster method for the mapping phase. Some batch-learning techniques may be promising. We are now paying attention to the fact that it has been known that edge detectors are not formed at the maximum of kurtoses. Some different nonlinearity is needed (e.g., tanh). So the choice of nonlinearity in the contrast function is quite important and sensitive in the usual ICA models, such as the InfoMax model in Bell and Sejnowski (1997). On the other hand, LMICA can generate hierarchical edge detectors by gradually increasing simple kurtoses. This suggests that LMICA may be able to extract edge detectors by using more general contrast functions, but further experiments will be needed to test this hypothesis. Finally, LMICA with kurtoses is expected to be as sensitive to outliers as other cumulant-based ICA algorithms are. Some different contrast functions would be needed for noisy signals. In addition, LMICA is available only if the number of sources is the same as that of observed signals, because it utilizes the ZCA filter. In order to apply LMICA to undercomplete cases, a new whitening method would have to be exploited, which can decrease the number of signals while preserving the topography of images. It is more difficult for LMICA to deal with overcomplete cases, because it is based on the simplest contrast function without any generative models of sources. Nevertheless, it seems interesting that LMICA could generate many edge detectors with different lengths at every layer. These detectors appear to be overcomplete bases, though without any theoretical foundations so far. Further analysis of these bases may be promising.

230

Y. Matsuda and K. Yamaguchi

References Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Bell, A. J., & Sejnowski, T. J. (1997). The “independent components” of natural scenes are edge filters. Vision Research, 37(23), 3327–3338. Cardoso, J.-F. (1999). High-order contrasts for independent component analysis. Neural Computation, 11(1), 157–192. Cardoso, J.-F., & Laheld, B. (1996). Equivariant adaptive source separation. IEEE Transactions on Signal Processing, 44(12), 3017–3030. Comon, P. (1994). Independent component analysis—a new concept? Signal Processing, 36, 287–314. Hoyer, P. O., & Hyv¨arinen, A. (2002). A multi-layer sparse coding network learns contour coding from natural images. Vision Research, 42(12), 1593–1605. Hyv¨arinen, A. (1999). Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks, 10(3), 626–634. Hyv¨arinen, A., & Hoyer, P. O. (2000). Emergence of phase and shift invariant features by decomposition of natural images into independent feature subspaces. Neural Computation, 12(7), 1705–1720. Jutten, C., & H´erault, J. (1991). Blind separation of sources (part I): An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24(1), 1–10. Li, Z., & Atick, J. J. (1994). Towards a theory of the striate cortex. Neural Computation, 6, 127–146. Matsuda, Y., & Yamaguchi, K. (2003). Linear multilayer ICA algorithm integrating small local modules. In Proceedings of ICA2003 (pp. 403–408). Nara, Japan. Matsuda, Y., & Yamaguchi, K. (2004). Linear multilayer independent component analysis using stochastic gradient algorithm. In C. G. Puntomet & A. Pneto (Eds.), Independent component analysis and blind source separation—ICA2004 (pp. 303–310). Berlin: Springer-Verlag. Matsuda, Y., & Yamaguchi, K. (2005a). An efficient MDS-based topographic mapping algorithm. Neurocomputing, 64, 285–299. Matsuda, Y., & Yamaguchi, K. (2005b). Linear multilayer independent component analysis for large natural scenes. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 897–904). Cambridge, MA: MIT Press. van Hateren, J. H., & van der Schaaf, A. (1998). Independent component filters of natural images compared with simple cells in primary visual cortex. Proceedings of the Royal Society of London: B, 265, 359–366.

Received December 16, 2005; accepted April 5, 2006.

LETTER

Communicated by Andries P. Engelbrecht

Functional Network Topology Learning and Sensitivity Analysis Based on ANOVA Decomposition Enrique Castillo [email protected] Department of Applied Mathematics and Computational Sciences, University of Cantabria and University of Castilla–La Mancha, Spain

˜ Noelia S´anchez-Marono [email protected]

Amparo Alonso-Betanzos [email protected] ˜ Spain Computer Science Department, University of A Coruna,

Carmen Castillo [email protected] Department of Civil Engineering, University of Castilla–La Mancha, Spain

A new methodology for learning the topology of a functional network from data, based on the ANOVA decomposition technique, is presented. The method determines sensitivity (importance) indices that allow a decision to be made as to which set of interactions among variables is relevant and which is irrelevant to the problem under study. This immediately suggests the network topology to be used in a given problem. Moreover, local sensitivities to small changes in the data can be easily calculated. In this way, the dual optimization problem gives the local sensitivities. The methods are illustrated by their application to artificial and real examples.

1 Introduction Functional networks (FN) have been proposed by E. Castillo (1998; Castillo, Cobo, Guti´errez, & Pruneda, 1998) and have been shown to be successful in solving many physical or engineering problems (see Castillo & Guti´errez, 1998; Castillo, Cobo, Guti´errez, & Pruneda, 2000). FN are a generalization of neural networks (NN) that combine knowledge about the structure of the problem and data (Castillo et al., 1998). There are important differences between FN and NN. Standard NN usually have a rigid topology (only the number of layers and the number of neurons can be chosen) and fixed neural functions, so that only the weights are learned. In FN, the neural Neural Computation 19, 231–257 (2007)

C 2006 Massachusetts Institute of Technology

232

˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,

topology is initially free, and the neural functions can be selected from a wide range of families, so that both the topology and the parameters of the neural functions are learned. In addition, FN incorporate knowledge on the network topology or neural functions, or both. This knowledge can come from two main sources: (1) the knowledge the user has about the problem being solved, which can be written in terms of certain properties or characteristics of the model, or (2) the available data, which can be scrutinized to obtain the structure. The authors cited have described how the first type of knowledge can be used to derive the network topology. This is done mainly with the help of functional equations, which both suggest an initial topology and allow this initial topology to be simplified, leading to an equivalent simpler one. Thus, one of the main tools for building the network topology from this type of knowledge is functional equations. Such equations have been extensively described in the literature (see, e.g., Acz´el, 1966; Castillo & Ru´ız-Cobo, 1992; Castillo, Cobo, Guti´errez, & Pruneda, 1999; Castillo, Iglesias, & Ru´ız-Cobo, 2004). In this way and although FN can also be used as black boxes, the network is a white (rather than a black) box, where the structure has a knowledge-based foundation. However, no efficient and clean method for deriving the network topology from data has yet been devised. This is one of the aims of this article. Sensitivity analysis is another area of great interest (Saltelli, Chan, & Scott, 2000; Castillo, Guti´errez, & Hadi, 1997; Chatterjee & Hadi, 1988; Hadi & Nyquist, 2002); the concern is not only with learning a model but with the sensitivity of the model parameters to data. Sensitivity can be understood as local or global. Local sensitivity aims at discovering how the model changes as small modifications are made to the data and at determining which data values have the greatest influence on the model when changed in small increments. Global sensitivity aims at discovering how sensitive the model is to the inputs and whether some inputs can be removed without a significant loss in output quality. Different studies have been carried out regarding model approximation (Sacks, Mitchell, & Wynn, 1989; Koehler & Owen, 1996; Currin, Mitchell, Morris, & Ylvisaker, 1991). Some of them use regression strategies, similar to, but different from the one described in Jiang and Owen (2001). In this letter, we present a methodology based on ANOVA decomposition that permits the topology of a functional network to be learned from data. The ANOVA decomposition, also permits global sensitivity indices to be obtained for each variable and for each interaction between variables. The topology of a functional network is derived from this information, and lowand high-order interactions among variables are easily determined using ANOVA. Finally, a local sensitivity analysis can be carried out too. In this way, the final model can be fully defined. The letter is structured as follows. Section 2 gives a quick introduction to functional networks and describes the ANOVA decomposition, including global sensitivity indices. Section 3 describes the proposed method for

Functional Network Topology Learning and Sensitivity Analysis

233

learning the network topology and the local and global sensitivity indices from data. Section 4 presents two illustrative examples, and section 5 contains the conclusions. 2 Background Knowledge 2.1 A Brief Introduction to Functional Networks. Functional networks (FN) are a generalization of neural networks that brings together domain knowledge, to determine the structure of the problem, and data, to estimate the unknown functional neurons (Castillo et al., 1998). In FN, there are two types of learning to deal with this domain and data knowledge:

r r

Structural learning, which includes the initial topology of the network and its posterior simplification using functional equations, leading to a simpler equivalent structure. Parametric learning, concerned with the estimation of the neuron functions. This can be done by considering linear combinations of given functional families and estimating the associated parameters from the available data. Note that this type of learning generalizes the idea of estimating the weights of the connections in a neural network.

In FN, not only arbitrary neural functions are allowed, but they are initially assumed to be multiargument and vector-valued functions; that is, they depend on several arguments and are multivariate. This fact is shown in Figure 1a. In this figure, we can also see some relevant differences between neural and functional networks. Note that the FN has no weights, and the parameters to be learned are incorporated into the neural functions f i ; i = 1, 2, 3. These neural functions are unknown functions from a given family (e.g., the polynomial function) to be estimated during the learning process. For example, the neural functions f i could be approximated by f i (xi , x j ) =

mi ki =0

a iki xiki +

mj

k

a ik j x j j ,

k j =0

and the parameters to be learned will be the coefficients a iki and a ik j . As each function f i is learned, a different function is obtained for each neuron. It can be argued that functional networks require domain knowledge for deriving the functional equations and make assumptions about the form the unknown functions should take. However, as neural networks, functional networks can also be used as a black box. 2.2 The ANOVA Decomposition. The analysis of variance (ANOVA) was developed by Fisher (1918) and later further developed by other authors (Efron & Stein, 1981; Sobol, 1969; Hoeffding, 1948).

˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,

234

x1 x2 x3

w51 f (Σw5i xi)

w52 w62 w63

f ( Σw6i xi)

w73 x4

f ( Σw7i xi)

x5 w85 x6 w86 x7

x8

f ( Σw8i xi)

w87

w74 (a)

x1

x5

f 1( x1,x2)

x2

f 2( x2,x3)

x9

f 4( x5,x6)

x3 x6

f 3( x3,x4) x4 (b)

Figure 1: (a) A neural network. (b) A functional network.

According to Sobol (2001), any square integrable function f (x1 , . . . , xn ) defined on the unit hypercube [0, 1]n can be written as:

y = f (x1 , . . . , xn ) = f 0 +

n i 1 =1

+

n−1 n−2

n

f i1 (xi1 ) +

n n−1

f i1 i2 (xi1 , xi2 )

i 1 =1 i 2 =i 1 +1

f i1 ,i2 ,i3 (xi1 , xi2 , xi3 ) + . . .

i 1 =1 i 2 =i 1 +1 i 3 =i 2 +1

+

2 1 i 1 =1 i 2 =2

...

n

f i1 i2 ...in (xi1 , xi2 , . . . , xin ),

(2.1)

i n =n

where the term f 0 is a constant and corresponds to the function with no arguments.

Functional Network Topology Learning and Sensitivity Analysis

235

The decomposition, equation 2.1, is called ANOVA iff

1 0

f i1 i2 ...ik (xi1 , xi2 , . . . , xik ) d xir ≡ 0; ∀r = 1, 2, . . . , k; ∀i 1 , i 2 , . . . , i k ; ∀k, (2.2)

where k is an index to point to any of the elements in the set x1 , x2 , . . . , xn . Hence, the functions corresponding to the different summands are orthogonal, that is,

1 0

1

1

...

0

0

f i1 i2 ...ik (xi1 , xi2 , . . . , xik ) f j1 j2 ... j (x j1 , x j2 , . . . , x j ) dx = 0

∀(i 1 i 2 , . . . , i k ) = ( j1 j2 , . . . , j ), where x = (x1 , x2 , . . . , xn ). It is important to notice that the conditions in equation 2.2 are sufficient for the different component functions to be orthogonal. In addition, it can be shown that they are unique, in the sense of L 2 equivalent classes. Note that since the above decomposition includes terms with all possible kinds of interactions among variables x1 , x2 , . . . , xn , it allows these interactions to be determined. The main advantage of this decomposition is that there are explicit formulas for obtaining the different summands or components of f (x1 , . . . , xn ). Sobol (2001) provides the following expressions:

1

f0 =

0

0

0

...

0

1

f (x1 , . . . , xn ) d x1 d x2 , . . . , d xn ,

(2.3)

0

1

...

0 1

f i j (xi , x j ) =

1

0

1

f i (xi ) =

1

f (x1 , . . . , xn )

0 1

... 0

n

d xk − f 0 ,

(2.4)

k=1;k=i 1

f (x1 , . . . , xn )

n k=1;k=i, j

d xk − f 0 −

f k (xk ) (2.5)

k=i, j

and so on. In other words, the f (x1 , . . . , xn ) function can always be written as the sum of 2n orthogonal summands (see equation 2.1). If f (x1 , . . . , xn ) is square integrable, then all f i1 i2 ,...,ik (xi1 , xi2 , . . . , xik ) for all combinations i 1 i 2 , . . . , i k and k = 1, 2, . . . , n are also square integrable.

˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,

236

Squaring f (x1 , . . . , xn ) and integrating over (0, 1)n gives

1 0

1

...

0

0

+

n

1

f 2 (x1 , . . . , xn )d x1 d x2 . . . d xn = f 02

f i21 (xi1 ) +

i 1 =1

+

n n−1

f i21 i2 (xi1 , xi2 ) + · · ·

i 1 =1 i 2 =i 1 +1

n−1 n−2

n

f i21 ,i2 ,i3 (xi1 , xi2 , xi3 )

i 1 =1 i 2 =i 1 +1 i 3 =i 2 +1

+

2 1

...

n

f i21 i2 ...in (xi1 , xi2 , . . . , xin )

(2.6)

i n =n

i 1 =1 i 2 =2

and calling D=

1

0

Di1 i2 ...ik =

1

0

1

0 1

1

...

0 1

...

0

0

f 2 (x1 , . . . , xn )d x1 d x2 . . . d xn − f 02 ,

(2.7)

f i21 i2 ,...,ik (xi1 , xi2 , . . . , xik )dx,

(2.8)

the result is D=

n

Di1 i2 ,...,ik .

(2.9)

k=1 i 1 i 2 ,...,i k

The constant D is called the variance, because if (x1 , x2 , . . . , xn ) is a uniform random variable in the unit hypercube, then D is the variance of f (x1 , x2 , . . . , xn ). Thus, the following set of global sensitivity indices, adding up to one, can be defined: Si1 i2 ...ik =

Di1 i2 ,...,ik D

⇔

Si1 i2 ,...,ik = 1.

(2.10)

i 1 i 2 ,...,i k

The practical importance of the ANOVA decomposition arises from the following facts: 1. Every square integrable function can be decomposed as the sum of orthogonal functions, including all interaction levels. 2. This decomposition is unique.

Functional Network Topology Learning and Sensitivity Analysis

237

3. There are explicit formulas for determining this decomposition in terms of f (x1 , . . . , xn ). 4. The variance of the initial function can be obtained by summing up the variances of the components, and this permits global sensitivity indices that sum to one to be assigned to the different functional components. 3 Learning Algorithm In this section, we present a method (denominated the AFN, i.e., ANOVA decomposition and functional networks) for learning the functional components of any general function f (x1 , . . . , xn ) from data and for calculating local and global sensitivity indices. Consider a data set {(x1m , x2m , . . . , xnm ; ym )|k = 1, 2, . . . , M}—a sample of size M of n input variables (X1 , X2 , . . . , Xn )—and one output variable Y. The algorithm works as follows. 3.1 Step 1: Select a Set of Approximating Orthonormal Functions. Each functional component f i1 i2 ,...,ik (xi1 , xi2 , . . . , xik ) is approximated as ki∗ i

1 2 ,...,i k

f i1 i2 ,...,ik (xi1 , xi2 , . . . , xik ) ≈

c i∗1 i2 ,...,ik ; j h i∗1 i2 ,...,ik ; j (xi1 , xi2 , . . . , xik ),

j=1

(3.1) where c i∗1 i2 ,...,ik ; j are real constants, and {h i∗1 i2 ,...,ik ; j (xi1 , xi2 , . . . , xik )| j = 1, 2, . . . , ki∗1 i2 ,...,ik }

(3.2)

is a set of basis functions (e.g., polynomial, sinusoids, exponential, splines, wavelets), which must be orthonormalized. One possibility consists of using one of the families of univariate orthogonal functions, for example, Legendre polynomials, form tensor products with them, and select a subset of them. Although these polynomials are defined with respect to a uniform weighting of [−1, 1], they can be easily mapped to the interval [0, 1]. Hermite and Chebychev polynomials, Fourier functions, and Haar wavelets also provide univariate basis functions. Another alternative consists of using a family of functions and the Gram-Schmidt technique, which is implemented in some numerical libraries, to orthonormalize these basis functions.

˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,

238

For the sake of completeness, we give a third alternative. Since they must satisfy the ANOVA constraints, equation 2.2, we must have 0

1

ki∗ i

1 2 ,...,i k

c i∗1 i2 ,...,ik ; j h i∗1 i2 ,...,ik ; j (xi1 , xi2 , . . . , xik )d xir ≡ 0; (3.3)

j=1

∀r = 1, 2, . . . , k;

∀i 1 , i 2 , . . . , i k ;

∀k ∈ 1, 2, . . . , n.

Note that in spite of the fact that these conditions represent an uncountably infinite number of constraints, only a finite number of them are independent. Thus, only some subset of these constants {c i∗1 i2 ,...,ik ; j | j = 1, 2, . . . , ki∗1 i2 ,...,ik }, remain free. After renaming the free constants {c i1 i2 ,...,ik ; j | j = 1, 2, . . . , ki1 i2 ,...,ik } and orthonormalizing the basis functions, equation 3.1 becomes ki1 i2 ...ik

f i1 i2 ,...,ik (xi1 , xi2 , . . . , xik ) ≈

c i1 i2 ...ik ; j pi1 i2 ,...,ik ; j (xi1 , xi2 , . . . , xik ).

(3.4)

j=1

Note that the initial set of basis functions, equation 3.2, changes to { pi1 i2 ,...,ik ; j (xi1 , xi2 , . . . , xik )| j = 1, 2, . . . , ki1 i2 ,...,ik }.

(3.5)

If one uses the tensor product technique, the multivariate normalized basis functions pi1 i2 ,...,ik ; j (xi1 , xi2 , . . . , xik ) can be selected from the set (tensor products): k

j=

pi j ,r j (xi j )|1 ≤ r j ≤ ki1

.

(3.6)

Note that the size of this set grows exponentially with k and that we can be interested in selecting a small subfamily. For example, if we select as univariate basis functions polynomials of degree r , for the m-dimensional basis functions, the tensor product technique leads to polynomials of degree r × m, which is too high. Thus, we can limit the degree of the corresponding m-multivariate basis to contain only polynomials of degree dm or less. This is what we have done with the examples presented in this letter. Example 1 (Univariate Functions): nomial functions,

Consider the basis of univariate poly-

{h ∗1;1 (x1 ), h ∗1;2 (x1 ), h ∗1;3 (x1 ), h ∗1;4 (x1 )} = {1, x1 , x12 , x13 }.

Functional Network Topology Learning and Sensitivity Analysis

239

After imposing the constraints, equations 2.3 to 2.5, we get the reduced basis {h 1;1 (x1 ), h 1;2 (x1 ), h 1;3 (x1 )} = {2x1 − 1, 3x12 − 1, 4x13 − 1}, which after normalization leads to √ √ { p1;1 (x1 ), p1;2 (x1 ), p1;3 (x1 )} = { 3(2x1 − 1), 5(6x12 − 6x1 + 1), √ 7(20x13 − 30x12 + 12x1 − 1)}. Example 2 (Multivariate Functions): If the function is multivariate, the normalized basis can be obtained as the tensor product of univariate basis functions. For example, consider the basis of 16 bivariate functions, {1, x2 , x22 , x23 , x1 , x1 x2 , x1 x22 , x1 x23 , x12 , x12 x2 , x12 x22 , x12 x23 , x13 , x13 x2 , x13 x22 , x13 x23 }, which, after imposing the constraints, equations 2.3 to 2.5, and normalization leads to the normalized basis: √

3 − 1 + 2 x1 (−1 + 2 x2 ), 15 (−1 + 2 x1 ) (1 − 6 x2 + 6 x22 ), √ 21 − 1 + 2 x1 − 1 + 12 x2 − 30x22 + 20 x23 , √ 15 1 − 6 x1 + 6 x12 − 1 + 2 x2 , 5 1 − 6 x1 + 6 x12 1 − 6 x2 + 6 x22 , √ 35 1 − 6 x1 + 6 x12 − 1 + 12 x2 − 30x22 + 20 x23 , √ 21 − 1 + 12 x1 − 30x12 + 20 x13 − 1 + 2 x2 √ 35 − 1 + 12 x1 − 30x12 + 20 x13 1 − 6 x2 + 6 x22 , 7 − 1 + 12 x1 − 30x12 + 20 x13 − 1 + 12 x2 − 30x22 + 20 x23 . This family of 9 functions is the family 3.6; that is, it comes from the tensor products of the normalized functions in example 1. Note that these bases can be obtained independently of the data set, which means that they are valid for learning any data set. So once calculated, they can be stored to be used when needed. In addition, since multivariate functions have associated tensor product bases of univariate functions, one needs to calculate and store only the basis associated with univariate functions.

240

˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,

3.2 Step 2: Learn the Coefficients by Least Squares and Obtain Local Sensitivity Indices. According to our approximation, we have ki 1 n

ym = f (x1m , . . . , xnm ) = f 0 +

c i1 ; j pi1 ; j (xi1 m )

i 1 =1 j=1

+

ki 1 i 2 n n−1

c i1 ; j pi1 i2 ; j (xi1 m , xi2 m )

i 1 =1 i 2 =i 1 +1 j=1

+

(3.7)

ki 1 i 2 i 3 n

n−1 n−2

c i1 i2 i3 ; j pi1 ,i2 ,i3 ; j (xi1 m , xi2 m , xi3 m ) + · · ·

i 1 =1 i 2 =i 1 +1 i 3 =i 2 +1 j=1

+

1 2

...

n ki 1 i 2 ...i n i n =n

i 1 =1 i 2 =2

c i1 i2 ...in ; j pi1 i2 ...in (xi1 m , xi2 m , . . . , xin m ).

j=1

The error for the mth data value is m (y, x, c) = ym − f 0 −

ki 1 n

c i1 ; j pi1 ; j (xi1 m )

i 1 =1 j=1

−

ki 1 i 2 n n−1

c i1 ; j pi1 i2 ; j (xi1 m , xi2 m )

i 1 =1 i 2 =i 1 +1 j=1

−

n−1 n−2

ki 1 i 2 i 3 n

c i1 i2 i3 ; j pi1 ,i2 ,i3 ; j (xi1 m , xi2 m , xi3 m ) − · · ·

i 1 =1 i 2 =i 1 +1 i 3 =i 2 +1 j=1

−

2 1

...

i 1 =1 i 2 =2

n ki 1 i 2 ...i n i n =n

c i1 i2 ,...,in ; j pi1 i2 ,...,in (xi1 m , xi2 m , . . . , xin m ), (3.8)

j=1

where y, x are the data vectors and c is the vector including all the unknown coefficients. Hence, to estimate the constants c, that is, all c i1 i2 ,...,ik ; j ; ∀k, j, the following minimization problem has to be solved: Minimize Q = f 0 ,c

M

m2 (y, x, c).

(3.9)

m=1

The minimization problem, equation 3.9, leads to a linear system of equations with a unique solution. However, due to its tensor character, its organization in a standard form as Ac = b requires renumbering of

Functional Network Topology Learning and Sensitivity Analysis

241

the unknowns c, which is not a trivial task. Alternatively, one can use standard optimization packages, such as GAMS (Brooke, Kendrik, Meeraus, & Raman, 1998), to solve equation 3.9. This is the option used in this letter. However, to obtain the local sensitivities of Q to small changes in the data, the following modified problem, which is equivalent to equation 3.9, can be solved instead: Minimize Q= f 0 ,c; y ,x

M

m2 (y , x , c),

(3.10)

m=1

subject to: y = y : λ

(3.11)

x = x : δ,

(3.12)

where λ and δ are the vectors of dual variables, which give the local sensitivities of Q with respect to the data values y and x, respectively. This is so because the dual variables associated with any primal constrained optimization problem are known to be the partial derivatives of the objective function optimal values with respect to changes in the right-hand-side parameters of the corresponding equality constraints. Since in this artificial optimization problem, equations 3.10 to 3.12, the right-hand sides of the equality constraints 3.11 and 3.12 are the data x and y, respectively, the partial derivatives of Q∗ (optimal value of Q), with respect to the data, that is, the sensitivities sought after, are the values of the corresponding dual variables. 3.3 Step 3: Obtain the Global Sensitivity Indices. Since the resulting basis functions have already been orthonormalized, the global sensitivity indices (importance factors) are the sums of the squares of the coefficients, that is, ki1 i2 ...ik

Si1 i2 ...ik =

c i21 i2 ,...,ik ; j ; ∀(i 1 i 2 , . . . , i k ).

(3.13)

j=1

3.4 Step 4: Return Solution. The algorithm returns the following information:

r

r

An estimation of the coefficients f 0 and {c i1 i2 ,...,ik ; j ; ∀(i 1 i 2 , . . . , i k ; j)}, which, considering the basis functions pi1 i2 ,...,ik ; j (xi1 , xi2 , . . . , xik ) and equation 3.7, allow obtaining an approximation for the solution of the problem being studied Local and global sensitivity indices.

242

˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,

For simplicity, only one output variable was considered; however, an extension to the case of several outputs can be immediately obtained by solving the corresponding minimization problems. Thus, the algorithm above allows us to obtain an approximation for a given function f in terms of its functional components. Moreover, global sensitivity indices are derived for each functional component; therefore, the topology of a functional network can be considerably simplified, taking into account only the components with higher sensitivities. Suppose that, ˆf (x1 , . . . , xn ) is the f approximation after removing unimportant interactions, that is, when only the subset of important interactions is included in the final model, then the mean squared error (assuming a uniform distribution in the unit cube) is MSE = 0

1

0

1

...

1

2 fˆ(x1 , . . . , xn ) − f (x1 , . . . , xn ) dx.

0

Sometimes it is more convenient to obtain the normalized mean squared error (NMSE), which is adimensional and defined as NMSE =

MSE . Var[ f (x1 , . . . , xn )]

4 Application Examples In this section two application examples are described to illustrate the methodology proposed above. 4.1 Learning a Three-Input Nonlinear Function. The aim of this example is to demonstrate that the global sensitivity indices recovered by the proposed algorithm from data are exactly the same as those that can be obtained directly from the original function. Also, we will show that the resulting global sensitivity indices are almost not affected by noise. For illustration, we selected the following function: y = f (x1 , x2 , x3 ) = 1 + x1 + x12 + x1 x2 + 2x1 x2 x3 .

(4.1)

First, we calculate the exact ANOVA components using equations 2.3 to 2.5, leading to: 7 4 1 f 0 = ; f 1 (x1 ) = − + 2 x1 + x12 ; f 2 (x2 ) = − + x2 ; 3 3 2 x3 1 1 f 3 (x3 ) = − + ; f 12 (x1 , x2 ) = − x1 − x2 + 2 x1 x2 ; 4 2 2

Functional Network Topology Learning and Sensitivity Analysis

243

Table 1: Sensibility Results with Different Levels of Noise. σ 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20

S1

S2

S3

S12

S13

S23

S123

0.8361 0.8359 0.8356 0.8360 0.8336 0.8335 0.8312 0.8305 0.8321 0.8268 0.8267

0.0922 0.0922 0.0921 0.0917 0.0924 0.0918 0.0931 0.0940 0.0901 0.0925 0.0930

0.0231 0.0230 0.0233 0.0231 0.0237 0.0238 0.0242 0.0241 0.0237 0.0244 0.0238

0.0307 0.0308 0.0309 0.0304 0.0317 0.0311 0.0314 0.0305 0.0319 0.0327 0.0331

0.0077 0.0077 0.0076 0.0081 0.0081 0.0086 0.0088 0.0089 0.0095 0.0110 0.0103

0.0077 0.0078 0.0080 0.0080 0.0077 0.0084 0.0081 0.0091 0.0096 0.0099 0.0096

0.0026 0.0026 0.0026 0.0026 0.0028 0.0028 0.0032 0.0029 0.0031 0.0028 0.0035

Note: Mean values of 100 replications.

1 1 x1 x3 x2 x3 − − + x1 x3 ; f 23 (x2 , x3 ) = − − + x2 x3 ; 4 2 2 4 2 2 x1 x2 1 x3 f 123 (x1 , x2 , x3 ) = − + + − x1 x2 + − x1 x3 − x2 x3 + 2 x1 x2 x3 , 4 2 2 2 (4.2) f 13 (x1 , x3 ) =

whose sum gives exactly the function in equation 4.1. Next, the exact variance D = 0.9037 of function f and the sensitivity indices for the functional components were calculated using equations 2.7 to 2.10. These are shown in the first row of Table 1, corresponding to σ = 0. Subsequently, we show that we can learn the function f , its ANOVA decomposition, and the global indices from data. To this end, a sample of size m = 100 was generated for each input variable {xik |i = 1, 2, 3; k = 1, 2, . . . , m} from a uniform U(0, 1) random variable, and {yk |k = 1, 2, . . . , m} was calculated using equation 4.1. Next, the AFN method was applied in order to obtain an approximation of the function f in equation 4.1. The algorithm starts by considering a set of basis functions for univariate functions. As the function to be estimated was known (because it is an illustrative example), a suitable set of basis functions was formed by third-degree polynomials. These basis functions were orthonormalized as described in example 1. The multivariate functions, with two and three arguments, were approximated with the tensor product functions obtained from the corresponding third-degree polynomials univariate functions (see example 2). However, only polynomials of third degree or lower were used in the approximation. After solving the minimization problem, the exact function

244

˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,

was recovered by applying the algorithm f (x1 , . . . , xn ) = 1 + x1 + x12 + x1 x2 + 2x1 x2 x3 . Next, the global sensitivity indices were calculated from the coefficients using equation 3.13 in step 3 of the algorithm. These coefficients were the exact ones (i.e., with no error), as can be confirmed in Table 1 in the row labeled “σ = 0.00.” Finally, to test the sensitivity of the results to noise, the learning process was repeated for different noise levels, adding to y in equation 4.1 a normal N(0, σ 2 ) noise with σ = 0.00, 0.02, . . . , 0.20. The results obtained are shown in Table 1. It can be observed that the sensitivity indices are barely affected by the noise level. Once the global sensitivity indices have been calculated, one can decide which interactions must be included in the model. This immediately defines the network topology. Since 98.18% of the variance in the illustrative example can be obtained by the approximation f (x1 , . . . , xn ) ≈ f 0 + f 1 (x1 ) + f 2 (x2 ) + f 3 (x3 ) + f 12 (x1 , x2 ),

(4.3)

one can decide to remove all other interactions. The final topology of the network obtained using our method is shown in Figure 2b, whereas Figure 2a shows the network topology used when the methodology proposed was not applied. Comparing both network topologies, it can be observed that the application of the algorithm achieved considerable simplification while practically maintaining the quality of the outputs. To illustrate the performance of the proposed method, Table 2 shows the means and standard deviations (in parentheses) of MSE (mean squared error) and NMSE (normalized mean squared error) for the training (80% of the sample) and testing (20% of the sample) samples, obtained with 100 replications. Since the values are small, they reveal good performance. In addition, Table 2 includes the errors obtained with increasing noise values (σ ). 4.2 Case Example: A Vertical Breakwater Problem. Breakwaters are constructed to provide sheltered bays for ships and protect harbor facilities. Moreover, in ports open to rough seas, they play a key role in port operations. Since sea waves are enormously powerful, it is not an easy matter to construct structures to mitigate sea power. When designing a breakwater (see Figure 3), one looks for the optimal cross-section that minimizes construction and maintenance costs during the breakwater’s useful life and also satisfies reliability constraints guaranteeing that the work is reasonably safe for each failure mode. Optimization of this engineering

Functional Network Topology Learning and Sensitivity Analysis

X1

245

f1 f12 f2 y

X2

f123

+

f13 f23 X3

f3 (a)

X1

f1 f12

X2

f2

X3

f3

y +

(b) Figure 2: Network topologies corresponding to the three-input function in the example. (a) Using the available data directly, without applying the proposed methodology. (b) Using the knowledge obtained by the proposed methodology to eliminate some interaction terms.

design is extremely important because of the corresponding reduction in the associated cost. Analysis of the failure probabilities of the breakwater requires calculating failure probabilities and the annual failure rate for each failure mode. This implies determining the pressure produced on the breakwater crownwall by the sea waves. This problem involves nine input variables (listed in Table 3) and four output variables: p1 , p2 , p3 , and pu (the first three pi are, respectively, water pressure at the mean, freeboard and base levels of the concrete block, and pu is the maximum subpressure value).

˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,

246

Table 2: Simulation Results for a Sample Size n = 100 with Different Levels of Noise. σ 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20

MSE tr

NMSE tr

MSE test

NMSE test

0.00000 (0.00000) 0.00035 (0.00006) 0.00140 (0.00023) 0.00312 (0.00052) 0.00570 (0.00095) 0.00891 (0.00162) 0.01257 (0.00248) 0.01681 (0.00300) 0.02203 (0.00389) 0.02789 (0.00489) 0.03490 (0.00623)

0.00000 (0.00000) 0.00040 (0.00009) 0.00159 (0.00039) 0.00352 (0.00082) 0.00643 (0.00153) 0.00997 (0.00230) 0.01398 (0.00363) 0.01853 (0.00409) 0.02433 (0.00554) 0.03052 (0.00695) 0.03753 (0.00873)

0.00000 (0.00000) 0.00054 (0.00018) 0.00228 (0.00090) 0.00494 (0.00149) 0.00944 (0.00326) 0.01371 (0.00465) 0.02062 (0.00761) 0.02661 (0.00973) 0.03563 (0.01453) 0.04186 (0.01439) 0.05572 (0.02170)

0.00000 (0.00000) 0.00061 (0.00022) 0.00258 (0.00111) 0.00556 (0.00183) 0.01067 (0.00420) 0.01546 (0.00589) 0.02304 (0.00975) 0.02926 (0.01096) 0.03956 (0.01772) 0.04571 (0.01657) 0.06034 (0.02652)

Notes: Mean values of 100 replications. Standard deviations appear in parentheses.

In the case of the vertical breakwater, approximating formulas for calculating p1 , p2 , p3 , and pu were given by Goda (1972) using his own theoretical and laboratory studies. These were later extended by other authors (Takahashi, 1996), obtaining: p1 = 0.5(1 + cos θ )(α1 + α4 cos2 θ )w0 HD

(4.4)

p2 = α2 p1

(4.5)

p3 = α3 p1

(4.6)

pu = 0.5(1 + cos θ )α1 α2 w0 HD ,

(4.7)

where the nondimensional coefficients are given by

(4π h/L) α1 = 0.6 + 0.5 sinh(4π h/L)

2 (4.8)

Functional Network Topology Learning and Sensitivity Analysis

247

p1 p2 hc h’ d h pu p3 Bm Figure 3: Typical cross section of a vertical breakwater.

h 1 α2 = 1 − 1− h cosh(2π h/L) hc α3 = 1 − min 1, 0.75(1 + cos θ )HD α4 = max(α5 , α I )

(4.9) (4.10) (4.11)

α5 = min((1 − d/ h)(HD /d) /3, 2d/HD ) 2

αI = αI 0αI 1 HD /d if HD ≤ 2d αI 0 = 2 otherwise cos δ2 / cosh δ1 αI 1 = 1/(cosh δ1 (cosh δ2 )) 20δ11 if δ11 ≤ 0 δ1 = 15δ11 otherwise

(4.12) (4.13) (4.14)

if δ2 ≤ 0 otherwise

(4.15)

(4.16)

δ11 = 0.93(Bm /L − 0.12) + 0.36[(h − d)/ h − 0.6] 4.9δ22 if δ22 ≤ 0 δ2 = otherwise 3δ22

(4.17)

δ22 = −0.36(Bm /L − 0.12) + 0.93[(h − d)/ h − 0.6].

(4.19)

(4.18)

˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,

248

Table 3: Input Variables for the Vertical Breakwater Problem. θ w0 HD h L d h hc Bm

incidence angle of waves water unit weight design wave height water depth wave length submerged depth of the vertical breakwater submerged depth of the vertical breakwater, including width of the armor layer free depth of the concrete block on the lee side berm sea side length

As can be observed, these formulas are very complicated and have continuity problems in the first derivatives because they use different expressions in different intervals that do not have the desired regularity properties at the change points. As a consequence, some optimization program solvers, such as GAMS, fail to obtain the optimal solution in some cases. Therefore, more complex solvers, which are more costly in terms of computational power and time, are needed. To solve this problem, the idea is to generate a random set of input variables and then calculate the corresponding outputs using the above formulas, then use these data to train a network that may solve the problem satisfactorily. This application is very important from a practical point of view because it means that an approximation can be obtained for the formulas in equations 4.4 to 4.19 without continuity or regularity problems. Therefore, the optimization problem can be stated, and the optimal solution can be obtained without any special computational requirement. Almost all the variables involved in the breakwater problem can be represented as powers of fundamental magnitudes such as length and mass. Therefore, the breakwater problem can be simplified by the application of dimensional analysis, specifically, its main theorem: the -theorem (Bridgman, 1922; Buckingham, 1914, 1915). This theorem states that any physical relation in terms of n variables can be rewritten in a new relation involving r fewer variables. When this theorem is applied, the input and output variables are transformed into dimensionless monomials. Different transformations are possible, but using engineering knowledge on breakwaters allows determining, without knowledge of formulas 4.4 to 4.19, that the dimensionless output variables

p1 p2 p3 pu , , , w0 HD w0 HD w0 HD w0 HD

(4.20)

Functional Network Topology Learning and Sensitivity Analysis

249

Table 4: Monomials Required for Estimating the Variables p1 , p2 , p3 , and pu . Output/Input p1 w0 HD p2 w0 HD p3 w0 HD pu w0 HD

h L

h h

√ √

√

√ √

√

θ

d h

HD d

BM L

√

√

√

√

√

√

√

√

√

√

√

√

hc HD

√

√

can be written in terms of the set of dimensionless input variables θ,

h d HD h h c Bm . , , , , , L h d h HD L

(4.21)

Not all the dimensionless monomials in equation 4.21 are needed for estimating the dimensionless outputs (see equations 4.4 to 4.19). Table 4 summarizes the required ones. In order to choose the input data points to learn these functions, several different criteria can be used; knowledge about their ranges and statistical distribution is required. (For some interesting work on this and related problems, readers are referred to Sacks et al., 1989; Koehler & Owen, 1996; and Currin et al., 1991.) Knowledge about ranges is easy to obtain from experienced engineers. However, given the lack of knowledge about the statistical distribution of the input data values in existing breakwaters, a set S of 2000 input data (1600 for training and 400 for testing) was generated randomly, assuming that each dimensionless variable in equation 4.21 is U(0, 1), that is, normalized in the interval (0, 1) in order to apply the methodology proposed. This is a reasonable assumption covering the actual ranges of the variables involved. Note that the ranges correspond to nondimensional variables. To show the performance of the methodology proposed (AFN), all of the dimensionless inputs will be considered for estimation of the desired dimensionless outputs. It will be demonstrated that only the relevant input interactions appear in the learned models. In this way, we illustrate how one can use data from a given problem to obtain appropriate knowledge that can assist in deriving a topology for functional networks. To our knowledge, there is no other efficient method for deriving functional network topologies from data. Twenty replications with different starting values for the parameters were performed for each case studied. This example is focused in the estimation of pu /w0 HD , but p1 /w0 HD , p2 /w0 HD , and p3 /w0 HD were also learned in a similar way, and the most significative results are presented. From this point, we will refer to the dimensionless outputs as p1 , p2 , p3 , and pu .

250

˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,

Table 5: Mean, Standard Deviation (STD), and Minimum and Maximum Values for the Normalized Mean Square Error (NMSE) for the Test Data Set. Degrees 2, 1, 1, 1, 1 2, 2, 1, 1, 1 2, 3, 1, 1, 1 2, 4, 1, 1, 1 3, 1, 1, 1, 1 3, 3, 3, 1, 1 3, 4, 1, 1, 1 3, 6, 9, 1, 1 2, 4, 3, 4, 5 3, 6, 3, 4, 5 2, 3, 1

MSEtest

(ST D)

Min

Max

2.3900 × 10−1 7.4135 × 10−2 6.6946 × 10−2 6.7852 × 10−2 9.4133 × 10−1 9.8419 × 10−1 9.5612 × 10−1 9.6607 × 10−1 7.4111 × 10−2 0.1044 × 101 6.8237 × 10−2

(3.8184 × 10−2 ) (4.8200 × 10−3 ) (4.0007 × 10−3 ) (4.0342 × 10−3 ) (5.2477 × 10−2 ) (5.7424 × 10−2 ) (6.0454 × 10−2 ) (6.6378 × 10−2 ) (6.4231 × 10−3 ) (6.9930 × 10−2 ) (4.2030 × 10−3 )

1.9346 × 10−1 6.5495 × 10−2 5.9426 × 10−2 5.9443 × 10−2 8.3121 × 10−1 8.6991 × 10−1 8.3115 × 10−1 8.6482 × 10−1 6.1047 × 10−2 9.2294 × 10−1 5.8822 × 10−2

3.2621 × 10−1 8.4480 × 10−2 7.6709 × 10−2 7.7210 × 10−2 1.0367 1.0875 1.0819 1.1104 8.5550 × 10−2 1.1540 7.4263 × 10−2

Notes: Results for the estimations of pu using the proposed methodology. The last row shows the results obtained when the AFN method was applied and a simplified topology is obtained.

The complexity of the proposed method grows exponentially with the number of inputs. To estimate pu , seven input variables are given; therefore, a complex model is derived if all levels of interactions are considered. Then a simpler model was chosen in the first instance, and its complexity was constantly increased. This model is obtained by considering a reduced number of univariate functions and limiting the multivariate functions obtained from its corresponding tensor products. As in the previous example, the polynomial family was selected for estimating the desired output. The first column in Table 5 shows the different degrees taken into account; the first number is the degree for the univariate functions, the second one is the degree for the bivariate functions, and so on. Note that six and seven arguments functions were not included because there is no improvement in the performance results when more relations are considered. The global sensitivity indices related to the best performance results and its corresponding total sensitivity indices are shown in Tables 6 and 7, respectively. To be concise, all the monomials or relations between them with index values under 0.0005 were removed from Table 6. It can be seen that the monomials with the highest sensitivity values are the three required (the first three rows in Table 6; see Table 4). Besides, the relation between the necessary variables h/L and h / h (fourth row in Table 6) is the most important one, although it has a low value compared to the monomials individually. All other variables and relations have lower sensitivity values and thus were not considered. After removing the unimportant factors, the topology of the functional network is much simpler. It has only three univariate functions

Functional Network Topology Learning and Sensitivity Analysis

251

Table 6: Global Sensitivity Indices for the Univariate and Bivariate Polynomials When Estimating pu . Replication Monomials

1

2

3

4

5

Mean

h/L h/ h θ

0.6644 0.2680 0.0169

0.6598 0.2763 0.0197

0.6544 0.2785 0.0161

0.6671 0.2626 0.0179

0.6594 0.2729 0.0197

0.6622 0.2728 0.0172

h/L, h / h h/L, θ h/L, d/ h h/L, HD /d h / h, θ h / h, d/ h

0.0429 0.0033 0.0002 0.0002 0.0022 0.0002

0.0362 0.0040 0.0002 0.0002 0.0010 0.0006

0.0443 0.0029 0.0000 0.0005 0.0013 0.0001

0.0440 0.0042 0.0005 0.0001 0.0011 0.0000

0.0396 0.0046 0.0002 0.0005 0.0007 0.0001

0.0392 0.0040 0.0002 0.0002 0.0015 0.0002

Notes: Results obtained by the first five replications and mean of the 20 replications. Rows with values under 0.0005 are not included. Shown below the line are the global sensitivity indices for the relations between monomials.

Table 7: Total Sensitivity Indices When Estimating pu . Replication Variables h/L h/ h θ d/ h HD /d B M /L h c /HD

1

2

3

4

5

Mean

0.7114 0.3135 0.0228 0.0007 0.0007 0.0007 0.0004

0.7007 0.3143 0.0254 0.0011 0.0012 0.0007 0.0004

0.7024 0.3243 0.0207 0.0007 0.0008 0.0009 0.0008

0.7163 0.3080 0.0238 0.0011 0.0005 0.0012 0.0011

0.7044 0.3138 0.0255 0.0008 0.0012 0.0010 0.0009

0.7061 0.3141 0.0234 0.0010 0.0010 0.0010 0.0008

Note: Results obtained by the first 5 replications and mean of the 20 replications.

(second-degree polynomials) and a bivariate function relating the variables h/L and h / h. However, the testing results after training this functional network are very similar to those obtained previously, as expected, because only the nonrelevant factors were removed. These results are presented in the last row of Table 5 to facilitate comparison. Note that the number of parameters is extremely reduced. In fact, there are 9 parameters when the proper dimensionless monomials are employed (3 univariate functions with 2 parameters and 1 bivariate function with 3 parameters), and 77 parameters are used when all the dimensionless monomials are taken into account, considering the same degree for the polynomials (7 univariate functions with 2 parameters and 21 bivariate functions with 3 parameters).

˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,

252

Table 8: Mean, Standard Deviation (STD), and Minimum and Maximum Values for the Normalized Mean Square Error (NMSE) for the Test Set When Estimating p1 , p2 , and p3 .

p1 p2 p3

Method

Mean NMSE

(STD)

Min

Max

AFN FN AFN FN AFN FN

1.8693 × 10−1 1.9795 × 10−1 7.7311 × 10−2 9.1324 × 10−2 7.7058 × 10−2 1.0552 × 10−1

(2.6462 × 10−2 ) (2.6624 × 10−2 ) (7.0881 × 10−3 ) (1.0700 × 10−2 ) (1.1104 × 10−2 ) (1.7653 × 10−2 )

1.4714 × 10−1 1.5455 × 10−1 6.5819 × 10−2 7.3875 × 10−2 6.2961 × 10−2 7.7863 × 10−2

2.4397 × 10−1 2.4882 × 10−1 9.0760 × 10−2 1.1522 × 10−1 1.0444 × 10−1 1.3590 × 10−1

Table 9: Total Sensitivity Indices for the Best Approximation of p1 , p2 , and p3 . Input/Output h L h h

θ

d h HD d BM L hc HD

p1

p2

p3

0.550433 0.003544 0.220720 0.107570 0.124541 0.083042 0.003075

0.652522 0.323328 0.040430 0.016706 0.019912 0.013103 0.001099

0.183044 0.001595 0.129206 0.035828 0.041625 0.027788 0.636814

Similar experiments were carried out for the other output variables ( p1 , p2 , and p3 ). However, for simplicity and clarity, only the most significant results are included. The best performance results obtained applying the proposed methodology (AFN) are shown in Tables 8 and 9, indicating the total sensitivity index for each input variable. The variables with the lowest values are those not checked in Table 4, and thus the proposed method is also able to discard the irrelevant variables in these cases. Again, considering the topology derived from the application of the method proposed, a functional network is trained for each output, and its performance results are also included in Table 8 (rows entitled FN). Again, it is important to remark that the simplification of the topology means a significant reduction in the number of parameters while maintaining the performance of the approach. By applying the proposed method, interactions between variables are removed and irrelevant variables are found. This suggests that the AFN method can be applied as a feature selection method, that is, a method that reduces the number of original features or variables by selecting a subset of them. The advantages of reducing the number of inputs have been extensively discussed in the machine learning literature (Kohavi & John, 1997; Guyon & Elisseeff, 2003). Two of the most important ones are that

Functional Network Topology Learning and Sensitivity Analysis

253

Table 10: Mean, Standard Deviation, and Minimum and Maximum Values for the Normalized MSE for the Test Set when Estimating pu Considering All the Inputs and Only the Three Relevant Ones. Neurons

Mean NMSE

(STD)

Min

Max

Considering the seven variables (1.2739 × 10−3 ) 5 1.9880 × 10−3 −4 10 4.8390 × 10 (1.0625 × 10−3 ) 15 2.5818 × 10−4 (7.5651 × 10−4 ) 20 1.5951 × 10−6 (2.2556 × 10−6 ) 25 1.4011 × 10−6 (2.6363 × 10−6 ) 27 8.0122 × 10−7 (8.1440 × 10−7 ) 30 2.5952 × 10−6 (6.3535 × 10−6 )

1.7747 × 10−4 1.7887 × 10−6 1.0661 × 10−6 2.5091 × 10−7 8.3305 × 10−8 1.4289 × 10−7 7.0063 × 10−8

4.8721 × 10−3 3.7790 × 10−3 2.6543 × 10−3 1.0151 × 10−5 1.1963 × 10−5 3.7679 × 10−6 2.4735 × 10−5

Using the three relevant variables (9.9263 × 10−4 ) 5 1.4199 × 10−3 10 2.4766 × 10−5 (6.7618 × 10−5 ) 15 1.1263 × 10−6 (9.7697 × 10−7 ) 20 2.1324 × 10−7 (2.7517 × 10−7 ) 25 1.1236 × 10−7 (1.0665 × 10−7 ) 27 7.2760 × 10−8 (4.1195 × 10−8 ) 30 8.8819 × 10−8 (7.7426 × 10−8 )

1.5737 × 10−4 1.8441 × 10−6 1.5265 × 10−7 3.4903 × 10−8 1.8267 × 10−8 1.0599 × 10−8 1.8998 × 10−8

2.8286 × 10−3 3.0382 × 10−4 3.8548 × 10−6 1.2734 × 10−6 4.3665 × 10−7 1.9125 × 10−7 2.6956 × 10−7

Note: Neurons refers to the number of neurons in the hidden layer.

it allows reducing computational complexity, and it improves performance results. As a feature selection method, the AFN method should indicate how to choose the relevant variables. This information is provided by the total sensitivity indices (TSI) that give a ranking of the variables in terms of its variance. Besides, a threshold is required to determine the variables to be selected, in such a way that variables with TSIs under the threshold are discarded. The establishment of this threshold is not a trivial issue and will be discussed in a further study. As a preliminary approach, the AFN method was used as a feature selection method for the breakwater problem. The threshold was established at 1%, that is, variables with TSI over 1% were chosen (see Tables 7 and 9). Then multilayer perceptrons (MLP) were used to solve the breakwater problem in both cases (using all the given inputs and only the inputs selected by the AFN method). The hyperbolic tangent was employed as the transfer function and the MSE as the performance function. The Levenberg-Marquardt learning algorithm was used to train the network (Levenberg, 1944; Marquardt, 1963). Two thousand epochs were employed because they were enough to the MLP to converge. The results are very representative in the case of pu , because the number of inputs is drastically reduced from 7 to 3. Table 10 shows these results. As can be seen, the reduction in the number of inputs leads to better performance when the same number of neurons in the hidden layer is used,

˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,

254

Table 11: Mean, Standard Deviation (STD), and Minimum and Maximum Values for the Normalized Mean Square Error (NMSE) for the Test Set When Estimating p1 , p2 , and p3 , Considering All the Inputs (All) and Only the Relevant (Rel.) Ones.

p1 p2 p3

Na

Varb

Mean NMSE

(STD)

Min

Max

35 37 21 22 37 39

All Rel. All Rel. All Rel.

1.4148 × 10−2 3.6129 × 10−3 1.0843 × 10−2 7.3305 × 10−3 6.3983 × 10−3 3.9794 × 10−3

(2.1045 × 10−2 ) (9.4444 × 10−4 ) (8.4366 × 10−3 ) (4.7671 × 10−3 ) (4.9059 × 10−3 ) (2.0662 × 10−3 )

2.8752 × 10−3 2.2487 × 10−3 4.1416 × 10−3 2.6945 × 10−3 2.4433 × 10−3 2.0258 × 10−3

9.8051 × 10−2 5.0922 × 10−3 3.6060 × 10−2 2.2399 × 10−2 2.2456 × 10−2 9.1660 × 10−3

Notes: a N refers to the number of neurons in the hidden layer. b Var. refers to the number of variables.

although it involves a smaller set of weights, that is, the network topology is simplified. The best results achieved when estimating p1 , p2 , and pu are shown in Table 11, although the variable reduction is not so important. Again, better performance results are achieved. 5 Conclusions and Future Work In this article, a new methodology, the AFN method, based on ANOVA decomposition and functional networks, was presented and described. This methodology permits a simplified topology for a functional network to be learned from the data available. As stated in section 1, to our knowledge, there is no other method that permits this to be done. In addition to the advantages inherited from the ANOVA decomposition (uniqueness and orthogonality), the proposed methodology has the following advantages:

r

r

r

Global sensitivity indices can be obtained from the application of the AFN. These can be used to establish the relevance of each functional component; consequently, they determine the input variables and relations to be selected. If a particular variable has no influence (in isolation or related to others), it can be discarded as an input for the functional or neural network. It allows learning and simplifying the topology of a functional or neural network. If the variables required for estimating a specific function are important, then the topology of the functional network should include these relationships. All existing multivariate interactions among variables are identified by the global sensitivity indices.

Functional Network Topology Learning and Sensitivity Analysis

r

r

255

Local sensitivity indices. Although the letter was focused on the global sensitivity indices for deriving the functional network topology, it is important to note that the proposed methodology also provides local sensitivity indices. These indices could be used to detect outliers in the samples or to make a selective sampling (these will be a future line of research for us). Several alternatives for selecting the basis orthonormal functions for the proposed approximation are available. Moreover, a new one has been presented here. As the orthonormalization decomposition is easily accomplished by one of these alternatives, the application of the AFN would require only a minimization problem to be solved.

The suitability of the proposed methodology was illustrated by its application to a real engineering problem: the design of a vertical breakwater. The performance results obtained were compared to those obtained using functional and neural networks to solve the same problem. It was demonstrated that although the AFN obtains similar performance results, it also returns a set of sensitivity indices that permits the initial topologies to be simplified (functional networks) or some of the input variables to be eliminated (neural networks). In view of the results obtained, future work will involve adapting the proposed methodology for use as a feature subset selection method. A detailed study will be carried out comparing the proposed methodology with the existing feature selection methods. Acknowledgments We are indebted to the Spanish Ministry of Science and Technology with FEDER Funds (Projects DPI2002-04172-C04-02 and TIC-2003-00600), to the Xunta de Galicia (Project PGIDIT04-PXIC10502), and to Iberdrola for partial support of this work. References Acz´el, J. (1966). Lectures on functional equations and their applications. New York: Academic Press. Bridgman, P. (1922). Dimensional analysis. New Haven, CT: Yale University Press. Brooke, A., Kendrik, D., Meeraus, A., & Raman, R. (1998). GAMS: A user’s guide. Washington, DC: Gams Development Corporation. Buckingham, E. (1914). On physically similar systems: Illustrations of dimensional equations. Phys. Rev., 4, 345–376. Buckingham, E. (1915). Model experiments and the form of empirical equations. Trans. ASME, 37, 263. Castillo, E. (1998). Functional networks. Neural Processing Letters, 7, 151–159.

256

˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,

Castillo, E., Cobo, A., Guti´errez, J. M., & Pruneda, E. (1998). Functional networks with applications. Boston: Kluwer. Castillo, E., Cobo, A., Guti´errez, J. M., & Pruneda, R. E. (1999). Working with differential, functional and difference equations using functional networks. Applied Mathematical Modeling, 23, 89–107. Castillo, E., Cobo, A., Guti´errez, J. M., & Pruneda, R. E. (2000). Functional networks: A new neural network based methodology. Computer-Aided Civil and Infrastructure Engineering, 15, 90–106. Castillo, E., & Guti´errez, J. M. (1998). Nonlinear time series modeling and prediction using functional networks. extracting information masked by chaos. Physics Letters A, 244, 71–84. Castillo, E., Guti´errez, J., & Hadi, A. (1997). Sensitivity analysis in discrete Bayesian networks. IEEE Transactions on Systems, Man and Cybernetics, 26, 412–423. Castillo, E., Iglesias, A., & Ru´ız-Cobo, R. (2004). Functional equations in applied sciences. New York: Elsevier. Castillo, E., & Ru´ız-Cobo, R. (1992). Functional equations in science and engineering. New York: Marcel Dekker. Chatterjee, S., & Hadi, A. S. (1988). Sensitivity analysis in linear regression. New York: Wiley. Currin, C., Mitchell, T., Morris, M., & Ylvisaker, D. (1991). Bayesian prediction of deterministic functions, with applications to the design and analysis of computer experiments. Journal of the American Statistical Association, 86, 953–963. Efron, B., & Stein, C. (1981). The jackknife estimate of variance. Annals of Statistics, 9, 586–596. Fisher, R. (1918). The correlation between relatives on the supposition of Mendelian inheritance. Transactions of the Royal Society, 52, 399–433. Goda, Y. (1972). Laboratory investigation of wave pressure exerted upon vertical and composite walls. Coastal Engineering, 15, 81–90. Guyon, I., & Elisseef, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182. Hadi, A., & Nyquist, H. (2002). Sensitivity analysis in statistics. Journal of Statistical Studies: A Special Volume in Honor of Professor Mir Masoom. Ali’s 65th Birthday, 125–138. Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution. Annals of Mathematical Statistics, 19, 293–325. Jiang, T., & Owen, A. (2001). Quasi-regression with shrinkage. Mathematics and Computers in Simulation, 62, 231–241. Koehler, J., & Owen, A. (1996). Computer experiments: Design and analysis of experiments. In S. Ghosh & C. R. Rao (Eds.), Handbook of statistics, 13 (pp. 261–308). New York: Elsevier Science. Kohavi, R., & John, G. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1–2), 273–324. Levenberg, K. (1944). A method for the solution of certain non-linear problems in least squares. Quartely Journal of Applied Mathematics, 2(2), 164–168. Marquardt, D. W. (1963). An algorithm for least-squares estimation of nonlinear parameters. Journal of the Society of Industrial and Applied Mathematics, 11(2), 431– 441.

Functional Network Topology Learning and Sensitivity Analysis

257

Sacks, J., Mitchell, W. W. T., & Wynn, H. (1989). Design and analysis of computer experiments. Statistical Science, 4(4), 409–435. Saltelli, A., Chan, K., & Scott, M. (2000). Sensitivity analysis. New York: Wiley. Sobol, I. M. (1969). Multidimensional quadrature formulas and Haar functions. Moscow: Nauka. (in Russian) Sobol, I. M. (2001). Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Mathematics and Computers in Simulation, 55, 271–280. Takahashi, S. (1996). Design of vertical breakwaters (Techn. Rep. No. 34). Yokosuka, Japan: Port and Harbour Research Institute. Ministry of Transport.

Received August 12, 2004; accepted June 17, 2006.

LETTER

Communicated by Erin Bredensteiner

Second-Order Cone Programming Formulations for Robust Multiclass Classification Ping Zhong [email protected] College of Science, China Agricultural University, Beijing, 100083, China

Masao Fukushima [email protected] Department of Applied Mathematics and Physics, Graduate School of Informatics, Kyoto University, Kyoto, 606-8501, Japan

Multiclass classification is an important and ongoing research subject in machine learning. Current support vector methods for multiclass classification implicitly assume that the parameters in the optimization problems are known exactly. However, in practice, the parameters have perturbations since they are estimated from the training data, which are usually subject to measurement noise. In this article, we propose linear and nonlinear robust formulations for multiclass classification based on the M-SVM method. The preliminary numerical experiments confirm the robustness of the proposed method. 1 Introduction Given L labeled examples known to come from K (>2) classes T = {(x p , θ p )} Lp=1 ⊂ X × Y, where X ⊂ R N and Y = {1 , . . . , K }, multiclass classification refers to the construction of a discriminate function from the input space X onto the unordered set of classes Y. Support vector machines (SVMs) serve as a useful and popular tool for classification. Recent developments in the study of SVMs show that there are roughly two types of approaches to tackle multiclass classification problem. One is to construct and fuse several binary classifiers, such as “one-against-all” (Bottou et al., 1994; Vapnik, 1998), “one-againstone” (Hastie & Tibshirani, 1998; Kressel, 1999), directed acyclic graph SVM (DAGSVM; Platt, Cristianini, & Shawe-Taylor, 2000), error-correcting output code (ECOC; Allwein, Schapire, & Singer, 2001; Dietterich & Bakiri, 1995), K-SVCR method (Angulo, Parra, & Catal`a, 2003), and ν-K-SVCR Neural Computation 19, 258–282 (2007)

C 2006 Massachusetts Institute of Technology

SOCP for Multiclass Classification

259

method (Zhong & Fukushima, 2006), among others. The other, called “all-together,” is to consider all data in one optimization formulation (Bennett & Mangasarian, 1994; Bredensteiner & Bennett, 1999; Guermeur, 2002; Vapnik, 1998; Weston & Watkins, 1998; Yajima, 2005). In this letter, we focus on the second approach. There are several all-together methods. The method independently proposed by Vapnik (1998) and Weston and Watkins (1998) is similar to oneagainst-all. It constructs K two-class discriminants where each discriminant separates a single class from all the others. Hence, there are K decision functions, but all are obtained by solving one optimization problem. Bennett and Mangasarian (1994) constructed a piecewise-linear discriminant for the K -class classification by a single linear program. The method called M-SVM (Bredensteiner & Bennett, 1999) extends their method to generate a kernel-based nonlinear K -class discriminant by solving a convex quadratic program. Although the original forms proposed by Vapnik (1998), Weston and Watkins (1998), and Bredensteiner and Bennett (1999) are different, they are not only equivalent to each other, but also equivalent to that proposed by Guermeur (2002). Based on M-SVM, the linear programming formulations are proposed in a low-dimensional feature subspace (Yajima, 2005). In the methods noted, the parameters in the optimization problems are implicitly assumed to be known exactly. However, in practice, these parameters have perturbations since they are estimated from the training data, which are usually corrupted by measurement noise. As pointed out by Goldfarb and Iyengar (2003), the solutions to the optimization problems are sensitive to parameter perturbations. Errors in the input space tend to get amplified in the decision function, which often results in misclassification. So it will be useful to explore formulations that can yield discriminants robust to such estimation errors. In this article, we propose a robust formulation of M-SVM, which is represented as a second-order cone program (SOCP). The second-order cone (SOC) in Rn (n ≥ 1), also called the Lorentz cone, is the convex cone defined by Kn =

z0 z¯

: z0 ∈ R, z¯ ∈ Rn−1 , z¯ ≤ z0 ,

where · denotes the Euclidean norm. The SOCP is a special class of convex optimization problems involving SOC constraints, which can be efficiently solved by interior point methods. The work related to SOCP can be seen, for example, in Alizadeh and Goldfarb (2003), Fukushima, Luo, and Tseng (2002), Hayashi, Yamashita, and Fukushima (2005), and Lobo, Vandenberghe, Boyd, and L´ebret (1998). The letter is organized as follows. We first propose a robust formulation for piecewise-linear M-SVM in section 2 and then construct a robust

260

P. Zhong and M. Fukushima

classifier based on the dual SOCP formulation in section 3. In section 4, we extend the robust classifier to the piecewise-nonlinear M-SVM case. Section 5 gives numerical results. Section 6 concludes the letter. 2 Robust Piecewise-Linear M-SVM Formulation For each i, let Ai be a set of examples in the N-dimensional real space R N with cardinality li . Let Ai be an li × N matrix whose rows are the examples in Ai . The pth example in Ai and the pth row of Ai are both denoted Aip . Let ei denote the vector of ones of dimension li . For each i, let wi be a vector in R N and b i be a real number. The sets Ai , i = 1, . . . , K , are called piecewise-linearly separable (Bredensteiner & Bennett, 1999) if there exist wi and b i , i = 1, . . . , K , such that Ai wi − b i ei > Ai w j − b j ei ,

i, j = 1, . . . , K ,

i = j.

Piecewise-linear M-SVM can be formulated as follows (Bredensteiner & Bennett, 1999): i−1 K K K K 1 1 min ν wi − w j 2 + wi 2 + (1 − ν) (ei )T yij w, b, y 2 2

i=1 j=1

i=1 j=1, j=i

i=1

s.t. Ai (wi − w j ) − (b i − b j )ei + yij ≥ ei , yij ≥ 0,

i, j = 1, . . . , K ,

i = j,

(2.1)

where ν ∈ (0, 1],

T w = (w 1 )T , (w 2 )T , . . . , (w K )T ∈ R K N , T

b = b 1, b 2, . . . , b K ∈ RK ,

T y = ( y12 )T , . . . , ( y1K )T , . . . , ( yK 1 )T , . . . , ( yK (K −1) )T ∈ R L(K −1) ,

(2.2) (2.3) (2.4)

K li . When ν = 1, equation 2.1 is the formulation for the and L = i=1 piecewise-linearly separable case. Otherwise, it is the formulation for the piecewise-linearly inseparable case. Figure 1 shows an example of a piecewise-linearly separable M-SVM for three classes in two dimensions. The training data Ai , i = 1, . . . , K , used in problem 2.1, are implicitly assumed to be known exactly. However, in practice, training data are often corrupted by measurement noises. Errors in the input space tend to get amplified in the decision function, which often results in misclassification. For example, suppose each example in Figure 1 is allowed to move in a sphere (see Figure 2). The original discriminants cannot separate the training data

SOCP for Multiclass Classification

261

(w1 − w2 )T x = (b1 − b2 ) + 1 (w1 − w2 , b1 − b2 ) (w1 − w2 )T x = (b1 − b2 ) − 1

A1

A2

(w1 − w3 , b1 − b3 )

*

*

*

*

(w2 − w3 , b2 − b3 )

* A3

Figure 1: Three classes separated by piecewise-linear M-SVM discriminants.

A1 A2

*

*

*

*

* 3

Figure 2: An example of the effect of measurement noises.

sets in the worst case. It will be useful to explore formulations that can yield discriminants robust to such estimation errors. In the following, we discuss such a formulation. We assume Aˆ ip = Aip + ρ ip (aip )T ,

(2.5)

where Aˆ ip is the actual value of the training data and ρ ip (aip )T is the measurement noise, with aip ∈ R N , aip = 1 and ρ ip ≥ 0 being a given constant. Denote the unit sphere in R N by U = {a ∈ R N : a = 1}. The robust

262

P. Zhong and M. Fukushima

version of formulation 2.1 can be stated as follows:

min

w, b, y

i−1 K K K K 1 1 ν wi − w j 2 + wi 2 + (1 − ν) (ei )T yij 2 2 i=1 j=1

i=1 j=1, j=i

i=1

ij

Aip (wi − w j ) + ρ ip (aip )T (wi − w j ) − (b i − b j ) + yp ≥ 1,

s.t.

ij

yp ≥ 0,

∀ aip ∈ U,

p = 1, . . . , li , i, j = 1, . . . , K , i = j.

(2.6)

Since min{ρ ip (aip )T (wi − w j ) : aip ∈ U} = −ρ ip wi − w j , problem 2.6 is equivalent to the following SOCP:

min

w, b, y

i−1 K K K K 1 1 ν wi − w j 2 + wi 2 + (1 − ν) (ei )T yij 2 2 i=1 j=1

i=1 j=1, j=i

i=1

ij

Aip (wi − w j ) − ρ ip wi − w j − (b i − b j ) + yp ≥ 1,

s.t.

ij

yp ≥ 0,

(2.7)

p = 1, . . . , li , i, j = 1, . . . , K , i = j.

Denote Q = (K + 1)I K N − , with I K N being the identity matrix of order K N and

IN IN = . .. IN

IN IN .. . IN

··· ··· .. . ···

IN IN K N×K N. .. ∈ R . IN

Denote e = [(e1 )T , . . . , (e1 )T , . . . , (e K )T , . . . , (e K )T ]T ∈ R L(K −1) . The objec K −1

K −1

tive function of problem 2.7 can then be expressed compactly as ν T w Qw + (1 − ν)eT y. 2

(2.8)

SOCP for Multiclass Classification

263

Additionally, Q is a symmetric positive definite matrix, which can be inferred from the following proposition. The proof of the proposition is omitted since it is similar to that given by Yajima (2005). Proposition 1

Denote C =

√

K + 1I K N −

1. Q = C 2 . 2. C is nonsingular, and C −1 =

√

√

K +1−1 . K

1

I + K +1 K N

Then

√

K√+1−1 . K K +1

Let H ij be the K N × N matrix with all blocks being N × N zero matrices except the ith block being I N and the jth block being −I N : H ij = [O, . . . , O, I N , O, . . . , O, −I N , O, . . . , O]T .

(2.9)

Then, by equation 2.2 we get wi − w j = (H ij )T w.

(2.10)

Let r ij be the K -dimensional vector with all components being zero except the ith component being 1 and the jth component being −1: r ij = [0, . . . , 0, 1, 0, . . . , 0, −1, 0, . . . , 0]T .

(2.11)

Then by equation 2.3 we get b i − b j = (r ij )T b.

(2.12)

ij

Let h p be the L(K − 1)-dimensional vector with all components being zero except the ((K − 1) i−1 k=1 lk + ( j − 1)li + p)th component being 1: ij

h p = [0, . . . , 0, . . . , 0, 1, 0, . . . , 0, . . . , 0]T .

(2.13)

Then by equation 2.4, we get ij

ij

yp = (h p )T y.

(2.14)

By equations 2.10, 2.12, and 2.14, the first constraint in problem 2.7 can be rewritten as follows: ij T ρ ip (H ij )T w ≤ Aip (H ij )T w − (r ij )T b + h p y − 1.

(2.15)

264

P. Zhong and M. Fukushima

Therefore, by equations 2.8 and 2.15 and proposition 1, formulation 2.7 can be written as follows:

min νt + (1 − ν)eT y

w, b, y, t

s.t.

1 Cw2 ≤ t, 2

ij T ρ ip (H ij )T w ≤ Aip (H ij )T w − (r ij )T b + h p y − 1,

(2.16)

p = 1, . . . , li , i, j = 1, . . . , K , i = j, y ≥ 0.

Furthermore, formulation 2.16 can be cast as the following SOCP:

min νt + (1 − ν)eT y

w, b, y, t

√ 2Cw ≤ 1 + t, s.t. 1−t ij T ρ ip (H ij )T w ≤ Aip (H ij )T w − (r ij )T b + h p y − 1,

(2.17)

p = 1, . . . , li , i, j = 1, . . . , K , i = j, y ≥ 0.

3 Robust Piecewise-Linear M-SVM Classifier In this section, we construct a robust piecewise-linear M-SVM classifier based on the dual formulation of problem 2.17.

3.1 Dual of the Robust Piecewise-Linear M-SVM Formulation. Denote

¯ = B1T , B2T , . . . , B KT T ∈ R L(K −1)×KN A

(3.1)

SOCP for Multiclass Classification

265

with

−Ai . .. O Bi = O . .. O

· · · O Ai O · · · O . . . .. .. . .. .. .. . i i · · · −A A O · · · O ∈ Rli (K −1)×K N . · · · O Ai −Ai · · · O .. .. .. . . .. . . . . . i · · · O A O · · · −Ai

Denote

T ¯1,M ¯ 2T . . . , M ¯ TK T ∈ R L N(K −1)×K N , H¯ = M

(3.2)

where

O ··· O .. .. . . · · · −Mi Mi O · · · O ∈ Rli N(K −1)×K N , · · · O Mi −Mi · · · O .. .. . . .. . . . . O · · · O Mi O · · · −Mi

−Mi . . . O ¯i = M O . . .

· · · O Mi . .. .. . .. .

with

Mi := Mi (ρ) = [ρ1i I N , . . . , ρlii I N ]T ∈ Rli N×N ,

i = 1, . . . , K .

Denote T

E¯ = E 1T , E 2T , . . . , E KT ∈ R L(K −1)×K

(3.3)

266

P. Zhong and M. Fukushima

with

−ei . . . 0 Ei = 0 . . . 0

· · · 0 ei 0 · · · . . .. .. .. . . . . · · · −ei ei 0 · · · · · · 0 ei −ei · · · .. .. .. . . . . . . · · · 0 ei 0 · · ·

0 .. . 0 ∈ Rli (K −1)×K . 0 .. . −ei

We can derive the following dual of problem 2.17 (see appendix A): max eT α − (σ + τ )

α,s,σ,τ

s.t. E¯ T α = 0, α ≤ (1 − ν)e, σ − τ = ν, −√ 1 ¯ T α + H¯ T s) (A 2(K +1) ≤ σ, τ ij s p ≤ α ijp ,

(3.4)

p = 1, . . . , li , i, j = 1, . . . , K , j = i,

where α = [(α 12 )T , . . . , (α 1K )T , . . . , (α K 1 )T , . . . , (α K (K −1) )T ]T ∈ R L(K −1) , T T T T , . . . , sl12 , . . . , s1K , . . . , sl1K ,..., s = s12 1 1 1 1

K (K −1) T

s1

K (K −1) T , . . . , sl K

T

∈ R L N(K −1) .

(3.5)

(3.6)

In addition, we get the following complementary equations at optimality:

ij T

αp ij

sp

ij T Aip (H ij )T w − (r ij )T b + h p y − 1 ρ ip (H ij )T w

p = 1, . . . , li , i, j = 1, . . . , K , j = i,

= 0, (3.7)

SOCP for Multiclass Classification

−√ 1 2(K +1)

267

T σ 1+t √ ¯ T α + H¯ T s) (A 2Cw = 0,

(3.8)

1−t

τ

((1 − ν)e − α)T y = 0.

(3.9)

3.2 Robust Classifier. From formulation 3.4 we get σ > 0. In fact, if σ = 0, then τ = 0. The third constraint of formulation 3.4 becomes ν = 0, which contradicts ν > 0. By the complementary equation 3.8, we have the following implications (see appendix B for the complementary conditions in SOCP, equations B.1 to B.3): If −√ 1 ¯ T α + H¯ T s) (A 2(K +1) < σ, τ then √ 2Cw 1 − t = 1 + t = 0. But this contradicts t ≥ 0. So we must have −√ 1 ¯ T α + H¯ T s) (A 2(K +1) = σ. τ Since σ > 0, we have √ 2Cw 1 − t = 1 + t. Hence, there exists µ > 0 such that √ 2Cw =

µ 2(K + 1)

¯ T α + H¯ T s) and 1 − t = −µτ. (A

(3.10)

In addition, it is easy to get the following equalities by proposition 1: ¯T = √ C −1 A

1 K +1

¯T A

and C −1 H¯ T = √

1 K +1

H¯ T .

(3.11)

268

P. Zhong and M. Fukushima

Hence, by equations 3.10 and 3.11, we get w=

t−1 ¯ T α + H¯ T s). (A 2τ (K + 1)

Furthermore, by equations 2.2, 3.1, and 3.2, we get lj li K t − 1 ij α p (Aip )T − α pji (Apj )T wi = 2τ (K + 1) j=1, j=i

p=1

lj li ij ρ ip s p − ρ pj s pji . +

p=1

p=1

p=1

Therefore, the decision functions are given by f i (x) = x T wi − b i

lj li K t−1 ij = α p x T (Aip )T − α pji x T (Apj )T 2τ (K + 1) j=1, j=i

p=1

p=1

lj li ij ρ ip x T s p − ρ pj x T s pji − b i , i = 1, . . . , K . + p=1

(3.12)

p=1

In particular, if we set ρ ip = 0, i = 1, . . . , K , p = 1, . . . , li , then equation 3.12 becomes lj li K i T j T t−1 ij T ji T f i (x) = α p x Ap − α p x Ap 2τ (K + 1) j=1, j=i

p=1

p=1

− b i , i = 1, . . . , K .

(3.13)

Since ρ ip = 0, p = 1, . . . , li , i = 1, . . . , K , imply that the parameter perturbations are not considered (cf. equation 2.5); equation 3.13 corresponds to the discriminants for the case of no measurement noise. With these decision functions, the classification of an example x is to find a class i such that f i (x) = max{ f 1 (x), . . . , f K (x)}. 4 Robust Piecewise-Nonlinear M-SVM Classifier The above discussion is concerned with the piecewise-linear case. In this section, the analysis will be extended to the nonlinear case.

SOCP for Multiclass Classification

269

To construct separating functions in a higher-dimensional feature space, a nonlinear mapping ψ : X → F is used to transform the original examples into the feature space, which is equipped with the inner product defined by k(x, x ) = ψ(x), ψ(x ) , where k(·, ·) : R N × R N → R is a function called a kernel. Typical choices of kernels include polynomial kernels k(x, x ) = (x T x + 1)d with an integer parameter d and radial basis function (RBF) kernels k(x, x ) = exp(−x − x 2 /κ) with a real parameter κ. 4.1 Robust Piecewise-Nonlinear M-SVM Formulation. We assume T ˜ (4.1) ψ Aˆ ip = ψ (Aip )T + ρ˜ ip a˜ ip , a˜ ip ∈ U, where U˜ is a unit sphere in the feature space. For the nonlinear case, ρ˜ ip in the feature space associated with a kernel k(·, ·) can be computed as T T ρ˜ ip = ψ Aˆ ip − ψ Aip % T T & % T T & − 2 ψ Aˆ ip , ψ Aip = ψ Aˆ ip , ψ Aˆ ip % T T &1/2 + ψ Aip , ψ Aip T T T T T T 1/2 = k Aˆ ip , Aˆ ip . − 2k Aˆ ip , Aip + k Aip , Aip For example, for RBF kernels, since T T = 1, k Aˆ ip , Aˆ ip 2 i T i T = exp − ρ ip /κ , k Aˆ p , Ap T T = 1, k Aip , Aip we have 2 1/2 . ρ˜ ip = 2 − 2 exp − ρ ip /κ

(4.2)

The robust version of the piecewise-nonlinear M-SVM can be expressed as follows: i−1 K K K K 1 1 min ν wi − w j 2 + wi 2 + (1 − ν) (ei )T yij w,b, y 2 2

i=1 j=1

i=1

i=1 j=1, j=i

ij

˜ s.t. (ψ((Aip )T ))T (wi − w j ) + ρ˜ ip ( a˜ ip )T (wi − w j ) − (b i − b j ) + yp ≥ 1, ∀ a˜ ip ∈ U,

270

P. Zhong and M. Fukushima ij

yp ≥ 0,

p = 1, . . . , li , i, j = 1, . . . , K , i = j,

which can be rewritten as the following SOCP: K K K i−1 K 1 1 min ν wi − w j 2 + wi 2 + (1 − ν) (ei )T yij w,b, y 2 2

i=1 j=1

i=1 j=1, j=i

i=1

' '' (T ((T ij s.t. ψ Aip (wi − w j ) − ρ˜ ip wi − w j − (b i − b j ) + yp ≥ 1, ij

yp ≥ 0,

p = 1, . . . , li , i, j = 1, . . . , K , i = j.

Denote

˜ = B˜ 1T , B˜ 2T , . . . , B˜ KT T , A

where

−(Ai ) .. . O B˜ i = O .. . O

···

(Ai ) .. .

O .. .

···

· · · −(Ai ) (Ai )

O

..

.

O .. .

O .. .

···

O

···

O .. .

(Ai ) −(Ai ) · · · .. .. .. . . .

O .. .

···

O

(Ai )

with

T (Ai ) = ψ((Ai1 )T ), . . . , ψ((Alii )T ) .

Denote

T ˜1,M ˜ 2T . . . , M ˜ TK T , H˜ = M

O

· · · −(Ai )

(4.3)

SOCP for Multiclass Classification

271

where

−Mi . . . O ˜i = M O . . . O

· · · O Mi O · · · O . .. .. .. .. . .. . . . · · · −Mi Mi O · · · O · · · O Mi −Mi · · · O .. .. . . .. . . . . · · · O Mi O · · · −Mi

with T

Mi := Mi (ρ) ˜ = ρ˜ 1i I N , . . . , ρ˜ lii I N .

(4.4)

In a similar manner to that of getting formulation 3.4, we get the dual of problem 4.3 as follows: max eT α − (σ + τ )

α,s,σ,τ

s.t. E¯ T α = 0, α ≤ (1 − ν)e,

(4.5)

σ − τ = ν, −√ 1 ˜ T α + H˜ T s) (A 2(K +1) ≤ σ, τ ij s p ≤ α ijp ,

p = 1, . . . , li , i, j = 1, . . . , K , j = i.

4.2 Robust Classifier in a Feature Subspace. In the previous section, we have gotten the robust formulation 4.5 in the feature space. However, the feature space F may have an arbitrarily large dimension, possibly infi¨ nite. Usually the kernel principal component analysis (KPCA) (Scholkopf, ¨ Smola, & Muller, 1998; Yajima, 2005) is used for feature extraction. In this section, we first reduce the feature space F to an S-dimensional subspace with S < L by KPCA, and then construct the corresponding robust classifier of piecewise-nonlinear M-SVM in the subspace. j Consider the kernel matrix G = (k((Aip )T , (Ap )T )) ∈ R L×L associated with a kernel k(·, ·). Since G is a symmetric positive semidefinite matrix, there is an orthogonal matrix V such that G = V V T , where is a diagonal matrix whose diagonal elements are the eigenvalues λi ≥ 0, i = 1, . . . , L , of G, and v i , i = 1, . . . , L, the columns of V, are the corresponding eigenvectors. Suppose λ1 ≥ λ2 ≥ . . . ≥ λ L . Select the S(< L)

272

P. Zhong and M. Fukushima

largest√positive √ eigenvalues √ and the corresponding eigenvectors. Denote DS = [ λ1 v 1 , λ2 v 2 , . . . , λ S v S ], where the components of v i are written as follows: T

1 2 K v i = vi,1 , . . . , vi,1 l1 , vi,1 , . . . , vi,2 l2 , . . . , vi,1 , . . . , vi,Kl K . Define the vectors K l j j=1

ui :=

j

j

vi, p ψ((Ap )T ) , i = 1, . . . , S. √ λi

p=1

Then we have ui T ui =

1 T v Gv i = 1 λi i

and ui T u j =

1 v iT Gv j = 0, λi λ j

i = j.

Therefore, {u1 , u2 , . . . , u S } forms an orthogonal basis of an S-dimensional subspace of F. Let ψ S (x) be the S-dimensional subcoordinate of ψ(x), which is given by

T lj lj K K 1 1 j j ψ S (x) = √ v k(x, (Apj )T ), . . . , √ v k(x, (Apj )T ) . λ S j=1 p=1 S, p λ1 j=1 p=1 1, p (4.6) Then, similar to equation 3.12, we can get the decision functions associated with the robust formulation of piecewise-nonlinear M-SVM in the feature subspace as follows: li K t−1 ij f i (x) = α p ψ S (x)T ψ S ((Aip )T ) 2τ (K + 1) j=1, j=i

−

lj

p=1

α pji ψ S (x)T ψ S ((Apj )T )

p=1

lj li ij ρ˜ ip ψ S (x)T s p − ρ˜ pj ψ S (x)T s pji − b i , i = 1, . . . , K . + p=1

p=1

(4.7)

SOCP for Multiclass Classification

273

Table 1: Description of Iris, Wine, and Glass Data Sets.

Name

Dimension (N)

Number of Classes (K )

Number of Examples (L)

Iris Wine Glass

4 13 9

3 3 6

150 178 214

5 Preliminary Numerical Results In this section, through numerical experiments, we examine the performance of the robust piecewise-nonlinear M-SVM formulation and the original model for multiclass classification problems. We use RBF kernel in the experiments. As we described in section 4.2, we first construct an L × L kernel matrix G associated with the RBF kernel for the training data set. Then we decompose G and select an appropriate number S. Using equation 4.6, we obtain the S-dimensional subcoordinate of each point. The problems used in the experiments are the robust model 4.5 and the original model obtained by setting ρ˜ = 0 in equation 4.4. In the latter model, we have H˜ = O. Thus, we may write the problem as follows: max eT α − (σ + τ ) α,σ,τ

s.t. E¯ T α = 0, α ≤ (1 − ν)e,

(5.1)

σ − τ = ν, −√ 1 ˜Tα A 2(K +1) ≤ σ. τ The experiments were implemented on a PC (1GB RAM, CPU 3.00GHz) using SeDuMi1.05 (Sturm, 2001) as a solver. This solver is developed by J. Sturm for optimization problems over symmetric cones, including SOCP. Some experimental results on real-world data sets taken from the UCI machine learning repository (Blake & Merz, 1998) are reported below. Table 1 gives a description of the data sets. In the experiments, the data sets were normalized to lie between −1 and 1. For simplicity, we set all ρ ip in equation 2.5 to be a constant ρ. The measurement noise aip was generated randomly from the normal distribution and scaled on the unit sphere. Two experiments were performed. In the first, an appropriate value of S for getting reasonable discriminants was sought. The second experiment was

I II I II I II I II I II

1 1 2 2 2 2 3 3 12 12

S 0.5364 0.5364 0.7950 0.7950 0.7950 0.7950 0.8836 0.8836 0.9911 0.9911

Rt

Iris

62.67 60.67 89.33 87.33 89.33 87.33 85.33 84.0 88.0 86.67

PTa

Note: a PT: Percentage of tenfold testing correctness on validation set.

0.99

0.8

0.7

0.6

0.5

Ra

Robust (I), Original (II) 5 5 9 9 16 16 — — — —

S 0.5179 0.5179 0.6102 0.6102 0.7103 0.7103 — — — —

Rt

Wine PT 90.0 88.89 88.89 80.0 87.78 82.22 — — — —

Table 2: Results for Iris, Wine, and Glass Data Sets with Noise (ρ = 0.3, κ = 2, ν = 0.05).

1 1 2 2 3 3 4 4 — —

S

0.5737 0.5737 0.6826 0.6826 0.7523 0.7523 0.8002 0.8002 — —

Rt

Glass

35.24 31.43 66.67 32.86 66.67 38.57 66.67 45.24 — —

PT

274 P. Zhong and M. Fukushima

Iris (2) Wine (5) Glass (4)

Data Set (S)

I II I II I II

Robust (I), Original (II) 94.0 94.0 98.33 98.33 59.52 59.52

0 88.67 87.33 91.11 88.89 66.19 46.19

0.1 88.0 87.33 90.0 88.83 65.71 45.71

0.2

ρ

89.33 87.33 90.0 88.89 66.67 45.24

0.3

Table 3: Percentage of Tenfold Test Correctness for the Data Sets with Noise (κ = 2, ν = 0.05).

91.33 86.67 87.78 85.56 66.67 49.05

0.4

90.0 85.33 84.44 82.22 66.67 49.52

0.5

SOCP for Multiclass Classification 275

276

P. Zhong and M. Fukushima

conducted on the three data sets with the measurement noise. Tenfold cross validation was used in the experiments. In order to seek an appropriate value of S, a ratio Ra is set. It is chosen from the set {0.5, 0.6, 0.7, S 0.8, 0.99}. L For each value of Ra , we Sfind the smallL est integer S such that i=1 λi / i=1 λi ≥ Ra , and let Rt := i=1 λi / i=1 λi . At the same time, we test the accuracy on the validation set by computing the percentage of tenfold testing correctness. Table 2 contains these three kinds of results for the robust model and the original model on the Iris, Wine, and Glass data sets with the measurement noise scaled by ρ = 0.3. When Ra is large, we were unable to solve the problems for the Wine and Glass data sets because of memory limitations. Nevertheless, it can be seen from Table 2 that the values of Rt around 50% up to 70% yield reasonable discriminants. Moreover, in all cases, S is much smaller than the data size L. Table 3 shows the percentage of tenfold testing correctness for the robust model and the original model on the three data sets with various noise levels ρ. Especially, ρ = 0 means that there is no noise on the data sets. In this case, the robust model reduces to the original model. It can be observed that the performance of the robust model is consistently better than that of the original model, especially for the Glass data set. In addition, the correctness for the original model on the three data sets with noise is much lower than the results when ρ = 0. For the linear case, we also find that the correctness for the original model on the data sets with noise is lower than the correctness on the data sets without noise. For example, the correctness of the Iris data set, the Wine data set, and the Glass data set without noise is 90.67%, 97.78%, and 57.14%, respectively. However, when ρ = 0.5, the corresponding correctness is 87.33%, 78.89%, and 48.57%, respectively. For the robust model, when ρ = 0.5, the corresponding correctness is 90.67%, 81.11%, and 56.19%, respectively.

6 Conclusion In this letter, we have established the robust linear and nonlinear formulations for multiclass classification based on M-SVM method. KPCA has been used to reduce the feature space to an S-dimensional subspace. The preliminary numerical experiments show that the performance of the robust model is better than that of the original model. Unfortunately, the conic convex optimization solver SeDuMi1.05 (Sturm, 2001) used in our numerical experiments could solve problems only for small data sets. The sequential minimal optimization (SMO) techniques (Platt, 1999) are essential in large-scale implementation of SVM. Future subjects for investigation include developing SMO-based robust algorithms for multiclass classification.

SOCP for Multiclass Classification

277

Appendix A: Dual of Formulation 2.17 In order to get the dual of problem 2.17, we first state a more general primal and dual form of the SOCP. The notations used in section A.1 are independent of those in the other part of the letter. A.1 A General Primal and Dual Pair. For the SOCP min cT x + dT y + eT z

x, y, z

s.t. AT x + B T y + C T z = f ,

(A.1)

y ≥ 0, z ∈ Kn1 × · · · × Knl , its dual is written as follows: max f T w w,u,v

s.t. Aw = c, Bw + u = d, Cw + v = e,

(A.2)

u ≥ 0, v ∈ Kn1 × · · · × Knl . Now consider the problem min cT x + dT y x, y,z

s.t. G¯ iT x + qi ≤ g iT x + hiT y + r iT z + a i ,

i = 1, . . . , m,

y ≥ 0. This problem can be formulated as follows: min cT x + dT y

x, y,z,ζ

s.t. ζ i −

g iT x + hiT y + r iT z + a i G¯ iT x + qi

ζ i ∈ Kni , y ≥ 0,

i = 1, . . . , m,

= 0,

i = 1, . . . , m,

(A.3)

278

P. Zhong and M. Fukushima

which can be further rewritten as min cT x + dT y

x, y,z,ζ

−G 1 −H1 −R1 s.t. I O . .. O

−G 2 −H2 −R2 O I .. . O

ζ i ∈ Kni ,

T x · · · −G m y · · · −Hm β1 z · · · −Rm β2 ··· O ζ 1 = .. ζ2 . ··· O . . .. βm . .. .. ζm ··· I

i = 1, . . . , m,

y ≥ 0, where G i = [g i , G¯ i ], Hi = [hi , O], Ri = [r i , O], β i = [a i , qiT ]T , i = 1, . . . , m. In view of the primal-dual pair A.2 and A.3, we obtain the dual of problem A.3 as follows: max − η, λ

s.t.

m

β iT ηi

i=1 m

G i ηi = c,

i=1 m

Hi ηi + λ = d,

(A.4)

i=1 m

Ri ηi = 0,

i=1

λ ≥ 0, ηi ∈ Kni ,

i = 1, . . . , m.

A.2 Dual of Problem 2.17. In the following we derive the dual of formulation 2.17. The primal problem 2.17 can be put in the following equivalent form:

min 0T , ν x + (1 − ν)eT y

x, b, y

√

0 2C 0 ≤ 0T , 1 x + 1, s.t. x+ 1 0T −1

SOCP for Multiclass Classification

279

ij T i ij T ρ p (H ) , 0 x ≤ Aip (H ij )T , 0 x + h p y − (rij )T b − 1, p = 1, . . . , li , i, j = 1, . . . , K , i = j, y ≥ 0, T

where x = w T , t . Then, by equation A.4, we get the dual of problem 2.17 as follows:

max

α, s,σ, τ

s.t.

li K K

ij

α p − (σ + τ )

(A.5)

i=1 j=1, j=i p=1

√

2C T ξ +

li ' K K ( ij ij α p H ij (Aip )T + ρ ip H ij s p = 0,

(A.6)

i=1 j=1, j=i p=1

σ − τ = ν, K

K

li

(A.7) ij ij

α p h p + λ = (1 − ν)e,

(A.8)

i=1 j=1, j=i p=1

−

li K K

ij

α p r ij = 0,

(A.9)

i=1 j=1, j=i p=1

ξ τ ≤ σ, ij

ij

s p ≤ α p ,

(A.10) p = 1, . . . , li , i, j = 1, . . . , K , i = j,

λ ≥ 0.

(A.11) (A.12)

By equations 2.9, 3.1, and 3.5, we get li K K

ij ¯ T α. α p H ij (Aip )T = A

(A.13)

i=1 j=1, j=i p=1

By equations 3.2 and 3.6, we get li K K i=1 j=1, j=i p=1

ij ρ ip H ij s p = H¯ T s.

(A.14)

280

P. Zhong and M. Fukushima

Hence by equations A.13 and A.14, we can express equation A.6 compactly as follows: √

¯ T α + H¯ T s = 0. 2C T ξ + A

(A.15)

By equations A.15 and 3.11, we get the following equation: ξ = −

1 2(K + 1)

¯ T α + H¯ T s). (A

(A.16)

By equations 2.13 and 3.5, we have li K K

ij ij

α p h p = α.

i=1 j=1, j=i p=1

Hence, equation A.8 can be expressed as follows: (1 − ν)e − λ − α = 0.

(A.17)

By equations 2.11, 3.3, and 3.5, we can rewrite equation A.9 as follows: − E¯ T α = 0.

(A.18)

Combining equations A.16 to A.18, the problem given in A.5 to A.12 can be written as problem 3.4. Appendix B: Complementarity Conditions of SOCP Let bd Kn denote the boundary of Kn : bd Kn =

z0 z¯

∈ Kn : z¯ = z0 .

Let int Kn denote the interior of Kn : int Kn = For two elements

z0 z¯

∈ Kn

z0 z¯

∈ Kn : z¯ < z0 .

SOCP for Multiclass Classification

281

and

z0 z¯

∈ Kn ,

z0 z¯

T

z0 z¯

=0

if and only if the following conditions are satisfied (Lobo et al., 1998):

z0 z¯

∈ bd Kn \ {0},

z0 z¯

z0 z¯ z0 z¯

∈ int Kn ⇒ z¯ = z0 = 0,

(B.1)

∈ int Kn ⇒ z¯ = z0 = 0,

(B.2)

∈ bd Kn \ {0} ⇒

z0 z¯

=µ

z0 , − z¯

(B.3)

where µ > 0 is a constant. These three conditions are regarded as a generalization of the complementary slackness conditions in linear programming. Acknowledgments P.Z is supported in part by a grant-in-aid from the Ministry of Education Culture Sports Science and Technology of Japan and the National Science Foundation of China Grant No. 70601033. M.F. is supported in part by the Scientific Research Grant-in-Aid from the Japan Society for the Promotion of Science. References Alizadeh, F., & Goldfarb, D. (2003). Second-order cone programming. Math. Program., Ser. B, 95, 3–51. Allwein, E. L., Schapire, R. E., & Singer, Y. (2001). Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1, 113–141. Angulo, C., Parra, X., & Catal`a, A. (2003). K-SVCR: A support vector machine for multi-class classification. Neurocomputing, 55, 57–77. Bennett, K. P., & Mangasarian, O. L. (1994). Multicategory discrimination via linear programming. Optimization Methods and Software, 3, 27–39. Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases. University of California. Available online at http://www.ics.uci.edu/∼mlearn/ MLRepository.html. Bottou, L., Cortes, C., Denker, J. S., Drucker, H., Guyon, I., Jackel, L. D., LeCun, ¨ Y., Muller, U. A., Sackinger, E., Simard, P., & Vapnik, V. (1994). Comparison of classifier methods: A case study in handwriting digit recognition. In IAPR (Ed.), Proceedings of the International Conference on Pattern Recognition (pp. 77–82). Piscataway, NJ: IEEE Computer Society Press.

282

P. Zhong and M. Fukushima

Bredensteiner, E. J., & Bennett, K. P. (1999). Multicategory classification by support vector machines. Computational Optimization and Applications, 12, 53–79. Dietterich, T. G., & Bakiri, G. (1995). Solving multi-class learning problems via errorcorrecting output codes. Journal of Artificial Intelligence Research, 2, 263–286. Fukushima, M., Luo, Z. Q., & Tseng, P. (2002). Smoothing functions for second-ordercone complementarity problems. SIAM Journal on Optimization, 12, 436–460. Goldfarb, D., & Iyengar, G. (2003). Robust convex quadratically constrained programs. Mathematical Programming, 97, 495–515. Guermeur, Y. (2002). Combining discriminant models with new multi-class SVMs. Pattern Analysis and Applications, 5, 168–179. Hastie, T. J., & Tibshirani, R. J. (1998). Classification by pairwise coupling. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 507–513). Cambridge, MA: MIT Press. Hayashi, S., Yamashita, N., & Fukushima, M. (2005). A combined smoothing and regularization method for monotone second-order cone complementarity problems. SIAM Journal on Optimization, 15, 593–615. Kressel, U. (1999). Pairwise classification and support vector machines. In B. ¨ Scholkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods: Support vector learning (pp. 255–268). Cambridge, MA: MIT Press. Lobo, M. S., Vandenberghe, L., Boyd, S., & L´ebret, H. (1998). Applications of secondorder cone programming. Linear Algebra and Applications, 284, 193–228. Platt, J. (1999). Sequential minimal optimization: A fast algorithm for training sup¨ port vector machines. In B. Scholkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods: Support vector learning (pp. 185–208). Cambridge, MA: MIT Press. Platt, J., Cristianini, N., & Shawe-Taylor, J. (2000). Large margin DAGs for multiclass ¨ classification. In S. A. Solla, T. K. Leen, & K. -R. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 547–553). Cambridge, MA: MIT Press. ¨ ¨ Scholkopf, B., Smola, A., & Muller, K. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299–1319. Sturm, J. (2001). Using SeDuMi, a Matlab toolbox for optimization over symmetric cones. Department of Econmetrics, Tilburg University, Netherlands. Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Weston, J., & Watkins, C. (1998). Multi-class support vector machines (CSD-TR-9804). Egham, UK: Royal Holloway, University of London. Yajima, Y. (2005). Linear programming approaches for multicategory support vector machines. European Journal of Operational Research, 162, 514–531. Zhong, P., & Fukushima, M. (2006). A new multi-class support vector algorithm. Optimization Methods and Software, 21, 359–372.

Received September 19, 2005; accepted May 24, 2006.

LETTER

Communicated by Grace Wahba

Fast Generalized Cross-Validation Algorithm for Sparse Model Learning S. Sundararajan [email protected] Philips Electronics India Ltd., Ulsoor, Bangalore, India

Shirish Shevade [email protected] Computer Science and Automation, Indian Institute of Science, Bangalore, India

S. Sathiya Keerthi [email protected] Yahoo! Research, 2821 Mission College Blvd., Santa Clara, CA 95054, USA

We propose a fast, incremental algorithm for designing linear regression models. The proposed algorithm generates a sparse model by optimizing multiple smoothing parameters using the generalized cross-validation approach. The performances on synthetic and real-world data sets are compared with other incremental algorithms such as Tipping and Faul’s fast relevance vector machine, Chen et al.’s orthogonal least squares, and Orr’s regularized forward selection. The results demonstrate that the proposed algorithm is competitive. 1 Introduction In recent years, there has been a lot of focus on designing sparse models in machine learning. For example, the support vector machine (SVM) (Cortes & Vapnik, 1995) and the relevance vector machine (RVM) (Tipping, 2001; Tipping & Faul, 2003) have been proven to provide sparse solutions to both regression and classification problems. Some of the earlier successful approaches for regression problems include Chen, Cowan, and Grant’s (1991) orthogonal least squares and Orr’s (1995b) regularized forward selection algorithms. In this letter, we consider only the regression problem. Often the target function model takes the form

y(x) =

M

wm φm (x).

(1.1)

m=1

Neural Computation 19, 283–301 (2007)

C 2006 Massachusetts Institute of Technology

284

S. Sundararajan, S. Shevade, and S. Keerthi

For notational convenience, we do not indicate the dependency of y on w, the weight vector. The available choices in selecting the set of basis vectors {φ m (x), m = 1, . . . , M} make the model flexible. Given a data set N {xn , tn }n=1 , tn ∈ R ∀ n, we can write the target vector t = (t1 , . . . ., tN )T as the sum of the approximation vector y(x) = (y(x1 ), . . . ., y(x N ))T and an error vector η, t = w + η,

(1.2)

where = ( φ 1 φ 2 , · · · , φ M ) is the design matrix and φ i is the ith column vector of of size N × 1 representing the response of the ith basis vector for all the input samples x j , j = 1, . . . , N. One possible way to obtain this x −x 2

design matrix is to use a gaussian kernel with i j = exp(− i 2z2 j ). The error vector η can be modeled as an independent zero-mean gaussian vector with variance σ 2 . In this context, controlling the model complexity in avoiding overfitting is an important task. This problem has been addressed in the past by regularization approaches (Bishop, 1995). In the classical approach, the sum squared error with weight decay regularization having a single regularization parameter α controls the trade-off between fitting the training data and smoothing the output function. In the local smoothing approach, each weight in w is associated with a regularization or ridge parameter (Hoerl & Kennard, 1970; Orr, 1995a; Denison & George, 2000). The interested reader can refer to Denison and George (2000) for a discussion on the generalized ridge regression and Bayesian approach. For our discussion, we consider the weight decay regularizer with multiple regularization parameters. In this case, the optimal weight vector is obtained by minimizing the following cost function (for a given set of hyperparameter values, α, σ 2 ), C(w, α, σ 2 ) =

1 1 t − w2 + wT Aw, 2σ 2 2

(1.3)

where A is a diagonal matrix with elements α = (α1 , . . . , α M )T , and the optimal weight vector is given by w=

1 −1 T S y, σ2

(1.4)

T where S = R + A and R = σ 2 . Note that this solution depends on the product ασ 2 . This solution is same as the maximum a posteriori (MAP) solution obtainable from defining the gaussian likelihood and gaussian prior for the weights in a Bayesian framework (Tipping, 2001).

Fast GCV Algorithm for Sparse Model Learning

285

The hyperparameters are typically selected using iterative approaches like marginal likelihood maximization (Tipping, 2001). In practice, many of the αi approach ∞. This results in the removal of associated basis vectors, thereby making the model sparse. The final model consists of a small number of basis vectors L (L M), called relevance vectors, and, hence, known by the name relevance vector machine (RVM). This procedure is computationally intensive and needs O(M3 ) effort at least for the initial iterations while starting with the full model (M = N or M = N + 1 if the bias term is included). Hence, it is not suitable for large data sets. This limitation was addressed in Tipping and Faul (2003), where a computationally efficient algorithm was proposed. In this algorithm, basis vectors are added sequentially starting from an empty model. It also allows deleting the basis vectors, which may subsequently become redundant. There are various other basis vector selection heuristics that can be used to design sparse models (Chen et al., 1991; Orr, 1995b). Chen et al. (1991) proposed an orthogonal least-squares algorithm as a forward regression procedure to select the basis vectors. At each step of the regression, the increment to the explained variance of the desired output is maximized. Orr (1995b) proposed an algorithm that combines regularization and crossvalidated selection of basis vectors. Some other promising incremental approaches are the algorithms of Csato and Opper (2002), Lawrence, Seeger, and Herbrich (2003), Seeger, Williams, and Lawrence (2003), and Smola and Bartlett (2000). But they apply to gaussian processes and are not directly related to the problem formulation addressed in this article. Generalized cross validation (GCV) is another important approach for the selection of hyperparameters and has been shown to exhibit good generalization performance (Sundararajan & Keerthi, 2001; Orr, 1995a). Orr (1995a) used the GCV approach to estimate the multiple smoothing parameters of the full model. This approach, however, is not suitable for large data sets due to its computational complexity. Therefore, there is a need to devise a computationally efficient algorithm based on the GCV approach for handling large data sets. In this letter, we propose a new fast, incremental GCV algorithm that can be used to design sparse models exhibiting good generalization performance. In the algorithm, we start with an empty model and sequentially add the basis functions to reduce the GCV error. The GCV error can also be reduced by deleting those basis functions that subsequently become redundant. This important feature offsets the inherent greediness exhibited by other sequential algorithms. This algorithm has the same computational complexity as that of the algorithm given in Tipping and Faul (2003) and is suitable for large data sets as well. Preliminary results on synthetic and real-world benchmark data sets indicate that the new approach gains on generalization but at the expense of a moderate increase in the number of basis vectors.

286

S. Sundararajan, S. Shevade, and S. Keerthi

The letter is organized as follows. In section 2, we describe the GCV approach and compare the GCV error function with marginal likelihood function. Section 3 describes the fast, incremental algorithm, computational complexity, and the numerical issues involved; the update expressions mentioned in this section are detailed in appendixes A to F. In section 4, we present the simulation results. Section 5 concludes the letter. 2 Generalized Cross Validation The standard techniques that estimate the prediction error of a given model are the leave-one-out (LOO) cross-validation (Stone, 1974) and the closely related GCV described in Golub, Heath, and Wahba (1979) and Orr (1995b). The generalization performance of the GCV approach is quite good, like that of the LOO cross-validation approach. The advantage in using the GCV error is that it takes a much simpler form compared to the LOO error and is given by N V(α, σ 2 ) = N

− y(xi ))2 , (tr(P))2

i=1 (ti

(2.1)

where P=I−

S−1 T . σ2

(2.2)

When this GCV error is minimized with respect to the hyperparameters, many of the αi approach ∞, making the model sparse. Equation 2.1 can be written as V(α, σ 2 ) = N

t T P2 t . (tr(P))2

(2.3)

Note that P is dependent only on ζ = ασ 2 . This means that we can get rid of σ 2 from equations 1.4 and 2.1. Then it is sufficient to work directly with ζ . However, using the optimal ζ obtained from minimizing the GCV error, the noise level can be computed from σˆ 2 =

t T P2 t . tr(P)

(2.4)

We now discuss the algorithm, proposed by Orr to determine the optimal set of hyperparameters.

Fast GCV Algorithm for Sparse Model Learning

287

2.1 Orr’s Algorithm. Starting with the full model, Orr (1995a) proposed an iterative scheme to minimize the GCV error, equation 2.3, with respect to the hyperparameters. Although this algorithm was originally described in terms of the variable ζ j , we describe it here using the variables α j and σ 2 for convenience. Each α j is optimized in turn, while the others are held fixed. The minimization thus proceeds by a series of one-dimensional minimizations. This can be achieved by rewriting equation 2.3 using t T P2 t = tr(P) =

a j 2j − 2b j j + c j 2j δjj − j , j

where a j = tT P2j t b j = (t

T

(2.5)

P2j φ j )(tT P j φ j )

(2.6)

c j = (φ Tj P2j φ j )(tT P j φ j )2

(2.7)

δ j = tr(P j )

(2.8)

j = (φ Tj P2j φ j ).

(2.9)

The above relationships follow from the rank-one update relationship between P and P j , P = Pj −

where P j = I −

1 Pj φ j φ Tj Pj , j T j S−1 j j

σ2

(2.10)

and

j = φ Tj P j φ j + α j σ 2 .

(2.11)

Here, j denotes the matrix with the column φ j removed. Therefore, S j = R j + A j , with the subscript j having the same interpretation as in j . Note that S j does not contain α j , and, hence, P j also does not contain α j . Thus, equation 2.3 can be seen as a rational polynomial in α j alone, with a single minimum in the range [0, ∞]. The details of minimizing this polynomial with respect to α j are given in appendix A. After one complete cycle in which each parameter is optimized once, the GCV score is calculated and compared with the score at the end of the previous cycle. If significant decrease has occurred, a new cycle is begun; otherwise, the algorithm

288

S. Sundararajan, S. Shevade, and S. Keerthi

terminates. As detailed in Orr (1995a), the computational complexity of one cycle of the above algorithm is O(N3 ), at least during the first few iterations. Although this will consequently reduce to O(L N2 ) as the basis vectors are pruned, this algorithm is not suitable for large data sets. 2.2 Comparison of GCV Error with Marginal Likelihood. We now compare the GCV error and marginal likelihood by studying their dependence on α j . In the marginal likelihood method, the optimal α j ’s are determined by maximizing the marginal likelihood with respect to α j . The GCV error is minimized to get the optimal α j ’s. First, we study the behavior of the GCV error with reference to α j . The term in the denominator of the GCV error has tr(P) = δ j − jj , where δ j and j are independent of α j and P is a positive semidefinite matrix. Further, tr(P) increases monotonically as a function of α j . Therefore, maximizing the tr(P) (in order to minimize the GCV error) will prefer the optimal value of α j to be ∞. Thus, the denominator term in equation 2.3 prefers a simple model. The term, tT P2 t, in the numerator of equation 2.3 is the squared error at the optimal weight vector in equation 1.4. Let g(α) = tT P2 t. Differentiating g(α) with respect to α j , we get ∂g(α) 2σ 2 = 2 ∂α j j

bj −

cj j

,

where b j and c j are independent of α j and c j , j ≥ 0. If b j is nonpositive, then g(α j ) is a monotonically decreasing function of α j . Minimization of g(α j ) with respect to α j would thus prefer α j to be ∞. On the other hand, if b j is positive, the minimum of g(α j ) would depend on the sign of σ 2 b j s¯ j − c j where s¯ j = φ Tj −1 − j φ j , − j is with the contribution of basis vector j removed and = σ 2 I + A−1 . Note that P = σ 2 −1 . Therefore, the optimal choice of α j using the GCV error method depends on the trade-off between the data-dependent term in the numerator and the term in the denominator that prefers α j to be ∞. The logarithm of marginal likelihood function is def

L(α) = −

1 N log(2π) + log || + tT −1 t . 2

Considering the dependence of L(α) on a single hyperparameter α j , the above equation can be rewritten (Tipping & Faul, 2003) as L(α) = L(α− j ) + l(α j ),

Fast GCV Algorithm for Sparse Model Learning α

where l(α j ) = 12 [log( α j +¯j s j ) +

q¯ 2j α j +¯s j

289

]. L(α− j ) is the marginal likelihood with def

φ j excluded and is thus independent of α j . Here, we have defined q¯ j = φ Tj −1 − j t. Note that q¯ j and s¯ j are independent of α j . The second term in l(α j ) comes from the data-dependent term, tT −1 t, in the logarithm of the marginal likelihood function, and maximization of this term prefers α j to be zero. On the other hand, the first term in l(α j ) comes from the Ockham factor (Tipping, 2001), and maximization of this term chooses α j to be ∞. So the optimal value of α j is a trade-off between the data-dependent term and the Ockham factor. Note that tT −1 t = σ12 tT P2 t + wT Aw. The term on the left-hand side of this equation appears in the negative logarithm of marginal likelihood function, while the first term on the right-hand side appears in the numerator of the GCV error. The key difference in the choice of the basis function is the additional term that is present in the data-dependent term of the marginal likelihood. For the marginal likelihood method, it has been shown that for a given basis function j, if q¯ 2j ≤ s¯ j , then the optimal value of α j is ∞; otherwise, the optimal value is

s¯ 2j q¯ 2j −¯s j

(Tipping and Faul, 2003). For the GCV

method, the optimal value of α j depends on q¯ j , s¯ j and some other quantities detailed in appendix A. Further, the sufficient condition for a basis function to be not relevant for the marginal likelihood method is q¯ 2j ≤ s¯ j and that for the GCV error method is b j ≤ 0. Note that b j is independent of s¯ j but is dependent on q¯ j . In general, a relevant vector obtained using a marginal likelihood (or GCV error) method need not be relevant in the GCV error (or marginal likelihood) method. This fact was also observed in our experiments. We now discuss the effect of scaling on the GCV error. First, note that the GCV error is a function of P. With the scaling of the output t, there will be an associated scaling of basis functions. However, P is invariant to such scaling (see equation 2.2). This will make the GCV error invariant to scaling. Also, a similar result holds for the log marginal likelihood function. 3 Fast GCV Algorithm In this section we describe the fast GCV (FGCV) algorithm, which constructs the model sequentially starting from an empty model. The basis vectors are added sequentially, and their weightings are modified to get the maximum reduction in the GCV error. The GCV error can also be decreased by deleting those basis vectors that subsequently become redundant. By maintaining the following set of variables for every basis vector, m, we can find the optimal value of αm for every basis vector and the corresponding GCV error efficiently: rm = tT Pφ m

(3.1)

290

S. Sundararajan, S. Shevade, and S. Keerthi T γm = φ m Pφ m

(3.2)

ξm = t T P 2 φ m

(3.3)

T 2 um = φ m P φm.

(3.4)

In addition, we need v = tr(P)

(3.5)

q = t P t.

(3.6)

T

2

Further, after every minimization process, these variables can be updated efficiently using the rank-one update given in equation 2.10. We now give the algorithm and discuss the relevant implementation details and storage and computational requirements. 3.1 Algorithm 1. Initialize σ 2 to some reasonable value (e.g., var(t) × .1). 2. Select the first basis vector φk (which forms the initial relevance vector set), and set the corresponding αk to its optimal value. The remaining α’s are notionally set to ∞. 3. Initialize S−1 , w (which are scalars initially), and the variables given in equations 3.1 to 3.6. 4. αk old := αk . 5. For all j, find the optimal solution α j , keeping the remaining αi , i = j fixed and the corresponding GCV error. Select the basis vector k for which reduction in the GCV error is maximum. 6. If αk old < ∞ and αk < ∞, the relevance vector set remains unchanged. 7. If αk old = ∞ and αk < ∞, add φk to the relevance vector set. 8. If αk old < ∞ and αk = ∞, then delete φk from the relevance vector set. 9. Update S−1 , w, and the variables given in equations 3.1 to 3.6. 10. Estimate the noise level using equation 2.4. This step may be repeated once in, for example, five iterations. 11. If there is no significant change in the values of α and σ 2 , then stop. Otherwise, go to step 4. 3.2 Implementation Details 1. Since we start from the empty model, the basis vector that gives the minimum GCV error is selected as the first basis vector (step 2). The details of this procedure are given in appendix B. The first basis vector

Fast GCV Algorithm for Sparse Model Learning

291

can also be selected based on the largest normalized projection onto the target vector, as suggested in Tipping and Faul (2003). 2. The initialization of relevant variables in step 3 is described in appendix C. 3. Appendixes D and A describe the method to estimate the optimal αi and the corresponding GCV error (step 5). 4. The variables in step 9 are updated using the details given in appendix E for the case in step 6 (reestimation) or step 8 (deletion) of the algorithm. The details corresponding to step 7 (addition) are given in appendix F. 5. In practice, numerical accuracies may be affected as the iterations progress. More specifically, the quantities γm and um , which are expected to remain nonnegative, may become negative while updating the variables. When any of these two quantities becomes negative, it is a good idea to compute the quantities in equations 3.1 to 3.6 afresh using direct computations. If the problem still persists (this typically happens when the matrix P becomes ill conditioned, for example, when the width parameter z used in a gaussian kernel is large), we terminate the algorithm. 6. When the noise level is also to be estimated (step 10), all the relevant variables are calculated afresh. This computation is simplified by expanding the matrix P using equation 2.2. 7. In an experiment, some of the α j may reach 0, as can be seen in appendix A. This may affect the solution. Therefore, it is useful to set such α j to a relatively small value, αmin . Setting this value to 1/N was found to work well in practice. 3.3 Storage and Computational Complexity. The storage requirements of the FGCV algorithm are more than that of the fast RVM (FRVM) algorithm in Tipping and Faul (2003) and arise from the additional variables ξm and um to be maintained. However, they are still linear in N. The computational requirements of the proposed algorithm are similar to those of the FRVM algorithm. Step 5 of the algorithm has the computational complexity of O(N). This is possible using the expressions given in appendix D. The computational complexity of the reestimation or deletion of a basis vector is O(L N), while that of the addition of a basis vector is O(N2 ). Step 10 of the algorithm, however, has a computational complexity of O(L N2 ) as it requires reestimating the relevant variables. We observed that the FGCV algorithm was 1.25 to 1.5 times slower than the FRVM algorithm in our simulations for these main reasons:

r

The error function is shallow near the solution, which results in more iterations.

292

S. Sundararajan, S. Shevade, and S. Keerthi

0.7

NMSE

0.6 0.5 0.4 0.3 0.2 0.1

1

2

3

4

3

4

Number of Basis Vectors

Algorithm

80 60 40 20 0 1

2 Algorithm

Figure 1: Results on the Friedman2 data set. Table 1: Statistical Significance (Wilcoxon Signed Rank) Test Results on the Friedman2 Data Set.

OLS RFS FRVM

r r

FGCV

FRVM

RFS

.47 .013 1.3e-12

9.4e-5 .097

.119

There is a higher number of relevance vectors at the solution. The additional variables, ξm and um , are updated.

4 Simulations The proposed FGCV algorithm is evaluated on four popular benchmark data sets and compared with the algorithms described in Tipping and Faul (2003), Chen et al. (1991), and Orr (1995b) and referred to as fast RVM (FRVM), orthogonal least squares (OLS), and regularized forward selection (RFS), respectively. Two of these data sets were generated, as described by Friedman (1991) and are referred to as Friedman2 and Friedman3. The input

Fast GCV Algorithm for Sparse Model Learning

293

0.35

NMSE

0.3 0.25 0.2 0.15 1

2

3

4

3

4

Algorithm

Number of Basis Vectors

80 70 60 50 40 30 20 10 1

2

Algorithm

Figure 2: Results on the Friedman3 data set. Table 2: Statistical Significance (Wilcoxon Signed Rank) Test Results on the Friedman3 Data Set.

OLS RFS FRVM

FGCV

FRVM

RFS

4.6e-9 2.1e-6 3.0e-6

.002 .139

.054

dimension for each of these data sets was four. For these data sets, the training set consisted of 200 randomly generated examples, while the test set had 1000 noise-free examples and the experiment was repeated 100 times for different training set examples. We report the normalized mean squared error (NMSE) (normalized with respect to the output variance) on the test set. The third data set used was the Boston housing data set obtained from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston). This data set comprises 506 examples with 13 variables. The data were split into 481/25 training/testing splits randomly, and the partitioning was repeated 100 times independently. The fourth data set used was the Abalone data set (ftp://ftp.ics.uci.edu/pub/machine-learning-databases/abalone/). After mapping the gender encoding (male/female/infant) into {(1,0,0), (0,1,0),

294

S. Sundararajan, S. Shevade, and S. Keerthi

40

MSE

30 20 10

1

2

3

4

3

4

Number of Basis Vectors

Algorithm

150

100

50

1

2 Algorithm

Figure 3: Results on the Boston Housing data set. Table 3: Statistical Significance (Wilcoxon Signed Rank) Test Results on the Boston Housing Data Set.

OLS RFS FRVM

FGCV

FRVM

RFS

2.6e-5 .003 2.0e-5

.096 .647

.156

(0,0,1)}, the 10-dimensional data were split into 3000/1177 training/testing splits randomly. The partitioning was repeated 10 times independently. (The exact partitions for the last two data sets were obtained from http://www.gatsby.ucl.ac.uk/˜chuwei/regression.html.) For all the data sets, gaussian kernel was used and the width parameter was chosen by using fivefold crossvalidation. For the OLS and RFS algorithms, the readily available Matlab functions (http://www.anc.ed.ac.uk/˜mjo/software/ rbf2.zip) were used with the default settings. We adhered to the guidelines provided in Tipping and Faul (2003) for FRVM. The results obtained using the four algorithms (FGCV, algorithm 1; FRVM, algorithm 2; RFS, algorithm 3; OLS, algorithm 4) on these data sets are presented in Figures 1 to 4. From these box plots, it is clear that

Fast GCV Algorithm for Sparse Model Learning

295

0.8

MSE

0.7 0.6 0.5 0.4 1

2

3

4

3

4

Algorithm

Number Of Basis Vectors

80 70 60 50 40 30 20 10 1

2 Algorithm

Figure 4: Results on the Abalone data set. Table 4: Statistical Significance (Wilcoxon Signed Rank) Test Results on the Abalone Data Set.

OLS RFS FRVM

FGCV

FRVM

RFS

9.8e-4 .023 .462

9.8e-4 .005

.403

the FGCV algorithm generalizes well compared to the other algorithms. However, this happens at the expense of a moderate increase in the number of basis vectors as compared to the FRVM algorithm. It is worth noting that the sparseness of the solution obtained from the FGCV algorithm is still very good. On the Abalone data set, the FRVM algorithm is slightly better than the FGCV algorithm on the average. Box plots in Figures 1 to 4 show that the distribution of MSE is nonsymmetrical. Therefore, we used Wilcoxon matched-pair signed rank tests to compare the four algorithms; the results are given in Tables 1 to 4. Each box in the table compares an algorithm in the column to an algorithm in the row. The null hypothesis is that the two medians of the test error are the

296

S. Sundararajan, S. Shevade, and S. Keerthi

same, while the alternate hypothesis is that they are different. The p-value of this hypothesis test is given in the box. The following comparisons are made with respect to the significance level of 0.05. If a p-value is smaller than 0.05, then the algorithm in the column (row) is statistically superior to the algorithm in the row (column). Table 1 suggests that for the Friedman2 data set, the FGCV algorithm is statistically superior to the FRVM and RFS algorithms. But it is not significantly different from the OLS algorithm. The FGCV algorithm is statistically superior to all the other algorithms on the Friedman3 and the Boston Housing data sets, as is evident from Tables 2 and 3. For the Abalone data set, the results in Table 4 show that the performances of the FGCV and the FRVM algorithms are not significantly different. However, these algorithms are statistically superior to the OLS and RFS algorithms. We also compared the FGCV and FRVM algorithms with the gaussian process regression (GPR) algorithm on the Boston Housing and Abalone data sets (results reported in http://www.gatsby.ucl.ac.uk/˜chuwei/ regression.html). These comparisons were also done using Wilcoxon matched-pair signed-rank test with a significance level of .05. For the Boston Housing data set, the performance comparison gave p-values of 3.4e-10 (FGCV) and 1.22e-12 (FRVM). This shows that the GPR algorithm (with all basis vectors) is statistically superior to the FRVM algorithm. A similar comparison on the Abalone data set resulted in the p-values of .012 (FGCV) and .095 (FRVM). On this data set, the GPR algorithm is statistically superior to the FGCV algorithm, while it is not statistically superior to the FRVM algorithm.

5 Conclusion We have proposed a fast, incremental GCV algorithm for designing sparse regression models. This algorithm is very efficient and constructs the model sequentially starting from an empty model. In each iteration, it adds or deletes or reestimates the basis vectors depending on the maximum reduction in the GCV error. The experimental results suggest that, considering the requirements of sparseness, good generalization performance, and computational complexity, the FGCV algorithm is competitive. Clearly, this algorithm is an excellent alternative to the FRVM algorithm of Tipping and Faul (2003). We mainly compared our approach against OLS, RFS, and FRVM since they were quite directly related to our problem formulation. We also compared against GPR. We did not compare against the other sparse incremental GP algorithms mentioned in Csato and Opper (2002), Lawrence et al. (2003), Seeger et al. (2003), and Smola and Bartlett (2000) since we felt that those methods will be slightly inferior to GPR. But those comparisons could be interesting, especially if we compare at the same levels of sparsity. This will

Fast GCV Algorithm for Sparse Model Learning

297

be taken up in future work. It will also be interesting to extend the proposed algorithm to classification problems. Appendix A: α Estimation Using GCV Approach Using equations 2.5 to 2.9, the numerator of the derivative of GCV error with respect to α j can be shown to take the form, g j + h j α j , where g j = (δ j b j − a j j )ψ j − (δ j c j − b j j ), h j = (δ j b j − a j j )σ 2 , and ψ j = φ Tj P j φ j .

(A.1)

It is easy to verify that the denominator of the derivative of GCV error with respect to α j is nonnegative, and noting that α j ≥ 0, the solution can be obtained directly using g j and h j or using the sign information of the gradient. More specifically, the optimal solution α j (lying in the interval [0, ∞)) is obtained from one of the following possibilities: If {g j , h j } < 0, then α j = ∞. If {g j , h j } > 0, then, α j = 0. g If g j < 0, h j > 0, then a unique solution exists and is given by α j = − h jj . g

If g j > 0, h j < 0, then a unique solution exists and is given by α j = − h jj . But in this case, the derivative changes from a positive to a negative value while crossing zero. Therefore, it is possible to have solution at either 0 or ∞. In this case, we can evaluate the function value at 0 and ∞ and choose the right one. When h j = 0, the solution is dependent on the sign of g j .

Appendix B: Selection of the First Basis Vector Two quantities that are of interest in finding the GCV error for all the basis vectors are given by tT P2m t =

a 2m − 2b m m + c m 2m

(B.1)

T φm φm , T φ m φ m + αm σ 2

(B.2)

tr(Pm ) = N −

φ S−1 φ T φ T φ +α σ 2 where Pm = I − m σm2 m , Sm = m mσ 2 m , and m = Sm . Then the optimal solution αm can be obtained, as described in appendix A, from the coefficients 2.5 to 2.9 using P j = I. After substituting the optimal solution αm into equations B.1 and B.2, we can evaluate the GCV error for a given m.

298

S. Sundararajan, S. Shevade, and S. Keerthi

Finally, the basis vector j is selected as the index for which the GCV error is minimum. Appendix C: Initialization Once the first basis vector, φ j , is selected based on the GCV error with the optimal α j , initialization of all the relevant quantities is done as follows: r m = tT φ m − T φm − γm = φ m

(tT φ j )(φ Tj φ m ) (φ Tj φ m )2

ξm = t T φ m − 2

(C.2)

j

T φm − 2 um = φ m

v=N−

(C.1)

j

(φ Tj φ m )2 j

+

T (φ m φ m )(φ Tj φ m )2

(tT φ j )(φ Tj φ m ) j

(C.3)

2j +

(tT φ j )(φ Tj φ m )(φ Tj φ j ) 2j

φ Tj φ j

(C.5)

j

q = tT t − 2

(C.4)

(tT φ j )2 (φ Tj φ j ) (tT φ j )2 + , j 2j

(C.6)

tT φ j j

σ , S−1 = [ ] and = [φ j ]. Note j

where j = φ Tj φ j + α j σ 2 . Further, w =

2

that S−1 is a single element matrix when there is a single basis vector. Appendix D: Computing α j

Here, we find the set of coefficients required to compute the optimal α j from the set of quantities defined in equations 3.1 to 3.6. All the results given below are obtained using the rank one update relationship between P and P j . With o j =

α j σ 2r j α j σ 2 −t j

and ψ j =

o j jo j j + 2ξ j j j αjσ2 jo j j + b j = o j ξj αjσ2 j

aj =q +

c j = j o 2j ,

α j σ 2tj α j σ 2 −t j

, we have

(D.1) (D.2) (D.3)

Fast GCV Algorithm for Sparse Model Learning

where j = ψ j + α j σ 2 and j =

uj 2j . (α j σ 2 )2

299

Further,

g j = (δ j b j − a j j )ψ j − (δ j c j − b j j )

(D.4)

h j = (δ j b j − a j j )σ ,

(D.5)

2

where δ j = v + jj . For j not in the relevance vector set, we do not have to find the above set of quantities with P j as j = ∞ and P = P j . Therefore, the quantities in equations 2.5 to 2.9 required to compute the optimal α j can be found in a much simpler way. Next, the optimal solution α j is obtained using g j and h j , as mentioned earlier. (See the discussion in the paragraph below equation A.1.) Appendix E: Reestimation and Deletion of a Basis Vector T −1 Recall that the matrix P = I − Sσ 2 . Then, with σ 2 fixed (at least for the iteration under consideration or for few iterations), a change in α j (which is essentially the reestimation process itself) results in a change in S−1 . Let s j denote the jth column and sjj denote the jth diagonal element of S−1 . Note that the computations are similar to the one used for reestimation except for the coefficient K j . In the case of deletion, K j = s1 , and in the case of

jj

reestimation, K j = (sjj + (αˆ j − α j )−1 )−1 . Here, αˆ j denotes the new optimal solution. The final set of computations required for the reestimation or deletion of basis vectors is given below. Defining ρ jm = sTj T φ m , we have rm = rm + K j w j ρ jm γm = γm +

(E.1)

Kj 2 ρ . σ 2 jm

(E.2)

Defining χ jm = sTj AS−1 T φ m , we have um = um + where τ j =

Kj ρ jm (2χ jm + τ j ρ jm ), σ2

Kj T T s s j . σ2 j

(E.3)

Defining κ j = wT As j , we have

ξm = ξm + K j ρ jm κ j + K j w j (χ jm + ρ jm τ j ) v = v + τj

(E.4) (E.5)

q = q + K j σ w j (2κ j + τ j w j ). 2

(E.6)

Finally, w = w − K j w j s j and S−1 = S−1 − K j s j sTj . Although the set of equations given above is common for reestimation and deletion procedures, the

300

S. Sundararajan, S. Shevade, and S. Keerthi

jth row and/or column is to be removed from S−1 , w, and after making the necessary updates in the deletion procedure. Appendix F: Adding a New Basis Vector On adding the new basis vector j, the dimension of changes, and a new finite α j gets defined. Defining l j = σ12 S−1 T φ j and e j = φ j − l j and µ jm = eTj φ m , we get rm = rm − w j µ jm sjj γm = γm − 2 µ2jm , σ where sjj = we have

tj σ2

1 +α j

(F.1) (F.2)

and w j =

sjj r . σ2 j

Next, defining ν jm = µ jm − lTj AS−1 T φ m ,

sjj µ jm (ξ j − w j u j ) − w j ν jm σ2 s sjj jj . um = um + 2 µ jm u µ − 2ν j jm jm σ σ2 ξ m = ξm −

(F.3) (F.4)

Next, sjj uj σ2 q = q + w j (w j u j − 2ξ j ). v=v −

(F.5) (F.6)

Also, S

−1

=

S−1 + sjj l j lTj −sjj lTj

Finally, w =

w − wjlj wj

−sjj l j sjj

.

(F.7)

and = [ φ j ].

References Bishop, C. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press. Chen, S., Cowan, C. F. N., & Grant, P. M. (1991). Orthogonal least squares learning for radial basis function networks. IEEE Trans on Neural Networks, 2, 302–309. Cortes, C., & Vapnik, V. N. (1995). Support vector networks. Machine Learning, 20, 273–297. Csato, L., & Opper, M. (2002). Sparse on-line gaussian processes. Neural Computation, 14(3), 641–668.

Fast GCV Algorithm for Sparse Model Learning

301

Denison, D., & George, E. (2000). Bayesian prediction using adaptive ridge estimators. (Tech Rep.). London: Department of Mathematics, Imperial College. Available online at http://stats.ma.ic.ac.uk/dgtd/public html/Papers/grr.ps. Friedman, J. H. (1991). Multivariate adaptive regression splines. Annals of Statistics, 19, 1–141. Golub, G. H., Heath, M., & Wahba, G. (1979). Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 21, 215–223. Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12, 55–67. Lawrence, N., Seeger, M., & Herbrich, R. (2003). Fast sparse gaussian process ¨ & K. Obermethods: The informative vector machine. In S. Becker, S. Thrun, mayer (Eds.), Advances in neural processing information systems, 15(pp. 609–616). Cambridge, MA: MIT Press. Orr, M. J. L. (1995a). Local smoothing of radial basis function networks. In Proceedings of International Symposium on Neural Networks. Hsinchu, Taiwan. Orr, M. J. L. (1995b). Regularization in the selection of radial basis function centers. Neural Computation, 7(3), 606–623. Seeger, M., Williams, C., & Lawrence, N. D. (2003). Fast forward selection to speed up sparse gaussian process regression. In C. M. Bishop & B. J. Frey (Eds.), Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics. San Francisco: Morgan Kaufmann. Smola, A. J., & Bartlett, P. L. (2000). Sparse greedy gaussian process regression. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 619–625). Cambridge, MA: MIT Press. Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions (with discussion). Journal of Royal Statistical Society (series B), 36, 111–147. Sundararajan, S., & Keerthi, S. S. (2001). Predictive approaches for choosing hyperparameters in gaussian processes. Neural Computation, 13(5), 1103–1118. Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1, 211–244. Tipping, M. E., & Faul, A. (2003). Fast marginal likelihood maximisation for sparse Bayesian models. In C. M. Bishop & B. J. Frey (Eds.), Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics. San Francisco: Morgan Kaufmann.

Received August 29, 2005; accepted June 1, 2006.

302

Addendum In “Low-Dimensional Maps Encoding Dynamics in Entorhinal Cortex and Hippocampus” by Dmitri Pervouchine, Theoden Netoff, Horacio Rotstein, John White, Mark Cunningham, Miles Whittington, and Nancy Kopell (Vol. 18, No. 11: 2617–2650), the following paragraph was omitted at the end of Appendix section A.2: Synaptic gating variable s j obeys first order kinetics according to the equation ∂s j = α j (1 − s j ) 1 + tanh((v j − vth )/vsl ) − β j s j , ∂t where α j and β j are the respective synapse rise and decay rate constants, vth = 0, and vsl = 4. The decay rate constants were chosen to achieve the desired synapse decay time (for instance, β j 0.2 for τ = 5, and β j 0.05 for τ = 20). The rise rate constant α j = 20 used in the simulations was not accessible in the experiments.

ARTICLE

Communicated by S. Coombes

The Astrocyte as a Gatekeeper of Synaptic Information Transfer Vladislav Volman [email protected] School of Physics and Astronomy, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel-Aviv University, 69978 Tel-Aviv, Israel

Eshel Ben-Jacob [email protected] School of Physics and Astronomy, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel-Aviv University, 69978 Tel-Aviv, Israel, and Center for Theoretical Biological Physics, University of California at San Diego, La Jolla, CA 92093-0319, U.S.A.

Herbert Levine [email protected] Center for Theoretical Biological Physics, University of California at San Diego, La Jolla, CA 92093-0319, U.S.A.

We present a simple biophysical model for the coupling between synaptic transmission and the local calcium concentration on an enveloping astrocytic domain. This interaction enables the astrocyte to modulate the information flow from presynaptic to postsynaptic cells in a manner dependent on previous activity at this and other nearby synapses. Our model suggests a novel, testable hypothesis for the spike timing statistics measured for rapidly firing cells in culture experiments. 1 Introduction In recent years, evidence has been mounting regarding the possible role of glial cells in the dynamics of neural tissue (Volterra & Meldolesi, 2005; Haydon 2001; Newman, 2003; Takano et al., 2006). For astrocytes in particular, the specific association of processes with synapses and the discovery of two-way astrocyte-neuron communication has demonstrated the inadequacy of the previously held view regarding the purely supportive role for these glial cells. Instead, future progress requires rethinking how the dynamics of the coupled neuron-glial network can store, recall, and process information. At the level of cell biophysics, some of the mechanisms underlying the so-called tripartite synapse (Araque, Parpura, Sanzgiri, & Haydon, 1999) are becoming clearer. For example, it is now well established that astrocytic Neural Computation 19, 303–326 (2007)

C 2007 Massachusetts Institute of Technology

304

V. Volman, E. Ben-Jacob, and H. Levine

mGlu receptors detect synaptic activity and respond via activation of the calcium-induced calcium release pathway, leading to elevated Ca 2+ levels. The spread of these levels within a microdomain of one cell can coordinate the activity of disparate synapses that are associated with the same microdomain (Perea & Araque, 2002). Moreover, it might even be possible to transmit information directly from domain to domain and even from astrocyte to astrocyte if the excitation level is strong enough to induce either intracellular or intercellular calcium waves (Cornell-Bell, Finkbeiner, Cooper, & Smith, 1990; Charles, Merrill, Dirksen, & Sanderson, 1991; Cornell-Bell & Finkbeiner, 1991). One sign of the maturity in our understanding is the formulation of semiquantitative models for this aspect of neuron-glial communication (Nadkarni & Jung, 2004; Sneyd, Wetton, Charles, & Sanderson, 1995; Hofer, Venance, & Giaume, 2003). There is also information flow in the opposite direction, from astrocyte to synapse. Direct experimental evidence for this, via the detection of the modulation of synaptic transmission as a function of the state of the glial cells, will be reviewed in more detail below. One of the major goals of this work is to introduce a simple phenomenological model for this interaction. The model will take into account both a deterministic effect of high Ca 2+ in the astrocytic process, namely, the reduction of the postsynaptic response to incoming spikes on the presynaptic axon (Araque, Parpura, Sanzgiri, & Haydon, 1998a), and a stochastic effect, namely, the increase in the frequency of observed miniature postsynaptic current events uncorrelated with any input (Araque, Parpura, Sanzgiri, & Haydon, 1998b). There are also direct NMDA-dependent effects on the postsynaptic neuron of astrocyte-emitted factors (Perea & Araque, 2005), which are not considered here. As we will show, the coupling allows the astrocyte to act as a gatekeeper for the synapse. By this, we mean that the amount of data transmitted across the synapse can be modulated by astrocytic dynamics. These dynamics may be controlled mostly by other synapses, in which case the gatekeeping will depend on dynamics external to the specific synapse under consideration. Alternatively, the dynamics may depend mostly on excitation from the selfsame synapse, in which case the behavior of the entire system is determined self-consistently. Here we focus on the latter possibility and leave for future work the discussion of how this mechanism could lead to multisynaptic coupling. Our ideas regarding the role of the astrocyte offer a new explanation for observations regarding firing patterns in cultured neuronal networks. In particular, spontaneous bursting activity in these networks is regulated by a set of rapidly firing neurons, which we refer to as spikers; these neurons exhibit spiking even during long interburst intervals and hence must have some form of self-consistent self-excitation. We model these neurons as containing astrocyte-mediated self-synapses (autapses) (Segal, 1991, 1994; Bekkers & Stevens, 1991) and show that this hypothesis naturally accounts

The Astrocyte as a Gatekeeper of Synaptic Information Transfer

305

for the observed unusual interspike interval distribution. Additional tests of this hypothesis are proposed at the end. 2 Experimental Observations 2.1 Cultured Neuronal Networks. The cultured neuronal networks presented here are self-generated from dissociated cultures of mixed cortical neurons and glial cells drawn from 1-day-old Charles River rats. The dissection, cell dissociation, and recording procedures were previously described in detail (Segev et al., 2002). Briefly, following dissection, neurons are dispersed by enzymatic treatment and mechanical dissociation. Then the cells are homogeneously plated on multielectrode arrays (MEA, Multi-Channel Systems), precoated with Poly-L-Lysine. Culture media was DMEM, (sigma) enriched by serum, and changed every two days. Plated cultures are placed on the MEA board (B-MEA-1060, Multi Channel Systems) for simultaneous long-term noninvasive recordings of neuronal activity from several neurons at a time. Recorded signals are digitized and stored for off-line analysis on a PC via an A-D board (Microstar DAP) and data acquisition software (Alpha-Map, Alpha Omega Engineering). Noninvasive recording of the networks activity (action potentials) is possible due to the capacitive coupling that some of the neurons form with some of the electrodes. Since typically one electrode can record signals from several neurons, a specially developed spike-sorting algorithm (Hulata, Segev, Shapira, Benveniste, & Ben-Jacob, 2000) is utilized to reconstruct single neuron-specific spike series. Although there are no externally provided guiding stimulations or chemical cues, relatively intense dynamical activity is spontaneously generated within several days. The activity is marked by the formation of synchronized bursting events (SBEs): short (∼200 ms) time windows during which most of the recorded neurons participate in relatively rapid firing (Segev & Ben-Jacob, 2001). These SBEs are separated by long intervals (several seconds or more) of sporadic neuronal firing of most of the neurons. A few neurons (referred to as spiker neurons) exhibit rapid firing even during the inter-SBE time intervals. These neurons also exhibit much faster firing rates during the SBEs, and their interspike intervals distribution is marked by a long-tail behavior (see Figure 4). 2.2 Interspike Interval (ISI) Increments Distribution. One of the tools used to compare model results with measured spike data concerns the distribution of increments in the spike times, defined as δ(i) = ISI (i + 1) − ISI (i), i ≥ 1. The distribution of δ(i) will have heavy tails if there is a wide range of interspike intervals and rapid transitions from one type of interval to the next. For example, rapid transitions from bursting events to occasional interburst firings will lead to such a tail. Applying this analysis to the recorded spike data of cultured cortical networks, Segev et al. (2002) found

306

V. Volman, E. Ben-Jacob, and H. Levine

that distributions of neurons’ ISI increments can be well fitted with Levy functions over three decades in time. 3 The Model In this section we present the mathematical details of the models employed in this work. Readers interested mainly in the conclusions can skip directly to section 4. The basic notion we use is that standard synapse models must be modified to account for the astrocytic modulation, depending, of course, on the calcium level. In turn, the astrocytic calcium level is affected by synaptic activity; for this, we use the Li-Rinzel model where the IP3 concentration parameter governing the excitability is increased on neurotransmitter release. These ingredients suffice to demonstrate what we mean by gatekeeping. Finally, we apply this model to the case of an autaptic oscillator, which requires the introduction of neuronal dynamics. For this, we chose the Morris-Lecar model as a generic example of a type-I firing system. None of our results would be altered with a different choice as long as we retain the tangent-bifurcation structure, which allows for arbitrarily long interspike intervals. 3.1 TUM Synapse Model. To describe the kinetics of a synaptic terminal, we have used the model of an activity-dependent synapse first introduced by Tsodyks, Uziel, and Markram (2000). In this model, the effective synaptic strength evolves according to the following equations: z − uxδ(t − tsp ) τrec y y˙ = − + uxδ(t − tsp ) τin y z z˙ = − τin τrec

x˙ =

(3.1)

Here, x, y, and z are the fractions of synaptic resources in the recovered, active, and inactive states, respectively. For an excitatory glutamatergic synapse, the values attained by these variables can be associated with the dynamics of vesicular glutamate. As an example, the value of y in this formulation will be proportional to the amount of glutamate that is being released during the synaptic event, and the value of x will be proportional to the size of a readily releasable vesicle pool. The time series tsp denotes the arrival times of presynaptic spikes, τin is the characteristic time of postsynaptic currents (PSCs) decay, and τr ec is the recovery time from synaptic depression. Upon arrival of a spike at the presynaptic terminal at time tsp , a fraction u of available synaptic resources is transferred from the recovered

The Astrocyte as a Gatekeeper of Synaptic Information Transfer

307

Table 1: Parameters Used in Simulations. P0 τca r I P3 v1 k3 d3 c0 gL VL V3 Iba se u

0.5

ηmea n

1.2 · 10−3 µA cm−2

σ

0.1

4 sec 7.2 mM sec−1 6 sec−1 0.1 µM 0.9434 µM 2.0 µM 0.5 mS cm−2 −35 mV 10 mV 0.34 µA cm−2 0.1

κ I P3∗ v2 d1 d5 gCa VCa V1 V4 τd A

0.5 sec−1 0.16 µM 0.11 sec−1 0.13 µM 0.08234 µM 1.1 mS cm−2 100 mV −1 mV 14.5 mV 10 msec 10.00 µA cm−2

τ I P3 c1 v3 d2 a2 gK VK V2 φ τr ec

7 sec 0.185 0.9 µM sec−1 1.049 µ M 0.2 µM−1 sec−1 2.0 mS cm−2 −70 mV 15 mV 0.3 100 msec

state to the active state. Once in the active state, synaptic resources rapidly decay to the inactive state, from which they recover within a timescale τr ec . Since the typical times are assumed to satisfy τr ec τin , the model predicts onset of short-term synaptic depression after a period of high-frequency repetitive firing. The onset of depression can be controlled by the variable u, which describes the effective use of synaptic resources by the incoming spike. In the original TUM model, the variable u is taken to be constant for the excitatory postsynaptic neuron; in what follows, we will set u = 0.1. Other parameter choices for these equations as well as for the rest of the model equations are presented in Table 1. To complete the specification, it is assumed that the resulting PSC, arriving at the model neurons’ soma through the synapse, depends linearly on the fraction of available synaptic resources. Hence, a total synaptic current seen by a neuron is Isyn (t) = Ay(t), where A stands for an absolute synaptic strength. At this stage, we do not take into account the long-term effects associated with the plasticity of neuronal somata and take the parameter A to be time independent. 3.2 Astrocyte Response. Astrocytes adjacent to synaptic terminals respond to the neuronal action potentials by binding glutamate to their metabotropic glutamate receptors (Porter & McCarthy, 1996). The activation of these receptors then triggers the production of I P3 , which consequently serves to modulate the intracellular concentration of calcium ions; the effective rate of I P3 production depends on the amount of transmitter released during the synaptic event. We therefore assume that the production of intracellular I P3 in the astrocyte is given by I P3∗ − I P3 d[I P3 ] = + ri p3 y. dt τi p3

(3.2)

308

V. Volman, E. Ben-Jacob, and H. Levine

This equation is similar to the formulation used by Nadkarni and Jung (2004), with some important differences. First, the effective rate of I P3 production depends not on the potential of neuronal membrane, but on the amount of neurotransmitter that is being released into the synaptic cleft. Hence, as the resources of synapse are depleted (due to depression), there will be less transmitter released, and therefore the I P3 will be produced at lower rates, leading eventually to decay of calcium concentration. Second, as the neurotransmitter is released also during spontaneous synaptic events (noise), the latter will also influence the production of I P3 and subsequent calcium oscillations. 3.3 Astrocyte. To model the dynamics of a single astrocytic domain, we use the Li-Rinzel model (Li & Rinzel, 1994; Nadkarni & Jung, 2004), which has been specifically developed to take into account the I P3 -dependent dynamical changes in the concentration of cytosolic Ca 2+ . This is based on the theoretical studies of Nadkarni and Jung (2004), where it is decisively demonstrated that astrocytic Ca 2+ oscillations may account for the spontaneous activity of neurons. The intracellular concentration of Ca2+ in the astrocyte is described by the following set of equations: d[Ca 2+ ] = −J chan − J pump − J leak dt

(3.3)

dq = αq (1 − q ) − βq q . dt

(3.4)

Here, q is the fraction of activated I P3 receptors. The fluxes of currents through ER membrane are given in the following expressions: J chan = c 1 v1 m3∞ n3∞ q 3 ([Ca 2+ ] − [Ca 2+ ] E R ) J pump =

v3 [Ca ] + [Ca 2+ ]2

(3.5)

2+ 2

k32

J leak = c 1 v2 ([Ca 2+ ] − [Ca 2+ ] E R ),

(3.6) (3.7)

where m∞ =

[I P3 ] [I P3 ] + d1

(3.8)

n∞ =

[Ca 2+ ] [Ca 2+ ] + d5

(3.9)

αq = a 2 d2

[I P3 ] + d1 [I P3 ] + d3

(3.10)

The Astrocyte as a Gatekeeper of Synaptic Information Transfer

βq = a 2 [Ca 2+ ].

309

(3.11)

The reversal Ca 2+ concentration ([Ca 2+ ] E R ) is obtained after requiring conservation of the overall Ca 2+ concentration: [Ca 2+ ] E R =

c 0 − [Ca 2+ ] . c1

(3.12)

3.4 Glia-Synapse Interaction. Astrocytes affect synaptic vesicle release in a calcium-dependent manner. Rather than attempt a complete biophysical model of the complex chain of events leading from calcium rise to vesicle release (Gandhi & Stevens, 2003), we proceed in a phenomenological manner. We define a dynamical variable f that phenomenologically will capture this interaction. When the concentration of calcium in its synapseassociated process exceeds a threshold, we assume that the astrocyte emits a finite amount of neurotransmitter into the perisynaptic space, thus altering the state of a nearby synapse; this interaction occurs via glutamate binding to presynaptic mGlu and NMDA receptors (Zhang et al., 2004). As the internal astrocyte resource of neurotransmitter is finite, we include the saturation term (1 − f ) in the dynamical equation for f . The final form is f˙ =

−f 2+ + (1 − f )κ ([Ca 2+ ] − [Ca thr eshold ]). τCa 2+

(3.13)

Given this assumption, equations 3.1 should be modified to take this modulation into account. We assume the following simple form: z − (1 − f )uxδ(t − tsp ) − xη( f ) τr ec −y + (1 − f )uxδ(t − tsp ) + xη( f ). y˙ = τin

x˙ =

(3.14) (3.15)

In equations 3.14 and 3.15, η( f ) represents a noise term modeling the increased occurrence of mini-PSCs. The fact that a noise increase accompanies an amplitude decrease is partially due to competition for synaptic resources between these two release modes (Otsu et al., 2004). Based on experimental observations, we prescribe that the dependence of η( f ) on f is such that the rate of noise occurrence (the frequency of η( f ) in a fixed time step) increases with increasing f , but the amplitude distribution (modeled here as a gaussian-distributed variable centered around positive mean) remains unchanged. For the rate of noise occurrence, we chose the following functional dependence: 2 1− f P( f ) = P0 exp − √ , 2σ

(3.16)

310

V. Volman, E. Ben-Jacob, and H. Levine

with P0 representing the maximal frequency of η( f ) in a fixed time step. Note that although both synaptic terminals and astrocytes utilize glutamate for their signaling purposes, we assume the two processes to be independent. In so doing, we rely on existing biophysical experiments demonstrating that whereas a presynaptic terminal releases glutamate in the synaptic cleft, astrocytes selectively target extrasynaptic glutamate receptors (Araque et al., 1998a, 1998b). Hence, synaptic transmission does not interfere with the astrocyte-to-synapse signaling. 3.5 Neuron Model. We describe the neuronal dynamics with a simplified two-component Morris-Lecar model (Morris & Lecar, 1981), V˙ = −Iion (V, W) + Ie xt (t)

(3.17)

W∞ (V) − W(V) , τW (V)

(3.18)

˙ W(V) =φ

with Iion (V, W) representing the contribution of the internal ionic Ca 2+ , K + , and leakage currents with their corresponding channel conductivities gCa , g K , and g L being constant: Iion (V, W) = gCa m∞ (V)(V − VCa ) + g K W(V)(V − VK ) + g L (V − VL ). (3.19) Ie xt represents all the external current sources stimulating the neuron, such as signals received through its synapses, glia-derived currents, artificial stimulations, as well as any noise sources. In the absence of any such stimulation, the fraction of open potassium channels, W(V), relaxes toward its limiting curve (nullcline) W∞ (V), which is described by the sigmoid function, V − V1 1 W∞ (V) = 1 + tanh 2 V2

(3.20)

within a characteristic timescale given by

τW (V) =

cosh

1 V−V1 . 2V2

(3.21)

The Astrocyte as a Gatekeeper of Synaptic Information Transfer

311

In contrast to this, it is assumed in the Morris-Lecar model that calcium channels are activated immediately. Accordingly, the fraction of open Ca 2+ channels obeys the following equation: V − V3 1 m∞ (V) = 1 + tanh . 2 V4

(3.22)

For an isolated neuron, rendered with a single autapse, one has Ie xt (t) = Isyn (t) + Iba se , where Isyn (t) is the current arriving through the self-synapse and Iba se is some constant background current. In this work, we assume that Iba se is such that, when acting alone, it causes a neuron to fire at a very low constant rate. Of course, these two terms enter the equation additively, and the dynamics depends on the total external current. Nonetheless, it is important to separate these terms, as only one of them enters through the synapse; it is only this term that is modulated by astrocytic glutamate release and only this term that would be changed by synaptic blockers. As we will note later, the baseline current may also be due to astrocytes, albeit to a direct current directed into the neuronal soma. In anticipation of a better future understanding of this term, we consider it separately from the constant appearing in leak current (g L VL ), although there is clearly some redundancy in the way these two terms set the operating point of the neuron. 4 Results 4.1 Synaptic Model. In simple models of neural networks, the synapse is considered to be a passive element that directly transmits information, in the form of arriving spikes on the presynaptic terminal, to postsynaptic currents. It has been known for a long time that more complex synaptic dynamics can affect this transfer. One such effect concerns the finite reservoir of presynaptic vesicle resources and was modeled by Tsodyks, Uziel, and Markram (TUM) (Tsodyks et al., 2000). Spike trains with too high a frequency will be attenuated by a TUM synapse, as there is insufficient recovery from one arrival to the next. To demonstrate this effect, we fed the TUM synaptic model with an actual spike train recorded from a neuron in a cultured network (shown in Figure 1a); the resulting postsynaptic current (PSC) is shown in Figure 1b. As is expected, there is attenuation of the PSC height during time windows with high rates of presynaptic spiking input. 4.2 Effect of Presynaptic Gating. Our goal is to extend the TUM model to include the interaction of the synapse with an astrocytic process imagined to be wrapped around the synaptic cleft. The effects of astrocytes on stimulated synaptic transmission are well established. Araque et al. (1998a) report that astrocyte stimulation reduced the magnitude of

312

V. Volman, E. Ben-Jacob, and H. Levine (a)

(b)

(c)

(d)

0

4 time [sec]

Figure 1: The generic effect of an astrocyte on the presynaptic depression, as captured by our phenomenological model (see text for details). To illustrate the effect of presynaptic depression and the astrocyte influence, we feed a model synapse with the input of spikes taken from the recorded activity of a cultured neuronal network (see the text and Segev & Ben-Jacob, 2001 for details). (a) The input sequence of spikes that is fed into the model presynaptic terminal. (b) Each spike arriving at the model presynaptic terminal results in the postsynaptic current (PSC). The strength of the PSC depends on the amount of available synaptic resources, and the synaptic depression effect is clearly observable during spike trains with relatively high frequency. (c) The effect of a periodic gating function, f (t) = 0.5 + f 0 sin(wt), shown in (d). The period of the = 2 sec, is taken to be compatible with the typical timescales oscillation, T = 2π ω of variations in the intraglial Ca 2+ concentration. Note the reduction in the PSC near the maxima of f , along with the elevated baseline resulting from the increase in the rate of spontaneous presynaptic transfer.

action-potential-evoked excitatory and inhibitory synaptic currents by decreasing the probability of evoked transmitter release. Specifically, presynaptic metabotropic glutamate receptors (mGluRs) have been shown to affect the stimulated synaptic transmission by regulating presynaptic voltagegated calcium channels, which eventually leads to the reduction of calcium flux during the incoming spike and results in a decrease of amplitude of synaptic transmission. These results are best shown in Figure 8 of their paper, which presents the amplitude of evoked EPSC both before and after stimulation of an associated astrocyte. Note that we are referring here to “faithful" synapses—those that transmit almost all of the incoming spikes. Effects of astrocytic stimulation on highly stochastic synapses, namely, the increase in fidelity (Kang, Jiang, Goldman, & Nedergaard, 1998), are not studied here. In addition, astrocytes were shown to increase the frequency of spontaneous synaptic events. In detail, Araque et al. (1998b) have shown that

The Astrocyte as a Gatekeeper of Synaptic Information Transfer

313

astrocyte stimulation increases the frequency of miniature postsynaptic currents (mPSC) without modifying their amplitude distribution, suggesting that astrocytes act to increase the probability of vesicular release from the presynaptic terminal. Although the exact mechanism is unknown, this effect is believed to be mediated by NMDA receptors located at the presynaptic terminal. It is important to note that the two kinds of astrocytic influence on the synapse (decrease of the probability of evoked release and increase in the probability of spontaneous release) do not contradict each other. Evoked transmitter release depends on the calcium influx through calcium channels that can be inhibited by the activation of presynaptic mGluRs. On the other hand, the increase in the probability of spontaneous release follows because of the activation of presynaptic NMDA channels. In addition, spontaneous activity can deplete the vesicle pool (in terms of either number or filling) and hence directly lower triggered release amplitudes (Otsu et al., 2004). We model these effects by two modifications of the TUM model. First, we introduce a gating function f that modulates the stimulated release in a calcium-dependent manner. This term will cause the synapse to turn off at high calcium. This presynaptic gating effect is demonstrated in Figure 1c, where we show the resulting PSC corresponding to a case in which f is chosen to vary periodically with a timescale consistent with astrocytic calcium dynamics. The effect on the recorded spike train data is quite striking. The second effect, the increase of stochastic release in the absence of any input, is included as an f -dependent noise term in the TUM equations. This will be important as we turn to a self-consistent calculation of the synapse coupled to a dynamical astrocyte. 4.3 The Gatekeeping Effect. We close the synapse-glia-synapse feedback loop by inclusion of the effect of the presynaptic activity on the intracellular Ca 2+ dynamics in the astrocyte that in turn sets the value of the gating function f . Nadkarni and Jung (2004) have argued that the basic calcium phenomenology in the astrocyte, arising via the glutamate-induced production of I P3 , can be studied using the Li-Rinzel model. What emerges from their work is that the dynamics of the intra-astrocyte Ca 2+ level depends on the intensity of the presynaptic spike train, acting as an information integrator over a timescale on the order of seconds; the level of Ca 2+ in the astrocyte increases according to the summation of the synaptic spikes over time. If the total number of spikes is low, the Ca 2+ concentration in the astrocyte remains below a self-amplification threshold level and simply decays back to its resting level with some characteristic time. However, things change dramatically when a sufficiently intense set of signals arises across the synapse. Now the Ca 2+ concentration overshoots its linear response level, followed by decaying oscillations. Given our results, these high Ca 2+ levels in the astrocyte will in fact attenuate spike information that arrives subsequent to strong bursts of activity. We illustrate this time-delayed gatekeeping (TDGK) effect in

314

V. Volman, E. Ben-Jacob, and H. Levine (a)

(b)

(c)

(d)

0

20 time [sec]

Figure 2: The gatekeeping effect in a glia-gated synapse. (a) The input sequence of spikes, which is composed of several copies of the sequence shown in Figure 1, separated by segments of long quiescent time. The resulting time series may be viewed as bursts of action potentials arriving at the model presynaptic terminal. The first burst of spikes results in the elevation of free astrocyte Ca 2+ concentration (b), but this elevation alone is not sufficient to evoke oscillatory response. An additional elevation of Ca 2+ , leading to the emergence of oscillation, is provided by the second burst of spikes arriving at the presynaptic terminal. Once the astrocytic Ca 2+ crosses a predefined threshold, it starts to exert a modulatory influence back on the presynaptic terminal. In the model, this is manifested by the rising dynamics of the gating function (c). Note that as the decay time of the gating function f is on the order of seconds, the astrocyte influence on the presynaptic terminal persists even after concentration of astrocyte Ca 2+ has fallen. This is best seen from d, where we show the profile of the PSC. The third burst of spikes arriving at the presynaptic terminal is modulated due to the astrocyte, even though the concentration of Ca 2+ is relatively low at that time. This modulation extends also to the fourth burst of spikes, which together with the third burst leads again to the oscillatory response of astrocyte Ca 2+ . Taken together, these results illustrate a temporally nonlocal gatekeeping effect of glia cells.

Figure 2. We constructed a spike train by placing a time delay in between segments of recorded sequences. As can be seen, since the degree of activity during the first two segments exceeds the threshold level, there is attenuation of the late-arriving segments. Thus, the information passed through the synapse is modulated by previous arriving data. 4.4 Autaptic Excitatory Neurons. Our new view of synaptic dynamics will have broad consequences for making sense of neural circuitry. To illustrate this prospect, we turn to the study of an autaptic oscillator (Seung, Lee, Reis, & Tank, 2000), by which we mean an excitatory neuron

The Astrocyte as a Gatekeeper of Synaptic Information Transfer

315

that exhibits repeated spiking driven at least in part by self-synapses (Segal, 1991, 1994; Bekkers & Stevens, 1991; Lubke, Markram, Frotscher, & Sakmann, 1996). By including the coupling of a model neuron to our synapse system, we can investigate both the case of the role of an associated astrocyte with externally imposed temporal behavior and the case where the astrocyte dynamics is itself determined by feedback from this particular synapse. Finally, we should be clear that when we refer to one synapse, we are also dealing implicitly with the case of multiple self-synapses, all coupled to the same astrocytic domain, which in turn is exhibiting correlated dynamics in its processes connecting to these multiple sites. It is important to note that this same modulation can in fact correlate multiple synapses connecting distinct neurons coupled to the same astrocyte. The effect of this new multisynaptic coupling on the spatiotemporal flow of information in a model network will be described elsewhere. We focus on an excitatory neuron modeled with Morris-Lecar dynamics, as described in section 3. We add some external bias current so as to place the neuron in a state slightly beyond the saddle-node bifurcation, where it would spontaneously oscillate at a very low frequency in the absence of any synaptic input. We then assume that this neuron has a self-synapse (autapse). An excitatory self-synapse clearly has the possibility of causing a much higher spiking rate than would otherwise be the case; this behavior without any astrocyte influence is shown in Figure 3. The existence of autaptic neurons was originally demonstrated in cultured networks (Segal, 1991, 1994; Bekkers & Stevens, 1991) but has been detected in intact neocortex as well (Lubke et al., 1996). Importantly, these can be either inhibitory or excitatory. There has been some speculation regarding the role of autapses in memory (Seung et al., 2000), but this is not our concern here. Are such neurons observed experimentally? In Figure 4 we show a typical raster plot recorded from cultured neural network grown from a dissociated mixture of glial and neuronal cortical cells taken from 1-day-old Charles River rats (see section 2). The spontaneous activity of the network is marked by synchronized bursting events (SBEs)—short (several 100s of ms) periods during which most of the recorded neurons show relatively rapid firing separated by long (order of seconds) time intervals of sporadic neuronal firing of most of the neurons (Segev & Ben-Jacob, 2001). Only small fractions of special neurons (termed spiker neurons) exhibit rapid firing also during inter-SBEs intervals. These spiker neurons also exhibit much higher firing rates during the SBEs. But the behavior of these rapidly firing neurons does not match that expected of the simple autaptic oscillator. The major differences, as illustrated by comparing Figures 3 and 4, are (1) the existence of long interspike intervals for the spikers, marked by a long-tail (Levy) distribution of the increments of the interspike intervals, and (2) the beating or burstlike rate modulation in the temporal ordering of the spike train.

316

V. Volman, E. Ben-Jacob, and H. Levine

(a)

(b)

0

5 time [sec]

(c)

−1

Probability distribution

10

−5

10

1

10 100 δ(ISI) [msec]

1000

Figure 3: Activity of a model neuron containing the self-synapse (autapse), as modeled by the classical Tsodyks-Uziel-Markram model of synaptic transmission. In this case, it is possible to recover some of the features of cortical rapidly firing neurons, namely, the relatively high-frequency persistent activity. However, the resulting time series of action potentials for such a model neuron, shown in (a), is almost periodic. Due to the self-synapse, a periodic series of spikes results in the periodic pattern for the postsynaptic current, shown in (b), which closes the self-consistency loop by causing a model neuron to generate a periodic time series of spikes. Further difference between the model neuron and between cortical rapidly firing neurons is seen upon comparing the corresponding distributions of ISI increments, plotted on double-logarithmic scale. These distributions, shown in (c), disclose that, contrary to the cortical rapidly firing neurons, the increments distribution for the model neuron with TUM autapse (diamonds) is gaussian (seen as a stretched parabola on double-log scale), pointing to the existence of characteristic timescale. On the other hand, distributions for cortical neurons (circles) decay algebraically and are much broader. The distribution of the model neuron has been vertically shifted for clarity of comparison.

The Astrocyte as a Gatekeeper of Synaptic Information Transfer 1

317

(a)

neuron #

12

60 0

45

90

time [sec] 1

(b)

neuron # 60 0

(c)

−1

probability distribution

12

10

−3

10

−5

400

time [msec]

800

10

1

10

δ(ISI) [msec]

100

Figure 4: Electrical activity of in-vitro cortical networks. These cultured networks are spontaneously formed from a dissociated mixture of cortical neurons and glial cells drawn from 1-day-old Charles River rats. The cells are homogeneously spread over a lithographically specified area of Poly-D-Lysine for attachment to the recording electrodes. The activity of a network is marked by formation of synchronized bursting events (SBEs), short (∼100–400 msec) periods of time during which most of the recorded neurons are active. (a) Raster plot of recorded activity, showing a sample of a few SBEs. The time axis is divided into 10−1 s bins. Each row is a binary bar code representation of the activity of an individual neuron; the bars mark detection of spikes. Note that while the majority of the recorded neurons are firing rapidly mostly during SBEs, some neurons are marked by persistent intense activity (e.g., neuron no. 12). This property supports the notion that the activity of these neurons is autonomous and hence self-amplified. (b) A zoomed view of a sample synchronized bursting event. Note that each neuron has its own pattern of activity during the SBE. To access the differences in activity between ordinary neurons and neurons that show intense firing between the SBEs, for each neuron we constructed the series of increments of interspike intervals (ISI), defined as δ(i) = I SI (i + 1) − I SI (i), i ≥ 1. The distributions of δ(i), shown in (c), disclose that the dynamics of ordinary neurons (squares) is similar to the dynamics of rapidly firing neurons (circles), up to the timescale of 100 msec, corresponding to the width of a typical SBE. Note that since increments of interspike intervals are analyzed, the increased rate of neurons firing does not necessarily affect the shape of the distribution. Yet above the characteristic time of 100 msec, the distributions diverge, possibly indicating the existence of additional mechanisms governing the activity of rapidly firing neurons on a longer timescale. Note that for normal neurons, there is another peak at typical interburst intervals (> seconds), not shown here.

318

V. Volman, E. Ben-Jacob, and H. Levine

Motivated by the above and the glial gatekeeping effect studied earlier, we proceed to test if an autaptic oscillator with a glial-regulated self-synapse will bring the model into better agreement. In Figure 5 we show that the activity of such a modified model does show the additional modulation. The basic mechanism results from the fact that after a period of rapid firing of the neuron, the astrocyte intracellular Ca 2+ concentration (shown in Figure 5b) exceeds the critical threshold for time-delayed attenuation. This then stops the activity and gives rise to large interspike intervals. The distributions shown in Figure 5 are a much better match to experimental data for time intervals up to 100 msec.

5 Robustness Tests 5.1 Stochastic Li-Rinzel Model. One of the implicit assumptions of our model for astrocyte-synapse interaction is related to the deterministic nature of astrocyte calcium release. It is assumed that in the absence of any I P3 signals from the associated synapses, the astrocyte will stay “silent,” in the sense that there will be no spontaneous Ca 2+ events. However, it should be kept in mind that the equations for the calcium channel dynamics used in the context of Li-Rinzel model in fact describe the collective behavior of large numbers of channels. In reality, experimental evidence indicates that the calcium release channels in astrocytes are spatially organized in small clusters of 20 to 50 channels—the so-called microdomains. These microdomains were found to contain small membrane leaflets (of O(10 nm) thick), wrapping around the synapses and potentially able to synchronize ensembles of synapses. This finding calls for a new view of astrocytes as cells with multiple functional and structural compartments. The microdomains (within the same astrocyte) have been observed to generate the spontaneous Ca 2+ signals. As the passage of the calcium ions through a single channel is subject to fluctuations, the stochastic aspects can become important for small clusters of channels. Inclusion of stochastic effects can explain the generation of calcium puffs: fast, localized elevations of calcium concentration. Hence, it is important to test the possible effect of stochastic calcium events on the model’s behavior. We achieve this goal by replacing the deterministic Li-Rinzel model with its stochastic version, obtained using Langevin approximation, as has been recently described by Shuai and Jung (2003). With the Langevin approach, the equation for the fraction of open calcium channels is modified and takes the following form:

dq = αq (1 − q ) − βq q + ξ (t), dt

(5.1)

The Astrocyte as a Gatekeeper of Synaptic Information Transfer

319

(a)

(b)

(c)

(d) 0

10

20

time [sec]

(e)

−1

Probability distribution

10

−3

10

−5

10

1

10 100 δ(ISI) [msec]

1000

Figure 5: The activity of a model neuron containing a glia-gated autapse. The equations of synaptic transmission for this case have been modified to take into account the influence of synaptically associated astrocyte, as explained in text. The resulting spike time series, shown in (a), deviates from periodicity due to the slow modulation of the synapse by the adjacent astrocyte. The relatively intense activity at the presynaptic terminal activates astrocyte receptors, which in turn leads to the production of IP3 and subsequent oscillations of free astrocyte Ca 2+ concentration. The period of these oscillations, shown in (b), is much larger than the characteristic time between spikes arriving at the presynaptic terminal. Because Ca 2+ dynamics is oscillatory, so also will be the dynamics of the gating function f , as is seen from (c), and the period of oscillations for f will follow the period of Ca 2+ oscillations. The periodic behavior of f leads to slow periodic modulation of PSC pattern (shown in (d)), which closes the self-consistency loop by causing a neuron to fire in a burstlike manner. Additional information is obtained after comparison of distributions for ISI increments, shown in (e). Contrary to results for the model neuron with a simple autapse (see Figure 4c), the distribution for a glia-gated autaptic model neuron (diamonds) now closely follows the distributions of two sample recorded cortical rapidly firing neurons (circles), up to the characteristic time of ∼100 msec, which corresponds to the width of a typical SBE. The heavy tails of the recorded distributions above this characteristic time indicate that network mechanisms are involved in shaping the form of the distribution on longer timescales.

320

V. Volman, E. Ben-Jacob, and H. Levine (a)

(b)

(c)

(d) 0

40

time [sec]

80

Figure 6: The dynamical behavior of an astrocyte-gated model autaptic neuron, including the stochastic release of calcium from ER of astrocyte. Shown are the results of the simulation when calcium release from intracellular stores is mediated by a cluster of N = 10 channels. The generic form of the spike time series (shown in (a)) does not differ from those obtained for the deterministic model. Namely, even for the stochastic model, the neuron is still firing in a burstlike manner. Although the temporal profile of astrocyte calcium (b) is irregular, the resulting dynamics of the gating function (c) is relatively smooth, stemming from the choice of the gating function dynamics (being an integration over the calcium profile). As a result, the PSC profile (d) does not differ much from the corresponding PSC profile obtained for the deterministic model.

in which the stochastic term, ξ (t), has the following properties: ξ (t) = 0

ξ (t)ξ (t ) =

αq (1 − q ) + βq q

δ(t − t ). N

(5.2) (5.3)

In the limit of very large cluster size, N → ∞, and the effect of stochastic Ca 2+ release is not significant. On the contrary, the dynamics of calcium release are greatly modified for small cluster sizes. A typical spike-time series of glia-gated autaptic neuron, obtained for the cluster size of N = 10 channels, is shown Figure 6a. Note that while there appear considerable fluctuations in concentration of astrocyte calcium (see Figure 6b), the dynamics of the gating function (see Figure 6c) is less irregular. This follows because our choice of the gating function corresponds to the integration of calcium events. We have also checked that the distribution of interspike intervals is practically unchanged (data not shown). All told, our results indicate that including the stochastic nature of the release of calcium from astrocyte ER does not affect the dynamics of our model autaptic neuron in any significant way.

The Astrocyte as a Gatekeeper of Synaptic Information Transfer

321

0

10

−4

τ =4 sec,κ=5*10 f −4 τ =40 sec,κ=1*10 f

−1

probability distribution

10

−2

10

−3

10

−4

10

−5

10

0

10

1

10

2

δ(ISI) [msec]

10

3

10

Figure 7: Distributions of interspike interval increments for the model of an astrocyte-gated autaptic neuron with slow dynamics of the gating function, as compared with the corresponding distribution for the deterministic Li-Rinzel model. Due to the slow dynamics of the gating function, the transitions between different phases of bursting are blurred, resulting in a weaker tail for the distribution of interspike interval increments.

5.2 The Correlation Time of the Gating Function. Another assumption made in our model concerns the dynamics of the gating function. We have assumed the simple first-order differential equation for the dynamics of our phenomenological gating function and have selected timescales that are believed to be consistent with the influence of astrocytes on synaptic terminals. However, because the exact nature of the underlying processes (and corresponding timescales) is unknown, it is important to test the robustness of the model to variations in the gating function dynamics. To do that, we altered the baseline dynamics of the gating function to have a slower characteristic decay time and a slower rate of accumulation; for example, we can set τ f = 40 sec and κ = 0.1 sec−1 . Simulations show that the only effect is a slight blurring of the transition between different phases of the bursting, as would be expected. This can best be detected by looking at the distribution of interspike interval increments, for the case of slow gating dynamics. The distribution, shown in Figure 7, has a weaker tail as compared to the distribution obtained for the faster gating dynamics. This result follows because for slower gating, the modulation of the postsynaptic current is weaker. Hence, the transitions from intense firing

322

V. Volman, E. Ben-Jacob, and H. Levine (a)

(b)

(c)

(d) 0

45

time [sec]

90

Figure 8: The dynamical behavior of an astrocyte-gated model autaptic neuron with slowly oscillating background current. Shown are the results of the simut), T = 10 sec. The mean level of Ibase is set so as to put lation when Ibase ∝ sin( 2π T a neuron in the quiescent phase for half a period. The resulting spike time series (a) disclose the burstlike firing of a neuron, with the superimposed oscillatory dynamics of a background current. The variations in the concentration of astrocyte calcium (b) are much more temporally localized, and so is the resulting dynamics of the gating function (c). Consequently, the PSC profile (d) strongly reflects the burstlike synaptic transmission efficacy, thus forcing the neuron to fire in a burstlike manner and closing the self-consistency loop.

to low-frequency spiking are less abrupt, resulting in a relatively low proportion of large increments. It is worth remembering that large increments of interspike intervals reflect sudden changes in dynamics, which are eliminated by the blurring. Clearly, the model with fast gating does a better job in fitting the spiker data. 5.3 Time-Dependent Background Current. All of the main results were obtained under the assumption of constant background current feeding into neuronal soma, such that when acting alone, this current forces the model neuron to fire at a very low frequency. One may justly argue that there is no such thing as constant current. Indeed, if a background current has to do with the biological reality, then it should possess some dynamics. For example, a better match would be to imagine the background current to be associated with the activity in adjacent astrocytes (see, e.g., Angulo, Kozlov, Charpak, & Audinat, 2004). To test this, we simulated glia-gated autaptic neuron subject to slowly oscillating (T = 10 sec) background current. For this case, we found that the behavior of a model is generically the same. Yet now the transitions between the bursting phases are sharper (see Figure 8a). This in turn leads to the sharper modulation of postsynaptic currents (shown in Figure 8d). We can confirm this by noting that the distribution of interspike interval increments has a slightly heavier tail, as compared to the distribution obtained for

The Astrocyte as a Gatekeeper of Synaptic Information Transfer

323

the case of constant background current (data not shown). On the other hand, replacing the constant current with the oscillating one introduces a typical frequency not seen in the actual spiker data. This artificial problem will presumably disappear when the background current is determined self-consistently as part of the overall network activity. Similarly, the key to extending the increments distributions to longer timescales seems to be getting the network feedback to the spikers to regulate the interburst timing, which at the moment is too regular. This will be presented in a future publication. 6 Discussion In this article, we have proposed that the regulation of synaptic transmission by astrocytic calcium dynamics is a critical new component of neural circuitry. We have used existing biophysical experiments to construct a coupled synapse-astrocyte model to illustrate this regulation and explore its consequences for an autaptic oscillator, arguably the most elementary neural circuit. Our results can be compared to data taken from cultured neuron networks. This comparison reveals that the glial gatekeeping effect appears to be necessary for an understanding of the interspike interval distribution of observed rapidly firing spiker neurons, for timescales up to about 100 msec. Of course, many aspects of our modeling are quite simplified as compared to the underlying biophysics. We have investigated the sensitivity of our results to the modification of some of the parameters of our model as well as the addition of more complex dynamics for the various parts of our system. Our results with regard to the interspike interval are exceedingly robust. This work should be viewed as a step toward understanding the full dynamical consequences brought about by the strong reciprocal couplings between synapses and the glial processes that envelop them. We have focused on the fact that astrocytic emissions shut down synaptic transmission when the activity becomes too high. This mechanism appears to be a necessary part of the regulation of spiker activity; without it, spikers would fire too often, too regularly. Related work by S. Nadkarni and P. Jung (private communication, July 2005) focuses on a different aspect: that of increased fidelity of synaptic release (for otherwise highly stochastic synapses) due to glia-mediated increases in presynaptic calcium levels. As our working assumption is that the spikers are most likely to be neurons with “faithful” autapses, this effect does not play a role in our attempt to compare to the experimental data. It will of course be necessary to combine these two different pieces to obtain a more complete picture. The application to spikers is just one way in which our new synaptic dynamics may alter our thinking about neural circuits. This particular application is appealing and informative but must at the moment be

324

V. Volman, E. Ben-Jacob, and H. Levine

considered an untested hypothesis. Future experimental work must test the assumption that spikers have significant excitatory autaptic coupling, that pharmacological blockage of the synaptic current reverts their firing to low-frequency, almost periodic patterns, and that cutting the feedback loop with the enveloping astrocyte eliminates the heavy-tail increment distribution. Work toward achieving these tests is ongoing. In the experimental system, a purported autaptic neuron is part of an active network and would therefore receive input currents from the other neurons in the network. This more complex input would clearly alter the very-long-time interspike interval distribution, especially given the existence of a new interburst timescale in the problem. Similarly, the current approach of adding a constant background current to the neuron is not realistic; the actual background current, due to such processes as glialgenerated currents in the cell soma, would again alter the long-time distribution. Preliminary tests have shown that these effects could extend the range of agreement between autaptic oscillator statistics and experimental measurements. Just as the network provides additional input for the spiker, the spiker provides part of the stimulation that leads to the bursting dynamics. Future work will endeavor to create a fully self-consistent network model to explore the overall activity patterns of this system. One issue that needs investigation concerns the role that glia might have in coordinating the action of neighboring synapses. It is well known that a single astrocytic process might contact thousands of synapses; if the calcium excitation spreads from being a local increase in a specific terminus to being a more widespread phenomenon within the glial cell body, neighboring synapses can become dynamically coupled. The role of this extra complexity in shaping the burst structure and time sequence is as yet unknown. Acknowledgments We thank Gerald M. Edelman for insightful conversation about the possible role of glia. Eugene Izhikevich, Peter Jung, Suhita Nadkarni, and Itay Baruchi are acknowledged for useful comments and for the critical reading of an earlier version of this manuscript. V. V. thanks the Center for Theoretical Biological Physics for hospitality. This work has been supported in part by the NSF-sponsored Center for Theoretical Biological Physics (grant numbers PHY-0216576 and PHY-0225630), by Maguy-Glass Chair in Physics of Complex Systems. References Angulo, M., Kozlov, A., Charpak, S., & Audinat, E. (2004). Glutamate released from glial cells synchronizes neuronal activity in the hippocampus. J. Neurosci., 24(31), 6920–6927.

The Astrocyte as a Gatekeeper of Synaptic Information Transfer

325

Araque, A., Parpura, V., Sanzgiri, R., & Haydon, P. (1998a). Glutamate-dependent astrocyte modulation of synaptic transmission between cultured hippocampal neurons. Eur. J. Neurosci., 10(6), 2129–2142. Araque, A., Parpura, V., Sanzgiri, R., & Haydon, P. (1998b). Glutamate-dependent astrocyte modulation of synaptic transmission between cultured hippocampal neurons. J. Neurosci., 18(17), 6822–6829. Araque A., Parpura, V., Sanzgiri, R., & Haydon, P. (1999). Tripartite synapses: Glia, the unacknowledged partner. Trends in Neurosci., 22(5), 208–215. Bekkers, J., & Stevens, C. (1991). Excitatory and inhibitory autaptic currents in isolated hippocampal neurons maintained in cell culture. Proc. Nat. Acad. Sci., 88, 7834–7838. Charles, A., Merrill, J., Dirksen, E., & Sanderson, M. (1991). Inter-cellular signaling in glial cells: Calcium waves and oscillations in response to mechanical stimulation and glutamate. Neuron., 6, 983–992. Cornell-Bell A., & Finkbeiner, S. (1991). Ca2+ waves in astrocytes. Cell Calcium, 12, 185–204. Cornell-Bell, A., Finkbeiner, S., Cooper, M., & Smith, S. (1990). Glutamate induces calcium waves in cultured astrocytes: Long-range glial signaling. Science, 247, 470–473. Gandhi, A., & Stevens, C. (2003). Three modes of synaptic vesicular release revealed by single-vesicale imaging. Nature, 423, 607–613. Haydon, P. (2001). Glia: Listening and talking to the synapse. Nat. Rev. Neurosci., 2(3), 185–193. Hofer, T., Venance, L., & Giaume, C. (2003). Control and plasticity of inter-cellular calcium waves in astrocytes. J. Neurosci., 22, 4850–4859. Hulata, E., Segev, R., Shapira, Y., Benveniste, M., & Ben-Jacob, E. (2000). Detection and sorting of neural spikes using wavelet packets. Phys. Rev. Lett., 85, 4637–4640. Kang, J., Jiang, L., Goldman, S., & Nedergaard, M. (1998). Astrocyte-mediated potentiation of inhibitory synaptic transmission. Nat. Neurosci., 1, 683–692. Li, Y., & Rinzel, J. (1994). Equations for inositol-triphosphate receptor-mediated calcium oscillations derived from a detailed kinetic model: A Hodgkin-Huxley like formalism. J. Theor. Biol., 166, 461–473. Lubke, J., Markham, H., Frotscher, M., & Sakmann, B. (1996). Frequency and dendritic distributions of autapses established by layer-5 pyramidal neurons in developing rat cortex. J. Neurosci., 616, 3209–3218. Morris, C., & Lecar. H. (1981). Voltage oscillations in the barnacle giant muscle fiber. Biophys. J., 35, 193–213. Nadkarni, S., & Jung, P. (2004). Spontaneous oscillations of dressed neurons: A new mechanism for epilepsy? Phys. Rev. Lett., 91(26). Newman, E. (2003). New roles for astrocytes: Regulation of synaptic transmission. Trends in Neurosci., 26(10), 536–542. Otsu, Y., Shahrezaei, V., Li, B., Raymond, L., Delaney K., & Murphy, T. (2004). Competition between phasic and asynchronous release for recovered synaptic vesicles at developing hippocampal autaptic synapses. J. Neurosci., 24(2), 420–433. Perea, G., & Araque, A. (2002). Communication between astrocytes and neurons: A complex language. J. Physiol., 96, 199–207.

326

V. Volman, E. Ben-Jacob, and H. Levine

Perea, G., & Araque, A. (2005). Properties of synaptically evoked astrocyte calcium signals reveal synaptic information processing by astrocytes. J. Neurosci., 25, 2192– 203. Porter. J., & McCarthy, K. (1996). Hippocampal astrocytes in situ respond to glutamate released from synaptic terminals. J. Neurosci., 16(16), 5073–5081. Segal, M. (1991). Epileptiform activity in microcultures containing one excitatory hippocampal neuron. J. Neuroanat., 65, 761–770. Segal, M. (2004). Endogenous bursts underlie seizurelike activity in solitary excitatory hippocampal neurons in microculture. J. Neurophysiol., 72, 1874–1884. Segev, R., & Ben-Jacob, E. (2001). Spontaneous synchronized bursting activity in 2D neural networks. Physica A, 302, 64–69. Segev, R., Benveniste, M., Hulata, E., Cohen, N., Paleski, A., Kapon, E., Shapira, Y., & Ben-Jacob, E. (2002). Long term behavior of lithographically prepared in-vitro neural networks. Phys. Rev. Lett., 88, 118102. Seung, H., Lee, D., Reis, B., & Tank, D. (2000). The autapse: A simple illustration of short-term analog memory storage by tuned synaptic feedback. J. Comp. Neurosci., 9, 171–185. Shuai, J., & Jung, P. (2003). Langevin modeling of intra-cellular calcium dynamics. In M. Falcke & D. Malchow (Eds.), Understanding calcium dynamics—experiments and theory. (pp. 231–252). Berlin: Springer. Sneyd, J., Wetton, B., Charles, A., & Sanderson, M. (1995). Intercellular calcium waves mediated by diffusion of inositol triphosphate: A two-dimensional model. Am. J. Physiology, 268, C1537–C1545. Takano T., Tian, G., Peng, W., Lou, N., Libionka, W., Han, X., & Nedergaard, M. (2006). Astrocyte-mediated control of cerebral blood flow. Nat. Neurosci., 9(2), 260–267. Tsodyks, M., Uziel, A., & Markram, H. (2000). Synchrony generation in recurrent networks with frequency-dependent synapses. J. Neurosci., 20(RC50), 1–5. Volterra A., & Meldolesi J. (2005). Astrocytes, from brain glue to communication elements: The revolution continues. Nat. Neurosci., 6, 626–640. Zhang, Q., Pangrsic, T., Kreft, M., Krzan, M., Li, N., Sul, J., Halassa, M., van Bockstaele, E., Zorec, R., & Haydon, P. (2004). Fusion-related release of glutamate from astrocytes. J. Biol. Chem., 279, 12724–12733.

Received January 6, 2006; accepted May 25, 2006.

LETTER

Communicated by Gert Cauwenberghs

Thermodynamically Equivalent Silicon Models of Voltage-Dependent Ion Channels Kai M. Hynna [email protected]

Kwabena Boahen [email protected] Department of Bioengineering, University of Pennsylvania, Philadelphia, PA 19104, U.S.A.

We model ion channels in silicon by exploiting similarities between the thermodynamic principles that govern ion channels and those that govern transistors. Using just eight transistors, we replicate—for the first time in silicon—the sigmoidal voltage dependence of activation (or inactivation) and the bell-shaped voltage-dependence of its time constant. We derive equations describing the dynamics of our silicon analog and explore its flexibility by varying various parameters. In addition, we validate the design by implementing a channel with a single activation variable. The design’s compactness allows tens of thousands of copies to be built on a single chip, facilitating the study of biologically realistic models of neural computation at the network level in silicon. 1 Neural Models A key computational component within the neurons of the brain is the ion channel. These channels display a wide range of voltage-dependent responses (Llinas, 1988). Some channels open as the cell depolarizes; others open as the cell hyperpolarizes; a third kind exhibits transient dynamics, opening and then closing in response to changes in the membrane voltage. The voltage dependence of the channel plays a functional role. Cells in the gerbil medial superior olivary complex possess a potassium channel that activates on depolarization and helps in phase-locking the response to incoming auditory information (Svirskis, Kotak, Sanes, & Rinzel, 2002). Thalamic cells possess a hyperpolarization-activated cation current that contributes to rhythmic bursting in thalamic neurons during periods of sleep by depolarizing the cell from hyperpolarized levels (McCormick & Pape, 1990). More ubiquitously, action potential generation is the result of voltage-dependent dynamics of a sodium channel and a delayed rectifier potassium channel (Hodgkin & Huxley, 1952). Researchers use a variety of techniques to study voltage-dependent channels, each possessing distinct advantages and disadvantages. Neural Computation 19, 327–350 (2007)

C 2007 Massachusetts Institute of Technology

328

K. Hynna and K. Boahen

Neurobiologists perform experiments on real cells, both in vivo and in vitro; however, while working with real cells eliminates the need for justifying assumptions within a model, limitations in technology restrict recordings to tens of neurons. Computational neuroscientists create computer models of real cells, using simulations to test the function of a channel within a single cell or a network of cells. While providing a great deal of flexibility, computational neuroscientists are often limited by the processing power of the computer and must constantly balance the complexity of the model with practical simulation times. For instance, a Sun Fire 480R takes 20 minutes to simulate 1 second of the 4000-neuron network (M. Shelley, personal communication, 2004) of Tao, Shelley, McLaughlin, and Shapley (2004), just 1 mm2 (a single hypercolumn) of a sublayer (4Cα) of the primary visual cortex (V1). An emerging medium for modeling neural circuits is the silicon chip, a technique at the heart of neuromorphic engineering. To model the brain, neuromorphic engineers use the transistor’s physical properties to create silicon analogs of neural circuits. Rather than build abstractions of neural processing, which make gross simplifications of brain function, the neuromorphic engineer designs circuit components, such as ion channels, from which silicon neurons are built. Silicon is an attractive medium because a single chip can have thousands of heterogeneous silicon neurons that operate in real time. Thus, network phenomena can be studied without waiting hours, or days, for a simulation to run. To date, however, neuromorphic models have not captured the voltage dependence of the ion channel’s temporal dynamics, a problem outstanding since 1991, the year Mahowald and Douglas (1991) published their seminal work on silicon neurons. Neuromorphic circuits are constrained by surface area on the silicon die, limiting their complexity, as more complex circuits translate to fewer silicon neurons on a chip. In the face of this trade-off, previous attempts at designing neuromorphic models of voltage-dependent ion channels (Mahowald and Douglas, 1991; Simoni, Cymbalyuk, Sorensen, Calabrese, & DeWeerth, 2004) sacrificed the time constant’s voltage dependence, keeping the time constant fixed. In some cases, however, this nonlinear property is critical. An example is the lowthreshold calcium channel in the thalamus’s relay neurons. The time constant for inactivation can vary over an order of magnitude depending on the membrane voltage. This variation defines the relative lengths of the interburst interval (long) and the burst duration (short) when the cell bursts rhythmically. Sodium channels involved in spike generation also possess a voltagedependent inactivation time constant that varies from a peak of approximately 8 ms just below spike threshold to as fast as 1 ms at the peak of the voltage spike (Hodgkin & Huxley, 1952). Ignoring this variation by fixing the time constant alters the dynamics that shape the action potential. For example, lowering the maximum (peak) time constant would reduce the

Voltage-Dependent Silicon Ion Channel Models

329

size of the voltage spike due to faster inactivation of sodium channels below threshold. Inactivation of these channels is a factor in the failure of action potential propagation in Purkinje cells (Monsivais, Clark, Roth, & Hausser, 2005). On the other hand, increasing the minimum time constant at more depolarized levels—that is, near the peak voltage of the spike—would increase the width of the action potential, as cell repolarization begins once the potassium channels overcome the inactivating sodium channel. A wider spike could influence the cell’s behavior through several mechanisms, such as those triggered by increased calcium entry through voltage-dependent calcium channels. In this letter, we present a compact circuit that models the nonlinear dynamics of the ion channel’s gating particles. Our circuit is based on linear thermodynamic models of ion channels (Destexhe & Huguenard, 2000), which apply thermodynamic considerations to the gating particle’s movement, due to conformation of the ion channel protein in an electric field. Similar considerations of the transistor make clear that both the ion channel and the transistor operate under similar principles. This observation, originally recognized by Carver Mead (1989), allows us to implement the voltage dependence of the ion channel’s temporal dynamics, while at the same time using fewer transistors than previous neuromorphic models that do not possess these nonlinear dynamics. With a more compact design, we can incorporate a larger number of silicon neurons on a chip without sacrificing biological realism. The next section provides a brief tutorial on the similarities between the underlying physics of thermodynamic models and that of transistors. In section 3, we derive a circuit that captures the gating particle’s dynamics. In section 4, we derive the equations defining the dynamics of our variable circuit, and section 5 describes the implementation of an ion channel population with a single activation variable. Finally, we discuss the ramifications of this design in the conclusion. 2 Ion Channels and Transistors Thermodynamic models of ion channels are founded on Hodgkin and Huxley’s (empirical) model of the ion channel. A channel model consists of a series of independent gating particles whose binary state—open or closed—determines the channel permeability. A Hodgkin-Huxley (HH) variable represents the probability of a particle being in the open state or, with respect to the channel population, the fraction of gating particles that are open. The kinetics of the variable are simply described by α(V) (1 − u) −→ ←− u, β(V)

(2.1)

330

K. Hynna and K. Boahen

where α (V) and β (V) define the voltage-dependent transition rates between the states (indicated by the arrows), u is the HH variable, and (1 − u) represents the closed fraction. α (V) and β (V) define the voltage-dependent dynamics of the two-state gating particle. At steady state, the total number of gating particles that are opening—that is, the opening flux, which depends on the number of channels closed and the opening rate—are balanced by the total number of gating particles that are closing (the closing flux). Increasing one of the transition rates—through, for example, a shift in the membrane voltage— will increase the respective flow of particles changing state; the system will find a new steady state with new open and closed fractions such that the fluxes again cancel each other out. We can describe these dynamics simply using a differential equation: du = α (V) (1 − u) − β (V) u. dt

(2.2)

The first term represents the opening flux, the product of the opening transition rate α (V) and the fraction of particles closed (1 − u). The second term represents the closing flux. Depending on which flux is larger (opening or closing), the fraction of open channels u will increase or decrease accordingly. Equation 2.2 is often expressed in the following form: du 1 =− (u − u∞ (V)) , dt τu (V)

(2.3)

where u∞ (V) =

α (V) , α (V) + β (V)

(2.4)

τu (V) =

1 , α (V) + β (V)

(2.5)

represent the steady-state level and time constant (respectively) for u. This form is much more intuitive to use, as it describes, for a given membrane voltage, where u will settle and how fast. In addition, these quantities are much easier for neuroscientists to extract from real cells through voltageclamp experiments. We will come back to the form of equation 2.3 in section 4. For now, we will focus on the dynamics of the gating particle in terms of transition rates. In thermodynamic models, state changes of a gating particle are related to changes in the conformation of the ion channel protein (Hill & Chen, 1972; Destexhe & Huguenard, 2000, 2001). Each state possesses a certain energy

Voltage-Dependent Silicon Ion Channel Models

331

G*(V)

Energy

∆GC(V) ∆GO(V)

GC(V) ∆G(V) GO(V) Closed

Activated

Open

State Figure 1: Energy diagram of a reaction. The transition rates between two states are dependent on the heights of the energy barriers (GC and GO ), the differences in energy between the activated state (G∗ ) and the initial states (GC or GO ). Thus, the time constant depends on the height of the energy barriers, and the steady state depends on the difference in energy between the closed and open states (G).

(see Figure 1), dependent on the interactions of the protein molecule with the electric field across the membrane. For a state transition to occur, the molecule must overcome an energy barrier (see Figure 1), defined as the difference in energy between the initial state and an intermediate activated state. The size of the barrier controls the rate of transition between states (Hille, 1992): α (V ) = α0 e−GC (V )/R T β (V ) = β0 e

−GO (V )/R T

(2.6) ,

(2.7)

where α0 and β0 are constants representing base transition rates (at zero barrier height), GC (V ) and GO (V ) are the voltage-dependent energy barriers, R is the gas constant, and T is the temperature in Kelvin. Changes in the membrane voltage of the cell, and thus the electric field across the membrane, influence the energies of the protein’s conformations differently, changing the sizes of the barriers and altering the transition rates between states. Increasing a barrier decreases the respective transition rate, slowing the dynamics, since fewer proteins will have sufficient energy. The steady state depends on the energy difference between the two states. For a difference of zero and equivalent base transition rates, particles are

332

K. Hynna and K. Boahen

equally distributed between the two states. Otherwise, the state with lower energy is the preferred one. The voltage dependence of an energy barrier has many components, both linear and nonlinear. Linear thermodynamic models, as the name implies, assume that the linear voltage dependence dominates. This dependence may be produced by the movement of a monopole or dipole through an electric field (Hill & Chen, 1972; Stevens, 1978). In this situation, the above rate equations simplify to α (V ) = A e−b1 (V −VH )/R T β (V ) = A e

−b 2 (V −VH )/R T

(2.8) ,

(2.9)

where VH and A represent the half-activation voltage and rate, while b 1 and b 2 define the linear relationship between each barrier and the membrane voltage. The magnitude of the linear term depends on such factors as the net movement of charge or net change in charge due to the conformation of the channel protein. Thus, ion channels use structural differences to define different membrane voltage dependencies. While linear thermodynamic models have simple governing equations, they possess a significant flaw: time constants can reach extremely small values at voltages where either α (V) and β (V) become large (see equation 2.5), which is unrealistic since it does not occur in biology. Adding nonlinear terms in the energy expansion of α (V) and β (V) can counter this effect (Destexhe & Huguenard, 2000). Other solutions involve either saturating the transition rate (Willms, Baro, Harris-Warrick, & Guckenheimer, 1999) or using a three-state model (Destexhe & Huguenard, 2001), where the forward and reverse transition rates between two of the states are fixed, effectively setting the maximum transition rate. Linear models, however, bear the closest resemblance to the MOS transistor, which operates under similar thermodynamic principles. Short for metal oxide semiconductor, the MOS transistor is named for its structure: a metallic gate (today, a polysilicon gate) atop a thin oxide, which insulates the gate from a semiconductor channel. The channel, part of the body or substrate of the transistor, lies between two heavily doped regions called the source and the drain (see Figure 2). There are two types of MOS transistors: negative or n-type (NMOS) and positive or p-type (PMOS). NMOS transistors possess a drain and a source that are negatively doped— areas where the charge carriers are negatively charged electrons. These two areas exist within a p-type substrate, a positively doped area, where the charge carriers are positively charged holes. A PMOS transistor consists of a p-type source and drain within an n-type well. While the rest of this discussion focuses on NMOS transistor operation, the same principles apply to PMOS transistors, except that the charge carrier is of the opposite sign.

Voltage-Dependent Silicon Ion Channel Models gate

source

S

D

drain

n+

333

n+

B

G

B

G

psubstrate

a

b

S

D

Figure 2: MOS transistor. (a) Cross-section of an n-type MOS transistor. The transistor has four terminals: source (S), drain (D), gate (G) and bulk (B), sometimes referred to as the back-gate. (b), Symbols for the two transistor types: NMOS (left) and PMOS (right). The transistor is a symmetric device, and thus the direction of its current—by convention, the flow of positive charges—indicates the drain and the source. In an NMOS, current flows from drain to source, as indicated by the arrow. Conversely, current flows from source to drain in a PMOS.

In the subthreshold regime, charge flows across the channel by diffusion from the source end of the channel, where the density is high, to the drain, where the density is low. Governed by the same laws of thermodynamics that govern protein conformations, the density of charge carriers at the source and drain ends of the channel depends exponentially on the size of the energy barriers there (see Figure 3). These energy barriers exist due to a built-in potential difference, and thus an electric field, between the channel and the source or the drain. Adjusting the voltage at the source, or the drain, changes the charge carriers’ energy level. For the NMOS transistor’s negatively charged electrons, increasing the source voltage decreases the energy level; hence, the barrier height increases. This decreases the charge density at that end of the channel, as fewer electrons have the energy required to overcome the barrier. The voltage applied to the gate, which influences the potential at the surface of the channel, has the opposite effect: increasing it (e.g., from VG to VG1 in Figure 3) decreases the barrier height—at both ends of the channel. Factoring in the exponential charge density dependence on barrier height yields the relationship between an NMOS transistor’s channel current and its terminal voltages (Mead, 1989): Ids = Ids0 e

κVGB −VSB UT

−e

κVGB −VDB UT

,

(2.10)

where κ describes the relationship between the gate voltage and the potential at the channel surface. UT is called the thermal voltage (25.4 mV at room temperature), and Ids0 is the baseline diffusion current, defined by the barrier introduced when the oppositely doped regions (p-type and n-type) were fabricated. Note that, for clarity, UT will not appear in the

334

K. Hynna and K. Boahen

Energy

VG

φS

VG1 > VG

VS

φD VD > VS

Source

Channel

Drain

Figure 3: Energy diagram of a transistor. The vertical dimension represents the energy of negative charge carriers (electrons) within an NMOS transistor, while the horizontal dimension represents location within the transistor. φS and φD are the energy barriers faced by electrons attempting to enter the channel from the source and drain, respectively. VD , VS , and VG are the terminal voltages, designated by their subscripts. During transistor operation, VD > VS , and thus φS < φD . VG1 represents another scenario with a higher gate voltage. (Adapted from Mead, 1989.)

remaining transistor current equations, as all transistor voltages from here on are given in units of UT . When VDB exceeds VSB by 4 UT or more, the drain term becomes negligible and is ignored; the transistor is then said to be in saturation. The similarities in the underlying physics of ion channels and transistors allow us to use transistors as thermodynamic isomorphs of ion channels. In both, there is a linear relationship between the energy barrier and the controlling voltage. For the ion channel, either isolated charges or dipoles of the channel protein have to overcome the electric field created by the voltage across the membrane. For the transistor, electrons, or holes, have to overcome the electric field created by the voltage difference between the source, or drain, and the transistor channel. In both instances, the transport of charge across the energy barrier is governed by a Boltzman distribution, which results in an exponential voltage dependence. In the next section, we use these similarities to design an efficient transistor representation of the gating particle dynamics. 3 Variable Circuit Based on the discussion from the previous section, it is tempting to think we may be able to use a single transistor to model the gating dynamics of a channel particle completely. However, obtaining the transition rates solves only part of the problem. We still need to multiply the rates with the number of gating particles in each state to obtain the opening and closing fluxes, and then integrate the flux difference to update the particle counts

Voltage-Dependent Silicon Ion Channel Models

335

uH N4

uτH uH

N2

N2

VO

VO

N1

VC

uV

uV

C

C

N1

VC

uL

N3

uτL

a

b

uL

Figure 4: Channel variable circuit. (a) The voltage uV represents the logarithm of the channel variable u. VO and VC are linearly related to the membrane voltage, with slopes of opposite sign. uH and uL are adjustable bias voltages. (b) Two transistors (N3 and N4) are added to saturate the variable’s opening and closing rates; the bias voltages uτ H and uτ L set the saturation level.

(see equation 2.2). A capacitor can perform the integration if we use charge to represent particle count and current to represent flux. The voltage on the capacitor, which is linearly proportional to its charge, yields the result. The first sign of trouble appears when we attempt to connect a capacitor to the transistor’s source (or drain) terminal to perform the integration. As the capacitor integrates the current, the voltage changes, and hence the transistor’s barrier height changes. Thus, the barrier height depends on the particle count, which is not the case in biology; gating particles do not (directly) affect the barrier height when they switch state. Our only remaining option, the gate voltage, is unsuitable for defining the barrier height, as it influences the barrier at both ends of the channel identically. α (V) and β (V), however, demonstrate opposite dependencies on the membrane voltage; that is, one increases while the other decreases. We can resolve this conundrum by connecting two transistors to a single capacitor (see Figure 4a). Each transistor defines an energy barrier for one of the transition rates: transistor N1 uses its source and gate voltages (uL and VC , respectively) to define the closing rate, and transistor N2 uses its drain and gate voltages (uH and VO ) to define the opening rate (where uH > uL ). We integerate the difference in transistor currents on the capacitor Cu to update the particle count. Notice that neither barrier

336

K. Hynna and K. Boahen

depends on the capacitor voltage uV . Thus, uV becomes representative of the fraction of open channels; it increases as particles switch to the open state. How do we compute the fluxes from the transition rates? If uV directly represented the particle count, we would take the product of uV and the transition rates. However, we can avoid multiplying altogether if uV represents the logarithm of the open fraction rather than the open fraction itself. uV ’s dynamics are described by the differential equation Cu

duV = Ids0 eκ VO e−uV − e−uH − Ids0 eκ VC e−uL , dt = Ids0 eκ VO −uH e−(uV −uH ) − 1 − Ids0 eκ VC −uL ,

(3.1)

where Ids0 and κ are transistor parameters (defined in equation 2.10), and VO , VC , uH , uL , and uV are voltages (defined in Figure 4a). We assume N1 remains in saturation during the channel’s operation; that is, uV > uL + 4 UT , making the drain voltage’s influence negligible. The analogies between equation 3.1 and equation 2.2 become clear when we divide the latter by u. Our barriers—N1’s source-gate for the closing rate and N2’s drain-gate for the opening rate—correspond to α (V) and β (V), while e−(uV −uH ) corresponds to u−1 . Thus, our circuit computes (and integrates) the net flux divided by u, the open fraction. Fortuitously, the net flux scaled by the open fraction is exactly what we need to update the fraction’s logarithm, since d log(u)/dt = (du/dt)/u. Indeed, substituting uV = log u + uH —our log-domain representation of u—into equation 3.1 yields Qu du = Ids0 eκ VO −uH u dt

1 −1 u

− Ids0 eκ VC −uL

Ids0 κ VC −uL du Ids0 κ VO −uH = ... e e u, (1 − u) − dt Qu Qu

(3.2)

where Qu = Cu UT . If we design VC and VO to be functions of the membrane voltage V, equation 3.2 becomes directly analogous to equation 2.2. In linear thermodynamic models, the transition rates depend exponentially on the membrane voltage. We can realize this by designing VC and VO to be linear functions of V, albeit with slopes of opposite sign. The opposite slopes ensure that as the membrane voltage shifts in one direction, the opening and closing rates will change in opposite directions relative to each other. Thus far in our circuit design, we have not specified whether the variable activates or inactivates as the membrane voltage increases. Recall that for activation, the gating particle opens as the cell depolarizes, whereas for

Voltage-Dependent Silicon Ion Channel Models

337

inactivation, the gating particle opens as the cell hyperpolarizes. In our circuit, whether activation or inactivation occurs depends on how we define VO and VC with respect to V. Increasing VO , and decreasing VC , with V defines an activation variable, as at depolarized levels this results in α (V) > β (V). Conversely, increasing VC , and decreasing VO , with V defines an inactivation variable, as now at depolarized voltages, β (V) > α (V), and the variable will equilibrate in a closed state. Naturally, our circuit has the same limitation that all two-state linear thermodynamic models have: its time constant approaches zero when either VO or VC grows large, as the transition rates α (V) and β (V) become unrealistically large. This shortcoming is easily rectified by imposing an upper limit on the transition rates, as has been done for other thermodynamic models (Willms et al., 1999). We realize this saturation by placing two additional transistors in series with the original two (see Figure 4b). With these transistors, the transition rates become α (V) =

Ids0 eκ VO −uH Qu 1 + eκ(VO −uτ H )

(3.3)

β (V) =

eκ VC −uL Ids0 , Qu 1 + eκ(VC −uτ L )

(3.4)

where the single exponentials in equation 3.2 are now scaled by an additional exponential term. The voltages uτ H and uτ L (fixed biases) set the maximum transition rate for opening and closing, respectively. That is, when VO < uτ H − 4 UT , α (V) ∝ eκ VO , a rate function exponentially dependent on the membrane voltage. But when VO > uτ H + 4 UT , α (V) ∝ eκuτ H , limiting the transition rate and fixing the minimum time constant for channel opening. The behavior of β (V) is similarly defined by VC ’s value relative to uτ L . In the following section, we explore how the channel variable computed by our circuit changes with the membrane voltage and how quickly it approaches steady state. To do so, we must relate the steady state and time constant to the opening and closing rates and specify how the circuit’s opening and closing voltages depend on the membrane voltage. 4 Circuit Operation To help understand the operation of the channel circuit and the influence of various circuit parameters, we will derive u∞ (V) and τu (V) for the circuit in Figure 4b using equations 2.4 and 2.5 and the transistor rate equations (equations 3.3 and 3.4), limiting our presentation to the activation version. The derivation, and the influence of various parameters, is similar for the inactivation version.

338

K. Hynna and K. Boahen

For the activation version of the channel circuit (see Figure 4b), we define the opening and closing voltages’ dependence on the membrane voltage as: VO = φo + γo V

(4.1)

VC = φc − γc V,

(4.2)

where φo , γo , φc , and γc are positive constants representing the offsets and slopes for the opening and closing voltages. Additional circuitry is required to define these constants; one example is described in the next section. In this section, however, we will leave the definition as such while we derive the equations for the circuit. Under certain restrictions (see appendix A), u’s steady-state level has a sigmoidal voltage dependence: u∞ (V) =

1 , V−Vmid 1 + exp − V∗u

(4.3)

u

where Vmid u = V∗u =

1 (φc − φo + (uH − uL )/κ) γ o + γc

(4.4)

UT 1 . γ o + γc κ

(4.5)

Figure 5a shows how the sigmoid arises from the transition rates and, through them, its relationship to the voltage biases. The midpoint of the sigmoid, where the open probability equals half, occurs when α (V) = β (V); it will thus shift with any voltage biases (φo , φc , uH , or uL ) that scale either of the transition rate currents. For example, increasing uH reduces α (V), shifting the midpoint to higher voltages. The slope of the sigmoid around the midpoint is defined by the slopes γo and γc of VO (V) and VC (V). To obtain the sigmoid shape, we restricted the effect of saturation. It is assumed that the bias voltages uτ H and uτ L are set such that saturation is negligible in the linear midsegment of the sigmoid (i.e., VO < uτ H − 4 UT and VC < uτ L − 4 UT when V ∼ Vmid u ). That is why α (V) and β (V) appear to be pure exponentials in Figure 5a. This restriction is reasonable as saturation is supposed to occur only for large excursions from Vmid u , where it imposes a lower limit on the time constant. Therefore, under this assumption, the sigmoid lacks any dependence on the biases uτ H and uτ L .

Voltage-Dependent Silicon Ion Channel Models

339

1

-uL

e

e

-κV -u e c + τL

-κV -u e o + τΗ

e

e

-κ (Vo-uτH)

τu(V) 1 0

0

0

a

1+e

1+e

-κ (Vc-uτL)

τu/τmin

-uH

Transition Rate

u∞

u∞(V)

b

V

V

Figure 5: Steady state and time constants for channel circuit. (a) The variable’s steady-state value (u∞ ) changes sigmoidally with membrane voltage (V), dictated by the ratio of the opening and closing rates (dashed lines). The midpoint occurs when the rates are equal, and hence its horizontal location is affected by the bias voltages (uH and uL ) applied to the circuit (see Figure 4b). (b) The variable’s time constant (τu ) has a bell-shaped dependence on the membrane voltage (V), dictated by the reciprocal of the opening and closing rates (dashed lines). The time constant diverges from these asymptotes at intermediate voltages, where neither rate dominates; it follows the reciprocal of their sum, peaking when the sum is minimized.

Under certain further restrictions (see appendix A), u’s time constant has a bell-shaped voltage dependence: τu (V) = τmin 1 +

exp

V−V1u V∗1u

1

2u + exp − V−V V∗

(4.6)

2u

where V1u = ( uτ H − φo ) /γo V∗1u = UT / (κ γo ) V2u = (φc − uτ H + (uH − uL )/κ) /γc V∗2u = UT / (κ γc ) and τmin = (Qu /Ids0 ) e−(κ uτ H −uH ) . Figure 5b shows how the bell shape arises from the transition rates and, through them, its relationship to the voltage biases. For large excursions of the membrane voltage, one transition rate dominates, and the time constant closely follows its inverse. For small excursions, neither rate dominates, and the time constant diverges from the inverses, peaking at the membrane voltage where the sum of the transition rates is minimized.

340

K. Hynna and K. Boahen

To obtain the bell shape, we saturated the opening and closing rates at the same level by setting κ uτ H − uH = κ uτ L − uL . Though not strictly necessary, this assumption simplifies the expression for τu (V) by matching the minimum time constants at hyperpolarized and depolarized voltages, yielding the result given in equation 4.6. The bell shape also requires this so-called minimum time constant to be smaller than the peak time constant in the absence of saturation. The free parameters within the circuit—φo , γo , φc , φo , uH , uL , uτ H , and uτ L —allow for much flexibility in designing a channel. Appendix B provides an expanded discussion on the influence of the various parameters in the equations above. In the following section, we present measurements from a simple activation channel designed using this circuit, which was fabricated in a standard 0.25 µm CMOS process.

5 A Simple Activation Channel Our goal here is to implement an activating channel to serve as a concrete example and examine its behavior through experiment. We start with the channel variable circuit, which computes the logarithm of the channel variable, and attach its output voltage to the gate of a transistor, which uses the subthreshold regime’s exponential current-voltage relationship to invert the logarithm. The current this transistor produces, which is directly proportional to the variable, can be injected directly into a silicon neuron (Hynna & Boahen, 2001) or can be used to define a conductance (Simoni et al., 2004). The actual choice is irrelevant for the purposes of this article, which demonstrates only the channel variable. In addition to the output transistor, we also need circuitry to compute the opening and closing voltages from the membrane voltage. For the opening voltage (VO ), we simply use a wire to tie it to the membrane voltage (V), which yields a slope of unity (γo = 1) and an intercept of zero (φo = 0). For the closing voltage (VC ), we use four transistors to invert the membrane voltage. The end result is shown in Figure 6. For the voltage inverter circuit we chose, the closing voltage’s intercept φc = κ VC0 (set by a bias voltage VC0 ) and its slope γc = κ 2 /(κ + 1) (set by the transistor parameter defined in equation 2.10). Since κ ≈ 0.7, the closing voltage has a shallower slope than the opening voltage, which makes the closing rate change more gradually, skewing the bell curve in the hyperpolarizing direction as intended for the application in which this circuit was used (Hynna, 2005). This eight-transistor design captures the ion channel’s nonlinear dynamics, which we demonstrated by performing voltage clamp experiments (see Figure 7). As the command voltage (i.e., step size) increases, the output current’s time course and final amplitude both change. The clustering and speed at low and high voltages are what we would expect from a sigmoidal steady-state dependence with a bell-shaped time constant. The

Voltage-Dependent Silicon Ion Channel Models

341

uH VO

V

N5

N6

VC 0

N8

N7

N3

N2

uτΗ

uV

VC

N1

N4

IT

Cu

uG

uL

Figure 6: A simple activating channel. A voltage inverter (N5-8) produces the closing voltage (VC ); a channel variable circuit (N1-3) implements the variable’s dynamics in the log domain (uV ); and an antilog transistor (N4) produces a current (IT ) proportional to the variable. The opening voltage (VO ) is identical to the membrane voltage (V). The series transistor (N2) sets the minimum time constant at depolarized levels. The same circuit can be used to implement an inactivating channel simply by swapping VO and VC .

Figure 7: Channel circuit’s measured voltage-dependent activation. When the membrane voltage is stepped to increasing levels, from the same starting level, the output current becomes increasingly larger, approaching its steady-state amplitudes at varying speeds.

relationship between this output current (IT ) and the activation variable, defined as u = euV −uH , has the form IT = uκ IT .

(5.1)

342

K. Hynna and K. Boahen 1

1.2 1.

0.6

0.8

u∞

τ u (ms)

0.8

0.4

0.6 0.4

0.2 0 0.2

a

0.2 0 0.3 0.4 0.5 0.6 Membrane Voltage (V)

0.3

b

0.4 0.5 0.6 0.7 Membrane Voltage (V)

Figure 8: Channel circuit’s measured sigmoid and bell curve. (a) Dependence of activation on membrane voltage in steady state, captured by sweeping the membrane voltage slowly and recording the normalized current output. (b), Dependence of time constant on membrane voltage, extracted from the curves in Figure 7 by fitting exponentials. Fits (solid lines) are of equations 4.3 = 423.0 mV, V∗u = 28.8 mV, τmin = 0.0425 ms, V1u = 571.3 mV, and 4.6: Vmid u ∗ V1u = 36.9 mV, V2u = 169.8 mV, V∗2u = 71.8 mV.

Its maximum value IT = eκ uH −uG and its exponent κ ≈ 0.7 (the same transistor parameter). It is possible to achieve an exponent of unity, or even a square or a cube, but this requires a lot more transistors. We measured the sigmoidal change in activation directly, by sweeping the membrane voltage slowly, and its bell-shaped time constant indirectly, by fitting the voltage clamp data in Figure 7 with exponentials. The results are shown in Figure 8; the relationship above was used to obtain u from IT . The solid lines are the fits of equations 4.3 and 4.6, which reasonably capture the behavior in both sets of data. The range in the time constant data is limited due to the experimental protocol used. Since we modulated only the step size from a fixed hyperpolarized position, we need a measurable change in the steady-state output current to be able to measure the temporal dynamics for opening. However, this worked to our benefit as, given the range of the fit, there was no need to modify equation 4.6 to allow the time constant to go to zero at hyperpolarized levels (this circuit omits the second saturation transistor in Figure 4b). All of the extracted parameters from the fit are reasonably close to our expectations—based on equations 4.3 and 4.6 and our applied voltage biases—except for V∗2u . For κ ≈ 0.7 and UT = 25.4 mV, we expected V∗2u ≈ 125 mV, but our fit yielded V∗2u ≈ 71.8 mV. There are two possible explanations, not mutually exclusive. First, the fact that equation 4.6 assumes the presence of a saturation transistor, in addition to the limited data along the left side of the bell curve, may have contributed to the underfitting of that value. Second, κ is not constant within the chip but possesses voltage dependence. Overall, however, the analysis matches reasonably well the performance of the circuit.

Voltage-Dependent Silicon Ion Channel Models

343

6 Conclusion We showed that the transistor is a thermodynamic isomorph of a channel gating particle. The analogy is accomplished by considering the operation of both within the framework of energy models. Both involve the movement of charge within an electric field: for the channel, due to conformations of the ion channel protein; for the transistor, due to charge carriers entering the transistor channel. Using this analogy, we generated a compact channel variable circuit. We demonstrated our variable circuit’s operation by implementing a simple channel with a single activation variable, showing that the steady state is sigmoid and the time constant bell shaped. Our measured results, obtained through voltage clamp experiments, matched our analytical results, derived from knowledge of transistors. Bias voltages applied to the circuit allow us to shift the sigmoid and the bell curve and set the bell curve’s height independently. However, the sigmoid’s slope, and its location relative to the bell curve, which is determined by the slope, cannot be changed (it is set by the MOS transistor’s κ parameter). Our variable circuit is not limited to activation variables: reversing the opening and closing voltages’ linear dependence on the membrane voltage will change the circuit into an inactivation variable. In addition, channels that activate and inactivate are easily modeled by including additional circuitry to multiply the variable circuits’ output currents (Simoni et al., 2004; Delbruck, 1991). The change in temporal dynamics of gating particles plays a critical role in some voltage-gated ion channels. As discussed in section 1, the inactivation time constant of T channels in thalamic relays changes dramatically, defining properties of the relay cell burst response, such as the interburst interval and the length of the burst itself. Activation time constants are also influential: they can modify the delay with which the channel responds (Zhan, Cox, Rinzel, & Sherman, 1999), an important determinant of a neuron’s temporal precision. Incorporating these nonlinear temporal dynamics into silicon models will yield further insights into neural computation. Equally important in our design, not only were we able to capture the nonlinear dynamics of gating particles, we were able to do so using fewer transistors than previous silicon models. Rather than exploit the parallels between transistors and ion channels, as we did, previous silicon modelers attempted to “linearize” the transistor, to make it approximate a resistor. After designing a circuit to accomplish this, the resistor’s value had to be adjusted dynamically, so more circuitry was added to filter the membrane voltage. The time constant of this filter was kept constant, sacrificing the ion channel’s voltage-dependent nonlinear dynamics for simplicity. We avoided all these complications by recognizing that the transistor is a

344

K. Hynna and K. Boahen

thermodynamic isomorph of the ion channel. Thus, we were able to come up with a compact replica. The size of the circuit is an important consideration within silicon models, as smaller circuits translate to more neurons on a silicon die. To illustrate, a simple silicon neuron (Zaghloul & Boahen, 2004), with a single input synapse, possessing the activation channel from section 5, requires about 330 µm2 of area. This corresponds to around 30,000 neurons on a silicon die, 10 mm2 in area. Adding additional circuits, such as inactivation to the channel, increases the area of the cell design, reducing the size of the population on the chip (assuming, of course, that the total area of the die remains constant). To compensate for larger cell footprints, we can either increase the size of the whole silicon die (which costs money), or we can simply incorporate multiple chips into the system, easily doubling or tripling the network size. And unlike computer simulations, the increase in network sizes comes with minimal cost in performance or “simulation” time. Of course, like all other modeling media, silicon has its own drawbacks. For one, silicon is not forgiving with respect to design flaws. Once the chip has been fabricated, we are limited to manipulating our models using only external voltage biases within our design. This places a great deal of importance on verification of the final design before submitting it for fabrication; the total time from starting the design to receiving the fabricated chip can be on the order of 6 to 12 months. An additional characteristic of the silicon fabrication process is mismatch, a term referring to the variability among fabricated copies of the same design within a silicon chip (Pavasovic, Andreou, & Westgate, 1994). Within an array of silicon neurons, this translates into heterogeneity within the population. While we can take steps to reduce the variability within an array, generally at the expense of area, this mismatch can be considered a feature, since biology also needs to deal with variability. When we build silicon models that reproduce biological phenomena, being able to do so lends credence to our models, given their robustness to parameter variability. And when we discover network phenomena within our chips, these are likely to be found in biology as well, as they will be robust to biological heterogeneity. With the ion channel design described in this article and our ability to expand our networks without much cost, we have a great deal of potential in building many of the neural systems within the brain, which consists of numerous layers of cells, each possessing its own distinct characteristics. Not only do we have the opportunity to study the role of an ion channel within an individual cell, we have the potential to study its influence within the dynamics of a population of cells, and hence its role in neural computation.

Voltage-Dependent Silicon Ion Channel Models

345

Appendix A: Derivations We can use the transition rates for our channel circuit (see equations 3.3 and 3.4) to calculate the voltage dependence of its steady state and time constant. Starting with the steady-state equation, equation 2.4, u∞ (V) = = =

α (V) α (V) + β (V) Ids0 Qu

Ids0 e−uH Qu e−κ VO +e−κuτ H −u e−uH + IQds0u e−κ VCe+eL−κuτ L e−κ VO +e−κuτ H

1+

e−κ VO +e−κuτ H e−κ VC +e−κuτ L

1 euH −uL

.

Throughout the linear segment of the sigmoid, we set the voltage biases such that uτ H > VO + 4 UT and uτ L > VC + 4 UT . These restrictions essentially marginalize uτ H and uτ L . By the time either of the terms with these two biases becomes significant, the steady state will be sufficiently close to either unity or zero, so that their influence is negligible. Therefore, we drop the exponential terms with uτ H and uτ L ; substituting equations 4.1 and 4.2 yields the desired result of equation 4.3. For the time constant, we substitute equations 3.3 and 3.4 into equation 4.1: τu (V) =

1 α (V) + β (V)

=

Qu Ids0

=

Qu Ids0

1 e−uH e−κ VO +e−κuτ H

+

e−uL e−κ VC +e−κuτ L

1 1 e−(κ VO −uH ) +e−(κ uτ H −uH )

+

1 e−(κ VC −uL ) +e−(κ uτ L −uL )

.

To equalize the minimum time constant at hyperpolarized and depolarized levels, we establish the following relationship: κ uτ H − uH = κ uτ L − uL . After additional algebraic manipulation, the time constant becomes

e−κ (VO −uτ H ) e−κ (VC −uτ L ) e−κ (VO −uτ H ) + e−κ (VC −uτ L ) + 2 1 − −κ (V −u ) , O τ H + e−κ (VC −uτ L ) + 2 e

τu (V) = τmin

1+

346

K. Hynna and K. Boahen

where τmin = (Qu /Ids0 ) e−(κ uτ H −uH ) is the minimum time constant. To reduce the expression further, we need to understand the relative magnitudes of the various terms. We can drop the constant in both denominators, as one of the exponentials there will always be significantly larger. Since VO and VC have opposite signs for their slopes with respect to V, the sum of the exponentials in the denominator peaks at a membrane voltage somewhere within the middle section of its operational voltage range. As it happens, this peak is close to the midpoint of the sigmoid (see below), where we have defined uτ H > VO + 4 UT and uτ L > VC + 4 UT . Thus, at the peak, we know the constant is negligible. As the membrane voltage moves away from the peak, in either direction, one of the exponentials will continue to increase while the other decreases. Thus, with the restriction on the bias voltage uτ H and uτ L , the sum of the exponentials will always be much larger than the constant in the denominator. By the same logic, we can disregard the final term, since the sum of the exponentials will always be substantially larger than the numerator, making the fraction negligible over the whole membrane voltage. With these assumptions, and substituting the dependence of VO and VC on V (see Equations 4.1 and 4.2), we obtain the desired result, equation 4.6. Appendix B: Circuit Flexibility This section is more theoretical in nature, using the steady-state and timeconstant equations derived in appendix A to provide insight into how the various parameters influence the two voltage dependencies (steady state and time constant). This section is likely of interest only to those who wish to use our approach for modeling ion channels. An important consideration in generating these models is defining the location (i.e., the membrane voltage) and magnitude of the peak time constant. Unlike the minimum time constant, which is determined simply by the difference between the bias voltages uτ H and uH (or uτ L and uL ), no bias directly controls the maximum time constant, since it is the point at which the sum of the transition rates is minimized (see Figure 5b). The voltage at which it occurs, however, is easily determined from equation 4.6: Vτ pk = Vmid + u

UT log [γc /γo ] . κ (γo + γc )

(B.1)

Thus, where the bell curve lies relative to the sigmoid, whose midpoint lies at Vmid u , is determined by the opening and closing voltages’ slopes (γo and γc ). Consequently, changing these slopes is the only way to displace the bell curve relative to the sigmoid. Shifting the bell curve by changing

Voltage-Dependent Silicon Ion Channel Models

0.5

0 200 300 400 500 600 700 800 V

a

3. Log 10 [ τ /τ min ]

u∞

1.

347

2. 1. 0 200 300 400 500 600 700 800 V

b

1.

10 τ /τ min

u∞

8 0.5

6 4 2

c

0 200 300 400 500 600 700 800 V

d

0 200 300 400 500 600 700 800 V

Figure 9: Varying the closing voltage’s slope (γc ). (a) Changing γc adjusts both the slope and midpoint of the steady-state sigmoid. (b) γc also affects the location and height (relative to the minimum) of the time constant’s bell curve; the change in height (plotted logarithmically) is dramatic. (c, d) Same as in a and b, except that we adjusted the bias voltage uL to compensate for the change in γc , so the sigmoid’s midpoint and the bell curve’s height remain the same. The sigmoid’s slope does change, as it did before, and the bell curve’s location shifts as well, though much less than before. In these plots, φo = 400 mV, uH = 400 mV, uL = 50 mV, uτ H = 700 mV, and γo = 1.0. γc ’s values are 0.5 (thin, solid line), 0.75 (short, dashed line), 1.0 (long, dashed line), and 1.25 (thick, solid line).

a parameter other than γo or γc will automatically shift the sigmoid by the same amount. As we change the opening and closing voltages’ slopes (γo and γc ), two things happen. First, due to Vmid u ’s dependence on these parameters (see equation 4.4), the sigmoid and the bell curve shift together. Two, due to the dependence we just described (see equation B.1), the bell curve shifts relative to the sigmoid. These effects are illustrated in Figures 9a and 9b for γc (γo behaves similarly). To eliminate the first effect while preserving the second, we can compensate for the sigmoid’s shift by adjusting the bias voltage uL , which scales the closing rate (see equation 4.2). Consequently, uL also rescales the left part of the bell curve, where the closing rate is dominant, reducing its height relative to the minimum and canceling its shift due to the first effect. This is demonstrated in Figures 9c and 9d. The sigmoid remains fixed, while the bell curve shifts (slightly) due to the second effect. Although the bias voltage uL cancels the shift γc produces in the sigmoid, it does not compensate for the change in the sigmoid’s slope. We can shift the sigmoid

348

K. Hynna and K. Boahen 1.

25

0.5

τ / τ min

u∞

20 15 10 5 0 200 300 400 500 600 700 800 V

a

b

0 200 300 400 500 600 700 800 V

a

25

25

20

20

15

15

10

τ / τ min

τ / τ min

Figure 10: Varying the closing voltage’s intercept (φc ). (a) Changing φc shifts the steady-state sigmoid’s midpoint, leaving its slope unaffected. (b) φc also affects the time constant bell curve’s location and height (relative to the minimum). In these plots, φo = 0 mV, uH = 400 mV, uL = 50 mV, uτ H = 700 mV, γo = 1.0, and γc = 0.5. φc ’s values are 400 mV (thin, solid line), 425 mV (short, dashed line), 450 mV (long, dashed line), and 475 mV (thick, solid line).

10

5

5

0 200 300 400 500 600 700 800 V

0 200 300 400 500 600 700 800 V

b

Figure 11: Setting the bell curve’s location and height independently. (a) Changing the opening and closing voltages’ intercepts (φo and φc ) together shifts the bell curve’s location without affecting its height. The steady-state sigmoid (not shown) moves with the bell curve (see equation B.1). (b) Changing φo and φc in opposite ways increases the height (relative to the minimum) without affecting the location. The steady-state sigmoid (not shown) stays put as well. In these plots, uH = 400 mV, uL = 50 mV, uτ H = 700 mV, γo = 1.0, and γc = 0.5. In both a and b, φc = 400 mV and φo = 0 mV for the thin, solid line. In a, both φc and φo increment by 25 mV from the thin, solid line to the short, dashed line, to the long, dashed line and to the thick, solid line. In b, φc increments and φo decrements by 25 mV from the thin, solid line to the short, dashed line, to the long, dashed line, and to the thick, solid line.

and leave its slope unaffected by changing the opening and closing voltages’ intercepts, as shown in Figure 10 for φc (φo behaves similarly). However, the bell curve shifts by the same amount, and its height changes as well, since φc rescales the closing rate. Thus, it is not possible to shift the bell curve relative to the sigmoid without changing the latter’s slope; this is evident from equations 4.5 and B.1.

Voltage-Dependent Silicon Ion Channel Models

349

We can set the bell curve’s location and height independently if we change the opening and closing voltages’ intercepts by the same amount or by equal and opposite amounts, respectively. Equal changes in the intercepts (φo and φc ) shift the opening and closing rate curves by the same amount, thus shifting the bell curve (and the sigmoid) without affecting its height (see Figure 11a), whereas equal and opposite changes shift the opening and closing rate curves apart, leaving the point where they cross at the same location while rescaling the value of the rate there. As a result, the bell curve’s height is changed without affecting its location (see Figure 11b). Choosing values for γo and γc , however, presents a trade-off. These two parameters define the dependence of VO and VC on V and are not external biases like the other parameters; rather, they are defined by the fabrication process through the transistor parameter κ. We can define their relationships with κ through the use of different circuits; in our simple activation channel (see section 5), γc = κ 2 /(κ + 1), as defined by the four transistors that invert the membrane voltage. There are a couple of drawbacks. First, not all values for γo or γc are possible using only a few transistors. Expressed another way, a trade-off needs to be made between achieving more precise values for γo or γc and using fewer transistors within the design. The other drawback is that after fabrication, γo and γc can no longer be modified, as they are defined as functions of the transistor parameter κ. These issues merit special consideration before submitting the chip for fabrication. References Delbruck, T. “Bump” circuits for computing similarity and dissimilarity of analog voltages. In Neural Networks, 1991, IJCNN-91-Seattle International Joint Conference on (Vol. 1, pp. 475–479). Piscataway, NJ: IEEE. Destexhe, A., & Huguenard, J. R. (2000). Nonlinear thermodynamic models of voltage-dependent currents. J. Comput. Neurosci., 9(3), 259–270. Destexhe, A., & Huguenard, J. R. (2001). Which formalism to use for voltagedependent conductances? In R. C. Cannon & E. D. Schutter (Eds.), Computational neuroscience: Realistic modeling for experimentalists (pp. 129–157). Boca Raton, FL: CRC. Hill, T. L., & Chen, Y. (1972). On the theory of ion transport across the nerve membrane. VI. free energy and activation free energies of conformational change. Proc. Natl. Acad. Sci. U.S.A., 69(7), 1723–1726. Hille, B. (1992). Ionic channels of excitable membranes (2nd ed.). Sunderland, MA: Sinauer Associates. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol., 117(4), 500–544. Hynna, K. (2005). T channel dynamics in a silicon LGN. Unpublished doctoral dissertation, University of Pennsylvania.

350

K. Hynna and K. Boahen

Hynna, K., & Boahen, K. (2001). Space-rate coding in an adaptive silicon neuron. Neural Networks, 14(6–7), 645–656. Llinas, R. R. (1988). The intrinsic electrophysiological properties of mammalian neurons: Insights into central nervous system function. Science, 242(4886), 1654–1664. Mahowald, M., & Douglas, R. (1991). A silicon neuron. Nature, 354(6354), 515–518. McCormick, D. A., & Pape, H. C. (1990). Properties of a hyperpolarization-activated cation current and its role in rhythmic oscillation in thalamic relay neurones. J. Physiol., 431, 291–318. Mead, C. (1989). Analog VLSI and neural systems. Reading, MA: Addison-Wesley. Monsivais, P., Clark, B. A., Roth, A., & Hausser, M. (2005). Determinants of action potential propagation in cerebellar Purkinje cell axons. J. Neurosci., 25(2), 464–472. Pavasovic, A., Andreou, A. G., & Westgate, C. R. (1994). Characterization of subthreshold MOS mismatch in transistors for VLSI systems. J. VLSI Signal Process. Syst., 8(1), 75–85. Simoni, M. F., Cymbalyuk, G. S., Sorensen, M. E., Calabrese, R. L., & DeWeerth, S. P. (2004). A multiconductance silicon neuron with biologically matched dynamics. IEEE Transactions on Biomedical Engineering, 51(2), 342–354. Stevens, C. F. (1978). Interactions between intrinsic membrane protein and electric field: An approach to studying nerve excitability. Biophys. J., 22(2), 295–306. Svirskis, G., Kotak, V., Sanes, D. H., & Rinzel, J. (2002). Enhancement of signal-tonoise ratio and phase locking for small inputs by a low-threshold outward current in auditory neurons. J. Neurosci., 22(24), 11019–11025. Tao, L., Shelley, M., McLaughlin, D., & Shapley, R. (2004). An egalitarian network model for the emergence of simple and complex cells in visual cortex. Proc. Natl. Acad. Sci. U.S.A., 101(1), 366–371. Willms, A. R., Baro, D. J., Harris-Warrick, R. M., & Guckenheimer, J. (1999). An improved parameter estimation method for Hodgkin-Huxley models. Journal of Computational Neuroscience, 6(2), 145–168. Zaghloul, K. A., & Boahen, K. (2004). Optic nerve signals in a neuromorphic chip II: Testing and results. IEEE Transactions on Biomedical Engineering, 51(4), 667–675. Zhan, X. J., Cox, C. L., Rinzel, J., & Sherman, S. M. (1999). Current clamp and modeling studies of low-threshold calcium spikes in cells of the cat’s lateral geniculate nucleus. J. Neurophysiol., 81(5), 2360–2373.

Received October 6, 2005; accepted June 20, 2006.

LETTER

Communicated by Harry Erwin

Spatiotemporal Conversion of Auditory Information for Cochleotopic Mapping Osamu Hoshino [email protected] Department of Intelligent Systems Engineering, Ibaraki University, Hitachi, Ibaraki, 316-8511 Japan

Auditory communication signals such as monkey calls are complex FM vocal sounds and in general induce action potentials in different timing in the primary auditory cortex. Delay line scheme is one of the effective ways for detecting such neuronal timing. However, the scheme is not straightforwardly applicable if the time intervals of signals are beyond the latency time of delay lines. In fact, monkey calls are often expressed in longer time intervals (hundreds of milliseconds to seconds) and are beyond the latency times observed in the brain (less than several hundreds of milliseconds). Here, we propose a cochleotopic map similar to that in vision known as a retinotopic map. We show that information about monkey calls could be mapped on a cochleotopic cortical network as spatiotemporal firing patterns of neurons, which can then be decomposed into simple (linearly sweeping) FM components and integrated into unified percepts by higher cortical networks. We suggest that the spatiotemporal conversion of auditory information may be essential for developing the cochleotopic map, which could serve as the foundation for later processing, or monkey call identification by higher cortical areas. 1 Introduction Frequency modulation (FM) is a critical parameter for constructing auditory communication signals in both humans and monkeys. For example, human speech contains FM sounds called formant transitions that are critical for encoding consonant and vowel combinations such as “ga,” “da,” and “ba” (Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967; Suga, 1995). Monkeys use complex FM sounds—the so-called monkey calls—that are considered to be a precursor of human speech (Poremba et al., 2004; Holden 2004), in order to make social interactions with other members of the species (Symmes, Newman, Talmage-Riggs, & Lieblich, 1979; Janik, 2000; Tyack, 2000). Auditory FM signals differ in informational structure from other sensory signals in that they are processed in a time-dependent manner (or characterized by time-varying spectral frequencies), while color and orientation Neural Computation 19, 351–370 (2007)

C 2007 Massachusetts Institute of Technology

352

O. Hoshino

in vision, odorants in olfaction, and tastants in gustation are characterized in most cases in a time-independent manner. Understanding how the auditory cortex encodes and detects the streams of spectral information arising from the temporal structure of FM sounds is one of the most challenging problems in auditory cognitive neuroscience. Auditory signals being sent from receptor neurons enter the primary auditory area (AI). Neurons of the AI are orderly arranged and form the so-called tonotopic map, where tonotopic representation of the cochlea is well preserved (Yost, 1994). Each neuron on the map tends to respond best to its characteristic frequency. Because of the tonotopic organization, these neurons could be activated in a sequential manner and generate action potentials in different timing when stimulated with an FM sound. It is interesting to know how higher cortical areas, to which the AI project, detect the timing of action potentials so that the brain can identify the applied FM sound. One of the effective ways for detecting the timing of action potentials is to use delay lines. Jeffress (1948) proposed a theory for sound localization. The theory is that an interaural time difference, which expresses the azimuthal location of a sound source, could be detected by neurons that receive signals from both ears through distinct delay lines. Suga (1995) proposed a multiple delay line scheme for echo sound detection in bats. Bi and Poo (1999) demonstrated in a cultured hippocampal neuronal network that the timing of action potentials could be detected when relevant delay lines are properly chosen. The timing of action potentials in these processes was relatively short, ranging from submilliseconds to tens of milliseconds, for which the delay line scheme worked well. However, the information about FM sounds contained in monkey calls is in general expressed in longer time intervals of hundreds of milliseconds to seconds. The delay line scheme is unlikely to be applicable for them, because the delay lines observed in the brain are at most of several hundreds of milliseconds (Miller, 1987). This implies that the brain might employ another strategy for identifying individual calls. A speculative cochleotopic map was proposed in relation to a retinotopic map in vision (Rauschecker, 1998). The retinotopic map expresses information about the location of a moving bar in a two-dimensional visual space that is projected onto the retina. As the bar moves in the visual space, the map shows a spatiotemporal firing pattern of neurons. The axes of the map indicate the location of the bar in the two-dimensional visual space. In the cochleotopic mapping scheme, the location of neuronal activation moves along the frequency axis as the frequency of a FM sound sweeps. However, the exact auditory variable for the other axis is still unknown. We propose here a hypothetical two-dimensional (frequency axis, propagation axis) cochleotopic neural network (NC O network) for the AI on which information about FM sounds is mapped in a spatiotemporal manner. The propagation axis is assumed for the unknown axis. Stimulation

Spatiotemporal Conversion of Auditory Information

353

with a pure (single-frequency) tone activates a peripheral neuron, whose activity propagates along the propagation axis, or along its isofrequency band. When the NC O network is stimulated with an FM sound, peripheral neurons are sequentially activated, which then propagates along their isofrequency bands. As a consequence, the information about the applied FM sound is expressed as a specific spatiotemporal firing pattern in the NC O network dynamics. Such activity propagation has been reported in the AI. Hess and Scheich (1996) stimulated Mongolian gerbils with pure tones (1 kHz to 16 kHz) and recorded the activity of AI neurons. The researchers found that neuronal activation propagated along isofrequency bands at all frequencies. Taniguchi, Horikawa, Moriyama, and Nasu (1992) stimulated guinea pigs with pure tones (1 kHz to 30 kHz) and recorded the activity of the anterior field (field A) in which tonotopic organization was well preserved. The researchers found that the focal activation beginning in field A propagated in two directions: along isofrequency bands and toward field DC. There has been neurophysiological evidence that FM sounds are precisely detected by auditory neurons. Neurons of the lateral belt (LB) (Rauschecker, 1998) that receives signals from the AI responded to FM sounds. These neurons were relatively organized in an orderly fashion depending on the sweeping rate (between slow and fast) and direction (upward or downward) of the FMs. Based on these experimental findings, we construct a neural network model (NFM ) for the LB to which the NC O network projects. A given FM sound evokes a specific spatiotemporal firing pattern in the NC O network, to which a certain group of NFM neurons (NFM column) responds and identifies the applied FM. It is also well known that LB neurons respond to monkey calls (Rauschecker, 1997). Monkey calls as vocal signatures are complex FM sounds and play an important role in identifying individuals, especially when their visual systems are unavailable, as in a luxuriant forest (Symmes et al., 1979). To detect such complex FM sounds, we construct a higher neural network (NI D network) model for the STGr (rostral portion of the superior temporal gyrus) to which the LB projects. The NI D network receives selective projections from the NFM network. When a monkey call is presented to the NC O network, multiple NFM columns are sequentially activated in a specific order. The NI D network integrates the sequence of the dynamic NFM columns, thereby identifying that call. Based on the proposed cochleotopic mapping scheme, we investigate how FM sound information is encoded and detected. Applying to the NC O network simple (linearly sweeping) and complex (monkey call) FM sounds, we record the activities of neurons. By statistically analyzing them, we try to understand the neuronal mechanisms that underlie FM sound information processing in the auditory cortex.

354

O. Hoshino

2 Neural Network Model 2.1 Outline of the Model. The NC O network, modeling the AI, is organized in a tonotopic fashion as shown in Figure 1A. When the frequency of an applied FM sound sweeps upward, the neuronal activation of the periphery (p1) sweeps from f1 to f40, which then moves along the propagate axis (isofrequency bands). The filled circles schematically indicate a neuronal firing pattern induced by a simple (linearly sweeping) upward FM sound at a certain time after the stimulus onset. The gray circles indicate a neuronal firing pattern induced by a downward FM sound. The spatiotemporal firing pattern of the NC O network expresses combinatorial information about the direction and the sweep rate of the applied FM sound. Neurons of the NFM network, modeling the LB, receive convergent projections from the NC O network (solid and dashed lines), and detect the upward (black circles) and downward (gray circles) FM sounds. We made sets of selective convergent projections from the NC O to NFM network in order to detect specific linearly sweeping FM sounds. Neurons within isofrequency bands are connected by excitatory and inhibitory delay lines, as shown in Figure 1B. The excitatory connections (solid lines) from the periphery to the center are the major driving force for the neuronal activation to move along the propagation axis, and the inhibitory connections (dashed lines) were employed to sharpen the spatiotemporal firing patterns. This specific circuitry was used for functionally expressing the propagation axis whose evidence in the AI is being accumulated (Hess & Scheich, 1996; Taniguchi et al., 1992). For simplicity, there is no connection between isofrequency bands. Neurons within NFM columns are connected with each other via excitatory synapses, and neurons between NFM columns are connected via inhibitory synapses. This circuitry enables each NFM column to respond selectively to a specific linearly sweeping FM sound. Figures 1C and 1D are schematic drawings of neuronal responses to a linearly sweeping FM sound. When the NC O network is stimulated with an upward FM sound sweeping at a slow (Figure 1C, left), intermediate (Figure 1C, center), or fast rate (Figure 1C, right), the activation area moves from the lower left to upper right (arrows). When a downward FM sound is presented, the activation area moves from the lower right to upper left (see Figure 1D). When the activation area reaches a certain position (gray ellipses), the neurons send action potentials to the NFM network via the selective feedforward projections, and activate the NFM column corresponding to the applied FM (black ellipses). The neuronal activation of the other NC O regions (dashed ellipses) can also be used for the FM detection. Nevertheless, we chose the firing patterns (gray ellipses), because these patterns appear first in the time courses with maximal neurons simultaneously activated. This enables the NFM network to respond reliably and rapidly to the applied FM sounds.

Spatiotemporal Conversion of Auditory Information

A

355

B

c1

c10

,,,,,,, ,,,

c40

,,,,,,,

,,,

,,,

NFM ,,

p ro

c31

,,,,,,,

,,,

n

tio

a ag

upwards

,,

,,

,,

p40

,,,,,,,

p

,,,

,,,

,,,,,,,

p1 input

f1

frequency

low

NCO

propagation - axis

downwards

f40 fl (l = 1, 2, ,,, 40)

high

C

D upward

downward

slow

fast

slow

fast

NCO

frequency

propagation

propagation

NFM

frequency

Figure 1: Neural network model. (A) The NC O network is organized in a tonotopic manner. The NFM network receives selective projections from the NC O network. Among NFM columns, c1–c10 and c31–c40 detect FM sounds that sweep linearly in downward and upward directions, respectively. For clarity, only two sets of firing patterns (black and gray circles) and projections (solid and dashed lines) for an upward and a downward FM sound are depicted. (B) Neuronal connections within isofrequency bands of the NC O network. Neurons are connected via excitatory (solid lines) and inhibitory (dashed lines) delay lines, where t denotes a signal transmission delay time. (C) Schematic drawings of the spatiotemporal neuronal responses of the NC O network and those of the NFM network to simple (linearly sweeping) upward FM sounds. Activity patterns for FM sounds that sweep at a slow (left), intermediate (center), and fast (right) rate are shown. Arrows indicate the directions of movements of active areas. (D) Schematic drawings as in C for downward FM sounds.

356

O. Hoshino

2.2 Model Description. Dynamic evolutions of the membrane potentials of neurons of the NC O and NFM networks are defined, respectively, by D ex duCk Oi (t) wk i,k = −uCk Oi (t) + dt

M

τC O

j=1

+ wki h i,k

τFM

CO (i+ j) Sk (i+ j) (t

CO (i− j) Sk (i− j) (t

− jt)

− jt) ,

(2.1)

f 40 p40 duiF M (t) = −uiF M (t) + L i,k j SkC Oj (t − tFM ) dt k= f 1 j= p1

MFM

+

wiFj M S Fj M (t),

(2.2)

j=1( j=i)

where

Prob SiY (t) = 1 = f Y uiY (t) f Y [u] =

1 . 1 + e −ηY (u−θY )

(Y = C O, F M), (2.3)

uCk Oi (t) and uiF M (t) are the membrane potential of the ith NC O neuron of the kth (k = f1–f40) isofrequency band and that of the ith NFM neuron at time t, respectively. τY (Y = CO, FM) is a decay time of the membrane potential of the network NY . MD is the number of excitatory or inhibitory input delay lines that a single NC O neuron receives from other neurons within isofrequency bands. wke xi,k (i− j) and wki h i,k (i+ j) are, respectively, excitatory and inhibitory synaptic connection strengths from neuron (i − j) to i and from (i + j) to i of the kth isofrequency band. SkC Oj (t − jt) = 1 expresses an action potential of the jth NC O neuron of the kth isofrequency band, where jt denotes a signal transmission delay time (see Figure 1B). (f1– f40, p1–p40) denotes the locations of neurons on the two-dimensional NC O map (see Figure 1A). L i,k j is the strength of synaptic connection from the jth NC O neuron of the kth isofrequency band to the ith NFM neuron. tFM is a signal transmission delay time from the NC O to NFM network. MFM is the number of NFM neurons. wiFj M is the synaptic connection strength from the jth to the ith NFM neuron, and S Fj M (t) = 1 expresses an action potential of the jth NFM neuron. {wiFj M } was set for the neurons within NFM columns (c1–c40; see Figure 1A) to be mutually excited and for the neurons between NFM columns to be laterally inhibited. ηY and θY are the steepness and the threshold of the sigmoid function f Y , respectively, for Y neuron. Equation 2.3 defines the probability of firing of neuron i; that is, the

Spatiotemporal Conversion of Auditory Information

357

probability of SiY (t) = 1 is given by function f Y . After firing, its membrane potential is reset to 0. FM sound stimuli are applied to the peripheral neurons of the NC O network. Dynamic evolutions of membrane potentials of these neurons are defined by D duCk O1 (t) wki h 1,k = −uCk O1 (t) + dt

M

τC O

CO (1+ j) (t)Sk (1+ j) (t

− jt)

j=1

+ α Ik 1 (t),

(2.4)

where Ik 1 (t) is the input stimulus to the peripheral neuron of the kth isofrequency band, or the neuron located at (k, p1; k = f1–f40; see Figure 1A). α is the intensity of the input. Note that the peripheral neurons receive only an inhibitory input from the (1 + j)th neuron with a delay of jt and do not receive any delayed excitatory input. Network parameter values are as follows. The number of neurons are 40 (f1–f40) × 40 (p1–p40) and 40 (c1–c40) × 12 for the NC O and NFM networks, respectively: τC O = 10 ms, τFM = 10 ms, θY = 0.7, ηY = 10.0, and MD = 3. wke xi,k (i− j) = 5.0 and wki h i,k (i− j) = −0.5. t = 10 ms and tFM = 20 ms. L i,k j was selectively set at either 0.1 or 0, as addressed in section 2.1, by which the specific firing patterns induced in the NC O network dynamics can activate their corresponding NFM columns (see Figures 1C and 1D). wiFj M = 0.3 and −5.0 within and between NFM columns, respectively. α = 8.0, and Ik 1 (t) = 1 for an input and 0 for no input. 3 Results 3.1 Tuning Property to Simple FM Sounds. We show here how the information about simple (linearly sweeping) FM sounds could be expressed as spatiotemporal firing patterns in the NC O network dynamics. We also show how the auditory information could be transferred to and detected by specific neuronal columns of the NFM network. Response properties (action potential generation) of NFM neurons are compared with those observed experimentally. An upward FM sound sweeping linearly at 20 Hz per ms induces a specific spatiotemporal firing pattern in the NC O network, in which the neuronal activation moves from the lower left toward the upper right (see Figure 2A). When the activity reaches a certain point (time = 200 ms), the active neurons send action potentials to the NFM network and stimulate the corresponding NFM column (arrow) at time = ∼ 220 ms. The difference in activation time between the NC O (200 ms) and NFM (220 ms) networks arises from a difference in signal transmission delay between the two networks, or tFM = 20 ms (see equation 2.2).

358

O. Hoshino A

120 ms

220 ms

320 ms

520 ms

NFM NCO 100 ms B

200 ms

300 ms

500 ms

f (kHz)

spikes/bin

upward 10 8 6 4 2 0

13.3 Hz/ms 10 8 6 4 2 0

20 Hz/ms

12

12

12

8

8 0 100 200 300

26.7 Hz/ms

10 8 6 4 2 0

8 0 100 200 300

0 100 200 300

time (ms)

downward

f (kHz)

spikes/bin

C 10 8 6 4 2 0

13.3 Hz/ms 10 8 6 4 2 0

12

12

8

20 Hz/ms

26.7 Hz/ms

12

8 0 100 200 300

10 8 6 4 2 0

8 0 100 200 300

time (ms)

0 100 200 300

Spatiotemporal Conversion of Auditory Information

359

We assumed f1 = 8 kHz to f40 = 11.9 kHz (see Figure 1A), where the isofrequency bands were placed at an even interval, or 100 Hz per band. These frequencies are within the range observed in squirrel monkeys (Symmes et al., 1979) and employed for investigating how complex FM sounds such as monkey calls could be identified, as will be shown in sections 3.2 and 3.3. Figure 2B presents the total spike counts (top) and raster plots (middle) for the neurons of a given NFM column when stimulated with different upward FM sounds (bottom). The columnar neurons show specific sensitivity to the upward FM sound (20 Hz per ms) but less to downward FM sounds (see Figure 2C). This tendency is almost consistent with that observed in macaque monkeys (Rauschecker, 1998). Although the tuning characteristic of the NFM columnar neurons to the applied FM sound (sweep rate = 20 Hz per ms; upward) is evident, these neurons also show weak responses to the other upward FM sounds with different sweep rates (see the arrows of Figure 2B). Such weak responsiveness to the “irrelevant” FM sounds is due to the overlapping of NC O firing patterns, as schematically shown in Figure 3. The set of NC O neurons within the solid ellipse, which are to be simultaneously activated by the FM stimulus (sweep rate = 20 Hz per ms; upward), send action potentials to the relevant NFM column and maximally activate the column (top left). However, the subsets of NC O neurons within the overlapping regions (gray and black) could also be activated, respectively, by the irrelevant upward FM sounds (26.7 and 13.3 Hz per ms). These neurons send a relatively small number of action potentials to the same NFM column, which results in weaker neural responses in the column (bottom left and top right). 3.2 Tuning Property to Complex FM Sounds. We show here how complex FM sounds such as monkey calls could be expressed as specific spatiotemporal firing patterns in the NC O network dynamics. We also show

Figure 2: Response property of the NC O and NFM networks. (A) Time courses of neuronal activation when stimulated with an upward FM sound sweeping at 20 Hz per ms. The level of neuronal activity is expressed for each neuron as the number of action potentials observed within a time interval (bin = 10 ms). Arrows indicate the responses of NFM neurons to the applied FM sound. (B) Spike counts (top) and raster plots (middle) for the NFM column neurons when stimulated with different upward FM sounds (bottom). Spike count is the number of action potentials of the NFM column neurons observed within a time interval (bin = 10 ms). (C) Spike counts (top) and raster plots (middle) for the NFM column when stimulated with different downward FM sounds (bottom). The NFM column shows specific sensitivity to the upward FM sound sweeping at 20 Hz per ms.

20 Hz/ms 10 8 6 4 2 0 0 100 200 300

propagation

NCO

spikes/bin

time (ms)

10 8 6 4 2 0

26.7 Hz/ms

spikes/bin

O. Hoshino

spikes/bin

360

low

10 8 6 4 2 0

13.3 Hz/ms

0 100 200 300

time (ms)

high frequency

0 100 200 300

time (ms)

Figure 3: Overlapping property of NC O firing patterns. Thin-dashed, solid, and thick-dashed areas schematically depict the firing patterns generated by the FM sounds sweeping at 26.7 Hz per ms, 20 Hz per ms, and 13.3 Hz per ms, respectively. A certain NFM column is activated maximally by its preferred FM sound (20 Hz per ms) (top left). The overlapping regions, or gray and black regions, are activated not only by the upward FM (20 Hz per ms) but also by those sweeping at 26.7 Hz per ms (gray region) and 13.3 Hz per ms (black region). This results in weaker neuronal responses in the same NFM column (bottom left and top right).

how the information about individual monkey calls can be decomposed into simple (linearly sweeping) FM components. We used artificial isolation peep (IP) as monkey calls (Symmes et al., 1979), whose pitch profiles are shown in Figure 4A. A specific spatiotemporal firing pattern is induced in the NC O network when stimulated with the IP of monkey X, which then activates multiple NFM columns in a sequential manner (see Figure 4B). We have observed distinct sequential orders of columnar activation for monkey call X, Y, and Z. Figure 4C presents the details of the sequences of columnar activation for monkey X (the thick solid line), Y (the thin solid line), and Z (dashed line), indicating that the IPs are decomposed into specific sequences of simple (linearly sweeping) FM components. In the model, the neurons of a currently active NFM -column continue firing, even without any excitatory input, until another dynamic NFM column emerges, or its neurons begin to fire. For example, NFM column c31 is activated at time = 0.22 s (upward arrow) and continues firing (downward arrow) until the NFM column c1 begins to fire (at 0.34 s; rightward arrow).

Spatiotemporal Conversion of Auditory Information A

frequency(kHz)

monkey X

361

monkey Y

monkey Z

12

12

12

8

8

8

4

4

4 0

0.2

0.4

0

0.2

0

0.4

0.2

0.4

time (s) B

NFM NCO 129 ms

222 ms

366 ms

262 ms

419 ms

341 ms

501 ms

C

monkey X:

c40

monkey Y:

NFM-column

c35

monkey Z:

c31

c10 c5 c1 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

time (s) Figure 4: Spatiotemporal conversion of information about isolation-peeps (IPs). (A) Profiles of artificial IPs for monkey calls (X, Y, Z) used in the simulations. (B) Time courses of neuronal activation of the NC O and NFM networks induced by the IP of monkey X. (C) Sequences of dynamic NFM columns induced by the IPs of monkey X (thick solid line), Y (thin solid line), and Z (dashed line). Each NFM column, c31–c40 and c1–10, responds to a specific linearly sweeping FM component involved in the IPs. NFM columns (c11–c30) were not assigned to detect FM sounds in this simulation.

362

O. Hoshino

This self-generative continuous firing could be mediated by mutual excitation within NFM columns. In the next section, we try to identify these monkey calls by an integrative process, to which the persistent neuronal firing effectively contributes. 3.3 Monkey Call Identification. We showed in the previous section that the information about complex FM sounds, or the IPs of monkeys, could be decomposed into simple (linearly sweeping) FM components. It reminds us that early visual systems decompose complex visual objects into simple features such as edge, orientation, and color. Higher visual areas integrate these features so that the visual images of the objects can be reconstructed as unified percepts. To identify the sequences of the dynamic NFM columns, we extended the model, adding a higher (integrative) neural network NI D (see Figure 5A). We made selective feedforward projections from the NFM to NI D network (solid lines) in order to integrate the series of FM components. The specific NI D column (filled circles; NI D ) assigned to detect a certain IP receives convergent inputs from the NFM columns (filled circles; NFM ) that are to be sequentially activated by the FM components constituting the IP (filled circles; NC O ). The NI D network may correspond to a lateral belt (LB) area that is known to respond to monkey calls or the rostral portion of the superior temporal gyrus (STGr) to which the LB projects (Rauschecker, 1998), as we assumed here. Dynamic evolutions of membrane potentials of NI D neurons are defined by τI D

MFM duiI D (t) L iIjD S Fj M (t − tI D ) = −uiI D (t) + dt j=1

+

MI D

wiIjD S Ij D (t),

(3.1)

j=1( j=i)

where action potentials are generated according to equation 2.3 and the other parameter values were the same as those for the NFM network (see equation 2.2). As shown in Figure 5B, when the NC O network is stimulated with the IP of monkey X, multiple NFM columns are sequentially activated, which then activate the call-relevant NI D column at 367 ms. In this integrative process, the NI D column receive consecutive action potentials from the NFM columns, as addressed in the previous section (see Figure 4C). Such continuous activation of the NI D column gradually depolarizes its neurons and allows them to fire when the membrane potentials reach a threshold, thereby identifying the applied IP. Figure 5C presents the identification processing for monkey Z.

Spatiotemporal Conversion of Auditory Information

363

A

B

monkey X NID NFM NCO 103 ms

C

208 ms

350 ms

367 ms

338 ms

410 ms

509 ms

monkey Z NID NFM NCO 83 ms

Figure 5: Monkey call identification. (A) A higher (integrative) neural network NI D is added to the original model (see Figure 1A), which integrates the specific sequences of dynamic NFM columns induced by individual IPs. (B,C) Time courses of neuronal activation in the NC O , NFM , and NI D networks when stimulated with the IPs of monkeys X (B) and Z (C).

Figure 6 presents how similar monkey calls could be distinguished from each other. The IP of monkey V (see Figure 6A, left) has a similar spectrogram in the first part (time = 0–0.1 s; solid line) to that of monkey A (dashed line). When the NC O network is stimulated with the IP (see Figure 6B), multiple NFM columns are sequentially activated, which then activate the two

364

O. Hoshino

A

8

4 0

0.2

0.4

monkey W frequency(kHz)

frequency(kHz)

monkey V 12

12

8

4 0

time(sec)

0.2

0.4

time(sec)

B

monkey V NID NFM NCO 250 ms

C

330 ms

446 ms

500 ms

338 ms

380 ms

509 ms

monkey W

NID NFM NCO 83 ms

Figure 6: Identification of monkey calls that have similar auditory spectrograms. (A) Profiles of artificial IPs for monkeys V and W (solid lines). The dashed lines denote that of monkey A. (B,C) Time courses of neuronal activation in the NC O , NFM , and NI D networks when stimulated with the IPs of monkeys V (B) and W (C).

NI D columns corresponding to the IPs of monkeys V and A (at 330 ms). The two dynamic NI D columns compete for a while (at 330–446 ms), and the NI D column corresponding to monkey V finally prevails (at 500 ms). The neuronal competition between the two dynamic NI D columns arises from the lateral inhibition between NI D columns. In contrast, when the IP of monkey W, which has a similar spectrogram in the last part (0.3–0.4 s; see Figure 6A, right), is presented, the NI D column corresponding to the IP of monkey W is selectively activated (at 338 ms) without competition, as shown in Figure 6C. Note that the time required

Spatiotemporal Conversion of Auditory Information

365

to identify monkey W (338 ms; see Figure 6C) is shorter than that for monkey V (500 ms; see Figure 6B), which arises presumably because there is less neuronal competition between dynamic NI D columns. These results indicate that the spectrogram of the last part might be useful for further analyses if circumstances require, although it is more time consuming. 4 Discussion We have proposed a hypothesis that (1) information about monkey calls could be mapped on a cochleotopic cortical network as spatiotemporal firing patterns, (2) which can then be decomposed into simple (linearly sweeping) FM components and (3) integrated into unified percepts by higher cortical networks. For the cochleotopic two-dimensional map (hypothesis 1), we assumed activity propagation along isofrequency axis (bands) in order to make a distinct spatiotemporal firing pattern for a given monkey call. Imaging studies (Taniguchi et al., 1992; Hess & Scheich, 1996; Song et al., 2005) evidenced such activity propagation in the primary auditory cortex (AI). When presented with alternating pure tones or alternation between 1 and 8 kHz (Hess & Scheich, 1996), the activity propagation was confined to the low and high isofrequency bands. To our knowledge, the exact formation of spatiotemporal firing patterns for FM sound stimulation has not been identified yet, but we simply extended this scheme. Namely, the neuronal activity propagates along multiple isofrequency bands corresponding to the tone frequencies constituting the applied FM sound. Actual spatiotemporal firing patterns induced by monkey calls might be rather complex because of the interaction between isofrequency bands or the influences of other brain regions. Accordingly, the information about individual monkey calls could be encoded more precisely in the auditory cortex. Nevertheless, the proposed simple cochleotopic map was sufficient to generate distinct spatiotemporal firing patterns and worked well as the foundation for later sound processing, or monkey call identification. The delay line proposed for the propagation axis (see Figure 1A) was our speculation prompted by a recent experimental study (Song et al., 2005). The study demonstrated that an electrical pulse, applied focally within an isofrequency band, triggered activity propagation along the isofrequency band that was similar to tone-evoked activation. When the auditory thalamus was chemically lesioned, the electrically evoked activity in the AI was not affected, but the tone-evoked activity was abolished. Based on these results, it was suggested that intracortical connectivity in the AI enables neuronal activity to propagate along isofrequency bands. The underling neuronal mechanisms of activity propagation in the AI has not fully been understood yet, but we assumed the intracortical connectivity via delay lines (see Figure 1B) for developing the activity propagation. Note that the intracortical delay lines are relatively short (less than tens of milliseconds)

366

O. Hoshino

that could be neurophysiologically plausible in the brain as addressed in section 1. For expressing the information about simple (linearly sweeping) FM components (hypothesis 2), we assumed neurons respond selectively to the sweeping rates and directions of FMs. Neurophysiological studies (Rauschecker, 1997, 1998; Tian & Rauschecker, 1998) demonstrated that many neurons of lateral belt areas to which the primary auditory cortex (AI) projects responded better to more complex stimuli, such as FMs and band passed noises, than to pure tones. These neurons were highly selective to the rates and directions of FMs. Neurons of the anterolateral (AL) and caudolateral (CL) belt areas responded better to slower and faster FM sweep rates, respectively. Neurons of the posterior auditory field were highly selective for one direction. The detailed organization of these (rateand direction-selective) neurons has not clearly been identified yet, but we represented them in a simple and functional manner (see NFM ; Figure 1A). To integrate the sequence of linearly sweeping FMs, expressing a specific monkey call, into a unified percept (hypothesis 3), we assumed that neurons respond selectively to the call. Neurophysiological studies (Rauschecker, 1997, 1998) demonstrated that neurons of the lateral belt responded to a certain class of monkey calls. Very few neurons responded to a single call, and most neurons responded to a number of calls. These results imply that the lateral belt is not yet the end state processing monkey calls. Higher cortical areas such as the rostral portion of the superior temporal gyrus (STGr) or the prefrontal cortex, or both, might be responsible for monkey call identification (Rauschecker, 1998). The exact cortical areas whose neurons have selective responsiveness to individual monkey calls have not clearly been identified yet, but we assumed such “call-selective” neurons in the STGr (see NI D ; Figure 5A). The selective projections from the NFM to NI D (or the lateral belt to STGr) were our speculation in order for the model to perform the identification of individual monkey calls or detect the sequence of FM components. Coincidence detection based on a delay line scheme, as addressed in section 1, cannot be applicable for longer signals such as monkey calls (more than 500 ms), because delay lines observed in the brain are at most of several hundred milliseconds. The delay lines proposed here for the propagation axis (NC O ; Figure 1B) are shorter ones (tens of milliseconds) that were speculative but neurophysiologically plausible in the brain. The idea of this study is on temporal-to-spatiotemporal conversion of auditory information mediated by shorter (plausible) delay lines (∼tens of milliseconds) but not on a coincidence detection scheme. In the propagation axis, we assumed delay lines between neighboring neurons ranging from 10 to 30 ms (see section 2.2). This architecture allowed the cochleotopic neural network (NC O ) to propagate along isofrequency bands as observed in the auditory cortex (Taniguchi et al., 1992; Hess & Scheich, 1996). To our knowledge, such delay lines have not been reported

Spatiotemporal Conversion of Auditory Information

367

in the auditory cortex. However, Hess and Scheich (1996) pointed out that activity propagation along isofrequency bands might be closely related to the distribution of response latency in the AI. Tian and Rauschecker (1998) found a response-latency distribution (23 ± 12 ms) in the AI. The range of delay lines used for the NC O network as an AI area is within the range observed. We assumed activity propagation along isofrequency bands. However, activity propagation across isofrequency bands has also been reported (Taniguchi et al., 1992; Hess & Scheich, 1996), for which the interaction between different isofrequency bands might be responsible. The neuronal activation propagated toward the two (isofrequncy and tonotopic-gradient) directions, where the peak activity was shifted along isofrequency bands (Hess & Scheich, 1996). Although the spatiotemporal firing pattern propagating toward the two directions might contribute to encoding auditory information, presumably in a more precise manner, we modeled the propagation only along the isofrequency bands. Panchev and Wermter (2004) proposed a neural network model that can detect temporal sequences in timescale from hundreds of milliseconds to several seconds. The network consisted of integrate-and-fire neurons with active dendrites and dynamic synapses. The researchers applied the model to recognizing words such as bat, tab, cab, and cat. Each word was expressed as a sequence of phonemes, (for example, c → a → t for cat. A spike train of a single input neuron encoded each phoneme of a word, with a 100 ms delay between the onset times of successive phonemes. After training based on spike-timing-dependent plasticity, a single output neuron was able to detect a particular sequence of phonemes or identify a specific word. Their model could be an alternative, especially for the integration network (NI D ; Figure 5A), that detects the sequences of simple (linearly sweeping) FMs. Sargolini and colleagues (2006) found evidence of conjunctive representation of position, direction, and velocity in the entorhinal cortex of rats that explored two-dimensional environments. In the medial entorhinal cortex (MEC), the network of grid cells constituted a spatial coordinate system in which positional information was represented. Head direction cells were responsible for head-directional information representation. Grid cells were co-localized with head direction cells and conjunctive (grid and head direction) cells, and the running speed of the rat modulated these cells. The researchers suggested that the conjunctive cells might update the representation of spatial location by integrating positional and directional and velocity information in the grid cell network during navigation. Such a conjunctive representation may be an alternative for the spatiotemporal representation of monkey calls, in which the information about spectral components, sweep rates and sweep directions and their combinatorial information may be represented by distinct types of cells in the primary auditory cortex.

368

O. Hoshino

In humans, it has been suggested that a voice contains information about not only a speech but also an “auditory face” which allows us to identify individuals (Belin, Fecteau, & Bedard, 2004). This is called auditory face perception and is processed based on a neurocognitive scheme similar to that proposed for visual face perception. Among vocal components for human speech processing, formants (Fitch, 1997) and syllables (Belin & Zatorre, 2003) might be candidate components used for identifying individuals. We suggest that monkey call identification may also be a kind of auditory face perception making use of FM components. There has been evidence that the inferior colliculus (IC) encodes spectrotemporal acoustic patterns of species-specific calls. For example, Suta, Kvasnak, Popelar, and Syka (2003) investigated the neuronal representation of specific calls in the IC of guinea pigs. Responses of individual IC neurons of anesthetized guinea pigs to four typical calls (purr, chutter, chirp, and whistle) were recorded. A majority of neurons (55% of 124 units) responded to all calls. A small portion of neurons (3%) responded to only one call or did not respond to any of the calls. A time-reversed version of calls elicited on average a weaker response. The researchers concluded that the IC neurons do not respond selectively to specific calls but encode spectrotemporal acoustic patterns of the calls. Maki and Riquimaroux (2002) recorded responses of IC neurons of gerbils to two distinct FM sounds that have the same spectral components (5–12 kHz) with an opposite sweep direction, or an upward sweep and a downward sweep. The upward FM generated much stronger responses than the downward FM. The researchers suggested that the directional selectivity to the FM sweeps implies that the IC may encode spectrotemporal acoustic patterns of species-specific calls. These experiments imply that the encoding of spectrotemporal acoustic images of specific calls takes place, in part, in the IC, which presumably makes the spatiotemporal pattern of neuronal activation more complex in the present cochleotopic map. Hopefully, we will know details of it in the near future. 5 Conclusion In this study, we have proposed a cochleotopic map similar to the retinotopic map in vision. When the cochleotopic (NC O ) network was stimulated with a monkey call, the peripheral neurons located on the frequency axis were sequentially activated. The active area moved along the propagation axis, by which the information about the call was mapped as a spatiotemporal firing pattern in the cochleotopic network dynamics. This spatiotemporal conversion was quite effective for the NFM network to decompose the call information into simple (linearly sweeping) FM components, by which the higher network (NI D ) was able to integrate these components into a unified percept, or to identify the call.

Spatiotemporal Conversion of Auditory Information

369

We suggest that the information about monkey calls could be mapped on a cochleotopic cortical network as spatiotemporal firing patterns, which can then be decomposed into simple (linearly sweeping) FM components and integrated into unified percepts by higher cortical networks. The spatiotemporal conversion of auditory information may be essential for developing the cochleotopic map, which could subserve as the foundation for later processing, or monkey call identification by higher cortical areas. Acknowledgments I am grateful to Yuishi Iwasaki for productive discussions and to Hiromi Ohta for her encouragement throughout the study. I am also grateful to the reviewers for giving me valuable comments and suggestions on the earlier draft. References Belin, P., Fecteau, S., & Bedard, C. (2004). Thinking the voice: Neural correlates of voice perception. Trends Cogn. Sci., 8, 129–135. Belin, P., & Zatorre, R. J. (2003). Adaptation to speaker’s voice in right anterior temporal lobe. Neuroreport, 14, 2105–2109. Bi, G. Q., & Poo, M. M. (1999). Distributed synaptic modification in neural networks induced by patterned stimulation. Nature, 401, 792–796. Fitch, W. T. (1997). Vocal tract length and formant frequency dispersion correlate with body size in rhesus macaques. J. Acoust. Soc. Am., 102, 1213–1222. Hess, A., & Scheich, H. (1996). Optical and FDG mapping of frequency-specific activity in auditory cortex. Neuroreport, 7, 2643–2677. Holden, C. (2004). The origin of speech. Science, 303, 1316–1319. Janik, V. M. (2000). Whistle matching in wild bottlenose dolphins (Tursiops truncates). Science, 289, 1355–1357. Jeffress, L. A. (1948). A place theory of sound localization. J. Comp. Physiol. Psychol., 41, 35–39. Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert-Kennedy M. (1967). Perception of the speech code. Psychol. Rev., 74, 431–461. Maki, K., & Riquimaroux, H. (2002). Time-frequency distribution of neuronal activity in the gerbil inferior colliculus responding to auditory stimuli. Neurosci. Lett., 331, 1–4. Miller, R. (1987). Representation of brief temporal patterns, Hebbian synapses, and the left-hemisphere dominance for phoneme recognition. Psychobiol., 15, 241–247. Panchev, C., & Wermter, S. (2004). Spike-timing-dependent synaptic plasticity: From single spikes to spike trains. Neurocomputing, 58–60, 365–371. Poremba, A., Malloy, M., Saunders, R. C., Carson, R. E., Herscovitch, P., & Mishkin, M. (2004). Species-specific calls evoke asymmetric activity in the monkey’s temporal poles. Nature, 427, 448–451. Rauschecker, J. P. (1997). Processing of complex sounds in the auditory cortex of cat, monkey, and man. Acta Otolaryngol. (Stockh.), 532, 34–38.

370

O. Hoshino

Rauschecker, J. P. (1998). Parallel processing in the auditory cortex of primates. Audiol. Neuro-Otol., 3, 86–103. Sargolini, F., Fyhn, M., Hafting, T., McNaughton, B. L., Witter, M. P., Moser, M. B., & Moser E. I. (2006). Conjunctive representation of position, direction, and velocity in entorhinal cortex. Science, 312, 758–762. Song, W. J., Kawaguchi, H., Totoki, S., Inoue, Y., Katura, T., Maeda, S., Inagaki, S., Shirasawa, H., & Nishimura, M. (2005). Cortical intrinsic circuits can support activity propagation through an isofrequency strip of the guinea pig primary auditory cortex. Cereb. Cortex, 16, 718–729. Suga, N. (1995). Processing of auditory information carried by species-specific complex sounds. In M. S. Gazzaniga (Ed.), The cognitive neurosciences (pp. 295–313). Cambridge, MA: MIT Press. Suta, D., Kvasnak, E., Popelar, J., & Syka, J. (2003). Representation of species-specific vocalizations in the inferior colliculus of the guinea pig. J. Neurophysiol., 90, 3794– 3808. Symmes, D., Newman, J. D., Talmage-Riggs, G., & Lieblich, A. K. (1979). Individuality and stability of isolation peeps in squirrel monkeys. Anim. Behav., 27, 1142–1152. Taniguchi, I., Horikawa, J., Moriyama, T., & Nasu, M. (1992). Spatio-temporal pattern of frequency representation in the auditory cortex of guinea pigs. Neurosci. Lett., 146, 37–40. Tian, B., & Rauschecker, J. P. (1998). Processing of frequency-modulated sounds in the cat’s posterior auditory field. J. Neurophysiol., 79, 2629–2642. Tyack, P. L. (2000). Dolphins whistle a signature tune. Science, 289, 1310–1311. Yost, W. A. (1994). The neural response and the auditory code. In W. A. Yost (Ed.), Fundamentals of hearing (pp. 116–133). San Diego, CA: Academic Press.

Received January 19, 2006; accepted June 9, 2006.

LETTER

Communicated by Gal Chechik

Reducing the Variability of Neural Responses: A Computational Theory of Spike-Timing-Dependent Plasticity Sander M. Bohte [email protected] Netherlands Centre for Mathematics and Computer Science (CWI), 1098 SJ Amsterdam, The Netherlands

Michael C. Mozer [email protected] Department of Computer Science, University of Colorado, Boulder, CO, U.S.A.

Experimental studies have observed synaptic potentiation when a presynaptic neuron fires shortly before a postsynaptic neuron and synaptic depression when the presynaptic neuron fires shortly after. The dependence of synaptic modulation on the precise timing of the two action potentials is known as spike-timing dependent plasticity (STDP). We derive STDP from a simple computational principle: synapses adapt so as to minimize the postsynaptic neuron’s response variability to a given presynaptic input, causing the neuron’s output to become more reliable in the face of noise. Using an objective function that minimizes response variability and the biophysically realistic spike-response model of Gerstner (2001), we simulate neurophysiological experiments and obtain the characteristic STDP curve along with other phenomena, including the reduction in synaptic plasticity as synaptic efficacy increases. We compare our account to other efforts to derive STDP from computational principles and argue that our account provides the most comprehensive coverage of the phenomena. Thus, reliability of neural response in the face of noise may be a key goal of unsupervised cortical adaptation. 1 Introduction Experimental studies have observed synaptic potentiation when a presynaptic neuron fires shortly before a postsynaptic neuron and synaptic de¨ pression when the presynaptic neuron fires shortly after (Markram, Lubke, Frotscher, & Sakmann, 1997; Bell, Han, Sugawara, & Grant, 1997; Zhang, Tao, Holt, Harris, & Poo, 1998; Bi & Poo, 1998; Debanne, G¨ahwiler, & Thomp¨ om, ¨ Turrigiano, & Nelson, 2001; Nishiyama, son, 1998; Feldman, 2000; Sjostr Hong, Mikoshiba, Poo, & Kato, 2000). The dependence of synaptic Neural Computation 19, 371–403 (2007)

C 2007 Massachusetts Institute of Technology

372 A

S. Bohte and M. Mozer B

Figure 1: (A) Measuring STDP experimentally. Presynaptic and postsynaptic spike pairs are repeatedly induced at a fixed interval tpr e− post , and the resulting change to the strength of the synapse is assessed. (B) Change in synaptic strength after repeated spike pairing as a function of the difference in time between the presynaptic and postsynaptic spikes. A presynaptic before postsynaptic spike induces LTP and postsynaptic before presynaptic LTD (data points were obtained by digitizing figures in Zhang et al., 1998). We have superimposed an exponential fit of LTP and LTD.

modulation on the precise timing of the two action potentials, known as spike-timing dependent plasticity (STDP), is depicted in Figure 1. Typically, plasticity is observed only when the presynaptic and postsynaptic spikes occur within a 20 to 30 ms time window, and the transition from potentiation to depression is very rapid. The effects are long lasting and are therefore referred to as long-term potentiation (LTP) and depression (LTD). An important observation is that the relative magnitude of the LTP component of STDP decreases with increased synaptic efficacy between presynaptic and postsynaptic neuron, whereas the magnitude of LTD remains roughly constant (Bi & Poo, 1998). This finding has led to the suggestion that the LTP component of STDP might best be modeled as additive, whereas the LTD component is better modeled as being multiplicative (Kepecs, van Rossum, Song, & Tegner, 2002). For detailed reviews of STDP see Bi and Poo (2001), Roberts and Bell (2002), and Dan and Poo (2004). Because these intriguing findings appear to describe a fundamental learning mechanism in the brain, a flurry of models has been developed that focus on different aspects of STDP. A number of studies focus on biochemical models that explain the underlying mechanisms giving rise to STDP (Senn, Markram, & Tsodyks, 2000; Bi, 2002; Karmarkar, Najarian, & ¨ otter, ¨ ¨ otter, ¨ Buonomano, 2002; Saudargiene, Porr, & Worg 2004; Porr & Worg 2003). Many researchers have also focused on models that explore the consequences of STDP-like learning rules in an ensemble of spiking neurons

Reducing the Variability of Neural Responses

373

(Gerstner, Kempter, van Hemmen, & Wagner, 1996; Kempter, Gerstner, & van Hemmen, 1999, 2001; Song, Miller, & Abbott, 2000; van Rossum, Bi, & Turrigiano, 2000; Izhikevich & Desai, 2003; Abbott & Gerstner, 2004; Burkitt, Meffin, & Grayden, 2004; Shon, Rao, & Sejnowski, 2004; Legenstein, Naeger, & Maass, 2005), and a comprehensive review of the different types and con¨ otter, ¨ clusions can be found in Porr and Worg (2003). Finally, a recent trend is to propose models that provide fundamental computational justifications for STDP. This article proposes a novel justification, and we explore the consequences of this justification in detail. Most commonly, STDP is viewed as a type of asymmetric Hebbian learning with a temporal dimension. However, this perspective is hardly a fundamental computational rationale, and one would hope that such an intuitively sensible learning rule would emerge from a first-principle computational justification. Several researchers have tried to derive a learning rule yielding STDP from first principles. Dayan and H¨ausser (2004) show that STDP can be viewed as an optimal noise-removal filter for certain noise distributions. However, even a small variation from these noise distributions yields quite different learning rules, and the noise statistics of biological neurons are ¨ otter, ¨ unknown. Similarly, Porr and Worg (2003) propose an unsupervised learning rule based on the correlation of bandpass-filtered inputs with the derivative of the output and show that the weight change rule is qualitatively similar to STDP. Hopfield and Brody (2004) derive learning rules that implement ongoing network self-repair. In some circumstances, a qualitative similarity to STDP is found, but the shape of the learning rule depends on both network architecture and task. M. Eisele (private communication, April 2004). has shown that an STDP-like learning rule can be derived from the goal of maintaining the relevant connections in a network. Rao and Sejnowski (1999, 2001) suggest that STDP may be related to prediction, in particular to temporal difference (TD) learning. They argue that STDP emerges when a neuron attempts to predict its membrane potential at some time t from the potential at time t − t. As Dayan (2002) points out, however, temporal difference learning depends on an estimate of the prediction error, which will be very hard to obtain. Rather, a quantity that might be called an activity difference can be computed, and the learning rule is then better characterized as a “correlational learning rule between the stimuli, and the differences in successive outputs” (Dayan, 2002; see also ¨ otter, ¨ Porr & Worg 2003, appendix B). Furthermore, Dayan argues that for true prediction, the model has to show that the learning rule works for biologically realistic timescales. The qualitative nature of the modeling makes it unclear whether a quantitative fit can be obtained. Finally, the derived difference rule is inherently instable, as it does not impose any bounds on synaptic efficacies; also, STDP emerges only for a narrow range of t values.

374

S. Bohte and M. Mozer

Chechik (2003) relates STDP to information theory via maximization of mutual information between input and output spike trains. This approach derives the LTP portion of STDP but fails to yield the LTD portion. Nonetheless, an information-theoretic approach is quite elegant and has proven valuable in explaining other neural learning phenomena (e.g., Linsker, 1989). The account we describe in this article also exploits an informationtheoretic approach. We are not the only ones to appreciate the elegance of information-theoretic accounts. In parallel with a preliminary presentation of our work at the NIPS 2004 conference, two quite similar informationtheoretic accounts also appeared (Bell & Parra, 2005; Toyoizumi, Pfister, Aihara, & Gerstner, 2005). It will be easiest to explain the relationship of these accounts to our own once we have presented ours. The computational approaches of Chechik (2003), Dayan and H¨ausser ¨ otter ¨ (2004) and Porr and Worg (2003) are all premised on a rate-based neuron model that disregards the relative timing of spikes. It seems quite odd to argue for STDP using neural firing rate: if spike timing is irrelevant to information transmission, then STDP is likely an artifact and is not central to understanding mechanisms of neural computation. Further, as Dayan and H¨ausser (2004) note, because STDP is not quite additive in the case of multiple input or output spikes that are near in time (Froemke & Dan, 2002), one should consider interpretations that are based on individual spikes, not aggregates over spike trains. In this letter, we present an alternative theoretical motivation for STDP from a spike-based neuron model that takes the specific times of spikes into account. We conjecture that a fundamental objective of cortical computation is to achieve reliable neural responses, that is, neurons should produce the identical response—in both the number and timing of spikes—given a fixed input spike train. Reliability is an issue if neurons are affected by noise influences, because noise leads to variability in a neuron’s dynamics and therefore in its response. Minimizing this variability will reduce the effect of noise and will therefore increase the informativeness of the neuron’s output signal. The source of the noise is not important; it could be intrinsic to a neuron (e.g., a time-varying threshold), or it could originate in unmodeled external sources that cause fluctuations in the membrane potential uncorrelated with a particular input. We are not suggesting that increasing neural reliability is the only objective of learning. If it were, a neuron would do well to shut off and give no response regardless of the input. Rather, reliability is but one of many objectives that learning tries to achieve. This form of unsupervised learning must, of course, be complemented by other unsupervised, supervised, and reinforcement learning objectives that allow an organism to achieve its goals and satisfy drives. We return to this issue below and in our conclusions section.

Reducing the Variability of Neural Responses

375

We derive STDP from the following computational principle: synapses adapt so as to minimize the variability in the timing of the spikes of the postsynaptic neuron’s output in response to given presynaptic input spike trains. This variability reduction causes the response of a neuron to become more deterministic and less sensitive to noise, which provides an obvious computational benefit. In our simulations, we follow the methodology of neurophysiological experiments. This approach leads to a detailed fit to key experimental results. We model not only the shape (sign and time course) of the STDP curve, but also the fact that potentiation of a synapse depends on the efficacy of the synapse; it decreases with increased efficacy. In addition to fitting these key STDP phenomena, the model allows us to make predictions regarding the relationship between properties of the neuron and the shape of the STDP curve. The detailed quantitative fit to data makes our work unique among first-principle computational accounts. Before delving into the details of our approach, we give a basic intuition about the approach. Noise in spiking neuron dynamics leads to variability in the number and timing of spikes. Given a particular input, one spike train might be more likely than others, but the output is nondeterministic. By the response variability minimization principle, adaptation should reduce the likelihood of these other possibilities. To be concrete, consider a particular experimental paradigm. In Zhang et al. (1998), a presynaptic neuron is identified with a weak synapse to a postsynaptic neuron, such that this presynaptic input is unlikely to cause the postsynaptic neuron to fire. However, the postsynaptic neuron can be induced to fire via a second presynaptic connection. In a typical trial, the presynaptic neuron is induced to fire a single spike, and with a variable delay, the postsynaptic neuron is also induced to fire (typically) a single spike. To increase the likelihood of the observed postsynaptic response, other response possibilities must be suppressed. With presynaptic input preceding the postsynaptic spike, the most likely alternative response is no output spikes at all. Increasing the synaptic connection weight should then reduce the possibility of this alternative response. With presynaptic input following the postsynaptic spike, the most likely alternative response is a second output spike. Decreasing the synaptic connection weight should reduce the possibility of this alternative response. Because both of these alternatives become less likely as the lag between preand postsynaptic spikes is increased, one would expect that the magnitude of synaptic plasticity diminishes with the lag, as is observed in the STDP curve. Our approach to reducing response variability given a particular input pattern involves computing the gradient of synaptic weights with respect to a differentiable model of spiking neuron behavior. We use the spike response model (SRM) of Gerstner (2001) with a stochastic threshold, where the stochastic threshold models fluctuations of the membrane potential or

376

S. Bohte and M. Mozer

the threshold outside experimental control. For the stochastic SRM, the response probability is differentiable with respect to the synaptic weights, allowing us to calculate the gradient that reduces response variability with respect to the weights. Learning is presumed to take a gradient step to reduce the response variability. In modeling neurophysiological experiments, we demonstrate that this learning rule yields the typical STDP curve. We can predict the relationship between the exact shape of the STDP curve and physiologically measurable parameters, and we show that our results are robust to the choice of the few free parameters of the model. Many important machine learning algorithms in the literature seek local optimizers. It is often the case that the initial conditions, which determine which local optimizer will be found, can be controlled to avoid unwanted local optimizers. For example, with neural networks, weights are initialized near the origin; large initial weights would lead to degenerate solutions. And K-means has many degenerate and suboptimal solutions; consequently, careful initialization of cluster centers is required. In the case of our model’s learning algorithm, the initial conditions also avoid the degenerate local optimizer. These initial conditions correspond to the original weights of the synaptic connections and are constrained by the specific methodology of the experiments that we model: the subthreshold input must have a small but nonzero connection strength, and the suprathreshold input must have a large connection strength (less than 10%, more than 70% probability of activating the target, respectively). Given these conditions, the local optimizer that our learning algorithm discovers is an extremely good fit to the experimental data. In parallel with our work, two other groups of authors have proposed explanations of STDP in terms of neurons maximizing an informationtheoretic measure for the spike-response model (Bell & Parra, 2005; Toyoizumi et al., 2005). Toyoizumi et al. (2005) maximize the mutual information of the input and output between a pool of presynaptic neurons and a single postsynaptic output neuron, whereas Bell and Parra (2005) maximize sensitivity between a pool of (possibly correlated) presynaptic neurons and a pool of postsynaptic neurons. Bell and Parra use a causal SRM model and do not obtain the LTD component of STDP. As we will show, when the objective function is minimization of (conditional) response variability, obtaining LTD critically depends on a stochastic neural response. In the derivation of Toyoizumi et al. (2005), LTD, which is very weak in magnitude, is attributed to the refractoriness of the spiking neuron (via the autocorrelation function), where they use questionably strong and enduring refractoriness. In our framework, refractoriness suppresses noise in the neuron after spiking, and we show that in our simulations, strong refraction in fact diminishes the LTD component of STDP. Furthermore, the mathematical derivation of Toyoizumi et al. is valid only for an essentially constant membrane potential with small fluctuations, a condition clearly violated in experimental

Reducing the Variability of Neural Responses

377

conditions studied by neurophysiologists. It is unclear whether the derivation would hold under more realistic conditions. Neither of these approaches thus far succeeds in quantitatively modeling specific experimental data with neurobiologically realistic timing parameters, and neither explains the relative reduction of STDP as the synaptic efficacy increases as we do. Nonetheless, these models make an interesting contrast to ours by suggesting a computational principle of optimization of information transmission, as contrasted with our principle of neural response variability reduction. Experimental tests might be devised to distinguish between these competing theories. In section 2 we describe the sSRM, and in section 3 we derive the minimal entropy gradient. In section 4 we describe the STDP experiment, which we simulate in section 5. We conclude with section 6. 2 The Stochastic Spike Response Model The spike response model (SRM), defined by Gerstner (2001), is a generic integrate-and-fire model of a spiking neuron that closely corresponds to the behavior of a biological spiking neuron and is characterized in terms of a small set of easily interpretable parameters (Jolivet, Lewis, & Gerstner, 2003; Paninski, Pillow, & Simoncelli, 2005). The standard SRM formulation describes the temporal evolution of the membrane potential based on past neuronal events, specifically as a weighted sum of postsynaptic potentials (PSPs) modulated by reset and threshold effects of previous postsynaptic spiking events. The general idea is depicted in Figure 2; formally (following Gerstner, 2001), the membrane potential ui (t) of cell i at time t is defined as ui (t) =

η(t − f i ) +

f i ∈Git

j∈i

wi j

(t| f j , Git ),

(2.1)

f j ∈G tj

where i is the set of inputs connected to neuron i; Git is the set of times prior to t that a neuron i has spiked, with firing times f i ∈ Git ; wi j is the synaptic weight from neuron j to neuron i; (t| f j , Git ) is the PSP in neuron i due to an input spike from neuron j at time f j given postsynaptic firing history Git ; and η(t − f i ) is the refractory response due to the postsynaptic spike at time f i . To model the postsynaptic potential ε in a leaky-integrate-and-fire neuron, a spike of presynaptic neuron j emitted at time f j generates a postsynaptic current α(t) for a presynaptic spike arriving at f j for t > f j . In the absence of postsynaptic firing, this kernel (following Gerstner & Kistler, 2002, eqs 4.62–4.56, pp. 114–115) can be computed as (t| f j ) =

t fj

s− f j α(s − f j ) ds, exp − τm

(2.2)

378

S. Bohte and M. Mozer

Figure 2: Membrane potential u(t) of a neuron as a sum of weighted excitatory PSP kernels due to impinging spikes. Arrival of PSPs marked by arrows. Once the membrane potential reaches threshold, it is reset, and a reset function η is added to model the recovery effects of the threshold.

where τm is the decay time of the postsynaptic neuron’s membrane potential. Consider an exponentially decaying postsynaptic current α(t) of the form α(t) =

t 1 H(t) exp − τs τs

(2.3)

(see Figure 3A), where τs is the decay time of the current and H(t) is the Heaviside function. In the absence of postsynaptic firing, this current contributes a postsynaptic potential of the form (t| f j ) =

1 1 − τs /τm

exp

(t − f j ) (t − f j ) − − exp − H(t − f j ), τm τs (2.4)

with current decay time constant τs and decay time constant τm . When the postsynaptic neuron fires after the presynaptic spike arrives—at some time fˆ i following presynaptic spike at time f j —the membrane potential is reset, and only the remaining synaptic current α(t ) for t > fˆ i is integrated in equation 2.2. Following Gerstner, 2001 (section 4.4, equation 1.66), the PSP that takes such postsynaptic firing into account can be written as fˆ i < f j , (t| f j ) (2.5) (t| f j , fˆ i ) = ( f j − fˆ i ) (t| f j ) fˆ i ≥ f j . exp − τs

Reducing the Variability of Neural Responses

379

A

B

C

D

Figure 3: (A) α(t) function. Synaptic input modeled as exponentially decaying current. (B) Postsynaptic potential due to a synaptic input in the absence of postsynaptic firing (solid line), and with postsynaptic firing once and twice (dotted resp. dashed lines; postsynaptic spikes indicated by arrow). (C) Reset function η(t). (D) Spike probability ρ(u) as a function of potential u for different values of α and β parameters.

This function is depicted in Figure 3B, for the cases when a postsynaptic spike occurs both before and after the presynaptic spike. In principle, this formulation can be expanded to include the postsynaptic neuron firing more than once after the onset of the postsynaptic potential. However, for fast current decay times τs , it is useful to consider only the residual current input for the first postsynaptic spike after onset and assume that any further postsynaptic spiking is modeled by a postsynaptic potential reset to zero from that point on. The reset response η(t) models two phenomena. First, a neuron can be in a refractory period: it simply cannot spike again for about a millisecond after a spiking event. Second, after the emission of a spike, the threshold of the neuron may initially be elevated and then recover to the original value (Kandel, Schwartz, & Jessell, 2000). The SRM models this behavior as negative contributions to the membrane potential (see equation 2.1): with s = t − fˆ i denoting the time since the postsynaptic spike, the refractory

380

S. Bohte and M. Mozer

reset function is defined as (Gerstner, 2001):

η(s) =

Uabs Uabs exp

−

s + δr τr

f

+ Ur exp

−

s τrs

0 < s < δr s ≥ δr ,

(2.6)

where a large negative impulse Ua bs models the absolute refractory period, with duration δr ; the absolute refractory contribution smoothly resets via a f fast-decaying exponential with time constant τr . The term Ur models the slow exponential recovery of the elevated threshold with time constant τrs . The function η is depicted in Figure 3C. We made a minor modification to the SRM described in Gerstner (2001) by relaxing the constraint that τrs = τm and also by smoothing the absolute refractory function (such smoothing is mentioned in Gerstner, but is not explicitly defined). In all simulations, we use δr = 1 ms, τrs = 3 ms, and f τr = 0.25 ms (in line with estimates for biological neurons; Kandel et al., 2000; the smoothing parameter was chosen to be fast compared to τrs ). The SRM we just described is deterministic. Gerstner (2001) introduces a stochastic variant of the SRM (sSRM) by incorporating the notion of a stochastic firing threshold: given membrane potential ui (t), the probability of the neuron firing at time t is specified by ρ ui (t) . Herrmann and Gerstner (2001) find that for a reasonable escape-rate noise model of the integration of current in real neurons, the probability of firing is small and constant for small potentials, but around a threshold ϑ, the probability increases linearly with the potential. In our simulations, we use such a function, ρ(v) =

β {ln[1 + exp(α (ϑ − v))] − α(ϑ − v)}, α

(2.7)

where α determines the abruptness of the constant-to-linear transition in the neighborhood of threshold ϑ and β determines the slope of the linear increase beyond ϑ. This function is depicted in Figure 3D for several values of α and β. We also conducted simulation experiments with sigmoidal and exponential density functions and found no qualitative difference in the results. 3 Minimizing Conditional Entropy We now derive the rule for adjusting the weight from a presynaptic input neuron j to a postsynaptic neuron i so as to minimize the entropy of i’s response given a particular spike train from j. A spike train is described by the set of all times at which a neuron i emitted spikes within some interval between 0 and T, denoted GiT . We assume the interval is wide enough that the occurrence of spikes outside

Reducing the Variability of Neural Responses

381

the interval does not influence the state of a neuron within the interval (e.g., through threshold reset effects). This assumption allows us to treat intervals as independent of each other. The set of input spikes received by neuron i during this interval is denoted FiT , which is just the union of all output spike trains of connected presynaptic neurons j: FiT = G Tj ∀ j ∈ i . Given input spikes FiT , the stochastic nature of neuron i may lead not only to the observed response GiT but also to a range of other possibilities. Denote the set of possible responses i , where GiT ∈ i . Further, let binary variable σ (t) denote the state of the neuron in the time interval [t, t + t), where σ (t) = 1 means the neuron spikes and σ (t) = 0 means no spike. A response ξ ∈ i is then equivalent to [σ (0), σ (t), . . . , σ (T)]. Given a probability density p(ξ ) over all possible responses ξ , the differential entropy of neuron i’s response conditional on input FiT is then defined as

h i FiT = −

i

p(ξ ) log p(ξ ) dξ.

(3.1)

According to our hypothesis, a neuron adjusts its weights so as to minimize the conditional response variability. Such an adjustment is obtained by performing gradient descent on the weighted likelihood of the response, which corresponds to the conditional entropy, with respect to the weights,

∂h i |FiT wi j = −γ , ∂wi j

(3.2)

with learning rate γ . In this section, we compute the right-hand side of equation 3.2 for an sSRM neuron. Substituting the entropy definition of equation 3.1 into equation 3.2, we obtain:

∂h i FiT ∂ =− p(ξ ) log( p(ξ )) dξ ∂wi j ∂wi j ∂ log( p(ξ )) p(ξ ) (log( p(ξ )) + 1) dξ. =− ∂wi j i We closely follow Xie and Seung (2004) to derive tiable neuron model firing at times

p(ξ ) =

T t=0

GiT .

P(σ (t)|{σ (t ), ∀t < t}).

∂ log( p(ξ )) ∂wi j

(3.3)

for a differen-

First, we factorize p(ξ ):

(3.4)

382

S. Bohte and M. Mozer

The states σ (t) are conditionally independent as the probability for a neuron i to fire during [t, t + t) is determined by the spike probability density of the membrane potential: σ (t) =

1 with probability

pi = ρi (t)t,

0 with probability

pi = 1 − pi (t),

for the spike probability density of the membrane with ρi (t) shorthand

potential, ρ ui (t) ; this equation holds for sufficiently small σ (t) (see also Xie & Seung, 2004, for more details). We note further that ∂ ln( p(ξ )) 1 ∂ p(ξ ) ≡ ∂wi j p(ξ ) ∂wi j and ∂ρi (t) ∂ρi (t) ∂ui (t) = . ∂wi j ∂ui (t) ∂wi j

(3.5)

It is straightforward to derive: ∂ log( p(ξ )) = ∂wi j

T

t=0

=−

∂ρi (t) ∂ui (t) ∂ui (t) ∂wi j T

t=0

f i ∈FiT

ρi (t) (t| f j , f i )dt +

δ(t − f i ) − ρi (t) ρi (t)

dt,

ρ ( fi ) i ( f i | f j , f i ), ρ i ( fi ) T

(3.6)

f i ∈Fi

∂ρi (t) and δ(t − f i ) is the Dirac delta, and we use that in the where ρi (t) ≡ ∂u i (t) sSRM formulation,

∂ui (t) = (t| f j , f i ). ∂wi j The term ρi (t) in equation 3.6 can be computed for any differentiable spike probability function. In the case of equation 2.7, ρi (t) =

β . 1 + exp(α(ϑ − ui (t))

Reducing the Variability of Neural Responses

383

Substituting our model for ρi (t), ρi (t) from equation 2.7 into equation 3.6, we obtain

∂ log( p(ξ )) = −β ∂wi j +

f i ∈GiT

T t=0

(t| f j , f i ) dt 1 + exp[α(ϑ − ui (t))]

( f i | f j , f i ) . α {ln(1 + exp[α (ϑ − ui ( f i ))]) − α (ϑ − ui ( f i ))}(1 + exp[α(ϑ − ui ( f i ))]) (3.7)

Equation 3.7 can be substituted into equation 3.3, which, when integrated, provides the gradient-descent weight update that implements conditional entropy minimization (see equation 3.2). The hypothesis under exploration is that this gradient-descent weight update yields STDP. Unfortunately, an analytic solution to equation 3.3 (and hence equation 3.2) is not readily obtained. Nonetheless, numerical methods can be used to obtain a solution. We are not suggesting a neuron performs numerical integration of this sort in real time. It would be preposterous to claim biological realism for an instantaneous integration over all possible responses ξ ∈ i , as specified by equation 3.3. Consequently, we have a dilemma: What use is a computational theory of STDP if the theory demands intensive computations that could not possibly be performed by a neuron in real time? This dilemma can be circumvented in two ways. First, the resulting learning rule might be cached in some form through evolution so that the computation is not necessary. That is, the solution—the STDP curve itself—may be built into a neuron. As such, our computational theory provides an argument for why neurons have evolved to implement the STDP learning rule. Second, the specific response produced by a neuron on a single trial might be considered a sample from the distribution p(ξ ), and the integration in equation 3.3 can be performed by a sampling process over repeated trials; each trial would produce a stochastic gradient step.

3.1 Numerical Computation. In this section, we describe the procedure for numerically evaluating equation 3.2 via Simpson’s integration (Hennion, 1962). This integration is performed over the set of possible responses i (see equation 3.3) within the time interval [0 . . . T]. The set i can be divided into disjoint subsets in , which contain exactly n spikes: i = in ∀ n.

384

S. Bohte and M. Mozer

Using this breakdown,

∂h i |FiT ∂ log(g(ξ )) =− g(ξ )(log(g(ξ )) + 1) dξ, ∂wi j ∂wi j i n=∞ ∂ log(g(ξ )) =− g(ξ )(log(g(ξ )) + 1) dξ. ∂wi j in

(3.8)

n=0

It is illustrative to walk through the alternatives. For n = 0, there is only one response given the input. Assuming the probability of n = 0 spikes is p0 , the n = 0 term of equation 3.8 reads:

T ∂h i FiT = p0 (log( p0 ) + 1) −ρi (t) (t| f j , f i )dt. ∂wi j t=0

(3.9)

The probability p0 is the probability of the neuron not having fired between t = 0 and t = T given inputs FiT resulting in membrane potential ui (t) and hence probability of firing at time t of ρ(ui (t)), p0 = S[0, T] = exp −

T

ρ (ui (t)) dt ,

(3.10)

t=0

which is equal to the survival function S for a nonhomogeneous Poisson process with probability density ρ(ui (t)) for t = [0 . . . T]. (We use the inclusive/exclusive notation for S: S(0,T) computes the function excluding the end points; S[0,T] is inclusive.) For n = 1, we must consider all responses containing exactly one output spike: GiT = { f i1 }, f i1 ∈ [0, T]. Assuming that neuron i fires only at time f i1 with probability p1 ( f i1 ), the n = 1 term of equation 3.8 reads

f 1 =T T i

∂h i FiT = p1 f i1 log p1 ( f i1 ) + 1 −ρi (t) t| f j , f i1 dt 1 ∂wi j f i =0 t=0

1

ρ f + i i1 f i1 | f j , f i1 d f i1 . (3.11) ρi f i The probability p1 ( f i1 ) is computed as

p1 f i1 = S 0, f i1 ρi f i1 S f i1 , T ,

(3.12)

Reducing the Variability of Neural Responses

385

where the membrane potential now incorporates one reset at t = f i1 :

ui (t) = η t − f i1 + wi j t| f j , f i1 . j∈i

f j ∈F tj

For n = 2, we must consider all responses containing exactly two output spikes: GiT = { f i1 , f i2 } for f i1 , f i2 ∈ [0, T]. Assuming that neuron i fires at f i1 and f i2 with probability probability p2 ( f i1 , f i2 ), the n = 2 term of equation 3.8 reads:

f 1 =T f 2 =T i i

∂h i |FiT = p2 f i1 , f i2 log p2 ( f i1 , f i2 ) + 1 ∂wi j f i1 =0 f i2 = f i1 T × −ρi (t) (t| f j , f i1 , f i2 )dt t=0

+

ρi f i2 2

1 1 2 1 2 + | f , f , f | f , f , f f f d f i1 d f i2 .

j j i i i i i i ρi f i1 ρi f i2

ρi

f i1

(3.13) The probability p2 ( f i1 , f i2 ) can again be expressed in terms of the survival function,

p2 f i1 , f i2 = S 0, f i1 ρi f i1 S f i1 , f i2 ρi f i2 S f i2 , T ,

(3.14)

with ui (t) = η(t − f i1 ) + η(t − f i2 ) + j∈i wi j f j ∈F tj (t| f j , f i1 , f i2 ). This procedure can be extended for n > 2 following the pattern above. In our simulation of the STDP experiments, the probability of obtaining zero, one, or two spikes already accounted for 99.9% of all possible responses; adding the responses of three spikes (n = 3) accounted for all possible responses got this number up to ≈ 99.999, which is close to the accuracy of our numerical computation. In practice, we found that taking into account n = 3 had no significant contribution to computing w, and we did not compute higher-order terms as the cumulative probability of these responses was below our numerical precision. For the results we present later, we used only terms n ≤ 2; we demonstrate that this is sufficient in appendix A. In this section, we have replaced an integral over possible spike sequences i with an integral over the time of two output spikes, f i1 and f i2 , which we compute numerically.

386

S. Bohte and M. Mozer

Figure 4: Experimental setup of Zhang et al. (1998).

4 Simulation Methodology We modeled in detail the experiment of Zhang et al. (1998) involving asynchronous costimulation of convergent inputs. In this experiment, depicted in Figure 4, a postsynaptic neuron is identified that has two neurons projecting to it: one weak (subthreshold) and one strong (suprathreshold). The subthreshold input results in depolarization of the postsynaptic neuron, but the depolarization is not strong enough to cause the postsynaptic neuron to spike. The suprathreshold input is strong enough to induce a spike in the postsynaptic neuron. Plasticity of the synapse between the subthreshold input and the postsynaptic neuron is measured as a function of the timing between subthreshold and postsynaptic neurons’ spikes (tpre-post ) by varying the intervals between induced spikes in the subthreshold and the suprathreshold inputs (tpre-pre ). This measurement yields the well-known STDP curve (see Figure 1b). In most experimental studies of STDP, the postsynaptic neuron is induced to spike not via a suprathreshold neuron, but rather by depolarizing current injection directly into the postsynaptic neuron. To model experiments that induce spiking via current injection, additional assumptions must be made in the spike response model framework. Because these assumptions are not well established in the literature, we have focused on the synaptic input technique of Zhang et al. (1998). In section 5.1, we propose a method for modeling a depolarizing current injection in the spike-response model. The Zhang et al. (1998) experiment imposes four constraints on a simulation: (1) the suprathreshold input alone causes spiking more than 70% of the time; (2) the subthreshold input alone causes spiking less than 10% of the time; (3) synchronous firing of suprathreshold or subthreshold inputs causes LTP if and only if the postsynaptic neuron fires; and (4) the time constants of the excitatory PSPs (EPSPs)—τs and τm in the sSRM—are

Reducing the Variability of Neural Responses

387

in the range of 1 to 5 ms and 7 to 15 ms, respectively. These constraints remove many free parameters from our simulation. We do not explicitly model the two input cells; instead, we model the EPSPs they produce. The magnitude of these EPSPs is picked to satisfy the experimental constraints: in most simulations, unless reported otherwise, the suprathreshold EPSP (wsupra ) alone causes a spike in the post on 85% of trials, and the subthreshold EPSP (wsub ) alone causes a spike on fewer than 0.1% of trials. In our principal re-creation of the experiment (see Figure 5), we added normally distributed variation to wsupra and wsub to simulate the experimental selection process of finding suitable supra-subthreshold input pairs according to: wsupra = wsupra + N(0, σsupra ) and wsub = wsub + N(0, σsub ) (we controlled the random variation for conditions outside the specified firing probability ranges). Free parameters of the simulation are ϑ and β in the spike probability function (α can be folded into ϑ) and the magnitude (urs , uabs ) and time f constants (τrs , τr , abs ) of the reset. We can further investigate how the results depend on the exact strengths of the subthreshold and suprathreshold EPSPs. The dependent variable of the simulation is tpre-pre , and we measure the time of the post spike to determine tpr e− post . In the experimental protocol, a pair of inputs is repeatedly stimulated at a specific interval tpre-pre at a low frequency of 0.1 Hz. The weight update for a given tpre-pre is measured by comparing the size of the EPSC before stimulation and (about) half an hour after stimulation. In terms of our model, this repeated stimulation can be considered as drawing a response ξ from the stochastic conditional response density p(ξ ). We estimate the expected weight update for this density p(ξ ) for a given tpre-pre using equation 3.2 by approximating the integral by a summation over all time-discretized output responses consisting of 0, 1, or 2 spikes. Note that performing the weight update computation like this implicitly assumes that the synaptic efficacies in the experiment do not change much during repeated stimulation; since longterm synaptic changes require the synthesis of for example, proteins this seems a reasonable assumption, also reflected in the half-hour or so that the experimentalists wait after stimulation before measuring the new synaptic efficacy.

5 Results Figure 5A shows an STDP curve produced by the model, obtained by plotting the estimated weight update of equation 3.2 against tpre− post for fixed supra and subthreshold inputs. Specifically, we vary the difference in time between subthreshold and suprathreshold inputs (a pre-pre pair), and we compute the expected gradient for the subthreshold input wsub over all responses of the postsynaptic neuron via equation 3.2. We thus obtain a value for w for each tpre-pre data point; we then compute wsub (%) as the

388

S. Bohte and M. Mozer

A

B

C

.,

Figure 5: (A) STDP: experimental data (triangles) and model fit (solid line). and wsub (B) Added simulation data points with perturbed weights wsupra (crosses). STDP data redrawn from Zhang et al. (1998). Model parameters: τs = 2.5 ms, τm = 10 ms, sub- and suprathreshold weight perturbation in B: σsupra = 0.33 wsupra , σsub = 0.1 wsub . (C) Model fit compared to previous generative models (Chechik, 2003, short dashed line; Toyoizumi et al., 2005, long dashed line; data points and curves were obtained by digitizing figures in original papers). Free parameters of Chechik (2003) and Toyoizumi et al. (2005) were fit to the experimental data as described in appendix B.

relative percentage change of synaptic efficacy: wsub (%) = w/wsub × 100%.1 For each tpre-pre , the corresponding value tpr e− post is determined by calculating for each input pair the average time at which the postsynaptic neuron fires relative to the subthreshold input. Together, this results in a set of (tpr e− post , wsub (%)) data points. The continuous graph in Figure 5A We set the global learning rate γ in equation 3.2 such that the simulation curve is scaled to match the neurophysiological results. In all other experiments where we use relative percentage change wsub (%), the same value for γ is used. 1

Reducing the Variability of Neural Responses

389

is obtained by repeating this procedure for fixed supra- and subthreshold weights and connecting the resultant points. In Figure 5B, the supra- and subthreshold weights in the simulation are randomly perturbed for each pre-pre pair, to simulate the fact that in the experiment, different pairs of neurons are selected for each pre-pre pair, leading inevitably to variation in the synaptic strengths. Mild variation of the input weights yields the “scattering” data points of the relative weight changes similar to the experimentally observed data. Clearly, the mild variation we apply is small only relative to the observed ¨ om, ¨ in vivo distributions of synaptic weights in the brain (e.g., Song, Sjostr Reigl, Nelson, & Chklovskii, 2005). However, Zhang et al. (1998) did not sample randomly from synapses in the brain but rather selected synapses that had a particularly narrow range of initial EPSPs to satisfy the criteria for “supra-” and “subthreshold” synapses (see also section 4). Hence, the experimental variance was particularly small (see Figure 1e of Zhang et al., 1998), and our variation of the size of the EPSP is in line with the observed variations in the experimental results of Zhang et al. (1998). The model produces a good quantitative fit to the experimental data points (triangles), especially compared to other related work as discussed in section 1 and robustly obtains the typical LTP and LTD time windows associated with STDP. In Figure 5C, we show our model fit compared to the models of Toyoizumi et al. (2005) and Chechik (2003). Our model obtained the lowest sum squared error (1.25 versus 1.63 and 3.27,2 respectively; see appendix B for methods)—this despite the lack of data in the region tpre-post = 0, . . . , 10 ms in the Zhang et al. (1998) experiment, where difference in LTD behavior is most pronounced. The qualitative shape of the STDP curve is robust to settings of the spiking neuron model’s parameters, as we will illustrate shortly. Additionally, we found that the type of spike probability function ρ (exponential, sigmoidal, or linear) is not critical. Our model accounts for an additional finding that has not been explained by alternative theories: the relative magnitude of LTP decreases as the efficacy of the synapse between the subthreshold input and the postsynaptic target neuron increases; in contrast, LTD remains roughly constant (Bi & Poo, 1998). Figure 6A shows this effect in the experiment of Bi and Poo (1998), and Figure 6B shows the corresponding result from our model. We compute the magnitude of LTP and LTD for the peak modulation (i.e., tpr e− post = −5 for LTP and tpr e− post = +5 for LTD) as the amplitude of 2 Of note for this comparison is that our spiking neuron model uses a more sophisticated difference of exponentials (see equation 2.4) to describe the EPSP, whereas the spiking neuron models in Toyoizumi et al. (2005) and Chechik (2003) use a single exponential. These other models might be improved using the more sophisticated EPSP function.

390 A

S. Bohte and M. Mozer B

Figure 6: Dependence of LTP and LTD magnitude on efficacy of the subthreshold input. (A) Experimental data redrawn from Bi and Poo (1998). (B) Simulation result.

the subthreshold EPSP is increased. The model’s explanation for this phenomenon is simple: as the synaptic weight increases, its effect saturates, and a small change to the weight does little to alter its influence. Consequently, the gradient of the entropy with respect to the weight goes toward zero. Similar saturation effects are observed in gradient-based learning methods with nonlinear response functions such as backpropagation. As we mentioned earlier, other theories have had difficulty reproducing the typical shape of the LTD component of STDP. In Chechik (2003), the shape is predicted to be near uniform, and in Toyoizumi et al. (2005), the shape depends on the autocorrelation. In our stochastic spike response model, this component arises due to the stochastic variation in the neural response: in the specific STDP experiment, reduction of variability is achieved by reducing the probability of multiple output spikes. To argue for this conclusion, we performed simulations that make our neuron model less variable in various ways, and each of these manipulations results in a reduction in the LTD component of STDP. In Figures 7A and 7B, we make the threshold more deterministic by increasing the values of α and β in the spike probability density function. In Figure 7C, we increase the magnitude of the refractory response η, which will prevent spikes following the initial postsynaptic response. And finally, in Figure 7D, we increase the efficacy of the suprathreshold input, which prevents the postsynaptic neuron’s potential from hovering in the region where the stochasticity of the threshold can induce a spike. Modulation of all of these variables makes the threshold more deterministic and decreases LTD relative to LTP. Our simulation results are robust to biologically realizable variation in the parameters of the sSRM model. For example, time constants of the EPSPs can be varied with no qualitative effect on the STDP curves.

Reducing the Variability of Neural Responses

A

B

C

D

391

Figure 7: Dependence of relative LTP and LTD on (A) the parameter α of the stochastic threshold function, (B) the parameter β of the stochastic threshold function, (C) the magnitude of refraction, η, and (D) efficacy of the suprathreshold synapse, expressed as p(fire|supra), the probability that the postsynaptic neuron will fire when receiving only the suprathreshold input. Larger values of p(fire|supra) correspond to a weaker suprathreshold synapse. In all graphs, the weight gradient for individual curves is normalized to peak LTP for comparison purposes.

Figures 8A and 8B show the effect of manipulating the membrane potential decay time τm and the EPSP rise time τs , respectively. Note that manipulation of these time constants does predict a systematic effect on STDP curves. Increasing τm increases the duration of both the LTP and LTD windows, whereas decreasing τs leads to a faster transition from LTP to LTD. Both predictions could be tested experimentally by correlating time constants of individual neurons studied with the time course of their STDP curves. 5.1 Current Injection. We mentioned earlier that in many STDP experiments, an action potential is induced in the postsynaptic neuron not via a suprathreshold presynaptic input, but via a depolarizing current injection. In order to model experiments using current injection, we must

392

S. Bohte and M. Mozer A

B

Figure 8: Influence of time constants of the sSRM model on the shape of the STDP curve: (A) varying the membrane potential time-constant τm and (B) varying the EPSP rise time constant τs . In both figures, the magnitude of LTP and LTD has been normalized to 1 for each curve to allow for easy examination of the effect of the manipulation on temporal characteristics of the STDP curves.

characterize the current function and its effect on the postsynaptic neuron. In this section, we make such a proposal framed in terms of the spike response model and report simulation results using current injection. We model the injected current I(t) as a rectangular step function, I(t) = H(t − f I ) Ic H(t − [ I − f I ]),

(5.1)

where the current of magnitude Ic is switched on at t = f I and off at t = f I + I . In the Zhang et al. (1998) experiment, I is 2 ms, a value we adopted for our simulations as well. The resulting postsynaptic potential, c is

t

c (t) = 0

s exp − τm

I(s) ds.

(5.2)

In the absence of postsynaptic firing, the membrane potential of an integrate-and-fire neuron in response to a step current is (Gerstner, 2001): c (t| f I ) = Ic (1 − exp[−(t − f I )/τm ]).

(5.3)

In the presence of postsynaptic firing at time fˆ i , we assume—as we did previously in equation 2.5—a reset and subsequent integration of the residual

Reducing the Variability of Neural Responses

393

Figure 9: Voltage response of a spiking neuron for a 2 ms current injection in the spike response model. Solid curve: The postsynaptic neuron produces no spike, and the potential due to the injected current decays with the membrane time constant τm . Dotted curve: The postsynaptic neuron spikes while the current is still being applied. Dashed curve: The postsynaptic neuron spikes after application of the current has terminated (moment of postsynaptic spiking indicated by arrows).

current:

s I(s) ds exp − τm 0 t s I(s) ds. + H(t − fˆ i ) exp − τm fˆ i

c (t| fˆ i ) = H( fˆ i − t)

t

(5.4)

These c kernels are depicted in Figure 9 for a postsynaptic spike occurring at various times fˆ i . In our simulations, we chose the current magnitude Ic to be large enough to elicit spiking of the target neuron with probability greater than 0.7. Figure 10a shows the STDP curve obtained using the current injection model for the exact same model parameter settings used to produce the result based on a suprathreshold synaptic input (depicted in Figure 5A) superimposed on the experimental data STDP obtained by depolarizing current injection from Zhang et al. (1998). Figure 10b additionally superimposes the earlier result on the current injection result, and the two curves are difficult to distinguish. As in the earlier result, variation of model parameters has little appreciable effect on the model’s behavior using the current injection paradigm, suggesting that current injection versus synaptic input makes little difference on the nature of STDP.

394

S. Bohte and M. Mozer

(a)

(b)

Figure 10: (A) STDP curve obtained for SRM with current injection (solid curve) compared with experimental data for depolarizing current injection (circles; redrawn from Zhang et al., 1998). (B) Comparing STDP curves for both current injection (solid curve) and suprathreshold input (dashed curve) models. The same model parameters are used for both curves. Experimental data redrawn from Zhang et al. (1998) for current injection (circles) and suprathreshold input (triangles) paradigms are superimposed.

6 Discussion In this letter, we explored a fundamental computational principle: that synapses adapt so as to minimize the variability of a neuron’s response in the face of noisy inputs, yielding more reliable neural representations. From this principle, instantiated as entropy minimization, we derived the STDP learning curve. Importantly, the simulation methodology we used to derive the curve closely follows the procedure used in neurophysiological experiments (Zhang et al., 1998): assuming variation in sub- and suprathreshold synaptic efficacies from experimental pair-to-pair even recovers the noisy scattering of efficacy changes. Our simulations furthermore obtain an STDP curve that is robust to model parameters and details of the noise distribution. Our results are critically dependent on the use of Gerstner’s stochastic spike response model, whose dynamics are a good approximation to those of a biological spiking neuron. The sSRM has the virtue of being characterized by parameters that are readily related to neural dynamics, and its dynamics are differentiable such that we can derive a gradient-descent learning rule that minimizes the response variability of a postsynaptic neuron given a particular set of input spikes. Our model predicts the shape of the STDP curve and how it relates to properties of a neuron’s response function. These predictions may be empirically testable if a diverse population of cells can be studied. The predictions include the following. First, the width of the LTD and LTP windows depends on the (excitatory) PSP time constants (see Figures 7A

Reducing the Variability of Neural Responses

395

and 7B). Second, the strength of LTD relative to LTP depends on the degree of noise in the neuron’s response; the LTD strength is related to the noise level. Our model also can characterize the nature of the learning curve for experimental situations that deviate from the boundary conditions of Zhang et al. (1998). In Zhang et al., the subthreshold and suprathreshold inputs produced postsynaptic firing with probability less than .10 and greater than .70, respectively. Our model can predict the consequences of violating these conditions. For example, when the subthreshold input is very strong or the suprathreshold input is very weak, our model produces strictly LTD, that is, anti-Hebbian learning. The consequence of a strong subthreshold input is shown in Figure 6B, and the consequence of a weak suprathreshold input is shown in Figure 7D. Intuitively, this simulation result makes sense because—in the first case—the most likely alternative response of the postsynaptic neuron is to produce more than one spike, and—in the second case—the most likely alternative response is no postsynaptic spike at all. In both cases, synaptic depression reduces the probability of the alternative response. We note that such strictly anti-Hebbian learning has been reported in relation to STDP-type experiments (Roberts & Bell, 2002). For very noisy thresholds and for weak suprathreshold inputs, our model produces an LTD dip before LTP (see Figure 7D). This dip is in fact also present in the work of Chechik (2003). We find it intriguing that this dip is also observed in the experimental results of Nishiyama et al. (2000). The explanation for this dip may be along the same lines as the explanation for the LTD window: given the very noisy threshold, the subthreshold input may occasionally cause spiking, and decreasing its weight would decrease response variability. This may not be offset by the increase due to its contribution to the spike caused by the suprathreshold input, as it is too early to have much influence. With careful consideration of experimental conditions and neuron parameters, it may be possible to reconcile the somewhat discrepant STDP curves obtained in the literature using our model. In our model, the transition from LTP to LTD occurs at a slight offset from tpr e− post = 0: if the subthreshold input fires 1 to 2 ms before the postsynaptic neuron fires (on average), then neither potentiation nor depression occurs. This offset of 1 to 2 ms is attributable to the current decay time constant, τs . The neurophysiological data are not sufficiently precise to determine the exact offset of the LTP-LTD transition in real neurons. Unfortunately, few experimental data points are recorded near tpr e− post = 0. However, the STDP curve of our model does pass through the one data point in that region (see Figure 5A), so the offset may be a real phenomenon. The main focus of the simulations in this letter was to replicate the experimental paradigm of Zhang et al. (1998), in which a suprathreshold presynaptic neuron is used to induce the postsynaptic neuron to fire. The Zhang et al. (1998) study is exceptional in that most other experimental studies of STDP use a depolarizing current injection to induce the

396

S. Bohte and M. Mozer

postsynaptic neuron to fire. We are not aware of any established model for current injection within the SRM framework. We therefore proposed a model of current injection within the SRM framework in section 5.1. The proposed model is an ideal abstraction of current injection that does not take into account effects like current onset and offset fluctuations inherent in such experimental methods. Even with these limitations in mind, the current injection model produced STDP curves very similar to the ones obtained by the simulation of the suprathreshold input–induced postsynaptic firing. The simulations reported in this letter account for classical STDP experiments in which a single presynaptic spike is paired with a single postsynaptic spike. The same methodology can be applied to model experimental paradigms involving multiple presynaptic or postsynaptic spikes, or both. However, the computation involved becomes nontrivial. We are currently engaged in modeling data from the multispike experiments of Froemke and Dan (2002). We note that one set of simulation results we reported is particularly pertinent for comparing and contrasting our model to the related model of Toyoizumi et al. (2005). The simulations reported in Figure 7 suggest that noise in our model is critical for obtaining the LTD component of STDP and that parameters that reduce noise in the neural response also reduce LTD. We found that increasing the strength of neuronal refraction reduces response variability and therefore diminishes the LTD component of STDP. This notion is also put forward in very recent work by Pfister, Toyoizumi, Barber, and Gerstner (2006), where an STDP-like rule arises from from a supervised learning procedure that aims to obtain spikes at times specified by a teacher. The LTD component in this work also depends on the probability of stochastic activity. In sharp contrast, Toyoizumi et al. (2005) suggest that neuronal refraction is responsible for LTD. Because the two models are quite similar, it seems unlikely that the models make opposite predictions and the discrepancy may be due to Toyoizumi et al.’s focus on analytical approximations to solve the mathematical problem at hand, limiting the validity of comparisons between that model and biological experiments in the process. It is useful to reflect on the philosphy of choosing reduction of spike train variability as a target function, as it so obviously has the degenerate but energy-efficient solution of emitting no spikes at all. The usefulness of our approach clearly relies on the stochastic gradient reaching a local optimum in the likelihood space that does not always correspond to the degenerate solution. We compute the gradient of the input weights with respect to the conditionally independent sequence of response intervals [t, t + ]. The gradient approach tries to push the probability of the responses in these intervals to either 0 or 1, irrespective of what the response is (not firing or firing). We find that in the sSRM spiking neuron model, this gradient can be toward either state of each response interval, which can be attributed

Reducing the Variability of Neural Responses

397

to the monotonically increasing spike probability density as a function of the membrane potential. This spike probability density allows neurons to become very reliable by firing spikes only at specific times, at least when starting from a set of input weights that, given the input pattern, is likely to induce a spike in the postsynaptic neuron. The fact that the target function is the reduction of postsynaptic spike train variability does predict that in the case of small inputs impinging on a postsynaptic target causing only occasional firing, the prediction would be that the average weight update due to this target function would reduce these inputs to zero. We have modeled the experimental studies in some detail, beyond the level of detail achieved by other researchers investigating STDP. Even a model with an entirely heuristic learning rule has value if it obtains a better fit to the data than other models of similar complexity. Our model has a learning rule that goes beyond heuristic: the learning rule is derived from a computational objective. To some, this objective may not be as exciting as more elaborative objectives like information maximization. As it is, our model stands alone from the contenders in providing a first-principle account of STDP that fits experimental data extremely well. Might there be a mathematically sexier model? We certainly hope so, but it has not yet been discovered. We reiterate the point that our learning objective is viewed as but one of many objectives operating in parallel. The question remains as to why neurons would respond in such a highly variable way to fixed input spike trains: a more deterministic threshold would eliminate the need for any minimization of response variability. We can only speculate that the variability in neuronal responses may also well serve these other objectives, such as exploitation or exploration in reinforcement learning or the exploitation of stochastic resonance phenomena (e.g., Hahnloser, Sarpeshkar, Mahowald, Douglas, & Seung, 2000). It is interesting to note that minimization of conditional response variability corresponds to one part of the equation that maximizes mutual information. The mutual information I between input X and outputs Y is defined as I(X, Y) = H(Y) − H(Y|X). Hence, minimization of the conditional entropy H(Y|X)—our objective—along with the secondary unsupervised objective of maximizing the marginal entropy H(Y) maximize mutual information. The first unsupervised objective is notoriously hard to compute (e.g., see Bell & Parra, 2005, for an extensive discussion) whereas, as we have shown, the second objective—conditional entropy minimization—can be computed relatively easily via stochastic gradient descent. Indeed, in this light, it

398

S. Bohte and M. Mozer

Figure 11: STDP graphs for w computed using the terms n ≤ 2 (solid line) and n ≤ 3 (crossed solid line).

is a virtue of our model that we account for the experimental data with only one component of the mutual information objective (taking the responses in the experimental conditions as the set of responses Y). The relatively simple nature of the experiments that uncovered STDP lacks any interaction with other (output) neurons, and we may speculate that STDP may be the degenerate reflection of information maximization in the absence of such interactions. If subsequent work shows that STDP can be explained by mutual information maximization (without the drawbacks of existing work, such as the rate-based treatment of Chechik, 2003, or the unrealistic autocorrelation function and difficulty of relating to biological parameters of Toyoizumi et al., 2005), this work contributes in helping to tease apart the components of the objective that are necessary and sufficient for explaining the data. Appendix A: Higher-Order Spike Probabilities To compute w, we stop at n = 2, as in the experimental conditions that we model, the contribution of n > 2 spikes is vanishingly small. We find that the probability of three spikes occurring is typically < 1e − 5, and the n = 3 term did not contribute significantly, as shown, for example, in Figure 11. Intuitively it seems very unlikely that the gradient of the conditional response entropy is dominated by terms that are highly unlikely. This could be the case only if the gradient on the probability of getting three or more spikes would be much larger than the gradient on getting, say, two spikes.

Reducing the Variability of Neural Responses

399

Given the setup of the model with an increasing probability of firing a spike as a function of the membrane potential, it is easy to see that changing a weight will change the probability of obtaining two spikes much more than the probability of obtaining three spikes. Hence, the entropy gradient from components n ≤ 2 will be (in practice, much) larger than the gradient for terms n = 3, n = 4, . . .. As we remarked before, in our simulation of the experimental setup, the probability of obtaining three spikes given the input was computed to be < 1e − 5; the overall probability was computed at up to 1e − 6. The probability of n = 4 was below the precision of our simulation. Appendix B: Sum-Squared-Error Parameter Fitting To compute the sum squared error when comparing the different STDP models in section 5, we use linear regression in the free parameters to minimize the sum-squared error between the model curves and the experimental data. For m experimental data points {(t1 , w1 ), . . . (tm , wm )} and model curve w = f (t), we report for each model curve the sum-squared error for those values of the free parameters that minimize the sum-squared error E 2: E2 =

min

free parameters

m

(wi − f (ti ))2 .

i=1

For our model, linear regression is performed in the scaling parameter γ in equation 3.2 that relates the gradient obtained with the model parameters mentioned to the weight change. Where possible for the other models, we set model parameters to correspond to the values observed in the experimental conditions described in section 4. For the model by Chechik (2003), the weight update is computed as the sum of a positive exponent and a negative damping contribution,

w = γ H(−t) exp(t/τ ) − H(t + )H(− − t)K , where t is computed as tpr e − tpost , K denotes the negative damping contribution that is applied over a time window before and after the postsynaptic spike, and H( ) denotes the Heaviside function. The time constant τ is related to the decay of the EPSP, and we set this value to the same value we use for our model: 10 ms. Linear regression to find the minimal sum-squared error is performed on the free parameters γ , K , .

400

S. Bohte and M. Mozer

In Toyoizumi et al. (2005), the learning rule is the sum of two terms

w = γ 2 (t) − µ0 (φ 2 )(t) , where (t) is the EPSP, modeled as (t) exp(−t/τ ), and µ0 (φ 2 )(t) is a function of the autocorrelation function of a neuron (φ 2 )(t), times the spontaneous neural activity in the absence of input, µ0 . The EPSP decay time constant used in Toyoizumi et al. was already set to τ = 10 ms, and for the two terms in the sum we used the functions described by Figures 2A and 2B in Toyoizumi et al. We performed linear regression to the one free parameter, γ . Note that for this model, we obtain better LTD, and hence E 2 , for larger values of µ0 as those used in Toyoizumi et al. (2005). However, then E 2 still remains worse than for the other two models, and the spontaneous neural activity becomes unrealistically large. Acknowledgments We thank Tony Bell, Lucas Parra, and Gary Cottrell for insightful comments and encouragement. We also thank the anonymous reviewers for constructive feedback, which allowed us to improve the quality of our work and this article. The work of S.M.B. was supported by the Netherlands Organization for Scientific Research (NWO), TALENT S-62 588 and VENI 639.021.203. The work of M.C.M. was supported by National Science Foundation BCS 0339103 and CSE-SMA 0509521. References Abbott, L., & Gerstner, W. (2004). Homeostasis and learning through spike-timing dependent plasticity. In D. Hansel, C. Chow, B. Gutkin, & C. Meunier (Eds.), Methods and models in neurophysics. In Proceedings of the Les Houches Summer School 2003. Amsterdam: Elsevier. Bell, C. C., Han, V. Z., Sugawara, Y., & Grant, K. (1997). Synaptic plasticity in a cerebellum-like structure depends on temporal order. Nature, 387, 278–281. Bell, A., & Parra, L. (2005). Maximising sensitivity in a spiking network. In L. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 121–128). Cambridge, MA: MIT Press. Bi, G.-Q. (2002). Spatiotemporal specificity of synaptic plasticity: Cellular rules and mechanisms. Biol. Cybern., 87, 319–332. Bi, G.-Q., & Poo, M.-M. (1998). Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 18(24), 10464–10472. Bi, G.-Q., & Poo, M.-M. (2001). Synaptic modification by correlated activity: Hebb’s postulate revisited. Ann. Rev. Neurosci., 24, 139–166.

Reducing the Variability of Neural Responses

401

Burkitt, A., Meffin, H., & Grayden, D. (2004). Spike timing-dependent plasticity: The relationship to rate-based learning for models with weight dynamics determined by a stable fixed-point. Neural Computation, 16(5), 885–940. Chechik, G. (2003). Spike-timing-dependent plascticity and relevant mutual information maximization. Neural Computation, 15, 1481–1510. Dan, Y., & Poo, M.-M. (2004). Spike timing-dependent plasticity of neural circuits. Neuron, 44, 23–30. Dayan, P. (2002). Matters temporal. Trends in Cognitive Sciences, 6(3), 105–106. Dayan, P., & H¨ausser, M. (2004). Plasticity kernels and temporal statistics. In S. Thrun, ¨ L. Saul, & B. Scholkopf (Eds.), Advances in neural information processing systems, 16. Cambridge, MA: MIT Press. Debanne, D., G¨ahwiler, B., & Thompson, S. (1998). Long-term synaptic plasticity between pairs of individual CA3 pyramidal cells in rat hippocampal slice cultures. J. Physiol., 507, 237–247. Feldman, D. (2000). Timing-based LTP and LTD at vertical inputs to layer II/III pyramidal cells in rat barrel cortex. Neuron, 27, 45–56. Froemke, R., & Dan, Y. (2002). Spike-timing-dependent synaptic modification induced by natural spike trains. Nature, 416, 433–438. Gerstner, W. (2001). A framework for spiking neuron models: The spike response model. In F. Moss & S. Gielen (Eds.), The handbook of biological physics, (vol 4, pp. 469–516). Amsterdam: Elsevier. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (1996). A neural learning rule for sub-millisecond temporal coding. Nature, 383, 76–78. Gerstner, W., & Kistler, W. (2002). Spiking neuron models. Cambridge: Cambridge University Press. Hahnloser, R. H. R., Sarpeshkar, R., Mahowald, M. A., Douglas, R. J., & Seung, H. S. (2000). Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature, 405, 947–951. Hennion, P. E. (1962). Algorithm 84: Simpson’s integration. Communications of the ACM, 5(4), 208. Herrmann, A., & Gerstner, W. (2001). Noise and the PSTH response to current transients: I. General theory and application to the integrate-and-fire neuron. J. Comp. Neurosci., 11, 135–151. Hopfield, J., & Brody, C. (2004). Learning rules and network repair in spike-timingbased computation. PNAS, 101(1), 337–342. Izhikevich, E., & Desai, N. (2003). Relating STDP to BCM. Neural Computation, 15, 1511–1523. Jolivet, R., Lewis, T., & Gerstner, W. (2003). The spike response model: A framework to predict neuronal spike trains. In O. Kaynak, E. Alpaydin, E. Oja, & L. Yu (Eds.), Proc. Joint International Conference ICANN/ICONIP 2003 (pp. 846–853). Berlin: Springer. Kandel, E. R., Schwartz, J., & Jessell, T. M. (2000). Principles of neural science. New York: McGraw-Hill. Karmarkar, U., Najarian, M., & Buonomano, D. (2002). Mechanisms and significance of spike-timing dependent plasticity. Biol. Cybern., 87, 373–382. Kempter, R., Gerstner, W., & van Hemmen, J. (1999). Hebbian learning and spiking neurons. Phys. Rev. E, 59(4), 4498–4514.

402

S. Bohte and M. Mozer

Kempter, R., Gerstner, W., & van Hemmen, J. (2001). Intrinsic stabilization of output rates by spike-based Hebbian learning. Neural Computation, 13, 2709–2742. Kepecs, A., van Rossum, M., Song, S., & Tegner, J. (2002). Spike-timing-dependent plasticity: Common themes and divergent vistas. Biol. Cybern., 87, 446–458. Legenstein, R., Naeger, C., & Maass, W. (2005). What can a neuron learn with spiketiming-dependent plasticity? Neural Computation, 17, 2337–2382. Linsker, R. (1989). How to generate ordered maps by maximizing the mutual information between input and output signals. Neural Computation, 1, 402–411. ¨ Markram, H., Lubke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efficacy by coincidence of postsynaptic APS and EPSPS. Science, 275, 213–215. Nishiyama, M., Hong, K., Mikoshiba, K., Poo, M.-M., & Kato, K. (2000). Calcium stores regulate the polarity and input specificity of synaptic modification. Nature, 408, 584–588. Paninski, L., Pillow, J., & Simoncelli, E. (2005). Comparing integrate-and-fire models estimated using intracellular and extracellular data. Neurocomputing, 65–66, 379– 385. Pfister, J.-P., Toyoizumi, T., Barber, D., & Gerstner, W. (2006). Optimal spike-timing dependent plasticity for precise action potential firing. Neural Computation, 18, 1318–1348. ¨ otter, ¨ Porr, B., & Worg F. (2003). Isotropic sequence order learning. Neural Computation, 15(4), 831–864. Rao, R., & Sejnowski, T. (1999). Predictive sequence learning in recurrent neocortical ¨ circuits. In S. A. Solla, T. K. Leen, & K. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 164–170). Cambridge, MA: MIT Press. Rao, R., & Sejnowski, T. (2001). Spike-timing-dependent plasticity as temporal difference learning. Neural Computation, 13, 2221–2237. Roberts, P., & Bell, C. (2002). Spike timing dependent synaptic plasticity in biological systems. Biol. Cybern., 87, 392–403. ¨ otter, ¨ Saudargiene, A., Porr, B., & Worg F. (2004). How the shape of pre- and postsynaptic signals can influence STDP: A biophysical model. Neural Computation, 16, 595–625. Senn, W., Markram, H., & Tsodyks, M. (2000). An algorithm for modifying neurotransmitter release probability based on pre- and postsynaptic spike timing. Neural Computation, 13, 35–67. Shon, A., Rao, R., & Sejnowski, T. (2004). Motion detection and prediction through spike-timing dependent plasticity. Network: Comput. Neural Syst., 15, 179–198. ¨ om, ¨ Sjostr P., Turrigiano, G., & Nelson, S. (2001). Rate, timing, and cooperativity jointly determine cortical synpatic plasticity. Neuron, 32, 1149–1164. Song, S., Miller, K., & Abbott, L. (2000). Competitive Hebbian learning through spiketime -dependent synaptic plasticity. Nature Neuroscience, 3, 919–926. ¨ om, ¨ P. J., Reigl, M., Nelson, S., & Chklovskii, D. B. (2005). Highly nonSong, S., Sjostr random features of synaptic connectivity in local cortical circuits. PLoS Biology, 3(3), e68. Toyoizumi, T., Pfister, J.-P., Aihara, K., & Gerstner, W. (2005). Spike-timing dependent plasticity and mutual information maximization for a spiking neuron model. In L. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 1409–1416). Cambridge, MA: MIT Press.

Reducing the Variability of Neural Responses

403

van Rossum, R., Bi, G.-Q., & Turrigiano, G. (2000). Stable Hebbian learning from spike time dependent plasticity. J. Neurosci., 20, 8812–8821. Xie, X., & Seung, H. (2004). Learning in neural networks by reinforcement of irregular spiking. Physical Review E, 69, 041909. Zhang, L., Tao, H., Holt, C., Harris, W., & Poo, M.-M. (1998). A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395, 37–44.

Received April 6, 2005; accepted June 19, 2006.

LETTER

Communicated by Alexandre Pouget

Fast Population Coding Quentin J. M. Huys [email protected] Gatsby Computational Neuroscience Unit, University College London, London WC1N 3AR, U.K.

Richard S. Zemel [email protected] Department of Computer Science, University of Toronto, Toronto, Ontario, Canada M5S 3H5

Rama Natarajan [email protected] Department of Computer Science, University of Toronto, Toronto, Ontario, Canada M5S 3H5

Peter Dayan [email protected] Gatsby Computational Neuroscience Unit, University College London, London WC1N 3AR, U.K.

Uncertainty coming from the noise in its neurons and the ill-posed nature of many tasks plagues neural computations. Maybe surprisingly, many studies show that the brain manipulates these forms of uncertainty in a probabilistically consistent and normative manner, and there is now a rich theoretical literature on the capabilities of populations of neurons to implement computations in the face of uncertainty. However, one major facet of uncertainty has received comparatively little attention: time. In a dynamic, rapidly changing world, data are only temporarily relevant. Here, we analyze the computational consequences of encoding stimulus trajectories in populations of neurons. For the most obvious, simple, instantaneous encoder, the correlations induced by natural, smooth stimuli engender a decoder that requires access to information that is nonlocal both in time and across neurons. This formally amounts to a ruinous representation. We show that there is an alternative encoder that is computationally and representationally powerful in which each spike contributes independent information; it is independently decodable, in other words. We suggest this as an appropriate foundation for understanding time-varying population codes. Furthermore, we show how adaptation to Neural Computation 19, 404–441 (2007)

C 2007 Massachusetts Institute of Technology

Fast Population Coding

405

temporal stimulus statistics emerges directly from the demands of simple decoding.

1 Introduction From the earliest neurophysiological investigations in the cortex, it became apparent that sensory and motor information is represented in the joint activity of large populations of neurons (Barlow, 1953; Georgopoulos, Schwartz, & Kettner, 1983). There are by now substantial ideas and data about how these representations are formed (Rao, Olshausen, & Lewicki 2002), how information can be decoded from recordings of this activity (Paradiso, 1988; Snippe & Koenderinck, 1992; Seung & Sompolinsky, 1993), and how various sorts of computations, including uncertaintysensitive, Bayesian optimal statistical processing can be performed through the medium of feedforward and recurrent connections among the populations (Pouget, Zhang, Deneve, & Latham, 1998; Deneve, Latham, & Pouget, 2001). Critical issues that have emerged from these analyses are the forms of correlations between neurons in the populations, whether these correlations are significant for decoding and computation, and what sorts of prior information are relevant to computations and can be incorporated by such networks. However, many theoretical investigations into population coding have so far somewhat neglected a major dimension of coding: time. This is despite the beautiful and influential analyzes of circumstances in which individual spikes contribute importantly to the representation of rapidly varying stimuli (Bialek, Rieke, de Ruyter van Steveninck & Warland, 1991; Reinagel & Reid, 2000; Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997; Johansson & Birznieks, 2004) and the importance accorded to fast-timescale spiking by some practical investigations into population coding (Wilson & McNaughton, 1993; Schwartz, 1994; Zhang, Ginzburg, McNaughton, & Sejnowski, 1998; Brown, Frank, Tang, Quirk, & Wilson, 1998). The assumption is often made that encoded objects do not vary quickly with time and that therefore spike counts in the population suffice. Even some approaches that consider fast decoding (Brunel & Nadal, 1998; Van Rullen & Thorpe, 2001) treat stimuli as being discrete and separate rather than as evolving along whole trajectories. In this letter, we study the generic computational consequences of population coding in time. We analyze decoding in time as a proxy for computation in time as it is the most comprehensive computation that can be performed (accessing all information present). Decoding therefore constitutes a canonical test (Brown et al., 1998; Zhang et al., 1998). We consider a regime in which stimuli are not static and create sparse trains of spikes. Decoding trajectory information from these population spike trains is thoroughly ill posed, and prior information about what trajectories are likely

406

Q. Huys, R. Zemel, R. Natarajan, and P. Dayan

comes to play a critical role. We show that optimal decoding with ecological priors formally couples together the spikes, making trajectory inference computationally very hard. We thus consider the prospects for neural populations to recode the information about the trajectory into new sets of spikes that do support simple computations. Phenomena reminiscent of adaptation emerge as a by-product of the maintenance of a computationally advantageous code. We analyze the extension of one of the simplest ideas about population codes for static stimuli (Snippe & Koenderinck, 1992) to the case of trajectories. This links a neurally plausible population encoding model with a naturally realistic gaussian process prior. Unlike some previous work on decoding in time (Brown et al., 1998; Zhang et al., 1998; Smith & Brown, 2003), we do not confine ourselves to recursively specifiable priors and can therefore treat smoother cases. It is these smooth priors that render decoding, and likely other computations, hard and inspire an energy-based (product of experts) recoding (Hinton, 1999; Zemel, Huys, Natarajan, & Dayan, 2005), which makes for readier probabilistic computation. Section 2 starts with a simple encoding model. It introduces the need for priors, their shape, and analytical results for decoding in time. Section 3 shows how priors determine the form in which information is available to downstream neurons. We show that the decoder corresponding to the simple encoder can be extraordinarily complex, meaning that the encoded information is not readily available to downstream neurons. Finally, section 4 proposes a representation that has comparable power but for which decoding requires vastly less downstream computation. 2 A Gaussian Process Prior Approach As a motivating example, consider tennis. The player returning a serve has to predict the position of the ball based on data acquired in fractions of seconds. Experts compensate for the extraordinarily sparse stimulus information with a very rich temporal prior over ball trajectories and thus make predictions that are accurate enough to guarantee many a winning return. Figure 1 illustrates the setting of this article more formally. It shows an array of neurons with partially overlapping tuning functions that emit spikes in response to a stimulus that varies in time. These could be V1 neurons responding to a target (the tennis ball) as it moves through their receptive fields, or hippocampal neurons with place fields firing as a rat explores an environment. The task is to decode the spikes in time, that is, recover the trajectory of the stimulus (the ball’s position, say) based on the spikes, a knowledge of the neuronal tuning functions (cf. Brown et al., 1998; Zhang et al., 1998, for hippocampal examples), and some knowledge about the temporal characteriztics of the stimulus (the prior). In Figure 1, the ordinate represents the stimulus space (here one-dimensional for illustrative

407

Neurone position/Space

Fast Population Coding

0

50

100 150 Time [ms]

200

250

Figure 1: The problem: Reconstructing the stimulus as a function of time, given the spikes emitted by a population of neurons. When a neuron with preferred stimulus si emits a spike at time t, a black dot is plotted at (t, si ). A few example tuning functions are shown in gray. The ordinate represents stimulus space, with each neuron being positioned according to its preferred stimulus si . The decoding problem is related to fitting a line through these points, which is achievable only if there is prior information about the line to be fitted (e.g., the order of a polynomial fit or the smoothness).

purposes) and the abscissa, time. Neuron i has preferred stimulus si . If it emits a spike ξti at time t, a dot is drawn at position (t, si ). The dots in Figure 1 thus represent the spiking activity of the entire population of neurons over time. Our aim is to find, for each observation time T, a distribution over likely stimulus values sT given all the spikes previous to T. This is related to fitting a line representing the trajectory of the stimulus through the points. It is a thoroughly ill-posed problem, for instance, because we are not given any information about the stimulus at all between the spikes. To solve this ill-posed problem, we have to bring in additional knowledge in the form of a prior distribution about likely stimulus trajectories. The prior distribution specifies the temporal characteriztics of the trajectories (e.g., how smooth they are) and also whether they live within some constrained part of the stimulus space. Subjects are assumed to possess such prior information ahead of time—for instance, from previous exposures to trajectories (a good tennis player will have seen many serves). To gain analytical insight into the structure of decoding in this temporally rich case, we consider a very simple spiking model p(ξti |st ) (c.f., Snippe & Koenderinck, 1992, for the static case), augmented with a simple prior over stimulus trajectories p(s). We thereafter follow standard approaches (Zhang et al., 1998; Brown et al., 1998) by performing causal decoding and thus recovering p(sT |ξ ) over the current stimulus sT at time T given all the J spikes ξ ≡ {ξtij } Jj=1 at times 0 < {t j } Jj=1 < T in the observation period

408

Q. Huys, R. Zemel, R. Natarajan, and P. Dayan

([0, T)), emitted by the entire population. Here, i = 1, . . . , N designates the neuron that emitted the spike. To state the problem in mathematical terms, we can write (at least for the case that there is no spike at time T itself) p(sT |ξ ) ∝ p(sT ) p(ξ |sT ) = p(sT ) dsT p(ξ |sT ) p(sT |sT ),

(2.1) (2.2)

where, being slightly notationally sloppy, we are integrating over stimulus trajectories sT up to, but not including, time T, but restricted to just those trajectories that end at sT . Equation 2.2 lays bare the two parts of the definition of the problem. One is the likelihood p(ξ |sT ) of the spikes given the trajectory. This will be assumed to arise from a Poisson-gaussian spiking model. The other is the prior p(sT ) p(sT |sT ) = p(s)

(2.3)

over the trajectories. This will be assumed to be a gaussian process. 2.1 Poisson-Gaussian Spiking Model. We first define the spiking model. Let φi (s) be the tuning function of neuron i and assume independent, inhomogeneous, and instantaneous Poisson neurons (Snippe & Koenderinck, 1992; Brown et al., 1998; Barbieri et al., 2004). Let j be an index running over all the spikes in the population, with i( j) reporting the index of the neuron that spiked at time t j . Then, from the basic definition of an inhomogeneous Poisson process, the likelihood of a particular population spike train ξ given the stimulus trajectory sT can be written as

p(ξ |sT ) =

φi( j) (st j ) exp −

j

∝

i

φi( j) (st j ),

T

dtφi (st )

(2.4)

0

(2.5)

j

assuming that the trajectories are such that we can swap the order of the sum and the integral in the exp(·), that tuning functions are sufficiently dense that the sum spiking rate is constant independent of the location of the stimulus st , and that no two neurons ever fire together.

Fast Population Coding

409

Finally, we assume squared-exponential (gaussian) tuning functions,

st j − si φi (st j ) = φmax exp − 2σ 2

2 ,

where φmax is the maximal firing rate of a neuron and si the ith neuron’s preferred stimulus. Combining this with our previous assumptions (see equation 2.5) and completing the square implies that (sξ − θ )T (sξ − θ) p(ξ |sT ) ∝ φmax exp − , 2σ 2

(2.6)

where the spikes from the entire population have been ordered in time; the jth component of both sξ and θ corresponds to the jth spike and is, respectively, the stimulus at that spike’s time t j and the preferred stimulus si of the neuron that produced it. Note that time is continuous here. 2.2 Gaussian Process Prior. The prior p(s) defines a distribution over stimulus trajectories that are continuous in time. However, p(ξ |sT ) in equation 2.6 depends on only the times t j at which neurons in the population spike. Thus, in the integral in equation 2.2, we can formally marginalize or integrate out all the nonspiking times, making the key quantity to be defined by the prior to be p(sξ , sT ). For a gaussian process (GP), this quantity is a multivariate gaussian, defined by its (J + 1)-dimensional mean vector m and covariance matrix C, which can in general depend on the times t j . We write the distribution as p(sξ , sT ) ∼ N (m, C)

Ct j t j = c exp −αt j − t j ζ .

(2.7)

The parameter ζ ≥ 0 dictates the smoothness and the correlation structure of the process. If ζ = 0, then the stimulus is assumed to be constant (we sometimes call this the static case). Setting ζ = 1 corresponds to assuming that the stimulus evolves as an Ornstein-Uhlenbeck (OU) or first-order autoregressive process. This is the generative model underlying Kalman filters (Twum-Danso & Brockett, 2001) and generates an autocorrelation with the Fourier spectrum ∼1/ω2 often observed experimentally (Atick, 1992; Dong & Atick, 1995; Wang, Liu, Sanchez-Vives, & McCormick, 2003). This can be generalized to nth-order autoregressive processes. Setting ζ = 2 leads to the opposite end of the spectrum, with smooth trajectories that are non-Markovian. The parameter α dictates the temporal extent of the correlations and c their overall size (c also parameterizes the scale of the overall process). Example trajectories drawn from these priors for ζ = {1, 2} are shown in Figure 2. For most of the letter, we will let m = 0. Assuming

410

Q. Huys, R. Zemel, R. Natarajan, and P. Dayan 0.5

A

Smooth 0

−0.5 0.5

B OU

0

−0.5

100

200 Time [ms]

Figure 2: Example trajectories drawn from the prior distribution in equation 2.7. (A) Examples for the smooth covariance matrix with ζ = 2. (B) The OU covariance matrix, ζ = 1.

a GP prior with a particular covariance matrix is exactly equivalent to regularizing the autocorrelation of the trajectory. 2.3 Posterior. Making these assumptions, we can write down the posterior distribution p(sT |ξ ) analytically by solving equation 2.2. It is a simple gaussian distribution with mean µ(T) and variance ν 2 (T) given in terms of tuning function widths σ , the vector θ , and the covariance matrix C. All three terms in equation 2.2 are now defined. The conditional distribution p(sξ |sT ) is given in terms of the partitioned covariance matrix C, p(sξ |sT ) = Nsξ Cξ T CT−1T sT , Cξ ξ − Cξ T CT−1T CTξ , where Cξ ξ is the covariance matrix of the stimulus at all the spike times, CTξ and Cξ T are vectors with the cross-covariances between the spike times and the observation time T, and CT T is the marginal (static) stimulus prior at the observation time (constant for the stationary processes considered here). The corresponding partitioning of the matrix C is

Cξ ξ Cξ T . C= CTξ CT T

(2.8)

Fast Population Coding

411

The remaining two terms in equation 2.2 are given by p(sT ) = NsT (0, CT T ) and equation 2.6. As the integral in equation 2.2 is a convolution of two gaussians, the variances add, and the integral evaluates to p(ξ |sT ) = Nθ Cξ T CT−1T sT , Cξ ξ − Cξ T CT−1T CTξ + Iσ 2 . Finally, taking a product with p(sT ), renormalizing, and applying the matrix inversion lemmas (see appendix A), we get µ(T) = k(ξ , T) · θ(T) ν 2 (T) = CT T − k(ξ , T) · Cξ T 2 −1

k(ξ , T) = CTξ (Cξ ξ + Iσ ) .

(2.9) (2.10) (2.11)

The mean µ(T) of the gaussian posterior is thus a weighted sum of the preferred stimulus of those neurons that emitted particular spikes. The weights are given by what we term the temporal kernel k(ξ , T). As we will see, the weight given to each spike will depend strongly on the time at which it occurred. A spike that occurred in the distant past will be given small weight. The posterior variance depends on only C and σ 2 . Remember that C depends on only the times of spikes, not the identities of the neurons that fired them. The posterior variance ν 2 , similar to a Kalman filter, depends on only when data are observed, not what data. This depends on the squared exponential nature of the tuning functions φ, and other tuning functions (e.g., with nonzero baselines) may not lead to this quality. However, it will not affect the conclusions reached below. This posterior distribution p(sT |ξ ) is well known in the GP literature as the predictive distribution (MacKay, 2003). 2.4 Structure of the Code. The operations needed to evaluate the posterior p(sT |ξ ) give us insight into the structure of the code and will be analyzed in section 3 for various priors. If the posterior is a function of combinations of spikes, postsynaptic neurons have to have simultaneous access to all those spikes. This point will be critical in temporal codes, as the spikes to which access is required are spread out in time. Only if spikes are interpretable independently can they be forgotten once they have been used for inference. All information the spikes contribute to some future time T > T is then contained within p(sT |ξ ). If the posterior depends on combinations of spikes (as will be the case for ecological, smooth priors), information that can be extracted from a spike about times T > T is not entirely contained within p(sT |ξ ). As a result, past spikes have to be stored and the posterior recomputed using them—an operation that is nonlocal in time. We will show that under ecological priors, the posterior depends on spike combinations and is thus complex. Decoding for the simple encoder (the spiking model) is thus hard. In section 4, we will illustrate the type

412

Q. Huys, R. Zemel, R. Natarajan, and P. Dayan

of computations (“recoding”) a network has to perform to access all the information. This will be equivalent to finding a new, complex encoder in time for which decoding is simple. 3 Effect of the Prior The effect of the prior manifests itself very clearly in the temporal kernels k(ξ , T) from equation 2.11 and the independence structure of the code. We show this by analyzing a representative set of priors in terms of both the behavior of the temporal kernels and the structure of the code, including priors that generate constant, varyingly rough and entirely smooth trajectories. (MATLAB example code can be downloaded from http://www.gatsby. ucl.ac.uk/∼qhuys/code.html.) 3.1 Constant stimulus prior ζ = 0. We first show that our treatment of the time-varying case is an exact generalization of the case in which the stimulus is fixed (does not change relative to the mean m), by rederiving classical results for static stimuli. Snippe and Koenderinck (1992) have shown that the posterior mean and variance (under a flat prior) is given by a weighted spike count, µ(T) =

i

ni (T)si J (T)

ν 2 (T) =

σ2 J (T)

(3.1)

T where ni (T) = 0 dt ξti is the ith neuron’s spike count and J (T) = i ni (T) is the total population spike count at time T. If we let ζ = 0, the matrix Cξ ξ = cnnT where n is a J (T) × 1 vector of ones. Equations 2.9 to 2.11 can then be solved analytically:

(σ 2 + c J (T))δi j − c σ 2 (σ 2 + c J (T)) c k(ξ , T) = 2 n σ + c J (T) ni (T)si c µ(T) = 2 i σ + c J (T)

(Cξ ξ + Iσ 2 )−1

ij

=

ν 2 (T) =

cσ 2 , σ 2 + c J (T)

which is exactly analogous to equation 3.1 with an informative prior. The temporal kernel k(ξ , T) does not decay but is flat, with a magnitude proportional to 1/J (T). The contribution of each neuron to the mean µ(T) is given by its spike count ni (T). Each spike is given the same weight,

Fast Population Coding

413

dynamic inference

static inference

static stim

dynamic stim

A

B

C

D

Figure 3: Comparison of static and dynamic inference. Throughout, the posterior density p(sT |ξ ) is indicated by gray shading, the spikes are vertical (gray) lines with dots, and the true stimulus is the line at the top of each plot. (A) Static stimulus, constant temporal kernel. (B) Moving stimulus, constant temporal kernel. (C) Static stimulus, decaying temporal kernel. (D) Moving stimulus, decaying temporal kernel. A and D show that only a match between true stimulus statistics and prior allows the posterior to capture the stimulus well.

which is a sensible approach only if spikes are eternally informative about the stimulus. This is true only if the covariance matrix is flat, which itself implies that the only time-varying component of the stimulus is in the mean m and not the covariance C. If the stimulus is a varying function of time s(t), spikes at time t are informative only about the stimulus at times t close to t and the influence of each spike on the posterior should fade away with time. This is illustrated in Figure 3. Figure 3A shows the present static case, where the stimulus does indeed not move; over time, the posterior p(sT |ξ ) sharpens up around the true value. However, if the stimulus does move, the posterior ends up at the wrong value (see Figure 3B). If the temporal kernel k(ξ , T) decays, this amounts to downweighting spikes observed in the more distant past. In the following, we analyze the behavior of p(sT |ξ ) and the optimal temporal kernel k(ξ , T) for various stimulus autocorrelation functions. Figure 3C shows that a decaying kernel leads to a posterior that widens inbetween spikes. This is incorrect if the stimulus is static, but Figure 3D shows how such a decaying temporal kernel would, in contrast to Figure 3B, allow p(sT |ξ ) to track the moving stimulus correctly.

414

Q. Huys, R. Zemel, R. Natarajan, and P. Dayan

Space

−1

0

1

0.05

0.1

0.15 0.2 Time [s]

0.25

0.3

Figure 4: Posterior distribution p(sT |ξ ) for OU prior. Same representation as in Figure 1. The dashed line shows the actual stimulus trajectory used to generate the spikes, the dots are the spikes, the posterior distribution is in gray scale, and the solid line shows the posterior mean. Between spikes, the posterior mean decays exponentially back toward the mean m (here 0), and the variance approaches the static prior variance CT T .

3.2 Nonsmooth (Ornstein-Uhlenbeck) Prior ζ = 1. Setting ζ = 1 in the definition of the prior (see equation 2.7) corresponds to assuming that the stimulus evolves as a random walk with drift to zero (an OU process): ds = −(1 − e −α )s(t)dt +

√ c(1 − e −2α ) dt d N(t),

(3.2)

with gaussian noise N(t) ∼ N (0, 1) and parameters as in equation 2.7. The OU process is the underlying generative process assumed by standard Kalman filters. The simplicity of Kalman-filter-like formulations explains some of its wide applicability and success (Brown et al., 1998; Barbieri, et al., 2004). However, as indicated visually by the example trajectories in Figure 2, the rough trajectories this prior favors are not a good model of smooth biological movements (see also section 5). Figure 4 shows a sampled stimulus trajectory, sample spikes generated from it, and the posterior distribution p(sT |ξ ). The mean of the posterior does a good job of tracking the true underlying stimulus trajectory and is never more than two standard deviations away from it. Between spikes, the mean simply moves back to zero (albeit rather slowly given the parameters associated with the Figure shown). Figure 5A displays example temporal kernels k(ξ , T) for inference in this process. They are very close to exponentials (note the logarithmic ordinate). This makes intuitive sense as an OU process is a first-order Markov process (it can be rewritten as a first-order difference equation). In fact, assuming the spikes arrive regularly (replacing each of the interspike intervals (ISI) 1 by their average value = 1J j (t j − t j−1 ) ∝ φmax ) allows us to write the

Fast Population Coding

415

B Kernel size (weight) (log scale)

Kernel size (weight) (log scale)

A −2

10

−4

10

−6

10

−8

10

0.05

0.1

0.15

T−t k

0.2

increasing observation time T

−3

10

−5

10

−7

10

0.2

0.4

0.6 T−t

0.8

1

k

Figure 5: OU temporal kernels for ζ = 1. (A) Example of temporal kernels. The top traces are for lower and the bottom for higher average firing rates. The gray traces show temporal kernels for Poisson spike trains. The components of the vector k(ξ , T) are plotted against the corresponding spike time. The dashed black traces show temporal kernels for regular spike arrivals (metronomic temporal kernels). The true (gray) temporal kernels are relatively tightly bunched around the metronomic temporal kernel. The firing rate affects the slope of the kernel, but not its overall scale of the kernel. (B) The effect of the time since the last spike on the temporal kernel is an overall multiplicative scaling. There is no effect on the slope.

jth component of k(ξ , T) as j−1

k j ≈ d1 λ1 , where d1 and λ1 are constants defined in appendix B. For such metronomic spiking, k(ξ , T) is thus really simply a decaying exponential. Somewhat similar expressions can be obtained for the original case of Poisson-distributed ISIs (see appendix B). Figure 5A shows that the metronomic approximation provides a generally good fit, capturing especially the slope of the true temporal kernels, which depends mostly on the correlation length α and the maximal (or average) firing rate φmax . The remaining quality of the fit is influenced most strongly by the match between and the time since the last spike T − tJ (which takes its effects through CTξ in equation 2.8 and 2.9–2.11). This determines the overall scale of the temporal kernel. The factors influencing the slope of the temporal kernel and its height do not interact greatly; that is, T − tJ does not affect the slope (shape) of the temporal kernel, only its magnitude, as shown in Figure 5B (metronomic temporal kernels are used for clarity, but the argument applies equally to the exact kernel). Conversely, affects mostly the slope. Replacing the true

416

Q. Huys, R. Zemel, R. Natarajan, and P. Dayan

−1

A

Space

0

1 −1

B 0

1

0.05

0.1

0.15 Time [s]

0.2

0.25

0.3

Figure 6: Comparison between exact and metronomic kernels. Same representation as in Figure 4. (A) Exact posterior p(sT |ξ ). (B) Approximate posterior derived by replacing all ISIs by but keeping T − tJ . This corresponds to approximating the true kernels with the metronomic kernels in Figure 5. The approximation is very close.

temporal kernels by metronomic temporal kernels, that is, replacing all ISIs by but keeping the time since the last spike T − tJ , does not greatly degrade p(sT |ξ ) (cf. Figures 6A and 6B). The dependence in Figure 5B can be understood by writing out the integrand of equation 2.2 in detail for the OU prior. This factorizes over potentials involving duplets of spikes because, as we show in appendix B, C −1 is tridiagonal, implying that the elements of C −1 involve only two successive spikes: T 1 s p(sξ , sT ) ∝ exp − sξ sT C −1 ξ sT 2 J +1 J 1 st2j Ct−1 + st j Ct−1 s = exp − j tj j ,t j+1 t j+1 2 j=1

p(sξ , sT ) = ψ(sT )

J

ψ(st j , st j+1 ),

j=1

(3.3)

j=1

where tJ stands for the time of the last spike, tJ −1 the time of the penultimate one, and so on, and the observation time T = tJ +1 . Note that the last equality

Fast Population Coding

A

417

B

90

0

−0.2

80

−0.4 −0.6 j

log C(t − t )

60

i

Space [cm]

70

50

−0.8 −1 −1.2

40

−1.4 30 −1.6 40

60 Space [cm]

80

−1.5

−1

−0.5

0 0.5 t −t [s] i

1

1.5

2

j

Figure 7: Natural trajectories are smooth. (A) Position of a rat freely exploring a square environment. (B) Covariance function of the position along the ordinate (gray, dashed line) and a quadratic approximation (black, solid line). Note the logarithmic ordinate. The smoothing applied to eliminate artifacts was of a timescale short enough not to interfere with the overall shape of the covariance function.

implies that the determinant also factors over spike pairs. This means that the integrations over each spike in the main equation 2.2 can be written in a recursive form akin to that used in message-passing algorithms (MacKay, 2003) and the exact Kalman filter. 3.3 Smooth Prior ζ = 2. Setting ζ = 2 in the definition of the prior (see equation 2.7) corresponds to assuming that the stimulus evolves as a non-Markov random walk. Trajectories with this autocovariance function are smooth (Figure 2A shows some sample trajectories generated from the prior) and infinitely differentiable. The smoothness makes it a more ecologically relevant prior for Bayesian decoding from movement-related trajectories than nonsmooth priors since natural objects (and limbs) move along smooth trajectories rather than jumping. As an example, Figure 7A shows trajectories of a rat exploring a square environment (data kindly provided by Lever, Wills, Cacucci, Burgess, & O’Keefe, 2002). Not only are these natural trajectories smooth, but Figure 7B also shows that a squared exponential covariance function closely approximates the real covariance function.1

1

Only the center of the covariance function is shown here. Due to the small size of the environment, the rat runs back and forth the entire available length, and there are oscillating flanks to the covariance function for delays larger than those shown.

418

Q. Huys, R. Zemel, R. Natarajan, and P. Dayan

−1

Space

−0.5 0 0.5 1

0.05

0.1 Time [s]

0.15

0.2

Figure 8: Posterior distribution p(sT |ξ ) for the smooth prior. Same representation as in Figure 4. The arrow highlights where the smooth prior uses spike combinations to constrain higher-order statistics of the process, such as velocity, acceleration, and jerk. While the smooth prior correctly predicts that the stimulus will continue away from the mean before returning back, the OU process can predict a decay only back to the mean (see Figure 9). The first spike on the left is the very first spike observed. As the spike history becomes more extensive, the posterior distribution is seen to sharpen up and follow the stimulus accurately.

Figure 8 is the equivalent of Figure 4 for the smooth case and shows the posterior p(sT |ξ ). The main dynamical difference between inference in this smooth case and inference in the OU case is indicated by the arrow in the Figure. While the OU process simply decays back to the mean (here, zero for simplicity), the dynamics of the smooth posterior mean are much richer. In the absence of spikes, the mean continues in its current direction for a while before reversing back. As can be seen, this gives a better fit to the underlying stimulus trajectory (the black dashed line) than would otherwise have been achieved. It arises directly from the fact that the correlations extend essentially beyond the last spike (and into the entire past). For comparison, Figure 9 shows the posterior when the wrong prior is used. The stimulus was generated from the smooth prior, but the OU prior was used to infer the posterior. The arrow indicates where the infelicity of the inaccurate posterior is most apparent, falling back to zero instead of predicting that the stimulus will continue to move farther away from zero. In terms of difference equations, the larger extent of correlations intuitively means that the higher-order derivatives of the process are also “constrained” by the covariance C. The simple exponential temporal kernels observed in the OU process cannot give rise to the reversals observed in the smooth process. Figure 10A shows the temporal kernels for the smooth process, which have a distinctively different flavor from the OU temporal kernels (shown in Figure 5),

Fast Population Coding

419

Space

−0.5 0 0.5 0.05

0.1 Time [s]

0.15

0.2

0.4 0.3 0.2 0.1 0 0.05 0.1 0.15 0.2 T−t j

C 0.2 Increasing observation time T

0.1 0 0

0.05 0.1 T−t j

0.15

Kernel size (weight)

B

0.5

Kernel size (weight)

A

Kernel size (weight)

Figure 9: Posterior distribution p(sT |ξ ) for smooth stimulus but wrongly assuming an OU prior. The posterior is consistently wider than it should be (cf. Figure 8). The arrow points out where the prediction is qualitatively wrong: the OU prior allows for decay back only to zero, unlike the smooth prior. Note also that the beneficial effect of a larger spike history observed in Figure 8 is absent here.

Increasing observation time T

2 1 0 0

0.2

0.4

0.6

T−t j

Figure 10: Temporal kernels for the smooth prior. (A) Exact (gray solid) and metronomic (black dashed) temporal kernels for the smooth prior with ζ = 2. The metronomic kernels again provide a close fit. (B) The metronomic temporal kernels change in a complex manner as the observation time T is moved away from the time of the last spike. Unlike in the OU case, this is not just a recursively implementable multiplication. (C) The same qualitative behavior arises for kernels derived from the empirical covariance function of the rat trajectories.

including oscillating terms multiplying the exponential decay. Most important, the oscillating terms allow the weight assigned to a spike to dip below zero; that is, a spike initially signifies proximity of the stimulus to the neuron’s preferred stimulus but later swaps over, signaling that the stimulus is not there anymore. This feature of the temporal kernels gives rise to the reversals seen in the posterior mean.

420

Q. Huys, R. Zemel, R. Natarajan, and P. Dayan

As in the OU case, the metronomic temporal kernel based on equal ISIs gives a good description of the temporal kernel mostly for spikes in the more distant past. Replacing the true temporal kernels by metronomic temporal kernels (but keeping the exact time since the last spike T − tJ ) again does not affect the posterior strongly. Nevertheless, the Kullback-Leibler divergence between the true posterior and the metronomic posterior is larger in the smooth than in the OU case (data not shown), indicating that the exact timing of spikes is more important in the smooth inference. Unlike in the OU case, there is no simple analytical expression for the metronomic temporal kernel (let alone the true temporal kernel). In particular, Figure 10B shows that changing the time since the last observed spike T − tJ does not simply scale the temporal kernel, but also changes the shape of the temporal kernel (it produces a complicated phase shift of the oscillating component). Again, for clarity, the metronomic kernels are used as an illustration, but the argument also applies to the exact kernels. Local structure has complex global consequences in the smooth case, with a single new spike requiring individual reweighting of all past spikes depending on their precise times. By comparison, for the OU process, the reweighting involves multiplication by a single factor. Figure 10C shows that this temporal kernel complexity is also a feature of the temporal kernel derived from the covariance function of the empirical rat trajectories in Figure 7. The fundamental difference between the OU and the smooth temporal kernels arises from the difference in the factorization properties of the prior. Because the inverse of the covariance matrix for ζ ∈ / {0, 1}, and specifically for ζ = 2, is dense, it does not factorize over spike combinations and therefore does not allow a recursive form. To see that a recurrence relation is possible only for the OU prior that factorizes across duplets of spikes, write p(sT |ξ ) = ds J ds J p(sT , s J , s J |ξ ) by expanding and integrating over the stimulus s J at the time tJ of the last spike, and s J at the time of all the spikes apart from the last ∝ ds J p(sT , s J ) p(ξ J |s J ) ds J p(s J , ξ J |sT , s J ) using Bayes rule, and the instantaneity of spiking =

ds J p(sT , s J ) p(ξ J |s J )

ds J p(ξ J |s J ) p(s J |sT , s J ),

again because the spikes are instantaneous, =

ds J p(sT , s J ) p(ξ J |s J )mT (sT , s J , ξ J ).

(3.4)

Fast Population Coding

421

Were mT (sT , s J , ξ J ) independent of sT , this would be exactly like a recursive update equation, with p(sT , s J ) being the transition probability from the last observed spike to the inference time T, p(ξ J |s J ) being the innovation due to the last observation (the likelihood of the last observed spike), and the message mT (s J , ξ J ) propagating the uncertainty from all the spikes other than the last to the last one. However, for general priors, p(s J |sT , s J ), and therefore also mT (sT , s J , ξ J ), do depend on sT , so all spikes have to be used to infer the posterior at each time T. To make the mT independent of sT , the prior has to be Markov in individual spike timings, with p(s J |sT , s J ) = p(s J |s J ),

(3.5)

which makes mT (sT , s J , ξ J ) =

ds J p(ξ J |s J ) p(s J |s J )

≡ mT (s J , ξ J ),

(3.6) (3.7)

which is indeed independent of sT . So for the OU process, the last message mT (s J , ξ J ) merely needs to be multiplied by the transition probability (see Figure 5B). However, the smooth temporal kernel changes shape in a complex way (corresponding to the dependence of the message mT (sT , s J , ξ J ) in equation 3.4 on sT ). Again, this means that all spikes have to be kept in memory for full inference. Note, finally, that this conclusion, and the fact that there is a recursive form for the OU process, do not depend on the particular spiking model assumed, verifying the assertion that the choice of squared exponential tuning functions, although mathematically helpful, does not pose limitations on our conclusions. 3.4 Intermediate (Autoregressive) Processes. There are cases intermediate to the smooth and the OU process that allow a partially recursive formulation. For instance, the metronomic OU process can be generalized to an autoregressive model of nth order by writing st =

n

√ βi st−i + c ηt .

(3.8)

i=1

In this case, the inverse covariance matrix C −1 is (2n + 1)-diagonal (see appendix C), with entries determined directly by the βi . This implies that the posterior factorizes over cliques ψ involving n + 1 spikes (see equation 3.3), and that inference will be Markov in groups of n spikes. Zhang et al. (1998) find that a two-step Bayesian decoder, which is an AR(2) process in our terms, significantly improves decoding hippocampal place cell data.

422

Q. Huys, R. Zemel, R. Natarajan, and P. Dayan

B

A

0.5 order=1 0 0.5 order=2

Space

kernel size (weight)

0 0.5 order=3 0 0.5 order=5 0 0.5 order=10 0 0.05 0.1

0.15 0.2

0.25 0.3

0.35 0.4

0.45 0.5

Time [s]

0.05

0.1 0.15 T−t j [s]

0.2

Figure 11: Autoregressive processes of increasing order. (A) Samples from processes of order n = {1, 2, 3, 5, 10} from top to bottom. The top process corresponds to an OU process. (B) Metronomic temporal kernels k(ξ , T) corresponding to the processes in A. The different lines (in varying shades of gray) correspond to increasing the observation time T as in Figures 5 and 10.

Figure 11A shows sample trajectories from such processes of increasing order. The coefficient vectors’ β was set here such that the nth difference of the processes evolved as an OU process (see appendix C). The higher the order, the smoother the processes that can be generated, and the more oscillations are apparent in the temporal kernels. The OU and the smooth processes (see section 3.3) are at opposite ends of this spectrum, with tridiagonal and dense inverse matrices, respectively. The higher the order, the greater the complexity of the code. Indeed, the complexity grows exponentially (since groups of n spikes have to be considered and the number of such groups increases exponentially). While natural stimulus trajectories may not be indefinitely differentiable, the exponential increase in complexity implies that any smoothness has great potential to render the code complex. 4 Expert Spikes for Efficient Computation Complex codes, following, for instance, from the assumption of natural smooth priors, render the information inherent in the spikes hard to extract. Efficient computation in time requires access to all encoded information and

Fast Population Coding

423

thus requires that the complex temporal structure of the code be taken into account. Here, we show that information present in the complex codes can be re-represented using codes that are straightforward to decode and use in key probabilistic computations. Specifically, we propose to decode each spike independently and multiply together the contributions from all spikes. This corresponds to treating each spike as an independent expert in a product of experts (PoE) setting (Hinton, 1999): 1 i pˆ (sT |ξ ) = exp gi (s, t)ξT−t . Z(T) i t

(4.1)

That is, each time a spike ξ i occurs, it contributes its same projection kernel exp(gi (s, t)) to the posterior distribution pˆ (sT |ξ ). To put it another way, for each spike, we add the same stereotyped contribution to the log posterior and then renormalize. From the discussion in the preceding sections, it is immediately apparent that the PoE approximation is a better approximation for the OU case than for the smooth case. In the following, we first derive an approximate analytical expression for separable projection kernels gi (s, t) = f i (s)h(t) based on metronomic spikes and the OU prior. We then remove any restrictions and derive nonparametric, nonseparable gi (s, t) for both the OU and the smooth temporal kernel and show that these still perform better for the OU process than for the smooth process. Finally, we infer a new set of spikes ρ ξ such that decoding according to the PoE model produces a posterior distribution pˆ (sT |ρ ξ ) that matches the true posterior distribution p(sT |ξ ) well for both OU and smooth priors. 4.1 Approximate Projection Kernels 4.1.1 Metronomic Projection Kernels Section 3.2 showed that for the OU process, the weight accorded a spike is approximately a decreasing exponential function of the time elapsed since its occurrence and that replacing the true temporal kernels by the metronomic temporal kernels (without fixing the time since the last spike at ) gives a qualitatively good approximation (see Figure 4). This suggests writing an approximate distribution with spatiotemporally separable projective kernels, pˆ (sT |ξ ) ∝

φi (s)

i

=

i

exp

i −βt t ξT−t e

t

(4.2)

log(φi (s))e

−βt

i ξT−t

(4.3)

424

Q. Huys, R. Zemel, R. Natarajan, and P. Dayan

A

−1

B

Space

0 1

0.05

0.1

0.15

0.2

0.25

0.3

0.05

0.1

0.15 Time [s]

0.2

0.25

0.3

−1 0 1

Figure 12: Separable projection kernel for the OU process: Comparison of true p(sT |ξ ) (A) and pˆ (sT |ξ ) from equation 4.3 (B). The left arrows indicate where the variance of the approximate distribution diverges toward ∞ as T − tJ → ∞ rather than approaching CT T . The right arrows show the effect of this on the mean or the approximate posterior, which returns to the prior mean m = 0 more rapidly than the true posterior.

to use exactly the form of equation 4.1. We can thus also write pˆ (sT |A) ∝

i

φi (s) Ai (T) ,

(4.4)

where Ai (T) can be seen as an equivalent “activity” of each neuron. The performance of this approximation is shown in Figure 12 for the OU process (see also Zemel et al., 2005). There are a few differences between Figures 4 and 12. Keeping the φi (s) as before, the variance of this approximation is νˆ 2 (T) = σ 2 / i Ai (T). As the last observed spike recedes into the past, this approaches infinity (left arrows in Figure 12), and the mean returns to zero (right arrows in Figure 12). This is different from the case of exact inference, which approaches the static prior with variance CT T . The mean is always normalized and returns to zero more slowly µ(T) ˆ = i si AiA(T) j (T) j than the variance increases. This introduces an inaccuracy, since the true OU temporal kernels (shown in Figure 5) are not normalized t kt (ξ , T) < 1, which arises because of the weight given to the spatial prior. For the smooth case, no simple approximation of the form of equation 4.3 is viable. This can be seen, for instance, from the fact that the smooth temporal kernels (see Figure 10) dip below zero (making it tricky to use them in products).

Fast Population Coding

425

A

B

Figure 13: Projection kernels inferred by equation 4.5 for OU (A) and smooth (B) priors. Stimulus trajectories and corresponding population spike trains ξ were generated until the update equations converged (approximately 2 · 104 spike trains). Both kernels have the shape of difference of gaussians for t = 0 and fall off exponentially with time. There is little nonseparable structure in both cases.

4.1.2 Inferring Full Spatiotemporal Projection Kernels gi (s, t). To apply expression 4.1 to the smooth case, we inferred gi (s, t) in a nonparametric way by discretizing time and space over which the distributions are defined and minimizing the Kullback-Leibler divergence between the discretized versions p(sT |ξ ) and pˆ (sT |ξ ) with respect to the projection kernels, gi (s, t) ← gi (s, t) − ε∇gi (s,t) DK L ( p(sT |ξ )|| pˆ (sT |ξ )),

(4.5)

where DK L ( p(s)||q (s)) = ds p(s) log qp(s) . Given that our approximation 4.1 (s) is related to restricted Boltzmann machines (RBM), it is not surprising that the gradient has a form akin to the wake-sleep algorithm (Hinton, Dayan, Frey, & Neal, 1995): ∇gi (s,t) DK L ( p(sT |ξ )|| pˆ (sT |ξ )) =

[ pˆ (sT |ξ ) − p(sT |ξ )] ξi (T − t).

(4.6)

T

Figure 13 shows the projection kernels inferred for the OU prior (see Figure 13A) and the smooth prior (see Figure 13B). Both start, for t = 0 with a spatial profile similar to a difference of gaussians (DOG), and then fall off as exponentials of time. The kernels gi (s, t) shown here are for neurons i with si close to 0, the center of the gaussian prior over the trajectories. The projection kernels shown are for the same parameter settings as Figures 4 and 8, and the faster decay of the smooth projection kernels is due to the

426

Q. Huys, R. Zemel, R. Natarajan, and P. Dayan

Figure 14: Projection kernels are independent of contrast. The left-most panel shows an OU kernel for the same contrast (φmax ) as in Figure 13; the contrast is doubled in the middle and quadrupled in the right panel. All these are offcenter kernels with the same parameters as used in the other Figures. Despite a slight slant toward the mean, the kernels are approximately separable.

shorter correlation timescale. For the OU process, the kernels for neurons i with si > 0 become slightly slanted toward −1 over time (and the converse holds for those with si < 0) to capture the decay to the mean (zero), which is only a function of the distance from the mean. This effect is noticeable for the OU but very small for the smooth kernels. Figure 14 shows off-center OU kernels inferred for different contrast (by varying φmax ). As can be seen, the kernels are invariant to the contrast, and the slant effect is small. For the parameter range explored here, both projection kernels are approximately separable, indicating that the analytically derived motivation above may be close to optimal and that, in the PoE framework of equation 4.1, separable projection kernels may be the optimal choice even for the smooth prior. However, simply using these projection kernels to interpret the original spikes ξ results in an approximation that is far from perfect, especially in the smooth case. Figure 15 compares the true posterior distribution and that given by the approximation with the above projection kernels. The cost of independent decoding is quantified in Figure 15A using

1 D( p(sT |ξ )|| pˆ (sT |ξ )) T t H( p(sT |ξ ))

,

(4.7)

p(s,ξ )

where H( p) is the entropy of p and the average is over many stimulus trajectories s ∼ N (0, C) and spikes ξ ∼ p(ξ |s). This quantity can also be interpreted as a percentage information loss. It is larger for the smooth than for the OU process, showing that the OU process suffers much less from the approximation than the smooth prior. Visually, there are no gross differences between p(sT |ξ ) and pˆ (sT |ξ ) for the OU prior (see Figures 15B and 15D). However, for the smooth prior, the arrows in

Fast Population Coding

427 OU

Smooth

−1

Exact

−0.5

0.12

Space

0.06 0.04

1 −1

−0.5

0.02 OU

Smooth

Approx

0.1 0.08

0

0 0.5

0 0.5 1

0.05

0.1

0.15 0.2 Time [s]

0.25

0.3

0.05

0.1 Time [s]

0.15

0.2

Figure 15: Comparison of true distribution p(sT |ξ ) and approximate distribution pˆ (sT |ξ ) given by equation 4.1 with projection kernels inferred by equation 4.5 and shown in Figure 13. Organization is the same as in previous figures. (A) T1 t D( p(sT |ξ )|| pˆ (sT |ξ ))/H( p(sT |ξ )) p(ξ ,s) ± 1 standard deviation for both priors. (B, C) p(sT |ξ ). (D, E) The corresponding pˆ (sT |ξ ) for the same spikes. (B, D) A stimulus generated from the OU prior. (C, E) The smooth prior. pˆ (sT |ξ ) is a good approximation for the OU prior but fails for the smooth prior. The arrows indicate where the approximation fails fundamentally in a similar way to that shown in Figure 9.

Figures 15C and 15E indicate areas where a large mismatch is introduced by the independent treatment of the spikes, which discards all information contained in spike combinations. This mismatch is entirely to be expected.

4.2 Recoding: Finding Expert Spikes. The previous section has shown that an independent interpretation of spikes is more costly with the smooth than with the OU prior. In this section, we show that it is possible to find a new set of “expert” spikes ρ, such that each spike can be interpreted independently and the posterior distribution is matched closely for both the OU and the smooth prior. This recoding thus takes spikes ξ that are redundant in a decoding sense and produces a new set of spikes ρ that can be easily used for efficient neural computation because the decoding redundancy has been eliminated. We first infer real-valued activities aξ and then proceed to infer actual spikes ρ. We use neurally implausible methods to infer the new set of spikes ρ. In a companion paper we will explore the capability of neurally plausible spiking networks to do this recoding and to use the resulting simple code for probabilistic computations in time (see also Zemel, Huys, Natarajan, and Dayan, 2004).

428

Q. Huys, R. Zemel, R. Natarajan, and P. Dayan

Figure 16: Inferring activities A for the OU prior. (A) True posterior p(sT |ξ ). (B) Approximate posterior pˆ (sT |A), which matches arbitrarily well (for this example, DK L T ∼ 10−5 and the entropy HT ∼ 2, making the information loss I ∼ 10−5 ). (C) Activities A for all neurons. The vertical black lines with dots indicate the original spike times ξ . Each thin line along the gray surface is the “activity” of one neuron as a function of time. There is a small amount of activity away from the spikes, but zeroing this affects the match between p(sT |ξ ) and pˆ (sT |A) only marginally.

4.2.1 Activities. Given a set of projection kernels gi (s, t) from the previous section, we can go back and infer the optimal activities A ≥ 0 of neurons by writing pˆ (sT |A) ∝ exp

Ai (T − t)gi (s, t) .

(4.8)

i,t

If we let Ai (T − t) = exp(Bi (T − t)) and minimize with respect to B the ¨ Kullback-Leibler divergence from the true posterior, we simultaneously enforce A ≥ 0: Bi (t) ← Bi (t) − ε∇ Bi (t) DK L ( p(sT |ξ )|| pˆ (sT |A)) .

(4.9)

The results of this procedure are shown for both the OU process (see Figure 16) and for the smooth process (see Figure 17). Figures 16A and 17A show the true spikes ξ and the corresponding distribution p(sT |ξ ). Figures 16B and 17B show the approximate distributions pˆ (sT |A) defined in equation 4.8 for the optimal activities A inferred with equation 4.9. The continuous nature of the activation functions means that they can contain as much information as the distribution itself, and indeed we find empirically that arbitrarily close matches are possible (exemplified by the two Figures; in both cases DK L T ∼ 10−5 ). Figures 16C and 17C finally show

Fast Population Coding

A

429

C

B

Figure 17: Inferring activities A for the smooth prior. (A) True posterior p(sT |ξ ). (B) Approximate posterior pˆ (sT |A), which matches arbitrarily well (for this example, DK L T ∼ 10−5 and the entropy HT ∼ 2, making the information loss I ∼ 10−5 ). (C) Activities A for all neurons. The vertical black lines with dots indicate the original spike times ξ . Each thin line along the gray surface is the “activity” of one neuron as a function of time. There is a small amount of activity away from the spikes, which allows the approximation pˆ (sT |A) to “bend” between spikes. Unlike in the OU