,Inhoud
1 Chapter 1: What is cognitive modelling? .................................................................................................................. 5
1.1 The use of models ............................................................................................................................................. 5
1.1.1 Advantages of models ............................................................................................................................... 5
1.1.2 Levels of modelling .................................................................................................................................. 5
Spatial scale .......................................................................................................................................................... 5
Temporal scale ...................................................................................................................................................... 6
Critique on the levels ............................................................................................................................................ 6
1.2 Striving for a goal ............................................................................................................................................. 6
1.2.1 Optimization principle .............................................................................................................................. 6
1.2.2 Minimalization .......................................................................................................................................... 7
Gradient descent.................................................................................................................................................... 7
Advantages of gradient descent ............................................................................................................................ 7
2 Chapter 2: Decision making ...................................................................................................................................... 7
2.1 Minimization in activation space ...................................................................................................................... 7
2.1.1 Linearity principle ..................................................................................................................................... 8
Biological plausible? ............................................................................................................................................. 8
2.1.2 Minimal energy model .............................................................................................................................. 8
Update rule ............................................................................................................................................................ 9
Typical features of the model................................................................................................................................ 9
2.1.3 Cooperative and competitive interactions in visual word recognition .................................................... 10
2.2 Hopfield model ............................................................................................................................................... 10
2.2.1 Models activation dynamics.................................................................................................................... 10
Update rule .......................................................................................................................................................... 10
2.2.2 Hard and soft constraint .......................................................................................................................... 11
2.2.3 Attractors................................................................................................................................................. 11
2.2.4 Human memory and the Hopfield model ................................................................................................ 11
2.3 Diffusion model .............................................................................................................................................. 12
3 Chapter 3: Hebbian learning ................................................................................................................................... 13
3.1 Energy function ............................................................................................................................................... 13
3.1.1 Learning rule ........................................................................................................................................... 13
Problem with the weights.................................................................................................................................... 14
Optimized ............................................................................................................................................................ 14
Local ................................................................................................................................................................... 14
3.1.2 Biology of Hebbian learning ................................................................................................................... 15
3.2 Hebbian learning in matrix notation ............................................................................................................... 15
3.2.1 Hebbian learning rule .............................................................................................................................. 15
After learning if we present an input vector/pattern x ........................................................................................ 16
Orthogonal vectors (also disadvantage) .............................................................................................................. 16
3.3 Hebbian learning in the Hopfield model ......................................................................................................... 16
1
, 3.3.1 Example .................................................................................................................................................. 16
3.4 Hebbian learning models in human memory .................................................................................................. 16
4 Chapter 4: The delta rule ......................................................................................................................................... 17
4.1 Previous chapters ............................................................................................................................................ 17
4.2 The delta rule in two-layer networks .............................................................................................................. 17
4.2.1 Linearity principle ................................................................................................................................... 17
4.2.2 Optimization: MNE................................................................................................................................. 18
Update rule: gradient descent .............................................................................................................................. 19
4.3 The geometry of the delta rule ........................................................................................................................ 19
4.3.1 Linear independence & Linear separability ............................................................................................ 19
Why? ................................................................................................................................................................... 19
4.3.2 Soft threshold > hard threshold ............................................................................................................... 20
4.4 Prediction error example ................................................................................................................................. 20
4.4.1 Learning: blocking .................................................................................................................................. 20
4.4.2 Perception................................................................................................................................................ 20
4.4.3 Psycholinguistics..................................................................................................................................... 21
4.5 The rise, fall, and return of the delta rule ........................................................................................................ 22
4.5.1 The rise.................................................................................................................................................... 22
Logical rules........................................................................................................................................................ 22
4.5.2 The fall .................................................................................................................................................... 22
4.5.3 The return ................................................................................................................................................ 22
4.5.4 Exercise ................................................................................................................................................... 23
5 Chapter 5: Multilayer networks .............................................................................................................................. 23
5.1 Recap............................................................................................................................................................... 23
5.2 Geometric intuition of the multilayer model ................................................................................................... 23
5.2.1 Notation................................................................................................................................................... 23
Convex sets ......................................................................................................................................................... 24
5.2.2 4 layers .................................................................................................................................................... 24
5.3 Generalizing the delta rule: backpropagation ................................................................................................. 25
5.3.1 Working this out algebraically ................................................................................................................ 25
Credit assignment................................................................................................................................................ 26
4 layers ................................................................................................................................................................ 26
Where is it used? ................................................................................................................................................. 27
5.3.2 Some drawbacks of backpropagation...................................................................................................... 27
Problem 1: No guarantee you end up at global min ............................................................................................ 27
Problem 2: Catastrophic interference .................................................................................................................. 27
5.4 Varieties of backpropagation .......................................................................................................................... 27
5.4.1 Convolutional etworks ............................................................................................................................ 27
5.4.2 Recurrent networks. ................................................................................................................................ 27
5.4.3 Deep music.............................................................................................................................................. 28
2
,6 Ch 10: unsupervised learning .................................................................................................................................. 28
6.1 Unsupervised Hebbian learning ...................................................................................................................... 28
6.2 Competitive learning ....................................................................................................................................... 29
6.2.1 The winner (hard).................................................................................................................................... 29
6.3 Kohonen learning (soft) .................................................................................................................................. 30
6.3.1 Examples ................................................................................................................................................. 30
7 Chapter 6: Estimating parameters in computational models ................................................................................... 30
7.1 How can a model tell us about the mind? ....................................................................................................... 30
7.1.1 Strategy 1: explore the parameter space.................................................................................................. 31
7.1.2 Strategy 2: Parameter estimation/estimating parameter based on data ................................................... 32
7.2 How parameter can be estimated?................................................................................................................... 32
7.2.1 Approach 1: Error minimization (or ordinary least square) .................................................................... 32
7.2.2 Approach 2: Maximum Likelihood......................................................................................................... 32
7.2.3 Optimization (LL) ................................................................................................................................... 33
7.2.4 How parameter can be estimated?........................................................................................................... 33
A. Graphically ..................................................................................................................................................... 33
B. Analytically .................................................................................................................................................... 33
C. Computationally ............................................................................................................................................. 34
A. Grid search .............................................................................................................................................. 34
B. Evolutionary computation ...................................................................................................................... 34
7.3 Issues with parameter estimations................................................................................................................... 34
7.3.1 Parameter identifiability .......................................................................................................................... 34
7.3.2 Applications ............................................................................................................................................ 35
1. The log-linear model ....................................................................................................................................... 35
2. Diffusion model .............................................................................................................................................. 35
3. Decision making ............................................................................................................................................. 36
8 Chapter 7: Testing and comparing computational models ...................................................................................... 37
8.1 Solutions ......................................................................................................................................................... 37
8.1.1 Akaike Information Criterion (AIC) ....................................................................................................... 37
8.1.2 Other solution: Bayesian Information Criterion (BIC) ........................................................................... 37
8.1.3 Other solution: Cross-validation ............................................................................................................. 37
9 Chapter 9: Reinforcement learning ......................................................................................................................... 38
9.1 Actively engaging with the environment ........................................................................................................ 38
9.1.1 What is reward and why is it important? ................................................................................................ 38
9.1.2 Actions .................................................................................................................................................... 38
9.1.3 States ....................................................................................................................................................... 38
Policy .................................................................................................................................................................. 39
The Markov decision process (or MDP) ............................................................................................................. 39
9.2 Goal ................................................................................................................................................................. 39
9.2.1 Value of a state (V) ................................................................................................................................. 39
3
, Temporal discounting ......................................................................................................................................... 40
9.2.2 Value of an action (Q) ............................................................................................................................. 40
9.2.3 Why do we estimate this value (Q or V) ................................................................................................. 40
9.3 Different ways of action value estimation ..................................................................................................... 41
9.3.1 Approach 1: Model-based (e.g., run through a decision tree) ................................................................. 41
9.3.2 Approach 2: Model-free .......................................................................................................................... 41
Rescorla-Wagner value estimation ..................................................................................................................... 41
9.3.3 Approach 3: Temporal differences.......................................................................................................... 42
9.4 Policy updating ............................................................................................................................................... 42
9.5 Exploration-Exploitation................................................................................................................................. 42
9.5.1 Applications ............................................................................................................................................ 42
Value computation in the brain ........................................................................................................................... 42
9.5.2 Temporal difference ................................................................................................................................ 43
Temporal difference in the human brain ............................................................................................................. 43
9.5.3 Exploration-Exploitation in the Brain ..................................................................................................... 43
Computational psychiatry ................................................................................................................................... 43
Example .............................................................................................................................................................. 44
4
,1 Chapter 1: What is cognitive modelling?
1.1 The use of models
• Modelling= tool to help construct better theories of cognition and behavior
• Conceptual > empirical
Modeling: Making a simple, formal representation of the theory (here, about cognitive processes)
(un)Intuitive theory → then made more precise (into a model) → then empirical test
• Standard practice: first you have a intuitive theory, based on this theory you use an empirical test
• Example: Dual code theory
− Hypothesis: highly imageable words are easier to remember than lower imageable words.
→ apple: you can have a picture in you mind of an apple, you can’t do this for democracy
A highly imageable word generates 2 codes, namely a mental picture of an apple and a word (a verbal code of
an apple) → 2 kinds of codes → they are easier to remember
− But what is a picture code and what is a verbal code exactly? How are they represented in the brain? Which
words are highly/ low imageable, how do they interact? And what extra predictions can we make from this
theory? This theory is basically a reformulation of the empirical finding that highly imageable words are
easier to remember
− Just having this theory that is so closely tied to the experimental data doesn’t give us traction for making novel
predictions.
• Example: gestalt law. Not much room for novel prediction that does not come straight from the theory
→problem is that this theory doesn’t really go beyond the data themselves, and therefore doesn’t leave a lot
of room for predictions
1.1.1 Advantages of models
1. Making novel predictions that don’t obviously follow from theory
→ sometimes: Only by making formalizations, we can come up with predictions (slide 13)
− We gain a lot from this theory by just making it more precis
− Only by making this formalization, we could come up with this prediction
2. Allow integration/understanding of existing data in well-organized conceptual framework
→ embodying very simple assumptions into models can lead to very surprising consequences that we would not have
come up with by just inspecting the assumptions. And we would not have come up with these predictions if we would
run immediately to the experimental lab, rather than to first think about it a little bit and formalize it.
1.1.2 Levels of modelling
Spatial scale
1. Social level: between individuals or between entire
groups of individuals
Vb. Lotka-Volterra model: considers interactions
between predators and prey. Both populations will
oscillate across time, alternating between periods with
more and fewer animals. This oscillations happens in a
synchronous way. Prey ↑, predator also ↑.
Example how simple models, with simple assumptions
leads to non intuitive prediction
2. Within a single person (vb motor control)
3. Interaction between brain areas Vb. Model of
cognitive control. Stroop task. Could be that
One module: processes words
One module processes colors
One module: processes verbal/manual actions
→ anatomical connectivity structure between the modules for response and word > anatomical connectivity
structure between the modules for response and color
5
, Bc word naming is more common and more practiced
Vb. We take a brain, we simplify it (retina → optic nerve → thalamus → this projects into the visual cortex and
then this projects into another visual cortex,… ) and this way we can even simplify it even more into a
connectionist model: this is the type of model that we will be using a lot in this course
3. Interaction between single neurons (slide 22)
Neurons communicate via channels (psodium, potassium)
Temporal scale
• Phenomena that go very fast, like neural phenomena. Neurons spike at some frequencies, so if u want to
understand this very fast timescale of processing of spikes firing.
• We can also be interested in phenomena that are at a slightly slower timescale, like learning, learning does not
occur at timescale of milliseconds, learning is something that occurs at a timescale of minutes or days or years.
• We can even be interested in an even slower timescale like years or hundreds of years like development of a
language
1. Fastest: cognitive processing and human behavior.
Vb. Stroop task
2. Knowledge acquisitions. New info or skills are acquired
3. Culture change
4. Slowest: genetic changes
Critique on the levels
• What matters for level n, may or may not be relevant for level n -1 or for level n + 1 !
• Model not biological plausible; but whether a model is biologically plausible has different meaning and relevance
for each of these models
• It is precisely bc some biological details are ignored that it is at all possible to formulate useful models
→ need a level of useful abstraction that allows models for integrating and understanding data at that specific
level and allowing the derivation of novel empirical tests.
→ each level has its own findings and assumptions that are relevant for it.
• Sometimes interested in several levels at the same time
− Must be cognitive plausible
− Account for behavioral data (RT, ACC in behavioral experiments)
− Neuronally plausible: consistent with systems-level neuroscience
1.2 Striving for a goal
• The function of cognition is to optimize an agent’s interaction with the world
• Reason why organisms have a brain? To adapt to the world around it
• People act bc they think it is useful for them way → they are motivated to act
1.2.1 Optimization principle
• Finding the optimal point (OP) of a function
• “Attempting to reach a goal” can be captured with the standard mathematical principle of optimization
• Behavioral and neural dynamics are consequences of attempting to reach that goal
• Different ways for this optimization
1. Graphically: draw a graph and search for min/max point
2. Analytically: derivative
We can calculate de derivative of this function. We want to find the point with
derivative of 0 because if we look at the minimal point: it has a derivative of 0 (the
horizontal line)
→ not easy for complicated functions
3. Computationally: use a model. Optimizing the function by taking tiny steps allong the
function
• Is this optimization principle true?
1. People are a product of years of adaptation to the world
2. Extensive mathematical and statistical apparatus has been developed for optimizing functions
6
,1.2.2 Minimalization
• Optimization principle: to maximize/minimize a function of some variable
Gradient descent
• Only continuous activation values
• Iterative optimization via gradient descent
− Iterative means littles steps
− We use gradient descent for this
− We are going to explain this with a graph
• We take a random start position on the graph
• Take that we start on the right side. The OP is
on the left
• So we have to take steps to the left, meaning
your delta must be negative: then Xn – Xn-1 is
positive
• The change in x between 2 trials is delta x:
Δx=-α ⅆf(x)/ⅆx
• If delta x is positive, then Xn – Xn-1 is larger than 0, then Xn is larger then Xn-1, so if we start from -0.5 and go
to 0.0 →we jump to the right
• We do this until delta x becomes zero, meaning your derivative is also 0
• You also have gradient ascent: Δx=α ⅆf(x)/ⅆx
• Explanation of the formula
− Explanation: when the derivative df(x)/dx is ±0, then the steps Δx also will be ±0
− Parameter α: determines the step size, with – or +to determine the minimization or maximalization
− α= 0: small values lead to small steps → slow processing/ slow time scale
− α= bigger: bigger steps in parameter space → faster processing
• example:
− 4 steps are displayed. Wit an α= 0.2
− Derivative and step size at the end must be close to 0 (algorithm
almost converged)
− We started at an x bigger than the optimal x point →we have to
move to the left to get to a smaller point that is the Optimal Point.
This way the steps are negative
− But starting point is randomly chosen, so if you start with a smaller
point than the OP, step size will be positive (with each step you will go
to an bigger x value until OP is reached)
− BUT: it is always how closer the algorithm comes to the OP, the smaller the step sizes will become
Advantages of gradient descent
• This algorithm works for any differentiable function
• One can formalize the cognitive process of attempting to reach a goal via optimizing a function. The gradient
descent is an efficient (mut not the only) method to do so
2 Chapter 2: Decision making
• Decision making: popular theme in modern cognitive neuroscience
• Example: dividing objects or animals to distinct classes. Why: bc we are interested in human categorization and
child development (target of interest for cognitive modeling)
2.1 Minimization in activation space
• Model says: when people classify cats vs dogs. They look at these 3 features.
Based on these features they decide whether it is a cat or dog.
7
,• Take cats vs dogs: set of feature detectors. Each detector responds when some feature in the environment is
present. These could detect low-lever perceptual properties or high level
1. Detector 1: picture on FB
2. Detector 2: 4 legs
3. Detector 3: bites visitors
→ detectors off: 0 or on: 1 (can also between 0 and 1, but here they are binary)
• Weight these 3 units by the connection from the input unit to the cat and dog units
• Activation of the cat unit and activation of the dog unit
Model has cat detector: receives linear activation of the first 3 detectors as input. Has activation level: xcat
→ this activation as identifying the probability of there being a cat in the environment
• Model also has a dog detector: receives linear activation of the first 3 detectors as input. Has activation level: xdog
2.1.1 Linearity principle
• How does one go rom the feature detectors to the pet detectors?
− Key assumption: the activation levels of the feature detectors are combined in a linear way. Then they are sent
to the next level of processing. In this example: pet detector
− Concrete:
Incat = wcat, 1 *x1 + wcat, 2 * x2+ wcat, 3 * x3 → on/off: x= 0/1 w: strength of the relation with cat/dog
detector → You have to change the weights
into the most plausible possibility
• Example: there is a very high likely hod that pictures of cats end up at FB, then the connection between FB unit
and cat unit is strong.
• This linear combination of input units and weights determines how much activation streams into the cat unit
Biological plausible?
• Plausible model of how humans detect dogs and cats?
• Answer on this question depends on the domain of interest
• Behavior of single neuron: makes no sense to have neurons with binary activation profile
• High level domain: categorizing of pets: don’t need very high levels of biological realism
2.1.2 Minimal energy model
• The pet detector gets a linear combination of the feature detectors. Now what should the pet detector do whit this
info?
• How do the y-values (the activation of the cat and the dog units)
change over time? Look at this by means of an energy model
• Dynamics of the model: how the activation values of its detectors change across time.
• They allow to make predictions about ACC, RT, neural activation… (other dependent variables of interest)
− To define the model dynamics you first have to define goal function
− Then you have to assume that the aim is to minimize this E function
What does minimizing energy function E mean?
A. Input cat detector incat = high. Then it would be good to make ycat high as well. Bc if they have both large
positive values, E will become more negative
B. Bc we want to minimize E, both ycat and ydog can’t be both high. Bc this will give E a more positive, higher
value (higher energy). This is not what we want. So the w takes care of this problem. It is unlikely that they
are active together bc they inhibit each other
• So: Large values of incat should make ycat large
Large values of indog should make ydog large
But ycat and ydog should not be active at the same time
• Minimization via gradient descent: we are going to apply this principle to the energy function (E). Bc grandient
descent is something you applie to a function. Then we will find that the steps we have to make for the cat unit are
of this shape (delta Ycat). And the steps we take for the dog unit take this shape (delta Ydog)
8
, • Take derivative E towards ycat or ydog
Update rule
• Transform gradient descent for the steps you have to make towards the
dog or cat unit
• The equation α(incat + wydog) is saying: If there is a lot of input in the
cat unit →make big steps for the activation of the cat unit, except if the
dog unit is very active too, then you have to suppress the activation of the
cat unit.
• Same for dog unit α (indog + wycat): if there is a lot of input in the dog unit →activate, make Ydog very big, except if
the cat unit is very active, then do not activate the dog unit.
• Delta is a step, is doing the same equation at different timepoints and subtract those from each other
Δy= y(t)- y(t-1) → we want to take the ‘fat’ part to the other side of the equation + noise
→ then you get this ycat(t) = ycat (t-1) + 𝛼 ( incat + wydog (t-1)) + N(t)
ydog(t) = ydog (t-1) + 𝛼 ( indog + wycat (t-1)) + N(t)
• Equation depend on the parameters wij
− Equation depend on the parameters wij
− The equation depended on incat and indog quantities and these inij themselves depend on the parameter wij
• Each time step t, a noise variable N(t) with a normal distribution (mean= 0, sd) is added to the main y
− If we run the equation many times, we get a response time distribution that looks like this.
− This distribution gives us something that we can match with empirical data: we
can give cat and dog pics to human participants, and we ask them to do it like
1000 times, we plot their RT distribution and we can compare the subjects RT
distribution with the model RT distribution. Then we can see if this is a good
model of cat and dog categorization →this is how we can make predictions about
response times.
• If activation reaches a specific threshold, that is the response time which we put
in the frequency diagram on the right. We can do that to every trial and then we
end up with the RT distribution.
− You can see that both detectors start with activation, but the further in the
trail the cat detector goes down, and the dog ups. This trial the dog detector
has ‘won’
→ So the point is that we make RT distributions (when RT gets lower after a lot of trials, the model has learned), and
the final goal is to compare empirical data and the model, if those 2 distributions are similar we can say that we have a
good model.
Typical features of the model
• Typical positive skew distribution, is a type of distribution in which most values are clustered around the left tail
of the distribution while the right tail of the distribution is longer. (RT distribution)
If we run the equation many times, we get a response time distribution
• Distance effect: if the animals resemble each other, then evidence for the cat and dog detectors becomes similar
(indog = incat). As a result: RT becomes slower and ACC decreases
Vb. Easier to decide which of 2 persons is smaller if they differ a lot in length (= Weber law)
→ discrimination between 2 large quantities is harder than discrimination between 2 small quantities
→ any quantitative difference between incat and indog gives rise to a distance effect
Careful: bc this shows that it is very hard to draw strong conclusions about spatial nature of representations from only
the distance effect
9
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller dwrosanne. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $9.81. You're not tied to anything after your purchase.