Behavioral Data Science
Introduction and Theory
Behavioral Data Science is a multidisciplinary scientific field that aims
to facilitate understanding,
prediction, and change of human
behavior through the analysis of
behaviorally defined
variables as they arise in large
datasets ("Big Data"), typically
gathered using modern digital
technology (e.g., online or
through mobile devices) and
analyzed with techniques for
detecting patterns from high-
dimensional data (e.g., machine
learning)
- Understanding: Construction of psychological theories to explain
behavior
- Prediction: Application of statistical models to predict behavior
- Change: Development of interventions to change behavior
The complexities of human behavior:
- Human behavior is at the root of many of the most central
problems of our time:
o COVID-19 spread and climate change but also war and famine
have important behavioral components
o Human behavior “is possibly the most difficult subject ever
submitted to scientific analysis” (Skinner)
- Yet standard methods to study it are remarkably simple:
Questionnaires, tests, and small-scale experiments
o However, recently, new sources of data are being mined and
these offer new ways of approaching old questions
Urgency due to communication networks:
- Due to the communication networks, the world has shrunk
o Dense and fast communication, causes polarization
o Polarization: Division into two sharply contrasting groups
or sets of opinions or beliefs
- E.g., Twitter (polarizing Twitter users, e.g., conservatives and
progressives)
Data
Data: Representations of observations
o E.g., Pete correctly solved IQ test item 36
, - Representation: The row that represents Pete has a 1 in the
column that represents the IQ item
o Appointed to time or place (Pete solved item 36 that on
January 1st)
- Typically, data are structured in rows
and columns (i.e., in spreadsheet)
o Rows represent cases, while
columns represent
features/properties/attribut
es
o The values in the columns
represent a variable
Phenomena: Robust (abstract) features of
the world
- They’re general (not appointed to time or place, it always stays that
pattern)
o E.g., the positive manifold of intelligence, the robust
correlation between insomnia and depression, the effect of
time pressure on accuracy
- Important: Phenomena are not themselves data
o Rather, phenomena are evidenced by patterns found in the
data
- Because psychology is very complex, we often need advanced
statistical models to “see” the patterns
o Statistics is a tool that allows us to observe (to make sense of
human behavior), and understand (the collected) data
Theories: It describes a world in which the phenomena would follow “as
a matter of course” (a = b)
- There are many kinds of
theories, but we are often
interested in explanatory
theories
o An explanatory
theory is a set of
principles that aims to
explain phenomena
- Coming up with a good
theory is a creative act, but
it can be systematized and practiced
o Ideally, in behavioral data science we are after
mathematically formulated models
It is hard to ‘reason through’ theories (therefor models,
‘simplify’ theories in a way everyone might be able to
use it)
The speed-accuracy trade-off (in-depth example)
The Lexical Decision Task
, - Participants must decide whether a letter string is a word (e.g.,
tango) or a nonword (e.g., drapa)
o Participants usually decide by pressing a keyboard key with
their index finger
- Participants may judge hundreds or even thousands of letter
strings in a single session
o Usually, the stimulus set contains 50% words
- Performance on this task is supposed to measure the ease with
which lexical representations are activated from memory
o For instance, performance is better for high-frequency
words (e.g., cat) than for low-frequency words (e.g., feline)
Frequency effect
- Participants are usually told to this as quickly and accurately as
possible
o Key dependent variables of interest are response time (RT)
and accuracy (proportion correct responses)
Used to study phenomenon: Global slowing
- Older adults are generally slower than young adults
o This decrease of response speed is explained by the ‘general
slowing’ hypothesis, which says that all cognitive processes
operate more slowly in older adults
Maybe, age-related demyelination harms basic neural
transmission speed?
Problems with the Standard Analysis:
- No account of ubiquitous tradeoff between RT and accuracy
o Offering up RT to be more accurate (or other way around)
- No process model (no account of how people generate the
observed data)
o And thus, no decomposition of underlying processes
- Solution: Use a process model that allows you to estimate latent
psychological processes
o One prominent candidate:
Ratcliff’s diffusion model
A model that describes how
noisy evidence is
accumulated over time
The deterministic or
signal component
(predictable part of the
process) of this noisy
process is called the
drift rate (v)
o How quickly and in what direction the
predictable part of the processes changes
(arrow in picture ^, it shows direction and
how quickly)