Samenvatting computational
Lecture 1
When communication meets computation: opportunities, challenges, and pitfalls in
computational communication science – Van Atteveldt & Peng (2018)
The role of computational methods in communication science
The recent acceleration in the promise and use of computational methods for communication science
is primarily fueled by the confluence of at least three developments:
1. A deluge of digitally available data, ranging from social media messages and other “digital
traces” to web archives and newly digitized newspaper and other historical archives (e.g.,
Weber, 2018);
2. Improved tools to analyze this data, including network analysis methods (e.g., Lungeanu,
Carter, DeChurch, & Contractor, 2018; Barabási, 2016) and automatic text analysis methods
such as supervised text classification (Boumans & Trilling, 2016; Collingwood & Wilkerson,
2012; Odijk, Burscher, Vliegenthart, & de Rijke, 2013), topic modelling (Maier et al., 2018; Blei,
Ng, & Jordan, 2003; Jacobi, Van Atteveldt, & Welbers, 2016; Roberts et al., 2014), word
embeddings (e.g. Rudkovsky et al., 2018), and syntactic methods (Van Atteveldt, Sheafer,
Shenhav, & Fogel-Dror, 2017); and
3. The emergence of powerful and cheap processing power, and easy to use computing
infrastructure for processing these data, including scientific and commercial cloud computing,
sharing platforms such as Github and Dataverse, and crowd coding platforms such as Amazon
MTurk and Crowdflower.
Like “big data,” the concept of computational methods makes intuitive sense but is hard to define.
Sheer size is not a necessary criterion to define big data (Monroe, Pan, Roberts, Sen, & Sinclair, 2015),
and the fact that a method is executed on a computer does not make it a “computational method”—
communication scholars have used computers to help in their studies for over half a century (e.g., Nie,
Bent, & Hull, 1970; Stone, Dunphy, Smith, & Ogilvie, 1966).
Adapting the criteria given by Shah et al. (2015), we can give an ideal-typical definition by
stating that computational communication science studies generally involve:
1. Large and complex data sets;
2. Consisting of digital traces and other “naturally occurring” data;
3. Requiring algorithmic solutions to analyze; and
4. Allowing the study of human communication by applying and testing communication theory.
Computational methods are an expansion and enhancement of the existing methodological toolbox,
while traditional methods can also contribute to the development, calibration, and validation of
computational methods. Moreover, the distinction between “classical” and “computational” methods
is often one of degree rather than of kind, and the boundaries between approaches are fuzzy.
Opportunities offered by computational methods
We argue that computational methods allow us to analyze social behavior and communication in ways
that were not possible before and have the potential to radically change our discipline at least in four
ways.
,From self-report to real behavior
Digital traces of online social behavior can function as a new behavioral lab available for
communication researchers. These data allow us to measure actual behavior in an unobtrusive way
rather than self-reported attitudes or intentions. This can help overcome social desirability problems,
and more importantly it does not rely on people’s imperfect estimate of their own desires and
intentions.
With voluminous time-stamped data on social media, it is methodologically viable to unravel
the dynamics underlying human communication and disentangle the interdependent relationships
between multiple communication processes. This can also help overcome the problems of linking
content data to survey data. This a mainstay of media effects research but is problematic because of
bias in media self-reports (Kobayashi & Boase, 2012; Scharkow & Bachl, 2017) and because news
consumers nowadays often cherry-pick articles from multiple sites, rather than relying on a single
source of news (Costera Meijer & Groot Kormelink, 2015).
From lab experiments to studies of the actual environment
A second advantage is that we can observe the reaction of persons to stimuli in their actual
environment rather than in an artificial lab setting. In their daily lives, people are exposed to a
multitude of stimuli simultaneously, and their reactions are also conditioned by how a stimulus fits
into the overall perception and daily routine of people. Moreover, we are mostly interested in social
behavior, and how people act strongly depends on their (perception of) actions and attitudes in their
social network (Barabási, 2016).
However, the implementation of experimental design on social media is not an easy task. Social
media companies will be very selective on their collaborators and on research topics. The coordination
of experiments on social media can also be extremely time-consuming. Furthermore, how to
adequately address ethical concerns involved in online experiments has become a pressing ethical
issue in scientific community.
From small-N to large-N
Simply increasing the scale of measurement can also enable us to study more subtle relations or effects
in smaller subpopulations than possible with the sample sizes normally available in communication
research (Monroe et al., 2015). Similarly, by measuring messages and behavior in real time rather than
in daily or weekly (or yearly) surveys, much more fine-grained time series can be constructed,
alleviating the problems of simultaneous correlation and making a stronger case for finding causal
mechanisms.
In order to leverage the more complex models afforded by larger data sets we need to change
the way we build and test our models.
- Penalized (lasso) regression;
- Cross-validation
→Are aimed at out-of-sample prediction rather than within-sample explanation.
- Exponential Random Graph Modeling (ERGM)
- Relational Event Modeling
→Can dynamically model network and group dynamics.
From solitary to collaborative research
Digital data and computational tools make it easier to share and reuse resources. The increased scale
and complexity also make it almost necessary to do so: it is very hard for any individual researcher to
possess the skills and resources needed to do all the steps of computational research him or herself
(Hesse, Moser, & Riley, 2015). An increased focus on sharing data and tools will also force us to be
more rigorous in defining operationalizations and documenting the data and analysis process,
furthering transparency and reproducibility of research.
, A second way in which computational methods can change the way we do research is by
fostering the interdisciplinary collaboration needed to deal with larger data sets and more complex
computational techniques (Wallach, 2016). By offering a chance to zoom in from the macro level down
to the individual data points, digital methods can also bring quantitative and qualitative research closer
together, allowing qualitative research to improve our understanding of data and build theory, while
keeping the link to large-scale quantitative research to test the resulting hypotheses (O’Brien, 2016).
Challenges and pitfalls in computational methods
Using these new methods and data sets, however, creates a new set of challenges and pitfalls, some
of which will be reviewed below. Most of these challenges do not have a single answer, and require
continued discussion about the advantages, disadvantages, and best practices in computational
communication science.
How do we keep research datasets accessible?
The privileged access to big data by a small group of researchers will make researchers with the access
“enjoy an unfair amount of attention at the expense of equally talented researchers without these
connections” (Huberman, 2012, p. 308).
Samples of big data on social media are made accessible to the public either in its original form
(data collected via Twitter public API) or in aggregate format (e.g., data from Google Trends).
Moreover, as explicated by Matthew Weber (2018), external parties also create accessible archives of
web data. However, the sampling, aggregation, and other transformation imposed on the released
data is a black box, which poses great challenges for communication researchers to evaluate the quality
and representativeness of the data and then assess the external validity of their findings derived from
such data.
- Where possible datasets should be fully open and published on platforms such as dataverse,
where needed for privacy or copyright reasons the data should be securely stored but
accessible under clear conditions.
- Additionally, we should work with funding agencies and data providers such as newspaper
publishers and social media platforms to make standardized data sets available for all
researchers
Is “big” data always a good idea?
Do communication researchers need to bother with small-sample survey data when it is easy and
cheap to get “big data” from social media? Big data, however, is not a panacea for all methodological
problems in empirical research, and it has its obvious limitations despite its widely touted advantages.
1. Big data is “found” while survey data is “made” (Taylor, 2013). Most of the big data are
secondary and intended for other primary uses most of which have little relevance to academic
research. The gap between the primary purpose intended for big data and the secondary
purpose found for big data will pose threat to the validity of design, measurement, and analysis
in computational communication research;
2. That data is “big” does not mean that it is representative for a certain population. People do
not randomly select into social media platforms, and very limited information is available for
communication researchers to assess the (un)representativeness of big data retrieved from
social media;
3. This also means that p-values are less meaningful as a measure of validity. Especially for very
large data sets, where representativeness and selection and measurement biases are a much
greater threat to validity than small sample sizes, p values are not a very meaningful indicator
of effect (Hofmann, 2015).
→So, focus on effect size and validity (by confidence intervals and using simulations or bootstrapping).