Computational analysis of digital communication
Week 1 – Introduction
What is computational social science and why should we care?
Example: surprising sources of information
• In 2009, Blumenstock and colleagues (2015, Science) wanted to study wealth and poverty in
Rwanda.
• They conducted a survey with a random sample of 1,000 customers of the largest mobile phone
provider
• They collected demographics, social, and economic characteristics
• Traditional social science survey, right?
• The authors also had access to complete call records from 1.5 million people
• Combining both data sources, they used the survey data to “train” a machine learning model to
predict a person’s wealth based on their call records.
• They also estimated the places of residence based on the geographic information embedded in call
records.
Blumenstock, Cadamura, & On, 2015
What is computational social science
- Field of Social Science that uses algorithmic tools and large/unstructured data to understand
human and social behavior
- Computational methods as “microscope”: Methods are not the goal, but contribute to theoretical
development and/or data generation
- Complements rather than replaces traditional methodologies
- Includes methods such as, e.g.,:
o Advanced data wrangling/data science
o Combining of different data sets
o Automated Text Analysis
o Machine Learning (supervised and unsupervised)
o Actor-based modelling
o Simulations
o …
Typical workflow
1
,Why is this important now?
- In the past, collecting data was expensive (surveys, observations…)
- In the digital age, the behaviors of billions of people are recorded, stored, and therefore analyzable
- Every time you click on a website, make a call on your mobile phone, or pay for something with
your credit card, a digital record of your behavior is created and stored
- Because (meta-)data are a byproduct of people’s everyday actions, they are often called digital
traces
- Large-scale records of persons or business are often called big data.
10 characteristics of big data
Characteristic Description
1 Big The scale or volume of some current datasets is often
impressive. However, big datasets are not an end in
themselves, but they can enable certain kinds of research
including the study of rare events, the estimation of
heterogeneity, and the detection of small differences
2 Always-on Many big data systems are constantly collecting data and thus
enable to study unexpected events and allow for real-time
measurement
3 Nonreactive Participants are generally not aware that their data are being
captured or they have become so accustomed to this data
collection that it no longer changes their behavior.
4 Incomplete Most big data sources are incomplete, in the sense that they
don’t have the information that you will want for your
research. This is a common feature of data that were created
for purposes other than research.
5 Inaccessible Data held by companies and governments are difficult for
researchers to access.
6 Nonrepresentative Most big datasets are nonetheless not representative of
certain populations. Out-of-sample generalizations are hence
difficult or impossible.
7 Drifting Many big data systems are changing constantly, thus making it
difficult to study long-term trends
8 Alghorithmically Behavior in big data systems is not natural; it is driven by the
confounded engineering goals of the systems.
9 Dirty Big data often includes a lot of noise (e.g., junk, spam, spurious
data points…)
10 Sensitive Some of the information that companies and governments
have is sensitive.
(Salganik, 2017, chap 2.3)
Example data
Smartphone log data (Masur, 2018)
- Incredible detailed log of each person’s smartphone use
- Big data?
• BIG: Thousands of rows per person, but not many columns
• ALWAYS-ON: Recorded smartphone use at all times
• INCOMPLETE: Did not record app use with higher privacy standards (e.g., signal)
2
, • DIRTY: Depending on what you want to study, lots of noise (e.g., phone on/off)
Typical computational research strategies
1. Counting things
In the age of big data, researcher can “count” more than ever
- How often do people use their smartphone per day?
- About which topics do news websites write most often?
2. Forecasting and nowcasting
Big data allow for more accurate predictions both in the present and in the future
- Investigate when people disclose themselves in computer-mediated communication
- Crime prediction
3. Approximating experiments
Computational methods provide opportunities to conduct “natural experiments”
• Compare smartphone log data of people who use their smartphone naturally vs. those who abstain
from certain apps (e.g., social media apps)
• Investigate the potential of nudges to make users select certain news
Advantages and disadvantages
Advantages of Computational Methods
- Actual behavior vs. self-report
- Social context vs. lab setting
- Small N to large N
Disadvantages of Computational Methods
- Techniques often complicated
- Data often proprietary
- Samples often biased
- Insufficient metadata
Computational communication science. Why computational methods are important for (future)
communication research
Definition
“Computational Communication Science (CCS) is the label applied to the emerging subfield that
investigates the use of computational algorithms to gather and analyze big and often semi- or unstructured
data sets to develop and test communication science theories”
3
, Van Atteveldt & Peng, 2018
Promises
The recent acceleration in the use of computational methods for communication science is primarily fueled
by the confluence of at least three developments:
- vast amounts of digitally available data, ranging from social media messages and other digital
traces to web archives and newly digitized newspaper and other historical archives
- improved tools to analyze this data, including network analysis methods and automatic text
analysis methods such as supervised text classification, topic modeling, word embeddings,
and syntactic methods
- powerful and cheap processing power, and easy to use computing infrastructure for processing
these data, including scientific and commercial cloud computing, sharing platforms such as Github
and Dataverse, and crowd coding platforms such as Amazon MTurk and Crowdflower
Example 1: simulating search queries
- Numbers of drug-overdose deaths have been increasing in the United States
- Google spotlights counselling services as helpful resources when users query for suicide-related
search terms
- However, the search engine does so at varying display rates, depending on terms used
- Display rates in the drug-overdose deaths domain are unknown
- Haim and colleagues (2021) emulated suicide-related potentially harmful searches at large scale
across the U.S. to explore Google’s response to search queries including or excluding additional
drug-related terms
- They conducted 215,999 search requests with varying combinations of search terms
- Counseling services were displayed at high rates after suicide-related potentially harmful search
queries (e.g., “how to commit suicide”)
- Display rates were substantially lower when drug-related terms, indicative of users’ suicidal
overdosing tendencies, were added (e.g., “how to commit suicide fentanyl”)
Haim, Scherr, & Arendt, 2021
Example 2: analyzing news coverage
- Jacobi and colleagues (2016) analyzed the coverage of nuclear technology from 1945 to the present
in the New York Times
- Analysis of 51,528 news stories (headline and lead): Way too much for human coding!
- Used “topic modeling” to extract latent topics and analyzed their occurrence over time
4