Lecture 1 Big Data Characteristics
What are big data?
Big data = The early definition stated that it is data that is too large to be loaded into one
machine. Now it means applying the tools of artificial intelligence, like machine learning, to
expand the usability of data beyond that captured in standard databases. It is applying tools
to extract meaning from the data.
Digital traces = a record created and stored of some behavior like a click on a website, a call
or a buy with a credit card.
There has been an explosion in the amount of data created and with that also in the
computing power to analyse this data.
Data always consist of variables (columns) and observations (rows). The total number of
variables is indicated by p and the number of observations is indicated by n. Data can be big
in two different ways;
Tall data: n >> p, many observations and relatively few variables
Wide data: n << p, few observations and many variables
Primary versus secondary data
Primary data = data collected specifically to answer a specific research question.
Secondary data = data collected for a non-research purpose which needs to be refitted in
order to answer a research question. Most of the big data is secondary data, but not all.
Business data
Figure 1 gives an overview of the four different types of business data:
Is it easily organized for analysis?
Structured: data with a clear scale such that it can immediately be analyzed,
such as 1 to 5 stars reviews.
Unstructured: data without a clear scale which still needs to be extracted
from the source. It is not directly quantifiable. 80% of the data firms use is
unstructured.
Where is the data generated?
Internal: Data created within the firm
External: Data created outside of the firm such as social media
,Uses of Big data
There are different things where businesses, governments or others can use big data for:
1. Personalization: Netflix goes through the big data of customer preferences, like
watch history, to give automated recommendations.
2. Boosting engagement: Facebook looked if the chance that people liked a post is
higher if they stated the amount of other likes or if they stated friends that also liked
this post. With this primary data they wanted to decide on the best layout to boost
the engagement.
3. New product development: by looking at social media, online forums, reviews,
etcetera, a firm can determine which products to make in the near future.
4. Reducing customer churn: customer churn means that a customer quits some
service. You can reduce churn if you use past data to estimate a model that predicts
the probability of churn on current customers. This data can be the length of time
being a customer, the number of services subscribed to or demographics. Firms can
than intervene on those most likely to churn.
5. Public policy and economy: the government could for example look at google maps
to see how the lockdowns in corona times affected the location of people such as
parks, train stations or workplaces.
10 characteristics of big data
Big data has 10 important characteristics. The first three characteristics are advantages of
big data and the other seven are disadvantages:
1. Big
Big data is big, which means that the data contains a lot of variables and/or
observations. This has multiple advantages:
o Beter estimations when the event is rare or small. Suppose you’re running
an A/B test and estimated with a small dataset that the CTR for
advertisement A is 0.35% and for B 0.40%. The difference of 0.05% could
mean a lot of extra revenue but can easily be estimated wrong because of
the small dataset. Big data gives a better confidence interval, it can better
estimate the real difference between advertisement A and B.
o Big data is better with heterogeneity. With a small dataset we can conclude
that seeing advertisement B increases the CTR with 0.05% for everyone.
However, when we have a larger dataset, we can for example say that
advertisement B increases the CTR with 0.10% for young people and with 0%
for the older people. There can thus be made a distinction between the
different types of customers, which means that there is dealt with
heterogeneity.
o Big data is better when the relationship is complex. Figure 2 gives the
relationship between certain drivers and chance of customer churn. With a
small dataset, a firm can conclude that the chance of customer churn is low
below the green line in figure 2 and high above the green line. However,
when the data is big, the firm may be able to predict better when a customer
is likely to quit by designing a heat map.
, 2. Always-on
Big data is data that is being collected 24/7. Collecting data in real-time is way faster
than collecting data via for example surveys and is important when we need to know
and respond to the answers quickly. Answering quickly is for example important at
monitoring competition, spotting trends, solving a product harm crises, marketing
response, etc.
3. Nonreactive
People usually change their behavior when they know they are being observed.
However, with big data users are typically not aware they are being recorded which
makes big data nonreactive.
4. Incomplete
Big data records what happened, but not why. It states for example which customers
use the service less, which makes them more likely to churn. It does however not tell
us what causes them to use the service less.
5. Inaccessible
From outside the organization there can be inaccessibility because of legal, business
or ethical barriers to giving outside researchers access to data. From inside the
organization there can be inaccessibility if databases are not integrated, lacking
variables to match or if there are different coding schemes. A company can have
multiple touchpoints a customer has interacted with such as email, the web or social
media. It can be difficult to link the data from different databases that come from
the same user.
6. Nonrepresentative
Representativeness is if you can make conclusions about the population based on
your sample. The characteristics of the sample can estimate the characteristics of the
population. Big data, however, is often nonrepresentative. There is for example a
study about the representativeness of online opinions. As shown in figure 3, the
frequency of online opinions is low when the product satisfaction is low, even lower
when the product satisfaction is medium and high when the product satisfaction is
high. Because of this, there are almost only positive opinions on the internet which
gives a wrong picture of the product quality. If we would survey every consumer of
that specific product, the opinions on the product quality would be a normal
Figure 2