Week 1
The evolution of data science
Why is data becoming more and more common?
The ubiquity of data oppurtunities in the digital era is the result of the convergence of two
interrelated phenomena:
1. The possibility of data collection in every aspect of business
a. Operations, manufacturing, supply-chain management
b. Consumer behavior, marketing, advertising, customer relationships
2. Technological development that is being going on
a. More powerful computers, networks, and algorithms to do analysis with
The main concepts within data environments
1. Big data
Because of these interrelated phenomena that make data more and more common in
today’s digital era, data currently is coming up in huge proportions leading to extremely
huge datasets, known as big data. This big data stands essentially stands for datasets that
are too large for traditional processing systems and require new technologies.
In general, there are three characteristics of why big data has become increasingly relevant
over time.
1. Volume: first, the volume of data has increased because the quantity of generated
and stored data is much greater nowadays
2. Variety: second, the variety of data has increased since there is much more data
available, such as data from audio, video, photo, text etc.
3. Velocity: last, the velocity has increased over time, meaning that the speed at which
data is generated (periods) is faster nowadays
In addition, there are three different types of big data that describe and represent different
stages in the evolution and development of big data technologies and applications.
1. Big data 1.0 (internal): this phase represents the initial phase of big data in which
firms were busy with enabling the adoption, storage and processing of standard data
that was generated by themselves internally (i.e., internal data sources)
2. Big data 2.0 (external): this phase represent the more mature stage after the period
in which firms became capable of processing massive data, and started to explore
, oppurtunities beyond their own internal data sources such as integrating and
analyzing external data (i.e., social media data, web data, etc.)
3. Big data 3.0 (combination): this phase represents the current stage in the data
evolution in which firms are more and more focused on the recombination of diverse
datasets, both internal and external (i.e., data fusion, data blending).
Information versus data
Important to note, last, is that not all data is information. Specifically, data is different from
information since data itself does not have any meaning, whereas information does have a
certain meaning. Hence, data can only become information if we impose any kind of
interpretation to the data (also known as informative data).
2. Data science & data mining
Data science stands for the total process of all the principles, processes and techniques for
understanding and using data in the figure below:
Important to note, however, is that focus of this course, however, will be on the data mining
stage that stands for the data analysis step and its interaction with the business
understanding step within the middle of the overall data science process. That is, the focus
of this course will mainly be on the extraction of (business) knowledge from data.
What is the challenge of using these data science process?
The main challenge in data science is to separate the actual information from the random
noise. The point here is that there will always be some descriptive or findings within
datasets but this findings not necessarily have to be generalizable beyond the particular
dataset (risk of overfitting).
3. Data driven decision-making
Because this course is mainly focused on the data mining stage by looking into data analysis
in order to extract business knowledge, this course is mainly interested in practicing data
driven decision making.
,This data driven decision stands for making business decisions based on the knowledge
derived from data analysis rather than just intuition and is already being used within
departments as:
- Marketing
o Online advertising
o Recommendations for cross-selling
o Customer relationship management
- Finance
o Credit scoring and trading
o Fraud detection
- Retail
o Supply chain management
o Store designs
Data analytics
Data analytics stands for the process of examining datasets in order to draw conclusions
about the useful information they may contain (related to data analysis step)
In general, there are three different types of data analytics
1. Descriptive analytics: what has happened? (charts, dashboards, diagrams)
2. Predictive analytics: what could happen? (regression, classification)
3. Prescriptive analytics: what should we do? (advanced techniques and causality)
What type of decisions/questions do we want to solve with data analytics?
In general, there are two main decisions/questions of interest within data driven decision-
making:
1. Discovery questions
a. E.g., Walmart, prior to a hurricane, looked at stocking data on shopping’s
during prior hurricanes to discover changes in the demand and found out that
water was more in demand. Hence, they increased their water stock this time
in order to outcompete their competitors.
2. Repetitive decisions
a. E.g., Telecom providers would like to predict whether some clients is going to
churn (switch to another provider) in order to receive a better offer, because
the current provider in that situation can retain the client by providing them
with an even better or similar offer.
, The data mining process
Because this course is mainly focused on data analysis and business understanding (data
mining) within the overall data science process, this course will be using the CRISP data
mining process. This data mining process is a cross-industry standard process for data
mining/analytics looks as follows:
1. Business understanding & data understanding
In this first step, data scientists and business stakeholders, initially need to decompose a
business problem into solvable subtasks. That is, this is the part where creativity plays a key
role regarding how to cast the business problem into one or more data science
(suB)problems for which methods and data are available/collectable.
That is, the following questions are important for this process of translating:
- What is the goal of the data science task?
- What is the business context?
- What is the data available (or collectable)?
- What is the appropriate method to reach the goal?
- How can the method be applied to the data?
In other words, this stage typically involves structuring the business problem such that one
or more subproblems involve building models for classification, regression, probability
estimation in order to convert the problem into data mining objectives.
Important to note is that this phase its crucial to understand the strengths & limitations of
the data, because there rarely is a direct match with the business problem.
2. Data preparation