Chapter 1. Introduction: Data Analytic Thinking
"The widest applications of data-mining techniques are in marketing for tasks such as targeted
marketing, online advertising, and recommendations for cross-selling"
"A data perspective will provide you with structure and principles, and this will give you a
framework to systematically analyze such problems"
Data science: "Set of fundamental principles that guide the extraction of knowledge from data"
● Data mining: "The extraction of knowledge from data, via technologies that incorporate
these principles"
Churn: "Customers switching from one company to another"
● "Customer retention has been a major use of data mining technologies–especially in
telecommunications and finance business"
Data Science, Engineering, and Data-Driven Decision Making
Ultimate goal: Improve decision making
"Data-driven decision-making (DDD) refers to the practice of basing decisions on the analysis of
data, rather than purely on intuition"
● Benefits:
○ "The more data-driven a firm is, the more productive it is"
■ "One standard deviation higher on the DDD scale is associated with a
4%-6% increase in productivity"
○ "Correlated with higher return on assets, return on equity, asset utilization, and
market value"
2 types of decisions to consider:
1. Decisions for which "discoveries" need to be made w/n data
a. E.g. Walmart and Target
2. Decisions that repeat, especially at massive scale, and so decision-making can benefit
from even small increases in decision-making accuracy based on data analytics
a. MegaTelCo churn example
"A predictive model extracts most of the complexity of the world, focusing in on a particular set
of indicators that correlate in some way w/ a quantity of interest"
Increasing use in business of automated decision making by computer systems
Data Processing and "Big Data"
Data engineering and processing ≠ data science
● "Data science needs access to data and it often benefits from sophisticated data
engineering and data processing technologies may facilitate, but these technologies
are not data science technologies per se", they do support data science
● Data processing for data-oriented business tasks that do not involve the extraction of
knowledge or data-driven decision making
,Big data: "Datasets that are too large for traditional data processing systems, and therefore
require new processing technologies"
● "Using big data technologies is associated w/ significant additional productivity growth"
○ 1 standard deviation of higher utilization = 1%-3% higher productivity
From Big Data 1.0 to Big Data 2.0
● "We should expect a Big Data 2.0 phase to follow Big Data 1.0. Once firms have become
capable of processing massive data in a flexible fashion, they should begin asking:
“What can I now do that I couldn’t do before, or do better than I could do before?”"
Data and Data Science Capability as a Strategic Asset
Fundamental principle: "Data, and the capability to extract useful knowledge from data, should
be regarded as key strategic assets"
● An asset view helps to think about the extent to which one should invest in it
● They are complementary strategic assets
"Predictive performance continues to improve as more data are used"
● Companies with more data may have an important strategic advantage
Data-Analytic Thinking
Data-analytic thinking may help "to envision opportunities for improving data-driven
decision-making, or to see data-oriented competitive threats"
Fundamental concept:
● "Extracting useful knowledge from data to solve business problems can be treated
systematically by following a process with reasonably well-defined stages"
● "From a large mass of data, information technology can be used to find informative
descriptive attributes of entities of interest"
● "If you look too hard at a set of data, you will find something—but it might not generalize
beyond the data you’re looking at"
● "Formulating data mining solutions and evaluating the results involves thinking carefully
about the context in which they will be used"
Chapter 2. Business Problems and Data Science Solutions
Fundamental concepts: A set of canonical data mining tasks; The data mining process;
Supervised versus unsupervised data mining.
Data science as a process with stages
From Business Problems to Data Mining Tasks
"A critical skill in data science is the ability to decompose a data-analytics problem into pieces
such that each piece matches a known task for which tools are available"
● Recognizing problems and their solutions helps to not waste time and resources
○ Focus on the parts that involve human involvement
,Types of tasks:
1. Classification and class probability estimation: Predict for each individual to which of a
(small) set of classes it belongs to
a. Classes are normally mutually exclusive
b. Closely related tasks: Scoring, class probability estimation
c. Outcome: Model
2. Regression ("value estimation"): "Estimate or predict… the numerical value of some
variable" per individual
a. "Informally, classification predicts whether something will happen, whereas
regression predicts how much something will happen"
3. Similarity matching: "Identify similar individuals based on data known about them"
a. E.g. product recommendations
4. Clustering: "Group individuals in a population together by their similarity, but not driven
by any specific purpose"
a. Useful in preliminary domain exploration for natural groups
5. (Association Rule Mining) Co-occurrence grouping (aka frequent itemset mining,
association rule discovery, and market-basket analysis): "Find associations between
entities based on transactions involving them"
a. "While clustering looks at similarit[ies] between objects based on the objects’
attributes, co-occurrence grouping considers similarity of objects based on
their appearing together in transactions"
b. Examples:
i. Put items next to each other for ease of finding
ii. Promote the items as a package
iii. Place items far apart from each other so that the customer has to walk
the aisles to search for it, and by doing so potentially see and buy other
items
c. Generic Rule: X ⇒ Y [S%, C %]
i. X,Y: Products and/or services
ii. X: Left-hand-side (LHS) and Y: Right-hand-side (RHS)
iii. S: Support: How often X and Y go together
iv. C: Confidence: How often Y go together with the X
v. Eg: {Laptop Computer, Antivirus Software} ⇒ {Extended SErvice
Plan}[30%, 70%]
6. Profiling (aka behavior description): "Attempts to characterize the typical behavior of an
individual, group, or population"
a. Establish behavioral norms for anomaly detection applications
b. E.g. fraud detection, monitoring for intrusions to computer systems
, 7. Link prediction: "Attempts to predict connections between data items, usually by
suggesting that a link should exist, and possibly also estimating the strength of the link"
8. Data reduction: "Attempts to take a large set of data and replace it with a smaller set of
data that contains much of the important information in the larger set"
9. Causal modeling: "Attempts to help us understand what events or actions actually
influence others"
a. Take into account the assumptions being made
i. "When undertaking causal modeling, a business needs to weigh the
trade-off of increasing investment to reduce the assumptions made,
versus deciding that the conclusions are good enough given the
assumptions"
Supervised Versus Unsupervised Methods
Unsupervised data mining problems do not have a specific target that is defined, but own
conclusions can be made about what the examples have in common
● "Supervised tasks require different techniques that unsupervised tasks do, and the
results often are much more useful"
● E.g. for unsupervised: Clustering, co-occurrence grouping and profiling
● E.g. for supervised: Classification (Categorical [binary] target), regression (numeric
target) and causal modeling
● Both: Matching, link prediction and data reduction
Important condition for supervised: There must be data on the target, acquiring this data is
often a key data science investment
● Individual's label: "The value for the target variable for an individual"
In the early stages it is important:
1. To decide whether the line of attack will be supervised or unsupervised
2. If supervised, to produce a precise definition of a target variable
Data Mining and Its Results
"Difference between (1) mining the data to find patterns and build models, and (2) using the
results of data mining"
Iteration is the rule rather than the exception
● Importance of "reasonable consistency, repeatability, and objectiveness"
The CRISP data mining process
1. Business understanding
a. Understand the problem to be solved and its use scenario: "The initial
formulation may not be complete or optimal so multiple iterations may be
necessary for an acceptable solution formulation to appear"
b. Analysts' creativity plays a large role
c. Problems can be structured/engineered into one or more subproblems that
involve building models
2. Data understanding