Lecture Slides Week 1:
CHAPTER 1: Introduction: Data-Analytic Thinking
The past fifteen years have seen extensive investments in business infrastructure, which
have improved the ability to collect data throughout the enterprise. Virtually every aspect of
business is now open to data collection and often even instrumented for data collection:
operations, manufacturing, supply-chain management, customer behavior, marketing
campaign performance, workflow procedures, and so on.
At the same time, information is now widely available on external events such as
market trends, industry news, and competitors’ movements.
This broad availability of data has led to increasing interest in methods for extracting
useful information and knowledge from data—the realm of data science.
From Data Warehouse to knowledge:
- Raw data comes into the Data Warehouse (selection & cleaning) -> Target data
(Transformation) -> Transformed Data (Data mining) -> Patterns and Rules
(Interpretation & Evaluation) -> Knowledge, and you need to understand this
knowledge.
With vast amounts of data now available, companies in almost every industry are focused on
exploiting data for competitive advantage.
Data mining is used for general customer relationship management to analyze customer
behavior in order to manage attrition and maximize expected customer value.
A data perspective will provide you with structure and principles, and this will give you a
framework to systematically analyze such problems.
,Data science is a set of fundamental principles that guide the extraction of knowledge from
data. Data mining is the extraction of knowledge from data, via technologies that
incorporate these principles.
Example how a hurricane can goes some products to be in higher or lower demand
It would be more valuable to discover patterns due to the hurricane that were not obvious. To
do this, analysts might examine the huge volume of Wal-Mart data from prior, similar
situations) to identify unusual local demand for products. From such patterns, the company
might be able to anticipate unusual demand for products and rush stock to the stores ahead
of the hurricane’s landfall.
Example Predicting Customer Churn Rate
Think carefully about what data you might use and how they would be used. Specifically,
how should MegaTelCo choose a set of customers to receive their offer in order to best
reduce churn for a particular incentive budget
Data science involve Statistics, Machine Learning, Data Mining and Knowledge Discovery
,Data Mining &
Data Warehousing / Storage
Data warehouses coalesce data from across an enterprise, often from multiple
transaction-processing systems
Querying / Reporting (SQL, Excel, QBE, other GUI-based querying)
It is a very flexible interface to ask factual question about data
There is no modeling or sophisticated pattern finding
Most of the cool visualizations are here.
OLAP - On-line Analytical Processing
OLAP provides easy-to-use GUI to explore large data collection
Exploration is manual; no modeling
Dimensions of analysis preprogrammed into OLAP system
Figure 1.1
Figure 1-1 places data science in the context of various other closely related and data related
processes in the organization. It distinguishes data science from other aspects of data processing that
are gaining increasing attention in business.
Data science involves principles, processes, and techniques for understanding phenomena
via the (automated) analysis of data.
Data-driven decision-making (DDD) refers to the practice of basing decisions on
the analysis of data, rather than purely on intuition.
They show that statistically, the more data driven a firm is, the more productive it
is—even controlling for a wide range of possible confounding factors.
DDD also is correlated with higher return on assets, return on equity, asset
utilization, and market value, and the relationship seems to be causal.
The diagram in Figure 1-1 shows data science supporting data-driven decision-making, but
also overlapping with data-driven decision-making. This highlights the often overlooked fact
that, increasingly, business decisions are being made automatically by com‐ puter systems.
,Data science needs access to data and it often benefits from sophisticated data engineering
that data processing technologies may facilitate, but these technologies are not data science
technologies per se. They support data science, as shown in Figure 1-1, but they are useful
for much more.
Data processing technologies are very important for many data-oriented business tasks
that do not involve extracting knowledge or data-driven decision-making, such as efficient
transaction processing, modern web system processing, and online advertising campaign
management
Big data essentially means datasets that are too large for traditional data processing
systems, and therefore require new processing technologies.
However, much more often the well-known big data technologies are used for data
processing in support of the data mining techniques and other data science activities,
as represented in Figure 1-1.
In Web 1.0, businesses busied themselves with getting the basic internet technologies in
place, so that they could establish a web presence, build electronic commerce capability,
and improve the efficiency of their operations.
Web 2.0, where new systems and companies began taking advantage of the interactive
nature of the Web. The changes brought on by this shift in thinking are pervasive;
the most obvious are the incorporation of social networking components,
and the rise of the “voice” of the individual consumer
The prior sections suggest one of the fundamental principles of data science:
data, and the capability to extract useful knowledge from data, should be regarded as
key strategic assets.
Too many businesses regard data analytics as pertaining mainly to realizing value from
some existing data, and often without careful regard to whether the business has the
appropriate analytical talent.
Viewing these as assets allows us to think explicitly about the extent to which one
should invest in them.
Often, we don’t have exactly the right data to best make decisions and/or the right
talent to best support making decisions from the data.
Further, thinking of these as assets should lead us to the realization that they are
complementary.
,Banks with bigger data assets may have an important strategic advantage over their
smaller competitors. If these trends generalize, and the banks are able to apply
sophisticated analytics, banks with bigger data assets should be better able to identify the
best customers for individual products. The net result will be either increased adoption of the
bank’s products, decreased cost of customer acquisition, or both.
The idea of data as a strategic asset is certainly not limited to Capital One, nor even to the
banking industry. Amazon was able to gather data early on online customers, which has
created significant switching costs.
It is important to understand data science even if you never intend to do it yourself, because
data analysis is now so critical to business strategy.
Businesses increasingly are driven by data analytics, so there is great professional
advantage in being able to interact competently with and within such businesses.
Firms in many traditional industries are exploiting new and existing data resources for
competitive advantage.
On a scale less grand, but probably more common, data analytics projects reach into all
business units. Employees throughout these units must interact with the data science team.
If these employees do not have a fundamental grounding in the principles of data
analytic thinking, they will not really understand what is happening in the business.
This lack of understanding is much more damaging in data science projects than in
other technical projects, because the data science is supporting improved decision
making.
This requires a close interaction between the data scientists and the business people
responsible for the decision-making. Firms where the business people do not
understand what the data scientists are doing are at a substantial disadvantage,
because they waste time and effort or, worse, because they ultimately make wrong
decisions.
This book is about the extraction of useful information and knowledge from large volumes of
data, in order to improve business decision-making.
Success in today’s data-oriented business environment requires being able to think about
how these fundamental concepts apply to particular business problems—to think data
analytically.
For example, in this chapter we discussed the principle that data should be thought
of as a business asset, and once we are thinking in this direction we start to ask
whether (and how much) we should invest in data.
Thus, an understanding of these fundamental concepts is important not only for data
scientists themselves, but for anyone working with data scientists, employing data
scientists, investing in data-heavy ventures, or directing the application of analytics in
an organization.
,There is convincing evidence that data-driven decision-making and big data technologies
substantially improve business performance.
Data science supports data-driven decision-making—and sometimes conducts such
decision-making automatically—and depends upon technologies for “big data”
storage and engineering, but its principles are separate.
CHAPTER 2: Business Problems and Data Science Solutions
An important principle of data science is that data mining is a process with fairly well
understood stages.
Some involve the application of information technology, such as the automated
discovery and evaluation of patterns from data, while others mostly require an
analyst’s creativity, business knowledge, and common sense.
Since the data mining process breaks up the overall task of finding patterns from data into
a set of well-defined subtasks, it is also useful for structuring discussions about data science.
Each data-driven business decision-making problem is unique, comprising its own
combination of goals, desires, constraints, and even personalities. In collaboration with
business stakeholders, data scientists decompose a business problem into subtasks. The
solutions to the subtasks can then be composed to solve the overall problem.
In many business analytics projects, we want to find “correlations” between a particular
variable describing an individual and other variables. For example, in historical data we may
know which customers left the company after their contracts expired. We may want to find
out which other variables correlate with a customer leaving in the near future. Finding such
correlations are the most basic examples of classification and regression tasks.
1. Classification and class probability estimation attempt to predict, for each
individual in a population, which of a (small) set of classes this individual belongs to.
“Among all the customers of MegaTelCo, which are likely to respond to a given offer?” In this
example the two classes could be called will respond and will not respond.
For a classification task, a data mining procedure produces a model that, given a new
individual, determines which class that individual belongs to.
- A closely related task is scoring or class probability estimation.
A scoring model applied to an individual produces, instead of a class prediction, a score
representing the probability (or some other quantification of likelihood) that that individual
belongs to each class.
- In our customer response scenario, a scoring model would be able to evaluate each
individual customer and produce a score of how likely each is to respond to the offer
, 2. Regression (“value estimation”) attempts to estimate or predict, for each
individual, the numerical value of some variable for that individual.
“How much will a given customer use the service?” The property (variable) to be predicted
here is service usage, and a model could be generated by looking at other, similar
individuals in the population and their historical usage.
Informally, classification predicts whether something will happen, whereas regression
predicts how much something will happen
3. Similarity matching attempts to identify similar individuals based on data known
about them. Similarity matching can be used directly to find similar entities
4. Clustering attempts to group individuals in a population together by their similarity,
but not driven by any specific purpose. An example clustering question would be: “Do
our customers form natural groups or segments?” Clustering is useful in preliminary
domain exploration to see which natural groups exist because these groups in turn
may suggest other data mining tasks or approaches.
5. Co-occurrence grouping (also known as frequent itemset mining, association rule
discovery, and market-basket analysis) attempts to find associations between entities
based on transactions involving them.
What items are commonly purchased together? While clustering looks at similarity
between objects based on the objects’ attributes, co-occurrence grouping considers
similarity of objects based on their appearing together in transactions.
For example, analyzing purchase records from a supermarket may uncover that
ground meat is purchased together with hot sauce much more frequently than we
might expect.
6. Profiling (also known as behavior description) attempts to characterize the typical
behavior of an individual, group, or population. An example profiling question would
be: “What is the typical cell phone usage of this customer segment?”
Profiling is often used to establish behavioral norms for anomaly detection appli‐ cations
such as fraud detection and monitoring for intrusions to computer systems.
7. Link prediction attempts to predict connections between data items, usually by
suggesting that a link should exist, and possibly also estimating the strength of the
link. Link prediction is common in social networking systems: “Since you and Karen
share 10 friends, maybe you’d like to be Karen’s friend?”
, 8. Data reduction attempts to take a large set of data and replace it with a smaller set
of data that contains much of the important information in the larger set. The smaller
dataset may be easier to deal with or to process. Moreover, the smaller dataset may
better reveal the information.
9. Causal modeling attempts to help us understand what events or actions actually
influence others.
For example, consider that we use predictive modeling to target advertisements to
consumers, and we observe that indeed the targeted consumers purchase at a higher rate
subsequent to having been targeted. Was this because the advertisements influenced the
consumers to purchase? Or did the predictive mod‐ els simply do a good job of identifying
those consumers who would have purchased anyway?
Techniques for causal modeling include those involving a substantial in‐ vestment in data,
such as randomized controlled experiments (e.g., so-called “A/B tests”), as well as
sophisticated methods for drawing causal conclusions from observational data
There are 3 types of Machine learning:
1. Supervised Learning -> With target.
2. Unsupervised Learning -> No target
3. Reinforcement Learning ->
Consider two similar questions we might ask about a customer population. The first is: “Do
our customers naturally fall into different groups?” Here no specific purpose or target has
been specified for the grouping.
When there is no such target, the data mining problem is referred to as
unsupervised.
Contrast this with a slightly different question: “Can we find groups of customers who have
particularly high likelihoods of canceling their service soon after their contracts expire?” Here
there is a specific target defined: will a customer leave when her contract expires?
In this case, segmentation is being done for a specific reason: to take action based
on likelihood of churn. This is called a supervised data mining problem.
If a specific target can be provided, the problem can be phrased as a supervised
one.
Clustering, an unsupervised task, produces groupings based on similarities, but there is no
guarantee that these similarities are meaningful or will be useful for any particular purpose.