Chapter 1 Introduction
1.1 What is Business Analytics?
Business analytics (BA) is the practice of bringing quantitative data to bear on decision-making. It
includes a range of data analysis methods.
Business intelligence (BI) refers to data visualization and reporting for understanding the past and the
present. It has evolved into effective tools and practices, such as creating interactive dashboards that
allow the user not only to access real-time data but also to directly interact with it.
Business analytics now typically includes BI as well as sophisticated data analysis methods used for
exploring relationships between measurements, predicting new records, and to forecast future
values.
1.2 What is Data Mining?
Data mining refers to business analytics methods that include statistical and machine-learning
methods that inform decision making, often in automated fashion. Prediction is typically an
important component, often at the individual level.
1.3 Data Mining and Related Terms
Data mining stands at the confluence of the fields of statistics and machine learning (also known as
artificial intelligence). However, in classical statistics computing is difficult and data are scarce,
whereas in data mining applications both data and computing power are plentiful.
Another major difference is the focus inn statistics on inference from a sample to the population. In
contrast, the focus in machine learning is on predicting individual records.
Data mining is vulnerable to the danger of overfitting, where a model is fit so closely to the available
sample of data that it describes not merely structural characteristics of the data but random
peculiarities as well.
We use the term machine learning to refer to algorithms that learn directly from data, especially
local patterns, often in layered or iterative fashion. In contrast, we use statistical models to refer to
methods that apply global structure to the data.
1.4 Big Data
Big data is a relative term that refers to the amount of data by reference to the past, and to the
methods and devices available to deal with them. The challenge big data presents if often
characterized by the four V’s:
- Volume refers to the amount of data.
,- Velocity refers to the speed at which it is being generated and changed.
- Variety refers to the different types of data being generated.
- Veracity refers to the fact that data is being generated by organic distributed processes and not
subject to the controls or quality checks that apply to data collected for a study.
Most large organizations face both the challenge and the opportunity of big data because most
routine data processes now generate data that can be stored and, possibly, analyzed.
1.5 Data Science
Data science is a mix of skills in the areas of statistics, machine learning, math, programming,
business, and IT. However, it is a rare individual who combines deep skills in all the constituent areas.
This book focuses on developing the statistical and machine learning models that will eventually be
plugged into a deployed system.
1.6 Why Are There So Many Different Methods?
The usefulness of a method can depend on factors such a the size of the dataset, the types of
patterns that exist in the data, whether the data meet some underlying assumptions of the method,
how noise the data are, and the particular goal of the analysis.
Different methods can lead to different results, and their performance can vary. It is therefore
customary in data mining to apply several different methods and select the one that appears most
useful for the goal at hand.
,Chapter 2 Overview of the Data Mining Process
2.1 Introduction
This book focuses on predictive analytics, the tasks of classification and prediction as well as pattern
discovery. Not covered are OLAP (online analytical processing) and SQL (structured query language).
They do not involve statistical modeling or automated algorithmic methods.
2.2 Core Ideas in Data Mining
Classification
A common task in data mining is to predict the value of a categorical variable (e.g. the recipient of an
offer can respond or not respond). Similar data where the classification is known are used to develop
rules, which are then applied to the data with the unknown classification.
Prediction
Here, we are trying to predict the value of a numerical variable (e.g. amount of purchase).
Association Rules and Recommendation Systems
Association rules, or affinity analysis, is designed to find general associations patterns between items
in large databases (“what goes with what”). For example, grocery stores can use such information for
bundling products and it can help predict future symptoms for returning patients.
Online recommendation systems (e.g. Netflix) use collaborative filtering, which is a method that
generates “what goes with what” at the individual user level. Recommendation systems aim to
deliver personalized recommendations to users with a wide range of preferences.
Predictive Analytics
Classification, prediction, and association rules and collaborative filtering constitute the analytical
methods employed in predictive analytics.
Data Reduction and Dimension Reduction
The performance of data mining algorithms if often improved when the number of variables is
limited, and when large numbers of records can be grouped into homogeneous groups. The process
of consolidating a large number of records into a smaller set is termed data reduction. Methods for
reducing the number of cases are often called clustering.
Reducing the number of variables is typically called dimension reduction, which improves predictive
power, manageability, and interpretability.
, Data Exploration and Visualization
Exploration is used for data cleaning and manipulation as well as for visual discovery and hypothesis
generation.
Exploration by creating charts and dashboards is called data visualization or visual analytics. For
numerical variables we use histograms and boxplots to learn about the distribution of their value, to
detect outliers, and to find other information that is relevant to the analysis. Similarly, for categorical
variables we use bar charts.
Supervised and Unsupervised Learning
For supervised learning algorithms we must have data available in which the value of the outcome of
interest is known. These training data are the data from which the algorithm “learns” or is “trained”
about the relationship between predictor variables and the outcome variable. The algorithm is then
applied to the validation data where the outcome is known, to see how well it does in comparison to
other models. It is prudent to save a third sample which also includes known outcomes (the test
data) to use with the model finally selected to predict how well it will do. The model can then be
used to classify or predict the outcome of interest in new cases where the outcome is unknown.
Unsupervised learning algorithms are those used where there is no outcome variable to predict or
classify. Association rules, dimension reduction methods, and clustering techniques are all
unsupervised learning methods.
2.3 The Steps in Data Mining
Here is a list of steps to taken in a typical data mining effort:
1. Develop an understanding of the purpose of the data-mining project. The most serious errors in
analytics projects result from a poor understanding of the problem.
2. Obtain the dataset to be used in the analysis. This often involves random sampling from a large
database. It may also involve pulling together data from different databases or sources. The
databases could be internal (e.g. past purchases made by customers) or external (e.g. credit
ratings).
3. Explore, clean, and preprocess the data. This step involves verifying that the data are in
reasonable condition (e.g. missing data, reasonable range of values, outliers, consistency in the
definitions of fields).
4. Reduce the data dimension, if necessary. Dimension reduction can involve operations such as
eliminating unneeded variables, transforming variables, and creating new variables.
5. Determine the data-mining task. This involves translating the general question or problem into a
more specific data-mining question.