Summary of book "Data Mining for Business Analytics - concepts, techniques and applications in PYTHON" (GALIT SHMUELI, PETER C. BRUCE, PETER GEDECK, NITIN R. PATEL). ISBN: 9840.
It consists of all chapters for the 2019/2020 exam: 1, 2, 3, 5, 6, 7, 8, 9, 14, and 15.
Business Analytics is an emerging discipline helping us ride the new wave of data. It requires
individuals (1) to be grounded in the fundamentals of business so they know the right questions to
ask, (2) to have the ability to harness, store, and optimally process vast datasets from a variety of
structured and unstructured sources, and (3) to be able to uncover new insights for decision making.
Chapter 1: Introduction
1.1 What is Business Analytics (BA)?
BA (or more generically: analytics) is the practice and art of bringing quantitative data to bear on
decision making. It includes a range of data analysis methods. The next level of BA is termed BI,
referring to data visualization and reporting for understanding ‘what happened and what is
happening’. This is done by the use of charts, tables, and dashboards to display, examine, and
explore data. BI has user-friendly, effective tools and practices, such as interactive dashboards.
Effective dashboards tie directly into company data, and give managers a tool to quickly see what
might not readily be apparent in a large complex database.
BA now typically includes BI as well as sophisticated data analysis methods, such as statistical models
and data mining algorithms for exploring data, quantifying and explaining relationships between
measurements, and predicting new records. Methods like regression models are used to describe
and quantify ‘on average’ relationships, to predict new records, and to forecast future values. Today,
BI is used to refer to data visualization and reporting, BA denotes advanced analytics.
A successful use of analytics and data mining requires both an understanding of the business context
where value is to be captured (1) and an understanding of what the data mining methods do (2).
1.2 What is data mining?
Data mining refers to business analytics methods that go beyond counts, descriptive techniques,
reporting, and methods based on business rules. Data visualization is commonly the first step into
more advanced analytics, but we focus more on statistical and machine-learning methods that
inform decision making, often in automated fashion. Prediction is an important component. Big data
has accelerated the use of data mining. Data mining methods have a lot of power and automaticity,
having the ability to cope with huge amounts of data and extract value.
1.3 Data mining and related terms
The field of analytics grows rapidly in terms of the breadth of applications and the number of
organizations using it. This has caused overlap and inconsistency of definitions used.
Data mining stands at the confluence of the fields of statistics and machine learning (also known as
AI). However, the core tenets of classical statistics – computing is difficult and data are scarce – do
not apply in data mining applications where both data and computing power are plentiful. Daryl
Pregibon describes data mining as: “statistics at scale and speed”. Another difference between
statistics and data mining is the focus in statistics on inference from a sample to the population
regarding an ‘average effect’, because machine learning focuses on predicting individual records. As
a result, the general approach to data mining is vulnerable to the danger of overfitting, where a
model is fit so closely to the available sample of data that it describes not merely structural
, characteristics of the data but random peculiarities as well. The model is fitting the noise, not just
the signal.
The term machine learning will be used to refer to algorithms that learn directly from data,
especially local patterns. The term statistical models is used to refer to methods applying global
structure to the data.
1.4 Big data
The challenge big data presents is characterized by the 4 V’s: volume (the amount of data), velocity
(the flow rate – the speed at which it’s being generated and changed), variety (the different types of
data), and veracity. Veracity refers to the fact that data is being generated by organic distributed
processes and not subject to the controls or quality checks that apply to data collected for a study.
Most large organizations face both the challenge and the opportunity of big data, because most
routine data processes now generate data to be stored and possibly analysed. Some valuable tasks
weren’t even feasible before the era of big data.
1.5 Data science
The ubiquity, size, value, and importance of big data has given rise to the data scientist. Data
science is a mix of skills in the areas of statistics, machine learning, math, programming, business,
and IT. The skillset of most data scientists resemble a “T”-deep in one area, and shallower in the
others. Although big data is the motivating power behind the growth of data science, most data
scientists do not actually spend most of their time working with terabyte-size or larger data. Data of
that size would be involved at the deployment stage of a model.
1.6 Why are there so many different methods?
There are many different methods for prediction and classification, each having its advantages and
disadvantages. The usefulness of a method can depend on factors such as the size of the dataset, the
types of patterns existing in the data, whether the data meet some underlying assumptions of the
method, how noisy the data are, and the particular goal of the analysis. Different methods lead to
different results, and their performance can vary. Therefore, it’s customary in data mining to apply
several different methods and select the one that appears most useful for the goal at hand.
1.7 Terminology and notation
Algorithm A specific procedure used to implement a particular data mining technique: classification
tree, discriminant analysis, etc.
Holdout data A sample of data not used in fitting a model, but instead used to assess the performance of
(or holdout set) that model. Synonyms: validation set, test set.
Model An algorithm as applied to a dataset, complete with its settings.
Observation Unit of analysis on which the measurements are taken. Synonyms: instance, sample,
example, case, record, pattern, or row.
Profile A set of measurements on an observation (height, weight, age).
Prediction The prediction of the numerical value of a continuous output variable. Synonym: estimation.
,Predictor A variable (X) used as an input into a predictive model. Synonyms: feature, input variable, IV,
or a field.
Response A variable (Y) which is the variable being predicted in supervised learning. Synonyms: DV,
output variable, target variable, or outcome variable.
Sample Collection of observations. In the machine learning community: a single observation.
Score A predicted value or class. “Scoring new data” means using a model developed with training
data to predict output values in new data.
Success class The class of interest in a binary outcome.
Supervised Process of providing an algorithm with records in which an output variable of interest is
learning known and the algorithm ‘learns’ how to predict this value with new records where the
output is unknown.
Test data The portion of the data used only at the end of the model building and selection process to
assess how well the final model might perform on new data. Synonym: test set.
Training data The portion of the data used to fit a model. Synonym: training set.
Unsupervised An analysis in which one attempts to learn patterns in the data other than predicting an
learning output value of interest.
Validation data The portion of the data used to assess how well the model fits, to adjust models, and to
select the best model from among those that have been tried. Synonym: validation set.
Variable Any measurement on the records, including both input (X) variables and the output (Y)
variable.
, Chapter 2: Overview of the data mining process
The general steps involved in data mining are as follows:
2.1 Introduction
The core of this book focuses on what has come to be called predictive analytics: “the tasks of
classification and prediction as well as pattern discovery, which have become key elements of a
‘business analytics’ function in most large firms.” Sometimes considered to be data mining
techniques, but not covered in this book, are OLAP (Online Analytical Processing) and SQL. OLAP and
SQL searches on databases are descriptive in nature and based on business rules set by users: they
do not involve statistical modelling or automated algorithmic methods. However, SQL queries are
often used to obtain the data in data mining.
2.2 Core ideas in data mining
Classification and prediction
Classification maybe is the most basic form of data analysis: a transaction can be normal or
fraudulent, a bus is available for service or unavailable. A common task in data mining is to examine
data where the classification is unknown or will occur in the future, with the goal of predicting what
that classification is or will be. Data with a known classification are used to develop rules, which are
then applied to the data with the unknown classification. Prediction (or estimation/regression) is
similar to classification, but we are trying to predict the value of a numerical (or continuous)
variable, rather than a class.
Association rules and recommendation systems
Large datasets lend themselves to analysis of associations among items: what goes with what.
Association rules (or affinity analysis) is designed to find such general association patterns between
items in large databases. The rules can then be used in a variety of ways. Online recommendation
systems, such as those used by Netflix or Amazon, use collaborative filtering: a method using
individual user’s preferences and tastes given their historic purchase, rating, or any other
measurable behaviour indicative of preference, as well as other user’s history. Where association
rules generate rules general to an entire population, collaborative filtering generates ‘what goes
with what’ at the individual user level, used mainly in delivering personalized recommendations.
Classification, prediction, and (to some extent) association rules and collaborative filtering constitute
the analytical methods employed in predictive analytics.
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller steenbergenolaf. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $3.19. You're not tied to anything after your purchase.