Data mining for business analytics
Chapter 1
Business analytics (BA) is the practice and art of bringing quantitative data together
to bear decision making. It includes a range of data analysis methods
Next level of analytics is Business Intelligence. It refers to data visualization and reporting
for understanding.
Data mining refers to business analytics that go beyond (BA) count, descriptive
techniques, reporting and methods based on business rules. Data mining methods have the
ability to cope with huge amounts of (big) data and extract value. Synonyms for data mining:
predictive analytics, predictive modelling and machine learning.
Machine learning vs. statistics: it is not the same. Statistics is focused on the
‘average effect’ on a box while machine learning is focused on predicting individual boxes.
With data mining there is the risk of overfitting, which is not allowed in statistics.
Definition of machine learning in this book: algorithms that learn directly from data.
Definition of statistical models: methods that apply global structure to data. Many
practitioners use machine learning to refer to all the methods from this book.
Big data is a relative term. The challenges of it are often related to four V’s: velocity
(speed), veracity (organic, so no quality standards), variety, volume.
Data science is a mix of skills in the area of business, statistics, machine learning,
math, programming and IT. A data scientist is a rare individual who combine deep skills in all
constituent areas.
, Chapter 2
The core of the book focus on what’s called predictive analytics: the tasks of
classification and prediction as well as pattern discovery, which have become key elements
of a business analytics function.
Core ideas in data mining: classification is perhaps the most basis form of business
analytics. Persons pays or not, respond or does respond or not etc. Task of data mining is to
examine whether the classification is unknown or will occur in the future. Prediction is
similar to that, except that we are trying to predict the value of a numerical value rather
than a class (yes or no). → refers to prediction of the value of a continuous variable.
Association rules or affinity analysis is designed to find general associations patterns
between items in large databases.
Online recommendation systems (Amazon & Netflix) use collaborative filtering, a
method that uses individual user’s preference based on history, behaviour etc.
Classification, prediction, and, to some extent, association rules and collaborative
filtering constitute the analytical methods employed in predictive analytics.
The process of consolidating a large number records (or cases) into smaller set is
called data reduction. Methods for reducing the amount of cases are often called clustering.
Reducing the number of variables is called dimension reduction, which is a common step
before deploying supervised learning methods on the data.
Exploration is in one of the earliest stages of engaging with the data and is about
understanding the global landscape of the data and detecting unusual values. Methods are:
looking at different aggregations, check individual values and relationships between them,
creating charts and dashboards → data visualization or visual analytics.
Fundamental distinction among data mining techniques: supervised learning
algorithms are those used in classification and prediction. You need to have train data so the
algorithm can ‘train’ and learn on it. Then you need validation data to benchmark with other
models and after that you can use the model at a case where the outcome is unknown.
(example: simple linear regression model). Unsupervised learning algorithms are those used
where there is no outcome variable to predict or classify. Association rules, dimension
reduction methods and clustering techniques are examples of unsupervised methods.
List of steps to be taken in a typical data mining effort:
1. Develop an understanding of the purpose of the data mining project
2. Obtain the data set to be used in the analysis
3. Explore, clean and preprocess the data
4. Reduce the data dimension, if necessary
5. Determine the data mining task (classification, prediction, clustering etc.)
6. Partition of the data (for supervised tasks)
7. Choose the data mining techniques to be used
8. Use algorithms to perform the tasks (iterative process)
9. Interpret the results of the algorithms
10. Deploy the mode
These steps encompass the steps in the SEMMA methodology, developed by SAS: