Data mining reading material
Chapter 1: Introduction
Data mining is the process of automatically discovering useful information in large data repositories.
Data mining is an integral part of knowledge discovery in databases (KDD), which is the overall
process of converting raw data into useful information. This process consists of a series of steps, from
data preprocessing to postprocessing of data mining results. The purpose of preprocessing is to
transform the raw input data into an appropriate format for subsequent analysis . An example of
postprocessing is visualization, which allows analysts to explore the data and the data mining results
from a variety of viewpoints. Hypothesis testing methods can also be applied during postprocessing
to eliminate spurious data mining results.
Specific challenges that motivated the development of data mining
- Scalability
- High dimensionality
- Heterogeneous and complex data
- Data ownership and distribution
- Non-traditional analysis
Data mining researchers draw upon ideas, such as (1) sampling, estimation, and hypothesis testing
from statistics and (2) search algorithms, modelling techniques, and learning theories from artificial
intelligence, pattern recognition, and machine learning.
Data mining tasks are generally divided into two major categories:
- Predictive tasks
o The objective of these tasks is to predict the value of a particular attribute based on
the values of other attributes. The attribute to be predicted is commonly known as
the target or dependent variable, while the attributes used for making the prediction
are known as the explanatory or independent variables.
- Descriptive tasks
o Here, the objective is to derive patterns (correlations, trends, clusters, trajectories,
and anomalies) that summarize the underlying relationships in data. Descriptive data
mining tasks are often exploratory in nature and frequently require postprocessing
techniques to validate and explain the results.
Predictive modelling refers to the task of building a model for the target variable as a function of the
explanatory variables. There are two types of predictive modelling tasks:
- Classification – used for discrete target variables
- Regression – used for continuous target variables
The goal of both tasks is to learn a model that minimizes the error between the predicted and true
values of the target variable.
Association analysis is used to discover patterns that describe strongly associated features in the
data.
Cluster analysis seeks to find groups of closely related observations so that observations that belong
to the same cluster are more similar to each other than observations that belong to other clusters.
,Anomaly detection is the task of identifying observations whose characteristics are significantly
different from the rest of the data. Such observations are known as anomalies or outliers. The goal of
an anomaly detection algorithm is to discover the real anomalies and avoid falsely labelling normal
objects as anomalous.
Chapter 2: Data
The Type of Data: Data sets differ in a number of ways. The type of data determines which tools and
techniques can be used to analyse the data.
The Quality of the Data: Data is often far from perfect. Data quality issues that often need to be
addressed include the presence of noise and outliers; missing, inconsistent, or duplicate data; and
data that is biased or, in some other way, unrepresentative of the phenomenon or population that
the data is supposed to describe.
Preprocessing Steps to Make the Data More Suitable for Data Mining: Often, the raw data must be
processed in order to make it suitable for analysis.
Analysing Data in Terms of Its Relationships: One approach to data analysis is to find relationships
among the data objects and then perform the remaining analysis using these relationships rather
than the data objects themselves.
2.1 Types of Data
A data set can often be viewed as a collection of data objects. In turn, data objects are described by a
number of attributes that capture the characteristics of an object.
An attribute is a property or characteristic of an object that can vary, either from one object to
another or from one time to another. A measurement scale is a rule (function) that associates a
numerical or symbolic value with an attribute of an object. Formally, the process of measurement is
the application of a measurement scale to associate a value with a particular attribute of a specific
object.
It is common to refer to the type of an attribute as the type of a measurement scale.
The following properties (operations) of numbers are typically used to describe attributes:
- Distinctness = and /=
- Order <, <=, >, and =>
- Addition + and –
- Multiplication x and /
Given these properties, we can define four types of attributes:
- Categorical (qualitative)
o Nominal
The values of a nominal attribute are just different names.
Only provides enough information to distinguish objects from another.
Transformation: any one-to-one mapping.
o Ordinal
Provide enough information to order objects.
Transformation: an order-preserving change of values.
- Numeric (quantitative)
o Interval
, The differences between values are meaningful, a unit of measurement
exists, addition.
Transformation: new_value = a x old_value + b.
a and b are constants.
o Ratio
Both differences and ratios are meaningful, multiplication.
Transformation: new_value = a x old_value.
Each attribute type possesses all of the properties and operations of the attribute types above it.
An independent way of distinguishing between attributes is by the number of values they can take.
- Discrete – a discrete attribute has a finite or countably infinite set of values.
- Binary - are a special case of discrete attributes and assume only two values, e.g., true/false,
yes/no, male/female, or 0/1.
- Continuous – a continuous attribute is one whose values are real numbers. Practically, real
values can be measured and represented only with limited precision.
Typically, nominal and ordinal attributes are binary or discrete, while interval and ratio attributes are
continuous. However, count attributes , which are discrete, are also ratio attributes.
For asymmetric attributes, only presence—a non-zero attribute value—is regarded as important.
Binary attributes where only non-zero values are important are called asymmetric binary attributes.
It is also possible to have discrete or continuous asymmetric features.
Types of data sets
For convenience, we have grouped the types of data sets into three groups: record data, graph-based
data, and ordered data.
Before providing details of specific kinds of data sets, we discuss three characteristics that apply to
many data sets and have a significant impact on the data mining techniques that are used:
- Dimensionality
o The number of attributes that the objects in the data set possess. Analysing data with
a small number of dimensions tends to be qualitatively different from analysing
moderate or high-dimensional data. Indeed, the difficulties associated with the
analysis of high-dimensional data are sometimes referred to as the curse of
dimensionality. Because of this, an important motivation in preprocessing the data is
dimensionality reduction.
- Distribution
o The frequency of occurrence of various values or sets of values for the attributes
comprising data objects. For example, suppose a categorical attribute is used as a
class variable, where one of the categories occurs 95% of the time, while the other
categories together occur only 5% of the time. This skewness in the distribution can
make classification difficult. A special case of skewed data is sparsity. For sparse
binary, count or continuous data, most attributes of an object have values of 0. In
many cases, fewer than 1% of the values are non-zero. In practical terms, sparsity is
an advantage because usually only the non-zero values need to be stored and
manipulated.
- Resolution