Samenvatting

Summary Data Mining - All reading material

0 keer verkocht

Vak
Data Mining (NWIIBI008)

Instelling
Radboud Universiteit Nijmegen (RU)

Dit is een samenvatting van het boek dat je moet leren voor het tentamen. Alle stof staat hier in.

[Meer zien]

Voorbeeld 3 van de 23 pagina's

Bekijk voorbeeld

Geupload op 19 oktober 2024
Aantal pagina's 23
Geschreven in 2023/2024
Type Samenvatting

Volgen

donjaschipper Lid sinds 4 maanden 3 documenten verkocht

€5,56

Ook beschikbaar in voordeelbundel v.a. €9,49

In winkelwagen

Opslaan

100% tevredenheidsgarantie
Direct beschikbaar na je betaling
Lees online óf als PDF
Geen vaste maandelijkse kosten

Ook beschikbaar in voordeelbundel (1)

Alle stof data mining

€ 10,52 € 9,49

1x verkocht

2 items

1. College aantekeningen - Data mining - alle colleges
2. Samenvatting - Data mining - all reading material
Meer zien

Data mining reading material

Chapter 1: Introduction

Data mining is the process of automatically discovering useful information in large data repositories.
Data mining is an integral part of knowledge discovery in databases (KDD), which is the overall
process of converting raw data into useful information. This process consists of a series of steps, from
data preprocessing to postprocessing of data mining results. The purpose of preprocessing is to
transform the raw input data into an appropriate format for subsequent analysis . An example of
postprocessing is visualization, which allows analysts to explore the data and the data mining results
from a variety of viewpoints. Hypothesis testing methods can also be applied during postprocessing
to eliminate spurious data mining results.

Specific challenges that motivated the development of data mining

- Scalability
- High dimensionality
- Heterogeneous and complex data
- Data ownership and distribution
- Non-traditional analysis

Data mining researchers draw upon ideas, such as (1) sampling, estimation, and hypothesis testing
from statistics and (2) search algorithms, modelling techniques, and learning theories from artificial
intelligence, pattern recognition, and machine learning.

Data mining tasks are generally divided into two major categories:

- Predictive tasks
o The objective of these tasks is to predict the value of a particular attribute based on
the values of other attributes. The attribute to be predicted is commonly known as
the target or dependent variable, while the attributes used for making the prediction
are known as the explanatory or independent variables.
- Descriptive tasks
o Here, the objective is to derive patterns (correlations, trends, clusters, trajectories,
and anomalies) that summarize the underlying relationships in data. Descriptive data
mining tasks are often exploratory in nature and frequently require postprocessing
techniques to validate and explain the results.

Predictive modelling refers to the task of building a model for the target variable as a function of the
explanatory variables. There are two types of predictive modelling tasks:

- Classification – used for discrete target variables
- Regression – used for continuous target variables

The goal of both tasks is to learn a model that minimizes the error between the predicted and true
values of the target variable.

Association analysis is used to discover patterns that describe strongly associated features in the
data.

Cluster analysis seeks to find groups of closely related observations so that observations that belong
to the same cluster are more similar to each other than observations that belong to other clusters.

,Anomaly detection is the task of identifying observations whose characteristics are significantly
different from the rest of the data. Such observations are known as anomalies or outliers. The goal of
an anomaly detection algorithm is to discover the real anomalies and avoid falsely labelling normal
objects as anomalous.

Chapter 2: Data

The Type of Data: Data sets differ in a number of ways. The type of data determines which tools and
techniques can be used to analyse the data.
The Quality of the Data: Data is often far from perfect. Data quality issues that often need to be
addressed include the presence of noise and outliers; missing, inconsistent, or duplicate data; and
data that is biased or, in some other way, unrepresentative of the phenomenon or population that
the data is supposed to describe.
Preprocessing Steps to Make the Data More Suitable for Data Mining: Often, the raw data must be
processed in order to make it suitable for analysis.
Analysing Data in Terms of Its Relationships: One approach to data analysis is to find relationships
among the data objects and then perform the remaining analysis using these relationships rather
than the data objects themselves.

2.1 Types of Data

A data set can often be viewed as a collection of data objects. In turn, data objects are described by a
number of attributes that capture the characteristics of an object.

An attribute is a property or characteristic of an object that can vary, either from one object to
another or from one time to another. A measurement scale is a rule (function) that associates a
numerical or symbolic value with an attribute of an object. Formally, the process of measurement is
the application of a measurement scale to associate a value with a particular attribute of a specific
object.

It is common to refer to the type of an attribute as the type of a measurement scale.

The following properties (operations) of numbers are typically used to describe attributes:

- Distinctness = and /=
- Order <, <=, >, and =>
- Addition + and –
- Multiplication x and /

Given these properties, we can define four types of attributes:

- Categorical (qualitative)
o Nominal
 The values of a nominal attribute are just different names.
 Only provides enough information to distinguish objects from another.
 Transformation: any one-to-one mapping.
o Ordinal
 Provide enough information to order objects.
 Transformation: an order-preserving change of values.
- Numeric (quantitative)
o Interval

,  The differences between values are meaningful, a unit of measurement
exists, addition.
 Transformation: new_value = a x old_value + b.
 a and b are constants.
o Ratio
 Both differences and ratios are meaningful, multiplication.
 Transformation: new_value = a x old_value.

Each attribute type possesses all of the properties and operations of the attribute types above it.

An independent way of distinguishing between attributes is by the number of values they can take.

- Discrete – a discrete attribute has a finite or countably infinite set of values.
- Binary - are a special case of discrete attributes and assume only two values, e.g., true/false,
yes/no, male/female, or 0/1.
- Continuous – a continuous attribute is one whose values are real numbers. Practically, real
values can be measured and represented only with limited precision.

Typically, nominal and ordinal attributes are binary or discrete, while interval and ratio attributes are
continuous. However, count attributes , which are discrete, are also ratio attributes.

For asymmetric attributes, only presence—a non-zero attribute value—is regarded as important.
Binary attributes where only non-zero values are important are called asymmetric binary attributes.
It is also possible to have discrete or continuous asymmetric features.

Types of data sets

For convenience, we have grouped the types of data sets into three groups: record data, graph-based
data, and ordered data.
Before providing details of specific kinds of data sets, we discuss three characteristics that apply to
many data sets and have a significant impact on the data mining techniques that are used:

- Dimensionality
o The number of attributes that the objects in the data set possess. Analysing data with
a small number of dimensions tends to be qualitatively different from analysing
moderate or high-dimensional data. Indeed, the difficulties associated with the
analysis of high-dimensional data are sometimes referred to as the curse of
dimensionality. Because of this, an important motivation in preprocessing the data is
dimensionality reduction.
- Distribution
o The frequency of occurrence of various values or sets of values for the attributes
comprising data objects. For example, suppose a categorical attribute is used as a
class variable, where one of the categories occurs 95% of the time, while the other
categories together occur only 5% of the time. This skewness in the distribution can
make classification difficult. A special case of skewed data is sparsity. For sparse
binary, count or continuous data, most attributes of an object have values of 0. In
many cases, fewer than 1% of the values are non-zero. In practical terms, sparsity is
an advantage because usually only the non-zero values need to be stored and
manipulated.
- Resolution

Dit zijn jouw voordelen als je samenvattingen koopt bij Stuvia:

Bewezen kwaliteit door reviews

Studenten hebben al meer dan 850.000 samenvattingen beoordeeld. Zo weet jij zeker dat je de beste keuze maakt!

In een paar klikken geregeld

Geen gedoe — betaal gewoon eenmalig met iDeal, creditcard of je Stuvia-tegoed en je bent klaar. Geen abonnement nodig.

Direct to-the-point

Studenten maken samenvattingen voor studenten. Dat betekent: actuele inhoud waar jij écht wat aan hebt. Geen overbodige details!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.