Samenvatting

Data Mining for Business & Governance summary

3 keer bekeken 0 keer verkocht

Vak
(880022M6)

Instelling
Tilburg University (UVT)

This document contains a summary of the 7 lectures for the course of Data Mining for Business & Governance. The summary also contains the links for each practical session and the answers to every formative quiz.

[Meer zien]

Voorbeeld 10 van de 105 pagina's

Bekijk voorbeeld

Geupload op 30 oktober 2023
Aantal pagina's 105
Geschreven in 2023/2024
Type Samenvatting

€3,99

In winkelwagen

Op verlanglijstje

100% tevredenheidsgarantie
Direct beschikbaar na betaling
Zowel online als in PDF
Je zit nergens aan vast

Data Mining for Business and Governance

Lecture 1: Introduction to Data Mining.................................................................... 1
Pattern classification.................................................................................................................1
Missing values...........................................................................................................................2
Feature scaling.......................................................................................................................... 4
Feature interaction................................................................................................................... 5
Encoding strategies................................................................................................................... 8
Class imbalance.........................................................................................................................9
Practical session 1...................................................................................................10
Formative quiz 1..................................................................................................... 10

Lecture 2: Pattern Classification............................................................................. 15
Rule-based learning................................................................................................................ 15
Bayesian learning.................................................................................................................... 22
Lazy learning........................................................................................................................... 24
Ensemble learning.................................................................................................................. 29
Practical session 2...................................................................................................30
Formative quiz 2..................................................................................................... 30

Lecture 3: Evaluation and Model Selection............................................................ 35
Splitting the data.....................................................................................................................36
Hyperparameter tuning.......................................................................................................... 40
Evaluation measures............................................................................................................... 42
Theoretical concepts...............................................................................................................45
Practical session 3...................................................................................................48
Formative quiz 3..................................................................................................... 49

Lecture 4: Explainable Artificial Intelligence........................................................... 53
Terminology............................................................................................................................ 53
Intrinsically interpretable models: white-box......................................................................... 55
Post-hoc explanation methods - black-box............................................................................. 55
Model-agnostic post-hoc methods......................................................................................... 56
Model-specific post-hoc methods.......................................................................................... 60

, Evaluation and measures........................................................................................................ 62
Practical session 4...................................................................................................64
Formative quiz 4..................................................................................................... 64

Lecture 5: Dimensionality Reduction Methods.......................................................67
Visualization............................................................................................................................67
Role of dimensions..................................................................................................................69
Dimensionality reduction........................................................................................................71
Feature selection.................................................................................................................... 72
Feature extraction...................................................................................................................75
Principal component analysis................................................................................................. 75
Deep neural networks.............................................................................................................76
Practical session 5...................................................................................................78
Formative quiz 5..................................................................................................... 78

Lecture 6: Cluster Analysis for Data Mining............................................................ 81
Centroid-based clustering.......................................................................................................82
The k-means algorithm (hard clustering)................................................................................82
The fuzzy c-means algorithm (soft clustering)........................................................................ 84
Hierarchical clustering............................................................................................................ 86
Spectral clustering.................................................................................................................. 88
Evaluation measures............................................................................................................... 89
Practical session 6...................................................................................................90
Formative quiz 6..................................................................................................... 90

Lecture 7: Association Rule for Data Mining...........................................................93
Association rules..................................................................................................................... 93
Support and confidence..........................................................................................................94
Mining association rules......................................................................................................... 95
The apriori algorithm.............................................................................................................. 96
Itemset taxonomy................................................................................................................... 98
Practical session 7.................................................................................................100
Formative quiz 7................................................................................................... 100

1

,Lecture 1: Introduction to Data Mining
Pattern classification
In this problem, we have three numerical variables (features) to be used to predict the outcome
or target (decision class). The features are X1, X2, and X3, and the decision class is Y. Y is always
a category, not a number, so we can have dogs and cats, different food types etc.
This problem is multi-class since we have three possible outcomes.
The goal in pattern classification is to build a model able to generalize well beyond the historical
training data.
How many features can we have in a pattern classification model? Unlimited. Each classification
problem can have a different number of features.
How many categories can we have in the decision class variable? Unlimited.
Concerning the rows, the values are referred to as instances or observations.

Let’s suppose we need to build a model based on this table, which is the classification model.
The ? instance is not contained in the model we want to build, and we do not know the target
value for this instance. The goal is to create a classifier from the data we have that provides the
decision class based on X1=0.6, X2=0.8, and X3=0.2.

1

,Missing values
Sometimes, we have instances that have missing values for some features. In this case, the first
column X1 is complete and does not present any missing values. X2 and X3 have many missing
values.
It is of paramount importance to deal with this situation before building any machine learning
or data mining model, as they cannot fill in the missing values.
Missing values might result from fields that are not always applicable, incomplete
measurements, lost values.

Imputation strategies for missing values
1. The simplest strategy would be to remove the feature containing missing values (=
removing X2 and X3 columns, which would not solve the problem). This strategy is
recommended when the majority of the instances (observations) have missing values for
that feature. However, there are situations in which we have few features or the feature
we want to remove is deemed relevant.
2. If we have scattered missing values and few features, we might want to remove the
instances having missing values. This is possible when we have large amounts of
instances. However, there are situations in which we have a limited number of instances.

2

, 3. The third strategy is the most popular. It consists of replacing the missing values for a
given feature with a representative value such as the mean, the median or the mode of
that feature. However, we need to be aware that we are introducing noise.
4. Fancier strategies include estimating the missing values with a machine learning model
trained on the non-missing information.

Autoencoders to impute missing values
Autoencoders are deep neural networks that involve two neural blocks named encoder and
decoder. The encoder reduces the problem dimensionality while the decoder completes the
pattern. They use unsupervised learning to adjust the weights that connect the neurons.

Missing values and recommender systems
The input presents three possible states: the person can like the movie, dislike the movie, or has
not watched it/has not expressed interest (which are the missing values). This neural network
takes the information we know (like or dislike) and is able to provide later on the value for the
missing value, providing a recommendation for viewers.

3

,Feature scaling
Normalization
- Different features might encode different
measurements and scales (the age and height of a
person).
- Normalization allows encoding all numeric features in
the [0,1] scale.
- We subtract the minimum from the value to be
transformed and divide the result by the feature range

Standardization
- This transformation method is similar to the
normalization, but the transformed values might not be
in the [0,1] interval.
- We subtract the mean from the value to be transformed
and divide the result by
the standard deviation.
- Normalization and standardization might lead to
different scaling results.

Similarities and differences
Both methods are applied to the whole column of a dataset.
Both methods can be used to put every feature on the same scale.
Both methods do not change the properties of the data, only the scale.

Normalization is always on a [0-1] scale, from the picture (b) it can be seen that the values are
confined in the square from 0 to 1.
Standardization does not necessarily need to be on a [0-1] scale, from the picture (c) it can be
seen that the values range from -1 to +2.

4

,Feature interaction
Correlation between two numerical variables
Sometimes, we need to measure the correlation between numerical features describing a
certain problem domain. For example, what is the correlation between gender and income in
country x?

Pearson’s correlation
- It is used when we want to determine the correlation
between two numerical variables given k observations.
- It is intended for numerical variables only and its value
lies in [-1,1].
- The order of variables does not matter since the
coefficient is symmetric.

5

,Correlation between age and glucose levels

Example from the first column: (𝑥𝑖 − 𝑥)(𝑦𝑖 − 𝑦) = (43 - 41.16)(99 - 81) = 33
2
(𝑥𝑖 − 𝑥) = (43 - 41.16)² = 3.36
2
(𝑦𝑖 − 𝑦) = (99 - 81)² = 324

Pearson’s correlation (R) is 0.53, which tells us it is a medium positive correlation (medium
because it is not 1, which is the perfect positive correlation).

Association between two categorical variables
Sometimes, we need to measure the association degree between two categorical (ordinal or
nominal) variables.

The x² association measure (chi-square statistic)
- It is used when we want to measure the association
between two categorical variables given k observations.
- We should compare the frequencies of values appearing
together with their individual frequencies.
- The first step in that regard would be to create a
contingency table.

6

, - Let us assume that a categorical variable X involves m
possible categories while Y involves n categories
- The observed value gives how many time each
combination was found
- The expected value is the multiplication of the
individual frequencies divided by the number of
observations

Association between gender and eye color
If we have some data, the first step is to build a contingency table. We put information
concerning the two categorical features and the categories. The numbers in the inside table are
frequency values. The numbers outside the table are the sums of columns and rows. There are
two categorical variables such that the first one has n=2 categories and the second has m=3
categories.

We have 26 males from which 6 have blue eyes, 8 have green eyes and 12 have brown eyes. The
number of people with blue, green and brown eyes is 15, 13 and 22, respectively.

→

7

, We have 24 females from which 9 have blue eyes, 5 have green eyes and 10 have brown eyes.
The number of people with blue, green and brown eyes is 15, 13 and 22, respectively.

→

Encoding strategies
Encoding categorical features
We have different types of features: numerical and categorical. In the case of numerical, we can
express them all on the same scale (with either normalization or standardization). In the case of
categorical, we cannot feed a machine learning algorithm with categories. Therefore, we need
to encode these features as numerical quantities.

The first strategy is referred to as label encoding and consists of assigning integer numbers to
each category. It only makes sense if there is an ordinal relationship among the categories. For
example, weekdays, months, star-based hotel ratings, income categories.

One-hot encoding
It is used to encode nominal features that lack an ordinal relationship.
Each category of the categorical feature is transformed into a binary feature such that one
marks the category.
This strategy often increases the problem dimensionality notably since each feature is encoded
as a binary vector.
We have three instances of a problem aimed at classifying animals given a set of features (not
shown for simplicity).

8

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper eleonora28. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €3,99. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 52510 samenvattingen verkocht

Opgericht in 2010, al 14 jaar dé plek om samenvattingen te kopen

Start met verkopen

Populaire Universiteiten

Populaire Hogescholen

Populaire Scholen

Populaire samengevatte studieboeken voor Communicatie en Taal

Populaire samengevatte studieboeken voor Economie en Bedrijf

Populaire samengevatte studieboeken voor Exact en Informatica

Populaire samengevatte studieboeken voor Gedrag en Maatschappij

Populaire samengevatte studieboeken voor Gezondheid en Geneeskunde

Populaire samengevatte studieboeken voor Onderwijs en Opvoeding

Populaire samengevatte studieboeken voor Recht en Bestuur

De beste samenvattingen om je Wft-diploma te behalen

De beste samenvattingen om je theorie examens te behalen

De beste samenvattingen voor je cursus in de Veiligheidsbranche

De beste samenvattingen voor Gezondheid & Hygiëne cursussen

De beste samenvattingen voor zakelijke cursussen

De beste samenvattingen voor je PABO WisCAT cursus

Populaire vakken

Populaire vakken

Populaire vakken

Boekverslagen en samenvattingen

Verkoper

Samenvatting

Data Mining for Business & Governance summary

Document informatie

Onderwerpen

Geschreven voor

Verkoper

Voorbeeld van de inhoud

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Snel en makkelijk kopen

Focus op de essentie

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Tevredenheidsgarantie: hoe werkt dat?

Van wie koop ik deze samenvatting?

Zit ik meteen vast aan een abonnement?

Is Stuvia te vertrouwen?