Samenvatting

Summary/Lecture notes Data mining for Business & Governance

62 keer bekeken 6 keer verkocht

Vak
Data mining (880662M6)

Instelling
Tilburg University (UVT)

Summary/Lecture notes for the course Data Mining for Business and Governance. Includes all lectures.

[Meer zien]

Voorbeeld 4 van de 38 pagina's

Bekijk voorbeeld

Geupload op 25 maart 2023
Aantal pagina's 38
Geschreven in 2022/2023
Type Samenvatting

data mining
data science
tilburg university

Volgen

sophiedekkers54 Lid sinds 6 jaar 38 documenten verkocht

€5,69

In winkelwagen

Op verlanglijstje

100% tevredenheidsgarantie
Direct beschikbaar na betaling
Zowel online als in PDF
Je zit nergens aan vast

Lectures data mining

Lecture 1
Pattern classification
- In this problem, we have 3 numerical variables (features) to be
used to predict the outcome (decision class).
- It’s multi-class since we have 3 possible outcomes
The goal in pattern classification is to build a model able to generalize
well beyond the historical training data.

In this lecture we cover 3 main things:
1. How to deal with missing values
2. How to compute the correlation/association between two
features
3. Methods to encode categorical features and handle class imbalance

Missing values
Missing values might result from fields that are not always applicable, incomplete
measurements, lost values.
Imputation strategies for missing values:
1. Simplest strategy → remove the feature containing missing values.
➢ Recommended when the majority of the instances (observations) have missing
values for that feature.
➢ However, there are situations in which we have a few features or the feature we
want to remove is deemed relevant.
2. If we have scattered missing values and few features, we might want to remove the
instances having missing values.
3. Most popular → replacing the missing values for a given feature with a
representative value such as the mean, the median or the mode of that feature.
➢ However, we need to be aware that we are introducing noise.
4. Fancier strategies include estimating the missing values with a machine learning
model trained on the non-missing information.
5. Autoencoders are deep neural networks
that involve two neural blocks named
encoder and decoder. The encoder reduces
the problem dimensionality while the
decoder completes the pattern.
➢ They use unsupervised learning to
adjust the weights that connect the
neurons.

,Feature scaling
1. Normalization
➢ Different features might encode different measurements
and scales (the age and height of a person)
➢ Normalization allows encoding all numeric features in the
[0,1] scale
➢ We subtract the minimum from the value to be
transformed and divide the result by the feature range.
2. Standardization
➢ This transformation method is similar to the
normalization, but the transformed values might not be in
the [0,1] interval
➢ We subtract the mean form the value to be transformed
and divide the result by the standard deviation.
➢ Normalization and standardization might lead to different
scaling results.

Normalization vs. standardization

- These feature scaling approaches might be affected by extreme values.

Feature interaction
1. Correlation between two numerical variables → Sometimes, we need to measure the
correlation between numerical features describing a certain problem domain.
➢ For example, what is the correlation between gender and income in Sweden?

2. Pearson’s correlation → it is used when we want to determine the correlation
between two numerical variables given k observations.
➢ It is intended for numerical variables only and its value lies in [-1, 1]
➢ The order of variables does not matter since the coefficient is symmetric.

Example: correlation between age and glucose levels

,The terminology can be different. We use correlation when we are working with numerical
data, and we use association when we are working with categorical data.

3. Association between two categorical variables → sometimes, we need to measure
the association degree between two categorical (ordinal or nominal) variables.
➢ For example, what is the association between gender and eye color?
4. The X2 association measure → it is used when we want to measure the association
between two categorical variables given k observations.
➢ We should compare the frequencies of values appearing together with their
individual frequencies
➢ The first step in that regard would be to create a contingency table.
➢ Let us assume that a categorical variable X involves m possible categories while Y
involves n categories.
➢ The observed value gives how many times each combination was found.
➢ The expected value is the multiplication of the individual frequencies divided by
the number of observations.

Association between gender and eye color

Such an example is very likely for in the exam.

Encoding strategies
Encoding categorical features → some machine learning, data mining algorithms or
platforms cannot operate with categorical features. Therefore, we need to encode these
features as numerical quantities.
1. Label encoding → consists of assigning integer numbers to each category. It only
makes sense if there is an ordinal relationship among the categories.
➢ E.g., weekdays, months, star-based hotel ratings, income
categories.
2. One-hot encoding → is used to encode nominal features that
lack an ordinal relationship. Each category of the categorical
feature is transformed into a binary feature such that one
marks the category.
➢ This strategy often increases the problem dimensionality
notably since each feature is encoded as a binary vector.

, Class imbalance
Sometimes we have problems with much more instances belonging to a decision class than
the other classes.
- In this example, we have more instances labelled with the
negative decision class than the positive one.
Classifiers are tempted to recognize the majority decision class only.

Simple strategies:
1. Under sampling
2. Oversampling
One strategy is to select some instances from the majority decision class,
provided we retain enough instances.
Another method consists of creating new instances belonging to the
minority class (creating random copies)
These strategies are applied to the data when building the model.

SMOTE → synthetic minority oversampling technique. It is a popular
strategy to deal with class imbalance.
- Creates synthetic instances in the neighborhoods of instances
belonging to the minority class.
- Caution is advised since the classifier is forced to learn from
artificial instances, which might induce noise.

Lecture 2
Classification problem
In this problem, we have four categorical (ordinal and nominal) features to be
used to predict the outcome.

We have only two possible outcomes or decision classes (binary problem).
The goal in pattern classification is to build a model to generalize well beyond
the historical training data.

Rule-based learning: in this approach, the classification problem is modelled as
a set of rules involving features and their values in the antecedent of such rules
and decision classes in the consequent.
- Algorithm → decision trees are perhaps the most popular algorithm of this
paradigm.

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper sophiedekkers54. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €5,69. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 51662 samenvattingen verkocht

Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen

Start met verkopen

Populaire Universiteiten

Populaire Hogescholen

Populaire Scholen

Populaire samengevatte studieboeken voor Communicatie en Taal

Populaire samengevatte studieboeken voor Economie en Bedrijf

Populaire samengevatte studieboeken voor Exact en Informatica

Populaire samengevatte studieboeken voor Gedrag en Maatschappij

Populaire samengevatte studieboeken voor Gezondheid en Geneeskunde

Populaire samengevatte studieboeken voor Onderwijs en Opvoeding

Populaire samengevatte studieboeken voor Recht en Bestuur

De beste samenvattingen om je Wft-diploma te behalen

De beste samenvattingen om je theorie examens te behalen

De beste samenvattingen voor je cursus in de Veiligheidsbranche

De beste samenvattingen voor Gezondheid & Hygiëne cursussen

De beste samenvattingen voor zakelijke cursussen

De beste samenvattingen voor je PABO WisCAT cursus

Populaire vakken

Populaire vakken

Populaire vakken

Boekverslagen en samenvattingen

Verkoper

Samenvatting

Summary/Lecture notes Data mining for Business & Governance

Document informatie

Onderwerpen

Geschreven voor

Verkoper

Ontvangen beoordelingen

Voorbeeld van de inhoud

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Snel en makkelijk kopen

Focus op de essentie

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Tevredenheidsgarantie: hoe werkt dat?

Van wie koop ik deze samenvatting?

Zit ik meteen vast aan een abonnement?

Is Stuvia te vertrouwen?