100% tevredenheidsgarantie Direct beschikbaar na betaling Zowel online als in PDF Je zit nergens aan vast
logo-home
Summary - Data Mining (880022-M-6): Comprehensive Summary: Data Mining and Machine Learning Techniques €6,45
In winkelwagen

Samenvatting

Summary - Data Mining (880022-M-6): Comprehensive Summary: Data Mining and Machine Learning Techniques

 0 keer verkocht

Looking to ace your exams or finally understand data mining and machine learning? This summary is your ultimate guide to mastering the subject. It takes you step by step through clustering methods like k-means and fuzzy c-means, helping you see how data can be grouped in meaningful ways. You’ll l...

[Meer zien]

Voorbeeld 4 van de 36  pagina's

  • 16 januari 2025
  • 36
  • 2024/2025
  • Samenvatting
Alle documenten voor dit vak (2)
avatar-seller
DataScienceandSociety
Data Mining Summary

Lecture 1: Introduction and preliminaries
Pattern classification
- Figure 1.1 is our dataset; we have to organize
it through features and instances.
o Features: The variables describing the
problem, in figure 1.1 X1, X2 and X3
are the features describing the
problem. Y is a special feature; we call
the decision feature or the target.
o Instances (examples): Placed by row.
For each instance, we have a set of
features or variables describing the
instance. Together we have the target,
the value of the target.
- This is why we call it a supervised
Classification problem.
o It is supervised because we have
Figure 1.1: Tabel Pattern Classification
knowledge about the target  Variables
related to a target.
o Need to build a model in which if we provide x1 = 0.5, X2 = 0.9, X3
= 0.5. The model should be able to predict c1 / to produce c1
(Y).
- Multi-class classification: Multi-class because we have a target and this
target has three possible variables: X1, X2 and X3 = multiple decisions
classes.
Missing values
- In figure 1.2 we have missing values; we denote this with “?”
- The reason can differentiate:
o Error when measuring the data.
o Information is not applicable to this particular case.
o There was something wrong there.
- Different strategies to cope missing data:
o Replacing missing data with other data:
 Actually, removing the feature containing
the missing values. If we do this in figure 1.2,
only X1 will remain.
 Remove the instances which are missing 
Also not advised because you can end upFigure
with 1.2: Missing values dataset.
no features, limited numbers of features or you can miss
relevant information.
 Replacing the missing values with the most popular value
per column per feature. If the variable / feature is

, numerical, it can just be the average. If the variable is
categorical, it can be the mode (number that appears the
most). THIS IS THE MOST POPULAR ONE OF THE REPLACING
DATA STRATEGIES
 These strategies, however, can induce noise = We are
completing data with information that we don’t know about.
There are fancier strategies to fill missing data.
o Using neural networks (= type of machine learning model inspired
by the way the human brain works:
 Autoencoders is a neural network in which we have two
blocks, the encoder and decoder.
 In figure 1.3 we see the general architecture with first, the
input layer with neurones.
 Neurones are the neural processing entities. This input layer
is capturing the information.




Figure 1.3: The general architecture



Feature scaling
- Why do we need feature scaling?
o Feature scaling is only applicable to numerical data. In a dataset
with a number of numerical features it is very unlikely that all those
features are going to be expressed in the same interval in the same
domain.
o For example, we have a feature X1 that can take values between 1
and 5. And X2 with values between 1 and 1000. Those are
expressed in different scales and will give wrong answers.
o This is the reasons why we first need to standardize or normalise a
scale in general, to ensure that all features are expressed in the
same scale.

- Two strategies for feature scaling:

, o Normalisation
o Standardisation

- Normalisation
(1.4)
o By




Figure 1.4: Normalisation formula

normalisation we apply feature by
feature. In this case what we do, is we
take every possible value in a column and this value is operated
with the minimum value we observed in the column. So basically,
we subtract the minimum value from each value x in the column.
Later on, we divide by this value we have in the denominator = the
maximum value observed in the column minus the minimum one
under this normalisation.
Figure 1.5: Standardisation formula -
-
- Standardisation (1.5)
o Kinda the same as normalisation but we are using the mean and the
standard deviation computer by the column.




Figure 1.6 Normalization versus standardization

- In the original data we see the first feature (the x axis), the domain is
between 0 and 3. In the case of the y axis the domain is between 0 and 2.
- In the case of normalization everything is disclosed in one unit. The
values which we will produce with the normalization strategy will always be
between 0 and 1.

, - In the case of standardization we have negative values, this is one of the
main differences between the two strategies.
- Buttt we are not changing the properties of the data, just the
scales to which the data is represented.
Feature interaction
- An important thing is discovering the interactions between features.
- A Pearson’s correlation can be used to describe the relationship between
two numerical features. They must be numerical though!
- An example: The correlation between gender (numerical feature) and




Figure 1.7: Three ways of correlation between two numerical variables
income in Sweden
- Closer the dots to the line, the closer the correlation is.
- The Pearson’s correlation takes values between -1 and 1.
Pearson's correlation
- Xi = Each of the values
associated to a feature / a
column.
- Xbar (The x with the line above
it) = The mean, so the average
value of x.
- Yi = The second feature we
have, so the different values
for the second column.
- Ybar = The mean, so the
average value of Y.
- The components in the
dominator (the formula
under the line) we know,
those are the same.
Numerators are the numbers above the line!!!!


Figure 1.8: Pearson's correlation

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper DataScienceandSociety. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €6,45. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 65507 samenvattingen verkocht

Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen

Start met verkopen
€6,45
  • (0)
In winkelwagen
Toegevoegd